Introducing the first blog post in a new series from Eric Vogelpohl, Managing Director of Tech Strategy at Blueprint. In the series, Eric will discuss the crucial elements that make a successful data ecosystem framework and why a comprehensive data strategy is key to future-proofing your business. Be sure to follow our socials to get updates when each post is published. This week Eric is covering data acquisition sources, patterns, and best practices for your data lakehouse
Organizations looking to future-proof their business will find success or failure determined by a few critical elements within their strategy—the first being the development of a robust data estate. Prioritizing this has a significant impact on business, specifically through cost savings and improved scalability. While a robust data estate delivers high-quality data for downstream processing and engineering, its implementation requires organizations to first acquire data from a wide range of sources, including traditional databases, SaaS platforms, APIs, files, and third-party data providers. To achieve this, modern data acquisition platforms like Fivetran, Azure Data Factory, and Informatica Cloud are designed to work with data lakehouses like Databricks Delta. As the foundation for all your digital transformation strategies, Blueprint enables you to leverage industry-leading services to accelerate your ability to adapt faster, innovate better, and expand more efficiently.
Let’s dive deep into the data acquisition process, starting with the sources of data eligible for the data lake. These include:
- Transactional databases like SQL, PostgreSQL, MySQL, and others
- SaaS platforms like Workday and Salesforce
- IoT time-series data
- Semi-structured data like JSON, XML, and CSV
Once data sources have been identified, organizations can determine the most appropriate data acquisition pattern and interval for each source. There are several patterns to choose from, including batch incremental, full data loads, CDC, and file pick-up.
Batch incremental is a pattern where data is acquired periodically, and only new or modified data is retrieved from the source system. This pattern is useful when dealing with large data sets that are frequently updated. It often relies on a date key in the source system to establish the ‘last used location’ or watermark. Full data loads, on the other hand, involve acquiring all data from a source system. This pattern is useful when dealing with smaller data sets that are not updated frequently. Full data loads are also useful as an initial load to ensure that all data is captured.
Change-data-capture (CDC) involves capturing changes to a source system’s data in real-time or near real-time. CDC can be useful in situations where immediate access to new data is necessary. File pick-up is a pattern where data is acquired by reading files from a specified location, such as a network file share or an FTP server. This pattern is useful when dealing with data that is generated by external systems or partners.
Other acquisition patterns or a hybrid of those above may be appropriate for specific sources or scenarios, and it is important to choose the best approach for the data being acquired.
It’s worth noting that the chosen acquisition pattern will also determine the frequency with which data is acquired. For example, some data sources may require frequent updates to stay up to date, while others may be static or infrequently updated. Understanding the different acquisition patterns and choosing the best one for each data source is crucial for building a successful data estate.
The Bronze layer
An important concept in Databricks Lakehouse architecture is the Bronze layer, which enables organizations to keep their data in its raw form, to preserve its integrity, ability to recover or rebuild, and provide a point of audit should it be needed. When data is first acquired, it is stored in the Bronze layer in its original form, without any transformations or modifications. This makes it easier to access and analyze the data in its original state and preserves its lineage. It serves as a staging area for all incoming data, where it is immediately available for use by downstream data engineering and transformation processes within the data pipeline.
Data is processed from the Bronze layer to the Silver layer, where it is transformed and optimized for analytical use. The ability to write data directly to storage as raw, without any transformation or modification, is a hallmark of the data lake architecture, which helps to save money on ingestion-processing and enables faster access to data for downstream processing and engineering.
Extract, load, transform
The E-L-T pattern, or extract, load, and transform, is a common data processing pattern used in the Lakehouse architecture. In this pattern, data is first extracted from various sources, then loaded into the data lake in its raw form, and finally transformed into higher quality tables as the data pipeline progresses. One of the benefits of writing data directly to storage as raw data in the Bronze layer is that it can help organizations save money on ingestion and processing costs. By not immediately transforming the data, the organization can avoid the cost of processing and transforming data that may not be needed in the future.
Additionally, the retention schedule of data in the Bronze layer is important to consider. Organizations need to determine how long they need to keep data in the Bronze layer before it is transformed or moved to a different layer. This retention schedule can be determined by factors such as compliance regulations, data usage patterns, and storage costs. By having a clear retention schedule, organizations can avoid unnecessary storage costs and ensure that data is retained for as long as it is needed.
Databricks and Python for data acquisition
There are many first-class data acquisition platforms like FiveTran, Matallion, Azure Data Factory, and many others. Blueprint is familiar with them and skilled in leveraging Databricks natively for data acquisition routines if it makes the most sense from a cost management and labor point of view. Native Python routines in Databricks can provide a highly customizable approach to data acquisition. By writing Python code snippets, organizations can acquire data from nearly all sources, including JDBC, APIs, and other storage locations. This approach is highly beneficial if the client or firm is tolerant of a ‘build’ strategy over a ‘buy’ strategy like with FiveTran or other data acquisition tools.
One of the benefits of using Python for data acquisition is the ability to customize data acquisition pipelines. Python can be used to acquire, transform, and process data in real-time, batch, or streaming mode. Additionally, Python provides a highly flexible approach to data acquisition as it can be customized to meet the specific requirements of each organization’s data ecosystem.
At Blueprint, our team has extensive experience with leading data acquisition platforms and can help organizations replace legacy technologies to reduce data engineering labor, costs, and increase scalability.
Ready to getstarted?
About the Author
Eric Vogelpohl is the Managing Director of Tech Strategy at Blueprint. He’s a proven IT professional with more than 20 years of experience and a high degree of technical and business acumen. He has an insatiable passion for all-things-tech, pro-cloud/SaaS, leadership, learning, and sharing ideas on how technology can turn data into information & transform user experiences. He is well-known for his dynamic and engaging speaking sessions at meetups, conferences, and industry events.