The Ultimate Guide To Data Lakes: Warehouse Integration

Your organization could be leaving valuable data out of its analytics with its existing data warehouse.

Through integrating a data lake with the existing data warehouse the organization can extract powerful analytics it may have been missing

As outlined in our previous post, modernizing an enterprise’s data management technology and processes with a data lake is highly recommended due to the high value they can bring. However, that does not mean that current data warehouse architecture should be removed. Data warehouses are tried-and-true assets that continue to create value and can be integrated into data lake architecture. In fact, data management environments are steadily moving towards the utilization of integrated approaches that optimize the benefits of both data warehouses and data lakes.

Adding a data lake can extend the viability of your data warehouse with a nod to the mature investment in your data warehouse. Using both options can diversify your data storage so that your data scientists and developers can choose the appropriate option for storage, processing, and analytics.

In the image below we have outlined an example of how the two systems can coexist in your enterprise data environment:

Figure 1 – Data Warehouse & Data Lake: Traditional Meets Modern

Below is an overview of conventional approaches for integrating a data lake with existing data warehouse architecture.

Ingest and Process in the data lake

In this use case, all data is ingested and stored in the data lake, which is serving as the initial staging area. Then cheaper computing resources process the data, and the results can be saved to the data warehouse while the staging data remains in the lake.

Warehouse as a data source in the data lake

Another option is to process all data, including warehouse data sources, with data lake resources. This option can provide a standard interface and processing API to all data repositories in your enterprise. It also enables your team to learn one technology for all or most data processing needs. Technologies exist to run massively parallel SQL queries while simultaneously integrating with advanced computation and algorithm libraries. Not all SQL queries can be efficiently executed on a cluster or grid of nodes. Therefore, understanding the performance tradeoffs is critical.

Data warehouse for reporting

Maintain your data warehouse as the core platform that supports standard reporting. A lot of testing and due diligence went into confirming the accuracy of the queries and computations for these reports, especially in the case of financial or other information that must be highly accurate. A common practice is to store raw data in the lake, then process and transform the data with lake resources, and finally, store the converted data in the data warehouse. By doing this, the reports and dashboards reading from the warehouse do not require any change.

Archive data

Use a data lake to store archives of data that once existed in the warehouse at a lower cost on commodity hardware.

We’ve shown that data lakes can be integrated into an existing data warehouse to significantly enhance the functionality and output of your data storage infrastructure.

In the next and final article in our data lake series, we will cover best practices in data lake architecture. If you missed the first three, here are the links:

The Ultimate Guide To Data Lakes: Warehouse Integration

Your organization could be leaving valuable data out of its analytics with its existing data warehouse.

Latest Posts

Categories