Is your organization keeping up with the volume, variety, velocity, and veracity of data?
To compete in today’s markets, it’s imperative to have a highly agile data management system.
Agility is just one of the differences between data lakes and data warehouses. In this second article of our data lake series, we will outline the fundamental differences between these two storage repositories.
If you missed the first article that defines what a data lake is, then visit part 1 of the series; Defining Data Lakes
Both data lakes and warehouses are central storage repositories of integrated data from multiple disparate sources of business intelligence and operational systems that can be used to guide management decisions. They look and sound similar on the surface, but when you look behind the curtain, you will find that they are very different. Here are 5 fundamental differences your organization should be aware of:
Structure
Data warehouses have a pre-defined and structured container for data. The storage format, which is typically relational, defines the structured tubular data (rows and columns) and data types allowed in each column like text, numbers or dates. This definition must be designed and created in a schema before writing any rows of data.
Alternatively, data lakes can store unstructured data (plain text, emails, documents, PDFs), binary data (images, audio, video), semi-structured data (logs, CSV, XML, JSON), and structured data from relational database management systems. Since the data schema is not defined before storing the data, then it must be established before reading the data; otherwise, the application attempting to process the data would be unaware of what the data represents and the types available.
Data extraction
Analysts and comparatively less skilled users can use data warehouses. However, due to the sheer amount, different formats and rapid evolution of information experienced in data lakes, achieving value from lakes requires data professionals and scientists.
Velocity
The rate at which data can be ingested into a data lake cannot be matched by a warehouse without specialized and expensive hardware. Due to their distributed design, data lakes can scale out to increase ingestion speed. This level of velocity has enabled use cases like sentiment analysis from constant streams of tweets or real-time quality control from manufacturing sensor data.
Agility
Due to their schema-on-write characteristic, warehouses are less agile at addressing changing business needs. Schema-on-write is self-limiting. If the data is not defined in the schema, then it cannot be stored.
While choosing the data structure for storage, you are also choosing what you will not store now and in the future. This static structure and taxonomy dictate the kind of analysis that is possible. Changing the schema to accommodate new data and analysis will require changes to ingestion, extract-transfer-load (ETL) processes, and APIs or visualizations accessing the data.
The self-limiting nature can benefit the warehouse designer by providing tight control over the intended uses of the EWD and prohibiting bad practices of developers and users. Data lakes allow enterprises to store a set of data in its native format as a complete snapshot and use part or all of the data set for an immediate purpose.
Later, a new need for an entirely different context may require pairing the data set with a different data set. Data lakes allow enterprises to quickly adapt to changing needs without any changes to the core architecture or storage.
It is impossible to ascertain upfront all the insights an organization could derive from all of its data sources. Starting with an initial list of questions is common. Then more relevant or specific questions always emerge when one begins analyzing the data.
The ability to steer in different directions allows ad-hoc analysis and data discovery that is driven by the data professional’s train-of-thought. Data lakes enable the use of data technologies in a way that matches human thought processes.
Costs
Warehouses can store vast amounts of data, but the relational database model they use was not originally designed to run on distributed hardware architectures with multiple nodes storing data from the same schema. Distributed database management systems used by warehouses are notorious for performance and latency problems due to joining data across nodes. Due to this problem warehouses can scale up but cannot scale out cost-effectively. Scaling up is more expensive than scaling out because the hardware is purpose built and the resource pools such as CPU, memory, and storage must be significant. The opportunity to create and afford a data lake was made possible by open source big data technologies that can reliably scale horizontally with commodity-grade hardware
As you can see, some differences favor both, but as data lakes become more ubiquitous as standard data storage practice, the security difference will soon be equivalent or better in favor of data lakes. We believe the agility and favorable long-term costs of data lakes make them the best storage option for organizations.
In our next article, we will go into an in-depth analysis of the value that data lakes add.