The Ultimate Guide to Data Lakes: Best Practices

Data lakes can significantly improve your organizations’ data warehouse and data storage output.

If a data lake is being considered, then make sure the following most common but often-overlooked mistakes are avoided.

Knowing data lake best practices is tremendously valuable, and in our experience of building data lake architecture, we have gained significant insights from many mistakes and poor practices. The following 8 points are the most common mistakes we see, by considering these you will ensure you are incorporating data lake best practices.

Failing to develop plausible use cases

Data lakes are new, and this technology is appealing in our as we work to keep up with data and analytics. But there should be actual use cases addressing business needs before attempting to add a data lake to your enterprise. Extensive use case analysis creates a clear expectation for all stakeholders and increases the chances of successful implementation and reaching your goals.

Exporting relational database to files in the data lake

An exported database of normalized tables like a snowflake schema which is ingested into the lake will lead to performance problems and latency. Big data technologies were not designed for small data files or data that must be joined with foreign keys. A better approach is to extract large flat denormalized files.

Creating data pond silos or data swamps

Often smaller repositories are created by different groups, either by ingesting different data or by using different processing schemas for the same data sets. Part of this problem is reinforced by acceptable old habits like data marts that are subsets of the data warehouse oriented to a single department. Schema-on-read can easily be abused because you do not have to plan for the use of the data before storing it but this does not mean that you should. If your organization fails to create descriptive metadata and an underlying process to maintain it, then your data management will lack consistency, lineage, quality and semantic coherence. Also, users will not know what data is stored in the lake. This can spiral out of control into an invaluable technology, thwarting business intelligence goals or worse, a costly liability when your organization makes bad decisions due to analysis from a data swamp. Plan for the big picture, agree on schemas, (re)design-before-read, share these documents along with your other technical publications, and train staff across all departments to prevent the silos and data swamps.

Manual installation and configuration

What do you do when teams feel they do not have the time or resources to use IT Automation for their grids and clusters? Don’t be an organization who learns the hard way that manual configuration management does not pay off in the long-run or even the short-term. Use IT Automation like Docker Machine, Rancher, Chef, etc., to remediate this. Using these tools will lower maintenance and support time as well as exposing your team to the microservice architecture. Instead of building one monolithic or set of monolithic applications for your data lake, consider developing the data lake as a suite of independently deployable, small, modular services in which each serves a unique purpose and communicates through a distinct, lightweight mechanism to achieve a business goal. If you do not have a professional DevOps group or resource filling this role then establishing this team is critical to success and day-to-day operations of your data lake architecture.

Using RAID for data nodes

Big data technologies like Hadoop stripe blocks of data across multiple nodes running on commodity hardware, and RAID stripes data across multiple disks. This striping ensures fault tolerance and better read/write performance than a single disk or node. RAID striping on DataNodes will cause striping at the application layer and the hardware layer. This duplication will cause a lower performing solution. Leave the striping to Hadoop. DataNodes should have just a bunch of disks (JBOD) that are individually mounted when using multiple disks per DataNode. Since the NameNode is a single point of failure in Hadoop, RAID is recommended on NameNodes.

Using logical volume management for data nodes

Logical volume management provides many features to system administrators. One of the most common uses is creating one logical volume from multiple physical disks or JBOD. Other features include striping, mirroring and runtime adjustments to volume sizes. Logical volume management is not recommended for DataNodes. Performance degradation is caused by the additional layer between the filesystem and the disk. The previous reasoning to avoid striping with RAID also applies. There is no need to make volumes larger on the hundreds of DataNodes in your environment when you can simply add a few more DataNodes. The Linux Logical Volume Manager is enabled by default on some Linux distributions. On Windows consider steering clear of dynamic disks and Logical Disk Manager.

Using SAN/NAS for data nodes

Do not use a SAN or NAS for Hadoop big data. If your IT organization is not using big data then procuring storage hardware with commodity machines using local disks will feel counter-intuitive for the IT team. But the team must ditch the decade-old practice of purchasing only diskless blades and virtualized storage, which connect to the SAN or NAS. Big data software processes extensive data by breaking up the workload into smaller tasks then executing each task independently on a node that contains the data needed to complete the smaller task. Using a SAN or NAS creates a single location for all these tasks to seek their data. This poor performing design will cause multiple nodes to contend for the same resource and increase network hops. A fundamental best practice in big data is to move the computation to the data. A SAN/NAS will put a network between the computation and the data, diminishing data I/O when high-end performance is necessary.

Containers vs. virtual machines

Neither containers nor virtual machines are considered a bad practice. This topic was worth noting here due to the performance-oriented nature of the bad practices section. Your organization can virtualize the NameNode and much of the rest of the data lake, but bare metal for DataNodes will provide the best performance. Your infrastructure can achieve most of the advantages of virtualization and more with Docker containers and DevOps tools. More cloud vendors are offering bare metal options.

In conclusion, many people feel the need to choose between either a data lake or enterprise data warehouse, assuming one is better than the other. In reality, data lakes and warehouses were designed at different times, under different constraints and use cases. A data lake can substitute the functionality of an EDW while creating new and incremental value, but lakes comparatively have feature deficiencies common in emerging technologies. A data lake is a favorable choice when you are starting from a blank slate without a central repository for data intelligence, and your enterprise does not require any of the features not available in a data lake. Already have an EDW or starting fresh but immediately need those mature features found in data warehouses, using a polyglot approach might be best.

Data lakes have proven their value over data warehouses. The traditional approach of manually curated data warehouses, providing limited views of data and the ability to answer only specific questions chosen at schema design time, must make room for a new entrant who can store and process more data at a lower cost while understanding that the data intelligence and analysis needs of its users are always changing.

This final post concludes our series on data lakes. If you missed the first four, here are the links:

The Ultimate Guide to Data Lakes: Best Practices

Data lakes can significantly improve your organizations’ data warehouse and data storage output.

Latest Posts

Categories