If you feel like you’re “drowning” in jargon and buzzwords surrounding the recent developments in data lakes and their ilk, you are not alone. A recent TDWI survey showed rapidly increasing adoption of data lakes as a source of big data analytics, though it also revealed barriers to success and confusion around implementation value. Much of this confusion stems from myths and misperceptions around the technical and business uses of a data lake. This article will examine the proper use of a data lake, and how proper governance can prevent it from becoming the dreaded data swamp.
To be clear, a data lake is not a data management platform, in that it is not an integrated, centrally-managed source for enterprise data. Rather, it is a storage repository that holds a vast amount of raw data in its native format. Most importantly, the structure and requirements of this data is not defined until the data is actually used. This is the difference between “schema-on-write” (common to most traditional data warehouses) and “schema-on-read” (using separate compute resources). It allows the untransformed source data to be quickly and cheaply loaded for immediate availability. There is a perception that “data in its raw form” means unorganized, but an unorganized data lake quickly becomes a swamp . . . a quagmire of disparate data sources that no one wants to tread into.
The physical storage of data in a lake should still have some level of organization and most importantly, be cataloged with a high level of metadata. A common practice is to create “buckets” or top-level folders, segregated by source, to quickly identify the nature and potential usability of the data. These can be further broken down into clusters stored as files. A metadata catalog around these files should be created and maintained and can largely be crowdsourced; it is likely that business users and data analysts within an organization can identify the nature of the lake’s data.
The data lake evolved as a local file-based repository built largely around Hadoop. Specialized skills are needed to utilize this data (such as proficiency in MapReduce or R) which has been a barrier to entry by the non-technical user. The cloud is changing this . . . “designed-for-the-cloud” data architect solutions such as Snowflake and Matillion enable fast time-to-value and instant scalability to create a “data lake in the sky” that can be accessed through common SQL expressions. Snowflake makes use of the VARIANT datatype to identify semi-structured data that can be quickly transformed by Matillion into a row-column structure for easy consumption.
Data lakes do not eliminate the need for data warehouses, i.e. structured, curated data stores optimized for analytics and reporting. In fact, lakes are well-suited as staging areas for DWs, particularly in a cloud environment where storage and compute resources are separated, and native ELT components are readily available. But users may (and should) query, mine and explore a well-organized data lake directly. An optimal way to provide a structured environment for non-technical data lake users is to use virtualization. Creating a data virtualization “layer” using a tool like Denodo sets up a level of structure for interacting with the lake without altering the original data. Though this approach, virtual “tables”, enriched with relationships and metadata, provide a gateway for non-technical users to get value from the data lake using conventional querying tools.
The data lake is rapidly evolving, particularly as it moves to cloud-hosted data platforms and broadly-appealing consumption models like virtualization. Coupled with data warehouses (“water treatment plants”) and data marts (“bottled water”), a “data wetlands” ecosystem can provide a complete usage model for the all enterprise data.