In this four-part series, we’ll explore the data lake ecosystem—its various components, supporting technologies, and how to best outfit your lake for success. In our first post, we explain how to instill governance on a growing data lake, with added insight from our partner Waterline Data.
As the variety of data that organizations want to leverage increases, many are creating data lakes, or a place to store data with a variety of structures and purposes. I recently co-authored a white paper, “Trifacta Data Wrangling for Hadoop,” that examines the typical components involved, as well as the importance of data preparation in supporting a data lake strategy. In this post, I’ll dive deeper into one particular component: data governance.
Growing Pains & Governance
While data preparation is critical to empower as many users as possible with data, there’s a side effect to it. The lake will grow (this is what you want!), but the more data that goes into it, the harder it will be to find, share, and trust, the right data. You’ll need to protect the data lake from becoming a mess of new data siloes (even collocated in one physical place), which was most likely the primary reason you decided to create a lake in first place. So, how can you prevent data proliferation and siloes?
In fact, a recent paper from 451 Research, “Sink or swim? Governance and Data Preparation are Key to a Functional Data Lake” made this very point, stating: “Enterprises should seriously consider the data governance and management requirements before embarking on data lake projects to ensure that the functionality is available to turn the concept into reality.”
The Data Catalog
Finding the right data in a lake of millions of files is like finding one specific needle from a stack of needles. With a data catalog, however, a business analyst or data scientist can quickly zero in on the data they need without asking around, browsing through raw data, or waiting for IT to give them that data.
But the question is: How does one create a data catalog, given the volumes and ongoing changes to the data in the lake? It’s not practical to manually explore and tag every file and field in the data lake. Since a data lake doesn’t require a data model, it’s very easy to get data in, but the flip side is that the business metadata still needs to be defined for the data to be understood and useful to business users. The challenge is that this has to be done as the data lands in the lake, which means that there is a need for automation so the business can start using the lake within minutes or hours, not months.
Waterline Data was founded to bring a sense of order to data lakes, without restricting how they evolve. Waterline Data’s technology automatically creates a rich catalog of the data, including data lineage, and presents that in an self-service data catalog interface—so that anyone that requires access can find it easily and quickly. What makes Waterline Data unique is not only the automated tagging, but also that the catalog can be augmented with tribal knowledge about the purpose and value of the data. Lastly, data governance policies can be applied to protect sensitive data, manage access, and increase trust in data quality and validity.
Waterline + Trifacta
The data catalog and data preparation go hand-in-hand, which means a Waterline Data and Trifacta partnership delivers even greater customer value. If you think about data discovery and data preparation as an assembly line, the data catalog makes sure all the right parts are ready to be assembled, while the finished product can also be put back in the catalog for re-use.
The Bottom Line
Creating a data lake is an exciting opportunity for your organization, but it’s important to ensure that it starts off on the right foot. Consider how you can leverage data preparation and an automated data inventory to both empower users, while still maintaining governance.
Stay tuned for our next post, where we’ll explore data ingestion, and how its quality can impact your data lake. Also, to try data prep or the data catalog out for free, you can download Trifacta and Waterline Data here.
photo credit: Al_HikesAZ Water Lilies – Balboa Park Botanical Garden via photopin (license)