Data Catalog

Data Catalogs – Lift Your Data Fog

Posted on August 20th, 2018 | Cynthia Crossland

Everyone’s heard of cloud computing. But for those of us from San Francisco, where much of our lives are spent in low clouds, it’s fitting when we find much of our data is lost in the fog. Imagine yourself as a business analyst. From afar, much of your organization’s data is obscured, hidden by such a dense fog that you either don’t know how to use it or that it is even there. Yet the data appears clear when you are in close proximity or when it is so fresh that you know exactly what it contains. Let’s look at proximity and freshness a little closer:

Proximity: You can identify valuable data that’s within your proximity of expertise, but the more distance the data is from your basic sphere of understanding, the less you understand the data and its value.

Freshness: The more time you spend studying a specific project’s data, the easier it is for you to understand the data’s value. Unfortunately, this is when the fog lifts temporarily. Once the project is completed, your memory begins to fade. In the meantime, the data itself begins to change and the fog returns.

The challenge is that proximity and freshness only work for a very small amount of data. Meanwhile, the variety, amount, and speed of data continue to grow. Trying to visualize all of the combined data becomes daunting. As one of Waterline Data’s customers recently said, “We have 100 million fields of data. How can anyone find anything?”

As Waterline Data’s CTO Alex Gorelik says, “It’s like looking for a specific book you want at a flea market. You can look all you want, but it’s going to take a long time to find it.” Alex adds, “You can invest all you want in faster data processing, faster analytics, and faster response times. But if your organization can’t discover, understand, and utilize your data fast enough—if you can’t quickly convert all that unstructured data into actionable business intelligence—you will have a tough time serving your customers, let alone competing with those who can efficiently capitalize on their data in today’s knowledge economy.”

Lifting the Fog for Good

The solution is for companies to lift the fog and keep it lifted, allowing business users to more readily find critical data on an ongoing basis. Fog lifts when the sun burns it off, and a data catalog can be your sun. It burns the fog off your data and allows you to see what was previously obscured.

The challenge is that many companies search for only the data they can see instead of stepping back for a comprehensive view. You may not know what data is even available because you can’t easily see it. This situation is compounded by the data security dilemma: you can’t access data without justifying why you need it, but how can you know if you need it if you don’t have access? Often, this is where tribal knowledge is used, but this method’s results are often spotty. The fact is people forget, leave the company, or make mistakes.

Waterline Data’s solution to the tribal knowledge problem is tagging. By adding automated tagging to your regular project workflow, you’ll kickstart the tagging process. We also establish the value of curating automated tagging results by SMEs. In addition, SMEs can incorporate ratings and annotations, which along with automatically discovered lineage, provide a clear (and fog-free) understanding of data quality.

The SME’s review of the automated results delivers a feedback loop that continually improves the accuracy of the automation through machine learning, but total accuracy is never assumed. Each automated tag is accompanied by a confidence percentage rating. The closer that figure is to 100%, the greater the likelihood that data is accurate. But the human element is always part of the equation. Data stewards, SMEs or analysts can accept or reject a tag at any time. This combination of data set properties (data quality and lineage), governance (curation and stewardship) and social validation (ratings, notes and reputation) establishes trust to your data classifications, which supports tighter control over accessing and provisioning of your data.

Taking on the Data Governance Challenge

Currently, there are thousands of analysts, researchers, and data scientists requiring access to data for their jobs. But finding, provisioning, and governing that data is not only difficult, it’s very expensive. The difficulty and expense is based on the time users waste trying to navigate the convoluted paths between the data seekers and their needed data. Even after they’ve found it, there’s never a guarantee that users are getting access to the “right” data.  Organizations can’t just blindly grant access to every employee. It must protect sensitive business information and personally identifiable data. So how does the organization protect the information while granting access to just those who need it?

Some organizations employ a top-down approach. The admin finds the data and removes any personally identifiable information before providing access to any user. For other companies, there’s the agile/self-provisioning approach in which a catalog is built and populated only with the metadata. Then, the data is “de-identified” and provisioned for each user’s request.

The problem with both approaches is the impracticality of manually finding, de-identifying, and authorizing access to data as it’s requested. No one wants to wait three months for someone to build a business glossary that maps back to the physical data. Everyone wants their data now. Yet data is trapped, sitting in silos because nobody has the time to evaluate its level of sensitivity and access rights.

Our approach quickly connects the right people to the right data by replacing the manual tagging of metadata with an automated process that quickly classifies all your data assets, including new data, while determining data lineage. After you understand what information is in your data set and where it came from, you can quickly and automatically make it available to authorized users.

Intelligent Data Catalogs can remove the fog from an organization’s data and deliver the right data to the right people while controlling its access, providing greater efficiencies and business benefits.

To learn more about Data Catalogs from Waterline Data, click here