Data Catalog

There will continue to be a natural conflict between big data and business self-service

Posted on July 17th, 2017 | Cynthia Crossland

Two trends in big data have been tugging at each other particularly hard for the past several years and frankly, I find it amazing that more people aren’t talking about how these two trends are almost fundamentally opposed to one another. On one hand, there is the push for big data, and the general sentiment is, more, more, more – more data at a faster pace and with greater variety. This all falls under the train of thought that if I get more data, I can find new and bigger insights that will change my company or perhaps the world for better.

On the other hand, there is the push for self-service. If I am a business analyst, I don’t want to be bothered with extra processes, such as going to IT to get access to data. Just let me at it!  With some of the awesome new data analytics and visualization tools, plus the advent of Hadoop and schema on read, I don’t need IT to build me a nice star schema to do my job. Just get me the data, so I can make use of it immediately, before it potentially loses value.

The problem is that as data gets bigger, it also becomes messier. More data means more mess, which makes sorting the good data from the bad data even more of a challenge. Also, with an increase in data governance regulations along with the implementation of proper access control, business professionals who want to take advantage of self-service often get blocked from access to all that data which has yet to been cleared for use. So, they either don’t get access to the data they need, or if they do have access, they end up spending more time sorting through that data and formatting it for analysis and spending less time actually using the data to answer questions and make decisions.

In the end, the tension between big data and self-service has resulted in a lot of industry buzz, but too many projects buckle under this tension, failing to unlock the potential value of big data projects. So here is a little food for thought. If you build it, they won’t come. (see Field of Dreams). If you want to enable self-service, you can’t just create a data swamp. Key considerations include:

  1. Give users some way to easily search and find the data they are looking for. And when they get search results, the report should contain objective profiling information about the quality and provenance of that data. It should also have ratings and reviews from other people who have used the data. Don’t forget, you want both the objective and subjective information about data.
  2. Automated data governance processes are important. Manual governance of so much data is too slow. You need ways to automate the tagging and categorization of data on an ongoing basis or your data lake will turn back into a swamp.
  3. Access control is critical. For most organizations, you can’t just let all the users have access to all the data. In regulated industries, that will result in hefty fines. For instance the upcoming General Data Protection Regulation (GDPR) will impose a fine up to 4% of worldwide revenue for violations. And just as I mentioned with regards to automated data governance, you need to automate the process so new data coming in doesn’t sit in quarantine for months while you are waiting for someone to check to make sure that the data is not sensitive.

Data Self-Service is a great idea. But it won’t be free. With all of the new big data coming your way, you will need to think through how you enable self-service in a way that is useful, governable and secure, while maintaining the level of agility you are striving for.

Learn more about how a data catalog from Waterline can help resolve this tension at .