Business Analytics

A Good Data Lake Starts with a Good Blueprint

Posted on May 16th, 2017 | Steve Wooledge

Big Data is making a lot of promises to the enterprise, but there are many challenges when it comes to building infrastructure that is flexible, scalable and capable of delivering the high level of performance that emerging applications require.

So where to begin? As with any architectural construct, the best place to start is with a good blueprint.

Commonly, the central construct of Big Data is the data lake. More than a repository, more than an analytics engine, the data lake is where raw, unstructured data is turned into actionable intelligence. But that’s not a magic, one-step process. For this reason, your enterprise needs to think very carefully about how the data lake is designed and executed. Failure to start with a well-crafted blueprint could have serious consequences not only in the short term – such as with quality of the analytics – but in the long term as part of the overall success of the big data strategy. Done right, your data lake architecture can span the gap between raw data and the broad range of end users who use it to answer their questions (and question their answers).

DataLakeBestPractices3

 

Based on our experience working with customers driving business value from Big Data, the data lake is best viewed as a sequence of three operational zones connected by a pair of transition phases. The function of each zone is not just to enhance the value of data assets coming in, but also to build workflows, establish access and security parameters, and systematically increase the exposure to a larger community of business users.

The Landing Zone

Data’s initial contact with the data lake is the Landing Zone. Quite naturally, the Landing Zone must be efficient at ingesting data from external sources, but it also needs the capability to track data provenance, document producers and provide a place for data to sit while your enterprise figures out what to do with it.

Just as important is what not to do in the Landing Zone. There should be no modeling at this stage, nor any attempt to gain further understanding other than where the data came from and who provided it. As such, the Landing Zone should be integrated with ETL and MDM platforms, StreamSets being a good example for developing ingestion flows and landing your data within the lake. Some visual discovery and analytics might be helpful along with vertical-specific toolsets for regulatory and compliance purposes. Access should be limited largely to the data team, with heavily monitored role-based access control (RBAC) security firmly in place. Until your enterprise has a firmer handle on what this data is and what value to assign to it, the fewer people touching it the better.

Another ‘don’t’ in the landing zone is ‘don’t select, summarize, scrap or suppress’ the data. And Hadoop makes it easier and cheaper to retain the data than ever, leaving room for these decisions to be made later by others.

From here we enter the first transition. At this stage, the data team can reach out to data producers to more fully understand the source and then implement initial filtering using any number of policy-based mechanisms. At the same time, the data team begins standardizing column and data names, establishing data formats and giving it other forms of structure. But still, there is no modeling or correlating.

The Curated Zone

Then it’s on to the Curated Zone. This is where limited access is granted to others in the data value chain, such as use-case experts and related systems-level personnel. This broader access will still require RBAC security.

Exploration and data discovery are key functions at this point. Advanced analytics tools like Spark and R are a great way to look for relationships and patterns, and perform hypothesis testing. Data wrangling tools such as Trifacta or Paxata help with data transformation and cleansing. Visual analytics tools such as Arcadia Data are critical because the Curated Zone is where data-driven applications are built and tested. And since data applications are intended for use elsewhere in your enterprise, they will need to be field-tested and qualified for full production environments.

The next transition is all about moving these data applications downstream into the final Production Zone. This can be done at the user’s request or once the data team decides the time is right based on usage levels and other metrics. The transition requires specialized toolsets to model the final stages of the application, create OLAP cubes and other derived data forms and apply all the finishing touches that the app needs for widespread consumption.

In all likelihood, this Curated Zone is where today’s Database Administrators will operate, except now they will be known as Data Application Administrators or something similar. Their primary tasks will be to ensure security, performance, modeling, end-user authentication and workload management, essentially serving as the ‘last mile’ data professional to touch the now-refined data before it is released for general consumption.

The Production Zone

Finally, the data can be made available to a large number of internal and external users. Of course, access to the Production Zone is still limited to the applications and data that individual users are authorized to see. Application level security is much more paramount, so leverage tools that provide the flexibility to control visibility based on roles and policies within the Big Data platform without redefining them in multiple applications (or locations).

Since most users will be accessing the Production Zone on a daily basis, perhaps even multiple times per day, it must be architected around ease-of-access, speed, interactivity and reliability. Web based access to applications that are integrated with your enterprise security (eg: AD/LDAP or SAML) are ideal. By removing barriers and making it easier for more people to access the refined datasets & data applications, will in turn increase the opportunities for reaping the benefits (i.e. – ROI) of your Big Data architecture.

Properly designed, the data lake serves much the same purpose as a natural or man-made lake: hold back the flood waters to allow the local environment to flourish. With wave upon wave of sensor- and device-driven data about to hit your enterprise, the only way to keep from being washed over is to hold back the torrent in a smart enough way that you can convert it into a nice steady stream.

With a well-crafted blueprint in hand, your enterprise will not only be able to establish a functional and scalable data lake quickly, but a more effective one as well.