From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State
The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected...
In the beginning, God created the heavens and the earth...which happened to spin off massive amounts of data. Man took that data and started using it to make business decisions. In order to effectively understand the past and make accurate predictions for the future, this data needed to be stored and processed. Thus data architecture came to be.
Allegories aside, we live in an era where data-driven decisions are becoming both requested and required, and the tools to make them increasingly democratized. From machine learning algorithms and neural nets to a simple dashboard built in Google Sheets, finding ways to assemble, visualize and make effective predictions with data is now within the purview of everyone. This access has increased the competitive pressures on both quality and timeframe. We can now quickly ask and answer questions from knowing your daily step count to dynamically pricing e-commerce products that would have required a herculean effort (or been impossible) even a few decades ago.
But underpinning all of these advances and expectations are data architectures and engineering. And the quality and coherence of these architecture patterns will make or break any attempt at advanced analytics.
Making good, well-informed decisions is always the end goal. Performing queries, drawing graphs, training models and making predictions are some ways of moving towards this goal. But for these tools to be truly meaningful and effective in a business context, your whole organization needs to deploy a modern data architecture.
A modern data architecture has reliable, accurate data pipes running all the way from the user fumbling through your web site, to adding third party information and services into the mix, conducting analysis, joining disparate data sources, performing enrichments, distributing that data to internal servers, training and deploying machine models, and finally pulling all this data together again to allow the decision-maker to interact, visualize, query and explore it in order to make the best choice.
The basic data lifecycle process looks like this. Specific data is collected, modeled, transformed, and visualized to be put in an actionable business context.
Figure 1 - High-level overview of data life cycle created by the Blue Orange Sales Team
Or to get more pretty and complicated, here’s a diagram of how we implement the modern data lake pattern:
Figure 2 (Data Lake Process created by Blue Orange Designers. The people that would actually build your custom dashboards).
So let’s explore, starting with the traditional approach to organizing business intelligence data.