From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State

Josh Miramant

Posted On:
May 26, 2022

The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected and processed on a daily basis for businesses and organizations to provide services to their customers; and without large scale of automation regulating and speeding up the calculation of data, we wouldn’t be where we are.

The tools and technologies used to interpret and manage the data over the years have been changed, updated, and renamed, but their function remains the same. Concepts that sound old-fashioned now, such as “scheduler”, “workflow” or “job” are still part of dataflow automation technology but in newly expressed terms. 

But where did it all start? What were the very first steps that pointed towards the Dataflow automation path and where we are today? In this article, we’ll go back in time, tracing the early days of cron and fast forward to its progression towards the Modern Data Stack (MDS). 

The Early Days of Cron 

Cron executes commands at specified dates and times.

This was the description of the cron’s functionality back in 1974 in one of the manual pages of that time. Cron functions as a scheduler and it was introduced as a command line in Version 6 Unix, which stands as one of the earliest versions of the operating system that is used in several computers today. 

Through a set of short commands, which are not always easy to remember (even engineers have to make use of cheat sheets to use them), you can instruct your computer to execute a specific repeating command at a particular time. The commands you can give are pretty limited since the input shouldn’t exceed five numbers separated by spaces and followed by the respective command.  

Cron has you covered if you’re searching to automate a single action to repeat over a regular interval with a short single script. For instance, using this command you can tell your computer to wish you a happy new year every first day of January at the midnight: 

0 0 1 1 * echo happy new year 


Source

The Relational Databases Mark Their Start

While engineers were enjoying using cron, another invention was yet to be released. Oracle is the most prominent example of what is known as relational databases, which started as a complex system for scheduling or automating work computing. It followed several updates, but one of the major ones was the introduction of job queues (1995) which were described as:  

Release 7.2 provides a new feature, job queues, that enables you to schedule periodic execution of PL/SQL code.

Relational databases included the simple functions of cron for scheduling and more. If cron helped you set single reminders, using job queues gave you the option to manage an entire to-do list with the possibility of changing reminders and their order. 

  • DBMS_JOB allowed users to keep a record of jobs to be executed, their order, and the possibility to monitor or set them as “broken” if they failed to execute. 
  • DBMS_SCHEDULER was used to replace DBMS_JOB with more advanced capabilities.

Data Warehouses and Data Integration Soar in Use

Oracle and other database-based providers were solving numerous problems for businesses through their systems but they were also creating room for a new problem: different applications had separate databases. There was the need to bring all these databases together into a single data warehouse and this was done through data integration

What this process actually entailed was the extraction, transformation, and loading of data (ETL) from the source system to the data warehouse. This was necessary because both the application database and the data warehouse structured data in different schemas. This required the transformation process to take place in the integration tools. 

One of the earliest integrations tools is Informatica which was founded in 1993, and it became prominent with the release of the PowerCenter product five years later. PowerCenter, which is still used today, is capable of managing jobs, their running order, metadata, and other details. From Informatica to Microsoft and Oracle, there’s a range of data integration tools that operate similarly today. 

Data Grew into…Big Data

With the open sourcing of Hadoop from Google, a new way of storing and processing data was introduced to the market. This new way, which became known as data lake in 2011, was based on storing data in more than one computer, and into separate files. Even though this posed more complexities for processing, storing data became cheaper and more simple.  

But again, engineers were working on the next big thing, and this was the launch of workflow orchestrators. These workflow orchestrators were created to work with tools like Hadoop, and solved the adoption problems that were surfacing in the industry. Previous tools like Informatica were too expensive and often led to vendor lock-in. Meanwhile, the developed workflow orchestrators were open-source and free, and users could easily modify and enhance their capabilities.


Source

Marching Towards the Modern Data Stack

But did it all stop there? Not at all. Things are about to get more exciting now with the development of cloud data warehouses in early 2012. Companies like Amazon and Google made their contributions to the industry and we had: 

  • The announcement of Amazon Redshift from Amazon Web Services
  • BigQuery gets released by the Google Cloud Platform
  • Snowflake, the data warehouse and cloud computing company, was founded

Cloud data warehouses made the storing and processing of data much easier. There was no longer data to be processed by a third-party tool before loading it into the data warehouse since they could transform the data in their systems. This marked a shift because extract, transform, and load (ETL) was changed to extract, load, and transform (ELT). 

Later on, between 2013 and 2014, we witnessed the emergence of new technologies to process distributed data, such as Spark 1.0 by Databricks, and Kafka 0.80 by Confluent. These tools increased the processing time so much that they made possible real-time processing, which is also known as streaming

Together, these tools began to be known as technologies for Modern Data Stack, which Andreessen Horowitz’s Emerging Architectures for Modern Data Infrastructure would define as: 

a collection of tools and technologies that primarily deliver value over an API, and that are often cloud-based.

Final Thoughts

Just because a novel technology enters the market, it doesn’t mean that the previous ones are completely left off. In most cases, this development is made possible alongside other existing technologies. In an average company, it’s not surprising to find scripts scheduled by cron, the use of data integration workflows, and stream processing in one place. 

The modern dataflows are inclusive and versatile, and they need to work with more than one technology. The modern data stack is constantly expanding as it adds new tools to analyze and process data. At Blue Orange Digital, we have worked with clients of all industries and helped them integrate tools like dbt, Snowflake, and AWS Sagemaker among others, to boost their revenue and by optimizing their use of data. Schedule a free 15-minutes consultation with us here to learn how we can help your business too.