From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State
The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected...
Data is complicated. (That’s why it’s called ‘data science’). Managing its use and flow remains challenging and when data pipeline management is added the complexity increases. Organizations are continuously moving the data between one another and it is no surprise that the meaning of values or individual tables of your data warehouse can be lost in the process.
Keeping track of who changed what would avoid the mess caused when processes such as refactoring overlap with each other causing pipelines to break. Using dbt, known as data build tool, your teams can be fully in charge of the transformation step in the ETL pipelines and seamlessly test, version-control, and track their data transformations.
dbt works wonderfully with batch data transformations but when it comes to transforming streaming data in real-time, that’s a whole different breed because batching data at all times is unsuitable. Analytics engineers have always suffered to access streaming data. They had to either work on multiple notebooks or use Scala, but not anymore!
Materialize, which holds its place as the first SQL-based platform that allows users to work with streaming data, has developed an adapter for dbt. As I’m writing this, it’s still in the beta testing phase but it will probably be available for production use cases very soon.
Finally, analysts can make use of Materialize as a data warehouse and transform real-time streaming data, build SQL pipelines, and be in charge of streaming analytics as users and creators. But what led to this integration?
Most analytics production use cases can be completed easily through batch-based tooling. However, when it comes to real-time use cases, the processes are difficult to maintain, especially with the ongoing scaling. Increased volumes ask for tools made for streaming.
Real-time use cases might be encountered rarely but they come with expensive consequences. Imagine the damage it would cause the lack of proper access to real-time data during inventory management or risk detection situations.
An alternative option might be the use of Lambda views to retrieve some kind of real-time analysis, but this is effective until a certain point with small volumes of data. Then, the performance starts to diminish because we have extended the limits of SQL optimization.
Are there other tools that help process and stream real-time data? Yes. Are analysts working with them? No. Well, at least most of them don’t because it’s not their part of their range of skills. Most analysts aren’t familiar in writing transformations in tools such as Kafka Streams. That would ask for them to code on Scala or Java and much more technical knowledge.
Therefore, this process is done by data engineers who are familiar with these tools and manage transformations from the start with the setting of pipelines. So analysts rely on their output and can’t really contribute much to streaming data transformations even though they could add an immense value if they did.
The question is that arises is: “What’s stopping them from doing so without having to learn how to use these tools?” SQL has the answer. If streaming analytics is made more accessible to analysts and those well-versed in SQL, there will be a significant improvement in analysis speed, data quality, and a reduction of silos among data teams.
This duo is definitely going to take data science and analytics companies to the next stage, but what are some of the exciting use cases that users can jump right into? Streaming real-data might benefit many functionalities but two impressive cases are these:
Materialize completely removes the need to go through streaming data modeling and grant users the ownership of real-time transformation workflows, similar to what one would do with a batch-based alternative. Also, the background of users matters in order to reap the full benefits of this integration.
Materialize structures materialized views suitable for streaming that are not cached, as you might encounter in most traditional data warehouses. Instead, these materialized views exist as single queries that can be continuously updated. For users experienced with dbt, it might seem fascinating that by using the Materialize plugin, they can execute ‘dbt run’ only once.
On the other side, Materialize users should dedicate a little bit of time to grasping dbt. Making use of dbt, users can scale Materialze pipelines with the help of macros. They assist in deploying and parametrizing views, and together with other features, in shortening the time spent to execute processes.
For instance, you can use a Jinja loop for iterating over a similar statement and avoid creating hundreds of create source statements. In the words of Andy, Head of Community at Materialize, dbt can be explained as:
“dbt is the best way to manage your Materialize deployment. With dbt, you can optimize your workflow with version control, macros, testing, and documentation.”
There exist many questions and issues with the batch paradigm that haven’t even been raised because they can’t be solved with this system. However, as the streaming workflows are made accessible to more data team members, the processes will get smoother, and these issues will be voiced and solved properly.
Working with data might be daunting and exhausting at times. Blue Orange is equipped with a team of data scientists who take the hard work off your shoulders and help you optimize those sophisticated data pipelines and turn them into digestible insights. Schedule a free 15-minute consultation with us here.