From Cron to Modern Data Stack (MDS): Dataflow Automation and Its Current State
The concept that makes the technological miracles of today possible are defined by data. Enormous amounts of data are collected...
There’s much buzz about data lakes vs. data warehouses today. They are sophisticated tools. So, it pays to be thoughtful before putting time and energy into a business case. Let’s start with definitions.
A data warehouse is an organized system that enforces high-quality standards. For instance, customer records can only be added if they meet data rules (e.g., all US states must follow two-character abbreviations). Due to these restrictions, data contained in data warehouses tend to be of high quality. There are trade-offs to consider, as well. A data warehouse requires substantial ongoing maintenance effort and it tends to be best for managing structured data.
Data warehouses are the right solution for cases where data integrity and accuracy are top priorities. For example, you might have financial data in a data warehouse so that your auditors have access to high-quality data. For routine activities like “produce a standard report each month,” a data warehouse is an excellent solution.
Data lakes are much more flexible than data warehouses.
Conceptually, any data can be poured into a data lake: nothing is turned away. Likewise, no user is turned away – anyone may use the data. You’re no longer dependent on a small group of overworked data analysts to get insights. To illustrate why data lakes are becoming more popular, take a look at how EMC uses data lakes to improve its marketing.
EMC, a technology company that produces digital storage products, uses data lakes to improve marketing productivity. According to a case study published by Intel, EMC reduced its marketing data query time from 4 hours to less than a minute. That means EMC can investigate many more ideas than ever before. As a result, the company can correctly predict “what and when a customer is going to purchase 80% of the time.” That level of accuracy means faster customer service because you know what customers want.
Most large companies already have data warehouses in place. Those systems serve a function. However, they require significant upfront effort in data management. They are also limited in the kinds of data they can manage – unstructured data is usually not welcome.
If you want to break new ground, setting up a data lake is your next step. A data lake is so much more than a standard-issue database.
With a data lake, the sky is the limit for your data. Anything you want, you can put it into a data lake. It could be text, structured data sets, customer comments, invoice data and more. Unlike a data warehouse, you don’t have to do complex configurations for every type of new data you add. Eliminating time-consuming data administration and cleansing work means you have more time to extract useful insights from your data. The free form nature of data lakes means your imagination no longer limits you. Data you put into the data lake today could become helpful next year or five years from then. We’ll see an example of that next with natural disasters.
With a data warehouse, you need to know the end goal of your data analysis before you start. A data lake is different! By adding a wide variety of data, you can use the power of machine learning to discover new correlations. For example, Walmart discovered that hurricanes and snack purchases are correlated! Before a hurricane, the company found that Pop-Tart sales increased seven-fold! This correlation is an excellent example because few people would imagine that hurricanes could predict a specific type of sale.
At this point, you might be wondering which technologies to use. For insights on specific cloud technologies like Amazon Web Services, Google Cloud Platform, and Microsoft Azure, check out our post on “The Cloud War.”
Before you choose a data lake or a data warehouse, take a step back. These are data tools that need to operate in the service of your strategy. For instance, you might have a marketing optimization objective that requires an in-depth understanding of your customers. Your first step will involve developing a few questions like what characteristics do our best customers have in common? Once you have that question in mind, you can start to build a data infrastructure to produce answers.
In our experience, most large companies will have a combination of data warehouses and data lakes. The data warehouse is needed for cases where data accuracy and integrity are non-negotiable, like financial reporting. A data lake is an excellent choice for situations where you want to encourage innovation, experimentation, and new ideas. In practice, data lakes tend to be most popular with sales, marketing, and customer service departments first. Once you become comfortable using data lakes in those areas, you can expand them to other areas.
You’re not alone. Many companies we speak with are still developing their strategies. The good news: you still have great opportunities to build an advantage with the data. The bad news: your staff may not have the skills and capacity to execute a modern data strategy. Contact Blue Orange today to discuss your data opportunities.