Data Architecture: Legacy Edition

Will Thomas

What is BI?

Business intelligence is best defined by its end goal: Making better business decisions. In Wikipedia’s parlance, BI “enables more effective strategic, tactical, and operational insights and decision-making” [3]. To do this, we need two things: data and a way to understand it.

Data Architecture Matters

So, making good, well-informed decisions is the goal. Performing queries, drawing graphs, training models and making predictions are some of the ways of moving towards this goal. But without accurate, accessible, and relevant data there is nothing to apply analytics on. For these tools to be truly meaningful and effective, your whole organization needs to deploy a modern data architecture.

Traditional BI has required businesses to invest in sophisticated infrastructure and expensive development projects, as well as cover the continual support costs to allow it to be reactive to changing business needs. Think having to request a monthly sales report from IT or internally managing a server network, which all too often devolves into this.

Major advances in tooling, along with sophisticated data processing, have democratized the job of analytics, moving it from IT to business stakeholders. But let’s start with how the traditional approach to BI data processing evolved.

Legacy Data Architecture: Evolved System

 This is your brain...on a legacy data pipeline.

Figure 1: Legacy Dataflow (Blue Orange Sales team design again)

Figure 1 shows how a typical legacy data pipeline would look. This is usually not planned, but rather the result of evolving business needs. As more and more external services are added, and more independent internal teams make use of the data, the architecture evolves into a complex tangled spider web of data connections. Each team has to figure out what data is available, where to find it, what each column and field actually mean (the data semantics) and do its best (using custom code and legacy ETL software) to pull it all together so they can perform their computations.

This works initially. However, as the company grows it becomes increasingly hard to manage. The process is very error-prone and requires significant, costly coordination and effort from each team. The end result is fragile and inaccurate. If any service, whether internal or external changes at all, the code either breaks or, worse yet, produces inaccurate results.

On top of the architectural and design challenges, today much more data is produced than the systems were initially designed for, further stressing the legacy approach. Traffic is increasing, analytics is becoming more prevalent and information-rich, and media is now ubiquitous (e.g. screenshots, recording mouse movements, images, and video). If that isn’t enough, large third party sources of data, both free and premium, are also needed to enrich internal data sets. Every addition requires more code, more ETL operations, and ultimately much more processing.

Processing all this data cost-effectively is the next challenge. Employing advanced processing on this growing mass data makes server loads difficult to balance. Similar to a power grid, designing computing power to meet peak load requirements means that full utilization only occurs for part of the day or week. The result is expensive server allocations that often sit idle.  

Finally, since the data is stored and only made available through proprietary services (designed by and for engineers), leveraging this data relies on direct support from engineering and IT. Any special information, query, or graph required by a decision-maker will need coordination and assistance from an engineer or data scientist. Even data discovery becomes a major hurdle. Management, and even the engineers who run the system, struggle to know what data is available, what data would prove useful, what the data means, and how to access it. The result is simplistic, slow decision making and a slow-moving organization.

Legacy Challenges

To summarize the challenges facing a legacy data pipeline:

  • Hard to scale - difficult data discovery, data access, etc.
  • Error-prone.
  • Exclusive data access to specialists.
  • Expensive to maintain.
  • Redundancy of data and code.
  • Wasteful and costly computing allocation.

Next Up: An exploration of data transformation.