Data Transformation Data Visualization Internet of Things Predictive Analytics

Leverage Google Cloud and Dataplex to Build a Data Mesh

author Colin Van Dyke May 27, 2022

Insights are the essence of building an effective data cloud and democratizing them is crucial to leading data-driven decisions. To derive meaningful insights, requires a self-service data platform that supports usage scaling and spans data silos. Ownership issues emerge in the process as well. 

Organizations need to share their data ownership among their teams without compromising their business context. All this needs to be done while maintaining the governance and the quality of data lifecycle management among distributed data. Google answered these issues with their latest update of Dataplex.

Dataplex stands for an intelligent data fabric that helps you make data available across data science tools and analytics. It essentially unifies the management, monitoring, and data governance spanning across data warehouses, data lakes, and data marts. The intelligent algorithms enable Dataplex to accelerate analytics agility and automate the process of data lifecycle management and data quality. 

Data mesh architectures are one of the most used cases for Dataplex. Today at Blue Orange Digital we’re having a more in-depth look at it, starting by discussing what data mesh is in the first place. 


Source

Understanding Data Mesh Architecture

Domain agnostic and monolithic data architectures are fading into the past with the invention of tools that allow enterprises to access and manage their distributed and diverse data. Statistics from Gartner indicate that nearly 80% of businesses by 2025 are likely to fail due to their resistance to adopting new approaches toward analytics and data governance. But how does data mesh fit into this new approach? 

If centrally managed architectures impact analytics agility, a decentralized architecture impedes data governance and results in silos and data duplications. The data mesh architecture emerged as a middle ground solution between siloed ownership and full democratization of data. The model, introduced by Zhamak Dehghani, refers to a modern data stack (MDS) that allows autonomy of data ownership and makes the management and monitoring of data across multiple domains easier. 

Building a Data Mesh with Google Cloud

Functioning as a data management platform, Dataplex lets you build data domains on your own with a data mesh across your organization and grants you the central controls for monitoring and governing data across these domains. Your data and related artifacts such as logs, notebooks, or code can be organized seamlessly in a Dataplex Lake which stands for a data domain. 

You don’t have to move data physically into a new location or a single storage system. All the data can be modeled into a specific domain as a group of Dataplex Assets. Then, making use of the data zones and lakes in Dataplex, you can bring the distributed data together based on the business context behind it. This is a step to managing your distributed data at scale. Let’s look deeper into Dataplex features: 

Discover Metadata Among Data Sources on Autopilot 

Members of a domain can search and discover file sets and tables through cataloging and metadata management and group them based on domain-specific semantics. Then, Dataplex gathers related metadata and updates it constantly and automatically. 

Enable Tools Interoperability

Federated open-source analytics can access the metadata provided and curated by Dataplex automatically as runtime metadata. It works with tools such as Apache SparkSQL, Presto, HiveQL, etc. Other federated analytics via BigQuery can be applied to it since the suitable metadata is portrayed in external BigQuery tables.


Source

Scale Governance of Data

Stewards and data administrators can manage their IAM data policies and control the access to distributed data using Dataplex. They can govern this data from a central point across multiple domains while allowing delegated ownership of it. Users can manage the access to read and write permissions on the domains and the linked physical storage resources. Furthermore, it brings more observability (logs, data metrics, audit logs, etc.) thanks to the option of integration with Stackdriver. 

High-Quality Data Access

The built-in data quality rules offered by Dataplex, which you can run across your data on GCS and BigQuery, can help you identify data issues automatically.  These rules can be run as data quality tasks to detect. 

Data Exploration in One Click

Data analysts, engineers, or data scientists can explore data and metadata, deploy data management workloads and develop scripts with a self-serve and built-in serverless data exploration experience. This means that you can easily create domain-specific code artifacts, and share them through the interface when you operate either content management with Jupyter notebooks or SQL scripts. 

Simplified Data Management

Tiering, refining, or achieving data and other common tasks can be done through the built-in data management tasks provided. Dataplex enhances your data management experience by integrating with tools such as Dataflow, BigQuery, Dataproc Serverless, or Data Fusion (from the Google Cloud’s native data tools).

All these elements concerning data, such as metadata, policies, interactive and production analytics infrastructure, data monitoring, and access capabilities provided by Dataplex together achieve the premise of a data mesh: data as the product. 

Final Thoughts

Dataplex is powerful for allowing you to organize the data on BigQuery and Cloud Storage into zones and lakes in a logical, manageable, and scalable way. It offers fantastic governance and fuels your analytics for more useful insights. Enterprises and organizations can share data with their owners, transfer ownership across domains and monitor all the processes in a single place. 

Working with clients from different sectors (energy, financial, healthcare, and IoT just to name a few) we’re compelled by the significant impact that architectures like data mesh have to increase productivity and save time. Reserve a 15-minute call with our experts to see how we can help your organization. 


Data Analytics Data Lake Data Science Google Modern Data Stack
logo-colored

Full-service data transformation to make it easy to get from raw data to insights.


Recent posts

Subscribe to the Blue Orange Blog

Other Services

Looking for something else?

Wondering how we can tailor our expertise to help your company unlock your data? Tell us about your project.

Contact