Paolo for Mia-Platform

Posted on Jun 20, 2023 • Edited on Jan 16, 2024 • Originally published at mia-platform.eu

From Data Warehouse to Data Fabric: the Evolution of Data Architecture

#datascience #data #datafabric #dataarchitecture

In the last century, data has become the lifeblood of every organization, ranging from e‑commerce giants to healthcare providers and government agencies. Collecting and managing this data effectively can give organizations valuable insights to aid decision‑making. However, this has been proven to be a daunting task.

As important as data is, CIOinsight reports that only 10% of organizations feel that their company excels in data analytics management. Recognizing this significant data utilization gap, organizations have embraced modern data architectures to bridge the divide.

Data architectures are the structured frameworks and systems that define how data is organized, integrated, and accessed within an organization. The architecture sets the blueprint and establishes guidelines for data and how it flows through data storage systems.

This article discusses the evolution of data architecture, its underlying principles, and the advantages of adopting modern data architectures for effectively managing organizational data.

The evolution of data architecture

Over the years, data architecture has evolved to adapt to the growing needs of businesses. A notable transformation discussed in this section is the shift in data architecture from logical warehouses to data fabrics.

The Logical Warehouse

Logical warehouses, also known as data warehouses, have been the foundation of data management for decades. These data warehouses are central repositories designed to store data from different sources like transactional systems, application log files, relational databases, etc., providing a unified view of the information.

In general, logical warehouses use Extract, Transform, Load (ETL) processes to extract data from source systems, transform it to ensure consistency, and load it into the warehouse. Logical warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data.

Challenges of Logical Warehouses

While logical warehouses served their purpose, they faced several challenges as data volumes increased. Some of the key limitations included:

Data Silos: Logical warehouses often resulted in data silos, where different departments or teams would maintain their own isolated data sets, leading to inconsistencies and duplication.
Performance: Since the data had to go through multiple processes and stages before it was available for analysis, this greatly impacted the performance of data warehouses.
Scalability: Implementing data warehouses is complex and costly due to hardware limitations. It also required expertise in data modeling, ETL processes, and database management making it harder to handle exponential data growth.

Data Lakes

In response to the data warehouse challenges, data lake architecture was introduced in 2010. Although the data lake architecture is very similar to data warehouses, they differ because data lakes also cater to semi‑structured and unstructured data.

Data lakes' ability to store large amounts of data in its natural or raw format helps us to:

Use data from multiple sources regardless of data types (unstructured, semi‑structured, structured);
Create focused portions of data specifically tailored to meet the requirements of a particular use case;
Discover new insights by leveraging advanced analytics and machine learning techniques without being limited by pre‑defined schemas.

Challenges of Data Lakes

This open format feature of the data lakes made it more popular than data warehouses. However, data lakes came with challenges as data ingested without strict standardization led to inconsistencies in the database. Additionally, the data stored in data lakes require significant transformation and integration efforts before it can be used for analysis which can be complex and time‑consuming.

Data Mesh

Data Mesh architecture is an emerging approach that provides an alternative to the central data lake architecture. Data Mesh, coined by Zhamak Dehghani in 2019, is a decentralized data architecture that organizes data by a specific business domain.

By introducing domain‑oriented data ownership, domain teams become accountable for their data and products, improving data quality and governance.

Traditional data lakes often encounter challenges related to scalability and performance when handling large volumes of data. However, data mesh architecture solves these scalability issues through its decentralized and self‑serve data infrastructure.

With each domain having the autonomy to choose the technologies and tools that best suits their needs, data mesh allows teams to scale their data storage and processing systems independently.

Data Fabrics

Data Fabric is an integrated data architecture that is adaptive, flexible, and secure. It is an architectural approach and technology framework that addresses data lake challenges by providing a unified and integrated view of data across various sources.

Data Fabric allows faster and more efficient access to data by extracting the technological complexities involved in data integration, transformation, and movement so that anybody can use it.

To understand the difference between the data mesh and data fabric architecture approach to achieve data quality and what data paradigm to adopt, check out this Mia‑Platform article Data Mesh vs Data Fabric: a tale of two new data paradigms.

Modern data architecture principles

According to Dataversity, data architecture principles refer to a collection of policies regulating the enterprise data framework and the operational rules for gathering, integrating, and managing data assets. These principles help us to create a consistent, reliable, and efficient data architecture that aligns with our organization's goals and objectives.

To effectively leverage data as a competitive asset, here are some of the common modern data architecture principles to follow:

Data Quality (DQ): Data quality is essential to any data architecture; establishing data quality standards and processes ensures data accuracy, completeness, and reliability. The data quality principle teaches us to implement data profiling, cleaning, and validation techniques to identify and rectify data issues immediately to avoid bad data quality. Effectively managed and reliable data is crucial in developing precise models and robust schemas for extracting valuable insights.
Data Governance (DG): The Experian data quality report tells us that 78% of organizations globally are plagued with poor data governance, which leads to mistrust in the data and the resulting insights drawn from the data. Data governance teaches that at any point in the data lifecycle, the data consumers should know the location, format, use relationship, and any other relevant information relating to the data to avoid data debt. Data governance is closely related to data quality, enabling data architectures to ensure data integrity and increased data quality.
Strive for consistency: This data principle emphasizes maintaining uniformity across all data structures, formats, and processes throughout an organization. Using standard terms and vocabulary ensures that data is consistently defined across different systems, making it easier for developers and non‑developers to collaborate on the same projects. Consistency in data representation across organizations presents teams with a "single version of the truth", making data and data analysis easily understood by a wide range of users within the company.
Make data a Shareable asset: Making data a Shareable asset emphasizes that we treat data as a valuable resource to be shared and accessed across different systems. This principle aims to eliminate data silos and encourages the design of data architectures that facilitate efficient data sharing. Doing so ensures that all stakeholders understand the company and its data comprehensively, fostering collaboration and informed decision‑making.
Data Security and Privacy: Modern data architecture must ensure the confidentiality and integrity of data while protecting sensitive information from unauthorized access, breaches, or misuse.

Characteristics of modern data architecture

This section discusses some of the characteristics of modern data architecture:

Automation: Modern data architectures use automated processes, tools, and technologies to optimize data‑related tasks and operations. Unlike traditional systems, modern data architecture can build complex processes in hours or days using cloud‑based tools.
Adaptable: A modern data architecture must be flexible to respond to our evolving business needs. It must support multiple kinds of users, query operations and deployments, data processing engines, pipelines, etc.
Scalability: Scalability is a crucial characteristic of modern data architecture that allows us to scale up or down quickly and affordably when our business demands change.
Cost Efficient: Modern data architectures leveraging scalable infrastructures allow us to easily accommodate increasing demands in the future without overbuying hardware at the start. Also, modern data architecture sometimes utilizes cloud‑computing platforms that use a "pay as you go model" so that we only pay for the resources we use.

Benefits of modern data architecture

After discussing modern data architectures, and their characteristics, this section discusses the benefits of modern data architectures to businesses and organizations.

Provides a comprehensive view of a company: With data integration enabling seamless data integration from various data sources and systems within an organization, data is collected and stored consistently and organized. This integration ensures that the data at any point offers a comprehensive view of the company, "a single source of truth."
Reducing redundancy: Data integration discourages overlap in data fields across an organization by collecting and harmonizing different data and data sources.
Improved data quality: Modern data architectures incorporate techniques like data cleansing and validation, data standardization, data quality monitoring and remediation, etc., that ensure data reliability.

Conclusion

The evolution of data architecture has been driven by the growing importance of data in organizations. From traditional data warehouses to modern data mesh and data fabric approaches, these architectures addressed specific challenges and brought new opportunities.

By embracing modern data architectures, organizations can benefit from improved data quality and comprehensive data insights to unlock the full potential of their data and stay competitive in today's world.

An amazing cohabitation of the two prominent modern data architectures, data mesh and data fabric is Mia‑Platform Fast Data. To understand more about Mia‑Platform Fast Data, read this article and download the free infographic.

DEV Community