Logical Data Warehouse: One Data Warehouse to Rule Them Alls
Wouldn't it be great if we could store all our data in a single place? It would theoretically make access to data so much easier. Unfortunately, that's not possible for various reasons. Primarily, data is scattered over multiple geographic locations hosting data storage. On top of that, different business units use different—and sometimes incompatible—technologies. Moreover, data ownership problems render this utopia unrealizable. However, accepting defeat and sitting on our hands is a costly option. Managing scattered data is expensive; time-sensitive operations demand real-time data; and we must ensure that data is timely, accurate, reliable, discoverable, and accessible.
People tend to get creative when there is a pressing need, and none of the available solutions work. That explains the proliferation of ideas like data virtualization, data mesh, and data fabric over a few years. Underpinning all these ideas is the concept of Logical Data Warehouse (LDW). Although Bill Inmon is credited with coining the term in 2004, it was Mark Beyer of Gartner who used it in 2009 in the sense that it is used today.
LDW—what does it do?
LDW is a data architecture specifically designed to overcome the geographic, technological, and data ownership constraints cited above. This architectural layer is built on top of data warehouses and data lakes. It harmonizes different components within the data infrastructure to create a single, integrated view of the data scattered across different storage locations. While achieving this feat, LDW makes no use of physical ELT and ETL processes generally used to copy and move data into a single storage.
How does it work?
How does an LDW manage to bring together all the data regardless of its format when the user has no idea where the data resides? LDW's power lies in how it can capitalize on metadata. LDW does not physically house data but holds metadata that defines the context of data, making it possible for data consumers to find and access a specific piece of data without having to know where it is stored. Metadata eliminates the need to perform ETL processes or replicate data and presents the users with a unified view of data upon a simple information request they send.
The data needs of a modern enterprise are so sophisticated that no single data management system can address the demands of all user types and use cases. LDW offers an all-encompassing, versatile, and flexible model that can do that by utilizing data warehouses, data lakes, Hadoop clusters, NoSQL databases, or cloud-based platforms. Different components work in sync within an LDW framework to handle structured or unstructured data, and relational or non-relational databases. The result can approximate real-time or provide a single view of truth with data historicity, depending on what the user seeks. Imagine two hypothetical data consumers: One prioritizing data quality over access speed and another who doesn't have high standards for data consistency while showing no tolerance for latency. LDW can lean on different components to satisfy both.
Strengths and Weaknesses of LDW
-
An LDW lowers the technical barriers for knowledge workers or "domain experts," practically turning them into "citizen data analysts." These people can self-serve without having to execute physical ELT and ETL processes. An LDW architecture empowers them to find and access data without relying on an already-overwhelmed IT department.
-
The centralized data source built by LDW minimizes the need for data movement and replication. This prevents data sprawl and degradation, improving data governance.
-
An LDW architecture allows for data historicity, the lack of which was a weak point for data virtualization, as our CEO Mustafa explained earlier. LDW can harness the capabilities of a Hadoop cluster to provide users with historical analysis. Coupled with real-time or near-real-time data analysis, it offers the users the best of both worlds.
-
The flexibility an LDW architecture affords its users is immense compared to conventional enterprise data warehouses (EDW). LDW represents a composable architecture. New components can be added and removed with ease, which is a huge plus in an age where people use multiple SaaS products and have difficulty integrating them into the existing data infrastructure. An LDW removes the need to establish new ELT and ETL processes every time a new data source is added to an existing data infrastructure.
An LDW architecture is not without its weaknesses. Those weaknesses follow directly from data virtualization, which is the technique an LDW utilizes for implementation. Luckily, we have the caching to mitigate the scalability problems. Additionally, different components of the architecture being down renders data analysis temporarily inaccessible. This is a problem that takes a holistic maintenance approach to deal with, as is the case with DV.
How does LDW fit with data virtualization, data mesh, and data fabric?
-
Data virtualization is a specific data integration technique to realize the LDW principles. It is what an LDW uses to bring together disparate sources of data without copying or moving the data.
-
Data mesh is an approach that redefines how we think about data. It basically is a mental exercise in developing a novel way of focusing on knowledge domains, treating data as a product, and prioritizing the empowerment of knowledge workers. The data mesh concept leverages the LDW to transform the existing data infrastructure in a way that can accommodate the needs of a modern enterprise.
-
Data fabric is defined by Gartner as "an emerging data management design,... not one single tool or technology." It is based on active metadata practices incorporating artificial intelligence (AI) and machine learning (ML). Thanks to AI and ML, data fabric can learn patterns of data usage and better service information requests over time. Data fabric is an end state for an LDW to reach once it matures.
The LDW is a data architecture modern enterprises should aspire to have. It does not replace the existing EDW but complements it with new components, turning the EDW into one of the cogs in a well-oiled machine and reducing the cost of data management. It is the current best practice in data management right now. At least until the next concept with an even cooler name emerges, that is.