Blog

Jun
22
Rethinking the Data Vault for Real-time Data

The data vault has long been viewed as a model best suited for historical and archival enterprise data. Its “insert only”, business-process approach to raw, unadulterated data is ideal for low-maintenance storage of all enterprise-generated information from all systems. Use cases for data vaults have traditionally revolved around historical tracking and auditing . . . however, the perception has largely been that it is ill-suited to analytics due to its many-to-many relationships and dispersed structure. In fact data vaults are often used as a “lightly modelled stage” for traditional star-schema data warehouses.


But the data vault may be best suited for a use case that Kimball data models are not: real-time data capture and reporting. There is an increasing need for real-time analytics as we see the growth of IoT and edge computing devices. Whereas the denormalized, star-schema data warehouse is fine for analyzing historical data in aggregate (strategic queries), the tactical queries associated with real-time data may work best in a high-insert, low latency structure like the data vault. Generally, there is no need to “transform” real-time data, thus the DV’s “un-scrubbed” approach to data storage is fitting. In fact, a core philosophy behind data marts is that any business logic lives as far downstream as possible (preferably within the user’s analytics tools). Once the initial “contexts” for real-time data are established (see below), the load speed can be as fast as the data is generated.


A quick recap of the data vault architecture: there are three types of objects in a DV. Hubs represent business entities, their keys, and their sources. Links are the relationships between hubs. Satellites are the descriptive attributes of both hubs and links. So using an example of a real-time bank transaction:


























Tx ID



145559236701



Datetime



07-01-2020 14:23:29



Customer



JMC553046



Account



9NOSB0185349



Amount



-$25.00




The Tx ID would be loaded as a link (between the Customer hub and the Account hub) called Transaction. The Datetime and Amount are attributes of that link and thus are loaded into the Transaction Satellite, and of course the Customer and Account are keys in their respective hubs. It is possible and advisable to load hub keys and satellites up front as a batch load, similar to a dimension load. This then creates the contexts for all future real-time loads and minimizes the load time even further.


If the loading of real-time data into a DV model is rapid and efficient, how does the business then get value from the data? This of course depends on the desired use-case. A popular deliverable is a “real-time dashboard” that delivers up-to-the-second metrics for production lines, web activity, etc. Rapid queries of a traditional, on-prem data vault may be slow, depending on how the tables are indexed, but this is where the new generation of cloud-based data warehouses such as Snowflake shine. Query optimization, compute scalability and alternate data types such as VARIANTS will enhance the performance of a Snowflake-connected analytical engine like ThoughtSpot. Additionally, schema-on-read approaches (e.g. views and data virtualization) can also be employed to further enhance usability and performance of real-time data marts.


The data vault continues to grow as a viable enterprise data warehouse modeling approach. Its unmodified, “insert everything” methodology requires pushing the transformative business logic closer to the end-user, which may raise governance questions. It can be used as a “hybrid” gateway between staged data and the traditional EDW, although adding a third layer of data management may not always be palatable to the business. But as a repository for real-time transactional data, it can provide a “statement of facts for any given point in time” and a frictionless point-of-entry for operational data streams and up-to-the-minute analytics, and that’s where the model may truly come into its own.


Portions of this article were inspired by “The Business of Data Vault Modeling” by Daniel Lindstedt, Kent Graziano and Hans Hultgren (2009).



About the Author:


Joe Caparula is a Senior Consultant with Pandata Group who specializes in delivering data modeling and data integration services for clients across several industries.