Two DV's in Tandem: Data Vaults and Data Virtualization

There has been growing interest in data vaults as an architecture suited for the archiving and preservation of all enterprise-wide data. The normalized, "hub-and-link" structure is suited for parallel loading from multiple operational systems, and the philosophy of "all the data, all the time", regardless of quality, lends itself to the rapid population of the vault without cumbersome governance and oversight. However, the data vault is ill-suited as a source for reporting and analytic querying, since it separates keys from descriptive metadata, propagates many-to-many joins, and makes no distinction between high-quality and low-quality (or relevant and irrelevant) data. Companies wanting to report from a data vault are often faced with the unappealing task of doing two physical loads, one from source systems to the vault, then from the vault to data marts. There is a better alternative: data virtualization.

Three main elements are contained in the data vault: hubs, links, and satellites (all of these are tables). Hubs store the unique business keys and identify the source of those keys. Links are the relationships between hubs (often many-to-many). Satellites contain the descriptive metadata for either the hub or the link. Data vaults usually have no limit to the number hubs, links or satellites . . . the structure is often similar to an extended snowflake model. Modeling, implementation, and performance of data vaults can be very rapid, but structurally they are not optimized for query performance and analysis.

Instead of loading separate, physical dimensional marts from the vault, faster time-to-value can be created by using data virtualization. Data virtualization implements a logical data model as an abstract (non-physical) layer that is presented to the end-users. Query optimization is employed to provide a quick return of the source data, no matter which analytical tools are being used. For the data vault, dimensional logical models can be easily created by using the hubs and associated satellites as "dimensions" (I.e. the "master data" keys and their descriptions), and the links as the "facts" (I.e. the transactional relationships between the master data).

The logical models (separated by business unit or "fact group") can be expressed as separate "views" in the virtualization layer, corresponding to distinct data marts. The "views" can be distinguished by business unit and made consumable by non-technical business users. Each view may utilize many of the same source tables from the vault, particularly the satellite tables where descriptive metadata is stored. If the physical vault resides on a scalable MPP data warehouse appliance such as Netezza or Teradata, queries via the virtualization layer will be very fast. To reduce total cost of ownership, the enterprise may choose to cloud-host the vault on AWS or Microsoft Azure . . . combining this with a data virtualization cloud offering (such as Denodo Platform for AWS) would maximize the offloading of all local resources in providing a total cloud data integration/analytics solution.

No matter what the platform (on-prem, cloud or hybrid), "the two DVs" combine a uniform, "single-view" archive of all enterprise data with the rapid time-to-value data virtualization layer for business consumption, thus creating an end-to-end data integration/analytics platform for the organization.

About the Author:

Joe Caparula is a Managing Partner at Pandata Group where he specializes in data integration strategies for businesses. Have a question on the use of data virtualization as part of your data integration strategy? Email Joe at