Data Vault Modeling

Data Vault Modeling is a database modeling method that is designed to provide historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as auditing, tracing of data, loading speed and resilience to change.

Data Vault Modeling focuses on several things. First, it emphasizes the need to trace of where all the data in the database came from. Second, it makes no distinction between good and bad data ("bad" meaning not conforming to business rules),^[1] leading to "a single version of the facts" versus "a single version of the truth",^[2] also expressed by Dan Linstedt as "all the data, all of the time". Third, the modeling method is designed to be resilient to change in the business environment where the data being stored is coming from, by explicitly separating structural information from descriptive attributes.^[3] Finally, Data Vault is designed to enable parallel loading as much as possible,^[4] so that you can scale out for very large implementations.
An alternative (and seldom used) name for the method is "Common Foundational Integration Modelling Architecture."^[5]

Basic notions

Data Vault attempts to solve the problem of dealing with change in the environment by separating the business keys (that do not mutate as often, because they uniquely identify a business entity) and the associations between those business keys, from the descriptive attributes of those keys.

The business keys and their associations are structural attributes, forming the skeleton of the data model. The Data Vault method has as one of its main axioms that real business keys only change when the business changes and are therefore the most stable elements from which to derive the structure of a historical database. If you use these keys as the backbone of a Data Warehouse, you can organize the rest of the data around them. This means that choosing the correct keys for the Hubs is of prime importance for the stability of your model.^[13] The keys are stored in tables with a few constraints on the structure. These key-tables are called Hubs.

Hubs

Hubs contain a list of unique business keys with low propensity to change. Hubs also contain a surrogate key for each Hub item and metadata describing the origin of the business key. The descriptive attributes for the information on the Hub (such as the description for the key, possibly in multiple languages) are stored in structures called Satellite tables which will be discussed below.
The Hub contains at least the following fields:^[14]

a surrogate key, used to connect the other structures to this table.
a business key, the driver for this hub. The business key can consist of multiple fields.
the record source, can be used to see where the business keys come from and if the primary loading system has all of the keys available in other systems as well.
optionally, you can also have metadata fields with information about manual updates (user/time) and the extraction date.

A Hub is not allowed to contain multiple business keys, except when two systems deliver the same business key but with collisions that have different meanings.
Hubs should normally have at least one satellite.^[14]

Hub example

This is an example for a Hub-table containing Cars, surprisingly called "Car" (H_CAR). The driving key is Vehicle Identification Number.

Pages

Translate

Friday, September 28, 2012