dbt

dbt (Data Build Tool) ist ein Entwicklungsframework, mit dem Teams Analyselösungen schneller erstellen, testen und bereitstellen können. Es kann verwendet werden, um eine einheitliche Entwicklungsumgebung für alle Data-Engineering- und Data-Analyst-Teams zu schaffen. Das Open-Source-Paket Datavault4dbt von Scalefree erweitert dbt mit dbt-Makros, um ein Data Vault 2.0-konformes Data Warehouse zu erstellen.

https://www.scalefree.com/consulting/datavault4dbt/

https://www.getdbt.com/

Die Challenge wurde durchgeführt von Andreas Haas und Jan Binge.
Das komplette Projekt kann mit den Sourcen auf github (https://github.com/haas-binge/dwa-compare-dbt) angeschaut, ausprobiert und getestet werden.

There is a table Kategorie in the webshop-data, which includes a parent-child relationship between KatID and OberKatID with several levels. It is implemented as a hierarchical link productcategory_hierarchy_l using a standard macro from datavault4dbt and a status-satellite, to identify potential changes. Please see the attached pdf for more information.

Datavault4dbt Krit 1

In the Wohnort-delivery, several valid rows are delivered for each customer – the set-key to differentiate is von. This constellation is modelled as a multi-active satellite. There is a standard-macro for multi-active satellites in datavault4dbt where the multi_active_key must be set to von.

Each delivered set gets the same hash-diff. That way a change in one entry leads to including the complete set as new. Please see the attached pdf for more information.

Datavault4dbt Krit 2

To track changes of relationships, we defined status sats (end with _sts), which can be used for links or hubs. They contain a cdc-attribute (either „I“– insert or „D“ – delete).
The load logic depends on the data-delivery (full, partiaL, cdc).
The idea is, that further downstream no knowledge regarding the load-logic is necessary.
In case there is a driving-key for a link, an effectivity sat is set on top of status-sat and link (end with _es), which enddates all entries no longer valid. Please see the attached pdf for more information.

Datavault4dbt Krit 3

We decided to define a non-historized-link (postfixed with _nhl) for that case. Solving this topic using a keyed instance or a dependent child link would be possible as well. Please see the attached pdf for more information.

Datavault4dbt Krit 4

There are orders from the webshop and roadshow. The orders and positions of both source-systems are loaded into the Raw Vault Hubs order_h and position_h.
Describing data is saved in separate satellites for each source-system.
Each source-table loading into position_h needs to be defined as source model using the standard datavault4dbt hub-macro.

In the roadshow dataset, there is no positionID available, that’s why we defined it concatinating bestellungid and produktid on stage-level. Please see the attached pdf for more information.

Datavault4dbt Krit 5

The data for delivery-adherence are delivered twice (first day and third day of delivery) with changes to the data in-between.
These changes should be reflected in the calculation of the delivery adherence.
In datavault4dbt two macros are provided to create a reference-table and a historized reference-table: ref_hub and ref_sat_v0.
They create a DV-structure that is quite like the standard hub and satellite with the difference of using natural keys instead of hash-keys. Please see the attached pdf for more information.

Datavault4dbt Krit 6

In the load-layer duplicates are detected with a window-function. This test is part of the error-mechanism described in #17

In the load-layer rows without business keys are detected. This test is part of the error-mechanism described in #17

A very simple test case, the data in the customer (KundeID 107) is changed to a value in delivery 2 and get the values from delivery 1 again in delivery 3. This type of changes is handled correctly by the datavault4dbt-macro sat_v0. Please see the attached pdf for more information.

Datavault4dbt Krit 9

In the willibald-data business-keys are being deleted and reappearing at a later date (e.g., CustomerID ’70‘ appearing in the webshop-data on 14. and 28. but not on 21.) This issue has been solved by using status-tracking-satellites (_sts) as explained in criterium #3.

Please see the attached pdf for more information.

Datavault4dbt Krit 10

The first delivery contains delivery addresses for which there is no record with the same customer ID (KundeID) in CUSTOMER (Kunde – KundeIDs 999, 998 and 997).
As part of the standard Raw Vault implementation, one of the sources of CUSTOMER_H is also the customer-column in delivery-address. That way the three relevant rows are correctly loaded into all the entities shown on the right. Please see the attached pdf for more information.

Datavault4dbt Krit 11

The orders are relevant for counting and are deleted during the dates of deliveries. Between period 1 and 2 the orderIDs (BestellungID) 99, 220 and 465. Between periods 2 and 3 the orderIDs (BestellungID) 1470 and 1288.

Within the Status-Satellite ORDER_WS_STS, a new record with cdc=D
is added for orderID=99, when loading period 2. This leads to this order not being part of the „snapshot-satellite“ order_sns, for each reporting_date after period 2.
Please see the attached pdf for more information.

Datavault4dbt Krit 12

We set this up as a hierarchical-link. It is vital, that both columns containing
productcategory-information are loaded into the productcategory-hub.
Please see the attached pdf for more information.

Datavault4dbt Krit 13

We created a PowerBI Report to give access to the data. Please see the attached pdf for more information.

Datavault4dbt Krit 14

Business rules were created using standard-sql in Views. Rules were created in the _bb (bridges) order_customer_bb, sales_bb and other _bs (business-satellites). Please see the attached pdf for more information.

Datavault4dbt Krit 15

dbt provides data lineage on table-level based on the references within the code. Columnar lineage is not available. Please see the attached pdf for more information.

Datavault4dbt Krit 16

Based on the column is_check_ok, erroneous data can easily be excluded from the further standard process and loaded into the error-vault.
We defined one Error-Satellite for each source-system one column with all the RAW_DATA as json
one column with a summary of all the failed checks. Please see the attached pdf for more information.

Datavault4dbt Krit 17

dbt uses lineage-information to calculate the sequence to build all dependent objects. There is no need to create specific loading-chains. Please see the attached pdf for more information.

Datavault4dbt Krit 18

All programming artefacts in dbt are plain text files. Dbt supports github, gitlab, bitbucket, Azure DevOps etc. The deployment follows the git workflow. Please see the attached pdf for more information.

Datavault4dbt Krit 19

In our project we use Jenkins schedule and run the dbt in a docker container. You also could use Automic or any other. Please see the attached pdf for more information.

Datavault4dbt Krit 20

There are many databases as dbt-targets available. Not all of them are already available for datavault4dbt. Please see the attached pdf for more information.

Datavault4dbt Krit 21

Prerequisites for installing dbt are python (3.7+) and Git.
We also installed Visual Code on our windows machines to be able to edit the text-files and do the git-operations more comfortably. All other installations are then done using pip or dbt.

Datavault4dbt Krit 22

dbt

Willibalds Data Warehouse

Automation Tools