Agile Data Engine

Agile Data Engine (ADE) ist eine DataOps SaaS-Plattform für die Entwicklung und den Betrieb von Cloud-basierten Data Warehouses und Lakehouses. Sie hilft Organisationen, die Produktivität von Datenteams durch die kontinuierliche Bereitstellung von Datenprodukten zu steigern, sodass sich die Teams auf die Bereitstellung von Mehrwerten für das Unternehmen konzentrieren können, anstatt sich um die Verwaltung der Infrastruktur oder den Aufbau benutzerdefinierter CI/CD-Pipelines zu kümmern.
Die Plattform verfügt über sechs Kernfunktionen: Datenmodellierung und -transformationen, kontinuierliche Bereitstellung, Workflow-Orchestrierung, API-Konnektivität, Überwachung und Tests sowie Einblicke. Sie basiert auf einem metadatengesteuerten, codearmen Ansatz, der sicherstellt, dass das Wissen über die Lösung unabhängig von Personal- oder Plattformänderungen erhalten bleibt. ADE unterstützt mehrere Cloud-Datenbank-Engines und verschiedene Datenmodellierungsmethoden, einschließlich vollständiger Automatisierung für Data Vault und Dimensional Modeling.
Als Multi-Cloud-Lösung verarbeitet Agile Data Engine Daten innerhalb der Cloud-Umgebung des Kunden und bietet gleichzeitig integrierte automatisierte Schemaänderungen, Datenqualitätstests, intelligente Bereitstellungen, Workflow-Management und DataOps-Observability.

In the source data, there are KatID and OberKatID, which represent product hierarchy. From those attributes, a hierarchical link was created with ADE built-in entity type LINK. References are used in ADE to display a parent-child relationship between the link entity L_PRODUCT_CATEGORY_HIERARCHY and hub entity H_PRODUCT_CATEGORY. A status tracking satellite S_SAT was added for the link entity to track the effectiveness of the hierarchy.

Hierarchical Link

For multi-active satellites, there is built-in entity type SAT_MA in ADE. The way how ADE handles multi-activity, is to add sub-sequence number (dv_running_id) to enable multiple active records per business key.
Hash difference (dv_datahash) calculation is done for ordered array of incoming records to enable historizing set of data per business key.

Multi-Active Satellite

To identify relationships and their validity, ADE has built-in entity S_SAT for status tracking satellite to track the effectiveness of records by the driving key in hub or link tables. In ADE, it is different whether the incoming data is incremental (changed data) or full table extract. It can be toggled on each status satellite load, whether to use full extract mode to detect hard-deletions and relationship changes or whether to just detect relationship changes in incremental data.

Validity in Relationships

For table without primary key and records that usually occur only once, a transactional link (also known as non-historized link) was used. ADE standard entity type LINK can be used for transactional links. A common naming convention is to add „T“ on the name, for example L_T_DELIVERY. The link primary key (DV_ID) is compound of all foreign keys.

MtoN-TableWithoutPrimaryKey

Business keys from both sources were loaded to the same hubs: H_ORDER (bestellungid) and H_ORDER_LINE (Roadshow: bestellunid+produkt, Webshop: bestellungid+posid). In ADE’s entity-driven approach, that meant defining separate loads for both sources in the same hubs. Satellites were split by source system and by granularity of the data, meaning order data from Roadshow was split into two satellites. Link tables were separated based on source system, since the unit of work is different for these sources.

Early Integration

Historized RefTable

Two duplicate scenarios were identified: true duplicates and two different products sharing the same ProduktId=21. For duplicates, only distinct values were inserted into the satellite. For the ProduktID 21 issue, where two different products shared the same ID, can be handled in two ways:

1) Load data as-is and filter in business vault/mart while investigating with source system team and adding test cases, or
2) Prevent duplicate key loading within batches using ADE’s GATEKEEPER load step.

Duplicate Data

Records without business keys are loaded to Raw Data Vault hub entities as unknown business key, marked with ‚-1‘. This default logic is configurable.

A record without business key shouldn’t be inserted into a satellite table. By default those will be inserted, but there is configuration to prevent „zero keys“ from being inserted to a satellite. With ADE load steps and load options, one can prevent zero keys to be loaded and attach a template to load the zero key records to Error Vault instead.

Rows without Businesskey

Customer KundeId=107 is changing between the deliveries and all that history is stored in the satellite. By default there is dv_datahash attribute added to each satellite SAT entity in ADE. The common pattern is to map all descriptive attributes to dv_datahash so that attribute changes can be detected by the hash difference in records.
Default attributes dv_load_time and dv_run_id can be used to track when the record was loaded and in which batch.

Changes in Customer Data

Customer KundeId=70 appeared in delivery 1, was deleted in delivery 2 and reappeared in delivery 3. The solution to this was to use status tracking satellite S_SAT in „full extract“ mode to track hard deletes in the source data. This status satellite STS_CUSTOMER was attached to the hub H_CUSTOMER.

With this solution pattern, we can query the hard deletes from the Data Vault and see when the record was active and when it was inactive.

Deletions of Customers

There are delivery addresses with CustomerID that is missing from the customer-data. A standard Data Vault approach is used, meaning H_CUSTOMER is loaded from all sources, where the business key occurs. CustomerID 999, 998 and 997 are loaded to H_CUSTOMER and to the link L_DELIVERY_ADDRESS_CUSTOMER from delivery address (Lieferaddresse). As there are no customer records, the satellite S_CUSTOMER_WEBSHOP does not contain those customers.

Invalid Foreignkeys

A status tracking satellite can be used to track hard-deletes in the data. In this case the status tracking satellite STS_ORDER was created in full extract mode in ADE and the load logic was generated automatically. This means deletions are historized correctly to enable auditability.

Deletion of Orders

The change in product hierarchy was done with Point-in-time (PIT) entity, which is a default entity type in ADE.
The PIT table is populated from the hierarchical link, status tracking satellite and related Raw Data Vault entities.
With those in place, is it possible to see changes in the dimension depending on the reporting date.

Changes in Dimensions

To get the key figures, a data mart was used to publish the data for the dashboard. In the data mart, a fact table with associated dimensions was implemented. We utilized Databricks Dashboard to visualize the results.

KeyFigures

Agile Data Engine is a SQL-based tool, which means all the business rules are implemented as SQL. In this case, Business Data Vault was used with Point-in-time tables and with views containing business logic.
It is ADE best practice to separate business logic to separate packages to make changing the business logic as modular as possible.

Business Rules

Agile Data Engine provides a real-time, interactive data lineage view at entity level, based on defined entity loads. There is also a high-level lineage available on a data packages level.
Data lineage visual representation is maintained automatically based on the actual loads implementation.
Data lineage can be also accessed from ADE Insights module or External API which enables integration with 3rd party tools like f.ex. data catalog solutions.

Data Lineage

Agile Data Engine provides a set of built-in error handling mechanisms:
– Promotion and deployment errors (including database schema changes) are captured and handled by a Deployment Management module.
– Workflow orchestration load errors are handled automatically by an integrated Airflow instance per environment.
– Data quality checks – smoke tests, gatekeepers and alerts.
Errors can be reported either via multiple selection of notification channels or directly in the ADE UI.

Error Handling

Workflow management is an integral built-in part of Agile Data Engine and no administration or configuration effort is required for the end users. Agile Data Engine uses an embedded, fully managed Airflow instance to execute the workflows (per environment).
Reusable workflows definitions are flexible, configurable in the UI, code generation is fully automated and the definitions form part of the internal CI/CD pipeline.

Orchestration

Agile Data Engine’s built-in and fully integrated Deployment Management module manages the CI/CD processes.The runtime metadata of each environment is stored in an internal repository and it includes both database structure definitions and the data loads metadata.
ADE is able to determine the optimal way to deploy the object structure SQL and workflow management Python code into an environment, i.e. identify needed changes by comparing state of objects instead of deploying code sequentially.

Deployment

In ADE workflows can be scheduled in several ways:

– The most common way is to schedule them using cron expressions.
– When workflows have interdependencies, they can trigger upon the completion of their dependent workflows. A workflow can be executed by itself.
– Integration with external tools allows running or checking workflow status via API calls
– For development, testing or fixing production runs, data loads can be manually triggered

Scheduling

As of Q4’2024 Agile Data Engine supports the following database systems as a target :

– Databricks SQL
– Snowflake
– Amazon Redshift
– Azure SQL Database
– Azure Synapse SQL (formerly Azure SQL DW)
– Microsoft Fabric
– Google BigQuery

Supported Databases

Agile Data Engine is a SaaS cloud offering and there is no installation required (no additional components, virtual machines or client agents).
ADE services are accessed via a web browser.

A new Agile Data Engine tenant set-up includes a few basic installation steps:
– Configuration of the target databases
– Definition of runtimes (environments)
– Privileges management
– Setting up naming conventions and development standards

Installation

Agile Data Engine

Willibalds Data Warehouse

Automation Tools