Author: Sakshi Agrawal
This section gives a brief idea about how the Incremental load strategy in DBT works, how efficient and beneficial it is in terms of CDC, and finally saving time and resources.
In this blog, we have tried to explain the implementation that we have done in one of our projects considering incremental load strategy with DBT.
Let's start with what exactly DBT is.
DBT (Data Build Tool )
Data Build Tool(DBT), is a Python application that is a transformation tool that focuses on the Transformation part in ELT (Extract, Load, Transform) processes.
It mainly aims at performing data engineering activities before the data is used for analytics.
DBT performs transformation using select statements, which in turn convert into tables and views, thus making the transformation process simple and effective.
This tool is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineerings best practices like modularity, portability, CI/CD, and documentation.
Now that we have a brief knowledge of DBT in order to understand incremental strategy, let’s first look into what is a non-incremental strategy with DBT
WHAT IS A NON-INCREMENTAL STRATEGY?
Non-incremental strategy is nothing but a general append data load approach
Here all the records in the tables are transformed and processed in every run which is not the case with incremental
In our project, we have 2 layers in DBT, staging, and mart layers.
The non-incremental approach is applied to source database tables present in snowflake. For non-incremental we don't need a staging layer; it will directly get the data from the source database present in snowflake and put it in the mart layer of DBT.
Further as per requirement, incremental and non-incremental strategy is performed in the mart's layer
As we covered non-incremental strategy, moving forward we will see what are materializations and what connects incremental strategy with Materialization
Materialization in DBT
Materialization in DBT is a strategy for maintaining the DBT models in a warehouse.
DBT provides four types of inbuilt materializations
Every time a model is run, it is rebuilt as one of the above 4 types e.g. as a View or a Table or in a different way as handled by the Incremental or Ephemeral type. We will be focusing on the Incremental type in this blog
Let's take a look at Incremental materialization provided by DBT.
WHAT IS INCREMENTAL STRATEGY
Incremental models allow DBT to insert or update records into a table since the last time that DBT was run. With help of incremental models, you can significantly reduce the build time by just transforming new records(Only the latest records will be processed). Incremental models require extra configuration and are part of the advanced usage of DBT.
For Eg - Your database table consists of 10,000 records that have been processed already, now 500 new records are added to the table so instead of processing all the 10500 records, the incremental strategy will only process, and transform the newly added 500 in the next run which will save the compute time as well as resources.
Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, DBT transforms only the rows in your source data that you tell DBT to filter for, inserting them into the target table which is the table that has already been built.
Let’s have a look at an example that we have considered.
In Our Expense Management System, we have our source database (EXPENSE_MANAGEMENT_DB) and streams created over its tables to capture the CDC data in Snowflake we have connected our snowflake to DBT using partner connect, once the data is in the streams further we have created a couple of models in DBT as below -
In Staging models, the sources are the streams that are created in Snowflake. Once the data is in streams, it will be further consumed by DBT staging models and incremental strategy will be performed on top of it in the mart's layer, i.e, n the first run it will consider all rows coming from the source but in the subsequent run, it will consider only the rows coming from streams. Implementation of incremental strategy along with some rules will be seen as we proceed with this blog
In Marts models, if the run is incremental run the sources will be staging models and if it is a non-incremental run/normal run the sources will be directly the source database from snowflake which is already mentioned in the sources.yml file in the model structure of DBT.
HOW TO IMPLEMENT THE INCREMENTAL MATERIALIZATION?
As mentioned earlier, we are trying to get the CDC data from snowflake streams in the staging model, here Source is bank_staging which is nothing but a model created when we are consuming data from Snowflake streams
Bank_check2 is a stream attached to the source database on which transformation has to be performed
Once the data is in staging models they are further populated in marts models. There we have a macro i.e is_incremental(), which takes care of the filter part for selecting only incremental records. More about is_Incremental() is mentioned in the further section. Incremental materialization will be applied when run hits this part of the code and fulfills the conditions written in this macros
The below configurations have to be defined prior to going to the is_incremental macro
Materialized = incremental: This suggests that materialization used in this model is incremental, i.e., incremental models
The unique_key constraint: This parameter which can be treated as optional determines whether a record has new values and should be updated or not. By using unique_key, you can ensure that each row from the source table is represented by a single row in your incremental model, without any duplicates.
Not specifying a unique_key can result in append-only behavior, which means nothing but DBT inserts all rows returned by the model's SQL into the preexisting target table regardless of whether the rows represent duplicates
Tags: dbt tags are very useful to select models depending on the situation by taking advantage of model selection syntax, for more info related to tags please refer below link
THE IS_INCREMENTAL() MACRO:
The is_incremental() macro will return True if:
The destination table already exists in the database
DBT is not running in full-refresh mode
The running model is configured with materialized='incremental'
Incremental models in DBT are defined with select statements, with the materialization defined in a config block within the same file
For eg - in our case, for the is_incremental() macro we have included the logic to populate the table with source as bank_staging which is being populated with streams
The ref function is used to give the reference of the staging model which in turn is built from snowflake streams, below screenshot depicts the is_incremental() macro used in our project
There might be a need to rebuild your incremental model, force DBT to rebuild the entire incremental model from scratch, and use the --full-refresh flag on the command line. The full refresh flag will cause DBT to drop the existing target table in the database before rebuilding the model for all time.
In Addition, we have performed some prehook expressions which will truncate all the table before building the model
WORKING OF INCREMENTAL MODEL BEHIND THE SCENES :
A merge statement in DBT is used to insert new records and update existing records.
On warehouses that do not support merge statements, a merge statement is implemented by first using a delete statement which deletes records in the target table that are to be updated, and then an insert statement is executed
For more info please refer to the below link: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models#how-do-incremental-models-work-behind-the-scenes
To summarize, we saw,
The different materialization strategies supported by DBT
How incremental Strategy allows you to run transformations only on the delta records through the is_incremental macro using the filter conditions
The configurations that have to be declared prior to is_incremental macro in model
Finally the benefits of using Incremental strategy help you to save your time as well as Compute resources.
Hope this blog helps you if you are in a similar situation and enables you to leverage the use of DBT in the best possible way.