top of page

DBT Incremental Strategy

Updated: Mar 16

Author: Sakshi Agrawal


Overview

This section gives a brief idea about how the Incremental load strategy in DBT works, how efficient and beneficial it is in terms of CDC, and finally saving time and resources.


In this blog, we have tried to explain the implementation that we have done in one of our projects considering incremental load strategy with DBT.

Let's start with what exactly DBT is.


DBT (Data Build Tool )
  • Data Build Tool(DBT), is a Python application that is a transformation tool that focuses on the Transformation part in ELT (Extract, Load, Transform) processes.

  • It mainly aims at performing data engineering activities before the data is used for analytics.

  • DBT performs transformation using select statements, which in turn convert into tables and views, thus making the transformation process simple and effective.

  • This tool is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineerings best practices like modularity, portability, CI/CD, and documentation.



  • Now that we have a brief knowledge of DBT in order to understand incremental strategy, let’s first look into what is a non-incremental strategy with DBT


WHAT IS A NON-INCREMENTAL STRATEGY?
  • Non-incremental strategy is nothing but a general append data load approach

Here all the records in the tables are transformed and processed in every run which is not the case with incremental


In our project, we have 2 layers in DBT, staging, and mart layers.


The non-incremental approach is applied to source database tables present in snowflake. For non-incremental we don't need a staging layer; it will directly get the data from the source database present in snowflake and put it in the mart layer of DBT.


Further as per requirement, incremental and non-incremental strategy is performed in the mart's layer



  • As we covered non-incremental strategy, moving forward we will see what are materializations and what connects incremental strategy with Materialization


Materialization in DBT
  • Materialization in DBT is a strategy for maintaining the DBT models in a warehouse.

  • DBT provides four types of inbuilt materializations


1. Table

2. View

3. Incremental

4. ephemeral


  • Every time a model is run, it is rebuilt as one of the above 4 types e.g. as a View or a Table or in a different way as handled by the Incremental or Ephemeral type. We will be focusing on the Incremental type in this blog


Let's take a look at Incremental materialization provided by DBT.


WHAT IS INCREMENTAL STRATEGY
  • Incremental models allow DBT to insert or update records into a table since the last time that DBT was run. With help of incremental models, you can significantly reduce the build time by just transforming new records(Only the latest records will be processed). Incremental models require extra configuration and are part of the advanced usage of DBT.

  • For Eg - Your database table consists of 10,000 records that have been processed already, now 500 new records are added to the table so instead of processing all the 10500 records, the incremental strategy will only process, and transform the newly added 500 in the next run which will save the compute time as well as resources.

  • Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, DBT transforms only the rows in your source data that you tell DBT to filter for, inserting them into the target table which is the table that has already been built.

  • Let’s have a look at an example that we have considered.


In Our Expense Management System, we have our source database (EXPENSE_MANAGEMENT_DB) and streams created over its tables to capture the CDC data in Snowflake we have connected our snowflake to DBT using partner connect, once the data is in the streams further we have created a couple of models in DBT as below -


1. Staging

  • In Staging models, the sources are the streams that are created in Snowflake. Once the data is in streams, it will be further consumed by DBT staging models and incremental strategy will be performed on top of it in the mart's layer, i.e, n the first run it will consider all rows coming from the source but in the subsequent run, it will consider only the rows coming from streams. Implementation of incremental strategy along with some rules will be seen as we proceed with this blog

2. Marts

  • In Marts models, if the run is incremental run the sources will be staging models and if it is a non-incremental run/normal run the sources will be directly the source database from snowflake which is already mentioned in the sources.yml file in the model structure of DBT.



HOW TO IMPLEMENT THE INCREMENTAL MATERIALIZATION?
  • As mentioned earlier, we are trying to get the CDC data from snowflake streams in the staging model, here Source is bank_staging which is nothing but a model created when we are consuming data from Snowflake streams




  • Bank_check2 is a stream attached to the source database on which transformation has to be performed

  • Once the data is in staging models they are further populated in marts models. There we have a macro i.e is_incremental(), which takes care of the filter part for selecting only incremental records. More about is_Incremental() is mentioned in the further section. Incremental materialization will be applied when run hits this part of the code and fulfills the conditions written in this macros


The below configurations have to be defined prior to going to the is_incremental macro



  • Materialized = incremental: This suggests that materialization used in this model is incremental, i.e., incremental models

  • The unique_key constraint: This parameter which can be treated as optional determines whether a record has new values and should be updated or not. By using unique_key, you can ensure that each row from the source table is represented by a single row in your incremental model, without any duplicates.

  • Not specifying a unique_key can result in append-only behavior, which means nothing but DBT inserts all rows returned by the model's SQL into the preexisting target table regardless of whether the rows represent duplicates

  • Tags: dbt tags are very useful to select models depending on the situation by taking advantage of model selection syntax, for more info related to tags please refer below link


THE IS_INCREMENTAL() MACRO:

The is_incremental() macro will return True if:

  • The destination table already exists in the database

  • DBT is not running in full-refresh mode

  • The running model is configured with materialized='incremental'

  • Incremental models in DBT are defined with select statements, with the materialization defined in a config block within the same file



  • For eg - in our case, for the is_incremental() macro we have included the logic to populate the table with source as bank_staging which is being populated with streams

  • The ref function is used to give the reference of the staging model which in turn is built from snowflake streams, below screenshot depicts the is_incremental() macro used in our project



  • There might be a need to rebuild your incremental model, force DBT to rebuild the entire incremental model from scratch, and use the --full-refresh flag on the command line. The full refresh flag will cause DBT to drop the existing target table in the database before rebuilding the model for all time.



  • In Addition, we have performed some prehook expressions which will truncate all the table before building the model




WORKING OF INCREMENTAL MODEL BEHIND THE SCENES :

CONCLUSION

To summarize, we saw,

  • The different materialization strategies supported by DBT

  • How incremental Strategy allows you to run transformations only on the delta records through the is_incremental macro using the filter conditions

  • The configurations that have to be declared prior to is_incremental macro in model

  • Finally the benefits of using Incremental strategy help you to save your time as well as Compute resources.

Hope this blog helps you if you are in a similar situation and enables you to leverage the use of DBT in the best possible way.


REFERENCES

3,606 views0 comments

Recent Posts

See All
bottom of page