DBT Incremental Strategy

Author: Sakshi Agrawal


Overview

This section gives a brief idea about how the Incremental load strategy in Dbt works , how efficient and beneficial it is in terms of CDC , finally saving time and resources.

In this blog we have tried to explain the implementation that we have done in one of our projects considering incremental load strategy with dbt .

Let's start with what exactly DBT is.


DBT (Data Build Tool )
  • Data Build Tool(DBT), is a Python application which is a transformation tool that focuses on the Transformation part in ELT (Extract, Load, Transform) processes.

  • It mainly aims at performing data engineering activities before the data is used for analytics.

  • DBT performs transformation using select statements, which in turn converts into tables and views, thus making the transformation process simple and effective.

  • This tool is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.



  • Now as we have a brief knowledge of DBT in order to understand incremental strategy , let’s first look into what is a non-incremental strategy with DBT


WHAT IS NON-INCREMENTAL STRATEGY ?
  • Non-incremental strategy is nothing but a general append data load approach

Here all the records in the tables are transformed , processed in every run which is not the case with incremental


In our project we have 2 layers in dbt , staging and mart layer.


Non-incremental approach is applied on source database tables present in snowflake. For non-incremental we don't need a staging layer; it will directly get the data from the source database present in snowflake and put it in mart layer of dbt .


Further as per requirement, incremental and non-incremental strategy is performed in marts layer



  • As we covered non-incremental strategy , moving forward we will see what is materializations and what connects incremental strategy with Materialization


Materialization in DBT
  • Materialization in DBT are strategies for maintaining the dbt models in a warehouse.

  • Dbt provides four types of inbuilt materializations


1.Table

2.View

3.Incremental

4.ephemeral


  • Everytime a model is run , it is rebuilt as one of the above 4 types e.g. as a View or a Table or in a different way as handled by Incremental or Ephemeral type. We will be focusing on Incremental type in this blog


Let's take a look at Incremental materialization provided by dbt


WHAT IS INCREMENTAL STRATEGY
  • Incremental models allow dbt to insert or update records into a table since the last time that dbt was run.With help of incremental models ,you can significantly reduce the build time by just transforming new records(Only the latest records will be processed).Incremental models require extra configuration and is part of advanced usage of dbt.

  • For Eg - Your database table consists of 10,000 records which have been processed already ,now 500 new records are added to the table so instead of processing all the 10500 records, the incremental strategy will only process , transform the newly added 500 in the next run which will save the compute time as well as resources.

  • Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built.

  • Let’s have a look at an example that we have considered ,


In Our Expense management System we have our source database (EXPENSE_MANAGEMENT_DB) and streams created over it’s tables to capture the CDC data in Snowflake we have connected our snowflake to dbt using partner connect , once the data is in the streams further we have created couple of models in dbt as below -


1. Staging

  • In Staging models the sources are the streams which are created in Snowflake. Once the data is in streams, it will be further consumed by dbt staging models and incremental strategy will be performed on top of it in the marts layer ,i.e n first run it will consider all rows coming from source but in subsequent run it will consider only the rows coming from streams. Implementation of incremental strategy along with some rules will be seen as we proceed with this blog

2. Marts

  • In Marts models, if the run is incremental run the sources will be staging models and if it is non-incremental run/normal run the sources will be directly the source database from snowflake which are already mentioned in sources.yml file in model structure of dbt \



HOW TO IMPLEMENT THE INCREMENTAL MATERIALIZATION ?
  • As mentioned earlier, we are trying to get the CDC data from snowflake streams in staging model , here Source is bank_staging which is nothing but a model created when we are consuming data from Snowflake streams




  • Bank_check2 is a stream attached to the source database on which transformation has to be performed

  • Once the data is in staging models they are further being populated in marts models . There we have a macro i.e is_incremental() ,which takes care of the filter part for selecting only incremental records . More about is_Incremental() is mentioned in the further section. Incremental materialization will be applied when run hits this part of code and fulfills the conditions written in this macros


Below configurations have to be defined prior going to the is_incremental macro



  • Materialized = incremental : This suggest that materialization used in this model is incremental , i.e incremental models

  • The unique_key constraint : This parameter which can be treated as optional determines whether a record has new values and should be updated or not. By using unique_key, you can ensure that each row from the source table is represented by a single row in your incremental model, without any duplicates.

  • Not specifying a unique_key can result in append-only behavior, which means nothing but dbt inserts all rows returned by the model's SQL into the preexisting target table regardless for whether the rows represent duplicates

  • Tags : dbt tags is very useful to select models depending on the situation by taking advantage of model selection syntax , for more info related to tags please refer below link


THE IS_INCREMENTAL() MACRO:

The is_incremental() macro will return True if:

  • The destination table already exists in the database

  • DBT is not running in full-refresh mode

  • The running model is configured with materialized='incremental'

  • Incremental models in dbt are defined with select statements, with the materialization defined in a config block within the same file



  • For eg - in our case , for is_incremental() macro we have included the logic to populate the table with source as bank_staging which is being populated with streams

  • The ref function is used to give the reference of staging model which in turn is built from snowflake streams , below screenshot depicts the is_incremental() macro used in our project



  • There might be a need to rebuild your incremental model ,to force dbt to rebuild the entire incremental model from scratch, use the --full-refresh flag on the command line. The full refresh flag will cause dbt to drop the existing target table in the database before rebuilding the model for all-time.



  • In Addition we have performed some prehook expressions which will truncate all the table before building the model




WORKING OF INCREMENTAL MODEL BEHIND THE SCENES :

CONCLUSION

To summarize, we saw ,

  • The different materialization strategies supported by DBT

  • How incremental Strategy allows you to run transformations only on the delta records through the is_incremental macro using the filter conditions

  • The configurations that has to be declared prior to is_incremental macro in model

  • Finally the benefits of using Incremental strategy helping you to save your time as well as Compute resources .

Hope this blog helps you if you are in a similar situation and enables you to leverage the use of DBT in the best possible way.


REFERENCES

185 views0 comments

Recent Posts

See All