Author: Rini Shiny M
This blog describes the integration of DBT with Snowflake.
Fig 1. Role of DBT in data analytics
Introduction to DBT
DBT (Data Build Tool), a Python application is a transformation tool, which focuses on the Transformation part in ELT (Extract, Load, Transform) processes. It aims at performing data engineering activities before the data is used for analytics. DBT performs transformation using select statements, which in turn convert into tables and views, thus making the transformation process simple and effective.
DBT comes in two variants:
i) DBT CLI – Command Line Interface, which is run by a terminal. ii)DBT Cloud – A web-based application along with an IDE.
Fig 2. Data Build Tool
Key features of DBT
1. Documentation The documentation offered by DBT is updated as to when models are developed. It is also easily accessible throughout the development. The documentation is generated based on the descriptions provided, dependencies between models, SQL files, sources, and the tests defined.
2. Data Lineage The data pipeline in DBT is represented in the form of lineage graphs. This gives proper visibility of the data and how the data maps with the business logic. The complete flow of data from the source to the target system is shown in the graph, thus providing transparency.
3. Version Control Integration of GIT with DBT has made version control come hand in hand. All the models, tests generated, sources, packages used, and other configurations used in the project are versioned in the connected GIT repository.
4. Testing DBT comes with prebuilt unique, not null, referential integrity, and accepted value testing. Jinja and SQL can be used to write custom test cases to add more testing features apart from the default tests.
5. Reusability In DBT, it is possible to develop models that can be reused by using the Jinja framework.
6. Data refreshes within DBT Cloud Using DBT, it is possible to schedule refresh at the production environment according to the business requirement without the need for an orchestration tool.
Why DBT and Snowflake?
Snowflake is a data warehouse hosted as a Software-as-a-Service (SaaS) which is faster, user-friendly, and more flexible than a traditional data warehouse. Snowflake’s data warehouse is developed using a SQL database engine with an architecture specifically designed for the cloud environment. Snowflake is built using multi-cluster, shared data architecture- This makes the Data Storage and Compute layer, and the query processing layer separated. DBT is an open-source tool that is used to manage the ELT load in Snowflake. DBT can be used with Snowflake for the following features.
Converting tables to views- It is sufficient to change the materialization in a single config file to change a table to a view.
Stored Procedure- The stored procedures created in dbt are shown in the models, which could be accessed and modified.
Combining transformation logic- DBT groups similar transformation logic together using DBT tags. Version Control- DBT supports version control by integrating with GitHub.
Open-source community- Could enhance the development by sharing experiences from fellow developers rather than starting from scratch.
Integration of DBT with Snowflake
Step 1: Snowflake account creation Create an account with Snowflake and note the highlighted details which will be used for database connection with DBT.
Fig 3. Snowflake Worksheet
Step 2: DBT Cloud sign up
Create an account with DBT using Try dbt Cloud Free. Once signed in, create a new project. To set up the created project, refer to the below steps.
Fig 4. DBT Project
Click on begin to proceed. Give an appropriate DBT project name on the next screen and hit continue. In the next screen, select Snowflake from the list of data warehouses.
Step 3: Connecting to Snowflake instance
The following fields are required when creating a Snowflake connection:
Account- The Snowflake account to connect to. If the URL for the Snowflake account is like abc12345.east-us-2.azure.snowflakecomputing.com, then the account name should be abc12345.east-us-2.azure.
Role- Role to be used after connecting to Snowflake. (Optional field)
Database- Establish a connection with this logical database in the data warehouse to run queries.
Warehouse- The virtual warehouse to use for running queries.
Username / Password – Enter the Snowflake username (specifically, the login_name) and the corresponding user’s Snowflake password, which will be used to authenticate DBT Cloud to run queries in Snowflake on behalf of a Snowflake user.
Key Pair – The Keypair auth method is based on Snowflake’s Key Pair Authentication to authenticate the credentials when accessed from a DBT Cloud project. After generating an encrypted key pair, rsa_public_key should be set for the Snowflake user for the authentication process.
Fig 5. Snowflake Connection with DBT
Once the credentials are given, test the connection. On a successful connection, proceed to connect DBT with an empty GIT repository.