Author: Siddharth Ghelani
Introduction
Continuous Integration and Continuous Deployment are at the heart of DataOps best practices which allow us to automate the Build and Deployment of Data Engineering workflows. DBT Cloud (SAAS offering used to develop and orchestrate Data Transformation pipelines using DBT) only offers Continuous Integration features for Github, Gitlab, and Azure DevOps-based source code repositories, for now. If our DBT project is hosted on Bitbucket, then this could lead to challenges in implementing these critical aspects of DataOps. This blog will focus on how to implement CI/CD using Bitbucket Pipelines for a DBT project.
Overview
The flow of events is as follows:
For working on a new feature, the developer creates a new feature branch within the DBT repository in Bitbucket.
After the work is completed, the developer would raise a pull request for that branch from DBT cloud which will redirect to the Bitbucket’s page. There he or she will fill in all the required details and save the request.
As soon as a pull request is opened a Continuous Integration pipeline will be triggered within Bitbucket which, will build and test the DBT models within a temporary schema in Snowflake’s production environment.
At the end of the Continuous Integration run, the temporary schema will be dropped.
Once the Pull request is approved, and changes are merged with the main branch, a continuous deployment pipeline will be triggered within Bitbucket, which will deploy all the models in Snowflake’s production environment.
Prerequisites
Bitbucket account hosting the git repository for the DBT project
DBT cloud account for code development
Snowflake account containing the Development and Production Database
Assumptions
The source data which is to be transformed is present in the production database <prod_db_name>, and the new data is also written to an appropriate schema within the same database.
The DBT models have been configured to build all the models inside a single schema in the production database.
As part of the continuous integration pipeline, a new schema will be created by DBT for building and testing the models within the production database and will be dropped at the end of the run.
Implementation
1. Create warehouse, user, and role in Snowflake with sufficient permissions to perform continuous integration activities
use role sysadmin;
create or replace warehouse <ci_warehouse_name>
warehouse_size = xsmall
auto_suspend = 300
auto_resume = true
initially_suspended = true;
Use role securityadmin;
Create role <ci_role_name>;
create or replace user <ci_user_name>
password = '**’'
default_role = <ci_role_name>;
grant role <ci_role_name> to user <ci_user_name>;
// Grant access to read the input data from the <prod_db_name>
grant usage on all schemas in database <prod_db_name> to role <ci_role_name>;
grant select on all tables in database <prod_db_name> to role <ci_role_name>;
grant select on all views in database <prod_db_name> to role <ci_role_name>;
grant all on database <prod_db_name> to role <ci_role_name>;
// Grant access to warehouse
grant all on warehouse <ci_warehouse_name> to role <ci_role_name>;
// Grant role to sysadmin
grant role <ci_role_name> to role sysadmin;
2. Create warehouse, user, and role in Snowflake with sufficient permissions to perform continuous deployment activities
use role sysadmin;
create or replace warehouse <cd_warehouse_name>
warehouse_size = xsmall
auto_suspend = 300
auto_resume = true
initially_suspended = true;
Use role securityadmin;
Create role <cd_role_name>;
create or replace user <cd_user_name>
password = '**’'
default_role = <cd_role_name>;
grant role <cd_role_name> to user <cd_user_name>;
// Grant access to read the input data from the <prod_db_name>
grant usage on all schemas in database <prod_db_name> to role <cd_role_name>;
grant select on all tables in database <prod_db_name> to role <cd_role_name>;
grant select on all views in database <prod_db_name> to role <cd_role_name>;
grant all on database <prod_db_name> to role <cd_role_name>;
// Grant access to warehouse
grant all on warehouse <cd_warehouse_name> to role <cd_role_name>;
// Grant role to sysadmin
grant role <cd_role_name> to role sysadmin;
3. The profiles for CI and CD are maintained in the profiles.yml file within the profiles folder of the DBT project repo. The passwords which will be later stored in Bitbucket repository variables are referenced here using environment variables
dbt_project:
target: ci
outputs:
ci:
type: snowflake
account: <snowflake_account_identifier>
user: <ci_user_name>
password: "{{ env_var('CI_USER_PASSWORD') }}"
role: <ci_role_name>
database: <prod_db_name>
warehouse: <ci_warehouse_name>
schema: <ci_schema_name>
threads: 4
cd:
type: snowflake
account: <snowflake_account_identifier>
user: <cd_user_name>
password: "{{ env_var('CD_USER_PASSWORD') }}"
role: <cd_role_name>
database: <prod_db_name>
warehouse: <cd_warehouse_name>
schema: <cd_schema_name>
threads: 4
4. Make sure that the profile name mentioned in profiles.yml file matches the one in dbt_project.yml file

5. Create a DBT macro to drop the test schema created for Continuous Integration within Snowflake and place it in the macros folder in the DBT project
{% macro drop_test_env() %}
{#-
To run it:
$ dbt run-operation drop_test_env
-#}
{% set sql -%}
DROP schema if exists <prod_db_name>.<ci_schema_name> cascade
{%- endset %}
{% do run_query(sql) %}
{% endmacro %}
6. Enable Pipeline for the Bitbucket repository

7. Create a pipeline within Bitbucket. A file with the name bitbucket-pipelines.yml gets created while creating the pipeline. Modify the contents of this file with the sample code as specified below
# Using Docker image with DBT installed
image: fishtownanalytics/dbt:1.0.0
pipelines:
# CI pipeline Will get triggered as soon as Pull request is created
pull-requests:
'**':
- step:
name: 'Build and Test'
script:
#Downloading DBT dependencies, if any
- dbt deps --profiles-dir ./profiles
#Compile DBT models
- dbt compile --profiles-dir ./profiles
#Run DBT models within test database
- dbt run --profiles-dir ./profiles
#Run automated unit tests within test database
- dbt test --profiles-dir ./profiles
after-script:
#Running DBT macro to drop test database in Snowflake
- dbt run-operation drop_test_env --profiles-dir ./profiles
# CD pipeline Will get triggered as soon as changes have been merged with the main branch
branches:
<main_branch_name>:
- step:
name: Deploy to production
script:
#Downloading DBT dependencies, if any
- dbt deps --profiles-dir ./profiles --target cd
#Compile DBT models
- dbt compile --profiles-dir ./profiles --target cd
#Run DBT models within prod database
- dbt run --profiles-dir ./profiles --target cd
8. Store the passwords for the CI user and CD user securely in the repository variables in Bitbucket

Demo
1. We will open a pull request to push this code in our feature branch to the main branch by clicking on the open pull request option.

2. Once we do that, it will send us to the bitbucket page to create a pull request.

3. Once a request is created, it is sent to the reviewer to receive approval and merge the request. As soon as the pull request is created, the Bitbucket pipeline starts running in the background.

4. All 18 Models got built, and 48 tests defined in the project got passed in the temporary database created in Snowflake


5. The Pull request now shows a green checkmark to indicate that the integration pipeline has been completed successfully and the changes are good to be merged with the main branch.

6. The Pull request has been approved and is ready to be merged


7. Once the code is merged another, Continuous Deployment pipeline got auto-triggered.

8. The pipeline ran successfully and deployed all the changes to the Snowflake prod environment

9. The merge request shows a green checkmark which signifies that the deployment pipeline ran successfully

Conclusion
This is how we can build a simple CI/CD pipeline with Bitbucket for our DBT project. The code for the Continuous Integration pipeline, which builds all the DBT models in the temporary schema within the production database. We can instead build only those models and their children, which have been updated or newly created since the last time the DBT job ran in production. This can be achieved using state deferral, and this process is called Slim Continuous Integration. The 1st article mentioned in the References section of this document goes into detail on how this can be achieved.