CI/CD with DBT Cloud and Bitbucket

Author: Siddharth Ghelani


Introduction

Continuous Integration and Continuous Deployment is at the heart of DataOps best practices which allow us to automate the Build and Deployment of Data Engineering workflows. DBT Cloud (SAAS offering used to develop and orchestrate Data Transformation pipelines using DBT), only offers Continuous Integration features for Github, Gitlab and Azure Devops based source code repositories, fow now.If our DBT project is hosted on Bitbucket then this could lead to challenges in implementing these critical aspects of DataOps. This blog will focus on how to implement CI/CD using Bitbucket Pipelines for a DBT project.


Overview

The flow of events is as follows:

  1. For working on a new feature, the developer creates a new feature branch within the DBT repository in Bitbucket.

  2. After the work is completed, the developer would raise a pull request for that branch from DBT cloud which will redirect to the Bitbucket’s page. There he or she will fill in all the required details and save the request.

  3. As soon as a pull request is opened a Continuous Integration pipeline will be triggered within Bitbucket which will build and test the DBT models within a temporary schema in Snowflake’s production environment.

  4. At the end of the Continuous Integration run, the temporary schema will be dropped.

  5. Once the Pull request is approved and changes are merged with the main branch, a continuous deployment pipeline will be triggered within Bitbucket, which will deploy all the models in Snowflake’s production environment.


Prerequisites
  • Bitbucket account hosting the git repository for the DBT project

  • DBT cloud account for code development

  • Snowflake account containing the Development and Production Database


Assumptions
  1. The source data which is to be transformed is present in the production database <prod_db_name> and the new data is also written to an appropriate schema within the same database.

  2. The DBT models have been configured to build all the models inside a single schema in the production database.

  3. As part of the continuous integration pipeline, a new schema will be created by DBT for building and testing the models within the production database and will be dropped at the end of the run.


Implementation

1. Create warehouse, user, role in Snowflake with sufficient permissions to perform continuous integration activities

use role sysadmin;


create or replace warehouse <ci_warehouse_name>

warehouse_size = xsmall

auto_suspend = 300

auto_resume = true

initially_suspended = true;


Use role securityadmin;


Create role <ci_role_name>;


create or replace user <ci_user_name>

password = '**’'

default_role = <ci_role_name>;


grant role <ci_role_name> to user <ci_user_name>;


// Grant access to read the input data from the <prod_db_name>

grant usage on all schemas in database <prod_db_name> to role <ci_role_name>;

grant select on all tables in database <prod_db_name> to role <ci_role_name>;

grant select on all views in database <prod_db_name> to role <ci_role_name>;

grant all on database <prod_db_name> to role <ci_role_name>;


// Grant access to warehouse

grant all on warehouse <ci_warehouse_name> to role <ci_role_name>;


// Grant role to sysadmin

grant role <ci_role_name> to role sysadmin;


2. Create warehouse, user, role in Snowflake with sufficient permissions to perform continuous deployment activities

use role sysadmin;


create or replace warehouse <cd_warehouse_name>

warehouse_size = xsmall

auto_suspend = 300

auto_resume = true

initially_suspended = true;


Use role securityadmin;


Create role <cd_role_name>;

create or replace user <cd_user_name>

password = '**’'

default_role = <cd_role_name>;


grant role <cd_role_name> to user <cd_user_name>;


// Grant access to read the input data from the <prod_db_name>

grant usage on all schemas in database <prod_db_name> to role <cd_role_name>;

grant select on all tables in database <prod_db_name> to role <cd_role_name>;

grant select on all views in database <prod_db_name> to role <cd_role_name>;

grant all on database <prod_db_name> to role <cd_role_name>;



// Grant access to warehouse

grant all on warehouse <cd_warehouse_name> to role <cd_role_name>;


// Grant role to sysadmin

grant role <cd_role_name> to role sysadmin;


3. The profiles for CI and CD are maintained in the profiles.yml file within the profiles folder of the DBT project repo.The passwords which will be later stored in Bitbucket repository variables are referenced here using environment variables


dbt_project:

target: ci

outputs:

ci:

type: snowflake

account: <snowflake_account_identifier>

user: <ci_user_name>

password: "{{ env_var('CI_USER_PASSWORD') }}"

role: <ci_role_name>

database: <prod_db_name>

warehouse: <ci_warehouse_name>

schema: <ci_schema_name>

threads: 4

cd:

type: snowflake

account: <snowflake_account_identifier>

user: <cd_user_name>

password: "{{ env_var('CD_USER_PASSWORD') }}"

role: <cd_role_name>

database: <prod_db_name>

warehouse: <cd_warehouse_name>

schema: <cd_schema_name>

threads: 4


4. Make sure that the profile name mentioned in profiles.yml file matches the one in dbt_project.yml file



5. Create a DBT macro to drop the test schema created for Continuous Integration within Snowflake and place it in the macros folder in the DBT project


{% macro drop_test_env() %}

{#-

To run it:

$ dbt run-operation drop_test_env


-#}

{% set sql -%}

DROP schema if exists <prod_db_name>.<ci_schema_name> cascade

{%- endset %}

{% do run_query(sql) %}

{% endmacro %}


6. Enable Pipeline for the Bitbucket repository



7. Create a pipeline within Bitbucket. A file with the name bitbucket-pipelines.yml gets created while creating the pipeline. Modify the contents of this file with the sample code as specified below


# Using Docker image with DBT installed

image: fishtownanalytics/dbt:1.0.0


pipelines:

# CI pipeline Will get triggered as soon as Pull request is created

pull-requests:

'**':

- step:

name: 'Build and Test'

script:

#Downloading DBT dependencies, if any

- dbt deps --profiles-dir ./profiles

#Compile DBT models

- dbt compile --profiles-dir ./profiles

#Run DBT models within test database

- dbt run --profiles-dir ./profiles

#Run automated unit tests within test database

- dbt test --profiles-dir ./profiles

after-script:

#Running DBT macro to drop test database in Snowflake

- dbt run-operation drop_test_env --profiles-dir ./profiles


# CD pipeline Will get triggered as soon as changes have been merged with the main branch

branches:

<main_branch_name>:

- step:

name: Deploy to production

script:

#Downloading DBT dependencies, if any

- dbt deps --profiles-dir ./profiles --target cd

#Compile DBT models

- dbt compile --profiles-dir ./profiles --target cd

#Run DBT models within prod database

- dbt run --profiles-dir ./profiles --target cd


8. Store the passwords for CI user and CD user, securely in the repository variables in Bitbucket



Demo

1. We will open a pull request to push this code in our feature branch to the main branch by clicking on the open pull request option.



2. Once we do that, it will send us to the bitbucket page to create a pull request.



3. Once a request is created it is sent to the reviewer to receive approval and merge the request.As soon as the pull request is created, the Bitbucket pipeline starts running in the background.



4. All the 18 Models got Build and 48 tests defined in the project got passed in the temporary database created in Snowflake




5. The Pull request now shows a green checkmark to indicate that the integration pipeline completed successfully and the changes are good to be merged with the main branch.



6. The Pull request has been approved and is ready to be merged




7. Once the code is merged another Continuous Deployment pipeline got auto triggered



8. The pipeline ran successfully and deployed all the changes to the Snowflake prod environment



9. The merge request shows a green checkmark which signified that the deployment pipeline ran successfully



Conclusion

This is how we can build a simple CI/CD pipeline with Bitbucket for our DBT project. The code for the Continuous Integration pipeline, which builds all the DBT models in the temporary schema within the production database. We can instead build only those models and their children, which have been updated or newly created since the last time the DBT job ran in production. This can be achieved using state deferral and this process is called Slim Continuous Integration. The 1st article mentioned in the References section of this document goes into detail on how this can be achieved.


References
40 views0 comments

Recent Posts

See All