Data Ops With DBT Tool

Author: Anand Jain


What is Data Ops?

What DevOps is to the software development life cycle is what data ops are to the data world. It is an agile way of making and tracking changes with teams. Collaborating with people with different requirements

eg. Data engineers, Data analysts, Data scientists.


Why data ops?

In reality, changes are pushed to prod environments manually, and with no version control, tracking changes in source code is a tedious job. DataOps comes with features like versioning and CI/CD, which reduces the time to prod for any new approved change that lies idle in a development environment.


Features of data ops
  1. Version control

  2. Environment management

  3. Continuous integration

  4. Continuous deployment

  5. Automated testing

  6. Documentation


What is DBT?

The data build tool is a transformation tool. It is a python based tool that operates over a cloud warehouse, e.g. snowflake. DBT is responsible for bringing out transformation in data present in the warehouse without incurring additional storage costs by importing data to DBT. DBT performs transformation without moving the data out of the warehouse cloud.


“Data engineers and analytics engineers use data build tool to perform data transformation by using simple SQL select statements. These statements take the form of views and tables”


DBT docs[2]


DBT offerings
  • Transformation logic

  • Orchestration

  • Change management

  • Data tests and documentation


Leveraging DBT to implement DataOps
  • Implement Version control

  • Create environments

  • Transformations

  • Perform testing using DBT's built-in generic tests.

  • Generate documentation

  • Automation and CI/CD using jobs

Prerequisite:
  • Git lab account

  • DBT account

  • Snowflake account

  • Implement Version control

  • Setup a new project in DBT

  • Setup connection with snowflake

  • Integrate your Gitlab account to allow webhook triggered jobs

  • Add GitLab repository

  • Create 2 Environments

  • The development environment is already present, which uses default settings



  • Create a new prod environment by the name “Analytical” with settings as correct snowflake settings for your PROD DB connection, and make sure to choose the right branch name for your Prod environment.

  • Here we are using our main branch for the prod environment.



  • Mention the name of the schema in your prod environment



  • Transformation

  • Start developing

  • Setup main branch

  • Initialize your project.



  • Perform first commit on the main branch



  • Verify-in GitLab



  • Create a new dev branch



  • Now we have two different branches, each for two different environments.

  • Main: Production branch

  • Dev: development branch



  • Create directories and begin writing SQL select statements in an organized manner; as a good practice, make sure to use functions like ref, sources rather than an absolute path to your Database (avoid using DB_NAME.SCHEMA_NAME.TABLE_NAME).





  • Verify this transformation using the DBT run



  • You should be able to see a new transformed table in your snowflake instance corresponding to the defined schema in the development environment.



  • Commit changes to the dev branch

  • Testing

  • DBT offers two types of testings

  • Singular test: Custom tests that you can write based on your requirements

  • Generic tests: Four predefined tests in DBT

  • Unique: checks uniqueness of a column values

  • Not_null: make sure that there are no null values in the column

  • Accepted_values: check if the values in the column belong to a predefined list of values

  • Relationships: check for referential integrity

  • One can refer here[3] for tests in DBT

  • Writing tests in source files to implement testing at the source.



  • Running tests

  • In DBT, run the command

  • DBT test: to perform tests on all data of all models

  • DBT test --select +my_model: to run tests on data on the current model and its ancestors



  • Documenting

  • Documenting our models is an important part of data ops. Documenting saves time for users who are new to a project or people who are working in teams where working in a collaborative environment is a necessity. One can view the documents and find a way into the mess.

  • DBT offers to document while coding our models; we can write descriptions in

“yml” files while describing our data



  • Documents can be generated using the DBT command

  • dbt docs generate

  • Newly generated documents can be viewed after the above command is completed.



  • Documents generated are web-based pages including details such as

  • Information about your project: including model code, a DAG (Directed acyclic graph) of your project, any tests you've added to a column, etc.

  • Details regarding your data warehouse: including table sizes and column data types. This information is derived from the information schema.



  • Lineage graph or “mini-map” of DAG

  • This graph explains the stages a table has undergone with other tables to produce a required transformed table

  • Here green blocks are source tables



  • Automation and CI/CD using jobs

  • Each job consists of running DBT commands; this example includes jobs with the following commands.

  • dbt run

  • dbt test

  • dbt docs generate

  • Create three jobs

  • Job run at PR: runs at the pull request

  • Nightly refresh: runs at a scheduled time

  • Custom run: runs manually.

  • Job run at pull request (CI)

  • Every time a user generates a pull request on the main branch after making changes to the dev branch, this job will be triggered to create a temporary schema with all the transformations for verification and relationship check. Once the PR is approved and merged, this temporary schema will be removed from the snowflake instance.




  • Create a pull request on the main branch after committing changes to the dev branch.



  • Job is triggered, and changes can be reflected in the Snowflake instance.




  • Nightly refresh (CD)

  • Schedule the job for a specific time at night for your new changes or transformation to be applied to your prod environment, so every day, the stakeholder has the latest and fresh transformed data.



  • Successful previous runs



  • Custom Run

  • Create an additional job with similar DBT commands as the above jobs, and this time don't use any triggers. This job can be used to push changes from the main branch to the prod environment manually.

  • Jobs in the production environment



Using DBT with the above features reduces our time to deployment to both prod and dev environments. Developers can maintain source code using version control capability and collaborate with a team of multiple developers. One can leverage the fundamentals of DevOps of software development in DBT with features like continuous integration, continuous deployment, automated testing using generic and custom tests, and create a well-documented project.


References

[1] Rise of data ops

[2] DBT docs

[3] DBT test

DataOps Manifesto

Enable CI

KIPI dbt and data ops




10 views0 comments

Recent Posts

See All