Author: Anand Jain
What is Data Ops?
What DevOps is to the software development life cycle is what data ops are to the data world. It is an agile way of making and tracking changes with teams. Collaborating with people with different requirements
eg. Data engineers, Data analysts, Data scientists.
Why data ops?
In reality, changes are pushed to prod environments manually, and with no version control, tracking changes in source code is a tedious job. DataOps comes with features like versioning and CI/CD, which reduces the time to prod for any new approved change that lies idle in a development environment.
Features of data ops
Version control
Environment management
Continuous integration
Continuous deployment
Automated testing
Documentation
What is DBT?
The data build tool is a transformation tool. It is a python-based tool that operates over a cloud warehouse, e.g., snowflake. DBT is responsible for bringing out transformation in data present in the warehouse without incurring additional storage costs by importing data to DBT. DBT performs transformation without moving the data out of the warehouse cloud.
“Data engineers and analytics engineers use a data build tool to perform data transformation by using simple SQL select statements. These statements take the form of views and tables.”
DBT docs[2]
DBT offerings
Transformation logic
Orchestration
Change management
Data tests and documentation
Leveraging DBT to implement DataOps
Implement Version control
Create environments
Transformations
Perform testing using DBT's built-in generic tests.
Generate documentation
Automation and CI/CD using jobs
Prerequisite:
Git lab account
DBT account
Snowflake account
Implement Version control
Setup a new project in DBT
Setup connection with snowflake
Integrate your GitLab account to allow webhook-triggered jobs
Add GitLab repository
Create 2 Environments
The development environment is already present, which uses default settings

Create a new prod environment by the name “Analytical” with settings as correct snowflake settings for your PROD DB connection, and make sure to choose the right branch name for your Prod environment.
Here we are using our main branch for the prod environment.

Mention the name of the schema in your prod environment.

Transformation
Start developing
Setup main branch
Initialize your project.

Perform the first commit on the main branch.

Verify-in GitLab

Create a new dev branch.

Now we have two different branches, each for two different environments.
Main: Production branch
Dev: development branch

Create directories and begin writing SQL select statements in an organized manner; as a good practice, make sure to use functions like ref and sources rather than an absolute path to your Database (avoid using DB_NAME.SCHEMA_NAME.TABLE_NAME).



Verify this transformation using the DBT run.

You should be able to see a new transformed table in your snowflake instance corresponding to the defined schema in the development environment.

Commit changes to the dev branch
Testing
DBT offers two types of testing
Singular test: Custom tests that you can write based on your requirements
Generic tests: Four predefined tests in DBT
Unique: checks the uniqueness of a column values
Not_null: make sure that there are no null values in the column
Accepted_values: check if the values in the column belong to a predefined list of values
Relationships: check for referential integrity
One can refer here[3] for tests in DBT
Writing tests in source files to implement testing at the source.

Running tests
In DBT, run the command
DBT test: to perform tests on all data of all models
DBT test --select +my_model: to run tests on data on the current model and its ancestors

Documenting
Documenting our models is an important part of data ops. Documenting saves time for users who are new to a project or people who are working in teams where working in a collaborative environment is a necessity. One can view the documents and find a way into the mess.
DBT offers to document while coding our models; we can write descriptions in
“yml” files while describing our data

Documents can be generated using the DBT command
dbt docs generate
Newly generated documents can be viewed after the above command is completed.

Documents generated are web-based pages, including details such as
Information about your project: including model code, a DAG (Directed acyclic graph) of your project, any tests you've added to a column, etc.
Details regarding your data warehouse: including table sizes and column data types. This information is derived from the information schema.

Lineage graph or “mini-map” of DAG
This graph explains the stages a table has undergone with other tables to produce a required transformed table
Here green blocks are source tables

Automation and CI/CD using jobs
Each job consists of running DBT commands; this example includes jobs with the following commands.
dbt run
dbt test
dbt docs generate
Create three jobs
Job run at PR: runs at the pull request
Nightly refresh: runs at a scheduled time
Custom run: runs manually.
Job run at pull request (CI)
Every time a user generates a pull request on the main branch after making changes to the dev branch, this job will be triggered to create a temporary schema with all the transformations for verification and relationship check. Once the PR is approved and merged, this temporary schema will be removed from the snowflake instance.


Create a pull request on the main branch after committing changes to the dev branch.

Job is triggered, and changes can be reflected in the Snowflake instance.


Nightly refresh (CD)
Schedule the job for a specific time at night for your new changes or transformation to be applied to your prod environment so that every day, the stakeholder has the latest and fresh transformed data.

Successful previous runs

Custom Run
Create an additional job with similar DBT commands as the above jobs, and this time don't use any triggers. This job can be used to push changes from the main branch to the prod environment manually.
Jobs in the production environment

Using DBT with the above features reduces our time to deployment to both prod and dev environments. Developers can maintain source code using version control capability and collaborate with a team of multiple developers. One can leverage the fundamentals of DevOps of software development in DBT with features like continuous integration, continuous deployment, automated testing using generic and custom tests, and create a well-documented project.
References
[1] Rise of data ops
[2] DBT docs
[3] DBT test