Authors: Ankita Deep, Sakshi Agrawal, Yash Chavan, Kashish Vyas
Datops is becoming a revolutionary methodology, especially from the data engineering perspective, and can be leveraged in different projects which can fulfill different requirements in one go. This paper will focus on providing a prescriptive approach on how dataOps is behaving with DBT for data practitioners.
DBT (Data Build Tool)
DBT (Data Build Tool), a Python application is used to transform data in the ELT (Extract, Load, Transform) processes.
It focuses on performing data engineering activities before the data is used for analytics.
DBT is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like CI/CD, portability, modularity, and Documentation.
Scope of DBT in DataOps
The first step that has to be taken was to establish the connection between snowflake and dbt after creating the trial account on dbt. For dbt configuration and connection with snowflake you can refer to this link Data teams with DBT cloud #4
Then as per dbt standards, some sample models and configuration files were created to perform transformations. Using these models and transformations, some observations on how dataOps is taking place and performing in a different area of dbt are as below :
1. Version Control:
Once our Dbt cloud is connected to the Git repository, we can create branches, where each user can maintain his/her feature branch and can make changes to respective branches once tested and can merge it to the main master branch.
Through version control, we can also achieve collaboration where different users can make changes to the code and maintain them in their feature branch and share them with colleagues as well. In dbt, we tried to achieve collaboration by creating another account to our DBT and maintaining a feature branch and main branch linked to the GitHub repository.
2. Continuous Deployment:
DBT Cloud provides the tools to enable analytics engineers to own the full end-to-end development and deployment process. We only need to use two environments: development and production.
We'll achieve this by committing the work to our feature branch into the main branch and setting up a job to orchestrate the execution of the models in production.
After setting the job, we can easily trigger this job which is nothing but triggering your CI/CD pipeline
For more information about how to implement Continuous Deployment with dbt please find the below link Data teams with DBT Cloud #13
3. Continuous Integration:
dbt Cloud makes it effortless to test every single change you make before deploying that code into production. Once you've connected your GitHub account, you can configure jobs, to run when new Pull Requests (referred to as Merge Requests in GitLab) are opened against your dbt repo. When these jobs are complete, their statuses will be shown directly in the Pull Request.
While creating a deployment job in DBT we can also schedule it as per our need like every day, a specific day of the week, or cron schedule.
We can set up a system to run a dbt job on a schedule, rather than running dbt commands manually from the command line.
Dbt Cloud's comprehensive scheduling interface makes it possible to schedule jobs by the time of day, day of the week, or a recurring interval.
5. Data governance:
DBT provides data lineage through DAG, which in turn helps in data cataloging.
In models, we can also include some data masking, which can be done on snowflake tables through dbt which also leads to data governance.
6. Automated Testing
dbt comes with accessible data testing. For tests, dbt comes with a set of 4 pre-specified data tests:
not null — no null/missing values occur in a column
unique — the column is unique
relation - referential integrity between a primary and foreign key (not database enforced, rather it is generated code to check the RI),
accepted values - whether a column only contains values within a specific domain, these can be supplied from a seed, another DBT concept for data enhancement that can also be referenced like a data model.
dbt is able to produce a static webpage with a data dictionary by pulling in information from your dbt project as well as your Snowflake information_schema.
It also furnishes an interactive DAG so you can see the full lineage of your models.
So regardless of how huge your project grows, it is super easy to understand what's happening with the help of dbt's documentation.
From the above points, it's evident that all the aspects of dataOps functionalities are being lined up with dbt Cloud as well.
By making use of this we can achieve a very definitive approach to deal with our data engineering part in the projects which will make our work more efficient and precise.