Author: Avinash Kr. Pandey
What Data operation really is?
Data operation (dataops) is an easy and quick data management exercise that controls the movement of data from source to landing place. It carries different data operations teams together to provide different processes that support data focused ventures.
Benefits of Dataops
Avoid manual testing.
Organized data flow.
Real time insights.
Keep data consistent and authentic.
What makes Dataops different from Devops?
Both Dataops and Devops are similar in various aspects such as delivering business values, continuous integration, continuous deployment, automation but in spite of similarities they have many differences as well. Dataops competent measurely focuses on creating models that help organizations in building insights that are more profitable to business targets whereas Devops competent assures better software for consumers (mostly coding based).
Why is Dbt best for Dataops?
Besides being a fine data transformation tool, dbt also helps organizations with following standard software engineering processes such as automated testing, documentation, extensibility, CI/CD, data governance, etc.
Supported Dataops features in Dbt
Dbt & Snowflake basics
Basic knowledge of Pull request and git.
Dataops pipeline implementation with Dbt (high level diagram)
Starting with Dbt
Goto https://www.getdbt.com/signup/ and fill up the details and signup in case you don't have a Dbt account.
Once signup is done, log into Dbt and create a new project by clicking on the new project button in the top right corner.
Set up a Snowflake connection by providing appropriate information such as source database details, warehouse details, role, username, password, target Schema, etc.
After that, establish connection with the gitlab repository.
In case you don't have a gitlab account. Goto Sign in · GitLab and create a new gitlab account, after account creation create a new project in gitlab itself that will create a new repository for your project.
Once setup is done with snowflake and gitlab then click on start developing, and we are all good to write, test & run our statements in DBT.
Version Control in Dbt
Version control provides a particular development sandbox to each user such that any code before being merged into the master(main) branch gets tested and reviewed successfully. We can create branches in dbt for each user once our dbt is connected with the gitlab repository.
1. Setup Project Configuration
Goto the development section. This will spin up IDE (Integrated Development Environment) where you will be developing your dbt Project.
Initialize your project.
After Initialization, checkout a new feature branch for read-write purposes in order to start developing. After creating a new custom branch, your main branch becomes read-only.
Dbt supports creating two different types of environments: Development & Deployment. Initially, by default it will create a development environment for your project on the basis of earlier provided details (which you have filled while creating your project) but if a user wants to create on their own, then they can delete the earlier one and create a new development environment as well.
Typically, in any project we need to have both the environments mentioned above.
Since we already have a development environment created by dbt, we will only create a deployment environment here.
Goto environment option in dbt, then click on new environment and fill all the required details as per snowflake production database.
Dbt provides out-and-out liberty for data transformation by writing simple SQL statements along with native test support. Users can easily transform their data as per their requirement. Follow the steps for transformation:
Goto to the development area by clicking on the hamburger icon, and checkout the main branch if you are there and goto feature branch.
Create source.yml file specifying source database, schema and table present in snowflake under models folder.
After that, create your .sql file (being called as models in dbt) as per your requirement in an organized manner in the models folder. It's always a best practice to have stage files referring to source then dimension files referring to stage.
As an instance for transformation, I have converted my json data into relational representation and also performed type casting and column length truncating for my staging models, here I'm using source jinja function for referring to source.yml file.
Also, by referring to created stage files using ref jinja functions I have reordered our columns in dimension models as well. Also, by referring to created stage files using ref jinja functions I have reordered our columns in dimension models as well.
After every change made to the models we can run them by simply writing the dbt run command in the below specified terminal. We are now able to see our changed or transformed data in a snowflake production environment for every successful dbt run.
Data Systemizing (Data Cataloging)
For every model that we have created in dbt, we will be able to see the lineage diagram which is nothing but a direct acyclic graph (DAG), which in turn helps us in data cataloging.
We can use dbt to mask the confidential data directly present in the source table if required, which in turn helps us in data governance.
Dbt supports two types of testing namely singular test and generic test. Singular tests are created as per user requirements under test folder with .sql extension and generic tests are pre-determined by dbt itself which we can directly use in our source.yml file (or we can also create new test.yml file) created in models folder.
Types of generic tests in dbt: not null, unique, accepted values, relationship.
If you don't want to mess up everything in your source file, then it's always better to create a test.yml file separately as specified below.
As a use case of the singular test, I have tested the row count of source and dimension models to check if they were having the same records before and after transformation or not.
We can directly test our models in dbt terminal by writing the dbt test command there . If we want to test a particular model then we can also write dbt test --select model_name.
Good documentation is generally a key for users to understand and find files which we have created for them. Following this, Dbt helps us in creating documentation for our project which includes information like code for models, tests, DAG, any description added, etc. We can simply generate documents in dbt by writing command dbt docs generate in the terminal itself. Once the command runs successfully, click on view docs option on top to see your project docs.
In order to automate every command or process which we generally perform manually by typing in the terminal, dbt provides us jobs. We can schedule jobs and leverage dbt to automate all processes such as dbt run, dbt test, etc and even these jobs provide flexibility to trigger it as per user requirements.
For job creation, go to your job section in dbt and click on the new job button and fill in all the required info including which you want your job to perform.
As mentioned earlier, users also have flexibility to schedule the tasks as per his/her requirements.
Continuous integration is nothing but the uninterrupted combination of code changes from feature branch to main branch. Proceeding with that it's also very important to integrate all code commits to our gitlab repository (feature branch) as well. In order to do so, we will have to create a Pull request (PR) and we can also configure jobs which will run when PR is created.
Inorder to create a job which triggers on pull request, goto webhook option (in job creation process) and tick the checkbox for it.
To create PR:
Goto develop section and first click on commit in that feature branch.
Then click on the open pull request button.
After that it will take us to our respective repository and assign a reviewer there for created PR (Merge Request). Once that reviewer approves the merge request, whole code from the feature branch will merge into the main branch in repo.
We can also verify the status of our job which we have scheduled to run on PR in dbt. As a matter of fact, that job internally creates a temporary schema in snowflake with all the transformations for verification and relationship checks. This temporary schema will get deleted on its own when PR gets approved and code changes merged.
Similar to Continuous integration, Continuous deployment is nothing but the uninterrupted placement of transformed data from production environment in dbt to production environment in snowflake. For continuous deployment we can also schedule jobs for a specific time in the production environment itself such that associates always have fresh and newest data every day and as a result of it, recently created data will successfully be drawn into the mentioned target schema of SF.
From the above mentioned aspects we can conclude that we can smoothly perform almost every characteristic of data operation with the combination of Dbt and gitlab and as a result of it we will be able to bring efficiency and steadiness of organized data flow in workspace.