Perform Auditing Jobs In Matillion

Author: Ramandeep Bhasin


What is Data Auditing?

Data auditing is the process of checking the quality of data to ensure that data is accurate and error free. It helps in improving the data quality so that analytics on that data can give more accurate results. By performing the data auditing, one can increase the reliability of data and can expect better results for creating useful business insights.


(Source: https://www.altec-inc.com/)

Prerequisites

There are two prerequisites that you have to work on:

  • Get the basics Idea of Matillion Components. You can discover more about the Matillion by clicking on this link. You have to Sign Up there and choose the course named Building a Data Warehouse using Matillion.


(Source: https://academy.matillion.com/certifications)
  • You must be aware of the Environment Variables before starting. You can learn about Environment Variables by clicking on this link.


Data Auditing in Matillion

One can perform various auditing jobs in Matillion. We will discuss here some of the auditing jobs such as:

  • Checking Tasks of the Snowflake

  • Checking Partial Loading of Data

  • Checking Null values in the Data

  • Track the Users who Accessed the Data

  • Reloading of Data


We will discuss the above mentioned jobs in detail. There can be a lot more possibilities of Auditing jobs. We are discussing a few of them. Let's start with the first job.


Checking Tasks of the Snowflake

We will now perform this job by adding a few of the Matillion components. You can improvise the job according to your needs. Now, consider a scenario that data is ingested into the Snowflake via Snowpipe, Streams and Tasks. Here, the purpose of the Task is to ingest data into the final table. Consider two tasks here. One is the Root Task and the second is the Child Task. The purpose of the Root Task is to ingest data into the final table from Stream and the purpose of the Child Task is to create a record in an auditing table. You can create an auditing table according to your requirements. You can refer to the example screenshot of the auditing table for your reference.



This auditing table has the Name of the Table, Number of rows that are ingested and Ingestion data i.e. Record Created Date. Now follow the below written steps in Matillion to perform this job.


Step 1: After Sign In to the Matillion Account. Go to the left where your job will be displayed. Right Click there and select Add Orchestration Job. Fill the Name of the job and then Click Ok



Step 2: Switch the job to the newly created job. Now Under the Components section, drag and drop the following components to the main sheet.

  • 1 - Query Result to Scalar Component

  • 1 - Retry Component

  • 2 - SNS Message Component


After drag and drop the components, connect them as shown below.



Step 3: Now Click on Query Result to Scalar Component. Put the SQL query which tells about the Row_Count from the Job_Status_Log where Record_Created_Date is Current Date and also filter the query result with Table_Name. If you are confused about these terms, then first follow the auditing table that is mentioned above. If I take the example of the above-mentioned auditing table called Job_Status_Log, then these should be filled in the field available in the Query Result to Scalar Component.

  • Name: Any suitable name such as Check Auditing Table

  • SQL Query: SELECT ROW_COUNT AS COUNT_VALUE FROM JOB_STATUS_LOG WHERE RECORD_CREATED_DATE = CURRENT_DATE AND TABLE_NAME = 'TABLE1'

  • Scalar Variable Mapping: For Scalar Variable Mapping, Map COUNT Environment Variable (first you need to create an environment variable) with COUNT_VALUE and our Query Result to Scalar Component is ready to use.



Step 4: Now click on Retry Component. This component is attached at the top of Query Result to Scalar Component and it retries the Query that is fed into the Query Result to Scalar component. Fill the Name, Number of Retries and Retry delay as show below



Step 5: Now click on SNS Message 1 component and fill in the required details such as Name, AWS Region, Topic Name (which you have to create for Task Success in your AWS account), Subject and Message.



Step 6: Now click on SNS Message 2 component and fill in the required details such as Name, AWS Region, Topic Name (which you have to create for Task Fail in your AWS account), Subject and Message.



Step 7: Now our job is completed. You will get an email if the task is working or not as shown below.



Checking Partial Loading of Data

Now we know whether the task is ingesting data into the final table or not. But we still did not know the data which are ingested into the final table are fully loaded or partially loaded. For that, we are going to use a new job which is Checking Partial Loading of Data. Now follow the below written steps in Matillion to perform this job.


Step 1: Go to the left where your job will be displayed. Right Click there and select Add Orchestration Job. Fill the Name of the job and then Click Ok



Step 2: Switch the job to the newly created job. Now Under the Components section, drag and drop the following components to the main sheet.

  • 2 - Query Result to Scalar Component

  • 1 - If Component

  • 1 - SNS Message Component

  • 1 - End Success Component


After drag and drop the components, connect them as shown below



Step 3: Now Click on “Query Result To Scalar 0” Component and then the details such as Name, SQL Query (which will be related to daily ingested count) and Scalar Variable Mapping. Sample of these fields are as follows

  • Name: Any suitable name such as Check Auditing Table

  • SQL Query: SELECT COUNT(*) COUNT_INGEST FROM TABLE1 WHERE RECORD_CREATED_DATE = CURRENT_DATE

  • Scalar Variable Mapping: Map INGEST_COUNT environment variable with INGEST_COUNT Column.


Step 4: Now click on the IF component and then apply the condition on the environment variable that we added in the previous step. Apply the condition such as INGEST_COUNT is equal to a desired number. Suppose the desired number is 185. Then apply this condition in the condition field.




Step 5: Now click on “Query Result To Scalar 1” Component. In this component put a SQL query which finds out the missing data which are not loaded. The query can be different for different cases which depends on the tables that you have created. Also map the environment variables with the desired column in the Scalar Variable Mapping field.



Step 6: Now Click on the “SNS Message 0” Component. Fill the fields such as Name, AWS Region, Topic Name(which you can create in your AWS account for Partial Loading), Subject and Message.



Step 7: Now your job is completed, you will get an email regarding partial loading as shown below.



Checking Null values in the Data

This is an auditing job for checking the nulls in our daily ingested data. Here we are using Query Result to Scalar Component. If null values are detected by the Query Result to Scalar Component, then an Email will be sent which tells the data which are having null values. Follow the below written steps to perform this job.


Step 1: Go to the left where your job will be displayed. Right Click there and select Add Orchestration Job. Fill the Name of the job and then Click Ok



Step 2: Switch the job to the newly created job. Now Under the Components section, drag and drop the following components to the main sheet.

  • 1 - Query Result to Scalar Component

  • 1 - If Component

  • 1 - SNS Message Component

  • 1 - End Success Component


After drag and drop the components, connect them as shown below



Step 3: Now click on the Query Result to Scalar Component and then fill the fields such as Name, SQL Query and Scalar Variable Mapping. Sample of these fields are as follows

  • Name: Fill the appropriate name according to your need.

  • SQL Query: SELECT LISTAGG(ID, ', ') AS NULL_LIST FROM TABLE1 WHERE NAV IS NULL AND RECORD_CREATED_DATE = CURRENT_DATE

  • Scalar Variable Mapping: Map NULL_LIST with NULL_LIST column name of the above SQL query.



Step 4: Click on the IF component then apply the condition in the Condition field. Conditions can be regarding the NULL_LIST environment variable. Put the condition as shown below.



Step 5: Now Click on the SNS Message Component. Fill the fields such as Name, AWS Region, Topic Name(which you can create in your AWS account for Null Values), Subject and Message.



Step 6: The job has been created. You will get an email regarding the Null Values detected as shown below.



Track the Users who Accessed the Data

Suppose in your Snowflake account, you have a table which contains highly confidential data. You want to find the users who accessed the confidential data in case of any mishaps. This job checks the Access History and tells about the users who accessed the confidential data.


Note: We do not recommend to use the Account Admin role directly for security concerns. But Access History is only accessed by Account Admin. So you need to create a Task which can insert Access History data into a table so that your Matillion user can perform an Access History auditing job. Now follow the below written steps in Matillion to perform this job.


Step 1: Go to the left where your job will be displayed. Right Click there and select Add Orchestration Job. Fill the Name of the job and then Click Ok



Step 2: Switch the job to the newly created job. Now Under the Components section, drag and drop the following components to the main sheet.

  • 1 - Query Result to Scalar Component

  • 1 - If Component

  • 1 - SNS Message Component

  • 1 - End Success Component


After drag and drop the components, connect them as shown below



Step 3: Now click on the Query Result to Scalar component and then fill the fields such as Name, SQL Query (Here put a query which tells the users who access the confidential data), Scalar Variable Mapping (Map the User environment variable, that you will create with column name who you will get from SQL Query)



Example of SQL Query:

WITH RESULT AS

(SELECT DISTINCT USER_NAME AS USER_NAME FROM PII_ACCESS_AUDITING_TABLE WHERE DATE = CURRENT_DATE)

SELECT LISTAGG(USER_NAME, ', ') AS DISTINCT_USER_NAME FROM RESULT


Step 4: Click on IF Component and then apply the condition on the User environment variable. Put the condition as shown below.



Step 5: Click on the SNS Component and fill the fields such as Name, AWS Region, Topic Name(which you can create in your AWS account for Access History), Subject and Message.



Step 6: Now your job is completed, you will get an email regarding the user who accessed the confidential data as shown below.



Reloading of Data

There can be many methods of reloading the data. We are going to discuss one of them. We are using the Lambda function which contains a python script. The purpose of this python script is to create a csv file and put it into S3 bucket. From S3, Snowpipe, Streams and Tasks come into picture and the data will be finally ingested into the final table. For the reloading purpose, we are going to pass id which we want to reload in the form of variables by using the SNS component. To know how we can perform this job in Matillion follow the following steps.


Step 1: Go to the left where your job will be displayed. Right Click there and select Add Orchestration Job. Fill the Name of the job and then Click Ok



Step 2: Switch the job to the newly created job. Now Under the Components section, drag and drop the following components to the main sheet.

  • 1 - Query Result to Scalar Component

  • 1 - If Component

  • 1 - SNS Message Component

  • 1 - End Success Component


After drag and drop the components, connect them as shown below



Step 3: Click on the Query Result to Scalar component and fill the fields such as Name, SQL Query (which tells about the tells which are not loaded), Scalar Variable Mapping (Map an environment variable with the id that we get from the SQL Query)



Sample SQL Query:

WITH RESULT AS

(SELECT * FROM

(SELECT ID DIM_ID, NAME FROM DIM_TABLE) AS DMF

LEFT JOIN

(SELECT DISTINCT ID AS TABLE1 WHERE RECORD_CREATED_DATE = CURRENT_DATE) AS TMF

ON

DIM_ID = TRAN_ID)

SELECT LISTAGG(DIM_ID, ', ') AS ID_LIST, LISTAGG(NAME, ', ') AS NAME_LIST

FROM RESULT WHERE TRAN_ID IS NULL


Note: The above mentioned query can be different according to the different types of Data Model. This is only for sample purposes where DIM_TABLE is the dimension table which contains all the ids and TABLE1 is Trans Table which contains historical and daily ingested data.


Step 4: Click on the IF Component and apply the condition in the Condition field. Sample of condition is shown below.



Where List is the environment variable that we have used in the previous step of this job.


Step 5: Now click on the the SNS Message Component and fill the fields such as Name, AWS Region, Topic Name(which you can create in your AWS account for Reloading), Subject and Message



Note: Here we are passing the same “List” environment variable and these same variables will be passed in the Lambda function in our AWS account. The Code of Lambda Function can vary according to use cases.


Conclusion

Now you got an idea of creating auditing jobs in Matillion. As you see Matillion plays an important role in performing auditing jobs. It is a simpler and effective way to monitor or audit your data. In the end, you can say the Matillion is a perfect choice for you whether in terms of performing ELT operations or in terms of performing monitoring or auditing.


121 views0 comments

Recent Posts

See All