Authors: Twinkle Viswanathan & Tejaswini R
Data Governance (DG) is a process of managing data availability, usability, integrity, cataloging, and security in enterprise systems. Data Governance in Snowflake is driven by an organization’s internal data management standards and policies. It helps us to understand, categorize and provide safe and easy access to our data.
To understand data governance better, let’s look into a scenario of data from the IPL (Indian Premier League). IPL data doesn't just revolve around the player, match, or team statistics. It also involves a lot of revenue or finance data such as the number of tickets sold, food catered to the audience, which food was sold, etc. Some of this data, like players' health reports and personal details, are sensitive and should not be disclosed to the public. Whereas details like the number of runs scored or wickets taken should be shared with the public. Here’s where data governance comes into the picture. It helps us to deliver the right data to the right people at the right time.
Snowflake provides industry-leading features to implement data governance in our system.
Now, let’s look into some of the features provided by Snowflake to implement data governance in our organization.
Data classification helps us to identify sensitive information in our data, which, when mapped to a Tag, helps us to group all these sensitive data together. Once all the sensitive data is identified, we can use row-level or column-level access policies to secure and provide limited and safe access to the right data. Now that all the governance features are set, it is important that we monitor whether the right users are able to access the data that they should be and not accessing the data that they shouldn’t be; for this, Snowflake brings us the Access History view in Account Usage schema to monitor access to different objects.
Object tags in Snowflake help us to enable data stewards to track sensitive data for compliance, discovery, protection, and resource usage use cases through either a centralized or decentralized data governance management approach, thus aiding efficient data governance.
In this blog, we’ll go through how Tags can be used in Data governance in Snowflake on a high level.
What are tags?
Tags help us to map particular information with an artifact. We have all used hashtags on social media sites like Twitter or Instagram. Whenever we search with a particular hashtag, all the posts that were tagged show up in our feed. This gives us visibility to all similar posts that were tagged with the same hashtag. A similar concept is used in snowflake Object Tagging.
Tags follow a property called tag lineage, which means a tag set to a high-level object is inherited by all its child objects (A Tag, when set on a database, will be inherited to its tables, views, columns, etc.).
Why use Tags?
Tags help us map account-level objects and database-level objects together.
Data Cataloging- Organizing data into different groups.
Consistent results with Data Replication- Tags present in the primary database are also replicated in the secondary database.
It can be used in row-level and column-level security- We’ll look into this in detail in this blog.
Centralized or decentralized data governance management-
In centralized data governance, we can create a separate role and database with required privileges for all tags and map it to different objects in different databases.
In decentralized data governance, we can create tags in any database and use them within that database.
Data cataloging use case: Consider a table having country names in a column named country, but in another table, the country names are present in a column called region; Tags can be used to catalog columns containing similar data together and thus enabling fuss-free data availability and usability. Tags can be used to catalog not only columns but also different database objects and their metadata.
Security Use Case: Consider a scenario where you need to give all transaction data to your analyst, but you need to hide sensitive data such as credit card number data and pin from them to avoid an unauthorized user. We can identify all the sensitive information, and Tags can be used to categorize sensitive and insensitive data, which could be used to enable efficient data masking. We have discussed Data Masking using tags with the implementation procedure here.
Identifying sensitive information in our data could be a time-consuming process. To overcome this, Snowflake has introduced a feature called Data Classification.
Data Classification is a brand new Snowflake feature that is still in the preview stage for enterprise edition and above, which helps us to dynamically identify sensitive information. Now that all the sensitive data have been identified, all these data can be mapped with tags using a stored procedure.
Resource usage Use Case:
Consider an organization with different departments such as HR, analytics, etc.; each department has its own databases, warehouses, and many different objects of its own. The organization’s accountant needs to find the total snowflake expenditure for each department. Tags can be used to map different database objects and account-level objects together. After implementing Object Tags to the account, the accountant can find storage and compute usage costs for different departments easily. Hence, by using Tags, we can easily track resource usage for different groups with ease.
We can also track queries executed by different departments using a session parameter called Query Tags; We have discussed Query Tags here.
The need to have Data governance in place, especially in a world of ever-expanding data lakes that are used across different data platforms and data applications, is essential. Snowflake’s object tags provide a way to have a one-stop view of all sensitive objects and how, where, and who is using them. Any of the above use cases/ methods can be implemented depending on the organization’s needs.