Author: Aman Raj
Previously in Snowflake, we needed to create the table, create the schema and load the data accordingly with the developed schema, but after the implementation of the Schema on Read feature, there is no hectic load of creating the schema and then Loading the data accordingly, here “Snowflake creates the schema accordingly as per the data." This feature is all set to make it impactable by three new functions that we use for the detection of Schema.
INFER_SCHEMA makes it possible to read and retrieve the data from the staging area and create the schema accordingly.
GENERATE_COLUMN_DESCRIPTION makes the process possible to detect the number of columns or make a note listing the number of columns.
CREATE TABLE makes it possible to create the table with Snowflake’s create table functionality without any further line of code.
Why Schema on Read- What’s the benefit?
So over a long period of time, it was a problem of analyzing and creating the schema and then loading the data accordingly. But as the Schema on read feature is initiated or recognized, the time taken to make it possible was also a tough task. So even though the people are capable of quickly loading this type of data into Snowflake itself, there is still more work in front of them to retrieve the data so that it can be analyzed or processed. It was a hectic task for the person writing the query to recognize the schema. Also, there may be a query performance penalty in this process (over that of querying columns in a relational database.)
Not so with the “Snowflake Elastic Data Warehouse” (#ElasticDW)! Also, with the help of Snowflake, we can load the semi-structured data directly to the relational table as well, after that querying the data with a SQL statement, further joining it with other structured data. Snowflake automatically keeps track of the self-describing schema, so we don’t have to do any ETL or any mesmerizing shredding required.
How to achieve the Schema on read feature?
Snowflake introduced a new data type called “VARIANT” that will allow the user to load the semi-structured data into the column in a relational table. So no Hadoop or NoSQL is needed for the data warehouse architecture to hold the semi-structured data. It is just a relational database management system or the cloud that uses SQL. That implies that by changing the data over time, there is no re-coding of ETL (or even ELT) code required to adapt. If the Schema changes, then the Variant data type will not worry about it.
Code snippet of Infer schema and schema on read on Snowflake.
Create file format:
create or replace file format parquet_format TYPE=parquet;
Create external stage:
create or replace stage parquet_stage url="s3://team4a-1/parquet_file/parquet_table/"
credentials=(AWS_KEY_ID = 'XXXXXXXXXXXXXXXXXXXXXXXXX' AWS_Secret_Key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
file_format = parquet_format;
List stage elements:
select * from TABLE(INFER_SCHEMA (LOCATION=>'@parquet_stage',FILE_FORMAT=>'parquet_format'));
select ARRAY_AGG(OBJECT_CONSTRUCT(*)) from TABLE(INFER_SCHEMA (LOCATION=>'@parquet_stage',FILE_FORMAT=>'parquet_format'));
create or replace table Teacher_parquet using template(select ARRAY_AGG(OBJECT_CONSTRUCT(*)) from TABLE(INFER_SCHEMA (LOCATION=>'@parquet_stage',FILE_FORMAT=>'parquet_format')));
copy into Teacher_parquet from @parquet_stage MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE FILE_FORMAT=parquet_format;
select * from Teacher_parquet;
select * from nas;
So considering the previous way of implementing things like creating a schema and doing the loads or whatever stuff is needed, Schema on Read is really a savior of time as well as efforts of the resources right there. It manages everything on its own, and the developer has less burden of undergoing an analysis of typical fact and dimension tables and creating the schema accordingly.