top of page

Schema On Read

Updated: Mar 14

Author: Amarendra Singh


Introduction:
  • In this blog, we are going to understand when we have semi-structure files in a local or any cloud location and have to ingest them in the Snowflake ecosystem. It’s hard to manually create column names in Snowflake and then load them in the table.

  • Because that takes lots of effort and is a manual process.

  • Supported Files formats are Apache Parquet, Apache Avro, and ORC files.

  • Using infer_schema we can handle semi-structured data.

  • Infer_schema automatically detects the column name and the type of expression in the staged file.

  • And then, we need to store it in a template. Using this template, we created the table in our Snowflake ecosystem.


When is this feature required?
  • Suppose you have a bulk of semi-structured files with lots of columns in the external table, user table, and Internally named stage except the staging table.

  • We have to ingest this data and also create an automated data pipeline.


Using these 3 features we can automate this process:

1. Schema On Read In Snowflake(INFER_SCHEMA)

  • Using this Snowflake feature, we can automatically manage dynamic column names for the semi-structured data. (here Parquet)

  • Using this code, we can fetch the data from the internal Stage or external Stage.



INFER_SCHEMA(

LOCATION => '{ internalStage | externalStage }'

, FILE_FORMAT => '<file_format_name>'

, FILES => '<file_name>' [ , '<file_name>' ] [ , ... ]

, IGNORE_CASE => TRUE | FALSE

)

  • In this scenario, we are loading the parquet file from amazon S3 to the Snowflake external Stage.

2. GENERATE_COLUMN_DESCRIPTION

  • Using this feature generates a list of columns from stages files that had semi-structured data using inferred schema output results. The output from INFER_SCHEMA can be used as input for creating a table, external table, or view based on column definitions of the stages files.

  • After generating the column description, I have to create a template to store this output column name.

  • Using this, we can create the table as per requirement.



3. CREATE TABLE … USING TEMPLATE

  • Using generate_column_description we can create the template, and using this template, we can create a table with a column name.

  • Using this template, we have created the table in our Snowflake ecosystem.



  • This feature is not supported for the internal table stage.

create file format my_parquet_format

type = parquet;


Retrieve the column definitions for Parquet files

-- Query the INFER_SCHEMA function.

select *

from table(

infer_schema(

location=>'@mystage'

, file_format=>'my_parquet_format'

)

);


select *

from table(

infer_schema(

location=>'@mystage'

, file_format=>'my_parquet_format'

, ignore_case=>true

)

);


Overall Code as per demo data sets URL: Infer_Schema.docx - Google Docs


Note:
  • Using these Snowflake features help you to load bulk semi-structured data in your Snowflake ecosystem.




36 views0 comments
bottom of page