Schema On Read

Author: Amarendra Singh


Introduction:
  • In this blog we are going to understand when we have semi-structure files in local or any cloud location and have to ingest them in the Snowflake ecosystem. It’s hard to manually create column names in Snowflake and then load in the table.

  • Because that takes lots of effort and this is a manual process.

  • Supported Files formats are Apache Parquet, Apache Avro, and ORC files.

  • Using infer_schema we can handle semi-structured data.

  • Infer_schema automatically detects the column name and the type of expression in the staged file.

  • And then we need to store it in a template. Using this template we created the table in our Snowflake ecosystem.


When is this feature required?
  • Suppose you have a bulk of semi-structured files with lots of columns in the external table, user table and Internal named stage except the stage table.

  • We have to ingest this data and also create an automated data pipeline.


Using these 3 features we can automate this process:

1. Schema On Read In Snowflake(INFER_SCHEMA)

  • Using this Snowflake feature we can automatically manage dynamic column names for the semi -structured data. (here Parquet)

  • Using this code, we can fetch the data from internal Stage or external Stage



INFER_SCHEMA(

LOCATION => '{ internalStage | externalStage }'

, FILE_FORMAT => '<file_format_name>'

, FILES => '<file_name>' [ , '<file_name>' ] [ , ... ]

, IGNORE_CASE => TRUE | FALSE

)

  • In this scenario we are loading the parquet file from amazon S3 to the Snowflake external Stage.

2. GENERATE_COLUMN_DESCRIPTION

  • Using this feature generates a list of columns from stages files that had semi-structured data using infer schema output results. The output from INFER_SCHEMA can be used as input for creating a table, external table, or view based as column definitions of the stages files.

  • After generating the column description, I have to create a template to store this output column name.

  • Using this we can create the table as per requirement.



3. CREATE TABLE … USING TEMPLATE

  • Using generate_column_description we can create the template and using this template we can create a table with column name.

  • Using this template we have created the table in our Snowflake ecosystem



  • This feature is not supported for the internal table stage.

create file format my_parquet_format

type = parquet;


Retrieve the column definitions for Parquet files

-- Query the INFER_SCHEMA function.

select *

from table(

infer_schema(

location=>'@mystage'

, file_format=>'my_parquet_format'

)

);


select *

from table(

infer_schema(

location=>'@mystage'

, file_format=>'my_parquet_format'

, ignore_case=>true

)

);


Overall Code as per demo data sets URL: Infer_Schema.docx - Google Docs


Note:
  • Using these Snowflake features, help you to load bulk semi-structured data in your Snowflake ecosystem.




19 views0 comments

Recent Posts

See All