Bigquery to Snowflake Migration

Author: Anuj Karn



Introduction:

Database and data warehouse technology is evolving at a very fast pace. Nowadays, many tools help you connect different Data Warehouses to each other so that you can take advantage of both technologies. One of the processes is integrating Snowflake to BigQuery.


The cloud has become the most popular way for businesses to store data. This is because it comes with several advantages compared to on-premise storage options. When storing data in the cloud, there is no need to select, install, configure and manage any hardware. The cloud is therefore an ideal choice for all businesses that do not want to spend resources on it and reduce the problem of scalability.


One of the top data warehouses utilized by businesses worldwide is Snowflake, along with BigQuery. For businesses, Snowflake is a well-liked cloud storage alternative. It permits businesses to create a data warehouse to house their data.After storing data in a data warehouse, companies will want to gain insight from the data. Such insights can then help business managers make evidence-based decisions. Google BigQuery is Google's data warehousing platform that is serverless, cost-effective, highly scalable, and has built-in machine learning. By connecting Snowflake with BigQuery, companies can take advantage of both data warehouses.


Prerequisite:

Let us take a scenario where businesses want their information that is generated every day in Google BigQuery to move into a Snowflake database in a nightly process.In this article, you will get a process to automate data migration from bigquery to snowflake.


Steps to migrate the data:

→ Follow the below steps to perform history load from GCP bigquery (google trends datasets) to snowflake


Overall the process consists of 2 steps, one is to export the data from bigquery to a storage bucket, other is to load it in Snowflake.


Step 1: Export data from bigquery datasets to google cloud storage bucket:


In this use case, we are using a bucket called mini_4a. Login to GCP console, navigate to bigquery and execute the query shown below to export data from international_top_rising_terms dataset. Customize the query as per requirement.


EXPORT DATA OPTIONS( uri=’<uri of gcs bucket>’, format=’<file format>’, compression=’<compression method>’, header=TRUE, field_delimiter=’<delimiter>’ AS <SELECT QUERY AS PER REQUIREMENT>

You can refer to the example below:



After executing the above query, navigate to google cloud storage bucket and check if data files are created in the bucket.



As shown above, data files will be created in the destination bucket.


Step 2: Create an external stage in snowflake that links to GCS bucket


→ To create an external stage with gcp in snowflake, we need a storage integration object. Follow the below steps to create a storage integration object. Storage integration objects can only be created by account admins. Switch to account admin role and execute the following query by providing appropriate values for storage_allowed_location. The bucket specified in storage_allowed_location should match with the bucket created in the previous step.

CREATE STORAGE INTEGRATION gcp_sf_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = GCS ENABLED = TRUE STORAGE_ALLOWED_LOCATIONS = (‘gcs://mini_4a’);

→ Get the service account principle linked to the storage integration object using the below command.


DESCRIBE INTEGRATION gcp_sf_integration



Output will look like:



The value corresponding to property no:5 is the principle for the storage integration object. Copy the corresponding property value and save it for later use.


→ Now, navigate to the IAM section in the gcp console and select Roles from the left side scroll bar (highlighted in yellow). Roles in GCP are a set of privileges which can be attached to a user or service account. We need to create a role that has necessary permissions to access the gcp bucket.


Below is a screenshot of IAM & Admin which will help to navigate to Roles:



Click on Roles and click on create role as shown below:



Fill in the necessary details like role name etc. In the add permissions section, shown below:



Click on add permissions and the below dialogue box will appear.



In the highlighted area, search for the below permissions and click on add at the end.



Once the above permissions are added to your role, your role assignments should look like:



Once the role is created, navigate to the GCS bucket and select your bucket using the check box and click on add principle shown above and enter the service account principle value that you have saved from snowflake in the beginning of step-2. In the add role section, select the role that you have just created earlier.



Click on save and head back to snowflake. Now create an external stage using the storage integration object by following the below query.


create or replace stage my_gcs_stage url = ‘gcs://mini_4a/’ storage_integration = gcp_sf_integration file_format = gcp_history_load_format;


We have successfully created an external stage in snowflake that links to a bucket in gcp.


→ To verify stage creation, execute the show stages command and check if your stage is visible.



Export data from external stage to snowflake table:

→ Create a file format as per your data files in the GCS bucket, using the below query.


CREATE or replace FILE FORMAT gcp_history_load_format TYPE = 'CSV' COMPRESSION = 'GZIP' FIELD_DELIMITER = '~' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'YYYY-MM-DD' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');


Once the file format is created, we can use the copy into command to load data to the destination table in snowflake using the below query.


copy into international_top_rising_terms from @my_gcs_stage file_format=gcp_history_load_format purge=True;


→ Let’s verify data load using count(*),



As shown above, all 11.3 million records from GCP google trends dataset are loaded to snowflake. Repeat this process for each dataset that you want to use.


Conclusion:

As migration is an ongoing activity and also most time consuming too. It's best to automate the migration process so that whenever there is change in dataset, then we can get the data reflected in near real time.


We hope that you were able to automate migration of the dataset from Google Bigquery to Snowflake Datawarehouse and data is in Snowflake by following one of the many ways.


Reference:
30 views0 comments

Recent Posts

See All