Author: Anuj Karn
Database and data warehouse technology are evolving at a very fast pace. Nowadays, many tools help you connect different Data Warehouses to each other so that you can take advantage of both technologies. One of the processes is integrating Snowflake into BigQuery.
The cloud has become the most popular way for businesses to store data. This is because it comes with several advantages compared to on-premise storage options. When storing data in the cloud, there is no need to select, install, configure, and manage any hardware. The cloud is, therefore an ideal choice for all businesses that do not want to spend resources on it and reduce the problem of scalability.
One of the top data warehouses utilized by businesses worldwide is Snowflake, along with BigQuery. For businesses, Snowflake is a well-liked cloud storage alternative. It permits businesses to create a data warehouse to house their data. After storing data in a data warehouse, companies will want to gain insight from the data. Such insights can then help business managers make evidence-based decisions. Google BigQuery is Google's data warehousing platform that is serverless, cost-effective, highly scalable and has built-in machine learning. By connecting Snowflake with BigQuery, companies can take advantage of both data warehouses.
Let us take a scenario where businesses want their information that is generated every day in Google BigQuery to move into a Snowflake database in a nightly process. In this article, you will get a process to automate data migration from bigquery to snowflake.
Steps to migrate the data:
→ Follow the below steps to perform a history load from GCP bigquery (google trends datasets) to snowflake
Overall the process consists of 2 steps, one is to export the data from bigquery to a storage bucket, other is to load it in Snowflake.
Step 1: Export data from bigquery datasets to google cloud storage bucket:
In this use case, we are using a bucket called mini_4a. Login to the GCP console, navigate to bigquery, and execute the query shown below to export data from international_top_rising_terms dataset. Customize the query as per requirement.
EXPORT DATA OPTIONS( uri=’<uri of gcs bucket>’, format=’<file format>’, compression=’<compression method>’, header=TRUE, field_delimiter=’<delimiter>’ AS <SELECT QUERY AS PER REQUIREMENT>
You can refer to the example below:
After executing the above query, navigate to the google cloud storage bucket and check if data files are created in the bucket.
As shown above, data files will be created in the destination bucket.
Step 2: Create an external stage in snowflake that links to GCS bucket
→ To create an external stage with GCP in snowflake, we need a storage integration object. Follow the below steps to create a storage integration object. Storage integration objects can only be created by account admins. Switch to the account admin role and execute the following query by providing appropriate values for storage_allowed_location. The bucket specified in storage_allowed_location should match the bucket created in the previous step.
CREATE STORAGE INTEGRATION gcp_sf_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = GCS ENABLED = TRUE STORAGE_ALLOWED_LOCATIONS = (‘gcs://mini_4a’);
→ Get the service account principle linked to the storage integration object using the below command.
DESCRIBE INTEGRATION gcp_sf_integration
The output will look like this:
The value corresponding to property no:5 is the principle for the storage integration object. Copy the corresponding property value and save it for later use.
→ Now, navigate to the IAM section in the GCP console and select Roles from the left side scroll bar (highlighted in yellow). Roles in GCP are a set of privileges that can be attached to a user or service account. We need to create a role that has the necessary permissions to access the GCP bucket.
Below is a screenshot of IAM & Admin, which will help to navigate to Roles:
Click on Roles and click on create a role as shown below:
Fill in the necessary details like role name etc. In the add permissions section, shown below:
Click on add permissions, and the below dialogue box will appear.
In the highlighted area, search for the below permissions and click on add at the end.
Once the above permissions are added to your role, your role assignments should look like this:
Once the role is created, navigate to the GCS bucket and select your bucket using the check box and click on add principle shown above and enter the service account principle value that you have saved from snowflake at the beginning of step-2. In the add role section, select the role that you have just created earlier.
Click on save and head back to snowflake. Now create an external stage using the storage integration object by following the below query.
create or replace stage my_gcs_stage url = ‘gcs://mini_4a/’ storage_integration = gcp_sf_integration file_format = gcp_history_load_format;
We have successfully created an external stage in snowflake that links to a bucket in GCP.
→ To verify stage creation, execute the show stages command and check if your stage is visible.
Export data from external stage to snowflake table:
→ Create a file format as per your data files in the GCS bucket using the below query.
CREATE or replace FILE FORMAT gcp_history_load_format TYPE = 'CSV' COMPRESSION = 'GZIP' FIELD_DELIMITER = '~' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'YYYY-MM-DD' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
Once the file format is created, we can use the copy into a command to load data to the destination table in snowflake using the below query.
copy into international_top_rising_terms from @my_gcs_stage file_format=gcp_history_load_format purge=True;
→ Let’s verify data load using count(*),
As shown above, all 11.3 million records from the GCP google trends dataset are loaded to snowflake. Repeat this process for each dataset that you want to use.
As migration is an ongoing activity and also the most time-consuming too. It's best to automate the migration process so that whenever there is a change in the dataset, then we can get the data reflected in near real-time.
We hope that you were able to automate the migration of the dataset from Google Bigquery to Snowflake Datawarehouse and that data is in Snowflake by following one of the many ways.