Author: Sanket Nalawade
This blog talks about migrating our existing data warehouse from Redshift to Snowflake.
Redshift warehouse architecture:
The existing data warehouse, i.e., Redshift, had a number of ETL (Talend, Matillion) and real-time data replicating jobs (AWS DMS) running on it. After data cleaning and transformation were complete, the Data analytics/science team used the same data for analysis and making business decisions. A number of Tableau reports and dashboards were also made from this data, with Tableau connecting to Redshift.
This process was stable for a number of years, and it served its purpose at the time. Still, as our data started growing in hundreds of TBs and users querying it also increased, we started to encounter concurrency issues when multiple ETL jobs were loading the same table or users were trying to access it at the same time. Many times, we had to manually re-run the ETLs as the Redshift disk/CPU utilization crossed 90%, and all ETL and Tableau processes began to develop latency. We had to keep aside all our development tasks and work on getting this latency issue resolved. Migrating to a different Redshift cluster configuration can take a full 48 hours, but we weren’t sure that it would resolve our speed issues.
We needed a platform with a much more scalable database solution that could dynamically be scaled based on user requirements and SLA expectations. Snowflake offers these capabilities, and we decided to use it as the consumption platform for all end-user reporting and access needs.
Migrating to Snowflake:
We used HVR, a Snowflake partnered real-time data replication solution, as a replacement for AWS DMS. For all our ETL needs, we used the same Talend and Matillion tools along with Python scripting. The master tables were migrated from Redshift to S3 using the UNLOAD command and then loaded into Snowflake using the COPY command. All other transaction tables were copied from their respective source databases using HVR and Talend.
One of the big advantages of Snowflake architecture is that it provides the separation between storage and compute. We built a dedicated warehouse for our major applications and named the warehouse so that it was easy to recognize who within the organization is using it. Once we provided the permissions for teams to use the warehouse, it was easy to identify the cost associated with each application and business unit. This is super helpful and something we could not do in Redshift.
In terms of analyzing and vacuuming the tables regularly, Snowflake offers a reliable solution. This poses a few challenges in Redshift, similar to the challenges faced while scaling up or down in Redshift. Redshift Resize operations can also become quite expensive, resulting in significant downtime. Since compute and storage are separate in Snowflake, you don’t have to resort to copying the data to scale up or down. You can switch the data compute capacity as you see fit. With Redshift, you have to manage specific servers even though the service is virtual. Overall, there’s more management involved with Redshift than Snowflake.
Additional benefits include: 1. Time Travel lets us instantly roll back the entire warehouse to any point during a chosen retention window. 2. Cloning lets us quickly clone databases, schemas, and tables without copying the data. 3. Snowflake uses result caching to retrieve only data that’s different from the last time you executed your query.
Challenges during migration:
Snowflake, a totally different data warehouse, supported and did not support a number of Redshift commands and functionalities.
The SQL challenge: Redshift and Snowflake use slightly different variants of SQL syntax. Here are the main differences that you might need to consider while migrating the code:
Redshift | Snowflake -------- | ------- select GETDATE() | select CURRENT_TIMESTAMP() select TRUNC(CURRENT_TIMESTAMP()) | select DATE_TRUNC(DAY,CURRENT_TIMESTAMP()) select SYSDATE | select TO_TIMESTAMP_NTZ(CONVERT_TIMEZONE('UTC',CURRENT_TIMESTAMP())) select DATE(CURRENT_TIMESTAMP()) | select TO_DATE(CURRENT_TIMESTAMP()) date_time_utc > GETDATE() - 7 | date_time_utc > dateadd('DAY', -7, TO_TIMESTAMP_NTZ(LOCALTIMESTAMP)) select coalesce('a') | select coalesce('a', 'b') --you will get error message if you specify only one argument "SQL compilation error: error line 1 at position 7 not enough arguments for function [COALESCE('a')], expected 2, got 1" DISTSTYLE | not used DISTKEY | not used SORTKEY | not used SORTKEY(col)|not used ENCODE | not used interleaved | not used bpchar | char character varying|varchar 'now'::character varying|current_timestamp identity(start, 0, ([0-9],[0-9])::text)|identity(start, 1)
DML changes: Redshift has a more lenient syntax for deleting tables. Snowflake is stricter with correct syntax, while Redshift can just delete table_name without the “from” keyword.
Redshift | Snowflake -------- | ------- DELETE customers | DELETE from customers
Semi-Structured Data: Both Snowflake and Redshift provide parsing capabilities for semi-structured data. You might need to change SQL to the format below if you are parsing JSON in Redshift, as Snowflake syntax is different.
Redshift | Snowflake -------- | ------- json_extract_path_text (properties,'name') | parse_json(properties):name select json_extract_array_element_text('[111,112,113]', 2); | select parse_json('[111,112,113]');
Snowflake provides VARIANT data types that we found valuable (JSON, Avro, ORC, Parquet, or XML) and performed well. Snowflake stores these types internally in an efficient compressed columnar binary representation of the documents for better performance and efficiency.
Both the data warehouses were live in parallel for a few weeks to compare data between the two. Then, once we were confident the migration was successful, we slowly switched the applications to use Snowflake as their main data source for analysis.