Author: Tejaswini R
Snowflake’s multi-cluster shared data architecture that separates compute and storage layers enables great flexibility and scalability of storage and compute resources. This allows users to scale up compute resources when a high volume of data is to be loaded faster or resource intensive queries are to be executed without degrading the compute performance. In this blog, we discuss ways to monitor and identify bottlenecks that cause delay in data operations.
Long Running Queries
Queries taking up a longer time to execute than expected affects overall performance causing delays in data operations critical to organizations. Long running queries could be due to one or more of the several reasons such as,
Inappropriate virtual warehouse size: Choosing an appropriate warehouse size considering the size and complexity of queries is not easy. For example, a large data load operation taking up a long time could probably not just be solved by increasing the warehouse size, as a single large file would still use a single thread in a single node in the warehouse. The large files must be split into smaller files to make use of the snowflake Multi Cluster feature.
Queued queries: When workloads increase, queries queue up slowing down the execution time. The Multi Cluster feature which can automatically scale out can be configured for this.
Disk Spilling: When query results don’t fit in the local drive, snowflake uses remote storage. As a result some of the bytes are “spilled” to remote storage which consumes more time to fetch a result.
Complex queries or inaccurate joins that a developer has executed causing eating up of compute resources.
Monitoring Performance and Identifying Long Running Queries
Snowflake provides several ways to identify and diagnose slow running queries and also monitor the performance in virtual warehouses.
System Views and functions
Query History table in account_usage schema and query history table function in information_schema provides all of the information related to a query. Information such as the query text, execution time, warehouse name, database etc can be used to identify long running queries along with the users running them or warehouses used to execute them.
select * from table(information_schema.query_history()) order by start_time;
Query Profile Page
This page provides all the execution details for a specific query and helps diagnose the problem for a long running query
Load Monitoring Chart
The Load monitoring chart in the warehouses tab of the classic web UI provides an insight to the load on individual warehouses. This helps to make decisions on increasing or decreasing warehouse sizes and cluster size.
Optimizing performance monitoring
All the above options to identify and monitor warehouse performance might not be feasible in large organizations where several users could execute multiple queries concurrently. Of which, many of the “Bad queries” as in long running queries might go unnoticed. On a longer run this would be disadvantageous not only by causing delays in operations but also results in bigger cost bills. One(Any user or admin) must have clear visibility on what were the “Good” and “Bad” queries that were executed on a day or in the past week.
The above picture shows a dashboard created on Snowsight by Mastiff where we have made use of snowflake query history information to categorize queries into good or bad not only based on how long it took to be executed but also on priority levels such as who executed them and which warehouse size was used.
Now we can drill down into the queries that were bad and use all of the information provided by the system views. One can dive into whether it was the type of query that caused the bottlenecks or was it the warehouse size or cluster size that caused a query to be bad. Furthermore, actions can be taken by changing or optimizing a complex query, changing the warehouse size, cluster size, adding cluster keys to tables, optimizing the file size for load etc.
In addition to this we have also configured notifications and alerts for users to be aware of the number of bad queries executed over a period of time, how long they were or still running etc. So that users can make decisions on whether or not to cancel a long running query and also make optimal changes to reduce the query execution time.
To know more about how to implement this tool contact email@example.com