Azure Sql Data Warehouse is Azure’s data warehouse solution that uses massive parallel processing to execute complex queries on petabytes of information.
SQL Data Warehouse stores information in relational tables in columnar format, reducing storage costs and improving query performance.
Based on an architecture that distributes data processing to multiple nodes, SQL Data Warehouse separates computation from data storage, which allows scaling independently of the size of stored data.
1. Applications connect and send T-SQL commands to the control node. This control node runs the MPP engine that optimizes queries for parallel processing and sends them to computation nodes.
2. The computation nodes execute queries on the information stored in Azure Storage.
3. The Azure Storage layer “shards” the information into distributions to optimize system performance.
4. Finally, the DMS (Data Movement Service) is responsible for moving data between nodes to return appropriate results when queries are executed in parallel.
A distribution is the storage and processing unit used by SQL Data Warehouse to distribute information and execute queries in parallel.
When a query is executed on SQL Data Warehouse, work is broken down into 60 smaller queries that are executed in parallel. Each of these queries is executed on a distribution and each computation node handles one or more of these 60 distributions.
As we will see below, there are three types of distribution that can be applied to tables.
Tables can be persisted temporarily or permanently in the SQL Data Warehouse storage or in storage external to the data warehouse.
- Permanent tables store their information in the Azure Storage layer of the data warehouse. The table and its information remain after the user who created the table has logged out.
- Temporary tables only exist whilst the session of the user who created the table is active.
- External tables are those whose data is in Azure Blob Storage or Azure Data Lake Store. These tables are useful for loading information to the data warehouse.
There are three ways in which information from a table can be distributed within the data warehouse:
Hash-distributed: The data is distributed based on the value of a column defined in the table as a distribution column. This distribution is the best for achieving performance improvement in large tables or queries that perform joins of several tables. Choosing the distribution key is not easy but in general, these premises should be followed:
- Choose a column whose values do not have to be updated.
- Choose a column whose values are evenly distributed.
- Choose a column that is a candidate for JOIN, GROUP BY, DISTINCT, OVER or HAVING and that is not of the date type.
Replicated: Cada nodo de cómputo tiene una copia completa de la tabla. Las consultas van mucho más rápido en las tablas replicadas, puesto que no es necesario realizar ningún movimiento de datos entre los nodos de cómputo. Sin embargo requieren almacenamiento adicional y no funcionan bien con tamaños de tabla por encima de 2 GB.
Round robin: Los datos de la tabla se distribuyen de manera uniforme y aleatoria entre todas las particiones. La carga de datos en este tipo de tablas es muy rápida, pero las consultas son penalizadas porque suelen requerir un mayor movimiento de datos entre los nodos de cómputo que las otras dos distribuciones.
The query optimizer uses the statistics from the columns of a table when creating the plan for the execution of the query. To improve query performance it is important to create statistics, especially in the columns used in JOIN-type operations.
By default, automatic creation of statistics is enabled in the SQL Data Warehouse. With this functionality, SQL Data Warehouse analyses incoming user queries and creates individual column statistics for those columns that do not have statistics. When this process is executed, it may result in a drop in query performance if the columns do not yet have statistics created.
It is advisable to update statistics daily in columns that receive a lot of new information that could cause the distribution of the data to change. A good example of a column to be updated would be the aggregation of a new date in a date type column.
At present there are three types of indexes that can be applied to tables when they are created:
Clustered columnstore indexes: This is the default index when index type is not specified in the table creation. They offer the highest level of data understanding and the best overall query performance. This is usually the best option for large tables especially from 100 million rows.
Heap tables: When we specify this type of table, what we are really saying is that no indexing is used. For this reason, using this type of indexing is a good option when we need to load information in the data warehouse in staging tables before performing any other operation. This is because without indexing, the load is much faster.
Clustered indexes: Clustered index is the recommended index when you want to retrieve a few rows of the table quickly. For example, when we want to retrieve a record of a table from its ID.
Workload management is the ability to optimize overall query performance for environments with multiple concurrent users.
The class of resources associated with the user determines the performance of a query. Resource classes are predetermined limits in SQL Data Warehouse that manage computing resources and concurrency for query execution. The use of resource classes allows the number of queries that are executed concurrently and the computing resources assigned to each query to be limited.
There are two types of resource classes:
Static: The same amount of memory is assigned to the queries regardless of the performance of the cluster. The horizontal scaling of the cluster increases the memory and consequently with this type of resource class the concurrency increases. This type is recommended to optimize concurrence when the data warehouse data volume remains constant.
Dynamic: A variable amount of memory is assigned, depending on the current performance of the cluster. Horizontal cluster scaling makes the queries obtain a bigger amount of memory automatically. This kind of resource class is recommended to work with data warehouses that have a variable or increasing amount of data.
Data can be loaded into a SQL Data Warehouse cluster in multiple ways; however, the recommended method in most cases is to use Polybase. Polybase is a technology that allows access to external information stored in Azure Blob Storage or Azure Data Lake Store.
Polybase supports the following file types:
- Text files separated by delimiters in UTF-8 and UTF-16 encoding
In addition, Polybase supports Gzip or Snappy compressed files and can be run in the following ways:
- Polybase with T-SQL: The ideal method when the information is in Azure Blob Storage or Azure Data Lake Store and it allows greater control over the loading process.
- Polybase with SSIS: The ideal method when the information to be loaded is in SQL Server either in Azure or in the “OnPremise” environment. SSIS defines the mapping between source and destination and orchestrates the upload processes.
- Polybase with Azure Data Factory: Data factory is a tool that orchestrates and plans the execution of pipelines. Polybase is used in pipelines where information is loaded to the data warehouse data factory.
- Polybase with Azure DataBricks: Polybase allows reading tables from the data warehouse into Spark dataframes and writing Spark dataframes into tables from the data warehouse.
Follow our blog in which we will continue publishing short reviews about Azure solutions.