The serverless SQL pools in the Azure Synapse Analytics workspace enable you to query data placed in Azure Data Lake, Data Verse, or Cosmos DB without the need to import data in some database. You just need to create a table on top of your Parquet, Delta, Cosmos DB data and use T-SQL language to query the data. In the following example you might see how to create an external table on top of Delta Lake data stored in Azure Data Lake storage:
CREATE EXTERNAL TABLE Covid (
date_rep date,
cases int,
geo_id varchar(6)
) WITH (
LOCATION = 'covid', --> the root folder containing the Delta Lake files
data_source = DeltaLakeStorage,
FILE_FORMAT = DeltaLakeFormat
);
One of the biggest challenges that you might face while creating a schema on your datasets is the proper choice of column types. In the files and collections, you will frequently find the generic number or string columns, and you might be tempted to use some type like VARCHAR(MAX) or VARCHAR(8000) to represent these columns, or you woudl not be sure should you use bigint or smallint for the numbers. You must be aware that choosing large column types might impact the performance of your queries.
You should minimize the column types to improve the performance and concurrency of the queries.
Type size impacts performance of SQL operations
Type minimization is a well-known technique in SQL Server and Azure SQL databases.
The type size is especially important for the columns that you are using in the JOIN conditions or the columns that are in the GROUP BY list. Ideally, you should join the datasets by int/bigint columns and avoid string/guids. Even if you need to use strings, make sure that they are the smallest possible sizes. Performance of GROUP BY and similar operations directly depends on the amount of data used to group the rows. If you are processing large data sets, oversized columns might have a big impact on the performance of your queries.
Type sizes impact concurrency
The serverless SQL pool is a distributed computing system that executes concurrent queries on a set of distributed compute nodes. Multiple compute nodes are running the parts of a distributed query plan that read the underlying files, join the data sets, group, and aggregate results. Different queries might try to use the same compute nodes to execute the parts of the queries.
The oversized column types like VARCHAR(MAX) might trick the compute node to allocate more resources than is needed. However, the allocation is based on the estimate, but these over-allocated resources will not be used in actual execution because they are not needed. If a compute node needs 100MB to sort the results it will use these 100MB although the query optimizer allocated 4GB of memory for the task on the compute node.
Over-allocation will not help, but it might impact the concurrency of your workload. The parts of the queries might get too many resources on some nodes that they would not use. A part of a query that effectively uses 100MB of memory might get 4GB of memory if the query optimizer believes that the columns contain a lot of data. However, although it will use just 100MB, it will not release the allocated 4GB because it is not aware that the remaining memory might not be needed at the end of the task.
Why is this bad? The part of the query will use just the memory that it needs and will not leverage other resources that are allocated because they are not needed. However, the other queries that would need to place their operations on that compute node must wait for the resources to be released to deploy their parts of the queries. You might end up with queries that are unnecessary waiting for some other queries to release the resources to start the execution.
Type sizes might bring overhead in the distributed execution
The serverless SQL pool is a distributed computing system where multiple compute nodes are running the parts of a distributed query plan that read the underlying files, join the data sets, group, and aggregate results.
The number and organization of compute nodes that will execute the tasks within the query plan depend on the schema and data size.
Oversizing the column types might cause additional overhead in the distributed environment. Some simple operations that could be efficiently completed with a single compute node might be spread across tens of distributed compute nodes.
Imagine a query SELECT TOP 10 * FROM table_with_1024_colums where all columns are VARCHAR(MAX) with potentially 2GB size per cell. The query optimizer might assume that you are reading 2TB per row (1024x2GB) and allocate multiple compute to handle this query. More time might be spent on exchanging these 10 rows between the distributed components than the actual reading.
The serverless SQL pools use both type sizes and statistics to estimate the column size, but it might be misled by the column size.
Best practices
Try to guess what would be the size of the column and use VARCHAR(30) for names, VARCHAR(100) for addresses, etc. There are other values like SSN, codes, and abbreviations where you might guess max size.
This is important if you are using the OPENROWSET function without WITH schema where you let the OPENROWSET function to infer all types. This function will represent all strings as VARCHAR(8000) to avoid possible truncation error, but this is not good for performance on larger data sets. Make sure that you add the minimized types in the WITH clause once you expose the function to the end-users.
If you have some description columns you would probably need to use VARCHAR(MAX) or VARCHAR(8000), but you should try to make other columns smaller. Even if you represent dates or time as strings, try to calculate what is the minimum required size to store these values.
Conclusion
Properly sized column types might significantly improve the performance of your queries and the concurrency of your workload. This is one of the best practices that you should apply to optimize your schema.
Posted at https://sl.advdat.com/30ERKyC