Advanced Data Solutions : Optimize database schema in serverless SQL pools using QPI library

Serverless SQL pools enable you to query the data stored in Azure Data Lake Storage, Cosmos DB analytical store, or Dataverse, without the need to import your data into the database tables. For optimal performance, it is very important to apply the best practices and optimize the schema/queries. Some of the must-have best practices are:

Make sure that your client (PowerBI), serverless, and storage layer are in the same region.
If you are using ADLS make sure that the files are evenly sized in the range from 100 MB to 10 GB.
Make sure that you are using optimal schema and types. Organize your tables/views in star/snowflake model, pre-aggregate measures, avoid using VARCHAR(MAX) type for smaller strings, don’t use NVARCHAR when you are using UTF-8 encoded data, use BIN2_UTF8 collation on the strings where you are doing the filtering.

You can find more of the best practices here. These best practices are very important because some issues might cause performance degradation. You might be surprised how applying some of these best practices might improve the performance of your workload.

The last item that is related to schema optimization is sometimes hard to check. You would need to look at your schema, inspect all columns and find what to optimize. If you have a large schema, this might not be an easy task. But you can make your life easier if you use the QPI helper library that can detect schema issues for you.

QPI library

Query Performance Insights (QPI) is an open-source and free set of T-SQL views that you can set up on your database and get some useful information about your database. You can simply install this library by executing this script. Once you execute this script (you can manually review it and confirm that it just references some system DMVs), you can use the views in QPI schema to find the recommendations for optimizing your database or troubleshoot your queries. In this article, I will talk about the view that returns the recommendations that explain how to optimize the schema (column types) in your serverless database.

How to find perf recommendations?

Once you set up the QPI library by executing this script, you should be able to query the views that are created with this script. You can read the recommendations by querying qpi.recommendations view:

This view will scan your tables, views, and columns, and give you some hints on how to optimize your schema. You will see the name of the recommendation, score (higher score = more impact on perf) schema, object name (view or table name), and column name (if it is applicable) where you can apply the recommendation. In the last column, you will see the reason why the recommendation is generated.

Recommendation types

Every recommendation has a type because you can apply different optimizations in different cases. In the following sections will be explained the most common recommendation types.

Optimize column type

The column types should be minimized so the query optimizer could more accurately estimate the resources for the query (you don’t want to run SELECT TOP 10 on 11 VMs because the query is overestimated). You should try to avoid big types such as VARCHAR(MAX) or VARCHAR(8000) where possible. The QPI library will inspect all types, find the ones that are maybe too large (for example LOB types or big VARCHAR types). Look at these columns and check if you could use the smaller types.

For example, let’s assume that the (dbo,demographics,population) is shown in the recommendation. You should open the definition of the dbo.demographics view, and see could you change the types of the column:

CREATE OR ALTER VIEW dbo.demographics AS
SELECT  *
FROM OPENROWSET( BULK 'https://mystorage.blob.core.windows.net/csv/*.csv', FORMAT = 'CSV'    )
WITH (
    country_code VARCHAR (5),
    population smallint
) AS r

The population column is bigint, but maybe it could be changed to smallint.

Optimize key column type

There are some columns that might be used in the join conditions (like primary and foreign keys in standard databases). These columns should have more restrictions. If you are joining tables using the equality of the columns, the columns should be up to 4-8 bytes. Joining tables using long string columns degrades performance. This recommendation will show schema, table, and the column where the QPI library believes that you might be used as the keys in join condition and will recommend using smaller types if possible.

Optimize string filters

You might have some string columns that are used to filter data. The filtering on Parquet, Delta, and CosmosDB data is much faster if the collation is Latin1_General_100_BIN2_UTF8. Apply this collation on the string filter column if your queries are slicing data using this column as a predicate.

For example, let’s assume that the (dbo,demographics,country_code) is shown in the recommendation. You should open the definition of the dbo.demographics view, and add the COLLATE Latin1_General_100_BIN2_UTF8 clause after the type definition:

CREATE OR ALTER VIEW dbo.demographics AS
SELECT  *
FROM OPENROWSET( BULK 'https://storage.blob.core.windows.net/d', FORMAT = 'DELTA' )
WITH (
    country_code VARCHAR (5) COLLATE Latin1_General_100_BIN2_UTF8,
    population smallint
) AS r

Make this change only if you are sure that you are filtering data using this column. Forcing BIN2 collation might cause unexpected results on sort operations because BIN2 collation uses ordering by character codes.

Use better column types

You might have some numeric, date, or GUID columns that are formally declared as VARCHAR, VARBINARY, etc. If you see this recommendation, look at the column’s type and think it is the wrong type assigned to the column.

For example, let’s assume that the (dbo,demographics,country_hash) and (dbo,demographics,date_modified) are shown in the recommendation. You should open the definition of the view dbo.demographics, see would you need to change the types:

CREATE OR ALTER VIEW dbo.demographics AS
SELECT  *
FROM OPENROWSET( BULK 'https://storage.blob.core.windows.net/d', FORMAT = 'DELTA' )
WITH (
    country_hash VARCHAR (8000),
    date_modified VARCHAR(30)
) AS r

A hashed value is usually a shorter string so the VARCHAR(8000) type might be overhead. If the column is named date_modified, probably should be represented as datetime2 instead of the string.

Use VARCHAR UTF-8 type

String values in some formats such as Delta or Parquet files are encoded using UTF-8 Unicode encoding. When you read the string values from these files, you should represent them as VARCHAR type with some UTF-8 encoding. You will see "User VARCHAR UTF-8 type" recommendation when you need to change your NVARCHAR columns to the VARCHAR type if you are querying Parquet, Delta, CosmosDB, or CSV files with the UTF-8 encoding. The string values encoded as UTF-8 should be represented as the VARCHAR type with UTF8 collation.

For example, let’s assume that the (dbo,demographics,stateName) is shown in the recommendation. You should open the definition of the dbo.demographics view, see would you need to change the type:

CREATE OR ALTER VIEW dbo.demographics AS
SELECT *
FROM OPENROWSET( BULK 'https://storage.blob.core.windows.net/data/*.parquet', FORMAT='PARQUET'  )
WITH (
    [stateName] NVARCHAR (50),  --> replace with VARCHAR (50) Latin1_General_100_BIN2_UTF8
    [population] bigint
) AS [r]

The Parquet format contains UTF-8 encoded strings, so the type of the string column should be a VARCHAR type with some UTF8 collation, instead of NVARCHAR.

You can find more information about UTF-8 collations in this post.

Use NVARCHAR type

CSV files can be saved as Unicode files with UTF-16 encoding. In this case, you should represent the string values from UTF-16 encoded files as the NVARCHAR types. You will see the recommendation "Use NVARCHAR types" if you set the VARCHAR type on UTF-16 file. If you see this recommendation, you should change your VARCHAR columns to the NVARCHAR type, because you are querying CSV files with the UTF-16 encoding. String values encoded as UTF-16 should be represented as the NVARCHAR type with UTF8 collation.

Replace table with a partitioned view

One important recommendation is to partition your data sets. At the time of writing this article, the external tables do not support querying partitioned data sets. The QPI library will find these tables and provide you with a recommendation to replace these tables with the partitioned views.

Remove duplicate references

Sometimes, you might have two tables or views referencing the same data on storage. Although this is not an error, it is a management burden because you will need to keep both schemas in sync and optimize both schemas. You might end up with some queries running fast and the others slow just because the second group references the table that has not optimized types as the first one. Avoid using multiple tables referencing the same data if possible and establish a 1:1 relationship between tables/views and data.

Conclusion

The QPI library is easy to use tool for identifying potential problems in your database schema that could impact the performance of your queries. With this library, you can see the recommendations and apply the changes described in the recommendations to optimize your database schema.

Posted at https://sl.advdat.com/3CsaZdihttps://sl.advdat.com/3CsaZdi

Wednesday, March 9, 2022

Optimize database schema in serverless SQL pools using QPI library