Tuesday, September 7, 2021

SQL Server CE: Multiple single-column statistics

The SQL Server Query Optimizer is a cost-based Query Optimizer. For one specific query and tables,  SQL Server creates the query plan based on the estimated rows. Hence, understanding how SQL Server calculate the estimated rows helps you troubleshoot query plan related issues.

I’m going to start a series of posts talking about SQL Server CE(Legacy/New).

 

The way that SQL Server calculates the estimated row of one single predicate is straightforward and simple.

Well, it becomes a little bit complex when SQL Server handles more than one predicates. Unlike the one single predicate, SQL Server has different strategies for Legacy CE and New CE.

In Legacy CE, SQL Server assumes that data distribution on different columns are independent; In New CE, SQL Server assumes that data distribution on different columns are correlated. 

For short, the estimated rows returned by New CE is generally greater than the rows returned by Legacy CE.

 

Here are examples , adventure 2019 OLTP  database is used in this example

 

------- Data prepare-------------------------------------------

alter database [AdventureWorks2019] set compatibility_level=150

go

use [AdventureWorks2019]

go

if exists(select 1 from sys.tables where name='SalesOrderDetail')

drop table SalesOrderDetail

go

select * into SalesOrderDetail from Sales.SalesOrderDetail---- import all the data into new table SalesOrderDetail.

go

------- Data prepare-------------------------------------------

 

-----  I'll use following T-SQL query, which return 533 rows, to demonstrate

select * from SalesOrderDetail where ProductID=711 and UnitPrice=20.1865

 

 

Legacy CE

1.use trace 9481 to make sure old CE is used

select * from SalesOrderDetail where ProductID=711 and UnitPrice=20.1865 option(querytraceon 9481,recompile)--force old CE

Liwei_13-1630986877333.png

 

 

 

2.The estimated rows is 38, while the actual returned rows is 533. Let's dive into to see how the estimated row 38 calculated. Please note, if your estimated row if different than mine, please update the stats with fullscan.

update statistics SalesOrderDetail with fullscan.

 

3.Show the statistics

select *From sys.stats where object_id=object_id('SalesOrderDetail')

As where clause has two different columns, two different columns are created automatically

Liwei_14-1630986877338.png

 

 

4.The two predicates are connected by AND, with selectivity P0 and P1. The combined selectivity is:P0*P1. Let's dive into

1)Estimated row: P0*P1*Card.

        P0 is the selectivity of first column, P1 is the selectivity of second column. The order does not mater.

        Card is the total number of rows ,which is 121317 in this case.

       there are two columns in where clause: ProductID and UnitPrice

2)Selectivity of ProductID(P0): 0.02547046

ProductId:711

Estimation rows:3090

Selectivity: 3090/121317=0.02547046

 

dbcc show_statistics(SalesOrderDetail,_WA_Sys_00000005_0E8E2250)

Liwei_15-1630986877346.png

 

3)Selectivity of UnitPrice: 0.01230660

UnitPrice:20.1865

Estimation rows:1493

Selectivity :1493/121317=0.01230660

 

dbcc show_statistics(SalesOrderDetail,_WA_Sys_00000007_0E8E2250)

Liwei_16-1630986877354.png

 

 

4)Total estimation:P0*P1*'table cardinality' =0.02547046*0.01230660*121317=38.0273914872384120, is rounded down to 38.

 

5.If there are more than two predicates, just multiples the predicate:P0*P1*P1….*PN.

 

 

New CE

1.The formula of combined selectivity in New CE is totally different:: P0 *P1^(1/2) *P2^(1/4) *P3^(1/8)

2.P0,P1,P2,P3 are the selectivity of each predicate where P0<P1<P2<P3.

3.If there are more than 4 columns in the where clause, only  the first 4 are counted, the rest of them are ignored.

In practice, Please use following formula instead if there are more than 2 predicate.

Liwei_17-1630986877355.png

 

 

 

 

1.Run follow querying in New CE, which is the default setting for compatibility level 150

select * from SalesOrderDetail where ProductID=711 and UnitPrice=20.1865

Liwei_18-1630986877358.png

 

 

 

2.The estimated rows is 238,  much closer than OLD CE value 38. Let's dive into to see how the estimated row 238 calculated.

 

3.Let's see how it's calculated.

1)In this case, we have two predicates, so the combined predicate is P0*P1^(1/2)

 

2)Selectivity of ProductID: 0.02547046

ProductId:711

Estimation rows:3090

Selectivity: 3090/121317=0.02547046

 

dbcc show_statistics(SalesOrderDetail,_WA_Sys_00000005_0E8E2250)

Liwei_19-1630986877369.png

 

3)Selectivity of UnitPrice: 0.01230660

UnitPrice:20.1865

Estimation rows:1493

Selectivity :1493/121317=0.01230660

 

dbcc show_statistics(SalesOrderDetail,_WA_Sys_00000007_0E8E2250)

Liwei_20-1630986877374.png

 

 

 

4)Because selectivity of UnitPrice( 0.0125629219) is less than selectivity of ProductID(0.02547046)

P0=0.01230660

P1=0.02547046

Combined selectivity :

Liwei_23-1630986877376.png=0.01230660*sqrt(0.02547046)=0.00196406781623722

Total estimated row is:

Liwei_24-1630986877376.png

 

 =0.00196406781623722*121317=238.274815262451,is rounded down to 238.

 

 

 

Homework:

----T-SQL

use AdventureWorks2019

go

IF exists(select 1 from sys.tables where name='Product' and schema_id=schema_id('dbo'))

drop table Product

go

IF exists(select 1 from sys.tables where name='SalesOrderDetail' and schema_id=schema_id('dbo'))

drop table SalesOrderDetail

go

select * into SalesOrderDetail  from Sales.SalesOrderDetail

GO

--create five statistics

create statistics I_ProductID on SalesOrderDetail(ProductID) with fullscan

create statistics I_orderqty on SalesOrderDetail(orderqty) with fullscan

create statistics I_SpecialOfferid on SalesOrderDetail(SpecialOfferid) with fullscan

create statistics I_modifiedDate on SalesOrderDetail(modifiedDate) with fullscan

create statistics I_unitpricediscount on SalesOrderDetail(unitpricediscount) with fullscan

 

 

 

 

1.A DBA is doing some tests, he finds that the estimated rows is reduced when he puts more columns in the where clause.  For example, the estimated row of following query is 474

select *From SalesOrderDetail where  SpecialOfferid=1  and orderqty=1 and modifiedDate='2013-06-30 00:00:00.000'  and ProductId=708  option(recompile,querytraceon 2363)

The estimated row is reduced to 4 after he put another column in the where clause: unitpricediscount=0.30

Liwei_26-1630986877382.png

 

 

 

As tests continues, the customers finds that it's not always true. After he replaces the unitpricediscount=0.30  with unitpricediscount=0.00, the estimated row is still 474, not decreased. Why ?

Liwei_27-1630986877387.png

 

 

2.Please test the same query using Legacy CE, do you get the same result?

 

 

 

 

 

Posted at https://sl.advdat.com/3BMOHBt