Monday, February 21, 2022

SQL Mysteries: Why is my SQL Server experiencing lots of 17830 (TCP 10054) errors?

Moved from: bobsql.com

 

I was reviewing a test run this week that had more SQL 17830 (TCP 10054 / 0x2746) errors than I could explain from just kills or login timeouts.

 

I started looking at the output in the connectivity ring buffer and the matching XEvent, error_reported event where error_number = 17830. I thought I would find my answer among the login timing information (Reference: https://blogs.msdn.microsoft.com/bobsql/2019/02/10/sql-mysteries-sql-server-login-timeouts-a-debugging-story.) Unfortunately, the timings were always small and not something the SQL Server engine was doing, instead the client was closing the connection. In fact, I was able to break the reproduction down to a single connection using an idle SQL Server so there was no impact from other workloads.

 

Perhaps the problem was a lag on the client or TCP layer?

I was expecting to see the client start the connection by sending the TCP SYN, perhaps taking a long time to get to the server. I thought the network trace might reveal a SYN, a long delay that exceeded the connection timeout and a close (RST) from the client. Instead, what I saw was the SYN (tcp open) followed quickly by the RST (tcp close.)

 

BobDorr_1-1645466834258.png

 

  • May articles explain that a client SYN followed by a server RST usually means the server is not listening on the port. This was not the pattern.
  • Other articles highlight that a client SYN followed by client RST could be something like firewall blocking the outgoing traffic. I turned off the firewall and was still able to reproduce the problem.

What was causing the SQL Server networking client to call the TCP open and then call TCP close without exceeding the connection timeout and without attempting the TDS login activities?

 

To make the scenario a bit more interesting:

  • The pattern was only happening when using SQLClient connections, ODBC connections didn’t exhibit the same behavior
  • The pattern only happened on certain lab systems

With some help from Dylan and Brian I was able to break down a reproduction to a Powershell script using SQLClient to connect to SQL Server.

 

Powershell Script

while (1 -eq 1)

{

    $connectionString = ‘Data Source=MyServer,1433;database=master;User ID=sa;Password=xxxxxxxxx;Pooling=False’

    $sqlConnection = New-Object System.Data.SqlClient.SqlConnection $connectionString

    $sqlConnection.Open()

    $sqlConnection.Close()

}

 

Attaching the debugger to the powershell repro I narrowed the issue to a close being called during the open.

 

ws2_32.dll!closesocket
System.Data.dll!TcpConnection::CloseOutstandingSocket
System.Data.dll!Tcp::SocketOpenParallel
System.Data.dll!Tcp::ParallelOpen
System.Data.dll!Tcp::Open
System.Data.dll!SNIOpenSync
System.Data.dll!Connect

What I discovered was the SQLClient attempting to connect to the server using transparent networking Ip resolution. The servers exhibiting the behavior had multiple IP addresses registered with DNS and the servers that didn’t exhibit the behavior had a single IP address registration. The SQLClient attempts to connect to all IP addresses returned from getaddrinfo, in parallel, if the target IP address registration count is between 2 and 64. The SQLClient starts asynchronous, tcp open (SYN) requests to the listed IP addresses. The first connection to accept the SYN ACK from the server (accept the open request) wins and the other, parallel, open requests are closed (RST.) This explains why I always saw a burst of open requests (SYN) to the server but only one of them succeeded and others were closed (RST.)

The SQLClient provides a connection property to control the transparent network ip resolution behavior. 

 

Property = TransparentNetworkIPResolution : https://docs.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.connectionstring?view=netframework-4.8
Setting the properly to false in the connection string disables the parallel activities removing the SYN, RST 17830 (10054) pattern for the secondary connection attempts.

 

I am not recommending you run out and disable the transparent network ip resolution on your clients. The design where the first connection to be opened wins is a nice addition to the connection capabilities.

 

I wrote this blog to help you troubleshoot additional 10054 error patterns. By pointing out the behavior of the SYN, RST occurring from the client, in a back-to-back fashion, you are able to filter these (treat as noise) and focus on other issues that might be causing 10054 errors on your system.

Posted at https://sl.advdat.com/33BXlHBhttps://sl.advdat.com/33BXlHB