Thursday, February 24, 2022

How It Works: How is SQL Server Error 833, 15 Sec I/O Detected

Moved from:


I was approached to revisit the details of how Error Message 833 is detected and reported.


SQL Server has encountered %d occurrence(s) of I/O requests taking longer than %d seconds to complete on file [%ls] in database [%ls] (%d).  The OS file handle is 0x%p.  The offset of the latest long I/O is: %#016I64x.

Older versions of SQL Server leveraged the Lazy Writer and I/O completion callback routines to record vs report the error condition.  Newer versions of SQL Server have refined the detection, now the logic is included as part of the Scheduler Monitor All the I/O requests contain a start time (invariant RDTSC ticks), the OVERLAPPED structure and additional tracking information.  The I/O tracking structures are all allocated out of the same memory so we don’t deadlock trying to allocate them from a memory manager/object that has to do I/O when low on memory.


When Scheduler Monitor does its checks (5 sec intervals) we walk over the in-motion, I/O entries and check the OVERLAPPED status to see if the I/O request is still pending at the kernel.  If (Overlapped->Internal), decimal value, is 259 the I/O is pending at kernel level and when the elapsed time becomes greater than 15 seconds we record the slow I/O condition.


The process is a bit more involved, using the I/O request dispenser and a check limit of 1024, I/O requests in which Scheduler Monitor will traverse in a single check, but basically SQL Server runs the I/O tracking list and checks the result of HasOverlappedIoCompleted.  If the OVERLAPPED->Internal value is no longer 259 the issue is an SOS scheduling issue in processing the I/O and we don’t record it as a slow I/O.  The sys.dm_os_pending_io_requests DMV shows the IsPending column which is just the outcome of HasOverlappedIoCompleted which is used for the column output so you can see the I/O state.  There is no timer per I/O involved from Windows as Scheduler Monitor acts as the monitor.


SQL Server maintains a 5-minute reporting interval, per file to prevent flood of error log as the target to when we will write the warning to the error log.   XEvent could be a better target when troubleshooting the problem (XeSqlPkg::long_io_detected fires at the time of recording in most instances.)


To recap, Scheduler Monitor does the work to look at the I/O state and when pending and greater than 15 seconds the 833 logic is invoked.

Posted at