Moved from: bobsql.com
Overview
SQL Server relies on Forced-Unit-Access (Fua) I/O subsystem capabilities to provide data durability, detailed in the following documents: SQL Server 2000 I/O Basic and SQL Server I/O Basics, Chapter 2
Durability: “In database systems, durability is the ACID property which guarantees transactions that have committed will survive permanently. For example, if a flight booking reports that a seat has successfully been booked, then the seat will remain booked even if the system crashes.” – https://en.wikipedia.org/wiki/Durability_(database_systems)
Durability is a cornerstone of any database system and starting with SQL Server 2017 on Linux Cumulative Update 6 (CU6), SQL Server on Linux enables “Forced Flush” behavior as described in this article, improving durability on non-Fua optimized systems.
“Be sure to deploy SQL Server 2017 CU6 or newer for best data durability and performance results.“
Microsoft SQL Server Database Engine Input/Output Requirements
The Back Story
My involvement began when looking at the Write-Ahead-Logging (WAL) and the write behaviors on a Linux SQL Server. The investigation revealed that some of the Linux file systems may not be Fua optimized. Instead of passing the Fua bit with the write request (like Windows) Linux generates a separate device level flush request. Device level flushing may have an impact on your I/O caching, read ahead or other behaviors of the storage system. The “forced flush” changes in SQL Server avoid flushes, when possible, in order to improve performance on non-optimized Fua file systems.
Starting with the Linux 4.18 kernel additions to the Linux XFS file system can send the Fua bit along with the write request so “forced flush” behavior is not needed and the performance of writes is significantly improved. Testing shows that by using the Fua bit with the data, write request can reduce the I/O traffic by ~50% for a SQL Server, write-intensive workload. |
Performance points of reference: The following charts highlight the increase in performance.
4 Socket, TPCC 9,108 – Old kernel 36,618 – Patched kernel |
|
2 Socket, TPCC 9,606 – Old kernel 18,840 – Patched kernel |
|
During my investigation, I traced, tested, reviewed, and instrumented the Linux kernel to understand the SQL Server I/O durability needs and associated impacts. This effort resulted in:
· Changes to the SQL Server Engine
· Changes to the SQL Server Linux Host Extension
· Validation with the Windows Storage team
· Validation with Windows Hyper-V team
· Validation with Windows Azure Storage team
· Conversations with Red Hat, HP, SUSE and Intel
· Changes in the Linux file system
Before all was said and done, all of these people and more helped with various activities:
· Eugene, Mike, Slava, Scott, Peter, Steve and others doing code reviews on my SQL Server and Host Extension changes.
· Val and Dylan performing power outage durability testing and adding new tests.
· Mitch and Ben doing functionality testing with the matrix of trace flags, mssql.conf options, Red Hat, Centos, Ubuntu, SUSE, and Windows using xfs, xfs opt, ext4, refs and ntfs.
· Pat validating Docker functionality and updating SQLIOSim to allow forced flush patterns.
· Patrick and Purvi doing performance and regression analytics.
· Kapil building and validating behavior on the Linux kernel changes for xfs (Note: I have tagged it xfs opt (xfs optimizations) in this document for reference.)
· Amit, Lee, Vaqar and Mike handling release activities and documentation.
· Neal, Matt, and others from Windows Storage, Windows Azure Storage, Windows Hyper-V, … validating Windows behaviors.
· Red Hat, SUSE, HPE, Intel helping and confirming Linux behaviors and making file system code changes.
· Pradeep, Venu and Suresh handling support issues.
Special thanks to Dave Chinner and the Red Hat team, Dan Williams, Hamesh Patel and their colleagues from Intel, Scott Norton and his colleagues, Josh Paulson and members of the Microsoft Linux teams, Jan Kara from SUSE and many others.
Terms
Before diving into details, the terminology is important.
Term |
Type |
Description |
||
Stable Media |
Concept |
Any storage device that can survive a power outage. In the 90’s we talked about saving data on spinning media. SSDs, persistent memory, battery-backed controllers broaden the range of devices surviving power outage from traditional, spinning media.
|
||
WAL |
Concept |
Write-Ahead Logging or journaling. The log records, which can be used to recreate the data, are stored in stable media before the data buffers are written and data buffers are stored in stable media before the supporting log records can be deleted.
|
||
Fua |
Hardware |
Fua = Forced Unit Access. Fua is the standard used to issue an I/O write command in stable data storage. Fua has been a long-standing part of the SCSI specifications.
Caution: Non-SCSI storage implementations often silently ignore the request.
|
||
Device Flush |
Hardware |
A device level flush forces any dirty buffer located in volatile storage onto stable media. Flushing a non-battery backed disk or cache device can be performance impacting and repeated device flush commands reduce the optimized capabilities, such as, writes near current head placement, wear level reduction, caching mechanisms, etc. |
||
FILE_FLAG_NO_BUFFERING |
Windows |
FILE_FLAG_NO_BUFFERING is the Win32, CreateFile API flags and attributes setting to bypass file system cache. This flag does NOT imply flush behaviors, only the avoidance of the system file cache. FILE_FLAG_NO_BUFFERING maps to the O_DIRECT flag for the Linux open command.
Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.
|
||
O_DIRECT |
Linux |
Linux open command flag used to bypass file system cache.
Warning: The flag causes the file system cache of a guest VM to be bypassed but the VM or container may not pass the option to the host.
|
||
FILE_FLAG_WRITE_THROUGH |
Windows |
FILE_FLAG_WRITE_THROUGH is the Win32, CreateFile API flags and attributes setting generating Forced Unit Access (Fua) writes. A write honoring Fua is issued with the Fua flag and does not return from the write command until transferred to stable media.
FILE_FLAG_WRITE_THROUGH maps to the O_DSYNC flag option for the Linux open command.
The FILE_FLAG_WRITE_THROUGH flag does NOT imply FILE_FLAG_NO_BUFFERING behavior. To achieve no buffering and Fua, both the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags must be specified.
|
||
O_DSYNC | fdatasync | RWF_ODSYNC |
Linux |
The O_DSYNC | RWF_ODSYNC | fdatasync indicate to the Linux file systems that the ‘data integrity standard’ is to be upheld for I/O operations. The data integrity standard makes sure the data and supporting meta data, needed to retrieve the data, are stored in stable media.
O_SYNC and fsync indicate to the Linux file system to uphold the ‘file integrity standard.’ O_SYNC is a superset of the O_DSYNC as the standard indicates that all file metadata (access timestamps for example) are stored in stable media.
RWF_ODSYNC is a newer variation of O_DSYNC. The RWF_ODSYNC can be set as a flag for each I/O request. Whereas, opening the file with O_DSYNC establishes flushing behavior for all writes to the file. SQL Server does NOT make use of RWF_ODSYNC.
|
||
FILE_FLAG_OVERLAPPED |
Windows |
FILE_FLAG_OVERLAPPED is the Win32, CreateFile API flags and attributes setting enabling asynchronous API behavior. Control from APIs such as WriteFile and ReadFile returns as soon as the I/O is queued. The operating system signals completion when the I/O stack finishes the request.
|
||
AIO | io_submit | Kernel I/O (kio) |
Linux |
Linux provides different system calls (syscall)/commands for asynchronous behavior. · write command – synchronous syscall, does not return until the write completes. · aio_write | io_submit – asynchronous syscall, returning control as soon as the I/O request is queued. SQL Server uses aio* for asynchronous I/O operations.
|
||
Synchronized |
Concept |
Not to be confused with the synchronous behavior of the API/ABI invocations. In the context of I/O, synchronized indicates that the storage and in-memory copies of the data are the same. O_DSYNC often uses this term to indicate that the in-memory buffer is in sync with the storage media.
|
||
Data Integrity Standard |
Concept |
This data integrity standard is a subset of the file integrity standard. In summary, the data integrity standard is the ability of the file system to store and retrieve data over a power outage. This includes the data itself as well as the tracking structures for the storage.
Refer to the ISO documentation on data integrity for complete details. CSA ISO/IEC 9945-1:2005 (R2009) 3.375 Synchronized I/O Data Integrity Completion
|
||
File Integrity Standard |
Concept |
This file integrity standard is a superset of the data integrity standard. The file integrity standard is the ability for the file system to store and retrieve data and metadata written over a power outage including attributes such as the time of last access and permissions.
Refer to the ISO documentation on file integrity for complete details: CSA ISO/IEC 9945-1:2005 (R2009) 3.376 Synchronized I/O File Integrity Completion
|
||
O_NONBLOCK |
Linux |
Linux open command flag value to avoid blocking during an I/O call. This is not a FILE_FLAG_OVERLAPPED implementation for the Linux ABIs. Instead, the flag is used to decide if a wait will occur while attempting to issue the I/O request. If the thread would become blocked the attempt to issue the request is aborted and control is returned to the caller.
Overlapped will issue the I/O and return STATUS_PENDING, whereas the O_NONBLOCK returns EAGAIN and the caller must retry the entire I/O operation.
|
||
O_NOATIME |
Linux |
Linux open command flag to avoid updating the time of last access which can reduce metadata writes. The O_NOATIME flag does NOT disable MTIME or CTIME modifications.
|
||
REQ_FUA |
Linux |
Block device write request using the Fua capabilities when performing a write. The Fua bit is sent with the write signaling stable media storage is required.
|
||
REQ_PREFLUSH |
Linux |
A block device write request (0 bytes – write[10, 12 or 16 bytes] SCSI command) requesting a device flush.
|
||
DpoFua |
Linux |
DPO = Disable Page Out (for writes instructs I/O subsystem to avoid adding write data to the cache, for reads prevents replacing existing entries in cache.)
Fua and Dpo are separate bit values in the SCSI SCB. SQL Server only requires Fua and not DPO capabilities.
|
||
Forced Flush Behavior |
SQL Server |
When enabled the Linux SQL Server opens the file with FILE_FLAG_NO_BUFFERING and without FILE_FLAG_WRITE_THROUGH (no FUA.) SQL Server log and checkpoint made durable with calls to FlushFileBuffers.
|
Common File Systems Used For SQL Server Deployments
ntfs | Windows New Technology File System |
refs | Windows Resilient File System |
xfs | High performance, 64-bit journaling file system for Linux |
ext4 | Fourth extended file system with journaling for Linux |
Supported platforms and file systems for SQL Server on Linux
Handling of Fua
The Linux Host Extension(HE) and SQLPAL work together to translate Win32 CreateFile, WriteFile, ReadFile and other I/O calls into Linux system calls. SQL Server opens data and log files using CreateFile with FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH. The flags are converted to the Linux open system call using the O_DIRECT and O_DSYNC flags, respectively.
When SQL Server performs a write, using a file opened with FILE_FLAG_WRITE_THROUGH(O_DSYNC), Linux performs the multi-step operation as a data write followed by the complete device flush request instead of adding the Fua bit to the data write request.
- Issue write to block device for the data
- Receive the data write complete event in the block device’s write completion routine
- If O_DSYNC requested, issue block device flush
- Wait for block device flush request completion
- Return control to write caller
Every write, requiring O_DSYNC behavior, also invokes a block device flush on Linux. The repeated use of a write request followed by a flush request may be detrimental to performance and will increase traffic on the I/O bus.
Note: The 4.18 Linux Kernel updates allow the XFS file system and a device supporting Fua writes to avoid the additional block device flush request.
Prior to the Linux Kernel 4.18 updates, Linux could use Fua but only for the file system journaling writes and not data writes. Linux queries the device for the DpoFua support capabilities and when the device reports Fua support (DpoFua = 1) the journaling is optimized to use Fua writes. The data storage always uses the multi-step flush process.
Windows does NOT query the device for support, instead any file opened with FILE_FLAG_WRITE_THROUGH sends the Fua option with the write request where the target device is expected to honor the Fua option. In some instances, the target silently ignores the request. The Windows implementation places the responsibility of making sure the I/O subsystem is Fua compliant on the database and system administrator.
Fua (Forced Unit Access)
Microsoft SQL Server requires Fua compliant storage (stable media) ensuring the data is stored in stable media before a write request returns to SQL Server. For example, if you run SQL Server on an IDE or SATA drive, you may need to disable the physical disk cache to accommodate WAL (durability), requirements. The SQL Server 2017 CU6, forced flush feature, provides durability assistance on Linux systems without Fua.
Note: Disabling the write cache using the Windows, properties dialog may not suffice. Some drives are not compliant with the dialog options and require special utilities to configure the drive cache. Linux may need similar utilities to control various hardware cache installations.
You can explore your system using the df and lsblk commands. When Linux mounts a block device, Linux queries the attributes and capabilities of the block device. One such capability is the ability to support DpoFua for write requests. When DpoFua=1 is reported, the device is reporting support for Fua.
Note: These commands may require root permissions to execute.
df -T
Filesystem Type 1K-blocks Used Available Use% Mounted on
devtmpfs devtmpfs 4062840 0 4062840 0% /dev
tmpfs tmpfs 4078624 0 4078624 0% /dev/shm
tmpfs tmpfs 4078624 8880 4069744 1% /run
tmpfs tmpfs 4078624 0 4078624 0% /sys/fs/cgroup
/dev/mapper/cl-root xfs 52403200 13010400 39392800 25% /
/dev/sda1 xfs 1038336 431168 607168 42% /boot
/dev/mapper/cl-home xfs 114361796 22488816 91872980 20% /home
tmpfs tmpfs 815728 0 815728 0% /run/user/1000
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 168G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 167G 0 part
├─cl-root 253:0 0 50G 0 lvm /
├─cl-swap 253:1 0 7.9G 0 lvm [SWAP]
└─cl-home 253:2 0 109.1G 0 lvm /home
sdb 8:16 0 1G 0 disk
sr0 11:0 1 1024M 0 rom
The /dev/sda1 entry reports an xfs file system and the device id of 8:1 (major:minor.) The block device is sda (8:0.) The entries for sda1, sda2 are partitions on the same block device.
You may also use lshw to explore disk information such as sector size and vendor which are helpful in determining additional device details.
lshw -class disk
description: SCSI Disk
product: Virtual Disk
vendor: Msft
physical id: 0.0.0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: 1.0
size: 168GiB (180GB)
capabilities: partitioned partitioned:dos
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=000a0600
Use dmsg or sg_modes to reveal the DpoFua state.
dmesg | grep sda
[ 1.266561] sd 0:0:0:0: [sda] 352321536 512-byte logical blocks: (180 GB/168 GiB)
[ 1.266562] sd 0:0:0:0: [sda] 4096-byte physical blocks
[ 1.266892] sd 0:0:0:0: [sda] Write Protect is off
[ 1.266894] sd 0:0:0:0: [sda] Mode Sense: 0f 00 10 00
[ 1.266949] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, supports DPO and Fua
[ 1.271227] sda: sda1 sda2
[ 1.274579] sd 0:0:0:0: [sda] Attached SCSI disk
[ 4.637845] XFS (sda1): Mounting V5 Filesystem
[ 4.661270] XFS (sda1): Ending clean mount
Note: /dev/sda is the block device and /dev/sda1, /dev/sda2 are partitions of the block device.
sg_modes /dev/sda1
Msft Virtual Disk 1.0 peripheral_type: disk [0x0]
Mode parameter header from MODE SENSE(10):
Mode data length=20, medium type=0x00, WP=0, DpoFua=1, longlba=0
Block descriptor length=0
>> Caching, page_control: current
00 08 0a 04 00 00 00 00 00 00 00 00 00
If your system supports Fua and you have the Linux Kernel 4.18 or newer updates, you can enable SQL Server trace flag -T3979 and use /opt/mssql/bin/mssql-conf set control.alternatewritethrough 0. SQL Server will use Fua write behavior patterns instead of Forced Flush. File systems supporting optimized Fua align SQL Server on Linux with SQL Server on Windows behavior and performance.
Note: If DpoFua=0 is reported the Linux file systems revert to the write, flush behavior to support O_DSYNC. If you are unsure about the Fua support use forced flush behavior.
Note: DpoFua optimizations may be ignored during the first data write to an offset in the file, as documented here.
Note: Some systems embellish the Fua support capabilities. You should always validate the Fua capabilities with your device and hardware vendor.
Beware The Cache
Disk drives often employ non-battery backed caches, even SSD drives can contain volatile, staging caches. Make sure the entire I/O stack from the SQL Server to the stable media location honors the Fua intent properly.
blktrace: Block Trace Shows All
A great place to start understanding how Linux is handling Fua requests is block trace (blktrace.) The block trace allows you to monitor I/O requests and the associated options being used.
The blktrace command and format options are documented here. The following example traces the block device (/dev/sda.) This sample outputs the timestamp, pid, sector, device major:minor, command, sequence, action, flags, block count and worker information.
blktrace -d /dev/sda -o – | blkparse -i – -f “%5T.%9t %10p %-10S %D %2c %8s %2a %3d %10n %32C\n“
- The sample output shows an O_DIRECT | O_DSYNC write from the IOPerfTest application (source contained at end of this blog) on a non-optimized, Linux file system.
- The WS flags indicate (W)rite at sector 16518176 with (S)ynchronized behavior (buffer and storage match.) The (Act)ion column shows the I/O request being (I)nserted into the (Q)ueue and the system workers performing, (G)etting, the FWS request.
Timestamp PID Sector Device Cmd Seq Act Flags Block Count
3.636934795 3348 16518176 8,0 7 443 3348 A WS 16 IOPerfTest <—I/O started
3.636935395 3348 18617376 8,0 7 444 3348 A WS 16 IOPerfTest
3.636936095 3348 18617376 8,0 7 445 3348 Q WS 16 IOPerfTest
3.636941795 3348 18617376 8,0 7 446 3348 G WS 16 IOPerfTest
3.636942295 3348 0 8,0 7 447 3348 P N 0 IOPerfTest
3.636945195 3348 18617376 8,0 7 448 3348 I WS 16 IOPerfTest
3.636946295 3348 0 8,0 7 449 3348 U N 0 IOPerfTest
3.636947795 3348 18617376 8,0 7 450 3348 D WS 16 IOPerfTest
3.638838077 3054 0 8,0 7 452 3054 A FWS 0 kworker/7:2
3.638838677 3054 0 8,0 7 453 3054 Q FWS 0 kworker/7:2
3.638843177 3054 0 8,0 7 454 3054 G FWS 0 kworker/7:2
3.638844377 3054 0 8,0 7 455 3054 I FWS 0 kworker/7:2 <-I/O completed
- A FWS with block count of 0 is the block device flush command
- A WFS is a REQ_WRITE with the REQ_FUA request
When the ‘FW’ pattern is present it shows a REQ_PREFLUSH | REQ_WRITE being sent to the device. What we are really looking for is a ‘WFS’ request showing a REQ_WRITE with REQ_FUA, without the secondary FWS request, using optimized Fua capabilities and avoiding the secondary flush.
Until Linux Kernel 4.18, XFS, the REQ_FUA is not honored for data requests, only journaling, and is always performed as an additional device flush request FWS.
To achieve best performance and durability you want the WFS (Write, REQ_FUA, Synchronized) pattern without FWS for each write. The output below is the block trace output on an XFS file system using optimized Fua capabilities.
13.944683500 2948 25671928 8,0 0 120245 I WFS 8 IOPerfTest <- D=Issued, WFS=Write,REQ_FUA <—I/O started
13.944683500 2948 25671928 8,0 0 120246 D WFS 8 IOPerfTest <- D=Completed, WFS=Write,REQ_FUA <-I/O completed
Using O_DIRECT and O_DSYNC Properly
There are conditions in the Linux file system which require kernel ‘fix-ups’ in order to support the I/O request. If the request does not meet the expected criteria a new, kernel buffer(s) is created, data copied, block device transfers completed using the temporary buffer(s) and then the temporary buffer is released. Using the Linux file system tracing you can confirm the direct I/O request is aligned on the proper memory boundary for DMA transfer, offset starts on a block device boundary, etc. to avoid temporary buffered activities.
Enabling Trace (xfs, ext4, block device tracing)
echo 1 > /sys/kernel/debug/tracing/events/block/enable
echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
echo 1 > /sys/kernel/debug/tracing/events/ext4/enable
… run reproduction …
cat /sys/kernel/debug/tracing/trace
The trace output shows a direct write and the file system performing a flush (generic_write_sync->xfs_file_fsync.). Removing generic_write_sync from each write is key to how optimized Fua improves performance.
# _—–=> irqs-off
# / _—-=> need-resched
# | / _—=> hardirq/softirq
# || / _–=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
IOPerfTest-2733 [004] …. 264.566322: xfs_file_direct_write: dev 253:2 ino 0x69 size 0x100000 offset 0xae000 count 0x2000
IOPerfTest-2733 [004] …. 264.566324: xfs_ilock: dev 253:2 ino 0x69 flags ILOCK_EXCL caller xfs_file_iomap_begin [xfs]
IOPerfTest-2733 [004] …. 264.566325: xfs_iunlock: dev 253:2 ino 0x69 flags ILOCK_EXCL caller xfs_file_iomap_begin [xfs]
IOPerfTest-2733 [004] …. 264.566325: xfs_iomap_found: dev 253:2 ino 0x69 size 0x100000 offset 0xae000 count 8192 type invalid startoff 0xae startblock 272 blockcount 0x2
IOPerfTest-2733 [004] …. 264.566345: xfs_iunlock: dev 253:2 ino 0x69 flags IOLOCK_SHARED caller xfs_file_dio_aio_write [xfs]
kworker/4:1-110 [004] …. 264.566761: xfs_end_io_direct_write: dev 253:2 ino 0x69 isize 0x100000 disize 0x100000 offset 0xae000 count 8192
kworker/4:1-110 [004] …. 264.566763: xfs_file_fsync: dev 253:2 ino 0x69
Forced Flush Behavior
From the first research I did, it would appear that you just have to do things differently on Linux than Windows to achieve durability. However, I have spent two decades working with Fua on Windows and the block traces were telling me a different story. The journaling was using Fua writes and that lead to the lengthy discussion and changes in the Linux 4.18 kernel to enable optimized Fua writes for data. However, that still left SQL Server with a requirement to run on Linux installations that are not Fua optimized, which became the source of forced flush settings. (For reference: As I studied the problem I found that other database vendors provide settings and adjustments to control fsync* calls or open file with and without O_DSYNC.)
The forced flush behavior is a choreographed sequence between SQL Server, SQLPAL and the Host Extension (HE.) To accommodate a wide range of installations and hardware required four (4) options across SQL Server and the HE. Before we dive into the options, let’s first take a look at the forced flush behavior activities employed by SQL Server.
When in forced flush mode, SQL Server assumes the system does NOT provide Fua capabilities and instead issues FlushFileBuffers (translated to fdatasync on Linux) at critical locations. The following shows the flow to harden the transaction log for a commit request.
- A commit requests that all Log Sequence Numbers (LSN) are stored in stable media up to and including the commit record
- The commit record is added to the log buffer
- The log buffer is closed out and written to disk (Only no buffering requested)
- Write completes and FlushFileBuffers is issued to harden
The advantage of allowing the SQL Server log writer to issue FlushFileBuffers is a reduction in flush requests over the default write, flush of the Linux kernel. SQL Server can pack multiple log records from multiple sessions in the same log buffer and log buffers without commit requests don’t require individual flush requests. When the flush request occurs, all previous writes achieve hardened state via the block device flush request.
The following diagram shows the SQL Server checkpoint activity with forced flush behavior is enabled. Since the writes are not being done with Fua, FlushFileBuffers for all the files in the database takes place at the conclusion of the checkpoint. The individual writes take place with no buffering and at the end of the checkpoint phase, the data is hardened. Only after the FlushFileBuffer calls successfully occur can the log records can be truncated.
Note: SQL Server In-memory optimized, File Stream, Backup and other storage operations issue FlushFileBuffers as well. For example, a backup is not considered complete until FlushFileBuffers returns successfully. If FlushFileBuffers returns a failure the backup is terminated in error.
Settings and Adjustments
SQL Server Trace Flags
The forced flush behavior is controlled using SQL Server trace flags.
Note: These flags are global (startup only) to the SQL Server instance. The entire SQL Server instance uses the configured settings.
Flag |
Description |
The Elevator Pitch (thanks to Jamie Burroughs) |
-T3979 |
“Request Fua Writes” SQL Server opens database data and log files with FILE_FLAG_WRITE_THROUGH, mapping to O_DSYNC in the Host Extension, asking for Fua writes. Linux Default: Trace flag considered Off * * The trace flag does not apply to SQL Server on Linux SQL 2017 RTM thru CU5. The behavior is assumed enabled prior to CU 6. ** The trace flag does not apply to Windows. On Windows it is always assumed to be enabled. |
On Linux, you need a file system that provides DpuFua=1 enablement or you have confirmed your system is O_DIRECT safe.
By enabling the trace flag you disable the batch flush requests from SQL Server and are trusting SQLPAL, HE and the I/O subsystem configuration to achieve durability. |
-T3982 |
“Avoid Fua Writes” Forces SQL Server to open database data and log files without FILE_FLAG_WRITE_THROUGH and issue the FlushFileBuffers calls to complete transactions and checkpoint activities. Linux Default: Trace flag is considered On *Microsoft retains the policy that SQL Server should always be run on a system with stable media configurations. The trace flag is NOT supported on Windows. This article highlights these support boundaries. It was written by David Campbell, Bob Ward, Keith Elmore and I many years ago and the forced flush feature does not alter the intent: “SQL Server requires systems to support ‘ guaranteed delivery to stable media ’ as outlined under the Microsoft SQL Server Always-On Storage Solution Review program. For more information about the input and output requirements for the SQL Server database engine.” |
When running on Linux kernels or file systems that do not support optimized Fua capabilities.
The trace flag and forced flush behavior is intended to be a workaround for Linux installations until the SQL Server instance id deployed on a system providing optimized, Fua capabilities. |
Host Extension Settings
The trace flags provide the ability to control how SQL Server requests write behavior and the associated hardening activities. The Host Extension (HE) provides 2 control options (control.writethrough and control.alternatewritethrough) which can be configured using mssql-conf set.
Note: These flags are global (startup only) for the entire SQLPAL instance.
Setting |
Default |
Description |
The Elevator Pitch |
writethrough |
1 |
Used to mask the O_DSYNC flag for the open invocation.
|
For most installations it is unsafe to disable the write-through. For example, SQL Server 2019 supports DTC/XA transactions. Disabling write through could compromise the DTC transaction logging durability.
If your system is deemed O_DIRECT safe it may be acceptable to disable write-through. When DpoFua = 0 or you are on a non-Fua optimized installation, you may need to disable the write through. By removing the O_DSYNC from the open, you remove the write, flush Linux kernel activity. However, you must configure SQL Server with –T8932 to maintain durability. A safer choice, for this situation, is to maintain write through and enable alternate write through behavior. |
alternatewritethough |
1 |
Controls how the write, flush behavior takes place. When enabled the HE opens the file without O_DSYNC and during I/O completion, batches fdatasync requests based on the major, parent device id.
Note: Alternate write through settings only apply when writethrough=1. |
When running on a DpoFuo = 0 or non-Fua optimized installation the alternate writethrough performs better than allowing the Linux kernel to handle the write, flush on every write.
On non-optimized, Fua systems using alternatewritethrough and –T3982 is the best performance and durability option. |
The following is a high-level flowchart of the settings and outcomes.
Safety and Performance
To help make sense of all of this I have broken down the various trace flag and configuration options by data safety and performance.
Safe Installation (Best performance):
– SQL Server 2017 CU6 or newer
– Linux kernel 4.18 or newer with xfs storage
– T3979
– DpuFua=1
– control.writethrough = 1
– control.alternatewritethrough = 0
Safe Installation (Best Grouping Performance – Forced Flush from SQL Server and HE):
– SQL Server 2017 CU6 or newer
– T3982
– DpuFua = 0
– control.writethrough = 1
– control.alternatewritethrough =1
Safe Installation (Midrange Grouping Performance – Grouped device level flushes from HE):
– SQL Server 2017 CU6 or newer
– T3979
– DpuFua = 0
– control.writethrough = 1
– control.alternatewritethrough =1
Safe Installation (Slowest Performance – Linux kernel write, flush O_DSYNC behavior):
– SQL Server 2017 CU6 or newer
– T3979
– DpuFua = 0
– control.writethrough = 1
– control.alternatewritethrough =0
Alternative Installation (O_DIRECT Safe Reliance on I/O Subsystem):
– SQL Server 2017 CU6 or newer
– T3979
– DpuFua = 0
– control.writethrough = 0
– control.alternatewritethrough =0
* Note: This achieves the prior configuration of SQL Server 2017 RTM to CU5 on Linux using only O_DIRECT are relying on O_DIRECT safe installations.
SQL Server 2017 |
Linux RTM-CU5 |
Linux CU6+ |
Windows RTM-CU5 |
Windows CU6+ |
||||||
|
xfs |
xfs opt |
ext4 |
xfs |
xfs opt |
ext4 |
ntfs |
refs |
ntfs |
refs |
-T3979 |
SafeD |
SafeD |
SafeD |
SafeX |
Safe |
SafeX |
Safe |
Safe |
Safe |
Safe |
-T3982 |
N/A |
N/A |
N/A |
SafeX |
Safe |
SafeX |
N/A |
N/A |
N/A |
N/A |
State |
Description |
SafeD |
Unless the system meets O_DIRECT safe criteria (many do not), you are exposed to possible data loss. |
Safe |
Requires alternatewritethrough=0, writethrough=1 and I/O subsystem supporting optimized Fua and DpoFua = 1. |
SafeX |
Requires alternatewritethrough=1, writethrough =1 enabling alternate flush behavior. |
O_DIRECT Safe
O_DIRECT safe refers to is the ability to perform writes with O_DIRECT and without O_DSYNC and maintain durability. If the I/O subsystem avoids the proper buffering mechanism until it reaches stable media (battery-backed cache for example) then the Fua option may not be required. However, to be O_DIRECT safe you must ensure the kernel, drivers, and hardware are able to place a write into non-volatile storage before indicating the write is complete.
I am unaware of tracing mechanisms to determine if the system is O_DIRECT safe. Microsoft has done extensive testing of durability on systems believed to be O_DIRECT safe and SQL Server for Linux RTM thru CU5 was configured for O_DIRECT safe write activities. However, Microsoft does not recommend reliance on O_DIRECT safe configurations and instead provides SQL Server 2017 CU6 and newer builds with explicit durability capabilities (forced flush or optimized Fua writes.)
If you choose to configure your system for O_DIRECT safe writes be sure to employ an appropriate backup or high availability strategy.
Caution
Unless you are using an updated file system, it is unsafe to assume the REQ_FUA behavior. SQL Server automatically enables -T3982 (forced flush behavior) on SQL Server for Linux. If you are able to, verify that a write on your system meets the following criteria, you may be able to achieve data integrity compliance without Fua capabilities as long as you are able to make sure the O_DIRECT path supports write completion when the data has been stored on stable media. – “O_DIRECT Safe.”
- I/O subsystem and drivers do not cache FILE_FLAG_NO_BUFFERING (O_DIRECT) in volatile cache. All cache involvement must be power outage hardened.
- All files in the database are on these stable media, compliant I/O paths (data, log, checkpoint files, standby recovery files, backups, …)
- All SQL Server support files are on the stable media (cluster files, registry files, …)
If you are unsure maintain the forced flush behavior configuration.
Virtual Environments (VMs and Containers)
Virtual environments may not provide the write behavior required to maintain durability. Microsoft has validated Hyper-V, Windows Azure Virtual Machines, Windows Storage and various Docker implementations pass the flush requirements to the host properly. When performing a Fua write the Fua requirement must be passed to the host and the host must carry the write in stable media. If the virtualization or host caching does not support the Fua requirement you should use forced flush behaviors to maintain durability.
How did I get DpoFua=1 enabled for a Hyper-V installation of Linux?
On virtual systems, additional effort may be required to confirm the virtualization technology passes the Fua intent to the host and the host honors the request. You may use a directly mounted device (direct pass-through drive for Hyper-V or iSCSI supporting Fua capabilities) and/or confirm the VHD | VHDx storage path supports Fua.
WARNING: Use caution as the Hyper-V configuration does not query the device for Fua support and trusts the setting provided.
Using the Hyper-V, PowerShell cmdlets, from Windows 2016, you can configure Fua reporting for the guest virtual machine.
$a = Get-VMHardDiskDrive -VMName “xxxxxxxxxxx”
foreach ($b in $a) { $b; $b.WriteHardeningMethod }
Set-VMHardDiskDrive -VMName “xxxxxxxxxxx” -OverrideCacheAttributes WriteCacheAndFuaEnabled
Note: You may need to add a controller type such as -ControllerType SCSI
Linux Kernel 4.18 Updates
I have worked at Microsoft since 1994 and this issue truly points to the spirit of ‘Microsoft Loves Linux.’ I was strongly supported, funded and encouraged to track down the issue and make it better for customers. The Linux committer and vendor engagements where insightful and helpful and, in the end, I am able to proudly say that Microsoft helped to improve Linux.
Several code changes were made to the imap, block device, and XFS components to accommodate optimized Fua writes. The changes in the imap and block device areas can be leveraged by other file systems but as Dec 2018 (this post) only XFS has enabled optimized Fua data write capabilities.
Key Commit References in 4.18 and newer kernel releases
- ed5c3e6 xfs: move generic_write_sync calls inwards
- 4f8ff44 iomap: iomap_dio_rw() handles all sync writes
- 0ce9144 block: add blk_queue_fua() helper function
- 3460cac iomap: Use FUA for pure data O_DSYNC DIO writes
Support
Microsoft has tested and confirmed the Fua changes are included in Linux kernel build 4.18 and newer kernels while Red Hat has tested and confirmed the changes are included in RHEL 8.0.
Hint: Use the uname -sr command to discover the Linux Kernel version.
Tracing SQL Server Behavior (XEvent)
The following XEvents can be used to monitor the flush behavior employed by SQL Server.
XEvents (Debug channel)
· flush*
· make_durable*
· checkpoint*
-T3979 (Force Flush Disabled) Checkpoint
-T3982 (Force Flush Enabled) Checkpoint
Notice the make_writes_durable and flush_file_buffers events for the database log and data files used for ACI(D)urability.
Testing and Futures (Warning: Plans May Change)
When it comes to database consistency there is no discussion on doing the right thing. When I told Slava I was going to put in the forced flush behavior and it would have a performance impact, he simply said do it! Not 2 seconds passed between us making the decision. As a team, we would figure out ways to improve the performance, but the safety of the data took priority.
When SQL Server 7.0 shipped Microsoft SQL Server made a distinct change to transaction log writes, aligning on sector boundaries and removing the 2K page rewrites of the 6.x and Sybase code to improve ACID properties of the SQL Server. This generated lots of calls to Microsoft Support because it changed the I/O performance characteristics in ways customers didn’t expect and lead to the first code I added to SQL70IOStress, which I would later rewrite into what you know today as SQLIOSim.
I mention this because folks like Keith Elmore, Bob Ward and I were running tests in the lab. In fact, I was pulling the power between the power supply and the hard drive, waiting a few minutes, plugging the power back in and validating the durability and consistency provided by the I/O subsystem. This lasted a couple of days until I electric-arc’ed the drive and burnt out the power supply. The lab manager would no longer allow me to manually pull the power. It just so happened that Ed Harper was on our team and he built me a wire wrapped board. It had a few D-Batteries and a switch on it with a plug on one end for the power supply and a plug on the other for the drive. I used the generic driver sample and could signal the switch to open and close which I could then signal from SQL70IOStress and perform 1000s of ACID related tests overnight. I won’t even tell you how many disk drives I have worn out doing I/O testing.
Creating SQL 2017 CU6 updates involved revisiting similar tests. The SQL Server test infrastructure can inject fail points. Dylan added additional fail points to checkpoint, lazy writer, and log writer. Ben, Mitch, and Kapil performed physical pull the plug and kill the VM testing. Pat added forced flush behavior to SQLIOSim, which we have been running in a Linux package, in our daily lab runs since before SQL Server 2017 was released. We also added additional coverage to containers, high-availability, replication, backup, restore and recovery test suites.
Met Tejas, a long time Microsoft SQL Server Support Escalation Engineer, is a Program Manager for SQL Server on Linux. He is working on the Linux experience, testing and capabilities surrounding SQL Server WAL and ACID capabilities. Some of the areas he is currently looking into are as follows (as always these a goals we would like to achieve
/opt/mssql/bin/mssql-conf | Providing single selection capability to configure the various Fua settings for a target environment |
Auto configuration | Can SQLPAL and SQL Server use the DpoFua detection to automatically do the right thing |
SQLIOSim | Provide the SQLIOSim package used in Microsoft SQL Server’s daily testing to customers for yum/apt-get install sqliosim testing capabilities |
DiskSpd (SQLIO) | Provide a package to customers for yum/apt-get install diskspd testing capabilities https://blogs.msdn.microsoft.com/sql_server_team/sqlio-disk-subsystem-benchmark-tool-is-being-retired/ |
Crash Monkey | Evaluating facilities such as Crash Monkey and B3 to see if we can include it with our daily and stress testing activities |
Vendors | When we released SQLIOSim for Windows many vendors added it to their test suites. We hope to achieve similar adoption from the Linux distribution vendors |
Azure | For years the SQL Azure storage team has and continues to run SQLIOSim in their daily tests and with storage, fail points as well. Expand the coverage to Linux guests running SQLIOSim |
WHCL Storage Certification SQL Server I/O reliability Program |
For more than a decade, SQLIOSim has been part of the WHCL Storage Certification for Windows as well as the SQL Server I/O Reliability Program. Vendors, such as Red Hat, have similar programs for Linux who Tejas is engaging |
As of this post, XFS on Linux 4.18 or newer is Fua optimized and we have also reached out to additional, Linux file system committers. For example, the EXT4 committers are aware of the changes and evaluating the changes.
Odds and Ends
Updates and changes like this tend to expose some odds and ends.
Checkpoint Record Flush
SQL Server creates a log record (LOP_BUF_WRITE) to assist recovery. The record has an array of page ids written to disk. The checkpoint flush record has a limited size and is written whenever it becomes full. The flushing of the checkpoint record requires FlushFileBuffers when forced flush behavior is enabled to maintain durability.
All Data Files For Same Database Flushed
Forced flush loops over all database files in the target database and issues FlushFileBuffers to complete a checkpoint. The data files may be on the same block device and the secondary flushes may be redundant but are inexpensive. Future SQL Server updates may optimize this behavior.
TEMPDB Is Faster
You may find that TEMPDB is faster using -T3982. TEMPDB is always rebuilt upon SQL Server restart and forced flush removes the FILE_FLAG_WRITE_THOUGH requirement of TEMPDB. Since TEMPDB does not need to issue checkpoints, FlushFileBuffers may be skipped. In fact, our performance team is evaluating a SQL Server 2019 change that never opens the TEMPDB file with the FILE_FLAG_WRITE_THROUGH.
RWF_ODSYNC
Another choice, provided in the Linux Fall 2017 Update, allows individual I/O requests (io_submit) to set the RWF_ODSYNC flag, forcing O_DSYNC for individual write requests. SQL Server does not use RWF_ODSYNC.
Simple Test and Results
Raw Results from Fedora 4.17 to 4.18 using SQL Server 2019 CTP 2.2
— Win10, 8 proc, 32GB ram host use master drop database IF EXISTS fuaTest — Always stamp disk space on creation declare @start datetime = GetUtcDate() declare @end datetime = GetUtcDate() — Control when checkpoint is executed — Do some inserts declare @start datetime = GetUtcDate() — Execute checkpoint |
IOPerfTest
#include <stdio.h> |
SCSI Fua Write Command Reference [write(10), write(12), and write(16)]
https://www.seagate.com/staticfiles/support/disc/manuals/scsi/100293068a.pdf
Related References
- https://blogs.msdn.microsoft.com/psssql/2013/06/03/how-it-works-the-io-path-sql-server-running-in-windows-azure-virtual-machine-iaas
- https://blogs.msdn.microsoft.com/bobsql/2018/12/10/sql-server-instant-file-initialization-setfilevaliddata-windows-vs-fallocate-linux
- https://blogs.msdn.microsoft.com/bobsql/2018/07/31/sql-server-on-linux-how-is-delete-on-close-handled
- https://blogs.msdn.microsoft.com/bobsql/2017/01/31/sql-server-on-linux-scattergather-vectored-io
- https://blogs.msdn.microsoft.com/psssql/2010/03/24/how-it-works-bob-dorrs-sql-server-io-presentation
- https://blogs.msdn.microsoft.com/psssql/2011/01/07/discussion-about-sql-server-io
- https://technet.microsoft.com/en-us/library/cc966412.aspx
- https://blogs.msdn.microsoft.com/psssql/2008/11/12/how-it-works-sqliosim-running-average-target-duration-discarded-buffers
- https://blogs.msdn.microsoft.com/psssql/2008/04/04/sqliosim-is-not-an-io-performance-tuning-tool
- https://support.microsoft.com/en-us/help/231619/how-to-use-the-sqliosim-utility-to-simulate-sql-server-activity-on-a-d
- https://blogs.msdn.microsoft.com/psssql/2007/02/21/sql-server-urban-legends-discussed
Posted at https://sl.advdat.com/3s9Ln1jhttps://sl.advdat.com/3s9Ln1j