Thursday, October 7, 2021

Azure SQL Database unavailable after switching from BYOK CMK to system-managed TDE key

Introduction:

Recently we have received a support case where an Azure SQL Database became unavailable after switching from BYOK Custom-Managed Keys (CMK) to system-managed TDE keys. The switch itself went fine, but the issue occurred much later when removing access to the BYOK in the Azure Key Vault.

 

This article provides you with technical background details and with steps for avoiding this type of incident. It is filling the gap until the public documentation is updated accordingly. Please see the reference section below for links to the current documentation.

 

Problem Description:

In this Azure SQL Database scenario, the customer's security team had first decided that TDE with BYOK is required to have the best security options in place. This was later deemed unnecessary so on September 12, the database was switched back to the system-managed TDE key. As there were no issues and the available documentation didn't recommend anything different, the access to the Azure Key Vault and the BYOK was revoked on September 28, more than two weeks later. Within minutes after revoking permissions, the database become unavailable. The issue was mitigated by re-granting the AKV permissions.

 

Cause:

As the current documentation indicates, rotating the TDE key or changing the protector mode only requires the re-encryption of the Database Encryption Key. This would rather take a few seconds or maybe minutes, but not days and weeks. Changing the TDE key or the protector mode does not require decrypting and re-encrypting the data itself.

 

The investigation showed that the cause was not related to the data itself, but to the transaction log of the database. When TDE is enabled for a database, the Virtual Log Files (VLFs) in the transaction log are also encrypted. When changing the TDE keys, the currently active VLF, which is encrypted by the previous key, will be closed. The next VLF will become active and be encrypted with the new key. The transaction log file then consists of VLFs encrypted by the previous TDE key and VLFs encrypted by the new key.

 

When the issue occurred, some of the VLFs in the transaction log were still using the custom TDE encryptor, although the switch had been made 16 days earlier. The transaction log had not yet rolled over through the VLFs during the period. In this specific case, the VLF rollover was slow because it was a serverless database, being deactivated some of the time, and because there was almost no workload.

 

Troubleshooting:

To identify if you might be affected by this issue, you first need to find out what TDE key is currently active for your Azure SQL Database. You can use the DMV sys.dm_database_encryption_keys to return information about the encryption state of the database and its associated database encryption keys.

The DMV will show you the thumbprint of the current TDE protector of the database:

 

-- execute in your database (not master)
SELECT [database_id], [encryption_state], [encryption_state_desc], 
[encryptor_type], [key_algorithm], [key_length], [encryptor_thumbprint]
FROM[sys].[dm_database_encryption_keys]
/* ASYMMETRIC KEY means Azure Key Vault, CERTIFICATE means service-managed key */

 

Sample output for a database encrypted with a with system-managed key - note the thumbprint in the last column:

Holger_Linke_0-1633616916323.png

 

The second step is to check what TDE protector thumbprints are still used by the VLFs of your database. You can use the DMV sys.dm_db_log_info to return the VLFs and their TDE protector, or rather the protector thumbprints in use. Each different thumbprint is referring to a different key in Azure Key Vault or the system-managed certificate:

 

-- execute in your database (not master)
SELECT database_id, file_id, vlf_size_mb, vlf_sequence_number, 
vlf_active, vlf_status, vlf_encryptor_thumbprint
FROM sys.dm_db_log_info (db_id())

 

Sample output for a database encrypted with a system-managed key - compare the thumbprint to the query result from step 1:

Holger_Linke_1-1633617042521.png

Output from the same database after encryption was disabled, immediately after decryption:

Holger_Linke_2-1633617606503.png

 

Solution:

If you still find references to the old TDE protector thumbprint, you can try to flush these references from the database by shrinking the transaction log file. The VLFs are used in a round-robin fashion and shrinking the log file is forcing a faster wrap-around.

You can shrink the log file (fileid=2 of the database) with either of the following SQL commands:

 

-- execute in your database (not master)
DBCC SHRINKFILE (2, TRUNCATEONLY) 
DBCC SHRINKFILE (2)

 

 

You may have to do this repeatedly after running some workload, depending on which of the VLFs are carrying the old thumbprint. After executing some workload on the database and re-running the Shrink operation, the references to the old encryptor will eventually be removed from the transaction log. Once this has happened, you may go ahead and remove the access to the old key.

 

Sample output of sys.dm_db_log_info continued from above:

Output after running DBCC SHRINKFILE (2):

Holger_Linke_3-1633617760551.png

Output after a short while - note that the encrypted VLF is now inactive:

Holger_Linke_4-1633617760552.png

The last reference to the old thumbprint will be flushed shortly after and you will be safe with removing permissions to the old BYOK and the AKV.

 

References:

VLF behaviour:

https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-log-architecture-and-management-guide?view=sql-server-ver15

 

sys.dm_db_log_info:

https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-log-info-transact-sql?view=sql-server-ver15

 

sys.dm_database_encryption_keys:

https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-encryption-keys-transact-sql?view=sql-server-ver15

 

Shrinking transaction log file:

https://docs.microsoft.com/en-us/azure/azure-sql/database/file-space-manage#shrinking-transaction-log-file

 

Azure SQL Transparent Data Encryption with customer-managed key:

https://docs.microsoft.com/en-us/azure/azure-sql/database/transparent-data-encryption-byok-overview

"Keep all previously used keys in AKV even after switching to service-managed keys. It ensures database backups can be restored with the TDE protectors stored in AKV. "

 

Rotate the Transparent Data Encryption (TDE) protector

https://docs.microsoft.com/en-us/azure/azure-sql/database/transparent-data-encryption-byok-key-rotation?tabs=azure-powershell

"Do not delete previous versions of the key after a rollover. When keys are rolled over, some data is still encrypted with the previous keys, such as older database backups."

 

Remove a Transparent Data Encryption (TDE) protector using PowerShell

https://docs.microsoft.com/en-us/azure/azure-sql/database/transparent-data-encryption-byok-remove-tde-protector

 

Msg 33111 error after SQL Server TDE certificate or key rotation

https://docs.microsoft.com/en-us/troubleshoot/sql/admin/error-after-perform-tde-certificate-key-rotation

"Cause: When changing the certificate or keys, the current active Virtual Log File (VLF)-which is encrypted by the previous key-will be closed. The next available VLF (or newly created VLF) will be used and encrypted by the new certification. At this stage, the transaction log file retains log records encrypted by the previous certificate as well as log records encrypted by new certificate."

 

Take a COPY_ONLY backup of TDE protected database on Azure SQL Managed Instance

https://techcommunity.microsoft.com/t5/azure-sql/take-a-copy-only-backup-of-tde-protected-database-on-azure-sql/ba-p/643407

(describes the steps of identifying encrypted VLFs and how to flush them off the log)

 

Posted at https://sl.advdat.com/2ZUotPR