Wednesday, January 12, 2022

How-to test failover group connectivity between primary and secondary SQL Managed Instances

This blog post is one of the 2 posts in the tiny SQL Managed Instance series on how-to determine the status of the connectivity from inside the Azure SQL Managed Instance. They will help you to determine if the SQL MI is able to reach a certain service, such as a different SQL MI or any other supported Azure service, for example.

 

If you are interested in other posts on how-to discover different aspects of SQL MI - please visit the  http://aka.ms/sqlmi-howto, which serves as a placeholder for the series.

 

When you set up a failover group between primary and secondary SQL Managed Instances in two different regions, each instance is isolated using an independent virtual network. Replication traffic needs to be allowed between these VNets.

To allow this kind of traffic, one of the prerequisites is:

  • "You need to set up your Network Security Groups (NSG) such that ports 5022 and the range 11000-11999 are open inbound and outbound for connections from the subnet of the other managed instance. This is to allow replication traffic between the instances."

Strategy

We can use SQL Agent, available on SQL Managed Instance, to run some failover group related connectivity tests between both instances.

The following script will create a new job on SQL Agent named TestFoGConnection and add 2 steps, one to test port 5022 and another to test HADR port.

Proper values for some parameters need to be specified (@server, @node and @port), but we also have a query to generate them.

The script will trigger the job and check status every 5 seconds until it completes.

You’ll need to run the following steps twice, so you can test connectivity in both ways.

 

Steps

  1. Connect to secondary instance.

  2. Run the following script to generate parameters:

 

SELECT 'DECLARE  NVARCHAR(512) = N'''+ value + ''''
FROM sys.dm_hadr_fabric_config_parameters
WHERE parameter_name  = 'DnsRecordName'
UNION
SELECT 'DECLARE @node NVARCHAR(512) = N'''+ NodeName + '.' + Cluster + ''''
FROM 
    (SELECT REPLACE(fr.node_name,'.','') AS NodeName, JoinCol = 1
FROM sys.dm_hadr_fabric_partitions fp
JOIN sys.dm_hadr_fabric_replicas fr ON fp.partition_id = fr.partition_id
JOIN sys.dm_hadr_fabric_nodes fn ON fr.node_name = fn.node_name
WHERE service_name like '%ManagedServer%' and replica_role = 2) t1
LEFT JOIN
    (SELECT value as Cluster, JoinCol = 1
FROM sys.dm_hadr_fabric_config_parameters
WHERE parameter_name  = 'ClusterName') t2
ON (t1.JoinCol = t2.JoinCol)
UNION
SELECT 'DECLARE @port NVARCHAR(512) = N'''+ value + ''''
FROM sys.dm_hadr_fabric_config_parameters
WHERE parameter_name = 'HadrPort';

 

 

 

You will get something like:

parameters.png

Save the result to be used in next steps. Note that the parameters we just generated may change after any reconfiguration, be sure to generate them again if needed.

 

  1. Now that we know what endpoints on secondary instance should we test against, let's connect to primary instance.

  2. Paste the following script

 

--START
-- Parameters section
DECLARE @node NVARCHAR(512) = N' '
DECLARE @port NVARCHAR(512) = N' '
DECLARE  NVARCHAR(512) = N' '

--Script section
IF EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'TestFoGConnection')
EXEC msdb.dbo.sp_delete_job @job_name=N'TestFoGConnection', @delete_unused_schedule=1

DECLARE @jobId BINARY(16), @cmd NVARCHAR(MAX)

EXEC  msdb.dbo.sp_add_job @job_name=N'TestFoGConnection', @enabled=1, @job_id = @jobId    OUTPUT

SET @cmd = (N'tnc ' +  + N' -port 5022 | select ComputerName, RemoteAddress,    TcpTestSucceeded | Format-List')
EXEC msdb.dbo.sp_add_jobstep @job_id=@jobId, _name=N'Test Port 5022'
, _id=1, @cmdexec_success_code=0, _success_action=3, _fail_action=3
, @subsystem=N'PowerShell', @command=@cmd, @database_name=N'master'

SET @cmd = (N'tnc ' + @node + N' -port ' + @port +' | select ComputerName,    RemoteAddress, TcpTestSucceeded | Format-List')
EXEC msdb.dbo.sp_add_jobstep @job_id=@jobId, _name=N'Test HADR Port'
, _id=2, @cmdexec_success_code=0, @subsystem=N'PowerShell', @command=@cmd,    @database_name=N'master'

EXEC msdb.dbo.sp_add_jobserver @job_id = @jobId, _name = N'(local)'
GO
EXEC msdb.dbo.sp_start_job @job_name=N'TestFoGConnection'
GO
--Check status every 5 seconds
DECLARE @RunStatus INT 
SET @RunStatus=10
WHILE ( @RunStatus >= 4)
BEGIN
SELECT distinct @RunStatus  = run_status
FROM [msdb].[dbo].[sysjobhistory] JH JOIN [msdb].[dbo].[sysjobs] J ON JH.job_id= J.   job_id 
WHERE J.name=N'TestFoGConnection' and step_id = 0
WAITFOR DELAY '00:00:05'; 
END

--Get logs once job completes
SELECT [step_name]
,SUBSTRING([message], CHARINDEX('TcpTestSucceeded',[message]), CHARINDEX('Process Exit',   [message])-CHARINDEX('TcpTestSucceeded',[message])) as TcpTestResult
,SUBSTRING([message], CHARINDEX('RemoteAddress',[message]), CHARINDEX   ('TcpTestSucceeded',[message])-CHARINDEX('RemoteAddress',[message])) as    RemoteAddressResult
,[run_status] ,[run_duration], [message]
FROM [msdb].[dbo].[sysjobhistory] JH JOIN [msdb].[dbo].[sysjobs] J ON JH.job_id= J.   job_id
WHERE J.name=N'TestFoGConnection' and step_id <> 0
--END

 

 

 

  1. Replace the parameters with the values you got from step #2, like:

 

--START
-- Parameters section
DECLARE @node NVARCHAR(512) = N'DB80C1.tr1234.francecentral1-a.worker.database.windows.net'
DECLARE @port NVARCHAR(512) = N'11002'
DECLARE  NVARCHAR(512) = N'mi46.abcd5db2d92.database.windows.net'

 

 

  1. Run the script and check the results, you will get something like:

    test result.png

    Verify the results:

    • The outcome of each test at TcpTestSucceeded should be TcpTestSucceeded : True.
    • Check if the resolved IP Address matches the range for the destination subnet, goal is to spot bad DNS resolution (like missing or wrong value).

Now, let's run the same test, but in the opposite direction

You just tested the connectivity from the primary against the secondary. We need to do the same the other way around.

Repeat the previous steps but now connect to primary to generate the parameters and then use those parameters to run the test from the secondary instance to confirm that secondary can also reach primary instance on those endpoints.

Next steps

In case there's any test failing (TcpTestSucceeded : False), this is usually a client-side networking issue (like DNS issue or a port being blocked) that you will need to pursue with your local network administrator.

We strongly recommend you request assistance from your network administrator, some validations you may do together are:

  • The Network Security Groups (NSG) on the primary managed instance subnet allows:

    • Inbound access on ports 5022 and the range 11000-11999
    • Outbound access on ports 5022 and the range 11000-11999
  • The Network Security Groups (NSG) on the secondary managed instance subnet allows:

    • Inbound access on ports 5022 and the range 11000-11999
    • Outbound access on ports 5022 and the range 11000-11999
  • The two SQL Managed Instance VNets do not have overlapping IP addresses.

  • Any networking device used (like firewalls, NVAs) do not block the traffic mentioned above.

  • Routing is properly configured, and asymmetric routing is avoided.

  • If you are using virtual network peering between different regions, ensure that global virtual network peering is supported. See more at Enabling geo-replication between managed instance virtual networks

  • If you are using peering via VPN gateway, ensure the two virtual networks are properly peered.

 

Posted at https://sl.advdat.com/3firv56