Advanced Data Solutions : SQL Server 2019 PolyBase feature High Availability

One of the most exciting aspects of SQL Server 2019 are the enhanced capabilities around the PolyBase feature that enables access to non-SQL Server external data sources through a process referred to as virtualization. The data remains at the external data source but can be queried from with SQL Server using native Transact-SQL just as if the data were stored locally.

As the PolyBase feature is implemented into mission-critical environments, organizations will seek high-availability solutions to ensure continuous operation of externalized data sources.

The abridged version of this article is that the PolyBase feature is compatible with Always On Failover Cluster Instances, but is incompatible (and therefore unsupported) with Always On Availability Groups. If that statement provides the needed information, then you can probably stop reading.

If you are curious as to why this is the case, the remainder of this article details why the PolyBase features works with one high availability deployment topology but not the other.

PolyBase Services

First, a brief review of the PolyBase services is important as their behavior is influenced by the chosen high-availability design.

SQL Server PolyBase Engine

Service executable: mpdwsvc.exe -dweng
Parses queries
Generates query plans
Distributes work to compute nodes
Processes compute node results and presents results back to the client

SQL Server PolyBase Data Movement Service

Service executable: mpdwsvc.exe -dms
Transfer data between external data sources as well as between PolyBase head and compute nodes
Capable of inserting data into the Hadoop and Azure Blob Storage data sources

Always On Failover Cluster Instances

With Always On Failover Cluster Instances, Windows Server Failover Clustering (WSFC) is leveraged to provide high availability. In terms of SQL Server, a Failover Cluster Instance (FCI) is a single instance of SQL Server across one or more Windows Server Failover Clustering (WSFC) nodes.

A Failover Cluster Instance of SQL Server uses a network name that moves between the WSFC nodes to provide high availability.

For this section of the article, the PolyBase scale-out topology is as follows:

PolyBase Head node: PBFCISQL0

Description: A SQL Server 2019 CU4 FCI comprised of two (2) cluster nodes

Windows Server Failover Cluster nodes:

PBFCI00
PBFCI01

The view from Failover Cluster Manager.

The view using the Get-ClusterResource PowerShell cmdlet.

PolyBase Compute node: PBSO0

Description: A stand-alone SQL Server 2019 CU4 instance.

As shown above, the PolyBase Engine and DMS services are both installed as Generic Service resource types in the SQL Server group resource group. As such, when the SQL Server resource group moves from one node to another, the PolyBase Engine and DMS services are taken offline.

Why is this important?

In a PolyBase scale out topology, compute nodes open a persistent connection to the host serving as the head node.

With the environment shown above, compute node “PBSO0” creates a persistent connection to “PBFCISQL0”, which is the Failover Cluster Instance (FCI) acting as the PolyBase head node.

When Failover Cluster resource group “SQLserver” and its associated resources (which include the PolyBase Engine and DMS services) moves from one node to another, the persistent connections from compute node “PBSO0” to head node “PBFCISQL0” are terminated as the SQL Server instance is taken offline.

Let us examine this process more closely.

The “PBFCISQL0” FCI SQL Network Name is currently hosted on the cluster node PBFCI00 and as shown below there are numerous connections from the PolyBase scale-out compute node “PBSO0”.

Now let us move the SQL Server FCI Resource Group to the second cluster node, “PBFCI01”.

Now that all the resources in the SQL Server Resource Group are back online, now on the “PBFCI01” cluster node, let us review the logs on the “PBSO0” PolyBase scale-out compute node.

We will focus on the “<Instance Name>_<Server Name>_Dms_movement.log” log file, which as shown below is named “MSSQLSERVER_PBSO0_Dms_movement.log”

Contained in lines 1 – 3, we see that the connection between PolyBase scale-out compute node “PBSO0” and the PolyBase head node, “PBFCISQL0” is terminated. This disconnection event is repeated several for each thread that was connected to the PolyBase head node.

In fact, the PolyBase DMS service on the “PBSO0” scale-out compute node terminates completely, as is evidenced in lines 5 and 6 above as well as in the System Event Log as shown below.

In lines 14 – 17 we see that the “PBSO0” PolyBase scale-out compute node has successfully reconnected with the “PBFCISQL0” head node. Additionally, the System Event Log shows the PolyBase DMS service successfully restarting.

Back on the "PBFCISQL0" SQL Failover Cluster instance and PolyBase scale-out head node, we again see connections established from the "PBSO0" PolyBase scale-out compute node.

Always On Availability Groups

At the time of this blog’s authoring, high availability for the PolyBase head node using Always On Availability Groups is not a supported configuration. And here’s why.

For this section of the blog post, the PolyBase scale-out topology is as follows:

Always On Availability Group Listener (head node endpoint): ExternalDBLis

Description: A SQL Server 2019 CU11 Always On Availability Group comprised of two replica nodes

Always On Availability Group Replica nodes:

PBAO1
PBAO2

The view from SQL Server Management Studio:

PolyBase Compute node: PBSO001

Description: A stand-alone SQL Server 2019 CU11 instance

When the “ExternalAG” availability group transitions from one replica to another, such as from PBAO1 to PBAO2, the PolyBase Services (Engine and Data Movement Service) are not automatically stopped and/or restarted on either of the Always On Availability Group replica servers. As a result, the PolyBase scale-out compute node in our topology, PBSO001, is unaware that the availability group has moved from one replica to another.

Consequently, the connection between the replica hosting the Always On Availability Group Listener and the scale-out compute node is marked as a “dead” connection and the Data Movement Service on the scale-out compute node is taken offline and a message similar to that shown below is recorded in the PolyBase DMS log.

The reason for this is that once established, the data and control connections between the DMS and Polybase Engine service, which in our Always On Availability Group is whichever replica is hosting the Availability Group Listener, are persistent. So, while the Availability Group Listener transitions from one replica to another, the PolyBase Engine service is not restarted.

And that is the key difference between Always On Failover Cluster Instances and Always On Availability Groups when deployed with PolyBase. In an Always On Failover Cluster Instance, the PolyBase Engine and DMS services moves from one node to another, the persistent connections from compute nodes and the head node are terminated as the SQL Server instance is taken offline.

Hopefully this article made clear the interaction between the PolyBase feature and common SQL Server high availability solutions. To summarize, if high availability for PolyBase is a requirement for your environment, the only viable deployment topology at this time are Always On Failover Cluster Instances.

Posted at https://sl.advdat.com/3bf6h5V

Wednesday, October 27, 2021

SQL Server 2019 PolyBase feature High Availability