Friday, July 16, 2021

Failover Clustering in Azure

Azure is a cloud computing platform with an ever-expanding set of services to help you build solutions to meet your business goals. Azure services range from simple web services for hosting your business presence in the cloud to running fully virtualized computers for you to run your custom software solutions.  With over 60 regions globally, 200+ products, and over 17,000 services and applications, Azure has everything you need in a cloud.

 

One of the products that can server as the compute infrastructure for our service or application is Failover Clustering.  Failover Clustering can be a traditional cluster or it can be running Storage Spaces Direct.  No matter the choice, there are a few configuration changes that must be made post cluster creation to ensure connectivity can be made.  Starting in Windows Server 2019, and moving forward, we have added detection into the cluster creation process that will automatically do some of this configuration for you.

 

Let's first talk about the Cluster Network Name.  The Cluster Network Name is used to provide an alternate computer name for an entity that exists on a network. When it is created, it will also create a Cluster IP Address resource that provides an identity to the group, allowing the group to be accessed by network clients.  When in Azure, an additional Azure Load Balancer must be created with separate a IP Address so that it can be reached.  

 

In Windows Server 2019, and moving forward, we have added detection during the cluster creation process to look to see if it is being created in Azure.  A new parameter has been added to Clustering to help you determine what we have detected.  To view it and see the output, the command to run would be:

 

Get-Cluster | fl DetectedCloudPlatform

 

DetectedCloudPlatform            : Azure

 

As a side note, if it detects it is on-premises or any other cloud provider, the response will be None.

 

If so, there are several configurations it will add and the first is with the Cluster Name.  Instead of the traditional Cluster Name and Cluster IP Address, it will now create the Cluster Name as a distributed network name (DNN) automatically.  If you have worked with Scale Out File Servers (SOFS), it is the same type distributed name.  A Distributed Network Name is a name in the Cluster that does not use a clustered IP Address.  It is a name that is published in DNS using the IP Addresses of all the nodes in the Cluster.  Since it uses the IP Addresses of the nodes, a load balancer is not needed.  So it would look like this from Failover Cluster Manager.

 

CNO-DNN.png

 

As a side note, the automatic creation of the name as a DNN is only when the machines are in Azure.  However, we have added the ability to create it as a DNN on-premises if it is so desired.  When creating the Cluster using Failover Cluster Manager or Windows Admin Center on-premises, it will create it with the name and IP Address.  However, using PowerShell, you have a new switch –ManagementPointNetworkType that can be used with New-Cluster that will create it as a DNN.  –ManagementPointNetworkType has several parameters to define the type of name it will be.

 

New-Cluster -ManagementPointNetworkType:x

 

Singleton : Traditional Cluster Name and Cluster IP Address
Distributed : Create as DNN and use node IP Addresses
Automatic : Detect if on-premises or Azure (default)

 

Moving on, one of the next things we will change is the network communication thresholds.  Communication between nodes is crucial in keeping them up and talking to ensure high availability.  As a refresher, you have several settings that control the length of wait times and number of failures before we determine a node to be down and it is removed from cluster membership.  As a refresher, these are those settings.

 

Parameter

Windows 2019 / Azure Stack HCI

Default

Maximum
SameSubnetDelay 1 second 2 seconds
SameSubnetThreshold 10 heartbeats 120 heartbeats
CrossSubnetDelay 1 second 4 seconds
CrossSubnetThreshold 20 heartbeats  120 heartbeats
CrossSiteDelay 1 second 4 seconds
CrossSiteThreshold 20 heartbeats  120 heartbeats

 

It is important to understand that both the delay and threshold have a cumulative effect on the total health detection.  For example setting SameSubnetDelay to send a heartbeat every 1 seconds and setting the SameSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 10 seconds before recovery action is taken.  The higher the numbers, the longer it will take to detect a node is not responding.  In general, continuing to send frequent heartbeats but having greater thresholds is the preferred method.  The primary scenario for increasing the Delay, is if there are ingress / egress charges for data sent between nodes.  When we have detected that the cluster in in Azure, we will auto increase the thresholds to their maximum values.

 

Please refer to the Tuning Failover Cluster Network Thresholds blog to change these values.

 

The last thing I want to talk about is Azure host maintenance.  Maintenance on an compute host is something you cannot get around as patches, driver/firmware updates, etc need to be done periodically.  Same goes for those hosts in Azure or any other cloud provider.  So what to do with the virtual machines running on those hosts is something that needs to be considered by the Azure administrators.  There is basically only a couple of things that they can do which is leave the VMs where they are or move them off.  The decision to move or stay can simply come down to how long is it going to take and does it need a reboot.  No matter how quick it may take to apply, if a reboot is needed, the VMs are going to move off.  However, if whatever maintenance being done doesn't need a reboot an is quick, simply freezing the virtual machine is done.

 

As a client, you very well never know anything ever happened and that is the goal.  But there could be times when you notice it as you cannot connect, you are hung, a cluster node drops out of membership, etc.  From a client perspective there is not a way of knowing what had happened.  You must trust that the administrators have no issues and they make the right decisions.

 

But what if you as an administrator received a heads up of impending host maintenance and you could make the decision.  Well, that leads to the other new feature we added.  With Windows Server 2019, we added integration and awareness of Azure Host Maintenance and improved experience by monitoring for Azure Scheduled Events.  For this to fully be done, all clustered VMs must be in the same Azure Availability Zone..  When a host has maintenance scheduled, we will now detect it and throw an event into the virtual machine's FailoverClustering/Operational channel.  We have also included actions that you can configure based on the event.

 

First, let's talk about the events you could see.  This is an example of one of those events.

 

Log: FailoverClustering/Operational
Level: Warning
Event ID: 1139
symbol="NODE_MAINTENANCE_DETECTED”
Description: The cluster service has detected an Azure host maintenance event has been scheduled. This maintenance event may cause the node hosting the virtual machine to become unavailable during this time.

 

Node: VMNode1
Approximate Time: 2021/07/16-17:30:00.000
Details: ' EventId = 4FE57A76-7754-48FD-9B45-48387A36CD19
EventStatus = Scheduled Event
Type = Freeze Resource
Type = VirtualMachine

 

As you can see, this event triggered as a host maintenance event has been scheduled.  It provides several other things of interest.

 

1. The time the event is to occur

2. What the event will be someone from the Azure Team could look up if a support ticket were raised

3. What it will do with the virtual machine

 

There are actually 3 events you could see.

 

Event ID 1136:  Host maintenance is imminent and about to occur

Event ID 1139:  Host maintenance has been detected

Event ID 1140:  Host maintenance has been rescheduled

 

Now that you have the events, the next thing is to decide if you want to define an action.  We have created two new cluster properties of DetectManagedEvents and DetectManagedEventsThresholdDetectManagedEvents is for the action you wish to have occur when it detects an event is scheduled.  DetectManagedEventsThreshold  The options for each of these are as follows:

 

DetectManagedEvents

0 = Do not Log Azure Scheduled Events <-- default for on-premises
1 = Log Azure Scheduled Events <-- default in Azure
2 = Avoid Placement (don’t move roles to this node)
3 = Pause and drain when Scheduled Event is detected
4 = Pause, drain, and failback when Scheduled Event is detected

 

DetectManagedEventsThreshold

60 seconds <-- default
Amount of time before taking action

 

Note: These settings only apply when the virtual machine is in Azure.  It does not take effect on any other platform (I.E. a third party cloud provider, Hyper-V, Azure Stack HUB/HCI/Edge, etc).

 

In closing, we recognized that there are some configurations needed when a Failover Cluster is in Azure.  By adding these new features, we have taken some of the burden away from you as an administrator and automatically making these changes for you.

 

Thanks

John Marlin

Senior Program Manager

Twitter: @Johnmarlin_MSFT

 

Posted at https://sl.advdat.com/2UlXcn9