Advanced Data Solutions : Disaster Recovery options for Azure Database for MySQL

An increasing number of customers are migrating their MySQL workloads to Azure Database for MySQL and often the related applications are read-heavy and support customers with operations in different geographical locations. Azure Database for MySQL makes it easy to architect, deploy, and manage diverse business requirements. One of the popular patterns in architecting applications is the use of read replicas for enhancing the read experience of end-users. To improve the experience even further, customers can place data as near to the users as possible using cross-region read replicas.

In addition to the performance, customers also have a critical need for business continuity through reliable Disaster Recovery (DR). Organizations can structure the behavior of their DR solution in multiple ways based on the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) of the application. In Azure, commonly used DR approaches are geo-restore and cross-region read-replicas. To address this scenario, we’ll discuss an architecture pattern that provides simple DR management for applications that maintain multiple read-replicas.

A common architecture that customers use for their critical application(s) in Azure includes a primary database instance in one region together with read replicas deployed across multiple regions as shown in the following diagram.

Figure 1-Disaster Recovery Architecture using Azure Database for MySQL Read-Replica

In this diagram, the primary Azure region is East US2, which hosts myrw-1 as the primary read-write instance. The Central US region hosts myrw-2, which is designated as the DR instance should an outage that exceeds the application RTO SLAs occur in the primary region. The DR instance remains in-sync asynchronously based on the Read Replica functionality in Azure Database for MySQL.

Notice that the primary instance (myrw-1) also has eight (8) additional read-replicas across multiple Azure regions, which allows for offloading read-only workloads local to the application users, as called out in the following table.

Azure region	Primary (read-write)	Read replica
East US2	myrw-1	myrr-1 myrr-2
Central US		myrw-2 myrr-5 myrr-6
East US		myrr-3 myrr-4
North Central US		myrr-7 myrr-8

Table-1 – List of Azure Database for MySQL instances

If there's an unplanned outage in East US2 (which hosts the primary read-write instance), the customer can stop replication and promote myrw-2 as the new primary read-write instance. As soon as replication stops and the application is switched to the new primary myrw-2 read-write instance, the read-replicas from the former primary instance (myrw-1) become obsolete and won't receive any new changes to the database.

Because the other regions remain active, the customer must drop the former replicas, including those in new primary region (Central US), and instantiate new read-replicas from the new primary myrw-2 read-write instance. This ensures that read-only workloads continue to be offloaded in the respective regions. When the East US2 (original primary) region is restored, the customer must drop old primary read-write and the two (2) read-replicas in that region.

In summary, promoting a read-replica as a read-write instance in this scenario requires the following steps:

Stop the read-replica between the myrw-1 and myrw-2 instances, making each Azure Database for MySQL read-write instances.
Update the DNS if required, and then switch the application to the new primary instance (myrw-2).
Drop the old replicas in the Central US, East US, and North Central US regions.
Provision new read-replicas in the appropriate Azure regions.
When the East US2 region has been restored, drop the old read-write instance and read-replicas.
Provision new read-replicas in East US2, and then designate one of the read-replicas as the new disaster recovery instance.

The new configuration is shown in the following diagram.

Figure 2 - DR Architecture using Azure Database for MySQL Read-Replica after the fail-over

Azure region	Primary (read-write)	Read replica
East US2 (down)	myrw-1	myrr-1 myrr-2
Central US	myrw-2	myrr-5 myrr-6
East US		myrr-3 myrr-4
North Central US		myrr-7 myrr-8

Table-2 – List of Azure Database for MySQL instances while East US is down

As noted, the ability for the system to fail-over to a new DR instance requires provisioning new read-replicas. The time required to provision a new read-replica on an existing Azure Database for MySQL instance depends on many factors, such as the size of the instance, the network latency, etc. Over time, as your application continues to grow, the time required to provision new read-replicas after a fail-over also increases. Restructuring the read-replicas post fail-over has its overheads.

The following section proposes an alternate solution to address this. The architecture described below also caters to scenarios in which there's a need for a site-swap every 3-6 months, as it reduces the overhead of replica restructuring.

Alternate DR architecture

An alternative to using an architecture similar to that described above is to use native MySQL replication between the primary read-write instance (myrw-1) and the secondary read-write (myrw-2) instance. Each is an independent instance of Azure Database for MySQL with its own read replicas for offloading read-only workloads in the respective regions. This alternative architecture is shown in the following diagram.

Figure 3- Azure Database MySQL DR Architecture using MySQL Data-In Replication

Azure region	Primary (read-write)	Secondary (read-write)	Read replica
East US2 (down)	myrw-1		myrr-1 myrr-2
Central US		myrw-2	myrr-5 myrr-6
East US			myrr-3 myrr-4
North Central US			myrr-7 myrr-8

Table-3 – List of Azure Database for MySQL instances using MySQL replication

Because each Azure Database for MySQL instance has its own set of read-replicas to offload read-only workloads, when the primary instance goes down, the replication stops and impacts only the associated read-replicas. Other read-replicas against the secondary Azure Database for MySQL instance will continue to function. The only downside of this architecture is the operational overhead that is required to maintain data-in replication. An example of operational overhead would be reconfiguring data-in replication and firewall rules after scaling Azure Database for MySQL up or down as it may change the IP address. However, using this approach eliminates the need to re-provision read-replicas after the fail-over. For more information about configuring data-in replication, see How to configure Azure Database for MySQL Data-in Replication. There is no impact to the IP address of the Azure Database for MySQL server (per the second section, Configure the Source MySQL Server) resulting from the manual restart or because of Azure planned maintenance.

Maintaining and automating an Azure Database for MySQL instance and Replication

Azure Database for MySQL provides multiple metrics that you can use to monitor the health of your instance and any associated read-replicas. Before initiating a planned failover, be sure to check the active_connections and seconds_behind_master metrics.

az monitor metrics list --resource {ResourceName} --metrics seconds_behind_master
az monitor metrics list --resource {ResourceName} -- metrics active_connections

Azure also provides multiple options for automating the entire architecture by using the Azure CLI or PowerShell. Customers can automate read-replica re-provisioning by using the following Azure CLI commands.

az mysql server replica list - Lists all the read replicas for a given server.
az mysql server replica stop - Stops replication to a read replica and makes it a read/write server.
az mysql server delete – Deletes a server.

Conclusion

As discussed, depending on your application RTO/RPO and other requirements, you can leverage geo-backups, read replica, or MySQL native replication in architecting your HADR strategy. Geo-restore and cross-region read-replicas cater to majority of the scenarios, except when there are multiple replicas spread across regions or a need for periodic site-swaps. Data-in replication helps alleviate the overhead in such cases.

If you have any feedback or questions about setting up the above architecture, please leave a comment below or email us at AskAzureDBforMySQL@service.microsoft.com.

Posted at https://sl.advdat.com/2UCkrJd

Thursday, July 8, 2021

Disaster Recovery options for Azure Database for MySQL

Alternate DR architecture

Maintaining and automating an Azure Database for MySQL instance and Replication

Conclusion