VCDX #200 Blog of one VMware Infrastructure Designer: VMware Metro Storage Cluster

Yesterday morning I had a design discussion with one of my customers about HA and DR solutions. We were discussing VMware Metro Storage Cluster topic the same day afternoon within our internal team, therefore it inspired me to write this blog article and use it as a reference for future similar discussions. By the way, I have presented this topic on local VMUG meeting two years ago so you can find the original slides here on SlideShare. On this blog post, I would like to document the topics, architectures, and conclusions I discussed today with several folks.

Stretched (aka active/active) clusters are very popular infrastructure architecture patterns nowadays. VMware implementation of such active/active cluster pattern is vMSC (VMware Metro Storage Cluster). Official VMware vSphere Metro Storage Cluster Recommended Practices can be found here. Let's start with definition what vMSC is and is not from HA (High Availability), DA (Disaster Avoidance) and DR (Disaster Recovery) perspective.

vMSC (VMware Metro Storage Cluster) is

High Availability solution extending infrastructure high availability across two availability zones (sites in the metro distance)
Disaster Avoidance solution enabling live migration of VMs not only across ESXi hosts within single availability zone (local cluster) but also to another availability zone (another site)

vMSC (VMware Metro Storage Cluster) is great High Availability and Disaster Avoidance technology but it is NOT pure Disaster Recovery solution even it can help with two specific disaster scenarios (one of two storage systems failure, single site failure). Why it is not pure DR solution? Here are a few reasons

vMSC requires Storage Metro Cluster technology which joins two storage systems into a single distributed storage system allowing stretched storage volumes (LUNs) but this creates a single fault zone for situations when LUNs are locked or badly served from the storage system. It is great for HA but not good for DR. Such single fault zone can lead to total cluster outage in situations like described here - http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005201 , https://kb.vmware.com/kb/2113956
vMSC compute cluster (vSphere cluster) requires to be stretched across two availability zones which creates a single fault zone. Such single fault zone can lead to total cluster outage in situations like described here - https://kb.vmware.com/kb/56492
DR is not only about infrastructure but also about applications, people and processes.
DR should be business service oriented therefore from IT perspective, DR is more about applications than infrastructure
DR should be tested on regular basis. Can you afford to power-off the whole site and test that all VMs will be restarted on the other side? Are you sure the applications (or more importantly business services) will survive such test? I know a few environments where they can afford it but most enterprise customers cannot.
DR should allow going back into the past, therefore the solution should be able to leverage old data recovery points. Recoverability from old recovery points should be possible on the application group and not for the whole infrastructure.

Combination of HA and DR solutions

Any HA solution should be combined with some DR solution. At a minimum, such DR solution is any classic backup solution having a local or even remote site backup repositories. The typical challenge with any backup solution is RTO (Recovery Time Objective) because

You must have the infrastructure hardware ready for workloads to be restored and powered on
Time to recovery from traditional backup repositories is usually very time-consuming and it may or may not fulfill RTO requirement

That's the reason why orchestrated DR with storage replications and snapshots is usually better DR solution than a classic backup. vMSC can be safely combined with storage based DR solutions with lower RTO SLA's. VMware has specific Disaster Recovery product called Site Recovery Manager (SRM) to achieve orchestrated vSphere or Storage replications and automated workload recovery. With such combination, you can get Cross Site High Availability, Cross Site Disaster Avoidance provided by vMSC, and pure Disaster Recovery provided by SRM. Such a combination is not so common, at least in my region, because it is relatively expensive. That's the reason customers usually have to decide for only one solution. Now, let's think why vMSC is preferred solution by infrastructure guys over pure DR like SRM. Here are the reasons

It is "more simple" and much easier to implement and operate
No need to understand, configure and test application dependencies
Can be "wrongly" claimed as DR solution

It is not very well known, but VMware SRM nowadays supports Disaster Recovery and Avoidance on top of stretched storage. It is described in the last architecture concept below.

So let's have a look at various architecture concepts for cross-site HA and DR with VMware products.

VMware Metro Storage Cluster (vMSC) - High Availability and Disaster Avoidance Solution

VMware Metro Storage Cluster (vMSC)

On the figure above I have depicted the VMware Metro Storage Cluster consist of

Two availability zones (Site A, Site B)
Single centralized vSphere Management (vCenter A)
Single stretched storage volume(s) distributed across two storage system each in different availability zone (Site A, Site B)
VMware vSphere Cluster stretched across two availability zone (Site A, Site B)
Third location (Site C) for storage witness. If the third site is not available, the witness can be placed in Site A or B but storage administrator is the real arbiter in case of potential split-brain scenarios

Advantages of such architecture are

Cross-site high availability (positive impact on Availability, thus Business Continuity)
Cross-site vMotion (good for Disaster Avoidance)
Protects against single storage system (storage in one site) failure scenario
Protects against single availability zone (one site) failure scenario
Self-initiated fail-over procedure.

Drawbacks

vMSC is tightly integrated distributed cluster system between vSphere HA Cluster and Storage Metro Cluster, therefore it is potential single fault zone. Stretched LUN(s) is a single fault zone for issues caused by the distributed storage system or the bad behavior of cluster filesystem (VMFS)
Typically, the third location is required for storage witness
It is usually very difficult to test HA
It is almost impossible to test DR

VMware Site Recovery Manager in Classic Architecture - Disaster Recovery Solution

VMware Site Recovery Manager - Classic Architecture

On the figure above I have depicted the classic architecture of VMware DR solution (Site Recovery Manager) consist of

Two availability zones (Site A, Site B)
Two independent vSphere Management servers (vCenter A, vCenter B)
Two independent DR orchestration servers (SRM A, SRM B)
Two independent vSphere Clusters
Two independent storage systems. One in Site A, second in Site B
Synchronous or asynchronous data replication between storage systems
Snapshots (multiple recovery points) on backup site are optional but highly recommended if you do DR planning seriously.

Advantages of such architecture are

Cross-site disaster recoverability (positive impact on Recoverability, thus Business Continuity)
Maximal infrastructure independence, therefore we have two independent fault zones. The only connection between the two sites is storage (data) replication.
Human-driven and well-tested disaster recovery procedure.
Disaster Avoidance (migration of applications between sites) can be achieved but only with business service downtime. Protection Group has to be shut down on one site and restarted on another site.

Drawbacks

Disaster Avoidance without service disruption is not available.
Usually, there is a huge level of effort with application dependency mapping and application-specific recovery plans (Automated or Semi-automated Run Books) has to be planned, created and tested

VMware Site Recovery Manager in Stretched Storage Architecture - Disaster Recovery and Avoidance Solution

VMware Site Recovery Manager - Stretched Storage Architecture

On the last figure, I have depicted the new architecture of VMware DR solution (Site Recovery Manager). In this architecture, SRM supports stretched storage volumes but everything else is independent and specific for each site. The solution consists of

Two availability zones (Site A, Site B)
Two independent vSphere Management servers (vCenter A, vCenter B)
Two independent DR orchestration servers (SRM A, SRM B)
Two independent vSphere Clusters
Single distributed storage systems having storage volumes stretched across Site A and Site B
Snapshots (multiple recovery points) on backup site are optional but highly recommended if you do DR planning seriously.

Advantages of such architecture are

Cross-site disaster recoverability (positive impact on Recoverability, thus Business Continuity)
Maximal infrastructure independence, therefore we have two independent fault zones. The only connection between the two sites is storage (data) replication.
Human-driven and well-tested disaster recovery procedure.
Disaster Avoidance without service disruption leveraging cross vCenter vMotion technology.

Drawbacks

Usually, there is a huge level of effort with application dependency mapping and application-specific recovery plans (Automated or Semi-automated Run Books) has to be planned, created and tested
Virtual Machine internal identifier (moRef ID) is changed after cross vCenter vMotion, therefore your supporting solutions (backup software, monitoring software, etc.) must not be dependent on this identifier.

CONCLUSION

Infrastructure availability and recoverability are two independent infrastructure qualities. Both of them have a positive impact on business continuity but each solves the different situation. High Availability solutions are increasing the reliability of the system with more redundancy and self-healing automated failover among redundant system components. Recoverability solutions are data backups from one system and allow a full recovery in another independent system. Both solutions can and should be combined in compliance with SLA/OLA requirements.

VMware Metro Storage Cluster is great High Availability technology but it should not be used as a replacement for disaster recovery technology. VMware Metro Storage Cluster is not a Disaster Recovery solution even it can protect the system against two specific disaster scenarios (single site failure, single storage system failure). You also do not call VMware "vSphere HA Cluster" as DR solution even it can protect you against single ESXi host failure.

The final infrastructure architecture always depends on specific use cases, requirements and expectations of the particular customer but expectations should be set correctly and we should know what designed system does and what does not. It is always better to know potential risks and not have unknown risks. For known risks, mitigation or contingency plan can be prepared and communicated to system users and business clients. You cannot do it for unknown risks.

Other resources

There are other posts on the blogosphere explaining what vMSC is and is NOT.

VMware vSphere Metro Storage Cluster (VMware vMSC)

“VMware vMSC can give organizations many of the benefits that a local high-availability cluster provides, but with geographically separate sites. Stretched clustering, also called distributed clustering, allows an organization to move virtual machines (VMs) between two data centers for failover or proactive load balancing. VMs in a metro storage cluster can be live migrated between sites with vSphere vMotion and vSphere Storage vMotion. The configuration is designed for disaster avoidance in environments where downtime cannot be tolerated, but should not be used as an organization's primary disaster recovery approach.”

Another very nice technical write up about vMSC is here - The dark side of stretched clusters

And here is my VMUG presentation "Metro Cluster High Availability or SRM Disaster Recovery?".