Monday, May 11, 2020

Undocumented HA Advanced Option - das.restartVmsWithoutResourceChecks

Some time ago, a colleague of mine (@stan_jurena) was challenged by one VMware customer who experienced APD (All Path Down) storage situation in the whole HA Cluster and he expected that VMs will be killed by VMware Hypervisor (ESXi) because of HA Cluster APD response setting "Power off and restart VMs - Aggressive restart policy". To be honest, I had the same expectation. However, after the discussion with VMware engineering, we have been told, that the primary role of HA Cluster is to keep VMs up and running, so "Aggressive restart policy" will restart VM only in certain conditions which are much better described in vSphere Client 7 UI. See the screenshot below.


APD Aggressive restart policy
A VM will be powered off, If HA determines the VM can be restarted on a different host, or if HA cannot detect the resources on other hosts because of network connectivity loss (network partition).

So, what it means? Aggressive restart policy is the same as Conservative but extended for the situation when there is network partitioning. This can be helpful in situations when you have IP storage and experience IP network issues but it does not help in a situation when you have dedicated Fibre Channel SAN and the storage is not available for the whole vSphere Cluster.

We explained to VMware engineering, that there are situations when it is much better to kill all VMs than keep compute (VMs) running without available storage. Based on these discussions, there was created a Feature Request, which was internally named as "super aggressive option" APD. I'm happy to see, that it was implemented and released in vSphere 7 as vSphere advanced option
das.restartVmsWithoutResourceChecks = false (default) / true (super aggressive)
I think this advanced option will be very useful for infrastructure architects / technical designers who will have a good justification to use this advanced option. Here are my typical justifications
  • When the storage subsystem is unavailable for some time, Linux operating system switch file system to Read-Only mode which has a negative impact on running applications. Such a situation typically leads to server restart anyway.
  • When you have an OS/Application clustering solution (for example MSCS) on top of vSphere clustering, having one Application node on one vSphere cluster and another Application node on different vSphere cluster, you prefer to kill VM (App Node) on the problematic cluster (without available storage) to fail-over to App Node (VM) running on the healthy cluster.
Hope this makes sense.

Please leave the comment if you will find this advanced option useful. VMware Engineering might consider adding this option into GUI, based vSphere architects / technical designers' feedback.

References
  • Duncan Epping wrote the blog post about it here.
  • For other "Advanced configuration options for VMware High Availability in vSphere 5.x and 6.x" check VMware KB 2033250.


No comments: