Saturday, June 20, 2015

VMware HA Error During VLT Failure

I have received following message in to my mailbox ...
Hi.
I have a customer that has been testing Force10 VLT with peer routing and VMWare and has encountered the warning message on all hosts during failover of the switches (S4810’s) only when the primary VLT node is failed
“vSphere HA Agent, on this host couldn’t not reach isolation address 10.100.0.1”
Does this impact HA at all?  Is there a solution?
Thanks
Paul 

Force10 is the legacy product name of DELL S-series datacenter networking. Force10 S4810's are datacenter L3 switches. If you don't know what Force10 VLT is look here. Generally it is something like CISCO virtual Port Channel (vPC), Juniper MC-LAG, Arista MLAG, etc.

I think my answer can be valuable for broader networking and virtualization community so here it is ...

First of all let’s make some assumptions:
  • Force10 VLT is used for multi chassis LAG capability
  • Force10 VLT peer routing is enabled in VLT domain to achieve L3 routing redundancy
  • 10.100.0.1 is IP address of VLAN interface on Force10 S4810 (primary VLT node) and this particular VLAN is used for vSphere management.
  • 10.100.0.2 is IP address on Force10 S4810 - secondary VLT node.
  • vSphere 5.x and above is used.

Root cause with explanation:
When primary Force10 VLT node is down then ping to 10.100.0.1 doesn’t work because peer-routing is ARP proxy on L2. Secondary node will route L2 traffic on behalf of primary node but 10.100.0.1 doesn’t answer on L3 therefore ICMP doesn’t work.

VMware (vSphere 5 and above) HA Cluster use network and storage heartbeat mechanism. Network mechanism use two probe algorithms listed below. 
  1. ESXi hosts in the cluster are sending heartbeat beacon to each other. This should work ok during primary VLT node failure.
  2. ESXi hosts are also pinging HA isolation addresses (Default HA isolation address is default gateway therefore 10.100.0.1 in your particular case). This doesn’t work  during primary VLT node failure.

That’s the reason VMware HA Cluster will log about this situation.

Is there any impact?
There is no impact on HA Cluster because
  •  It is just informative message because algorithm (1) works correctly and there is still network visibility  among ESXi hosts in the cluster.
  • From vSphere 5 and above there is also storage heartbeat mechanism which can eliminate network invisibility among ESXi host in the cluster.

Are there any potential improvements?
Yes they are. You can configure multiple HA Isolation Addresses to mitigate default gateway unavailability. In your particular case I would recommend to use two IP addresses (10.100.0.1 and 10.100.0.2) because at least one VLT node will be always available.


For more information how to configure multiple HA isolation addresses look at http://kb.vmware.com/kb/1002117

No comments: