Wednesday, May 06, 2015

DELL Force10 VLT and vSphere Networking

DELL Force10 VLT is multi chassis LAG technology. I wrote several blog posts about VLT so for VLT introduction look at http://blog.igics.com/2014/05/dell-force10-vlt-virtual-link-trunking.html. All Force10 related posts are listed here.  By the way DELL Force10 S-Series switches has been renamed to DELL S-Series switches with DNOS 9 (DNOS stands for DELL Network Operating System) however I’ll keep using Force10 and FTOS in my series to keep it uniform. 

In this blog post I would like to discuss Force10 VLT specific failure scenario when VLTi fails.

VLT Domain is actually cluster of two VLT nodes (peers). One node is configured as primary and second node as secondary. VLTi is a peer link between two VLT nodes. The main role of VLTi peer link is to synchronize MAC addresses interface assignments which is used for optimal traffic in VLT port-channels. In other words if everything is up and running data traffic over VLT port-channels (virtual LAGs)  is optimize and optimal link will be chosen to eliminate inter VLTi traffic. VLTi is used for data traffic only in case of some VLT link failure in one node and another VLT link still available on another node.

Now you can ask what happen in case of VLTi failure. In this situation backup link will kick in and act as a backup communication link for VLT Domain cluster. This situation is called Split-Brain scenario and exact behavior is nicely described in VLT Reference guide.
The backup heartbeat messages are exchanged between the VLT peers through the backup links of the OOB Management network. When the VLTI link (port-channel) fails, the MAC/ARP entries cannot be synchronized between the VLT peers through the failed VLTI link, hence the Secondary VLT Peer shuts the VLT port-channel forcing the traffic from the ToR switches to flow only through the primary VLT peer to avoid traffic black-hole. Similarly the return traffic on layer-3 also reaches the primary VLT node. This is Split-brain scenario and when the VLTI link is restored, the secondary VLT peer waits for the pre-configured time (delay-restore) for the MAC/ARP tables to synchronize before passing the traffic. In case of both VLTi and backup link failure, both the VLT nodes take primary role and continue to pass the traffic, if the system mac is configured on both the VLT peers. However there would not be MAC/ARP synchronization.
With all that being said let’s look at some typical VLT topologies with VMware ESXi host. Force10 S4810 is L3 switch therefore VLT domain can provide switching and routing services. Upstream router is single router for external connectivity. ESXi host has two physical NIC interfaces.

First topology

First topology is with VMware switch independent connectivity. This is very common and favorite ESXi network connectivity because of simplicity for vSphere administrator.




The problem with this topology is when VLTi peer-link has a failure (red cross in the drawing). We already know that in this scenario the backup link will kick in and VLT links from secondary node are intentionally disabled (black cross in the drawing). However our ESXi host is not connected via VLT therefore the server facing port will stay up.  VLT Domain doesn’t know anything about VMware vSwitch topology therefore it must keep port up which implies as a black hole scenario (black circle in the drawing) for virtual machines pinned into VMware vSwitch Uplink 2.
I hear you. You ask what the solution for this problem is.  I think there are two solutions.  First out-of-the-box solution is to use VLT down to the ESXi host which is depicted on second topology later in this post. Second solution could be to leverage UFD (Uplink Failure Detection) and track some VLT ports together with server facing ports. I did not test this scenario but I think it should work and there is big probability I’ll have to test it soon.   

Second topology

Second topology is leveraging VMware LACP. LACP connectivity is obviously more VLT friendly because VLT is established down to the server and downlink to ESXi host is correctly disabled. Virtual machines are not pinned directly to VMware vSwitch uplinks but they are connected through LACP virtual interface. That’s the reason you will not experience black hole scenario for some virtual machines.







Conclusion

Server virtualization is nowadays on every modern datacenter. That’s the reason why virtual networking has to be taken in to account for any datacenter network design. VMware switch independent NIC teaming is simple for vSphere administrator but it can negatively impact network availability in some scenarios. Unfortunately VMware standard virtual switch doesn’t support dynamic port-channel (LACP) but only static port-channel. Static port-channel should work correctly with VLT but LACP is recommended because of LACP keep-alive mechanism.  LACP is available only with VMware distributed virtual switch which requires the highest VMware licenses (vSphere Enteprise Plus edition). VMware’s distributed virtual switch with LACP uplink is the best solution for Force10 VLT.  In case of the budget or technical constraint you have to design an alternative solution leveraging either static port-channel (VMware call it “IP Hash load balancing”) or FTOS UFD (Uplink Failure Detection) to mitigate risk of black hole scenario. 

Update 2015-05-13:
I have just realized that NPAR is actually technical constraint avoiding to use port-channel technology on ESXi host virtual switch. NPAR technology allows switch independent network partitioning of physical NIC ports into more logical NICs. However port-channel cannot be configured on NPAR enabled NICs therefore UFD is probably the only solution to avoid black hole scenario when VLT peer-link fails. 

1 comment:

Unknown said...

In topology 1, to avoid black hole scenario if we implement UFD solution as tracking VLTi links as upstream and interfaces connected to ESXi host as downstream. it will solve black hole scenario but in case of any switch failure, UFD will bring down interfaces connected to ESXi host for the only active switch as well which will result in whole environment being down.