Wednesday, September 11, 2013

What type of NIC teaming, loadbalancing and physical switch configuration to use for VMware's VXLAN?

As a former CISCO UCS Architect I'm observing VXLAN initiative almost 2 years so I was looking forward to do the real customer project. Finally it is here. I'm working on vSphere design for vCloud Director (vCD). To be honest I'm responsible just for vSphere design and someone else is doing vCD Design because I'm not vCD expert and I have just conceptual and high-level vCD knowledge. I'm not planning to change it in near future because I'm more focused on next generation infrastructure and vCD is in my opinion just another software for selling IaaS. I'm not saying it is not important. It is actually very important because IaaS is not just technology but business process. However nobody knows everything and I leave some work for other architects :-)

We all know that vCD sits on top of vSphere providing multi-tenancy and other IaaS constructs and since vCD 5.1 the network multi-tenancy segmentation is done by VXLAN network overlay. Therefore I have finally opportunity to plan, design and implement VXLANs for real customer.

Right now I'm designing network part of vSphere architecture and I describe VXLAN oriented design decision point bellow.

VMware VXLAN Information sources:
I would like to thanks Duncan for his blog post back in October 2012 right before Barcelona VMworld 2012 where VXLANs were officially introduced by VMware. Even it is unofficial information source it is very informative and I'm verifying it against official VMware documentation and white papers.  Unfortunately I have realized that there is a lack of trustful and publicly available technical information till today and some information are contradictory. See bellow what confusion I'm facing and I would be very happy if someone help me to jump out from the circle.

Design decision point:
What type of NIC teaming, loadbalancing and physical switch configuration to use for VMware's VXLAN?

Requirements:
  • R1: Fully supported solution
  • R2: vSphere 5.1 and vCloud Director 5.1
  • R3: VMware vCloud Network & Security (aka vCNS or vShield) with VMware distributed virtual switch
  • R4: Network Virtualization and multi-tenant segmentation with VXLAN network overlay 
  • R5: Leverage standard access datacenter switches like CISCO Nexus 5000, Force10 S4810, etc.
Constraints:
  • C1: LACP 5-tuple hash algorithm is not available on current standard access datacenter physical switches mentioned in requirement R5
  • C2: VMware Virtual Port ID loadbalancing is not supported with VXLAN Source: S3
  • C3: VMware LBT loadbalancing is not supported with VXLAN Source: S3
  • C4: LACP must be used with 5-tuple hash algorithm Source: S3, S2, S1 on Page 48. [THIS IS STRANGE CONSTRAINT, WHY IT IS HASH DEPENDENT?] Updated 2013-09-11: It looks like there is a bug in VMware documentation and KB Article. Thanks @DuncanYB and @fojta for confirmation and internal VMware escalations.
Available Options:
  • Option 1: Virtual Port ID
  • Option 2: Load based Teaming
  • Option 3: LACP
  • Option 4: Explicit fail-over

Option comparison:
  • Option 1: not supported because of C1
  • Option 2: not supported because of C2
  • Option 3: supported
  • Option 4: supported but not optimal because only one NIC is used for network traffic. 
Design decision and justification:
Based on available information options 3 and 4 complies with requirements and constraints. Option 3 is better because network traffic is load balanced across physical NICs. That's not a case for option 4.

Other alternatives not compliant with all requirements:
  • Alt 1: Use physical switches with 5-tuple hash loadbalancing. That means high-end switch models like Nexus 7000, Force10 E Series, etc.
  • Alt 2: Use CISCO Nexus 1000V with VXLAN. They support LACP with any hash algorithm. 5-tuple hash is also recommended but not strictly required.
Conclusion:
I hope some information in constraints C2, C3, and C4 are wrong and will be clarified by VMware. I'll tweet this blog post to some VMware experts and hope someone will help me to jump out from the decision circle.
If you have any official/unofficial topic related information or you see anything where I'm wrong, please feel free to speak up in the comments.
Updated 2013-09-11: Constraint C4 doesn't exists and VMware doc will be updated.
Based on updated information LACP and "Explicit fail-over" teaming/load-balancing is supported for VXLANs. LACP is better way to go and  "Explicit fail-over" is alternative in case LACP is not achievable on your environment.

2 comments:

Duncan Epping (VMware) said...

Hi David,

Could be I am misreading this post but there are two supported ways of deploying VXLAN today. My apologies for the VMware documentation not being up to date and my blog post slightly confusing. I have update my post and have requested the documentation and KB to be updated.

Anyway, these are the two options you have:
1) port channel (static / lacp)
2) specified fail-over order

In the case of port channels it is recommended to use 5-tupple hash for load balancing effectiveness. This is no hard requirement though, so if your switches do not support this then it is no problem for VXLAN it might just lead to a less balanced network.

Hope this helps, and again I have update my post and requested Docs + KB to be updated. (this will take time though)

David Pasek said...

Thanks Duncan for absolutely clear public statement. I would also thanks Fojta who reply to me privately with the same information.