Wednesday, January 28, 2015

The fastest vMotion over Force10 MXL?

Here is the question I have got yesterday ...
My customer has two M1000e chassis in a single rack with MXL blade switches in fabrics A and B.  MXL fabric B is connected to 10G EQL SAN.  The goal is to allow vmotion to occur very fast between the two chassis using fabric A without going to the top of rack 10G switch.  The question is what interconnect between the A fabric is both chassis is best?
 Is VLT or stacking preferred?
Is it best to vertically stack chassis 1 MXL A1 to chassis 2 MXL A1 and then LACP to TOR S4810?
Or is it better to horizontally stack chassis 1 MXL A1 to chassis 1 MXL A2 and then LACP to TOR S4810?
Let's do a consultative design exercise.

Requirements
  • R1 - vMotion to occur very fast between two chassis
  • R2- vMotion over fabric A without going to the TOR switch
  • R3 - use fabric B only for iSCSI 
Constraints
  • C1 - 2x Blade chassis DELL M1000e
  • C2 - Each blade chassis has Force10 MXL blade switch IO modules on fabrics A1 and A2 for ethernet/IP trafffic
  • C3 - Each blade chassis has Force10 MXL blade switch IO modules on fabrics B1 and B2 for iSCSI trafffic
  • C4 - VMware vSphere ESXi hypervisor on each blade server
  • C5 - maximum 8 vMotions per ESXi host
Assumptions
  • A1- Blade servers have 2x 10GB NIC connected to fabric A (A1, A2)
  • A2- Blade servers have 2x 10GB NIC connected to fabric B (B1, B2)
  • A3- 16x half height blade servers are used on each blade chassis
  • A4 - Each ESXi server has NIC Teaming with dual homing to A1 and A2 IO modules
Design decision and justification
  • MXL switches in fabrics A1/A2 on each blade chassis are stacked vertically via 160Gb interconnect for east/west traffic with fan in/out ratio 1:1
  • Vertical stacking allow single management of two switches in fabric A1 but still allow non disruptive firmware upgrade because fabric A2 is independent fault zone and NIC teaming will handle automated fail over. That's the reason horizontal stacking is not used.
  • Northbound connectivity for north/south traffic is done via VLT port-channel giving 80Gb total upstream bandwidth for each MXL which is fan in/out ratio 2:1. 
  • Top of rack switches (2x Force10 S4810) are formed into single VLT domain (aka virtual chassis) to have loop free topology and utilize full upstream bandwidth. 
Design impact
  • vMotion vmKernel interfaces has to be configured on the same physical NIC (vmnic) on each ESXi host. This will ensure vMotion traffic inside Fabric A without unwanted TOR switch traffic.
  • MultiNIC vMotion cannot be used otherwise vMotion traffic between A1 and A2 would potentially go through TOR switch which is against requirement REQ-002.
  • LACP/Etherchannel Teaming cannot be used because upstream MXL switches are not in stack or VLT. Therefore IP Hash based load balancing cannot be used and single VM traffic will be always routed over single physical NIC and single VM will not be able to handle more than 10Gb.
  • VM Traffic for particular portgroups (L2 segments) should be configured as active/standby consistently across servers for optimal east-west traffic in particular VLAN in non-degraded state and eliminating VM traffic flow across TOR switch.
  • L3 traffic will be routed over TOR.
Alternative
  • Leveraging vSphere MultiNIC vMotion can improve vMotion performance (REQ-001) but would be against REQ-002 because vMotion communication would fly over TOR switch.
Design decision qualities
  • Availability: Great
  • Performance: Very good for vMotion, Good for VM Traffic 
  • Manageability: Good - just two logical switches to manage
  • Scalability: max 6 members in stack
Logical design drawing


2 comments:

Anonymous said...

i would go for connect all the Ax MXL switch to be part of the same stack. this probably remove all the design impacts that you mentioned and leave you with one issue which the firmware upgrade.

Firmware upgrade is not considered a daily/ week task and once required it can be done though non working hours

David Pasek said...

Yes. It is definitely alternative if maintenance window is available.