Thursday, September 12, 2019

Datacenter Fabric for HCI in scale

I'm currently designing a brand new data center based on VMware HCI for one of my customers. Conceptually, we are planning to have two sites in the metro distance (~10 km) for disaster avoidance and cross-site high availability. For me, a cross-site high availability (stretched metro clusters) is not a disaster recovery solution, so we will have the third location (200km+ far from the primary location) for DR purposes. We are also considering remote clusters in branch offices to provide services outside the datacenters. The overall concept is depicted in the drawing below.
Datacenter conceptual design
The capacity requirements lead to a decent number of servers, therefore the datacenter size will be somewhere around 10+ racks per site. The rack design is pretty simple. Top of rack switches and standard rack servers connected to TOR switches optimally by 4x 25Gb link.
Server rack design
However, we will have 10+ racks per site so there is the question of how we will connect racks together? Well, we have two options. Traditionally (Access / Aggregation / Core) or Leaf/Spine. In our case, both options are valid and pretty similar because if we would use Traditional topology, we would go with collapsed Access / Core topology anyway, however, the are two key differences between Access/Core network and Leaf/Spine fabric

  1. Leaf/Spine is always L3 internally keeping L2/L3 boundary at TORs. This is good because it splits the L2 fault domain and mitigates risks of L2 networking with STP issues, broadcast storms, unknown unicast flooding, etc.
  2. Leaf/Spine supports additional fabric services like automation, L2 over L3 to have L2 across the racks (PODs), Life Cycle Management (fabric level firmware upgrades, rollbacks, etc.), Single Point of Management/Visibility, etc.

We are designing the brand new data center which has to be in production for the next 15+ years and we will need very good connectivity within the data center supporting huge east-west traffic, therefore leaf-spine fabric topology based on 100Gb/25Gb ethernet makes a perfect sense. The concept of data center fabric in leaf-spine topology is depicted in the figure below.
Datacenter fabric
Ok. So, conceptually and logically we know what we want but how to design it physically and what products to choose?

I've just starting to work for VMware as HCI Specialist supporting Dell within our Dell Synergy Acceleration Team, so there is no doubt VxRAIL makes a perfect sense here and it perfectly fits into the concept. However, we need a data center fabric to connect all racks within each site and also interconnect two sites together.

I have found that Dell EMC has SmartFabric Services. You can watch the high-level introduction at
https://www.dellemc.com/en-us/video-collateral/dellemc-smartfabric-services-for-vxrail.htm

SmartFabric Services seems very tempting.  To be honest, I do not have any experience with Dell EMC SmartFabric so far, however, my regular readers know that I was testing, designing and implementing Dell Networking a few years ago. At that time I was blogging about Force10 Networking (FTOS 9) technology and published a series of blog posts available at https://www.vcdx200.com/p/series.html 

However, DellEMC SmartFabric Services are based on a newer switch operating system (OS 10) which I do not have any experience yet, therefore, I did some research and I found a very interesting blog posts about Dell EMC SmartFabric published by Mike Orth aka @HCIdiver and Hasan Mansur. Here are links to blog posts:


So the next step is to get more familiar with DellEMC SmartFabric Services because it can significantly simplify data center operations and split the duties between the datacenter full-stack engineers (compute/storage/network) and traditional network team.

I personally believe, that datacenter full-stack engineer should be responsible not only for compute and storage (HCI) but also for data center fabric. And the traditional networking team takes the responsibility at network rack where the fabric is connected to the external network. You can treat datacenter fabric as a new generation of SAN which is nowadays operated by storage guys anyway, right?

Hope this makes sense and if there is anybody with Dell EMC SmartFabric experience or with a similar design, please feel free to leave the comment below or contact me on the twitter @david_pasek.

No comments: