Wednesday, October 09, 2013

Two (2) or four (4) socket servers for vSphere infrastructure?

Last week I had interesting discussion with customer subject matter experts and VMware PSO experts about using 2-socket versus 4-socket servers for VMware vSphere infrastructure in IaaS cloud environment. I was impressed how difficult is to persuade infrastructure professionals about 4-socket server benefits in some cases.

Although it seems as pretty easy question it is actually more complex if we analyze it deeper. It is common question from many of my customers and because it is a typical "it depends" answer I've decided to blog about it.

Let's start with some general statements:
  • 2-socket servers are designed and used for general business workloads
  • 2-socket servers are less expensive
  • 4-socket servers are very expensive
  • 4-socket servers are designed and used for high performance and mission-critical workloads
  • failure of single 4-socket server node in vSphere cluster has bigger impact on capacity 
All these general statements are relative so what is really better for particular environment depends on customer's requirements, expected workload size and chosen hardware platform.

It is important to note that at the time of writing this post there are two types of 4-socket Intel servers in the market. Servers with Intel E7 CPU Family and servers with Intel E5-4600 Family. Comparing the Intel E7-4780 (10 core, 2.4GHz) with an Intel E5-4650 (8 core, 2.7 GHz), you’ll find that the E5 server outperforms against the E7 server in the following benchmarks:
  •  CAE
  •  SPECfp*_rate_base2006
  •  Numerical Weather
  •  Financial Services
  •  Life Sciences
  •  Linpack AVX
  •  SPECint*_rate_base2006
The E7 server outperforms the E5 server in the following benchmarks:
  •  java* Middleware
  •  OLTP Database
  •  Enterprise Resource Planning
 CPU family comparison is taken from here.

Intel E7 are designed for mission critical workloads and E5-4600 family for general workloads with big CPU performance requirements. Therefore E7 processors are "very" (I would say more) expensive but price difference between E5-4600 (4-socket) and E5-2600 (2-socket) servers is usually less than 10 or 20 percent but it can vary among different hardware vendors.


Server consolidation is the most common use case of server virtualization. Before any server consolidation it is highly recommended to do "ASIS" capacity monitoring and "TOBE" capacity planning with consolidation scenarios. There are plenty of different tools for such exercise. For example VMware Capacity Planner, Platespin Recon, CIRBA, etc. However if we design green field environment and there is not legacy environment which can be monitored we have to define expected average and maximum VM. So, let's define our average and maximum workload we are planning to virtualize in single VM.

Let's assume our typical VM is configured as
  • 1 vCPU consuming 333 MHz CPU
  • 1 vCPU consuming 1/3 of one CPU Thread
  • 4GB RAM
and maximal VM (aka monster VM) is configured as
  • 8 vCPU
  • 128 GB RAM
So what physical servers to use for virtualization in such environment? E7 CPUs are significantly more expensive therefore let's compare 2-socket servers with E5-2665 (2.4GHz) and 4-socket server with E5-4640 (2.4GHz). So here are our server options in detail.

4S-SERVER: Single 4-socket server E5-4640 (8 cores) has 32 cores and 64 CPU Threads (logical CPU) in case hyper-threading is enabled. Total CPU capacity is 76.8 GHz. From RAM perspective it can accommodate 48 DIMMs (4 sockets x 4 channels x 3 DIMMs).

2S-SERVER: Single 2-socket server E5-2665 (8 cores) has 16 cores and 32 CPU Threads (logical CPUs) in case hyper-threading is enabled. Total CPU capacity is 38.4 GHz. From RAM perspective it can accommodate 24 DIMMs (2 sockets x 4 channels x 3 DIMMs).

So in first look 8 x 4-socket server has same compute capacity and performance as 16 x 2-socket servers, right? 4-socket server can accommodate double number of DIMMs, so total RAM capacity of 8 x 4-socket servers and 16 x 2-socket servers is also the same. It is 768GB RAM in 16GB DIMMs or 1536GB RAM in 32GB DIMMs.

If we will use vSphere Cluster with 8 x 4S-SERVER or 16 x 2S-SERVER we have same total raw capacity and performance but 16 x 2S-SERVERs will beet 8 x 4S-SERVERx in real available capacity because in case of single server fail we will lose just 1/16 of capacity and performance unlike 1/8 of capacity.

Is it true or not?

Yes, from memory perspective.
Yes and sometimes No, from CPU performance.

Let's concentrate on CPU performance and compare CPU performance of DELL 2-socket server  M620 (E5-2665/2.4GHz) with DELL 4-socket server M820 (E5-4640/2.4GHz). We all know that 1MHz on two different systems doesn't represent comparable performance, so the question is how to compare CPU performance. The answer is CPU normalization. Good, good ... but wait how we can normalize CPU performance. The answer is CPU benchmark. Good ... but what benchmark?

Below are listed different benchmark results for single host so based on results we can deeply discuss what system is better for particular environment.  Please note that some benchmark results are not available or published so I use results from similar systems. I believe it's enough accuracy for our comparison.

2S-SERVER: M620 (E5-2665/2.4GHz)
  • SPECint_rate2006: 611
  • SPECfp_rate2006: 467
  • VMmark: 5.1  Calculation: 2x M620 VMmark (E5-2680) has 10.20 @ 10 Tiles = 10.2 / 2
  • SPECvirt_sc2013: 236.15 Calculation: 1x HP DL380p G8 SPECvirt_sc2013 (E5-2690) 472.3 @ 27 =472.3 / 2
4S-SERVER: M820 (E5-4640/2.4GHz)
  • SPECint_rate2006: 1080
  • SPECfp_rate2006: 811
  • VMmark: 10.175 Calculation: 2x HP DL560 VMmark (E5-4650) 20.35 @ 18 Tiles =20.35 / 2
  • SPECvirt_sc2013: 454.05 Calculation: 1x HP DL560 SPECvirt_sc2013 (E5-4650) 908.1 @ 53 =908.1 / 2
Note 1: DELL 4S-SERVER VMware results are not published so I use results for HP DL560 servers
Note 2: Some SPECvirt_sc2013 results are not available for VMware vSphere so I use results for Redhat KVM.

Based on results above I prepared performance benchmark comparison table:

Benchmark
2x 2S
1x 4S
4S against 2S
SPECint
1222
1080
88.38%
SPECfp
934
811
86.83%
VMmark
10.2
10.175
99.75%
SPECvirt_sc
472.3
454.05
96.13%

So what does it mean? I explain it by way of 2-socket servers are better for RAW mathematical operations (integer and flouting point) but for more real live workloads 4-socket servers have generally same performance like 2-socket servers and more cores/threads per single system.

BTW: It seems to me that CPU performance normalization based on SPECint and/or SPECfp is not fair to 4-socket servers. That's what Platespin Recon use for CPU normalization.

We can say that there is not 1MHz performance difference between our 2S-SERVER and 4S-SERVER. So what is the advantage of 4-Socket servers based on E5-4600 CPUs? The CPU performance is not only about MHz performance but also about CPU scheduling (aka multi-threading). The 4S-SERVER advantage is bigger count of  logical CPUs which has positive impact on co-scheduling vCPUs of vSMP virtual machines. Although vCPU co-scheduling has been dramatically improved in ESX 3.0 some co-scheduling is required anyway. Co-scheduling executes a set of threads or processes at the same time to achieve high performance. Because multiple cooperating threads or processes frequently synchronize with each other, not executing them concurrently would only increase the latency of synchronization. For more information  about co-scheduling look at https://communities.vmware.com/docs/DOC-4960

In our example we are planning to have monster VMs with 8 vCPUs so 64 logical CPUs in 4S-SERVER offers potentially more scheduling opportunities against 32 logical CPUs in 2S-SERVER. As far as I know virtualization benchmarks tiles (tiles are group of VMs) usually have up to 2 vCPUs so I think co-scheduling issue is not covered by these benchmarks.

So final decision depends on expected number of  monster VMs which can affect real performance of workloads inside these monster VMs. CPU performance overloading can be monitored by ESX metric %RDY (vCPU READY but pCPU not) and co-scheduling execution delays by metric %CSTP (vCPU stoped because of co-scheduling). Recommended thresholds are discussed here but every environment has different quality requirements so your thresholds can be different and it depends what quality SLA you want to offer and what type of application you want to run on top of virtual infrastructure.

Anyway, co-scheduling of monster VMs is serious issue for IaaS Cloud Providers because it is really hard to explain customers that less vCPUs can give them paradoxically better CPU performance. I call this phenomenon "VIRTUALIZATION PARADOX".   

The final hardware selection or recommendation is always dependent on justification of vSphere Architect who has to carefully analyze specific requirements and constraints in particular environment and reasonably justify selected decision. We should remember there can be other requirements favoring a specific platform. Example of "other" requirement (sometime the constraint) can be the situation when blade servers want to be used. In 2-socket blade servers is usually very difficult and sometimes even impossible to avoid single point of failure of NIC/HBA/CNA adapter. 4-socket server are usually full height (dual slot) and therefore I/O cards are doubled ... but that's another topic.

1 comment:

Anonymous said...

Pretty! This was a really wonderful article. Many thanks for supplying
these details.

My blog :: lost password