VCDX #200 Blog of one VMware Infrastructure Designer: November 2020

Sunday, November 29, 2020

Virtual Machine Advanced Configuration Options

First and foremost, it is worth mentioning, that it is definitely not recommended to change any advanced settings unless you know what you are doing and you are fully aware of all potential impacts. VMware default settings are the best for general use covering the majority of use cases, however, when you have some specific requirements you might need to do the VM tuning and change some advanced virtual machine configuration options. In this blog post, I'm trying to document advanced configuration options I've found useful in some specific design decisions.

Time synchronization

time.synchronize.tools.startup

Description:
Type: Boolean
Values:

true / 1 (default)
false / 0

time.synchronize.restore

Description:
Type: Boolean
Values:

true / 1 (default)
false / 0

time.synchronize.shrink

Description:
Type: Boolean
Values:

true / 1 (default)
false / 0

time.synchronize.continue

Description:
Type: Boolean
Values:

true / 1 (default)
false / 0

time.synchronize.resume.disk

Description:
Type: Boolean
Values:

true / 1 (default)
false / 0

Relevant resources:

Ethernet

ethernetX.rxDataRingEnabled

Description: When set to true, Receive Data Ring is enabled
Type: Boolean
Values: true/false, 1/0
Relevant resources:

VMXNET3 RX Ring Buffer Exhaustion and Packet Loss

ethernetX.pnicFeatures

Description: Enable specific pNIC features for particular VM vNIC
Type: Int
Values:

4 - Enable multi-queue (NetQueue RSS) in particular VM vNIC

ethernetX.ctxPerDev

Description: Enable transmit traffic multi-threading for VM or vNIC
Type: Int
Values:

1 - one CPU thread per vNIC
2 (default) - one transmit CPU thread per VM
3 - multiple (2 to 8) TX threads for particular VM vNIC

Relevant resources:

ethernetX.udpRSS

Description: vSphere ESXi 6.7 supports Receive Side Scaling (RSS) for UDP which provides a significant improvement in throughput. Use this feature for workloads that are sensitive to network latency and bandwidth.
Type: Boolean
Values:

true / 1
false / 0 (default)

Relevant resources:

Enhance NSX Edge Performance after ESXi Host Upgrade

Isolation

With the isolation option, you can restrict file operations between the virtual machine and the host system, and between the virtual machine and other virtual machines.

VMware virtual machines can work both in a vSphere environment and on hosted virtualization platforms such as VMware Workstation and VMware Fusion. Certain virtual machine parameters do not need to be enabled when you run a virtual machine in a vSphere environment. Disable these parameters to reduce the potential for vulnerabilities.

Following advanced settings are booleans (true/false) with default value false. You can disable it by changing the value to true.

isolation.tools.unity.push.update.disable
isolation.tools.ghi.launchmenu.change
isolation.tools.memSchedFakeSampleStats.disable
isolation.tools.getCreds.disable
isolation.tools.ghi.autologon.disable
isolation.bios.bbs.disable
isolation.tools.hgfsServerSet.disable
isolation.tools.vmxDnDVersionGet.disable
isolation.tools.diskShrink.disable
isolation.tools.memSchedFakeSampleStats.disable
isolation.tools.guestDnDVersionSet.disable
isolation.tools.unityActive.disable
isolation.tools.diskWiper.disable

Relevant resources:

Snapshots

snapshot.maxSnapshots

Description: Limit the number of snapshots of Virtual Machine
Type: Int
Value: max number of snapshots
Relevant resources:

Remote Display

RemoteDisplay.maxConnections

Description: Limit the number of simultaneous connections to Virtual Machine
Type: Int
Value: max number of snapshots
Relevant resources:

Opening a console to a virtual machine fails with the error: Unable to connect to the MKS: Console access to the virtual machine cannot be granted since the connection limit of 0 has been reached (2015407)

Tuesday, November 24, 2020

vSAN 7 Update 1 - What's new in Cloud Native Storage

vSAN 7 U1 comes with new features also in Cloud Native Storage area, so let's look at what's new.

PersistentVolumeClaim expansion

Kubernetes v1.11 offered volume expansion by editing the PersistentVolumeClaim object. Please note, that volume shrink is not supported and extension must be done offline. Online expansion is not supported in U1 but planned on the roadmap.

Static Provisioning in Supervisor Cluster

This feature allows exposing an existing storage volume within a K8s cluster integrated within vSphere Hypervisor Cluster (aka Supervisor Cluster, vSphere with K8s, Project Pacific).

vVols Support for vSphere K8s and TKG Service

Supporting external storage deployments on vK8s and TKG using vVols.

Data Protection for Modern Applications

vSphere 7.0 U1 comes with support Dell PowerProtect and Velero backup for Pacific Supervisor and TKG clusters. Velero only option to initiate snapshots from supervisor Velero plugin and store on S3.

vSAN Direct

vSAN Direct is the feature introducing Directly Attach Storage (typically physical HDD) for object storage solutions running on top of vSphere.

There will not be a shared vSAN Datastore like typical vSAN has but vSAN Direct Datastores are allowing connect physical disks directly to virtual appliances or containers on top of vSphere/vSAN Cluster providing Object Storage services and bypassing traditional vSAN datapath.

Hope you find it useful.

Monday, November 23, 2020

Why HTTPS is faster than HTTP?

Recently, I was planning, preparing, and executing a network performance test plan, including TCP, UDP, HTTP, and HTTPS throughput benchmarks. The intention of the test plan was the network throughput comparison between two particular NICs

Intel X710
QLogic FastLinQ QL41xxx

There was a reason for such exercise (reproduction of specific NIC driver behavior) and I will probably write another blog post about it, but today I would like to raise another topic. During the analysis of testing results, I've observed very interesting HTTPS throughput results in comparison to HTTP throughput. These results were observed on both types of NICs, therefore, it should not be a benefit of specific NIC hardware or driver.

Here is the Test Lab Environment:

2x ESXi hosts

Server Platform: HPE ProLiant DL560 Gen10
CPU: Intel Cascade Lake based Xeon
BIOS: U34 | Date (ISO-8601): 2020-04-08
NIC1: Intel X710, driver i40en version: 1.9.5, firmware 10.51.5
NIC2: QLogic QL41xxx, driver qedentv version: 3.11.16.0, firmware mfw 8.52.9.0 storm
OS/Hypervisor: VMware ESXi 6.7.0 build-16075168 (6.7 U3)

1x Physical Switch

10Gb switch ports << network bottleneck by purpose, because customer is using 10Gb switch ports as well

Below are the observed interesting HTTP and HTTPS results.

HTTP

HTTPS

OBSERVATION, EXPLANATION, AND CONCLUSION

We have observed

HTTP throughput between 5 and 6 Gbps
HTTPS throughput between 8 and 9 Gbps

which means 50% higher throughput of HTTPS over HTTP. Normally, we would be expecting HTTP transfer faster than HTTPS as HTTPS requires encryption, which should end-up with some CPU overhead. Encryption overhead is questionable, but nobody would expect HTTPS significantly faster than HTTP, right? That's the reason I was asking myself,

why HTTPS overachieved HTTP results on HPE Lab with the latest Intel CPUs?

Here is my process of the "issue" troubleshooting or better to say, root cause analysis.

When I execute the same test in my home lab (old Intel CPUs model Intel Xeon E5-2620) with both VMs running in the single ESXi host (everything running in software without hardware physics) I can achieve 22 Gbps for both protocols (HTTP and HTTPS). This means, that software can generate enough traffic to saturate 10 Gb NICs. However, the results are almost identical without a big difference between HTTP and HTTPS. It is worth to say, CPU in my home lab does not support Intel QuickAssist.
I did further research and found the following documents

Asynchronous Advantages for Web Servers (high-level explanation)
Programming Intel® QuickAssist Technology Hardware Accelerators for Optimal Performance (low-level explanation)
Data plane acceleration technologies: realizing the potential of network virtualization (high-level explanation)

Conclusion

In my home lab, I have old Intel CPUs models (Intel Xeon CPU E5-2620 0 @ 2.00GHz), that's the reason HTTP and HTTPS throughputs are identical.
In the HPE test lab, there are the latest Intel CPU models, therefore, HTTPS can be offloaded and client/server communication can leverage asynchronous advantages for web servers using Intel® QuickAssist Technology introduced in the Intel Xeon E5-2600 v3 product family.
It is worth to mention, that it is not only about CPU hardware acceleration, but also about software code which must be written in the form, hardware acceleration can leverage for a positive impact on performance. This is the case of OpenSSL 1.1.0, and NGINX 1.10 to boost HTTPS server efficiency.

Lesson learned

When you are virtualizing network functions, it is worth considering the latest CPUs, as it can have a significant impact on overall system performance and throughput. Does not matter, if such network function virtualization is done by VMware NSX or other virtualization or containerization platforms.

Investigation continues

To be honest, I do not know if I really fully understand the root cause of such behavior. I still wonder why HTTPS is 50% faster than HTTP, and if CPU offloading is the only factor for such performance gain.

I'll try to run the test plan on other hardware platforms, compare results, and do some further research to understand much deeper. Unfortunately, I do not have direct access to the latest x86 servers of other vendors, so it can take a while. If you have access to some modern x86 hardware and want to run my test plan by yourself, you can download the test plan document from here. If you will invest some time into the testing, please share your results in the comments below this article or simply send me an e-mail.

Hope this blog post is informative, and as always, any comment or idea is very welcome.

Saturday, November 21, 2020

Understanding vSAN Architecture Components for better troubleshooting

VMware vSAN becomes more and more popular, thus more often used as primary storage in data centers and server rooms. Sometimes, as with any IT technology, is necessary to do the troubleshooting. Understanding of architecture and components interactions is essential for effective troubleshooting of vSAN. Over years, I have collected some vSAN architectural information into a slide deck I made available at https://www.slideshare.net/davidpasek/vsan-architecture-components

In the slide deck are the slides with the following sections ...

vSAN Terminology

CMMDS - Cluster Monitoring, Membership, and Directory Service
CLOMD - Cluster Level Object Manager Daemon
OSFSD - Object Storage File System Daemon
CLOM - Cluster Level Object Manager
OSFS - Object Storage File System
RDT - Reliable Datagram Transport
VSANVP - Virtual SAN Vendor Provider
SPBM - Storage Policy-Based Management
UUID - Universally unique identifier
SSD - Solid-State Drive
MD - Magnetic disk
VSA - Virtual Storage Appliance
RVC - Ruby vSphere Console

Architecture components

CMMDS

Cluster Monitoring, Membership, and Directory Service

CLOM

Cluster Level Object Manager Daemon

Distributed Object Manager
Each object in a vSAN cluster has a DOM owner and a DOM client

LSOM

Local Log Structured Object Manager
LSOM works with local disks

Reliable Datagram Transport

Components interaction

Architecture & I/O Flow

Troubleshooting tools

vsan.observer
vsan.disks_info
vsan.disks_stats
vsan.disk_object_info
vsan.cmmds_find

ESXCLI

esxcli vsan debug disk list

Objects tools

/usr/lib/vmware/osfs/bin/objtool

How to use vSAN Observer

SSH somewhere where you have RVC. It can be for example VCSA or HCIbench

ssh root@[IP-ADDRESS-OF-VCSA]

Run RVC command-line interface and connect to your vCenter where you have vSphere cluster with vSAN service enabled. RVC requires the password of the administrator in your vSphere domain.

rvc administrator@[IP-ADDRESS-OF-VCSA]

Start vSAN Observer on your vSphere cluster with vSAN service enabled

vsan.observer -r /localhost/[vDatacenter]/computers/[vSphere & vSAN Cluster]

Go to vSAN Observer web interface

vSAN Observer is available at https://[IP-ADDRESS-OF-VCSA]:8010

Slide deck includes little more info so download it from https://www.slideshare.net/davidpasek/vsan-architecture-components

If you have to troubleshoot vSAN, I highly recommend to follow the process documented at "Troubleshooting vSAN Performance".

Hope it helps the broader VMware community.

If you know some other detail or troubleshooting tool, please leave a comment below this post.

Thursday, November 05, 2020

NSX-T Edge Node performance profiles

It is good to know that NSX-T Edge Node has multiple performance profiles. Those profiles will change the # of vCPU for DPDK and so leave more or less vCPU for other services such as LB:

default (best for L2/L3 traffic)
LB TCP (best for L4 traffic)
LB HTTP (best for HTTP traffic)
LB HTTPS (best for HTTPS traffic)

Now you can ask how to choose Load Balancer Performance profile. SSH to the edge node and use CLI.

 nsx-edgebm3> set load-balancer perf-profile  
  http   Performance profile type argument  
  https   Performance profile type argument  
  l4    Performance profile type argument  
 Note: You may be prompted to restart the dataplane or reboot the Edge Node if there are changes in the profile in # of cores used by LB.  
 To go back to default profile:  
 nsx-edgebm3> clear load-balancer perf-profile

VCDX #200
Blog of one VMware Infrastructure Designer

Pages

Sunday, November 29, 2020

Virtual Machine Advanced Configuration Options

Tuesday, November 24, 2020

vSAN 7 Update 1 - What's new in Cloud Native Storage