Saturday, September 01, 2018

New with vSphere 6.7 U1 - Enhanced Load Balancing Path Selection Policy

With the release of vSphere 6.7 U1, there are now sub-policy options for VMW_PSP_RR to enable active monitoring of the paths. The policy considers path latency and pending IOs on each active path. This is accomplished with an algorithm that monitors active paths and calculates average latency per path based on either time and/or the number of IOs. When the module is loaded, the latency logic will get triggered and the first 16 IOs per path are used to calculate the latency. The remaining IOs will then be directed based on the results of the algorithm’s calculations to use the path with the least latency. When using the latency mechanism, the Round Robin policy can dynamically select the optimal path and achieve better load balancing results.

The user must enable the configuration option to use latency based sub-policy for VMW_PSP_RR:
esxcfg-advcfg -s 1 /Misc/EnablePSPLatencyPolicy
Update: this setting is the default from 6.7 U1

To switch to latency based sub-policy, use the following command:
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency
If you want to change the default evaluation time or the number of sampling IOs to evaluate latency,
use the following commands.

For Latency evaluation time (default is 15000 = 15 sec):
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency --latency-eval-time=18000
For the number of sampling IOs:
esxcli storage nmp psp roundrobin deviceconfig set -d -- type=latency --num-sampling-cycles=32
To check the device configuration and sub-policy:
esxcli storage nmp device list -d
The diagram below shows how sampling IOs are monitored on paths P1, P2, and P3 and eventually selected. The time “t” sampling window starts. In the sampling window, IOs are issued on each path in Round Robin fashion and their round-trip time is monitored. Path P1 took 10ms to complete in total for 16 sampling IOs. Similarly, path P2 took 20ms for the same number of sampling IOs and path P3 took 30ms. As path P1 has the lowest latency, path P1 will be selected more often for IOs. Then the sampling window again starts at ‘T’. Both “m” and “T” are tunable parameters but we would suggest to not change these parameters as they are set to a default value based on the experiments ran internally while implementing it.

The diagram, how sampling IOs are monitored and selected.
Legend: 
T = Interval after sampling should start again 
m = Sampling IOs per path

t1 < t2 < t3 ---------------> 10ms < 20ms < 30ms 
t1/m < t2/m < t3/m -----> 10/16 < 20/16 < 30/16

With the testing, VMware found that with the new latency monitoring policy, even with latency introduced up to 100ms on half the paths, the PSP sub-policy maintained almost full throughput.
Setting the values for the round robin sub-policy can be accomplished via CLI or using host-profiles. Cody Hosterman describes here how to configure the Latency Policy as a SATP Rule via ESXCLI and PowerCLI.

Source: StorageHub : vSphere 6.7 Core Storage

References & External Links
Cody Hosterman: LATENCY ROUND ROBIN PSP IN ESXI 6.7 UPDATE 1

VMworld US 2018 - VIN2416BU - Core Storage Best Practices

As in previous years, William Lam (www.virtuallyghetto.com) has published URLs to VMworld US 2018 Breakout Sessions. William wrote the blog post about it and created GitHub repo vmworld2018-session-urls available at http://vmwa.re/vmworld2018. Direct link to US sessions is here https://github.com/lamw/vmworld2018-session-urls/blob/master/vmworld-us-playback-urls.md

I'm going to watch sessions from areas of my interest and write my thoughts and interesting findings in my blog. So stay tuned and come back to read future posts, if interested. Let's start with VMworld 2018 session VIN2416BU - Core Storage Best Practices Speakers: Jason Massae, Cody Hosterman.  This technical session is about vSphere core storage topics. In the beginning, Jason shared with the audience GSS top storage issues and customers challenges. These are PSA, iSCSI, VMFS, NFS, VVols, Trim/Unmap, Queueing, Troubleshooting. General recommendation by Jason and Cody is to validate any vSphere storage change and adjustment with particular storage vendor and VMware. Customers should change advanced settings only when recommended by storage vendor or VMware GSS.

SATP and PSP
Cody follows with the basic explanation of how SATP and PSP works. Then he explains why some storage vendors recommend adjusting the default Round Robin I/Os quantity per single path until switching to the next path. The reason is not the performance but faster failover in case of some storage paths issues. If you want to know how to change such setting, read VMware KB 2069356.

iSCSI
Jason continues with the iSCSI topic which is, based on VMware GSS, the #1 problem. The first recommendation is to not expose LUNs used for the virtual environment to some other external systems or functions. The only exception might be RDM LUNs used for OS level clustering, but this is the special use case. Another topic is the teaming and port binding. Some kind of the teaming is highly recommended. The iSCSI port binding is preferred. The port binding will give you load balancing and fail-over not only on link failure but also on SCSI sense code. This will help you in situations when the network path is OK but the target LUN is not available for whatever reasons. I wrote the blog post about this topic here. It is about the advanced option enable_action_OnRetryErrors and as far as I know, it was not enabled by default in vSphere 6.5 but probably is in 6.7. I did not test it so it should be validated before implemented into production. Jason explained, that in vSphere 6.0 and below, port binding did not allow you to do network L3 routing, therefore NIC teaming was the only way to go in environments where initiators and targets (iSCSI portals) were in different IP subnets. However, since vSphere 6.5, iSCSI can leverage dedicated TCP/IP stack, therefore default gateway can be specified for VMkernel port used for iSCSI. After, Jason explains the difference among software iSCSI adapter, dependent iSCSI adapters (iSCSI offloaded to the hardware), iSER (iSCSI over RDMA).

The mic is handed over to Cody, and Cody is sharing with the audience the information about the new Round Robin enhanced policy, available in vSphere 6.7 U1 which considers I/O latency for path selection. For more info read my blog post here.

VMFS
Cody moves to VMFS topic. VMFS 6 has a lot of enhancements. You cannot upgrade VMFS 5 to VMFS 6, therefore you have to create new datastore, format it to VMFS 6 and use storage vMotion to migrate VMs from VMFS 5. Cody is trying to answer typical customers questions. How big datastores I should do? How many VMs I should accommodate in a single datastore? Do not expect the exact answer. The answer is, it depends on your storage architecture, performance, and capabilities. Everything begins with the question if your array supports VAAI (VMware API for Array Integration) but there are other questions you have to answer your self, like what granularity of recoverability do you expect, etc. Cody joked a little bit and at the beginning of VMFS section of the presentation, he shared his opinion, the best VMFS datastore is VVols datastore :-)

NFS
Jason continued with NFS best practices. The interesting one was about the usage of Jumbo Frames. Jason's Jumbo Frames recommendation is to use it only when it is already configured in your network. In other words, it is not worth to enable it and believe you will get significantly better performance.

VVols
Another topic is VVols. Cody starts with general explanation what VVols are and are NOT. Cody highlight that with VVols you can finally use VVols snapshots without any performance overhead because it is offloaded to storage hardware. Cody explains the basic of VVols terminology and architecture. VVols use one or more Protocol Endpoints. Protocol Endpoint is nothing else then LUN with ID 254 and used as Administrative Logical Unit (ALU). Protocol Endpoint is used to handle the data but within the storage, there are Virtual Volumes also known as Subsidiary Logical Units (SLU) having VVol Sub-LUN IDs (for example 254:7) where vSphere storage objects (VM home directory, VMDKs, SWAPs, Snapshots) are stored. This session is not about VVols, therefore, it is really just a brief overview.

Trim/UNMAP and Space Reclamation on Thin Provisioned storage
The mic is handed over to Jason who starts another topic - Trim/UNMAP and Space Reclamation. Jason informs auditorium that Trim/UNMAP functionality depends on vSphere version and other circumstances.

In vSphere 6.0, Trim/UNMAP is a manual operation (esxcli storage vmfs unmap). In vSphere 6.5, it is automated with VMFS-6, it is enabled by default, can be configured to Low Priority or Off and it takes 12-24 hours to clean up space. In vSphere 6.7 adds configurable throughput limits where the default is Low Priority.

Space reclamation also works differently on each vSphere edition.

  • In vSphere 6.0 it works when virtual disks are thin, VM hardware is 11+, and only for MS Windows OS. It does NOT work with VMware snapshots, CBT enabled, Linux OS, UNMAPs are misaligned. 
  • In vSphere 6.5, it works when virtual disks are thin, it works for Windows and Linux OS, VM hardware is 11+ for MS Windows OS, VM hardware 13 for Linux OS, CBT can be enabled, UNMAPs can be misaligned. It does NOT work when VM has VMware snapshots, with thick virtual disks, when Virtual NVMe adapter is used.
  • In vSphere 6.7it works when virtual disks are thin, it works for Windows and Linux OS, VM hardware is 11+ for MS Windows OS, VM hardware 13 for Linux OS, CBT can be enabled, UNMAPs can be misaligned, VM snapshots supported, Virtual NVMe adapter is supported. It does NOT work for thick provisioned virtual disks.
Queuing
The mic is handed over back to Cody to speak about queueing. At the beginning of this section, Cody explains how storage queuing works in vSphere. He shares the default values of HBA Device Queues (aka HBA LUN queues) for different HBA types:

  • QLogic - 64
  • Brocade - 32
  • Emulex - 32
  • Cisco UCS (VIC) - 32
  • Software iSCSI - 128
HBA Device Queue is an HBA setting which controls how many I/Os may be queued on a device (aka LUN). Default values are configurable via esxcli. Changing requires reboot. Details are documented in VMware KB 1267. After the explanation of "HBA Device Queue", Cody explains DQLEN which is Hypervisor level device queue limit. The actual device queue depth is a minimum from "HBA Device Queue" and DQLEN. In a mathematical formula, it is a MIN("HBA Device Queue", DQLEN). Therefore, if you increase DQLEN you have to also adjust "HBA Device Queue" for some real effect. VMFS DQLEN defaults are:

  • VMFS - 32
  • RDM - 32
  • VVols Protocol Endpoints - 128 (Scsi.ScsiVVolPESNRO)
After basic problem explanation, Cody does some quick math to stress that default settings are ok for most environments and it usually does not make sense to change it. However, if you have specific storage performance requirements, you have to understand the whole end-to-end storage stack and then you can adjust it appropriately and do a performance tunning. If you do any change, you should do it on all ESXi hosts in the cluster to keep performance consistent after migration from one ESXi host to another.

Storage DRS (SDRS) and Storage I/O Control (SIOC)
Cody continues with SDRS and SIOC. SDRS and SIOC are two different technologies.

SDRS moves VMS around based on hitting a latency threshold. This is the VM observed latency which includes any latency induced by queueing.

SIOC controls throttles VMs based on hitting a datastore latency threshold. SIOC is using a device (LUN) latency to throttle device queue depth automatically when the device is stressed. If SIOC kicks in, it takes into consideration VMDK shares, therefore VMDKs with higher shares have more frequent access to stressed (overloaded) storage device.

Jason notes that what Cody explained is how SIOC version 1 works but there is also SIOC version 2 introduced in vSphere 6.5. SIOC v1 and v2 are different. SIOC v1 looks at datastore level, SIOC v2 is policy based setting. It is good to know that SIOC v1 and SIOC v2 can co-exist on vSphere 6.5+. SIOC V2 is considerably different from a user experience perspective when compared to V1. SIOCv2 is implemented using IO Filter framework Storage IO Control category. SIOC V2 can be managed using SPBM Policies. What this means is that you create a policy which contains your SIOC specifications, and these policies are then attached to virtual machines. One thing to note is that IO Filter based IOPS does not look at the size of the IO. For example, there is no normalization like in SIOC v1 so that a 64K IOP is not equal to 2 x 32K IOPS. It is a fixed value of IOPS irrespective of the size of the IO. For more information about SIOC look here.

Troubleshooting 
The last section of the presentation is about troubleshooting. It is presented by Jason. When you have to do a storage performance troubleshooting, start with reviewing performance graph and try to narrow down the issue. Looks at VM and ESXi host. Select only problem component. You have to build your troubleshooting toolbox. It should include

  • Performance Graph
  • ESXTOP
  • vRealize LogInsight
  • vSphere On-disk Metadata Analyzer (VOMA)
  • Enable CEIP (Customer Experience Improvement Program) which shares troubleshooting data with VMware Support (GSS). CEIP is actually call-home functionality which is opt-out in vSphere 6.5+

Session evaluation 
This is a very good technical session for infrastructure folks responsible for VMware vSphere and storage interoperability. I highly recommend to watch it.