Saturday, September 01, 2018

New with vSphere 6.7 U1 - Enhanced Load Balancing Path Selection Policy

With the release of vSphere 6.7 U1, there are now sub-policy options for VMW_PSP_RR to enable active monitoring of the paths. The policy considers path latency and pending IOs on each active path. This is accomplished with an algorithm that monitors active paths and calculates average latency per path based on either time and/or the number of IOs. When the module is loaded, the latency logic will get triggered and the first 16 IOs per path are used to calculate the latency. The remaining IOs will then be directed based on the results of the algorithm’s calculations to use the path with the least latency. When using the latency mechanism, the Round Robin policy can dynamically select the optimal path and achieve better load balancing results.

The user must enable the configuration option to use latency based sub-policy for VMW_PSP_RR:
esxcfg-advcfg -s 1 /Misc/EnablePSPLatencyPolicy
Update: this setting is the default from 6.7 U1

To switch to latency based sub-policy, use the following command:
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency
If you want to change the default evaluation time or the number of sampling IOs to evaluate latency,
use the following commands.

For Latency evaluation time (default is 15000 = 15 sec):
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency --latency-eval-time=18000
For the number of sampling IOs:
esxcli storage nmp psp roundrobin deviceconfig set -d -- type=latency --num-sampling-cycles=32
To check the device configuration and sub-policy:
esxcli storage nmp device list -d
The diagram below shows how sampling IOs are monitored on paths P1, P2, and P3 and eventually selected. The time “t” sampling window starts. In the sampling window, IOs are issued on each path in Round Robin fashion and their round-trip time is monitored. Path P1 took 10ms to complete in total for 16 sampling IOs. Similarly, path P2 took 20ms for the same number of sampling IOs and path P3 took 30ms. As path P1 has the lowest latency, path P1 will be selected more often for IOs. Then the sampling window again starts at ‘T’. Both “m” and “T” are tunable parameters but we would suggest to not change these parameters as they are set to a default value based on the experiments ran internally while implementing it.

The diagram, how sampling IOs are monitored and selected.
Legend: 
T = Interval after sampling should start again 
m = Sampling IOs per path

t1 < t2 < t3 ---------------> 10ms < 20ms < 30ms 
t1/m < t2/m < t3/m -----> 10/16 < 20/16 < 30/16

With the testing, VMware found that with the new latency monitoring policy, even with latency introduced up to 100ms on half the paths, the PSP sub-policy maintained almost full throughput.
Setting the values for the round robin sub-policy can be accomplished via CLI or using host-profiles. Cody Hosterman describes here how to configure the Latency Policy as a SATP Rule via ESXCLI and PowerCLI.

Source: StorageHub : vSphere 6.7 Core Storage

References & External Links
Cody Hosterman: LATENCY ROUND ROBIN PSP IN ESXI 6.7 UPDATE 1

No comments: