Monday, December 15, 2014

Disk queue depth in an ESXi environment

On the internet, there are a lot of information and documents about ESXi and Disk Queue Depth but I didn't find any single document explained all details I would like to know and in the format for easy consumption. Different vendors have their specific recommendations and best practices but without deeper principal explanation and holistic view. Some documents are incomplete and some others have, in my opinion, wrong or at least strange information. That's the reason I have decided to write this blog post mainly for my own purpose but I hope it helps other people from the VMware community.

When I did research I found Frank Denneman blog post from March 2009 where almost everything is very well and accurately explained but technology little bit changed over time. Another great document is here on VMware Community Docs.

Disclaimer: I work for DELL Services, therefore, I work on a lot of projects with DELL Compellent Storage Systems, DELL Servers and QLogic HBA's. Therefore this blog post is based on "DELL Compellent Best Practices with VMware vSphere 5.x" and "Qlogic Best Practices for VMware vSphere 5.x". I have all this equipment in the lab so I was able to test it and verify what's going on. However, although this blog has information about specific hardware general principles are the same for any other HBAs and storage systems.

Some blog sections are just copied and paste from public documents mentioned above so all credits go there. Some other sections are based on my lab tests, experiments and thoughts.

What is queue depth?

Queue depth is defined as the number of disk transactions that are allowed to be “in flight” between an initiator and a target (parallel I/O transaction), where the initiator is typically an ESXi host HBA port/iSCSI initiator and the target is typically the Storage Center front-end port. Since any given target can have multiple initiators sending it data, the initiator queue depth is generally used to throttle the number of parallel transactions being sent to a target to keep it from becoming “flooded”. When this happens, the transactions start to pile up causing higher latencies and degraded performance. That being said, while increasing the queue depth can sometimes increase performance, if it is set too high, there is an increased risk of over-driving (over-loading and saturating) the storage array. As data travels between the application and the storage array, there are several places that the queue depth can be set to throttle the number of concurrent (parallel) disk transactions. The most common places queue depth can be modified are:
  • The application itself (Default=dependent on application)
  • The virtual SCSI card driver in the guest (Default=32)
  • The VMFS layer (DSNRO) (Default=32)
  • The HBA VMkernel Module driver (Default=64)
  • The HBA BIOS (Default=Varies)

HBA Queue Depth = AQLEN 

(Default Varies but in my QLogic card it is 2176)

Here are Compellent recommended BIOS settings for QLogic Fibre Channel card:
  1. The “connection options” field should be set to 1 for point to point only
  2. The “login retry count” field should be set to 60 attempts
  3. The “port down retry” count field should be set to 60 attempts
  4. The “link down timeout” field should be set to 30 seconds
  5. The “queue depth” (or “Execution Throttle”) field should be set to 255. This queue depth can be set to 255 because the ESXi VMkernel driver module and DSNRO can more conveniently control the queue depth
I don't think the point (5) is correct and up to date in Compellent Best Practices. Based on QLogic document HBA "Queue Depth" doesn't exist in QLogic BIOS and "Execution Throttle" is not used anymore. QLogic Adapter Queue Depth is only managed by the driver and default value is 64 for vSphere 5.x. For more info look at here.

I believe QLogic Host Bus Adapter has Queue Depth 2176 because this number is visible in esxtop as AQLEN (Adapter Queue Length) for each particular HBA (vmhba).

HBA LUN Queue Depth 

(Default=64)
HBA LUN Queue Depth is by default 64 but can be changed via VMkernel Module driver. Compellent recommends to set HBA max queue depth to 255 but this queue depth can be used just for single VM running on single LUN (disk). If more VMs are running on LUN than DSNRO has precedence. However, when SIOC (Storage I/O Control) is used then DSNRO is not used at all and HBA VMkernel Module Driver Queue Depth value is used for each device queue depth  (DQLEN). I think SIOC is beneficial in this situation because you can have deeper queues with dynamic queue management across all ESXi hosts connected to particular LUN. Please note, that special attention has to be taken for correct SIOC latency threshold especially on LUNs with sub-lun tiering.

If you want to check your Disk (aka LUN) Queue Depth value it is nicely visible in esxtop as DQLEN.

So here is the procedure on how to change HBA VMkernel Module Driver Queue Depth.

Find the appropriate driver name for the module that is loaded for QLogic HBA:
~ # esxcli system module list | grep ql
qlnativefc                          true        true
In our case, the name of module for QLogic driver is qlnativefc.

Set the QLogic driver queue depth and timeouts using the esxcli command:
esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60"
You have to reboot ESXi host to apply module changes.

Below is a description of QLogic Parameters we have just change
  • ql2xmaxqdepth (int) - Maximum queue depth to report for target devices. 
  • ql2xloginretrycount (int) - Specify an alternate value for the NVRAM login retry count. 
  • qlport_down_retry (int) - Maximum number of command retries to a port that returns a PORT-DOWN status.
You can verify QLogic driver options by the following command:
esxcfg-module --get-options qlnativefc 
And affected change (after ESXi reboot) is visible on esxtop on disk devices as DQLEN.

VMFS Layer DSNRO 

(Default=32)
When two or more virtual machines share a LUN (logical unit number), this parameter controls the total number of outstanding commands permitted from all virtual machines collectively on the host to that LUN (this setting is not per virtual machine). For more information see KB

The HBA LUN queue depth (DQLEN) determines how many commands the HBA is willing to accept and process per LUN and target. If a single VM virtual machine is issuing IO, the HBA LUN Queue Depth (DQLEN) setting is applicable. When multiple VM’s are simultaneously issuing IO’s to the LUN, the ESX parameter Disk.SchedNumReqOutstanding (DSNRO) value becomes the leading parameter, and HBA LUN Queue Depth is ignored. The above statement is only true when SIOC is not enabled on the LUN (datastore).

Parameter can be changed to the value between 1 and 256 so when 
esxcli storage core device set -d -O
No of outstanding IOs with competing worlds (DSNRO) can be listed for each device
~ # esxcli storage core device list -d naa.6000d3100025e7000000000000000098
naa.6000d3100025e7000000000000000098
   Display Name: COMPELNT Fibre Channel Disk (naa.6000d3100025e7000000000000000098)
   Has Settable Display Name: true
   Size: 1048576
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.6000d3100025e7000000000000000098
   Vendor: COMPELNT
   Model: Compellent Vol
   Revision: 0504
   SCSI Level: 5
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: supported
   Other UIDs: vml.02000200006000d3100025e7000000000000000098436f6d70656c
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32
The issue with DSNRO is that it must be set per devices (LUNs) on all ESXi hosts so it has a negative impact on implementation and overall manageability. It is usually left on default value anyway because 32 outstanding (async) IOs is good enough on low latency SANs and properly designed storage devices with good response times. Bigger queues are beneficial on higher latency SANs and/or storage systems which need more parallelism (more parallel threads) to draw out more I/O's storage is physically capable of. It doesn't help in situations when storage doesn't have enough back end performance (usually spindles). Bigger queues can generate more outstanding IOs to the storage system which can overload the storage and everything will be even worse. That's the reason why I think it is a good idea to leave DSNRO on the default value.

And once again. DSNRO is not used when SIOC is enabled because in that case "HBA LUN queue depth" is the leading parameter for disk queue depth (DQLEN).

Storage Target Port Queue Depth

Queues are on a lot of places and you have to understand the path from end to end. A queue exist on the storage array controller port as well, this is called the “Target Port Queue Depth“. Modern midrange storage arrays, like most EMC and HP arrays, can handle around 2048 outstanding IO’s.
2048 IO’s sounds a lot, but most of the time multiple servers communicate with the storage controller at the same time. Because a port can only service one request at a time, additional requests are placed in the queue and when the storage controller port receives more than 2048 IO requests, the queue gets flooded. When the queue depth is reached, this status is called (QFULL), the storage controller issues an IO throttling command to the host to suspend further requests until space in the queue becomes available. The ESXi host accepts the IO throttling command and decreases the LUN queue depth to the minimum value, which is 1! The QFULL condition is handled by the QLogic driver itself. The QFULL status is not returned to the OS. But some storage devices return BUSY rather than QFULL. BUSY errors are logged in the /var/log/vmkernel. Not every busy error is a Qfull error! Check the SCSI sense codes indicated in the vmkernel message to determine what type of error it is. The VMkernel will check every 2 seconds to check if the QFULL condition is resolved. If it is resolved, theVMkernel will slowly increase the LUN queue depth to its normal value, usually, this can take up to 60 seconds.

Compellent SC8000 front-end ports act as targets and have Target Port Queue Depth 1900+. What 1900+ means? Compellent is excellent storage operating system on top of commodity servers, IO cards, disk enclosures and disks. By the way, is it software-designed-storage or not? But back to queues, 1900+ means that Queue Depth depends on IO card used for front-end port but 1900 is the guaranteed minimum. So for our case let's assume we have Storage Target Port Queue Depth 2048.

All together will give as the holistic view
Here are end to end disk queue parameters:
  • DSNRO is 32 - this is default ESXi value.
  • HBA LUN Queue Depth is 64 by default in QLogic HBAs. Compellent recommends to change it  to 256 by parameter of HBA VMkernel Module Driver.
  • HBA Queue Depth is 2176.
  • Compellent Target Ports Queue Depths are 2048.
Everything is depicted in the figure below.
Disk Queue Depth - Holistic View

What is the optimal HBA LUN Queue Depth?

We already know 64 is the default value for QLogic and you can change it via VMkernel Module driver. So what is the optimal HBA LUN Queue Depth?

To prevent flooding the target port queue depth, the result of the combination of a number of host paths + HBA LUN Queue Depth+ number of presented LUNs through the host port must be less than the target port queue depth. In short T >= P * Q * L

T = Target Port Queue Depth (2048)
P = Paths connected to the target port (1)
Q = HBA LUN Queue Depth (?)
L = number of LUN presented to the host through this port (20)

So when I have 20 LUNs exposed from Compellent storage system HBA LUN Queue Depth should be maximally Q = T / P /L = 2048 / 1 / 20 = 102.4 for a single ESXi host. When I have 16 hosts HBA LUN Queue Depth must be divided by 16 so the value for this particular scenario is  6.4.
This number is without any overbooking which is not the real case. Current QLogic HBA LUN Queue Depth default value (64) introduces fan in fan-out ratio 10:1 which is probably based on long term experience with virtualized workloads. In the past, Qlogic has lower default values 16 in ESX 3.5 and 32 in ESX 4.x which had lower fan-in / fan-out ratios (2.5:1 and 5:1).

With Compellent recommended QLogic HBA LUN Queue Depth 255 we will have, in this particular case, fan-in / fan-out ratio 40 : 1. But only when SIOC is used. And when SIOC is used and normalized datastore latency is higher than SIOC threshold dynamic queue management kick in and disk queues are automatically throttled. So all is good.

What is my real Disk Queue Depth?

VMkernel Disk Queue Depth (DQLEN) is the number that matters. However, this number is dynamic and it depends on several factors. Here are scenarios you can have:
Scenario 1 - SIOC enabledWhen you have datastore with SIOC enabled then your DQLEN will be set to HBA LUN Queue Depth. It is by default 64 but Compellent recommends to set it to 255. But Compellent doesn't recommend to use SIOC.
Scenario 2 - SIOC disabled and you have only one VM on datastoreWhen you have just one VM on datastore your DQLEN will be set o HBA LUN Queue Depth. But how often do you have just one VM on datastore?
Scenario 3 - SIOC disabled and you have two or more VMs on datastoreIn this scenario your DQLEN will be set to DSNRO which is by default 32 and it has precedence over HBA LUN Queue Depth.

ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?

By default, ESXi doesn't use any Disk Queue Length (DQLEN) throttling mechanism and your DQLEN is set to DSNRO (32) when two or more VM (vDisks) are on the datastore. When only one VM disk is on the datastore your DQLEN will be set to HBA LUN Queue Depth (default=64, Compellent recommends 255).

When you have enabled Adaptive Queuing or Storage I/O Control your Disk Queue Length (DQLEN) will be throttled automatically when I/O congestion occurs. The difference between Adaptive Queuing and SIOC is how I/O congestion is detected.

Adaptive queuing is waiting for storage SCSI Sense Code of BUSY or QUEUE FULL status on the I/O path. For adaptive queueing, there are some advanced parameters that must be set on a per-host basis. From the Configuration tab, you have to select Software Advanced Settings. Navigate to Disk and the two parameters you need are Disk.QFullSampleSize and Disk.QFullThreshold. By default, QFullSampleSize is set to 0, meaning that it is disabled. When this value is set, adaptive queueing will kick in and half the queue depth when this number of queue full conditions is reported by the array. The QFullThreshold is the number of good statuses to receive before incrementing the queue once again.

Storage I/O Control uses the concept of a congestion threshold, which is based on latency. But not normal latency of datastore on only one particular ESXi host but there is a sophisticated algorithm preparing normalized datastore latency of all datastore latencies across ESXi hosts using one particular datastore. On top of datastore wise normalization, there is another type of normalization which takes in to account just "normal" I/O sizes. What is a normal IO size? I cannot find this detail but I think I read somewhere that the normal I/O size is between 2KB and 16KB.  Update 2016-10-07: Now when I work for VMware I have access to ESXi source codes and if I read the code correctly it seems to me that IO size normalization works a little bit differently. I cannot publish the exact formula as it is VMware intellectual property and implementation details but the whole idea are that for example 1MB I/Os would significantly skewed latency against normally accepted latency values in the storage industry, therefore, latency for bigger I/O's are somehow adjusted based on I/O size.

Information in the above section is based on Cormag Hogan blog post.

Conclusion

Based on the information above I think that default ESXi and QLogic values are the best for general workloads and tuning queues is not something I would recommend to do. It is important to know that queues tuning does not help with performance in most cases. Queues are used for latency improvements during transient I/O burst. If you have storage performance issues, it is usually because of an overloaded storage system and not about queues in the network path. Bigger Queue Depth can help you to handle more parallel I/Os to the storage subsystem. This could be beneficial when your target storage system can handle more I/Os. It cannot help in situations when the target storage system is overloaded and you expect more performance just by increasing queue depth.

The question I still need the answer to my self is why Compellent doesn't recommend VMware Storage I/O Control and Adaptive Queuing for dynamic queue management which is in my opinion very good thing.

UPDATE 2014-12-16: I have downloaded the latest Compellent Best Practices and there is a new statement about SIOC. SIOC can be enabled but you must know what's happening and if it is beneficial for you. Here is the snippet from the document ...
SIOC is a feature that was introduced in ESX/ESXi 4.1 to help VMware administrators regulate storage performance and provide fairness across hosts sharing a LUN. Due to factors such as Data Progression and the fact that Storage Center uses a shared pool of disk spindles, it is recommended that caution is exercised when using this feature. Due to how Data Progression migrates portions of volumes into different storage tiers and RAID levels at the block level, this could ultimately affect the latency of the volume, and trigger the resource scheduler at inappropriate times. Practically speaking, it may not make sense to use SIOC unless pinning particular volumes into specific tiers of disk.
I still believe SIOC is the way to go and special attention has to be paid to SIOC latency threshold. Compellent recommends keeping it on default value 30 milliseconds which makes perfect sense. Storage System will do all the hard work for you but when there are huge congestion and normalized latency is too high dynamic ESXi disk queue management can kick in. It makes sense to me.

I also believe that Adaptive Queuing is a really good and practical safety mechanism when your storage array has full queues in storage front-end ports or LUNs. However, it is applicable only with storage arrays supporting Adaptive Queuing and sending back to ESXi SCSI sense codes about full queue condition. If Adaptive Queueing is not used, even SIOC cannot help you with LUNs/Datastores issues (Datastore Disconnections) because SIOC algorithm is based on device response time but not on queue full storage response. Therefore, I strongly recommend enabling SIOC together with Adaptive Queuing unless your storage vendor has a really good justification to not do so. SIOC will help you with storage traffic throttling during high device response times and Adaptive Queuing when storage array queues are full and the device cannot accept new I/O's. However, Adaptive Queuing should be configured in concert with your storage vendor. For more information on how to enable Adaptive Queuing read VMware KB 1008113. Please note, that SIOC and Adaptive Queuing are just safety mechanisms on how to mitigate impacts of storage issues but the root cause is the Capacity Planning on the storage array.

UPDATE 2020-10-22: SIOC control throttles queues based on hitting a datastore latency threshold. This is the disk (device) latency not including any additional latency induced by additional VMkernel queueing. SIOC is using a minimal threshold 5 ms response time to kick in SIOC, this can be good for traditional storage arrays with rotational disks, however, in the all-flash era, you expect response times between 1 and 3 ms, right? Sometimes even sub-millisecond response time. SIOC will not help you to achieve fairness and defined SLA/OLA in such all-flash environments, but can you do dynamic queue management at least in situations the response times increase above 5 ms. 

If any of you with a deep understanding of vSphere and storage architectures see an error in my analysis, please let me know so that I can correct it appropriately.

Useful links:

3 comments:

ONDREJ said...

Hi David, IMHO the SIOC is not recommended by Compellent team because SIOC algorythm assumes that a single raid group = single datastore. While the above is true for a common setup of LSI, MSA and EMC arrays, the common setup of a Compellent array is to have a single disk pool, with all the datastores sharing all the spindles within a pool. So, increased load on a single datastore will be visible as increased latency on all other datastores. Under such conditions, SIOC cannot act properly, of course.

David Pasek said...

Ondrej,
first of all thanks for your comment and opinion even I don't agree with you because SIOC is just doing heuristic measurement of normalized latency and disk queue throttling when normalized latency is higher then SIOC threshold. As a proof, below are snips from Best Practices documents from other storage systems also leveraging disk pools and automated storage tiering. They don't have any issue with SIOC and generally recommends to use it together with their proprietary pooling and sub-LUN-tiering.

This is from EMC VNX Best Practices "... SIOC can be used along with VNX FAST VP ...".

This is from HP 3PAR Best Practices "... vSphere Storage I/O Control (SIOC) is a vSphere feature which manages ESXi device-queue depths, while AO is an HP 3PAR feature which moves data between storage tiers at scheduled intervals depending upon usage patterns. The two features operate at different layers, so there is no conflict between them."

David Pasek said...

Ondrej,
I have just read again the latest Compellent vSphere Best Practices and statement about SIOC is not so strict and just saying that it is may not make sense to use SIOC. However, when SIOC is used then default latency threshold (30ms) is recommended. Based on my understanding I still believe SIOC is beneficial.

I have updated blog post to reflect this change.