Wednesday, June 05, 2019

How to get more IOPS from a single VM?

Yesterday, I have got a typical storage performance question. Here is the question ...
I am running a test with my customer how many IOPS we can get from a single VM working with HDS all flash array. The best that I could get with IOmeter was 32K IOPS with 3ms latency at 8KB blocks. No matter what other block size I choose or outstanding IOs, I am unable to have more then 32k. On the other hand I can't find any bottlenecks across the paths or storage. I use PVSCSI storage controller. Latency and queues looks to be ok
IOmeter is good storage test tool. However, you have to understand basic storage principles to plan and interpret your storage performance test properly. The storage is the most crucial component for any vSphere infrastructure, therefore I have some experience with IOmeter and storage performance tests in general and here are my thoughts about this question.

First thing first, every shared storage system requires specific I/O scheduling to NOT give the whole performance to a single worker. The storage worker is the compute process or thread sending storage I/Os down the storage subsystem. If you think about it, it makes a perfect sense as it mitigates the problem of a noisy neighbor. When you invest a lot of money to a shared storage system, you most probably want to use it for multiple servers, right? Does not matter if these servers are physical (ESXi hosts) or virtual (VMs). To get the most performance from shared storage you must use multiple workers and optimally spread them across multiple servers and multiple storage devices (aka LUNs, volumes,  datastores).

IOmeter allows you to use

  • Multiple workers on a single server (aka Manager)
  • Outstanding I/Os within a single worker (asynchronous I/O to a disk queue without waiting for acknowledge)
  • Multiple Managers – the manager is the server generating storage workload (multiple workers) and reporting results to a central IOmeter GUI. This is where IOmeter dynamos come in to play.
To test the performance limits of a shared storage subsystem, it is an always good idea to use multiple servers (IOmeter managers) with multiple workers on each server (nowadays usually VMs) spread across multiple storage devices (datastores / LUNs). This will give you multiple storage queues, which means more parallel I/Os. Parallelism is the way which will give you more performance when such performance exists on shared storage. If such performance does not exist on the shared storage, queueing will not help you to boost performance. If you want, you can also leverage Oustanding I/Os to fill disk queue(s) more quickly and make an additional pressure to a storage subsystem, but it is not necessary if you use the number of workers equal to available queue depth. Outstanding I/Os can help you potentially generating more I/Os with fewer workers but it does not help you to get more performance when your queues are full. You will just increase response times without any positive performance gain.

Just as an example of IOmeter performance test, on the image below, you can see the results from IOmeter distributed performance tests on 2-node vSAN I planned, designed, implemented and tested recently for one of my customers. There is just one disk group (1xSSD cache, 4xSSD capacity).


Above storage performance test was using 8xVMs and each VMs was running 8 storage workers.
I have performed different storage patterns (I/O size, R/W ratio, 100% random access). The performance is pretty good, right? However, I would not be able to get such performance from the single VM having a single vDisk. 
Note: vSAN has a significant advantage in comparison to traditional storage because you do not need to deal with LUNs queueing (HBA Device Queue Depth) as there are no LUNs. On the other hand, in vSAN storage, you have to think about the total performance available for a single vDisk and it boils down to vSAN DiskGroup(s) layout and vDisk object components distribution across physical disks. But that's another topic as the asker is using traditional storage with LUNs.

Unfortunately, using multiple VMs is not the solution for the asker as he is trying to get all I/Os from a single VM.

In the question is declared that a single VM cannot get more than 32K IOPS and observed I/O response time is 3ms. The asker is curious why he cannot get more IOPS from the single VM?

Well, there can be multiple reasons but let’s assume the physical storage is capable provide more than 32K IOPS. I think, that more IOPS cannot be achieved because only one VM is used and IOmeter is using a single vDisk having a single queue. The situation is depicted in drawing below.


So, let’s do the simple math calculation for this particular situation …
  • We have a single vDisk queue having default queue depth 64 (we use Paravirtual SCSI adapter. Non-paravirtualized SCSI adapters have queue depth 32)
  • We have an HBA QLogic having queue default depth 64 (other HBA vendors like Emulex, have default queue depth 32, so it would be another bottleneck on the storage path)
  • The storage has average service time (response time) around 3ms
We have to understand the following basic principles
  • IOPS is the number of I/O operations per second
  • 64 queue depth = 64 I/O operations in parallel = 64 slices for I/O operations
  • Each I/O from these 64 I/Os are in the vDisk queue until SCSI response from the LUN will come back
  • All other I/Os have to wait until there is the free I/O slice in the queue.
And here is the math calculation ...

Q1: How many I/Os can be delivered in this situation per 1 millisecond?
A1: 64 (queue depth) / 3 (service time in ms)  = 64 / 3 = 21.33333 I/Os per 1 millisecond
 
Q2: How many I/Os can be delivered per 1 second?
A2: It is easy. 1,000 times more than in millisecond. So, 21.33333 x 1,000 = 21333.33 IOs per second ~= 21.3K IOPS
 
The asker is claiming he can get 32K IOPS with 3 ms response time, therefore it seems that the response time from storage is better than 3 ms. The math above would tell me that storage response time in this particular exercise is somewhere around 2 ms. There can be other mechanisms to boost performance. For example, I/O coalescing but let's keep it simple.

If the storage would be able to service I/O in 1 ms we would be able to get ~64K IOPS.
If the storage would be able to service I/O in 2 ms we would be able to get ~32K IOPS. 
If the storage would be able to service I/O in 3 ms we would be able to get ~21K IOPS. 

The math above would work if END-2-END queue depth is 64. This would be the case when QLogic HBA is used as it has HBA LUN Queue Depth 64. In the case of Emulex HBA, there is HBA LUN Queue Depth 32, therefore higher vDisk Queue Depth (64), would not help.
 
Hope the principle is clear now.

So how can I boost storage performance for a single VM? If you really need to get more IOPS from the single VM you have only three following options:
  1. increase queue depth, but not only on vDISK itself but END-2-END. IT IS GENERALLY NOT RECOMMENDED as you really must know what you are doing and it can have a negative impact on overall shared storage. However, if you need it and have the justification for it, you can try to tune the system.
  2. use the storage system with low service time (response time). For example, the sub-millisecond storage system (for example 0.5 ms) will give you more IOPS for the same queue depth as a storage system having higher service time (for example 3 ms).
  3. leverage multiple vDisks spread across multiple vSCSI controllers and datastores (LUNs). This would give you more (total) queue depth in a distributed fashion. However, this would have additional requirements for your real application as it would need a filesystem or other mechanism supporting multiple storage devices (vDisks).
I hope options 1 and 2 are clear. Option 3 is depicted in the figure below.


CONCLUSION
On a typical VMware vSphere environment, you use the shared storage system from multiple ESXi hosts, multiple VMs having vDisks on multiple datastores (LUNs). That's the reason why the default queue depth usually makes perfect sense as it provides fairness among all storage consumers. If you have storage system with, let's say 2 ms response time, and queue depth 32, you can still get around 16K IOPS. This should be good enough for any typical enterprise application, and usually, I recommend to use IOPS limiting to limit some VMs (vDisks) even more. This is how storage performance tiering can be very simply achieved on VMware SDDC with unified infrastructure.  If you need higher storage performance, your application is specific and you should do a specific design and leverage specific technologies or tunings.

By the way, I like Howard's Marks (@DeepStorageNet) statement I have heard on his storage technologies related podcast "GrayBeards".  It is something like ...
"There are only two storage performance types - good enough and not good enough." 
This is very true.
 
Hope this writeup helps to broader VMware community.

Relevant articles:

No comments: