Wednesday, January 11, 2017

Using esxtop to identify storage performance issues for ESX / ESXi

ESXi performance are exposing to administrators through vSphere Clients. You can see real-time performance statistics which are collected in 5 minute intervals where each interval consists of fifteen 20 seconds samples. It is obvious that 20 second sample is pretty large for storage performance where we are working in mili or even micro second scale.
20 seconds contains 20,000 milliseconds
Let's be clear here, we will never have full visibility but smaller monitoring sample will give as better clue what is really happening inside the system. It is similar to microscope device.

The smallest monitoring samples can be achieved by ESXi utility ESXTOP. The default esxtop delay between monitoring points (sample) is 5 seconds. However, it can be lowered up to 2 seconds by parameter -d 2

For real analytics the esxtop data must be exprted to external file. In esxtop terminology it is batch mode and it is achieved by parameter -b 

Another important factor is what statistics (metrics) we are going to collect. The best is to collect all statistics because during performance analytics you have to correlate multiple values against each other. It is achieved by parameter -a

And last parameter is -n which defines how many iterations you want to perform in batch mode. So in example below we will have 30 iterations with delay between each other 2 seconds. So we will do total monitoring for 60 seconds.

esxtop -b -a -d 2 -n 30 > esxtop-data.csv

For all esxtop parameters see screenshot below.

 [root@esx11:~] esxtop -h  
 usage: esxtop [-h] [-v] [-b] [-l] [-s] [-a] [-c config file] [-R vm-support-dir-path]   
         [-d delay] [-n iterations]  
        [-export-entity entity-file] [-import-entity entity-file]   
        -h prints this help menu.  
        -v prints version.  
        -b enables batch mode.  
        -l locks the esxtop objects to those available in the first snapshot.  
        -s enables secure mode.  
        -a show all statistics.  
        -c sets the esxtop configuration file, which by default is .esxtop60rc  
        -R enables replay mode.  
        -d sets the delay between updates in seconds.  
        -n runs esxtop for only n iterations. Use "-n infinity" to run esxtop forever.  
        -----Experimental Features-------------  
        -export-entity writes the entity ids into a file, which can be modified  
         to select interesting entities.  
        -import-entity reads the file of selected entities. If this opion   
         is used, esxtop only shows the data for the selected entities.  

It is important to know, that esxtop will give you significantly more statistics you can see in vSphere Client level. That's another important benefit of esxtop. But each benefit has also some drawbacks or impact. The impact is, that single esxtop output line can have several thousands statistic counters. For example ESXi 6.0 host with just 2 running VMs in my home lab has 27,314 counters. My customer's product ESXi host has over 330,000 counters! So the output file can be pretty large in case you run it for 24 hours. Count on it.

In the file are very interesting counters. Following counters for physical disk devices are the most interesting
### Reponse times
Average Guest MilliSec/Command
Average Kernel MilliSec/Command
Average Queue MilliSec/Command
Average Queue MilliSec/Read
Average Driver MilliSec/Command
Average Driver MilliSec/Write
### Queue
Adapter Q Depth
### IOPS
Reads/sec
Writes/sec
Commands/sec
### MB/s
MBytes Read/sec
MBytes Written/sec"
### Split commands
Split Commands/sec
### SCSI Reservations
Reserves/sec
Failed Reserves/sec
Conflicts/sec
### Failures
Failed Commands/sec
Failed Reads/sec
Failed Writes/sec
Failed Bytes Read/sec
Failed Bytes Written/sec
Aborts/sec
Resets/sec
Some of above counters are not available in vSphere Client but the big benefit is that esxtop will give you data in 2 second interval which is much better granularity.

I hear your questions - So what now? How to analyze esxtop output file?
Well, you can replay it back in esxtop or you can use any of following tools

  • VisualEsxtop
  • perfmon
  • excel
  • esxplot
To be honest, none of tools above fulfilled my requirements therefore I'm writing my own python script for esxtop output analysis.

I will blog about it in next post when script will be good enough for public usage and published on github.

Stay tuned.

No comments: