Friday, May 03, 2019

Storage and Fabric latencies - difference in order of magnitude

It is well known, that the storage industry is in a big transformation. SSD's based on Flash is changing the old storage paradigma and supporting fast computing required nowadays in modern applications supporting digital transformation projects.

So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).

It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.

Latencies order of magnitude:
  • ms - miliseconds - 0.001 of second = 10−3
  • μs - microseconds - 0.000001 of second = 10−6
  • ns - nanoseconds - 0.000000001 of second = 10−9
Storage Latencies

SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms

SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.

NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms =  100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms =  80 μs

DIMM - 3D XPoint memory (Intel Optane) ~=   the latency less than 500 ns (0.5 μs)

Ethernet Fabric Latencies

Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~=  μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= μs (sockets application) / 1.3 μs (RDMA application)

InfiniBand and Omni-Path Fabrics Latencies

10Gb/s SDR - 1 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s  ~=  1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s  ~=  1.04 μs (Intel 100G Omni-Path)

RAM Latency

DIMM - DDR4 SDRAM ~=  75 ns (local NUMA access) - 120 ns (remote NUMA access)

Visualization

Latencies are very well visualized in the figure below.


Conclusion

It is good to realize what latencies we should expect on different infrastructure subsystems 
  • RAM ~= 100 ns
  • 3D XPoint memory ~= 500 ns
  • Modern Fabrics ~= 1-4 μs
  • NVMe ~= 80 μs
  • NVMe over RoCE ~= 100 μs
  • SAS SSD ~= 250 μs
  • SAS magnetic disks ~= 5-12 ms
The latency order of magnitude is important for several reasons. Let's focus on one of them - latency monitoring. It was always a challenge to monitor traditional storage systems as 5 minutes or even 1-minute is simply too large interval for ms (milisecond) latency and the average does not tell you anything about microbursts. However, in lower latency (μs or even ns) systems is 5-minute interval like an eternity. Average, Min and Max of 5-minute interval might not help you to understand what is really happening there. Much deeper mathematical statistics would be needed to have real and valuable visibility into telemetry data. Percentiles are good but Histograms can help even more ...
Wavefront links above are talking mainly about application monitoring but do we have such telemetry granularity in hardware? Mellanox Spectrum claims Real-time Network Visibility but it seems to me as an exception. Intel had an open source project "The Snap Telemetry Framework", however, it seems that it was discontinued by Intel. And what about other components? To be honest, I do not know and it seems to me that real-time visibility is not a big priority for the Infrastructure Industry, however, Operating Systems, Hypervisors and Software Defined Storages could help here. VMware vSphere Performance Manager available via vCenter SOAP API can provide "real-time" monitoring. I'm quoting "real-time" into brackets because it can provide 20-second samples (min, max, average) for metrics in leaf objects. Is it good enough? Well, not really. It is better than a 5-minute or 1-minute sample but still very long for sub-millisecond latencies. Minimum, maximum and average do not have enough information value for some decisions. The histograms could help here. ESXi has an old good tool vscsiStats supporting histograms latency of IOs in Microseconds (us) for virtual machine. Unfortunately, there is no officially supported vCenter API for this tool so it is usually used for short-term manual performance troubleshooting and not for continuous latency monitoring.  William Lam has published a blog post and scripts on how to leverage ESXi API to get vscsiStats histograms. It would be great to be able to get histograms for some objects through vCenter in a supported way and expose such information to external monitoring tools. #FEATURE-REQUEST

Hope this is informative and educational.

Other sources:
Performance Characteristics of Common Network Fabrics: https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-network-fabrics/
Real-time Network Visibility: http://www.mellanox.com/related-docs/whitepapers/WP_Real-time_Network_Visibility.pdf
Johan van Amersfoort and Frank Denneman present a NUMA deep dive: https://youtu.be/VnfFk1W1MqE
Cormac Hogan : GETTING STARTED WITH VSCSISTATS: https://cormachogan.com/2013/07/10/getting-started-with-vscsistats/
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API

No comments: