Sunday, June 30, 2013

Simple UNIX Shell Script for generating disk IO trafic

Here is pretty easy unix shell script for disk I/O generation.
#!/bin/sh
dd_threads="0 1 2 3 4 5 6 7 8 9"
finish () {
  killall dd
  for i in $dd_threads
  do
    rm /var/tmp/dd.$i.test
  done
  exit 0;
}
trap 'finish' INT
while true
do
  for i in $dd_threads
  do
    dd if=/dev/random of=/var/tmp/dd.$i.test bs=512 count=100000 &
  done
done
Generated IOs (aka TPS - transaction per second) can be watched by following command
iostat -d -c 100000
Script can be terminated by pressing CTRL-C.

Thursday, June 27, 2013

Calculating optimal segment size and stripe size for storage LUN backing vSphere VMFS Datastore

Colleague of mine (BTW very good Storage Expert) asked me what is the best segment size for storage LUN used for VMware vSphere Datastore (VMFS). Recommendations can vary among storage vendors and models but I think the basic principles are same for any storage.

I found IBM RedBook [SOURCE: IBM RedBook redp-4609-01] explanation the most descriptive, so here it is.
The term segment size refers to the amount of data that is written to one disk drive in anarray before writing to the next disk drive in the array, for example, in a RAID5, 4+1 array with a segment size of 128 KB, the first 128 KB of the LUN storage capacity is written to the first disk drive and the next 128 KB to the second disk drive. For a RAID1, 2+2 array, 128 KB of an I/O is written to each of the two data disk drives and to the mirrors. If the I/O size is larger than the number of disk drives times 128 KB, this pattern repeats until the entire I/O is completed. For very large I/O requests, the optimal segment size for a RAID array is one that distributes a single host I/O across all data disk drives. 
The formula for optimal segment size is:
LUN segment size = LUN stripe width ÷ number of data disk drives 
For RAID 5, the number of data disk drives is equal to the number of disk drives in the array minus 1, for example:
RAID5, 4+1 with a 64 KB segment size = (5-1) * 64KB = 256 KB stripe width 
For RAID 1, the number of data disk drives is equal to the number of disk drives divided by 2, for example:
RAID 10, 2+2 with a 64 KB segment size = (2) * 64 KB = 128 KB stripe width 
For small I/O requests, the segment size must be large enough to minimize the number ofsegments (disk drives in the LUN) that must be accessed to satisfy the I/O request, that is, to minimize segment boundary crossings. 
For IOPS environments, set the segment size to 256KB or larger, so that the stripe width is at least as large as the median I/O size. 
IBM Best practice: For most implementations set the segment size of VMware data partitions to 256KB.

Note: If I decrypting IBM terminology correctly then IBM mentioned term "stripe width" is actually "data stripe size". We need to clear terminology because normally is the term "stripe width" used as number of disks in RAID group. "Data stripe size" is payload without the parity. The parity is stored on another segment(s) dependent on selected RAID level.

For clear understanding terminology I've created  RAID 5 (4+1) segment/stripe visualization depicted bellow.

RAID 5 (4+1) striping example
RAID 5 (4+1) striping example

Even I found this IBM description very informative I'm not sure why they recommend to use segment size 256KB for VMware. It is true that the biggest IO size issued from ESX can be by default 32MB because bigger IOs issued from guest OS ESX splits into more IOs (for more information about big IO split see this blog post). However the most important is IO size issued from guest OSes. If you want to monitor max/average/median IO size from ESX you can use tool vscsiStats already included in ESXi for such purpose. It allows you to show histogram which is really cool (for more information about vscsiStats read this excellent blog post). So based on all these assumptions and also my own IO size monitoring in the field it seems to me that average IO size issued from ESX is usually somewhere between 32 and 64KB. So let's use 64KB as average data stripe (IO size issued from OS). Then for RAID 5 (4+1) data stripe will be composed from 4 segments and optimal segment size in this particular case should be 16KB (64/4).

Am I right or I missed something? Any comments are welcome and highly appreciated.

Update 2014/01/31:
We are discussing this topic very frequently with my colleague who work as DELL storage specialist. The theory is nice but only the real test can prove any theory. Recently he performed set of IOmeter tests against DELL PV MD3600f which is actually the same array as IBM DS3500. He found that optimal performance (# of IOPS versus response times) is when segment size is as close as possible to IO size issued from operating system. So key takeaway from this exercise is that optimal segment size for example above is not 16KB but 64KB. Now I understand IBM general recommendation (best practice) to use 256KB segment size for VMware workloads as this is the biggest segment size which can be chosen.

Update 2014/07/23:
After more thinking about this topic I've realized that idea to use the segment size bigger than your biggest IO size can make sense from several reasons

  • each IO will get single spindle (disk) to handle this IO which will use queues down the route and will be served in spindle latency time which is the minimal one for this single IO, right?
  • typical virtual infrastructure environment is running several VMs generating several IOs based on queues available in the guest OS, ESX layer disk scheduler settings (see more here on Duncan Epping blog) so at the end of the day you are able to generate lot of IOPSes by different threads and load is evenly distributed across RAID group
However, please note, that all this discussion was related to legacy (traditional) storage architectures. Some modern (virtualized) storages are doing some magic on their controllers like I/O Coalescing. I/O Coalescing is IO optimization leveraging reordering smaller IO writes to another bigger IO in controller cache and sending this bigger IO down to the disks. This can significantly change segment size recommendations so please try to understand particular storage architecture or follow storage vendor best practices and try to understand the reason of these recommendations in your particular use case. I remember EMC Clariions used IO coalescing into 64KB IO blocks. 

Related resources:

Wednesday, June 26, 2013

IOBlazer

IOBlazer is a multi-platform storage stack micro-benchmark. IOBlazer runs on Linux, Windows and OSX and it is capable of generating a highly customizable workload. Parameters like IO size and pattern, burstiness (number of outstanding IOs), burst interarrival time, read vs. write mix, buffered vs. direct IO, etc., can be configured independently. IOBlazer is also capable of playing back VSCSI traces captured using vscsiStats. The performance metrics reported are throughput (in terms of both IOPS and bytes/s) and IO latency.
IOBlazer evolved from a minimalist MS SQL Server emulator which focused solely on the IO component of said workload. The original tool had limited capabilities as it was able to generate a very specific workload based on the MS SQL Server IO model (Asynchronous, Un-buffered, Gather/Scatter). IOBlazer has now a far more generic IO model, but two limitations still remain:
  1. The alignment of memory accesses on 4 KB boundaries (i.e., a memory page)
  2. The alignment of disk accesses on 512 B boundaries (i.e., a disk sector).
Both limitations are required by the gather/scatter and un-buffered IO models.
A very useful new feature is the capability to playback VSCSI traces captured on VMware ESX through the vscsiStats utility. This allows IOBlazer to generate a synthetic workload absolutely identical to the disk activity of a Virtual Machine, ensuring 100% experiment repeatability.

TBD - TEST & WRITE REVIEW

PXE Manager for vCenter

PXE Manager for vCenter enables ESXi host state (firmware) management and provisioning. Specifically, it allows:
  • Automated provisioning of new ESXi hosts stateless and stateful (no ESX)
  • ESXi host state (firmware) backup, restore, and archiving with retention
  • ESXi builds repository management (stateless and statefull)
  • ESXi Patch management
  • Multi vCenter support
  • Multi network support with agents (Linux CentOS virtual appliance will be available later)
  • Wake on Lan
  • Hosts memtest
  • vCenter plugin
  • Deploy directly to VMware Cloud Director
  • Deploy to Cisco UCS blades
TBD - TEST & WRITE REVIEW

vBenchmark

vBenchmark provides a succinct set of metrics in these categories for your VMware virtualized private cloud. Additionally, if you choose to contribute your metrics to the community repository, vBenchmark also allows you to compare your metrics against those of comparable companies in your peer group. The data you submit is anonymized and encrypted for secure transmission.

Key Features:

  • Retrieves metrics across one or multiple vCenter servers
  • Allows inclusion or exclusion of hosts at the cluster level
  • Allows you to save queries and compare over time to measure changes as your environment evolves
  • Allows you to define your peer group by geographic region, industry and company size, to see how you stack up
TBD - TEST & WRITE REVIEW

Tuesday, June 25, 2013

How to create your own vSphere Performance Statistics Collector

Statsfeeder is a tool that enables performance metrics to be retrieved from vCenter and sent to multiple destinations, including 3rd party systems. The goal of StatsFeeder is to make it easier to collect statistics in a scalable manner. The user specifies the statistics to be collected in an XML file, and StatsFeeder will collect and persist these stats. The default persistence mechanism is comma-separated values, but the user can extend it to persist the data in a variety of formats, including a standard relational database or Key-value store. StatsFeeder is written leveraging significant experience with the performance APIs, allow the metrics to be retrieved in the most efficient manner possible.
White paper located at StatsFeeder: An Extensible Statistics Collection Framework for Virtualized Environments can give you better understanding how it work and how to leverage it.




Monday, June 24, 2013

vCenter Single Sign-On Design Decision Point

When you designing vSphere 5.1 you have to implement vCenter SSO. Therefore you have to make design decision what SSO mode to choose.

There are actually three available options

  1. Basic
  2. HA (don't mix with vSphere HA)
  3. Multisite
Justin King wrote excellent blog post about SSO here and it is worth source of information to make right design decision. I fully agree with Justin and recommending Basic SSO to my customers if possible. SSO Server protection  can be achieved by standard backup/restore methods and SSO High Availability can be increased by vSphere HA. All these methods are well known and long time used.

You have to use Multisite SSO when vCenter linked-mode is required but think twice if you really need it and benefits overweight drawbacks.

Thursday, June 20, 2013

Force10 Open Automation Guide - Configuration and Command Line Reference

This document describes the components and uses of the Open Automation Framework designed to run on the Force10 Operating System (FTOS), including:
• Smart Scripting
• Virtual Server Networking (VSN)
• Programmatic Management
• Web graphic user interface (GUI) and HTTP Server

http://www.force10networks.com/CSPortal20/KnowledgeBase/DOCUMENTATION/CLIConfig/FTOS/Automation_2.2.0_4-Mar-2013.pdf

Tuesday, June 18, 2013

How to – use vmkping to verify Jumbo Frames

Here is nice blog post about Jumbo Frame configuration on vSphere and how to test it works as expected. This is BTW excellent test for Operational Verification (aka Test Plan).

Architectural Decisions

Josh Odgers – VMware Certified Design Expert (VCDX) #90 is continuously building database of architectural decisions available at  http://www.joshodgers.com/architectural-decisions/

It is very nice example of one architecture approach.
 

Monday, June 17, 2013

PowerCLI One-Liners to make your VMware environment rock out!

Christopher Kusek wrote excellent blog post about PowerCLI useful scripts fit single line. He call it one-liners. These one-liners can significantly help you on daily vSphere administration. On top of that you can very easily learn PowerCLI constructs just from reading these one-liners.

http://www.pkguild.com/2013/06/powercli-one-liners-to-make-your-vmware-environment-rock-out/

Tuesday, June 04, 2013

Software Defined Networking - SDN

SDN is another big topic in modern virtualized datacenter so it is worth to understand what it is and how it can help us to solve real datacenter challenges.

Brad Hedlund's explanation "What is Network Virtualization"
http://bradhedlund.com/2013/05/28/what-is-network-virtualization/
Bred Hedlund is very well known netwoking expert. Now he works for VMware | Nicira participating on VMware NSX product which should be next network virtualisation platform (aka network hypervisor). He is ex-CISCO and ex-DELL | Force10 so there is big probability he fully understand what is going on.

It is obvious that "dynamic service insertion" is the most important thing in SDN. OpenFlow and CISCO vPath is trying to do it but each in different way. Same goal but with different approach. What is better? Who knows? The future and real experience will show us what is better. Jason Edelman's blog post very nicely and clearly compares both approaches.
http://www.jedelman.com/1/post/2013/04/openflow-vpath-and-sdn.html

CISCO as long term networking leader and pioneer has of course its own vision of SDN. Nexus 1000V and Virtual Network Overlays Play for CISCO Pivotal Role in Software Defined Networks. Very nice explanation of CISCO approach is available at
http://blogs.cisco.com/datacenter/nexus-1000v-and-virtual-network-overlays-play-pivotal-role-in-software-defined-networks/