Tuesday, October 15, 2013

iSCSI NetGear datastore issues

Yesterday I had a phone call from my neighbor who work as vSphere admin for one local system integrator. He was in the middle of upgrade from vSphere 4.1 to vSphere 5.5 and had a trouble.

He decided to use vSphere 5.5 but not by in place upgrade but as having two environments. The legacy one (vSphere 4.1) and new one (vSphere 5.5). Each environment had their own vCenter and he used one iSCSI datastore connected to both environments as transfer datastore.  He called me because he experienced issues with powering on particular VM stored on transfer datastore and registered on ESXi 5.5 managed by vCenter 5.5. When VM power on was initiated it took some time and the task failed in - he told me - 25%.

I remember  we were discussing some time ago if is better to use vSphere 5.1 or go directly to very new vSphere 5.5. My answer was "it depends" but at  the end we agreed that in small environment is possible to go directly to vSphere 5.5 and accept some risk.  That's the reason why I felt little bit guilty.

As we are neighbors he came to my garden. He smoked several cigarettes probably to organize his thoughts and we were discussing potential root cause and other best practices including migration possibilities. All those general ideas and recommendations were just best practices and hypothesis. At the end we agreed that we have to look at log files to understand what is really happening and what issue he is experiencing.

I have to say I like troubleshooting ...  the first log file to check in such situations is obviously /var/log/vmkernel.log

As he is more Microsoft (GUI) then *nix (CLI) oriented I navigated him over the phone how to enable ssh, login to ESXi  and check the log file.

When we start the command
tail -f /var/log/vmkernel.log 
the troubleshooting was almost done. Lot of SCSI errors were continuously logged into vmkernel.log. SCSI errors included following useful information 
H:0x0 D:0x2 P:0x0  SCSI sense keys: 0x0B 0x24 0x00
Let's translated log file information into human language ... device returns "aborted command" (0x0B) and additional sense code (0x24) is undocumented so it is probably device specific.
However  root cause was obvious ... it is storage related issue. We tried to create directory on affected datastore and it took almost 30 seconds which prove our  assumption of storage issue. Problematic datastore was backed by iSCSI NetGear storage. The same operation in another datastore backed by another storage connected directly by SAS was, of course, immediate.

So I asked him again (we talk about HCL at the beginning general discussion) if he checked HCL and he confirmed he does it but he will double check it. In one hour later he send me a message that storage model is supported but the firmware must be upgraded to work correctly with ESX 5.5

All my "ad-hoc consulting" was done just like quick help to friend of mine so I don't even know what NetGear iSCSI storage my neighbor has but I will ask him and update this post because it can help other people.

Update 10/16/2013:
I have been informed that exact NetGear iSCSI storage model is "NetGear Ready NAS 3100". I checked VMware HCL by my self and at the moment it is supported just for ESX 5.1 with firmware RAIDiator-x86 4.2.21. So I warn my neighbor that even it will work with new firmware this configuration will be unsupported.  Another lesson from this - don't trust anybody and validate everything by your self :-)

So what is the conclusion of this story? Plan, plan and plan again before any vSphere upgrade. Don't check just hardware models on HCL but check also firmwares. Modern hardware and operating systems (including hypervisors) are very software dependent so firmware versions matters.

No comments: