Wednesday, May 25, 2016

ESXi : How to mask storage device causing some issues

I have heard about the issue with ESXi 6 Update 2 and HP 3PAR storage where VVOLs are enabled. I have been told that the issue is caused by issuing unsupported SCSI command to PE LUN (256). PE stands for Protocol Endpoint and it is VVOL technical LUN for data path between ESXi and remote storage system.

Observed symptoms:
  • ESX 6 Update 2 – issues (ESXi disconnects from vCenter, console is very slow)
  • Hosts may take a long time to reconnect to vCenter after reboot or hosts may enter a "Not Responding" state in vCenter Server
  • Storage-related tasks such as HBA rescan may take a very long time to complete
  • I have been told that ESX 6 Update 1 doesn't experience such issues (there are entries are in log file but no other symptoms occur)
Below is a snippet from a log file ..

 2016-05-18T11:31:27.319Z cpu1:242967)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.2ff70002ac0150c3" - issuing command 0x43a657470fc0  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmpCompleteRetryForPath:352: Retry cmd 0x28 (0x43a657470fc0) to dev "naa.2ff70002ac0150c3" failed on path "vmhba0:C0:T2:L256" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmp_PathDetermineFailure:2973: Cmd (0x28) PDL error (0x5/0x25/0x0) - path vmhba0:C0:T2:L256 device naa.2ff70002ac0150c3 - triggering path failover  
 2016-05-18T11:31:27.320Z cpu31:33602)WARNING: NMP: nmpCompleteRetryForPath:382: Logical device "naa.2ff70002ac0150c3": awaiting fast path state update before retrying failed command again.  

Possible workarounds
  • ESXi hostd restart helps therefore SSH to ESXi hosts was enabled for quick resolution in case of problem
  • LUN masking of LUN 256
UPDATE 2016-09-30: There is most probably another workaround. Changing the Disk.MaxLUN parameter on ESXi Hosts as described in VMware KB 1998

Final solution
  • Application of HP 3PAR firmware patch (unfortunately patch is not available for current firmware thus firmware upgrade has to be planned and executed)
  • Investigation of root cause why ESXi 6 Update 2 is more sensitive then ESXi 6 Update 1
Immediate steps
  • Application of workarounds mentioned above  

HOME LAB EXERCISE
I have tested in my home lab how to mask particular LUN on ESXi host just to be sure I know how to do it.

Below is quick solution for impatient readers.

Let's sat we have following device with following path.
  • Device: naa.6589cfc000000bf5e731ffc99ec35186
  • Path: vmhba36:C0:T0:L1
LUN Masking
esxcli storage core claimrule add -P MASK_PATH -r 500 -t location -A vmhba36 -C 0 -T 0 -L 1
esxcli storage core claimrule load
esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186

LUN Unmasking
esxcli storage core claimrule remove --rule 500
esxcli storage core claimrule load
esxcli storage core claiming unclaim --type=path --path=vmhba36:C0:T0:L1
esxcli storage core claimrule run

... continue reading for details.

LUN MASKING DETAILS
Exact LUN masking procedure is documented in vSphere 6 Documentation here. It is also documented in these KB articles 1009449 and 1014953.

List storage devices

 [root@esx02:~] esxcli storage core device list  
 naa.6589cfc000000bf5e731ffc99ec35186  
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)  
   Has Settable Display Name: true  
   Size: 10240  
   Device Type: Direct-Access  
   Multipath Plugin: NMP  
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000bf5e731ffc99ec35186  
   Vendor: FreeNAS  
   Model: iSCSI Disk  
   Revision: 0123  
   SCSI Level: 6  
   Is Pseudo: false  
   Status: degraded  
   Is RDM Capable: true  
   Is Local: false  
   Is Removable: false  
   Is SSD: true  
   Is VVOL PE: false  
   Is Offline: false  
   Is Perennially Reserved: false  
   Queue Full Sample Size: 0  
   Queue Full Threshold: 0  
   Thin Provisioning Status: yes  
   Attached Filters:  
   VAAI Status: supported  
   Other UIDs: vml.010001000030303530353661386131633830300000695343534920  
   Is Shared Clusterwide: true  
   Is Local SAS Device: false  
   Is SAS: false  
   Is USB: false  
   Is Boot USB Device: false  
   Is Boot Device: false  
   Device Max Queue Depth: 128  
   No of outstanding IOs with competing worlds: 32  
   Drive Type: unknown  
   RAID Level: unknown  
   Number of Physical Drives: unknown  
   Protection Enabled: false  
   PI Activated: false  
   PI Type: 0  
   PI Protection Mask: NO PROTECTION  
   Supported Guard Types: NO GUARD SUPPORT  
   DIX Enabled: false  
   DIX Guard Type: NO GUARD SUPPORT  
   Emulated DIX/DIF Enabled: false
  
 naa.6589cfc000000ac12355fe604028bf21  
   Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)  
   Has Settable Display Name: true  
   Size: 10240  
   Device Type: Direct-Access  
   Multipath Plugin: NMP  
   Devfs Path: /vmfs/devices/disks/naa.6589cfc000000ac12355fe604028bf21  
   Vendor: FreeNAS  
   Model: iSCSI Disk  
   Revision: 0123  
   SCSI Level: 6  
   Is Pseudo: false  
   Status: degraded  
   Is RDM Capable: true  
   Is Local: false  
   Is Removable: false  
   Is SSD: true  
   Is VVOL PE: false  
   Is Offline: false  
   Is Perennially Reserved: false  
   Queue Full Sample Size: 0  
   Queue Full Threshold: 0  
   Thin Provisioning Status: yes  
   Attached Filters:  
   VAAI Status: supported  
   Other UIDs: vml.010002000030303530353661386131633830310000695343534920  
   Is Shared Clusterwide: true  
   Is Local SAS Device: false  
   Is SAS: false  
   Is USB: false  
   Is Boot USB Device: false  
   Is Boot Device: false  
   Device Max Queue Depth: 128  
   No of outstanding IOs with competing worlds: 32  
   Drive Type: unknown  
   RAID Level: unknown  
   Number of Physical Drives: unknown  
   Protection Enabled: false  
   PI Activated: false  
   PI Type: 0  
   PI Protection Mask: NO PROTECTION  
   Supported Guard Types: NO GUARD SUPPORT  
   DIX Enabled: false  
   DIX Guard Type: NO GUARD SUPPORT  
   Emulated DIX/DIF Enabled: false  

So we have two device with following NAA IDs
  • naa.6589cfc000000bf5e731ffc99ec35186
  • naa.6589cfc000000ac12355fe604028bf21
Now let's list paths of both of my iSCSI devices

[root@esx02:~] esxcli storage nmp path list
iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000bf5e731ffc99ec35186
   Runtime Name: vmhba36:C0:T0:L1
   Device: naa.6589cfc000000bf5e731ffc99ec35186
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)
   Group State: active
   Array Priority: 0
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}
   Path Selection Policy Path Config: {current path; rank: 0}

iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000ac12355fe604028bf21
   Runtime Name: vmhba36:C0:T0:L2
   Device: naa.6589cfc000000ac12355fe604028bf21
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)
   Group State: active
   Array Priority: 0
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}
   Path Selection Policy Path Config: {current path; rank: 0}

Let's mask iSCSI devices exposed as a LUN 1.
So our path we want to mask is vmhba36:C0:T0:L1 and device UID is naa.6589cfc000000bf5e731ffc99ec35186

So let's create masking rule of path above. In this particular case we have just a single path because it is local device. In real environment we have usually multiple paths and all paths should be masked.

 esxcli storage core claimrule add -P MASK_PATH -r 500 -t location -A vmhba36 -C 0 -T 0 -L 1
 esxcli storage core claimrule load  

We can list our claim rules to see the result

 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches                  XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- ---------------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                            false            false            0  
 MP       1 runtime transport NMP    transport=sata                           false            false            0  
 MP       2 runtime transport NMP    transport=ide                            false            false            0  
 MP       3 runtime transport NMP    transport=block                           false            false            0  
 MP       4 runtime transport NMP    transport=unknown                          false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      500 runtime location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP      500 file   location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                          false            false            0  

We can see that new claim rule (500) is in configuration file (/etc/vmware/esx.com) and also loaded in runtime.

However, to really mask our particular device without ESXi host reboot we have to reclaim device

 [root@esx02:~] esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186  

The particular device disappear from ESXi host immediately. ESXi host reboot is not needed.
So we are done. Particular device is not visible to ESXi host anymore.

Note: I was unsuccessful when I was testing LUN masking with local device. Therefore I assume that LUN masking works only with remote disks (iSCSI, Fibre Channel). 

LUN UNMASKING
Just in case you would like to unmask device and use it again here is the procedure.

Let's start with removing claimrules for our previously masked path.

 [root@esx02:~] esxcli storage core claimrule remove --rule 500  
 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches                  XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- ---------------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                            false            false            0  
 MP       1 runtime transport NMP    transport=sata                           false            false            0  
 MP       2 runtime transport NMP    transport=ide                            false            false            0  
 MP       3 runtime transport NMP    transport=block                           false            false            0  
 MP       4 runtime transport NMP    transport=unknown                          false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport                  false            false            0  
 MP      500 runtime location  MASK_PATH adapter=vmhba36 channel=0 target=0 lun=1              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                          false            false            0  
 [root@esx02:~]   

You can see that rule is removed from file configuration but it is still running. We have to re-load claimrules from file to runtime.

 [root@esx02:~] esxcli storage core claimrule load  
 [root@esx02:~] esxcli storage core claimrule list  
 Rule Class  Rule Class  Type    Plugin   Matches              XCOPY Use Array Reported Values XCOPY Use Multiple Segments XCOPY Max Transfer Size  
 ---------- ----- ------- --------- --------- --------------------------------- ------------------------------- --------------------------- -----------------------  
 MP       0 runtime transport NMP    transport=usb                        false            false            0  
 MP       1 runtime transport NMP    transport=sata                        false            false            0  
 MP       2 runtime transport NMP    transport=ide                        false            false            0  
 MP       3 runtime transport NMP    transport=block                       false            false            0  
 MP       4 runtime transport NMP    transport=unknown                      false            false            0  
 MP      101 runtime vendor   MASK_PATH vendor=DELL model=Universal Xport              false            false            0  
 MP      101 file   vendor   MASK_PATH vendor=DELL model=Universal Xport              false            false            0  
 MP     65535 runtime vendor   NMP    vendor=* model=*                       false            false            0  
 [root@esx02:~]   

Here we go. Now there is no rule with id 500.

But the device is still not visible and we cannot execute command
esxcli storage core claiming reclaim -d naa.6589cfc000000bf5e731ffc99ec35186
because such device is not visible to ESXi host. We mask it, right? So it is exactly how it should behave.

ESXi host would probably help but can we do it without ESXi host reboot?
The answer is yes we can.
We have to unclaim the path to our device and re-run claim rules.

 esxcli storage core claiming unclaim --type=path --path=vmhba36:C0:T0:L1  
 esxcli storage core claimrule run  

and now we can see both paths to iSCSI LUNs again.

 [root@esx02:~] esxcli storage nmp path list  
 iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000bf5e731ffc99ec35186  
   Runtime Name: vmhba36:C0:T0:L1  
   Device: naa.6589cfc000000bf5e731ffc99ec35186  
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000bf5e731ffc99ec35186)  
   Group State: active  
   Array Priority: 0  
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}  
   Path Selection Policy Path Config: {current path; rank: 0}  
 iqn.1998-01.com.vmware:esx02-096fde38-00023d000001,iqn.2005-10.org.freenas.ctl:test,t,257-naa.6589cfc000000ac12355fe604028bf21  
   Runtime Name: vmhba36:C0:T0:L2  
   Device: naa.6589cfc000000ac12355fe604028bf21  
   Device Display Name: FreeNAS iSCSI Disk (naa.6589cfc000000ac12355fe604028bf21)  
   Group State: active  
   Array Priority: 0  
   Storage Array Type Path Config: {TPG_id=1,TPG_state=AO,RTP_id=3,RTP_health=UP}  
   Path Selection Policy Path Config: {current path; rank: 0}  

Hope this helps to other vmware users having a need for LUN masking / unmasking.

2 comments:

Anonymous said...

What effect would masking off this LUN have for VVOL operation on the ESXi host?

David Pasek said...

This particular customer don't want to use VVOLs but storage system expose Protocol Endpoint (LUN 256) by default and it cannot be disabled on storage side. Therefore masking can help and it should not have any impact on customer because he is not using VVOL at the moment.