VCDX #200 Blog of one VMware Infrastructure Designer: December 2017

Thursday, December 21, 2017

SDRS Initial Placement - interim storage lease between recommendation and provisioning

Every day we learn something new. In the past, I blogged about SDRS behavior on these blog posts

Recently (a few months ago), I have been informed about interesting SDRS behavior which is not exposed through standard GUI nor advanced settings but available through API. Such functionality was not very well known even within VMware so I have decided to blog about it.

Long story short ...

vSphere API Call for SDRS Initial Placement can lease recommended storage resource for some time.

What does it mean? Just after recommendations, SDRS can lease the storage space on recommended datastores to have an interim reservation for somebody who is, most probably, going to do provisioning. By default, SDRS do not lease storage space on recommended datastores, therefore, you can observe provisioning failures in some situations. I have simulated such situation in Test #3 of test plan available here. Such situations are not very common when you do manual provisioning but there is higher probability when automated provisioning is in use so you can experience such issues on environments with VMware vRealize Automation (vRA) or vCloud Director (vCD).

And now the secret I did not know ... SDRS has the solution for such issues since vSphere 5.1. When somebody (vRA, vCD, anybody else who wants to deploy VM) asked for SDRS recommendation via API call, that API call can include a specific parameter (resourceLeaseDurationSec) which will instruct vSphere to block the recommended storage space on datastores only for provisioning of that specific SDRS recommendation. It's worth to mention, the resource leasing is released immediately after provisioning, therefore the time defined in resourceLeaseDurationSec is actually the maximum reservation time of the resource just in case somebody who wanted to do provisioning change his mind and decided to not deploy VM. This is to avoid unnecessary storage space reservations.

If you want to know details check API documentation. Here is what is written in vSphere API documentation about placeSpec.resourceLeaseDurationSec - Resource lease duration in seconds. If the duration is within bounds, Storage DRS will hold onto resources needed for applying recommendations generated as part of that call. Only initial placement recommendations generated by storage DRS can reserve resources this way.

Parameter resourceLeaseDurationSec is used in StoragePlacementSpec which encapsulates all of the inputs passed to the VcStorageResourceManager method recommendDatastores. VcStoragePlacementSpec is documented here.

So that sounds good, right? Well, there is one issue with this approach. SDRS can give provisioning application more recommendations (multiple datastores) which would lead to blocking more storage space than is really needed. VMware engineering is aware of this issue and at the moment works at least with vRA BU to solve it. As far as I know, the final solution will be a special SDRS setting to return a single recommendation. However, this is planned as a specific integration optimization between SDRS and vRA provisioning.

UPDATE: details about special SDRS setting is described at https://www.vcdx200.com/2018/06/undocumented-sdrs-advanced-options.html

The challenge with VRA storage (vSphere vDisk) provisioning and SDRS is depicted in the figure bellow

To be honest, there is another design consideration and potential risk associated with this solution. If resourceLeaseDurationSec is used and an external application (VRA, vCloud Director, or other) is using it incorrectly, it can eventually block a storage space in the Datastore cluster and cause Deny of Service (DoS). Incorrect usage would be to leverage SDRS for recommendations that would block the recommended storage space but do not provision anything, therefore storage would be in block state for some defined time and not available for other provisionings until the lease expires.

Tuesday, December 19, 2017

What ESXi command will create kernel panic and result in a PSOD?

This is a very short post but I want to publish it at least for myself to find this trick much quickly next time.

Sometimes, especially during testing of vSphere HA, it can be useful to simulate PSOD (Purple Screen of Death). I did some googling and found the article "What ESXi command will create kernel panic and result in a PSOD?". Long story short, PSOD can be accomplished by following ESXi console command:

vsish -e set /reliability/crashMe/Panic 1

Of course, you have to SSH to particular ESXi host before you can run the command above.

All credits go to IT Pro Today article available here http://www.itprotoday.com/virtualization/q-what-esxi-command-will-create-kernel-panic-and-result-psod

Sunday, December 17, 2017

No Storage, No vSphere, No Datacenter

In the past, I have had a lot of discussions with different customers and partners about various storage issues with VMware vSphere. It was always identified as a physical storage or SAN issue and VMware support recommendation was to contact the particular storage vendor. It was always true and correct recommendation, however such storage issues always have the catastrophic or at least huge impact not only on virtualized workloads running on the impacted datastore but also on manageability of VMware vSphere because of intensive ESXi logging which affects hostd and vpxa services and it ends up with ESXi host disconnection from vCenter and very slow direct manageability of ESXi host. Such issues should be resolved by fixing the storage issue but in the meantime, vSphere admins do not have visibility into the part or even the whole vSphere environment, therefore, they usually restart impacted ESXi hosts which have a negative impact on the availability of VMs running even on not impacted datastores. Such situations are usually classified by users as the whole datacenter outage. You can imagine how hot are discussions between VMware and Storage teams in such situations and I often say the generic expression ...

"NO STORAGE, NO DATACENTER"

Well, there is no doubt, the storage is the most important piece of the datacenter. VMware ESXi hypervisor is usually just an I/O storage passthrough component with some additional intelligence like

native storage multipathing (NMP),
fair storage I/O scheduling (SIOC),
I/O filtering (VAIO),
etc.

And probably due to such additional intelligence, VMware customers usually expect that VMware vSphere will do some magic to mitigate physical storage or SAN related issues. First of all, it is logical and obvious that VMware vSphere cannot solve the issue of physical infrastructure. However, there can be some specific scenarios when the storage device is not available through one path but available via another path.

In this blog post, I would like to share my recent findings of storage issues and VMware native multipathing. Let's start with the visualization of storage multipathing over Fibre Channel SAN. Usually, there are two independent SANs (A and B). Each ESXi HBA is connected to different SAN. From the storage point of view, each storage controller (two storage controllers depicted in the figure below) is connected to different SAN through different storage front-end ports. HBA port is storage initiator and storage front-end ports are usually storage targets.

The I/O sent from ESXi hosts to their assigned logical unit numbers (LUNs) travels through a specific route that starts with an HBA and ends at a LUN. This route is referred to as a path. Each host, in a properly designed infrastructure, should have more than one path to each LUN. VMware generally recommends four storage paths but the optimal number of paths depending on particular storage architecture. In the figure above we have following four paths to LUN 1

vmhba0:C0:T0:L1
vmhba0:C0:T2:L1
vmhba1:C0:T1:L1
vmhba1:C0:T3:L1

Note: The storage system usually exports multiple LUNs with additional paths but we use single LUN (LUN 1) here for simplicity.

ESXi host sees LUN 1 as four independent devices (volumes) but because ESXi has native multipathing driver these four LUNs are identified as the same LUN (LUN1 in our case) therefore ESXi automatically collapse these four devices into a single device having four independent paths. Storage I/Os to such device are distributed across multiple paths based on multipathing policy. ESXi has three native multipathing policies

fixed (FIXED),
most recently used (aka MRU) and
round robin (RR).

Multipathing policy type dictates how multiple I/Os are distributed across available paths but if the one I/O is sent through particular path it will stick on it until the path is claimed as dead. Single I/O flow is depicted below.

Most commonly used SCSI commands are

Inquiry (Requests general information of the target device)
Test/Unit/Ready aka TUR (Checks whether the target device is ready for the transfer operation)
Read (Transfers data from the SCSI target device)
Write (Transfers data to the SCSI target device)
Request Sense (Requests the sense data of the last command)
Read Capacity (Requests the storage capacity information)

If the LUN accepts the SCSI command everything is great and shiny, however when a LUN at the end of storage path experiences some problems, then an ESXi host sends the Test Unit Ready (TUR) command to the storage target (particular storage front-end port) to confirm that the path to the LUN is down before initiating a path failover. However, when the ESXi receive some TUR response from the storage system the path is for ESXi host up a running and repeatedly returns a retry operation request without triggering the failover even the TUR returns error responses and effectively the LUN is not ready. Typical TUR SCSI command response should be "TEST_UNIT_READY" but in case of any problem it returns from the storage systems following responses:

SCSI_HOST_BUS_BUSY 0x02
SCSI_HOST_SOFT_ERROR 0x0b
SCSI_HOST_RETRY 0x0c

The particular I/O flow is happening over a single selected path and VMware native multipathing will not try another path even there is some probability that LUN could be ready via another path. Let me say it again. The default behavior is that ...

... storage path does not fail over when the path to the target is up and sending reponse back into the initiator even the LUN is not available for whatever reasons.

The reason for such conservative vSphere behavior is that Enterprise Storage System and SAN should work and storage vendors claiming storage availability higher than 99.999%. Multipathing is usually solving the issue with the path to the storage system (to storage target ports) but not the problem on the storage system itself (LUN unknown unavailability). I personally believe, the physical storage system has another possibilities how to respond to ESXi host that particular path is not available at the moment and instruct ESXi multipathing driver to not issue I/Os via particular path if it is necessary and the storage system does not have other possibilities for transfer I/O to the LUN in the storage. However, the reality is that some storage systems do not have LUN available (TUR return errors) through one path but it works via another path. This is a typical interoperability issue. However, I have just been informed that there is a solution how to resolve this interoperability issue. You can use the enable_action_OnRetryErrors option.

What is the advanced option enable_action_OnRetryErrors?

This option allows the ESXi host to mark a problematic path as dead. After marking the path as dead, the host can trigger a failover and use an alternative working path. I assume that in case the LUN is not available via any path, all paths will be claimed as dead until LUN works again. See VMware KB 2106770 (Storage path does not fail over when TUR command repeatedly returns retry requests) for instructions how to enable/disable the option.

Now you can ask when the storage path claimed as dead will become active again in case the LUN is back and available. All paths claimed as dead are periodically evaluated. The Fibre Channel path state is evaluated at a fixed interval or when there is an I/O error and TUR is returning nothing, which is not our case here. The path evaluation interval is defined via the advanced configuration option Disk.PathEvalTime in seconds. The default value is 300 seconds. This means that the path state is evaluated every 5 minutes unless an error is reported sooner on that path, in which case the path state might change depending on the interpretation of the reported error. However, I have been told that this standard Disk Path Evaluation DOES NOT return path to an active state when it was claimed as down by OnRetryErrors action. My understanding of the reason for such behavior is that the storage path had some errors, therefore, it is not good to put the path back into production to avoid flip-flop situation.

Let me stress again, such intelligent and proactive failover behavior based on TUR responses is not the default one. At least not in vSphere 6.5 and below. There are some rumors that it can change in the next vSphere release but there is not any official messaging so far. I personally think that more intelligent behavior is better for VMware customers which are usually expecting such cleverness from the vSphere and they are negatively surprised how vSphere behaves in case of storage issues over some paths. Som the intelligent and proactive failover behavior based on TUR responses can be additional cleverness of VMware vSphere native multipathing, however, it is important to say that it would help with few specific behaviors/misbehaviors of some storage systems but the basic rule is still valid ... "NO STORAGE, NO DATACENTER".

Disclaimer: This is my current understanding how vSphere ESXi handles storage I/O based on my long experience in the field, tests in the lab, design and implementation projects and knowledge I have read from the documentation, VMware KB's and books. If you want to know more, please, check some relevant references below and do your own research. I do not if my understanding of this topic is complete and if I do not understand something wrong. Therefore, express any feedback in the comments and we can discuss it further because only deep constructive discussions lead to further knowledge.

References:

Khalil, Mostafa : Storage Implementation in vSphere 5.0 (VMware Press)
Mao Tao (IBM) : Tour the Linux generic SCSI driver
VMware KB : Storage path does not fail over when TUR command repeatedly returns retry requests (2106770)
VMware Documentation : No Failover for Storage Path When TUR Command Is Unsuccessful
VMware KB : ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios (67006)

Tuesday, December 12, 2017

Start order of software services in VMware vCenter Server Appliance 6.5 U1

In the past, I have documented start order of services in VMware vCenter Server Appliance 6.0 U2.

In the past, I simply stopped all services in VCSA, started them again and document the order.

Commands to do that are
service-control --stop --all
service-control --start --all

I did the same in vCenter Server Appliance 6.5 U1, and below are documented services started in the following order ...

lwsmd (Likewise Service Manager)
vmafdd (VMware Authentication Framework)
vmdird (VMware Directory Service)
vmcad (VMware Certificate Service)
vmware-sts-idmd (VMware Identity Management Service)
vmware-stsd (VMware Security Token Service)
vmdnsd (VMware Domain Name Service)
vmware-psc-client (VMware Platform Services Controller Client)
vmon (VMware Service Lifecycle Manager)

I was very surprised that there are no other services like vmware-vpostgres, vpxd, etc. I have found out that the rest of VCSA services are started by vmon service. To understand the start order we have to stop these servcies and start it again

/usr/lib/vmware-vmon/vmon-cli --batchstop ALL

/usr/lib/vmware-vmon/vmon-cli --batchstart ALL

vmon-cli do not report anything to standard output but it is very verbose to log file located at /var/log/vmware/vmon/vmon-syslog.log so grep of the log can help to understand the start order of vmon controlled services.

 root@vc01 [ /var/log/vmware/vmon ]# grep "Executing op START on service" vmon-sys  
 17-12-12T09:44:23.639142+00:00 notice vmon Executing op START on service eam...  
 17-12-12T09:44:23.643113+00:00 notice vmon Executing op START on service cis-license...  
 17-12-12T09:44:23.643619+00:00 notice vmon Executing op START on service rhttpproxy...  
 17-12-12T09:44:23.644161+00:00 notice vmon Executing op START on service vmonapi...  
 17-12-12T09:44:23.644704+00:00 notice vmon Executing op START on service statsmonitor...  
 17-12-12T09:44:23.645413+00:00 notice vmon Executing op START on service applmgmt...  
 17-12-12T09:44:26.076456+00:00 notice vmon Executing op START on service sca...  
 17-12-12T09:44:26.139508+00:00 notice vmon Executing op START on service vsphere-client...  
 17-12-12T09:44:26.199049+00:00 notice vmon Executing op START on service cm...  
 17-12-12T09:44:26.199579+00:00 notice vmon Executing op START on service vsphere-ui...  
 17-12-12T09:44:26.200095+00:00 notice vmon Executing op START on service vmware-vpostgres...  
 17-12-12T09:45:33.427357+00:00 notice vmon Executing op START on service vpxd-svcs...  
 17-12-12T09:45:33.431203+00:00 notice vmon Executing op START on service vapi-endpoint...  
 17-12-12T09:46:54.874107+00:00 notice vmon Executing op START on service vpxd...  
 17-12-12T09:47:28.148275+00:00 notice vmon Executing op START on service sps...  
 17-12-12T09:47:28.169502+00:00 notice vmon Executing op START on service content-library...  
 17-12-12T09:47:28.176130+00:00 notice vmon Executing op START on service vsm...  
 17-12-12T09:47:28.195833+00:00 notice vmon Executing op START on service updatemgr...  
 17-12-12T09:47:28.206981+00:00 notice vmon Executing op START on service pschealth...  
 17-12-12T09:47:28.220975+00:00 notice vmon Executing op START on service vsan-health...

eam (VMware ESX Agent Manager)
cis-license (VMware License Service)
rhttpproxy (VMware HTTP Reverse Proxy)
vmonapi
statsmonitor
applmgmt
sca (VMware Service Control Agent)
vsphere-client
cm (Component Manager / Content Library Service)
vsphere-ui
vmware-vpostgres (VMware Postgres)
vpxd-svcs
vapi-endpoint
vpxd (VMware vCenter Server)
sps (VMware vSphere Profile-Driven Storage Service)
content-library
vsm
updatemgr
pschealth
vsan-health

Hope it helps to other folks in VMware community.

References

Brian Graf : Restarting vCenter Services in vSphere 6.5

Friday, December 01, 2017

vSphere Switch Independent Teaming or LACP?

I have answered this question lot of times during the last couple of years, thus I have finally decided to write a blog post on this topic. Unfortunately, the answer always depends on specific factors (requirements and constraints) for the particular environment so do not expect the short answer. Instead of the simple answer, I will do the comparison of LBT and LACP.

I assume you (my reader) is familiar with LACP but do you know what LBT is? If not, here is the short explanation.

VMware LBT (load based teaming) is advanced switch independent teaming available on VMware DVS which pin each VM vNIC to particular physical uplink in roud robin fasion but if the network traffic of particular physical NIC is higher then 75% of total bandwidth over 30 seconds it will initiate rebalancing across available physical uplinks (physical NICs of ESXi host) to avoid network congestion on particular uplink.

If you are not familiar with basic VMware vSphere networking read my previous blog post "Back to the basics - VMware vSphere networking" before continuing.

What we are doing is the comparison of switch independent teaming and LACP. LACP is the capability of VMware Distributed Virtual Switch (VDS), therefore, I would assume you are on vSphere Enterprise Plus license and having VDS. When you have VDS then I would have another assumption, that you are already considering LBT as it is the best choice for switch independent teaming algorithms available on VDS.

LBT versus LACP comparison

Option 1: Switch Independent Teaming (LBT - Load Based Teaming)
Option 2: LACP

LBT advantages

Fully independent on upstream physical switches
Simple configuration
Beacon probing can be used. Note: Beacon probing requires at least 3 physical NICs.

LBT disadvantages

Single VM cannot handle traffic higher than the bandwidth of single physical NIC.
Traffic is load-balanced across links in the channel from ESXi perspective (egress traffic) but only at VM NIC granularity and returning traffic (ingress traffic) is forwarded by the same link as egress traffic.

LACP advantages

One of the main LACP advantages is continuous heartbeat between two sides of the link (ESXi physical NIC port and switch port). VMware's LACP is sending LACPDUs every 30 seconds but it can be reconfigured to fast mode when LACPDUs are exchanged every 1 second. This improves failover in case of link failure and also helps when link status (up/down) do not work well.
Single VM can, in theory, handle higher traffic then single physical NIC because of load-balancing algorithm.
Trafic can be load-balanced from both sides of the link (virtual link channel, port-channel, etc.). From ESXi perspective by ESXi and from the switch perspective by load-balancing set on the switch side. The proper configuration on both sides is required.

LACP disadvantages

ESXi Network Dump Collector does not work if the Management vmkernel port has been configured to use EtherChannel/LACP
VMware vSphere beacon probing cannot be used
The LACP is not supported with software iSCSI port binding.
The LACP support settings are not available in host profiles.

CONCLUSION AND ANSWER

So which option is better? Well, it depends.

When you do not have direct or indirect control of physical network infrastructure then switch independent teaming is generally much simpler and safer solution, therefore LBT is a better choice.

In case, you trust your network vendor LACP implementation and you have some control or trust your physical switch configuration LACP is the better choice because of LACPDU heart beating and multiple load-balancing hash algorithms which can, in theory, improve network bandwidth for single VM network traffic and can be configured on both sides of the link channel. Another advantage is that LACP works better with multi-chassis LAG (MLAG) technologies like Cisco vPC, Dell Force10 VLT, Arista MLAG, etc. Generally, Multi-Chassis LAG "orphan ports" (ports without LACP) are not recommended by MLAG switch vendors because they do not have the control of the end-point.

So the final decision is, as always, up to you but this blog post should help you with the right decision on your specific environment.

Any other opinions, advantages, disadvantages, and ideas are welcome, so do not hesitate to write a comment.

****************************************************************

References to other resources:

[1] Check "Limitations of the LACP Support on a vSphere Distributed Switch" in the documentation here.

FAQ related to LBT and LACP comparison

Q: VMware's LACP is sending LACPDUs every 30 seconds. Is there any way how to configure LACPDU frequency to 1 second?

A: Yes.

You can use command "esxcli network vswitch dvs vmware lacp timeout set". It allows set advanced timeout settings for LACP

Description:

set ... Set long/short timeout for vmnics in one LACP LAG

Cmd options:

-l|--lag-id= The ID of LAG to be configured. (required)

-n|--nic-name= The nic name. If it is set, then only this vmnic in the lag will be configured.

-t|--timeout Set long or short timeout: 1 for short timeout and 0 for long timeout. (required)

-s|--vds= The name of VDS. (required)

Relevant blog post on this topic "VMware vSphere DVS LACP timers".

Q: Does ESXi has a possibility to display LACP settings of established LACP session in particular ESXi host? Something like "show lacp" on Cisco switch?

A: Yes. You can use command "esxcli network vswitch dvs vmware lacp status get". It should be equivalent to "show lacp" on Cisco physical switch

Q: How VMware vSwitch Beacon Probing works?

A: Read following blog posts

Q: What is Beacon Probing interval?

A: 1 second

Q: Is ESXi beacon probing send beacons to every VLAN?

A: Yes, but only to VLANs (portgroups) where at least one VM is connected. It does not make sense to test failure on VLANs where nothing is connected.

Pages