VCDX #200 Blog of one VMware Infrastructure Designer: July 2016

Monday, July 25, 2016

How to read BIOS settings from HP server

Sometimes it is pretty handy how to read BIOS settings from modern HP server. Let's assume you have server ouf-of-band remote management card (aka HP iLO).

HP iLO 4 and above supports RESTful API. Here is the snippet from "HPE iLO 4 User Guide".

iLO RESTful API

iLO 4 2.00 and later includes the iLO RESTful API. The iLO RESTful API is a management interface that server management tools can use to perform server configuration, inventory, and monitoring via iLO. A REST client, such as the RESTful Interface Tool, sends HTTPS operations to the iLO web server to GET and PATCH JSON-formatted data, and to configure supported iLO and server settings, such as the UEFI BIOS settings.

So you can leverage REST API calls or if you like PowerShell you can simplify it by precooked HP command-lets.

Following PowerShell code should show the level of power versus performance for the system.

 $ilo = 192.168.0.100   
 $bios = Connect-HPBIOS $ilo -Username "username" -Password "password"  
 Get-HPBIOSPowerProfile $ilo  
 Disconnect-HPBIOS $ilo

Other sources:

RESTful Application Programming Interface (API) - Redfish 1.0 Conformance
PowerShell Galery - HPRESTCmdlets
GitHub : HewlettPackard/PowerShell-ProLiant-SDK
HP releases BIOS PowerShell cmdlets

Sunday, July 24, 2016

ESXi PSOD and HeartbeatPanicTimeout

A Purple Screen of Death (PSOD) is a diagnostic screen with white type on a purple background that is displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative and terminates any virtual machines that are running. For more info look here.

Nobody is happy to see PSOD in ESXi host but it is important to say that it is just another safety mechanism how to protect your server workloads because PSOD is intentionally initiated by ESXi's vmkernel in situations when something really bad happens in low level. It is usually related to hardware, firmware or driver issue. You can find further information in VMware KB article - Interpreting an ESX/ESXi host purple diagnostic screen (1004250).

The main purpose of this blog post is to explain the timing of PSOD for just single type of error message - "Lost heartbeat". If there is no heartbeat in some time interval PSOD looks like screenshot below.

no heartbeat

There is no doubt that something serious has to happened in ESXi vmkernel, however regardless what exactly happened following two vSphere advanced settings are used to control heartbeat time interval in which heartbeat must be received otherwise PSOD is executed.

ESXi - Misc.HeartbeatPanicTimeout
VPXD (aka vCenter) - vpxd.das.heartbeatPanicMaxTimeout

Let's start with ESXi advanced setting Misc.HeartbeatPanicTimeout. It defines interval in seconds after which vmkernel goes to panic if no heartbeat is received. Please, don't mixed this "Panic Heartbeat" with "HA network heartbeat". These two heartbeats are very different. "HA network heartbeat" is heart beating mechanism between HA cluster members (master<-><->sleaves) over ethernet network but "Panic Heartbeat" is heartbeat inside single ESXi host between vmkernel and COS software components. You can see "Panic Heartbeat" settings by issuing following esxcli command

esxcli system settings advanced list | grep -A10 /Misc/HeartbeatPanicTimeout

 [root@esx01:~] esxcli system settings advanced list | grep -A10 /Misc/HeartbeatPanicTimeout  
   Path: /Misc/HeartbeatPanicTimeout  
   Type: integer  
   Int Value: 14  
   Default Int Value: 14  
   Min Value: 1  
   Max Value: 86400  
   String Value:  
   Default String Value:  
   Valid Characters:  
   Description: Interval in seconds after which to panic if no heartbeats received

I have tested that Misc.HeartbeatPanicTimeout has different values in different situations. Default value is always 14 seconds but

if you have single standalone ESXi host not connected to HA Cluster effective value is 900 seconds
if you have ESXi host as a member of vSphere HA Cluster then the value is 14 seconds

So now we know that the value in ESXi host with enabled HA is 14 seconds (panicTimeoutMS = 14000) and it usually works without any problem. However, if you will, from whatever reasons, decide to change this value it is worth to know that in HA enabled ESXi host is in HA code hardcoded cap of 60 seconds on this value. It is a cap so it does not change the value if it is already less than 60. However, if you use for example the value 900 it will be caped to 60 seconds anyway. I did a test in vSphere6/ESXi6 and it works exactly like that and I assume it works in the same way in vSphere5/ESXi5.

Side note: It was very different in vSphere4/ESXi4 because HA cluster was rewritten in vSphere 5 from the scratch but it is already a history and I hope nobody use vSphere4 anymore.

Behavior justification:
Behavior described in paragraph above makes perfect sense if you ask me. If you have standalone ESXi host and you are experiencing some hardware issue it is better to wait 900 seconds (15 minutes) before ESXi goes to PSOD state because virtual machines running on top of this ESXi host cannot be automatically restarted in other ESXi hosts anyway. And guess what, if ESXi host have some significant hardware failure, it has most probably negative impact on virtual machines running on top of this particular ESXi host, right? Unfortunately, if you have just a single ESXi host vSphere cannot do anything for you.

On the other hand, if affected ESXi host is a member of vSphere HA cluster then it is better to wait only 14 seconds (by default) or maximally 60 seconds and put ESXi host into PSOD quicker because HA cluster will restart affected virtual machines automatically and helps to mitigate the risk of unavailable virtual machines and with that application services running inside these virtual machines.

So that's the explanation how ESXi setting /Misc/ HeartbeatPanicTimeout behaves. Now we can look what vpxd.das.heartbeatPanicMaxTimeout setting is. My understanding is that vpxd.das.heartbeatPanicMaxTimeout is vCenter (VPXD) global configuration for ESXi advanced setting Misc.HeartbeatPanicTimeout. But don't forget that HA cluster is capping Misc.HeartbeatPanicTimeout value on ESXi hosts as described above.

You can read further details about vpxd.das.heartbeatPanicMaxTimeout in VMware KB 2033250 but I think that following description is little bit misleading.

"This option impacts how long it takes for a host impacted by a PSOD to release file locks and hence allow HA to restart virtual machines that were running on it. If not specified, 60s is used. HA sets the host Misc.HeartbeatPanicTimeout advanced option to the value of this HA option. The HA option is in seconds."

My understanding is that description should be reworded to something like ...

"This option is in seconds and impacts how long it takes for ESXi host experiencing some critical issue to go into a PSOD. Setting vpxd.das.heartbeatPanicMaxTimeout is a global setting used for vCenter managed ESXi advanced option Misc.HeartbeatPanicTimeout however Misc.HeartbeatPanicTimeout is adjusted automatically in certain situations.
In standalone ESXi host 900s is used. In vSphere HA Cluster ESXi host it is automatically changed to 14s and capped to maximum of 60s. This setting have indirect impact on time when file locks are released and hence allow HA cluster to restart virtual machines that were running on affected ESXi host."

Potential side effects and impacts

ESXi HA Cluster restart of virtual machines - if your Misc.HeartbeatPanicTimeout is set to 60 seconds than HA cluster will most probably try to restart VMs on another ESXi hosts because network heartbeat (also 14 seconds) will not be received. However because it is not in PSOD the file lock still exist and VM restart will be unsuccessful.
ESXi Host Profiles - if you use the same host profile for HA protected and also non-protected ESXi hosts then it can report difference of Misc.HeartbeatPanicTimeout against compliance.

Blog posts in blogosphere covering "no heartbeat" issues:

Friday, July 15, 2016

DELL Force10 : DNS, Time and Syslog server configuration

It is generally good practice to have time synchronized on all network devices and configure remote logging (syslog) to centralized syslog server for proper troubleshooting and problem management. Force10 switches are not exceptions therefore let's configure time synchronization and remote logging to my central syslog server - VMware LogInsight in my case.

I would like to use hostnames instead of IP addresses so let's start with DNS resolution, continue with time settings and finalize the mission with remote syslog configuration.

Below are my environment details:

My DNS server is 192.168.4.21
DNS domain name is home.uw.cz
I will use internet following three NTP servers/pools - ntp.cesnet.cz, ntp.gts.cz and cz.pool.ntp.org
My syslog server is at syslog.home.uw.cz

Step 1/ DNS resolution configuration

f10-s60#conf
f10-s60(conf)#ip name-server 192.168.4.21
f10-s60(conf)#ip domain-name home.uw.cz
f10-s60(conf)#ip domain-lookup
f10-s60(conf)#exit

Don't forget to configure "ip domain-lookup" because it is the command which enables domain name resolution.

Now let's test name resolution by ping www.google.com

f10-s60#ping www.google.com Translating "www.google.com"...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 172.217.16.164, timeout is 2 seconds:!!!!!Success rate is 100.0 percent (5/5), round-trip min/avg/max = 40/44/60 (ms)

We should also test some local hostname in long format

f10-s60#ping esx01.home.uw.cz
Translating "esx01.home.uw.cz"
...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 192.168.4.101, timeout is 2 seconds:
!!!!!
Success rate is 100.0 percent (5/5), round-trip min/avg/max = 0/0/0 (ms)

and short format

f10-s60#ping esx01
Translating "esx01"
...domain server (192.168.4.21) [OK]
Type Ctrl-C to abort.
Sending 5, 100-byte ICMP Echos to 192.168.4.101, timeout is 2 seconds:
!!!!!
Success rate is 100.0 percent (5/5), round-trip min/avg/max = 0/0/0 (ms)

Step 2/ Set current date, time and NTP synchronization

You have to decide if you want to use GMT or local time. The hardware time should be always set to GMT and you can configure timezone and summer-time if you wish. So let's configure GMT time in the first place.

f10-s60#calendar set 15:12:46 july 15 2016

and test it

f10-s60#sho calendar
15:12:39 Fri Jul 15 2016

Ok, so hardware time is set correctly to GMT.

If you really want to play with timezone and summer-time you can do it in conf mode with following commands.

f10-s60(conf)#clock ?
summer-time Configure summer (daylight savings) time
timezone Configure time zone

I prefer to keep GMT time everywhere because it, in my opinion, simplifies troubleshooting, problem management and capacity planning.

Step 3/ Configuration of remote logging

FTOS by default doesn't use date and time for log messages. It uses uptime (time from last boot) therefore you can see when something happened since last system boot. However, because we already have time configured properly it is good idea to change this default behavior to use date and time.

f10-s60(conf)#service timestamps log datetime

To be honest, you generally don't need date and time on log messages because remote syslog server will add date and time to messages but I generally prefer to have both times - time from device and time when message arrived to syslog server. If you want to disable time stamping on syslog messages, use no service timestamps [log | debug].

And now, finally, let's configure remote syslog server by single configuration command

f10-s60(conf)#logging syslog.home.uw.cz
Translating "syslog.home.uw.cz"
Translating "syslog.home.uw.cz"
...domain server (192.168.4.21) [OK]

And we are done. Now you can see incoming log messages in your syslog server. See screenshot of my VMware Log Insight syslog server.

VMware Log Insight with Force10 log messages.

Hope you find it useful and as always - any comment is very appreciated.

Pages