Tuesday, February 11, 2020

Host cannot communicate with one or more other nodes in the vSAN enabled cluster

I work as VMware HCI Specialist, therefore I have to do a lot of vSAN testing and demonstrations in my home lab. The only reasonable way how to effectively test and demonstrate different vSAN configurations and topologies is to run vSAN in a nested environment. Thanks to a nested virtualization, I can very easily and quickly build any type of vSAN cluster.

Recently I have experienced the issue in 3-node (nested) vSAN cluster. I have seen vSAN datastore capacity just of a single node instead of three nodes and on hosts was an error message "Host cannot communicate with one or more other nodes in the vSAN enabled cluster".

The first idea was about networking issue but ping between nodes was working ok so it was not a physical network issue. This is the lab environment so all services (mgmt, vMotion, vSAN) are enabled on single VMKNIC (vmknic0) so everything is pretty straight forward.

So what's the problem?

I did some google searching and found that some people were seeing the same error message when experiencing problems with vSAN unicast agents.

Here is the command to list of unicast agents on vSAN node

esxcli vsan cluster unicastagent list

I test it in my environment.
Grrrr. The list is empty!!!! It is empty on all ESXi hosts in my 3 nodes vSAN cluster.

Let's try to configure it manually.

Each vSAN node should have a connection to agents on other vSAN nodes in the cluster.

For example, one vSAN node from 4-node vSAN Cluster should have 3 connections

 [root@n-esx04:~] esxcli vsan cluster unicastagent list  
 NodeUuid               IsWitness Supports Unicast IP Address    Port Iface Name Cert Thumbprint  
 ------------------------------------ --------- ---------------- -------------- ----- ---------- -----------------------------------------------------------  
 5e3ec640-c033-7c7d-888f-00505692f54d     0       true 192.168.11.105 12321       18:F3:B7:9F:66:C4:C4:3E:0F:7D:69:BB:55:92:BC:A3:AC:E4:DD:5F  
 5df792b0-f49f-6d76-45af-005056a89963     0       true 192.168.11.107 12321       20:4C:C1:48:F5:2D:04:16:55:F1:D3:F1:4C:26:B5:C4:23:E5:B4:12  
 5e3e467a-1c1b-f803-3d0f-00505692ddc7     0       true 192.168.11.106 12321       53:99:00:B8:9D:1A:97:42:C0:10:C0:AF:8C:AD:91:59:22:8E:C9:79  

We need the get local UUID of the cluster node.

 [root@n-esx08:~] esxcli vsan cluster get  
 Cluster Information  
   Enabled: true  
   Current Local Time: 2020-02-11T08:32:55Z  
   Local Node UUID: 5df792b0-f49f-6d76-45af-005056a89963  
   Local Node Type: NORMAL  
   Local Node State: MASTER  
   Local Node Health State: HEALTHY  
   Sub-Cluster Master UUID: 5df792b0-f49f-6d76-45af-005056a89963  
   Sub-Cluster Backup UUID:  
   Sub-Cluster UUID: 52c99c6b-6b7a-3e67-4430-4c0aeb96f3f4  
   Sub-Cluster Membership Entry Revision: 0  
   Sub-Cluster Member Count: 1  
   Sub-Cluster Member UUIDs: 5df792b0-f49f-6d76-45af-005056a89963  
   Sub-Cluster Member HostNames: n-esx08.home.uw.cz  
   Sub-Cluster Membership UUID: f8d4415e-aca5-a597-636d-005056997c1d  
   Unicast Mode Enabled: true  
   Maintenance Mode State: ON  
   Config Generation: 7ef88f9d-a402-48e3-8d3f-2c33f951fce1 6 2020-02-10T21:58:16.349  

So here are my nodes
n-esx08 - 192.168.11.108 - 5df792b0-f49f-6d76-45af-005056a89963
n-esx09 - 192.168.11.109 - 5df792b0-f49f-6d76-45af-005056a89963
n-esx10 - 192.168.11.110 - 5df792b0-f49f-6d76-45af-005056a89963

And now the problem is clear. All vSAN nodes have the same UUID.
Why?  Let's check ESXi system UUIDs on each ESXi host.

 [root@n-esx08:~] esxcli system uuid get  
 5df792b0-f49f-6d76-45af-005056a89963  
 [root@n-esx08:~]  

 [root@n-esx09:~] esxcli system uuid get  
 5df792b0-f49f-6d76-45af-005056a89963  
 [root@n-esx09:~]  

 [root@n-esx10:~] esxcli system uuid get  
 5df792b0-f49f-6d76-45af-005056a89963  
 [root@n-esx10:~]  


Note: if you want to check UUID of all ESXi hosts, use following PowerCLI

 Get-VMHost | Select Name,  
   @{N='HW BIOS Uuid';E={$_.Extensiondata.Hardware.SystemInfo.Uuid}},  
   @{N='ESXi System UUid';E={(Get-Esxcli -VMHost $_).system.uuid.get()}}  

So the root cause is obvious.
I use nested ESXi hosts to test vSAN and I forgot to regenerate system UUID after the clone. 
The solution is easy. Just delete UUID from /etc/vmware/esx.conf and restart ESXi hosts.

ESXi system UUID in /etc/vmware/esx.conf

You can do it from command line as well

sed -i 's/system\/uuid.*//' /etc/vmware/esx.conf
reboot

So we have identified the problem and we are done. After ESXi hosts restart vSAN Cluster Nodes UUIDs are changed automatically and vSAN unicastagents are automatically configured on vSAN nodes as well.

However, if you are interested in how to manually add a connection to a unicast agent on a particular node, you would execute the following command

esxcli vsan cluster unicastagent add –a [ip address unicast agent] –U [supports unicast] –u [Local UUID] -t [type]

Anyway, such a manual configuration should not be necessary and you should do it only when instructed by VMware support.

Hope this helps someone else in VMware community.

2 comments:

wojcieh said...

Thanks a lot.

wayne said...

Same behavior here =) Saved me tons of Time, thx for sharing this!