Problems Patching 2-Node vSAN File Services Clusters

Lately I’ve been rolling out vSAN File Services (vSAN FS) to my ROBO sites. It’s been a fairly smooth endeavor, but I’ve noticed a few quirks related to patching with Lifecycle Manager. Below are some of the problems patching 2-node vSAN File Services clusters I encountered and the solutions I found.

Unable to remediate from the cluster level

This is a 2-node ROBO cluster on 7.0U2a with a shared witness. The witness node was already updated to 7.0U2d in preparation of patching my physical ESXi hosts. As usual I selected my cluster and premade baseline, started my remediation and went to the kitchen for some coffee. When I returned I saw the message “Remediate entity Failed” and after some investigation, it looks like the problem is linked to the Lifecycle manager not being able to place the first host into maintenance mode. From what I could see it looks like Lifecycle manager gets stuck while trying to clear the host of VM’s and then the maintenance mode task is eventually aborted. This makes sense because the vSAN FS containers are not meant to be moved between hosts. To work around this, I manually put the first host in maintenance mode and patched them one at a time.

Server names redacted.

Stuck on “Remediate File Services”

After the patches were pushed to the first host and it rebooted, I noticed that the progress seemed to get stuck on “Remediate vSAN File Services” for a very long time. After about 30 minutes I started poking around and discovered that part of the patching process involves deleting and redeploying the vSAN FS container. Because the host is in maintenance mode that newly deployed container is unable to power on and finish its configuration. While looking at he tasks I saw that the patch installation completed successfully and it appeared to only be waiting on the vSAN FS remediation. To fix this I manually removed the host from maintenance mode which allowed the container to power on. Once the container was running the Remediate File services task was able to successfully finish.

Cluster name redacted.

Need to repair vSAN Objects

Before moving on to the next host I wanted to validate that my share was still available and healthily and noticed that there was a “Reduced availability with no rebuild” health check warning. Not exactly surprising since it is a 2-node cluster, and 1 node was offline for around 45 minutes. To remedy this, I just selected the “Repair Objects Immediately” option and gave it a few minutes to work its magic. Its worth mentioning that even though the repair would have happened by default after an hour, its an important step to ensure that the vSAN objects are healthy before taking the next host offline. After I validated that my health checks were green, I moved on and repeated the steps for the second host.

Last thing I did was submit a bug report SR to VMware. I’ll update this post with any additional information that may provide. 

About: Greg Russell

Greg Russell is a Principal Architect Working in Healthcare IT on the East Coast. His primary focus is vSAN, Replication and Disaster Recovery solutions.