VCF upgrade pre-check

Since VCF has a very short update cycle, it’s important to always stay up to date with the latest release, or as most of my clients do; n-1. With VCF it’s not really necessary to stay in version n-1 since VMware tests each bundle before they release it to the world. But that’s always up to the client’s discretion (can’t fault them to err on the safe side, now can I?).

Before each VCF upgrade it’s important to always do the built-in pre-check from the SDDC manager. There you will learn about any discrepancies or issues that need to be addressed before you can safely execute the operation. But as any sysadmin knows, that’s not good enough. Below I’ll show you what other things I double (or triple) check before I hit the ‘upgrade now’ button.

Green marks all around!

It goes without saying that you should always let your CAB and application owners know what versions you are upgrading to. The checks below are well past the approval stage and just before the actual implementation stage.

Visual inspection

First and foremost, look at your environment. It’s as simple as that. Log in to vCenter, click each datacenter, each cluster, and check for any weirdness.

Check the health of your environment, vSAN, and monitor the issues that are in your environment. Don’t forget to do a DVS Health Check to make sure all the VLANs are configured to all your ESXi hosts, just to be sure.

That’s what I like to see

Use your brain! If your spidey-sense gets all tingly, find out what it caused. Upgrading your entire SDDC is not something you want to rush. As Ron Swanson says: never half-ass two things. Whole-ass one thing instead.

DRS

Next up: DRS rules and settings. This is something that can trip you up if you’re not careful. During the pre-check you might get the following error:

Why the NSX manager throws this error though…

For some reason this originates from the NSX manager, even though it is not related to it in no way. The error message does point you in the right direction though; a host cannot enter maintenance mode due to a pinned VM.

However, there are no DRS rules specifically pinning a VM to this host in my configuration. There are quite a few DRS rules, but none of them are a ‘must’ rule, and all of them contain more than 1 host. Then I rememberd that we’re running a backup program that really doens’t play well with vMotions, and we had to prevent this from happening or we’d spend the rest of the day cleaning up zombie VMDKs.

We did this by leveraging the VM Overrides functionality. However, instead of putting DRS on ‘Manual’ we accidentally put it on ‘Disabled’. Disabled really isn’t something you should be using (unless specifically necessary) in regards to DRS since it has some undesireable behavior.

DRS is disabled

Setting the DRS automation level to ‘Manual’ solves this specific problem in the SDDC manager pre-check.

Manual is the lowest you should go

Of course, having a VM set to ‘Manual’ still prevents the automated ESXi upgrades from progressing. VCF handles this quite neatly; the upgrade workflow encounters the error of a host not entering maintenance mode due to timeout, you can click ‘retry’ once you have resolved the issue, and VCF continues along it’s merry way.

The preferred solution is obviously having no VM overrides or DRS rules pinning to one host at all, but reality is often disappointing.

Backups and snapshots

VCF creates a snapshot before it upgrades a component. Which is nice. But it is always a good idea to create a snapshot yourself of all the components before you even start your first pre-check. This way you have a full rollback scenario should it be necessary.

Be mindful that VCF has no official rollback in place to go back a version after it completes the upgrade!

Because there is no official path, make sure you have your backups in place. Depending on your RPO you could even just make a backup specifically right before the upgrade. The following components are the most important:

  • SDDC manager
  • Lifecycle Manager
  • vCenter(s) and PSC
  • NSX Manager(s) (controllers not so much)
  • vROps, Log Insight

Basically, everything in the SDDC-VM resource pool VCF has created for you.

ESXi upgrade order

If you have VMs that you need to manually replace it is good to know that VCF doesn’t upgrade ESXi in order of hostname, it does so in order of ID. If you really need to know the order, here’s how:

  1. Go to https://<SDDC_Manager_IP>/inventory/hosts and log in with admin and the password for the REST API user
  2. Retrieve the output and filter out for the "id":"<long string>" field.
  3. Sort the id field, that’s the order in which VCF will upgrade the ESXi hosts.

With this order you can plan ahead to move your VMs.

Closing thoughts

I’m a big fan of RVtools (who isn’t). Before every upgrade I let it run and export it all to Excel. Should there be any funny business with configurations or DRS rules, I at least have that to fall back on.

From my experience you should plan at least 30 minutes for each component upgrade (SDDC, vCenter, NSX) and for each ESXi upgrade. Plan accordingly.

I’m very confident in the robustness of the upgrade procedure and have no issues with running the ESXi part at night. But it doesn’t hurt to keep an eye on the operation (plus it’s fun to look at automated processes I think).

I haven’t run into any issues upgrading our VCF deployments so far. My thinking is that it can be attributed to not going outside of the allowed operations in VCF, and listening to the warning vCenter gives you when logging in:

It’s true!