Certificates and the NSX lookupservice

Certificates, without them the Internet would not be possible (or at least very, very unsafe). Unfortunately, security and user-friendliness are often two sides of the same coin. Working with certificates is always risky business, and you’re bound to run into some issues along the way.

Last week I replaced the vCenter and NSX Machine SSL certificates in one of our environments. Having done this previously, and learning from the troubles I had then, I thought I was in for a smooth ride.

I was wrong.

CertificateValidationException

Anyone who has replaced the certificates in vSphere 6.x with NSX has probably run into this issue at one point or another:

The dreadful NSX LookupService failure. You replace your certificates, get the SSO working smoothly, but then there’s that one other red bubble that just doesn’t want to turn green! Maddening!

There’s a lot of posts out there that tell you how to fix it. Posts that I’ve used in the past that worked like a charm. But this time was different.

ls_update_certs.py

What all these posts have in common is that they rely on discrepancies between the presented thumbprint and the actual machine SSL certificate. The thumbprint is presented by the Security Token Services (STS) and is (sometimes?) not updated by the certificate manager utility in vCenter.

Using the Managed Object Browser (MOB) you can find the STS service, locate the certificate that it has, extract the thumbprint and use it in a Python script. This Python script, ls_update_certs.py then finds the certificate by this thumbprint, updates it with the correct certificate (which you have to supply) and presto change-o, now it works! Well, according to almost all the blogs out there.

The fix to all your problems:

python /usr/lib/vmidentity/tools/scripts/ls_update_certs.py --url https://psc.domain.com/lookupservice/sdk --fingerprint <fingerprint> --certfile <path_to_certfile> --user Username --password Password

Every blog on the subject

But what if it doesn’t?

So there you are, your maintenance window minutes ticking away. Every blog you can find tells you that your problems are solved. And yet you have that red bubble still staring away at you.

Why won’t you go away?!

The problem I was facing was the fact that the thumbprint that STS presented to NSX was the correct certificate! Everything was fine, according to, well, everything! The MOB showed no discrepancies, the commands that echo the certificates showed the same values for the Machine SSL certificate and the Lookup Service, nothing I did showed me where stuff was wrong.

vecs-cli

vCenter stores its certificates in the vCenter Certificate Store (VECS). vecs-cli is the tool that allows you to manipulate this certificate store directly. Naturally, this is a dangerous place to be, but with great responsibility also comes great power (right?).

With vecs-cli you can list the certificate stores, what certificates are in the stores, create certificates, delete certificates, etc. Using this tool I was able to finally find a discrepancy which resulted in finding the root cause.

With the commands vecs-cli store list you are presented with a list of all the different certificate stores that VECS currently has.

List of certificate stores in VECS

You can see the two stores I’m most interested in: MACHINE_SSL_CERT and STS_INTERNAL_SSL_CERT. Using the same utility I can check what the certificate is that is in these stores: vecs-cli entry list --store STS_INTERNAL_SSL_CERT and vecs-cli entry list --store MACHINE_SSL_CERT.

I spy with my little eye…

And lo and behold, there’s a difference! You can see by the alias that they are supposed to be the same certificate (__MACHINE_CERT) but the actual certificate itself is different.

From that point on it was a simple case of backing up the wrong/old certificate, copying over the correct one, and restarting the services.

  1. Backup the existing entry of certificate and key for __MACHINE_CERT in the STS_INTERNAL_SSL_CERT store
/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > oldmachine.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > oldmachine.key
  1. Copy out the __MACHINE_CERT certificate and key from the MACHINE_SSL_CERT store
/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT > machine.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store MACHINE_SSL_CERT --alias __MACHINE_CERT > machine.key
  1. Delete the old
/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT -y
  1. Add the copied certificate into the store
/usr/lib/vmware-vmafd/bin/vecs-cli entry create --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --cert machine.crt --key machine.key
  1. Restart services
service-control --stop --all && service-control --start --all

And now my problem is fixed.

Vindication!