NSX BGP Neighbors down

While deploying a new T0 Gateway in a new Edge Node Cluster in NSX, I got a little stumped when I saw the following:

4 BGP neighbors down

My BGP sessions wouldn’t come up.

Now, I’m not a networking wizard by any stretch of the imagination, but this T0 and the Edge Nodes on which it resides were configured exactly the same as another T0 and Edge Node Cluster in the same lab. The only difference being the IP addressess and AS number; both were moved over 1 digit.

Even more interesting, I saw these messages in the logs:

logs with the message connection from (BGP neighbor IP) rejected due to admin shutdown

Why does it say “admin shutdown”?

This whould mean I have it disabled somewhere in the configuration, but for the life of me I couldn’t find a setting anywhere that could have this effect.

Troubleshooting

I was able to ping all the things I expected; the direct neighbor was connected, but the neighbor in the other AZ was not. Which makes sense since we don’t have any routing information.

This means IP addresses and masks are fine, VLANs are correct, MTU as well, no weirdness with ports being down.

The log message means that the BGP hello messages are being received.

The log message also means that BGP is correctly configured on either end, since any misconfiguration would result in a different message. So if, for example, the timings were off we would be presented with a different message.

And looking at all the settings, as mentioned before, everything is identical to the other T0 and Edge Nodes.

All the interfaces on the T0 were showing expected results:

uplink interface showing admin and op_state 'up'

Then I noticed that I didn’t have any tunnels on the Edge Nodes, and no TEP addresses either. See below.

2 edge nodes without TEP addresses and no tunnels active

Now, this didn’t immediately spark my ‘a-ha!’ moment, since I didn’t have any VMs connected to this network. Any configurations so far weren’t related to Overlay networking either, it was all ‘real’ networks and VLANs.

Digging tunnels

Creating a T1 and and Overlay Segment, and connecting a powered-up VM to it would cause the tunnels to be created.

At least, they should’ve, but this didn’t happen either.

Turns out; I missed adding an Overlay Transport Zone to my Edge Nodes, and had only connected them to my VLAN ‘edge-tz’ Transport Zone:

Edge node configuration with only 'edge-tz' as the configured Transport Zone

Simply adding this fixed everything. Immediately tunnels started appearing and even more interesting; BGP started working as well.

BGP and VLAN configurations don’t rely Overlay networking, so being part of the Overlay Transport Zone wasn’t something I expected as a possible fix.

So why is that?

Communication is key

And finally, it was a PEBCAK-error all along (surprise!).

I forgot one simple thing:

Edge Nodes use the auto-plumbed tunnels exchange information between eachother. Without any TEP addresses or an Overlay Transport Zone as a fabric to connect to they cannot build the tunnels.

So without any TEPs configured, the Edge Nodes simply cannot communicate. To prevent any strange behaviour from happening (like a split-brain scenario), NSX administratively disables the interfaces until at least one tunnel is up.

My colleague Bart ran into a similar issue whilst deploying a one-armed Load Balancer, read about it on his blog here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s