Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

geofffranks · 2024-06-06T17:55:38Z

Observed behavior

nats-server-1 gets overwhelmed, logging about multiple slow consumers and read loops.
nats-server-1 closes its connections to other nats servers in the cluster (nats-server-2 and nats-server-3)
Connections are re-established to nats-server-2 and nats-server-3
For nats-server-2, two connections are established.
nats-server-1 detects a duplicate connection to nats-server-2 and tears it down
nats-server-2 detects a duplicate connection to nats-server-1 and tears it down
nats-server-1 tries to connect to nats-server-2 and is repeatedly given Route Error 'Authorization Violation' and Router connection closed: Parse Error errors until the nats-server-1 process is restarted
Consumers of nats-server-1 do not get messages originally sent to nats-server-2 and Consumers of nats-server-2 do not get messages originally sent to nats-server-1 until the nats-server-1 is restarted

We have logs from all three servers for this issue, but need to confirm with our customer that they can be shared.

Expected behavior

nats-server-1 and nats-server-2 would either not generate a duplicate connection, or resolve it correctly, with no authorization issues being seen as a result.
No messages were lost.

Server and client version

Nats-server v2.10.14

Host environment

VMware IaaS, x86_64 architecture, ubuntu 22.04 OS. nats-server is running as a direct process via bosh in https://github.com/cloudfoundry/nats-release

Steps to reproduce

The only way I was able to reproduce this was by jamming the network and provide a big load for nats. Also helpful to size-down the VM to 1 CPU + least compute power possible.

The text was updated successfully, but these errors were encountered:

wallyqs · 2024-06-06T18:00:13Z

Thanks for the report @geofffranks the logs would be very helpful, I assume the routes are using auth?

geofffranks · 2024-06-07T16:18:16Z

Thanks for the quick response! Here are the logs, and an example config file - yes, auth is in play :)

nats-server-1: 10.154.192.48
nats-server-2: 10.154.192.49
nats-server-3 10.154.192.50
nats-server-3.log
nats-server-2.log
nats-server-1.log
nats.conf.txt

geofffranks · 2024-06-12T14:46:25Z

@wallyqs Our customer is wondering if high load on the VMs running nats-server due to underlying IaaS CPU contention would be something likely to cause this issue to be seen.

It seems likely to be the case from our perspective, given that the workload did not change significantly, and all of a sudden a single node experienced a plethora of slow consumers/producers, followed by cluster disconnection. But having confirmation from y'all would be helpful.

geofffranks added the defect Suspected defect such as a bug or regression label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

geofffranks commented Jun 6, 2024

wallyqs commented Jun 6, 2024 •

edited

Loading

geofffranks commented Jun 7, 2024 •

edited

Loading

geofffranks commented Jun 12, 2024

Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

Comments

geofffranks commented Jun 6, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

wallyqs commented Jun 6, 2024 • edited Loading

geofffranks commented Jun 7, 2024 • edited Loading

geofffranks commented Jun 12, 2024

wallyqs commented Jun 6, 2024 •

edited

Loading

geofffranks commented Jun 7, 2024 •

edited

Loading