Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Connections + Authorization Failures After Resetting Cluster Connections Leading to Lost Messages Across the Cluster #5500

Open
geofffranks opened this issue Jun 6, 2024 · 3 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@geofffranks
Copy link

Observed behavior

  1. nats-server-1 gets overwhelmed, logging about multiple slow consumers and read loops.
  2. nats-server-1 closes its connections to other nats servers in the cluster (nats-server-2 and nats-server-3)
  3. Connections are re-established to nats-server-2 and nats-server-3
  4. For nats-server-2, two connections are established.
  5. nats-server-1 detects a duplicate connection to nats-server-2 and tears it down
  6. nats-server-2 detects a duplicate connection to nats-server-1 and tears it down
  7. nats-server-1 tries to connect to nats-server-2 and is repeatedly given Route Error 'Authorization Violation' and Router connection closed: Parse Error errors until the nats-server-1 process is restarted
  8. Consumers of nats-server-1 do not get messages originally sent to nats-server-2 and Consumers of nats-server-2 do not get messages originally sent to nats-server-1 until the nats-server-1 is restarted

We have logs from all three servers for this issue, but need to confirm with our customer that they can be shared.

Expected behavior

  1. nats-server-1 and nats-server-2 would either not generate a duplicate connection, or resolve it correctly, with no authorization issues being seen as a result.
  2. No messages were lost.

Server and client version

Nats-server v2.10.14

Host environment

VMware IaaS, x86_64 architecture, ubuntu 22.04 OS. nats-server is running as a direct process via bosh in https://github.com/cloudfoundry/nats-release

Steps to reproduce

The only way I was able to reproduce this was by jamming the network and provide a big load for nats. Also helpful to size-down the VM to 1 CPU + least compute power possible.

@geofffranks geofffranks added the defect Suspected defect such as a bug or regression label Jun 6, 2024
@wallyqs
Copy link
Member

wallyqs commented Jun 6, 2024

Thanks for the report @geofffranks the logs would be very helpful, I assume the routes are using auth?

@geofffranks
Copy link
Author

geofffranks commented Jun 7, 2024

Thanks for the quick response! Here are the logs, and an example config file - yes, auth is in play :)

nats-server-1: 10.154.192.48
nats-server-2: 10.154.192.49
nats-server-3 10.154.192.50
nats-server-3.log
nats-server-2.log
nats-server-1.log
nats.conf.txt

@geofffranks
Copy link
Author

@wallyqs Our customer is wondering if high load on the VMs running nats-server due to underlying IaaS CPU contention would be something likely to cause this issue to be seen.

It seems likely to be the case from our perspective, given that the workload did not change significantly, and all of a sudden a single node experienced a plethora of slow consumers/producers, followed by cluster disconnection. But having confirmation from y'all would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
2 participants