Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak of nats cluster #5518

Open
Steel551454 opened this issue Jun 11, 2024 · 45 comments
Open

Memory leak of nats cluster #5518

Steel551454 opened this issue Jun 11, 2024 · 45 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@Steel551454
Copy link

Observed behavior

We have a NATS cluster of three nodes (NATS version is 2.10.16).

host: 127.0.0.1
port: 4222

server_name: nats-02-cluster
accounts {
 $SYS { users = [ { user: "nats", pass: "PASS" } ] }
}

jetstream {
  store_dir=/var/lib/nats
  max_memory_store: 1024Mb
  max_file_store: 819200Mb
}

cluster {
  name: cluster
  listen: 127.0.0.1:6222
  routes: [
    nats-route://nats-00-cluster:6226
    nats-route://nats-01-cluster:6226
    nats-route://nats-02-cluster:6226
  ]
  compression: {
    mode: s2_auto
    rtt_thresholds: [10ms, 50ms, 100ms]
  }
}

http_port: 8222
max_connections: 64K
max_control_line: 4KB
max_payload: 8MB
max_pending: 64MB
max_subscriptions: 0
log_file: /var/log/nats/na29ts-server.log

Cluster interaction occurs via nginx:

upstream nats {
    server 127.0.0.1:4222;
}

server {
    listen        127.0.0.1:4224 so_keepalive=1m:5s:2;
    listen        192.168.1.2:4224 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats;
}

upstream nats-cluster {
    server 127.0.0.1:6222;
}

server {
    listen        127.0.0.1:6226 so_keepalive=1m:5s:2;
    listen        192.168.1.2:6226 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats-cluster;
}

Events are forwarded to NATS with Vector service. The average throughput is 80k events per second (or 90 MB/s).

  nats:
    type: "nats"
    inputs:
      - "upstreams.other"
    url: "nats://127.0.0.1:4222"
    request:
      rate_limit_num: 70000
    buffer:
      type: memory
      max_events: 2000
    subject: "{{ type }}"
    acknowledgements:
      enabled: true
    encoding:
      codec: json

Memory usage is continuously increasing and reaches host limit (60 GB) and OOM killer happens to NATS service as a result.
NATS profile can be found in attachments.
profiles.tar.gz

Expected behavior

Service memory should not leak

Server and client version

nats-server: 2.10.16
nats: 0.1.4

Host environment

No response

Steps to reproduce

No response

@Steel551454 Steel551454 added the defect Suspected defect such as a bug or regression label Jun 11, 2024
@neilalexander
Copy link
Member

Thanks for providing the memory profiles!

Can you please try disabling route compression by changing mode from s2_auto to off and see if there's an improvement?

@Steel551454
Copy link
Author

we did it. no changes followed

@Steel551454
Copy link
Author

I removed nginx and now the nodes communicate with each other directly. Memory continues to leak.

@Steel551454
Copy link
Author

profiles.zip
it's current profiles

@neilalexander
Copy link
Member

Your latest profile suggests there are still a lot of allocations in the route S2 writer, are you sure route compression was disabled properly? You may need to do a rolling restart of the cluster nodes to ensure it's taken effect.

@Steel551454
Copy link
Author

You are right: I forgot to turn off compression on one server

@Steel551454
Copy link
Author

profiles.zip

After we disabled nginx and removed compression, memory continues to leak.

@neilalexander
Copy link
Member

OK, this latest profile shows a different type of memory build-up to before (this one shows Raft append entries, last time that wasn't evident).

Can you please post more details about your cluster?
What spec of machines are the cluster nodes running on?
Are all of the cluster nodes the same CPU/RAM/disk-wise?
Do you see these build-ups on a single node or multiple?

@Steel551454
Copy link
Author

The Nats cluster is running on servers with the following specifications: 64GB RAM, Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz, 890GB SSD. All servers are identical. We use 10GE network cards. The operating system is Arch Linux. Memory usage on the servers is uneven. The node with the most primary replicas consumes memory to a greater extent.

@derekcollison
Copy link
Member

Do you async publish for JetStream?

@Steel551454
Copy link
Author

Steel551454 commented Jun 12, 2024

Honestly, I'm not sure how this is implemented in vector.dev. Here is a link to the module: https://vector.dev/docs/reference/configuration/sinks/nats/
https://github.com/vectordotdev/vector/tree/master/src/sinks/nats

@Steel551454
Copy link
Author

I reviewed the source code of the NATS module and saw the function call to async_nats.

@derekcollison
Copy link
Member

Maybe we can have @Jarema take a look since its using the rust client.

@Jarema
Copy link
Member

Jarema commented Jun 13, 2024

@derekcollison A quick glance shows that vector is using Core NATS publish, so not even JetStream async publish.

@derekcollison
Copy link
Member

ok very easy to overload the system in that case.. This will balloon up the internal append entries since that pipeline needs to interbally queue then write to the store.

@Steel551454
Copy link
Author

Will you fix this? Or do we need to make changes on our end?

@neilalexander
Copy link
Member

The issue needs to be rectified in Vector by switching from Core NATS publishes to JetStream publishes, as currently the Core NATS publishes can potentially send data into JetStream faster than it can be processed. This explains the build-up of append entries in memory that you are seeing.

It looks like there's already an issue tracking this on their repository: vectordotdev/vector#10534

@Steel551454
Copy link
Author

We will die faster than the task above will be completed :) (It was created in 2021)

Is there any chance that you could create some kind of var to limit the rate of JetStream forwarding?

@derekcollison
Copy link
Member

We would not approach it that way, we should not slow down normal NATS core publishers due to misconfiguration.

We are considering a way to protect the server by dropping AppendEntry msgs from the NRG (raft) layer. That would avoid memory bloat but would cause the system to thrash a bit catching up the NRG followers when they detect gaps from the dropped messages.

@Jarema
Copy link
Member

Jarema commented Jun 13, 2024

@Steel551454 I plan to contribute to the issue mentioned somewhere in Q3, and introducel JetStream support. The current one is not actually supporting acks, despite saying that in docs.

@Steel551454
Copy link
Author

Let's say we turn off Jetstream. Where in NATS settings can we specify where to store events?

@Jarema
Copy link
Member

Jarema commented Jun 13, 2024

If you turn off JetStream, messages will not be stored anywhere, they become at most once. You need a subscriber application that processes them.

In JetStream, you can define the store directiory here:

jetstream {
  store_dir: /path
}

or via providing -sd flag.

@Steel551454
Copy link
Author

Do you happen to have a simple Go-written proxy: transforming requests from NATS Core Stream to Jetstream?

@Steel551454
Copy link
Author

Today we replaced the pipeline parser vetor.dev with redpanda-connect, which has a plugin for working with NATS Jetstream. The memory leak issue has not been resolved. Attached is an archive with profiles.
profiles.zip

@Steel551454
Copy link
Author

@derekcollison, I'm sorry to bother you, but switching our pipeline to use Jetstream did not solve the memory leak issue. Maybe we should add some explicit limiter? The situation where a cluster node crashes due to OOM cannot be considered good.

@derekcollison
Copy link
Member

Agree, we could simply drop messages, and not place them into the stream. The system will be complaining about high lag getting messages into the stream. Those should be in the log.

However in this case, I would imagine you want the system to store the messages. So you either need to slow down the publisher or speed up the storage mechanism. Meaning running multiple parallet streams and have the NATS system transparently partition the subject space into multiple streams.

@Steel551454
Copy link
Author

Do I understand correctly that in our case, for faster message storage in streams, we should launch multiple instances (preferably on different servers) and distribute the streams among several instances?

@Steel551454
Copy link
Author

Or do you have something else in mind?

@Steel551454
Copy link
Author

And another question: would the memory leak situation change if we used NVMe disks instead of traditional SSDs to store the events?

@derekcollison
Copy link
Member

Yes that is correct. @jnmoyne can help with how that gets put together.

The memory leak is not a leak, since the publishing layer does not wait and publishes as fast as NATS core allows (NATS core can be >10M/s vs JetStream around ~250k/s), the system is simply holding onto all the staged messages waiting to be stored into the stream.

NVMe probably would not make a difference in this case IMO.

@derekcollison
Copy link
Member

Is the stream an R1 or R3+?

@Steel551454
Copy link
Author

We are currently experimenting with R1, R2, and R3. Leakage occurs on one server, two, and three respectively.

@Steel551454
Copy link
Author

You mentioned that inserting events through Jetstream is around 250 k/s. This is twice as much as what we are currently inserting.

@derekcollison
Copy link
Member

YMMV but for R1 on decent hardware that should be a good estimate.

@Steel551454
Copy link
Author

Could you clarify for me how the service handles events that have not yet been written to disk and are still stored in memory? In our situation, the NATS server is consuming all memory, so in this case, will old events that have not been written to disk be deleted from memory? And what happens to events for a stream with R3 set, but which have not yet been flushed to disk?

@derekcollison
Copy link
Member

The memory is an internal queue for the stream processing, which is currently unbounded but we plan on putting limits in place.

@Steel551454
Copy link
Author

Is there a Prometheus metric that indicates how many events have not yet been saved to disk (written to the internal queue)?

@Steel551454
Copy link
Author

I have not received an answer to my question yet: what happens to events that are stored in the internal queue and have not yet been successfully written to disk?

@derekcollison
Copy link
Member

derekcollison commented Jun 17, 2024 via email

@Steel551454
Copy link
Author

@derekcollison Sorry, I asked the question incorrectly. I wanted to know what happens to events in the internal queue if the service is shut down correctly and in the case of an OOM. And is there a Prometheus metric tracking this queue?

@derekcollison
Copy link
Member

They are lost and not committed. Currently we are not exposing this metric but will do so most likely in 2.11.

@Steel551454
Copy link
Author

@derekcollison I reread our discussion, and it seems to me that we misunderstood each other in the middle of our dialogue. Earlier, I mentioned that when switching from vector.dev to redpanda-connect (this service supports jetstream), memory continued to leak. Perhaps this is because it's not memory leakage from the application anymore but rather a buildup of the write buffer to disk. Could you please correct me on this?

@derekcollison
Copy link
Member

If the RP connect is doing sync js.Publish calls that would not make much sense, but are we sure that is what is happening?

We could triage the system and tell you what is going on. If you want to do it grab a memory profile via NATS cli of all servers in the system and share with us. CPU profiles and stacksz are helpful as well.

@Steel551454
Copy link
Author

In the end, we rechecked the event publishing to NATS using JetStream. There is no memory leak. I have only a few questions left. What happens to the events that were stored in JetStream with disk and memory storage configured? Are they gradually read from memory and saved to disk? And what happens to the events in memory when the service is shut down correctly? I hope that when the nats service finishes its work, it flushes all events from memory to disk and only then shuts down.

@derekcollison
Copy link
Member

If the stream is backed by a filestore yes they are flushed to disk.

In your case I believe the publishers are publishing faster then the stream can process and the buildup in memory is from the pending messages being held in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
4 participants