Memory leak of nats cluster #5518

Steel551454 · 2024-06-11T14:55:45Z

Observed behavior

We have a NATS cluster of three nodes (NATS version is 2.10.16).

host: 127.0.0.1
port: 4222

server_name: nats-02-cluster
accounts {
 $SYS { users = [ { user: "nats", pass: "PASS" } ] }
}

jetstream {
  store_dir=/var/lib/nats
  max_memory_store: 1024Mb
  max_file_store: 819200Mb
}

cluster {
  name: cluster
  listen: 127.0.0.1:6222
  routes: [
    nats-route://nats-00-cluster:6226
    nats-route://nats-01-cluster:6226
    nats-route://nats-02-cluster:6226
  ]
  compression: {
    mode: s2_auto
    rtt_thresholds: [10ms, 50ms, 100ms]
  }
}

http_port: 8222
max_connections: 64K
max_control_line: 4KB
max_payload: 8MB
max_pending: 64MB
max_subscriptions: 0
log_file: /var/log/nats/na29ts-server.log

Cluster interaction occurs via nginx:

upstream nats {
    server 127.0.0.1:4222;
}

server {
    listen        127.0.0.1:4224 so_keepalive=1m:5s:2;
    listen        192.168.1.2:4224 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats;
}

upstream nats-cluster {
    server 127.0.0.1:6222;
}

server {
    listen        127.0.0.1:6226 so_keepalive=1m:5s:2;
    listen        192.168.1.2:6226 so_keepalive=1m:5s:2;

    access_log    off;
    tcp_nodelay   on;
    preread_buffer_size 64k;
    proxy_pass    nats-cluster;
}

Events are forwarded to NATS with Vector service. The average throughput is 80k events per second (or 90 MB/s).

  nats:
    type: "nats"
    inputs:
      - "upstreams.other"
    url: "nats://127.0.0.1:4222"
    request:
      rate_limit_num: 70000
    buffer:
      type: memory
      max_events: 2000
    subject: "{{ type }}"
    acknowledgements:
      enabled: true
    encoding:
      codec: json

Memory usage is continuously increasing and reaches host limit (60 GB) and OOM killer happens to NATS service as a result.
NATS profile can be found in attachments.
profiles.tar.gz

Expected behavior

Service memory should not leak

Server and client version

nats-server: 2.10.16
nats: 0.1.4

Host environment

No response

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

neilalexander · 2024-06-11T14:58:38Z

Thanks for providing the memory profiles!

Can you please try disabling route compression by changing mode from s2_auto to off and see if there's an improvement?

Steel551454 · 2024-06-11T15:38:20Z

we did it. no changes followed

Steel551454 · 2024-06-12T06:32:54Z

I removed nginx and now the nodes communicate with each other directly. Memory continues to leak.

Steel551454 · 2024-06-12T06:36:44Z

profiles.zip
it's current profiles

neilalexander · 2024-06-12T08:34:53Z

Your latest profile suggests there are still a lot of allocations in the route S2 writer, are you sure route compression was disabled properly? You may need to do a rolling restart of the cluster nodes to ensure it's taken effect.

Steel551454 · 2024-06-12T11:25:41Z

You are right: I forgot to turn off compression on one server

Steel551454 · 2024-06-12T12:54:31Z

profiles.zip

After we disabled nginx and removed compression, memory continues to leak.

neilalexander · 2024-06-12T12:58:20Z

OK, this latest profile shows a different type of memory build-up to before (this one shows Raft append entries, last time that wasn't evident).

Can you please post more details about your cluster?
What spec of machines are the cluster nodes running on?
Are all of the cluster nodes the same CPU/RAM/disk-wise?
Do you see these build-ups on a single node or multiple?

Steel551454 · 2024-06-12T13:04:49Z

The Nats cluster is running on servers with the following specifications: 64GB RAM, Intel(R) Xeon(R) E-2236 CPU @ 3.40GHz, 890GB SSD. All servers are identical. We use 10GE network cards. The operating system is Arch Linux. Memory usage on the servers is uneven. The node with the most primary replicas consumes memory to a greater extent.

derekcollison · 2024-06-12T13:09:32Z

Do you async publish for JetStream?

Steel551454 · 2024-06-12T13:18:29Z

Honestly, I'm not sure how this is implemented in vector.dev. Here is a link to the module: https://vector.dev/docs/reference/configuration/sinks/nats/
https://github.com/vectordotdev/vector/tree/master/src/sinks/nats

Steel551454 · 2024-06-13T06:53:37Z

I reviewed the source code of the NATS module and saw the function call to async_nats.

derekcollison · 2024-06-13T11:51:32Z

Maybe we can have @Jarema take a look since its using the rust client.

Jarema · 2024-06-13T12:06:01Z

@derekcollison A quick glance shows that vector is using Core NATS publish, so not even JetStream async publish.

derekcollison · 2024-06-13T12:10:59Z

ok very easy to overload the system in that case.. This will balloon up the internal append entries since that pipeline needs to interbally queue then write to the store.

Steel551454 · 2024-06-13T12:19:32Z

Will you fix this? Or do we need to make changes on our end?

neilalexander · 2024-06-13T12:24:22Z

The issue needs to be rectified in Vector by switching from Core NATS publishes to JetStream publishes, as currently the Core NATS publishes can potentially send data into JetStream faster than it can be processed. This explains the build-up of append entries in memory that you are seeing.

It looks like there's already an issue tracking this on their repository: vectordotdev/vector#10534

Steel551454 · 2024-06-13T12:29:27Z

We will die faster than the task above will be completed :) (It was created in 2021)

Is there any chance that you could create some kind of var to limit the rate of JetStream forwarding?

derekcollison · 2024-06-13T12:36:17Z

We would not approach it that way, we should not slow down normal NATS core publishers due to misconfiguration.

We are considering a way to protect the server by dropping AppendEntry msgs from the NRG (raft) layer. That would avoid memory bloat but would cause the system to thrash a bit catching up the NRG followers when they detect gaps from the dropped messages.

Jarema · 2024-06-13T12:46:48Z

@Steel551454 I plan to contribute to the issue mentioned somewhere in Q3, and introducel JetStream support. The current one is not actually supporting acks, despite saying that in docs.

Steel551454 · 2024-06-13T12:57:07Z

Let's say we turn off Jetstream. Where in NATS settings can we specify where to store events?

Jarema · 2024-06-13T13:05:16Z

If you turn off JetStream, messages will not be stored anywhere, they become at most once. You need a subscriber application that processes them.

In JetStream, you can define the store directiory here:

jetstream {
  store_dir: /path
}

or via providing -sd flag.

Steel551454 · 2024-06-13T13:13:10Z

Do you happen to have a simple Go-written proxy: transforming requests from NATS Core Stream to Jetstream?

Steel551454 · 2024-06-14T10:02:14Z

Today we replaced the pipeline parser vetor.dev with redpanda-connect, which has a plugin for working with NATS Jetstream. The memory leak issue has not been resolved. Attached is an archive with profiles.
profiles.zip

Steel551454 · 2024-06-15T10:00:25Z

@derekcollison, I'm sorry to bother you, but switching our pipeline to use Jetstream did not solve the memory leak issue. Maybe we should add some explicit limiter? The situation where a cluster node crashes due to OOM cannot be considered good.

derekcollison · 2024-06-15T13:15:43Z

Agree, we could simply drop messages, and not place them into the stream. The system will be complaining about high lag getting messages into the stream. Those should be in the log.

However in this case, I would imagine you want the system to store the messages. So you either need to slow down the publisher or speed up the storage mechanism. Meaning running multiple parallet streams and have the NATS system transparently partition the subject space into multiple streams.

Steel551454 · 2024-06-15T13:47:15Z

Do I understand correctly that in our case, for faster message storage in streams, we should launch multiple instances (preferably on different servers) and distribute the streams among several instances?

Steel551454 · 2024-06-15T13:47:50Z

Or do you have something else in mind?

Steel551454 · 2024-06-15T13:50:05Z

And another question: would the memory leak situation change if we used NVMe disks instead of traditional SSDs to store the events?

derekcollison · 2024-06-15T14:23:57Z

Yes that is correct. @jnmoyne can help with how that gets put together.

The memory leak is not a leak, since the publishing layer does not wait and publishes as fast as NATS core allows (NATS core can be >10M/s vs JetStream around ~250k/s), the system is simply holding onto all the staged messages waiting to be stored into the stream.

NVMe probably would not make a difference in this case IMO.

derekcollison · 2024-06-15T14:34:31Z

Is the stream an R1 or R3+?

Steel551454 · 2024-06-15T17:22:00Z

We are currently experimenting with R1, R2, and R3. Leakage occurs on one server, two, and three respectively.

Steel551454 · 2024-06-15T17:30:23Z

You mentioned that inserting events through Jetstream is around 250 k/s. This is twice as much as what we are currently inserting.

derekcollison · 2024-06-15T17:47:31Z

YMMV but for R1 on decent hardware that should be a good estimate.

Steel551454 · 2024-06-16T06:54:11Z

Could you clarify for me how the service handles events that have not yet been written to disk and are still stored in memory? In our situation, the NATS server is consuming all memory, so in this case, will old events that have not been written to disk be deleted from memory? And what happens to events for a stream with R3 set, but which have not yet been flushed to disk?

derekcollison · 2024-06-16T19:54:15Z

The memory is an internal queue for the stream processing, which is currently unbounded but we plan on putting limits in place.

Steel551454 · 2024-06-17T06:51:15Z

Is there a Prometheus metric that indicates how many events have not yet been saved to disk (written to the internal queue)?

Steel551454 · 2024-06-17T08:27:34Z

I have not received an answer to my question yet: what happens to events that are stored in the internal queue and have not yet been successfully written to disk?

derekcollison · 2024-06-17T18:00:10Z

It was already answered above. Copied here again. The memory leak is not a leak, since the publishing layer does not wait and publishes as fast as NATS core allows (NATS core can be >10M/s vs JetStream around ~250k/s), the system is simply holding onto all the staged messages waiting to be stored into the stream.

…

On Mon, Jun 17, 2024 at 1:27 AM Steel551454 ***@***.***> wrote: I have not received an answer to my question yet: what happens to events that are stored in the internal queue and have not yet been successfully written to disk? — Reply to this email directly, view it on GitHub <#5518 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAV74JYA34H7GK5QJW3DETZH2MYZAVCNFSM6AAAAABJEQU35GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZSGYZDMMZUGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Steel551454 · 2024-06-18T06:25:37Z

@derekcollison Sorry, I asked the question incorrectly. I wanted to know what happens to events in the internal queue if the service is shut down correctly and in the case of an OOM. And is there a Prometheus metric tracking this queue?

derekcollison · 2024-06-18T09:07:18Z

They are lost and not committed. Currently we are not exposing this metric but will do so most likely in 2.11.

Steel551454 · 2024-06-27T05:52:30Z

@derekcollison I reread our discussion, and it seems to me that we misunderstood each other in the middle of our dialogue. Earlier, I mentioned that when switching from vector.dev to redpanda-connect (this service supports jetstream), memory continued to leak. Perhaps this is because it's not memory leakage from the application anymore but rather a buildup of the write buffer to disk. Could you please correct me on this?

derekcollison · 2024-06-28T01:37:28Z

If the RP connect is doing sync js.Publish calls that would not make much sense, but are we sure that is what is happening?

We could triage the system and tell you what is going on. If you want to do it grab a memory profile via NATS cli of all servers in the system and share with us. CPU profiles and stacksz are helpful as well.

Steel551454 · 2024-07-02T09:21:15Z

In the end, we rechecked the event publishing to NATS using JetStream. There is no memory leak. I have only a few questions left. What happens to the events that were stored in JetStream with disk and memory storage configured? Are they gradually read from memory and saved to disk? And what happens to the events in memory when the service is shut down correctly? I hope that when the nats service finishes its work, it flushes all events from memory to disk and only then shuts down.

derekcollison · 2024-07-02T14:56:56Z

If the stream is backed by a filestore yes they are flushed to disk.

In your case I believe the publishers are publishing faster then the stream can process and the buildup in memory is from the pending messages being held in memory.

Steel551454 added the defect Suspected defect such as a bug or regression label Jun 11, 2024

Memory leak of nats cluster #5518

Memory leak of nats cluster #5518

Comments

Steel551454 commented Jun 11, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

neilalexander commented Jun 11, 2024

Steel551454 commented Jun 11, 2024

Steel551454 commented Jun 12, 2024

Steel551454 commented Jun 12, 2024

neilalexander commented Jun 12, 2024

Steel551454 commented Jun 12, 2024

Steel551454 commented Jun 12, 2024

neilalexander commented Jun 12, 2024

Steel551454 commented Jun 12, 2024

derekcollison commented Jun 12, 2024

Steel551454 commented Jun 12, 2024 • edited Loading

Steel551454 commented Jun 13, 2024

derekcollison commented Jun 13, 2024

Jarema commented Jun 13, 2024 • edited Loading

derekcollison commented Jun 13, 2024

Steel551454 commented Jun 13, 2024

neilalexander commented Jun 13, 2024

Steel551454 commented Jun 13, 2024

derekcollison commented Jun 13, 2024

Jarema commented Jun 13, 2024

Steel551454 commented Jun 13, 2024

Jarema commented Jun 13, 2024 • edited Loading

Steel551454 commented Jun 13, 2024

Steel551454 commented Jun 14, 2024

Steel551454 commented Jun 15, 2024

derekcollison commented Jun 15, 2024

Steel551454 commented Jun 15, 2024

Steel551454 commented Jun 15, 2024

Steel551454 commented Jun 15, 2024

derekcollison commented Jun 15, 2024

derekcollison commented Jun 15, 2024

Steel551454 commented Jun 15, 2024

Steel551454 commented Jun 15, 2024

derekcollison commented Jun 15, 2024

Steel551454 commented Jun 16, 2024

derekcollison commented Jun 16, 2024

Steel551454 commented Jun 17, 2024

Steel551454 commented Jun 17, 2024

derekcollison commented Jun 17, 2024 via email

Steel551454 commented Jun 18, 2024

derekcollison commented Jun 18, 2024

Steel551454 commented Jun 27, 2024

derekcollison commented Jun 28, 2024

Steel551454 commented Jul 2, 2024

derekcollison commented Jul 2, 2024

Steel551454 commented Jun 12, 2024 •

edited

Loading

Jarema commented Jun 13, 2024 •

edited

Loading

Jarema commented Jun 13, 2024 •

edited

Loading