JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

tierpod · 2024-06-26T11:36:48Z

Observed behavior

We experience strange nats jetstream behavior: it periodically produces huge amount of disk IO (reads and writes) even without any clients connected (0 producers, 0 consumers connected). As a result - high CPU usage during reads and writes and timeouts in log:

Internal subscription on "$JS.API.CONSUMER.CREATE.XXX" took too long: 26.464577809

Stream configuration:

     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 1
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 30 GiB
          Maximum Age: 30d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

State:

             Messages: 37,886,565
                Bytes: 19 GiB
             FirstSeq: 19,941,809 @ 2024-06-25T07:57:13 UTC
              LastSeq: 65,677,737 @ 2024-06-25T10:28:36 UTC
     Deleted Messages: 7849364
     Active Consumers: 10

One service produced all these messages, another service should consume them in batches (slower than producer).

I'm not sure how to reproduce it, but I have a backup of this "broken" stream locally and did some experiments with previous versions.

Nats 2.10.14 (1), 2.10.12 (2): after restoring this stream from backup (first spike), I see only 1 such IO/CPU spike after 3-5 minutes and later nats works fine.

Nats 2.10.16 (3) : after restoring this stream from backup (first spike), I see repeating peaks every N minutes. N changes if I change jetstream.sync_interval configuration:

I created pprof dump and see these functions in top: syncBlocks, compact, loadMsgsWithLock:

After consuming most messages these spikes becomes less visible, so it depends on stream size (data size or number of messages).

Expected behavior

Now, we use nats 2.8.4 in production and there is no such issue. I know that storage was changed in 2.10, but we did fresh installation of nats, without migrating current production storage.

Server and client version

nats-server: v2.10.16; messages were pushed to stream by Nats.net 1.1.5.0

Host environment

OracleLinux 9.4, Linux 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux, nats without containers.

Steps to reproduce

No response

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-06-27T02:23:52Z

Could you share the backup of the stream with us?

derek@synadia.com

tierpod · 2024-06-27T05:35:31Z

Sorry, but I'm not allowed to share the backup. I'll try to reproduce it using fake data/workload and let you know.

yuzhou-nj · 2024-07-18T06:47:22Z

Deleted Messages: 7849364 <-- Maybe the nats-server is busy deleting messages.

tierpod · 2024-07-18T07:23:40Z

This is our main assumption. It looks like we have quite busy queue - producer pushes a lot of messages into queue, several consumers pull messages in batches and send acks on each message. WorkQueue was chosen because we don't need to keep messages after ack, so we wanted to save filesystem storage space. But after a lot of deletes (that leads to IO operations and latency) nats starts to compact filesystem storage (that leads to IO operation and latency too).

tierpod added the defect Suspected defect such as a bug or regression label Jun 26, 2024

tierpod changed the title ~~JetStream: high disk IO, CPU usage, timeouts (Internal subsciption took too long)~~ Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

tierpod commented Jun 26, 2024 •

edited

Loading

derekcollison commented Jun 27, 2024

tierpod commented Jun 27, 2024

yuzhou-nj commented Jul 18, 2024

tierpod commented Jul 18, 2024 •

edited

Loading

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

Comments

tierpod commented Jun 26, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented Jun 27, 2024

tierpod commented Jun 27, 2024

yuzhou-nj commented Jul 18, 2024

tierpod commented Jul 18, 2024 • edited Loading

tierpod commented Jun 26, 2024 •

edited

Loading

tierpod commented Jul 18, 2024 •

edited

Loading