Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JetStream: high disk IO, CPU usage, timeouts (Internal subscription took too long) #5595

Open
tierpod opened this issue Jun 26, 2024 · 4 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@tierpod
Copy link

tierpod commented Jun 26, 2024

Observed behavior

We experience strange nats jetstream behavior: it periodically produces huge amount of disk IO (reads and writes) even without any clients connected (0 producers, 0 consumers connected). As a result - high CPU usage during reads and writes and timeouts in log:

Internal subscription on "$JS.API.CONSUMER.CREATE.XXX" took too long: 26.464577809

Stream configuration:

     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 1
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 30 GiB
          Maximum Age: 30d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

State:

             Messages: 37,886,565
                Bytes: 19 GiB
             FirstSeq: 19,941,809 @ 2024-06-25T07:57:13 UTC
              LastSeq: 65,677,737 @ 2024-06-25T10:28:36 UTC
     Deleted Messages: 7849364
     Active Consumers: 10

One service produced all these messages, another service should consume them in batches (slower than producer).

I'm not sure how to reproduce it, but I have a backup of this "broken" stream locally and did some experiments with previous versions.

1

Nats 2.10.14 (1), 2.10.12 (2): after restoring this stream from backup (first spike), I see only 1 such IO/CPU spike after 3-5 minutes and later nats works fine.

Nats 2.10.16 (3) : after restoring this stream from backup (first spike), I see repeating peaks every N minutes. N changes if I change jetstream.sync_interval configuration:

2

I created pprof dump and see these functions in top: syncBlocks, compact, loadMsgsWithLock:

3

After consuming most messages these spikes becomes less visible, so it depends on stream size (data size or number of messages).

Expected behavior

Now, we use nats 2.8.4 in production and there is no such issue. I know that storage was changed in 2.10, but we did fresh installation of nats, without migrating current production storage.

Server and client version

nats-server: v2.10.16; messages were pushed to stream by Nats.net 1.1.5.0

Host environment

OracleLinux 9.4, Linux 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux, nats without containers.

Steps to reproduce

No response

@tierpod tierpod added the defect Suspected defect such as a bug or regression label Jun 26, 2024
@tierpod tierpod changed the title JetStream: high disk IO, CPU usage, timeouts (Internal subsciption took too long) Jun 26, 2024
@derekcollison
Copy link
Member

Could you share the backup of the stream with us?

derek@synadia.com

@tierpod
Copy link
Author

tierpod commented Jun 27, 2024

Sorry, but I'm not allowed to share the backup. I'll try to reproduce it using fake data/workload and let you know.

@yuzhou-nj
Copy link

Deleted Messages: 7849364 <-- Maybe the nats-server is busy deleting messages.

@tierpod
Copy link
Author

tierpod commented Jul 18, 2024

This is our main assumption. It looks like we have quite busy queue - producer pushes a lot of messages into queue, several consumers pull messages in batches and send acks on each message. WorkQueue was chosen because we don't need to keep messages after ack, so we wanted to save filesystem storage space. But after a lot of deletes (that leads to IO operations and latency) nats starts to compact filesystem storage (that leads to IO operation and latency too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
3 participants