You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We experience strange nats jetstream behavior: it periodically produces huge amount of disk IO (reads and writes) even without any clients connected (0 producers, 0 consumers connected). As a result - high CPU usage during reads and writes and timeouts in log:
Internal subscription on "$JS.API.CONSUMER.CREATE.XXX" took too long: 26.464577809
Stream configuration:
Acknowledgements: true
Retention: File - WorkQueue
Replicas: 1
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Maximum Messages: unlimited
Maximum Bytes: 30 GiB
Maximum Age: 30d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
State:
Messages: 37,886,565
Bytes: 19 GiB
FirstSeq: 19,941,809 @ 2024-06-25T07:57:13 UTC
LastSeq: 65,677,737 @ 2024-06-25T10:28:36 UTC
Deleted Messages: 7849364
Active Consumers: 10
One service produced all these messages, another service should consume them in batches (slower than producer).
I'm not sure how to reproduce it, but I have a backup of this "broken" stream locally and did some experiments with previous versions.
Nats 2.10.14 (1), 2.10.12 (2): after restoring this stream from backup (first spike), I see only 1 such IO/CPU spike after 3-5 minutes and later nats works fine.
Nats 2.10.16 (3) : after restoring this stream from backup (first spike), I see repeating peaks every N minutes. N changes if I change jetstream.sync_interval configuration:
I created pprof dump and see these functions in top: syncBlocks, compact, loadMsgsWithLock:
After consuming most messages these spikes becomes less visible, so it depends on stream size (data size or number of messages).
Expected behavior
Now, we use nats 2.8.4 in production and there is no such issue. I know that storage was changed in 2.10, but we did fresh installation of nats, without migrating current production storage.
Server and client version
nats-server: v2.10.16; messages were pushed to stream by Nats.net 1.1.5.0
Host environment
OracleLinux 9.4, Linux 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux, nats without containers.
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered:
This is our main assumption. It looks like we have quite busy queue - producer pushes a lot of messages into queue, several consumers pull messages in batches and send acks on each message. WorkQueue was chosen because we don't need to keep messages after ack, so we wanted to save filesystem storage space. But after a lot of deletes (that leads to IO operations and latency) nats starts to compact filesystem storage (that leads to IO operation and latency too).
Observed behavior
We experience strange nats jetstream behavior: it periodically produces huge amount of disk IO (reads and writes) even without any clients connected (0 producers, 0 consumers connected). As a result - high CPU usage during reads and writes and timeouts in log:
Stream configuration:
One service produced all these messages, another service should consume them in batches (slower than producer).
I'm not sure how to reproduce it, but I have a backup of this "broken" stream locally and did some experiments with previous versions.
Nats 2.10.14 (1), 2.10.12 (2): after restoring this stream from backup (first spike), I see only 1 such IO/CPU spike after 3-5 minutes and later nats works fine.
Nats 2.10.16 (3) : after restoring this stream from backup (first spike), I see repeating peaks every N minutes. N changes if I change
jetstream.sync_interval
configuration:I created pprof dump and see these functions in top: syncBlocks, compact, loadMsgsWithLock:
After consuming most messages these spikes becomes less visible, so it depends on stream size (data size or number of messages).
Expected behavior
Now, we use nats 2.8.4 in production and there is no such issue. I know that storage was changed in 2.10, but we did fresh installation of nats, without migrating current production storage.
Server and client version
nats-server: v2.10.16; messages were pushed to stream by Nats.net 1.1.5.0
Host environment
OracleLinux 9.4,
Linux 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
, nats without containers.Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: