You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.
We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.
Possible assumptions / Wild Guesses
The stream state is only verified based on the Meta stream state. This can cause the actual stream data to be completely out of sync. It seems to me that NATS is missing some data integrity checks on the actual stream data when recovering the state from it.
The stream disk state is only verified based on Last Sequence and fails to recover Streams that have a mismatch on First Sequence. There's a var errFirstSequenceMismatch = errors.New("first sequence mismatch") in nats-server code base but no place where this could be returned. This could indicate that at some point there was a check for it and that maybe it should be put back in place.
I understand that RAFT only cares about the log index which I assume translates to Last Sequence but from what I understand of NATS's RAFT implementation, it does not follow the Leader Append-Only property because it actually removes old messages from a Stream. I would think that removing messages would not be related with RAFT log at all but I have reasons to believe that's not the case according to the NATS JetSream Clustering docs:
NATS server optimizes this by combining the data plane for replicating messages with the messages RAFT would normally use to ensure consensus
I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.
Expected behavior
NATS would be able to recover its state consistency across all replicas
Server and client version
Observed in
2.10.16
2.10.17-RC.6
Host environment
The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.
Steps to reproduce
I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.
Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.
Further investigation
Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage.
I focused on a stream named system-config. After finding it under the correct account directory there were 2 files. <index>.blk and index.db. nats-0 and nats-1 contained a 258.blk and nats-2 contained a 1.blk. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results from nats-0 and nats-1 which differed from nats-2. The stream data in nats-2 was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match in Last Sequence but the number of messages, subjects and First Sequence don't match.
Stream info from node-0
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-0
Replica: nats-1, current, seen 489ms ago
Replica: nats-2, current, seen 490ms ago
State:
Messages: 3
Bytes: 217 B
First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Active Consumers: 0
Number of Subjects: 3
</details>
Stream info from node-1
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 491ms ago
Replica: nats-2, current, seen 490ms ago
State:
Messages: 3
Bytes: 217 B
First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Active Consumers: 0
Number of Subjects: 3
Stream info from node-2
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-2
Replica: nats-0, current, seen 461ms ago
Replica: nats-1, current, seen 461ms ago
State:
Messages: 7
Bytes: 1.1 KiB
First Sequence: 778 @ 2024-06-19 17:19:31 UTC
Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
Deleted Messages: 245
Active Consumers: 0
Number of Subjects: 7
Here are a couple of observations I have done by intentionally manipulating the stream state data.
If we run the following in the repro repo:
docker-compose up -d to start the cluster
nats stream cluster step-down system-config until you get nats-2 as the leader
The state is correctly restored from the leader (nats-2).
But, if instead we only remove the stream data leaving the Meta stream data in place with sudo rm data/nats-1/jetstream/USERS/streams/system-config, then nats-1 comes back with a stream containing no data at all while reporting it as current.
Information for Stream system-config created 2024-06-13 14:49:29
Subjects: config.system.>
Replicas: 3
Storage: File
Options:
Retention: Limits
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats
Leader: nats-1
Replica: nats-0, current, seen 491ms ago
Replica: nats-2, current, seen 491ms ago
State:
Messages: 0
Bytes: 0 B
First Sequence: 0
Last Sequence: 0
Active Consumers: 0
The text was updated successfully, but these errors were encountered:
Observed behavior
As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.
We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.
Possible assumptions / Wild Guesses
Last Sequence
and fails to recover Streams that have a mismatch onFirst Sequence
. There's a var errFirstSequenceMismatch = errors.New("first sequence mismatch") in nats-server code base but no place where this could be returned. This could indicate that at some point there was a check for it and that maybe it should be put back in place.Last Sequence
but from what I understand of NATS's RAFT implementation, it does not follow the Leader Append-Only property because it actually removes old messages from a Stream. I would think that removing messages would not be related with RAFT log at all but I have reasons to believe that's not the case according to the NATS JetSream Clustering docs:I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.
Expected behavior
NATS would be able to recover its state consistency across all replicas
Server and client version
Observed in
Host environment
The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.
Steps to reproduce
I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.
Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.
Further investigation
Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage.
I focused on a stream named
system-config
. After finding it under the correct account directory there were 2 files.<index>.blk
andindex.db
.nats-0
andnats-1
contained a258.blk
andnats-2
contained a1.blk
. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results fromnats-0
andnats-1
which differed fromnats-2
. The stream data innats-2
was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match inLast Sequence
but the number of messages, subjects andFirst Sequence
don't match.Stream info from node-0
Stream info from node-1
Stream info from node-2
Here are a couple of observations I have done by intentionally manipulating the stream state data.
If we run the following in the repro repo:
docker-compose up -d
to start the clusternats stream cluster step-down system-config
until you getnats-2
as the leaderdocker-compose stop nats-1
sudo rm -rf data/nats-1/jetstream/SYS/_js_/* data/nats-1/jetstream/USERS/streams/system-config
docker-compose up nats-1
The state is correctly restored from the leader (
nats-2
).But, if instead we only remove the stream data leaving the Meta stream data in place with
sudo rm data/nats-1/jetstream/USERS/streams/system-config
, thennats-1
comes back with a stream containing no data at all while reporting it as current.The text was updated successfully, but these errors were encountered: