Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orphan messages in the stream #5542

Closed
debajyoti-truefoundry opened this issue Jun 15, 2024 · 7 comments
Closed

Orphan messages in the stream #5542

debajyoti-truefoundry opened this issue Jun 15, 2024 · 7 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@debajyoti-truefoundry
Copy link

Observed behavior

I started a discussion on the Slack channel but opened an issue, too.

I am using NATS 2.10.12-alpine3.19. I have a stream like this below,

Information for Stream tfy-agent-state-buffer created 2024-05-20 14:19:40

              Subjects: tfy-agent-state-buffer.>
              Replicas: 3
               Storage: File

Options:

             Retention: WorkQueue
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 15m0s
            Direct Get: true
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: true

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 20
         Maximum Bytes: unlimited
           Maximum Age: 1d0h0m0s
  Maximum Message Size: 50 KiB
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: truefoundry-nats-2
               Replica: truefoundry-nats-0, current, seen 445ms ago
               Replica: truefoundry-nats-1, current, seen 445ms ago

State:

              Messages: 1,891
                 Bytes: 956 KiB
        First Sequence: 9,780,767 @ 2024-06-15 00:49:51 UTC
         Last Sequence: 9,813,689 @ 2024-06-15 08:20:20 UTC
      Deleted Messages: 31,032
      Active Consumers: 173
    Number of Subjects: 1,226

As you can see, 1891 messages are in the stream. If I look at filter subject tfy-agent-state-buffer.tfy-devtest-euwe1.>

❯ nats -s nats://localhost:4222 --creds prod-nats.creds stream subjects tfy-agent-state-buffer | grep tfy-devtest-euwe1
│ tfy-agent-state-buffer.tfy-devtest-euwe1.application.active-version.business-ws.customer-supslack-bot                                 │ 1     │
│ tfy-agent-state-buffer.tfy-devtest-euwe1.job-run.status.truefoundcivo-usage-tracker-1718399700                                        │ 1     │
...

But the consumer for the above filter subject shows no pending messages.

❯ nats -s nats://localhost:4222 --creds prod-nats.creds consumer info tfy-agent-state-buffer cluster-tfy-devtest-euwe1
Information for Consumer tfy-agent-state-buffer > cluster-tfy-devtest-euwe1 created 2024-06-14T18:16:33+05:30

Configuration:

                    Name: cluster-tfy-devtest-euwe1
               Pull Mode: true
          Filter Subject: tfy-agent-state-buffer.tfy-devtest-euwe1.>
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
      Maximum Deliveries: 2,880
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats
                  Leader: truefoundry-nats-2
                 Replica: truefoundry-nats-0, current, seen 667ms ago
                 Replica: truefoundry-nats-1, current, seen 667ms ago

State:

  Last Delivered Message: Consumer sequence: 132,187 Stream sequence: 9,813,760 Last delivery: 3m35s ago
    Acknowledgment Floor: Consumer sequence: 132,187 Stream sequence: 9,813,698 Last Ack: 3m35s ago
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 1 of maximum 512
  • All these commands were executed by directly connecting to the stream and the consumer leader.
  • We do not execute Nak or Term in our system on messages.
  • Even with the consumer report command, I do not see any pending or unprocessed messages on any consumer.
  • We use the subject rollups and dedup feature while publishing messages in the stream.

Our consumption flow looks something like this,

  1. We create or update the consumer for a filter subject.
    a. We fetch max 50 messages from the consumer using Fetch.
    b. We process these 50 messages serially and in order.
    c. After processing each message, we execute DoubleAck. We do not process message N if N - 1 is not double-acked
    successfully.
    d. Go back to (a) after the batch is processed completely.

In a different environment, we are running nats:2.10.14-alpine3.19
I am noticing the same issue there, too.

❯ nats -s ... --creds devtest-nats.creds stream get --last-for "tfy-agent-state-buffer.tfy-willk-usce.>"
? Select a Stream tfy-agent-state-buffer
Item: tfy-agent-state-buffer#990719 received 2024-06-15 00:38:11.357294174 +0000 UTC on Subject tfy-agent-state-buffer.tfy-willk-usce.application.active-version.tfy-agent.tfy-agent-cp

Headers:
  Nats-Rollup: sub
  Nats-Msg-Id: 343a147c7869f6291d2759731f4eb666c15ff76c

...

❯ nats -s ... --creds devtest-nats.creds consumer info tfy-agent-state-buffer cluster-tfy-willk-usce
Information for Consumer tfy-agent-state-buffer > cluster-tfy-willk-usce created 2024-06-05T16:46:08+05:30

Configuration:

                    Name: cluster-tfy-willk-usce
               Pull Mode: true
          Filter Subject: tfy-agent-state-buffer.tfy-willk-usce.>
          Deliver Policy: All
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
      Maximum Deliveries: 2,880
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512

Cluster Information:

                    Name: nats
                  Leader: nats-1
                 Replica: nats-0, current, seen 699ms ago
                 Replica: nats-2, current, seen 699ms ago

State:

  Last Delivered Message: Consumer sequence: 3,517 Stream sequence: 998,846 Last delivery: 58.14s ago
    Acknowledgment Floor: Consumer sequence: 3,517 Stream sequence: 998,846 Last Ack: 58.13s ago
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 1 of maximum 512

❯ nats -s ... --creds devtest-nats.creds stream get --last-for "tfy-agent-state-buffer.tfy-willk-usce.>"
? Select a Stream tfy-agent-state-buffer
Item: tfy-agent-state-buffer#990719 received 2024-06-15 00:38:11.357294174 +0000 UTC on Subject tfy-agent-state-buffer.tfy-willk-usce.application.active-version.tfy-agent.tfy-agent-cp

Headers:
  Nats-Rollup: sub
  Nats-Msg-Id: 343a147c7869f6291d2759731f4eb666c15ff76c

Expected behavior

The stream will not have any orphan messages.

Server and client version

The server version is mentioned above.

CLI

❯ nats --version
0.1.4

On the consumption side, we use the Go client v1.34.1.
On the publisher side, we use "nats.ws": "^1.21.0".

Host environment

No response

Steps to reproduce

No response

@debajyoti-truefoundry debajyoti-truefoundry added the defect Suspected defect such as a bug or regression label Jun 15, 2024
@debajyoti-truefoundry
Copy link
Author

We found this issue while debugging a potential orphan message scenario. We were not receiving the messages in the stream from the interested consumer. Even though there were messages in the stream, the consumer was not showing any outstanding acks or unprocessed messages.

On the Slack thread, @ripienaar suggested that there can be some edge cases due to using the rollup feature in a work queue stream, but I have not verified this yet.

@debajyoti-truefoundry
Copy link
Author

We use rollups to replace existing messages from the stream for some subjects if they have not been processed yet.

We could not model this using the per-subject message limit as, in some cases, I need to keep multiple messages with the same subject.

Sometimes, we want a message of a particular subject to replace any existing unprocessed messages in the stream with the same subject.

We decided to use rollups to model this.

@derekcollison
Copy link
Member

We have made some fixes in the area of stream sync. I would suggest trying 2.10.17-RC3 (formal release will be next week).

https://github.com/nats-io/nats-server/releases

@debajyoti-truefoundry
Copy link
Author

Cool. We will try it once the release is done on our devtest environment.

@sammy007
Copy link

On 2.10.17 either ordering is broken or orphans even when rollup is used.

@Jarema
Copy link
Member

Jarema commented Jul 1, 2024

@sammy007 please provide details and insights that lead you to this conclusion.

@debajyoti-truefoundry
Copy link
Author

debajyoti-truefoundry commented Jul 18, 2024

We are running 2.10.17 and have not noticed this issue again yet. We noticed that 2.10.18 also had related fixes and upgraded on one of our environments. I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
4 participants