Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream does not store messages until restart or leader election #6391

Open
VadimZhiltsov opened this issue Jan 21, 2025 · 4 comments
Open

Stream does not store messages until restart or leader election #6391

VadimZhiltsov opened this issue Jan 21, 2025 · 4 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@VadimZhiltsov
Copy link

VadimZhiltsov commented Jan 21, 2025

Observed behavior

We observe behaviour that some times (no time correlation) Streams stop storing messages.

So as result publisher service starts producing nats: no response from stream error log, because of some reason it gets "no responders" error, like there is no such stream or leader election is in progress.

Lets call problematic subject some.subject, which is stored in jetstream.
It was a part of big stream, but the only that subject in stream was problematic, all others subject were stored well in the same stream.
In order to investigate the problem we separated subject to independent stream, but problem still there

For that subject we also had core subscriber, which for a long time generated us "nats: timeout" and "nats: invalid jetstream response" instead of normal error (see reported issue in client repo). So once we removed the core subscriber, we were able to get nats: no response from stream error.

Incident happens from times to times (aprox once in 2w)

Our observations:

  • at that time we do not see any leader election proceses in the log or something suspicious.
  • cluster resources consumed as usual (cpu, mem, disk)
  • All other streams / subject stored well

When we need to fix the problem, we did nats restarts, but later we found that just enough to init leader election for the problematic stream and system starts behave well. Some times the problem is self-healing, but we suppose that is because of leader election process happened.

We appreciate if somebody can help us, because we have no guess what went wrong if not Nats server issue

Expected behavior

Stream stores published messages

Server and client version

server: v2.10.20
client: v1.37.0

Host environment

Terraform definitions for stream:

resource "jetstream_stream" "STEAM_RESOURCE_NAME" {
  name = "SOME_STREAM_NAME"
  subjects = [
    "some.subject",
  ]
  ack       = true
  sample_freq = 100
  storage     = "file"
  replicas    = 3 
  max_bytes = local.max_bytes * 3 # 300 MB
  max_age   = local.max_age   * 7 # 7 days
}

3 nodes cluster with similar configs:

listen: ip:4222
http: 0.0.0.0:8080
server_name: name
jetstream {
    store_dir: /data-nats
    domain: product
}
cluster {
  name: name
  listen: ip:4223
  no_advertise = true
  authorization {
    user: user_nats_cluster_production
    password: ****
    timeout: 0.75
  }
.
  routes = [
    nats-route://url1:4223
    nats-route://url2:4223
  ]
}
mqtt {
    listen 172.17.99.70:4225
}
debug:   true
include ./leaf.conf
include ./account.conf

Steps to reproduce

No response

@VadimZhiltsov VadimZhiltsov added the defect Suspected defect such as a bug or regression label Jan 21, 2025
@neilalexander
Copy link
Member

Does this happen on the latest 2.10.24 version?

For that subject we also had core subscriber, which for a long time generated us "nats: timeout" and "nats: invalid jetstream response". So once we removed the core subscriber, we were able to get nats: no response from stream error.

You need to be very careful when using Core NATS subs on the same subjects as JetStream streams. If you can avoid doing this then all the better, but if you can't, those Core NATS subs must not send back acks/replies, otherwise you will break clients. You might find that stream RePublish option is a safer approach (where the stream will redeliver messages to a new Core NATS subject once they are accepted).

@VadimZhiltsov
Copy link
Author

VadimZhiltsov commented Jan 21, 2025

@neilalexander we haven't tried it, as we're fighting with the problem from Sept 2024 and at that time 2.10.20 was the latest. Is there a chance updates contain fix?

@VadimZhiltsov VadimZhiltsov changed the title Stream does not store messages until restart or leader delection Stream does not store messages until restart or leader election Jan 21, 2025
@markovichecha
Copy link

Dealing with the same issue right now. We are also using core nats with jetstream and some of the streams are getting stuck on producing messages with error nats: no response from stream.

@VadimZhiltsov
Copy link
Author

VadimZhiltsov commented Jan 23, 2025

@neilalexander we have updated to 2.10.24, but it tooks time to proof that it helped as issue happens once in 2-3 weeks.

@markovichecha since you also have similar issue, may you share your version of Nats server & nats client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants