Server sometimes stops handling webhook events #352

StefanNemeth · 2025-02-07T17:57:47Z

In some instances, we noticed that our app server stops handling Webhook events seemingly randomly.

Server logs

This is an excerpt from the logs that contains the last event we handled and the first errors we got after that.

application-server-1  | 2025-02-07T15:51:01.632Z  INFO 1 --- [Helios] [         nats:4] .a.h.w.g.GitHubWorkflowRunMessageHandler : Received worfklow run event for repository: ls1intum/Artemis, workflow run: https://api.github.com/repos/ls1intum/Artemis/actions/runs/13202866255, action: completed
[....]
application-server-1  | 2025-02-07T15:51:06.956Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : pullStatusError, Connection: 13, Subscription: 615556999, Consumer Name: xy23djLX, Status:Status{code=409, message='Consumer Deleted'}
[....]
application-server-1  | 2025-02-07T15:53:59.905Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : heartbeatAlarm, Connection: 13, Subscription: 94524147, Consumer Name: xy23djLX, lastStreamSequence: 58810, lastConsumerSequence: 40771

Workaround

Currently, when the issue happens, we restart the app server, use the Github sync and manually fix deployments. Works most of the time, as it doesn't happen frequently.

What we know so far

There's a 5-second gap between receiving the last event from NATS and the consumer being deleted
We're currently using an ephermal consumer
We always only handle one message at a time
Even though it's hardly documented anywhere, according to the server code of NATS, when an ephermal consumer is inactive for 5 seconds (not pulling any further messages) and no inactive threshold is set, the consumer is deleted automatically: https://github.com/nats-io/nats-server/blob/aeaa12f3e22c75829a576736f0dc74b6d02801ef/server/consumer.go#L1206C1-L1224C2

What we should do

Make sure our events are handled in a timely manner with us setting as maximum time, e.g. setting a timeout for Github API and tracking slow handlers
Implement the durable consumer (already a PR for that fix: Use durable consumer for GitHub webhook events #349)
Log the errors to sentry as well and adding a notification for errors like the "heartbeatAlarm" so we know about it early
If the PR fix: Use durable consumer for GitHub webhook events #349 takes longer, we can set the InactiveThreshold to more like 20+ seconds manually as a hot-fix for now
If we don't acknowledge a message within a certain timeframe, it will be re-queued by NATS. So we should make sure that we set that config as well.

The text was updated successfully, but these errors were encountered:

egekocabas · 2025-02-08T22:32:40Z

We merged the durable consumer fix: Use durable consumer for GitHub webhook events #349 PR but we also try to find which webhook takes too long using the prod logs that we have from that incident

So thats why I am reopening this issue

StefanNemeth · 2025-02-08T22:59:01Z

There might also still be some todos left within "What we should do". I'd propose we add some kind of log for a certain threshold regarding the duration of the handler. I can also deal with it tomorrow.

egekocabas · 2025-02-08T23:05:17Z

There might also still be some todos left within "What we should do". I'd propose we add some kind of log for a certain threshold regarding the duration of the handler. I can also deal with it tomorrow.

I think we gave the threshold values generously (NATS) and that long running webhook listeners problem is kinda solved.

I am aware that It shouldn't be taking that long but I also suggest that we should focus on the sprint tasks and keep this issue in mind. Then when we finish our tasks lets add those hard-limits or finding the what is taking long. What do you say?

StefanNemeth added the bug Something isn't working label Feb 7, 2025

StefanNemeth self-assigned this Feb 7, 2025

egekocabas added todo application-server labels Feb 7, 2025

egekocabas linked a pull request Feb 7, 2025 that will close this issue

fix: Use durable consumer for GitHub webhook events #349

Merged

egekocabas closed this as completed in #349 Feb 8, 2025

egekocabas reopened this Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server sometimes stops handling webhook events #352

Server sometimes stops handling webhook events #352

StefanNemeth commented Feb 7, 2025 •

edited

Loading

egekocabas commented Feb 8, 2025

StefanNemeth commented Feb 8, 2025

egekocabas commented Feb 8, 2025 •

edited

Loading

Server sometimes stops handling webhook events #352

Server sometimes stops handling webhook events #352

Comments

StefanNemeth commented Feb 7, 2025 • edited Loading

Server logs

Workaround

What we know so far

What we should do

egekocabas commented Feb 8, 2025

StefanNemeth commented Feb 8, 2025

egekocabas commented Feb 8, 2025 • edited Loading

StefanNemeth commented Feb 7, 2025 •

edited

Loading

egekocabas commented Feb 8, 2025 •

edited

Loading