Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server sometimes stops handling webhook events #352

Open
StefanNemeth opened this issue Feb 7, 2025 · 3 comments · Fixed by #349
Open

Server sometimes stops handling webhook events #352

StefanNemeth opened this issue Feb 7, 2025 · 3 comments · Fixed by #349
Assignees
Labels
application-server bug Something isn't working todo

Comments

@StefanNemeth
Copy link
Contributor

StefanNemeth commented Feb 7, 2025

In some instances, we noticed that our app server stops handling Webhook events seemingly randomly.

Server logs

This is an excerpt from the logs that contains the last event we handled and the first errors we got after that.

application-server-1  | 2025-02-07T15:51:01.632Z  INFO 1 --- [Helios] [         nats:4] .a.h.w.g.GitHubWorkflowRunMessageHandler : Received worfklow run event for repository: ls1intum/Artemis, workflow run: https://api.github.com/repos/ls1intum/Artemis/actions/runs/13202866255, action: completed
[....]
application-server-1  | 2025-02-07T15:51:06.956Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : pullStatusError, Connection: 13, Subscription: 615556999, Consumer Name: xy23djLX, Status:Status{code=409, message='Consumer Deleted'}
[....]
application-server-1  | 2025-02-07T15:53:59.905Z ERROR 1 --- [Helios] [pool-3-thread-1] i.n.client.impl.ErrorListenerLoggerImpl  : heartbeatAlarm, Connection: 13, Subscription: 94524147, Consumer Name: xy23djLX, lastStreamSequence: 58810, lastConsumerSequence: 40771

Workaround

Currently, when the issue happens, we restart the app server, use the Github sync and manually fix deployments. Works most of the time, as it doesn't happen frequently.

What we know so far

What we should do

  • Make sure our events are handled in a timely manner with us setting as maximum time, e.g. setting a timeout for Github API and tracking slow handlers
  • Implement the durable consumer (already a PR for that fix: Use durable consumer for GitHub webhook events #349)
  • Log the errors to sentry as well and adding a notification for errors like the "heartbeatAlarm" so we know about it early
  • If the PR fix: Use durable consumer for GitHub webhook events #349 takes longer, we can set the InactiveThreshold to more like 20+ seconds manually as a hot-fix for now
  • If we don't acknowledge a message within a certain timeframe, it will be re-queued by NATS. So we should make sure that we set that config as well.
@StefanNemeth StefanNemeth added the bug Something isn't working label Feb 7, 2025
@StefanNemeth StefanNemeth self-assigned this Feb 7, 2025
@egekocabas egekocabas linked a pull request Feb 7, 2025 that will close this issue
@egekocabas
Copy link
Member

So thats why I am reopening this issue

@egekocabas egekocabas reopened this Feb 8, 2025
@StefanNemeth
Copy link
Contributor Author

There might also still be some todos left within "What we should do". I'd propose we add some kind of log for a certain threshold regarding the duration of the handler. I can also deal with it tomorrow.

@egekocabas
Copy link
Member

egekocabas commented Feb 8, 2025

There might also still be some todos left within "What we should do". I'd propose we add some kind of log for a certain threshold regarding the duration of the handler. I can also deal with it tomorrow.

I think we gave the threshold values generously (NATS) and that long running webhook listeners problem is kinda solved.

I am aware that It shouldn't be taking that long but I also suggest that we should focus on the sprint tasks and keep this issue in mind. Then when we finish our tasks lets add those hard-limits or finding the what is taking long. What do you say?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
application-server bug Something isn't working todo
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants