Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Use durable consumer for GitHub webhook events #349

Merged
merged 11 commits into from
Feb 8, 2025

Conversation

egekocabas
Copy link
Member

@egekocabas egekocabas commented Feb 7, 2025

Motivation

Currently, we are using an ephemeral consumer, which causes all webhook events to be reprocessed from scratch whenever we redeploy. Switching to a durable consumer ensures event processing continuity and may help resolve the connection issues experienced during our launch. Additionally, these changes lay the foundation for future runtime updates to consumer subjects.

Furthermore, to handle cases where NATS unexpectedly removes our durable consumer (e.g., due to inactivity or other issues), we introduced a NatsErrorListener that detects consumer deletion errors and triggers consumer reinitialization.

Description

  • Durable Consumer Support
    • Added NATS_DURABLE_CONSUMER_NAME environment variable.
    • Updated deployment action, Docker configuration, documentation and environment variable in repository settings.
    • Updated the Docker Compose configuration to use nats:2.10.25-alpine instead of nats:alpine for consistency with production. This ensures that features like createOrUpdateConsumer work as expected.
  • NATS Consumer Enhancements
    • Introduced an updateSubjects method that triggers setupOrUpdateConsumer with the latest subjects. This method is crucial for updating consuming subjects at runtime for another PR feat: Dynamic repo registration #351.
    • Made both setupOrUpdateConsumer and updateSubjects synchronized in NatsConsumerService to ensure thread safety during runtime subject updates.
  • Consumer Error Handling & Auto-Recovery
    • Added NatsErrorListener to detect critical NATS errors, such as Consumer Deleted (409) errors.
    • If NATS removes the consumer, NatsErrorListener logs the event and calls reinitializeConsumer from NatsConsumerService to recreate the consumer automatically.
    • Addressed a circular dependency between NatsConsumerService and NatsErrorListener by using setter injection with @Lazy and @Autowired annotations to break the dependency loop while ensuring correct bean initialization.
  • Consumer Configuration Enhancements via Environment Variables
    • NATS_CONSUMER_INACTIVE_THRESHOLD_MINUTES (default: 30) specifies the time (in minutes) after which an inactive consumer is removed.
    • NATS_CONSUMER_ACK_WAIT_SECONDS (default: 60) specifies the time (in seconds) that NATS waits for a message acknowledgment before resending the message.

Durable Consumer Behavior and Runtime Updates

  • NATS Sequence and Durable Consumer
    • With a durable consumer, the last processed event sequence is maintained. On redeployments, the consumer resumes from the last acknowledged sequence, ensuring that events are not reprocessed.
    • Example: If the consumer has processed up to sequence 100 and a redeploy occurs, it will resume processing from sequence 101 rather than reprocessing sequences 1–100.
  • Runtime Updates on Subjects
    • The updateSubjects method allows the consumer to update the subjects it listens to at runtime. This is essential when the GitHub App is installed on new repositories or removed from existing ones.
    • Important Note: When a new subject is added, previous events for that subject will not be processed. In our case, this is acceptable because:
      • Installation Scenario: The GitHub App installation event establishes the starting point for that repository—there are no prior events to process.
      • Removal Scenario: Once the GitHub App is removed, webhook events for that repository cease.

Testing Instructions

Durable stream

  1. Update your .env file under src/application-server and set NATS_DURABLE_CONSUMER_NAME.
  2. Run the Docker Compose.
  3. Start the application-server and verify that it processes events from the beginning.
  4. Once processing is complete, stop the application-server and restart it.
  5. Confirm that it does not re-process past events.

Reconnect (recreate) Consumer

  • If doesnt matter but we will try with and without NATS_DURABLE_CONSUMER_NAME so repeat below steps for both
  • Run compose & server
  • Example log
Consumer created or updated with name '<you-consumer-name>' and configuration: ConsumerConfiguration { ... }
Successfully started consuming messages.

Consumer Name: (randomly generated or durable name)
Stream: github

  • Connect to a NATS CLI Environment
    • docker run --rm -it --network=host natsio/nats-box
  • List Consumers in the github Stream
    • nats consumer list github --server=nats://<NATS_AUTH_TOKEN>@localhost:4222
    • Use the NATS_AUTH_TOKENdefined in your project root .env file
  • Example log
Consumers for Stream github:

    TfGIIr1Z
    BoNZPSs9
    ...

Find and see it exist.

  • Delete the Consumer
    • nats consumer delete github <you-consumer-name> --server=nats://<NATS_AUTH_TOKEN>@localhost:4222
  • Expected log output
[SEVERE] pullStatusError, Consumer Name: TfGIIr1Z, Status: Status{code=409, message='Consumer Deleted'}
Consumer 'TfGIIr1Z' was deleted, triggering reinitialization of consumer...
NATS Consumer reinitialization process started.
Consumer created or updated with name 'NewConsumerName' and configuration: ConsumerConfiguration { ... }
Successfully started consuming messages.
  • It doesn't matter if your consumer if durable or ephemeral, if NATS removed the consumer and send the Consumer Deleted then the sequesnce data is removed from NATS and when you reinitialize your consumer it starts sending events from the beginning (config).

@github-actions github-actions bot added bug Something isn't working documentation Improvements or additions to documentation github-config application-server size:S labels Feb 7, 2025
Copy link

codacy-production bot commented Feb 7, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
Report missing for 4a3d0051 0.00%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (4a3d005) Report Missing Report Missing Report Missing
Head commit (1f7bc20) 4500 320 7.11%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#349) 98 0 0.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

Footnotes

  1. Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

@github-actions github-actions bot added size:L and removed size:S labels Feb 7, 2025
@egekocabas egekocabas changed the title chore: Use durable consumer for GitHub webhook events fix: Use durable consumer for GitHub webhook events Feb 7, 2025
@egekocabas egekocabas marked this pull request as ready for review February 7, 2025 16:15
@egekocabas egekocabas requested a review from a team as a code owner February 7, 2025 16:15
@egekocabas egekocabas linked an issue Feb 7, 2025 that may be closed by this pull request
Copy link
Contributor

@TurkerKoc TurkerKoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for dealing with our NATS configuration. I've tested NATS as described in the description. Everything seems to work perfectly. I believe that this one will resolve our NATS heartbeat errors in prod, since it's increasing thresholds and also reinitializing the durable consumer.

Thanks for the great work and detailed description of the PR! I've added 1-2 comments. Let's merge this one soon 🚀


private final NatsConsumerService natsConsumerService;

@Autowired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this Autowired annotation is unnecessary because Spring automatically autowires the only constructor of a bean class, even if it's not annotated. Since NatsErrorListener has only one constructor, we can remove the annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't really sure about whether the Autowired should stay or not since between the NatsErrorListener and NatsConsumerService we have a circular dependency and all the documentaiton and articles bout solving the issue was pointing me out to also use Autowired with it.

Also the javadocs of @Lazy annotation is also suggest me to use it with Autowired
image

I mean i think when we don't use Autowired on the constructor and when spring was deciding on the bean initialization and injections on the startup i think if a class has only one constructor then it doesn't have to have Autowired annotation.

I also tested without the Autowired and it worked 🥳

But i would just propose to keep it in here if that makes sense 😄

@egekocabas egekocabas merged commit 5aa9cc6 into staging Feb 8, 2025
15 checks passed
@egekocabas egekocabas deleted the fix/use-durable-consumer branch February 8, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
application-server bug Something isn't working documentation Improvements or additions to documentation github-config ready for review size:L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server sometimes stops handling webhook events
2 participants