[DRAFT]: Connection Reliability Improvements #675
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR focuses on improving connection reliability in
nats.py
, addressing persistent stability issues observed in long-running applications or scenarios with intermittent network disruptions like the one we've been experiencing from one of our compute providers(lol). I was able to mitigate some of it, but not all of it with additional logic in our codebase that would help nats to reconnect in these instances and rebind the subscriptions, but honestly that was an ugly bandaid and it didn't work 100% of the time. So, after repeated encounters with elusive connectivity bugs and noting similar experiences among others in the community (#598), I decided to take a stab at helping improve the issue at the core.Here's what this PR includes:
1. Improved Ping Loop Error Handling
asyncio.InvalidStateError
.ErrStaleConnection
when ping anomalies occur.2. Enhanced Read Loop Stability
max_read_timeouts
, defaults to 3) to fine-tune sensitivity.ConnectionResetError
andasyncio.InvalidStateError
to improve resilience and provide clearer debug information.3. Reliable Request/Response Handling
4. Proactive Connection Health Checks
_check_connection_health()
, a method designed to proactively test and re-establish the connection if necessary.Linked Issues
This PR addresses stability concerns raised in:
Recommended Testing Scenarios
These improvements should noticeably enhance stability, particularly in environments with:
Impact and Compatibility
max_read_timeouts
) default to safe values and can be tuned as needed without affecting existing usage.Contributing Statement
As always, feedback is welcome, and I'm happy to iterate as needed!
Cheers,
Fielding