Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT]: Connection Reliability Improvements #675

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fielding
Copy link

Overview

This PR focuses on improving connection reliability in nats.py, addressing persistent stability issues observed in long-running applications or scenarios with intermittent network disruptions like the one we've been experiencing from one of our compute providers(lol). I was able to mitigate some of it, but not all of it with additional logic in our codebase that would help nats to reconnect in these instances and rebind the subscriptions, but honestly that was an ugly bandaid and it didn't work 100% of the time. So, after repeated encounters with elusive connectivity bugs and noting similar experiences among others in the community (#598), I decided to take a stab at helping improve the issue at the core.

Here's what this PR includes:

1. Improved Ping Loop Error Handling

  • Enhanced error handling within the ping loop to prevent silent failures.
  • Properly catches and handles asyncio.InvalidStateError.
  • Adds a catch-all exception handler to ensure the ping loop never silently stalls.
  • Forces a proper disconnect with ErrStaleConnection when ping anomalies occur.

2. Enhanced Read Loop Stability

  • Implements timeout detection for read operations, introducing a consecutive timeout counter to identify potentially stalled connections.
  • Adds a configurable client option (max_read_timeouts, defaults to 3) to fine-tune sensitivity.
  • Explicit handling for ConnectionResetError and asyncio.InvalidStateError to improve resilience and provide clearer debug information.

3. Reliable Request/Response Handling

  • Adds pre-flight connection health checks before issuing requests.
  • Improves internal cleanup for request/response calls to prevent subtle resource leaks.
  • Strengthens timeout and cancellation logic to guard against orphaned or stale futures.

4. Proactive Connection Health Checks

  • Introduces _check_connection_health(), a method designed to proactively test and re-establish the connection if necessary.
  • Utilized in critical paths like request handling to ensure robustness under varying network conditions.

Linked Issues

This PR addresses stability concerns raised in:


Recommended Testing Scenarios

These improvements should noticeably enhance stability, particularly in environments with:

  • Long-running applications (24+ hours uptime)
  • Frequent or intermittent connectivity disruptions
  • Intensive request-response workloads
  • Heavy usage of JetStream or Key-Value operations

Impact and Compatibility

  • Backward Compatibility: Fully backward compatible. Existing interfaces remain unchanged.
  • Configuration: New options (max_read_timeouts) default to safe values and can be tuned as needed without affecting existing usage.
  • Robustness: Designed to gracefully handle and recover from various edge cases previously causing silent connection failures.

Contributing Statement

  • This contribution is my original work.
  • I license the work to the NATS project under the project's Apache 2.0 license.

As always, feedback is welcome, and I'm happy to iterate as needed!

Cheers,
Fielding

@fielding
Copy link
Author

fielding commented Mar 13, 2025

I'm going to leave this as a draft until I can get some concrete data from my own testing in the wild or from others who might be having similar issues and can help to test.

If you want to install and test this commit, fielding@4a20463, you can simply do:

pip install git+https://github.com/fielding/nats.py.git@4a20463b521962c83ec58e0cc1a4d1f72fd98440

or there is a way to install a specific PR by #, but I can't remember exactly what it is -.-

@fielding
Copy link
Author

My own concerns with this:

  • Performance Overhead: Timeouts and health checks add slight overhead. For high-throughput applications, this could be noticeable. Consider profiling under load. This is my main concern.

  • Testing Gaps: The bug may/may not be specific to specific scenarios that I mentioned in the description, thus it becomes difficult to test and assess easily.

@fielding
Copy link
Author

So far testing shows a positive change in behavior over the last 5 days. Normally it would fail to reconnect somewhere around the 36-48hr mark prior to these changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant