-
Notifications
You must be signed in to change notification settings - Fork 48
Add shard-aware reconnection policies with support for scheduling constraints #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add shard-aware reconnection policies with support for scheduling constraints #473
Conversation
0b80886
to
f62dfa3
Compare
dbb3ad1
to
cbb4719
Compare
Shouldn't we have some warning / info level log when backoff is taking place? |
I would rather not do it, it is not useful and can potentially pollute the log |
Do you know what caused the test failure?
it is a unit test that at the first glance should be fully deterministic. Failure is unexpected. |
It is known issue, conversion goes wrong somewhere |
a43ccd1
to
b0fd069
Compare
f47313f
to
9dfd9ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: integration tests for new policies are definitely needed here.
cassandra/policies.py
Outdated
session: Session | ||
reconnection_policy: ReconnectionPolicy | ||
lock = threading.Lock | ||
schedule: Optional[Iterator[float]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be lock: threading.Lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cassandra/policies.py
Outdated
if self.shard_reconnection_scope == ShardReconnectionPolicyScope.Cluster: | ||
scope_hash = "global-cluster-scope" | ||
else: | ||
scope_hash = host_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When operating on enums, it is usually good to perform exhaustiveness checks.
If in the future someone adds a new variant to this enum, then your code would (incorrectly) treat it as Host
scope. Instead make an else if
branch for Host
, and then else
that throws an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
cassandra/policies.py
Outdated
|
||
scope_info = self.scopes.get(scope_hash, 0) | ||
if not scope_info: | ||
scope_info = _ScopeBucket(self.session, self.reconnection_policy) | ||
self.scopes[scope_hash] = scope_info | ||
scope_info.add(self._execute, scheduled_key, method, *args, **kwargs) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So scope_info here is at first either _ScopeBucket
or int
. I think it would be more idiomatic to use None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
def schedule( | ||
self, | ||
host_id: str, | ||
shard_id: int, | ||
method: Callable[..., None], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard_id is int here, interesting. What is going to be passed for Cassandra? 0? Or maybe None and the type hint is just wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for cassandra this code is not used, it is used only when host has shard info.
cassandra/policies.py
Outdated
class NoConcurrentShardReconnectionPolicy(ShardReconnectionPolicy): | ||
""" | ||
A shard reconnection policy that allows only one pending connection per scope, where scope could be `Host`, `Cluster` | ||
For backoff it uses `ReconnectionPolicy`, when there is no more reconnections to scheduled backoff policy is reminded | ||
For all scopes does not allow schedule multiple reconnections for same host+shard, it silently ignores attempts to do that. | ||
|
||
On `new_scheduler` instantiate a scheduler that behaves according to the policy | ||
""" | ||
shard_reconnection_scope: ShardReconnectionPolicyScope | ||
reconnection_policy: ReconnectionPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I really tried to get the hang of the code here, but failed.
What I thought before:
- ReconnectionPolicy, according to its comments, defines the schedules when trying to reconnect to DOWN node.
- For some reason (don't know if a good one, as there is no discussion about it in PR) instead of extending driver to use for populating connection pool to, you decided to introduce a new mechanism for that, totally separate from ReconnectionPolicy.
But now I see ReconnectionPolicy used inside ShardReconnectionPolicy?! So a policy that steers reconnections to failed node now is used inside policy that re-fills connection pool. I cannot make sense of it.
This PR needs thorough explanation of newly introduced interfaces.
- what are the things that are passed to
schedule
? What is this method, when and how many times are we supposed to call it? Which APIs can block and which cannot? How about thread safety - what they can assume? - How is ReconnectionPolicy different from ShardReconnectionPolicy? Names differ only in "Shard", so initially I thought it is shard-aware version of ReconnectionPolicy, but that does not seem to be the case.
- What are the pros and cons of taken approach, what other approaches did you consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I really tried to get the hang of the code here, but failed. What I thought before:
- ReconnectionPolicy, according to its comments, defines the schedules when trying to reconnect to DOWN node.
- For some reason (don't know if a good one, as there is no discussion about it in PR) instead of extending driver to use for populating connection pool to, you decided to introduce a new mechanism for that, totally separate from ReconnectionPolicy.
But now I see ReconnectionPolicy used inside ShardReconnectionPolicy?! So a policy that steers reconnections to failed node now is used inside policy that re-fills connection pool. I cannot make sense of it.
This PR needs thorough explanation of newly introduced interfaces.
- what are the things that are passed to
schedule
? What is this method, when and how many times are we supposed to call it? Which APIs can block and which cannot? How about thread safety - what they can assume?- How is ReconnectionPolicy different from ShardReconnectionPolicy? Names differ only in "Shard", so initially I thought it is shard-aware version of ReconnectionPolicy, but that does not seem to be the case.
I have changed name for ReconnectionPolicy
and added another type to it and some description why it accepts both types.
I have also added documentation to the interfaces and implemntations.
Also I have renamed all the classes and abstracts involved
- What are the pros and cons of taken approach, what other approaches did you consider?
I will add this information to PR description.
cassandra/policies.py
Outdated
""" | ||
items: List[Tuple[Callable[..., None], Tuple[Any, ...], dict[str, Any]]] | ||
session: Session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I see such complicated type, I immediately think that it should be simplified.
Here if I understand this code well, you could introduce Callback
type that has fields callable
, args
, kwargs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/unit/test_policies.py
Outdated
class MockLock: | ||
def __init__(self): | ||
self.acquire_calls = 0 | ||
self.release_calls = 0 | ||
|
||
def __enter__(self): | ||
self.acquire_calls += 1 | ||
|
||
def __exit__(self, exc_type, exc_value, traceback): | ||
self.release_calls += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this used anywhere. Why is it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a left over from old test, removed
cassandra/policies.py
Outdated
scheduled_key = f'{host_id}-{shard_id}' | ||
if self.already_scheduled.get(scheduled_key): | ||
return | ||
|
||
self.already_scheduled[scheduled_key] = True | ||
if not self.session.is_shutdown: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example here, in _NoDelayShardReconnectionScheduler
: It performs the check in obviously non-thread-safe way. So if it can be called concurrently, then multiple schedules for the same key are possible, despite already_scheduled
trying to prevent that. So now I'm thinking that maybe it can't be called concurrently?
OTOH already_scheduled
uses a lock, which is extremely strong signal that concurrency is at play here. And now I have no idea what to think, because nothing is explained anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, assumption was that it is not a big deal, since _open_connection_to_missing_shard
will take care of second connection.
But after looking at it I realised that it will close old one, which can lead to lost responses.
Added a lock here.
cassandra/policies.py
Outdated
def _get_delay(self) -> float: | ||
if self.schedule is None: | ||
self.schedule = self.reconnection_policy.new_schedule() | ||
try: | ||
return next(self.schedule) | ||
except StopIteration: | ||
self.schedule = self.reconnection_policy.new_schedule() | ||
return next(self.schedule) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a situation where self.schedule
can really be None
here, or is it just a precaution condition that should never really be entered? If it is a precaution, it is fine to have it but there should be a comment explaining that.
I thought that self.schedule
can only be None
when running
is false (btw the opposite is not true: running
is initialized to False, but schedule is initialized to non-
None), and I only see calls to
_get_delaywhen
runningshould be
True`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got rid of None
case completely.
cassandra/cluster.py
Outdated
def empty(self): | ||
return len(self._scheduled_tasks) == 0 and self._queue.empty() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used to be part of tests, now it is unused, removed.
9dfd9ec
to
aebc540
Compare
Add abstract classes: `ShardReconnectionPolicy` and `ShardReconnectionScheduler` And implementations: `NoDelayShardReconnectionPolicy` - policy that represents old behavior of having no delay and no concurrency restriction. `NoConcurrentShardReconnectionPolicy` - policy that limits concurrent reconnections to 1 per scope and introduces delay between reconnections within the scope.
Inject shard reconnection policy into cluster, session, connection and host pool. Drop pending connections tracking logic, since policy does that. Fix some tests that mocks Cluster, session, connection or host pool.
aebc540
to
61668de
Compare
The patchset lacks documentation, which would have helped to understand the feature and when/how to use it. Is documentation a separate repo / commit? |
A scope for `ShardConnectionBackoffPolicy`, in particular ``LimitedConcurrencyShardConnectionBackoffPolicy`` | ||
|
||
Scope defines concurrency limitation scope, for instance : | ||
``LimitedConcurrencyShardConnectionBackoffPolicy`` - allows only one pending connection per scope, if you set it to Cluster, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there any ask for 1 connection per cluster? What's the usefulness? I can understand 1 per host, 1 per rack, maybe even 1 per DC. 1 per cluster is not performant, not highly available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update description, it limits concurrency to 'max_concurrency' per scope
""" | ||
A shard connection backoff policy that allows only ``max_concurrent`` concurrent connection per scope. | ||
Scope could be ``Host``or ``Cluster`` | ||
For backoff calculation ir needs ``ShardConnectionBackoffSchedule`` or ``ReconnectionPolicy``, since both share same API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: ir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds shard‐aware reconnection policies with support for scheduling constraints. Key changes include new policy implementations and schedulers in cassandra/policies.py, modifications to connection management in cassandra/pool.py and cassandra/cluster.py, and comprehensive tests in both unit and integration suites to validate the new behavior.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/test_shard_aware.py | Adds tests for both immediate and delayed reconnection behavior using new policies. |
tests/unit/test_policies.py | Introduces extensive tests for scope bucket and scheduler behavior. |
tests/unit/test_host_connection_pool.py | Updates HostConnectionPool tests to integrate the new scheduler. |
tests/integration/long/test_policies.py | Validates backoff policies and correct connection formation across shards. |
tests/integration/init.py | Adds a marker for tests designed for Scylla-specific behavior. |
cassandra/pool.py | Refactors connection replacements to use the new scheduler instead of direct submission. |
cassandra/policies.py | Implements new scheduler classes and backoff policies for shard connections. |
cassandra/cluster.py | Exposes a new property and uses the scheduler for initializing shard connections. |
Introduce
ShardReconnectionPolicy
and its implementations:NoDelayShardReconnectionPolicy
: avoids reconnection delay and ensures at most one reconnection per host+shard.NoConcurrentShardReconnectionPolicy
: limits concurrent reconnections to 1 per scope (Cluster or Host) using a backoff policy.This feature enables finer control over shard reconnection behavior, helping prevent reconnection storms.
Fixes: #483
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.