Skip to content

Connection pool size configuration #281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

muzarski
Copy link
Collaborator

@muzarski muzarski commented Apr 23, 2025

Fixes: #240
Ref: #132

Depends on: #280 (I need metrics support to implement one integration test).

Implements:

  • cass_cluster_set_core_connections_per_host
  • cass_cluster_set_core_connections_per_shard (as an extension to cpp-driver API)

Integration tests

Existing ErrorsConnectionTimeouts metrics test

CASSANDRA_INTEGRATION_TEST_F(MetricsTests, ErrorsConnectionTimeouts) {
  CHECK_FAILURE;

  Session session =
      default_cluster().with_core_connections_per_host(2).with_connect_timeout(1).connect(
          "", false); // Quick connection timeout and no assertion

  CassMetrics metrics = session.metrics();
  EXPECT_GE(2u, metrics.errors.connection_timeouts);
}

I don't know who implemented this test, but (IMO) it's just wrong. So my understanding of intentions is:

  1. we set 2 connections per host
  2. we set low connect timeout
  3. we expect at least 2 connection timeouts registered in the metrics

If those were the intentions, the EXPECT_GE(2u, metrics.errors.connection_timeouts); is wrong - it should be the other way around. Currently it expects 2 >= connection_timeouts.

Ok, so let's say we fix it. It still won't work with rust driver. The Session object will not even be built - we will fail to open a control connection (because of timeouts) and fail to fetch the metadata. In particular, we won't even be able to call Session::get_metrics() - the cass_session_get_metrics will return early with some log error message.

Thus, I'm not enabling this test.

Fun fact: in my local setup, this test passes. Even though, the connect timeout is low, it is not triggered. And then, the metrics.errors.connection_timeouts is 0 - this means that aforementioned assertion passes as well.

My new StatsShardConnections metrics test

I implemented simple test where we configure a pool size to be 2 connections per shard (cass_cluster_set_core_connections_per_shard). It checks whether all connections are registered in the metrics. The expected value is at least nr_hosts * nr_shards * 2.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
  • I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

@muzarski muzarski self-assigned this Apr 23, 2025
@muzarski muzarski added the P1 P1 priority item - very important label Apr 23, 2025
@muzarski muzarski added this to the 0.5 milestone Apr 23, 2025
@muzarski muzarski marked this pull request as draft April 23, 2025 11:57
muzarski added 12 commits April 24, 2025 14:08
We 0-initialize deprecated fields, so we retain the binary compatibility
with cpp-driver.
Since we implemented metrics, we can enable HeartbeatFailed test with some
adjustments to log filtering.

This test seems to fail under valgrind. This is why I enable it to run
next to other test that cannot be run under valgrind.

Note: The original test seems to be flaky for Cassandra. The following scenario
occurred:
1. node2 is paused
2. keepaliver notifies the pool refiller about that
3. refiller removes the connection to node2 (metrics::total_connections -= 1)
4. in the test, we read get_metrics().total_connections < initial_connections - we go out of the loop
5. refiller tries to open a connection again (metrics::total_connections += 1)
6. we read get_metrics().total_connections, and expect total_connections to be
less than initial_connections - but it is not.

This is why, to combat this, I adjusted the test so the same metrics snapshot
is used to leave the loop and make an assertion. In this case, the aforementioned
"unlucky" scenario will not happen.
To be quite honest, I thought we'll be able to enable more tests, but:

1. StatsConnections
Requires `cass_cluster_set_num_threads_io`, `cass_cluster_set_core_connections_per_host`
and `cass_cluster_set_constant_reconnect`.

2. ErrorsConnectionTimeouts
Requires `cass_cluster_set_core_connections_per_host`.

3. SpeculativeExecutionRequests
Requires `cass_session_get_speculative_execution_metrics`.

4. Requests
This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?
Without this, valgrind complains about access to uninitialized memory.
Why does this test not work without adjustments? Well, this is because
rust-driver collects latencies with millisecond graunularity. In result,
most of the latencies during the tests in local setup are rounded to 0ms.

This is why, we somehow need to simulate higher latencies during the test.
There is one piece of code that user controls and is executed in rust-driver
in between start and end time measurements - namely `HistoryListener::log_attempt_start`.

This is where we can add a sleep to simulate higher latencies in local setup.
And so, I implemented `SleepingHistoryListener` that does just that. In addition,
I implemented the testing API to set such listener on the statement.

The "Requests" test is adjusted accordingly, and enabled.

Note: Since all latencies during the test in local setup are now expected to be around 1ms,
I loosened the stddev check to be `>= 0` instead of `> 0`.
I haven't set the default value yet. This will be done once
cass_cluster_set_core_connections_per_shard is introduced and implemented (later in this PR).
This will serve as an extension to cpp API.
The default connection pool size is 1 per shard.
This tests `cass_cluster_set_core_connections_per_shard` as well as
`cass_session_get_metrics`.
@muzarski muzarski force-pushed the core-connections-per-host-shard branch from 4b4ec8e to f70522d Compare April 24, 2025 12:28
@muzarski muzarski requested review from wprzytula and Lorak-mmk and removed request for wprzytula April 24, 2025 13:23
@muzarski muzarski marked this pull request as ready for review April 24, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 P1 priority item - very important
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement cass_cluster_set_core_connections_per_host
1 participant