Connection pool size configuration #281

muzarski · 2025-04-23T11:56:41Z

Fixes: #240
Ref: #132

Depends on: #280 (I need metrics support to implement one integration test).

Implements:

cass_cluster_set_core_connections_per_host
cass_cluster_set_core_connections_per_shard (as an extension to cpp-driver API)

Integration tests

Existing ErrorsConnectionTimeouts metrics test

CASSANDRA_INTEGRATION_TEST_F(MetricsTests, ErrorsConnectionTimeouts) {
  CHECK_FAILURE;

  Session session =
      default_cluster().with_core_connections_per_host(2).with_connect_timeout(1).connect(
          "", false); // Quick connection timeout and no assertion

  CassMetrics metrics = session.metrics();
  EXPECT_GE(2u, metrics.errors.connection_timeouts);
}

I don't know who implemented this test, but (IMO) it's just wrong. So my understanding of intentions is:

we set 2 connections per host
we set low connect timeout
we expect at least 2 connection timeouts registered in the metrics

If those were the intentions, the EXPECT_GE(2u, metrics.errors.connection_timeouts); is wrong - it should be the other way around. Currently it expects 2 >= connection_timeouts.

Ok, so let's say we fix it. It still won't work with rust driver. The Session object will not even be built - we will fail to open a control connection (because of timeouts) and fail to fetch the metadata. In particular, we won't even be able to call Session::get_metrics() - the cass_session_get_metrics will return early with some log error message.

Thus, I'm not enabling this test.

Fun fact: in my local setup, this test passes. Even though, the connect timeout is low, it is not triggered. And then, the metrics.errors.connection_timeouts is 0 - this means that aforementioned assertion passes as well.

My new `StatsShardConnections` metrics test

I implemented simple test where we configure a pool size to be 2 connections per shard (cass_cluster_set_core_connections_per_shard). It checks whether all connections are registered in the metrics. The expected value is at least nr_hosts * nr_shards * 2.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have implemented Rust unit tests for the features/changes introduced.~~
I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

We 0-initialize deprecated fields, so we retain the binary compatibility with cpp-driver.

Since we implemented metrics, we can enable HeartbeatFailed test with some adjustments to log filtering. This test seems to fail under valgrind. This is why I enable it to run next to other test that cannot be run under valgrind. Note: The original test seems to be flaky for Cassandra. The following scenario occurred: 1. node2 is paused 2. keepaliver notifies the pool refiller about that 3. refiller removes the connection to node2 (metrics::total_connections -= 1) 4. in the test, we read get_metrics().total_connections < initial_connections - we go out of the loop 5. refiller tries to open a connection again (metrics::total_connections += 1) 6. we read get_metrics().total_connections, and expect total_connections to be less than initial_connections - but it is not. This is why, to combat this, I adjusted the test so the same metrics snapshot is used to leave the loop and make an assertion. In this case, the aforementioned "unlucky" scenario will not happen.

To be quite honest, I thought we'll be able to enable more tests, but: 1. StatsConnections Requires `cass_cluster_set_num_threads_io`, `cass_cluster_set_core_connections_per_host` and `cass_cluster_set_constant_reconnect`. 2. ErrorsConnectionTimeouts Requires `cass_cluster_set_core_connections_per_host`. 3. SpeculativeExecutionRequests Requires `cass_session_get_speculative_execution_metrics`. 4. Requests This one is interesting. It turns out that cpp-driver stores latency stats as microseconds, while rust-driver stores them as milliseconds. Because of that, the mean and median latency is rounded to 0 (at least for my machine). The test expects them to be greater than 0, which makes sense assuming the driver collects the stats with microsecond precision. I'm not sure how to address this one. Is there any way to force higher >=1ms latencies in the test?

Without this, valgrind complains about access to uninitialized memory.

Why does this test not work without adjustments? Well, this is because rust-driver collects latencies with millisecond graunularity. In result, most of the latencies during the tests in local setup are rounded to 0ms. This is why, we somehow need to simulate higher latencies during the test. There is one piece of code that user controls and is executed in rust-driver in between start and end time measurements - namely `HistoryListener::log_attempt_start`. This is where we can add a sleep to simulate higher latencies in local setup. And so, I implemented `SleepingHistoryListener` that does just that. In addition, I implemented the testing API to set such listener on the statement. The "Requests" test is adjusted accordingly, and enabled. Note: Since all latencies during the test in local setup are now expected to be around 1ms, I loosened the stddev check to be `>= 0` instead of `> 0`.

I haven't set the default value yet. This will be done once cass_cluster_set_core_connections_per_shard is introduced and implemented (later in this PR).

This will serve as an extension to cpp API. The default connection pool size is 1 per shard.

This tests `cass_cluster_set_core_connections_per_shard` as well as `cass_session_get_metrics`.

muzarski self-assigned this Apr 23, 2025

muzarski added the P1 P1 priority item - very important label Apr 23, 2025

muzarski added this to the 0.5 milestone Apr 23, 2025

muzarski marked this pull request as draft April 23, 2025 11:57

muzarski added 12 commits April 24, 2025 14:08

build: generate bindings for CassMetrics

885c211

cargo: add metrics feature to scylla crate

8cdb310

session: implement cass_session_get_metrics

4cf81da

We 0-initialize deprecated fields, so we retain the binary compatibility with cpp-driver.

testing: remove cass_session_get_metrics from unimplemented

4431eee

testing: explicitly 0-initialize CassMetrics struct

87de4fe

Without this, valgrind complains about access to uninitialized memory.

cluster: implement cass_cluster_set_core_connections_per_host

7cb874c

I haven't set the default value yet. This will be done once cass_cluster_set_core_connections_per_shard is introduced and implemented (later in this PR).

cluster: implement cass_cluster_set_core_connections_per_shard

575a029

This will serve as an extension to cpp API. The default connection pool size is 1 per shard.

testing: remove set_core_connections_per_host from unimplemented

7ccded3

IT: introduce test that checks connection count if configured per-shard

f70522d

This tests `cass_cluster_set_core_connections_per_shard` as well as `cass_session_get_metrics`.

muzarski force-pushed the core-connections-per-host-shard branch from 4b4ec8e to f70522d Compare April 24, 2025 12:28

muzarski requested review from wprzytula and Lorak-mmk and removed request for wprzytula April 24, 2025 13:23

muzarski marked this pull request as ready for review April 24, 2025 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool size configuration #281

Connection pool size configuration #281

muzarski commented Apr 23, 2025 •

edited

Loading

Connection pool size configuration #281

Are you sure you want to change the base?

Connection pool size configuration #281

Conversation

muzarski commented Apr 23, 2025 • edited Loading

Integration tests

Existing ErrorsConnectionTimeouts metrics test

My new StatsShardConnections metrics test

Pre-review checklist

muzarski commented Apr 23, 2025 •

edited

Loading

My new `StatsShardConnections` metrics test