implement `cass_session_get_metrics` and enable some integration tests #280

muzarski · 2025-04-18T15:23:31Z

Fixes: #257
Ref: #132

This PR implements cass_session_get_metrics and enables two integration tests.

Metrics tests

To be quite honest, I thought we'll be able to enable more tests, but:

StatsConnections

Requires cass_cluster_set_num_threads_io, cass_cluster_set_core _connections_per_host
and cass_cluster_set_constant_reconnect.

ErrorsConnectionTimeouts

Requires cass_cluster_set_core_connections_per_host.

SpeculativeExecutionRequests

Requires cass_session_get_speculative_execution_metrics.

Requests

This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test? Thanks to @Lorak-mmk suggestion, I was able to apply the HistoryListener hack in the test - this allows us to enforce >=1ms latencies for the requests.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have implemented Rust unit tests for the features/changes introduced.~~
I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

muzarski · 2025-04-18T16:01:33Z

HeartbeatFailed test failed. It worked for me locally - I'll investigate.

muzarski · 2025-04-18T19:19:06Z

Ok, it looks like the HearbeatFailed test is totally unreliable under vagrind. This makes sense, because this test is timeout-sensitive and valgrind increases the execution time significantly. This is similar to the test that used cass_future_wait_timed.

I've enabled the test under [C*/SCYLLA]_NO_VALGRIND_TEST_FILTER.

muzarski · 2025-04-18T19:32:01Z

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

muzarski · 2025-04-18T19:52:08Z

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

v1.2: Well, I gave it a shot, but it failed for one of the Scylla releases..... So it unfortunately is flaky (and will be, unless we address microsecond granularity on the rust-driver side). Disabled this test again.

Lorak-mmk

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?

What about your idea with setting coalescing to > 1ms?

muzarski · 2025-04-22T09:31:36Z

What about your idea with setting coalescing to > 1ms?

It is not reliable enough.. Notice that there may be a burst of requests that are handled (and appended to request queue) just before the coalescing timer expires. Then such delay has almost no effect on the mean request latency value.

Lorak-mmk · 2025-04-22T09:46:41Z

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

Lorak-mmk · 2025-04-22T09:47:20Z

Another possible place for that would be history listener.

muzarski · 2025-04-22T11:20:29Z

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

I see that we measure latency per-attempt. We completely ignore failing attempts when measuring the latency. Is it expected behaviour?

Lorak-mmk · 2025-04-22T11:39:48Z

I have no idea tbh. @piodul ?

This means that both approaches (LBP, history) are not viable, right?

muzarski · 2025-04-22T12:09:37Z

This means that both approaches (LBP, history) are not viable, right?

LBP will not work for sure. But history hack might work actually. I'll try it out.

piodul · 2025-04-22T12:14:02Z

I have no idea tbh. @piodul ?

Me neither. What do other drivers measure?

muzarski · 2025-04-23T10:03:41Z

v1.3: Applied the HistoryListener hack to increase the request latency during the test. The idea behind this is explained in the commit message. Thanks to that, we can enable Requests metrics test. All changes are contained in the last commit.

muzarski · 2025-04-24T10:55:25Z

Rebased on master

We 0-initialize deprecated fields, so we retain the binary compatibility with cpp-driver.

Since we implemented metrics, we can enable HeartbeatFailed test with some adjustments to log filtering. This test seems to fail under valgrind. This is why I enable it to run next to other test that cannot be run under valgrind. Note: The original test seems to be flaky for Cassandra. The following scenario occurred: 1. node2 is paused 2. keepaliver notifies the pool refiller about that 3. refiller removes the connection to node2 (metrics::total_connections -= 1) 4. in the test, we read get_metrics().total_connections < initial_connections - we go out of the loop 5. refiller tries to open a connection again (metrics::total_connections += 1) 6. we read get_metrics().total_connections, and expect total_connections to be less than initial_connections - but it is not. This is why, to combat this, I adjusted the test so the same metrics snapshot is used to leave the loop and make an assertion. In this case, the aforementioned "unlucky" scenario will not happen.

To be quite honest, I thought we'll be able to enable more tests, but: 1. StatsConnections Requires `cass_cluster_set_num_threads_io`, `cass_cluster_set_core_connections_per_host` and `cass_cluster_set_constant_reconnect`. 2. ErrorsConnectionTimeouts Requires `cass_cluster_set_core_connections_per_host`. 3. SpeculativeExecutionRequests Requires `cass_session_get_speculative_execution_metrics`. 4. Requests This one is interesting. It turns out that cpp-driver stores latency stats as microseconds, while rust-driver stores them as milliseconds. Because of that, the mean and median latency is rounded to 0 (at least for my machine). The test expects them to be greater than 0, which makes sense assuming the driver collects the stats with microsecond precision. I'm not sure how to address this one. Is there any way to force higher >=1ms latencies in the test?

Without this, valgrind complains about access to uninitialized memory.

Why does this test not work without adjustments? Well, this is because rust-driver collects latencies with millisecond graunularity. In result, most of the latencies during the tests in local setup are rounded to 0ms. This is why, we somehow need to simulate higher latencies during the test. There is one piece of code that user controls and is executed in rust-driver in between start and end time measurements - namely `HistoryListener::log_attempt_start`. This is where we can add a sleep to simulate higher latencies in local setup. And so, I implemented `SleepingHistoryListener` that does just that. In addition, I implemented the testing API to set such listener on the statement. The "Requests" test is adjusted accordingly, and enabled. Note: Since all latencies during the test in local setup are now expected to be around 1ms, I loosened the stddev check to be `>= 0` instead of `> 0`.

muzarski · 2025-04-24T12:08:34Z

Rebased on master (sped up ci)

scylla-rust-wrapper/src/session.rs

wprzytula · 2025-04-25T10:40:43Z

tests/src/integration/tests/test_heartbeat.cpp

  Cluster cluster =
      default_cluster().with_connection_heartbeat_interval(1).with_connection_idle_timeout(5);
  connect(cluster);

  cass_uint64_t initial_connections = session_.metrics().stats.total_connections;
  pause_node(2);
  start_timer();
-  while (session_.metrics().stats.total_connections >= initial_connections &&
+
+  CassMetrics metrics = session_.metrics();


Suggested change

CassMetrics metrics = session_.metrics();

CassMetrics const metrics = session_.metrics();

Why? We update the metrics in a loop.

Makefile

scylla-rust-wrapper/src/integration_testing.rs

muzarski · 2025-04-25T15:33:52Z

@wprzytula I addressed all of your commits. There was nothing to change.

muzarski self-assigned this Apr 18, 2025

muzarski added the P1 P1 priority item - very important label Apr 18, 2025

muzarski added this to the 0.5 milestone Apr 18, 2025

muzarski requested review from Lorak-mmk and wprzytula April 18, 2025 15:24

muzarski marked this pull request as draft April 18, 2025 16:01

muzarski removed request for Lorak-mmk and wprzytula April 18, 2025 16:01

muzarski force-pushed the metrics branch 3 times, most recently from 55ca2d9 to 3b5129d Compare April 18, 2025 18:55

muzarski marked this pull request as ready for review April 18, 2025 19:20

muzarski requested review from wprzytula and Lorak-mmk April 18, 2025 19:20

muzarski force-pushed the metrics branch from 3712200 to 3b5129d Compare April 18, 2025 19:50

Lorak-mmk approved these changes Apr 22, 2025

View reviewed changes

muzarski force-pushed the metrics branch from 9dfbd1d to 1bbefbf Compare April 23, 2025 09:32

muzarski requested a review from Lorak-mmk April 23, 2025 10:03

Lorak-mmk approved these changes Apr 23, 2025

View reviewed changes

muzarski force-pushed the metrics branch from 1bbefbf to 91ebc41 Compare April 24, 2025 10:54

muzarski mentioned this pull request Apr 24, 2025

ci: build in release #282

Merged

6 tasks

muzarski added 8 commits April 24, 2025 14:08

build: generate bindings for CassMetrics

885c211

cargo: add metrics feature to scylla crate

8cdb310

session: implement cass_session_get_metrics

4cf81da

We 0-initialize deprecated fields, so we retain the binary compatibility with cpp-driver.

testing: remove cass_session_get_metrics from unimplemented

4431eee

testing: explicitly 0-initialize CassMetrics struct

87de4fe

Without this, valgrind complains about access to uninitialized memory.

muzarski force-pushed the metrics branch from 91ebc41 to 0f5ea42 Compare April 24, 2025 12:08

muzarski mentioned this pull request Apr 24, 2025

Connection pool size configuration #281

Open

5 tasks

wprzytula requested changes Apr 25, 2025

View reviewed changes

muzarski requested a review from wprzytula April 25, 2025 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement `cass_session_get_metrics` and enable some integration tests #280

implement `cass_session_get_metrics` and enable some integration tests #280

muzarski commented Apr 18, 2025 •

edited

Loading

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025 •

edited

Loading

Lorak-mmk left a comment

muzarski commented Apr 22, 2025 •

edited

Loading

Lorak-mmk commented Apr 22, 2025

Lorak-mmk commented Apr 22, 2025

muzarski commented Apr 22, 2025

Lorak-mmk commented Apr 22, 2025

muzarski commented Apr 22, 2025

piodul commented Apr 22, 2025

muzarski commented Apr 23, 2025

muzarski commented Apr 24, 2025

muzarski commented Apr 24, 2025

wprzytula Apr 25, 2025

wprzytula Apr 25, 2025

muzarski Apr 25, 2025

muzarski commented Apr 25, 2025

	CassMetrics metrics = session_.metrics();
	CassMetrics const metrics = session_.metrics();

implement cass_session_get_metrics and enable some integration tests #280

Are you sure you want to change the base?

implement cass_session_get_metrics and enable some integration tests #280

Conversation

muzarski commented Apr 18, 2025 • edited Loading

Metrics tests

Pre-review checklist

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025

muzarski commented Apr 18, 2025 • edited Loading

Lorak-mmk left a comment

Choose a reason for hiding this comment

muzarski commented Apr 22, 2025 • edited Loading

Lorak-mmk commented Apr 22, 2025

Lorak-mmk commented Apr 22, 2025

muzarski commented Apr 22, 2025

Lorak-mmk commented Apr 22, 2025

muzarski commented Apr 22, 2025

piodul commented Apr 22, 2025

muzarski commented Apr 23, 2025

muzarski commented Apr 24, 2025

muzarski commented Apr 24, 2025

wprzytula Apr 25, 2025

Choose a reason for hiding this comment

wprzytula Apr 25, 2025

Choose a reason for hiding this comment

muzarski Apr 25, 2025

Choose a reason for hiding this comment

muzarski commented Apr 25, 2025

implement `cass_session_get_metrics` and enable some integration tests #280

implement `cass_session_get_metrics` and enable some integration tests #280

muzarski commented Apr 18, 2025 •

edited

Loading

muzarski commented Apr 18, 2025 •

edited

Loading

muzarski commented Apr 22, 2025 •

edited

Loading