Skip to content

implement cass_session_get_metrics and enable some integration tests #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

muzarski
Copy link
Collaborator

@muzarski muzarski commented Apr 18, 2025

Fixes: #257
Ref: #132

This PR implements cass_session_get_metrics and enables two integration tests.

Metrics tests

To be quite honest, I thought we'll be able to enable more tests, but:

  1. StatsConnections

Requires cass_cluster_set_num_threads_io, cass_cluster_set_core _connections_per_host
and cass_cluster_set_constant_reconnect.

  1. ErrorsConnectionTimeouts

Requires cass_cluster_set_core_connections_per_host.

  1. SpeculativeExecutionRequests

Requires cass_session_get_speculative_execution_metrics.

  1. Requests

This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?
Thanks to @Lorak-mmk suggestion, I was able to apply the HistoryListener hack in the test - this allows us to enforce >=1ms latencies for the requests.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have implemented Rust unit tests for the features/changes introduced.
  • I have enabled appropriate tests in .github/workflows/build.yml in gtest_filter.
  • I have enabled appropriate tests in .github/workflows/cassandra.yml in gtest_filter.

@muzarski muzarski self-assigned this Apr 18, 2025
@muzarski muzarski added the P1 P1 priority item - very important label Apr 18, 2025
@muzarski muzarski added this to the 0.5 milestone Apr 18, 2025
@muzarski muzarski requested review from Lorak-mmk and wprzytula April 18, 2025 15:24
@muzarski
Copy link
Collaborator Author

HeartbeatFailed test failed. It worked for me locally - I'll investigate.

@muzarski muzarski marked this pull request as draft April 18, 2025 16:01
@muzarski muzarski force-pushed the metrics branch 3 times, most recently from 55ca2d9 to 3b5129d Compare April 18, 2025 18:55
@muzarski
Copy link
Collaborator Author

Ok, it looks like the HearbeatFailed test is totally unreliable under vagrind. This makes sense, because this test is timeout-sensitive and valgrind increases the execution time significantly. This is similar to the test that used cass_future_wait_timed.

I've enabled the test under [C*/SCYLLA]_NO_VALGRIND_TEST_FILTER.

@muzarski muzarski marked this pull request as ready for review April 18, 2025 19:20
@muzarski muzarski requested review from wprzytula and Lorak-mmk April 18, 2025 19:20
@muzarski
Copy link
Collaborator Author

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

@muzarski
Copy link
Collaborator Author

muzarski commented Apr 18, 2025

v1.1: Enabled Requests test as well. This test does not work for me locally without valgrind (because sub-millisecond latencies are rounded to 0ms). But it seems to always pass under valgrind (latencies are higher than 1ms). I think it's worth giving it a shot and run it in our CI - the tests are running under valgrind and I expect GH actions runner to be slower than my local machine.

v1.2: Well, I gave it a shot, but it failed for one of the Scylla releases..... So it unfortunately is flaky (and will be, unless we address microsecond granularity on the rust-driver side). Disabled this test again.

Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?

What about your idea with setting coalescing to > 1ms?

@muzarski
Copy link
Collaborator Author

muzarski commented Apr 22, 2025

What about your idea with setting coalescing to > 1ms?

It is not reliable enough.. Notice that there may be a burst of requests that are handled (and appended to request queue) just before the coalescing timer expires. Then such delay has almost no effect on the mean request latency value.

@Lorak-mmk
Copy link
Collaborator

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

@Lorak-mmk
Copy link
Collaborator

Another possible place for that would be history listener.

@muzarski
Copy link
Collaborator Author

Hmm.... When is the "before" measurement of the latency done? If it is before LBP is consulted you could put sleep(1ms) in custom LBP.

I see that we measure latency per-attempt. We completely ignore failing attempts when measuring the latency. Is it expected behaviour?

@Lorak-mmk
Copy link
Collaborator

I have no idea tbh. @piodul ?

This means that both approaches (LBP, history) are not viable, right?

@muzarski
Copy link
Collaborator Author

This means that both approaches (LBP, history) are not viable, right?

LBP will not work for sure. But history hack might work actually. I'll try it out.

@piodul
Copy link
Collaborator

piodul commented Apr 22, 2025

I have no idea tbh. @piodul ?

Me neither. What do other drivers measure?

@muzarski
Copy link
Collaborator Author

v1.3: Applied the HistoryListener hack to increase the request latency during the test. The idea behind this is explained in the commit message. Thanks to that, we can enable Requests metrics test. All changes are contained in the last commit.

@muzarski muzarski requested a review from Lorak-mmk April 23, 2025 10:03
@muzarski
Copy link
Collaborator Author

Rebased on master

@muzarski muzarski mentioned this pull request Apr 24, 2025
6 tasks
We 0-initialize deprecated fields, so we retain the binary compatibility
with cpp-driver.
Since we implemented metrics, we can enable HeartbeatFailed test with some
adjustments to log filtering.

This test seems to fail under valgrind. This is why I enable it to run
next to other test that cannot be run under valgrind.

Note: The original test seems to be flaky for Cassandra. The following scenario
occurred:
1. node2 is paused
2. keepaliver notifies the pool refiller about that
3. refiller removes the connection to node2 (metrics::total_connections -= 1)
4. in the test, we read get_metrics().total_connections < initial_connections - we go out of the loop
5. refiller tries to open a connection again (metrics::total_connections += 1)
6. we read get_metrics().total_connections, and expect total_connections to be
less than initial_connections - but it is not.

This is why, to combat this, I adjusted the test so the same metrics snapshot
is used to leave the loop and make an assertion. In this case, the aforementioned
"unlucky" scenario will not happen.
To be quite honest, I thought we'll be able to enable more tests, but:

1. StatsConnections
Requires `cass_cluster_set_num_threads_io`, `cass_cluster_set_core_connections_per_host`
and `cass_cluster_set_constant_reconnect`.

2. ErrorsConnectionTimeouts
Requires `cass_cluster_set_core_connections_per_host`.

3. SpeculativeExecutionRequests
Requires `cass_session_get_speculative_execution_metrics`.

4. Requests
This one is interesting. It turns out that cpp-driver stores latency
stats as microseconds, while rust-driver stores them as milliseconds.

Because of that, the mean and median latency is rounded to 0 (at least for my machine).
The test expects them to be greater than 0, which makes sense assuming
the driver collects the stats with microsecond precision.

I'm not sure how to address this one. Is there any way to force
higher >=1ms latencies in the test?
Without this, valgrind complains about access to uninitialized memory.
Why does this test not work without adjustments? Well, this is because
rust-driver collects latencies with millisecond graunularity. In result,
most of the latencies during the tests in local setup are rounded to 0ms.

This is why, we somehow need to simulate higher latencies during the test.
There is one piece of code that user controls and is executed in rust-driver
in between start and end time measurements - namely `HistoryListener::log_attempt_start`.

This is where we can add a sleep to simulate higher latencies in local setup.
And so, I implemented `SleepingHistoryListener` that does just that. In addition,
I implemented the testing API to set such listener on the statement.

The "Requests" test is adjusted accordingly, and enabled.

Note: Since all latencies during the test in local setup are now expected to be around 1ms,
I loosened the stddev check to be `>= 0` instead of `> 0`.
@muzarski
Copy link
Collaborator Author

Rebased on master (sped up ci)

Cluster cluster =
default_cluster().with_connection_heartbeat_interval(1).with_connection_idle_timeout(5);
connect(cluster);

cass_uint64_t initial_connections = session_.metrics().stats.total_connections;
pause_node(2);
start_timer();
while (session_.metrics().stats.total_connections >= initial_connections &&

CassMetrics metrics = session_.metrics();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CassMetrics metrics = session_.metrics();
CassMetrics const metrics = session_.metrics();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We update the metrics in a loop.

@muzarski muzarski requested a review from wprzytula April 25, 2025 15:33
@muzarski
Copy link
Collaborator Author

@wprzytula I addressed all of your commits. There was nothing to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 P1 priority item - very important
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement metrics on cpp-rust-driver side (cass_session_get_metrics)
4 participants