Skip to content

Conversation

joao-r-reis
Copy link
Collaborator

@joao-r-reis joao-r-reis commented Jun 23, 2020

Currently cluster.Connect() starts the initialization task of the control connection, then creates the session, then initializes the session.

This PR changes this behavior so that cluster.Connect() creates the session object and returns it to the user immediately. The session initialization task is started in the background which will start the cluster initialization if it's not initialized yet.

If a user attempts to do something with the session, cluster or cluster.Metadata that requires the initialization to be finished, the method will await/block until the initialization task is complete.

I also implemented CSHARP-698 on this PR. It uses the reconnection policy to keep retrying the control connection initialization in the background. **After an initialization attempt fails, all method calls (session.Execute, etc.) will throw the original exception until a new attempt is started. When a new attempt is started, the methods will block again until the current attempt finishes.

This PR also adds Session.Connect() for users that want to wait for the initialization to be finished.

Several API changes are required in order for this to be a good experience for the user.

The Cluster.Metadata property no longer blocks. Also introduced a IMetadata interface and a IMetadataSnapshotProvider that declares some Metadata methods which do not block.

ILoadBalancingPolicy.Initialize is called by the initialization method but a ICluster instance is passed to that method which means that if the user attempts to access Metadata it will deadlock because it requires the initialization task to be finished. This PR changes this method so that it receives a IMetadataSnapshotProvider instance instead (which is inherited by IMetadata). It also changes the method to support async (returns Task).

A lot of methods from ICluster and ISession are moved to Metadata. This simplifies the driver's API because these methods and properties were just wrappers of those in Metadata so it makes more sense to be on the IMetadata interface only.

Also updated the upgrade guide so it's easier to get the overall picture of these API changes by reading the guide.

@joao-r-reis
Copy link
Collaborator Author

This PR doesn't unify Session and Cluster yet but a lot of these changes were made with the unification in mind which will come in a following PR.

I'm opening this PR before fixing the test code in order to allow for an earlier review of the API changes/design @jorgebay

@joao-r-reis joao-r-reis requested a review from jorgebay June 23, 2020 19:56
@joao-r-reis joao-r-reis added this to the 4.0.0 milestone Jun 24, 2020
@joao-r-reis joao-r-reis changed the base branch from master to 4.x June 24, 2020 09:44
* Remove IDisposable implementation
* Move ControlConnection construction to Metadata
* Add Metadata InitializeAsync and ShutdownAsync
- add IMetadata interface and change Metadata to be internal
- add *Async methods to IMetadata
- change internal references from Metadata to InternalMetadata
- add *Snapshot methods to IMetadata
@joao-r-reis joao-r-reis removed this from the 4.0.0 milestone Sep 10, 2020
@dvyskrebets
Copy link

Hi @joao-r-reis
I'd like to ask you about this PR, are there any plans to finish this item?
There is an issue with connecting to cassandra in my team. Sometimes, for some reason, we see Cassandra.NoHostAvailableException and the code gets stuck in such state even if the cluster is actual available. We're trying to find the reason for the issue, thinking that it might be something related to how session is handled on the driver level.
So if you have any advice on such problem or a timeline for this PR it'd be great to get help from you (I guess this PR relates to described issue though I'm not 100% sure)
Thanks

@joao-r-reis
Copy link
Collaborator Author

This was being worked on when we had a plan to release a new major version but those plans have for the most part been abandoned for now so there is no timeline for this PR.

When the cluster initialization fails, it is recommended to close it and create the cluster object again (from the builder that you already have). If you are building the cluster object and then registering it as a singleton for example, the initialization will happen when the first session is created (likely through the DI container) and if that fails then there is a fail case where all subsequent session creations will fail from that same cluster object.

You should create the session during app startup and only then you register it as a singleton (if it fails you let the app crash and the app launch should be retried by whatever deployment tooling you're using as an example, or you can implement retries of this first session creation during startup), or change the DI configuration so that if the first session creation fails then a new cluster object is created.

@dvyskrebets
Copy link

dvyskrebets commented Jul 14, 2025

@joao-r-reis Thanks for quick response. Can I ask you quick question though a bit off topic. Sometimes our team experience issues with connection pool. No new connections opened. Any suggestions at what direction should we dig to fix this?

@joao-r-reis
Copy link
Collaborator Author

That's too vague, I'd have to take a look at the driver logs (at least INFO level)

@dvyskrebets
Copy link

dvyskrebets commented Jul 17, 2025

again thank you for looking into. before pasting logs I need to clarify one important thing. in our setup we use a load balancer in front of cassandra. so dotnet app (some api let's say) goes to load balancer which directs request to cassandra itself (it's an Azure's internal load balancer if this is important). (I think this might be a cause of the issue though not concrete explanation has been found so far).
the problem is that sometimes (quite a lot tbh) connection pool exhausts and cassandra is not available even if it actually is.

logs from latest to oldest: (I tried to leave only meaningful info).

Jul 14, 2025 @ 07:28:21.091	Creating a new connection to "10.31.12.28:9042"	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:21.056	Cluster #39446293 ["profilesdb"] has been initialized.	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:21.056	Cluster Connected using binary protocol version: [V4]	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:21.054	Local datacenter was not specified. In the next major release of the driver applications will be required to specify the local datacenter in the load balancing policy. Available datacenters: dc1.	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:21.040	Finished building TokenMap for 13 keyspaces and 1 hosts. It took 42 milliseconds.	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.988	Rebuilding token map	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.893	Updating keyspaces metadata	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.868	Retrieving keyspaces metadata	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.857	Connection established to "10.31.12.28:9042" using protocol version "4". Building token map...	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.839	Node list retrieved successfully	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.796	Refreshing node list	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:20.644	Cancelling in Connection #47793487 to "10.31.12.28:9042", 0 pending operations and write queue 0	mentionstream-cluster-app-6447bc9dc4-27jr8
Jul 14, 2025 @ 07:28:20.642	Protocol version DseV2 not supported, trying with version 4	mentionstream-cluster-app-6447bc9dc4-27jr8
Jul 14, 2025 @ 07:28:19.705	Trying to connect the ControlConnection	mentionstream-cluster-app-6447bc9dc4-27jr8
Jul 14, 2025 @ 07:28:19.703	Connecting to cluster using "DataStax C# Driver for Apache Cassandra v3.20.1"	mentionstream-cluster-app-6447bc9dc4-27jr8
Jul 14, 2025 @ 07:28:19.328	Cancelling in Connection #42194491 to "10.31.12.28:9042", 0 pending operations and write queue 0	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:19.319	Protocol version DseV2 not supported, trying with version 4	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:18.341	Trying to connect the ControlConnection	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 07:28:18.340	Connecting to cluster using "DataStax C# Driver for Apache Cassandra v3.20.1"	mentionstream-cluster-app-6447bc9dc4-pthzm
Jul 14, 2025 @ 06:07:18.034	Connection to host "10.31.12.28:9042" switching to keyspace "profiledb"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:18.031	Re-preparing 2 queries on 10.31.12.28:9042	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:18.028	Host "10.31.12.28:9042" is now UP	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:18.027	Connection to "10.31.12.28:9042" opened successfully, pool #20314855 length: 1	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:17.956	Creating a new connection to "10.31.12.28:9042"	mentionstream-cluster-app-6447bc9dc4-wkg6q
-- from now on error logs start to appear 
Jul 14, 2025 @ 06:07:13.024	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.024	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.023	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	RequestHandler received exception "Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	RequestHandler received exception "Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.31.12.28:9042: OperationTimedOutException 'The host 10.31.12.28:9042 did not reply before timeout 12000ms')
   at Cassandra.Requests.RequestExecution.Start(Boolean currentHostRetry)
   at Cassandra.Requests.RequestExecution.RetryExecution(Boolean currentHostRetry, Host host)"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.022	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:13.021	The host 10.31.12.28:9042 did not reply before timeout 12000ms
...
Jul 14, 2025 @ 06:07:12.963	Retrying request: "ExecuteRequest"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:12.965	Connection to "10.31.12.28:9042" considered as unhealthy after 100 timed out operations	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:12.965	RequestHandler received exception "Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:12.965	The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:12.963	RequestHandler received exception "Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms"	mentionstream-cluster-app-6447bc9dc4-wkg6q
Jul 14, 2025 @ 06:07:12.963	RequestHandler received exception "Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms"
Jul 14, 2025 @ 04:40:46.279	ControlConnection reconnected to host "10.31.12.28:9042"	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.278	Finished building TokenMap for 13 keyspaces and 1 hosts. It took 0 milliseconds.	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.278	Updating keyspaces metadata	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.278	Rebuilding token map	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.276	Retrieving keyspaces metadata	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.275	Connection established to "10.31.12.28:9042" using protocol version "4". Building token map...	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.275	Node list retrieved successfully	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.272	Refreshing node list	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.201	Cancelling in Connection #60120211 to "10.31.12.28:9042", 0 pending operations and write queue 0	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.199	Received heartbeat request exception System.Net.Sockets.SocketException (107): Transport endpoint is not connected	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.199	Idle timeout exception, connection to "10.31.12.28:9042" used in control connection is disposed, triggering a reconnection. Exception: "System.Net.Sockets.SocketException (107): Transport endpoint is not connected"	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.199	Received heartbeat request exception System.Net.Sockets.SocketException (107): Transport endpoint is not connected	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.199	Idle timeout exception, connection to "10.31.12.28:9042" used in control connection is disposed, triggering a reconnection. Exception: "System.Net.Sockets.SocketException (107): Transport endpoint is not connected"	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:46.198	Trying to reconnect the ControlConnection	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:43.987	Received heartbeat request exception Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:43.987	Received heartbeat request exception Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:38.911	Received heartbeat request exception Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:38.911	Received heartbeat request exception Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-jnwl9
Jul 14, 2025 @ 04:40:33.907	Received heartbeat request exception Cassandra.OperationTimedOutException: The host 10.31.12.28:9042 did not reply before timeout 12000ms	mentionstream-cluster-app-6447bc9dc4-jnwl9

@joao-r-reis
Copy link
Collaborator Author

joao-r-reis commented Jul 17, 2025

Using a load balancer in front of Cassandra is very odd and can lead to issues because the drivers have load balancing policies that maintain a specific number of connections to each node. How are you telling the driver what the addresses of the load balancers are? The driver gets the addresses from the system.local and system.peers tables but I assume those addresses are from the nodes themselves in your setup unless you're using some kind of customized version of Cassandra

@dvyskrebets
Copy link

dvyskrebets commented Jul 21, 2025

this is cassandra in kubernetes cluster. load balancer used to expose private id to external service (web app in another cluster). at this point, I can't say for 100% how it should be configured correctly.
I believe this setup as a base is used for cassandra - https://github.com/k8ssandra/k8ssandra/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants