Skip to content

Conversation

@adrian-niculescu
Copy link
Contributor

@adrian-niculescu adrian-niculescu commented Nov 10, 2025

Follow-up: fixes #721

PR #789 added a mutex but used tryLock(), which released the mutex before connection completed, still allowing overlapping renegotiations (sorry, @davidliu).

Fix

  • Use withLock() instead of tryLock() so negotiation attempts wait instead of being skipped
  • Hold the mutex until connection completes (or times out) to prevent overlapping renegotiations
  • Move hasPublished = true before the client.isConnected check for proper reconnect handling
  • Add @Volatile to hasPublished for thread-safe memory visibility
  • Ensure the negotiation mutex only waits until the publisher connection is either connected or definitively failed (waitUntilConnectedOrDisconnected), so failed negotiations don't delay reconnect.

@changeset-bot
Copy link

changeset-bot bot commented Nov 10, 2025

🦋 Changeset detected

Latest commit: 405373a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
client-sdk-android Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@adrian-niculescu adrian-niculescu force-pushed the fix/publisher-negotiation-first-publish branch 2 times, most recently from f157148 to 50f86d5 Compare November 10, 2025 14:17
@adrian-niculescu adrian-niculescu marked this pull request as draft November 13, 2025 17:19
@adrian-niculescu adrian-niculescu marked this pull request as ready for review November 13, 2025 18:47
// Best-effort wait for publisher to connect before releasing mutex
// to avoid overlapping renegotiations.
withTimeoutOrNull(MAX_ICE_CONNECT_TIMEOUT_MS.toLong()) {
publisherObserver.waitUntilConnected()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there's a change that's on my todo list that should ensure we ignore offers/answers that come from old negotiations (server offer/answers now come with ids associated with them). I wonder if that's the missing piece here that will alleviate the issue of overlapping renegotiations.

My main concern here is that waiting for each negotiation to finish each time will cause the overall connection time to increase quite a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidliu - fair to be careful about this. I analyzed the code flow to understand the actual impact:

The mutex + waitUntilConnected() is only held while the publisher PeerConnection is not yet in CONNECTED. Once it's connected, additional negotiatePublisher() calls return immediately, so normal renegotiations don't pay extra latency.

Where your concern is valid is in a failure + soft reconnect path: if the initial negotiation fails and the publisher goes to FAILED, the current code still holds the mutex until the 20s timeout, so the first soft reconnect attempt can't start its own negotiation until that timeout elapses. That can add up to MAX_ICE_CONNECT_TIMEOUT_MS (~20s) to the reconnection time in that scenario.

A targeted fix would be to have waitUntilConnected() return not only when the connection becomes CONNECTED, but also when it becomes "disconnected" (FAILED/CLOSED). That way we still serialize negotiations, but we don't delay reconnect: the mutex is released promptly on failure and the reconnect negotiation can start immediately. The negotiation-ID filtering you mentioned would complement this, but doesn't by itself remove the extra wait.

Copy link
Contributor Author

@adrian-niculescu adrian-niculescu Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidliu Quick follow-up on this after digging deeper into the flow and your latency concern.

Based on the analysis I mentioned earlier, I updated the implementation to address the failure + soft reconnect case without giving up the “no overlapping renegotiations” guarantee:

  • The mutex is still only used to serialize publisher negotiations, but the wait helper is now waitUntilConnectedOrDisconnected(), which returns as soon as the publisher PeerConnection is either CONNECTED or has definitively terminated (FAILED/CLOSED).
  • In the normal / steady state (publisher already connected), additional negotiatePublisher() calls effectively don’t wait — the first emission is already CONNECTED, so renegotiations don’t pay extra latency.
  • In the failure + soft reconnect scenario you called out, we no longer hold the mutex for the full MAX_ICE_CONNECT_TIMEOUT_MS when the connection has already failed. The helper exits as soon as the publisher reaches a terminal failure state, so the soft reconnect can immediately start a fresh negotiation instead of being blocked for ~20s.
  • DISCONNECTED is still treated as transient (per the existing isDisconnected() extension), so we keep waiting through temporary drops where ICE can still recover.

So the behavior now is: serialize negotiations while the connection is in a transient state, but release the lock promptly once it’s either successfully connected or clearly dead. Your planned change to use negotiation IDs on offers/answers will still be a nice complementary layer on the signaling side; this PR just tightens the local sequencing so we don’t introduce avoidable reconnect latency in the failure path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So taking a closer look at the code, I still think the waitUntilConnectedOrDisconnected is unneeded?

With the withLock implementation serializing the calls to createAndSendOffer, this should handle subsequent calls to it. The scenario I'm thinking of is if multiple things want to kick off a negotiation, it should aggregate all the negotiations if one's already in-flight, and handle them all in a single renegotiation, rather than each extra request handled sequentially.

I think I'll also need to add a lock directly to createAndSendOffer, since it is called directly elsewhere.

@adrian-niculescu adrian-niculescu force-pushed the fix/publisher-negotiation-first-publish branch 2 times, most recently from 248f106 to e85dfcb Compare November 16, 2025 20:13
@adrian-niculescu adrian-niculescu marked this pull request as draft November 16, 2025 20:22
@adrian-niculescu adrian-niculescu force-pushed the fix/publisher-negotiation-first-publish branch from e85dfcb to 429f14a Compare November 16, 2025 20:49
@adrian-niculescu adrian-niculescu marked this pull request as ready for review November 16, 2025 20:55
@adrian-niculescu adrian-niculescu force-pushed the fix/publisher-negotiation-first-publish branch from 429f14a to e9153df Compare November 16, 2025 20:58
@adrian-niculescu adrian-niculescu force-pushed the fix/publisher-negotiation-first-publish branch from 0a3bb58 to 96a5cf8 Compare November 17, 2025 18:18
@adrian-niculescu
Copy link
Contributor Author

adrian-niculescu commented Nov 24, 2025

@davidliu, closing this PR as your changes in #813 solved the overlapping negotiation issue using offer ID tracking, which avoids the connection time penalty that the serialization approach in this PR would introduce.

The independent fixes from this PR (adding @Volatile to hasPublished and moving its assignment before the connection check) have been extracted into #814.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection instability when receiving and sending data messages immediately after joining a room

2 participants