Skip to content

Conversation

@adrian-niculescu
Copy link
Contributor

Summary

This PR fixes a race condition in negotiatePublisher() that causes ICE gathering failures and connection timeouts when sending data messages immediately after connecting to a room.

Issue

Fixes #721

Root Cause Analysis

The detailed analysis is in this comment: #721 (comment)

TL;DR

When the server responds with both subscriberPrimary=true and fastPublish=true, two concurrent calls to negotiatePublisher() occur:

  1. From RTCEngine.join() at connection time (when fastPublish=true)
  2. From ensurePublisherConnected() when sending the first data message

The existing debounce mechanism fails with concurrent coroutines due to non-atomic read-modify-write operations on the shared debounceJob variable. This results in:

  • Multiple parallel SDP offer/answer exchanges
  • Conflicting ICE gathering sessions
  • Publisher ICE state transitions: DISCONNECTED → FAILED
  • Connection timeout after 20 seconds

The Fix

Added a Mutex to serialize negotiatePublisher() calls using tryLock():

  • First call acquires the lock and runs negotiation
  • Second call detects the lock is held and returns immediately (no-op)
  • Only one negotiation runs at a time
  • No conflicting ICE gathering
  • Publisher connects reliably

Testing

This fix has been tested in production with a VoIP application handling incoming calls that:

  1. Connect to LiveKit room with subscriberPrimary=true
  2. Immediately send "RINGING" data message
  3. Send "ANSWER" data message when user answers

Before: ~20% failure rate with publisher ICE timeouts
After: 0% failure rate in extensive testing

Impact

This should also fix the issue reported in #721 by @Christophe-DC and others experiencing connection instability when sending data messages immediately after room connection.

@changeset-bot
Copy link

changeset-bot bot commented Oct 30, 2025

🦋 Changeset detected

Latest commit: 8877634

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
client-sdk-android Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Added Mutex to serialize concurrent negotiatePublisher() calls that were
causing ICE gathering failures when subscriberPrimary=true and fastPublish=true.

The race condition occurred when:
1. negotiatePublisher() called from RTCEngine.join() (fastPublish=true)
2. negotiatePublisher() called again from ensurePublisherConnected()
3. Both coroutines executed concurrently, bypassing the debounce mechanism
4. Multiple parallel SDP offers corrupted ICE state

The fix uses tryLock() to make the second call a no-op, ensuring only
one negotiation runs at a time.

Fixes: #721
@adrian-niculescu adrian-niculescu force-pushed the fix-publisher-negotiation-race-721 branch from dbefc91 to 8877634 Compare October 30, 2025 23:12
@adrian-niculescu adrian-niculescu changed the title Fix publisher negotiation race condition causing ICE timeouts (#721) #721 Fix publisher negotiation race condition causing ICE timeouts Oct 30, 2025
@davidliu
Copy link
Contributor

davidliu commented Nov 2, 2025

Wow! Thank you so much for the investigation here!

@davidliu davidliu merged commit 810988e into livekit:main Nov 2, 2025
2 checks passed
@davidliu davidliu mentioned this pull request Nov 2, 2025
@adrian-niculescu
Copy link
Contributor Author

Wow! Thank you so much for the investigation here!

Happy to help! Really appreciate your quick turnaround and for getting the fix published so fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection instability when receiving and sending data messages immediately after joining a room

2 participants