#721 Fix publisher negotiation race condition causing ICE timeouts #789
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a race condition in
negotiatePublisher()that causes ICE gathering failures and connection timeouts when sending data messages immediately after connecting to a room.Issue
Fixes #721
Root Cause Analysis
The detailed analysis is in this comment: #721 (comment)
TL;DR
When the server responds with both
subscriberPrimary=trueandfastPublish=true, two concurrent calls tonegotiatePublisher()occur:RTCEngine.join()at connection time (whenfastPublish=true)ensurePublisherConnected()when sending the first data messageThe existing debounce mechanism fails with concurrent coroutines due to non-atomic read-modify-write operations on the shared
debounceJobvariable. This results in:The Fix
Added a
Mutexto serializenegotiatePublisher()calls usingtryLock():Testing
This fix has been tested in production with a VoIP application handling incoming calls that:
subscriberPrimary=trueBefore: ~20% failure rate with publisher ICE timeouts
After: 0% failure rate in extensive testing
Impact
This should also fix the issue reported in #721 by @Christophe-DC and others experiencing connection instability when sending data messages immediately after room connection.