-
Notifications
You must be signed in to change notification settings - Fork 203
fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rokk4
wants to merge
6
commits into
livekit:main
Choose a base branch
from
rokk4:rokk4/rejoin-track-freeze-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+716
−48
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes race condition where tracks arriving before participant metadata were permanently dropped from the pending queue after timeout, causing 10-60 second delays or complete failures when participants rejoin. Changes: 1. Retry transient failures: Modified _flushPendingTracks() to differentiate between transient (notTrackMetadataFound) and permanent failures. Transient failures now keep tracks in queue for retry instead of removing them. 2. Additional flush trigger: Added listener to flush pending tracks when SignalParticipantUpdateEvent contains track publications, ensuring tracks are subscribed once metadata becomes available. 3. Improved logging: Transient failures logged at fine level to reduce noise, permanent failures at severe level for visibility. The fix maintains the existing timeout configuration from connectOptions while enabling retry logic that resolves the race condition where: - WebRTC track arrives first → queued - ParticipantInfo arrives → participant created → flush fails (no publications) - TrackPublishedResponse arrives later → second flush succeeds This reduces track subscription latency after rejoin from 10-60s to <1s and improves reliability on slower devices where the race condition was more pronounced. Related: livekit#928
… logic Combines defensive and reactive approaches to fix race condition where tracks arriving before participant metadata caused 10-60s delays or failures on rejoin. Root Cause: When a participant rejoins, WebRTC tracks can arrive before signaling metadata. The previous logic had three critical gaps: 1. Tracks queued but dropped on timeout (no retry) 2. Missing flush triggers when metadata finally arrives 3. Insufficient deferral check (only participant existence, not publication) Solution - Three-Layer Defense: 1. PREVENTIVE: Enhanced deferral logic (NEW) Check not just participant existence, but also publication metadata: - connectionState != connected (pre-connection tracks) - participant == null (tracks before participant) - publication == null (tracks before metadata) ← NEW CHECK This prevents premature subscription attempts that would timeout. 2. REACTIVE: Retry transient failures Modified _flushPendingTracks() to differentiate failure types: - notTrackMetadataFound → return false (keep in queue, retry) - Other failures → return true (remove from queue) Handles micro-timing races where flush happens before metadata processed. 3. AGGRESSIVE: Additional flush trigger Added SignalParticipantUpdateEvent listener to flush when track publications arrive, ensuring queued tracks are processed promptly. Impact: - Reduces rejoin latency from 10-60s to <1s - Eliminates frozen frames on rejoin - More robust on slower devices (reduced CPU-dependent timing sensitivity) - Maintains configurable timeout from connectOptions The combined approach is superior because: - Prevention reduces unnecessary timeout waits - Retry ensures recovery from edge cases - Aggressive flush ensures timely processing - Event-driven design scales better than polling Related: livekit#928
When a participant leaves and rejoins a LiveKit room, the SFU assigns them a new participant sid while keeping the same identity. This caused the staying participant to not receive tracks from the rejoining participant. The issue was in _getOrCreateRemoteParticipant() which only checked by identity. When it found an existing participant with the old sid, it returned that stale participant object without emitting ParticipantConnectedEvent or setting up new track subscriptions. Now detects sid mismatches and properly cleans up the old participant before creating a new one, ensuring all events are emitted and subscriptions work correctly for both audio and video tracks.
- Remove 'hide logger' from pending_track_queue import (logger not exported) - Add explicit type annotation for result variable - Fix import ordering in pending_track_queue_test.dart - Update PR description to reflect current status: - Asymmetric reconnection issue fully resolved - Brief 5-10s freeze still occurs on rejoin but self-recovers - Both video and audio affected during freeze (not just audio)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix Track Subscription Issues on Participant Rejoin
@cloudwebrtc @hiroshihorie could you have a look please?
Problem Statement
This PR addresses critical track subscription failures that occur when participants leave and rejoin LiveKit rooms. Users experienced two distinct but related issues:
Issue 1: Delayed Track Subscriptions (10-60s delays, sometimes permanent failure)
When a participant rejoins a room, their video would freeze or audio would be missing for 10-60 seconds, or sometimes fail entirely. This was particularly pronounced on slower devices.
Root Cause: Race condition in track subscription timing where WebRTC tracks arrive before signaling metadata:
Current Status: After implementing the three-layer defense (deferral, retry, flush triggers), the issue has significantly improved. Tracks now subscribe much faster in most cases. However, rejoining participants may still experience brief 5-10 second freezes where initial video frames render but then freeze with no audio, before eventually recovering. This residual issue is under investigation and appears related to publication metadata arrival timing rather than the pending track queue itself.
Issue 2: Asymmetric Reconnection (Black Screen/No Audio) - ✅ FIXED
When a participant leaves and rejoins, the participant who stayed in the room sees a black screen and hears no audio from the rejoining participant. However, the rejoining participant can see and hear the staying participant perfectly.
Root Cause: When participants rejoin, the SFU assigns them a new participant
sidwhile keeping the sameidentity. The SDK's_getOrCreateRemoteParticipant()only checked by identity and returned the stale participant object with the old sid, preventing proper track subscription setup.Status: ✅ Resolved - The sid mismatch detection fix completely resolves this issue. Staying participants now properly receive audio and video from rejoining participants.
Solutions Implemented
1. Enhanced Pending Track Queue with Three-Layer Defense
A. PREVENTIVE: Enhanced Deferral Logic
This prevents premature subscription attempts that would timeout, addressing the root cause rather than just symptoms.
B. REACTIVE: Retry Transient Failures
Modified
_flushPendingTracks()to differentiate failure types:notTrackMetadataFound→ returnfalse(keep in queue for retry)true(remove from queue permanently)Handles micro-timing races where flush happens milliseconds before metadata is processed.
C. AGGRESSIVE: Additional Flush Trigger
Added
SignalParticipantUpdateEventlistener to immediately flush pending tracks when track publications arrive, ensuring queued tracks are processed promptly without waiting for timeout.Impact:
connectOptions2. Lifecycle Hooks for Pending Track Queue
Added proper lifecycle integration:
onConnected()- Flush tracks when room connectsonDisconnected()- Clear queue and cancel timersonReconnecting()- Pause processing during reconnectionWhy: The pending track queue needs to coordinate with room connection state to avoid processing tracks during invalid states and to ensure timely cleanup.
3. Queue Hygiene and TTL
Implemented:
pendingTrackTTL)maxPendingTracks)Why: Without TTL and limits, the queue could grow unbounded and retain stale tracks indefinitely, causing memory leaks and incorrect behavior.
4. Fix Asymmetric Reconnection Issue
Why: When a participant rejoins with a new sid, we must:
ParticipantDisconnectedEvent)ParticipantConnectedEventandTrackPublishedEventfor the staying participantWithout this, the staying participant would try to use the old participant object with a stale sid, causing subscription failures and missing E2EE keys.
Testing
Cross-SDK Context
This issue has been addressed in other LiveKit SDKs:
This PR brings the Flutter SDK to parity with other client SDKs in handling these edge cases.
Breaking Changes
None. All changes are backward compatible:
maxPendingTracks,pendingTrackTTL) have sensible defaultsKnown Limitations
While this PR significantly improves track subscription reliability, there is one remaining issue:
Brief Track Freeze on Rejoin (5-10 seconds):
Related Issues
#928
Summary: This PR significantly improves track subscription reliability during participant rejoin scenarios through a combination of preventive (enhanced deferral), reactive (retry logic), and corrective (sid mismatch detection) approaches. The asymmetric reconnection issue is completely resolved, and track subscription latency is greatly reduced from 10-60s to 5-10s with automatic recovery.