fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937

rokk4 · 2025-12-02T22:14:51Z

Fix Track Subscription Issues on Participant Rejoin

@cloudwebrtc @hiroshihorie could you have a look please?

Problem Statement

This PR addresses critical track subscription failures that occur when participants leave and rejoin LiveKit rooms. Users experienced two distinct but related issues:

Issue 1: Delayed Track Subscriptions (10-60s delays, sometimes permanent failure)

When a participant rejoins a room, their video would freeze or audio would be missing for 10-60 seconds, or sometimes fail entirely. This was particularly pronounced on slower devices.

Root Cause: Race condition in track subscription timing where WebRTC tracks arrive before signaling metadata:

WebRTC track arrives → queued in pending track queue
ParticipantInfo arrives → participant created → flush fails (publications not yet available)
Track sits in queue until timeout (10-60s)
TrackPublishedResponse arrives later but track already timed out

Current Status: After implementing the three-layer defense (deferral, retry, flush triggers), the issue has significantly improved. Tracks now subscribe much faster in most cases. However, rejoining participants may still experience brief 5-10 second freezes where initial video frames render but then freeze with no audio, before eventually recovering. This residual issue is under investigation and appears related to publication metadata arrival timing rather than the pending track queue itself.

Issue 2: Asymmetric Reconnection (Black Screen/No Audio) - ✅ FIXED

When a participant leaves and rejoins, the participant who stayed in the room sees a black screen and hears no audio from the rejoining participant. However, the rejoining participant can see and hear the staying participant perfectly.

Root Cause: When participants rejoin, the SFU assigns them a new participant sid while keeping the same identity. The SDK's _getOrCreateRemoteParticipant() only checked by identity and returned the stale participant object with the old sid, preventing proper track subscription setup.

Status: ✅ Resolved - The sid mismatch detection fix completely resolves this issue. Staying participants now properly receive audio and video from rejoining participants.

Solutions Implemented

1. Enhanced Pending Track Queue with Three-Layer Defense

A. PREVENTIVE: Enhanced Deferral Logic

// Before: Only checked if participant exists
if (participant == null) { /* defer */ }

// After: Comprehensive checks
if (connectionState != connected ||     // Pre-connection tracks
    participant == null ||              // Tracks before participant
    publication == null) {              // Tracks before metadata ← NEW
  /* defer and queue */
}

This prevents premature subscription attempts that would timeout, addressing the root cause rather than just symptoms.

B. REACTIVE: Retry Transient Failures
Modified _flushPendingTracks() to differentiate failure types:

notTrackMetadataFound → return false (keep in queue for retry)
Other failures → return true (remove from queue permanently)

Handles micro-timing races where flush happens milliseconds before metadata is processed.

C. AGGRESSIVE: Additional Flush Trigger
Added SignalParticipantUpdateEvent listener to immediately flush pending tracks when track publications arrive, ensuring queued tracks are processed promptly without waiting for timeout.

Impact:

✅ Significantly reduces rejoin latency from 10-60s to 5-10s in most cases
⚠️ Brief Video freeze and audio delay may still occur during rejoin (~5-10s) - under investigation
✅ Video tracks subscribe faster and more reliably
✅ More robust on slower devices (reduced CPU-dependent timing sensitivity)
✅ Maintains configurable timeout from connectOptions
✅ Tracks now properly retry instead of failing permanently

2. Lifecycle Hooks for Pending Track Queue

Added proper lifecycle integration:

onConnected() - Flush tracks when room connects
onDisconnected() - Clear queue and cancel timers
onReconnecting() - Pause processing during reconnection

Why: The pending track queue needs to coordinate with room connection state to avoid processing tracks during invalid states and to ensure timely cleanup.

3. Queue Hygiene and TTL

Implemented:

Automatic removal of expired tracks (5s TTL by default, configurable via pendingTrackTTL)
Limit on max pending tracks (100 by default, configurable via maxPendingTracks)
Serialized flush operations to prevent concurrent modifications
Proper cleanup on participant disconnect

Why: Without TTL and limits, the queue could grow unbounded and retain stale tracks indefinitely, causing memory leaks and incorrect behavior.

4. Fix Asymmetric Reconnection Issue

if (participant != null) {
  // NEW: Detect rejoin with different sid
  if (participant.sid != info.sid) {
    logger.fine('Participant ${info.identity} rejoined with new sid ${info.sid} (old: ${participant.sid})');
    await _handleParticipantDisconnect(info.identity);
    // Fall through to create new participant
  } else {
    // Existing participant, just refresh
    _pendingTrackQueue.refreshParticipant(participant.sid, reason: 'participant reused');
    return ParticipantCreationResult(participant: participant, newPublications: const []);
  }
}

Why: When a participant rejoins with a new sid, we must:

Clean up the old participant (dispose tracks, emit ParticipantDisconnectedEvent)
Create a new participant with the new sid
Emit ParticipantConnectedEvent and TrackPublishedEvent for the staying participant
Properly set up track subscriptions with the new sid
Re-establish E2EE key exchange

Without this, the staying participant would try to use the old participant object with a stale sid, causing subscription failures and missing E2EE keys.

Testing

✅ All existing unit tests pass
✅ Pending track queue tests validate TTL, lifecycle hooks, and serialization
✅ Room E2E tests confirm proper event emission and participant management
✅ Manually verified rejoin scenarios:
- ✅ Asymmetric reconnection issue completely resolved
- ⚠️ Brief 5-10 second audio delay and video freeze may still occur on rejoin (under investigation)
- ✅ Eventually all tracks subscribe and work correctly

Cross-SDK Context

This issue has been addressed in other LiveKit SDKs:

iOS SDK: PR #434 - Similar pending track queue hardening
Android SDK: PR #483 - Participant rejoin handling
React SDK: Has built-in retry logic and robust state management

This PR brings the Flutter SDK to parity with other client SDKs in handling these edge cases.

Breaking Changes

None. All changes are backward compatible:

New options (maxPendingTracks, pendingTrackTTL) have sensible defaults
Existing behavior is preserved for normal connection flows
Only affects edge cases that previously failed

Known Limitations

While this PR significantly improves track subscription reliability, there is one remaining issue:

Brief Track Freeze on Rejoin (5-10 seconds):

Rejoining participants may experience a short freeze before receiving full audio/video from staying participants
Video typically shows a few initial frames, then freezes with no audio
After 5-10 seconds, both audio and video recover and work correctly
This appears to be related to publication metadata arrival timing rather than the pending track queue
The freeze self-resolves and all tracks work correctly after recovery
Further investigation needed to eliminate this remaining delay

Related Issues

#928

Summary: This PR significantly improves track subscription reliability during participant rejoin scenarios through a combination of preventive (enhanced deferral), reactive (retry logic), and corrective (sid mismatch detection) approaches. The asymmetric reconnection issue is completely resolved, and track subscription latency is greatly reduced from 10-60s to 5-10s with automatic recovery.

Fixes race condition where tracks arriving before participant metadata were permanently dropped from the pending queue after timeout, causing 10-60 second delays or complete failures when participants rejoin. Changes: 1. Retry transient failures: Modified _flushPendingTracks() to differentiate between transient (notTrackMetadataFound) and permanent failures. Transient failures now keep tracks in queue for retry instead of removing them. 2. Additional flush trigger: Added listener to flush pending tracks when SignalParticipantUpdateEvent contains track publications, ensuring tracks are subscribed once metadata becomes available. 3. Improved logging: Transient failures logged at fine level to reduce noise, permanent failures at severe level for visibility. The fix maintains the existing timeout configuration from connectOptions while enabling retry logic that resolves the race condition where: - WebRTC track arrives first → queued - ParticipantInfo arrives → participant created → flush fails (no publications) - TrackPublishedResponse arrives later → second flush succeeds This reduces track subscription latency after rejoin from 10-60s to <1s and improves reliability on slower devices where the race condition was more pronounced. Related: livekit#928

… logic Combines defensive and reactive approaches to fix race condition where tracks arriving before participant metadata caused 10-60s delays or failures on rejoin. Root Cause: When a participant rejoins, WebRTC tracks can arrive before signaling metadata. The previous logic had three critical gaps: 1. Tracks queued but dropped on timeout (no retry) 2. Missing flush triggers when metadata finally arrives 3. Insufficient deferral check (only participant existence, not publication) Solution - Three-Layer Defense: 1. PREVENTIVE: Enhanced deferral logic (NEW) Check not just participant existence, but also publication metadata: - connectionState != connected (pre-connection tracks) - participant == null (tracks before participant) - publication == null (tracks before metadata) ← NEW CHECK This prevents premature subscription attempts that would timeout. 2. REACTIVE: Retry transient failures Modified _flushPendingTracks() to differentiate failure types: - notTrackMetadataFound → return false (keep in queue, retry) - Other failures → return true (remove from queue) Handles micro-timing races where flush happens before metadata processed. 3. AGGRESSIVE: Additional flush trigger Added SignalParticipantUpdateEvent listener to flush when track publications arrive, ensuring queued tracks are processed promptly. Impact: - Reduces rejoin latency from 10-60s to <1s - Eliminates frozen frames on rejoin - More robust on slower devices (reduced CPU-dependent timing sensitivity) - Maintains configurable timeout from connectOptions The combined approach is superior because: - Prevention reduces unnecessary timeout waits - Retry ensures recovery from edge cases - Aggressive flush ensures timely processing - Event-driven design scales better than polling Related: livekit#928

CLAassistant · 2025-12-02T22:14:57Z

All committers have signed the CLA.

When a participant leaves and rejoins a LiveKit room, the SFU assigns them a new participant sid while keeping the same identity. This caused the staying participant to not receive tracks from the rejoining participant. The issue was in _getOrCreateRemoteParticipant() which only checked by identity. When it found an existing participant with the old sid, it returned that stale participant object without emitting ParticipantConnectedEvent or setting up new track subscriptions. Now detects sid mismatches and properly cleans up the old participant before creating a new one, ensuring all events are emitted and subscriptions work correctly for both audio and video tracks.

- Remove 'hide logger' from pending_track_queue import (logger not exported) - Add explicit type annotation for result variable - Fix import ordering in pending_track_queue_test.dart - Update PR description to reflect current status: - Asymmetric reconnection issue fully resolved - Brief 5-10s freeze still occurs on rejoin but self-recovers - Both video and audio affected during freeze (not just audio)

rokk4 added 3 commits December 2, 2025 18:32

Harden pending track queue lifecycle

b025e84

rokk4 marked this pull request as draft December 2, 2025 22:15

rokk4 changed the title ~~Rokk4/rejoin track freeze fix~~ WIP: rejoin track freeze fix Dec 2, 2025

rokk4 added 3 commits December 5, 2025 10:37

correct gitignore

82353ec

rokk4 marked this pull request as ready for review December 5, 2025 10:53

rokk4 changed the title ~~WIP: rejoin track freeze fix~~ Fix participant rejoin issues: asymmetric reconnection and track subscription delays Dec 5, 2025

rokk4 changed the title ~~Fix participant rejoin issues: asymmetric reconnection and track subscription delays~~ fix: participant rejoin issues: asymmetric reconnection and track subscription delays Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937

fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937

rokk4 commented Dec 2, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937

Are you sure you want to change the base?

fix: participant rejoin issues: asymmetric reconnection and track subscription delays #937

Conversation

rokk4 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Track Subscription Issues on Participant Rejoin

Problem Statement

Issue 1: Delayed Track Subscriptions (10-60s delays, sometimes permanent failure)

Issue 2: Asymmetric Reconnection (Black Screen/No Audio) - ✅ FIXED

Solutions Implemented

1. Enhanced Pending Track Queue with Three-Layer Defense

2. Lifecycle Hooks for Pending Track Queue

3. Queue Hygiene and TTL

4. Fix Asymmetric Reconnection Issue

Testing

Cross-SDK Context

Breaking Changes

Known Limitations

Related Issues

Uh oh!

CLAassistant commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rokk4 commented Dec 2, 2025 •

edited

Loading

CLAassistant commented Dec 2, 2025 •

edited

Loading