Skip to content

Conversation

@rokk4
Copy link

@rokk4 rokk4 commented Dec 2, 2025

Fix Track Subscription Issues on Participant Rejoin

@cloudwebrtc @hiroshihorie could you have a look please?

Problem Statement

This PR addresses critical track subscription failures that occur when participants leave and rejoin LiveKit rooms. Users experienced two distinct but related issues:

Issue 1: Delayed Track Subscriptions (10-60s delays, sometimes permanent failure)

When a participant rejoins a room, their video would freeze or audio would be missing for 10-60 seconds, or sometimes fail entirely. This was particularly pronounced on slower devices.

Root Cause: Race condition in track subscription timing where WebRTC tracks arrive before signaling metadata:

  1. WebRTC track arrives → queued in pending track queue
  2. ParticipantInfo arrives → participant created → flush fails (publications not yet available)
  3. Track sits in queue until timeout (10-60s)
  4. TrackPublishedResponse arrives later but track already timed out

Current Status: After implementing the three-layer defense (deferral, retry, flush triggers), the issue has significantly improved. Tracks now subscribe much faster in most cases. However, rejoining participants may still experience brief 5-10 second freezes where initial video frames render but then freeze with no audio, before eventually recovering. This residual issue is under investigation and appears related to publication metadata arrival timing rather than the pending track queue itself.

Issue 2: Asymmetric Reconnection (Black Screen/No Audio) - ✅ FIXED

When a participant leaves and rejoins, the participant who stayed in the room sees a black screen and hears no audio from the rejoining participant. However, the rejoining participant can see and hear the staying participant perfectly.

Root Cause: When participants rejoin, the SFU assigns them a new participant sid while keeping the same identity. The SDK's _getOrCreateRemoteParticipant() only checked by identity and returned the stale participant object with the old sid, preventing proper track subscription setup.

Status:Resolved - The sid mismatch detection fix completely resolves this issue. Staying participants now properly receive audio and video from rejoining participants.

Solutions Implemented

1. Enhanced Pending Track Queue with Three-Layer Defense

A. PREVENTIVE: Enhanced Deferral Logic

// Before: Only checked if participant exists
if (participant == null) { /* defer */ }

// After: Comprehensive checks
if (connectionState != connected ||     // Pre-connection tracks
    participant == null ||              // Tracks before participant
    publication == null) {              // Tracks before metadata ← NEW
  /* defer and queue */
}

This prevents premature subscription attempts that would timeout, addressing the root cause rather than just symptoms.

B. REACTIVE: Retry Transient Failures
Modified _flushPendingTracks() to differentiate failure types:

  • notTrackMetadataFound → return false (keep in queue for retry)
  • Other failures → return true (remove from queue permanently)

Handles micro-timing races where flush happens milliseconds before metadata is processed.

C. AGGRESSIVE: Additional Flush Trigger
Added SignalParticipantUpdateEvent listener to immediately flush pending tracks when track publications arrive, ensuring queued tracks are processed promptly without waiting for timeout.

Impact:

  • ✅ Significantly reduces rejoin latency from 10-60s to 5-10s in most cases
  • ⚠️ Brief Video freeze and audio delay may still occur during rejoin (~5-10s) - under investigation
  • ✅ Video tracks subscribe faster and more reliably
  • ✅ More robust on slower devices (reduced CPU-dependent timing sensitivity)
  • ✅ Maintains configurable timeout from connectOptions
  • ✅ Tracks now properly retry instead of failing permanently

2. Lifecycle Hooks for Pending Track Queue

Added proper lifecycle integration:

  • onConnected() - Flush tracks when room connects
  • onDisconnected() - Clear queue and cancel timers
  • onReconnecting() - Pause processing during reconnection

Why: The pending track queue needs to coordinate with room connection state to avoid processing tracks during invalid states and to ensure timely cleanup.

3. Queue Hygiene and TTL

Implemented:

  • Automatic removal of expired tracks (5s TTL by default, configurable via pendingTrackTTL)
  • Limit on max pending tracks (100 by default, configurable via maxPendingTracks)
  • Serialized flush operations to prevent concurrent modifications
  • Proper cleanup on participant disconnect

Why: Without TTL and limits, the queue could grow unbounded and retain stale tracks indefinitely, causing memory leaks and incorrect behavior.

4. Fix Asymmetric Reconnection Issue

if (participant != null) {
  // NEW: Detect rejoin with different sid
  if (participant.sid != info.sid) {
    logger.fine('Participant ${info.identity} rejoined with new sid ${info.sid} (old: ${participant.sid})');
    await _handleParticipantDisconnect(info.identity);
    // Fall through to create new participant
  } else {
    // Existing participant, just refresh
    _pendingTrackQueue.refreshParticipant(participant.sid, reason: 'participant reused');
    return ParticipantCreationResult(participant: participant, newPublications: const []);
  }
}

Why: When a participant rejoins with a new sid, we must:

  1. Clean up the old participant (dispose tracks, emit ParticipantDisconnectedEvent)
  2. Create a new participant with the new sid
  3. Emit ParticipantConnectedEvent and TrackPublishedEvent for the staying participant
  4. Properly set up track subscriptions with the new sid
  5. Re-establish E2EE key exchange

Without this, the staying participant would try to use the old participant object with a stale sid, causing subscription failures and missing E2EE keys.

Testing

  • ✅ All existing unit tests pass
  • ✅ Pending track queue tests validate TTL, lifecycle hooks, and serialization
  • ✅ Room E2E tests confirm proper event emission and participant management
  • ✅ Manually verified rejoin scenarios:
    • ✅ Asymmetric reconnection issue completely resolved
    • ⚠️ Brief 5-10 second audio delay and video freeze may still occur on rejoin (under investigation)
    • ✅ Eventually all tracks subscribe and work correctly

Cross-SDK Context

This issue has been addressed in other LiveKit SDKs:

  • iOS SDK: PR #434 - Similar pending track queue hardening
  • Android SDK: PR #483 - Participant rejoin handling
  • React SDK: Has built-in retry logic and robust state management

This PR brings the Flutter SDK to parity with other client SDKs in handling these edge cases.

Breaking Changes

None. All changes are backward compatible:

  • New options (maxPendingTracks, pendingTrackTTL) have sensible defaults
  • Existing behavior is preserved for normal connection flows
  • Only affects edge cases that previously failed

Known Limitations

While this PR significantly improves track subscription reliability, there is one remaining issue:

Brief Track Freeze on Rejoin (5-10 seconds):

  • Rejoining participants may experience a short freeze before receiving full audio/video from staying participants
  • Video typically shows a few initial frames, then freezes with no audio
  • After 5-10 seconds, both audio and video recover and work correctly
  • This appears to be related to publication metadata arrival timing rather than the pending track queue
  • The freeze self-resolves and all tracks work correctly after recovery
  • Further investigation needed to eliminate this remaining delay

Related Issues

#928


Summary: This PR significantly improves track subscription reliability during participant rejoin scenarios through a combination of preventive (enhanced deferral), reactive (retry logic), and corrective (sid mismatch detection) approaches. The asymmetric reconnection issue is completely resolved, and track subscription latency is greatly reduced from 10-60s to 5-10s with automatic recovery.

rokk4 added 3 commits December 2, 2025 18:32
Fixes race condition where tracks arriving before participant metadata
were permanently dropped from the pending queue after timeout, causing
10-60 second delays or complete failures when participants rejoin.

Changes:
1. Retry transient failures: Modified _flushPendingTracks() to differentiate
   between transient (notTrackMetadataFound) and permanent failures. Transient
   failures now keep tracks in queue for retry instead of removing them.

2. Additional flush trigger: Added listener to flush pending tracks when
   SignalParticipantUpdateEvent contains track publications, ensuring tracks
   are subscribed once metadata becomes available.

3. Improved logging: Transient failures logged at fine level to reduce noise,
   permanent failures at severe level for visibility.

The fix maintains the existing timeout configuration from connectOptions
while enabling retry logic that resolves the race condition where:
- WebRTC track arrives first → queued
- ParticipantInfo arrives → participant created → flush fails (no publications)
- TrackPublishedResponse arrives later → second flush succeeds

This reduces track subscription latency after rejoin from 10-60s to <1s
and improves reliability on slower devices where the race condition was
more pronounced.

Related: livekit#928
… logic

Combines defensive and reactive approaches to fix race condition where tracks
arriving before participant metadata caused 10-60s delays or failures on rejoin.

Root Cause:
When a participant rejoins, WebRTC tracks can arrive before signaling metadata.
The previous logic had three critical gaps:
1. Tracks queued but dropped on timeout (no retry)
2. Missing flush triggers when metadata finally arrives
3. Insufficient deferral check (only participant existence, not publication)

Solution - Three-Layer Defense:

1. PREVENTIVE: Enhanced deferral logic (NEW)
   Check not just participant existence, but also publication metadata:
   - connectionState != connected (pre-connection tracks)
   - participant == null (tracks before participant)
   - publication == null (tracks before metadata) ← NEW CHECK

   This prevents premature subscription attempts that would timeout.

2. REACTIVE: Retry transient failures
   Modified _flushPendingTracks() to differentiate failure types:
   - notTrackMetadataFound → return false (keep in queue, retry)
   - Other failures → return true (remove from queue)

   Handles micro-timing races where flush happens before metadata processed.

3. AGGRESSIVE: Additional flush trigger
   Added SignalParticipantUpdateEvent listener to flush when track
   publications arrive, ensuring queued tracks are processed promptly.

Impact:
- Reduces rejoin latency from 10-60s to <1s
- Eliminates frozen frames on rejoin
- More robust on slower devices (reduced CPU-dependent timing sensitivity)
- Maintains configurable timeout from connectOptions

The combined approach is superior because:
- Prevention reduces unnecessary timeout waits
- Retry ensures recovery from edge cases
- Aggressive flush ensures timely processing
- Event-driven design scales better than polling

Related: livekit#928
@CLAassistant
Copy link

CLAassistant commented Dec 2, 2025

CLA assistant check
All committers have signed the CLA.

@rokk4 rokk4 marked this pull request as draft December 2, 2025 22:15
@rokk4 rokk4 changed the title Rokk4/rejoin track freeze fix WIP: rejoin track freeze fix Dec 2, 2025
rokk4 added 3 commits December 5, 2025 10:37
When a participant leaves and rejoins a LiveKit room, the SFU assigns them
a new participant sid while keeping the same identity. This caused the staying
participant to not receive tracks from the rejoining participant.

The issue was in _getOrCreateRemoteParticipant() which only checked by identity.
When it found an existing participant with the old sid, it returned that stale
participant object without emitting ParticipantConnectedEvent or setting up
new track subscriptions.

Now detects sid mismatches and properly cleans up the old participant before
creating a new one, ensuring all events are emitted and subscriptions work
correctly for both audio and video tracks.
- Remove 'hide logger' from pending_track_queue import (logger not exported)
- Add explicit type annotation for result variable
- Fix import ordering in pending_track_queue_test.dart
- Update PR description to reflect current status:
  - Asymmetric reconnection issue fully resolved
  - Brief 5-10s freeze still occurs on rejoin but self-recovers
  - Both video and audio affected during freeze (not just audio)
@rokk4 rokk4 marked this pull request as ready for review December 5, 2025 10:53
@rokk4 rokk4 changed the title WIP: rejoin track freeze fix Fix participant rejoin issues: asymmetric reconnection and track subscription delays Dec 5, 2025
@rokk4 rokk4 changed the title Fix participant rejoin issues: asymmetric reconnection and track subscription delays fix: participant rejoin issues: asymmetric reconnection and track subscription delays Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants