Skip to content

Conversation

daniel-lxs
Copy link
Collaborator

@daniel-lxs daniel-lxs commented Aug 15, 2025

Summary

This PR implements a telemetry queuing system to ensure telemetry events are not lost during network outages or server downtime.

Problem

Currently, the Roo Code Extension sends telemetry using a "fire and forget" approach. When telemetry events fail to send (due to network issues, server downtime, or other connectivity problems), they are simply lost with no retry mechanism.

Solution

Key Components:

  1. TelemetryQueueManager - Manages persistent event storage

    • Stores failed events to disk as JSON
    • Implements exponential backoff retry (1s → 60s max)
    • Events persist for 24 hours before expiring
    • Queue limited to 100 events to manage file size
    • Aggressive cleanup every 5 minutes
  2. QueuedTelemetryClient - Base class for telemetry clients

    • Automatic queuing on send failure
    • Background retry every 30 seconds
    • Graceful online/offline transitions
  3. Updated PostHogTelemetryClient

    • Now extends QueuedTelemetryClient
    • Disables PostHog's internal queue for better control
    • Forces immediate flush to detect network errors

Features:

  • Per-workspace storage - Each VS Code workspace has its own queue file
  • Persistent storage - Events survive extension restarts
  • Smart retry - Exponential backoff prevents server overload
  • No retry limit - Events keep trying for 24 hours
  • Automatic cleanup - Old events removed, file size managed
  • Debug logging - Comprehensive logs for testing (set DEBUG_TELEMETRY=true)

Testing:

  • Comprehensive unit tests for TelemetryQueueManager
  • Tests for QueuedTelemetryClient retry logic
  • Updated PostHogTelemetryClient tests

Changes

  • Added for persistent queue management
  • Added base class
  • Updated to use queuing system
  • Added comprehensive tests
  • Added debug logging throughout

How to Test

  1. Set environment variable
  2. Disconnect network or block PostHog endpoint
  3. Trigger telemetry events in the extension
  4. Check logs for queuing messages
  5. Restore network connection
  6. Verify events are sent from queue

The queue file is stored at:


Important

Introduces a telemetry queuing system with persistent storage and retry logic to handle failed telemetry events due to network issues.

  • Behavior:
    • Introduces TelemetryQueueManager for persistent event storage and retry logic with exponential backoff in TelemetryQueueManager.ts.
    • Adds QueuedTelemetryClient as a base class for telemetry clients to handle automatic queuing and retry in QueuedTelemetryClient.ts.
    • Updates PostHogTelemetryClient to extend QueuedTelemetryClient, disabling PostHog's internal queue and handling retries in PostHogTelemetryClient.ts.
  • Features:
    • Per-workspace storage for telemetry events.
    • Persistent storage across extension restarts.
    • Exponential backoff for retries, with no retry limit within 24 hours.
    • Automatic cleanup of old events and file size management.
    • Debug logging for testing purposes.
  • Testing:
    • Adds unit tests for TelemetryQueueManager in TelemetryQueueManager.test.ts.
    • Adds tests for QueuedTelemetryClient retry logic in QueuedTelemetryClient.test.ts.
    • Updates PostHogTelemetryClient tests in PostHogTelemetryClient.test.ts.
  • Misc:
    • Exports new classes in index.ts.
    • Updates extension.ts to register the new telemetry client with debug mode support.

This description was created by Ellipsis for a897849. You can customize this summary. It will automatically update as commits are pushed.

- Implement TelemetryQueueManager for persistent event storage
- Add QueuedTelemetryClient base class with retry logic
- Update PostHogTelemetryClient to use queuing system
- Store queue per-workspace to avoid conflicts
- Add exponential backoff retry (1s to 60s max)
- Events persist for 24 hours before expiring
- Queue limited to 100 events to manage file size
- Add comprehensive tests for queue functionality
- Disable PostHog's internal queue for better control

This ensures telemetry events are not lost during network
outages or server downtime, with events persisted to disk
and retried automatically when connectivity is restored.
@daniel-lxs daniel-lxs requested review from mrubens, cte and jr as code owners August 15, 2025 17:00
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Aug 15, 2025
Copy link

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I've reviewed the telemetry queue persistence implementation and found several issues that need attention. The overall architecture is solid, but there are some critical issues around production logging and potential race conditions that should be addressed.


// Force immediate flush to detect network errors
// This will throw if there's a network issue
await this.client.flush()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush() call could fail for reasons other than network issues. Could we add more specific error handling here to distinguish between different failure types? This would help with debugging and potentially allow for different retry strategies based on the error type.

/**
* Persist queue to disk
*/
private async persistQueue(): Promise<void> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persistQueue() method writes to disk asynchronously without awaiting. Could this cause race conditions if multiple rapid enqueue operations occur? Consider using a write queue or debouncing mechanism to prevent potential data corruption.

public updateTelemetryState(didUserOptIn: boolean): void {
this.telemetryEnabled = didUserOptIn
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define a proper interface for the test client instead of using 'any' type? This would improve type safety and make the tests more maintainable.

})

describe("getRetryDelay", () => {
it("should calculate exponential backoff correctly", () => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed there's no test for the scenario where the queue file is corrupted (invalid JSON). The loadQueue method handles this case, but it would be good to have explicit test coverage.

- Make console.log statements conditional based on debug flag
- Add singleton reset method for TelemetryQueueManager
- Fix race condition in persistQueue with debouncing and promise tracking
- Improve error handling in PostHogTelemetryClient to differentiate error types
- Make retry interval configurable in QueuedTelemetryClient
- Update tests to reflect 24-hour event expiration instead of retry-based removal
- Add test for handling corrupted JSON queue files
- Fix test expectations for retry count and event filtering
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Review] in Roo Code Roadmap Aug 15, 2025
- Move pendingPersist flag clearing inside persistQueue() method
- Implement loop-based draining to handle concurrent persist requests
- Add comprehensive tests for concurrent operations
- Ensures no telemetry events are lost during rapid enqueue operations

The previous implementation had a lost-notification bug where the pendingPersist
flag was cleared in the setImmediate callback before calling persistQueue().
This could cause events enqueued during an in-flight persist to remain unpersisted.

The fix implements Option A: clearing the flag inside persistQueue() and using
a while loop to drain all pending requests, ensuring any enqueue that happens
during a persist operation triggers another persist pass immediately after.
- Remove unused _isNetworkError variable in PostHogTelemetryClient
- Add TelemetryQueueManager.resetInstance() call in extension deactivation to prevent memory leaks and stale data when switching workspaces
- Removed all console.info and console.log statements that were used for debugging
- Kept console.error statements for actual error reporting
- Fixed ESLint warnings for unused error variables
- Telemetry debug logs are now completely removed to prevent console spam
@daniel-lxs daniel-lxs closed this Aug 21, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Aug 21, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR - Needs Review size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants