feat: add telemetry queue persistence system #7133

daniel-lxs · 2025-08-15T17:00:47Z

Summary

This PR implements a telemetry queuing system to ensure telemetry events are not lost during network outages or server downtime.

Problem

Currently, the Roo Code Extension sends telemetry using a "fire and forget" approach. When telemetry events fail to send (due to network issues, server downtime, or other connectivity problems), they are simply lost with no retry mechanism.

Solution

Key Components:

TelemetryQueueManager - Manages persistent event storage
- Stores failed events to disk as JSON
- Implements exponential backoff retry (1s → 60s max)
- Events persist for 24 hours before expiring
- Queue limited to 100 events to manage file size
- Aggressive cleanup every 5 minutes
QueuedTelemetryClient - Base class for telemetry clients
- Automatic queuing on send failure
- Background retry every 30 seconds
- Graceful online/offline transitions
Updated PostHogTelemetryClient
- Now extends QueuedTelemetryClient
- Disables PostHog's internal queue for better control
- Forces immediate flush to detect network errors

Features:

✅ Per-workspace storage - Each VS Code workspace has its own queue file
✅ Persistent storage - Events survive extension restarts
✅ Smart retry - Exponential backoff prevents server overload
✅ No retry limit - Events keep trying for 24 hours
✅ Automatic cleanup - Old events removed, file size managed
✅ Debug logging - Comprehensive logs for testing (set DEBUG_TELEMETRY=true)

Testing:

Comprehensive unit tests for TelemetryQueueManager
Tests for QueuedTelemetryClient retry logic
Updated PostHogTelemetryClient tests

Changes

Added for persistent queue management
Added base class
Updated to use queuing system
Added comprehensive tests
Added debug logging throughout

How to Test

Set environment variable
Disconnect network or block PostHog endpoint
Trigger telemetry events in the extension
Check logs for queuing messages
Restore network connection
Verify events are sent from queue

The queue file is stored at:

Important

Introduces a telemetry queuing system with persistent storage and retry logic to handle failed telemetry events due to network issues.

Behavior:
- Introduces TelemetryQueueManager for persistent event storage and retry logic with exponential backoff in TelemetryQueueManager.ts.
- Adds QueuedTelemetryClient as a base class for telemetry clients to handle automatic queuing and retry in QueuedTelemetryClient.ts.
- Updates PostHogTelemetryClient to extend QueuedTelemetryClient, disabling PostHog's internal queue and handling retries in PostHogTelemetryClient.ts.
Features:
- Per-workspace storage for telemetry events.
- Persistent storage across extension restarts.
- Exponential backoff for retries, with no retry limit within 24 hours.
- Automatic cleanup of old events and file size management.
- Debug logging for testing purposes.
Testing:
- Adds unit tests for TelemetryQueueManager in TelemetryQueueManager.test.ts.
- Adds tests for QueuedTelemetryClient retry logic in QueuedTelemetryClient.test.ts.
- Updates PostHogTelemetryClient tests in PostHogTelemetryClient.test.ts.
Misc:
- Exports new classes in index.ts.
- Updates extension.ts to register the new telemetry client with debug mode support.

^{This description was created by}^{for a897849. You can customize this summary. It will automatically update as commits are pushed.}

- Implement TelemetryQueueManager for persistent event storage - Add QueuedTelemetryClient base class with retry logic - Update PostHogTelemetryClient to use queuing system - Store queue per-workspace to avoid conflicts - Add exponential backoff retry (1s to 60s max) - Events persist for 24 hours before expiring - Queue limited to 100 events to manage file size - Add comprehensive tests for queue functionality - Disable PostHog's internal queue for better control This ensures telemetry events are not lost during network outages or server downtime, with events persisted to disk and retried automatically when connectivity is restored.

roomote

Thank you for your contribution! I've reviewed the telemetry queue persistence implementation and found several issues that need attention. The overall architecture is solid, but there are some critical issues around production logging and potential race conditions that should be addressed.

packages/telemetry/src/TelemetryQueueManager.ts

roomote · 2025-08-15T17:06:04Z

packages/telemetry/src/PostHogTelemetryClient.ts

+
+			// Force immediate flush to detect network errors
+			// This will throw if there's a network issue
+			await this.client.flush()


The flush() call could fail for reasons other than network issues. Could we add more specific error handling here to distinguish between different failure types? This would help with debugging and potentially allow for different retry strategies based on the error type.

packages/telemetry/src/TelemetryQueueManager.ts

roomote · 2025-08-15T17:06:04Z

packages/telemetry/src/TelemetryQueueManager.ts

+	/**
+	 * Persist queue to disk
+	 */
+	private async persistQueue(): Promise<void> {


The persistQueue() method writes to disk asynchronously without awaiting. Could this cause race conditions if multiple rapid enqueue operations occur? Consider using a write queue or debouncing mechanism to prevent potential data corruption.

packages/telemetry/src/QueuedTelemetryClient.ts

roomote · 2025-08-15T17:06:04Z

packages/telemetry/src/__tests__/QueuedTelemetryClient.test.ts

+	public updateTelemetryState(didUserOptIn: boolean): void {
+		this.telemetryEnabled = didUserOptIn
+	}
+}


Could we define a proper interface for the test client instead of using 'any' type? This would improve type safety and make the tests more maintainable.

roomote · 2025-08-15T17:06:04Z

packages/telemetry/src/__tests__/TelemetryQueueManager.test.ts

+	})
+
+	describe("getRetryDelay", () => {
+		it("should calculate exponential backoff correctly", () => {


I noticed there's no test for the scenario where the queue file is corrupted (invalid JSON). The loadQueue method handles this case, but it would be good to have explicit test coverage.

- Make console.log statements conditional based on debug flag - Add singleton reset method for TelemetryQueueManager - Fix race condition in persistQueue with debouncing and promise tracking - Improve error handling in PostHogTelemetryClient to differentiate error types - Make retry interval configurable in QueuedTelemetryClient - Update tests to reflect 24-hour event expiration instead of retry-based removal - Add test for handling corrupted JSON queue files - Fix test expectations for retry count and event filtering

packages/telemetry/src/PostHogTelemetryClient.ts

- Move pendingPersist flag clearing inside persistQueue() method - Implement loop-based draining to handle concurrent persist requests - Add comprehensive tests for concurrent operations - Ensures no telemetry events are lost during rapid enqueue operations The previous implementation had a lost-notification bug where the pendingPersist flag was cleared in the setImmediate callback before calling persistQueue(). This could cause events enqueued during an in-flight persist to remain unpersisted. The fix implements Option A: clearing the flag inside persistQueue() and using a while loop to drain all pending requests, ensuring any enqueue that happens during a persist operation triggers another persist pass immediately after.

- Remove unused _isNetworkError variable in PostHogTelemetryClient - Add TelemetryQueueManager.resetInstance() call in extension deactivation to prevent memory leaks and stale data when switching workspaces

- Removed all console.info and console.log statements that were used for debugging - Kept console.error statements for actual error reporting - Fixed ESLint warnings for unused error variables - Telemetry debug logs are now completely removed to prevent console spam

daniel-lxs added 2 commits August 15, 2025 11:28

fix: update PostHogTelemetryClient tests to include context parameter

fe70368

daniel-lxs requested review from mrubens, cte and jr as code owners August 15, 2025 17:00

github-project-automation bot added this to Roo Code Roadmap and Roo Code Roadmap Aug 15, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Aug 15, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Aug 15, 2025

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Aug 15, 2025

roomote bot reviewed Aug 15, 2025

View reviewed changes

ellipsis-dev bot reviewed Aug 15, 2025

View reviewed changes

packages/telemetry/src/PostHogTelemetryClient.ts Outdated Show resolved Hide resolved

daniel-lxs moved this from Triage to PR [Needs Review] in Roo Code Roadmap Aug 15, 2025

daniel-lxs added 3 commits August 15, 2025 13:55

fix: address PR review comments

1f08909

- Remove unused _isNetworkError variable in PostHogTelemetryClient - Add TelemetryQueueManager.resetInstance() call in extension deactivation to prevent memory leaks and stale data when switching workspaces

hannesrudolph added the PR - Needs Review label Aug 16, 2025

daniel-lxs closed this Aug 21, 2025

github-project-automation bot moved this from PR [Needs Review] to Done in Roo Code Roadmap Aug 21, 2025

github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add telemetry queue persistence system #7133

feat: add telemetry queue persistence system #7133

Uh oh!

daniel-lxs commented Aug 15, 2025 •

edited

Loading

Uh oh!

roomote bot left a comment

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Uh oh!

roomote bot Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

feat: add telemetry queue persistence system #7133

feat: add telemetry queue persistence system #7133

Uh oh!

Conversation

daniel-lxs commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Key Components:

Features:

Testing:

Changes

How to Test

Uh oh!

roomote bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roomote bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

roomote bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daniel-lxs commented Aug 15, 2025 •

edited

Loading