Skip to content

Conversation

roomote[bot]
Copy link

@roomote roomote bot commented Aug 19, 2025

Automatically exclude large non-code files and common game engine assets from checkpoints via dynamic scanning, and log the number of excluded files with a sample. This prevents disk bloat from large assets (e.g., Unity/Unreal, media, archives) while keeping code files checkpointed. Touched files: src/services/checkpoints/excludes.ts, src/services/checkpoints/ShadowCheckpointService.ts


Important

Automatically exclude large non-code files from checkpoints, with logging and configurable size threshold.

  • Behavior:
    • Automatically excludes large non-code files from checkpoints using getExcludePatternsWithStats in excludes.ts.
    • Logs number of excluded files and a sample in ShadowCheckpointService.ts.
    • Supports environment variable ROO_CHECKPOINTS_LARGE_FILE_THRESHOLD_MB to configure file size threshold.
  • Functions:
    • getExcludePatternsWithStats in excludes.ts returns patterns and stats for excluded files.
    • getLargeFileAutoExcludePatterns in excludes.ts scans for large files, excluding them based on size and type.
  • Tests:
    • Added tests in ShadowCheckpointService.spec.ts to verify exclude file creation and behavior.
    • Added tests in excludes.spec.ts to verify pattern generation and error handling.

This description was created by Ellipsis for a52dad9. You can customize this summary. It will automatically update as commits are pushed.

@roomote roomote bot requested review from mrubens, cte and jr as code owners August 19, 2025 19:01
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Aug 19, 2025
@dosubot dosubot bot added the enhancement New feature or request label Aug 19, 2025
Copy link
Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing my own code is like debugging in a mirror - everything looks backwards but the bugs are still mine.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Aug 19, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Aug 19, 2025
@daniel-lxs daniel-lxs marked this pull request as draft August 19, 2025 19:25
Copy link
Author

roomote bot commented Aug 19, 2025

Thanks @adamhill! Great point about leveraging the existing git-lfs filter. I'll update the implementation to pre-filter files using the git-lfs extensions before scanning, which should significantly improve performance by avoiding unnecessary file system operations on already-ignored files. Working on this now.

@hannesrudolph hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Aug 19, 2025
- Modified getLargeFileAutoExcludePatterns to accept git-lfs patterns as parameter
- Use ripgrep exclusion flags to pre-filter git-lfs managed files before size checking
- This avoids unnecessary file system operations on already-ignored files
- Added comprehensive unit tests for the optimization

As suggested by @adamhill, this leverages the existing git-lfs filter to improve performance
Copy link
Author

roomote bot commented Aug 19, 2025

Hi @adamhill! I've implemented your suggestion. Here's what I did:

Changes Made:

  1. Modified getLargeFileAutoExcludePatterns to accept git-lfs patterns as a parameter
  2. Pre-filter files using ripgrep exclusion flags (-g !pattern) to skip git-lfs managed files before checking their sizes
  3. Updated getExcludePatternsWithStats to pass the git-lfs patterns to the large file scanner

Performance Improvement:

  • Files managed by git-lfs are now excluded at the ripgrep level, avoiding unnecessary fs.stat() calls
  • This significantly reduces I/O operations when scanning repositories with many git-lfs tracked files

Testing:

  • Added comprehensive unit tests to verify the optimization works correctly
  • Tests confirm that git-lfs patterns are properly converted to ripgrep exclusion flags
  • All existing tests continue to pass

The optimization leverages the existing git-lfs filter exactly as you suggested, preventing redundant file system operations on already-ignored files. CI checks are currently running.

Copy link
Collaborator

@hannesrudolph hannesrudolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I've reviewed the changes and have some suggestions for improvement.

…TS_LARGE_FILE_THRESHOLD_MB; improve error reporting for auto-exclude (ripgrep/fs.stat); log diagnostics in ShadowCheckpointService; add tests
@hannesrudolph
Copy link
Collaborator

hannesrudolph commented Aug 19, 2025

Addressed review comments. Summary of changes:

  • Configurable threshold for large-file auto-exclusion via ROO_CHECKPOINTS_LARGE_FILE_THRESHOLD_MB (default 10 MB). Applied to dynamic scan and stats. Files: src/services/checkpoints/excludes.ts
  • LFS pre-filter preserved and tested. Ripgrep excludes Git LFS patterns using "-g !pattern". Tests updated. File: src/services/checkpoints/tests/excludes.spec.ts
  • Error handling & diagnostics. Distinguish ripgrep failures vs fs.stat failures; return errorCounts; ShadowCheckpointService logs diagnostics when non-zero. Files: src/services/checkpoints/excludes.ts, src/services/checkpoints/ShadowCheckpointService.ts
  • Logging behavior: single summary only when exclusions occur; shows threshold and sample (up to 10).
  • Tests added/updated: threshold override via env, ripgrep failure path, fs.stat error counting. All tests passing.

Ready for re-review.

hannesrudolph and others added 2 commits August 19, 2025 17:28
- Modified ShadowCheckpointService to check if exclude file exists before regenerating
- Avoids expensive file scanning (1-3 seconds) on every checkpoint initialization
- Only runs the scan on initial shadow repo creation or when exclude file is missing
- Added refreshExcludePatterns() method for manual refresh if needed
- Maintains backwards compatibility while fixing performance regression
@hannesrudolph hannesrudolph moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Aug 19, 2025
@hannesrudolph hannesrudolph moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Aug 19, 2025
…etup

- Force write exclude file on initial creation since git creates a default one
- Fix test to delete exclude file before re-initialization to test LFS patterns
- Add verification that exclude patterns are properly applied
@hannesrudolph hannesrudolph moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Aug 20, 2025
@hannesrudolph hannesrudolph marked this pull request as ready for review August 20, 2025 00:09
@Copilot Copilot AI review requested due to automatic review settings August 20, 2025 00:09
Copy link

dosubot bot commented Aug 20, 2025

Related Documentation

No published documentation to review for changes on this repository.
Write your first living document

How did I do? Any feedback?  Join Discord

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automatic exclusion of large non-code files from checkpoints to prevent disk bloat while maintaining code file checkpointing. The feature dynamically scans for large files, respects Git LFS patterns, and provides configurable thresholds with detailed logging.

  • Adds dynamic large file scanning with configurable size thresholds (default 10MB)
  • Implements game engine asset pattern exclusions (Unity, Unreal)
  • Introduces logging for excluded files with error tracking and diagnostics

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/services/checkpoints/excludes.ts Adds core functionality for large file detection, LFS pre-filtering, and configurable thresholds
src/services/checkpoints/ShadowCheckpointService.ts Integrates new exclude patterns with stats and adds logging for excluded files
src/services/checkpoints/tests/excludes.spec.ts Comprehensive tests for new large file exclusion functionality
src/services/checkpoints/tests/ShadowCheckpointService.spec.ts Updates tests to handle exclude file regeneration behavior

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

for (const item of items) {
if ((item as any).type !== "file") continue

const rel = (item as any).path
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type assertion (item as any) bypasses TypeScript's type safety. Consider defining a proper interface for the ripgrep result items or using type guards to safely access the type property.

Suggested change
const rel = (item as any).path
if (item.type !== "file") continue
const rel = item.path

Copilot uses AI. Check for mistakes.

for (const item of items) {
if ((item as any).type !== "file") continue

const rel = (item as any).path
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another unsafe type assertion (item as any). The same interface or type guard solution would improve type safety here as well.

Suggested change
const rel = (item as any).path
if (item.type !== "file") continue
const rel = item.path

Copilot uses AI. Check for mistakes.

let fsStatErrors = 0

try {
const rgResult = await executeRipgrep({ args, workspacePath, limit: 50000 })
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded limit of 50000 should be extracted as a named constant or made configurable. This magic number makes the code harder to maintain and understand.

Suggested change
const rgResult = await executeRipgrep({ args, workspacePath, limit: 50000 })
const rgResult = await executeRipgrep({ args, workspacePath, limit: getConfiguredRipgrepFileLimit() })

Copilot uses AI. Check for mistakes.


const mb = Math.round(stats.thresholdBytes / (1024 * 1024))

if (stats?.largeFilesExcluded && stats.largeFilesExcluded > 0) {
Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The condition can be simplified to if (stats.largeFilesExcluded > 0) since the optional chaining already handles the undefined case and numbers greater than 0 are truthy.

Suggested change
if (stats?.largeFilesExcluded && stats.largeFilesExcluded > 0) {
if (stats.largeFilesExcluded > 0) {

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR - Needs Preliminary Review size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants