A read-only Windows CLI tool written in Go that recursively scans user directories, identifies duplicate files via SHA-256 hashing, and flags large files — all without modifying or deleting anything.
| Feature | Details |
|---|---|
| Duplicate Detection | SHA-256 hash comparison across all scanned files |
| Large File Flagging | Configurable threshold (default 500 MB) |
| Read-Only | Never modifies, moves, or deletes files |
| CPU Throttling | Limits to 50% CPU (configurable via cpu_percent_limit) |
| Concurrent Hashing | Worker pool sized to half your CPU cores |
| Permission Handling | Gracefully skips files/directories with access errors |
| System Dir Exclusion | Skips AppData, $Recycle.Bin, node_modules, .git, etc. |
| Dual Output | Console summary + full JSON report |
| YAML Configuration | All settings in config.yaml |
| Structured Logging | JSON-formatted logs via Zap |
scanner/
├── cmd/
│ └── scanner/
│ └── main.go # CLI entry point
├── internal/
│ ├── analyzer/
│ │ ├── analyzer.go # Duplicate & large-file analysis
│ │ └── types.go # Analysis data types
│ ├── config/
│ │ └── config.go # YAML config loading & defaults
│ ├── logger/
│ │ └── logger.go # Structured logging setup
│ ├── reporter/
│ │ └── reporter.go # Console & JSON output
│ └── scanner/
│ ├── scanner.go # Recursive file walker + hasher
│ └── types.go # Scan data types
├── config.yaml # Default configuration
├── go.mod
└── README.md
- Go 1.21+ installed and on your PATH
- Windows 10/11 (targets
C:\Usersby default) - Run from a terminal with access to the directories you want to scan
cd C:\Users\YourName\Projects
git clone <repo-url> scanner
cd scannergo mod tidyOpen config.yaml and adjust any settings. Key options:
| Setting | Default | Description |
|---|---|---|
scanner.root_path |
C:\Users |
Directory tree to scan |
scanner.max_workers |
Half your CPUs | Concurrent hashing goroutines |
scanner.cpu_percent_limit |
0.5 |
Max CPU fraction (0.0–1.0) |
scanner.ignored_dirs |
See file | Directories to skip |
analyzer.large_file_size_bytes |
524288000 |
Large file threshold (500 MB) |
reporter.json_output_path |
scan_results.json |
JSON report location |
logger.level |
info |
Log verbosity |
go build -o scanner.exe ./cmd/scannerDefault (uses config.yaml):
.\scanner.exeWith CLI overrides:
# Scan a specific user profile
.\scanner.exe -root "C:\Users\JohnDoe"
# Custom config file and output
.\scanner.exe -config myconfig.yaml -output results.json
# Show version
.\scanner.exe -version- Console: A formatted summary prints automatically
- JSON report: Written to
scan_results.json(or your override path) - Logs: Structured JSON logs written to
scanner.log
| Flag | Type | Default | Description |
|---|---|---|---|
-config |
string | config.yaml |
Path to YAML config file |
-root |
string | (from config) | Override scan root directory |
-output |
string | (from config) | Override JSON output file path |
-version |
bool | false |
Print version and exit |
{
"scan_result": {
"root_path": "C:\\Users",
"total_files": 12345,
"total_dirs": 678,
"skipped_dirs": 42,
"permission_errors": 3,
"files": [ ... ],
"duration": "2m34s",
"scanned_at": "2026-02-15T10:30:00Z"
},
"analysis": {
"total_files_analyzed": 12345,
"duplicate_groups": [
{
"sha256": "abc123...",
"size_bytes": 1048576,
"count": 3,
"files": [
{ "path": "C:\\Users\\...", "mod_time": "..." }
]
}
],
"wasted_bytes": 2097152,
"large_files": [
{
"path": "C:\\Users\\...\\large.iso",
"size_bytes": 734003200,
"size_mb": 700.00
}
]
},
"generated_at": "2026-02-15T10:32:34Z"
}The application follows a three-phase pipeline:
-
Scanner (
internal/scanner): Walks the directory tree, collects file metadata, and computes SHA-256 hashes using a bounded worker pool with CPU throttling viaGOMAXPROCS. -
Analyzer (
internal/analyzer): Pure in-memory analysis — groups files by hash to find duplicates and filters by size threshold for large files. -
Reporter (
internal/reporter): Formats and outputs results to both console (human-readable table) and JSON file (machine-readable).
All three phases are read-only — the tool opens files with os.Open() (read-only) and never calls any write, rename, or delete operations on scanned files.
- Files are opened with
os.Open()which is read-only - No
os.Remove,os.Rename, oros.WriteFileis called on scanned files - Permission errors are logged and skipped — they do not crash the program
- CPU usage is capped by both
GOMAXPROCSand worker pool size
| Issue | Solution |
|---|---|
| "cannot access root path" | Verify the path exists and you have read permission |
| Many permission errors | Run as Administrator, or scan a narrower subtree |
| Slow scan | Reduce max_workers or increase cpu_percent_limit |
| Missing duplicates | Lower min_duplicate_size in config |
| Large JSON file | The full file list is included; filter with jq or similar |
MIT