Skip to content

sidneyshafer/scanner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Windows File Scanner CLI

A read-only Windows CLI tool written in Go that recursively scans user directories, identifies duplicate files via SHA-256 hashing, and flags large files — all without modifying or deleting anything.

Features

Feature Details
Duplicate Detection SHA-256 hash comparison across all scanned files
Large File Flagging Configurable threshold (default 500 MB)
Read-Only Never modifies, moves, or deletes files
CPU Throttling Limits to 50% CPU (configurable via cpu_percent_limit)
Concurrent Hashing Worker pool sized to half your CPU cores
Permission Handling Gracefully skips files/directories with access errors
System Dir Exclusion Skips AppData, $Recycle.Bin, node_modules, .git, etc.
Dual Output Console summary + full JSON report
YAML Configuration All settings in config.yaml
Structured Logging JSON-formatted logs via Zap

Project Structure

scanner/
├── cmd/
│   └── scanner/
│       └── main.go              # CLI entry point
├── internal/
│   ├── analyzer/
│   │   ├── analyzer.go          # Duplicate & large-file analysis
│   │   └── types.go             # Analysis data types
│   ├── config/
│   │   └── config.go            # YAML config loading & defaults
│   ├── logger/
│   │   └── logger.go            # Structured logging setup
│   ├── reporter/
│   │   └── reporter.go          # Console & JSON output
│   └── scanner/
│       ├── scanner.go           # Recursive file walker + hasher
│       └── types.go             # Scan data types
├── config.yaml                  # Default configuration
├── go.mod
└── README.md

Prerequisites

  • Go 1.21+ installed and on your PATH
  • Windows 10/11 (targets C:\Users by default)
  • Run from a terminal with access to the directories you want to scan

Step-by-Step Guide

1. Clone or Download

cd C:\Users\YourName\Projects
git clone <repo-url> scanner
cd scanner

2. Install Dependencies

go mod tidy

3. Review Configuration

Open config.yaml and adjust any settings. Key options:

Setting Default Description
scanner.root_path C:\Users Directory tree to scan
scanner.max_workers Half your CPUs Concurrent hashing goroutines
scanner.cpu_percent_limit 0.5 Max CPU fraction (0.0–1.0)
scanner.ignored_dirs See file Directories to skip
analyzer.large_file_size_bytes 524288000 Large file threshold (500 MB)
reporter.json_output_path scan_results.json JSON report location
logger.level info Log verbosity

4. Build

go build -o scanner.exe ./cmd/scanner

5. Run

Default (uses config.yaml):

.\scanner.exe

With CLI overrides:

# Scan a specific user profile
.\scanner.exe -root "C:\Users\JohnDoe"

# Custom config file and output
.\scanner.exe -config myconfig.yaml -output results.json

# Show version
.\scanner.exe -version

6. Read the Results

  • Console: A formatted summary prints automatically
  • JSON report: Written to scan_results.json (or your override path)
  • Logs: Structured JSON logs written to scanner.log

CLI Flags

Flag Type Default Description
-config string config.yaml Path to YAML config file
-root string (from config) Override scan root directory
-output string (from config) Override JSON output file path
-version bool false Print version and exit

JSON Report Format

{
  "scan_result": {
    "root_path": "C:\\Users",
    "total_files": 12345,
    "total_dirs": 678,
    "skipped_dirs": 42,
    "permission_errors": 3,
    "files": [ ... ],
    "duration": "2m34s",
    "scanned_at": "2026-02-15T10:30:00Z"
  },
  "analysis": {
    "total_files_analyzed": 12345,
    "duplicate_groups": [
      {
        "sha256": "abc123...",
        "size_bytes": 1048576,
        "count": 3,
        "files": [
          { "path": "C:\\Users\\...", "mod_time": "..." }
        ]
      }
    ],
    "wasted_bytes": 2097152,
    "large_files": [
      {
        "path": "C:\\Users\\...\\large.iso",
        "size_bytes": 734003200,
        "size_mb": 700.00
      }
    ]
  },
  "generated_at": "2026-02-15T10:32:34Z"
}

Architecture

The application follows a three-phase pipeline:

  1. Scanner (internal/scanner): Walks the directory tree, collects file metadata, and computes SHA-256 hashes using a bounded worker pool with CPU throttling via GOMAXPROCS.

  2. Analyzer (internal/analyzer): Pure in-memory analysis — groups files by hash to find duplicates and filters by size threshold for large files.

  3. Reporter (internal/reporter): Formats and outputs results to both console (human-readable table) and JSON file (machine-readable).

All three phases are read-only — the tool opens files with os.Open() (read-only) and never calls any write, rename, or delete operations on scanned files.

Safety Guarantees

  • Files are opened with os.Open() which is read-only
  • No os.Remove, os.Rename, or os.WriteFile is called on scanned files
  • Permission errors are logged and skipped — they do not crash the program
  • CPU usage is capped by both GOMAXPROCS and worker pool size

Troubleshooting

Issue Solution
"cannot access root path" Verify the path exists and you have read permission
Many permission errors Run as Administrator, or scan a narrower subtree
Slow scan Reduce max_workers or increase cpu_percent_limit
Missing duplicates Lower min_duplicate_size in config
Large JSON file The full file list is included; filter with jq or similar

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages