fix(source): stream wordlist file instead of loading into memory by oritwoen · Pull Request #76 · oritwoen/vuke

oritwoen · 2026-03-13T13:11:54Z

WordlistSource read the entire file into Vec<String> before processing anything. A 2GB wordlist means 2GB+ of resident memory just for the input buffer, before any key derivation starts.

Switched to streaming with read_line() in chunks of 10k lines. Each chunk gets processed through Rayon par_chunks(1000) the same way as before, then dropped before the next chunk loads. Memory usage is now bounded by chunk size, not file size.

Progress bar tracks bytes read from the file (via read_line() return value) instead of line count, so it works without knowing total lines upfront and handles CRLF correctly.

Closes #71

WordlistSource used to read the entire file into Vec<String> before processing. For large wordlists (multi-GB), this consumed memory proportional to file size regardless of batch processing needs. Switched to streaming with read_line() in chunks of 10k lines, keeping Rayon parallelism within each chunk. Progress bar now tracks bytes read from the file instead of line count. Closes #71

qodo-code-review · 2026-03-13T13:12:08Z

Review Summary by Qodo

Stream wordlist file to reduce memory consumption

🐞 Bug fix ✨ Enhancement

Walkthroughs

Description

• Stream wordlist files in 10k-line chunks instead of loading entire file into memory
• Progress bar now tracks bytes read from file instead of line count
• Extract chunk processing logic into separate process_chunk() function for clarity
• Add comprehensive unit tests for file validation and edge cases

Diagram

flowchart LR
  A["Load entire file<br/>into Vec"] -->|Before| B["High memory usage<br/>for large files"]
  C["Stream file in<br/>10k-line chunks"] -->|After| D["Bounded memory<br/>by chunk size"]
  D --> E["Process chunks<br/>with Rayon"]
  E --> F["Track progress<br/>by bytes read"]

File Changes

1. src/source/wordlist.rs Bug fix, enhancement, tests +172/-49

Implement streaming wordlist processing with memory bounds

• Changed WordlistSource to store file path instead of pre-loaded Vec<String>
• Implemented streaming file reading with read_line() in 10k-line chunks
• Progress bar now tracks bytes consumed instead of line count for accurate progress
• Extracted chunk processing into separate process_chunk() function using Rayon parallelism
• Added file validation in from_file() to check existence and file type
• Added four unit tests covering file not found, directory rejection, empty files, blank line
 skipping, and invalid UTF-8 handling

src/source/wordlist.rs

qodo-code-review · 2026-03-13T13:12:09Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (2) 📎 Requirement gaps (0)

1. ~~Progress misses final chunk~~ ☑ 🐞 Bug ✓ Correctness

Description

WordlistSource::process only updates the progress bar when a full 10k-line chunk is processed, and
it never updates the bar after processing the final partial chunk. This makes progress appear
stalled for wordlists with <10k non-empty lines and leaves the bar short of 100% for any wordlist
whose last chunk is smaller than CHUNK_SIZE.

Code

src/source/wordlist.rs[R85-100]

+            if chunk.len() >= CHUNK_SIZE {
+                process_chunk(
+                    &chunk, transforms, deriver, matcher, output, &stats, &matches,
+                );
+                pb.set_position(bytes_consumed);
+                chunk.clear();
+            }
+        }
+
+        if !chunk.is_empty() {
+            process_chunk(
+                &chunk, transforms, deriver, matcher, output, &stats, &matches,
+            );
+        }

       pb.finish_and_clear();

Evidence

The progress bar position is only set inside the if chunk.len() >= CHUNK_SIZE block; the tail
process_chunk call does not update progress at all before finishing. Other sources update the
progress bar for every processed batch, so this behavior is inconsistent and causes the UI to show
no/incorrect progress for common inputs.

src/source/wordlist.rs[48-100]
src/source/range.rs[50-90]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`WordlistSource::process` only calls `pb.set_position(bytes_consumed)` after processing a full `CHUNK_SIZE` chunk, and does not update the progress bar after processing the final partial chunk. For wordlists with fewer than `CHUNK_SIZE` non-empty lines, the progress bar never advances; for larger wordlists it may never reflect the final bytes consumed.
## Issue Context
The progress bar is initialized with `file_size` and intended to track `bytes_consumed`. Right now the bar is only advanced in the `chunk.len() &amp;gt;= CHUNK_SIZE` branch.
## Fix Focus Areas
- src/source/wordlist.rs[62-101]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Missing import block separator 📘 Rule violation ✓ Correctness

Description

The super:: and crate:: import groups are adjacent without the required single blank line
separator. This violates the mandated import ordering/layout for Rust files.

Code

src/source/wordlist.rs[R12-16]

use super::{ProcessStats, Source};
use crate::derive::KeyDeriver;
use crate::matcher::Matcher;
use crate::output::Output;
use crate::transform::{Input, Transform};

Evidence

The checklist requires a single blank line between super:: imports and crate:: imports when both
groups exist; the diff shows use super::{...}; immediately followed by use crate::... with no
blank line.

Rule 88803: Order Rust import statements by origin and scope
src/source/wordlist.rs[12-16]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The file violates the required Rust import grouping because the `super::` and `crate::` blocks are not separated by exactly one blank line.
## Issue Context
Compliance requires grouped imports ordered by origin with a single blank line between groups.
## Fix Focus Areas
- src/source/wordlist.rs[12-16]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Tests rely on .unwrap() 📘 Rule violation ⛯ Reliability

Description

New inline tests use .unwrap() for fallible operations, which acts as an implicit assertion via
panicking rather than explicit test assertions. This does not meet the rule requiring tests to use
assertion macros instead of unwrap()/expect() as substitutes for verifying behavior.

Code

src/source/wordlist.rs[R172-184]

+    fn process_empty_file() {
+        let mut file = NamedTempFile::new().unwrap();
+        file.write_all(b"").unwrap();
+
+        let source = WordlistSource::from_file(file.path()).unwrap();
+        let deriver = KeyDeriver::new();
+        let output = ConsoleOutput::new();
+        let transforms: Vec<Box<dyn Transform>> = Vec::new();
+
+        let stats = source
+            .process(&transforms, &deriver, None, &output)
+            .unwrap();
+        assert_eq!(stats.inputs_processed, 0);

Evidence

The compliance item forbids using panicking .unwrap()/.expect() as a substitute for assertions
in inline tests; the added tests call .unwrap() on fallible setup and on the process() result
instead of explicitly asserting success/failure via assert!/assert_eq! patterns.

Rule 88813: Write Rust tests as inline modules at the end of the source file
src/source/wordlist.rs[172-184]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Inline unit tests added in `src/source/wordlist.rs` use `.unwrap()` for fallible operations, which is treated as a panicking substitute for explicit assertions.
## Issue Context
The compliance rule requires tests to be written with standard assertion macros and to avoid using `unwrap()`/`expect()` as substitutes for asserting expected behavior.
## Fix Focus Areas
- src/source/wordlist.rs[172-218]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

coderabbitai · 2026-03-13T13:12:09Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84c63cde-fbca-4b0e-a139-b3cf12c7753e

📥 Commits

Reviewing files that changed from the base of the PR and between 311fe4f and 5c5f232.

📒 Files selected for processing (1)

src/source/wordlist.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/source/wordlist.rs

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: benchmarks

📝 Walkthrough

Walkthrough

WordlistSource now stores a PathBuf and streams the wordlist file in fixed-size line chunks, processing each chunk via a new process_chunk helper. File existence/type is validated in from_file(). Progress is tracked by bytes read; blank lines and invalid UTF‑8 are skipped and accounted for in returned stats.

Changes

Cohort / File(s)	Summary
Streaming Wordlist Processing `src/source/wordlist.rs`, `tests/...`	Replaced in-memory `Vec<String>` with `path: PathBuf`. `from_file()` now validates path and file type. Processing switched to buffered streaming: read lines into CHUNK_SIZE, skip blank/invalid-UTF8 lines, call `process_chunk()` per batch which applies transforms, deriver, matcher, and output (parallelized per batch). Progress updates use bytes read; inputs_processed/bytes_consumed updated. Tests added/updated for missing file, directory-as-input, empty file, blank-line skipping, and invalid UTF‑8 handling.

Sequence Diagram(s)

sequenceDiagram
    participant File as File (wordlist)
    participant Reader as BufReader
    participant Chunker as Chunk Buffer
    participant Worker as Parallel Workers
    participant Deriver as Key Deriver
    participant Matcher as Matcher
    participant Output as Output Sink

    File->>Reader: open & stream bytes
    Reader->>Chunker: read lines, skip blank/invalid UTF-8
    Chunker->>Worker: emit chunk (CHUNK_SIZE)
    Worker->>Deriver: apply transforms -> derive keys
    Deriver->>Matcher: check matches (optional)
    Matcher->>Output: write matched results / stats
    Worker->>Chunker: report inputs_processed & bytes_consumed
    Chunker->>Reader: request next lines (loop)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

Chunks stream in where stacks once grew,
Bytes counted, bad UTF skipped through.
Batches march, derive, and test,
Memory breathes — the file's at rest. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately describes the main change: switching from in-memory loading to streaming the wordlist file.
Description check	✅ Passed	Description directly addresses the problem (2GB file = 2GB+ resident memory) and explains the solution (streaming chunks with Rayon processing).
Linked Issues check	✅ Passed	Changes fully implement `#71` requirements: streams file in bounded chunks (10k lines), processes with Rayon par_chunks(1000), replaces line-count progress with byte-based tracking.
Out of Scope Changes check	✅ Passed	All changes directly support the three core objectives—bounded-size chunk streaming, Rayon parallelism per chunk, and byte-based progress tracking. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/wordlist-streaming-memory

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch fix/wordlist-streaming-memory

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codeant-ai · 2026-03-13T13:12:56Z

Sequence Diagram

This PR changes wordlist handling from full-file preloading to incremental streaming in fixed-size chunks. The flow keeps parallel key generation per chunk while tracking progress by bytes read, so memory usage stays bounded for very large files.

sequenceDiagram
    participant Runner
    participant WordlistSource
    participant WordlistFile
    participant ChunkProcessor
    participant Output

    Runner->>WordlistSource: Start processing
    WordlistSource->>WordlistFile: Open file and read file size
    loop Stream lines in chunks
        WordlistSource->>WordlistFile: Read next line and count bytes
        WordlistSource->>WordlistSource: Buffer non empty lines and update progress
    end
    WordlistSource->>ChunkProcessor: Process chunk with parallel batches
    ChunkProcessor->>Output: Write derived keys or matched hits
    WordlistSource-->>Runner: Return process stats

Generated by CodeAnt AI

codspeed-hq · 2026-03-13T13:15:29Z

Merging this PR will not alter performance

✅ 7 untouched benchmarks

_{Comparing fix/wordlist-streaming-memory (5c5f232) with main (afb45b5)}

cubic-dev-ai

2 issues found across 1 file

Confidence score: 3/5

There is a concrete regression risk in src/source/wordlist.rs: the new chunking can cap Rayon at 10 batches, which may underutilize higher-core systems and slow large wordlist processing versus the previous par_chunks(1000) behavior.
A secondary low-severity issue in src/source/wordlist.rs affects progress accuracy on invalid UTF-8 input, because read_line() already consumed the full bad line before InvalidData, so the byte counter falls behind.
Score is 3 because the top issue is a medium-severity, high-confidence performance regression (user-visible on large workloads), while the other issue is mostly progress-reporting correctness rather than core functionality.
Pay close attention to src/source/wordlist.rs - parallel chunk sizing and invalid UTF-8 progress accounting can both produce user-visible behavior changes.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/source/wordlist.rs">

<violation number="1" location="src/source/wordlist.rs:18">
P2: This chunk size caps Rayon parallelism at 10 batches. On boxes with more than 10 worker threads, part of the pool goes idle here and large wordlists get slower than the old whole-file `par_chunks(1000)` path.</violation>

<violation number="2" location="src/source/wordlist.rs:69">
P3: This progress update is wrong for invalid UTF-8 lines. `read_line()` has already consumed the whole bad line before it returns `InvalidData`, so adding 1 byte leaves the bar far behind on corrupted wordlists.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant App as Application Runner
    participant WLS as WordlistSource
    participant FS as File System
    participant PB as Progress Bar
    participant Rayon as Rayon (Parallel Workers)
    participant Core as Transform/Deriver/Matcher

    App->>WLS: process(transforms, deriver, matcher, output)
    
    WLS->>FS: NEW: Get file metadata (size)
    FS-->>WLS: File size in bytes
    
    WLS->>PB: CHANGED: Initialize with total bytes
    WLS->>FS: Open file for streaming
    
    loop Until End of File
        loop Read into Chunk (up to 10k lines)
            WLS->>FS: NEW: read_line()
            alt Invalid UTF-8
                WLS->>WLS: Skip line & increment byte count
            else Valid Line
                WLS->>WLS: Trim and push to Vec<String>
            end
        end

        Note over WLS,Rayon: Memory is bounded by 10k strings
        
        WLS->>Rayon: CHANGED: process_chunk(lines)
        
        loop Parallel Batches (1k lines)
            Rayon->>Core: Apply Transforms
            Core->>Core: Derive Keys
            opt Matcher present
                Core->>Core: Check Match
                Core->>App: CHANGED: Report Hit (Atomic update)
            end
        end
        
        Rayon-->>WLS: Batch complete
        WLS->>PB: NEW: set_position(bytes_consumed)
        WLS->>WLS: NEW: chunk.clear() (Free memory)
    end

    WLS->>PB: finish_and_clear()
    WLS-->>App: Return ProcessStats (inputs_processed, etc.)

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/source/wordlist.rs`:
- Around line 67-70: The InvalidData branch underestimates bytes_consumed
(currently +=1); instead query the actual stream position and set bytes_consumed
from it: when handling Err(e) if e.kind() == std::io::ErrorKind::InvalidData,
call Seek::stream_position() on the underlying reader (e.g. the File inside your
BufReader) and update bytes_consumed = position as returned, falling back to the
previous conservative increment only if stream_position() itself errors; update
the branch that contains read_line(), bytes_consumed, and the InvalidData match
to use this accurate seek-based position.
- Around line 133-142: The code currently swallows errors by calling .ok() on
output.hit(...) and output.key(...); replace those .ok() calls with the ?
operator so any I/O errors from ConsoleOutput.key() or ConsoleOutput.hit()
(e.g., from writeln!) propagate up to the caller. In the block using matcher /
m.check(&derived) and the else branch that calls output.key(source,
transform.name(), &derived), change the call sites to use ? and return Result
from the surrounding function so errors bubble to the caller that performs
flush/handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a60d9b89-79b0-4d89-a4f9-144c8aa101ef

📥 Commits

Reviewing files that changed from the base of the PR and between afb45b5 and 03faec4.

📒 Files selected for processing (1)

src/source/wordlist.rs

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: cubic · AI code reviewer
GitHub Check: benchmarks

🧰 Additional context used

📓 Path-based instructions (8)

src/source/*.rs

📄 CodeRabbit inference engine (src/source/AGENTS.md)

Create new source in src/source/{name}.rs file

Files:

src/source/wordlist.rs

src/source/!(mod).rs

📄 CodeRabbit inference engine (src/source/AGENTS.md)

src/source/!(mod).rs: Implement Source trait with process() method accepting transforms, deriver, matcher, and output parameters
Use Rayon par_chunks() for batch processing and parallelism in source implementations
Report progress via optional ProgressBar using indicatif::ProgressBar in process() method
All sources must implement Send + Sync traits for thread safety

Files:

src/source/wordlist.rs

**/*.rs