fix(source): stream wordlist file instead of loading into memory#76
Conversation
WordlistSource used to read the entire file into Vec<String> before processing. For large wordlists (multi-GB), this consumed memory proportional to file size regardless of batch processing needs. Switched to streaming with read_line() in chunks of 10k lines, keeping Rayon parallelism within each chunk. Progress bar now tracks bytes read from the file instead of line count. Closes #71
Review Summary by QodoStream wordlist file to reduce memory consumption
WalkthroughsDescription• Stream wordlist files in 10k-line chunks instead of loading entire file into memory • Progress bar now tracks bytes read from file instead of line count • Extract chunk processing logic into separate process_chunk() function for clarity • Add comprehensive unit tests for file validation and edge cases Diagramflowchart LR
A["Load entire file<br/>into Vec"] -->|Before| B["High memory usage<br/>for large files"]
C["Stream file in<br/>10k-line chunks"] -->|After| D["Bounded memory<br/>by chunk size"]
D --> E["Process chunks<br/>with Rayon"]
E --> F["Track progress<br/>by bytes read"]
File Changes1. src/source/wordlist.rs
|
Code Review by Qodo
1.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📜 Recent review details⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
📝 WalkthroughWalkthroughWordlistSource now stores a Changes
Sequence Diagram(s)sequenceDiagram
participant File as File (wordlist)
participant Reader as BufReader
participant Chunker as Chunk Buffer
participant Worker as Parallel Workers
participant Deriver as Key Deriver
participant Matcher as Matcher
participant Output as Output Sink
File->>Reader: open & stream bytes
Reader->>Chunker: read lines, skip blank/invalid UTF-8
Chunker->>Worker: emit chunk (CHUNK_SIZE)
Worker->>Deriver: apply transforms -> derive keys
Deriver->>Matcher: check matches (optional)
Matcher->>Output: write matched results / stats
Worker->>Chunker: report inputs_processed & bytes_consumed
Chunker->>Reader: request next lines (loop)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
📝 Coding Plan
Comment |
Sequence DiagramThis PR changes wordlist handling from full-file preloading to incremental streaming in fixed-size chunks. The flow keeps parallel key generation per chunk while tracking progress by bytes read, so memory usage stays bounded for very large files. sequenceDiagram
participant Runner
participant WordlistSource
participant WordlistFile
participant ChunkProcessor
participant Output
Runner->>WordlistSource: Start processing
WordlistSource->>WordlistFile: Open file and read file size
loop Stream lines in chunks
WordlistSource->>WordlistFile: Read next line and count bytes
WordlistSource->>WordlistSource: Buffer non empty lines and update progress
end
WordlistSource->>ChunkProcessor: Process chunk with parallel batches
ChunkProcessor->>Output: Write derived keys or matched hits
WordlistSource-->>Runner: Return process stats
Generated by CodeAnt AI |
There was a problem hiding this comment.
2 issues found across 1 file
Confidence score: 3/5
- There is a concrete regression risk in
src/source/wordlist.rs: the new chunking can cap Rayon at 10 batches, which may underutilize higher-core systems and slow large wordlist processing versus the previouspar_chunks(1000)behavior. - A secondary low-severity issue in
src/source/wordlist.rsaffects progress accuracy on invalid UTF-8 input, becauseread_line()already consumed the full bad line beforeInvalidData, so the byte counter falls behind. - Score is 3 because the top issue is a medium-severity, high-confidence performance regression (user-visible on large workloads), while the other issue is mostly progress-reporting correctness rather than core functionality.
- Pay close attention to
src/source/wordlist.rs- parallel chunk sizing and invalid UTF-8 progress accounting can both produce user-visible behavior changes.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/source/wordlist.rs">
<violation number="1" location="src/source/wordlist.rs:18">
P2: This chunk size caps Rayon parallelism at 10 batches. On boxes with more than 10 worker threads, part of the pool goes idle here and large wordlists get slower than the old whole-file `par_chunks(1000)` path.</violation>
<violation number="2" location="src/source/wordlist.rs:69">
P3: This progress update is wrong for invalid UTF-8 lines. `read_line()` has already consumed the whole bad line before it returns `InvalidData`, so adding 1 byte leaves the bar far behind on corrupted wordlists.</violation>
</file>
Architecture diagram
sequenceDiagram
participant App as Application Runner
participant WLS as WordlistSource
participant FS as File System
participant PB as Progress Bar
participant Rayon as Rayon (Parallel Workers)
participant Core as Transform/Deriver/Matcher
App->>WLS: process(transforms, deriver, matcher, output)
WLS->>FS: NEW: Get file metadata (size)
FS-->>WLS: File size in bytes
WLS->>PB: CHANGED: Initialize with total bytes
WLS->>FS: Open file for streaming
loop Until End of File
loop Read into Chunk (up to 10k lines)
WLS->>FS: NEW: read_line()
alt Invalid UTF-8
WLS->>WLS: Skip line & increment byte count
else Valid Line
WLS->>WLS: Trim and push to Vec<String>
end
end
Note over WLS,Rayon: Memory is bounded by 10k strings
WLS->>Rayon: CHANGED: process_chunk(lines)
loop Parallel Batches (1k lines)
Rayon->>Core: Apply Transforms
Core->>Core: Derive Keys
opt Matcher present
Core->>Core: Check Match
Core->>App: CHANGED: Report Hit (Atomic update)
end
end
Rayon-->>WLS: Batch complete
WLS->>PB: NEW: set_position(bytes_consumed)
WLS->>WLS: NEW: chunk.clear() (Free memory)
end
WLS->>PB: finish_and_clear()
WLS-->>App: Return ProcessStats (inputs_processed, etc.)
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/source/wordlist.rs`:
- Around line 67-70: The InvalidData branch underestimates bytes_consumed
(currently +=1); instead query the actual stream position and set bytes_consumed
from it: when handling Err(e) if e.kind() == std::io::ErrorKind::InvalidData,
call Seek::stream_position() on the underlying reader (e.g. the File inside your
BufReader) and update bytes_consumed = position as returned, falling back to the
previous conservative increment only if stream_position() itself errors; update
the branch that contains read_line(), bytes_consumed, and the InvalidData match
to use this accurate seek-based position.
- Around line 133-142: The code currently swallows errors by calling .ok() on
output.hit(...) and output.key(...); replace those .ok() calls with the ?
operator so any I/O errors from ConsoleOutput.key() or ConsoleOutput.hit()
(e.g., from writeln!) propagate up to the caller. In the block using matcher /
m.check(&derived) and the else branch that calls output.key(source,
transform.name(), &derived), change the call sites to use ? and return Result
from the surrounding function so errors bubble to the caller that performs
flush/handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a60d9b89-79b0-4d89-a4f9-144c8aa101ef
📒 Files selected for processing (1)
src/source/wordlist.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: cubic · AI code reviewer
- GitHub Check: benchmarks
🧰 Additional context used
📓 Path-based instructions (8)
src/source/*.rs
📄 CodeRabbit inference engine (src/source/AGENTS.md)
Create new source in
src/source/{name}.rsfile
Files:
src/source/wordlist.rs
src/source/!(mod).rs
📄 CodeRabbit inference engine (src/source/AGENTS.md)
src/source/!(mod).rs: ImplementSourcetrait withprocess()method acceptingtransforms,deriver,matcher, andoutputparameters
Use Rayonpar_chunks()for batch processing and parallelism in source implementations
Report progress via optionalProgressBarusingindicatif::ProgressBarinprocess()method
All sources must implementSend + Synctraits for thread safety
Files:
src/source/wordlist.rs
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Order imports as: external crates → std → blank line →super::→ blank line →crate::
Prefer?operator over.unwrap()in new code
UsePascalCasefor types and structs
Usesnake_casefor function and method names
UseSCREAMING_SNAKE_CASEfor constants
Usesnake_casefor file and module names
Files:
src/source/wordlist.rs
src/{derive,matcher,network,benchmark,provider,transform,analyze,source,output,gpu,storage}/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
Implement custom error enums with
DisplayandErrortrait implementations for domain modules
Files:
src/source/wordlist.rs
src/{transform,analyze,source}/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
Suffix struct names by role:
{Name}Transform,{Name}Analyzer,{Name}Source
Files:
src/source/wordlist.rs
src/{transform,source}/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
Implement batch processing for transforms and sources using
&[Input]batches via Rayonpar_chunks()
Files:
src/source/wordlist.rs
src/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.rs: Useindicatif::ProgressBarfor long-running operations
Place inline tests in#[cfg(test)] mod testsat the end of each file, using standardassert!andassert_eq!macros
Usetempfilecrate for file I/O tests
Files:
src/source/wordlist.rs
src/{transform,source,analyze}/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
Use
AtomicBoolfor early termination signaling across Rayon threads
Files:
src/source/wordlist.rs
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/stdin.rs : Implement line-by-line streaming in `StdinSource` for memory efficiency
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Use Rayon `par_chunks()` for batch processing and parallelism in source implementations
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/transform/AGENTS.md:0-0
Timestamp: 2026-03-02T14:18:44.316Z
Learning: Applies to src/transform/*.rs : Process inputs as batch operations using `&[Input]` as input and `&mut Vec<(String, Key)>` as output, where the first tuple element is a human-readable source description
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/{transform,source}/**/*.rs : Implement batch processing for transforms and sources using `&[Input]` batches via Rayon `par_chunks()`
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: New input source implementations must implement the `Source` trait, update `SourceType` enum, and update `main.rs`
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Implement `Source` trait with `process()` method accepting `transforms`, `deriver`, `matcher`, and `output` parameters
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/stdin.rs : Implement line-by-line streaming in `StdinSource` for memory efficiency
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Use Rayon `par_chunks()` for batch processing and parallelism in source implementations
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/{transform,source}/**/*.rs : Implement batch processing for transforms and sources using `&[Input]` batches via Rayon `par_chunks()`
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-02T14:18:44.316Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/transform/AGENTS.md:0-0
Timestamp: 2026-03-02T14:18:44.316Z
Learning: Applies to src/transform/*.rs : Process inputs as batch operations using `&[Input]` as input and `&mut Vec<(String, Key)>` as output, where the first tuple element is a human-readable source description
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/src/main.rs : Update `create_source()` function in `src/main.rs` to handle new source variants
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Implement `Source` trait with `process()` method accepting `transforms`, `deriver`, `matcher`, and `output` parameters
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-05T12:40:29.177Z
Learnt from: oritwoen
Repo: oritwoen/vuke PR: 64
File: src/source/mod.rs:26-32
Timestamp: 2026-03-05T12:40:29.177Z
Learning: In `oritwoen/vuke`, `AGENTS.md` and `src/source/AGENTS.md` describe an **aspirational/future** `Source::process` trait signature (generic `T: Transform`, `O: Output`, with `no_gpu: bool` and `progress: Option<&ProgressBar>` parameters). The **current** implementation in `src/source/mod.rs` uses dynamic dispatch: `fn process(&self, transforms: &[Box<dyn Transform>], deriver: &KeyDeriver, matcher: Option<&Matcher>, output: &dyn Output) -> Result<ProcessStats>`. Migrating to the generic form is blocked by object-safety requirements (`Box<dyn Source>` is used in `src/main.rs`) and is a tracked future refactor. Do not flag the current dynamic-dispatch signature as non-conformant with AGENTS.md.
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: New input source implementations must implement the `Source` trait, update `SourceType` enum, and update `main.rs`
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/main.rs : CLI command and subcommand changes go in `src/main.rs` using clap derive macros
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/src/main.rs : Update `SourceCommand` enum in `src/main.rs` when adding a new source type
Applied to files:
src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/**/*.rs : Use `tempfile` crate for file I/O tests
Applied to files:
src/source/wordlist.rs
🧬 Code graph analysis (1)
src/source/wordlist.rs (4)
src/derive.rs (2)
derive(78-165)new(62-67)src/source/files.rs (1)
from_file(21-31)src/transform/input.rs (1)
from_string(33-41)src/output/mod.rs (1)
key(40-40)
🔇 Additional comments (3)
src/source/wordlist.rs (3)
18-24: Good sizing choices for chunk/batch constants.10K lines per chunk with 1K batches gives 10 parallel units per chunk - reasonable balance between memory footprint and parallelism.
27-37: Solid validation pattern, consistent with files.rs.Early path validation prevents confusing errors later during process(). Storing path instead of loaded content is the right call for streaming.
150-219: Good test coverage for the streaming edge cases.Tests cover: missing file, directory instead of file, empty file, blank-line skipping, and UTF-8 errors. Uses tempfile as per guidelines. The UTF-8 test correctly expects 3 inputs (valid, "invalid bytes" text, test) with the
\xff\xfe\nline skipped.
- Update progress bar after final partial chunk - Use stream_position() for accurate byte tracking on invalid UTF-8 - Bump CHUNK_SIZE to 100k for better Rayon utilization on many-core machines
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/source/wordlist.rs">
<violation number="1" location="src/source/wordlist.rs:98">
P2: This only updates progress after a non-empty chunk is processed, so files with lots of blank or invalid lines look stuck at 0% and an all-skipped file never reports byte progress at all.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
WordlistSource read the entire file into
Vec<String>before processing anything. A 2GB wordlist means 2GB+ of resident memory just for the input buffer, before any key derivation starts.Switched to streaming with
read_line()in chunks of 10k lines. Each chunk gets processed through Rayonpar_chunks(1000)the same way as before, then dropped before the next chunk loads. Memory usage is now bounded by chunk size, not file size.Progress bar tracks bytes read from the file (via
read_line()return value) instead of line count, so it works without knowing total lines upfront and handles CRLF correctly.Closes #71