Skip to content

fix(source): stream wordlist file instead of loading into memory#76

Merged
oritwoen merged 3 commits into
mainfrom
fix/wordlist-streaming-memory
Mar 13, 2026
Merged

fix(source): stream wordlist file instead of loading into memory#76
oritwoen merged 3 commits into
mainfrom
fix/wordlist-streaming-memory

Conversation

@oritwoen
Copy link
Copy Markdown
Owner

WordlistSource read the entire file into Vec<String> before processing anything. A 2GB wordlist means 2GB+ of resident memory just for the input buffer, before any key derivation starts.

Switched to streaming with read_line() in chunks of 10k lines. Each chunk gets processed through Rayon par_chunks(1000) the same way as before, then dropped before the next chunk loads. Memory usage is now bounded by chunk size, not file size.

Progress bar tracks bytes read from the file (via read_line() return value) instead of line count, so it works without knowing total lines upfront and handles CRLF correctly.

Closes #71

WordlistSource used to read the entire file into Vec<String> before
processing. For large wordlists (multi-GB), this consumed memory
proportional to file size regardless of batch processing needs.

Switched to streaming with read_line() in chunks of 10k lines,
keeping Rayon parallelism within each chunk. Progress bar now
tracks bytes read from the file instead of line count.

Closes #71
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Stream wordlist file to reduce memory consumption

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Stream wordlist files in 10k-line chunks instead of loading entire file into memory
• Progress bar now tracks bytes read from file instead of line count
• Extract chunk processing logic into separate process_chunk() function for clarity
• Add comprehensive unit tests for file validation and edge cases
Diagram
flowchart LR
  A["Load entire file<br/>into Vec"] -->|Before| B["High memory usage<br/>for large files"]
  C["Stream file in<br/>10k-line chunks"] -->|After| D["Bounded memory<br/>by chunk size"]
  D --> E["Process chunks<br/>with Rayon"]
  E --> F["Track progress<br/>by bytes read"]
Loading

Grey Divider

File Changes

1. src/source/wordlist.rs Bug fix, enhancement, tests +172/-49

Implement streaming wordlist processing with memory bounds

• Changed WordlistSource to store file path instead of pre-loaded Vec<String>
• Implemented streaming file reading with read_line() in 10k-line chunks
• Progress bar now tracks bytes consumed instead of line count for accurate progress
• Extracted chunk processing into separate process_chunk() function using Rayon parallelism
• Added file validation in from_file() to check existence and file type
• Added four unit tests covering file not found, directory rejection, empty files, blank line
 skipping, and invalid UTF-8 handling

src/source/wordlist.rs


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Mar 13, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Progress misses final chunk🐞 Bug ✓ Correctness
Description
WordlistSource::process only updates the progress bar when a full 10k-line chunk is processed, and
it never updates the bar after processing the final partial chunk. This makes progress appear
stalled for wordlists with <10k non-empty lines and leaves the bar short of 100% for any wordlist
whose last chunk is smaller than CHUNK_SIZE.
Code

src/source/wordlist.rs[R85-100]

+            if chunk.len() >= CHUNK_SIZE {
+                process_chunk(
+                    &chunk, transforms, deriver, matcher, output, &stats, &matches,
+                );
+                pb.set_position(bytes_consumed);
+                chunk.clear();
+            }
+        }
+
+        if !chunk.is_empty() {
+            process_chunk(
+                &chunk, transforms, deriver, matcher, output, &stats, &matches,
+            );
+        }

       pb.finish_and_clear();
Evidence
The progress bar position is only set inside the if chunk.len() >= CHUNK_SIZE block; the tail
process_chunk call does not update progress at all before finishing. Other sources update the
progress bar for every processed batch, so this behavior is inconsistent and causes the UI to show
no/incorrect progress for common inputs.

src/source/wordlist.rs[48-100]
src/source/range.rs[50-90]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`WordlistSource::process` only calls `pb.set_position(bytes_consumed)` after processing a full `CHUNK_SIZE` chunk, and does not update the progress bar after processing the final partial chunk. For wordlists with fewer than `CHUNK_SIZE` non-empty lines, the progress bar never advances; for larger wordlists it may never reflect the final bytes consumed.
## Issue Context
The progress bar is initialized with `file_size` and intended to track `bytes_consumed`. Right now the bar is only advanced in the `chunk.len() &amp;gt;= CHUNK_SIZE` branch.
## Fix Focus Areas
- src/source/wordlist.rs[62-101]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Missing import block separator 📘 Rule violation ✓ Correctness
Description
The super:: and crate:: import groups are adjacent without the required single blank line
separator. This violates the mandated import ordering/layout for Rust files.
Code

src/source/wordlist.rs[R12-16]

use super::{ProcessStats, Source};
use crate::derive::KeyDeriver;
use crate::matcher::Matcher;
use crate::output::Output;
use crate::transform::{Input, Transform};
Evidence
The checklist requires a single blank line between super:: imports and crate:: imports when both
groups exist; the diff shows use super::{...}; immediately followed by use crate::... with no
blank line.

Rule 88803: Order Rust import statements by origin and scope
src/source/wordlist.rs[12-16]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The file violates the required Rust import grouping because the `super::` and `crate::` blocks are not separated by exactly one blank line.
## Issue Context
Compliance requires grouped imports ordered by origin with a single blank line between groups.
## Fix Focus Areas
- src/source/wordlist.rs[12-16]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Tests rely on .unwrap() 📘 Rule violation ⛯ Reliability
Description
New inline tests use .unwrap() for fallible operations, which acts as an implicit assertion via
panicking rather than explicit test assertions. This does not meet the rule requiring tests to use
assertion macros instead of unwrap()/expect() as substitutes for verifying behavior.
Code

src/source/wordlist.rs[R172-184]

+    fn process_empty_file() {
+        let mut file = NamedTempFile::new().unwrap();
+        file.write_all(b"").unwrap();
+
+        let source = WordlistSource::from_file(file.path()).unwrap();
+        let deriver = KeyDeriver::new();
+        let output = ConsoleOutput::new();
+        let transforms: Vec<Box<dyn Transform>> = Vec::new();
+
+        let stats = source
+            .process(&transforms, &deriver, None, &output)
+            .unwrap();
+        assert_eq!(stats.inputs_processed, 0);
Evidence
The compliance item forbids using panicking .unwrap()/.expect() as a substitute for assertions
in inline tests; the added tests call .unwrap() on fallible setup and on the process() result
instead of explicitly asserting success/failure via assert!/assert_eq! patterns.

Rule 88813: Write Rust tests as inline modules at the end of the source file
src/source/wordlist.rs[172-184]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Inline unit tests added in `src/source/wordlist.rs` use `.unwrap()` for fallible operations, which is treated as a panicking substitute for explicit assertions.
## Issue Context
The compliance rule requires tests to be written with standard assertion macros and to avoid using `unwrap()`/`expect()` as substitutes for asserting expected behavior.
## Fix Focus Areas
- src/source/wordlist.rs[172-218]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84c63cde-fbca-4b0e-a139-b3cf12c7753e

📥 Commits

Reviewing files that changed from the base of the PR and between 311fe4f and 5c5f232.

📒 Files selected for processing (1)
  • src/source/wordlist.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/source/wordlist.rs
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: benchmarks

📝 Walkthrough

Walkthrough

WordlistSource now stores a PathBuf and streams the wordlist file in fixed-size line chunks, processing each chunk via a new process_chunk helper. File existence/type is validated in from_file(). Progress is tracked by bytes read; blank lines and invalid UTF‑8 are skipped and accounted for in returned stats.

Changes

Cohort / File(s) Summary
Streaming Wordlist Processing
src/source/wordlist.rs, tests/...
Replaced in-memory Vec<String> with path: PathBuf. from_file() now validates path and file type. Processing switched to buffered streaming: read lines into CHUNK_SIZE, skip blank/invalid-UTF8 lines, call process_chunk() per batch which applies transforms, deriver, matcher, and output (parallelized per batch). Progress updates use bytes read; inputs_processed/bytes_consumed updated. Tests added/updated for missing file, directory-as-input, empty file, blank-line skipping, and invalid UTF‑8 handling.

Sequence Diagram(s)

sequenceDiagram
    participant File as File (wordlist)
    participant Reader as BufReader
    participant Chunker as Chunk Buffer
    participant Worker as Parallel Workers
    participant Deriver as Key Deriver
    participant Matcher as Matcher
    participant Output as Output Sink

    File->>Reader: open & stream bytes
    Reader->>Chunker: read lines, skip blank/invalid UTF-8
    Chunker->>Worker: emit chunk (CHUNK_SIZE)
    Worker->>Deriver: apply transforms -> derive keys
    Deriver->>Matcher: check matches (optional)
    Matcher->>Output: write matched results / stats
    Worker->>Chunker: report inputs_processed & bytes_consumed
    Chunker->>Reader: request next lines (loop)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

Chunks stream in where stacks once grew,
Bytes counted, bad UTF skipped through.
Batches march, derive, and test,
Memory breathes — the file's at rest. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: switching from in-memory loading to streaming the wordlist file.
Description check ✅ Passed Description directly addresses the problem (2GB file = 2GB+ resident memory) and explains the solution (streaming chunks with Rayon processing).
Linked Issues check ✅ Passed Changes fully implement #71 requirements: streams file in bounded chunks (10k lines), processes with Rayon par_chunks(1000), replaces line-count progress with byte-based tracking.
Out of Scope Changes check ✅ Passed All changes directly support the three core objectives—bounded-size chunk streaming, Rayon parallelism per chunk, and byte-based progress tracking. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/wordlist-streaming-memory
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/wordlist-streaming-memory
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai
Copy link
Copy Markdown

codeant-ai Bot commented Mar 13, 2026

Sequence Diagram

This PR changes wordlist handling from full-file preloading to incremental streaming in fixed-size chunks. The flow keeps parallel key generation per chunk while tracking progress by bytes read, so memory usage stays bounded for very large files.

sequenceDiagram
    participant Runner
    participant WordlistSource
    participant WordlistFile
    participant ChunkProcessor
    participant Output

    Runner->>WordlistSource: Start processing
    WordlistSource->>WordlistFile: Open file and read file size
    loop Stream lines in chunks
        WordlistSource->>WordlistFile: Read next line and count bytes
        WordlistSource->>WordlistSource: Buffer non empty lines and update progress
    end
    WordlistSource->>ChunkProcessor: Process chunk with parallel batches
    ChunkProcessor->>Output: Write derived keys or matched hits
    WordlistSource-->>Runner: Return process stats
Loading

Generated by CodeAnt AI

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Mar 13, 2026

Merging this PR will not alter performance

✅ 7 untouched benchmarks


Comparing fix/wordlist-streaming-memory (5c5f232) with main (afb45b5)

Open in CodSpeed

Comment thread src/source/wordlist.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file

Confidence score: 3/5

  • There is a concrete regression risk in src/source/wordlist.rs: the new chunking can cap Rayon at 10 batches, which may underutilize higher-core systems and slow large wordlist processing versus the previous par_chunks(1000) behavior.
  • A secondary low-severity issue in src/source/wordlist.rs affects progress accuracy on invalid UTF-8 input, because read_line() already consumed the full bad line before InvalidData, so the byte counter falls behind.
  • Score is 3 because the top issue is a medium-severity, high-confidence performance regression (user-visible on large workloads), while the other issue is mostly progress-reporting correctness rather than core functionality.
  • Pay close attention to src/source/wordlist.rs - parallel chunk sizing and invalid UTF-8 progress accounting can both produce user-visible behavior changes.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/source/wordlist.rs">

<violation number="1" location="src/source/wordlist.rs:18">
P2: This chunk size caps Rayon parallelism at 10 batches. On boxes with more than 10 worker threads, part of the pool goes idle here and large wordlists get slower than the old whole-file `par_chunks(1000)` path.</violation>

<violation number="2" location="src/source/wordlist.rs:69">
P3: This progress update is wrong for invalid UTF-8 lines. `read_line()` has already consumed the whole bad line before it returns `InvalidData`, so adding 1 byte leaves the bar far behind on corrupted wordlists.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant App as Application Runner
    participant WLS as WordlistSource
    participant FS as File System
    participant PB as Progress Bar
    participant Rayon as Rayon (Parallel Workers)
    participant Core as Transform/Deriver/Matcher

    App->>WLS: process(transforms, deriver, matcher, output)
    
    WLS->>FS: NEW: Get file metadata (size)
    FS-->>WLS: File size in bytes
    
    WLS->>PB: CHANGED: Initialize with total bytes
    WLS->>FS: Open file for streaming
    
    loop Until End of File
        loop Read into Chunk (up to 10k lines)
            WLS->>FS: NEW: read_line()
            alt Invalid UTF-8
                WLS->>WLS: Skip line & increment byte count
            else Valid Line
                WLS->>WLS: Trim and push to Vec<String>
            end
        end

        Note over WLS,Rayon: Memory is bounded by 10k strings
        
        WLS->>Rayon: CHANGED: process_chunk(lines)
        
        loop Parallel Batches (1k lines)
            Rayon->>Core: Apply Transforms
            Core->>Core: Derive Keys
            opt Matcher present
                Core->>Core: Check Match
                Core->>App: CHANGED: Report Hit (Atomic update)
            end
        end
        
        Rayon-->>WLS: Batch complete
        WLS->>PB: NEW: set_position(bytes_consumed)
        WLS->>WLS: NEW: chunk.clear() (Free memory)
    end

    WLS->>PB: finish_and_clear()
    WLS-->>App: Return ProcessStats (inputs_processed, etc.)
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/source/wordlist.rs Outdated
Comment thread src/source/wordlist.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/source/wordlist.rs`:
- Around line 67-70: The InvalidData branch underestimates bytes_consumed
(currently +=1); instead query the actual stream position and set bytes_consumed
from it: when handling Err(e) if e.kind() == std::io::ErrorKind::InvalidData,
call Seek::stream_position() on the underlying reader (e.g. the File inside your
BufReader) and update bytes_consumed = position as returned, falling back to the
previous conservative increment only if stream_position() itself errors; update
the branch that contains read_line(), bytes_consumed, and the InvalidData match
to use this accurate seek-based position.
- Around line 133-142: The code currently swallows errors by calling .ok() on
output.hit(...) and output.key(...); replace those .ok() calls with the ?
operator so any I/O errors from ConsoleOutput.key() or ConsoleOutput.hit()
(e.g., from writeln!) propagate up to the caller. In the block using matcher /
m.check(&derived) and the else branch that calls output.key(source,
transform.name(), &derived), change the call sites to use ? and return Result
from the surrounding function so errors bubble to the caller that performs
flush/handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a60d9b89-79b0-4d89-a4f9-144c8aa101ef

📥 Commits

Reviewing files that changed from the base of the PR and between afb45b5 and 03faec4.

📒 Files selected for processing (1)
  • src/source/wordlist.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: benchmarks
🧰 Additional context used
📓 Path-based instructions (8)
src/source/*.rs

📄 CodeRabbit inference engine (src/source/AGENTS.md)

Create new source in src/source/{name}.rs file

Files:

  • src/source/wordlist.rs
src/source/!(mod).rs

📄 CodeRabbit inference engine (src/source/AGENTS.md)

src/source/!(mod).rs: Implement Source trait with process() method accepting transforms, deriver, matcher, and output parameters
Use Rayon par_chunks() for batch processing and parallelism in source implementations
Report progress via optional ProgressBar using indicatif::ProgressBar in process() method
All sources must implement Send + Sync traits for thread safety

Files:

  • src/source/wordlist.rs
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Order imports as: external crates → std → blank line → super:: → blank line → crate::
Prefer ? operator over .unwrap() in new code
Use PascalCase for types and structs
Use snake_case for function and method names
Use SCREAMING_SNAKE_CASE for constants
Use snake_case for file and module names

Files:

  • src/source/wordlist.rs
src/{derive,matcher,network,benchmark,provider,transform,analyze,source,output,gpu,storage}/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

Implement custom error enums with Display and Error trait implementations for domain modules

Files:

  • src/source/wordlist.rs
src/{transform,analyze,source}/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

Suffix struct names by role: {Name}Transform, {Name}Analyzer, {Name}Source

Files:

  • src/source/wordlist.rs
src/{transform,source}/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

Implement batch processing for transforms and sources using &[Input] batches via Rayon par_chunks()

Files:

  • src/source/wordlist.rs
src/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.rs: Use indicatif::ProgressBar for long-running operations
Place inline tests in #[cfg(test)] mod tests at the end of each file, using standard assert! and assert_eq! macros
Use tempfile crate for file I/O tests

Files:

  • src/source/wordlist.rs
src/{transform,source,analyze}/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

Use AtomicBool for early termination signaling across Rayon threads

Files:

  • src/source/wordlist.rs
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/stdin.rs : Implement line-by-line streaming in `StdinSource` for memory efficiency
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Use Rayon `par_chunks()` for batch processing and parallelism in source implementations
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/transform/AGENTS.md:0-0
Timestamp: 2026-03-02T14:18:44.316Z
Learning: Applies to src/transform/*.rs : Process inputs as batch operations using `&[Input]` as input and `&mut Vec<(String, Key)>` as output, where the first tuple element is a human-readable source description
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/{transform,source}/**/*.rs : Implement batch processing for transforms and sources using `&[Input]` batches via Rayon `par_chunks()`
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: New input source implementations must implement the `Source` trait, update `SourceType` enum, and update `main.rs`
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Implement `Source` trait with `process()` method accepting `transforms`, `deriver`, `matcher`, and `output` parameters
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/stdin.rs : Implement line-by-line streaming in `StdinSource` for memory efficiency

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Use Rayon `par_chunks()` for batch processing and parallelism in source implementations

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/{transform,source}/**/*.rs : Implement batch processing for transforms and sources using `&[Input]` batches via Rayon `par_chunks()`

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-02T14:18:44.316Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/transform/AGENTS.md:0-0
Timestamp: 2026-03-02T14:18:44.316Z
Learning: Applies to src/transform/*.rs : Process inputs as batch operations using `&[Input]` as input and `&mut Vec<(String, Key)>` as output, where the first tuple element is a human-readable source description

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/src/main.rs : Update `create_source()` function in `src/main.rs` to handle new source variants

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/!(mod).rs : Implement `Source` trait with `process()` method accepting `transforms`, `deriver`, `matcher`, and `output` parameters

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-05T12:40:29.177Z
Learnt from: oritwoen
Repo: oritwoen/vuke PR: 64
File: src/source/mod.rs:26-32
Timestamp: 2026-03-05T12:40:29.177Z
Learning: In `oritwoen/vuke`, `AGENTS.md` and `src/source/AGENTS.md` describe an **aspirational/future** `Source::process` trait signature (generic `T: Transform`, `O: Output`, with `no_gpu: bool` and `progress: Option<&ProgressBar>` parameters). The **current** implementation in `src/source/mod.rs` uses dynamic dispatch: `fn process(&self, transforms: &[Box<dyn Transform>], deriver: &KeyDeriver, matcher: Option<&Matcher>, output: &dyn Output) -> Result<ProcessStats>`. Migrating to the generic form is blocked by object-safety requirements (`Box<dyn Source>` is used in `src/main.rs`) and is a tracked future refactor. Do not flag the current dynamic-dispatch signature as non-conformant with AGENTS.md.

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: New input source implementations must implement the `Source` trait, update `SourceType` enum, and update `main.rs`

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/main.rs : CLI command and subcommand changes go in `src/main.rs` using clap derive macros

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-05T12:48:44.245Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: src/source/AGENTS.md:0-0
Timestamp: 2026-03-05T12:48:44.245Z
Learning: Applies to src/source/src/main.rs : Update `SourceCommand` enum in `src/main.rs` when adding a new source type

Applied to files:

  • src/source/wordlist.rs
📚 Learning: 2026-03-12T18:24:52.371Z
Learnt from: CR
Repo: oritwoen/vuke PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T18:24:52.371Z
Learning: Applies to src/**/*.rs : Use `tempfile` crate for file I/O tests

Applied to files:

  • src/source/wordlist.rs
🧬 Code graph analysis (1)
src/source/wordlist.rs (4)
src/derive.rs (2)
  • derive (78-165)
  • new (62-67)
src/source/files.rs (1)
  • from_file (21-31)
src/transform/input.rs (1)
  • from_string (33-41)
src/output/mod.rs (1)
  • key (40-40)
🔇 Additional comments (3)
src/source/wordlist.rs (3)

18-24: Good sizing choices for chunk/batch constants.

10K lines per chunk with 1K batches gives 10 parallel units per chunk - reasonable balance between memory footprint and parallelism.


27-37: Solid validation pattern, consistent with files.rs.

Early path validation prevents confusing errors later during process(). Storing path instead of loaded content is the right call for streaming.


150-219: Good test coverage for the streaming edge cases.

Tests cover: missing file, directory instead of file, empty file, blank-line skipping, and UTF-8 errors. Uses tempfile as per guidelines. The UTF-8 test correctly expects 3 inputs (valid, "invalid bytes" text, test) with the \xff\xfe\n line skipped.

Comment thread src/source/wordlist.rs
Comment thread src/source/wordlist.rs
- Update progress bar after final partial chunk
- Use stream_position() for accurate byte tracking on invalid UTF-8
- Bump CHUNK_SIZE to 100k for better Rayon utilization on many-core machines
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/source/wordlist.rs">

<violation number="1" location="src/source/wordlist.rs:98">
P2: This only updates progress after a non-empty chunk is processed, so files with lots of blank or invalid lines look stuck at 0% and an all-skipped file never reports byte progress at all.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/source/wordlist.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: This is a significant refactor of core data processing logic, involving streaming file I/O and parallel processing, which requires human review to ensure data integrity.

@oritwoen oritwoen merged commit f83ad78 into main Mar 13, 2026
4 checks passed
@oritwoen oritwoen deleted the fix/wordlist-streaming-memory branch March 13, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WordlistSource loads entire file into memory before processing

1 participant