Problem
Currently, vuke outputs results to console or simple files (CSV/TXT). Results are lost after scan completion, making historical analysis impossible.
With future GPU acceleration (#9), we'll generate billions of keys per session. We need persistent, queryable storage that:
- Handles terabyte-scale data (2³² seeds × ~100 bytes = 400GB per milksad scan)
- Supports streaming writes at 100+ MB/s (GPU generation speed)
- Is vendor-agnostic (not locked to any cloud provider)
- Enables efficient querying without loading everything into RAM
Proposed Solution
Architecture
vuke generate/scan
│
▼
┌─────────────────────────────────────────┐
│ StorageBackend (trait) │
├─────────────────────────────────────────┤
│ write_batch(&mut self, records) → () │
│ flush(&mut self) → PathBuf │
│ schema(&self) → Schema │
└─────────────────────────────────────────┘
│
├──────────────┬──────────────┐
▼ ▼ ▼
ParquetBackend CSVBackend (future)
│
▼
┌─────────────────────────────────────────┐
│ Partitioned local files │
│ results/ │
│ transform=milksad/ │
│ 2025-01-15_chunk_0001.parquet │
│ 2025-01-15_chunk_0002.parquet │
└─────────────────────────────────────────┘
│
▼ (async, optional)
Cloud Upload (R2/S3/GCS)
+ Iceberg catalog registration
│
▼
Query via DuckDB / R2 SQL / Spark / Polars
Why Parquet?
- Columnar format - excellent compression (10-20x for this data)
- Industry standard - readable by every analytics tool
- GPU-friendly - RAPIDS/cuDF read Parquet natively
- Vendor-agnostic - just files, uploadable anywhere
Schema
source: string # input value (seed, passphrase, etc.)
private_key: binary[32] # raw 32-byte key
transform: string # "sha256", "milksad", "lcg:glibc", etc.
wif_compressed: string # WIF format
p2pkh: string # legacy address
p2wpkh: string # native segwit address
timestamp: timestamp # generation time
matched_target: string? # nullable - only for hits
Partitioning Strategy
- By
transform (first level) - different workloads query different transforms
- By
date (second level) - time-based pruning
- Auto-rotate chunks at N records or M megabytes
Feature Flags
[features]
default = []
storage = ["parquet", "arrow"] # Parquet output
storage-query = ["storage", "duckdb"] # + embedded query (optional)
CLI Integration
# Enable storage
vuke generate --storage ./results --transform milksad range --start 1 --end 1000000
# Query results (with storage-query feature)
vuke query ./results "SELECT COUNT(*) FROM results WHERE transform = 'milksad'"
Implementation Phases
Phase 1: Core Storage (MVP)
Phase 2: Query Support
Phase 3: Cloud Integration (future)
Phase 4: GPU Optimization (future)
Open Questions
- Chunk size - 100MB? 256MB? Configurable?
- Compression - zstd (best ratio) vs lz4 (fastest)?
- Include ALL generated keys or only hits? - Decision: ALL (for reproducibility)
- Separate crate or in-tree module? - TBD based on dependency weight
References
Related
Problem
Currently, vuke outputs results to console or simple files (CSV/TXT). Results are lost after scan completion, making historical analysis impossible.
With future GPU acceleration (#9), we'll generate billions of keys per session. We need persistent, queryable storage that:
Proposed Solution
Architecture
Why Parquet?
Schema
Partitioning Strategy
transform(first level) - different workloads query different transformsdate(second level) - time-based pruningFeature Flags
CLI Integration
Implementation Phases
Phase 1: Core Storage (MVP)
--storageflag integrationPhase 2: Query Support
vuke querysubcommandPhase 3: Cloud Integration (future)
Phase 4: GPU Optimization (future)
Open Questions
References
Related