Skip to content

Storage: Persistent Parquet-based result storage for TB-scale analysis #25

@oritwoen

Description

@oritwoen

Problem

Currently, vuke outputs results to console or simple files (CSV/TXT). Results are lost after scan completion, making historical analysis impossible.

With future GPU acceleration (#9), we'll generate billions of keys per session. We need persistent, queryable storage that:

  • Handles terabyte-scale data (2³² seeds × ~100 bytes = 400GB per milksad scan)
  • Supports streaming writes at 100+ MB/s (GPU generation speed)
  • Is vendor-agnostic (not locked to any cloud provider)
  • Enables efficient querying without loading everything into RAM

Proposed Solution

Architecture

vuke generate/scan
        │
        ▼
┌─────────────────────────────────────────┐
│         StorageBackend (trait)          │
├─────────────────────────────────────────┤
│  write_batch(&mut self, records) → ()   │
│  flush(&mut self) → PathBuf             │
│  schema(&self) → Schema                 │
└─────────────────────────────────────────┘
        │
        ├──────────────┬──────────────┐
        ▼              ▼              ▼
   ParquetBackend  CSVBackend    (future)
        │
        ▼
┌─────────────────────────────────────────┐
│  Partitioned local files                │
│  results/                               │
│    transform=milksad/                   │
│      2025-01-15_chunk_0001.parquet      │
│      2025-01-15_chunk_0002.parquet      │
└─────────────────────────────────────────┘
        │
        ▼ (async, optional)
   Cloud Upload (R2/S3/GCS)
   + Iceberg catalog registration
        │
        ▼
   Query via DuckDB / R2 SQL / Spark / Polars

Why Parquet?

  • Columnar format - excellent compression (10-20x for this data)
  • Industry standard - readable by every analytics tool
  • GPU-friendly - RAPIDS/cuDF read Parquet natively
  • Vendor-agnostic - just files, uploadable anywhere

Schema

source: string          # input value (seed, passphrase, etc.)
private_key: binary[32] # raw 32-byte key
transform: string       # "sha256", "milksad", "lcg:glibc", etc.
wif_compressed: string  # WIF format
p2pkh: string           # legacy address
p2wpkh: string          # native segwit address
timestamp: timestamp    # generation time
matched_target: string? # nullable - only for hits

Partitioning Strategy

  • By transform (first level) - different workloads query different transforms
  • By date (second level) - time-based pruning
  • Auto-rotate chunks at N records or M megabytes

Feature Flags

[features]
default = []
storage = ["parquet", "arrow"]           # Parquet output
storage-query = ["storage", "duckdb"]    # + embedded query (optional)

CLI Integration

# Enable storage
vuke generate --storage ./results --transform milksad range --start 1 --end 1000000

# Query results (with storage-query feature)
vuke query ./results "SELECT COUNT(*) FROM results WHERE transform = 'milksad'"

Implementation Phases

Phase 1: Core Storage (MVP)

Phase 2: Query Support

Phase 3: Cloud Integration (future)

Phase 4: GPU Optimization (future)

Open Questions

  1. Chunk size - 100MB? 256MB? Configurable?
  2. Compression - zstd (best ratio) vs lz4 (fastest)?
  3. Include ALL generated keys or only hits? - Decision: ALL (for reproducibility)
  4. Separate crate or in-tree module? - TBD based on dependency weight

References

Related

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions