Storage: Persistent Parquet-based result storage for TB-scale analysis

## Problem

Currently, vuke outputs results to console or simple files (CSV/TXT). Results are lost after scan completion, making historical analysis impossible.

With future GPU acceleration (#9), we'll generate billions of keys per session. We need persistent, queryable storage that:

- Handles **terabyte-scale** data (2³² seeds × ~100 bytes = 400GB per milksad scan)
- Supports **streaming writes** at 100+ MB/s (GPU generation speed)
- Is **vendor-agnostic** (not locked to any cloud provider)
- Enables **efficient querying** without loading everything into RAM

## Proposed Solution

### Architecture

```
vuke generate/scan
        │
        ▼
┌─────────────────────────────────────────┐
│         StorageBackend (trait)          │
├─────────────────────────────────────────┤
│  write_batch(&mut self, records) → ()   │
│  flush(&mut self) → PathBuf             │
│  schema(&self) → Schema                 │
└─────────────────────────────────────────┘
        │
        ├──────────────┬──────────────┐
        ▼              ▼              ▼
   ParquetBackend  CSVBackend    (future)
        │
        ▼
┌─────────────────────────────────────────┐
│  Partitioned local files                │
│  results/                               │
│    transform=milksad/                   │
│      2025-01-15_chunk_0001.parquet      │
│      2025-01-15_chunk_0002.parquet      │
└─────────────────────────────────────────┘
        │
        ▼ (async, optional)
   Cloud Upload (R2/S3/GCS)
   + Iceberg catalog registration
        │
        ▼
   Query via DuckDB / R2 SQL / Spark / Polars
```

### Why Parquet?

- **Columnar format** - excellent compression (10-20x for this data)
- **Industry standard** - readable by every analytics tool
- **GPU-friendly** - RAPIDS/cuDF read Parquet natively
- **Vendor-agnostic** - just files, uploadable anywhere

### Schema

```
source: string          # input value (seed, passphrase, etc.)
private_key: binary[32] # raw 32-byte key
transform: string       # "sha256", "milksad", "lcg:glibc", etc.
wif_compressed: string  # WIF format
p2pkh: string           # legacy address
p2wpkh: string          # native segwit address
timestamp: timestamp    # generation time
matched_target: string? # nullable - only for hits
```

### Partitioning Strategy

- By `transform` (first level) - different workloads query different transforms
- By `date` (second level) - time-based pruning
- Auto-rotate chunks at N records or M megabytes

### Feature Flags

```toml
[features]
default = []
storage = ["parquet", "arrow"]           # Parquet output
storage-query = ["storage", "duckdb"]    # + embedded query (optional)
```

### CLI Integration

```bash
# Enable storage
vuke generate --storage ./results --transform milksad range --start 1 --end 1000000

# Query results (with storage-query feature)
vuke query ./results "SELECT COUNT(*) FROM results WHERE transform = 'milksad'"
```

## Implementation Phases

### Phase 1: Core Storage (MVP)
- [x] #33 - StorageBackend trait definition
- [x] #34 - Arrow schema for results
- [x] #35 - ParquetBackend implementation
- [x] #36 - Automatic chunk rotation
- [x] #37 - Basic partitioning (transform/date)
- [x] #38 - CLI `--storage` flag integration

### Phase 2: Query Support
- [x] #39 - DuckDB integration for local queries
- [x] #40 - `vuke query` subcommand

### Phase 3: Cloud Integration (future)
- [ ] #52 - Async upload to S3-compatible storage (R2, MinIO, AWS S3)
- [ ] #50 - Iceberg catalog registration (for R2 SQL compatibility)
- [x] #51 - Compression tuning (zstd levels)

### Phase 4: GPU Optimization (future)
- [ ] Memory-mapped writes
- [ ] Zero-copy from GPU buffers (when #9 lands)
- [ ] Parallel chunk writers

## Open Questions

1. **Chunk size** - 100MB? 256MB? Configurable?
2. **Compression** - zstd (best ratio) vs lz4 (fastest)?
3. **Include ALL generated keys or only hits?** - Decision: ALL (for reproducibility)
4. **Separate crate or in-tree module?** - TBD based on dependency weight

## References

- [Apache Parquet](https://parquet.apache.org/)
- [Arrow Rust](https://github.com/apache/arrow-rs)
- [DuckDB](https://duckdb.org/)
- [R2 SQL / Data Catalog](https://developers.cloudflare.com/r2-sql/get-started/)
- [Apache Iceberg](https://iceberg.apache.org/)

## Related

- #9 - GPU acceleration for milksad brute-force

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage: Persistent Parquet-based result storage for TB-scale analysis #25

Problem

Proposed Solution

Architecture

Why Parquet?

Schema

Partitioning Strategy

Feature Flags

CLI Integration

Implementation Phases

Phase 1: Core Storage (MVP)

Phase 2: Query Support

Phase 3: Cloud Integration (future)

Phase 4: GPU Optimization (future)

Open Questions

References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Storage: Persistent Parquet-based result storage for TB-scale analysis #25

Description

Problem

Proposed Solution

Architecture

Why Parquet?

Schema

Partitioning Strategy

Feature Flags

CLI Integration

Implementation Phases

Phase 1: Core Storage (MVP)

Phase 2: Query Support

Phase 3: Cloud Integration (future)

Phase 4: GPU Optimization (future)

Open Questions

References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions