ShardPack is a storage format for large-scale datasets designed to solve common data pipeline bottlenecks by reducing the amount of required I/O operations. It provides random access reads, efficient partial data loading, and comprehensive metadata handling through a fast implementation that integrates with Python.
- [✅] Direct data access without archive parsing
- [✅] Random-access and partial reads
- [✅] Self-describing dataset organization
- [✅] Fast compression support
- [🛠️] FHE support
- [🛠️] Distributed training support