Skip to content
/ shardpack Public

Efficient storage format for large-scale datasets designed to solve common data pipeline bottlenecks.

Notifications You must be signed in to change notification settings

jsam/shardpack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shardpack

ShardPack is a storage format for large-scale datasets designed to solve common data pipeline bottlenecks by reducing the amount of required I/O operations. It provides random access reads, efficient partial data loading, and comprehensive metadata handling through a fast implementation that integrates with Python.

Key features:

  • [✅] Direct data access without archive parsing
  • [✅] Random-access and partial reads
  • [✅] Self-describing dataset organization
  • [✅] Fast compression support
  • [🛠️] FHE support
  • [🛠️] Distributed training support

About

Efficient storage format for large-scale datasets designed to solve common data pipeline bottlenecks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages