Replies: 9 comments 17 replies
-
Amazon S3 Express One Zone high performance storage classAWS have recently launched a new storage tier for low-latency access to small files, at hundreds of thousands of operations per second. They claim "up to 10x better performance than the S3 Standard storage class". Single-digit millisecond latencies. It's expensive though! https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/ (Also see Azure Ultra Disk, introduced in 2019. sub-ms latency; high IOPS) |
Beta Was this translation helpful? Give feedback.
-
Key-value storage & compression at the device level with NVMe v2 (for SSDs)I could be wrong, but I think this could be super exciting for Zarr! It's almost as if this new hardware was designed for Zarr! The NVMe v2 standard includes a standard API for key-value storage. To be specific: The SSD itself maintains the key-value lookup table; and the SSD will (optionaly) compress and decompress the value. In other words: The SSD hardware implements Zarr's
This could provide a large performance increase for Zarr. At the moment, Zarr has to translate from a Store key to a path and byte range. The operating system then maps that path and byte range to a sequence of logical block addresses (where each block is usually 4 kbytes). The SSD then translates those logical block addresses to the physical location of the data in flash. In contrast, with NVMe's key-value storage, we'd give the key directly to the SSD, entirely skipping all the filesystem. And we could have tiny chunk sizes (just a few bytes) if we wanted. One disadvantage is that - as far as I can tell - there's no way to tell the SSD to get multiple values in a single operation. Each KV lookup requires one operation. This is in contrast to traditional block storage, where you can say "get me all the data between these two offsets". I'll think about supporting NVMe key-value storage in light-speed-io. I'd imagine that we'd be able to make use of this NVMe key-value storage in at least two ways: On our own hardware, we could install NVMe v2 SSDs today. And, in the near future, I'd guess that cloud object storage would start to use NVMe key-value storage, to provide super-fast cloud object stores. Further reading:
|
Beta Was this translation helpful? Give feedback.
-
Random access to flash memory using CXLIn yesterday's Pangeo meeting, @TomNicholas talked about how frustrating it can be to think about chunks (the talk was deliberately provocative 🙂): Scientists don't want to think about chunks. Chunks are an implementation detail. Like Tom, I'd also love to live in a world where we don't have to think about chunks (on disk). One way to achieve that would be to lay the data out on disk in whatever way you want1, and to have storage hardware that can provide random access to any byte on disk (instead of having to put data into chunks of at least 4 kbytes; and probably more like a few megabytes). It turns out that CXL allows access terabytes of flash memory, as if it were RAM2. Koxia recently demonstrated 1.3 TB of flash, connected over CXL. Maybe this is one route to get to very-high-performance, random access to any arbitrary byte. This isn't available to buy yet. But might be worth keeping an eye on. UPDATE: But flash isn't truly random access. You can only access it units of pages (kbytes). Footnotes
|
Beta Was this translation helpful? Give feedback.
-
Here's a great one! Modern storage is plenty fast. It is the APIs that are bad.https://itnext.io/modern-storage-is-plenty-fast-it-is-the-apis-that-are-bad-6a68319fbc1a |
Beta Was this translation helpful? Give feedback.
-
Mojo 🔥 — "the programming language for all AI developers"Mojo is a new, compiled programming language. You can download a preview version of Mojo. Mojo aims to be a super-set of Python (although it isn't there yet). Mojo is currently closed-source, but aims to be open-source soon. Mojo is only about 1 year old. The company behind Mojo is called Modular. Modular have raised about $130 million in VC funding, and Modular is co-founded by Chris Lattner, co-founder of LLVM, Clang, MLIR, and Swift. Modular recently announced their "MAX Platform": "The Modular Accelerated Xecution (MAX) platform is an integrated suite of tools for AI deployment that power all your AI workloads". I think Modular's main aim is to significantly reduce the compute costs for running ML inference (see this recent interview with Chris Lattner for more info). Of particular interest to us is that Mojo is laser-focused on speed of execution. Mojo compiles into MLIR (multi-level intermediate representation). This enables lots of optimisations (such as "kernel fusion"), and should also allow a single codebase to target multiple hardware architectures (including CPUs, TPUs, GPUs, etc.). SIMD is a core data type for Mojo (to the extent that scalars are represented as SIMD types with An optimistic take is that, after Mojo matures a little more, Mojo could allow us to re-implement large parts of the Scientific Python stack in such a way that we'd only have to write single implementations for each algorithm, and the Mojo compiler (+ MLIR + LLVM) would automatically optimise our code for each platform. Mojo would automatically fuse operations in our data pipelines (this isn't the same as the query optimisation that databases do, because "kernel fusion" doesn't know anything about IO, but kernel fusion could still be interesting for our compute-bound tasks). However, Mojo definitely isn't ready to replace Scientific Python libraries. Mojo can't yet call C or C++ code (although this is on their roadmap). Also, CPython can't call code written in Mojo. So Mojo is "viral", kind of like the GPL license: Code written in Mojo forces all down-stream users to also use Mojo (see this discussion). At the time of writing, fixing this isn't on Mojo's official roadmap, although an informal Discord discussion suggests that this is on their radar. Today, I can't use Mojo for my As others have commented on before, we've seen similar "hype cycles" before. There was a time when "Swift for TensorFlow" was the new hottness. Or, before that, Julia. It's hard (impossible?) to predict ahead-of-time which languages will really bed in. In some ways, I'm even more excited about "Rust for Python" now, after I've read about Mojo. It's true that Rust doesn't include every optimisation that Mojo promises (like auto-tuning). But - like Mojo going from MLIR to LLVM - the Rust compiler performs two broad levels of optimisation: first on a mid-level intermediate representation (MIR), and again within LLVM. And Rust has experimental support for SIMD data types in the standard library. Tuning Rust for performance is well documented. And Mojo is planning to implement a borrow checker, which helps to validate that borrow checking is a good idea. And, unlike Mojo, Rust makes it super-easy to call C/C++ code, and super-easy to interact with CPython. That said, I'll definitely be keeping a close eye on Mojo. Whilst Mojo isn't ready yet, in the future it might be an interesting language to help accelerate compute-bound Scientific Python tasks. And, in Chris Lattner's interview on The Latent Space podcast (at around 40:40), Chris very briefly mentions speeding up data loading for ML, which is very close to my heart. |
Beta Was this translation helpful? Give feedback.
-
NVDRAM: Micron's byte-addressable, high-performance, non-volatile memory based on ferro-electric RAM (FeRAM)Micron are working on NVDRAM: a non-volatile memory technology that is byte-addressable, faster than NAND flash (the technology behind modern SSDs), more energy-efficient than NAND flash, and has the potential to have a high data storage density. But it'll be slower than SDRAM. And it's unclear if it'll actually become a product! NVDRAM is exciting for folks (like me) who want fast, random access to multi-dimensional arrays that are too large to fit into RAM. To give some context: Despite my love for Zarr(!), I'd actually like to live in a world where we don't have to chunk our data on disk 🙂. In this dream world (which may never materialise!), we'd store uncompressed1 ndim arrays on disk in the same simple structure that we store ndim arrays in RAM. The trouble is that, to be able to read arbitrary sub-selections of the data, we need to be able to address individual bytes2. For example, say we have a 2D uint8 array of size 1024x1024 in row-major layout, and we want to read a 64x64 crop. Then we'd want to read 64 bytes (the first row of our 64x64 crop), then skip over 1024 bytes, then read another 64 bytes (the second row), etc. If you try to do this on existing HDDs and SSDs then you end up reading a lot of data that you don't need. That's because HDDs and SSDs aren't byte-addressable: on a HDD, the smallest unit you can read is a sector (usually 512 bytes). On an SSD, the smallest unit you can read is a page (usually 4,096 bytes). So, on an SSD, if you want to read just a single byte, then you end up reading 4,096 bytes (the entire page that contains the single byte you want). NVDRAM is interesting because it is byte-addressable, and is faster than NAND flash. But it's possible that NVDRAM may never become a product. But it sounds like multiple memory companies are exploring similar technologies. And, Micron might use parts of NVDRAM in a product. In some ways, the main point of this comment just to illustrate that NAND flash isn't necessarily the "last word" for high-performance non-volatile storage. There are several exciting technologies in the pipeline 🙂, that might be very interesting for folks who need high-performance access to large multi-dimensional arrays. And these technologies further illustrate that there's a need for a computationally efficient and high-performance IO software stack, because the old assumptions that IO is orders of magnitude slower than RAM is becoming less and less accurate. Footnotes
|
Beta Was this translation helpful? Give feedback.
-
The One Billion Row ChallengeThe One Billion Row Challenge is a programming challenge where folks compete to create the fastest programme that can process a text file with 1 billion rows. The "official" competition is restricted to Java; but plenty of folks are giving it a shot in different languages. Each row contains a temperature reading for a single location, like this:
There are guaranteed to be no more than 10,000 unique locations. So, on average, the text file will contain 100,000 measurements per location. The output of the programme is the min, mean, and max temperature for each location. The source text file is 12 GBytes and is stored in a RAM disk. From what I understand, the slowest part of the process is parsing the strings in the text file. All submissions are benchmarked on a single machine. Each submission is allowed to use 8 CPU cores on a Zen2 AMD EPYC 7502p (released in 2019) with 128 GB of RAM. The naive baseline solution takes almost 5 minutes. The fastest solution so far runs in 2.5 seconds(!) when restricted to 8 CPU cores. When allowed to use all 32 CPU cores, the fastest solution runs in 0.8 seconds! The deadline for submission is 31st Jan 2024. There is no cash prize. Why do I mention this on a Zarr discussion forum?!I mention this for several reasons: CPUs are fast 🙂It's interesting that the best solution can process a 12 GB text file in 0.8 seconds. This informally supports my hunch that, in general, a lot of processing on "big data" should operate at the speed of the IO. Modern CPUs can go very fast if we let them. Especially because, in the Zarr community, our data is usually stored in pretty CPU-friendly forms (n-dim arrays). In contrast, a text file is quite a pain to parse into numerical values. To pick an example at random, in Open Climate Fix we have a script which converts GRIB files to Zarr. This takes about 30 minutes to process 50 GBytes of GRIB files (on a single machine, from an SSD). There's very little actual "processing" happening. My hunch is that it should be possible to perform these simple "rechunking" operations at the speed of the IO. So, with a PCIe 4 SSD able to sustain about 5GBytes/s, it should only take 20 seconds to convert 50 GB. That's approximately 100 times faster than the current solution! Maybe we could run a similar competition?Maybe the Zarr / Pangeo / Scientific Python community could run similar programming competitions. The aim would be to harness the collective intelligence of hundreds of people to improve the performance of an algorithm that's important to us. A non-trivial problem is to find a programming challenge that's simple to explain and simple to implement a naive solution. But the challenge needs to be realistic enough to actually be useful to the community. Perhaps something like loading 1 TByte of multi-dimensional Zarr chunks and computing the min, mean, and max? |
Beta Was this translation helpful? Give feedback.
-
Paper: "BtrBlocks: Efficient Columnar Compression for Data Lakes"This paper came out in June 2023. The DOI is 10.1145/3589263. The authors are Kuschewski, Sauerwein, Alhomssi, and Leis. The paper contains lots of juicy ideas and experimental results. Their first assertion is that, when reading cloud storage buckets from VMs with fast network adaptors (100 Gbps+), then using "heavyweight" compressors like zstd mean that throughput is CPU-bound, and not IO-bound. The authors show that it's possible to achieve near-optimal throughput (and good compression ratios) using a "cascade" of simple encodings (RLE, dictionary, etc.). Especially if the implementations are SIMD-optimised. The authors also describe a novel encoding for floating point numbers. The paper goes on to describe a simple algorithm for automatically select the best combination of "simple" encodings. In some cases, this uses simple summary statistics of each block. The paper finishes with an empirical evaluation of BtrBlocks. (I first heard about the BtrBlocks paper whilst reading about Vortex, and exciting new "Apache Arrow-compatible toolkit for working with compressed array data".) It feels like Zarr could adopt several ideas from this paper (and, indeed, Zarr already is using some of these ideas!). |
Beta Was this translation helpful? Give feedback.
-
SSD News: 3 million I/O operations per second at 4 kB random reads. With a 122 TB SSD due in Q1 2025.Phison have just announced that they'll start manufacturing enterprise SSDs under their new "Pascari" brand. The X200 SSDs are capable of an impressive 3 million I/O operations per second (IOPS) for 4 kByte random reads (so it should sustain 3M x 4kB = 12 GB/sec when reading random 4 kB chunks). And they are planning to release a 122 TB SSD in Q1 2025. My view is that very exciting things are happening in the world of storage hardware! And that these improvements will inevitably have knock-on effects for the storage of large multi-dimensional arrays. Yes, this very high-performance storage hardware will probably first be deployed by the small group of orgs and researchers who build their own on-prem high-performance hardware. But it feels inevitable that cloud storage buckets will get a lot faster if/when SSDs replace HDDs for mainstream cloud storage buckets, combined with 200 Gbps and 400 Gbps network interface cards. |
Beta Was this translation helpful? Give feedback.
-
For those of us interested in improving the performance of Zarr, I thought it might be fun to have a little "discussion group" of literature on high-performance storage systems 🙂.
Paper: TensorBank:Tensor Lakehouse for Foundation Model Training
Here's a paper to kick us off: Kienzier et al., Sept 2023, "TensorBank:Tensor Lakehouse for Foundation Model Training" on arXiv. All eleven authors are from IBM Research.
Their focus is very similar to mine: training large ML models on huge volumes of multi-dimensional data, especially numerical weather predictions and satellite data. (Many of the same authors are involved in IBM's work on foundational ML models for weather & climate.)
To quote their paper:
The authors use cloud object storage, Zarr, Dask, and Xarray. They use HTTP range reads to read subsets of chunks.
The authors introduce domain-specific Hierarchical Statistical Indices:
(Jack's note: Is this a bit like ZEP 5 - Zarr-based Chunk-level Accumulation in Reduced Dimensions?)
The authors run some performance analyses from single VMs. They want to find out if their stack can saturate the network IO (using 50 Gbit/s on an HPC environment; and 25 Gbit/s NICs from an AWS instance to S3). In their HPC setup, they saturated the 50 Gb/s link using 10 parallel threads, corresponding to ~6.1 GB/s. On AWS, they required 128 threads to saturate network bandwidth, achieving ~3.1 GB/s. The huge number of threads required on AWS is curious. The authors say: "Although not yet verified, we assume HTTP imposes some significant overhead and latency in comparison to the protocol used by GPFS".
Jack's thoughts: It's great to see other people training large ML models from Zarr! The paper contributes an interesting idea (the HSI) which feels like it could be the basis for a ZEP?! The authors' stack is Zarr, Dask, Xarray, and some custom code on top to implement the HSI. TBH, it's kind of a shame that the authors didn't mention Zarr, Dask, or xarray in the title or abstract. But ho hum 🙂 (although they do cite the pangeo paper).
Beta Was this translation helpful? Give feedback.
All reactions