Skip to content

Bit-Packing Codec for ADC Data Optimization #2942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

asimchoudhary
Copy link
Contributor

This PR implements a prototype BitPackingCodec that efficiently stores ADC integer values (10-bit, 12-bit) that typically waste space when stored as 16-bit integers.

Key Features

  • Reduces storage requirements by ~37.5% for 10-bit ADC data
  • Maintains data integrity through careful bit manipulation
  • Simple API requiring only bits_per_value and original_dtype parameters

Implementation Details

The codec works by:

  1. Accepting arrays with values that only use N bits (where N < standard bit widths)
  2. Packing these values contiguously, crossing byte boundaries as needed
  3. Unpacking values during decoding while preserving original data integrity

Key algorithms:

  • Bit Masking: Extracts only necessary bits
  • Byte Boundary Handling: Special logic for values that span multiple bytes
  • Efficient Unpacking: Reconstructs original values without data loss

Demo

Tried to bitpack an array of 4 values with dtype = uint16 into 10 bits

Packing Data
image

Unpacking Data
image

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 1, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Apr 1, 2025

thanks for working on this. It seems like numcodecs already has a packbits codec, and @jbms has a specification for a more general form of that codec.

I'm not an expert but it seems to me that your PR and these other packbits codecs are doing something slightly different. Because the names are so similar, it might be good to disambiguate the various functionalities here.

@LDeakin
Copy link
Contributor

LDeakin commented Apr 1, 2025

This would be best suited as an array-to-bytes codec (like packbits) because:

  • The order of the input bytes going into your bytes-to-bytes codec depends on the endian parameter of the bytes codec
  • You would not need to pass in the original_dtype
    • Also, the dtype of the input to an array-to-bytes codec (in chunk_spec) may not match the dtype of the array, array-to-array codecs can change it

@asimchoudhary
Copy link
Contributor Author

Hi @d-v-b,

Thank you for your feedback and for highlighting these related implementations!

You're absolutely right—the naming similarity could lead to confusion. Would ADCPackCodec be a clearer alternative?

Looking forward to your thoughts!

@asimchoudhary
Copy link
Contributor Author

Hii, @LDeakin

Thank you for the insightful feedback!

You’re absolutely right. ArrayBytesCodec is a better fit than BytesBytesCodec for this bit-packing implementation , shifting to an ArrayBytesCodec would simplify the implementation by eliminating the need for original_dtype and ensuring better alignment with the pipeline.

Initially, I modeled this after BloscCodec as a BytesBytesCodec, viewing it mainly as compression. But as I dug deeper, I realized the array-aware nature of bit-packing might suit ArrayArrayCodec better. My hesitation was around handling NDBuffer objects correctly.

Would you recommend ArrayBytesCodec over ArrayArrayCodec here?
I’ll refactor accordingly to simplify the code .

Appreciate your guidance!

@jbms
Copy link

jbms commented Apr 1, 2025

This should definitely be an array -> bytes codec.

However, potentially it could make sense to combine this with the packbits codec that I proposed (as an extension of the numcodecs packbits codec).

My packbits proposal is designed to work with bool as well as non-whole-number-of-bytes dtypes like int2, int4, float4_*, that I also proposed as extensions.

In contrast, your codec proposed here works with the existing whole-number-of-bytes data types and retains only the specified number of least-significant bits.

Would it make sense to extend my packbits proposal to allow optionally specifying a range of bits to retain, e.g. start_bit and end_bit? I can imagine that someone may wish to keep just the most-significant bits in some cases.

@jbms
Copy link

jbms commented Apr 1, 2025

Note: A separate issue is that your implementation, being pure Python that loops over each element, will surely be very slow. You will want to implement in cython or otherwise ensure the bulk of the work gets handled by optimized native code.

@asimchoudhary
Copy link
Contributor Author

Thank you for your detailed feedback, @jbms!

I'll definitely refactor this as an array → bytes codec as you and @LDeakin suggested.

Your suggestion about extending packbits with start_bit and end_bit parameters is excellent! This would indeed provide more flexibility for selecting which bits to retain. I'd be happy to help.

Regarding performance, you're absolutely right. My current implementation is a proof-of-concept focused on correctness, but production use would require optimization. I'd be interested in exploring:

  1. Cython implementation for the core bit manipulation logic
  2. Use of native code wherever possible.
  3. Numba for JIT compilation

@jbms
Copy link

jbms commented Apr 8, 2025

I modified my packbits proposal to include a first_bit and last_bit option:
https://github.com/zarr-developers/zarr-extensions/pull/3/files#diff-8c721f8bc1d30cef0875281a1a9f922d762a0998f1601cd05022e544ed8df271

It is now a fair bit more complicated though --- but still it probably makes sense to keep those options integrated rather than as a separate codec. There could still be partial implementations of packbits that do not support them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants