Skip to content

[Packaging] Shrink wheel ~35 % via nvcc --compress-mode=size #1704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

trmanish
Copy link

@trmanish trmanish commented Jul 10, 2025

What this PR does

  • Appends --compress-mode=size to CMAKE_CUDA_FLAGS for nvcc ≥ 12.4.
  • No runtime or API changes.

Impact

  • Wheel shrinks 69 MB → 45 MB (≈ 35 %).

Compatibility

  • nvcc < 12.4 builds unchanged—the flag is gated by a version check.
  • Decompression adds only a few hundred ms on first import bitsandbytes.

@matthewdouglas
Copy link
Member

Thanks, appreciate the suggestion! I have the same concern mentioned over on PyTorch regarding support for users with older drivers: pytorch/pytorch#157791 (comment)

Mainly it seems that this would require cu124+ users to have the 550+ driver, while currently we should still have compatibility for driver version 525+.

So will have to weigh that in as a consideration.

@trmanish
Copy link
Author

I believe an earlier comment from original PR did say it won't have that as a requirement

pytorch/pytorch#157791 (comment)

But I believe latest from Pytorch is as below:

pytorch/pytorch#157791 (comment)

However my understanding is(pls correct me if wrong) that the only variant that would be built with --compress-mode=size is the cu124 wheel, and that wheel already implies a 550-series driver. Users on 525/535 stay on the cu122 / cu121 wheels, which this PR leaves untouched.

Options
Merge as-is – compression only affects the cu124 wheel, no compatibility regression for existing users.

Opt-in flag – guard it behind ENABLE_BNB_CUDA_COMPRESSION=1; default off.

Dual wheels – publish both bitsandbytes-cu124.whl and -cu124-slim.whl.

@matthewdouglas
Copy link
Member

Hi,

We appreciate the effort and explanation, but unfortunately we cannot merge this. The assumption that only applying to cu124+ builds limits the scope is flawed, since cu124+ builds can still be utilized with the older driver versions thanks to CUDA's Minor Version Compatibility. This means that the 12.4, 12.6, 12.8, and 12.9 builds can currently run on systems with driver v525.

We have sufficient evidence that this is a valid usage scenario. For example, vLLM received a similar PR and had to revert it for this reason: vllm-project/vllm#20853. See also: pytorch/pytorch#157791 (comment)

The three most recent minor PyTorch releases use cu124+ builds by default, and it's supported on systems with drivers v525+.

Publishing additional wheels adds extra complexity that we do not wish to take on.

With that said, we will use this option when we start producing builds for CUDA 13, which by default will start to use the "balanced" compressed mode, and can provide guarantees that all users can support the "size" mode as well.

Additionally, we will explorer further ways to limit our binary sizes:

  • Dropping support for older GPUs. In particular, Maxwell and Pascal could be dropped. Right now we do not support these in the CUDA 12.8/12.9 builds, but we're open to dropping support entirely.
  • Removing binaries for CUDA 12.0, 12.2, 12.3, 12.5, as PyTorch is not typically built with these versions. We would instead take advantage of compatibility and only load one of CUDA 11.8, 12.1, 12.4, 12.6, 12.8, and 12.9.
  • Eventually, we'll follow PyTorch's lead and drop the CUDA 11.8 build as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants