Skip to content

Commit 99ff7f7

Browse files
authored
Merge branch 'main' into tp
2 parents 804a4f7 + 62c7ea0 commit 99ff7f7

File tree

246 files changed

+2635
-1100
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

246 files changed

+2635
-1100
lines changed

.circleci/config.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ jobs:
5858
- run:
5959
name: "Prepare pipeline parameters"
6060
command: |
61-
python utils/process_test_artifacts.py
61+
python utils/process_test_artifacts.py
6262
6363
# To avoid too long generated_config.yaml on the continuation orb, we pass the links to the artifacts as parameters.
6464
# Otherwise the list of tests was just too big. Explicit is good but for that it was a limitation.
@@ -110,7 +110,7 @@ jobs:
110110
- run:
111111
name: "Prepare pipeline parameters"
112112
command: |
113-
python utils/process_test_artifacts.py
113+
python utils/process_test_artifacts.py
114114
115115
# To avoid too long generated_config.yaml on the continuation orb, we pass the links to the artifacts as parameters.
116116
# Otherwise the list of tests was just too big. Explicit is good but for that it was a limitation.

.github/ISSUE_TEMPLATE/bug-report.yml

+1
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@ body:
106106
label: Reproduction
107107
description: |
108108
Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
109+
Please include relevant config information with your code, for example your Trainers, TRL, Peft, and DeepSpeed configs.
109110
If you have code snippets, error messages, stack traces please provide them here as well.
110111
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
111112
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.

docker/transformers-quantization-latest-gpu/Dockerfile

+3
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
5353
# Add vptq for quantization testing
5454
RUN python3 -m pip install --no-cache-dir vptq
5555

56+
# Add spqr for quantization testing
57+
RUN python3 -m pip install --no-cache-dir spqr_quant[gpu]
58+
5659
# Add hqq for quantization testing
5760
RUN python3 -m pip install --no-cache-dir hqq
5861

docs/source/en/_toctree.yml

+4
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,8 @@
166166
- local: quantization/aqlm
167167
title: AQLM
168168
- local: quantization/vptq
169+
title: SpQR
170+
- local: quantization/spqr
169171
title: VPTQ
170172
- local: quantization/quanto
171173
title: Quanto
@@ -185,6 +187,8 @@
185187
title: BitNet
186188
- local: quantization/compressed_tensors
187189
title: compressed-tensors
190+
- local: quantization/finegrained_fp8
191+
title: Fine-grained FP8
188192
- local: quantization/contribute
189193
title: Contribute new quantization method
190194
title: Quantization Methods

docs/source/en/main_classes/data_collator.md

+3
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n
7171

7272
[[autodoc]] data.data_collator.DataCollatorWithFlattening
7373

74+
# DataCollatorForMultipleChoice
75+
76+
[[autodoc]] data.data_collator.DataCollatorForMultipleChoice

docs/source/en/main_classes/quantization.md

+8
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,11 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
8080
## BitNetConfig
8181

8282
[[autodoc]] BitNetConfig
83+
84+
## SpQRConfig
85+
86+
[[autodoc]] SpQRConfig
87+
88+
## FineGrainedFP8Config
89+
90+
[[autodoc]] FineGrainedFP8Config

docs/source/en/model_doc/helium.md

+4-8
Original file line numberDiff line numberDiff line change
@@ -107,24 +107,20 @@ Tips:
107107

108108
## Usage tips
109109

110-
`Helium` can be found on the [Huggingface Hub](https://huggingface.co/collections/kyutai/helium-1-preview)
110+
`Helium` can be found on the [Huggingface Hub](https://huggingface.co/models?other=helium)
111111

112112
In the following, we demonstrate how to use `helium-1-preview` for the inference.
113113

114114
```python
115115
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
116116
>>> device = "cuda" # the device to load the model onto
117117

118-
>>> model = AutoModelForCausalLM.from_pretrained("helium-1-preview", device_map="auto")
119-
>>> tokenizer = AutoTokenizer.from_pretrained("helium-1-preview")
118+
>>> model = AutoModelForCausalLM.from_pretrained("kyutai/helium-1-preview-2b", device_map="auto")
119+
>>> tokenizer = AutoTokenizer.from_pretrained("kyutai/helium-1-preview-2b")
120120

121121
>>> prompt = "Give me a short introduction to large language model."
122122

123-
>>> messages = [{"role": "user", "content": prompt}]
124-
125-
>>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
126-
127-
>>> model_inputs = tokenizer([text], return_tensors="pt").to(device)
123+
>>> model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
128124

129125
>>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
130126

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Fine-grained FP8
18+
19+
With FP8 quantization method, you can quantize your model in FP8 (W8A8):
20+
- the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
21+
- Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)
22+
23+
It's implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper [here](https://arxiv.org/pdf/2412.19437), and the image below explains the quantization scheme :
24+
25+
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/b7b3b34bf826a6423ea82ffc57ecac80c46c3c76/transformers/quantization/quantization_deepseek.png)
26+
27+
> [!TIP]
28+
> You need a GPU with compute capability>=9 (e.g. H100)
29+
30+
Before you begin, make sure the following libraries are installed with their latest version:
31+
32+
```bash
33+
pip install --upgrade accelerate torch
34+
```
35+
> [!TIP]
36+
> You need to install a torch version compatible with the cuda version of your GPU.
37+
38+
39+
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
40+
41+
```py
42+
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
43+
44+
model_name = "meta-llama/Meta-Llama-3-8B"
45+
quantization_config = FineGrainedFP8Config()
46+
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
47+
48+
tokenizer = AutoTokenizer.from_pretrained(model_name)
49+
input_text = "What are we having for dinner?"
50+
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
51+
52+
output = quantized_model.generate(**input_ids, max_new_tokens=10)
53+
print(tokenizer.decode(output[0], skip_special_tokens=True))
54+
```
55+
56+
A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
57+
58+
```py
59+
quant_path = "/path/to/save/quantized/model"
60+
model.save_pretrained(quant_path)
61+
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
62+
```

docs/source/en/quantization/overview.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,8 @@ Use the table below to help you decide which quantization method to use.
6161
| [FBGEMM_FP8](./fbgemm_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
6262
| [torchao](./torchao.md) | 🟢 | | 🟢 | 🔴 | 🟡 <sub>5</sub> | 🔴 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
6363
| [VPTQ](./vptq.md) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
64-
64+
| [SpQR](./spqr.md) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
65+
| [FINEGRAINED_FP8](./finegrained_fp8.md) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | |
6566
<Tip>
6667

6768
**1:** bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.

docs/source/en/quantization/spqr.md

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# SpQR
18+
19+
[SpQR](https://github.com/Vahe1994/SpQR) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078).
20+
21+
To SpQR-quantize a model, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
22+
23+
Load a pre-SpQR-quantized model in [`~PreTrainedModel.from_pretrained`].
24+
25+
```python
26+
from transformers import AutoTokenizer, AutoModelForCausalLM
27+
import torch
28+
29+
quantized_model = AutoModelForCausalLM.from_pretrained(
30+
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
31+
torch_dtype=torch.half,
32+
device_map="auto"
33+
)
34+
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")
35+
```

docs/source/en/tasks/multiple_choice.md

+5-90
Original file line numberDiff line numberDiff line change
@@ -109,99 +109,14 @@ The preprocessing function you want to create needs to:
109109
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
110110

111111
```py
112-
tokenized_swag = swag.map(preprocess_function, batched=True)
112+
>>> tokenized_swag = swag.map(preprocess_function, batched=True)
113113
```
114114

115-
🤗 Transformers doesn't have a data collator for multiple choice, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
116-
117-
`DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results:
118-
119-
<frameworkcontent>
120-
<pt>
121-
```py
122-
>>> from dataclasses import dataclass
123-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
124-
>>> from typing import Optional, Union
125-
>>> import torch
126-
127-
128-
>>> @dataclass
129-
... class DataCollatorForMultipleChoice:
130-
... """
131-
... Data collator that will dynamically pad the inputs for multiple choice received.
132-
... """
133-
134-
... tokenizer: PreTrainedTokenizerBase
135-
... padding: Union[bool, str, PaddingStrategy] = True
136-
... max_length: Optional[int] = None
137-
... pad_to_multiple_of: Optional[int] = None
138-
139-
... def __call__(self, features):
140-
... label_name = "label" if "label" in features[0].keys() else "labels"
141-
... labels = [feature.pop(label_name) for feature in features]
142-
... batch_size = len(features)
143-
... num_choices = len(features[0]["input_ids"])
144-
... flattened_features = [
145-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
146-
... ]
147-
... flattened_features = sum(flattened_features, [])
148-
149-
... batch = self.tokenizer.pad(
150-
... flattened_features,
151-
... padding=self.padding,
152-
... max_length=self.max_length,
153-
... pad_to_multiple_of=self.pad_to_multiple_of,
154-
... return_tensors="pt",
155-
... )
156-
157-
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
158-
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
159-
... return batch
160-
```
161-
</pt>
162-
<tf>
115+
To create a batch of examples, it's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. [`DataCollatorForMultipleChoice`] flattens all the model inputs, applies padding, and then unflattens the results.
163116
```py
164-
>>> from dataclasses import dataclass
165-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
166-
>>> from typing import Optional, Union
167-
>>> import tensorflow as tf
168-
169-
170-
>>> @dataclass
171-
... class DataCollatorForMultipleChoice:
172-
... """
173-
... Data collator that will dynamically pad the inputs for multiple choice received.
174-
... """
175-
176-
... tokenizer: PreTrainedTokenizerBase
177-
... padding: Union[bool, str, PaddingStrategy] = True
178-
... max_length: Optional[int] = None
179-
... pad_to_multiple_of: Optional[int] = None
180-
181-
... def __call__(self, features):
182-
... label_name = "label" if "label" in features[0].keys() else "labels"
183-
... labels = [feature.pop(label_name) for feature in features]
184-
... batch_size = len(features)
185-
... num_choices = len(features[0]["input_ids"])
186-
... flattened_features = [
187-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
188-
... ]
189-
... flattened_features = sum(flattened_features, [])
190-
191-
... batch = self.tokenizer.pad(
192-
... flattened_features,
193-
... padding=self.padding,
194-
... max_length=self.max_length,
195-
... pad_to_multiple_of=self.pad_to_multiple_of,
196-
... return_tensors="tf",
197-
... )
198-
199-
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
200-
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
201-
... return batch
117+
>>> from transformers import DataCollatorForMultipleChoice
118+
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
202119
```
203-
</tf>
204-
</frameworkcontent>
205120

206121
## Evaluate
207122

@@ -271,7 +186,7 @@ At this point, only three steps remain:
271186
... train_dataset=tokenized_swag["train"],
272187
... eval_dataset=tokenized_swag["validation"],
273188
... processing_class=tokenizer,
274-
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
189+
... data_collator=collator,
275190
... compute_metrics=compute_metrics,
276191
... )
277192

0 commit comments

Comments
 (0)