You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: .github/ISSUE_TEMPLATE/bug-report.yml
+1
Original file line number
Diff line number
Diff line change
@@ -106,6 +106,7 @@ body:
106
106
label: Reproduction
107
107
description: |
108
108
Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
109
+
Please include relevant config information with your code, for example your Trainers, TRL, Peft, and DeepSpeed configs.
109
110
If you have code snippets, error messages, stack traces please provide them here as well.
110
111
Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
111
112
Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2
+
3
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+
the License. You may obtain a copy of the License at
5
+
6
+
http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
specific language governing permissions and limitations under the License.
11
+
12
+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+
rendered properly in your Markdown viewer.
14
+
15
+
-->
16
+
17
+
# Fine-grained FP8
18
+
19
+
With FP8 quantization method, you can quantize your model in FP8 (W8A8):
20
+
- the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
21
+
- Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)
22
+
23
+
It's implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper [here](https://arxiv.org/pdf/2412.19437), and the image below explains the quantization scheme :
> You need a GPU with compute capability>=9 (e.g. H100)
29
+
30
+
Before you begin, make sure the following libraries are installed with their latest version:
31
+
32
+
```bash
33
+
pip install --upgrade accelerate torch
34
+
```
35
+
> [!TIP]
36
+
> You need to install a torch version compatible with the cuda version of your GPU.
37
+
38
+
39
+
By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
40
+
41
+
```py
42
+
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
**1:** bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
+
3
+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
+
the License. You may obtain a copy of the License at
5
+
6
+
http://www.apache.org/licenses/LICENSE-2.0
7
+
8
+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9
+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
+
specific language governing permissions and limitations under the License.
11
+
12
+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13
+
rendered properly in your Markdown viewer.
14
+
15
+
-->
16
+
17
+
# SpQR
18
+
19
+
[SpQR](https://github.com/Vahe1994/SpQR) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078).
20
+
21
+
To SpQR-quantize a model, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
22
+
23
+
Load a pre-SpQR-quantized model in [`~PreTrainedModel.from_pretrained`].
24
+
25
+
```python
26
+
from transformers import AutoTokenizer, AutoModelForCausalLM
Copy file name to clipboardexpand all lines: docs/source/en/tasks/multiple_choice.md
+5-90
Original file line number
Diff line number
Diff line change
@@ -109,99 +109,14 @@ The preprocessing function you want to create needs to:
109
109
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
🤗 Transformers doesn't have a data collator for multiple choice, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
116
-
117
-
`DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results:
To create a batch of examples, it's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. [`DataCollatorForMultipleChoice`] flattens all the model inputs, applies padding, and then unflattens the results.
0 commit comments