You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* model card for Mistral
* Update docs/source/en/model_doc/mistral.md
Co-authored-by: Steven Liu <[email protected]>
* Apply suggestions from code review
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/mistral.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/mistral.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/mistral.md
Co-authored-by: Steven Liu <[email protected]>
* Update docs/source/en/model_doc/mistral.md
Co-authored-by: Steven Liu <[email protected]>
* apply suggestions
* fix typo
* updated with comments
* updated with comments
* updated with comments
* remove hfoption block
---------
Co-authored-by: Steven Liu <[email protected]>
Mistral was introduced in the [this blogpost](https://mistral.ai/news/announcing-mistral-7b/) by Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.
31
-
32
-
The introduction of the blog post says:
33
-
34
-
*Mistral AI team is proud to release Mistral 7B, the most powerful language model for its size to date.*
35
-
36
-
Mistral-7B is the first large language model (LLM) released by [mistral.ai](https://mistral.ai/).
37
-
38
-
### Architectural details
39
-
40
-
Mistral-7B is a decoder-only Transformer with the following architectural choices:
41
-
42
-
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens.
45
-
46
-
For more details refer to the [release blog post](https://mistral.ai/news/announcing-mistral-7b/).
47
-
48
-
### License
28
+
# Mistral
49
29
50
-
`Mistral-7B` is released under the Apache 2.0 license.
30
+
[Mistral](https://huggingface.co/papers/2310.06825) is a 7B parameter language model, available as a pretrained and instruction-tuned variant, focused on balancing
31
+
the scaling costs of large models with performance and efficient inference. This model uses sliding window attention (SWA) trained with a 8K context length and a fixed cache size to handle longer sequences more effectively. Grouped-query attention (GQA) speeds up inference and reduces memory requirements. Mistral also features a byte-fallback BPE tokenizer to improve token handling and efficiency by ensuring characters are never mapped to out-of-vocabulary tokens.
51
32
52
-
## Usage tips
33
+
You can find all the original Mistral checkpoints under the [Mistral AI_](https://huggingface.co/mistralai) organization.
53
34
54
-
The Mistral team has released 3 checkpoints:
35
+
> [!TIP]
36
+
> Click on the Mistral models in the right sidebar for more examples of how to apply Mistral to different language tasks.
55
37
56
-
- a base model, [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), which has been pre-trained to predict the next token on internet-scale data.
57
-
- an instruction tuned model, [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), which is the base model optimized for chat purposes using supervised fine-tuning (SFT) and direct preference optimization (DPO).
58
-
- an improved instruction tuned model, [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), which improves upon v1.
38
+
The example below demonstrates how to chat with [`Pipeline`] or the [`AutoModel`], and from the command line.
... {"role": "user", "content": "What is your favourite condiment?"},
49
+
... {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
50
+
... {"role": "user", "content": "Do you have mayonnaise recipes?"}
... {"role": "user", "content": "What is your favourite condiment?"},
@@ -96,59 +77,20 @@ The instruction tuned model can be used as follows:
96
77
"Mayonnaise can be made as follows: (...)"
97
78
```
98
79
99
-
As can be seen, the instruction-tuned model requires a [chat template](../chat_templating) to be applied to make sure the inputs are prepared in the right format.
100
-
101
-
## Speeding up Mistral by using Flash Attention
102
-
103
-
The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
104
-
105
-
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
106
-
107
-
```bash
108
-
pip install -U flash-attn --no-build-isolation
109
-
```
110
-
111
-
Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). Make also sure to load your model in half-precision (e.g. `torch.float16`)
112
-
113
-
To load and run a model using Flash Attention-2, refer to the snippet below:
echo -e "My favorite condiment is"| transformers-cli chat --model_name_or_path mistralai/Mistral-7B-v0.3 --torch_dtype auto --device 0--attn_implementation flash_attention_2
130
85
```
131
86
132
-
### Expected speedups
133
-
134
-
Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using `mistralai/Mistral-7B-v0.1` checkpoint and the Flash Attention 2 version of the model.
The current implementation supports the sliding window attention mechanism and memory efficient cache management.
143
-
To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`).
144
90
145
-
The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding.
91
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
146
92
147
-
## Shrinking down Mistral using quantization
148
-
149
-
As the Mistral model has 7 billion parameters, that would require about 14GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter),that requires only about 3.5GB of RAM.
150
-
151
-
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods):
93
+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
152
94
153
95
```python
154
96
>>>import torch
@@ -161,8 +103,8 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
161
103
...bnb_4bit_compute_dtype="torch.float16",
162
104
... )
163
105
164
-
>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=True, device_map="auto")
@@ -179,19 +121,18 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
179
121
"The expected output"
180
122
```
181
123
182
-
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) .
183
-
The original code can be found [here](https://github.com/mistralai/mistral-src).
184
-
185
-
## Resources
124
+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
186
125
187
-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mistral. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
- A demo notebook to perform supervised fine-tuning (SFT) of Mistral-7B can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb). 🌎
192
-
- A [blog post](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl) on how to fine-tune LLMs in 2024 using Hugging Face tooling. 🌎
193
-
- The [Alignment Handbook](https://github.com/huggingface/alignment-handbook) by Hugging Face includes scripts and recipes to perform supervised fine-tuning (SFT) and direct preference optimization with Mistral-7B. This includes scripts for full fine-tuning, QLoRa on a single GPU as well as multi-GPU fine-tuning.
194
-
-[Causal language modeling task guide](../tasks/language_modeling)
0 commit comments