Skip to content

Commit 39be9a0

Browse files
committed
Merge remote-tracking branch 'origin/v2' into feature/damian/no_kv_cache
2 parents 105b1d5 + a2aaa51 commit 39be9a0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+2496
-490
lines changed

.github/workflows/mlc_config.json

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
{
22
"aliveStatusCodes": [
33
0,
4-
200
4+
200,
5+
403,
56
],
67
"ignorePatterns": [
78
{

.github/workflows/test-check.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626
strategy:
2727
fail-fast: false
2828
matrix:
29-
python-version: [3.8, 3.9, '3.10']
29+
python-version: [3.8, 3.9, '3.11']
3030
os: [ubuntu-20.04]
3131
runs-on: ${{ matrix.os }}
3232
steps:
@@ -52,7 +52,7 @@ jobs:
5252
strategy:
5353
fail-fast: false
5454
matrix:
55-
python-version: [3.8, 3.9, '3.10']
55+
python-version: [3.8, 3.9, '3.11']
5656
os: [ubuntu-20.04]
5757
runs-on: ${{ matrix.os }}
5858
steps:
@@ -97,6 +97,6 @@ jobs:
9797
- name: "Clean sparsezoo directory"
9898
run: rm -r sparsezoo/
9999
- name: ⚙️ Install dependencies
100-
run: pip install .[dev,server,image_classification,transformers,haystack]
100+
run: pip install .[dev,haystack]
101101
- name: Run integrations tests
102102
run: make test_integrations

CONTRIBUTING.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ For documentation edits, include:
7777

7878
## Question or Problem
7979

80-
- Sign up or log in to our [**Neural Maggic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
80+
- Sign up or log in to our [**Neural Magic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
8181

8282
Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
8383

README.md

+6-1
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@ For more general questions about Neural Magic, [complete this form.](http://neur
222222

223223
### License
224224

225-
- **DeepSparse Community** is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
225+
- **DeepSparse Community** is free to use and is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
226226
Some source code, example files, and scripts included in the DeepSparse GitHub repository or directory are licensed under the [Apache License Version 2.0](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE) as noted.
227227

228228
- **DeepSparse Enterprise** requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
@@ -283,3 +283,8 @@ Find this project useful in your research or other communications? Please consid
283283
bibsource = {dblp computer science bibliography, https://dblp.org}
284284
}
285285
```
286+
# All Thanks To Our Contributors
287+
288+
<a href="https://github.com/neuralmagic/deepsparse/graphs/contributors">
289+
<img src="https://contrib.rocks/image?repo=neuralmagic/deepsparse" />
290+
</a>

docker/Dockerfile

+7
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ ARG BRANCH
66

77
FROM python:3.8.16-slim-bullseye@sha256:322e38e3056cf87280ad80be615a6282aae768090f30d43d99abe413e1dd081a AS base
88
ARG VENV
9+
ARG BRANCH
910

1011
RUN set -Eeuxo \
1112
&& apt-get update \
@@ -30,6 +31,8 @@ ENV PATH="${VENV}/bin:$PATH"
3031
ENV PIP_DEFAULT_TIMEOUT=200
3132
ARG VERSION
3233
ARG MODE=""
34+
ARG BRANCH
35+
3336
RUN \
3437
if [ -n "$BRANCH" ] ; then \
3538
echo Installing from BRANCH && \
@@ -60,6 +63,7 @@ ENV PATH="${VENV}/bin:$PATH"
6063
ENV PIP_DEFAULT_TIMEOUT=200
6164
ARG VERSION
6265
ARG MODE=""
66+
ARG BRANCH
6367
RUN \
6468
if [ -n "$BRANCH" ] ; then \
6569
echo Installing from BRANCH && \
@@ -88,6 +92,8 @@ ENV PATH="${VENV}/bin:$PATH"
8892
ENV PIP_DEFAULT_TIMEOUT=200
8993
ARG VERSION
9094
ARG MODE
95+
ARG BRANCH
96+
9197
RUN \
9298
if [ -n "$BRANCH" ] ; then \
9399
echo Installing from BRANCH with editable mode && \
@@ -117,5 +123,6 @@ ARG VENV
117123
COPY --from=build $VENV $VENV
118124
ENV PATH="${VENV}/bin:$PATH"
119125
HEALTHCHECK CMD python -c 'import deepsparse'
126+
RUN pip list | grep deepsparse
120127
CMD bash
121128

docs/llms/integration-langchain.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ It is broken into two parts: installation and then examples of DeepSparse usage.
2323

2424
- Install the Python packages with `pip install deepsparse-nightly langchain`
2525
- Choose a [SparseZoo model](https://sparsezoo.neuralmagic.com/?useCase=text_generation) or export a support model to ONNX [using Optimum](https://github.com/neuralmagic/notebooks/blob/main/notebooks/opt-text-generation-deepsparse-quickstart/OPT_Text_Generation_DeepSparse_Quickstart.ipynb)
26-
- Models hosted on HuggingFace are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
26+
- Models hosted on Hugging Face are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
2727

2828
## Using DeepSparse With LangChain
2929

@@ -41,7 +41,7 @@ llm = DeepSparse(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggin
4141
print(llm('def fib():'))
4242
```
4343
## Streaming
44-
The DeepSparse LangChain wrapper also supports per token output streaming:
44+
The DeepSparse LangChain wrapper also supports per-token output streaming:
4545

4646
```python
4747
from langchain.llms import DeepSparse
@@ -53,7 +53,7 @@ for chunk in llm.stream("Tell me a joke", stop=["'","\n"]):
5353
print(chunk, end='', flush=True)
5454
```
5555
## Using Instruction Fine-tune Models With DeepSparse
56-
Here's an example of how to prompt an instruction fine-tuned model using DeepSparse and the MPT-Instruct model:
56+
Here's an example of how to prompt an instruction in a fine-tuned model using DeepSparse and the MPT-Instruct model:
5757
```python
5858
prompt="""
5959
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is quantization? ### Response:
@@ -84,7 +84,7 @@ List how to Become a great software engineer
8484
By TechRadar Staff
8585
Here are some tips on how to become a great software engineer:
8686
1. Develop good programming skills: To become a great software engineer, you need to have a strong understanding of programming concepts and techniques. You should be able to write clean, efficient code that meets the requirements of the project.
87-
2. Learn new technologies: To stay up-to in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
87+
2. Learn new technologies: To stay up-to-date in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
8888
3. Build a portfolio: To showcase your skills, you should build a portfolio of your work. This will help you showcase your skills and abilities to potential employers.
8989
4. Network: Networking is an important aspect of your career. You should attend industry events and conferences to meet other professionals in the field.
9090
5. Stay up-to-date with industry trends: Stay up-to-date with industry trends and developments. This will help you stay relevant in your field and help you stay ahead of your competition.

docs/llms/text-generation-pipeline.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,15 @@ This user guide explains how to run inference of text generation models with Dee
2020

2121
## **Installation**
2222

23-
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:
23+
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPI:
2424

2525
```bash
2626
pip install -U deepsparse-nightly[llm]
2727
```
2828

2929
#### **System Requirements**
3030

31-
- Hardware: x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+.
31+
- Hardware: x86 AVX2, AVX-512, AVX-512 VNNI, and ARM v8.2+.
3232
- Operating System: Linux (MacOS will be supported soon)
3333
- Python: v3.8-3.11
3434

@@ -49,7 +49,7 @@ prompt = "Below is an instruction that describes a task. Write a response that a
4949
output = pipeline(prompt=prompt)
5050
print(output.generations[0].text)
5151

52-
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
52+
# >> Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
5353
```
5454

5555
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
@@ -58,11 +58,11 @@ print(output.generations[0].text)
5858

5959
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
6060

61-
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
61+
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At present, we suggest only using LLM ONNX graphs created by Neural Magic.***
6262
>
6363
### **SparseZoo Stubs**
6464

65-
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
65+
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifies a 50% pruned-quantized pre-trained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
6666

6767
```python
6868
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
@@ -91,7 +91,7 @@ pipeline = TextGeneration(model="./local-model/deployment")
9191
```
9292

9393
### **Hugging Face Models**
94-
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
94+
Hugging Face models that conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
9595

9696
```python
9797
from deepsparse import TextGeneration
@@ -176,7 +176,7 @@ print(f"finished_reason: {output.generations[0].finished_reason}")
176176

177177
## **Generation Configuration**
178178

179-
`TextGeneration` can be configured to alter several variables in generation.
179+
`TextGeneration` can be configured to alter several variables in a generation.
180180

181181
The following examples use a quantized 33M parameter TinyStories model for quick compilation:
182182
```python
@@ -186,7 +186,7 @@ model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
186186
pipeline = TextGeneration(model=model_id)
187187
```
188188

189-
### **Creating A `GenerationConfig`**
189+
### **Creating a `GenerationConfig`**
190190

191191
The `GenerationConfig` can be created in three ways:
192192
- Via `transformers.GenerationConfig`:
@@ -267,15 +267,15 @@ for generated_text in output.generations[0]:
267267
# >> Princess peach jumped from the balcony and ran after her. Jill jumped to the floor and followed
268268
```
269269

270-
#### Controling the Output Length
270+
#### Controlling the Output Length
271271
- `max_new_tokens`: maximum number of tokens to generate. Default is `None`
272272
```python
273273
output = pipeline(prompt=prompt, max_new_tokens=10)
274274
print(f"{prompt}{output.generations[0].text}")
275275
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
276276
```
277277

278-
#### Controling the Sampling
278+
#### Controlling the Sampling
279279
- `do_sample`: If True, will apply sampling from the probability distribution computed from the logits rather than deterministic greedy sampling. Default is `False`
280280
```python
281281
output = pipeline(prompt=prompt, do_sample=True, max_new_tokens=15)
@@ -286,7 +286,7 @@ print(f"{prompt}{output.generations[0].text}")
286286
# >> Princess peach jumped from the balcony and landed in front of her. She stood proudly and exclaimed, “I did
287287
```
288288

289-
- `temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
289+
- `temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always taking the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
290290
```python
291291
# more random
292292
output = pipeline(prompt=prompt, do_sample=True, temperature=1.5, max_new_tokens=15)

research/mpt/README.md

+9-15
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
*LAST UPDATED: 10/11/2023*
1+
*LAST UPDATED: 11/24/2023*
22

33
# **Sparse Finetuned LLMs with DeepSparse**
44

5-
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
5+
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT and Meta's Llama 2.
66
Check out our paper [Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
77

88
In this research overview, we will discuss:
@@ -11,7 +11,7 @@ In this research overview, we will discuss:
1111

1212
## **Sparse Finetuning Research**
1313

14-
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
14+
We show that MPT-7B and Llama-2-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
1515

1616
When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
1717

@@ -23,16 +23,16 @@ Fine-tuning is useful for two main reasons:
2323
1. It can teach the model *how to respond* to input (often called **instruction tuning**).
2424
2. It can teach the model *new information* (often called **domain adaptation**).
2525

26-
2726
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
2827

2928
The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
3029

3130
<div align="center">
32-
<img src="https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9" width="60%"/>
31+
<img src="https://github.com/neuralmagic/deepsparse/assets/3195154/f9a86726-12f5-4926-8d8c-668c449faa84" width="60%"/>
3332
</div>
3433

3534
- [See the paper on Arxiv](https://arxiv.org/abs/2310.06927)
35+
- [See our Llama 2 expansion blog on the initial paper](https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/)
3636

3737
### **How Is This Useful For Real World Use?**
3838

@@ -46,17 +46,14 @@ Install the DeepSparse Nightly build (requires Linux):
4646
pip install -U deepsparse-nightly[llm]
4747
```
4848

49-
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
50-
51-
### MPT-7B on GSM
49+
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
5250

5351
We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
5452

5553
```python
5654
from deepsparse import TextGeneration
5755

58-
model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
59-
pipeline = TextGeneration(model_path=model)
56+
pipeline = TextGeneration(model_path="zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized")
6057

6158
prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
6259
output = pipeline(prompt=prompt)
@@ -84,11 +81,8 @@ print(output.generations[0].text)
8481
### >> #### 5
8582
```
8683

87-
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
88-
89-
9084
#### Other Resources
91-
- [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
85+
- [Check out all the GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
9286
- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
9387
- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
9488

@@ -97,7 +91,7 @@ print(output.generations[0].text)
9791
Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
9892

9993
- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
100-
- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
94+
- **Expanding Model Support**: Apply sparse fine-tuning results to Mistral models
10195
- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
10296
- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
10397

0 commit comments

Comments
 (0)