You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ For documentation edits, include:
77
77
78
78
## Question or Problem
79
79
80
-
- Sign up or log in to our [**Neural Maggic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
80
+
- Sign up or log in to our [**Neural Magic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
81
81
82
82
Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
Copy file name to clipboardExpand all lines: README.md
+6-1
Original file line number
Diff line number
Diff line change
@@ -222,7 +222,7 @@ For more general questions about Neural Magic, [complete this form.](http://neur
222
222
223
223
### License
224
224
225
-
-**DeepSparse Community** is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
225
+
-**DeepSparse Community** is free to use and is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
226
226
Some source code, example files, and scripts included in the DeepSparse GitHub repository or directory are licensed under the [Apache License Version 2.0](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE) as noted.
227
227
228
228
-**DeepSparse Enterprise** requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
@@ -283,3 +283,8 @@ Find this project useful in your research or other communications? Please consid
Copy file name to clipboardExpand all lines: docs/llms/integration-langchain.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ It is broken into two parts: installation and then examples of DeepSparse usage.
23
23
24
24
- Install the Python packages with `pip install deepsparse-nightly langchain`
25
25
- Choose a [SparseZoo model](https://sparsezoo.neuralmagic.com/?useCase=text_generation) or export a support model to ONNX [using Optimum](https://github.com/neuralmagic/notebooks/blob/main/notebooks/opt-text-generation-deepsparse-quickstart/OPT_Text_Generation_DeepSparse_Quickstart.ipynb)
26
-
- Models hosted on HuggingFace are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
26
+
- Models hosted on Hugging Face are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
The DeepSparse LangChain wrapper also supports pertoken output streaming:
44
+
The DeepSparse LangChain wrapper also supports per-token output streaming:
45
45
46
46
```python
47
47
from langchain.llms import DeepSparse
@@ -53,7 +53,7 @@ for chunk in llm.stream("Tell me a joke", stop=["'","\n"]):
53
53
print(chunk, end='', flush=True)
54
54
```
55
55
## Using Instruction Fine-tune Models With DeepSparse
56
-
Here's an example of how to prompt an instruction fine-tuned model using DeepSparse and the MPT-Instruct model:
56
+
Here's an example of how to prompt an instruction in a fine-tuned model using DeepSparse and the MPT-Instruct model:
57
57
```python
58
58
prompt="""
59
59
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is quantization? ### Response:
@@ -84,7 +84,7 @@ List how to Become a great software engineer
84
84
By TechRadar Staff
85
85
Here are some tips on how to become a great software engineer:
86
86
1. Develop good programming skills: To become a great software engineer, you need to have a strong understanding of programming concepts and techniques. You should be able to write clean, efficient code that meets the requirements of the project.
87
-
2. Learn new technologies: To stay up-to in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
87
+
2. Learn new technologies: To stay up-to-date in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
88
88
3. Build a portfolio: To showcase your skills, you should build a portfolio of your work. This will help you showcase your skills and abilities to potential employers.
89
89
4. Network: Networking is an important aspect of your career. You should attend industry events and conferences to meet other professionals in the field.
90
90
5. Stay up-to-date with industry trends: Stay up-to-date with industry trends and developments. This will help you stay relevant in your field and help you stay ahead of your competition.
Copy file name to clipboardExpand all lines: docs/llms/text-generation-pipeline.md
+11-11
Original file line number
Diff line number
Diff line change
@@ -20,15 +20,15 @@ This user guide explains how to run inference of text generation models with Dee
20
20
21
21
## **Installation**
22
22
23
-
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:
23
+
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPI:
24
24
25
25
```bash
26
26
pip install -U deepsparse-nightly[llm]
27
27
```
28
28
29
29
#### **System Requirements**
30
30
31
-
- Hardware: x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+.
31
+
- Hardware: x86 AVX2, AVX-512, AVX-512 VNNI, and ARM v8.2+.
32
32
- Operating System: Linux (MacOS will be supported soon)
33
33
- Python: v3.8-3.11
34
34
@@ -49,7 +49,7 @@ prompt = "Below is an instruction that describes a task. Write a response that a
49
49
output = pipeline(prompt=prompt)
50
50
print(output.generations[0].text)
51
51
52
-
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
52
+
# >> Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
53
53
```
54
54
55
55
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
60
60
61
-
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
61
+
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At present, we suggest only using LLM ONNX graphs created by Neural Magic.***
62
62
>
63
63
### **SparseZoo Stubs**
64
64
65
-
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized`identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
65
+
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized`identifies a 50% pruned-quantized pre-trained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
94
+
Hugging Face models that conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
# >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
276
276
```
277
277
278
-
#### Controling the Sampling
278
+
#### Controlling the Sampling
279
279
-`do_sample`: If True, will apply sampling from the probability distribution computed from the logits rather than deterministic greedy sampling. Default is `False`
# >> Princess peach jumped from the balcony and landed in front of her. She stood proudly and exclaimed, “I did
287
287
```
288
288
289
-
-`temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
289
+
-`temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always taking the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
Copy file name to clipboardExpand all lines: research/mpt/README.md
+9-15
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
*LAST UPDATED: 10/11/2023*
1
+
*LAST UPDATED: 11/24/2023*
2
2
3
3
# **Sparse Finetuned LLMs with DeepSparse**
4
4
5
-
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
5
+
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT and Meta's Llama 2.
6
6
Check out our paper [Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
7
7
8
8
In this research overview, we will discuss:
@@ -11,7 +11,7 @@ In this research overview, we will discuss:
11
11
12
12
## **Sparse Finetuning Research**
13
13
14
-
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
14
+
We show that MPT-7B and Llama-2-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
15
15
16
16
When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
17
17
@@ -23,16 +23,16 @@ Fine-tuning is useful for two main reasons:
23
23
1. It can teach the model *how to respond* to input (often called **instruction tuning**).
24
24
2. It can teach the model *new information* (often called **domain adaptation**).
25
25
26
-
27
26
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
28
27
29
28
The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
50
-
51
-
### MPT-7B on GSM
49
+
The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).
52
50
53
51
We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
54
52
55
53
```python
56
54
from deepsparse import TextGeneration
57
55
58
-
model ="zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
prompt ="Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
88
-
89
-
90
84
#### Other Resources
91
-
-[Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
85
+
-[Check out all the GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
92
86
-[Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
93
87
-[Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
0 commit comments