neuralmagic
diff --git a/‎.github/workflows/mlc_config.json
+2-1 b/‎.github/workflows/mlc_config.json
+2-1
diff --git a/‎.github/workflows/test-check.yaml
+3-3 b/‎.github/workflows/test-check.yaml
+3-3
diff --git a/‎CONTRIBUTING.md
+1-1 b/‎CONTRIBUTING.md
+1-1
diff --git a/‎README.md
+6-1 b/‎README.md
+6-1
diff --git a/‎docker/Dockerfile
+7 b/‎docker/Dockerfile
+7
diff --git a/‎docs/llms/integration-langchain.md
+4-4 b/‎docs/llms/integration-langchain.md
+4-4
diff --git a/‎docs/llms/text-generation-pipeline.md
+11-11 b/‎docs/llms/text-generation-pipeline.md
+11-11
diff --git a/‎research/mpt/README.md
+9-15 b/‎research/mpt/README.md
+9-15
@@ -1,7 +1,8 @@
 {
     "aliveStatusCodes": [
         0,
-        200
+        200,
+        403,
     ],
     "ignorePatterns": [
         {
 
@@ -26,7 +26,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.8, 3.9, '3.10']
+        python-version: [3.8, 3.9, '3.11']
         os: [ubuntu-20.04]
     runs-on: ${{ matrix.os }}
     steps:
@@ -52,7 +52,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.8, 3.9, '3.10']
+        python-version: [3.8, 3.9, '3.11']
         os: [ubuntu-20.04]
     runs-on: ${{ matrix.os }}
     steps:
@@ -97,6 +97,6 @@ jobs:
       - name: "Clean sparsezoo directory"
         run: rm -r sparsezoo/
       - name: ⚙️ Install dependencies
-        run: pip install .[dev,server,image_classification,transformers,haystack]
+        run: pip install .[dev,haystack]
       - name: Run integrations tests
         run: make test_integrations
@@ -77,7 +77,7 @@ For documentation edits, include:
 
 ## Question or Problem
 
-- Sign up or log in to our [**Neural Maggic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
+- Sign up or log in to our [**Neural Magic Community Slack**](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). We are growing the community member by member and happy to see you there. Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
 
 Post all other questions including support or how to contribute. Don’t forget to search through existing discussions to avoid duplication! Thanks!
 
 
@@ -222,7 +222,7 @@ For more general questions about Neural Magic, [complete this form.](http://neur
 
 ### License
 
-- **DeepSparse Community** is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
+- **DeepSparse Community** is free to use and is licensed under the [Neural Magic DeepSparse Community License.](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE-NEURALMAGIC)
 Some source code, example files, and scripts included in the DeepSparse GitHub repository or directory are licensed under the [Apache License Version 2.0](https://github.com/neuralmagic/deepsparse/blob/main/LICENSE) as noted.
 
 - **DeepSparse Enterprise** requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications.
@@ -283,3 +283,8 @@ Find this project useful in your research or other communications? Please consid
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 ```
+# All Thanks To Our Contributors
+
+<a href="https://github.com/neuralmagic/deepsparse/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=neuralmagic/deepsparse" />
+</a>
@@ -6,6 +6,7 @@ ARG BRANCH
 
 FROM python:3.8.16-slim-bullseye@sha256:322e38e3056cf87280ad80be615a6282aae768090f30d43d99abe413e1dd081a AS base
 ARG VENV
+ARG BRANCH
 
 RUN set -Eeuxo \
     && apt-get update \
@@ -30,6 +31,8 @@ ENV PATH="${VENV}/bin:$PATH"
 ENV PIP_DEFAULT_TIMEOUT=200
 ARG VERSION
 ARG MODE=""
+ARG BRANCH
+
 RUN \
     if [ -n "$BRANCH" ] ; then \
       echo Installing from BRANCH && \
@@ -60,6 +63,7 @@ ENV PATH="${VENV}/bin:$PATH"
 ENV PIP_DEFAULT_TIMEOUT=200
 ARG VERSION
 ARG MODE=""
+ARG BRANCH
 RUN \
     if [ -n "$BRANCH" ] ; then \
       echo Installing from BRANCH && \
@@ -88,6 +92,8 @@ ENV PATH="${VENV}/bin:$PATH"
 ENV PIP_DEFAULT_TIMEOUT=200
 ARG VERSION
 ARG MODE
+ARG BRANCH
+
 RUN \
     if [ -n "$BRANCH" ] ; then \
       echo Installing from BRANCH with editable mode && \
@@ -117,5 +123,6 @@ ARG VENV
 COPY --from=build $VENV $VENV
 ENV PATH="${VENV}/bin:$PATH"
 HEALTHCHECK CMD python -c 'import deepsparse'
+RUN pip list | grep deepsparse
 CMD bash
 
@@ -23,7 +23,7 @@ It is broken into two parts: installation and then examples of DeepSparse usage.
 
 - Install the Python packages with `pip install deepsparse-nightly langchain`
 - Choose a [SparseZoo model](https://sparsezoo.neuralmagic.com/?useCase=text_generation) or export a support model to ONNX [using Optimum](https://github.com/neuralmagic/notebooks/blob/main/notebooks/opt-text-generation-deepsparse-quickstart/OPT_Text_Generation_DeepSparse_Quickstart.ipynb)
-- Models hosted on HuggingFace are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
+- Models hosted on Hugging Face are also supported by prepending `"hf:"` to the model id, such as [`"hf:mgoin/TinyStories-33M-quant-deepsparse"`](https://huggingface.co/mgoin/TinyStories-33M-quant-deepsparse)
 
 ## Using DeepSparse With LangChain
 
@@ -41,7 +41,7 @@ llm = DeepSparse(model='zoo:nlg/text_generation/codegen_mono-350m/pytorch/huggin
 print(llm('def fib():'))
 ```
 ## Streaming
-The DeepSparse LangChain wrapper also supports per token output streaming:
+The DeepSparse LangChain wrapper also supports per-token output streaming:
 
 ```python
 from langchain.llms import DeepSparse
@@ -53,7 +53,7 @@ for chunk in llm.stream("Tell me a joke", stop=["'","\n"]):
     print(chunk, end='', flush=True)
 ```
 ## Using Instruction Fine-tune Models With DeepSparse
-Here's an example of how to prompt an instruction fine-tuned model using DeepSparse and the MPT-Instruct model:
+Here's an example of how to prompt an instruction in a fine-tuned model using DeepSparse and the MPT-Instruct model:
 ```python
 prompt="""
 Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is quantization? ### Response:
@@ -84,7 +84,7 @@ List how to Become a great software engineer
 By TechRadar Staff
 Here are some tips on how to become a great software engineer:
 1. Develop good programming skills: To become a great software engineer, you need to have a strong understanding of programming concepts and techniques. You should be able to write clean, efficient code that meets the requirements of the project.
-2. Learn new technologies: To stay up-to in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
+2. Learn new technologies: To stay up-to-date in the field, you should be familiar with new technologies and programming languages. You should also be able to adapt to new environments and work with different tools and platforms.
 3. Build a portfolio: To showcase your skills, you should build a portfolio of your work. This will help you showcase your skills and abilities to potential employers.
 4. Network: Networking is an important aspect of your career. You should attend industry events and conferences to meet other professionals in the field.
 5. Stay up-to-date with industry trends: Stay up-to-date with industry trends and developments. This will help you stay relevant in your field and help you stay ahead of your competition.
 
@@ -20,15 +20,15 @@ This user guide explains how to run inference of text generation models with Dee
 
 ## **Installation**
 
-DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:
+DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPI:
 
 ```bash
 pip install -U deepsparse-nightly[llm]
 ```
 
 #### **System Requirements**
 
-- Hardware: x86 AVX2, AVX512, AVX512-VNNI and ARM v8.2+.
+- Hardware: x86 AVX2, AVX-512, AVX-512 VNNI, and ARM v8.2+.
 - Operating System: Linux (MacOS will be supported soon)
 - Python: v3.8-3.11
 
@@ -49,7 +49,7 @@ prompt = "Below is an instruction that describes a task. Write a response that a
 output = pipeline(prompt=prompt)
 print(output.generations[0].text)
 
-# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
+# >> Kubernetes is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
 ```
 
 > **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
@@ -58,11 +58,11 @@ print(output.generations[0].text)
 
 DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.
 
-> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
+> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At present, we suggest only using LLM ONNX graphs created by Neural Magic.***
 > 
 ### **SparseZoo Stubs**
 
-SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
+SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifies a 50% pruned-quantized pre-trained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
 
 ```python
 model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
@@ -91,7 +91,7 @@ pipeline = TextGeneration(model="./local-model/deployment")
 ```
 
 ### **Hugging Face Models**
-Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
+Hugging Face models that conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).
 
 ```python
 from deepsparse import TextGeneration
@@ -176,7 +176,7 @@ print(f"finished_reason: {output.generations[0].finished_reason}")
 
 ## **Generation Configuration**
 
-`TextGeneration` can be configured to alter several variables in generation.
+`TextGeneration` can be configured to alter several variables in a generation.
 
 The following examples use a quantized 33M parameter TinyStories model for quick compilation:
 ```python
@@ -186,7 +186,7 @@ model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
 pipeline = TextGeneration(model=model_id)
 ```
 
-### **Creating A `GenerationConfig`**
+### **Creating a `GenerationConfig`**
 
 The `GenerationConfig` can be created in three ways:
 - Via `transformers.GenerationConfig`:
@@ -267,15 +267,15 @@ for generated_text in output.generations[0]:
 # >> Princess peach jumped from the balcony and ran after her. Jill jumped to the floor and followed
 ```
 
-#### Controling the Output Length
+#### Controlling the Output Length
 - `max_new_tokens`: maximum number of tokens to generate. Default is `None`
 ```python
 output = pipeline(prompt=prompt, max_new_tokens=10)
 print(f"{prompt}{output.generations[0].text}")
 # >> Princess peach jumped from the balcony and landed on the ground. She was so happy that she
 ```
 
-#### Controling the Sampling
+#### Controlling the Sampling
 - `do_sample`: If True, will apply sampling from the probability distribution computed from the logits rather than deterministic greedy sampling. Default is `False`
 ```python
 output = pipeline(prompt=prompt, do_sample=True, max_new_tokens=15)
@@ -286,7 +286,7 @@ print(f"{prompt}{output.generations[0].text}")
 # >> Princess peach jumped from the balcony and landed in front of her. She stood proudly and exclaimed, “I did
 ```
 
-- `temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
+- `temperature`: The temperature of the sampling operation. 1 means regular sampling, 0 means always taking the highest score, 100.0 is close to uniform probability. If `0.0`, temperature is turned off. Default is `0.0`
 ```python
 # more random
 output = pipeline(prompt=prompt, do_sample=True, temperature=1.5, max_new_tokens=15)
 
@@ -1,8 +1,8 @@
-*LAST UPDATED: 10/11/2023*
+*LAST UPDATED: 11/24/2023*
 
 # **Sparse Finetuned LLMs with DeepSparse**
 
-DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
+DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT and Meta's Llama 2.
 Check out our paper [Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927)
 
 In this research overview, we will discuss:
@@ -11,7 +11,7 @@ In this research overview, we will discuss:
 
 ## **Sparse Finetuning Research**
 
-We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
+We show that MPT-7B and Llama-2-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.
 
 When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!
 
@@ -23,16 +23,16 @@ Fine-tuning is useful for two main reasons:
 1. It can teach the model *how to respond* to input (often called **instruction tuning**).
 2. It can teach the model *new information* (often called **domain adaptation**).
 
-
 An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
 
 The key insight from [our paper](https://arxiv.org/abs/2310.06927) is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!
 
 <div align="center">
-    <img src="https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9" width="60%"/>
+    <img src="https://github.com/neuralmagic/deepsparse/assets/3195154/f9a86726-12f5-4926-8d8c-668c449faa84" width="60%"/>
 </div>
 
 - [See the paper on Arxiv](https://arxiv.org/abs/2310.06927)
+- [See our Llama 2 expansion blog on the initial paper](https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/)
 
 ### **How Is This Useful For Real World Use?**
 
@@ -46,17 +46,14 @@ Install the DeepSparse Nightly build (requires Linux):
 pip install -U deepsparse-nightly[llm]
 ```
 
-The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). 
-
-### MPT-7B on GSM 
+The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). 
 
 We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:
 
 ```python
 from deepsparse import TextGeneration
 
-model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
-pipeline = TextGeneration(model_path=model)
+pipeline = TextGeneration(model_path="zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized")
 
 prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
 output = pipeline(prompt=prompt)
@@ -84,11 +81,8 @@ print(output.generations[0].text)
 ### >> #### 5
 ```
 
-> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***
-
-
 #### Other Resources
-- [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
+- [Check out all the GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
 - [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
 - [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)
 
@@ -97,7 +91,7 @@ print(output.generations[0].text)
 Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:
 
 - **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
-- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
+- **Expanding Model Support**: Apply sparse fine-tuning results to Mistral models
 - **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
 - **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard
Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,8 @@`
`1`	`1`	`{`
`2`	`2`	`"aliveStatusCodes": [`
`3`	`3`	`0,`
`4`		`- 200`
	`4`	`+ 200,`
	`5`	`+ 403,`
`5`	`6`	`],`
`6`	`7`	`"ignorePatterns": [`
`7`	`8`	`{`