Skip to content

Commit a733f36

Browse files
Merge pull request #271 from yiliu30/cpu_infer_doc
Update the CPU inference doc
2 parents 54d1ec3 + 7b4ff6e commit a733f36

File tree

1 file changed

+16
-1
lines changed

1 file changed

+16
-1
lines changed

README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,22 @@ model = GenericLoraKbitModel('tiiuae/falcon-7b')
7979
# Run the fine-tuning
8080
model.finetune(dataset)
8181
```
82-
4. __CPU inference__ - Now you can use just your CPU for inference of any LLM. _CAUTION : The inference process may be sluggish because CPUs lack the required computational capacity for efficient inference_.
82+
83+
4. __CPU inference__ - The CPU, including laptop CPUs, is now fully equipped to handle LLM inference. We integrated [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers) to conserve memory by compressing the model with [weight-only quantization algorithms](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md) and accelerate the inference by leveraging its highly optimized kernel on Intel platforms.
84+
85+
```python
86+
# Make the necessary imports
87+
from xturing.models import BaseModel
88+
89+
# Initializes the model: quantize the model with weight-only algorithms
90+
# and replace the linear with Itrex's qbits_linear kernel
91+
model = BaseModel.create("llama2_int8")
92+
93+
# Once the model has been quantized, do inferences directly
94+
output = model.generate(texts=["Why LLM models are becoming so important?"])
95+
print(output)
96+
```
97+
8398
5. __Batch integration__ - By tweaking the 'batch_size' in the .generate() and .evaluate() functions, you can expedite results. Using a 'batch_size' greater than 1 typically enhances processing efficiency.
8499
```python
85100
# Make the necessary imports

0 commit comments

Comments
 (0)