Skip to content

Commit e90b253

Browse files
committed
Add README.md file in the kleidiai example for llama.cpp
Signed-off-by: Gian Marco Iodice <[email protected]>
1 parent 142a96e commit e90b253

File tree

2 files changed

+141
-1
lines changed

2 files changed

+141
-1
lines changed

kleidiai-examples/llama_cpp/LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2017-2024 Arm Limited
3+
Copyright (c) 2024 Arm Limited
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
<!--
2+
MIT License
3+
4+
Copyright (c) 2024 Arm Limited
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy
7+
of this software and associated documentation files (the "Software"), to deal
8+
in the Software without restriction, including without limitation the rights
9+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
copies of the Software, and to permit persons to whom the Software is
11+
furnished to do so, subject to the following conditions:
12+
13+
The above copyright notice and this permission notice shall be included in all
14+
copies or substantial portions of the Software.
15+
16+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
SOFTWARE.
23+
-->
24+
25+
<h1><b>Running llama.cpp with KleidiAI Int4 matrix-multiplication (matmul) micro-kernels </b></h1>
26+
27+
## Prerequisities
28+
29+
- Experience with Arm® cross-compilation on Android™
30+
- Proficiency with Android® shell commands
31+
- An Android® device with an Arm® CPU with <strong>FEAT_DotProd</strong> (dotprod) and <strong>FEAT_I8MM</strong> (i8mm) features
32+
33+
## Dependencies
34+
- A laptop/PC with a Linux®-based operating system (tested on Ubuntu® 20.04.4 LTS)
35+
- The Android™ NDK (minimum version: r25), which can be downloaded from [here](https://developer.android.com/ndk/downloads).
36+
- The Android™ SDK Platform command-line tools, which can be downloaded from [here](https://developer.android.com/tools/releases/platform-tools)
37+
38+
## Goal
39+
40+
In this guide, we will show you how to apply a patch on top of llama.cpp to enable the <strong>[KleidiAI](https://gitlab.arm.com/kleidi/kleidiai)</strong> int4 matmul micro-kernels with per-block quantization (c32).
41+
42+
> ℹ️ In the context of llama.cpp, this int4 format is called <strong>Q4_0</strong>.
43+
44+
These KleidiAI micro-kernels were fundamental to the Cookie and Ada chatbot, which Arm® showcased to demonstrate large language models (LLMs) running on existing flagship and premium mobile CPUs based on Arm® technology. You can learn more about the demo in <strong>[this](https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/generative-ai-on-mobile-on-arm-cpu)</strong> blog post.
45+
46+
<p align="center">
47+
<video autoplay src="https://community.arm.com/cfs-file/__key/telligent-evolution-videotranscoding-securefilestorage/communityserver-blogs-components-weblogfiles-00-00-00-38-23/phi_2D00_3-demo.mp4.mp4" width="640" height="480" controls></video>
48+
</p>
49+
50+
> ⚠️ This guide is intended as a demonstration of how to integrate the KleidiAI int4 matmul optimized routines in llama.cpp.
51+
52+
## Target Arm® CPUs
53+
54+
Arm® CPUs with <strong>FEAT_DotProd</strong> (dotprod) and <strong>FEAT_I8MM</strong> (i8mm) features.
55+
56+
## Running llama.cpp with KleidiAI
57+
58+
Connect your Android™ device to your computer and open Terminal. Then, follow the following steps to apply the patch with the KleidiAI backend on top of llama.cpp.
59+
60+
### Step 1:
61+
62+
Clone the [llama.cpp](https://github.com/ggerganov/llama.cpp) repository:
63+
64+
```bash
65+
git clone https://github.com/ggerganov/llama.cpp.git
66+
```
67+
### Step 2:
68+
69+
Enter the `llama.cpp/` directory, and checkout the `6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4` commit:
70+
71+
```bash
72+
cd llama.cpp
73+
git checkout 6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4
74+
```
75+
76+
The reason for checking out the `6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4` commit is that it provides a stable base for applying the patch with the KleidiAI backend for llama.cpp.
77+
78+
### Step 3:
79+
80+
In the `llama.cpp/` directory, copy [this](0001-Use-KleidiAI-Int4-Matmul-micro-kernels-in-llama.cpp.patch) patch, which includes the code changes for llama.cpp to enable the KleidiAI optimizations.
81+
82+
83+
### Step 4:
84+
85+
Apply the patch with the KleidiAI backend:
86+
87+
```bash
88+
git apply 0001-Use-KleidiAI-Int4-Matmul-micro-kernels-in-llama.cpp.patch
89+
```
90+
91+
### Step 5:
92+
93+
Build the llama.cpp project for Android™:
94+
95+
```bash
96+
mkdir build && cd build
97+
98+
export NDK_PATH="your-android-ndk-path"
99+
100+
cmake -DLLAMA_KLEIDIAI=ON -DLLAMA_OPENMP=OFF -DCMAKE_TOOLCHAIN_FILE=${NDK_PATH}/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod ..
101+
102+
make -j4
103+
```
104+
105+
### Step 6:
106+
107+
Download the Large Language Model (LLM) in `.gguf` format with `Q4_0` weights. For example, you can download the <strong>Phi-2</strong> model from [here](https://huggingface.co/TheBloke/phi-2-GGUF/blob/main/phi-2.Q4_0.gguf).
108+
109+
110+
### Step 7:
111+
112+
Push the `llama-cli` binary and the `.gguf` file to `/data/local/tmp` on your Android™ device:
113+
114+
```bash
115+
adb push bin/llama-cli /data/local/tmp
116+
adb push phi-2.Q4_0.gguf /data/local/tmp
117+
```
118+
119+
### Step 8:
120+
121+
Enter your Android™ device:
122+
123+
```bash
124+
adb shell
125+
```
126+
Then, go to `/data/local/tmp`:
127+
128+
```bash
129+
cd /data/local/tmp
130+
```
131+
132+
### Step 9:
133+
134+
Run the model inference using the `llama-cli` binary using 4 CPU cores:
135+
136+
```bash
137+
./llama-cli -m phi-2.Q4_0.gguf -p "Write a code in C for bubble sorting" -n 32 -t 4
138+
```
139+
140+
That’s all for this guide!

0 commit comments

Comments
 (0)