|
| 1 | +<!-- |
| 2 | + MIT License |
| 3 | +
|
| 4 | + Copyright (c) 2024 Arm Limited |
| 5 | +
|
| 6 | + Permission is hereby granted, free of charge, to any person obtaining a copy |
| 7 | + of this software and associated documentation files (the "Software"), to deal |
| 8 | + in the Software without restriction, including without limitation the rights |
| 9 | + to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
| 10 | + copies of the Software, and to permit persons to whom the Software is |
| 11 | + furnished to do so, subject to the following conditions: |
| 12 | +
|
| 13 | + The above copyright notice and this permission notice shall be included in all |
| 14 | + copies or substantial portions of the Software. |
| 15 | +
|
| 16 | + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
| 17 | + IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
| 18 | + FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
| 19 | + AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
| 20 | + LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
| 21 | + OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
| 22 | + SOFTWARE. |
| 23 | +--> |
| 24 | + |
| 25 | +<h1><b>Running llama.cpp with KleidiAI Int4 matrix-multiplication (matmul) micro-kernels </b></h1> |
| 26 | + |
| 27 | +## Prerequisities |
| 28 | + |
| 29 | +- Experience with Arm® cross-compilation on Android™ |
| 30 | +- Proficiency with Android® shell commands |
| 31 | +- An Android® device with an Arm® CPU with <strong>FEAT_DotProd</strong> (dotprod) and <strong>FEAT_I8MM</strong> (i8mm) features |
| 32 | + |
| 33 | +## Dependencies |
| 34 | +- A laptop/PC with a Linux®-based operating system (tested on Ubuntu® 20.04.4 LTS) |
| 35 | +- The Android™ NDK (minimum version: r25), which can be downloaded from [here](https://developer.android.com/ndk/downloads). |
| 36 | +- The Android™ SDK Platform command-line tools, which can be downloaded from [here](https://developer.android.com/tools/releases/platform-tools) |
| 37 | + |
| 38 | +## Goal |
| 39 | + |
| 40 | +In this guide, we will show you how to apply a patch on top of llama.cpp to enable the <strong>[KleidiAI](https://gitlab.arm.com/kleidi/kleidiai)</strong> int4 matmul micro-kernels with per-block quantization (c32). |
| 41 | + |
| 42 | +> ℹ️ In the context of llama.cpp, this int4 format is called <strong>Q4_0</strong>. |
| 43 | +
|
| 44 | +These KleidiAI micro-kernels were fundamental to the Cookie and Ada chatbot, which Arm® showcased to demonstrate large language models (LLMs) running on existing flagship and premium mobile CPUs based on Arm® technology. You can learn more about the demo in <strong>[this](https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/generative-ai-on-mobile-on-arm-cpu)</strong> blog post. |
| 45 | + |
| 46 | +<p align="center"> |
| 47 | +<video autoplay src="https://community.arm.com/cfs-file/__key/telligent-evolution-videotranscoding-securefilestorage/communityserver-blogs-components-weblogfiles-00-00-00-38-23/phi_2D00_3-demo.mp4.mp4" width="640" height="480" controls></video> |
| 48 | +</p> |
| 49 | + |
| 50 | +> ⚠️ This guide is intended as a demonstration of how to integrate the KleidiAI int4 matmul optimized routines in llama.cpp. |
| 51 | +
|
| 52 | +## Target Arm® CPUs |
| 53 | + |
| 54 | +Arm® CPUs with <strong>FEAT_DotProd</strong> (dotprod) and <strong>FEAT_I8MM</strong> (i8mm) features. |
| 55 | + |
| 56 | +## Running llama.cpp with KleidiAI |
| 57 | + |
| 58 | +Connect your Android™ device to your computer and open Terminal. Then, follow the following steps to apply the patch with the KleidiAI backend on top of llama.cpp. |
| 59 | + |
| 60 | +### Step 1: |
| 61 | + |
| 62 | +Clone the [llama.cpp](https://github.com/ggerganov/llama.cpp) repository: |
| 63 | + |
| 64 | +```bash |
| 65 | +git clone https://github.com/ggerganov/llama.cpp.git |
| 66 | +``` |
| 67 | +### Step 2: |
| 68 | + |
| 69 | +Enter the `llama.cpp/` directory, and checkout the `6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4` commit: |
| 70 | + |
| 71 | +```bash |
| 72 | +cd llama.cpp |
| 73 | +git checkout 6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4 |
| 74 | +``` |
| 75 | + |
| 76 | +The reason for checking out the `6fcd1331efbfbb89c8c96eba2321bb7b4d0c40e4` commit is that it provides a stable base for applying the patch with the KleidiAI backend for llama.cpp. |
| 77 | + |
| 78 | +### Step 3: |
| 79 | + |
| 80 | +In the `llama.cpp/` directory, copy [this](0001-Use-KleidiAI-Int4-Matmul-micro-kernels-in-llama.cpp.patch) patch, which includes the code changes for llama.cpp to enable the KleidiAI optimizations. |
| 81 | + |
| 82 | + |
| 83 | +### Step 4: |
| 84 | + |
| 85 | +Apply the patch with the KleidiAI backend: |
| 86 | + |
| 87 | +```bash |
| 88 | +git apply 0001-Use-KleidiAI-Int4-Matmul-micro-kernels-in-llama.cpp.patch |
| 89 | +``` |
| 90 | + |
| 91 | +### Step 5: |
| 92 | + |
| 93 | +Build the llama.cpp project for Android™: |
| 94 | + |
| 95 | +```bash |
| 96 | +mkdir build && cd build |
| 97 | + |
| 98 | +export NDK_PATH="your-android-ndk-path" |
| 99 | + |
| 100 | +cmake -DLLAMA_KLEIDIAI=ON -DLLAMA_OPENMP=OFF -DCMAKE_TOOLCHAIN_FILE=${NDK_PATH}/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.2a+i8mm+dotprod -DCMAKE_CXX_FLAGS=-march=armv8.2a+i8mm+dotprod .. |
| 101 | + |
| 102 | +make -j4 |
| 103 | +``` |
| 104 | + |
| 105 | +### Step 6: |
| 106 | + |
| 107 | +Download the Large Language Model (LLM) in `.gguf` format with `Q4_0` weights. For example, you can download the <strong>Phi-2</strong> model from [here](https://huggingface.co/TheBloke/phi-2-GGUF/blob/main/phi-2.Q4_0.gguf). |
| 108 | + |
| 109 | + |
| 110 | +### Step 7: |
| 111 | + |
| 112 | +Push the `llama-cli` binary and the `.gguf` file to `/data/local/tmp` on your Android™ device: |
| 113 | + |
| 114 | +```bash |
| 115 | +adb push bin/llama-cli /data/local/tmp |
| 116 | +adb push phi-2.Q4_0.gguf /data/local/tmp |
| 117 | +``` |
| 118 | + |
| 119 | +### Step 8: |
| 120 | + |
| 121 | +Enter your Android™ device: |
| 122 | + |
| 123 | +```bash |
| 124 | +adb shell |
| 125 | +``` |
| 126 | +Then, go to `/data/local/tmp`: |
| 127 | + |
| 128 | +```bash |
| 129 | +cd /data/local/tmp |
| 130 | +``` |
| 131 | + |
| 132 | +### Step 9: |
| 133 | + |
| 134 | +Run the model inference using the `llama-cli` binary using 4 CPU cores: |
| 135 | + |
| 136 | +```bash |
| 137 | +./llama-cli -m phi-2.Q4_0.gguf -p "Write a code in C for bubble sorting" -n 32 -t 4 |
| 138 | +``` |
| 139 | + |
| 140 | +That’s all for this guide! |
0 commit comments