Build and use ik_llama.cpp with CPU or CPU+CUDA

Built on top of ikawrakow/ik_llama.cpp and llama-swap

All commands are provided for Podman and Docker.

CPU or CUDA sections under Build and Run are enough to get up and running.

Overview

Build
Run
Troubleshooting
Extra Features
Credits

Build

Using docker-bake (Recommended)

The project uses Docker Bake for building multiple targets efficiently.

CPU Variant

docker buildx bake --builder ik-llama-builder full swap

Or with custom tags:

REPO_OWNER=yourname docker buildx bake --builder ik-llama-builder \
  -f ./docker-bake.hcl \
  full swap

CUDA Variant

First, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:

CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g., 86 for RTX 30*, 89 for RTX 40*, 12.0 for RTX 50*)
CUDA_VERSION: CUDA Toolkit version (e.g., 12.6.2, 13.1.1)

VARIANT=cu12 docker buildx bake --builder ik-llama-builder full swap

Build Targets

Builds two image tags per variant:

full: Includes llama-server, llama-quantize, and other utilities.
swap: Includes only llama-swap and llama-server.

Local Development

Clone the repository: git clone https://github.com/ikawrakow/ik_llama.cpp
Enter the repo: cd ik_llama.cpp
Use either docker-bake or build-local.sh as shown above.

Run

Download .gguf model files to your favorite directory (e.g., /my_local_files/gguf).
Map it to /models inside the container.
Open browser http://localhost:9292 and enjoy the features.
API endpoints are available at http://localhost:9292/v1 for use in other applications.

CPU

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

CUDA

Install Nvidia Drivers and CUDA on the host.
For Docker, install NVIDIA Container Toolkit
For Podman, install CDI Container Device Interface
Identify your GPU:
- CUDA GPU Compute Capability (e.g., 8.6 for RTX30*, 8.9 for RTX40*, 12.0 for RTX50*)
- CUDA Toolkit supported version

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap

docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap

Troubleshooting

If CUDA is not available, use ik_llama-cpu instead.
If models are not found, ensure you mount the correct directory: -v /my_local_files/gguf:/models:ro
If you need to install podman or docker follow the Podman Installation or Install Docker Engine for your OS.

Extra

Custom commit: Build a specific ik_llama.cpp commit by modifying the Containerfile or using build args.

docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full

Using the tools in the full image:

$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

Customize llama-swap config: Save the ./docker/ik_llama-cpu-swap.config.yaml or ./docker/ik_llama-cuda-swap.config.yaml locally (e.g., under /my_local_files/) then map it to /app/config.yaml inside the container appending -v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro to your podman run ... or docker run ....
Run in background: Replace -it with -d: podman run -d ... or docker run -d .... To stop it: podman stop ik_llama or docker stop ik_llama.
GGML_NATIVE: If you build the image on a different machine, change -DGGML_NATIVE=ON to -DGGML_NATIVE=OFF in the .Containerfile.
KV quantization types: To use more KV quantization types, build with -DGGML_IQK_FA_ALL_QUANTS=ON.

Cleanup unused CUDA images: If you experiment with several CUDA_VERSION, delete unused images (they are several GB):

podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \
  podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04

Build without llama-swap: Change --target swap to --target server in docker-bake or Containerfiles.
Pre-made quants: Look for premade quants from ubergarm.
GGUF tools: Build custom quants with Thireus's tools.
Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.
KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.

Credits

All credits to the awesome community:

llama-swap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build and use ik_llama.cpp with CPU or CPU+CUDA

Overview

Build

Using docker-bake (Recommended)

CPU Variant

CUDA Variant

Build Targets

Local Development

Run

CPU

CUDA

Troubleshooting

Extra

Credits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Build and use ik_llama.cpp with CPU or CPU+CUDA

Overview

Build

Using docker-bake (Recommended)

CPU Variant

CUDA Variant

Build Targets

Local Development

Run

CPU

CUDA

Troubleshooting

Extra

Credits