Built on top of ikawrakow/ik_llama.cpp and llama-swap
All commands are provided for Podman and Docker.
CPU or CUDA sections under Build and Run are enough to get up and running.
The project uses Docker Bake for building multiple targets efficiently.
docker buildx bake --builder ik-llama-builder full swapOr with custom tags:
REPO_OWNER=yourname docker buildx bake --builder ik-llama-builder \
-f ./docker-bake.hcl \
full swapFirst, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:
CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g.,86for RTX 30*,89for RTX 40*,12.0for RTX 50*)CUDA_VERSION: CUDA Toolkit version (e.g.,12.6.2,13.1.1)
VARIANT=cu12 docker buildx bake --builder ik-llama-builder full swapBuilds two image tags per variant:
full: Includesllama-server,llama-quantize, and other utilities.swap: Includes onlyllama-swapandllama-server.
- Clone the repository:
git clone https://github.com/ikawrakow/ik_llama.cpp - Enter the repo:
cd ik_llama.cpp - Use either docker-bake or build-local.sh as shown above.
- Download
.ggufmodel files to your favorite directory (e.g.,/my_local_files/gguf). - Map it to
/modelsinside the container. - Open browser
http://localhost:9292and enjoy the features. - API endpoints are available at
http://localhost:9292/v1for use in other applications.
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swapdocker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap- Install Nvidia Drivers and CUDA on the host.
- For Docker, install NVIDIA Container Toolkit
- For Podman, install CDI Container Device Interface
- Identify your GPU:
- CUDA GPU Compute Capability (e.g.,
8.6for RTX30*,8.9for RTX40*,12.0for RTX50*) - CUDA Toolkit supported version
- CUDA GPU Compute Capability (e.g.,
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swapdocker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap- If CUDA is not available, use
ik_llama-cpuinstead. - If models are not found, ensure you mount the correct directory:
-v /my_local_files/gguf:/models:ro - If you need to install
podmanordockerfollow the Podman Installation or Install Docker Engine for your OS.
- Custom commit: Build a specific
ik_llama.cppcommit by modifying the Containerfile or using build args.
docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full- Using the tools in the
fullimage:
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...-
Customize
llama-swapconfig: Save the./docker/ik_llama-cpu-swap.config.yamlor./docker/ik_llama-cuda-swap.config.yamllocally (e.g., under/my_local_files/) then map it to/app/config.yamlinside the container appending-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:roto yourpodman run ...ordocker run .... -
Run in background: Replace
-itwith-d:podman run -d ...ordocker run -d .... To stop it:podman stop ik_llamaordocker stop ik_llama. -
GGML_NATIVE: If you build the image on a different machine, change
-DGGML_NATIVE=ONto-DGGML_NATIVE=OFFin the.Containerfile. -
KV quantization types: To use more KV quantization types, build with
-DGGML_IQK_FA_ALL_QUANTS=ON. -
Cleanup unused CUDA images: If you experiment with several
CUDA_VERSION, delete unused images (they are several GB):podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \ podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 -
Build without
llama-swap: Change--target swapto--target serverin docker-bake or Containerfiles. -
Pre-made quants: Look for premade quants from ubergarm.
-
GGUF tools: Build custom quants with Thireus's tools.
-
Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.
-
KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.
All credits to the awesome community: