This blueprint provides a comprehensive framework for testing inference on CPUs using the Ollama platform with a variety of supported models such as Mistral, Gemma, and others available through Ollama. Unlike GPU-dependent solutions, this blueprint is designed for environments where CPU inference is preferred or required. It offers clear guidelines and configuration settings to deploy a robust CPU inference service, enabling thorough performance evaluations and reliability testing. Ollama's lightweight and efficient architecture makes it an ideal solution for developers looking to benchmark and optimize CPU-based inference workloads.
This blueprint explains how to use CPU inference for running large language models using Ollama. It includes two main deployment strategies:
- Serving pre-saved models directly from Object Storage
- Pulling models from Ollama and saving them to Object Storage
Title | Description |
---|---|
CPU inference with Mistral and BM.Standard.E4 | Deploys CPU inference with Mistral and BM.Standard.E4 on BM.Standard.E4.128 with undefined GPU(s). |
CPU inference with Gemma and BM.Standard.E5.192 | Deploys CPU inference with Gemma and BM.Standard.E5.192 on BM.Standard.E5.192 with undefined GPU(s). |
CPU inference with mistral and VM.Standard.E4.Flex | Deploys CPU inference with mistral and VM.Standard.E4.Flex on VM.Standard.E4.Flex with undefined GPU(s). |
You can access these pre-filled samples from the portal.
CPU inference is ideal for:
- Low-throughput or cost-sensitive deployments
- Offline testing and validation
- Prototyping without GPU dependency
Ollama supports several high-quality open-source LLMs. Below is a small set of commonly used models:
Model Name | Description |
---|---|
gemma | Lightweight open LLM by Google |
llama2 | Meta’s large language model |
mistral | Open-weight performant LLM |
phi3 | Microsoft’s compact LLM |
If you've already pushed your model to Object Storage, use the following service-mode recipe to run it. Ensure your model files are in the blob + manifest format used by Ollama.
Field | Description |
---|---|
recipe_id | cpu_inference – Identifier for the recipe |
recipe_mode | service – Used for long-running inference |
deployment_name | Custom name for the deployment |
recipe_image_uri | URI for the container image in OCIR |
recipe_node_shape | OCI shape, e.g., BM.Standard.E4.128 |
input_object_storage | Object Storage bucket mounted as input |
recipe_container_env | List of environment variables |
recipe_replica_count | Number of replicas |
recipe_container_port | Port to expose the container |
recipe_node_pool_size | Number of nodes in the pool |
recipe_node_boot_volume_size_in_gbs | Boot volume size in GB |
recipe_container_command_args | Arguments for the container command |
recipe_ephemeral_storage_size | Temporary scratch storage |
{
"recipe_id": "cpu_inference",
"recipe_mode": "service",
"deployment_name": "gemma and BME4 service",
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_service_v0.2",
"recipe_node_shape": "BM.Standard.E4.128",
"input_object_storage": [
{
"bucket_name": "ollama-models",
"mount_location": "/models",
"volume_size_in_gbs": 20
}
],
"recipe_container_env": [
{ "key": "MODEL_NAME", "value": "gemma" },
{ "key": "PROMPT", "value": "What is the capital of France?" }
],
"recipe_replica_count": 1,
"recipe_container_port": "11434",
"recipe_node_pool_size": 1,
"recipe_node_boot_volume_size_in_gbs": 200,
"recipe_container_command_args": [
"--input_directory", "/models", "--model_name", "gemma"
],
"recipe_ephemeral_storage_size": 100
}
Once deployed, send inference requests to the model via the exposed port:
curl http://<PUBLIC_IP>:11434/api/generate -d '{
"model": "gemma",
"prompt": "What is the capital of France?",
"stream": false
}'
curl -L POST https://cpu-inference-mismistral.130-162-199-33.nip.io/api/generate \
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
| jq -r 'select(.response) | .response' | paste -sd " "
curl -L -k POST https://cpu-inference-mistral-flexe4.130-162-199-33.nip.io/api/generate \
-d '{ "model": "mistral", "prompt": "What is the capital of Germany?" }' \
| jq -r 'select(.response) | .response' | paste -sd " "
To download a model from Ollama and store it in Object Storage, use the job-mode recipe below.
Field | Description |
---|---|
recipe_id | cpu_inference – Same recipe base |
recipe_mode | job – One-time job to save a model |
deployment_name | Custom name for the saving job |
recipe_image_uri | OCIR URI of the saver image |
recipe_node_shape | Compute shape used for the job |
output_object_storage | Where to store pulled models |
recipe_container_env | Environment variables including model name |
recipe_replica_count | Set to 1 |
recipe_container_port | Typically 11434 for Ollama |
recipe_node_pool_size | Set to 1 |
recipe_node_boot_volume_size_in_gbs | Size in GB |
recipe_container_command_args | Set output directory and model name |
recipe_ephemeral_storage_size | Temporary storage |
{
"recipe_id": "cpu_inference",
"recipe_mode": "job",
"deployment_name": "gemma and BME4 saver",
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:cpu_inference_saver_v0.2",
"recipe_node_shape": "BM.Standard.E4.128",
"output_object_storage": [
{
"bucket_name": "ollama-models",
"mount_location": "/models",
"volume_size_in_gbs": 20
}
],
"recipe_container_env": [
{ "key": "MODEL_NAME", "value": "gemma" },
{ "key": "PROMPT", "value": "What is the capital of France?" }
],
"recipe_replica_count": 1,
"recipe_container_port": "11434",
"recipe_node_pool_size": 1,
"recipe_node_boot_volume_size_in_gbs": 200,
"recipe_container_command_args": [
"--output_directory", "/models", "--model_name", "gemma"
],
"recipe_ephemeral_storage_size": 100
}
- Ensure all OCI IAM permissions are set to allow Object Storage access.
- Confirm that bucket region and deployment region match.
- Use the job-mode recipe once to save a model, and the service-mode recipe repeatedly to serve it.