Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## The AFM-4.5B model

[AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 8 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.

In this Learning Path, you'll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based Google Cloud Axion instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.

This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices.

### LLM deployment workflow on Google Axion

- **Provision compute**: launch a Google Cloud instance using an Axion-based instance type (for example, `c4a-standard-16`)

- **Set up your environment**: install the required build tools and dependencies (such as CMake, Python, and Git)

- **Build the inference engine**: clone the [Llama.cpp](https://github.com/ggerganov/llama.cpp) repository and compile the project for your Arm-based environment

- **Prepare the model**: download the **AFM-4.5B** model files from Hugging Face and use Llama.cpp's quantization tools to reduce model size and optimize performance

- **Run inference**: load the quantized model and run sample prompts using Llama.cpp.

- **Evaluate model quality**: calculate **perplexity** or use other metrics to assess model performance

{{< notice Note>}}
You can reuse this deployment flow with other models supported by Llama.cpp by swapping out the model file and adjusting quantization settings.
{{< /notice >}}




Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Provision your Axion environment
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Requirements

Before you begin, make sure you have the following:

- A Google Cloud account
- Permission to launch a Compute Engine Axion instance of type `c4a-standard-16` (or larger)
- At least 128 GB of available storage

If you're new to Google Cloud, check out the Learning Path [Getting Started with Google Cloud](/learning-paths/servers-and-cloud-computing/csp/google/).

## Launch and configure the Compute Engine instance

In the left sidebar of the [Compute Engine dashboard](https://console.cloud.google.com/compute), select **VM instances**, and then **Create instance**.

Use the following settings to configure your instance:

- **Name**: `arcee-axion-instance`
- **Region** and **Zone**: the region and zone where you have access to c4a instances
- Select **General purpose**, then click **C4A**
- **Machine type**: c4a-standard-16 or larger

## Configure OS and Storage

In the left sidebar, select **OS and storage**.

Under **Operating system and storage**, click on **Change**

Set the size of the disk to 128 GB, then click on **Select**.

## Review and launch the instance

Leave the other settings as they are.

When you're ready, click on **Create** to create your Compute Engine instance.

## Monitor the instance launch

After a few seconds, you should see that your instance is ready.

If the launch fails, double-check your settings and permissions, and try again.

## Connect to your instance

Open the **SSH** dropdown list, and select **Open in browser window**.

Your browser may ask you to authenticate. Once you've done that, a terminal window will open.

You are now connected to your Ubuntu instance running on Axion.

{{% notice Note %}}
**Region**: make sure you're launching in your preferred Google Cloud region.
**Storage**: 128 GB is sufficient for the AFM-4.5B model and dependencies.
{{% /notice %}}

Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Configure your Axion environment
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

In this step, you'll set up the Axion instance with the tools and dependencies required to build and run the Arcee Foundation Model. This includes installing system packages and a Python environment.

## Update the package list

Run the following command to update your local APT package index:

```bash
sudo apt-get update
```

This step ensures you have the most recent metadata about available packages, including versions and dependencies. It helps prevent conflicts when installing new packages.

## Install system dependencies

Install the build tools and Python environment:

```bash
sudo apt-get install cmake gcc g++ git python3 python3-pip python3-virtualenv libcurl4-openssl-dev unzip -y
```

This command installs the following tools and dependencies:

- **CMake**: cross-platform build system generator used to compile and build Llama.cpp

- **GCC and G++**: GNU C and C++ compilers for compiling native code

- **Git**: version control system for cloning repositories

- **Python 3**: Python interpreter for running Python-based tools and scripts

- **Pip**: Python package manager

- **Virtualenv**: tool for creating isolated Python environments

- **libcurl4-openssl-dev**: development files for the curl HTTP library

- **Unzip**: tool to extract `.zip` files (used in some model downloads)

The `-y` flag automatically approves the installation of all packages without prompting.

## Ready for build and deployment

After completing the setup, your instance includes the following tools and environments:

- A complete C/C++ development environment for building Llama.cpp
- Python 3, pip, and virtualenv for managing Python tools and environments
- Git for cloning repositories
- All required dependencies for compiling optimized Arm64 binaries

You're now ready to build Llama.cpp and download the Arcee Foundation Model.
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: Build Llama.cpp
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---
## Build the Llama.cpp inference engine

In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on a range of hardware platforms, including Arm-based processors like Google Axion.

Even though AFM-4.5B uses a custom model architecture, you can still use the standard Llama.cpp repository - Arcee AI has contributed the necessary modeling code upstream.

## Clone the repository

```bash
git clone https://github.com/ggerganov/llama.cpp
```

This command clones the Llama.cpp repository from GitHub to your local machine. The repository contains the source code, build scripts, and documentation needed to compile the inference engine.

## Navigate to the project directory

```bash
cd llama.cpp
```

Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and all source code.

## Configure the build with CMake

```bash
cmake -B .
```

This command configures the build system using CMake:

- `-B .` tells CMake to generate build files in the current directory
- CMake detects your system's compiler, libraries, and hardware capabilities
- It produces Makefiles (on Linux) or platform-specific build scripts for compiling the project


If you're running on Axion, the CMake output should include hardware-specific optimizations targeting the Neoverse V2 architecture. These optimizations are crucial for achieving high performance on Axion:

```output
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+dotprod+i8mm+sve
```

These features enable advanced CPU instructions that accelerate inference performance on Arm64:

- **DOTPROD: Dot Product**: hardware-accelerated dot product operations for neural network workloads

- **SVE (Scalable Vector Extension)**: advanced vector processing capabilities that can handle variable-length vectors up to 2048 bits, providing significant performance improvements for matrix operations

- **MATMUL_INT8**: integer matrix multiplication units optimized for transformers

- **FMA**: fused multiply-add operations to speed up floating-point math

- **FP16 vector arithmetic**: 16-bit floating-point vector operations to reduce memory use without compromising precision

## Compile the project

```bash
cmake --build . --config Release -j16
```

This command compiles the Llama.cpp source code:

- `--build .` tells CMake to build the project in the current directory
- `--config Release` enables optimizations and strips debug symbols
- `-j16` runs the build with 16 parallel jobs, which speeds up compilation on multi-core systems like Axion.

The build process compiles the C++ source code into executable binaries optimized for the Arm64 architecture. Compilation typically takes under a minute.

## Key binaries after compilation

After compilation, you'll find several key command-line tools in the `bin` directory:
- `llama-cli`: the main inference executable for running LLaMA models
- `llama-server`: a web server for serving model inference over HTTP
- `llama-quantize`: a tool for model quantization to reduce memory usage
- Additional utilities for model conversion and optimization

You can find more tools and usage details in the llama.cpp [GitHub repository](https://github.com/ggml-org/llama.cpp/tree/master/tools).

These binaries are specifically optimized for the Arm architecture and will provide excellent performance on your Axion instance.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: Install Python dependencies
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---
## Overview

In this step, you'll set up a Python virtual environment and install the required dependencies for working with Llama.cpp. This ensures you have a clean, isolated Python environment with all the necessary packages for model optimization.

## Create a Python virtual environment

```bash
virtualenv env-llama-cpp
```

This command creates a new Python virtual environment named `env-llama-cpp`, which has the following benefits:
- Provides an isolated Python environment to prevent package conflicts between projects
- Creates a local directory containing its own Python interpreter and installation space
- Ensures Llama.cpp dependencies don’t interfere with your global Python setup
- Supports reproducible and portable development environments

## Activate the virtual environment

Run the following command to activate the virtual environment:

```bash
source env-llama-cpp/bin/activate
```
This command does the following:

- Runs the activation script, which modifies your shell environment
- Updates your shell prompt to show `env-llama-cpp`, indicating the environment is active
- Updates `PATH` to use so the environment’s Python interpreter
- Ensures all `pip` commands install packages into the isolated environment

## Upgrade pip to the latest version

Before installing dependencies, it’s a good idea to upgrade pip:

```bash
pip install --upgrade pip
```
This command:

- Ensures you have the latest version of pip
- Helps avoid compatibility issues with modern packages
- Applies the `--upgrade` flag to fetch and install the newest release
- Brings in security patches and better dependency resolution logic

## Install project dependencies

Use the following command to install all required Python packages:

```bash
pip install -r requirements.txt
```

This command does the following:

- Uses the `-r` flag to read the list of dependencies from `requirements.txt`
- Installs the exact package versions required for the project
- Ensures consistency across development environments and contributors
- Includes packages for model loading, inference, and Python bindings for `llama.cpp`

This step sets up everything you need to run AFM-4.5B in your Python environment.

## What the environment includes

After the installation completes, your virtual environment includes:
- **NumPy**: for numerical computations and array operations
- **Requests**: for HTTP operations and API calls
- **Other dependencies**: additional packages required by llama.cpp's Python bindings and utilities
Your environment is now ready to run Python scripts that integrate with the compiled Llama.cpp binaries

{{< notice Tip >}}
Before running any Python commands, make sure your virtual environment is activated. {{< /notice >}}


Loading