Skip to content

Commit abac95f

Browse files
authored
Merge pull request #2150 from aryan8433/main
Adding learning path for distributed inference with llama.cpp on Arm
2 parents 93c762c + 742f41c commit abac95f

File tree

6 files changed

+432
-96
lines changed

6 files changed

+432
-96
lines changed

assets/contributors.csv

Lines changed: 97 additions & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -1,96 +1,97 @@
1-
author,company,github,linkedin,twitter,website
2-
Jason Andrews,Arm,jasonrandrews,jason-andrews-7b05a8,,
3-
Pareena Verma,Arm,pareenaverma,pareena-verma-7853607,,
4-
Ronan Synnott,Arm,,ronansynnott,,
5-
Florent Lebeau,Arm,,,,
6-
Brenda Strech,Remote.It,bstrech,bstrech,@remote_it,www.remote.it
7-
Liliya Wu,Arm,Liliyaw,liliya-wu-8b6227216,,
8-
Julio Suarez,Arm,jsrz,juliosuarez,,
9-
Gabriel Peterson,Arm,gabrieldpeterson,gabrieldpeterson,@gabedpeterson,https://corteximplant.com/@gabe
10-
Christopher Seidl,Arm,,,,
11-
Michael Hall,Arm,,,,
12-
Kasper Mecklenburg,Arm,,,,
13-
Mathias Brossard,Arm,,,,
14-
Julie Gaskin,Arm,,,,
15-
Pranay Bakre,Arm,,,,
16-
Elham Harirpoush,Arm,,,,
17-
Frédéric -lefred- Descamps,OCI,,,,lefred.be
18-
Fr�d�ric -lefred- Descamps,OCI,,,,lefred.be
19-
Kristof Beyls,Arm,,,,
20-
David Spickett,Arm,,,,
21-
Uma Ramalingam,Arm,uma-ramalingam,,,
22-
Konstantinos Margaritis,VectorCamp,markos,konstantinosmargaritis,@freevec1,https://vectorcamp.gr/
23-
Diego Russo,Arm,diegorusso,diegor,diegor,https://www.diegor.it
24-
Jonathan Davies,Arm,,,,
25-
Zhengjun Xing,Arm,,,,
26-
Leandro Nunes,Arm,,,,
27-
Dawid Borycki,,dawidborycki,,,
28-
Ying Yu,Arm,,,,
29-
Bolt Liu,Arm,,,,
30-
Roberto Lopez Mendez,Arm,,,,
31-
Arnaud de Grandmaison,Arm,Arnaud-de-Grandmaison-ARM,arnauddegrandmaison,,
32-
Jose-Emilio Munoz-Lopez,Arm,,,,
33-
James Whitaker,Arm,,,,
34-
Johanna Skinnider,Arm,,,,
35-
Varun Chari,Arm,,,,
36-
Adnan AlSinan,Arm,,,,
37-
Graham Woodward,Arm,,,,
38-
Basma El Gaabouri,Arm,,,,
39-
Gayathri Narayana Yegna Narayanan,Arm,,,,
40-
Alexandros Lamprineas,Arm,,,,
41-
Annie Tallund,Arm,annietllnd,annietallund,,
42-
Cyril Rohr,RunsOn,crohr,cyrilrohr,,
43-
Rin Dobrescu,Arm,,,,
44-
Przemyslaw Wirkus,Arm,PrzemekWirkus,przemyslaw-wirkus-78b73352,,
45-
Nader Zouaoui,Day Devs,nader-zouaoui,nader-zouaoui,@zouaoui_nader,https://daydevs.com/
46-
Alaaeddine Chakroun,Day Devs,Alaaeddine-Chakroun,alaaeddine-chakroun,,https://daydevs.com/
47-
Koki Mitsunami,Arm,,kmitsunami,,
48-
Chen Zhang,Zilliz,,,,
49-
Tianyu Li,Arm,,,,
50-
Georgios Mermigkis,VectorCamp,gMerm,georgios-mermigkis,,https://vectorcamp.gr/
51-
Ben Clark,Arm,,,,
52-
Han Yin,Arm,hanyin-arm,nacosiren,,
53-
Willen Yang,Arm,,,,
54-
Daniel Gubay,,,,,
55-
Paul Howard,,,,,
56-
Iago Calvo Lista,Arm,,,,
57-
Stephen Theobald,Arm,,,,
58-
ThirdAI,,,,,
59-
Preema Merlin Dsouza,,,,,
60-
Dominica Abena O. Amanfo,,,,,
61-
Arm,,,,,
62-
Albin Bernhardsson,,,,,
63-
Przemyslaw Wirkus,,,,,
64-
Zach Lasiuk,,,,,
65-
Daniel Nguyen,,,,,
66-
Joe Stech,Arm,JoeStech,joestech,,
67-
visualSilicon,,,,,
68-
Konstantinos Margaritis,VectorCamp,,,,
69-
Kieran Hejmadi,,,,,
70-
Alex Su,,,,,
71-
Chaodong Gong,,,,,
72-
Owen Wu,Arm,,,,
73-
Koki Mitsunami,,,,,
74-
Nikhil Gupta,,,,,
75-
Nobel Chowdary Mandepudi,Arm,,,,
76-
Ravi Malhotra,Arm,,,,
77-
Masoud Koleini,,,,,
78-
Na Li,Arm,,,,
79-
Tom Pilar,,,,,
80-
Cyril Rohr,,,,,
81-
Odin Shen,Arm,odincodeshen,odin-shen-lmshen,,
82-
Avin Zarlez,Arm,AvinZarlez,avinzarlez,,https://www.avinzarlez.com/
83-
Shuheng Deng,Arm,,,,
84-
Yiyang Fan,Arm,,,,
85-
Julien Jayat,Arm,JulienJayat-Arm,julien-jayat-a980a397,,
86-
Geremy Cohen,Arm,geremyCohen,geremyinanutshell,,
87-
Barbara Corriero,Arm,,,,
88-
Nina Drozd,Arm,NinaARM,ninadrozd,,
89-
Jun He,Arm,JunHe77,jun-he-91969822,,
90-
Gian Marco Iodice,Arm,,,,
91-
Aude Vuilliomenet,Arm,,,,
92-
Andrew Kilroy,Arm,,,,
93-
Peter Harris,Arm,,,,
94-
Chenying Kuo,Adlink,evshary,evshary,,
95-
William Liang,,wyliang,,,
96-
Waheed Brown,Arm,https://github.com/armwaheed,https://www.linkedin.com/in/waheedbrown/,,
1+
author,company,github,linkedin,twitter,website
2+
Jason Andrews,Arm,jasonrandrews,jason-andrews-7b05a8,,
3+
Pareena Verma,Arm,pareenaverma,pareena-verma-7853607,,
4+
Ronan Synnott,Arm,,ronansynnott,,
5+
Florent Lebeau,Arm,,,,
6+
Brenda Strech,Remote.It,bstrech,bstrech,@remote_it,www.remote.it
7+
Liliya Wu,Arm,Liliyaw,liliya-wu-8b6227216,,
8+
Julio Suarez,Arm,jsrz,juliosuarez,,
9+
Gabriel Peterson,Arm,gabrieldpeterson,gabrieldpeterson,@gabedpeterson,https://corteximplant.com/@gabe
10+
Christopher Seidl,Arm,,,,
11+
Michael Hall,Arm,,,,
12+
Kasper Mecklenburg,Arm,,,,
13+
Mathias Brossard,Arm,,,,
14+
Julie Gaskin,Arm,,,,
15+
Pranay Bakre,Arm,,,,
16+
Elham Harirpoush,Arm,,,,
17+
Frédéric -lefred- Descamps,OCI,,,,lefred.be
18+
Fr�d�ric -lefred- Descamps,OCI,,,,lefred.be
19+
Kristof Beyls,Arm,,,,
20+
David Spickett,Arm,,,,
21+
Uma Ramalingam,Arm,uma-ramalingam,,,
22+
Konstantinos Margaritis,VectorCamp,markos,konstantinosmargaritis,@freevec1,https://vectorcamp.gr/
23+
Diego Russo,Arm,diegorusso,diegor,diegor,https://www.diegor.it
24+
Jonathan Davies,Arm,,,,
25+
Zhengjun Xing,Arm,,,,
26+
Leandro Nunes,Arm,,,,
27+
Dawid Borycki,,dawidborycki,,,
28+
Ying Yu,Arm,,,,
29+
Bolt Liu,Arm,,,,
30+
Roberto Lopez Mendez,Arm,,,,
31+
Arnaud de Grandmaison,Arm,Arnaud-de-Grandmaison-ARM,arnauddegrandmaison,,
32+
Jose-Emilio Munoz-Lopez,Arm,,,,
33+
James Whitaker,Arm,,,,
34+
Johanna Skinnider,Arm,,,,
35+
Varun Chari,Arm,,,,
36+
Adnan AlSinan,Arm,,,,
37+
Graham Woodward,Arm,,,,
38+
Basma El Gaabouri,Arm,,,,
39+
Gayathri Narayana Yegna Narayanan,Arm,,,,
40+
Alexandros Lamprineas,Arm,,,,
41+
Annie Tallund,Arm,annietllnd,annietallund,,
42+
Cyril Rohr,RunsOn,crohr,cyrilrohr,,
43+
Rin Dobrescu,Arm,,,,
44+
Przemyslaw Wirkus,Arm,PrzemekWirkus,przemyslaw-wirkus-78b73352,,
45+
Nader Zouaoui,Day Devs,nader-zouaoui,nader-zouaoui,@zouaoui_nader,https://daydevs.com/
46+
Alaaeddine Chakroun,Day Devs,Alaaeddine-Chakroun,alaaeddine-chakroun,,https://daydevs.com/
47+
Koki Mitsunami,Arm,,kmitsunami,,
48+
Chen Zhang,Zilliz,,,,
49+
Tianyu Li,Arm,,,,
50+
Georgios Mermigkis,VectorCamp,gMerm,georgios-mermigkis,,https://vectorcamp.gr/
51+
Ben Clark,Arm,,,,
52+
Han Yin,Arm,hanyin-arm,nacosiren,,
53+
Willen Yang,Arm,,,,
54+
Daniel Gubay,,,,,
55+
Paul Howard,,,,,
56+
Iago Calvo Lista,Arm,,,,
57+
Stephen Theobald,Arm,,,,
58+
ThirdAI,,,,,
59+
Preema Merlin Dsouza,,,,,
60+
Dominica Abena O. Amanfo,,,,,
61+
Arm,,,,,
62+
Albin Bernhardsson,,,,,
63+
Przemyslaw Wirkus,,,,,
64+
Zach Lasiuk,,,,,
65+
Daniel Nguyen,,,,,
66+
Joe Stech,Arm,JoeStech,joestech,,
67+
visualSilicon,,,,,
68+
Konstantinos Margaritis,VectorCamp,,,,
69+
Kieran Hejmadi,,,,,
70+
Alex Su,,,,,
71+
Chaodong Gong,,,,,
72+
Owen Wu,Arm,,,,
73+
Koki Mitsunami,,,,,
74+
Nikhil Gupta,,,,,
75+
Nobel Chowdary Mandepudi,Arm,,,,
76+
Ravi Malhotra,Arm,,,,
77+
Masoud Koleini,,,,,
78+
Na Li,Arm,,,,
79+
Tom Pilar,,,,,
80+
Cyril Rohr,,,,,
81+
Odin Shen,Arm,odincodeshen,odin-shen-lmshen,,
82+
Avin Zarlez,Arm,AvinZarlez,avinzarlez,,https://www.avinzarlez.com/
83+
Shuheng Deng,Arm,,,,
84+
Yiyang Fan,Arm,,,,
85+
Julien Jayat,Arm,JulienJayat-Arm,julien-jayat-a980a397,,
86+
Geremy Cohen,Arm,geremyCohen,geremyinanutshell,,
87+
Barbara Corriero,Arm,,,,
88+
Nina Drozd,Arm,NinaARM,ninadrozd,,
89+
Jun He,Arm,JunHe77,jun-he-91969822,,
90+
Gian Marco Iodice,Arm,,,,
91+
Aude Vuilliomenet,Arm,,,,
92+
Andrew Kilroy,Arm,,,,
93+
Peter Harris,Arm,,,,
94+
Chenying Kuo,Adlink,evshary,evshary,,
95+
William Liang,,wyliang,,,
96+
Waheed Brown,Arm,https://github.com/armwaheed,https://www.linkedin.com/in/waheedbrown/,,
97+
Aryan Bhusari,Arm,,https://www.linkedin.com/in/aryanbhusari,,
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: Distributed inference using llama.cpp
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 30
9+
10+
who_is_this_for: This learning path is for developers with some experience using llama.cpp who want to learn about distributed inference.
11+
12+
learning_objectives:
13+
- Set up the main host and worker nodes using llama.cpp
14+
- Run a large quantized model (e.g., Llama 3.1 405B) on CPUs in a distributed manner on Arm machines
15+
16+
prerequisites:
17+
- An AWS Graviton4 c8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.
18+
- Familiarity with -> [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
19+
- Familiarity with AWS
20+
21+
author: Aryan Bhusari
22+
23+
### Tags
24+
skilllevels: Introductory
25+
subjects: ML
26+
armips:
27+
- Neoverse
28+
tools_software_languages:
29+
- LLM
30+
- GenAI
31+
- AWS
32+
operatingsystems:
33+
- Linux
34+
35+
36+
37+
further_reading:
38+
- resource:
39+
title: Llama.cpp rpc-server code
40+
link: https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc
41+
type: Code
42+
43+
44+
45+
### FIXED, DO NOT MODIFY
46+
# ================================================================================
47+
weight: 1 # _index.md always has weight of 1 to order correctly
48+
layout: "learningpathall" # All files under learning paths have this same wrapper
49+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
50+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
61.7 KB
Loading
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
---
2+
title: Overview and Worker Node Configuration
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Before you begin
10+
The instructions in this Learning Path are for any Arm server running Ubuntu 24.04.2 LTS. You will need at least three Arm server instances with at least 64 cores and 128GB of RAM to run this example. The instructions have been tested on an AWS Graviton4 c8g.16xlarge instance
11+
12+
## Overview
13+
llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from its publication date, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
14+
15+
For the purposes of this demonstration, the following experimental setup will be used:
16+
- Total number of instances: 3
17+
- Instance type: c8g.16xlarge
18+
- Model: Llama-3.1-405B_Q4_0.gguf
19+
20+
One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
21+
22+
## Implementation
23+
24+
1. To get started, follow [this learning path](/learning-paths/servers-and-cloud-computing/llama-cpu) up to the step where you clone the llama.cpp repository. Since this setup involves multiple instances (or devices), you will need to replicate the initial setup on each device. Specifically, after executing the command below on all devices, continue with this learning path starting from Step 2.
25+
26+
```bash
27+
git clone https://github.com/ggerganov/llama.cpp
28+
```
29+
2. Now we can build the llama.cpp library with the RPC feature enabled by compiling it with the -DLLAMA_RPC=ON flag
30+
```bash
31+
cd llama.cpp
32+
mkdir -p build-rpc
33+
cd build-rpc
34+
cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON
35+
cmake --build . --config Release
36+
```
37+
38+
`llama.cpp` is now built in the `build-rpc/bin` directory.
39+
Check that `llama.cpp` has built correctly by running the help command:
40+
```bash
41+
cd build-rpc
42+
bin/llama-cli -h
43+
```
44+
If everything was built correctly, you should see a list of all the available flags that can be used with llama-cli.
45+
3. Now, choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
46+
47+
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
48+
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
49+
Use the following command to start the listeneing on the worker nodes:
50+
```bash
51+
bin/rpc-server -p 50052 -H 0.0.0.0 -t 64
52+
```
53+
Below are the available flag options that can be used with the rpc-server functionality:
54+
55+
```output
56+
-h, --help show this help message and exit
57+
-t, --threads number of threads for the CPU backend (default: 6)
58+
-d DEV, --device device to use
59+
-H HOST, --host HOST host to bind to (default: 127.0.0.1)
60+
-p PORT, --port PORT port to bind to (default: 50052)
61+
-m MEM, --mem MEM backend memory size (in MB)
62+
-c, --cache enable local file cache
63+
```
64+
Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.

0 commit comments

Comments
 (0)