diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md index 4871cd65e6..e1b4943183 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md @@ -1,9 +1,9 @@ --- title: Distributed inference using llama.cpp -draft: true -cascade: - draft: true +# draft: true +# cascade: +# draft: true minutes_to_complete: 30 diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md index 6838a42e06..3a8e687e5c 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md @@ -1,64 +1,132 @@ --- -title: Overview and Worker Node Configuration +title: Convert model to gguf and quantization (Optional) weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Assumptions +1. This page assumes that Python is installed on your AWS (or other cloud provider) instance. +2. You have access to Meta’s gated repository for the Llama 3.1 model family. +3. You have a Hugging Face token generated for downloading models. +4. You are performing this download on the master node. +5. You have cloned llama.cpp repository. +6. Basic libraries such as: transformers, torch, sentencepiece are already installed. +7. There is 2TB+ storage space on your device. +8. The model conversion was performed on c8g.24xlarge. Please make sure you have an equivalent ma hine. -## Before you begin -The instructions in this Learning Path are for any Arm server running Ubuntu 24.04.2 LTS. You will need at least three Arm server instances with at least 64 cores and 128GB of RAM to run this example. The instructions have been tested on an AWS Graviton4 c8g.16xlarge instance +{{% notice Note %}}The time mentioned on the Introduction page is only for setting up distributed inference (process on the next page). The gguf conversion mentioned on this page may take 6+ hours. You may skip this page, if you have a GGUF model of your choice. Before skipping please read the paragraph preceding: "Step 1: Make a virtual environment." {{% /notice %}} -## Overview -llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from its publication date, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs. - -For the purposes of this demonstration, the following experimental setup will be used: -- Total number of instances: 3 -- Instance type: c8g.16xlarge -- Model: Llama-3.1-405B_Q4_0.gguf - -One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed. - -## Implementation - -1. To get started, follow [this learning path](/learning-paths/servers-and-cloud-computing/llama-cpu) up to the step where you clone the llama.cpp repository. Since this setup involves multiple instances (or devices), you will need to replicate the initial setup on each device. Specifically, after executing the command below on all devices, continue with this learning path starting from Step 2. +## Procedure +Implement the following steps to download, convert and quantize the 405B parameter model released my meta [here](https://huggingface.co/meta-llama/Llama-3.1-405B). +To get started, follow [this learning path](/learning-paths/servers-and-cloud-computing/llama-cpu) up to the step where you clone the llama.cpp repository. Since this setup involves multiple instances (or devices), you will need to replicate the initial setup on each device. Specifically, after executing the command below on all devices, continue with this learning path starting from Step 2. ```bash git clone https://github.com/ggerganov/llama.cpp ``` -2. Now we can build the llama.cpp library with the RPC feature enabled by compiling it with the -DLLAMA_RPC=ON flag +##### 1. Make a virtual environment +Make a python virtual environment using (update the python version accordingly in the following commands): ```bash -cd llama.cpp -mkdir -p build-rpc -cd build-rpc -cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON -cmake --build . --config Release +apt update +apt install python3.12-venv +python3 -m venv myenv +source myenv/bin/activate ``` +##### 2. Download the model +Install Huggingface Hub in the virtual environment: +```bash +pip3 install huggingface_hub -`llama.cpp` is now built in the `build-rpc/bin` directory. -Check that `llama.cpp` has built correctly by running the help command: +``` +Make a python file and name it download.py: ```bash -cd build-rpc -bin/llama-cli -h +vi download.py ``` -If everything was built correctly, you should see a list of all the available flags that can be used with llama-cli. -3. Now, choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers. - -Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information. -{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}} -Use the following command to start the listening on the worker nodes: +Write the following code to it: +```python +import os +from huggingface_hub import snapshot_download +model_id = "meta-llama/Llama-3.1-405B" +local_dir = "llama-hf" +# Create the directory if it doesn't exist +os.makedirs(local_dir, exist_ok=True) +# Download the model snapshot +snapshot_download( repo_id=model_id, local_dir=local_dir, + revision="main", + token="your_hf_token", + allow_patterns=["*.md", "*.json", "*.safetensors"] +) +``` +Execute the file: ```bash -bin/rpc-server -p 50052 -H 0.0.0.0 -t 64 +python3 download.py +``` +##### 3. Convert the model from .safetensors to gguf and quantize +Following lines installs the files important for conversion to .gguf format. +```bash +pip3 install -r llama.cpp/requirements.txt +python3 llama.cpp/convert_hf_to_gguf.py llama-hf +cd llama.cpp/build-rpc +bin/llama-quantize ../../llama-hf/Llama-Hf-406B-F16.gguf Q4_0 +``` +You may rename the resultant file to model.gguf and use it. There are different quantization options as well, as shown below: +```bash +bin/llama-quantize -h ``` -Below are the available flag options that can be used with the rpc-server functionality: - ```output --h, --help show this help message and exit --t, --threads number of threads for the CPU backend (default: 6) --d DEV, --device device to use --H HOST, --host HOST host to bind to (default: 127.0.0.1) --p PORT, --port PORT port to bind to (default: 50052) --m MEM, --mem MEM backend memory size (in MB) --c, --cache enable local file cache +usage: bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] + [--token-embedding-type] [--tensor-type] [--keep-split] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads] + + --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit + --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing + --pure: Disable k-quant mixtures and quantize all tensors to the same type + --imatrix file_name: use data in file_name as importance matrix for quant optimizations + --include-weights tensor_name: use importance matrix for this/these tensor(s) + --exclude-weights tensor_name: use importance matrix for this/these tensor(s) + --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor + --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor + --tensor-type TENSOR=TYPE: quantize this tensor to this ggml_type. example: --tensor-type attn_q=q8_0 + Advanced option to selectively quantize tensors. May be specified multiple times. + --keep-split: will generate quantized model in the same shards as input + --override-kv KEY=TYPE:VALUE + Advanced option to override model metadata by key in the quantized model. May be specified multiple times. +Note: --include-weights and --exclude-weights cannot be used together + +Allowed quantization types: + 2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B + 3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B + 8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B + 9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B + 19 or IQ2_XXS : 2.06 bpw quantization + 20 or IQ2_XS : 2.31 bpw quantization + 28 or IQ2_S : 2.5 bpw quantization + 29 or IQ2_M : 2.7 bpw quantization + 24 or IQ1_S : 1.56 bpw quantization + 31 or IQ1_M : 1.75 bpw quantization + 36 or TQ1_0 : 1.69 bpw ternarization + 37 or TQ2_0 : 2.06 bpw ternarization + 10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B + 21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B + 23 or IQ3_XXS : 3.06 bpw quantization + 26 or IQ3_S : 3.44 bpw quantization + 27 or IQ3_M : 3.66 bpw quantization mix + 12 or Q3_K : alias for Q3_K_M + 22 or IQ3_XS : 3.3 bpw quantization + 11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B + 12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B + 13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B + 25 or IQ4_NL : 4.50 bpw non-linear quantization + 30 or IQ4_XS : 4.25 bpw non-linear quantization + 15 or Q4_K : alias for Q4_K_M + 14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B + 15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B + 17 or Q5_K : alias for Q5_K_M + 16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B + 17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B + 18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B + 7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B + 1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B + 32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B + 0 or F32 : 26.00G @ 7B + COPY : only copy tensors, no quantizing ``` -Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md index 65a020ccab..2602156f4e 100644 --- a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-2.md @@ -1,213 +1,64 @@ --- -title: Configuring Master Node +title: Overview and Worker Node Configuration weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Before you begin +The instructions in this Learning Path are for any Arm server running Ubuntu 24.04.2 LTS. You will need at least three Arm server instances with at least 64 cores and 128GB of RAM to run this example. The instructions have been tested on an AWS Graviton4 c8g.16xlarge instance. -(continued)
-4. In this learning path, we will use the following three IP addresses for the nodes. +## Overview +llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs. -```bash -master_ip =" 172.31.110.10" -worker_ips = "172.31.110.11,172.31.110.12" -``` -Note that these IPs may be different in your setup. You can find the IP address of your AWS instance using the command provided below. -```bash -curl http://169.254.169.254/latest/meta-data/local-ipv4 -``` +For the purposes of this demonstration, the following experimental setup will be used: +- Total number of instances: 3 +- Instance type: c8g.16xlarge +- Model: Llama-3.1-405B_Q4_0.gguf -Now, on the master node, you can verify communication with the worker nodes using the following command on master node: -```bash -telnet 172.31.110.11 50052 -``` -If the backend server is set up correctly, the output of the `telnet` command should look like the following: -```bash -Trying 172.31.110.11... -Connected to 172.31.110.11. -Escape character is '^]'. -``` -Finally, you can execute the following command, to execute distributed inference: -```bash -bin/llama-cli -m /home/ubuntu/model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99 -``` -{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
-The model file for this experiment is hosted on Arm’s private AWS S3 bucket. If you don’t have access to it, you can find a publicly available version of the model on Hugging Face. -The output: -```output -build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu -main: llama backend init -main: load the model and apply lora adapter, if any -llama_model_load_from_file_impl: using device RPC[172.31.110.11:50052] (RPC[172.31.110.11:50052]) - 126497 MiB free -llama_model_load_from_file_impl: using device RPC[172.31.110.12:50052] (RPC[172.31.110.12:50052]) - 126497 MiB free -llama_model_loader: loaded meta data with 30 key-value pairs and 1138 tensors from /home/ubuntu/Llama-3.1-405B_Q4_0.gguf (version GGUF V3 (latest)) -llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. -llama_model_loader: - kv 0: general.architecture str = llama -llama_model_loader: - kv 1: general.type str = model -llama_model_loader: - kv 2: general.name str = Llama Hf -llama_model_loader: - kv 3: general.size_label str = 406B -llama_model_loader: - kv 4: general.license str = llama3.1 -llama_model_loader: - kv 5: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... -llama_model_loader: - kv 6: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... -llama_model_loader: - kv 7: llama.block_count u32 = 126 -llama_model_loader: - kv 8: llama.context_length u32 = 131072 -llama_model_loader: - kv 9: llama.embedding_length u32 = 16384 -llama_model_loader: - kv 10: llama.feed_forward_length u32 = 53248 -llama_model_loader: - kv 11: llama.attention.head_count u32 = 128 -llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 -llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 -llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 -llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 -llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 -llama_model_loader: - kv 17: llama.vocab_size u32 = 128256 -llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 128 -llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 -llama_model_loader: - kv 20: tokenizer.ggml.pre str = llama-bpe -llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... -llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... -llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... -llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 128000 -llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 128001 -llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = true -llama_model_loader: - kv 27: tokenizer.ggml.add_sep_token bool = false -llama_model_loader: - kv 28: general.quantization_version u32 = 2 -llama_model_loader: - kv 29: general.file_type u32 = 2 -llama_model_loader: - type f32: 254 tensors -llama_model_loader: - type q4_0: 883 tensors -llama_model_loader: - type q6_K: 1 tensors -print_info: file format = GGUF V3 (latest) -print_info: file type = Q4_0 -print_info: file size = 213.13 GiB (4.51 BPW) -load: special tokens cache size = 256 -load: token to piece cache size = 0.7999 MB -print_info: arch = llama -print_info: vocab_only = 0 -print_info: n_ctx_train = 131072 -print_info: n_embd = 16384 -print_info: n_layer = 126 -print_info: n_head = 128 -print_info: n_head_kv = 8 -print_info: n_rot = 128 -print_info: n_swa = 0 -print_info: is_swa_any = 0 -print_info: n_embd_head_k = 128 -print_info: n_embd_head_v = 128 -print_info: n_gqa = 16 -print_info: n_embd_k_gqa = 1024 -print_info: n_embd_v_gqa = 1024 -print_info: f_norm_eps = 0.0e+00 -print_info: f_norm_rms_eps = 1.0e-05 -print_info: f_clamp_kqv = 0.0e+00 -print_info: f_max_alibi_bias = 0.0e+00 -print_info: f_logit_scale = 0.0e+00 -print_info: f_attn_scale = 0.0e+00 -print_info: n_ff = 53248 -print_info: n_expert = 0 -print_info: n_expert_used = 0 -print_info: causal attn = 1 -print_info: pooling type = 0 -print_info: rope type = 0 -print_info: rope scaling = linear -print_info: freq_base_train = 500000.0 -print_info: freq_scale_train = 1 -print_info: n_ctx_orig_yarn = 131072 -print_info: rope_finetuned = unknown -print_info: model type = ?B -print_info: model params = 405.85 B -print_info: general.name = Llama Hf -print_info: vocab type = BPE -print_info: n_vocab = 128256 -print_info: n_merges = 280147 -print_info: BOS token = 128000 '<|begin_of_text|>' -print_info: EOS token = 128001 '<|end_of_text|>' -print_info: EOT token = 128009 '<|eot_id|>' -print_info: EOM token = 128008 '<|eom_id|>' -print_info: LF token = 198 'Ċ' -print_info: EOG token = 128001 '<|end_of_text|>' -print_info: EOG token = 128008 '<|eom_id|>' -print_info: EOG token = 128009 '<|eot_id|>' -print_info: max token length = 256 -load_tensors: loading model tensors, this can take a while... (mmap = true) -.................................................................................................... -llama_context: constructing llama_context -llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache -llama_context: n_seq_max = 1 -llama_context: n_ctx = 4096 -llama_context: n_ctx_per_seq = 4096 -llama_context: n_batch = 2048 -llama_context: n_ubatch = 512 -llama_context: causal_attn = 1 -llama_context: flash_attn = 0 -llama_context: kv_unified = true -llama_context: freq_base = 500000.0 -llama_context: freq_scale = 1 -llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized -llama_context: CPU output buffer size = 0.49 MiB -llama_kv_cache_unified: RPC[172.31.110.11:50052] KV buffer size = 800.00 MiB -llama_kv_cache_unified: RPC[172.31.110.12:50052] KV buffer size = 784.00 MiB -llama_kv_cache_unified: CPU KV buffer size = 432.00 MiB -llama_kv_cache_unified: size = 2016.00 MiB ( 4096 cells, 126 layers, 1/ 1 seqs), K (f16): 1008.00 MiB, V (f16): 1008.00 MiB -llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility -llama_context: RPC[172.31.110.11:50052] compute buffer size = 1160.00 MiB -llama_context: RPC[172.31.110.12:50052] compute buffer size = 1160.00 MiB -llama_context: CPU compute buffer size = 1160.01 MiB -llama_context: graph nodes = 4668 -llama_context: graph splits = 4 -common_init_from_params: added <|end_of_text|> logit bias = -inf -common_init_from_params: added <|eom_id|> logit bias = -inf -common_init_from_params: added <|eot_id|> logit bias = -inf -common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 -common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) -main: llama threadpool init, n_threads = 64 +One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed. -system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 | +## Implementation -sampler seed: 4077122424 -sampler params: - repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 - dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 - top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 - mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 -sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist -generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1 +1. Once you have the model.gguf ready and llama.cpp cloned (described on previous page), you can proceed to step 2. -Tell me a joke! (or a funny story) -Thread starter Fiver -This thread is for any jokes you may want to share with other members. Please keep them clean! -Reactions: Fiver -A duck walks into a bar, and asks the bartender, "Have you got any bread?" -The bartender says, "No, we don't have any bread." -The duck leaves. -A few minutes later, the duck returns, and asks the bartender, "Have you got any bread?" -The bartender says, "No, I told you, we don't have any bread." -A few minutes later, the duck returns, and asks the bartender, +2. Now we can build the llama.cpp library with the RPC feature enabled by compiling it with the -DLLAMA_RPC=ON flag +```bash +apt install -y cmake build-essential +apt install -y g++ +apt install -y libcurl4-openssl-dev +cd llama.cpp +mkdir -p build-rpc +cd build-rpc +cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON +cmake --build . --config Release +``` -llama_perf_sampler_print: sampling time = 9.48 ms / 133 runs ( 0.07 ms per token, 14032.50 tokens per second) -llama_perf_context_print: load time = 1796754.73 ms -llama_perf_context_print: prompt eval time = 1925.98 ms / 5 tokens ( 385.20 ms per token, 2.60 tokens per second) -llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609.68 ms per token, 1.64 tokens per second) -llama_perf_context_print: total time = 79394.06 ms / 132 tokens -llama_perf_context_print: graphs reused = 0 +`llama.cpp` is now built in the `build-rpc/bin` directory. +Check that `llama.cpp` has built correctly by running the help command: +```bash +cd build-rpc +bin/llama-cli -h ``` -That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`:

+If everything was built correctly, you should see a list of all the available flags that can be used with llama-cli. -| Log Line | Description | -|-------------------|-----------------------------------------------------------------------------| -| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). | -| load time | Time to load the model into memory and initialize weights/buffers. | -| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). | -| eval time | Time to generate output tokens by forward-passing through the model. | -| total time | Total time for both prompt processing and token generation (excludes model load). | +3. Now, choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers. -Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference: +Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information. +{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}} +Use the following command to start the listening on the worker nodes: ```bash -bin/llama-server -m /home/ubuntu/model.gguf --port 8080 --rpc "$worker_ips" -ngl 99 +bin/rpc-server -p 50052 -H 0.0.0.0 -t 64 ``` -At the very end of the output to the above command, you will see something like the following: +Below are the available flag options that can be used with the rpc-server functionality: + ```output -main: server is listening on http://127.0.0.1:8080 - starting the main loop -srv update_slots: all slots are idle +-h, --help show this help message and exit +-t, --threads number of threads for the CPU backend (default: 6) +-d DEV, --device device to use +-H HOST, --host HOST host to bind to (default: 127.0.0.1) +-p PORT, --port PORT port to bind to (default: 50052) +-m MEM, --mem MEM backend memory size (in MB) +-c, --cache enable local file cache ``` - +Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md new file mode 100644 index 0000000000..e22e0cdd3b --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md @@ -0,0 +1,218 @@ +--- +title: Configuring Master Node +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- +(continued)
+4. In this learning path, we will use the following three IP addresses for the nodes. + +```bash +master_ip =" 172.31.110.10" +worker_ips = "172.31.110.11,172.31.110.12" +``` +Note that these IPs may be different in your setup. You can find the IP address of your AWS instance using the command provided below. +```bash +curl http://169.254.169.254/latest/meta-data/local-ipv4 +``` + +Now, on the master node, you can verify communication with the worker nodes using the following command on master node: +```bash +telnet 172.31.110.11 50052 +``` +If the backend server is set up correctly, the output of the `telnet` command should look like the following: +```bash +Trying 172.31.110.11... +Connected to 172.31.110.11. +Escape character is '^]'. +``` +Finally, you can execute the following command, to execute distributed inference: +```bash +bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999 +``` +Here are short definitions of the flags used in above command:
+-n => Number of maximum output tokens
+--rpc => list of backend workers
+-ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers) + +{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
+The model.gguf (llama-3.1-405B_Q4_0) used in this experiment is hosted on Arm’s private AWS S3 bucket. If you are using a device (or instance) with lower capacity, you may want to run a smaller model such as the 8B or 70B parameter variants. Quantized versions of these models are available [here](https://huggingface.co/aryan-arm). The 405B model is not avaible on previously mentioned link. However, you can obtain it by following the [bonus section](/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3) of this learning path.
+ +The output: +```output +build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu +main: llama backend init +main: load the model and apply lora adapter, if any +llama_model_load_from_file_impl: using device RPC[172.31.110.11:50052] (RPC[172.31.110.11:50052]) - 126497 MiB free +llama_model_load_from_file_impl: using device RPC[172.31.110.12:50052] (RPC[172.31.110.12:50052]) - 126497 MiB free +llama_model_loader: loaded meta data with 30 key-value pairs and 1138 tensors from /home/ubuntu/Llama-3.1-405B_Q4_0.gguf (version GGUF V3 (latest)) +llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. +llama_model_loader: - kv 0: general.architecture str = llama +llama_model_loader: - kv 1: general.type str = model +llama_model_loader: - kv 2: general.name str = Llama Hf +llama_model_loader: - kv 3: general.size_label str = 406B +llama_model_loader: - kv 4: general.license str = llama3.1 +llama_model_loader: - kv 5: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... +llama_model_loader: - kv 6: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... +llama_model_loader: - kv 7: llama.block_count u32 = 126 +llama_model_loader: - kv 8: llama.context_length u32 = 131072 +llama_model_loader: - kv 9: llama.embedding_length u32 = 16384 +llama_model_loader: - kv 10: llama.feed_forward_length u32 = 53248 +llama_model_loader: - kv 11: llama.attention.head_count u32 = 128 +llama_model_loader: - kv 12: llama.attention.head_count_kv u32 = 8 +llama_model_loader: - kv 13: llama.rope.freq_base f32 = 500000.000000 +llama_model_loader: - kv 14: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 +llama_model_loader: - kv 15: llama.attention.key_length u32 = 128 +llama_model_loader: - kv 16: llama.attention.value_length u32 = 128 +llama_model_loader: - kv 17: llama.vocab_size u32 = 128256 +llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 128 +llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 +llama_model_loader: - kv 20: tokenizer.ggml.pre str = llama-bpe +llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... +llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... +llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... +llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 128000 +llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 128001 +llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = true +llama_model_loader: - kv 27: tokenizer.ggml.add_sep_token bool = false +llama_model_loader: - kv 28: general.quantization_version u32 = 2 +llama_model_loader: - kv 29: general.file_type u32 = 2 +llama_model_loader: - type f32: 254 tensors +llama_model_loader: - type q4_0: 883 tensors +llama_model_loader: - type q6_K: 1 tensors +print_info: file format = GGUF V3 (latest) +print_info: file type = Q4_0 +print_info: file size = 213.13 GiB (4.51 BPW) +load: special tokens cache size = 256 +load: token to piece cache size = 0.7999 MB +print_info: arch = llama +print_info: vocab_only = 0 +print_info: n_ctx_train = 131072 +print_info: n_embd = 16384 +print_info: n_layer = 126 +print_info: n_head = 128 +print_info: n_head_kv = 8 +print_info: n_rot = 128 +print_info: n_swa = 0 +print_info: is_swa_any = 0 +print_info: n_embd_head_k = 128 +print_info: n_embd_head_v = 128 +print_info: n_gqa = 16 +print_info: n_embd_k_gqa = 1024 +print_info: n_embd_v_gqa = 1024 +print_info: f_norm_eps = 0.0e+00 +print_info: f_norm_rms_eps = 1.0e-05 +print_info: f_clamp_kqv = 0.0e+00 +print_info: f_max_alibi_bias = 0.0e+00 +print_info: f_logit_scale = 0.0e+00 +print_info: f_attn_scale = 0.0e+00 +print_info: n_ff = 53248 +print_info: n_expert = 0 +print_info: n_expert_used = 0 +print_info: causal attn = 1 +print_info: pooling type = 0 +print_info: rope type = 0 +print_info: rope scaling = linear +print_info: freq_base_train = 500000.0 +print_info: freq_scale_train = 1 +print_info: n_ctx_orig_yarn = 131072 +print_info: rope_finetuned = unknown +print_info: model type = ?B +print_info: model params = 405.85 B +print_info: general.name = Llama Hf +print_info: vocab type = BPE +print_info: n_vocab = 128256 +print_info: n_merges = 280147 +print_info: BOS token = 128000 '<|begin_of_text|>' +print_info: EOS token = 128001 '<|end_of_text|>' +print_info: EOT token = 128009 '<|eot_id|>' +print_info: EOM token = 128008 '<|eom_id|>' +print_info: LF token = 198 'Ċ' +print_info: EOG token = 128001 '<|end_of_text|>' +print_info: EOG token = 128008 '<|eom_id|>' +print_info: EOG token = 128009 '<|eot_id|>' +print_info: max token length = 256 +load_tensors: loading model tensors, this can take a while... (mmap = true) +.................................................................................................... +llama_context: constructing llama_context +llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache +llama_context: n_seq_max = 1 +llama_context: n_ctx = 4096 +llama_context: n_ctx_per_seq = 4096 +llama_context: n_batch = 2048 +llama_context: n_ubatch = 512 +llama_context: causal_attn = 1 +llama_context: flash_attn = 0 +llama_context: kv_unified = true +llama_context: freq_base = 500000.0 +llama_context: freq_scale = 1 +llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized +llama_context: CPU output buffer size = 0.49 MiB +llama_kv_cache_unified: RPC[172.31.110.11:50052] KV buffer size = 800.00 MiB +llama_kv_cache_unified: RPC[172.31.110.12:50052] KV buffer size = 784.00 MiB +llama_kv_cache_unified: CPU KV buffer size = 432.00 MiB +llama_kv_cache_unified: size = 2016.00 MiB ( 4096 cells, 126 layers, 1/ 1 seqs), K (f16): 1008.00 MiB, V (f16): 1008.00 MiB +llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility +llama_context: RPC[172.31.110.11:50052] compute buffer size = 1160.00 MiB +llama_context: RPC[172.31.110.12:50052] compute buffer size = 1160.00 MiB +llama_context: CPU compute buffer size = 1160.01 MiB +llama_context: graph nodes = 4668 +llama_context: graph splits = 4 +common_init_from_params: added <|end_of_text|> logit bias = -inf +common_init_from_params: added <|eom_id|> logit bias = -inf +common_init_from_params: added <|eot_id|> logit bias = -inf +common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 +common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) +main: llama threadpool init, n_threads = 64 + +system_info: n_threads = 64 (n_threads_batch = 64) / 64 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 | + +sampler seed: 4077122424 +sampler params: + repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 + dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 + top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 + mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 +sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist +generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1 + +Tell me a joke! (or a funny story) +Thread starter Fiver +This thread is for any jokes you may want to share with other members. Please keep them clean! +Reactions: Fiver +A duck walks into a bar, and asks the bartender, "Have you got any bread?" +The bartender says, "No, we don't have any bread." +The duck leaves. +A few minutes later, the duck returns, and asks the bartender, "Have you got any bread?" +The bartender says, "No, I told you, we don't have any bread." +A few minutes later, the duck returns, and asks the bartender, + +llama_perf_sampler_print: sampling time = 9.48 ms / 133 runs ( 0.07 ms per token, 14032.50 tokens per second) +llama_perf_context_print: load time = 1796754.73 ms +llama_perf_context_print: prompt eval time = 1925.98 ms / 5 tokens ( 385.20 ms per token, 2.60 tokens per second) +llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609.68 ms per token, 1.64 tokens per second) +llama_perf_context_print: total time = 79394.06 ms / 132 tokens +llama_perf_context_print: graphs reused = 0 +``` +That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`:

+ +| Log Line | Description | +|-------------------|-----------------------------------------------------------------------------| +| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). | +| load time | Time to load the model into memory and initialize weights/buffers. | +| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). | +| eval time | Time to generate output tokens by forward-passing through the model. | +| total time | Total time for both prompt processing and token generation (excludes model load). | + +Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference: +```bash +bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99 +``` +At the very end of the output to the above command, you will see something like the following: +```output +main: server is listening on http://127.0.0.1:8080 - starting the main loop +srv update_slots: all slots are idle +``` +