AMD 7600XT - llama off loads to GPU but low TpS #707

frankmtl-git · 2025-02-08T14:19:35Z

frankmtl-git
Feb 8, 2025

I've been trying to use an AMD 7600XT with Pi5.
I got it up and running with the self-compiled kernel and AMD drivers.

DISPLAY=:0 glxinfo -B gives:

name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 7600 XT (gfx1102, LLVM 15.0.6, DRM 3.54, 6.6.70-v8-16k+) (0x7480)
    Version: 23.2.1
    Accelerated: yes
    Video memory: 16384MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 16321 MB, largest block: 16321 MB
    VBO free aux. memory - total: 3958 MB, largest block: 3958 MB
    Texture free memory - total: 16321 MB, largest block: 16321 MB
    Texture free aux. memory - total: 3958 MB, largest block: 3958 MB
    Renderbuffer free memory - total: 16321 MB, largest block: 16321 MB
    Renderbuffer free aux. memory - total: 3958 MB, largest block: 3958 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 16384 MB
    Total available memory: 20410 MB
    Currently available dedicated video memory: 16321 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon RX 7600 XT (gfx1102, LLVM 15.0.6, DRM 3.54, 6.6.70-v8-16k+)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.2.1-1~bpo12+rpt3
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.2.1-1~bpo12+rpt3
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.2.1-1~bpo12+rpt3
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20``

When I launch ``./build/bin/llama-cli -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 50 -e -ngl 33 -t 4

I get:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV GFX1102) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 4563 (caf773f2) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 7600 XT (RADV GFX1102)) - 16128 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from models/Llama-3.2-3B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 15
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.87 GiB (5.01 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 28
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.21 B
print_info: general.name     = Llama 3.2 3B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 128 'Ä'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:      Vulkan0 model buffer size =  1918.35 MiB
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   448.00 MiB
llama_init_from_model: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_init_from_model: Vulkan_Host  output buffer size =     0.49 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   256.50 MiB
llama_init_from_model: Vulkan_Host compute buffer size =    14.01 MiB
llama_init_from_model: graph nodes  = 902
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

but performance are really low and i can't really use the LLM:

llama_perf_sampler_print:    sampling time =       1.01 ms /    14 runs   (    0.07 ms per token, 13806.71 tokens per second)
llama_perf_context_print:        load time =   36552.88 ms
llama_perf_context_print: prompt eval time =   27448.31 ms /    21 tokens ( 1307.06 ms per token,     0.77 tokens per second)
llama_perf_context_print:        eval time =    7666.34 ms /     3 runs   ( 2555.45 ms per token,     0.39 tokens per second)
llama_perf_context_print:       total time =   38405.26 ms /    24 tokens

If I also open nvtop everything goes smoothly and i can have normal conversation but i still get:

llama_perf_sampler_print:    sampling time =       6.37 ms /    21 runs   (    0.30 ms per token,  3297.74 tokens per second)
llama_perf_context_print:        load time =   40915.43 ms
llama_perf_context_print: prompt eval time =   78729.57 ms /    91 tokens (  865.16 ms per token,     1.16 tokens per second)
llama_perf_context_print:        eval time =   12532.62 ms /   156 runs   (   80.34 ms per token,    12.45 tokens per second)
llama_perf_context_print:       total time =  230074.09 ms /   247 tokens

which seems to me a bit odd since conversation is really fluid.

Also if I try a beefier model (mistral-7b-instruct-v0.2.Q8_0.gguf), I can't even load it on to the GPU even though the model is just around 7GB:

radv/amdgpu: Failed to allocate a buffer:
radv/amdgpu:    size      : 4248715264 bytes
radv/amdgpu:    alignment : 262144 bytes
radv/amdgpu:    domains   : 4
ggml_vulkan: Device memory allocation of size 4248715264 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
llama_model_load: error loading model: unable to allocate Vulkan0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/mistral-7b-instruct-v0.2.Q8_0.gguf'
main: error: unable to load model

not sure if this is just vulkan fault though.

Answered by geerlingguy

Feb 8, 2025

You may need to adjust the GGML_VK_FORCE_MAX_ALLOCATION_SIZE size due to some bugs in the driver on arm64. See: geerlingguy/ollama-benchmark#1 and have a read through my notes towards the bottom of that original issue post.

View full answer

geerlingguy · 2025-02-08T16:14:17Z

geerlingguy
Feb 8, 2025
Maintainer

You may need to adjust the GGML_VK_FORCE_MAX_ALLOCATION_SIZE size due to some bugs in the driver on arm64. See: geerlingguy/ollama-benchmark#1 and have a read through my notes towards the bottom of that original issue post.

1 reply

frankmtl-git Feb 8, 2025
Author

Thanks! export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647 made me load the 7B model correctly.
I still don't understand why i need nvtop open to make any model use the GPU but at least it work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AMD 7600XT - llama off loads to GPU but low TpS #707

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

AMD 7600XT - llama off loads to GPU but low TpS #707

Uh oh!

frankmtl-git Feb 8, 2025

Replies: 1 comment · 1 reply

Uh oh!

geerlingguy Feb 8, 2025 Maintainer

Uh oh!

frankmtl-git Feb 8, 2025 Author

frankmtl-git
Feb 8, 2025

Replies: 1 comment 1 reply

geerlingguy
Feb 8, 2025
Maintainer

frankmtl-git Feb 8, 2025
Author