[pull] master from ggml-org:master by pull[bot] · Pull Request #1209 · LongLeCE/llama.cpp

pull · 2026-05-22T14:42:03Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* pi : update * ci : fix ios build * ci : fix andoroid * ci : fix apple builds * cmake : add install() for impl libraries Add install(TARGETS <target> LIBRARY) for all -impl libraries that were changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in commit bb28c1f. Without this, cmake --install fails to copy the shared libraries, causing runtime errors like: llama-server: error while loading shared libraries: libllama-server-impl.so Ref: #23494 (comment) Assisted-by: llama.cpp:local pi * ci : fix xcframework build

* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0

) * SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup BF16 models had no dedicated token generation kernel — they fell through to the generic full-GEMM path, resulting in ~14% memory bandwidth utilization on Intel Arc GPUs. This adds BF16 support to the DMMV (dequantize mul-mat-vec) path, matching the existing F16 implementation. Fixes #20478 * SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0 The qk=1 kernel (used for F16 and BF16) iterates with stride 2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last warp iteration accesses elements at col >= ncols, producing NaN for the final row and wrong values for interior rows. Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] % (2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16 launcher to match. Quantized types use block-structured kernels with different access patterns and keep the existing DMMV_X check. Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70. Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* sycl_gated_delta_net K>1 * editor_config

* [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning

- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends - switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

ggerganov and others added 10 commits May 22, 2026 11:46

vocab : fix HybridDNA tokenizer (#23466)

afcda09

* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

cmake : build router app only during standalone builds (#23521)

9c92e96

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

ggml-zendnn : add Q8_0 quantization support (#23414)

99d4026

* ggml-zendnn : add Q8_0 quantization support * ggml-zendnn : sync with latest ZenDNN * ggml-zendnn : address review comments for Q8_0

docs: Update documentation with Granite 4.0/4.1 (#23404)

95feeab

SYCL : gated_delta_net K>1 (#23174)

56f16f2

* sycl_gated_delta_net K>1 * editor_config

sycl : Level Zero detection in ggml_sycl_init (#23097)

bcfd198

* [SYCL] Centralize Level Zero detection in ggml_sycl_init * use the same wording * get back the warning

perplexity : fix integer overflow (#23496)

ef570f6

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

pull Bot locked and limited conversation to collaborators May 22, 2026

pull Bot added the ⤵️ pull label May 22, 2026

pull Bot merged commit ef570f6 into LongLeCE:master May 22, 2026

github-actions Bot added documentation Improvements or additions to documentation examples python ggml SYCL server devops build android AMD ZenDNN labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#1209

[pull] master from ggml-org:master#1209
pull[bot] merged 10 commits into
LongLeCE:masterfrom
ggml-org:master

pull Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

pull Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pull Bot commented May 22, 2026 •

edited

Loading