[pull] master from ggml-org:master by pull[bot] · Pull Request #1034 · LongLeCE/llama.cpp

pull · 2026-03-31T08:42:02Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…i-compat (#21090) * server/webui: cleanup dual representation approach, simplify to openai-compat * feat: Fix regression for Agentic Loop UI * chore: update webui build output * refactor: Post-review code improvements * chore: update webui build output * refactor: Cleanup * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix: include API key in CORS proxy requests for MCP connections When llama-server is started with --api-key-file and --webui-mcp-proxy, the /cors-proxy endpoint requires authentication. The WebUI was not including the Authorization header in proxy requests, causing MCP connections to fail with 401. Inject getAuthHeaders() into requestInit when useProxy is true so the proxy request carries the Bearer token alongside the forwarded target headers. Fixes #21167 * fix: simplify headers assignment based on reviewer suggestion Apply buildProxiedHeaders only when useProxy is true, pass headers directly to the transport otherwise.

…gfault on failed model load (#21082) * common: add bounds check in common_init_result::sampler to prevent segfault on failed model load * Revert a308e58 * Add regression test * Remove regression test for init-fail sampler check

The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

- emdeddings → embeddings (gemma3.cpp, gemma3n-iswa.cpp, gemma-embedding.cpp) - imlpemented → implemented (llama-adapter.cpp) - interere → interfere (llama-graph.cpp) - overridde → overridden (chat.cpp) - stastistics → statistics (ngram-map.h) - layed → laid (llama-kv-cache.h) - worster → worst (llama-context.cpp) - sequantial → sequential (llama-batch.h)

* webui: no more gzip * try changing a small line * Revert "try changing a small line" This reverts commit 0d7a353. * fix lint * fix test * rebuild * split into html/css/js * lint * chore: update webui build output * chore: Update git hooks script * server: update webui build output * chore: Update pre-commit hook * refactor: Cleanup --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

``` $ build/bin/llama-server -hf unsloth/Qwen3.5-0.8B-GGUF common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping Downloading mmproj-BF16.gguf ——————————————————————————————————————— 100% Downloading Qwen3.5-0.8B-Q4_K_M.gguf ——————————————————————————————— 100% ... ``` Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Galunid and others added 2 commits March 31, 2026 10:14

common : Disable backend sampling if reasoning budget is enabled (#21209

5ce013c

)

vendor : update BoringSSL to 0.20260327.0 (#21211)

26dac84

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

pull Bot locked and limited conversation to collaborators Mar 31, 2026

pull Bot added the ⤵️ pull label Mar 31, 2026

github-actions Bot added examples server labels Mar 31, 2026

satishkc7 and others added 3 commits March 31, 2026 10:52

sycl : enhance fattn perf (#21185)

62278ce

github-actions Bot added ggml SYCL labels Mar 31, 2026

common : move up common_init() and fix Windows UTF-8 logs (#21176)

41361c8

The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

github-actions Bot added the testing label Mar 31, 2026

ggerganov added 2 commits March 31, 2026 14:00

ggml : bump version to 0.9.9 (ggml/1449)

0be6c7c

sync : ggml

9281dd1

github-actions Bot added the script label Mar 31, 2026

CI: Enable CPU and Vulkan ARM64 Release (#21207)

eec6f85

github-actions Bot added the devops label Mar 31, 2026

github-actions Bot added the model label Mar 31, 2026

aldehir and others added 3 commits March 31, 2026 13:52

common : gpt-oss handle builtin and unsolicited tool calls (#21213)

624733d

github-actions Bot added the Ascend NPU label Mar 31, 2026

pull Bot merged commit 6307ec0 into LongLeCE:master Mar 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#1034

[pull] master from ggml-org:master#1034
pull[bot] merged 15 commits into
LongLeCE:masterfrom
ggml-org:master

pull Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

pull Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

pull Bot commented Mar 31, 2026 •

edited

Loading