[pull] master from ggml-org:master by pull[bot] · Pull Request #1075 · LongLeCE/llama.cpp

pull · 2026-04-14T14:42:03Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

…1870) * common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.

…21644) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function

* cmake: fix CMP0194 warning on Windows with MSVC Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1. The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler. This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147). Closes #20311 * cmake: apply cisc's formatting suggestion --------- Co-authored-by: texasich <texasich@users.noreply.github.com>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : re-enable mac workflows * vulkan : fix compile warning

…device supports it (#21572) * vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it * use FetchContent to get SPIRV-Headers * Fetch spirv-headers unconditionally * remove fetchcontent, rely on installed headers * fix ubuntu job * Update docs/build.md

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

* ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson and others added 13 commits April 14, 2026 11:09

server: support OAI /v1/audio/transcriptions API (#21863)

e489a5c

* server: support OAI /v1/audio/transcriptions API * address autoreview comments * correct default response_format value

vulkan: Support GGML_TYPE_NVFP4 (#21455)

6a6780a

This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.

ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (#21559)

2e05f06

vendor : update BoringSSL to 0.20260413.0 (#21881)

be76dd0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

metal : add XIELU unary op (#20802)

aa0f189

ci : re-enable mac workflows (#21894)

f4b5bf2

* ci : re-enable mac workflows * vulkan : fix compile warning

mtmd: add mtmd_image_tokens_get_decoder_pos() API (#21851)

707c0b7

* mtmd: add mtmd_image_tokens_get_decoder_pos() API * consistent naming * fix build

metal : fix FA support logic (#21898)

c0de6ed

ggml : remove ggml-ext.h (#21869)

fae3a28

* ggml: correct placement of ggml-ext.h * ggml : remove ggml-ext.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

pull Bot locked and limited conversation to collaborators Apr 14, 2026

pull Bot added the ⤵️ pull label Apr 14, 2026

pull Bot merged commit fae3a28 into LongLeCE:master Apr 14, 2026

github-actions Bot added documentation Improvements or additions to documentation Apple Metal testing examples ggml server Vulkan devops WebGPU labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#1075

[pull] master from ggml-org:master#1075
pull[bot] merged 13 commits into
LongLeCE:masterfrom
ggml-org:master

pull Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

pull Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pull Bot commented Apr 14, 2026 •

edited

Loading