[GPU] native gguf support by riverlijunjie · Pull Request #36365 · openvinotoolkit/openvino

riverlijunjie · 2026-06-12T01:26:46Z

Details:

Add GGUF FE to read native gguf model file
Pass original gguf weights to gpu plugin primitive
Create primitive muti-impls framework to support gguf fc
Use new OCL kernel to support native gguf's fc computation:
- prefill: ocl_transcode + onednn (WIP ocl gguf fc kernel)
- decoding: ocl gguf fc kernel
Support gguf data type
- gguf_q4_0
- gguf_q4_1
- gguf_q5_0
- gguf_q5_1
- gguf_q8_0
- gguf_q8_1
- gguf_iq4_nl
- gguf_q2_k
- gguf_q3_k
- gguf_q4_k
- gguf_q5_k
- gguf_q6_k
- gguf_q8_k
- gguf_iq2_xxs
- gguf_iq2_xs
- gguf_iq3_xxs
- gguf_iq1_s
- gguf_iq3_s
- gguf_iq2_s
- gguf_iq4_xs
- gguf_iq1_m
- gguf_tq1_0
- gguf_tq2_0
Test result
model: https://huggingface.co/Qwen/Qwen3-8B-GGUF/blob/main/Qwen3-8B-Q5_K_M.gguf
platform: PTL 12Xe

Tickets:

ticket-id

AI Assistance:

AI assistance used: no / yes
If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

- Add fc_compressed_generate_opt.cpp/.hpp: new OCL v2 implementation for W4A8 (4-bit weights, 8-bit activations) FC generate-phase inference - Override get_arguments() to use weights_memory() for dep[1] to handle dynamic weight reordering correctly - Fix act_scale_index: remove count()==0 guard that caused false negatives with stale compile-time shapes - Set DEBUG_FCCmpOpt=0 for production; no diagnostic stdout logging

When the multi-impl pool switches from OneDNN to OCL (first generate step during warmup), update_weights() was NOT called. This left _impl_params->weights_layout pointing to the OneDNN internal format (set during prefill). weights_memory() then returned OneDNN-reordered bytes to the OCL kernel, which reads raw u4 format → garbage output. Fix: call update_weights() inside switch_impl_to() after _impl is set to the new impl. This re-caches the weights with the correct expected layout for the newly active impl (no-reorder for OCL → original u4 bytes; reorder for OneDNN → internal format). Also removes leftover debug std::cout for M=1 retry messages.

…ative GGUF FrontEnd (qwen3) Add native GGUF support to OpenVINO Core and a new GGUF FrontEnd OV Core: - 23 opaque gguf_* block element types (enum order matches GGML ggml_type), with TypeInfo (is_real=false, is_quantized=true, per-type signedness, rounded-up bitwidth) and block_byte_size()/block_elem_count()/is_gguf_block() accessors + free functions. - Block-aware sizing in ov::util::get_memory_size and op::v0::Constant (ceil_div(n, block_elem) * block_bytes); element_iterator treats GGUF as non-byte; visualize_tree / xml deserialize handle the new types. Transformation guards — every pass that could rewrite/reshape/ fold/convert an opaque GGUF Constant now early-outs on is_gguf_block(): Convert validate, ConstantFolding (pins disable_constant_folding), nop_elimination, transpose_sinking (ts_base), convert_fc_to_compressed, convert_precision, mark_dequantization_subgraph. GGUF FrontEnd — src/frontends/gguf/: - Registered ov::frontend::FrontEnd "gguf" (Core::read_model("*.gguf")). - mmap'd zero-copy weights emitted directly as gguf_* Constants (no u4+Convert+Multiply decompose); FullyConnectedCompressed with empty weight_scales/weight_zero_points (scale lives inside the block). - Full qwen3 builder (GQA attention, SwiGLU MLP, RMSNorm, RoPE, KV-cache); throws for non-qwen3 architectures. - rt-info schema under the "gguf" top segment (metadata + tokenizer).

…t) + baseline kernels Add native GPU consumption of GGUF block weights. A GGUF FullyConnected node (gguf_* weight Constant, scale/zp embedded in the block) runs entirely on the GPU with the weight decoded in-kernel; no host-side dequant. ocl::FCGGUFOpt dispatches by M: - M <= threshold (decode / short prompt): native OCL GEMV kernel (fc_gguf_opt.cl) — decodes GGUF blocks in registers, any M, ≥ baseline BW. - M > threshold (prefill / long prompt): transcode + OneDNN WOQ GEMM — fc_gguf_transcode.cl requantises GGUF blocks into an i4/i8 + f16-per-group scale scratchpad (never an f16/f32 weight), then a directly-constructed dnnl::matmul (LRU-cached by (type,M,K,N)) consumes it via DPAS. Threshold via OV_GPU_GGUF_PREFILL_THRESHOLD (default 32). Baseline formats Q4_0/Q4_K/Q5_K/Q6_K/Q8_0 (Q4_*->i4, Q5/6/8->i8); decoders mirror ggml-quants.c / the FE CPU reference. Scratchpad via get_internal_buffer_descs (sized by static K,N only); OneDNN engine/stream and get_onednn_memory() bind the cldnn buffers with no copy. Plumbing & no-corruption guards: - block-aware layout::bytes_count / data_type_traits::size_of; jitter maps GGUF block types to opaque uchar. - ConvertGGUFFullyConnectedCompressed lowers the FE's internal FullyConnectedCompressed (GGUF weight) into the GPU op (Placeholder/dummy scale). - onednn FC, generic + GPU convert_fc_to_compressed, legacy dynamic FC selector and FullyConnectedHorizontalFusion all early-out on is_gguf_block weights. - ov::supported_gguf_types property = Q4_0/Q4_K/Q5_K/Q6_K/Q8_0. Validated on qwen3-4b-q4_0: decode (M=1) predicts " Paris"; prefill (M=64) transcode path agrees with the GEMV path (argmax match, finite logits).

riverlijunjie added 26 commits March 13, 2026 14:31

Runtime Primitive Implementation Switching Policy

fbfcc26

Add_impl_pool policy for different impls

6c6db79

Add ocl type gemm_impl

610b564

Fix gemm ocl impl cannot be chosen issue

fa43e17

:Add i8 support for gemm_ocl

c6265d8

Debug ocl_gemm kernel

cf38645

Fixed ocl_gemm accuracy issue

98c0dca

Optimize ocl_gemm

4550015

Opt ocl_gemm by N-parallel

8314e87

Opt ocl_gemm by K-split

974242d

Merge branch 'master' into river/enhance_kernel_scheduler

0f2b093

Opt ocl gemv and add cm gemv

12f7049

Fix gemv ocl issue and disabled gemv cm

01abb53

Fix prefill gemv multi-compiling issue

2fbac8b

Continue opt kernel

8863cfa

refine some logic

61cc0e8

Fixed kernel build error

81f76cf

Merge branch 'master' into river/enhance_kernel_scheduler

5561951

Optimize fc_gguf_opt.cl kernel to achieve about 80% roofline

5808faf

Fix crash issue

e30c649

optimize prefill from 24s to 0.638s for 1K input token

d914b66

Fixed double-copy of weight at prefill

8244d0d

github-actions Bot added category: inference OpenVINO Runtime library - Inference category: Core OpenVINO Core (aka ngraph) category: IE Tests OpenVINO Test: plugins and common category: GPU OpenVINO GPU plugin labels Jun 12, 2026

github-actions Bot added category: build OpenVINO cmake script / infra category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings no-match-files labels Jun 12, 2026

Merge branch 'master' into river/gguf_support

04ffacb

riverlijunjie added do_not_review do_not_merge labels Jun 12, 2026

Add some test cases

c668e29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] native gguf support#36365

[GPU] native gguf support#36365
riverlijunjie wants to merge 28 commits into
openvinotoolkit:masterfrom
riverlijunjie:river/gguf_support

riverlijunjie commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

riverlijunjie commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

AI Assistance:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

riverlijunjie commented Jun 12, 2026 •

edited

Loading