Skip to content

[GPU] native gguf support#36365

Draft
riverlijunjie wants to merge 28 commits into
openvinotoolkit:masterfrom
riverlijunjie:river/gguf_support
Draft

[GPU] native gguf support#36365
riverlijunjie wants to merge 28 commits into
openvinotoolkit:masterfrom
riverlijunjie:river/gguf_support

Conversation

@riverlijunjie

@riverlijunjie riverlijunjie commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Details:

  • Add GGUF FE to read native gguf model file

  • Pass original gguf weights to gpu plugin primitive

  • Create primitive muti-impls framework to support gguf fc

  • Use new OCL kernel to support native gguf's fc computation:

    • prefill: ocl_transcode + onednn (WIP ocl gguf fc kernel)
    • decoding: ocl gguf fc kernel
  • Support gguf data type

    • gguf_q4_0
    • gguf_q4_1
    • gguf_q5_0
    • gguf_q5_1
    • gguf_q8_0
    • gguf_q8_1
    • gguf_iq4_nl
    • gguf_q2_k
    • gguf_q3_k
    • gguf_q4_k
    • gguf_q5_k
    • gguf_q6_k
    • gguf_q8_k
    • gguf_iq2_xxs
    • gguf_iq2_xs
    • gguf_iq3_xxs
    • gguf_iq1_s
    • gguf_iq3_s
    • gguf_iq2_s
    • gguf_iq4_xs
    • gguf_iq1_m
    • gguf_tq1_0
    • gguf_tq2_0
  • Test result
    model: https://huggingface.co/Qwen/Qwen3-8B-GGUF/blob/main/Qwen3-8B-Q5_K_M.gguf
    platform: PTL 12Xe

image

Tickets:

  • ticket-id

AI Assistance:

  • AI assistance used: no / yes
  • If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).

- Add fc_compressed_generate_opt.cpp/.hpp: new OCL v2 implementation
  for W4A8 (4-bit weights, 8-bit activations) FC generate-phase inference
- Override get_arguments() to use weights_memory() for dep[1] to handle
  dynamic weight reordering correctly
- Fix act_scale_index: remove count()==0 guard that caused false negatives
  with stale compile-time shapes
- Set DEBUG_FCCmpOpt=0 for production; no diagnostic stdout logging
When the multi-impl pool switches from OneDNN to OCL (first generate
step during warmup), update_weights() was NOT called. This left
_impl_params->weights_layout pointing to the OneDNN internal format
(set during prefill). weights_memory() then returned OneDNN-reordered
bytes to the OCL kernel, which reads raw u4 format → garbage output.

Fix: call update_weights() inside switch_impl_to() after _impl is
set to the new impl. This re-caches the weights with the correct
expected layout for the newly active impl (no-reorder for OCL →
original u4 bytes; reorder for OneDNN → internal format).

Also removes leftover debug std::cout for M=1 retry messages.
…ative GGUF FrontEnd (qwen3)

Add native GGUF support to OpenVINO Core and a new GGUF FrontEnd

OV Core:
  - 23 opaque gguf_* block element types (enum order matches GGML ggml_type),
    with TypeInfo (is_real=false, is_quantized=true, per-type signedness,
    rounded-up bitwidth) and block_byte_size()/block_elem_count()/is_gguf_block()
    accessors + free functions.
  - Block-aware sizing in ov::util::get_memory_size and op::v0::Constant
    (ceil_div(n, block_elem) * block_bytes); element_iterator treats GGUF as
    non-byte; visualize_tree / xml deserialize handle the new types.

Transformation guards — every pass that could rewrite/reshape/
fold/convert an opaque GGUF Constant now early-outs on is_gguf_block():
  Convert validate, ConstantFolding (pins disable_constant_folding),
  nop_elimination, transpose_sinking (ts_base), convert_fc_to_compressed,
  convert_precision, mark_dequantization_subgraph.

GGUF FrontEnd  — src/frontends/gguf/:
  - Registered ov::frontend::FrontEnd "gguf" (Core::read_model("*.gguf")).
  - mmap'd zero-copy weights emitted directly as gguf_* Constants
    (no u4+Convert+Multiply decompose); FullyConnectedCompressed with empty
    weight_scales/weight_zero_points (scale lives inside the block).
  - Full qwen3 builder (GQA attention, SwiGLU MLP, RMSNorm, RoPE, KV-cache);
    throws for non-qwen3 architectures.
  - rt-info schema under the "gguf" top segment (metadata + tokenizer).
…t) + baseline kernels

Add native GPU consumption of GGUF block weights.
A GGUF FullyConnected node (gguf_* weight Constant, scale/zp embedded in the block) runs entirely on
the GPU with the weight decoded in-kernel; no host-side dequant.

ocl::FCGGUFOpt dispatches by M:
  - M <= threshold (decode / short prompt): native OCL GEMV kernel
    (fc_gguf_opt.cl) — decodes GGUF blocks in registers, any M, ≥ baseline BW.
  - M >  threshold (prefill / long prompt): transcode + OneDNN WOQ GEMM —
    fc_gguf_transcode.cl requantises GGUF blocks into an i4/i8 + f16-per-group
    scale scratchpad (never an f16/f32 weight), then a directly-constructed
    dnnl::matmul (LRU-cached by (type,M,K,N)) consumes it via DPAS. Threshold
    via OV_GPU_GGUF_PREFILL_THRESHOLD (default 32).
  Baseline formats Q4_0/Q4_K/Q5_K/Q6_K/Q8_0 (Q4_*->i4, Q5/6/8->i8); decoders
  mirror ggml-quants.c / the FE CPU reference. Scratchpad via
  get_internal_buffer_descs (sized by static K,N only); OneDNN engine/stream and
  get_onednn_memory() bind the cldnn buffers with no copy.

Plumbing & no-corruption guards:
  - block-aware layout::bytes_count / data_type_traits::size_of; jitter maps GGUF
    block types to opaque uchar.
  - ConvertGGUFFullyConnectedCompressed lowers the FE's internal
    FullyConnectedCompressed (GGUF weight) into the GPU op (Placeholder/dummy
    scale).
  - onednn FC, generic + GPU convert_fc_to_compressed, legacy dynamic FC selector
    and FullyConnectedHorizontalFusion all early-out on is_gguf_block weights.
  - ov::supported_gguf_types property = Q4_0/Q4_K/Q5_K/Q6_K/Q8_0.

Validated on qwen3-4b-q4_0: decode (M=1) predicts " Paris"; prefill (M=64)
transcode path agrees with the GEMV path (argmax match, finite logits).
@github-actions github-actions Bot added category: inference OpenVINO Runtime library - Inference category: Core OpenVINO Core (aka ngraph) category: IE Tests OpenVINO Test: plugins and common category: GPU OpenVINO GPU plugin labels Jun 12, 2026
@github-actions github-actions Bot added category: build OpenVINO cmake script / infra category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings no-match-files labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: Core OpenVINO Core (aka ngraph) category: CPP API OpenVINO CPP API bindings category: GPU OpenVINO GPU plugin category: IE Tests OpenVINO Test: plugins and common category: inference OpenVINO Runtime library - Inference category: transformations OpenVINO Runtime library - Transformations do_not_merge do_not_review no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant