[GPU] native gguf support#36365
Draft
riverlijunjie wants to merge 28 commits into
Draft
Conversation
- Add fc_compressed_generate_opt.cpp/.hpp: new OCL v2 implementation for W4A8 (4-bit weights, 8-bit activations) FC generate-phase inference - Override get_arguments() to use weights_memory() for dep[1] to handle dynamic weight reordering correctly - Fix act_scale_index: remove count()==0 guard that caused false negatives with stale compile-time shapes - Set DEBUG_FCCmpOpt=0 for production; no diagnostic stdout logging
When the multi-impl pool switches from OneDNN to OCL (first generate step during warmup), update_weights() was NOT called. This left _impl_params->weights_layout pointing to the OneDNN internal format (set during prefill). weights_memory() then returned OneDNN-reordered bytes to the OCL kernel, which reads raw u4 format → garbage output. Fix: call update_weights() inside switch_impl_to() after _impl is set to the new impl. This re-caches the weights with the correct expected layout for the newly active impl (no-reorder for OCL → original u4 bytes; reorder for OneDNN → internal format). Also removes leftover debug std::cout for M=1 retry messages.
…ative GGUF FrontEnd (qwen3)
Add native GGUF support to OpenVINO Core and a new GGUF FrontEnd
OV Core:
- 23 opaque gguf_* block element types (enum order matches GGML ggml_type),
with TypeInfo (is_real=false, is_quantized=true, per-type signedness,
rounded-up bitwidth) and block_byte_size()/block_elem_count()/is_gguf_block()
accessors + free functions.
- Block-aware sizing in ov::util::get_memory_size and op::v0::Constant
(ceil_div(n, block_elem) * block_bytes); element_iterator treats GGUF as
non-byte; visualize_tree / xml deserialize handle the new types.
Transformation guards — every pass that could rewrite/reshape/
fold/convert an opaque GGUF Constant now early-outs on is_gguf_block():
Convert validate, ConstantFolding (pins disable_constant_folding),
nop_elimination, transpose_sinking (ts_base), convert_fc_to_compressed,
convert_precision, mark_dequantization_subgraph.
GGUF FrontEnd — src/frontends/gguf/:
- Registered ov::frontend::FrontEnd "gguf" (Core::read_model("*.gguf")).
- mmap'd zero-copy weights emitted directly as gguf_* Constants
(no u4+Convert+Multiply decompose); FullyConnectedCompressed with empty
weight_scales/weight_zero_points (scale lives inside the block).
- Full qwen3 builder (GQA attention, SwiGLU MLP, RMSNorm, RoPE, KV-cache);
throws for non-qwen3 architectures.
- rt-info schema under the "gguf" top segment (metadata + tokenizer).
…t) + baseline kernels
Add native GPU consumption of GGUF block weights.
A GGUF FullyConnected node (gguf_* weight Constant, scale/zp embedded in the block) runs entirely on
the GPU with the weight decoded in-kernel; no host-side dequant.
ocl::FCGGUFOpt dispatches by M:
- M <= threshold (decode / short prompt): native OCL GEMV kernel
(fc_gguf_opt.cl) — decodes GGUF blocks in registers, any M, ≥ baseline BW.
- M > threshold (prefill / long prompt): transcode + OneDNN WOQ GEMM —
fc_gguf_transcode.cl requantises GGUF blocks into an i4/i8 + f16-per-group
scale scratchpad (never an f16/f32 weight), then a directly-constructed
dnnl::matmul (LRU-cached by (type,M,K,N)) consumes it via DPAS. Threshold
via OV_GPU_GGUF_PREFILL_THRESHOLD (default 32).
Baseline formats Q4_0/Q4_K/Q5_K/Q6_K/Q8_0 (Q4_*->i4, Q5/6/8->i8); decoders
mirror ggml-quants.c / the FE CPU reference. Scratchpad via
get_internal_buffer_descs (sized by static K,N only); OneDNN engine/stream and
get_onednn_memory() bind the cldnn buffers with no copy.
Plumbing & no-corruption guards:
- block-aware layout::bytes_count / data_type_traits::size_of; jitter maps GGUF
block types to opaque uchar.
- ConvertGGUFFullyConnectedCompressed lowers the FE's internal
FullyConnectedCompressed (GGUF weight) into the GPU op (Placeholder/dummy
scale).
- onednn FC, generic + GPU convert_fc_to_compressed, legacy dynamic FC selector
and FullyConnectedHorizontalFusion all early-out on is_gguf_block weights.
- ov::supported_gguf_types property = Q4_0/Q4_K/Q5_K/Q6_K/Q8_0.
Validated on qwen3-4b-q4_0: decode (M=1) predicts " Paris"; prefill (M=64)
transcode path agrees with the GEMV path (argmax match, finite logits).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details:
Add GGUF FE to read native gguf model file
Pass original gguf weights to gpu plugin primitive
Create primitive muti-impls framework to support gguf fc
Use new OCL kernel to support native gguf's fc computation:
Support gguf data type
Test result
model: https://huggingface.co/Qwen/Qwen3-8B-GGUF/blob/main/Qwen3-8B-Q5_K_M.gguf
platform: PTL 12Xe
Tickets:
AI Assistance: