Skip to content

Add Granite 4.1 Vision (granite4_vision)#45597

Open
artem-spector wants to merge 58 commits intohuggingface:mainfrom
artem-spector:add-gv41
Open

Add Granite 4.1 Vision (granite4_vision)#45597
artem-spector wants to merge 58 commits intohuggingface:mainfrom
artem-spector:add-gv41

Conversation

@artem-spector
Copy link
Copy Markdown

What does this PR do?

Adds built-in support for Granite 4.1 Vision (granite4_vision), IBM's multimodal vision-language model for enterprise document understanding.

Architecture highlights

  • Vision encoder: SigLIP2 (google/siglip2-so400m-patch16-384), tiled 384×384 patches
  • Window Q-Former projector: 4×4 patch windows compressed to 2×2 query tokens via cross-attention (downsample_rate="4/8")
  • DeepStack feature injection: 8 vision-to-LLM injection points across two mechanisms:
    • LayerDeepstack: features from 4 vision encoder depths injected at 4 LLM layers (reversed order — deepest vision → earliest LLM)
    • SpatialDeepstack: deepest features split into 4 spatial offset groups (TL/TR/BL/BR), injected at 4 later LLM layers
  • Language model: GraniteForCausalLM (3.5B) with a rank-256 LoRA adapter (same-repo, LM-only)

Files added

File Purpose
modular_granite4_vision.py Source of truth — inherits from LLaVA-Next, overrides novel components
configuration_granite4_vision.py Config (generated)
modeling_granite4_vision.py Model (generated)
processing_granite4_vision.py Unified processor (generated)
image_processing_granite4_vision.py Torchvision-based image processor
image_processing_pil_granite4_vision.py PIL/NumPy image processor
tests/models/granite4_vision/ Modeling, image processing, and processor tests
docs/source/en/model_doc/granite4_vision.md Model documentation

Auto-registration

  • Config: auto-generated via configuration_granite4_vision.py model_type
  • Modeling: MODEL_MAPPING_NAMES + MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
  • Processing + image processing: registered in respective auto files

Tests

  • Unit tests pass locally (pytest tests/models/granite4_vision/ -x -q)
  • @slow integration tests load real checkpoint and assert outputs within tolerance
  • make style and make check-repo pass (3 remaining failures are pre-existing upstream issues: mlinter version mismatch and Sam3Lite incomplete model)

Before submitting

  • This PR is not a duplicate
  • I have read the contributor guidelines
  • The documentation reflects the changes
  • The tests pass

Related

@artem-spector
Copy link
Copy Markdown
Author

artem-spector commented Apr 23, 2026

I've traced the root cause of the check_repository_consistency and tests_torch failures to a specific upstream commit:

[Sam3LiteText] Remove unnecessary modules/configs (#45535) (7439ac0)

This commit removed Sam3LiteTextViTConfig and Sam3LiteTextVisionConfig from the modeling file but left them referenced in auto_mappings.py, causing:

  • AttributeError: module transformers has no attribute Sam3LiteTextViTConfig (357 test failures)
  • check_repo failure: Sam3LiteTextVisionConfig appears in CONFIG_MAPPING_NAMES but is not defined

This is reproducible on main independently of our PR.

Question for reviewers: Should we include a fix for this in our PR (removing the stale entries from auto_mappings.py), or would you prefer to handle it in a separate hotfix? Happy to do either.

@artem-spector
Copy link
Copy Markdown
Author

Opened a dedicated issue for the upstream regression: #45600

@zucchini-nlp
Copy link
Copy Markdown
Member

LMK when ready for review, and ig this PR supersedes #45350?

@artem-spector
Copy link
Copy Markdown
Author

artem-spector commented Apr 25, 2026

@zucchini-nlp - yes, this PR supersedes #45350. Its our team that is responsible for producing/release IBM vision models.
This PR is ready for review from my side.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artem-spector great usage of modular!

Seems like the model uses granite llm as backbone with deepstack features. We will need to add an llm class in that case, since calling each backbone layer manually doesn't align well with our API. We can use modular to copy everything except for a single forward

As per adapters, can you explain how the weights are released? I am not really sure we have to manually add merge_adapters, prob I can suggest a cleaner way

Comment on lines +41 to +48
```bibtex
@misc{granite-vision-4.1-4b,
title={Granite Vision 4.1},
author={IBM Granite Vision Team},
year={2026},
url={https://huggingface.co/ibm-granite/granite-vision-4.1-4b}
}
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we dont need a bibtext entry and as long as there is a link to HF papers/arxiv, that is enought

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed.

Comment on lines +66 to +68
device=0,
torch_dtype=torch.bfloat16,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these two are by default "auto" so we dont need to manually set

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed.


processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, torch_dtype is "auto" by default and can be deleted

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed.

Comment on lines +165 to +177
## Notes

- The model includes LoRA adapters. Call `model.merge_lora_adapters()` after loading to merge them into base weights for faster inference.

- Set `padding_side="left"` during batched generation for more accurate results.

```py
processor.tokenizer.padding_side = "left"
```

- The model supports specialized task tags for document extraction: `<chart2csv>`, `<chart2summary>`, `<chart2code>`, `<tables_html>`, `<tables_otsl>`, `<tables_json>`. Pass these as the text prompt along with a document image.

- For key-value pair extraction, provide a JSON schema describing the fields to extract. The model returns structured JSON matching the schema.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this block as Usage Tips section, before the usage example code snippets

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved to a "Usage Tips" section before the code examples.

@@ -0,0 +1,155 @@
import math
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following "one-model - one-file" philosophy, it is better put inside modular/modeling files

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — downsampling_granite4_vision.py deleted, all contents inlined into the modular.

Comment on lines 81 to 87
"openai-privacy-filter": "OpenAIPrivacyFilterConfig",
"lasr": "LasrCTCConfig",
"wav2vec2-with-lm": "Wav2Vec2Config",
"granite4-vision": "Granite4VisionConfig",
"hy-v3": "HYV3Config",
"slanet": "SLANetConfig",
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few bad rebases :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the stale entries introduced by bad rebases.

Comment thread src/transformers/conversion_mapping.py Outdated
Comment on lines +463 to +466
WeightRenaming(
source_patterns=r"(vision_tower\.)vision_model\.",
target_patterns=r"\1",
),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not needed anymore, we added PrefixWeights recently and fixed all llava models

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the granite4_vision entry from conversion_mapping.py.

@@ -0,0 +1,253 @@
# Copyright 2025 IBM. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026 :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — Copyright 2026 IBM and The HuggingFace Team.

Comment on lines +50 to +57
class Granite4VisionModelTester(VLMModelTester):
base_model_class = Granite4VisionModel
config_class = Granite4VisionConfig
conditional_generation_class = Granite4VisionForConditionalGeneration
text_config_class = GraniteConfig
vision_config_class = CLIPVisionConfig

def __init__(self, parent, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need only this tester, since processing is identical to llava-next. Thanks for using VLMTester 🤩

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed test_image_processing_granite4_vision.py entirely (processing is identical to LlavaNext, no re-definition needed).

Comment thread utils/check_repo.py Outdated
Comment on lines +551 to +557
"granite4_vision",
"falcon3",
"megatron_gpt2",
"code_llama",
"hy_v3",
"openai_privacy_filter",
"slanet",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also bad rebase

@artem-spector
Copy link
Copy Markdown
Author

@zucchini-nlp I'm ready for the second round :)

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @artem-spector ! Great work, glad to see a singlr modular file.

Left some comments on further cleaning-up, and a few questions. I just noticed that all image placeholders are filled with zeros, and not sure if that is intended. If we actually have no image features scattered, can we stop adding that many placeholders without quality degradation? Dummy and unnecessary token ids increase total length and looks like a waste of resources

model_id = "ibm-granite/granite-vision-4.1-4b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id).eval()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: [...], device_map="auto") without eval, shoudl be already in eval mode when loaded

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed .eval().

("glm_image", {"pil": "GlmImageImageProcessorPil", "torchvision": "GlmImageImageProcessor"}),
("glpn", {"pil": "GLPNImageProcessorPil", "torchvision": "GLPNImageProcessor"}),
("got_ocr2", {"pil": "GotOcr2ImageProcessorPil", "torchvision": "GotOcr2ImageProcessor"}),
("granite4_vision", {"pil": "Granite4VisionImageProcessorPil", "torchvision": "Granite4VisionImageProcessor"}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can delete this, already mapped to 'LlavaNext'

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed.

@@ -0,0 +1,734 @@
# Copyright 2025 IBM. All rights reserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ig you forgot to commit and push 😄



@dataclass
class Granite4VisionImageFeaturesOutput(ModelOutput):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think same as qwen3_vl.BaseModelOutputWithDeepstackFeatures

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — Granite4VisionImageFeaturesOutput now inherits BaseModelOutputWithPooling (same approach as qwen3_vl.BaseModelOutputWithDeepstackFeatures), with a deepstack_features field added.

Comment on lines +95 to +97
class Granite4VisionTextConfig(PreTrainedConfig):
model_type = "granite4_vision_text"
base_config_key = "text_config"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I am following, are we not supposed to inherit from GraniteConfig? Current class has no attributes

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — Granite4VisionTextConfig inherits GraniteConfig directly. It has no additional attributes because the text config is fully specified by GraniteConfig; the subclass exists only to set model_type = "granite4_vision_text" and base_config_key = "text_config".

past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
deepstack_features=outputs.deepstack_features,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only diff from llava-next is returning deepstack_features?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — the main differences from LlavaNextForConditionalGeneration are: (1) the model uses Granite4VisionModel which has deepstack injection, (2) logits are scaled by text_config.logits_scaling, and (3) deepstack_features is threaded through the output. Everything else is inherited.

logits_to_keep=logits_to_keep,
**kwargs,
)
model_inputs = self._init_hybrid_cache(**model_inputs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to delete this line to inti cache. Correct cache should be init by generationMixin._prepare_cache_for_generation automatically

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed; GenerationMixin._prepare_cache_for_generation handles cache initialization.

Comment on lines +140 to +146
@unittest.skip("Granite4VisionImageFeaturesOutput has no hidden_states field")
def test_get_image_features_hidden_states(self):
pass

@unittest.skip("Granite4VisionImageFeaturesOutput has no attentions field")
def test_get_image_features_attentions(self):
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two should be fixed when we add **kwargs and return a BaseModelOutputWithDeepstackFeatures

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — Granite4VisionImageFeaturesOutput now inherits BaseModelOutputWithPooling so all five tests pass. skip_test_image_features_output_shape = True remains because last_hidden_state isn't meaningful for this output type, but hidden_states, pooler_output, and field presence checks all pass.

Comment on lines +148 to +154
@unittest.skip("Base model forward returns ModelOutputWithPast, not CausalLMOutput with loss")
def test_training(self):
pass

@unittest.skip("QFormer submodules not initialized by init_weights from meta device")
def test_can_init_all_missing_weights(self):
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these also should be fixable, I don't think this is a valid reason to skip

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for test_training — the skip was stale; Granite4VisionForConditionalGeneration computes a loss, and the framework skips Granite4VisionModel automatically via MODEL_MAPPING_NAMES.\n\ntest_can_init_all_missing_weights remains skipped: Blip2QFormerModel submodules aren't initialized from meta device by our _init_weightsBlip2QFormerPreTrainedModel._init_weights only handles Blip2ForConditionalGeneration instances, not the standalone Blip2QFormerModel. Happy to investigate further if you'd like.

Comment thread utils/check_config_attributes.py Outdated
Comment on lines +102 to +105
"Granite4VisionConfig": [
"multimodal_projector_bias",
"projector_hidden_act",
],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if actually not used, lets just delete them from config class. When inheriting config, you can define

multimodal_projector_bias = AtributeErro()

and it will not copy this field

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — multimodal_projector_bias and projector_hidden_act are shadowed with AttributeError() at class level on Granite4VisionConfig, so the framework sees them as intentionally inaccessible. The SPECIAL_CASES_TO_ALLOW entry was removed.

@artem-spector
Copy link
Copy Markdown
Author

Hi @zucchini-nlp, sorry for delay - I was chasing a performance problem, which turn out to be not real.

Summary of changes:

  • Granite4VisionImageFeaturesOutput now inherits BaseModelOutputWithPooling instead of bare ModelOutput, which unblocks the common image-features tests
  • Added @capture_outputs to Granite4VisionTextModel.forward so output_hidden_states propagates correctly through generation
  • Removed the manual DynamicCache init from forward — GenerationMixin._prepare_cache_for_generation handles this
  • get_image_features now properly passes output_hidden_states through (via kwarg or config fallback)
  • qformer_config is now fully specified at init time rather than patched post-super().init()
  • _can_record_outputs and _deepstack_inject moved up to Granite4VisionPreTrainedModel
  • One-letter variable names renamed for clarity; unused inherited config attrs now raise AttributeError with a clear message

Open question — test_can_init_all_missing_weights:

This test verifies that every weight can be re-initialized from meta device via _init_weights. It fails because WindowQFormerDownsampler embeds a Blip2QFormerModel (a full PreTrainedModel from a different family). When from_pretrained(None, state_dict={}) walks submodules, our _init_weights has no isinstance branches for QFormer's internal layers (attention, embeddings, etc.), so they stay uninitialized.
Is there a recommended approach here, or is skipping acceptable for embedded third-party PreTrainedModel submodules?

@zucchini-nlp
Copy link
Copy Markdown
Member

When from_pretrained(None, state_dict={}) walks submodules, our _init_weights has no isinstance branches for QFormer's internal layers (attention, embeddings, etc.), so they stay uninitialized.

That is weird, usually our initializer walks down each sub-module and initializes its weights. Unless Qformer has special buffers that need to be init, it should be fine. I can check it on Monday and do another review pass

@artem-spector
Copy link
Copy Markdown
Author

When from_pretrained(None, state_dict={}) walks submodules, our _init_weights has no isinstance branches for QFormer's internal layers (attention, embeddings, etc.), so they stay uninitialized.

That is weird, usually our initializer walks down each sub-module and initializes its weights. Unless Qformer has special buffers that need to be init, it should be fine. I can check it on Monday and do another review pass

will re-check too

@artem-spector
Copy link
Copy Markdown
Author

When from_pretrained(None, state_dict={}) walks submodules, our _init_weights has no isinstance branches for QFormer's internal layers (attention, embeddings, etc.), so they stay uninitialized.

That is weird, usually our initializer walks down each sub-module and initializes its weights. Unless Qformer has special buffers that need to be init, it should be fine. I can check it on Monday and do another review pass

will re-check too

you right, test_can_init_all_missing_weights works

Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, last comments on style, flagged a few that weren't addressed/pushed from prev review

Approving so we will request core maintainer review to merge after fixing left-over comments

("glm_image", {"pil": "GlmImageImageProcessorPil", "torchvision": "GlmImageImageProcessor"}),
("glpn", {"pil": "GLPNImageProcessorPil", "torchvision": "GLPNImageProcessor"}),
("got_ocr2", {"pil": "GotOcr2ImageProcessorPil", "torchvision": "GotOcr2ImageProcessor"}),
("granite4_vision", {"pil": "LlavaNextImageProcessorPil", "torchvision": "LlavaNextImageProcessor"}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this belong in image_processing_auto.py. Was it auto-generated or added manually?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed. Not sure why it was added, but image_processing_auto.py already has it. Duplicate deleted.

Comment on lines +146 to +154
# Peek at vision hidden_size before super() to build a fully-specified qformer_config,
# avoiding any runtime field patching after super().
if isinstance(self.vision_config, dict):
vision_hidden_size = self.vision_config.get("hidden_size", 1152)
elif self.vision_config is not None:
vision_hidden_size = self.vision_config.hidden_size
else:
vision_hidden_size = 1152

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vision/text configs cannot be None, no? In other words, we create a default config if not passes, just like with qformer below. Then we won't need a magical 1152 hardcoded here

If that is about the order, you can call super() at the beginning and it will unwrap first. Also we can completely override __post_init__ by calling PreTrainedConfig.__post_init__(**kwargs) instead of super

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — call super().__post_init__(**kwargs) first so vision_config is already a typed object when we build qformer_config.

Comment on lines +287 to +293
return (
features.view(batch, side, side, channels)
.view(batch, num_win, window_size, num_win, window_size, channels)
.transpose(2, 3)
.flatten(0, 2)
.flatten(1, 2)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra nit: for sake of readability I prefer to split into several lines. Same comment for unwindowed_raster

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split into separate assignment lines in both methods.

Comment on lines +344 to +355
if isinstance(module, Granite4VisionTextRotaryEmbedding):
# Non-persistent buffers (inv_freq, original_inv_freq) are replaced with
# torch.empty_like() garbage by _move_missing_keys_from_meta_to_device.
# Recompute them here so _initialize_missing_keys restores correct values.
rope_type = module.config.rope_parameters.get("rope_type", "default")
if rope_type != "default":
rope_init_fn = ROPE_INIT_FUNCTIONS[rope_type]
else:
rope_init_fn = module.compute_default_rope_parameters
inv_freq, attention_scaling = rope_init_fn(module.config, module.inv_freq.device)
init.copy_(module.inv_freq, inv_freq)
init.copy_(module.original_inv_freq, inv_freq)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't push commit? 😄 Same for LN and embed modules, calling super is usually enough unless model requires special weight init

Comment on lines +368 to +369
base_model_prefix = "model"
_no_split_modules = ["Granite4VisionTextDecoderLayer"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment, I think you have one local commit unpushed

Comment thread src/transformers/models/granite4_vision/modular_granite4_vision.py Outdated
Comment thread src/transformers/models/granite4_vision/modular_granite4_vision.py Outdated
Comment thread src/transformers/models/granite4_vision/modular_granite4_vision.py
Comment thread src/transformers/models/granite4_vision/modular_granite4_vision.py Outdated
Comment thread src/transformers/models/granite4_vision/modular_granite4_vision.py Outdated
@zucchini-nlp
Copy link
Copy Markdown
Member

run-slow: granite4_vision

@artem-spector
Copy link
Copy Markdown
Author

@zucchini-nlp , thanks for the approval!
All suggestions have been implemented, except one for _init_weights: the Embedding/LayerNorm/RMSNorm branches are necessary, since the parent LlavaNextPreTrainedModel._init_weights doesn't handle these module types.

@zucchini-nlp
Copy link
Copy Markdown
Member

@artem-spector not the parent llava-next, it is handled in base PreTrainedModel class :)

Comment on lines +151 to +156
if isinstance(self.vision_config, dict):
vision_hidden_size = self.vision_config.get("hidden_size", 1152)
elif self.vision_config is not None:
vision_hidden_size = self.vision_config.hidden_size
else:
vision_hidden_size = 1152
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really what I meant. Vision config has to be a config obj by this moment, so you only call vision_hidden_size = self.vision_config.hidden_size

Ordering correctly should do, move viision config code block before and that is all

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — three clean blocks: dict→typed conversion before super(), super().__post_init__ deserializes vision_config, then default qformer_config built from self.vision_config.hidden_size. No fallbacks or hardcoded values.

Comment on lines +658 to +659
@merge_with_config_defaults
@can_return_tuple
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets delete all, so it is copied completely. Current state is missing auto docstring

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_image_features body is entirely custom (deepstack multi-layer extraction + spatial projectors, different return type) so the method needs to stay in modular — deleting it would copy the LlavaNext version and break the model. Added @auto_docstring and @can_return_tuple.

@zucchini-nlp
Copy link
Copy Markdown
Member

A couple comments only, I think we had a misunderstanding with prev round. And requested review from a core maintainer (@vasqu) , then we can merge

@zucchini-nlp zucchini-nlp requested a review from vasqu May 5, 2026 07:41
artemspector and others added 24 commits May 5, 2026 13:46
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctor naming

- Granite4VisionTextConfig restored as proper GraniteConfig subclass
- Granite4VisionConfig.__post_init__: convert dict->config before super() so
  _attn_implementation.setter sees config objects; patch vision-size fields after super()
- Use CONFIG_MAPPING/AutoModel at module top-level (no lazy imports)
- Add _can_record_outputs to Granite4VisionTextModel for hidden_states/attentions
  capture via @capture_outputs decorator
- Add Granite4VisionTextAttention/TextDecoderLayer stubs in modular so converter
  generates the registry entries pointing to the correct layer classes
- WindowQFormerDownsampler renamed to Granite4VisionWindowQFormerDownsampler
- interpolate_downsample/spatial_offset_downsample take explicit size args (not config)
- Remove output_attentions/output_hidden_states from forward signatures (handled by
  @capture_outputs and **kwargs); use BaseModelOutputWithPast return type
- Remove prepare_inputs_for_generation (handled by parent)
- Remove _init_hybrid_cache (GraniteMoeHybrid leftover from 4.0)
- auto_mappings.py: use LlavaNextImageProcessor(Pil) instead of model-specific copies
- docs: remove .eval() from example

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…onfig attrs

- _windowed_raster/unwindowed_raster: x -> features, x_win -> windowed_features
- __init__: q, w -> query_side_str, window_side_str
- Granite4VisionConfig: shadow LlavaNextConfig's multimodal_projector_bias and
  projector_hidden_act with AttributeError() so check_config_attributes passes
  without SPECIAL_CASES_TO_ALLOW entry
- Remove Granite4VisionConfig from check_config_attributes.py SPECIAL_CASES_TO_ALLOW

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…d patching

Peek at vision_config.hidden_size (or its dict equivalent) before super() and
include hidden_size, num_attention_heads, encoder_hidden_size directly in the
CONFIG_MAPPING["blip_2_qformer"]() constructor call. This avoids mutating the
config object after super().__post_init__() runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…ainedModel

Follows qwen3_vl pattern: _can_record_outputs and _deepstack_inject belong on
the shared PreTrainedModel base class, not on TextModel. TextAttention/
TextDecoderLayer stubs are defined before PreTrainedModel so the converter
generates locally-scoped classes for the _can_record_outputs registry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…cCache

- Granite4VisionImageFeaturesOutput now inherits BaseModelOutputWithPooling so
  the common test framework can introspect last_hidden_state/pooler_output/
  hidden_states/attentions fields (removes 5 test skips)
- Add @capture_outputs to Granite4VisionTextModel.forward so output_hidden_states
  is collected via hooks and propagated through to the causal LM output (fixes
  test_assisted_decoding_matches_greedy_search)
- get_image_features: populate hidden_states from vision tower when
  output_hidden_states=True (via kwarg or config); removes test skip
- Remove stale test_training skip (ForConditionalGeneration computes loss;
  base model is skipped automatically by MODEL_MAPPING_NAMES check)
- Delete DynamicCache init block from Granite4VisionTextModel.forward;
  GenerationMixin._prepare_cache_for_generation handles this
- Import BaseModelOutputWithPooling, capture_outputs; drop DynamicCache import

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…ader

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test actually passes — framework's recursive apply() handles QFormer
submodule weights correctly via the nn.Linear branch of _init_weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RMSNorm branches

These were missing, causing test_can_init_all_missing_weights to fail in CI
when weights initialized from meta device didn't match __init__ values.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Granite4VisionTextRMSNorm is defined in the generated file, not modular,
so referencing it by name in modular_granite4_vision.py is an undefined name.
Use attribute-based detection (has weight + variance_epsilon, not Linear) instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove duplicate image processor entry from auto_mappings.py (already in image_processing_auto.py)
- Split _windowed_raster and _unwindowed_raster chains into multiple lines
- Move _deepstack_inject from Granite4VisionPreTrainedModel into Granite4VisionTextModel (only used there)
- Remove nn.Embedding/nn.LayerNorm branches from _init_weights (super handles them)
- Remove use_spatial_sampling conditional guard — spatial projectors always active in released weights
- Remove pixel_values.size(0) > 0 check (bad copy from llava-next)
- Add inline comment explaining the intentional masked_fill(..., 0.0) for deepstack
- Replace manual vision_feature_layer/strategy fallbacks with @merge_with_config_defaults

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…p hardcoded 1152

super().__post_init__() deserializes vision_config from dict to a typed object,
so qformer_config can read vision_config.hidden_size directly without a fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
Changes:
- __post_init__: call super() first so vision_config is deserialized before
  building qformer_config default; qformer_config dict converted before super()
  to avoid _attn_implementation setter hitting a raw dict
- import merge_with_config_defaults from ...utils.generic (not ...utils)
- restore _init_weights Embedding/LayerNorm branches — needed for
  test_can_init_all_missing_weights (LlavaNextPreTrainedModel doesn't handle them)

Regenerate modeling and configuration files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
The Granite4VisionTextModel.forward(self.language_model, ...) pattern was
added to bypass nn.Module.__call__ overhead, but compare_implementations
shows no measurable difference (1.05x total vs card, within noise).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
check_auto.py requires granite4_vision_text to appear in CONFIG_MAPPING_NAMES
and MODEL_TYPE_TO_MODULE_NAME.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
Spatial sampling is always enabled in all released weights
(use_spatial_sampling: true in config.json), so the conditional guard
was already removed from modeling code. Remove the now-unused attribute
from the config to fix check_config_attributes.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_size in tester

QFormer defaulted to intermediate_size=3072 even in tests, pushing the
tiny test model to 2.29M params (limit is 1M). Set intermediate_size=64
in get_config(). Also remove stale use_spatial_sampling from tester.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- C14: reorder __post_init__ — dict→typed qformer_config conversion stays before
  super() (needed for _attn_implementation propagation), then super().__post_init__
  deserializes vision_config, then default qformer_config built from
  self.vision_config.hidden_size with no dict fallback or hardcoded 1152
- C15: move base_model_prefix/_no_split_modules to Granite4VisionPreTrainedModel;
  remove from Granite4VisionTextModel
- C16: add @can_return_tuple + @auto_docstring to get_image_features override

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
…ructure

Three clean blocks: guard fields, dict→typed conversion before super(),
default build from self.vision_config.hidden_size after super(). No fallbacks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: artemspector <artems@il.ibm.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, granite4_vision

@artem-spector
Copy link
Copy Markdown
Author

@zucchini-nlp thanks for clarifying! All three addressed and pushed.

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, looks already pretty good 🤗 it's mostly nits and smaller things


<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">

not on you but since we started to only support torch either way, that badge doesn't make much sense anymore


## Usage Tips

- Set `padding_side="left"` during batched generation for more accurate results.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the default in the configs itself? I think you can just add "padding_side": "left" to the tokenizer json and it should always adopt left padding then

model_id = "ibm-granite/granite-vision-4.1-4b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model = AutoModelForImageTextToText.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")

otherwise we will (always) run on cpu here

@@ -0,0 +1,28 @@
# Copyright 2025 IBM. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 IBM. All rights reserved.
# Copyright 2026 IBM. All rights reserved.

# ── Output classes ──────────────────────────────────────────────────────────


@dataclass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@dataclass

I think we need to remove the decorator (so modular inherits everything) or add auto docstring itself so

@auto_docstring
@dataclass

(the order matters)

Same for the output classes below

Comment on lines +722 to +723
@merge_with_config_defaults
@can_return_tuple
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@merge_with_config_defaults
@can_return_tuple

Comment on lines +757 to +769
loss = None
logits = self.lm_head(hidden_states)
logits = logits / self.config.text_config.logits_scaling
if labels is not None:
loss = self.loss_function(
logits,
labels,
vocab_size=self.config.text_config.vocab_size,
**kwargs,
)

if isinstance(logits_to_keep, int) and logits_to_keep > 0:
logits = logits[:, -logits_to_keep:, :]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loss = None
logits = self.lm_head(hidden_states)
logits = logits / self.config.text_config.logits_scaling
if labels is not None:
loss = self.loss_function(
logits,
labels,
vocab_size=self.config.text_config.vocab_size,
**kwargs,
)
if isinstance(logits_to_keep, int) and logits_to_keep > 0:
logits = logits[:, -logits_to_keep:, :]
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])
logits = logits / self.config.text_config.logits_scaling
loss = None
if labels is not None:
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)

let's follow the original more closely, the current way doesnt really make use of logits to keep because we calculate everything either way

Comment on lines +125 to +135
@pytest.mark.xfail(reason="This architecture seems to not compute gradients for some layer.")
def test_training_gradient_checkpointing(self):
super().test_training_gradient_checkpointing()

@pytest.mark.xfail(reason="This architecture seems to not compute gradients for some layer.")
def test_training_gradient_checkpointing_use_reentrant_false(self):
super().test_training_gradient_checkpointing_use_reentrant_false()

@pytest.mark.xfail(reason="This architecture seems to not compute gradients for some layer.")
def test_training_gradient_checkpointing_use_reentrant_true(self):
super().test_training_gradient_checkpointing_use_reentrant_true()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_all_params_have_gradient=False instead - I assume the text backbone is a MoE or similar? That can easily cause this

def setUp(self):
self.processor = AutoProcessor.from_pretrained(self.model_id)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
self.image = Image.open(requests.get(url, stream=True).raw)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use url_to_local

  1. URLS_FOR_TESTING_DATA = [
    (check if it is already there re url)
  2. Use from ...test_processing_common import url_to_local_path on the url which resolves it to a cache path (if available else normal url)

@@ -0,0 +1,122 @@
# Copyright 2025 IBM. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 IBM. All rights reserved.
# Copyright 2026 IBM. All rights reserved.

@vasqu vasqu added the New model label May 5, 2026
@zucchini-nlp
Copy link
Copy Markdown
Member

zucchini-nlp commented May 5, 2026

@artem-spector we are releasing 5.8.0 today so if you could address all comments until the evening, we are very glad to add granite-vision in the release

If you need help, or you are afk, just ping me and I can fix the rest. It is all just nits about styling and better modular usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants