Add Vision / VLM models to environments and GRPO trainer #409

UlrickBL · 2025-10-01T21:06:14Z

Description

This Pull Request introduces support for Vision-Language Models (VLMs) into the environments and the GRPO trainer. This functionality is implemented by tracking pixel values and image grids as the base inputs, and then transforming images into Base64 format to comply with VLLM/OpenAI chat formats. The implementation is adapted to work with both standard text tokenizers and multimodal/mixin processors. It also adds Image and Answer login in wanDB table to simplify data analysis during training.

The motivation for adding VLM support is strategic: I believe Vision-Language environments are critical for advancing AGI and Reinforcement Learning (RL) research. This feature was necessary to begin testing several promising, high-value environments.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass
New tests have been added to cover the changes
Tests have been run locally with uv run pytest

It was end to end tested with 3 Prime-RL envs :

OCR VL with Qwen VL 2.5 3B and 7B : https://app.primeintellect.ai/dashboard/environments/ulrick-bl/ocr-vl (single turn image)

Rebus VL Thinking with Qwen VL 2.5 7B : https://app.primeintellect.ai/dashboard/environments/ulrick-bl/rebus-vl-thinking (single turn image)

Semantix with Qwen 2.5 0.5B : https://app.primeintellect.ai/dashboard/environments/ulrick-bl/semantic (multi turn text)

Test Coverage

Current coverage: 33%
Coverage after changes: 29%

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

…ment and grpo trainer, mainly the get lgps function

…rs_vlm into add_vlm_support

…vllm processing

…nto add_vlm_support

…utting TODO where I think I need to pass vision data

… need to check how it goes with text only and need to test it with Qwen 2.5 VL

…puts

…rs_vlm into add_vlm_support

…ogits

…e with pixel values after

…rs_vlm into add_vlm_support

…o improve prompt so some pass + need to fix for text only

…e on the reward calculation of the env for japaneese

UlrickBL · 2025-10-03T18:35:50Z

Sounds good, I'll work on that !

… of outputs and inputs to model utils with lazy import

…nvironment

…uts typing

… converge and issue when more batch size

…rs_vlm into add_vlm_support

…fferent data in same batch with pixel values

…rs_vlm into add_vlm_support

UlrickBL · 2025-10-08T21:10:33Z

Hello @willccbb ,

For point 2, I managed to clean up the problematic dependencies, handle lazy imports, and reorganize the code into utils/image_utils.py and utils/processing_utils.py. Is it ok with the current pattern ?

For point 1, I tried to find a task that was challenging enough to demonstrate the relevance of the training, but not too expensive to run.

I used an OCR environment I set up on Prime Hub (ocr-vl): https://app.primeintellect.ai/dashboard/environments/ulrick-bl/ocr-vl
.

I trained Qwen 2.5 VL 3B on the "hi" (Hindi) scope, since the model doesn’t perform very well on this task. The reward is mainly based on format and CER.

Warning that there are some issues in the dataset I use as the base for the env such as this type of data where the screen fails because of popup and on which my small training setup was very sensitive :

I trained with the following setup :
Qwen 2.5 VL 3B with Lora rank 16 on 2xA100 40GB

args = vf.grpo_defaults(run_name="ocr-vl")
args.per_device_train_batch_size = 8
args.num_generations = 16
args.gradient_accumulation_steps = 2
args.max_steps = 1000
args.eval_strategy = "steps"
args.eval_steps = 2
args.max_tokens = 1024
args.vllm_server_port= 8000
args.fp16 = True
args.temperature = 0.4
args.learning_rate = 1e-5
args.lr_scheduler_type = "cosine"
args.warmup_steps = 10

I would say the training is stable and started off very well. The first slowdown in reward progression was due to a series of poor-quality images in the data, like the ones I showed earlier. Nevertheless, we can observe the model improving and maintaining stable training performance on the task, which highlights the relevance of the implementation.

If needed, I can spend some time cleaning the dataset and retraining it.

Let’s keep in touch if there’s anything else to adjust, test, or adapt.

… into add_vlm_support

anaszil · 2025-10-29T20:08:11Z

+1 @willccbb any idea when this MR might be merged or available

UlrickBL added 30 commits September 17, 2025 21:55

add first version to deal with image on verifiers, dealt with environ…

0195a06

…ment and grpo trainer, mainly the get lgps function

fix some issues with trainer

7ad272a

fix filter by prompt length to deal with images

eb31d6e

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

7d8051f

…rs_vlm into add_vlm_support

WIP : fix the dataset filtering with images, working. Arrived at the …

8307874

…vllm processing

add batch treatment with base64 for openAI format with vllm

1a5a662

Merge branch 'add_vlm_support' of github.com:UlrickBL/verifiers_vlm i…

7ea887f

…nto add_vlm_support

WIP async generation working, issue in processing for now

652a7c3

fix ruff issues

1c4ce78

fix ruff issues

121913f

WIP : deal with after rollout : processing the input and output and p…

7460db2

…utting TODO where I think I need to pass vision data

WIP : pass pixel values and image grid all way long to prepare input,…

5219e08

… need to check how it goes with text only and need to test it with Qwen 2.5 VL

WIP : fix encoding and treatment, issue with pydantic in ProcessedOut…

eac0ccf

…puts

fix until compute loss issue

e000737

fix pixel values

f31ddb1

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

d0471ff

…rs_vlm into add_vlm_support

WIP : try to fix compute loss

5e3ac99

WIP : working until self._get_per_token_logps model(**model_inputs).l…

b44d3f9

…ogits

FIX Batch for text data while keeping string for image, however, issu…

b4ddd58

…e with pixel values after

fix issue with shape

dd4ed6d

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

076395d

…rs_vlm into add_vlm_support

check pixel values issue

58a39ce

WIP : VL training working end to end, but rebus too complex -> need t…

ba0302d

…o improve prompt so some pass + need to fix for text only

update typing

3004416

add image logging and answer logging in wandb to improve data diging

5209a04

make code robust to text only

47c1df0

logging of image and answer working + full training works, maybe issu…

1b515a8

…e on the reward calculation of the env for japaneese

fix ruff

277ed43

fix ruff

c6c0d67

fix ruff

6652dac

UlrickBL added 22 commits October 3, 2025 21:39

remove transformers from base dependencies

ae412a5

in grpo trainer, only convert images from PIL if not base64

d2b6166

Create image utils for base64 and PIL transformation, move processing…

02caa79

… of outputs and inputs to model utils with lazy import

add image utils

030dddc

add processor utils with lazy imports and no transformers to use in e…

50eb136

…nvironment

clean type checking to avoid transformers

653efb6

fix issue with not callable Processor

3512a3b

fix ruff

491a8e4

fix py precommit checks for typing

6a32a10

fix py precommit checks for typing

235a452

fix py precommit checks for typing

555c268

fix py precommit checks for typing

80fb0bb

WIP : fix issue with validators in training by modifying GenerateOutp…

378f885

…uts typing

fix ruff

0fde4d0

WIP : working on Hindi OCR by fixing KL olds pixel values, but cannot…

193346c

… converge and issue when more batch size

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

ce432a6

…rs_vlm into add_vlm_support

correct trainer, almost working, fix pixel values and schuffle for di…

aea2b9d

…fferent data in same batch with pixel values

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

9b6f075

…rs_vlm into add_vlm_support

fix grpo trainer

1ed9ca6

clean debug and fix ruff

016cc35

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifie…

1323ba6

…rs_vlm into add_vlm_support

clean ruff

fff88bf

UlrickBL and others added 5 commits October 9, 2025 22:54

Merge branch 'main' into add_vlm_support

13f939b

Merge branch 'main' into add_vlm_support

7817f25

adapt code to multiple images in a single sequence

d30bfdc

add behavior for multi images in prompt

0ed1ff6

Merge branch 'add_vlm_support' of https://github.com/UlrickBL/verifiers…

e4f558f

… into add_vlm_support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Vision / VLM models to environments and GRPO trainer #409

Add Vision / VLM models to environments and GRPO trainer #409

Uh oh!

UlrickBL commented Oct 1, 2025 •

edited

Loading

Uh oh!

UlrickBL commented Oct 3, 2025

Uh oh!

UlrickBL commented Oct 8, 2025

Uh oh!

anaszil commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Vision / VLM models to environments and GRPO trainer #409

Are you sure you want to change the base?

Add Vision / VLM models to environments and GRPO trainer #409

Uh oh!

Conversation

UlrickBL commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Test Coverage

Checklist

Uh oh!

UlrickBL commented Oct 3, 2025

Uh oh!

UlrickBL commented Oct 8, 2025

Uh oh!

anaszil commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UlrickBL commented Oct 1, 2025 •

edited

Loading