Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformize LlavaNextVideoProcessor kwargs #35613

Merged

Conversation

yonigozlan
Copy link
Member

What does this PR do?

Adds uniformized processors following #31911 for LlavaNextVideoProcessor .

Fixes #35602

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late review, forgot about this PR.

Cool that we're standardizing video LLMs. Overall LGTM, we just need a few tests with video processors to make sure nothing breaks

images: ImageInput = None,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
audio=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny concern about squishing in audio before the videos, but I don't think any user passes videos as positional arg so maybe we are oke

I don't want to add more complexity by trying to validate the order of audio and video now, so let's leave as is and just take this noted

@@ -39,7 +39,7 @@ class LlavaOnevisionProcessorKwargs(ProcessingKwargs, total=False):
"padding": False,
},
"image_kwargs": {},
"video_kwargs": {},
"videos_kwargs": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment on lines +348 to +350
model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **processing_kwargs).to(
dtype=self.torch_dtype
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right that we don't need legacy=False anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up keeping it because I had put v5.0.0 as the deprecation version

def test_processor_to_json_string(self):
processor = self.get_processor()
obj = json.loads(processor.to_json_string())
print(processor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops thanks for catching that :)



@require_vision
class LlavaNextVideoProcessorTest(ProcessorTesterMixin, unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik ProcessorTesterMixin doesn't test videos_kwargs yet. I think we need to add video tests to make sure that llava-next-video processor works as expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I added tests for video_kwargs, very similar to those on images_kwargs :)

@yonigozlan yonigozlan force-pushed the uniformize-llava-next-video-processor branch from a75909c to fd69d5c Compare January 14, 2025 19:54
@yonigozlan
Copy link
Member Author

Thanks for the feedback @zucchini-nlp !
I also just noticed that there is an issue with LlavaOneVision processor test, more specifically test_chat_template_dict seems to fail, with this error:

src/transformers/models/llava_onevision/processing_llava_onevision.py:167: in __call__
    one_video = to_numpy_array(video_inputs.get("pixel_values_videos")[0])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def to_numpy_array(img) -> np.ndarray:
        if not is_valid_image(img):
>           raise ValueError(f"Invalid image type: {type(img)}")
E           ValueError: Invalid image type: <class 'list'>

src/transformers/image_utils.py:231: ValueError

This is the case in main as well so seems unrelated to this PR.

@zucchini-nlp
Copy link
Member

Yep, will be fixed by #35660

@qubvel qubvel removed their request for review January 20, 2025 18:23
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks!

@yonigozlan
Copy link
Member Author

I had to make some modifications to llava_next_video processor tests following this PR #35953 .
@zucchini-nlp Could you confirm these modifications are fine? I had to add some return_tensors=None in the chat_template tests to have consistent output types when comparing.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, LGTM, thanks!

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well - for the processing common tests for videos there's a couple models that were recently/will be soon merged, would be cool to check if they work !

Comment on lines 40 to 41
"image_kwargs": {},
"videos_kwargs": {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you have to specify an empty dictionary here!

@yonigozlan yonigozlan merged commit 9b479a2 into huggingface:main Feb 18, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants