A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
-
Updated
Nov 26, 2024 - Python
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
The official implementation for the ICCV 2023 paper "Grounded Image Text Matching with Mismatched Relation Reasoning".
Cross-aware Early Fusion with Stage-divided Vision and Language Transformer Encoders for Referring Image Segmentation (Published in IEEE TMM 2023)
Code for ECIR 2023 paper "Dialogue-to-Video Retrieval"
Multimodal Agentic GenAI Workflow – Seamlessly blends retrieval and generation for intelligent storytelling
Streamlit App Combining Vision, Language, and Audio AI Models
Explore the rich flavors of Indian desserts with TunedLlavaDelights. Utilizing the in Llava fine-tuning, our project unveils detailed nutritional profiles, taste notes, and optimal consumption times for beloved sweets. Dive into a fusion of AI innovation and culinary tradition
Socratic models for multimodal reasoning & image captioning
Add a description, image, and links to the vision-language-learning topic page so that developers can more easily learn about it.
To associate your repository with the vision-language-learning topic, visit your repo's landing page and select "manage topics."