The official implementation for:
SQUARE: Semantic Query - Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval
Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging.
In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings.
Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.
First, clone the repository to a desired location.
Prerequisites
The following commands will create a local Anaconda environment with the necessary packages installed.
conda create -n square_cir -y python=3.8
conda activate square_cir
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
export PYTHONPATH=$(pwd)
Download Pre-trained Weight
We use these pre-trained models BLIP w/ ViT-B. For BLIP checkpoint download, please refer to the following links:
- BLIP w/ ViT-B (129M)
- BLIP w/ ViT-B fine tuned on Image-Text Retrieval (COCO)
- BLIP w/ ViT-B fine tuned on Image-Text Retrieval (Flickr30k)
- BLIP w/ ViT-L (129M)
- BLIP w/ ViT-L fine tuned on Image-Text Retrieval (COCO)
- BLIP w/ ViT-L fine tuned on Image-Text Retrieval (Flickr30k)
For the CLIP model, you will download the model from the Hugging Face model hub. So you don't need to download the model manually.
Here is the link to each model:
- CLIP-ViT-B-32: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
- CLIP-ViT-L-14: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
- CLIP-ViT-H-14: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
- CLIP-ViT-G-14: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
The download BLIP model should be placed in the models folder.
models/
model_base.pth
model_base_retrieval_coco.pth
model_base_retrieval_flickr.pth
model_large.pth
model_large_retrieval_coco.pth
model_large_retrieval_flickr.pth
FashionIQ Dataset
The FashionIQ dataset can be downloaded from the following link:
The dataset should be placed in the fashionIQ_dataset folder.
fashionIQ_dataset/
labeled_images_cir_cleaned.json
captions/
cap.dress.test.json
cap.dress.train.json
cap.dress.val.json
...
image_splits/
split.dress.test.json
split.dress.train.json
split.dress.val.json
...
images/
245600258X.png
978980539X.png
...
CIRR Dataset
The CIRR dataset can be downloaded from the following link:
The dataset should be placed in the cirr_dataset folder.
cirr_dataset/
train/
0/
train-10108-0-img0.png
train-10108-0-img1.png
train-10108-1-img0.png
...
1/
train-10056-0-img0.png
train-10056-0-img1.png
train-10056-1-img0.png
...
...
dev/
dev-0-0-img0.png
dev-0-0-img1.png
dev-0-1-img0.png
...
test1/
test1-0-0-img0.png
test1-0-0-img1.png
test1-0-1-img0.png
...
cirr/
captions/
cap.rc2.test1.json
cap.rc2.train.json
cap.rc2.val.json
image_splits/
split.rc2.test1.json
split.rc2.train.json
split.rc2.val.json
CIRCO Dataset
The CIRCO dataset can be downloaded from the following link:
The dataset should be placed in the circo_dataset folder.
circo_dataset/
COCO2017_unlabeled/
annotations/
image_info_unlabeled2017.json
unlabeled2017/
000000000008.jpg
000000000013.jpg
000000000022.jpg
...
annotations/
test.json
val.json
GeneCIS Dataset
The GeneCIS dataset can be downloaded from the following link:
The official GeneCIS dataset's Visual Genome dataset download link is not available anymore. Please use the following link to download the Visual Genome dataset:
- Visual Genome dataset 1.2
- The images have two zips:
images.zipandimages2.zip. You need to extract both of them into theVG_100K_allfolder.
The dataset should be placed in the genecis_dataset folder.
genecis_dataset/
genecis/
change_attribute.json
change_object.json
focus_attribute.json
focus_object.json
val2017/
000000000139.jpg
000000000285.jpg
000000000632.jpg
...
VG_100K_all/
1.jpg
2.jpg
3.jpg
...
Note
- Please modify the
requirements.txtfile if you use a different version of torch with different CUDA version. - Make sure change the
PYTHONPATHto the current directory. Or the code will not be able to find the necessary modules.
We need first run the MLLM to generate the image captions for each dataset.
Please read the MLLM Label Guild for how to use the MLLM to generate the image captions.
After running the MLLM to generate the image captions, please read the following documents for each dataset about how to run the SQAF part to get the initial ranked list:
In this stage we take the original ranked list from the SQAF stage as input, and then rerank the top-K candidates with MLLM.
Please read the following documents for each dataset about how to run the EBR part:
We use the same MIT License as the Bi-BlipCIR, CLIP4Cir, BLIP and WeiMoCIR.
Special thanks to the Bi-BlipCIR. We use the code to evaluate the performance of our proposed method. If you find this code useful for your research, please consider citing the original paper:
@misc{wu2025square,
title={SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval},
author={Ren-Di Wu and Yu-Yen Lin and Huei-Fang Yang},
year={2025},
eprint={2509.26330},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.26330},
}