SQUARE

The official implementation for:

SQUARE: Semantic Query - Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

Overview

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging.

In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings.

Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.

Setting up

First, clone the repository to a desired location.

Prerequisites

The following commands will create a local Anaconda environment with the necessary packages installed.

conda create -n square_cir -y python=3.8
conda activate square_cir
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
export PYTHONPATH=$(pwd)

Download Pre-trained Weight

We use these pre-trained models BLIP w/ ViT-B. For BLIP checkpoint download, please refer to the following links:

For the CLIP model, you will download the model from the Hugging Face model hub. So you don't need to download the model manually.

Here is the link to each model:

CLIP-ViT-B-32: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
CLIP-ViT-L-14: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
CLIP-ViT-H-14: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
CLIP-ViT-G-14: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

The download BLIP model should be placed in the models folder.

models/
    model_base.pth
    model_base_retrieval_coco.pth
    model_base_retrieval_flickr.pth
    model_large.pth
    model_large_retrieval_coco.pth
    model_large_retrieval_flickr.pth

FashionIQ Dataset

The FashionIQ dataset can be downloaded from the following link:

Fashion-IQ

The dataset should be placed in the fashionIQ_dataset folder.

fashionIQ_dataset/
    labeled_images_cir_cleaned.json
    captions/
        cap.dress.test.json
        cap.dress.train.json
        cap.dress.val.json
        ...
    image_splits/
        split.dress.test.json
        split.dress.train.json
        split.dress.val.json
        ...
    images/
        245600258X.png
        978980539X.png
        ...

CIRR Dataset

The CIRR dataset can be downloaded from the following link:

CIRR

The dataset should be placed in the cirr_dataset folder.

cirr_dataset/
    train/
        0/
            train-10108-0-img0.png
            train-10108-0-img1.png
            train-10108-1-img0.png
            ...
        1/
            train-10056-0-img0.png
            train-10056-0-img1.png
            train-10056-1-img0.png
            ...
        ...
    dev/
        dev-0-0-img0.png
        dev-0-0-img1.png
        dev-0-1-img0.png
        ...
    test1/
        test1-0-0-img0.png
        test1-0-0-img1.png
        test1-0-1-img0.png
        ...
    cirr/
        captions/
            cap.rc2.test1.json
            cap.rc2.train.json
            cap.rc2.val.json
        image_splits/
            split.rc2.test1.json
            split.rc2.train.json
            split.rc2.val.json

CIRCO Dataset

The CIRCO dataset can be downloaded from the following link:

CIRCO

The dataset should be placed in the circo_dataset folder.

circo_dataset/
    COCO2017_unlabeled/
        annotations/
            image_info_unlabeled2017.json
        unlabeled2017/
            000000000008.jpg
            000000000013.jpg
            000000000022.jpg
            ...
    annotations/
        test.json
        val.json

GeneCIS Dataset

The GeneCIS dataset can be downloaded from the following link:

GeneCIS

The official GeneCIS dataset's Visual Genome dataset download link is not available anymore. Please use the following link to download the Visual Genome dataset:

Visual Genome dataset 1.2
The images have two zips: images.zip and images2.zip. You need to extract both of them into the VG_100K_all folder.

The dataset should be placed in the genecis_dataset folder.

genecis_dataset/
    genecis/
        change_attribute.json
        change_object.json
        focus_attribute.json
        focus_object.json
    val2017/
        000000000139.jpg
        000000000285.jpg
        000000000632.jpg
        ...
    VG_100K_all/
        1.jpg
        2.jpg
        3.jpg
        ...

Note

Please modify the requirements.txt file if you use a different version of torch with different CUDA version.
Make sure change the PYTHONPATH to the current directory. Or the code will not be able to find the necessary modules.

Reproducing results

For the SQAF part

We need first run the MLLM to generate the image captions for each dataset.

Please read the MLLM Label Guild for how to use the MLLM to generate the image captions.

After running the MLLM to generate the image captions, please read the following documents for each dataset about how to run the SQAF part to get the initial ranked list:

For the EBR part

In this stage we take the original ranked list from the SQAF stage as input, and then rerank the top-K candidates with MLLM.

Please read the following documents for each dataset about how to run the EBR part:

License

We use the same MIT License as the Bi-BlipCIR, CLIP4Cir, BLIP and WeiMoCIR.

Citation

Special thanks to the Bi-BlipCIR. We use the code to evaluate the performance of our proposed method. If you find this code useful for your research, please consider citing the original paper:

@misc{wu2025square,
      title={SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval}, 
      author={Ren-Di Wu and Yu-Yen Lin and Huei-Fang Yang},
      year={2025},
      eprint={2509.26330},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.26330}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
demo_images		demo_images
docs		docs
prompts		prompts
results		results
src		src
submission/CIRCO/validation		submission/CIRCO/validation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SQUARE

Overview

Abstract

Setting up

Reproducing results

For the SQAF part

For the EBR part

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

whats2000/SQUARE

Folders and files

Latest commit

History

Repository files navigation

SQUARE

Overview

Abstract

Setting up

Reproducing results

For the SQAF part

For the EBR part

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages