Authors' code for paper "MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer", ACMMM 2022.
- Python == 3.9
- PyTorch == 1.12.1
- stanfordcorenlp == 3.9.1.1 with stanford-corenlp-4.2.2
- transformers == 4.12.5
- pycocoevalcap (https://github.com/sks3i/pycocoevalcap)
VIST-E: download the SIS-with-labels.tar.gz (https://visionandlanguage.net/VIST/dataset.html), download the image features (https://vist-arel.s3.amazonaws.com/resnet_features.zip) and put them in data/VIST-E.
LSMDC-E: download LSMDC 2021 version (task1_2021.zip, resnet152_200.zip) (https://sites.google.com/site/describingmovies/home) and put them in data/LSMDC-E. NOTE: Due to LSMDC agreement, we cannot share data to any third-party.
We utilize Glove embedding, please download the glove.6b.300d.txt and put it in data/.
VIST-E:
- Unzip SIS-with-labels.tar.gz to
data/VIST-E. - Unzip conv features in resnet_features.zip to a folder
data/VIST-E/image_featureswithout any subfolders. - Run
data/VIST-E/annotations.py. - Run
data/VIST-E/img_feat_path.py. - Run
data/VIST-E/pro_label.py. - Run
data/embed_vocab.pyand make sure parameterdatasetis set to VIST-E.
LSMDC-E:
- Unzip task1_2021.zip to
data/LSMDC-E. - Unzip all resnet features in resnet152_200.zip to a folder
data/LSMDC-E/image_featureswithout any subfolders. - Run
data/LSMDC-E/prepro_vocab.py. - Run
data/embed_vocab.pyand make sure parameterdatasetis set to LSMDC-E.
-
Set parameters in
utils/opts.py. -
Run
train.pyto train a model. -
Run
eval.pyto evaluate a model.Recommended Settings
VIST-E w BERT:
python train.py --dataset VIST-E --use_bert True --num_head 4 --weight_decay 0 --grad_clip_value 0
VIST-E w/o BERT:
python train.py --dataset VIST-E --use_bert False --num_head 4 --weight_decay 1e-5 --grad_clip_value 0
LSMDC-E w BERT:
python train.py --dataset LSMDC-E --use_bert True --num_head 8 --weight_decay 1e-5 --grad_clip_value 0.1
LSMDC-E w/o BERT:
python train.py --dataset LSMDC-E --use_bert False --num_head 8 --weight_decay 1e-5 --grad_clip_value 0.1
If you find our work or the code useful, please consider cite our paper using:
@inproceedings{10.1145/3503161.3548022,
author = {Xue, Dizhan and Qian, Shengsheng and Fang, Quan and Xu, Changsheng},
title = {MMT: Image-Guided Story Ending Generation with Multimodal Memory Transformer},
year = {2022},
doi = {10.1145/3503161.3548022},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
pages = {750–758},
numpages = {9},
}