This repository contains code for our paper:
Pixels or Positions? Benchmarking Modalities in Group Activity Recognition.
Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem
This repository provides all the codes necessary to reproduce the results on our paper. SoccerNet-GAR dataset used for this paper will be provided through OpenSportsLab HuggingFace repository. We will also provide pretrained models.
You can follow either of these options to setup the environment.
chmod +x setup.sh
bash setup.shconda create -y -n pvp python=3.10
python -m pip install torch==2.5.1+cu118 torchvision==0.20.1+cu118 torchaudio==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118
python -m pip install torch-cluster -f https://data.pyg.org/whl/torch-2.5.1+cu118.html
python -m pip install torch-geometric seaborn numpy tqdm transformers opencv-python matplotlib datetime scikit-learn pyarrow==17.0 fastparquet easydict timmThe SoccerNet-GAR dataset is available on HuggingFace.
Alternatively, you can generate the dataset from PFF FC source data.
SoccerNet-GAR is derived from PFF FC World Cup 2022 dataset. Sign up to their newsletter to gain access to their Google Drive link. Download the folders inside Event Data/March 14, 2025 and Tracking Data, then place them in data/events and data/tracking respectively.
python utils/pff_to_soccernet_gar.py --modality tracking
python utils/pff_to_soccernet_gar.py --modality videoThe dataset is organized in this format
data/
├── video_dataset/
│ ├── train/
│ │ ├── train.json
│ │ └── videos/
│ ├── valid/
│ │ ├── valid.json
│ │ └── videos/
│ ├── test/
│ │ ├── test.json
│ │ └── videos/
│ └── preprocessed/
│ ├── train/
│ │ ├── clips.json
│ │ └── *.npy
│ ├── valid/
│ │ ├── clips.json
│ │ └── *.npy
│ └── test/
│ ├── clips.json
│ └── *.npy
└── tracking_dataset/
├── train/
│ ├── train.json
│ └── videos/
├── valid/
│ ├── valid.json
│ └── videos/
└── test/
├── test.json
└── videos/
python scripts/train_tracking.py \
--data-dir data/tracking_dataset \
--output-dir outputs/tracking \
--conv-type gin \
--edge positional \
--temporal-model attention | Operator | Argument |
|---|---|
| Graph Convolutional Network | graphconv |
| Graph Attention Network (GATv2) | gat |
| GraphSAGE | sage |
| Graph Isomorphism Network | gin |
| Edge Convolution | edgeconv |
| Generalized Aggregation | gen |
| Edge Type | Argument |
|---|---|
| Positional (role-based) | positional |
| K-Nearest Neighbors | knn |
| Ball K-Nearest Neighbors | ball_knn |
| Distance Threshold | distance |
| Ball Distance Threshold | ball_distance |
| Fully Connected | full |
| No Edges | none |
| Method | Argument |
|---|---|
| Mean Pooling | pool |
| Max Pooling | maxpool |
| Bidirectional LSTM | bilstm |
| Temporal Convolutional Network | tcn |
| Multi-Head Self-Attention | attention |
Frozen-backbone
python scripts/train_video.py \
--data-dir data/video_dataset \
--output-dir outputs/video \
--backbone videomae2 \
--temporal-model maxpool \
--freeze-backbone \Full-finetuning
python scripts/train_video.py \
--data-dir data/video_dataset \
--output-dir outputs/video \
--backbone videomae2 \
--temporal-model maxpool | Model | Argument | Type |
|---|---|---|
| DINOv3 ViT-B/16 | dinov3 |
Image |
| CLIP ViT-B/16 | clip |
Image |
| VideoMAE | videomae |
Video |
| VideoMAEv2 | videomae2 |
Video |
The image backbones have same temporal models as tracking.
The best tracking model is provided in the weights folder weights/tracking/best_model.pt.
python scripts/infer_tracking.py \
--checkpoint weights/tracking/best_model.pt \
--data-dir data/tracking_dataset \
--conv-type gin \
--edge positional \
--temporal-model attentionFor video, use the following command.
python scripts/infer_video.py \
--checkpoint weights/video/best_model.pt \
--preprocessed-dir data/video_dataset/preprocessed \
--backbone videomae2 \
--temporal-model maxpool \Trained models will also be provided. Currently, the best tracking model is provided at weights/tracking/best_model.pt and the best video model can be downloaded from here.
If you have any questions related to the code, feel free to contact karkidrishya1@gmail.com.
If you find our work useful, please consider citing our paper.
@article{karki2025pixels,
title={Pixels or Positions? Benchmarking Modalities in Group Activity Recognition},
author={Karki, Drishya and Ramazanova, Merey and Cioppa, Anthony and Giancola, Silvio and Ghanem, Bernard},
journal={arXiv preprint arXiv:2511.12606},
year={2025}
}