Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

This repository contains code for our paper:

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition.
Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

Overview

This repository provides all the codes necessary to reproduce the results on our paper. SoccerNet-GAR dataset used for this paper will be provided through OpenSportsLab HuggingFace repository. We will also provide pretrained models.

Environment

You can follow either of these options to setup the environment.

Option 1. Use the `setup.sh` provided.

chmod +x setup.sh
bash setup.sh

Option 2. Manual installation

conda create -y -n pvp python=3.10
python -m pip install torch==2.5.1+cu118 torchvision==0.20.1+cu118 torchaudio==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118
python -m pip install torch-cluster -f https://data.pyg.org/whl/torch-2.5.1+cu118.html
python -m pip install torch-geometric seaborn numpy tqdm transformers opencv-python matplotlib datetime scikit-learn pyarrow==17.0 fastparquet easydict timm

Data

The SoccerNet-GAR dataset is available on HuggingFace.

Alternatively, you can generate the dataset from PFF FC source data.

SoccerNet-GAR is derived from PFF FC World Cup 2022 dataset. Sign up to their newsletter to gain access to their Google Drive link. Download the folders inside Event Data/March 14, 2025 and Tracking Data, then place them in data/events and data/tracking respectively.

python utils/pff_to_soccernet_gar.py --modality tracking
python utils/pff_to_soccernet_gar.py --modality video

The dataset is organized in this format

data/
├── video_dataset/
│   ├── train/
│   │   ├── train.json
│   │   └── videos/
│   ├── valid/
│   │   ├── valid.json
│   │   └── videos/
│   ├── test/
│   │   ├── test.json
│   │   └── videos/
│   └── preprocessed/
│       ├── train/
│       │   ├── clips.json
│       │   └── *.npy
│       ├── valid/
│       │   ├── clips.json
│       │   └── *.npy
│       └── test/
│           ├── clips.json
│           └── *.npy
└── tracking_dataset/
    ├── train/
    │   ├── train.json
    │   └── videos/
    ├── valid/
    │   ├── valid.json
    │   └── videos/
    └── test/
        ├── test.json
        └── videos/

Execution

Tracking

python scripts/train_tracking.py \
    --data-dir data/tracking_dataset \
    --output-dir outputs/tracking \
    --conv-type gin \
    --edge positional \
    --temporal-model attention

Graph Convolution Operators `--conv-type`

Operator	Argument
Graph Convolutional Network	`graphconv`
Graph Attention Network (GATv2)	`gat`
GraphSAGE	`sage`
Graph Isomorphism Network	`gin`
Edge Convolution	`edgeconv`
Generalized Aggregation	`gen`

Edge Connectivity `--edge`

Edge Type	Argument
Positional (role-based)	`positional`
K-Nearest Neighbors	`knn`
Ball K-Nearest Neighbors	`ball_knn`
Distance Threshold	`distance`
Ball Distance Threshold	`ball_distance`
Fully Connected	`full`
No Edges	`none`

Temporal Aggregation `--temporal-model`

Method	Argument
Mean Pooling	`pool`
Max Pooling	`maxpool`
Bidirectional LSTM	`bilstm`
Temporal Convolutional Network	`tcn`
Multi-Head Self-Attention	`attention`

Video

Frozen-backbone

python scripts/train_video.py \
    --data-dir data/video_dataset \
    --output-dir outputs/video \
    --backbone videomae2 \
    --temporal-model maxpool \
    --freeze-backbone \

Full-finetuning

python scripts/train_video.py \
    --data-dir data/video_dataset \
    --output-dir outputs/video \
    --backbone videomae2 \
    --temporal-model maxpool

Backbone Architectures

Model	Argument	Type
DINOv3 ViT-B/16	`dinov3`	Image
CLIP ViT-B/16	`clip`	Image
VideoMAE	`videomae`	Video
VideoMAEv2	`videomae2`	Video

The image backbones have same temporal models as tracking.

Evaluation

The best tracking model is provided in the weights folder weights/tracking/best_model.pt.

python scripts/infer_tracking.py \
    --checkpoint weights/tracking/best_model.pt \
    --data-dir data/tracking_dataset \
    --conv-type gin \
    --edge positional \
    --temporal-model attention

For video, use the following command.

python scripts/infer_video.py \
    --checkpoint weights/video/best_model.pt \
    --preprocessed-dir data/video_dataset/preprocessed \
    --backbone videomae2 \
    --temporal-model maxpool \

Trained Models

Trained models will also be provided. Currently, the best tracking model is provided at weights/tracking/best_model.pt and the best video model can be downloaded from here.

Contact

If you have any questions related to the code, feel free to contact karkidrishya1@gmail.com.

References

If you find our work useful, please consider citing our paper.

@article{karki2025pixels,
  title={Pixels or Positions? Benchmarking Modalities in Group Activity Recognition},
  author={Karki, Drishya and Ramazanova, Merey and Cioppa, Anthony and Giancola, Silvio and Ghanem, Bernard},
  journal={arXiv preprint arXiv:2511.12606},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
datasets		datasets
models		models
scripts		scripts
utils		utils
weights/tracking		weights/tracking
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Overview

Environment

Option 1. Use the `setup.sh` provided.

Option 2. Manual installation

Data

Execution

Tracking

Graph Convolution Operators `--conv-type`

Edge Connectivity `--edge`

Temporal Aggregation `--temporal-model`

Video

Backbone Architectures

Evaluation

Trained Models

Contact

References

About

Uh oh!

Releases

Packages

Languages

drishyakarki/pixels_vs_positions

Folders and files

Latest commit

History

Repository files navigation

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Overview

Environment

Option 1. Use the setup.sh provided.

Option 2. Manual installation

Data

Execution

Tracking

Graph Convolution Operators --conv-type

Edge Connectivity --edge

Temporal Aggregation --temporal-model

Video

Backbone Architectures

Evaluation

Trained Models

Contact

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Option 1. Use the `setup.sh` provided.

Graph Convolution Operators `--conv-type`

Edge Connectivity `--edge`

Temporal Aggregation `--temporal-model`

Packages