This repository contains code for automated gaze estimation in redfronted lemurs, adapted from the Gaze-LLE framework. This work is part of a PhD thesis and provides the first automated approach to learning gaze detection in lemurs directly from visual input.
Acknowledgment: This work builds on Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg (CVPR 2025 Highlight). We thank the authors for their foundational work on gaze target estimation.
The automated analysis of individual primate behavior in the wild must address two fundamental questions: who is the actor, and what behavior is being performed? While the identification of individuals has been addressed separately, the focus of this work is on behavior recognition, specifically the detection and localization of gaze behavior.
In animal behavior research, the set of possible actions is typically formalized in an ethogram, which provides a structured list of behaviors with precise definitions. Computer vision models have the potential to automate primate behavior annotations in the wild, enabling large-scale analysis of social groups and individual behavior.
Gaze is of particular interest because:
- Long-distance interaction: Unlike behaviors like grooming, gaze can occur between individuals who are spatially separated, making it challenging to detect from close-up interactions alone.
- Simultaneous interactions: Multiple gazes can happen simultaneously in a group, requiring models that can resolve which individual is gazing at which target.
- Social learning context: Understanding who is learning from whom in social learning experiments requires knowing who is attending to the demonstrator.
Gazes are best annotated on cameras with an overview of all feeding boxes and their surroundings, to capture even long-range gazes. In our dataset, gaze annotations are:
- Temporally and spatially localized
- Aligned with tracked individuals
- Manually curated to ensure accuracy
- Associated with feeding box interactions (gaze targets are manipulating the feeding box)
However, this process introduces several sources of noise, including identity ambiguities, temporal gaps in gaze annotation, and approximations of gaze targets.
The above figure shows example frames from our lemur gaze estimation dataset, displaying spatio-temporally localized gaze examples as a product of joining gaze annotations with tracks and manually revising the resulting frames.
We present the first approach to learning automated gaze detection in lemurs directly from visual input. Our work:
- Demonstrates the feasibility of adapting foundation models (Gaze-LLE) to animal gaze estimation
- Provides insights into the performance and limitations of automated gaze estimation for ecological applications
- Releases code and models for the lemur gaze estimation task
We evaluate the adapted Gaze-LLE model on our lemur gaze dataset. The model is built on a frozen DINOv2 backbone with a lightweight gaze decoder, trained to predict spatial heatmaps of gaze targets.
| Dataset | Samples | AUC ↑ | L2 ↓ | Inout AP ↑ |
|---|---|---|---|---|
| Unfiltered | 90,745 | 0.9887 | 0.0297 | 0.8075 |
| Manually filtered | 34,519 | 0.9914 | 0.0263 | 0.8455 |
| Max 100 frames per track | 19,227 | 0.9913 | 0.0267 | 0.8519 |
| Clustered | 12,277 | 0.9905 | 0.0267 | 0.8397 |
The table shows a comparison of different training sets. We report the average over three runs.
We investigate the benefit of fine-tuning the frozen DINOv2 backbone with various learning rates. Results show that selective backbone tuning improves performance for lemur gaze estimation.
| Method | AUC ↑ | L2 ↓ | Inout AP ↑ |
|---|---|---|---|
| Manually filtered | 0.9914 | 0.0263 | 0.8455 |
| + unfreeze last layers | 0.9921 (+0.0007) | 0.0261 (-0.0002) | 0.8851 (+0.0396) |
| Max 100 frames per track | 0.9913 | 0.0267 | 0.8519 |
| + unfreeze last layers | 0.9923 (+0.0010) | 0.0257 (-0.0010) | 0.8724 (+0.0205) |
The table shows results of unfreezing the last layers of the backbone. We report the average over three runs.
Fine-tuning the backbone with a learning rate of 5e-5 for ViT-B and 1e-4 for ViT-L yields the best performance, demonstrating that careful adaptation of the pretrained encoder helps transfer the model to the lemur domain.
The above figure shows qualitative results from our gaze estimation model: (A) correct predictions where the model accurately localizes the gaze target, and (B) failure cases where the model struggles due to occlusion, small target size, or ambiguous gaze direction.
Clone this repo and create the virtual environment.
conda env create -f environment.yml
conda activate gazelle
pip install -e .
If your system supports it, consider installing xformers to speed up attention computation.
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
Before training, download and preprocess your gaze annotation data. Training scripts assume:
- Frames with individual bounding boxes
- Gaze target annotations (spatial locations or bounding boxes)
- Train/validation/test annotations defined in JSON format:
[
{
"path": "images/A_e9_c7",
"width": 1920,
"height": 1080,
"frames": [
{
"path": "images/A_e9_c7/A_e9_c7_frame16494.jpg",
"heads": [
{
"bbox": [1141, 698, 1278, 869],
"bbox_norm": [0.59, 0.65, 0.67, 0.80],
"gazex": [1179],
"gazex_norm": [0.61],
"gazey": [557],
"gazey_norm": [0.52],
"inout": 1
},
{
"bbox": [1133, 480, 1224, 633],
"bbox_norm": [0.59, 0.44, 0.64, 0.59],
"gazex": [-1],
"gazex_norm": [-0.0005],
"gazey": [-1],
"gazey_norm": [-0.0009],
"inout": 0
}
]
},
{
"path": "images/A_e9_c7/A_e9_c7_frame16495.jpg",
"heads": [...]
},
...
]
},
...
]
