Lemur Gaze Estimation: Fine-tuning Gaze-LLE for Primate Behavior Analysis

This repository contains code for automated gaze estimation in redfronted lemurs, adapted from the Gaze-LLE framework. This work is part of a PhD thesis and provides the first automated approach to learning gaze detection in lemurs directly from visual input.

Acknowledgment: This work builds on Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg (CVPR 2025 Highlight). We thank the authors for their foundational work on gaze target estimation.

Motivation

The automated analysis of individual primate behavior in the wild must address two fundamental questions: who is the actor, and what behavior is being performed? While the identification of individuals has been addressed separately, the focus of this work is on behavior recognition, specifically the detection and localization of gaze behavior.

Why Gaze in Lemurs?

In animal behavior research, the set of possible actions is typically formalized in an ethogram, which provides a structured list of behaviors with precise definitions. Computer vision models have the potential to automate primate behavior annotations in the wild, enabling large-scale analysis of social groups and individual behavior.

Gaze is of particular interest because:

Long-distance interaction: Unlike behaviors like grooming, gaze can occur between individuals who are spatially separated, making it challenging to detect from close-up interactions alone.
Simultaneous interactions: Multiple gazes can happen simultaneously in a group, requiring models that can resolve which individual is gazing at which target.
Social learning context: Understanding who is learning from whom in social learning experiments requires knowing who is attending to the demonstrator.

Gazes are best annotated on cameras with an overview of all feeding boxes and their surroundings, to capture even long-range gazes. In our dataset, gaze annotations are:

Temporally and spatially localized
Aligned with tracked individuals
Manually curated to ensure accuracy
Associated with feeding box interactions (gaze targets are manipulating the feeding box)

However, this process introduces several sources of noise, including identity ambiguities, temporal gaps in gaze annotation, and approximations of gaze targets.

The above figure shows example frames from our lemur gaze estimation dataset, displaying spatio-temporally localized gaze examples as a product of joining gaze annotations with tracks and manually revising the resulting frames.

Main Contributions

We present the first approach to learning automated gaze detection in lemurs directly from visual input. Our work:

Demonstrates the feasibility of adapting foundation models (Gaze-LLE) to animal gaze estimation
Provides insights into the performance and limitations of automated gaze estimation for ecological applications
Releases code and models for the lemur gaze estimation task

Results

Model Performance on Lemur Gaze

We evaluate the adapted Gaze-LLE model on our lemur gaze dataset. The model is built on a frozen DINOv2 backbone with a lightweight gaze decoder, trained to predict spatial heatmaps of gaze targets.

Dataset	Samples	AUC ↑	L2 ↓	Inout AP ↑
Unfiltered	90,745	0.9887	0.0297	0.8075
Manually filtered	34,519	0.9914	0.0263	0.8455
Max 100 frames per track	19,227	0.9913	0.0267	0.8519
Clustered	12,277	0.9905	0.0267	0.8397

The table shows a comparison of different training sets. We report the average over three runs.

Fine-tuning the Backbone

We investigate the benefit of fine-tuning the frozen DINOv2 backbone with various learning rates. Results show that selective backbone tuning improves performance for lemur gaze estimation.

Method	AUC ↑	L2 ↓	Inout AP ↑
Manually filtered	0.9914	0.0263	0.8455
+ unfreeze last layers	0.9921 (+0.0007)	0.0261 (-0.0002)	0.8851 (+0.0396)
Max 100 frames per track	0.9913	0.0267	0.8519
+ unfreeze last layers	0.9923 (+0.0010)	0.0257 (-0.0010)	0.8724 (+0.0205)

The table shows results of unfreezing the last layers of the backbone. We report the average over three runs.

Fine-tuning the backbone with a learning rate of 5e-5 for ViT-B and 1e-4 for ViT-L yields the best performance, demonstrating that careful adaptation of the pretrained encoder helps transfer the model to the lemur domain.

Qualitative Results

The above figure shows qualitative results from our gaze estimation model: (A) correct predictions where the model accurately localizes the gaze target, and (B) failure cases where the model struggles due to occlusion, small target size, or ambiguous gaze direction.

Installation

Clone this repo and create the virtual environment.

conda env create -f environment.yml
conda activate gazelle
pip install -e .

If your system supports it, consider installing xformers to speed up attention computation.

pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

Data prepration

Before training, download and preprocess your gaze annotation data. Training scripts assume:

Frames with individual bounding boxes
Gaze target annotations (spatial locations or bounding boxes)
Train/validation/test annotations defined in JSON format:

[
    {
        "path": "images/A_e9_c7",
        "width": 1920,
        "height": 1080,
        "frames": [
            {
                "path": "images/A_e9_c7/A_e9_c7_frame16494.jpg",
                "heads": [
                    {
                        "bbox": [1141, 698, 1278, 869],
                        "bbox_norm": [0.59, 0.65, 0.67, 0.80],
                        "gazex": [1179],
                        "gazex_norm": [0.61],
                        "gazey": [557],
                        "gazey_norm": [0.52],
                        "inout": 1
                    },
                    {
                        "bbox": [1133, 480, 1224, 633],
                        "bbox_norm": [0.59, 0.44, 0.64, 0.59],
                        "gazex": [-1],
                        "gazex_norm": [-0.0005],
                        "gazey": [-1],
                        "gazey_norm": [-0.0009],
                        "inout": 0
                    }
                ]
            },
            {
                "path": "images/A_e9_c7/A_e9_c7_frame16495.jpg",
                "heads": [...]
            },
            ...
        ]
    },
    ...
]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
data_prep		data_prep
gazelle		gazelle
scripts		scripts
visuals		visuals
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
frames.txt		frames.txt
hubconf.py		hubconf.py
pinned		pinned
setup.py		setup.py
slurm-13488627.out		slurm-13488627.out
tesst.txt		tesst.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lemur Gaze Estimation: Fine-tuning Gaze-LLE for Primate Behavior Analysis

Motivation

Why Gaze in Lemurs?

Main Contributions

Results

Model Performance on Lemur Gaze

Fine-tuning the Backbone

Qualitative Results

Installation

Data prepration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Lemur Gaze Estimation: Fine-tuning Gaze-LLE for Primate Behavior Analysis

Motivation

Why Gaze in Lemurs?

Main Contributions

Results

Model Performance on Lemur Gaze

Fine-tuning the Backbone

Qualitative Results

Installation

Data prepration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages