Skip to content

ecker-lab/lemur-gaze-estimation

Repository files navigation

Lemur Gaze Estimation: Fine-tuning Gaze-LLE for Primate Behavior Analysis

This repository contains code for automated gaze estimation in redfronted lemurs, adapted from the Gaze-LLE framework. This work is part of a PhD thesis and provides the first automated approach to learning gaze detection in lemurs directly from visual input.

Acknowledgment: This work builds on Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg (CVPR 2025 Highlight). We thank the authors for their foundational work on gaze target estimation.

Motivation

The automated analysis of individual primate behavior in the wild must address two fundamental questions: who is the actor, and what behavior is being performed? While the identification of individuals has been addressed separately, the focus of this work is on behavior recognition, specifically the detection and localization of gaze behavior.

Why Gaze in Lemurs?

In animal behavior research, the set of possible actions is typically formalized in an ethogram, which provides a structured list of behaviors with precise definitions. Computer vision models have the potential to automate primate behavior annotations in the wild, enabling large-scale analysis of social groups and individual behavior.

Gaze is of particular interest because:

  1. Long-distance interaction: Unlike behaviors like grooming, gaze can occur between individuals who are spatially separated, making it challenging to detect from close-up interactions alone.
  2. Simultaneous interactions: Multiple gazes can happen simultaneously in a group, requiring models that can resolve which individual is gazing at which target.
  3. Social learning context: Understanding who is learning from whom in social learning experiments requires knowing who is attending to the demonstrator.

Gazes are best annotated on cameras with an overview of all feeding boxes and their surroundings, to capture even long-range gazes. In our dataset, gaze annotations are:

  • Temporally and spatially localized
  • Aligned with tracked individuals
  • Manually curated to ensure accuracy
  • Associated with feeding box interactions (gaze targets are manipulating the feeding box)

However, this process introduces several sources of noise, including identity ambiguities, temporal gaps in gaze annotation, and approximations of gaze targets.

Example frames from lemur gaze estimation dataset

The above figure shows example frames from our lemur gaze estimation dataset, displaying spatio-temporally localized gaze examples as a product of joining gaze annotations with tracks and manually revising the resulting frames.

Main Contributions

We present the first approach to learning automated gaze detection in lemurs directly from visual input. Our work:

  1. Demonstrates the feasibility of adapting foundation models (Gaze-LLE) to animal gaze estimation
  2. Provides insights into the performance and limitations of automated gaze estimation for ecological applications
  3. Releases code and models for the lemur gaze estimation task

Results

Model Performance on Lemur Gaze

We evaluate the adapted Gaze-LLE model on our lemur gaze dataset. The model is built on a frozen DINOv2 backbone with a lightweight gaze decoder, trained to predict spatial heatmaps of gaze targets.

Dataset Samples AUC ↑ L2 ↓ Inout AP ↑
Unfiltered 90,745 0.9887 0.0297 0.8075
Manually filtered 34,519 0.9914 0.0263 0.8455
Max 100 frames per track 19,227 0.9913 0.0267 0.8519
Clustered 12,277 0.9905 0.0267 0.8397

The table shows a comparison of different training sets. We report the average over three runs.

Fine-tuning the Backbone

We investigate the benefit of fine-tuning the frozen DINOv2 backbone with various learning rates. Results show that selective backbone tuning improves performance for lemur gaze estimation.

Method AUC ↑ L2 ↓ Inout AP ↑
Manually filtered 0.9914 0.0263 0.8455
+ unfreeze last layers 0.9921 (+0.0007) 0.0261 (-0.0002) 0.8851 (+0.0396)
Max 100 frames per track 0.9913 0.0267 0.8519
+ unfreeze last layers 0.9923 (+0.0010) 0.0257 (-0.0010) 0.8724 (+0.0205)

The table shows results of unfreezing the last layers of the backbone. We report the average over three runs.

Fine-tuning the backbone with a learning rate of 5e-5 for ViT-B and 1e-4 for ViT-L yields the best performance, demonstrating that careful adaptation of the pretrained encoder helps transfer the model to the lemur domain.

Qualitative Results

Qualitative gaze estimation results on lemurs

The above figure shows qualitative results from our gaze estimation model: (A) correct predictions where the model accurately localizes the gaze target, and (B) failure cases where the model struggles due to occlusion, small target size, or ambiguous gaze direction.

Installation

Clone this repo and create the virtual environment.

conda env create -f environment.yml
conda activate gazelle
pip install -e .

If your system supports it, consider installing xformers to speed up attention computation.

pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

Data prepration

Before training, download and preprocess your gaze annotation data. Training scripts assume:

  • Frames with individual bounding boxes
  • Gaze target annotations (spatial locations or bounding boxes)
  • Train/validation/test annotations defined in JSON format:
[
    {
        "path": "images/A_e9_c7",
        "width": 1920,
        "height": 1080,
        "frames": [
            {
                "path": "images/A_e9_c7/A_e9_c7_frame16494.jpg",
                "heads": [
                    {
                        "bbox": [1141, 698, 1278, 869],
                        "bbox_norm": [0.59, 0.65, 0.67, 0.80],
                        "gazex": [1179],
                        "gazex_norm": [0.61],
                        "gazey": [557],
                        "gazey_norm": [0.52],
                        "inout": 1
                    },
                    {
                        "bbox": [1133, 480, 1224, 633],
                        "bbox_norm": [0.59, 0.44, 0.64, 0.59],
                        "gazex": [-1],
                        "gazex_norm": [-0.0005],
                        "gazey": [-1],
                        "gazey_norm": [-0.0009],
                        "inout": 0
                    }
                ]
            },
            {
                "path": "images/A_e9_c7/A_e9_c7_frame16495.jpg",
                "heads": [...]
            },
            ...
        ]
    },
    ...
]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors