Fix EfficientNet AMP NaN grads (FP16-only guard) by nkundiushuti · Pull Request #170 · earthspecies/avex

nkundiushuti · 2026-03-27T14:05:54Z

Fix for #169
This fixes a reproducible fine-tuning failure where esp_aves2_effnetb0_all (and the Animalspeak checkpoint) produce NaN gradients under CUDA autocast FP16 while forward/loss remain finite. The fix runs EfficientNet’s model.features(...) in FP32 only when autocast dtype is FP16, keeping BF16 autocast unaffected. Verified via tmp_bn_nan_repro.py on gs://representation-learning/models/efficientnet_animalspeak.pt with AMP enabled.

Copilot

Pull request overview

Fixes a fine-tuning instability in the EfficientNet AVEX backbone where CUDA AMP FP16 can yield NaN gradients (while forward/loss remain finite) by forcing the EfficientNet feature extractor to run in FP32 only under CUDA autocast FP16.

Changes:

Add a CUDA-autocast-FP16 guard in EfficientNet.forward() to run model.features(...) under FP32 (autocast disabled).
Preserve the existing gradient-checkpointing behavior inside the guarded/un-guarded branches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T13:38:26Z

avex/models/efficientnet.py

+        needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16
+        if needs_guard:
+            with torch.autocast(device_type="cuda", enabled=False):
+                if self.gradient_checkpointing and self.training:
+                    features = self._checkpointed_features(x.float())
+                else:
+                    features = self.model.features(x.float())
        else:
-            features = self.model.features(x)
+            if self.gradient_checkpointing and self.training:
+                features = self._checkpointed_features(x)
+            else:
+                features = self.model.features(x)


needs_guard currently triggers whenever CUDA autocast FP16 is enabled, including model.eval() / torch.no_grad() paths. Since the reported issue is NaN gradients, consider gating this guard on self.training and/or torch.is_grad_enabled() so FP16 autocast inference/eval keeps the expected throughput and memory benefits.

Copilot · 2026-03-30T13:38:26Z

avex/models/efficientnet.py

+        needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16
+        if needs_guard:
+            with torch.autocast(device_type="cuda", enabled=False):
+                if self.gradient_checkpointing and self.training:
+                    features = self._checkpointed_features(x.float())
+                else:
+                    features = self.model.features(x.float())
        else:
-            features = self.model.features(x)
+            if self.gradient_checkpointing and self.training:
+                features = self._checkpointed_features(x)
+            else:
+                features = self.model.features(x)


This AMP stability behavior change isn’t covered by the existing EfficientNet unit tests. Please add a CUDA-only test that exercises forward() under torch.autocast("cuda", dtype=torch.float16) and asserts the guarded path is taken (e.g., by checking returned feature dtype when return_features_only=True, or by instrumenting/patching model.features).

amp guard in efficientnet to avoid nan gradients in the case of fp16

2b3e9ae

nkundiushuti requested a review from david-rx as a code owner March 27, 2026 14:05

nkundiushuti requested review from GaganNarula and Copilot March 30, 2026 13:33

Copilot started reviewing on behalf of nkundiushuti March 30, 2026 13:34 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix EfficientNet AMP NaN grads (FP16-only guard)#170

Fix EfficientNet AMP NaN grads (FP16-only guard)#170
nkundiushuti wants to merge 1 commit intomainfrom
marius/nan-bug-retraining

nkundiushuti commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nkundiushuti commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants