Skip to content

Fix EfficientNet AMP NaN grads (FP16-only guard)#170

Open
nkundiushuti wants to merge 1 commit intomainfrom
marius/nan-bug-retraining
Open

Fix EfficientNet AMP NaN grads (FP16-only guard)#170
nkundiushuti wants to merge 1 commit intomainfrom
marius/nan-bug-retraining

Conversation

@nkundiushuti
Copy link
Copy Markdown
Contributor

Fix for #169
This fixes a reproducible fine-tuning failure where esp_aves2_effnetb0_all (and the Animalspeak checkpoint) produce NaN gradients under CUDA autocast FP16 while forward/loss remain finite. The fix runs EfficientNet’s model.features(...) in FP32 only when autocast dtype is FP16, keeping BF16 autocast unaffected. Verified via tmp_bn_nan_repro.py on gs://representation-learning/models/efficientnet_animalspeak.pt with AMP enabled.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a fine-tuning instability in the EfficientNet AVEX backbone where CUDA AMP FP16 can yield NaN gradients (while forward/loss remain finite) by forcing the EfficientNet feature extractor to run in FP32 only under CUDA autocast FP16.

Changes:

  • Add a CUDA-autocast-FP16 guard in EfficientNet.forward() to run model.features(...) under FP32 (autocast disabled).
  • Preserve the existing gradient-checkpointing behavior inside the guarded/un-guarded branches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +225 to +236
needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16
if needs_guard:
with torch.autocast(device_type="cuda", enabled=False):
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x.float())
else:
features = self.model.features(x.float())
else:
features = self.model.features(x)
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x)
else:
features = self.model.features(x)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs_guard currently triggers whenever CUDA autocast FP16 is enabled, including model.eval() / torch.no_grad() paths. Since the reported issue is NaN gradients, consider gating this guard on self.training and/or torch.is_grad_enabled() so FP16 autocast inference/eval keeps the expected throughput and memory benefits.

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +236
needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16
if needs_guard:
with torch.autocast(device_type="cuda", enabled=False):
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x.float())
else:
features = self.model.features(x.float())
else:
features = self.model.features(x)
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x)
else:
features = self.model.features(x)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This AMP stability behavior change isn’t covered by the existing EfficientNet unit tests. Please add a CUDA-only test that exercises forward() under torch.autocast("cuda", dtype=torch.float16) and asserts the guarded path is taken (e.g., by checking returned feature dtype when return_features_only=True, or by instrumenting/patching model.features).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants