Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 15 additions & 3 deletions avex/models/efficientnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,10 +218,22 @@ def forward(
x = self.process_audio(x)

# Extract features with optional gradient checkpointing
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x)
# Empirically, EfficientNet's backward under CUDA autocast can produce NaN
# gradients even when activations and loss are finite. When autocast is
# enabled, run the feature extractor in FP32 for stability while keeping
# AMP for the rest of the training loop.
needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16
if needs_guard:
with torch.autocast(device_type="cuda", enabled=False):
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x.float())
else:
features = self.model.features(x.float())
else:
features = self.model.features(x)
if self.gradient_checkpointing and self.training:
features = self._checkpointed_features(x)
else:
features = self.model.features(x)
Comment on lines +225 to +236
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs_guard currently triggers whenever CUDA autocast FP16 is enabled, including model.eval() / torch.no_grad() paths. Since the reported issue is NaN gradients, consider gating this guard on self.training and/or torch.is_grad_enabled() so FP16 autocast inference/eval keeps the expected throughput and memory benefits.

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +236
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This AMP stability behavior change isn’t covered by the existing EfficientNet unit tests. Please add a CUDA-only test that exercises forward() under torch.autocast("cuda", dtype=torch.float16) and asserts the guarded path is taken (e.g., by checking returned feature dtype when return_features_only=True, or by instrumenting/patching model.features).

Copilot uses AI. Check for mistakes.

# Return unpooled spatial features if requested
if self.return_features_only:
Expand Down
Loading