Skip to content

[megatron, trainer] fix: respect calculate_entropy config in megatron actor update#6016

Merged
wuxibin89 merged 2 commits intoverl-project:mainfrom
MaxwellJryao:fix/megatron-actor-calculate-entropy
Apr 16, 2026
Merged

[megatron, trainer] fix: respect calculate_entropy config in megatron actor update#6016
wuxibin89 merged 2 commits intoverl-project:mainfrom
MaxwellJryao:fix/megatron-actor-calculate-entropy

Conversation

@MaxwellJryao
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes calculate_entropy config being ignored in megatron_actor.update_actor and ray_trainer._update_actor (legacy_worker_impl=disable path).

Previously, both only checked entropy_coeff != 0 to decide whether to compute entropy during training. This meant setting calculate_entropy=True had no effect when entropy_coeff=0, unlike dp_actor which already respected the calculate_entropy flag (line 586):

# dp_actor (correct behavior)
calculate_entropy = self.config.calculate_entropy or (entropy_coeff != 0)

This is especially problematic in bypass_mode where _compute_old_log_prob is skipped entirely — there was no way to get actor/entropy metrics without also adding entropy to the loss.

Not duplicating an existing PR: searched for calculate_entropy megatron and entropy bypass_mode — no related open PRs.

AI assistance was used (Claude) for code analysis and patch generation. All changes have been reviewed and validated by a human.

Checklist Before Starting

  • I have searched for similar PRs
  • PR title follows [{modules}] {type}: {description} format

Design & Code Changes

Three minimal changes to align megatron actor with dp_actor behavior:

  1. verl/workers/actor/megatron_actor.py (update_actor, line 791):

    # Before:
    calculate_entropy = self.config.entropy_coeff != 0
    # After:
    calculate_entropy = self.config.get("calculate_entropy", False) or (self.config.entropy_coeff != 0)
  2. verl/workers/actor/megatron_actor.py (loss_func, line 550-557): Decouple entropy metric logging from entropy loss. Always log actor/entropy when calculate_entropy=True; only add entropy to policy_loss when entropy_coeff != 0.

    # Before: unconditionally modifies loss
    policy_loss = pg_loss - entropy_coeff * entropy_loss
    # After: log metric first, only modify loss if needed
    stats["actor/entropy"] = entropy_loss.detach().item()
    if entropy_coeff != 0:
        policy_loss = pg_loss - entropy_coeff * entropy_loss
  3. verl/trainer/ppo/ray_trainer.py (_update_actor, line 1227):

    # Before:
    calculate_entropy = self.config.actor_rollout_ref.actor.entropy_coeff != 0.0
    # After:
    calculate_entropy = self.config.actor_rollout_ref.actor.get("calculate_entropy", False) or (
        self.config.actor_rollout_ref.actor.entropy_coeff != 0.0
    )

Backward Compatibility

User Config Before After
calculate_entropy=False, entropy_coeff=0 (default) No entropy computed No entropy computed (unchanged)
calculate_entropy=False, entropy_coeff!=0 Entropy computed + in loss Entropy computed + in loss + metric logged (unchanged loss)
calculate_entropy=True, entropy_coeff=0 Ignored Entropy computed, metric logged, loss unchanged ✅
calculate_entropy=True, entropy_coeff!=0 Entropy computed + in loss Entropy computed + in loss + metric logged (unchanged loss)

Test

  • pre-commit run passed all 12 checks (ruff, ruff format, mypy, config generation, license, device API, DataProto usage, naming conventions, compile, etc.)
  • This change is a minimal logic fix (3 lines of condition change + 2 lines of metric logging) that aligns megatron_actor with the existing dp_actor behavior. It does not introduce new APIs or change existing behavior for users who don't set calculate_entropy=True.

Checklist Before Submitting

  • Read the Contribute Guide
  • Pre-commit checks applied and all passed
  • CI tests (existing tests should cover this; no new API introduced)

…date

Previously, megatron_actor.update_actor and ray_trainer._update_actor
only checked entropy_coeff != 0 to decide whether to compute entropy
during training. This meant setting calculate_entropy=True in the actor
config had no effect when entropy_coeff=0, unlike dp_actor which already
respected the calculate_entropy flag.

This is especially problematic in bypass_mode where _compute_old_log_prob
is skipped entirely — there was no way to get actor/entropy metrics
without also adding entropy to the loss.

Changes:
- megatron_actor: check calculate_entropy config in addition to
  entropy_coeff, consistent with dp_actor behavior
- megatron_actor loss_func: log actor/entropy metric regardless of
  entropy_coeff; only add entropy to loss when entropy_coeff != 0
- ray_trainer: same calculate_entropy logic fix for the
  use_legacy_worker_impl=disable path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the PPO trainer and Megatron actor to allow entropy calculation and logging even when the entropy coefficient is zero, controlled by a new calculate_entropy configuration flag. The logic for the calculate_entropy variable was updated in both ray_trainer.py and megatron_actor.py, and the Megatron actor now logs entropy statistics independently of the loss calculation. I have no feedback to provide.

wuxibin89
wuxibin89 previously approved these changes Apr 16, 2026
@wuxibin89 wuxibin89 merged commit 45bfd3c into verl-project:main Apr 16, 2026
66 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants