feat: Support Linear Cross Entropy fuse kernel by TaoZex · Pull Request #1322 · areal-project/AReaL

TaoZex · 2026-05-10T04:30:02Z

Description

Adds a fused Linear Cross Entropy (LCE) path for Megatron training to avoid materialising full [tokens, vocab] logits.

Key changes:

Adds Triton-based fused LCE forward/backward for per-token logprobs and entropy.
Integrates fused LCE into Megatron via LM-head hidden/weight capture.
Supports tensor-parallel vocab sharding, including TP forward reductions and d_hidden all-reduce in backward.
Keeps safe fallback to the materialised reference path when fused LCE is unavailable.
Adds focused correctness, TP, and benchmark coverage for fused vs materialised LCE.

Related Issue

Fixes #TBD

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Key files:

areal/utils/kernel/kernels.py: implements the Triton fused LCE kernels, including forward logprob/entropy computation and split-N backward.
areal/utils/kernel/linear_cross_entropy.py: exposes the fused LCE autograd function and handles TP d_hidden all-reduce in backward.
areal/utils/functional/linear_cross_entropy.py: provides AReaL-facing wrappers with fallback to the materialised reference path.
areal/engine/megatron_utils/fused_lce_capture.py: captures LM-head hidden states and weights without materialising logits.
areal/engine/megatron_engine.py: wires fused LCE into the Megatron training/logprob path behind actor.use_fused_linear_ce.
tests/test_linear_cross_entropy.py and tests/torchrun/run_lce_tp2.py: cover single-GPU and TP=2 correctness/performance checks.
benchmark/bench_linear_cross_entropy.py: provides standalone fused vs materialised latency/memory benchmarking, including TP mode.

Need help? Check the Contributing Guide or ask in
https://github.com/inclusionAI/AReaL/discussions !

gemini-code-assist

Code Review

This pull request introduces a fused linear cross-entropy (LCE) optimization using Triton kernels to avoid materializing large logit tensors, which significantly reduces memory overhead and improves performance for the Megatron backend. The implementation includes a context manager for capturing hidden states, a custom autograd function, and comprehensive benchmarking and testing suites. Review feedback identifies opportunities to improve telemetry accuracy by using actual logit values instead of logprobs, suggests moving kernel alignment assertions into a compatibility check for graceful fallbacks, and recommends using in-place operations in the backward pass for better efficiency.

TaoZex · 2026-05-10T04:45:57Z

`tests/test_linear_cross_entropy.py`

This file validates fused Linear Cross Entropy (LCE) across correctness, gradients, TP=2 distributed execution, and performance.

Test Coverage

Forward correctness
- Checks fused logprobs and entropy against the materialized logits -> log_softmax reference.
- Covers float32, bfloat16, float16.
- Covers temperatures 0.7, 1.0, 1.5.
Backward correctness
- Checks fused hidden.grad and weight.grad against PyTorch autograd reference.
- Covers small, medium, and large shapes:
  - 64 x 256 x 2048
  - 512 x 1024 x 32000
  - 2048 x 2048 x 32000
TP=2 correctness and performance
- Launches torchrun --nproc_per_node=2 internally via subprocess.run.
- Users can run it with normal pytest; no manual torchrun is needed.
- Validates fused TP=2 forward/backward against the full-vocab reference.
Single-GPU performance
- Compares fused vs materialized forward + backward latency and peak memory.
- Covers representative LLM shapes, including large-vocab Qwen-style cases.

Accuracy Tolerances

The checks use tight numerical tolerances:

Case	rtol	atol
Forward `float32`	`1e-5`	`1e-5`
Forward `bfloat16`	`2e-2`	`2e-2`
Forward `float16`	`1e-2`	`1e-2`
Temperature `float32`	`1e-5`	`1e-5`
Backward `hidden.grad`	`1e-4`	`1e-4`
Backward `weight.grad` small/medium	`1e-4`	`1e-4`
Backward `weight.grad` large	`1e-4`	`5e-4`

These tolerances are strict enough to catch real numerical regressions while allowing expected low-precision accumulation drift.

Below are the test results：

TaoZex · 2026-05-10T05:34:44Z

End-to-End Evaluation on H20

Qwen3-0.6B, TP=1

Task Reward

The task reward trend is almost identical between the baseline and fused LCE runs.
This confirms that the fused LCE path preserves training behavior and keeps accuracy-related metrics aligned with the baseline.

LCE Optimization

The fused LCE path significantly improves both step time and peak memory usage.

Metric	Value
Baseline average step time	38.2 s
Fused LCE average step time	27.1 s
Average step time reduction	11.1 s
Average speedup percentage	29.52%
Baseline peak memory	55.00 GB
Fused LCE peak memory	32.92 GB
Peak memory reduction	22.08 GB

LCE Optimization Result: fused LCE achieves a 29.52% step-time reduction and saves 22.08 GB peak memory for Qwen3-0.6B with tp=1.

Qwen3-8B, TP=2

Task Reward

The task reward trend remains aligned with the baseline.
This further confirms that fused LCE does not introduce visible training quality regression under TP=2.

LCE Optimization

With tp=2, the vocabulary is split across two GPUs, so each rank handles roughly half of the vocab-related LCE workload. As expected, the step-time reduction is smaller than the tp=1 case.

Metric	Value
Baseline average step time	101.9 s
Fused LCE average step time	96.8 s
Average step time reduction	5.1 s
Average speedup percentage	5.02%

LCE Optimization Result: fused LCE still provides a measurable 5.02% step-time reduction under tp=2, while preserving the same task reward trend as the baseline.

TaoZex · 2026-05-10T05:46:38Z

Benchmark Purpose

The benchmark benchmark/bench_linear_cross_entropy.py is used to compare the fused Linear Cross Entropy path against the materialized reference path.

It measures:

Forward + backward latency
Peak GPU memory usage
Fused vs baseline speedup
Memory reduction from avoiding [tokens, vocab] logits materialization
Single-GPU and TP multi-GPU behavior

This provides a focused way to evaluate the kernel-level benefit independently from full end-to-end training noise.

Below is a benchmark example：

Future FSDP Adaptation

The current pull request implements the core fused LCE capability and integrates it into the Megatron engine first.

After this PR is merged, the same design can be adapted to the FSDP engine.

garrett4wade

LGTM except that several coding style issues. We can make the code look much more better.

Besides, please fix the pre-commit error with pre-commit run --all-files

garrett4wade · 2026-05-11T08:28:43Z

+                        fused_weight = mb_input.orig_mb.get(FUSED_LCE_WEIGHT_KEY)
+                        if (
+                            fused_weight is not None
+                            and output.dtype != fused_weight.dtype
+                        ):
+                            output = output.to(fused_weight.dtype)
+                        mb_input.orig_mb[FUSED_LCE_HIDDEN_KEY] = output


Since we usually require fp32 logits, will this downcast operation cause a precision issue?

The fused LCE kernel internally accumulates the matrix multiplication in fp32. Therefore, even with bf16 input hidden states, the precision of the logits and log-softmax computations within the kernel remains fully preserved in fp32.

In practice, the non-fused computation path follows:
bf16 hidden → bf16 matmul → bf16 logits → fp32 logits (upcast by Float16Module) → fp32 log-softmax.

In contrast, the fused path maintains fp32 accumulation throughout the entire computation, ensuring its numerical precision is at least on par with, if not better than, the non-fused baseline.

TaoZex · 2026-05-15T04:34:00Z

@garrett4wade Thank you for the feedback. I've updated the code based on your suggestions and resolved the pre-commit error. Looking forward to your review!

Note: The pre-commit check in CI is still failing due to a bad commit message. This is expected and can be safely ignored.

bingyechen added 25 commits May 8, 2026 17:57

feat(engine): LinearCrossEntropy

c2852f3

fix(kernel): continus

fc5211b

test(linear_cross_entropy): add test for tp > 1

e494738

test(linear_cross_entropy): log test

9495141

refactor(test): remove useless test code

77ea280

perf: NVTX

47fd3e2

feat(config): add use_fused_linear_ce config

203a7e4

fix(utils): network

e2e5d70

feat(profiling):nsys flush

fecfcbd

fix(engine): fix

0b5eefe

feat(profiler): torch profile

e69674e

fix(sequence_parallel): fix sp

d444cbe

fix(engine): dtype

07f03d8

fix(engine): dtype again

003296b

test(linear_cross_entropy): fix test

0020faa

perf(benchmark): benchmark

fa82bb9

refactor(fsdp): remove useless

4e9f8c7

refactor: remove test profile

64b91a4

refactor(kernel): fix

92c4298

fix(kernel): fix code

c640e03

refactor(kernel): fix

8b1a243

refactor(kernel): remove useless

a6952f3

refactor(kernel): comment

6346396

docs(kernels): fix

0b044e5

feat(test): fix

5e35bbe

TaoZex marked this pull request as ready for review May 10, 2026 04:30

TaoZex requested review from CormickKneey, HwVanICI, PrometheusComing and garrett4wade as code owners May 10, 2026 04:30

TaoZex requested review from fishcrap, geshi001, nuzant, rchardx and sitabulaixizawaluduo as code owners May 10, 2026 04:30

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread areal/engine/megatron_engine.py Outdated

Comment thread areal/models/kernel/kernels.py

Comment thread areal/models/kernel/kernels.py

bingyechen and others added 6 commits May 10, 2026 14:20

fix(engine): fix

09eff54

fix(megatron): fix vocab

825d44c

feat(engine): fix conflict

8aaff93

Merge branch 'main' into lm

292dab1

Merge branch 'main' into lm

8addb48

feat: precommit fix

c5525d6

garrett4wade reviewed May 11, 2026

View reviewed changes

TaoZex and others added 3 commits May 15, 2026 00:56

feat: fix by comment

9a6b467

feat: fix

7fcc99b

Merge branch 'main' into lm

fa91e9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Linear Cross Entropy fuse kernel#1322

feat: Support Linear Cross Entropy fuse kernel#1322
TaoZex wants to merge 34 commits into
areal-project:mainfrom
TaoZex:lm

TaoZex commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 10, 2026

Uh oh!

TaoZex commented May 10, 2026

Uh oh!

TaoZex commented May 10, 2026 •

edited

Loading

Uh oh!

garrett4wade left a comment

Uh oh!

Uh oh!

Uh oh!

garrett4wade May 11, 2026

Uh oh!

TaoZex May 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaoZex commented May 10, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 10, 2026

tests/test_linear_cross_entropy.py

Test Coverage

Accuracy Tolerances

Uh oh!

TaoZex commented May 10, 2026

End-to-End Evaluation on H20

Qwen3-0.6B, TP=1

Task Reward

LCE Optimization

Qwen3-8B, TP=2

Task Reward

LCE Optimization

Uh oh!

TaoZex commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Purpose

Future FSDP Adaptation

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

garrett4wade May 11, 2026

Choose a reason for hiding this comment

Uh oh!

TaoZex May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`tests/test_linear_cross_entropy.py`

TaoZex commented May 10, 2026 •

edited

Loading