Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running lowermodulelds multiple times needs to be thinlto only, breaking fortran at present #122891

Open
JonChesterfield opened this issue Jan 14, 2025 · 1 comment
Labels
LTO Link time optimization (regular/full LTO or ThinLTO)

Comments

@JonChesterfield
Copy link
Collaborator

As of #85626 and #75333 the lowermodulelds codegen pass is run as part of LTO. That doesn't work - the pass was designed to run once as part of codegen where it can globally allocate variables to the LDS space.

There is a narrow exemption carved out to unblock thinlto - provided the IR module is carved up into independent modules prior to codegen, no calls or references between them, running the allocator on each subgraph works ok and a second run during codegen over the entire module is a no-op. There's a partial check to notice when that invariant is breached - if the input IR has some allocated variables and some non-allocated variables, the pass aborts. That's the error message which @ergawy reported to me for Fortran.

I think the pass should be added to the thinlto pipeline and removed from the full lto pipeline, and generally only run once during codegen except for the thinlto case.

Arguments could be made that the pass should cope with being run multiple times on various bits of IR and spliced together. The problem with running on subgraphs is the reachable test can't be done - we don't know if a call to an external function accesses some visible LDS, or if a function can be called by an external kernel. A correct lowering then looks like a lot of table lookups and overallocation, at which point the user is likely to discover they've run out of LDS and/or have occupancy problems.

Tagging Joseph as well since I think he moved the openmp pipeline to use LTO at some point, in which case that's also vulnerable to this pattern. I suspect we're getting by at present because O0 doesn't get much use and most compilation flows are straightforward, e.g. they don't run opt on bits of IR manually.

@JonChesterfield
Copy link
Collaborator Author

JonChesterfield commented Jan 14, 2025

Bug is in AMDGPUTargetMachine,

  PB.registerFullLinkTimeOptimizationLastEPCallback(
      [this](ModulePassManager &PM, OptimizationLevel Level) {
        // We want to support the -lto-partitions=N option as "best effort".                                                                                                                                       
        // For that, we need to lower LDS earlier in the pipeline before the                                                                                                                                       
        // module is partitioned for codegen.                                                                                                                                                                      
        if (EnableSwLowerLDS)
          PM.addPass(AMDGPUSwLowerLDSPass(*this));
        if (EnableLowerModuleLDS)
          PM.addPass(AMDGPULowerModuleLDSPass(*this));
        if (Level != OptimizationLevel::O0) {
          // Do we really need internalization in LTO?                                                                                                                                                             
          if (InternalizeSymbols) {
            PM.addPass(InternalizePass(mustPreserveGV));
            PM.addPass(GlobalDCEPass());
          }
          if (EnableAMDGPUAttributor)
            PM.addPass(AMDGPUAttributorPass(*this));
        }
      });

Fortran ROCm build unblocked by ROCm#243

@EugeneZelenko EugeneZelenko added LTO Link time optimization (regular/full LTO or ThinLTO) and removed new issue labels Jan 14, 2025
ergawy added a commit to ergawy/llvm-project that referenced this issue Jan 14, 2025
This is a workaround to resolve https://ontrack-internal.amd.com/browse/SWDEV-502923

The AMDGPULowerModuleLDS pass is run 2 times. Both times are run on the
combined/linked module for the entire input during LTO.

The first run happens when lto::backend executes the the optimization
pipeline (through calling opt)
(https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTOBackend.cpp#L551)
and the second run happens when lto::backend later calls
codegen (https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTOBackend.cpp#L558).

A crash happens in the second run because between the first and second
runs, a new GV is added to the combined module:

```
@llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds = internal addrspace(3) global %llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds.t poison, align 16, !absolute_symbol !0
```

This is the only absolute variable and it is not there yet during the
first run of AMDGPULowerModuleLDS.

In addition to the above absolute variable, we have other GVs that
remain in the combined module (during the second run of the pass):
```
@_ZZ35__kmpc_nvptx_teams_reduce_nowait_v2E14ChunkTeamCount = internal unnamed_addr addrspace(3) global i32 undef, align 4
@_ZZ35__kmpc_nvptx_teams_reduce_nowait_v2E5Bound = internal unnamed_addr addrspace(3) global i32 undef, align 4
```
which are not absolute, causing the issue (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp#L250).

Disussing this with Jon Chesterfield, running the pass twice is actually
incorrect (it should be run only once during code-gen, the first run of
the pass
creates
"@llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds" which
makes the second run fail). Commenting out these 2 lines "fixes" the
issue:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L863-L864.
We raised an issue upstream to further discuss proper fixes:
llvm#122891.
ergawy added a commit to ROCm/llvm-project that referenced this issue Jan 14, 2025
This is a workaround to resolve https://ontrack-internal.amd.com/browse/SWDEV-502923

The AMDGPULowerModuleLDS pass is run 2 times. Both times are run on the
combined/linked module for the entire input during LTO.

The first run happens when lto::backend executes the the optimization
pipeline (through calling opt)
(https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTOBackend.cpp#L551)
and the second run happens when lto::backend later calls
codegen (https://github.com/llvm/llvm-project/blob/main/llvm/lib/LTO/LTOBackend.cpp#L558).

A crash happens in the second run because between the first and second
runs, a new GV is added to the combined module:

```
@llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds = internal addrspace(3) global %llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds.t poison, align 16, !absolute_symbol !0
```

This is the only absolute variable and it is not there yet during the
first run of AMDGPULowerModuleLDS.

In addition to the above absolute variable, we have other GVs that
remain in the combined module (during the second run of the pass):
```
@_ZZ35__kmpc_nvptx_teams_reduce_nowait_v2E14ChunkTeamCount = internal unnamed_addr addrspace(3) global i32 undef, align 4
@_ZZ35__kmpc_nvptx_teams_reduce_nowait_v2E5Bound = internal unnamed_addr addrspace(3) global i32 undef, align 4
```
which are not absolute, causing the issue (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUMemoryUtils.cpp#L250).

Disussing this with Jon Chesterfield, running the pass twice is actually
incorrect (it should be run only once during code-gen, the first run of
the pass
creates
"@llvm.amdgcn.kernel.__omp_offloading_fc00_600030__QQmain_l5.lds" which
makes the second run fail). Commenting out these 2 lines "fixes" the
issue:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L863-L864.
We raised an issue upstream to further discuss proper fixes:
llvm#122891.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

No branches or pull requests