-
Notifications
You must be signed in to change notification settings - Fork 55
feat[cartesian]: DaCe bridge refactor: OIR -> TreeIR -> ScheduleTree -> SDFG #2067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 137 commits
Commits
Show all changes
159 commits
Select commit
Hold shift + click to select a range
a7b6bbf
WIP: skeleton of new oir -> stree -> sdfg workflow
romanc 47fe7e9
WIP: basic skeleton for mapping i, j, and k loops
romanc 8dddcd5
WIP: move tasklet generation into its own visitor
romanc 1f41d90
WIP: passing fields to tree descriptor repo
romanc ec5b16b
WIP: First version of memlet genration
romanc db6493c
WIP: use symbols where symbols should be used
romanc 3f7cfeb
Flip `symbols` to be a mapping between type and symbol name.
FlorianDeconinck 4094e32
Cast / BinaryOp / Literal
FlorianDeconinck cb50076
Fix in tasklet field name for negative offset
FlorianDeconinck a9c4086
Fix scoping of intervals/vertical loop
FlorianDeconinck 06b8070
(WIP) Adding FORWARD/BACKWARD serial loop (interval range badly compu…
FlorianDeconinck 8f78f1d
Raise in all visitor in `oir_to_tasklet`
FlorianDeconinck 73b3d3d
Update dace with correct symbol dict type
romanc c25e9dc
Fix sequential vertical loops
romanc ab4e884
Add support for passing parameters
romanc bf93566
WIP: Support for variable K offset (reads)
romanc 6f59f9e
UnaryOp & remaining visitor clean up pass
FlorianDeconinck 481e6ca
Add scalar inputs & unary operation example stencils
FlorianDeconinck ec9ecc6
Native Functions + LocalScalar
FlorianDeconinck bd252ae
Fix glue code (tm) for Temporary using magic function to swap strides…
FlorianDeconinck 2af4531
If/Else in `stree` (WIP SDFG)
FlorianDeconinck 4559ab8
Clean `TreeIR` of redudant `children` when using `TreeScope`
FlorianDeconinck 750dd4f
While (tree + SDFG)
FlorianDeconinck f78dba2
Refactor ScopeManager -> ContextPushPop to match oir_to_treeir
FlorianDeconinck 52e57a7
Regions
FlorianDeconinck 39b68c5
Use proper symbol from `dcir.Axis`
FlorianDeconinck fe66b5a
Ternary op & cast on expression for OirToTreeIR
FlorianDeconinck 195014f
Lint
FlorianDeconinck 9355633
Clean up Mask visitation with generic visit
FlorianDeconinck c31ca1c
Revert test generation only dace:cpu
romanc dc42e65
Fix: don't try to specialize transient strides for scalars
romanc 910dc2e
tmp: run ci with stree branch
romanc d691e11
tmp ci: skip gt4py/next tests
romanc 1f2c53b
fix offset calculation in oir -> treeir
romanc a3945ff
fix translate unary operations from oir -> treeir
romanc 7230997
Fix built-in literals (true/false) in tasklet code
romanc 31a0f1e
fix: ctx is an expected keyword arg
romanc 7c111e6
Fix: translate built-in literals in oir -> treeir step
romanc ad4dfc4
Support for 2d field access in if/while condition
romanc 634afd8
keep dimensionality information in treeir
romanc d9719e5
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 0135576
trying to square uv.lock file again
romanc d34c7c7
Notes on how to update the duck-tape uv.lock file
romanc 71385a0
Preserve order of operations oir -> treeir
romanc f90374d
Fix: store dimensions of temporary arrays
romanc 140e922
Fix array shape and horizontal domain shift
romanc 90f2397
Fix: pass on all kwargs to lower visitors
romanc a237e89
Fix: variable K offset memlet max size, allow casts
romanc 0d13b4d
Duck-tape version of k-extent analysis
romanc 6b515e5
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc c6f278d
Fix: test_negative_origin_i
romanc ba9c979
Fix: Variable offset K are relative indices
romanc fc102ba
Fix: add k-shift in variable offset K
romanc 6fe86db
support for data dimensions in tasklet codegen
romanc 6afb779
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 6abad8a
While conditions are handled as a separate tasklet that outputs into …
FlorianDeconinck 4214697
Fix native @dace.program usage with stencils
FlorianDeconinck bd48520
Add `scalar` to wrapper
FlorianDeconinck 220f214
Fix types (and add some)
romanc 02efcd3
Guard against using code that will be torched soon
romanc 26f711e
General if statement with evaluation tasklet
romanc fd751aa
Fixed horizontal mask conditions
romanc 7ffd955
fix lazy stencil tests
romanc d0447c8
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 093c4eb
GPU residency: don't flip scalar to Global device array, because, you…
FlorianDeconinck dd20834
Flag more code for later deletion
FlorianDeconinck aee933a
Fix frozen SDFG caching bug - orchestration now specializes bounds co…
FlorianDeconinck e7b5e89
Lint
FlorianDeconinck 601d332
context manager for just tree ir
romanc 2452776
cleanly separate dace-cartesian and dace-next dependencies
romanc 592eaf5
Fixup: fix noxfile (and update docs)
romanc e1b2194
Specialize transient strides for wrapper SDFG
FlorianDeconinck 3ac83ca
Apply correct residency to arrays
FlorianDeconinck 8ee7849
Revert "Apply correct residency to arrays"
FlorianDeconinck 1a36cf3
Cleanup types
romanc cf1963f
Specialize transients in all levels of nested SDFGs
romanc 7e56965
Reapply "Apply correct residency to arrays"
romanc 50f851b
Test: Also move transients to the GPU
romanc ace4547
Set cpu/gpu map schedule
romanc 9ca5a5a
Docs
FlorianDeconinck 8271b38
Keep vertical loop schedule on the Host
FlorianDeconinck 4911689
Fix previous commit... GPU should be sequential
FlorianDeconinck b939046
Fix Cache opt skipping
FlorianDeconinck 8400f80
GPU tests running locally
romanc 409b5ae
Updated DaCe version (only report cycles if we find them)
romanc db55422
More protection aginst stuff we sholdn't run anymore
romanc cb179ed
Do not call tests of things that are about to be torched.
romanc 383e568
Fix wrapper edge/memlet to inner_sdfg need
FlorianDeconinck 0fb8dec
Merge remote-tracking branch 'romanc/oir-to-stree' into oir-to-stree
FlorianDeconinck 1c75fe3
Minor cleanup: just moving some code around
romanc 6a7632a
Remove dcir from new code by copying Axis from daceir to treeir
romanc 00623d8
The purge: delete old bridge
romanc be29182
Fix caching system by _not_ trashing the build_options deep downstrea…
FlorianDeconinck e3557f9
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc aabd6e5
Fix typo
romanc 55e14be
Shape `dace.Array` properly
FlorianDeconinck 658b266
Merge remote-tracking branch 'romanc/oir-to-stree' into oir-to-stree
FlorianDeconinck 525378d
Merge branch 'main' into oir-to-stree
FlorianDeconinck 027f7a9
Fix missing scalar in OIR parameter list by comparing it to the origi…
FlorianDeconinck 785c1af
Merge remote-tracking branch 'romanc/oir-to-stree' into oir-to-stree
FlorianDeconinck bd6d188
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc c294575
Undo changes to github workflows
romanc 378268a
Undo changes to top-level README
romanc 54c48c4
Update comments in pyproject toml
romanc bab881e
Undo debug changes
romanc 68c2db9
cleanups loading/saving sdfgs
romanc cb4a1dd
WIP: ADRs for the schedule tree feature
romanc fd6c1ab
Temporary fields are shaped ignoring regions
FlorianDeconinck 0282e1d
Merge remote-tracking branch 'romanc/oir-to-stree' into oir-to-stree
FlorianDeconinck 3c0beaa
Revert "Temporary fields are shaped ignoring regions"
FlorianDeconinck 3272330
Fix typo in template
romanc bb96e1f
first draft of schedule tree adr
romanc 6095ec0
Merge branch 'oir-to-stree' of github.com:romanc/gt4py into oir-to-stree
romanc a3765f8
Don't cache frozen_sdfg in cwd
romanc 2c16934
Update dace ADRS, add one for cuda backend
romanc 75186e9
Last pass on the ADRs (for now)
romanc cee91eb
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 03ab3db
Fixup: fix bad merge
romanc b073e3e
Type hints in dace_backend
romanc c979a73
Merge dace/symbol_utils into dace/utils
romanc 08365dd
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 9f250cf
Cleanup OIR -> TreeIR visitor init
romanc 1eda341
cleanups in oir_to_tasklet
romanc 95850e2
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 2b28cdd
Minor cleanup refactors
romanc eae0d09
When adding `scalar` to wrapper SDFG, properly flag transientness
FlorianDeconinck f958944
Merge remote-tracking branch 'romanc/oir-to-stree' into oir-to-stree
FlorianDeconinck 14daecc
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc a61ebdd
Merge branch 'main' into oir-to-stree
FlorianDeconinck c1a075f
Update DaCe version (remove debug print statements)
romanc fddd114
Block caching in SDFGManager in case of no-caching policy on the builder
romanc 06cd753
Fixup: caching
romanc 036589e
Change dace cpu layout to match loop structure
romanc 11ce4b6
Update Dace version (merge v1/maintenance)
romanc 5113142
No need to pass transients of the inner_sdfg into the inner_sdfg
romanc ceb90e0
Philips code review, part 2
romanc 34670b2
Fix the bugs introduced in the review commit
romanc 7069b7f
Use default allocation lifetime since we don't have infinite memory
romanc 9354fb3
Fixing typos in ADRs
romanc eaeaf1c
Save stree next to unspecialized sdfg
romanc 5c5414e
Fixup allocation lifetime of temporaries needs to be global
romanc 58352a4
Only save stree if configured. Allow a previous stree to exist
romanc f27c491
Fix SDFG wrapping for GlobalTables (no cartesian axis, only data dims)
FlorianDeconinck 5217ba9
Normalize field arguments, escape non fields
FlorianDeconinck b201f0a
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 4691619
Merge branch 'oir-to-stree' of github.com:romanc/gt4py into oir-to-stree
romanc 339d708
Fix linting issues
romanc b7677e9
Fix gt4py tests again
romanc 3aed349
Re-enable DaCe's DDE by patching it
romanc bf2be21
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc 66ded75
Move dace/stree branch to GridTools/dace fork
romanc 69a5673
Revert tempoary to Lifetime.Persistent to fix memory leak (temporary …
FlorianDeconinck c34ccc2
"one last" code review commit
romanc 3b5702e
Revert: dace_backend: get ranges only for defined axes
romanc f17fd7c
Append data dimensions index in the same tuple for Absolute Indexing
FlorianDeconinck 6ec132a
Revert "Append data dimensions index in the same tuple for Absolute I…
FlorianDeconinck 62e3361
Preliminary support for NView nodes
romanc 5a6ccfc
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc a2a7cd7
Merge remote-tracking branch 'origin/main' into oir-to-stree
romanc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
17 changes: 17 additions & 0 deletions
17
docs/development/ADRs/cartesian/backend-cuda-feature-freeze.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Cuda backend: Feature freeze | ||
|
|
||
| In the context of (backend) feature development, facing maintainability/duplication concerns, we decided to put a feature freeze on the `cuda` backend and focus on the `dace:gpu` backends instead to keep the number of backends manageable. | ||
|
|
||
| ## Context | ||
|
|
||
| The introduction of the [`dace:*`](./backend-dace.md) backends brought up the question of backend redundancy. In particular, it seems that `cuda` and `dace:gpu` backends serve similar purposes. | ||
|
|
||
| `dace:gpu` backends not only generate code for different graphics cards, they also share substantial code paths with the `dace:cpu` backend. This simplifies (backend) feature development. | ||
|
|
||
| ## Decision | ||
|
|
||
| We decided to put a feature freeze on the `cuda` backend, focusing on the `dace:*` backends instead. While we don't drop the backend, new DSL features won't be able in the `cuda` backend. New features will error out cleanly and suggest to use the `dace:gpu` backend instead. | ||
|
|
||
| ## Consequences | ||
|
|
||
| While the `cuda` backend only targets NVIDIA cards, the `dace:*` backends allow to generate code for NVIDIA and AMD graphics cards. Furthermore, `dace:cpu` and `dace:gpu` backends share large parts of the transpilation layers because code generation is deferred to DaCe and only depending on the SDFG. This allows us to develop many (backend) features for the `dace:*` backends in one place. | ||
73 changes: 73 additions & 0 deletions
73
docs/development/ADRs/cartesian/backend-dace-schedule-tree.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # DaCe backends: Schedule tree | ||
|
|
||
| In the context of [DaCe backends](./backend-dace.md), facing tech-debt, a lack of understanding of the current stack, and under performing map- & state fusion, we decided to rewrite substantial parts of the DaCe backends with so called "Schedule Trees" to achieve hardware dependent macro-level optimizations (e.g. loop merging and loop re-ordering) at a new IR level before going down to SDFGs. We considered writing custom SDFG fusion passes and accept that we have to contribute a conversion from Schedule Tree to SDFG in DaCe. | ||
|
|
||
| ## Context | ||
|
|
||
| Basically three forces were driving this drastic change: | ||
|
|
||
| 1. We were unhappy with the performance of the DaCe backends, especially on CPU. | ||
| 2. We had little understanding of the previous GT4Py-DaCe bridge. | ||
| 3. The previous GT4Py-DaCe bridge accumulated a lot of tech debt, making it clumsy to work with and hard to inject major changes. | ||
|
|
||
| ## Decision | ||
|
|
||
| We chose to directly translate GT4Py's optimization IR (OIR) to DaCe's schedule tree (and from there to SDFG and code generation) because this allows to separate macro-level and data-specific optimizations. DaCe's schedule tree is ideally suited for schedule-level optimizations like loop re-ordering or loop merges with over-computation. The (simplified) pipeline looks like this: | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| oir[" | ||
| OIR | ||
| (GT4Py) | ||
| "] | ||
| treeir[" | ||
| Tree IR | ||
| (GT4Py) | ||
| "] | ||
| stree[" | ||
| Schedule tree | ||
| (DaCe) | ||
| "] | ||
| sdfg[" | ||
| SDFG | ||
| (DaCe) | ||
| "] | ||
| codegen[" | ||
| Code generation | ||
| (per target) | ||
| "] | ||
|
|
||
| oir --> treeir --> stree --> sdfg --> codegen | ||
| ``` | ||
|
|
||
| OIR to Tree IR conversion has two visitors in separate files: | ||
|
|
||
| 1. `dace/oir_to_treeir.py` transpiles control flow elements | ||
| 2. `dace/oir_to_tasklet.py` transpiles computations (i.e. bodies of control flow elements) | ||
|
|
||
| While this incurs a bit of code duplications (e.g. resolving index accesses), it allows for separation of concerns: Everything that is related to the schedule is handled in `oir_to_treeir.py`. Note, for example, that we keep the distinction between horizontal mask and general `if` statements. This distinction is kept because horizontal regions might influence scheduling decisions, while general `if` statements do not. | ||
|
|
||
| The subsequent conversion from Tree IR to schedule tree is a straight forward visitor located in `dace/treeir_to_stree.py`. Notice the simplicity of that visitor. | ||
|
|
||
| ## Consequences | ||
|
|
||
| The schedule tree introduces a transpilation layer ideally suited for macro-level optimizations, which are targeting the program's execution schedule. This is particularly interesting for the DaCe backends because we use the same backend pipeline to generate code for CPU and GPU targets. | ||
|
|
||
| In particular, the schedule tree allows to easily re-order/modify/change the loop structure. This not only allows us to generate hardware-specific loop order and tile-sizes, but also gives us fine grained control over loop merges and/or which loops to generate in the first place. For example, going directly from OIR to Tree IR allows us to translate horizontal regions to either `if` statements inside a bigger horizontal loop (for small regions) or break them out into separate loops (for bigger regions) if that makes sense for the target architecture. | ||
|
|
||
| ## Alternatives considered | ||
|
|
||
| ### OIR -> SDFG -> schedule tree -> SDFG | ||
|
|
||
| - Seems smart because it allows to keep the current OIR -> SDFG bridge, i.e. no need to write and OIR -> schedule tree bridge, | ||
| - but the first SDFG is unnecessary and translation times are a real problem | ||
| - and we were unhappy with the OIR -> SDFG bridge anyway | ||
| - and ,in addition, we loose some context between OIR and schedule tree (e.g. horizontal regions). | ||
|
|
||
| ### Improve the existing SDFG map fusion | ||
|
|
||
| GT4Py next has gone this route and an improved version is merged in the mainline version of DaCe. We think we'll need a custom map fusion pass which lets us decide low-level things like under which circumstances over-computation is desirable. A general map fusion pass will never be able to allow this. | ||
|
|
||
| ### Write custom map fusion based on SDFG syntax | ||
|
|
||
| Possible, but a lot more cumbersome than writing the same transformation based on the schedule tree syntax. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # DaCe backends: DaCe version | ||
|
|
||
| In the context of the [DaCe backend](./backend-dace.md) and the [schedule tree](./backend-dace-schedule-tree.md), facing time pressure, we decided to stay at the `v1.x` branch of DaCe to minimize up-front cost and deliver CPU performance as fast as possible. We considered updating to the mainline version of DaCe and accept follow-up cost of partial rewrites once DaCe `v2` releases. | ||
|
|
||
| ## Context | ||
|
|
||
| The currently released version of DaCe is on the `v1.x` branch. However, the mainline branch moved on (with breaking changes) to what is supposed to be DaCe `v2`. All feature development is supposed to be merged against mainline. Only bug fixes are allowed on the `v1.x` branch. | ||
|
|
||
| The [schedule tree](./backend-dace-schedule-tree.md) feature will need changes in DaCe, in particular to translate schedule trees into SDFG. We are unfamiliar with the breaking changes in DaCe. | ||
romanc marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Decision | ||
|
|
||
| We decided to build a first version of the schedule tree feature against the `v1.x` version of DaCe. | ||
|
|
||
| ## Consequences | ||
|
|
||
| - We'll be able to code against familiar API (e.g. same as the previous GT4Py-DaCe bridge). | ||
| - In DaCe, we won't be able to merge changes into `v1.x`. We'll work on a branch and later refactor the schedule tree -> SDFG transformation to code flow regions in DaCe `v2`. | ||
|
|
||
| ## Alternatives considered | ||
|
|
||
| ### Update to DaCe mainline first | ||
|
|
||
| - Good because mainline DaCe is accepting new features while `v1.x` is closed for new feature development. | ||
| - Bad because it incurs an up-front cost, which we are trying to minimize to get results fast. | ||
| - Bad because we aren't trained to use the new control flow regions. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # DaCe backends | ||
|
|
||
| In the context of performance optimization, facing the fragmentedness of NWP code, we decided to implemented a backend based on DaCe to unlock full-program optimization. We accept the downside of having to maintain that (additional) performance backend. | ||
romanc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Context | ||
|
|
||
| NWP codes aren't like your typical optimization problem homework where 80% of runtime is spent within a single stencil which you can then optimize to oblivion. Instead, computations in NWP codes are fragmented and scattered all over the place with parts in-between that move memory around. Stencil-only optimizations don't cut through this. DaCe allows us to do (data-flow) optimization on the full program, not only inside stencils. As a nice side-effect, DaCe offers code generation to CPU and GPU targets. | ||
|
|
||
| ## Decision | ||
|
|
||
| We chose to add DaCe backends,`dace:cpu` and `dace:gpu`, for CPU and GPU targets because we need full-program optimization to get the best possible performance. | ||
|
|
||
| ## Consequences | ||
|
|
||
| We will need to maintain the `dace:*` backends. If we keep adding more and more backends, maintainability will be a question down the road. We thus decided to put a [feature freeze](./backend-cuda-feature-freeze.md) on the `cuda` backend, focussing on `dace:*` backends instead. | ||
|
|
||
| Compared to the [`cuda` backend](./backend-cuda-feature-freeze.md), which only targets NVIDIA cards, we get support for both, NVIDIA and AMD cards, with the `dace:gpu` backends. | ||
|
|
||
| ## Alternatives considered | ||
|
|
||
| @Florian: Did we consider alternatives (back then)? | ||
|
|
||
| ## References | ||
|
|
||
| [DaCe Promo Website](http://dace.is/fast) | [DaCe GitHub](https://github.com/spcl/dace) | [DaCe Documentation](https://spcldace.readthedocs.io/en/latest/) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.