forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop upstream sync 20250113 #2802
Open
alekstheod
wants to merge
1,286
commits into
develop-upstream
Choose a base branch
from
develop-upstream-sync-20250113
base: develop-upstream
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+98,035
−39,380
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was forgotten to be removed during an earlier refactoring. PiperOrigin-RevId: 713237188
This method was renamed but staging function kept, switch to renamed variant. PiperOrigin-RevId: 713242393
Imported from GitHub PR openxla/xla#19649 This change has as a goal to introduce an external dependency to the rocm library and tools. Building xla with the hermetic rocm is done by using these env variables: --repo_env=OS=ubuntu_20.04 --repo_env=ROCM_VERSION=6.2.0 To use only hermetic libs define this flag: --@local_config_rocm//rocm:use_rocm_hermetic_rpath=True This flag will make rpaths and configs to look inside the sandbox If flag is not set then default installation paths are used e.g /opt/rocm One has to provie OS version and ROCm version to initialize a proper rocm repository. If these flags are not set then default ROCm installation will be used to build XLA. depends-on: openxla/xla#19691 Copybara import of the project: -- cf744eca78f697144e122c6a9d1aa8fc52722b20 by Alexandros Theodoridis <[email protected]>: Implement hermetic rocm dependency -- 4f4ad859ec3143fdb04f7792541c61b98c708397 by Alexandros Theodoridis <[email protected]>: Add missing dependency -- 8e164f765b45b5e5d118b02695fd6d6e2b0b232d by Alexandros Theodoridis <[email protected]>: Add missing dependency and remove so files from data -- 35538f4922b5b28b9debd0ce17bb15b83b5921fc by Alexandros Theodoridis <[email protected]>: Rename setting to use_rocm_hermetic_rpath -- 58d140220e9e58572c9a7ae3de2ec1ea189566d3 by Alexandros Theodoridis <[email protected]>: Fix build for cuda and cpu Merging this change closes tensorflow#19649 PiperOrigin-RevId: 713248195
PiperOrigin-RevId: 713248622
`std::mismatch` should be called with an end iterator as the second argument if there is no guarantee on element count in the second range. PiperOrigin-RevId: 713264159
Move comparison of executable != nullptr _before_ calling std::move(executable). This is really only used for logging, but definitely adds confusion to the logs when it's always 0 :). PiperOrigin-RevId: 713272260
…andle and SupportsSendRecvCallbacks PiperOrigin-RevId: 713276521
…Usage It was always set to false by the callers. PiperOrigin-RevId: 713277020
….cseConstants=false` to avoid constant folding and CSE which is expensive. PiperOrigin-RevId: 713277781
…lso be used in vectorizing AtomicRMW in follow-up changes. PiperOrigin-RevId: 713281944
PiperOrigin-RevId: 713282226
The algorithm was checking whether to write to the output or not by comparing the current slice index with the number of indices per warp. It works only when we have perfectly tiled indices, e.g. 50 indices per warp with a total of 2000 indices. As soon as we have 2001 indices, the last warp processes 1 update slice, but never writes it down. Also simplified the logic for the update loop that accumulates elements in registers. Instead of having scf.if inside of xla.loop, now we have two different xla.loops in different cases of scf.if, that either overwrite the accumulator or combine it with the new data. PiperOrigin-RevId: 713296321
… when adding vectorization for AtomicRMW which will only be available for Hopper. PiperOrigin-RevId: 713297711
…tives APIs to acquire communicator in CollectiveThunk Implement Cliques support for XLA:CPU collectives for consistency with XLA:GPU. Further unification will be in followup CLs. PiperOrigin-RevId: 713305764
In preparation for larger changes, this entry point is being disabled here for now. PiperOrigin-RevId: 713316210
PiperOrigin-RevId: 713318085
Changed to use signature_key for the Run() method for input / output maps since it aligns with other parameters. PiperOrigin-RevId: 713323123
…ing path. PiperOrigin-RevId: 713323821
PiperOrigin-RevId: 713330122
…when one CollectivePermute (cp) depends on the other. When we insert control dependency from send-start of one cp to recv-start of another, we need to make sure that the cps are in post order. PiperOrigin-RevId: 713336414
PiperOrigin-RevId: 713346163
…sizes. PiperOrigin-RevId: 713353992
PiperOrigin-RevId: 7133574
…ll-to-all operation. The following example shows the detailed method. ``` base_shape: (32,32,32,32) mesh: a=2, b=4 old sharding: P('a', 'b', None, None), local shape (16,8,32,32) new sharding: P(None, None, 'a', 'b'), local shape (32,32,16,8) // Step 1. Merge sharding axes to a single dimension reshape (16,8,32,32) -> (16,8,2,16,4,8) transpose (16,8,2,16,4,8) -> (2,4,16,8,16,8) with permutation (2,4,0,1,3,5) reshape (2,4,16,8,16,8) -> (8,16,8,16,8) // Step 2. Apply the all-to-all all-to-all on (8,16,8,16,8) with split_dimension = 0 // Step 3. Split sharding axes to multiple dimensions reshape (8,16,8,16,8) -> (2,4,16,8,16,8) transpose (2,4,16,8,16,8) -> (2,16,4,8,16,8) with permutation (0,2,1,3,4,5) reshape (2,16,4,8,16,8) -> (32,32,16,8) ``` PiperOrigin-RevId: 713362037
PiperOrigin-RevId: 713372912
PiperOrigin-RevId: 713374730
…parameters. PiperOrigin-RevId: 713389880
PiperOrigin-RevId: 713394310
PiperOrigin-RevId: 713395731
This CL takes care of 1. Migrating the targets ``` tensorflow/compiler/xla:test tensorflow/compiler/xla:test_helpers tensorflow/compiler/xla/service:pattern_matcher_gmock ``` to tensorflow/compiler/xla/hlo/testlib 2. Setting up build aliases in xla or xla/service/ ensuring external dependencies are still satisfied. Phase II will take care of migration of external projects dependencies PiperOrigin-RevId: 713400473
…cies. PiperOrigin-RevId: 714845417
…ns in manual sharding group. Imported from GitHub PR openxla/xla#20808 This is a small fix in GSPMD partitioning for partitioning collective permutes instructions added in manual sharding group. In JAX, we can add `ppermute` instruction in shard_map. In cases where we have shard_map with auto axes specified, collective permuting an operand even with the same sharding will end up with an `all-gather` and then collective permute, which leads to inefficient collectives. The correct and efficient way is to partition the collective permute as an element-wise op. The unit test added provides a repro. Also, the JAX unit test in https://github.com/jax-ml/jax/blob/fa9c7edf736516052df6eab22947bc627d0deca3/tests/shard_map_test.py#L2167 gives a real-world JAX example. Copybara import of the project: -- 8ee6ecd51f6e4aae8e3d92a6a439a60f53ab02ae by Yunlong Liu <[email protected]>: A hacky fix on partitioning collective permute. -- e50e87696defb290f7561a7808ee42ebbc11e144 by Yunlong Liu <[email protected]>: Local change. -- 84eb38597c783a4488774823c2c464296a8c54c7 by Yunlong Liu <[email protected]>: Simplifies sharding in tests. Merging this change closes tensorflow#20808 PiperOrigin-RevId: 714851861
generated .a library has a different name depending on cuda/rocm build. PiperOrigin-RevId: 714853250
PiperOrigin-RevId: 714855571
PiperOrigin-RevId: 714863244
alekstheod
force-pushed
the
develop-upstream-sync-20250113
branch
15 times, most recently
from
January 15, 2025 13:42
afe115b
to
0e026ce
Compare
alekstheod
force-pushed
the
develop-upstream-sync-20250113
branch
from
January 15, 2025 14:16
0e026ce
to
e1d9704
Compare
Nightly build failing tests list for openxla:
Vs failing xla tests in this PR:
|
We decided to skip the int4 triton tests and investigate the issue separate from the weekly sync. |
alekstheod
force-pushed
the
develop-upstream-sync-20250113
branch
from
January 16, 2025 15:21
d123032
to
a5407d3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Weekly sync 13/01/2025