Skip to content

Conversation

pruthvistony
Copy link
Contributor

  • Add Hipify as a git submodule
  • Trigger hipify from cmake build
  • TP_USE_ROCM controls the trigger, which will be set to ON
    when building on ROCm

@pruthvistony
Copy link
Contributor Author

@lw @jeffdaily, @jithunnair-amd,
Please review the changes.

@jithunnair-amd
Copy link

@pruthvistony I think we discussed this before, but just to make sure: could the build_amd.py be part of hipify-torch so that it doesn't have to be added to the hipifying project's sources? Is there anything that's project-specific that cannot be passed in as arguments to build_amd.py?

@pruthvistony
Copy link
Contributor Author

@pruthvistony I think we discussed this before, but just to make sure: could the build_amd.py be part of hipify-torch so that it doesn't have to be added to the hipifying project's sources? Is there anything that's project-specific that cannot be passed in as arguments to build_amd.py?

As discussed, build_amd.py can be moved to hipify-torch repo and triggered directly from cmake. Just that passing all parameters like list(regex) can be tricky within cmake. Currently I dont see anything specific problem as such in passing arguments through build_amd.py. I have kept current change same as it is used in other projects. Can update this change as an improvement, since it is not a blocker for other ROCm build related changes.

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be OK with merging this but (correct me if I'm wrong) I don't think this is yet usable is it? If anyone tried to run with TP_USE_ROCM the build would most likely fail. Should we fix all that and ensure everything works before we do that?

Also, while I guess we cannot run any ROCm tests on CircleCI (because it's lacking the necessary hardware), I think we should be able to at least add a ROCm build job. Do you think you could do that? Thanks!

@@ -0,0 +1,49 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been putting some assorted tools in the tensorpipe/misc folder. It would be good to use a single directory for all these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process of moving this file into hipify-torch repo, which is added a git submodule.
https://github.com/ROCmSoftwarePlatform/hipify-torch.git
Once the PRs into hipify-torch is merged, I will update this PR.

Comment on lines 13 to 17
sys.path.append(os.path.realpath(os.path.join(
os.path.dirname(__file__),
os.path.pardir,
os.path.pardir,
'third_party')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could work, though it means that this file "assumes" a certain structure in the repo, which means it's not really standalone and thus possibly brittle. I now wonder whether it could be more robust to have the path of the hipify module passed in via a command-line argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this file will be moved hipify-torch repo, the above code will not be required.

output_directory=args.output_directory,
includes=includes,
ignores=ignores,
is_pytorch_extension=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the is_pytorch_extension flag do? TensorPipe is not a PyTorch extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tensorpipe is not a extension, but this parameter is used internal in hipify.
I will let @jithunnair-amd to comment for info here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That nomenclature made sense until now :) Essentially, we needed a way to preserve the "legacy" way of hipifying for PyTorch source code, but use the "new and improved" way for other code which used hipify, which were basically PyTorch extensions so far. We'll assess the best way to generalize this, but this flag is needed for now until that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it, just wanted to check. It would be nice if we could add a comment to explain this, so I won't wonder about it again the next time I read this code. (Though if this code is moved to the hipify repo it won't matter anyways)


from hipify import hipify_python

parser = argparse.ArgumentParser(description='Top-level script for HIPifying, filling in most common parameters')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally more idiomatic to put all the logic of a runnable Python entry-point script in a def main(): function and then invoke it with if __name__ == "__main__": main().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check on it and update the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code as suggest and committed to hipify repo

Comment on lines +37 to +39
if(TP_USE_CUDA AND TP_USE_ROCM)
message(FATAL_ERROR "Tensorpipe can be built either for CUDA or ROCm, TP_USE_CUDA and TP_USE_ROCM both are set, erroring out!!!!")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not suggesting to do this now, but how difficult would that limitation be to lift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first version of ROCm support we intend to still use the CudaBuffer name (and the cuda_ipc, ... ones) for the HIP version of TensorPipe as well, out of simplicity, hence using the CUDA and HIP versions together would cause name conflicts. Eventually I think we could/should fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are thinking tensorpipe build for CUDA or ROCm will be mutual exclusive, since pyTorch lib is also mutual exclusive.
Even for channels there will be only hip based channels, hip_basic, hip_gdr, hip_ipc, ... .
So the builds will have corresponding flags turned ON. So I didn't get whats the concern here. Please let me know more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here was motivated by the fact that in principle there shouldn't be any hard blocker to have both CUDA and ROCm (once we fix the name conflicts) right? Hence this is mainly a comment about "code style". In practice though yes, when used from PyTorch they will be mutually exclusive, hence no worries about this.

@pruthvistony
Copy link
Contributor Author

I could be OK with merging this but (correct me if I'm wrong) I don't think this is yet usable is it? If anyone tried to run with TP_USE_ROCM the build would most likely fail. Should we fix all that and ensure everything works before we do that?

Also, while I guess we cannot run any ROCm tests on CircleCI (because it's lacking the necessary hardware), I think we should be able to at least add a ROCm build job. Do you think you could do that? Thanks!

Currently pyTorch ROCm tests are executed on jenkins.
The changes are not yet usable, I am raising changes in multiple PR as it was suggested previously. Next PR which builds tensorpipe enabling 'TP_USE_ROCM' is in draft mode, but is not complete due to a not support API. For which we are working with the HIP team.

Regarding the ROCm build, I believe job can be setup. Regarding the tests I will check on the necessary hardware and get back.

@lw
Copy link
Contributor

lw commented Jul 13, 2021

Currently pyTorch ROCm tests are executed on jenkins.

Yes, I saw that. However that's because they are both build and run on AMD GPUs right? If we only built, without running, we could do so on machines without any GPUs. (PyTorch already builds the CUDA version on CPU-only machines). It wouldn't be perfect but it would at least catch build issues, for example caused by hipification.

@pruthvistony
Copy link
Contributor Author

Moved all the hipify related files into the hipify-torch repo.

[submodule "third_party/libnop"]
path = third_party/libnop
url = https://github.com/google/libnop.git
[submodule "third_party/hipify"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, this dependency does not seem existing in PyTorch's .gitmodules. Any reason for this difference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's because PyTorch currently has hipify logic as part of its source code at https://github.com/pytorch/pytorch/blob/master/torch/utils/hipify
We can't use that for Tensorpipe as it'll create a circular dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in PyTorch hipify is not used as a git submodule. For new project to hipify we are using hipify-torch repo as a git submodule, so that all the hipification code is in a single place, providing many interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to change PyTorch's hipify to use the same strategy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would like to in the long run, but PyTorch being a much bigger codebase, and having a more complex hipification strategy, will require much more coordination and effort.

- Add Hipify as a git submodule
- Trigger hipify from cmake build
- TP_USE_ROCM controls the trigger, which will be set to ON
  when building on ROCm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants