-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Use a thread pool when building runtimes #6133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This parallelizes the compilations and dramatically reduces the time to build runtimes. As part of this, teach the driver infrastructure to have an option to control the use of threads and to build the relevant thread pool and thread it into the various APIs. However, it requires our `ClangRunner` to become thread-safe and to invoke Clang in a way that is thread-safe. This is somewhat challenging as the code in `clang_main` is distinctly _not_ thread-safe. To address this, the relevant logic of `clang_main`, especially the CC1 execution, is extracted into our runner and cleaned up to be much more appropriate in a multithreaded context. Much of this code should eventually be factored back into Clang, but that will be a follow-up patch to upstream.
| llvm::SingleThreadExecutor single_thread({.ThreadsRequested = 1}); | ||
| std::optional<llvm::DefaultThreadPool> threads; | ||
| driver_env_.thread_pool = &single_thread; | ||
| if (options.threads) { | ||
| threads.emplace(llvm::optimal_concurrency()); | ||
| driver_env_.thread_pool = &*threads; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're are creating this SingleThreadExecutor at multiple levels it seems, both here and inside the ClangRunner. Could we consolidate to a single place - maybe making ClangRunner expect to receive a non-null executor always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could... I was just a bit torn doing so as it adds quite a bit of complexity to building and using the runner which is only needed if you're actually building runtimes.
Another alternative would be to have ClangRunner either accept a pre-built path or a thread pool to use for on-demand building, and remove the boolean option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you forsee wanting to use other thread pools than SingleThreadExecutor and DefaultThreadPool? Maybe we could tell it if we want threads or not and have ClangRunner make what it needs?
enum class BuildRuntimesOnDemand {
UsePrebuiltOnly,
BuildOnSingleThreaded,
BuildOnWorkerThreads,
}
Taking either a path XOR a thread pool also sounds good, whichever you pref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you forsee wanting to use other thread pools than SingleThreadExecutor and DefaultThreadPool?
Some possibility. One thing I wonder is if we'll want a (much) larger thread pool to fully absorb the latency, but would like to avoid it given the overhead.
Maybe we could tell it if we want threads or not and have ClangRunner make what it needs?
When using threads, my expectation is that it'll be very desirable to use the existing thread pool to avoid paying the cost of spinning one up and forking all the threads.
It also allows more global management of the load.
This is why I somewhat like the driver either accepting or building a thread pool, and then making it available for any commands or subcommands to use.
Taking either a path XOR a thread pool also sounds good, whichever you pref
I think what I'm liking is to take one of:
- A thread pool, enabling on-demand building in that pool.
- A pre-built path that will be used.
- Nothing, disabling on-demand building, and its up to the caller to use the runner in a way compatible with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I thought more about this and I think I had the fundamental wrong structure of this API.
I've restructured everything so that we have a simple constructor that only accepts the necessary components. Then there are three variations on Run -- using on-demand runtimes with a cache and thread pool, using pre-built runtimes, and using no runtimes. I've also updated comments and callers accourdingly. I think this ends up more clear, but PTAL and let me know.
Co-authored-by: Dana Jansens <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL!
| llvm::SingleThreadExecutor single_thread({.ThreadsRequested = 1}); | ||
| std::optional<llvm::DefaultThreadPool> threads; | ||
| driver_env_.thread_pool = &single_thread; | ||
| if (options.threads) { | ||
| threads.emplace(llvm::optimal_concurrency()); | ||
| driver_env_.thread_pool = &*threads; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could... I was just a bit torn doing so as it adds quite a bit of complexity to building and using the runner which is only needed if you're actually building runtimes.
Another alternative would be to have ClangRunner either accept a pre-built path or a thread pool to use for on-demand building, and remove the boolean option.
toolchain/driver/clang_runner.cpp
Outdated
| llvm::SmallVector<llvm::NewArchiveMember, 0> unwrapped_objs; | ||
| unwrapped_objs.reserve(objs.size()); | ||
| for (auto& obj : objs) { | ||
| unwrapped_objs.push_back(*std::move(obj)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you enjoy some code golf...
llvm::SmallVector<llvm::NewArchiveMember, 0> unwrapped_objs(
llvm::map_range(objs, [](auto& obj) { return *std::move(obj); }));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I somewhat prefer the simplicity of the loop here... It's a close call though, happy to switch if it helps a lot.
| llvm::SingleThreadExecutor single_thread({.ThreadsRequested = 1}); | ||
| std::optional<llvm::DefaultThreadPool> threads; | ||
| driver_env_.thread_pool = &single_thread; | ||
| if (options.threads) { | ||
| threads.emplace(llvm::optimal_concurrency()); | ||
| driver_env_.thread_pool = &*threads; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you forsee wanting to use other thread pools than SingleThreadExecutor and DefaultThreadPool? Maybe we could tell it if we want threads or not and have ClangRunner make what it needs?
enum class BuildRuntimesOnDemand {
UsePrebuiltOnly,
BuildOnSingleThreaded,
BuildOnWorkerThreads,
}
Taking either a path XOR a thread pool also sounds good, whichever you pref
Co-authored-by: Dana Jansens <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, think I found a better API structure.
toolchain/driver/clang_runner.cpp
Outdated
| llvm::SmallVector<llvm::NewArchiveMember, 0> unwrapped_objs; | ||
| unwrapped_objs.reserve(objs.size()); | ||
| for (auto& obj : objs) { | ||
| unwrapped_objs.push_back(*std::move(obj)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I somewhat prefer the simplicity of the loop here... It's a close call though, happy to switch if it helps a lot.
| llvm::SingleThreadExecutor single_thread({.ThreadsRequested = 1}); | ||
| std::optional<llvm::DefaultThreadPool> threads; | ||
| driver_env_.thread_pool = &single_thread; | ||
| if (options.threads) { | ||
| threads.emplace(llvm::optimal_concurrency()); | ||
| driver_env_.thread_pool = &*threads; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you forsee wanting to use other thread pools than SingleThreadExecutor and DefaultThreadPool?
Some possibility. One thing I wonder is if we'll want a (much) larger thread pool to fully absorb the latency, but would like to avoid it given the overhead.
Maybe we could tell it if we want threads or not and have ClangRunner make what it needs?
When using threads, my expectation is that it'll be very desirable to use the existing thread pool to avoid paying the cost of spinning one up and forking all the threads.
It also allows more global management of the load.
This is why I somewhat like the driver either accepting or building a thread pool, and then making it available for any commands or subcommands to use.
Taking either a path XOR a thread pool also sounds good, whichever you pref
I think what I'm liking is to take one of:
- A thread pool, enabling on-demand building in that pool.
- A pre-built path that will be used.
- Nothing, disabling on-demand building, and its up to the caller to use the runner in a way compatible with that.
| llvm::SingleThreadExecutor single_thread({.ThreadsRequested = 1}); | ||
| std::optional<llvm::DefaultThreadPool> threads; | ||
| driver_env_.thread_pool = &single_thread; | ||
| if (options.threads) { | ||
| threads.emplace(llvm::optimal_concurrency()); | ||
| driver_env_.thread_pool = &*threads; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I thought more about this and I think I had the fundamental wrong structure of this API.
I've restructured everything so that we have a simple constructor that only accepts the necessary components. Then there are three variations on Run -- using on-demand runtimes with a cache and thread pool, using pre-built runtimes, and using no runtimes. I've also updated comments and callers accourdingly. I think this ends up more clear, but PTAL and let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the new API looks like a really nice improvement, thanks. LGTM
Co-authored-by: Dana Jansens <[email protected]>
This parallelizes the compilations and dramatically reduces the time to build runtimes.
As part of this, teach the driver infrastructure to have an option to control the use of threads and to build the relevant thread pool and thread it into the various APIs.
However, it requires our
ClangRunnerto become thread-safe and to invoke Clang in a way that is thread-safe. This is somewhat challenging as the code inclang_mainis distinctly not thread-safe.To address this, the relevant logic of
clang_main, especially the CC1 execution, is extracted into our runner and cleaned up to be much more appropriate in a multithreaded context. Much of this code should eventually be factored back into Clang, but that will be a follow-up patch to upstream.Last but not least, this rearranges the
ClangRunnerAPI to make a bit more sense out of the different options for building runtimes, and have a clean model for which things need to be passed in at which points.