-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Noticeable overhead in crossbeam-epoch under many Locals #852
Comments
oh, also, i have a small reproducible program and monkey-patched |
Yet another solution would probably be starting to accept a new type parameter (with default type to be compatible) over So that, each thread pool is isolated from each other in terms of maybe, this impl strategy is similar std::HashMap's |
@taiki-e hey, I know your status show |
If we handle this on our end, it seems increasing PINNINGS_BETWEEN_COLLECT or to make it configurable is okay. Which should be chosen depends on what the adverse effects of increasing it would be. (By the way, I'm not sure why you are using a thread pool specifically intended for expensive CPU-bound computations like rayon to handle a large number of tasks that are mostly idling.) |
cc @cuviper: Any thoughts on this idea (or this issue)? |
With the caveat that I only have a rough intuition of how
This sounds like it could be fine for stuff that's really internal to the pool, like the thread-specific deques, but is it a problem to delay collection for stuff that straddles that boundary? I'm thinking of the pool's injector which may be pushed from outside the pool and popped within, or anything that the user brings themself that crosses the pool. If that pool goes idle while the rest of the process is still working heavily, there could be memory in that pool
I think this could make sense for the thread deques, if we create them with a pool
Which solution is that, just the increased constant?
That also sounds like a reasonable idea to me. |
really appreciate for spending on this issue. :)
thx for confirmation. i'm thinking to submit a pr to make it configurable.
as far as i tested increasing it looks like
Well, recent rayon can park nicely with no outstanding tasks. (very much kudo to @cuviper, ref: https://github.com/rayon-rs/rayon/blob/master/RELEASES.md#release-rayon-140--rayon-core-180-2020-08-24 ) So, we want to maximize burst task throughput utilizing all cores and rayon is very handy for that with various ecosystem |
thx for confirmation again! I'll create pr to rayon as well. stay tuned. :)
yeah, the increased constant. here's rayon's one: ryoqun/rayon@48efe66 (note that it's just directing to upstream crossbeam revs, because of being transitive; i.e. no actual code changes in rayon per se)
now, i think a process-wide singleton really-global As you already pointed out at the preceding paragraph, I reached to the same conclusion that as long as threads are closely grouped to a indivisual grouping like a thread pool, there should be no concern with multiple @taiki-e @cuviper So, I'll move this forward with the configuration approach. Really thanks for the advice. Maybe, I'll introduce |
Rayon's queues only deal with opaque |
I made this work locally. However, I'm pivoting to the new feature flag approch due to...:
So, I want to take the lazy man's approach #862 along with rayon's corresponding pr (rayon-rs/rayon#958) @taiki-e, @cuviper: could you take a quick glance to decide this kind of ad-hoc feature addition would be ever merged into your crates? Certainly, I want to avoid vendoring... Or, do you have some stamina to review my original approach prs? |
fyi, the original approch's newer api would look like this: use crossbeam_deque::{CustomCollector, Collector, LocalHandle};
static MY_COLLECTOR: Lazy<Collector> = Lazy::new(Collector::new);
// note that, maybe i need to provide macro_rule! for the following boilerplate code.
thread_local! {
static MY_HANDLE: LocalHandle = MY_COLLECTOR.register();
}
struct MyCustomCollector;
impl CustomCollector for MyCustomCollector {
#[inline]
fn collector() -> &'static Collector {
&MY_COLLECTOR
}
#[inline]
fn handle() -> &'static std::thread::LocalKey<LocalHandle> {
&MY_HANDLE
}
}
fn foo() {
let pool = rayon::ThreadPoolBuilder::<_, MyCustomCollector>::new()
...;
pool.install_for_iters(|type_marker| { // this is new! type_marker is parameterized with CustomCollector
vec.
.into_par_iter()
.install_type(type_marker) // this is the new! we need to propagate static collector type to actual iters.
.map(...)
...
}
} |
As an admittedly uninvolved bystander but user of Rayon, I would like to add that I think that at least this kind of solution might be better put into a private fork of |
@adamreichold thanks for chiming in. :) loves collective wisdom. your idea sounds doable. so, i created solana-labs/solana#26555 |
I was not imagining any public API changes in rayon, not even feature flags. My thought was that we would just internally create a collector in rayon's |
Hi, nice to meet you and thanks for maintaining this crate. :)
I found some kind of performance issue spanning cross-crate code interactions, ultimately resulting in many cpu cycles being wasted at
crossbeam-epoch
. And I'm wondering where the proper fix could be placed across those related crates.In short, there are circumstances where
crossbeam-epoch
's epoch bookkeeping code exposes significant overhead. And our production code was hit at it.Dirtiest hack would be reducing frequency of gc collection (=
global().collect(&guard);
) by increasingPINNINGS_BETWEEN_COLLECT
like this:However, I'm not fully sure this is the right fix (esp, regarding its ramifications due to reduced garbage collections).
Let me explain the rather complicated context a bit.
Firstly,
crossbeam-epoch
is used bycrossbeam-deque
(duh), which in turn byrayon
(task scheduler library) as task queues , then bysolana-validator
(which experinced the performance issue; solana-labs/solana#22603).so far, so good. it's just normal use case of rayon for multi cores by an application code.
The twist is that
solana-validator
is holding manyrayon
thread pools managed by its internal subsystems. So, it's well exceeding system's core count by great factor like 2000 threads on a 64 core machine.(We know this is a bit silly setup. But every subsystem isn't 100% cpu persistently. Rather they're mostly idling. On the other hand, we want to maximize processing throughput/minimize latency at the time of peak load. Also, casual
top -H
instropection and granular kernel thread priority tuning is handy. Lastly, sharing a single or several thread pool would introduce unneeded synchronization cost across subsystems and implementaion complexities insolana-validator
code. All in all, each independent thread pools makes sense to us at least for now.)So, that whopping 2000 (rayon) threads means that all of them are registering as
Local
s to the singletoncrossbeam-epoch
'sGlobal
. Then,global().collect()
suddenly become very slow because it's doing linear scan over theLocal
s (= O(n))...(Then, this affects all indepedent rayon pools inside a process. That's because of the use of the singleton
Global
. This seemingly-unrelated subsytem's performance degradtion was hard to debug, by the way...)Regarding the extent of the overhead, I observed certain rayon-heavy subsystem is ~5% faster with the above 1-line hack alone in walltime. And, I also saw 100x reduction of overhead by Linux's
perf
.Possible solutions:
Local
s?Global
s for each thread pools as a kind of scope?solana-validator
?The text was updated successfully, but these errors were encountered: