-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use thread_local for loader_life_support to improve performance #5830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.
in particular, Google seems to think that thread_local was available as of iOS 9, but previous discussion seems to say it's iOS 17? https://stackoverflow.com/a/29929949 |
@b-pass Is there a chance that you could help reviewing this PR? (See top of PR description why I thought of asking you: "but does not need to be isolated to a particular interpreter") |
static thread_local fake_thread_specific_storage storage; | ||
return storage; | ||
#else | ||
return get_internals().loader_life_support_tls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pybind11's thread_specific_storage
should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local
.
So, why not just use static thread_specific_storage<loader_life_support>
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pybind11's thread_specific_storage should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local.
I don't think this is correct, but I'll get clearer data about it and come back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the code generated for static thread_local std::string* p = nullptr; return p;
by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough....
Then, I think then someone should fix the thread_specific_storage
class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the thread_specific_storage class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.
Wouldn't that break ABI? Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that break ABI?
Two uses of thread_specific_storage are members of internals
(this loader_life_support is one, tstate
is another, which is used in GIL management). So, yes, changing those would break ABI. This PR would abandon one of them.
I suppose one could fix the other usages of thread_specific_storage and leave these two until its time for an ABI break.... internals_pp_manager
can be changed without an ABI break AFAICT, each different module would have its own local copy of that code and its thread locals, sharing the actual internals
pointer amongst them through the InterpreterState dict.
I'm willing to refactor that in a separate PR using whatever method is decided here for conditionally compiling these thread_locals out.
In the past we gave people an option to override PYBIND11_INTERNALS_VERSION. It was a bit messy, but maybe worth a consideration here?
It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.
What we had was pretty basic, easy to see in one view here (when support for old internals versions was removed):
https://github.com/pybind/pybind11/pull/5530/files
Not pretty, but it served the purpose; for ~2 years actually.
We'll have to be careful about test coverage, in ci.yml.
I think it would fit in this PR.
Another route is to get #5800 reviewed and merged. It's an XXL review job though, and we have to determine who's taking on maintaining that code long term. @oremanj for viz (btw @oremanj: did you see my direct email from Aug 21?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the code generated for
static thread_local std::string* p = nullptr; return p;
by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso
This was misleading; adding -fPIC (as you would generally do to build a shared library such Python extension IIUC, though pybind11_benchmark doesn't seem to do it) causes this code to use _tls_get_addr()
(apparently _tlv_get_addr
on Mac). It's not pthread_getspecific
though, and runtime measurement shows that we are actually getting an improvement despite my incorrect assembly reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rwgk I did not see such an email, nor have I been able to find it searching just now - can you resend it and/or tell me what subject to search for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oremanj I sent another email, from my corp address to your corp address, subject "pybind11 hello".
include/pybind11/detail/common.h
Outdated
@@ -1344,5 +1344,8 @@ constexpr | |||
# define PYBIND11_BACKWARD_COMPATIBILITY_TP_DICTOFFSET | |||
#endif | |||
|
|||
// TODO: determine which platforms cannot use thread_local. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was one of the iOS platforms. But that should've failed one of the PR Checks (since this is unconditionally 1 right now) and didn't. Meaning all of the Checks pybind11 currently has support thread_local
anyway.... can that be right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have fixed the issue with iOS, you have to set some minimum version to be able to use it and it wasn't letting us set the minimum version. But now I think it's working.
Description
As explained in a new code comment,
loader_life_support
needs to bethread_local
but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definition.Performance before:
After:
A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.
Open questions:
thread_local
? I see concerns about old iOS versions on ci: ios #5705 (comment) and fix: modify the internals pointer-to-pointer implementation to not usethread_local
#5709; is there anything else?Suggested changelog entry: