Skip to content

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Sep 5, 2025

Description

As explained in a new code comment, loader_life_support needs to be thread_local but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definition.

Performance before:

  • on M4 Max using pybind/pybind11_benchmark unmodified repo:
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
  • Linux server:
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop

After:

  • M4 Max:
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
  • Linux server:
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

  • How do we determine whether we can safely use thread_local? I see concerns about old iOS versions on ci: ios #5705 (comment) and fix: modify the internals pointer-to-pointer implementation to not use thread_local #5709; is there anything else?
  • Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter on the same thread"? I think it's fine, but can it happen?
  • Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.

Suggested changelog entry:

  • Reduced function call overhead by using thread_local for loader_life_support when possible.

As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```

After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  pybind#5705 (comment)
  and pybind#5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
  multiple extensions compiled with and without this PR interoperate?
  I think it's fine -- each dispatch pushes and cleans up its own
  state -- but a second opinion is certainly welcome.
@swolchok
Copy link
Contributor Author

swolchok commented Sep 5, 2025

old iOS versions

in particular, Google seems to think that thread_local was available as of iOS 9, but previous discussion seems to say it's iOS 17? https://stackoverflow.com/a/29929949

@rwgk
Copy link
Collaborator

rwgk commented Sep 5, 2025

@b-pass Is there a chance that you could help reviewing this PR? (See top of PR description why I thought of asking you: "but does not need to be isolated to a particular interpreter")

Comment on lines 72 to 75
static thread_local fake_thread_specific_storage storage;
return storage;
#else
return get_internals().loader_life_support_tls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybind11's thread_specific_storage should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local.

So, why not just use static thread_specific_storage<loader_life_support> here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybind11's thread_specific_storage should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local.

I don't think this is correct, but I'll get clearer data about it and come back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the code generated for static thread_local std::string* p = nullptr; return p; by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough....

Then, I think then someone should fix the thread_specific_storage class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the thread_specific_storage class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.

Wouldn't that break ABI? Is that OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that break ABI?

Two uses of thread_specific_storage are members of internals (this loader_life_support is one, tstate is another, which is used in GIL management). So, yes, changing those would break ABI. This PR would abandon one of them.

I suppose one could fix the other usages of thread_specific_storage and leave these two until its time for an ABI break.... internals_pp_manager can be changed without an ABI break AFAICT, each different module would have its own local copy of that code and its thread locals, sharing the actual internals pointer amongst them through the InterpreterState dict.

I'm willing to refactor that in a separate PR using whatever method is decided here for conditionally compiling these thread_locals out.

In the past we gave people an option to override PYBIND11_INTERNALS_VERSION. It was a bit messy, but maybe worth a consideration here?

It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.

What we had was pretty basic, easy to see in one view here (when support for old internals versions was removed):

https://github.com/pybind/pybind11/pull/5530/files

Not pretty, but it served the purpose; for ~2 years actually.

We'll have to be careful about test coverage, in ci.yml.

I think it would fit in this PR.

Another route is to get #5800 reviewed and merged. It's an XXL review job though, and we have to determine who's taking on maintaining that code long term. @oremanj for viz (btw @oremanj: did you see my direct email from Aug 21?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the code generated for static thread_local std::string* p = nullptr; return p; by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso

This was misleading; adding -fPIC (as you would generally do to build a shared library such Python extension IIUC, though pybind11_benchmark doesn't seem to do it) causes this code to use _tls_get_addr() (apparently _tlv_get_addr on Mac). It's not pthread_getspecific though, and runtime measurement shows that we are actually getting an improvement despite my incorrect assembly reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwgk I did not see such an email, nor have I been able to find it searching just now - can you resend it and/or tell me what subject to search for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oremanj I sent another email, from my corp address to your corp address, subject "pybind11 hello".

@@ -1344,5 +1344,8 @@ constexpr
# define PYBIND11_BACKWARD_COMPATIBILITY_TP_DICTOFFSET
#endif

// TODO: determine which platforms cannot use thread_local.
Copy link
Contributor

@b-pass b-pass Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was one of the iOS platforms. But that should've failed one of the PR Checks (since this is unconditionally 1 right now) and didn't. Meaning all of the Checks pybind11 currently has support thread_local anyway.... can that be right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have fixed the issue with iOS, you have to set some minimum version to be able to use it and it wasn't letting us set the minimum version. But now I think it's working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants