Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the specializing interpreter thread-safe in --disable-gil builds #115999

Open
17 tasks done
Tracked by #108219
swtaarrs opened this issue Feb 27, 2024 · 13 comments
Open
17 tasks done
Tracked by #108219

Make the specializing interpreter thread-safe in --disable-gil builds #115999

swtaarrs opened this issue Feb 27, 2024 · 13 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-feature A feature request or enhancement

Comments

@swtaarrs
Copy link
Member

swtaarrs commented Feb 27, 2024

Feature or enhancement

Proposal:

In free-threaded builds, the specializing adaptive interpreter needs to be made thread-safe. We should start with a small PR to simply disable it in free-threaded builds, which will be correct but will incur a performance penalty. Then we can work out how to properly support specialization in a free-threaded build.

These two commits from Sam's nogil-3.12 branch can serve as inspiration:

  1. specialize: make specialization thread-safe
  2. specialize: optimize for single-threaded programs

There are two primary concerns to balance while implementing this functionality on main:

  1. Runtime overhead: There should be no performance impact on normal builds, and minimal performance impact on single-threaded code running in free-threaded builds.
  2. Reducing code duplication/divergence: We should come up with a design that is minimally disruptive to ongoing work on the specializing interpreter. It should be easy for other devs to keep the free-threaded build working without having to know too much about it.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

Specialization Families

Tasks

Preview Give feedback

Linked PRs

@brandtbucher
Copy link
Member

(subscribing myself)

colesbury pushed a commit that referenced this issue Mar 1, 2024
…aded builds (#116013)

For now, disable all specialization when the GIL might be disabled.
@swtaarrs swtaarrs removed their assignment Mar 1, 2024
@swtaarrs
Copy link
Member Author

swtaarrs commented Mar 1, 2024

This is now a performance (rather than correctness) issue for free-threaded builds, so I'm going to focus on more time-sensitive issues for a while.

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024
…e-threaded builds (python#116013)

For now, disable all specialization when the GIL might be disabled.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…e-threaded builds (python#116013)

For now, disable all specialization when the GIL might be disabled.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…e-threaded builds (python#116013)

For now, disable all specialization when the GIL might be disabled.
@corona10
Copy link
Member

@swtaarrs Out of curiosity, is there any progress or plan for this issue?

@Fidget-Spinner
Copy link
Member

@corona10 I'm planning to work on this after I get the deferred reference stack in. However, there are no concrete plans as of now. I'm really happy for you or anyone else to propose a design for the specializing interpreter with free-threaded safety!

@corona10
Copy link
Member

@Fidget-Spinner cc @swtaarrs
Nice. I was also thinking about how to make it thread-safe in a seamless way since I agree with @swtaarrs.
But there is no good idea yet to solve the issue right now since I am not in a full-time position for this task :)
So it will be happy to see you have a good plan.
(I am curious that we can make them per-thread mechanism...)

By the way, in the short term, can we enable the specializer to be used only for the main thread if we can not solve the issue before 3.13 is released?
We can easily track the performance degradation between the default build because most of pyperformance benchmark are based on a single thread :)

@Fidget-Spinner
Copy link
Member

@corona10 for 3.13, I think generally we're focusing on scalability across multicore rather than single-threaded perf for 3.13. It's a bit too near to feature freeze for me to feel safe re-enabling specialization at this point. There are a lot of unsolved problems still even with specialization only on the main thread. Consider the following:

Two threads sharing the same code object, A and B. A is main thread.
Thread B is in LOAD_ATTR_METHOD_WITH_VALUES's action (after guards, it is in the middle of loading from a method)
Thread A is in LOAD_ATTR_METHOD_WITH_VALUES's guard, but then deopts, meaning the method reference is now most likely dead/invalid.
Thread B loads from LOAD_ATTR_METHOD_WITH_VALUE's method, it is now holding a dangling pointer.
Thread B pushes dangling pointer to the stack. Everything crashes.

I'm reading a few papers to get some inspiration and also looking at how CRuby and other runtimes deal with this. Will post back when I have an actual plan.

@mpage mpage self-assigned this Aug 8, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 13, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 17, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 25, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 26, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 28, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 30, 2024
mpage added a commit to mpage/cpython that referenced this issue Oct 5, 2024
mpage added a commit to mpage/cpython that referenced this issue Oct 7, 2024
colesbury pushed a commit that referenced this issue Oct 8, 2024
Stop the world when invalidating function versions

The tier1 interpreter specializes `CALL` instructions based on the values
of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1
interpreter uses function versions to verify that the attributes of a function
during execution of a specialization match those seen during specialization.
A function's version is initialized in `MAKE_FUNCTION` and is invalidated when
any of the critical function attributes are changed. The tier1 interpreter stores
the function version in the inline cache during specialization. A guard is used by
the specialized instruction to verify that the version of the function on the operand
stack matches the cached version (and therefore has all of the expected attributes).
It is assumed that once the guard passes, all attributes will remain unchanged
while executing the rest of the specialized instruction.

Stopping the world when invalidating function versions ensures that all critical
function attributes will remain unchanged after the function version guard passes
in free-threaded builds. It's important to note that this is only true if the remainder
of the specialized instruction does not enter and exit a stop-the-world point.

We will stop the world the first time any of the following function attributes
are mutated:

- defaults
- vectorcall
- kwdefaults
- closure
- code

This should happen rarely and only happens once per function, so the performance
impact on majority of code should be minimal.

Additionally, refactor the API for manipulating function versions to more clearly
match the stated semantics.
efimov-mikhail pushed a commit to efimov-mikhail/cpython that referenced this issue Oct 9, 2024
…ython#124997)

Stop the world when invalidating function versions

The tier1 interpreter specializes `CALL` instructions based on the values
of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1
interpreter uses function versions to verify that the attributes of a function
during execution of a specialization match those seen during specialization.
A function's version is initialized in `MAKE_FUNCTION` and is invalidated when
any of the critical function attributes are changed. The tier1 interpreter stores
the function version in the inline cache during specialization. A guard is used by
the specialized instruction to verify that the version of the function on the operand
stack matches the cached version (and therefore has all of the expected attributes).
It is assumed that once the guard passes, all attributes will remain unchanged
while executing the rest of the specialized instruction.

Stopping the world when invalidating function versions ensures that all critical
function attributes will remain unchanged after the function version guard passes
in free-threaded builds. It's important to note that this is only true if the remainder
of the specialized instruction does not enter and exit a stop-the-world point.

We will stop the world the first time any of the following function attributes
are mutated:

- defaults
- vectorcall
- kwdefaults
- closure
- code

This should happen rarely and only happens once per function, so the performance
impact on majority of code should be minimal.

Additionally, refactor the API for manipulating function versions to more clearly
match the stated semantics.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…thon#126607)

Enable specialization of LOAD_GLOBAL in free-threaded builds.

Thread-safety of specialization in free-threaded builds is provided by the following:

A critical section is held on both the globals and builtins objects during specialization. This ensures we get an atomic view of both builtins and globals during specialization.
Generation of new keys versions is made atomic in free-threaded builds.
Existing helpers are used to atomically modify the opcode.
Thread-safety of specialized instructions in free-threaded builds is provided by the following:

Relaxed atomics are used when loading and storing dict keys versions. This avoids potential data races as the dict keys versions are read without holding the dictionary's per-object lock in version guards.
Dicts keys objects are passed from keys version guards to the downstream uops. This ensures that we are loading from the correct offset in the keys object. Once a unicode key has been stored in a keys object for a combined dictionary in free-threaded builds, the offset that it is stored in will never be reused for a different key. Once the version guard passes, we know that we are reading from the correct offset.
The dictionary read fast-path is used to read values from the dictionary once we know the correct offset.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…E` (python#126600)

Add free-threaded specialization for `UNPACK_SEQUENCE` opcode.
`UNPACK_SEQUENCE_TUPLE/UNPACK_SEQUENCE_TWO_TUPLE` are already thread safe since tuples are immutable.
`UNPACK_SEQUENCE_LIST` is not thread safe because of nature of lists (there is nothing preventing another thread from adding items to or removing them the list while the instruction is executing). To achieve thread safety we add a critical section to the implementation of `UNPACK_SEQUENCE_LIST`, especially around the parts where we check the size of the list and push items onto the stack.


---------

Co-authored-by: Matt Page <[email protected]>
Co-authored-by: mpage <[email protected]>
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
Record success in `specialize`

This matches the existing behavior where we increment the success
stat for the generic opcode each time we successfully specialize
an instruction.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…ython#127169)

The specialization only depends on the type, so no special thread-safety
considerations there.

STORE_SUBSCR_LIST_INT needs to lock the list before modifying it.

`_PyDict_SetItem_Take2` already internally locks the dictionary using a
critical section.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…pythongh-127128)

Use existing helpers to atomically modify the bytecode.  Add unit tests
to ensure specializing is happening as expected.  Add test_specialize.py
that can be used with ThreadSanitizer to detect data races.  
Fix thread safety issue with cell_set_contents().
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…h-127426)

No additional thread safety changes are required.  Note that sending to
a generator that is shared between threads is currently not safe in the
free-threaded build.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…-threaded builds (python#127123)

The CALL family of instructions were mostly thread-safe already and only required a small number of changes, which are documented below.

A few changes were needed to make CALL_ALLOC_AND_ENTER_INIT thread-safe:

Added _PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref.

Added _PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to __init__.

Only cache __init__ functions that are deferred in free-threaded builds. This ensures that the reference to __init__ that is stored in the specialization cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT passes.
Fix a bug in _CREATE_INIT_FRAME where the frame is pushed to the stack on failure.

A few other miscellaneous changes were also needed:

Use {LOCK,UNLOCK}_OBJECT in LIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it.

Add missing co_tlbc for _Py_InitCleanup.

Stop/start the world around setting the eval frame hook. This allows us to read interp->eval_frame non-atomically and preserves the behavior of _CHECK_PEP_523 documented below.
mpage added a commit to mpage/cpython that referenced this issue Jan 14, 2025
mpage added a commit that referenced this issue Jan 14, 2025
… free-threaded builds (#128164)

Finish specialization for LOAD_ATTR in the free-threaded build by adding support for class and instance receivers.
Yhg1s added a commit that referenced this issue Jan 29, 2025
Enable free-threaded specialization of LOAD_CONST.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 7, 2025
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:

 * Use relaxed atomics when clearing `ht->_spec_cache.getitem`
   (pythongh-115999)
 * Add temporary suppression for type slot modifications (pythongh-127266)
 * Use atomic load when reading `*dictptr`
colesbury added a commit to colesbury/cpython that referenced this issue Feb 10, 2025
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:

 * Use relaxed atomics when clearing `ht->_spec_cache.getitem`
   (pythongh-115999)
 * Add temporary suppression for type slot modifications (pythongh-127266)
 * Use atomic load when reading `*dictptr`
colesbury added a commit that referenced this issue Feb 11, 2025
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:

 * Use relaxed atomics when clearing `ht->_spec_cache.getitem`
   (gh-115999)
 * Add temporary suppression for type slot modifications (gh-127266)
 * Use atomic load when reading `*dictptr`
Yhg1s added a commit that referenced this issue Feb 19, 2025
Make tuple iteration more thread-safe, and actually test concurrent iteration of tuple, range and list. (This is prep work for enabling specialization of FOR_ITER in free-threaded builds.) The basic premise is:

Iterating over a shared iterable (list, tuple or range) should be safe, not involve data races, and behave like iteration normally does.

Using a shared iterator should not crash or involve data races, and should only produce items regular iteration would produce. It is not guaranteed to produce all items, or produce each item only once. (This is not the case for range iteration even after this PR.)

Providing stronger guarantees is possible for some of these iterators, but it's not always straight-forward and can significantly hamper the common case. Since iterators in general aren't shared between threads, and it's simply impossible to concurrently use many iterators (like generators), better to make sharing iterators without explicit synchronization clearly wrong.

Specific issues fixed in order to make the tests pass:

 - List iteration could occasionally fail an assertion when a shared list was shrunk and an item past the new end was retrieved concurrently. There's still some unsafety when deleting/inserting multiple items through for example slice assignment, which uses memmove/memcpy.

 - Tuple iteration could occasionally crash when the iterator's reference to the tuple was cleared on exhaustion. Like with list iteration, in free-threaded builds we can't safely and efficiently clear the iterator's reference to the iterable (doing it safely would mean extra, slow refcount operations), so just keep the iterable reference around.
colesbury added a commit to colesbury/cpython that referenced this issue Feb 26, 2025
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:

 * Use relaxed atomics when clearing `ht->_spec_cache.getitem`
   (pythongh-115999)
 * Add temporary suppression for type slot modifications (pythongh-127266)
 * Use atomic load when reading `*dictptr`
(cherry picked from commit f151d27)

Co-authored-by: Sam Gross <[email protected]>
colesbury added a commit that referenced this issue Feb 26, 2025
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:

 * Use relaxed atomics when clearing `ht->_spec_cache.getitem`
   (gh-115999)
 * Add temporary suppression for type slot modifications (gh-127266)
 * Use atomic load when reading `*dictptr`

(cherry picked from commit f151d27)
Yhg1s added a commit that referenced this issue Mar 12, 2025
Add free-threaded versions of existing specialization for FOR_ITER (list, tuples, fast range iterators and generators), without significantly affecting their thread-safety. (Iterating over shared lists/tuples/ranges should be fine like before. Reusing iterators between threads is not fine, like before. Sharing generators between threads is a recipe for significant crashes, like before.)
@corona10
Copy link
Member

@mpage @colesbury @Yhg1s Now we can close the issue right since #128798 is merged?

@colesbury
Copy link
Contributor

Yes

@corona10
Copy link
Member

And maybe worth to update What's new for it?

@picnixz picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Mar 15, 2025
plashchynski pushed a commit to plashchynski/cpython that referenced this issue Mar 17, 2025
…n#128798)

Add free-threaded versions of existing specialization for FOR_ITER (list, tuples, fast range iterators and generators), without significantly affecting their thread-safety. (Iterating over shared lists/tuples/ranges should be fine like before. Reusing iterators between threads is not fine, like before. Sharing generators between threads is a recipe for significant crashes, like before.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-free-threading type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants