-
Notifications
You must be signed in to change notification settings - Fork 232
First version of architecture overview doc #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
10bc4b6
First version of architecture overview doc
nicoddemus b29ff73
Apply small review requests
nicoddemus c62effd
Reword reason why workers must keep a single test on queue always
nicoddemus 1716767
Drop "node" from "workers" and "master"
nicoddemus 0a5bdfc
Add a FAQ section
nicoddemus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Overview # | ||
|
||
`xdist` works by spawning one or more **workers**, which are controlled | ||
by the **master**. Each **worker** is responsible for performing | ||
a full test collection and afterwards running tests as dictated by the **master**. | ||
|
||
The execution flow is: | ||
|
||
1. **master** spawns one or more **workers** at the beginning of | ||
the test session. The communication between **master** and **worker** nodes makes use of | ||
[execnet](http://codespeak.net/execnet/) and its [gateways](http://codespeak.net/execnet/basics.html#gateways-bootstrapping-python-interpreters). | ||
The actual interpreters executing the code for the **workers** might | ||
be remote or local. | ||
|
||
1. Each **worker** itself is a mini pytest runner. **workers** at this | ||
point perform a full test collection, sending back the collected | ||
test-ids back to the **master** which does not | ||
perform any collection itself. | ||
|
||
1. The **master** receives the result of the collection from all nodes. | ||
At this point the **master** performs some sanity check to ensure that | ||
all **workers** collected the same tests (including order), bailing out otherwise. | ||
If all is well, it converts the list of test-ids into a list of simple | ||
indexes, where each index corresponds to the position of that test in the | ||
original collection list. This works because all nodes have the same | ||
collection list, and saves bandwidth because the **master** can now tell | ||
one of the workers to just *execute test index 3* index of passing the | ||
full test id. | ||
|
||
1. If **dist-mode** is **each**: the **master** just sends the full list | ||
of test indexes to each node at this moment. | ||
|
||
1. If **dist-mode** is **load**: the **master** takes around 25% of the | ||
tests and sends them one by one to each **worker** in a round robin | ||
fashion. The rest of the tests will be distributed later as **workers** | ||
finish tests (see below). | ||
|
||
1. **workers** re-implement `pytest_runtestloop`: pytest's default implementation | ||
basically loops over all collected items in the `session` object and executes | ||
the `pytest_runtest_protocol` for each test item, but in xdist **workers** sit idly | ||
waiting for **master** to send tests for execution. As tests are | ||
received by **workers**, `pytest_runtest_protocol` is executed for each test. | ||
Here it worth noting an implementation detail: **workers** always must keep at | ||
least one test item on their queue due to how the `pytest_runtest_protocol(item, nextitem)` | ||
hook is defined: in order to pass the `nextitem` to the hook, the worker must wait for more | ||
instructions from master before executing that remaining test. If it receives more tests, | ||
then it can safely call `pytest_runtest_protocol` because it knows what the `nextitem` parameter will be. | ||
If it receives a "shutdown" signal, then it can execute the hook passing `nextitem` as `None`. | ||
|
||
1. As tests are started and completed at the **workers**, the results are sent | ||
back to the **master**, which then just forwards the results to | ||
the appropriate pytest hooks: `pytest_runtest_logstart` and | ||
`pytest_runtest_logreport`. This way other plugins (for example `junitxml`) | ||
can work normally. The **master** (when in dist-mode **load**) | ||
decides to send more tests to a node when a test completes, using | ||
some heuristics such as test durations and how many tests each **worker** | ||
still has to run. | ||
|
||
1. When the **master** has no more pending tests it will | ||
send a "shutdown" signal to all **workers**, which will then run their | ||
remaining tests to completion and shut down. At this point the | ||
**master** will sit waiting for **workers** to shut down, still | ||
processing events such as `pytest_runtest_logreport`. | ||
|
||
## FAQ ## | ||
|
||
> Why does each worker do its own collection, as opposed to having | ||
the master collect once and distribute from that collection to the workers? | ||
|
||
If collection was performed by master then it would have to | ||
serialize collected items to send them through the wire, as workers live in another process. | ||
The problem is that test items are not easily (impossible?) to serialize, as they contain references to | ||
the test functions, fixture managers, config objects, etc. Even if one manages to serialize it, | ||
it seems it would be very hard to get it right and easy to break by any small change in pytest. | ||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Architecture question: Why does each worker do its own collection, as opposed to having the master collect once and distribute from that collection to the workers?
This would drop the need for the master to do the sanity check you describe later, and would perhaps also facilitate the work we are discussing in #18.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the collection was performed by master then it would have to serialize collected items to send them through the wire, as workers live in another process. The problem is that test items are not easily (impossible?) to serialize, as they contain references to the test functions, fixture managers, config objects, etc. Even if one manages to serialize it, it seems it would be very hard to get it right and easy to break by any small change in pytest. At least that's my understanding.
IIRC @hpk42 commented in a mailing list that
xdist
at its beginning performed collection at master, but that proved too troublesome. Perhaps he can share more here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I know from the Apache Spark project that reliable object serialization can be a tough problem to solve.
This is off-topic as far as this review is concerned, but perhaps now is good time to revisit this issue now that libraries like Dill and cloudpickle exist. Perhaps it is not as difficult to solve as it once was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a FAQ at the bottom with this answer, thought it worth doing.