Skip to content

Use process pool for test runner #4614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 10, 2025

Conversation

HollandDM
Copy link
Contributor

@HollandDM HollandDM commented Feb 23, 2025

fixes #4590.

This PR introduce testEnableWorkStealing setting for TestModule. When enabled, tests in the module will be distributed by a work-stealing process pool, and run in parallel with each other.

Internally, when set testEnableWorkStealing to true, Mill will try to spawn at most --job test runner for each test groups. Each runner will compete to "steal" a test classes from shared folder to run. This increase throughput of the whole test group, as slow tests now don't block fast tests from running. Each test runner have it own working folder and log files used to communicate with the Mill parent process.

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from c09aa01 to 20eccb3 Compare February 23, 2025 16:10
@HollandDM
Copy link
Contributor Author

Local run on my machine yield similar test time when run scalalib.test.testForked with grouping on and off. But do I need some kind of benchmark for this?

@lihaoyi
Copy link
Member

lihaoyi commented Feb 24, 2025

Let's go with the multiple-JVMs-one-thread-each architecture described in the original ticket, rather than the one-JVM-multiple-threads architecture you have in this PR.

Having each JVM be single-threaded running tests makes things like resource usage much more attributable and deterministic, since the resource footprint of each JVM can only be due to the one test it is currently running. Whereas if you put a bunch of tests into a single JVM and it crashes due to OOM, you have no idea which test is at fault.

In terms of performance, I think the goal should be:

  1. Running heavyweight test classes like scalalib.test.testForked, it should have no significant performance difference with testForkGrouping enabled or disabled

  2. Running lightweight test classes, say 1000 lightweight test classes that do nothing, it should be much more performant than having testForkGrouping configured to one-test-per-class, and comparable to testForkGrouping disabled (although maybe not quite the same, depending on whether the overhead of N different JVMs or whether the speedup from running N times parallel wins out)

@HollandDM
Copy link
Contributor Author

Hmm, look like i misunderstood the ideal of the original ticket. From what you said, it seems like each test groups can now spawn multiple sub processes (with respect to total --jobs config), and they can perform work stealing between them, right?
Also, for the performance goal, number 1 is easy to setup. But for number 2, do we have existing test like this, or should I create new one, and add it in this PR also?

@lihaoyi
Copy link
Member

lihaoyi commented Feb 24, 2025

@HollandDM that's right.

In terms of the (2) benchmarks, feel free to include it in this PR as an integration test. You can have the test code generate the necessary project files before running the test to avoid having to commit all the boilerplate code

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from 040daa7 to d96c887 Compare February 28, 2025 06:01
@HollandDM
Copy link
Contributor Author

@lihaoyi I've update this into multiple-JVMs-one-thread-each architect, with the work stealing capability.
The PR is still kind of like a PoC, to make sure the approach is correct, please verify it first.
I'll try clean up and pass all test cases in following commits

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch 2 times, most recently from 6612052 to c9ca9c4 Compare February 28, 2025 16:27
@lihaoyi
Copy link
Member

lihaoyi commented Mar 3, 2025

@HollandDM can you write an english summary of how the implementation works? That would make it much easier to review vs. trying to reverse engineer your diff from scratch

@lihaoyi
Copy link
Member

lihaoyi commented Mar 3, 2025

At a first glance, the IPC protocol looks far too complicated for what this task requires. For example:

  1. In terms of stealing work, we can get around the whole locking-stealing-file protocol by writing each test group's test class names to files on disk, that each subprocess tries to atomically os.move to claim the class. Such a simple disk-based queue would be more than sufficient for the scales we're looking at (O(num-test-source-files)) and would have a lot fewer moving parts than the mem-mapped in-memory version you have here

  2. Do we really need 8 different cluster states? Since the test module has to complete running every test class eventually anyway, each subprocess could just continually take work off its (possibly shared) queue until there is no more work, and then shut down. There should be no need for subprocesses ever to be blocked, stopped, or in other states: either it is actively running tests, or it's done when there are no more test classes to claim. We never add test classes to the queue that weren't there to start, so we don't need to worry about keeping a subprocess around idle to use it later

  3. In terms of managing concurrency, if the parent Mill process runs each subprocess in a ctx.fork.async(dest, key, msg){...} block, it will automatically claim the async slot before each subprocess starts, and release it when the subprocess ends. The subprocess shouldn't need to worry about this: the fact that the subprocess exists means that a slot is claimed for it, and the fact that the subprocess hasn't exited means it is doing work running some test class. Once the subprocess exits, the Mill parent process can release the async slot, which using ctx.fork.async happens automatically

Before coding everything up, it's worth sketching out exactly what dataflows are necessary between the parent and worker processes. What I can think of are:

  1. Parent needs to pass shared test class queue to workers (can be done via files on disk)
  2. Workers need to signal completion to parent (can be done by exiting)
  3. Workers need to pass test results to parent (can be done at the end via the same code path we currently use, writing JSON to a file on disk)
  4. Workers need to pass name of currently-running test suite to parent (maybe write it to a file on disk the parent polls?)
  5. Workers need to stream logs to parent (can be done via stdout/stderr just as it is done now)

I think those are the only ways in which the parent and worker processes need to interact? If we limit our implementation to these dataflows, that will greatly simplify the implementation:

Parent:

  1. The parent assigns each test group a queue folder and writes that test group's classes as files within it
  2. For i in NUM_CORES:
    1. The parent loops through every test group, spawning a ctx.fork.async that first checks if the test queue folder is empty, exits early if there are no tests left, otherwise runs a single subprocess worker for that test group
  3. While each ctx.fork.async is running, polls a file on disk assigned to each worker containing the running test class, and updates the status prompt via setPromptDetail
  4. Once all ctx.fork.asyncs have completed, collect the test result files on disk, parse/aggregate them, and return them as the result of the test command

Worker:

  1. List the files in the test queue folder
    1. If the list is non-empty, try to claim a file with os.move(file, os.tmp())
      1. If you successfully claimed the file, then write the test class name to a file on disk for the parent to pick up, then run the test class associated with that file
      2. If you failed to claim the file, loop around and try again
    2. If the list is empty, exit.

If we want more features in future that our implementation doesn't support we can refactor the code when the time comes.

@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 3, 2025

Let's me explain the IPC protocol, as it is the heart of all of this implementation:
In this impl:

  • the host mill will decide: 1. when to spawn a subprocess; 2. when to signal the subprocess to stop; 3. what subprocess it gonna steal from.
  • the subprocess will signal cluster: 1. if it is running normally or slow (blocking), if it want to steal work
  • the subprocess will signal other process: 1. if it want to steal; 2. if it permit other to steal.

For host, at first the host spawns 1 subprocess to handle all the test cases. As time go, if any subprocess want to steal, the host will find another subprocess to be the victim, prioritize blocking one. The host also keep track of the state each subprocess is in at the moment: if more blocking than steal, the host will try spawn new one to accommodate the need to off load work, if more stealing and blocking, it will signal denial to the stealers.

For subprocesses, they periodically notify host their running state. If they're allow to steal from a victim, a 2nd channel will be used between them to coordinate the steal (write test to steal to disk, and signal other to check this). If they are denied from stealing, they can keep retry for sometimes, but ultimately they will stop.

Yes, this is quite a complicate work stealing system. The reason that I went this way are:

  • It is a work stealing system, the stealing happen without knowledge of the host, the host only coordinate the stealer - victim pair, the pain of coordinating stealing is off loaded to that pair of subprocesses.
  • This stealing happen in a batch, and if one test is blocked because its running to slow, other tests are guaranteed to be picked by other subprocesses. If we coordinated by using host as the test provider, unless we provide 1 test at a time (which I think is worse because the of the overload in I/O compare to running test), we can't guaranteed this behavior. This of course contains more moving part, and a test can be moved multiple times (but at most only O(log(num-test-source-files)))).
  • The number of states is quite overwhelming, but it is needed to convey the state of each parties, and because I don't want to use any locking mechanism while doing IPC, the number of states is expected to be bigger than using one.
  • In terms of managing concurrency, I did it like this to respect the --job parameter to the fullest. The host when decided to spawn new subprocess need to "ask" executor for a slot, and it can be denied. Spawning normally by ctx.fork.async (which is a decorated new Future[A]) will leave the number of subprocess for one group test to be as much as --jobs. At first I though this can lead to jobs * jobs subprocess to be spawn in the worst cases. But now after a closer look, I think this is not the case. So indeed the managing concurrency problem can be reduced to just use ctx.fork.async.

Your suggestion is way simpler, and should work fine. The only thing that I'm not too sure is how many test should we write on a single file for subprocess to pick up. If it is one then subprocess can scan and content for files on disk heavily. If it is more then quick test cases can be blocked by one slow test case, and I think work stealing is required in this situation, which will lead to something similar to this

@lihaoyi
Copy link
Member

lihaoyi commented Mar 3, 2025

Got it. What you say makes sense, but lets go with the approach i suggested and see if we can make that work without the complexity of a full peer-to-peer work stealing model

The only thing that I'm not too sure is how many test should we write on a single file for subprocess to pick up. If it is one then subprocess can scanning and contending for files on disk heavily. If it is more then quick test cases can be blocked by one slow test case, and I think work stealing is required in this situation, which will lead to something similar to this

Most things work on the granularity of test classes, so that would be the natural unit of work here. In that case the workers can just pick work directly from the shared queue folder, until the folder is empty.

I don't think we need to worry too much about performance of the disk queue; we won't have more than a few thousand test classes per module, and the operations we need to do are cheap ones (just checking if the folder is, trying doing os.move on an arbitrary file to claim it).

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from c9ca9c4 to 33ba34b Compare March 5, 2025 02:00
@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 5, 2025

@lihaoyi I updated the PR to use the simplify version you came up with.
The performance is also great, it seems that we indeed don't need the full blow work stealing cluster I designed originally.
The only down side for now, is that the number of awaiting threads will be super big if testGroup is used.
I'll try fix the CI in the following commit if we're good to go with this impl

@lihaoyi
Copy link
Member

lihaoyi commented Mar 5, 2025

@HollandDM thanks will take a look

if (files.nonEmpty) {
var shouldRetry = false
val offset = Random.nextInt(files.size)
try {
Copy link
Member

@lihaoyi lihaoyi Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do val shouldRetry = try { to avoid the var here

val taskQueue = mutable.Queue.empty[Task]

@tailrec
def stealFromSelectorFolder(): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move stealFromSelectorFolder out of stealTasks into its own top-level method and make it return a Option[String] with the successfully stolen class name? That will help encapsulate things a bit more and keep the code easier to understand

// we successfully stole a selector
val selectors = os.read.lines(stealFolder / s"selector-stolen")
val classFilter = TestRunnerUtils.globFilter(selectors)
val tasks = runner.tasks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tasks always of length 1 here? If so we can add an assert and simplify the code accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure if the tasks returned always has the same length as the taskdefs input.

Comment on lines 266 to 275
@tailrec
def stealTaskLoop(): Unit = {
if (taskQueue.nonEmpty) {
val next = taskQueue.dequeue().execute(
new EventHandler {
def handle(event: Event) = {
testReporter.logStart(event)
events.add(event)
testReporter.logFinish(event)
}
},
Array(new Logger {
def debug(msg: String) = ctx.log.outputStream.println(msg)
def error(msg: String) = ctx.log.outputStream.println(msg)
def ansiCodesSupported() = true
def warn(msg: String) = ctx.log.outputStream.println(msg)
def trace(t: Throwable) = t.printStackTrace(ctx.log.outputStream)
def info(msg: String) = ctx.log.outputStream.println(msg)
})
)

taskQueue.enqueueAll(next)
stealTaskLoop()
} else if (stealFromSelectorFolder()) {
stealTaskLoop()
} else {
()
}
}
Copy link
Member

@lihaoyi lihaoyi Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This @tailrec method can probably be more easily understood written as nested loops, something like:

while({
  stealFromSelectorFolder() match{
    case Some(className) => 
      for(cls <- classes) executeTaskForCls(cls)
      true
    case None => false
  }
})()

The inner loop is always over a fixed collection of elements while the outer loop runs until a condition is met, so having it be two separate loops makes these properties a bit more obvious than having them mixed together in one big tailrec function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more about the coding style used in these repositories. In my other open-source PRs, people tend to prefer tailrec over while loops. It looks like it's the other way around here. I'm happy to change all the tailrec to while loops if you'd like. I don't really have a preference.

val events = new ConcurrentLinkedQueue[Event]()
val doneMessage = {

val taskQueue = mutable.Queue.empty[Task]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes suggested for stealFromSelectorFolder and stealTaskLoop, we should be able to remove this mutable taskQueue as well

val selectorFolder = base / "selectors"
os.makeDir.all(selectorFolder)
selectors2.zipWithIndex.foreach { case (s, i) =>
os.write.over(selectorFolder / s"selector-$i", s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we name the files according to the class name s rather than the index i? If we need to debug things being able to ls to see the queued test classes would be much easier than having to dig into each file individually

Comment on lines 108 to 116
def runTestRunnerSubprocesses(selectors2: Seq[String], base: os.Path) = {
val selectorFolder = prepareTestSelectorFolder(selectors2, base)

val resultFutures = Range(0, selectors2.length).map { index =>
val indexStr = index.toString
Task.fork.async(base / indexStr, indexStr, s"Test process $indexStr") {
(indexStr, runTestRunnerSubprocess(index, base / indexStr, selectorFolder))
}
}
Copy link
Member

@lihaoyi lihaoyi Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use some heuristics to optimize the way we spawn the worker processes here and avoid the large number of processes you mentioned:

  1. We don't ever want to have more processes per group than testClassList.length, since they won't have anything to do even if every process takes one test class

  2. We don't ever want to have more processes per group than ctx.jobs, since that's the parallelism limit

  3. For each Task.fork.async we can check the selectors folder once before running runTestRunnerSubprocess, so that if that group's tests are all claimed before the process starts we can skip the process overhead.

There are probably some other heuristics we can add, but all together this should be enough to mitigate the problem with too many processes: any Task.fork.asyncs that do not have any work to do will be cheap and exit quickly without subprocess overhead, so even if we queue up a lot of them there shouldn't be much cost

Copy link
Contributor Author

@HollandDM HollandDM Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation already have the check before spawn, so it's true that the number of meaningful threads that spawn subprocess is small.
What I meant was that whew using testGroup, the server call runTestRunnerSubprocesses foreach test groups. These Future then only await for runTestRunnerSubprocess (the real test running work) to return. So we got a lot of awaiting Future hanging arounds, and these will all show up on the prompt

Copy link
Contributor Author

@HollandDM HollandDM Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's look exactly like the thing you saw in #4611

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Since runTestRunnerSubprocesses already spawns fork.async blocks, what if we move the caller out into normal synchronous code? So instead of

          Task.fork.async(Task.dest / folderName, paddedIndex, groupPromptMessage) {
            (folderName, runTestRunnerSubprocesses(testClassList, Task.dest / folderName))
          }

Just

          (folderName, runTestRunnerSubprocesses(testClassList, Task.dest / folderName))

That way there won't be a middle-layer of async blocks that do nothing except wait for their children, which should reduce the number of blocked tasks showing up in the command line prompt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can make testGroup run sequentially with each other, rather than concurrently
But I understand your thinking now, I think I can make use of this, maybe I'll flatten all the work to the outer layer and fork from there

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from b2edc51 to 8af8cbc Compare March 5, 2025 09:01
@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 5, 2025

@lihaoyi Updated, I replaced the @tailrec with while loop. Also flatten out all subprocess calls to the outer most level and await all of them

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from d346139 to 9ae5330 Compare March 5, 2025 13:40
@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 6, 2025

CI failed massively, so I did some investigations.
It's seem that example test can be troublesome, because now we always have multiple test runner for a group, the order of output log can be non deterministic, this will make all example tests fail because they rely on deterministic output log.
One way to keep the same behavior is that for each subprocess in a group, we sort the log by the correct order (where to find it?, result of discoveredTestClasses maybe?). This can be done with 2 strategies:

- Defer all the log, after everything end, write the logs by the correct order. This delay the output at the end, and I don't think we want that
- Write the log in order as soon as possible, eg: if we received log in the order of: 1 4 2 6 3 5, then we will out put 4 times, [1], [2], [3, 4], [5, 6]. This will be more responsive, but require some kind of concurrent algo.

Another way to to change ExampleTester to check expectedLine by existence only, but I'm afraid this can affect too many area outside of scala test (kotlin, python, javascript, etc...).

Or we can update all related test to use -j 1, but this will required explicit knowledge of contributor every time they add new examples.

Also, in order to avoid some move contention, each subprocess currently is picking file totally random, so we will need to update that too

False alarm

@lihaoyi
Copy link
Member

lihaoyi commented Mar 6, 2025

@HollandDM could you elaborate on the CI failures? AFAIK the example tests already check output in a non-ordered fashion: as long as each line in the example expected output appears in the actual output, the test passes, regardless of what order the actual output is in. So if your tests are failing, it seems like there must be some other issue there other than the difference in output ordering

@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 6, 2025

Oh true, now I check the ode carefully, it does check existence. I guess looking at all the --- Expected output ---------- make me dizzy. Sorry for the false alarm.
Let me try fixing the CI

Comment on lines 333 to 334
new Runnable {
override def run(): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be shortened using the lambda syntax () => {...} without explicitly instantiating an anonymous subclass of Runnable

stealFolder: os.Path,
selectorFolder: os.Path
): Array[Task] = {
val stealLog = stealFolder / os.up / "steal.log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can now be written stealFolder / "../steal.log", or rather stealFolder / "../status.txt" since it's not really a log file but a snapshot of the current status

Copy link
Contributor Author

@HollandDM HollandDM Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file was appended while runner's working. So it can be use to check the order of test that was executed by the runner.

Copy link
Contributor Author

@HollandDM HollandDM Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like "/ os.up" more, I think it more verbose and explicit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, it makes sense then. We can leave it as is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have a comment somewhere explaining how this works?

Comment on lines 244 to 245
while (true) {
val files = os.list(selectorFolder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this further given the current design: we just need to run os.list once and then loop over all the tests to try and steal each one in order. Since the files in selectorFolder never grows, running os.list a second time will never pick up anything the first list didn't.

Let's also get rid of the Random.nextInt selection and just pick up the tests in order from top to bottom. That will give a bit more determinism to how things run, and although it still nondeterministic due to parallelism that would help people follow the order that tests are being run in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it like this because I want to check if we have any file left, and only steal form the left over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os.move already effectively checks whether the file is still left before taking it, so we don't need to run an entire os.list each time we steal a single file. We could add a os.exists check before os.move if we want to avoid throwing and catching an error in the common case

val paddedGroupIndex = mill.internal.Util.leftPad(groupIndex.toString, maxGroupLength, '0')
val paddedProcessIndex =
mill.internal.Util.leftPad(processIndex.toString, maxProcessLength, '0')
val processFolder = baseFolder / processIndex.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this baseFolder / s"worker-$processIndex", so if someone looks on disk it's a bit more obvious what is going on

@lihaoyi
Copy link
Member

lihaoyi commented Mar 7, 2025

Right now the ctx.fork.async tasks all have no name. We should fix that for the default case when work stealing is disabled, and when work stealing is enabled let's give them simple names worker-$processIndex if there's one group and worker-$groupIndex-$processIndex if there's multiple groups. We don't need to list the test classes, since we don't know up front which test classes each worker will take, and the in-progress test will show up in the ticker on the right anyway

@lihaoyi
Copy link
Member

lihaoyi commented Mar 7, 2025

I think this looks pretty good. Adding def testEnableWorkStealing = true in my manual testing does indeed make it work, though the missing labels when testEnableWorkStealing is not enabled needs to be fixed. Left some last comments, but once those are resolved I think the code is ready to merge.

Next steps are probably:

  1. Update the PR description to accurately reflect the design of the PR after all the revisions and updates
  2. Open a second PR backporting it to the 0.12.x branch. Looking at the code I think all the APIs changed are internal or private so there shouldn't be binary compat issues.

Once both PRs are ready we can merge it and close out the bounty

@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 8, 2025

Right now the ctx.fork.async tasks all have no name. We should fix that for the default case when work stealing is disabled, and when work stealing is enabled let's give them simple names worker-$processIndex if there's one group and worker-$groupIndex-$processIndex if there's multiple groups. We don't need to list the test classes, since we don't know up front which test classes each worker will take, and the in-progress test will show up in the ticker on the right anyway

I think there is a bug in MultiLogger that causes this.
when creating new MultiLogger, it does not have its own message/keySuffix values, so when we call setPromptLine, it uses the default value of "".

I tried a simple fix like this:

  private[mill] override def message = logger1.message ++ logger2.message
  private[mill] override def keySuffix = logger1.keySuffix ++ logger2.keySuffix

And the log is good again. But I'm not sure this is the right way to do it.

Would you like me to include these fixes in this PR? If not, I think I can skip the problem because it'll be fixed in other PR

@lihaoyi
Copy link
Member

lihaoyi commented Mar 8, 2025

@HollandDM your proposed fix looks reasonable. Lets include it in this PR

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from 747de59 to 47dff00 Compare March 8, 2025 14:51
@HollandDM
Copy link
Contributor Author

HollandDM commented Mar 8, 2025

@lihaoyi updated, please check if you are happy with the current state of the PR.

In the mean time, I'll assume the PR is good to go, and create a back port for 0.12.x branch using this PR as the source

@HollandDM HollandDM requested a review from lihaoyi March 8, 2025 14:53
Comment on lines 6 to 19
@Test
public void test1() throws Exception {
testGreeting("Storm", 38);
}

@Test
public void test2() throws Exception {
testGreeting("Bella", 25);
}

@Test
public void test3() throws Exception {
testGreeting("Cameron", 32);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need more than one test per file? If not we could consolidate each file to a single test case to greatly reduce the verbosity of these test examples

@@ -0,0 +1,38 @@
// Test stealing is an opt-in, powerful feature that enables parallel test execution while maintaining complete reliability and debuggability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while maintaining complete reliability and debuggability. is meaningless. You should write why people would turn this on: because it provides both the speedups from parallelism and the efficiency of re-using JVMs between tests, allowing it to be used on all sorts of test suites without downside. And we will likely make it the default in future

@HollandDM HollandDM force-pushed the work-stealing-test-runner branch from 75f6fa8 to 43225cf Compare March 9, 2025 04:22
@HollandDM HollandDM requested a review from lihaoyi March 9, 2025 04:24
@HollandDM
Copy link
Contributor Author

update test example to only have 1 test case, and fix some wording in adoc

@HollandDM HollandDM marked this pull request as ready for review March 9, 2025 15:41
@HollandDM
Copy link
Contributor Author

@lihaoyi backport for 0.12.x readied at #4697

@lihaoyi
Copy link
Member

lihaoyi commented Mar 10, 2025

I think this looks good, thanks @HollandDM! Let's get #4679 green and merged as well, and send me your international bank transfer details and I will send you the bounty

@lihaoyi lihaoyi merged commit f733779 into com-lihaoyi:main Mar 10, 2025
41 of 42 checks passed
lihaoyi pushed a commit that referenced this pull request Mar 10, 2025
ported for 0.12.x from #4614

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
@lefou lefou added this to the 0.13.0 milestone Mar 10, 2025
lihaoyi added a commit that referenced this pull request Mar 11, 2025
Turns on #4614 by default in
0.13.x, leaving the default as of in 0.12.x
lihaoyi added a commit that referenced this pull request Mar 13, 2025
…nc task and de-prioritizing subsequent ones (#4714)

This is a follow up improvement to
#4701 which uses an explicit
priority queue rather than relying on ad-hoc LIFO ordering. By
prioritizing the first async task and de-prioritizing subsequent async
tasks, this encourages Mill to re-use the same async task JVM to do more
work, minimizing JVM startup overhead

When testing this out on the Netty example, `testParallelism`
prioritizing the first testrunner JVM over all others does seem to make
a material difference. I manually performed did ad-hoc benchmarks of the
time taken to run the following command, basically running all the tests
that we run in the example integration test:

```
 /Users/lihaoyi/Github/mill/out/dist/launcher.dest/run 'codec-{dns,haproxy,http,http2,memcache,mqtt,redis,smtp,socks,stomp,xml}.test' + 'transport-{blockhound-tests,native-unix-common,sctp}.test'
```

- `priority = -1` (`fork.async`s always run before other tasks): ~20s
- `priority = 1` (`fork.async`s always run after other tasks): ~20s
- `priority = if (processIndex == 0) -1 else processIndex` (The first
`fork.async` in a test module runs before all other tasks, subsequent
`fork.async`s run after all other tasks): ~11s

- As a comparison, `testForkGrouping` with 1-element groups: 50s

Notably, this first-JVM priority has similar timings with
`testParallelism = false` on this benchmark, where the previous approach
had some penalty (though not as bad as `testForkGrouping`). With these
improvements, the parallel test runner should be self-tuning enough to
turn on by default (#4614)
without needing manual configuration in most cases

This also lays the foundation to other prioritization strategies later.
For example, we may want to prioritize historically-slower tasks over
faster tasks, all else being equal, to avoid long-running stragglers
delaying a command from completing after everything else has finished.
But such improvements can come later
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a dynamic work-stealing test fork runner (1500USD Bounty)
3 participants