race condition and missing results in aggregation #732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jtnord wants to merge 6 commits into jenkinsci:master from jtnord:junit-aggregation

Member

jtnord commented Sep 8, 2025 •

edited

Loading

In attempting to diagnose an ATH failure I noticed what looks like a potential race condition that could lead to results not being correctly updated.

There is also issue where if jobs are not specified and are to be inferre,d then these are obtained from the dependency graph at action creation time, but the graph may not be correct at this point in time as it is built asynchronously.

There is also an other timing issue.
For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at T1, a web request comes in a T2 and then another build fishes at T3 followed by a web request at T4.

When T1 arrives the timestamp is set to T1, and then in T2 the current value is checked against the timestamp and set to T1+1

When T3 arrives the timestamp is set to T3.

THe issue occurs if you have multiple CPUs and updating the code is fast and T1 = T2 = T3.

When the request at T4 arrives the code thinks everything is up-to-date when it is not.

Testing done

internal ATH consistently failed before these changes, and passes after.

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests that demonstrate the feature works or the issue is fixed

jtnord marked this pull request as ready for review

September 17, 2025 10:23

jtnord requested a review from a team as a code owner

September 17, 2025 10:23

timja added the bug label

timja approved these changes

View reviewed changes

jtnord marked this pull request as draft

September 17, 2025 10:26

jtnord commented

View reviewed changes

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java Outdated Show resolved Hide resolved

jtnord added 3 commits

September 17, 2025 11:54


          potential race condition in result aggregation

017fc05

In attempting to diagnose an ATH failure I noticed what looks like a
potential race condition that could lead to results not being correctly
updated.

For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at
T1, a web request comes in a T2 and then another build fishes at T3
followed by a web request at T4.

When T1 arrives the timestamp is set to T1, and then in T2 the current
value is checked against the timestamp and set to T1+1

When T3 arrives the timestamp is set to T3.

THe issue occurs if you have multiple CPUs and updating the code is fast
and T1 = T2 = T3.

When the request at T4 arrives the code thinks everything is up-to-date when it is not.


          update on finalized

d5ff59a


          If the jobs are not specified we need to calculate them on demand

54d4976

The dependencies may not be calculated when we create the action.

The dependency graph is updated asynchronously by Jenkins, so we may not
have the full graph when the TestResultAction is created

This caused the job calculation to be fixed in stone, and potentially
incorrect if the test results where obtained before the graph calculation had completed.

By deferring obtaining the downstream jobs we reduce the risk of partial
results being shown, and they will be eventually complete (when some run
completes).
We are not computing the graph so this should have little overhead on
the action.

jtnord force-pushed the junit-aggregation branch from 0c60cce to 54d4976 Compare

September 18, 2025 10:03

jtnord changed the title ~~potential race condition in result aggregation~~ race condition and missing results in aggregation


          filter downstream projects based on permissions

fa3c186

jtnord commented

View reviewed changes

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java

    
                      private transient List<AbstractProject> noFingerprints;

                      @SuppressWarnings("deprecation") // calls getProject in constructor, so needs owner immediately

Member Author

jtnord Sep 18, 2025

no longer true as calculation is defered, so the constructor was cleaned up (left as deprecated as it was public)

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java

Comment on lines 163 to 165

    
                          if (jobs == null) {

                              return getProject().getTransitiveDownstreamProjects();

                          }

Member Author

jtnord Sep 18, 2025

this fixes missing results for downstream projects if the dependency graph has not yet been rebuilt.

It may still miss some things, but we will at least be eventually consistent as Runs complete. If you have an Jenkins that quiet this could last for some time, but there is not DependencyGraphListener or another way to monitor it for changes that I am aware of.

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java

Comment on lines -152 to -162

    
                          if (jobs == null) {

                              // resolve null as the transitive downstream jobs

                              StringBuilder buf = new StringBuilder();

                              for (AbstractProject p : getProject().getTransitiveDownstreamProjects()) {

                                  if (buf.length() > 0) {

                                      buf.append(',');

                                  }

                                  buf.append(p.getFullName());

                              }

                              jobs = buf.toString();

                          }

Member Author

jtnord Sep 18, 2025

moved to getJobs() so this is updated as the dependencyGraph is recomputed.

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java

Comment on lines 166 to 172

    
                              Set<AbstractProject> projects = getProject().getTransitiveDownstreamProjects();

                              return projects.stream()

                                      .filter(AbstractProject.class::isInstance)

                                      .map(AbstractProject.class::cast)

                                      .filter(p -> p.hasPermission(Item.READ))

                                      .collect(Collectors.toSet());

                          }

Member Author

jtnord Sep 18, 2025

previously the code would set the string list of jobs which we would then recover below with access control checks. this puts the access control here with Item.READ.

jtnord requested a review from timja

September 18, 2025 11:18

jtnord marked this pull request as ready for review

September 18, 2025 11:18

jtnord commented

View reviewed changes

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java

    
                       * Gets the jobs to be monitored.

                       */

                      public Collection<AbstractProject> getJobs() {

                          if (jobs == null) {

Member Author

jtnord Sep 18, 2025

this was the main fix for the failing internal ATH.

timja reviewed

View reviewed changes

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java Outdated Show resolved Hide resolved


          Update src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.…

…java

Co-authored-by: Tim Jacomb <[email protected]>

jtnord commented

View reviewed changes

src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java Outdated Show resolved Hide resolved


          Update src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.…

8c9b148

…java

timja approved these changes

View reviewed changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug