Skip to content

Conversation

jtnord
Copy link
Member

@jtnord jtnord commented Sep 8, 2025

In attempting to diagnose an ATH failure I noticed what looks like a potential race condition that could lead to results not being correctly updated.

There is also issue where if jobs are not specified and are to be inferre,d then these are obtained from the dependency graph at action creation time, but the graph may not be correct at this point in time as it is built asynchronously.

There is also an other timing issue.
For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at T1, a web request comes in a T2 and then another build fishes at T3 followed by a web request at T4.

When T1 arrives the timestamp is set to T1, and then in T2 the current value is checked against the timestamp and set to T1+1

When T3 arrives the timestamp is set to T3.

THe issue occurs if you have multiple CPUs and updating the code is fast and T1 = T2 = T3.

When the request at T4 arrives the code thinks everything is up-to-date when it is not.

Testing done

internal ATH consistently failed before these changes, and passes after.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@jtnord jtnord marked this pull request as ready for review September 17, 2025 10:23
@jtnord jtnord requested a review from a team as a code owner September 17, 2025 10:23
@timja timja added the bug label Sep 17, 2025
@jtnord jtnord marked this pull request as draft September 17, 2025 10:26
In attempting to diagnose an ATH failure I noticed what looks like a
potential race condition that could lead to results not being correctly
updated.

For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at
T1, a web request comes in a T2 and then another build fishes at T3
followed by a web request at T4.

When T1 arrives the timestamp is set to T1, and then in T2 the current
value is checked against the timestamp and set to T1+1

When T3 arrives the timestamp is set to T3.

THe issue occurs if you have multiple CPUs and updating the code is fast
and T1 = T2 = T3.

When the request at T4 arrives the code thinks everything is up-to-date when it is not.
The dependencies may not be calculated when we create the action.

The dependency graph is updated asynchronously by Jenkins, so we may not
have the full graph when the TestResultAction is created

This caused the job calculation to be fixed in stone, and potentially
incorrect if the test results where obtained before the graph calculation had completed.

By deferring obtaining the downstream jobs we reduce the risk of partial
results being shown, and they will be eventually complete (when some run
completes).
We are not computing the graph so this should have little overhead on
the action.
@jtnord jtnord changed the title potential race condition in result aggregation race condition and missing results in aggregation Sep 18, 2025

private transient List<AbstractProject> noFingerprints;

@SuppressWarnings("deprecation") // calls getProject in constructor, so needs owner immediately
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer true as calculation is defered, so the constructor was cleaned up (left as deprecated as it was public)

Comment on lines 163 to 165
if (jobs == null) {
return getProject().getTransitiveDownstreamProjects();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes missing results for downstream projects if the dependency graph has not yet been rebuilt.

It may still miss some things, but we will at least be eventually consistent as Runs complete. If you have an Jenkins that quiet this could last for some time, but there is not DependencyGraphListener or another way to monitor it for changes that I am aware of.

Comment on lines -152 to -162
if (jobs == null) {
// resolve null as the transitive downstream jobs
StringBuilder buf = new StringBuilder();
for (AbstractProject p : getProject().getTransitiveDownstreamProjects()) {
if (buf.length() > 0) {
buf.append(',');
}
buf.append(p.getFullName());
}
jobs = buf.toString();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to getJobs() so this is updated as the dependencyGraph is recomputed.

Comment on lines 166 to 172
Set<AbstractProject> projects = getProject().getTransitiveDownstreamProjects();
return projects.stream()
.filter(AbstractProject.class::isInstance)
.map(AbstractProject.class::cast)
.filter(p -> p.hasPermission(Item.READ))
.collect(Collectors.toSet());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously the code would set the string list of jobs which we would then recover below with access control checks. this puts the access control here with Item.READ.

@jtnord jtnord requested a review from timja September 18, 2025 11:18
@jtnord jtnord marked this pull request as ready for review September 18, 2025 11:18
* Gets the jobs to be monitored.
*/
public Collection<AbstractProject> getJobs() {
if (jobs == null) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the main fix for the failing internal ATH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants