-
Notifications
You must be signed in to change notification settings - Fork 346
race condition and missing results in aggregation #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java
Outdated
Show resolved
Hide resolved
In attempting to diagnose an ATH failure I noticed what looks like a potential race condition that could lead to results not being correctly updated. For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at T1, a web request comes in a T2 and then another build fishes at T3 followed by a web request at T4. When T1 arrives the timestamp is set to T1, and then in T2 the current value is checked against the timestamp and set to T1+1 When T3 arrives the timestamp is set to T3. THe issue occurs if you have multiple CPUs and updating the code is fast and T1 = T2 = T3. When the request at T4 arrives the code thinks everything is up-to-date when it is not.
The dependencies may not be calculated when we create the action. The dependency graph is updated asynchronously by Jenkins, so we may not have the full graph when the TestResultAction is created This caused the job calculation to be fixed in stone, and potentially incorrect if the test results where obtained before the graph calculation had completed. By deferring obtaining the downstream jobs we reduce the risk of partial results being shown, and they will be eventually complete (when some run completes). We are not computing the graph so this should have little overhead on the action.
0c60cce
to
54d4976
Compare
|
||
private transient List<AbstractProject> noFingerprints; | ||
|
||
@SuppressWarnings("deprecation") // calls getProject in constructor, so needs owner immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer true as calculation is defered, so the constructor was cleaned up (left as deprecated as it was public)
if (jobs == null) { | ||
return getProject().getTransitiveDownstreamProjects(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fixes missing results for downstream projects if the dependency graph has not yet been rebuilt.
It may still miss some things, but we will at least be eventually consistent as Runs complete. If you have an Jenkins that quiet this could last for some time, but there is not DependencyGraphListener or another way to monitor it for changes that I am aware of.
if (jobs == null) { | ||
// resolve null as the transitive downstream jobs | ||
StringBuilder buf = new StringBuilder(); | ||
for (AbstractProject p : getProject().getTransitiveDownstreamProjects()) { | ||
if (buf.length() > 0) { | ||
buf.append(','); | ||
} | ||
buf.append(p.getFullName()); | ||
} | ||
jobs = buf.toString(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to getJobs() so this is updated as the dependencyGraph is recomputed.
Set<AbstractProject> projects = getProject().getTransitiveDownstreamProjects(); | ||
return projects.stream() | ||
.filter(AbstractProject.class::isInstance) | ||
.map(AbstractProject.class::cast) | ||
.filter(p -> p.hasPermission(Item.READ)) | ||
.collect(Collectors.toSet()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously the code would set the string list of jobs which we would then recover below with access control checks. this puts the access control here with Item.READ.
* Gets the jobs to be monitored. | ||
*/ | ||
public Collection<AbstractProject> getJobs() { | ||
if (jobs == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the main fix for the failing internal ATH.
src/main/java/hudson/tasks/test/AggregatedTestResultPublisher.java
Outdated
Show resolved
Hide resolved
…java Co-authored-by: Tim Jacomb <[email protected]>
In attempting to diagnose an ATH failure I noticed what looks like a potential race condition that could lead to results not being correctly updated.
There is also issue where if jobs are not specified and are to be inferre,d then these are obtained from the dependency graph at action creation time, but the graph may not be correct at this point in time as it is built asynchronously.
There is also an other timing issue.
For example in rare cases (but maybe not so rare in our ATH) if 2 builds finish at T1, a web request comes in a T2 and then another build fishes at T3 followed by a web request at T4.
When T1 arrives the timestamp is set to T1, and then in T2 the current value is checked against the timestamp and set to T1+1
When T3 arrives the timestamp is set to T3.
THe issue occurs if you have multiple CPUs and updating the code is fast and T1 = T2 = T3.
When the request at T4 arrives the code thinks everything is up-to-date when it is not.
Testing done
internal ATH consistently failed before these changes, and passes after.
Submitter checklist