Skip to content

Conversation

@jackshirazi
Copy link
Contributor

closes #2458

@jackshirazi jackshirazi requested a review from a team as a code owner November 24, 2025 12:42
profilingTask = scheduler.schedule(this, delay, TimeUnit.MILLISECONDS);
}
} finally {
profilerLock.unlock();
Copy link
Contributor Author

@jackshirazi jackshirazi Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewer, same code block, just pushed into try-finally block to use a lock. This ensures that setProfilerInterval is correct. The race is that run() finishes a profiling session and prepares to schedule the next one. It reads the old interval (e.g., 10s). Concurrently, the test calls setProfilerInterval (e.g., 100ms) and then reschedule(). reschedule() tries to cancel the current task. If they race, run() might successfully schedule the next task with the old 10s delay, effectively ignoring the update. The test then waits for the next session (expecting it in 100ms), but it doesn't happen for 10s, causing a timeout.

it's actually possible for an interruption to occur just before the lock was acquired. This should is handled by re-reading Thread.currentThread().isInterrupted() to ensure no task is scheduled if an interruption occurred just before acquiring the lock

}
}
} finally {
profilerLock.unlock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewer, same code block just wrapped in the lock try-finally

consumeActivationEventsFromRingBufferAndWriteToFile(profilingDuration);
} finally {
String stopMessage = profiler.execute("stop");
logger.fine(stopMessage);
Copy link
Contributor Author

@jackshirazi jackshirazi Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reviewer. same code block, but restructured to ensure that profiler.execute("stop"); is definitely called as an earlier interrupt will go straight to the catch block (below) bypassing the stop - this is most likely the test flakiness ("Profiler already started") - the profiler not correctly stopped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add a comment that explains that this is in a finally block because the preceding code be interrupted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jackshirazi jackshirazi changed the title Fix laky test Fix flaky test Nov 24, 2025
@jackshirazi jackshirazi self-assigned this Nov 24, 2025

boolean interrupted = Thread.currentThread().isInterrupted();
boolean continueProfilingSession =
config.isNonStopProfiling() && !interrupted && postProcessingEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could inline interrupted variable and add a comment why we need the double call to Thread.currentThread().isInterrupted() to get an up-to-date status without using the value from the previous call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@github-actions github-actions bot requested a review from SylvainJuge December 5, 2025 15:06
@trask
Copy link
Member

trask commented Dec 5, 2025

@jackshirazi does the CI failure on this PR look like same or different issue than you were trying to fix here? https://scans.gradle.com/s/gfqf5jlzidn6q

@jackshirazi
Copy link
Contributor Author

it's related. I left shutdown() instead of shutdownNow() which meant the previous test session can on occasion not finish before the next is started. I've added an interrupt for that too with the latest commit

@trask
Copy link
Member

trask commented Dec 8, 2025

another failure in the latest SamplingProfilerTest: https://scans.gradle.com/s/2dbaqboab5fr6

@jackshirazi
Copy link
Contributor Author

lol, this is not getting better

@jackshirazi
Copy link
Contributor Author

@SylvainJuge and/or @JonasKunz I need a careful review, this has turned into significantly more extensive changes than I expected. there's a bunch of different things

  • the change to let the test framework to handle the temp dir - that's standard and should be solid
  • the changes to better manage interrupts, including a new lock - should be good, there's a chunk of them but I don't think there's anything that can be negative there
  • the changes to ensure profiler is stopped - should be fine
  • the change to handle empty jfr files - should be fine
  • the PostProcessing management changes - this worries me a little, especially the setProfilingSessionOngoing(postProcessingEnabled) to setProfilingSessionOngoing(true) change; it seems to be needed to avoid flaking, and doesn't have any bad effects as far as I can tell by working through code paths, but I need that to be considered carefully
  • the ProfilingActivationListener.ensureInitialized() is only in the test framework, so should be fine

}

/** For testing only. */
public InferredSpansProcessorBuilder tempDir(@Nullable File tempDir) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] not related to this PR, but we could probably replace usage of File with Path to prevent having to call toFile.

Comment on lines +796 to +798
if (profilingTask != null) {
profilingTask.cancel(true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] I would suggest to also wrap the access to profilingTask and potentially modifying its state with the profilerLock, at least for consistency with the other call to .cancel(true) above. There is however no side effect on potentially calling this twice as the return value of cancel(...) is ignored.

Comment on lines +96 to +98
if (fileSize == 0) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the case of empty file covered by any existing test ? I haven't seen any change related to this in the tests.

class SamplingProfilerTest {

static {
ProfilingActivationListener.ensureInitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of explicitly calling this in the tests and not in production code ? is there any benefit of doing this and if yes maybe a comment could be welcome.

Comment on lines +429 to +439
if (e.getMessage() != null && e.getMessage().contains("already started")) {
logger.fine("Profiler already started. Stopping and restarting.");
try {
profiler.stop();
} catch (RuntimeException ignore) {
logger.log(Level.FINE, "Ignored error on stopping profiler", ignore);
}
startMessage = profiler.execute(startCommand);
} else {
throw e;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is a known failure mode of async profiler here ? If so, then maybe having a dedicated method to wrap this "when exception is thrown stop, start and try again" logic could make it slightly more readeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky tests in inferred-spans

4 participants