-
Notifications
You must be signed in to change notification settings - Fork 169
Fix flaky test #2469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix flaky test #2469
Conversation
| profilingTask = scheduler.schedule(this, delay, TimeUnit.MILLISECONDS); | ||
| } | ||
| } finally { | ||
| profilerLock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for reviewer, same code block, just pushed into try-finally block to use a lock. This ensures that setProfilerInterval is correct. The race is that run() finishes a profiling session and prepares to schedule the next one. It reads the old interval (e.g., 10s). Concurrently, the test calls setProfilerInterval (e.g., 100ms) and then reschedule(). reschedule() tries to cancel the current task. If they race, run() might successfully schedule the next task with the old 10s delay, effectively ignoring the update. The test then waits for the next session (expecting it in 100ms), but it doesn't happen for 10s, causing a timeout.
it's actually possible for an interruption to occur just before the lock was acquired. This should is handled by re-reading Thread.currentThread().isInterrupted() to ensure no task is scheduled if an interruption occurred just before acquiring the lock
...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
| } finally { | ||
| profilerLock.unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for reviewer, same code block just wrapped in the lock try-finally
| consumeActivationEventsFromRingBufferAndWriteToFile(profilingDuration); | ||
| } finally { | ||
| String stopMessage = profiler.execute("stop"); | ||
| logger.fine(stopMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for reviewer. same code block, but restructured to ensure that profiler.execute("stop"); is definitely called as an earlier interrupt will go straight to the catch block (below) bypassing the stop - this is most likely the test flakiness ("Profiler already started") - the profiler not correctly stopped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add a comment that explains that this is in a finally block because the preceding code be interrupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…spans/internal/SamplingProfiler.java
|
|
||
| boolean interrupted = Thread.currentThread().isInterrupted(); | ||
| boolean continueProfilingSession = | ||
| config.isNonStopProfiling() && !interrupted && postProcessingEnabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could inline interrupted variable and add a comment why we need the double call to Thread.currentThread().isInterrupted() to get an up-to-date status without using the value from the previous call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java
Outdated
Show resolved
Hide resolved
|
@jackshirazi does the CI failure on this PR look like same or different issue than you were trying to fix here? https://scans.gradle.com/s/gfqf5jlzidn6q |
|
it's related. I left shutdown() instead of shutdownNow() which meant the previous test session can on occasion not finish before the next is started. I've added an interrupt for that too with the latest commit |
|
another failure in the latest SamplingProfilerTest: https://scans.gradle.com/s/2dbaqboab5fr6 |
|
lol, this is not getting better |
|
@SylvainJuge and/or @JonasKunz I need a careful review, this has turned into significantly more extensive changes than I expected. there's a bunch of different things
|
| } | ||
|
|
||
| /** For testing only. */ | ||
| public InferredSpansProcessorBuilder tempDir(@Nullable File tempDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] not related to this PR, but we could probably replace usage of File with Path to prevent having to call toFile.
| if (profilingTask != null) { | ||
| profilingTask.cancel(true); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] I would suggest to also wrap the access to profilingTask and potentially modifying its state with the profilerLock, at least for consistency with the other call to .cancel(true) above. There is however no side effect on potentially calling this twice as the return value of cancel(...) is ignored.
| if (fileSize == 0) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the case of empty file covered by any existing test ? I haven't seen any change related to this in the tests.
| class SamplingProfilerTest { | ||
|
|
||
| static { | ||
| ProfilingActivationListener.ensureInitialized(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of explicitly calling this in the tests and not in production code ? is there any benefit of doing this and if yes maybe a comment could be welcome.
| if (e.getMessage() != null && e.getMessage().contains("already started")) { | ||
| logger.fine("Profiler already started. Stopping and restarting."); | ||
| try { | ||
| profiler.stop(); | ||
| } catch (RuntimeException ignore) { | ||
| logger.log(Level.FINE, "Ignored error on stopping profiler", ignore); | ||
| } | ||
| startMessage = profiler.execute(startCommand); | ||
| } else { | ||
| throw e; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is a known failure mode of async profiler here ? If so, then maybe having a dedicated method to wrap this "when exception is thrown stop, start and try again" logic could make it slightly more readeable.
closes #2458