Fix flaky test #2469

jackshirazi · 2025-11-24T12:42:05Z

jackshirazi · 2025-11-24T12:49:56Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

+        profilingTask = scheduler.schedule(this, delay, TimeUnit.MILLISECONDS);
+      }
+    } finally {
+      profilerLock.unlock();


for reviewer, same code block, just pushed into try-finally block to use a lock. This ensures that setProfilerInterval is correct. The race is that run() finishes a profiling session and prepares to schedule the next one. It reads the old interval (e.g., 10s). Concurrently, the test calls setProfilerInterval (e.g., 100ms) and then reschedule(). reschedule() tries to cancel the current task. If they race, run() might successfully schedule the next task with the old 10s delay, effectively ignoring the update. The test then waits for the next session (expecting it in 100ms), but it doesn't happen for 10s, causing a timeout.

it's actually possible for an interruption to occur just before the lock was acquired. This should is handled by re-reading Thread.currentThread().isInterrupted() to ensure no task is scheduled if an interruption occurred just before acquiring the lock

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

jackshirazi · 2025-11-24T12:51:37Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

+        }
      }
+    } finally {
+      profilerLock.unlock();


for reviewer, same code block just wrapped in the lock try-finally

jackshirazi · 2025-11-24T12:53:41Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

+        consumeActivationEventsFromRingBufferAndWriteToFile(profilingDuration);
+      } finally {
+        String stopMessage = profiler.execute("stop");
+        logger.fine(stopMessage);


for reviewer. same code block, but restructured to ensure that profiler.execute("stop"); is definitely called as an earlier interrupt will go straight to the catch block (below) bypassing the stop - this is most likely the test flakiness ("Profiler already started") - the profiler not correctly stopped

you could add a comment that explains that this is in a finally block because the preceding code be interrupted

…spans/internal/SamplingProfiler.java

SylvainJuge · 2025-11-26T08:27:57Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java


    boolean interrupted = Thread.currentThread().isInterrupted();
    boolean continueProfilingSession =
        config.isNonStopProfiling() && !interrupted && postProcessingEnabled;


Maybe we could inline interrupted variable and add a comment why we need the double call to Thread.currentThread().isInterrupted() to get an up-to-date status without using the value from the previous call.

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

trask · 2025-12-05T16:30:47Z

@jackshirazi does the CI failure on this PR look like same or different issue than you were trying to fix here? https://scans.gradle.com/s/gfqf5jlzidn6q

jackshirazi · 2025-12-08T15:54:16Z

it's related. I left shutdown() instead of shutdownNow() which meant the previous test session can on occasion not finish before the next is started. I've added an interrupt for that too with the latest commit

trask · 2025-12-08T16:14:41Z

another failure in the latest SamplingProfilerTest: https://scans.gradle.com/s/2dbaqboab5fr6

jackshirazi · 2025-12-08T16:35:59Z

lol, this is not getting better

…t is run

jackshirazi · 2025-12-08T23:50:27Z

@SylvainJuge and/or @JonasKunz I need a careful review, this has turned into significantly more extensive changes than I expected. there's a bunch of different things

the change to let the test framework to handle the temp dir - that's standard and should be solid
the changes to better manage interrupts, including a new lock - should be good, there's a chunk of them but I don't think there's anything that can be negative there
the changes to ensure profiler is stopped - should be fine
the change to handle empty jfr files - should be fine
the PostProcessing management changes - this worries me a little, especially the setProfilingSessionOngoing(postProcessingEnabled) to setProfilingSessionOngoing(true) change; it seems to be needed to avoid flaking, and doesn't have any bad effects as far as I can tell by working through code paths, but I need that to be considered carefully
the ProfilingActivationListener.ensureInitialized() is only in the test framework, so should be fine

SylvainJuge · 2025-12-09T13:13:14Z

...pans/src/main/java/io/opentelemetry/contrib/inferredspans/InferredSpansProcessorBuilder.java

  }

+  /** For testing only. */
+  public InferredSpansProcessorBuilder tempDir(@Nullable File tempDir) {


[minor] not related to this PR, but we could probably replace usage of File with Path to prevent having to call toFile.

SylvainJuge · 2025-12-09T13:17:09Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

+    if (profilingTask != null) {
+      profilingTask.cancel(true);
+    }


[minor] I would suggest to also wrap the access to profilingTask and potentially modifying its state with the profilerLock, at least for consistency with the other call to .cancel(true) above. There is however no side effect on potentially calling this twice as the return value of cancel(...) is ignored.

SylvainJuge · 2025-12-09T13:18:18Z

...s/src/main/java/io/opentelemetry/contrib/inferredspans/internal/asyncprofiler/JfrParser.java

+    if (fileSize == 0) {
+      return;
+    }


Is the case of empty file covered by any existing test ? I haven't seen any change related to this in the tests.

SylvainJuge · 2025-12-09T13:19:15Z

...pans/src/test/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfilerTest.java

 class SamplingProfilerTest {

+  static {
+    ProfilingActivationListener.ensureInitialized();


what is the benefit of explicitly calling this in the tests and not in production code ? is there any benefit of doing this and if yes maybe a comment could be welcome.

SylvainJuge · 2025-12-09T13:22:07Z

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java

+        if (e.getMessage() != null && e.getMessage().contains("already started")) {
+          logger.fine("Profiler already started. Stopping and restarting.");
+          try {
+            profiler.stop();
+          } catch (RuntimeException ignore) {
+            logger.log(Level.FINE, "Ignored error on stopping profiler", ignore);
+          }
+          startMessage = profiler.execute(startCommand);
+        } else {
+          throw e;
+        }


Is is a known failure mode of async profiler here ? If so, then maybe having a dedicated method to wrap this "when exception is thrown stop, start and try again" logic could make it slightly more readeable.

Update SamplingProfiler.java

b963213

jackshirazi requested a review from a team as a code owner November 24, 2025 12:42

github-actions bot requested review from JonasKunz and SylvainJuge November 24, 2025 12:42

jackshirazi commented Nov 24, 2025

View reviewed changes

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java Outdated Show resolved Hide resolved

jackshirazi commented Nov 24, 2025

View reviewed changes

jackshirazi changed the title ~~Fix laky test~~ Fix flaky test Nov 24, 2025

Update inferred-spans/src/main/java/io/opentelemetry/contrib/inferred…

a8ba1cb

…spans/internal/SamplingProfiler.java

jackshirazi self-assigned this Nov 24, 2025

SylvainJuge approved these changes Nov 26, 2025

View reviewed changes

laurit reviewed Dec 1, 2025

View reviewed changes

...ed-spans/src/main/java/io/opentelemetry/contrib/inferredspans/internal/SamplingProfiler.java Outdated Show resolved Hide resolved

review feedback

b065a8d

github-actions bot requested a review from SylvainJuge December 5, 2025 15:06

end previous session quicker

30aa33a

jackshirazi added 2 commits December 8, 2025 15:59

suppress interruption warning

cd3b46b

spotless

3765386

jackshirazi added 8 commits December 8, 2025 16:54

handle empty files sent to the jfr parser

373a947

ensure that a running profiler is stopped then started when a new tes…

7ecda2d

…t is run

ignore error

6678fc6

move temp dir to control of unit test framework

40834c4

don't postprocess if it's disabled

500018f

statically initialize ProfilingActivationListener in test

7101cfb

fake comment to bump for more test runs

1dbda97

test out partly forcing post processing

8fa0815

jackshirazi added 4 commits December 8, 2025 22:44

fake comment again to bump for more test runs

1b5207c

test out further post processing handling

da32568

fake comment again2 to bump for more test runs

2113a0a

remove fake comment

010cac8

SylvainJuge reviewed Dec 9, 2025

View reviewed changes

Fix flaky test #2469

Are you sure you want to change the base?

Fix flaky test #2469

Uh oh!

Conversation

jackshirazi commented Nov 24, 2025

Uh oh!

jackshirazi Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackshirazi Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

trask commented Dec 5, 2025

Uh oh!

jackshirazi commented Dec 8, 2025

Uh oh!

trask commented Dec 8, 2025

Uh oh!

jackshirazi commented Dec 8, 2025

Uh oh!

jackshirazi commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jackshirazi Nov 24, 2025 •

edited

Loading

jackshirazi Nov 24, 2025 •

edited

Loading