Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring boot runtime metrics #13078

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

zeitlinger
Copy link
Member

Fixes #12812

@zeitlinger zeitlinger requested a review from a team as a code owner January 21, 2025 13:21
@zeitlinger zeitlinger self-assigned this Jan 21, 2025
@github-actions github-actions bot added the test native This label can be applied to PRs to trigger them to run native tests label Jan 21, 2025
@zeitlinger
Copy link
Member Author

@jeanbisutti can you help me with the native failures:

  1. not sure in this is transient:
Failures (1):
  JUnit Jupiter:OtelSpringStarterSmokeTest:shouldSendTelemetry()
    MethodSource [className = 'io.opentelemetry.spring.smoketest.OtelSpringStarterSmokeTest', methodName = 'shouldSendTelemetry', methodParameterTypes = '']
    => org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a Lambda expression in io.opentelemetry.instrumentation.testing.InstrumentationTestRunner
Expecting actual not to be empty within 10 seconds.
       org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
       org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
       org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
       org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
       org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:790)
       [...]
     Caused by: org.awaitility.core.DeadlockException: Deadlocked threads detected:


       org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:159)
       [...]
  1. numLogsCapturedBeforeOtelInstall value of the OpenTelemetry appender is too small. - should we increase the buffer?

  2. thread started: this if for JFR - I'll try @PreDestry for this

The web application [ROOT] appears to have started a thread named [BatchLogRecordProcessor_WorkerThread-1] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
 org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.headers.Pthread.pthread_cond_timedwait(Pthread.java)
 org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.thread.PosixParker.park0(PosixPlatformThreads.java:379)
 org.graalvm.nativeimage.builder/com.oracle.svm.core.posix.thread.PosixParker.park(PosixPlatformThreads.java:354)
 org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.parkCurrentPlatformOrCarrierThread(PlatformThreads.java:1001)
 [email protected]/jdk.internal.misc.Unsafe.park(Unsafe.java:56)
 [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
 [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1763)
 [email protected]/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:435)
 io.opentelemetry.sdk.logs.export.BatchLogRecordProcessor$Worker.run(BatchLogRecordProcessor.java:246)
 [email protected]/java.lang.Thread.runWith(Thread.java:1596)
 [email protected]/java.lang.Thread.run(Thread.java:1583)

@jeanbisutti
Copy link
Member

@zeitlinger About 1., it seems an awaitility issue. Does the problem only appear with the new changes? Perhaps it may be worth to do something like

But I am not sure today it would be a good thing to do. It would require some further investigations.

About 2., numLogsCapturedBeforeOtelInstall default value is high: 1 000 logs. I suspect that the warning is related to something specific to the test.

About 3., it seems related to Tomcat searching memory leaks. With the full log we could know if it really comes from Tomcat. It does not seem possible to stop the BatchLogRecordProcessor thread. Surprised it could be JFR related. @jack-berg, would you know if some users have already reported the following log?

appears to have started a thread named [BatchLogRecordProcessor_WorkerThread-1] but has failed to stop it.

Native tests of this PR are failing during the native compilation step:

[native-image-plugin] Native Image written to: /home/runner/work/opentelemetry-java-instrumentation/opentelemetry-java-instrumentation/smoke-tests-otel-starter/spring-boot-3.2/build/native/nativeTestCompile

[Incubating] Problems report is available at: file:///home/runner/work/opentelemetry-java-instrumentation/opentelemetry-java-instrumentation/build/reports/problems/problems-report.html

I would try to focus on the JMX or JFR metrics for a GraalVM native execution. GraalVM supports some JFR events, but not all of them. So, not sure that all the JFR metrics can work today in the native mode.

@jack-berg
Copy link
Member

@jack-berg, would you know if some users have already reported the following log?

I haven't seen that log before.

@zeitlinger
Copy link
Member Author

we also have com.oracle.svm.core.jdk.UnsupportedFeatureError: ThreadMXBean methods - see oracle/graal#6101

@zeitlinger zeitlinger force-pushed the spring-boot-runtime-metrics branch from c0de41a to b3c0f10 Compare January 28, 2025 08:25
@zeitlinger
Copy link
Member Author

@jeanbisutti turned out that all prior errors were just a side effect of a jmx issue which is resolved now

can you take a look again?

@@ -209,6 +209,11 @@ void shouldSendTelemetry() {
OtelSpringStarterSmokeTestController.METER_SCOPE_NAME,
OtelSpringStarterSmokeTestController.TEST_HISTOGRAM,
AbstractIterableAssert::isNotEmpty);
// runtime metrics
testing.waitAndAssertMetrics(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to verify a thread-based runtime metric.

It may be worth to also verify that JMX-based and JFR-based runtime metrics work with the OpenTelemetry starter in both non-native and native modes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added @jeanbisutti

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeitlinger

I have not noticed assertions for JFR-based runtime metrics (io.opentelemetry.runtime-telemetry-java17 instrumentation scope). Could you please add them?

Also, could you please add more assertions for JMX-based runtime metrics (jvm.cpu.time, ...)? It woud be great to check one metric by MXBean used.

This way, we would be more confident that things work well in native mode!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not noticed assertions for JFR-based runtime metrics (io.opentelemetry.runtime-telemetry-java17 instrumentation scope). Could you please add them?

it's here:

protected void assertAdditionalMetrics() {
// JFR based metrics
testing.waitAndAssertMetrics(
"io.opentelemetry.runtime-telemetry-java17",
"jvm.cpu.limit",
AbstractIterableAssert::isNotEmpty);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you please add more assertions for JMX-based runtime metrics (jvm.cpu.time, ...

here's one:

// JMX based metrics
testing.waitAndAssertMetrics(
"io.opentelemetry.runtime-telemetry-java8",
"jvm.memory.used",
AbstractIterableAssert::isNotEmpty);

or do you want to have one for each JMX bean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeitlinger zeitlinger force-pushed the spring-boot-runtime-metrics branch from 55b2482 to ab2439a Compare January 28, 2025 18:10
@@ -101,7 +101,7 @@ private JfrRuntimeMetrics(OpenTelemetry openTelemetry, Predicate<JfrFeature> fea
recordingStream.onEvent(handler.getEventName(), handler);
});
recordingStream.onMetadata(event -> startUpLatch.countDown());
Thread daemonRunner = new Thread(() -> recordingStream.start());
Thread daemonRunner = new Thread(recordingStream::start, "JFR-Metrics-Runner");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mentioned OpenTelemetry in the thread name?

@@ -31,6 +24,10 @@ public final class RuntimeMetricsBuilder {

private boolean disableJmx = false;
private boolean enableExperimentalJmxTelemetry = false;
private Consumer<Runnable> shutdownHook =
runnable -> {
Runtime.getRuntime().addShutdownHook(new Thread(runnable, "RuntimeMetricsShutdownHook"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mentioned OpenTelemetry in the thread name?

@@ -66,6 +66,9 @@ graalvmNative {
// Workaround for https://github.com/junit-team/junit5/issues/3405
buildArgs.add("--initialize-at-build-time=org.junit.platform.launcher.core.LauncherConfig")
buildArgs.add("--initialize-at-build-time=org.junit.jupiter.engine.config.InstantiatingConfigurationParameterConverter")

// enable JFR - see https://www.graalvm.org/22.0/reference-manual/native-image/JFR/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation if for an old GraalVM version.

The last one:

Suggested change
// enable JFR - see https://www.graalvm.org/22.0/reference-manual/native-image/JFR/
// enable JFR - see https://www.graalvm.org/latest/reference-manual/native-image/debugging-and-diagnostics/JFR/

@jeanbisutti
Copy link
Member

@roberttoyonaga If you have time, it would be great if you could have a look at this PR.

/**
* Create and start {@link RuntimeMetrics}.
*
* <p>Listens for select JFR events, extracts data, and records to various metrics. Recording will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this javadoc should be adjusted. JFR recording and metrics are only with JDK17+. This is because the JFR streaming feature only was introduced in JDK 14.


public void startFromInstrumentationConfig(InstrumentationConfig config) {
/*
By default, don't use any JFR metrics. May change this once semantic conventions are updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should not mention JFR, since JFR is only used with JFR streaming in Java 17.

return new RuntimeMetricsBuilder(openTelemetry);
}

/** Stop recording JFR events. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this javadoc

private static boolean useThreads() {
// GraalVM native image does not support ThreadMXBean yet
// see https://github.com/oracle/graal/issues/6101
return !isJava9OrNewer() || System.getProperty("org.graalvm.nativeimage.imagecode") != null;
Copy link
Contributor

@roberttoyonaga roberttoyonaga Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that GraalVM Native Image would work fine with Threads::java8Callback. Native Image implements some ThreadMXBean functionality, notably all the functionality needed by Threads::java8Callback (see here and here). Although, Threads::java9AndNewerCallback still won't work with Native Image since we don't support ThreadMXBean#getAllThreadIds() or ThreadMXBean.getThreadInfo() yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test native This label can be applied to PRs to trigger them to run native tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add runtime-telemetry to spring starter
4 participants