diff --git a/benchmarks/README.md b/benchmarks/README.md
index b3bdb19..85a7954 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -66,7 +66,7 @@ JMH parameters can be configured in `benchmarks/build.gradle.kts` or passed via
 
 ## Latest Results (Snapshot)
 
-Run date: `2026-03-08`
+Run date: `2026-03-10`
 
 ### 1. Avro Pipeline: The "Zero-Copy" Advantage
 
@@ -101,14 +101,14 @@ abstraction without a performance penalty.
 This benchmark compares KPipe's "thread-per-record" model using Java 24 Virtual Threads against the industry-standard
 Confluent Parallel Consumer.
 
-| Benchmark                                                 |    Mode |  Cnt |     Score |       Error |   Units |
-|-----------------------------------------------------------|--------:|-----:|----------:|------------:|--------:|
-| `ParallelProcessingBenchmark.confluentParallelProcessing` | `thrpt` | `16` | `329.594` | `+/- 0.757` | `ops/s` |
-| `ParallelProcessingBenchmark.kpipeParallelProcessing`     | `thrpt` | `16` | `331.248` | `+/- 0.774` | `ops/s` |
+| Benchmark                                                 |    Mode |  Cnt |       Score |        Error |   Units |
+|-----------------------------------------------------------|--------:|-----:|------------:|-------------:|--------:|
+| `ParallelProcessingBenchmark.confluentParallelProcessing` | `thrpt` | `16` | `3,235.415` | `+/- 14.876` | `ops/s` |
+| `ParallelProcessingBenchmark.kpipeParallelProcessing`     | `thrpt` | `16` | `3,306.732` |  `+/- 3.368` | `ops/s` |
 
-**Observation**: KPipe achieves **performance parity** with the Confluent Parallel Consumer while maintaining a
-significantly simpler programming model. We reach these numbers using standard Java 24 Virtual Threads, avoiding the
-complexity of managed thread pools or proprietary scheduling logic.
+**Observation**: With `10,000` messages per invocation and `8` partitions, this run shows a measurable throughput edge
+for KPipe (**~2.2%** over Confluent). At the same time, Confluent shows lower allocation per operation in this profile (
+`275.078 B/op` vs `1457.324 B/op`), so this is a throughput-vs-allocation tradeoff rather than a one-dimensional win.
 
 ## Understanding Results
 
@@ -122,11 +122,11 @@ Based on the latest snapshot results, we can derive the following throughput exp
   when Kafka I/O is excluded.
 - **JSON (In-Memory)**: Up to **~405,000 records/s**. JSON processing is significantly more CPU-intensive than Avro due
   to text parsing.
-- **End-to-End Parallel Processing**: **~331,000 messages/s**. This is the most realistic metric as it includes a real
-  Kafka broker (embedded), network polling, and Virtual Thread scheduling.
+- **End-to-End Parallel Processing**: **~32.3M to ~33.1M messages/s**. For this run, use `score * 10,000`
+  because `ParallelProcessingBenchmark` uses `@OperationsPerInvocation(10000)`.
 
-> **Note**: The `ParallelProcessingBenchmark` uses `@OperationsPerInvocation(1000)`, so the reported `331.248 ops/s` is
-> normalized to reflect **331,248 messages per second**.
+> **Note**: The `ParallelProcessingBenchmark` uses `@OperationsPerInvocation(10000)`. For this benchmark,
+> derive message rate as `ops/s * 10,000`.
 
 Key performance indicators to watch for:
 
@@ -141,12 +141,85 @@ Key performance indicators to watch for:
 - **Parallel timing fairness**: both `kpipeParallelProcessing` and `confluentParallelProcessing` start
   their processing loops inside benchmark methods (not in setup), so measured time includes comparable
   startup-to-completion behavior for each invocation.
-- **Parallel throughput normalization**: `ParallelProcessingBenchmark` uses `@OperationsPerInvocation(1000)`, so its
+- **Parallel throughput normalization**: `ParallelProcessingBenchmark` uses `@OperationsPerInvocation(10000)`, so its
   reported throughput is normalized per processed message rather than per full benchmark invocation.
 - **Logging noise control**: KPipe parallel benchmark uses a no-op sink in benchmark runs to avoid console I/O from
   distorting throughput numbers.
+- **CPU efficiency (Linux only)**: compare CPI and related normalized counters from `perfnorm` for
+  `kpipeParallelProcessing` vs `confluentParallelProcessing`.
+- **Platform caveat for CPI**: macOS runs can still compare throughput and GC behavior, but CPI should be
+  collected/reported only from Linux perf-enabled runs.
 
 ## Requirements
 
 - **Java 24+**: Required for Virtual Threads (Project Loom).
 - **Gradle**: Used to compile and execute the benchmark harness.
+
+### CPU/CPI Profiling For Parallel Benchmark
+
+For KPipe vs Confluent parallel processing, keep benchmark target fixed and enable a profiler:
+
+```bash
+# Linux: collect normalized hardware counters (includes CPI)
+./gradlew :benchmarks:jmh \
+  -Pjmh.includes='ParallelProcessingBenchmark' \
+  -Pjmh.profilers='perfnorm' \
+  -Pjmh.resultFormat=TEXT
+
+# macOS: CPI is not available via perf counters in JMH; use GC/CPU-adjacent signal instead
+./gradlew :benchmarks:jmh \
+  -Pjmh.includes='ParallelProcessingBenchmark' \
+  -Pjmh.profilers='gc' \
+  -Pjmh.resultFormat=TEXT
+```
+
+You can also use the helper script:
+
+```bash
+# Linux (CPI mode)
+PROFILE_MODE=cpi INCLUDES='ParallelProcessingBenchmark' ./scripts/run-benchmarks.sh
+
+# macOS (falls back from cpi -> gc with a warning)
+PROFILE_MODE=cpi INCLUDES='ParallelProcessingBenchmark' ./scripts/run-benchmarks.sh
+
+# Heap/allocation view (portable)
+PROFILE_MODE=heap INCLUDES='ParallelProcessingBenchmark' ./scripts/run-benchmarks.sh
+
+# Thread/runtime view (HotSpot)
+PROFILE_MODE=threads INCLUDES='ParallelProcessingBenchmark' ./scripts/run-benchmarks.sh
+```
+
+Supported `PROFILE_MODE` values in `scripts/run-benchmarks.sh`:
+
+- `none`: no JMH profiler
+- `gc`: allocation and GC counters (`gc`)
+- `heap`: allocation/GC plus HotSpot GC internals (`gc,hs_gc`)
+- `threads`: HotSpot thread/runtime signal (`hs_thr,hs_rt`)
+- `cpi`: Linux `perfnorm` (falls back to `gc` on macOS)
+
+Interpretation guidance for KPipe vs Confluent:
+
+- Throughput (`ops/s`) remains the primary metric.
+- On Linux, `perfnorm` adds normalized counters; compare CPI (`cycles`/`instructions`) between both benchmarks.
+- Lower CPI at similar throughput usually indicates better instruction-path efficiency.
+- On macOS, use throughput plus GC metrics; do not claim CPI without Linux perf counters.
+
+### Parallel Comparison Graph
+
+The latest visual comparison for KPipe vs Confluent parallel processing is at:
+
+- `benchmarks/graphs/parallel_processing_gc_comparison.svg`
+
+![KPipe vs Confluent Parallel Benchmark](graphs/parallel_processing_gc_comparison.svg)
+
+To regenerate source benchmark results before producing/refreshing the graph:
+
+```bash
+./gradlew :benchmarks:jmh \
+  -Pjmh.includes='ParallelProcessingBenchmark' \
+  -Pjmh.profilers='gc' \
+  -Pjmh.resultFormat=TEXT
+```
+
+> Note: JMH output is written to `benchmarks/build/results/jmh/results.<resultFormat lowercase>`.
+> For `TEXT`, the file is typically `benchmarks/build/results/jmh/results.text`.
diff --git a/benchmarks/build.gradle.kts b/benchmarks/build.gradle.kts
index da675f9..18b1fe8 100644
--- a/benchmarks/build.gradle.kts
+++ b/benchmarks/build.gradle.kts
@@ -1,5 +1,3 @@
-import org.gradle.api.artifacts.VersionCatalogsExtension
-
 plugins {
     java
     id("me.champeau.jmh") version "0.7.3"
@@ -20,36 +18,43 @@ dependencies {
     // Logging for JMH forks
     implementation(libsCatalog.findLibrary("slf4jSimple").get())
 
-    // Apache Kafka test-kit for embedded benchmark broker
-    val kafkaVersion = libsCatalog.findVersion("kafka").get().requiredVersion
-    implementation(libsCatalog.findLibrary("kafkaScala213").get())
-    implementation("org.apache.kafka:kafka_2.13:$kafkaVersion:test")
-    implementation("org.apache.kafka:kafka-clients:$kafkaVersion:test")
-    implementation("org.apache.kafka:kafka-server-common:$kafkaVersion:test")
+    implementation(libsCatalog.findLibrary("kafkaTestCommonRuntime").get())
+
     implementation(libsCatalog.findLibrary("junitJupiterApi").get())
 }
 
 jmh {
-    warmupIterations = providers.gradleProperty("jmh.warmupIterations").orNull?.toIntOrNull() ?: 3
-    iterations = providers.gradleProperty("jmh.iterations").orNull?.toIntOrNull() ?: 5
-    fork = providers.gradleProperty("jmh.fork").orNull?.toIntOrNull() ?: 1
-    threads = providers.gradleProperty("jmh.threads").orNull?.toIntOrNull() ?: 1
-
-    providers
-        .gradleProperty("jmh.includes")
-        .orNull
-        ?.split(',')
-        ?.map(String::trim)
-        ?.filter(String::isNotEmpty)
-        ?.takeIf { it.isNotEmpty() }
-        ?.let { includes = it }
+    fun intProp(name: String, default: Int): Int {
+        return providers.gradleProperty(name).orNull?.toIntOrNull() ?: default
+    }
+
+    fun stringProp(name: String, default: String): String {
+        return providers.gradleProperty(name).orNull?.trim()?.takeIf { it.isNotEmpty() } ?: default
+    }
+
+    fun csvProp(name: String): List<String>? {
+        return providers.gradleProperty(name).orNull?.split(',')?.map(String::trim)?.filter(String::isNotEmpty)
+            ?.takeIf { it.isNotEmpty() }
+    }
+
+    warmupIterations = intProp("jmh.warmupIterations", 3)
+    iterations = intProp("jmh.iterations", 5)
+    fork = intProp("jmh.fork", 1)
+    threads = intProp("jmh.threads", 1)
+
+    csvProp("jmh.includes")?.let { includes = it }
+    csvProp("jmh.profilers")?.let { profilers = it }
 
+    val jmhResultFormat = stringProp("jmh.resultFormat", "TEXT")
     val jmhTmpDir = layout.buildDirectory.dir("tmp/jmh").get().asFile.absolutePath
+    val jmhResultFile = layout.buildDirectory.file("results/jmh/results.${jmhResultFormat.lowercase()}").get().asFile
 
     benchmarkMode = listOf("thrpt")
     timeUnit = "s"
     failOnError = true
     forceGC = true
+    resultFormat = jmhResultFormat
+    resultsFile = jmhResultFile
 
     jvmArgs = listOf("-Djava.io.tmpdir=$jmhTmpDir")
 }
diff --git a/benchmarks/graphs/parallel_processing_gc_comparison.svg b/benchmarks/graphs/parallel_processing_gc_comparison.svg
new file mode 100644
index 0000000..2d5354c
--- /dev/null
+++ b/benchmarks/graphs/parallel_processing_gc_comparison.svg
@@ -0,0 +1,123 @@
+<svg xmlns="http://www.w3.org/2000/svg" width="1180" height="800" viewBox="0 0 1180 800" role="img" aria-label="KPipe vs Confluent Parallel Benchmark GC and Throughput comparison">
+  <defs>
+    <style>
+      .bg { fill: none; }
+      .panel { fill: #f8fafc; stroke: #cbd5e1; stroke-width: 1; }
+      .title { fill: #111827; font: 700 24px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .subtitle { fill: #4b5563; font: 500 14px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .h2 { fill: #111827; font: 700 16px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .axis { stroke: #9ca3af; stroke-width: 1; }
+      .grid { stroke: #e5e7eb; stroke-width: 1; stroke-dasharray: 4 4; }
+      .label { fill: #374151; font: 12px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .small { fill: #4b5563; font: 11px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .value { fill: #111827; font: 700 12px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .legend { fill: #374151; font: 12px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .note { fill: #4b5563; font: 12px -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, sans-serif; }
+      .kpipe { fill: #22c55e; }
+      .confluent { fill: #3b82f6; }
+    </style>
+  </defs>
+
+  <rect class="bg" x="0" y="0" width="1180" height="800"/>
+
+  <text x="28" y="40" class="title">ParallelProcessingBenchmark: KPipe vs Confluent (Latest Baseline)</text>
+  <text x="28" y="62" class="subtitle">JMH Throughput + gc profiler (16 samples), TARGET_MESSAGES=10,000, partitions=8</text>
+
+  <!-- Headline panel -->
+  <rect x="28" y="112" width="1104" height="176" class="panel"/>
+
+  <!-- Throughput title (inside panel) -->
+  <text x="48" y="144" class="h2">Throughput (ops/s)</text>
+
+  <!-- top-panel legend inside white area -->
+  <rect x="920" y="126" width="10" height="10" class="confluent"/>
+  <text x="934" y="135" class="small">Confluent</text>
+  <rect x="1010" y="126" width="10" height="10" class="kpipe"/>
+  <text x="1024" y="135" class="small">KPipe</text>
+
+  <!-- Throughput axis (0..3400) -->
+  <line x1="110" y1="258" x2="790" y2="258" class="axis"/>
+  <line x1="110" y1="156" x2="110" y2="258" class="axis"/>
+  <line x1="110" y1="232" x2="790" y2="232" class="grid"/>
+  <line x1="110" y1="206" x2="790" y2="206" class="grid"/>
+  <line x1="110" y1="180" x2="790" y2="180" class="grid"/>
+
+  <rect x="190" y="178" width="533" height="24" class="confluent"/>
+  <rect x="190" y="214" width="545" height="24" class="kpipe"/>
+  <text x="130" y="195" class="label">Confluent</text>
+  <text x="130" y="231" class="label">KPipe</text>
+  <text x="730" y="195" class="small">3235.415 +/- 14.876</text>
+  <text x="742" y="231" class="small">3306.732 +/- 3.368</text>
+
+  <text x="900" y="188" class="h2">Readout</text>
+  <text x="900" y="212" class="label">Delta: +2.20% (KPipe over Confluent)</text>
+  <text x="900" y="234" class="label">Gap: +71.317 ops/s</text>
+
+  <!-- Cost profile -->
+  <text x="28" y="328" class="h2">Cost Profile (allocation and GC behavior)</text>
+
+  <!-- Panel A: alloc.rate MB/s -->
+  <rect x="28" y="344" width="360" height="220" class="panel"/>
+  <text x="44" y="368" class="label">gc.alloc.rate (MB/sec)</text>
+  <rect x="248" y="356" width="10" height="10" class="confluent"/>
+  <text x="262" y="365" class="small">Confluent</text>
+  <rect x="334" y="356" width="10" height="10" class="kpipe"/>
+  <text x="348" y="365" class="small">KPipe</text>
+  <line x1="44" y1="530" x2="372" y2="530" class="axis"/>
+  <line x1="44" y1="390" x2="44" y2="530" class="axis"/>
+  <rect x="86" y="503" width="72" height="27" class="confluent"/>
+  <rect x="206" y="390" width="72" height="140" class="kpipe"/>
+  <text x="84" y="548" class="small">Confluent</text>
+  <text x="214" y="548" class="small">KPipe</text>
+  <text x="88" y="496" class="small">0.729</text>
+  <text x="208" y="382" class="small">3.826</text>
+
+  <!-- Panel B: alloc.rate.norm B/op -->
+  <rect x="410" y="344" width="360" height="220" class="panel"/>
+  <text x="426" y="368" class="label">gc.alloc.rate.norm (B/op)</text>
+  <rect x="630" y="356" width="10" height="10" class="confluent"/>
+  <text x="644" y="365" class="small">Confluent</text>
+  <rect x="716" y="356" width="10" height="10" class="kpipe"/>
+  <text x="730" y="365" class="small">KPipe</text>
+  <line x1="426" y1="530" x2="754" y2="530" class="axis"/>
+  <line x1="426" y1="390" x2="426" y2="530" class="axis"/>
+  <rect x="468" y="503" width="72" height="27" class="confluent"/>
+  <rect x="588" y="390" width="72" height="140" class="kpipe"/>
+  <text x="466" y="548" class="small">Confluent</text>
+  <text x="596" y="548" class="small">KPipe</text>
+  <text x="470" y="496" class="small">275.078</text>
+  <text x="590" y="382" class="small">1457.324</text>
+
+  <!-- Panel C: GC count/time (vertical bars) -->
+  <rect x="792" y="344" width="340" height="220" class="panel"/>
+  <text x="808" y="368" class="label">GC activity (lower is better)</text>
+  <rect x="972" y="356" width="10" height="10" class="confluent"/>
+  <text x="986" y="365" class="small">Confluent</text>
+  <rect x="1052" y="356" width="10" height="10" class="kpipe"/>
+  <text x="1066" y="365" class="small">KPipe</text>
+
+  <line x1="808" y1="530" x2="1136" y2="530" class="axis"/>
+  <line x1="808" y1="390" x2="808" y2="530" class="axis"/>
+
+  <!-- gc.count group -->
+  <text x="832" y="386" class="small">gc.count</text>
+  <rect x="848" y="410" width="40" height="120" class="confluent"/>
+  <rect x="902" y="465" width="40" height="65" class="kpipe"/>
+  <text x="860" y="404" class="small">80</text>
+  <text x="914" y="459" class="small">43</text>
+
+  <!-- gc.time group -->
+  <text x="980" y="386" class="small">gc.time (ms)</text>
+  <rect x="996" y="412" width="40" height="118" class="confluent"/>
+  <rect x="1046" y="464" width="40" height="66" class="kpipe"/>
+  <text x="1004" y="406" class="small">128</text>
+  <text x="1054" y="458" class="small">72</text>
+
+  <!-- Summary box -->
+  <rect x="28" y="588" width="1104" height="108" class="panel"/>
+  <text x="48" y="614" class="h2">Summary</text>
+  <text x="48" y="638" class="label">- Throughput: KPipe leads by +2.20% in this baseline.</text>
+  <text x="48" y="660" class="label">- Allocation: Confluent is lower on MB/sec and B/op.</text>
+  <text x="48" y="682" class="label">- GC behavior: KPipe shows fewer events and less total GC time here.</text>
+</svg>
+
diff --git a/benchmarks/src/jmh/java/org/kpipe/benchmarks/ParallelProcessingBenchmarkInfrastructure.java b/benchmarks/src/jmh/java/org/kpipe/benchmarks/ParallelProcessingBenchmarkInfrastructure.java
index 9075cc1..87b1ca6 100644
--- a/benchmarks/src/jmh/java/org/kpipe/benchmarks/ParallelProcessingBenchmarkInfrastructure.java
+++ b/benchmarks/src/jmh/java/org/kpipe/benchmarks/ParallelProcessingBenchmarkInfrastructure.java
@@ -9,8 +9,6 @@
 import java.util.UUID;
 import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicInteger;
-import kafka.testkit.KafkaClusterTestKit;
-import kafka.testkit.TestKitNodes;
 import org.apache.kafka.clients.admin.Admin;
 import org.apache.kafka.clients.admin.NewTopic;
 import org.apache.kafka.clients.consumer.ConsumerConfig;
@@ -20,6 +18,8 @@
 import org.apache.kafka.clients.producer.ProducerRecord;
 import org.apache.kafka.common.serialization.ByteArrayDeserializer;
 import org.apache.kafka.common.serialization.ByteArraySerializer;
+import org.apache.kafka.common.test.KafkaClusterTestKit;
+import org.apache.kafka.common.test.TestKitNodes;
 import org.kpipe.consumer.FunctionalConsumer;
 import org.openjdk.jmh.annotations.Level;
 import org.openjdk.jmh.annotations.Scope;
@@ -50,7 +50,10 @@ public final class ParallelProcessingBenchmarkInfrastructure {
   static final String TOPIC = "benchmark-topic";
 
   /// Number of records seeded and awaited per invocation.
-  static final int TARGET_MESSAGES = 1000;
+  static final int TARGET_MESSAGES = 10_000;
+
+  /// Topic partitions used to expose parallel scheduler behavior.
+  static final int TOPIC_PARTITIONS = 8;
 
   /// Safety timeout for per-invocation message completion checks.
   private static final long MAX_WAIT_NANOS = TimeUnit.SECONDS.toNanos(30);
@@ -159,7 +162,7 @@ private static void createTopicIfMissing(final Properties clientProperties) {
       adminProps.putAll(clientProperties);
 
       try (final var admin = Admin.create(adminProps)) {
-        admin.createTopics(Collections.singletonList(new NewTopic(TOPIC, 1, (short) 1))).all().get();
+        admin.createTopics(Collections.singletonList(new NewTopic(TOPIC, TOPIC_PARTITIONS, (short) 1))).all().get();
       } catch (final Exception e) {
         throw new IllegalStateException("Unable to create benchmark topic", e);
       }
diff --git a/gradle/libs.versions.toml b/gradle/libs.versions.toml
index 24000d4..1b66edd 100644
--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
@@ -13,6 +13,7 @@ parallelConsumer = "0.5.3.0"
 kafkaClients = { module = "org.apache.kafka:kafka-clients", version.ref = "kafka" }
 kafkaScala213 = { module = "org.apache.kafka:kafka_2.13", version.ref = "kafka" }
 kafkaServerCommon = { module = "org.apache.kafka:kafka-server-common", version.ref = "kafka" }
+kafkaTestCommonRuntime = { module = "org.apache.kafka:kafka-test-common-runtime", version.ref = "kafka" }
 
 avro = { module = "org.apache.avro:avro", version.ref = "avro" }
 dslJson = { module = "com.dslplatform:dsl-json", version.ref = "dslJson" }
@@ -33,4 +34,3 @@ testcontainersJunitJupiter = { module = "org.testcontainers:testcontainers-junit
 testcontainersKafka = { module = "org.testcontainers:testcontainers-kafka", version.ref = "testcontainers" }
 
 parallelConsumerCore = { module = "io.confluent.parallelconsumer:parallel-consumer-core", version.ref = "parallelConsumer" }
-
diff --git a/scripts/run-benchmarks.sh b/scripts/run-benchmarks.sh
index f78a524..a98136f 100755
--- a/scripts/run-benchmarks.sh
+++ b/scripts/run-benchmarks.sh
@@ -8,34 +8,88 @@ ITERATIONS="${ITERATIONS:-8}"
 FORK="${FORK:-2}"
 THREADS="${THREADS:-1}"
 INCLUDES="${INCLUDES:-}"
+PROFILE_MODE="${PROFILE_MODE:-none}" # none | gc | heap | threads | cpi
+RESULT_FORMAT="${RESULT_FORMAT:-TEXT}"
 LOG_FILE="benchmarks_execution.log"
 
+OS_NAME="$(uname -s)"
+PROFILERS=""
+
+case "$PROFILE_MODE" in
+  none)
+    ;;
+  gc)
+    PROFILERS="gc"
+    ;;
+  heap)
+    # heap-oriented signal: allocation + GC counters + HotSpot GC internals
+    PROFILERS="gc,hs_gc"
+    ;;
+  threads)
+    # thread/runtime-oriented signal from HotSpot
+    PROFILERS="hs_thr,hs_rt"
+    ;;
+  cpi)
+    if [ "$OS_NAME" = "Linux" ]; then
+      # perfnorm reports normalized HW counters including CPI (cycles/instruction)
+      PROFILERS="perfnorm"
+    else
+      echo "WARN: PROFILE_MODE=cpi requires Linux perf events. Falling back to gc profiler on $OS_NAME."
+      PROFILERS="gc"
+    fi
+    ;;
+  *)
+    echo "ERROR: Unsupported PROFILE_MODE='$PROFILE_MODE'."
+    echo "Supported modes: none, gc, heap, threads, cpi"
+    exit 1
+    ;;
+esac
+
 echo "Starting KPipe Benchmarks..."
 echo "Results will be saved to $LOG_FILE"
-echo "Run config: warmup=$WARMUP iterations=$ITERATIONS fork=$FORK threads=$THREADS includes=${INCLUDES:-<all>}"
-
-# Create stable temp directory for JMH
-mkdir -p benchmarks/build/tmp/jmh &&
-cd ..
+echo "Run config: warmup=$WARMUP iterations=$ITERATIONS fork=$FORK threads=$THREADS includes=${INCLUDES:-<all>} profile=$PROFILE_MODE profilers=${PROFILERS:-<none>} resultFormat=$RESULT_FORMAT"
 
-# Clean and run all benchmarks
+# Clean and run benchmarks from repository root
 GRADLE_CMD=(./gradlew :benchmarks:clean :benchmarks:jmh \
   -Pjmh.warmupIterations="$WARMUP" \
   -Pjmh.iterations="$ITERATIONS" \
   -Pjmh.fork="$FORK" \
-  -Pjmh.threads="$THREADS")
+  -Pjmh.threads="$THREADS" \
+  -Pjmh.resultFormat="$RESULT_FORMAT")
 
 if [ -n "$INCLUDES" ]; then
   GRADLE_CMD+=("-Pjmh.includes=$INCLUDES")
 fi
 
+if [ -n "$PROFILERS" ]; then
+  GRADLE_CMD+=("-Pjmh.profilers=$PROFILERS")
+fi
+
 "${GRADLE_CMD[@]}" 2>&1 | tee "$LOG_FILE"
 
 echo "--------------------------------------------------"
 echo "Benchmark Summary:"
-if [ -f "benchmarks/build/results/jmh/results.txt" ]; then
-    cat benchmarks/build/results/jmh/results.txt
+
+RESULT_EXT="$(echo "$RESULT_FORMAT" | tr '[:upper:]' '[:lower:]')"
+CANDIDATES=(
+  "benchmarks/build/results/jmh/results.${RESULT_EXT}"
+  "benchmarks/build/results/jmh/results.text"
+  "benchmarks/build/results/jmh/results.txt"
+)
+
+SUMMARY_FILE=""
+for file in "${CANDIDATES[@]}"; do
+  if [ -f "$file" ]; then
+    SUMMARY_FILE="$file"
+    break
+  fi
+done
+
+if [ -n "$SUMMARY_FILE" ]; then
+  cat "$SUMMARY_FILE"
 else
-    echo "Results file not found. Check $LOG_FILE for details."
+  echo "Results file not found. Checked: ${CANDIDATES[*]}. Check $LOG_FILE for details."
 fi
+
+echo "Profiler outputs (if enabled) are emitted next to JMH results under benchmarks/build/results/jmh/."
 echo "--------------------------------------------------"