-
Notifications
You must be signed in to change notification settings - Fork 69
Add benchmark overview doc #1528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cijothomas
wants to merge
25
commits into
open-telemetry:main
Choose a base branch
from
cijothomas:cijothomas/benchoverviewdoc1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
a459ee4
Add benchmark overview doc
cijothomas 38ca568
use throughput per core
cijothomas 557c524
nits
cijothomas 3357270
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas bad5cb6
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas 6f33600
few feedback addressed
cijothomas 2a68ded
consistent wordings
cijothomas 0c461ad
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas e95ab5c
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas 132fd9b
fill in available data
cijothomas 18259e0
few tweaks to clarfy
cijothomas 8c8bfcc
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas 8961a15
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas 4efa9a1
snaity
cijothomas 218676b
md
cijothomas b501207
add passthrough
cijothomas 2dbd3f1
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas c8cc9b8
fill some data for standard load
cijothomas 6cc3316
fill more values
cijothomas a1abd95
include passthrough
cijothomas fa63a4f
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas e28d208
lengthh
cijothomas c407c88
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas 3c67b1d
clarify syslog
cijothomas 029b0b3
Merge branch 'main' into cijothomas/benchoverviewdoc1
cijothomas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,272 @@ | ||
| # OpenTelemetry Arrow Performance Summary | ||
|
|
||
| ## Overview | ||
|
|
||
| The OpenTelemetry Arrow (OTel Arrow) project is currently in **Phase 2**, | ||
| building an end-to-end Arrow-based telemetry pipeline in Rust. Phase 1 focused | ||
| on collector-to-collector traffic compression using the OTAP protocol, achieving | ||
| significant network bandwidth savings. Phase 2 expands this foundation by | ||
| implementing the entire in-process pipeline using Apache Arrow's columnar | ||
| format, targeting substantial improvements in data processing efficiency while | ||
| maintaining the network efficiency gains from Phase 1. | ||
|
|
||
| The OTel Arrow dataflow engine, implemented in Rust, provides predictable | ||
| performance characteristics and efficient resource utilization across varying | ||
| load conditions. The engine uses a [thread-per-core | ||
| architecture](#thread-per-core-design) where resource consumption scales with | ||
| the number of configured cores. | ||
|
|
||
| This document presents a curated set of key performance metrics across different | ||
| load scenarios and test configurations. For the complete set of automated | ||
| performance tests (continuous, nightly, saturation, idle state, and binary size | ||
| benchmarks), see [Detailed Benchmark | ||
| Results](benchmarks.md#current-performance-results). | ||
|
|
||
| ### Test Environment | ||
|
|
||
| All performance tests are executed on bare-metal compute instance with the | ||
| following specifications: | ||
|
|
||
| - **CPU**: 64 physical cores / 128 logical cores (x86-64 architecture) | ||
| - **Memory**: 512 GB RAM | ||
| - **Platform**: Oracle Bare Metal Instance | ||
| - **OS**: Oracle Linux 8 | ||
|
|
||
| This consistent, high-performance environment ensures reproducible results and | ||
| allows for comprehensive testing across various CPU core configurations (1, 4, | ||
| and 8 cores etc.) by constraining the OTel Arrow dataflow engine to specific | ||
| core allocations. | ||
|
|
||
| ### Performance Metrics | ||
|
|
||
| #### Idle State Performance | ||
|
|
||
| Baseline resource consumption with no active telemetry traffic, measured after | ||
| startup stabilization over a 60-second period. These metrics represent | ||
| post-initialization idle state and validate minimal resource footprint. Note | ||
| that longer-duration soak testing for memory leak detection is outside the scope | ||
| of this benchmark summary. | ||
|
|
||
| | Configuration | CPU Usage | Memory Usage | | ||
| |---------------|-----------|--------------| | ||
| | Single Core | 0.06% | 27 MB | | ||
| | All Cores (128) | 2.5% | 600 MB | | ||
|
Comment on lines
+52
to
+53
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two things to consider:
|
||
|
|
||
| *Note: CPU usage is normalized (percentage of total system capacity). Memory | ||
| usage scales with core count due to the [thread-per-core | ||
| architecture](#thread-per-core-design).* | ||
|
|
||
| These baseline metrics validate that the engine maintains minimal resource | ||
| footprint when idle, ensuring efficient operation in environments with variable | ||
| telemetry loads. | ||
|
|
||
| #### Standard Load Performance (Single Core) | ||
|
|
||
| Resource utilization at 100,000 log records per second (100K logs/sec) on a | ||
| single CPU core. Tests are conducted with four different batch sizes to | ||
| demonstrate the impact of batching on performance. | ||
|
|
||
| **Test Parameters:** | ||
|
|
||
| - Total input load: 100,000 log records/second | ||
| - Average log record size: 1 KB | ||
| - Batch sizes tested: 10, 100, 1000, 5000, and 10000 records per request | ||
| - Test duration: 60 seconds | ||
|
|
||
| This wide range of batch sizes evaluates performance across diverse deployment | ||
| scenarios. Small batches (10-100) represent edge collectors or real-time | ||
| streaming requirements, while large batches (1000-10000) represent gateway | ||
cijothomas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| collectors and high-throughput aggregation points. This approach ensures a fair | ||
| assessment, highlighting both the overhead for small batches and the significant | ||
| efficiency gains inherent to Arrow's columnar format at larger batch sizes. | ||
|
|
||
| <!-- TODO: We need to add BatchingProcessor to tests. --> | ||
| <!-- TODO: Batch size influence might be most relevant when we | ||
| do aggregation/transform etc. --> | ||
| <!-- TODO: Add benchmark tests for batch sizes 10 and 100. --> | ||
|
|
||
| ##### Standard Load - OTAP -> OTAP (Native Protocol) | ||
|
|
||
| | Batch Size | CPU Usage | Memory Usage | Network In | Network Out | | ||
| |------------|-----------|--------------|------------|-------------| | ||
| | 10/batch | TBD | TBD | TBD | TBD | | ||
| | 100/batch | TBD | TBD | TBD | TBD | | ||
| | 1000/batch | 17% | 46 MB | 718 KB/s | 750 KB/s | | ||
| | 5000/batch | 7% | 50 MB | 390 KB/s | 422 KB/s | | ||
| | 10000/batch | 5% | 58 MB | 350 KB/s | 383 KB/s | | ||
|
|
||
| This represents the optimal scenario where the dataflow engine operates with its | ||
| native protocol end-to-end, eliminating protocol conversion overhead. | ||
|
|
||
| ##### Standard Load - OTLP -> OTLP (Standard Protocol) | ||
|
|
||
| | Batch Size | CPU Usage | Memory Usage | Network In | Network Out | | ||
| |------------|-----------|--------------|------------|-------------| | ||
| | 10/batch | TBD | TBD | TBD | TBD | | ||
| | 100/batch | TBD | TBD | TBD | TBD | | ||
| | 1000/batch | 43% | 52 MB | 2.1 MB/s | 2.2 MB/s | | ||
| | 5000/batch | 40% | 79 MB | 1.9 MB/s | 2.0 MB/s | | ||
| | 10000/batch | 40% | 80 MB | 1.8 MB/s | 1.9 MB/s | | ||
|
|
||
| This scenario processes OTLP end-to-end using the standard OpenTelemetry | ||
| protocol, providing a baseline for comparison with traditional OTLP-based | ||
| pipelines. | ||
|
|
||
| #### Saturation Performance (Single Core) | ||
|
|
||
| Maximum throughput achievable on a single CPU core at full utilization. This | ||
| establishes the baseline "unit of capacity" for capacity planning. | ||
|
|
||
| **Test Parameters:** | ||
|
|
||
| - Batch size: 500 records per request | ||
| - Load: Continuously increased until the CPU core is fully saturated | ||
| - Test duration: 60 seconds at maximum load | ||
|
|
||
| ##### Pass-through Mode | ||
|
|
||
| Forwarding without data transformation. Represents the minimum engine overhead | ||
| for load balancing and routing use cases. | ||
|
|
||
| | Protocol | Max Throughput | CPU Usage | Memory Usage | | ||
| |----------|----------------|-----------|--------------| | ||
| | OTAP -> OTAP (Native) | TBD | TBD | TBD | | ||
| | OTLP -> OTLP (Standard) | ~566K logs/sec | ~98% | ~45 MB | | ||
|
|
||
| ##### With Processing | ||
|
|
||
| Includes an attribute processor to force data materialization. Represents | ||
| typical production workloads where collectors perform transformations such as | ||
| filtering, attribute enrichment, renaming, or aggregation. | ||
|
|
||
| | Protocol | Max Throughput | CPU Usage | Memory Usage | | ||
| |----------|----------------|-----------|--------------| | ||
| | OTAP -> OTAP (Native) | TBD | TBD | TBD | | ||
| | OTLP -> OTLP (Standard) | ~238K logs/sec | ~99% | ~38 MB | | ||
|
|
||
| #### Scalability | ||
|
|
||
| How throughput scales when adding CPU cores. The [thread-per-core | ||
| architecture](#thread-per-core-design) enables near-linear scaling by | ||
| eliminating shared-state synchronization overhead. | ||
|
|
||
| **Test Parameters:** | ||
|
|
||
| - Batch size: 512 records per request | ||
| - Protocol: OTLP -> OTLP with attribute processing | ||
| - Load: Maximum sustained throughput at each core count | ||
|
|
||
| | CPU Cores | Max Throughput | CPU Usage | Scaling Efficiency | Memory Usage | | ||
| |-----------|----------------|-----------|-------------------|--------------| | ||
| | 1 Core | ~238K logs/sec | ~99% | 100% (baseline) | ~38 MB | | ||
| | 2 Cores | ~414K logs/sec | ~91% | 87% | ~55 MB | | ||
| | 4 Cores | ~653K logs/sec | ~70%* | 69% | ~80 MB | | ||
| | 8 Cores | ~1.47M logs/sec | ~82%* | 78% | ~178 MB | | ||
| | 16 Cores | ~2.37M logs/sec | ~78%* | 62% | ~288 MB | | ||
| | 24 Cores | ~3.67M logs/sec | ~77%* | 64% | ~461 MB | | ||
|
|
||
| TODO: Use more load-generator so saturate CPU at higher cores. | ||
|
|
||
| Scaling Efficiency = (Throughput at N cores) / (N * Single-core throughput) | ||
|
|
||
| ### Architecture | ||
|
|
||
| The OTel Arrow dataflow engine is built in Rust, to achieve high throughput and | ||
| low latency. The columnar data representation and zero-copy processing | ||
| capabilities enable efficient handling of telemetry data at scale. | ||
|
|
||
| #### Thread-Per-Core Design | ||
|
|
||
| The dataflow engine supports a configurable runtime execution model, using a | ||
| **thread-per-core architecture** that eliminates traditional concurrency | ||
| overhead. This design allows: | ||
|
|
||
| - **CPU Affinity Control**: Pipelines can be pinned to specific CPU cores or | ||
| groups through configuration | ||
| - **NUMA Optimization**: Memory and CPU assignments can be coordinated for | ||
| Non-Uniform Memory Access (NUMA) architectures | ||
| - **Workload Isolation**: Different telemetry signals or tenants can be assigned | ||
| to dedicated CPU resources, preventing resource contention | ||
| - **Reduced Synchronization**: Thread-per-core design minimizes lock contention | ||
| and context switching overhead | ||
|
|
||
| For detailed technical documentation, see the [OTAP Dataflow Engine | ||
| Documentation](../rust/otap-dataflow/README.md) and [Phase 2 | ||
| Design](phase2-design.md). | ||
|
|
||
| --- | ||
|
|
||
| ## Comparative Analysis: OTel Arrow vs OpenTelemetry Collector | ||
|
|
||
| ### Methodology | ||
cijothomas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| To provide a fair and meaningful comparison between the OTel Arrow dataflow | ||
| engine and the OpenTelemetry Collector, we use **Syslog (UDP/TCP)** as the | ||
| ingress protocol for both systems. | ||
|
|
||
| #### Rationale for Syslog-Based Comparison | ||
|
|
||
| Syslog was specifically chosen as the input protocol because: | ||
|
|
||
| 1. Neutral Ground: Syslog is neither OTLP (OpenTelemetry Protocol) nor OTAP | ||
| (OpenTelemetry Arrow Protocol), ensuring neither system has a native protocol | ||
| advantage | ||
| 2. Real-World Relevance: Syslog is widely deployed in production environments, | ||
| particularly for log aggregation from network devices, legacy systems, and | ||
| infrastructure components | ||
| 3. Conversion Overhead: Both systems must perform meaningful work to convert | ||
| incoming Syslog messages into their internal representations: | ||
| - **OTel Collector**: Converts to Go-based `pdata` (protocol data) structures | ||
| - **OTel Arrow**: Converts to Arrow-based columnar memory format | ||
| 4. Complete Pipeline Test: This approach validates the full pipeline efficiency, | ||
| including parsing, transformation, and serialization stages | ||
|
|
||
| Both OTLP and OTAP are used as output protocols to measure egress performance | ||
| across different serialization formats. | ||
|
|
||
| ### Performance Comparison | ||
|
|
||
| **Test Parameters:** | ||
|
|
||
| - Input protocol: Syslog RFC 3164 (UDP) | ||
| - Input load: 100,000 messages/second | ||
| - Output protocols: OTLP and OTAP | ||
| - Test duration: 60 seconds | ||
|
|
||
| #### Standard Load (100K Syslog Messages/sec) - OTLP Output | ||
|
|
||
| | Metric | OTel Collector | OTel Arrow | Improvement | | ||
| |--------|---------------|------------|-------------| | ||
| | CPU Usage | TBD | TBD | TBD | | ||
| | Memory Usage | TBD | TBD | TBD | | ||
| | Network Egress | TBD | TBD | TBD | | ||
| | Throughput (messages/sec) | TBD | TBD | TBD | | ||
|
|
||
| #### Standard Load (100K Syslog Messages/sec) - OTAP Output | ||
|
|
||
| | Metric | OTel Collector | OTel Arrow | Improvement | | ||
| |--------|---------------|------------|-------------| | ||
| | CPU Usage | TBD | TBD | TBD | | ||
| | Memory Usage | TBD | TBD | TBD | | ||
| | Network Egress | TBD | TBD | TBD | | ||
| | Throughput (messages/sec) | TBD | TBD | TBD | | ||
|
|
||
| ### Key Findings | ||
|
|
||
| To be populated with analysis once benchmark data is available. | ||
|
|
||
| The comparative analysis will demonstrate: | ||
|
|
||
| - Relative efficiency of Arrow-based columnar processing vs traditional | ||
| row-oriented data structures | ||
| - Memory allocation patterns and garbage collection impact (Rust vs Go) | ||
| - Throughput characteristics under varying load conditions | ||
|
|
||
| --- | ||
|
|
||
| ## Additional Resources | ||
|
|
||
| - [Detailed Benchmark Results from phase2](benchmarks.md) | ||
| - [Phase 1 Benchmark Results](benchmarks-phase1.md) | ||
| - [OTAP Dataflow Engine Documentation](../rust/otap-dataflow/README.md) | ||
| - [Project Phases Overview](project-phases.md) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.