Skip to content

[Feature] Implement Prometheus Metrics Support for Multi-Client Mode #55

@WarningRan

Description

@WarningRan

Description

The initial implementation of Prometheus metrics in PR #52 was explicitly scoped to single-client mode (n=1). Now that PR #53 (Add multi-client support) is merged, we need to extend this functionality.

Currently, the prom_metrics field is null in multi-client output. The fundamental difference for implementation is that a multi-client run involves multiple, non-concurrent client processes, each with its own specific start and end time.

Expected Behavior

When running the benchmark in multi-client mode with --enable_prom_metrics, the system must be able to:

  1. Utilize Individual Timestamps: Use the specific client_time_start and client_time_end logs captured for each client process to define the time window for metric collection.
  2. Individually Isolate Metrics: Query Prometheus/Thanos to collect resource usage (CPU, memory, etc.) for the Runner Pod. The key is to filter these metrics based on the unique start and end timestamps of each client.
    • ⚠️ OPEN TO DISCUSSION ⚠️
      Should the metrics represent only the client's individual process usage, or the Runner Pod's total usage during the client's execution window?
  3. Complete Output: Associate the isolated metrics with the correct client's JSON output payload.

Technical Notes from Current Implementation (PR #52 Context)

  • Current Single-Client Logic: The existing logic in process_server correctly uses client_start_time and client_end_time to define the time window for the single client.
  • Multi-Client Challenge: The multi-client runner likely executes clients sequentially. The process script must iterate through the logs/outputs for each client instance (client.log.<instance>), extract its unique start/end times, and query Prometheus accordingly.
  • Metric Scope (Needs Decision):
    • Since all clients run within the same Runner Pod, the primary metric source remains the RUNNER_POD_NAME.
    • The team needs to determine if resource usage should be calculated as:
      • Option A: The pod's total usage during the client's time window.
      • Option B: An attempt to isolate client-specific process metrics (if available) within that time window.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions