Describe the bug
Current scripts in the benchmarks/single_node directory (e.g., https://github.com/SemiAnalysisAI/InferenceX/blob/adbaae52ddf2569ddf2e793f5ad0a56f2f2a4d13/benchmarks/single_node/qwen3.5_fp8_mi325x.sh) follow this pattern to establish GPU monitoring for every InferenceX benchmark run:
- start_gpu_monitor
- Start LLM inference engine as a background process (e.g. SGLang, vLLM)
- wait_for_server_ready
- Run benchmark for the model (key result to report for InferenceX benchmark)
- stop_gpu_monitor
As-is, the GPU energy consumption and other metrics can be skewed by irregularities in LLM inference engine startup (compilation, kernel autotuning, etc) and the 5-second sleeps used while waiting for the server to start (waiting on the /health route). While cross-referencing other logs can identify some timestamps that fall outside of the actual benchmark window, there is no direct correspondence within the CSV itself.
Expected behavior
Monitoring data should only correspond to the benchmark execution window. Alternatively, providing a way to identify which portion of the data corresponds to benchmark execution would also be acceptable.
Potential solutions
Adding a marker in the gpu_metrics.csv or delaying the start_gpu_monitor calls until the benchmark is ready to run (moving step 1 above to just before step 4) can assist in disambiguating the results.
Additional context
The startup costs of pinned versions of vLLM (likely similar for other engines) should be relatively consistent for a GitHub action runner, but the inference engine's startup performance can change over successive versions. The ability to separate startup costs from the benchmark data is important.
Describe the bug
Current scripts in the benchmarks/single_node directory (e.g., https://github.com/SemiAnalysisAI/InferenceX/blob/adbaae52ddf2569ddf2e793f5ad0a56f2f2a4d13/benchmarks/single_node/qwen3.5_fp8_mi325x.sh) follow this pattern to establish GPU monitoring for every InferenceX benchmark run:
As-is, the GPU energy consumption and other metrics can be skewed by irregularities in LLM inference engine startup (compilation, kernel autotuning, etc) and the 5-second sleeps used while waiting for the server to start (waiting on the /health route). While cross-referencing other logs can identify some timestamps that fall outside of the actual benchmark window, there is no direct correspondence within the CSV itself.
Expected behavior
Monitoring data should only correspond to the benchmark execution window. Alternatively, providing a way to identify which portion of the data corresponds to benchmark execution would also be acceptable.
Potential solutions
Adding a marker in the
gpu_metrics.csvor delaying thestart_gpu_monitorcalls until the benchmark is ready to run (moving step 1 above to just before step 4) can assist in disambiguating the results.Additional context
The startup costs of pinned versions of vLLM (likely similar for other engines) should be relatively consistent for a GitHub action runner, but the inference engine's startup performance can change over successive versions. The ability to separate startup costs from the benchmark data is important.