Skip to content

Measure plugin initializing latency and fix queue latency#7350

Open
fg91 wants to merge 2 commits into
masterfrom
fg91/feat/measure-initializing-latency
Open

Measure plugin initializing latency and fix queue latency#7350
fg91 wants to merge 2 commits into
masterfrom
fg91/feat/measure-initializing-latency

Conversation

@fg91
Copy link
Copy Markdown
Member

@fg91 fg91 commented May 7, 2026

Why are the changes needed?

  • Flytepropeller currently emits a metric measuring how much time is spent by a plugin in the queued phase, i.e. a Pod waiting for being scheduled on a node.
    • This metric currently undercounts the queued time because the time stamp when we transitioned into this phase is reset too often. This PR fixes this.
  • I now also add a metric measuring how much time is spent by a plugin in the initializing phase, i.e. how long it takes for a Pod to transition from ContainerCreating to Running.

What changes were proposed in this pull request?

  • Fix measurement of phase transition time stamps
  • Add a new metric that measures duration in initializing phase

How was this patch tested?

  • Added unit test
  • Ran propeller with this change in a staging cluster and compared metrics to measured transitioning times

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

@fg91 fg91 added added Merged changes that add new functionality fixed For any bug fixes labels May 7, 2026
@github-actions github-actions Bot added the flyte label May 7, 2026
}
}

// Emit the initializing latency if the task has just transitioned from Initializing to Running.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is new. (I want to measure progress regarding startup times, i.e. image pulls.)

PluginStateVersion: pluginTrns.pluginStateVersion,
PluginPhase: pluginTrns.pInfo.Phase(),
PluginPhaseVersion: pluginTrns.pInfo.Version(),
LastPhaseUpdatedAt: time.Now(),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We called this not only when transitioning from one phase to another but also within a single phase when the phase version was increased. This lead to wrong queue latency measurements.

// off this timestamp; if version bumps reset it, both metrics undercount
// and only capture the final segment before exiting the phase.
func Test_task_Handle_LastPhaseUpdatedAt(t *testing.T) {
inputs := &core.LiteralMap{
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boilerplate is mostly taken from existing tests above.

Comment thread go.mod
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fg91 added 2 commits May 7, 2026 13:26
Signed-off-by: Fabio Grätz <fabio@cusp.ai>
Signed-off-by: Fabio Grätz <fabio@cusp.ai>
@fg91 fg91 force-pushed the fg91/feat/measure-initializing-latency branch from 18a326c to 7150e00 Compare May 7, 2026 13:26
@fg91 fg91 requested a review from Sovietaced May 7, 2026 13:29
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.97%. Comparing base (cd7371f) to head (7150e00).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7350      +/-   ##
==========================================
+ Coverage   56.96%   56.97%   +0.01%     
==========================================
  Files         931      931              
  Lines       58276    58284       +8     
==========================================
+ Hits        33198    33209      +11     
+ Misses      22019    22016       -3     
  Partials     3059     3059              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.15% <ø> (ø)
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.09% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.19% <ø> (ø)
unittests-flytepropeller 53.75% <100.00%> (+0.04%) ⬆️
unittests-flytestdlib 62.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

added Merged changes that add new functionality fixed For any bug fixes flyte

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant