-
Notifications
You must be signed in to change notification settings - Fork 2k
OCPBUGS-55755: gather P50, P95 and P99 for etcd disk metrics from all CI runs #70577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-55755: gather P50, P95 and P99 for etcd disk metrics from all CI runs #70577
Conversation
Data will be charted and analyzed to help inform the alert thresholds we should ship with.
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-55755, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips |
|
@dgoodwin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn |
|
@dgoodwin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@dgoodwin: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
{code} While I think the p99 0.003 number appears correct across this job run, it is not capturing the fact that this metrics when viewed over a [5m] window was actually above our threshold upstream of 0.01, though never for more than 2 minutes. Highest spike was around 0.04, 4x the documented upstream limit. Cluster was perfectly healthy at the time. Checking an azure run next. |
|
/pj-rehearse periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn |
|
@dgoodwin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@dgoodwin: requesting more than one rehearsal in one comment is not supported. If you would like to rehearse multiple specific jobs, please separate the job names by a space in a single command. |
|
[REHEARSALNOTIFIER]
A total of 32626 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-55755, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn |
|
@dgoodwin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
GCP: AWS: Azure: At least on these specific runs, we're nearly passing the upstream recommendation (0.01 for fsync, 0.025 for commit) at P99 across an entire job run, but there are spikes above which depending on how long they last, could alert. |
|
/jira refresh |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-55755, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse ack |
|
@dgoodwin: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@dgoodwin: Jira Issue OCPBUGS-55755: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-55755 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.21.0-0.nightly-2025-11-03-191704 |
Data will be charted and analyzed to help improve the alert thresholds we can ship with which have been elevated too high.
I suspect upstream thresholds are unreasonably low, perhaps an optimal case by their standards, but from our perspective our CI may be proving clusters can operate healthily above those. Goal is to find thresholds we know are safe, and analyze fleet data for the same. Combine both to determine what to do.