Skip to content

Conversation

@anuragthehatter
Copy link
Contributor

@anuragthehatter anuragthehatter commented Sep 16, 2025

Making FDP IPv4 QE jobs default and add TIMEOUT (post discussion with OCP QE infra maintainers) to 60 min to accomodate IPsec enable/disable/test usecase which takes ~50 min and currently causing interrupts in recent FDP workflows.

#68237 messed up due to bulk merge and force pushes..sorry. Opened this new clean one.

cc @martinkennelly @jluhrsen

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci openshift-ci bot requested review from jcaamano and kyrtapz September 16, 2025 02:59
@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.20-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@deepsm007
Copy link
Contributor

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.20-e2e-aws-ovn-fdp-qe pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@deepsm007: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.20-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@jluhrsen
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2025
@asood-rh
Copy link
Contributor

@anuragthehatter Do we need @martinkennelly 's approval for this PR to merge? Not lot of code change except the flags and timeout.

@martinkennelly
Copy link
Contributor

@anuragthehatter how many times have you executed this job and whats the pass rate? Thanks

@anuragthehatter
Copy link
Contributor Author

anuragthehatter commented Sep 22, 2025

@anuragthehatter how many times have you executed this job and whats the pass rate? Thanks

Executed 3 times above in rehearsals on different releases as seen above. Its 100% on all those 3 jobs.
The historical pass rate for this trigger is almost 100%. I would say 9/10 based on our QE past triggers hence we want to make it default as discussed in past team meeting discussions and helps catching issues early in FDP merges and defaukt OVN dev PRS.

@martinkennelly
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2025
@martinkennelly
Copy link
Contributor

@anuragthehatter how many times have you executed this job and whats the pass rate? Thanks

Executed 3 times above in rehearsals on different releases as seen above. Its 100% on all those 3 jobs. The historical pass rate for this trigger is almost 100%. I would say 9/10 based on our QE past triggers hence we want to make it default as discussed in past team meeting discussions and helps catching issues early in FDP merges and defaukt OVN dev PRS.

I think theres a command to run it many times looking for flakes. The reason I ask is i saw it fail recently on a ds merge it and it was flake. I trust your opinion here and if we are wrong, we can revert this PR.

@anuragthehatter
Copy link
Contributor Author

anuragthehatter commented Sep 22, 2025

@martinkennelly Yes. That d/s merge has IPsec usecase timeout issue which usually takes 45-52 minutes across various platforms hence this PR addressed TEST_TIMEOUT: "60" and testd well to accomodate that.

@asood-rh
Copy link
Contributor

4.21, node gone missing. I think aws cluster could b using spot instances.

SDN: OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster expand_less	10s
{failed OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster failed 
        Scenario: OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster

Given I store the ready and schedulable nodes in the :nodes clipboard

Message:

        nodes 'ip-10-0-25-232.us-east-2.compute.internal' not found (BushSlicer::ResourceNotFoundError)
./lib/openshift/resource.rb:66:in `get_checked'
./lib/openshift/resource.rb:130:in `get_cached_prop'
./lib/openshift/node.rb:110:in `ready?'
./features/step_definitions/node.rb:76:in `block (2 levels) in '
./features/step_definitions/node.rb:76:in `select'
./features/step_definitions/node.rb:76:in `/^I store the( schedulable| ready and schedulable)?( windows)? (node|master|worker)s in the(?: :(\S+))? clipboard(?: excluding "(.+?)")?$/'
features/tierN/networking/network-policy.feature:2578:in `I store the ready and schedulable nodes in the :nodes clipboard'

@anuragthehatter
Copy link
Contributor Author

4.21, node gone missing. I think aws cluster could b using spot instances.

SDN: OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster expand_less 10s
{failed OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster failed
Scenario: OCP-40908:SDN Allow from hostnetwork policy to allow traffic from hostnetwork pods on LoadBalancerService endpoint strategy cluster

Given I store the ready and schedulable nodes in the :nodes clipboard

Message:

    nodes 'ip-10-0-25-232.us-east-2.compute.internal' not found (BushSlicer::ResourceNotFoundError)

./lib/openshift/resource.rb:66:in get_checked' ./lib/openshift/resource.rb:130:in get_cached_prop'
./lib/openshift/node.rb:110:in ready?' ./features/step_definitions/node.rb:76:in block (2 levels) in '
./features/step_definitions/node.rb:76:in select' ./features/step_definitions/node.rb:76:in /^I store the( schedulable| ready and schedulable)?( windows)? (node|master|worker)s in the(?: :(\S+))? clipboard(?: excluding "(.+?)")?$/'
features/tierN/networking/network-policy.feature:2578:in `I store the ready and schedulable nodes in the :nodes clipboard'

Hmm also hostnetwork pod usecases failing due to OTP migration, need compat_otp.SetNamespacePrivileged(oc, ns) in those 3 failed cases.

@martinkennelly
Copy link
Contributor

ill be away for a week but can you folks handle any fixes and retry here? Since I found issues on the first run ye may have to run this job many times to shake out flakes. Unfortunately theres no aggresgate it seems for rehearsals.... idk why... its needed. I looked at the docs for rehearsal commands and didnt find any aggregate :/ @jluhrsen do you know any good command to run at each release main-3 jobs 10x times?

@anuragthehatter
Copy link
Contributor Author

ill be away for a week but can you folks handle any fixes and retry here? Since I found issues on the first run ye may have to run this job many times to shake out flakes. Unfortunately theres no aggresgate it seems for rehearsals.... idk why... its needed. I looked at the docs for rehearsal commands and didnt find any aggregate :/ @jluhrsen do you know any good command to run at each release main-3 jobs 10x times?

Sure @martinkennelly Yep. I figured our OTE migration has again impacted few more cases. Will fix them and have a PR up soon https://github.com/openshift/openshift-tests-private/pull/27809. Once that merge we can rehearse again.

Few points to note:

  • We can ignore 4.22 errors as that branch from QE side is not ready with right configs or agents. When we get closer to 4.22 QE infra fix that
  • for 4.21 I have seen few failures due to OTE migration in QE, Have a test PR for it https://github.com/openshift/openshift-tests-private/pull/27809
  • for 4.17 cluster creationf failed at image payload image step so not a worry there
  • Rest all version passed.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.21-e2e-aws-ovn-fdp-qe
/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.17-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: requesting more than one rehearsal in one comment is not supported. If you would like to rehearse multiple specific jobs, please separate the job names by a space in a single command.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.17-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.17-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.21-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/test generated-config

@openshift-ci-robot openshift-ci-robot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 24, 2025
@anuragthehatter
Copy link
Contributor Author

/pj-rehearse more

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.22-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.21-e2e-aws-ovn-fdp-qe

@openshift-ci-robot
Copy link
Contributor

@anuragthehatter: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@anuragthehatter
Copy link
Contributor Author

@martinkennelly Based on various changes and commits over last several weeks, I am confident that we have achieved almost 100% stability across releases. Unfortunately we had OTE library migrations which caused flakes and upstream migration will also happen in near future which might causes flakes again and ERT team along with QE to step in to fix those if needed.

From stability POV, we have achieved consistent stability as tested here. Env install failures will always be out of our control :)

Please Note: We can ignore 4.22 runs at the moment. QE set Polarion tags and make agents needed for future releases when we enter into that release officially so 4.22 doesn't have blackened QE infra ready to run tests perfectly at the moment.

Let me know if you have any comments else it should be good to merge now. Also in future is we see any flake it will be tracked via Tracker

Resolved conflicts in ovn-kubernetes config files by accepting PR changes
to make FDP IPv4 QE jobs default (always_run: true) with 60 min timeout.
@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@anuragthehatter: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.14-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.15-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.16-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.17-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.18-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.20-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.21-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.22-e2e-aws-ovn-fdp-qe openshift/ovn-kubernetes presubmit Presubmit changed
pull-ci-openshift-ovn-kubernetes-release-4.22-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.20-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.18-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.15-e2e-aws-live-migration-sdn-ovn-rollback openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.21-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.19-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.16-e2e-aws-ovn-local-gateway openshift/ovn-kubernetes presubmit Ci-operator config changed
pull-ci-openshift-ovn-kubernetes-release-4.17-e2e-aws-ovn openshift/ovn-kubernetes presubmit Ci-operator config changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 28, 2025

@anuragthehatter: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/ovn-kubernetes/release-4.18/e2e-aws-ovn-fdp-qe 11f503d link unknown /pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.18-e2e-aws-ovn-fdp-qe
ci/rehearse/openshift/ovn-kubernetes/release-4.22/e2e-aws-ovn-fdp-qe 11f503d link unknown /pj-rehearse pull-ci-openshift-ovn-kubernetes-release-4.22-e2e-aws-ovn-fdp-qe
ci/prow/generated-config 135cd26 link true /test generated-config

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@anuragthehatter
Copy link
Contributor Author

Opened #71975

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants