Skip to content

Aborting forever and untriggerd on_failure node when aborting failure caused by UserError#6749

Open
0yukali0 wants to merge 1 commit into
flyteorg:masterfrom
0yukali0:6695
Open

Aborting forever and untriggerd on_failure node when aborting failure caused by UserError#6749
0yukali0 wants to merge 1 commit into
flyteorg:masterfrom
0yukali0:6695

Conversation

@0yukali0

Copy link
Copy Markdown
Contributor

Tracking issue

#6695

Why are the changes needed?

  1. Solving Aborting forever when failure of aborting happens
  2. Set on failure function in failing state node when aborting fails

What changes were proposed in this pull request?

  1. Node state transfer to failing when an user error happen in Aborting and finalizing steps.
  2. Exceeding max retries make the workflow state failed in order to trigger on failure node.

How was this patch tested?

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@flyte-bot

Copy link
Copy Markdown
Collaborator

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at eduardo@union.ai.

@0yukali0 0yukali0 changed the title 6695 Aborting forever and untriggerd on_failure node when aborting failed caused by UserError Nov 19, 2025
@0yukali0 0yukali0 changed the title Aborting forever and untriggerd on_failure node when aborting failed caused by UserError Aborting forever and untriggerd on_failure node when aborting failure caused by UserError Nov 19, 2025
@codecov

codecov Bot commented Nov 19, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.26%. Comparing base (de65211) to head (144cb78).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6749      +/-   ##
==========================================
+ Coverage   57.16%   58.26%   +1.10%     
==========================================
  Files         925      710     -215     
  Lines       57823    42362   -15461     
==========================================
- Hits        33052    24682    -8370     
+ Misses      21753    15441    -6312     
+ Partials     3018     2239     -779     
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.15% <ø> (+0.03%) ⬆️
unittests-flytecopilot 43.06% <ø> (?)
unittests-flytectl 64.09% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.36% <ø> (ø)
unittests-flytepropeller ?
unittests-flytestdlib 62.78% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@davidmirror-ops

Copy link
Copy Markdown
Contributor

@0yukali0 thanks for contributing. Any help you need?

@0yukali0

0yukali0 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor Author

@davidmirror-ops
Thank you for your concern.
I am updating tests and tracing the failed tests recently.
Some failure are not in my expect, so i am going figure out where result in problems.

Comment on lines +107 to +130
if mutableW.Status.FailedAttempts == maxRetries {
var err error
_ = controllerutil.AddFinalizer(mutableW, Finalizer)
SetDefinitionVersionIfEmpty(mutableW, v1alpha1.LatestWorkflowDefinitionVersion)

func() {
t := p.metrics.RawWorkflowTraversalTime.Start(ctx)
defer func() {
t.Stop()
if r := recover(); r != nil {
stack := debug.Stack()
err = fmt.Errorf("panic when reconciling workflow, Stack: [%s]", string(stack))
logger.Errorf(ctx, err.Error())
p.metrics.PanicObserved.Inc(ctx)
}
}()
err = p.workflowExecutor.HandleFlyteWorkflow(ctx, mutableW)
}()
if err != nil {
logger.Errorf(ctx, "Error when trying to reconcile workflow. Error [%v]. Error Type[%v]",
err, reflect.TypeOf(err))
p.metrics.SystemError.Inc(ctx)
return nil, err
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind explain a bit why we need this here?

Signed-off-by: Yuteng Chen <a08h0283@gmail.com>
@github-actions github-actions Bot added the flyte label Jun 19, 2026
@0yukali0 0yukali0 marked this pull request as ready for review June 19, 2026 14:54
@0yukali0

Copy link
Copy Markdown
Contributor Author

UserCodeError triggers the On_failure node and enter to the failed state.
Screenshot From 2026-06-19 22-47-40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants