[One Workflow] fix: http step error handling lacks structured error data #243395

skynetigor · 2025-11-18T17:10:52Z

Summary

closes: https://github.com/elastic/security-team/issues/14737

🔧 HTTP Step Error Handling Enhancement

Structured Error Response

Changed error format from string to structured object containing:
- type: Error type classification (e.g., HttpRequestError, ConnectionRefused, HttpRequestCancelledError)
- message: Human-readable error message
- details: Additional context (status code, headers, response body when available)

HTTP Step Improvements

Introduced mapAxiosError() method to transform Axios errors into structured execution errors
HTTP errors now preserve:
- Status code and status text from HTTP responses
- Response headers
- Response body/data
- Request configuration for debugging

Error Type Schema

Added ExecutionError schema with fields: type, message, details
Updated EsWorkflowExecution.error and EsWorkflowStepExecution.error types from string | null to ExecutionError | null

⚙️ Error Handling Infrastructure

Error Mapping Utility

Created mapError() utility function to standardize error conversion
Handles JavaScript Error objects, string errors, and existing ExecutionError objects
Used throughout the codebase for consistent error handling

Conditional Error Recovery

Retry logic now supports conditional retries:
- Added condition field to WorkflowRetrySchema (KQL expression;, liquid expression, or boolean)
- Retry only executes when condition evaluates to true
- Example: condition: "${{error.type == 'NetworkError'}}"
Continue on failure now supports conditions:
- continue field accepts boolean, liquidjs or KQL expression string
- Workflow continues only if condition is met
- Stores condition in graph node configuration

Context Manager Enhancement

Added evaluateBooleanExpressionInContext() method for evaluating conditional expressions
Supports KQL expressions, liquidjs and boolean values
Error objects accessible in fallback scope via steps.<stepName>.error

🔄 Error Propagation Improvements

Node Error Catching Interface

Updated NodeWithErrorCatching.catchError() signature to accept StepExecutionRuntime parameter
Enables error-catching nodes to inspect failed step context before deciding recovery action
Used by retry and continue nodes to evaluate conditional recovery

Retry Behavior

Retry node now fails with the original step error after max attempts (instead of generic "max attempts exceeded" message)
Condition evaluation occurs before each retry attempt

Workflow Context

Fallback steps can now access error information via context
Error available as steps.<stepName>.error in fallback scope

📝 Logging & Observability

Workflow event logger updated to handle ExecutionError objects
Error details (type, message, details) logged consistently across all error types
HTTP request errors tagged with error type for better filtering

✅ Test Coverage

Updated all integration tests to expect structured error objects
Added test coverage for conditional retry and continue scenarios
Tests verify error details are preserved and accessible in fallback steps

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
Flaky Test Runner was used on any tests changed
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

See some risk examples
...

…-step-execution-displays-skeletons-for-not-executed-steps

…cuted step

…ns-for-not-executed-steps

…ration-will-cancel-only-when-delay-resolves

…-cancel-only-when-delay-resolves' into 14737-HTTP-Step-Error-Handling-Lacks-Structured-Error-Data

…red-Error-Data

…-Structured-Error-Data

…ata' of https://github.com/skynetigor/kibana into 14737-HTTP-Step-Error-Handling-Lacks-Structured-Error-Data

Copilot

Pull Request Overview

This PR enhances error handling in the One Workflow execution engine by replacing string-based error storage with structured ExecutionError objects. The changes enable better error introspection and conditional error handling in retry and continue steps.

Key Changes:

Introduced ExecutionError schema with type, message, and optional details fields
Added conditional error handling support for retry and continue steps via KQL expressions
Enhanced HTTP step error handling to include detailed response information
Updated all error-related code paths to use structured error objects

Reviewed Changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/platform/packages/shared/kbn-workflows/spec/schema.ts`	Added ExecutionError schema and condition field to retry/continue configurations
`src/platform/packages/shared/kbn-workflows/types/v1.ts`	Updated error fields to use ExecutionError type
`src/platform/plugins/shared/workflows_execution_engine/server/utils/map_error/map_error.ts`	New utility to convert errors to ExecutionError format
`src/platform/plugins/shared/workflows_execution_engine/server/step/http_step/http_step_impl.ts`	Enhanced HTTP error handling with structured error details
`src/platform/plugins/shared/workflows_execution_engine/server/step/on_failure/retry_step/enter_retry_node_impl.ts`	Added conditional retry logic based on error properties
`src/platform/plugins/shared/workflows_execution_engine/server/step/on_failure/continue_step/enter_continue_node_impl.ts`	Added conditional continue logic based on error properties
`src/platform/plugins/shared/workflows_execution_engine/server/workflow_context_manager/workflow_context_manager.ts`	Added evaluateBooleanExpressionInContext method and error context injection
Multiple test files	Updated assertions to match new structured error format

Comments suppressed due to low confidence (1)

src/platform/plugins/shared/workflows_execution_engine/integration_tests/tests/on_failure_continue.test.ts:1

The comment is misleading in the context of the continue test. Since the condition is not met and retries don't occur, the duration comment doesn't accurately describe this test case. The comment should reflect that no retries are expected here.

/*

...shared/workflows_execution_engine/server/step/on_failure/retry_step/enter_retry_node_impl.ts

src/platform/plugins/shared/workflows_execution_engine/server/plugin.ts

...hared/workflows_execution_engine/server/workflow_context_manager/workflow_context_manager.ts

src/platform/plugins/shared/workflows_execution_engine/server/step/http_step/http_step_impl.ts

src/platform/packages/shared/kbn-workflows/spec/schema.ts

…red-Error-Data

elasticmachine · 2025-11-20T13:31:55Z

💔 Build Failed

Failed CI Steps

Test Failures

[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure should override workflow level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure should override workflow level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure should override workflow level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'step level on-failure',
fallbackActionNodeId: 'fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'workflow level on-failure',
fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'workflow level on-failure',
fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] Jest Tests #18 / on_failure graph {
name: 'workflow level on-failure',
fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
workflow: [Object]
} should configure continue node correctly
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"
[job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/workflows`	335	337	+2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id	before	after	diff
`workflowsManagement`	2.1MB	2.1MB	+179.0B

Unknown metric groups

API count

id	before	after	diff
`@kbn/workflows`	365	367	+2

ESLint disabled line counts

id	before	after	diff
`workflowsExecutionEngine`	40	39	-1

Total ESLint disabled count

id	before	after	diff
`workflowsExecutionEngine`	54	53	-1

History

semd and others added 30 commits November 12, 2025 09:56

make cancelation great again

541d95e

Merge branch '14649/show_execution_cancelation_fix' into 14316-Single…

2f6b23e

…-step-execution-displays-skeletons-for-not-executed-steps

in case of single step execution display skeletons only for being exe…

d52d7cf

…cuted step

Merge branch 'main' into 14316-Single-step-execution-displays-skeleto…

6c01c3c

…ns-for-not-executed-steps

Merge branch 'main' into 14316-Single-step-execution-displays-skeleto…

a152298

…ns-for-not-executed-steps

add unit-tests

da151ee

Merge branch 'main' into 14316-Single-step-execution-displays-skeleto…

4cc40b8

…ns-for-not-executed-steps

temp

cb6dc64

Merge branch 'main' into 14709-Workflow-having-wait-step-with-long-du…

95222e0

…ration-will-cancel-only-when-delay-resolves

fixes

dfbc29c

fixes

93cca7b

temp

a73eeb8

fix bugs

3a1a25f

fixes

c0892c8

fixes

21ff0c3

fixes

e68f7bf

Merge branch 'main' into 14709-Workflow-having-wait-step-with-long-du…

3f7a2d6

…ration-will-cancel-only-when-delay-resolves

Update plugin.ts

0868cc7

add abortable timeout

332b6d8

fixes

69fa57d

Update plugin.ts

9b8fe2a

fixes

9b78019

refactor

3f3813e

Update run_node.ts

0a9d1b4

Update process_node_stack_monitoring.ts

28ef8d6

fixes

7d107fc

Update plugin.ts

44b95f1

refactor

f25ea22

fix tests for wait step

3ded572

fix tests

15c5f28

skynetigor added 6 commits November 18, 2025 15:41

fixes

8c65d1e

add continue condition support

48b8f71

add retry condition support

9dbdb96

Merge branch '14709-Workflow-having-wait-step-with-long-duration-will…

0be6b29

…-cancel-only-when-delay-resolves' into 14737-HTTP-Step-Error-Handling-Lacks-Structured-Error-Data

add condition support for retry

9ac3ef3

Merge branch 'main' into 14737-HTTP-Step-Error-Handling-Lacks-Structu…

9841753

…red-Error-Data

skynetigor added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:One Workflow Team label for One Workflow (Workflow automation) labels Nov 18, 2025

skynetigor added 13 commits November 18, 2025 18:12

Update build_execution_graph.ts

6991572

Update catch_error.ts

330d8ff

fixes

101a05e

fixe integ tests

b69d76d

add test case to check that error is available from step context

89d4b36

add testcase for continue expression

6c47430

Update on_failure_retry.test.ts

bc5e081

Update on_failure_retry.test.ts

19c888f

add evaluateBooleanExpressionInContext

d60d1e5

Merge branch 'elastic:main' into 14737-HTTP-Step-Error-Handling-Lacks…

e7f5d73

…-Structured-Error-Data

add error to global context for steps inside fallback scope

b036c6e

Merge branch '14737-HTTP-Step-Error-Handling-Lacks-Structured-Error-D…

be95e2d

…ata' of https://github.com/skynetigor/kibana into 14737-HTTP-Step-Error-Handling-Lacks-Structured-Error-Data

fixes

b9b8d6e

skynetigor requested a review from Copilot November 20, 2025 11:19

Copilot AI reviewed Nov 20, 2025

View reviewed changes

skynetigor added 2 commits November 20, 2025 12:37

fixes

3e1e79e

Merge branch 'main' into 14737-HTTP-Step-Error-Handling-Lacks-Structu…

4b961db

…red-Error-Data

skynetigor marked this pull request as ready for review November 20, 2025 11:45

skynetigor requested a review from a team as a code owner November 20, 2025 11:45

Changes from node scripts/eslint_all_files --no-cache --fix

3db7267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[One Workflow] fix: http step error handling lacks structured error data #243395

[One Workflow] fix: http step error handling lacks structured error data #243395

skynetigor commented Nov 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Nov 20, 2025 •

edited

Loading

API count

ESLint disabled line counts

Total ESLint disabled count

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[One Workflow] fix: http step error handling lacks structured error data #243395

Are you sure you want to change the base?

[One Workflow] fix: http step error handling lacks structured error data #243395

Conversation

skynetigor commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

🔧 HTTP Step Error Handling Enhancement

Structured Error Response

HTTP Step Improvements

Error Type Schema

⚙️ Error Handling Infrastructure

Error Mapping Utility

Conditional Error Recovery

Context Manager Enhancement

🔄 Error Propagation Improvements

Node Error Catching Interface

Retry Behavior

Workflow Context

📝 Logging & Observability

✅ Test Coverage

Checklist

Identify risks

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

Public APIs missing comments

Async chunks

API count

ESLint disabled line counts

Total ESLint disabled count

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

skynetigor commented Nov 18, 2025 •

edited

Loading

elasticmachine commented Nov 20, 2025 •

edited

Loading