Skip to content

Conversation

@skynetigor
Copy link
Contributor

@skynetigor skynetigor commented Nov 18, 2025

Summary

closes: https://github.com/elastic/security-team/issues/14737

🔧 HTTP Step Error Handling Enhancement

Structured Error Response

  • Changed error format from string to structured object containing:
    • type: Error type classification (e.g., HttpRequestError, ConnectionRefused, HttpRequestCancelledError)
    • message: Human-readable error message
    • details: Additional context (status code, headers, response body when available)

HTTP Step Improvements

  • Introduced mapAxiosError() method to transform Axios errors into structured execution errors
  • HTTP errors now preserve:
    • Status code and status text from HTTP responses
    • Response headers
    • Response body/data
    • Request configuration for debugging

Error Type Schema

  • Added ExecutionError schema with fields: type, message, details
  • Updated EsWorkflowExecution.error and EsWorkflowStepExecution.error types from string | null to ExecutionError | null

⚙️ Error Handling Infrastructure

Error Mapping Utility

  • Created mapError() utility function to standardize error conversion
  • Handles JavaScript Error objects, string errors, and existing ExecutionError objects
  • Used throughout the codebase for consistent error handling

Conditional Error Recovery

  • Retry logic now supports conditional retries:

    • Added condition field to WorkflowRetrySchema (KQL expression;, liquid expression, or boolean)
    • Retry only executes when condition evaluates to true
    • Example: condition: "${{error.type == 'NetworkError'}}"
  • Continue on failure now supports conditions:

    • continue field accepts boolean, liquidjs or KQL expression string
    • Workflow continues only if condition is met
    • Stores condition in graph node configuration

Context Manager Enhancement

  • Added evaluateBooleanExpressionInContext() method for evaluating conditional expressions
  • Supports KQL expressions, liquidjs and boolean values
  • Error objects accessible in fallback scope via steps.<stepName>.error

🔄 Error Propagation Improvements

Node Error Catching Interface

  • Updated NodeWithErrorCatching.catchError() signature to accept StepExecutionRuntime parameter
  • Enables error-catching nodes to inspect failed step context before deciding recovery action
  • Used by retry and continue nodes to evaluate conditional recovery

Retry Behavior

  • Retry node now fails with the original step error after max attempts (instead of generic "max attempts exceeded" message)
  • Condition evaluation occurs before each retry attempt

Workflow Context

  • Fallback steps can now access error information via context
  • Error available as steps.<stepName>.error in fallback scope

📝 Logging & Observability

  • Workflow event logger updated to handle ExecutionError objects
  • Error details (type, message, details) logged consistently across all error types
  • HTTP request errors tagged with error type for better filtering

✅ Test Coverage

  • Updated all integration tests to expect structured error objects
  • Added test coverage for conditional retry and continue scenarios
  • Tests verify error details are preserved and accessible in fallback steps

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Identify risks

Does this PR introduce any risks? For example, consider risks like hard to test bugs, performance regression, potential of data loss.

Describe the risk, its severity, and mitigation for each identified risk. Invite stakeholders and evaluate how to proceed before merging.

@skynetigor skynetigor added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Team:One Workflow Team label for One Workflow (Workflow automation) labels Nov 18, 2025
@skynetigor skynetigor requested a review from Copilot November 20, 2025 11:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances error handling in the One Workflow execution engine by replacing string-based error storage with structured ExecutionError objects. The changes enable better error introspection and conditional error handling in retry and continue steps.

Key Changes:

  • Introduced ExecutionError schema with type, message, and optional details fields
  • Added conditional error handling support for retry and continue steps via KQL expressions
  • Enhanced HTTP step error handling to include detailed response information
  • Updated all error-related code paths to use structured error objects

Reviewed Changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/platform/packages/shared/kbn-workflows/spec/schema.ts Added ExecutionError schema and condition field to retry/continue configurations
src/platform/packages/shared/kbn-workflows/types/v1.ts Updated error fields to use ExecutionError type
src/platform/plugins/shared/workflows_execution_engine/server/utils/map_error/map_error.ts New utility to convert errors to ExecutionError format
src/platform/plugins/shared/workflows_execution_engine/server/step/http_step/http_step_impl.ts Enhanced HTTP error handling with structured error details
src/platform/plugins/shared/workflows_execution_engine/server/step/on_failure/retry_step/enter_retry_node_impl.ts Added conditional retry logic based on error properties
src/platform/plugins/shared/workflows_execution_engine/server/step/on_failure/continue_step/enter_continue_node_impl.ts Added conditional continue logic based on error properties
src/platform/plugins/shared/workflows_execution_engine/server/workflow_context_manager/workflow_context_manager.ts Added evaluateBooleanExpressionInContext method and error context injection
Multiple test files Updated assertions to match new structured error format
Comments suppressed due to low confidence (1)

src/platform/plugins/shared/workflows_execution_engine/integration_tests/tests/on_failure_continue.test.ts:1

  • The comment is misleading in the context of the continue test. Since the condition is not met and retries don't occur, the duration comment doesn't accurately describe this test case. The comment should reflect that no retries are expected here.
/*

@skynetigor skynetigor marked this pull request as ready for review November 20, 2025 11:45
@skynetigor skynetigor requested a review from a team as a code owner November 20, 2025 11:45
@elasticmachine
Copy link
Contributor

elasticmachine commented Nov 20, 2025

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure should override workflow level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure should override workflow level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure should override workflow level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'step level on-failure',
    fallbackActionNodeId: 'fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'workflow level on-failure',
    fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'workflow level on-failure',
    fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] Jest Tests #18 / on_failure graph {
    name: 'workflow level on-failure',
    fallbackActionNodeId: 'workflow-level-on-failure_testRetryConnectorStep_fallbackAction',
    workflow: [Object]
    } should configure continue node correctly
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "after all" hook for "returns job without anomalies when time range excludes them"
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"
  • [job] [logs] FTR Configs #71 / Serverless Observability feature flag testing - Deployment-agnostic Observability Agent API integration tests Observability Agent tool: observability.get_anomaly_detection_jobs "before all" hook for "filters by specific job ID"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/workflows 335 337 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
workflowsManagement 2.1MB 2.1MB +179.0B
Unknown metric groups

API count

id before after diff
@kbn/workflows 365 367 +2

ESLint disabled line counts

id before after diff
workflowsExecutionEngine 40 39 -1

Total ESLint disabled count

id before after diff
workflowsExecutionEngine 54 53 -1

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:One Workflow Team label for One Workflow (Workflow automation)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants