Skip to content

market-trends-agent: add code-based evaluators + observability wiring#1413

Merged
BharathiSrini merged 4 commits intoawslabs:mainfrom
harniva14:update-market-trends-code-evaluators
Apr 30, 2026
Merged

market-trends-agent: add code-based evaluators + observability wiring#1413
BharathiSrini merged 4 commits intoawslabs:mainfrom
harniva14:update-market-trends-code-evaluators

Conversation

@harniva14
Copy link
Copy Markdown
Contributor

Summary

Adds 5 custom code-based evaluators to the Market Trends Agent sample, along with end-to-end deploy / invoke / results tooling, and updates the agent itself to emit the telemetry required for evaluation.

Tested end-to-end in us-west-2 with all 5 evaluators firing correctly and zero Lambda errors across two full traffic rounds.

What changed

New — custom code-based evaluators (evaluators/)

Evaluator Level What it checks
mt_schema_validator TRACE Tool outputs conform to expected structure (ticker+price, multi-headline news)
mt_stock_price_drift TRACE Prices quoted by the agent are within 2% of live Yahoo Finance reference
mt_pii_regex TRACE No SSN / credit-card (Luhn-validated) / IBAN / US phone / email in agent output
mt_pii_comprehend SESSION Amazon Comprehend PII scan; flags high-risk types and benign PII overuse
mt_workflow_contract_gsr SESSION Agent satisfied required tool-call contract groups (profile + market data)

Ships with:

  • evaluators/scripts/deploy.py — idempotent end-to-end deploy (IAM roles, Lambdas, evaluator registration, online eval config)
  • evaluators/scripts/invoke.py — 4 built-in traffic scenarios (broker intro, returning broker, PII bait, anonymous chitchat)
  • evaluators/scripts/results.py — CloudWatch eval-results viewer
  • evaluators/iam/ — trust + permissions policy templates

Agent updates

  • LangchainInstrumentor registered in market_trends_agent.py so gen_ai.tool.* spans flow to aws/spans where evaluators can read them.
  • deploy.py rewritten to use bedrock-agentcore-control boto3 client + CodeBuild directly — no dependency on bedrock-agentcore-starter-toolkit. Matches the post-GA (2026-03-31) deployment path for code-based evaluators.
  • pyproject.toml — pins boto3 >= 1.42.0 (required for list_evaluators / create_evaluator / create_online_evaluation_config); adds opentelemetry-instrumentation-langchain with wrapt<1.16.0 to avoid a known instrumentation TypeError.
  • Hardcoded "us-east-1" in test_agent.py, test_broker_card.py, tools/browser_tool.py, tools/broker_card_tools.py replaced with os.getenv("AWS_REGION", "us-east-1").

Docs

  • README: new Evaluating Your Agent with Custom Code-Based Evaluators section (how they work, IAM requirements, deploy/test/view-results flow, AgentCore CLI usage, cleanup ordering).
  • README: troubleshooting updates (CodeBuild bootstrap, image-tag retag, memory-delete latency, eval result latency).
  • README: IAM section split into agent role vs evaluator Lambda role.
  • CHANGELOG.md added.
  • Architecture diagram updated to show the evaluator layer + Comprehend integration.

Removed

  • Dockerfile, .dockerignore — container is built by AWS CodeBuild at deploy time, no local Docker required.

How to test

cd 02-use-cases/market-trends-agent
uv sync

# Deploy the agent
AWS_REGION=us-west-2 uv run python deploy.py --region us-west-2

# Deploy the evaluators + online eval config
AWS_REGION=us-west-2 uv run python evaluators/scripts/deploy.py

# Generate traffic against 4 scenarios
AWS_REGION=us-west-2 uv run python evaluators/scripts/invoke.py

# Wait ~5–10 min for online evaluation to score sessions, then:
AWS_REGION=us-west-2 uv run python evaluators/scripts/results.py --minutes 30

Notes for reviewers

  • All resource names and IAM roles are scoped to this sample (MarketTrends*, market-trends-eval-*) so deploy/cleanup scripts won't collide with other resources in the account.
  • Evaluator Lambdas are minimal and dependency-free except mt_pii_comprehend which calls comprehend:DetectPiiEntities.
  • Existing cleanup.py only covers agent-side resources; the README now documents that the Cleanup Evaluators block must run first.

- Add 5 Lambda-backed code-based evaluators (schema_validator, stock_price_drift,
  pii_regex, pii_comprehend, workflow_contract_gsr) with online evaluation config
- Add evaluator deploy/invoke/results scripts under evaluators/scripts/
- Enable LangchainInstrumentor so gen_ai.tool.* spans flow to AgentCore Observability
- Replace hardcoded us-east-1 with AWS_REGION env var fallback across agent and tests
- Rewrite deploy.py to use CodeBuild + bedrock-agentcore-control directly (no starter toolkit dep)
- Pin boto3 >= 1.42.0 for Evaluations control-plane APIs
- Update README: evaluator documentation, IAM split, troubleshooting, cleanup ordering
- Update architecture diagram to reflect evaluator layer
- Remove Dockerfile and .dockerignore (container built by CodeBuild, no local Docker needed)
@github-actions github-actions Bot added the 02-use-cases 02-use-cases label Apr 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

Latest scan for commit: 77e7fda | Updated: 2026-04-30 21:46:40 UTC

Security Scan Results

Scan Metadata

  • Project: ASH
  • Scan executed: 2026-04-30T21:46:20+00:00
  • ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

  • Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
  • Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
  • High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
  • Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
  • Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
  • Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

  • Time: Duration taken by each scanner to complete its analysis
  • Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

  • PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
  • FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
  • MISSING: Scanner could not run because required dependencies/tools are not installed or available
  • SKIPPED: Scanner was intentionally disabled or excluded from this scan
  • ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

  • CRITICAL: Only Critical severity findings cause scanner to fail
  • HIGH: High and Critical severity findings cause scanner to fail
  • MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
  • LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
  • ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

  • (g) = global: Set in the global_settings section of ASH configuration
  • (c) = config: Set in the individual scanner configuration section
  • (s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

  • All statistics are calculated from the final aggregated SARIF report
  • Suppressed findings are counted separately and do not contribute to actionable findings
  • Scanner status is determined by comparing actionable findings to the threshold
Scanner S C H M L I Time Action Result Thresh
bandit 0 0 0 0 1 0 1.1s 0 PASSED MED (g)
cdk-nag 0 0 0 0 0 0 29.8s 0 PASSED MED (g)
cfn-nag 0 0 0 0 0 0 37ms 0 PASSED MED (g)
checkov 0 0 0 0 0 0 5.1s 0 PASSED MED (g)
detect-secrets 0 0 0 0 0 0 1.1s 0 PASSED MED (g)
grype 0 0 0 0 0 0 41.3s 0 PASSED MED (g)
npm-audit 0 0 0 0 0 0 167ms 0 PASSED MED (g)
opengrep 0 0 0 0 0 0 <1ms 0 SKIPPED MED (g)
semgrep 0 0 0 0 0 0 <1ms 0 MISSING MED (g)
syft 0 0 0 0 0 0 2.5s 0 PASSED MED (g)

@BharathiSrini BharathiSrini self-requested a review April 30, 2026 16:56
Comment thread 02-use-cases/market-trends-agent/test_broker_card.py
Copy link
Copy Markdown
Collaborator

@BharathiSrini BharathiSrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run linters

- test_broker_card.py: add 'import os' (F821 from linter)
- stock_price_drift/lambda_function.py: reject non-https reference URLs
  before urlopen() and annotate with nosec B310 / noqa S310 (Bandit)
@BharathiSrini BharathiSrini merged commit 27b7022 into awslabs:main Apr 30, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

02-use-cases 02-use-cases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants