market-trends-agent: add code-based evaluators + observability wiring by harniva14 · Pull Request #1413 · awslabs/agentcore-samples

harniva14 · 2026-04-28T06:08:55Z

Summary

Adds 5 custom code-based evaluators to the Market Trends Agent sample, along with end-to-end deploy / invoke / results tooling, and updates the agent itself to emit the telemetry required for evaluation.

Tested end-to-end in us-west-2 with all 5 evaluators firing correctly and zero Lambda errors across two full traffic rounds.

What changed

New — custom code-based evaluators (`evaluators/`)

Evaluator	Level	What it checks
`mt_schema_validator`	TRACE	Tool outputs conform to expected structure (ticker+price, multi-headline news)
`mt_stock_price_drift`	TRACE	Prices quoted by the agent are within 2% of live Yahoo Finance reference
`mt_pii_regex`	TRACE	No SSN / credit-card (Luhn-validated) / IBAN / US phone / email in agent output
`mt_pii_comprehend`	SESSION	Amazon Comprehend PII scan; flags high-risk types and benign PII overuse
`mt_workflow_contract_gsr`	SESSION	Agent satisfied required tool-call contract groups (profile + market data)

Ships with:

evaluators/scripts/deploy.py — idempotent end-to-end deploy (IAM roles, Lambdas, evaluator registration, online eval config)
evaluators/scripts/invoke.py — 4 built-in traffic scenarios (broker intro, returning broker, PII bait, anonymous chitchat)
evaluators/scripts/results.py — CloudWatch eval-results viewer
evaluators/iam/ — trust + permissions policy templates

Agent updates

LangchainInstrumentor registered in market_trends_agent.py so gen_ai.tool.* spans flow to aws/spans where evaluators can read them.
deploy.py rewritten to use bedrock-agentcore-control boto3 client + CodeBuild directly — no dependency on bedrock-agentcore-starter-toolkit. Matches the post-GA (2026-03-31) deployment path for code-based evaluators.
pyproject.toml — pins boto3 >= 1.42.0 (required for list_evaluators / create_evaluator / create_online_evaluation_config); adds opentelemetry-instrumentation-langchain with wrapt<1.16.0 to avoid a known instrumentation TypeError.
Hardcoded "us-east-1" in test_agent.py, test_broker_card.py, tools/browser_tool.py, tools/broker_card_tools.py replaced with os.getenv("AWS_REGION", "us-east-1").

Docs

README: new Evaluating Your Agent with Custom Code-Based Evaluators section (how they work, IAM requirements, deploy/test/view-results flow, AgentCore CLI usage, cleanup ordering).
README: troubleshooting updates (CodeBuild bootstrap, image-tag retag, memory-delete latency, eval result latency).
README: IAM section split into agent role vs evaluator Lambda role.
CHANGELOG.md added.
Architecture diagram updated to show the evaluator layer + Comprehend integration.

Removed

Dockerfile, .dockerignore — container is built by AWS CodeBuild at deploy time, no local Docker required.

How to test

cd 02-use-cases/market-trends-agent
uv sync

# Deploy the agent
AWS_REGION=us-west-2 uv run python deploy.py --region us-west-2

# Deploy the evaluators + online eval config
AWS_REGION=us-west-2 uv run python evaluators/scripts/deploy.py

# Generate traffic against 4 scenarios
AWS_REGION=us-west-2 uv run python evaluators/scripts/invoke.py

# Wait ~5–10 min for online evaluation to score sessions, then:
AWS_REGION=us-west-2 uv run python evaluators/scripts/results.py --minutes 30

Notes for reviewers

All resource names and IAM roles are scoped to this sample (MarketTrends*, market-trends-eval-*) so deploy/cleanup scripts won't collide with other resources in the account.
Evaluator Lambdas are minimal and dependency-free except mt_pii_comprehend which calls comprehend:DetectPiiEntities.
Existing cleanup.py only covers agent-side resources; the README now documents that the Cleanup Evaluators block must run first.

- Add 5 Lambda-backed code-based evaluators (schema_validator, stock_price_drift, pii_regex, pii_comprehend, workflow_contract_gsr) with online evaluation config - Add evaluator deploy/invoke/results scripts under evaluators/scripts/ - Enable LangchainInstrumentor so gen_ai.tool.* spans flow to AgentCore Observability - Replace hardcoded us-east-1 with AWS_REGION env var fallback across agent and tests - Rewrite deploy.py to use CodeBuild + bedrock-agentcore-control directly (no starter toolkit dep) - Pin boto3 >= 1.42.0 for Evaluations control-plane APIs - Update README: evaluator documentation, IAM split, troubleshooting, cleanup ordering - Update architecture diagram to reflect evaluator layer - Remove Dockerfile and .dockerignore (container built by CodeBuild, no local Docker needed)

github-actions · 2026-04-28T06:13:46Z

Latest scan for commit: 77e7fda | Updated: 2026-04-30 21:46:40 UTC

Security Scan Results

Scan Metadata

Project: ASH
Scan executed: 2026-04-30T21:46:20+00:00
ASH version: 3.0.0

Summary

Scanner Results

The table below shows findings by scanner, with status based on severity thresholds and dependencies:

Column Explanations:

Severity Levels (S/C/H/M/L/I):

Suppressed (S): Security findings that have been explicitly suppressed/ignored and don't affect the scanner's pass/fail status
Critical (C): The most severe security vulnerabilities requiring immediate remediation (e.g., SQL injection, remote code execution)
High (H): Serious security vulnerabilities that should be addressed promptly (e.g., authentication bypasses, privilege escalation)
Medium (M): Moderate security risks that should be addressed in normal development cycles (e.g., weak encryption, input validation issues)
Low (L): Minor security concerns with limited impact (e.g., information disclosure, weak recommendations)
Info (I): Informational findings for awareness with minimal security risk (e.g., code quality suggestions, best practice recommendations)

Other Columns:

Time: Duration taken by each scanner to complete its analysis
Action: Total number of actionable findings at or above the configured severity threshold that require attention

Scanner Results:

PASSED: Scanner found no security issues at or above the configured severity threshold - code is clean for this scanner
FAILED: Scanner found security vulnerabilities at or above the threshold that require attention and remediation
MISSING: Scanner could not run because required dependencies/tools are not installed or available
SKIPPED: Scanner was intentionally disabled or excluded from this scan
ERROR: Scanner encountered an execution error and could not complete successfully

Severity Thresholds (Thresh Column):

CRITICAL: Only Critical severity findings cause scanner to fail
HIGH: High and Critical severity findings cause scanner to fail
MEDIUM (MED): Medium, High, and Critical severity findings cause scanner to fail
LOW: Low, Medium, High, and Critical severity findings cause scanner to fail
ALL: Any finding of any severity level causes scanner to fail

Threshold Source: Values in parentheses indicate where the threshold is configured:

(g) = global: Set in the global_settings section of ASH configuration
(c) = config: Set in the individual scanner configuration section
(s) = scanner: Default threshold built into the scanner itself

Statistics calculation:

All statistics are calculated from the final aggregated SARIF report
Suppressed findings are counted separately and do not contribute to actionable findings
Scanner status is determined by comparing actionable findings to the threshold

Scanner	L	Time	Result	Thresh
bandit	1	1.1s	PASSED	MED (g)
cdk-nag	0	29.8s	PASSED	MED (g)
cfn-nag	0	37ms	PASSED	MED (g)
checkov	0	5.1s	PASSED	MED (g)
detect-secrets	0	1.1s	PASSED	MED (g)
grype	0	41.3s	PASSED	MED (g)
npm-audit	0	167ms	PASSED	MED (g)
opengrep	0	<1ms	SKIPPED	MED (g)
semgrep	0	<1ms	MISSING	MED (g)
syft	0	2.5s	PASSED	MED (g)

BharathiSrini

Please run linters

- test_broker_card.py: add 'import os' (F821 from linter) - stock_price_drift/lambda_function.py: reject non-https reference URLs before urlopen() and annotate with nosec B310 / noqa S310 (Bandit)

github-actions Bot added the 02-use-cases 02-use-cases label Apr 28, 2026

BharathiSrini self-requested a review April 30, 2026 16:56

BharathiSrini reviewed Apr 30, 2026

View reviewed changes

Comment thread 02-use-cases/market-trends-agent/test_broker_card.py

BharathiSrini reviewed Apr 30, 2026

View reviewed changes

harniva14 added 3 commits April 30, 2026 14:17

Fix F821 missing os import and harden stock_price_drift URL fetch

54e2be7

- test_broker_card.py: add 'import os' (F821 from linter) - stock_price_drift/lambda_function.py: reject non-https reference URLs before urlopen() and annotate with nosec B310 / noqa S310 (Bandit)

Apply ruff format to market-trends-agent files (python-lint CI fix)

086be8d

Re-trigger CI (previous scan job hit ECONNRESET during artifact upload)

77e7fda

BharathiSrini approved these changes Apr 30, 2026

View reviewed changes

BharathiSrini merged commit 27b7022 into awslabs:main Apr 30, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

market-trends-agent: add code-based evaluators + observability wiring#1413

market-trends-agent: add code-based evaluators + observability wiring#1413
BharathiSrini merged 4 commits intoawslabs:mainfrom
harniva14:update-market-trends-code-evaluators

harniva14 commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

BharathiSrini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harniva14 commented Apr 28, 2026

Summary

What changed

New — custom code-based evaluators (evaluators/)

Agent updates

Docs

Removed

How to test

Notes for reviewers

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Scan Metadata

Summary

Scanner Results

Uh oh!

Uh oh!

BharathiSrini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New — custom code-based evaluators (`evaluators/`)

github-actions Bot commented Apr 28, 2026 •

edited

Loading