Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/sdk/concepts/connecting-servers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ const manager = new MCPClientManager({

await manager.connectToServer("myServer");

// Get tools in AI SDK format
// Get tools for TestAgent
const tools = await manager.getTools();

// Create agent with those tools
Expand Down
37 changes: 37 additions & 0 deletions docs/sdk/concepts/testing-with-llms.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,43 @@ const agent = new TestAgent({
});
```

## Control Multi-Step Loops with stopWhen

Use `stopWhen` when you want to stop the multi-step loop after a particular step completes:

```typescript
import { hasToolCall } from "@mcpjam/sdk";

// Stop after the step where the tool is called
const result = await agent.prompt("Search for open tasks", {
stopWhen: hasToolCall("search_tasks"),
});

expect(result.hasToolCall("search_tasks")).toBe(true);
```

<Tip>
`stopWhen` does not skip tool execution. It controls whether the prompt loop continues after the current step completes, and `TestAgent` also applies `stepCountIs(maxSteps)` as a safety guard.
</Tip>

## Bound Prompt Runtime with timeout

Use `timeout` when you want to bound how long `TestAgent.prompt()` can run:

```typescript
const result = await agent.prompt("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500 },
});

if (result.hasError()) {
console.error(result.getError());
}
```

<Tip>
`timeout` accepts `number`, `totalMs`, `stepMs`, and `chunkMs`. In practice, `number` and `totalMs` cap the full prompt, `stepMs` caps each step, and `chunkMs` mainly matters in streaming flows. The runtime creates an internal abort signal, so tools can stop early if their implementation respects the provided `abortSignal`.
</Tip>

## Writing Assertions

Use validators to assert tool call behavior:
Expand Down
2 changes: 1 addition & 1 deletion docs/sdk/reference/llm-providers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ const { provider, modelId } = parseLLMString("anthropic/claude-sonnet-4-20250514

### createModelFromString()

Create a Vercel AI SDK model directly.
Create a provider model instance directly.

```typescript
import { createModelFromString } from "@mcpjam/sdk";
Expand Down
2 changes: 1 addition & 1 deletion docs/sdk/reference/prompt-result.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ getMessages(): CoreMessage[]

#### Returns

`CoreMessage[]` - Vercel AI SDK message format.
`CoreMessage[]` - The full conversation message format used by the SDK.

#### Example

Expand Down
71 changes: 70 additions & 1 deletion docs/sdk/reference/test-agent.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: "API reference for TestAgent"
icon: "book"
---

The `TestAgent` class wraps LLM providers via the Vercel AI SDK, enabling you to run prompts with MCP tools. It handles the agentic loop and returns rich result objects.
The `TestAgent` class runs prompts with MCP tools enabled. It handles the multi-step prompt loop and returns rich result objects.

## Import

Expand Down Expand Up @@ -76,6 +76,8 @@ prompt(
| Property | Type | Description |
|----------|------|-------------|
| `context` | `PromptResult \| PromptResult[]` | Previous result(s) for multi-turn conversations |
| `stopWhen` | `StopCondition<ToolSet> \| Array<StopCondition<ToolSet>>` | Additional conditions for the multi-step prompt loop. Tools still execute normally. `TestAgent` always applies `stepCountIs(maxSteps)` as a safety guard. |
| `timeout` | `number \| { totalMs?: number; stepMs?: number; chunkMs?: number }` | Bounds prompt runtime. `number` and `totalMs` cap the full prompt, `stepMs` caps each generation step, and `chunkMs` is accepted for parity but is mainly relevant to streaming APIs. |

#### Returns

Expand All @@ -84,6 +86,8 @@ prompt(
#### Example

```typescript
import { hasToolCall } from "@mcpjam/sdk";

// Simple prompt
const result = await agent.prompt("Add 2 and 3");

Expand All @@ -93,12 +97,31 @@ const r2 = await agent.prompt("Mark it complete", { context: r1 });

// Multiple context items
const r3 = await agent.prompt("Show summary", { context: [r1, r2] });

// Stop the loop after the step where a tool is called
const r4 = await agent.prompt("Search for tasks", {
stopWhen: hasToolCall("search_tasks"),
});
console.log(r4.hasToolCall("search_tasks"));

// Bound prompt runtime
const r5 = await agent.prompt("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500 },
});

if (r5.hasError()) {
console.error(r5.getError());
}
```

<Note>
`prompt()` never throws exceptions. Errors are captured in the `PromptResult`. Check `result.hasError()` to detect failures.
</Note>

<Info>
`timeout` bounds prompt runtime. The runtime creates an internal abort signal, so tools can stop early if their implementation respects the provided `abortSignal`. If a tool ignores that signal, its underlying work may continue briefly after the prompt returns an error result.
</Info>

---

## Model String Format
Expand Down Expand Up @@ -233,6 +256,52 @@ Setting `maxSteps` too low may prevent complex tasks from completing. Setting it

---

## Control Multi-Step Loops with stopWhen

Use `stopWhen` to control whether the agent starts another step after the current step completes.

```typescript
import { hasToolCall } from "@mcpjam/sdk";

// Stop after the step where "search_tasks" is called
const result = await agent.prompt("Find my open tasks", {
stopWhen: hasToolCall("search_tasks"),
});

expect(result.hasToolCall("search_tasks")).toBe(true);

// Stop after any of multiple conditions
const result2 = await agent.prompt("Do something", {
stopWhen: [hasToolCall("tool_a"), hasToolCall("tool_b")],
});
```

<Info>
`stopWhen` does not skip tool execution. It controls whether the prompt loop continues after the current step completes, and `TestAgent` also applies `stepCountIs(maxSteps)` as a safety guard.
</Info>

---

## Bound Prompt Runtime with timeout

Use `timeout` when you want to bound how long `TestAgent.prompt()` can run:

```typescript
const result = await agent.prompt("Run a long workflow", {
timeout: 10_000,
});

const result2 = await agent.prompt("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500, chunkMs: 1_000 },
});
```

<Info>
`chunkMs` is accepted for parity, but it is mainly useful for streaming APIs. For `TestAgent.prompt()`, `number`, `totalMs`, and `stepMs` are the main settings to focus on.
</Info>

---

## Complete Example

```typescript
Expand Down
20 changes: 19 additions & 1 deletion sdk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ await manager.connectToServer("asana", {
},
});

// Get tools for AI SDK integration
// Get tools for TestAgent
const tools = await manager.getToolsForAiSdk(["everything", "asana"]);

// Direct MCP operations
Expand Down Expand Up @@ -197,13 +197,31 @@ const agent = new TestAgent({
});

// Run a prompt
import { hasToolCall } from "@mcpjam/sdk";

const result = await agent.prompt("Add 2 and 3");

// Multi-turn with context
const r1 = await agent.prompt("Who am I?");
const r2 = await agent.prompt("List my projects", { context: [r1] });

// Stop the loop after the step where a tool is called
const r3 = await agent.prompt("Search tasks", {
stopWhen: hasToolCall("search_tasks"),
});
r3.hasToolCall("search_tasks"); // true

// Bound prompt runtime
const r4 = await agent.prompt("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500 },
});
r4.hasError(); // true if the prompt timed out
```

`stopWhen` does not skip tool execution. It controls whether the prompt loop continues after the current step completes, and `TestAgent` also applies `stepCountIs(maxSteps)` as a safety guard.

`timeout` bounds prompt runtime. `number` and `totalMs` cap the full prompt, `stepMs` caps each step, and `chunkMs` is accepted for parity but mainly matters in streaming flows. The runtime creates an internal abort signal, so tools can stop early if their implementation respects the provided `abortSignal`.

**Supported providers:** `openai`, `anthropic`, `azure`, `google`, `mistral`, `deepseek`, `ollama`, `openrouter`, `xai`

</details>
Expand Down
21 changes: 19 additions & 2 deletions sdk/skills/create-mcp-eval/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ await manager.connectToServer("server-id", {
env: { API_KEY: "..." },
});

// Get AI SDK-compatible tools for TestAgent
// Get tools for TestAgent
const tools = await manager.getToolsForAiSdk(["server-id"]);

// Cleanup
Expand All @@ -194,12 +194,26 @@ const agent = new TestAgent({
});

// Single prompt
import { hasToolCall } from "@mcpjam/sdk";

const result = await agent.prompt("List all projects");

// Multi-turn with context
const r1 = await agent.prompt("Get my user profile");
const r2 = await agent.prompt("List workspaces for that user", { context: r1 });

// Stop the loop after the step where a tool is called
const r3 = await agent.prompt("Search tasks", {
stopWhen: hasToolCall("search_tasks"),
});
r3.hasToolCall("search_tasks"); // true

// Bound prompt runtime
const r4 = await agent.prompt("Run a long workflow", {
timeout: { totalMs: 10_000, stepMs: 2_500 },
});
r4.hasError(); // true if the prompt timed out

// Mock agent for deterministic tests (no LLM needed)
const mockAgent = TestAgent.mock(async (message) =>
PromptResult.from({
Expand All @@ -216,6 +230,10 @@ const mockAgent = TestAgent.mock(async (message) =>
);
```

`stopWhen` does not skip tool execution. It controls whether the prompt loop continues after the current step completes, and `TestAgent` also applies `stepCountIs(maxSteps)` as a safety guard.

`timeout` bounds prompt runtime. `number` and `totalMs` cap the full prompt, `stepMs` caps each step, and `chunkMs` is accepted for parity but mainly matters in streaming flows. The runtime creates an internal abort signal, so tools can stop early if their implementation respects the provided `abortSignal`.

### PromptResult — Inspect Agent Responses

```typescript
Expand Down Expand Up @@ -998,4 +1016,3 @@ it("selects search_tasks", async () => {
expect(result.hasToolCall("search_tasks")).toBe(true);
}, 90_000);
```

40 changes: 40 additions & 0 deletions sdk/src/EvalAgent.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import type { StopCondition, TimeoutConfiguration, ToolSet } from "ai";
import type { PromptResult } from "./PromptResult.js";

/**
Expand All @@ -6,6 +7,45 @@ import type { PromptResult } from "./PromptResult.js";
export interface PromptOptions {
/** Previous PromptResult(s) to include as conversation context for multi-turn conversations */
context?: PromptResult | PromptResult[];

/**
* Additional stop conditions for the agentic loop.
* Evaluated after each step completes (tools execute normally).
* `stepCountIs(maxSteps)` is always applied as a safety guard
* in addition to any conditions provided here.
*
* Import helpers like `hasToolCall` and `stepCountIs` from `"@mcpjam/sdk"`.
*
* @example
* ```typescript
* import { hasToolCall } from "@mcpjam/sdk";
*
* // Stop the loop after the step where "search_tasks" is called
* const result = await agent.prompt("Find my tasks", {
* stopWhen: hasToolCall("search_tasks"),
* });
* expect(result.hasToolCall("search_tasks")).toBe(true);
*
* // Multiple conditions (any one being true stops the loop)
* const result = await agent.prompt("Do something", {
* stopWhen: [hasToolCall("tool_a"), hasToolCall("tool_b")],
* });
* ```
*/
stopWhen?: StopCondition<ToolSet> | Array<StopCondition<ToolSet>>;

/**
* Timeout for the prompt runtime.
*
* - `number`: total timeout for the entire prompt call in milliseconds
* - `{ totalMs }`: total timeout across all steps
* - `{ stepMs }`: timeout for each generation step
* - `{ chunkMs }`: accepted for parity and primarily relevant to streaming APIs
*
* The runtime creates an internal abort signal. Tools can stop early if they
* respect the `abortSignal` passed to `execute()`.
*/
timeout?: TimeoutConfiguration;
}

/**
Expand Down
29 changes: 23 additions & 6 deletions sdk/src/TestAgent.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
*/

import { generateText, stepCountIs, dynamicTool, jsonSchema } from "ai";
import type { ToolSet, ModelMessage, UserModelMessage } from "ai";
import type { StopCondition, ToolSet, ModelMessage, UserModelMessage } from "ai";
import { CallToolResultSchema } from "@modelcontextprotocol/sdk/types.js";
import { createModelFromString, parseLLMString } from "./model-factory.js";
import type { CreateModelOptions } from "./model-factory.js";
Expand Down Expand Up @@ -318,6 +318,17 @@ export class TestAgent implements EvalAgent {
return instrumented;
}

private resolveStopWhen(
stopWhen?: PromptOptions["stopWhen"]
): StopCondition<ToolSet> | Array<StopCondition<ToolSet>> {
if (stopWhen == null) {
return stepCountIs(this.maxSteps);
}

const conditions = Array.isArray(stopWhen) ? stopWhen : [stopWhen];
return [stepCountIs(this.maxSteps), ...conditions];
}

/**
* Build an array of ModelMessages from previous PromptResult(s) for multi-turn context.
* @param context - Single PromptResult or array of PromptResults to include as context
Expand Down Expand Up @@ -383,10 +394,13 @@ export class TestAgent implements EvalAgent {
const model = createModelFromString(this.model, modelOptions);

// Instrument tools to track MCP execution time
const instrumentedTools = this.createInstrumentedTools((ms) => {
totalMcpMs += ms;
stepMcpMs += ms; // Accumulate per-step for LLM calculation
}, widgetSnapshots);
const instrumentedTools = this.createInstrumentedTools(
(ms) => {
totalMcpMs += ms;
stepMcpMs += ms; // Accumulate per-step for LLM calculation
},
widgetSnapshots
);

// Build messages array if context is provided for multi-turn
const contextMessages = this.buildContextMessages(options?.context);
Expand All @@ -405,9 +419,12 @@ export class TestAgent implements EvalAgent {
...(this.temperature !== undefined && {
temperature: this.temperature,
}),
...(options?.timeout !== undefined && {
timeout: options.timeout,
}),
// Use stopWhen with stepCountIs for controlling max agentic steps
// AI SDK v6+ uses this instead of maxSteps
stopWhen: stepCountIs(this.maxSteps),
stopWhen: this.resolveStopWhen(options?.stopWhen),
onStepFinish: () => {
const now = Date.now();
const stepDuration = now - lastStepEndTime;
Expand Down
4 changes: 4 additions & 0 deletions sdk/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ export { EvalReportingError, SdkError } from "./errors.js";
// EvalAgent interface (for deterministic testing without concrete TestAgent)
export type { EvalAgent, PromptOptions } from "./EvalAgent.js";

// AI SDK stop condition helpers re-exported for TestAgent.prompt()
export { hasToolCall, stepCountIs } from "ai";
export type { StopCondition } from "ai";

// TestAgent
export { TestAgent } from "./TestAgent.js";
export type { TestAgentConfig } from "./TestAgent.js";
Expand Down
Loading
Loading