Production-ready tracing and evaluations for a weather chat app built with Next.js and the Vercel AI SDK, instrumented with Braintrust for online/offline scoring.
- Next.js app with Vercel AI SDK tools and streaming responses
- Braintrust tracing: root span for each request, tool sub-spans, automatic model I/O tracing
- Online (“in-app”) evaluators scored at the end of each user request
- Offline evaluations via Braintrust Evalwith shared scorers
- Node 18+
- Braintrust account and API key
- OpenAI API key (or use Braintrust AI providers proxy)
Create .env.local in the project root:
BRAINTRUST_API_KEY=<your-braintrust-api-key>
BRAINTRUST_PROJECT_NAME=<your-braintrust-project-name>
OPENAI_API_KEY=<your-openai-api-key>
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.braintrust.dev/otel
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-braintrust-api-key>, x-bt-parent=project_id:<your-braintrust-project-name>"
npm install
npm run dev
# open http://localhost:3000
- 
app/(preview)/api/chat/route.ts- Wraps the Vercel AI SDK OpenAI model with wrapAISDKModel
- Wraps the POSThandler in atracedspan namedPOST /api/chat
- Logs input/output and simple online scores (fahrenheit_presence,contains_number)
- Adds asynchronous LLM-judge and content scores via logger.updateSpan
- Supports ?mode=textto return plain text (useful for experiments)
 
- Wraps the Vercel AI SDK OpenAI model with 
- 
components/tools.ts- Weather tools are wrapped with wrapTracedso tool calls appear as child spans
 
- Weather tools are wrapped with 
- 
lib/braintrust.ts- Initializes the Braintrust logger and re-exports helpers: traced,wrapTraced,wrapAISDKModel,currentSpan
 
- Initializes the Braintrust logger and re-exports helpers: 
- 
lib/scorers.ts- Shared scorer implementations used by both online tracing and offline evals:
- contentAccuracyScore: synonym- and partial-match tolerant; adds lenient score floors
- weatherLLMJudgeScore: lenient weather-domain LLM judge (uses- openai("gpt-4o-mini"))
- generalLLMJudgeScore: general lenient LLM judge (uses- openai("gpt-4o-mini"))
 
- All include calibration metadata and bounded scores in [0, 1]
 
- Shared scorer implementations used by both online tracing and offline evals:
- 
scripts/eval.agent.ts- Offline evaluation using Evalwith a set of test cases
- Calls the local API with http://localhost:3000/api/chat?mode=textfor clean, plain-text outputs
- Uses the shared scorers from lib/scorers.ts
 
- Offline evaluation using 
In route.ts, we log simple online metrics and also asynchronously compute LLM-judge and content scores after the model finishes:
- Simple scores:
- fahrenheit_presence: 1 if response mentions Fahrenheit (or- F), else 0
- contains_number: 1 if response contains any digit, else 0
 
- LLM-judge scores (async, non-blocking):
- weather_llm_judge: lenient, weather-focused judge
- general_llm_judge: lenient, general-purpose judge
- content_accuracy: tolerant phrase-based accuracy with calibration
 
These scores are attached to the same root span with logger.updateSpan.
Run a full evaluation across curated test cases with shared scorers:
npm run eval:agent
This will create a new Braintrust experiment (visible in your project) with:
- Scores: content_accuracy,general_llm_judge,weather_llm_judge
- Per-datapoint metadata: reasons, calibration details, and feedback
By default, the Vercel AI SDK returns a stream with frames. To store clean text in experiments, the API supports:
POST /api/chat?mode=text
This returns a concatenated text stream as the HTTP response body, which the evaluation script uses.
Edit lib/scorers.ts:
- Switch the judge model by changing openai("gpt-4o-mini")to another (e.g.,openai("gpt-4o")).
- Adjust leniency by tweaking the soft-floor thresholds in each scorer’s calibration.
- No logs in Braintrust:
- Ensure BRAINTRUST_API_KEYandBRAINTRUST_PROJECT_NAMEare set in.env.local
- Confirm the app is running and requests are hitting /api/chat
 
- Ensure 
- Evals fail with missing keys:
- scripts/eval.agent.tsloads- .env.localvia- dotenv; confirm the file exists and contains keys
 
- Frame-like experiment outputs:
- Ensure eval is calling http://localhost:3000/api/chat?mode=text
 
- Ensure eval is calling 
- Logging is best-effort and non-blocking: if online LLM-judge scoring fails, the user response is still returned
- Tool calls are traced with preserved hierarchy under the request’s root span
npm run dev         # Start Next.js
npm run eval:agent  # Run offline evaluation