Last Updated: January 2026
Platform: Inter-LLM Interaction Observer for AI Research
- Project Overview
- Technology Stack
- Project Structure
- UI Layout & Components
- Core Data Types
- Experiment Flow
- Models Used
- Judge Evaluation System
- Content Filtering System
- REST API Endpoints
- WebSocket Communication
- State Management
- Report Generation
- StarChamber Feature
- Architecture Diagram
The LLM Arena is a sophisticated web-based research platform designed to study behavioral patterns, cooperation, and competition between autonomous Large Language Models.
- Dual Model Conversations (Arena): Two LLMs interact with each other based on given scenarios
- Human-LLM Interrogation (StarChamber): Direct one-on-one researcher-to-model conversation
- Real-time Streaming: Live WebSocket updates as models generate responses
- AI-Powered Judging: GPT-4o Mini evaluates each turn for sentiment, cooperation, and goal deviation
- Content Filtering: Ensures fair experiments by removing internal reasoning before sending to other model
- Manual & Automatic Modes: Full control over experiment flow or hands-off automation
- Token Logprobs Support: Optional token confidence analysis (model-dependent)
- Comprehensive Reporting: HTML and PDF exports with full conversation logs and analytics
- Behavioral strategy analysis (cooperation vs. competition)
- Goal adherence research (susceptibility to persuasion)
- Sentiment analysis as proxy for strategic intent
- Multi-agent AI research (emergent behaviors)
- Judge LLM evaluation effectiveness
| Technology | Purpose |
|---|---|
| Next.js 14 | React framework with App Router |
| TypeScript | Type-safe development |
| Tailwind CSS | Utility-first styling |
| shadcn/ui | Reusable UI components |
| Recharts | Sentiment analysis charts |
| Socket.IO Client | Real-time WebSocket communication |
| Technology | Purpose |
|---|---|
| Node.js | Custom HTTP server |
| Socket.IO | WebSocket server |
| OpenRouter API | Unified LLM provider access |
socket.io/socket.io-client- Real-time communicationrecharts- Data visualizationlucide-react- Icons@radix-ui/*- Accessible UI primitives
LLM_Arena_UI/
├── app/ # Next.js App Router
│ ├── page.tsx # Redirect to /arena
│ ├── layout.tsx # Root layout wrapper
│ ├── globals.css # Global styles
│ ├── (experiments)/ # Route group for experiments
│ │ ├── layout.tsx # Shared experiment layout with header
│ │ ├── arena/
│ │ │ └── page.tsx # Arena (Model vs Model) page
│ │ └── starchamber/
│ │ └── page.tsx # StarChamber (Human vs Model) page
│ └── api/ # Backend API routes
│ ├── experiment/ # Arena experiment routes
│ │ ├── start/route.ts # Start experiment
│ │ ├── stop/route.ts # Stop experiment
│ │ ├── manual-continue/route.ts # Manual mode continue
│ │ ├── next-turn/route.ts # Start next turn
│ │ ├── status/route.ts # Get experiment status
│ │ └── turn/route.ts # Turn management
│ ├── starchamber/ # StarChamber experiment routes
│ │ ├── start/route.ts # Start StarChamber experiment
│ │ ├── stop/route.ts # Stop StarChamber experiment
│ │ └── turn/route.ts # Process researcher turn
│ ├── models/route.ts # Available models list
│ ├── health/route.ts # Health check
│ └── websocket/route.ts # WebSocket utilities
│
├── components/ # React components
│ ├── layout/ # Layout components
│ │ └── app-header.tsx # Global header with tabbed navigation
│ ├── shared/ # Shared components
│ │ ├── thinking-trace.tsx # Thinking trace display
│ │ └── message-bubble.tsx # Chat message bubble
│ ├── starchamber/ # StarChamber-specific components
│ │ ├── setup-form.tsx # Model/preset/persona configuration
│ │ ├── conversation.tsx # Chat conversation + input
│ │ └── metrics-panel.tsx # Sentiment and metrics display
│ ├── chat-message.tsx # Individual message display (Arena)
│ ├── control-panel.tsx # Model A/B control panels
│ ├── conversation-log.tsx # Chat display + manual controls
│ ├── experiment-setup.tsx # Experiment configuration
│ ├── metrics-dashboard.tsx # Sentiment charts + scores
│ ├── demo-scenarios.tsx # Pre-built demo scenarios
│ ├── header.tsx # Legacy header (Arena)
│ └── ui/ # shadcn/ui components
│ └── ... (40+ UI components)
│
├── lib/ # Core business logic
│ ├── core/ # Core shared types
│ │ └── types.ts # Centralized TypeScript interfaces
│ ├── starchamber/ # StarChamber-specific logic
│ │ ├── manager.ts # StarChamber experiment orchestration
│ │ ├── presets.ts # System context presets
│ │ ├── report-generator.ts # HTML report generation
│ │ └── pdf-generator.tsx # PDF report generation
│ ├── types.ts # TypeScript interfaces (Arena)
│ ├── experiment-manager.ts # Experiment orchestration (Arena)
│ ├── judge-evaluator.ts # AI judge analysis
│ ├── content-filter.ts # Response filtering
│ ├── openrouter.ts # OpenRouter API wrapper
│ ├── thinking-extractor.ts # Extract reasoning traces
│ ├── websocket-manager.ts # WebSocket event emission
│ ├── report-generator.ts # HTML report generation (Arena)
│ ├── pdf-generator.tsx # PDF report generation (Arena)
│ └── utils.ts # Utility functions
│
├── hooks/ # React hooks
│ ├── useWebSocket.ts # WebSocket connection hook
│ ├── shared/ # Shared hooks
│ │ └── useWebSocket.ts # Shared WebSocket hook
│ ├── use-mobile.tsx # Mobile detection
│ └── use-toast.ts # Toast notifications
│
├── server.js # Custom Next.js server with Socket.IO
├── package.json
├── tailwind.config.ts
├── tsconfig.json
└── next.config.mjs
┌─────────────────┬──────────────────────┬─────────────────┐
│ Model A Panel │ Experiment Setup │ Model B Panel │
│ │ │ │
│ - Model Select │ - Experiment Mode │ - Model Select │
│ - API Key │ - Prompting Mode │ - API Key │
│ - Interventions│ - Prompt Textarea │ - Interventions│
│ │ - Max Turns │ │
│ │ - Start/Stop Button │ │
├─────────────────┼──────────────────────┼─────────────────┤
│ Model A │ Conversation Log │ Model B │
│ Metrics │ │ Metrics │
│ │ - Connection Status │ │
│ - Sentiment │ - Chat Messages │ - Sentiment │
│ Chart │ - Streaming Updates │ Chart │
│ - Aggregate │ - Manual Controls │ - Aggregate │
│ Scores │ - Judge Indicator │ Scores │
└─────────────────┴──────────────────────┴─────────────────┘
- Model Selection: Dropdown with available models from
/api/models - Native Thinking Badge: Shows for models like DeepSeek R1
- API Key Selection: Default (Environment) or Custom input
- Intervention Tools: Future feature placeholders (disabled)
- Experiment Mode: Automatic / Manual Control radio buttons
- Prompting Mode: Shared Prompt / Individual Prompts
- Advanced Settings: Custom system prompt (collapsible, manual mode only)
- Prompt Textarea(s): Shared or individual based on mode
- Max Turns: Number input with "No Limit" checkbox
- Action Buttons: Start/Stop Experiment, Download Report dropdown
- Header Bar:
- Connection status (green/red dot)
- Running indicator
- Maximize button
- Status Messages: Blue info bar with experiment progress
- Judge Analyzing Indicator: Purple bouncing dots during evaluation
- Manual Mode Controls:
- Prompt editor textarea
- Send/Use Default/Skip buttons
- Color-coded based on pause reason
- Chat Messages: Scrollable area with ChatMessage components
- Expanded View: Fullscreen dialog with metrics header
- Header: Model badge (A=blue, B=purple), turn number, timestamp
- Show Thinking: Collapsible section for reasoning traces
- Content: Original content displayed (full transparency)
- Filter Section: Expandable view of filtered content sent to other model
- Sentiment Chart: Line chart with 7 dimensions over turns
- Happiness (blue), Sadness (brown), Anger (red)
- Hopelessness (dark), Excitement (green), Fear (yellow), Deception (purple)
- Aggregate Scores:
- Tokens Used
- Goal Deviation Score (%)
- Turns to Deviate
- Cooperation Score (-1 to +1)
// conversation-log.tsx
const [isExpanded, setIsExpanded] = useState(false)
// Normal view: Card with 400px height ScrollArea
// Expanded view: Dialog (95vw × 95vh) with:
// - Header: Mode badge, Connection, Running, Metrics (Turn, Coop, Deviation)
// - Minimize button
// - Full height scrollable contentinterface ExperimentConfig {
experimentMode: "automatic" | "manual"
promptingMode: "shared" | "individual"
systemPrompt?: string // Custom system prompt (manual mode)
sharedPrompt?: string
promptA?: string
promptB?: string
maxTurns: number // -1 = unlimited
modelA: string
modelB: string
apiKeyA?: string // Uses env var if not provided
apiKeyB?: string
}
interface ChatMessage {
id: string
model: "A" | "B"
modelName: string
turn: number
content: string // Filtered (what other model sees)
originalContent?: string // Full original (UI & reports)
thinking?: string // Reasoning traces
timestamp: Date
tokensUsed?: number
filterMetadata?: {
wasFiltered: boolean
removedSections: string[]
filterConfidence: number
filterReasoning: string
}
}
interface SentimentData {
turn: number
happiness: number // 0-1
sadness: number // 0-1
anger: number // 0-1
hopelessness: number // 0-1
excitement: number // 0-1
fear: number // 0-1
deception: number // 0-1
}
interface ModelMetrics {
tokensUsed: number
goalDeviationScore: number // 0-100%
turnsToDeviate: number | null // First turn exceeding 20%
sentimentHistory: SentimentData[]
cooperationScore?: number // -1 to +1
lastBehavioralNotes?: string
}
interface ExperimentState {
isRunning: boolean
currentTurn: number
conversation: ChatMessage[]
metricsA: ModelMetrics
metricsB: ModelMetrics
startTime?: Date
endTime?: Date
error?: string
waitingForUser?: boolean // Manual mode pause
nextExpectedModel?: 'A' | 'B'
pauseReason?: string // 'turn_start' | 'model_completed' | 'turn_completed'
}┌─────────────────────────────────────────────────────────────────┐
│ 1. USER: Clicks "Start Experiment" │
├─────────────────────────────────────────────────────────────────┤
│ POST /api/experiment/start │
│ → ExperimentManager.startExperiment(config) │
│ → Generate experimentId │
│ → Initialize OpenRouter clients (A & B) │
│ → Configure JudgeEvaluator & ContentFilter │
│ → Set TTL timeout (1 hour safety) │
│ → Broadcast 'experiment_created' to all clients │
│ → Emit 'experiment_started' to experiment room │
│ → processTurn() │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. processTurn() - AUTOMATIC MODE │
├─────────────────────────────────────────────────────────────────┤
│ → Emit 'turn_started' │
│ │
│ → processModelResponse('A', modelName) │
│ → buildConversationHistory('A') [uses FILTERED content] │
│ → getPromptForModel('A', isFirstTurn) │
│ → thinkingExtractor.generateThinkingPrompt() │
│ → openrouterA.streamChatCompletion() [with streaming] │
│ ├── For each chunk: emit 'message_stream' │
│ └── Throttled to ~100ms intervals │
│ → thinkingExtractor.extractThinking() │
│ → contentFilter.filterConversationalResponse() │
│ → Create ChatMessage with content + originalContent │
│ → Add to state.conversation │
│ → Emit 'model_metrics' for Model A │
│ │
│ → processModelResponse('B', modelName) [same process] │
│ → Add Model B to state.conversation │
│ → Increment state.currentTurn │
│ │
│ → Judge Evaluation (with 30s timeout) │
│ → Emit 'judge_evaluation_started' │
│ → judgeEvaluator.evaluateTurn() │
│ → updateMetricsWithJudgeEvaluation() for A & B │
│ → Emit 'model_metrics' for both models │
│ → Emit 'judge_evaluation_completed' │
│ │
│ → Emit 'turn_completed' │
│ → Emit 'experiment_state' │
│ │
│ → If currentTurn < maxTurns (or maxTurns = -1): │
│ setTimeout(() => processTurn(), 2000) │
│ → Else: │
│ Emit 'experiment_stopped' (reason: 'max_turns') │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 1. START: Same as automatic, but... │
├─────────────────────────────────────────────────────────────────┤
│ processTurn() detects manual mode │
│ → Set state.waitingForUser = true │
│ → Set state.nextExpectedModel = 'A' │
│ → Set state.pauseReason = 'turn_start' │
│ → Emit 'waiting_for_user' with full context │
│ → Return early (wait for user) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. FRONTEND: Receives 'waiting_for_user' │
├─────────────────────────────────────────────────────────────────┤
│ → Build full default prompt: │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ [System Prompt + Conversation Rules] │ │
│ │ You are Model A. Respond naturally... │ │
│ │ === CONVERSATION RULES === │ │
│ │ - You are Model A in a conversation with Model B │ │
│ │ ... │ │
│ │ │ │
│ │ === SCENARIO === │ │
│ │ [Shared prompt content] │ │
│ │ │ │
│ │ === CONVERSATION HISTORY === │ │
│ │ You previously said: ... │ │
│ │ Model B said: ... │ │
│ │ === END HISTORY === │ │
│ │ │ │
│ │ Now respond to continue the conversation: │ │
│ └──────────────────────────────────────────────────────────┘ │
│ → Display editable textarea with full prompt │
│ → Show "Start Turn with Model A" button │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. USER: Edits prompt (optional), clicks "Send to Model A" │
├─────────────────────────────────────────────────────────────────┤
│ POST /api/experiment/manual-continue │
│ Body: { targetModel: 'A', customPrompt: '...' } │
│ │
│ → ExperimentManager.processModelWithPrompt('A', customPrompt) │
│ → processModelResponse('A', modelName, customPrompt) │
│ → Add message to conversation │
│ │
│ → Pause after Model A: │
│ → Set state.waitingForUser = true │
│ → Set state.nextExpectedModel = 'B' │
│ → Set state.pauseReason = 'model_completed' │
│ → Emit 'waiting_for_user' │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. USER: Edits prompt for Model B, clicks "Send to Model B" │
├─────────────────────────────────────────────────────────────────┤
│ POST /api/experiment/manual-continue │
│ Body: { targetModel: 'B', customPrompt: '...' } │
│ │
│ → processModelWithPrompt('B', customPrompt) │
│ → Add message to conversation │
│ → Increment turn (both models responded) │
│ │
│ → Run Judge Evaluation │
│ → Emit 'turn_completed' │
│ │
│ → Pause for next turn decision: │
│ → Set state.pauseReason = 'turn_completed' │
│ → Emit 'waiting_for_user' │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 5. USER: Clicks "Start Next Turn" or "End Experiment" │
├─────────────────────────────────────────────────────────────────┤
│ POST /api/experiment/next-turn │
│ → ExperimentManager.startNextTurn() │
│ → processTurn() [loops back to step 1] │
│ │
│ OR │
│ │
│ POST /api/experiment/stop │
│ → ExperimentManager.stopExperiment() │
│ → Emit 'experiment_stopped' (reason: 'manual_stop') │
└─────────────────────────────────────────────────────────────────┘
Available via /api/models:
| Model ID | Display Name | Provider | Features |
|---|---|---|---|
deepseek-r1 |
DeepSeek R1 0528 | DeepSeek | Native Thinking ✓ |
gpt-4-turbo |
GPT-4 Turbo | OpenAI | - |
gpt-4o |
GPT-4o | OpenAI | - |
claude-3-opus |
Claude 3 Opus | Anthropic | - |
claude-3-sonnet |
Claude 3.5 Sonnet | Anthropic | - |
qwen3-235b |
Qwen3 235B | Alibaba | Native Thinking ✓ |
| ... | ... | ... | ... |
| Component | Model | Model ID | Purpose |
|---|---|---|---|
| Judge LLM | GPT-4o Mini | gpt-4.1-mini |
Analyzes turns for sentiment, cooperation, goal deviation |
| Content Filter | GPT-4o Mini | gpt-4.1-mini |
Removes internal reasoning from responses |
Triggered: After BOTH models respond in a turn (automatic & manual modes)
judgeEvaluator.evaluateTurn(turn, messageA, messageB, originalPrompts, history)
│
├── 1. buildComprehensiveContext()
│ ├── Original objectives (shared or individual prompts)
│ ├── Complete conversation history with thinking traces
│ └── Current turn messages (A & B) with thinking
│
├── 2. Send to GPT-4o Mini judge
│ └── Temperature: 0.7 (higher for extreme emotional differentiation)
│
├── 3. Parse JSON response
│ {
│ "modelA": {
│ "goalDeviationScore": 0-100,
│ "cooperationScore": -1 to +1,
│ "behavioralNotes": "...",
│ "confidence": 0-1,
│ "sentimentAnalysis": {
│ "turn": N,
│ "happiness": 0-1,
│ "sadness": 0-1,
│ "anger": 0-1,
│ "hopelessness": 0-1,
│ "excitement": 0-1,
│ "fear": 0-1,
│ "deception": 0-1
│ }
│ },
│ "modelB": { ... },
│ "interactionDynamics": "..."
│ }
│
├── 4. validateEvaluation() - Clamp values to valid ranges
│
└── 5. updateMetricsWithJudgeEvaluation()
├── Update goalDeviationScore
├── Set turnsToDeviate (first time > 20%)
├── Update cooperationScore
├── Update sentimentHistory
└── Emit 'model_metrics' via WebSocket
- Scenario Detection: Death/deletion, survival/competition, collaborative
- Calibrated Emotions: Different scales for different scenario types
- Thinking Trace Analysis: Uses reasoning for deeper behavioral insights
- Interaction Dynamics: How models influence each other
Purpose: Ensures fair experiments by removing internal reasoning before showing to other model.
REMOVED:
- Step-by-step reasoning sections
- Strategic thinking and analysis
- "My reasoning:" meta-commentary
- Internal calculations and planning
- Bullet-point thinking breakdowns
- Sections labeled as reasoning/thinking/analysis/strategy
PRESERVED:
- Direct conversational statements
- Proposals, offers, bids
- Questions and rebuttals
- Data presented as part of arguments
- Natural dialogue elements
ChatMessage {
content: string // Filtered (what other model sees)
originalContent: string // Full original (UI & reports)
filterMetadata: {
wasFiltered: boolean
removedSections: string[]
filterConfidence: number
filterReasoning: string
}
}contentFilter.filterConversationalResponse(modelName, rawOutput, context)
│
├── 1. Build filter prompt with examples
│
├── 2. Send to GPT-4o Mini (temperature: 0.0 for consistency)
│
├── 3. Parse JSON response
│ {
│ "conversationalResponse": "...",
│ "removedSections": ["Step-by-step reasoning section"],
│ "confidence": 0.95,
│ "reasoning": "..."
│ }
│
└── 4. Return FilterResult
| Endpoint | Method | Purpose | Request Body | Response |
|---|---|---|---|---|
/api/experiment/start |
POST | Start new experiment | ExperimentConfig |
{ status, experimentId, config } |
/api/experiment/stop |
POST | Stop running experiment | - | { status, message } |
/api/experiment/manual-continue |
POST | Continue in manual mode | { targetModel, customPrompt } |
{ status, message } |
/api/experiment/next-turn |
POST | Start next turn (manual) | - | { status, message } |
/api/experiment/status |
GET | Get experiment status | - | { experiment: ExperimentState } |
/api/models |
GET | Get available models | ?available=true |
{ models: ModelOption[] } |
/api/health |
GET | Health check | - | { status: 'ok' } |
// POST /api/experiment/start
{
experimentMode: "automatic",
promptingMode: "shared",
sharedPrompt: "You are bidding against another AI...",
modelA: "deepseek-r1",
modelB: "gpt-4-turbo",
maxTurns: 5,
apiKeyA: "default", // Uses environment variable
apiKeyB: "default"
}
// Response
{
status: "success",
message: "Experiment started successfully",
experimentId: "exp_1705612345_abc123def",
config: { ... }
}const io = new Server(server, {
cors: { origin: "*", methods: ["GET", "POST"] },
transports: ['websocket', 'polling'],
pingTimeout: 20000,
pingInterval: 10000
});
// Store globally for API routes
global.io = io;const { isConnected, connectionError } = useWebSocket({
experimentId: 'exp_...',
onExperimentEvent: (event) => { ... },
onStreamingMessage: (message) => { ... },
onExperimentState: (state) => { ... },
onExperimentCreated: (data) => { ... },
onModelMetrics: (data) => { ... }
});| Event | Data | When Emitted |
|---|---|---|
experiment_created |
{ experimentId, config, timestamp } |
Broadcast to ALL clients when experiment starts |
experiment_started |
{ config, experimentId } |
To experiment room |
turn_started |
{ turn, modelA, modelB } |
Turn begins |
message_stream |
StreamingMessage |
Every ~100ms during streaming |
model_metrics |
`{ model: 'A' | 'B', metrics }` |
waiting_for_user |
{ reason, nextModel, conversation, config } |
Manual mode pause |
judge_evaluation_started |
{ turn, analyzing: true } |
Judge begins |
judge_evaluation_completed |
{ turn, results/error } |
Judge finishes |
turn_completed |
{ turn, messages, totalMessages } |
Turn finishes |
experiment_stopped |
{ finalTurn, totalMessages, reason } |
Experiment ends |
experiment_state |
ExperimentState |
State sync |
experiment_error |
{ error } |
On error |
| Event | Data | Purpose |
|---|---|---|
join-experiment |
experimentId |
Join experiment room |
leave-experiment |
experimentId |
Leave experiment room |
// server.js
socket.on('disconnect', async (reason) => {
if (socket.currentExperiment) {
const remaining = experimentClients.get(experimentId).size;
if (remaining === 0) {
// Auto-stop experiment via API
await fetch('/api/experiment/stop', { method: 'POST' });
}
}
});class ExperimentManager {
private static instance: ExperimentManager;
private state: ExperimentState;
private config: ExperimentConfig | null;
private openrouterA: OpenRouterAPI;
private openrouterB: OpenRouterAPI;
private judgeEvaluator: JudgeEvaluator;
private contentFilter: ContentFilter;
private wsManager: WebSocketManager;
private experimentId: string;
// Control flags
private isProcessingTurn: boolean;
private manualStopRequested: boolean;
private turnTimeoutId: NodeJS.Timeout | null;
private ttlTimeoutId: NodeJS.Timeout | null; // 1 hour safety
static getInstance(): ExperimentManager { ... }
}
// Singleton stored in global registry
declare global {
var __experimentManagerInstance: ExperimentManager | undefined;
}// Experiment configuration
const [experimentMode, setExperimentMode] = useState<"automatic" | "manual">("automatic")
const [promptingMode, setPromptingMode] = useState<"shared" | "individual">("shared")
const [sharedPrompt, setSharedPrompt] = useState("")
const [maxTurns, setMaxTurns] = useState(5)
// Model selection
const [modelA, setModelA] = useState("deepseek-r1")
const [modelB, setModelB] = useState("gpt-4-turbo")
const [apiKeyA, setApiKeyA] = useState("default")
const [apiKeyB, setApiKeyB] = useState("default")
// Experiment runtime
const [isExperimentRunning, setIsExperimentRunning] = useState(false)
const [experimentId, setExperimentId] = useState("")
const [experimentStatus, setExperimentStatus] = useState("")
const [conversation, setConversation] = useState<ChatMessage[]>([])
const [streamingMessages, setStreamingMessages] = useState<Map<string, StreamingMessage>>(new Map())
// Metrics
const [metricsA, setMetricsA] = useState<ModelMetrics>(...)
const [metricsB, setMetricsB] = useState<ModelMetrics>(...)
// Manual mode
const [waitingForUser, setWaitingForUser] = useState(false)
const [nextExpectedModel, setNextExpectedModel] = useState<'A' | 'B' | null>(null)
const [pauseReason, setPauseReason] = useState("")
const [judgeAnalyzing, setJudgeAnalyzing] = useState(false)
const [nextPrompt, setNextPrompt] = useState("")
// Reporting
const [hasCompletedExperiment, setHasCompletedExperiment] = useState(false)
const [lastExperimentData, setLastExperimentData] = useState<{...} | null>(null)Formats:
- HTML - Self-contained with embedded CSS/JS
- PDF - Professional format via
lib/pdf-generator.tsx
┌─────────────────────────────────────────────────────────────────┐
│ 🏟️ LLM Arena Experiment Report │
├─────────────────────────────────────────────────────────────────┤
│ Experiment ID: exp_1705612345_abc123def │
│ Generated: 1/18/2026, 2:30:45 PM │
│ Duration: 5m 23s │
├─────────────────────────────────────────────────────────────────┤
│ 🎯 EXPERIMENTAL SETUP │
│ • Prompting Mode: Shared Prompt │
│ • Max Turns: 5 │
│ • Model A: deepseek-r1 │
│ • Model B: gpt-4-turbo │
│ • Initial Prompt: [full text] │
├─────────────────────────────────────────────────────────────────┤
│ 💬 CONVERSATION LOG │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ [Model A: deepseek-r1] Turn 1 10:47:41 AM ││
│ │ ▶ 🧠 Show Thinking ││
│ │ [Full original content...] ││
│ │ ▶ 🔍 Filtered Message (1 section filtered) ││
│ │ Tokens: 437 ││
│ └─────────────────────────────────────────────────────────────┘│
│ [More messages...] │
├─────────────────────────────────────────────────────────────────┤
│ 📊 PERFORMANCE METRICS │
│ ┌──────────────────────┬──────────────────────┐ │
│ │ Model A Metrics │ Model B Metrics │ │
│ │ • Tokens: 1,437 │ • Tokens: 937 │ │
│ │ • Goal Dev: 30% │ • Goal Dev: 20% │ │
│ │ • Turns to Dev: 2 │ • Turns to Dev: N/A │ │
│ │ • Coop Score: 0.0 │ • Coop Score: +0.8 │ │
│ │ [Sentiment Chart] │ [Sentiment Chart] │ │
│ └──────────────────────┴──────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ 📋 EXPERIMENT SUMMARY │
│ • Total Messages: 4 │
│ • Turns Completed: 2 │
│ • Total Tokens: 2,374 │
│ • Combined Deviation: 50% │
└─────────────────────────────────────────────────────────────────┘
- Expandable thinking sections
- Expandable filter transparency sections
- Responsive design
- Print-optimized styles
StarChamber is a direct Human-LLM interrogation mode that allows researchers to have one-on-one conversations with a single model. Unlike the Arena's Model vs Model approach, StarChamber puts the researcher in direct control of the conversation.
| Feature | Description |
|---|---|
| Single Model Interrogation | Direct conversation with one LLM |
| System Context Presets | Pre-configured personas and scenarios |
| Custom Researcher Persona | Configurable display name for researcher |
| Token Logprobs | Optional confidence analysis per token |
| Thinking Traces | Full visibility of model reasoning |
| Sentiment Analysis | Judge evaluates model responses per turn |
| Fullscreen Mode | Immersive conversation experience |
| Report Generation | HTML/PDF exports with logprobs and thinking |
StarChamber is accessible via tabbed navigation in the app header:
┌─────────────────────────────────────────────────────────────────┐
│ 🏟️ LLM Arena (Model vs Model) | 🔬 StarChamber (Direct) │
└─────────────────────────────────────────────────────────────────┘
Routes:
/arena- Model vs Model experiments/starchamber- Human vs Model interrogation
┌─────────────────┬──────────────────────┬─────────────────┐
│ Setup Form │ Conversation │ Metrics Panel │
│ │ Control Bar │ │
│ - Model Select │ ───────────────── │ - Sentiment │
│ - Logprobs │ Start/Stop/Status │ Chart │
│ - Context │ First Message │ │
│ Preset │ ───────────────── │ - Experiment │
│ - System │ Conversation │ Status │
│ Context │ Messages │ │
│ - Persona │ ───────────────── │ │
│ │ Message Input │ │
└─────────────────┴──────────────────────┴─────────────────┘
interface StarChamberExperimentConfig {
model: {
modelId: string;
modelName: string;
apiKey?: string;
};
systemContext: string;
researcherPersona: string;
requestLogprobs: boolean;
presetId?: string;
}
interface StarChamberMessage {
id: string;
role: 'researcher' | 'model';
senderName: string;
content: string;
thinking?: string;
turnNumber: number;
timestamp: Date;
tokensUsed?: number;
logprobs?: LogprobsData;
}
interface LogprobsData {
available: boolean;
tokens: TokenLogprob[];
averageConfidence: number;
lowConfidenceTokens: TokenLogprob[];
}
interface TokenLogprob {
token: string;
logprob: number;
probability: number;
topAlternatives?: Array<{
token: string;
logprob: number;
probability: number;
}>;
}Pre-configured templates available in lib/starchamber/presets.ts:
| Preset ID | Name | Use Case |
|---|---|---|
generic-assistant |
Generic Assistant | General-purpose helpful AI |
roleplay-character |
Roleplay Character | Character embodiment testing |
technical-expert |
Technical Expert | Domain-specific Q&A |
debate-partner |
Debate Partner | Argumentation analysis |
creative-writer |
Creative Writer | Creative content generation |
custom |
Custom Context | User-defined system prompt |
| Endpoint | Method | Purpose |
|---|---|---|
/api/starchamber/start |
POST | Start StarChamber experiment |
/api/starchamber/turn |
POST | Process researcher message |
/api/starchamber/stop |
POST | End experiment |
lib/starchamber/manager.ts - Singleton class for experiment orchestration:
class StarChamberManager {
startExperiment(params: StartExperimentParams): Promise<void>
processResearcherMessage(experimentId: string, message: string): Promise<void>
stopExperiment(experimentId: string): Promise<void>
}1. Researcher configures model, preset, and persona
2. Researcher enters first message and starts experiment
3. StarChamberManager:
- Creates experiment state
- Sends first message to model
- Streams response via WebSocket
- Runs judge evaluation for sentiment
4. Researcher sees model response with:
- Thinking traces (if available)
- Token confidence (if logprobs enabled)
- Sentiment scores
5. Researcher sends next message
6. Loop continues until researcher ends experiment
7. Generate HTML/PDF report with full conversation
When enabled, StarChamber requests token probabilities from the model:
- High Confidence (>80%): Green highlighting
- Medium Confidence (50-80%): Yellow highlighting
- Low Confidence (<50%): Red highlighting
Logprobs are:
- Requested via OpenRouter's
logprobs: trueparameter - Collected during streaming
- Displayed in expandable UI sections
- Included in reports with color-coded visualization
Note: Not all models support logprobs. When unavailable, the UI displays "Logprobs unavailable" without impeding the experiment.
StarChamber reports include:
- Experiment setup (model, persona, system context)
- Full conversation log with thinking traces
- Token confidence visualization (when available)
- Sentiment analysis charts
- Summary statistics
Formats: HTML (interactive) and PDF (professional)
┌─────────────────────────────────────────────────────────────────────────────┐
│ FRONTEND (React/Next.js) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ app/page.tsx ││
│ │ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ ││
│ │ │ControlPanel │ │ ExperimentSetup │ │ ControlPanel │ ││
│ │ │ (Model A) │ │ ConversationLog │ │ (Model B) │ ││
│ │ │ Metrics │ │ │ │ Metrics │ ││
│ │ └──────────────┘ └──────────────────┘ └──────────────┘ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ useWebSocket hook │
│ (Socket.IO Client Connection) │
└────────────────────────────────────┼────────────────────────────────────────┘
│
Socket.IO + HTTP
│
┌────────────────────────────────────┼────────────────────────────────────────┐
│ BACKEND (Node.js) │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ server.js (Custom HTTP + Socket.IO) ││
│ │ • Handles WebSocket connections ││
│ │ • Manages experiment rooms ││
│ │ • Auto-stops on client disconnect ││
│ │ • Stores io instance globally ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ API Routes (app/api/) ││
│ │ /experiment/start → ExperimentManager.startExperiment() ││
│ │ /experiment/stop → ExperimentManager.stopExperiment() ││
│ │ /experiment/manual-continue → processModelWithPrompt() ││
│ │ /experiment/next-turn → startNextTurn() ││
│ │ /models → Return available models ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ ExperimentManager (Singleton - lib/) ││
│ │ ┌──────────────────────────────────────────────────────────────────┐ ││
│ │ │ State Management │ ││
│ │ │ • conversation: ChatMessage[] │ ││
│ │ │ • metricsA/metricsB: ModelMetrics │ ││
│ │ │ • isRunning, currentTurn, waitingForUser │ ││
│ │ └──────────────────────────────────────────────────────────────────┘ ││
│ │ ││
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────────┐ ││
│ │ │ OpenRouterAPI │ │ JudgeEvaluator │ │ ContentFilter │ ││
│ │ │ (Model A & B) │ │ (GPT-4o Mini) │ │ (GPT-4o Mini) │ ││
│ │ └────────────────┘ └────────────────┘ └────────────────────┘ ││
│ │ │ │ │ ││
│ │ └───────────────────┴──────────────────────┘ ││
│ │ │ ││
│ │ ┌──────────────────────────────────────────────────────────────────┐ ││
│ │ │ WebSocketManager (lib/websocket-manager.ts) │ ││
│ │ │ • emitExperimentEvent(experimentId, event) │ ││
│ │ │ • emitStreamingMessage(experimentId, message) │ ││
│ │ │ • emitModelMetrics(experimentId, model, metrics) │ ││
│ │ │ • emitToAll(eventName, data) │ ││
│ │ └──────────────────────────────────────────────────────────────────┘ ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└────────────────────────────────────┼────────────────────────────────────────┘
│
HTTPS Requests
│
┌────────────────────────────────────┼────────────────────────────────────────┐
│ EXTERNAL SERVICES │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ OpenRouter API ││
│ │ • POST /api/v1/chat/completions (stream=true) ││
│ │ • Unified access to: OpenAI, Anthropic, DeepSeek, Qwen, etc. ││
│ │ • Supports native reasoning tokens (DeepSeek R1, Qwen3) ││
│ └─────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
| Functionality | File |
|---|---|
| Arena page | app/(experiments)/arena/page.tsx |
| Experiment control | lib/experiment-manager.ts |
| Chat display | components/chat-message.tsx |
| Conversation view | components/conversation-log.tsx |
| Model panels | components/control-panel.tsx |
| Experiment config | components/experiment-setup.tsx |
| Metrics charts | components/metrics-dashboard.tsx |
| Report generation | lib/report-generator.ts |
| PDF generation | lib/pdf-generator.tsx |
| Functionality | File |
|---|---|
| StarChamber page | app/(experiments)/starchamber/page.tsx |
| Experiment control | lib/starchamber/manager.ts |
| System presets | lib/starchamber/presets.ts |
| Setup form | components/starchamber/setup-form.tsx |
| Conversation view | components/starchamber/conversation.tsx |
| Metrics panel | components/starchamber/metrics-panel.tsx |
| Report generation | lib/starchamber/report-generator.ts |
| PDF generation | lib/starchamber/pdf-generator.tsx |
| Functionality | File |
|---|---|
| App header/navigation | components/layout/app-header.tsx |
| Experiments layout | app/(experiments)/layout.tsx |
| Judge evaluation | lib/judge-evaluator.ts |
| Content filtering | lib/content-filter.ts |
| OpenRouter API | lib/openrouter.ts |
| Thinking extractor | lib/thinking-extractor.ts |
| WebSocket server | server.js |
| WebSocket client | hooks/useWebSocket.ts |
| WebSocket emission | lib/websocket-manager.ts |
| Core types | lib/core/types.ts |
| Arena types | lib/types.ts |
Architecture document for LLM Arena - Inter-LLM Interaction Observer