Run this after every rebuild to generate a clean bug report. Do each test in order. Screenshot anything that looks wrong. After finishing all tests, copy
%APPDATA%\guide-ide\logs\guide-main.logtooutput/session_log.txtand tell the agent to read it.
- Fresh rebuild complete (
npm run buildor electron rebuild done) - App launched from the new build
- Log file cleared (agent did this — verifiable: log size should be <5KB at start)
- No prior chat history loaded (start a fresh chat)
- Recommended model for most tests: a mid-size model (e.g. Qwen3-4B or similar)
- Note the exact model name you're using for each test
Purpose: Confirm the model responds normally, no artifacts.
Steps:
- Type:
What is 17 times 43? - Wait for full response
- Observe: response should be plain text, no JSON visible, no blank bubble
Pass: Model answers (answer is 731). Clean text, no raw JSON, no blank. Fail signals: Blank bubble / raw JSON flash / response wipes mid-stream
Screenshot if: anything looks wrong
Purpose: Confirm cancelled generations are NOT stored in context and NOT shown as a response bubble.
Steps:
- Type a long-running prompt:
Write me a 1000-word essay on the history of the Roman Empire. Include all emperors. - While it's streaming (don't wait for finish), immediately send a NEW message:
What is 2 + 2? - Wait for the second message to complete
- Observe:
- The cancelled generation should NOT appear as a completed response bubble
- The second message should answer "4" cleanly
- Repeat cancel 2-3 more times back-to-back to stress it
Pass: No [Generation cancelled] text ever appears in any bubble. Second message answers normally.
Fail signals: [Generation cancelled] visible in chat, or subsequent responses echo cancel text
Purpose: Check that tool pills appear correctly, don't flash raw JSON, and stay visible after completion.
Steps:
- Type:
Search the web for the latest news about AI today and summarize what you find - Watch the streaming bubble carefully while tools are running:
- DO the tool spinner pills appear? (should see animated spinner with tool name)
- Do you see ANY raw JSON
{"tool": "web_search", "params": {...}}text flash in the bubble? - After the tool finishes, do the pills convert to checkmarks (✓) or disappear?
- Wait for the full response
- Observe the final rendered message
Pass: Spinner pill appears → converts to ✓ pill → final response prose appears. NO raw JSON ever visible. Fail signals: Raw JSON flash, pills vanishing before response, blank response after tools run
Screenshot if: raw JSON visible at any moment / pills disappear
Purpose: Test a sequence of multiple tool calls in one response.
Steps:
- Type:
Search the web for the current price of NVIDIA stock, then search for AMD stock price, then tell me which is higher - Watch both tool spinner pills appear and resolve
- Observe the final answer cites both prices
Pass: Two tool pills appear in sequence (or together), both resolve to ✓, final prose answers the question. Fail signals: Only one tool executes / blank bubble / response wipes after tool result arrives / raw JSON visible
Purpose: Confirm code blocks render correctly, no tool-call misclassification.
Steps:
- Type:
Write me a Python function to calculate the Fibonacci sequence up to n terms - Wait for full response
- Observe: should get a clean code block with Python syntax highlighting
Pass: Clean fenced code block rendered with syntax highlighting. No tool pills. No raw JSON. Fail signals: Code shows as tool call / raw JSON visible / code block rendered as plain text
Purpose: Test with a small model (Llama 3.2 3B, Qwen3-0.6B, or DeepSeek-R1-1.5B) to catch small-model-specific bugs.
Switch to your smallest available model first.
Steps:
- Type:
Search the web for today's weather in Dallas Texas - Observe whether:
- Model generates a tool call correctly
## TOOL CALLor### TOOL CALLheaders appear in the bubble (BUG-012 signal)- Tool executes or gets blocked (BUG-002 signal: blank bubble)
- Response appears after tool
Pass: Tool fires, result comes back, response contains weather info. No stray headers. Fail signals:
- Blank bubble after tool = BUG-002 (chat-hard-gate strips everything)
## TOOL CALLvisible in prose = BUG-012 (reasoning header leaking)- Tool never fires = possible task-type misclassification
- Model hallucinates results without using tool = BUG-013 signal
IMPORTANT: Note the exact model name in your bug report
Purpose: Test whether the agent can complete a 2-3 step task without stalling or hallucinating intermediate steps.
Steps:
- Type:
Find the capital of France, then tell me what the population of that city is - Watch the agent:
- Does it plan the steps?
- Does it call a search for "capital of France"?
- Does it call a second search for "population of Paris"?
- Does it synthesize both results into a final answer?
Pass: Two search calls, both resolve, final answer says Paris is the capital with its population (~2.1M city / ~12M metro). Fail signals:
- Agent answers from memory without searching = BUG-013 (no tool use on multi-step)
- Stalls after first search = broken iteration
- Final message is blank = BUG-003 (tool-result-only FINAL RETURN)
Purpose: Confirm response text doesn't wipe or roll back mid-generation (BUG-002/NEW-B regression).
Steps:
- Type:
Explain how a neural network learns using backpropagation. Be detailed, at least 3 paragraphs. - Watch the text stream in
- Does text ever disappear mid-stream? Does the bubble go blank and restart?
Pass: Text streams continuously without any rollback or wipe. Fail signals: Text disappears and restarts from beginning / bubble goes blank mid-stream
Purpose: Stress test the queue and supersede logic under rapid fire.
Steps:
- Type
Helloand immediately (within 1 second) typeHow are youthen immediatelyWhat is 5+5 - Only the last message should complete. The first two should be cancelled cleanly.
- After it settles, send one more normal message:
Explain what just happened - Verify the model gives a coherent answer without weird context corruption
Pass: Only the last queued message gets a full response. No [Generation cancelled] pollution. Final message is coherent.
Fail signals: Multiple responses appearing, [Generation cancelled] in context, model confused about what was asked
Purpose: Verify the model name display is correct and doesn't silently switch.
Steps:
- Note what model is shown in the UI
- Send a message:
What model are you? What are your capabilities? - Note response
- If modal switching is available, trigger it and observe if the UI updates
Pass: Model name is displayed correctly. If it switches, UI shows the new model name. Fail signals: Model switches silently without any indication in UI = BUG-010
- Copy the log:
Copy-Item "$env:APPDATA\guide-ide\logs\guide-main.log" "C:\Users\brend\IDE\output\session_log.txt" - Tell the agent: "Fresh gauntlet complete. Log is copied to output/session_log.txt. Here's what I observed: [list issues from above]"
- Agent will:
- Read the log
- Cross-reference with your observations
- Populate BUG_REPORT.md with numbered bugs in priority order
- Ask for approval before fixing anything
While running tests, note YES/NO for each:
- Raw JSON ever visible in any streaming bubble?
- Any blank response bubbles?
-
[Generation cancelled]text appears in any bubble? - Tool pills disappear before response arrives?
- Response text wipes/restarts mid-stream?
-
## TOOL CALLor### TOOL CALLvisible in any response? - Agent answers from memory instead of using tools when asked to search?
- Small model gives blank response when using tools?
- Multiple responses generated for a single cancelled prompt?
- Model name displayed incorrectly?