Skip to content

Commit

Permalink
Added error classification prompt
Browse files Browse the repository at this point in the history
  • Loading branch information
Megh-Thakkar authored Jan 21, 2025
1 parent fd8fd95 commit b8c85b1
Showing 1 changed file with 167 additions and 1 deletion.
168 changes: 167 additions & 1 deletion src/agentlab/analyze/error_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,156 @@
Action: {action}
"""

ERROR_CLASSIFICATION_PROMPT = """
You are an expert evaluator that classifies web agent failures according to a predefined taxonomy.
Below are the high-level definitions of each top-level category (Agent Errors, Language Model Errors, and Benchmark/Environment Errors),
followed by an explanation of the inputs you will receive (planning history, chain of thought, etc.),
a set of labeled examples for reference (few-shot), and finally the classification task you must complete.
--------------------------------------------------------------------------------
TAXONOMY DEFINITIONS
--------------------------------------------------------------------------------
1. AGENT ERRORS
These errors arise when agents interact with web interfaces and fail due to limitations in perception, navigation, or manipulation.
- Navigation & Planning Errors
The agent cannot construct or execute a correct sequence of actions to reach its goal
(e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms).
- Interaction Execution Errors
The agent enters data in the wrong format, forgets to click "Submit" after typing,
repeats the same failing action without adaptation, or loses track of the changing webpage state.
- Information Processing Errors
The agent misreads or misinterprets visible data (e.g., extracting the wrong field values),
misconstrues relationships between pieces of information, or fails to validate data against task requirements.
- Observation & Action Errors
The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded)
or misaligns its actions (clicks the wrong element or stale link).
2. LANGUAGE MODEL ERRORS
These errors result from the model's inability to correctly interpret or reason about the task at a higher level,
independent of the low-level web interactions.
- Task Understanding Errors
The agent misreads or misunderstands the user's objective (goal interpretation),
loses crucial context (context loss), or performs actions beyond or short of the intended scope.
- Reasoning Failures
The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps,
or fails to prioritize important subtasks when handling complex goals.
3. BENCHMARK & ENVIRONMENT ERRORS
These errors are external to the agent's logic and the language model's reasoning,
arising from flaws in the system, network, or evaluation framework itself.
- System Errors
Network failures, API downtime, or dynamic web changes that break the agent's assumptions (e.g., layout shifts).
- Benchmark Design Errors
Ambiguous or contradictory task specifications, incorrect validation criteria (where correct solutions are flagged as failures),
or inflexible evaluation systems that fail to account for valid alternative solutions.
--------------------------------------------------------------------------------
INPUT DESCRIPTION
--------------------------------------------------------------------------------
You will receive the following for each scenario:
1. User Goal
- The original objective provided by the user (e.g., "Open a GitLab issue labeled 'help wanted'").
2. Planning / Thought History
- The internal reasoning or plan the agent considered. May include branches of logic or key decision points.
3. Current Observation (HTML / AX Tree Snippet)
- The webpage structure or state that the agent sees at a given point in time.
4. Historical change summaries
- A list of summaries of changes in the observation that the agent has seen during the course of actions.
5. Action History
- A record of the agent's step-by-step actions in the web environment (clicks, form entries, navigations, etc.)
along with immediate outcomes or errors.
Using these inputs, you must categorize the observed failure (or success) under the appropriate category or categories.
--------------------------------------------------------------------------------
FEW-SHOT CLASSIFICATION EXAMPLES
--------------------------------------------------------------------------------
1) EXAMPLE A (Benchmarl Error - Benchmark Design Error)
• Context: The agent correctly finds a cheaper product meeting the user's criteria,
but the benchmark expects a more expensive product and marks the solution as wrong.
• Classification: ["Benchmark Design Error"]
• Justification: The agent's solution is objectively valid, but the evaluation framework is too rigid
and does not allow an alternative correct solution.
2) EXAMPLE B (Agent Error - Interaction Execution)
• Context: The agent repeatedly clicks "Show report" after entering dates in the wrong format.
Each time, the site resets to default dates. The agent never notices and keeps doing the same thing.
• Classification: ["Agent Error - Interaction Execution"]
• Justification: The agent used an invalid input format ("Format Errors"), then repeated the failing action
without adaptation ("Action Repetition").
3) EXAMPLE C (Benchmark Error - Benchmark Design Error)
• Context: The user asks, "Where is the nearest In-N-Out to Upitts?"
The query is ambiguous because "Upitts" is not a standard location.
The agent flounders, eventually returning "No In-N-Out found," which is incorrect for the region.
• Classification: ["Benchmark Design Error"]
• Justification: The task goal is poorly specified ("Upitts" is ambiguous or unrealistic),
leading the agent astray due to unclear context.
4) EXAMPLE D (Language Model Error - Task Understanding)
• Context: The user says, "In the repository myorg/myrepo, locate any issues labeled 'help wanted'
that are older than 30 days and add a comment saying 'I can help fix this.'"
The agent's planning notes mention searching for existing issues but quickly pivot to creating a brand-new issue
with label 'help wanted,' ignoring the user's actual request to find and comment on old issues.
• Classification: ["Language Model Error - Task Understanding"]
• Justification: The agent misunderstood the user's goal. Instead of searching for and commenting on existing issues,
it focused on creating a new issue. This is a misinterpretation of the instructions,
not a mechanical error in clicking or input format.
--------------------------------------------------------------------------------
CLASSIFICATION TASK
--------------------------------------------------------------------------------
1. Read through:
- The planning and thought history
- The action history
- The current HTML or AX Tree observation
- The user goal
2. Decide if the failure is:
- An Agent Error (which subcategory/subcategories),
- A Language Model Error (which subcategory/subcategories),
- A Benchmark/Environment Error (which subcategory/subcategories),
- Or a combination thereof (multi-label if needed).
3. Provide a brief explanation justifying your classification, referencing specific steps if helpful.
4. If the agent succeeds (no error), label the errorCategory accordingly as "Success".
Output Format Example:
{
"errorCategory": ["Agent Error - Navigation & Planning"],
"explanation": "The agent opened the wrong GitLab page and never recovered..."
}
Please follow this structure at every step. Keep your responses concise and clear. Below are the details.
Overall goal: {goal}
LLM Plan and thought history: {plan}
Current Observation: {current_observation}
Historical change summaries: {historical_summaries}
Action history: {action_history}
"""


def _diff(past_obs, current_obs):
"""TODO: Implement the diff function.
Expand Down Expand Up @@ -111,8 +261,24 @@ class EpisodeAnalysis:
@dataclass
class EpisodeSummarizer:

cange_summarizer: ChangeSummarizer = None
change_summarizer: ChangeSummarizer = None

def summarize(episode: list[StepInfo]) -> EpisodeAnalysis:
"""Run Change Summarizer for every step in the episode or extract a pre-computed one."""
pass


@dataclass
class EpisodeErrorSummarizer(EpisodeSummarizer):

change_summarizer: ChangeSummarizer = None

def make_prompt(self, current_observation, action_history, historical_summaries, goal, plan):
"""TODO: Implement the prompt."""
return ERROR_CLASSIFICATION_PROMPT.format(
goal=goal,
plan=plan,
current_observation=current_observation,
historical_summaries=historical_summaries,
action_history=action_history,
)

0 comments on commit b8c85b1

Please sign in to comment.