A systematic evaluation framework for agentic AI systems across diverse architectural configurations and enterprise use cases.
AgentArch provides empirical insights into how different design dimensions interact within complex multi-agent systems. This benchmark evaluates 18 distinct agentic configurations across state-of-the-art large language models, examining four critical system dimensions:
|
Single-agent vs. multi-agent systems |
ReAct vs. function calling approaches |
|
Complete vs. summarized memory management |
Mathematical reasoning and information synthesis tools |
TL;DR: No one-size-fits-all solution exists for enterprise agentic systems
| Finding | Impact | ๐ |
|---|---|---|
| No Universal Architecture | Models demonstrate significant architectural preferences that vary by use case complexity | ๐ฏ |
| Performance Gaps | Even top models achieve only 35.3% success on complex enterprise tasks and 70.8% on simpler workflows | ๐ |
| Multi-Agent ReAct Limitations | Consistent underperformance across all models in multi-agent ReAct configurations | |
| Reliability Challenges | Pass^K scores peak at only 6.34%, indicating fundamental gaps for production deployment | ๐จ |
# Clone the repository
git clone https://github.com/ServiceNow/AgentArch.git
cd AgentArch
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.example .env
# ๐ Replace placeholders with real API keys and endpointspython -m src.run \
--mode single_agent \
--usecase requesting_time_off \
--model claude_sonnet_4 \
--agent_type function_calling \
--project test \
--debugAgentArch/
โโโ ๐ configs/
โ โโโ ๐ง mocked_data/
โ โ โโโ requesting_time_off_mocked_tool_calls.json
โ โ โโโ triage_cases_mocked_tool_calls.json
โ โโโ โ๏ธ use_case_configs/
โ โ โโโ requesting_time_off.yaml
โ โ โโโ triage_cases.yaml
โ โโโ โ๐ prompts.yaml
โโโ ๐ src/
โ โโโ ๐ ๏ธ tools/
โ โโโ ๐ง utils/
โ โโโ ๐ค agent.py
โ โโโ ๐ metrics.py
โ โโโ โถ๏ธ run.py # Main execution script
โโโ ๐ .env.example
โโโ ๐ .gitignore
โโโ ๐ LICENSE
โโโ ๐ requirements.txt
| Aspect | Details |
|---|---|
| ๐ฏ Complexity | Basic multi-step reasoning with clear success criteria |
| ๐ ๏ธ Tools | 8 custom enterprise tools |
| ๐ค Agents | 3 specialized agents |
| ๐ก Challenges | Date calculations, leave balance verification, policy compliance |
| Aspect | Details |
|---|---|
| ๐ฏ Complexity | Intelligent classification and escalation decisions |
| ๐ ๏ธ Tools | 31 custom enterprise tools |
| ๐ค Agents | 9 specialized agents |
| ๐ก Challenges | Ambiguous request handling, context preservation, routing logic |
| Provider | Models | Status |
|---|---|---|
| OpenAI | GPT-4.1, GPT-4o, GPT-4.1-mini, o3-mini | โ |
| Meta | LLaMA 3.3 70B | โ |
| Anthropic | Claude Sonnet 4 | โ |
*Framework includes support for evaluating Gemini family models as well as Qwen models
Centralized task assignment with mediated communication
Initial task assignment with direct agent-to-agent communication
Unified agent with access to all tools
|
Direct tool selection using native model capabilities |
Structured reasoning-action framework with explicit thought processes |
|
Full visibility into all previous tool calls and responses |
Condensed information sharing to manage context length |
|
Structured mathematical reasoning and calculations |
Information organization and analysis capabilities |
Success requires simultaneous achievement of:
- โ Correct tool selection
- โ Accurate tool arguments (100% accuracy required)
- โ Correct final decision
- Pass@1: Success rate over k=8 trials
- Pass^K: Probability of all k trials succeeding
- ๐ซ Hallucination rates (non-existent tool/agent selection)
- ๐ Tool repetition rates
- โ Missing required tools
| Recommendation | Rationale |
|---|---|
| โ Avoid Multi-Agent ReAct | Poor performance across all tested models |
| โ Use Multi-Agent for Final Decisions | Higher accuracy in decision-making despite tool selection challenges |
| ๐ฏ Model-Specific Architectures | Test multiple configurations rather than assuming universal optima |
| ๐งฎ Thinking Tools for Non-Reasoning Models | Significant performance improvements on calculation-heavy tasks for non-reasoning models |
| Focus Area | Insight |
|---|---|
| ๐ Architecture-Use Case Interaction | Models perform optimally under different architectures depending on task complexity |
| โ๏ธ Reliability vs Performance | Consider both accuracy and consistency for enterprise deployment |
| ๐พ Memory Management Impact | Minimal performance differences between complete and summarized memory |
@misc{bogavelli2025agentarchcomprehensivebenchmarkevaluate,
title={AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise},
author={Tara Bogavelli and Roshnee Sharma and Hari Subramani},
year={2025},
eprint={2509.10769},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.10769},
}AgentArch is licensed under the Apache 2.0 License.
For questions or collaboration opportunities:
โญ If this project helps your research, please consider giving it a star! โญ