Skip to content

Reference Dataset

Dipkumar Patel edited this page Feb 4, 2026 · 1 revision

Reference Dataset

The reference dataset is the core of PaperBanana's in-context learning approach. The Retriever agent selects relevant examples from this set to guide diagram generation.

Current State

The implementation includes 13 curated methodology diagrams spanning four categories:

Category Description Examples
Agent & Reasoning Agent architectures, reasoning chains, tool-use pipelines Multi-agent coordination, ReAct loops
Vision & Perception Detection, segmentation, multimodal architectures Object detection pipelines, VLM architectures
Generative & Learning Training frameworks, generative model architectures Diffusion pipelines, GAN architectures
Science & Applications Domain-specific architectures outside core ML Drug discovery, climate modeling

How the Paper Does It

The original paper by Zhu et al. constructs a much larger reference set:

  1. Sample 2,000 papers from NeurIPS 2025
  2. Parse PDFs with MinerU to extract methodology sections and figures
  3. Filter to papers with methodology diagrams (2,000 → 1,359)
  4. Filter by aspect ratio [1.5, 2.5] (1,359 → 610)
  5. Categorize diagrams using Gemini into four classes
  6. Human curation for quality (610 → 584)
  7. Split into test set (292) and reference set (292)

Structure

Each reference example is a directory under data/reference_sets/:

data/reference_sets/
├── example_name/
│   ├── methodology.txt    # Extracted methodology section
│   ├── diagram.png        # Methodology diagram image
│   └── metadata.json      # Caption and metadata

metadata.json format:

{
  "paper_title": "Full paper title",
  "arxiv_id": "2601.23265",
  "figure_number": 2,
  "caption": "Original figure caption",
  "category": "agent_reasoning",
  "source_url": "https://arxiv.org/abs/2601.23265",
  "aspect_ratio": 1.85
}

Valid categories: agent_reasoning, vision_perception, generative_learning, science_applications

Why Reference Quality Matters

The Planner agent uses retrieved reference examples as few-shot demonstrations. If the reference examples have clear layouts, consistent styling, and accurate methodology-to-diagram mappings, the Planner produces better descriptions. If the references are cluttered or poorly matched, the output degrades accordingly.

This is the single highest-leverage area for improving PaperBanana's output quality.

Contributing Examples

See Adding Reference Examples for how to contribute new reference examples.

Clone this wiki locally