FAQ

General

Is this the official implementation?

No. This is an unofficial, community-driven implementation based on the publicly available paper. It is not affiliated with or endorsed by the original authors or Google Research. The official code has not been released yet.

How close is this to the original paper?

We followed the paper's described methodology as closely as possible. The five-agent pipeline, two-phase architecture, and iterative refinement process match the paper. The main differences are in the reference dataset (we use 13 curated examples vs. the paper's 292) and potentially in prompt details, which the paper doesn't fully specify.

Does it cost money to use?

PaperBanana can run on Google Gemini's free tier. You need a free API key from Google AI Studio. The free tier has rate limits but is sufficient for normal usage. Alternatively, you can use OpenRouter for pay-per-use access to many models. The package itself is MIT licensed and free on PyPI.

Can I use it for my paper submission?

Yes. The output is yours. However, we'd recommend treating generated diagrams as a strong starting point rather than final camera-ready figures. Review the output for accuracy and make manual adjustments as needed.

Is PaperBanana on PyPI?

Yes. pip install paperbanana installs the CLI and Python API. pip install paperbanana[mcp] adds MCP server support. See Installation for all options.

Is the MCP server listed on registries?

Yes. PaperBanana is published on the Official MCP Registry and submitted to mcp.so. You can also find it on PyPI.

Technical

Why Gemini specifically?

The original paper uses Gemini for both VLM and image generation. We followed their choice to stay as close to the described system as possible. However, you can now use other models via OpenRouter, including Claude, GPT-4, and Llama.

Can I use Claude or GPT-4 instead of Gemini?

Yes! Use the OpenRouter provider. Set OPENROUTER_API_KEY and use --vlm-provider openrouter --vlm-model anthropic/claude-3.5-sonnet. See the OpenRouter Provider page for full details.

Why only 13 reference examples instead of 292?

Curating 292 high-quality (methodology text, diagram, caption) tuples requires significant manual effort. The paper describes using 2,000 NeurIPS papers as the starting point. Our 13 examples were manually verified to be clean and representative across the four categories. We're actively looking for community contributions to expand this. See Adding Reference Examples.

How long does generation take?

Typically 30-90 seconds for a single diagram with 3 refinement iterations. Most of the time is spent on API calls. Reducing iterations to 1-2 speeds things up at the cost of some output quality.

Can I use local models instead of cloud APIs?

Not yet out of the box, but the provider system supports this. Someone would need to implement an Ollama or similar provider. The challenge is that local image generation models (Stable Diffusion, FLUX) produce a different style than Gemini's native generation, so prompt templates may need adjustment. This is an open area for contribution. See Research Directions for more ideas.

Does the MCP server work with Windsurf/Zed/other editors?

It should work with any editor that supports the MCP specification. We've tested with Cursor and Claude Desktop. If you get it working with another client, let us know and we'll add configuration examples to the MCP Server Setup page.

What platforms are supported?

PaperBanana is tested on Ubuntu, Windows, and macOS across Python 3.10, 3.11, and 3.12. All platforms should work identically.

Output Quality

The diagram doesn't look right. What can I do?

Several things affect output quality:

Input text specificity: More detailed methodology descriptions produce better diagrams. Vague descriptions give the Planner less to work with.
Caption clarity: The caption should describe the communicative intent, not just label the figure.
Re-running: Generation is non-deterministic. Running the same input again sometimes produces better results.
Iterations: More refinement rounds (up to 3) generally help. Diminishing returns beyond 3.
Try different models: If using OpenRouter, experiment with different VLM models. Claude and GPT-4 may produce different planning outputs than Gemini.

Why does it sometimes produce results plots instead of architecture diagrams?

The Retriever may select poorly matched reference examples. This can happen when the methodology text is ambiguous about what kind of visualization is needed. Being explicit in the caption (e.g., "System architecture diagram showing..." rather than just "Overview of our method") helps.

Can it generate diagrams in specific styles (e.g., matching my paper's existing figures)?

Not currently. The Stylist applies NeurIPS-style guidelines uniformly. Supporting custom style references is a possible future enhancement. See Research Directions for more on planned improvements.

Providers

What providers are currently supported?

Provider	VLM	Image Gen	Notes
Google Gemini	Yes	Yes	Default, free tier available
OpenRouter	Yes	Yes	Access to 100+ models

Can I mix providers?

Yes. You can use one provider for VLM (planning, critique) and another for image generation. For example, Claude via OpenRouter for planning and Gemini for image generation. Both API keys need to be set.

How do I add a new provider?

See Adding a New Provider. The provider system is modular - you implement the interface and register the provider. PRs welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ

FAQ

General

Technical

Output Quality

Providers

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally