π¬ arXiv MCP Research Assistant
AI-powered academic research platform with RAG retrieval , BM25 + TF-IDF hybrid search , TextRank summarization , K-Means paper clustering , knowledge graph , NER , and an autonomous multi-step agent β built on the Model Context Protocol (MCP) .
π§ AI / RAG Intelligence
Feature
Description
TF-IDF Semantic Search
Pure-Python TF-IDF vectoriser with cosine similarity β no API keys, no sklearn
BM25 Retriever
Okapi BM25 ranking (the algorithm behind Elasticsearch/Lucene)
Hybrid Search + RRF
Two-stage TF-IDF + BM25 β Reciprocal Rank Fusion β production-grade retrieval
Sentence-Level Chunking
Abstracts split into ~300-char sentence chunks for fine-grained retrieval
RAG Context Builder
Retrieve top-k chunks and assemble grounded context for LLM prompts
TextRank Summarizer
Graph-based multi-document extractive summarization (PageRank on sentences)
Named Entity Recognition
Regex NER for AI models, datasets, metrics, and tasks (100+ patterns)
Paper Clustering (K-Means)
Unsupervised grouping on TF-IDF vectors with K-Means++ init and auto-labeling
Paper Similarity Matrix
Pairwise cosine similarity with interactive Plotly heatmap
Query Expansion
25+ domain-specific synonym mappings for better recall
9-Dimension Gap Analysis
Scans abstracts against Scalability, Privacy, Explainability, Fairness, Reproducibility, Efficiency, Multi-Modal, Real-World Deployment, Human Evaluation
Research Question Generator
Combines gaps + keywords + methodologies into actionable questions
Topic Drift / Evolution
Track keyword emergence and decline year-over-year
Knowledge Graph
Concept co-occurrence network with interactive Plotly visualization
Autonomous Research Agent
8-step pipeline: expand β search β chunk β index β retrieve β gaps β questions β review
π Analysis & Visualisation
Feature
Description
Literature Review
Auto-generated comprehensive reviews with contribution analysis
Trend Analysis
Publication timeline, growth rate, emerging keywords
Author Network
Prolific-author ranking and collaboration-size distribution
Interactive Charts
Plotly: timeline, categories, keywords, methods, authors, similarity heatmap, knowledge graph
Multi-Format Export
BibTeX, APA citations, full Markdown review
β‘ MCP Server (Model Context Protocol)
19 MCP tools exposed for Claude Desktop , Cursor , or any MCP client β including all AI/RAG tools.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit Web UI β
β Search Β· AI/RAG Β· Visualisations Β· Export Β· Chat β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββΌβββββββ
β utils.py β Shared Core (~1600 lines)
β β ββ arXiv API client (async)
β β ββ TFIDFIndex + BM25Index classes
β β ββ Hybrid retrieval + RRF
β β ββ Sentence chunker
β β ββ TextRank summarizer (PageRank)
β β ββ Named entity extraction (NER)
β β ββ K-Means paper clustering
β β ββ Query expansion (synonym map)
β β ββ Gap analysis engine (9 dims)
β β ββ Topic drift / evolution
β β ββ Knowledge graph builder
β β ββ run_research_agent() β 8-step
β β ββ NLP keyword extraction
β β ββ BibTeX/APA/MD export
ββββββββ¬βββββββ
β
βββββββββββββββ΄βββββββββββββββ
β β
ββββββββΌββββββββ ββββββββΌβββββββ
β FastMCP β β arXiv API β
β Server β β (HTTP/XML) β
β (19 tools) β βββββββββββββββ
ββββββββββββββββ
β
Claude Desktop / Cursor / any MCP client
User Query
β
ββββ TF-IDF Ranking (cosine similarity)
β βββ Reciprocal Rank Fusion βββ Results
ββββ BM25 Ranking (Okapi BM25 k1/b)
Alternative paths:
Query β Expansion (25+ synonyms) β arXiv Search
Query β Sentence Chunking β TF-IDF Index β Cosine Retrieval β RAG Context
Papers βββ TextRank (PageRank on sentence graph) βββ Extractive Summary
Papers βββ TF-IDF Vectors βββ K-Means++ βββ Auto-Labeled Clusters
Papers βββ Regex NER βββ Models / Datasets / Metrics / Tasks
Papers βββ Keyword Extraction per Year βββ Emerging / Declining Terms
Papers βββ Concept Extraction βββ Co-Occurrence Graph βββ Knowledge Graph
Papers βββ 9-Dimension Regex Scan βββ Gap Analysis βββ Research Questions
File
Lines
Purpose
utils.py
~1600
TF-IDF + BM25 engines, hybrid retrieval + RRF, TextRank summarizer, NER, K-Means clustering, topic drift, knowledge graph, gap analysis, research agent, arXiv API, export
research_server.py
~600
FastMCP server β 19 tool endpoints
streamlit_app.py
~830
Tabbed web UI with full AI/RAG dashboard
π mcp-for-research-paper.streamlit.app
git clone https://github.com/< your-username> /MCP-for-Research-paper.git
cd MCP-for-Research-paper
pip install -r requirements.txt
streamlit run streamlit_app.py
MCP Server (for Claude Desktop / Cursor)
python research_server.py
{
"mcpServers" : {
"arxiv-research" : {
"command" : " python" ,
"args" : [" path/to/research_server.py" ]
}
}
}
π οΈ MCP Tools Reference
Tool
Description
search_papers
Search arXiv with query expansion
analyze_papers
Full literature review
compare_papers
Side-by-side paper comparison
track_research_trends
Trend analysis report
export_bibtex
BibTeX export
export_citations
APA or Markdown citations
Tool
Description
semantic_search
TF-IDF search across all saved papers
smart_search
Hybrid TF-IDF + BM25 with RRF
ask_papers
RAG-style QA against paper corpus
find_related
Cosine similarity to find related papers
summarize_corpus
TextRank multi-document summarization
extract_named_entities
NER: models, datasets, metrics, tasks
cluster_topic
K-Means paper clustering
topic_evolution
Temporal keyword drift analysis
knowledge_graph
Concept co-occurrence network
identify_gaps
9-dimension research gap analysis
suggest_questions
Auto-generated research questions
research_agent
Full 8-step autonomous pipeline
π§ͺ Technical Highlights (Resume-Worthy)
Okapi BM25 : Full implementation with k1/b parameters, IDF scoring β the same algorithm powering Elasticsearch
Reciprocal Rank Fusion : Merges TF-IDF and BM25 ranked lists (Cormack et al. 2009) β used by production search systems like Pinecone and Weaviate
TextRank / PageRank : Graph-based sentence ranking for extractive multi-document summarization β the same algorithm Google was built on
K-Means++ Clustering : Smart centroid initialization on TF-IDF vectors with automatic cluster labeling from top centroid terms
Knowledge Graph : Concept co-occurrence network with interactive Plotly force-directed visualization
Named Entity Recognition : 100+ regex patterns covering modern AI models (GPT-4, LLaMA, BERT...), datasets (ImageNet, SQuAD...), metrics (BLEU, F1...), and tasks
Pure-Python ML : All algorithms (TF-IDF, BM25, K-Means, TextRank, NER) implemented from scratch β zero sklearn/numpy dependency
RAG without LLM API : Full retrieval-augmented generation architecture (chunk β index β retrieve β assemble) without paid API keys
Multi-Step Agent : 8-step autonomous pipeline mirroring LangChain/AutoGPT agent patterns
Model Context Protocol : 19 MCP tools exposable to Claude Desktop, Cursor, or any MCP client
Async Throughout : aiohttp for non-blocking arXiv API calls; Streamlit bridge handles sync β async
Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Commit your changes (git commit -m "feat: add awesome feature")
Push & open a Pull Request
MIT β see LICENSE for details.