Skip to content

achouhan93/ClusterChat

Repository files navigation

Contributors Forks Stargazers Issues MIT License Visitors


Logo

ClusterChat: A Multi-Feature Search for Corpus Exploration

Ashish Chouhan, Saifeldin Mandour, and Michael Gertz

Heidelberg University

Contact us at: {chouhan, gertz}@informatik.uni-heidelberg.de, [email protected]

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Project Structure
  3. Getting Started
  4. Cite our work
  5. License
  6. Acknowledgments

About The Project

Video demonstration: here

Abstract

Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methodologies, i.e., keyword-based search, often retrieve documents in isolation, limiting the user's ability to understand corpus-wide trends and relationships. We present $\textit{ClusterChat}$ (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization based on textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that $\textit{ClusterChat}$ enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections.

(back to top)

Project Structure

The $\textit{ClusterChat}$ framework provides a web-based tool for exploring PubMed abstracts, utilizing backend components for document clustering and retrieval-augmented generation (RAG). It employs BERTopic and LangChain for the backend processing, with Cosmograph used for interactive visualizations in the frontend. This setup supports both multi-feature search on abstracts and natural language query capabilities for enhanced corpus navigation.

Backend

Folder: backend/

  1. Data Collection and Storage (1.embedding_data_storage): PubMed abstracts from 2020–2024 were collected and stored in OpenSearch, yielding about four million abstracts with metadata.

  2. Topic Modeling and Clustering Information (2.topic_modelling and 4. cluster_information): Abstracts are embedded with NeuML/pubmedbert-base-embeddings, reduced in dimensionality via UMAP, and clustered using HDBSCAN. Keywords and labels for each cluster are generated using BM25 and GPT-4o-mini, and stored in OpenSearch.

  3. RAG Pipeline (3.rag_pipeline): For question answering, abstracts are segmented into sentences, creating around $46$ million sentence embeddings indexed in OpenSearch. Document-level queries retrieve contextually relevant sentence chunks, which are then processed with Mixtral-8x7B to generate precise answers with citations pointing to respective PubMed Abstract.

Frontend

Figure 1: Overview of the ClusterChat interface

Figure 1: Overview of the web-based $\textit{ClusterChat}$ interface. The interface includes four main features: 1) a chat panel on the top-left for corpus and document-level question answering; 2) a metadata information panel on the bottom-left for displaying metadata information of the selected documents; 3) a central cluster visualization map showing research topics like Cancer Treatment” and Genetic Disorders”; 4) a search panel at the top to perform a lexical and semantic search on Abstract'' text and a lexical search on Title'' text.

Folder: app/

  1. Cluster Overview: Visualizes thematic clusters, like “Cancer Treatment,” and allows for intuitive exploration.

  2. Search and Filtering: Filters documents by date, keywords, semantics, and clusters, refining corpus exploration.

  3. Question-Answering Interface: Supports document-level and corpus-level queries, allowing users to ask both targeted and broad questions about selected clusters or the entire corpus.

(back to top)

Getting Started

Clone the repository by executing the below command

git clone https://github.com/achouhan93/ClusterChat.git

Navigate to the cloned repository folder

cd ClusterChat

Once the repository is successfully cloned and user navigated to the folder.

Setting up Backend

Execute the below steps to setup Python Environment (tested with Python 3.9.0):

  1. Setup a venv with python (or conda)
python -m venv .venv
  1. Activate venv
source .venv/bin/activate
  1. Install all necessary dependencies by running
pip install -r requirements.txt
  1. Rename the .env-example to .env and populate the file with the required credentials
CLUSTER_CHAT_LOG_EXE_PATH="logs/insights_execution.log"
CLUSTER_CHAT_LOG_PATH="logs/"

# Required for Backend functionalities, i.e., Embedding creation and storage, 
# Topic Modeling and Clustering information construction and storage,
# Retrieval Augmented Generation (RAG) or QA Pipeline to work

# Opensearch Connection Details
OPENSEARCH_USERNAME = "your_opensearch_username"
OPENSEARCH_PASSWORD = "your_opensearch_password"
OPENSEARCH_PORT=your_opensearch_port
CLUSTER_CHAT_OPENSEARCH_HOST="your_opensearch_host_name"

CLUSTER_CHAT_OPENSEARCH_SOURCE_INDEX="frameintell_pubmed"
CLUSTER_CHAT_OPENSEARCH_TARGET_INDEX_COMPLETE="frameintell_pubmed_abstract_embeddings"
CLUSTER_CHAT_OPENSEARCH_TARGET_INDEX_SENTENCE="frameintell_pubmed_sentence_embeddings"
CLUSTER_CHAT_CLUSTER_INFORMATION_INDEX="frameintell_clusterchat_clusterinformation"
CLUSTER_CHAT_DOCUMENT_INFORMATION_INDEX="frameintell_clusterchat_documentinformation"

# HuggingFace Key
HUGGINGFACE_AUTH_KEY = "your-huggingface-api-key"

## Required for embedding computation for Abstract and Sentences
CLUSTER_CHAT_EMBEDDING_MODEL="NeuML/pubmedbert-base-embeddings"
## Required for topic label and topic description generation
OPENAI_API_KEY = "your-openapi-key"
## Required for Answer Generation in the QA Pipeline
MODEL_CONFIGS = '{"mixtral7B": {"temperature": 0.3, "max_tokens": 100, "huggingface_model":"mistralai/Mixtral-8x7B-Instruct-v0.1", "repetition_penalty":1.2, "stop_sequences":["<|endoftext|>", "</s>"]}}'

# For storage of the BERTopic models at the intermediate stage
MODEL_PATH = "./intermediate_results/"

# Required for frontend
APP_URL="http://localhost:5173"
OPENSEARCH_NODE="https://your-opensearch-hostname:your-opensearch-port"
  1. Start the backend server:
    cd backend/3.\ rag_pipeline
    uvicorn main:app --reload --port 8100

Setting up Frontend

Execute the below steps to setup frontend:

  1. Navigate to the app folder:

    cd app
  2. Install frontend dependencies:

    npm install
  3. Start the frontend server:

    npm run dev

(back to top)

Cite our work

No current information

(back to top)

License

We use the standard MIT license for code artifacts. See license/LICENSE.txt for more information.

(back to top)

Acknowledgments

We thank the Bundesministerium für Bildung und Forschung (BMBF) for funding this research within the FrameIntell project.

(back to top)

About

The code for ClusterChat: a Multi-Feature Search for Corpus Exploration.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •