Daksha Ladia, Snigdha Ansu, Vasileios Vittis
This project introduces a novel approach that bridges the gap between user intent and document relevance using pseudo-query generation and large language models (LLMs). By chunking documents into passages to create pseudo-queries and transforming user queries into detailed, multifaceted queries with LLMs, we compare them on a per-document basis to rank and retrieve the most relevant results.
Here is an overview of the system pipeline used in our project:
Current information retrieval systems often fail to capture implicit context necessary for producing relevant documents, due to limitations of short, ambiguous queries. This project addresses these complexities by generating long, context-rich queries to improve retrieval accuracy.
The project's methodology includes:
- Pseudo-Query Generation: Segmenting documents into chunks to generate contextually rich pseudo-queries using generative models.
- Training: Using autoregressive models to train on generated pseudo-queries and corresponding documents.
- Inference and Retrieval: Utilizing trained models to generate queries and retrieve relevant documents.
- Document Segmentation: Documents are segmented into chunks.
- Generative Modeling: A pretrained model (e.g., FLAN-T5-Large) generates pseudo-queries from document chunks.
- Diversity Filtering: Redundant queries are filtered out to maintain query quality.
- Model Training: An autoregressive model maps queries to the best matching pseudo-queries.
- Document Retrieval: The system retrieves documents based on query-pseudo-query relevance scores.
The project was evaluated using two datasets:
- NFCorpus: Focuses on medical information retrieval.Link
- SciFact: Tailored for scientific document retrieval.Link
We employed metrics such as Precision, Recall, and NDCG to assess the performance of our retrieval system.
Our approach has shown significant improvements over traditional retrieval methods, particularly in aligning complex queries with relevant document content.
Below are links to the project resources, organized by dataset and methodology:
-
T5-CPQG + GPT 2.0 Fine Tuned
-
[T5-CPQG + Cross Encoder Fine Tuned]
- T5-CPQG + GPT 4o-mini (WordNet + Pretrained LLM)
- T5-CPQG + Cross Encoder Fine Tuned
- T5-CPQG + GPT 2.0 Fine Tuned
- T5-CPQG + T5-small Fine Tuned
- BM25, Dense Retrieval, QOQA + BM25, QOQA + Dense Retrieval baseline implementation on NFCorpus DataSet
- BM25, Dense Retrieval, QOQA + BM25, QOQA + Dense Retrieval baseline implementation on SciFact DataSet
Data files used for experimenation can be found in this folder : https://drive.google.com/drive/folders/191D9QMsCVku2V1aCE0ZlkWvDqCzXlWQ3