Skip to content

dakshaladia/QueryGenerationAndRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Complex Query Synthesis for Enhanced Information Retrieval

Team Members

Daksha Ladia, Snigdha Ansu, Vasileios Vittis

Abstract

This project introduces a novel approach that bridges the gap between user intent and document relevance using pseudo-query generation and large language models (LLMs). By chunking documents into passages to create pseudo-queries and transforming user queries into detailed, multifaceted queries with LLMs, we compare them on a per-document basis to rank and retrieve the most relevant results.

System Overview

Here is an overview of the system pipeline used in our project: System Pipeline

Problem Statement

Current information retrieval systems often fail to capture implicit context necessary for producing relevant documents, due to limitations of short, ambiguous queries. This project addresses these complexities by generating long, context-rich queries to improve retrieval accuracy.

Approach

The project's methodology includes:

  • Pseudo-Query Generation: Segmenting documents into chunks to generate contextually rich pseudo-queries using generative models.
  • Training: Using autoregressive models to train on generated pseudo-queries and corresponding documents.
  • Inference and Retrieval: Utilizing trained models to generate queries and retrieve relevant documents.

Detailed Steps

  1. Document Segmentation: Documents are segmented into chunks.
  2. Generative Modeling: A pretrained model (e.g., FLAN-T5-Large) generates pseudo-queries from document chunks.
  3. Diversity Filtering: Redundant queries are filtered out to maintain query quality.
  4. Model Training: An autoregressive model maps queries to the best matching pseudo-queries.
  5. Document Retrieval: The system retrieves documents based on query-pseudo-query relevance scores.

Experiments

The project was evaluated using two datasets:

  • NFCorpus: Focuses on medical information retrieval.Link
  • SciFact: Tailored for scientific document retrieval.Link

Evaluation Metrics

We employed metrics such as Precision, Recall, and NDCG to assess the performance of our retrieval system.

Results

Our approach has shown significant improvements over traditional retrieval methods, particularly in aligning complex queries with relevant document content.

Code and Resources

Below are links to the project resources, organized by dataset and methodology:

NFCorpus Dataset

SciFact Dataset

Baselines

Data files used for experimenation can be found in this folder : https://drive.google.com/drive/folders/191D9QMsCVku2V1aCE0ZlkWvDqCzXlWQ3

About

Complex Query Synthesis for Enhanced Information Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published