TEXTUAL SEARCH ENGINE

Description

A text search engine is an application that allows users to search and retrieve documents or information within large collections of text. These search engines are designed to parse, index and retrieve textual data efficiently.

Modules

The project is divided into 2 modules:

INDEX
PROMPT

INDEX

The Index module implements the main program for building an inverted index from a collection of documents. This module takes care of building the lexicon and index of documents.

PROMPT

The Prompt module implements the main program for querying the index built by the Index module. This module takes care of querying the index and returning the results to the user.

Installation

Clone the repository

git clone https://github.com/mattiadido95/textualSearchEngine.git
cd textualSearchEngine

If you want manually download the dataset and put it in the correct folder, follow the instructions in the next section. Otherwise, you can skip the download section and run the program with the run.sh script.

Download the dataset

You must use the document collection available on this page: https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020. Scroll down the page until you come to a section titled “Passage ranking dataset”, and download the first link in the table, collection.tar.gz. Note that this is a collection of 8.8M documents, also called passages, about 2.2GB in size. Put the collection in the folder textualSearchEngine/data/collection/ and unzip it.

Download the queries

You must use the queries available on this page: https://msmarco.blob.core.windows.net/msmarcoranking/queries.tar.gz. Put the queries in the folder textualSearchEngine/data/collection/ and unzip them.

Download the qrels

You must use the qrels available on this page: https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.tsv. Put the qrels in the folder textualSearchEngine/data/collection/ and unzip them.

At the end of this procedure you should have the following folder structure:

textualSearchEngine
├── data
│   ├── collection
│   │   ├── collection.tar.gz
│   │   ├── collection.tsv
│   │   ├── queries.dev.tsv
│   │   └── qrels.dev.tsv

Run the program

cd textualSearchEngine
bash run.sh

Once run.sh has started you will be able to choose from the following menu:

Select an option:
1. Run indexing program.
2. Run prompt program.
3. Exit

Index - Option 1

Select option 1 to start the indexing program. You will have to enter the parameters choosing from the following options:

Enter parameters for indexing
params: 
-compressed: Enable compressed reading of the collection in tar.gz format. Default: uncompressed reading
-stemmer: Enable Porter Stemming in document preprocessing. Default: disabled

Once the parameters have been entered, the indexing program will start and the lexicon and index will be built.

Prompt - Option 2

Select option 2 to start the prompt program. You will have to enter the parameters choosing from the following options:

Enter parameters for the prompt:
List of params:
-scoring <value>: Specify the scoring function [BM25, TFIDF]. Default: TFIDF-topK: Specify the number of documents to return. Default: 10
-dynamic: Enable dynamic pruning using MAXSCORE. Default: disabled
-conjunctive: Enable conjunctive mode. Default: disjunctive

Once the parameters have been entered, you will be able to choose from the following menu:

Welcome to the search engine!
MENU:
- insert 1 to search 
- insert 2 to evaluate searchEngine 
- insert 3 calculate upper bounds for dynamic pruning 
- insert 10 to exit

Option 1 to start the search engine: with this option you will be asked to enter a query and the search result will be returned.
Option 2 to start the evaluation of the search engine: with this option the search engine evaluation program will be started. this program will process pre-established queries and will evaluate the results obtained with trec_eval.
Option 3 to calculate the TUBs and the DUBs for dynamic pruning: with this option the program will be started that calculates the TUB values for each term present in the lexicon and the DUB values for each document present in the index.

Requirements

Java 17

Name		Name	Last commit message	Last commit date
Latest commit History 375 Commits
.idea		.idea
data		data
documentation		documentation
index		index
out/artifacts		out/artifacts
prompt		prompt
.gitignore		.gitignore
README.md		README.md
RUN.sh		RUN.sh
index.iml		index.iml
pom.xml		pom.xml
prompt.iml		prompt.iml
textualSearchEngine.iml		textualSearchEngine.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEXTUAL SEARCH ENGINE

Description

Modules

INDEX

PROMPT

Installation

Clone the repository

Download the dataset

Download the queries

Download the qrels

Run the program

Index - Option 1

Prompt - Option 2

Requirements

About

Releases

Packages

Contributors 2

Languages

mgiorgi13/textualSearchEngine

Folders and files

Latest commit

History

Repository files navigation

TEXTUAL SEARCH ENGINE

Description

Modules

INDEX

PROMPT

Installation

Clone the repository

Download the dataset

Download the queries

Download the qrels

Run the program

Index - Option 1

Prompt - Option 2

Requirements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages