Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docsprint to reflect the new xapian-letor. #18

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions advanced/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Advanced features
postingsource
unigramlm
custom_weighting
learning_to_rank
admin_notes
scalability
replication
Expand Down
142 changes: 142 additions & 0 deletions advanced/learning_to_rank.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@

.. Copyright (C) 2011 Parth Gupta
.. Copyright (C) 2016 Ayush Tomar


=======================
Xapian Learning-to-Rank
=======================

.. contents:: Table of Contents


Introduction
============

Learning-to-Rank(LTR) can be viewed as a weighting scheme which involves machine learning. The main idea behind LTR is to bring up relevant documents given a low ranking by probablistic techniques like BM25 by using machine learning models. A model is trained by learning from the relevance judgements provided by a user corresponding to a set of queries and a corpus of documents. This model is then used to re-rank the matchset to bring more relevant documents higher in the ranking. Learning-to-Rank has gained immense popularity and attention among researchers recently.

LTR can be broadly seen in two stages: Learning the model and Ranking. Learning the model takes the training file as input and produces a model. After that given this learnt model, when a new query comes in, scores can be assigned to the documents associated to it.

Preparing the Training file
---------------------------

Currently the ranking models supported by LTR are supervised learning models. A supervised learning model requires a labelled training data as an input. To learn a model using LTR you need to provide the training data in the following format.

.. code-block:: none

0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 ... 18:0.750000 19:1.000000 #docid = 1123323
1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 ... 18:0.500000 19:0.023400 #docid = 4222333

Here each row represents the document for the specified query. The first column is the relevance label and which can take non-negative values. The second column represents the queryid, and the last column is the docid. The third column represents the value of the features.

As mentioned before, this process requires a training file in the above format. xapian-letor API empowers you to generate such training file. But for that you have to supply some information like:

1. Query file: This file has information of queries to be involved in
learning and its id. It should be formatted in such a way::

2010001 'landslide,malaysia'
2010002 'search,engine'
2010003 'Monuments,of,India'
2010004 'Indian,food'

where 2010xxx being query-id followed by a comma separated query in
single-quotes.

2. Qrel file: This is the file containing relevance judgements. It should
be formatted in this way::

2010003 Q0 19243417 1
2010003 Q0 3256433 1
2010003 Q0 275014 1
2010003 Q0 298021 0
2010003 Q0 1456811 0

where first column is query-id, third column is Document-id and fourth column being relevance label which is 0 for irrelevance and 1 for relevance. Second column is many times referred as 'iter' but doesn't really important for us. All the fields are whitespace delimited. This is the standard format of almost all the relevance judgement files. If you have relevance judgements in a different format then you can convert it to this format using a text processing tool such as 'awk'.

3. Collection Index : Here you supply the path to the index of the corpus. If
you have 'title' information in the collection with some xml/html tag or so
then add::

indexer.index(title,1,"S");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting spaces after the commas in code would better match the other code examples.


You can refer to the "Indexing" section under "A practical example" heading for the Collection Index. The database created in the practical example will be used as the collection index for the examples. In particular we are going to be using all the documents which contain the term "watch", which will be used as the query for the examples.

Provided such information, API is capable of creating the training file which is in the mentioned format and can be used for learning a model.

To prepare a training file run the following command from the top level directory. This example assumes that you have created the db from the first example in "Indexing" section under "A practical example" header and you have installed xapian-letor.

.. code-block:: none

$ xapian-prepare-trainingfile --db=db data/query.txt data/qrel.txt training_data.txt

xapian-prepare-trainingfile is a utility present after you have installed xapian-letor. This should create a training_data.txt which should have the similar values to the data/training_data.txt.

The source code is present for xapian-prepare-trainingfile.cc is present at `xapian/xapian-letor/bin/xapian-prepare-trainingfile.cc <https://github.com/xapian/xapian/blob/master/xapian-letor/bin/xapian-prepare-trainingfile.cc>`_.

Learning the Model
------------------

In xapian-letor we support the following learning algorithms:

1. `ListNET <http://dl.acm.org/citation.cfm?id=1273513>`_
2. `Ranking-SVM <http://dl.acm.org/citation.cfm?id=775067>`_
3. `ListMLE <http://icml2008.cs.helsinki.fi/papers/167.pdf>`_

You can use any one of the rankers to Learn the model. The command line tool xapian-train uses ListNET as the ranker for learning. To learn a model run the following command from the top level directory.

.. code-block:: none

$ xapian-train --db=db data/training_data.txt "ListNET_Ranker"

Ranking
-------

After we have built a model, it's quite straightforward to get a real score for a particular document for the given query. Here we supply the first hand retrieved ranked-list to the Ranking function, which assigns a new score to each document after converting it to the same dimensioned feature vector. This list is re-ranked according to the new scores.

Here’s the significant part of the example code to implement ranking.

.. xapianexample:: search_letor

A full copy of this code is available in :xapian-code-example:`^`

You can run this code as follows to re-rank the list of documents retrieved from the db containing the term "watch" in the order of relevance as mentioned in the data/qrel.

.. xapianrunexample:: index1
:silent:
:args: data/100-objects-v1.csv db

.. xapiantrain:: search_letor

.. xapianrunexample:: search_letor
:args: db ListNET_Ranker watch
:letor:

Features
========

Features play a major role in the learning. In LTR, features are mainly of three types: query dependent, document dependent (pagerank, inLink/outLink number, number of children, etc) and query-document pair dependent (TF-IDF Score, BM25 Score, etc).

Currently we have incorporated 19 features which are described below. These features are statistically tested in `Nallapati2004 <http://dl.acm.org/citation.cfm?id=1009006>`_.

Here c(w,D) means that count of term w in Document D. C represents the Collection. 'n' is the total number of terms in query.
:math:`|.|` is size-of function and idf(.) is the inverse-document-frequency.


1. :math:`\sum_{q_i \in Q \cap D} \log{\left( c(q_i,D) \right)}`

2. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\right)}`

3. :math:`\sum_{q_i \in Q \cap D} \log{\left(idf(q_i) \right) }`

4. :math:`\sum_{q_i \in Q \cap D} \log{\left( \frac{|C|}{c(q_i,C)} \right)}`

5. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}idf(q_i)\right)}`

6. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\frac{|C|}{c(q_i,C)}\right)}`


All the above 6 features are calculated considering 'title only', 'body only' and 'whole' document. So they make in total 6*3=18 features. The 19th feature is the Xapian weighting scheme score assigned to the document (by default this is BM25).The API gives a choice to select which specific features you want to use. By default, all the 19 features defined above are used.

One thing that should be noticed is that all the feature values are `normalized at Query-Level <https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm>`_. That means that the values of a particular feature for a particular query are divided by its query-level maximum value and hence all the feature values will be between 0 and 1. This normalization helps for unbiased learning.

Nallapati, R. Discriminative models for information retrieval. Proceedings of SIGIR 2004 (pp. 64-71).
98 changes: 98 additions & 0 deletions code/c++/search_letor.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
/** @file search_letor.cc
*/
/* Copyright (C) 2004,2005,2006,2007,2008,2009,2010,2015 Olly Betts
* Copyright (C) 2011 Parth Gupta
* Copyright (C) 2016 Ayush Tomar
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; either version 2 of the
* License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
* USA
*/
#include <xapian-letor.h>
ojwb marked this conversation as resolved.
Show resolved Hide resolved

#include <iostream>
#include <sstream>
#include <string>

using namespace std;

static void show_usage()
{
cout << "Usage: search_letor MODEL_METADATA_KEY QUERY\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is wrong - it's only the --db= part which isn't expected - the DBPATH part is.

}

// Start of example code
void rank_letor(string db_path, string model_key, string query_)
{
Xapian::Stem stemmer("english");
Xapian::doccount msize = 10;
Xapian::QueryParser parser;
parser.add_prefix("title", "S");
parser.add_prefix("subject", "S");
Xapian::Database db(db_path);
parser.set_database(db);
parser.set_default_op(Xapian::Query::OP_OR);
parser.set_stemmer(stemmer);
parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
Xapian::Query query_no_prefix = parser.parse_query(query_,
parser.FLAG_DEFAULT);
// query with title as default prefix
Xapian::Query query_default_prefix = parser.parse_query(query_,
parser.FLAG_DEFAULT,
"S");
// Combine queries
Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, query_no_prefix,
query_default_prefix);
Xapian::Enquire enquire(db);
enquire.set_query(query);
Xapian::MSet mset = enquire.get_mset(0, msize);

cout << "Docids before re-ranking by LTR model:" << endl;
for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
Xapian::Document doc = i.get_document();
string data = doc.get_data();
cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
}

// Initialise Ranker object with ListNETRanker instance, db path and query.
// See Ranker documentation for available Ranker subclass options.
Xapian::ListNETRanker ranker;
ranker.set_database_path(db_path);
ranker.set_query(query);

// Re-rank the existing mset using the letor model.
ranker.rank(mset, model_key);
ojwb marked this conversation as resolved.
Show resolved Hide resolved

cout << "Docids after re-ranking by LTR model:\n" << endl;

for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
Xapian::Document doc = i.get_document();
string data = doc.get_data();
cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
}
}
// End of example code.

int main(int argc, char** argv)
{
if (argc != 4) {
show_usage();
return 1;
}
string db_path = argv[1];
string model_key = argv[2];
string query = argv[3];
rank_letor(db_path, model_key, query);
return 0;
}
59 changes: 59 additions & 0 deletions code/expected.out/search_letor.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Docids before re-ranking by LTR model:
4: [6.7291]
1970-377
Watch with Chinese duplex escapement
Watch with Chinese duplex escapement
13: [6.29324]
1985-1538
Watch timer by P
Watch timer by P.L. Instruments Ltd., Welwyn Garden City; with tuning fork standard and spark printer
33: [5.77657]
1986-516
A device by Favag of Neuchatel which enables a stop watch to
A device by Favag of Neuchatel which enables a stop watch to be operated by an electric signal, 1961
18: [5.76966]
1987-1145
Solar/Sidereal verge watch with epicyclic maintaining power
Solar/Sidereal verge watch with epicyclic maintaining power by James Green, London, no. 5969, in silver pair case hallmarked 1776, believed to be the earliest English watch to indicate solar and sidereal time with seconds.
36: [5.52705]
1986-91
Universal 'Tri-Compax' chronographic wrist watch
'Tri-Compax' calendar chronograph wristwatch in gold case, by Universal, Geneva, Switzerland, 1960-1986.
15: [5.05174]
1985-1914
Ingersoll "Dan Dare" automaton pocket watch with pin-pallet
Ingersoll "Dan Dare" automaton pocket watch with pin-pallet lever escapement. The dial shows a helmeted Dan Dare holding a ray gun, with a monster approaching. The emblem of the " Eagle" children's comic is impressed in the back of case.
46: [2.5204]
1883-68
Model by Dent of mechanism for setting hands and winding up
Model by Dent of mechanism for setting hands and winding up a keyless watch
Docids after re-ranking by LTR model:

46: [-0.0075442]
1883-68
Model by Dent of mechanism for setting hands and winding up
Model by Dent of mechanism for setting hands and winding up a keyless watch
15: [-0.0298056]
1985-1914
Ingersoll "Dan Dare" automaton pocket watch with pin-pallet
Ingersoll "Dan Dare" automaton pocket watch with pin-pallet lever escapement. The dial shows a helmeted Dan Dare holding a ray gun, with a monster approaching. The emblem of the " Eagle" children's comic is impressed in the back of case.
33: [-0.0305018]
1986-516
A device by Favag of Neuchatel which enables a stop watch to
A device by Favag of Neuchatel which enables a stop watch to be operated by an electric signal, 1961
36: [-0.032863]
1986-91
Universal 'Tri-Compax' chronographic wrist watch
'Tri-Compax' calendar chronograph wristwatch in gold case, by Universal, Geneva, Switzerland, 1960-1986.
18: [-0.0338239]
1987-1145
Solar/Sidereal verge watch with epicyclic maintaining power
Solar/Sidereal verge watch with epicyclic maintaining power by James Green, London, no. 5969, in silver pair case hallmarked 1776, believed to be the earliest English watch to indicate solar and sidereal time with seconds.
13: [-0.0403684]
1985-1538
Watch timer by P
Watch timer by P.L. Instruments Ltd., Welwyn Garden City; with tuning fork standard and spark printer
4: [-0.0447454]
1970-377
Watch with Chinese duplex escapement
Watch with Chinese duplex escapement
Loading