xapian · AyushP123 · Aug 7, 2017 · Feb 21, 2018 · Mar 7, 2019 · ojwb
diff --git a/advanced/index.rst b/advanced/index.rst
@@ -7,6 +7,7 @@ Advanced features
    postingsource
    unigramlm
    custom_weighting
+   learning_to_rank
    admin_notes
    scalability
    replication

diff --git a/advanced/learning_to_rank.rst b/advanced/learning_to_rank.rst
@@ -0,0 +1,142 @@
+
+.. Copyright (C) 2011 Parth Gupta
+.. Copyright (C) 2016 Ayush Tomar
+
+
+=======================
+Xapian Learning-to-Rank
+=======================
+
+.. contents:: Table of Contents
+
+
+Introduction
+============
+
+Learning-to-Rank(LTR) can be viewed as a weighting scheme which involves machine learning. The main idea behind LTR is to bring up relevant documents given a low ranking by probablistic techniques like BM25 by using machine learning models. A model is trained by learning from the relevance judgements provided by a user corresponding to a set of queries and a corpus of documents. This model is then used to re-rank the matchset to bring more relevant documents higher in the ranking. Learning-to-Rank has gained immense popularity and attention among researchers recently.
+
+LTR can be broadly seen in two stages: Learning the model and Ranking. Learning the model takes the training file as input and produces a model. After that given this learnt model, when a new query comes in, scores can be assigned to the documents associated to it.
+
+Preparing the Training file
+---------------------------
+
+Currently the ranking models supported by LTR are supervised learning models. A supervised learning model requires a labelled training data as an input. To learn a model using LTR you need to provide the training data in the following format.
+
+.. code-block:: none
+
+    0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 ... 18:0.750000 19:1.000000 #docid = 1123323
+    1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 ... 18:0.500000 19:0.023400 #docid = 4222333
+
+Here each row represents the document for the specified query. The first column is the relevance label and which can take non-negative values. The second column represents the queryid, and the last column is the docid. The third column represents the value of the features.
+
+As mentioned before, this process requires a training file in the above format. xapian-letor API empowers you to generate such training file. But for that you have to supply some information like:
+
+1. Query file: This file has information of queries to be involved in
+   learning and its id. It should be formatted in such a way::
+
+    2010001 'landslide,malaysia'
+    2010002 'search,engine'
+    2010003 'Monuments,of,India'
+    2010004 'Indian,food'
+
+   where 2010xxx being query-id followed by a comma separated query in
+   single-quotes.
+
+2. Qrel file: This is the file containing relevance judgements. It should
+   be formatted in this way::
+
+    2010003 Q0 19243417 1
+    2010003 Q0 3256433 1
+    2010003 Q0 275014 1
+    2010003 Q0 298021 0
+    2010003 Q0 1456811 0
+
+   where first column is query-id, third column is Document-id and fourth column being relevance label which is 0 for irrelevance and 1 for relevance. Second column is many times referred as 'iter' but doesn't really important for us.  All the fields are whitespace delimited. This is the standard format of almost all the relevance judgement files. If you have relevance judgements in a different format then you can convert it to this format using a text processing tool such as 'awk'.
+
+3. Collection Index : Here you supply the path to the index of the corpus. If
+   you have 'title' information in the collection with some xml/html tag or so
+   then add::
+
+    indexer.index(title,1,"S");
+
+You can refer to the "Indexing" section under "A practical example" heading for the Collection Index. The database created in the practical example will be used as the collection index for the examples. In particular we are going to be using all the documents which contain the term "watch", which will be used as the query for the examples.
+
+Provided such information, API is capable of creating the training file which is in the mentioned format and can be used for learning a model.
+
+To prepare a training file run the following command from the top level directory. This example assumes that you have created the db from the first example in "Indexing" section under "A practical example" header and you have installed xapian-letor.
+
+.. code-block:: none
+
+    $ xapian-prepare-trainingfile --db=db data/query.txt data/qrel.txt training_data.txt
+
+xapian-prepare-trainingfile is a utility present after you have installed xapian-letor. This should create a training_data.txt which should have the similar values to the data/training_data.txt.
+
+The source code is present for xapian-prepare-trainingfile.cc is present at `xapian/xapian-letor/bin/xapian-prepare-trainingfile.cc <https://github.com/xapian/xapian/blob/master/xapian-letor/bin/xapian-prepare-trainingfile.cc>`_.
+
+Learning the Model
+------------------
+
+In xapian-letor we support the following learning algorithms:
+
+1. `ListNET <http://dl.acm.org/citation.cfm?id=1273513>`_
+2. `Ranking-SVM <http://dl.acm.org/citation.cfm?id=775067>`_
+3. `ListMLE <http://icml2008.cs.helsinki.fi/papers/167.pdf>`_
+
+You can use any one of the rankers to Learn the model. The command line tool xapian-train uses ListNET as the ranker for learning. To learn a model run the following command from the top level directory.
+
+.. code-block:: none
+
+    $ xapian-train --db=db data/training_data.txt "ListNET_Ranker"
+
+Ranking
+-------
+
+After we have built a model, it's quite straightforward to get a real score for a particular document for the given query. Here we supply the first hand retrieved ranked-list to the Ranking function, which assigns a new score to each document after converting it to the same dimensioned feature vector. This list is re-ranked according to the new scores.
+
+Here’s the significant part of the example code to implement ranking.
+
+.. xapianexample:: search_letor
+
+A full copy of this code is available in :xapian-code-example:`^`
+
+You can run this code as follows to re-rank the list of documents retrieved from the db containing the term "watch" in the order of relevance as mentioned in the data/qrel.
+
+.. xapianrunexample:: index1
+    :silent:
+    :args: data/100-objects-v1.csv db
+
+.. xapiantrain:: search_letor
+
+.. xapianrunexample:: search_letor
+    :args: db ListNET_Ranker watch
+    :letor:
+
+Features
+========
+
+Features play a major role in the learning. In LTR, features are mainly of three types: query dependent, document dependent (pagerank, inLink/outLink number, number of children, etc) and query-document pair dependent (TF-IDF Score, BM25 Score, etc).
+
+Currently we have incorporated 19 features which are described below. These features are statistically tested in `Nallapati2004 <http://dl.acm.org/citation.cfm?id=1009006>`_.
+
+    Here c(w,D) means that count of term w in Document D. C represents the Collection. 'n' is the total number of terms in query.
+    :math:`|.|` is size-of function and idf(.) is the inverse-document-frequency.
+
+
+    1. :math:`\sum_{q_i \in Q \cap D} \log{\left( c(q_i,D) \right)}`
+
+    2. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\right)}`
+
+    3. :math:`\sum_{q_i \in Q \cap D} \log{\left(idf(q_i) \right) }`
+
+    4. :math:`\sum_{q_i \in Q \cap D} \log{\left( \frac{|C|}{c(q_i,C)} \right)}`
+
+    5. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}idf(q_i)\right)}`
+
+    6. :math:`\sum_{i=1}^{n}\log{\left(1+\frac{c\left(q_i,D\right)}{|D|}\frac{|C|}{c(q_i,C)}\right)}`
+
+
+All the above 6 features are calculated considering 'title only', 'body only' and 'whole' document. So they make in total 6*3=18 features. The 19th feature is the Xapian weighting scheme score assigned to the document (by default this is BM25).The API gives a choice to select which specific features you want to use. By default, all the 19 features defined above are used.
+
+One thing that should be noticed is that all the feature values are `normalized at Query-Level <https://trac.xapian.org/wiki/GSoC2011/LTR/Notes#QueryLevelNorm>`_. That means that the values of a particular feature for a particular query are divided by its query-level maximum value and hence all the feature values will be between 0 and 1. This normalization helps for unbiased learning.
+
+Nallapati, R. Discriminative models for information retrieval. Proceedings of SIGIR 2004 (pp. 64-71).
diff --git a/code/c++/search_letor.cc b/code/c++/search_letor.cc
@@ -0,0 +1,98 @@
+/** @file search_letor.cc
+ */
+/* Copyright (C) 2004,2005,2006,2007,2008,2009,2010,2015 Olly Betts
+ * Copyright (C) 2011 Parth Gupta
+ * Copyright (C) 2016 Ayush Tomar
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301
+ * USA
+ */
+#include <xapian-letor.h>
+
+#include <iostream>
+#include <sstream>
+#include <string>
+
+using namespace std;
+
+static void show_usage()
+{
+    cout << "Usage: search_letor MODEL_METADATA_KEY QUERY\n";
+}
+
+// Start of example code
+void rank_letor(string db_path, string model_key, string query_)
+{
+    Xapian::Stem stemmer("english");
+    Xapian::doccount msize = 10;
+    Xapian::QueryParser parser;
+    parser.add_prefix("title", "S");
+    parser.add_prefix("subject", "S");
+    Xapian::Database db(db_path);
+    parser.set_database(db);
+    parser.set_default_op(Xapian::Query::OP_OR);
+    parser.set_stemmer(stemmer);
+    parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
+    Xapian::Query query_no_prefix = parser.parse_query(query_,
+						       parser.FLAG_DEFAULT);
+    // query with title as default prefix
+    Xapian::Query query_default_prefix = parser.parse_query(query_,
+							    parser.FLAG_DEFAULT,
+							    "S");
+    // Combine queries
+    Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, query_no_prefix,
+					query_default_prefix);
+    Xapian::Enquire enquire(db);
+    enquire.set_query(query);
+    Xapian::MSet mset = enquire.get_mset(0, msize);
+
+    cout << "Docids before re-ranking by LTR model:" << endl;
+    for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
+	Xapian::Document doc = i.get_document();
+	string data = doc.get_data();
+	cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
+    }
+
+    // Initialise Ranker object with ListNETRanker instance, db path and query.
+    // See Ranker documentation for available Ranker subclass options.
+    Xapian::ListNETRanker ranker;
+    ranker.set_database_path(db_path);
+    ranker.set_query(query);
+
+    // Re-rank the existing mset using the letor model.
+    ranker.rank(mset, model_key);
+
+    cout << "Docids after re-ranking by LTR model:\n" << endl;
+
+    for (Xapian::MSetIterator i = mset.begin(); i != mset.end(); ++i) {
+	Xapian::Document doc = i.get_document();
+	string data = doc.get_data();
+	cout << *i << ": [" << i.get_weight() << "]\n" << data << "\n";
+    }
+}
+// End of example code.
+
+int main(int argc, char** argv)
+{
+    if (argc != 4) {
+	show_usage();
+	return 1;
+    }
+    string db_path = argv[1];
+    string model_key = argv[2];
+    string query = argv[3];
+    rank_letor(db_path, model_key, query);
+    return 0;
+}
diff --git a/code/expected.out/search_letor.out b/code/expected.out/search_letor.out
@@ -0,0 +1,59 @@
+Docids before re-ranking by LTR model:
+4: [6.7291]
+1970-377
+Watch with Chinese duplex escapement
+Watch with Chinese duplex escapement
+13: [6.29324]
+1985-1538
+Watch timer by P
+Watch timer by P.L. Instruments Ltd., Welwyn Garden City; with tuning fork standard and spark printer
+33: [5.77657]
+1986-516
+A device by Favag of Neuchatel which enables a stop watch to
+A device by Favag of Neuchatel which enables a stop watch to be operated by an electric signal, 1961
+18: [5.76966]
+1987-1145
+Solar/Sidereal verge watch with epicyclic maintaining power
+Solar/Sidereal verge watch with epicyclic maintaining power by James Green, London, no. 5969, in silver pair case hallmarked 1776, believed to be the earliest English watch to indicate solar and sidereal time with seconds.
+36: [5.52705]
+1986-91
+Universal 'Tri-Compax' chronographic wrist watch
+'Tri-Compax' calendar chronograph wristwatch in gold case, by Universal, Geneva, Switzerland, 1960-1986.
+15: [5.05174]
+1985-1914
+Ingersoll "Dan Dare" automaton pocket watch with pin-pallet
+Ingersoll "Dan Dare" automaton pocket watch with pin-pallet lever escapement. The dial shows a helmeted Dan Dare holding a ray gun, with a monster approaching. The emblem of the " Eagle" children's comic is impressed in the back of case.
+46: [2.5204]
+1883-68
+Model by Dent of mechanism for setting hands and winding up
+Model by Dent of mechanism for setting hands and winding up a keyless watch
+Docids after re-ranking by LTR model:
+
+46: [-0.0075442]
+1883-68
+Model by Dent of mechanism for setting hands and winding up
+Model by Dent of mechanism for setting hands and winding up a keyless watch
+15: [-0.0298056]
+1985-1914
+Ingersoll "Dan Dare" automaton pocket watch with pin-pallet
+Ingersoll "Dan Dare" automaton pocket watch with pin-pallet lever escapement. The dial shows a helmeted Dan Dare holding a ray gun, with a monster approaching. The emblem of the " Eagle" children's comic is impressed in the back of case.
+33: [-0.0305018]
+1986-516
+A device by Favag of Neuchatel which enables a stop watch to
+A device by Favag of Neuchatel which enables a stop watch to be operated by an electric signal, 1961
+36: [-0.032863]
+1986-91
+Universal 'Tri-Compax' chronographic wrist watch
+'Tri-Compax' calendar chronograph wristwatch in gold case, by Universal, Geneva, Switzerland, 1960-1986.
+18: [-0.0338239]
+1987-1145
+Solar/Sidereal verge watch with epicyclic maintaining power
+Solar/Sidereal verge watch with epicyclic maintaining power by James Green, London, no. 5969, in silver pair case hallmarked 1776, believed to be the earliest English watch to indicate solar and sidereal time with seconds.
+13: [-0.0403684]
+1985-1538
+Watch timer by P
+Watch timer by P.L. Instruments Ltd., Welwyn Garden City; with tuning fork standard and spark printer
+4: [-0.0447454]
+1970-377
+Watch with Chinese duplex escapement
+Watch with Chinese duplex escapement