-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add howto for Diversification #24
base: master
Are you sure you want to change the base?
Changes from 3 commits
53f3e52
747cba9
6254efe
f8e98e3
c1e276b
6b3fcbf
4aec046
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
Diversification of Search Results | ||
================================= | ||
|
||
.. contents:: Table of contents | ||
|
||
Introduction | ||
------------ | ||
|
||
Xapian allows for diversification of documents which are stored in the form of an MSet. | ||
This feature is a well-known technique in information retrieval used to increase | ||
user satisfaction, especially for ambiguous queries. | ||
|
||
Xapian currently has an implementation of an *implict* method (using documents as features, | ||
ojwb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
as opposed to using query based features such as query logs) adapted from the C :sup:`2` - GLS method mentioned in Scalable and Efficient Web Search Results Diversification, Naini et al. 2016. This saves the cost of not having to provide external features such as query | ||
ojwb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
logs, while still achieving the desired diversification effect, which according to | ||
ojwb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the paper is reasonable enough for practical uses as tested on the public data set - ClueWeb09 with TREC Web 09/10 queries. | ||
|
||
API | ||
--- | ||
|
||
Diversification on an MSet of results can be achieved by using the | ||
:xapian-method:`Diversify` class, e.g.:: | ||
ojwb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// Query a database and get 10 results, where 'enq' is an instantiated | ||
// Enquire object over a database | ||
matches = enq.get_mset(0, 10) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than giving code in a particular language (or here what appears to be Python but with C++ comments!) it's better to use put the example code under By using these macros, people can provide translations of the examples to other languages, and so we can have a versions of this guide in C++, Python, Perl, PHP, etc. The other thing these macros do is ensure that the shown example code actually works and produces any output shown. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples. |
||
|
||
Now, cluster the 10 candidate documents into 4 clusters and use (at most) top-2 | ||
documents from each cluster for diversification:: | ||
|
||
k, r = 4, 2 | ||
// Instantiate Diversify object | ||
d = xapian.Diversify(k, r) | ||
|
||
Perform diversification over 'matches' and obtain an ordered list of documents:: | ||
|
||
dset = d.get_dmset(matches) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These two snippets go closely together, and the general approach we've taken is to show the whole snippet and then explain it with text before and after. Further, I'd suggest that you need to explain how to use the generated There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it probably makes more sense to have diversification reorder the With the currently implemented diversification API you'd also have to rework all the application code which uses the returned This would also make it easier to document here, since using the diversified For explicit clustering, I think you get a As for how to do this, there's a simple API to support that which was added for letor by ayushp last year, which is based on setting new weights for each item with Or perhaps better, reorder the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'd favour a direct re-ordering in this case, which would retain the original weights. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the original weights are probably useful to some API users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had a look at the code, and I'm thinking about refactoring it to instead provide a new So usage would look like:
And then you can display Currently we effectively get a reordered list of docids out, so I think we might need to have some sort of mapping to allow us to efficiently find the MSet entry with a particular docid - that could be created as the MSet is scanned to seed the clustering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've refactored the code to change the API as above, and from updating the test code it does seem more natural. However while checking if it gives the same results as before I get a segfault which I can reproduce with the unrefactored code:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The segfault was a bug in LCDClusterer - the relevance weight was used as a unique key, which doesn't work well in cases where multiple documents have exactly the same weight. Fixed by xapian/xapian@be80054. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been looking at the diversification code some more. I tried some test queries and what I always seem to get out is a result with the top k results in the same order, and the rest with their order reversed. I would expect the first k to change in many cases, and the relative order of remaining entries should probably be preserved (since in the absence of any other reason to order them, the original ordering seems the way to go). I dug in a bit and found that the diversification uses the cluster centroids, but the LCDClusterer doesn't set these, so every centroid has no terms and zero magnitude - effectively every centroid is the zero point of the term space. That would explain the lack of useful reordering. IIRC the diversification code was written using the existing K-means clustering (which does set centroids) and then LCDClusterer was added, so I suspect that's how we ended up here. Also I had a look at the paper, and noticed that says "The next cluster center is chosen as the one that maximizes the sum of distances to the previous center(s)" but our implementation has:
That seems to do the right thing for the second cluster (and for the first we seem to correctly take the top ranked doc) but for the third and onwards we only consider the distance to the previous cluster centre, and not also the distances to the ones before that. @uppinder Do you have any useful insights? |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,4 @@ How To... | |
synonyms | ||
weighting_scheme | ||
iterate_all_docs | ||
diversification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't explain what diversification actually does, although the glossary entry does this very well. Using similar wording may feel like duplication, but would actually be a good thing here.