-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add howto for Diversification #24
base: master
Are you sure you want to change the base?
Conversation
|
||
// Query a database and get 10 results, where 'enq' is an instantiated | ||
// Enquire object over a database | ||
matches = enq.get_mset(0, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than giving code in a particular language (or here what appears to be Python but with C++ comments!) it's better to use put the example code under code/
and use .. xapianexample::
to pull in parts of it. You can show the results of running the code with ..xapianrunexample::
(see for example howtos/boolean_filters.rst
for uses of these).
By using these macros, people can provide translations of the examples to other languages, and so we can have a versions of this guide in C++, Python, Perl, PHP, etc.
The other thing these macros do is ensure that the shown example code actually works and produces any output shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.
|
||
// Query a database and get 10 results, where 'enq' is an instantiated | ||
// Enquire object over a database | ||
matches = enq.get_mset(0, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.
|
||
Xapian allows for diversification of documents which are stored in the form of an MSet. | ||
This feature is a well-known technique in information retrieval used to increase | ||
user satisfaction, especially for ambiguous queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't explain what diversification actually does, although the glossary entry does this very well. Using similar wording may feel like duplication, but would actually be a good thing here.
|
||
Perform diversification over 'matches' and obtain an ordered list of documents:: | ||
|
||
dset = d.get_dmset(matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two snippets go closely together, and the general approach we've taken is to show the whole snippet and then explain it with text before and after. Further, I'd suggest that you need to explain how to use the generated dset
, since otherwise people reading this document will be left with an object they'll still have to look up in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it probably makes more sense to have diversification reorder the MSet
. If it worked like that, then adding diversification to an existing application would just be a matter of adding a bit of code between the application code which calls get_mset()
and the application code which uses the returned MSet
.
With the currently implemented diversification API you'd also have to rework all the application code which uses the returned MSet
to instead use a DocumentSet
. That's especially annoying if you want to support diversification as an option, since then you need two versions of that code.
This would also make it easier to document here, since using the diversified MSet
is then really no different to using any MSet
.
For explicit clustering, I think you get a DocumentSet
object for each cluster, so there it has more of a reason for the class to exist in that case. To return clustered results via an MSet
we'd need to add some sort of cluster-id in the MSet
or some other way to indicate what the clusters are. It also seems less likely you'd want to treat explicitly clustered results the same way as unclustered results.
As for how to do this, there's a simple API to support that which was added for letor by ayushp last year, which is based on setting new weights for each item with replace_weights()
and then calling sort_by_relevance()
. That could be used here, but we'd have to adjust the weights of the entries to get the desired order - for example, crudely setting weight to 1.0 / desired_rank
would work.
Or perhaps better, reorder the MSet
's internals more directly - diversification is part of xapian-core, so doesn't have to be constrained to use public API methods (unlike xapian-letor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd favour a direct re-ordering in this case, which would retain the original weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the original weights are probably useful to some API users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the code, and I'm thinking about refactoring it to instead provide a new diversify
method on Xapian::MSet
which takes the same parameters that a Diversify
constructor currently does. The Diversify
class would then not be needed at all.
So usage would look like:
matches = enq.get_mset(0, 10);
matches.diversify(4, 2);
And then you can display matches
as you would without diversification.
Currently we effectively get a reordered list of docids out, so I think we might need to have some sort of mapping to allow us to efficiently find the MSet entry with a particular docid - that could be created as the MSet is scanned to seed the clustering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the code to change the API as above, and from updating the test code it does seem more natural.
However while checking if it gives the same results as before I get a segfault which I can reproduce with the unrefactored code:
==149288== Invalid read of size 8
==149288== at 0x49A3A4B: Xapian::LCDClusterer::cluster(Xapian::MSet const&) (lcd_clusterer.cc:132)
==149288== by 0x49A81FB: Xapian::Diversify::Internal::get_dmset(Xapian::MSet const&) (diversify.cc:181)
==149288== by 0x49A880C: Xapian::Diversify::get_dmset(Xapian::MSet const&) (diversify.cc:75)
==149288== by 0x10D657: main (quest.cc:458)
==149288== Address 0x0 is not stack'd, malloc'd or (recently) free'd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The segfault was a bug in LCDClusterer - the relevance weight was used as a unique key, which doesn't work well in cases where multiple documents have exactly the same weight. Fixed by xapian/xapian@be80054.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at the diversification code some more. I tried some test queries and what I always seem to get out is a result with the top k results in the same order, and the rest with their order reversed. I would expect the first k to change in many cases, and the relative order of remaining entries should probably be preserved (since in the absence of any other reason to order them, the original ordering seems the way to go).
I dug in a bit and found that the diversification uses the cluster centroids, but the LCDClusterer doesn't set these, so every centroid has no terms and zero magnitude - effectively every centroid is the zero point of the term space. That would explain the lack of useful reordering.
IIRC the diversification code was written using the existing K-means clustering (which does set centroids) and then LCDClusterer was added, so I suspect that's how we ended up here.
Also I had a look at the paper, and noticed that says "The next cluster center is chosen as the one that maximizes the sum of distances to the previous center(s)" but our implementation has:
// Select a new cluster center which is the point that is farthest away
// from the current cluster center
cluster_center = dist_vector.back().first;
That seems to do the right thing for the second cluster (and for the first we seem to correctly take the top ranked doc) but for the third and onwards we only consider the distance to the previous cluster centre, and not also the distances to the ones before that.
@uppinder Do you have any useful insights?
Contains a howto for diversification with code snippets in python.