Add howto for Diversification #24

uppinder · 2018-08-04T16:21:24Z

Contains a howto for diversification with code snippets in python.

howtos/diversification.rst

ojwb · 2018-08-09T03:15:34Z

howtos/diversification.rst

+
+    // Query a database and get 10 results, where 'enq' is an instantiated
+    // Enquire object over a database
+    matches = enq.get_mset(0, 10)


Rather than giving code in a particular language (or here what appears to be Python but with C++ comments!) it's better to use put the example code under code/ and use .. xapianexample:: to pull in parts of it. You can show the results of running the code with ..xapianrunexample:: (see for example howtos/boolean_filters.rst for uses of these).

By using these macros, people can provide translations of the examples to other languages, and so we can have a versions of this guide in C++, Python, Perl, PHP, etc.

The other thing these macros do is ensure that the shown example code actually works and produces any output shown.

Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.

howtos/diversification.rst

jaylett · 2018-08-19T17:02:45Z

howtos/diversification.rst

+
+    // Query a database and get 10 results, where 'enq' is an instantiated
+    // Enquire object over a database
+    matches = enq.get_mset(0, 10)


Further, this encourages creating a complete working program; however, you don't have to show all the code in the documentation. It's generally a good idea to base your code on one of the other examples, which you can reference and then describe just what you've changed. In this case, you wouldn't need this snippet of code, which is common in existing search examples.

jaylett · 2018-08-19T17:03:42Z

howtos/diversification.rst

+
+Xapian allows for diversification of documents which are stored in the form of an MSet.
+This feature is a well-known technique in information retrieval used to increase
+user satisfaction, especially for ambiguous queries.


This doesn't explain what diversification actually does, although the glossary entry does this very well. Using similar wording may feel like duplication, but would actually be a good thing here.

howtos/diversification.rst

jaylett · 2018-08-19T17:07:36Z

howtos/diversification.rst

+
+Perform diversification over 'matches' and obtain an ordered list of documents::
+
+    dset = d.get_dmset(matches)


These two snippets go closely together, and the general approach we've taken is to show the whole snippet and then explain it with text before and after. Further, I'd suggest that you need to explain how to use the generated dset, since otherwise people reading this document will be left with an object they'll still have to look up in the API.

I think it probably makes more sense to have diversification reorder the MSet. If it worked like that, then adding diversification to an existing application would just be a matter of adding a bit of code between the application code which calls get_mset() and the application code which uses the returned MSet.

With the currently implemented diversification API you'd also have to rework all the application code which uses the returned MSet to instead use a DocumentSet. That's especially annoying if you want to support diversification as an option, since then you need two versions of that code.

This would also make it easier to document here, since using the diversified MSet is then really no different to using any MSet.

For explicit clustering, I think you get a DocumentSet object for each cluster, so there it has more of a reason for the class to exist in that case. To return clustered results via an MSet we'd need to add some sort of cluster-id in the MSet or some other way to indicate what the clusters are. It also seems less likely you'd want to treat explicitly clustered results the same way as unclustered results.

As for how to do this, there's a simple API to support that which was added for letor by ayushp last year, which is based on setting new weights for each item with replace_weights() and then calling sort_by_relevance(). That could be used here, but we'd have to adjust the weights of the entries to get the desired order - for example, crudely setting weight to 1.0 / desired_rank would work.

Or perhaps better, reorder the MSet's internals more directly - diversification is part of xapian-core, so doesn't have to be constrained to use public API methods (unlike xapian-letor).

I think I'd favour a direct re-ordering in this case, which would retain the original weights.

Yes, the original weights are probably useful to some API users.

I had a look at the code, and I'm thinking about refactoring it to instead provide a new diversify method on Xapian::MSet which takes the same parameters that a Diversify constructor currently does. The Diversify class would then not be needed at all.

So usage would look like:

matches = enq.get_mset(0, 10); matches.diversify(4, 2);

And then you can display matches as you would without diversification.

Currently we effectively get a reordered list of docids out, so I think we might need to have some sort of mapping to allow us to efficiently find the MSet entry with a particular docid - that could be created as the MSet is scanned to seed the clustering.

I've refactored the code to change the API as above, and from updating the test code it does seem more natural.

However while checking if it gives the same results as before I get a segfault which I can reproduce with the unrefactored code:

==149288== Invalid read of size 8 ==149288== at 0x49A3A4B: Xapian::LCDClusterer::cluster(Xapian::MSet const&) (lcd_clusterer.cc:132) ==149288== by 0x49A81FB: Xapian::Diversify::Internal::get_dmset(Xapian::MSet const&) (diversify.cc:181) ==149288== by 0x49A880C: Xapian::Diversify::get_dmset(Xapian::MSet const&) (diversify.cc:75) ==149288== by 0x10D657: main (quest.cc:458) ==149288== Address 0x0 is not stack'd, malloc'd or (recently) free'd

The segfault was a bug in LCDClusterer - the relevance weight was used as a unique key, which doesn't work well in cases where multiple documents have exactly the same weight. Fixed by xapian/xapian@be80054.

I've been looking at the diversification code some more. I tried some test queries and what I always seem to get out is a result with the top k results in the same order, and the rest with their order reversed. I would expect the first k to change in many cases, and the relative order of remaining entries should probably be preserved (since in the absence of any other reason to order them, the original ordering seems the way to go).

I dug in a bit and found that the diversification uses the cluster centroids, but the LCDClusterer doesn't set these, so every centroid has no terms and zero magnitude - effectively every centroid is the zero point of the term space. That would explain the lack of useful reordering.

IIRC the diversification code was written using the existing K-means clustering (which does set centroids) and then LCDClusterer was added, so I suspect that's how we ended up here.

Also I had a look at the paper, and noticed that says "The next cluster center is chosen as the one that maximizes the sum of distances to the previous center(s)" but our implementation has:

// Select a new cluster center which is the point that is farthest away // from the current cluster center cluster_center = dist_vector.back().first;

That seems to do the right thing for the second cluster (and for the first we seem to correctly take the top ranked doc) but for the third and onwards we only consider the distance to the previous cluster centre, and not also the distances to the ones before that.

@uppinder Do you have any useful insights?

uppinder added 3 commits July 21, 2018 15:51

Update copyright year

53f3e52

Add 'diversification' to glossary

747cba9

Add howto for diversification

6254efe

ojwb reviewed Aug 9, 2018

View reviewed changes

howtos/diversification.rst Outdated Show resolved Hide resolved

ojwb reviewed Aug 9, 2018

View reviewed changes

Fix typo

f8e98e3

jaylett requested changes Aug 19, 2018

View reviewed changes

ojwb added 3 commits August 6, 2020 19:48

fixup! Add howto for diversification

c1e276b

fixup! Add howto for diversification

6b3fcbf

Split long sentence as suggested by James Aylett

4aec046

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add howto for Diversification #24

Add howto for Diversification #24

uppinder commented Aug 4, 2018

ojwb Aug 9, 2018 •

edited

Loading

jaylett Aug 19, 2018

jaylett Aug 19, 2018

jaylett Aug 19, 2018

jaylett Aug 19, 2018

ojwb Aug 20, 2018

jaylett Aug 21, 2018

ojwb Aug 22, 2018

ojwb Aug 8, 2020

ojwb Aug 13, 2020

ojwb Aug 13, 2020 •

edited

Loading

ojwb Aug 21, 2020


		Perform diversification over 'matches' and obtain an ordered list of documents::

		dset = d.get_dmset(matches)

Add howto for Diversification #24

Are you sure you want to change the base?

Add howto for Diversification #24

Conversation

uppinder commented Aug 4, 2018

ojwb Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ojwb Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ojwb Aug 9, 2018 •

edited

Loading

ojwb Aug 13, 2020 •

edited

Loading