CS562 2018 Spring Final Project -- Diversification Among Search Results in k-Nearest Neighbor Search Problem
To solve the problem of returning too many homogeneous results in any type of search, several algorithms have been created for generation a diversed set of results while maintaining closeness to query. Specifically, this paper will deal with several content-based diversity for refining
Often when people search on the web, they prefer a more heterogeneous result in web search. In some situations, search result that only contain homogeneous information may not prove to be all that useful, or satisfactory to users. For example, when we use web application such as Yelp to look for food restuarants nearby, we may expect a diverse set of restaurants in this area, instead of restaurants that all offer food of the same style. Users of such applications have been found to prefer some sort of diversity, and that applies to other types of recommendation system. To increase user satisfaction, algorithms should be implemented to return a "diverse" set of results from a data set. This is known as the diversification problem.
There is no specific definition for diversity within a dataset. Definitions of diversity include content-based, novelty-based, and coverage-based diversity. Content-based diversity is defined based on similarity (or dissimilarity) of items. For novelty-based and coverage-based diversity, documents are labeled with some pre-defined information (usually called "concept" or "information nugget") for each document, while the previous one seeks to maximize information gain for each incoming document, and the later tries to maximize number of covered information for the returned set. In some research, several definitions are combined to form a hybrid model, and require additional learning to obtain proper weightings for each model.
Search for diverse nearest-neighbor is an instance of the p-dispersion problem, which is to choose
It is important to solve the diversification problem because diversity not only satisfies users by providing broader result from ambiguity queries in recommendation system, it also allows users to find out additional information related to their requirement and thus help narrow down subsequent queries. By solving the diversification problem and increase variety to the answer, we are one step closer to having optimal search results.
The problem we are trying to solve is to add diversity onto solution for original
In general
As for heuristics used for speed-up and approximation, these strategies can be simply categorized into greedy heuristics and swap heuristics. Greedy heuristics set up some criteria to decide whether an input should be included in the answer set from the input, and keep testing the next nearest candidate until reach the required answer size. Swap heuristics usually initialize the result with original k-NN result, and trying to replace some points with other candidate points which can increase diversity in each iteration.
In this project, we implemented algorithms that tries to find a diversed set of results within a dataset; in the case of this paper, diversity is defined by the content-based definition. One of the algorithms we used is MOTLEY algorithm, which can produce near-optimal solution to the
MOTLEY is a content-based, greedy-heuristic
MOTLEY defines a binary function to define whether two points are diverse to each other or not. First it computes
Which can be obtained by weighted sum on sorted difference of 1D distance over all dimensions in the diversity-attribute space. The threshold
In addition to MOTLEY algorithm, we will also present another greedy heuristic algorithm for solving the diversification problem. The simple greedy heuristic algorithm first retrieves a candidate set
To compare the performance, we performed experiments on the real world dataset used in original MOTLEY paper. Forest Cover is a dataset contains geographical information of around 580,000 tuples. In our experiment, we take horizontal
and vertical distance to nearest water as point-attributes, and using elevation, aspect and slope for diversity-attributes. We conducted two rounds of experiments on this dataset. In the first round, both
Note that for the experiments, data structure used in
For all experiments, we tried to find out the 10 nearest-neighbors as answer. For
To evaluate the performance, we compute the following metrics from the answer set:
- Distance between farthest point to the query in point-attribute space.
- Average distance between points in the answer set to the query in point-attribute space.
- Average pairwise diversity of the answer set in diversity-attribute space.
Note that in the real world cases, especially in map applications, (2) and (3) are meaningful to user and service provider, because we want to obtain the largest variety from the result within the smallest overall range.
From figures (1)~(4), we can find the difference between searching in the whole space and doing different search in separated spaces. Running
We can also notice that MOTLEY generate result with much higher diversity, only with the cost of moderate increase on average distance from query point.
From above experiences, we’ve found that both separating original space into spatial (point-attribute) space and diversity space, and usage of MOTLEY algorithm, can refine result of diversed
- Drosou, Marina, and Evaggelia Pitoura. "Search result diversification." SIGMOD record 39.1 (2010): 41-47. Link
- Jain, Anoop, Parag Sarda, and Jayant R. Haritsa. "Providing diversity in k-nearest neighbor query results." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, 2004. [1] [2]
- Forest Cover dataset Link