Welcome to the Ball-Tree and BC-Tree GitHub!
In this repository, we target at solving the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). Given a set of data points in a (
This problem plays a vital role in many research domains. For example, in the applications of pool-based active learning with SVMs, the goal is to request labels for the data points closest (with minimum margin) to the SVM's decision hyperplane to reduce human efforts for annotation. Moreover, motivated by the success of SVM for classification, the maximum margin clustering aims at finding the hyperplane maximizing the minimum margin to the data, which can separate the data from different classes. Such applications require finding the data points that are closest to the hyperplane.
This repository provides the implementations and experiments of our work entitled Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search that has been published in IEEE ICDE 2023. You could also find our arXiv version here. We implement Ball-Tree and BC-Tree for performing P2HNNS in high-dimensional spaces. To make a systematic comparison, we include two state-of-the-art hyperplane hashing schemes NH and FH (published in SIGMOD 2021) as baselines for evaluations, where their implementation codes can be found here.
We study the performance of Ball-Tree and BC-Tree on 16 real-world data sets, i.e., Music, GloVe, Sift, UKBench, Tiny, Msong, NUSW, Cifar-10, Sun, LabelMe, Gist, Enron, Trevi, P53, Deep100M, and Sift100M. Their statistics are summarized as follows.
Data Sets | # Data Points | # Dim | # Queries | Data Size | Data Type |
---|---|---|---|---|---|
Music | 1,000,000 | 100 | 100 | 386 MB | Rating |
GloVe | 1,183,514 | 100 | 100 | 460 MB | Text |
Sift | 985,462 | 128 | 100 | 485 MB | Image |
UKBench | 1,097,907 | 128 | 100 | 541 MB | Image |
Tiny | 1,000,000 | 384 | 100 | 1.5 GB | Image |
Msong | 992,272 | 420 | 100 | 1.6 GB | Audio |
NUSW | 268,643 | 500 | 100 | 514 MB | Image |
Cifar-10 | 50,000 | 512 | 100 | 98 MB | Image |
Sun | 79,106 | 512 | 100 | 155 MB | Image |
LabelMe | 181,093 | 512 | 100 | 355 MB | Image |
Gist | 982,694 | 960 | 100 | 3.6 GB | Image |
Enron | 94,987 | 1,369 | 100 | 497 MB | Text |
Trevi | 100,900 | 4,096 | 100 | 1.6 GB | Image |
P53 | 31,153 | 5,408 | 100 | 643 MB | Biology |
Deep100M | 100,000,000 | 96 | 100 | 36.1 GB | Image |
Sift100M | 99,986,452 | 128 | 100 | 48.0 GB | Image |
Different from NH and FH that generate the hyperplane queries uniformly at random from a hypercube, we generate the hyperplane queries uniformly at random from a
Except for the link to download the data sets and hyperplane queries, we have also enclosed the sourece codes to generate the hyperplane query and the source codes for data sets deduplication, so that users can conduct experiemnts on any data sets of interests. Please refer to the fold pre_process/
for more details.
To reduce the I/O cost, all the datasets and queries are stored in a binary format. For a dataset (or query set) with n vectors in d dimensions, the binary file comparises with a float
array of n*d
length. Please specify the cardinality n and data dimension d in the bash scripts in advance.
- Ubuntu 18.04 (or higher version)
- g++ 8.3.1 (or higher version).
- Python 3.7 (or higher version)
The source codes are implemented by C++, which requires g++-8
with c++17
(or higher version) for compilation. Thus, please check whether the g++-8
is installed before the compilation. If not, we provide a way to install g++-8
in Ubuntu 18.04
as follows.
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8
sudo apt-get install gcc-8 (optional)
Users can use the following commands to compile the C++ source codes:
cd methods/
make -j
We provide the bash scripts to run all experiments. Once you have downloaded the data sets and queries and completed the compilation, you can reproduce the experiments by simply running the following commands:
cd methods/
bash run.sh
Finally, we privode python scripts for visualization. These scripts require python 3.7
(or higher versions) with numpy, scipy, and matplotlib installed. If not, you might need to use anaconda
to create a new virtual environment and use pip
to install those packages.
After you have conducted all experiments, you can plot all the figures in our paper with the following commands.
cd scripts/
python3 plot.py
or
cd scripts/
python plot.py
Thank you for being patient in reading the user manual. We will appreciate using the following BibTeX to cite this work when you use these source codes in your paper.
@inproceedings{huang2023lightweight,
title={Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search},
author={Huang, Qiang and Tung, Anthony K. H.},
booktitle={2023 IEEE 39th International Conference on Data Engineering (ICDE)},
pages={436--449},
year={2023},
organization={IEEE}
}
It is welcome to contact me ([email protected]) if you meet any issue. Thank you for your interest!