This framework includes the implementation of the following algorithms.
- Distributed Random Projection Trees (DRPT) algorithm constructs a K-Nearest Neighbor Graph (KNNG) for Approximate Nearest Neighbor Search (ANNS) from a given dataset.
- Force2Vec graph emdbedding algorithm which can be used to visualize the resulting K-Nearest Neighbor Graph (KNNG) produced by DRPT.
Theese algorithms are implemented in C++ and leverages MPI for distributed processing, making it suitable for large-scale datasets across multiple nodes in HPC environments.
Ensure the following tools and libraries are installed before building the project. The source code was compiled and run successfully in NERSC Perlmutter.
GCC version >= 13.2
OpenMP version >= 4.5
CMake version >= 3.30.2
intel version >= 2024.1.0
cray-mpich version >= 8.1.30
CombBLAS
Eigen version >= 3.3.9
PCG version >= 0.98
Some helpful links for installation can be found at GCC, OpenMP and Environment Setup.
Please refer to the following guides to install CombBLAS, Eigen, and PCG respectively:
- CombBLAS - Installation Guide
- Eigen - Installation Guide
- PCG - Download & Setup
- Load/ install the modules mentioned in System Requirments.
- Clone the repository using the following command.
git clone https://github.com/HipGraph/DistViz.git
- Set the following environment variables
export COMBLAS_ROOT=<path/to/CombBLAS/root>
export EIGEN_ROOT=<path/to/Eigen/root>
export PCG_ROOT=<path/to/PCG/root>
- Run the following commands in order to compile and build the framework using CMake.
cd DistViz
mkdir build
cd build
cmake -DCMAKE_CXX_FLAGS="-fopenmp" ..
make all
Let’s run a couple of examples to demonstrate how to execute the DRPT algorithm on a dataset and visualize the resulting K-Nearest Neighbor Graph (KNNG) using the Force2Vec embedding algorithm using this framework.
This test run is designed for High-Performance Computing (HPC) environments with Slurm as the workload manager. Ensure you have access to an HPC cluster running Slurm before proceeding.
The following two datasets are used to demonstrate the example execution.
- PenDigits dataset
- MNIST dataset
PenDigits is a smaller (compared to MNIST) dataset consisting of handwritten digits.
- Dataset size - 1797
- Dimension - 65
- Number of classes - 10
Note: Since the PenDigits dataset is relatively small in size, it is already included in test/data folder.
- Navigate into the test folder.
cd test
- Before running the framework, create a directory to store the outputs. For example:
mkdir output
- Run the framework by submitting the Slurm script run_distviz_on_pendigits.sh, as a batch job.
sbatch run_distviz_on_pendigits.sh
- This script will run the DRPT algorithm on the PenDigits dataset located in test/data/pen_digits and generates the KNNG and its embeddings.
- Here, the K value is set to 10, and the workload is distributed among 2 nodes.
- Once the job completes:
- Check the output log file (
%j.log
) for progress and any errors. - Verify that the KNNG output file (
knng.txt
) appears in your specified output directory. - The
knng.txt
file should contain an edge list representing the KNNG graph. - Additionally, confirm that an
embedding.txt
file is generated in the output directory. This file contains the generated embeddings corresponding to the nodes in the KNNG, which can be used for visualization and analysis.
- Check the output log file (
- In order to generate the embeddings, convert the
knng.txt
file into Matrix Market (.mtx
) format by adding the following two lines to the top of the file and renaming the extension tomtx
instead oftxt
.
%%MatrixMarket matrix coordinate pattern general
1797 1797 16173
- First two 1797s represent the number of nodes (rows and columns of the adjacency matrix).
- The last 16173 is the number of edges (non-zero entries).
- Make sure you have python (version >= 3.6) is installed. You can check your version with:
python3 --version
- Next create and activate a python virtual environment to keep dependencies organized:
pip install venv
python3 -m venv venv
source venv/bin/activate
- Install the following python dependnies using
pip install
.
scikit-learn version >= 0.24.2
matplotlib version >= 3.3.4
networkx version >= 2.5.1
community version >= 1.0.0
- Run the python script run_visualization.py (in the visualization directory) to generate the visualization of the KNNG:
cd visualization
python3 -u ./run_visualization.py ../output/knng.mtx 1 ../output/embedding.txt 2 ../data/pen_digits/digit_labels.txt pen_digits_knng_embedding
Note: digit_labels.txt represents the actual node lables of the dataset.
-
After running the script, check if a PDF file with the visualization has been generated.
- Open the PDF to ensure the embeddings are properly visualized as follows. You should see a clear structure where similar data points (data points belonging to the same label) cluster together, reflecting the generated KNNG.
MNIST is a widely used dataset which consists of a set of handwritten digits.
- Dataset size - 60000
- Dimension - 784
- Number of classes - 10
Note: Since the MNIST dataset is relatively large in size, it is not included in the repository.
- Navigate into the test folder.
cd test
- Download and extract the MNIST dataset (
train-images.idx3-ubyte
dataset) into the test/data/MNIST folder by using the following url.
- Before running the framework, make sure that you have a directory to store the outputs. If not, create a folder as follows.
mkdir output
- Run the framework by submitting the Slurm script run_distviz_on_mnist.sh, as a batch job.
sbatch run_distviz_on_mnist.sh
- This script will run the DRPT algorithm on the MNIST dataset located in test/data/MNIST and generates the KNNG and its embeddings.
- Here, the K value is set to 10, and the workload is distributed among 4 nodes.
- Once the job completes check whether following output files have been generated as before.
- Output log file (
%j.log
) - KNNG output file (
knng.txt
) - Embeddings output file (
embedding.txt
)
- Output log file (
- In order to generate the embeddings, convert the
knng.txt
file into Matrix Market (.mtx
) format by adding the following two lines to the top of the file and renaming the extension tomtx
instead oftxt
.
%%MatrixMarket matrix coordinate pattern general
60000 60000 540000
- Make sure you have completed the prerequisites mentioned in the PenDigits example. If the Python virtual environment is not already activated, please activate it using the folling command.
source venv/bin/activate
- Run the python script run_visualization.py (in the visualization directory) to generate the visualization of the KNNG:
cd visualization
python3 -u ./run_visualization.py ../output/knng.mtx 1 ../output/embedding.txt 2 ../data/MNIST/mnist.labels.txt mnist_knng_embedding
Note: mnist.labels.txt represents the actual node lables of the dataset.
-
After running the script, check if a PDF file with the visualization has been generated.
- Open the PDF to ensure the embeddings are properly visualized as follows. You should see a clear structure where similar data points (data points belonging to the same label) cluster together, reflecting the generated KNNG.