Skip to content

Minor optimizations for getDataByLabel? #643

Description

@LTLA

In my KNN-related applications, I often create the HNSW index and pass it to downstream functions without the original data. When I need to do a KNN search within the data, I use HierarchicalNSW::getDataByLabel() to extract the data vector before passing it to searchKnn().

This works pretty well, but some examination of the source reveals that getDataByLabel() does a surprising amount of work:

hnswlib/hnswlib/hnswalg.h

Lines 826 to 847 in c1b9b79

std::vector<data_t> getDataByLabel(labeltype label) const {
// lock all operations with element by label
std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label));
std::unique_lock <std::mutex> lock_table(label_lookup_lock);
auto search = label_lookup_.find(label);
if (search == label_lookup_.end() || isMarkedDeleted(search->second)) {
throw std::runtime_error("Label not found");
}
tableint internalId = search->second;
lock_table.unlock();
char* data_ptrv = getDataByInternalId(internalId);
size_t dim = *((size_t *) dist_func_param_);
std::vector<data_t> data;
data_t* data_ptr = (data_t*) data_ptrv;
for (size_t i = 0; i < dim; i++) {
data.push_back(*data_ptr);
data_ptr += 1;
}
return data;
}

i.e., multiple locks that might cause contention in a multi-threaded context, plus an allocation.

Might you consider adding a streamlined version like:

    const char* getDataByLabelRaw(labeltype label) const {
        auto search = label_lookup_.find(label);
        if (search == label_lookup_.end() || isMarkedDeleted(search->second)) {
            throw std::runtime_error("Label not found");
        }
        tableint internalId = search->second;
        return getDataByInternalId(internalId);
    }

where the pointer could be directly passed to searchKnn()? This avoids the allocation of and copy to a new vector, and skips the various locks under the assumption that the user is not adding/deleting points at the same time. Probably won't make much of a performance difference in the single-threaded case, but is a bit more ergonomic as I can just pass the pointer directly to searchKnn().

FWIW we could then streamline getDataByLabel() itself to:

    template<typename data_t>
    std::vector<data_t> getDataByLabel(labeltype label) const {
        // lock all operations with element by label
        std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label));
        
        std::unique_lock <std::mutex> lock_table(label_lookup_lock);
        auto data_ptrv = getDataByLabelRaw(label);
        lock_table.unlock();

        size_t dim = *((size_t *) dist_func_param_);
        data_t* data_ptr = (data_t*) data_ptrv;
        return std::vector<data_t>(data_ptr, data_ptr + dim);
    }

(TBH I don't really understand the motivation for the locks - are people really reading from the index at the same time as they're adding/deleting points? That seems like it would have all kinds of issues from race conditions. But if this is a common operation, one might also consider switching to a std::shared_mutex + std::shared_lock so that parallel reads don't contend with each other.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions