-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Background
Both the LocalGraph and RemoteGraph classes implement the concept of an Iterator to allow for operators, such as EdgeMap(), to access their connectivity information. Because Iterators are used as a template parameter in EdgeMap, they need to have a public interface that satisfies the constraints of template substitution into this function: they need to be constructible from a factory function Graph::GetIterator, and they need to implement two methods HasNext() and Next(). Take a look at the class declarations for RemoteGraph::Iterator and LocalGraph::Iterator in famgraph.hpp.
In the case of LocalGraph::Iterator, the HasNext() function just checks whether or not the iterator will return a list of neighbors if Next() is called, and the Next() function simply just does an indirection into a the LocalGraph's in-memory adjacency array and returns a pointer to the start of the current vertex's adjacency list, along with the number of elements in the list.
For RemoteGraph::Iterator, the HasNext() function works exactly the same way. The Next() function is a little bit more complicated because it has to use RDMA to read in the current vertex's adjacency list into a smaller memory buffer, and then return a pointer to this memory buffer after it ensures that the RDMA is complete.
How the RDMA works
The "smaller memory buffer" mentioned above is allocated inside of the function famgraph::RemoteGraph::CreateInstance in graph.cpp. The relevant lines are:
auto const edge_window_size = index.max_out_degree
* static_cast<unsigned long>(rdma_channels)
* sizeof(uint32_t);
auto const edge_window =
fam_control->CreateRegion(edge_window_size, false, true);The size of this buffer is chosen to be at least as big as the neighborhood of the largest vertex (we can change this later, but it makes things simpler for now), and that gets multiplied by the number of worker threads (channels) and the size of a vertex label (4 bytes for now). I have placed all of the RDMA code for memory allocation/ registration with the network card, and creation of IB work requests inside of the class FAM::FamControl, so that the RemoteGraph class can issue logical RDMA operations like CreateRegion, Read, Write etc. without having to import and IB verbs interfaces.
In the same function you can see how the in-local-memory index array gets created:
auto index = fgidx::DenseIndex::CreateInstance(index_file, edges);and you can also see how the remote memory, adjacency array gets created:
auto fam_control = std::make_unique<FAM::FamControl>(
grpc_addr, ipoib_addr, ipoib_port, rdma_channels);
auto const adjacency_file = fam_control->MmapRemoteFile(adj_file);What the last line does behind the scenes is send a GRPC message to the server asking it to find the file named "adj_file" and memory map it as well as register this memory region with the network card, so that RDMA can be performed on it.
Data path
When a client of RemoteGraph::Iterator calls Next(), the iterator first needs to figure out what the next vertex is, and how many vertices starting from this vertex without any gaps can fit into local memory, then an RDMA request is sent:
famgraph::AdjacencyList famgraph::RemoteGraph::Iterator::Next() noexcept
{
auto const v = this->current_vertex_++;
if (v >= this->current_window_.end_exclusive) {
this->current_window_ = this->MaximalRange(v); //Calculate set of vertices we want
this->FillWindow(this->current_window_); //Perform RDMA and wait
this->cursor = static_cast<uint32_t *>(this->edge_buffer_.p);
}Then, when the data is in local memory, the pointer is returned to the caller.
auto const [start_inclusive, end_exclusive] = this->graph_.idx_[v];
auto const num_edges = end_exclusive - start_inclusive;
auto edges = const_cast<const uint32_t *>(this->cursor);
this->cursor += num_edges;
return { v, num_edges, edges };The change
The current implementation of Next() will perform lookahead for consecutive runs in vertices e.g. [3,4,5], inside of this->MaximalRange(v) and then will dispatch a single RDMA read inside of this->FillWindow(this->current_window_);.
this->graph_.fam_control_->Read(this->edge_buffer_.p,
raddr,
length * sizeof(uint32_t),
lkey,
rkey,
this->channel_);If we look at the API exported by FAM::FamControl, we can see that in addition to having the above Read method, there is another overload that takes a vector of remote addresses and posts them all in a single chained RDMA read:
void Read(void *laddr,
std::vector<FamSegment> const &segs,
uint32_t lkey,
uint32_t rkey,
unsigned long channel) noexcept;We can leverage this in the following way. As the code is right now, if we are reading the neighborhoods of [1, 2, 3, 5, 6, 7], assuming that we have enough space for all of the edges in our buffer, the iterator will still split the list into [1, 2, 3] , [5, 6, 7] and perform two different fam_control->Read calls. If we use the second overload of Read, however, we can simply take the list of 2 consecutive runs [1, 2, 3] , [5, 6, 7] and post them in a vector to a single call to Read, which will behind the scenes create two RDMA requests and chain them together before posting to the network card.
Testing
Because we are changing the implementation of RemoteGraph::Iterator and not adding/changing any interfaces, the existing test cases for RemoteGraph should cover the change. For debugging purposes, it will be easiest to focus on the tests in graph_tests.cpp first, because they do not use TBB.
This test case:
TEST_CASE("RemoteGraph Construction", "[famgraph]")
{
int const rdma_channels = 1;
auto [graph, graph_base] = CreateGraph<famgraph::RemoteGraph>(
memserver_grpc_addr, ipoib_addr, ipoib_port, rdma_channels);is probably where you want to start with testing. Take a look at the graphs that this test case currently uses. It may be helpful to rig a small example graph that has the properties we are looking to stress here (i.e. that we know it will trigger chained reads).
Once the tests in graph_tests.cpp pass, try the algorithm_tests.cpp. These tests use TBB for multithreading, but the way have designed things, none of the data inside of an iterator is shared and mutable, so if it works on the single threaded unit tests, it should work on the algorithm ones.