Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion tensorpipe/common/ibv_lib.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ namespace tensorpipe {
_(query_gid, int, (IbvLib::context*, uint8_t, int, IbvLib::gid*)) \
_(query_port, int, (IbvLib::context*, uint8_t, IbvLib::port_attr*)) \
_(reg_mr, IbvLib::mr*, (IbvLib::pd*, void*, size_t, int)) \
_(wc_status_str, const char*, (IbvLib::wc_status))
_(wc_status_str, const char*, (IbvLib::wc_status)) \
_(port_state_str, const char*, (IbvLib::port_state))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you keep this list sorted alphabetically?


// Wrapper for libibverbs.

Expand Down
26 changes: 25 additions & 1 deletion tensorpipe/transport/ibv/reactor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,32 @@ namespace ibv {

Reactor::Reactor(IbvLib ibvLib, IbvDeviceList deviceList)
: ibvLib_(std::move(ibvLib)) {
bool found = false;
TP_DCHECK_GE(deviceList.size(), 1);
ctx_ = createIbvContext(getIbvLib(), deviceList[0]);

// If the deviceList contains multiple ibv devices, we will select the
// device of the port whose port_state is active, instead of just selecting
// the first device in the deviceList by default.
for (int i = 0; i < deviceList.size(); i++) {
IbvContext tp_ctx_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our naming convention is camelCase. Also, a trailing underscore means that a name is a private class member, whereas this is just a local variable. Could you just name this ctx?

IbvLib::port_attr portAttr;
std::memset(&portAttr, 0, sizeof(portAttr));
tp_ctx_ = createIbvContext(getIbvLib(), deviceList[i]);
TP_CHECK_IBV_INT(ibvLib.query_port(tp_ctx_.get(), kPortNum, &portAttr));
if (portAttr.state == IbvLib::port_state::PORT_ACTIVE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port_state is just an enum, not an enum class, hence its values should be accessed just as IbvLib::PORT_ACTIVE.

ctx_ = std::move(tp_ctx_);
found = true;
break;
} else {
TP_VLOG(8) << "IbvDevice " << deviceList[i].name << " port "
<< unsigned(kPortNum) << " state is "
<< ibvLib.port_state_str(portAttr.state)
<< " , so skip this device";
}
}

TP_THROW_ASSERT_IF(found == false) << "Unable to find available ibv device";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't find any usable devices we shouldn't consider it an error (and crash the program), instead we should just disable the ibv transport. The logic to do so happens here:

if (deviceList.size() == 0) {
TP_VLOG(7) << "IBV transport is not viable because it couldn't find any "
<< "InfiniBand NICs";
return nullptr;
}

Could you move your code to that file? You will probably need to change the constructor of the Reactor class so that it takes a IbvContext object, instead of an IbvDeviceList.


pd_ = createIbvProtectionDomain(getIbvLib(), ctx_);
cq_ = createIbvCompletionQueue(
getIbvLib(),
Expand Down