Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

distributed training port bind error #14114

Open
ENegatiVY opened this issue Feb 11, 2019 · 4 comments
Open

distributed training port bind error #14114

ENegatiVY opened this issue Feb 11, 2019 · 4 comments

Comments

@ENegatiVY
Copy link

ENegatiVY commented Feb 11, 2019

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

I am trying distributed training on two ubuntu server. Both of them have one GPU,but this may not be the problem.

I installed mxnet-cu90 with pip. and I also git cloned mxnet(https://github.com/apache/incubator-mxnet) to my home directory.

The command is simple
/incubator-mxnet/tools/launch.py -H host -n 2 python3 store.py”
or
/incubator-mxnet/tools/launch.py -H host -n 2 python3 image-classificatioin.py” with some other network config command.

host
"
server1
server2
"
both of them are sshable without password

Environment info (Required)

two Ubuntu16.04 with one GPU



## Error Message:
Traceback (most recent call last):
File “store.py”, line 3, in 
store = kv.create(‘dist’)
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore.py”, line 674, in create
ctypes.byref(handle)))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node.port) != (-1) bind failed

Minimum reproducible example

store.py code
from mxnet import kv, nd
store = kv.create('dist')
shape = (2, 3)
x = nd.random_uniform(shape=shape)
store.init('weight', x)
print('=== init "weight" ==={}'.format(x))
from mxnet import gpu,cpu
ctx = [gpu(0), cpu(0)]
y = [nd.zeros(shape, ctx=c) for c in ctx]
store.pull('weight', out=y)
print('=== pull "weight" to {} ===\n{}'.format(ctx, y))
~

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. “~/incubator-mxnet/tools/launch.py -H host -n 2 python3 store.py”

What have you tried to solve it?

  1. https://stackoverflow.com/questions/6024003/why-doesnt-zeromq-work-on-localhost I cant find similar code in my example.
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.

@roywei
Copy link
Member

roywei commented Feb 12, 2019

@TPchanger you can try using ip addresses instead of host names in your host file. refer to dmlc/ps-lite#139

@yuxihu
Copy link
Member

yuxihu commented Feb 12, 2019

store = kv.create(‘dist’)

This might not be related to the error. But 'dist' seems an invalid option. Refer to supported options here.

@anirudhacharya
Copy link
Member

@yuxihu should we use an enum for these options or validate their values before use? see also here - https://github.com/dmlc/ps-lite/blob/aee325276bccb092f516df0bce30d3a8333f4038/src/postoffice.cc#L12 where the DMLC-ROLE is checked for not_null but its values are not validated(scheduler/server/worker)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants