Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support worker/server nodes use cluster ip #136

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

solin319
Copy link
Contributor

@solin319 solin319 commented Apr 23, 2018

Use 0.0.0.0 to bind all nodes' listening port.
This can make all nodes use cluster ip in kubernetes.
We can set DMLC_NODE_HOST and DMLC_PS_ROOT_URI with cluster ip when use kubernetes to launch distribute jobs.
@mli

Use 0.0.0.0 to bind all nodes' listening port.
This can make all nodes use cluster ip in kubernetes. 
We can set DMLC_NODE_HOST and DMLC_PS_ROOT_URI with cluster ip when use kubernetes to launch distribute jobs.
@sswv
Copy link

sswv commented Apr 23, 2018

In my Kubernetes based MXNet cluster, I also found the issue of using Cluster IP. Cluster IP can be accessed through virtual routing but it cannot be bound by socket. Thus, it is not good to use DMLC_NODE_HOST for both binding at server and accessing at client. I had some tricked modification in my MXNet cluster to solve it.

I think it is necessary to differentiate "the IP for socket binding at server" and "the IP for accessing at client". It is a good idea to use 0.0.0.0 simply when binding socket at server. TensorFlow also did that in its GrpcServer.

@coldsheephot
Copy link

I think this is necessary when I use k8s to create mxnet distributed job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants