Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

Clear stale nodes #105

Open
doits opened this issue Jul 16, 2015 · 10 comments
Open

Clear stale nodes #105

doits opened this issue Jul 16, 2015 · 10 comments

Comments

@doits
Copy link

doits commented Jul 16, 2015

I've played around with DCell a little bit, but now I have this:

DCell::Node.all.length
=> 75
DCell::Node.all.map(&:addr).uniq.length
=> 60

I've only two nodes running just now, but it still lists 75 of them. Also, it lists multiple nodes with the same address (which cannot be, right?). Is there any way to clear stale/dead/removed nodes?

@doits
Copy link
Author

doits commented Jul 16, 2015

With this I've noted that exiting programs which used DCell hang really long after displaying

 DEBUG -- : Terminating 89 actors...

I flushed redis db manually and it came back to normal, but shouldn't stale nodes be cleared automatically?

@Asmod4n
Copy link
Contributor

Asmod4n commented Jul 16, 2015

Zeromq is "stateless" when it comes to connections, you can still send messages to a peer which is disconencted and it will automatically send those messages again when it comes back online.

@Asmod4n
Copy link
Contributor

Asmod4n commented Jul 16, 2015

But if needed one could implement a ping/pong mechanism for DCell which would disconnect inactive nodes.

@doits
Copy link
Author

doits commented Jul 16, 2015

At least it should not hang (on termination or sending messages to nodes) when a lot of stale nodes are present.

@Asmod4n
Copy link
Contributor

Asmod4n commented Jul 16, 2015

one would have to set the sndtime to 0 for each zmq socket on shutdown so it discards all remaining messages.

@doits
Copy link
Author

doits commented Jul 16, 2015

yeah, that's a good idea - if there are remaining messages on shutdown output a warning and discard them after for example waiting 10 seconds (user configurable).

Also a configurable timeout when a node hangs would be great, for example when I try DCell::Node['which_is_dead].all, it hangs really long - it should throw an exception after a user configurable time (or if it does it already after too long time, the time should be configurable :-))

@niamster
Copy link
Contributor

@doits it's already like this in master. Dead nodes are not taken into account(though they are still present in the DB).

@tarcieri
Copy link
Member

At one point nodes healthchecked other nodes and marked them down if they didn't get responses. Did that get lost along the way?

@niamster
Copy link
Contributor

@tarcieri @doits in current master there are currently 3 ways to bypass dead nodes:

  • you have node#ping(timeout) to check if node is alive before trying to touch it
  • periodical heartbeat to interrupt requests to the nodes that passed away in the meantime (10 sec by default)
  • node lifebeat - client won't try to connect to the node if it didn't update status within some timeout(20 sec by default)

If you are accessing actor by id(w/o specifying the node) you get all actors with request ID from all alive nodes: scratchy example

@doits
Copy link
Author

doits commented Jul 16, 2015

I switched to master now and things go much smoother now. Didn't have enough time to test it, though, so maybe tomorrow I can say more. Thanks for the explanation!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants