Skip to content
This repository has been archived by the owner on Sep 6, 2018. It is now read-only.

Feature: access to the len(server.c) #160

Open
philips opened this issue Jan 22, 2014 · 22 comments
Open

Feature: access to the len(server.c) #160

philips opened this issue Jan 22, 2014 · 22 comments

Comments

@philips
Copy link
Member

philips commented Jan 22, 2014

It would be good to have public access to this channel or a way to poll len(server.c) so we can tell if the queue is getting backed up.

https://github.com/goraft/raft/blob/master/server.go#L171

@philips
Copy link
Member Author

philips commented Jan 22, 2014

/cc @bcwaldon

@benbjohnson
Copy link
Contributor

@philips I'd rather not expose the channel directly. If we're interested in checking if the queue is backed up then I'd rather make an API for that. Maybe another threshold event for it?

@philips
Copy link
Member Author

philips commented Jan 22, 2014

@benbjohnson Yea, I guess a thresh hold event would work.

@mreiferson
Copy link

Hmmm, curious as to the thinking behind the buffer of 256?

I would argue that it might be prudent to reduce the size (or eliminate) this buffer entirely... This would propagate back-pressure up the chain (all the way to clients sending commands, I assume).

But, it's likely that I don't know enough about the codebase to understand how it's used 😄

@philips
Copy link
Member Author

philips commented Jan 23, 2014

@benbjohnson I agree I think we should measure then consider getting rid of the sendAsync codepath. I don't know off of the top of my head why we decided to let there be an essentially infinite event backlog via sendAsync.

@benbjohnson
Copy link
Contributor

@philips @mreiferson The idea of sendAsync() is to allow the heartbeat to continue even if the server gets backed up by something like a long disk write. That's why there's a buffer on Server.c.

However, I agree that a buffered channel isn't the right approach. A better approach would be to eliminate the buffered channel and do processing of the peer AE responses through a goroutine.

What do you think of that?

@mreiferson
Copy link

Well, it's not really possible to discriminate between a failed or unresponsive server (for example, blocked on disk IO). Given that dilemma, would it not be "correct" for a heartbeat to fail due to resource contention? (i.e. wouldn't we want that unresponsive peer demoted, for example, if it was the leader?)

My general feeling is that I would want back-pressure as a result of any kind of resource contention propagated to peers (and clients) so that they can make decisions about how to proceed.

I think I need to better understand the semantics of sendAsync(), though. I'll poke around later this evening.

@benbjohnson
Copy link
Contributor

@mreiferson That's a good point -- a go-raft instance indefinitely hung on a disk write would hang the cluster. I think people will just need to bump up the heartbeat interval if they're running on something like EBS where 99% percentile latency can be higher than the heartbeat interval.

@philips
Copy link
Member Author

philips commented Jan 23, 2014

@benbjohnson Yea, exactly. :(

@mreiferson
Copy link

hah, right 😄

The benefit of removing this hidden internal buffer (and others?) is that when disk IO degrades the behavior will be well defined and unsurprising.

@philips
Copy link
Member Author

philips commented Jan 24, 2014

@benbjohnson How should we make this decision? It seems like we need to do a few things:

  • Make the channel size adjustable at construction
  • Benchmark how quickly goraft can drain a queue
  • Make a reasonable default

My hunches on the second two:

  • Based on my current observations this will have everything to do with the network, not go-raft
  • Based on say a 50ms worst case latency and 1ms average completion we get ~50 (round it to 64?)

/cc @xiangli-cmu

@mreiferson
Copy link

Is the intended purpose of the buffer to be able to batch work to reduce round-trip overhead of more/smaller AE requests to peers?

If not, and it is simply a workaround for potential resource contention, the other option is to:

  • remove the buffer

End-users can still work around this potential issue by adjusting timeouts, right?

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2014

@mreiferson Our raft server is serving in one for loop and collecting events via a buffered channel from

  1. HTTP triggered go-routines.
  2. Peer go-routines when the server is a leader.

No matter if the channel is buffered or not, the HTTP triggered go-routines cannot process until the server finish processes its request. Buffer will not help to reduce the latency here I think.

The reason to have a buffered is that our peer go-routine does not require the response from server. So we do not block it from doing another round of heartbeat.

I can see we are able to remove the buffer completely after some refactoring.

/cc @philips @benbjohnson

@mreiferson
Copy link

Agreed, I'm in favor of refactoring in order to remove the buffer. I'm assuming this is related to your #167 (comment), which I'm also 👍 on.

@benbjohnson
Copy link
Contributor

Sorry for the delay in responding. I've been out sick for a couple days. I'm 👍 and 👌 (and any other related emoji) on removing the buffer.

Unrelated: do you guys think it would make sense to maintain the heartbeat in the Server instead of in the Peer? We could kick off goroutines on regular intervals within the server event loop so we wouldn't have peers acting quite so non-deterministically. Just throwing that idea out there.

@mreiferson
Copy link

@benbjohnson I'm not sure. I think, to start off with the safest and simplest approach would be for heartbeats to not be asynchronous at all and therefore can be trusted as a truer reflection of a peer's liveness. (Perhaps we should move this discussion to #173.)

@mreiferson
Copy link

(edit: meant "heartbeats" not "goroutines" 😄)

@xiang90
Copy link
Contributor

xiang90 commented Jan 30, 2014

@mreiferson It is not feasible to make heartbeat strictly synchronous. We cannot afford let the server idle waiting for the reply from a peer.

@mreiferson
Copy link

right, "async" and "sync" are not really the right words - a better way to phrase what I'm thinking is "not in complete isolation"

@xiang90
Copy link
Contributor

xiang90 commented Jan 30, 2014

@mreiferson I am trying to make heartbeat totally isolation for the state of the server. The responsibility of the heartbeat routine should be just sending out read-only data to peers and get back reply to the server routine via channel. It should not change any state of other stuff directly.

@mreiferson
Copy link

@xiangli-cmu I'll try to find some time later to detail what I'm thinking on #173 - I suspect, based on your pull requests and comments on other issues, that we're interested in achieving the same end result 😄

@xiang90
Copy link
Contributor

xiang90 commented Jan 30, 2014

@mreiferson Cool. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants