-
Notifications
You must be signed in to change notification settings - Fork 479
Feature: access to the len(server.c) #160
Comments
/cc @bcwaldon |
@philips I'd rather not expose the channel directly. If we're interested in checking if the queue is backed up then I'd rather make an API for that. Maybe another threshold event for it? |
@benbjohnson Yea, I guess a thresh hold event would work. |
Hmmm, curious as to the thinking behind the buffer of 256? I would argue that it might be prudent to reduce the size (or eliminate) this buffer entirely... This would propagate back-pressure up the chain (all the way to clients sending commands, I assume). But, it's likely that I don't know enough about the codebase to understand how it's used 😄 |
@benbjohnson I agree I think we should measure then consider getting rid of the sendAsync codepath. I don't know off of the top of my head why we decided to let there be an essentially infinite event backlog via sendAsync. |
@philips @mreiferson The idea of However, I agree that a buffered channel isn't the right approach. A better approach would be to eliminate the buffered channel and do processing of the peer AE responses through a goroutine. What do you think of that? |
Well, it's not really possible to discriminate between a failed or unresponsive server (for example, blocked on disk IO). Given that dilemma, would it not be "correct" for a heartbeat to fail due to resource contention? (i.e. wouldn't we want that unresponsive peer demoted, for example, if it was the leader?) My general feeling is that I would want back-pressure as a result of any kind of resource contention propagated to peers (and clients) so that they can make decisions about how to proceed. I think I need to better understand the semantics of |
@mreiferson That's a good point -- a go-raft instance indefinitely hung on a disk write would hang the cluster. I think people will just need to bump up the heartbeat interval if they're running on something like EBS where 99% percentile latency can be higher than the heartbeat interval. |
@benbjohnson Yea, exactly. :( |
hah, right 😄 The benefit of removing this hidden internal buffer (and others?) is that when disk IO degrades the behavior will be well defined and unsurprising. |
@benbjohnson How should we make this decision? It seems like we need to do a few things:
My hunches on the second two:
/cc @xiangli-cmu |
Is the intended purpose of the buffer to be able to batch work to reduce round-trip overhead of more/smaller AE requests to peers? If not, and it is simply a workaround for potential resource contention, the other option is to:
End-users can still work around this potential issue by adjusting timeouts, right? |
@mreiferson Our raft server is serving in one for loop and collecting events via a buffered channel from
No matter if the channel is buffered or not, the HTTP triggered go-routines cannot process until the server finish processes its request. Buffer will not help to reduce the latency here I think. The reason to have a buffered is that our peer go-routine does not require the response from server. So we do not block it from doing another round of heartbeat. I can see we are able to remove the buffer completely after some refactoring. /cc @philips @benbjohnson |
Agreed, I'm in favor of refactoring in order to remove the buffer. I'm assuming this is related to your #167 (comment), which I'm also 👍 on. |
Sorry for the delay in responding. I've been out sick for a couple days. I'm 👍 and 👌 (and any other related emoji) on removing the buffer. Unrelated: do you guys think it would make sense to maintain the heartbeat in the |
@benbjohnson I'm not sure. I think, to start off with the safest and simplest approach would be for heartbeats to not be asynchronous at all and therefore can be trusted as a truer reflection of a peer's liveness. (Perhaps we should move this discussion to #173.) |
(edit: meant "heartbeats" not "goroutines" 😄) |
@mreiferson It is not feasible to make heartbeat strictly synchronous. We cannot afford let the server idle waiting for the reply from a peer. |
right, "async" and "sync" are not really the right words - a better way to phrase what I'm thinking is "not in complete isolation" |
@mreiferson I am trying to make heartbeat totally isolation for the state of the server. The responsibility of the heartbeat routine should be just sending out read-only data to peers and get back reply to the server routine via channel. It should not change any state of other stuff directly. |
@xiangli-cmu I'll try to find some time later to detail what I'm thinking on #173 - I suspect, based on your pull requests and comments on other issues, that we're interested in achieving the same end result 😄 |
@mreiferson Cool. Thanks. |
It would be good to have public access to this channel or a way to poll len(server.c) so we can tell if the queue is getting backed up.
https://github.com/goraft/raft/blob/master/server.go#L171
The text was updated successfully, but these errors were encountered: