Adjust Worker::poll logic to fix pending wake failure in balance service.#825
Adjust Worker::poll logic to fix pending wake failure in balance service.#825c98 wants to merge 2 commits intotower-rs:masterfrom
Conversation
…ice. This commit aims to solve the problem that Worker::poll_next_msg may block service.poll_ready. Especially when the service is a balance service, while multiple pending endpoints become ready, the task is blocked on Worker::poll_next_msg if there is no other message come which would cause these endpoints to be disconnected.
|
Thanks! Looks like CI failed... Also, is there a way to update the tests to check this change? |
adjust the buffer tests. remove `propagates_trace_spans` test due to the worker's service `poll_ready` is outside the request processing.
|
Thanks for your attention, I have updated the tests to check this change. |
|
@cratelyn do yall have a more extensive suite that includes balancing that can test this change out? |
we do! i'll try running the linkerd2-proxy tests with this patch applied, pardon the wait. |
cratelyn
left a comment
There was a problem hiding this comment.
i left some notes below. my main questions and concerns about this change are about how many of the test suite's assertions (and a test case) are removed in this change.
if changes to the test suite are required here, is this a breaking change to this middleware's behavior?
| #[tokio::test(flavor = "current_thread")] | ||
| async fn propagates_trace_spans() { | ||
| use tower::util::ServiceExt; | ||
| use tracing::Instrument; | ||
|
|
||
| let _t = support::trace_init(); | ||
|
|
||
| let span = tracing::info_span!("my_span"); | ||
|
|
||
| let service = support::AssertSpanSvc::new(span.clone()); | ||
| let (service, worker) = Buffer::pair(service, 5); | ||
| let worker = tokio::spawn(worker); | ||
|
|
||
| let result = tokio::spawn(service.oneshot(()).instrument(span)); | ||
|
|
||
| result.await.expect("service panicked").expect("failed"); | ||
| worker.await.expect("worker panicked"); | ||
| } |
There was a problem hiding this comment.
may i ask why this test was deleted?
There was a problem hiding this comment.
this commit change the worker exec order:
- service.poll_ready
- poll_next_msg
- service.call
after 2 poll_next_msg, the span in the msg can be retrieved, the service.call is now in the span scope, but not the servce.poll_ready, as you see, AssertSpanSvc does not fit the case, so i delete this test.
| let err = assert_ready_err!(response2.poll()); | ||
| assert!( | ||
| err.is::<error::ServiceError>(), | ||
| "response should fail with a ServiceError, got: {:?}", | ||
| err | ||
| ); |
There was a problem hiding this comment.
why is this no longer the case?
There was a problem hiding this comment.
as mentioned above, the old exec order is
- poll_next_msg
- service.poll_ready
- service.call
the new exec order is
- service.poll_ready
- poll_next_msg
- service.call
there is no 'preload' message, so i delete it.
| assert_pending!(worker.poll()); | ||
| handle | ||
| .next_request() | ||
| .await | ||
| .unwrap() | ||
| .1 | ||
| .send_response("world"); | ||
| assert_pending!(worker.poll()); | ||
| assert_ready_ok!(response4.poll()); |
There was a problem hiding this comment.
why are these assertions no longer needed?
There was a problem hiding this comment.
because in this case there is no other request will be issued, so it is no need to do this assertion.
|
if this change is focused on solving how this middleware interacts with a |
@seanmonstar i wasn't able to get a patched version of the linkerd2-proxy building with this patch. we're on tonic v0.12 at the moment, which depends on tower 0.4, and thus builds fail if this branch is used. i did leave some questions above, however. i'm concerned that trace contexts no longer seem to be propagated, and that this change seems to change the behavior of |
|
@cratelyn you've since ported all of linkerd to use the newer hyper/tonic, I think, right? If it'd be easier to pop this branch and see if the test suite is still happy, that could help move this along. If you're busy, understandable! |
i ran the linkerd-proxy test suite against this branch, via a tests did not pass, due to a new failure in this test: https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/stack/src/loadshed.rs#L133-L198 i have not dug further into the cause, but i hope this helps! |
#824
This commit aims to solve the problem that
Worker::poll_next_msgmay blockservice.poll_ready. Especially when the service is aBalanceservice, while multiple pending endpoints become ready, the task is blocked onWorker::poll_next_msgif there is no other message come which would cause these endpoints to be disconnected.