Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue-3: add device timeout notify to rdma partition #3187

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

vladstepanyuk
Copy link
Contributor

#3
Adding volume notifying if some of requests time outed or rdma endpoint unavailable

Copy link
Contributor

Hi! Thank you for contributing!
The tests on this PR will run after a maintainer adds an ok-to-test label to this PR manually. Thank you for your patience!

@vladstepanyuk vladstepanyuk marked this pull request as draft March 14, 2025 09:58
@komarevtsev-d komarevtsev-d added blockstore Add this label to run only cloud/blockstore build and tests on PR large-tests Launch large tests for PR ok-to-test Label to approve test launch for external members labels Mar 14, 2025
@github-actions github-actions bot removed the ok-to-test Label to approve test launch for external members label Mar 14, 2025
@komarevtsev-d komarevtsev-d added the ok-to-test Label to approve test launch for external members label Mar 14, 2025
@vladstepanyuk vladstepanyuk marked this pull request as ready for review March 14, 2025 10:11
@github-actions github-actions bot removed the ok-to-test Label to approve test launch for external members label Mar 14, 2025
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit be705e7.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3887 3887 0 0 0 0

@komarevtsev-d komarevtsev-d added the ok-to-test Label to approve test launch for external members label Mar 17, 2025
@github-actions github-actions bot removed the ok-to-test Label to approve test launch for external members label Mar 17, 2025
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 3eb347f.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3888 3888 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 66289d8.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3888 3888 0 0 0 0


auto& dCtx = DeviceTimeouted[deviceUUID];
dCtx.FirstErrorTs = ctx.Now();
dCtx.ParentWasNotified = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В interconnect партишионе есть проблема: таймаут запроса может прийти после того как всё раздуплилось. Тут такое может быть?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

если пришел таймаут то ответ на реквест уже не получим, если ты спрашивал про это.
Здесь

void DropTimedOutRequests()
{
auto endpoints = Endpoints.Get();
for (const auto& endpoint: *endpoints) {
if (!endpoint->CheckState(EEndpointState::Connected)) {
continue;
}
auto requests = endpoint->ActiveRequests.PopTimedOutRequests(
DurationToCyclesSafe(Config->MaxResponseDelay));
for (auto& request: requests) {
endpoint->AbortRequest(
std::move(request),
E_TIMEOUT,
"request timeout");
}
}
}

мы извлекаем таймаученный запрос отвекчаем на него и соответственно уничтожаем

requestInfo.Value.ActorId,
std::make_unique<
TEvNonreplPartitionPrivate::TEvCancelRequest>(
EReason::Canceled));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А где обработка TEvCancelRequest ?

Copy link
Contributor Author

@vladstepanyuk vladstepanyuk Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вынесу всю логику с отменой запросов в отдельный пр

{
NCloud::Send(
ctx,
requestInfo.Value.ActorId,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Где заполняется поле ActorId?

Copy link
Contributor Author

@vladstepanyuk vladstepanyuk Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Вынесу всю логику с отменой запросов в отдельный пр

@@ -67,6 +111,11 @@ class TNonreplicatedPartitionRdmaActor final

TRequestInfoPtr Poisoner;

struct TDeviceTimeoutCtx {
TInstant FirstErrorTs;
bool ParentWasNotified = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А что такое parent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartConfig->GetParentActorId()
т.е. волум

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

переименовал в VolumeWasNotified

@github-actions github-actions bot removed the ok-to-test Label to approve test launch for external members label Mar 20, 2025
Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit df05605.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3888 3888 0 0 0 0

@vladstepanyuk vladstepanyuk requested a review from sharpeye March 24, 2025 11:35
NCloud::Send(
ctx,
PartConfig->GetParentActorId(),
std::make_unique<TEvVolumePrivate::TEvDeviceTimeoutedRequest>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

мы же собираемся переименовать Timeouted в TimedOut?

можно в отдельном ПРе

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo: some tests FAILED for commit 3a73d29.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3888 3884 0 4 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 3903229.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3888 3888 0 0 0 0

Copy link
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo: all tests PASSED for commit 180d548.

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
3910 3910 0 0 0 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blockstore Add this label to run only cloud/blockstore build and tests on PR large-tests Launch large tests for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants