-
Notifications
You must be signed in to change notification settings - Fork 201
scx_lavd: Fix stale migration roles in load balancer, remove redundant condition, and optimize the stealing loop #3109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In current implementation, plan_x_cpdom_migration() returns early when no domain was overloaded without updating the per-domain migration role: cpdomc->is_stealer, cpdomc->is_stealee, and sys_stat.nr_stealee. Under normal conditions these fields are updated in the classification loop, marking each domain as stealer, stealee, or neutral given by its current load. When the overloaded condition returns early, the loop never runs and the remained roles still impact the load balancer's decision. Places make migration decisions: 1. consume_task() checks is_stealer before calling try_to_steal_task() 2. try_to_steal_task() only steals from stealee domains 3. the idle donation path in pick_idle_cpu() uses is_stealer/is_stealee to decide cross-domain task placement As a result, a domain marked as stealer or stealee in a previous round could keep that role even after plan_x_cpdom_migration() determines that no migration was needed. This induces cross-domain migrations to continue based on stale decisions. Fix this by jumping to the classification loop instead of returning early, making sure the roles are recalculated. Fixes: 17c7b27 ("scx_lavd: Make the load balancing core compaction- and capacity-aware.") Signed-off-by: Gavin Guo <[email protected]>
The first clause of this condition:
(stealee_threshold <= max_sc_load || overflow_running)
is the negation of the previous if statement:
(stealee_threshold > max_sc_load) && !overflow_running
So, if we reach this point, that check already failed and the second
clause is always true.
Signed-off-by: Gavin Guo <[email protected]>
Add an early exit in try_to_steal_task() when sys_stat.nr_stealee is zero. This avoids the probabilistic calculation and nested loop traversal when no domain is marked as a stealee. Together with the commit that ensures the classification loop runs when there is no domain's scaled loading over the stealee_threshold, this guarantees sys_stat.nr_stealee accurately reflects the current state and avoid redundant hunting for tasks to be stolen. Signed-off-by: Gavin Guo <[email protected]>
| * --------------------------------------> | ||
| */ | ||
| return 0; | ||
| goto calc_stealer_stealee; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although stealer/stealee roles are likely to be set to false at the end of each round in regular cases, it is still useful to update them immediately to avoid reusing the stale values in some cases, such as full of pinned/non-migratable tasks in a domain.
My only doubt is if it is guaranteed that both stealer/stealee flags will set to false in the following bpf_for block in this condition. It seeems not guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the inactive domains, the is_stealee is always true. To rerun the for
loop will not be helpful to fix the stale roles.
For the active domains, sc_load is dynamic, recalculated each round based on:
- cur_util_sum or avg_util_sum (cpu utilization)
- nr_queued_task (queue length)
My original idea behind the goto enhancement is that the sc_load could be
dynamically changed based on the utilization and the queue length. The reason is
that in the previous round, the role is stealee, then the program runs to a less
cpu-intensive block, leading to subside domain's cpu utilization and consume
more tasks on the queue, decreasing the queue length. In the next round of
scaled loading calculation, it may not be a stealee anymore.
Conversely, the original role is stealer, the user is actively increasing the
load. The system would become busier. And previous stealer domain would become a
neutral one. In both cases, the stale roles would incorrectly induce the load
balancer to misjudge the circumstance and make load balancing suboptimal.
It can be demonstrated in the following table:
| Round | Domain A | Domain B | System State |
|---|---|---|---|
| N | Stealee | Stealer | Imbalanced |
| N+1 | sc_load dropped (should be Neutral) | sc_load increased (Neutral) | Balanced |
However, when the try_to_steal_task() function is called, it will use the stale
roles to make migration decisions despite the current system is balanced.
The fix ensures that in Round N+1, the roles are recalculated—domain A becomes
neutral, domain B becomes neutral, and no incorrect migration continues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. As long as both stealer and stealee flags are unset in this case, I am okay with that.
| /* | ||
| * No stealee, nothing to steal. | ||
| */ | ||
| if (!sys_stat.nr_stealee) | ||
| return false; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I feel this is a bit redundant since calling try_to_steal_task() is anyway gated by cpdomc->is_stealer. Is it possible that there is a stealer but no stealee? Logically, that does not make sense. Can we somehow guarantee that there is no such case triggered in `plan_x_cpdom_migration()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that logically, a “stealer but no stealee” state shouldn’t exist. As we focus on resolving the overloading domains. However, there are still several scenarios:
- It's valid in a big system. Some domains are under-loaded, but no domains are stealees.
- In a dynamic system, the domain could become a stealer due to the loading
changes. Particularly, the cur_util_sum fluctuates quickly. - Even the system continues to migrate tasks and set both is_stealee and
is_stealer to false, however, if the number of stealers is more than stealees,
the system would continue to migrate tasks. - The asynchronous update by the timer is also possible that when
consume_task() is called, the roles are not updated completely yet. - The stale issue described above. Despite the stealer increases the loading
(become neutral) and the all the stealees has been balanced by other stealers.
But, the recalculation is not conducted and the stale stealers are still there.
Even though there exists a stealer, it's not harmful. In pick_idle_cpu(), the stealer
would be selected as a sticky domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though there exists a stealer, it's not harmful. In pick_idle_cpu(), the stealer would be selected as a sticky domain.
When there is no stealee, why not turn off all stealer flags? Since plan_x_cpdom_migration() runs once in a while, imposing a bit more overhead (2-pass scanning: first for stealee then for stealer) there would be better. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed again the sticky domain logic in pick_idle_cpu(), and found the
stealer is chosen only when there exists stealee, and not based on is_stealer alone.
if (!cpdc || !READ_ONCE(cpdc->is_stealee))
goto skip_fully_idle_neighbor;This means that the stealer is not used without a stealee. So, the last statement
Even though there exists a stealer, it's not harmful. In pick_idle_cpu(), the stealer
would be selected as a sticky domain.
of my previous comment is weak.
Given above justification, we can safely set the stealer to false when there is
no stealee. How about the following patch?
/*
* If there is no overloaded domain (no stealees), skip load balancing.
* Clear all stealer/stealee roles to prevent stale state from previous
* balancing rounds from triggering incorrect migrations. While some
* domains may be underloaded (stealers), migration is unnecessary
* without overloaded domains (stealees) to steal from.
* <~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~>
* [stealer_threshold ... avg_sc_load ... max_sc_load ... stealee_threshold]
* -------------------------------------->
*/
if ((stealee_threshold > max_sc_load) && !overflow_running) {
bpf_for(cpdom_id, 0, nr_cpdoms) {
if (cpdom_id >= LAVD_CPDOM_MAX_NR)
break;
cpdomc = MEMBER_VPTR(cpdom_ctxs, [cpdom_id]);
WRITE_ONCE(cpdomc->is_stealer, false);
WRITE_ONCE(cpdomc->is_stealee, false);
}
sys_stat.nr_stealee = 0;
return 0;
}Maybe separate these into another patch to not mix the logic. Also, I put the
logic in comment to make it clearer.
/*
* At this point, there is at least one overloaded domain (stealee),
* indicated by the following condition, which is the negation of the
* previous if-statement:
* stealee_threshold <= max_sc_load || overflow_running
*
* Adjust the stealer threshold to minimum scaled load to ensure that
* there exists at least one stealer.
*/
if (stealer_threshold < min_sc_load) {However, we still need to bear in mind that this is to avoid the stale state and
stealer still could be observed while there is no stealee when the number of
stealers is more than stealees, and the asynchronous update situation. But, this
is harmless in this case. The impact is for the try_to_steal_task() to run more
loops to select the non-existing stealee domain to steal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both changes look good to me. Please force-push another version.
This series fixes a bug where cross-domain migration decisions were based on
stale stealer/stealee roles.
plan_x_cpdom_migration() returned early when no domain was overloaded, which
skips the classification loop that updates is_stealer/is_stealee roles. The
stale roles from previous rounds caused consume_task() and pick_idle_cpu() to
continue migrating tasks even when the system loading is neutral and the load
balancer determined no migration was needed.
The fix replaces the early return with a goto to ensure the classification loop
runs, guarantees roles are refreshed and sys_stat.nr_stealee accurately
reflects the current state.
With correct sys_stat.nr_stealee, we add an early exit in try_to_steal_task()
when nr_stealee is zero. This avoids the nested neighbor traversal when there is
nothing to steal. The enhancement converts a previously unused counter into an
optimization.
PATCH [1/3]: Fix stale roles by always running the classification loop
PATCH [2/3]: Remove a redundant condition
PATCH [3/3]: Skip try_to_steal_task() iteration when nr_stealee is zero