* [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
@ 2025-11-23 4:04 Tingjia Cao
2025-11-24 6:07 ` K Prateek Nayak
0 siblings, 1 reply; 3+ messages in thread
From: Tingjia Cao @ 2025-11-23 4:04 UTC (permalink / raw)
To: peterz, Vincent Guittot, mingo, juri.lelli, dietmar.eggemann,
linux-kernel
[-- Attachment #1.1: Type: text/plain, Size: 988 bytes --]
Recently, we encountered an issue that sync wakeup kthread didn't choose
the current CPU though the waker is the only runnable task. It is caused by
a conflict between delayed dequeue feature and select_idle_sibling function.
With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may
not be removed from the runqueue immediately. As a result, nr_running may
overcount the number of runnable tasks. Inside select_idle_sibling, there
is a special case for sync wakeup:
if (is_per_cpu_kthread(current) &&
in_task() &&
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
asym_fits_cpu(...)) {
return prev;
}
For "this_rq()->nr_running <= 1": we should use the real running-tasks rq
to check whether to place the wake-up task to the current cpu.
To fix this (patch attached), we can use the true number of runnable tasks
by subtracting the delayed-dequeue count:
this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1
Best,
Tingjia
[-- Attachment #1.2: Type: text/html, Size: 1185 bytes --]
[-- Attachment #2: fix-select_idle_sibling-vs-DELAYED_DEQUEUE.patch --]
[-- Type: application/octet-stream, Size: 823 bytes --]
From 2540ac815e9cfa47e984a828139526e290f4f459 Mon Sep 17 00:00:00 2001
From: Tingjia-0v0 <tjcao980311@gmail.com>
Date: Sat, 22 Nov 2025 21:42:00 -0600
Subject: [PATCH] fix select_idle_sibling vs DELAYED_DEQUEUE
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..d60a3f5ebeca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7869,7 +7869,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (is_per_cpu_kthread(current) &&
in_task() &&
prev == smp_processor_id() &&
- this_rq()->nr_running <= 1 &&
+ this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1 &&
asym_fits_cpu(task_util, util_min, util_max, prev)) {
return prev;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
2025-11-23 4:04 [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE Tingjia Cao
@ 2025-11-24 6:07 ` K Prateek Nayak
[not found] ` <CABcWv9_D5g9RTnu2+briWO-AZXySfpNSNORYxEoOAkHp2nE7Eg@mail.gmail.com>
0 siblings, 1 reply; 3+ messages in thread
From: K Prateek Nayak @ 2025-11-24 6:07 UTC (permalink / raw)
To: Tingjia Cao, peterz, Vincent Guittot, mingo, juri.lelli,
dietmar.eggemann, linux-kernel
Hello Tingjia,
On 11/23/2025 9:34 AM, Tingjia Cao wrote:
> Recently, we encountered an issue that sync wakeup kthread didn't choose the current CPU though the waker is the only runnable task. It is caused by a conflict between delayed dequeue feature and select_idle_sibling function.
>
> With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may not be removed from the runqueue immediately. As a result, nr_running may overcount the number of runnable tasks. Inside select_idle_sibling, there is a special case for sync wakeup:
>
> if (is_per_cpu_kthread(current) &&
> in_task() &&
> prev == smp_processor_id() &&
> this_rq()->nr_running <= 1 &&
> asym_fits_cpu(...)) {
> return prev;
> }
>
> For "this_rq()->nr_running <= 1": we should use the real running-tasks rq to check whether to place the wake-up task to the current cpu.
>
> To fix this (patch attached), we can use the true number of runnable tasks by subtracting the delayed-dequeue count:
>
> this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1
This is a very transient state - tasks cannot be delayed without other
runnable tasks at the time of dequeue and soon after the dequeue of
last runnable task, all the pending delayed tasks would get dequeued.
The window is actually very small. Does this make a difference in
your workload performance?
Once all tasks are dequeued, the newidle balance should run on the CPU
going idle to help reduce any imbalance.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
[not found] ` <CABcWv9_D5g9RTnu2+briWO-AZXySfpNSNORYxEoOAkHp2nE7Eg@mail.gmail.com>
@ 2025-12-02 6:26 ` Tingjia Cao
0 siblings, 0 replies; 3+ messages in thread
From: Tingjia Cao @ 2025-12-02 6:26 UTC (permalink / raw)
To: K Prateek Nayak; +Cc: linux-kernel
Hello Prateek,
Thanks for the explanation, and we didn't observe a clear performance
impact in our workloads.
We originally noticed this issue because some CPU selection decisions
for sync wakeups looked unexpected occasionally. To better understand
the behavior, we constructed a deterministic test that issues a
sync_wakeup in the window where (1) the waker is the only runnable
task, and (2) there is a delayed task on its runqueue. In this case,
wake_affine_idle returns the parent's CPU that is about to be idle,
but select_idle_sibling chooses an idle sibling instead (lost the
chance to use the warm core; the idle sibling can be running at a low
frequency, the child’s performance will be affected more).
We think it's good to keep the sync-wakeup logic consistent between
wake_affine_idle() and select_idle_sibling(). wake_affine_idle() use
the predicate "nr_running - cfs_h_nr_delayed(this_rq()) <= 1" for sync
wakeup, but the later select_idle_sibling use the predicate
"nr_running <=1" for sync wakeup.
Sorry for the duplicate email; I added linux-kernel to CC this time.
Best,
Tingjia
On Tue, Dec 2, 2025 at 12:19 AM Tingjia Cao <tjcao980311@gmail.com> wrote:
>
> Hello Prateek,
>
> Thanks for the explanation, and we didn't observe a clear performance impact in our workloads.
>
> We originally noticed this issue because some CPU selection decisions for sync wakeups looked unexpected occasionally. To better understand the behavior, we constructed a deterministic test that issues a sync_wakeup in the window where (1) the waker is the only runnable task, and (2) there is a delayed task on its runqueue. In this case, wake_affine_idle returns the parent's CPU that is about to be idle, but select_idle_sibling chooses an idle sibling instead (lost the chance to use the warm core; the idle sibling can be running at a low frequency, the child’s performance will be affected more).
>
> We think it's good to keep the sync-wakeup logic consistent between wake_affine_idle() and select_idle_sibling(). wake_affine_idle() use the predicate "nr_running - cfs_h_nr_delayed(this_rq()) <= 1" for sync wakeup, but the later select_idle_sibling use the predicate "nr_running <=1" for sync wakeup.
>
> Best,
> Tingjia
>
> On Mon, Nov 24, 2025 at 12:07 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Tingjia,
>>
>> On 11/23/2025 9:34 AM, Tingjia Cao wrote:
>> > Recently, we encountered an issue that sync wakeup kthread didn't choose the current CPU though the waker is the only runnable task. It is caused by a conflict between delayed dequeue feature and select_idle_sibling function.
>> >
>> > With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may not be removed from the runqueue immediately. As a result, nr_running may overcount the number of runnable tasks. Inside select_idle_sibling, there is a special case for sync wakeup:
>> >
>> > if (is_per_cpu_kthread(current) &&
>> > in_task() &&
>> > prev == smp_processor_id() &&
>> > this_rq()->nr_running <= 1 &&
>> > asym_fits_cpu(...)) {
>> > return prev;
>> > }
>> >
>> > For "this_rq()->nr_running <= 1": we should use the real running-tasks rq to check whether to place the wake-up task to the current cpu.
>> >
>> > To fix this (patch attached), we can use the true number of runnable tasks by subtracting the delayed-dequeue count:
>> >
>> > this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1
>>
>> This is a very transient state - tasks cannot be delayed without other
>> runnable tasks at the time of dequeue and soon after the dequeue of
>> last runnable task, all the pending delayed tasks would get dequeued.
>> The window is actually very small. Does this make a difference in
>> your workload performance?
>>
>> Once all tasks are dequeued, the newidle balance should run on the CPU
>> going idle to help reduce any imbalance.
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-12-02 6:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-23 4:04 [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE Tingjia Cao
2025-11-24 6:07 ` K Prateek Nayak
[not found] ` <CABcWv9_D5g9RTnu2+briWO-AZXySfpNSNORYxEoOAkHp2nE7Eg@mail.gmail.com>
2025-12-02 6:26 ` Tingjia Cao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox