[RFC][PATCH] sched_ext: Allow consuming local tasks when aborting

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
@ 2026-05-07 13:56 Christian Loehle
  2026-05-08 14:14 ` Andrea Righi
  2026-05-08 15:28 ` Tejun Heo
  0 siblings, 2 replies; 6+ messages in thread
From: Christian Loehle @ 2026-05-07 13:56 UTC (permalink / raw)
  To: sched-ext; +Cc: linux-kernel, tj, void, arighi, changwoo, Christian Loehle

When aborting, consume_dispatch_q() breaks out of the task iteration
loop entirely for non-bypass DSQs. This prevents CPUs from consuming
even their own tasks (where rq == task_rq) from any DSQ.

This causes a deadlock during CPU hotplug:

1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
   setting sch->aborting and queuing the disable_work on the helper
   kthread.

2. The helper kthread (and other tasks) are stuck on the global or
   user DSQs because bypass mode hasn't been entered yet.

3. No CPU can consume these tasks due to the aborting break, so the
   helper never runs scx_root_disable() -> scx_bypass().

4. The cpuhp thread is stuck in balance_hotplug_wait() because the
   dying CPU's rq never drains.

Tasks on user DSQs are equally affected: BPF schedulers can dispatch
RCU and other critical kthreads to user DSQs, causing RCU stalls when
those tasks become unconsumable.

The aborting check was added to prevent live-locks from the remote task
migration path (consume_remote_task() -> goto retry), but also avoid
holding the dsq->lock for too long.

Change the break to skip only remote tasks via continue, allowing each
CPU to still consume tasks already on its own rq. This unblocks the
helper kthread, lets bypass mode activate, and allows both hotplug and
RCU grace periods to complete.

Fixes: 5ebec443fb96 ("sched_ext: Exit dispatch and move operations immediately when aborting")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
RFC:
I guess this reintroduces the live-lock of a BPF scheduler having a
highly contended DSQ with a lot of tasks and the outer loop holding
dsq->lock and therefore it still taking too long for the bypass to
activate, is there a better way?
I also couldn't trigger a lockup through that, did I just not have
the right platform (e.g. 2x Intel 8480c). Should we add a selftest
for this too, then?

 kernel/sched/ext.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 345aa11b84b2..3cce200708b0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2463,10 +2463,13 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 		 * a contended DSQ, or the outer retry loop can repeatedly race
 		 * against scx_bypass() dequeueing tasks from @dsq trying to put
 		 * the system into the bypass mode. This can easily live-lock the
-		 * machine. If aborting, exit from all non-bypass DSQs.
+		 * machine. If aborting, skip remote tasks from non-bypass DSQs
+		 * but still allow consuming local tasks to prevent deadlocks
+		 * during CPU hotplug where the dying CPU must drain its rq.
 		 */
-		if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS)
-			break;
+		if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS
+		    && rq != task_rq)
+			continue;

 		if (rq == task_rq) {
 			task_unlink_from_dsq(p, dsq);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
  2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle
@ 2026-05-08 14:14 ` Andrea Righi
  2026-05-08 15:45   ` Christian Loehle
  2026-05-08 15:28 ` Tejun Heo
  1 sibling, 1 reply; 6+ messages in thread
From: Andrea Righi @ 2026-05-08 14:14 UTC (permalink / raw)
  To: Christian Loehle; +Cc: sched-ext, linux-kernel, tj, void, changwoo

Hi Christian,

On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
> When aborting, consume_dispatch_q() breaks out of the task iteration
> loop entirely for non-bypass DSQs. This prevents CPUs from consuming
> even their own tasks (where rq == task_rq) from any DSQ.
> 
> This causes a deadlock during CPU hotplug:
> 
> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
>    setting sch->aborting and queuing the disable_work on the helper
>    kthread.
> 
> 2. The helper kthread (and other tasks) are stuck on the global or
>    user DSQs because bypass mode hasn't been entered yet.
> 
> 3. No CPU can consume these tasks due to the aborting break, so the
>    helper never runs scx_root_disable() -> scx_bypass().
> 
> 4. The cpuhp thread is stuck in balance_hotplug_wait() because the
>    dying CPU's rq never drains.
> 
> Tasks on user DSQs are equally affected: BPF schedulers can dispatch
> RCU and other critical kthreads to user DSQs, causing RCU stalls when
> those tasks become unconsumable.
> 
> The aborting check was added to prevent live-locks from the remote task
> migration path (consume_remote_task() -> goto retry), but also avoid
> holding the dsq->lock for too long.
> 
> Change the break to skip only remote tasks via continue, allowing each
> CPU to still consume tasks already on its own rq. This unblocks the
> helper kthread, lets bypass mode activate, and allows both hotplug and
> RCU grace periods to complete.

Have you been able to reproduce this stall condition?

When the kernel forces bypass, scx_bypass() explicitly walks every CPU's
runnable_list and cycles tasks through DEQUEUE_SAVE | DEQUEUE_MOVE so
dispatching stops depending on BPF.

On CPU hotplug the helper kthread (and all the other critical kthreads) should
be also in the runnable_list, so they should be moved to SCX_DSQ_BYPASS and
consume_dispatch_q() should be able to consume them.

Maybe the problem is that in do_enqueue_task() we keep tasks on the local DSQ
when !scx_rq_online(rq), instead we should prioritize the bypass condition.

Does something like the following make sense to you?

Thanks,
-Andrea

 kernel/sched/ext.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7ac7d10a41bef..277110d950c30 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1901,6 +1901,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 */
 	p->scx.flags &= ~SCX_TASK_IMMED;
 
+	/*
+	 * Check bypass before testing the rq online state: bypass mode stops
+	 * processing local DSQs, so tasks should be routed through
+	 * SCX_DSQ_BYPASS rather than dispatched to the local DSQ during CPU
+	 * hotplug events.
+	 */
+	if (scx_bypassing(sch, cpu_of(rq))) {
+		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
+		goto bypass;
+	}
+
 	/*
 	 * If !scx_rq_online(), we already told the BPF scheduler that the CPU
 	 * is offline and are just running the hotplug path. Don't bother the
@@ -1909,11 +1920,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	if (!scx_rq_online(rq))
 		goto local;
 
-	if (scx_bypassing(sch, cpu_of(rq))) {
-		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
-		goto bypass;
-	}
-
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
 		goto direct;
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
  2026-05-08 14:14 ` Andrea Righi
@ 2026-05-08 15:45   ` Christian Loehle
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Loehle @ 2026-05-08 15:45 UTC (permalink / raw)
  To: Andrea Righi; +Cc: sched-ext, linux-kernel, tj, void, changwoo

On 5/8/26 15:14, Andrea Righi wrote:
> Hi Christian,
> 
> On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
>> When aborting, consume_dispatch_q() breaks out of the task iteration
>> loop entirely for non-bypass DSQs. This prevents CPUs from consuming
>> even their own tasks (where rq == task_rq) from any DSQ.
>>
>> This causes a deadlock during CPU hotplug:
>>
>> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
>>    setting sch->aborting and queuing the disable_work on the helper
>>    kthread.
>>
>> 2. The helper kthread (and other tasks) are stuck on the global or
>>    user DSQs because bypass mode hasn't been entered yet.
>>
>> 3. No CPU can consume these tasks due to the aborting break, so the
>>    helper never runs scx_root_disable() -> scx_bypass().
>>
>> 4. The cpuhp thread is stuck in balance_hotplug_wait() because the
>>    dying CPU's rq never drains.
>>
>> Tasks on user DSQs are equally affected: BPF schedulers can dispatch
>> RCU and other critical kthreads to user DSQs, causing RCU stalls when
>> those tasks become unconsumable.
>>
>> The aborting check was added to prevent live-locks from the remote task
>> migration path (consume_remote_task() -> goto retry), but also avoid
>> holding the dsq->lock for too long.
>>
>> Change the break to skip only remote tasks via continue, allowing each
>> CPU to still consume tasks already on its own rq. This unblocks the
>> helper kthread, lets bypass mode activate, and allows both hotplug and
>> RCU grace periods to complete.
> 
> Have you been able to reproduce this stall condition?

Yes, the hotplug selftest reproduces this for me occasionally, I guess
with 100 iteration loop around the 4 test cases it's up to 100%. 

> 
> When the kernel forces bypass, scx_bypass() explicitly walks every CPU's
> runnable_list and cycles tasks through DEQUEUE_SAVE | DEQUEUE_MOVE so
> dispatching stops depending on BPF.
> 
> On CPU hotplug the helper kthread (and all the other critical kthreads) should
> be also in the runnable_list, so they should be moved to SCX_DSQ_BYPASS and
> consume_dispatch_q() should be able to consume them.
> 
> Maybe the problem is that in do_enqueue_task() we keep tasks on the local DSQ
> when !scx_rq_online(rq), instead we should prioritize the bypass condition.
> 
> Does something like the following make sense to you?
> 
> Thanks,
> -Andrea
> 
>  kernel/sched/ext.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 7ac7d10a41bef..277110d950c30 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1901,6 +1901,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 */
>  	p->scx.flags &= ~SCX_TASK_IMMED;
>  
> +	/*
> +	 * Check bypass before testing the rq online state: bypass mode stops
> +	 * processing local DSQs, so tasks should be routed through
> +	 * SCX_DSQ_BYPASS rather than dispatched to the local DSQ during CPU
> +	 * hotplug events.
> +	 */
> +	if (scx_bypassing(sch, cpu_of(rq))) {
> +		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
> +		goto bypass;
> +	}
> +
>  	/*
>  	 * If !scx_rq_online(), we already told the BPF scheduler that the CPU
>  	 * is offline and are just running the hotplug path. Don't bother the
> @@ -1909,11 +1920,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	if (!scx_rq_online(rq))
>  		goto local;
>  
> -	if (scx_bypassing(sch, cpu_of(rq))) {
> -		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
> -		goto bypass;
> -	}
> -
>  	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
>  		goto direct;
>  
> 

Unfortunately that also locks up, let me go have another look.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
  2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle
  2026-05-08 14:14 ` Andrea Righi
@ 2026-05-08 15:28 ` Tejun Heo
  2026-05-08 15:47   ` Andrea Righi
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2026-05-08 15:28 UTC (permalink / raw)
  To: Christian Loehle; +Cc: sched-ext, linux-kernel, void, arighi, changwoo

Hello,

On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
>    setting sch->aborting and queuing the disable_work on the helper
>    kthread.
> 
> 2. The helper kthread (and other tasks) are stuck on the global or
>    user DSQs because bypass mode hasn't been entered yet.

The helper thread runs RT class, so it doesn't go through SCX at all. Can
you try Andrea's patch?

> RFC:
> I guess this reintroduces the live-lock of a BPF scheduler having a
> highly contended DSQ with a lot of tasks and the outer loop holding
> dsq->lock and therefore it still taking too long for the bypass to
> activate, is there a better way?
> I also couldn't trigger a lockup through that, did I just not have
> the right platform (e.g. 2x Intel 8480c). Should we add a selftest
> for this too, then?

Dual Sapphire Rapids is where the problem was initially observed and I could
also reproduce on dual socket Zen 2 too. SPRs are way more susceptible tho.
I *think* I was running scx_simple with some mixture of saturating
stress-ng. It wasn't that difficult to reproduce. We should probably
document the repro somewhere. I'm not sure selftests is a good place to host
this sort of repros.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
  2026-05-08 15:28 ` Tejun Heo
@ 2026-05-08 15:47   ` Andrea Righi
  2026-05-08 17:59     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Andrea Righi @ 2026-05-08 15:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christian Loehle, sched-ext, linux-kernel, void, changwoo

Hi Tejun,

On Fri, May 08, 2026 at 05:28:29AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote:
> > 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
> >    setting sch->aborting and queuing the disable_work on the helper
> >    kthread.
> > 
> > 2. The helper kthread (and other tasks) are stuck on the global or
> >    user DSQs because bypass mode hasn't been entered yet.
> 
> The helper thread runs RT class, so it doesn't go through SCX at all. Can
> you try Andrea's patch?
> 
> > RFC:
> > I guess this reintroduces the live-lock of a BPF scheduler having a
> > highly contended DSQ with a lot of tasks and the outer loop holding
> > dsq->lock and therefore it still taking too long for the bypass to
> > activate, is there a better way?
> > I also couldn't trigger a lockup through that, did I just not have
> > the right platform (e.g. 2x Intel 8480c). Should we add a selftest
> > for this too, then?
> 
> Dual Sapphire Rapids is where the problem was initially observed and I could
> also reproduce on dual socket Zen 2 too. SPRs are way more susceptible tho.
> I *think* I was running scx_simple with some mixture of saturating
> stress-ng. It wasn't that difficult to reproduce. We should probably
> document the repro somewhere. I'm not sure selftests is a good place to host
> this sort of repros.

There are few selftests that use stress-ng in tools/testing/selftests, maybe we
can put a script there calling stress-ng, if present, and a sched similar to
scx_simple and if stress-ng isn't present, skip the test. Do you remember the
stress-ng command you were using? Probably we can even reproduce the issue
adding something to the C part of the scheduler that mimics what stress-ng is
doing.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
  2026-05-08 15:47   ` Andrea Righi
@ 2026-05-08 17:59     ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2026-05-08 17:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Christian Loehle, sched-ext, linux-kernel, void, changwoo,
	Emil Tsalapatis

Hello, Andrea.

On Fri, May 08, 2026 at 05:47:36PM +0200, Andrea Righi wrote:
> There are few selftests that use stress-ng in tools/testing/selftests, maybe we
> can put a script there calling stress-ng, if present, and a sched similar to
> scx_simple and if stress-ng isn't present, skip the test. Do you remember the
> stress-ng command you were using? Probably we can even reproduce the issue
> adding something to the C part of the scheduler that mimics what stress-ng is
> doing.

Dug it out of shell history on the test box. It was the workload from
b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"):

  stress-ng --race-sched 1024
  stress-ng --workload 80 --workload-threads 10

Run both in parallel on a 2x EPYC 7642 while flipping a SCX scheduler on
and off in a loop and the live-lock reproduced reliably.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-08 17:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle
2026-05-08 14:14 ` Andrea Righi
2026-05-08 15:45   ` Christian Loehle
2026-05-08 15:28 ` Tejun Heo
2026-05-08 15:47   ` Andrea Righi
2026-05-08 17:59     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox