* [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting
@ 2026-05-07 13:56 Christian Loehle
2026-05-08 14:14 ` Andrea Righi
2026-05-08 15:28 ` Tejun Heo
0 siblings, 2 replies; 6+ messages in thread
From: Christian Loehle @ 2026-05-07 13:56 UTC (permalink / raw)
To: sched-ext; +Cc: linux-kernel, tj, void, arighi, changwoo, Christian Loehle
When aborting, consume_dispatch_q() breaks out of the task iteration
loop entirely for non-bypass DSQs. This prevents CPUs from consuming
even their own tasks (where rq == task_rq) from any DSQ.
This causes a deadlock during CPU hotplug:
1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(),
setting sch->aborting and queuing the disable_work on the helper
kthread.
2. The helper kthread (and other tasks) are stuck on the global or
user DSQs because bypass mode hasn't been entered yet.
3. No CPU can consume these tasks due to the aborting break, so the
helper never runs scx_root_disable() -> scx_bypass().
4. The cpuhp thread is stuck in balance_hotplug_wait() because the
dying CPU's rq never drains.
Tasks on user DSQs are equally affected: BPF schedulers can dispatch
RCU and other critical kthreads to user DSQs, causing RCU stalls when
those tasks become unconsumable.
The aborting check was added to prevent live-locks from the remote task
migration path (consume_remote_task() -> goto retry), but also avoid
holding the dsq->lock for too long.
Change the break to skip only remote tasks via continue, allowing each
CPU to still consume tasks already on its own rq. This unblocks the
helper kthread, lets bypass mode activate, and allows both hotplug and
RCU grace periods to complete.
Fixes: 5ebec443fb96 ("sched_ext: Exit dispatch and move operations immediately when aborting")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
RFC:
I guess this reintroduces the live-lock of a BPF scheduler having a
highly contended DSQ with a lot of tasks and the outer loop holding
dsq->lock and therefore it still taking too long for the bypass to
activate, is there a better way?
I also couldn't trigger a lockup through that, did I just not have
the right platform (e.g. 2x Intel 8480c). Should we add a selftest
for this too, then?
kernel/sched/ext.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 345aa11b84b2..3cce200708b0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2463,10 +2463,13 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
* a contended DSQ, or the outer retry loop can repeatedly race
* against scx_bypass() dequeueing tasks from @dsq trying to put
* the system into the bypass mode. This can easily live-lock the
- * machine. If aborting, exit from all non-bypass DSQs.
+ * machine. If aborting, skip remote tasks from non-bypass DSQs
+ * but still allow consuming local tasks to prevent deadlocks
+ * during CPU hotplug where the dying CPU must drain its rq.
*/
- if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS)
- break;
+ if (unlikely(READ_ONCE(sch->aborting)) && dsq->id != SCX_DSQ_BYPASS
+ && rq != task_rq)
+ continue;
if (rq == task_rq) {
task_unlink_from_dsq(p, dsq);
--
2.34.1
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting 2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle @ 2026-05-08 14:14 ` Andrea Righi 2026-05-08 15:45 ` Christian Loehle 2026-05-08 15:28 ` Tejun Heo 1 sibling, 1 reply; 6+ messages in thread From: Andrea Righi @ 2026-05-08 14:14 UTC (permalink / raw) To: Christian Loehle; +Cc: sched-ext, linux-kernel, tj, void, changwoo Hi Christian, On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote: > When aborting, consume_dispatch_q() breaks out of the task iteration > loop entirely for non-bypass DSQs. This prevents CPUs from consuming > even their own tasks (where rq == task_rq) from any DSQ. > > This causes a deadlock during CPU hotplug: > > 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(), > setting sch->aborting and queuing the disable_work on the helper > kthread. > > 2. The helper kthread (and other tasks) are stuck on the global or > user DSQs because bypass mode hasn't been entered yet. > > 3. No CPU can consume these tasks due to the aborting break, so the > helper never runs scx_root_disable() -> scx_bypass(). > > 4. The cpuhp thread is stuck in balance_hotplug_wait() because the > dying CPU's rq never drains. > > Tasks on user DSQs are equally affected: BPF schedulers can dispatch > RCU and other critical kthreads to user DSQs, causing RCU stalls when > those tasks become unconsumable. > > The aborting check was added to prevent live-locks from the remote task > migration path (consume_remote_task() -> goto retry), but also avoid > holding the dsq->lock for too long. > > Change the break to skip only remote tasks via continue, allowing each > CPU to still consume tasks already on its own rq. This unblocks the > helper kthread, lets bypass mode activate, and allows both hotplug and > RCU grace periods to complete. Have you been able to reproduce this stall condition? When the kernel forces bypass, scx_bypass() explicitly walks every CPU's runnable_list and cycles tasks through DEQUEUE_SAVE | DEQUEUE_MOVE so dispatching stops depending on BPF. On CPU hotplug the helper kthread (and all the other critical kthreads) should be also in the runnable_list, so they should be moved to SCX_DSQ_BYPASS and consume_dispatch_q() should be able to consume them. Maybe the problem is that in do_enqueue_task() we keep tasks on the local DSQ when !scx_rq_online(rq), instead we should prioritize the bypass condition. Does something like the following make sense to you? Thanks, -Andrea kernel/sched/ext.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 7ac7d10a41bef..277110d950c30 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1901,6 +1901,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, */ p->scx.flags &= ~SCX_TASK_IMMED; + /* + * Check bypass before testing the rq online state: bypass mode stops + * processing local DSQs, so tasks should be routed through + * SCX_DSQ_BYPASS rather than dispatched to the local DSQ during CPU + * hotplug events. + */ + if (scx_bypassing(sch, cpu_of(rq))) { + __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); + goto bypass; + } + /* * If !scx_rq_online(), we already told the BPF scheduler that the CPU * is offline and are just running the hotplug path. Don't bother the @@ -1909,11 +1920,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, if (!scx_rq_online(rq)) goto local; - if (scx_bypassing(sch, cpu_of(rq))) { - __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); - goto bypass; - } - if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) goto direct; ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting 2026-05-08 14:14 ` Andrea Righi @ 2026-05-08 15:45 ` Christian Loehle 0 siblings, 0 replies; 6+ messages in thread From: Christian Loehle @ 2026-05-08 15:45 UTC (permalink / raw) To: Andrea Righi; +Cc: sched-ext, linux-kernel, tj, void, changwoo On 5/8/26 15:14, Andrea Righi wrote: > Hi Christian, > > On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote: >> When aborting, consume_dispatch_q() breaks out of the task iteration >> loop entirely for non-bypass DSQs. This prevents CPUs from consuming >> even their own tasks (where rq == task_rq) from any DSQ. >> >> This causes a deadlock during CPU hotplug: >> >> 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(), >> setting sch->aborting and queuing the disable_work on the helper >> kthread. >> >> 2. The helper kthread (and other tasks) are stuck on the global or >> user DSQs because bypass mode hasn't been entered yet. >> >> 3. No CPU can consume these tasks due to the aborting break, so the >> helper never runs scx_root_disable() -> scx_bypass(). >> >> 4. The cpuhp thread is stuck in balance_hotplug_wait() because the >> dying CPU's rq never drains. >> >> Tasks on user DSQs are equally affected: BPF schedulers can dispatch >> RCU and other critical kthreads to user DSQs, causing RCU stalls when >> those tasks become unconsumable. >> >> The aborting check was added to prevent live-locks from the remote task >> migration path (consume_remote_task() -> goto retry), but also avoid >> holding the dsq->lock for too long. >> >> Change the break to skip only remote tasks via continue, allowing each >> CPU to still consume tasks already on its own rq. This unblocks the >> helper kthread, lets bypass mode activate, and allows both hotplug and >> RCU grace periods to complete. > > Have you been able to reproduce this stall condition? Yes, the hotplug selftest reproduces this for me occasionally, I guess with 100 iteration loop around the 4 test cases it's up to 100%. > > When the kernel forces bypass, scx_bypass() explicitly walks every CPU's > runnable_list and cycles tasks through DEQUEUE_SAVE | DEQUEUE_MOVE so > dispatching stops depending on BPF. > > On CPU hotplug the helper kthread (and all the other critical kthreads) should > be also in the runnable_list, so they should be moved to SCX_DSQ_BYPASS and > consume_dispatch_q() should be able to consume them. > > Maybe the problem is that in do_enqueue_task() we keep tasks on the local DSQ > when !scx_rq_online(rq), instead we should prioritize the bypass condition. > > Does something like the following make sense to you? > > Thanks, > -Andrea > > kernel/sched/ext.c | 16 +++++++++++----- > 1 file changed, 11 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 7ac7d10a41bef..277110d950c30 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1901,6 +1901,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > */ > p->scx.flags &= ~SCX_TASK_IMMED; > > + /* > + * Check bypass before testing the rq online state: bypass mode stops > + * processing local DSQs, so tasks should be routed through > + * SCX_DSQ_BYPASS rather than dispatched to the local DSQ during CPU > + * hotplug events. > + */ > + if (scx_bypassing(sch, cpu_of(rq))) { > + __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); > + goto bypass; > + } > + > /* > * If !scx_rq_online(), we already told the BPF scheduler that the CPU > * is offline and are just running the hotplug path. Don't bother the > @@ -1909,11 +1920,6 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > if (!scx_rq_online(rq)) > goto local; > > - if (scx_bypassing(sch, cpu_of(rq))) { > - __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); > - goto bypass; > - } > - > if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) > goto direct; > > Unfortunately that also locks up, let me go have another look. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting 2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle 2026-05-08 14:14 ` Andrea Righi @ 2026-05-08 15:28 ` Tejun Heo 2026-05-08 15:47 ` Andrea Righi 1 sibling, 1 reply; 6+ messages in thread From: Tejun Heo @ 2026-05-08 15:28 UTC (permalink / raw) To: Christian Loehle; +Cc: sched-ext, linux-kernel, void, arighi, changwoo Hello, On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote: > 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(), > setting sch->aborting and queuing the disable_work on the helper > kthread. > > 2. The helper kthread (and other tasks) are stuck on the global or > user DSQs because bypass mode hasn't been entered yet. The helper thread runs RT class, so it doesn't go through SCX at all. Can you try Andrea's patch? > RFC: > I guess this reintroduces the live-lock of a BPF scheduler having a > highly contended DSQ with a lot of tasks and the outer loop holding > dsq->lock and therefore it still taking too long for the bypass to > activate, is there a better way? > I also couldn't trigger a lockup through that, did I just not have > the right platform (e.g. 2x Intel 8480c). Should we add a selftest > for this too, then? Dual Sapphire Rapids is where the problem was initially observed and I could also reproduce on dual socket Zen 2 too. SPRs are way more susceptible tho. I *think* I was running scx_simple with some mixture of saturating stress-ng. It wasn't that difficult to reproduce. We should probably document the repro somewhere. I'm not sure selftests is a good place to host this sort of repros. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting 2026-05-08 15:28 ` Tejun Heo @ 2026-05-08 15:47 ` Andrea Righi 2026-05-08 17:59 ` Tejun Heo 0 siblings, 1 reply; 6+ messages in thread From: Andrea Righi @ 2026-05-08 15:47 UTC (permalink / raw) To: Tejun Heo; +Cc: Christian Loehle, sched-ext, linux-kernel, void, changwoo Hi Tejun, On Fri, May 08, 2026 at 05:28:29AM -1000, Tejun Heo wrote: > Hello, > > On Thu, May 07, 2026 at 02:56:42PM +0100, Christian Loehle wrote: > > 1. The BPF scheduler's cpu_offline callback calls scx_bpf_exit(), > > setting sch->aborting and queuing the disable_work on the helper > > kthread. > > > > 2. The helper kthread (and other tasks) are stuck on the global or > > user DSQs because bypass mode hasn't been entered yet. > > The helper thread runs RT class, so it doesn't go through SCX at all. Can > you try Andrea's patch? > > > RFC: > > I guess this reintroduces the live-lock of a BPF scheduler having a > > highly contended DSQ with a lot of tasks and the outer loop holding > > dsq->lock and therefore it still taking too long for the bypass to > > activate, is there a better way? > > I also couldn't trigger a lockup through that, did I just not have > > the right platform (e.g. 2x Intel 8480c). Should we add a selftest > > for this too, then? > > Dual Sapphire Rapids is where the problem was initially observed and I could > also reproduce on dual socket Zen 2 too. SPRs are way more susceptible tho. > I *think* I was running scx_simple with some mixture of saturating > stress-ng. It wasn't that difficult to reproduce. We should probably > document the repro somewhere. I'm not sure selftests is a good place to host > this sort of repros. There are few selftests that use stress-ng in tools/testing/selftests, maybe we can put a script there calling stress-ng, if present, and a sched similar to scx_simple and if stress-ng isn't present, skip the test. Do you remember the stress-ng command you were using? Probably we can even reproduce the issue adding something to the C part of the scheduler that mimics what stress-ng is doing. Thanks, -Andrea ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting 2026-05-08 15:47 ` Andrea Righi @ 2026-05-08 17:59 ` Tejun Heo 0 siblings, 0 replies; 6+ messages in thread From: Tejun Heo @ 2026-05-08 17:59 UTC (permalink / raw) To: Andrea Righi Cc: Christian Loehle, sched-ext, linux-kernel, void, changwoo, Emil Tsalapatis Hello, Andrea. On Fri, May 08, 2026 at 05:47:36PM +0200, Andrea Righi wrote: > There are few selftests that use stress-ng in tools/testing/selftests, maybe we > can put a script there calling stress-ng, if present, and a sched similar to > scx_simple and if stress-ng isn't present, skip the test. Do you remember the > stress-ng command you were using? Probably we can even reproduce the issue > adding something to the C part of the scheduler that mimics what stress-ng is > doing. Dug it out of shell history on the test box. It was the workload from b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"): stress-ng --race-sched 1024 stress-ng --workload 80 --workload-threads 10 Run both in parallel on a 2x EPYC 7642 while flipping a SCX scheduler on and off in a loop and the live-lock reproduced reliably. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-08 17:59 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-07 13:56 [RFC][PATCH] sched_ext: Allow consuming local tasks when aborting Christian Loehle 2026-05-08 14:14 ` Andrea Righi 2026-05-08 15:45 ` Christian Loehle 2026-05-08 15:28 ` Tejun Heo 2026-05-08 15:47 ` Andrea Righi 2026-05-08 17:59 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox