* [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
@ 2026-03-10 1:16 Tejun Heo
2026-03-10 1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
` (6 more replies)
0 siblings, 7 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
Hello,
Cheng-Yang reported a lockdep circular dependency between scx_sched_lock and
rq->__lock. scx_bypass() and sysrq_handle_sched_ext_dump() take
scx_sched_lock -> rq lock, while scx_claim_exit() (reachable from many paths
with rq lock held) takes rq -> scx_sched_lock. In addition, scx_disable()
directly calling kthread_queue_work() under scx_sched_lock creates another
chain through worker->lock -> pi_lock -> rq->__lock.
This patchset fixes these issues:
1. Fix wrong sub_detach op check.
2. Add scx_dump_lock and dump_disabled to decouple dump from scx_sched_lock.
3. Always bounce scx_disable() through irq_work to avoid lock nesting.
4. Flip scx_bypass() lock order and drop scx_sched_lock from sysrq dump.
5. Reject sub-sched attachment to a disabled parent.
Tested on three machines (16-CPU QEMU, 192-CPU dual-socket EPYC, AMD Ryzen)
with lockdep trigger tests and an 11-test stress suite covering
attach/detach, nesting, reverse teardown, rapid cycling, error injection,
SysRq-D/S dump/exit, and combined stress. Lockdep triggered on baseline,
clean after patches.
Based on sched_ext/for-7.1 (b8840942644c).
1. sched_ext: Fix sub_detach op check to test the parent's ops
2. sched_ext: Add scx_dump_lock and dump_disabled
3. sched_ext: Always bounce scx_disable() through irq_work
4. sched_ext: Fix scx_sched_lock / rq lock ordering
5. sched_ext: Reject sub-sched attachment to a disabled parent
kernel/sched/ext.c | 59 +++++++++++++++++++++++++++++++++++----------
kernel/sched/ext_internal.h | 3 ++-
2 files changed, 48 insertions(+), 14 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
@ 2026-03-10 1:16 ` Tejun Heo
2026-03-10 1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
` (5 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
sub_detach is the parent's op called to notify the parent that a child
is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 07476355bfd5..8bf4b51ad0e5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5438,7 +5438,7 @@ static void scx_sub_disable(struct scx_sched *sch)
*/
wake_up_all(&scx_unlink_waitq);
- if (sch->ops.sub_detach && sch->sub_attached) {
+ if (parent->ops.sub_detach && sch->sub_attached) {
struct scx_sub_detach_args sub_detach_args = {
.ops = &sch->ops,
.cgroup_path = sch->cgrp_path,
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
2026-03-10 1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
@ 2026-03-10 1:16 ` Tejun Heo
2026-03-10 1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
` (4 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that
debug dumping can be safely disabled during sched teardown without
relying on scx_sched_lock. This is a prep for the next patch which
decouples the sysrq dump path from scx_sched_lock to resolve a lock
ordering issue.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 25 ++++++++++++++++++++++---
kernel/sched/ext_internal.h | 1 +
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8bf4b51ad0e5..d76a47b782a7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -136,6 +136,8 @@ static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock);
static struct scx_bstr_buf scx_exit_bstr_buf;
/* ops debug dump */
+static DEFINE_RAW_SPINLOCK(scx_dump_lock);
+
struct scx_dump_data {
s32 cpu;
bool first;
@@ -5279,6 +5281,17 @@ static void scx_unlink_sched(struct scx_sched *sch)
refresh_watchdog();
}
+/*
+ * Called to disable future dumps and wait for in-progress one while disabling
+ * @sch. Once @sch becomes empty during disable, there's no point in dumping it.
+ * This prevents calling dump ops on a dead sch.
+ */
+static void scx_disable_dump(struct scx_sched *sch)
+{
+ guard(raw_spinlock_irqsave)(&scx_dump_lock);
+ sch->dump_disabled = true;
+}
+
#ifdef CONFIG_EXT_SUB_SCHED
static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
@@ -5414,6 +5427,8 @@ static void scx_sub_disable(struct scx_sched *sch)
}
scx_task_iter_stop(&sti);
+ scx_disable_dump(sch);
+
scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
@@ -5525,6 +5540,8 @@ static void scx_root_disable(struct scx_sched *sch)
}
scx_task_iter_stop(&sti);
+ scx_disable_dump(sch);
+
scx_cgroup_lock();
set_cgroup_sched(sch_cgroup(sch), NULL);
scx_cgroup_unlock();
@@ -5680,7 +5697,7 @@ static __printf(2, 3) void dump_line(struct seq_buf *s, const char *fmt, ...)
#ifdef CONFIG_TRACEPOINTS
if (trace_sched_ext_dump_enabled()) {
- /* protected by scx_dump_state()::dump_lock */
+ /* protected by scx_dump_lock */
static char line_buf[SCX_EXIT_MSG_LEN];
va_start(args, fmt);
@@ -5842,7 +5859,6 @@ static void scx_dump_task(struct scx_sched *sch,
static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
size_t dump_len, bool dump_all_tasks)
{
- static DEFINE_RAW_SPINLOCK(dump_lock);
static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n";
struct scx_dump_ctx dctx = {
.kind = ei->kind,
@@ -5856,7 +5872,10 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
char *buf;
int cpu;
- guard(raw_spinlock_irqsave)(&dump_lock);
+ guard(raw_spinlock_irqsave)(&scx_dump_lock);
+
+ if (sch->dump_disabled)
+ return;
seq_buf_init(&s, ei->dump, dump_len);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index bec4d22890b0..3623de2c30a1 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1003,6 +1003,7 @@ struct scx_sched {
atomic_t bypass_dsp_enable_depth;
bool aborting;
+ bool dump_disabled; /* protected by scx_dump_lock */
u32 dsp_max_batch;
s32 level;
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
2026-03-10 1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
2026-03-10 1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
@ 2026-03-10 1:16 ` Tejun Heo
2026-03-10 1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
` (3 subsequent siblings)
6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
scx_disable() directly called kthread_queue_work() which can acquire
worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to
call while holding locks that conflict with this chain - in particular,
scx_claim_exit() calls scx_disable() for each descendant while holding
scx_sched_lock, which nests inside rq->__lock in scx_bypass().
The error path (scx_vexit()) was already bouncing through irq_work to
avoid this issue. Generalize the pattern to all scx_disable() calls by
always going through irq_work. irq_work_queue() is lockless and safe to
call from any context, and the actual kthread_queue_work() call happens
in the irq_work handler outside any locks.
Rename error_irq_work to disable_irq_work to reflect the broader usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 12 ++++++------
kernel/sched/ext_internal.h | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d76a47b782a7..cf28a8f62ad0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4498,7 +4498,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
struct scx_dispatch_q *dsq;
int cpu, node;
- irq_work_sync(&sch->error_irq_work);
+ irq_work_sync(&sch->disable_irq_work);
kthread_destroy_worker(sch->helper);
timer_shutdown_sync(&sch->bypass_lb_timer);
@@ -5679,7 +5679,7 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
{
guard(preempt)();
if (scx_claim_exit(sch, kind))
- kthread_queue_work(sch->helper, &sch->disable_work);
+ irq_work_queue(&sch->disable_irq_work);
}
static void dump_newline(struct seq_buf *s)
@@ -6012,9 +6012,9 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
trunc_marker, sizeof(trunc_marker));
}
-static void scx_error_irq_workfn(struct irq_work *irq_work)
+static void scx_disable_irq_workfn(struct irq_work *irq_work)
{
- struct scx_sched *sch = container_of(irq_work, struct scx_sched, error_irq_work);
+ struct scx_sched *sch = container_of(irq_work, struct scx_sched, disable_irq_work);
struct scx_exit_info *ei = sch->exit_info;
if (ei->kind >= SCX_EXIT_ERROR)
@@ -6048,7 +6048,7 @@ static bool scx_vexit(struct scx_sched *sch,
ei->kind = kind;
ei->reason = scx_exit_reason(ei->kind);
- irq_work_queue(&sch->error_irq_work);
+ irq_work_queue(&sch->disable_irq_work);
return true;
}
@@ -6184,7 +6184,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
sch->slice_dfl = SCX_SLICE_DFL;
atomic_set(&sch->exit_kind, SCX_EXIT_NONE);
- init_irq_work(&sch->error_irq_work, scx_error_irq_workfn);
+ init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn);
kthread_init_work(&sch->disable_work, scx_disable_workfn);
timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0);
sch->ops = *ops;
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 3623de2c30a1..c78dadaadab8 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1042,7 +1042,7 @@ struct scx_sched {
struct kobject kobj;
struct kthread_worker *helper;
- struct irq_work error_irq_work;
+ struct irq_work disable_irq_work;
struct kthread_work disable_work;
struct timer_list bypass_lb_timer;
struct rcu_work rcu_work;
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
` (2 preceding siblings ...)
2026-03-10 1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
@ 2026-03-10 1:16 ` Tejun Heo
2026-03-10 5:18 ` Cheng-Yang Chou
2026-03-10 6:39 ` Andrea Righi
2026-03-10 1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
` (2 subsequent siblings)
6 siblings, 2 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
There are two sites that nest rq lock inside scx_sched_lock:
- scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
per-cpu bypass flags and re-enqueue tasks.
- sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
scheds, scx_dump_state() then takes rq lock per CPU for dump.
And scx_claim_exit() takes scx_sched_lock to propagate exits to
descendants. It can be reached from scx_tick(), BPF kfuncs, and many
other paths with rq lock already held, creating the reverse ordering:
rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
Fix by flipping scx_bypass() to take rq lock first, and dropping
scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
already RCU-traversable and scx_dump_lock now prevents dumping a dead
sched. This makes the consistent ordering rq lock -> scx_sched_lock.
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index cf28a8f62ad0..677c1c6c64bf 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
struct rq *rq = cpu_rq(cpu);
struct task_struct *p, *n;
- raw_spin_lock(&scx_sched_lock);
raw_spin_rq_lock(rq);
+ raw_spin_lock(&scx_sched_lock);
scx_for_each_descendant_pre(pos, sch) {
struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
@@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
struct scx_sched *sch;
- guard(raw_spinlock_irqsave)(&scx_sched_lock);
-
list_for_each_entry_rcu(sch, &scx_sched_all, all)
scx_dump_state(sch, &ei, 0, false);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
` (3 preceding siblings ...)
2026-03-10 1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
@ 2026-03-10 1:16 ` Tejun Heo
2026-03-10 6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
2026-03-10 17:17 ` Tejun Heo
6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 1:16 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
Tejun Heo
scx_claim_exit() propagates exits to descendants under scx_sched_lock.
A sub-sched being attached concurrently could be missed if it links
after the propagation. Check the parent's exit_kind in scx_link_sched()
under scx_sched_lock to interlock against scx_claim_exit() - either the
parent sees the child in its iteration or the child sees the parent's
non-NONE exit_kind and fails attachment.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 677c1c6c64bf..5d31e65e5596 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5247,6 +5247,17 @@ static s32 scx_link_sched(struct scx_sched *sch)
s32 ret;
if (parent) {
+ /*
+ * scx_claim_exit() propagates exit_kind transition to
+ * its sub-scheds while holding scx_sched_lock - either
+ * we can see the parent's non-NONE exit_kind or the
+ * parent can shoot us down.
+ */
+ if (atomic_read(&parent->exit_kind) != SCX_EXIT_NONE) {
+ scx_error(sch, "parent disabled");
+ return -ENOENT;
+ }
+
ret = rhashtable_lookup_insert_fast(&scx_sched_hash,
&sch->hash_node, scx_sched_hash_params);
if (ret) {
@@ -5638,6 +5649,11 @@ static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind)
* serialized, running them in separate threads allows parallelizing
* ops.exit(), which can take arbitrarily long prolonging bypass mode.
*
+ * To guarantee forward progress, this propagation must be in-line so
+ * that ->aborting is synchronously asserted for all sub-scheds. The
+ * propagation is also the interlocking point against sub-sched
+ * attachment. See scx_link_sched().
+ *
* This doesn't cause recursions as propagation only takes place for
* non-propagation exits.
*/
--
2.53.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
2026-03-10 1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
@ 2026-03-10 5:18 ` Cheng-Yang Chou
2026-03-10 6:39 ` Andrea Righi
1 sibling, 0 replies; 11+ messages in thread
From: Cheng-Yang Chou @ 2026-03-10 5:18 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, sched-ext,
Emil Tsalapatis, linux-kernel, jserv
Hi Tejun,
On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Just a little nit, this link is invalid.
It should use the modern lore.kernel.org domain instead:
Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
--
Thanks,
Cheng-Yang
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
2026-03-10 1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
2026-03-10 5:18 ` Cheng-Yang Chou
@ 2026-03-10 6:39 ` Andrea Righi
2026-03-10 6:47 ` Andrea Righi
1 sibling, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-03-10 6:39 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
Cheng-Yang Chou, linux-kernel
Hi Tejun,
On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> There are two sites that nest rq lock inside scx_sched_lock:
>
> - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
> per-cpu bypass flags and re-enqueue tasks.
>
> - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
> scheds, scx_dump_state() then takes rq lock per CPU for dump.
>
> And scx_claim_exit() takes scx_sched_lock to propagate exits to
> descendants. It can be reached from scx_tick(), BPF kfuncs, and many
> other paths with rq lock already held, creating the reverse ordering:
>
> rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
>
> Fix by flipping scx_bypass() to take rq lock first, and dropping
> scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
> already RCU-traversable and scx_dump_lock now prevents dumping a dead
> sched. This makes the consistent ordering rq lock -> scx_sched_lock.
>
> Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
> Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index cf28a8f62ad0..677c1c6c64bf 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
> struct rq *rq = cpu_rq(cpu);
> struct task_struct *p, *n;
>
> - raw_spin_lock(&scx_sched_lock);
> raw_spin_rq_lock(rq);
> + raw_spin_lock(&scx_sched_lock);
>
> scx_for_each_descendant_pre(pos, sch) {
> struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
> @@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
> struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
> struct scx_sched *sch;
>
> - guard(raw_spinlock_irqsave)(&scx_sched_lock);
> -
Don't we need RCU protection here?
> list_for_each_entry_rcu(sch, &scx_sched_all, all)
> scx_dump_state(sch, &ei, 0, false);
> }
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
2026-03-10 6:39 ` Andrea Righi
@ 2026-03-10 6:47 ` Andrea Righi
0 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-03-10 6:47 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
Cheng-Yang Chou, linux-kernel
On Tue, Mar 10, 2026 at 07:39:10AM +0100, Andrea Righi wrote:
> Hi Tejun,
>
> On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> > There are two sites that nest rq lock inside scx_sched_lock:
> >
> > - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
> > per-cpu bypass flags and re-enqueue tasks.
> >
> > - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
> > scheds, scx_dump_state() then takes rq lock per CPU for dump.
> >
> > And scx_claim_exit() takes scx_sched_lock to propagate exits to
> > descendants. It can be reached from scx_tick(), BPF kfuncs, and many
> > other paths with rq lock already held, creating the reverse ordering:
> >
> > rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
> >
> > Fix by flipping scx_bypass() to take rq lock first, and dropping
> > scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
> > already RCU-traversable and scx_dump_lock now prevents dumping a dead
> > sched. This makes the consistent ordering rq lock -> scx_sched_lock.
> >
> > Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> > Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
> > Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> > kernel/sched/ext.c | 4 +---
> > 1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index cf28a8f62ad0..677c1c6c64bf 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
> > struct rq *rq = cpu_rq(cpu);
> > struct task_struct *p, *n;
> >
> > - raw_spin_lock(&scx_sched_lock);
> > raw_spin_rq_lock(rq);
> > + raw_spin_lock(&scx_sched_lock);
> >
> > scx_for_each_descendant_pre(pos, sch) {
> > struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
> > @@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
> > struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
> > struct scx_sched *sch;
> >
> > - guard(raw_spinlock_irqsave)(&scx_sched_lock);
> > -
>
> Don't we need RCU protection here?
Nevermind, __handle_sysrq() is already doing rcu_read_lock/unlock(), so
this looks good.
Sorry for the noise,
-Andrea
>
> > list_for_each_entry_rcu(sch, &scx_sched_all, all)
> > scx_dump_state(sch, &ei, 0, false);
> > }
>
> Thanks,
> -Andrea
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
` (4 preceding siblings ...)
2026-03-10 1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
@ 2026-03-10 6:50 ` Andrea Righi
2026-03-10 17:17 ` Tejun Heo
6 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-03-10 6:50 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
Cheng-Yang Chou, linux-kernel
On Mon, Mar 09, 2026 at 03:16:48PM -1000, Tejun Heo wrote:
> Hello,
>
> Cheng-Yang reported a lockdep circular dependency between scx_sched_lock and
> rq->__lock. scx_bypass() and sysrq_handle_sched_ext_dump() take
> scx_sched_lock -> rq lock, while scx_claim_exit() (reachable from many paths
> with rq lock held) takes rq -> scx_sched_lock. In addition, scx_disable()
> directly calling kthread_queue_work() under scx_sched_lock creates another
> chain through worker->lock -> pi_lock -> rq->__lock.
>
> This patchset fixes these issues:
>
> 1. Fix wrong sub_detach op check.
> 2. Add scx_dump_lock and dump_disabled to decouple dump from scx_sched_lock.
> 3. Always bounce scx_disable() through irq_work to avoid lock nesting.
> 4. Flip scx_bypass() lock order and drop scx_sched_lock from sysrq dump.
> 5. Reject sub-sched attachment to a disabled parent.
>
> Tested on three machines (16-CPU QEMU, 192-CPU dual-socket EPYC, AMD Ryzen)
> with lockdep trigger tests and an 11-test stress suite covering
> attach/detach, nesting, reverse teardown, rapid cycling, error injection,
> SysRq-D/S dump/exit, and combined stress. Lockdep triggered on baseline,
> clean after patches.
With the comment from Cheng-Yang about fixing the link in patch 4/5.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
` (5 preceding siblings ...)
2026-03-10 6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
@ 2026-03-10 17:17 ` Tejun Heo
6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 17:17 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel
Applied to sched_ext/for-7.1 with Andrea's Reviewed-by tags and
corrected Link tag in patch 4.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-03-10 17:17 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
2026-03-10 1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
2026-03-10 1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
2026-03-10 1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
2026-03-10 1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
2026-03-10 5:18 ` Cheng-Yang Chou
2026-03-10 6:39 ` Andrea Righi
2026-03-10 6:47 ` Andrea Righi
2026-03-10 1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
2026-03-10 6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
2026-03-10 17:17 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox