[PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
@ 2026-03-10  1:16 Tejun Heo
  2026-03-10  1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

Hello,

Cheng-Yang reported a lockdep circular dependency between scx_sched_lock and
rq->__lock. scx_bypass() and sysrq_handle_sched_ext_dump() take
scx_sched_lock -> rq lock, while scx_claim_exit() (reachable from many paths
with rq lock held) takes rq -> scx_sched_lock. In addition, scx_disable()
directly calling kthread_queue_work() under scx_sched_lock creates another
chain through worker->lock -> pi_lock -> rq->__lock.

This patchset fixes these issues:

1. Fix wrong sub_detach op check.
2. Add scx_dump_lock and dump_disabled to decouple dump from scx_sched_lock.
3. Always bounce scx_disable() through irq_work to avoid lock nesting.
4. Flip scx_bypass() lock order and drop scx_sched_lock from sysrq dump.
5. Reject sub-sched attachment to a disabled parent.

Tested on three machines (16-CPU QEMU, 192-CPU dual-socket EPYC, AMD Ryzen)
with lockdep trigger tests and an 11-test stress suite covering
attach/detach, nesting, reverse teardown, rapid cycling, error injection,
SysRq-D/S dump/exit, and combined stress. Lockdep triggered on baseline,
clean after patches.

Based on sched_ext/for-7.1 (b8840942644c).

 1. sched_ext: Fix sub_detach op check to test the parent's ops
 2. sched_ext: Add scx_dump_lock and dump_disabled
 3. sched_ext: Always bounce scx_disable() through irq_work
 4. sched_ext: Fix scx_sched_lock / rq lock ordering
 5. sched_ext: Reject sub-sched attachment to a disabled parent

 kernel/sched/ext.c          | 59 +++++++++++++++++++++++++++++++++++----------
 kernel/sched/ext_internal.h |  3 ++-
 2 files changed, 48 insertions(+), 14 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
@ 2026-03-10  1:16 ` Tejun Heo
  2026-03-10  1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

sub_detach is the parent's op called to notify the parent that a child
is detaching. Test parent->ops.sub_detach instead of sch->ops.sub_detach.

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 07476355bfd5..8bf4b51ad0e5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5438,7 +5438,7 @@ static void scx_sub_disable(struct scx_sched *sch)
 	 */
 	wake_up_all(&scx_unlink_waitq);
 
-	if (sch->ops.sub_detach && sch->sub_attached) {
+	if (parent->ops.sub_detach && sch->sub_attached) {
 		struct scx_sub_detach_args sub_detach_args = {
 			.ops = &sch->ops,
 			.cgroup_path = sch->cgrp_path,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
  2026-03-10  1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
@ 2026-03-10  1:16 ` Tejun Heo
  2026-03-10  1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

Add a dedicated scx_dump_lock and per-sched dump_disabled flag so that
debug dumping can be safely disabled during sched teardown without
relying on scx_sched_lock. This is a prep for the next patch which
decouples the sysrq dump path from scx_sched_lock to resolve a lock
ordering issue.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 25 ++++++++++++++++++++++---
 kernel/sched/ext_internal.h |  1 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8bf4b51ad0e5..d76a47b782a7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -136,6 +136,8 @@ static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock);
 static struct scx_bstr_buf scx_exit_bstr_buf;
 
 /* ops debug dump */
+static DEFINE_RAW_SPINLOCK(scx_dump_lock);
+
 struct scx_dump_data {
 	s32			cpu;
 	bool			first;
@@ -5279,6 +5281,17 @@ static void scx_unlink_sched(struct scx_sched *sch)
 	refresh_watchdog();
 }
 
+/*
+ * Called to disable future dumps and wait for in-progress one while disabling
+ * @sch. Once @sch becomes empty during disable, there's no point in dumping it.
+ * This prevents calling dump ops on a dead sch.
+ */
+static void scx_disable_dump(struct scx_sched *sch)
+{
+	guard(raw_spinlock_irqsave)(&scx_dump_lock);
+	sch->dump_disabled = true;
+}
+
 #ifdef CONFIG_EXT_SUB_SCHED
 static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
 
@@ -5414,6 +5427,8 @@ static void scx_sub_disable(struct scx_sched *sch)
 	}
 	scx_task_iter_stop(&sti);
 
+	scx_disable_dump(sch);
+
 	scx_cgroup_unlock();
 	percpu_up_write(&scx_fork_rwsem);
 
@@ -5525,6 +5540,8 @@ static void scx_root_disable(struct scx_sched *sch)
 	}
 	scx_task_iter_stop(&sti);
 
+	scx_disable_dump(sch);
+
 	scx_cgroup_lock();
 	set_cgroup_sched(sch_cgroup(sch), NULL);
 	scx_cgroup_unlock();
@@ -5680,7 +5697,7 @@ static __printf(2, 3) void dump_line(struct seq_buf *s, const char *fmt, ...)
 
 #ifdef CONFIG_TRACEPOINTS
 	if (trace_sched_ext_dump_enabled()) {
-		/* protected by scx_dump_state()::dump_lock */
+		/* protected by scx_dump_lock */
 		static char line_buf[SCX_EXIT_MSG_LEN];
 
 		va_start(args, fmt);
@@ -5842,7 +5859,6 @@ static void scx_dump_task(struct scx_sched *sch,
 static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
 			   size_t dump_len, bool dump_all_tasks)
 {
-	static DEFINE_RAW_SPINLOCK(dump_lock);
 	static const char trunc_marker[] = "\n\n~~~~ TRUNCATED ~~~~\n";
 	struct scx_dump_ctx dctx = {
 		.kind = ei->kind,
@@ -5856,7 +5872,10 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
 	char *buf;
 	int cpu;
 
-	guard(raw_spinlock_irqsave)(&dump_lock);
+	guard(raw_spinlock_irqsave)(&scx_dump_lock);
+
+	if (sch->dump_disabled)
+		return;
 
 	seq_buf_init(&s, ei->dump, dump_len);
 
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index bec4d22890b0..3623de2c30a1 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1003,6 +1003,7 @@ struct scx_sched {
 	atomic_t		bypass_dsp_enable_depth;
 
 	bool			aborting;
+	bool			dump_disabled;	/* protected by scx_dump_lock */
 	u32			dsp_max_batch;
 	s32			level;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
  2026-03-10  1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
  2026-03-10  1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
@ 2026-03-10  1:16 ` Tejun Heo
  2026-03-10  1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

scx_disable() directly called kthread_queue_work() which can acquire
worker->lock, pi_lock and rq->__lock. This made scx_disable() unsafe to
call while holding locks that conflict with this chain - in particular,
scx_claim_exit() calls scx_disable() for each descendant while holding
scx_sched_lock, which nests inside rq->__lock in scx_bypass().

The error path (scx_vexit()) was already bouncing through irq_work to
avoid this issue. Generalize the pattern to all scx_disable() calls by
always going through irq_work. irq_work_queue() is lockless and safe to
call from any context, and the actual kthread_queue_work() call happens
in the irq_work handler outside any locks.

Rename error_irq_work to disable_irq_work to reflect the broader usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 12 ++++++------
 kernel/sched/ext_internal.h |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d76a47b782a7..cf28a8f62ad0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4498,7 +4498,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 	struct scx_dispatch_q *dsq;
 	int cpu, node;
 
-	irq_work_sync(&sch->error_irq_work);
+	irq_work_sync(&sch->disable_irq_work);
 	kthread_destroy_worker(sch->helper);
 	timer_shutdown_sync(&sch->bypass_lb_timer);
 
@@ -5679,7 +5679,7 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
 {
 	guard(preempt)();
 	if (scx_claim_exit(sch, kind))
-		kthread_queue_work(sch->helper, &sch->disable_work);
+		irq_work_queue(&sch->disable_irq_work);
 }
 
 static void dump_newline(struct seq_buf *s)
@@ -6012,9 +6012,9 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
 		       trunc_marker, sizeof(trunc_marker));
 }
 
-static void scx_error_irq_workfn(struct irq_work *irq_work)
+static void scx_disable_irq_workfn(struct irq_work *irq_work)
 {
-	struct scx_sched *sch = container_of(irq_work, struct scx_sched, error_irq_work);
+	struct scx_sched *sch = container_of(irq_work, struct scx_sched, disable_irq_work);
 	struct scx_exit_info *ei = sch->exit_info;
 
 	if (ei->kind >= SCX_EXIT_ERROR)
@@ -6048,7 +6048,7 @@ static bool scx_vexit(struct scx_sched *sch,
 	ei->kind = kind;
 	ei->reason = scx_exit_reason(ei->kind);
 
-	irq_work_queue(&sch->error_irq_work);
+	irq_work_queue(&sch->disable_irq_work);
 	return true;
 }
 
@@ -6184,7 +6184,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 
 	sch->slice_dfl = SCX_SLICE_DFL;
 	atomic_set(&sch->exit_kind, SCX_EXIT_NONE);
-	init_irq_work(&sch->error_irq_work, scx_error_irq_workfn);
+	init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn);
 	kthread_init_work(&sch->disable_work, scx_disable_workfn);
 	timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0);
 	sch->ops = *ops;
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 3623de2c30a1..c78dadaadab8 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1042,7 +1042,7 @@ struct scx_sched {
 	struct kobject		kobj;
 
 	struct kthread_worker	*helper;
-	struct irq_work		error_irq_work;
+	struct irq_work		disable_irq_work;
 	struct kthread_work	disable_work;
 	struct timer_list	bypass_lb_timer;
 	struct rcu_work		rcu_work;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
                   ` (2 preceding siblings ...)
  2026-03-10  1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
@ 2026-03-10  1:16 ` Tejun Heo
  2026-03-10  5:18   ` Cheng-Yang Chou
  2026-03-10  6:39   ` Andrea Righi
  2026-03-10  1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

There are two sites that nest rq lock inside scx_sched_lock:

- scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
  per-cpu bypass flags and re-enqueue tasks.

- sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
  scheds, scx_dump_state() then takes rq lock per CPU for dump.

And scx_claim_exit() takes scx_sched_lock to propagate exits to
descendants. It can be reached from scx_tick(), BPF kfuncs, and many
other paths with rq lock already held, creating the reverse ordering:

  rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock

Fix by flipping scx_bypass() to take rq lock first, and dropping
scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
already RCU-traversable and scx_dump_lock now prevents dumping a dead
sched. This makes the consistent ordering rq lock -> scx_sched_lock.

Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index cf28a8f62ad0..677c1c6c64bf 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
 		struct rq *rq = cpu_rq(cpu);
 		struct task_struct *p, *n;
 
-		raw_spin_lock(&scx_sched_lock);
 		raw_spin_rq_lock(rq);
+		raw_spin_lock(&scx_sched_lock);
 
 		scx_for_each_descendant_pre(pos, sch) {
 			struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
@@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
 	struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
 	struct scx_sched *sch;
 
-	guard(raw_spinlock_irqsave)(&scx_sched_lock);
-
 	list_for_each_entry_rcu(sch, &scx_sched_all, all)
 		scx_dump_state(sch, &ei, 0, false);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
                   ` (3 preceding siblings ...)
  2026-03-10  1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
@ 2026-03-10  1:16 ` Tejun Heo
  2026-03-10  6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
  2026-03-10 17:17 ` Tejun Heo
  6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10  1:16 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel,
	Tejun Heo

scx_claim_exit() propagates exits to descendants under scx_sched_lock.
A sub-sched being attached concurrently could be missed if it links
after the propagation. Check the parent's exit_kind in scx_link_sched()
under scx_sched_lock to interlock against scx_claim_exit() - either the
parent sees the child in its iteration or the child sees the parent's
non-NONE exit_kind and fails attachment.

Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 677c1c6c64bf..5d31e65e5596 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5247,6 +5247,17 @@ static s32 scx_link_sched(struct scx_sched *sch)
 		s32 ret;
 
 		if (parent) {
+			/*
+			 * scx_claim_exit() propagates exit_kind transition to
+			 * its sub-scheds while holding scx_sched_lock - either
+			 * we can see the parent's non-NONE exit_kind or the
+			 * parent can shoot us down.
+			 */
+			if (atomic_read(&parent->exit_kind) != SCX_EXIT_NONE) {
+				scx_error(sch, "parent disabled");
+				return -ENOENT;
+			}
+
 			ret = rhashtable_lookup_insert_fast(&scx_sched_hash,
 					&sch->hash_node, scx_sched_hash_params);
 			if (ret) {
@@ -5638,6 +5649,11 @@ static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind)
 	 * serialized, running them in separate threads allows parallelizing
 	 * ops.exit(), which can take arbitrarily long prolonging bypass mode.
 	 *
+	 * To guarantee forward progress, this propagation must be in-line so
+	 * that ->aborting is synchronously asserted for all sub-scheds. The
+	 * propagation is also the interlocking point against sub-sched
+	 * attachment. See scx_link_sched().
+	 *
 	 * This doesn't cause recursions as propagation only takes place for
 	 * non-propagation exits.
 	 */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
  2026-03-10  1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
@ 2026-03-10  5:18   ` Cheng-Yang Chou
  2026-03-10  6:39   ` Andrea Righi
  1 sibling, 0 replies; 11+ messages in thread
From: Cheng-Yang Chou @ 2026-03-10  5:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, sched-ext,
	Emil Tsalapatis, linux-kernel, jserv

Hi Tejun,

On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com

Just a little nit, this link is invalid.
It should use the modern lore.kernel.org domain instead:

Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
  2026-03-10  1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
  2026-03-10  5:18   ` Cheng-Yang Chou
@ 2026-03-10  6:39   ` Andrea Righi
  2026-03-10  6:47     ` Andrea Righi
  1 sibling, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-03-10  6:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
	Cheng-Yang Chou, linux-kernel

Hi Tejun,

On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> There are two sites that nest rq lock inside scx_sched_lock:
> 
> - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
>   per-cpu bypass flags and re-enqueue tasks.
> 
> - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
>   scheds, scx_dump_state() then takes rq lock per CPU for dump.
> 
> And scx_claim_exit() takes scx_sched_lock to propagate exits to
> descendants. It can be reached from scx_tick(), BPF kfuncs, and many
> other paths with rq lock already held, creating the reverse ordering:
> 
>   rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
> 
> Fix by flipping scx_bypass() to take rq lock first, and dropping
> scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
> already RCU-traversable and scx_dump_lock now prevents dumping a dead
> sched. This makes the consistent ordering rq lock -> scx_sched_lock.
> 
> Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
> Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index cf28a8f62ad0..677c1c6c64bf 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
>  		struct rq *rq = cpu_rq(cpu);
>  		struct task_struct *p, *n;
>  
> -		raw_spin_lock(&scx_sched_lock);
>  		raw_spin_rq_lock(rq);
> +		raw_spin_lock(&scx_sched_lock);
>  
>  		scx_for_each_descendant_pre(pos, sch) {
>  			struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
> @@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
>  	struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
>  	struct scx_sched *sch;
>  
> -	guard(raw_spinlock_irqsave)(&scx_sched_lock);
> -

Don't we need RCU protection here?

>  	list_for_each_entry_rcu(sch, &scx_sched_all, all)
>  		scx_dump_state(sch, &ei, 0, false);
>  }

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering
  2026-03-10  6:39   ` Andrea Righi
@ 2026-03-10  6:47     ` Andrea Righi
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-03-10  6:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
	Cheng-Yang Chou, linux-kernel

On Tue, Mar 10, 2026 at 07:39:10AM +0100, Andrea Righi wrote:
> Hi Tejun,
> 
> On Mon, Mar 09, 2026 at 03:16:52PM -1000, Tejun Heo wrote:
> > There are two sites that nest rq lock inside scx_sched_lock:
> > 
> > - scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
> >   per-cpu bypass flags and re-enqueue tasks.
> > 
> > - sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
> >   scheds, scx_dump_state() then takes rq lock per CPU for dump.
> > 
> > And scx_claim_exit() takes scx_sched_lock to propagate exits to
> > descendants. It can be reached from scx_tick(), BPF kfuncs, and many
> > other paths with rq lock already held, creating the reverse ordering:
> > 
> >   rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
> > 
> > Fix by flipping scx_bypass() to take rq lock first, and dropping
> > scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
> > already RCU-traversable and scx_dump_lock now prevents dumping a dead
> > sched. This makes the consistent ordering rq lock -> scx_sched_lock.
> > 
> > Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> > Link: http://lkml.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
> > Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  kernel/sched/ext.c | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index cf28a8f62ad0..677c1c6c64bf 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -5097,8 +5097,8 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
> >  		struct rq *rq = cpu_rq(cpu);
> >  		struct task_struct *p, *n;
> >  
> > -		raw_spin_lock(&scx_sched_lock);
> >  		raw_spin_rq_lock(rq);
> > +		raw_spin_lock(&scx_sched_lock);
> >  
> >  		scx_for_each_descendant_pre(pos, sch) {
> >  			struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
> > @@ -7240,8 +7240,6 @@ static void sysrq_handle_sched_ext_dump(u8 key)
> >  	struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
> >  	struct scx_sched *sch;
> >  
> > -	guard(raw_spinlock_irqsave)(&scx_sched_lock);
> > -
> 
> Don't we need RCU protection here?

Nevermind, __handle_sysrq() is already doing rcu_read_lock/unlock(), so
this looks good.

Sorry for the noise,
-Andrea

> 
> >  	list_for_each_entry_rcu(sch, &scx_sched_all, all)
> >  		scx_dump_state(sch, &ei, 0, false);
> >  }
> 
> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
                   ` (4 preceding siblings ...)
  2026-03-10  1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
@ 2026-03-10  6:50 ` Andrea Righi
  2026-03-10 17:17 ` Tejun Heo
  6 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-03-10  6:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
	Cheng-Yang Chou, linux-kernel

On Mon, Mar 09, 2026 at 03:16:48PM -1000, Tejun Heo wrote:
> Hello,
> 
> Cheng-Yang reported a lockdep circular dependency between scx_sched_lock and
> rq->__lock. scx_bypass() and sysrq_handle_sched_ext_dump() take
> scx_sched_lock -> rq lock, while scx_claim_exit() (reachable from many paths
> with rq lock held) takes rq -> scx_sched_lock. In addition, scx_disable()
> directly calling kthread_queue_work() under scx_sched_lock creates another
> chain through worker->lock -> pi_lock -> rq->__lock.
> 
> This patchset fixes these issues:
> 
> 1. Fix wrong sub_detach op check.
> 2. Add scx_dump_lock and dump_disabled to decouple dump from scx_sched_lock.
> 3. Always bounce scx_disable() through irq_work to avoid lock nesting.
> 4. Flip scx_bypass() lock order and drop scx_sched_lock from sysrq dump.
> 5. Reject sub-sched attachment to a disabled parent.
> 
> Tested on three machines (16-CPU QEMU, 192-CPU dual-socket EPYC, AMD Ryzen)
> with lockdep trigger tests and an 11-test stress suite covering
> attach/detach, nesting, reverse teardown, rapid cycling, error injection,
> SysRq-D/S dump/exit, and combined stress. Lockdep triggered on baseline,
> clean after patches.

With the comment from Cheng-Yang about fixing the link in patch 4/5.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues
  2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
                   ` (5 preceding siblings ...)
  2026-03-10  6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
@ 2026-03-10 17:17 ` Tejun Heo
  6 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-10 17:17 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, Cheng-Yang Chou, linux-kernel

Applied to sched_ext/for-7.1 with Andrea's Reviewed-by tags and
corrected Link tag in patch 4.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-10 17:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10  1:16 [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Tejun Heo
2026-03-10  1:16 ` [PATCH 1/5] sched_ext: Fix sub_detach op check to test the parent's ops Tejun Heo
2026-03-10  1:16 ` [PATCH 2/5] sched_ext: Add scx_dump_lock and dump_disabled Tejun Heo
2026-03-10  1:16 ` [PATCH 3/5] sched_ext: Always bounce scx_disable() through irq_work Tejun Heo
2026-03-10  1:16 ` [PATCH 4/5] sched_ext: Fix scx_sched_lock / rq lock ordering Tejun Heo
2026-03-10  5:18   ` Cheng-Yang Chou
2026-03-10  6:39   ` Andrea Righi
2026-03-10  6:47     ` Andrea Righi
2026-03-10  1:16 ` [PATCH 5/5] sched_ext: Reject sub-sched attachment to a disabled parent Tejun Heo
2026-03-10  6:50 ` [PATCHSET sched_ext/for-7.1] Fix sub-sched locking issues Andrea Righi
2026-03-10 17:17 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox