public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability
@ 2025-11-09 18:30 Tejun Heo
  2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:30 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel

Hello,

This patchset improves bypass mode scalability on large systems with many
runnable tasks.

Problem 1: Per-node DSQ contention with affinitized tasks

When bypass mode is triggered, tasks are routed through fallback dispatch
queues. Originally, bypass used a single global DSQ, but this didn't scale on
NUMA machines and could lead to livelocks. It was changed to use per-node
global DSQs with a breather mechanism that injects delays during bypass mode
switching to reduce lock contention. This resolved the cross-node issues and
has worked well for most cases.

However, Dan Schatzberg found that per-node global DSQs can still livelock in
a different scenario: On systems with many CPUs and many threads pinned to
different small subsets of CPUs, each CPU often has to scan through many
tasks it cannot run to find the one task it can run. With high CPU counts,
this scanning overhead causes severe DSQ lock contention that can live-lock
the system, preventing bypass mode activation from completing at all.

The patchset addresses this by switching to per-CPU bypass DSQs to eliminate
the shared DSQ contention. However, per-CPU DSQs alone aren't enough - CPUs
can still get stuck in long iteration loops during dispatch and move
operations. The existing breather mechanism helps with lock contention but
doesn't help when CPUs are trapped in these loops. The patchset replaces the
breather with immediate exits from dispatch and move operations when
aborting. Since these operations only run during scheduler abort, there's no
need to maintain normal operation semantics, making immediate exit both
simpler and more effective.

As an additional safety net, the patchset hooks up the hardlockup detector.
The contention can be so severe that hardlockup can be the first sign of
trouble. For example, running scx_simple (which uses a single global DSQ)
with many affinitized tasks causes all CPUs to contend on the DSQ lock while
doing long scans, triggering hardlockup before other warnings appear.

Problem 2: Task concentration with per-CPU DSQs

The switch to per-CPU DSQs introduces a new failure mode. If the BPF
scheduler severely skews task placement before triggering bypass in a highly
over-saturated system, most tasks can end up concentrated on a few CPUs.
Those CPUs then accumulate queues that are too long to drain in a reasonable
time, leading to RCU stalls and hung tasks.

This is addressed by implementing a simple timer-based load balancer that
redistributes tasks across CPUs within each NUMA node.

The patchset also uses shorter time slices in bypass mode for faster forward
progress.

The patchset has been tested on a 192 CPU dual socket AMD EPYC machine with
~20k runnable tasks:

- For problem 1 (contention): 20k runnable threads in 20 cgroups affinitized
  to different CPU subsets running scx_simple. This creates the worst-case
  contention scenario where every CPU must scan through many incompatible
  tasks. The system can now reliably survive and kick out the scheduler.

- For problem 2 (concentration): scx_cpu0 (included in this series) queues
  all tasks to CPU0, creating worst-case task concentration. Without these
  changes, disabling the scheduler leads to RCU stalls and hung tasks. With
  these changes, disable completes in about a second.

This patchset contains the following 13 patches:

 0001-sched_ext-Don-t-set-ddsp_dsq_id-during-select_cpu-in.patch
 0002-sched_ext-Make-slice-values-tunable-and-use-shorter-.patch
 0003-sched_ext-Refactor-do_enqueue_task-local-and-global-.patch
 0004-sched_ext-Use-per-CPU-DSQs-instead-of-per-node-globa.patch
 0005-sched_ext-Simplify-breather-mechanism-with-scx_abort.patch
 0006-sched_ext-Exit-dispatch-and-move-operations-immediat.patch
 0007-sched_ext-Make-scx_exit-and-scx_vexit-return-bool.patch
 0008-sched_ext-Refactor-lockup-handlers-into-handle_locku.patch
 0009-sched_ext-Make-handle_lockup-propagate-scx_verror-re.patch
 0010-sched_ext-Hook-up-hardlockup-detector.patch
 0011-sched_ext-Add-scx_cpu0-example-scheduler.patch
 0012-sched_ext-Factor-out-scx_dsq_list_node-cursor-initia.patch
 0013-sched_ext-Implement-load-balancer-for-bypass-mode.patch

Based on sched_ext/for-6.19 (5a629ecbcdff).

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-bypass-scalability

 include/linux/sched/ext.h        |  20 ++
 include/trace/events/sched_ext.h |  39 +++
 kernel/sched/ext.c               | 505 +++++++++++++++++++++++++++++----------
 kernel/sched/ext_internal.h      |   6 +
 kernel/sched/sched.h             |   1 +
 kernel/watchdog.c                |   9 +
 tools/sched_ext/Makefile         |   2 +-
 tools/sched_ext/scx_cpu0.bpf.c   |  84 +++++++
 tools/sched_ext/scx_cpu0.c       | 106 ++++++++
 9 files changed, 642 insertions(+), 130 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  6:57   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo, Patrick Lu

In the default CPU selection path used during bypass mode, select_task_rq_scx()
set p->scx.ddsp_dsq_id to SCX_DSQ_LOCAL to emulate direct dispatch. However,
do_enqueue_task() ignores ddsp_dsq_id in bypass mode and queues to the global
DSQ, leaving ddsp_dsq_id dangling. This triggers WARN_ON_ONCE() in
mark_direct_dispatch() if the task later gets direct dispatched.

Don't use direct dispatch from bypass. Just return the selected CPU, which has
the effect of waking up the picked idle CPU. Later patches will implement
per-CPU bypass DSQs to resolve this issue in a more proper way.

Reported-by: Patrick Lu <patlu@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 652a364e9e4c..cf8d86a2585c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2521,12 +2521,8 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		s32 cpu;
 
 		cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, NULL, 0);
-		if (cpu >= 0) {
-			refill_task_slice_dfl(sch, p);
-			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
-		} else {
+		if (cpu < 0)
 			cpu = prev_cpu;
-		}
 		p->scx.selected_cpu = cpu;
 
 		if (rq_bypass)
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
  2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  7:03   ` Andrea Righi
                     ` (2 more replies)
  2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.

Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make both the default and bypass
slice values tunable through module parameters (slice_dfl_us and
slice_bypass_us, adjustable between 100us and 100ms) to make it easier to test
whether slice durations are a factor in problem cases. Note that the configured
values are applied through bypass mode switching and thus are guaranteed to
apply only during scheduler [un]load operations.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 11 +++++++++++
 kernel/sched/ext.c        | 37 ++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index eb776b094d36..9f5b0f2be310 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -17,7 +17,18 @@
 enum scx_public_consts {
 	SCX_OPS_NAME_LEN	= 128,
 
+	/*
+	 * %SCX_SLICE_DFL is used to refill slices when the BPF scheduler misses
+	 * to set the slice for a task that is selected for execution.
+	 * %SCX_EV_REFILL_SLICE_DFL counts the number of times the default slice
+	 * refill has been triggered.
+	 *
+	 * %SCX_SLICE_BYPASS is used as the slice for all tasks in the bypass
+	 * mode. As mkaing forward progress for all tasks is the main goal of
+	 * the bypass mode, a shorter slice is used.
+	 */
 	SCX_SLICE_DFL		= 20 * 1000000,	/* 20ms */
+	SCX_SLICE_BYPASS	=  5 * 1000000, /*  5ms */
 	SCX_SLICE_INF		= U64_MAX,	/* infinite, implies nohz */
 };
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index cf8d86a2585c..2ce226018dbe 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -143,6 +143,35 @@ static struct scx_dump_data scx_dump_data = {
 /* /sys/kernel/sched_ext interface */
 static struct kset *scx_kset;
 
+/*
+ * Parameter that can be adjusted through /sys/module/sched_ext/parameters.
+ * There usually is no reason to modify these as normal scheduler opertion
+ * shouldn't be affected by them. The knobs are primarily for debugging.
+ */
+static u64 scx_slice_dfl = SCX_SLICE_DFL;
+static unsigned int scx_slice_dfl_us = SCX_SLICE_DFL / NSEC_PER_USEC;
+static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC;
+
+static int set_slice_us(const char *val, const struct kernel_param *kp)
+{
+	return param_set_uint_minmax(val, kp, 100, 100 * USEC_PER_MSEC);
+}
+
+static const struct kernel_param_ops slice_us_param_ops = {
+	.set = set_slice_us,
+	.get = param_get_uint,
+};
+
+#undef MODULE_PARAM_PREFIX
+#define MODULE_PARAM_PREFIX	"sched_ext."
+
+module_param_cb(slice_dfl_us, &slice_us_param_ops, &scx_slice_dfl_us, 0600);
+MODULE_PARM_DESC(slice_dfl_us, "default slice in microseconds, applied on [un]load (100us to 100ms)");
+module_param_cb(slice_bypass_us, &slice_us_param_ops, &scx_slice_bypass_us, 0600);
+MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]load (100us to 100ms)");
+
+#undef MODULE_PARAM_PREFIX
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched_ext.h>
 
@@ -919,7 +948,7 @@ static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
 
 static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p)
 {
-	p->scx.slice = SCX_SLICE_DFL;
+	p->scx.slice = scx_slice_dfl;
 	__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
 }
 
@@ -2892,7 +2921,7 @@ void init_scx_entity(struct sched_ext_entity *scx)
 	INIT_LIST_HEAD(&scx->runnable_node);
 	scx->runnable_at = jiffies;
 	scx->ddsp_dsq_id = SCX_DSQ_INVALID;
-	scx->slice = SCX_SLICE_DFL;
+	scx->slice = scx_slice_dfl;
 }
 
 void scx_pre_fork(struct task_struct *p)
@@ -3770,6 +3799,7 @@ static void scx_bypass(bool bypass)
 		WARN_ON_ONCE(scx_bypass_depth <= 0);
 		if (scx_bypass_depth != 1)
 			goto unlock;
+		scx_slice_dfl = scx_slice_bypass_us * NSEC_PER_USEC;
 		bypass_timestamp = ktime_get_ns();
 		if (sch)
 			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
@@ -3778,6 +3808,7 @@ static void scx_bypass(bool bypass)
 		WARN_ON_ONCE(scx_bypass_depth < 0);
 		if (scx_bypass_depth != 0)
 			goto unlock;
+		scx_slice_dfl = scx_slice_dfl_us * NSEC_PER_USEC;
 		if (sch)
 			scx_add_event(sch, SCX_EV_BYPASS_DURATION,
 				      ktime_get_ns() - bypass_timestamp);
@@ -4776,7 +4807,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 			queue_flags |= DEQUEUE_CLASS;
 
 		scoped_guard (sched_change, p, queue_flags) {
-			p->scx.slice = SCX_SLICE_DFL;
+			p->scx.slice = scx_slice_dfl;
 			p->sched_class = new_class;
 		}
 	}
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
  2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
  2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  7:21   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

The local and global DSQ enqueue paths in do_enqueue_task() share the same
slice refill logic. Factor out the common code into a shared enqueue label.
This makes adding new enqueue cases easier. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2ce226018dbe..a29bfadde89d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1282,6 +1282,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 {
 	struct scx_sched *sch = scx_root;
 	struct task_struct **ddsp_taskp;
+	struct scx_dispatch_q *dsq;
 	unsigned long qseq;
 
 	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
@@ -1349,8 +1350,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 direct:
 	direct_dispatch(sch, p, enq_flags);
 	return;
-
+local_norefill:
+	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
+	return;
 local:
+	dsq = &rq->scx.local_dsq;
+	goto enqueue;
+global:
+	dsq = find_global_dsq(sch, p);
+	goto enqueue;
+
+enqueue:
 	/*
 	 * For task-ordering, slice refill must be treated as implying the end
 	 * of the current slice. Otherwise, the longer @p stays on the CPU, the
@@ -1358,14 +1368,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 */
 	touch_core_sched(rq, p);
 	refill_task_slice_dfl(sch, p);
-local_norefill:
-	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
-	return;
-
-global:
-	touch_core_sched(rq, p);	/* see the comment in local: */
-	refill_task_slice_dfl(sch, p);
-	dispatch_enqueue(sch, find_global_dsq(sch, p), p, enq_flags);
+	dispatch_enqueue(sch, dsq, p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (2 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  7:42   ` Andrea Righi
  2025-11-11 15:31   ` Dan Schatzberg
  2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

When bypass mode is activated, tasks are routed through a fallback dispatch
queue instead of the BPF scheduler. Originally, bypass mode used a single
global DSQ, but this didn't scale well on NUMA machines and could lead to
livelocks. In b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"),
this was changed to use per-node global DSQs, which resolved the
cross-node-related livelocks.

However, Dan Schatzberg found that per-node global DSQ can also livelock in a
different scenario: On a NUMA node with many CPUs and many threads pinned to
different small subsets of CPUs, each CPU often has to scan through many tasks
it cannot run to find the one task it can run. With a high number of CPUs,
this scanning overhead can easily cause livelocks.

Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
on the CPU that it's currently on. Because the default idle CPU selection
policy and direct dispatch are both active during bypass, this works well in
most cases including the above.

However, this does have a failure mode in highly over-saturated systems where
tasks are concentrated on a single CPU. If the BPF scheduler places most tasks
on one CPU and then triggers bypass mode, bypass mode will keep those tasks on
that one CPU, which can lead to failures such as RCU stalls as the queue may be
too long for that CPU to drain in a reasonable time. This will be addressed
with a load balancer in a future patch. The bypass DSQ is kept separate from
the local DSQ to allow the load balancer to move tasks between bypass DSQs.

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |  1 +
 kernel/sched/ext.c        | 16 +++++++++++++---
 kernel/sched/sched.h      |  1 +
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 9f5b0f2be310..e1502faf6241 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
 	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
 	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
+	SCX_DSQ_BYPASS		= SCX_DSQ_FLAG_BUILTIN | 3,
 	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a29bfadde89d..4b8b91494947 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1301,7 +1301,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 	if (scx_rq_bypassing(rq)) {
 		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
-		goto global;
+		goto bypass;
 	}
 
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -1359,6 +1359,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 global:
 	dsq = find_global_dsq(sch, p);
 	goto enqueue;
+bypass:
+	dsq = &task_rq(p)->scx.bypass_dsq;
+	goto enqueue;
 
 enqueue:
 	/*
@@ -2157,8 +2160,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	if (consume_global_dsq(sch, rq))
 		goto has_tasks;
 
-	if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
-	    scx_rq_bypassing(rq) || !scx_rq_online(rq))
+	if (scx_rq_bypassing(rq)) {
+		if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
+			goto has_tasks;
+		else
+			goto no_tasks;
+	}
+
+	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
 		goto no_tasks;
 
 	dspc->rq = rq;
@@ -5370,6 +5379,7 @@ void __init init_sched_ext_class(void)
 		int  n = cpu_to_node(cpu);
 
 		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
+		init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27aae2a298f8..5991133a4849 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -808,6 +808,7 @@ struct scx_rq {
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
+	struct scx_dispatch_q	bypass_dsq;
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (3 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  7:45   ` Andrea Righi
  2025-11-11 15:34   ` Dan Schatzberg
  2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid
live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the
ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
injecting delays when CPUs are trapped in dispatch paths.

Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
(unsigned long) with separate increment/decrement and cleanup operations. The
breather is only activated when aborting, so tie it directly to the exit
mechanism. Replace both variables with scx_aborting flag set when exit is
claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
consolidate exit_kind claiming and breather enablement. This eliminates
scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().

The breather mechanism will be replaced by a different abort mechanism in a
future patch. This simplification prepares for that change.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 54 +++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4b8b91494947..905d01f74687 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -33,9 +33,8 @@ static DEFINE_MUTEX(scx_enable_mutex);
 DEFINE_STATIC_KEY_FALSE(__scx_enabled);
 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
-static unsigned long scx_in_softlockup;
-static atomic_t scx_breather_depth = ATOMIC_INIT(0);
 static int scx_bypass_depth;
+static bool scx_aborting;
 static bool scx_init_task_enabled;
 static bool scx_switching_all;
 DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
@@ -1834,7 +1833,7 @@ static void scx_breather(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	if (likely(!atomic_read(&scx_breather_depth)))
+	if (likely(!READ_ONCE(scx_aborting)))
 		return;
 
 	raw_spin_rq_unlock(rq);
@@ -1843,9 +1842,9 @@ static void scx_breather(struct rq *rq)
 
 	do {
 		int cnt = 1024;
-		while (atomic_read(&scx_breather_depth) && --cnt)
+		while (READ_ONCE(scx_aborting) && --cnt)
 			cpu_relax();
-	} while (atomic_read(&scx_breather_depth) &&
+	} while (READ_ONCE(scx_aborting) &&
 		 time_before64(ktime_get_ns(), until));
 
 	raw_spin_rq_lock(rq);
@@ -3740,30 +3739,14 @@ void scx_softlockup(u32 dur_s)
 		goto out_unlock;
 	}
 
-	/* allow only one instance, cleared at the end of scx_bypass() */
-	if (test_and_set_bit(0, &scx_in_softlockup))
-		goto out_unlock;
-
 	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
 			smp_processor_id(), dur_s, scx_root->ops.name);
 
-	/*
-	 * Some CPUs may be trapped in the dispatch paths. Enable breather
-	 * immediately; otherwise, we might even be able to get to scx_bypass().
-	 */
-	atomic_inc(&scx_breather_depth);
-
 	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
 out_unlock:
 	rcu_read_unlock();
 }
 
-static void scx_clear_softlockup(void)
-{
-	if (test_and_clear_bit(0, &scx_in_softlockup))
-		atomic_dec(&scx_breather_depth);
-}
-
 /**
  * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
  * @bypass: true for bypass, false for unbypass
@@ -3826,8 +3809,6 @@ static void scx_bypass(bool bypass)
 				      ktime_get_ns() - bypass_timestamp);
 	}
 
-	atomic_inc(&scx_breather_depth);
-
 	/*
 	 * No task property is changing. We just need to make sure all currently
 	 * queued tasks are re-queued according to the new scx_rq_bypassing()
@@ -3883,10 +3864,8 @@ static void scx_bypass(bool bypass)
 		raw_spin_rq_unlock(rq);
 	}
 
-	atomic_dec(&scx_breather_depth);
 unlock:
 	raw_spin_unlock_irqrestore(&bypass_lock, flags);
-	scx_clear_softlockup();
 }
 
 static void free_exit_info(struct scx_exit_info *ei)
@@ -3981,6 +3960,7 @@ static void scx_disable_workfn(struct kthread_work *work)
 
 	/* guarantee forward progress by bypassing scx_ops */
 	scx_bypass(true);
+	WRITE_ONCE(scx_aborting, false);
 
 	switch (scx_set_enable_state(SCX_DISABLING)) {
 	case SCX_DISABLING:
@@ -4103,9 +4083,24 @@ static void scx_disable_workfn(struct kthread_work *work)
 	scx_bypass(false);
 }
 
-static void scx_disable(enum scx_exit_kind kind)
+static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind)
 {
 	int none = SCX_EXIT_NONE;
+
+	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
+		return false;
+
+	/*
+	 * Some CPUs may be trapped in the dispatch paths. Enable breather
+	 * immediately; otherwise, we might not even be able to get to
+	 * scx_bypass().
+	 */
+	WRITE_ONCE(scx_aborting, true);
+	return true;
+}
+
+static void scx_disable(enum scx_exit_kind kind)
+{
 	struct scx_sched *sch;
 
 	if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE))
@@ -4114,7 +4109,7 @@ static void scx_disable(enum scx_exit_kind kind)
 	rcu_read_lock();
 	sch = rcu_dereference(scx_root);
 	if (sch) {
-		atomic_try_cmpxchg(&sch->exit_kind, &none, kind);
+		scx_claim_exit(sch, kind);
 		kthread_queue_work(sch->helper, &sch->disable_work);
 	}
 	rcu_read_unlock();
@@ -4435,9 +4430,8 @@ static void scx_vexit(struct scx_sched *sch,
 		      const char *fmt, va_list args)
 {
 	struct scx_exit_info *ei = sch->exit_info;
-	int none = SCX_EXIT_NONE;
 
-	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
+	if (!scx_claim_exit(sch, kind))
 		return;
 
 	ei->exit_code = exit_code;
@@ -4653,6 +4647,8 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	 */
 	WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED);
 	WARN_ON_ONCE(scx_root);
+	if (WARN_ON_ONCE(READ_ONCE(scx_aborting)))
+		WRITE_ONCE(scx_aborting, false);
 
 	atomic_long_set(&scx_nr_rejected, 0);
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (4 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:20   ` Andrea Righi
  2025-11-11 15:46   ` Dan Schatzberg
  2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.

However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.

This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.

The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 62 ++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 44 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 905d01f74687..afa89ca3659e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1821,48 +1821,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 	return dst_rq;
 }
 
-/*
- * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
- * banging on the same DSQ on a large NUMA system to the point where switching
- * to the bypass mode can take a long time. Inject artificial delays while the
- * bypass mode is switching to guarantee timely completion.
- */
-static void scx_breather(struct rq *rq)
-{
-	u64 until;
-
-	lockdep_assert_rq_held(rq);
-
-	if (likely(!READ_ONCE(scx_aborting)))
-		return;
-
-	raw_spin_rq_unlock(rq);
-
-	until = ktime_get_ns() + NSEC_PER_MSEC;
-
-	do {
-		int cnt = 1024;
-		while (READ_ONCE(scx_aborting) && --cnt)
-			cpu_relax();
-	} while (READ_ONCE(scx_aborting) &&
-		 time_before64(ktime_get_ns(), until));
-
-	raw_spin_rq_lock(rq);
-}
-
 static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 			       struct scx_dispatch_q *dsq)
 {
 	struct task_struct *p;
 retry:
-	/*
-	 * This retry loop can repeatedly race against scx_bypass() dequeueing
-	 * tasks from @dsq trying to put the system into the bypass mode. On
-	 * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock
-	 * the machine into soft lockups. Give a breather.
-	 */
-	scx_breather(rq);
-
 	/*
 	 * The caller can't expect to successfully consume a task if the task's
 	 * addition to @dsq isn't guaranteed to be visible somehow. Test
@@ -1876,6 +1839,17 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 	nldsq_for_each_task(p, dsq) {
 		struct rq *task_rq = task_rq(p);
 
+		/*
+		 * This loop can lead to multiple lockup scenarios, e.g. the BPF
+		 * scheduler can put an enormous number of affinitized tasks into
+		 * a contended DSQ, or the outer retry loop can repeatedly race
+		 * against scx_bypass() dequeueing tasks from @dsq trying to put
+		 * the system into the bypass mode. This can easily live-lock the
+		 * machine. If aborting, exit from all non-bypass DSQs.
+		 */
+		if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS)
+			break;
+
 		if (rq == task_rq) {
 			task_unlink_from_dsq(p, dsq);
 			move_local_task_to_local_dsq(p, 0, dsq, rq);
@@ -5635,6 +5609,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 	    !scx_kf_allowed(sch, SCX_KF_DISPATCH))
 		return false;
 
+	/*
+	 * If the BPF scheduler keeps calling this function repeatedly, it can
+	 * cause similar live-lock conditions as consume_dispatch_q().
+	 */
+	if (unlikely(scx_aborting))
+		return false;
+
 	/*
 	 * Can be called from either ops.dispatch() locking this_rq() or any
 	 * context where no rq lock is held. If latter, lock @p's task_rq which
@@ -5655,13 +5636,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 		raw_spin_rq_lock(src_rq);
 	}
 
-	/*
-	 * If the BPF scheduler keeps calling this function repeatedly, it can
-	 * cause similar live-lock conditions as consume_dispatch_q(). Insert a
-	 * breather if necessary.
-	 */
-	scx_breather(src_rq);
-
 	locked_rq = src_rq;
 	raw_spin_lock(&src_dsq->lock);
 
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (5 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:28   ` Andrea Righi
  2025-11-11 15:48   ` Dan Schatzberg
  2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

Make scx_exit() and scx_vexit() return bool indicating whether the calling
thread successfully claimed the exit. This will be used by the abort mechanism
added in a later patch.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afa89ca3659e..033c8b8e88e8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -177,18 +177,21 @@ MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]
 static void process_ddsp_deferred_locals(struct rq *rq);
 static u32 reenq_local(struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
-static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
+static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 		      s64 exit_code, const char *fmt, va_list args);
 
-static __printf(4, 5) void scx_exit(struct scx_sched *sch,
+static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
 				    enum scx_exit_kind kind, s64 exit_code,
 				    const char *fmt, ...)
 {
 	va_list args;
+	bool ret;
 
 	va_start(args, fmt);
-	scx_vexit(sch, kind, exit_code, fmt, args);
+	ret = scx_vexit(sch, kind, exit_code, fmt, args);
 	va_end(args);
+
+	return ret;
 }
 
 #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
@@ -4399,14 +4402,14 @@ static void scx_error_irq_workfn(struct irq_work *irq_work)
 	kthread_queue_work(sch->helper, &sch->disable_work);
 }
 
-static void scx_vexit(struct scx_sched *sch,
+static bool scx_vexit(struct scx_sched *sch,
 		      enum scx_exit_kind kind, s64 exit_code,
 		      const char *fmt, va_list args)
 {
 	struct scx_exit_info *ei = sch->exit_info;
 
 	if (!scx_claim_exit(sch, kind))
-		return;
+		return false;
 
 	ei->exit_code = exit_code;
 #ifdef CONFIG_STACKTRACE
@@ -4423,6 +4426,7 @@ static void scx_vexit(struct scx_sched *sch,
 	ei->reason = scx_exit_reason(ei->kind);
 
 	irq_work_queue(&sch->error_irq_work);
+	return true;
 }
 
 static int alloc_kick_syncs(void)
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup()
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (6 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:29   ` Andrea Righi
  2025-11-11 15:49   ` Dan Schatzberg
  2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
scheduler is enabled under RCU read lock and trigger an error if so. Extract
the common pattern into handle_lockup() helper. Add scx_verror() macro and use
guard(rcu)().

This simplifies both handlers, reduces code duplication, and prepares for
hardlockup handling.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 65 ++++++++++++++++++----------------------------
 1 file changed, 25 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 033c8b8e88e8..5c75b0125dfe 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -195,6 +195,7 @@ static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
 }
 
 #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
+#define scx_verror(sch, fmt, args)	scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
 
 #define SCX_HAS_OP(sch, op)	test_bit(SCX_OP_IDX(op), (sch)->has_op)
 
@@ -3653,39 +3654,40 @@ bool scx_allow_ttwu_queue(const struct task_struct *p)
 	return false;
 }
 
-/**
- * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
- *
- * While there are various reasons why RCU CPU stalls can occur on a system
- * that may not be caused by the current BPF scheduler, try kicking out the
- * current scheduler in an attempt to recover the system to a good state before
- * issuing panics.
- */
-bool scx_rcu_cpu_stall(void)
+static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
 {
 	struct scx_sched *sch;
+	va_list args;
 
-	rcu_read_lock();
+	guard(rcu)();
 
 	sch = rcu_dereference(scx_root);
-	if (unlikely(!sch)) {
-		rcu_read_unlock();
+	if (unlikely(!sch))
 		return false;
-	}
 
 	switch (scx_enable_state()) {
 	case SCX_ENABLING:
 	case SCX_ENABLED:
-		break;
+		va_start(args, fmt);
+		scx_verror(sch, fmt, args);
+		va_end(args);
+		return true;
 	default:
-		rcu_read_unlock();
 		return false;
 	}
+}
 
-	scx_error(sch, "RCU CPU stall detected!");
-	rcu_read_unlock();
-
-	return true;
+/**
+ * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
+ *
+ * While there are various reasons why RCU CPU stalls can occur on a system
+ * that may not be caused by the current BPF scheduler, try kicking out the
+ * current scheduler in an attempt to recover the system to a good state before
+ * issuing panics.
+ */
+bool scx_rcu_cpu_stall(void)
+{
+	return handle_lockup("RCU CPU stall detected!");
 }
 
 /**
@@ -3700,28 +3702,11 @@ bool scx_rcu_cpu_stall(void)
  */
 void scx_softlockup(u32 dur_s)
 {
-	struct scx_sched *sch;
-
-	rcu_read_lock();
-
-	sch = rcu_dereference(scx_root);
-	if (unlikely(!sch))
-		goto out_unlock;
-
-	switch (scx_enable_state()) {
-	case SCX_ENABLING:
-	case SCX_ENABLED:
-		break;
-	default:
-		goto out_unlock;
-	}
-
-	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
-			smp_processor_id(), dur_s, scx_root->ops.name);
+	if (!handle_lockup("soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s))
+		return;
 
-	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
-out_unlock:
-	rcu_read_unlock();
+	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU %d stuck for %us, disabling BPF scheduler\n",
+			smp_processor_id(), dur_s);
 }
 
 /**
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (7 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:29   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

handle_lockup() currently calls scx_verror() but ignores its return value,
always returning true when the scheduler is enabled. Make it capture and return
the result from scx_verror(). This prepares for hardlockup handling.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5c75b0125dfe..4507bc4f0b5c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3658,6 +3658,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
 {
 	struct scx_sched *sch;
 	va_list args;
+	bool ret;
 
 	guard(rcu)();
 
@@ -3669,9 +3670,9 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
 	case SCX_ENABLING:
 	case SCX_ENABLED:
 		va_start(args, fmt);
-		scx_verror(sch, fmt, args);
+		ret = scx_verror(sch, fmt, args);
 		va_end(args);
-		return true;
+		return ret;
 	default:
 		return false;
 	}
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/13] sched_ext: Hook up hardlockup detector
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (8 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:31   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo, Douglas Anderson, Andrew Morton

A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.

Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.

Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |  1 +
 kernel/sched/ext.c        | 18 ++++++++++++++++++
 kernel/watchdog.c         |  9 +++++++++
 3 files changed, 28 insertions(+)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index e1502faf6241..12561a3fcee4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -223,6 +223,7 @@ struct sched_ext_entity {
 void sched_ext_dead(struct task_struct *p);
 void print_scx_info(const char *log_lvl, struct task_struct *p);
 void scx_softlockup(u32 dur_s);
+bool scx_hardlockup(void);
 bool scx_rcu_cpu_stall(void);
 
 #else	/* !CONFIG_SCHED_CLASS_EXT */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4507bc4f0b5c..bd66178e5927 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3710,6 +3710,24 @@ void scx_softlockup(u32 dur_s)
 			smp_processor_id(), dur_s);
 }
 
+/**
+ * scx_hardlockup - sched_ext hardlockup handler
+ *
+ * A poorly behaving BPF scheduler can trigger hard lockup by e.g. putting
+ * numerous affinitized tasks in a single queue and directing all CPUs at it.
+ * Try kicking out the current scheduler in an attempt to recover the system to
+ * a good state before taking more drastic actions.
+ */
+bool scx_hardlockup(void)
+{
+	if (!handle_lockup("hard lockup - CPU %d", smp_processor_id()))
+		return false;
+
+	printk_deferred(KERN_ERR "sched_ext: Hard lockup - CPU %d, disabling BPF scheduler\n",
+			smp_processor_id());
+	return true;
+}
+
 /**
  * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
  * @bypass: true for bypass, false for unbypass
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 5b62d1002783..8dfac4a8f587 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -196,6 +196,15 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
 #ifdef CONFIG_SYSFS
 		++hardlockup_count;
 #endif
+		/*
+		 * A poorly behaving BPF scheduler can trigger hard lockup by
+		 * e.g. putting numerous affinitized tasks in a single queue and
+		 * directing all CPUs at it. The following call can return true
+		 * only once when sched_ext is enabled and will immediately
+		 * abort the BPF scheduler and print out a warning message.
+		 */
+		if (scx_hardlockup())
+			return;
 
 		/* Only print hardlockups once. */
 		if (per_cpu(watchdog_hardlockup_warned, cpu))
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (9 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:36   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
  2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and
only dispatches them from CPU0 in FIFO order. This is useful for testing bypass
behavior when many tasks are concentrated on a single CPU. If the load balancer
doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is
long and there's only one CPU working on it.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/Makefile       |   2 +-
 tools/sched_ext/scx_cpu0.bpf.c |  84 ++++++++++++++++++++++++++
 tools/sched_ext/scx_cpu0.c     | 106 +++++++++++++++++++++++++++++++++
 3 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 tools/sched_ext/scx_cpu0.bpf.c
 create mode 100644 tools/sched_ext/scx_cpu0.c

diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index d68780e2e03d..069b0bc38e55 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -187,7 +187,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP
 
 SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
 
-c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg
+c-sched-targets = scx_simple scx_cpu0 scx_qmap scx_central scx_flatcg
 
 $(addprefix $(BINDIR)/,$(c-sched-targets)): \
 	$(BINDIR)/%: \
diff --git a/tools/sched_ext/scx_cpu0.bpf.c b/tools/sched_ext/scx_cpu0.bpf.c
new file mode 100644
index 000000000000..8626bd369f60
--- /dev/null
+++ b/tools/sched_ext/scx_cpu0.bpf.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A CPU0 scheduler.
+ *
+ * This scheduler queues all tasks to a shared DSQ and only dispatches them on
+ * CPU0 in FIFO order. This is useful for testing bypass behavior when many
+ * tasks are concentrated on a single CPU. If the load balancer doesn't work,
+ * bypass mode can trigger task hangs or RCU stalls as the queue is long and
+ * there's only one CPU working on it.
+ *
+ * - Statistics tracking how many tasks are queued to local and CPU0 DSQs.
+ * - Termination notification for userspace.
+ *
+ * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+const volatile u32 nr_cpus = 32;	/* !0 for veristat, set during init */
+
+UEI_DEFINE(uei);
+
+/*
+ * We create a custom DSQ with ID 0 that we dispatch to and consume from on
+ * CPU0.
+ */
+#define DSQ_CPU0 0
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(u32));
+	__uint(value_size, sizeof(u64));
+	__uint(max_entries, 2);			/* [local, cpu0] */
+} stats SEC(".maps");
+
+static void stat_inc(u32 idx)
+{
+	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+	if (cnt_p)
+		(*cnt_p)++;
+}
+
+s32 BPF_STRUCT_OPS(cpu0_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+	return 0;
+}
+
+void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	if (p->nr_cpus_allowed < nr_cpus) {
+		stat_inc(0);	/* count local queueing */
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+		return;
+	}
+
+	stat_inc(1);	/* count cpu0 queueing */
+	scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags);
+}
+
+void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev)
+{
+	if (cpu == 0)
+		scx_bpf_dsq_move_to_local(DSQ_CPU0);
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init)
+{
+	return scx_bpf_create_dsq(DSQ_CPU0, -1);
+}
+
+void BPF_STRUCT_OPS(cpu0_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(cpu0_ops,
+	       .select_cpu		= (void *)cpu0_select_cpu,
+	       .enqueue			= (void *)cpu0_enqueue,
+	       .dispatch		= (void *)cpu0_dispatch,
+	       .init			= (void *)cpu0_init,
+	       .exit			= (void *)cpu0_exit,
+	       .name			= "cpu0");
diff --git a/tools/sched_ext/scx_cpu0.c b/tools/sched_ext/scx_cpu0.c
new file mode 100644
index 000000000000..1e4fa4ab8da9
--- /dev/null
+++ b/tools/sched_ext/scx_cpu0.c
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <assert.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_cpu0.bpf.skel.h"
+
+const char help_fmt[] =
+"A cpu0 sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-v]\n"
+"\n"
+"  -v            Print libbpf debug messages\n"
+"  -h            Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int sig)
+{
+	exit_req = 1;
+}
+
+static void read_stats(struct scx_cpu0 *skel, __u64 *stats)
+{
+	int nr_cpus = libbpf_num_possible_cpus();
+	assert(nr_cpus > 0);
+	__u64 cnts[2][nr_cpus];
+	__u32 idx;
+
+	memset(stats, 0, sizeof(stats[0]) * 2);
+
+	for (idx = 0; idx < 2; idx++) {
+		int ret, cpu;
+
+		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+					  &idx, cnts[idx]);
+		if (ret < 0)
+			continue;
+		for (cpu = 0; cpu < nr_cpus; cpu++)
+			stats[idx] += cnts[idx][cpu];
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct scx_cpu0 *skel;
+	struct bpf_link *link;
+	__u32 opt;
+	__u64 ecode;
+
+	libbpf_set_print(libbpf_print_fn);
+	signal(SIGINT, sigint_handler);
+	signal(SIGTERM, sigint_handler);
+restart:
+	skel = SCX_OPS_OPEN(cpu0_ops, scx_cpu0);
+
+	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
+
+	while ((opt = getopt(argc, argv, "vh")) != -1) {
+		switch (opt) {
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			fprintf(stderr, help_fmt, basename(argv[0]));
+			return opt != 'h';
+		}
+	}
+
+	SCX_OPS_LOAD(skel, cpu0_ops, scx_cpu0, uei);
+	link = SCX_OPS_ATTACH(skel, cpu0_ops, scx_cpu0);
+
+	while (!exit_req && !UEI_EXITED(skel, uei)) {
+		__u64 stats[2];
+
+		read_stats(skel, stats);
+		printf("local=%llu cpu0=%llu\n", stats[0], stats[1]);
+		fflush(stdout);
+		sleep(1);
+	}
+
+	bpf_link__destroy(link);
+	ecode = UEI_REPORT(skel, uei);
+	scx_cpu0__destroy(skel);
+
+	if (UEI_ECODE_RESTART(ecode))
+		goto restart;
+	return 0;
+}
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (10 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  8:37   ` Andrea Righi
  2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
macro in preparation for additional users.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 7 +++++++
 kernel/sched/ext.c        | 5 ++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 12561a3fcee4..280828b13608 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -149,6 +149,13 @@ struct scx_dsq_list_node {
 	u32			priv;		/* can be used by iter cursor */
 };
 
+#define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv)				\
+	(struct scx_dsq_list_node) {						\
+		.node = LIST_HEAD_INIT((__node).node),				\
+		.flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags),			\
+		.priv = (__priv),						\
+	}
+
 /*
  * The following is embedded in task_struct and contains all fields necessary
  * for a task to be scheduled by SCX.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index bd66178e5927..4b2cc6cc8cb2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6252,9 +6252,8 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
 	if (!kit->dsq)
 		return -ENOENT;
 
-	INIT_LIST_HEAD(&kit->cursor.node);
-	kit->cursor.flags = SCX_DSQ_LNODE_ITER_CURSOR | flags;
-	kit->cursor.priv = READ_ONCE(kit->dsq->seq);
+	kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags,
+					   READ_ONCE(kit->dsq->seq));
 
 	return 0;
 }
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/13] sched_ext: Implement load balancer for bypass mode
  2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
                   ` (11 preceding siblings ...)
  2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
@ 2025-11-09 18:31 ` Tejun Heo
  2025-11-10  9:38   ` Andrea Righi
  12 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-09 18:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel,
	Tejun Heo

In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
in most cases, there is a failure mode where a BPF scheduler can skew task
placement severely before triggering bypass in highly over-saturated systems.
If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
queues that are too long to drain in a reasonable time, leading to RCU stalls
and hung tasks.

Implement a simple timer-based load balancer that redistributes tasks across
CPUs within each NUMA node. The balancer runs periodically (default 500ms,
tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
CPUs to underloaded ones.

When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
balancer timer function reads scx_bypass_depth locklessly to check whether
bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
with the READ_ONCE() in the timer function.

This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
long time to complete. With the load balancer, disable completes in about a
second.

The load balancing operation can be monitored using the sched_ext_bypass_lb
tracepoint and disabled by setting bypass_lb_intv_us to 0.

Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/trace/events/sched_ext.h |  39 +++++
 kernel/sched/ext.c               | 236 ++++++++++++++++++++++++++++++-
 kernel/sched/ext_internal.h      |   6 +
 3 files changed, 278 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sched_ext.h b/include/trace/events/sched_ext.h
index 50e4b712735a..d1bf5acd59c5 100644
--- a/include/trace/events/sched_ext.h
+++ b/include/trace/events/sched_ext.h
@@ -45,6 +45,45 @@ TRACE_EVENT(sched_ext_event,
 	)
 );
 
+TRACE_EVENT(sched_ext_bypass_lb,
+
+	TP_PROTO(__u32 node, __u32 nr_cpus, __u32 nr_tasks, __u32 nr_balanced,
+		 __u32 before_min, __u32 before_max,
+		 __u32 after_min, __u32 after_max),
+
+	TP_ARGS(node, nr_cpus, nr_tasks, nr_balanced,
+		before_min, before_max, after_min, after_max),
+
+	TP_STRUCT__entry(
+		__field(	__u32,		node		)
+		__field(	__u32,		nr_cpus		)
+		__field(	__u32,		nr_tasks	)
+		__field(	__u32,		nr_balanced	)
+		__field(	__u32,		before_min	)
+		__field(	__u32,		before_max	)
+		__field(	__u32,		after_min	)
+		__field(	__u32,		after_max	)
+	),
+
+	TP_fast_assign(
+		__entry->node		= node;
+		__entry->nr_cpus	= nr_cpus;
+		__entry->nr_tasks	= nr_tasks;
+		__entry->nr_balanced	= nr_balanced;
+		__entry->before_min	= before_min;
+		__entry->before_max	= before_max;
+		__entry->after_min	= after_min;
+		__entry->after_max	= after_max;
+	),
+
+	TP_printk("node %u: nr_cpus=%u nr_tasks=%u nr_balanced=%u min=%u->%u max=%u->%u",
+		  __entry->node, __entry->nr_cpus,
+		  __entry->nr_tasks, __entry->nr_balanced,
+		  __entry->before_min, __entry->after_min,
+		  __entry->before_max, __entry->after_max
+	)
+);
+
 #endif /* _TRACE_SCHED_EXT_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4b2cc6cc8cb2..39b6e7895152 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -34,6 +34,8 @@ DEFINE_STATIC_KEY_FALSE(__scx_enabled);
 DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
 static int scx_bypass_depth;
+static cpumask_var_t scx_bypass_lb_donee_cpumask;
+static cpumask_var_t scx_bypass_lb_resched_cpumask;
 static bool scx_aborting;
 static bool scx_init_task_enabled;
 static bool scx_switching_all;
@@ -150,6 +152,7 @@ static struct kset *scx_kset;
 static u64 scx_slice_dfl = SCX_SLICE_DFL;
 static unsigned int scx_slice_dfl_us = SCX_SLICE_DFL / NSEC_PER_USEC;
 static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC;
+static unsigned int scx_bypass_lb_intv_us = SCX_BYPASS_LB_DFL_INTV_US;
 
 static int set_slice_us(const char *val, const struct kernel_param *kp)
 {
@@ -161,6 +164,16 @@ static const struct kernel_param_ops slice_us_param_ops = {
 	.get = param_get_uint,
 };
 
+static int set_bypass_lb_intv_us(const char *val, const struct kernel_param *kp)
+{
+	return param_set_uint_minmax(val, kp, 0, 10 * USEC_PER_SEC);
+}
+
+static const struct kernel_param_ops bypass_lb_intv_us_param_ops = {
+	.set = set_bypass_lb_intv_us,
+	.get = param_get_uint,
+};
+
 #undef MODULE_PARAM_PREFIX
 #define MODULE_PARAM_PREFIX	"sched_ext."
 
@@ -168,6 +181,8 @@ module_param_cb(slice_dfl_us, &slice_us_param_ops, &scx_slice_dfl_us, 0600);
 MODULE_PARM_DESC(slice_dfl_us, "default slice in microseconds, applied on [un]load (100us to 100ms)");
 module_param_cb(slice_bypass_us, &slice_us_param_ops, &scx_slice_bypass_us, 0600);
 MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]load (100us to 100ms)");
+module_param_cb(bypass_lb_intv_us, &bypass_lb_intv_us_param_ops, &scx_bypass_lb_intv_us, 0600);
+MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microseconds (0 (disable) to 10s)");
 
 #undef MODULE_PARAM_PREFIX
 
@@ -965,7 +980,9 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
 
 	if (!is_local) {
-		raw_spin_lock(&dsq->lock);
+		raw_spin_lock_nested(&dsq->lock,
+			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);
+
 		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
 			scx_error(sch, "attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
@@ -3728,6 +3745,204 @@ bool scx_hardlockup(void)
 	return true;
 }
 
+static u32 bypass_lb_cpu(struct scx_sched *sch, struct scx_dispatch_q *donor_dsq,
+			 struct cpumask *donee_mask, struct cpumask *resched_mask,
+			 u32 nr_donor_target, u32 nr_donee_target)
+{
+	struct task_struct *p, *n;
+	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0);
+	s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
+	u32 nr_balanced = 0, min_delta_us;
+
+	/*
+	 * All we want to guarantee is reasonable forward progress. No reason to
+	 * fine tune. Assuming every task on @donor_dsq runs their full slice,
+	 * consider offloading iff the total queued duration is over the
+	 * threshold.
+	 */
+	min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV;
+	if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us))
+		return 0;
+
+	raw_spin_lock_irq(&donor_dsq->lock);
+	list_add(&cursor.node, &donor_dsq->list);
+resume:
+	n = container_of(&cursor, struct task_struct, scx.dsq_list);
+	n = nldsq_next_task(donor_dsq, n, false);
+
+	while ((p = n)) {
+		struct rq *donee_rq;
+		struct scx_dispatch_q *donee_dsq;
+		int donee;
+
+		n = nldsq_next_task(donor_dsq, n, false);
+
+		if (donor_dsq->nr <= nr_donor_target)
+			break;
+
+		if (cpumask_empty(donee_mask))
+			break;
+
+		donee = cpumask_any_and_distribute(donee_mask, p->cpus_ptr);
+		if (donee >= nr_cpu_ids)
+			continue;
+
+		donee_rq = cpu_rq(donee);
+		donee_dsq = &donee_rq->scx.bypass_dsq;
+
+		/*
+		 * $p's rq is not locked but $p's DSQ lock protects its
+		 * scheduling properties making this test safe.
+		 */
+		if (!task_can_run_on_remote_rq(sch, p, donee_rq, false))
+			continue;
+
+		/*
+		 * Moving $p from one non-local DSQ to another. The source DSQ
+		 * is already locked. Do an abbreviated dequeue and then perform
+		 * enqueue without unlocking $donor_dsq.
+		 *
+		 * We don't want to drop and reacquire the lock on each
+		 * iteration as @donor_dsq can be very long and potentially
+		 * highly contended. Donee DSQs are less likely to be contended.
+		 * The nested locking is safe as only this LB moves tasks
+		 * between bypass DSQs.
+		 */
+		task_unlink_from_dsq(p, donor_dsq);
+		p->scx.dsq = NULL;
+		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
+
+		/*
+		 * $donee might have been idle and need to be woken up. No need
+		 * to be clever. Kick every CPU that receives tasks.
+		 */
+		cpumask_set_cpu(donee, resched_mask);
+
+		if (READ_ONCE(donee_dsq->nr) >= nr_donee_target)
+			cpumask_clear_cpu(donee, donee_mask);
+
+		nr_balanced++;
+		if (!(nr_balanced % SCX_BYPASS_LB_BATCH) && n) {
+			list_move_tail(&cursor.node, &n->scx.dsq_list.node);
+			raw_spin_unlock_irq(&donor_dsq->lock);
+			cpu_relax();
+			raw_spin_lock_irq(&donor_dsq->lock);
+			goto resume;
+		}
+	}
+
+	list_del_init(&cursor.node);
+	raw_spin_unlock_irq(&donor_dsq->lock);
+
+	return nr_balanced;
+}
+
+static void bypass_lb_node(struct scx_sched *sch, int node)
+{
+	const struct cpumask *node_mask = cpumask_of_node(node);
+	struct cpumask *donee_mask = scx_bypass_lb_donee_cpumask;
+	struct cpumask *resched_mask = scx_bypass_lb_resched_cpumask;
+	u32 nr_tasks = 0, nr_cpus = 0, nr_balanced = 0;
+	u32 nr_target, nr_donor_target;
+	u32 before_min = U32_MAX, before_max = 0;
+	u32 after_min = U32_MAX, after_max = 0;
+	int cpu;
+
+	/* count the target tasks and CPUs */
+	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
+		u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr);
+
+		nr_tasks += nr;
+		nr_cpus++;
+
+		before_min = min(nr, before_min);
+		before_max = max(nr, before_max);
+	}
+
+	if (!nr_cpus)
+		return;
+
+	/*
+	 * We don't want CPUs to have more than $nr_donor_target tasks and
+	 * balancing to fill donee CPUs upto $nr_target. Once targets are
+	 * calculated, find the donee CPUs.
+	 */
+	nr_target = DIV_ROUND_UP(nr_tasks, nr_cpus);
+	nr_donor_target = DIV_ROUND_UP(nr_target * SCX_BYPASS_LB_DONOR_PCT, 100);
+
+	cpumask_clear(donee_mask);
+	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
+		if (READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr) < nr_target)
+			cpumask_set_cpu(cpu, donee_mask);
+	}
+
+	/* iterate !donee CPUs and see if they should be offloaded */
+	cpumask_clear(resched_mask);
+	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
+		struct rq *rq = cpu_rq(cpu);
+		struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq;
+
+		if (cpumask_empty(donee_mask))
+			break;
+		if (cpumask_test_cpu(cpu, donee_mask))
+			continue;
+		if (READ_ONCE(donor_dsq->nr) <= nr_donor_target)
+			continue;
+
+		nr_balanced += bypass_lb_cpu(sch, donor_dsq,
+					     donee_mask, resched_mask,
+					     nr_donor_target, nr_target);
+	}
+
+	for_each_cpu(cpu, resched_mask) {
+		struct rq *rq = cpu_rq(cpu);
+
+		raw_spin_rq_lock_irq(rq);
+		resched_curr(rq);
+		raw_spin_rq_unlock_irq(rq);
+	}
+
+	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
+		u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr);
+
+		after_min = min(nr, after_min);
+		after_max = max(nr, after_max);
+
+	}
+
+	trace_sched_ext_bypass_lb(node, nr_cpus, nr_tasks, nr_balanced,
+				  before_min, before_max, after_min, after_max);
+}
+
+/*
+ * In bypass mode, all tasks are put on the per-CPU bypass DSQs. If the machine
+ * is over-saturated and the BPF scheduler skewed tasks into few CPUs, some
+ * bypass DSQs can be overloaded. If there are enough tasks to saturate other
+ * lightly loaded CPUs, such imbalance can lead to very high execution latency
+ * on the overloaded CPUs and thus to hung tasks and RCU stalls. To avoid such
+ * outcomes, a simple load balancing mechanism is implemented by the following
+ * timer which runs periodically while bypass mode is in effect.
+ */
+static void scx_bypass_lb_timerfn(struct timer_list *timer)
+{
+	struct scx_sched *sch;
+	int node;
+	u32 intv_us;
+
+	sch = rcu_dereference_all(scx_root);
+	if (unlikely(!sch) || !READ_ONCE(scx_bypass_depth))
+		return;
+
+	for_each_node_with_cpus(node)
+		bypass_lb_node(sch, node);
+
+	intv_us = READ_ONCE(scx_bypass_lb_intv_us);
+	if (intv_us)
+		mod_timer(timer, jiffies + usecs_to_jiffies(intv_us));
+}
+
+static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn);
+
 /**
  * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
  * @bypass: true for bypass, false for unbypass
@@ -3771,7 +3986,9 @@ static void scx_bypass(bool bypass)
 	sch = rcu_dereference_bh(scx_root);
 
 	if (bypass) {
-		scx_bypass_depth++;
+		u32 intv_us;
+
+		WRITE_ONCE(scx_bypass_depth, scx_bypass_depth + 1);
 		WARN_ON_ONCE(scx_bypass_depth <= 0);
 		if (scx_bypass_depth != 1)
 			goto unlock;
@@ -3779,8 +3996,15 @@ static void scx_bypass(bool bypass)
 		bypass_timestamp = ktime_get_ns();
 		if (sch)
 			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
+
+		intv_us = READ_ONCE(scx_bypass_lb_intv_us);
+		if (intv_us && !timer_pending(&scx_bypass_lb_timer)) {
+			scx_bypass_lb_timer.expires =
+				jiffies + usecs_to_jiffies(intv_us);
+			add_timer_global(&scx_bypass_lb_timer);
+		}
 	} else {
-		scx_bypass_depth--;
+		WRITE_ONCE(scx_bypass_depth, scx_bypass_depth - 1);
 		WARN_ON_ONCE(scx_bypass_depth < 0);
 		if (scx_bypass_depth != 0)
 			goto unlock;
@@ -7036,6 +7260,12 @@ static int __init scx_init(void)
 		return ret;
 	}
 
+	if (!alloc_cpumask_var(&scx_bypass_lb_donee_cpumask, GFP_KERNEL) ||
+	    !alloc_cpumask_var(&scx_bypass_lb_resched_cpumask, GFP_KERNEL)) {
+		pr_err("sched_ext: Failed to allocate cpumasks\n");
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 __initcall(scx_init);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index dd6f25fb6159..386c677e4c9a 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -23,6 +23,11 @@ enum scx_consts {
 	 * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
 	 */
 	SCX_TASK_ITER_BATCH		= 32,
+
+	SCX_BYPASS_LB_DFL_INTV_US	= 500 * USEC_PER_MSEC,
+	SCX_BYPASS_LB_DONOR_PCT		= 125,
+	SCX_BYPASS_LB_MIN_DELTA_DIV	= 4,
+	SCX_BYPASS_LB_BATCH		= 256,
 };
 
 enum scx_exit_kind {
@@ -963,6 +968,7 @@ enum scx_enq_flags {
 
 	SCX_ENQ_CLEAR_OPSS	= 1LLU << 56,
 	SCX_ENQ_DSQ_PRIQ	= 1LLU << 57,
+	SCX_ENQ_NESTED		= 1LLU << 58,
 };
 
 enum scx_deq_flags {
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode
  2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
@ 2025-11-10  6:57   ` Andrea Righi
  2025-11-10 16:08     ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  6:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel, Patrick Lu

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:00AM -1000, Tejun Heo wrote:
> In the default CPU selection path used during bypass mode, select_task_rq_scx()
> set p->scx.ddsp_dsq_id to SCX_DSQ_LOCAL to emulate direct dispatch. However,
> do_enqueue_task() ignores ddsp_dsq_id in bypass mode and queues to the global
> DSQ, leaving ddsp_dsq_id dangling. This triggers WARN_ON_ONCE() in
> mark_direct_dispatch() if the task later gets direct dispatched.

The patch makes sense and I was actually testing something similar to fix
https://github.com/sched-ext/scx/issues/2758.

However, in dispatch_enqueue() we're always clearing p->scx.ddsp_dsq_id
(SCX_DSQ_INVALID), even when we're targeting the global DSQ due to bypass
mode, so in this scenario we shouldn't see a stale ddsp_dsq_id. Am I
missing something?

Thanks,
-Andrea

> 
> Don't use direct dispatch from bypass. Just return the selected CPU, which has
> the effect of waking up the picked idle CPU. Later patches will implement
> per-CPU bypass DSQs to resolve this issue in a more proper way.
> 
> Reported-by: Patrick Lu <patlu@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 652a364e9e4c..cf8d86a2585c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2521,12 +2521,8 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  		s32 cpu;
>  
>  		cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, NULL, 0);
> -		if (cpu >= 0) {
> -			refill_task_slice_dfl(sch, p);
> -			p->scx.ddsp_dsq_id = SCX_DSQ_LOCAL;
> -		} else {
> +		if (cpu < 0)
>  			cpu = prev_cpu;
> -		}
>  		p->scx.selected_cpu = cpu;
>  
>  		if (rq_bypass)
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
@ 2025-11-10  7:03   ` Andrea Righi
  2025-11-10  7:59     ` Andrea Righi
  2025-11-10 16:21     ` Tejun Heo
  2025-11-10  8:22   ` Andrea Righi
  2025-11-11 14:57   ` Dan Schatzberg
  2 siblings, 2 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  7:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:01AM -1000, Tejun Heo wrote:
> There have been reported cases of bypass mode not making forward progress fast
> enough. The 20ms default slice is unnecessarily long for bypass mode where the
> primary goal is ensuring all tasks can make forward progress.
> 
> Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
> switch to it when entering bypass mode. Also make both the default and bypass
> slice values tunable through module parameters (slice_dfl_us and
> slice_bypass_us, adjustable between 100us and 100ms) to make it easier to test
> whether slice durations are a factor in problem cases. Note that the configured
> values are applied through bypass mode switching and thus are guaranteed to
> apply only during scheduler [un]load operations.

IIRC Changwoo suggested to introduce a tunable to change the default time
slice in the past.

I agree that slice_bypass_us can be a tunable in sysfs, but I think it'd be
nicer if the default time slice would be a property of sched_ext_ops, is
there any reason to not do that?

Thanks,
-Andrea

> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/sched/ext.h | 11 +++++++++++
>  kernel/sched/ext.c        | 37 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index eb776b094d36..9f5b0f2be310 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -17,7 +17,18 @@
>  enum scx_public_consts {
>  	SCX_OPS_NAME_LEN	= 128,
>  
> +	/*
> +	 * %SCX_SLICE_DFL is used to refill slices when the BPF scheduler misses
> +	 * to set the slice for a task that is selected for execution.
> +	 * %SCX_EV_REFILL_SLICE_DFL counts the number of times the default slice
> +	 * refill has been triggered.
> +	 *
> +	 * %SCX_SLICE_BYPASS is used as the slice for all tasks in the bypass
> +	 * mode. As mkaing forward progress for all tasks is the main goal of
> +	 * the bypass mode, a shorter slice is used.
> +	 */
>  	SCX_SLICE_DFL		= 20 * 1000000,	/* 20ms */
> +	SCX_SLICE_BYPASS	=  5 * 1000000, /*  5ms */
>  	SCX_SLICE_INF		= U64_MAX,	/* infinite, implies nohz */
>  };
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index cf8d86a2585c..2ce226018dbe 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -143,6 +143,35 @@ static struct scx_dump_data scx_dump_data = {
>  /* /sys/kernel/sched_ext interface */
>  static struct kset *scx_kset;
>  
> +/*
> + * Parameter that can be adjusted through /sys/module/sched_ext/parameters.
> + * There usually is no reason to modify these as normal scheduler opertion
> + * shouldn't be affected by them. The knobs are primarily for debugging.
> + */
> +static u64 scx_slice_dfl = SCX_SLICE_DFL;
> +static unsigned int scx_slice_dfl_us = SCX_SLICE_DFL / NSEC_PER_USEC;
> +static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC;
> +
> +static int set_slice_us(const char *val, const struct kernel_param *kp)
> +{
> +	return param_set_uint_minmax(val, kp, 100, 100 * USEC_PER_MSEC);
> +}
> +
> +static const struct kernel_param_ops slice_us_param_ops = {
> +	.set = set_slice_us,
> +	.get = param_get_uint,
> +};
> +
> +#undef MODULE_PARAM_PREFIX
> +#define MODULE_PARAM_PREFIX	"sched_ext."
> +
> +module_param_cb(slice_dfl_us, &slice_us_param_ops, &scx_slice_dfl_us, 0600);
> +MODULE_PARM_DESC(slice_dfl_us, "default slice in microseconds, applied on [un]load (100us to 100ms)");
> +module_param_cb(slice_bypass_us, &slice_us_param_ops, &scx_slice_bypass_us, 0600);
> +MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]load (100us to 100ms)");
> +
> +#undef MODULE_PARAM_PREFIX
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/sched_ext.h>
>  
> @@ -919,7 +948,7 @@ static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
>  
>  static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p)
>  {
> -	p->scx.slice = SCX_SLICE_DFL;
> +	p->scx.slice = scx_slice_dfl;
>  	__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
>  }
>  
> @@ -2892,7 +2921,7 @@ void init_scx_entity(struct sched_ext_entity *scx)
>  	INIT_LIST_HEAD(&scx->runnable_node);
>  	scx->runnable_at = jiffies;
>  	scx->ddsp_dsq_id = SCX_DSQ_INVALID;
> -	scx->slice = SCX_SLICE_DFL;
> +	scx->slice = scx_slice_dfl;
>  }
>  
>  void scx_pre_fork(struct task_struct *p)
> @@ -3770,6 +3799,7 @@ static void scx_bypass(bool bypass)
>  		WARN_ON_ONCE(scx_bypass_depth <= 0);
>  		if (scx_bypass_depth != 1)
>  			goto unlock;
> +		scx_slice_dfl = scx_slice_bypass_us * NSEC_PER_USEC;
>  		bypass_timestamp = ktime_get_ns();
>  		if (sch)
>  			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
> @@ -3778,6 +3808,7 @@ static void scx_bypass(bool bypass)
>  		WARN_ON_ONCE(scx_bypass_depth < 0);
>  		if (scx_bypass_depth != 0)
>  			goto unlock;
> +		scx_slice_dfl = scx_slice_dfl_us * NSEC_PER_USEC;
>  		if (sch)
>  			scx_add_event(sch, SCX_EV_BYPASS_DURATION,
>  				      ktime_get_ns() - bypass_timestamp);
> @@ -4776,7 +4807,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  			queue_flags |= DEQUEUE_CLASS;
>  
>  		scoped_guard (sched_change, p, queue_flags) {
> -			p->scx.slice = SCX_SLICE_DFL;
> +			p->scx.slice = scx_slice_dfl;
>  			p->sched_class = new_class;
>  		}
>  	}
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths
  2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
@ 2025-11-10  7:21   ` Andrea Righi
  0 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  7:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:02AM -1000, Tejun Heo wrote:
> The local and global DSQ enqueue paths in do_enqueue_task() share the same
> slice refill logic. Factor out the common code into a shared enqueue label.
> This makes adding new enqueue cases easier. No functional changes.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 2ce226018dbe..a29bfadde89d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1282,6 +1282,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  {
>  	struct scx_sched *sch = scx_root;
>  	struct task_struct **ddsp_taskp;
> +	struct scx_dispatch_q *dsq;
>  	unsigned long qseq;
>  
>  	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
> @@ -1349,8 +1350,17 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  direct:
>  	direct_dispatch(sch, p, enq_flags);
>  	return;
> -
> +local_norefill:
> +	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
> +	return;
>  local:
> +	dsq = &rq->scx.local_dsq;
> +	goto enqueue;
> +global:
> +	dsq = find_global_dsq(sch, p);
> +	goto enqueue;
> +
> +enqueue:
>  	/*
>  	 * For task-ordering, slice refill must be treated as implying the end
>  	 * of the current slice. Otherwise, the longer @p stays on the CPU, the
> @@ -1358,14 +1368,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 */
>  	touch_core_sched(rq, p);
>  	refill_task_slice_dfl(sch, p);
> -local_norefill:
> -	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
> -	return;
> -
> -global:
> -	touch_core_sched(rq, p);	/* see the comment in local: */
> -	refill_task_slice_dfl(sch, p);
> -	dispatch_enqueue(sch, find_global_dsq(sch, p), p, enq_flags);
> +	dispatch_enqueue(sch, dsq, p, enq_flags);
>  }
>  
>  static bool task_runnable(const struct task_struct *p)
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
@ 2025-11-10  7:42   ` Andrea Righi
  2025-11-10 16:42     ` Tejun Heo
  2025-11-11 15:31   ` Dan Schatzberg
  1 sibling, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  7:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:03AM -1000, Tejun Heo wrote:
> When bypass mode is activated, tasks are routed through a fallback dispatch
> queue instead of the BPF scheduler. Originally, bypass mode used a single
> global DSQ, but this didn't scale well on NUMA machines and could lead to
> livelocks. In b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"),
> this was changed to use per-node global DSQs, which resolved the
> cross-node-related livelocks.
> 
> However, Dan Schatzberg found that per-node global DSQ can also livelock in a
> different scenario: On a NUMA node with many CPUs and many threads pinned to
> different small subsets of CPUs, each CPU often has to scan through many tasks
> it cannot run to find the one task it can run. With a high number of CPUs,
> this scanning overhead can easily cause livelocks.
> 
> Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
> on the CPU that it's currently on. Because the default idle CPU selection
> policy and direct dispatch are both active during bypass, this works well in
> most cases including the above.

Is there any reason not to reuse rq->scx.local_dsq for this?

Thanks,
-Andrea

> 
> However, this does have a failure mode in highly over-saturated systems where
> tasks are concentrated on a single CPU. If the BPF scheduler places most tasks
> on one CPU and then triggers bypass mode, bypass mode will keep those tasks on
> that one CPU, which can lead to failures such as RCU stalls as the queue may be
> too long for that CPU to drain in a reasonable time. This will be addressed
> with a load balancer in a future patch. The bypass DSQ is kept separate from
> the local DSQ to allow the load balancer to move tasks between bypass DSQs.
> 
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/sched/ext.h |  1 +
>  kernel/sched/ext.c        | 16 +++++++++++++---
>  kernel/sched/sched.h      |  1 +
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 9f5b0f2be310..e1502faf6241 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
>  	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
>  	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
>  	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
> +	SCX_DSQ_BYPASS		= SCX_DSQ_FLAG_BUILTIN | 3,
>  	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
>  	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
>  };
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a29bfadde89d..4b8b91494947 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1301,7 +1301,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  
>  	if (scx_rq_bypassing(rq)) {
>  		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
> -		goto global;
> +		goto bypass;
>  	}
>  
>  	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> @@ -1359,6 +1359,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  global:
>  	dsq = find_global_dsq(sch, p);
>  	goto enqueue;
> +bypass:
> +	dsq = &task_rq(p)->scx.bypass_dsq;
> +	goto enqueue;
>  
>  enqueue:
>  	/*
> @@ -2157,8 +2160,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>  	if (consume_global_dsq(sch, rq))
>  		goto has_tasks;
>  
> -	if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
> -	    scx_rq_bypassing(rq) || !scx_rq_online(rq))
> +	if (scx_rq_bypassing(rq)) {
> +		if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
> +			goto has_tasks;
> +		else
> +			goto no_tasks;
> +	}
> +
> +	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
>  		goto no_tasks;
>  
>  	dspc->rq = rq;
> @@ -5370,6 +5379,7 @@ void __init init_sched_ext_class(void)
>  		int  n = cpu_to_node(cpu);
>  
>  		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
> +		init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
>  		INIT_LIST_HEAD(&rq->scx.runnable_list);
>  		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 27aae2a298f8..5991133a4849 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -808,6 +808,7 @@ struct scx_rq {
>  	struct balance_callback	deferred_bal_cb;
>  	struct irq_work		deferred_irq_work;
>  	struct irq_work		kick_cpus_irq_work;
> +	struct scx_dispatch_q	bypass_dsq;
>  };
>  #endif /* CONFIG_SCHED_CLASS_EXT */
>  
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag
  2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
@ 2025-11-10  7:45   ` Andrea Righi
  2025-11-11 15:34   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  7:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:04AM -1000, Tejun Heo wrote:
> The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid
> live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the
> ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
> injecting delays when CPUs are trapped in dispatch paths.
> 
> Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
> (unsigned long) with separate increment/decrement and cleanup operations. The
> breather is only activated when aborting, so tie it directly to the exit
> mechanism. Replace both variables with scx_aborting flag set when exit is
> claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
> consolidate exit_kind claiming and breather enablement. This eliminates
> scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().
> 
> The breather mechanism will be replaced by a different abort mechanism in a
> future patch. This simplification prepares for that change.

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 54 +++++++++++++++++++++-------------------------
>  1 file changed, 25 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 4b8b91494947..905d01f74687 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -33,9 +33,8 @@ static DEFINE_MUTEX(scx_enable_mutex);
>  DEFINE_STATIC_KEY_FALSE(__scx_enabled);
>  DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
>  static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
> -static unsigned long scx_in_softlockup;
> -static atomic_t scx_breather_depth = ATOMIC_INIT(0);
>  static int scx_bypass_depth;
> +static bool scx_aborting;
>  static bool scx_init_task_enabled;
>  static bool scx_switching_all;
>  DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
> @@ -1834,7 +1833,7 @@ static void scx_breather(struct rq *rq)
>  
>  	lockdep_assert_rq_held(rq);
>  
> -	if (likely(!atomic_read(&scx_breather_depth)))
> +	if (likely(!READ_ONCE(scx_aborting)))
>  		return;
>  
>  	raw_spin_rq_unlock(rq);
> @@ -1843,9 +1842,9 @@ static void scx_breather(struct rq *rq)
>  
>  	do {
>  		int cnt = 1024;
> -		while (atomic_read(&scx_breather_depth) && --cnt)
> +		while (READ_ONCE(scx_aborting) && --cnt)
>  			cpu_relax();
> -	} while (atomic_read(&scx_breather_depth) &&
> +	} while (READ_ONCE(scx_aborting) &&
>  		 time_before64(ktime_get_ns(), until));
>  
>  	raw_spin_rq_lock(rq);
> @@ -3740,30 +3739,14 @@ void scx_softlockup(u32 dur_s)
>  		goto out_unlock;
>  	}
>  
> -	/* allow only one instance, cleared at the end of scx_bypass() */
> -	if (test_and_set_bit(0, &scx_in_softlockup))
> -		goto out_unlock;
> -
>  	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
>  			smp_processor_id(), dur_s, scx_root->ops.name);
>  
> -	/*
> -	 * Some CPUs may be trapped in the dispatch paths. Enable breather
> -	 * immediately; otherwise, we might even be able to get to scx_bypass().
> -	 */
> -	atomic_inc(&scx_breather_depth);
> -
>  	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
>  out_unlock:
>  	rcu_read_unlock();
>  }
>  
> -static void scx_clear_softlockup(void)
> -{
> -	if (test_and_clear_bit(0, &scx_in_softlockup))
> -		atomic_dec(&scx_breather_depth);
> -}
> -
>  /**
>   * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
>   * @bypass: true for bypass, false for unbypass
> @@ -3826,8 +3809,6 @@ static void scx_bypass(bool bypass)
>  				      ktime_get_ns() - bypass_timestamp);
>  	}
>  
> -	atomic_inc(&scx_breather_depth);
> -
>  	/*
>  	 * No task property is changing. We just need to make sure all currently
>  	 * queued tasks are re-queued according to the new scx_rq_bypassing()
> @@ -3883,10 +3864,8 @@ static void scx_bypass(bool bypass)
>  		raw_spin_rq_unlock(rq);
>  	}
>  
> -	atomic_dec(&scx_breather_depth);
>  unlock:
>  	raw_spin_unlock_irqrestore(&bypass_lock, flags);
> -	scx_clear_softlockup();
>  }
>  
>  static void free_exit_info(struct scx_exit_info *ei)
> @@ -3981,6 +3960,7 @@ static void scx_disable_workfn(struct kthread_work *work)
>  
>  	/* guarantee forward progress by bypassing scx_ops */
>  	scx_bypass(true);
> +	WRITE_ONCE(scx_aborting, false);
>  
>  	switch (scx_set_enable_state(SCX_DISABLING)) {
>  	case SCX_DISABLING:
> @@ -4103,9 +4083,24 @@ static void scx_disable_workfn(struct kthread_work *work)
>  	scx_bypass(false);
>  }
>  
> -static void scx_disable(enum scx_exit_kind kind)
> +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind)
>  {
>  	int none = SCX_EXIT_NONE;
> +
> +	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
> +		return false;
> +
> +	/*
> +	 * Some CPUs may be trapped in the dispatch paths. Enable breather
> +	 * immediately; otherwise, we might not even be able to get to
> +	 * scx_bypass().
> +	 */
> +	WRITE_ONCE(scx_aborting, true);
> +	return true;
> +}
> +
> +static void scx_disable(enum scx_exit_kind kind)
> +{
>  	struct scx_sched *sch;
>  
>  	if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE))
> @@ -4114,7 +4109,7 @@ static void scx_disable(enum scx_exit_kind kind)
>  	rcu_read_lock();
>  	sch = rcu_dereference(scx_root);
>  	if (sch) {
> -		atomic_try_cmpxchg(&sch->exit_kind, &none, kind);
> +		scx_claim_exit(sch, kind);
>  		kthread_queue_work(sch->helper, &sch->disable_work);
>  	}
>  	rcu_read_unlock();
> @@ -4435,9 +4430,8 @@ static void scx_vexit(struct scx_sched *sch,
>  		      const char *fmt, va_list args)
>  {
>  	struct scx_exit_info *ei = sch->exit_info;
> -	int none = SCX_EXIT_NONE;
>  
> -	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
> +	if (!scx_claim_exit(sch, kind))
>  		return;
>  
>  	ei->exit_code = exit_code;
> @@ -4653,6 +4647,8 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	 */
>  	WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED);
>  	WARN_ON_ONCE(scx_root);
> +	if (WARN_ON_ONCE(READ_ONCE(scx_aborting)))
> +		WRITE_ONCE(scx_aborting, false);
>  
>  	atomic_long_set(&scx_nr_rejected, 0);
>  
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-10  7:03   ` Andrea Righi
@ 2025-11-10  7:59     ` Andrea Righi
  2025-11-10 16:21     ` Tejun Heo
  1 sibling, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  7:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 08:03:45AM +0100, Andrea Righi wrote:
> Hi Tejun,
> 
> On Sun, Nov 09, 2025 at 08:31:01AM -1000, Tejun Heo wrote:
> > There have been reported cases of bypass mode not making forward progress fast
> > enough. The 20ms default slice is unnecessarily long for bypass mode where the
> > primary goal is ensuring all tasks can make forward progress.
> > 
> > Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
> > switch to it when entering bypass mode. Also make both the default and bypass
> > slice values tunable through module parameters (slice_dfl_us and
> > slice_bypass_us, adjustable between 100us and 100ms) to make it easier to test
> > whether slice durations are a factor in problem cases. Note that the configured
> > values are applied through bypass mode switching and thus are guaranteed to
> > apply only during scheduler [un]load operations.
> 
> IIRC Changwoo suggested to introduce a tunable to change the default time
> slice in the past.
> 
> I agree that slice_bypass_us can be a tunable in sysfs, but I think it'd be
> nicer if the default time slice would be a property of sched_ext_ops, is
> there any reason to not do that?

Moreover (not necessarily for this patchset, we can add this later), should
we turn SCX_SLICE_DFL into a special value (e.g., 0) and have the
schedulers that currently rely on it automatically pick up the new global
default time slice internally?

Thanks,
-Andrea

> 
> Thanks,
> -Andrea
> 
> > 
> > Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> > Cc: Emil Tsalapatis <etsal@meta.com>
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  include/linux/sched/ext.h | 11 +++++++++++
> >  kernel/sched/ext.c        | 37 ++++++++++++++++++++++++++++++++++---
> >  2 files changed, 45 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index eb776b094d36..9f5b0f2be310 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -17,7 +17,18 @@
> >  enum scx_public_consts {
> >  	SCX_OPS_NAME_LEN	= 128,
> >  
> > +	/*
> > +	 * %SCX_SLICE_DFL is used to refill slices when the BPF scheduler misses
> > +	 * to set the slice for a task that is selected for execution.
> > +	 * %SCX_EV_REFILL_SLICE_DFL counts the number of times the default slice
> > +	 * refill has been triggered.
> > +	 *
> > +	 * %SCX_SLICE_BYPASS is used as the slice for all tasks in the bypass
> > +	 * mode. As mkaing forward progress for all tasks is the main goal of
> > +	 * the bypass mode, a shorter slice is used.
> > +	 */
> >  	SCX_SLICE_DFL		= 20 * 1000000,	/* 20ms */
> > +	SCX_SLICE_BYPASS	=  5 * 1000000, /*  5ms */
> >  	SCX_SLICE_INF		= U64_MAX,	/* infinite, implies nohz */
> >  };
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index cf8d86a2585c..2ce226018dbe 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -143,6 +143,35 @@ static struct scx_dump_data scx_dump_data = {
> >  /* /sys/kernel/sched_ext interface */
> >  static struct kset *scx_kset;
> >  
> > +/*
> > + * Parameter that can be adjusted through /sys/module/sched_ext/parameters.
> > + * There usually is no reason to modify these as normal scheduler opertion
> > + * shouldn't be affected by them. The knobs are primarily for debugging.
> > + */
> > +static u64 scx_slice_dfl = SCX_SLICE_DFL;
> > +static unsigned int scx_slice_dfl_us = SCX_SLICE_DFL / NSEC_PER_USEC;
> > +static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC;
> > +
> > +static int set_slice_us(const char *val, const struct kernel_param *kp)
> > +{
> > +	return param_set_uint_minmax(val, kp, 100, 100 * USEC_PER_MSEC);
> > +}
> > +
> > +static const struct kernel_param_ops slice_us_param_ops = {
> > +	.set = set_slice_us,
> > +	.get = param_get_uint,
> > +};
> > +
> > +#undef MODULE_PARAM_PREFIX
> > +#define MODULE_PARAM_PREFIX	"sched_ext."
> > +
> > +module_param_cb(slice_dfl_us, &slice_us_param_ops, &scx_slice_dfl_us, 0600);
> > +MODULE_PARM_DESC(slice_dfl_us, "default slice in microseconds, applied on [un]load (100us to 100ms)");
> > +module_param_cb(slice_bypass_us, &slice_us_param_ops, &scx_slice_bypass_us, 0600);
> > +MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]load (100us to 100ms)");
> > +
> > +#undef MODULE_PARAM_PREFIX
> > +
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/sched_ext.h>
> >  
> > @@ -919,7 +948,7 @@ static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
> >  
> >  static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p)
> >  {
> > -	p->scx.slice = SCX_SLICE_DFL;
> > +	p->scx.slice = scx_slice_dfl;
> >  	__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
> >  }
> >  
> > @@ -2892,7 +2921,7 @@ void init_scx_entity(struct sched_ext_entity *scx)
> >  	INIT_LIST_HEAD(&scx->runnable_node);
> >  	scx->runnable_at = jiffies;
> >  	scx->ddsp_dsq_id = SCX_DSQ_INVALID;
> > -	scx->slice = SCX_SLICE_DFL;
> > +	scx->slice = scx_slice_dfl;
> >  }
> >  
> >  void scx_pre_fork(struct task_struct *p)
> > @@ -3770,6 +3799,7 @@ static void scx_bypass(bool bypass)
> >  		WARN_ON_ONCE(scx_bypass_depth <= 0);
> >  		if (scx_bypass_depth != 1)
> >  			goto unlock;
> > +		scx_slice_dfl = scx_slice_bypass_us * NSEC_PER_USEC;
> >  		bypass_timestamp = ktime_get_ns();
> >  		if (sch)
> >  			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
> > @@ -3778,6 +3808,7 @@ static void scx_bypass(bool bypass)
> >  		WARN_ON_ONCE(scx_bypass_depth < 0);
> >  		if (scx_bypass_depth != 0)
> >  			goto unlock;
> > +		scx_slice_dfl = scx_slice_dfl_us * NSEC_PER_USEC;
> >  		if (sch)
> >  			scx_add_event(sch, SCX_EV_BYPASS_DURATION,
> >  				      ktime_get_ns() - bypass_timestamp);
> > @@ -4776,7 +4807,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> >  			queue_flags |= DEQUEUE_CLASS;
> >  
> >  		scoped_guard (sched_change, p, queue_flags) {
> > -			p->scx.slice = SCX_SLICE_DFL;
> > +			p->scx.slice = scx_slice_dfl;
> >  			p->sched_class = new_class;
> >  		}
> >  	}
> > -- 
> > 2.51.1
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting
  2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
@ 2025-11-10  8:20   ` Andrea Righi
  2025-11-10 18:51     ` Tejun Heo
  2025-11-11 15:46   ` Dan Schatzberg
  1 sibling, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:05AM -1000, Tejun Heo wrote:
> 62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
> the breather mechanism to inject delays during bypass mode switching. It
> maintains operation semantics unchanged while reducing lock contention to avoid
> live-locks on large NUMA systems.
> 
> However, the breather only activates when exiting the scheduler, so there's no
> need to maintain operation semantics. Simplify by exiting dispatch and move
> operations immediately when scx_aborting is set. In consume_dispatch_q(), break
> out of the task iteration loop. In scx_dsq_move(), return early before
> acquiring locks.
> 
> This also fixes cases the breather mechanism cannot handle. When a large system
> has many runnable threads affinitized to different CPU subsets and the BPF
> scheduler places them all into a single DSQ, many CPUs can scan the DSQ
> concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
> for extended periods, leading to various failure modes. The breather cannot
> solve this because once in the consume loop, there's no exit. The new mechanism
> fixes this by exiting the loop immediately.
> 
> The bypass DSQ is exempted to ensure the bypass mechanism itself can make
> progress.
> 
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 62 ++++++++++++++--------------------------------
>  1 file changed, 18 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 905d01f74687..afa89ca3659e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1821,48 +1821,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  	return dst_rq;
>  }
>  
> -/*
> - * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
> - * banging on the same DSQ on a large NUMA system to the point where switching
> - * to the bypass mode can take a long time. Inject artificial delays while the
> - * bypass mode is switching to guarantee timely completion.
> - */
> -static void scx_breather(struct rq *rq)
> -{
> -	u64 until;
> -
> -	lockdep_assert_rq_held(rq);
> -
> -	if (likely(!READ_ONCE(scx_aborting)))
> -		return;
> -
> -	raw_spin_rq_unlock(rq);
> -
> -	until = ktime_get_ns() + NSEC_PER_MSEC;
> -
> -	do {
> -		int cnt = 1024;
> -		while (READ_ONCE(scx_aborting) && --cnt)
> -			cpu_relax();
> -	} while (READ_ONCE(scx_aborting) &&
> -		 time_before64(ktime_get_ns(), until));
> -
> -	raw_spin_rq_lock(rq);
> -}
> -
>  static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
>  			       struct scx_dispatch_q *dsq)
>  {
>  	struct task_struct *p;
>  retry:
> -	/*
> -	 * This retry loop can repeatedly race against scx_bypass() dequeueing
> -	 * tasks from @dsq trying to put the system into the bypass mode. On
> -	 * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock
> -	 * the machine into soft lockups. Give a breather.
> -	 */
> -	scx_breather(rq);
> -
>  	/*
>  	 * The caller can't expect to successfully consume a task if the task's
>  	 * addition to @dsq isn't guaranteed to be visible somehow. Test
> @@ -1876,6 +1839,17 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
>  	nldsq_for_each_task(p, dsq) {
>  		struct rq *task_rq = task_rq(p);
>  
> +		/*
> +		 * This loop can lead to multiple lockup scenarios, e.g. the BPF
> +		 * scheduler can put an enormous number of affinitized tasks into
> +		 * a contended DSQ, or the outer retry loop can repeatedly race
> +		 * against scx_bypass() dequeueing tasks from @dsq trying to put
> +		 * the system into the bypass mode. This can easily live-lock the
> +		 * machine. If aborting, exit from all non-bypass DSQs.
> +		 */
> +		if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS)
> +			break;
> +
>  		if (rq == task_rq) {
>  			task_unlink_from_dsq(p, dsq);
>  			move_local_task_to_local_dsq(p, 0, dsq, rq);
> @@ -5635,6 +5609,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  	    !scx_kf_allowed(sch, SCX_KF_DISPATCH))
>  		return false;
>  
> +	/*
> +	 * If the BPF scheduler keeps calling this function repeatedly, it can
> +	 * cause similar live-lock conditions as consume_dispatch_q().
> +	 */
> +	if (unlikely(scx_aborting))

READ_ONCE(scx_aborting)?

Thanks,
-Andrea

> +		return false;
> +
>  	/*
>  	 * Can be called from either ops.dispatch() locking this_rq() or any
>  	 * context where no rq lock is held. If latter, lock @p's task_rq which
> @@ -5655,13 +5636,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  		raw_spin_rq_lock(src_rq);
>  	}
>  
> -	/*
> -	 * If the BPF scheduler keeps calling this function repeatedly, it can
> -	 * cause similar live-lock conditions as consume_dispatch_q(). Insert a
> -	 * breather if necessary.
> -	 */
> -	scx_breather(src_rq);
> -
>  	locked_rq = src_rq;
>  	raw_spin_lock(&src_dsq->lock);
>  
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
  2025-11-10  7:03   ` Andrea Righi
@ 2025-11-10  8:22   ` Andrea Righi
  2025-11-11 14:57   ` Dan Schatzberg
  2 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:01AM -1000, Tejun Heo wrote:
...
> +	/*
> +	 * %SCX_SLICE_DFL is used to refill slices when the BPF scheduler misses
> +	 * to set the slice for a task that is selected for execution.
> +	 * %SCX_EV_REFILL_SLICE_DFL counts the number of times the default slice
> +	 * refill has been triggered.
> +	 *
> +	 * %SCX_SLICE_BYPASS is used as the slice for all tasks in the bypass
> +	 * mode. As mkaing forward progress for all tasks is the main goal of

Small typo: s/mkaing/making/

-Andrea

> +	 * the bypass mode, a shorter slice is used.
> +	 */
>  	SCX_SLICE_DFL		= 20 * 1000000,	/* 20ms */
> +	SCX_SLICE_BYPASS	=  5 * 1000000, /*  5ms */
>  	SCX_SLICE_INF		= U64_MAX,	/* infinite, implies nohz */
>  };
>  

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool
  2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
@ 2025-11-10  8:28   ` Andrea Righi
  2025-11-11 15:48   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:06AM -1000, Tejun Heo wrote:
> Make scx_exit() and scx_vexit() return bool indicating whether the calling
> thread successfully claimed the exit. This will be used by the abort mechanism
> added in a later patch.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  kernel/sched/ext.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afa89ca3659e..033c8b8e88e8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -177,18 +177,21 @@ MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]
>  static void process_ddsp_deferred_locals(struct rq *rq);
>  static u32 reenq_local(struct rq *rq);
>  static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
> -static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
> +static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
>  		      s64 exit_code, const char *fmt, va_list args);
>  
> -static __printf(4, 5) void scx_exit(struct scx_sched *sch,
> +static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
>  				    enum scx_exit_kind kind, s64 exit_code,
>  				    const char *fmt, ...)
>  {
>  	va_list args;
> +	bool ret;
>  
>  	va_start(args, fmt);
> -	scx_vexit(sch, kind, exit_code, fmt, args);
> +	ret = scx_vexit(sch, kind, exit_code, fmt, args);
>  	va_end(args);
> +
> +	return ret;
>  }
>  
>  #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
> @@ -4399,14 +4402,14 @@ static void scx_error_irq_workfn(struct irq_work *irq_work)
>  	kthread_queue_work(sch->helper, &sch->disable_work);
>  }
>  
> -static void scx_vexit(struct scx_sched *sch,
> +static bool scx_vexit(struct scx_sched *sch,
>  		      enum scx_exit_kind kind, s64 exit_code,
>  		      const char *fmt, va_list args)
>  {
>  	struct scx_exit_info *ei = sch->exit_info;
>  
>  	if (!scx_claim_exit(sch, kind))
> -		return;
> +		return false;
>  
>  	ei->exit_code = exit_code;
>  #ifdef CONFIG_STACKTRACE
> @@ -4423,6 +4426,7 @@ static void scx_vexit(struct scx_sched *sch,
>  	ei->reason = scx_exit_reason(ei->kind);
>  
>  	irq_work_queue(&sch->error_irq_work);
> +	return true;
>  }
>  
>  static int alloc_kick_syncs(void)
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup()
  2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
@ 2025-11-10  8:29   ` Andrea Righi
  2025-11-11 15:49   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:07AM -1000, Tejun Heo wrote:
> scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
> scheduler is enabled under RCU read lock and trigger an error if so. Extract
> the common pattern into handle_lockup() helper. Add scx_verror() macro and use
> guard(rcu)().
> 
> This simplifies both handlers, reduces code duplication, and prepares for
> hardlockup handling.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  kernel/sched/ext.c | 65 ++++++++++++++++++----------------------------
>  1 file changed, 25 insertions(+), 40 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 033c8b8e88e8..5c75b0125dfe 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -195,6 +195,7 @@ static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
>  }
>  
>  #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
> +#define scx_verror(sch, fmt, args)	scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
>  
>  #define SCX_HAS_OP(sch, op)	test_bit(SCX_OP_IDX(op), (sch)->has_op)
>  
> @@ -3653,39 +3654,40 @@ bool scx_allow_ttwu_queue(const struct task_struct *p)
>  	return false;
>  }
>  
> -/**
> - * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
> - *
> - * While there are various reasons why RCU CPU stalls can occur on a system
> - * that may not be caused by the current BPF scheduler, try kicking out the
> - * current scheduler in an attempt to recover the system to a good state before
> - * issuing panics.
> - */
> -bool scx_rcu_cpu_stall(void)
> +static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
>  {
>  	struct scx_sched *sch;
> +	va_list args;
>  
> -	rcu_read_lock();
> +	guard(rcu)();
>  
>  	sch = rcu_dereference(scx_root);
> -	if (unlikely(!sch)) {
> -		rcu_read_unlock();
> +	if (unlikely(!sch))
>  		return false;
> -	}
>  
>  	switch (scx_enable_state()) {
>  	case SCX_ENABLING:
>  	case SCX_ENABLED:
> -		break;
> +		va_start(args, fmt);
> +		scx_verror(sch, fmt, args);
> +		va_end(args);
> +		return true;
>  	default:
> -		rcu_read_unlock();
>  		return false;
>  	}
> +}
>  
> -	scx_error(sch, "RCU CPU stall detected!");
> -	rcu_read_unlock();
> -
> -	return true;
> +/**
> + * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
> + *
> + * While there are various reasons why RCU CPU stalls can occur on a system
> + * that may not be caused by the current BPF scheduler, try kicking out the
> + * current scheduler in an attempt to recover the system to a good state before
> + * issuing panics.
> + */
> +bool scx_rcu_cpu_stall(void)
> +{
> +	return handle_lockup("RCU CPU stall detected!");
>  }
>  
>  /**
> @@ -3700,28 +3702,11 @@ bool scx_rcu_cpu_stall(void)
>   */
>  void scx_softlockup(u32 dur_s)
>  {
> -	struct scx_sched *sch;
> -
> -	rcu_read_lock();
> -
> -	sch = rcu_dereference(scx_root);
> -	if (unlikely(!sch))
> -		goto out_unlock;
> -
> -	switch (scx_enable_state()) {
> -	case SCX_ENABLING:
> -	case SCX_ENABLED:
> -		break;
> -	default:
> -		goto out_unlock;
> -	}
> -
> -	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
> -			smp_processor_id(), dur_s, scx_root->ops.name);
> +	if (!handle_lockup("soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s))
> +		return;
>  
> -	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
> -out_unlock:
> -	rcu_read_unlock();
> +	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU %d stuck for %us, disabling BPF scheduler\n",
> +			smp_processor_id(), dur_s);
>  }
>  
>  /**
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result
  2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
@ 2025-11-10  8:29   ` Andrea Righi
  0 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:08AM -1000, Tejun Heo wrote:
> handle_lockup() currently calls scx_verror() but ignores its return value,
> always returning true when the scheduler is enabled. Make it capture and return
> the result from scx_verror(). This prepares for hardlockup handling.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  kernel/sched/ext.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5c75b0125dfe..4507bc4f0b5c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3658,6 +3658,7 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
>  {
>  	struct scx_sched *sch;
>  	va_list args;
> +	bool ret;
>  
>  	guard(rcu)();
>  
> @@ -3669,9 +3670,9 @@ static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
>  	case SCX_ENABLING:
>  	case SCX_ENABLED:
>  		va_start(args, fmt);
> -		scx_verror(sch, fmt, args);
> +		ret = scx_verror(sch, fmt, args);
>  		va_end(args);
> -		return true;
> +		return ret;
>  	default:
>  		return false;
>  	}
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/13] sched_ext: Hook up hardlockup detector
  2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
@ 2025-11-10  8:31   ` Andrea Righi
  0 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel, Douglas Anderson, Andrew Morton

On Sun, Nov 09, 2025 at 08:31:09AM -1000, Tejun Heo wrote:
> A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
> large system with many tasks pinned to different subsets of CPUs, if the BPF
> scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
> can be contended to the point where hardlockup triggers. Unfortunately,
> hardlockup can be the first signal out of such situations, thus requiring
> hardlockup handling.
> 
> Hook scx_hardlockup() into the hardlockup detector to try kicking out the
> current scheduler in an attempt to recover the system to a good state. The
> handling strategy can delay watchdog taking its own action by one polling
> period; however, given that the only remediation for hardlockup is crash, this
> is likely an acceptable trade-off.
> 
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Makes sense to me, from a sched_ext perspective:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  include/linux/sched/ext.h |  1 +
>  kernel/sched/ext.c        | 18 ++++++++++++++++++
>  kernel/watchdog.c         |  9 +++++++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index e1502faf6241..12561a3fcee4 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -223,6 +223,7 @@ struct sched_ext_entity {
>  void sched_ext_dead(struct task_struct *p);
>  void print_scx_info(const char *log_lvl, struct task_struct *p);
>  void scx_softlockup(u32 dur_s);
> +bool scx_hardlockup(void);
>  bool scx_rcu_cpu_stall(void);
>  
>  #else	/* !CONFIG_SCHED_CLASS_EXT */
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 4507bc4f0b5c..bd66178e5927 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3710,6 +3710,24 @@ void scx_softlockup(u32 dur_s)
>  			smp_processor_id(), dur_s);
>  }
>  
> +/**
> + * scx_hardlockup - sched_ext hardlockup handler
> + *
> + * A poorly behaving BPF scheduler can trigger hard lockup by e.g. putting
> + * numerous affinitized tasks in a single queue and directing all CPUs at it.
> + * Try kicking out the current scheduler in an attempt to recover the system to
> + * a good state before taking more drastic actions.
> + */
> +bool scx_hardlockup(void)
> +{
> +	if (!handle_lockup("hard lockup - CPU %d", smp_processor_id()))
> +		return false;
> +
> +	printk_deferred(KERN_ERR "sched_ext: Hard lockup - CPU %d, disabling BPF scheduler\n",
> +			smp_processor_id());
> +	return true;
> +}
> +
>  /**
>   * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
>   * @bypass: true for bypass, false for unbypass
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 5b62d1002783..8dfac4a8f587 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -196,6 +196,15 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>  #ifdef CONFIG_SYSFS
>  		++hardlockup_count;
>  #endif
> +		/*
> +		 * A poorly behaving BPF scheduler can trigger hard lockup by
> +		 * e.g. putting numerous affinitized tasks in a single queue and
> +		 * directing all CPUs at it. The following call can return true
> +		 * only once when sched_ext is enabled and will immediately
> +		 * abort the BPF scheduler and print out a warning message.
> +		 */
> +		if (scx_hardlockup())
> +			return;
>  
>  		/* Only print hardlockups once. */
>  		if (per_cpu(watchdog_hardlockup_warned, cpu))
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
  2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
@ 2025-11-10  8:36   ` Andrea Righi
  2025-11-10 18:44     ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:10AM -1000, Tejun Heo wrote:
> Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and
> only dispatches them from CPU0 in FIFO order. This is useful for testing bypass
> behavior when many tasks are concentrated on a single CPU. If the load balancer
> doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is
> long and there's only one CPU working on it.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  tools/sched_ext/Makefile       |   2 +-
>  tools/sched_ext/scx_cpu0.bpf.c |  84 ++++++++++++++++++++++++++
>  tools/sched_ext/scx_cpu0.c     | 106 +++++++++++++++++++++++++++++++++
>  3 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 tools/sched_ext/scx_cpu0.bpf.c
>  create mode 100644 tools/sched_ext/scx_cpu0.c
> 
> diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
> index d68780e2e03d..069b0bc38e55 100644
> --- a/tools/sched_ext/Makefile
> +++ b/tools/sched_ext/Makefile
> @@ -187,7 +187,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP
>  
>  SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
>  
> -c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg
> +c-sched-targets = scx_simple scx_cpu0 scx_qmap scx_central scx_flatcg
>  
>  $(addprefix $(BINDIR)/,$(c-sched-targets)): \
>  	$(BINDIR)/%: \
> diff --git a/tools/sched_ext/scx_cpu0.bpf.c b/tools/sched_ext/scx_cpu0.bpf.c
> new file mode 100644
> index 000000000000..8626bd369f60
> --- /dev/null
> +++ b/tools/sched_ext/scx_cpu0.bpf.c
> @@ -0,0 +1,84 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * A CPU0 scheduler.
> + *
> + * This scheduler queues all tasks to a shared DSQ and only dispatches them on
> + * CPU0 in FIFO order. This is useful for testing bypass behavior when many
> + * tasks are concentrated on a single CPU. If the load balancer doesn't work,
> + * bypass mode can trigger task hangs or RCU stalls as the queue is long and
> + * there's only one CPU working on it.
> + *
> + * - Statistics tracking how many tasks are queued to local and CPU0 DSQs.
> + * - Termination notification for userspace.
> + *
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
> + */
> +#include <scx/common.bpf.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +const volatile u32 nr_cpus = 32;	/* !0 for veristat, set during init */
> +
> +UEI_DEFINE(uei);
> +
> +/*
> + * We create a custom DSQ with ID 0 that we dispatch to and consume from on
> + * CPU0.
> + */
> +#define DSQ_CPU0 0
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(u32));
> +	__uint(value_size, sizeof(u64));
> +	__uint(max_entries, 2);			/* [local, cpu0] */
> +} stats SEC(".maps");
> +
> +static void stat_inc(u32 idx)
> +{
> +	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
> +	if (cnt_p)
> +		(*cnt_p)++;
> +}
> +
> +s32 BPF_STRUCT_OPS(cpu0_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
> +{
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	if (p->nr_cpus_allowed < nr_cpus) {

We could be even more aggressive with DSQ_CPU0 and check
bpf_cpumask_test_cpu(0, p->cpus_ptr), but this is fine as well.

> +		stat_inc(0);	/* count local queueing */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);

And this is why I was suggesting to automatically fallback to the new
global default time slice internally. In this case do we want to preserve
the old 20ms default or automatically switch to the new one?

Apart than these minor details that we can address later:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> +		return;
> +	}
> +
> +	stat_inc(1);	/* count cpu0 queueing */
> +	scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags);
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	if (cpu == 0)
> +		scx_bpf_dsq_move_to_local(DSQ_CPU0);
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init)
> +{
> +	return scx_bpf_create_dsq(DSQ_CPU0, -1);
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_exit, struct scx_exit_info *ei)
> +{
> +	UEI_RECORD(uei, ei);
> +}
> +
> +SCX_OPS_DEFINE(cpu0_ops,
> +	       .select_cpu		= (void *)cpu0_select_cpu,
> +	       .enqueue			= (void *)cpu0_enqueue,
> +	       .dispatch		= (void *)cpu0_dispatch,
> +	       .init			= (void *)cpu0_init,
> +	       .exit			= (void *)cpu0_exit,
> +	       .name			= "cpu0");
> diff --git a/tools/sched_ext/scx_cpu0.c b/tools/sched_ext/scx_cpu0.c
> new file mode 100644
> index 000000000000..1e4fa4ab8da9
> --- /dev/null
> +++ b/tools/sched_ext/scx_cpu0.c
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <assert.h>
> +#include <libgen.h>
> +#include <bpf/bpf.h>
> +#include <scx/common.h>
> +#include "scx_cpu0.bpf.skel.h"
> +
> +const char help_fmt[] =
> +"A cpu0 sched_ext scheduler.\n"
> +"\n"
> +"See the top-level comment in .bpf.c for more details.\n"
> +"\n"
> +"Usage: %s [-v]\n"
> +"\n"
> +"  -v            Print libbpf debug messages\n"
> +"  -h            Display this help and exit\n";
> +
> +static bool verbose;
> +static volatile int exit_req;
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +	if (level == LIBBPF_DEBUG && !verbose)
> +		return 0;
> +	return vfprintf(stderr, format, args);
> +}
> +
> +static void sigint_handler(int sig)
> +{
> +	exit_req = 1;
> +}
> +
> +static void read_stats(struct scx_cpu0 *skel, __u64 *stats)
> +{
> +	int nr_cpus = libbpf_num_possible_cpus();
> +	assert(nr_cpus > 0);
> +	__u64 cnts[2][nr_cpus];
> +	__u32 idx;
> +
> +	memset(stats, 0, sizeof(stats[0]) * 2);
> +
> +	for (idx = 0; idx < 2; idx++) {
> +		int ret, cpu;
> +
> +		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
> +					  &idx, cnts[idx]);
> +		if (ret < 0)
> +			continue;
> +		for (cpu = 0; cpu < nr_cpus; cpu++)
> +			stats[idx] += cnts[idx][cpu];
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct scx_cpu0 *skel;
> +	struct bpf_link *link;
> +	__u32 opt;
> +	__u64 ecode;
> +
> +	libbpf_set_print(libbpf_print_fn);
> +	signal(SIGINT, sigint_handler);
> +	signal(SIGTERM, sigint_handler);
> +restart:
> +	skel = SCX_OPS_OPEN(cpu0_ops, scx_cpu0);
> +
> +	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
> +
> +	while ((opt = getopt(argc, argv, "vh")) != -1) {
> +		switch (opt) {
> +		case 'v':
> +			verbose = true;
> +			break;
> +		default:
> +			fprintf(stderr, help_fmt, basename(argv[0]));
> +			return opt != 'h';
> +		}
> +	}
> +
> +	SCX_OPS_LOAD(skel, cpu0_ops, scx_cpu0, uei);
> +	link = SCX_OPS_ATTACH(skel, cpu0_ops, scx_cpu0);
> +
> +	while (!exit_req && !UEI_EXITED(skel, uei)) {
> +		__u64 stats[2];
> +
> +		read_stats(skel, stats);
> +		printf("local=%llu cpu0=%llu\n", stats[0], stats[1]);
> +		fflush(stdout);
> +		sleep(1);
> +	}
> +
> +	bpf_link__destroy(link);
> +	ecode = UEI_REPORT(skel, uei);
> +	scx_cpu0__destroy(skel);
> +
> +	if (UEI_ECODE_RESTART(ecode))
> +		goto restart;
> +	return 0;
> +}
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
  2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
@ 2025-11-10  8:37   ` Andrea Righi
  0 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  8:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:11AM -1000, Tejun Heo wrote:
> Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
> macro in preparation for additional users.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
>  include/linux/sched/ext.h | 7 +++++++
>  kernel/sched/ext.c        | 5 ++---
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 12561a3fcee4..280828b13608 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -149,6 +149,13 @@ struct scx_dsq_list_node {
>  	u32			priv;		/* can be used by iter cursor */
>  };
>  
> +#define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv)				\
> +	(struct scx_dsq_list_node) {						\
> +		.node = LIST_HEAD_INIT((__node).node),				\
> +		.flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags),			\
> +		.priv = (__priv),						\
> +	}
> +
>  /*
>   * The following is embedded in task_struct and contains all fields necessary
>   * for a task to be scheduled by SCX.
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index bd66178e5927..4b2cc6cc8cb2 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6252,9 +6252,8 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
>  	if (!kit->dsq)
>  		return -ENOENT;
>  
> -	INIT_LIST_HEAD(&kit->cursor.node);
> -	kit->cursor.flags = SCX_DSQ_LNODE_ITER_CURSOR | flags;
> -	kit->cursor.priv = READ_ONCE(kit->dsq->seq);
> +	kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags,
> +					   READ_ONCE(kit->dsq->seq));
>  
>  	return 0;
>  }
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/13] sched_ext: Implement load balancer for bypass mode
  2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
@ 2025-11-10  9:38   ` Andrea Righi
  2025-11-10 19:21     ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10  9:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:12AM -1000, Tejun Heo wrote:
> In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
> in most cases, there is a failure mode where a BPF scheduler can skew task
> placement severely before triggering bypass in highly over-saturated systems.
> If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
> queues that are too long to drain in a reasonable time, leading to RCU stalls
> and hung tasks.
> 
> Implement a simple timer-based load balancer that redistributes tasks across
> CPUs within each NUMA node. The balancer runs periodically (default 500ms,
> tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
> CPUs to underloaded ones.
> 
> When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
> to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
> donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
> and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
> balancer timer function reads scx_bypass_depth locklessly to check whether
> bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
> with the READ_ONCE() in the timer function.
> 
> This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
> runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
> all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
> disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
> long time to complete. With the load balancer, disable completes in about a
> second.
> 
> The load balancing operation can be monitored using the sched_ext_bypass_lb
> tracepoint and disabled by setting bypass_lb_intv_us to 0.

In general, I really like to have a default load balancer implementation in
the sched_ext core, even if initially it's only used for bypass mode for
now. In the future, we could also consider reusing this in the regular
scheduling path somehow and not just for bypass.

Comments below.

> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
...
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -34,6 +34,8 @@ DEFINE_STATIC_KEY_FALSE(__scx_enabled);
>  DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
>  static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
>  static int scx_bypass_depth;
> +static cpumask_var_t scx_bypass_lb_donee_cpumask;
> +static cpumask_var_t scx_bypass_lb_resched_cpumask;
>  static bool scx_aborting;
>  static bool scx_init_task_enabled;
>  static bool scx_switching_all;
> @@ -150,6 +152,7 @@ static struct kset *scx_kset;
>  static u64 scx_slice_dfl = SCX_SLICE_DFL;
>  static unsigned int scx_slice_dfl_us = SCX_SLICE_DFL / NSEC_PER_USEC;
>  static unsigned int scx_slice_bypass_us = SCX_SLICE_BYPASS / NSEC_PER_USEC;
> +static unsigned int scx_bypass_lb_intv_us = SCX_BYPASS_LB_DFL_INTV_US;
>  
>  static int set_slice_us(const char *val, const struct kernel_param *kp)
>  {
> @@ -161,6 +164,16 @@ static const struct kernel_param_ops slice_us_param_ops = {
>  	.get = param_get_uint,
>  };
>  
> +static int set_bypass_lb_intv_us(const char *val, const struct kernel_param *kp)
> +{
> +	return param_set_uint_minmax(val, kp, 0, 10 * USEC_PER_SEC);
> +}
> +
> +static const struct kernel_param_ops bypass_lb_intv_us_param_ops = {
> +	.set = set_bypass_lb_intv_us,
> +	.get = param_get_uint,
> +};
> +
>  #undef MODULE_PARAM_PREFIX
>  #define MODULE_PARAM_PREFIX	"sched_ext."
>  
> @@ -168,6 +181,8 @@ module_param_cb(slice_dfl_us, &slice_us_param_ops, &scx_slice_dfl_us, 0600);
>  MODULE_PARM_DESC(slice_dfl_us, "default slice in microseconds, applied on [un]load (100us to 100ms)");
>  module_param_cb(slice_bypass_us, &slice_us_param_ops, &scx_slice_bypass_us, 0600);
>  MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]load (100us to 100ms)");
> +module_param_cb(bypass_lb_intv_us, &bypass_lb_intv_us_param_ops, &scx_bypass_lb_intv_us, 0600);
> +MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microseconds (0 (disable) to 10s)");
>  
>  #undef MODULE_PARAM_PREFIX
>  
> @@ -965,7 +980,9 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
>  
>  	if (!is_local) {
> -		raw_spin_lock(&dsq->lock);
> +		raw_spin_lock_nested(&dsq->lock,
> +			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);
> +
>  		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
>  			scx_error(sch, "attempting to dispatch to a destroyed dsq");
>  			/* fall back to the global dsq */

Outside the context of the patch we're doing:

			/* fall back to the global dsq */
			raw_spin_unlock(&dsq->lock);
			dsq = find_global_dsq(sch, p);
			raw_spin_lock(&dsq->lock);

I think we should we preserve the nested lock annotation also when locking
the global DSQ and do:

		raw_spin_lock_nested(&dsq->lock,
			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);

It seems correct either way, but without this I think we could potentially
trigger false positive lockdep warnings.

> @@ -3728,6 +3745,204 @@ bool scx_hardlockup(void)
>  	return true;
>  }
>  
> +static u32 bypass_lb_cpu(struct scx_sched *sch, struct scx_dispatch_q *donor_dsq,
> +			 struct cpumask *donee_mask, struct cpumask *resched_mask,
> +			 u32 nr_donor_target, u32 nr_donee_target)
> +{
> +	struct task_struct *p, *n;
> +	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0);
> +	s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
> +	u32 nr_balanced = 0, min_delta_us;
> +
> +	/*
> +	 * All we want to guarantee is reasonable forward progress. No reason to
> +	 * fine tune. Assuming every task on @donor_dsq runs their full slice,
> +	 * consider offloading iff the total queued duration is over the
> +	 * threshold.
> +	 */
> +	min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV;
> +	if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us))
> +		return 0;
> +
> +	raw_spin_lock_irq(&donor_dsq->lock);
> +	list_add(&cursor.node, &donor_dsq->list);
> +resume:
> +	n = container_of(&cursor, struct task_struct, scx.dsq_list);
> +	n = nldsq_next_task(donor_dsq, n, false);
> +
> +	while ((p = n)) {
> +		struct rq *donee_rq;
> +		struct scx_dispatch_q *donee_dsq;
> +		int donee;
> +
> +		n = nldsq_next_task(donor_dsq, n, false);
> +
> +		if (donor_dsq->nr <= nr_donor_target)
> +			break;
> +
> +		if (cpumask_empty(donee_mask))
> +			break;
> +
> +		donee = cpumask_any_and_distribute(donee_mask, p->cpus_ptr);
> +		if (donee >= nr_cpu_ids)
> +			continue;
> +
> +		donee_rq = cpu_rq(donee);
> +		donee_dsq = &donee_rq->scx.bypass_dsq;
> +
> +		/*
> +		 * $p's rq is not locked but $p's DSQ lock protects its
> +		 * scheduling properties making this test safe.
> +		 */
> +		if (!task_can_run_on_remote_rq(sch, p, donee_rq, false))
> +			continue;
> +
> +		/*
> +		 * Moving $p from one non-local DSQ to another. The source DSQ
> +		 * is already locked. Do an abbreviated dequeue and then perform
> +		 * enqueue without unlocking $donor_dsq.
> +		 *
> +		 * We don't want to drop and reacquire the lock on each
> +		 * iteration as @donor_dsq can be very long and potentially
> +		 * highly contended. Donee DSQs are less likely to be contended.
> +		 * The nested locking is safe as only this LB moves tasks
> +		 * between bypass DSQs.
> +		 */
> +		task_unlink_from_dsq(p, donor_dsq);
> +		p->scx.dsq = NULL;
> +		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);

Are we racing with dispatch_dequeue() and the holding_cpu dancing here?

If I read correctly, dispatch_dequeue() reads p->scx.dsq without holding
the lock, then acquires the lock on that DSQ, but between the read and lock
acquisition, the load balancer can move the task to a different DSQ.

Maybe we should change dispatch_dequeue() as well to verify after locking
that we locked the correct DSQ, and retry if the task was moved.

Thanks,
-Andrea

> +
> +		/*
> +		 * $donee might have been idle and need to be woken up. No need
> +		 * to be clever. Kick every CPU that receives tasks.
> +		 */
> +		cpumask_set_cpu(donee, resched_mask);
> +
> +		if (READ_ONCE(donee_dsq->nr) >= nr_donee_target)
> +			cpumask_clear_cpu(donee, donee_mask);
> +
> +		nr_balanced++;
> +		if (!(nr_balanced % SCX_BYPASS_LB_BATCH) && n) {
> +			list_move_tail(&cursor.node, &n->scx.dsq_list.node);
> +			raw_spin_unlock_irq(&donor_dsq->lock);
> +			cpu_relax();
> +			raw_spin_lock_irq(&donor_dsq->lock);
> +			goto resume;
> +		}
> +	}
> +
> +	list_del_init(&cursor.node);
> +	raw_spin_unlock_irq(&donor_dsq->lock);
> +
> +	return nr_balanced;
> +}
> +
> +static void bypass_lb_node(struct scx_sched *sch, int node)
> +{
> +	const struct cpumask *node_mask = cpumask_of_node(node);
> +	struct cpumask *donee_mask = scx_bypass_lb_donee_cpumask;
> +	struct cpumask *resched_mask = scx_bypass_lb_resched_cpumask;
> +	u32 nr_tasks = 0, nr_cpus = 0, nr_balanced = 0;
> +	u32 nr_target, nr_donor_target;
> +	u32 before_min = U32_MAX, before_max = 0;
> +	u32 after_min = U32_MAX, after_max = 0;
> +	int cpu;
> +
> +	/* count the target tasks and CPUs */
> +	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
> +		u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr);
> +
> +		nr_tasks += nr;
> +		nr_cpus++;
> +
> +		before_min = min(nr, before_min);
> +		before_max = max(nr, before_max);
> +	}
> +
> +	if (!nr_cpus)
> +		return;
> +
> +	/*
> +	 * We don't want CPUs to have more than $nr_donor_target tasks and
> +	 * balancing to fill donee CPUs upto $nr_target. Once targets are
> +	 * calculated, find the donee CPUs.
> +	 */
> +	nr_target = DIV_ROUND_UP(nr_tasks, nr_cpus);
> +	nr_donor_target = DIV_ROUND_UP(nr_target * SCX_BYPASS_LB_DONOR_PCT, 100);
> +
> +	cpumask_clear(donee_mask);
> +	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
> +		if (READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr) < nr_target)
> +			cpumask_set_cpu(cpu, donee_mask);
> +	}
> +
> +	/* iterate !donee CPUs and see if they should be offloaded */
> +	cpumask_clear(resched_mask);
> +	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
> +		struct rq *rq = cpu_rq(cpu);
> +		struct scx_dispatch_q *donor_dsq = &rq->scx.bypass_dsq;
> +
> +		if (cpumask_empty(donee_mask))
> +			break;
> +		if (cpumask_test_cpu(cpu, donee_mask))
> +			continue;
> +		if (READ_ONCE(donor_dsq->nr) <= nr_donor_target)
> +			continue;
> +
> +		nr_balanced += bypass_lb_cpu(sch, donor_dsq,
> +					     donee_mask, resched_mask,
> +					     nr_donor_target, nr_target);
> +	}
> +
> +	for_each_cpu(cpu, resched_mask) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		raw_spin_rq_lock_irq(rq);
> +		resched_curr(rq);
> +		raw_spin_rq_unlock_irq(rq);
> +	}
> +
> +	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
> +		u32 nr = READ_ONCE(cpu_rq(cpu)->scx.bypass_dsq.nr);
> +
> +		after_min = min(nr, after_min);
> +		after_max = max(nr, after_max);
> +
> +	}
> +
> +	trace_sched_ext_bypass_lb(node, nr_cpus, nr_tasks, nr_balanced,
> +				  before_min, before_max, after_min, after_max);
> +}
> +
> +/*
> + * In bypass mode, all tasks are put on the per-CPU bypass DSQs. If the machine
> + * is over-saturated and the BPF scheduler skewed tasks into few CPUs, some
> + * bypass DSQs can be overloaded. If there are enough tasks to saturate other
> + * lightly loaded CPUs, such imbalance can lead to very high execution latency
> + * on the overloaded CPUs and thus to hung tasks and RCU stalls. To avoid such
> + * outcomes, a simple load balancing mechanism is implemented by the following
> + * timer which runs periodically while bypass mode is in effect.
> + */
> +static void scx_bypass_lb_timerfn(struct timer_list *timer)
> +{
> +	struct scx_sched *sch;
> +	int node;
> +	u32 intv_us;
> +
> +	sch = rcu_dereference_all(scx_root);
> +	if (unlikely(!sch) || !READ_ONCE(scx_bypass_depth))
> +		return;
> +
> +	for_each_node_with_cpus(node)
> +		bypass_lb_node(sch, node);
> +
> +	intv_us = READ_ONCE(scx_bypass_lb_intv_us);
> +	if (intv_us)
> +		mod_timer(timer, jiffies + usecs_to_jiffies(intv_us));
> +}
> +
> +static DEFINE_TIMER(scx_bypass_lb_timer, scx_bypass_lb_timerfn);
> +
>  /**
>   * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
>   * @bypass: true for bypass, false for unbypass
> @@ -3771,7 +3986,9 @@ static void scx_bypass(bool bypass)
>  	sch = rcu_dereference_bh(scx_root);
>  
>  	if (bypass) {
> -		scx_bypass_depth++;
> +		u32 intv_us;
> +
> +		WRITE_ONCE(scx_bypass_depth, scx_bypass_depth + 1);
>  		WARN_ON_ONCE(scx_bypass_depth <= 0);
>  		if (scx_bypass_depth != 1)
>  			goto unlock;
> @@ -3779,8 +3996,15 @@ static void scx_bypass(bool bypass)
>  		bypass_timestamp = ktime_get_ns();
>  		if (sch)
>  			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
> +
> +		intv_us = READ_ONCE(scx_bypass_lb_intv_us);
> +		if (intv_us && !timer_pending(&scx_bypass_lb_timer)) {
> +			scx_bypass_lb_timer.expires =
> +				jiffies + usecs_to_jiffies(intv_us);
> +			add_timer_global(&scx_bypass_lb_timer);
> +		}
>  	} else {
> -		scx_bypass_depth--;
> +		WRITE_ONCE(scx_bypass_depth, scx_bypass_depth - 1);
>  		WARN_ON_ONCE(scx_bypass_depth < 0);
>  		if (scx_bypass_depth != 0)
>  			goto unlock;
> @@ -7036,6 +7260,12 @@ static int __init scx_init(void)
>  		return ret;
>  	}
>  
> +	if (!alloc_cpumask_var(&scx_bypass_lb_donee_cpumask, GFP_KERNEL) ||
> +	    !alloc_cpumask_var(&scx_bypass_lb_resched_cpumask, GFP_KERNEL)) {
> +		pr_err("sched_ext: Failed to allocate cpumasks\n");
> +		return -ENOMEM;
> +	}
> +
>  	return 0;
>  }
>  __initcall(scx_init);
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index dd6f25fb6159..386c677e4c9a 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -23,6 +23,11 @@ enum scx_consts {
>  	 * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
>  	 */
>  	SCX_TASK_ITER_BATCH		= 32,
> +
> +	SCX_BYPASS_LB_DFL_INTV_US	= 500 * USEC_PER_MSEC,
> +	SCX_BYPASS_LB_DONOR_PCT		= 125,
> +	SCX_BYPASS_LB_MIN_DELTA_DIV	= 4,
> +	SCX_BYPASS_LB_BATCH		= 256,
>  };
>  
>  enum scx_exit_kind {
> @@ -963,6 +968,7 @@ enum scx_enq_flags {
>  
>  	SCX_ENQ_CLEAR_OPSS	= 1LLU << 56,
>  	SCX_ENQ_DSQ_PRIQ	= 1LLU << 57,
> +	SCX_ENQ_NESTED		= 1LLU << 58,
>  };
>  
>  enum scx_deq_flags {
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode
  2025-11-10  6:57   ` Andrea Righi
@ 2025-11-10 16:08     ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 16:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel, Patrick Lu

Hello,

On Mon, Nov 10, 2025 at 07:57:18AM +0100, Andrea Righi wrote:
> On Sun, Nov 09, 2025 at 08:31:00AM -1000, Tejun Heo wrote:
> > In the default CPU selection path used during bypass mode, select_task_rq_scx()
> > set p->scx.ddsp_dsq_id to SCX_DSQ_LOCAL to emulate direct dispatch. However,
> > do_enqueue_task() ignores ddsp_dsq_id in bypass mode and queues to the global
> > DSQ, leaving ddsp_dsq_id dangling. This triggers WARN_ON_ONCE() in
> > mark_direct_dispatch() if the task later gets direct dispatched.
> 
> The patch makes sense and I was actually testing something similar to fix
> https://github.com/sched-ext/scx/issues/2758.
> 
> However, in dispatch_enqueue() we're always clearing p->scx.ddsp_dsq_id
> (SCX_DSQ_INVALID), even when we're targeting the global DSQ due to bypass
> mode, so in this scenario we shouldn't see a stale ddsp_dsq_id. Am I
> missing something?

I think you're right. The bug fix part was a wrong assumption on my part.
Will update the description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-10  7:03   ` Andrea Righi
  2025-11-10  7:59     ` Andrea Righi
@ 2025-11-10 16:21     ` Tejun Heo
  2025-11-10 16:22       ` Tejun Heo
  1 sibling, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 16:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hello,

On Mon, Nov 10, 2025 at 08:03:37AM +0100, Andrea Righi wrote:
> I agree that slice_bypass_us can be a tunable in sysfs, but I think it'd be
> nicer if the default time slice would be a property of sched_ext_ops, is
> there any reason to not do that?

My thinking was that a scheduler should always be able to avoid using the
default slice. Even if we allow the default slice to be overridden by the
scheduler, it's still very crude as it will apply the same slice to all
tasks. I'm not necessarily against moving it into ops but a bit unsure how
useful it is.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-10 16:21     ` Tejun Heo
@ 2025-11-10 16:22       ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 16:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 06:21:10AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 10, 2025 at 08:03:37AM +0100, Andrea Righi wrote:
> > I agree that slice_bypass_us can be a tunable in sysfs, but I think it'd be
> > nicer if the default time slice would be a property of sched_ext_ops, is
> > there any reason to not do that?
> 
> My thinking was that a scheduler should always be able to avoid using the
> default slice. Even if we allow the default slice to be overridden by the
> scheduler, it's still very crude as it will apply the same slice to all
> tasks. I'm not necessarily against moving it into ops but a bit unsure how
> useful it is.

Hmm... for now, let me drop slice_dfl knob from this patch. We can address
this separately.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  2025-11-10  7:42   ` Andrea Righi
@ 2025-11-10 16:42     ` Tejun Heo
  2025-11-10 17:30       ` Andrea Righi
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 16:42 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hello,

On Mon, Nov 10, 2025 at 08:42:47AM +0100, Andrea Righi wrote:
> On Sun, Nov 09, 2025 at 08:31:03AM -1000, Tejun Heo wrote:
> > Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
> > on the CPU that it's currently on. Because the default idle CPU selection
> > policy and direct dispatch are both active during bypass, this works well in
> > most cases including the above.
> 
> Is there any reason not to reuse rq->scx.local_dsq for this?
...
> > The bypass DSQ is kept separate from
> > the local DSQ to allow the load balancer to move tasks between bypass DSQs.

This is the explanation for that. More detailed explanation is that local
DSQs are protected by rq locks and that makes load balancing across them
more complicated - ie. we can't keep scanning and transferring while holding
the source DSQ and if the system is already heavily contended, the system
may already be melting down on rq locks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  2025-11-10 16:42     ` Tejun Heo
@ 2025-11-10 17:30       ` Andrea Righi
  0 siblings, 0 replies; 45+ messages in thread
From: Andrea Righi @ 2025-11-10 17:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 06:42:56AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 10, 2025 at 08:42:47AM +0100, Andrea Righi wrote:
> > On Sun, Nov 09, 2025 at 08:31:03AM -1000, Tejun Heo wrote:
> > > Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
> > > on the CPU that it's currently on. Because the default idle CPU selection
> > > policy and direct dispatch are both active during bypass, this works well in
> > > most cases including the above.
> > 
> > Is there any reason not to reuse rq->scx.local_dsq for this?
> ...
> > > The bypass DSQ is kept separate from
> > > the local DSQ to allow the load balancer to move tasks between bypass DSQs.
> 
> This is the explanation for that. More detailed explanation is that local
> DSQs are protected by rq locks and that makes load balancing across them
> more complicated - ie. we can't keep scanning and transferring while holding
> the source DSQ and if the system is already heavily contended, the system
> may already be melting down on rq locks.

Ok, thanks for the explanation, makes sense and it's definitely better than
what we have right now, so:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

-Andrea

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
  2025-11-10  8:36   ` Andrea Righi
@ 2025-11-10 18:44     ` Tejun Heo
  2025-11-10 21:06       ` Andrea Righi
  0 siblings, 1 reply; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 18:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 09:36:46AM +0100, Andrea Righi wrote:
> > +void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
> > +{
> > +	if (p->nr_cpus_allowed < nr_cpus) {
> 
> We could be even more aggressive with DSQ_CPU0 and check
> bpf_cpumask_test_cpu(0, p->cpus_ptr), but this is fine as well.

I did the following instead:

  void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
  {
          /*
           * select_cpu() always picks CPU0. If @p is not on CPU0, it can't run on
           * CPU 0. Queue on whichever CPU it's currently only.
           */
          if (scx_bpf_task_cpu(p) != 0) {
                  stat_inc(0);	/* count local queueing */
                  scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
                  return;
          }

          stat_inc(1);	/* count cpu0 queueing */
          scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags);
  }

This should be safe against migration disabled tasks and so on.

> > +		stat_inc(0);	/* count local queueing */
> > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> 
> And this is why I was suggesting to automatically fallback to the new
> global default time slice internally. In this case do we want to preserve
> the old 20ms default or automatically switch to the new one?

Maybe SCX_SLICE_DFL can become runtime loaded const volatile but anyone
who's using it is just saying "I don't care". As long as it's not something
that breaks the system left and right, does it matter what exact value it
is?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting
  2025-11-10  8:20   ` Andrea Righi
@ 2025-11-10 18:51     ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 18:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 09:20:15AM +0100, Andrea Righi wrote:
> > @@ -5635,6 +5609,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> >  	    !scx_kf_allowed(sch, SCX_KF_DISPATCH))
> >  		return false;
> >  
> > +	/*
> > +	 * If the BPF scheduler keeps calling this function repeatedly, it can
> > +	 * cause similar live-lock conditions as consume_dispatch_q().
> > +	 */
> > +	if (unlikely(scx_aborting))
> 
> READ_ONCE(scx_aborting)?

Updated.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/13] sched_ext: Implement load balancer for bypass mode
  2025-11-10  9:38   ` Andrea Righi
@ 2025-11-10 19:21     ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 19:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hello,

On Mon, Nov 10, 2025 at 10:38:43AM +0100, Andrea Righi wrote:
> > @@ -965,7 +980,9 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
> >  
> >  	if (!is_local) {
> > -		raw_spin_lock(&dsq->lock);
> > +		raw_spin_lock_nested(&dsq->lock,
> > +			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);
> > +
> >  		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
> >  			scx_error(sch, "attempting to dispatch to a destroyed dsq");
> >  			/* fall back to the global dsq */
> 
> Outside the context of the patch we're doing:
> 
> 			/* fall back to the global dsq */
> 			raw_spin_unlock(&dsq->lock);
> 			dsq = find_global_dsq(sch, p);
> 			raw_spin_lock(&dsq->lock);
> 
> I think we should we preserve the nested lock annotation also when locking
> the global DSQ and do:
> 
> 		raw_spin_lock_nested(&dsq->lock,
> 			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);
> 
> It seems correct either way, but without this I think we could potentially
> trigger false positive lockdep warnings.

That'd be a bug. I'll add an explicit WARN. I don't think falling back to
global DSQ quietly makes sense - e.g. global DSQ is not even consumed in
bypass mode anymore.

> > +		/*
> > +		 * Moving $p from one non-local DSQ to another. The source DSQ
> > +		 * is already locked. Do an abbreviated dequeue and then perform
> > +		 * enqueue without unlocking $donor_dsq.
> > +		 *
> > +		 * We don't want to drop and reacquire the lock on each
> > +		 * iteration as @donor_dsq can be very long and potentially
> > +		 * highly contended. Donee DSQs are less likely to be contended.
> > +		 * The nested locking is safe as only this LB moves tasks
> > +		 * between bypass DSQs.
> > +		 */
> > +		task_unlink_from_dsq(p, donor_dsq);
> > +		p->scx.dsq = NULL;
> > +		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
> 
> Are we racing with dispatch_dequeue() and the holding_cpu dancing here?
> 
> If I read correctly, dispatch_dequeue() reads p->scx.dsq without holding
> the lock, then acquires the lock on that DSQ, but between the read and lock
> acquisition, the load balancer can move the task to a different DSQ.
> 
> Maybe we should change dispatch_dequeue() as well to verify after locking
> that we locked the correct DSQ, and retry if the task was moved.

Right, this is a bug. The LB should hold the source rq lock too. Let me
update the code and add a lockdep annotation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
  2025-11-10 18:44     ` Tejun Heo
@ 2025-11-10 21:06       ` Andrea Righi
  2025-11-10 22:08         ` Tejun Heo
  0 siblings, 1 reply; 45+ messages in thread
From: Andrea Righi @ 2025-11-10 21:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

On Mon, Nov 10, 2025 at 08:44:12AM -1000, Tejun Heo wrote:
> On Mon, Nov 10, 2025 at 09:36:46AM +0100, Andrea Righi wrote:
> > > +void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
> > > +{
> > > +	if (p->nr_cpus_allowed < nr_cpus) {
> > 
> > We could be even more aggressive with DSQ_CPU0 and check
> > bpf_cpumask_test_cpu(0, p->cpus_ptr), but this is fine as well.
> 
> I did the following instead:
> 
>   void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
>   {
>           /*
>            * select_cpu() always picks CPU0. If @p is not on CPU0, it can't run on
>            * CPU 0. Queue on whichever CPU it's currently only.
>            */
>           if (scx_bpf_task_cpu(p) != 0) {
>                   stat_inc(0);	/* count local queueing */
>                   scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
>                   return;
>           }
> 
>           stat_inc(1);	/* count cpu0 queueing */
>           scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags);
>   }
> 
> This should be safe against migration disabled tasks and so on.

Looks good.

> 
> > > +		stat_inc(0);	/* count local queueing */
> > > +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> > 
> > And this is why I was suggesting to automatically fallback to the new
> > global default time slice internally. In this case do we want to preserve
> > the old 20ms default or automatically switch to the new one?
> 
> Maybe SCX_SLICE_DFL can become runtime loaded const volatile but anyone
> who's using it is just saying "I don't care". As long as it's not something
> that breaks the system left and right, does it matter what exact value it
> is?

I agree that if a scheduler uses SCX_SLICE_DFL it shouldn't care too much
about the exact value.

My concern was more about those schedulers that are quite paranoid about
latency and even if something isn't handled properly (directly dispatching
to a wrong CPU, a task being rescheduled internally, etc.), we'd still have
a guarantee that a task's time slice can't exceed a known upper bound. But
this could be managed by being able to set a default time slice (somehow)
and it can be addressed separately.

So yeah, in this case the exact value of SCX_SLICE_DFL doesn't really
matter probably.

-Andrea

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
  2025-11-10 21:06       ` Andrea Righi
@ 2025-11-10 22:08         ` Tejun Heo
  0 siblings, 0 replies; 45+ messages in thread
From: Tejun Heo @ 2025-11-10 22:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Dan Schatzberg, Emil Tsalapatis,
	sched-ext, linux-kernel

Hello,

On Mon, Nov 10, 2025 at 10:06:29PM +0100, Andrea Righi wrote:
> I agree that if a scheduler uses SCX_SLICE_DFL it shouldn't care too much
> about the exact value.
> 
> My concern was more about those schedulers that are quite paranoid about
> latency and even if something isn't handled properly (directly dispatching
> to a wrong CPU, a task being rescheduled internally, etc.), we'd still have
> a guarantee that a task's time slice can't exceed a known upper bound. But
> this could be managed by being able to set a default time slice (somehow)
> and it can be addressed separately.
> 
> So yeah, in this case the exact value of SCX_SLICE_DFL doesn't really
> matter probably.

AFAICS, all cases where we use the default slice can be avoided by setting
the right SCX_OPS_ENQ_ flags and not letting through tasks with zero slice.
ie. If the scheduler needs to control slice distribution closely, it can do
so and, if something leaks, that can be detected through the events although
it may be helpful to add a strict mode where these leaks can be tracked down
more easily.

This is not necessarily an argument against making the default slice
configurable. The fact that we use the default slice for bypassing was a
reason to be more cautious with exposing it (as that can affect system
recoverability) but with bypass slice separated out, that's less of a
concern too. So, yeah, I think making it ops configurable is fine too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice in bypass mode
  2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
  2025-11-10  7:03   ` Andrea Righi
  2025-11-10  8:22   ` Andrea Righi
@ 2025-11-11 14:57   ` Dan Schatzberg
  2 siblings, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 14:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:01AM -1000, Tejun Heo wrote:
> @@ -919,7 +948,7 @@ static void dsq_mod_nr(struct scx_dispatch_q *dsq, s32 delta)
>  
>  static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p)
>  {
> -	p->scx.slice = SCX_SLICE_DFL;
> +	p->scx.slice = scx_slice_dfl;

Do you need to use READ_ONCE here given that this can be modified concurrently?

>  	__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
>  }
>  
> @@ -2892,7 +2921,7 @@ void init_scx_entity(struct sched_ext_entity *scx)
>  	INIT_LIST_HEAD(&scx->runnable_node);
>  	scx->runnable_at = jiffies;
>  	scx->ddsp_dsq_id = SCX_DSQ_INVALID;
> -	scx->slice = SCX_SLICE_DFL;
> +	scx->slice = scx_slice_dfl;
>  }
>  
>  void scx_pre_fork(struct task_struct *p)
> @@ -3770,6 +3799,7 @@ static void scx_bypass(bool bypass)
>  		WARN_ON_ONCE(scx_bypass_depth <= 0);
>  		if (scx_bypass_depth != 1)
>  			goto unlock;
> +		scx_slice_dfl = scx_slice_bypass_us * NSEC_PER_USEC;

Similarly WRITE_ONCE here

>  		bypass_timestamp = ktime_get_ns();
>  		if (sch)
>  			scx_add_event(sch, SCX_EV_BYPASS_ACTIVATE, 1);
> @@ -3778,6 +3808,7 @@ static void scx_bypass(bool bypass)
>  		WARN_ON_ONCE(scx_bypass_depth < 0);
>  		if (scx_bypass_depth != 0)
>  			goto unlock;
> +		scx_slice_dfl = scx_slice_dfl_us * NSEC_PER_USEC;

And here

>  		if (sch)
>  			scx_add_event(sch, SCX_EV_BYPASS_DURATION,
>  				      ktime_get_ns() - bypass_timestamp);
> @@ -4776,7 +4807,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  			queue_flags |= DEQUEUE_CLASS;
>  
>  		scoped_guard (sched_change, p, queue_flags) {
> -			p->scx.slice = SCX_SLICE_DFL;
> +			p->scx.slice = scx_slice_dfl;

And here
>  			p->sched_class = new_class;
>  		}
>  	}
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
  2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
  2025-11-10  7:42   ` Andrea Righi
@ 2025-11-11 15:31   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 15:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:03AM -1000, Tejun Heo wrote:
> When bypass mode is activated, tasks are routed through a fallback dispatch
> queue instead of the BPF scheduler. Originally, bypass mode used a single
> global DSQ, but this didn't scale well on NUMA machines and could lead to
> livelocks. In b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node"),
> this was changed to use per-node global DSQs, which resolved the
> cross-node-related livelocks.
> 
> However, Dan Schatzberg found that per-node global DSQ can also livelock in a
> different scenario: On a NUMA node with many CPUs and many threads pinned to
> different small subsets of CPUs, each CPU often has to scan through many tasks
> it cannot run to find the one task it can run. With a high number of CPUs,
> this scanning overhead can easily cause livelocks.
> 
> Change bypass mode to use dedicated per-CPU bypass DSQs. Each task is queued
> on the CPU that it's currently on. Because the default idle CPU selection
> policy and direct dispatch are both active during bypass, this works well in
> most cases including the above.
> 
> However, this does have a failure mode in highly over-saturated systems where
> tasks are concentrated on a single CPU. If the BPF scheduler places most tasks
> on one CPU and then triggers bypass mode, bypass mode will keep those tasks on
> that one CPU, which can lead to failures such as RCU stalls as the queue may be
> too long for that CPU to drain in a reasonable time. This will be addressed
> with a load balancer in a future patch. The bypass DSQ is kept separate from
> the local DSQ to allow the load balancer to move tasks between bypass DSQs.
> 
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/sched/ext.h |  1 +
>  kernel/sched/ext.c        | 16 +++++++++++++---
>  kernel/sched/sched.h      |  1 +
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index 9f5b0f2be310..e1502faf6241 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -57,6 +57,7 @@ enum scx_dsq_id_flags {
>  	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
>  	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
>  	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
> +	SCX_DSQ_BYPASS		= SCX_DSQ_FLAG_BUILTIN | 3,
>  	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
>  	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
>  };
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a29bfadde89d..4b8b91494947 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1301,7 +1301,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  
>  	if (scx_rq_bypassing(rq)) {
>  		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
> -		goto global;
> +		goto bypass;
>  	}
>  
>  	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> @@ -1359,6 +1359,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  global:
>  	dsq = find_global_dsq(sch, p);
>  	goto enqueue;
> +bypass:
> +	dsq = &task_rq(p)->scx.bypass_dsq;
> +	goto enqueue;
>  
>  enqueue:
>  	/*
> @@ -2157,8 +2160,14 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>  	if (consume_global_dsq(sch, rq))
>  		goto has_tasks;
>  
> -	if (unlikely(!SCX_HAS_OP(sch, dispatch)) ||
> -	    scx_rq_bypassing(rq) || !scx_rq_online(rq))
> +	if (scx_rq_bypassing(rq)) {
> +		if (consume_dispatch_q(sch, rq, &rq->scx.bypass_dsq))
> +			goto has_tasks;
> +		else
> +			goto no_tasks;
> +	}
> +
> +	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
>  		goto no_tasks;
>  
>  	dspc->rq = rq;
> @@ -5370,6 +5379,7 @@ void __init init_sched_ext_class(void)
>  		int  n = cpu_to_node(cpu);
>  
>  		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
> +		init_dsq(&rq->scx.bypass_dsq, SCX_DSQ_BYPASS);
>  		INIT_LIST_HEAD(&rq->scx.runnable_list);
>  		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 27aae2a298f8..5991133a4849 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -808,6 +808,7 @@ struct scx_rq {
>  	struct balance_callback	deferred_bal_cb;
>  	struct irq_work		deferred_irq_work;
>  	struct irq_work		kick_cpus_irq_work;
> +	struct scx_dispatch_q	bypass_dsq;
>  };
>  #endif /* CONFIG_SCHED_CLASS_EXT */
>  
> -- 
> 2.51.1
> 

Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag
  2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
  2025-11-10  7:45   ` Andrea Righi
@ 2025-11-11 15:34   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:04AM -1000, Tejun Heo wrote:
> The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid
> live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the
> ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
> injecting delays when CPUs are trapped in dispatch paths.
> 
> Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
> (unsigned long) with separate increment/decrement and cleanup operations. The
> breather is only activated when aborting, so tie it directly to the exit
> mechanism. Replace both variables with scx_aborting flag set when exit is
> claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
> consolidate exit_kind claiming and breather enablement. This eliminates
> scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().
> 
> The breather mechanism will be replaced by a different abort mechanism in a
> future patch. This simplification prepares for that change.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 54 +++++++++++++++++++++-------------------------
>  1 file changed, 25 insertions(+), 29 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 4b8b91494947..905d01f74687 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -33,9 +33,8 @@ static DEFINE_MUTEX(scx_enable_mutex);
>  DEFINE_STATIC_KEY_FALSE(__scx_enabled);
>  DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
>  static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
> -static unsigned long scx_in_softlockup;
> -static atomic_t scx_breather_depth = ATOMIC_INIT(0);
>  static int scx_bypass_depth;
> +static bool scx_aborting;
>  static bool scx_init_task_enabled;
>  static bool scx_switching_all;
>  DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
> @@ -1834,7 +1833,7 @@ static void scx_breather(struct rq *rq)
>  
>  	lockdep_assert_rq_held(rq);
>  
> -	if (likely(!atomic_read(&scx_breather_depth)))
> +	if (likely(!READ_ONCE(scx_aborting)))
>  		return;
>  
>  	raw_spin_rq_unlock(rq);
> @@ -1843,9 +1842,9 @@ static void scx_breather(struct rq *rq)
>  
>  	do {
>  		int cnt = 1024;
> -		while (atomic_read(&scx_breather_depth) && --cnt)
> +		while (READ_ONCE(scx_aborting) && --cnt)
>  			cpu_relax();
> -	} while (atomic_read(&scx_breather_depth) &&
> +	} while (READ_ONCE(scx_aborting) &&
>  		 time_before64(ktime_get_ns(), until));
>  
>  	raw_spin_rq_lock(rq);
> @@ -3740,30 +3739,14 @@ void scx_softlockup(u32 dur_s)
>  		goto out_unlock;
>  	}
>  
> -	/* allow only one instance, cleared at the end of scx_bypass() */
> -	if (test_and_set_bit(0, &scx_in_softlockup))
> -		goto out_unlock;
> -
>  	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
>  			smp_processor_id(), dur_s, scx_root->ops.name);
>  
> -	/*
> -	 * Some CPUs may be trapped in the dispatch paths. Enable breather
> -	 * immediately; otherwise, we might even be able to get to scx_bypass().
> -	 */
> -	atomic_inc(&scx_breather_depth);
> -
>  	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
>  out_unlock:
>  	rcu_read_unlock();
>  }
>  
> -static void scx_clear_softlockup(void)
> -{
> -	if (test_and_clear_bit(0, &scx_in_softlockup))
> -		atomic_dec(&scx_breather_depth);
> -}
> -
>  /**
>   * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
>   * @bypass: true for bypass, false for unbypass
> @@ -3826,8 +3809,6 @@ static void scx_bypass(bool bypass)
>  				      ktime_get_ns() - bypass_timestamp);
>  	}
>  
> -	atomic_inc(&scx_breather_depth);
> -
>  	/*
>  	 * No task property is changing. We just need to make sure all currently
>  	 * queued tasks are re-queued according to the new scx_rq_bypassing()
> @@ -3883,10 +3864,8 @@ static void scx_bypass(bool bypass)
>  		raw_spin_rq_unlock(rq);
>  	}
>  
> -	atomic_dec(&scx_breather_depth);
>  unlock:
>  	raw_spin_unlock_irqrestore(&bypass_lock, flags);
> -	scx_clear_softlockup();
>  }
>  
>  static void free_exit_info(struct scx_exit_info *ei)
> @@ -3981,6 +3960,7 @@ static void scx_disable_workfn(struct kthread_work *work)
>  
>  	/* guarantee forward progress by bypassing scx_ops */
>  	scx_bypass(true);
> +	WRITE_ONCE(scx_aborting, false);
>  
>  	switch (scx_set_enable_state(SCX_DISABLING)) {
>  	case SCX_DISABLING:
> @@ -4103,9 +4083,24 @@ static void scx_disable_workfn(struct kthread_work *work)
>  	scx_bypass(false);
>  }
>  
> -static void scx_disable(enum scx_exit_kind kind)
> +static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind)
>  {
>  	int none = SCX_EXIT_NONE;
> +
> +	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
> +		return false;
> +
> +	/*
> +	 * Some CPUs may be trapped in the dispatch paths. Enable breather
> +	 * immediately; otherwise, we might not even be able to get to
> +	 * scx_bypass().
> +	 */
> +	WRITE_ONCE(scx_aborting, true);
> +	return true;
> +}
> +
> +static void scx_disable(enum scx_exit_kind kind)
> +{
>  	struct scx_sched *sch;
>  
>  	if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE))
> @@ -4114,7 +4109,7 @@ static void scx_disable(enum scx_exit_kind kind)
>  	rcu_read_lock();
>  	sch = rcu_dereference(scx_root);
>  	if (sch) {
> -		atomic_try_cmpxchg(&sch->exit_kind, &none, kind);
> +		scx_claim_exit(sch, kind);
>  		kthread_queue_work(sch->helper, &sch->disable_work);
>  	}
>  	rcu_read_unlock();
> @@ -4435,9 +4430,8 @@ static void scx_vexit(struct scx_sched *sch,
>  		      const char *fmt, va_list args)
>  {
>  	struct scx_exit_info *ei = sch->exit_info;
> -	int none = SCX_EXIT_NONE;
>  
> -	if (!atomic_try_cmpxchg(&sch->exit_kind, &none, kind))
> +	if (!scx_claim_exit(sch, kind))
>  		return;
>  
>  	ei->exit_code = exit_code;
> @@ -4653,6 +4647,8 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	 */
>  	WARN_ON_ONCE(scx_set_enable_state(SCX_ENABLING) != SCX_DISABLED);
>  	WARN_ON_ONCE(scx_root);
> +	if (WARN_ON_ONCE(READ_ONCE(scx_aborting)))
> +		WRITE_ONCE(scx_aborting, false);
>  
>  	atomic_long_set(&scx_nr_rejected, 0);
>  
> -- 
> 2.51.1
> 

Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting
  2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
  2025-11-10  8:20   ` Andrea Righi
@ 2025-11-11 15:46   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 15:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:05AM -1000, Tejun Heo wrote:
> 62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
> the breather mechanism to inject delays during bypass mode switching. It
> maintains operation semantics unchanged while reducing lock contention to avoid
> live-locks on large NUMA systems.
> 
> However, the breather only activates when exiting the scheduler, so there's no
> need to maintain operation semantics. Simplify by exiting dispatch and move
> operations immediately when scx_aborting is set. In consume_dispatch_q(), break
> out of the task iteration loop. In scx_dsq_move(), return early before
> acquiring locks.
> 
> This also fixes cases the breather mechanism cannot handle. When a large system
> has many runnable threads affinitized to different CPU subsets and the BPF
> scheduler places them all into a single DSQ, many CPUs can scan the DSQ
> concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
> for extended periods, leading to various failure modes. The breather cannot
> solve this because once in the consume loop, there's no exit. The new mechanism
> fixes this by exiting the loop immediately.
> 
> The bypass DSQ is exempted to ensure the bypass mechanism itself can make
> progress.
> 
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 62 ++++++++++++++--------------------------------
>  1 file changed, 18 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 905d01f74687..afa89ca3659e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1821,48 +1821,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  	return dst_rq;
>  }
>  
> -/*
> - * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
> - * banging on the same DSQ on a large NUMA system to the point where switching
> - * to the bypass mode can take a long time. Inject artificial delays while the
> - * bypass mode is switching to guarantee timely completion.
> - */
> -static void scx_breather(struct rq *rq)
> -{
> -	u64 until;
> -
> -	lockdep_assert_rq_held(rq);
> -
> -	if (likely(!READ_ONCE(scx_aborting)))
> -		return;
> -
> -	raw_spin_rq_unlock(rq);
> -
> -	until = ktime_get_ns() + NSEC_PER_MSEC;
> -
> -	do {
> -		int cnt = 1024;
> -		while (READ_ONCE(scx_aborting) && --cnt)
> -			cpu_relax();
> -	} while (READ_ONCE(scx_aborting) &&
> -		 time_before64(ktime_get_ns(), until));
> -
> -	raw_spin_rq_lock(rq);
> -}
> -
>  static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
>  			       struct scx_dispatch_q *dsq)
>  {
>  	struct task_struct *p;
>  retry:
> -	/*
> -	 * This retry loop can repeatedly race against scx_bypass() dequeueing
> -	 * tasks from @dsq trying to put the system into the bypass mode. On
> -	 * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock
> -	 * the machine into soft lockups. Give a breather.
> -	 */
> -	scx_breather(rq);
> -
>  	/*
>  	 * The caller can't expect to successfully consume a task if the task's
>  	 * addition to @dsq isn't guaranteed to be visible somehow. Test
> @@ -1876,6 +1839,17 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
>  	nldsq_for_each_task(p, dsq) {
>  		struct rq *task_rq = task_rq(p);
>  
> +		/*
> +		 * This loop can lead to multiple lockup scenarios, e.g. the BPF
> +		 * scheduler can put an enormous number of affinitized tasks into
> +		 * a contended DSQ, or the outer retry loop can repeatedly race
> +		 * against scx_bypass() dequeueing tasks from @dsq trying to put
> +		 * the system into the bypass mode. This can easily live-lock the
> +		 * machine. If aborting, exit from all non-bypass DSQs.
> +		 */
> +		if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS)
> +			break;
> +
>  		if (rq == task_rq) {
>  			task_unlink_from_dsq(p, dsq);
>  			move_local_task_to_local_dsq(p, 0, dsq, rq);
> @@ -5635,6 +5609,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  	    !scx_kf_allowed(sch, SCX_KF_DISPATCH))
>  		return false;
>  
> +	/*
> +	 * If the BPF scheduler keeps calling this function repeatedly, it can
> +	 * cause similar live-lock conditions as consume_dispatch_q().
> +	 */
> +	if (unlikely(scx_aborting))
> +		return false;
> +
>  	/*
>  	 * Can be called from either ops.dispatch() locking this_rq() or any
>  	 * context where no rq lock is held. If latter, lock @p's task_rq which
> @@ -5655,13 +5636,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  		raw_spin_rq_lock(src_rq);
>  	}
>  
> -	/*
> -	 * If the BPF scheduler keeps calling this function repeatedly, it can
> -	 * cause similar live-lock conditions as consume_dispatch_q(). Insert a
> -	 * breather if necessary.
> -	 */
> -	scx_breather(src_rq);
> -
>  	locked_rq = src_rq;
>  	raw_spin_lock(&src_dsq->lock);
>  
> -- 
> 2.51.1
> 

Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool
  2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
  2025-11-10  8:28   ` Andrea Righi
@ 2025-11-11 15:48   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 15:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:06AM -1000, Tejun Heo wrote:
> Make scx_exit() and scx_vexit() return bool indicating whether the calling
> thread successfully claimed the exit. This will be used by the abort mechanism
> added in a later patch.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afa89ca3659e..033c8b8e88e8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -177,18 +177,21 @@ MODULE_PARM_DESC(slice_bypass_us, "bypass slice in microseconds, applied on [un]
>  static void process_ddsp_deferred_locals(struct rq *rq);
>  static u32 reenq_local(struct rq *rq);
>  static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
> -static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
> +static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
>  		      s64 exit_code, const char *fmt, va_list args);
>  
> -static __printf(4, 5) void scx_exit(struct scx_sched *sch,
> +static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
>  				    enum scx_exit_kind kind, s64 exit_code,
>  				    const char *fmt, ...)
>  {
>  	va_list args;
> +	bool ret;
>  
>  	va_start(args, fmt);
> -	scx_vexit(sch, kind, exit_code, fmt, args);
> +	ret = scx_vexit(sch, kind, exit_code, fmt, args);
>  	va_end(args);
> +
> +	return ret;
>  }
>  
>  #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
> @@ -4399,14 +4402,14 @@ static void scx_error_irq_workfn(struct irq_work *irq_work)
>  	kthread_queue_work(sch->helper, &sch->disable_work);
>  }
>  
> -static void scx_vexit(struct scx_sched *sch,
> +static bool scx_vexit(struct scx_sched *sch,
>  		      enum scx_exit_kind kind, s64 exit_code,
>  		      const char *fmt, va_list args)
>  {
>  	struct scx_exit_info *ei = sch->exit_info;
>  
>  	if (!scx_claim_exit(sch, kind))
> -		return;
> +		return false;
>  
>  	ei->exit_code = exit_code;
>  #ifdef CONFIG_STACKTRACE
> @@ -4423,6 +4426,7 @@ static void scx_vexit(struct scx_sched *sch,
>  	ei->reason = scx_exit_reason(ei->kind);
>  
>  	irq_work_queue(&sch->error_irq_work);
> +	return true;
>  }
>  
>  static int alloc_kick_syncs(void)
> -- 
> 2.51.1
>

Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup()
  2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
  2025-11-10  8:29   ` Andrea Righi
@ 2025-11-11 15:49   ` Dan Schatzberg
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Schatzberg @ 2025-11-11 15:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

On Sun, Nov 09, 2025 at 08:31:07AM -1000, Tejun Heo wrote:
> scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
> scheduler is enabled under RCU read lock and trigger an error if so. Extract
> the common pattern into handle_lockup() helper. Add scx_verror() macro and use
> guard(rcu)().
> 
> This simplifies both handlers, reduces code duplication, and prepares for
> hardlockup handling.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 65 ++++++++++++++++++----------------------------
>  1 file changed, 25 insertions(+), 40 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 033c8b8e88e8..5c75b0125dfe 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -195,6 +195,7 @@ static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
>  }
>  
>  #define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
> +#define scx_verror(sch, fmt, args)	scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
>  
>  #define SCX_HAS_OP(sch, op)	test_bit(SCX_OP_IDX(op), (sch)->has_op)
>  
> @@ -3653,39 +3654,40 @@ bool scx_allow_ttwu_queue(const struct task_struct *p)
>  	return false;
>  }
>  
> -/**
> - * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
> - *
> - * While there are various reasons why RCU CPU stalls can occur on a system
> - * that may not be caused by the current BPF scheduler, try kicking out the
> - * current scheduler in an attempt to recover the system to a good state before
> - * issuing panics.
> - */
> -bool scx_rcu_cpu_stall(void)
> +static __printf(1, 2) bool handle_lockup(const char *fmt, ...)
>  {
>  	struct scx_sched *sch;
> +	va_list args;
>  
> -	rcu_read_lock();
> +	guard(rcu)();
>  
>  	sch = rcu_dereference(scx_root);
> -	if (unlikely(!sch)) {
> -		rcu_read_unlock();
> +	if (unlikely(!sch))
>  		return false;
> -	}
>  
>  	switch (scx_enable_state()) {
>  	case SCX_ENABLING:
>  	case SCX_ENABLED:
> -		break;
> +		va_start(args, fmt);
> +		scx_verror(sch, fmt, args);
> +		va_end(args);
> +		return true;
>  	default:
> -		rcu_read_unlock();
>  		return false;
>  	}
> +}
>  
> -	scx_error(sch, "RCU CPU stall detected!");
> -	rcu_read_unlock();
> -
> -	return true;
> +/**
> + * scx_rcu_cpu_stall - sched_ext RCU CPU stall handler
> + *
> + * While there are various reasons why RCU CPU stalls can occur on a system
> + * that may not be caused by the current BPF scheduler, try kicking out the
> + * current scheduler in an attempt to recover the system to a good state before
> + * issuing panics.
> + */
> +bool scx_rcu_cpu_stall(void)
> +{
> +	return handle_lockup("RCU CPU stall detected!");
>  }
>  
>  /**
> @@ -3700,28 +3702,11 @@ bool scx_rcu_cpu_stall(void)
>   */
>  void scx_softlockup(u32 dur_s)
>  {
> -	struct scx_sched *sch;
> -
> -	rcu_read_lock();
> -
> -	sch = rcu_dereference(scx_root);
> -	if (unlikely(!sch))
> -		goto out_unlock;
> -
> -	switch (scx_enable_state()) {
> -	case SCX_ENABLING:
> -	case SCX_ENABLED:
> -		break;
> -	default:
> -		goto out_unlock;
> -	}
> -
> -	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n",
> -			smp_processor_id(), dur_s, scx_root->ops.name);
> +	if (!handle_lockup("soft lockup - CPU %d stuck for %us", smp_processor_id(), dur_s))
> +		return;
>  
> -	scx_error(sch, "soft lockup - CPU#%d stuck for %us", smp_processor_id(), dur_s);
> -out_unlock:
> -	rcu_read_unlock();
> +	printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU %d stuck for %us, disabling BPF scheduler\n",
> +			smp_processor_id(), dur_s);
>  }
>  
>  /**
> -- 
> 2.51.1
> 

Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-11-11 15:50 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
2025-11-10  6:57   ` Andrea Righi
2025-11-10 16:08     ` Tejun Heo
2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
2025-11-10  7:03   ` Andrea Righi
2025-11-10  7:59     ` Andrea Righi
2025-11-10 16:21     ` Tejun Heo
2025-11-10 16:22       ` Tejun Heo
2025-11-10  8:22   ` Andrea Righi
2025-11-11 14:57   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
2025-11-10  7:21   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
2025-11-10  7:42   ` Andrea Righi
2025-11-10 16:42     ` Tejun Heo
2025-11-10 17:30       ` Andrea Righi
2025-11-11 15:31   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
2025-11-10  7:45   ` Andrea Righi
2025-11-11 15:34   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
2025-11-10  8:20   ` Andrea Righi
2025-11-10 18:51     ` Tejun Heo
2025-11-11 15:46   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
2025-11-10  8:28   ` Andrea Righi
2025-11-11 15:48   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
2025-11-10  8:29   ` Andrea Righi
2025-11-11 15:49   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
2025-11-10  8:29   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
2025-11-10  8:31   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
2025-11-10  8:36   ` Andrea Righi
2025-11-10 18:44     ` Tejun Heo
2025-11-10 21:06       ` Andrea Righi
2025-11-10 22:08         ` Tejun Heo
2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
2025-11-10  8:37   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
2025-11-10  9:38   ` Andrea Righi
2025-11-10 19:21     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox