From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
Andrea Righi <andrea.righi@linux.dev>,
Changwoo Min <changwoo@igalia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>,
Emil Tsalapatis <etsal@meta.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
Tejun Heo <tj@kernel.org>, Andrea Righi <arighi@nvidia.com>
Subject: [PATCH 05/13] sched_ext: Exit dispatch and move operations immediately when aborting
Date: Tue, 11 Nov 2025 09:18:08 -1000 [thread overview]
Message-ID: <20251111191816.862797-6-tj@kernel.org> (raw)
In-Reply-To: <20251111191816.862797-1-tj@kernel.org>
62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.
However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.
This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.
The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.
v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 62 ++++++++++++++--------------------------------
1 file changed, 18 insertions(+), 44 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5da699cacde1..56a6d453543a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1818,48 +1818,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
return dst_rq;
}
-/*
- * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
- * banging on the same DSQ on a large NUMA system to the point where switching
- * to the bypass mode can take a long time. Inject artificial delays while the
- * bypass mode is switching to guarantee timely completion.
- */
-static void scx_breather(struct rq *rq)
-{
- u64 until;
-
- lockdep_assert_rq_held(rq);
-
- if (likely(!READ_ONCE(scx_aborting)))
- return;
-
- raw_spin_rq_unlock(rq);
-
- until = ktime_get_ns() + NSEC_PER_MSEC;
-
- do {
- int cnt = 1024;
- while (READ_ONCE(scx_aborting) && --cnt)
- cpu_relax();
- } while (READ_ONCE(scx_aborting) &&
- time_before64(ktime_get_ns(), until));
-
- raw_spin_rq_lock(rq);
-}
-
static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
struct scx_dispatch_q *dsq)
{
struct task_struct *p;
retry:
- /*
- * This retry loop can repeatedly race against scx_bypass() dequeueing
- * tasks from @dsq trying to put the system into the bypass mode. On
- * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock
- * the machine into soft lockups. Give a breather.
- */
- scx_breather(rq);
-
/*
* The caller can't expect to successfully consume a task if the task's
* addition to @dsq isn't guaranteed to be visible somehow. Test
@@ -1873,6 +1836,17 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
nldsq_for_each_task(p, dsq) {
struct rq *task_rq = task_rq(p);
+ /*
+ * This loop can lead to multiple lockup scenarios, e.g. the BPF
+ * scheduler can put an enormous number of affinitized tasks into
+ * a contended DSQ, or the outer retry loop can repeatedly race
+ * against scx_bypass() dequeueing tasks from @dsq trying to put
+ * the system into the bypass mode. This can easily live-lock the
+ * machine. If aborting, exit from all non-bypass DSQs.
+ */
+ if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS)
+ break;
+
if (rq == task_rq) {
task_unlink_from_dsq(p, dsq);
move_local_task_to_local_dsq(p, 0, dsq, rq);
@@ -5636,6 +5610,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
!scx_kf_allowed(sch, SCX_KF_DISPATCH))
return false;
+ /*
+ * If the BPF scheduler keeps calling this function repeatedly, it can
+ * cause similar live-lock conditions as consume_dispatch_q().
+ */
+ if (unlikely(READ_ONCE(scx_aborting)))
+ return false;
+
/*
* Can be called from either ops.dispatch() locking this_rq() or any
* context where no rq lock is held. If latter, lock @p's task_rq which
@@ -5656,13 +5637,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
raw_spin_rq_lock(src_rq);
}
- /*
- * If the BPF scheduler keeps calling this function repeatedly, it can
- * cause similar live-lock conditions as consume_dispatch_q(). Insert a
- * breather if necessary.
- */
- scx_breather(src_rq);
-
locked_rq = src_rq;
raw_spin_lock(&src_dsq->lock);
--
2.51.2
next prev parent reply other threads:[~2025-11-11 19:18 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-11 19:18 [PATCHSET v3 sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
2025-11-11 19:18 ` [PATCH 01/13] sched_ext: Use shorter slice in bypass mode Tejun Heo
2025-11-11 19:18 ` [PATCH 02/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
2025-11-11 19:18 ` [PATCH 03/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
2025-11-11 19:18 ` [PATCH 04/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
2025-11-11 19:18 ` Tejun Heo [this message]
2025-11-11 19:18 ` [PATCH 06/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
2025-11-11 19:18 ` [PATCH 07/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
2025-11-11 19:18 ` [PATCH 08/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
2025-11-11 19:18 ` [PATCH 09/13] sched_ext: Hook up hardlockup detector Tejun Heo
2025-11-11 19:19 ` Tejun Heo
2025-11-13 22:33 ` Doug Anderson
2025-11-14 1:25 ` Tejun Heo
2025-11-14 1:33 ` [PATCH sched_ext/for-6.19] sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs Tejun Heo
2025-11-14 2:00 ` Emil Tsalapatis
2025-11-14 7:32 ` Andrea Righi
2025-11-14 19:24 ` Doug Anderson
2025-11-14 21:15 ` Tejun Heo
2025-11-14 21:19 ` Tejun Heo
2025-11-11 19:18 ` [PATCH 10/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
2025-11-11 19:18 ` [PATCH 11/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
2025-11-11 19:18 ` [PATCH 12/13] sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() Tejun Heo
2025-11-11 19:18 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
2025-11-11 19:30 ` Emil Tsalapatis
2025-11-12 16:49 ` [PATCHSET v3 sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251111191816.862797-6-tj@kernel.org \
--to=tj@kernel.org \
--cc=andrea.righi@linux.dev \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=etsal@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=schatzberg.dan@gmail.com \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox