From: Andrea Righi <andrea.righi@linux.dev>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Dan Schatzberg <schatzberg.dan@gmail.com>,
Emil Tsalapatis <etsal@meta.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting
Date: Mon, 10 Nov 2025 09:20:15 +0100 [thread overview]
Message-ID: <aRGgP4UQwgOdhgwj@gpd4> (raw)
In-Reply-To: <20251109183112.2412147-7-tj@kernel.org>
Hi Tejun,
On Sun, Nov 09, 2025 at 08:31:05AM -1000, Tejun Heo wrote:
> 62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
> the breather mechanism to inject delays during bypass mode switching. It
> maintains operation semantics unchanged while reducing lock contention to avoid
> live-locks on large NUMA systems.
>
> However, the breather only activates when exiting the scheduler, so there's no
> need to maintain operation semantics. Simplify by exiting dispatch and move
> operations immediately when scx_aborting is set. In consume_dispatch_q(), break
> out of the task iteration loop. In scx_dsq_move(), return early before
> acquiring locks.
>
> This also fixes cases the breather mechanism cannot handle. When a large system
> has many runnable threads affinitized to different CPU subsets and the BPF
> scheduler places them all into a single DSQ, many CPUs can scan the DSQ
> concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
> for extended periods, leading to various failure modes. The breather cannot
> solve this because once in the consume loop, there's no exit. The new mechanism
> fixes this by exiting the loop immediately.
>
> The bypass DSQ is exempted to ensure the bypass mechanism itself can make
> progress.
>
> Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 62 ++++++++++++++--------------------------------
> 1 file changed, 18 insertions(+), 44 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 905d01f74687..afa89ca3659e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1821,48 +1821,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> return dst_rq;
> }
>
> -/*
> - * A poorly behaving BPF scheduler can live-lock the system by e.g. incessantly
> - * banging on the same DSQ on a large NUMA system to the point where switching
> - * to the bypass mode can take a long time. Inject artificial delays while the
> - * bypass mode is switching to guarantee timely completion.
> - */
> -static void scx_breather(struct rq *rq)
> -{
> - u64 until;
> -
> - lockdep_assert_rq_held(rq);
> -
> - if (likely(!READ_ONCE(scx_aborting)))
> - return;
> -
> - raw_spin_rq_unlock(rq);
> -
> - until = ktime_get_ns() + NSEC_PER_MSEC;
> -
> - do {
> - int cnt = 1024;
> - while (READ_ONCE(scx_aborting) && --cnt)
> - cpu_relax();
> - } while (READ_ONCE(scx_aborting) &&
> - time_before64(ktime_get_ns(), until));
> -
> - raw_spin_rq_lock(rq);
> -}
> -
> static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
> struct scx_dispatch_q *dsq)
> {
> struct task_struct *p;
> retry:
> - /*
> - * This retry loop can repeatedly race against scx_bypass() dequeueing
> - * tasks from @dsq trying to put the system into the bypass mode. On
> - * some multi-socket machines (e.g. 2x Intel 8480c), this can live-lock
> - * the machine into soft lockups. Give a breather.
> - */
> - scx_breather(rq);
> -
> /*
> * The caller can't expect to successfully consume a task if the task's
> * addition to @dsq isn't guaranteed to be visible somehow. Test
> @@ -1876,6 +1839,17 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
> nldsq_for_each_task(p, dsq) {
> struct rq *task_rq = task_rq(p);
>
> + /*
> + * This loop can lead to multiple lockup scenarios, e.g. the BPF
> + * scheduler can put an enormous number of affinitized tasks into
> + * a contended DSQ, or the outer retry loop can repeatedly race
> + * against scx_bypass() dequeueing tasks from @dsq trying to put
> + * the system into the bypass mode. This can easily live-lock the
> + * machine. If aborting, exit from all non-bypass DSQs.
> + */
> + if (unlikely(READ_ONCE(scx_aborting)) && dsq->id != SCX_DSQ_BYPASS)
> + break;
> +
> if (rq == task_rq) {
> task_unlink_from_dsq(p, dsq);
> move_local_task_to_local_dsq(p, 0, dsq, rq);
> @@ -5635,6 +5609,13 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> !scx_kf_allowed(sch, SCX_KF_DISPATCH))
> return false;
>
> + /*
> + * If the BPF scheduler keeps calling this function repeatedly, it can
> + * cause similar live-lock conditions as consume_dispatch_q().
> + */
> + if (unlikely(scx_aborting))
READ_ONCE(scx_aborting)?
Thanks,
-Andrea
> + return false;
> +
> /*
> * Can be called from either ops.dispatch() locking this_rq() or any
> * context where no rq lock is held. If latter, lock @p's task_rq which
> @@ -5655,13 +5636,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> raw_spin_rq_lock(src_rq);
> }
>
> - /*
> - * If the BPF scheduler keeps calling this function repeatedly, it can
> - * cause similar live-lock conditions as consume_dispatch_q(). Insert a
> - * breather if necessary.
> - */
> - scx_breather(src_rq);
> -
> locked_rq = src_rq;
> raw_spin_lock(&src_dsq->lock);
>
> --
> 2.51.1
>
next prev parent reply other threads:[~2025-11-10 8:20 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
2025-11-10 6:57 ` Andrea Righi
2025-11-10 16:08 ` Tejun Heo
2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
2025-11-10 7:03 ` Andrea Righi
2025-11-10 7:59 ` Andrea Righi
2025-11-10 16:21 ` Tejun Heo
2025-11-10 16:22 ` Tejun Heo
2025-11-10 8:22 ` Andrea Righi
2025-11-11 14:57 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
2025-11-10 7:21 ` Andrea Righi
2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
2025-11-10 7:42 ` Andrea Righi
2025-11-10 16:42 ` Tejun Heo
2025-11-10 17:30 ` Andrea Righi
2025-11-11 15:31 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
2025-11-10 7:45 ` Andrea Righi
2025-11-11 15:34 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
2025-11-10 8:20 ` Andrea Righi [this message]
2025-11-10 18:51 ` Tejun Heo
2025-11-11 15:46 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
2025-11-10 8:28 ` Andrea Righi
2025-11-11 15:48 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
2025-11-10 8:29 ` Andrea Righi
2025-11-11 15:49 ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
2025-11-10 8:29 ` Andrea Righi
2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
2025-11-10 8:31 ` Andrea Righi
2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
2025-11-10 8:36 ` Andrea Righi
2025-11-10 18:44 ` Tejun Heo
2025-11-10 21:06 ` Andrea Righi
2025-11-10 22:08 ` Tejun Heo
2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
2025-11-10 8:37 ` Andrea Righi
2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
2025-11-10 9:38 ` Andrea Righi
2025-11-10 19:21 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRGgP4UQwgOdhgwj@gpd4 \
--to=andrea.righi@linux.dev \
--cc=changwoo@igalia.com \
--cc=etsal@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=schatzberg.dan@gmail.com \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox