From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Cheng-Yang Chou <yphbchou0911@gmail.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Ching-Chun Huang <jserv@ccns.ncku.edu.tw>,
Chia-Ping Tsai <chia7712@gmail.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
Date: Mon, 13 Apr 2026 07:32:56 +0200 [thread overview]
Message-ID: <adyACFXObn1rZNPh@gpd4> (raw)
In-Reply-To: <9e172bda49dade833db7118929332693@kernel.org>
Hi Tejun,
On Sun, Apr 12, 2026 at 05:30:52PM -1000, Tejun Heo wrote:
> scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
> ops.dispatch() can pop from. When a CPU pops a task that can't run on it
> (e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
> consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
> stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
> cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
> returns false when softirq is pending) -- but can cause noticeable scheduling
> delays.
>
> After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
> it. There's a small race window where the home CPU can enter idle before the
> kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
> can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
> and the home CPU drains the task.
>
> Rather than fully eliminating the warning by routing pinned tasks to local or
> global DSQs, the current code keeps them going through the normal BPF queue
> path and documents the race and the resulting warning in detail. scx_qmap is an
> example scheduler and having tasks go through the usual dispatch path is useful
> for testing. The detailed comment also serves as a reference for other
> schedulers that may encounter similar warnings.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: Replaced the previous enqueue-side fix which kicked when a pinned task was
> enqueued. That was based on the theory that ops.select_cpu() being skipped
> meant the home CPU wouldn't be woken, which wasn't quite right --
> wakeup_preempt() kicks the target CPU regardless. Moved the fix to
> ops.dispatch() where the stranding is actually observable.
Looks good now!
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
>
> tools/sched_ext/scx_qmap.bpf.c | 40 ++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f3587fb709c9..a4543c7ab25d 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -471,6 +471,46 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
> __sync_fetch_and_add(&nr_dispatched, 1);
>
> scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0);
> +
> + /*
> + * scx_qmap uses a global BPF queue that any CPU's
> + * dispatch can pop from. If this CPU popped a task that
> + * can't run here, it gets stranded on SHARED_DSQ after
> + * consume_dispatch_q() skips it. Kick the task's home
> + * CPU so it drains SHARED_DSQ.
> + *
> + * There's a race between the pop and the flush of the
> + * buffered dsq_insert:
> + *
> + * CPU 0 (dispatching) CPU 1 (home, idle)
> + * ~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
> + * pop from BPF queue
> + * dsq_insert(buffered)
> + * balance:
> + * SHARED_DSQ empty
> + * BPF queue empty
> + * -> goes idle
> + * flush -> on SHARED
> + * kick CPU 1
> + * wakes, drains task
> + *
> + * The kick prevents indefinite stalls but a per-CPU
> + * kthread like ksoftirqd can be briefly stranded when
> + * its home CPU enters idle with softirq pending,
> + * triggering:
> + *
> + * "NOHZ tick-stop error: local softirq work is pending, handler #N!!!"
> + *
> + * from report_idle_softirq(). The kick lands shortly
> + * after and the home CPU drains the task. This could be
> + * avoided by e.g. dispatching pinned tasks to local or
> + * global DSQs, but the current code is left as-is to
> + * document this class of issue -- other schedulers
> + * seeing similar warnings can use this as a reference.
> + */
> + if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
> + scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
> +
> bpf_task_release(p);
>
> batch--;
> --
> 2.53.0
next prev parent reply other threads:[~2026-04-13 5:33 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-11 11:33 [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap Tejun Heo
2026-04-11 12:57 ` Cheng-Yang Chou
2026-04-11 14:27 ` Cheng-Yang Chou
2026-04-11 15:03 ` Andrea Righi
2026-04-13 3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
2026-04-13 5:32 ` Andrea Righi [this message]
2026-04-13 5:38 ` Cheng-Yang Chou
2026-04-13 16:21 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adyACFXObn1rZNPh@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=chia7712@gmail.com \
--cc=emil@etsalapatis.com \
--cc=jserv@ccns.ncku.edu.tw \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=yphbchou0911@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox