[PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx

public inbox for sched-ext@lists.linux.dev
 help / color / mirror / Atom feed

* [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap
@ 2026-04-11 11:33 Tejun Heo
  2026-04-11 12:57 ` Cheng-Yang Chou
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-11 11:33 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Emil Tsalapatis, sched-ext, linux-kernel

scx_qmap uses global BPF queue maps for task dispatch. A task pinned to a
single CPU can only be dispatched by its home CPU's ops.dispatch(), but an
idle CPU won't call ops.dispatch() on its own. This leaves per-CPU kthreads
like ksoftirqd stranded, causing NOHZ tick-stop warnings from pending
softirqs.

Kick the target CPU with SCX_KICK_IDLE when enqueueing a pinned task.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index f3587fb709c9..09d1624fb869 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -314,6 +314,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		__sync_fetch_and_add(&nr_highpri_queued, 1);
 	}
 	__sync_fetch_and_add(&nr_enqueued, 1);
+
+	/*
+	 * Kick idle target CPU for pinned tasks. Without this, the CPU can
+	 * idle while ksoftirqd is pending in the BPF queue, triggering NOHZ
+	 * tick-stop warnings.
+	 */
+	if (p->nr_cpus_allowed == 1)
+		scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
 }

 /*
--
2.53.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap
  2026-04-11 11:33 [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap Tejun Heo
@ 2026-04-11 12:57 ` Cheng-Yang Chou
  2026-04-11 14:27   ` Cheng-Yang Chou
  2026-04-11 15:03 ` Andrea Righi
  2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
  2 siblings, 1 reply; 8+ messages in thread
From: Cheng-Yang Chou @ 2026-04-11 12:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel

Hi Tejun,

On Sat, Apr 11, 2026 at 01:33:56AM -1000, Tejun Heo wrote:
> scx_qmap uses global BPF queue maps for task dispatch. A task pinned to a
> single CPU can only be dispatched by its home CPU's ops.dispatch(), but an
> idle CPU won't call ops.dispatch() on its own. This leaves per-CPU kthreads
> like ksoftirqd stranded, causing NOHZ tick-stop warnings from pending
> softirqs.
> 
> Kick the target CPU with SCX_KICK_IDLE when enqueueing a pinned task.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  tools/sched_ext/scx_qmap.bpf.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f3587fb709c9..09d1624fb869 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -314,6 +314,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>  		__sync_fetch_and_add(&nr_highpri_queued, 1);
>  	}
>  	__sync_fetch_and_add(&nr_enqueued, 1);
> +
> +	/*
> +	 * Kick idle target CPU for pinned tasks. Without this, the CPU can
> +	 * idle while ksoftirqd is pending in the BPF queue, triggering NOHZ
> +	 * tick-stop warnings.
> +	 */
> +	if (p->nr_cpus_allowed == 1)
> +		scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
>  }
> 
>  /*
> --
> 2.53.0

Looks good to me! The same issue exists in scx_userland where pinned
tasks can be dispatched to SCX_DSQ_GLOBAL without kicking the idle
target CPU. I'll follow a patch to add the same fix there!

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap
  2026-04-11 12:57 ` Cheng-Yang Chou
@ 2026-04-11 14:27   ` Cheng-Yang Chou
  0 siblings, 0 replies; 8+ messages in thread
From: Cheng-Yang Chou @ 2026-04-11 14:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	sched-ext, linux-kernel, Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Sat, Apr 11, 2026 at 08:57:42PM +0800, Cheng-Yang Chou wrote:
> On Sat, Apr 11, 2026 at 01:33:56AM -1000, Tejun Heo wrote:
> > scx_qmap uses global BPF queue maps for task dispatch. A task pinned to a
> > single CPU can only be dispatched by its home CPU's ops.dispatch(), but an
> > idle CPU won't call ops.dispatch() on its own. This leaves per-CPU kthreads
> > like ksoftirqd stranded, causing NOHZ tick-stop warnings from pending
> > softirqs.
> > 
> > Kick the target CPU with SCX_KICK_IDLE when enqueueing a pinned task.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> >  tools/sched_ext/scx_qmap.bpf.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> > index f3587fb709c9..09d1624fb869 100644
> > --- a/tools/sched_ext/scx_qmap.bpf.c
> > +++ b/tools/sched_ext/scx_qmap.bpf.c
> > @@ -314,6 +314,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
> >  		__sync_fetch_and_add(&nr_highpri_queued, 1);
> >  	}
> >  	__sync_fetch_and_add(&nr_enqueued, 1);
> > +
> > +	/*
> > +	 * Kick idle target CPU for pinned tasks. Without this, the CPU can
> > +	 * idle while ksoftirqd is pending in the BPF queue, triggering NOHZ
> > +	 * tick-stop warnings.
> > +	 */
> > +	if (p->nr_cpus_allowed == 1)
> > +		scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
> >  }
> > 
> >  /*
> > --
> > 2.53.0
> 
> Looks good to me! The same issue exists in scx_userland where pinned
> tasks can be dispatched to SCX_DSQ_GLOBAL without kicking the idle
> target CPU. I'll follow a patch to add the same fix there!
> 
> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

Actually, would it make more sense to fold this directly into the same
patch?

diff --git a/tools/sched_ext/scx_userland.bpf.c b/tools/sched_ext/scx_userland.bpf.c
index f29862b89386..56c53d457f45 100644
--- a/tools/sched_ext/scx_userland.bpf.c
+++ b/tools/sched_ext/scx_userland.bpf.c
@@ -195,6 +195,14 @@ static void enqueue_task_in_user_space(struct task_struct *p, u64 enq_flags)
                 */
                __sync_fetch_and_add(&nr_failed_enqueues, 1);
                scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+
+               /*
+               * Kick idle target CPU for pinned tasks. Without this, the CPU can
+               * idle while ksoftirqd is pending in the BPF queue, triggering NOHZ
+               * tick-stop warnings.
+               */
+               if (p->nr_cpus_allowed == 1)
+                       scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
        } else {
                __sync_fetch_and_add(&nr_user_enqueues, 1);
                set_usersched_needed();


-- 
Thanks,
Cheng-Yang

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap
  2026-04-11 11:33 [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap Tejun Heo
  2026-04-11 12:57 ` Cheng-Yang Chou
@ 2026-04-11 15:03 ` Andrea Righi
  2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
  2 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2026-04-11 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, sched-ext,
	linux-kernel

Hi Tejun,

On Sat, Apr 11, 2026 at 01:33:56AM -1000, Tejun Heo wrote:
> scx_qmap uses global BPF queue maps for task dispatch. A task pinned to a
> single CPU can only be dispatched by its home CPU's ops.dispatch(), but an
> idle CPU won't call ops.dispatch() on its own. This leaves per-CPU kthreads
> like ksoftirqd stranded, causing NOHZ tick-stop warnings from pending
> softirqs.
> 
> Kick the target CPU with SCX_KICK_IDLE when enqueueing a pinned task.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  tools/sched_ext/scx_qmap.bpf.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f3587fb709c9..09d1624fb869 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -314,6 +314,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>  		__sync_fetch_and_add(&nr_highpri_queued, 1);
>  	}
>  	__sync_fetch_and_add(&nr_enqueued, 1);
> +
> +	/*
> +	 * Kick idle target CPU for pinned tasks. Without this, the CPU can
> +	 * idle while ksoftirqd is pending in the BPF queue, triggering NOHZ
> +	 * tick-stop warnings.
> +	 */
> +	if (p->nr_cpus_allowed == 1)
> +		scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);

I think we should kick the task's CPU in general, also when
p->nr_cpus_allowed == N, with N < nr_cpus_ids, otherwise we can have the same
problem if one of the N allowed CPUs is never awakened. Moreover, tasks will
have a better chance to keep running on the same CPU, which is nice, unless we
want to limit the amount of CPU wakeups.

If we want to be fancy we could even do something like this:

if (!__COMPAT_is_enq_cpu_selected(enq_flags) && !scx_bpf_task_running(p))
    scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);

In practice: if ops.select_cpu() was called there's no reason to do the kick,
because it's supposed to be done already in ops.select_cpu(). Simiarly, if we've
a queued wakeup event, ops.select_cpu() was skipped, so we should explicitly
kick the CPU if the task was enqueued and it wasn't running already.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  2026-04-11 11:33 [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap Tejun Heo
  2026-04-11 12:57 ` Cheng-Yang Chou
  2026-04-11 15:03 ` Andrea Righi
@ 2026-04-13  3:30 ` Tejun Heo
  2026-04-13  5:32   ` Andrea Righi
                     ` (2 more replies)
  2 siblings, 3 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-13  3:30 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, Ching-Chun Huang,
	Chia-Ping Tsai, sched-ext, linux-kernel

scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
ops.dispatch() can pop from. When a CPU pops a task that can't run on it
(e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
returns false when softirq is pending) -- but can cause noticeable scheduling
delays.

After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
it. There's a small race window where the home CPU can enter idle before the
kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
and the home CPU drains the task.

Rather than fully eliminating the warning by routing pinned tasks to local or
global DSQs, the current code keeps them going through the normal BPF queue
path and documents the race and the resulting warning in detail. scx_qmap is an
example scheduler and having tasks go through the usual dispatch path is useful
for testing. The detailed comment also serves as a reference for other
schedulers that may encounter similar warnings.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: Replaced the previous enqueue-side fix which kicked when a pinned task was
    enqueued. That was based on the theory that ops.select_cpu() being skipped
    meant the home CPU wouldn't be woken, which wasn't quite right --
    wakeup_preempt() kicks the target CPU regardless. Moved the fix to
    ops.dispatch() where the stranding is actually observable.

 tools/sched_ext/scx_qmap.bpf.c | 40 ++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index f3587fb709c9..a4543c7ab25d 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -471,6 +471,46 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 			__sync_fetch_and_add(&nr_dispatched, 1);

 			scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0);
+
+			/*
+			 * scx_qmap uses a global BPF queue that any CPU's
+			 * dispatch can pop from. If this CPU popped a task that
+			 * can't run here, it gets stranded on SHARED_DSQ after
+			 * consume_dispatch_q() skips it. Kick the task's home
+			 * CPU so it drains SHARED_DSQ.
+			 *
+			 * There's a race between the pop and the flush of the
+			 * buffered dsq_insert:
+			 *
+			 *  CPU 0 (dispatching)      CPU 1 (home, idle)
+			 *  ~~~~~~~~~~~~~~~~~~~      ~~~~~~~~~~~~~~~~~~~
+			 *  pop from BPF queue
+			 *  dsq_insert(buffered)
+			 *                           balance:
+			 *                             SHARED_DSQ empty
+			 *                             BPF queue empty
+			 *                             -> goes idle
+			 *  flush -> on SHARED
+			 *  kick CPU 1
+			 *                           wakes, drains task
+			 *
+			 * The kick prevents indefinite stalls but a per-CPU
+			 * kthread like ksoftirqd can be briefly stranded when
+			 * its home CPU enters idle with softirq pending,
+			 * triggering:
+			 *
+			 *  "NOHZ tick-stop error: local softirq work is pending, handler #N!!!"
+			 *
+			 * from report_idle_softirq(). The kick lands shortly
+			 * after and the home CPU drains the task. This could be
+			 * avoided by e.g. dispatching pinned tasks to local or
+			 * global DSQs, but the current code is left as-is to
+			 * document this class of issue -- other schedulers
+			 * seeing similar warnings can use this as a reference.
+			 */
+			if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
+				scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
+
 			bpf_task_release(p);

 			batch--;
--
2.53.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
@ 2026-04-13  5:32   ` Andrea Righi
  2026-04-13  5:38   ` Cheng-Yang Chou
  2026-04-13 16:21   ` Tejun Heo
  2 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2026-04-13  5:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Cheng-Yang Chou, Emil Tsalapatis,
	Ching-Chun Huang, Chia-Ping Tsai, sched-ext, linux-kernel

Hi Tejun,

On Sun, Apr 12, 2026 at 05:30:52PM -1000, Tejun Heo wrote:
> scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
> ops.dispatch() can pop from. When a CPU pops a task that can't run on it
> (e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
> consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
> stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
> cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
> returns false when softirq is pending) -- but can cause noticeable scheduling
> delays.
> 
> After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
> it. There's a small race window where the home CPU can enter idle before the
> kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
> can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
> and the home CPU drains the task.
> 
> Rather than fully eliminating the warning by routing pinned tasks to local or
> global DSQs, the current code keeps them going through the normal BPF queue
> path and documents the race and the resulting warning in detail. scx_qmap is an
> example scheduler and having tasks go through the usual dispatch path is useful
> for testing. The detailed comment also serves as a reference for other
> schedulers that may encounter similar warnings.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: Replaced the previous enqueue-side fix which kicked when a pinned task was
>     enqueued. That was based on the theory that ops.select_cpu() being skipped
>     meant the home CPU wouldn't be woken, which wasn't quite right --
>     wakeup_preempt() kicks the target CPU regardless. Moved the fix to
>     ops.dispatch() where the stranding is actually observable.

Looks good now!

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> 
>  tools/sched_ext/scx_qmap.bpf.c | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f3587fb709c9..a4543c7ab25d 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -471,6 +471,46 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
>  			__sync_fetch_and_add(&nr_dispatched, 1);
> 
>  			scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0);
> +
> +			/*
> +			 * scx_qmap uses a global BPF queue that any CPU's
> +			 * dispatch can pop from. If this CPU popped a task that
> +			 * can't run here, it gets stranded on SHARED_DSQ after
> +			 * consume_dispatch_q() skips it. Kick the task's home
> +			 * CPU so it drains SHARED_DSQ.
> +			 *
> +			 * There's a race between the pop and the flush of the
> +			 * buffered dsq_insert:
> +			 *
> +			 *  CPU 0 (dispatching)      CPU 1 (home, idle)
> +			 *  ~~~~~~~~~~~~~~~~~~~      ~~~~~~~~~~~~~~~~~~~
> +			 *  pop from BPF queue
> +			 *  dsq_insert(buffered)
> +			 *                           balance:
> +			 *                             SHARED_DSQ empty
> +			 *                             BPF queue empty
> +			 *                             -> goes idle
> +			 *  flush -> on SHARED
> +			 *  kick CPU 1
> +			 *                           wakes, drains task
> +			 *
> +			 * The kick prevents indefinite stalls but a per-CPU
> +			 * kthread like ksoftirqd can be briefly stranded when
> +			 * its home CPU enters idle with softirq pending,
> +			 * triggering:
> +			 *
> +			 *  "NOHZ tick-stop error: local softirq work is pending, handler #N!!!"
> +			 *
> +			 * from report_idle_softirq(). The kick lands shortly
> +			 * after and the home CPU drains the task. This could be
> +			 * avoided by e.g. dispatching pinned tasks to local or
> +			 * global DSQs, but the current code is left as-is to
> +			 * document this class of issue -- other schedulers
> +			 * seeing similar warnings can use this as a reference.
> +			 */
> +			if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
> +				scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
> +
>  			bpf_task_release(p);
> 
>  			batch--;
> --
> 2.53.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
  2026-04-13  5:32   ` Andrea Righi
@ 2026-04-13  5:38   ` Cheng-Yang Chou
  2026-04-13 16:21   ` Tejun Heo
  2 siblings, 0 replies; 8+ messages in thread
From: Cheng-Yang Chou @ 2026-04-13  5:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	Ching-Chun Huang, Chia-Ping Tsai, sched-ext, linux-kernel

Hi Tejun,

On Sun, Apr 12, 2026 at 05:30:52PM -1000, Tejun Heo wrote:
> scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's
> ops.dispatch() can pop from. When a CPU pops a task that can't run on it
> (e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ.
> consume_dispatch_q() then skips the task due to affinity mismatch, leaving it
> stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't
> cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick()
> returns false when softirq is pending) -- but can cause noticeable scheduling
> delays.
> 
> After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run
> it. There's a small race window where the home CPU can enter idle before the
> kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this
> can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after
> and the home CPU drains the task.
> 
> Rather than fully eliminating the warning by routing pinned tasks to local or
> global DSQs, the current code keeps them going through the normal BPF queue
> path and documents the race and the resulting warning in detail. scx_qmap is an
> example scheduler and having tasks go through the usual dispatch path is useful
> for testing. The detailed comment also serves as a reference for other
> schedulers that may encounter similar warnings.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: Replaced the previous enqueue-side fix which kicked when a pinned task was
>     enqueued. That was based on the theory that ops.select_cpu() being skipped
>     meant the home CPU wouldn't be woken, which wasn't quite right --
>     wakeup_preempt() kicks the target CPU regardless. Moved the fix to
>     ops.dispatch() where the stranding is actually observable.
> 
>  tools/sched_ext/scx_qmap.bpf.c | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f3587fb709c9..a4543c7ab25d 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -471,6 +471,46 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
>  			__sync_fetch_and_add(&nr_dispatched, 1);
> 
>  			scx_bpf_dsq_insert(p, SHARED_DSQ, slice_ns, 0);
> +
> +			/*
> +			 * scx_qmap uses a global BPF queue that any CPU's
> +			 * dispatch can pop from. If this CPU popped a task that
> +			 * can't run here, it gets stranded on SHARED_DSQ after
> +			 * consume_dispatch_q() skips it. Kick the task's home
> +			 * CPU so it drains SHARED_DSQ.
> +			 *
> +			 * There's a race between the pop and the flush of the
> +			 * buffered dsq_insert:
> +			 *
> +			 *  CPU 0 (dispatching)      CPU 1 (home, idle)
> +			 *  ~~~~~~~~~~~~~~~~~~~      ~~~~~~~~~~~~~~~~~~~
> +			 *  pop from BPF queue
> +			 *  dsq_insert(buffered)
> +			 *                           balance:
> +			 *                             SHARED_DSQ empty
> +			 *                             BPF queue empty
> +			 *                             -> goes idle
> +			 *  flush -> on SHARED
> +			 *  kick CPU 1
> +			 *                           wakes, drains task
> +			 *
> +			 * The kick prevents indefinite stalls but a per-CPU
> +			 * kthread like ksoftirqd can be briefly stranded when
> +			 * its home CPU enters idle with softirq pending,
> +			 * triggering:
> +			 *
> +			 *  "NOHZ tick-stop error: local softirq work is pending, handler #N!!!"
> +			 *
> +			 * from report_idle_softirq(). The kick lands shortly
> +			 * after and the home CPU drains the task. This could be
> +			 * avoided by e.g. dispatching pinned tasks to local or
> +			 * global DSQs, but the current code is left as-is to
> +			 * document this class of issue -- other schedulers
> +			 * seeing similar warnings can use this as a reference.
> +			 */
> +			if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
> +				scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
> +
>  			bpf_task_release(p);
> 
>  			batch--;
> --
> 2.53.0

This makes sense.

I also realized my previous patch for scx_userland was unnecessary, as
the global DSQ logic handles this automatically. Sorry for the nose on
that one.

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

-- 
Thanks,
Cheng-Yang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
  2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
  2026-04-13  5:32   ` Andrea Righi
  2026-04-13  5:38   ` Cheng-Yang Chou
@ 2026-04-13 16:21   ` Tejun Heo
  2 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-13 16:21 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Cheng-Yang Chou, Emil Tsalapatis, Ching-Chun Huang,
	Chia-Ping Tsai, sched-ext, linux-kernel

Hello,

> Tejun Heo (1): tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap

Applied to sched_ext/for-7.1.

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-13 16:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-11 11:33 [PATCH sched_ext/for-7.1] tools/sched_ext: Kick idle CPU for pinned tasks in scx_qmap Tejun Heo
2026-04-11 12:57 ` Cheng-Yang Chou
2026-04-11 14:27   ` Cheng-Yang Chou
2026-04-11 15:03 ` Andrea Righi
2026-04-13  3:30 ` [PATCH sched_ext/for-7.1] tools/sched_ext: Kick home CPU for stranded " Tejun Heo
2026-04-13  5:32   ` Andrea Righi
2026-04-13  5:38   ` Cheng-Yang Chou
2026-04-13 16:21   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox