public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q
@ 2026-02-28 13:06 David Carlier
  2026-02-28 17:28 ` Tejun Heo
  0 siblings, 1 reply; 3+ messages in thread
From: David Carlier @ 2026-02-28 13:06 UTC (permalink / raw)
  To: Tejun Heo, David Vernet; +Cc: linux-kernel, David Carlier

lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
the same cache line in struct scx_dispatch_q. Every lock acquire/release
by a dispatching CPU invalidates the line for all CPUs performing
lockless first_task peeks, causing unnecessary cache coherence traffic,
especially across NUMA nodes.

Add ____cacheline_aligned_in_smp to first_task to place it on its own
cache line, eliminating this false sharing on SMP systems. On
uniprocessor builds the annotation is a no-op, so no space is wasted.

On SMP, the trade-off is increased struct size: each scx_dispatch_q
grows by up to ~56 bytes of padding. There are two instances embedded
per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
allocated custom DSQs, so the total overhead scales with the number of
CPUs and active DSQs.

Signed-off-by: David Carlier <devnexen@gmail.com>
---
 include/linux/sched/ext.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d..2988df68a97a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -70,7 +70,7 @@ enum scx_dsq_id_flags {
  */
 struct scx_dispatch_q {
 	raw_spinlock_t		lock;
-	struct task_struct __rcu *first_task; /* lockless peek at head */
+	struct task_struct __rcu *first_task ____cacheline_aligned_in_smp; /* lockless peek at head */
 	struct list_head	list;	/* tasks in dispatch order */
 	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
 	u32			nr;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q
  2026-02-28 13:06 [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q David Carlier
@ 2026-02-28 17:28 ` Tejun Heo
  2026-02-28 18:26   ` David CARLIER
  0 siblings, 1 reply; 3+ messages in thread
From: Tejun Heo @ 2026-02-28 17:28 UTC (permalink / raw)
  To: David Carlier; +Cc: David Vernet, linux-kernel

On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote:
> lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
> the same cache line in struct scx_dispatch_q. Every lock acquire/release
> by a dispatching CPU invalidates the line for all CPUs performing
> lockless first_task peeks, causing unnecessary cache coherence traffic,
> especially across NUMA nodes.
> 
> Add ____cacheline_aligned_in_smp to first_task to place it on its own
> cache line, eliminating this false sharing on SMP systems. On
> uniprocessor builds the annotation is a no-op, so no space is wasted.
> 
> On SMP, the trade-off is increased struct size: each scx_dispatch_q
> grows by up to ~56 bytes of padding. There are two instances embedded
> per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
> allocated custom DSQs, so the total overhead scales with the number of
> CPUs and active DSQs.

But first_task is read-mostly. How could it be? David, from now on, I'm not
going to apply these patches unless you provide backing experimental data.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q
  2026-02-28 17:28 ` Tejun Heo
@ 2026-02-28 18:26   ` David CARLIER
  0 siblings, 0 replies; 3+ messages in thread
From: David CARLIER @ 2026-02-28 18:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, linux-kernel

Hi Tejun,

  You're right, I got the access pattern wrong. Looking at it more
carefully, first_task is written via rcu_assign_pointer() on every
enqueue and on dequeues when the removed task is the head — all under
  dsq->lock. Since the lock acquisition already brings the cache line
into exclusive state, writing first_task on the same line is
essentially free. The only lockless reader is scx_bpf_dsq_nr_queued(),
which
  isn't a hot path ...

  Understood on requiring experimental data going forward. I'll make
sure to back any performance-related patches with benchmark numbers
and profiling output (perf c2c / perf stat).

  Sorry for the noise (again..).

On Sat, 28 Feb 2026 at 17:28, Tejun Heo <tj@kernel.org> wrote:
>
> On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote:
> > lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
> > the same cache line in struct scx_dispatch_q. Every lock acquire/release
> > by a dispatching CPU invalidates the line for all CPUs performing
> > lockless first_task peeks, causing unnecessary cache coherence traffic,
> > especially across NUMA nodes.
> >
> > Add ____cacheline_aligned_in_smp to first_task to place it on its own
> > cache line, eliminating this false sharing on SMP systems. On
> > uniprocessor builds the annotation is a no-op, so no space is wasted.
> >
> > On SMP, the trade-off is increased struct size: each scx_dispatch_q
> > grows by up to ~56 bytes of padding. There are two instances embedded
> > per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
> > allocated custom DSQs, so the total overhead scales with the number of
> > CPUs and active DSQs.
>
> But first_task is read-mostly. How could it be? David, from now on, I'm not
> going to apply these patches unless you provide backing experimental data.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-28 18:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-28 13:06 [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q David Carlier
2026-02-28 17:28 ` Tejun Heo
2026-02-28 18:26   ` David CARLIER

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox