public inbox for sched-ext@lists.linux.dev
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Andrea Righi <arighi@nvidia.com>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
	newton@meta.com
Subject: Re: [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q()
Date: Sun, 15 Mar 2026 10:10:49 -1000	[thread overview]
Message-ID: <abcSSTZglXeHmgT4@slm.duckdns.org> (raw)
In-Reply-To: <abZ-li3Dq9_PXu58@gpd4>

(cc'ing Ryan Newton and quoting whole body)

On Sun, Mar 15, 2026 at 10:40:38AM +0100, Andrea Righi wrote:
> On Sat, Mar 14, 2026 at 10:58:05PM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> > 
> > On Sun, Mar 15, 2026 at 12:52:31AM +0100, Andrea Righi wrote:
> > ...
> > > Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
> > > show around 2-3x higher throughput with most of the scx schedulers with
> > > this change applied.
> > 
> > Can you share more details about the benchmark setup and results?
> 
> Just running schbench and perf bench for now, it definitely needs more
> testing, but I wanted to send a patch to start a discussion about this (I
> should have added the RFC in the subject, sorry).
> 
> > 
> > > +	/*
> > > +	 * Use trylock to avoid spinning on a contended DSQ, if we fail to
> > > +	 * acquire the lock kick the CPU to retry on the next balance.
> > > +	 *
> > > +	 * In bypass mode simply spin to acquire the lock, since
> > > +	 * scx_kick_cpu() is suppressed.
> > > +	 */
> > > +	if (scx_bypassing(sch, cpu)) {
> > > +		raw_spin_lock(&dsq->lock);
> > > +	} else if (!raw_spin_trylock(&dsq->lock)) {
> > > +		scx_kick_cpu(sch, cpu, 0);
> > > +		return false;
> > > +	}
> > 
> > But I'm not sure this is what we wanna do. If we *really* want to do this,
> > maybe we can add a try_move variant; however, I'm pretty deeply skeptical
> > about the approach for a few reasons.
> > 
> > - If a shared DSQ becomes a bottleneck, the right thing to do would be
> >   introducing multiple DSQs and shard them.
> 
> True, but then we also need a load balancer with multiple DSQs and moving
> tasks across DSQs is also not very efficient. With a shared DSQ we do
> really well with latency, but under intense scheduling activity (e.g.,
> schbench) we get poor performance, so all those scheduling-related
> benchmarks get a bad score with most of the scx schedulers.
> 
> With this applied pretty much all the scx schedulers (scx_cosmos,
> scx_bpfland, scx_p2dq, scx_lavd) get pretty much the same score (or
> even slightly better) as EEVDF with schbench, without any noticeable impact
> on latency (tested avg fps and tail latency with a few games).
> 
> > 
> > - This likely is trading off fairness to gain bandwidth and this approach
> >   depending on machine / workload may lead to severe starvation. One can
> >   argue that controlled trade off between fairness and bandwidth is useful
> >   for some use cases. However, even if that is the case, I don't think
> >   trylock is the way to get there. If we think that low overhead high
> >   fan-out shared queue is desirable, it'd be better to introduce dedicated
> >   data structure which can do so in a controlled manner.
> 
> True, and I think with moderate CPU activity this may increase latency due
> to the additional kick/balance step when trylock fails (maybe control this
> behavior with a flag?).
> 
> That said, the throughput benefits seem significant. While schbench is
> probably an extreme case, the improvement there is substantial (2-3x),
> which suggests this approach might also benefit some more realistic
> workloads. I'm planning to run additional tests over the next few days to
> better understand this.
> 
> Based on the schbench results, it seems like a missed opportunity to drop
> this entirely. Can you elaborate more on the dedicated data structure you
> mentioned? Do you have something specific in mind?

IIRC, Ryan was experimenting with some data structures which trade off
fairness for scalability in BPF and saw promising results. Ryan, can you
please share what you did?

Thanks.

-- 
tejun

      reply	other threads:[~2026-03-15 20:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-14 23:52 [PATCH sched_ext/for-7.1] sched_ext: Reduce DSQ lock contention in consume_dispatch_q() Andrea Righi
2026-03-15  8:58 ` Tejun Heo
2026-03-15  9:40   ` Andrea Righi
2026-03-15 20:10     ` Tejun Heo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abcSSTZglXeHmgT4@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=newton@meta.com \
    --cc=sched-ext@lists.linux.dev \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox