From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
longman@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com,
void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
cgroups@vger.kernel.org, sched-ext@lists.linux.dev,
liuwenfang@honor.com, tglx@linutronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
Date: Tue, 16 Sep 2025 12:29:57 -1000 [thread overview]
Message-ID: <aMnk5Wcdr2q6BWqR@slm.duckdns.org> (raw)
In-Reply-To: <20250915083815.GB3289052@noisy.programming.kicks-ass.net>
Hello,
On Mon, Sep 15, 2025 at 10:38:15AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 12, 2025 at 07:56:21AM -1000, Tejun Heo wrote:
> > It *seems* that way to me. There are two other scenarios tho.
> >
> > - A task can move from a non-local DSQ to another non-local DSQ at any time
> > while queued. As this doesn't cause rq migration, we can probably just
> > overwrite p->srq_lock to the new one. Need to think about it a bit more.
>
> It can use task_on_rq_migrating(), exactly like 'normal' rq-to-rq
> migration:
>
> LOCK src_dsq->lock
> p->on_rq = TASK_ON_RQ_MIGRATING;
> task_unlink_from_dsq();
> UNLOCK src_dsq->lock
>
> LOCK dst_dsq->lock
> dispatch_enqueue()
> p->on_rq = TASK_ON_RQ_QUEUED;
> UNLOCK dst_dsq->lock
>
> Same reasoning as for the pick_task_scx() migration, if it observes
> !p->srq_lock, then it must observe MIGRATING and we'll spin-wait until
> QUEUED. At which point we'll see the new srq_lock.
I see.
> > - A task can be queued on a BPF data structure and thus may not be on any
> > DSQ. I think this can be handled by adding a raw_spinlock to task_struct
> > and treating the task as if it's on its own DSQ by pointing to that one,
> > and grabbing that lock when transferring that task from BPF side.
>
> Hmm, and BPF data structures cannot have a lock associated with them?
> I'm thinking they must, something is serializing all that.
>
> > So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
> > doing something simpler and just add a per-task lock which nests inside rq
> > lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
> > sched classes that want to migrate tasks without grabbing the source rq
> > lock? That way, we don't need to the lock pointer dancing while achieving
> > about the same result. From sched_ext's POV, grabbing that per-task lock is
> > likely going to be cheaper than doing the rq lock switching, so it's way
> > simpler and nothing gets worse.
>
> I *really* don't like that. Fundamentally a runqueue is 'rich' data
> structure. It has a container (list, tree, whatever) but also a pile of
> statistics (time, vtime, counts, load-sums, averages). Adding/removing a
> task from a runqueue needs all that serialized. A per-task lock simply
> cannot do this.
>
> If you've hidden this lock inside BPF such that C cannot access it, then
> your abstraction needs fixing. Surely it is possible to have a C DSQ to
> mirror whatever the BPF thing does. Add a few helpers for BPF to
> create/destroy DSQs (IDs) and a callback to map a task to a DSQ. Then
> the C part can use the DSQ lock, and hold it while calling into whatever
> BPF.
Most current schedulers (except for scx_qmap which is there just to demo how
to use BPF side queueing) use use DSQs to handle tasks the way you're
describing. However, BPF arena is becoming more accessible and gaining wider
usage, paired with purely BPF side synchronization constructs (spinlock or
some lockless data structure).
Long term, I think maintaining flexibility is of higher importance for
sched_ext than e.g. small performance improvements or even design or
implementation aesthetics. The primary purpose is enabling trying out new,
sometimes wild, things after all. As such, I don't think it'd be a good idea
to put strict restrictions on how the BPF side operates unless it affects
the ability to recover the system from a malfunctioning BPF scheduler, of
course.
> Additionally, it can sanity check the BPF thing, tasks cannot go
> 'missing' without C knowing wtf they went -- which is that bypass
> problem, no?
They are orthogonal. Even if all tasks are on DSQs, the scheduler may still
fail to dispatch some DSQs for too long, mess up the ordering inside, cause
excessive bouncing across them, and what not. So, the kernel side still
needs to be able to detect and contain failures. The only difference between
a task being on a DSQ or BPF side is that there needs to be extra per-task
state tracking to ensure that tasks only transit states in orderly fashion
(ie. don't dispatch the same task twice).
Thanks.
--
tejun
next prev parent reply other threads:[~2025-09-16 22:29 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
2025-09-11 9:06 ` K Prateek Nayak
2025-09-11 9:55 ` Peter Zijlstra
2025-09-11 10:10 ` Peter Zijlstra
2025-09-11 10:37 ` K Prateek Nayak
2025-10-06 15:21 ` Shrikanth Hegde
2025-10-06 18:14 ` Peter Zijlstra
2025-10-07 5:12 ` Shrikanth Hegde
2025-10-07 9:34 ` Peter Zijlstra
2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
2025-09-11 1:44 ` Tejun Heo
2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
2025-10-30 0:12 ` Mark Brown
2025-10-30 9:07 ` Peter Zijlstra
2025-10-30 12:47 ` Mark Brown
2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
2025-09-12 0:19 ` Tejun Heo
2025-09-12 11:54 ` Peter Zijlstra
2025-09-12 14:11 ` Peter Zijlstra
2025-09-12 17:56 ` Tejun Heo
2025-09-15 8:38 ` Peter Zijlstra
2025-09-16 22:29 ` Tejun Heo [this message]
2025-09-16 22:41 ` Tejun Heo
2025-09-25 8:35 ` Peter Zijlstra
2025-09-25 21:43 ` Tejun Heo
2025-09-26 9:59 ` Peter Zijlstra
2025-09-26 16:48 ` Tejun Heo
2025-09-26 10:36 ` Peter Zijlstra
2025-09-26 21:39 ` Tejun Heo
2025-09-29 10:06 ` Peter Zijlstra
2025-09-30 23:49 ` Tejun Heo
2025-10-01 11:54 ` Peter Zijlstra
2025-10-02 23:32 ` Tejun Heo
2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
2025-09-11 2:01 ` Tejun Heo
2025-09-11 9:42 ` Peter Zijlstra
2025-09-11 20:40 ` Tejun Heo
2025-09-12 14:19 ` Peter Zijlstra
2025-09-12 16:32 ` Tejun Heo
2025-09-13 22:32 ` Tejun Heo
2025-09-15 8:48 ` Peter Zijlstra
2025-09-25 13:10 ` Peter Zijlstra
2025-09-25 15:40 ` Tejun Heo
2025-09-25 15:53 ` Peter Zijlstra
2025-09-25 18:44 ` Tejun Heo
2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
2025-09-10 16:07 ` Peter Zijlstra
2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
2025-09-10 18:19 ` Peter Zijlstra
2025-09-10 18:35 ` Peter Zijlstra
2025-09-10 19:00 ` Andrea Righi
2025-09-11 9:58 ` Peter Zijlstra
2025-09-11 14:51 ` Andrea Righi
2025-09-11 14:00 ` Peter Zijlstra
2025-09-11 14:30 ` Peter Zijlstra
2025-09-11 14:48 ` Andrea Righi
2025-09-18 15:15 ` Christian Loehle
2025-09-25 9:00 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aMnk5Wcdr2q6BWqR@slm.duckdns.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=cgroups@vger.kernel.org \
--cc=changwoo@igalia.com \
--cc=dietmar.eggemann@arm.com \
--cc=hannes@cmpxchg.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liuwenfang@honor.com \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sched-ext@lists.linux.dev \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox