Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()

Linux cgroups development
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	longman@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com,
	void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
	cgroups@vger.kernel.org, sched-ext@lists.linux.dev,
	liuwenfang@honor.com, tglx@linutronix.de
Subject: Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
Date: Fri, 12 Sep 2025 07:56:21 -1000	[thread overview]
Message-ID: <aMRexZ_SIUVgkIpZ@slm.duckdns.org> (raw)
In-Reply-To: <20250912115459.GZ3289052@noisy.programming.kicks-ass.net>

Hello,

On Fri, Sep 12, 2025 at 01:54:59PM +0200, Peter Zijlstra wrote:
> > With the !slock condition, the following scenario is possible:
> > 
> >   __task_rq_lock()
> >      slock = p->srq_lock; /* NULL */
> >                                                 dispatch_enqueue()
> >                                                   p->srq_lock = &dsq->lock;
> >                                                 enqueue finishes
> >      raw_spin_rq_lock(rq);
> >      rq is the same, $slock is NULL, return
> >   do something assuming p is locked down        p gets dispatched to another rq
> > 
> > I'm unclear on when p->srq_lock would be safe to set and clear, so the goal
> > is that whoever does [__]task_rq_lock() ends up waiting on the dsq lock that
> > the task is queued on, and if we can exclude other sched operations that
> > way, we don't have to hold source rq lock when moving the task to another rq
> > for execution, right?
> 
> Indeed. If !p->srq_lock then task_rq(p)->lock must be sufficient.
> 
> So for enqueue, which sets p->srq_lock, this must be done while holding
> task_rq(p)->lock.
> 
> So the above example should be serialized on task_rq(p)->lock, since
> __task_rq_lock() holds it, enqueue cannot happen. Conversely, if enqueue
> holds task_rq(p)->lock, then __task_rq_lock() will have to wait for
> that, and then observe the newly set p->srq_lock and cycle to take that.

For that to work, [__]task_rq_lock() would have to read p->srq_lock while
holdling rq_lock. Simple reordering, but yeah it'd help to have
setter/getter for explicit locking rules.

...
> We must do set_task_cpu(0) before task_unlink_from_dsq() (and I got this
> order wrong in yesterday's email).
> 
>   pick_task_ext() on CPU0		
>     lock DSQ
>     pick p
>     set_task_cpu(0)			task_rq_lock()
>     task_unlink_from_dsq()		  if !p->srq_lock, then task_rq(p) == 0
>       p->srq_lock = NULL;
>     p is moved to local DSQ
> 
> Perhaps the p->srq_lock store should be store-release, so that the cpu
> store is before.
> 
> Then if we observe p->srq_lock, we'll serialize against DSQ and all is
> well, if we observe !p->srq_lock then we must also observe task_rq(p) ==
> 0 and then we'll serialize on rq->lock.

I see, so the interlocking is between task_rq(p) and p->srq_lock - either
one sees the updated CPU or non-NULL srq_lock. As long as the one that
clears ->srq_lock has both the destination rq and DSQ locked, task_rq_lock()
either ends up waiting on ->srq_lock or sees updated CPU and has to loop
over and wait on the destination rq.

> Now let me see if there isn't an ABA issue here, consider:
> 
> pre: task_cpu(p) != 2, p->srq_lock = NULL
> 
>   CPU0				CPU1				CPU2
> 
>   __task_rq_lock()		enqueue_task_scx() 		pick_task_scx()
> 
> 				rq = task_rq(p);
> 				LOCK rq->lock
>   rq = task_rq(p)
>   LOCK rq->lock
>     .. waits
> 				LOCK dsq->lock
> 				enqueue on dsq
> 				p->srq_lock = &dsq->lock	
> 				UNLOCK dsq->lock
> 								LOCK dsq->lock
> 								pick p
> 				UNLOCK rq->lock
> 								set_task_cpu(2)
> 								task_unlink_from_dsq()
> 								  p->srq_lock = NULL;
> 								UNLOCK dsq->lock
>     .. resumes
>  
> At this point our CPU0's __task_rq_lock():
> 
>   - if it observes p->srq_lock, it will cycle taking that, only to then
>     find out p->srq_lock is no longer set, but then it must also see
>     task_rq() has changed, so the next cycle will block on CPU2's
>     rq->lock.
> 
>   - if it observes !p->srq_lock, then it cannot be the initial NULL,
>     since the initial task_rq(p)->lock ordering prohibits this. So it
>     must be the second NULL, which then also mandates we see the CPU
>     change and we'll cycle to take CPU2's rq->lock.
> 
> That is, I _think_ we're okay :-)

It *seems* that way to me. There are two other scenarios tho.

- A task can move from a non-local DSQ to another non-local DSQ at any time
  while queued. As this doesn't cause rq migration, we can probably just
  overwrite p->srq_lock to the new one. Need to think about it a bit more.

- A task can be queued on a BPF data structure and thus may not be on any
  DSQ. I think this can be handled by adding a raw_spinlock to task_struct
  and treating the task as if it's on its own DSQ by pointing to that one,
  and grabbing that lock when transferring that task from BPF side.

So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
doing something simpler and just add a per-task lock which nests inside rq
lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
sched classes that want to migrate tasks without grabbing the source rq
lock? That way, we don't need to the lock pointer dancing while achieving
about the same result. From sched_ext's POV, grabbing that per-task lock is
likely going to be cheaper than doing the rq lock switching, so it's way
simpler and nothing gets worse.

Thanks.

-- 
tejun

next prev parent reply	other threads:[~2025-09-12 17:56 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
2025-09-11  9:06   ` K Prateek Nayak
2025-09-11  9:55     ` Peter Zijlstra
2025-09-11 10:10       ` Peter Zijlstra
2025-09-11 10:37         ` K Prateek Nayak
2025-10-06 15:21   ` Shrikanth Hegde
2025-10-06 18:14     ` Peter Zijlstra
2025-10-07  5:12       ` Shrikanth Hegde
2025-10-07  9:34         ` Peter Zijlstra
2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
2025-09-11  1:44   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
2025-10-30  0:12   ` Mark Brown
2025-10-30  9:07     ` Peter Zijlstra
2025-10-30 12:47       ` Mark Brown
2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
2025-09-12  0:19   ` Tejun Heo
2025-09-12 11:54     ` Peter Zijlstra
2025-09-12 14:11       ` Peter Zijlstra
2025-09-12 17:56       ` Tejun Heo [this message]
2025-09-15  8:38         ` Peter Zijlstra
2025-09-16 22:29           ` Tejun Heo
2025-09-16 22:41             ` Tejun Heo
2025-09-25  8:35               ` Peter Zijlstra
2025-09-25 21:43                 ` Tejun Heo
2025-09-26  9:59                   ` Peter Zijlstra
2025-09-26 16:48                     ` Tejun Heo
2025-09-26 10:36                   ` Peter Zijlstra
2025-09-26 21:39                     ` Tejun Heo
2025-09-29 10:06                       ` Peter Zijlstra
2025-09-30 23:49                         ` Tejun Heo
2025-10-01 11:54                           ` Peter Zijlstra
2025-10-02 23:32                             ` Tejun Heo
2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
2025-09-11  2:01   ` Tejun Heo
2025-09-11  9:42     ` Peter Zijlstra
2025-09-11 20:40       ` Tejun Heo
2025-09-12 14:19         ` Peter Zijlstra
2025-09-12 16:32           ` Tejun Heo
2025-09-13 22:32             ` Tejun Heo
2025-09-15  8:48               ` Peter Zijlstra
2025-09-25 13:10             ` Peter Zijlstra
2025-09-25 15:40               ` Tejun Heo
2025-09-25 15:53                 ` Peter Zijlstra
2025-09-25 18:44                   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
2025-09-10 16:07   ` Peter Zijlstra
2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
2025-09-10 18:19   ` Peter Zijlstra
2025-09-10 18:35   ` Peter Zijlstra
2025-09-10 19:00     ` Andrea Righi
2025-09-11  9:58     ` Peter Zijlstra
2025-09-11 14:51       ` Andrea Righi
2025-09-11 14:00   ` Peter Zijlstra
2025-09-11 14:30     ` Peter Zijlstra
2025-09-11 14:48       ` Andrea Righi
2025-09-18 15:15 ` Christian Loehle
2025-09-25  9:00   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMRexZ_SIUVgkIpZ@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liuwenfang@honor.com \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox