Linux cgroups development
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	longman@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com,
	void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
	cgroups@vger.kernel.org, sched-ext@lists.linux.dev,
	liuwenfang@honor.com, tglx@linutronix.de
Subject: Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
Date: Thu, 25 Sep 2025 17:53:23 +0200	[thread overview]
Message-ID: <20250925155323.GB4067720@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <aNVia1u-GVByUJtC@slm.duckdns.org>

On Thu, Sep 25, 2025 at 05:40:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 25, 2025 at 03:10:25PM +0200, Peter Zijlstra wrote:
> ...
> > Well, either this or scx_tasks iterator will result in lock ops for
> > every task, this is unavoidable if we want the normal p->pi_lock,
> > rq->lock (dsq->lock) taken for every sched_change caller.
> > 
> > I have the below which I would like to include in the series such that I
> > can clean up all that DEQUEUE_LOCKED stuff a bit, this being the only
> > sched_change that's 'weird'.
> > 
> > Added 'bonus' is of course one less user of the runnable_list.
> > 
> > (also, I have to note, for_each_cpu with preemption disabled is asking
> > for trouble, the enormous core count machines are no longer super
> > esoteric)
> 
> Oh yeah, we can break up every N CPUs. There's no cross-CPU atomicity
> requirement.

Right.

> > +	/*
> > +	 * XXX online_mask is stable due to !preempt (per bypass_lock)
> > +	 * so could this be for_each_online_cpu() ?
> >  	 */
> 
> CPUs can go on and offline while CPUs are being bypassed. We can handle that
> in hotplug ops but I'm not sure the complexity is justified in this case.

Well, not in the current code, since the CPU running this has IRQs and
preemption disabled (per bypass_lock) and thus stop_machine, as used in
hotplug can't make progress.

That is; disabling preemption serializes against hotplug. This is
something that the scheduler relies on in quite a few places.

> >  	for_each_possible_cpu(cpu) {
> >  		struct rq *rq = cpu_rq(cpu);
> > -		struct task_struct *p, *n;
> >  
> >  		raw_spin_rq_lock(rq);
> > -
> >  		if (bypass) {
> >  			WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING);
> >  			rq->scx.flags |= SCX_RQ_BYPASSING;
> > @@ -4866,36 +4867,33 @@ static void scx_bypass(bool bypass)
> >  			WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING));
> >  			rq->scx.flags &= ~SCX_RQ_BYPASSING;
> 
> I may be using BYPASSING being set as all tasks having been cycled. Will
> check. We may need an extra state to note that bypass switching is complete.
> Hmm... the switching is not synchronized against scheduling operations
> anymore - ie. we can end up mixing regular op and bypassed operation for the
> same scheduling event (e.g. enqueue vs. task state transitions), which can
> lead subtle state inconsistencies on the BPF scheduler side. Either the
> bypassing state should become per-task, which likely has system
> recoverability issues under lock storm conditions, or maybe we can just
> shift it to the scheduling path - e.g. decide whether to bypass or not at
> the beginning of enqueue path and then stick to it until the whole operation
> is finished.

Makes sense.

> >  		}
> > +		raw_spin_rq_unlock(rq);
> > +	}
> > +
> > +	/* implicit RCU section due to bypass_lock */
> > +	for_each_process_thread(g, p) {
> 
> I don't think this is safe. p->tasks is unlinked from __unhash_process() but
> tasks can schedule between being unhashed and the final preemt_disable() in
> do_exit() and thus the above iteration can miss tasks which may currently be
> runnable.

Bah, you're quite right :/

> > +		unsigned int state;
> >  
> > +		guard(raw_spinlock)(&p->pi_lock);
> > +		if (p->flags & PF_EXITING || p->sched_class != &ext_sched_class)
> > +			continue;
> > +
> > +		state = READ_ONCE(p->__state);
> > +		if (state != TASK_RUNNING && state != TASK_WAKING)
> >  			continue;
> >  
> > +		guard(__task_rq_lock)(p);
> > +		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
> > +			/* nothing */ ;
> >  		}
> > +	}
> 
> This is significantly more expensive. On large systems, the number of
> threads can easily reach six digits. Iterating all of them while doing
> locking ops on each of them might become problematic depending on what the
> rest of the system is doing (unfortunately, it's not too difficult to cause
> meltdowns on some NUMA systems with cross-node traffic). I don't think
> p->tasks iterations can be broken up either.

I thought to have understood that bypass isn't something that happens
when the system is happy. As long as it completes at some point all this
should be fine right?

I mean, yeah, it'll take a while, but meh.

Also, we could run the thing at fair or FIFO-1 or something, to be
outside of ext itself. Possibly we can freeze all the ext tasks on
return to user to limit the amount of noise they generate.

> The change guard cleanups make sense
> regardless of how the rest develops. Would it make sense to land them first?
> Once we know what to do with the core scheduling locking, I'm sure we can
> find a way to make this work accordingly.

Yeah, definitely. Thing is, if we can get all sched_change users to be
the same, that all cleans up better.

But if cleaning this up gets to be too vexing, we can postpone that.




  reply	other threads:[~2025-09-25 15:53 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
2025-09-11  9:06   ` K Prateek Nayak
2025-09-11  9:55     ` Peter Zijlstra
2025-09-11 10:10       ` Peter Zijlstra
2025-09-11 10:37         ` K Prateek Nayak
2025-10-06 15:21   ` Shrikanth Hegde
2025-10-06 18:14     ` Peter Zijlstra
2025-10-07  5:12       ` Shrikanth Hegde
2025-10-07  9:34         ` Peter Zijlstra
2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
2025-09-11  1:44   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
2025-10-30  0:12   ` Mark Brown
2025-10-30  9:07     ` Peter Zijlstra
2025-10-30 12:47       ` Mark Brown
2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
2025-09-12  0:19   ` Tejun Heo
2025-09-12 11:54     ` Peter Zijlstra
2025-09-12 14:11       ` Peter Zijlstra
2025-09-12 17:56       ` Tejun Heo
2025-09-15  8:38         ` Peter Zijlstra
2025-09-16 22:29           ` Tejun Heo
2025-09-16 22:41             ` Tejun Heo
2025-09-25  8:35               ` Peter Zijlstra
2025-09-25 21:43                 ` Tejun Heo
2025-09-26  9:59                   ` Peter Zijlstra
2025-09-26 16:48                     ` Tejun Heo
2025-09-26 10:36                   ` Peter Zijlstra
2025-09-26 21:39                     ` Tejun Heo
2025-09-29 10:06                       ` Peter Zijlstra
2025-09-30 23:49                         ` Tejun Heo
2025-10-01 11:54                           ` Peter Zijlstra
2025-10-02 23:32                             ` Tejun Heo
2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
2025-09-11  2:01   ` Tejun Heo
2025-09-11  9:42     ` Peter Zijlstra
2025-09-11 20:40       ` Tejun Heo
2025-09-12 14:19         ` Peter Zijlstra
2025-09-12 16:32           ` Tejun Heo
2025-09-13 22:32             ` Tejun Heo
2025-09-15  8:48               ` Peter Zijlstra
2025-09-25 13:10             ` Peter Zijlstra
2025-09-25 15:40               ` Tejun Heo
2025-09-25 15:53                 ` Peter Zijlstra [this message]
2025-09-25 18:44                   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
2025-09-10 16:07   ` Peter Zijlstra
2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
2025-09-10 18:19   ` Peter Zijlstra
2025-09-10 18:35   ` Peter Zijlstra
2025-09-10 19:00     ` Andrea Righi
2025-09-11  9:58     ` Peter Zijlstra
2025-09-11 14:51       ` Andrea Righi
2025-09-11 14:00   ` Peter Zijlstra
2025-09-11 14:30     ` Peter Zijlstra
2025-09-11 14:48       ` Andrea Righi
2025-09-18 15:15 ` Christian Loehle
2025-09-25  9:00   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250925155323.GB4067720@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liuwenfang@honor.com \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox