public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	longman@redhat.com, hannes@cmpxchg.org, mkoutny@suse.com,
	void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
	cgroups@vger.kernel.org, sched-ext@lists.linux.dev,
	liuwenfang@honor.com, tglx@linutronix.de,
	alexei.starovoitov@gmail.com
Subject: Re: [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx()
Date: Wed, 8 Oct 2025 11:11:51 +0200	[thread overview]
Message-ID: <20251008091151.GS4067720@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <aOWKn6f0OtegV1q0@slm.duckdns.org>

On Tue, Oct 07, 2025 at 11:48:15AM -1000, Tejun Heo wrote:
> On Mon, Oct 06, 2025 at 12:46:52PM +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > So I had a poke at 'give @rf to pick_task() and fold balance_scx() into
> > pick_task_scx()' option to see how terrible it was. Turns out, not terrible at
> > all.
> > 
> > I've ran the sched_ext selftest and stress-ng --race-sched 0 thing with various
> > scx_* thingies on.
> 
> This is great. I was thinking that I needed to call pick_task() of other
> classes to detect the retry conditions but yeah enqueue() must be the
> triggering event and this is way neater. 

:-)

> Does this mean that balance() can be dropped from other classes too?

Possibly. But lets stick that on a todo list :-) I was also looking to
get rid of sched_class::pick_next_task() -- the only reason that
currently still exists is because of that fair cgroup nonsense.

I was looking at bringing back Rik's flatten series (easier now that the
cfs bandwidth throttle thing is fixed). That should get rid of that
hierarchical pick; and thus obviate that whole mess in
pick_next_task_fair().

> For the whole series:
> 
>  Acked-by: Tejun Heo <tj@kernel.org>
> 

Thanks!

Now, I do have a few questions from staring at this ext stuff for a
while:

 - So the 'RT' problem with balance_one() is due to:

    o rq-lock-break allowing higher prio task to come in
    o placing a task on the local dsq

   which results in that local-dsq getting 'starved'. The patches want
   to call switch_class(), which I suppose will work, this is something
   like:

   if (rq_modified_above(rq, &ext_sched_class)) {
      /*
       * We don't have a next task at this point, but can guess at its
       * class based on the highest set bit in the queue_mask.
       */
      scx_cpu_release(reason_from_mask(rq->queue_mask), NULL);
      return RETRY_TASK;
   }

   But I was thinking that we could also just stick that task back onto
   some global dsq, right? (presumably the one we just pulled it from is
   a good target). This would effectively 'undo' the balance_one().


 - finish_dispatch() -- one of the problems that I ran into with the
   shared rq lock implementation is that pick_task_scx() will in fact
   try and enqueue on a non-local dsq.

   The callchain is something like:

     pick_task_scx()
       bpf__sched_ext_ops_dispatch()
         scx_bpf_dsq_move_to_local()
	   flush_dispatch_buf()
	     dispatch_enqueue() // dsq->id != SCX_DSQ_LOCAL
 
   And this completely messes up the locking -- I'm not sure how to fix
   this yet. But adding this flush to do random other things to various
   code paths really complicates things. Per the function what we really
   want to do is move-to-local, but then we end up doing random other
   things instead :/


 - finish_dispatch() -- per the above problem I read this function and
   found that:

     "the BPF scheduler is allowed to dispatch tasks spuriously"

   and I had to go and buy a new WTF'o'meter again :/ Why would you
   allow such a thing? Detect the case because the BPF thing is
   untrusted and can do crazy things, sure. But then kill it dead; don't
   try and excuse badness.


 - scx_bpf_dsq_move_to_local() found per the above problem, but per its
   comment it is possible BPF calls this with its own locks held. This
   then results in:

   CPU1 

   try_to_wake_up()
     rq_lock();
     enqueue_task() := enqueue_task_scx()
       bpf__sched_ext_ops_something_or_other()
         your bpf area lock thing

   // rq->lock
   // bpf area lock

   CPU2
     bpf__sched_ext_whatever()
       bpf area lock
         scx_bpf_move_to_local()
	   rq_lock()

   // bpf area lock
   // rq->lock

  and we have a deadlock -- I thought BPF was supposed to be safe? And
  while the recent rqspinlock has a timeout, and there the bpf validator
  knows a spinlock exists and can prohibit kernel helper calls, this
  bpf area lock you have has no such things (afaict) and BPF can
  completely mess up the kernel. How is this okay?



  reply	other threads:[~2025-10-08  9:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 10:46 [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 1/3] sched: Detect per-class runqueue changes Peter Zijlstra
2025-10-07 10:08   ` Juri Lelli
2025-10-07 10:16     ` Peter Zijlstra
2025-10-07 10:26       ` Juri Lelli
2025-10-16  9:33   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-10-06 10:46 ` [RFC][PATCH 2/3] sched: Add support to pick functions to take rf Peter Zijlstra
2025-10-08 13:16   ` Vincent Guittot
2025-10-08 13:58     ` Peter Zijlstra
2025-10-08 15:22       ` Vincent Guittot
2025-10-08 20:34         ` Tejun Heo
2025-10-09  7:17           ` Vincent Guittot
2025-10-13 11:04             ` Peter Zijlstra
2025-10-13 11:09               ` Peter Zijlstra
2025-10-13 13:06                 ` Vincent Guittot
2025-10-13 17:20                   ` Tejun Heo
2025-10-16  9:33   ` [tip: sched/core] " tip-bot2 for Joel Fernandes
2025-10-06 10:46 ` [RFC][PATCH 3/3] sched/ext: Fold balance_scx() into pick_task_scx() Peter Zijlstra
2025-10-16  9:33   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-10-07 12:40 ` [RFC][PATCH 0/3] sched/ext: Cleanup pick_task_scx() Christian Loehle
2025-10-07 21:48 ` Tejun Heo
2025-10-08  9:11   ` Peter Zijlstra [this message]
2025-10-08 20:21     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251008091151.GS4067720@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liuwenfang@honor.com \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox