bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@kernel.org, joshdon@google.com,
	brho@google.com, pjt@google.com, derkling@google.com,
	haoluo@google.com, dvernet@meta.com, dschatzberg@meta.com,
	dskarlat@cs.cmu.edu, riel@surriel.com, changwoo@igalia.com,
	himadrics@inria.fr, memxor@gmail.com, andrea.righi@canonical.com,
	joel@joelfernandes.org, linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class
Date: Sun, 5 May 2024 13:31:26 -1000	[thread overview]
Message-ID: <ZjgWzhruwo8euPC0@slm.duckdns.org> (raw)
In-Reply-To: <20240503085232.GC30852@noisy.programming.kicks-ass.net>

Hello,

On Fri, May 03, 2024 at 10:52:32AM +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> > We can resurrect the discussion on that patchset but how is that connected
> > to sched_ext? 
> 
> I'm absolutely not taking any of this until at the very least the cgroup
> situation that's been created is solved. And even then, I fundamentally
> believe the approach to be detrimental to the scheduler eco-system.

Please see below for more on the flattened hierarchy patchset. However, no
matter how that discussion works out, what you seem to be suggesting -
suspending discussion or further push for upstream on sched_ext until a
mostly unrelated work is done - doesn't seem reasonable, especially when
most of the input that you have provided is not constructive.

Even if we agree that, for some reason, the two projects are linked and Meta
and Google owe to push the flattened hierarchy patchset to land sched_ext
upstream, it should be obvious how your proposition puts us in an impossible
spot - A is a prerequisite for B but B isn't going to happen. That's not a
motivating situation for anyone.

If working on the flattened hierarchy patchset is something you want us to
commit to as a gesture of good will, we can surely consider that, but that
shouldn't block further discussions on sched_ext or its upstreaming.

(I reordered your comment about the number of sched_ext schedulers and
developer attention towards the end of the reply to avoid jumping back and
forth between subjects.)

> You guys Google/Facebook got us the cgroup thing, Google did a lot of

We can't divorce ourselves completely from the organizations that we work
for but the above is still a pretty broad stroke. Neither David nor I was
involved in the CPU controller design or implementation and I don't think
it's the same group of people on the Google side either. We sure can discuss
how to proceed on the flattened hierarchy patchset but I don't think the
picture you're painting is a fair depiction of the overall situation.

> the work for cpu-cgroup, and now you Facebook say you can't live with it
> because it's too expensive. Yes Rik did put a lot of effort into it, but
> Google shot it down. What am I to do?

You could have encouraged and guided the project if it felt important
enough. You didn't have to but that was an option.

> You Google/Facebook are touting collaboration, collaborate on fixing it.
> Instead of re-posting this over and over. After all, your main
> motivation for starting this was the cpu-cgroup overhead.

The hierarchical scheduling overhead isn't the main motivation for us. We
can't use the CPU controller for all workloads and while it'd be nice to
improve that, it's pretty easy to work around especially with constantly
increasing number of CPUs. Currently, most sched_ext experiments are without
cgroups and even when cgroups are considered, they're just used as grouping
hints.

In fact, we want to try implementing hierarchical scheduling by dynamically
soft-affinitizing cgroups to CPUs, which would be a bridge too far for the
in-kernel scheduler, at least for now, as it wouldn't be able to handle
custom affinities properly, but it is an idea worth exploring. Enabling
experiments like that is definitely one of our main motivations.

> From where I'm sitting, you created a problem (cpu-cgroup) and now
> you're creating an even bigger problem as a work-around. Very much not
> appreciated.

I have a hard time agreeing. These projects don't overlap all that much.
Their scopes are wildly different. That said, if this is somehow the
blocker, we can talk and try to find a solution but such a solution would
have to be reasonable from our end too. How else would it work?

> Witness the metric ton of toy schedulers written for it, that's all
> effort not put into improving the existing code.

This view works only if you assume that the entire world contains only a
handful of developers who can work on schedulers. The only way that would be
the case is if the barrier of entry is raised unreasonably high. Sometimes a
high barrier of entry can't be avoided or is beneficial. However, if it's
pushed up high enough to leave only a handful of people to work on an area
as large as scheduling, something probably is wrong.

You know better than anyone that there's no such thing as the perfect
scheduler for all, or even most, workloads. There are too many interacting
factors and second-order effects for a single implementation, no matter how
advanced, to be perfect or even great for the multitudes of situations that
scheduling encounters. With hardware and workloads becoming more complex,
the situation isn't getting any better. This partially explains why we can
easily achieve significantly better behaviors for specific workloads even
with a toy scheduler which is just there to demonstrate an idea.

The built-in scheduler has to be good enough for everyone, and, thanks to
the effort of you and the other sched maintainers, it serves that role
admirably. However, that requirement also comes with stringent constraints.
Radical ideas are difficult to play with. Each change has to make some sense
for every use case. Nothing drastic can be introduced unless the future path
can reasonably be forecast. So, development efforts must be highly
orchestrated and stay consistent which justifies a higher barrier of entry
and strict control.

Yet, the many different ways that even simple schedulers can demonstrates
sometimes significant behavior and performance benefits for specific
workloads suggest that there are a lot of low hanging fruits in the area.
Low hanging fruits that we can't easily reach from our current local
optimum. A single implementation which has to satisfy all users all the time
is unlikely to be an effective vehicle for mapping out such landscape.

I believe we agree that we want more people contributing to the scheduling
area. We need that. However, I have a hard time seeing how that would be
achieved in the current structure. Most people can't afford to sink six
months, a year, two years into a project only to eventually be nacked
without any way to deploy and prove their ideas and efforts. Unfortunately,
that is where we end up today in many cases.

There are many smart people with bright ideas just outside the fence who are
eager to develop, tune and even just play with schedulers. I believe they
will flourish when they can work in an environment where scheduling
experimentation is accessible and encouraged. In fact, we are already seeing
that. Out of the four non-trivial sched_ext schedulers, three are either
primarily driven by or have significant contributions from people who had
not and wouldn't have worked on the in-kernel schedulers at all.

So, here's my proposition. Let's please open it up. sched_ext hooks into
sched infra but the contact surface is limited and we'll try our best to
stay out of your way. I can't promise that it won't ever get in your way,
but, if it ever does, just ping me and David. Resolving such situations
would be our highest priority. Let us and others try out crazy ideas and
find out what works.

Thanks.

-- 
tejun

  reply	other threads:[~2024-05-05 23:31 UTC|newest]

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 15:09 [PATCHSET v6] sched: Implement BPF extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 01/39] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2024-05-01 15:09 ` [PATCH 02/39] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2024-05-01 15:09 ` [PATCH 03/39] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2024-05-01 15:09 ` [PATCH 04/39] sched: Add sched_class->reweight_task() Tejun Heo
2024-06-24 10:23   ` Peter Zijlstra
2024-06-24 10:31     ` Peter Zijlstra
2024-06-24 23:59     ` Tejun Heo
2024-06-25  7:29       ` Peter Zijlstra
2024-06-25 23:57         ` Tejun Heo
2024-06-26  1:29           ` [PATCH sched/urgent] sched/fair: set_load_weight() must also call reweight_task() for SCHED_IDLE tasks Tejun Heo
2024-06-26  2:19           ` [PATCH sched_ext/for-6.11] sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() Tejun Heo
2024-07-08 19:29             ` [PATCH v2 " Tejun Heo
2024-05-01 15:09 ` [PATCH 05/39] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2024-06-24 11:06   ` Peter Zijlstra
2024-06-24 22:18     ` Tejun Heo
2024-06-25  8:16       ` Peter Zijlstra
2024-05-01 15:09 ` [PATCH 06/39] sched: Factor out cgroup weight conversion functions Tejun Heo
2024-05-01 15:09 ` [PATCH 07/39] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2024-06-24 11:19   ` Peter Zijlstra
2024-06-24 18:56     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 08/39] sched: Enumerate CPU cgroup file types Tejun Heo
2024-05-01 15:09 ` [PATCH 09/39] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2024-06-24 11:32   ` Peter Zijlstra
2024-06-24 21:18     ` Tejun Heo
2024-06-25  8:29       ` Peter Zijlstra
2024-06-25 23:41         ` Tejun Heo
2024-06-26  8:23           ` Peter Zijlstra
2024-06-26 18:01             ` Tejun Heo
2024-06-27  1:27               ` [PATCH sched_ext/for-6.11] sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect Tejun Heo
2024-07-08 19:30                 ` Tejun Heo
2024-05-01 15:09 ` [PATCH 10/39] sched: Factor out update_other_load_avgs() from __update_blocked_others() Tejun Heo
2024-06-24 12:35   ` Peter Zijlstra
2024-06-24 16:15     ` Vincent Guittot
2024-06-24 19:24       ` Tejun Heo
2024-06-25  9:13         ` Vincent Guittot
2024-06-26 20:49           ` Tejun Heo
2024-05-01 15:09 ` [PATCH 11/39] cpufreq_schedutil: Refactor sugov_cpu_is_busy() Tejun Heo
2024-05-01 15:09 ` [PATCH 12/39] sched: Add normal_policy() Tejun Heo
2024-05-01 15:09 ` [PATCH 13/39] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2024-05-01 15:09 ` [PATCH 14/39] sched_ext: Implement BPF " Tejun Heo
2024-05-01 15:09 ` [PATCH 15/39] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2024-05-01 15:09 ` [PATCH 16/39] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2024-05-01 15:09 ` [PATCH 17/39] sched_ext: Implement runnable task stall watchdog Tejun Heo
2024-05-01 15:09 ` [PATCH 18/39] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2024-06-24 12:40   ` Peter Zijlstra
2024-06-24 19:06     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 19/39] sched_ext: Print sched_ext info when dumping stack Tejun Heo
2024-06-24 12:46   ` Peter Zijlstra
2024-06-24 14:25     ` Linus Torvalds
2024-06-24 18:34     ` Tejun Heo
2024-05-01 15:09 ` [PATCH 20/39] sched_ext: Print debug dump after an error exit Tejun Heo
2024-05-01 15:09 ` [PATCH 21/39] tools/sched_ext: Add scx_show_state.py Tejun Heo
2024-05-01 15:09 ` [PATCH 22/39] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2024-05-01 15:09 ` [PATCH 23/39] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2024-05-01 15:09 ` [PATCH 24/39] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2024-05-01 15:10 ` [PATCH 25/39] sched_ext: Add task state tracking operations Tejun Heo
2024-05-01 15:10 ` [PATCH 26/39] sched_ext: Implement tickless support Tejun Heo
2024-05-01 15:10 ` [PATCH 27/39] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2024-05-01 15:10 ` [PATCH 28/39] sched_ext: Add cgroup support Tejun Heo
2024-05-01 15:10 ` [PATCH 29/39] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2024-05-01 15:10 ` [PATCH 30/39] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2024-05-01 15:10 ` [PATCH 31/39] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2024-05-01 15:10 ` [PATCH 32/39] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2024-05-01 15:10 ` [PATCH 33/39] sched_ext: Bypass BPF scheduler while PM events are in progress Tejun Heo
2024-05-01 15:10 ` [PATCH 34/39] sched_ext: Implement core-sched support Tejun Heo
2024-05-01 15:10 ` [PATCH 35/39] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2024-05-01 15:10 ` [PATCH 36/39] sched_ext: Implement DSQ iterator Tejun Heo
2024-05-01 15:10 ` [PATCH 37/39] sched_ext: Add cpuperf support Tejun Heo
2024-05-01 15:10 ` [PATCH 38/39] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2024-05-02  2:24   ` Bagas Sanjaya
2024-05-01 15:10 ` [PATCH 39/39] sched_ext: Add selftests Tejun Heo
2024-05-02  8:48 ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Peter Zijlstra
2024-05-02 19:20   ` Tejun Heo
2024-05-03  8:52     ` Peter Zijlstra
2024-05-05 23:31       ` Tejun Heo [this message]
2024-05-13  8:03         ` Peter Zijlstra
2024-05-13 18:26           ` Steven Rostedt
2024-05-14  0:07             ` Qais Yousef
2024-05-14 21:34               ` David Vernet
2024-05-27 21:25                 ` Qais Yousef
2024-05-28 23:46                   ` Tejun Heo
2024-05-29 22:09                     ` Qais Yousef
2024-05-17  9:58               ` Peter Zijlstra
2024-05-27 20:29                 ` Qais Yousef
2024-05-14 20:22           ` Chris Mason
2024-05-14 22:06           ` Josh Don
2024-05-15 20:41           ` Tejun Heo
2024-05-21  0:19             ` Tejun Heo
2024-05-30 16:49               ` Tejun Heo
2024-05-06 18:47       ` Rik van Riel
2024-05-07 19:33         ` Tejun Heo
2024-05-07 19:47           ` Rik van Riel
2024-05-09  7:38       ` Changwoo Min
2024-05-10 18:24 ` Peter Jung
2024-05-13 20:36 ` Andrea Righi
2024-06-11 21:34 ` Linus Torvalds
2024-06-13 23:38   ` Tejun Heo
2024-06-19 20:56   ` Thomas Gleixner
2024-06-19 22:10     ` Linus Torvalds
2024-06-19 22:27       ` Thomas Gleixner
2024-06-19 22:55         ` Linus Torvalds
2024-06-20  2:35           ` Thomas Gleixner
2024-06-20  5:07             ` Linus Torvalds
2024-06-20 17:11               ` Linus Torvalds
2024-06-20 17:41                 ` Tejun Heo
2024-06-20 22:15                   ` [PATCH sched_ext/for-6.11] sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() Tejun Heo
2024-06-20 22:42                     ` Linus Torvalds
2024-06-21 19:46                       ` Tejun Heo
2024-06-24  9:04                         ` Peter Zijlstra
2024-06-24 18:41                           ` Tejun Heo
2024-06-24  9:02                       ` Peter Zijlstra
2024-06-21 19:52                     ` Tejun Heo
2024-06-24  8:59                     ` Peter Zijlstra
2024-06-24 21:01                       ` Tejun Heo
2024-06-25  7:49                         ` Peter Zijlstra
2024-06-25 23:30                           ` Tejun Heo
2024-06-26  8:28                             ` Peter Zijlstra
2024-06-26 17:56                               ` Tejun Heo
2024-06-20 18:47               ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Thomas Gleixner
2024-06-20 19:20                 ` Linus Torvalds
2024-06-21  9:35                   ` Thomas Gleixner
2024-06-21 16:34                     ` Linus Torvalds
2024-06-23  2:00                       ` Tejun Heo
2024-06-23 10:31                       ` Thomas Gleixner
2024-06-23 10:33                       ` Thomas Gleixner
2024-06-24 14:23                         ` Jason Gunthorpe
2024-06-20 19:58                 ` Tejun Heo
2024-06-24  9:34                   ` Peter Zijlstra
2024-06-24 20:17                     ` Tejun Heo
2024-06-24 20:51                       ` [PATCH sched_ext/for-6.11] sched, sched_ext: Simplify dl_prio() case handling in sched_fork() Tejun Heo
2024-07-08 18:56                         ` Tejun Heo
2024-06-20 19:35             ` [PATCHSET v6] sched: Implement BPF extensible scheduler class Tejun Heo
2024-06-21 10:46               ` Thomas Gleixner
2024-06-21 21:14                 ` Chris Mason
2024-06-23  8:14                   ` Thomas Gleixner
2024-06-24 16:42                     ` Chris Mason
2024-06-24 18:11                       ` Tejun Heo
2024-06-24 22:01                         ` Peter Oskolkov
2024-06-24 22:17                     ` David Vernet
2024-06-24 21:54             ` Peter Oskolkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZjgWzhruwo8euPC0@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=andrea.righi@canonical.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=himadrics@inria.fr \
    --cc=joel@joelfernandes.org \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).