public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@kernel.org, joshdon@google.com,
	brho@google.com, pjt@google.com, derkling@google.com,
	haoluo@google.com, dvernet@meta.com, dschatzberg@meta.com,
	dskarlat@cs.cmu.edu, riel@surriel.com,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	kernel-team@meta.com, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCHSET v4] sched: Implement BPF extensible scheduler class
Date: Thu, 27 Jul 2023 14:12:05 -1000	[thread overview]
Message-ID: <ZMMH1WiYlipR0byf@slm.duckdns.org> (raw)
In-Reply-To: <20230726091752.GA3802077@hirez.programming.kicks-ass.net>

Hello, Peter.

On Wed, Jul 26, 2023 at 11:17:52AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 21, 2023 at 08:37:41AM -1000, Tejun Heo wrote:
> > We are comfortable with the current API. Everything we tried fit pretty
> > well. It will continue to evolve but sched_ext now seems mature enough for
> > initial inclusion. I suppose lack of response doesn't indicate tacit
> > agreement from everyone, so what are you guys all thinking?
> 
> I'm still hating the whole thing with a passion.
> 
> As can be seen from the wide-spread SCHED_DEBUG abuse; people are, in
> general, not interested in doing the right thing. They prod random
> numbers (as in really, some are just completely insane) until their
> workload improves and call it a day.

I think it'd be useful to add some details to what's going on in situations
like above. This of course wouldn't apply directly to everyone but I suspect
many will recognize at least some parts of it.

In many production setups, there are aspects of workload behaviors that are
difficult to understand comprehensively. The workloads are often massively
complex, constantly being developed by many people, and dynamically
interacting with external entities. As with any sufficiently complex system,
there are many emergent properties which are difficult to untangle
completely.

Add to that multiple generations of divergent hardware and most of the
software stack coming from third parties (including kernel from application
team's POV), people often and justifiably feel as if they're swimming in the
sea of black boxes and emergent properties.

Scheduling, naturally, is one of the areas that people look into when trying
to optimize system performance. Vast majority of people don't know scheduler
code base well enough to hack on it. Even when they do, it's often not easy
to set up benchmarks in production environments and cycle through different
kernels. We (Meta) are a lot better now than a couple years ago, but even
now swapping kernels and ramping workloads back up can take a long time for
certain workloads.

Given the circumstances, it's not surprising that people go for tunable
knobs when they're trying to find out whether changing scheduling behaviors
would improve performance for their workloads. That's often the only option
available and tuning the knobs frequently leads to some gains. Most people
aren't scheduling experts and the causal relationships between changes and
results may not be direct or intuitive. So, that's often where things end.
Given that nobody has found scheduling behavior which is optimal for every
workload and the SCHED_DEBUG knobs are what people can access, it is an
expected outcome.

If a consistent pattern is repeated across multiple workloads, we can
sometimes work back why tuning certain way makes sense and generalize that,
which is to some degree how we ended up focusing on recent work-conservation
related projects.

Maybe the situation is not ideal but I don't think it's people not being
interested in doing the right thing. They are doing what they can within the
confines of available mechanisms, expertise, and time & effort they can
afford to invest.

One of the impediments when trying to connect these disparate data points
into something meaningful is the difficulty in experimentation. The trials
are confined to whatever combinations that can be achieved with SCHED_DEBUG
knobs which are both limiting and obscuring. I believe we're a lot more
likely to learn more about scheduling with sched_ext widely available than
without as it would allow easier and wider-in-scope experimentations.

> There is not a single doubt in my mind that if I were to merge this,
> there will be Enterprise software out there that will mandate its own
> BPF sched thing, or else it won't work.
>
> They will not care, they will not contribute, they might even pull a
> RedHat and only share the code to customers.

I'm sure some will behave in a way which isn't the most conducive to
collective improvement of the upstream kernel. That said, I don't see how
this will be noticeably worsened by inclusion of sched_ext. Most mobile
kernels and some production kernels in cloud environments already carry
significant custom modifications, and they're often addressing real problems
for their use cases.

It'd be ideal if everyone had the commitment and bandwidth to try their best
to merge back their changes but it's also understandable why that can't
always be the case. Sometimes, it's too specific or underdeveloped. At other
times, time and resources just aren't there. We can incentivize and coerce
but that can be pushed only so far. However, we do have an a lot easier time
learning about what people are doing thanks to GPL which all sched_ext
programs would need to follow exactly like the rest of the kernel.

At least relatively speaking, scheduling doesn't seem like an area which is
particularly starved for developer bandwidth although one can always hope
for more. Actual insights and an easy way to experiment and collaborate to
discover them seem like a bigger bottleneck. Hopefully, sched_ext will widen
the scope of things that people will try. Even when they don't directly
contribute those changes back to CFS, if a strategy is effective and general
enough, others can learn from them and apply to improve scheduling for
everyone.

Both Meta and Google are committed to sharing what we learn, both in terms
of code and insights. The example schedulers in the posting are all we
(Meta) have been experimenting with except for really hacky soft affinity
trials which will be generalized and shared too. David has also been
actively working to apply the shared runqueue changes to CFS which came from
earlier sched_ext experiments. Google has been open-sourcing their ghOSt
framework and schedulers built on top of it which will be ported to
sched_ext in the future. Google is starting to see promising results with
search and will share their findings in code and through other venues
including conferences.

> We all loose in that scenario. Not least me, because I get the
> additional maintenance burden.

sched_ext isn't that invasive to the core code and its interactions with
other scheduling classes are very limited. This would make changing
scheduling core API a bit more burdensome but they have been relatively
stable and both David and I would be on the hook if anything is in your way.
I don't see why this would significantly increase your maintenance burden.
It's a thing but it's a thing in its own corner.

> I also don't see upsides to merging this. You all can play with
> schedulers out-of-tree just fine and then submit what actually works.

There is a huge difference between having a common framework upstream and
not having one. If in kernel, everyone knows that it's widely available and
will remain so for a very long time. It removes the risk of investing energy
and effort into something which may or may not exist next year.

It also has the standardizing effect where different parties can exchange
code and ideas easily. It's so much more effective to be able to directly
build upon other people's work than trying to reimplement everything on your
own or navigate maze of different frameworks and patches with different
baseline kernel versions and so on. I mean, these are the reasons that we
want things upstreamed, right?

Thanks.

-- 
tejun

  reply	other threads:[~2023-07-28  0:12 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11  1:13 [PATCHSET v4] sched: Implement BPF extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 01/34] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2023-07-11  1:13 ` [PATCH 02/34] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2023-07-11  1:13 ` [PATCH 03/34] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2023-07-11  1:13 ` [PATCH 04/34] sched: Add sched_class->reweight_task() Tejun Heo
2023-07-11  1:13 ` [PATCH 05/34] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2023-07-11  1:13 ` [PATCH 06/34] sched: Factor out cgroup weight conversion functions Tejun Heo
2023-07-11  1:13 ` [PATCH 07/34] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2023-07-11  1:13 ` [PATCH 08/34] sched: Enumerate CPU cgroup file types Tejun Heo
2023-07-11  1:13 ` [PATCH 09/34] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2023-07-11  1:13 ` [PATCH 10/34] sched: Add normal_policy() Tejun Heo
2023-07-11  1:13 ` [PATCH 11/34] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 12/34] sched_ext: Implement BPF " Tejun Heo
2023-07-11  9:21   ` Andrea Righi
2023-07-11 21:45     ` Tejun Heo
2023-08-16 11:45   ` Vishal Chourasia
2023-08-16 19:20     ` Tejun Heo
2023-07-11  1:13 ` [PATCH 13/34] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2023-07-11  1:13 ` [PATCH 14/34] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 15/34] sched_ext: Implement runnable task stall watchdog Tejun Heo
2023-07-11  1:13 ` [PATCH 16/34] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2023-07-11  1:13 ` [PATCH 17/34] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2023-07-11  1:13 ` [PATCH 18/34] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2023-07-11  1:13 ` [PATCH 19/34] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2023-07-11  1:13 ` [PATCH 20/34] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2023-07-11  1:13 ` [PATCH 21/34] sched_ext: Add task state tracking operations Tejun Heo
2023-07-11  1:13 ` [PATCH 22/34] sched_ext: Implement tickless support Tejun Heo
2023-07-11  1:13 ` [PATCH 23/34] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2023-07-11  1:13 ` [PATCH 24/34] sched_ext: Add cgroup support Tejun Heo
2023-07-11  1:13 ` [PATCH 25/34] sched_ext: Add a cgroup-based core-scheduling scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 26/34] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2023-07-11  1:13 ` [PATCH 27/34] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2023-07-13 13:45   ` Andrea Righi
2023-07-13 18:32     ` Linus Torvalds
2023-07-13 19:48       ` Tejun Heo
2023-07-11  1:13 ` [PATCH 28/34] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2023-07-11  1:13 ` [PATCH 29/34] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2023-07-11  1:13 ` [PATCH 30/34] sched_ext: Implement core-sched support Tejun Heo
2023-07-11  1:13 ` [PATCH 31/34] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2023-07-11  1:13 ` [PATCH 32/34] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 33/34] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 34/34] sched_ext: Add a rust userspace hybrid example scheduler Tejun Heo
2023-07-21 18:37 ` [PATCHSET v4] sched: Implement BPF extensible scheduler class Tejun Heo
2023-07-24 15:11   ` Barret Rhoden
2023-07-26  9:17   ` Peter Zijlstra
2023-07-28  0:12     ` Tejun Heo [this message]
2023-08-04  0:08       ` Tejun Heo
2023-08-11  1:16       ` Tejun Heo
2023-08-17 12:44       ` Mel Gorman
2023-08-24 21:31         ` Tejun Heo
2023-09-19 17:56           ` Tejun Heo
2023-09-26  9:20             ` Mel Gorman
2023-10-10 22:09               ` Tejun Heo
2023-08-25  0:26   ` Josh Don

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMMH1WiYlipR0byf@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox