Re: [PATCHSET RFC] sched: Implement BPF extensible scheduler class

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Don <joshdon@google.com>,
	torvalds@linux-foundation.org, mingo@redhat.com,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@kernel.org, brho@google.com,
	pjt@google.com, derkling@google.com, haoluo@google.com,
	dvernet@meta.com, dschatzberg@meta.com, dskarlat@cs.cmu.edu,
	riel@surriel.com, linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCHSET RFC] sched: Implement BPF extensible scheduler class
Date: Wed, 14 Dec 2022 12:23:14 -1000	[thread overview]
Message-ID: <Y5pM0ralEr6coT25@slm.duckdns.org> (raw)
In-Reply-To: <Y5mPigH1bPatXNeB@hirez.programming.kicks-ass.net>

Hello,

On Wed, Dec 14, 2022 at 09:55:38AM +0100, Peter Zijlstra wrote:
> On Tue, Dec 13, 2022 at 06:11:38PM -0800, Josh Don wrote:
> > Improving scheduling performance requires rapid iteration to explore
> > new policies and tune parameters, especially as hardware becomes more
> > heterogeneous, and applications become more complex. Waiting months
> > between evaluating scheduler policy changes is simply not scalable,
> > but this is the reality with large fleets that require time for
> > testing, qualification, and progressive rollout. The security angle
> > should be clear from how involved it was to integrate core scheduling,
> > for example.
> 
> Surely you can evaluate stuff on a small subset of machines -- I'm
> fairly sure I've had google and facebook people tell me they do just
> that, roll out the test kernel on tens to hundreds of thousand of
> machines instead of the stupid number and see how it behaves there.
>
> Statistics has something here I think, you can get a reliable
> representation of stuff without having to sample *everyone*.

Google guys probably have a lot to say here too and there may be many
commonalties, but here's how things are on our end.

We (Meta) experiment and debug at multiple levels. For example, when
qualifying a new kernel or feature, a common pattern we follow is
two-phased. The first phase is testing it on several well-known and widely
used workloads in a controlled experiment environment with fewer number of
machines, usually some dozens but can go one or two orders of magnitude
higher. Once that looks okay, the second phase is to gradually deploy while
monitoring system-level behaviors (crashes, utilization, latency and
pressure metrics and so on) and feedbacks from service owners.

We run tens of thousands of different workloads in the fleet and we try hard
to do as much as possible in the first phase but many of the difficult and
subtle problems are only detectable in the second phase. When we detect such
problems in the second phase, we triage the problem and pull back deployment
if necessary and then restart after fixing.

As the overused saying goes, quantity has a quality of its own. The
workloads become largely opaque because there are so many of them doing so
many different things for anyone from system side to examine each of them.
In many cases, the best and sometimes only visibility we get is statistical
- comparing two chunks of the fleet which are large enough for the
statistical signals to overcome the noise. That threshold can be pretty
high. Multiple hundreds of thousands of machines being used for a test set
isn't all that uncommon.

One complicating factor for the second phase is that we're deploying on
production fleet running live production workloads. Besides the obvious fact
that users become mightily unhappy when machines crash, there are
complicating matters like limits on how many and which machines can be
rebooted at any given time due to interactions with capacity and maintenance
which severely restricts how fast kernels can be iterated. A full sweep
through the fleet can easily take months.

Between a large number of opaque workloads and production constraints which
limit the type and speed of kernel iterations, our ability to experiment
with scheduling by modifying the kernel directly is severely limited. We can
do small things but trying out big ideas can become logistically
prohibitive.

Note that all these get even worse for public cloud operators. If we really
need to, we can at least find the service owner and talk with them. For
public cloud operators, the workloads are truly opaque.

There's yet another aspect which is caused by fleet dynamism. When we're
hunting down a scheduling misbehavior and want to test out specific ideas,
it can actually be pretty difficult to get back the same workload
composition after a reboot or crash. The fleet management layer will kick in
right away and the workloads get reallocated who-knows-where. This problem
is likely shared by smaller scale operations too. There are just a lot of
layers which are difficult to fixate across reboots and crashes. Even in the
same workload, the load balancer or dispatcher might behave very differently
for the machine after a reboot.

> I was given to believe this was a fairly rapid process.

Going back to the first phase where we're experimenting in a more controlled
environment. Yes, that is a faster process but only in comparison to the
second phase. Some controlled experiments, the faster ones, usually take
several hours to obtain a meaningful result. It just takes a while for
production workloads to start, jit-compile all the hot code paths, warm up
caches and so on. Others, unfortunately, take a lot longer to ramp up to the
degree whether it can be compared against production numbers. Some of the
benchmarks stretch multiple days.

With SCX, we can keep just keep hotswapping and tuning the scheduler
behavior getting results in tens of minutes instead of multiple hours and
without worrying about crashing the test machines, which often have
side-effects on the benchmark setup - the benchmarks are often performed
with shadowed production traffic using the same production software and they
get unhappy when a lot of machines crash. These problems can easily take
hours to resolve.

> Just because you guys have more machines than is reasonable, doesn't
> mean we have to put BPF everywhere.

There are some problems which are specific to large operators like us or
google for sure, but many of these problems are shared by other use cases
which need to test with real-world applications. Even on mobile devices,
it's way easier and faster to have a running test environment setup and
iterate through scheduling behavior changes without worrying about crashing
the machine than having to cycle and re-setup test setup for each iteration.

The productivity gain extends to individual kernel developers and
researchers. Just rebooting a server class hardware often takes upwards of
ten minutes, so most of us try to iterate as much on VMs as possible which
unfortunately doesn't work out too well for subtle performance issues. SCX
can easily cut down iteration time by an order of magnitude or more.

> Additionally, we don't merge and ship everybodies random debug patch
> either -- you're free to do whatever you need to iterate on your own and
> then send the patches that result from this experiment upstream. This is
> how development works, no?

We of course don't merge random debug patches which have limited usefulness
to a small number of use cases. However, we absolutely do ship code to
support debugging and development when the benefit outweights the cost, just
to list several examples - lockdep, perf, tracing, all the memory debug
options.

The argument is that given the current situation including hardware and
software landscape, the benefit of having BPF extensible scheduling
framework has enough benefits to justify the cost.

Thanks.

-- 
tejun

next prev parent reply	other threads:[~2022-12-14 22:23 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-30  8:22 [PATCHSET RFC] sched: Implement BPF extensible scheduler class Tejun Heo
2022-11-30  8:22 ` [PATCH 01/31] rhashtable: Allow rhashtable to be used from irq-safe contexts Tejun Heo
2022-11-30 16:35   ` Linus Torvalds
2022-11-30 17:00     ` Tejun Heo
2022-12-06 21:36   ` [PATCH v2 " Tejun Heo
2022-12-09 10:50     ` patchwork-bot+netdevbpf
2022-11-30  8:22 ` [PATCH 02/31] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2022-11-30  8:22 ` [PATCH 03/31] BPF: Add @prog to bpf_struct_ops->check_member() Tejun Heo
2022-11-30  8:22 ` [PATCH 04/31] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2022-12-12 11:13   ` Peter Zijlstra
2022-12-12 18:03     ` Tejun Heo
2022-12-12 20:07       ` Peter Zijlstra
2022-12-12 20:12         ` Tejun Heo
2022-11-30  8:22 ` [PATCH 05/31] sched: Add sched_class->reweight_task() Tejun Heo
2022-12-12 11:22   ` Peter Zijlstra
2022-12-12 17:34     ` Tejun Heo
2022-12-12 20:11       ` Peter Zijlstra
2022-12-12 20:15         ` Tejun Heo
2022-11-30  8:22 ` [PATCH 06/31] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2022-12-12 11:28   ` Peter Zijlstra
2022-12-12 17:59     ` Tejun Heo
2022-11-30  8:22 ` [PATCH 07/31] sched: Factor out cgroup weight conversion functions Tejun Heo
2022-11-30  8:22 ` [PATCH 08/31] sched: Expose css_tg() and __setscheduler_prio() in kernel/sched/sched.h Tejun Heo
2022-12-12 11:49   ` Peter Zijlstra
2022-12-12 17:47     ` Tejun Heo
2022-11-30  8:22 ` [PATCH 09/31] sched: Enumerate CPU cgroup file types Tejun Heo
2022-11-30  8:22 ` [PATCH 10/31] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2022-12-12 11:57   ` Peter Zijlstra
2022-12-12 18:06     ` Tejun Heo
2022-11-30  8:22 ` [PATCH 11/31] sched: Add @reason to sched_move_task() Tejun Heo
2022-12-12 12:00   ` Peter Zijlstra
2022-12-12 17:54     ` Tejun Heo
2022-11-30  8:22 ` [PATCH 12/31] sched: Add normal_policy() Tejun Heo
2022-11-30  8:22 ` [PATCH 13/31] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2022-11-30  8:22 ` [PATCH 14/31] sched_ext: Implement BPF " Tejun Heo
2022-12-02 17:08   ` Barret Rhoden
2022-12-02 18:01     ` Tejun Heo
2022-12-06 21:42       ` Tejun Heo
2022-12-06 21:44   ` Tejun Heo
2022-12-11 22:33   ` Julia Lawall
2022-12-12  2:15     ` Tejun Heo
2022-12-12  6:03       ` Julia Lawall
2022-12-12  6:08         ` Tejun Heo
2022-12-12 12:31   ` Peter Zijlstra
2022-12-12 20:03     ` Tejun Heo
2022-12-12 12:53   ` Peter Zijlstra
2022-12-12 21:33     ` Tejun Heo
2022-12-13 10:55       ` Peter Zijlstra
2022-12-13 18:12         ` Tejun Heo
2022-12-13 18:40           ` Rik van Riel
2022-12-13 23:20             ` Josh Don
2022-12-13 10:57       ` Peter Zijlstra
2022-12-13 17:32         ` Tejun Heo
2022-11-30  8:22 ` [PATCH 15/31] sched_ext: [TEMPORARY] Add temporary workaround kfunc helpers Tejun Heo
2022-11-30  8:22 ` [PATCH 16/31] sched_ext: Add scx_example_dummy and scx_example_qmap example schedulers Tejun Heo
2022-11-30  8:22 ` [PATCH 17/31] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2022-11-30  8:23 ` [PATCH 18/31] sched_ext: Implement runnable task stall watchdog Tejun Heo
2022-11-30  8:23 ` [PATCH 19/31] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2022-11-30  8:23 ` [PATCH 20/31] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2022-11-30  8:23 ` [PATCH 21/31] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2022-11-30  8:23 ` [PATCH 22/31] sched_ext: Add task state tracking operations Tejun Heo
2022-11-30  8:23 ` [PATCH 23/31] sched_ext: Implement tickless support Tejun Heo
2022-11-30  8:23 ` [PATCH 24/31] sched_ext: Add cgroup support Tejun Heo
2022-11-30  8:23 ` [PATCH 25/31] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2022-11-30  8:23 ` [PATCH 26/31] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2022-11-30  8:23 ` [PATCH 27/31] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2022-11-30  8:23 ` [PATCH 28/31] sched_ext: Add Documentation/scheduler/sched-ext.rst Tejun Heo
2022-12-12  4:01   ` Bagas Sanjaya
2022-12-12  6:28     ` Tejun Heo
2022-12-12 13:07       ` Bagas Sanjaya
2022-12-12 17:30         ` Tejun Heo
2022-12-12 12:39   ` Peter Zijlstra
2022-12-12 17:16     ` Tejun Heo
2022-11-30  8:23 ` [PATCH 29/31] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2022-11-30  8:23 ` [PATCH 30/31] BPF: [TEMPORARY] Nerf BTF scalar value check Tejun Heo
2022-11-30  8:23 ` [PATCH 31/31] sched_ext: Add a rust userspace hybrid example scheduler Tejun Heo
2022-12-12 14:03   ` Peter Zijlstra
2022-12-12 21:05     ` Peter Oskolkov
2022-12-13 11:02       ` Peter Zijlstra
2022-12-13 18:24         ` Peter Oskolkov
2022-12-12 22:00     ` Tejun Heo
2022-12-12 22:18     ` Josh Don
2022-12-13 11:30       ` Peter Zijlstra
2022-12-13 20:33         ` Tejun Heo
2022-12-14  2:00         ` Josh Don
2022-12-12  9:37 ` [PATCHSET RFC] sched: Implement BPF extensible scheduler class Peter Zijlstra
2022-12-12 17:27   ` Tejun Heo
2022-12-12 10:14 ` Peter Zijlstra
2022-12-14  2:11   ` Josh Don
2022-12-14  8:55     ` Peter Zijlstra
2022-12-14 22:23       ` Tejun Heo [this message]
2022-12-14 23:20         ` Barret Rhoden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y5pM0ralEr6coT25@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox