The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: mingo@kernel.org
Cc: longman@redhat.com, chenridong@huaweicloud.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
	mkoutny@suse.com, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jstultz@google.com,
	kprateek.nayak@amd.com, qyousef@layalina.io
Subject: [PATCH v2 00/10] sched: Flatten the pick
Date: Mon, 11 May 2026 13:31:04 +0200	[thread overview]
Message-ID: <20260511113104.563854162@infradead.org> (raw)

Hi!

So cgroup scheduling has always been a pain in the arse. The problems start
with weight distribution and end with hierachical picks and it all sucks.

The problems with weight distribution are related to that infernal global
fraction:

             tg->w * grq_i->w
   ge_i->w = ----------------
             \Sum_j grq_j->w

which we've approximated reasonably well by now. However, the immediate
consequence of this fraction is that the total group weight (tg->w) gets
fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
with the fact that 256 CPU systems are relatively common these days, this
becomes painful.

The common 'solution' is to inflate the group weight by 'nr_cpus'; the
immediate problem with that is that when all load of a group gets concentrated
on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
exceeding nice -20.

Additionally there are numerical limits on the max weight you can have before
the math starts suffering overflows. As such there is a definite limit on the
total group weight. Which has annoyed people ;-)

The first few patches add a knob /debug/sched/cgroup_mode and a few different
options on how to deal with this. My favourite is 'concur', but obviously that
is also the most expensive one :-/ It adds a tg->tasks counter which makes the
update_tg_load_avg() thing more expensive.

I have some ideas but I figured I ought to share these things before sinking
more time into it.


On to the hierarchical pick; this has been causing trouble for a very long
time. So once again an attempt at flatting it. The basic idea is to keep the
full hierarchical load tracking as-is, but keep all the runnable entities in a
single level. The immediate concequence of all this is ofcourse that we need to
constantly re-compute the effective weight of each entity as things progress.

Reweight is done on:
 - enqueue
 - pick -- or rather set_next_entity(.first=true)
 - tick

So while the {en,de}queue operations are still O(depth) due to the full
accounting mess, the pick is now a single level. Removing the intermediate
levels that obscure runnability etc.


For testing, I've done a little experiment, I dug out what is colloqually known
as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
runs really well on this machine (provided you stick to 1080p).

To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
spin.sh'; this results in the game becoming almost unplayable, as in proper
terrible.

I used MangoHUD to record a few minutes of playtime for statistics, and then
quit the came and re-started it with a shorter slice set (base/10). This
results in the game being entirely playable -- not great, but definiltey
playable.

  Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
  Intel Core i7-2600K
  AMD Radeon RX 580

  Shadows Awakening (GOG)

	  default slice(*)

  FPS min  3.8    20.6
      avg 48.0    57.2
      mag 87.4    80.3

  FT  min   9.4    8.4
      avg  34.5   19.5
      max 107.4   37.2

  FPS (Frames Per Second)
  FT  (FrameTime)

  [*] Command prefix: 'chrt -o --sched-runtime 280000 0'
      effectively setting 'base_slice_ns/10'

I have not compared to a kernel without flat on, just wanted to run non trivial
workloads and play with slice to make sure everything 'works'.


Can also be had:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat

 include/linux/cpuset.h |    6 
 include/linux/sched.h  |    1 
 kernel/cgroup/cpuset.c |   15 
 kernel/sched/core.c    |   47 --
 kernel/sched/debug.c   |  171 +++++---
 kernel/sched/fair.c    | 1038 ++++++++++++++++++++++---------------------------
 kernel/sched/pelt.c    |    6 
 kernel/sched/sched.h   |   44 --
 8 files changed, 672 insertions(+), 656 deletions(-)

---
Change since v1 ( https://patch.msgid.link/20260317095113.387450089@infradead.org ):
 - various Sashiko thingies
 - rebase atop curren -tip



             reply	other threads:[~2026-05-11 12:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 11:31 Peter Zijlstra [this message]
2026-05-11 11:31 ` [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[] Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 02/10] sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 03/10] sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 04/10] sched/fair: Add cgroup_mode switch Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 05/10] sched/fair: Add cgroup_mode: UP Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 06/10] sched/fair: Add cgroup_mode: MAX Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 07/10] sched/fair: Add cgroup_mode: CONCUR Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 09/10] sched: Remove sched_class::pick_next_task() Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
2026-05-11 16:21   ` K Prateek Nayak
2026-05-11 19:23 ` [PATCH v2 00/10] sched: Flatten the pick Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260511113104.563854162@infradead.org \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox