[PATCH v3 0/7] sched: Flatten the pick

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH v3 0/7] sched: Flatten the pick
@ 2026-06-05 12:40 Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 1/7] sched/fair: Add cgroup_mode switch Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef


Hi!

New version, same story [1]. TL;DR:

 - Adds new cgroup_mode knob and implements new policies to address the
   hierarchy level weight mismatch.

 - Builds upon that base to create a flat / single runqueue scheduler where the
   cgroup hierarchy is expressed through dynamic weight management.

I'm hoping to be able to merge these patches early in the next cycle (after
7.2-rc1).

Random benchmark:

Game vs 'for ((i=0; i<8; i++)) do nice ./spin.sh; done':

  Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
  Intel Core i7-2600K
  AMD Radeon RX 580

  Shadows Awakening (GOG)

	  default slice(*)

  FPS min   4.0   29.0
      avg  47.5   59.2
      max  83.7   83.7

  FT  min   9.3   10.2
      avg  34.0   17.0
      max 121.2   30.0

  FPS (Frames Per Second)
  FT  (FrameTime)

  [*] Command prefix: 'chrt -o --sched-runtime 100000 0'


Changes since v2:

 - merged debug and prep patches
 - fixed update_entity_lag() on dequeue (Vincent)
 - fixed throttle vs tick (Prateek)
 - fixed wakeup_preempt_fair()
 - rebased on tip/sched/core
 - rewritten cgroup_mode changelogs
 - reworked cgroup_mode concur
 - added cgroup_mode tasks
 - changed default cgroup_mode


[1] - https://lore.kernel.org/r/20260511113104.563854162@infradead.org

Can also be had:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat

 include/linux/cpuset.h |    6 
 include/linux/sched.h  |    1 
 kernel/cgroup/cpuset.c |   15 
 kernel/sched/core.c    |    5 
 kernel/sched/debug.c   |   89 ++++
 kernel/sched/fair.c    |  943 ++++++++++++++++++++++++-------------------------
 kernel/sched/pelt.c    |    6 
 kernel/sched/sched.h   |   30 -
 8 files changed, 607 insertions(+), 488 deletions(-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/7] sched/fair: Add cgroup_mode switch
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 2/7] sched/fair: Add cgroup_mode: up Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

The effective task weight (W_t') for a task in cgroup g on CPU n is given by:

                                 W_t
	W_t' = W_g * F_g_n * ----------
                             \Sum W_t_n

Where W_g is the group's weight (cpu.weight), F_g_n is the fraction of the
group weight for CPU n and W_t/W is the relative weight of this task against
all other tasks in the same group on the same CPU.

Furthermore, this makes:

                \Sum W_t_n
	F_g_n = ----------
	         \Sum W_t

The fraction of weight inside the group of CPU n against the whole group.

The problem is with F_g_n, the primary goal of this fraction is to make sure
that the relative weight of tasks, when distributed over CPUs is maintained.
For example, consider 4 (equal weight) tasks and 2 CPUs with a 1:3
distribution, then if F_g_n would simply be 1 (no weight re-distribution) the
effective relative weights (W_t') of the tasks in our group would be:

	CPU0	CPU1
        W_g     W_g/3
	        W_g/3
		W_g/3

IOW, the lucky task on CPU0 would get an equal amount of weight as all 3 tasks
on CPU1 combined. However, with the weight redistribution, this becomes:

	CPU0	CPU1
        W_g/4   W_g/4
	        W_g/4
		W_g/4

All tasks are equal weight (as intended). However, as is already evident from
this example, the more CPUs you add, the smaller F_g_n becomes, which creates a
disparity against tasks not in our group.

Specifically:

	avg(F_g_n) ~ 1/N

This leads to a weight mismatch in the hierarchy. IOW tasks cannot compete
fairly across hierarchy levels.

*Notably*, what is meant by avg(F_g_n) being proportional to 1/N is that when
there are at least N runnable tasks, the average of this fraction tends to 1/N.

For a hierarchy of depth d, this gets even worse, since that gets terms on the
order of:

	avg(F_g_n)^d ~ 1/(N^d)

Given fixed point arithmetic, this also leads to numerical trouble.

However, the meaning of "cpu.weight" is simple and intiutive: the total weight
of the cgroup. But as explored above, there is deception in this simplicity.

Prepare to add a few alternative methods for distributing weight.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -633,6 +633,76 @@ static void debugfs_fair_server_init(voi
 	}
 }

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static int cgroup_mode = 0;
+
+static const char *cgroup_mode_str[] = {
+	"smp",
+};
+
+static int sched_cgroup_mode(const char *str)
+{
+	for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+		if (!strcmp(str, cgroup_mode_str[i]))
+			return i;
+	}
+	return -EINVAL;
+}
+
+static ssize_t sched_cgroup_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	int mode;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+	mode = sched_cgroup_mode(strstrip(buf));
+	if (mode < 0)
+		return mode;
+
+	WRITE_ONCE(cgroup_mode, mode);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_cgroup_show(struct seq_file *m, void *v)
+{
+	int mode = READ_ONCE(cgroup_mode);
+
+	for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+		if (mode == i)
+			seq_puts(m, "(");
+		seq_puts(m, cgroup_mode_str[i]);
+		if (mode == i)
+			seq_puts(m, ")");
+
+		seq_puts(m, " ");
+	}
+	seq_puts(m, "\n");
+	return 0;
+}
+
+static int sched_cgroup_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_cgroup_show, NULL);
+}
+
+static const struct file_operations sched_cgroup_fops = {
+	.open		= sched_cgroup_open,
+	.write		= sched_cgroup_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa, *llc;
@@ -686,6 +756,10 @@ static __init int sched_init_debug(void)

 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+	debugfs_create_file("cgroup_mode", 0644, debugfs_sched, NULL, &sched_cgroup_fops);
+#endif
+
 	debugfs_fair_server_init();
 #ifdef CONFIG_SCHED_CLASS_EXT
 	debugfs_ext_server_init();

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 2/7] sched/fair: Add cgroup_mode: up
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 1/7] sched/fair: Add cgroup_mode switch Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-05 15:07   ` Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 3/7] sched/fair: Add cgroup_mode: max Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Instead of calculating the proportional fraction of the group weight for each
CPU, just give each CPU the full measure, ignoring these pesky SMP problems.

This makes the SMP cgroup fraction (F_g_n) equal to 1, and ensures a single
task in a cgroup competes on equal footing to a task in a level above.

However, as already explored, this is not a very good policy because it gets
the SMP weight distribution wrong. Included for completeness.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    5 ++++-
 kernel/sched/fair.c  |   31 +++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    1 +
 3 files changed, 34 insertions(+), 3 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -271,6 +271,7 @@ static ssize_t sched_dynamic_write(struc
 	if (mode < 0)
 		return mode;
 
+	__sched_cgroup_mode_update(mode);
 	sched_dynamic_update(mode);
 
 	*ppos += cnt;
@@ -634,9 +635,11 @@ static void debugfs_fair_server_init(voi
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static int cgroup_mode = 0;
+static int cgroup_mode = 1;
 
+/* See __sched_cgroup_mode_update(). */
 static const char *cgroup_mode_str[] = {
+	"up",
 	"smp",
 };
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -38,6 +38,7 @@
 #include <linux/sched/isolation.h>
 #include <linux/sched/nohz.h>
 #include <linux/sched/prio.h>
+#include <linux/static_call.h>
 
 #include <linux/cpuidle.h>
 #include <linux/interrupt.h>
@@ -4800,7 +4801,7 @@ static inline int throttled_hierarchy(st
  *
  * hence icky!
  */
-static long calc_group_shares(struct cfs_rq *cfs_rq)
+static long calc_smp(struct cfs_rq *cfs_rq)
 {
 	long tg_weight, tg_shares, load, shares;
 	struct task_group *tg = cfs_rq->tg;
@@ -4835,6 +4836,32 @@ static long calc_group_shares(struct cfs
 }
 
 /*
+ * Ignore this pesky SMP stuff, use (4).
+ */
+static long calc_up_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	return READ_ONCE(tg->shares);
+}
+
+DEFINE_STATIC_CALL(calc_group_shares, calc_smp_shares);
+
+void __sched_cgroup_mode_update(int mode)
+{
+	long (*func)(struct cfs_rq *);
+	switch (mode) {
+	case 0:
+		func = &calc_up_shares;
+		break;
+	case 1:
+	default:
+		func = &calc_smp_shares;
+		break;
+	}
+	static_call_update(calc_group_shares, func);
+}
+
+/*
  * Recomputes the group entity based on the current state of its group
  * runqueue.
  */
@@ -4850,7 +4877,7 @@ static void update_cfs_group(struct sche
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
-	shares = calc_group_shares(gcfs_rq);
+	shares = static_call(calc_group_shares)(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -571,6 +571,7 @@ extern void free_fair_sched_group(struct
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
 extern void online_fair_sched_group(struct task_group *tg);
 extern void unregister_fair_sched_group(struct task_group *tg);
+extern void __sched_cgroup_mode_update(int mode);
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static inline void free_fair_sched_group(struct task_group *tg) { }
 static inline int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 1/7] sched/fair: Add cgroup_mode switch Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 2/7] sched/fair: Add cgroup_mode: up Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-10 15:09   ` Waiman Long
  2026-06-05 12:40 ` [PATCH v3 4/7] sched/fair: Add cgroup_mode: concur Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

In order to avoid the average CPU fraction avg(F_g_n) becoming tiny '1/N',
assume each cgroup is maximally concurrent and distrubute 'N*weight', such
that:

	F_g_n' = N * F_g_n

Giving:

	avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1

And while this sounds like it solves things, remember what that ~ meant. There
is the corner case when a cgroup is minimally loaded, eg a single runnable
task, therefore limit the CPU fraction to that of a nice -20 task to avoid
getting too much load.

This last bit is what makes it different from a previous proposal to allow
raising cpu.weight to '100 * N', that would not limit the mininal concurrency
case and results in a very large F_g_n. And just like F_g_n << 1 is
problematic, so is F_g_n >> 1 for the exact same reasons (it would drown the
kthreads, but it also risks overflowing the load values).

So while this might appear to be a better scheme than the current default
scheme, it doesn't really handle less than maximal concurrency nicely -- it
clips and introduces artificially large weights. So where the traditional SMP
mode works well when nr_tasks << nr_cpus, MAX doesn't work well in that regime
and vice-versa.

The meaning of "cpu.weight" would be: weight per allowed CPU.

Included for completeness (and infrastructure).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/cpuset.h |    6 +++++
 kernel/cgroup/cpuset.c |   15 ++++++++++++++
 kernel/sched/debug.c   |    1 
 kernel/sched/fair.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 69 insertions(+), 5 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
 extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_num_cpus(struct cgroup *cgroup);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
 	return false;
 }
 
+static inline int cpuset_num_cpus(struct cgroup *cgroup)
+{
+	return num_online_cpus();
+}
+
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
 	return node_possible_map;
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
 	return changed;
 }
 
+int cpuset_num_cpus(struct cgroup *cgrp)
+{
+	int nr = num_online_cpus();
+	struct cpuset *cs;
+
+	if (is_in_v2_mode()) {
+		guard(rcu)();
+		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
+		if (cs)
+			nr = cpumask_weight(cs->effective_cpus);
+	}
+
+	return nr;
+}
+
 void __init cpuset_init_current_mems_allowed(void)
 {
 	nodes_setall(current->mems_allowed);
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -641,6 +641,7 @@ static int cgroup_mode = 1;
 static const char *cgroup_mode_str[] = {
 	"up",
 	"smp",
+	"max",
 };
 
 static int sched_cgroup_mode(const char *str)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4801,12 +4801,10 @@ static inline int throttled_hierarchy(st
  *
  * hence icky!
  */
-static long calc_smp(struct cfs_rq *cfs_rq)
+static long __calc_smp_shares(struct cfs_rq *cfs_rq, long tg_shares, long shares_max)
 {
-	long tg_weight, tg_shares, load, shares;
 	struct task_group *tg = cfs_rq->tg;
-
-	tg_shares = READ_ONCE(tg->shares);
+	long tg_weight, load, shares;
 
 	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
@@ -4832,7 +4830,48 @@ static long calc_smp(struct cfs_rq *cfs_
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	return clamp_t(long, shares, MIN_SHARES, tg_shares);
+	return clamp_t(long, shares, MIN_SHARES, shares_max);
+}
+
+static int tg_cpus(struct task_group *tg)
+{
+	int nr = num_online_cpus();
+
+	if (cpusets_enabled()) {
+		struct cgroup *cgrp = tg->css.cgroup;
+		if (cgrp)
+			nr = cpuset_num_cpus(cgrp);
+	}
+
+	return nr;
+}
+
+/*
+ * Func: min(fraction(nr_cpus * tg->shares), nice -20)
+ *
+ * Scale tg->shares by the maximal number of CPUs; but clip the max shares at
+ * nice -20, otherwise a single spinner on a 512 CPU machine would result in
+ * 512*NICE_0_LOAD, which is also crazy.
+ */
+static long calc_max_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	int nr = tg_cpus(tg);
+	long tg_shares = READ_ONCE(tg->shares);
+	long max_shares = scale_load(sched_prio_to_weight[0]);
+	return __calc_smp_shares(cfs_rq, tg_shares * nr, max_shares);
+}
+
+/*
+ * Func: fraction(tg->shares)
+ *
+ * This infamously results in tiny shares when you have many CPUs.
+ */
+static long calc_smp_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long tg_shares = READ_ONCE(tg->shares);
+	return __calc_smp_shares(cfs_rq, tg_shares, tg_shares);
 }
 
 /*
@@ -4857,6 +4896,9 @@ void __sched_cgroup_mode_update(int mode
 	default:
 		func = &calc_smp_shares;
 		break;
+	case 2:
+		func = &calc_max_shares;
+		break;
 	}
 	static_call_update(calc_group_shares, func);
 }



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 4/7] sched/fair: Add cgroup_mode: concur
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (2 preceding siblings ...)
  2026-06-05 12:40 ` [PATCH v3 3/7] sched/fair: Add cgroup_mode: max Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 5/7] sched/fair: Add cgroup_mode: tasks Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Improve upon the previous scheme ("max") by no longer assuming maximal
concurrency. Instead scale by: 'min(nr_tasks, nr_cpus)'. This handles
the low concurrency cases more gracefully:

	F_g_n' = min(M, N) * F_g_n

Notably this is the first mode where:

	avg(F_g_n) = 1

In the single task case it reduces to ("smp") and then it nicely scales up
until it hits N, where it behaves like ("max").

This is no longer clipped at nice -20. Strictly speaking it isn't different
from the normal SMP scenario where all tasks are extremely unbalanced. There
are no unnatural inflations in this scheme.

The meaning of "cpu.weight" would be: weight per active CPU.

NOTE: Compute the group wide number of tasks by extending the tg->load_avg
computation with tg->runnable_avg, since cfs_rq->runnable_avg is based on
cfs_rq->h_nr_running.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    1 +
 kernel/sched/fair.c  |   43 ++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |    3 +++
 3 files changed, 40 insertions(+), 7 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -641,6 +641,7 @@ static int cgroup_mode = 1;
 static const char *cgroup_mode_str[] = {
 	"up",
 	"smp",
+	"concur",
 	"max",
 };
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4846,6 +4846,11 @@ static int tg_cpus(struct task_group *tg
 	return nr;
 }
 
+static inline int tg_tasks(struct task_group *tg)
+{
+	return max(1, atomic_long_read(&tg->runnable_avg) >> SCHED_CAPACITY_SHIFT);
+}
+
 /*
  * Func: min(fraction(nr_cpus * tg->shares), nice -20)
  *
@@ -4863,6 +4868,20 @@ static long calc_max_shares(struct cfs_r
 }
 
 /*
+ * Func: fraction(nr * tg->shares); nr = min(nr_tasks, nr_cpus)
+ *
+ * Scales between "smp" and "max" in a natural way. No longer needs clipping
+ * since there are no unnatural inflations like with "max".
+ */
+static long calc_concur_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	int nr = min(tg_tasks(tg), tg_cpus(tg));
+	long tg_shares = READ_ONCE(tg->shares);
+	return __calc_smp_shares(cfs_rq, nr * tg_shares, nr * tg_shares);
+}
+
+/*
  * Func: fraction(tg->shares)
  *
  * This infamously results in tiny shares when you have many CPUs.
@@ -4897,6 +4916,9 @@ void __sched_cgroup_mode_update(int mode
 		func = &calc_smp_shares;
 		break;
 	case 2:
+		func = &calc_concur_shares;
+		break;
+	case 3:
 		func = &calc_max_shares;
 		break;
 	}
@@ -5043,7 +5065,7 @@ static inline bool cfs_rq_is_decayed(str
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta;
+	long dl, dr;
 	u64 now;
 
 	/*
@@ -5064,17 +5086,21 @@ static inline void update_tg_load_avg(st
 	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
 		return;
 
-	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
-	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
-		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+	dl = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	dr = cfs_rq->avg.runnable_avg - cfs_rq->tg_runnable_avg_contrib;
+	if (abs(dl) > cfs_rq->tg_load_avg_contrib / 64 ||
+	    abs(dr) > cfs_rq->tg_runnable_avg_contrib / 64) {
+		atomic_long_add(dl, &cfs_rq->tg->load_avg);
+		atomic_long_add(dr, &cfs_rq->tg->runnable_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->tg_runnable_avg_contrib = cfs_rq->avg.runnable_avg;
 		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
 static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta;
+	long dl, dr;
 	u64 now;
 
 	/*
@@ -5084,9 +5110,12 @@ static inline void clear_tg_load_avg(str
 		return;
 
 	now = rq_clock(rq_of(cfs_rq));
-	delta = 0 - cfs_rq->tg_load_avg_contrib;
-	atomic_long_add(delta, &cfs_rq->tg->load_avg);
+	dl = 0 - cfs_rq->tg_load_avg_contrib;
+	dr = 0 - cfs_rq->tg_runnable_avg_contrib;
+	atomic_long_add(dl, &cfs_rq->tg->load_avg);
+	atomic_long_add(dr, &cfs_rq->tg->runnable_avg);
 	cfs_rq->tg_load_avg_contrib = 0;
+	cfs_rq->tg_runnable_avg_contrib = 0;
 	cfs_rq->last_update_tg_load_avg = now;
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -493,6 +493,8 @@ struct task_group {
 	 * will also be accessed at each tick.
 	 */
 	atomic_long_t		load_avg ____cacheline_aligned;
+	atomic_long_t		runnable_avg;
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -722,6 +724,7 @@ struct cfs_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
+	unsigned long		tg_runnable_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 5/7] sched/fair: Add cgroup_mode: tasks
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (3 preceding siblings ...)
  2026-06-05 12:40 ` [PATCH v3 4/7] sched/fair: Add cgroup_mode: concur Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 6/7] sched/fair: Change the default cgroup_mode to concur Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Since we are exploring this space; include a scheme that scales by total number
of runnable tasks. This results in:

	F_g_n' = M * F_g_n

This will obviously have: avg(F_g_n') > 1, (it will be ~M/N in fact).

And while that sounds odd, it actually has a fairly straight foward meaning for
"cpu.weight": average weight per member task.

This is an entirely valid and workable option, it is however wildly different
from the traditional meaning.

Included for completeness (and curiosity).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    1 +
 kernel/sched/fair.c  |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -643,6 +643,7 @@ static const char *cgroup_mode_str[] = {
 	"smp",
 	"concur",
 	"max",
+	"tasks",
 };
 
 static int sched_cgroup_mode(const char *str)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4852,6 +4852,19 @@ static inline int tg_tasks(struct task_g
 }
 
 /*
+ * Func: fraction(nr_tasks * tg->shares)
+ *
+ * Scale tg->shares by the number of tasks.
+ */
+static long calc_tasks_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	int nr = tg_tasks(tg);
+	long tg_shares = READ_ONCE(tg->shares);
+	return __calc_smp_shares(cfs_rq, nr * tg_shares, nr * tg_shares);
+}
+
+/*
  * Func: min(fraction(nr_cpus * tg->shares), nice -20)
  *
  * Scale tg->shares by the maximal number of CPUs; but clip the max shares at
@@ -4921,6 +4934,9 @@ void __sched_cgroup_mode_update(int mode
 	case 3:
 		func = &calc_max_shares;
 		break;
+	case 4:
+		func = &calc_tasks_shares;
+		break;
 	}
 	static_call_update(calc_group_shares, func);
 }



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 6/7] sched/fair: Change the default cgroup_mode to concur
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (4 preceding siblings ...)
  2026-06-05 12:40 ` [PATCH v3 5/7] sched/fair: Add cgroup_mode: tasks Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-05 12:40 ` [PATCH v3 7/7] sched/eevdf: Move to a single runqueue Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

For all the reasons described in the preceding patches, the way cgroup
weight is computed is problematic. However, changing it is bound to
also lead to trouble. Esp. since people might have taken to inflating
the weight value where they can.

Since things are configurable, change the default and hope this serves
more people than it hurts, esp. in the longer run.

Specifically, this prepares for a flattened runqueue, where the hierarchical
weight becomes far more important (F_g^d terms), so getting rid of small F_g is
imperative.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    2 +-
 kernel/sched/fair.c  |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -635,7 +635,7 @@ static void debugfs_fair_server_init(voi
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static int cgroup_mode = 1;
+static int cgroup_mode = 2;
 
 /* See __sched_cgroup_mode_update(). */
 static const char *cgroup_mode_str[] = {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4921,7 +4921,7 @@ static long calc_up_shares(struct cfs_rq
 	return READ_ONCE(tg->shares);
 }
 
-DEFINE_STATIC_CALL(calc_group_shares, calc_smp_shares);
+DEFINE_STATIC_CALL(calc_group_shares, calc_concur_shares);
 
 void __sched_cgroup_mode_update(int mode)
 {
@@ -4931,10 +4931,10 @@ void __sched_cgroup_mode_update(int mode
 		func = &calc_up_shares;
 		break;
 	case 1:
-	default:
 		func = &calc_smp_shares;
 		break;
 	case 2:
+	default:
 		func = &calc_concur_shares;
 		break;
 	case 3:



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 7/7] sched/eevdf: Move to a single runqueue
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (5 preceding siblings ...)
  2026-06-05 12:40 ` [PATCH v3 6/7] sched/fair: Change the default cgroup_mode to concur Peter Zijlstra
@ 2026-06-05 12:40 ` Peter Zijlstra
  2026-06-09  5:37 ` [PATCH v3 0/7] sched: Flatten the pick K Prateek Nayak
  2026-06-12  2:29 ` Shubhang Kaushik
  8 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 12:40 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Change fair/cgroup to a single runqueue.

Infamously fair/cgroup isn't working for a number of people; typically
the complaint is latencies and/or overhead. The latency issue is due
to the intermediate entries that represent a combination of tasks and
thereby obfuscate the runnability of tasks.

The approach here is to leave the cgroup hierarchy as is; including
the intermediate enqueue/dequeue but move the actual EEVDF runqueue
outside. This means things like the shares_weight approximation are
fully preserved.

That is, given a hierarchy like:

          R
          |
          se--G1
              / \
        G2--se   se--G3
       / \           |
  T1--se se--T2      se--T3

This is fully maintained for load tracking, however the EEVDF parts of
cfs_rq/se go unused for the intermediates and are instead connected
like:

     _R_
    / | \
   T1 T2 T3

Since the effective weight of the entities is determined by the
hierarchy, this gets recomputed on enqueue,set_next_task and tick.

Notably, the effective weight (se->h_load) is computed from the
hierarchical fraction: se->load / cfs_rq->load.

Since EEVDF is now exclusively operating on rq->cfs, it needs to
consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
only tasks can get delayed, simplifying some of the cgroup cleanup.

One place where additional information was required was
set_next_task() / put_prev_task(), where we need to track 'current'
both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
(cfs_rq->curr).

As a result of only having a single level to pick from, much of the
complications in pick_next_task() and preemption go away.

Since many of the hierarchical operations are still there, this won't
immediately fix the performance issues, but hopefully it will fix some
of the latency issues.

TODO: split struct cfs_rq / struct sched_entity
TODO: try and get rid of h_curr

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    5 
 kernel/sched/debug.c  |    9 
 kernel/sched/fair.c   |  803 +++++++++++++++++++++-----------------------------
 kernel/sched/pelt.c   |    6 
 kernel/sched/sched.h  |   26 -
 6 files changed, 375 insertions(+), 475 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -574,6 +574,7 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
+	struct load_weight		h_load;
 	struct rb_node			run_node;
 	u64				deadline;
 	u64				min_vruntime;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5652,11 +5652,8 @@ EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
  */
 static inline void prefetch_curr_exec_start(struct task_struct *p)
 {
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	struct sched_entity *curr = p->se.cfs_rq->curr;
-#else
 	struct sched_entity *curr = task_rq(p)->cfs.curr;
-#endif
+
 	prefetch(curr);
 	prefetch(&curr->exec_start);
 }
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -975,10 +975,11 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
+	SEQ_printf(m, " %15s %5d %10ld %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
 		p->comm, task_pid_nr(p),
+		p->se.h_load.weight,
 		SPLIT_NS(p->se.vruntime),
-		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		entity_eligible(&rq->cfs, &p->se) ? 'E' : 'N',
 		SPLIT_NS(p->se.deadline),
 		p->se.custom_slice ? 'S' : ' ',
 		SPLIT_NS(p->se.slice),
@@ -1007,7 +1008,7 @@ static void print_rq(struct seq_file *m,
 
 	SEQ_printf(m, "\n");
 	SEQ_printf(m, "runnable tasks:\n");
-	SEQ_printf(m, " S            task   PID       vruntime   eligible    "
+	SEQ_printf(m, " S            task   PID     weight       vruntime   eligible    "
 		   "deadline             slice          sum-exec      switches  "
 		   "prio         wait-time        sum-sleep       sum-block"
 #ifdef CONFIG_NUMA_BALANCING
@@ -1115,6 +1116,8 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
 			atomic_long_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
+			cfs_rq->h_load);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_CFS_BANDWIDTH
 	SEQ_printf(m, "  .%-30s: %d\n", "throttled",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -297,8 +297,8 @@ static u64 __calc_delta(u64 delta_exec,
  */
 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 {
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+	if (se->h_load.weight != NICE_0_LOAD)
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
 
 	return delta;
 }
@@ -428,38 +428,6 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
-static void
-find_matching_se(struct sched_entity **se, struct sched_entity **pse)
-{
-	int se_depth, pse_depth;
-
-	/*
-	 * preemption test can be made between sibling entities who are in the
-	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
-	 * both tasks until we find their ancestors who are siblings of common
-	 * parent.
-	 */
-
-	/* First walk up until both entities are at same depth */
-	se_depth = (*se)->depth;
-	pse_depth = (*pse)->depth;
-
-	while (se_depth > pse_depth) {
-		se_depth--;
-		*se = parent_entity(*se);
-	}
-
-	while (pse_depth > se_depth) {
-		pse_depth--;
-		*pse = parent_entity(*pse);
-	}
-
-	while (!is_same_group(*se, *pse)) {
-		*se = parent_entity(*se);
-		*pse = parent_entity(*pse);
-	}
-}
-
 static int tg_is_idle(struct task_group *tg)
 {
 	return tg->idle > 0;
@@ -503,11 +471,6 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
-static inline void
-find_matching_se(struct sched_entity **se, struct sched_entity **pse)
-{
-}
-
 static inline int tg_is_idle(struct task_group *tg)
 {
 	return 0;
@@ -686,7 +649,7 @@ static inline unsigned long avg_vruntime
 static inline void
 __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	s64 w_vruntime, key = entity_key(cfs_rq, se);
 
 	w_vruntime = key * weight;
@@ -703,7 +666,7 @@ sum_w_vruntime_add_paranoid(struct cfs_r
 	s64 key, tmp;
 
 again:
-	weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	key = entity_key(cfs_rq, se);
 
 	if (check_mul_overflow(key, weight, &key))
@@ -749,7 +712,7 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
 static void
 sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->sum_w_vruntime -= key * weight;
@@ -791,7 +754,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		s64 runtime = cfs_rq->sum_w_vruntime;
 
 		if (curr) {
-			unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
+			unsigned long w = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
 			runtime += entity_key(cfs_rq, curr) * w;
 			weight += w;
@@ -862,8 +825,6 @@ bool update_entity_lag(struct cfs_rq *cf
 	u64 avruntime = avg_vruntime(cfs_rq);
 	s64 vlag = entity_lag(cfs_rq, se, avruntime);
 
-	WARN_ON_ONCE(!se->on_rq);
-
 	if (se->sched_delayed) {
 		/* previous vlag < 0 otherwise se would not be delayed */
 		vlag = max(vlag, se->vlag);
@@ -899,7 +860,7 @@ static int vruntime_eligible(struct cfs_
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
-		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
+		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
 		avg += entity_key(cfs_rq, curr) * weight;
 		load += weight;
@@ -1040,6 +1001,9 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
+	WARN_ON_ONCE(!entity_is_task(se));
+
 	sum_w_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
@@ -1049,6 +1013,9 @@ static void __enqueue_entity(struct cfs_
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
+	WARN_ON_ONCE(!entity_is_task(se));
+
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	sum_w_vruntime_sub(cfs_rq, se);
@@ -1145,7 +1112,7 @@ static struct sched_entity *pick_eevdf(s
 	 * We can safely skip eligibility check if there is only one entity
 	 * in this cfs_rq, saving some cycles.
 	 */
-	if (cfs_rq->nr_queued == 1)
+	if (cfs_rq->h_nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
 	/*
@@ -1395,8 +1362,6 @@ static s64 update_se(struct rq *rq, stru
 	return delta_exec;
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 #ifdef CONFIG_SCHED_CACHE
 
 /*
@@ -1991,7 +1956,7 @@ static void update_curr(struct cfs_rq *c
 	 * not necessarily be the actual task running
 	 * (rq->curr.se). This is easy to confuse!
 	 */
-	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *curr = cfs_rq->h_curr;
 	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
 	bool resched;
@@ -2003,26 +1968,29 @@ static void update_curr(struct cfs_rq *c
 	if (unlikely(delta_exec <= 0))
 		return;
 
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	if (!entity_is_task(curr))
+		return;
+
+	cfs_rq = &rq->cfs;
+
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
 
-	if (entity_is_task(curr)) {
-		/*
-		 * If the fair_server is active, we need to account for the
-		 * fair_server time whether or not the task is running on
-		 * behalf of fair_server or not:
-		 *  - If the task is running on behalf of fair_server, we need
-		 *    to limit its time based on the assigned runtime.
-		 *  - Fair task that runs outside of fair_server should account
-		 *    against fair_server such that it can account for this time
-		 *    and possibly avoid running this period.
-		 */
-		dl_server_update(&rq->fair_server, delta_exec);
-	}
-
-	account_cfs_rq_runtime(cfs_rq, delta_exec);
+	/*
+	 * If the fair_server is active, we need to account for the
+	 * fair_server time whether or not the task is running on
+	 * behalf of fair_server or not:
+	 *  - If the task is running on behalf of fair_server, we need
+	 *    to limit its time based on the assigned runtime.
+	 *  - Fair task that runs outside of fair_server should account
+	 *    against fair_server such that it can account for this time
+	 *    and possibly avoid running this period.
+	 */
+	dl_server_update(&rq->fair_server, delta_exec);
 
-	if (cfs_rq->nr_queued == 1)
+	if (cfs_rq->h_nr_queued == 1)
 		return;
 
 	if (resched || !protect_slice(curr)) {
@@ -2033,7 +2001,10 @@ static void update_curr(struct cfs_rq *c
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->donor->se));
+	struct sched_entity *se = &rq->donor->se;
+
+	for_each_sched_entity(se)
+		update_curr(cfs_rq_of(se));
 }
 
 static inline void
@@ -2109,7 +2080,7 @@ update_stats_enqueue_fair(struct cfs_rq
 	 * Are we enqueueing a waiting task? (for current tasks
 	 * a dequeue/enqueue event is a NOP)
 	 */
-	if (se != cfs_rq->curr)
+	if (se != cfs_rq->h_curr)
 		update_stats_wait_start_fair(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -2127,7 +2098,7 @@ update_stats_dequeue_fair(struct cfs_rq
 	 * Mark the end of the wait period if dequeueing a
 	 * waiting task:
 	 */
-	if (se != cfs_rq->curr)
+	if (se != cfs_rq->h_curr)
 		update_stats_wait_end_fair(cfs_rq, se);
 
 	if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
@@ -4468,6 +4439,7 @@ static inline void update_scan_period(st
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
 		struct task_struct *p = task_of(se);
@@ -4483,6 +4455,7 @@ account_entity_enqueue(struct cfs_rq *cf
 static void
 account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
 		struct task_struct *p = task_of(se);
@@ -4564,7 +4537,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
 static void
 rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
 {
-	unsigned long old_weight = se->load.weight;
+	long old_weight = se->h_load.weight;
 
 	/*
 	 * VRUNTIME
@@ -4664,16 +4637,17 @@ rescale_entity(struct sched_entity *se,
 		se->vprot = div64_long(se->vprot * old_weight, weight);
 }
 
-static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			   unsigned long weight, bool on_rq)
 {
 	bool curr = cfs_rq->curr == se;
 	bool rel_vprot = false;
 	u64 avruntime = 0;
 
-	if (se->on_rq) {
-		/* commit outstanding execution time */
-		update_curr(cfs_rq);
+	if (se->h_load.weight == weight)
+		return;
+
+	if (on_rq) {
 		avruntime = avg_vruntime(cfs_rq);
 		se->vlag = entity_lag(cfs_rq, se, avruntime);
 		se->deadline -= avruntime;
@@ -4683,46 +4657,90 @@ static void reweight_entity(struct cfs_r
 			rel_vprot = true;
 		}
 
-		cfs_rq->nr_queued--;
+		cfs_rq->h_nr_queued--;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
-		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
-	dequeue_load_avg(cfs_rq, se);
 
 	rescale_entity(se, weight, rel_vprot);
 
-	update_load_set(&se->load, weight);
+	update_load_set(&se->h_load, weight);
 
-	do {
-		u32 divider = get_pelt_divider(&se->avg);
-		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-	} while (0);
-
-	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
+	if (on_rq) {
 		if (rel_vprot)
 			se->vprot += avruntime;
 		se->deadline += avruntime;
 		se->rel_deadline = 0;
 		se->vruntime = avruntime - se->vlag;
 
-		update_load_add(&cfs_rq->load, se->load.weight);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
-		cfs_rq->nr_queued++;
+		cfs_rq->h_nr_queued++;
 	}
 }
 
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->load.weight == weight)
+		return;
+
+	if (se->on_rq) {
+		WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
+		update_load_sub(&cfs_rq->load, se->load.weight);
+	}
+	dequeue_load_avg(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+	do {
+		u32 divider = get_pelt_divider(&se->avg);
+		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
+	} while (0);
+
+	enqueue_load_avg(cfs_rq, se);
+
+	if (se->on_rq)
+		update_load_add(&cfs_rq->load, se->load.weight);
+}
+
+/*
+ * weight = NICE_0_LOAD;
+ * for_each_entity_se(se)
+ *   weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
+ */
+static __always_inline
+unsigned long __calc_prop_weight(struct cfs_rq *cfs_rq, struct sched_entity *se,
+				 unsigned long weight)
+{
+	weight *= se->load.weight;
+	if (parent_entity(se))
+		weight /= cfs_rq->load.weight;
+	else
+		weight /= NICE_0_LOAD;
+
+	return max(weight, MIN_SHARES);
+}
+
 static void reweight_task_fair(struct rq *rq, struct task_struct *p,
 			       const struct load_weight *lw)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	struct load_weight *load = &se->load;
+	unsigned long weight = NICE_0_LOAD;
 
-	reweight_entity(cfs_rq, se, lw->weight);
-	load->inv_weight = lw->inv_weight;
+	if (se->on_rq)
+		update_curr_fair(rq);
+
+	reweight_entity(cfs_rq_of(se), se, lw->weight);
+	se->load.inv_weight = lw->inv_weight;
+
+	if (!se->on_rq)
+		return;
+
+	for_each_sched_entity(se)
+		weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
+
+	reweight_eevdf(&rq->cfs, &p->se, weight, p->se.on_rq);
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -4958,8 +4976,7 @@ static void update_cfs_group(struct sche
 		return;
 
 	shares = static_call(calc_group_shares)(gcfs_rq);
-	if (unlikely(se->load.weight != shares))
-		reweight_entity(cfs_rq_of(se), se, shares);
+	reweight_entity(cfs_rq_of(se), se, shares);
 }
 
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
@@ -5077,7 +5094,7 @@ static inline bool cfs_rq_is_decayed(str
  * differential update where we store the last value we propagated. This in
  * turn allows skipping updates if the differential is 'small'.
  *
- * Updating tg's load_avg is necessary before update_cfs_share().
+ * Updating tg's load_avg is necessary before update_cfs_group().
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
@@ -5544,7 +5561,7 @@ static void migrate_se_pelt_lag(struct s
  * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
  * avg. The immediate corollary is that all (fair) tasks must be attached.
  *
- * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
  *
  * Return: true if the load decayed or we removed load.
  *
@@ -6082,6 +6099,7 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+	unsigned int nr_queued = cfs_rq->h_nr_queued;
 	bool update_zero = false;
 	s64 lag = 0;
 
@@ -6089,6 +6107,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		se->slice = sysctl_sched_base_slice;
 	vslice = calc_delta_fair(se->slice, se);
 
+	if (flags & ENQUEUE_QUEUED)
+		nr_queued -= 1;
+
 	/*
 	 * Due to how V is constructed as the weighted average of entities,
 	 * adding tasks with positive lag, or removing tasks with negative lag
@@ -6097,7 +6118,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
+	if (sched_feat(PLACE_LAG) && nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
 		long load, weight;
 
@@ -6157,9 +6178,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 */
 		load = cfs_rq->sum_weight;
 		if (curr && curr->on_rq)
-			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
+			load += avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
-		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+		weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 		lag *= load + weight;
 		if (WARN_ON_ONCE(!load))
 			load = 1;
@@ -6218,22 +6239,8 @@ static void check_enqueue_throttle(struc
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 
 static void
-requeue_delayed_entity(struct sched_entity *se);
-
-static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool curr = cfs_rq->curr == se;
-
-	/*
-	 * If we're the current task, we must renormalise before calling
-	 * update_curr().
-	 */
-	if (curr)
-		place_entity(cfs_rq, se, flags);
-
-	update_curr(cfs_rq);
-
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
@@ -6252,13 +6259,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_cfs_group(se);
 
-	/*
-	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
-	 * we can place the entity.
-	 */
-	if (!curr)
-		place_entity(cfs_rq, se, flags);
-
 	account_entity_enqueue(cfs_rq, se);
 
 	/* Entity has migrated, no longer consider this task hot */
@@ -6267,8 +6267,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	if (!curr)
-		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_queued == 1) {
@@ -6286,21 +6284,19 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	}
 }
 
-static void __clear_buddies_next(struct sched_entity *se)
+static void set_next_buddy(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next != se)
-			break;
-
-		cfs_rq->next = NULL;
-	}
+	if (WARN_ON_ONCE(!se->on_rq || se->sched_delayed))
+		return;
+	if (se_is_idle(se))
+		return;
+	cfs_rq->next = se;
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->next == se)
-		__clear_buddies_next(se);
+		cfs_rq->next = NULL;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -6311,7 +6307,7 @@ static void set_delayed(struct sched_ent
 
 	/*
 	 * Delayed se of cfs_rq have no tasks queued on them.
-	 * Do not adjust h_nr_runnable since dequeue_entities()
+	 * Do not adjust h_nr_runnable since __dequeue_task()
 	 * will account it for blocked tasks.
 	 */
 	if (!entity_is_task(se))
@@ -6344,45 +6340,16 @@ static void clear_delayed(struct sched_e
 	}
 }
 
-static bool
+static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool sleep = flags & DEQUEUE_SLEEP;
-	int action = 0;
-
-	update_curr(cfs_rq);
-	clear_buddies(cfs_rq, se);
-
-	if (flags & DEQUEUE_DELAYED) {
-		WARN_ON_ONCE(!se->sched_delayed);
-	} else {
-		bool delay = sleep;
-		/*
-		 * DELAY_DEQUEUE relies on spurious wakeups, special task
-		 * states must not suffer spurious wakeups, excempt them.
-		 */
-		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
-			delay = false;
-
-		WARN_ON_ONCE(delay && se->sched_delayed);
+	int action = UPDATE_TG;
 
-		if (sched_feat(DELAY_DEQUEUE) && delay &&
-		    !entity_eligible(cfs_rq, se)) {
-			if (entity_is_task(se))
-				action |= UPDATE_UTIL_EST;
-			update_load_avg(cfs_rq, se, action);
-			update_entity_lag(cfs_rq, se);
-			set_delayed(se);
-			return false;
-		}
-	}
-
-	action = UPDATE_TG;
 	if (entity_is_task(se)) {
 		if (task_on_rq_migrating(task_of(se)))
 			action |= DO_DETACH;
 
-		if (sleep && !(flags & DEQUEUE_DELAYED))
+		if ((flags & DEQUEUE_SLEEP) && !(flags & DEQUEUE_DELAYED))
 			action |= UPDATE_UTIL_EST;
 	}
 
@@ -6400,14 +6367,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
 
-	update_entity_lag(cfs_rq, se);
-	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
-		se->deadline -= se->vruntime;
-		se->rel_deadline = 1;
-	}
-
-	if (se != cfs_rq->curr)
-		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
@@ -6416,9 +6375,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_cfs_group(se);
 
-	if (flags & DEQUEUE_DELAYED)
-		clear_delayed(se);
-
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -6431,15 +6387,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 		}
 #endif
 	}
-
-	return true;
 }
 
 static void
-set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	clear_buddies(cfs_rq, se);
-
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
 		/*
@@ -6448,16 +6400,12 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 		 * runqueue.
 		 */
 		update_stats_wait_end_fair(cfs_rq, se);
-		__dequeue_entity(cfs_rq, se);
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-
-		if (first)
-			set_protect_slice(cfs_rq, se);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
-	WARN_ON_ONCE(cfs_rq->curr);
-	cfs_rq->curr = se;
+	WARN_ON_ONCE(cfs_rq->h_curr);
+	cfs_rq->h_curr = se;
 
 	/*
 	 * Track our maximum slice length, if the CPU's load is at
@@ -6477,23 +6425,17 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
+static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
-/*
- * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
- */
 static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool protect)
+pick_next_entity(struct rq *rq, bool protect)
 {
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
 
 	se = pick_eevdf(cfs_rq, protect);
 	if (se->sched_delayed) {
-		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		__dequeue_task(rq, task_of(se), DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
 		 * Must not reference @se again, see __block_task().
 		 */
@@ -6513,13 +6455,11 @@ static void put_prev_entity(struct cfs_r
 
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
-		/* Put 'current' back into the tree. */
-		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
-	WARN_ON_ONCE(cfs_rq->curr != prev);
-	cfs_rq->curr = NULL;
+	WARN_ON_ONCE(cfs_rq->h_curr != prev);
+	cfs_rq->h_curr = NULL;
 }
 
 static void
@@ -7074,7 +7014,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_queued)
+	if (rq->curr == rq->idle && rq->cfs.h_nr_queued)
 		resched_curr(rq);
 }
 
@@ -7409,7 +7349,7 @@ static void check_enqueue_throttle(struc
 		return;
 
 	/* an active group must be handled by the update_curr() path */
-	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+	if (!cfs_rq->runtime_enabled || cfs_rq->h_curr)
 		return;
 
 	/* ensure the group is not already throttled */
@@ -7781,7 +7721,7 @@ static void hrtick_start_fair(struct rq
 			resched_curr(rq);
 		return;
 	}
-	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+	delta = (se->h_load.weight * vdelta) / NICE_0_LOAD;
 
 	/*
 	 * Correct for instantaneous load of other classes.
@@ -7881,10 +7821,8 @@ static int choose_idle_cpu(int cpu, stru
 }
 
 static void
-requeue_delayed_entity(struct sched_entity *se)
+requeue_delayed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
 	/*
 	 * se->sched_delayed should imply: se->on_rq == 1.
 	 * Because a delayed entity is one that is still on
@@ -7894,19 +7832,58 @@ requeue_delayed_entity(struct sched_enti
 	WARN_ON_ONCE(!se->on_rq);
 
 	if (update_entity_lag(cfs_rq, se)) {
-		cfs_rq->nr_queued--;
+		cfs_rq->h_nr_queued--;
 		if (se != cfs_rq->curr)
 			__dequeue_entity(cfs_rq, se);
 		place_entity(cfs_rq, se, 0);
 		if (se != cfs_rq->curr)
 			__enqueue_entity(cfs_rq, se);
-		cfs_rq->nr_queued++;
+		cfs_rq->h_nr_queued++;
 	}
 
 	update_load_avg(cfs_rq, se, 0);
 	clear_delayed(se);
 }
 
+static unsigned long enqueue_hierarchy(struct task_struct *p, int flags)
+{
+	unsigned long weight = NICE_0_LOAD;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
+	struct sched_entity *se = &p->se;
+	int h_nr_idle = task_has_idle_policy(p);
+	int h_nr_runnable = 1;
+
+	if (task_new && se->sched_delayed)
+		h_nr_runnable = 0;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_curr(cfs_rq);
+
+		if (!se->on_rq) {
+			enqueue_entity(cfs_rq, se, flags);
+		} else {
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
+		}
+
+		cfs_rq->h_nr_runnable += h_nr_runnable;
+		cfs_rq->h_nr_queued++;
+		cfs_rq->h_nr_idle += h_nr_idle;
+
+		if (cfs_rq_is_idle(cfs_rq))
+			h_nr_idle = 1;
+
+		weight = __calc_prop_weight(cfs_rq, se, weight);
+
+		flags = ENQUEUE_WAKEUP;
+	}
+
+	return weight;
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -7915,13 +7892,12 @@ requeue_delayed_entity(struct sched_enti
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-	int h_nr_idle = task_has_idle_policy(p);
-	int h_nr_runnable = 1;
-	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
-	u64 slice = 0;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long weight;
+	bool curr;
 
 	if (task_is_throttled(p) && enqueue_throttled_task(p))
 		return;
@@ -7933,10 +7909,10 @@ enqueue_task_fair(struct rq *rq, struct
 	 * estimated utilization, before we update schedutil.
 	 */
 	if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))
-		util_est_enqueue(&rq->cfs, p);
+		util_est_enqueue(cfs_rq, p);
 
 	if (flags & ENQUEUE_DELAYED) {
-		requeue_delayed_entity(se);
+		requeue_delayed_entity(cfs_rq, se);
 		return;
 	}
 
@@ -7948,57 +7924,22 @@ enqueue_task_fair(struct rq *rq, struct
 	if (p->in_iowait)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
-	if (task_new && se->sched_delayed)
-		h_nr_runnable = 0;
-
-	for_each_sched_entity(se) {
-		if (se->on_rq) {
-			if (se->sched_delayed)
-				requeue_delayed_entity(se);
-			break;
-		}
-		cfs_rq = cfs_rq_of(se);
-
-		/*
-		 * Basically set the slice of group entries to the min_slice of
-		 * their respective cfs_rq. This ensures the group can service
-		 * its entities in the desired time-frame.
-		 */
-		if (slice) {
-			se->slice = slice;
-			se->custom_slice = 1;
-		}
-		enqueue_entity(cfs_rq, se, flags);
-		slice = cfs_rq_min_slice(cfs_rq);
-
-		cfs_rq->h_nr_runnable += h_nr_runnable;
-		cfs_rq->h_nr_queued++;
-		cfs_rq->h_nr_idle += h_nr_idle;
-
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = 1;
-
-		flags = ENQUEUE_WAKEUP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+	/*
+	 * XXX comment on the curr thing
+	 */
+	curr = (cfs_rq->curr == se);
+	if (curr)
+		place_entity(cfs_rq, se, flags);
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-		update_cfs_group(se);
+	if (se->on_rq && se->sched_delayed)
+		requeue_delayed_entity(cfs_rq, se);
 
-		se->slice = slice;
-		if (se != cfs_rq->curr)
-			min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+	weight = enqueue_hierarchy(p, flags);
 
-		cfs_rq->h_nr_runnable += h_nr_runnable;
-		cfs_rq->h_nr_queued++;
-		cfs_rq->h_nr_idle += h_nr_idle;
-
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = 1;
+	if (!curr) {
+		reweight_eevdf(cfs_rq, se, weight, false);
+		place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
+		__enqueue_entity(cfs_rq, se);
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
@@ -8029,105 +7970,109 @@ enqueue_task_fair(struct rq *rq, struct
 	hrtick_update(rq);
 }
 
-/*
- * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
- * failing half-way through and resume the dequeue later.
- *
- * Returns:
- * -1 - dequeue delayed
- *  0 - dequeue throttled
- *  1 - dequeue complete
- */
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
+static void dequeue_hierarchy(struct task_struct *p, int flags)
 {
-	bool was_sched_idle = sched_idle_rq(rq);
+	struct sched_entity *se = &p->se;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
 	bool task_throttled = flags & DEQUEUE_THROTTLE;
-	struct task_struct *p = NULL;
-	int h_nr_idle = 0;
-	int h_nr_queued = 0;
 	int h_nr_runnable = 0;
-	struct cfs_rq *cfs_rq;
-	u64 slice = 0;
+	int h_nr_idle = task_has_idle_policy(p);
+	bool dequeue = true;
 
-	if (entity_is_task(se)) {
-		p = task_of(se);
-		h_nr_queued = 1;
-		h_nr_idle = task_has_idle_policy(p);
-		if (task_sleep || task_delayed || !se->sched_delayed)
-			h_nr_runnable = 1;
-	}
+	if (task_sleep || task_delayed || !se->sched_delayed)
+		h_nr_runnable = 1;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		if (!dequeue_entity(cfs_rq, se, flags)) {
-			if (p && &p->se == se)
-				return -1;
+		update_curr(cfs_rq);
 
-			slice = cfs_rq_min_slice(cfs_rq);
-			break;
+		if (dequeue) {
+			dequeue_entity(cfs_rq, se, flags);
+			/* Don't dequeue parent if it has other entities besides us */
+			if (cfs_rq->load.weight)
+				dequeue = false;
+		} else {
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
 		}
 
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
-		cfs_rq->h_nr_queued -= h_nr_queued;
+		cfs_rq->h_nr_queued--;
 		cfs_rq->h_nr_idle -= h_nr_idle;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = h_nr_queued;
+			h_nr_idle = 1;
 
 		if (throttled_hierarchy(cfs_rq) && task_throttled)
 			record_throttle_clock(cfs_rq);
 
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight) {
-			slice = cfs_rq_min_slice(cfs_rq);
-
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			/*
-			 * Bias pick_next to pick a task from this cfs_rq, as
-			 * p is sleeping when it is within its sched_slice.
-			 */
-			if (task_sleep && se)
-				set_next_buddy(se);
-			break;
-		}
 		flags |= DEQUEUE_SLEEP;
 		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
 	}
+}
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+/*
+ * The part of dequeue_task_fair() that is needed to dequeue delayed tasks.
+ *
+ * Returns:
+ *   true  - dequeued
+ *   false - delayed
+ */
+static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	bool was_sched_idle = sched_idle_rq(rq);
+	bool task_sleep = flags & DEQUEUE_SLEEP;
+	bool task_delayed = flags & DEQUEUE_DELAYED;
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-		update_cfs_group(se);
+	clear_buddies(cfs_rq, se);
 
-		se->slice = slice;
-		if (se != cfs_rq->curr)
-			min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+	update_curr(cfs_rq_of(se));
+	update_entity_lag(cfs_rq, se);
 
-		cfs_rq->h_nr_runnable -= h_nr_runnable;
-		cfs_rq->h_nr_queued -= h_nr_queued;
-		cfs_rq->h_nr_idle -= h_nr_idle;
+	if (flags & DEQUEUE_DELAYED) {
+		WARN_ON_ONCE(!se->sched_delayed);
+	} else {
+		bool delay = task_sleep;
+		/*
+		 * DELAY_DEQUEUE relies on spurious wakeups, special task
+		 * states must not suffer spurious wakeups, excempt them.
+		 */
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
+			delay = false;
 
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = h_nr_queued;
+		WARN_ON_ONCE(delay && se->sched_delayed);
 
-		if (throttled_hierarchy(cfs_rq) && task_throttled)
-			record_throttle_clock(cfs_rq);
+		if (sched_feat(DELAY_DEQUEUE) && delay &&
+		    !entity_eligible(cfs_rq, se)) {
+			update_load_avg(cfs_rq_of(se), se, UPDATE_UTIL_EST);
+			set_delayed(se);
+			return false;
+		}
 	}
 
-	sub_nr_running(rq, h_nr_queued);
+	dequeue_hierarchy(p, flags);
+
+	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
+	}
+	if (se != cfs_rq->curr)
+		__dequeue_entity(cfs_rq, se);
+
+	sub_nr_running(rq, 1);
 
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
-	if (p && task_delayed) {
+	if (task_delayed) {
+		clear_delayed(se);
+
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
 
@@ -8139,7 +8084,7 @@ static int dequeue_entities(struct rq *r
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return true;
 }
 
 /*
@@ -8157,11 +8102,11 @@ static bool dequeue_task_fair(struct rq
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);
 
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (!__dequeue_task(rq, p, flags))
 		return false;
 
 	/*
-	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
+	 * Must not reference @p after __dequeue_task(DEQUEUE_DELAYED).
 	 */
 	return true;
 }
@@ -9749,19 +9694,6 @@ static void migrate_task_rq_fair(struct
 static void task_dead_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-
-	if (se->sched_delayed) {
-		struct rq_flags rf;
-		struct rq *rq;
-
-		rq = task_rq_lock(p, &rf);
-		if (se->sched_delayed) {
-			update_rq_clock(rq);
-			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-		}
-		task_rq_unlock(rq, p, &rf);
-	}
-
 	remove_entity_load_avg(se);
 }
 
@@ -9795,21 +9727,10 @@ static void set_cpus_allowed_fair(struct
 	set_task_max_allowed_capacity(p);
 }
 
-static void set_next_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		if (WARN_ON_ONCE(!se->on_rq))
-			return;
-		if (se_is_idle(se))
-			return;
-		cfs_rq_of(se)->next = se;
-	}
-}
-
 enum preempt_wakeup_action {
 	PREEMPT_WAKEUP_NONE,	/* No preemption. */
 	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
-	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_PICK,	/* Let pick_eevdf() decide. */
 	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
 };
 
@@ -9826,7 +9747,7 @@ set_preempt_buddy(struct cfs_rq *cfs_rq,
 	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
 		return false;
 
-	set_next_buddy(pse);
+	set_next_buddy(cfs_rq, pse);
 	return true;
 }
 
@@ -9879,7 +9800,7 @@ static void wakeup_preempt_fair(struct r
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
-	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	int cse_is_idle, pse_is_idle;
 
 	/*
@@ -9916,7 +9837,6 @@ static void wakeup_preempt_fair(struct r
 	if (!sched_feat(WAKEUP_PREEMPTION))
 		return;
 
-	find_matching_se(&se, &pse);
 	WARN_ON_ONCE(!pse);
 
 	cse_is_idle = se_is_idle(se);
@@ -9944,8 +9864,7 @@ static void wakeup_preempt_fair(struct r
 	if (unlikely(!normal_policy(p->policy)))
 		return;
 
-	cfs_rq = cfs_rq_of(se);
-	update_curr(cfs_rq);
+	update_curr_fair(rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
@@ -9989,18 +9908,15 @@ static void wakeup_preempt_fair(struct r
 	}
 
 pick:
-	nse = pick_next_entity(rq, cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT);
-	/* If @p has become the most eligible task, force preemption */
-	if (nse == pse)
-		goto preempt;
-
-	/*
-	 * Because p is enqueued, nse being null can only mean that we
-	 * dequeued a delayed task. If there are still entities queued in
-	 * cfs, check if the next one will be p.
-	 */
-	if (!nse && cfs_rq->nr_queued)
-		goto pick;
+	if (cfs_rq->h_nr_queued) {
+		nse = pick_next_entity(rq, preempt_action != PREEMPT_WAKEUP_SHORT);
+		if (unlikely(!nse))
+			goto pick;
+
+		/* If @p has become the most eligible task, force preemption */
+		if (nse == pse)
+			goto preempt;
+	}
 
 	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
@@ -10019,33 +9935,24 @@ static void wakeup_preempt_fair(struct r
 struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
-	struct cfs_rq *cfs_rq;
 	struct task_struct *p;
-	bool throttled;
 	int new_tasks;
 
 again:
-	cfs_rq = &rq->cfs;
-	if (!cfs_rq->nr_queued)
+	if (!cfs_rq->h_nr_queued)
 		goto idle;
 
-	throttled = false;
-
-	do {
-		/* Might not have done put_prev_entity() */
-		if (cfs_rq->curr && cfs_rq->curr->on_rq)
-			update_curr(cfs_rq);
+	/* Might not have done put_prev_entity() */
+	if (cfs_rq->curr && cfs_rq->curr->on_rq)
+		update_curr(cfs_rq);
 
-		se = pick_next_entity(rq, cfs_rq, true);
-		if (!se)
-			goto again;
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
+	se = pick_next_entity(rq, true);
+	if (!se)
+		goto again;
 
 	p = task_of(se);
-	if (unlikely(throttled))
-		task_throttle_setup_work(p);
 	return p;
 
 idle:
@@ -10079,7 +9986,7 @@ void fair_server_init(struct rq *rq)
 static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
 	struct sched_entity *se = &prev->se;
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *nse = NULL;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -10089,7 +9996,7 @@ static void put_prev_task_fair(struct rq
 
 	while (se) {
 		cfs_rq = cfs_rq_of(se);
-		if (!nse || cfs_rq->curr)
+		if (!nse || cfs_rq->h_curr)
 			put_prev_entity(cfs_rq, se);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		if (nse) {
@@ -10108,6 +10015,14 @@ static void put_prev_task_fair(struct rq
 #endif
 		se = parent_entity(se);
 	}
+
+	/* Put 'current' back into the tree. */
+	cfs_rq = &rq->cfs;
+	se = &prev->se;
+	WARN_ON_ONCE(cfs_rq->curr != se);
+	cfs_rq->curr = NULL;
+	if (se->on_rq)
+		__enqueue_entity(cfs_rq, se);
 }
 
 /*
@@ -10116,8 +10031,8 @@ static void put_prev_task_fair(struct rq
 static void yield_task_fair(struct rq *rq)
 {
 	struct task_struct *curr = rq->donor;
-	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 
 	/*
 	 * Are we the only task in the tree?
@@ -10158,11 +10073,11 @@ static bool yield_to_task_fair(struct rq
 	struct sched_entity *se = &p->se;
 
 	/* !se->on_rq also covers throttled task */
-	if (!se->on_rq)
+	if (!se->on_rq || se->sched_delayed)
 		return false;
 
 	/* Tell the scheduler that we'd really like se to run next. */
-	set_next_buddy(se);
+	set_next_buddy(&task_rq(p)->cfs, se);
 
 	yield_task_fair(rq);
 
@@ -10501,15 +10416,10 @@ static inline long migrate_degrades_loca
  */
 static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_cpu)
 {
-	struct cfs_rq *dst_cfs_rq;
+	struct cfs_rq *dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	dst_cfs_rq = tg_cfs_rq(task_group(p), dest_cpu);
-#else
-	dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
-#endif
-	if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
-	    !entity_eligible(task_cfs_rq(p), &p->se))
+	if (sched_feat(PLACE_LAG) && dst_cfs_rq->h_nr_queued &&
+	    !entity_eligible(&task_rq(p)->cfs, &p->se))
 		return 1;
 
 	return 0;
@@ -11292,7 +11202,7 @@ static void update_cfs_rq_h_load(struct
 	while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
 		load = cfs_rq->h_load;
 		load = div64_ul(load * se->avg.load_avg,
-			cfs_rq_load_avg(cfs_rq) + 1);
+				cfs_rq_load_avg(cfs_rq) + 1);
 		cfs_rq = group_cfs_rq(se);
 		cfs_rq->h_load = load;
 		cfs_rq->last_h_load_update = now;
@@ -14684,7 +14594,7 @@ static inline void task_tick_core(struct
 	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
 	 * if we need to give up the CPU.
 	 */
-	if (rq->core->core_forceidle_count && rq->cfs.nr_queued == 1 &&
+	if (rq->core->core_forceidle_count && rq->cfs.h_nr_queued == 1 &&
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
@@ -14893,30 +14803,8 @@ bool cfs_prio_less(const struct task_str
 
 	WARN_ON_ONCE(task_rq(b)->core != rq->core);
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	/*
-	 * Find an se in the hierarchy for tasks a and b, such that the se's
-	 * are immediate siblings.
-	 */
-	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
-		int sea_depth = sea->depth;
-		int seb_depth = seb->depth;
-
-		if (sea_depth >= seb_depth)
-			sea = parent_entity(sea);
-		if (sea_depth <= seb_depth)
-			seb = parent_entity(seb);
-	}
-
-	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
-	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
-
-	cfs_rqa = sea->cfs_rq;
-	cfs_rqb = seb->cfs_rq;
-#else /* !CONFIG_FAIR_GROUP_SCHED: */
 	cfs_rqa = &task_rq(a)->cfs;
 	cfs_rqb = &task_rq(b)->cfs;
-#endif /* !CONFIG_FAIR_GROUP_SCHED */
 
 	/*
 	 * Find delta after normalizing se's vruntime with its cfs_rq's
@@ -14955,11 +14843,20 @@ static inline void task_tick_core(struct
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
 	struct sched_entity *se = &curr->se;
-	struct cfs_rq *cfs_rq;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		entity_tick(cfs_rq, se, queued);
+	if (se->on_rq) {
+		unsigned long weight = NICE_0_LOAD;
+		struct cfs_rq *cfs_rq;
+
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			entity_tick(cfs_rq, se, queued);
+
+			weight = __calc_prop_weight(cfs_rq, se, weight);
+		}
+
+		se = &curr->se;
+		reweight_eevdf(cfs_rq, se, weight, se->on_rq);
 	}
 
 	if (queued)
@@ -14999,7 +14896,7 @@ prio_changed_fair(struct rq *rq, struct
 	if (p->prio == oldprio)
 		return;
 
-	if (rq->cfs.nr_queued == 1)
+	if (rq->cfs.h_nr_queued == 1)
 		return;
 
 	/*
@@ -15128,33 +15025,44 @@ static void switched_to_fair(struct rq *
 	}
 }
 
-/*
- * Account for a task changing its policy or group.
- *
- * This routine is mostly called to set cfs_rq->curr field when a task
- * migrates between groups/classes.
- */
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
 	bool throttled = false;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long weight = NICE_0_LOAD;
+	bool on_rq = se->on_rq;
+
+	clear_buddies(cfs_rq, se);
+
+	if (on_rq)
+		__dequeue_entity(cfs_rq, se);
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
 
-		if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
-		    first && cfs_rq->curr)
-			break;
+		if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
+		    !first || !cfs_rq->h_curr)
+			set_next_entity(cfs_rq, se);
 
-		set_next_entity(cfs_rq, se, first);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
+
+		if (on_rq)
+			weight = __calc_prop_weight(cfs_rq, se, weight);
 	}
 
 	if (throttled)
 		task_throttle_setup_work(p);
 
 	se = &p->se;
+	cfs_rq->curr = se;
+
+	if (on_rq) {
+		reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+		if (first)
+			set_protect_slice(cfs_rq, se);
+	}
 
 	if (task_on_rq_queued(p)) {
 		/*
@@ -15267,17 +15175,8 @@ void unregister_fair_sched_group(struct
 		struct sched_entity *se = tg_se(tg, cpu);
 		struct rq *rq = cpu_rq(cpu);
 
-		if (se) {
-			if (se->sched_delayed) {
-				guard(rq_lock_irqsave)(rq);
-				if (se->sched_delayed) {
-					update_rq_clock(rq);
-					dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-				}
-				list_del_leaf_cfs_rq(cfs_rq);
-			}
+		if (se)
 			remove_entity_load_avg(se);
-		}
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -206,7 +206,7 @@ ___update_load_sum(u64 now, struct sched
 	/*
 	 * running is a subset of runnable (weight) so running can't be set if
 	 * runnable is clear. But there are some corner cases where the current
-	 * se has been already dequeued but cfs_rq->curr still points to it.
+	 * se has been already dequeued but cfs_rq->h_curr still points to it.
 	 * This means that weight will be 0 but not running for a sched_entity
 	 * but also for a cfs_rq if the latter becomes idle. As an example,
 	 * this happens during sched_balance_newidle() which calls
@@ -307,7 +307,7 @@ int __update_load_avg_blocked_se(u64 now
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
-				cfs_rq->curr == se)) {
+				cfs_rq->h_curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
@@ -323,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, st
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
 				cfs_rq->h_nr_runnable,
-				cfs_rq->curr != NULL)) {
+				cfs_rq->h_curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -530,21 +530,8 @@ struct task_group {
 
 };
 
-#ifdef CONFIG_GROUP_SCHED_WEIGHT
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
-/*
- * A weight of 0 or 1 can cause arithmetics problems.
- * A weight of a cfs_rq is the sum of weights of which entities
- * are queued on this cfs_rq, so a weight of a entity should not be
- * too large, so as the shares value of a task group.
- * (The default weight is 1024 - so there's no practical
- *  limitation from this.)
- */
-#define MIN_SHARES		(1UL <<  1)
-#define MAX_SHARES		(1UL << 18)
-#endif
-
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 extern int walk_tg_tree_from(struct task_group *from,
@@ -631,6 +618,17 @@ static inline bool cfs_task_bw_constrain
 
 #endif /* !CONFIG_CGROUP_SCHED */
 
+/*
+ * A weight of 0 or 1 can cause arithmetics problems.
+ * A weight of a cfs_rq is the sum of weights of which entities
+ * are queued on this cfs_rq, so a weight of a entity should not be
+ * too large, so as the shares value of a task group.
+ * (The default weight is 1024 - so there's no practical
+ *  limitation from this.)
+ */
+#define MIN_SHARES		(1UL <<  1)
+#define MAX_SHARES		(1UL << 18)
+
 extern void unregister_rt_sched_group(struct task_group *tg);
 extern void free_rt_sched_group(struct task_group *tg);
 extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
@@ -709,6 +707,7 @@ struct cfs_rq {
 	/*
 	 * CFS load tracking
 	 */
+	struct sched_entity	*h_curr;
 	struct sched_avg	avg;
 #ifndef CONFIG_64BIT
 	u64			last_update_time_copy;
@@ -2575,6 +2574,7 @@ extern const u32		sched_prio_to_wmult[40
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_QUEUED		0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/7] sched/fair: Add cgroup_mode: up
  2026-06-05 12:40 ` [PATCH v3 2/7] sched/fair: Add cgroup_mode: up Peter Zijlstra
@ 2026-06-05 15:07   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-05 15:07 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

On Fri, Jun 05, 2026 at 02:40:15PM +0200, Peter Zijlstra wrote:
> Instead of calculating the proportional fraction of the group weight for each
> CPU, just give each CPU the full measure, ignoring these pesky SMP problems.
> 
> This makes the SMP cgroup fraction (F_g_n) equal to 1, and ensures a single
> task in a cgroup competes on equal footing to a task in a level above.
> 
> However, as already explored, this is not a very good policy because it gets
> the SMP weight distribution wrong. Included for completeness.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/debug.c |    5 ++++-
>  kernel/sched/fair.c  |   31 +++++++++++++++++++++++++++++--
>  kernel/sched/sched.h |    1 +
>  3 files changed, 34 insertions(+), 3 deletions(-)
> 
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -271,6 +271,7 @@ static ssize_t sched_dynamic_write(struc
>  	if (mode < 0)
>  		return mode;
>  
> +	__sched_cgroup_mode_update(mode);
>  	sched_dynamic_update(mode);
>  
>  	*ppos += cnt;

Yeez, I'm not sute WTF happened here. Let me go fix that up.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/7] sched: Flatten the pick
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (6 preceding siblings ...)
  2026-06-05 12:40 ` [PATCH v3 7/7] sched/eevdf: Move to a single runqueue Peter Zijlstra
@ 2026-06-09  5:37 ` K Prateek Nayak
  2026-06-12  2:29 ` Shubhang Kaushik
  8 siblings, 0 replies; 16+ messages in thread
From: K Prateek Nayak @ 2026-06-09  5:37 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

Hello Peter,

On 6/5/2026 6:10 PM, Peter Zijlstra wrote:
> Can also be had:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat

Here are early performance numbers at commit 7eb85c33dd20 ("sched/eevdf:
Move to a single runqueue").

tl;dr

  Apart form a regression in the tbench and schbench at super high
  utilization, all numbers for concur mode looks good so 7.2-rc1 target
  should be good.

  P.S. I won't be able to look into this until Thursday; Sorry in
  advance.

Note: I haven't gotten around to analyzing anything in depth; These are
just early numbers on microbenchmarks and DeathStarBench. Please take
them with a grain of salt.

tip was at commit f666241e6bd5 ("sched/fair: Unify cfs_rq throttling via
account_cfs_rq_runtime()") and each label are the individual cgroup
mode togged on at queue:sched/flat:

Machine:

o 4th Generation EPYC system (Zen4c)
o 2 x 128C/256T (32LLCs)
o Boost enabled
o C2 disabled; MWAIT based C1 and POLL remained enabled

Benchmark numbers:


  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)            up[pct imp](CV)           smp[pct imp](CV)     max[pct imp](CV)        concur[pct imp](CV)      tasks[pct imp](CV)
   1-groups     1.00 [ -0.00](11.18)     1.01 [ -0.60](12.84)     1.12 [-12.08](12.50)     1.16 [-15.71]( 9.46)     1.07 [ -7.25](11.46)     0.99 [  1.21](15.44)
   2-groups     1.00 [ -0.00]( 4.73)     1.01 [ -1.00]( 9.29)     2.88 [-188.25](63.11)    0.97 [  2.75](11.62)     0.95 [  5.00](12.47)     0.96 [  4.00](11.72)
   4-groups     1.00 [ -0.00]( 2.90)     0.99 [  0.96]( 2.48)     0.99 [  0.72]( 2.38)     1.00 [  0.48]( 1.78)     0.99 [  0.96]( 6.61)     1.00 [ -0.00]( 5.23)
   8-groups     1.00 [ -0.00]( 1.82)     1.03 [ -2.51]( 2.81)     0.99 [  0.91]( 3.33)     0.99 [  0.91]( 2.40)     1.01 [ -0.68]( 2.44)     1.02 [ -1.60]( 2.45)
  16-groups     1.00 [ -0.00]( 3.05)     1.03 [ -2.96]( 1.97)     1.23 [-22.82](22.67)     1.01 [ -1.31]( 2.03)     0.99 [  0.99]( 2.48)     1.01 [ -0.66]( 2.76)

  Note: For smp variant runs, I think there was some system noise form
  an unrelated job that stated by mistake. I have to go back and
  rerun to confirm if the regression holds.
  
  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)       concur[pct imp](CV)       tasks[pct imp](CV)
      1     1.00 [  0.00]( 0.23)     1.02 [  1.53]( 0.10)     1.02 [  1.80]( 0.96)     1.00 [  0.19]( 0.17)     0.99 [ -0.58]( 0.18)     1.01 [  1.13]( 0.18)
      2     1.00 [  0.00]( 0.11)     1.01 [  1.13]( 0.04)     1.01 [  1.26]( 0.08)     1.00 [ -0.12]( 0.21)     0.99 [ -1.00]( 0.29)     1.01 [  0.51]( 0.02)
      4     1.00 [  0.00]( 0.11)     1.01 [  1.30]( 0.16)     1.02 [  2.26]( 0.48)     1.00 [  0.06]( 0.37)     0.99 [ -0.74]( 0.52)     1.01 [  1.35]( 0.35)
      8     1.00 [  0.00]( 0.24)     1.01 [  1.19]( 0.81)     1.02 [  2.17]( 0.52)     1.00 [  0.45]( 0.36)     0.99 [ -0.80]( 0.05)     1.01 [  0.92]( 0.39)
     16     1.00 [  0.00]( 0.15)     1.01 [  0.87]( 0.70)     1.03 [  2.56]( 0.18)     1.00 [ -0.41]( 0.46)     0.99 [ -0.99]( 0.36)     1.01 [  0.55]( 1.02)
     32     1.00 [  0.00]( 1.02)     1.02 [  2.42]( 0.36)     1.04 [  3.76]( 0.94)     1.01 [  1.20]( 0.19)     1.00 [ -0.10]( 0.44)     1.02 [  1.61]( 0.27)
     64     1.00 [  0.00]( 0.36)     1.02 [  1.92]( 1.71)     1.03 [  2.59]( 1.15)     0.99 [ -0.51]( 0.88)     1.01 [  1.19]( 0.29)     1.02 [  2.42]( 0.57)
    128     1.00 [  0.00]( 0.45)     1.01 [  1.11]( 1.37)     1.05 [  4.64]( 1.05)     1.01 [  0.98]( 2.47)     1.01 [  0.84]( 1.97)     1.03 [  2.56]( 1.22)
    256     1.00 [  0.00]( 0.06)     1.02 [  2.23]( 1.11)     1.02 [  2.17]( 0.69)     1.03 [  2.87]( 0.46)     1.03 [  2.57]( 0.41)     1.03 [  2.99]( 0.84)
    512     1.00 [  0.00]( 1.50)     0.92 [ -7.62]( 6.42)     1.02 [  1.94]( 2.12)     1.01 [  0.74]( 6.70)     0.94 [ -6.09]( 5.19)     0.98 [ -2.40]( 3.00)
   1024     1.00 [  0.00]( 0.07)     0.98 [ -1.51]( 0.30)     1.02 [  1.66]( 0.13)     0.97 [ -2.97]( 0.76)     0.94 [ -6.33]( 0.26)     0.95 [ -4.63]( 0.47)
   2048     1.00 [  0.00]( 0.25)     0.98 [ -1.57]( 0.59)     1.02 [  1.81]( 0.20)     0.98 [ -1.94]( 0.38)     0.94 [ -5.54]( 0.17)     0.95 [ -4.56]( 0.77)
  
  
  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)        concur[pct imp](CV)     tasks[pct imp](CV)
   Copy     1.00 [  0.00]( 0.51)     0.92 [ -7.61](12.64)     0.99 [ -0.79]( 0.30)     0.85 [-15.33](21.73)     0.99 [ -0.61]( 0.35)     0.99 [ -0.90]( 0.38)
  Scale     1.00 [  0.00]( 0.35)     0.90 [ -9.96](14.86)     0.99 [ -1.45]( 0.76)     0.85 [-14.73](21.14)     0.99 [ -1.12]( 0.86)     0.99 [ -1.13]( 0.71)
    Add     1.00 [  0.00]( 0.21)     0.93 [ -6.95](10.26)     0.98 [ -1.51]( 0.80)     0.95 [ -4.91]( 9.84)     0.99 [ -0.98]( 0.69)     0.99 [ -1.37]( 0.67)
  Triad     1.00 [  0.00]( 0.24)     0.89 [-10.61](15.55)     0.99 [ -1.46]( 0.72)     0.95 [ -4.87]( 9.50)     0.99 [ -0.90]( 0.62)     0.99 [ -1.44]( 0.61)
  
  
  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)        concur[pct imp](CV)     tasks[pct imp](CV)
   Copy     1.00 [  0.00]( 1.52)     1.00 [ -0.16]( 0.56)     0.97 [ -2.54]( 2.30)     1.00 [  0.09]( 0.18)     1.00 [  0.17]( 0.44)     0.99 [ -0.74]( 1.49)
  Scale     1.00 [  0.00]( 1.43)     0.99 [ -0.63]( 0.55)     0.97 [ -2.72]( 2.44)     1.00 [  0.20]( 0.16)     0.99 [ -0.67]( 0.43)     0.99 [ -1.39]( 1.45)
    Add     1.00 [  0.00]( 1.06)     0.99 [ -1.27]( 0.42)     0.97 [ -3.08]( 1.95)     1.00 [ -0.13]( 0.19)     0.99 [ -1.19]( 0.34)     0.98 [ -1.76]( 1.07)
  Triad     1.00 [  0.00]( 1.10)     0.98 [ -1.55]( 0.41)     0.97 [ -3.40]( 1.95)     1.00 [ -0.42]( 0.16)     0.99 [ -1.49]( 0.35)     0.98 [ -2.08]( 1.11)
  
  
  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:           tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)       concur[pct imp](CV)       tasks[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.15)     1.02 [  2.24]( 0.12)     1.02 [  1.50]( 0.22)     1.07 [  6.86]( 2.15)     1.00 [  0.27]( 0.45)     1.02 [  1.86]( 0.15)
     2-clients     1.00 [  0.00]( 0.37)     1.03 [  2.50]( 0.80)     1.03 [  3.11]( 0.78)     1.07 [  7.15]( 0.96)     1.00 [ -0.43]( 0.54)     1.02 [  1.67]( 0.69)
     4-clients     1.00 [  0.00]( 0.26)     1.02 [  2.26]( 0.33)     1.02 [  2.02]( 0.37)     1.07 [  7.37]( 0.78)     1.00 [ -0.24]( 0.42)     1.02 [  1.84]( 0.22)
     8-clients     1.00 [  0.00]( 0.21)     1.02 [  2.47]( 0.55)     1.02 [  2.13]( 0.48)     1.08 [  7.60]( 0.57)     1.00 [ -0.15]( 0.37)     1.02 [  1.87]( 0.29)
    16-clients     1.00 [  0.00]( 0.23)     1.02 [  2.07]( 0.69)     1.02 [  1.87]( 0.42)     1.07 [  7.38]( 0.50)     0.99 [ -0.55]( 0.48)     1.02 [  1.78]( 0.25)
    32-clients     1.00 [  0.00]( 0.47)     1.02 [  2.14]( 0.63)     1.02 [  1.81]( 0.75)     1.07 [  7.44]( 0.94)     1.00 [ -0.38]( 0.53)     1.02 [  1.76]( 0.43)
    64-clients     1.00 [  0.00]( 0.91)     1.02 [  2.00]( 0.81)     1.02 [  1.74]( 0.96)     1.07 [  7.12]( 1.06)     1.00 [ -0.20]( 0.68)     1.02 [  1.63]( 0.76)
   128-clients     1.00 [  0.00]( 1.19)     1.01 [  1.36]( 1.28)     1.01 [  1.34]( 1.18)     1.06 [  6.37]( 1.47)     1.00 [ -0.49]( 1.15)     1.01 [  1.06]( 1.09)
   256-clients     1.00 [  0.00]( 1.00)     1.02 [  1.70]( 1.15)     1.02 [  1.64]( 1.18)     1.07 [  7.17]( 1.87)     1.00 [ -0.31]( 1.19)     1.01 [  1.29]( 1.12)
   512-clients     1.00 [  0.00]( 5.16)     1.00 [  0.02]( 6.48)     0.99 [ -0.60]( 4.31)     1.04 [  4.08]( 3.83)     1.00 [  0.02]( 2.52)     1.00 [  0.48]( 2.86)
   768-clients     1.00 [  0.00](34.61)     1.03 [  2.84](62.48)     1.00 [  0.26](30.91)     0.98 [ -2.12](13.41)     0.94 [ -6.34](11.15)     0.95 [ -4.82](10.61)
  1024-clients     1.00 [  0.00](41.78)     1.04 [  3.95](76.45)     1.01 [  0.99](40.23)     0.97 [ -2.58](11.79)     0.95 [ -5.36](11.21)     0.96 [ -4.01](12.98)
  
  
  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)        concur[pct imp](CV)     tasks[pct imp](CV)
     1     1.00 [ -0.00]( 5.88)     0.94 [  5.88](19.25)     0.88 [ 11.76]( 7.37)     0.88 [ 11.76](34.02)     1.06 [ -5.88](11.11)     0.94 [  5.88]( 3.69)
     2     1.00 [ -0.00](36.56)     0.97 [  3.03]( 9.52)     0.97 [  3.03]( 4.82)     1.00 [ -0.00](26.93)     0.97 [  3.03](14.53)     0.82 [ 18.18](13.87)
     4     1.00 [ -0.00]( 9.35)     0.97 [  3.12](18.25)     0.94 [  6.25]( 9.12)     1.00 [ -0.00](20.82)     1.00 [ -0.00]( 4.82)     0.88 [ 12.50](15.93)
     8     1.00 [ -0.00](23.38)     0.94 [  6.45](23.26)     1.26 [-25.81](14.43)     1.26 [-25.81](26.13)     1.06 [ -6.45]( 4.68)     0.97 [  3.23](10.00)
    16     1.00 [ -0.00]( 2.71)     1.04 [ -3.57]( 2.62)     1.00 [ -0.00]( 2.04)     0.98 [  1.79]( 2.76)     1.04 [ -3.57]( 0.00)     1.02 [ -1.79]( 2.70)
    32     1.00 [ -0.00]( 0.72)     1.00 [ -0.00]( 0.72)     0.96 [  3.75]( 1.30)     1.01 [ -1.25]( 0.71)     1.01 [ -1.25]( 1.88)     1.02 [ -2.50]( 1.86)
    64     1.00 [ -0.00]( 1.52)     0.96 [  4.41]( 1.18)     0.97 [  2.94]( 4.49)     0.94 [  5.88]( 1.19)     0.95 [  5.15]( 0.45)     0.96 [  4.41]( 0.45)
   128     1.00 [ -0.00]( 3.24)     0.96 [  3.80]( 0.91)     0.96 [  3.80]( 1.53)     0.95 [  4.64]( 0.26)     0.96 [  3.80]( 0.44)     0.97 [  2.53]( 1.32)
   256     1.00 [ -0.00]( 0.90)     0.92 [  7.60]( 0.75)     0.95 [  5.40]( 0.61)     0.92 [  7.80]( 0.55)     0.93 [  6.60]( 6.23)     1.00 [ -0.00]( 0.61)
   512     1.00 [ -0.00]( 1.76)     1.10 [-10.36]( 0.48)     1.04 [ -4.15]( 1.46)     0.97 [  3.45]( 7.07)     1.07 [ -6.56]( 2.54)     0.93 [  6.56]( 2.45)
   768     1.00 [ -0.00]( 2.79)     0.95 [  5.10]( 6.65)     1.71 [-70.87]( 4.67)     0.80 [ 20.13]( 1.81)     0.78 [ 21.80]( 1.19)     0.78 [ 22.36]( 1.46)
  1024     1.00 [ -0.00]( 0.86)     0.35 [ 64.79](18.07)     1.16 [-15.56]( 1.17)     1.00 [  0.32]( 3.65)     0.95 [  4.54]( 3.10)     0.90 [  9.72]( 2.33)
  
  
  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)       concur[pct imp](CV)       tasks[pct imp](CV)
     1     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [ -0.30]( 0.00)     1.00 [ -0.30]( 0.31)     0.99 [ -0.59]( 0.15)     1.00 [  0.00]( 0.00)
     2     1.00 [  0.00]( 0.00)     1.00 [  0.30]( 0.00)     1.00 [  0.00]( 0.61)     1.00 [  0.30]( 0.15)     1.00 [ -0.30]( 0.70)     1.00 [  0.00]( 0.61)
     4     1.00 [  0.00]( 0.00)     1.00 [  0.29]( 0.30)     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.00)     1.00 [ -0.29]( 0.15)     0.99 [ -0.59]( 0.31)
     8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)     1.00 [ -0.29]( 0.15)
    16     1.00 [  0.00]( 0.00)     1.00 [  0.29]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.00)     1.00 [ -0.29]( 0.15)
    32     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)     1.00 [  0.29]( 0.15)     1.00 [  0.29]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)
    64     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.29]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.29]( 0.15)
   128     1.00 [  0.00](17.05)     1.01 [  0.59](14.45)     1.00 [  0.30]( 0.00)     1.01 [  0.59](10.49)     1.00 [  0.00]( 9.79)     1.00 [  0.30]( 0.31)
   256     1.00 [  0.00]( 0.59)     1.00 [  0.28]( 0.59)     1.00 [  0.28]( 0.44)     1.00 [ -0.28]( 0.82)     0.99 [ -0.57]( 0.39)     1.00 [  0.00]( 0.67)
   512     1.00 [  0.00]( 0.69)     1.01 [  1.50]( 1.66)     1.00 [  0.00]( 0.33)     0.99 [ -1.50]( 0.39)     0.99 [ -1.12]( 0.58)     1.00 [  0.00]( 0.78)
   768     1.00 [  0.00]( 1.13)     1.12 [ 11.51]( 9.06)     0.99 [ -1.15]( 0.21)     0.91 [ -8.52]( 0.47)     0.91 [ -8.98]( 0.47)     0.93 [ -6.67]( 0.84)
  1024     1.00 [  0.00]( 1.23)     1.12 [ 12.41]( 4.32)     1.01 [  1.10]( 0.71)     0.87 [-13.24]( 0.59)     0.87 [-12.97]( 0.49)     0.86 [-14.07]( 0.00)
  
  
  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)        concur[pct imp](CV)      tasks[pct imp](CV)
     1     1.00 [ -0.00](21.48)     2.75 [-175.00](14.06)    3.00 [-200.00](12.91)    3.62 [-262.50](29.64)    3.62 [-262.50](21.84)    2.25 [-125.00](12.45)
     2     1.00 [ -0.00]( 5.96)     1.33 [-33.33](24.78)     1.33 [-33.33](18.25)     1.11 [-11.11](25.37)     1.11 [-11.11]( 9.68)     1.11 [-11.11]( 5.00)
     4     1.00 [ -0.00]( 6.20)     1.38 [-37.50]( 4.84)     1.12 [-12.50]( 0.00)     1.25 [-25.00](21.51)     1.12 [-12.50](19.99)     1.38 [-37.50](12.81)
     8     1.00 [ -0.00]( 5.34)     1.20 [-20.00]( 0.00)     1.00 [ -0.00]( 5.00)     1.20 [-20.00]( 7.45)     1.20 [-20.00]( 0.00)     1.20 [-20.00]( 4.19)
    16     1.00 [ -0.00]( 5.53)     1.22 [-22.22]( 0.00)     1.11 [-11.11]( 0.00)     1.11 [-11.11]( 9.68)     1.33 [-33.33]( 0.00)     1.33 [-33.33]( 4.43)
    32     1.00 [ -0.00]( 5.53)     1.11 [-11.11]( 5.00)     1.00 [ -0.00]( 0.00)     1.11 [-11.11]( 5.00)     1.11 [-11.11]( 0.00)     1.11 [-11.11]( 0.00)
    64     1.00 [ -0.00](12.81)     1.00 [ -0.00]( 0.00)     0.91 [  9.09]( 5.34)     0.91 [  9.09]( 5.00)     0.91 [  9.09]( 5.00)     0.82 [ 18.18](10.68)
   128     1.00 [ -0.00](12.14)     1.12 [-12.50]( 4.97)     0.81 [ 18.75](13.62)     1.06 [ -6.25]( 5.26)     0.94 [  6.25](14.68)     0.81 [ 18.75]( 6.88)
   256     1.00 [ -0.00]( 4.83)     0.98 [  1.90]( 7.97)     0.98 [  2.37]( 2.41)     0.95 [  4.74]( 5.31)     0.95 [  5.21]( 3.25)     0.97 [  2.84]( 2.84)
   512     1.00 [ -0.00]( 0.00)     0.98 [  2.00]( 5.86)     0.96 [  3.99]( 1.61)     0.83 [ 16.83]( 1.08)     0.85 [ 14.84]( 7.53)     0.97 [  3.42]( 7.91)
   768     1.00 [ -0.00]( 0.71)     0.96 [  3.63](14.27)     0.96 [  3.73]( 1.19)     0.67 [ 32.98]( 0.00)     0.67 [ 32.98]( 0.00)     0.67 [ 32.98]( 0.15)
  1024     1.00 [ -0.00]( 2.55)     0.85 [ 14.99](14.23)     0.98 [  1.80]( 0.82)     0.79 [ 21.29]( 0.00)     0.79 [ 21.29]( 0.00)     0.79 [ 21.29]( 0.00)
  
  
  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:  tip[pct imp](CV)          up[pct imp](CV)         smp[pct imp](CV)         max[pct imp](CV)       concur[pct imp](CV)       tasks[pct imp](CV)
     1     1.00 [ -0.00]( 0.14)     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 0.14)     1.00 [ -0.26]( 0.27)     1.01 [ -0.53]( 0.27)     1.00 [ -0.00]( 0.14)
     2     1.00 [ -0.00]( 0.00)     1.00 [  0.26]( 0.14)     1.02 [ -2.36]( 1.51)     0.99 [  0.52]( 0.14)     1.00 [ -0.00]( 2.00)     1.00 [ -0.00]( 1.74)
     4     1.00 [ -0.00]( 0.00)     1.00 [  0.26]( 1.82)     1.03 [ -2.63]( 0.23)     0.99 [  0.53]( 0.14)     1.00 [ -0.26]( 1.61)     1.03 [ -3.42]( 1.79)
     8     1.00 [ -0.00]( 0.14)     0.99 [  0.79]( 0.00)     1.00 [ -0.00]( 0.00)     0.99 [  1.05]( 0.14)     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 0.00)
    16     1.00 [ -0.00]( 0.14)     1.00 [ -0.00]( 1.16)     1.00 [ -0.26]( 1.28)     0.99 [  0.79]( 1.97)     1.00 [ -0.26]( 0.00)     1.02 [ -2.38]( 1.15)
    32     1.00 [ -0.00]( 0.23)     0.99 [  1.04]( 0.24)     1.00 [ -0.00]( 0.67)     0.99 [  1.04]( 0.14)     0.99 [  0.52]( 0.49)     0.99 [  1.30]( 0.95)
    64     1.00 [ -0.00](12.39)     1.72 [-71.69](26.18)     1.01 [ -1.29](28.90)     0.97 [  3.09]( 0.27)     0.98 [  2.06]( 0.41)     0.98 [  2.32]( 1.22)
   128     1.00 [ -0.00]( 4.75)     0.98 [  2.07](12.01)     0.99 [  0.59]( 0.41)     1.00 [  0.30]( 3.02)     1.00 [ -0.00]( 2.72)     0.99 [  0.59]( 0.77)
   256     1.00 [ -0.00]( 0.13)     0.99 [  0.76]( 0.13)     1.00 [ -0.00]( 0.13)     0.99 [  0.76]( 0.00)     0.99 [  0.51]( 0.13)     0.99 [  0.51]( 0.13)
   512     1.00 [ -0.00]( 7.49)     1.04 [ -4.36](28.15)     0.89 [ 10.50]( 4.15)     0.42 [ 57.93]( 9.03)     0.48 [ 52.23](25.18)     0.71 [ 28.83](25.82)
   768     1.00 [ -0.00]( 3.13)     1.31 [-31.19](10.91)     1.10 [-10.13]( 1.12)     0.71 [ 28.69]( 3.69)     0.70 [ 29.59]( 3.58)     0.84 [ 15.65]( 4.74)
  1024     1.00 [ -0.00]( 2.28)     1.44 [-44.48](16.05)     1.13 [-13.09]( 0.88)     0.85 [ 15.39]( 6.64)     0.85 [ 15.03]( 0.39)     0.81 [ 19.39]( 1.36)
---

DeathStarBench was run on a 3rd Generation EPYC system (2 x 64C/128T)
with boost enabled and C2 disabled:

  ==================================================================
  Test          : DeathStarBench
  Units         : %diff compared to tip
  Interpretation: Higher is better
  Statistic     : Average throughput
  ==================================================================

  Scaling\cg_mode    up      smp      max    concur    tasks
      2x            1.63%  -0.32%   -0.03%   -0.65%    0.62%
      4x           -4.94%   1.38%   -6.59%    1.09%   -6.11%
      6x            1.83%   0.16%   -0.87%    2.24%   -0.23%

---

I'll get to look any deeper into any of the regressions until Thursday
but overall the concur mode seems good in my testing for most part.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-05 12:40 ` [PATCH v3 3/7] sched/fair: Add cgroup_mode: max Peter Zijlstra
@ 2026-06-10 15:09   ` Waiman Long
  2026-06-10 15:42     ` Waiman Long
  2026-06-11 13:47     ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Waiman Long @ 2026-06-10 15:09 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On 6/5/26 8:40 AM, Peter Zijlstra wrote:
> In order to avoid the average CPU fraction avg(F_g_n) becoming tiny '1/N',
> assume each cgroup is maximally concurrent and distrubute 'N*weight', such
> that:
>
> 	F_g_n' = N * F_g_n
>
> Giving:
>
> 	avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1
>
> And while this sounds like it solves things, remember what that ~ meant. There
> is the corner case when a cgroup is minimally loaded, eg a single runnable
> task, therefore limit the CPU fraction to that of a nice -20 task to avoid
> getting too much load.
>
> This last bit is what makes it different from a previous proposal to allow
> raising cpu.weight to '100 * N', that would not limit the mininal concurrency
> case and results in a very large F_g_n. And just like F_g_n << 1 is
> problematic, so is F_g_n >> 1 for the exact same reasons (it would drown the
> kthreads, but it also risks overflowing the load values).
>
> So while this might appear to be a better scheme than the current default
> scheme, it doesn't really handle less than maximal concurrency nicely -- it
> clips and introduces artificially large weights. So where the traditional SMP
> mode works well when nr_tasks << nr_cpus, MAX doesn't work well in that regime
> and vice-versa.
>
> The meaning of "cpu.weight" would be: weight per allowed CPU.
>
> Included for completeness (and infrastructure).
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/cpuset.h |    6 +++++
>   kernel/cgroup/cpuset.c |   15 ++++++++++++++
>   kernel/sched/debug.c   |    1
>   kernel/sched/fair.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++-----
>   4 files changed, 69 insertions(+), 5 deletions(-)
>
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
>   extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
>   extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
>   extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
> +extern int cpuset_num_cpus(struct cgroup *cgroup);
>   extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>   #define cpuset_current_mems_allowed (current->mems_allowed)
>   void cpuset_init_current_mems_allowed(void);
> @@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
>   	return false;
>   }
>   
> +static inline int cpuset_num_cpus(struct cgroup *cgroup)
> +{
> +	return num_online_cpus();
> +}
> +
>   static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>   {
>   	return node_possible_map;
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>   	return changed;
>   }
>   
> +int cpuset_num_cpus(struct cgroup *cgrp)
> +{
> +	int nr = num_online_cpus();
> +	struct cpuset *cs;
> +
> +	if (is_in_v2_mode()) {
> +		guard(rcu)();
> +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> +		if (cs)
> +			nr = cpumask_weight(cs->effective_cpus);
> +	}
> +
> +	return nr;
> +}

I just have a question about cgroup v1 support. I am assuming that 
cgroup v1 without the cpuset_v2_mode mount option is not supported. To 
fully support cgroup v1, you may have to use guarantee_active_cpus() to 
return the actual set of CPUs that the task can run on. Also there is a 
caveat about the arm64 specific task_cpu_possible_mask() for certain 
arm64 CPUs. That is for 32-bit binary running on 64-bit core which are 
allowed only on a selected subset of cores within the CPU.

This is probably not what you want to focus on right now, but it will be 
good to have a comment to list items that are not fully supported here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-10 15:09   ` Waiman Long
@ 2026-06-10 15:42     ` Waiman Long
  2026-06-11 13:49       ` Peter Zijlstra
  2026-06-11 13:47     ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Waiman Long @ 2026-06-10 15:42 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On 6/10/26 11:09 AM, Waiman Long wrote:
> On 6/5/26 8:40 AM, Peter Zijlstra wrote:
>> In order to avoid the average CPU fraction avg(F_g_n) becoming tiny 
>> '1/N',
>> assume each cgroup is maximally concurrent and distrubute 'N*weight', 
>> such
>> that:
>>
>>     F_g_n' = N * F_g_n
>>
>> Giving:
>>
>>     avg(F_g_n') = N*avg(F_g_n) ~ N * 1/N = 1
>>
>> And while this sounds like it solves things, remember what that ~ 
>> meant. There
>> is the corner case when a cgroup is minimally loaded, eg a single 
>> runnable
>> task, therefore limit the CPU fraction to that of a nice -20 task to 
>> avoid
>> getting too much load.
>>
>> This last bit is what makes it different from a previous proposal to 
>> allow
>> raising cpu.weight to '100 * N', that would not limit the mininal 
>> concurrency
>> case and results in a very large F_g_n. And just like F_g_n << 1 is
>> problematic, so is F_g_n >> 1 for the exact same reasons (it would 
>> drown the
>> kthreads, but it also risks overflowing the load values).
>>
>> So while this might appear to be a better scheme than the current 
>> default
>> scheme, it doesn't really handle less than maximal concurrency nicely 
>> -- it
>> clips and introduces artificially large weights. So where the 
>> traditional SMP
>> mode works well when nr_tasks << nr_cpus, MAX doesn't work well in 
>> that regime
>> and vice-versa.
>>
>> The meaning of "cpu.weight" would be: weight per allowed CPU.
>>
>> Included for completeness (and infrastructure).
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>   include/linux/cpuset.h |    6 +++++
>>   kernel/cgroup/cpuset.c |   15 ++++++++++++++
>>   kernel/sched/debug.c   |    1
>>   kernel/sched/fair.c    |   52 
>> ++++++++++++++++++++++++++++++++++++++++++++-----
>>   4 files changed, 69 insertions(+), 5 deletions(-)
>>
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
>>   extern void cpuset_cpus_allowed_locked(struct task_struct *p, 
>> struct cpumask *mask);
>>   extern void cpuset_cpus_allowed(struct task_struct *p, struct 
>> cpumask *mask);
>>   extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
>> +extern int cpuset_num_cpus(struct cgroup *cgroup);
>>   extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>>   #define cpuset_current_mems_allowed (current->mems_allowed)
>>   void cpuset_init_current_mems_allowed(void);
>> @@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
>>       return false;
>>   }
>>   +static inline int cpuset_num_cpus(struct cgroup *cgroup)
>> +{
>> +    return num_online_cpus();
>> +}
>> +
>>   static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>>   {
>>       return node_possible_map;
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>>       return changed;
>>   }
>>   +int cpuset_num_cpus(struct cgroup *cgrp)
>> +{
>> +    int nr = num_online_cpus();
>> +    struct cpuset *cs;
>> +
>> +    if (is_in_v2_mode()) {
>> +        guard(rcu)();
>> +        cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
>> +        if (cs)
>> +            nr = cpumask_weight(cs->effective_cpus);
>> +    }
>> +
>> +    return nr;
>> +}
>
> I just have a question about cgroup v1 support. I am assuming that 
> cgroup v1 without the cpuset_v2_mode mount option is not supported. To 
> fully support cgroup v1, you may have to use guarantee_active_cpus() 
> to return the actual set of CPUs that the task can run on. Also there 
> is a caveat about the arm64 specific task_cpu_possible_mask() for 
> certain arm64 CPUs. That is for 32-bit binary running on 64-bit core 
> which are allowed only on a selected subset of cores within the CPU.
>
> This is probably not what you want to focus on right now, but it will 
> be good to have a comment to list items that are not fully supported 
> here. 

FYI, you may have to take the callback_lock to ensure the stability of 
the effective_cpus mask.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-10 15:09   ` Waiman Long
  2026-06-10 15:42     ` Waiman Long
@ 2026-06-11 13:47     ` Peter Zijlstra
  2026-06-11 20:57       ` Waiman Long
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-11 13:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Wed, Jun 10, 2026 at 11:09:59AM -0400, Waiman Long wrote:

> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
> >   	return changed;
> >   }
> > +int cpuset_num_cpus(struct cgroup *cgrp)
> > +{
> > +	int nr = num_online_cpus();
> > +	struct cpuset *cs;
> > +
> > +	if (is_in_v2_mode()) {
> > +		guard(rcu)();
> > +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> > +		if (cs)
> > +			nr = cpumask_weight(cs->effective_cpus);
> > +	}
> > +
> > +	return nr;
> > +}
> 
> I just have a question about cgroup v1 support. I am assuming that cgroup v1
> without the cpuset_v2_mode mount option is not supported. 

Correct.

> To fully support
> cgroup v1, you may have to use guarantee_active_cpus() to return the actual
> set of CPUs that the task can run on.

Except this is group based, we'd need an iteration of all tasks in the
group and compute a union of guarantee_active_cpus(). Which all seems
far too expensive and not worth the effort.

> Also there is a caveat about the arm64 specific
> task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit
> binary running on 64-bit core which are allowed only on a selected
> subset of cores within the CPU.
> 
> This is probably not what you want to focus on right now, but it will be
> good to have a comment to list items that are not fully supported here.

Will add a comment!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-10 15:42     ` Waiman Long
@ 2026-06-11 13:49       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2026-06-11 13:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Wed, Jun 10, 2026 at 11:42:47AM -0400, Waiman Long wrote:

> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
> > >       return changed;
> > >   }
> > >   +int cpuset_num_cpus(struct cgroup *cgrp)
> > > +{
> > > +    int nr = num_online_cpus();
> > > +    struct cpuset *cs;
> > > +
> > > +    if (is_in_v2_mode()) {
> > > +        guard(rcu)();
> > > +        cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
> > > +        if (cs)
> > > +            nr = cpumask_weight(cs->effective_cpus);
> > > +    }
> > > +
> > > +    return nr;
> > > +}

> FYI, you may have to take the callback_lock to ensure the stability of the
> effective_cpus mask.

That seems pointless, the moment we drop that lock, its changeable
again. Either way around nr is but a snapshot.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/7] sched/fair: Add cgroup_mode: max
  2026-06-11 13:47     ` Peter Zijlstra
@ 2026-06-11 20:57       ` Waiman Long
  0 siblings, 0 replies; 16+ messages in thread
From: Waiman Long @ 2026-06-11 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, chenridong, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On 6/11/26 9:47 AM, Peter Zijlstra wrote:
> On Wed, Jun 10, 2026 at 11:09:59AM -0400, Waiman Long wrote:
>
>>> --- a/kernel/cgroup/cpuset.c
>>> +++ b/kernel/cgroup/cpuset.c
>>> @@ -4116,6 +4116,21 @@ bool cpuset_cpus_allowed_fallback(struct
>>>    	return changed;
>>>    }
>>> +int cpuset_num_cpus(struct cgroup *cgrp)
>>> +{
>>> +	int nr = num_online_cpus();
>>> +	struct cpuset *cs;
>>> +
>>> +	if (is_in_v2_mode()) {
>>> +		guard(rcu)();
>>> +		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
>>> +		if (cs)
>>> +			nr = cpumask_weight(cs->effective_cpus);
>>> +	}
>>> +
>>> +	return nr;
>>> +}
>> I just have a question about cgroup v1 support. I am assuming that cgroup v1
>> without the cpuset_v2_mode mount option is not supported.
> Correct.
>
>> To fully support
>> cgroup v1, you may have to use guarantee_active_cpus() to return the actual
>> set of CPUs that the task can run on.
> Except this is group based, we'd need an iteration of all tasks in the
> group and compute a union of guarantee_active_cpus(). Which all seems
> far too expensive and not worth the effort.
I thought so.
>
>> Also there is a caveat about the arm64 specific
>> task_cpu_possible_mask() for certain arm64 CPUs. That is for 32-bit
>> binary running on 64-bit core which are allowed only on a selected
>> subset of cores within the CPU.
>>
>> This is probably not what you want to focus on right now, but it will be
>> good to have a comment to list items that are not fully supported here.
> Will add a comment!
Thanks,
Longman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 0/7] sched: Flatten the pick
  2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
                   ` (7 preceding siblings ...)
  2026-06-09  5:37 ` [PATCH v3 0/7] sched: Flatten the pick K Prateek Nayak
@ 2026-06-12  2:29 ` Shubhang Kaushik
  8 siblings, 0 replies; 16+ messages in thread
From: Shubhang Kaushik @ 2026-06-12  2:29 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Hello Peter,

I applied the `sched/flat` patchset from your tree on top of the
`tip/sched/core` base commit (9ebe5c3c29f62)/(7.1-rc2)

The evaluation was performed on an 80-core, Ampere Altra
system running Fedora Linux 41.

Benchmark Runs:-

1. Hackbench (Execution time in seconds: lower is better)
     The data reveals a clear architectural pivot point at 4 tasks:
     - Low Concurrency (< 4 tasks): Regresses by +1.8% to +4.0%.
       Removing cgroup isolation boundaries expands the idle CPU search
       adding slight overhead to the wake-up path.
       * 1 Thread:  (+1.8%)
       * 2 Threads: (+4.0%)
       * 2 Procs:   (+3.3%)
     - Tipping Point (4 tasks): Performance is completely flat.
       * 4 Threads: (+0.03%)
       * 4 Procs:   (+0.1%)
     - High Concurrency (>= 8 tasks): Improves by -0.7% to -2.3%.
       Collapsing the tree structure down to a flat layout removes
       multi-layer load tracking updates (update_load_avg), saving cycles
       under load.
       * 8 Threads:  (-0.7%)
       * 16 Threads: (-1.8%)
       * 8 Procs:    (-1.2%)
       * 16 Procs:   (-2.3%)
       * 32 Procs:   (-1.6%)

2. Schbench (Wakeup Tail Latency)
     - 16 Threads (128kb footprint): 99.9th percentile tail latency drops
       significantly by -12.21% (us). Operating on a unified runqueue layer
       prevents induced group-level throttling.
     - 32 Threads (128kb footprint): 99.9th percentile tail latency
       regresses by +5.50% (us). Eliminating nested queues increases lock
       contention during heavy simultaneous wakeups.

3. Sysbench
     - Sysbench RAM: Throughput increases by +1.55% (MiB/sec). Fewer tree
       traversals reduce cache-line bouncing, freeing up cycles.

The patchset trades minor low-load performance for better scaling and
tighter tail latencies under distributed load. However, the majority of
these deltas remain small and sit near the measurement noise floor (<=
4%).

Regards,
Shubhang Kaushik

On Fri, 5 Jun 2026, Peter Zijlstra wrote:

>
> Hi!
>
> New version, same story [1]. TL;DR:
>
> - Adds new cgroup_mode knob and implements new policies to address the
>   hierarchy level weight mismatch.
>
> - Builds upon that base to create a flat / single runqueue scheduler where the
>   cgroup hierarchy is expressed through dynamic weight management.
>
> I'm hoping to be able to merge these patches early in the next cycle (after
> 7.2-rc1).
>
> Random benchmark:
>
> Game vs 'for ((i=0; i<8; i++)) do nice ./spin.sh; done':
>
>  Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
>  Intel Core i7-2600K
>  AMD Radeon RX 580
>
>  Shadows Awakening (GOG)
>
> 	  default slice(*)
>
>  FPS min   4.0   29.0
>      avg  47.5   59.2
>      max  83.7   83.7
>
>  FT  min   9.3   10.2
>      avg  34.0   17.0
>      max 121.2   30.0
>
>  FPS (Frames Per Second)
>  FT  (FrameTime)
>
>  [*] Command prefix: 'chrt -o --sched-runtime 100000 0'
>
>
> Changes since v2:
>
> - merged debug and prep patches
> - fixed update_entity_lag() on dequeue (Vincent)
> - fixed throttle vs tick (Prateek)
> - fixed wakeup_preempt_fair()
> - rebased on tip/sched/core
> - rewritten cgroup_mode changelogs
> - reworked cgroup_mode concur
> - added cgroup_mode tasks
> - changed default cgroup_mode
>
>
> [1] - https://lore.kernel.org/r/20260511113104.563854162@infradead.org
>
> Can also be had:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
>
> include/linux/cpuset.h |    6
> include/linux/sched.h  |    1
> kernel/cgroup/cpuset.c |   15
> kernel/sched/core.c    |    5
> kernel/sched/debug.c   |   89 ++++
> kernel/sched/fair.c    |  943 ++++++++++++++++++++++++-------------------------
> kernel/sched/pelt.c    |    6
> kernel/sched/sched.h   |   30 -
> 8 files changed, 607 insertions(+), 488 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-12  2:29 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-05 12:40 [PATCH v3 0/7] sched: Flatten the pick Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 1/7] sched/fair: Add cgroup_mode switch Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 2/7] sched/fair: Add cgroup_mode: up Peter Zijlstra
2026-06-05 15:07   ` Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 3/7] sched/fair: Add cgroup_mode: max Peter Zijlstra
2026-06-10 15:09   ` Waiman Long
2026-06-10 15:42     ` Waiman Long
2026-06-11 13:49       ` Peter Zijlstra
2026-06-11 13:47     ` Peter Zijlstra
2026-06-11 20:57       ` Waiman Long
2026-06-05 12:40 ` [PATCH v3 4/7] sched/fair: Add cgroup_mode: concur Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 5/7] sched/fair: Add cgroup_mode: tasks Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 6/7] sched/fair: Change the default cgroup_mode to concur Peter Zijlstra
2026-06-05 12:40 ` [PATCH v3 7/7] sched/eevdf: Move to a single runqueue Peter Zijlstra
2026-06-09  5:37 ` [PATCH v3 0/7] sched: Flatten the pick K Prateek Nayak
2026-06-12  2:29 ` Shubhang Kaushik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox