[PATCH v2 00/10] sched: Flatten the pick

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH v2 00/10] sched: Flatten the pick
@ 2026-05-11 11:31 Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[] Peter Zijlstra
                   ` (12 more replies)
  0 siblings, 13 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Hi!

So cgroup scheduling has always been a pain in the arse. The problems start
with weight distribution and end with hierachical picks and it all sucks.

The problems with weight distribution are related to that infernal global
fraction:

             tg->w * grq_i->w
   ge_i->w = ----------------
             \Sum_j grq_j->w

which we've approximated reasonably well by now. However, the immediate
consequence of this fraction is that the total group weight (tg->w) gets
fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
with the fact that 256 CPU systems are relatively common these days, this
becomes painful.

The common 'solution' is to inflate the group weight by 'nr_cpus'; the
immediate problem with that is that when all load of a group gets concentrated
on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
exceeding nice -20.

Additionally there are numerical limits on the max weight you can have before
the math starts suffering overflows. As such there is a definite limit on the
total group weight. Which has annoyed people ;-)

The first few patches add a knob /debug/sched/cgroup_mode and a few different
options on how to deal with this. My favourite is 'concur', but obviously that
is also the most expensive one :-/ It adds a tg->tasks counter which makes the
update_tg_load_avg() thing more expensive.

I have some ideas but I figured I ought to share these things before sinking
more time into it.

On to the hierarchical pick; this has been causing trouble for a very long
time. So once again an attempt at flatting it. The basic idea is to keep the
full hierarchical load tracking as-is, but keep all the runnable entities in a
single level. The immediate concequence of all this is ofcourse that we need to
constantly re-compute the effective weight of each entity as things progress.

Reweight is done on:
 - enqueue
 - pick -- or rather set_next_entity(.first=true)
 - tick

So while the {en,de}queue operations are still O(depth) due to the full
accounting mess, the pick is now a single level. Removing the intermediate
levels that obscure runnability etc.

For testing, I've done a little experiment, I dug out what is colloqually known
as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
runs really well on this machine (provided you stick to 1080p).

To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
spin.sh'; this results in the game becoming almost unplayable, as in proper
terrible.

I used MangoHUD to record a few minutes of playtime for statistics, and then
quit the came and re-started it with a shorter slice set (base/10). This
results in the game being entirely playable -- not great, but definiltey
playable.

  Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
  Intel Core i7-2600K
  AMD Radeon RX 580

  Shadows Awakening (GOG)

	  default slice(*)

  FPS min  3.8    20.6
      avg 48.0    57.2
      mag 87.4    80.3

  FT  min   9.4    8.4
      avg  34.5   19.5
      max 107.4   37.2

  FPS (Frames Per Second)
  FT  (FrameTime)

  [*] Command prefix: 'chrt -o --sched-runtime 280000 0'
      effectively setting 'base_slice_ns/10'

I have not compared to a kernel without flat on, just wanted to run non trivial
workloads and play with slice to make sure everything 'works'.

Can also be had:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat

 include/linux/cpuset.h |    6 
 include/linux/sched.h  |    1 
 kernel/cgroup/cpuset.c |   15 
 kernel/sched/core.c    |   47 --
 kernel/sched/debug.c   |  171 +++++---
 kernel/sched/fair.c    | 1038 ++++++++++++++++++++++---------------------------
 kernel/sched/pelt.c    |    6 
 kernel/sched/sched.h   |   44 --
 8 files changed, 672 insertions(+), 656 deletions(-)

---
Change since v1 ( https://patch.msgid.link/20260317095113.387450089@infradead.org ):
 - various Sashiko thingies
 - rebase atop curren -tip

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[]
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 02/10] sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode Peter Zijlstra
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Some of the fancy AI robots are getting 'upset'.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -136,7 +136,7 @@ sched_feat_write(struct file *filp, cons
 	if (cnt > 63)
 		cnt = 63;
 
-	if (copy_from_user(&buf, ubuf, cnt))
+	if (copy_from_user(buf, ubuf, cnt))
 		return -EFAULT;
 
 	buf[cnt] = 0;
@@ -221,7 +221,7 @@ static ssize_t sched_dynamic_write(struc
 	if (cnt > 15)
 		cnt = 15;
 
-	if (copy_from_user(&buf, ubuf, cnt))
+	if (copy_from_user(buf, ubuf, cnt))
 		return -EFAULT;
 
 	buf[cnt] = 0;



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 02/10] sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[] Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 03/10] sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections Peter Zijlstra
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Robots figured out you can read and write this concurrently and got
'upset'. Gemini even noted sched_dynamic_show() can generate
'confusing' output if it observed different values during the
printing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   15 ++++++++-------
 kernel/sched/debug.c |    5 +++--
 2 files changed, 11 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7743,7 +7743,7 @@ static void __sched_dynamic_update(int m
 		break;
 	}
 
-	preempt_dynamic_mode = mode;
+	WRITE_ONCE(preempt_dynamic_mode, mode);
 }
 
 void sched_dynamic_update(int mode)
@@ -7784,12 +7784,13 @@ static void __init preempt_dynamic_init(
 	}
 }
 
-# define PREEMPT_MODEL_ACCESSOR(mode) \
-	bool preempt_model_##mode(void)						 \
-	{									 \
-		WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
-		return preempt_dynamic_mode == preempt_dynamic_##mode;		 \
-	}									 \
+# define PREEMPT_MODEL_ACCESSOR(mode)					\
+	bool preempt_model_##mode(void)					\
+	{								\
+		int mode = READ_ONCE(preempt_dynamic_mode);		\
+		WARN_ON_ONCE(mode == preempt_dynamic_undefined);	\
+		return mode == preempt_dynamic_##mode;			\
+	}								\
 	EXPORT_SYMBOL_GPL(preempt_model_##mode)
 
 PREEMPT_MODEL_ACCESSOR(none);
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -239,6 +239,7 @@ static ssize_t sched_dynamic_write(struc
 static int sched_dynamic_show(struct seq_file *m, void *v)
 {
 	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
+	int mode = READ_ONCE(preempt_dynamic_mode);
 	int j;
 
 	/* Count entries in NULL terminated preempt_modes */
@@ -247,10 +248,10 @@ static int sched_dynamic_show(struct seq
 	j -= !IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY);
 
 	for (; i < j; i++) {
-		if (preempt_dynamic_mode == i)
+		if (mode == i)
 			seq_puts(m, "(");
 		seq_puts(m, preempt_modes[i]);
-		if (preempt_dynamic_mode == i)
+		if (mode == i)
 			seq_puts(m, ")");
 
 		seq_puts(m, " ");



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 03/10] sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[] Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 02/10] sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 04/10] sched/fair: Add cgroup_mode switch Peter Zijlstra
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   92 ++++++++++++++++++++++++---------------------------
 1 file changed, 44 insertions(+), 48 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -446,6 +446,8 @@ static const struct file_operations fair
 	.release	= single_release,
 };
 
+static struct dentry *debugfs_sched;
+
 #ifdef CONFIG_SCHED_CLASS_EXT
 static ssize_t
 sched_ext_server_runtime_write(struct file *filp, const char __user *ubuf,
@@ -478,75 +480,92 @@ static const struct file_operations ext_
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
-#endif /* CONFIG_SCHED_CLASS_EXT */
 
 static ssize_t
-sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
-			       size_t cnt, loff_t *ppos)
+sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 
 	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
-					&rq->fair_server);
+					&rq->ext_server);
 }
 
-static int sched_fair_server_period_show(struct seq_file *m, void *v)
+static int sched_ext_server_period_show(struct seq_file *m, void *v)
 {
 	unsigned long cpu = (unsigned long) m->private;
 	struct rq *rq = cpu_rq(cpu);
 
-	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
 }
 
-static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
+static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, sched_fair_server_period_show, inode->i_private);
+	return single_open(filp, sched_ext_server_period_show, inode->i_private);
 }
 
-static const struct file_operations fair_server_period_fops = {
-	.open		= sched_fair_server_period_open,
-	.write		= sched_fair_server_period_write,
+static const struct file_operations ext_server_period_fops = {
+	.open		= sched_ext_server_period_open,
+	.write		= sched_ext_server_period_write,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
 
-#ifdef CONFIG_SCHED_CLASS_EXT
+static void debugfs_ext_server_init(void)
+{
+	struct dentry *d_ext;
+	unsigned long cpu;
+
+	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
+	if (!d_ext)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct dentry *d_cpu;
+		char buf[32];
+
+		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
+		d_cpu = debugfs_create_dir(buf, d_ext);
+
+		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
+		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
+	}
+}
+#endif /* CONFIG_SCHED_CLASS_EXT */
+
 static ssize_t
-sched_ext_server_period_write(struct file *filp, const char __user *ubuf,
-			      size_t cnt, loff_t *ppos)
+sched_fair_server_period_write(struct file *filp, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
 {
 	long cpu = (long) ((struct seq_file *) filp->private_data)->private;
 	struct rq *rq = cpu_rq(cpu);
 
 	return sched_server_write_common(filp, ubuf, cnt, ppos, DL_PERIOD,
-					&rq->ext_server);
+					&rq->fair_server);
 }
 
-static int sched_ext_server_period_show(struct seq_file *m, void *v)
+static int sched_fair_server_period_show(struct seq_file *m, void *v)
 {
 	unsigned long cpu = (unsigned long) m->private;
 	struct rq *rq = cpu_rq(cpu);
 
-	return sched_server_show_common(m, v, DL_PERIOD, &rq->ext_server);
+	return sched_server_show_common(m, v, DL_PERIOD, &rq->fair_server);
 }
 
-static int sched_ext_server_period_open(struct inode *inode, struct file *filp)
+static int sched_fair_server_period_open(struct inode *inode, struct file *filp)
 {
-	return single_open(filp, sched_ext_server_period_show, inode->i_private);
+	return single_open(filp, sched_fair_server_period_show, inode->i_private);
 }
 
-static const struct file_operations ext_server_period_fops = {
-	.open		= sched_ext_server_period_open,
-	.write		= sched_ext_server_period_write,
+static const struct file_operations fair_server_period_fops = {
+	.open		= sched_fair_server_period_open,
+	.write		= sched_fair_server_period_write,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
-#endif /* CONFIG_SCHED_CLASS_EXT */
-
-static struct dentry *debugfs_sched;
 
 static void debugfs_fair_server_init(void)
 {
@@ -569,29 +588,6 @@ static void debugfs_fair_server_init(voi
 	}
 }
 
-#ifdef CONFIG_SCHED_CLASS_EXT
-static void debugfs_ext_server_init(void)
-{
-	struct dentry *d_ext;
-	unsigned long cpu;
-
-	d_ext = debugfs_create_dir("ext_server", debugfs_sched);
-	if (!d_ext)
-		return;
-
-	for_each_possible_cpu(cpu) {
-		struct dentry *d_cpu;
-		char buf[32];
-
-		snprintf(buf, sizeof(buf), "cpu%lu", cpu);
-		d_cpu = debugfs_create_dir(buf, d_ext);
-
-		debugfs_create_file("runtime", 0644, d_cpu, (void *) cpu, &ext_server_runtime_fops);
-		debugfs_create_file("period", 0644, d_cpu, (void *) cpu, &ext_server_period_fops);
-	}
-}
-#endif /* CONFIG_SCHED_CLASS_EXT */
-
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 04/10] sched/fair: Add cgroup_mode switch
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (2 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 03/10] sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 05/10] sched/fair: Add cgroup_mode: UP Peter Zijlstra
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Since calc_group_shares() has issues with 'many' CPUs, specifically the
computed shares value gets to be roughly 1/nr_cpus, prepare to add a few
alternative methods.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |    1 
 2 files changed, 75 insertions(+)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -588,6 +588,76 @@ static void debugfs_fair_server_init(voi
 	}
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+int cgroup_mode = 0;
+
+static const char *cgroup_mode_str[] = {
+	"smp",
+};
+
+static int sched_cgroup_mode(const char *str)
+{
+	for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+		if (!strcmp(str, cgroup_mode_str[i]))
+			return i;
+	}
+	return -EINVAL;
+}
+
+static ssize_t sched_cgroup_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	int mode;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(buf, ubuf, cnt))
+		return -EFAULT;
+
+	buf[cnt] = 0;
+	mode = sched_cgroup_mode(strstrip(buf));
+	if (mode < 0)
+		return mode;
+
+	WRITE_ONCE(cgroup_mode, mode);
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_cgroup_show(struct seq_file *m, void *v)
+{
+	int mode = READ_ONCE(cgroup_mode);
+
+	for (int i = 0; i < ARRAY_SIZE(cgroup_mode_str); i++) {
+		if (mode == i)
+			seq_puts(m, "(");
+		seq_puts(m, cgroup_mode_str[i]);
+		if (mode == i)
+			seq_puts(m, ")");
+
+		seq_puts(m, " ");
+	}
+	seq_puts(m, "\n");
+	return 0;
+}
+
+static int sched_cgroup_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_cgroup_show, NULL);
+}
+
+static const struct file_operations sched_cgroup_fops = {
+	.open		= sched_cgroup_open,
+	.write		= sched_cgroup_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
 static __init int sched_init_debug(void)
 {
 	struct dentry __maybe_unused *numa;
@@ -625,6 +695,10 @@ static __init int sched_init_debug(void)
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	debugfs_create_file("cgroup_mode", 0644, debugfs_sched, NULL, &sched_cgroup_fops);
+#endif
+
 	debugfs_fair_server_init();
 #ifdef CONFIG_SCHED_CLASS_EXT
 	debugfs_ext_server_init();
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -565,6 +565,7 @@ static inline struct task_group *css_tg(
 extern int tg_nop(struct task_group *tg, void *data);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+extern int cgroup_mode;
 extern void free_fair_sched_group(struct task_group *tg);
 extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
 extern void online_fair_sched_group(struct task_group *tg);



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 05/10] sched/fair: Add cgroup_mode: UP
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (3 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 04/10] sched/fair: Add cgroup_mode switch Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 06/10] sched/fair: Add cgroup_mode: MAX Peter Zijlstra
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Instead of calculating the proportional fraction of tg->shares for
each CPU, just give each CPU the full measure, ignoring these pesky
SMP problems.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    3 ++-
 kernel/sched/fair.c  |   21 ++++++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -589,9 +589,10 @@ static void debugfs_fair_server_init(voi
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-int cgroup_mode = 0;
+int cgroup_mode = 1;
 
 static const char *cgroup_mode_str[] = {
+	"up",
 	"smp",
 };
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4150,7 +4150,7 @@ static inline int throttled_hierarchy(st
  *
  * hence icky!
  */
-static long calc_group_shares(struct cfs_rq *cfs_rq)
+static long calc_smp_shares(struct cfs_rq *cfs_rq)
 {
 	long tg_weight, tg_shares, load, shares;
 	struct task_group *tg = cfs_rq->tg;
@@ -4185,6 +4185,25 @@ static long calc_group_shares(struct cfs
 }
 
 /*
+ * Ignore this pesky SMP stuff, use (4).
+ */
+static long calc_up_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	return READ_ONCE(tg->shares);
+}
+
+static long calc_group_shares(struct cfs_rq *cfs_rq)
+{
+	int mode = READ_ONCE(cgroup_mode);
+
+	if (mode == 0)
+		return calc_up_shares(cfs_rq);
+
+	return calc_smp_shares(cfs_rq);
+}
+
+/*
  * Recomputes the group entity based on the current state of its group
  * runqueue.
  */



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 06/10] sched/fair: Add cgroup_mode: MAX
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (4 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 05/10] sched/fair: Add cgroup_mode: UP Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 07/10] sched/fair: Add cgroup_mode: CONCUR Peter Zijlstra
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

In order to avoid the CPU shares becoming tiny '1 / nr_cpus', assume each
cgroup is maximally concurrent and distrubute 'nr_cpus * tg->shares',
such that each CPU ends up with a 'tg->shares' sized fraction (on
average).

There is the corner case, when a cgroup is minimally loaded, eg a
single spinner, therefore limit the CPU shares to that of a nice -20
task to avoid getting too much load.

It was previously suggested to allow raising cpu.weight to '100 * nr_cpus'
to combat this same problem, but the problem there is the above corner case,
allowing multiple cgroups with such immense weight to the runqueue has
significant problems.

It would drown the kthreads, but it also risks overflowing the load values.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/cpuset.h |    6 +++++
 kernel/cgroup/cpuset.c |   15 ++++++++++++++
 kernel/sched/debug.c   |    1 
 kernel/sched/fair.c    |   52 ++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 69 insertions(+), 5 deletions(-)

--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -80,6 +80,7 @@ extern void lockdep_assert_cpuset_lock_h
 extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
+extern int cpuset_num_cpus(struct cgroup *cgroup);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
@@ -216,6 +217,11 @@ static inline bool cpuset_cpus_allowed_f
 	return false;
 }
 
+static inline int cpuset_num_cpus(struct cgroup *cgroup)
+{
+	return num_online_cpus();
+}
+
 static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 {
 	return node_possible_map;
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4100,6 +4100,21 @@ bool cpuset_cpus_allowed_fallback(struct
 	return changed;
 }
 
+int cpuset_num_cpus(struct cgroup *cgrp)
+{
+	int nr = num_online_cpus();
+	struct cpuset *cs;
+
+	if (is_in_v2_mode()) {
+		guard(rcu)();
+		cs = css_cs(cgroup_e_css(cgrp, &cpuset_cgrp_subsys));
+		if (cs)
+			nr = cpumask_weight(cs->effective_cpus);
+	}
+
+	return nr;
+}
+
 void __init cpuset_init_current_mems_allowed(void)
 {
 	nodes_setall(current->mems_allowed);
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -594,6 +594,7 @@ int cgroup_mode = 1;
 static const char *cgroup_mode_str[] = {
 	"up",
 	"smp",
+	"max",
 };
 
 static int sched_cgroup_mode(const char *str)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4150,12 +4150,10 @@ static inline int throttled_hierarchy(st
  *
  * hence icky!
  */
-static long calc_smp_shares(struct cfs_rq *cfs_rq)
+static long __calc_smp_shares(struct cfs_rq *cfs_rq, long tg_shares, long shares_max)
 {
-	long tg_weight, tg_shares, load, shares;
 	struct task_group *tg = cfs_rq->tg;
-
-	tg_shares = READ_ONCE(tg->shares);
+	long tg_weight, load, shares;
 
 	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
@@ -4181,7 +4179,48 @@ static long calc_smp_shares(struct cfs_r
 	 * case no task is runnable on a CPU MIN_SHARES=2 should be returned
 	 * instead of 0.
 	 */
-	return clamp_t(long, shares, MIN_SHARES, tg_shares);
+	return clamp_t(long, shares, MIN_SHARES, shares_max);
+}
+
+static int tg_cpus(struct task_group *tg)
+{
+	int nr = num_online_cpus();
+
+	if (cpusets_enabled()) {
+		struct cgroup *cgrp = tg->css.cgroup;
+		if (cgrp)
+			nr = cpuset_num_cpus(cgrp);
+	}
+
+	return nr;
+}
+
+/*
+ * Func: min(fraction(num_cpus * tg->shares), nice -20)
+ *
+ * Scale tg->shares by the maximal number of CPUs; but clip the max shares at
+ * nice -20, otherwise a single spinner on a 512 CPU machine would result in
+ * 512*NICE_0_LOAD, which is also crazy.
+ */
+static long calc_max_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	int nr = tg_cpus(tg);
+	long tg_shares = READ_ONCE(tg->shares);
+	long max_shares = scale_load(sched_prio_to_weight[0]);
+	return __calc_smp_shares(cfs_rq, tg_shares * nr, max_shares);
+}
+
+/*
+ * Func: fraction(tg->shares)
+ *
+ * This infamously results in tiny shares when you have many CPUs.
+ */
+static long calc_smp_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long tg_shares = READ_ONCE(tg->shares);
+	return __calc_smp_shares(cfs_rq, tg_shares, tg_shares);
 }
 
 /*
@@ -4200,6 +4239,9 @@ static long calc_group_shares(struct cfs
 	if (mode == 0)
 		return calc_up_shares(cfs_rq);
 
+	if (mode == 2)
+		return calc_max_shares(cfs_rq);
+
 	return calc_smp_shares(cfs_rq);
 }
 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 07/10] sched/fair: Add cgroup_mode: CONCUR
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (5 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 06/10] sched/fair: Add cgroup_mode: MAX Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

A variation of MAX; where instead of assuming maximal concurrent, this scales
with 'min(nr_tasks, nr_cpus)'. This handles the low concurrency cases more
gracefully, with the exception of CPU affnity.

Note: the tracking of tg->tasks is somewhat expensive :-/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/debug.c |    1 +
 kernel/sched/fair.c  |   39 ++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h |    3 +++
 3 files changed, 40 insertions(+), 3 deletions(-)

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -594,6 +594,7 @@ int cgroup_mode = 1;
 static const char *cgroup_mode_str[] = {
 	"up",
 	"smp",
+	"concur",
 	"max",
 };
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4211,6 +4211,30 @@ static long calc_max_shares(struct cfs_r
 	return __calc_smp_shares(cfs_rq, tg_shares * nr, max_shares);
 }
 
+static inline int tg_tasks(struct task_group *tg)
+{
+	return max(1, atomic_long_read(&tg->tasks));
+}
+
+/*
+ * Func: min(fraction(num * tg->shares), nice -20); where
+ *       num = min(nr_tasks, nr_cpus)
+ *
+ * Similar to max, except scale with min(nr_tasks, nr_cpus), which gives
+ * a far more natural distrubution. Can still create edge case using CPU
+ * affinity.
+ */
+static long calc_concur_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg = cfs_rq->tg;
+	int nr_cpus = tg_cpus(tg);
+	int nr_tasks = tg_tasks(tg);
+	int nr = min(nr_tasks, nr_cpus);
+	long tg_shares = READ_ONCE(tg->shares);
+	long max_shares = scale_load(sched_prio_to_weight[0]);
+	return __calc_smp_shares(cfs_rq, nr * tg_shares, max_shares);
+}
+
 /*
  * Func: fraction(tg->shares)
  *
@@ -4240,6 +4264,9 @@ static long calc_group_shares(struct cfs
 		return calc_up_shares(cfs_rq);
 
 	if (mode == 2)
+		return calc_concur_shares(cfs_rq);
+
+	if (mode == 3)
 		return calc_max_shares(cfs_rq);
 
 	return calc_smp_shares(cfs_rq);
@@ -4385,7 +4412,7 @@ static inline bool cfs_rq_is_decayed(str
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta;
+	long delta, dt;
 	u64 now;
 
 	/*
@@ -4407,16 +4434,19 @@ static inline void update_tg_load_avg(st
 		return;
 
 	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
-	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+	dt = cfs_rq->h_nr_queued - cfs_rq->tg_tasks_contrib;
+	if (dt || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		atomic_long_add(dt, &cfs_rq->tg->tasks);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->tg_tasks_contrib = cfs_rq->h_nr_queued;
 		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
 static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta;
+	long delta, dt;
 	u64 now;
 
 	/*
@@ -4427,8 +4457,11 @@ static inline void clear_tg_load_avg(str
 
 	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
 	delta = 0 - cfs_rq->tg_load_avg_contrib;
+	dt = 0 - cfs_rq->tg_tasks_contrib;
 	atomic_long_add(delta, &cfs_rq->tg->load_avg);
+	atomic_long_add(dt, &cfs_rq->tg->tasks);
 	cfs_rq->tg_load_avg_contrib = 0;
+	cfs_rq->tg_tasks_contrib = 0;
 	cfs_rq->last_update_tg_load_avg = now;
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -491,6 +491,8 @@ struct task_group {
 	 * will also be accessed at each tick.
 	 */
 	atomic_long_t		load_avg ____cacheline_aligned;
+	atomic_long_t		tasks;
+
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -720,6 +722,7 @@ struct cfs_rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
+	unsigned long		tg_tasks_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (6 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 07/10] sched/fair: Add cgroup_mode: CONCUR Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-12  5:37   ` K Prateek Nayak
                     ` (2 more replies)
  2026-05-11 11:31 ` [PATCH v2 09/10] sched: Remove sched_class::pick_next_task() Peter Zijlstra
                   ` (4 subsequent siblings)
  12 siblings, 3 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

With commit 50653216e4ff ("sched: Add support to pick functions to
take rf") removing the balance callback, the pick_task() callback is
in charge of newidle balancing.

This means pick_task_fair() should do so too. This hasn't been a
problem in practise because pick_next_task_fair() is used. However,
since we'll be removing that one shortly, make sure pick_next_task()
is up to scratch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9215,16 +9215,18 @@ static void wakeup_preempt_fair(struct r
 }
 
 static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
 	struct task_struct *p;
 	bool throttled;
+	int new_tasks;
 
 again:
 	cfs_rq = &rq->cfs;
 	if (!cfs_rq->nr_queued)
-		return NULL;
+		goto idle;
 
 	throttled = false;
 
@@ -9245,6 +9247,14 @@ static struct task_struct *pick_task_fai
 	if (unlikely(throttled))
 		task_throttle_setup_work(p);
 	return p;
+
+idle:
+	new_tasks = sched_balance_newidle(rq, rf);
+	if (new_tasks < 0)
+		return RETRY_TASK;
+	if (new_tasks > 0)
+		goto again;
+	return NULL;
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
@@ -9256,12 +9266,12 @@ pick_next_task_fair(struct rq *rq, struc
 {
 	struct sched_entity *se;
 	struct task_struct *p;
-	int new_tasks;
 
-again:
 	p = pick_task_fair(rq, rf);
+	if (unlikely(p == RETRY_TASK))
+		return p;
 	if (!p)
-		goto idle;
+		return p;
 	se = &p->se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9311,29 +9321,11 @@ pick_next_task_fair(struct rq *rq, struc
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 	put_prev_set_next_task(rq, prev, p);
 	return p;
-
-idle:
-	if (rf) {
-		new_tasks = sched_balance_newidle(rq, rf);
-
-		/*
-		 * Because sched_balance_newidle() releases (and re-acquires)
-		 * rq->lock, it is possible for any higher priority task to
-		 * appear. In that case we must re-start the pick_next_entity()
-		 * loop.
-		 */
-		if (new_tasks < 0)
-			return RETRY_TASK;
-
-		if (new_tasks > 0)
-			goto again;
-	}
-
-	return NULL;
 }
 
 static struct task_struct *
 fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
+	__must_hold(__rq_lockp(dl_se->rq))
 {
 	return pick_task_fair(dl_se->rq, rf);
 }



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 09/10] sched: Remove sched_class::pick_next_task()
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (7 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-19 15:14   ` Vincent Guittot
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

The reason for pick_next_task_fair() is the put/set optimization that
avoids touching the common ancestors. However, it is possible to
implement this in the put_prev_task() and set_next_task() calls as
used in put_prev_set_next_task().

Notably, put_prev_set_next_task() is the only site that:

 - calls put_prev_task() with a .next argument;
 - calls set_next_task() with .first = true.

This means that put_prev_task() can determine the common hierarchy and
stop there, and then set_next_task() can terminate where put_prev_task
stopped.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   27 +++------
 kernel/sched/fair.c  |  139 +++++++++++++++++----------------------------------
 kernel/sched/sched.h |   14 -----
 3 files changed, 57 insertions(+), 123 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5980,16 +5980,15 @@ __pick_next_task(struct rq *rq, struct t
 	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
 		   rq->nr_running == rq->cfs.h_nr_queued)) {
 
-		p = pick_next_task_fair(rq, prev, rf);
+		p = pick_task_fair(rq, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart;
 
 		/* Assume the next prioritized class is idle_sched_class */
-		if (!p) {
+		if (!p)
 			p = pick_task_idle(rq, rf);
-			put_prev_set_next_task(rq, prev, p);
-		}
 
+		put_prev_set_next_task(rq, prev, p);
 		return p;
 	}
 
@@ -5997,20 +5996,12 @@ __pick_next_task(struct rq *rq, struct t
 	prev_balance(rq, prev, rf);
 
 	for_each_active_class(class) {
-		if (class->pick_next_task) {
-			p = class->pick_next_task(rq, prev, rf);
-			if (unlikely(p == RETRY_TASK))
-				goto restart;
-			if (p)
-				return p;
-		} else {
-			p = class->pick_task(rq, rf);
-			if (unlikely(p == RETRY_TASK))
-				goto restart;
-			if (p) {
-				put_prev_set_next_task(rq, prev, p);
-				return p;
-			}
+		p = class->pick_task(rq, rf);
+		if (unlikely(p == RETRY_TASK))
+			goto restart;
+		if (p) {
+			put_prev_set_next_task(rq, prev, p);
+			return p;
 		}
 	}
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9214,7 +9214,7 @@ static void wakeup_preempt_fair(struct r
 	resched_curr_lazy(rq);
 }
 
-static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
+struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
 	struct sched_entity *se;
@@ -9257,72 +9257,6 @@ static struct task_struct *pick_task_fai
 	return NULL;
 }
 
-static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
-static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
-
-struct task_struct *
-pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
-	__must_hold(__rq_lockp(rq))
-{
-	struct sched_entity *se;
-	struct task_struct *p;
-
-	p = pick_task_fair(rq, rf);
-	if (unlikely(p == RETRY_TASK))
-		return p;
-	if (!p)
-		return p;
-	se = &p->se;
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (prev->sched_class != &fair_sched_class)
-		goto simple;
-
-	__put_prev_set_next_dl_server(rq, prev, p);
-
-	/*
-	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
-	 * likely that a next task is from the same cgroup as the current.
-	 *
-	 * Therefore attempt to avoid putting and setting the entire cgroup
-	 * hierarchy, only change the part that actually changes.
-	 *
-	 * Since we haven't yet done put_prev_entity and if the selected task
-	 * is a different task than we started out with, try and touch the
-	 * least amount of cfs_rqs.
-	 */
-	if (prev != p) {
-		struct sched_entity *pse = &prev->se;
-		struct cfs_rq *cfs_rq;
-
-		while (!(cfs_rq = is_same_group(se, pse))) {
-			int se_depth = se->depth;
-			int pse_depth = pse->depth;
-
-			if (se_depth <= pse_depth) {
-				put_prev_entity(cfs_rq_of(pse), pse);
-				pse = parent_entity(pse);
-			}
-			if (se_depth >= pse_depth) {
-				set_next_entity(cfs_rq_of(se), se, true);
-				se = parent_entity(se);
-			}
-		}
-
-		put_prev_entity(cfs_rq, pse);
-		set_next_entity(cfs_rq, se, true);
-
-		__set_next_task_fair(rq, p, true);
-	}
-
-	return p;
-
-simple:
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-	put_prev_set_next_task(rq, prev, p);
-	return p;
-}
-
 static struct task_struct *
 fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
 	__must_hold(__rq_lockp(dl_se->rq))
@@ -9346,10 +9280,33 @@ static void put_prev_task_fair(struct rq
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
+	struct sched_entity *nse = NULL;
 
-	for_each_sched_entity(se) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	if (next && next->sched_class == &fair_sched_class)
+		nse = &next->se;
+#endif
+
+	while (se) {
 		cfs_rq = cfs_rq_of(se);
-		put_prev_entity(cfs_rq, se);
+		if (!nse || cfs_rq->curr)
+			put_prev_entity(cfs_rq, se);
+#ifdef CONFIG_FAIR_GROUP_SCHED
+		if (nse) {
+			if (is_same_group(se, nse))
+				break;
+
+			int d = nse->depth - se->depth;
+			if (d >= 0) {
+				/* nse has equal or greater depth, ascend */
+				nse = parent_entity(nse);
+				/* if nse is the deeper, do not ascend se */
+				if (d > 0)
+					continue;
+			}
+		}
+#endif
+		se = parent_entity(se);
 	}
 }
 
@@ -13896,10 +13853,30 @@ static void switched_to_fair(struct rq *
 	}
 }
 
-static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
+/*
+ * Account for a task changing its policy or group.
+ *
+ * This routine is mostly called to set cfs_rq->curr field when a task
+ * migrates between groups/classes.
+ */
+static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
 
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
+		    first && cfs_rq->curr)
+			break;
+
+		set_next_entity(cfs_rq, se, first);
+		/* ensure bandwidth has been allocated on our new cfs_rq */
+		account_cfs_rq_runtime(cfs_rq, 0);
+	}
+
+	se = &p->se;
+
 	if (task_on_rq_queued(p)) {
 		/*
 		 * Move the next running task to the front of the list, so our
@@ -13919,27 +13896,6 @@ static void __set_next_task_fair(struct
 	sched_fair_update_stop_tick(rq, p);
 }
 
-/*
- * Account for a task changing its policy or group.
- *
- * This routine is mostly called to set cfs_rq->curr field when a task
- * migrates between groups/classes.
- */
-static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
-{
-	struct sched_entity *se = &p->se;
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-		set_next_entity(cfs_rq, se, first);
-		/* ensure bandwidth has been allocated on our new cfs_rq */
-		account_cfs_rq_runtime(cfs_rq, 0);
-	}
-
-	__set_next_task_fair(rq, p, first);
-}
-
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
@@ -14251,7 +14207,6 @@ DEFINE_SCHED_CLASS(fair) = {
 	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
-	.pick_next_task		= pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 	.set_next_task          = set_next_task_fair,
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2555,17 +2555,6 @@ struct sched_class {
 	 * schedule/pick_next_task: rq->lock
 	 */
 	struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
-	/*
-	 * Optional! When implemented pick_next_task() should be equivalent to:
-	 *
-	 *   next = pick_task();
-	 *   if (next) {
-	 *       put_prev_task(prev);
-	 *       set_next_task_first(next);
-	 *   }
-	 */
-	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
-					      struct rq_flags *rf);
 
 	/*
 	 * sched_change:
@@ -2789,8 +2778,7 @@ static inline bool sched_fair_runnable(s
 	return rq->cfs.nr_queued > 0;
 }
 
-extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
-					       struct rq_flags *rf);
+extern struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf);
 extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
 
 #define SCA_CHECK		0x01



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (8 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 09/10] sched: Remove sched_class::pick_next_task() Peter Zijlstra
@ 2026-05-11 11:31 ` Peter Zijlstra
  2026-05-11 16:21   ` K Prateek Nayak
                     ` (3 more replies)
  2026-05-11 19:23 ` [PATCH v2 00/10] sched: Flatten the pick Tejun Heo
                   ` (2 subsequent siblings)
  12 siblings, 4 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:31 UTC (permalink / raw)
  To: mingo
  Cc: longman, chenridong, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Change fair/cgroup to a single runqueue.

Infamously fair/cgroup isn't working for a number of people; typically
the complaint is latencies and/or overhead. The latency issue is due
to the intermediate entries that represent a combination of tasks and
thereby obfuscate the runnability of tasks.

The approach here is to leave the cgroup hierarchy as is; including
the intermediate enqueue/dequeue but move the actual EEVDF runqueue
outside. This means things like the shares_weight approximation are
fully preserved.

That is, given a hierarchy like:

        R
        |
        se--G1
            / \
      G2--se   se--G3
     / \           |
T1--se se--T2      se--T3

This is fully maintained for load tracking, however the EEVDF parts of
cfs_rq/se go unused for the intermediates and are instead connected
like:

     _R_
    / | \
   T1 T2 T3

Since the effective weight of the entities is determined by the
hierarchy, this gets recomputed on enqueue,set_next_task and tick.

Notably, the effective weight (se->h_load) is computed from the
hierarchical fraction: se->load / cfs_rq->load.

Since EEVDF is now exclusive operating on rq->cfs, it needs to
consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
only tasks can get delayed, simplifying some of the cgroup cleanup.

One place where additional information was required was
set_next_task() / put_prev_task(), where we need to track 'current'
both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
(cfs_rq->curr).

As a result of only having a single level to pick from, much of the
complications in pick_next_task() and preemption go away.

Since many of the hierarchical operations are still there, this won't
immediately fix the performance issues, but hopefully it will fix some
of the latency issues.

TODO: split struct cfs_rq / struct sched_entity
TODO: try and get rid of h_curr

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    5 
 kernel/sched/debug.c  |    9 
 kernel/sched/fair.c   |  789 +++++++++++++++++++++-----------------------------
 kernel/sched/pelt.c   |    6 
 kernel/sched/sched.h  |   26 -
 6 files changed, 366 insertions(+), 470 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -575,6 +575,7 @@ struct sched_statistics {
 struct sched_entity {
 	/* For load-balancing: */
 	struct load_weight		load;
+	struct load_weight		h_load;
 	struct rb_node			run_node;
 	u64				deadline;
 	u64				min_vruntime;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5539,11 +5539,8 @@ EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
  */
 static inline void prefetch_curr_exec_start(struct task_struct *p)
 {
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	struct sched_entity *curr = p->se.cfs_rq->curr;
-#else
 	struct sched_entity *curr = task_rq(p)->cfs.curr;
-#endif
+
 	prefetch(curr);
 	prefetch(&curr->exec_start);
 }
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -911,10 +911,11 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, " %15s %5d %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
+	SEQ_printf(m, " %15s %5d %10ld %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
 		p->comm, task_pid_nr(p),
+		p->se.h_load.weight,
 		SPLIT_NS(p->se.vruntime),
-		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
+		entity_eligible(&rq->cfs, &p->se) ? 'E' : 'N',
 		SPLIT_NS(p->se.deadline),
 		p->se.custom_slice ? 'S' : ' ',
 		SPLIT_NS(p->se.slice),
@@ -943,7 +944,7 @@ static void print_rq(struct seq_file *m,
 
 	SEQ_printf(m, "\n");
 	SEQ_printf(m, "runnable tasks:\n");
-	SEQ_printf(m, " S            task   PID       vruntime   eligible    "
+	SEQ_printf(m, " S            task   PID     weight       vruntime   eligible    "
 		   "deadline             slice          sum-exec      switches  "
 		   "prio         wait-time        sum-sleep       sum-block"
 #ifdef CONFIG_NUMA_BALANCING
@@ -1051,6 +1052,8 @@ void print_cfs_rq(struct seq_file *m, in
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
 			atomic_long_read(&cfs_rq->tg->load_avg));
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
+			cfs_rq->h_load);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_CFS_BANDWIDTH
 	SEQ_printf(m, "  .%-30s: %d\n", "throttled",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -296,8 +296,8 @@ static u64 __calc_delta(u64 delta_exec,
  */
 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 {
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
+	if (se->h_load.weight != NICE_0_LOAD)
+		delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
 
 	return delta;
 }
@@ -427,38 +427,6 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
-static void
-find_matching_se(struct sched_entity **se, struct sched_entity **pse)
-{
-	int se_depth, pse_depth;
-
-	/*
-	 * preemption test can be made between sibling entities who are in the
-	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
-	 * both tasks until we find their ancestors who are siblings of common
-	 * parent.
-	 */
-
-	/* First walk up until both entities are at same depth */
-	se_depth = (*se)->depth;
-	pse_depth = (*pse)->depth;
-
-	while (se_depth > pse_depth) {
-		se_depth--;
-		*se = parent_entity(*se);
-	}
-
-	while (pse_depth > se_depth) {
-		pse_depth--;
-		*pse = parent_entity(*pse);
-	}
-
-	while (!is_same_group(*se, *pse)) {
-		*se = parent_entity(*se);
-		*pse = parent_entity(*pse);
-	}
-}
-
 static int tg_is_idle(struct task_group *tg)
 {
 	return tg->idle > 0;
@@ -502,11 +470,6 @@ static inline struct sched_entity *paren
 	return NULL;
 }
 
-static inline void
-find_matching_se(struct sched_entity **se, struct sched_entity **pse)
-{
-}
-
 static inline int tg_is_idle(struct task_group *tg)
 {
 	return 0;
@@ -685,7 +648,7 @@ static inline unsigned long avg_vruntime
 static inline void
 __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	s64 w_vruntime, key = entity_key(cfs_rq, se);
 
 	w_vruntime = key * weight;
@@ -702,7 +665,7 @@ sum_w_vruntime_add_paranoid(struct cfs_r
 	s64 key, tmp;
 
 again:
-	weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	key = entity_key(cfs_rq, se);
 
 	if (check_mul_overflow(key, weight, &key))
@@ -748,7 +711,7 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
 static void
 sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+	unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 	s64 key = entity_key(cfs_rq, se);
 
 	cfs_rq->sum_w_vruntime -= key * weight;
@@ -790,7 +753,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 		s64 runtime = cfs_rq->sum_w_vruntime;
 
 		if (curr) {
-			unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
+			unsigned long w = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
 			runtime += entity_key(cfs_rq, curr) * w;
 			weight += w;
@@ -861,8 +824,6 @@ bool update_entity_lag(struct cfs_rq *cf
 	u64 avruntime = avg_vruntime(cfs_rq);
 	s64 vlag = entity_lag(cfs_rq, se, avruntime);
 
-	WARN_ON_ONCE(!se->on_rq);
-
 	if (se->sched_delayed) {
 		/* previous vlag < 0 otherwise se would not be delayed */
 		vlag = max(vlag, se->vlag);
@@ -898,7 +859,7 @@ static int vruntime_eligible(struct cfs_
 	long load = cfs_rq->sum_weight;
 
 	if (curr && curr->on_rq) {
-		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
+		unsigned long weight = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
 		avg += entity_key(cfs_rq, curr) * weight;
 		load += weight;
@@ -1039,6 +1000,9 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
  */
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
+	WARN_ON_ONCE(!entity_is_task(se));
+
 	sum_w_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
@@ -1048,6 +1012,9 @@ static void __enqueue_entity(struct cfs_
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
+	WARN_ON_ONCE(!entity_is_task(se));
+
 	rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				  &min_vruntime_cb);
 	sum_w_vruntime_sub(cfs_rq, se);
@@ -1144,7 +1111,7 @@ static struct sched_entity *pick_eevdf(s
 	 * We can safely skip eligibility check if there is only one entity
 	 * in this cfs_rq, saving some cycles.
 	 */
-	if (cfs_rq->nr_queued == 1)
+	if (cfs_rq->h_nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
 	/*
@@ -1391,8 +1358,6 @@ static s64 update_se(struct rq *rq, stru
 	return delta_exec;
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Used by other classes to account runtime.
  */
@@ -1412,7 +1377,7 @@ static void update_curr(struct cfs_rq *c
 	 * not necessarily be the actual task running
 	 * (rq->curr.se). This is easy to confuse!
 	 */
-	struct sched_entity *curr = cfs_rq->curr;
+	struct sched_entity *curr = cfs_rq->h_curr;
 	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
 	bool resched;
@@ -1424,26 +1389,29 @@ static void update_curr(struct cfs_rq *c
 	if (unlikely(delta_exec <= 0))
 		return;
 
+	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	if (!entity_is_task(curr))
+		return;
+
+	cfs_rq = &rq->cfs;
+
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	resched = update_deadline(cfs_rq, curr);
 
-	if (entity_is_task(curr)) {
-		/*
-		 * If the fair_server is active, we need to account for the
-		 * fair_server time whether or not the task is running on
-		 * behalf of fair_server or not:
-		 *  - If the task is running on behalf of fair_server, we need
-		 *    to limit its time based on the assigned runtime.
-		 *  - Fair task that runs outside of fair_server should account
-		 *    against fair_server such that it can account for this time
-		 *    and possibly avoid running this period.
-		 */
-		dl_server_update(&rq->fair_server, delta_exec);
-	}
-
-	account_cfs_rq_runtime(cfs_rq, delta_exec);
+	/*
+	 * If the fair_server is active, we need to account for the
+	 * fair_server time whether or not the task is running on
+	 * behalf of fair_server or not:
+	 *  - If the task is running on behalf of fair_server, we need
+	 *    to limit its time based on the assigned runtime.
+	 *  - Fair task that runs outside of fair_server should account
+	 *    against fair_server such that it can account for this time
+	 *    and possibly avoid running this period.
+	 */
+	dl_server_update(&rq->fair_server, delta_exec);
 
-	if (cfs_rq->nr_queued == 1)
+	if (cfs_rq->h_nr_queued == 1)
 		return;
 
 	if (resched || !protect_slice(curr)) {
@@ -1454,7 +1422,10 @@ static void update_curr(struct cfs_rq *c
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->donor->se));
+	struct sched_entity *se = &rq->donor->se;
+
+	for_each_sched_entity(se)
+		update_curr(cfs_rq_of(se));
 }
 
 static inline void
@@ -1530,7 +1501,7 @@ update_stats_enqueue_fair(struct cfs_rq
 	 * Are we enqueueing a waiting task? (for current tasks
 	 * a dequeue/enqueue event is a NOP)
 	 */
-	if (se != cfs_rq->curr)
+	if (se != cfs_rq->h_curr)
 		update_stats_wait_start_fair(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP)
@@ -1548,7 +1519,7 @@ update_stats_dequeue_fair(struct cfs_rq
 	 * Mark the end of the wait period if dequeueing a
 	 * waiting task:
 	 */
-	if (se != cfs_rq->curr)
+	if (se != cfs_rq->h_curr)
 		update_stats_wait_end_fair(cfs_rq, se);
 
 	if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
@@ -3875,6 +3846,7 @@ static inline void update_scan_period(st
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
 		struct rq *rq = rq_of(cfs_rq);
@@ -3888,6 +3860,7 @@ account_entity_enqueue(struct cfs_rq *cf
 static void
 account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
@@ -3965,7 +3938,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
 static void
 rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
 {
-	unsigned long old_weight = se->load.weight;
+	long old_weight = se->h_load.weight;
 
 	/*
 	 * VRUNTIME
@@ -4065,16 +4038,17 @@ rescale_entity(struct sched_entity *se,
 		se->vprot = div64_long(se->vprot * old_weight, weight);
 }
 
-static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
-			    unsigned long weight)
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			   unsigned long weight, bool on_rq)
 {
 	bool curr = cfs_rq->curr == se;
 	bool rel_vprot = false;
 	u64 avruntime = 0;
 
-	if (se->on_rq) {
-		/* commit outstanding execution time */
-		update_curr(cfs_rq);
+	if (se->h_load.weight == weight)
+		return;
+
+	if (on_rq) {
 		avruntime = avg_vruntime(cfs_rq);
 		se->vlag = entity_lag(cfs_rq, se, avruntime);
 		se->deadline -= avruntime;
@@ -4084,46 +4058,90 @@ static void reweight_entity(struct cfs_r
 			rel_vprot = true;
 		}
 
-		cfs_rq->nr_queued--;
+		cfs_rq->h_nr_queued--;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
-		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
-	dequeue_load_avg(cfs_rq, se);
 
 	rescale_entity(se, weight, rel_vprot);
 
-	update_load_set(&se->load, weight);
+	update_load_set(&se->h_load, weight);
 
-	do {
-		u32 divider = get_pelt_divider(&se->avg);
-		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
-	} while (0);
-
-	enqueue_load_avg(cfs_rq, se);
-	if (se->on_rq) {
+	if (on_rq) {
 		if (rel_vprot)
 			se->vprot += avruntime;
 		se->deadline += avruntime;
 		se->rel_deadline = 0;
 		se->vruntime = avruntime - se->vlag;
 
-		update_load_add(&cfs_rq->load, se->load.weight);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
-		cfs_rq->nr_queued++;
+		cfs_rq->h_nr_queued++;
 	}
 }
 
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->load.weight == weight)
+		return;
+
+	if (se->on_rq) {
+		WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
+		update_load_sub(&cfs_rq->load, se->load.weight);
+	}
+	dequeue_load_avg(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+	do {
+		u32 divider = get_pelt_divider(&se->avg);
+		se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
+	} while (0);
+
+	enqueue_load_avg(cfs_rq, se);
+
+	if (se->on_rq)
+		update_load_add(&cfs_rq->load, se->load.weight);
+}
+
+/*
+ * weight = NICE_0_LOAD;
+ * for_each_entity_se(se)
+ *   weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
+ */
+static __always_inline
+unsigned long __calc_prop_weight(struct cfs_rq *cfs_rq, struct sched_entity *se,
+				 unsigned long weight)
+{
+	weight *= se->load.weight;
+	if (parent_entity(se))
+		weight /= cfs_rq->load.weight;
+	else
+		weight /= NICE_0_LOAD;
+
+	return max(weight, MIN_SHARES);
+}
+
 static void reweight_task_fair(struct rq *rq, struct task_struct *p,
 			       const struct load_weight *lw)
 {
 	struct sched_entity *se = &p->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	struct load_weight *load = &se->load;
+	unsigned long weight = NICE_0_LOAD;
+
+	if (se->on_rq)
+		update_curr_fair(rq);
+
+	reweight_entity(cfs_rq_of(se), se, lw->weight);
+	se->load.inv_weight = lw->inv_weight;
+
+	if (!se->on_rq)
+		return;
+
+	for_each_sched_entity(se)
+		weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
 
-	reweight_entity(cfs_rq, se, lw->weight);
-	load->inv_weight = lw->inv_weight;
+	reweight_eevdf(&rq->cfs, &p->se, weight, p->se.on_rq);
 }
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
@@ -4331,7 +4349,6 @@ static long calc_group_shares(struct cfs
 static void update_cfs_group(struct sched_entity *se)
 {
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-	long shares;
 
 	/*
 	 * When a group becomes empty, preserve its weight. This matters for
@@ -4340,9 +4357,7 @@ static void update_cfs_group(struct sche
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
-	shares = calc_group_shares(gcfs_rq);
-	if (unlikely(se->load.weight != shares))
-		reweight_entity(cfs_rq_of(se), se, shares);
+	reweight_entity(cfs_rq_of(se), se, calc_group_shares(gcfs_rq));
 }
 
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
@@ -4460,7 +4475,7 @@ static inline bool cfs_rq_is_decayed(str
  * differential update where we store the last value we propagated. This in
  * turn allows skipping updates if the differential is 'small'.
  *
- * Updating tg's load_avg is necessary before update_cfs_share().
+ * Updating tg's load_avg is necessary before update_cfs_group().
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
@@ -4926,7 +4941,7 @@ static void migrate_se_pelt_lag(struct s
  * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
  * avg. The immediate corollary is that all (fair) tasks must be attached.
  *
- * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
+ * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
  *
  * Return: true if the load decayed or we removed load.
  *
@@ -5475,6 +5490,7 @@ static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+	unsigned int nr_queued = cfs_rq->h_nr_queued;
 	bool update_zero = false;
 	s64 lag = 0;
 
@@ -5482,6 +5498,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		se->slice = sysctl_sched_base_slice;
 	vslice = calc_delta_fair(se->slice, se);
 
+	if (flags & ENQUEUE_QUEUED)
+		nr_queued -= 1;
+
 	/*
 	 * Due to how V is constructed as the weighted average of entities,
 	 * adding tasks with positive lag, or removing tasks with negative lag
@@ -5490,7 +5509,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	 *
 	 * EEVDF: placement strategy #1 / #2
 	 */
-	if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
+	if (sched_feat(PLACE_LAG) && nr_queued && se->vlag) {
 		struct sched_entity *curr = cfs_rq->curr;
 		long load, weight;
 
@@ -5550,9 +5569,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
 		 */
 		load = cfs_rq->sum_weight;
 		if (curr && curr->on_rq)
-			load += avg_vruntime_weight(cfs_rq, curr->load.weight);
+			load += avg_vruntime_weight(cfs_rq, curr->h_load.weight);
 
-		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+		weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
 		lag *= load + weight;
 		if (WARN_ON_ONCE(!load))
 			load = 1;
@@ -5611,22 +5630,8 @@ static void check_enqueue_throttle(struc
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 
 static void
-requeue_delayed_entity(struct sched_entity *se);
-
-static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool curr = cfs_rq->curr == se;
-
-	/*
-	 * If we're the current task, we must renormalise before calling
-	 * update_curr().
-	 */
-	if (curr)
-		place_entity(cfs_rq, se, flags);
-
-	update_curr(cfs_rq);
-
 	/*
 	 * When enqueuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
@@ -5645,13 +5650,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_cfs_group(se);
 
-	/*
-	 * XXX now that the entity has been re-weighted, and it's lag adjusted,
-	 * we can place the entity.
-	 */
-	if (!curr)
-		place_entity(cfs_rq, se, flags);
-
 	account_entity_enqueue(cfs_rq, se);
 
 	/* Entity has migrated, no longer consider this task hot */
@@ -5660,8 +5658,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
-	if (!curr)
-		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_queued == 1) {
@@ -5679,21 +5675,19 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	}
 }
 
-static void __clear_buddies_next(struct sched_entity *se)
+static void set_next_buddy(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
-		if (cfs_rq->next != se)
-			break;
-
-		cfs_rq->next = NULL;
-	}
+	if (WARN_ON_ONCE(!se->on_rq || se->sched_delayed))
+		return;
+	if (se_is_idle(se))
+		return;
+	cfs_rq->next = se;
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (cfs_rq->next == se)
-		__clear_buddies_next(se);
+		cfs_rq->next = NULL;
 }
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -5704,7 +5698,7 @@ static void set_delayed(struct sched_ent
 
 	/*
 	 * Delayed se of cfs_rq have no tasks queued on them.
-	 * Do not adjust h_nr_runnable since dequeue_entities()
+	 * Do not adjust h_nr_runnable since __dequeue_task()
 	 * will account it for blocked tasks.
 	 */
 	if (!entity_is_task(se))
@@ -5737,37 +5731,11 @@ static void clear_delayed(struct sched_e
 	}
 }
 
-static bool
+static void
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	bool sleep = flags & DEQUEUE_SLEEP;
 	int action = UPDATE_TG;
 
-	update_curr(cfs_rq);
-	clear_buddies(cfs_rq, se);
-
-	if (flags & DEQUEUE_DELAYED) {
-		WARN_ON_ONCE(!se->sched_delayed);
-	} else {
-		bool delay = sleep;
-		/*
-		 * DELAY_DEQUEUE relies on spurious wakeups, special task
-		 * states must not suffer spurious wakeups, excempt them.
-		 */
-		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
-			delay = false;
-
-		WARN_ON_ONCE(delay && se->sched_delayed);
-
-		if (sched_feat(DELAY_DEQUEUE) && delay &&
-		    !entity_eligible(cfs_rq, se)) {
-			update_load_avg(cfs_rq, se, 0);
-			update_entity_lag(cfs_rq, se);
-			set_delayed(se);
-			return false;
-		}
-	}
-
 	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
 		action |= DO_DETACH;
 
@@ -5785,14 +5753,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
 
-	update_entity_lag(cfs_rq, se);
-	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
-		se->deadline -= se->vruntime;
-		se->rel_deadline = 1;
-	}
-
-	if (se != cfs_rq->curr)
-		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
@@ -5801,9 +5761,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	update_cfs_group(se);
 
-	if (flags & DEQUEUE_DELAYED)
-		clear_delayed(se);
-
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -5816,15 +5773,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 		}
 #endif
 	}
-
-	return true;
 }
 
 static void
-set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	clear_buddies(cfs_rq, se);
-
 	/* 'current' is not kept within the tree. */
 	if (se->on_rq) {
 		/*
@@ -5833,16 +5786,12 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 		 * runqueue.
 		 */
 		update_stats_wait_end_fair(cfs_rq, se);
-		__dequeue_entity(cfs_rq, se);
 		update_load_avg(cfs_rq, se, UPDATE_TG);
-
-		if (first)
-			set_protect_slice(cfs_rq, se);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
-	WARN_ON_ONCE(cfs_rq->curr);
-	cfs_rq->curr = se;
+	WARN_ON_ONCE(cfs_rq->h_curr);
+	cfs_rq->h_curr = se;
 
 	/*
 	 * Track our maximum slice length, if the CPU's load is at
@@ -5862,23 +5811,17 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
+static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
-/*
- * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
- */
 static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool protect)
+pick_next_entity(struct rq *rq, bool protect)
 {
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
 
 	se = pick_eevdf(cfs_rq, protect);
 	if (se->sched_delayed) {
-		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		__dequeue_task(rq, task_of(se), DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
 		 * Must not reference @se again, see __block_task().
 		 */
@@ -5903,13 +5846,11 @@ static void put_prev_entity(struct cfs_r
 
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
-		/* Put 'current' back into the tree. */
-		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
-	WARN_ON_ONCE(cfs_rq->curr != prev);
-	cfs_rq->curr = NULL;
+	WARN_ON_ONCE(cfs_rq->h_curr != prev);
+	cfs_rq->h_curr = NULL;
 }
 
 static void
@@ -6062,7 +6003,7 @@ static void __account_cfs_rq_runtime(str
 	 * if we're unable to extend our runtime we resched so that the active
 	 * hierarchy can be throttled
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
+	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->h_curr))
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -6420,7 +6361,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_queued)
+	if (rq->curr == rq->idle && rq->cfs.h_nr_queued)
 		resched_curr(rq);
 }
 
@@ -6761,7 +6702,7 @@ static void check_enqueue_throttle(struc
 		return;
 
 	/* an active group must be handled by the update_curr()->put() path */
-	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
+	if (!cfs_rq->runtime_enabled || cfs_rq->h_curr)
 		return;
 
 	/* ensure the group is not already throttled */
@@ -7156,7 +7097,7 @@ static void hrtick_start_fair(struct rq
 			resched_curr(rq);
 		return;
 	}
-	delta = (se->load.weight * vdelta) / NICE_0_LOAD;
+	delta = (se->h_load.weight * vdelta) / NICE_0_LOAD;
 
 	/*
 	 * Correct for instantaneous load of other classes.
@@ -7256,10 +7197,8 @@ static int choose_idle_cpu(int cpu, stru
 }
 
 static void
-requeue_delayed_entity(struct sched_entity *se)
+requeue_delayed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
 	/*
 	 * se->sched_delayed should imply: se->on_rq == 1.
 	 * Because a delayed entity is one that is still on
@@ -7269,19 +7208,58 @@ requeue_delayed_entity(struct sched_enti
 	WARN_ON_ONCE(!se->on_rq);
 
 	if (update_entity_lag(cfs_rq, se)) {
-		cfs_rq->nr_queued--;
+		cfs_rq->h_nr_queued--;
 		if (se != cfs_rq->curr)
 			__dequeue_entity(cfs_rq, se);
 		place_entity(cfs_rq, se, 0);
 		if (se != cfs_rq->curr)
 			__enqueue_entity(cfs_rq, se);
-		cfs_rq->nr_queued++;
+		cfs_rq->h_nr_queued++;
 	}
 
 	update_load_avg(cfs_rq, se, 0);
 	clear_delayed(se);
 }
 
+static unsigned long enqueue_hierarchy(struct task_struct *p, int flags)
+{
+	unsigned long weight = NICE_0_LOAD;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
+	struct sched_entity *se = &p->se;
+	int h_nr_idle = task_has_idle_policy(p);
+	int h_nr_runnable = 1;
+
+	if (task_new && se->sched_delayed)
+		h_nr_runnable = 0;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_curr(cfs_rq);
+
+		if (!se->on_rq) {
+			enqueue_entity(cfs_rq, se, flags);
+		} else {
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
+		}
+
+		cfs_rq->h_nr_runnable += h_nr_runnable;
+		cfs_rq->h_nr_queued++;
+		cfs_rq->h_nr_idle += h_nr_idle;
+
+		if (cfs_rq_is_idle(cfs_rq))
+			h_nr_idle = 1;
+
+		weight = __calc_prop_weight(cfs_rq, se, weight);
+
+		flags = ENQUEUE_WAKEUP;
+	}
+
+	return weight;
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -7290,13 +7268,12 @@ requeue_delayed_entity(struct sched_enti
 static void
 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-	int h_nr_idle = task_has_idle_policy(p);
-	int h_nr_runnable = 1;
-	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
-	u64 slice = 0;
+	int task_new = !(flags & ENQUEUE_WAKEUP);
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long weight;
+	bool curr;
 
 	if (task_is_throttled(p) && enqueue_throttled_task(p))
 		return;
@@ -7308,10 +7285,10 @@ enqueue_task_fair(struct rq *rq, struct
 	 * estimated utilization, before we update schedutil.
 	 */
 	if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))
-		util_est_enqueue(&rq->cfs, p);
+		util_est_enqueue(cfs_rq, p);
 
 	if (flags & ENQUEUE_DELAYED) {
-		requeue_delayed_entity(se);
+		requeue_delayed_entity(cfs_rq, se);
 		return;
 	}
 
@@ -7323,57 +7300,22 @@ enqueue_task_fair(struct rq *rq, struct
 	if (p->in_iowait)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
-	if (task_new && se->sched_delayed)
-		h_nr_runnable = 0;
-
-	for_each_sched_entity(se) {
-		if (se->on_rq) {
-			if (se->sched_delayed)
-				requeue_delayed_entity(se);
-			break;
-		}
-		cfs_rq = cfs_rq_of(se);
-
-		/*
-		 * Basically set the slice of group entries to the min_slice of
-		 * their respective cfs_rq. This ensures the group can service
-		 * its entities in the desired time-frame.
-		 */
-		if (slice) {
-			se->slice = slice;
-			se->custom_slice = 1;
-		}
-		enqueue_entity(cfs_rq, se, flags);
-		slice = cfs_rq_min_slice(cfs_rq);
-
-		cfs_rq->h_nr_runnable += h_nr_runnable;
-		cfs_rq->h_nr_queued++;
-		cfs_rq->h_nr_idle += h_nr_idle;
-
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = 1;
-
-		flags = ENQUEUE_WAKEUP;
-	}
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-		update_cfs_group(se);
+	/*
+	 * XXX comment on the curr thing
+	 */
+	curr = (cfs_rq->curr == se);
+	if (curr)
+		place_entity(cfs_rq, se, flags);
 
-		se->slice = slice;
-		if (se != cfs_rq->curr)
-			min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+	if (se->on_rq && se->sched_delayed)
+		requeue_delayed_entity(cfs_rq, se);
 
-		cfs_rq->h_nr_runnable += h_nr_runnable;
-		cfs_rq->h_nr_queued++;
-		cfs_rq->h_nr_idle += h_nr_idle;
+	weight = enqueue_hierarchy(p, flags);
 
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = 1;
+	if (!curr) {
+		reweight_eevdf(cfs_rq, se, weight, false);
+		place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
+		__enqueue_entity(cfs_rq, se);
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
@@ -7404,105 +7346,107 @@ enqueue_task_fair(struct rq *rq, struct
 	hrtick_update(rq);
 }
 
-/*
- * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
- * failing half-way through and resume the dequeue later.
- *
- * Returns:
- * -1 - dequeue delayed
- *  0 - dequeue throttled
- *  1 - dequeue complete
- */
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
+static void dequeue_hierarchy(struct task_struct *p, int flags)
 {
-	bool was_sched_idle = sched_idle_rq(rq);
+	struct sched_entity *se = &p->se;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
 	bool task_throttled = flags & DEQUEUE_THROTTLE;
-	struct task_struct *p = NULL;
-	int h_nr_idle = 0;
-	int h_nr_queued = 0;
 	int h_nr_runnable = 0;
-	struct cfs_rq *cfs_rq;
-	u64 slice = 0;
+	int h_nr_idle = task_has_idle_policy(p);
+	bool dequeue = true;
 
-	if (entity_is_task(se)) {
-		p = task_of(se);
-		h_nr_queued = 1;
-		h_nr_idle = task_has_idle_policy(p);
-		if (task_sleep || task_delayed || !se->sched_delayed)
-			h_nr_runnable = 1;
-	}
+	if (task_sleep || task_delayed || !se->sched_delayed)
+		h_nr_runnable = 1;
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		if (!dequeue_entity(cfs_rq, se, flags)) {
-			if (p && &p->se == se)
-				return -1;
+		update_curr(cfs_rq);
 
-			slice = cfs_rq_min_slice(cfs_rq);
-			break;
+		if (dequeue) {
+			dequeue_entity(cfs_rq, se, flags);
+			/* Don't dequeue parent if it has other entities besides us */
+			if (cfs_rq->load.weight)
+				dequeue = false;
+		} else {
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
 		}
 
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
-		cfs_rq->h_nr_queued -= h_nr_queued;
+		cfs_rq->h_nr_queued--;
 		cfs_rq->h_nr_idle -= h_nr_idle;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = h_nr_queued;
+			h_nr_idle = 1;
 
 		if (throttled_hierarchy(cfs_rq) && task_throttled)
 			record_throttle_clock(cfs_rq);
 
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight) {
-			slice = cfs_rq_min_slice(cfs_rq);
-
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			/*
-			 * Bias pick_next to pick a task from this cfs_rq, as
-			 * p is sleeping when it is within its sched_slice.
-			 */
-			if (task_sleep && se)
-				set_next_buddy(se);
-			break;
-		}
 		flags |= DEQUEUE_SLEEP;
 		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
 	}
+}
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+/*
+ * The part of dequeue_task_fair() that is needed to dequeue delayed tasks.
+ *
+ * Returns:
+ *   true  - dequeued
+ *   false - delayed
+ */
+static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	bool was_sched_idle = sched_idle_rq(rq);
+	bool task_sleep = flags & DEQUEUE_SLEEP;
+	bool task_delayed = flags & DEQUEUE_DELAYED;
 
-		update_load_avg(cfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-		update_cfs_group(se);
+	clear_buddies(cfs_rq, se);
 
-		se->slice = slice;
-		if (se != cfs_rq->curr)
-			min_vruntime_cb_propagate(&se->run_node, NULL);
-		slice = cfs_rq_min_slice(cfs_rq);
+	if (flags & DEQUEUE_DELAYED) {
+		WARN_ON_ONCE(!se->sched_delayed);
+	} else {
+		bool delay = task_sleep;
+		/*
+		 * DELAY_DEQUEUE relies on spurious wakeups, special task
+		 * states must not suffer spurious wakeups, excempt them.
+		 */
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
+			delay = false;
 
-		cfs_rq->h_nr_runnable -= h_nr_runnable;
-		cfs_rq->h_nr_queued -= h_nr_queued;
-		cfs_rq->h_nr_idle -= h_nr_idle;
+		WARN_ON_ONCE(delay && se->sched_delayed);
 
-		if (cfs_rq_is_idle(cfs_rq))
-			h_nr_idle = h_nr_queued;
+		if (sched_feat(DELAY_DEQUEUE) && delay &&
+		    !entity_eligible(cfs_rq, se)) {
+			update_load_avg(cfs_rq_of(se), se, 0);
+			set_delayed(se);
+			return false;
+		}
+	}
 
-		if (throttled_hierarchy(cfs_rq) && task_throttled)
-			record_throttle_clock(cfs_rq);
+	dequeue_hierarchy(p, flags);
+
+	update_entity_lag(cfs_rq, se);
+	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
 	}
+	if (se != cfs_rq->curr)
+		__dequeue_entity(cfs_rq, se);
 
-	sub_nr_running(rq, h_nr_queued);
+	sub_nr_running(rq, 1);
 
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
-	if (p && task_delayed) {
+	if (task_delayed) {
+		clear_delayed(se);
+
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
 
@@ -7514,7 +7458,7 @@ static int dequeue_entities(struct rq *r
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return true;
 }
 
 /*
@@ -7533,11 +7477,11 @@ static bool dequeue_task_fair(struct rq
 		util_est_dequeue(&rq->cfs, p);
 
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (!__dequeue_task(rq, p, flags))
 		return false;
 
 	/*
-	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
+	 * Must not reference @p after __dequeue_task(DEQUEUE_DELAYED).
 	 */
 	return true;
 }
@@ -9021,19 +8965,6 @@ static void migrate_task_rq_fair(struct
 static void task_dead_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
-
-	if (se->sched_delayed) {
-		struct rq_flags rf;
-		struct rq *rq;
-
-		rq = task_rq_lock(p, &rf);
-		if (se->sched_delayed) {
-			update_rq_clock(rq);
-			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-		}
-		task_rq_unlock(rq, p, &rf);
-	}
-
 	remove_entity_load_avg(se);
 }
 
@@ -9067,21 +8998,10 @@ static void set_cpus_allowed_fair(struct
 	set_task_max_allowed_capacity(p);
 }
 
-static void set_next_buddy(struct sched_entity *se)
-{
-	for_each_sched_entity(se) {
-		if (WARN_ON_ONCE(!se->on_rq))
-			return;
-		if (se_is_idle(se))
-			return;
-		cfs_rq_of(se)->next = se;
-	}
-}
-
 enum preempt_wakeup_action {
 	PREEMPT_WAKEUP_NONE,	/* No preemption. */
 	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
-	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_PICK,	/* Let pick_eevdf() decide. */
 	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
 };
 
@@ -9098,7 +9018,7 @@ set_preempt_buddy(struct cfs_rq *cfs_rq,
 	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
 		return false;
 
-	set_next_buddy(pse);
+	set_next_buddy(cfs_rq, pse);
 	return true;
 }
 
@@ -9188,7 +9108,6 @@ static void wakeup_preempt_fair(struct r
 	if (!sched_feat(WAKEUP_PREEMPTION))
 		return;
 
-	find_matching_se(&se, &pse);
 	WARN_ON_ONCE(!pse);
 
 	cse_is_idle = se_is_idle(se);
@@ -9216,8 +9135,7 @@ static void wakeup_preempt_fair(struct r
 	if (unlikely(!normal_policy(p->policy)))
 		return;
 
-	cfs_rq = cfs_rq_of(se);
-	update_curr(cfs_rq);
+	update_curr_fair(rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
@@ -9261,18 +9179,15 @@ static void wakeup_preempt_fair(struct r
 	}
 
 pick:
-	nse = pick_next_entity(rq, cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT);
-	/* If @p has become the most eligible task, force preemption */
-	if (nse == pse)
-		goto preempt;
-
-	/*
-	 * Because p is enqueued, nse being null can only mean that we
-	 * dequeued a delayed task. If there are still entities queued in
-	 * cfs, check if the next one will be p.
-	 */
-	if (!nse && cfs_rq->nr_queued)
-		goto pick;
+	if (cfs_rq->h_nr_queued) {
+		nse = pick_next_entity(rq, preempt_action != PREEMPT_WAKEUP_SHORT);
+		if (unlikely(!nse))
+			goto pick;
+
+		/* If @p has become the most eligible task, force preemption */
+		if (nse == pse)
+			goto preempt;
+	}
 
 	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
@@ -9291,34 +9206,25 @@ static void wakeup_preempt_fair(struct r
 struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
-	struct cfs_rq *cfs_rq;
 	struct task_struct *p;
-	bool throttled;
 	int new_tasks;
 
 again:
-	cfs_rq = &rq->cfs;
-	if (!cfs_rq->nr_queued)
+	if (!cfs_rq->h_nr_queued)
 		goto idle;
 
-	throttled = false;
-
-	do {
-		/* Might not have done put_prev_entity() */
-		if (cfs_rq->curr && cfs_rq->curr->on_rq)
-			update_curr(cfs_rq);
-
-		throttled |= check_cfs_rq_runtime(cfs_rq);
+	/* Might not have done put_prev_entity() */
+	if (cfs_rq->curr && cfs_rq->curr->on_rq)
+		update_curr(cfs_rq);
 
-		se = pick_next_entity(rq, cfs_rq, true);
-		if (!se)
-			goto again;
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
+	se = pick_next_entity(rq, true);
+	if (!se)
+		goto again;
 
 	p = task_of(se);
-	if (unlikely(throttled))
+	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
 		task_throttle_setup_work(p);
 	return p;
 
@@ -9353,7 +9259,7 @@ void fair_server_init(struct rq *rq)
 static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
 	struct sched_entity *se = &prev->se;
-	struct cfs_rq *cfs_rq;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *nse = NULL;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -9363,7 +9269,7 @@ static void put_prev_task_fair(struct rq
 
 	while (se) {
 		cfs_rq = cfs_rq_of(se);
-		if (!nse || cfs_rq->curr)
+		if (!nse || cfs_rq->h_curr)
 			put_prev_entity(cfs_rq, se);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		if (nse) {
@@ -9382,6 +9288,14 @@ static void put_prev_task_fair(struct rq
 #endif
 		se = parent_entity(se);
 	}
+
+	/* Put 'current' back into the tree. */
+	cfs_rq = &rq->cfs;
+	se = &prev->se;
+	WARN_ON_ONCE(cfs_rq->curr != se);
+	cfs_rq->curr = NULL;
+	if (se->on_rq)
+		__enqueue_entity(cfs_rq, se);
 }
 
 /*
@@ -9390,8 +9304,8 @@ static void put_prev_task_fair(struct rq
 static void yield_task_fair(struct rq *rq)
 {
 	struct task_struct *curr = rq->donor;
-	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
 
 	/*
 	 * Are we the only task in the tree?
@@ -9432,11 +9346,11 @@ static bool yield_to_task_fair(struct rq
 	struct sched_entity *se = &p->se;
 
 	/* !se->on_rq also covers throttled task */
-	if (!se->on_rq)
+	if (!se->on_rq || se->sched_delayed)
 		return false;
 
 	/* Tell the scheduler that we'd really like se to run next. */
-	set_next_buddy(se);
+	set_next_buddy(&task_rq(p)->cfs, se);
 
 	yield_task_fair(rq);
 
@@ -9762,15 +9676,10 @@ static inline long migrate_degrades_loca
  */
 static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_cpu)
 {
-	struct cfs_rq *dst_cfs_rq;
+	struct cfs_rq *dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	dst_cfs_rq = task_group(p)->cfs_rq[dest_cpu];
-#else
-	dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
-#endif
-	if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
-	    !entity_eligible(task_cfs_rq(p), &p->se))
+	if (sched_feat(PLACE_LAG) && dst_cfs_rq->h_nr_queued &&
+	    !entity_eligible(&task_rq(p)->cfs, &p->se))
 		return 1;
 
 	return 0;
@@ -10240,7 +10149,7 @@ static void update_cfs_rq_h_load(struct
 	while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
 		load = cfs_rq->h_load;
 		load = div64_ul(load * se->avg.load_avg,
-			cfs_rq_load_avg(cfs_rq) + 1);
+				cfs_rq_load_avg(cfs_rq) + 1);
 		cfs_rq = group_cfs_rq(se);
 		cfs_rq->h_load = load;
 		cfs_rq->last_h_load_update = now;
@@ -13459,7 +13368,7 @@ static inline void task_tick_core(struct
 	 * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
 	 * if we need to give up the CPU.
 	 */
-	if (rq->core->core_forceidle_count && rq->cfs.nr_queued == 1 &&
+	if (rq->core->core_forceidle_count && rq->cfs.h_nr_queued == 1 &&
 	    __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
 		resched_curr(rq);
 }
@@ -13668,30 +13577,8 @@ bool cfs_prio_less(const struct task_str
 
 	WARN_ON_ONCE(task_rq(b)->core != rq->core);
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	/*
-	 * Find an se in the hierarchy for tasks a and b, such that the se's
-	 * are immediate siblings.
-	 */
-	while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
-		int sea_depth = sea->depth;
-		int seb_depth = seb->depth;
-
-		if (sea_depth >= seb_depth)
-			sea = parent_entity(sea);
-		if (sea_depth <= seb_depth)
-			seb = parent_entity(seb);
-	}
-
-	se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
-	se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
-
-	cfs_rqa = sea->cfs_rq;
-	cfs_rqb = seb->cfs_rq;
-#else /* !CONFIG_FAIR_GROUP_SCHED: */
 	cfs_rqa = &task_rq(a)->cfs;
 	cfs_rqb = &task_rq(b)->cfs;
-#endif /* !CONFIG_FAIR_GROUP_SCHED */
 
 	/*
 	 * Find delta after normalizing se's vruntime with its cfs_rq's
@@ -13729,14 +13616,20 @@ static inline void task_tick_core(struct
  */
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	unsigned long weight = NICE_0_LOAD;
+	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		entity_tick(cfs_rq, se, queued);
+
+		weight = __calc_prop_weight(cfs_rq, se, weight);
 	}
 
+	se = &curr->se;
+	reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+
 	if (queued)
 		return;
 
@@ -13772,7 +13665,7 @@ prio_changed_fair(struct rq *rq, struct
 	if (p->prio == oldprio)
 		return;
 
-	if (rq->cfs.nr_queued == 1)
+	if (rq->cfs.h_nr_queued == 1)
 		return;
 
 	/*
@@ -13901,29 +13794,40 @@ static void switched_to_fair(struct rq *
 	}
 }
 
-/*
- * Account for a task changing its policy or group.
- *
- * This routine is mostly called to set cfs_rq->curr field when a task
- * migrates between groups/classes.
- */
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long weight = NICE_0_LOAD;
+	bool on_rq = se->on_rq;
+
+	clear_buddies(cfs_rq, se);
+
+	if (on_rq)
+		__dequeue_entity(cfs_rq, se);
 
 	for_each_sched_entity(se) {
-		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq = cfs_rq_of(se);
 
-		if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
-		    first && cfs_rq->curr)
-			break;
+		if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
+		    !first || !cfs_rq->h_curr)
+			set_next_entity(cfs_rq, se);
 
-		set_next_entity(cfs_rq, se, first);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
+
+		if (on_rq)
+			weight = __calc_prop_weight(cfs_rq, se, weight);
 	}
 
 	se = &p->se;
+	cfs_rq->curr = se;
+
+	if (on_rq) {
+		reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+		if (first)
+			set_protect_slice(cfs_rq, se);
+	}
 
 	if (task_on_rq_queued(p)) {
 		/*
@@ -14054,17 +13958,8 @@ void unregister_fair_sched_group(struct
 		struct sched_entity *se = tg->se[cpu];
 		struct rq *rq = cpu_rq(cpu);
 
-		if (se) {
-			if (se->sched_delayed) {
-				guard(rq_lock_irqsave)(rq);
-				if (se->sched_delayed) {
-					update_rq_clock(rq);
-					dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-				}
-				list_del_leaf_cfs_rq(cfs_rq);
-			}
+		if (se)
 			remove_entity_load_avg(se);
-		}
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -206,7 +206,7 @@ ___update_load_sum(u64 now, struct sched
 	/*
 	 * running is a subset of runnable (weight) so running can't be set if
 	 * runnable is clear. But there are some corner cases where the current
-	 * se has been already dequeued but cfs_rq->curr still points to it.
+	 * se has been already dequeued but cfs_rq->h_curr still points to it.
 	 * This means that weight will be 0 but not running for a sched_entity
 	 * but also for a cfs_rq if the latter becomes idle. As an example,
 	 * this happens during sched_balance_newidle() which calls
@@ -307,7 +307,7 @@ int __update_load_avg_blocked_se(u64 now
 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
-				cfs_rq->curr == se)) {
+				cfs_rq->h_curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
@@ -323,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, st
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
 				cfs_rq->h_nr_runnable,
-				cfs_rq->curr != NULL)) {
+				cfs_rq->h_curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -528,21 +528,8 @@ struct task_group {
 
 };
 
-#ifdef CONFIG_GROUP_SCHED_WEIGHT
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
-/*
- * A weight of 0 or 1 can cause arithmetics problems.
- * A weight of a cfs_rq is the sum of weights of which entities
- * are queued on this cfs_rq, so a weight of a entity should not be
- * too large, so as the shares value of a task group.
- * (The default weight is 1024 - so there's no practical
- *  limitation from this.)
- */
-#define MIN_SHARES		(1UL <<  1)
-#define MAX_SHARES		(1UL << 18)
-#endif
-
 typedef int (*tg_visitor)(struct task_group *, void *);
 
 extern int walk_tg_tree_from(struct task_group *from,
@@ -629,6 +616,17 @@ static inline bool cfs_task_bw_constrain
 
 #endif /* !CONFIG_CGROUP_SCHED */
 
+/*
+ * A weight of 0 or 1 can cause arithmetics problems.
+ * A weight of a cfs_rq is the sum of weights of which entities
+ * are queued on this cfs_rq, so a weight of a entity should not be
+ * too large, so as the shares value of a task group.
+ * (The default weight is 1024 - so there's no practical
+ *  limitation from this.)
+ */
+#define MIN_SHARES		(1UL <<  1)
+#define MAX_SHARES		(1UL << 18)
+
 extern void unregister_rt_sched_group(struct task_group *tg);
 extern void free_rt_sched_group(struct task_group *tg);
 extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
@@ -707,6 +705,7 @@ struct cfs_rq {
 	/*
 	 * CFS load tracking
 	 */
+	struct sched_entity	*h_curr;
 	struct sched_avg	avg;
 #ifndef CONFIG_64BIT
 	u64			last_update_time_copy;
@@ -2509,6 +2508,7 @@ extern const u32		sched_prio_to_wmult[40
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_QUEUED		0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
@ 2026-05-11 16:21   ` K Prateek Nayak
  2026-05-12 11:09     ` Peter Zijlstra
  2026-05-13  4:51   ` John Stultz
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-11 16:21 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

Hello Peter,

On 5/11/2026 5:01 PM, Peter Zijlstra wrote:
> @@ -9291,34 +9206,25 @@ static void wakeup_preempt_fair(struct r
> +	se = pick_next_entity(rq, true);
> +	if (!se)
> +		goto again;
>  
>  	p = task_of(se);
> -	if (unlikely(throttled))
> +	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
>  		task_throttle_setup_work(p);

I think this bit should also be replicated in set_next_task() after
account_cfs_rq_runtime() since any part of the hierarchy may get
throttled as a result of failing to grab runtime.

Also check_cfs_rq_runtime() only sees if the cfs_rq is throttled
but the task can fail to run if it is on a throttled_hierarchy() too
so that should be the correct check here.

Something like below (only build tested on queue/sched/flat):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54da4c6c945..950c072244b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9224,7 +9224,19 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 		goto again;
 
 	p = task_of(se);
-	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
+	/*
+	 * For cases where prev is picked again after
+	 * being throttled, entity_tick() would have
+	 * already marked its hierarchy as throttled.
+	 *
+	 * Add throttle work here since
+	 * put_prev_set_next_task() is skipped on
+	 * same task's selection.
+	 *
+	 * For other case, set_next_task_fair() will
+	 * handle adding the throttle work.
+	 */
+	if (throttled_hierarchy(cfs_rq_of(se)))
 		task_throttle_setup_work(p);
 	return p;
 
@@ -13819,6 +13831,12 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 		if (on_rq)
 			weight = __calc_prop_weight(cfs_rq, se, weight);
 	}
+	/*
+	 * Add throttle work if the bandwidth allocation above failed
+	 * to grab any runtime and throttled the task's hierarchy.
+	 */
+	if (throttled_hierarchy(task_cfs_rq(p)))
+		task_throttle_setup_work(p);
 
 	se = &p->se;
 	cfs_rq->curr = se;
---


>  	return p;
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (9 preceding siblings ...)
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
@ 2026-05-11 19:23 ` Tejun Heo
  2026-05-12  8:10   ` Peter Zijlstra
  2026-05-12  8:42 ` Vincent Guittot
  2026-05-16  3:30 ` Qais Yousef
  12 siblings, 1 reply; 64+ messages in thread
From: Tejun Heo @ 2026-05-11 19:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Hello, Peter.

On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote:
> So cgroup scheduling has always been a pain in the arse. The problems start
> with weight distribution and end with hierachical picks and it all sucks.
> 
> The problems with weight distribution are related to that infernal global
> fraction:
> 
>              tg->w * grq_i->w
>    ge_i->w = ----------------
>              \Sum_j grq_j->w
> 
> which we've approximated reasonably well by now. However, the immediate
> consequence of this fraction is that the total group weight (tg->w) gets
> fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> with the fact that 256 CPU systems are relatively common these days, this
> becomes painful.
> 
> The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> immediate problem with that is that when all load of a group gets concentrated
> on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> exceeding nice -20.
> 
> Additionally there are numerical limits on the max weight you can have before
> the math starts suffering overflows. As such there is a definite limit on the
> total group weight. Which has annoyed people ;-)
> 
> The first few patches add a knob /debug/sched/cgroup_mode and a few different
> options on how to deal with this. My favourite is 'concur', but obviously that
> is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> update_tg_load_avg() thing more expensive.

Ignoring fixed math accuracy problems, isn't the root problem here that
every thread in the root cgroup competes as if each is its own cgroup? ie.
Isn't the canonical solution here to create an enveloping group, at least
for share calculation purposes, for root threads and then assign them some
weight so that they compete in the same way that other cgroups do? Then, the
different modes go away or rather whatever the user wants can be expressed
via root's weight if that's to be made configurable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
@ 2026-05-12  5:37   ` K Prateek Nayak
  2026-05-12  9:45     ` Peter Zijlstra
  2026-05-19 15:13   ` Vincent Guittot
  2026-06-03  9:51   ` Aaron Lu
  2 siblings, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-12  5:37 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

Hello Peter,

On 5/11/2026 5:01 PM, Peter Zijlstra wrote:
> @@ -9245,6 +9247,14 @@ static struct task_struct *pick_task_fai
>  	if (unlikely(throttled))
>  		task_throttle_setup_work(p);
>  	return p;
> +
> +idle:
> +	new_tasks = sched_balance_newidle(rq, rf);
> +	if (new_tasks < 0)
> +		return RETRY_TASK;
> +	if (new_tasks > 0)
> +		goto again;
> +	return NULL;
>  }

For core scheduling will now trigger a newidle balance during the pick
when core_cookie is reset to 0 which can cause tasks to migrate only
for them to find they cannot run on the CPU since core-wide selection
leads to a cookie mismatch and it is kept hanging there.

Can we return early if sched_core_enabled() here or are the additional
newidle balance okay?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-11 19:23 ` [PATCH v2 00/10] sched: Flatten the pick Tejun Heo
@ 2026-05-12  8:10   ` Peter Zijlstra
  2026-05-12 18:45     ` Tejun Heo
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12  8:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, May 11, 2026 at 09:23:45AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, May 11, 2026 at 01:31:04PM +0200, Peter Zijlstra wrote:
> > So cgroup scheduling has always been a pain in the arse. The problems start
> > with weight distribution and end with hierachical picks and it all sucks.
> > 
> > The problems with weight distribution are related to that infernal global
> > fraction:
> > 
> >              tg->w * grq_i->w
> >    ge_i->w = ----------------
> >              \Sum_j grq_j->w
> > 
> > which we've approximated reasonably well by now. However, the immediate
> > consequence of this fraction is that the total group weight (tg->w) gets
> > fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> > weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> > with the fact that 256 CPU systems are relatively common these days, this
> > becomes painful.
> > 
> > The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> > immediate problem with that is that when all load of a group gets concentrated
> > on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> > exceeding nice -20.
> > 
> > Additionally there are numerical limits on the max weight you can have before
> > the math starts suffering overflows. As such there is a definite limit on the
> > total group weight. Which has annoyed people ;-)
> > 
> > The first few patches add a knob /debug/sched/cgroup_mode and a few different
> > options on how to deal with this. My favourite is 'concur', but obviously that
> > is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> > update_tg_load_avg() thing more expensive.
> 
> Ignoring fixed math accuracy problems, isn't the root problem here that
> every thread in the root cgroup competes as if each is its own cgroup? ie.
> Isn't the canonical solution here to create an enveloping group, at least
> for share calculation purposes, for root threads and then assign them some
> weight so that they compete in the same way that other cgroups do? Then, the
> different modes go away or rather whatever the user wants can be expressed
> via root's weight if that's to be made configurable.

As long as the total group weight is a fraction; and it sorta has to be.
You can run into trouble by stacking that fraction.

Take 256 CPUs and a group weight of 1024. Then each CPU gets a weight of
1/256 or 4. Even if we increase the internal accuracy to 20 bits (we do
on 64bit) then this becomes 4096, do this for 2 more levels in the
hierarchy and you're down to scraping the barrel again.

So if each level runs at a fraction f of the level above, then level n
runs at f^n. Moving root into a phantom group at level 1, only solves
the problem against other tasks at level 1, but then you have the same
problem again at level 2 and below.

Both the numerical problems and the scale problem of the root group can
be avoided if we can get the average/nominal fraction to be near 1.

The 'normal' way around this is to ensure the group weight is nr_cpus *
1024, then, when everybody is running, the per CPU weight is 1024 or 1
and the continued fraction is also 1-ish. This is why people like to
increase the max group weight.

Trouble is of course that if not all CPUs are busy, with the extreme
being only a single CPU carrying that weight of nr_cpus*1024, this then
causes trouble because that one CPU gets overloaded.

One of the options is to simply put a max on the single CPU load; which
is the crudest option to just make it 'work'. The one I favour though is
the one where we scale the group weight by: 'min(cpumas, nr_tasks)'.

Anyway, this is why I've been looking at these alternative weight
schemes, to get the nominal fraction near 1 and make these problems go
away. It is both the numerical issues and the disparity between levels
(with root being at level 0 being the most obvious).

Does that make sense?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (10 preceding siblings ...)
  2026-05-11 19:23 ` [PATCH v2 00/10] sched: Flatten the pick Tejun Heo
@ 2026-05-12  8:42 ` Vincent Guittot
  2026-05-12  9:20   ` Peter Zijlstra
  2026-05-13 11:35   ` Peter Zijlstra
  2026-05-16  3:30 ` Qais Yousef
  12 siblings, 2 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-12  8:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi!
>
> So cgroup scheduling has always been a pain in the arse. The problems start
> with weight distribution and end with hierachical picks and it all sucks.
>
> The problems with weight distribution are related to that infernal global
> fraction:
>
>              tg->w * grq_i->w
>    ge_i->w = ----------------
>              \Sum_j grq_j->w
>
> which we've approximated reasonably well by now. However, the immediate
> consequence of this fraction is that the total group weight (tg->w) gets
> fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> with the fact that 256 CPU systems are relatively common these days, this
> becomes painful.
>
> The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> immediate problem with that is that when all load of a group gets concentrated
> on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> exceeding nice -20.
>
> Additionally there are numerical limits on the max weight you can have before
> the math starts suffering overflows. As such there is a definite limit on the
> total group weight. Which has annoyed people ;-)
>
> The first few patches add a knob /debug/sched/cgroup_mode and a few different
> options on how to deal with this. My favourite is 'concur', but obviously that
> is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> update_tg_load_avg() thing more expensive.
>
> I have some ideas but I figured I ought to share these things before sinking
> more time into it.
>
>
> On to the hierarchical pick; this has been causing trouble for a very long
> time. So once again an attempt at flatting it. The basic idea is to keep the
> full hierarchical load tracking as-is, but keep all the runnable entities in a
> single level. The immediate concequence of all this is ofcourse that we need to
> constantly re-compute the effective weight of each entity as things progress.
>
> Reweight is done on:
>  - enqueue
>  - pick -- or rather set_next_entity(.first=true)
>  - tick
>
> So while the {en,de}queue operations are still O(depth) due to the full
> accounting mess, the pick is now a single level. Removing the intermediate
> levels that obscure runnability etc.
>
>
> For testing, I've done a little experiment, I dug out what is colloqually known
> as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
> it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
> runs really well on this machine (provided you stick to 1080p).
>
> To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
> spin.sh'; this results in the game becoming almost unplayable, as in proper
> terrible.
>
> I used MangoHUD to record a few minutes of playtime for statistics, and then
> quit the came and re-started it with a shorter slice set (base/10). This
> results in the game being entirely playable -- not great, but definiltey
> playable.
>
>   Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
>   Intel Core i7-2600K
>   AMD Radeon RX 580
>
>   Shadows Awakening (GOG)
>
>           default slice(*)
>
>   FPS min  3.8    20.6
>       avg 48.0    57.2
>       mag 87.4    80.3
>
>   FT  min   9.4    8.4
>       avg  34.5   19.5
>       max 107.4   37.2
>
>   FPS (Frames Per Second)
>   FT  (FrameTime)
>
>   [*] Command prefix: 'chrt -o --sched-runtime 280000 0'
>       effectively setting 'base_slice_ns/10'
>
> I have not compared to a kernel without flat on, just wanted to run non trivial
> workloads and play with slice to make sure everything 'works'.


I haven't reviewed the patches yet but I ran some tests with it while
testing sched latency related changes for short slice wakeup
preemption. I have some large hackbench regressions with this series
on HMP system with and without EAS. those figures are unexpected
because the benchs run on root cfs

One example with hackbench 8 groups thread pipe
tip/sched/core  tip/sched/core          +this patchset          +this patchset
slice 2.8ms     16ms                    2.8ms                   16ms
dragonboard rb5 with EAS
0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
0,689(+/- 9.1%) +8%

radxa orion6 HMP without EAS
0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
1,071(+/-5.9%) -82%

Increasing the slice partly removes regressions but tis is surprising
because the bench runs at root cfs and I thought that results will not
change in such a case

I will review the patchset and try to get what is going wrong


>
>
> Can also be had:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
>
>  include/linux/cpuset.h |    6
>  include/linux/sched.h  |    1
>  kernel/cgroup/cpuset.c |   15
>  kernel/sched/core.c    |   47 --
>  kernel/sched/debug.c   |  171 +++++---
>  kernel/sched/fair.c    | 1038 ++++++++++++++++++++++---------------------------
>  kernel/sched/pelt.c    |    6
>  kernel/sched/sched.h   |   44 --
>  8 files changed, 672 insertions(+), 656 deletions(-)
>
> ---
> Change since v1 ( https://patch.msgid.link/20260317095113.387450089@infradead.org ):
>  - various Sashiko thingies
>  - rebase atop curren -tip
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12  8:42 ` Vincent Guittot
@ 2026-05-12  9:20   ` Peter Zijlstra
  2026-05-12 18:24     ` Peter Zijlstra
  2026-05-13 11:35   ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12  9:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:

> 
> I haven't reviewed the patches yet but I ran some tests with it while
> testing sched latency related changes for short slice wakeup
> preemption. I have some large hackbench regressions with this series
> on HMP system with and without EAS. those figures are unexpected
> because the benchs run on root cfs
> 
> One example with hackbench 8 groups thread pipe
> tip/sched/core  tip/sched/core          +this patchset          +this patchset
> slice 2.8ms     16ms                    2.8ms                   16ms
> dragonboard rb5 with EAS
> 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> 0,689(+/- 9.1%) +8%
> 
> radxa orion6 HMP without EAS
> 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> 1,071(+/-5.9%) -82%
> 
> Increasing the slice partly removes regressions but tis is surprising
> because the bench runs at root cfs and I thought that results will not
> change in such a case
> 
> I will review the patchset and try to get what is going wrong

Yeah, that is unexpected. Let me go have another look too.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-05-12  5:37   ` K Prateek Nayak
@ 2026-05-12  9:45     ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12  9:45 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

On Tue, May 12, 2026 at 11:07:13AM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/11/2026 5:01 PM, Peter Zijlstra wrote:
> > @@ -9245,6 +9247,14 @@ static struct task_struct *pick_task_fai
> >  	if (unlikely(throttled))
> >  		task_throttle_setup_work(p);
> >  	return p;
> > +
> > +idle:
> > +	new_tasks = sched_balance_newidle(rq, rf);
> > +	if (new_tasks < 0)
> > +		return RETRY_TASK;
> > +	if (new_tasks > 0)
> > +		goto again;
> > +	return NULL;
> >  }
> 
> For core scheduling will now trigger a newidle balance during the pick
> when core_cookie is reset to 0 which can cause tasks to migrate only
> for them to find they cannot run on the CPU since core-wide selection
> leads to a cookie mismatch and it is kept hanging there.
> 
> Can we return early if sched_core_enabled() here or are the additional
> newidle balance okay?

This basically makes fair behave like every other class, so in that sense
this is probably okay. That said, fair is the most common case, so
perhaps.

Lets see if the people actually using this notice first though ;-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 16:21   ` K Prateek Nayak
@ 2026-05-12 11:09     ` Peter Zijlstra
  2026-05-13  7:01       ` K Prateek Nayak
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12 11:09 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

On Mon, May 11, 2026 at 09:51:57PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/11/2026 5:01 PM, Peter Zijlstra wrote:
> > @@ -9291,34 +9206,25 @@ static void wakeup_preempt_fair(struct r
> > +	se = pick_next_entity(rq, true);
> > +	if (!se)
> > +		goto again;
> >  
> >  	p = task_of(se);
> > -	if (unlikely(throttled))
> > +	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
> >  		task_throttle_setup_work(p);
> 
> I think this bit should also be replicated in set_next_task() after
> account_cfs_rq_runtime() since any part of the hierarchy may get
> throttled as a result of failing to grab runtime.
> 
> Also check_cfs_rq_runtime() only sees if the cfs_rq is throttled
> but the task can fail to run if it is on a throttled_hierarchy() too
> so that should be the correct check here.
> 
> Something like below (only build tested on queue/sched/flat):
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e54da4c6c945..950c072244b2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9224,7 +9224,19 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>  		goto again;
>  
>  	p = task_of(se);
> -	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
> +	/*
> +	 * For cases where prev is picked again after
> +	 * being throttled, entity_tick() would have
> +	 * already marked its hierarchy as throttled.
> +	 *
> +	 * Add throttle work here since
> +	 * put_prev_set_next_task() is skipped on
> +	 * same task's selection.
> +	 *
> +	 * For other case, set_next_task_fair() will
> +	 * handle adding the throttle work.
> +	 */
> +	if (throttled_hierarchy(cfs_rq_of(se)))
>  		task_throttle_setup_work(p);

Ah, right, because we've not accumulated runtime, it doesn't make sense
to use check_cfs_rq_runtime() at pick time, all we need to do is check
if the task should be throttled.

However, since set_next_task_fair() will walk the entire hierarchy
anyway, we can remove it here entirely and fully rely on that.

>  	return p;
>  
> @@ -13819,6 +13831,12 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  		if (on_rq)
>  			weight = __calc_prop_weight(cfs_rq, se, weight);
>  	}
> +	/*
> +	 * Add throttle work if the bandwidth allocation above failed
> +	 * to grab any runtime and throttled the task's hierarchy.
> +	 */
> +	if (throttled_hierarchy(task_cfs_rq(p)))
> +		task_throttle_setup_work(p);

We already call into account_cfs_rq_runtime(); which basically does all
we need.

I think the distinction between account_cfs_rq_runtime() and
check_cfs_rq_runtime() no longer makes sense. We can throttle a cfs_rq
at any point now, since we no longer remove the cfs_rq, but rather we
make the tasks suspend themselves until the cfs_rq naturally dequeues
for being empty.

Something like so perhaps?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -488,7 +488,7 @@ static int se_is_idle(struct sched_entit
 #endif /* !CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
+bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -1420,12 +1420,22 @@ static void update_curr(struct cfs_rq *c
 	}
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+static inline void task_throttle_setup_work(struct task_struct *p);
+
 static void update_curr_fair(struct rq *rq)
 {
 	struct sched_entity *se = &rq->donor->se;
+	bool throttled = false;
 
-	for_each_sched_entity(se)
-		update_curr(cfs_rq_of(se));
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		update_curr(cfs_rq);
+		throttled |= cfs_rq_throttled(cfs_rq);
+	}
+
+	if (throttled)
+		task_throttle_setup_work(rq->donor);
 }
 
 static inline void
@@ -5627,7 +5637,6 @@ place_entity(struct cfs_rq *cfs_rq, stru
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
-static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5830,8 +5839,6 @@ pick_next_entity(struct rq *rq, bool pro
 	return se;
 }
 
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
-
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -5841,9 +5848,6 @@ static void put_prev_entity(struct cfs_r
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
-	/* throttle cfs_rqs exceeding runtime */
-	check_cfs_rq_runtime(cfs_rq);
-
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
@@ -5976,44 +5980,29 @@ static int __assign_cfs_rq_runtime(struc
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/* returns 0 on failure to allocate runtime */
-static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	int ret;
-
-	raw_spin_lock(&cfs_b->lock);
-	ret = __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice());
-	raw_spin_unlock(&cfs_b->lock);
-
-	return ret;
-}
+static bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
 
-static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
+static bool __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
 
 	if (likely(cfs_rq->runtime_remaining > 0))
-		return;
+		return false;
 
 	if (cfs_rq->throttled)
-		return;
-	/*
-	 * if we're unable to extend our runtime we resched so that the active
-	 * hierarchy can be throttled
-	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->h_curr))
-		resched_curr(rq_of(cfs_rq));
+		return true;
+
+	return throttle_cfs_rq(cfs_rq);
 }
 
 static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
+bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
-		return;
+		return false;
 
-	__account_cfs_rq_runtime(cfs_rq, delta_exec);
+	return __account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -6284,9 +6273,9 @@ static bool throttle_cfs_rq(struct cfs_r
 		 * We have raced with bandwidth becoming available, and if we
 		 * actually throttled the timer might not unthrottle us for an
 		 * entire period. We additionally needed to make sure that any
-		 * subsequent check_cfs_rq_runtime calls agree not to throttle
-		 * us, as we may commit to do cfs put_prev+pick_next, so we ask
-		 * for 1ns of runtime rather than just check cfs_b.
+		 * subsequent account_cfs_rq_runtime() calls agree not to
+		 * throttle us, as we may commit to do cfs put_prev+pick_next,
+		 * so we ask for 1ns of runtime rather than just check cfs_b.
 		 */
 		dequeue = 0;
 	} else {
@@ -6711,8 +6700,6 @@ static void check_enqueue_throttle(struc
 
 	/* update runtime allocation */
 	account_cfs_rq_runtime(cfs_rq, 0);
-	if (cfs_rq->runtime_remaining <= 0)
-		throttle_cfs_rq(cfs_rq);
 }
 
 static void sync_throttle(struct task_group *tg, int cpu)
@@ -6742,25 +6729,6 @@ static void sync_throttle(struct task_gr
 		cfs_rq->pelt_clock_throttled = 1;
 }
 
-/* conditionally throttle active cfs_rq's from put_prev_entity() */
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	if (!cfs_bandwidth_used())
-		return false;
-
-	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
-
-	/*
-	 * it's possible for a throttled entity to be forced into a running
-	 * state (e.g. set_curr_task), in this case we're finished.
-	 */
-	if (cfs_rq_throttled(cfs_rq))
-		return true;
-
-	return throttle_cfs_rq(cfs_rq);
-}
-
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
@@ -7015,8 +6983,7 @@ static void sched_fair_update_stop_tick(
 
 #else /* !CONFIG_CFS_BANDWIDTH: */
 
-static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
+static bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9208,7 +9175,6 @@ struct task_struct *pick_task_fair(struc
 {
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
-	struct task_struct *p;
 	int new_tasks;
 
 again:
@@ -9223,10 +9189,7 @@ struct task_struct *pick_task_fair(struc
 	if (!se)
 		goto again;
 
-	p = task_of(se);
-	if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
-		task_throttle_setup_work(p);
-	return p;
+	return task_of(se);
 
 idle:
 	new_tasks = sched_balance_newidle(rq, rf);
@@ -13618,6 +13581,7 @@ static void task_tick_fair(struct rq *rq
 {
 	struct sched_entity *se = &curr->se;
 	unsigned long weight = NICE_0_LOAD;
+	bool throttled = false;
 	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
@@ -13625,8 +13589,13 @@ static void task_tick_fair(struct rq *rq
 		entity_tick(cfs_rq, se, queued);
 
 		weight = __calc_prop_weight(cfs_rq, se, weight);
+
+		throttled |= cfs_rq_throttled(cfs_rq);
 	}
 
+	if (throttled)
+		task_throttle_setup_work(curr);
+
 	se = &curr->se;
 	reweight_eevdf(cfs_rq, se, weight, se->on_rq);
 
@@ -13800,6 +13769,7 @@ static void set_next_task_fair(struct rq
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	unsigned long weight = NICE_0_LOAD;
 	bool on_rq = se->on_rq;
+	bool throttled = false;
 
 	clear_buddies(cfs_rq, se);
 
@@ -13814,12 +13784,15 @@ static void set_next_task_fair(struct rq
 			set_next_entity(cfs_rq, se);
 
 		/* ensure bandwidth has been allocated on our new cfs_rq */
-		account_cfs_rq_runtime(cfs_rq, 0);
+		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
 
 		if (on_rq)
 			weight = __calc_prop_weight(cfs_rq, se, weight);
 	}
 
+	if (throttled)
+		task_throttle_setup_work(p);
+
 	se = &p->se;
 	cfs_rq->curr = se;
 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12  9:20   ` Peter Zijlstra
@ 2026-05-12 18:24     ` Peter Zijlstra
  2026-05-12 18:25       ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12 18:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote:
> On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> 
> > 
> > I haven't reviewed the patches yet but I ran some tests with it while
> > testing sched latency related changes for short slice wakeup
> > preemption. I have some large hackbench regressions with this series
> > on HMP system with and without EAS. those figures are unexpected
> > because the benchs run on root cfs
> > 
> > One example with hackbench 8 groups thread pipe
> > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > slice 2.8ms     16ms                    2.8ms                   16ms
> > dragonboard rb5 with EAS
> > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > 0,689(+/- 9.1%) +8%
> > 
> > radxa orion6 HMP without EAS
> > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > 1,071(+/-5.9%) -82%
> > 
> > Increasing the slice partly removes regressions but tis is surprising
> > because the bench runs at root cfs and I thought that results will not
> > change in such a case
> > 
> > I will review the patchset and try to get what is going wrong
> 
> Yeah, that is unexpected. Let me go have another look too.

So I can reproduce even without the last patch applied. I suspect it is
in the cgroup mode patches somewhere. My first suspect is that concur
mode thing doing bad things to track the 'global' nr_running thing.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12 18:24     ` Peter Zijlstra
@ 2026-05-12 18:25       ` Peter Zijlstra
  2026-05-12 18:32         ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-12 18:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote:
> On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote:
> > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> > 
> > > 
> > > I haven't reviewed the patches yet but I ran some tests with it while
> > > testing sched latency related changes for short slice wakeup
> > > preemption. I have some large hackbench regressions with this series
> > > on HMP system with and without EAS. those figures are unexpected
> > > because the benchs run on root cfs
> > > 
> > > One example with hackbench 8 groups thread pipe
> > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > dragonboard rb5 with EAS
> > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > 0,689(+/- 9.1%) +8%
> > > 
> > > radxa orion6 HMP without EAS
> > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > 1,071(+/-5.9%) -82%
> > > 
> > > Increasing the slice partly removes regressions but tis is surprising
> > > because the bench runs at root cfs and I thought that results will not
> > > change in such a case
> > > 
> > > I will review the patchset and try to get what is going wrong
> > 
> > Yeah, that is unexpected. Let me go have another look too.
> 
> So I can reproduce even without the last patch applied. I suspect it is
> in the cgroup mode patches somewhere. My first suspect is that concur
> mode thing doing bad things to track the 'global' nr_running thing.

Argh, n/m PEBKAC. I'll try this again in the morning :/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12 18:25       ` Peter Zijlstra
@ 2026-05-12 18:32         ` Vincent Guittot
  2026-05-13  7:25           ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2026-05-12 18:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, 12 May 2026 at 20:25, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote:
> > On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote:
> > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> > >
> > > >
> > > > I haven't reviewed the patches yet but I ran some tests with it while
> > > > testing sched latency related changes for short slice wakeup
> > > > preemption. I have some large hackbench regressions with this series
> > > > on HMP system with and without EAS. those figures are unexpected
> > > > because the benchs run on root cfs
> > > >
> > > > One example with hackbench 8 groups thread pipe
> > > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > > dragonboard rb5 with EAS
> > > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > > 0,689(+/- 9.1%) +8%
> > > >
> > > > radxa orion6 HMP without EAS
> > > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > > 1,071(+/-5.9%) -82%
> > > >
> > > > Increasing the slice partly removes regressions but tis is surprising
> > > > because the bench runs at root cfs and I thought that results will not
> > > > change in such a case
> > > >
> > > > I will review the patchset and try to get what is going wrong
> > >
> > > Yeah, that is unexpected. Let me go have another look too.
> >
> > So I can reproduce even without the last patch applied. I suspect it is
> > in the cgroup mode patches somewhere. My first suspect is that concur
> > mode thing doing bad things to track the 'global' nr_running thing.
>
> Argh, n/m PEBKAC. I'll try this again in the morning :/

Reverting the last patch is enough to recover performance

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12  8:10   ` Peter Zijlstra
@ 2026-05-12 18:45     ` Tejun Heo
  2026-05-18  7:14       ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Tejun Heo @ 2026-05-12 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Hello, Peter.

On Tue, May 12, 2026 at 10:10:00AM +0200, Peter Zijlstra wrote:
...
> Anyway, this is why I've been looking at these alternative weight
> schemes, to get the nominal fraction near 1 and make these problems go
> away. It is both the numerical issues and the disparity between levels
> (with root being at level 0 being the most obvious).

I see. I think what bothers me is that I'm unsure what the weight config
would mean when the shares are scaled by the number of active cpus in that
cgroup. Here's a simple example:

- There are 256 cpus.
- /cgroup-A has weight 100 and 128 active threads. No pinning.
- /cgroup-B has weight 100 and 256 active thredas. No pinning.

In the current code, assuming math holds up, cgroup-A and B would get about
the same shares - ~128 CPUs each. However, if we scale the share by active
CPUs in each cgroup, B's tasks would end up with the same weight as A's on
CPUs that they end up competing on, which would lead to ~ 1:3 distribution.
Is that the right reading of the code?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
  2026-05-11 16:21   ` K Prateek Nayak
@ 2026-05-13  4:51   ` John Stultz
  2026-05-13  5:00     ` John Stultz
  2026-05-19 10:38   ` Vincent Guittot
  2026-05-26  7:53   ` Zhang Qiao
  3 siblings, 1 reply; 64+ messages in thread
From: John Stultz @ 2026-05-13  4:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, kprateek.nayak, qyousef

On Mon, May 11, 2026 at 5:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Change fair/cgroup to a single runqueue.
>
> Infamously fair/cgroup isn't working for a number of people; typically
> the complaint is latencies and/or overhead. The latency issue is due
> to the intermediate entries that represent a combination of tasks and
> thereby obfuscate the runnability of tasks.
>
> The approach here is to leave the cgroup hierarchy as is; including
> the intermediate enqueue/dequeue but move the actual EEVDF runqueue
> outside. This means things like the shares_weight approximation are
> fully preserved.
>
> That is, given a hierarchy like:
>
>         R
>         |
>         se--G1
>             / \
>       G2--se   se--G3
>      / \           |
> T1--se se--T2      se--T3
>
> This is fully maintained for load tracking, however the EEVDF parts of
> cfs_rq/se go unused for the intermediates and are instead connected
> like:
>
>      _R_
>     / | \
>    T1 T2 T3
>
> Since the effective weight of the entities is determined by the
> hierarchy, this gets recomputed on enqueue,set_next_task and tick.
>
> Notably, the effective weight (se->h_load) is computed from the
> hierarchical fraction: se->load / cfs_rq->load.
>
> Since EEVDF is now exclusive operating on rq->cfs, it needs to
> consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
> only tasks can get delayed, simplifying some of the cgroup cleanup.
>
> One place where additional information was required was
> set_next_task() / put_prev_task(), where we need to track 'current'
> both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
> (cfs_rq->curr).
>
> As a result of only having a single level to pick from, much of the
> complications in pick_next_task() and preemption go away.
>
> Since many of the hierarchical operations are still there, this won't
> immediately fix the performance issues, but hopefully it will fix some
> of the latency issues.
>
> TODO: split struct cfs_rq / struct sched_entity
> TODO: try and get rid of h_curr
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I know Vincent was having some perf troubles with this patch, but
booting on a 64 vCPU qemu environment, I'm seeing:

[    5.688490] Oops: divide error: 0000 [#1] SMP NOPTI
[    5.689457] CPU: 47 UID: 0 PID: 0 Comm: swapper/47 Not tainted
7.1.0-rc2-00026-g82a8ec6fb3f9 #38 PREEMPT(full)
[    5.689457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
[    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
14 31 9
[    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
[    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
[    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
[    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
[    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
knlGS:0000000000000000
[    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
[    5.689457] Call Trace:
[    5.689457]  <TASK>
[    5.689457]  wakeup_preempt+0xa8/0xd0
[    5.689457]  attach_one_task+0xec/0x150
[    5.689457]  __schedule+0x1ad8/0x21c0
[    5.689457]  schedule_idle+0x22/0x40
[    5.689457]  cpu_startup_entry+0x29/0x30
[    5.689457]  start_secondary+0xf7/0x100
[    5.689457]  common_startup_64+0x13e/0x148
[    5.689457]  </TASK>
[    5.689457] Dumping ftrace buffer:
[    5.689457]    (ftrace buffer empty)
[    5.689457] ---[ end trace 0000000000000000 ]---
[    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
[    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
14 31 9
[    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
[    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
[    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
[    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
[    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
knlGS:0000000000000000
[    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
[    5.689457] Kernel panic - not syncing: Fatal exception

Which I bisected down to this last patch in the series.

faddr2line gave me:
__calc_delta at kernel/sched/fair.c:290
(inlined by) calc_delta_fair at kernel/sched/fair.c:300
(inlined by) update_protect_slice at kernel/sched/fair.c:1070
(inlined by) wakeup_preempt_fair at kernel/sched/fair.c:9193

This usually trips as the ww_mutex selftest starts at bootup.

Unfortunately I still see it with the add-on changes you proposed to K
Prateek's feedback here.

I'll try to narrow it down further tomorrow.

thanks
-john

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-13  4:51   ` John Stultz
@ 2026-05-13  5:00     ` John Stultz
  2026-05-14  1:36       ` John Stultz
  0 siblings, 1 reply; 64+ messages in thread
From: John Stultz @ 2026-05-13  5:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 9:51 PM John Stultz <jstultz@google.com> wrote:
>
> On Mon, May 11, 2026 at 5:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Change fair/cgroup to a single runqueue.
> >
...
>
> I know Vincent was having some perf troubles with this patch, but
> booting on a 64 vCPU qemu environment, I'm seeing:
>
> [    5.688490] Oops: divide error: 0000 [#1] SMP NOPTI
> [    5.689457] CPU: 47 UID: 0 PID: 0 Comm: swapper/47 Not tainted
> 7.1.0-rc2-00026-g82a8ec6fb3f9 #38 PREEMPT(full)
> [    5.689457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> [    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> [    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> 14 31 9
> [    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> [    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> [    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> [    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> [    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> [    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
> knlGS:0000000000000000
> [    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> [    5.689457] Call Trace:
> [    5.689457]  <TASK>
> [    5.689457]  wakeup_preempt+0xa8/0xd0
> [    5.689457]  attach_one_task+0xec/0x150
> [    5.689457]  __schedule+0x1ad8/0x21c0
> [    5.689457]  schedule_idle+0x22/0x40
> [    5.689457]  cpu_startup_entry+0x29/0x30
> [    5.689457]  start_secondary+0xf7/0x100
> [    5.689457]  common_startup_64+0x13e/0x148
> [    5.689457]  </TASK>
> [    5.689457] Dumping ftrace buffer:
> [    5.689457]    (ftrace buffer empty)
> [    5.689457] ---[ end trace 0000000000000000 ]---
> [    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> [    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> 14 31 9
> [    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> [    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> [    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> [    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> [    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> [    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
> knlGS:0000000000000000
> [    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> [    5.689457] Kernel panic - not syncing: Fatal exception
>
> Which I bisected down to this last patch in the series.
>
> faddr2line gave me:
> __calc_delta at kernel/sched/fair.c:290
> (inlined by) calc_delta_fair at kernel/sched/fair.c:300
> (inlined by) update_protect_slice at kernel/sched/fair.c:1070
> (inlined by) wakeup_preempt_fair at kernel/sched/fair.c:9193
>
> This usually trips as the ww_mutex selftest starts at bootup.
>
> Unfortunately I still see it with the add-on changes you proposed to K
> Prateek's feedback here.
>
> I'll try to narrow it down further tomorrow.

As karma would have it, this does seem to depend on CONFIG_SCHED_PROXY_EXEC. :)
I'm guessing the switch in calc_delta_fair() to use se->h_load is
uncovering something proxy isn't handling properly with that value.

But I'll have more tomorrow.

thanks
-john

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-12 11:09     ` Peter Zijlstra
@ 2026-05-13  7:01       ` K Prateek Nayak
  2026-05-13  7:25         ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-13  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

Hello Peter,

On 5/12/2026 4:39 PM, Peter Zijlstra wrote:
>> @@ -13819,6 +13831,12 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>  		if (on_rq)
>>  			weight = __calc_prop_weight(cfs_rq, se, weight);
>>  	}
>> +	/*
>> +	 * Add throttle work if the bandwidth allocation above failed
>> +	 * to grab any runtime and throttled the task's hierarchy.
>> +	 */
>> +	if (throttled_hierarchy(task_cfs_rq(p)))
>> +		task_throttle_setup_work(p);
> 
> We already call into account_cfs_rq_runtime(); which basically does all
> we need.
> 
> I think the distinction between account_cfs_rq_runtime() and
> check_cfs_rq_runtime() no longer makes sense. We can throttle a cfs_rq
> at any point now, since we no longer remove the cfs_rq, but rather we
> make the tasks suspend themselves until the cfs_rq naturally dequeues
> for being empty.
> 
> Something like so perhaps?

That makes sense! The task should naturally execute the task work when
exiting out of the kernel / IRQ handler into the userspace so we should
be good.

I'll rebase the below diff on tip, test it a bit, add a commit log, and
send it your way if you don't mind or would you like to keep it with the
flat_cg bits?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12 18:32         ` Vincent Guittot
@ 2026-05-13  7:25           ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-13  7:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 08:32:12PM +0200, Vincent Guittot wrote:
> On Tue, 12 May 2026 at 20:25, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, May 12, 2026 at 08:24:39PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 12, 2026 at 11:20:40AM +0200, Peter Zijlstra wrote:
> > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> > > >
> > > > >
> > > > > I haven't reviewed the patches yet but I ran some tests with it while
> > > > > testing sched latency related changes for short slice wakeup
> > > > > preemption. I have some large hackbench regressions with this series
> > > > > on HMP system with and without EAS. those figures are unexpected
> > > > > because the benchs run on root cfs
> > > > >
> > > > > One example with hackbench 8 groups thread pipe
> > > > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > > > dragonboard rb5 with EAS
> > > > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > > > 0,689(+/- 9.1%) +8%
> > > > >
> > > > > radxa orion6 HMP without EAS
> > > > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > > > 1,071(+/-5.9%) -82%
> > > > >
> > > > > Increasing the slice partly removes regressions but tis is surprising
> > > > > because the bench runs at root cfs and I thought that results will not
> > > > > change in such a case
> > > > >
> > > > > I will review the patchset and try to get what is going wrong
> > > >
> > > > Yeah, that is unexpected. Let me go have another look too.
> > >
> > > So I can reproduce even without the last patch applied. I suspect it is
> > > in the cgroup mode patches somewhere. My first suspect is that concur
> > > mode thing doing bad things to track the 'global' nr_running thing.
> >
> > Argh, n/m PEBKAC. I'll try this again in the morning :/
> 
> Reverting the last patch is enough to recover performance

Yeah, I was on a fail-streak yesterday. I forgot to copy the kernel
image before reboot ...

Lets see if today is better :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-13  7:01       ` K Prateek Nayak
@ 2026-05-13  7:25         ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-13  7:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

On Wed, May 13, 2026 at 12:31:05PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/12/2026 4:39 PM, Peter Zijlstra wrote:
> >> @@ -13819,6 +13831,12 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>  		if (on_rq)
> >>  			weight = __calc_prop_weight(cfs_rq, se, weight);
> >>  	}
> >> +	/*
> >> +	 * Add throttle work if the bandwidth allocation above failed
> >> +	 * to grab any runtime and throttled the task's hierarchy.
> >> +	 */
> >> +	if (throttled_hierarchy(task_cfs_rq(p)))
> >> +		task_throttle_setup_work(p);
> > 
> > We already call into account_cfs_rq_runtime(); which basically does all
> > we need.
> > 
> > I think the distinction between account_cfs_rq_runtime() and
> > check_cfs_rq_runtime() no longer makes sense. We can throttle a cfs_rq
> > at any point now, since we no longer remove the cfs_rq, but rather we
> > make the tasks suspend themselves until the cfs_rq naturally dequeues
> > for being empty.
> > 
> > Something like so perhaps?
> 
> That makes sense! The task should naturally execute the task work when
> exiting out of the kernel / IRQ handler into the userspace so we should
> be good.
> 
> I'll rebase the below diff on tip, test it a bit, add a commit log, and
> send it your way if you don't mind or would you like to keep it with the
> flat_cg bits?

Nah, this seems like something that can be done independent. And thus is
should be. That flat patch is big enough as is.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12  8:42 ` Vincent Guittot
  2026-05-12  9:20   ` Peter Zijlstra
@ 2026-05-13 11:35   ` Peter Zijlstra
  2026-05-13 12:43     ` Peter Zijlstra
  2026-05-18 13:34     ` Vincent Guittot
  1 sibling, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-13 11:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:

> I haven't reviewed the patches yet but I ran some tests with it while
> testing sched latency related changes for short slice wakeup
> preemption. I have some large hackbench regressions with this series
> on HMP system with and without EAS. those figures are unexpected
> because the benchs run on root cfs
> 
> One example with hackbench 8 groups thread pipe
> tip/sched/core  tip/sched/core          +this patchset          +this patchset
> slice 2.8ms     16ms                    2.8ms                   16ms
> dragonboard rb5 with EAS
> 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> 0,689(+/- 9.1%) +8%
> 
> radxa orion6 HMP without EAS
> 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> 1,071(+/-5.9%) -82%
> 
> Increasing the slice partly removes regressions but tis is surprising
> because the bench runs at root cfs and I thought that results will not
> change in such a case

D'oh :/

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e54da4c6c945..77d0e1937f2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
-	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
+	struct cfs_rq *cfs_rq = &rq->cfs;
 	int cse_is_idle, pse_is_idle;
 
 	/*

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-13 11:35   ` Peter Zijlstra
@ 2026-05-13 12:43     ` Peter Zijlstra
  2026-05-18 13:34     ` Vincent Guittot
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-13 12:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Wed, May 13, 2026 at 01:35:10PM +0200, Peter Zijlstra wrote:
> On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> 
> > I haven't reviewed the patches yet but I ran some tests with it while
> > testing sched latency related changes for short slice wakeup
> > preemption. I have some large hackbench regressions with this series
> > on HMP system with and without EAS. those figures are unexpected
> > because the benchs run on root cfs
> > 
> > One example with hackbench 8 groups thread pipe
> > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > slice 2.8ms     16ms                    2.8ms                   16ms
> > dragonboard rb5 with EAS
> > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > 0,689(+/- 9.1%) +8%
> > 
> > radxa orion6 HMP without EAS
> > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > 1,071(+/-5.9%) -82%
> > 
> > Increasing the slice partly removes regressions but tis is surprising
> > because the bench runs at root cfs and I thought that results will not
> > change in such a case
> 
> D'oh :/
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e54da4c6c945..77d0e1937f2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
>  	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>  	struct task_struct *donor = rq->donor;
>  	struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> -	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> +	struct cfs_rq *cfs_rq = &rq->cfs;
>  	int cse_is_idle, pse_is_idle;
>  
>  	/*

With that fixed, I now get:

	vanilla	slice(*)

FPS min	  3.0	11.1
    avg  44.7	57.3
    max  88.1	96.2

FT  min   9.1	 8.0
    avg  41.4	21.0
    max 157.2   53.9

FPS (Frames Per Second)
FT  (FrameTime)


Which I suppose shows we now preempt less. Its still significantly
better with reduced slice, but not as good as it was.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-13  5:00     ` John Stultz
@ 2026-05-14  1:36       ` John Stultz
  2026-05-14  2:53         ` K Prateek Nayak
  0 siblings, 1 reply; 64+ messages in thread
From: John Stultz @ 2026-05-14  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 10:00 PM John Stultz <jstultz@google.com> wrote:
> On Tue, May 12, 2026 at 9:51 PM John Stultz <jstultz@google.com> wrote:
> >
> > On Mon, May 11, 2026 at 5:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Change fair/cgroup to a single runqueue.
> > >
> ...
> >
> > I know Vincent was having some perf troubles with this patch, but
> > booting on a 64 vCPU qemu environment, I'm seeing:
> >
> > [    5.688490] Oops: divide error: 0000 [#1] SMP NOPTI
> > [    5.689457] CPU: 47 UID: 0 PID: 0 Comm: swapper/47 Not tainted
> > 7.1.0-rc2-00026-g82a8ec6fb3f9 #38 PREEMPT(full)
> > [    5.689457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS 1.17.0-debian-1.17.0-1 04/01/2014
> > [    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> > [    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> > 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> > 14 31 9
> > [    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> > [    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> > [    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > [    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> > [    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> > [    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> > [    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
> > knlGS:0000000000000000
> > [    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> > [    5.689457] Call Trace:
> > [    5.689457]  <TASK>
> > [    5.689457]  wakeup_preempt+0xa8/0xd0
> > [    5.689457]  attach_one_task+0xec/0x150
> > [    5.689457]  __schedule+0x1ad8/0x21c0
> > [    5.689457]  schedule_idle+0x22/0x40
> > [    5.689457]  cpu_startup_entry+0x29/0x30
> > [    5.689457]  start_secondary+0xf7/0x100
> > [    5.689457]  common_startup_64+0x13e/0x148
> > [    5.689457]  </TASK>
> > [    5.689457] Dumping ftrace buffer:
> > [    5.689457]    (ftrace buffer empty)
> > [    5.689457] ---[ end trace 0000000000000000 ]---
> > [    5.689457] RIP: 0010:wakeup_preempt_fair+0x1b7/0x430
> > [    5.689457] Code: 74 0b 48 8b 52 28 48 39 d0 48 0f 47 c2 48 8b b9
> > 90 00 00 00 48 8b b1 08 01 00 00 48 81 ff 00 00 10 00 74 09 48 c1 e0
> > 14 31 9
> > [    5.689457] RSP: 0000:ffffc9000021fd70 EFLAGS: 00010046
> > [    5.689457] RAX: 000002ab98000000 RBX: ffff8881b8e2db40 RCX: ffffffff83022a80
> > [    5.689457] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > [    5.689457] RBP: 0000000000000001 R08: ffff88810cb14380 R09: ffffffff83022b00
> > [    5.689457] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
> > [    5.689457] R13: 0000000000000000 R14: ffff88810cb14300 R15: ffff8881b8e2da00
> > [    5.689457] FS:  0000000000000000(0000) GS:ffff888235c2e000(0000)
> > knlGS:0000000000000000
> > [    5.689457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    5.689457] CR2: 0000000000000000 CR3: 000000000304c001 CR4: 0000000000370ef0
> > [    5.689457] Kernel panic - not syncing: Fatal exception
> >
> > Which I bisected down to this last patch in the series.
> >
> > faddr2line gave me:
> > __calc_delta at kernel/sched/fair.c:290
> > (inlined by) calc_delta_fair at kernel/sched/fair.c:300
> > (inlined by) update_protect_slice at kernel/sched/fair.c:1070
> > (inlined by) wakeup_preempt_fair at kernel/sched/fair.c:9193
> >
> > This usually trips as the ww_mutex selftest starts at bootup.
> >
> > Unfortunately I still see it with the add-on changes you proposed to K
> > Prateek's feedback here.
> >
> > I'll try to narrow it down further tomorrow.
>
> As karma would have it, this does seem to depend on CONFIG_SCHED_PROXY_EXEC. :)
> I'm guessing the switch in calc_delta_fair() to use se->h_load is
> uncovering something proxy isn't handling properly with that value.
>

So looking at the callstack when I see the failure:
proxy_find_task()
  proxy_force_return()
    proxy_resched_idle()  <- sets rq->donor to idle
    attach_one_task()
      wakeup_preempt()
        wakeup_preempt_fair()
          update_protect_slice() <- called with the donor's se
            calc_delta_fair()
              __calc_delta() <- div by zero

Basically we end up in wakeup_preempt_fair() with rq->donor ==
rq->idle because we earlier called proxy_resched_idle().

Without proxy, if we call wakeup_preempt_fair() when rq->donor (and
rq->curr) is rq->idle, we usually end up taking the `if
(test_tsk_need_resched(rq->curr))` early exit and we don't hit this.

But with proxy, rq->curr isn't idle at this point. So we end up
continuing on. Despite the se_is_idle(se) checks (where se is the
&donor->se), those don't catch because rq->idle (maybe unintuitvely)
has a SCHED_NORMAL policy.

So we end up getting down to update_protect_slice() with rq->idle as
the se and the idle h_load.weight is zero.

Not sure what the best approach might be, but adding:
  if (donor == rq->idle) {
    /* don't give rq->idle slice protection */
    preempt_action = PREEMPT_WAKEUP_SHORT;
    goto preempt;
  }

similar to the `if (cse_is_idle && !pse_is_idle)` check seems to resolve this.

Anyway, if you have thoughts on better approach, I'd be happy to work
up a patch to add on top of this one.

thanks
-john

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-14  1:36       ` John Stultz
@ 2026-05-14  2:53         ` K Prateek Nayak
  2026-05-14  3:14           ` John Stultz
  0 siblings, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-14  2:53 UTC (permalink / raw)
  To: John Stultz, Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, qyousef

Hello John,

On 5/14/2026 7:06 AM, John Stultz wrote:
> So looking at the callstack when I see the failure:
> proxy_find_task()
>   proxy_force_return()
>     proxy_resched_idle()  <- sets rq->donor to idle
>     attach_one_task()
>       wakeup_preempt()
>         wakeup_preempt_fair()

After this point, I would have expected we called idle class's
wakeup_preempt() since that is the donor context ...

>           update_protect_slice() <- called with the donor's se
>             calc_delta_fair()
>               __calc_delta() <- div by zero
> 
> Basically we end up in wakeup_preempt_fair() with rq->donor ==
> rq->idle because we earlier called proxy_resched_idle().

Could you check if following makes things better:

  (Only build tested)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ae5f19c1b7e..77f4ebe8f5c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6653,6 +6653,7 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 {
 	put_prev_set_next_task(rq, rq->donor, rq->idle);
+	rq->next_class = &idle_sched_class;
 	rq_set_donor(rq, rq->idle);
 	set_tsk_need_resched(rq->idle);
 	return rq->idle;
---

I'm just getting started for the day so it'll be a while before I
actually get to test this on top of flat cgroup bits which I haven't yet
run with SCHED_PROXY_EXEC enabled.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-14  2:53         ` K Prateek Nayak
@ 2026-05-14  3:14           ` John Stultz
  0 siblings, 0 replies; 64+ messages in thread
From: John Stultz @ 2026-05-14  3:14 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, mingo, longman, chenridong, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, cgroups, linux-kernel, qyousef

On Wed, May 13, 2026 at 7:53 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> On 5/14/2026 7:06 AM, John Stultz wrote:
> > So looking at the callstack when I see the failure:
> > proxy_find_task()
> >   proxy_force_return()
> >     proxy_resched_idle()  <- sets rq->donor to idle
> >     attach_one_task()
> >       wakeup_preempt()
> >         wakeup_preempt_fair()
>
> After this point, I would have expected we called idle class's
> wakeup_preempt() since that is the donor context ...

Ah, that's a good point! (I was getting muddied by the rq->idle having
SCHED_NORMAL policy value and assuming that was why we were in the
fair code).

> >           update_protect_slice() <- called with the donor's se
> >             calc_delta_fair()
> >               __calc_delta() <- div by zero
> >
> > Basically we end up in wakeup_preempt_fair() with rq->donor ==
> > rq->idle because we earlier called proxy_resched_idle().
>
> Could you check if following makes things better:
>
>   (Only build tested)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3ae5f19c1b7e..77f4ebe8f5c7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6653,6 +6653,7 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
>  static inline struct task_struct *proxy_resched_idle(struct rq *rq)
>  {
>         put_prev_set_next_task(rq, rq->donor, rq->idle);
> +       rq->next_class = &idle_sched_class;
>         rq_set_donor(rq, rq->idle);
>         set_tsk_need_resched(rq->idle);
>         return rq->idle;

Yeah, that looks to avoid the problem and is a fair bit cleaner. I
missed the introduction of the rq->next_class detail!
Thanks for pointing this out!

I'll do some testing against the full series and get a patch sent out here soon.
-john

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
                   ` (11 preceding siblings ...)
  2026-05-12  8:42 ` Vincent Guittot
@ 2026-05-16  3:30 ` Qais Yousef
  12 siblings, 0 replies; 64+ messages in thread
From: Qais Yousef @ 2026-05-16  3:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak

On 05/11/26 13:31, Peter Zijlstra wrote:
> Hi!
> 
> So cgroup scheduling has always been a pain in the arse. The problems start
> with weight distribution and end with hierachical picks and it all sucks.

It does..

Not that it is useful info, but we talked briefly about it at OSPM, so thought
I'll report back. I gave this a go with my test case from schedqos announcement
[1] of running schbench with kernel build as BACKGROUND noise, but the fairness
imposed at group level is preserved (as expected) even if the pick is
flattened.

I do actually want a total flat system, ie: disable this whole thing :-)

The problem is to create a system where you want to introduce smart tagging
based on tasks, group scheduling becomes a big problem. If a task is set as
background or interactive, it has to be global to be enforced otherwise it
loses its meaning. And my test case stresses two long running tasks one is
interactive but the other is background and group scheduling imposes fairness
that breaks the task level tagging. Managing deadline via runtime doesn't help
here since they are both always busy tasks; and one must use nice values to
manage bandwidth. But nice values are local when autogroup/cgroups are present.

I need to find a simple way to turn this thing off at runtime and properly
flatten it ;-)

/runs away

[1] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-12 18:45     ` Tejun Heo
@ 2026-05-18  7:14       ` Peter Zijlstra
  2026-05-18 19:11         ` Tejun Heo
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-18  7:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, May 12, 2026 at 08:45:21AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, May 12, 2026 at 10:10:00AM +0200, Peter Zijlstra wrote:
> ...
> > Anyway, this is why I've been looking at these alternative weight
> > schemes, to get the nominal fraction near 1 and make these problems go
> > away. It is both the numerical issues and the disparity between levels
> > (with root being at level 0 being the most obvious).
> 
> I see. I think what bothers me is that I'm unsure what the weight config
> would mean when the shares are scaled by the number of active cpus in that
> cgroup. 

Relative weight per active cpu :-), but yes, that is a somewhat more
difficult concept I suppose.

> Here's a simple example:
> 
> - There are 256 cpus.
> - /cgroup-A has weight 100 and 128 active threads. No pinning.
> - /cgroup-B has weight 100 and 256 active thredas. No pinning.
> 
> In the current code, assuming math holds up, cgroup-A and B would get about
> the same shares - ~128 CPUs each. However, if we scale the share by active
> CPUs in each cgroup, B's tasks would end up with the same weight as A's on
> CPUs that they end up competing on, which would lead to ~ 1:3 distribution.
> Is that the right reading of the code?

Indeed. So both A and B will get ~1024 weight per (active) CPU, such
that on the CPUs they contend they will get 1:1 and then B will get the
full CPU on the uncontested CPUs, resulting in a total of 1:3
distribution.

This can of course be compensated by increasing the relative
weight of A, if that is so desired. But the alternative view is that for
those 128 CPUs they overlap, A and B will get equal parts, it is just
that B consumes another 128 CPUs and will not have contention there.

So the current scheme will inflate the part of A to be double the weight
(of B), giving them 2 out of 3 parts on the contended CPUs, but then B
will still get complete / uncontested access to those extra 128 CPUs,
resulting in a 2:4 weight distribution.

Which also isn't as straight forward as one might think.

So perhaps 'weight on the CPUs you contest on' isn't as unintuitive as
it seems on first glance, its just different.

And it has tremendous advantages as outlined before; it is naturally
normalized -- the disparity between nesting levels goes away, and the
edge case of a single CPU active will be sane.

Eg. consider your example except now A will have 1 active thread. Then A
will get the full group weight (1024) on its one CPU, while B will get
(1024/256=8) on each CPU.

So for the one contended CPU A gets 256 out of 257 parts, while B gets
the full CPU for the remaining 255 CPUs, for a:

  256    1        257
  --- : --- + 255*--- = 256:65535 ~ 1:256
  257   257       257

distribution. While with the new scheme it would be:

 1   1       2
 - : - + 255*- = 1:511
 2   2       2

Which, realistically isn't all that different, except the old scheme has
this really large weight to deal with.

So from where I'm sitting, yes different, but it behaves better.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-13 11:35   ` Peter Zijlstra
  2026-05-13 12:43     ` Peter Zijlstra
@ 2026-05-18 13:34     ` Vincent Guittot
  2026-05-18 21:12       ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2026-05-18 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
>
> > I haven't reviewed the patches yet but I ran some tests with it while
> > testing sched latency related changes for short slice wakeup
> > preemption. I have some large hackbench regressions with this series
> > on HMP system with and without EAS. those figures are unexpected
> > because the benchs run on root cfs
> >
> > One example with hackbench 8 groups thread pipe
> > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > slice 2.8ms     16ms                    2.8ms                   16ms
> > dragonboard rb5 with EAS
> > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > 0,689(+/- 9.1%) +8%
> >
> > radxa orion6 HMP without EAS
> > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > 1,071(+/-5.9%) -82%
> >
> > Increasing the slice partly removes regressions but tis is surprising
> > because the bench runs at root cfs and I thought that results will not
> > change in such a case
>
> D'oh :/
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e54da4c6c945..77d0e1937f2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
>         enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>         struct task_struct *donor = rq->donor;
>         struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> -       struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> +       struct cfs_rq *cfs_rq = &rq->cfs;

I tested this patch on top of the series but it doesn't fix the perf
regression on rb5

hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default
slice duration

>         int cse_is_idle, pse_is_idle;
>
>         /*

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-18  7:14       ` Peter Zijlstra
@ 2026-05-18 19:11         ` Tejun Heo
  2026-05-27  9:41           ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Tejun Heo @ 2026-05-18 19:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

Hello, Peter.

On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote:
...
> So the current scheme will inflate the part of A to be double the weight
> (of B), giving them 2 out of 3 parts on the contended CPUs, but then B
> will still get complete / uncontested access to those extra 128 CPUs,
> resulting in a 2:4 weight distribution.
> 
> Which also isn't as straight forward as one might think.

Right, the current behavior isn't quite what people would expect intuitively
either.

...
> So for the one contended CPU A gets 256 out of 257 parts, while B gets
> the full CPU for the remaining 255 CPUs, for a:
> 
>   256    1        257
>   --- : --- + 255*--- = 256:65535 ~ 1:256
>   257   257       257
> 
> distribution. While with the new scheme it would be:
> 
>  1   1       2
>  - : - + 255*- = 1:511
>  2   2       2
> 
> Which, realistically isn't all that different, except the old scheme has
> this really large weight to deal with.
> 
> So from where I'm sitting, yes different, but it behaves better.

I see. Thread cardinality and affinity problems make weight based
distribution such a pain. I wonder whether this can be better solved by
turning it into a two-layer allocation problem - groups to CPUs and then
timeshare on CPUs as necessary. That comes with a lot of its own problems
but it can, aspirationally at least, approximate global weight distribution
and would have better locality properties.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-18 13:34     ` Vincent Guittot
@ 2026-05-18 21:12       ` Peter Zijlstra
  2026-05-19 10:13         ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-18 21:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote:
> On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> >
> > > I haven't reviewed the patches yet but I ran some tests with it while
> > > testing sched latency related changes for short slice wakeup
> > > preemption. I have some large hackbench regressions with this series
> > > on HMP system with and without EAS. those figures are unexpected
> > > because the benchs run on root cfs
> > >
> > > One example with hackbench 8 groups thread pipe
> > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > dragonboard rb5 with EAS
> > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > 0,689(+/- 9.1%) +8%
> > >
> > > radxa orion6 HMP without EAS
> > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > 1,071(+/-5.9%) -82%
> > >
> > > Increasing the slice partly removes regressions but tis is surprising
> > > because the bench runs at root cfs and I thought that results will not
> > > change in such a case
> >
> > D'oh :/
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e54da4c6c945..77d0e1937f2c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
> >         enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
> >         struct task_struct *donor = rq->donor;
> >         struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> > -       struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> 
> I tested this patch on top of the series but it doesn't fix the perf
> regression on rb5
> 
> hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default
> slice duration

Weird, I can't reproduce anymore with this fixed :/

I'll try more hackbench variants tomorrow I suppose.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-18 21:12       ` Peter Zijlstra
@ 2026-05-19 10:13         ` Vincent Guittot
  2026-05-19 16:00           ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2026-05-19 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, 18 May 2026 at 23:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote:
> > On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> > >
> > > > I haven't reviewed the patches yet but I ran some tests with it while
> > > > testing sched latency related changes for short slice wakeup
> > > > preemption. I have some large hackbench regressions with this series
> > > > on HMP system with and without EAS. those figures are unexpected
> > > > because the benchs run on root cfs
> > > >
> > > > One example with hackbench 8 groups thread pipe
> > > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > > dragonboard rb5 with EAS
> > > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > > 0,689(+/- 9.1%) +8%
> > > >
> > > > radxa orion6 HMP without EAS
> > > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > > 1,071(+/-5.9%) -82%
> > > >
> > > > Increasing the slice partly removes regressions but tis is surprising
> > > > because the bench runs at root cfs and I thought that results will not
> > > > change in such a case
> > >
> > > D'oh :/
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e54da4c6c945..77d0e1937f2c 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
> > >         enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
> > >         struct task_struct *donor = rq->donor;
> > >         struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> > > -       struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> > > +       struct cfs_rq *cfs_rq = &rq->cfs;
> >
> > I tested this patch on top of the series but it doesn't fix the perf
> > regression on rb5
> >
> > hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default
> > slice duration
>
> Weird, I can't reproduce anymore with this fixed :/
>
> I'll try more hackbench variants tomorrow I suppose.

I tried several conf :
- HMP with EAS enabled
- HMP without EAS enabled (perf cpufreq gov)
- SMP (only the 4 little cores)

All of them show large regressions with hackbench which are almost
recovered when increasing the slice from 2.8 to 16ms

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
  2026-05-11 16:21   ` K Prateek Nayak
  2026-05-13  4:51   ` John Stultz
@ 2026-05-19 10:38   ` Vincent Guittot
  2026-05-20 16:32     ` Vincent Guittot
  2026-05-26  7:53   ` Zhang Qiao
  3 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2026-05-19 10:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Change fair/cgroup to a single runqueue.
>
> Infamously fair/cgroup isn't working for a number of people; typically
> the complaint is latencies and/or overhead. The latency issue is due
> to the intermediate entries that represent a combination of tasks and
> thereby obfuscate the runnability of tasks.
>
> The approach here is to leave the cgroup hierarchy as is; including
> the intermediate enqueue/dequeue but move the actual EEVDF runqueue
> outside. This means things like the shares_weight approximation are
> fully preserved.
>
> That is, given a hierarchy like:
>
>         R
>         |
>         se--G1
>             / \
>       G2--se   se--G3
>      / \           |
> T1--se se--T2      se--T3
>
> This is fully maintained for load tracking, however the EEVDF parts of
> cfs_rq/se go unused for the intermediates and are instead connected
> like:
>
>      _R_
>     / | \
>    T1 T2 T3
>
> Since the effective weight of the entities is determined by the
> hierarchy, this gets recomputed on enqueue,set_next_task and tick.
>
> Notably, the effective weight (se->h_load) is computed from the
> hierarchical fraction: se->load / cfs_rq->load.
>
> Since EEVDF is now exclusive operating on rq->cfs, it needs to
> consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
> only tasks can get delayed, simplifying some of the cgroup cleanup.
>
> One place where additional information was required was
> set_next_task() / put_prev_task(), where we need to track 'current'
> both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
> (cfs_rq->curr).
>
> As a result of only having a single level to pick from, much of the
> complications in pick_next_task() and preemption go away.
>
> Since many of the hierarchical operations are still there, this won't
> immediately fix the performance issues, but hopefully it will fix some
> of the latency issues.
>
> TODO: split struct cfs_rq / struct sched_entity
> TODO: try and get rid of h_curr
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h |    1
>  kernel/sched/core.c   |    5
>  kernel/sched/debug.c  |    9
>  kernel/sched/fair.c   |  789 +++++++++++++++++++++-----------------------------
>  kernel/sched/pelt.c   |    6
>  kernel/sched/sched.h  |   26 -
>  6 files changed, 366 insertions(+), 470 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -575,6 +575,7 @@ struct sched_statistics {
>  struct sched_entity {
>         /* For load-balancing: */
>         struct load_weight              load;
> +       struct load_weight              h_load;
>         struct rb_node                  run_node;
>         u64                             deadline;
>         u64                             min_vruntime;
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5539,11 +5539,8 @@ EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
>   */
>  static inline void prefetch_curr_exec_start(struct task_struct *p)
>  {
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       struct sched_entity *curr = p->se.cfs_rq->curr;
> -#else
>         struct sched_entity *curr = task_rq(p)->cfs.curr;
> -#endif
> +
>         prefetch(curr);
>         prefetch(&curr->exec_start);
>  }
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -911,10 +911,11 @@ print_task(struct seq_file *m, struct rq
>         else
>                 SEQ_printf(m, " %c", task_state_to_char(p));
>
> -       SEQ_printf(m, " %15s %5d %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
> +       SEQ_printf(m, " %15s %5d %10ld %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
>                 p->comm, task_pid_nr(p),
> +               p->se.h_load.weight,
>                 SPLIT_NS(p->se.vruntime),
> -               entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
> +               entity_eligible(&rq->cfs, &p->se) ? 'E' : 'N',
>                 SPLIT_NS(p->se.deadline),
>                 p->se.custom_slice ? 'S' : ' ',
>                 SPLIT_NS(p->se.slice),
> @@ -943,7 +944,7 @@ static void print_rq(struct seq_file *m,
>
>         SEQ_printf(m, "\n");
>         SEQ_printf(m, "runnable tasks:\n");
> -       SEQ_printf(m, " S            task   PID       vruntime   eligible    "
> +       SEQ_printf(m, " S            task   PID     weight       vruntime   eligible    "
>                    "deadline             slice          sum-exec      switches  "
>                    "prio         wait-time        sum-sleep       sum-block"
>  #ifdef CONFIG_NUMA_BALANCING
> @@ -1051,6 +1052,8 @@ void print_cfs_rq(struct seq_file *m, in
>                         cfs_rq->tg_load_avg_contrib);
>         SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
>                         atomic_long_read(&cfs_rq->tg->load_avg));
> +       SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
> +                       cfs_rq->h_load);
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_CFS_BANDWIDTH
>         SEQ_printf(m, "  .%-30s: %d\n", "throttled",
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -296,8 +296,8 @@ static u64 __calc_delta(u64 delta_exec,
>   */
>  static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
>  {
> -       if (unlikely(se->load.weight != NICE_0_LOAD))
> -               delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
> +       if (se->h_load.weight != NICE_0_LOAD)
> +               delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
>
>         return delta;
>  }
> @@ -427,38 +427,6 @@ static inline struct sched_entity *paren
>         return se->parent;
>  }
>
> -static void
> -find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> -{
> -       int se_depth, pse_depth;
> -
> -       /*
> -        * preemption test can be made between sibling entities who are in the
> -        * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> -        * both tasks until we find their ancestors who are siblings of common
> -        * parent.
> -        */
> -
> -       /* First walk up until both entities are at same depth */
> -       se_depth = (*se)->depth;
> -       pse_depth = (*pse)->depth;
> -
> -       while (se_depth > pse_depth) {
> -               se_depth--;
> -               *se = parent_entity(*se);
> -       }
> -
> -       while (pse_depth > se_depth) {
> -               pse_depth--;
> -               *pse = parent_entity(*pse);
> -       }
> -
> -       while (!is_same_group(*se, *pse)) {
> -               *se = parent_entity(*se);
> -               *pse = parent_entity(*pse);
> -       }
> -}
> -
>  static int tg_is_idle(struct task_group *tg)
>  {
>         return tg->idle > 0;
> @@ -502,11 +470,6 @@ static inline struct sched_entity *paren
>         return NULL;
>  }
>
> -static inline void
> -find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> -{
> -}
> -
>  static inline int tg_is_idle(struct task_group *tg)
>  {
>         return 0;
> @@ -685,7 +648,7 @@ static inline unsigned long avg_vruntime
>  static inline void
>  __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> +       unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
>         s64 w_vruntime, key = entity_key(cfs_rq, se);
>
>         w_vruntime = key * weight;
> @@ -702,7 +665,7 @@ sum_w_vruntime_add_paranoid(struct cfs_r
>         s64 key, tmp;
>
>  again:
> -       weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> +       weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
>         key = entity_key(cfs_rq, se);
>
>         if (check_mul_overflow(key, weight, &key))
> @@ -748,7 +711,7 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
>  static void
>  sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> +       unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
>         s64 key = entity_key(cfs_rq, se);
>
>         cfs_rq->sum_w_vruntime -= key * weight;
> @@ -790,7 +753,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>                 s64 runtime = cfs_rq->sum_w_vruntime;
>
>                 if (curr) {
> -                       unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
> +                       unsigned long w = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
>
>                         runtime += entity_key(cfs_rq, curr) * w;
>                         weight += w;
> @@ -861,8 +824,6 @@ bool update_entity_lag(struct cfs_rq *cf
>         u64 avruntime = avg_vruntime(cfs_rq);
>         s64 vlag = entity_lag(cfs_rq, se, avruntime);
>
> -       WARN_ON_ONCE(!se->on_rq);
> -
>         if (se->sched_delayed) {
>                 /* previous vlag < 0 otherwise se would not be delayed */
>                 vlag = max(vlag, se->vlag);
> @@ -898,7 +859,7 @@ static int vruntime_eligible(struct cfs_
>         long load = cfs_rq->sum_weight;
>
>         if (curr && curr->on_rq) {
> -               unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
> +               unsigned long weight = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
>
>                 avg += entity_key(cfs_rq, curr) * weight;
>                 load += weight;
> @@ -1039,6 +1000,9 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
>   */
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
> +       WARN_ON_ONCE(!entity_is_task(se));
> +
>         sum_w_vruntime_add(cfs_rq, se);
>         se->min_vruntime = se->vruntime;
>         se->min_slice = se->slice;
> @@ -1048,6 +1012,9 @@ static void __enqueue_entity(struct cfs_
>
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
> +       WARN_ON_ONCE(!entity_is_task(se));
> +
>         rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
>                                   &min_vruntime_cb);
>         sum_w_vruntime_sub(cfs_rq, se);
> @@ -1144,7 +1111,7 @@ static struct sched_entity *pick_eevdf(s
>          * We can safely skip eligibility check if there is only one entity
>          * in this cfs_rq, saving some cycles.
>          */
> -       if (cfs_rq->nr_queued == 1)
> +       if (cfs_rq->h_nr_queued == 1)
>                 return curr && curr->on_rq ? curr : se;
>
>         /*
> @@ -1391,8 +1358,6 @@ static s64 update_se(struct rq *rq, stru
>         return delta_exec;
>  }
>
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -1412,7 +1377,7 @@ static void update_curr(struct cfs_rq *c
>          * not necessarily be the actual task running
>          * (rq->curr.se). This is easy to confuse!
>          */
> -       struct sched_entity *curr = cfs_rq->curr;
> +       struct sched_entity *curr = cfs_rq->h_curr;
>         struct rq *rq = rq_of(cfs_rq);
>         s64 delta_exec;
>         bool resched;
> @@ -1424,26 +1389,29 @@ static void update_curr(struct cfs_rq *c
>         if (unlikely(delta_exec <= 0))
>                 return;
>
> +       account_cfs_rq_runtime(cfs_rq, delta_exec);
> +
> +       if (!entity_is_task(curr))
> +               return;
> +
> +       cfs_rq = &rq->cfs;
> +
>         curr->vruntime += calc_delta_fair(delta_exec, curr);
>         resched = update_deadline(cfs_rq, curr);
>
> -       if (entity_is_task(curr)) {
> -               /*
> -                * If the fair_server is active, we need to account for the
> -                * fair_server time whether or not the task is running on
> -                * behalf of fair_server or not:
> -                *  - If the task is running on behalf of fair_server, we need
> -                *    to limit its time based on the assigned runtime.
> -                *  - Fair task that runs outside of fair_server should account
> -                *    against fair_server such that it can account for this time
> -                *    and possibly avoid running this period.
> -                */
> -               dl_server_update(&rq->fair_server, delta_exec);
> -       }
> -
> -       account_cfs_rq_runtime(cfs_rq, delta_exec);
> +       /*
> +        * If the fair_server is active, we need to account for the
> +        * fair_server time whether or not the task is running on
> +        * behalf of fair_server or not:
> +        *  - If the task is running on behalf of fair_server, we need
> +        *    to limit its time based on the assigned runtime.
> +        *  - Fair task that runs outside of fair_server should account
> +        *    against fair_server such that it can account for this time
> +        *    and possibly avoid running this period.
> +        */
> +       dl_server_update(&rq->fair_server, delta_exec);
>
> -       if (cfs_rq->nr_queued == 1)
> +       if (cfs_rq->h_nr_queued == 1)
>                 return;
>
>         if (resched || !protect_slice(curr)) {
> @@ -1454,7 +1422,10 @@ static void update_curr(struct cfs_rq *c
>
>  static void update_curr_fair(struct rq *rq)
>  {
> -       update_curr(cfs_rq_of(&rq->donor->se));
> +       struct sched_entity *se = &rq->donor->se;
> +
> +       for_each_sched_entity(se)
> +               update_curr(cfs_rq_of(se));
>  }
>
>  static inline void
> @@ -1530,7 +1501,7 @@ update_stats_enqueue_fair(struct cfs_rq
>          * Are we enqueueing a waiting task? (for current tasks
>          * a dequeue/enqueue event is a NOP)
>          */
> -       if (se != cfs_rq->curr)
> +       if (se != cfs_rq->h_curr)
>                 update_stats_wait_start_fair(cfs_rq, se);
>
>         if (flags & ENQUEUE_WAKEUP)
> @@ -1548,7 +1519,7 @@ update_stats_dequeue_fair(struct cfs_rq
>          * Mark the end of the wait period if dequeueing a
>          * waiting task:
>          */
> -       if (se != cfs_rq->curr)
> +       if (se != cfs_rq->h_curr)
>                 update_stats_wait_end_fair(cfs_rq, se);
>
>         if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
> @@ -3875,6 +3846,7 @@ static inline void update_scan_period(st
>  static void
>  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
>         update_load_add(&cfs_rq->load, se->load.weight);
>         if (entity_is_task(se)) {
>                 struct rq *rq = rq_of(cfs_rq);
> @@ -3888,6 +3860,7 @@ account_entity_enqueue(struct cfs_rq *cf
>  static void
>  account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
>         update_load_sub(&cfs_rq->load, se->load.weight);
>         if (entity_is_task(se)) {
>                 account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> @@ -3965,7 +3938,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
>  static void
>  rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
>  {
> -       unsigned long old_weight = se->load.weight;
> +       long old_weight = se->h_load.weight;
>
>         /*
>          * VRUNTIME
> @@ -4065,16 +4038,17 @@ rescale_entity(struct sched_entity *se,
>                 se->vprot = div64_long(se->vprot * old_weight, weight);
>  }
>
> -static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> -                           unsigned long weight)
> +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                          unsigned long weight, bool on_rq)
>  {
>         bool curr = cfs_rq->curr == se;
>         bool rel_vprot = false;
>         u64 avruntime = 0;
>
> -       if (se->on_rq) {
> -               /* commit outstanding execution time */
> -               update_curr(cfs_rq);
> +       if (se->h_load.weight == weight)
> +               return;
> +
> +       if (on_rq) {
>                 avruntime = avg_vruntime(cfs_rq);
>                 se->vlag = entity_lag(cfs_rq, se, avruntime);
>                 se->deadline -= avruntime;
> @@ -4084,46 +4058,90 @@ static void reweight_entity(struct cfs_r
>                         rel_vprot = true;
>                 }
>
> -               cfs_rq->nr_queued--;
> +               cfs_rq->h_nr_queued--;
>                 if (!curr)
>                         __dequeue_entity(cfs_rq, se);
> -               update_load_sub(&cfs_rq->load, se->load.weight);
>         }
> -       dequeue_load_avg(cfs_rq, se);
>
>         rescale_entity(se, weight, rel_vprot);
>
> -       update_load_set(&se->load, weight);
> +       update_load_set(&se->h_load, weight);
>
> -       do {
> -               u32 divider = get_pelt_divider(&se->avg);
> -               se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> -       } while (0);
> -
> -       enqueue_load_avg(cfs_rq, se);
> -       if (se->on_rq) {
> +       if (on_rq) {
>                 if (rel_vprot)
>                         se->vprot += avruntime;
>                 se->deadline += avruntime;
>                 se->rel_deadline = 0;
>                 se->vruntime = avruntime - se->vlag;
>
> -               update_load_add(&cfs_rq->load, se->load.weight);
>                 if (!curr)
>                         __enqueue_entity(cfs_rq, se);
> -               cfs_rq->nr_queued++;
> +               cfs_rq->h_nr_queued++;
>         }
>  }
>
> +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                           unsigned long weight)
> +{
> +       if (se->load.weight == weight)
> +               return;
> +
> +       if (se->on_rq) {
> +               WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
> +               update_load_sub(&cfs_rq->load, se->load.weight);
> +       }
> +       dequeue_load_avg(cfs_rq, se);
> +
> +       update_load_set(&se->load, weight);
> +
> +       do {
> +               u32 divider = get_pelt_divider(&se->avg);
> +               se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> +       } while (0);
> +
> +       enqueue_load_avg(cfs_rq, se);
> +
> +       if (se->on_rq)
> +               update_load_add(&cfs_rq->load, se->load.weight);
> +}
> +
> +/*
> + * weight = NICE_0_LOAD;
> + * for_each_entity_se(se)
> + *   weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
> + */
> +static __always_inline
> +unsigned long __calc_prop_weight(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                                unsigned long weight)
> +{
> +       weight *= se->load.weight;
> +       if (parent_entity(se))
> +               weight /= cfs_rq->load.weight;
> +       else
> +               weight /= NICE_0_LOAD;
> +
> +       return max(weight, MIN_SHARES);
> +}
> +
>  static void reweight_task_fair(struct rq *rq, struct task_struct *p,
>                                const struct load_weight *lw)
>  {
>         struct sched_entity *se = &p->se;
> -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -       struct load_weight *load = &se->load;
> +       unsigned long weight = NICE_0_LOAD;
> +
> +       if (se->on_rq)
> +               update_curr_fair(rq);
> +
> +       reweight_entity(cfs_rq_of(se), se, lw->weight);
> +       se->load.inv_weight = lw->inv_weight;
> +
> +       if (!se->on_rq)
> +               return;
> +
> +       for_each_sched_entity(se)
> +               weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
>
> -       reweight_entity(cfs_rq, se, lw->weight);
> -       load->inv_weight = lw->inv_weight;
> +       reweight_eevdf(&rq->cfs, &p->se, weight, p->se.on_rq);
>  }
>
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> @@ -4331,7 +4349,6 @@ static long calc_group_shares(struct cfs
>  static void update_cfs_group(struct sched_entity *se)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> -       long shares;
>
>         /*
>          * When a group becomes empty, preserve its weight. This matters for
> @@ -4340,9 +4357,7 @@ static void update_cfs_group(struct sche
>         if (!gcfs_rq || !gcfs_rq->load.weight)
>                 return;
>
> -       shares = calc_group_shares(gcfs_rq);
> -       if (unlikely(se->load.weight != shares))
> -               reweight_entity(cfs_rq_of(se), se, shares);
> +       reweight_entity(cfs_rq_of(se), se, calc_group_shares(gcfs_rq));
>  }
>
>  #else /* !CONFIG_FAIR_GROUP_SCHED: */
> @@ -4460,7 +4475,7 @@ static inline bool cfs_rq_is_decayed(str
>   * differential update where we store the last value we propagated. This in
>   * turn allows skipping updates if the differential is 'small'.
>   *
> - * Updating tg's load_avg is necessary before update_cfs_share().
> + * Updating tg's load_avg is necessary before update_cfs_group().
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
> @@ -4926,7 +4941,7 @@ static void migrate_se_pelt_lag(struct s
>   * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
>   * avg. The immediate corollary is that all (fair) tasks must be attached.
>   *
> - * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
> + * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
>   *
>   * Return: true if the load decayed or we removed load.
>   *
> @@ -5475,6 +5490,7 @@ static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>         u64 vslice, vruntime = avg_vruntime(cfs_rq);
> +       unsigned int nr_queued = cfs_rq->h_nr_queued;
>         bool update_zero = false;
>         s64 lag = 0;
>
> @@ -5482,6 +5498,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                 se->slice = sysctl_sched_base_slice;
>         vslice = calc_delta_fair(se->slice, se);
>
> +       if (flags & ENQUEUE_QUEUED)
> +               nr_queued -= 1;
> +
>         /*
>          * Due to how V is constructed as the weighted average of entities,
>          * adding tasks with positive lag, or removing tasks with negative lag
> @@ -5490,7 +5509,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
>          *
>          * EEVDF: placement strategy #1 / #2
>          */
> -       if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
> +       if (sched_feat(PLACE_LAG) && nr_queued && se->vlag) {
>                 struct sched_entity *curr = cfs_rq->curr;
>                 long load, weight;
>
> @@ -5550,9 +5569,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
>                  */
>                 load = cfs_rq->sum_weight;
>                 if (curr && curr->on_rq)
> -                       load += avg_vruntime_weight(cfs_rq, curr->load.weight);
> +                       load += avg_vruntime_weight(cfs_rq, curr->h_load.weight);
>
> -               weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> +               weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
>                 lag *= load + weight;
>                 if (WARN_ON_ONCE(!load))
>                         load = 1;
> @@ -5611,22 +5630,8 @@ static void check_enqueue_throttle(struc
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
>
>  static void
> -requeue_delayed_entity(struct sched_entity *se);
> -
> -static void
>  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -       bool curr = cfs_rq->curr == se;
> -
> -       /*
> -        * If we're the current task, we must renormalise before calling
> -        * update_curr().
> -        */
> -       if (curr)
> -               place_entity(cfs_rq, se, flags);
> -
> -       update_curr(cfs_rq);
> -
>         /*
>          * When enqueuing a sched_entity, we must:
>          *   - Update loads to have both entity and cfs_rq synced with now.
> @@ -5645,13 +5650,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>          */
>         update_cfs_group(se);
>
> -       /*
> -        * XXX now that the entity has been re-weighted, and it's lag adjusted,
> -        * we can place the entity.
> -        */
> -       if (!curr)
> -               place_entity(cfs_rq, se, flags);
> -
>         account_entity_enqueue(cfs_rq, se);
>
>         /* Entity has migrated, no longer consider this task hot */
> @@ -5660,8 +5658,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>
>         check_schedstat_required();
>         update_stats_enqueue_fair(cfs_rq, se, flags);
> -       if (!curr)
> -               __enqueue_entity(cfs_rq, se);
>         se->on_rq = 1;
>
>         if (cfs_rq->nr_queued == 1) {
> @@ -5679,21 +5675,19 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>         }
>  }
>
> -static void __clear_buddies_next(struct sched_entity *se)
> +static void set_next_buddy(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       for_each_sched_entity(se) {
> -               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -               if (cfs_rq->next != se)
> -                       break;
> -
> -               cfs_rq->next = NULL;
> -       }
> +       if (WARN_ON_ONCE(!se->on_rq || se->sched_delayed))
> +               return;
> +       if (se_is_idle(se))
> +               return;
> +       cfs_rq->next = se;
>  }
>
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         if (cfs_rq->next == se)
> -               __clear_buddies_next(se);
> +               cfs_rq->next = NULL;
>  }
>
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -5704,7 +5698,7 @@ static void set_delayed(struct sched_ent
>
>         /*
>          * Delayed se of cfs_rq have no tasks queued on them.
> -        * Do not adjust h_nr_runnable since dequeue_entities()
> +        * Do not adjust h_nr_runnable since __dequeue_task()
>          * will account it for blocked tasks.
>          */
>         if (!entity_is_task(se))
> @@ -5737,37 +5731,11 @@ static void clear_delayed(struct sched_e
>         }
>  }
>
> -static bool
> +static void
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -       bool sleep = flags & DEQUEUE_SLEEP;
>         int action = UPDATE_TG;
>
> -       update_curr(cfs_rq);
> -       clear_buddies(cfs_rq, se);
> -
> -       if (flags & DEQUEUE_DELAYED) {
> -               WARN_ON_ONCE(!se->sched_delayed);
> -       } else {
> -               bool delay = sleep;
> -               /*
> -                * DELAY_DEQUEUE relies on spurious wakeups, special task
> -                * states must not suffer spurious wakeups, excempt them.
> -                */
> -               if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
> -                       delay = false;
> -
> -               WARN_ON_ONCE(delay && se->sched_delayed);
> -
> -               if (sched_feat(DELAY_DEQUEUE) && delay &&
> -                   !entity_eligible(cfs_rq, se)) {
> -                       update_load_avg(cfs_rq, se, 0);
> -                       update_entity_lag(cfs_rq, se);
> -                       set_delayed(se);
> -                       return false;
> -               }
> -       }
> -
>         if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
>                 action |= DO_DETACH;
>
> @@ -5785,14 +5753,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
>         update_stats_dequeue_fair(cfs_rq, se, flags);
>
> -       update_entity_lag(cfs_rq, se);
> -       if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
> -               se->deadline -= se->vruntime;
> -               se->rel_deadline = 1;
> -       }
> -
> -       if (se != cfs_rq->curr)
> -               __dequeue_entity(cfs_rq, se);
>         se->on_rq = 0;
>         account_entity_dequeue(cfs_rq, se);
>
> @@ -5801,9 +5761,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
>         update_cfs_group(se);
>
> -       if (flags & DEQUEUE_DELAYED)
> -               clear_delayed(se);
> -
>         if (cfs_rq->nr_queued == 0) {
>                 update_idle_cfs_rq_clock_pelt(cfs_rq);
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -5816,15 +5773,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>                 }
>  #endif
>         }
> -
> -       return true;
>  }
>
>  static void
> -set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
> +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       clear_buddies(cfs_rq, se);
> -
>         /* 'current' is not kept within the tree. */
>         if (se->on_rq) {
>                 /*
> @@ -5833,16 +5786,12 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>                  * runqueue.
>                  */
>                 update_stats_wait_end_fair(cfs_rq, se);
> -               __dequeue_entity(cfs_rq, se);
>                 update_load_avg(cfs_rq, se, UPDATE_TG);
> -
> -               if (first)
> -                       set_protect_slice(cfs_rq, se);
>         }
>
>         update_stats_curr_start(cfs_rq, se);
> -       WARN_ON_ONCE(cfs_rq->curr);
> -       cfs_rq->curr = se;
> +       WARN_ON_ONCE(cfs_rq->h_curr);
> +       cfs_rq->h_curr = se;
>
>         /*
>          * Track our maximum slice length, if the CPU's load is at
> @@ -5862,23 +5811,17 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>         se->prev_sum_exec_runtime = se->sum_exec_runtime;
>  }
>
> -static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
> +static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags);
>
> -/*
> - * Pick the next process, keeping these things in mind, in this order:
> - * 1) keep things fair between processes/task groups
> - * 2) pick the "next" process, since someone really wants that to run
> - * 3) pick the "last" process, for cache locality
> - * 4) do not run the "skip" process, if something else is available
> - */
>  static struct sched_entity *
> -pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool protect)
> +pick_next_entity(struct rq *rq, bool protect)
>  {
> +       struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *se;
>
>         se = pick_eevdf(cfs_rq, protect);
>         if (se->sched_delayed) {
> -               dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> +               __dequeue_task(rq, task_of(se), DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>                 /*
>                  * Must not reference @se again, see __block_task().
>                  */
> @@ -5903,13 +5846,11 @@ static void put_prev_entity(struct cfs_r
>
>         if (prev->on_rq) {
>                 update_stats_wait_start_fair(cfs_rq, prev);
> -               /* Put 'current' back into the tree. */
> -               __enqueue_entity(cfs_rq, prev);
>                 /* in !on_rq case, update occurred at dequeue */
>                 update_load_avg(cfs_rq, prev, 0);
>         }
> -       WARN_ON_ONCE(cfs_rq->curr != prev);
> -       cfs_rq->curr = NULL;
> +       WARN_ON_ONCE(cfs_rq->h_curr != prev);
> +       cfs_rq->h_curr = NULL;
>  }
>
>  static void
> @@ -6062,7 +6003,7 @@ static void __account_cfs_rq_runtime(str
>          * if we're unable to extend our runtime we resched so that the active
>          * hierarchy can be throttled
>          */
> -       if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> +       if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->h_curr))
>                 resched_curr(rq_of(cfs_rq));
>  }
>
> @@ -6420,7 +6361,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
>         assert_list_leaf_cfs_rq(rq);
>
>         /* Determine whether we need to wake up potentially idle CPU: */
> -       if (rq->curr == rq->idle && rq->cfs.nr_queued)
> +       if (rq->curr == rq->idle && rq->cfs.h_nr_queued)
>                 resched_curr(rq);
>  }
>
> @@ -6761,7 +6702,7 @@ static void check_enqueue_throttle(struc
>                 return;
>
>         /* an active group must be handled by the update_curr()->put() path */
> -       if (!cfs_rq->runtime_enabled || cfs_rq->curr)
> +       if (!cfs_rq->runtime_enabled || cfs_rq->h_curr)
>                 return;
>
>         /* ensure the group is not already throttled */
> @@ -7156,7 +7097,7 @@ static void hrtick_start_fair(struct rq
>                         resched_curr(rq);
>                 return;
>         }
> -       delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> +       delta = (se->h_load.weight * vdelta) / NICE_0_LOAD;
>
>         /*
>          * Correct for instantaneous load of other classes.
> @@ -7256,10 +7197,8 @@ static int choose_idle_cpu(int cpu, stru
>  }
>
>  static void
> -requeue_delayed_entity(struct sched_entity *se)
> +requeue_delayed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -
>         /*
>          * se->sched_delayed should imply: se->on_rq == 1.
>          * Because a delayed entity is one that is still on
> @@ -7269,19 +7208,58 @@ requeue_delayed_entity(struct sched_enti
>         WARN_ON_ONCE(!se->on_rq);
>
>         if (update_entity_lag(cfs_rq, se)) {
> -               cfs_rq->nr_queued--;
> +               cfs_rq->h_nr_queued--;
>                 if (se != cfs_rq->curr)
>                         __dequeue_entity(cfs_rq, se);
>                 place_entity(cfs_rq, se, 0);
>                 if (se != cfs_rq->curr)
>                         __enqueue_entity(cfs_rq, se);
> -               cfs_rq->nr_queued++;
> +               cfs_rq->h_nr_queued++;
>         }
>
>         update_load_avg(cfs_rq, se, 0);
>         clear_delayed(se);
>  }
>
> +static unsigned long enqueue_hierarchy(struct task_struct *p, int flags)
> +{
> +       unsigned long weight = NICE_0_LOAD;
> +       int task_new = !(flags & ENQUEUE_WAKEUP);
> +       struct sched_entity *se = &p->se;
> +       int h_nr_idle = task_has_idle_policy(p);
> +       int h_nr_runnable = 1;
> +
> +       if (task_new && se->sched_delayed)
> +               h_nr_runnable = 0;
> +
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +               update_curr(cfs_rq);
> +
> +               if (!se->on_rq) {
> +                       enqueue_entity(cfs_rq, se, flags);
> +               } else {
> +                       update_load_avg(cfs_rq, se, UPDATE_TG);
> +                       se_update_runnable(se);
> +                       update_cfs_group(se);
> +               }
> +
> +               cfs_rq->h_nr_runnable += h_nr_runnable;
> +               cfs_rq->h_nr_queued++;
> +               cfs_rq->h_nr_idle += h_nr_idle;
> +
> +               if (cfs_rq_is_idle(cfs_rq))
> +                       h_nr_idle = 1;
> +
> +               weight = __calc_prop_weight(cfs_rq, se, weight);
> +
> +               flags = ENQUEUE_WAKEUP;
> +       }
> +
> +       return weight;
> +}
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -7290,13 +7268,12 @@ requeue_delayed_entity(struct sched_enti
>  static void
>  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> -       struct cfs_rq *cfs_rq;
> -       struct sched_entity *se = &p->se;
> -       int h_nr_idle = task_has_idle_policy(p);
> -       int h_nr_runnable = 1;
> -       int task_new = !(flags & ENQUEUE_WAKEUP);
>         int rq_h_nr_queued = rq->cfs.h_nr_queued;
> -       u64 slice = 0;
> +       int task_new = !(flags & ENQUEUE_WAKEUP);
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = &rq->cfs;
> +       unsigned long weight;
> +       bool curr;
>
>         if (task_is_throttled(p) && enqueue_throttled_task(p))
>                 return;
> @@ -7308,10 +7285,10 @@ enqueue_task_fair(struct rq *rq, struct
>          * estimated utilization, before we update schedutil.
>          */
>         if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))
> -               util_est_enqueue(&rq->cfs, p);
> +               util_est_enqueue(cfs_rq, p);
>
>         if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> +               requeue_delayed_entity(cfs_rq, se);
>                 return;
>         }
>
> @@ -7323,57 +7300,22 @@ enqueue_task_fair(struct rq *rq, struct
>         if (p->in_iowait)
>                 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
>
> -       if (task_new && se->sched_delayed)
> -               h_nr_runnable = 0;
> -
> -       for_each_sched_entity(se) {
> -               if (se->on_rq) {
> -                       if (se->sched_delayed)
> -                               requeue_delayed_entity(se);
> -                       break;
> -               }
> -               cfs_rq = cfs_rq_of(se);
> -
> -               /*
> -                * Basically set the slice of group entries to the min_slice of
> -                * their respective cfs_rq. This ensures the group can service
> -                * its entities in the desired time-frame.
> -                */
> -               if (slice) {
> -                       se->slice = slice;
> -                       se->custom_slice = 1;
> -               }
> -               enqueue_entity(cfs_rq, se, flags);
> -               slice = cfs_rq_min_slice(cfs_rq);
> -
> -               cfs_rq->h_nr_runnable += h_nr_runnable;
> -               cfs_rq->h_nr_queued++;
> -               cfs_rq->h_nr_idle += h_nr_idle;
> -
> -               if (cfs_rq_is_idle(cfs_rq))
> -                       h_nr_idle = 1;
> -
> -               flags = ENQUEUE_WAKEUP;
> -       }
> -
> -       for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> -
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> -               se_update_runnable(se);
> -               update_cfs_group(se);
> +       /*
> +        * XXX comment on the curr thing
> +        */
> +       curr = (cfs_rq->curr == se);
> +       if (curr)
> +               place_entity(cfs_rq, se, flags);
>
> -               se->slice = slice;
> -               if (se != cfs_rq->curr)
> -                       min_vruntime_cb_propagate(&se->run_node, NULL);
> -               slice = cfs_rq_min_slice(cfs_rq);
> +       if (se->on_rq && se->sched_delayed)
> +               requeue_delayed_entity(cfs_rq, se);
>
> -               cfs_rq->h_nr_runnable += h_nr_runnable;
> -               cfs_rq->h_nr_queued++;
> -               cfs_rq->h_nr_idle += h_nr_idle;
> +       weight = enqueue_hierarchy(p, flags);
>
> -               if (cfs_rq_is_idle(cfs_rq))
> -                       h_nr_idle = 1;
> +       if (!curr) {
> +               reweight_eevdf(cfs_rq, se, weight, false);
> +               place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
> +               __enqueue_entity(cfs_rq, se);
>         }
>
>         if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
> @@ -7404,105 +7346,107 @@ enqueue_task_fair(struct rq *rq, struct
>         hrtick_update(rq);
>  }
>
> -/*
> - * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> - * failing half-way through and resume the dequeue later.
> - *
> - * Returns:
> - * -1 - dequeue delayed
> - *  0 - dequeue throttled
> - *  1 - dequeue complete
> - */
> -static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> +static void dequeue_hierarchy(struct task_struct *p, int flags)
>  {
> -       bool was_sched_idle = sched_idle_rq(rq);
> +       struct sched_entity *se = &p->se;
>         bool task_sleep = flags & DEQUEUE_SLEEP;
>         bool task_delayed = flags & DEQUEUE_DELAYED;
>         bool task_throttled = flags & DEQUEUE_THROTTLE;
> -       struct task_struct *p = NULL;
> -       int h_nr_idle = 0;
> -       int h_nr_queued = 0;
>         int h_nr_runnable = 0;
> -       struct cfs_rq *cfs_rq;
> -       u64 slice = 0;
> +       int h_nr_idle = task_has_idle_policy(p);
> +       bool dequeue = true;
>
> -       if (entity_is_task(se)) {
> -               p = task_of(se);
> -               h_nr_queued = 1;
> -               h_nr_idle = task_has_idle_policy(p);
> -               if (task_sleep || task_delayed || !se->sched_delayed)
> -                       h_nr_runnable = 1;
> -       }
> +       if (task_sleep || task_delayed || !se->sched_delayed)
> +               h_nr_runnable = 1;
>
>         for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> -               if (!dequeue_entity(cfs_rq, se, flags)) {
> -                       if (p && &p->se == se)
> -                               return -1;
> +               update_curr(cfs_rq);
>
> -                       slice = cfs_rq_min_slice(cfs_rq);
> -                       break;
> +               if (dequeue) {
> +                       dequeue_entity(cfs_rq, se, flags);
> +                       /* Don't dequeue parent if it has other entities besides us */
> +                       if (cfs_rq->load.weight)
> +                               dequeue = false;
> +               } else {
> +                       update_load_avg(cfs_rq, se, UPDATE_TG);
> +                       se_update_runnable(se);
> +                       update_cfs_group(se);
>                 }
>
>                 cfs_rq->h_nr_runnable -= h_nr_runnable;
> -               cfs_rq->h_nr_queued -= h_nr_queued;
> +               cfs_rq->h_nr_queued--;
>                 cfs_rq->h_nr_idle -= h_nr_idle;
>
>                 if (cfs_rq_is_idle(cfs_rq))
> -                       h_nr_idle = h_nr_queued;
> +                       h_nr_idle = 1;
>
>                 if (throttled_hierarchy(cfs_rq) && task_throttled)
>                         record_throttle_clock(cfs_rq);
>
> -               /* Don't dequeue parent if it has other entities besides us */
> -               if (cfs_rq->load.weight) {
> -                       slice = cfs_rq_min_slice(cfs_rq);
> -
> -                       /* Avoid re-evaluating load for this entity: */
> -                       se = parent_entity(se);
> -                       /*
> -                        * Bias pick_next to pick a task from this cfs_rq, as
> -                        * p is sleeping when it is within its sched_slice.
> -                        */
> -                       if (task_sleep && se)
> -                               set_next_buddy(se);
> -                       break;
> -               }
>                 flags |= DEQUEUE_SLEEP;
>                 flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
>         }
> +}
>
> -       for_each_sched_entity(se) {
> -               cfs_rq = cfs_rq_of(se);
> +/*
> + * The part of dequeue_task_fair() that is needed to dequeue delayed tasks.
> + *
> + * Returns:
> + *   true  - dequeued
> + *   false - delayed
> + */
> +static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> +{
> +       struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = &rq->cfs;
> +       bool was_sched_idle = sched_idle_rq(rq);
> +       bool task_sleep = flags & DEQUEUE_SLEEP;
> +       bool task_delayed = flags & DEQUEUE_DELAYED;
>
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> -               se_update_runnable(se);
> -               update_cfs_group(se);
> +       clear_buddies(cfs_rq, se);
>
> -               se->slice = slice;
> -               if (se != cfs_rq->curr)
> -                       min_vruntime_cb_propagate(&se->run_node, NULL);
> -               slice = cfs_rq_min_slice(cfs_rq);
> +       if (flags & DEQUEUE_DELAYED) {
> +               WARN_ON_ONCE(!se->sched_delayed);
> +       } else {
> +               bool delay = task_sleep;
> +               /*
> +                * DELAY_DEQUEUE relies on spurious wakeups, special task
> +                * states must not suffer spurious wakeups, excempt them.
> +                */
> +               if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
> +                       delay = false;
>
> -               cfs_rq->h_nr_runnable -= h_nr_runnable;
> -               cfs_rq->h_nr_queued -= h_nr_queued;
> -               cfs_rq->h_nr_idle -= h_nr_idle;
> +               WARN_ON_ONCE(delay && se->sched_delayed);
>
> -               if (cfs_rq_is_idle(cfs_rq))
> -                       h_nr_idle = h_nr_queued;
> +               if (sched_feat(DELAY_DEQUEUE) && delay &&
> +                   !entity_eligible(cfs_rq, se)) {
> +                       update_load_avg(cfs_rq_of(se), se, 0);

update_entity_lag(cfs_rq, se); is missing here. Unfortunately this
doesn't fix my regression

> +                       set_delayed(se);
> +                       return false;
> +               }
> +       }
>
> -               if (throttled_hierarchy(cfs_rq) && task_throttled)
> -                       record_throttle_clock(cfs_rq);
> +       dequeue_hierarchy(p, flags);
> +
> +       update_entity_lag(cfs_rq, se);
> +       if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> +               se->deadline -= se->vruntime;
> +               se->rel_deadline = 1;
>         }
> +       if (se != cfs_rq->curr)
> +               __dequeue_entity(cfs_rq, se);
>
> -       sub_nr_running(rq, h_nr_queued);
> +       sub_nr_running(rq, 1);
>
>         /* balance early to pull high priority tasks */
>         if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>                 rq->next_balance = jiffies;
>
> -       if (p && task_delayed) {
> +       if (task_delayed) {
> +               clear_delayed(se);
> +
>                 WARN_ON_ONCE(!task_sleep);
>                 WARN_ON_ONCE(p->on_rq != 1);
>
> @@ -7514,7 +7458,7 @@ static int dequeue_entities(struct rq *r
>                 __block_task(rq, p);
>         }
>
> -       return 1;
> +       return true;
>  }
>
>  /*
> @@ -7533,11 +7477,11 @@ static bool dequeue_task_fair(struct rq
>                 util_est_dequeue(&rq->cfs, p);
>
>         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> -       if (dequeue_entities(rq, &p->se, flags) < 0)
> +       if (!__dequeue_task(rq, p, flags))
>                 return false;
>
>         /*
> -        * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
> +        * Must not reference @p after __dequeue_task(DEQUEUE_DELAYED).
>          */
>         return true;
>  }
> @@ -9021,19 +8965,6 @@ static void migrate_task_rq_fair(struct
>  static void task_dead_fair(struct task_struct *p)
>  {
>         struct sched_entity *se = &p->se;
> -
> -       if (se->sched_delayed) {
> -               struct rq_flags rf;
> -               struct rq *rq;
> -
> -               rq = task_rq_lock(p, &rf);
> -               if (se->sched_delayed) {
> -                       update_rq_clock(rq);
> -                       dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> -               }
> -               task_rq_unlock(rq, p, &rf);
> -       }
> -
>         remove_entity_load_avg(se);
>  }
>
> @@ -9067,21 +8998,10 @@ static void set_cpus_allowed_fair(struct
>         set_task_max_allowed_capacity(p);
>  }
>
> -static void set_next_buddy(struct sched_entity *se)
> -{
> -       for_each_sched_entity(se) {
> -               if (WARN_ON_ONCE(!se->on_rq))
> -                       return;
> -               if (se_is_idle(se))
> -                       return;
> -               cfs_rq_of(se)->next = se;
> -       }
> -}
> -
>  enum preempt_wakeup_action {
>         PREEMPT_WAKEUP_NONE,    /* No preemption. */
>         PREEMPT_WAKEUP_SHORT,   /* Ignore slice protection. */
> -       PREEMPT_WAKEUP_PICK,    /* Let __pick_eevdf() decide. */
> +       PREEMPT_WAKEUP_PICK,    /* Let pick_eevdf() decide. */
>         PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */
>  };
>
> @@ -9098,7 +9018,7 @@ set_preempt_buddy(struct cfs_rq *cfs_rq,
>         if (cfs_rq->next && entity_before(cfs_rq->next, pse))
>                 return false;
>
> -       set_next_buddy(pse);
> +       set_next_buddy(cfs_rq, pse);
>         return true;
>  }
>
> @@ -9188,7 +9108,6 @@ static void wakeup_preempt_fair(struct r
>         if (!sched_feat(WAKEUP_PREEMPTION))
>                 return;
>
> -       find_matching_se(&se, &pse);
>         WARN_ON_ONCE(!pse);
>
>         cse_is_idle = se_is_idle(se);
> @@ -9216,8 +9135,7 @@ static void wakeup_preempt_fair(struct r
>         if (unlikely(!normal_policy(p->policy)))
>                 return;
>
> -       cfs_rq = cfs_rq_of(se);
> -       update_curr(cfs_rq);
> +       update_curr_fair(rq);
>         /*
>          * If @p has a shorter slice than current and @p is eligible, override
>          * current's slice protection in order to allow preemption.
> @@ -9261,18 +9179,15 @@ static void wakeup_preempt_fair(struct r
>         }
>
>  pick:
> -       nse = pick_next_entity(rq, cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT);
> -       /* If @p has become the most eligible task, force preemption */
> -       if (nse == pse)
> -               goto preempt;
> -
> -       /*
> -        * Because p is enqueued, nse being null can only mean that we
> -        * dequeued a delayed task. If there are still entities queued in
> -        * cfs, check if the next one will be p.
> -        */
> -       if (!nse && cfs_rq->nr_queued)
> -               goto pick;
> +       if (cfs_rq->h_nr_queued) {
> +               nse = pick_next_entity(rq, preempt_action != PREEMPT_WAKEUP_SHORT);
> +               if (unlikely(!nse))
> +                       goto pick;
> +
> +               /* If @p has become the most eligible task, force preemption */
> +               if (nse == pse)
> +                       goto preempt;
> +       }
>
>         if (sched_feat(RUN_TO_PARITY))
>                 update_protect_slice(cfs_rq, se);
> @@ -9291,34 +9206,25 @@ static void wakeup_preempt_fair(struct r
>  struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>         __must_hold(__rq_lockp(rq))
>  {
> +       struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *se;
> -       struct cfs_rq *cfs_rq;
>         struct task_struct *p;
> -       bool throttled;
>         int new_tasks;
>
>  again:
> -       cfs_rq = &rq->cfs;
> -       if (!cfs_rq->nr_queued)
> +       if (!cfs_rq->h_nr_queued)
>                 goto idle;
>
> -       throttled = false;
> -
> -       do {
> -               /* Might not have done put_prev_entity() */
> -               if (cfs_rq->curr && cfs_rq->curr->on_rq)
> -                       update_curr(cfs_rq);
> -
> -               throttled |= check_cfs_rq_runtime(cfs_rq);
> +       /* Might not have done put_prev_entity() */
> +       if (cfs_rq->curr && cfs_rq->curr->on_rq)
> +               update_curr(cfs_rq);
>
> -               se = pick_next_entity(rq, cfs_rq, true);
> -               if (!se)
> -                       goto again;
> -               cfs_rq = group_cfs_rq(se);
> -       } while (cfs_rq);
> +       se = pick_next_entity(rq, true);
> +       if (!se)
> +               goto again;
>
>         p = task_of(se);
> -       if (unlikely(throttled))
> +       if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
>                 task_throttle_setup_work(p);
>         return p;
>
> @@ -9353,7 +9259,7 @@ void fair_server_init(struct rq *rq)
>  static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
>  {
>         struct sched_entity *se = &prev->se;
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq = &rq->cfs;
>         struct sched_entity *nse = NULL;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -9363,7 +9269,7 @@ static void put_prev_task_fair(struct rq
>
>         while (se) {
>                 cfs_rq = cfs_rq_of(se);
> -               if (!nse || cfs_rq->curr)
> +               if (!nse || cfs_rq->h_curr)
>                         put_prev_entity(cfs_rq, se);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>                 if (nse) {
> @@ -9382,6 +9288,14 @@ static void put_prev_task_fair(struct rq
>  #endif
>                 se = parent_entity(se);
>         }
> +
> +       /* Put 'current' back into the tree. */
> +       cfs_rq = &rq->cfs;
> +       se = &prev->se;
> +       WARN_ON_ONCE(cfs_rq->curr != se);
> +       cfs_rq->curr = NULL;
> +       if (se->on_rq)
> +               __enqueue_entity(cfs_rq, se);
>  }
>
>  /*
> @@ -9390,8 +9304,8 @@ static void put_prev_task_fair(struct rq
>  static void yield_task_fair(struct rq *rq)
>  {
>         struct task_struct *curr = rq->donor;
> -       struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>         struct sched_entity *se = &curr->se;
> +       struct cfs_rq *cfs_rq = &rq->cfs;
>
>         /*
>          * Are we the only task in the tree?
> @@ -9432,11 +9346,11 @@ static bool yield_to_task_fair(struct rq
>         struct sched_entity *se = &p->se;
>
>         /* !se->on_rq also covers throttled task */
> -       if (!se->on_rq)
> +       if (!se->on_rq || se->sched_delayed)
>                 return false;
>
>         /* Tell the scheduler that we'd really like se to run next. */
> -       set_next_buddy(se);
> +       set_next_buddy(&task_rq(p)->cfs, se);
>
>         yield_task_fair(rq);
>
> @@ -9762,15 +9676,10 @@ static inline long migrate_degrades_loca
>   */
>  static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_cpu)
>  {
> -       struct cfs_rq *dst_cfs_rq;
> +       struct cfs_rq *dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       dst_cfs_rq = task_group(p)->cfs_rq[dest_cpu];
> -#else
> -       dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
> -#endif
> -       if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
> -           !entity_eligible(task_cfs_rq(p), &p->se))
> +       if (sched_feat(PLACE_LAG) && dst_cfs_rq->h_nr_queued &&
> +           !entity_eligible(&task_rq(p)->cfs, &p->se))
>                 return 1;
>
>         return 0;
> @@ -10240,7 +10149,7 @@ static void update_cfs_rq_h_load(struct
>         while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
>                 load = cfs_rq->h_load;
>                 load = div64_ul(load * se->avg.load_avg,
> -                       cfs_rq_load_avg(cfs_rq) + 1);
> +                               cfs_rq_load_avg(cfs_rq) + 1);
>                 cfs_rq = group_cfs_rq(se);
>                 cfs_rq->h_load = load;
>                 cfs_rq->last_h_load_update = now;
> @@ -13459,7 +13368,7 @@ static inline void task_tick_core(struct
>          * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
>          * if we need to give up the CPU.
>          */
> -       if (rq->core->core_forceidle_count && rq->cfs.nr_queued == 1 &&
> +       if (rq->core->core_forceidle_count && rq->cfs.h_nr_queued == 1 &&
>             __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
>                 resched_curr(rq);
>  }
> @@ -13668,30 +13577,8 @@ bool cfs_prio_less(const struct task_str
>
>         WARN_ON_ONCE(task_rq(b)->core != rq->core);
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       /*
> -        * Find an se in the hierarchy for tasks a and b, such that the se's
> -        * are immediate siblings.
> -        */
> -       while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> -               int sea_depth = sea->depth;
> -               int seb_depth = seb->depth;
> -
> -               if (sea_depth >= seb_depth)
> -                       sea = parent_entity(sea);
> -               if (sea_depth <= seb_depth)
> -                       seb = parent_entity(seb);
> -       }
> -
> -       se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
> -       se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
> -
> -       cfs_rqa = sea->cfs_rq;
> -       cfs_rqb = seb->cfs_rq;
> -#else /* !CONFIG_FAIR_GROUP_SCHED: */
>         cfs_rqa = &task_rq(a)->cfs;
>         cfs_rqb = &task_rq(b)->cfs;
> -#endif /* !CONFIG_FAIR_GROUP_SCHED */
>
>         /*
>          * Find delta after normalizing se's vruntime with its cfs_rq's
> @@ -13729,14 +13616,20 @@ static inline void task_tick_core(struct
>   */
>  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  {
> -       struct cfs_rq *cfs_rq;
>         struct sched_entity *se = &curr->se;
> +       unsigned long weight = NICE_0_LOAD;
> +       struct cfs_rq *cfs_rq;
>
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>                 entity_tick(cfs_rq, se, queued);
> +
> +               weight = __calc_prop_weight(cfs_rq, se, weight);
>         }
>
> +       se = &curr->se;
> +       reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> +
>         if (queued)
>                 return;
>
> @@ -13772,7 +13665,7 @@ prio_changed_fair(struct rq *rq, struct
>         if (p->prio == oldprio)
>                 return;
>
> -       if (rq->cfs.nr_queued == 1)
> +       if (rq->cfs.h_nr_queued == 1)
>                 return;
>
>         /*
> @@ -13901,29 +13794,40 @@ static void switched_to_fair(struct rq *
>         }
>  }
>
> -/*
> - * Account for a task changing its policy or group.
> - *
> - * This routine is mostly called to set cfs_rq->curr field when a task
> - * migrates between groups/classes.
> - */
>  static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  {
>         struct sched_entity *se = &p->se;
> +       struct cfs_rq *cfs_rq = &rq->cfs;
> +       unsigned long weight = NICE_0_LOAD;
> +       bool on_rq = se->on_rq;
> +
> +       clear_buddies(cfs_rq, se);
> +
> +       if (on_rq)
> +               __dequeue_entity(cfs_rq, se);
>
>         for_each_sched_entity(se) {
> -               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +               cfs_rq = cfs_rq_of(se);
>
> -               if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
> -                   first && cfs_rq->curr)
> -                       break;
> +               if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
> +                   !first || !cfs_rq->h_curr)
> +                       set_next_entity(cfs_rq, se);
>
> -               set_next_entity(cfs_rq, se, first);
>                 /* ensure bandwidth has been allocated on our new cfs_rq */
>                 account_cfs_rq_runtime(cfs_rq, 0);
> +
> +               if (on_rq)
> +                       weight = __calc_prop_weight(cfs_rq, se, weight);
>         }
>
>         se = &p->se;
> +       cfs_rq->curr = se;
> +
> +       if (on_rq) {
> +               reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> +               if (first)
> +                       set_protect_slice(cfs_rq, se);
> +       }
>
>         if (task_on_rq_queued(p)) {
>                 /*
> @@ -14054,17 +13958,8 @@ void unregister_fair_sched_group(struct
>                 struct sched_entity *se = tg->se[cpu];
>                 struct rq *rq = cpu_rq(cpu);
>
> -               if (se) {
> -                       if (se->sched_delayed) {
> -                               guard(rq_lock_irqsave)(rq);
> -                               if (se->sched_delayed) {
> -                                       update_rq_clock(rq);
> -                                       dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> -                               }
> -                               list_del_leaf_cfs_rq(cfs_rq);
> -                       }
> +               if (se)
>                         remove_entity_load_avg(se);
> -               }
>
>                 /*
>                  * Only empty task groups can be destroyed; so we can speculatively
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -206,7 +206,7 @@ ___update_load_sum(u64 now, struct sched
>         /*
>          * running is a subset of runnable (weight) so running can't be set if
>          * runnable is clear. But there are some corner cases where the current
> -        * se has been already dequeued but cfs_rq->curr still points to it.
> +        * se has been already dequeued but cfs_rq->h_curr still points to it.
>          * This means that weight will be 0 but not running for a sched_entity
>          * but also for a cfs_rq if the latter becomes idle. As an example,
>          * this happens during sched_balance_newidle() which calls
> @@ -307,7 +307,7 @@ int __update_load_avg_blocked_se(u64 now
>  int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> -                               cfs_rq->curr == se)) {
> +                               cfs_rq->h_curr == se)) {
>
>                 ___update_load_avg(&se->avg, se_weight(se));
>                 cfs_se_util_change(&se->avg);
> @@ -323,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, st
>         if (___update_load_sum(now, &cfs_rq->avg,
>                                 scale_load_down(cfs_rq->load.weight),
>                                 cfs_rq->h_nr_runnable,
> -                               cfs_rq->curr != NULL)) {
> +                               cfs_rq->h_curr != NULL)) {
>
>                 ___update_load_avg(&cfs_rq->avg, 1);
>                 trace_pelt_cfs_tp(cfs_rq);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -528,21 +528,8 @@ struct task_group {
>
>  };
>
> -#ifdef CONFIG_GROUP_SCHED_WEIGHT
>  #define ROOT_TASK_GROUP_LOAD   NICE_0_LOAD
>
> -/*
> - * A weight of 0 or 1 can cause arithmetics problems.
> - * A weight of a cfs_rq is the sum of weights of which entities
> - * are queued on this cfs_rq, so a weight of a entity should not be
> - * too large, so as the shares value of a task group.
> - * (The default weight is 1024 - so there's no practical
> - *  limitation from this.)
> - */
> -#define MIN_SHARES             (1UL <<  1)
> -#define MAX_SHARES             (1UL << 18)
> -#endif
> -
>  typedef int (*tg_visitor)(struct task_group *, void *);
>
>  extern int walk_tg_tree_from(struct task_group *from,
> @@ -629,6 +616,17 @@ static inline bool cfs_task_bw_constrain
>
>  #endif /* !CONFIG_CGROUP_SCHED */
>
> +/*
> + * A weight of 0 or 1 can cause arithmetics problems.
> + * A weight of a cfs_rq is the sum of weights of which entities
> + * are queued on this cfs_rq, so a weight of a entity should not be
> + * too large, so as the shares value of a task group.
> + * (The default weight is 1024 - so there's no practical
> + *  limitation from this.)
> + */
> +#define MIN_SHARES             (1UL <<  1)
> +#define MAX_SHARES             (1UL << 18)
> +
>  extern void unregister_rt_sched_group(struct task_group *tg);
>  extern void free_rt_sched_group(struct task_group *tg);
>  extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> @@ -707,6 +705,7 @@ struct cfs_rq {
>         /*
>          * CFS load tracking
>          */
> +       struct sched_entity     *h_curr;
>         struct sched_avg        avg;
>  #ifndef CONFIG_64BIT
>         u64                     last_update_time_copy;
> @@ -2509,6 +2508,7 @@ extern const u32          sched_prio_to_wmult[40
>  #define ENQUEUE_MIGRATED       0x00040000
>  #define ENQUEUE_INITIAL                0x00080000
>  #define ENQUEUE_RQ_SELECTED    0x00100000
> +#define ENQUEUE_QUEUED         0x00200000
>
>  #define RETRY_TASK             ((void *)-1UL)
>
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
  2026-05-12  5:37   ` K Prateek Nayak
@ 2026-05-19 15:13   ` Vincent Guittot
  2026-06-03  9:51   ` Aaron Lu
  2 siblings, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-19 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> With commit 50653216e4ff ("sched: Add support to pick functions to
> take rf") removing the balance callback, the pick_task() callback is
> in charge of newidle balancing.
>
> This means pick_task_fair() should do so too. This hasn't been a
> problem in practise because pick_next_task_fair() is used. However,
> since we'll be removing that one shortly, make sure pick_next_task()
> is up to scratch.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/fair.c |   38 +++++++++++++++-----------------------
>  1 file changed, 15 insertions(+), 23 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9215,16 +9215,18 @@ static void wakeup_preempt_fair(struct r
>  }
>
>  static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
> +       __must_hold(__rq_lockp(rq))
>  {
>         struct sched_entity *se;
>         struct cfs_rq *cfs_rq;
>         struct task_struct *p;
>         bool throttled;
> +       int new_tasks;
>
>  again:
>         cfs_rq = &rq->cfs;
>         if (!cfs_rq->nr_queued)
> -               return NULL;
> +               goto idle;
>
>         throttled = false;
>
> @@ -9245,6 +9247,14 @@ static struct task_struct *pick_task_fai
>         if (unlikely(throttled))
>                 task_throttle_setup_work(p);
>         return p;
> +
> +idle:
> +       new_tasks = sched_balance_newidle(rq, rf);
> +       if (new_tasks < 0)
> +               return RETRY_TASK;
> +       if (new_tasks > 0)
> +               goto again;
> +       return NULL;
>  }
>
>  static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
> @@ -9256,12 +9266,12 @@ pick_next_task_fair(struct rq *rq, struc
>  {
>         struct sched_entity *se;
>         struct task_struct *p;
> -       int new_tasks;
>
> -again:
>         p = pick_task_fair(rq, rf);
> +       if (unlikely(p == RETRY_TASK))
> +               return p;
>         if (!p)
> -               goto idle;
> +               return p;
>         se = &p->se;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -9311,29 +9321,11 @@ pick_next_task_fair(struct rq *rq, struc
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>         put_prev_set_next_task(rq, prev, p);
>         return p;
> -
> -idle:
> -       if (rf) {
> -               new_tasks = sched_balance_newidle(rq, rf);
> -
> -               /*
> -                * Because sched_balance_newidle() releases (and re-acquires)
> -                * rq->lock, it is possible for any higher priority task to
> -                * appear. In that case we must re-start the pick_next_entity()
> -                * loop.
> -                */
> -               if (new_tasks < 0)
> -                       return RETRY_TASK;
> -
> -               if (new_tasks > 0)
> -                       goto again;
> -       }
> -
> -       return NULL;
>  }
>
>  static struct task_struct *
>  fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
> +       __must_hold(__rq_lockp(dl_se->rq))
>  {
>         return pick_task_fair(dl_se->rq, rf);
>  }
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 09/10] sched: Remove sched_class::pick_next_task()
  2026-05-11 11:31 ` [PATCH v2 09/10] sched: Remove sched_class::pick_next_task() Peter Zijlstra
@ 2026-05-19 15:14   ` Vincent Guittot
  0 siblings, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-19 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The reason for pick_next_task_fair() is the put/set optimization that
> avoids touching the common ancestors. However, it is possible to
> implement this in the put_prev_task() and set_next_task() calls as
> used in put_prev_set_next_task().
>
> Notably, put_prev_set_next_task() is the only site that:
>
>  - calls put_prev_task() with a .next argument;
>  - calls set_next_task() with .first = true.
>
> This means that put_prev_task() can determine the common hierarchy and
> stop there, and then set_next_task() can terminate where put_prev_task
> stopped.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Hackbench results on my Arm64 dev machine stay similars with patch 8
and 9 (unlike patch 10)

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/core.c  |   27 +++------
>  kernel/sched/fair.c  |  139 +++++++++++++++++----------------------------------
>  kernel/sched/sched.h |   14 -----
>  3 files changed, 57 insertions(+), 123 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5980,16 +5980,15 @@ __pick_next_task(struct rq *rq, struct t
>         if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
>                    rq->nr_running == rq->cfs.h_nr_queued)) {
>
> -               p = pick_next_task_fair(rq, prev, rf);
> +               p = pick_task_fair(rq, rf);
>                 if (unlikely(p == RETRY_TASK))
>                         goto restart;
>
>                 /* Assume the next prioritized class is idle_sched_class */
> -               if (!p) {
> +               if (!p)
>                         p = pick_task_idle(rq, rf);
> -                       put_prev_set_next_task(rq, prev, p);
> -               }
>
> +               put_prev_set_next_task(rq, prev, p);
>                 return p;
>         }
>
> @@ -5997,20 +5996,12 @@ __pick_next_task(struct rq *rq, struct t
>         prev_balance(rq, prev, rf);
>
>         for_each_active_class(class) {
> -               if (class->pick_next_task) {
> -                       p = class->pick_next_task(rq, prev, rf);
> -                       if (unlikely(p == RETRY_TASK))
> -                               goto restart;
> -                       if (p)
> -                               return p;
> -               } else {
> -                       p = class->pick_task(rq, rf);
> -                       if (unlikely(p == RETRY_TASK))
> -                               goto restart;
> -                       if (p) {
> -                               put_prev_set_next_task(rq, prev, p);
> -                               return p;
> -                       }
> +               p = class->pick_task(rq, rf);
> +               if (unlikely(p == RETRY_TASK))
> +                       goto restart;
> +               if (p) {
> +                       put_prev_set_next_task(rq, prev, p);
> +                       return p;
>                 }
>         }
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9214,7 +9214,7 @@ static void wakeup_preempt_fair(struct r
>         resched_curr_lazy(rq);
>  }
>
> -static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
> +struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>         __must_hold(__rq_lockp(rq))
>  {
>         struct sched_entity *se;
> @@ -9257,72 +9257,6 @@ static struct task_struct *pick_task_fai
>         return NULL;
>  }
>
> -static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
> -static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
> -
> -struct task_struct *
> -pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> -       __must_hold(__rq_lockp(rq))
> -{
> -       struct sched_entity *se;
> -       struct task_struct *p;
> -
> -       p = pick_task_fair(rq, rf);
> -       if (unlikely(p == RETRY_TASK))
> -               return p;
> -       if (!p)
> -               return p;
> -       se = &p->se;
> -
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       if (prev->sched_class != &fair_sched_class)
> -               goto simple;
> -
> -       __put_prev_set_next_dl_server(rq, prev, p);
> -
> -       /*
> -        * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> -        * likely that a next task is from the same cgroup as the current.
> -        *
> -        * Therefore attempt to avoid putting and setting the entire cgroup
> -        * hierarchy, only change the part that actually changes.
> -        *
> -        * Since we haven't yet done put_prev_entity and if the selected task
> -        * is a different task than we started out with, try and touch the
> -        * least amount of cfs_rqs.
> -        */
> -       if (prev != p) {
> -               struct sched_entity *pse = &prev->se;
> -               struct cfs_rq *cfs_rq;
> -
> -               while (!(cfs_rq = is_same_group(se, pse))) {
> -                       int se_depth = se->depth;
> -                       int pse_depth = pse->depth;
> -
> -                       if (se_depth <= pse_depth) {
> -                               put_prev_entity(cfs_rq_of(pse), pse);
> -                               pse = parent_entity(pse);
> -                       }
> -                       if (se_depth >= pse_depth) {
> -                               set_next_entity(cfs_rq_of(se), se, true);
> -                               se = parent_entity(se);
> -                       }
> -               }
> -
> -               put_prev_entity(cfs_rq, pse);
> -               set_next_entity(cfs_rq, se, true);
> -
> -               __set_next_task_fair(rq, p, true);
> -       }
> -
> -       return p;
> -
> -simple:
> -#endif /* CONFIG_FAIR_GROUP_SCHED */
> -       put_prev_set_next_task(rq, prev, p);
> -       return p;
> -}
> -
>  static struct task_struct *
>  fair_server_pick_task(struct sched_dl_entity *dl_se, struct rq_flags *rf)
>         __must_hold(__rq_lockp(dl_se->rq))
> @@ -9346,10 +9280,33 @@ static void put_prev_task_fair(struct rq
>  {
>         struct sched_entity *se = &prev->se;
>         struct cfs_rq *cfs_rq;
> +       struct sched_entity *nse = NULL;
>
> -       for_each_sched_entity(se) {
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +       if (next && next->sched_class == &fair_sched_class)
> +               nse = &next->se;
> +#endif
> +
> +       while (se) {
>                 cfs_rq = cfs_rq_of(se);
> -               put_prev_entity(cfs_rq, se);
> +               if (!nse || cfs_rq->curr)
> +                       put_prev_entity(cfs_rq, se);
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +               if (nse) {
> +                       if (is_same_group(se, nse))
> +                               break;
> +
> +                       int d = nse->depth - se->depth;
> +                       if (d >= 0) {
> +                               /* nse has equal or greater depth, ascend */
> +                               nse = parent_entity(nse);
> +                               /* if nse is the deeper, do not ascend se */
> +                               if (d > 0)
> +                                       continue;
> +                       }
> +               }
> +#endif
> +               se = parent_entity(se);
>         }
>  }
>
> @@ -13896,10 +13853,30 @@ static void switched_to_fair(struct rq *
>         }
>  }
>
> -static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> +/*
> + * Account for a task changing its policy or group.
> + *
> + * This routine is mostly called to set cfs_rq->curr field when a task
> + * migrates between groups/classes.
> + */
> +static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  {
>         struct sched_entity *se = &p->se;
>
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +               if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
> +                   first && cfs_rq->curr)
> +                       break;
> +
> +               set_next_entity(cfs_rq, se, first);
> +               /* ensure bandwidth has been allocated on our new cfs_rq */
> +               account_cfs_rq_runtime(cfs_rq, 0);
> +       }
> +
> +       se = &p->se;
> +
>         if (task_on_rq_queued(p)) {
>                 /*
>                  * Move the next running task to the front of the list, so our
> @@ -13919,27 +13896,6 @@ static void __set_next_task_fair(struct
>         sched_fair_update_stop_tick(rq, p);
>  }
>
> -/*
> - * Account for a task changing its policy or group.
> - *
> - * This routine is mostly called to set cfs_rq->curr field when a task
> - * migrates between groups/classes.
> - */
> -static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> -{
> -       struct sched_entity *se = &p->se;
> -
> -       for_each_sched_entity(se) {
> -               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> -
> -               set_next_entity(cfs_rq, se, first);
> -               /* ensure bandwidth has been allocated on our new cfs_rq */
> -               account_cfs_rq_runtime(cfs_rq, 0);
> -       }
> -
> -       __set_next_task_fair(rq, p, first);
> -}
> -
>  void init_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>         cfs_rq->tasks_timeline = RB_ROOT_CACHED;
> @@ -14251,7 +14207,6 @@ DEFINE_SCHED_CLASS(fair) = {
>         .wakeup_preempt         = wakeup_preempt_fair,
>
>         .pick_task              = pick_task_fair,
> -       .pick_next_task         = pick_next_task_fair,
>         .put_prev_task          = put_prev_task_fair,
>         .set_next_task          = set_next_task_fair,
>
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2555,17 +2555,6 @@ struct sched_class {
>          * schedule/pick_next_task: rq->lock
>          */
>         struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
> -       /*
> -        * Optional! When implemented pick_next_task() should be equivalent to:
> -        *
> -        *   next = pick_task();
> -        *   if (next) {
> -        *       put_prev_task(prev);
> -        *       set_next_task_first(next);
> -        *   }
> -        */
> -       struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev,
> -                                             struct rq_flags *rf);
>
>         /*
>          * sched_change:
> @@ -2789,8 +2778,7 @@ static inline bool sched_fair_runnable(s
>         return rq->cfs.nr_queued > 0;
>  }
>
> -extern struct task_struct *pick_next_task_fair(struct rq *rq, struct task_struct *prev,
> -                                              struct rq_flags *rf);
> +extern struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf);
>  extern struct task_struct *pick_task_idle(struct rq *rq, struct rq_flags *rf);
>
>  #define SCA_CHECK              0x01
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-19 10:13         ` Vincent Guittot
@ 2026-05-19 16:00           ` Vincent Guittot
  0 siblings, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-19 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Tue, 19 May 2026 at 12:13, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 18 May 2026 at 23:12, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, May 18, 2026 at 03:34:51PM +0200, Vincent Guittot wrote:
> > > On Wed, 13 May 2026 at 13:35, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Tue, May 12, 2026 at 10:42:33AM +0200, Vincent Guittot wrote:
> > > >
> > > > > I haven't reviewed the patches yet but I ran some tests with it while
> > > > > testing sched latency related changes for short slice wakeup
> > > > > preemption. I have some large hackbench regressions with this series
> > > > > on HMP system with and without EAS. those figures are unexpected
> > > > > because the benchs run on root cfs
> > > > >
> > > > > One example with hackbench 8 groups thread pipe
> > > > > tip/sched/core  tip/sched/core          +this patchset          +this patchset
> > > > > slice 2.8ms     16ms                    2.8ms                   16ms
> > > > > dragonboard rb5 with EAS
> > > > > 0,748(+/-4,6%)  0,621(+/-3.6%) +17%     1,915(+/-7.9%) -156%
> > > > > 0,689(+/- 9.1%) +8%
> > > > >
> > > > > radxa orion6 HMP without EAS
> > > > > 0,588(+/-5.8%)  0,677(+/-5.9%) -15%     1,505(+/-10%) -156%
> > > > > 1,071(+/-5.9%) -82%
> > > > >
> > > > > Increasing the slice partly removes regressions but tis is surprising
> > > > > because the bench runs at root cfs and I thought that results will not
> > > > > change in such a case
> > > >
> > > > D'oh :/
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e54da4c6c945..77d0e1937f2c 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -9071,7 +9071,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
> > > >         enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
> > > >         struct task_struct *donor = rq->donor;
> > > >         struct sched_entity *nse, *se = &donor->se, *pse = &p->se;
> > > > -       struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> > > > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > >
> > > I tested this patch on top of the series but it doesn't fix the perf
> > > regression on rb5
> > >
> > > hackbench 8 groups thread pipe is still at 1.907(+/-7.6%) with default
> > > slice duration
> >
> > Weird, I can't reproduce anymore with this fixed :/
> >
> > I'll try more hackbench variants tomorrow I suppose.
>
> I tried several conf :
> - HMP with EAS enabled
> - HMP without EAS enabled (perf cpufreq gov)
> - SMP (only the 4 little cores)
>
> All of them show large regressions with hackbench which are almost
> recovered when increasing the slice from 2.8 to 16ms

With patch 10 the vlag value is very often set to the max 3.8ms (the
clamp value of 2.8ms slice + 1ms tick) whereas it is usually less than
a 1ms without patch 10

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-19 10:38   ` Vincent Guittot
@ 2026-05-20 16:32     ` Vincent Guittot
  2026-05-21  2:57       ` K Prateek Nayak
  2026-05-21 10:31       ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-20 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

Le mardi 19 mai 2026 à 12:38:10 (+0200), Vincent Guittot a écrit :
> On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Change fair/cgroup to a single runqueue.
> >
> > Infamously fair/cgroup isn't working for a number of people; typically
> > the complaint is latencies and/or overhead. The latency issue is due
> > to the intermediate entries that represent a combination of tasks and
> > thereby obfuscate the runnability of tasks.
> >
> > The approach here is to leave the cgroup hierarchy as is; including
> > the intermediate enqueue/dequeue but move the actual EEVDF runqueue
> > outside. This means things like the shares_weight approximation are
> > fully preserved.
> >
> > That is, given a hierarchy like:
> >
> >         R
> >         |
> >         se--G1
> >             / \
> >       G2--se   se--G3
> >      / \           |
> > T1--se se--T2      se--T3
> >
> > This is fully maintained for load tracking, however the EEVDF parts of
> > cfs_rq/se go unused for the intermediates and are instead connected
> > like:
> >
> >      _R_
> >     / | \
> >    T1 T2 T3
> >
> > Since the effective weight of the entities is determined by the
> > hierarchy, this gets recomputed on enqueue,set_next_task and tick.
> >
> > Notably, the effective weight (se->h_load) is computed from the
> > hierarchical fraction: se->load / cfs_rq->load.
> >
> > Since EEVDF is now exclusive operating on rq->cfs, it needs to
> > consider cfs_rq->h_nr_queued rather than cfs_rq->nr_queued. Similarly,
> > only tasks can get delayed, simplifying some of the cgroup cleanup.
> >
> > One place where additional information was required was
> > set_next_task() / put_prev_task(), where we need to track 'current'
> > both in the hierarchical sense (cfs_rq->h_curr) and in the flat sense
> > (cfs_rq->curr).
> >
> > As a result of only having a single level to pick from, much of the
> > complications in pick_next_task() and preemption go away.
> >
> > Since many of the hierarchical operations are still there, this won't
> > immediately fix the performance issues, but hopefully it will fix some
> > of the latency issues.
> >
> > TODO: split struct cfs_rq / struct sched_entity
> > TODO: try and get rid of h_curr

I finally fount the root cause of regression: the update of entity lag happened
after the task has been dequeued which screwed update_entity_lag():

update_entity_lag must be called after updating curr and cfs_rd and before 
clearing on_rq

With the fix below I'm back to original hackbench figures and maybe even a bit better.
I haven't checked shceduling latency yet

---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77d0e1937f2c..32fe57004f27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5753,6 +5753,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_stats_dequeue_fair(cfs_rq, se, flags);
 
+	if (entity_is_task(se))
+		update_entity_lag(&rq_of(cfs_rq)->cfs, se);
+
 	se->on_rq = 0;
 	account_entity_dequeue(cfs_rq, se);
 
@@ -7423,6 +7426,7 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			update_load_avg(cfs_rq_of(se), se, 0);
+			update_entity_lag(cfs_rq, se);
 			set_delayed(se);
 			return false;
 		}
@@ -7430,7 +7434,6 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	dequeue_hierarchy(p, flags);
 
-	update_entity_lag(cfs_rq, se);
 	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
 		se->deadline -= se->vruntime;
 		se->rel_deadline = 1;
-- 
2.43.0




> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/sched.h |    1
> >  kernel/sched/core.c   |    5
> >  kernel/sched/debug.c  |    9
> >  kernel/sched/fair.c   |  789 +++++++++++++++++++++-----------------------------
> >  kernel/sched/pelt.c   |    6
> >  kernel/sched/sched.h  |   26 -
> >  6 files changed, 366 insertions(+), 470 deletions(-)
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -575,6 +575,7 @@ struct sched_statistics {
> >  struct sched_entity {
> >         /* For load-balancing: */
> >         struct load_weight              load;
> > +       struct load_weight              h_load;
> >         struct rb_node                  run_node;
> >         u64                             deadline;
> >         u64                             min_vruntime;
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5539,11 +5539,8 @@ EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
> >   */
> >  static inline void prefetch_curr_exec_start(struct task_struct *p)
> >  {
> > -#ifdef CONFIG_FAIR_GROUP_SCHED
> > -       struct sched_entity *curr = p->se.cfs_rq->curr;
> > -#else
> >         struct sched_entity *curr = task_rq(p)->cfs.curr;
> > -#endif
> > +
> >         prefetch(curr);
> >         prefetch(&curr->exec_start);
> >  }
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -911,10 +911,11 @@ print_task(struct seq_file *m, struct rq
> >         else
> >                 SEQ_printf(m, " %c", task_state_to_char(p));
> >
> > -       SEQ_printf(m, " %15s %5d %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
> > +       SEQ_printf(m, " %15s %5d %10ld %9Ld.%06ld   %c   %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld   %5d ",
> >                 p->comm, task_pid_nr(p),
> > +               p->se.h_load.weight,
> >                 SPLIT_NS(p->se.vruntime),
> > -               entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
> > +               entity_eligible(&rq->cfs, &p->se) ? 'E' : 'N',
> >                 SPLIT_NS(p->se.deadline),
> >                 p->se.custom_slice ? 'S' : ' ',
> >                 SPLIT_NS(p->se.slice),
> > @@ -943,7 +944,7 @@ static void print_rq(struct seq_file *m,
> >
> >         SEQ_printf(m, "\n");
> >         SEQ_printf(m, "runnable tasks:\n");
> > -       SEQ_printf(m, " S            task   PID       vruntime   eligible    "
> > +       SEQ_printf(m, " S            task   PID     weight       vruntime   eligible    "
> >                    "deadline             slice          sum-exec      switches  "
> >                    "prio         wait-time        sum-sleep       sum-block"
> >  #ifdef CONFIG_NUMA_BALANCING
> > @@ -1051,6 +1052,8 @@ void print_cfs_rq(struct seq_file *m, in
> >                         cfs_rq->tg_load_avg_contrib);
> >         SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
> >                         atomic_long_read(&cfs_rq->tg->load_avg));
> > +       SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
> > +                       cfs_rq->h_load);
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >         SEQ_printf(m, "  .%-30s: %d\n", "throttled",
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -296,8 +296,8 @@ static u64 __calc_delta(u64 delta_exec,
> >   */
> >  static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> >  {
> > -       if (unlikely(se->load.weight != NICE_0_LOAD))
> > -               delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
> > +       if (se->h_load.weight != NICE_0_LOAD)
> > +               delta = __calc_delta(delta, NICE_0_LOAD, &se->h_load);
> >
> >         return delta;
> >  }
> > @@ -427,38 +427,6 @@ static inline struct sched_entity *paren
> >         return se->parent;
> >  }
> >
> > -static void
> > -find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> > -{
> > -       int se_depth, pse_depth;
> > -
> > -       /*
> > -        * preemption test can be made between sibling entities who are in the
> > -        * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> > -        * both tasks until we find their ancestors who are siblings of common
> > -        * parent.
> > -        */
> > -
> > -       /* First walk up until both entities are at same depth */
> > -       se_depth = (*se)->depth;
> > -       pse_depth = (*pse)->depth;
> > -
> > -       while (se_depth > pse_depth) {
> > -               se_depth--;
> > -               *se = parent_entity(*se);
> > -       }
> > -
> > -       while (pse_depth > se_depth) {
> > -               pse_depth--;
> > -               *pse = parent_entity(*pse);
> > -       }
> > -
> > -       while (!is_same_group(*se, *pse)) {
> > -               *se = parent_entity(*se);
> > -               *pse = parent_entity(*pse);
> > -       }
> > -}
> > -
> >  static int tg_is_idle(struct task_group *tg)
> >  {
> >         return tg->idle > 0;
> > @@ -502,11 +470,6 @@ static inline struct sched_entity *paren
> >         return NULL;
> >  }
> >
> > -static inline void
> > -find_matching_se(struct sched_entity **se, struct sched_entity **pse)
> > -{
> > -}
> > -
> >  static inline int tg_is_idle(struct task_group *tg)
> >  {
> >         return 0;
> > @@ -685,7 +648,7 @@ static inline unsigned long avg_vruntime
> >  static inline void
> >  __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> > +       unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
> >         s64 w_vruntime, key = entity_key(cfs_rq, se);
> >
> >         w_vruntime = key * weight;
> > @@ -702,7 +665,7 @@ sum_w_vruntime_add_paranoid(struct cfs_r
> >         s64 key, tmp;
> >
> >  again:
> > -       weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> > +       weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
> >         key = entity_key(cfs_rq, se);
> >
> >         if (check_mul_overflow(key, weight, &key))
> > @@ -748,7 +711,7 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
> >  static void
> >  sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> > +       unsigned long weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
> >         s64 key = entity_key(cfs_rq, se);
> >
> >         cfs_rq->sum_w_vruntime -= key * weight;
> > @@ -790,7 +753,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> >                 s64 runtime = cfs_rq->sum_w_vruntime;
> >
> >                 if (curr) {
> > -                       unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
> > +                       unsigned long w = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
> >
> >                         runtime += entity_key(cfs_rq, curr) * w;
> >                         weight += w;
> > @@ -861,8 +824,6 @@ bool update_entity_lag(struct cfs_rq *cf
> >         u64 avruntime = avg_vruntime(cfs_rq);
> >         s64 vlag = entity_lag(cfs_rq, se, avruntime);
> >
> > -       WARN_ON_ONCE(!se->on_rq);
> > -
> >         if (se->sched_delayed) {
> >                 /* previous vlag < 0 otherwise se would not be delayed */
> >                 vlag = max(vlag, se->vlag);
> > @@ -898,7 +859,7 @@ static int vruntime_eligible(struct cfs_
> >         long load = cfs_rq->sum_weight;
> >
> >         if (curr && curr->on_rq) {
> > -               unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
> > +               unsigned long weight = avg_vruntime_weight(cfs_rq, curr->h_load.weight);
> >
> >                 avg += entity_key(cfs_rq, curr) * weight;
> >                 load += weight;
> > @@ -1039,6 +1000,9 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> >   */
> >  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
> > +       WARN_ON_ONCE(!entity_is_task(se));
> > +
> >         sum_w_vruntime_add(cfs_rq, se);
> >         se->min_vruntime = se->vruntime;
> >         se->min_slice = se->slice;
> > @@ -1048,6 +1012,9 @@ static void __enqueue_entity(struct cfs_
> >
> >  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       WARN_ON_ONCE(&rq_of(cfs_rq)->cfs != cfs_rq);
> > +       WARN_ON_ONCE(!entity_is_task(se));
> > +
> >         rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> >                                   &min_vruntime_cb);
> >         sum_w_vruntime_sub(cfs_rq, se);
> > @@ -1144,7 +1111,7 @@ static struct sched_entity *pick_eevdf(s
> >          * We can safely skip eligibility check if there is only one entity
> >          * in this cfs_rq, saving some cycles.
> >          */
> > -       if (cfs_rq->nr_queued == 1)
> > +       if (cfs_rq->h_nr_queued == 1)
> >                 return curr && curr->on_rq ? curr : se;
> >
> >         /*
> > @@ -1391,8 +1358,6 @@ static s64 update_se(struct rq *rq, stru
> >         return delta_exec;
> >  }
> >
> > -static void set_next_buddy(struct sched_entity *se);
> > -
> >  /*
> >   * Used by other classes to account runtime.
> >   */
> > @@ -1412,7 +1377,7 @@ static void update_curr(struct cfs_rq *c
> >          * not necessarily be the actual task running
> >          * (rq->curr.se). This is easy to confuse!
> >          */
> > -       struct sched_entity *curr = cfs_rq->curr;
> > +       struct sched_entity *curr = cfs_rq->h_curr;
> >         struct rq *rq = rq_of(cfs_rq);
> >         s64 delta_exec;
> >         bool resched;
> > @@ -1424,26 +1389,29 @@ static void update_curr(struct cfs_rq *c
> >         if (unlikely(delta_exec <= 0))
> >                 return;
> >
> > +       account_cfs_rq_runtime(cfs_rq, delta_exec);
> > +
> > +       if (!entity_is_task(curr))
> > +               return;
> > +
> > +       cfs_rq = &rq->cfs;
> > +
> >         curr->vruntime += calc_delta_fair(delta_exec, curr);
> >         resched = update_deadline(cfs_rq, curr);
> >
> > -       if (entity_is_task(curr)) {
> > -               /*
> > -                * If the fair_server is active, we need to account for the
> > -                * fair_server time whether or not the task is running on
> > -                * behalf of fair_server or not:
> > -                *  - If the task is running on behalf of fair_server, we need
> > -                *    to limit its time based on the assigned runtime.
> > -                *  - Fair task that runs outside of fair_server should account
> > -                *    against fair_server such that it can account for this time
> > -                *    and possibly avoid running this period.
> > -                */
> > -               dl_server_update(&rq->fair_server, delta_exec);
> > -       }
> > -
> > -       account_cfs_rq_runtime(cfs_rq, delta_exec);
> > +       /*
> > +        * If the fair_server is active, we need to account for the
> > +        * fair_server time whether or not the task is running on
> > +        * behalf of fair_server or not:
> > +        *  - If the task is running on behalf of fair_server, we need
> > +        *    to limit its time based on the assigned runtime.
> > +        *  - Fair task that runs outside of fair_server should account
> > +        *    against fair_server such that it can account for this time
> > +        *    and possibly avoid running this period.
> > +        */
> > +       dl_server_update(&rq->fair_server, delta_exec);
> >
> > -       if (cfs_rq->nr_queued == 1)
> > +       if (cfs_rq->h_nr_queued == 1)
> >                 return;
> >
> >         if (resched || !protect_slice(curr)) {
> > @@ -1454,7 +1422,10 @@ static void update_curr(struct cfs_rq *c
> >
> >  static void update_curr_fair(struct rq *rq)
> >  {
> > -       update_curr(cfs_rq_of(&rq->donor->se));
> > +       struct sched_entity *se = &rq->donor->se;
> > +
> > +       for_each_sched_entity(se)
> > +               update_curr(cfs_rq_of(se));
> >  }
> >
> >  static inline void
> > @@ -1530,7 +1501,7 @@ update_stats_enqueue_fair(struct cfs_rq
> >          * Are we enqueueing a waiting task? (for current tasks
> >          * a dequeue/enqueue event is a NOP)
> >          */
> > -       if (se != cfs_rq->curr)
> > +       if (se != cfs_rq->h_curr)
> >                 update_stats_wait_start_fair(cfs_rq, se);
> >
> >         if (flags & ENQUEUE_WAKEUP)
> > @@ -1548,7 +1519,7 @@ update_stats_dequeue_fair(struct cfs_rq
> >          * Mark the end of the wait period if dequeueing a
> >          * waiting task:
> >          */
> > -       if (se != cfs_rq->curr)
> > +       if (se != cfs_rq->h_curr)
> >                 update_stats_wait_end_fair(cfs_rq, se);
> >
> >         if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
> > @@ -3875,6 +3846,7 @@ static inline void update_scan_period(st
> >  static void
> >  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
> >         update_load_add(&cfs_rq->load, se->load.weight);
> >         if (entity_is_task(se)) {
> >                 struct rq *rq = rq_of(cfs_rq);
> > @@ -3888,6 +3860,7 @@ account_entity_enqueue(struct cfs_rq *cf
> >  static void
> >  account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
> >         update_load_sub(&cfs_rq->load, se->load.weight);
> >         if (entity_is_task(se)) {
> >                 account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > @@ -3965,7 +3938,7 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
> >  static void
> >  rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
> >  {
> > -       unsigned long old_weight = se->load.weight;
> > +       long old_weight = se->h_load.weight;
> >
> >         /*
> >          * VRUNTIME
> > @@ -4065,16 +4038,17 @@ rescale_entity(struct sched_entity *se,
> >                 se->vprot = div64_long(se->vprot * old_weight, weight);
> >  }
> >
> > -static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > -                           unsigned long weight)
> > +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > +                          unsigned long weight, bool on_rq)
> >  {
> >         bool curr = cfs_rq->curr == se;
> >         bool rel_vprot = false;
> >         u64 avruntime = 0;
> >
> > -       if (se->on_rq) {
> > -               /* commit outstanding execution time */
> > -               update_curr(cfs_rq);
> > +       if (se->h_load.weight == weight)
> > +               return;
> > +
> > +       if (on_rq) {
> >                 avruntime = avg_vruntime(cfs_rq);
> >                 se->vlag = entity_lag(cfs_rq, se, avruntime);
> >                 se->deadline -= avruntime;
> > @@ -4084,46 +4058,90 @@ static void reweight_entity(struct cfs_r
> >                         rel_vprot = true;
> >                 }
> >
> > -               cfs_rq->nr_queued--;
> > +               cfs_rq->h_nr_queued--;
> >                 if (!curr)
> >                         __dequeue_entity(cfs_rq, se);
> > -               update_load_sub(&cfs_rq->load, se->load.weight);
> >         }
> > -       dequeue_load_avg(cfs_rq, se);
> >
> >         rescale_entity(se, weight, rel_vprot);
> >
> > -       update_load_set(&se->load, weight);
> > +       update_load_set(&se->h_load, weight);
> >
> > -       do {
> > -               u32 divider = get_pelt_divider(&se->avg);
> > -               se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> > -       } while (0);
> > -
> > -       enqueue_load_avg(cfs_rq, se);
> > -       if (se->on_rq) {
> > +       if (on_rq) {
> >                 if (rel_vprot)
> >                         se->vprot += avruntime;
> >                 se->deadline += avruntime;
> >                 se->rel_deadline = 0;
> >                 se->vruntime = avruntime - se->vlag;
> >
> > -               update_load_add(&cfs_rq->load, se->load.weight);
> >                 if (!curr)
> >                         __enqueue_entity(cfs_rq, se);
> > -               cfs_rq->nr_queued++;
> > +               cfs_rq->h_nr_queued++;
> >         }
> >  }
> >
> > +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > +                           unsigned long weight)
> > +{
> > +       if (se->load.weight == weight)
> > +               return;
> > +
> > +       if (se->on_rq) {
> > +               WARN_ON_ONCE(cfs_rq != cfs_rq_of(se));
> > +               update_load_sub(&cfs_rq->load, se->load.weight);
> > +       }
> > +       dequeue_load_avg(cfs_rq, se);
> > +
> > +       update_load_set(&se->load, weight);
> > +
> > +       do {
> > +               u32 divider = get_pelt_divider(&se->avg);
> > +               se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> > +       } while (0);
> > +
> > +       enqueue_load_avg(cfs_rq, se);
> > +
> > +       if (se->on_rq)
> > +               update_load_add(&cfs_rq->load, se->load.weight);
> > +}
> > +
> > +/*
> > + * weight = NICE_0_LOAD;
> > + * for_each_entity_se(se)
> > + *   weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
> > + */
> > +static __always_inline
> > +unsigned long __calc_prop_weight(struct cfs_rq *cfs_rq, struct sched_entity *se,
> > +                                unsigned long weight)
> > +{
> > +       weight *= se->load.weight;
> > +       if (parent_entity(se))
> > +               weight /= cfs_rq->load.weight;
> > +       else
> > +               weight /= NICE_0_LOAD;
> > +
> > +       return max(weight, MIN_SHARES);
> > +}
> > +
> >  static void reweight_task_fair(struct rq *rq, struct task_struct *p,
> >                                const struct load_weight *lw)
> >  {
> >         struct sched_entity *se = &p->se;
> > -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > -       struct load_weight *load = &se->load;
> > +       unsigned long weight = NICE_0_LOAD;
> > +
> > +       if (se->on_rq)
> > +               update_curr_fair(rq);
> > +
> > +       reweight_entity(cfs_rq_of(se), se, lw->weight);
> > +       se->load.inv_weight = lw->inv_weight;
> > +
> > +       if (!se->on_rq)
> > +               return;
> > +
> > +       for_each_sched_entity(se)
> > +               weight = __calc_prop_weight(cfs_rq_of(se), se, weight);
> >
> > -       reweight_entity(cfs_rq, se, lw->weight);
> > -       load->inv_weight = lw->inv_weight;
> > +       reweight_eevdf(&rq->cfs, &p->se, weight, p->se.on_rq);
> >  }
> >
> >  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> > @@ -4331,7 +4349,6 @@ static long calc_group_shares(struct cfs
> >  static void update_cfs_group(struct sched_entity *se)
> >  {
> >         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> > -       long shares;
> >
> >         /*
> >          * When a group becomes empty, preserve its weight. This matters for
> > @@ -4340,9 +4357,7 @@ static void update_cfs_group(struct sche
> >         if (!gcfs_rq || !gcfs_rq->load.weight)
> >                 return;
> >
> > -       shares = calc_group_shares(gcfs_rq);
> > -       if (unlikely(se->load.weight != shares))
> > -               reweight_entity(cfs_rq_of(se), se, shares);
> > +       reweight_entity(cfs_rq_of(se), se, calc_group_shares(gcfs_rq));
> >  }
> >
> >  #else /* !CONFIG_FAIR_GROUP_SCHED: */
> > @@ -4460,7 +4475,7 @@ static inline bool cfs_rq_is_decayed(str
> >   * differential update where we store the last value we propagated. This in
> >   * turn allows skipping updates if the differential is 'small'.
> >   *
> > - * Updating tg's load_avg is necessary before update_cfs_share().
> > + * Updating tg's load_avg is necessary before update_cfs_group().
> >   */
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> > @@ -4926,7 +4941,7 @@ static void migrate_se_pelt_lag(struct s
> >   * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
> >   * avg. The immediate corollary is that all (fair) tasks must be attached.
> >   *
> > - * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example.
> > + * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example.
> >   *
> >   * Return: true if the load decayed or we removed load.
> >   *
> > @@ -5475,6 +5490,7 @@ static void
> >  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >         u64 vslice, vruntime = avg_vruntime(cfs_rq);
> > +       unsigned int nr_queued = cfs_rq->h_nr_queued;
> >         bool update_zero = false;
> >         s64 lag = 0;
> >
> > @@ -5482,6 +5498,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >                 se->slice = sysctl_sched_base_slice;
> >         vslice = calc_delta_fair(se->slice, se);
> >
> > +       if (flags & ENQUEUE_QUEUED)
> > +               nr_queued -= 1;
> > +
> >         /*
> >          * Due to how V is constructed as the weighted average of entities,
> >          * adding tasks with positive lag, or removing tasks with negative lag
> > @@ -5490,7 +5509,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >          *
> >          * EEVDF: placement strategy #1 / #2
> >          */
> > -       if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
> > +       if (sched_feat(PLACE_LAG) && nr_queued && se->vlag) {
> >                 struct sched_entity *curr = cfs_rq->curr;
> >                 long load, weight;
> >
> > @@ -5550,9 +5569,9 @@ place_entity(struct cfs_rq *cfs_rq, stru
> >                  */
> >                 load = cfs_rq->sum_weight;
> >                 if (curr && curr->on_rq)
> > -                       load += avg_vruntime_weight(cfs_rq, curr->load.weight);
> > +                       load += avg_vruntime_weight(cfs_rq, curr->h_load.weight);
> >
> > -               weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> > +               weight = avg_vruntime_weight(cfs_rq, se->h_load.weight);
> >                 lag *= load + weight;
> >                 if (WARN_ON_ONCE(!load))
> >                         load = 1;
> > @@ -5611,22 +5630,8 @@ static void check_enqueue_throttle(struc
> >  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
> >
> >  static void
> > -requeue_delayed_entity(struct sched_entity *se);
> > -
> > -static void
> >  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> > -       bool curr = cfs_rq->curr == se;
> > -
> > -       /*
> > -        * If we're the current task, we must renormalise before calling
> > -        * update_curr().
> > -        */
> > -       if (curr)
> > -               place_entity(cfs_rq, se, flags);
> > -
> > -       update_curr(cfs_rq);
> > -
> >         /*
> >          * When enqueuing a sched_entity, we must:
> >          *   - Update loads to have both entity and cfs_rq synced with now.
> > @@ -5645,13 +5650,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
> >          */
> >         update_cfs_group(se);
> >
> > -       /*
> > -        * XXX now that the entity has been re-weighted, and it's lag adjusted,
> > -        * we can place the entity.
> > -        */
> > -       if (!curr)
> > -               place_entity(cfs_rq, se, flags);
> > -
> >         account_entity_enqueue(cfs_rq, se);
> >
> >         /* Entity has migrated, no longer consider this task hot */
> > @@ -5660,8 +5658,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
> >
> >         check_schedstat_required();
> >         update_stats_enqueue_fair(cfs_rq, se, flags);
> > -       if (!curr)
> > -               __enqueue_entity(cfs_rq, se);
> >         se->on_rq = 1;
> >
> >         if (cfs_rq->nr_queued == 1) {
> > @@ -5679,21 +5675,19 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
> >         }
> >  }
> >
> > -static void __clear_buddies_next(struct sched_entity *se)
> > +static void set_next_buddy(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       for_each_sched_entity(se) {
> > -               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > -               if (cfs_rq->next != se)
> > -                       break;
> > -
> > -               cfs_rq->next = NULL;
> > -       }
> > +       if (WARN_ON_ONCE(!se->on_rq || se->sched_delayed))
> > +               return;
> > +       if (se_is_idle(se))
> > +               return;
> > +       cfs_rq->next = se;
> >  }
> >
> >  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >         if (cfs_rq->next == se)
> > -               __clear_buddies_next(se);
> > +               cfs_rq->next = NULL;
> >  }
> >
> >  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> > @@ -5704,7 +5698,7 @@ static void set_delayed(struct sched_ent
> >
> >         /*
> >          * Delayed se of cfs_rq have no tasks queued on them.
> > -        * Do not adjust h_nr_runnable since dequeue_entities()
> > +        * Do not adjust h_nr_runnable since __dequeue_task()
> >          * will account it for blocked tasks.
> >          */
> >         if (!entity_is_task(se))
> > @@ -5737,37 +5731,11 @@ static void clear_delayed(struct sched_e
> >         }
> >  }
> >
> > -static bool
> > +static void
> >  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> > -       bool sleep = flags & DEQUEUE_SLEEP;
> >         int action = UPDATE_TG;
> >
> > -       update_curr(cfs_rq);
> > -       clear_buddies(cfs_rq, se);
> > -
> > -       if (flags & DEQUEUE_DELAYED) {
> > -               WARN_ON_ONCE(!se->sched_delayed);
> > -       } else {
> > -               bool delay = sleep;
> > -               /*
> > -                * DELAY_DEQUEUE relies on spurious wakeups, special task
> > -                * states must not suffer spurious wakeups, excempt them.
> > -                */
> > -               if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
> > -                       delay = false;
> > -
> > -               WARN_ON_ONCE(delay && se->sched_delayed);
> > -
> > -               if (sched_feat(DELAY_DEQUEUE) && delay &&
> > -                   !entity_eligible(cfs_rq, se)) {
> > -                       update_load_avg(cfs_rq, se, 0);
> > -                       update_entity_lag(cfs_rq, se);
> > -                       set_delayed(se);
> > -                       return false;
> > -               }
> > -       }
> > -
> >         if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> >                 action |= DO_DETACH;
> >
> > @@ -5785,14 +5753,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >
> >         update_stats_dequeue_fair(cfs_rq, se, flags);
> >
> > -       update_entity_lag(cfs_rq, se);
> > -       if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
> > -               se->deadline -= se->vruntime;
> > -               se->rel_deadline = 1;
> > -       }
> > -
> > -       if (se != cfs_rq->curr)
> > -               __dequeue_entity(cfs_rq, se);
> >         se->on_rq = 0;
> >         account_entity_dequeue(cfs_rq, se);
> >
> > @@ -5801,9 +5761,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >
> >         update_cfs_group(se);
> >
> > -       if (flags & DEQUEUE_DELAYED)
> > -               clear_delayed(se);
> > -
> >         if (cfs_rq->nr_queued == 0) {
> >                 update_idle_cfs_rq_clock_pelt(cfs_rq);
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > @@ -5816,15 +5773,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >                 }
> >  #endif
> >         }
> > -
> > -       return true;
> >  }
> >
> >  static void
> > -set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
> > +set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       clear_buddies(cfs_rq, se);
> > -
> >         /* 'current' is not kept within the tree. */
> >         if (se->on_rq) {
> >                 /*
> > @@ -5833,16 +5786,12 @@ set_next_entity(struct cfs_rq *cfs_rq, s
> >                  * runqueue.
> >                  */
> >                 update_stats_wait_end_fair(cfs_rq, se);
> > -               __dequeue_entity(cfs_rq, se);
> >                 update_load_avg(cfs_rq, se, UPDATE_TG);
> > -
> > -               if (first)
> > -                       set_protect_slice(cfs_rq, se);
> >         }
> >
> >         update_stats_curr_start(cfs_rq, se);
> > -       WARN_ON_ONCE(cfs_rq->curr);
> > -       cfs_rq->curr = se;
> > +       WARN_ON_ONCE(cfs_rq->h_curr);
> > +       cfs_rq->h_curr = se;
> >
> >         /*
> >          * Track our maximum slice length, if the CPU's load is at
> > @@ -5862,23 +5811,17 @@ set_next_entity(struct cfs_rq *cfs_rq, s
> >         se->prev_sum_exec_runtime = se->sum_exec_runtime;
> >  }
> >
> > -static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
> > +static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags);
> >
> > -/*
> > - * Pick the next process, keeping these things in mind, in this order:
> > - * 1) keep things fair between processes/task groups
> > - * 2) pick the "next" process, since someone really wants that to run
> > - * 3) pick the "last" process, for cache locality
> > - * 4) do not run the "skip" process, if something else is available
> > - */
> >  static struct sched_entity *
> > -pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool protect)
> > +pick_next_entity(struct rq *rq, bool protect)
> >  {
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> >         struct sched_entity *se;
> >
> >         se = pick_eevdf(cfs_rq, protect);
> >         if (se->sched_delayed) {
> > -               dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> > +               __dequeue_task(rq, task_of(se), DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> >                 /*
> >                  * Must not reference @se again, see __block_task().
> >                  */
> > @@ -5903,13 +5846,11 @@ static void put_prev_entity(struct cfs_r
> >
> >         if (prev->on_rq) {
> >                 update_stats_wait_start_fair(cfs_rq, prev);
> > -               /* Put 'current' back into the tree. */
> > -               __enqueue_entity(cfs_rq, prev);
> >                 /* in !on_rq case, update occurred at dequeue */
> >                 update_load_avg(cfs_rq, prev, 0);
> >         }
> > -       WARN_ON_ONCE(cfs_rq->curr != prev);
> > -       cfs_rq->curr = NULL;
> > +       WARN_ON_ONCE(cfs_rq->h_curr != prev);
> > +       cfs_rq->h_curr = NULL;
> >  }
> >
> >  static void
> > @@ -6062,7 +6003,7 @@ static void __account_cfs_rq_runtime(str
> >          * if we're unable to extend our runtime we resched so that the active
> >          * hierarchy can be throttled
> >          */
> > -       if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
> > +       if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->h_curr))
> >                 resched_curr(rq_of(cfs_rq));
> >  }
> >
> > @@ -6420,7 +6361,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
> >         assert_list_leaf_cfs_rq(rq);
> >
> >         /* Determine whether we need to wake up potentially idle CPU: */
> > -       if (rq->curr == rq->idle && rq->cfs.nr_queued)
> > +       if (rq->curr == rq->idle && rq->cfs.h_nr_queued)
> >                 resched_curr(rq);
> >  }
> >
> > @@ -6761,7 +6702,7 @@ static void check_enqueue_throttle(struc
> >                 return;
> >
> >         /* an active group must be handled by the update_curr()->put() path */
> > -       if (!cfs_rq->runtime_enabled || cfs_rq->curr)
> > +       if (!cfs_rq->runtime_enabled || cfs_rq->h_curr)
> >                 return;
> >
> >         /* ensure the group is not already throttled */
> > @@ -7156,7 +7097,7 @@ static void hrtick_start_fair(struct rq
> >                         resched_curr(rq);
> >                 return;
> >         }
> > -       delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> > +       delta = (se->h_load.weight * vdelta) / NICE_0_LOAD;
> >
> >         /*
> >          * Correct for instantaneous load of other classes.
> > @@ -7256,10 +7197,8 @@ static int choose_idle_cpu(int cpu, stru
> >  }
> >
> >  static void
> > -requeue_delayed_entity(struct sched_entity *se)
> > +requeue_delayed_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > -
> >         /*
> >          * se->sched_delayed should imply: se->on_rq == 1.
> >          * Because a delayed entity is one that is still on
> > @@ -7269,19 +7208,58 @@ requeue_delayed_entity(struct sched_enti
> >         WARN_ON_ONCE(!se->on_rq);
> >
> >         if (update_entity_lag(cfs_rq, se)) {
> > -               cfs_rq->nr_queued--;
> > +               cfs_rq->h_nr_queued--;
> >                 if (se != cfs_rq->curr)
> >                         __dequeue_entity(cfs_rq, se);
> >                 place_entity(cfs_rq, se, 0);
> >                 if (se != cfs_rq->curr)
> >                         __enqueue_entity(cfs_rq, se);
> > -               cfs_rq->nr_queued++;
> > +               cfs_rq->h_nr_queued++;
> >         }
> >
> >         update_load_avg(cfs_rq, se, 0);
> >         clear_delayed(se);
> >  }
> >
> > +static unsigned long enqueue_hierarchy(struct task_struct *p, int flags)
> > +{
> > +       unsigned long weight = NICE_0_LOAD;
> > +       int task_new = !(flags & ENQUEUE_WAKEUP);
> > +       struct sched_entity *se = &p->se;
> > +       int h_nr_idle = task_has_idle_policy(p);
> > +       int h_nr_runnable = 1;
> > +
> > +       if (task_new && se->sched_delayed)
> > +               h_nr_runnable = 0;
> > +
> > +       for_each_sched_entity(se) {
> > +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +
> > +               update_curr(cfs_rq);
> > +
> > +               if (!se->on_rq) {
> > +                       enqueue_entity(cfs_rq, se, flags);
> > +               } else {
> > +                       update_load_avg(cfs_rq, se, UPDATE_TG);
> > +                       se_update_runnable(se);
> > +                       update_cfs_group(se);
> > +               }
> > +
> > +               cfs_rq->h_nr_runnable += h_nr_runnable;
> > +               cfs_rq->h_nr_queued++;
> > +               cfs_rq->h_nr_idle += h_nr_idle;
> > +
> > +               if (cfs_rq_is_idle(cfs_rq))
> > +                       h_nr_idle = 1;
> > +
> > +               weight = __calc_prop_weight(cfs_rq, se, weight);
> > +
> > +               flags = ENQUEUE_WAKEUP;
> > +       }
> > +
> > +       return weight;
> > +}
> > +
> >  /*
> >   * The enqueue_task method is called before nr_running is
> >   * increased. Here we update the fair scheduling stats and
> > @@ -7290,13 +7268,12 @@ requeue_delayed_entity(struct sched_enti
> >  static void
> >  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > -       struct cfs_rq *cfs_rq;
> > -       struct sched_entity *se = &p->se;
> > -       int h_nr_idle = task_has_idle_policy(p);
> > -       int h_nr_runnable = 1;
> > -       int task_new = !(flags & ENQUEUE_WAKEUP);
> >         int rq_h_nr_queued = rq->cfs.h_nr_queued;
> > -       u64 slice = 0;
> > +       int task_new = !(flags & ENQUEUE_WAKEUP);
> > +       struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > +       unsigned long weight;
> > +       bool curr;
> >
> >         if (task_is_throttled(p) && enqueue_throttled_task(p))
> >                 return;
> > @@ -7308,10 +7285,10 @@ enqueue_task_fair(struct rq *rq, struct
> >          * estimated utilization, before we update schedutil.
> >          */
> >         if (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))
> > -               util_est_enqueue(&rq->cfs, p);
> > +               util_est_enqueue(cfs_rq, p);
> >
> >         if (flags & ENQUEUE_DELAYED) {
> > -               requeue_delayed_entity(se);
> > +               requeue_delayed_entity(cfs_rq, se);
> >                 return;
> >         }
> >
> > @@ -7323,57 +7300,22 @@ enqueue_task_fair(struct rq *rq, struct
> >         if (p->in_iowait)
> >                 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> >
> > -       if (task_new && se->sched_delayed)
> > -               h_nr_runnable = 0;
> > -
> > -       for_each_sched_entity(se) {
> > -               if (se->on_rq) {
> > -                       if (se->sched_delayed)
> > -                               requeue_delayed_entity(se);
> > -                       break;
> > -               }
> > -               cfs_rq = cfs_rq_of(se);
> > -
> > -               /*
> > -                * Basically set the slice of group entries to the min_slice of
> > -                * their respective cfs_rq. This ensures the group can service
> > -                * its entities in the desired time-frame.
> > -                */
> > -               if (slice) {
> > -                       se->slice = slice;
> > -                       se->custom_slice = 1;
> > -               }
> > -               enqueue_entity(cfs_rq, se, flags);
> > -               slice = cfs_rq_min_slice(cfs_rq);
> > -
> > -               cfs_rq->h_nr_runnable += h_nr_runnable;
> > -               cfs_rq->h_nr_queued++;
> > -               cfs_rq->h_nr_idle += h_nr_idle;
> > -
> > -               if (cfs_rq_is_idle(cfs_rq))
> > -                       h_nr_idle = 1;
> > -
> > -               flags = ENQUEUE_WAKEUP;
> > -       }
> > -
> > -       for_each_sched_entity(se) {
> > -               cfs_rq = cfs_rq_of(se);
> > -
> > -               update_load_avg(cfs_rq, se, UPDATE_TG);
> > -               se_update_runnable(se);
> > -               update_cfs_group(se);
> > +       /*
> > +        * XXX comment on the curr thing
> > +        */
> > +       curr = (cfs_rq->curr == se);
> > +       if (curr)
> > +               place_entity(cfs_rq, se, flags);
> >
> > -               se->slice = slice;
> > -               if (se != cfs_rq->curr)
> > -                       min_vruntime_cb_propagate(&se->run_node, NULL);
> > -               slice = cfs_rq_min_slice(cfs_rq);
> > +       if (se->on_rq && se->sched_delayed)
> > +               requeue_delayed_entity(cfs_rq, se);
> >
> > -               cfs_rq->h_nr_runnable += h_nr_runnable;
> > -               cfs_rq->h_nr_queued++;
> > -               cfs_rq->h_nr_idle += h_nr_idle;
> > +       weight = enqueue_hierarchy(p, flags);
> >
> > -               if (cfs_rq_is_idle(cfs_rq))
> > -                       h_nr_idle = 1;
> > +       if (!curr) {
> > +               reweight_eevdf(cfs_rq, se, weight, false);
> > +               place_entity(cfs_rq, se, flags | ENQUEUE_QUEUED);
> > +               __enqueue_entity(cfs_rq, se);
> >         }
> >
> >         if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
> > @@ -7404,105 +7346,107 @@ enqueue_task_fair(struct rq *rq, struct
> >         hrtick_update(rq);
> >  }
> >
> > -/*
> > - * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> > - * failing half-way through and resume the dequeue later.
> > - *
> > - * Returns:
> > - * -1 - dequeue delayed
> > - *  0 - dequeue throttled
> > - *  1 - dequeue complete
> > - */
> > -static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > +static void dequeue_hierarchy(struct task_struct *p, int flags)
> >  {
> > -       bool was_sched_idle = sched_idle_rq(rq);
> > +       struct sched_entity *se = &p->se;
> >         bool task_sleep = flags & DEQUEUE_SLEEP;
> >         bool task_delayed = flags & DEQUEUE_DELAYED;
> >         bool task_throttled = flags & DEQUEUE_THROTTLE;
> > -       struct task_struct *p = NULL;
> > -       int h_nr_idle = 0;
> > -       int h_nr_queued = 0;
> >         int h_nr_runnable = 0;
> > -       struct cfs_rq *cfs_rq;
> > -       u64 slice = 0;
> > +       int h_nr_idle = task_has_idle_policy(p);
> > +       bool dequeue = true;
> >
> > -       if (entity_is_task(se)) {
> > -               p = task_of(se);
> > -               h_nr_queued = 1;
> > -               h_nr_idle = task_has_idle_policy(p);
> > -               if (task_sleep || task_delayed || !se->sched_delayed)
> > -                       h_nr_runnable = 1;
> > -       }
> > +       if (task_sleep || task_delayed || !se->sched_delayed)
> > +               h_nr_runnable = 1;
> >
> >         for_each_sched_entity(se) {
> > -               cfs_rq = cfs_rq_of(se);
> > +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >
> > -               if (!dequeue_entity(cfs_rq, se, flags)) {
> > -                       if (p && &p->se == se)
> > -                               return -1;
> > +               update_curr(cfs_rq);
> >
> > -                       slice = cfs_rq_min_slice(cfs_rq);
> > -                       break;
> > +               if (dequeue) {
> > +                       dequeue_entity(cfs_rq, se, flags);
> > +                       /* Don't dequeue parent if it has other entities besides us */
> > +                       if (cfs_rq->load.weight)
> > +                               dequeue = false;
> > +               } else {
> > +                       update_load_avg(cfs_rq, se, UPDATE_TG);
> > +                       se_update_runnable(se);
> > +                       update_cfs_group(se);
> >                 }
> >
> >                 cfs_rq->h_nr_runnable -= h_nr_runnable;
> > -               cfs_rq->h_nr_queued -= h_nr_queued;
> > +               cfs_rq->h_nr_queued--;
> >                 cfs_rq->h_nr_idle -= h_nr_idle;
> >
> >                 if (cfs_rq_is_idle(cfs_rq))
> > -                       h_nr_idle = h_nr_queued;
> > +                       h_nr_idle = 1;
> >
> >                 if (throttled_hierarchy(cfs_rq) && task_throttled)
> >                         record_throttle_clock(cfs_rq);
> >
> > -               /* Don't dequeue parent if it has other entities besides us */
> > -               if (cfs_rq->load.weight) {
> > -                       slice = cfs_rq_min_slice(cfs_rq);
> > -
> > -                       /* Avoid re-evaluating load for this entity: */
> > -                       se = parent_entity(se);
> > -                       /*
> > -                        * Bias pick_next to pick a task from this cfs_rq, as
> > -                        * p is sleeping when it is within its sched_slice.
> > -                        */
> > -                       if (task_sleep && se)
> > -                               set_next_buddy(se);
> > -                       break;
> > -               }
> >                 flags |= DEQUEUE_SLEEP;
> >                 flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
> >         }
> > +}
> >
> > -       for_each_sched_entity(se) {
> > -               cfs_rq = cfs_rq_of(se);
> > +/*
> > + * The part of dequeue_task_fair() that is needed to dequeue delayed tasks.
> > + *
> > + * Returns:
> > + *   true  - dequeued
> > + *   false - delayed
> > + */
> > +static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > +{
> > +       struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > +       bool was_sched_idle = sched_idle_rq(rq);
> > +       bool task_sleep = flags & DEQUEUE_SLEEP;
> > +       bool task_delayed = flags & DEQUEUE_DELAYED;
> >
> > -               update_load_avg(cfs_rq, se, UPDATE_TG);
> > -               se_update_runnable(se);
> > -               update_cfs_group(se);
> > +       clear_buddies(cfs_rq, se);
> >
> > -               se->slice = slice;
> > -               if (se != cfs_rq->curr)
> > -                       min_vruntime_cb_propagate(&se->run_node, NULL);
> > -               slice = cfs_rq_min_slice(cfs_rq);
> > +       if (flags & DEQUEUE_DELAYED) {
> > +               WARN_ON_ONCE(!se->sched_delayed);
> > +       } else {
> > +               bool delay = task_sleep;
> > +               /*
> > +                * DELAY_DEQUEUE relies on spurious wakeups, special task
> > +                * states must not suffer spurious wakeups, excempt them.
> > +                */
> > +               if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
> > +                       delay = false;
> >
> > -               cfs_rq->h_nr_runnable -= h_nr_runnable;
> > -               cfs_rq->h_nr_queued -= h_nr_queued;
> > -               cfs_rq->h_nr_idle -= h_nr_idle;
> > +               WARN_ON_ONCE(delay && se->sched_delayed);
> >
> > -               if (cfs_rq_is_idle(cfs_rq))
> > -                       h_nr_idle = h_nr_queued;
> > +               if (sched_feat(DELAY_DEQUEUE) && delay &&
> > +                   !entity_eligible(cfs_rq, se)) {
> > +                       update_load_avg(cfs_rq_of(se), se, 0);
> 
> update_entity_lag(cfs_rq, se); is missing here. Unfortunately this
> doesn't fix my regression
> 
> > +                       set_delayed(se);
> > +                       return false;
> > +               }
> > +       }
> >
> > -               if (throttled_hierarchy(cfs_rq) && task_throttled)
> > -                       record_throttle_clock(cfs_rq);
> > +       dequeue_hierarchy(p, flags);
> > +
> > +       update_entity_lag(cfs_rq, se);
> > +       if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> > +               se->deadline -= se->vruntime;
> > +               se->rel_deadline = 1;
> >         }
> > +       if (se != cfs_rq->curr)
> > +               __dequeue_entity(cfs_rq, se);
> >
> > -       sub_nr_running(rq, h_nr_queued);
> > +       sub_nr_running(rq, 1);
> >
> >         /* balance early to pull high priority tasks */
> >         if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> >                 rq->next_balance = jiffies;
> >
> > -       if (p && task_delayed) {
> > +       if (task_delayed) {
> > +               clear_delayed(se);
> > +
> >                 WARN_ON_ONCE(!task_sleep);
> >                 WARN_ON_ONCE(p->on_rq != 1);
> >
> > @@ -7514,7 +7458,7 @@ static int dequeue_entities(struct rq *r
> >                 __block_task(rq, p);
> >         }
> >
> > -       return 1;
> > +       return true;
> >  }
> >
> >  /*
> > @@ -7533,11 +7477,11 @@ static bool dequeue_task_fair(struct rq
> >                 util_est_dequeue(&rq->cfs, p);
> >
> >         util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > -       if (dequeue_entities(rq, &p->se, flags) < 0)
> > +       if (!__dequeue_task(rq, p, flags))
> >                 return false;
> >
> >         /*
> > -        * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
> > +        * Must not reference @p after __dequeue_task(DEQUEUE_DELAYED).
> >          */
> >         return true;
> >  }
> > @@ -9021,19 +8965,6 @@ static void migrate_task_rq_fair(struct
> >  static void task_dead_fair(struct task_struct *p)
> >  {
> >         struct sched_entity *se = &p->se;
> > -
> > -       if (se->sched_delayed) {
> > -               struct rq_flags rf;
> > -               struct rq *rq;
> > -
> > -               rq = task_rq_lock(p, &rf);
> > -               if (se->sched_delayed) {
> > -                       update_rq_clock(rq);
> > -                       dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> > -               }
> > -               task_rq_unlock(rq, p, &rf);
> > -       }
> > -
> >         remove_entity_load_avg(se);
> >  }
> >
> > @@ -9067,21 +8998,10 @@ static void set_cpus_allowed_fair(struct
> >         set_task_max_allowed_capacity(p);
> >  }
> >
> > -static void set_next_buddy(struct sched_entity *se)
> > -{
> > -       for_each_sched_entity(se) {
> > -               if (WARN_ON_ONCE(!se->on_rq))
> > -                       return;
> > -               if (se_is_idle(se))
> > -                       return;
> > -               cfs_rq_of(se)->next = se;
> > -       }
> > -}
> > -
> >  enum preempt_wakeup_action {
> >         PREEMPT_WAKEUP_NONE,    /* No preemption. */
> >         PREEMPT_WAKEUP_SHORT,   /* Ignore slice protection. */
> > -       PREEMPT_WAKEUP_PICK,    /* Let __pick_eevdf() decide. */
> > +       PREEMPT_WAKEUP_PICK,    /* Let pick_eevdf() decide. */
> >         PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */
> >  };
> >
> > @@ -9098,7 +9018,7 @@ set_preempt_buddy(struct cfs_rq *cfs_rq,
> >         if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> >                 return false;
> >
> > -       set_next_buddy(pse);
> > +       set_next_buddy(cfs_rq, pse);
> >         return true;
> >  }
> >
> > @@ -9188,7 +9108,6 @@ static void wakeup_preempt_fair(struct r
> >         if (!sched_feat(WAKEUP_PREEMPTION))
> >                 return;
> >
> > -       find_matching_se(&se, &pse);
> >         WARN_ON_ONCE(!pse);
> >
> >         cse_is_idle = se_is_idle(se);
> > @@ -9216,8 +9135,7 @@ static void wakeup_preempt_fair(struct r
> >         if (unlikely(!normal_policy(p->policy)))
> >                 return;
> >
> > -       cfs_rq = cfs_rq_of(se);
> > -       update_curr(cfs_rq);
> > +       update_curr_fair(rq);
> >         /*
> >          * If @p has a shorter slice than current and @p is eligible, override
> >          * current's slice protection in order to allow preemption.
> > @@ -9261,18 +9179,15 @@ static void wakeup_preempt_fair(struct r
> >         }
> >
> >  pick:
> > -       nse = pick_next_entity(rq, cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT);
> > -       /* If @p has become the most eligible task, force preemption */
> > -       if (nse == pse)
> > -               goto preempt;
> > -
> > -       /*
> > -        * Because p is enqueued, nse being null can only mean that we
> > -        * dequeued a delayed task. If there are still entities queued in
> > -        * cfs, check if the next one will be p.
> > -        */
> > -       if (!nse && cfs_rq->nr_queued)
> > -               goto pick;
> > +       if (cfs_rq->h_nr_queued) {
> > +               nse = pick_next_entity(rq, preempt_action != PREEMPT_WAKEUP_SHORT);
> > +               if (unlikely(!nse))
> > +                       goto pick;
> > +
> > +               /* If @p has become the most eligible task, force preemption */
> > +               if (nse == pse)
> > +                       goto preempt;
> > +       }
> >
> >         if (sched_feat(RUN_TO_PARITY))
> >                 update_protect_slice(cfs_rq, se);
> > @@ -9291,34 +9206,25 @@ static void wakeup_preempt_fair(struct r
> >  struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
> >         __must_hold(__rq_lockp(rq))
> >  {
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> >         struct sched_entity *se;
> > -       struct cfs_rq *cfs_rq;
> >         struct task_struct *p;
> > -       bool throttled;
> >         int new_tasks;
> >
> >  again:
> > -       cfs_rq = &rq->cfs;
> > -       if (!cfs_rq->nr_queued)
> > +       if (!cfs_rq->h_nr_queued)
> >                 goto idle;
> >
> > -       throttled = false;
> > -
> > -       do {
> > -               /* Might not have done put_prev_entity() */
> > -               if (cfs_rq->curr && cfs_rq->curr->on_rq)
> > -                       update_curr(cfs_rq);
> > -
> > -               throttled |= check_cfs_rq_runtime(cfs_rq);
> > +       /* Might not have done put_prev_entity() */
> > +       if (cfs_rq->curr && cfs_rq->curr->on_rq)
> > +               update_curr(cfs_rq);
> >
> > -               se = pick_next_entity(rq, cfs_rq, true);
> > -               if (!se)
> > -                       goto again;
> > -               cfs_rq = group_cfs_rq(se);
> > -       } while (cfs_rq);
> > +       se = pick_next_entity(rq, true);
> > +       if (!se)
> > +               goto again;
> >
> >         p = task_of(se);
> > -       if (unlikely(throttled))
> > +       if (unlikely(check_cfs_rq_runtime(cfs_rq_of(se))))
> >                 task_throttle_setup_work(p);
> >         return p;
> >
> > @@ -9353,7 +9259,7 @@ void fair_server_init(struct rq *rq)
> >  static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
> >  {
> >         struct sched_entity *se = &prev->se;
> > -       struct cfs_rq *cfs_rq;
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> >         struct sched_entity *nse = NULL;
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > @@ -9363,7 +9269,7 @@ static void put_prev_task_fair(struct rq
> >
> >         while (se) {
> >                 cfs_rq = cfs_rq_of(se);
> > -               if (!nse || cfs_rq->curr)
> > +               if (!nse || cfs_rq->h_curr)
> >                         put_prev_entity(cfs_rq, se);
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >                 if (nse) {
> > @@ -9382,6 +9288,14 @@ static void put_prev_task_fair(struct rq
> >  #endif
> >                 se = parent_entity(se);
> >         }
> > +
> > +       /* Put 'current' back into the tree. */
> > +       cfs_rq = &rq->cfs;
> > +       se = &prev->se;
> > +       WARN_ON_ONCE(cfs_rq->curr != se);
> > +       cfs_rq->curr = NULL;
> > +       if (se->on_rq)
> > +               __enqueue_entity(cfs_rq, se);
> >  }
> >
> >  /*
> > @@ -9390,8 +9304,8 @@ static void put_prev_task_fair(struct rq
> >  static void yield_task_fair(struct rq *rq)
> >  {
> >         struct task_struct *curr = rq->donor;
> > -       struct cfs_rq *cfs_rq = task_cfs_rq(curr);
> >         struct sched_entity *se = &curr->se;
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> >
> >         /*
> >          * Are we the only task in the tree?
> > @@ -9432,11 +9346,11 @@ static bool yield_to_task_fair(struct rq
> >         struct sched_entity *se = &p->se;
> >
> >         /* !se->on_rq also covers throttled task */
> > -       if (!se->on_rq)
> > +       if (!se->on_rq || se->sched_delayed)
> >                 return false;
> >
> >         /* Tell the scheduler that we'd really like se to run next. */
> > -       set_next_buddy(se);
> > +       set_next_buddy(&task_rq(p)->cfs, se);
> >
> >         yield_task_fair(rq);
> >
> > @@ -9762,15 +9676,10 @@ static inline long migrate_degrades_loca
> >   */
> >  static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_cpu)
> >  {
> > -       struct cfs_rq *dst_cfs_rq;
> > +       struct cfs_rq *dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
> >
> > -#ifdef CONFIG_FAIR_GROUP_SCHED
> > -       dst_cfs_rq = task_group(p)->cfs_rq[dest_cpu];
> > -#else
> > -       dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
> > -#endif
> > -       if (sched_feat(PLACE_LAG) && dst_cfs_rq->nr_queued &&
> > -           !entity_eligible(task_cfs_rq(p), &p->se))
> > +       if (sched_feat(PLACE_LAG) && dst_cfs_rq->h_nr_queued &&
> > +           !entity_eligible(&task_rq(p)->cfs, &p->se))
> >                 return 1;
> >
> >         return 0;
> > @@ -10240,7 +10149,7 @@ static void update_cfs_rq_h_load(struct
> >         while ((se = READ_ONCE(cfs_rq->h_load_next)) != NULL) {
> >                 load = cfs_rq->h_load;
> >                 load = div64_ul(load * se->avg.load_avg,
> > -                       cfs_rq_load_avg(cfs_rq) + 1);
> > +                               cfs_rq_load_avg(cfs_rq) + 1);
> >                 cfs_rq = group_cfs_rq(se);
> >                 cfs_rq->h_load = load;
> >                 cfs_rq->last_h_load_update = now;
> > @@ -13459,7 +13368,7 @@ static inline void task_tick_core(struct
> >          * MIN_NR_TASKS_DURING_FORCEIDLE - 1 tasks and use that to check
> >          * if we need to give up the CPU.
> >          */
> > -       if (rq->core->core_forceidle_count && rq->cfs.nr_queued == 1 &&
> > +       if (rq->core->core_forceidle_count && rq->cfs.h_nr_queued == 1 &&
> >             __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE))
> >                 resched_curr(rq);
> >  }
> > @@ -13668,30 +13577,8 @@ bool cfs_prio_less(const struct task_str
> >
> >         WARN_ON_ONCE(task_rq(b)->core != rq->core);
> >
> > -#ifdef CONFIG_FAIR_GROUP_SCHED
> > -       /*
> > -        * Find an se in the hierarchy for tasks a and b, such that the se's
> > -        * are immediate siblings.
> > -        */
> > -       while (sea->cfs_rq->tg != seb->cfs_rq->tg) {
> > -               int sea_depth = sea->depth;
> > -               int seb_depth = seb->depth;
> > -
> > -               if (sea_depth >= seb_depth)
> > -                       sea = parent_entity(sea);
> > -               if (sea_depth <= seb_depth)
> > -                       seb = parent_entity(seb);
> > -       }
> > -
> > -       se_fi_update(sea, rq->core->core_forceidle_seq, in_fi);
> > -       se_fi_update(seb, rq->core->core_forceidle_seq, in_fi);
> > -
> > -       cfs_rqa = sea->cfs_rq;
> > -       cfs_rqb = seb->cfs_rq;
> > -#else /* !CONFIG_FAIR_GROUP_SCHED: */
> >         cfs_rqa = &task_rq(a)->cfs;
> >         cfs_rqb = &task_rq(b)->cfs;
> > -#endif /* !CONFIG_FAIR_GROUP_SCHED */
> >
> >         /*
> >          * Find delta after normalizing se's vruntime with its cfs_rq's
> > @@ -13729,14 +13616,20 @@ static inline void task_tick_core(struct
> >   */
> >  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >  {
> > -       struct cfs_rq *cfs_rq;
> >         struct sched_entity *se = &curr->se;
> > +       unsigned long weight = NICE_0_LOAD;
> > +       struct cfs_rq *cfs_rq;
> >
> >         for_each_sched_entity(se) {
> >                 cfs_rq = cfs_rq_of(se);
> >                 entity_tick(cfs_rq, se, queued);
> > +
> > +               weight = __calc_prop_weight(cfs_rq, se, weight);
> >         }
> >
> > +       se = &curr->se;
> > +       reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> > +
> >         if (queued)
> >                 return;
> >
> > @@ -13772,7 +13665,7 @@ prio_changed_fair(struct rq *rq, struct
> >         if (p->prio == oldprio)
> >                 return;
> >
> > -       if (rq->cfs.nr_queued == 1)
> > +       if (rq->cfs.h_nr_queued == 1)
> >                 return;
> >
> >         /*
> > @@ -13901,29 +13794,40 @@ static void switched_to_fair(struct rq *
> >         }
> >  }
> >
> > -/*
> > - * Account for a task changing its policy or group.
> > - *
> > - * This routine is mostly called to set cfs_rq->curr field when a task
> > - * migrates between groups/classes.
> > - */
> >  static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >  {
> >         struct sched_entity *se = &p->se;
> > +       struct cfs_rq *cfs_rq = &rq->cfs;
> > +       unsigned long weight = NICE_0_LOAD;
> > +       bool on_rq = se->on_rq;
> > +
> > +       clear_buddies(cfs_rq, se);
> > +
> > +       if (on_rq)
> > +               __dequeue_entity(cfs_rq, se);
> >
> >         for_each_sched_entity(se) {
> > -               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> > +               cfs_rq = cfs_rq_of(se);
> >
> > -               if (IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) &&
> > -                   first && cfs_rq->curr)
> > -                       break;
> > +               if (!IS_ENABLED(CONFIG_FAIR_GROUP_SCHED) ||
> > +                   !first || !cfs_rq->h_curr)
> > +                       set_next_entity(cfs_rq, se);
> >
> > -               set_next_entity(cfs_rq, se, first);
> >                 /* ensure bandwidth has been allocated on our new cfs_rq */
> >                 account_cfs_rq_runtime(cfs_rq, 0);
> > +
> > +               if (on_rq)
> > +                       weight = __calc_prop_weight(cfs_rq, se, weight);
> >         }
> >
> >         se = &p->se;
> > +       cfs_rq->curr = se;
> > +
> > +       if (on_rq) {
> > +               reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> > +               if (first)
> > +                       set_protect_slice(cfs_rq, se);
> > +       }
> >
> >         if (task_on_rq_queued(p)) {
> >                 /*
> > @@ -14054,17 +13958,8 @@ void unregister_fair_sched_group(struct
> >                 struct sched_entity *se = tg->se[cpu];
> >                 struct rq *rq = cpu_rq(cpu);
> >
> > -               if (se) {
> > -                       if (se->sched_delayed) {
> > -                               guard(rq_lock_irqsave)(rq);
> > -                               if (se->sched_delayed) {
> > -                                       update_rq_clock(rq);
> > -                                       dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> > -                               }
> > -                               list_del_leaf_cfs_rq(cfs_rq);
> > -                       }
> > +               if (se)
> >                         remove_entity_load_avg(se);
> > -               }
> >
> >                 /*
> >                  * Only empty task groups can be destroyed; so we can speculatively
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -206,7 +206,7 @@ ___update_load_sum(u64 now, struct sched
> >         /*
> >          * running is a subset of runnable (weight) so running can't be set if
> >          * runnable is clear. But there are some corner cases where the current
> > -        * se has been already dequeued but cfs_rq->curr still points to it.
> > +        * se has been already dequeued but cfs_rq->h_curr still points to it.
> >          * This means that weight will be 0 but not running for a sched_entity
> >          * but also for a cfs_rq if the latter becomes idle. As an example,
> >          * this happens during sched_balance_newidle() which calls
> > @@ -307,7 +307,7 @@ int __update_load_avg_blocked_se(u64 now
> >  int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >         if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> > -                               cfs_rq->curr == se)) {
> > +                               cfs_rq->h_curr == se)) {
> >
> >                 ___update_load_avg(&se->avg, se_weight(se));
> >                 cfs_se_util_change(&se->avg);
> > @@ -323,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, st
> >         if (___update_load_sum(now, &cfs_rq->avg,
> >                                 scale_load_down(cfs_rq->load.weight),
> >                                 cfs_rq->h_nr_runnable,
> > -                               cfs_rq->curr != NULL)) {
> > +                               cfs_rq->h_curr != NULL)) {
> >
> >                 ___update_load_avg(&cfs_rq->avg, 1);
> >                 trace_pelt_cfs_tp(cfs_rq);
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -528,21 +528,8 @@ struct task_group {
> >
> >  };
> >
> > -#ifdef CONFIG_GROUP_SCHED_WEIGHT
> >  #define ROOT_TASK_GROUP_LOAD   NICE_0_LOAD
> >
> > -/*
> > - * A weight of 0 or 1 can cause arithmetics problems.
> > - * A weight of a cfs_rq is the sum of weights of which entities
> > - * are queued on this cfs_rq, so a weight of a entity should not be
> > - * too large, so as the shares value of a task group.
> > - * (The default weight is 1024 - so there's no practical
> > - *  limitation from this.)
> > - */
> > -#define MIN_SHARES             (1UL <<  1)
> > -#define MAX_SHARES             (1UL << 18)
> > -#endif
> > -
> >  typedef int (*tg_visitor)(struct task_group *, void *);
> >
> >  extern int walk_tg_tree_from(struct task_group *from,
> > @@ -629,6 +616,17 @@ static inline bool cfs_task_bw_constrain
> >
> >  #endif /* !CONFIG_CGROUP_SCHED */
> >
> > +/*
> > + * A weight of 0 or 1 can cause arithmetics problems.
> > + * A weight of a cfs_rq is the sum of weights of which entities
> > + * are queued on this cfs_rq, so a weight of a entity should not be
> > + * too large, so as the shares value of a task group.
> > + * (The default weight is 1024 - so there's no practical
> > + *  limitation from this.)
> > + */
> > +#define MIN_SHARES             (1UL <<  1)
> > +#define MAX_SHARES             (1UL << 18)
> > +
> >  extern void unregister_rt_sched_group(struct task_group *tg);
> >  extern void free_rt_sched_group(struct task_group *tg);
> >  extern int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent);
> > @@ -707,6 +705,7 @@ struct cfs_rq {
> >         /*
> >          * CFS load tracking
> >          */
> > +       struct sched_entity     *h_curr;
> >         struct sched_avg        avg;
> >  #ifndef CONFIG_64BIT
> >         u64                     last_update_time_copy;
> > @@ -2509,6 +2508,7 @@ extern const u32          sched_prio_to_wmult[40
> >  #define ENQUEUE_MIGRATED       0x00040000
> >  #define ENQUEUE_INITIAL                0x00080000
> >  #define ENQUEUE_RQ_SELECTED    0x00100000
> > +#define ENQUEUE_QUEUED         0x00200000
> >
> >  #define RETRY_TASK             ((void *)-1UL)
> >
> >
> >

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-20 16:32     ` Vincent Guittot
@ 2026-05-21  2:57       ` K Prateek Nayak
  2026-05-21  7:56         ` Vincent Guittot
  2026-05-21 10:31       ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-21  2:57 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, qyousef

Hello Vincent,

On 5/20/2026 10:02 PM, Vincent Guittot wrote:
> I finally fount the root cause of regression: the update of entity lag happened
> after the task has been dequeued which screwed update_entity_lag():

Great catch!

> 
> update_entity_lag must be called after updating curr and cfs_rd and before 
> clearing on_rq
> 
> With the fix below I'm back to original hackbench figures and maybe even a bit better.
> I haven't checked shceduling latency yet
> 
> ---
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77d0e1937f2c..32fe57004f27 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5753,6 +5753,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	update_stats_dequeue_fair(cfs_rq, se, flags);
>  
> +	if (entity_is_task(se))
> +		update_entity_lag(&rq_of(cfs_rq)->cfs, se);
> +
>  	se->on_rq = 0;

Ah! The curr->on_rq indicator changes here and we'll start ignoring it
for avg_vruntime() calculation afterwards! Makes sense.

>  	account_entity_dequeue(cfs_rq, se);
>  
> @@ -7423,6 +7426,7 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {

Does this need a update_curr() before checking entity_eligible()?

Currently these bits reside in dequeue_entity() and is always done after
a update_curr(cfs_rq) but here we may need a:

    update_curr(task_cfs_rq(p)); /* to catch up h_curr's vruntime */

Just doing it for task_cfs_rq(p) should be fine since we only have to
catch up curr's vruntime - sum_w_vruntime and sum_weight at root cfs_rq
should be stable for all the tasks on rb-tree.

>  			update_load_avg(cfs_rq_of(se), se, 0);
> +			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
> @@ -7430,7 +7434,6 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  	dequeue_hierarchy(p, flags);
>  
> -	update_entity_lag(cfs_rq, se);

If we decide to do a update_curr(task_cfs_rq(p)) at the beginning of
__dequeue_task(), we can just move this to above dequeue_hierarchy()
before se->on_rq indicators are modified.

Thoughts?

>  	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
>  		se->deadline -= se->vruntime;
>  		se->rel_deadline = 1;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21  2:57       ` K Prateek Nayak
@ 2026-05-21  7:56         ` Vincent Guittot
  0 siblings, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-21  7:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, mingo, longman, chenridong, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef

On Thu, 21 May 2026 at 04:57, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 5/20/2026 10:02 PM, Vincent Guittot wrote:
> > I finally fount the root cause of regression: the update of entity lag happened
> > after the task has been dequeued which screwed update_entity_lag():
>
> Great catch!
>
> >
> > update_entity_lag must be called after updating curr and cfs_rd and before
> > clearing on_rq
> >
> > With the fix below I'm back to original hackbench figures and maybe even a bit better.
> > I haven't checked shceduling latency yet
> >
> > ---
> >  kernel/sched/fair.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 77d0e1937f2c..32fe57004f27 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5753,6 +5753,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> >       update_stats_dequeue_fair(cfs_rq, se, flags);
> >
> > +     if (entity_is_task(se))
> > +             update_entity_lag(&rq_of(cfs_rq)->cfs, se);
> > +
> >       se->on_rq = 0;
>
> Ah! The curr->on_rq indicator changes here and we'll start ignoring it
> for avg_vruntime() calculation afterwards! Makes sense.
>
> >       account_entity_dequeue(cfs_rq, se);
> >
> > @@ -7423,6 +7426,7 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
>
> Does this need a update_curr() before checking entity_eligible()?

Yes we need to update curr first

>
> Currently these bits reside in dequeue_entity() and is always done after
> a update_curr(cfs_rq) but here we may need a:
>
>     update_curr(task_cfs_rq(p)); /* to catch up h_curr's vruntime */
>
> Just doing it for task_cfs_rq(p) should be fine since we only have to
> catch up curr's vruntime - sum_w_vruntime and sum_weight at root cfs_rq
> should be stable for all the tasks on rb-tree.
>
> >                       update_load_avg(cfs_rq_of(se), se, 0);
> > +                     update_entity_lag(cfs_rq, se);
> >                       set_delayed(se);
> >                       return false;
> >               }
> > @@ -7430,7 +7434,6 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >
> >       dequeue_hierarchy(p, flags);
> >
> > -     update_entity_lag(cfs_rq, se);
>
> If we decide to do a update_curr(task_cfs_rq(p)) at the beginning of
> __dequeue_task(), we can just move this to above dequeue_hierarchy()
> before se->on_rq indicators are modified.
>
> Thoughts?

yes it's doable, we will have a spurious update_curr in
dequeue_hierarchy but that will be a nop because of a null delta_exec

With flat hierarchy, vruntime and deadline are no longer linked to the
cfs hierarchy. A possibility could be to move the update of vruntime
and deadline outside but this is more complex because of delta_exec

The same apply for dl_server


>
> >       if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> >               se->deadline -= se->vruntime;
> >               se->rel_deadline = 1;
>
> --
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-20 16:32     ` Vincent Guittot
  2026-05-21  2:57       ` K Prateek Nayak
@ 2026-05-21 10:31       ` Peter Zijlstra
  2026-05-21 12:13         ` Vincent Guittot
                           ` (2 more replies)
  1 sibling, 3 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-21 10:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Wed, May 20, 2026 at 06:32:11PM +0200, Vincent Guittot wrote:

> I finally fount the root cause of regression: the update of entity lag happened
> after the task has been dequeued which screwed update_entity_lag():
> 
> update_entity_lag must be called after updating curr and cfs_rd and before 
> clearing on_rq
> 
> With the fix below I'm back to original hackbench figures and maybe even a bit better.
> I haven't checked shceduling latency yet
> 
> ---
>  kernel/sched/fair.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77d0e1937f2c..32fe57004f27 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5753,6 +5753,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	update_stats_dequeue_fair(cfs_rq, se, flags);
>  
> +	if (entity_is_task(se))
> +		update_entity_lag(&rq_of(cfs_rq)->cfs, se);
> +
>  	se->on_rq = 0;
>  	account_entity_dequeue(cfs_rq, se);
>  
> @@ -7423,6 +7426,7 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
>  			update_load_avg(cfs_rq_of(se), se, 0);
> +			update_entity_lag(cfs_rq, se);
>  			set_delayed(se);
>  			return false;
>  		}
> @@ -7430,7 +7434,6 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  	dequeue_hierarchy(p, flags);
>  
> -	update_entity_lag(cfs_rq, se);
>  	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
>  		se->deadline -= se->vruntime;
>  		se->rel_deadline = 1;

Argh!!! Thank you! I've gone blind staring at all this :/

Would it not be simpler to just move the update_entity_lag() call up a
bit, like so?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
 
 	clear_buddies(cfs_rq, se);
 
+	update_curr(cfs_rq);
+	update_entity_lag(cfs_rq, se);
+
 	if (flags & DEQUEUE_DELAYED) {
 		WARN_ON_ONCE(!se->sched_delayed);
 	} else {
@@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
 
 	dequeue_hierarchy(p, flags);
 
-	update_entity_lag(cfs_rq, se);
 	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
 		se->deadline -= se->vruntime;
 		se->rel_deadline = 1;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 10:31       ` Peter Zijlstra
@ 2026-05-21 12:13         ` Vincent Guittot
  2026-05-21 13:29           ` Peter Zijlstra
  2026-05-21 13:21         ` Peter Zijlstra
  2026-05-21 13:39         ` Peter Zijlstra
  2 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2026-05-21 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, 21 May 2026 at 12:31, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 20, 2026 at 06:32:11PM +0200, Vincent Guittot wrote:
>
> > I finally fount the root cause of regression: the update of entity lag happened
> > after the task has been dequeued which screwed update_entity_lag():
> >
> > update_entity_lag must be called after updating curr and cfs_rd and before
> > clearing on_rq
> >
> > With the fix below I'm back to original hackbench figures and maybe even a bit better.
> > I haven't checked shceduling latency yet
> >
> > ---
> >  kernel/sched/fair.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 77d0e1937f2c..32fe57004f27 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5753,6 +5753,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> >       update_stats_dequeue_fair(cfs_rq, se, flags);
> >
> > +     if (entity_is_task(se))
> > +             update_entity_lag(&rq_of(cfs_rq)->cfs, se);
> > +
> >       se->on_rq = 0;
> >       account_entity_dequeue(cfs_rq, se);
> >
> > @@ -7423,6 +7426,7 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
> >                       update_load_avg(cfs_rq_of(se), se, 0);
> > +                     update_entity_lag(cfs_rq, se);
> >                       set_delayed(se);
> >                       return false;
> >               }
> > @@ -7430,7 +7434,6 @@ static bool __dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> >
> >       dequeue_hierarchy(p, flags);
> >
> > -     update_entity_lag(cfs_rq, se);
> >       if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> >               se->deadline -= se->vruntime;
> >               se->rel_deadline = 1;
>
> Argh!!! Thank you! I've gone blind staring at all this :/
>
> Would it not be simpler to just move the update_entity_lag() call up a
> bit, like so?
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
>
>         clear_buddies(cfs_rq, se);
>
> +       update_curr(cfs_rq);

I agree it's simpler although we will call update_curr twice for one
level, but the 2nd call should be nop because of delta_exec being null

Prateek proposed update_curr(task_cfs_rq(p)). Using task_cfs_rq(p)
will ensure that we keep the same ordering as for_each_sched_entity


> +       update_entity_lag(cfs_rq, se);
> +
>         if (flags & DEQUEUE_DELAYED) {
>                 WARN_ON_ONCE(!se->sched_delayed);
>         } else {
> @@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
>
>         dequeue_hierarchy(p, flags);
>
> -       update_entity_lag(cfs_rq, se);
>         if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
>                 se->deadline -= se->vruntime;
>                 se->rel_deadline = 1;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 10:31       ` Peter Zijlstra
  2026-05-21 12:13         ` Vincent Guittot
@ 2026-05-21 13:21         ` Peter Zijlstra
  2026-05-21 13:39         ` Peter Zijlstra
  2 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-21 13:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, May 21, 2026 at 12:31:17PM +0200, Peter Zijlstra wrote:
> On Wed, May 20, 2026 at 06:32:11PM +0200, Vincent Guittot wrote:
> 
> > I finally fount the root cause of regression: the update of entity lag happened
> > after the task has been dequeued which screwed update_entity_lag():
> > 
> > update_entity_lag must be called after updating curr and cfs_rd and before 
> > clearing on_rq
> > 
> > With the fix below I'm back to original hackbench figures and maybe even a bit better.
> > I haven't checked shceduling latency yet

I see a very slight hackbench regression on the high end, but meh. The
latency-slice test seems to have slightly improved max values, but this
isn't the most stable of things.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 12:13         ` Vincent Guittot
@ 2026-05-21 13:29           ` Peter Zijlstra
  2026-05-21 13:44             ` Vincent Guittot
  2026-05-21 14:01             ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-21 13:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, May 21, 2026 at 02:13:48PM +0200, Vincent Guittot wrote:

> > Would it not be simpler to just move the update_entity_lag() call up a
> > bit, like so?
> >
> > ---
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
> >
> >         clear_buddies(cfs_rq, se);
> >
> > +       update_curr(cfs_rq);
> 
> I agree it's simpler although we will call update_curr twice for one
> level, but the 2nd call should be nop because of delta_exec being null
> 
> Prateek proposed update_curr(task_cfs_rq(p)). Using task_cfs_rq(p)
> will ensure that we keep the same ordering as for_each_sched_entity

Given:

    R
    |
    G
    |
    t

Then task_cfs_rq() will be G's cfs_rq, while cfs_rq is R's cfs_rq.

Since all the actual running happens inside R, this is what is required
by update_entity_lag().

Doing update_curr(task_cfs_rq()) here doesn't make sense.

I'm not sure I see a way in which running them out of order hurts
anything.

> > +       update_entity_lag(cfs_rq, se);
> > +
> >         if (flags & DEQUEUE_DELAYED) {
> >                 WARN_ON_ONCE(!se->sched_delayed);
> >         } else {
> > @@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
> >
> >         dequeue_hierarchy(p, flags);
> >
> > -       update_entity_lag(cfs_rq, se);
> >         if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> >                 se->deadline -= se->vruntime;
> >                 se->rel_deadline = 1;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 10:31       ` Peter Zijlstra
  2026-05-21 12:13         ` Vincent Guittot
  2026-05-21 13:21         ` Peter Zijlstra
@ 2026-05-21 13:39         ` Peter Zijlstra
  2026-05-21 13:56           ` Vincent Guittot
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-21 13:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, May 21, 2026 at 12:31:17PM +0200, Peter Zijlstra wrote:

> Would it not be simpler to just move the update_entity_lag() call up a
> bit, like so?
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
>  
>  	clear_buddies(cfs_rq, se);
>  
> +	update_curr(cfs_rq);
> +	update_entity_lag(cfs_rq, se);
> +
>  	if (flags & DEQUEUE_DELAYED) {
>  		WARN_ON_ONCE(!se->sched_delayed);
>  	} else {
> @@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
>  
>  	dequeue_hierarchy(p, flags);
>  
> -	update_entity_lag(cfs_rq, se);
>  	if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
>  		se->deadline -= se->vruntime;
>  		se->rel_deadline = 1;

FWIW, I pushed out a new queue:sched/flat with this on. I had to rebase
because of: 6d2051403d6c ("sched/fair: Update util_est after updating
util_avg during dequeue"), hopefully I didn't wreck that :/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 13:29           ` Peter Zijlstra
@ 2026-05-21 13:44             ` Vincent Guittot
  2026-05-21 14:01             ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-21 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, 21 May 2026 at 15:29, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 21, 2026 at 02:13:48PM +0200, Vincent Guittot wrote:
>
> > > Would it not be simpler to just move the update_entity_lag() call up a
> > > bit, like so?
> > >
> > > ---
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
> > >
> > >         clear_buddies(cfs_rq, se);
> > >
> > > +       update_curr(cfs_rq);
> >
> > I agree it's simpler although we will call update_curr twice for one
> > level, but the 2nd call should be nop because of delta_exec being null
> >
> > Prateek proposed update_curr(task_cfs_rq(p)). Using task_cfs_rq(p)
> > will ensure that we keep the same ordering as for_each_sched_entity
>
> Given:
>
>     R
>     |
>     G
>     |
>     t
>
> Then task_cfs_rq() will be G's cfs_rq, while cfs_rq is R's cfs_rq.

Yes but update_curr() moves to R's cfs anyway before updating
vruntime, deadline and dl_server

>
> Since all the actual running happens inside R, this is what is required
> by update_entity_lag().

In other places like task_tick_fair, we follow the G then R order and
vruntime and deadline are updated while updating G

>
> Doing update_curr(task_cfs_rq()) here doesn't make sense.
>
> I'm not sure I see a way in which running them out of order hurts
> anything.

I was thinking of use cases which involves throttling but I haven't
gone deeply in the analyses

>
> > > +       update_entity_lag(cfs_rq, se);
> > > +
> > >         if (flags & DEQUEUE_DELAYED) {
> > >                 WARN_ON_ONCE(!se->sched_delayed);
> > >         } else {
> > > @@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
> > >
> > >         dequeue_hierarchy(p, flags);
> > >
> > > -       update_entity_lag(cfs_rq, se);
> > >         if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> > >                 se->deadline -= se->vruntime;
> > >                 se->rel_deadline = 1;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 13:39         ` Peter Zijlstra
@ 2026-05-21 13:56           ` Vincent Guittot
  0 siblings, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2026-05-21 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, 21 May 2026 at 15:39, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 21, 2026 at 12:31:17PM +0200, Peter Zijlstra wrote:
>
> > Would it not be simpler to just move the update_entity_lag() call up a
> > bit, like so?
> >
> > ---
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
> >
> >       clear_buddies(cfs_rq, se);
> >
> > +     update_curr(cfs_rq);
> > +     update_entity_lag(cfs_rq, se);
> > +
> >       if (flags & DEQUEUE_DELAYED) {
> >               WARN_ON_ONCE(!se->sched_delayed);
> >       } else {
> > @@ -8022,7 +8025,6 @@ static bool __dequeue_task(struct rq *rq
> >
> >       dequeue_hierarchy(p, flags);
> >
> > -     update_entity_lag(cfs_rq, se);
> >       if (sched_feat(PLACE_REL_DEADLINE) && !task_sleep) {
> >               se->deadline -= se->vruntime;
> >               se->rel_deadline = 1;
>
> FWIW, I pushed out a new queue:sched/flat with this on. I had to rebase
> because of: 6d2051403d6c ("sched/fair: Update util_est after updating
> util_avg during dequeue"), hopefully I didn't wreck that :/

This looks good to me

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-21 13:29           ` Peter Zijlstra
  2026-05-21 13:44             ` Vincent Guittot
@ 2026-05-21 14:01             ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-21 14:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, longman, chenridong, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, hannes, mkoutny, cgroups,
	linux-kernel, jstultz, kprateek.nayak, qyousef

On Thu, May 21, 2026 at 03:29:01PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2026 at 02:13:48PM +0200, Vincent Guittot wrote:
> 
> > > Would it not be simpler to just move the update_entity_lag() call up a
> > > bit, like so?
> > >
> > > ---
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7999,6 +7999,9 @@ static bool __dequeue_task(struct rq *rq
> > >
> > >         clear_buddies(cfs_rq, se);
> > >
> > > +       update_curr(cfs_rq);
> > 
> > I agree it's simpler although we will call update_curr twice for one
> > level, but the 2nd call should be nop because of delta_exec being null
> > 
> > Prateek proposed update_curr(task_cfs_rq(p)). Using task_cfs_rq(p)
> > will ensure that we keep the same ordering as for_each_sched_entity
> 
> Given:
> 
>     R
>     |
>     G
>     |
>     t
> 
> Then task_cfs_rq() will be G's cfs_rq, while cfs_rq is R's cfs_rq.
> 
> Since all the actual running happens inside R, this is what is required
> by update_entity_lag().
> 
> Doing update_curr(task_cfs_rq()) here doesn't make sense.
> 
> I'm not sure I see a way in which running them out of order hurts
> anything.

Bah, I'm so full of fail. So update_curr() takes ->h_curr, which for R
would be G's se, not t. So yeah, Prateek is right and I should stop
trying to do more than one thing at a time :-(

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
                     ` (2 preceding siblings ...)
  2026-05-19 10:38   ` Vincent Guittot
@ 2026-05-26  7:53   ` Zhang Qiao
  2026-05-26  9:15     ` K Prateek Nayak
  3 siblings, 1 reply; 64+ messages in thread
From: Zhang Qiao @ 2026-05-26  7:53 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef,
	Hui Tang

Hi Peter,

在 2026/5/11 19:31, Peter Zijlstra 写道:

> @@ -13729,14 +13616,20 @@ static inline void task_tick_core(struct
>   */
>  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  {
> -	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &curr->se;
> +	unsigned long weight = NICE_0_LOAD;
> +	struct cfs_rq *cfs_rq;
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
>  		entity_tick(cfs_rq, se, queued);
> +
> +		weight = __calc_prop_weight(cfs_rq, se, weight);

Testing sched/flat branch on AMD EPYC 9654 (384 CPUs, 8 NUMA nodes)
with a 2-level cgroup hierarchy and cfs_bandwidth quota enabled,
hackbench triggers a divide-by-zero oops:

  [  142.308571] divide error: 0000 [#1] SMP NOPTI
  [  142.308582] RIP: 0010:task_tick_fair+0x19e/0x410
  [  142.308601] Call Trace:
  [  142.308604]  <IRQ>
  [  142.308607]  scheduler_tick+0x6a/0x110
  [  142.308609]  update_process_times+0x6b/0x90
  [  142.308611]  tick_sched_handle+0x2a/0x70
  [  142.308613]  tick_sched_timer+0x57/0xb0

faddr2line confirms:

  task_tick_fair+0x19e/0x410:
  __calc_prop_weight at kernel/sched/fair.c:4085
  (inlined by) task_tick_fair at kernel/sched/fair.c:13576

===========================================================
Reproduction
===========================================================

Kernel: sched/flat branch (54d493980e00 and later)
Hardware: AMD EPYC 9654, 2S 384 logical CPUs

  # 2-level cgroup, quota = 50% of one period
  cgcreate -g cpu:/bw/l1/l2
  cgset -r cpu.cfs_quota_us=50000  /bw/l1/l2
  cgset -r cpu.cfs_period_us=100000 /bw/l1/l2

  # high task count amplifies the throttle→tick race window
  cgexec -g cpu:/bw/l1/l2 hackbench -g 48 -l 1000 -s 512 -T

Typically crashes within 30 seconds on this machine.  A single-CPU
kernel or a very loose quota (e.g. 90%) is unlikely to trigger it
because the race window is narrow.

Thanks,
Zhang Qiao

>  	}
>  
> +	se = &curr->se;
> +	reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> +
>  	if (queued)
>  		return;
>  
> @@ -13772,7 +13665,7 @@ prio_changed_fair(struct rq *rq, struct
>  	if (p->prio == oldprio)
>  		return;
>  
> -	if (rq->cfs.nr_queued == 1)
> +	if (rq->cfs.h_nr_queued == 1)
>  		return;
>  
>  	/*
> @@ -13901,29 +13794,40 @@ static void switched_to_fair(struct rq *
>  	}
>  }
>  

>  
> 
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26  7:53   ` Zhang Qiao
@ 2026-05-26  9:15     ` K Prateek Nayak
  2026-05-26  9:36       ` Zhang Qiao
  2026-05-26  9:52       ` Peter Zijlstra
  0 siblings, 2 replies; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-26  9:15 UTC (permalink / raw)
  To: Zhang Qiao, Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef, Hui Tang

Hello Zhang,

On 5/26/2026 1:23 PM, Zhang Qiao wrote:
> Testing sched/flat branch on AMD EPYC 9654 (384 CPUs, 8 NUMA nodes)
> with a 2-level cgroup hierarchy and cfs_bandwidth quota enabled,
> hackbench triggers a divide-by-zero oops:
> 
>   [  142.308571] divide error: 0000 [#1] SMP NOPTI
>   [  142.308582] RIP: 0010:task_tick_fair+0x19e/0x410
>   [  142.308601] Call Trace:
>   [  142.308604]  <IRQ>
>   [  142.308607]  scheduler_tick+0x6a/0x110
>   [  142.308609]  update_process_times+0x6b/0x90
>   [  142.308611]  tick_sched_handle+0x2a/0x70
>   [  142.308613]  tick_sched_timer+0x57/0xb0

More of this trace would have been helpful.

> 
> faddr2line confirms:
> 
>   task_tick_fair+0x19e/0x410:
>   __calc_prop_weight at kernel/sched/fair.c:4085
>   (inlined by) task_tick_fair at kernel/sched/fair.c:13576

Those line numbers don't match on the latest sched/flat but since you
mention this happens with throttling, I believe it is tick hitting
somewhere in between the task being dequeued by throttle_cfs_rq_work()
and the CPU rescheduling and taking the task off the runqueue.

Dequeue from throttle is slightly special since it keeps the task on
runqueue but the sched entity goes off the cfs_rq changing the
hierarchical weights.

Can you check if this helps:

  (Lightly tested with your reproducer)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8bae794f063..d96e5915fb3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14815,18 +14815,21 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
 	struct sched_entity *se = &curr->se;
-	unsigned long weight = NICE_0_LOAD;
-	struct cfs_rq *cfs_rq;
 
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-		entity_tick(cfs_rq, se, queued);
+	if (se->on_rq) {
+		unsigned long weight = NICE_0_LOAD;
+		struct cfs_rq *cfs_rq;
 
-		weight = __calc_prop_weight(cfs_rq, se, weight);
-	}
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			entity_tick(cfs_rq, se, queued);
+
+			weight = __calc_prop_weight(cfs_rq, se, weight);
+		}
 
-	se = &curr->se;
-	reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+		se = &curr->se;
+		reweight_eevdf(cfs_rq, se, weight, se->on_rq);
+	}
 
 	if (queued)
 		return;
---

I don't think it makes too much sense to reweight an entity that
has been dequeued. The enqueue at unthrottle will do it anyways.

> 
> ===========================================================
> Reproduction
> ===========================================================
> 
> Kernel: sched/flat branch (54d493980e00 and later)
> Hardware: AMD EPYC 9654, 2S 384 logical CPUs
> 
>   # 2-level cgroup, quota = 50% of one period
>   cgcreate -g cpu:/bw/l1/l2
>   cgset -r cpu.cfs_quota_us=50000  /bw/l1/l2
>   cgset -r cpu.cfs_period_us=100000 /bw/l1/l2
> 
>   # high task count amplifies the throttle→tick race window
>   cgexec -g cpu:/bw/l1/l2 hackbench -g 48 -l 1000 -s 512 -T
> 
> Typically crashes within 30 seconds on this machine.  A single-CPU
> kernel or a very loose quota (e.g. 90%) is unlikely to trigger it
> because the race window is narrow.

This was helpful! I see:

[  209.935597] Oops: divide error: 0000 [#1] SMP NOPTI
[  209.941061] CPU: 329 UID: 0 PID: 8247 Comm: sched-messaging Not tainted 7.1.0-rc2-test+ #73 PREEMPT(full)
[  209.951841] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS RTI100CC 03/28/2024
[  209.961254] RIP: 0010:task_tick_fair+0x10d/0x850
[  209.966420] Code: dc 00 00 00 4c 89 f7 e8 f1 52 ff ff 45 85 e4 0f 85 ba 00 00 00 49 8b 06 4d 8b b6 b8 00 00 00 48 0f af c3 4d 85 f6 74 19 31 d2 <49> f7 37 ba 02 00 00 00 48 89 d3 48 39 d0 48 0f 43 d8 e9 20 ff ff
[  209.987382] RSP: 0018:ff581fd71e1fce58 EFLAGS: 00010046
[  209.993216] RAX: 0000010000000000 RBX: 0000000000100000 RCX: ff295dbfa9ad8080
[  210.001179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff295dbfa9ad8080
[  210.009141] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000063eb
[  210.017104] R10: 0000000000000000 R11: ff581fd71e1fcff8 R12: 0000000000000000
[  210.025061] R13: ff295dbfa9ad8000 R14: ff295dc06c6eac00 R15: ff295dbfd9bc8600
[  210.033027] FS:  00007faef8c8b640(0000) GS:ff295e7c4acca000(0000) knlGS:0000000000000000
[  210.042060] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  210.048474] CR2: 00007f9884292d30 CR3: 000000011aa26001 CR4: 0000000000f71ef0
[  210.056430] PKRU: 55555554
[  210.059448] Call Trace:
[  210.062177]  <IRQ>
[  210.064426]  sched_tick+0x94/0x250
[  210.068229]  update_process_times+0x99/0xc0
[  210.072903]  tick_nohz_handler+0x95/0x1a0
[  210.077380]  ? __pfx_tick_nohz_handler+0x10/0x10
[  210.082534]  __hrtimer_run_queues+0xfe/0x260
[  210.087304]  hrtimer_interrupt+0x122/0x1f0
[  210.091880]  __sysvec_apic_timer_interrupt+0x55/0x130
[  210.097525]  sysvec_apic_timer_interrupt+0x7a/0xb0
[  210.102873]  </IRQ>
[  210.105203]  <TASK>
[  210.107542]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  210.113284] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
[  210.119511] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d ec 20 fd 01 74 05 e9 c0 81 d4 fe e8 00 93 ec fe e9 b6 81
[  210.140469] RSP: 0018:ff581fd74032fe88 EFLAGS: 00000206
[  210.146308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[  210.154271] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ff295dbfa9ad8d64
[  210.162235] RBP: ff295dbfa9ad8000 R08: 0000000000000000 R09: 0000000000000000
[  210.170196] R10: 0000000000000000 R11: 0000000000000000 R12: ff295dbfa9ad8d64
[  210.178159] R13: ff581fd74032ff48 R14: ff295dbfa9ad8000 R15: 00fffffffffff000
[  210.186139]  task_work_run+0x5c/0x90
[  210.190137]  exit_to_user_mode_loop+0x16e/0x550
[  210.195198]  ? srso_alias_return_thunk+0x5/0xfbef5
[  210.200552]  ? ksys_read+0xc5/0xe0
[  210.204352]  do_syscall_64+0x26e/0x750
[  210.208540]  ? do_syscall_64+0xaa/0x750
[  210.212823]  ? srso_alias_return_thunk+0x5/0xfbef5
[  210.218174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
---

So the theory of throttle work causing this checks out.

The suggested diff above solves the crash in my case but your
mileage may vary. Peter can comment if this is the right thing
to do or not :-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26  9:15     ` K Prateek Nayak
@ 2026-05-26  9:36       ` Zhang Qiao
  2026-05-26  9:52       ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Zhang Qiao @ 2026-05-26  9:36 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, mingo
  Cc: longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, qyousef, Hui Tang

Hi Prateek,

在 2026/5/26 17:15, K Prateek Nayak 写道:
> Hello Zhang,
> 
> On 5/26/2026 1:23 PM, Zhang Qiao wrote:
>> Testing sched/flat branch on AMD EPYC 9654 (384 CPUs, 8 NUMA nodes)
>> with a 2-level cgroup hierarchy and cfs_bandwidth quota enabled,
>> hackbench triggers a divide-by-zero oops:
>>
>>   [  142.308571] divide error: 0000 [#1] SMP NOPTI
>>   [  142.308582] RIP: 0010:task_tick_fair+0x19e/0x410
>>   [  142.308601] Call Trace:
>>   [  142.308604]  <IRQ>
>>   [  142.308607]  scheduler_tick+0x6a/0x110
>>   [  142.308609]  update_process_times+0x6b/0x90
>>   [  142.308611]  tick_sched_handle+0x2a/0x70
>>   [  142.308613]  tick_sched_timer+0x57/0xb0
> 
> More of this trace would have been helpful.
> 
>>
>> faddr2line confirms:
>>
>>   task_tick_fair+0x19e/0x410:
>>   __calc_prop_weight at kernel/sched/fair.c:4085
>>   (inlined by) task_tick_fair at kernel/sched/fair.c:13576
> 
> Those line numbers don't match on the latest sched/flat but since you
> mention this happens with throttling, I believe it is tick hitting
> somewhere in between the task being dequeued by throttle_cfs_rq_work()
> and the CPU rescheduling and taking the task off the runqueue.
> 

Sorry for the confusion on the line numbers — the mismatch was due
to some local debug code I had added on top of sched/flat,
not a difference in the base tree.

> Dequeue from throttle is slightly special since it keeps the task on
> runqueue but the sched entity goes off the cfs_rq changing the
> hierarchical weights.
> > Can you check if this helps:
> 
>   (Lightly tested with your reproducer)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b8bae794f063..d96e5915fb3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -14815,18 +14815,21 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
>  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  {
>  	struct sched_entity *se = &curr->se;
> -	unsigned long weight = NICE_0_LOAD;
> -	struct cfs_rq *cfs_rq;
>  
> -	for_each_sched_entity(se) {
> -		cfs_rq = cfs_rq_of(se);
> -		entity_tick(cfs_rq, se, queued);
> +	if (se->on_rq) {
> +		unsigned long weight = NICE_0_LOAD;
> +		struct cfs_rq *cfs_rq;
>  
> -		weight = __calc_prop_weight(cfs_rq, se, weight);
> -	}
> +		for_each_sched_entity(se) {
> +			cfs_rq = cfs_rq_of(se);
> +			entity_tick(cfs_rq, se, queued);
> +
> +			weight = __calc_prop_weight(cfs_rq, se, weight);
> +		}
>  
> -	se = &curr->se;
> -	reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> +		se = &curr->se;
> +		reweight_eevdf(cfs_rq, se, weight, se->on_rq);
> +	}
>  

throttle_cfs_rq_work() sets se->on_rq = 0 while the task is still running as
rq->curr, and the subsequent tick should not attempt to reweight an
already-dequeued entity. The unthrottle enqueue will handle the reweight anyway.

I've tested your suggested diff on my AMD EPYC 9654 (384 CPUs, 8 NUMA
nodes) and it resolves the crash. The reproducer no longer triggers the
divide error after running for several minutes.

Tested-by: Zhang Qiao <zhangqiao22@huawei.com>


Thanks,
Zhang Qiao

.

>  	if (queued)
>  		return;
> ---
> 
> I don't think it makes too much sense to reweight an entity that
> has been dequeued. The enqueue at unthrottle will do it anyways.
> 
>>
>> ===========================================================
>> Reproduction
>> ===========================================================
>>
>> Kernel: sched/flat branch (54d493980e00 and later)
>> Hardware: AMD EPYC 9654, 2S 384 logical CPUs
>>
>>   # 2-level cgroup, quota = 50% of one period
>>   cgcreate -g cpu:/bw/l1/l2
>>   cgset -r cpu.cfs_quota_us=50000  /bw/l1/l2
>>   cgset -r cpu.cfs_period_us=100000 /bw/l1/l2
>>
>>   # high task count amplifies the throttle→tick race window
>>   cgexec -g cpu:/bw/l1/l2 hackbench -g 48 -l 1000 -s 512 -T
>>
>> Typically crashes within 30 seconds on this machine.  A single-CPU
>> kernel or a very loose quota (e.g. 90%) is unlikely to trigger it
>> because the race window is narrow.
> 
> This was helpful! I see:
> 
> [  209.935597] Oops: divide error: 0000 [#1] SMP NOPTI
> [  209.941061] CPU: 329 UID: 0 PID: 8247 Comm: sched-messaging Not tainted 7.1.0-rc2-test+ #73 PREEMPT(full)
> [  209.951841] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS RTI100CC 03/28/2024
> [  209.961254] RIP: 0010:task_tick_fair+0x10d/0x850
> [  209.966420] Code: dc 00 00 00 4c 89 f7 e8 f1 52 ff ff 45 85 e4 0f 85 ba 00 00 00 49 8b 06 4d 8b b6 b8 00 00 00 48 0f af c3 4d 85 f6 74 19 31 d2 <49> f7 37 ba 02 00 00 00 48 89 d3 48 39 d0 48 0f 43 d8 e9 20 ff ff
> [  209.987382] RSP: 0018:ff581fd71e1fce58 EFLAGS: 00010046
> [  209.993216] RAX: 0000010000000000 RBX: 0000000000100000 RCX: ff295dbfa9ad8080
> [  210.001179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff295dbfa9ad8080
> [  210.009141] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000063eb
> [  210.017104] R10: 0000000000000000 R11: ff581fd71e1fcff8 R12: 0000000000000000
> [  210.025061] R13: ff295dbfa9ad8000 R14: ff295dc06c6eac00 R15: ff295dbfd9bc8600
> [  210.033027] FS:  00007faef8c8b640(0000) GS:ff295e7c4acca000(0000) knlGS:0000000000000000
> [  210.042060] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  210.048474] CR2: 00007f9884292d30 CR3: 000000011aa26001 CR4: 0000000000f71ef0
> [  210.056430] PKRU: 55555554
> [  210.059448] Call Trace:
> [  210.062177]  <IRQ>
> [  210.064426]  sched_tick+0x94/0x250
> [  210.068229]  update_process_times+0x99/0xc0
> [  210.072903]  tick_nohz_handler+0x95/0x1a0
> [  210.077380]  ? __pfx_tick_nohz_handler+0x10/0x10
> [  210.082534]  __hrtimer_run_queues+0xfe/0x260
> [  210.087304]  hrtimer_interrupt+0x122/0x1f0
> [  210.091880]  __sysvec_apic_timer_interrupt+0x55/0x130
> [  210.097525]  sysvec_apic_timer_interrupt+0x7a/0xb0
> [  210.102873]  </IRQ>
> [  210.105203]  <TASK>
> [  210.107542]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  210.113284] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
> [  210.119511] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d ec 20 fd 01 74 05 e9 c0 81 d4 fe e8 00 93 ec fe e9 b6 81
> [  210.140469] RSP: 0018:ff581fd74032fe88 EFLAGS: 00000206
> [  210.146308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
> [  210.154271] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ff295dbfa9ad8d64
> [  210.162235] RBP: ff295dbfa9ad8000 R08: 0000000000000000 R09: 0000000000000000
> [  210.170196] R10: 0000000000000000 R11: 0000000000000000 R12: ff295dbfa9ad8d64
> [  210.178159] R13: ff581fd74032ff48 R14: ff295dbfa9ad8000 R15: 00fffffffffff000
> [  210.186139]  task_work_run+0x5c/0x90
> [  210.190137]  exit_to_user_mode_loop+0x16e/0x550
> [  210.195198]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  210.200552]  ? ksys_read+0xc5/0xe0
> [  210.204352]  do_syscall_64+0x26e/0x750
> [  210.208540]  ? do_syscall_64+0xaa/0x750
> [  210.212823]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  210.218174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ---
> 
> So the theory of throttle work causing this checks out.
> 


> The suggested diff above solves the crash in my case but your
> mileage may vary. Peter can comment if this is the right thing
> to do or not :-)
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26  9:15     ` K Prateek Nayak
  2026-05-26  9:36       ` Zhang Qiao
@ 2026-05-26  9:52       ` Peter Zijlstra
  2026-05-26 10:54         ` K Prateek Nayak
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-26  9:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhang Qiao, mingo, longman, chenridong, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, cgroups, linux-kernel, jstultz,
	qyousef, Hui Tang

On Tue, May 26, 2026 at 02:45:45PM +0530, K Prateek Nayak wrote:

> The suggested diff above solves the crash in my case but your
> mileage may vary. Peter can comment if this is the right thing
> to do or not :-)

Is this a different issue than the one you raised before? We talked
about throtte, and you were going to make a proper patch of that cleanup
iirc.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26  9:52       ` Peter Zijlstra
@ 2026-05-26 10:54         ` K Prateek Nayak
  2026-05-26 11:07           ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: K Prateek Nayak @ 2026-05-26 10:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Zhang Qiao, mingo, longman, chenridong, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, cgroups, linux-kernel, jstultz,
	qyousef, Hui Tang

Hello Peter,

On 5/26/2026 3:22 PM, Peter Zijlstra wrote:
> On Tue, May 26, 2026 at 02:45:45PM +0530, K Prateek Nayak wrote:
> 
>> The suggested diff above solves the crash in my case but your
>> mileage may vary. Peter can comment if this is the right thing
>> to do or not :-)
> 
> Is this a different issue than the one you raised before?

Yes, this is different. Essentially, this is what is happening:

  throttle_cfs_rq_work()
    task_rq_lock()

    dequeue_task_fair(current)    /* Task is dequeued on cfs side */
      __dequeue_task(current)
        dequeue_hierarchy(current);
          current->se.on_rq = 0;
          /* update_load_sub() */
    resched_curr();               /* Initiates a resched */

    task_rq_unlock()
      local_irq_enable();

  =====> sched_tick()
          task_tick_fair()
             __calc_prop_weight()
               /*
                * Oops: update_load_sub() above has
                * 0ed the weight of cfs_rq.
                */
  <====

  preempt_schedule_irq()
    next = ...
    put_prev_set_next_task() /* The runtime context is switched here */


> We talked about throtte, and you were going to make a proper patch of that cleanup
> iirc.

I had rebased your suggestion on tip and fixed a couple of splats but
once it was functional, I noticed hackbench taking twice as long to
complete compared to tip and I was chasing that before I fell sick.

Let me go dig deeper to see where exactly it is all going sideways.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26 10:54         ` K Prateek Nayak
@ 2026-05-26 11:07           ` Peter Zijlstra
  2026-05-26 12:40             ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-26 11:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhang Qiao, mingo, longman, chenridong, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, cgroups, linux-kernel, jstultz,
	qyousef, Hui Tang

On Tue, May 26, 2026 at 04:24:32PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/26/2026 3:22 PM, Peter Zijlstra wrote:
> > On Tue, May 26, 2026 at 02:45:45PM +0530, K Prateek Nayak wrote:
> > 
> >> The suggested diff above solves the crash in my case but your
> >> mileage may vary. Peter can comment if this is the right thing
> >> to do or not :-)
> > 
> > Is this a different issue than the one you raised before?
> 
> Yes, this is different. Essentially, this is what is happening:
> 
>   throttle_cfs_rq_work()
>     task_rq_lock()
> 
>     dequeue_task_fair(current)    /* Task is dequeued on cfs side */
>       __dequeue_task(current)
>         dequeue_hierarchy(current);
>           current->se.on_rq = 0;
>           /* update_load_sub() */
>     resched_curr();               /* Initiates a resched */
> 
>     task_rq_unlock()
>       local_irq_enable();
> 
>   =====> sched_tick()
>           task_tick_fair()
>              __calc_prop_weight()
>                /*
>                 * Oops: update_load_sub() above has
>                 * 0ed the weight of cfs_rq.
>                 */
>   <====
> 
>   preempt_schedule_irq()
>     next = ...
>     put_prev_set_next_task() /* The runtime context is switched here */
> 

Ah, right. OK, I'll go have a poke once I get these proxy patches I've
been spending too much time on posted.

I think I've found a 'problem' with that PROXY_WAKING ==> '->is_blocked
&& !->blocked_on' scheme :-(

> > We talked about throtte, and you were going to make a proper patch of that cleanup
> > iirc.
> 
> I had rebased your suggestion on tip and fixed a couple of splats but
> once it was functional, I noticed hackbench taking twice as long to
> complete compared to tip and I was chasing that before I fell sick.
> 
> Let me go dig deeper to see where exactly it is all going sideways.

Sure, no worries. This happens; computers just never want to just DTRT
already. I lost a day and then some trying to figure out why my
seemingly 'trivial' proxy changes ended up trying to run a dead task
last week...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/10] sched/eevdf: Move to a single runqueue
  2026-05-26 11:07           ` Peter Zijlstra
@ 2026-05-26 12:40             ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-26 12:40 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Zhang Qiao, mingo, longman, chenridong, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, tj, hannes, mkoutny, cgroups, linux-kernel, jstultz,
	qyousef, Hui Tang

On Tue, May 26, 2026 at 01:07:09PM +0200, Peter Zijlstra wrote:
> On Tue, May 26, 2026 at 04:24:32PM +0530, K Prateek Nayak wrote:
> > Hello Peter,
> > 
> > On 5/26/2026 3:22 PM, Peter Zijlstra wrote:
> > > On Tue, May 26, 2026 at 02:45:45PM +0530, K Prateek Nayak wrote:
> > > 
> > >> The suggested diff above solves the crash in my case but your
> > >> mileage may vary. Peter can comment if this is the right thing
> > >> to do or not :-)
> > > 
> > > Is this a different issue than the one you raised before?
> > 
> > Yes, this is different. Essentially, this is what is happening:
> > 
> >   throttle_cfs_rq_work()
> >     task_rq_lock()
> > 
> >     dequeue_task_fair(current)    /* Task is dequeued on cfs side */
> >       __dequeue_task(current)
> >         dequeue_hierarchy(current);
> >           current->se.on_rq = 0;
> >           /* update_load_sub() */
> >     resched_curr();               /* Initiates a resched */
> > 
> >     task_rq_unlock()
> >       local_irq_enable();
> > 
> >   =====> sched_tick()
> >           task_tick_fair()
> >              __calc_prop_weight()
> >                /*
> >                 * Oops: update_load_sub() above has
> >                 * 0ed the weight of cfs_rq.
> >                 */
> >   <====
> > 
> >   preempt_schedule_irq()
> >     next = ...
> >     put_prev_set_next_task() /* The runtime context is switched here */
> > 
> 
> Ah, right. OK, I'll go have a poke once I get these proxy patches I've
> been spending too much time on posted.

Yes, your solution seems reasonable. I'll fold that and push out a new
version a little later today.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/10] sched: Flatten the pick
  2026-05-18 19:11         ` Tejun Heo
@ 2026-05-27  9:41           ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-05-27  9:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

On Mon, May 18, 2026 at 09:11:03AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, May 18, 2026 at 09:14:56AM +0200, Peter Zijlstra wrote:
> ...
> > So the current scheme will inflate the part of A to be double the weight
> > (of B), giving them 2 out of 3 parts on the contended CPUs, but then B
> > will still get complete / uncontested access to those extra 128 CPUs,
> > resulting in a 2:4 weight distribution.
> > 
> > Which also isn't as straight forward as one might think.
> 
> Right, the current behavior isn't quite what people would expect intuitively
> either.
> 
> ...
> > So for the one contended CPU A gets 256 out of 257 parts, while B gets
> > the full CPU for the remaining 255 CPUs, for a:
> > 
> >   256    1        257
> >   --- : --- + 255*--- = 256:65535 ~ 1:256
> >   257   257       257
> > 
> > distribution. While with the new scheme it would be:
> > 
> >  1   1       2
> >  - : - + 255*- = 1:511
> >  2   2       2
> > 
> > Which, realistically isn't all that different, except the old scheme has
> > this really large weight to deal with.
> > 
> > So from where I'm sitting, yes different, but it behaves better.

FWIW if the workload was single threads per CPU; the above is also the
exact behaviour we'd have without cgroups.

> I see. Thread cardinality and affinity problems make weight based
> distribution such a pain. I wonder whether this can be better solved by
> turning it into a two-layer allocation problem - groups to CPUs and then
> timeshare on CPUs as necessary. That comes with a lot of its own problems
> but it can, aspirationally at least, approximate global weight distribution
> and would have better locality properties.

If people want, they can already do this today. I don't see a reason to
mandate something like that. That is, combine cpuset and cpu in a v2
hierarchy and you get this.

The main problem with doing something like that is of course that it
isn't always clear how many CPUs will be needed for a particular 'job'.
So assigning groups to CPUs isn't a straight forward thing.

If I remember, Meta was actually doing some of this. It was dynamically
resizing cpusets based on load predictions and the like in order to
separate various worloads on the same large machine, right?


Anyway, while it is somewhat tedious to change behaviour, I do think it
is worth doing in this case.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
  2026-05-12  5:37   ` K Prateek Nayak
  2026-05-19 15:13   ` Vincent Guittot
@ 2026-06-03  9:51   ` Aaron Lu
  2026-06-11 11:32     ` Peter Zijlstra
  2 siblings, 1 reply; 64+ messages in thread
From: Aaron Lu @ 2026-06-03  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef

[-- Attachment #1: Type: text/plain, Size: 2467 bytes --]

Hi Peter,

On Mon, May 11, 2026 at 01:31:12PM +0200, Peter Zijlstra wrote:
> With commit 50653216e4ff ("sched: Add support to pick functions to
> take rf") removing the balance callback, the pick_task() callback is
> in charge of newidle balancing.
> 
> This means pick_task_fair() should do so too. This hasn't been a
> problem in practise because pick_next_task_fair() is used. However,
> since we'll be removing that one shortly, make sure pick_next_task()
> is up to scratch.

While testing Prateek's throttle series, I noticed a panic issue when
coresched is enabled and bisected to this patch.

I fed the panic log and this patch to an agent and its analysis looks
correct to me(cpu56 and cpu57 are siblings in a VM):

       cpu57 (holds core-wide lock)

     pick_next_task() [core scheduling]
     for_each_cpu_wrap(i, smt_mask, 57):
       i=57: pick_task(rq_57)
             pick_task_fair(rq_57)
             -> picks task A
       rq_57->core_pick = task A
       // task_rq(A) == rq_57

       i=56: pick_task(rq_56)
             pick_task_fair(rq_56)
             cfs_rq->nr_queued == 0
             goto idle
             sched_balance_newidle(rq_56)
             raw_spin_rq_unlock(rq_56)
             // core-wide lock released
             newidle_balance() pulls
               task A: rq_57 -> rq_56
             // task_rq(A) == rq_56 now
             raw_spin_rq_lock(rq_56)
             // core-wide lock re-acquired
             return > 0
             goto again
             pick_task_fair(rq_56)
             -> picks task A
       rq_56->core_pick = task A

     // first loop done
     // rq_57->core_pick is still task A (set before lock release)
     // but task_rq(A) == rq_56 now
     next = rq_57->core_pick  // = task A

     put_prev_set_next_task(rq_57, prev, task A)
     __set_next_task_fair(rq_57, task A)
     hrtick_start_fair(rq_57, task A)
     WARN_ON_ONCE(task_rq(task A) != rq_57)
     // task_rq(A) == rq_56

I applied below diff and the problem is gone:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f48af700fd44..942a543af3e54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9897,6 +9897,9 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	return p;
 
 idle:
+	if (sched_core_enabled(rq))
+		return NULL;
+
 	new_tasks = sched_balance_newidle(rq, rf);
 	if (new_tasks < 0)
 		return RETRY_TASK;

Full dmesg of the panic is attached for your reference.

[-- Attachment #2: dmesg_b3a2dfa8b42 --]
[-- Type: application/octet-stream, Size: 78556 bytes --]

SeaBIOS (version 1.16.2-debian-1.16.2-1)


iPXE (http://ipxe.org) 00:02.0 CA00 PCI2.10 PnP PMM+BEFCEE10+BEF0EE10 CA00



Booting from ROM..
[    0.000000] Linux version 7.1.0-rc2-00057-gb3a2dfa8b42e (ziqianlu@n232-168-014) (x86_64-linux-gcc (GCC) 12.5.0, GNU 6
[    0.000000] Command line: root=/dev/vda2 selinux=0 console=ttyS0 initcall_debug
[    0.000000] x86/split lock detection: #DB: warning on user-space bus_locks
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff]  System RAM
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff]  device reserved
[    0.000000] BIOS-e820: [gap 0x00000000000a0000-0x00000000000effff]
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]  device reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffd6fff]  System RAM
[    0.000000] BIOS-e820: [mem 0x00000000bffd7000-0x00000000bfffffff]  device reserved
[    0.000000] BIOS-e820: [gap 0x00000000c0000000-0x00000000feffbfff]
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]  device reserved
[    0.000000] BIOS-e820: [gap 0x00000000ff000000-0x00000000fffbffff]
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]  device reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000103fffffff]  System RAM
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] APIC: Static calls initialized
[    0.000000] DMI: SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[    0.000000] DMI: Memory slots populated: 4/4
[    0.000000] Hypervisor detected: KVM
[    0.000000] last_pfn = 0xbffd7 max_arch_pfn = 0x10000000000
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000003] kvm-clock: using sched offset of 404261872 cycles
[    0.000005] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000011] tsc: Detected 2600.000 MHz processor
[    0.001117] last_pfn = 0x1040000 max_arch_pfn = 0x10000000000
[    0.001170] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
[    0.001173] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.001237] last_pfn = 0xbffd7 max_arch_pfn = 0x10000000000
[    0.001244] Using GB pages for direct mapping
[    0.001787] ACPI: Early table checksum verification disabled
[    0.001790] ACPI: RSDP 0x00000000000F58F0 000014 (v00 BOCHS )
[    0.001798] ACPI: RSDT 0x00000000BFFE31DE 000034 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001803] ACPI: FACP 0x00000000BFFE2E9A 000074 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001807] ACPI: DSDT 0x00000000BFFE0040 002E5A (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001810] ACPI: FACS 0x00000000BFFE0000 000040
[    0.001812] ACPI: APIC 0x00000000BFFE2F0E 000270 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001814] ACPI: HPET 0x00000000BFFE317E 000038 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001816] ACPI: WAET 0x00000000BFFE31B6 000028 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.001818] ACPI: Reserving FACP table memory at [mem 0xbffe2e9a-0xbffe2f0d]
[    0.001819] ACPI: Reserving DSDT table memory at [mem 0xbffe0040-0xbffe2e99]
[    0.001820] ACPI: Reserving FACS table memory at [mem 0xbffe0000-0xbffe003f]
[    0.001821] ACPI: Reserving APIC table memory at [mem 0xbffe2f0e-0xbffe317d]
[    0.001821] ACPI: Reserving HPET table memory at [mem 0xbffe317e-0xbffe31b5]
[    0.001822] ACPI: Reserving WAET table memory at [mem 0xbffe31b6-0xbffe31dd]
[    0.001866] No NUMA configuration found
[    0.001867] Faking a node at [mem 0x0000000000000000-0x000000103fffffff]
[    0.001875] NODE_DATA(0) allocated [mem 0x103ffdb840-0x103fffdfff]
[    0.002595] ACPI: PM-Timer IO Port: 0x608
[    0.002612] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.002663] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
[    0.002666] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.002667] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.002669] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.002669] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.002670] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.002674] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.002675] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.002677] TSC deadline timer available
[    0.002679] CPU topo: Max. logical packages:   1
[    0.002680] CPU topo: Max. logical nodes:      1
[    0.002681] CPU topo: Num. nodes per package:  1
[    0.002682] CPU topo: Max. logical dies:       1
[    0.002683] CPU topo: Max. dies per package:   1
[    0.002685] CPU topo: Max. threads per core:   2
[    0.002686] CPU topo: Num. cores per package:    32
[    0.002686] CPU topo: Num. threads per package:  64
[    0.002687] CPU topo: Allowing 64 present CPUs plus 0 hotplug CPUs
[    0.002725] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
[    0.002739] kvm-guest: KVM setup pv remote TLB flush
[    0.002745] kvm-guest: setup PV sched yield
[    0.002758] [gap 0xc0000000-0xfeffbfff] available for PCI devices
[    0.002759] Booting paravirtualized kernel on KVM
[    0.002760] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.158473] Zone ranges:
[    0.158475]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.158478]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.158480]   Normal   [mem 0x0000000100000000-0x000000103fffffff]
[    0.158481] Movable zone start for each node
[    0.158483] Early memory node ranges
[    0.158483]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.158485]   node   0: [mem 0x0000000000100000-0x00000000bffd6fff]
[    0.158486]   node   0: [mem 0x0000000100000000-0x000000103fffffff]
[    0.158495] Initmem setup node 0 [mem 0x0000000000001000-0x000000103fffffff]
[    0.158512] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.158545] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.277831] On node 0, zone Normal: 41 pages in unavailable ranges
[    0.277836] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:1
[    0.331107] percpu: Embedded 522 pages/cpu s2097176 r8192 d32744 u4194304
[    0.331229] kvm-guest: PV spinlocks enabled
[    0.331232] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.331242] Kernel command line: root=/dev/vda2 selinux=0 console=ttyS0 initcall_debug
[    0.331274] Unknown kernel command line parameters "selinux=0", will be passed to user space.
[    0.331293] random: crng init done
[    0.331294] printk: log buffer data + meta data: 16777216 + 58720256 = 75497472 bytes
[    0.340831] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, linear)
[    0.345633] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, linear)
[    0.346506] software IO TLB: area num 64.
[    0.363064] Fallback order for Node 0: 0
[    0.363076] Built 1 zonelists, mobility grouping on.  Total pages: 16777077
[    0.363077] Policy zone: Normal
[    0.363081] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.506688] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=64, Nodes=1
[    0.520070] ftrace: allocating 41111 entries in 162 pages
[    0.520071] ftrace: allocated 162 pages with 3 groups
[    0.522760] Dynamic Preempt: lazy
[    0.524044] Running RCU self tests
[    0.524045] Running RCU synchronous self tests
[    0.524046] rcu: Preemptible hierarchical RCU implementation.
[    0.524047] rcu:     RCU lockdep checking is enabled.
[    0.524047] rcu:     RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=64.
[    0.524050]  Trampoline variant of Tasks RCU enabled.
[    0.524050]  Rude variant of Tasks RCU enabled.
[    0.524051]  Tracing variant of Tasks RCU enabled.
[    0.524051] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.524052] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=64
[    0.524221] Running RCU synchronous self tests
[    0.524234] RCU Tasks: Setting shift to 6 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=64.
[    0.524246] RCU Tasks Rude: Setting shift to 6 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=64.
[    0.526613] NR_IRQS: 524544, nr_irqs: 936, preallocated irqs: 16
[    0.527079] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.545088] Console: colour VGA+ 80x25
[    0.545185] printk: legacy console [ttyS0] enabled
[    0.712572] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[    0.714245] ... MAX_LOCKDEP_SUBCLASSES:  8
[    0.715136] ... MAX_LOCK_DEPTH:          48
[    0.716043] ... MAX_LOCKDEP_KEYS:        8192
[    0.716986] ... CLASSHASH_SIZE:          4096
[    0.717930] ... MAX_LOCKDEP_ENTRIES:     32768
[    0.718893] ... MAX_LOCKDEP_CHAINS:      65536
[    0.719855] ... CHAINHASH_SIZE:          32768
[    0.720817]  memory used by lock dependency info: 6941 kB
[    0.721983]  memory used for stack traces: 4224 kB
[    0.723021]  per task-struct memory footprint: 2688 bytes
[    0.724543] ACPI: Core revision 20251212
[    0.725788] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.728011] APIC: Switch to symmetric I/O mode setup
[    0.729186] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
[    0.730759] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
[    0.732723] kvm-guest: setup PV IPIs
[    0.735620] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.736987] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x257a3c3232d, max_idle_ns: 440795236700 ns
[    0.739545] Calibrating delay loop (skipped) preset value.. 5200.00 BogoMIPS (lpj=26000000)
[    0.741586] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.743128] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.744284] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.745586] mitigations: Enabled attack vectors: user_kernel, user_user, SMT mitigations: auto
[    0.747452] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
[    0.749544] Spectre V2 : Mitigation: Enhanced / Automatic IBRS
[    0.750818] ITS: Mitigation: Aligned branch/return thunks
[    0.751992] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.754131] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB!
[    0.756702] Spectre V2 : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
[    0.759546] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.761378] active return thunk: its_return_thunk
[    0.762473] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.764149] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.765540] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.766932] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.768342] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.769550] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.771024] x86/fpu: Supporting XSAVE feature 0x20000: 'AMX Tile config'
[    0.772505] x86/fpu: Supporting XSAVE feature 0x40000: 'AMX Tile data'
[    0.773939] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.775293] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.776653] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.778004] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.779542] x86/fpu: xstate_offset[17]: 2432, xstate_sizes[17]:   64
[    0.780936] x86/fpu: xstate_offset[18]: 2496, xstate_sizes[18]: 8192
[    0.782326] x86/fpu: Enabled xstate features 0x600e7, context size is 10688 bytes, using 'compacted' format.
[    0.812589] pid_max: default: 65536 minimum: 512
[    0.814516] Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.816447] Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.818783] VFS: Finished mounting rootfs on nullfs
[    0.821061] Running RCU synchronous self tests
[    0.822043] Running RCU synchronous self tests
[    0.823733] smpboot: CPU0: Intel INTEL(R) XEON(R) PLATINUM 8582C (family: 0x6, model: 0xcf, stepping: 0x2)
[    0.826912] Performance Events: PEBS fmt0-, Sapphire Rapids events,
[    0.826989] core: The event 0x400 is not supported as a normal event.
[    0.829539] core: The event 0x8000 is not supported as a normal event.
[    0.829539] core: The event 0x8100 is not supported as a normal event.
[    0.829539] core: The event 0x8200 is not supported as a normal event.
[    0.829564] core: The event 0x8300 is not supported as a normal event.
[    0.830989] core: The event 0x8400 is not supported as a normal event.
[    0.832414] core: The event 0x8500 is not supported as a normal event.
[    0.833840] core: The event 0x8600 is not supported as a normal event.
[    0.835264] core: The event 0x8700 is not supported as a normal event.
[    0.836711] full-width counters, Intel PMU driver.
[    0.837799] ... version:                   2
[    0.838740] ... bit width:                 48
[    0.839545] ... generic counters:          8
[    0.840533] ... generic bitmap:            00000000000000ff
[    0.841833] ... fixed-purpose counters:    3
[    0.842778] ... fixed-purpose bitmap:      0000000000000007
[    0.844001] ... value mask:                0000ffffffffffff
[    0.845231] ... max period:                00007fffffffffff
[    0.846457] ... global_ctrl mask:          00000007000000ff
[    0.848730] signal: max sigframe size: 11952
[    0.849795] rcu: Hierarchical SRCU implementation.
[    0.850848] rcu:     Max phase no-delay instances is 1000.
[    0.852229] Timer migration: 2 hierarchy levels; 8 children per group; 2 crossnode level
[    0.862598] smp: Bringing up secondary CPUs ...
[    0.864188] smpboot: x86: Booting SMP configuration:
[    0.865296] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14 #16 #18 #20 #22 #24 #26 #28 #30 #32 #34 #36 #38 #3
[    0.960164] smp: Brought up 1 node, 64 CPUs
[    0.966568] smpboot: Total of 64 processors activated (332800.00 BogoMIPS)
[    0.977612] Memory: 65590812K/67108308K available (15072K kernel code, 59652K rwdata, 14884K rodata, 5692K init, 287)
[    0.989848] devtmpfs: initialized
[    0.990688] x86/mm: Memory block size: 1024MB
[    1.003627] Running RCU synchronous self tests
[    1.003627] Running RCU synchronous self tests
[    1.011501] Running RCU Tasks wait API self tests
[    1.011501] Running RCU Tasks Rude wait API self tests
[    1.011501] Running RCU Tasks Trace wait API self tests
[    1.015337] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    1.015337] posixtimers hash table entries: 32768 (order: 10, 2621440 bytes, linear)
[    1.021312] futex hash table entries: 16384 (2097152 bytes on 1 NUMA nodes, total 2048 KiB, linear).
[    1.024907] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    1.027638] audit: initializing netlink subsys (disabled)
[    1.028959] audit: type=2000 audit(1780456802.589:1): state=initialized audit_enabled=0 res=1
[    1.030231] thermal_sys: Registered thermal governor 'step_wise'
[    1.030234] thermal_sys: Registered thermal governor 'user_space'
[    1.034918] cpuidle: using governor ladder
[    1.034918] cpuidle: using governor menu
[    1.034918] Freeing SMP alternatives memory: 36K
[    1.039668] PCI: Using configuration type 1 for base access
[    1.042036] kprobes: kprobe jump-optimization is enabled. All kprobes are optimized if possible.
[    1.045883] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
[    1.045883] HugeTLB: 16380 KiB vmemmap can be freed for a 1.00 GiB page
[    1.049553] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[    1.051532] Callback from call_rcu_tasks_trace() invoked.
[    1.051043] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
[    1.053828] ACPI: Added _OSI(Module Device)
[    1.053883] ACPI: Added _OSI(Processor Device)
[    1.054868] ACPI: Added _OSI(Processor Aggregator Device)
[    1.064769] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    1.078119] ACPI: \_SB_: platform _OSC: OS support mask [00027cee]
[    1.081088] ACPI: Interpreter enabled
[    1.081938] ACPI: PM: (supports S0 S5)
[    1.082771] ACPI: Using IOAPIC for interrupt routing
[    1.083920] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    1.085902] PCI: Using E820 reservations for host bridge windows
[    1.088003] ACPI: Enabled 2 GPEs in block 00 to 0F
[    1.115871] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    1.117249] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
[    1.118936] acpi PNP0A03:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
[    1.119556] acpi PNP0A03:00: _OSC: platform retains control of PCIe features (AE_ERROR)
[    1.121344] acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended configuration space under this e
[    1.124674] PCI host bridge to bus 0000:00
[    1.125586] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    1.127064] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    1.128543] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    1.129551] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
[    1.131188] pci_bus 0000:00: root bus resource [mem 0x1040000000-0x10bfffffff window]
[    1.132896] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.134218] pci 0000:00:00.0: calling  quirk_mmio_always_on+0x0/0x20 @ 1
[    1.135689] pci 0000:00:00.0: quirk_mmio_always_on+0x0/0x20 took 0 usecs
[    1.137243] pci 0000:00:00.0: [8086:1237] type 00 class 0x060000 conventional PCI endpoint
[    1.141483] pci 0000:00:01.0: [8086:7000] type 00 class 0x060100 conventional PCI endpoint
[    1.145603] pci 0000:00:01.1: [8086:7010] type 00 class 0x010180 conventional PCI endpoint
[    1.150929] pci 0000:00:01.1: BAR 4 [io  0xc1c0-0xc1cf]
[    1.152154] pci 0000:00:01.1: BAR 0 [io  0x01f0-0x01f7]: legacy IDE quirk
[    1.153652] pci 0000:00:01.1: BAR 1 [io  0x03f6]: legacy IDE quirk
[    1.155011] pci 0000:00:01.1: BAR 2 [io  0x0170-0x0177]: legacy IDE quirk
[    1.156520] pci 0000:00:01.1: BAR 3 [io  0x0376]: legacy IDE quirk
[    1.158329] pci 0000:00:01.3: calling  acpi_pm_check_blacklist+0x0/0x50 @ 1
[    1.159550] pci 0000:00:01.3: acpi_pm_check_blacklist+0x0/0x50 took 0 usecs
[    1.161081] pci 0000:00:01.3: [8086:7113] type 00 class 0x068000 conventional PCI endpoint
[    1.163375] pci 0000:00:01.3: calling  quirk_piix4_acpi+0x0/0x180 @ 1
[    1.164802] pci 0000:00:01.3: quirk: [io  0x0600-0x063f] claimed by PIIX4 ACPI
[    1.166487] pci 0000:00:01.3: quirk: [io  0x0700-0x070f] claimed by PIIX4 SMB
[    1.168111] pci 0000:00:01.3: quirk_piix4_acpi+0x0/0x180 took 9765 usecs
[    1.169546] pci 0000:00:01.3: calling  pci_fixup_piix4_acpi+0x0/0x20 @ 1
[    1.171016] pci 0000:00:01.3: pci_fixup_piix4_acpi+0x0/0x20 took 0 usecs
[    1.173137] pci 0000:00:02.0: [1af4:1000] type 00 class 0x020000 conventional PCI endpoint
[    1.179559] pci 0000:00:02.0: BAR 0 [io  0xc180-0xc19f]
[    1.180853] pci 0000:00:02.0: BAR 1 [mem 0xfc052000-0xfc052fff]
[    1.182206] pci 0000:00:02.0: BAR 4 [mem 0xfebf0000-0xfebf3fff 64bit pref]
[    1.183723] pci 0000:00:02.0: ROM [mem 0xfc000000-0xfc03ffff pref]
[    1.187574] pci 0000:00:03.0: [1b36:0100] type 00 class 0x030000 conventional PCI endpoint
[    1.204337] pci 0000:00:03.0: BAR 0 [mem 0xf0000000-0xf7ffffff]
[    1.205831] pci 0000:00:03.0: BAR 1 [mem 0xf8000000-0xfbffffff]
[    1.207180] pci 0000:00:03.0: BAR 2 [mem 0xfc050000-0xfc051fff]
[    1.208580] pci 0000:00:03.0: BAR 3 [io  0xc1a0-0xc1bf]
[    1.209579] pci 0000:00:03.0: ROM [mem 0xfc040000-0xfc04ffff pref]
[    1.211034] pci 0000:00:03.0: calling  screen_info_fixup_lfb+0x0/0x170 @ 1
[    1.212558] pci 0000:00:03.0: screen_info_fixup_lfb+0x0/0x170 took 0 usecs
[    1.214076] pci 0000:00:03.0: calling  pci_fixup_video+0x0/0x110 @ 1
[    1.215512] pci 0000:00:03.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    1.217345] pci 0000:00:03.0: pci_fixup_video+0x0/0x110 took 0 usecs
[    1.220945] pci 0000:00:04.0: [1af4:1001] type 00 class 0x010000 conventional PCI endpoint
[    1.229567] pci 0000:00:04.0: BAR 0 [io  0xc000-0xc07f]
[    1.230839] pci 0000:00:04.0: BAR 1 [mem 0xfc053000-0xfc053fff]
[    1.232304] pci 0000:00:04.0: BAR 4 [mem 0xfebf4000-0xfebf7fff 64bit pref]
[    1.236563] pci 0000:00:05.0: [1af4:1001] type 00 class 0x010000 conventional PCI endpoint
[    1.251930] pci 0000:00:05.0: BAR 0 [io  0xc080-0xc0ff]
[    1.253130] pci 0000:00:05.0: BAR 1 [mem 0xfc054000-0xfc054fff]
[    1.254617] pci 0000:00:05.0: BAR 4 [mem 0xfebf8000-0xfebfbfff 64bit pref]
[    1.258786] pci 0000:00:06.0: [1af4:1001] type 00 class 0x010000 conventional PCI endpoint
[    1.269664] pci 0000:00:06.0: BAR 0 [io  0xc100-0xc17f]
[    1.270890] pci 0000:00:06.0: BAR 1 [mem 0xfc055000-0xfc055fff]
[    1.272368] pci 0000:00:06.0: BAR 4 [mem 0xfebfc000-0xfebfffff 64bit pref]
[    1.279844] ACPI: PCI: Interrupt link LNKA configured for IRQ 10
[    1.281679] ACPI: PCI: Interrupt link LNKB configured for IRQ 10
[    1.283351] ACPI: PCI: Interrupt link LNKC configured for IRQ 11
[    1.285032] ACPI: PCI: Interrupt link LNKD configured for IRQ 11
[    1.286503] ACPI: PCI: Interrupt link LNKS configured for IRQ 9
[    1.311447] PCI: Using ACPI for IRQ routing
[    1.312408] e820: register RAM buffer resource [mem 0x0009fc00-0x0009ffff]
[    1.312408] e820: register RAM buffer resource [mem 0xbffd7000-0xbfffffff]
[    1.316084] pci 0000:00:03.0: vgaarb: setting as boot VGA device
[    1.316084] pci 0000:00:03.0: vgaarb: bridge control possible
[    1.316084] pci 0000:00:03.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    1.316084] vgaarb: loaded
[    1.316084] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    1.319548] hpet0: 3 comparators, 64-bit 100.000000 MHz counter
[    1.329544] clocksource: Switched to clocksource kvm-clock
[    1.332816] pnp: PnP ACPI init
[    1.334738] pnp: PnP ACPI: found 5 devices
[    1.342651] Callback from call_rcu_tasks() invoked.
[    1.348725] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.351007] NET: Registered PF_INET protocol family
[    1.352536] IP idents hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    1.357256] tcp_listen_portaddr_hash hash table entries: 32768 (order: 10, 2621440 bytes, linear)
[    1.360367] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    1.362098] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[    1.364687] TCP bind hash table entries: 65536 (order: 12, 10485760 bytes, vmalloc hugepage)
[    1.370412] TCP: Hash tables configured (established 524288 bind 65536)
[    1.372235] UDP hash table entries: 32768 (order: 12, 9961472 bytes, vmalloc hugepage)
[    1.377574] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    1.378911] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
[    1.380274] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
[    1.381630] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[    1.383155] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xfebfffff window]
[    1.384669] pci_bus 0000:00: resource 8 [mem 0x1040000000-0x10bfffffff window]
[    1.386391] pci 0000:00:00.0: calling  quirk_passive_release+0x0/0xb0 @ 1
[    1.388018] pci 0000:00:01.0: PIIX3: Enabling Passive Release
[    1.389318] pci 0000:00:00.0: quirk_passive_release+0x0/0xb0 took 1284 usecs
[    1.390878] pci 0000:00:00.0: calling  quirk_natoma+0x0/0x40 @ 1
[    1.392209] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    1.393533] pci 0000:00:00.0: quirk_natoma+0x0/0x40 took 1293 usecs
[    1.395026] PCI: CLS 0 bytes, default 64
[    1.396057] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.397482] software IO TLB: mapped [mem 0x00000000bbfd7000-0x00000000bffd7000] (64MB)
[    1.399313] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 10737418240 ms ovfl timer
[    1.401150] RAPL PMU: hw unit of domain psys 2^-0 Joules
[    1.402478] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x257a3c3232d, max_idle_ns: 440795236700 ns
[    1.448549] workingset: timestamp_bits=36 (anon: 32) max_order=24 bucket_order=0 (anon: 0)
[    1.451899] fuse: init (API version 7.45)
[    1.453726] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
[    1.455376] io scheduler mq-deadline registered
[    1.456389] io scheduler kyber registered
[    1.635984] ACPI: \_SB_.LNKB: Enabled at IRQ 10
[    1.802865] ACPI: \_SB_.LNKD: Enabled at IRQ 11
[    1.975123] ACPI: \_SB_.LNKA: Enabled at IRQ 10
[    2.153578] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    2.155816] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    2.159812] Non-volatile memory driver v1.3
[    2.160834] ACPI: bus type drm_connector registered
[    2.325622] ACPI: \_SB_.LNKC: Enabled at IRQ 11
[    2.326680] qxl 0000:00:03.0: vgaarb: deactivate vga console
[    2.343804] Console: switching to colour dummy device 80x25
[    2.345156] [drm] Device Version 0.0
[    2.345965] [drm] Compression level 0 log level 0
[    2.347194] [drm] 16382 io pages at offset 0x4000000
[    2.348298] [drm] 67108864 byte draw area at offset 0x0
[    2.349464] [drm] RAM header offset: 0x7ffe000
[    2.351527] [drm] qxl: 64M of VRAM memory size
[    2.352554] [drm] qxl: 127M of IO pages memory ready (VRAM domain)
[    2.353916] [drm] qxl: 64M of Surface memory size
[    2.355753] [drm] slot 0 (main): base 0xf0000000, size 0x07ffe000
[    2.357165] [drm] slot 1 (surfaces): base 0xf8000000, size 0x04000000
[    2.358612] stackdepot: allocating hash table of 1048576 entries via kvcalloc
[    2.364366] stackdepot: allocating space for 8192 stack pools via kvcalloc
[    2.367937] [drm] Initialized qxl 0.1.0 for 0000:00:03.0 on minor 0
[    2.371188] fbcon: qxldrmfb (fb0) is primary device
[    2.411789] Console: switching to colour frame buffer device 128x48
[    2.452239] qxl 0000:00:03.0: [drm] fb0: qxldrmfb frame buffer device
[    2.517669] brd: module loaded
[    2.543812] loop: module loaded
[    2.544852] virtio_blk virtio1: 64/0/0 default/read/poll queues
[    2.575178] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB)
[    2.606416]  vda: vda1 vda2
[    2.607820] virtio_blk virtio2: 64/0/0 default/read/poll queues
[    2.638462] virtio_blk virtio2: [vdb] 83886080 512-byte logical blocks (42.9 GB/40.0 GiB)
[    2.677292] virtio_blk virtio3: 64/0/0 default/read/poll queues
[    2.707670] virtio_blk virtio3: [vdc] 104857600 512-byte logical blocks (53.7 GB/50.0 GiB)
[    2.752515] zram: Added device: zram0
[    2.753799] tun: Universal TUN/TAP device driver, 1.6
[    2.760780] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[    2.764644] serio: i8042 KBD port at 0x60,0x64 irq 1
[    2.765852] serio: i8042 AUX port at 0x60,0x64 irq 12
[    2.767512] mousedev: PS/2 mouse device common for all mice
[    2.769099] rtc_cmos PNP0B00:00: RTC can wake from S4
[    2.771367] rtc_cmos PNP0B00:00: registered as rtc0
[    2.772690] rtc_cmos PNP0B00:00: setting system clock to 2026-06-03T03:20:04 UTC (1780456804)
[    2.774747] rtc_cmos PNP0B00:00: alarms up to one day, y3k, 242 bytes nvram, hpet irqs
[    2.776625] intel_pstate: CPU model not supported
[    2.777914] NET: Registered PF_PACKET protocol family
[    2.779802] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[    2.798681] IPI shorthand broadcast: enabled
[    2.822167] sched_clock: Marking stable (2620004035, 192443415)->(2821730911, -9283461)
[    2.827078] registered taskstats version 1
[    2.866001] Demotion targets for Node 0: null
[    2.867133] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
[    3.009038] page_owner is disabled
[    3.011181] netconsole: network logging started
[    3.013049] clk: Disabling unused clocks
[    3.454950] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input2
[    3.575954] EXT4-fs (vda2): orphan cleanup on readonly fs
[    3.577562] EXT4-fs (vda2): mounted filesystem d1a47800-9a04-4b0d-8eb5-7864b160615d ro with ordered data mode. Quota.
[    3.579635] VFS: Mounted root (ext4 filesystem) readonly on device 254:2.
[    3.581785] devtmpfs: mounted
[    3.582990] VFS: Pivoted into new rootfs
[    3.595156] Freeing unused kernel image (initmem) memory: 5692K
[    3.596195] Write protecting the kernel read-only data: 32768k
[    3.599499] Freeing unused kernel image (text/rodata gap) memory: 1308K
[    3.602043] Freeing unused kernel image (rodata/data gap) memory: 1500K
[    3.603199] Run /sbin/init as init process
[    4.007294] systemd[1]: Failed to find module 'autofs4'
[    4.084919] systemd[1]: systemd 258.7-1.fc43 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +IPE +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +BTF +XKBCOMMON +UTMP +SYSVINIT +LIBARCHIVE)
[    4.090677] systemd[1]: Detected virtualization kvm.
[    4.091547] systemd[1]: Detected architecture x86-64.

Welcome to Fedora Linux 43 (Forty Three)!

[    4.101846] systemd[1]: Hostname set to <intelvm>.
[    4.207427] systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled in the kernel, BPF LSM not supported.
[    4.455323] systemd[1]: Queued start job for default target multi-user.target.
[    4.508112] systemd[1]: Created slice system-getty.slice - Slice /system/getty.
[  OK  ] Created slice system-getty.slice - Slice /system/getty.
[    4.514578] systemd[1]: Created slice system-modprobe.slice - Slice /system/modprobe.
[  OK  ] Created slice system-modprobe.slice - Slice /system/modprobe.
[    4.521685] systemd[1]: Created slice system-serial\x2dgetty.slice - Slice /system/serial-getty.
[  OK  ] Created slice system-serial\x2dgetty.slice - Slice /system/serial-getty.
[    4.528832] systemd[1]: Created slice system-sshd\x2dkeygen.slice - Slice /system/sshd-keygen.
[  OK  ] Created slice system-sshd\x2dkeygen.slice - Slice /system/sshd-keygen.
[    4.536200] systemd[1]: Created slice system-systemd\x2dfsck.slice - Slice /system/systemd-fsck.
[  OK  ] Created slice system-systemd\x2dfsck.slice - Slice /system/systemd-fsck.
[    4.543365] systemd[1]: Created slice system-systemd\x2dzram\x2dsetup.slice - Slice /system/systemd-zram-setup.
[  OK  ] Created slice system-systemd\x2dzram\x2dsetup.slice - Slice /system/systemd-zram-setup.
[    4.550882] systemd[1]: Created slice user.slice - User and Session Slice.
[  OK  ] Created slice user.slice - User and Session Slice.
[    4.555350] systemd[1]: Started systemd-ask-password-wall.path - Forward Password Requests to Wall Directory Watch.
[  OK  ] Started systemd-ask-password-wall.path - Forward Password Requests to Wall Directory Watch.
[    4.561828] systemd[1]: Starting of proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point unsupported.
[UNSUPP] Starting of proc-sys-fs-binfmt_misc.automount - Arbitr…e File Formats File System Automount Point unsupported.
[    4.569722] systemd[1]: Expecting device dev-disk-by\x2duuid-d44c9a9e\x2d67bc\x2d4f32\x2dad5f\x2ddec10c152dba.device - /dev/disk/by-uuid/d44c9a9e-67bc-4f32-ad5f-dec10c152dba...
         Expecting device dev-disk-by\x2duuid-d44c9a9e\x2d67bc\…ev/disk/by-uuid/d44c9a9e-67bc-4f32-ad5f-dec10c152dba...
[    4.577342] systemd[1]: Expecting device dev-disk-by\x2duuid-f4f6c870\x2d2da2\x2d4b25\x2d89e7\x2db335d590aa87.device - /dev/disk/by-uuid/f4f6c870-2da2-4b25-89e7-b335d590aa87...
         Expecting device dev-disk-by\x2duuid-f4f6c870\x2d2da2\…ev/disk/by-uuid/f4f6c870-2da2-4b25-89e7-b335d590aa87...
[    4.584882] systemd[1]: Expecting device dev-ttyS0.device - /dev/ttyS0...
         Expecting device dev-ttyS0.device - /dev/ttyS0...
[    4.588839] systemd[1]: Expecting device dev-zram0.device - /dev/zram0...
         Expecting device dev-zram0.device - /dev/zram0...
[    4.592707] systemd[1]: Reached target imports.target - Image Downloads.
[  OK  ] Reached target imports.target - Image Downloads.
[    4.596985] systemd[1]: Reached target integritysetup.target - Local Integrity Protected Volumes.
[  OK  ] Reached target integritysetup.target - Local Integrity Protected Volumes.
[    4.602430] systemd[1]: Reached target remote-cryptsetup.target - Remote Encrypted Volumes.
[  OK  ] Reached target remote-cryptsetup.target - Remote Encrypted Volumes.
[    4.607656] systemd[1]: Reached target remote-fs.target - Remote File Systems.
[  OK  ] Reached target remote-fs.target - Remote File Systems.
[    4.612137] systemd[1]: Reached target slices.target - Slice Units.
[  OK  ] Reached target slices.target - Slice Units.
[    4.616180] systemd[1]: Reached target veritysetup.target - Local Verity Protected Volumes.
[  OK  ] Reached target veritysetup.target - Local Verity Protected Volumes.
[    4.622724] systemd[1]: Listening on systemd-ask-password.socket - Query the User Interactively for a Password.
[  OK  ] Listening on systemd-ask-password.socket - Query the User Interactively for a Password.
[    4.630586] systemd[1]: Listening on systemd-coredump.socket - Process Core Dump Socket.
[  OK  ] Listening on systemd-coredump.socket - Process Core Dump Socket.
[    4.636880] systemd[1]: Listening on systemd-creds.socket - Credential Encryption/Decryption.
[  OK  ] Listening on systemd-creds.socket - Credential Encryption/Decryption.
[    4.643661] systemd[1]: Listening on systemd-factory-reset.socket - Factory Reset Management.
[  OK  ] Listening on systemd-factory-reset.socket - Factory Reset Management.
[    4.649540] systemd[1]: Listening on systemd-journald-audit.socket - Journal Audit Socket.
[  OK  ] Listening on systemd-journald-audit.socket - Journal Audit Socket.
[    4.654921] systemd[1]: Listening on systemd-journald-dev-log.socket - Journal Socket (/dev/log).
[  OK  ] Listening on systemd-journald-dev-log.socket - Journal Socket (/dev/log).
[    4.660671] systemd[1]: Listening on systemd-journald.socket - Journal Sockets.
[  OK  ] Listening on systemd-journald.socket - Journal Sockets.
[    4.665605] systemd[1]: Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
[  OK  ] Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
[    4.671333] systemd[1]: systemd-pcrextend.socket - TPM PCR Measurements skipped, unmet condition check ConditionSecurity=measured-uki
[    4.673386] systemd[1]: systemd-pcrlock.socket - Make TPM PCR Policy skipped, unmet condition check ConditionSecurity=measured-uki
[    4.675628] systemd[1]: Listening on systemd-resolved-monitor.socket - Resolve Monitor Varlink Socket.
[  OK  ] Listening on systemd-resolved-monitor.socket - Resolve Monitor Varlink Socket.
[    4.681438] systemd[1]: Listening on systemd-resolved-varlink.socket - Resolve Service Varlink Socket.
[  OK  ] Listening on systemd-resolved-varlink.socket - Resolve Service Varlink Socket.
[    4.687459] systemd[1]: Listening on systemd-udevd-control.socket - udev Control Socket.
[  OK  ] Listening on systemd-udevd-control.socket - udev Control Socket.
[    4.692590] systemd[1]: Listening on systemd-udevd-kernel.socket - udev Kernel Socket.
[  OK  ] Listening on systemd-udevd-kernel.socket - udev Kernel Socket.
[    4.697767] systemd[1]: Listening on systemd-udevd-varlink.socket - udev Varlink Socket.
[  OK  ] Listening on systemd-udevd-varlink.socket - udev Varlink Socket.
[    4.702921] systemd[1]: Listening on systemd-userdbd.socket - User Database Manager Socket.
[  OK  ] Listening on systemd-userdbd.socket - User Database Manager Socket.
[    4.711462] systemd[1]: Mounting dev-hugepages.mount - Huge Pages File System...
         Mounting dev-hugepages.mount - Huge Pages File System...
[    4.718261] systemd[1]: Mounting dev-mqueue.mount - POSIX Message Queue File System...
         Mounting dev-mqueue.mount - POSIX Message Queue File System...
[    4.774947] systemd[1]: Mounting sys-kernel-debug.mount - Kernel Debug File System...
         Mounting sys-kernel-debug.mount - Kernel Debug File System...
[    4.781941] systemd[1]: Mounting sys-kernel-tracing.mount - Kernel Trace File System...
         Mounting sys-kernel-tracing.mount - Kernel Trace File System...
[    4.786679] systemd[1]: kmod-static-nodes.service - Create List of Static Device Nodes skipped, unmet condition check ConditionFileNotEmpty=/lib/modules/7.1.0-rc2-00057-gb3a2dfa8b42e/modules.devname
[    4.792161] systemd[1]: Starting modprobe@configfs.service - Load Kernel Module configfs...
         Starting modprobe@configfs.service - Load Kernel Module configfs...
[    4.799728] systemd[1]: Starting modprobe@dm_mod.service - Load Kernel Module dm_mod...
         Starting modprobe@dm_mod.service - Load Kernel Module dm_mod...
[    4.804381] systemd[1]: modprobe@drm.service - Load Kernel Module drm skipped, unmet condition check ConditionKernelModuleLoaded=!drm
[    4.809310] systemd[1]: Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
         Starting modprobe@efi_pstore.service - Load Kernel Module efi_pstore...
[    4.814378] systemd[1]: modprobe@fuse.service - Load Kernel Module fuse skipped, unmet condition check ConditionKernelModuleLoaded=!fuse
[    4.819352] systemd[1]: Mounting sys-fs-fuse-connections.mount - FUSE Control File System...
         Mounting sys-fs-fuse-connections.mount - FUSE Control File System...
[    4.824258] systemd[1]: modprobe@loop.service - Load Kernel Module loop skipped, unmet condition check ConditionKernelModuleLoaded=!loop
[    4.829042] systemd[1]: Starting systemd-fsck-root.service - File System Check on Root Device...
         Starting systemd-fsck-root.service - File System Check on Root Device...
[    4.834146] systemd[1]: systemd-hibernate-clear.service - Clear Stale Hibernate Storage Info skipped, unmet condition check ConditionPathExists=/sys/firmware/efi/efivars/HibernateLocation-8cf2644b-4b0b-428f-9387-6d876050dc67
[    4.845545] systemd[1]: Starting systemd-journald.service - Journal Service...
         Starting systemd-journald.service - Journal Service...
[    4.853019] systemd[1]: Starting systemd-modules-load.service - Load Kernel Modules...
         Starting systemd-modules-load.service - Load Kernel Module[    4.857095] systemd[1]: Starting systemd-network-generator.service - Generate Network Units from Kernel Command Line...
s...
         Startin[    4.860242] systemd[1]: systemd-pcrmachine.service - TPM PCR Machine ID Measurement skipped, unmet condition check ConditionSecurity=measured-uki
g systemd-network-generator.service - Generate Network Units from K[    4.866131] systemd[1]: Starting systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully...
ernel Command Line...
         Startin[    4.869325] systemd[1]: systemd-tpm2-setup-early.service - Early TPM SRK Setup skipped, unmet condition check ConditionSecurity=measured-uki
g systemd-tmpfiles-setup-dev-early.service - Create Static Device N[    4.874603] systemd[1]: Starting systemd-udev-load-credentials.service - Load udev Rules from Credentials...
odes in /dev gracefully...
         Starting systemd-udev-load-credentials.service - Load udev Rules from Credentials...
[    4.884678] systemd[1]: Starting systemd-udev-trigger.service - Coldplug All udev Devices...
         Starting syste[    4.887424] systemd-journald[488]: Collecting audit messages is enabled.
md-udev-trigger.service - Coldplug All udev Devices...
[    4.893492] systemd[1]: Starting systemd-vconsole-setup.service - Virtual Console Setup...
         Starting systemd-vconsole-setup.service - Virtual Console Setup...
[    4.904874] systemd[1]: Mounted dev-hugepages.mount - Huge Pages File System.
[  OK  ] Mounted dev-hugepages.mount - Huge Pages File System.
[    4.909825] systemd[1]: Mounted dev-mqueue.mount - POSIX Message Queue File System.
[  OK  ] Mounted dev-mqueue.mount - POSIX Message Queue File System.
[    4.915101] systemd[1]: Mounted sys-kernel-debug.mount - Kernel Debug File System.
[  OK      4.916777] systemd[1]: Mounted sys-kernel-tracing.mount - Kernel Trace File System.
0m] Mounted sys-kernel-debug.mount - Kernel Debug F[    4.920580] systemd[1]: modprobe@configfs.service: Deactivated successfully.
ile System.
[  OK  ] Mounted sys-kernel-tr[    4.924013] systemd[1]: Finished modprobe@configfs.service - Load Kernel Module configfs.
acing.mount - Kernel Trace File System.
[ audit: type=1130 audit(1780456806.640:2): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@configfs comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    4.928306] systemd[1]: modprobe@dm_mod.service: Deactivated successfully.
[0;32m  OK  [    4.930905] audit: type=1131 audit(1780456806.640:3): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@configfs comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    4.933555] systemd[1]: Finished modprobe@dm_mod.service - Load Kernel Module dm_mod.
] Finished modprobe@configfs.service - Load Kernel Module configfs.
[  OK      4.944989] audit: type=1130 audit(1780456806.660:4): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@dm_mod comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    4.945000] audit: type=1131 audit(1780456806.660:5): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@dm_mod comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
0m] Finished     4.945975] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
;1;39mmodprobe@d[    4.947284] systemd[1]: Finished modprobe@efi_pstore.service - Load Kernel Module efi_pstore.
[    4.957158] audit: type=1130 audit(1780456806.670:6): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@efi_pstore comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

[  OK   systemd[1]: Mounted sys-fs-fuse-connections.mount - FUSE Control File System.
[0m] Finished     4.961189] audit: type=1131 audit(1780456806.670:7): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=modprobe@efi_pstore comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
0;1;39mmodprobe@efi_pstore.service - Load Kernel Module efi_pstore.
[  OK  ] Mounted sys-fs-fuse-connections.mount - FUSE Control File System.
[    4.973499] systemd[1]: Finished systemd-fsck-root.service - File System Check on Root Device.
[  OK      4.975108] audit: type=1130 audit(1780456806.690:8): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[    4.976356] systemd[1]: Started systemd-journald.service - Journal Service.
0m] Finished systemd-fsck-root.service audit: type=1130 audit(1780456806.690:9): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[0m - File System Check on Root Device.
[    4.986012] audit: type=1130 audit(1780456806.700:10): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-modules-load comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2m  OK  ] Started systemd-journald.service - Journal Service.
[  OK  ] Finished systemd-modules-load.service - Load Kernel Modules.
[  OK  ] Finished systemd-network-generator.service - Generate Network Units from Kernel Command Line.
[  OK  ] Finished systemd-udev-load-credentials.service - Load udev Rules from Credentials.
[  OK  ] Finished systemd-vconsole-setup.service - Virtual Console Setup.
         Starting systemd-remount-fs.service - Remount Root and Kernel File Systems...
         Starting systemd-sysctl.service - Apply Kernel Variables...
         Starting systemd-userdbd.service - User Database Manager...
[  OK  ] Finished systemd-sysctl.service - Apply Kernel Variables.
[  OK  ] Started systemd-userdbd.service - User Database Manager.
[  OK  ] Finished systemd-tmpfiles-setup-dev-early.service - Create Static Device Nodes in /dev gracefully.
[    5.374873] EXT4-fs (vda2): re-mounted d1a47800-9a04-4b0d-8eb5-7864b160615d r/w.
[  OK  ] Finished systemd-remount-fs.service - Remount Root and Kernel File Systems.
         Starting systemd-journal-flush.service - Flush Journal to Persistent Storage...
         Starting systemd-random-seed.service - Load/Save OS Random Seed...
         Starting systemd-tmpfiles-setu[    5.521116] systemd-journald[488]: Received client request to flush runtime journal.
p-dev.service - Create Static Device Nodes in /dev...
[  OK  ] Finished systemd-random-seed.service - Load/Save OS Random Seed.
[  OK  ] Finished systemd-udev-trigger.service - Coldplug All udev Devices.
[  OK  ] Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
[  OK  ] Reached target local-fs-pre.target - Preparation for Local File Systems.
         Starting systemd-udevd.service - Rule-based Manager for Device Events and Files...
[  OK  ] Finished systemd-journal-flush.service - Flush Journal to Persistent Storage.
[  OK  ] Started systemd-udevd.service - Rule-based Manager for Device Events and Files.
         Starting plymouth-start.service - Show Plymouth Boot Screen...
[  OK  ] Started plymouth-start.service - Show Plymouth Boot Screen.
[  OK  ] Started systemd-ask-password-plymouth.path - Forward Password Requests to Plymouth Directory Watch.
[  OK  ] Reached target cryptsetup.target - Local Encrypted Volumes.
[  OK  ] Reached target paths.target - Path Units.
virtio_net virtio0 ens2: renamed from eth0
[  OK  ] Found device dev-zram0.device - /dev/zram0.
[  OK  ] Found device dev-disk-by\x2duuid-f4f6c870\x2d2da2\x2d4…/dev/disk/by-uuid/f4f6c870-2da2-4b25-89e7-b335d590aa87.
[  OK  ] Found device dev-disk-by\x2duuid-d44c9a9e\x2d67bc\x2d4…/dev/disk/by-uuid/d44c9a9e-67bc-4f32-ad5f-dec10c152dba.
[  OK  ] Found device dev-ttyS0.device - /dev/ttyS0.
         Starting systemd-fsck@dev-disk-by\x2duuid-d44c9a9e\x2d…ev/disk/by-uuid/d44c9a9e-67bc-4f32-ad5f-dec10c152dba...
         Starting systemd-fsck@dev-disk-by\x2duuid-f4f6c870\x2d…ev/disk/by-uuid/f4f6c870-2da2-4b25-89e7-b335d590aa87...
[  OK  ] Stopped systemd-vconsole-setup.service - Virtual Console Setup.
         Stopping systemd-vconsole-setup.service - Virtual Console Setup...
kauditd_printk_skb: 18 callbacks suppressed
audit: type=1131 audit(1780456808.080:29): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Starting systemd-vconsole-setup.service - Virtual Console Setup...
         Starting systemd-zram-setup@zram0.service - Create swap on /dev/zram0...
[  OK  ] Finished systemd-fsck@dev-disk-by\x2duuid-d44c9a9e\x2d…/dev/disk/by-uuid/d44c9a9e-67bc-4f32-ad5f-dec10c152dba.
audit: type=1130 audit(1780456808.270:30): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-disk-by\x2duuid-d44c9a9e\x2d67bc\x2d4f32\x2dad5f\x2ddec10c152dba comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Finished systemd-fsck@dev-disk-by\x2duuid-f4f6c870\x2d…/dev/disk/by-uuid/f4f6c870-2da2-4b25-89e7-b335d590aa87.
zram0: detected capacity change from 0 to 16777216
audit: type=1130 audit(1780456808.290:31): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-disk-by\x2duuid-f4f6c870\x2d2da2\x2d4b25\x2d89e7\x2db335d590aa87 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
         Mounting home.mount - /home...
[  OK  ] Finished systemd-zram-setup@zram0.service - Create swap on /dev/zram0.
         Activating swap dev-zram0.swap - Compressed Swap on /dev/zram0...
audit: type=1130 audit(1780456808.310:32): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-zram-setup@zram0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Finished systemd-vconsole-setup.service - Virtual Console Setup.
audit: type=1130 audit(1780456808.340:33): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
EXT4-fs (vdb): mounted filesystem d44c9a9e-67bc-4f32-ad5f-dec10c152dba r/w with ordered data mode. Quota mode: disabled.
[  OK  ] Mounted home.mount - /home.
         Mounting home-aaron-linux.mount - /home/aaron/linux...
[  OK  ] Mounted home-aaron-linux.mount - /home/aaron/linux.
EXT4-fs (vdc): mounted filesystem f4f6c870-2da2-4b25-89e7-b335d590aa87 r/w with ordered data mode. Quota mode: disabled.
[  OK  ] Activated swap dev-zram0.swap - Compressed Swap on /dev/zram0.
[  OK  ] Reached target swap.target - Swaps.
Adding 8388604k swap on /dev/zram0.  Priority:100 extents:1 across:8388604k SSDsc
         Mounting tmp.mount - Temporary Directory /tmp...
[  OK  ] Mounted tmp.mount - Temporary Directory /tmp.
[  OK  ] Reached target local-fs.target - Local File Systems.
[  OK  ] Listening on systemd-bootctl.socket - Boot Entries Service Socket.
[  OK  ] Listening on systemd-sysext.socket - System Extension Image Management.
         Starting plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data...
         Starting systemd-tmpfiles-setup.service - Create System Files and Directories...
         Starting systemd-userdb-load-credentials.service - Load JSON user/group Records from Credentials...
[  OK  ] Finished plymouth-read-write.service - Tell Plymouth To Write Out Runtime Data.
audit: type=1130 audit(1780456809.070:34): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Finished systemd-userdb-load-credentials.service - Load JSON user/group Records from Credentials.
audit: type=1130 audit(1780456809.090:35): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-userdb-load-credentials comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[  OK  ] Finished systemd-tmpfiles-setup.service - Create System Files and Directories.
         Starting auditd.service - Security Audit Logging Service...
audit: type=1130 audit(1780456809.200:36): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1334 audit(1780456809.200:37): prog-id=22 op=LOAD
audit: type=1334 audit(1780456809.200:38): prog-id=23 op=LOAD
         Starting systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer...
         Starting systemd-resolved.service - Network Name Resolution...
[  OK  ] Started auditd.service - Security Audit Logging Service.
         Starting audit-rules.service - Load Audit Rules...
         Starting systemd-update-utmp.service - Record System Boot/Shutdown in UTMP...
[  OK  ] Finished systemd-update-utmp.service - Record System Boot/Shutdown in UTMP.
[  OK  ] Finished audit-rules.service - Load Audit Rules.
[  OK  ] Started systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer.
[  OK  ] Started systemd-resolved.service - Network Name Resolution.
[  OK  ] Reached target nss-lookup.target - Host and Network Name Lookups.
[  OK  ] Reached target sysinit.target - System Initialization.
[  OK  ] Started fstrim.timer - Discard unused filesystem blocks once a week.
[  OK  ] Started logrotate.timer - Daily rotation of log files.
[  OK  ] Started raid-check.timer - Weekly RAID setup health check.
[  OK  ] Started sysstat-collect.timer - Run system activity accounting tool every 10 minutes.
[  OK  ] Started sysstat-rotate.timer - Rotate daily system activity data file at midnight.
[  OK  ] Started sysstat-summary.timer - Generate summary of yesterday's process accounting.
[  OK  ] Started systemd-tmpfiles-clean.timer - Daily Cleanup of Temporary Directories.
[  OK  ] Started unbound-anchor.timer - daily update of the root trust anchor for DNSSEC.
[  OK  ] Reached target timers.target - Timer Units.
[  OK  ] Listening on avahi-daemon.socket - Avahi mDNS/DNS-SD Stack Activation Socket.
[  OK  ] Listening on dbus.socket - D-Bus System Message Bus Socket.
[  OK  ] Listening on pcscd.socket - PC/SC Smart Card Daemon Activation Socket.
[  OK  ] Listening on sshd-unix-local.socket - OpenSSH Server Socket (systemd-ssh-generator, AF_UNIX Local).
[  OK  ] Listening on systemd-hostnamed.socket - Hostname Service Socket.
[  OK  ] Listening on systemd-logind-varlink.socket - User Login Management Varlink Socket.
[  OK  ] Reached target sockets.target - Socket Units.
         Starting dbus-broker.service - D-Bus System Message Bus...
[  OK  ] Started dbus-broker.service - D-Bus System Message Bus.
[  OK  ] Reached target basic.target - Basic System.
         Starting avahi-daemon.service - Avahi mDNS/DNS-SD Stack...
         Starting chronyd.service - NTP client/server...
         Starting dracut-shutdown.service - Restore /run/initramfs on shutdown...
         Starting firewalld.service - firewalld - dynamic firewall daemon...
[  OK  ] Reached target sshd-keygen.target.
clocksource: Watchdog remote CPU 12 read timed out
         Starting sysstat.service - Resets System Activity Logs...
         Starting systemd-homed.service - Home Area Manager...
         Starting systemd-logind.service - User Login Management...
[  OK  ] Finished dracut-shutdown.service - Restore /run/initramfs on shutdown.
[  OK  ] Finished sysstat.service - Resets System Activity Logs.
[  OK  ] Started avahi-daemon.service - Avahi mDNS/DNS-SD Stack.
[  OK  ] Started systemd-homed.service - Home Area Manager.
[  OK  ] Finished systemd-homed-activate.service - Home Area Activation.
[  OK  ] Started systemd-logind.service - User Login Management.
[FAILED] Failed to start chronyd.service - NTP client/server.
See 'systemctl status chronyd.service' for details.
[FAILED] Failed to start firewalld.service - firewalld - dynamic firewall daemon.
See 'systemctl status firewalld.service' for details.
[  OK  ] Reached target network-pre.target - Preparation for Network.
         Starting NetworkManager.service - Network Manager...
         Starting systemd-hostnamed.service - Hostname Service...
[  OK  ] Started systemd-hostnamed.service - Hostname Service.
         Starting NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service...
[  OK  ] Started NetworkManager.service - Network Manager.
[  OK  ] Reached target network.target - Network.
         Starting kdump.service - Crash recovery kernel arming...
         Starting sshd.service - OpenSSH server daemon...
         Starting systemd-user-sessions.service - Permit User Sessions...
[  OK  ] Started NetworkManager-dispatcher.service - Network Manager Script Dispatcher Service.
[  OK  ] Finished systemd-user-sessions.service - Permit User Sessions.
         Starting plymouth-quit-wait.service - Hold until boot process finishes up...
         Starting plymouth-quit.service - Terminate Plymouth Boot Screen...
[  OK  ] Started sshd.service - OpenSSH server daemon.

Fedora Linux 43 (Forty Three)
Kernel 7.1.0-rc2-00057-gb3a2dfa8b42e on x86_64 (ttyS0)

intelvm login: [   26.082516] ------------[ cut here ]------------
[   26.082721]
[   26.082722] ======================================================
[   26.082723] WARNING: possible circular locking dependency detected
[   26.082724] 7.1.0-rc2-00057-gb3a2dfa8b42e #44 Not tainted
[   26.082725] ------------------------------------------------------
[   26.082726] kworker/57:1/414 is trying to acquire lock:
[   26.082727] ffffffff83071220 (console_owner){....}-{0:0}, at: console_lock_spinning_enable+0x40/0x70
[   26.082735]
[   26.082735] but task is already holding lock:
[   26.082736] ff11000ffddfdc60 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x53/0xa0
[   26.082740]
[   26.082740] which lock already depends on the new lock.
[   26.082740]
[   26.082741]
[   26.082741] the existing dependency chain (in reverse order) is:
[   26.082741]
[   26.082741] -> #4 (&rq->__lock){-.-.}-{2:2}:
[   26.082743]        __lock_acquire+0x6e5/0xc70
[   26.082745]        lock_acquire+0xc7/0x2d0
[   26.082749]        _raw_spin_lock_nested+0x32/0x80
[   26.082752]        raw_spin_rq_lock_nested+0x26/0xa0
[   26.082754]        _task_rq_lock+0x49/0x110
[   26.082755]        cgroup_move_task+0x35/0x120
[   26.082758]        css_set_move_task+0xe8/0x240
[   26.082762]        cgroup_post_fork+0x96/0x290
[   26.082764]        copy_process+0x180a/0x1bd0
[   26.082766]        kernel_clone+0xae/0x3b0
[   26.082768]        user_mode_thread+0x61/0x90
[   26.082769]        rest_init+0x1e/0x190
[   26.082771]        start_kernel+0x6c1/0x770
[   26.082774]        x86_64_start_reservations+0x18/0x30
[   26.082776]        x86_64_start_kernel+0xd3/0xe0
[   26.082777]        common_startup_64+0x12c/0x138
[   26.082781]
[   26.082781] -> #3 (&p->pi_lock){-.-.}-{2:2}:
[   26.082782]        __lock_acquire+0x6e5/0xc70
[   26.082783]        lock_acquire+0xc7/0x2d0
[   26.082786]        _raw_spin_lock_irqsave+0x43/0x90
[   26.082788]        try_to_wake_up+0x5a/0x710
[   26.082790]        __wake_up_common+0x82/0xc0
[   26.082793]        __wake_up+0x36/0x60
[   26.082795]        tty_port_default_wakeup+0x2d/0x40
[   26.082799]        serial8250_tx_chars+0x35c/0x3a0
[   26.082802]        serial8250_handle_irq+0x47/0x140
[   26.082803]        serial8250_interrupt+0x5a/0xb0
[   26.082805]        __handle_irq_event_percpu+0x90/0x340
[   26.082808]        handle_irq_event+0x38/0x80
[   26.082809]        handle_edge_irq+0xbe/0x1a0
[   26.082811]        __common_interrupt+0x4d/0x130
[   26.082815]        common_interrupt+0x78/0xa0
[   26.082817]        asm_common_interrupt+0x26/0x40
[   26.082819]        pv_native_safe_halt+0xf/0x20
[   26.082821]        default_idle+0x9/0x10
[   26.082822]        default_idle_call+0x83/0x200
[   26.082824]        cpuidle_idle_call+0x16a/0x1a0
[   26.082826]        do_idle+0x93/0xd0
[   26.082827]        cpu_startup_entry+0x29/0x30
[   26.082829]        start_secondary+0xf8/0x100
[   26.082832]        common_startup_64+0x12c/0x138
[   26.082834]
[   26.082834] -> #2 (&tty->write_wait){-.-.}-{3:3}:
[   26.082835]        __lock_acquire+0x6e5/0xc70
[   26.082836]        lock_acquire+0xc7/0x2d0
[   26.082839]        _raw_spin_lock_irqsave+0x43/0x90
[   26.082841]        __wake_up+0x21/0x60
[   26.082842]        tty_port_default_wakeup+0x2d/0x40
[   26.082844]        serial8250_tx_chars+0x35c/0x3a0
[   26.082845]        serial8250_handle_irq+0x47/0x140
[   26.082846]        serial8250_interrupt+0x5a/0xb0
[   26.082848]        __handle_irq_event_percpu+0x90/0x340
[   26.082849]        handle_irq_event+0x38/0x80
[   26.082851]        handle_edge_irq+0xbe/0x1a0
[   26.082852]        __common_interrupt+0x4d/0x130
[   26.082854]        common_interrupt+0x78/0xa0
[   26.082856]        asm_common_interrupt+0x26/0x40
[   26.082857]        pv_native_safe_halt+0xf/0x20
[   26.082859]        default_idle+0x9/0x10
[   26.082860]        default_idle_call+0x83/0x200
[   26.082862]        cpuidle_idle_call+0x16a/0x1a0
[   26.082863]        do_idle+0x93/0xd0
[   26.082864]        cpu_startup_entry+0x29/0x30
[   26.082865]        start_secondary+0xf8/0x100
[   26.082867]        common_startup_64+0x12c/0x138
[   26.082869]
[   26.082869] -> #1 (&port_lock_key){-.-.}-{3:3}:
[   26.082870]        __lock_acquire+0x6e5/0xc70
[   26.082871]        lock_acquire+0xc7/0x2d0
[   26.082874]        _raw_spin_lock_irqsave+0x43/0x90
[   26.082876]        serial8250_console_write+0x8f/0x650
[   26.082877]        console_emit_next_record+0x10d/0x200
[   26.082879]        console_flush_one_record+0x223/0x330
[   26.082880]        console_unlock+0x6d/0x130
[   26.082881]        vprintk_emit+0x187/0x1c0
[   26.082883]        _printk+0x5b/0x80
[   26.082886]        register_console+0x278/0x450
[   26.082888]        univ8250_console_init+0x24/0x40
[   26.082890]        console_init+0x74/0x250
[   26.082892]        start_kernel+0x42d/0x770
[   26.082893]        x86_64_start_reservations+0x18/0x30
[   26.082895]        x86_64_start_kernel+0xd3/0xe0
[   26.082896]        common_startup_64+0x12c/0x138
[   26.082898]
[   26.082898] -> #0 (console_owner){....}-{0:0}:
[   26.082899]        check_prev_add+0xf1/0xc50
[   26.082902]        validate_chain+0x5c5/0x6f0
[   26.082904]        __lock_acquire+0x6e5/0xc70
[   26.082905]        lock_acquire+0xc7/0x2d0
[   26.082907]        console_lock_spinning_enable+0x5c/0x70
[   26.082909]        console_emit_next_record+0xcf/0x200
[   26.082910]        console_flush_one_record+0x223/0x330
[   26.082911]        console_unlock+0x6d/0x130
[   26.082912]        vprintk_emit+0x187/0x1c0
[   26.082914]        _printk+0x5b/0x80
[   26.082916]        __report_bug+0xf3/0x1f0
[   26.082919]        report_bug+0x2c/0x80
[   26.082921]        handle_bug+0x214/0x250
[   26.082922]        exc_invalid_op+0x17/0x70
[   26.082924]        asm_exc_invalid_op+0x1a/0x20
[   26.082925]        hrtick_start_fair+0x88/0xa0
[   26.082926]        __set_next_task_fair+0x1de/0x210
[   26.082928]        pick_next_task+0x6fc/0x9c0
[   26.082930]        __schedule+0x1a8/0x7d0
[   26.082932]        schedule+0x3a/0xe0
[   26.082934]        worker_thread+0xd0/0x360
[   26.082936]        kthread+0xf4/0x130
[   26.082938]        ret_from_fork+0x29f/0x330
[   26.082940]        ret_from_fork_asm+0x1a/0x30
[   26.082942]
[   26.082942] other info that might help us debug this:
[   26.082942]
[   26.082943] Chain exists of:
[   26.082943]   console_owner --> &p->pi_lock --> &rq->__lock
[   26.082943]
[   26.082945]  Possible unsafe locking scenario:
[   26.082945]
[   26.082945]        CPU0                    CPU1
[   26.082945]        ----                    ----
[   26.082946]   lock(&rq->__lock);
[   26.082946]                                lock(&p->pi_lock);
[   26.082947]                                lock(&rq->__lock);
[   26.082948]   lock(console_owner);
[   26.082949]
[   26.082949]  *** DEADLOCK ***
[   26.082949]
[   26.082949] 3 locks held by kworker/57:1/414:
[   26.082950]  #0: ff11000ffddfdc60 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x53/0xa0
[   26.082953]  #1: ffffffff86871460 (console_lock){+.+.}-{0:0}, at: vprintk_emit+0x144/0x1c0
[   26.082956]  #2: ffffffff868714b8 (console_srcu){....}-{0:0}, at: console_flush_one_record+0x7c/0x330
[   26.082959]
[   26.082959] stack backtrace:
[   26.082960] CPU: 57 UID: 0 PID: 414 Comm: kworker/57:1 Not tainted 7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.082962] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.082963] Workqueue:  0x0 (mm_percpu_wq)
[   26.082966] Call Trace:
[   26.082967]  <TASK>
[   26.082969]  dump_stack_lvl+0x78/0xe0
[   26.082973]  print_circular_bug+0xd5/0xf0
[   26.082976]  check_noncircular+0x14a/0x160
[   26.082979]  ? save_trace+0x56/0x170
[   26.082983]  check_prev_add+0xf1/0xc50
[   26.082987]  validate_chain+0x5c5/0x6f0
[   26.082990]  __lock_acquire+0x6e5/0xc70
[   26.082993]  lock_acquire+0xc7/0x2d0
[   26.082996]  ? console_lock_spinning_enable+0x40/0x70
[   26.082997]  ? __lock_release+0x15b/0x2a0
[   26.082999]  ? console_lock_spinning_enable+0x39/0x70
[   26.083001]  console_lock_spinning_enable+0x5c/0x70
[   26.083003]  ? console_lock_spinning_enable+0x40/0x70
[   26.083004]  console_emit_next_record+0xcf/0x200
[   26.083007]  console_flush_one_record+0x223/0x330
[   26.083010]  console_unlock+0x6d/0x130
[   26.083011]  ? vprintk_emit+0x144/0x1c0
[   26.083014]  vprintk_emit+0x187/0x1c0
[   26.083016]  ? hrtick_start_fair+0x88/0xa0
[   26.083018]  _printk+0x5b/0x80
[   26.083022]  __report_bug+0xf3/0x1f0
[   26.083025]  ? lock_acquire+0xc7/0x2d0
[   26.083028]  ? raw_spin_rq_lock_nested+0x53/0xa0
[   26.083029]  ? hrtick_start_fair+0x88/0xa0
[   26.083031]  report_bug+0x2c/0x80
[   26.083034]  handle_bug+0x214/0x250
[   26.083036]  exc_invalid_op+0x17/0x70
[   26.083037]  asm_exc_invalid_op+0x1a/0x20
[   26.083039] RIP: 0010:hrtick_start_fair+0x88/0xa0
[   26.083041] Code: 29 d0 31 d2 48 89 c6 b8 00 00 10 00 48 f7 f6 48 0f af c1 48 c1 e8 0a 48 89 c6 e9 03 57 ff ff 48 3b 77 18 74 09 e9 58 e8 ba 00 <0f> 0b eb 90 e9 bf 58 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[   26.083042] RSP: 0000:ffa00000022dfda0 EFLAGS: 00010006
[   26.083044] RAX: ff11000ffddfdc00 RBX: ff11000103ba0000 RCX: 0000000000000000
[   26.083045] RDX: 0000000000000038 RSI: ff11000103ba0000 RDI: ff11000ffe1fdc00
[   26.083046] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000014
[   26.083047] R10: 0000000000000001 R11: ff11000ffe1fdd48 R12: ff11000ffe1fdc00
[   26.083047] R13: ff11000103ba00c8 R14: ff110001098080c8 R15: ff11000ffe1fec30
[   26.083051]  __set_next_task_fair+0x1de/0x210
[   26.083054]  pick_next_task+0x6fc/0x9c0
[   26.083058]  __schedule+0x1a8/0x7d0
[   26.083061]  schedule+0x3a/0xe0
[   26.083063]  worker_thread+0xd0/0x360
[   26.083065]  ? __pfx_worker_thread+0x10/0x10
[   26.083068]  kthread+0xf4/0x130
[   26.083069]  ? __pfx_kthread+0x10/0x10
[   26.083071]  ret_from_fork+0x29f/0x330
[   26.083073]  ? __pfx_kthread+0x10/0x10
[   26.083074]  ret_from_fork_asm+0x1a/0x30
[   26.083079]  </TASK>
[   26.234402] WARNING: kernel/sched/fair.c:7617 at hrtick_start_fair+0x88/0xa0, CPU#57: kworker/57:1/414
[   26.235937] Modules linked in:
[   26.236462] CPU: 57 UID: 0 PID: 414 Comm: kworker/57:1 Not tainted 7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.238175] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.239744] Workqueue:  0x0 (mm_percpu_wq)
[   26.240434] RIP: 0010:hrtick_start_fair+0x88/0xa0
[   26.241225] Code: 29 d0 31 d2 48 89 c6 b8 00 00 10 00 48 f7 f6 48 0f af c1 48 c1 e8 0a 48 89 c6 e9 03 57 ff ff 48 3b 77 18 74 09 e9 58 e8 ba 00 <0f> 0b eb 90 e9 bf 58 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[   26.244282] RSP: 0000:ffa00000022dfda0 EFLAGS: 00010006
[   26.245154] RAX: ff11000ffddfdc00 RBX: ff11000103ba0000 RCX: 0000000000000000
[   26.246338] RDX: 0000000000000038 RSI: ff11000103ba0000 RDI: ff11000ffe1fdc00
[   26.247523] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000014
[   26.248706] R10: 0000000000000001 R11: ff11000ffe1fdd48 R12: ff11000ffe1fdc00
[   26.249890] R13: ff11000103ba00c8 R14: ff110001098080c8 R15: ff11000ffe1fec30
[   26.251078] FS:  0000000000000000(0000) GS:ff11001076d9c000(0000) knlGS:0000000000000000
[   26.252413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.253374] CR2: 00007fcf1a265dc0 CR3: 000000017d5d5005 CR4: 0000000000371ef0
[   26.254561] Call Trace:
[   26.254983]  <TASK>
[   26.255350]  __set_next_task_fair+0x1de/0x210
[   26.256088]  pick_next_task+0x6fc/0x9c0
[   26.256737]  __schedule+0x1a8/0x7d0
[   26.257330]  schedule+0x3a/0xe0
[   26.257869]  worker_thread+0xd0/0x360
[   26.258492]  ? __pfx_worker_thread+0x10/0x10
[   26.259214]  kthread+0xf4/0x130
[   26.259751]  ? __pfx_kthread+0x10/0x10
[   26.260386]  ret_from_fork+0x29f/0x330
[   26.261020]  ? __pfx_kthread+0x10/0x10
[   26.261656]  ret_from_fork_asm+0x1a/0x30
[   26.262321]  </TASK>
[   26.262703] irq event stamp: 1022
[   26.263269] hardirqs last  enabled at (1021): [<ffffffff81eb3cf8>] _raw_spin_unlock_irq+0x28/0x50
[   26.264735] hardirqs last disabled at (1022): [<ffffffff81ea7b5e>] __schedule+0x6be/0x7d0
[   26.266087] softirqs last  enabled at (0): [<ffffffff812a1631>] copy_process+0xa91/0x1bd0
[   26.267436] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   26.268478] ---[ end trace 0000000000000000 ]---
[   26.269297] ------------[ cut here ]------------
[   26.270125] WARNING: kernel/sched/fair.c:6316 at set_next_entity+0x22a/0x290, CPU#56: swapper/56/0
[   26.271635] Modules linked in:
[   26.272169] CPU: 56 UID: 0 PID: 0 Comm: swapper/56 Tainted: G        W           7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.274099] Tainted: [W]=WARN
[   26.274617] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.276211] RIP: 0010:set_next_entity+0x22a/0x290
[   26.277008] Code: 74 71 48 83 bb 40 01 00 00 00 48 8d 93 40 01 00 00 0f 84 ec fe ff ff 31 f6 48 8b bd 78 01 00 00 e8 6b 04 05 00 e9 d9 fe ff ff <0f> 0b e9 57 fe ff ff 48 c1 e0 14 31 d2 48 f7 f7 e9 5a ff ff ff 48
[   26.280106] RSP: 0018:ffa0000000267e00 EFLAGS: 00010082
[   26.280989] RAX: 00000006127857f5 RBX: ff11000103ba0080 RCX: 00000000001aa2dd
[   26.282176] RDX: 0000000000000000 RSI: ff11000103ba0080 RDI: 00000006127857f5
[   26.283368] RBP: ff1100016d2f4000 R08: ff11000ffddfefc0 R09: ff11000101462f80
[   26.284561] R10: 0000000000000001 R11: 0000000000000000 R12: ff11000ffddfdc00
[   26.285751] R13: ff11000103ba0000 R14: ff11000ffddfdc00 R15: 0000000000000000
[   26.286944] FS:  0000000000000000(0000) GS:ff1100107699c000(0000) knlGS:0000000000000000
[   26.288289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.289252] CR2: 000055c32c010520 CR3: 000000000303e003 CR4: 0000000000371ef0
[   26.290443] Call Trace:
[   26.290870]  <TASK>
[   26.291242]  set_next_task_fair+0x39/0x80
[   26.291927]  pick_next_task+0x6fc/0x9c0
[   26.292581]  ? do_raw_spin_lock+0xae/0xc0
[   26.293263]  ? lock_acquired+0x125/0x160
[   26.293932]  __schedule+0x1a8/0x7d0
[   26.294533]  schedule_idle+0x23/0x40
[   26.295147]  cpu_startup_entry+0x29/0x30
[   26.295815]  start_secondary+0xf8/0x100
[   26.296466]  common_startup_64+0x12c/0x138
[   26.297165]  </TASK>
[   26.297550] irq event stamp: 4354
[   26.298119] hardirqs last  enabled at (4353): [<ffffffff813e6475>] tick_nohz_idle_exit+0x75/0x110
[   26.299597] hardirqs last disabled at (4354): [<ffffffff81ea7b5e>] __schedule+0x6be/0x7d0
[   26.300955] softirqs last  enabled at (4312): [<ffffffff812ae596>] handle_softirqs+0x366/0x440
[   26.302385] softirqs last disabled at (4307): [<ffffffff812ae796>] __irq_exit_rcu+0xb6/0x170
[   26.303789] ---[ end trace 0000000000000000 ]---
[   26.304587] psi: inconsistent task state! task=1305:nop cpu=56 psi_flags=14 clear=0 set=10
[   26.304592] BUG: unable to handle page fault for address: 00000000000012a0
[   26.307136] #PF: supervisor read access in kernel mode
[   26.308008] #PF: error_code(0x0000) - not-present page
[   26.308881] PGD 16e83d067 P4D 17295a067 PUD 102449067 PMD 0
[   26.309834] Oops: Oops: 0000 [#1] SMP NOPTI
[   26.310541] CPU: 56 UID: 1000 PID: 1305 Comm: nop Tainted: G        W           7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.312449] Tainted: [W]=WARN
[   26.312964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.314544] RIP: 0010:__schedule+0xa0/0x7d0
[   26.315252] Code: fa f6 c4 02 0f 85 39 06 00 00 44 89 f7 e8 98 5e 50 ff 8b bb 28 10 00 00 48 89 ee e8 ca d0 45 ff 48 89 df 31 f6 e8 f0 39 45 ff <44> 8b b3 a0 12 00 00 48 8d 7b 48 45 85 f6 74 0b 48 8b 83 88 12 00
[   26.318334] RSP: 0018:ffa0000004963eb8 EFLAGS: 00010046
[   26.319212] RAX: ff11000101462f80 RBX: 0000000000000000 RCX: 0000000000000000
[   26.320404] RDX: 0000000000000000 RSI: ff11000103ba0000 RDI: 0000000000000000
[   26.321595] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[   26.322521] ------------[ cut here ]------------
[   26.322787] R10: ff11000ffdc18500 R11: 0000000000000000 R12: ff11000103ba0000
[   26.322788] R13: ff11000ffe1fdc00 R14: ffffffff812faf83 R15: 0000000000000000
[   26.323604] RCU not watching for tracepoint
[   26.324798] FS:  00007fcf1a107780(0000) GS:ff1100107699c000(0000) knlGS:0000000000000000
[   26.325969] WARNING: include/trace/events/sched.h:886 at __schedule+0x6af/0x7d0, CPU#11: nop/1312
[   26.326679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.328024] Modules linked in:
[   26.329492] CR2: 00000000000012a0 CR3: 000000017d5d5005 CR4: 0000000000371ef0
[   26.330507]
[   26.331037] Call Trace:
[   26.332257] CPU: 11 UID: 1000 PID: 1312 Comm: nop Tainted: G        W           7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.332529]  <TASK>
[   26.332960] Tainted: [W]=WARN
[   26.334872]  ? handle_softirqs+0x366/0x440
[   26.335246] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.335761]  schedule+0x3a/0xe0
[   26.336459] RIP: 0003:ktime_get_update_offsets_now+0x141/0x200
[   26.338045]  irqentry_exit+0x2e4/0x770
[   26.338589] RSP: cac0:7fffffffffffffff EFLAGS: 00000006
[   26.339569]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   26.340211]  ORIG_RAX: 0000000615a347d5
[   26.341088] RIP: 0033:0x40110c
[   26.341957] RAX: ff11000ff25ecac0 RBX: ff11000ff25ecac0 RCX: ffffffff812fcaf0
[   26.342610] Code: 7a ff ff ff c6 05 17 2f 00 00 01 5d c3 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa eb 8a 55 48 89 e5 f3 90 <eb> fc 00 00 f3 0f 1e fa 48 83 ec 08 48 83 c4 08 c3 00 00 00 00 00
[   26.343140] RDX: ffffffff813cc47a RSI: 0000000000000006 RDI: 0000000300000000
[   26.344337] RSP: 002b:00007ffc7959d830 EFLAGS: 00000246
[   26.347426] RBP: ffa000000499be60 R08: ff11000ff25ecac0 R09: ff11000ff25fed80
[   26.348613]
[   26.349493] R10: ffffffff812fcb29 R11: ff11000ff25fdc00 R12: ff11000ff25fed80
[   26.350688] RAX: 00007fcf1a2fae28 RBX: 0000000000000000 RCX: 0000000000403e40
[   26.350964] R13: ffffffff812faf83 R14: 0000000000000000 R15: ff11000ff25fdc00
[   26.352158] RDX: 00007ffc7959d968 RSI: 00007ffc7959d958 RDI: 0000000000000001
[   26.353355] FS:  00007fb34f06e780 GS:  0000000000000000
[   26.354548] RBP: 00007ffc7959d830 R08: 00007fcf1a2f3680 R09: 00007fcf1a2f4fe0
[   26.355742] irq event stamp: 1201
[   26.356620] R10: 00007ffc7959d570 R11: 0000000000000203 R12: 00007ffc7959d958
[   26.357817] hardirqs last  enabled at (1201): [<ffffffff8100148a>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   26.358389] R13: 0000000000000001 R14: 00007fcf1a345000 R15: 0000000000403e40
[   26.359586] hardirqs last disabled at (1199): [<ffffffff812ae626>] handle_softirqs+0x3f6/0x440
[   26.361223]  </TASK>
[   26.362416] softirqs last  enabled at (1200): [<ffffffff812ae596>] handle_softirqs+0x366/0x440
[   26.363847] Modules linked in:
[   26.364235] softirqs last disabled at (1195): [<ffffffff812ae796>] __irq_exit_rcu+0xb6/0x170
[   26.365667]
[   26.366198] ---[ end trace 0000000000000000 ]---
[   26.367607] CR2: 00000000000012a0
[   26.369237] ---[ end trace 0000000000000000 ]---
[   26.369245] kernel tried to execute NX-protected page - exploit attempt? (uid: 1000)
[   26.370018] RIP: 0010:__schedule+0xa0/0x7d0
[   26.371323] BUG: unable to handle page fault for address: ff1100103fdc6a40
[   26.372029] Code: fa f6 c4 02 0f 85 39 06 00 00 44 89 f7 e8 98 5e 50 ff 8b bb 28 10 00 00 48 89 ee e8 ca d0 45 ff 48 89 df 31 f6 e8 f0 39 45 ff <44> 8b b3 a0 12 00 00 48 8d 7b 48 45 85 f6 74 0b 48 8b 83 88 12 00
[   26.373188] #PF: supervisor instruction fetch in kernel mode
[   26.376272] RSP: 0018:ffa0000004963eb8 EFLAGS: 00010046
[   26.377227] #PF: error_code(0x0011) - permissions violation
[   26.378104] RAX: ff11000101462f80 RBX: 0000000000000000 RCX: 0000000000000000
[   26.379043] PGD 908e067
[   26.380235] RDX: 0000000000000000 RSI: ff11000103ba0000 RDI: 0000000000000000
[   26.380236] P4D 908f067 PUD 80000010000001e3
[   26.380677] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
[   26.381873]
[   26.382617] R10: ff11000ffdc18500 R11: 0000000000000000 R12: ff11000103ba0000
[   26.383814] Oops: Oops: 0011 [#2] SMP NOPTI
[   26.384088] R13: ff11000ffe1fdc00 R14: ffffffff812faf83 R15: 0000000000000000
[   26.385283] CPU: 11 UID: 1000 PID: 1312 Comm: nop Tainted: G      D W           7.1.0-rc2-00057-gb3a2dfa8b42e #44 PREEMPT(lazy)
[   26.385993] FS:  00007fcf1a107780(0000) GS:ff1100107699c000(0000) knlGS:0000000000000000
[   26.387187] Tainted: [D]=DIE, [W]=WARN
[   26.389100] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.390448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   26.391088] CR2: 00000000000012a0 CR3: 000000017d5d5005 CR4: 0000000000371ef0
[   26.392055] RIP: 0010:0xff1100103fdc6a40
[   26.393640] Kernel panic - not syncing: Fatal exception
[   27.473979] Shutting down cpus with NMI
[   27.476542] Kernel Offset: disabled
[   27.477157] Rebooting in 5 seconds..
QEMU: Terminated


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair()
  2026-06-03  9:51   ` Aaron Lu
@ 2026-06-11 11:32     ` Peter Zijlstra
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Zijlstra @ 2026-06-11 11:32 UTC (permalink / raw)
  To: Aaron Lu
  Cc: mingo, longman, chenridong, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, hannes,
	mkoutny, cgroups, linux-kernel, jstultz, kprateek.nayak, qyousef,
	svens


Aaron,

Sorry I failed to notice this email earlier.

On Wed, Jun 03, 2026 at 05:51:08PM +0800, Aaron Lu wrote:

> I applied below diff and the problem is gone:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5f48af700fd44..942a543af3e54 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9897,6 +9897,9 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>  	return p;
>  
>  idle:
> +	if (sched_core_enabled(rq))
> +		return NULL;
> +
>  	new_tasks = sched_balance_newidle(rq, rf);
>  	if (new_tasks < 0)
>  		return RETRY_TASK;
> 

Right, this is the safe patch and restores pick_task_fair() to its
previous status (for core-sched).

Since people are hitting this problem, I'm going to merge it as below.
I've presumed your SoB, please let me know if that's a problem.

I think I'm going to try and move newidle into sched_class::balance /
balance_fair(), but I'll do that next cycle.

Thanks!

---
Subject: sched/fair: Fix newidle vs core-sched
From: "Aaron Lu" <ziqianlu@bytedance.com>
Date: Wed, 3 Jun 2026 17:51:08 +0800

From: "Aaron Lu" <ziqianlu@bytedance.com>

While testing Prateek's throttle series, I noticed a panic issue when
coresched is enabled and bisected to this patch.

I fed the panic log and this patch to an agent and its analysis looks
correct to me(cpu56 and cpu57 are siblings in a VM):

       cpu57 (holds core-wide lock)

     pick_next_task() [core scheduling]
     for_each_cpu_wrap(i, smt_mask, 57):
       i=57: pick_task(rq_57)
             pick_task_fair(rq_57)
             -> picks task A
       rq_57->core_pick = task A
       // task_rq(A) == rq_57

       i=56: pick_task(rq_56)
             pick_task_fair(rq_56)
             cfs_rq->nr_queued == 0
             goto idle
             sched_balance_newidle(rq_56)
             raw_spin_rq_unlock(rq_56)
             // core-wide lock released
             newidle_balance() pulls
               task A: rq_57 -> rq_56
             // task_rq(A) == rq_56 now
             raw_spin_rq_lock(rq_56)
             // core-wide lock re-acquired
             return > 0
             goto again
             pick_task_fair(rq_56)
             -> picks task A
       rq_56->core_pick = task A

     // first loop done
     // rq_57->core_pick is still task A (set before lock release)
     // but task_rq(A) == rq_56 now
     next = rq_57->core_pick  // = task A

     put_prev_set_next_task(rq_57, prev, task A)
     __set_next_task_fair(rq_57, task A)
     hrtick_start_fair(rq_57, task A)
     WARN_ON_ONCE(task_rq(task A) != rq_57)
     // task_rq(A) == rq_56

IOW: by allowing pick_task_fair() to do newidle_balance and not returning
RETRY_TASK, it can end up selecting the same task on two CPUs. Restore the
previous state by never doing newidle when core scheduling is enabled.

Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: "Aaron Lu" <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260603095108.GA1684319@bytedance.com
---
 kernel/sched/fair.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9942,6 +9942,9 @@ struct task_struct *pick_task_fair(struc
 	return p;
 
 idle:
+	if (sched_core_enabled(rq))
+		return NULL;
+
 	new_tasks = sched_balance_newidle(rq, rf);
 	if (new_tasks < 0)
 		return RETRY_TASK;

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2026-06-11 11:32 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 11:31 [PATCH v2 00/10] sched: Flatten the pick Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 01/10] sched/debug: Use char * instead of char (*)[] Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 02/10] sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 03/10] sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 04/10] sched/fair: Add cgroup_mode switch Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 05/10] sched/fair: Add cgroup_mode: UP Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 06/10] sched/fair: Add cgroup_mode: MAX Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 07/10] sched/fair: Add cgroup_mode: CONCUR Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 08/10] sched/fair: Add newidle balance to pick_task_fair() Peter Zijlstra
2026-05-12  5:37   ` K Prateek Nayak
2026-05-12  9:45     ` Peter Zijlstra
2026-05-19 15:13   ` Vincent Guittot
2026-06-03  9:51   ` Aaron Lu
2026-06-11 11:32     ` Peter Zijlstra
2026-05-11 11:31 ` [PATCH v2 09/10] sched: Remove sched_class::pick_next_task() Peter Zijlstra
2026-05-19 15:14   ` Vincent Guittot
2026-05-11 11:31 ` [PATCH v2 10/10] sched/eevdf: Move to a single runqueue Peter Zijlstra
2026-05-11 16:21   ` K Prateek Nayak
2026-05-12 11:09     ` Peter Zijlstra
2026-05-13  7:01       ` K Prateek Nayak
2026-05-13  7:25         ` Peter Zijlstra
2026-05-13  4:51   ` John Stultz
2026-05-13  5:00     ` John Stultz
2026-05-14  1:36       ` John Stultz
2026-05-14  2:53         ` K Prateek Nayak
2026-05-14  3:14           ` John Stultz
2026-05-19 10:38   ` Vincent Guittot
2026-05-20 16:32     ` Vincent Guittot
2026-05-21  2:57       ` K Prateek Nayak
2026-05-21  7:56         ` Vincent Guittot
2026-05-21 10:31       ` Peter Zijlstra
2026-05-21 12:13         ` Vincent Guittot
2026-05-21 13:29           ` Peter Zijlstra
2026-05-21 13:44             ` Vincent Guittot
2026-05-21 14:01             ` Peter Zijlstra
2026-05-21 13:21         ` Peter Zijlstra
2026-05-21 13:39         ` Peter Zijlstra
2026-05-21 13:56           ` Vincent Guittot
2026-05-26  7:53   ` Zhang Qiao
2026-05-26  9:15     ` K Prateek Nayak
2026-05-26  9:36       ` Zhang Qiao
2026-05-26  9:52       ` Peter Zijlstra
2026-05-26 10:54         ` K Prateek Nayak
2026-05-26 11:07           ` Peter Zijlstra
2026-05-26 12:40             ` Peter Zijlstra
2026-05-11 19:23 ` [PATCH v2 00/10] sched: Flatten the pick Tejun Heo
2026-05-12  8:10   ` Peter Zijlstra
2026-05-12 18:45     ` Tejun Heo
2026-05-18  7:14       ` Peter Zijlstra
2026-05-18 19:11         ` Tejun Heo
2026-05-27  9:41           ` Peter Zijlstra
2026-05-12  8:42 ` Vincent Guittot
2026-05-12  9:20   ` Peter Zijlstra
2026-05-12 18:24     ` Peter Zijlstra
2026-05-12 18:25       ` Peter Zijlstra
2026-05-12 18:32         ` Vincent Guittot
2026-05-13  7:25           ` Peter Zijlstra
2026-05-13 11:35   ` Peter Zijlstra
2026-05-13 12:43     ` Peter Zijlstra
2026-05-18 13:34     ` Vincent Guittot
2026-05-18 21:12       ` Peter Zijlstra
2026-05-19 10:13         ` Vincent Guittot
2026-05-19 16:00           ` Vincent Guittot
2026-05-16  3:30 ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox