[PATCH 00/24] Complete EEVDF

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/24] Complete EEVDF
@ 2024-07-27 10:27 Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 01/24] sched/eevdf: Add feature comments Peter Zijlstra
                   ` (31 more replies)
  0 siblings, 32 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi all,

So after much delay this is hopefully the final version of the EEVDF patches.
They've been sitting in my git tree for ever it seems, and people have been
testing it and sending fixes.

I've spend the last two days testing and fixing cfs-bandwidth, and as far
as I know that was the very last issue holding it back.

These patches apply on top of queue.git sched/dl-server, which I plan on merging
in tip/sched/core once -rc1 drops.

I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.

Aside from a ton of bug fixes -- thanks all! -- new in this version is:

 - split up the huge delay-dequeue patch
 - tested/fixed cfs-bandwidth
 - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
 - SCHED_BATCH is equivalent to RESPECT_SLICE
 - propagate min_slice up cgroups
 - CLOCK_THREAD_DVFS_ID

^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 01/24] sched/eevdf: Add feature comments
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 02/24] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/features.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,7 +5,14 @@
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+/*
+ * Give new tasks half a slice to ease into the competition.
+ */
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+/*
+ * Inhibit (wakeup) preemption until the current task has either matched the
+ * 0-lag point or until is has exhausted it's slice.
+ */
 SCHED_FEAT(RUN_TO_PARITY, true)
 
 /*



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 02/24] sched/eevdf: Remove min_vruntime_copy
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 01/24] sched/eevdf: Add feature comments Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 03/24] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Since commit e8f331bcc270 ("sched/smp: Use lag to simplify
cross-runqueue placement") the min_vruntime_copy is no longer used.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |    5 ++---
 kernel/sched/sched.h |    4 ----
 2 files changed, 2 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,8 +780,7 @@ static void update_min_vruntime(struct c
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	u64_u32_store(cfs_rq->min_vruntime,
-		      __update_min_vruntime(cfs_rq, vruntime));
+	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -12876,7 +12875,7 @@ static void set_next_task_fair(struct rq
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
+	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -599,10 +599,6 @@ struct cfs_rq {
 	u64			min_vruntime_fi;
 #endif
 
-#ifndef CONFIG_64BIT
-	u64			min_vruntime_copy;
-#endif
-
 	struct rb_root_cached	tasks_timeline;
 
 	/*



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 03/24] sched/fair: Cleanup pick_task_fair() vs throttle
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 01/24] sched/eevdf: Add feature comments Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 02/24] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 04/24] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
                   ` (28 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Per 54d27365cae8 ("sched/fair: Prevent throttling in early
pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the
'if (curr)' check is to ensure the (downward) traversal does not
result in an empty cfs_rq.

But then the pick_task_fair() 'copy' of all this made it restart the
traversal anyway, so that seems to solve the issue too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8435,11 +8435,11 @@ static struct task_struct *pick_task_fai
 				update_curr(cfs_rq);
 			else
 				curr = NULL;
-
-			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-				goto again;
 		}
 
+		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+			goto again;
+
 		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 04/24] sched/fair: Cleanup pick_task_fair()s curr
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (2 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 03/24] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] sched/fair: Cleanup pick_task_fair()'s curr tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 05/24] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from
pick_next_entity()") curr is no longer being used, so no point in
clearing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8427,15 +8427,9 @@ static struct task_struct *pick_task_fai
 		return NULL;
 
 	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
 		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-		}
+		if (cfs_rq->curr && cfs_rq->curr->on_rq)
+			update_curr(cfs_rq);
 
 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
 			goto again;



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 05/24] sched/fair: Unify pick_{,next_}_task_fair()
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (3 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 04/24] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 06/24] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
                   ` (26 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Implement pick_next_task_fair() in terms of pick_task_fair() to
de-duplicate the pick loop.

More importantly, this makes all the pick loops use the
state-invariant form, which is useful to introduce further re-try
conditions in later patches.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   60 ++++++----------------------------------------------
 1 file changed, 8 insertions(+), 52 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8415,7 +8415,6 @@ static void check_preempt_wakeup_fair(st
 	resched_curr(rq);
 }
 
-#ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
@@ -8427,7 +8426,7 @@ static struct task_struct *pick_task_fai
 		return NULL;
 
 	do {
-		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
+		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
@@ -8440,19 +8439,19 @@ static struct task_struct *pick_task_fai
 
 	return task_of(se);
 }
-#endif
 
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
 
 again:
-	if (!sched_fair_runnable(rq))
+	p = pick_task_fair(rq);
+	if (!p)
 		goto idle;
+	se = &p->se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (!prev || prev->sched_class != &fair_sched_class)
@@ -8464,52 +8463,14 @@ pick_next_task_fair(struct rq *rq, struc
 	 *
 	 * Therefore attempt to avoid putting and setting the entire cgroup
 	 * hierarchy, only change the part that actually changes.
-	 */
-
-	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
-		/*
-		 * Since we got here without doing put_prev_entity() we also
-		 * have to consider cfs_rq->curr. If it is still a runnable
-		 * entity, update_curr() will update its vruntime, otherwise
-		 * forget we've ever seen it.
-		 */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-
-			/*
-			 * This call to check_cfs_rq_runtime() will do the
-			 * throttle and dequeue its entity in the parent(s).
-			 * Therefore the nr_running test will indeed
-			 * be correct.
-			 */
-			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
-				cfs_rq = &rq->cfs;
-
-				if (!cfs_rq->nr_running)
-					goto idle;
-
-				goto simple;
-			}
-		}
-
-		se = pick_next_entity(cfs_rq);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
-
-	/*
+	 *
 	 * Since we haven't yet done put_prev_entity and if the selected task
 	 * is a different task than we started out with, try and touch the
 	 * least amount of cfs_rqs.
 	 */
 	if (prev != p) {
 		struct sched_entity *pse = &prev->se;
+		struct cfs_rq *cfs_rq;
 
 		while (!(cfs_rq = is_same_group(se, pse))) {
 			int se_depth = se->depth;
@@ -8535,13 +8496,8 @@ pick_next_task_fair(struct rq *rq, struc
 	if (prev)
 		put_prev_task(rq, prev);
 
-	do {
-		se = pick_next_entity(cfs_rq);
-		set_next_entity(cfs_rq, se);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
+	for_each_sched_entity(se)
+		set_next_entity(cfs_rq_of(se), se);
 
 done: __maybe_unused;
 #ifdef CONFIG_SMP



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 06/24] sched: Allow sched_class::dequeue_task() to fail
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (4 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 05/24] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
                   ` (25 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |    7 +++++--
 kernel/sched/deadline.c  |    4 +++-
 kernel/sched/fair.c      |    4 +++-
 kernel/sched/idle.c      |    3 ++-
 kernel/sched/rt.c        |    4 +++-
 kernel/sched/sched.h     |    4 ++--
 kernel/sched/stop_task.c |    3 ++-
 7 files changed, 20 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,7 +2001,10 @@ void enqueue_task(struct rq *rq, struct
 		sched_core_enqueue(rq, p);
 }
 
-void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+/*
+ * Must only return false when DEQUEUE_SLEEP.
+ */
+inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (sched_core_enabled(rq))
 		sched_core_dequeue(rq, p, flags);
@@ -2015,7 +2018,7 @@ void dequeue_task(struct rq *rq, struct
 	}
 
 	uclamp_rq_dec(rq, p);
-	p->sched_class->dequeue_task(rq, p, flags);
+	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2162,7 +2162,7 @@ static void enqueue_task_dl(struct rq *r
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 
@@ -2172,6 +2172,8 @@ static void dequeue_task_dl(struct rq *r
 	dequeue_dl_entity(&p->dl, flags);
 	if (!p->dl.dl_throttled && !dl_server(&p->dl))
 		dequeue_pushable_dl_task(rq, p);
+
+	return true;
 }
 
 /*
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6865,7 +6865,7 @@ static void set_next_buddy(struct sched_
  * decreased. We remove the task from the rbtree and
  * update the fair scheduling stats:
  */
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
@@ -6937,6 +6937,8 @@ static void dequeue_task_fair(struct rq
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+
+	return true;
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -482,13 +482,14 @@ struct task_struct *pick_next_task_idle(
  * It is not legal to sleep in the idle task - print a warning
  * message if some code attempts to do it:
  */
-static void
+static bool
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
 	raw_spin_rq_unlock_irq(rq);
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
 	raw_spin_rq_lock_irq(rq);
+	return true;
 }
 
 /*
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1483,7 +1483,7 @@ enqueue_task_rt(struct rq *rq, struct ta
 		enqueue_pushable_task(rq, p);
 }
 
-static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
@@ -1491,6 +1491,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se, flags);
 
 	dequeue_pushable_task(rq, p);
+
+	return true;
 }
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2285,7 +2285,7 @@ struct sched_class {
 #endif
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
-	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
+	bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
 	bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
 
@@ -3606,7 +3606,7 @@ extern int __sched_setaffinity(struct ta
 extern void __setscheduler_prio(struct task_struct *p, int prio);
 extern void set_load_weight(struct task_struct *p, bool update_load);
 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
-extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags);
+extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void check_class_changed(struct rq *rq, struct task_struct *p,
 				const struct sched_class *prev_class,
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -57,10 +57,11 @@ enqueue_task_stop(struct rq *rq, struct
 	add_nr_running(rq, 1);
 }
 
-static void
+static bool
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
 	sub_nr_running(rq, 1);
+	return true;
 }
 
 static void yield_task_stop(struct rq *rq)



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair()
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (5 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 06/24] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-09 16:53   ` Valentin Schneider
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 08/24] sched: Split DEQUEUE_SLEEP from deactivate_task() Peter Zijlstra
                   ` (24 subsequent siblings)
  31 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Working towards delaying dequeue, notably also inside the hierachy,
rework dequeue_task_fair() such that it can 'resume' an interrupted
hierarchy walk.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 21 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6861,34 +6861,43 @@ enqueue_task_fair(struct rq *rq, struct
 static void set_next_buddy(struct sched_entity *se);
 
 /*
- * The dequeue_task method is called before nr_running is
- * decreased. We remove the task from the rbtree and
- * update the fair scheduling stats:
+ * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
+ * failing half-way through and resume the dequeue later.
+ *
+ * Returns:
+ * -1 - dequeue delayed
+ *  0 - dequeue throttled
+ *  1 - dequeue complete
  */
-static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 {
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-	int task_sleep = flags & DEQUEUE_SLEEP;
-	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
+	bool task_sleep = flags & DEQUEUE_SLEEP;
+	struct task_struct *p = NULL;
+	int idle_h_nr_running = 0;
+	int h_nr_running = 0;
+	struct cfs_rq *cfs_rq;
 
-	util_est_dequeue(&rq->cfs, p);
+	if (entity_is_task(se)) {
+		p = task_of(se);
+		h_nr_running = 1;
+		idle_h_nr_running = task_has_idle_policy(p);
+	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running--;
+		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+			idle_h_nr_running = h_nr_running;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			goto dequeue_throttle;
+			return 0;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -6912,20 +6921,18 @@ static bool dequeue_task_fair(struct rq
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running--;
+		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+			idle_h_nr_running = h_nr_running;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			goto dequeue_throttle;
-
+			return 0;
 	}
 
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, 1);
+	sub_nr_running(rq, h_nr_running);
 
 	if (rq_h_nr_running && !rq->cfs.h_nr_running)
 		dl_server_stop(&rq->fair_server);
@@ -6934,10 +6941,22 @@ static bool dequeue_task_fair(struct rq
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
-dequeue_throttle:
-	util_est_update(&rq->cfs, p, task_sleep);
-	hrtick_update(rq);
+	return 1;
+}
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+	util_est_dequeue(&rq->cfs, p);
 
+	if (dequeue_entities(rq, &p->se, flags) < 0)
+		return false;
+
+	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
+	hrtick_update(rq);
 	return true;
 }
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 08/24] sched: Split DEQUEUE_SLEEP from deactivate_task()
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (6 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 09/24] sched: Prepare generic code for delayed dequeue Peter Zijlstra
                   ` (23 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

As a preparation for dequeue_task() failing, and a second code-path
needing to take care of the 'success' path, split out the DEQEUE_SLEEP
path from deactivate_task().

Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING
ordering fail.

Fixed-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   23 +++++++++++++----------
 kernel/sched/sched.h |   14 ++++++++++++++
 2 files changed, 27 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2036,12 +2036,23 @@ void activate_task(struct rq *rq, struct
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
+	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
+	/*
+	 * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
+	 * dequeue_task() and cleared *after* enqueue_task().
+	 */
+
 	dequeue_task(rq, p, flags);
 }
 
+static void block_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
+		__block_task(rq, p);
+}
+
 /**
  * task_curr - is this task currently executing on a CPU?
  * @p: the task in question.
@@ -6486,9 +6497,6 @@ static void __sched notrace __schedule(u
 				!(prev_state & TASK_NOLOAD) &&
 				!(prev_state & TASK_FROZEN);
 
-			if (prev->sched_contributes_to_load)
-				rq->nr_uninterruptible++;
-
 			/*
 			 * __schedule()			ttwu()
 			 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -6500,12 +6508,7 @@ static void __sched notrace __schedule(u
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
+			block_task(rq, prev, DEQUEUE_NOCLOCK);
 		}
 		switch_count = &prev->nvcsw;
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -68,6 +68,7 @@
 #include <linux/wait_api.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue_api.h>
+#include <linux/delayacct.h>
 
 #include <trace/events/power.h>
 #include <trace/events/sched.h>
@@ -2591,6 +2592,19 @@ static inline void sub_nr_running(struct
 	sched_update_tick_dependency(rq);
 }
 
+static inline void __block_task(struct rq *rq, struct task_struct *p)
+{
+	WRITE_ONCE(p->on_rq, 0);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible++;
+
+	if (p->in_iowait) {
+		atomic_inc(&rq->nr_iowait);
+		delayacct_blkio_start();
+	}
+}
+
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 09/24] sched: Prepare generic code for delayed dequeue
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (7 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 08/24] sched: Split DEQUEUE_SLEEP from deactivate_task() Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
                   ` (22 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

While most of the delayed dequeue code can be done inside the
sched_class itself, there is one location where we do not have an
appropriate hook, namely ttwu_runnable().

Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking
delayed dequeue tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |   17 ++++++++++++++++-
 kernel/sched/sched.h  |    2 ++
 3 files changed, 19 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -542,6 +542,7 @@ struct sched_entity {
 
 	struct list_head		group_node;
 	unsigned int			on_rq;
+	unsigned int			sched_delayed;
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2036,6 +2036,8 @@ void activate_task(struct rq *rq, struct
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
+
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
@@ -3677,12 +3679,14 @@ static int ttwu_runnable(struct task_str
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		if (p->se.sched_delayed)
+			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
 			 * it should preempt the task that is current now.
 			 */
-			update_rq_clock(rq);
 			wakeup_preempt(rq, p, wake_flags);
 		}
 		ttwu_do_wakeup(p);
@@ -4062,11 +4069,16 @@ int try_to_wake_up(struct task_struct *p
 		 * case the whole 'p->on_rq && ttwu_runnable()' case below
 		 * without taking any locks.
 		 *
+		 * Specifically, given current runs ttwu() we must be before
+		 * schedule()'s block_task(), as such this must not observe
+		 * sched_delayed.
+		 *
 		 * In particular:
 		 *  - we rely on Program-Order guarantees for all the ordering,
 		 *  - we're serialized against set_special_state() by virtue of
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
+		SCHED_WARN_ON(p->se.sched_delayed);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4358,6 +4370,9 @@ static void __sched_fork(unsigned long c
 	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	/* A delayed task cannot be in clone(). */
+	SCHED_WARN_ON(p->se.sched_delayed);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2258,6 +2258,7 @@ extern const u32		sched_prio_to_wmult[40
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
+#define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2273,6 +2274,7 @@ extern const u32		sched_prio_to_wmult[40
 #endif
 #define ENQUEUE_INITIAL		0x80
 #define ENQUEUE_MIGRATING	0x100
+#define ENQUEUE_DELAYED		0x200
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (8 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 09/24] sched: Prepare generic code for delayed dequeue Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
                     ` (2 more replies)
  2024-07-27 10:27 ` [PATCH 11/24] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced Peter Zijlstra
                   ` (21 subsequent siblings)
  31 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault,
	Luis Machado, Hongyan Xia

Delayed dequeue has tasks sit around on the runqueue that are not
actually runnable -- specifically, they will be dequeued the moment
they get picked.

One side-effect is that such a task can get migrated, which leads to a
'nested' dequeue_task() scenario that messes up uclamp if we don't
take care.

Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
the runqueue. This however will have removed the task from uclamp --
per uclamp_rq_dec() in dequeue_task(). So far so good.

However, if at that point the task gets migrated -- or nice adjusted
or any of a myriad of operations that does a dequeue-enqueue cycle --
we'll pass through dequeue_task()/enqueue_task() again. Without
modification this will lead to a double decrement for uclamp, which is
wrong.

Reported-by: Luis Machado <luis.machado@arm.com>
Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	if (p->se.sched_delayed)
+		return;
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
 
@@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	if (p->se.sched_delayed)
+		return;
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
@@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
 		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
 	}
 
-	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+	/*
+	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
+	 * ->sched_delayed.
+	 */
+	uclamp_rq_inc(rq, p);
 
 	if (sched_core_enabled(rq))
 		sched_core_enqueue(rq, p);
@@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
+	/*
+	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
+	 * and mark the task ->sched_delayed.
+	 */
 	uclamp_rq_dec(rq, p);
 	return p->sched_class->dequeue_task(rq, p, flags);
 }



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 11/24] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (9 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
                   ` (20 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Just a little sanity test..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5432,6 +5432,7 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	}
 
 	update_stats_curr_start(cfs_rq, se);
+	SCHED_WARN_ON(cfs_rq->curr);
 	cfs_rq->curr = se;
 
 	/*
@@ -5493,6 +5494,7 @@ static void put_prev_entity(struct cfs_r
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
+	SCHED_WARN_ON(cfs_rq->curr != prev);
 	cfs_rq->curr = NULL;
 }
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (10 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 11/24] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-13 12:43   ` Valentin Schneider
                     ` (2 more replies)
  2024-07-27 10:27 ` [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue Peter Zijlstra
                   ` (19 subsequent siblings)
  31 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

When dequeue_task() is delayed it becomes possible to exit a task (or
cgroup) that is still enqueued. Ensure things are dequeued before
freeing.

NOTE: switched_from_fair() causes spurious wakeups due to clearing
sched_delayed after enqueueing a task in another class that should've
been dequeued. This *should* be harmless.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 48 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8318,7 +8318,20 @@ static void migrate_task_rq_fair(struct
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	struct sched_entity *se = &p->se;
+
+	if (se->sched_delayed) {
+		struct rq_flags rf;
+		struct rq *rq;
+
+		rq = task_rq_lock(p, &rf);
+		update_rq_clock(rq);
+		if (se->sched_delayed)
+			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		task_rq_unlock(rq, p, &rf);
+	}
+
+	remove_entity_load_avg(se);
 }
 
 /*
@@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+	/*
+	 * Since this is called after changing class, this isn't quite right.
+	 * Specifically, this causes the task to get queued in the target class
+	 * and experience a 'spurious' wakeup.
+	 *
+	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
+	 * problem.
+	 */
+	p->se.sched_delayed = 0;
+	/*
+	 * While here, also clear the vlag, it makes little sense to carry that
+	 * over the excursion into the new class.
+	 */
+	p->se.vlag = 0;
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
+	SCHED_WARN_ON(p->se.sched_delayed);
+
 	attach_task_cfs_rq(p);
 
 	set_task_max_allowed_capacity(p);
@@ -12971,28 +13000,33 @@ void online_fair_sched_group(struct task
 
 void unregister_fair_sched_group(struct task_group *tg)
 {
-	unsigned long flags;
-	struct rq *rq;
 	int cpu;
 
 	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
 	for_each_possible_cpu(cpu) {
-		if (tg->se[cpu])
-			remove_entity_load_avg(tg->se[cpu]);
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+		struct sched_entity *se = tg->se[cpu];
+		struct rq *rq = cpu_rq(cpu);
+
+		if (se) {
+			if (se->sched_delayed) {
+				guard(rq_lock_irqsave)(rq);
+				if (se->sched_delayed)
+					dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+				list_del_leaf_cfs_rq(cfs_rq);
+			}
+			remove_entity_load_avg(se);
+		}
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
 		 * check on_list without danger of it being re-added.
 		 */
-		if (!tg->cfs_rq[cpu]->on_list)
-			continue;
-
-		rq = cpu_rq(cpu);
-
-		raw_spin_rq_lock_irqsave(rq, flags);
-		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_rq_unlock_irqrestore(rq, flags);
+		if (cfs_rq->on_list) {
+			guard(rq_lock_irqsave)(rq);
+			list_del_leaf_cfs_rq(cfs_rq);
+		}
 	}
 }
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (11 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-09-10  9:16   ` [PATCH 13/24] " Luis Machado
  2024-07-27 10:27 ` [PATCH 14/24] sched/fair: Implement ENQUEUE_DELAYED Peter Zijlstra
                   ` (18 subsequent siblings)
  31 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Delayed dequeue's natural end is when it gets picked again. Ensure
pick_next_task() knows what to do with delayed tasks.

Note, this relies on the earlier patch that made pick_next_task()
state invariant -- it will restart the pick on dequeue, because
obviously the just dequeued task is no longer eligible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5453,6 +5453,8 @@ set_next_entity(struct cfs_rq *cfs_rq, s
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
+static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
+
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5461,16 +5463,27 @@ set_next_entity(struct cfs_rq *cfs_rq, s
  * 4) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	/*
 	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
 	if (sched_feat(NEXT_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
 		return cfs_rq->next;
+	}
+
+	struct sched_entity *se = pick_eevdf(cfs_rq);
+	if (se->sched_delayed) {
+		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		SCHED_WARN_ON(se->sched_delayed);
+		SCHED_WARN_ON(se->on_rq);
 
-	return pick_eevdf(cfs_rq);
+		return NULL;
+	}
+	return se;
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -8478,7 +8491,9 @@ static struct task_struct *pick_task_fai
 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
 			goto again;
 
-		se = pick_next_entity(cfs_rq);
+		se = pick_next_entity(rq, cfs_rq);
+		if (!se)
+			goto again;
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 14/24] sched/fair: Implement ENQUEUE_DELAYED
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (12 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 15/24] sched,freezer: Mark TASK_FROZEN special Peter Zijlstra
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Doing a wakeup on a delayed dequeue task is about as simple as it
sounds -- remove the delayed mark and enjoy the fact it was actually
still on the runqueue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,6 +5290,9 @@ static inline int cfs_rq_throttled(struc
 static inline bool cfs_bandwidth_used(void);
 
 static void
+requeue_delayed_entity(struct sched_entity *se);
+
+static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	bool curr = cfs_rq->curr == se;
@@ -5922,8 +5925,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		if (se->on_rq)
+		if (se->on_rq) {
+			SCHED_WARN_ON(se->sched_delayed);
 			break;
+		}
 		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
 
 		if (cfs_rq_is_idle(group_cfs_rq(se)))
@@ -6773,6 +6778,22 @@ static int sched_idle_cpu(int cpu)
 }
 #endif
 
+static void
+requeue_delayed_entity(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * se->sched_delayed should imply both: se->on_rq == 1.
+	 * Because a delayed entity is one that is still on
+	 * the runqueue competing until elegibility.
+	 */
+	SCHED_WARN_ON(!se->sched_delayed);
+	SCHED_WARN_ON(!se->on_rq);
+
+	se->sched_delayed = 0;
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -6787,6 +6812,11 @@ enqueue_task_fair(struct rq *rq, struct
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 
+	if (flags & ENQUEUE_DELAYED) {
+		requeue_delayed_entity(se);
+		return;
+	}
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
@@ -6804,8 +6834,11 @@ enqueue_task_fair(struct rq *rq, struct
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
 	for_each_sched_entity(se) {
-		if (se->on_rq)
+		if (se->on_rq) {
+			if (se->sched_delayed)
+				requeue_delayed_entity(se);
 			break;
+		}
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 15/24] sched,freezer: Mark TASK_FROZEN special
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (13 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 14/24] sched/fair: Implement ENQUEUE_DELAYED Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 16/24] sched: Teach dequeue_task() about special task states Peter Zijlstra
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

The special task states are those that do not suffer spurious wakeups,
TASK_FROZEN is very much one of those, mark it as such.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    5 +++--
 kernel/freezer.c      |    2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -147,8 +147,9 @@ struct user_event_mm;
  * Special states are those that do not use the normal wait-loop pattern. See
  * the comment with set_special_state().
  */
-#define is_special_task_state(state)				\
-	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
+#define is_special_task_state(state)					\
+	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
+		    TASK_DEAD | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -72,7 +72,7 @@ bool __refrigerator(bool check_kthr_stop
 		bool freeze;
 
 		raw_spin_lock_irq(&current->pi_lock);
-		set_current_state(TASK_FROZEN);
+		WRITE_ONCE(current->__state, TASK_FROZEN);
 		/* unstale saved_state so that __thaw_task() will wake us up */
 		current->saved_state = TASK_RUNNING;
 		raw_spin_unlock_irq(&current->pi_lock);



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 16/24] sched: Teach dequeue_task() about special task states
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (14 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 15/24] sched,freezer: Mark TASK_FROZEN special Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Since special task states must not suffer spurious wakeups, and the
proposed delayed dequeue can cause exactly these (under some boundary
conditions), propagate this knowledge into dequeue_task() such that it
can do the right thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    7 ++++++-
 kernel/sched/sched.h |    3 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6521,11 +6521,16 @@ static void __sched notrace __schedule(u
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
 		} else {
+			int flags = DEQUEUE_NOCLOCK;
+
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
 				!(prev_state & TASK_FROZEN);
 
+			if (unlikely(is_special_task_state(prev_state)))
+				flags |= DEQUEUE_SPECIAL;
+
 			/*
 			 * __schedule()			ttwu()
 			 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -6537,7 +6542,7 @@ static void __sched notrace __schedule(u
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			block_task(rq, prev, DEQUEUE_NOCLOCK);
+			block_task(rq, prev, flags);
 		}
 		switch_count = &prev->nvcsw;
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2254,10 +2254,11 @@ extern const u32		sched_prio_to_wmult[40
  *
  */
 
-#define DEQUEUE_SLEEP		0x01
+#define DEQUEUE_SLEEP		0x01 /* Matches ENQUEUE_WAKEUP */
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (15 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 16/24] sched: Teach dequeue_task() about special task states Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-02 14:39   ` Valentin Schneider
                     ` (4 more replies)
  2024-07-27 10:27 ` [PATCH 18/24] sched/fair: Implement DELAY_ZERO Peter Zijlstra
                   ` (14 subsequent siblings)
  31 siblings, 5 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
noting that lag is fundamentally a temporal measure. It should not be
carried around indefinitely.

OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.

Since lag is intimately tied to the virtual time base, a wall-time
based decay is also insufficient, notably competition is required for
any of this to make sense.

Instead, delay the dequeue and keep the 'tasks' on the runqueue,
competing until they are eligible.

Strictly speaking, we only care about keeping them until the 0-lag
point, but that is a difficult proposition, instead carry them around
until they get picked again, and dequeue them at that point.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/deadline.c |    1 
 kernel/sched/fair.c     |   82 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |    9 +++++
 3 files changed, 81 insertions(+), 11 deletions(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2428,7 +2428,6 @@ static struct task_struct *__pick_next_t
 		else
 			p = dl_se->server_pick_next(dl_se);
 		if (!p) {
-			WARN_ON_ONCE(1);
 			dl_se->dl_yielded = 1;
 			update_curr_dl_se(rq, dl_se, 0);
 			goto again;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5379,20 +5379,44 @@ static void clear_buddies(struct cfs_rq
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static void
+static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	int action = UPDATE_TG;
+	if (flags & DEQUEUE_DELAYED) {
+		/*
+		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
+		 * at which point we've already done update_curr() and do not
+		 * want to do so again.
+		 */
+		SCHED_WARN_ON(!se->sched_delayed);
+		se->sched_delayed = 0;
+	} else {
+		bool sleep = flags & DEQUEUE_SLEEP;
+
+		/*
+		 * DELAY_DEQUEUE relies on spurious wakeups, special task
+		 * states must not suffer spurious wakeups, excempt them.
+		 */
+		if (flags & DEQUEUE_SPECIAL)
+			sleep = false;
+
+		SCHED_WARN_ON(sleep && se->sched_delayed);
+		update_curr(cfs_rq);
 
+		if (sched_feat(DELAY_DEQUEUE) && sleep &&
+		    !entity_eligible(cfs_rq, se)) {
+			if (cfs_rq->next == se)
+				cfs_rq->next = NULL;
+			se->sched_delayed = 1;
+			return false;
+		}
+	}
+
+	int action = UPDATE_TG;
 	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
 		action |= DO_DETACH;
 
 	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
-
-	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
 	 *   - For group_entity, update its runnable_weight to reflect the new
@@ -5430,6 +5454,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	if (cfs_rq->nr_running == 0)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
+
+	return true;
 }
 
 static void
@@ -5828,11 +5854,21 @@ static bool throttle_cfs_rq(struct cfs_r
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		int flags;
+
 		/* throttled entity or throttle-on-deactivate */
 		if (!se->on_rq)
 			goto done;
 
-		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		/*
+		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
+		 * This avoids teaching dequeue_entities() about throttled
+		 * entities and keeps things relatively simple.
+		 */
+		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
+		if (se->sched_delayed)
+			flags |= DEQUEUE_DELAYED;
+		dequeue_entity(qcfs_rq, se, flags);
 
 		if (cfs_rq_is_idle(group_cfs_rq(se)))
 			idle_task_delta = cfs_rq->h_nr_running;
@@ -6918,6 +6954,7 @@ static int dequeue_entities(struct rq *r
 	bool was_sched_idle = sched_idle_rq(rq);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
+	bool task_delayed = flags & DEQUEUE_DELAYED;
 	struct task_struct *p = NULL;
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
@@ -6931,7 +6968,13 @@ static int dequeue_entities(struct rq *r
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		dequeue_entity(cfs_rq, se, flags);
+
+		if (!dequeue_entity(cfs_rq, se, flags)) {
+			if (p && &p->se == se)
+				return -1;
+
+			break;
+		}
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -6956,6 +6999,7 @@ static int dequeue_entities(struct rq *r
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;
+		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
 	}
 
 	for_each_sched_entity(se) {
@@ -6985,6 +7029,17 @@ static int dequeue_entities(struct rq *r
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	if (p && task_delayed) {
+		SCHED_WARN_ON(!task_sleep);
+		SCHED_WARN_ON(p->on_rq != 1);
+
+		/* Fix-up what dequeue_task_fair() skipped */
+		hrtick_update(rq);
+
+		/* Fix-up what block_task() skipped. */
+		__block_task(rq, p);
+	}
+
 	return 1;
 }
 /*
@@ -6996,8 +7051,10 @@ static bool dequeue_task_fair(struct rq
 {
 	util_est_dequeue(&rq->cfs, p);
 
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (dequeue_entities(rq, &p->se, flags) < 0) {
+		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
 		return false;
+	}
 
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
 	hrtick_update(rq);
@@ -12973,6 +13030,11 @@ static void set_next_task_fair(struct rq
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
 	}
+
+	if (!first)
+		return;
+
+	SCHED_WARN_ON(se->sched_delayed);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
 SCHED_FEAT(CACHE_HOT_BUDDY, true)
 
 /*
+ * Delay dequeueing tasks until they get selected or woken.
+ *
+ * By delaying the dequeue for non-eligible tasks, they remain in the
+ * competition and can burn off their negative lag. When they get selected
+ * they'll have positive lag by definition.
+ */
+SCHED_FEAT(DELAY_DEQUEUE, true)
+
+/*
  * Allow wakeup-time preemption of the current task:
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 18/24] sched/fair: Implement DELAY_ZERO
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (16 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Peter Zijlstra
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them
at the 0-lag point, truncate lag (eg. don't let them earn positive
lag).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c     |   16 ++++++++++++++++
 kernel/sched/features.h |    3 +++
 2 files changed, 19 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5529,6 +5529,8 @@ pick_next_entity(struct rq *rq, struct c
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		SCHED_WARN_ON(se->sched_delayed);
 		SCHED_WARN_ON(se->on_rq);
+		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
+			se->vlag = 0;
 
 		return NULL;
 	}
@@ -6827,6 +6829,20 @@ requeue_delayed_entity(struct sched_enti
 	SCHED_WARN_ON(!se->sched_delayed);
 	SCHED_WARN_ON(!se->on_rq);
 
+	if (sched_feat(DELAY_ZERO)) {
+		update_entity_lag(cfs_rq, se);
+		if (se->vlag > 0) {
+			cfs_rq->nr_running--;
+			if (se != cfs_rq->curr)
+				__dequeue_entity(cfs_rq, se);
+			se->vlag = 0;
+			place_entity(cfs_rq, se, 0);
+			if (se != cfs_rq->curr)
+				__enqueue_entity(cfs_rq, se);
+			cfs_rq->nr_running++;
+		}
+	}
+
 	se->sched_delayed = 0;
 }
 
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -34,8 +34,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  * By delaying the dequeue for non-eligible tasks, they remain in the
  * competition and can burn off their negative lag. When they get selected
  * they'll have positive lag by definition.
+ *
+ * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
 SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(DELAY_ZERO, true)
 
 /*
  * Allow wakeup-time preemption of the current task:



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (17 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 18/24] sched/fair: Implement DELAY_ZERO Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-13 12:43   ` Valentin Schneider
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 20/24] sched/fair: Avoid re-setting virtual deadline on migrations Peter Zijlstra
                   ` (12 subsequent siblings)
  31 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Note that tasks that are kept on the runqueue to burn off negative
lag, are not in fact runnable anymore, they'll get dequeued the moment
they get picked.

As such, don't count this time towards runnable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |    2 ++
 kernel/sched/sched.h |    6 ++++++
 2 files changed, 8 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
 			se->sched_delayed = 1;
+			update_load_avg(cfs_rq, se, 0);
 			return false;
 		}
 	}
@@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
 	}
 
 	se->sched_delayed = 0;
+	update_load_avg(cfs_rq, se, 0);
 }
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -816,6 +816,9 @@ static inline void se_update_runnable(st
 
 static inline long se_runnable(struct sched_entity *se)
 {
+	if (se->sched_delayed)
+		return false;
+
 	if (entity_is_task(se))
 		return !!se->on_rq;
 	else
@@ -830,6 +833,9 @@ static inline void se_update_runnable(st
 
 static inline long se_runnable(struct sched_entity *se)
 {
+	if (se->sched_delayed)
+		return false;
+
 	return !!se->on_rq;
 }
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 20/24] sched/fair: Avoid re-setting virtual deadline on migrations
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (18 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] sched/fair: Avoid re-setting virtual deadline on 'migrations' tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
                   ` (11 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

During OSPM24 Youssef noted that migrations are re-setting the virtual
deadline. Notably everything that does a dequeue-enqueue, like setting
nice, changing preferred numa-node, and a myriad of other random crap,
will cause this to happen.

This shouldn't be. Preserve the relative virtual deadline across such
dequeue/enqueue cycles.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    6 ++++--
 kernel/sched/fair.c     |   23 ++++++++++++++++++-----
 kernel/sched/features.h |    4 ++++
 3 files changed, 26 insertions(+), 7 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -542,8 +542,10 @@ struct sched_entity {
 	u64				min_vruntime;
 
 	struct list_head		group_node;
-	unsigned int			on_rq;
-	unsigned int			sched_delayed;
+	unsigned char			on_rq;
+	unsigned char			sched_delayed;
+	unsigned char			rel_deadline;
+					/* hole */
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5270,6 +5270,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	se->vruntime = vruntime - lag;
 
+	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+		se->deadline += se->vruntime;
+		se->rel_deadline = 0;
+		return;
+	}
+
 	/*
 	 * When joining the competition; the existing tasks will be,
 	 * on average, halfway through their slice, as such start tasks
@@ -5382,6 +5388,8 @@ static __always_inline void return_cfs_r
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
+	bool sleep = flags & DEQUEUE_SLEEP;
+
 	if (flags & DEQUEUE_DELAYED) {
 		/*
 		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
@@ -5391,19 +5399,18 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 		SCHED_WARN_ON(!se->sched_delayed);
 		se->sched_delayed = 0;
 	} else {
-		bool sleep = flags & DEQUEUE_SLEEP;
-
+		bool delay = sleep;
 		/*
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
 		if (flags & DEQUEUE_SPECIAL)
-			sleep = false;
+			delay = false;
 
-		SCHED_WARN_ON(sleep && se->sched_delayed);
+		SCHED_WARN_ON(delay && se->sched_delayed);
 		update_curr(cfs_rq);
 
-		if (sched_feat(DELAY_DEQUEUE) && sleep &&
+		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
@@ -5434,6 +5441,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	clear_buddies(cfs_rq, se);
 
 	update_entity_lag(cfs_rq, se);
+	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
+	}
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
@@ -13024,6 +13036,7 @@ static void switched_from_fair(struct rq
 	 * over the excursion into the new class.
 	 */
 	p->se.vlag = 0;
+	p->se.rel_deadline = 0;
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -10,6 +10,10 @@ SCHED_FEAT(PLACE_LAG, true)
  */
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 /*
+ * Preserve relative virtual deadline on 'migration'.
+ */
+SCHED_FEAT(PLACE_REL_DEADLINE, true)
+/*
  * Inhibit (wakeup) preemption until the current task has either matched the
  * 0-lag point or until is has exhausted it's slice.
  */



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (19 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 20/24] sched/fair: Avoid re-setting virtual deadline on migrations Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-05 12:24   ` Chunxin Zang
                     ` (2 more replies)
  2024-07-27 10:27 ` [PATCH 22/24] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
                   ` (10 subsequent siblings)
  31 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault,
	Mike Galbraith

Part of the reason to have shorter slices is to improve
responsiveness. Allow shorter slices to preempt longer slices on
wakeup.

    Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |

  100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT

  1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
  2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
  3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
  1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
  2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
  3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
  1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
  2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
  3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |

  100ms massive_intr 500us cyclictest PREEMPT_SHORT

  1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
  2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
  3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
  1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
  2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
  3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
  1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
  2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
  3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |

As per the numbers the, this makes cyclictest (short slice) it's
max-delay more consistent and consistency drops the sum-delay. The
trade-off is that the massive_intr (long slice) gets more context
switches and a slight increase in sum-delay.

[mike: numbers]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 kernel/sched/fair.c     |   64 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |    5 +++
 2 files changed, 61 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -973,10 +973,10 @@ static void clear_buddies(struct cfs_rq
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+		return false;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -993,10 +993,7 @@ static void update_deadline(struct cfs_r
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
-		clear_buddies(cfs_rq, se);
-	}
+	return true;
 }
 
 #include "pelt.h"
@@ -1134,6 +1131,38 @@ static inline void update_curr_task(stru
 		dl_server_update(p->dl_server, delta_exec);
 }
 
+static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (curr->vlag == curr->deadline)
+		return false;
+
+	return !entity_eligible(cfs_rq, curr);
+}
+
+static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
+				    struct sched_entity *pse, struct sched_entity *se)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (pse->slice >= se->slice)
+		return false;
+
+	if (!entity_eligible(cfs_rq, pse))
+		return false;
+
+	if (entity_before(pse, se))
+		return true;
+
+	if (!entity_eligible(cfs_rq, se))
+		return true;
+
+	return false;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -1157,6 +1186,7 @@ static void update_curr(struct cfs_rq *c
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
+	bool resched;
 
 	if (unlikely(!curr))
 		return;
@@ -1166,7 +1196,7 @@ static void update_curr(struct cfs_rq *c
 		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	resched = update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1184,6 +1214,14 @@ static void update_curr(struct cfs_rq *c
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	if (rq->nr_running == 1)
+		return;
+
+	if (resched || did_preempt_short(cfs_rq, curr)) {
+		resched_curr(rq);
+		clear_buddies(cfs_rq, curr);
+	}
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -8611,7 +8649,17 @@ static void check_preempt_wakeup_fair(st
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 	/*
-	 * XXX pick_eevdf(cfs_rq) != se ?
+	 * If @p has a shorter slice than current and @p is eligible, override
+	 * current's slice protection in order to allow preemption.
+	 *
+	 * Note that even if @p does not turn out to be the most eligible
+	 * task at this moment, current's slice protection will be lost.
+	 */
+	if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
+		se->vlag = se->deadline + 1;
+
+	/*
+	 * If @p has become the most eligible task, force preemption.
 	 */
 	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -18,6 +18,11 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
  * 0-lag point or until is has exhausted it's slice.
  */
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for
+ * current.
+ */
+SCHED_FEAT(PREEMPT_SHORT, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 22/24] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (20 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-07-27 10:27 ` [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy Peter Zijlstra
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.

For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:

  {A,B,C,D}(w=1,r=8):

  ABCD...
  +---+---+---+---

  t=0, V=1.5				t=1, V=3.5
  A  |------<				A          |------<
  B   |------<				B   |------<
  C    |------<				C    |------<
  D     |------<			D     |------<
  ---+*------+-------+---		---+--*----+-------+---

  t=2, V=5.5				t=3, V=7.5
  A          |------<			A          |------<
  B           |------<			B           |------<
  C    |------<				C            |------<
  D     |------<			D     |------<
  ---+----*--+-------+---		---+------*+-------+---

Note: 4 identical tasks in FIFO order

~~~

  {A,B}(w=1,r=16) C(w=2,r=16)

  AACCBBCC...
  +---+---+---+---

  t=0, V=1.25				t=2, V=5.25
  A  |--------------<                   A                  |--------------<
  B   |--------------<                  B   |--------------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+----*--+-------+---

  t=4, V=8.25				t=6, V=12.25
  A                  |--------------<   A                  |--------------<
  B   |--------------<                  B                   |--------------<
  C            |------<                 C            |------<
  ---+-------*-------+---               ---+-------+---*---+---

Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
      task doesn't go below q.

Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.

Note: the period of the heavy task is half the full period at:
      W*(r_i/w_i) = 4*(2q/2) = 4q

~~~

  {A,C,D}(w=1,r=16) B(w=1,r=8):

  BAACCBDD...
  +---+---+---+---

  t=0, V=1.5				t=1, V=3.5
  A  |--------------<			A  |---------------<
  B   |------<				B           |------<
  C    |--------------<			C    |--------------<
  D     |--------------<		D     |--------------<
  ---+*------+-------+---		---+--*----+-------+---

  t=3, V=7.5				t=5, V=11.5
  A                  |---------------<  A                  |---------------<
  B           |------<                  B           |------<
  C    |--------------<                 C                    |--------------<
  D     |--------------<                D     |--------------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=6, V=13.5
  A                  |---------------<
  B                   |------<
  C                    |--------------<
  D     |--------------<
  ---+-------+----*--+---

Note: 1 short task -- again double r so that the deadline of the short task
      won't be below q. Made B short because its not the leftmost task, but is
      eligible with the 0,1,2,3 spread.

Note: like with the heavy task, the period of the short task observes:
      W*(r_i/w_i) = 4*(1q/1) = 4q

~~~

  A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)

  BCCAABCC...
  +---+---+---+---

  t=0, V=1.25				t=1, V=3.25
  A  |--------------<                   A  |--------------<
  B   |------<                          B           |------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+--*----+-------+---

  t=3, V=7.25				t=5, V=11.25
  A  |--------------<                   A                  |--------------<
  B           |------<                  B           |------<
  C            |------<                 C            |------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=6, V=13.25
  A                  |--------------<
  B                   |------<
  C            |------<
  ---+-------+----*--+---

Note: 1 heavy and 1 short task -- combine them all.

Note: both the short and heavy task end up with a period of 4q

~~~

  A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)

  BBCAABBC...
  +---+---+---+---

  t=0, V=1				t=2, V=5
  A  |--------------<                   A  |--------------<
  B   |------<                          B           |------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+----*--+-------+---

  t=3, V=7				t=5, V=11
  A  |--------------<                   A                  |--------------<
  B           |------<                  B           |------<
  C            |------<                 C            |------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=7, V=15
  A                  |--------------<
  B                   |------<
  C            |------<
  ---+-------+------*+---

Note: as before but permuted

~~~

>From all this it can be deduced that, for the steady state:

 - the total period (P) of a schedule is:	W*max(r_i/w_i)
 - the average period of a task is:		W*(r_i/w_i)
 - each task obtains the fair share:		w_i/W of each full period P

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 +
 kernel/sched/core.c     |    4 +++-
 kernel/sched/debug.c    |    3 ++-
 kernel/sched/fair.c     |    6 ++++--
 kernel/sched/syscalls.c |   29 +++++++++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,7 @@ struct sched_entity {
 	unsigned char			on_rq;
 	unsigned char			sched_delayed;
 	unsigned char			rel_deadline;
+	unsigned char			custom_slice;
 					/* hole */
 
 	u64				exec_start;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4347,7 +4347,6 @@ static void __sched_fork(unsigned long c
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	/* A delayed task cannot be in clone(). */
@@ -4600,6 +4599,8 @@ int sched_fork(unsigned long clone_flags
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
+		p->se.custom_slice = 0;
+		p->se.slice = sysctl_sched_base_slice;
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -8328,6 +8329,7 @@ void __init sched_init(void)
 	}
 
 	set_load_weight(&init_task, false);
+	init_task.se.slice = sysctl_sched_base_slice,
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -580,11 +580,12 @@ print_task(struct seq_file *m, struct rq
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
 		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
 		SPLIT_NS(p->se.deadline),
+		p->se.custom_slice ? 'S' : ' ',
 		SPLIT_NS(p->se.slice),
 		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -995,7 +995,8 @@ static void update_deadline(struct cfs_r
 	 * nice) while the request time r_i is determined by
 	 * sysctl_sched_base_slice.
 	 */
-	se->slice = sysctl_sched_base_slice;
+	if (!se->custom_slice)
+		se->slice = sysctl_sched_base_slice;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
@@ -5190,7 +5191,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
 	s64 lag = 0;
 
-	se->slice = sysctl_sched_base_slice;
+	if (!se->custom_slice)
+		se->slice = sysctl_sched_base_slice;
 	vslice = calc_delta_fair(se->slice, se);
 
 	/*
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -401,10 +401,20 @@ static void __setscheduler_params(struct
 
 	p->policy = policy;
 
-	if (dl_policy(policy))
+	if (dl_policy(policy)) {
 		__setparam_dl(p, attr);
-	else if (fair_policy(policy))
+	} else if (fair_policy(policy)) {
 		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+		if (attr->sched_runtime) {
+			p->se.custom_slice = 1;
+			p->se.slice = clamp_t(u64, attr->sched_runtime,
+					      NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
+					      NSEC_PER_MSEC*100); /* HZ=100  / 10 */
+		} else {
+			p->se.custom_slice = 0;
+			p->se.slice = sysctl_sched_base_slice;
+		}
+	}
 
 	/*
 	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -700,7 +710,9 @@ int __sched_setscheduler(struct task_str
 	 * but store a possible modification of reset_on_fork.
 	 */
 	if (unlikely(policy == p->policy)) {
-		if (fair_policy(policy) && attr->sched_nice != task_nice(p))
+		if (fair_policy(policy) &&
+		    (attr->sched_nice != task_nice(p) ||
+		     (attr->sched_runtime != p->se.slice)))
 			goto change;
 		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
 			goto change;
@@ -846,6 +858,9 @@ static int _sched_setscheduler(struct ta
 		.sched_nice	= PRIO_TO_NICE(p->static_prio),
 	};
 
+	if (p->se.custom_slice)
+		attr.sched_runtime = p->se.slice;
+
 	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
 	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
 		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
@@ -1012,12 +1027,14 @@ static int sched_copy_attr(struct sched_
 
 static void get_params(struct task_struct *p, struct sched_attr *attr)
 {
-	if (task_has_dl_policy(p))
+	if (task_has_dl_policy(p)) {
 		__getparam_dl(p, attr);
-	else if (task_has_rt_policy(p))
+	} else if (task_has_rt_policy(p)) {
 		attr->sched_priority = p->rt_priority;
-	else
+	} else {
 		attr->sched_nice = task_nice(p);
+		attr->sched_runtime = p->se.slice;
+	}
 }
 
 /**



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (21 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 22/24] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-09-29  2:02   ` [PATCH 23/24] " Tianchen Ding
  2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
                   ` (8 subsequent siblings)
  31 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

In the absence of an explicit cgroup slice configureation, make mixed
slice length work with cgroups by propagating the min_slice up the
hierarchy.

This ensures the cgroup entity gets timely service to service its
entities that have this timing constraint set on them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 
 kernel/sched/fair.c   |   57 +++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 1 deletion(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -542,6 +542,7 @@ struct sched_entity {
 	struct rb_node			run_node;
 	u64				deadline;
 	u64				min_vruntime;
+	u64				min_slice;
 
 	struct list_head		group_node;
 	unsigned char			on_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,6 +782,21 @@ static void update_min_vruntime(struct c
 	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
 }
 
+static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *root = __pick_root_entity(cfs_rq);
+	struct sched_entity *curr = cfs_rq->curr;
+	u64 min_slice = ~0ULL;
+
+	if (curr && curr->on_rq)
+		min_slice = curr->slice;
+
+	if (root)
+		min_slice = min(min_slice, root->min_slice);
+
+	return min_slice;
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
 	return entity_before(__node_2_se(a), __node_2_se(b));
@@ -798,19 +813,34 @@ static inline void __min_vruntime_update
 	}
 }
 
+static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (rse->min_slice < se->min_slice)
+			se->min_slice = rse->min_slice;
+	}
+}
+
 /*
  * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
 static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
 	u64 old_min_vruntime = se->min_vruntime;
+	u64 old_min_slice = se->min_slice;
 	struct rb_node *node = &se->run_node;
 
 	se->min_vruntime = se->vruntime;
 	__min_vruntime_update(se, node->rb_right);
 	__min_vruntime_update(se, node->rb_left);
 
-	return se->min_vruntime == old_min_vruntime;
+	se->min_slice = se->slice;
+	__min_slice_update(se, node->rb_right);
+	__min_slice_update(se, node->rb_left);
+
+	return se->min_vruntime == old_min_vruntime &&
+	       se->min_slice == old_min_slice;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
@@ -823,6 +853,7 @@ static void __enqueue_entity(struct cfs_
 {
 	avg_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
+	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				__entity_less, &min_vruntime_cb);
 }
@@ -6917,6 +6948,7 @@ enqueue_task_fair(struct rq *rq, struct
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
+	u64 slice = 0;
 
 	if (flags & ENQUEUE_DELAYED) {
 		requeue_delayed_entity(se);
@@ -6946,7 +6978,18 @@ enqueue_task_fair(struct rq *rq, struct
 			break;
 		}
 		cfs_rq = cfs_rq_of(se);
+
+		/*
+		 * Basically set the slice of group entries to the min_slice of
+		 * their respective cfs_rq. This ensures the group can service
+		 * its entities in the desired time-frame.
+		 */
+		if (slice) {
+			se->slice = slice;
+			se->custom_slice = 1;
+		}
 		enqueue_entity(cfs_rq, se, flags);
+		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -6968,6 +7011,9 @@ enqueue_task_fair(struct rq *rq, struct
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		se->slice = slice;
+		slice = cfs_rq_min_slice(cfs_rq);
+
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
@@ -7033,11 +7079,15 @@ static int dequeue_entities(struct rq *r
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
 	struct cfs_rq *cfs_rq;
+	u64 slice = 0;
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
 		h_nr_running = 1;
 		idle_h_nr_running = task_has_idle_policy(p);
+	} else {
+		cfs_rq = group_cfs_rq(se);
+		slice = cfs_rq_min_slice(cfs_rq);
 	}
 
 	for_each_sched_entity(se) {
@@ -7062,6 +7112,8 @@ static int dequeue_entities(struct rq *r
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
+			slice = cfs_rq_min_slice(cfs_rq);
+
 			/* Avoid re-evaluating load for this entity: */
 			se = parent_entity(se);
 			/*
@@ -7083,6 +7135,9 @@ static int dequeue_entities(struct rq *r
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		se->slice = slice;
+		slice = cfs_rq_min_slice(cfs_rq);
+
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (22 preceding siblings ...)
  2024-07-27 10:27 ` [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy Peter Zijlstra
@ 2024-07-27 10:27 ` Peter Zijlstra
  2024-07-28 21:30   ` Thomas Gleixner
                     ` (2 more replies)
  2024-08-01 12:08 ` [PATCH 00/24] Complete EEVDF Luis Machado
                   ` (7 subsequent siblings)
  31 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-07-27 10:27 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

In order to measure thread time in a DVFS world, introduce
CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows
down with both DVFS scaling and CPU capacity.

The clock does *NOT* support setting timers.

Useful for both SCHED_DEADLINE and the newly introduced
sched_attr::sched_runtime usage for SCHED_NORMAL.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/posix-timers_types.h |    5 ++--
 include/linux/sched.h              |    1 
 include/linux/sched/cputime.h      |    3 ++
 include/uapi/linux/time.h          |    1 
 kernel/sched/core.c                |   40 +++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c                |    8 +++++--
 kernel/time/posix-cpu-timers.c     |   16 +++++++++++++-
 kernel/time/posix-timers.c         |    1 
 kernel/time/posix-timers.h         |    1 
 9 files changed, 71 insertions(+), 5 deletions(-)

--- a/include/linux/posix-timers_types.h
+++ b/include/linux/posix-timers_types.h
@@ -13,9 +13,9 @@
  *
  * Bit 2 indicates whether a cpu clock refers to a thread or a process.
  *
- * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or FD=3.
+ * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3
  *
- * A clockid is invalid if bits 2, 1, and 0 are all set.
+ * (DVFS is PERTHREAD only)
  */
 #define CPUCLOCK_PID(clock)		((pid_t) ~((clock) >> 3))
 #define CPUCLOCK_PERTHREAD(clock) \
@@ -27,6 +27,7 @@
 #define CPUCLOCK_PROF		0
 #define CPUCLOCK_VIRT		1
 #define CPUCLOCK_SCHED		2
+#define CPUCLOCK_DVFS		3
 #define CPUCLOCK_MAX		3
 #define CLOCKFD			CPUCLOCK_MAX
 #define CLOCKFD_MASK		(CPUCLOCK_PERTHREAD_MASK|CPUCLOCK_CLOCK_MASK)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,6 +550,7 @@ struct sched_entity {
 	u64				exec_start;
 	u64				sum_exec_runtime;
 	u64				prev_sum_exec_runtime;
+	u64				sum_dvfs_runtime;
 	u64				vruntime;
 	s64				vlag;
 	u64				slice;
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -180,4 +180,7 @@ static inline void prev_cputime_init(str
 extern unsigned long long
 task_sched_runtime(struct task_struct *task);
 
+extern unsigned long long
+task_sched_dvfs_runtime(struct task_struct *task);
+
 #endif /* _LINUX_SCHED_CPUTIME_H */
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -62,6 +62,7 @@ struct timezone {
  */
 #define CLOCK_SGI_CYCLE			10
 #define CLOCK_TAI			11
+#define CLOCK_THREAD_DVFS_ID		12
 
 #define MAX_CLOCKS			16
 #define CLOCKS_MASK			(CLOCK_REALTIME | CLOCK_MONOTONIC)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,6 +4551,7 @@ static void __sched_fork(unsigned long c
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
+	p->se.sum_dvfs_runtime		= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
@@ -5632,6 +5633,45 @@ unsigned long long task_sched_runtime(st
 	task_rq_unlock(rq, p, &rf);
 
 	return ns;
+}
+
+unsigned long long task_sched_dvfs_runtime(struct task_struct *p)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+	u64 ns;
+
+#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
+	/*
+	 * 64-bit doesn't need locks to atomically read a 64-bit value.
+	 * So we have a optimization chance when the task's delta_exec is 0.
+	 * Reading ->on_cpu is racy, but this is ok.
+	 *
+	 * If we race with it leaving CPU, we'll take a lock. So we're correct.
+	 * If we race with it entering CPU, unaccounted time is 0. This is
+	 * indistinguishable from the read occurring a few cycles earlier.
+	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has
+	 * been accounted, so we're correct here as well.
+	 */
+	if (!p->on_cpu || !task_on_rq_queued(p))
+		return p->se.sum_dvfs_runtime;
+#endif
+
+	rq = task_rq_lock(p, &rf);
+	/*
+	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
+	 * project cycles that may never be accounted to this
+	 * thread, breaking clock_gettime().
+	 */
+	if (task_current(rq, p) && task_on_rq_queued(p)) {
+		prefetch_curr_exec_start(p);
+		update_rq_clock(rq);
+		p->sched_class->update_curr(rq);
+	}
+	ns = p->se.sum_dvfs_runtime;
+	task_rq_unlock(rq, p, &rf);
+
+	return ns;
 }
 
 #ifdef CONFIG_SCHED_DEBUG
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1118,15 +1118,19 @@ static void update_tg_load_avg(struct cf
 static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
 	u64 now = rq_clock_task(rq);
-	s64 delta_exec;
+	s64 delta_exec, delta_dvfs;
 
-	delta_exec = now - curr->exec_start;
+	delta_dvfs = delta_exec = now - curr->exec_start;
 	if (unlikely(delta_exec <= 0))
 		return delta_exec;
 
 	curr->exec_start = now;
 	curr->sum_exec_runtime += delta_exec;
 
+	delta_dvfs = cap_scale(delta_dvfs, arch_scale_freq_capacity(cpu_of(rq)));
+	delta_dvfs = cap_scale(delta_dvfs, arch_scale_cpu_capacity(cpu_of(rq)));
+	curr->sum_dvfs_runtime += delta_dvfs;
+
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
 
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -164,7 +164,7 @@ posix_cpu_clock_getres(const clockid_t w
 	if (!error) {
 		tp->tv_sec = 0;
 		tp->tv_nsec = ((NSEC_PER_SEC + HZ - 1) / HZ);
-		if (CPUCLOCK_WHICH(which_clock) == CPUCLOCK_SCHED) {
+		if (CPUCLOCK_WHICH(which_clock) >= CPUCLOCK_SCHED) {
 			/*
 			 * If sched_clock is using a cycle counter, we
 			 * don't have any idea of its true resolution
@@ -198,6 +198,9 @@ static u64 cpu_clock_sample(const clocki
 	if (clkid == CPUCLOCK_SCHED)
 		return task_sched_runtime(p);
 
+	if (clkid == CPUCLOCK_DVFS)
+		return task_sched_dvfs_runtime(p);
+
 	task_cputime(p, &utime, &stime);
 
 	switch (clkid) {
@@ -1628,6 +1631,7 @@ static long posix_cpu_nsleep_restart(str
 
 #define PROCESS_CLOCK	make_process_cpuclock(0, CPUCLOCK_SCHED)
 #define THREAD_CLOCK	make_thread_cpuclock(0, CPUCLOCK_SCHED)
+#define THREAD_DVFS_CLOCK make_thread_cpuclock(0, CPUCLOCK_DVFS)
 
 static int process_cpu_clock_getres(const clockid_t which_clock,
 				    struct timespec64 *tp)
@@ -1664,6 +1668,11 @@ static int thread_cpu_timer_create(struc
 	timer->it_clock = THREAD_CLOCK;
 	return posix_cpu_timer_create(timer);
 }
+static int thread_dvfs_cpu_clock_get(const clockid_t which_clock,
+				struct timespec64 *tp)
+{
+	return posix_cpu_clock_get(THREAD_DVFS_CLOCK, tp);
+}
 
 const struct k_clock clock_posix_cpu = {
 	.clock_getres		= posix_cpu_clock_getres,
@@ -1690,3 +1699,8 @@ const struct k_clock clock_thread = {
 	.clock_get_timespec	= thread_cpu_clock_get,
 	.timer_create		= thread_cpu_timer_create,
 };
+
+const struct k_clock clock_thread_dvfs = {
+	.clock_getres		= thread_cpu_clock_getres,
+	.clock_get_timespec	= thread_dvfs_cpu_clock_get,
+};
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1516,6 +1516,7 @@ static const struct k_clock * const posi
 	[CLOCK_MONOTONIC]		= &clock_monotonic,
 	[CLOCK_PROCESS_CPUTIME_ID]	= &clock_process,
 	[CLOCK_THREAD_CPUTIME_ID]	= &clock_thread,
+	[CLOCK_THREAD_DVFS_ID]		= &clock_thread_dvfs,
 	[CLOCK_MONOTONIC_RAW]		= &clock_monotonic_raw,
 	[CLOCK_REALTIME_COARSE]		= &clock_realtime_coarse,
 	[CLOCK_MONOTONIC_COARSE]	= &clock_monotonic_coarse,
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -34,6 +34,7 @@ extern const struct k_clock clock_posix_
 extern const struct k_clock clock_posix_dynamic;
 extern const struct k_clock clock_process;
 extern const struct k_clock clock_thread;
+extern const struct k_clock clock_thread_dvfs;
 extern const struct k_clock alarm_clock;
 
 int posix_timer_event(struct k_itimer *timr, int si_private);



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID
  2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
@ 2024-07-28 21:30   ` Thomas Gleixner
  2024-07-29  7:53   ` Juri Lelli
  2024-08-19 11:11   ` Christian Loehle
  2 siblings, 0 replies; 277+ messages in thread
From: Thomas Gleixner @ 2024-07-28 21:30 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, efault

On Sat, Jul 27 2024 at 12:27, Peter Zijlstra wrote:
> In order to measure thread time in a DVFS world, introduce
> CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows
> down with both DVFS scaling and CPU capacity.
>
> The clock does *NOT* support setting timers.

That's not the only limitation. See below.

> Useful for both SCHED_DEADLINE and the newly introduced
> sched_attr::sched_runtime usage for SCHED_NORMAL.

Can this please have an explanation about the usage of the previously
reserved value of 0x7 in the lower 3 bits?

>   *
>   * Bit 2 indicates whether a cpu clock refers to a thread or a process.
>   *
> - * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or FD=3.
> + * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3
>   *
> - * A clockid is invalid if bits 2, 1, and 0 are all set.
> + * (DVFS is PERTHREAD only)

This drops the information about the FD usage. Something like:

/*
 * Bit fields within a clockid:
 *
 * Bit 31:3 hold either a pid or a file descriptor.
 *
 * Bit 2  Bit 1  Bit 0
 *   0      0      0     Per process	CPUCLOCK_PROF
 *   0      0      1     Per process	CPUCLOCK_VIRT
 *   0      1      0     Per process	CPUCLOCK_SCHED
 *   0      1      1     Posixclock FD	CLOCKFD
 *   1      0      0     Per thread	CPUCLOCK_PROF
 *   1      0      1     Per thread	CPUCLOCK_VIRT
 *   1      1      0     Per thread	CPUCLOCK_SCHED
 *   1      1      1     Per thread	CPUCLOCK_DVSF
 *
 * CPUCLOCK_DVSF is per thread only and shares the type code in Bit 1:0
 * with CLOCKFD. CLOCKFD uses a file descriptor to access dynamically
 * registered POSIX clocks (e.g. PTP hardware clocks).
 */

should be clear enough, no?

But, all of this is wishful thinking because the provided implementation
only works for:

      sys_clock_getres(CLOCK_THREAD_DVFS_ID, ...)

which falls back to thread_cpu_clock_getres().

The variant which has the TID encoded in bit 31:3 and the type in bit
2:0 fails the test in pid_for_clock():

        if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
		return NULL;

Worse for sys_clock_gettime(). That fails in both cases for the very
same reason.

See the uncompiled delta patch below for a cure of that and the rest of
my comments.

>   #define CPUCLOCK_PROF		0
>   #define CPUCLOCK_VIRT		1
>   #define CPUCLOCK_SCHED		2
>  +#define CPUCLOCK_DVFS		3
>   #define CPUCLOCK_MAX		3
>   #define CLOCKFD			CPUCLOCK_MAX
>   #define CLOCKFD_MASK		(CPUCLOCK_PERTHREAD_MASK|CPUCLOCK_CLOCK_MASK)

With that DVFS addition CPUCLOCK_MAX is misleading at best. See delta
patch.

> +
> +	rq = task_rq_lock(p, &rf);
> +	/*
> +	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
> +	 * project cycles that may never be accounted to this
> +	 * thread, breaking clock_gettime().

Must be? For what? I assume you want to say:

     Update the runtime if the task is the current task and on the
     runqueue. The latter is important because if current is dequeued,
     ....

> +	 */
> +	if (task_current(rq, p) && task_on_rq_queued(p)) {
> +		prefetch_curr_exec_start(p);
> +		update_rq_clock(rq);
> +		p->sched_class->update_curr(rq);
> +	}
> +	ns = p->se.sum_dvfs_runtime;
> +	task_rq_unlock(rq, p, &rf);
> @@ -1664,6 +1668,11 @@ static int thread_cpu_timer_create(struc
>  	timer->it_clock = THREAD_CLOCK;
>  	return posix_cpu_timer_create(timer);
>  }
> +static int thread_dvfs_cpu_clock_get(const clockid_t which_clock,
> +				struct timespec64 *tp)

Please align the second line properly with the argument in the first line.

Thanks,

        tglx
---

--- a/include/linux/posix-timers_types.h
+++ b/include/linux/posix-timers_types.h
@@ -9,27 +9,42 @@
 /*
  * Bit fields within a clockid:
  *
- * The most significant 29 bits hold either a pid or a file descriptor.
+ * Bit 31:3 hold either a PID/TID or a file descriptor.
  *
- * Bit 2 indicates whether a cpu clock refers to a thread or a process.
+ * Bit 2  Bit 1  Bit 0
+ *   0      0      0     Per process	CPUCLOCK_PROF
+ *   0      0      1     Per process	CPUCLOCK_VIRT
+ *   0      1      0     Per process	CPUCLOCK_SCHED
+ *   0      1      1     Posixclock FD	CLOCKFD
+ *   1      0      0     Per thread	CPUCLOCK_PROF
+ *   1      0      1     Per thread	CPUCLOCK_VIRT
+ *   1      1      0     Per thread	CPUCLOCK_SCHED
+ *   1      1      1     Per thread	CPUCLOCK_DVSF
  *
- * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3
- *
- * (DVFS is PERTHREAD only)
+ * CPUCLOCK_DVSF is per thread only and shares the type code in Bit 1:0
+ * with CLOCKFD. CLOCKFD uses a file descriptor to access dynamically
+ * registered POSIX clocks (e.g. PTP hardware clocks).
  */
+
 #define CPUCLOCK_PID(clock)		((pid_t) ~((clock) >> 3))
-#define CPUCLOCK_PERTHREAD(clock) \
-	(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
+#define CPUCLOCK_PERTHREAD(clock)	(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
 
-#define CPUCLOCK_PERTHREAD_MASK	4
-#define CPUCLOCK_WHICH(clock)	((clock) & (clockid_t) CPUCLOCK_CLOCK_MASK)
-#define CPUCLOCK_CLOCK_MASK	3
 #define CPUCLOCK_PROF		0
 #define CPUCLOCK_VIRT		1
 #define CPUCLOCK_SCHED		2
-#define CPUCLOCK_DVFS		3
-#define CPUCLOCK_MAX		3
-#define CLOCKFD			CPUCLOCK_MAX
+#define CPUCLOCK_SAMPLE_MAX	(CPUCLOCK_SCHED + 1)
+
+#define CPUCLOCK_CLOCK_MASK	3
+#define CPUCLOCK_PERTHREAD_MASK	4
+#define CPUCLOCK_WHICH(clock)	((clock) & (clockid_t) CPUCLOCK_CLOCK_MASK)
+
+/*
+ * CPUCLOCK_DVFS and CLOCKFD share the type code in bit 1:0. CPUCLOCK_DVFS
+ * does not belong to the sampling clocks and does not allow timers to be
+ * armed on it.
+ */
+#define CPUCLOCK_DVFS		CPUCLOCK_SAMPLE_MAX
+#define CLOCKFD			CPUCLOCK_DVFS
 #define CLOCKFD_MASK		(CPUCLOCK_PERTHREAD_MASK|CPUCLOCK_CLOCK_MASK)
 
 #ifdef CONFIG_POSIX_TIMERS
@@ -55,7 +70,7 @@ struct posix_cputimer_base {
  * Used in task_struct and signal_struct
  */
 struct posix_cputimers {
-	struct posix_cputimer_base	bases[CPUCLOCK_MAX];
+	struct posix_cputimer_base	bases[CPUCLOCK_SAMPLE_MAX];
 	unsigned int			timers_active;
 	unsigned int			expiry_active;
 };
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5413,9 +5413,10 @@ unsigned long long task_sched_dvfs_runti
 
 	rq = task_rq_lock(p, &rf);
 	/*
-	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
-	 * project cycles that may never be accounted to this
-	 * thread, breaking clock_gettime().
+	 * Update the runtime if the task is the current task and on the
+	 * runqueue. The latter is important because if current is
+	 * dequeued, we would project cycles that may never be accounted to
+	 * this thread, breaking clock_gettime().
 	 */
 	if (task_current(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -54,13 +54,13 @@ int update_rlimit_cpu(struct task_struct
 /*
  * Functions for validating access to tasks.
  */
-static struct pid *pid_for_clock(const clockid_t clock, bool gettime)
+static struct pid *__pid_for_clock(const clockid_t clock, const clockid_t maxclock, bool gettime)
 {
 	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
 	const pid_t upid = CPUCLOCK_PID(clock);
 	struct pid *pid;
 
-	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
+	if (CPUCLOCK_WHICH(clock) > maxclock)
 		return NULL;
 
 	/*
@@ -94,12 +94,17 @@ static struct pid *pid_for_clock(const c
 	return pid_has_task(pid, PIDTYPE_TGID) ? pid : NULL;
 }
 
+static inline struct pid *pid_for_clock(const clockid_t clock, bool gettime)
+{
+	return __pid_for_clock(clock, CPUCLOCK_SCHED, gettime);
+}
+
 static inline int validate_clock_permissions(const clockid_t clock)
 {
 	int ret;
 
 	rcu_read_lock();
-	ret = pid_for_clock(clock, false) ? 0 : -EINVAL;
+	ret = __pid_for_clock(clock, CPUCLOCK_DVFS, false) ? 0 : -EINVAL;
 	rcu_read_unlock();
 
 	return ret;
@@ -344,7 +349,7 @@ static u64 cpu_clock_sample_group(const
 {
 	struct thread_group_cputimer *cputimer = &p->signal->cputimer;
 	struct posix_cputimers *pct = &p->signal->posix_cputimers;
-	u64 samples[CPUCLOCK_MAX];
+	u64 samples[CPUCLOCK_SAMPLE_MAX];
 
 	if (!READ_ONCE(pct->timers_active)) {
 		if (start)
@@ -365,7 +370,7 @@ static int posix_cpu_clock_get(const clo
 	u64 t;
 
 	rcu_read_lock();
-	tsk = pid_task(pid_for_clock(clock, true), clock_pid_type(clock));
+	tsk = pid_task(__pid_for_clock(clock, CPUCLOCK_DVFS, true), clock_pid_type(clock));
 	if (!tsk) {
 		rcu_read_unlock();
 		return -EINVAL;
@@ -864,7 +869,7 @@ static void collect_posix_cputimers(stru
 	struct posix_cputimer_base *base = pct->bases;
 	int i;
 
-	for (i = 0; i < CPUCLOCK_MAX; i++, base++) {
+	for (i = 0; i < CPUCLOCK_SAMPLE_MAX; i++, base++) {
 		base->nextevt = collect_timerqueue(&base->tqhead, firing,
 						    samples[i]);
 	}
@@ -901,7 +906,7 @@ static void check_thread_timers(struct t
 				struct list_head *firing)
 {
 	struct posix_cputimers *pct = &tsk->posix_cputimers;
-	u64 samples[CPUCLOCK_MAX];
+	u64 samples[CPUCLOCK_SAMPLE_MAX];
 	unsigned long soft;
 
 	if (dl_task(tsk))
@@ -979,7 +984,7 @@ static void check_process_timers(struct
 {
 	struct signal_struct *const sig = tsk->signal;
 	struct posix_cputimers *pct = &sig->posix_cputimers;
-	u64 samples[CPUCLOCK_MAX];
+	u64 samples[CPUCLOCK_SAMPLE_MAX];
 	unsigned long soft;
 
 	/*
@@ -1098,7 +1103,7 @@ task_cputimers_expired(const u64 *sample
 {
 	int i;
 
-	for (i = 0; i < CPUCLOCK_MAX; i++) {
+	for (i = 0; i < CPUCLOCK_SAMPLE_MAX; i++) {
 		if (samples[i] >= pct->bases[i].nextevt)
 			return true;
 	}
@@ -1121,7 +1126,7 @@ static inline bool fastpath_timer_check(
 	struct signal_struct *sig;
 
 	if (!expiry_cache_is_inactive(pct)) {
-		u64 samples[CPUCLOCK_MAX];
+		u64 samples[CPUCLOCK_SAMPLE_MAX];
 
 		task_sample_cputime(tsk, samples);
 		if (task_cputimers_expired(samples, pct))
@@ -1146,7 +1151,7 @@ static inline bool fastpath_timer_check(
 	 * delays with signals actually getting sent are expected.
 	 */
 	if (READ_ONCE(pct->timers_active) && !READ_ONCE(pct->expiry_active)) {
-		u64 samples[CPUCLOCK_MAX];
+		u64 samples[CPUCLOCK_SAMPLE_MAX];
 
 		proc_sample_cputime_atomic(&sig->cputimer.cputime_atomic,
 					   samples);
@@ -1669,7 +1674,7 @@ static int thread_cpu_timer_create(struc
 	return posix_cpu_timer_create(timer);
 }
 static int thread_dvfs_cpu_clock_get(const clockid_t which_clock,
-				struct timespec64 *tp)
+				     struct timespec64 *tp)
 {
 	return posix_cpu_clock_get(THREAD_DVFS_CLOCK, tp);
 }

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID
  2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
  2024-07-28 21:30   ` Thomas Gleixner
@ 2024-07-29  7:53   ` Juri Lelli
  2024-08-02 11:29     ` Peter Zijlstra
  2024-08-19 11:11   ` Christian Loehle
  2 siblings, 1 reply; 277+ messages in thread
From: Juri Lelli @ 2024-07-29  7:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

Hi Peter,

On 27/07/24 12:27, Peter Zijlstra wrote:
> In order to measure thread time in a DVFS world, introduce
> CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows
> down with both DVFS scaling and CPU capacity.
> 
> The clock does *NOT* support setting timers.
> 
> Useful for both SCHED_DEADLINE and the newly introduced
> sched_attr::sched_runtime usage for SCHED_NORMAL.

Just so I'm sure I understand, this would be useful for estimating the
runtime needs of a (also DEADLINE) task when DVFS is enabled, right?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (23 preceding siblings ...)
  2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
@ 2024-08-01 12:08 ` Luis Machado
  2024-08-14 14:34 ` Vincent Guittot
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-08-01 12:08 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi Peter,

On 7/27/24 11:27, Peter Zijlstra wrote:
> Hi all,
> 
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
> 
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
> 
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
> 
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> 
> 
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> 
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID
> 
> 
> 

Thanks for the updated series.

FTR, I ran this through our Pixel 6 and did not see significant differences
between the baseline (6.8 kernel with EEVDF + fixes) and the baseline with
the updated eevdf-complete series applied on top.

The power use and frame metrics seem mostly the same between the two.

I'll run a few more tests just in case.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID
  2024-07-29  7:53   ` Juri Lelli
@ 2024-08-02 11:29     ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-02 11:29 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Mon, Jul 29, 2024 at 09:53:17AM +0200, Juri Lelli wrote:
> Hi Peter,
> 
> On 27/07/24 12:27, Peter Zijlstra wrote:
> > In order to measure thread time in a DVFS world, introduce
> > CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows
> > down with both DVFS scaling and CPU capacity.
> > 
> > The clock does *NOT* support setting timers.
> > 
> > Useful for both SCHED_DEADLINE and the newly introduced
> > sched_attr::sched_runtime usage for SCHED_NORMAL.
> 
> Just so I'm sure I understand, this would be useful for estimating the
> runtime needs of a (also DEADLINE) task when DVFS is enabled, right?

Correct, DVFS or biggie-smalls CPUs with mixed capacities.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
@ 2024-08-02 14:39   ` Valentin Schneider
  2024-08-02 14:59     ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 277+ messages in thread
From: Valentin Schneider @ 2024-08-02 14:39 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault


On 27/07/24 12:27, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
>
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
>
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
>
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
>
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
>

Question from a lazy student who just caught up to the current state of
EEVDF...

IIUC this makes it so time spent sleeping increases an entity's lag, rather
than it being frozen & restored via the place_entity() magic.

So entities with negative lag get closer to their 0-lag point, after which
they can get picked & dequeued if still not runnable.

However, don't entities with positive lag get *further* away from their
0-lag point?


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-08-02 14:39   ` Valentin Schneider
@ 2024-08-02 14:59     ` Peter Zijlstra
  2024-08-02 16:32       ` Valentin Schneider
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-02 14:59 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Fri, Aug 02, 2024 at 04:39:08PM +0200, Valentin Schneider wrote:
> 
> On 27/07/24 12:27, Peter Zijlstra wrote:
> > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> > noting that lag is fundamentally a temporal measure. It should not be
> > carried around indefinitely.
> >
> > OTOH it should also not be instantly discarded, doing so will allow a
> > task to game the system by purposefully (micro) sleeping at the end of
> > its time quantum.
> >
> > Since lag is intimately tied to the virtual time base, a wall-time
> > based decay is also insufficient, notably competition is required for
> > any of this to make sense.
> >
> > Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> > competing until they are eligible.
> >
> > Strictly speaking, we only care about keeping them until the 0-lag
> > point, but that is a difficult proposition, instead carry them around
> > until they get picked again, and dequeue them at that point.
> >
> 
> Question from a lazy student who just caught up to the current state of
> EEVDF...
> 
> IIUC this makes it so time spent sleeping increases an entity's lag, rather
> than it being frozen & restored via the place_entity() magic.
> 
> So entities with negative lag get closer to their 0-lag point, after which
> they can get picked & dequeued if still not runnable.

Right.

> However, don't entities with positive lag get *further* away from their
> 0-lag point?

Which is why we only delay de dequeue when !eligible, IOW when lag is
negative.

The next patch additionally truncates lag to 0 (for delayed entities),
so they can never earn extra time.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-08-02 14:59     ` Peter Zijlstra
@ 2024-08-02 16:32       ` Valentin Schneider
  0 siblings, 0 replies; 277+ messages in thread
From: Valentin Schneider @ 2024-08-02 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On 02/08/24 16:59, Peter Zijlstra wrote:
> On Fri, Aug 02, 2024 at 04:39:08PM +0200, Valentin Schneider wrote:
>>
>> On 27/07/24 12:27, Peter Zijlstra wrote:
>> > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>> > noting that lag is fundamentally a temporal measure. It should not be
>> > carried around indefinitely.
>> >
>> > OTOH it should also not be instantly discarded, doing so will allow a
>> > task to game the system by purposefully (micro) sleeping at the end of
>> > its time quantum.
>> >
>> > Since lag is intimately tied to the virtual time base, a wall-time
>> > based decay is also insufficient, notably competition is required for
>> > any of this to make sense.
>> >
>> > Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>> > competing until they are eligible.
>> >
>> > Strictly speaking, we only care about keeping them until the 0-lag
>> > point, but that is a difficult proposition, instead carry them around
>> > until they get picked again, and dequeue them at that point.
>> >
>>
>> Question from a lazy student who just caught up to the current state of
>> EEVDF...
>>
>> IIUC this makes it so time spent sleeping increases an entity's lag, rather
>> than it being frozen & restored via the place_entity() magic.
>>
>> So entities with negative lag get closer to their 0-lag point, after which
>> they can get picked & dequeued if still not runnable.
>
> Right.
>
>> However, don't entities with positive lag get *further* away from their
>> 0-lag point?
>
> Which is why we only delay de dequeue when !eligible, IOW when lag is
> negative.
>
> The next patch additionally truncates lag to 0 (for delayed entities),
> so they can never earn extra time.

Gotcha, thanks for pointing that out, I think I'm (slowly) getting it :D


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
@ 2024-08-05 12:24   ` Chunxin Zang
  2024-08-07 17:54     ` Peter Zijlstra
  2024-08-08 10:15   ` Chen Yu
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 277+ messages in thread
From: Chunxin Zang @ 2024-08-05 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, K Prateek Nayak,
	wuyun.abel, youssefesmat, tglx, Mike Galbraith, Mike Galbraith,
	Chunxin Zang



> On Jul 27, 2024, at 18:27, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Part of the reason to have shorter slices is to improve
> responsiveness. Allow shorter slices to preempt longer slices on
> wakeup.
> 
>   Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> 
> 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> 
> 1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
> 2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
> 3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
> 1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
> 2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
> 3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
> 1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
> 2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
> 3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> 
> 100ms massive_intr 500us cyclictest PREEMPT_SHORT
> 
> 1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
> 2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
> 3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
> 1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
> 2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
> 3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
> 1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
> 2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
> 3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> 
> As per the numbers the, this makes cyclictest (short slice) it's
> max-delay more consistent and consistency drops the sum-delay. The
> trade-off is that the massive_intr (long slice) gets more context
> switches and a slight increase in sum-delay.
> 
> [mike: numbers]
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> ---
> kernel/sched/fair.c     |   64 ++++++++++++++++++++++++++++++++++++++++++------
> kernel/sched/features.h |    5 +++
> 2 files changed, 61 insertions(+), 8 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -973,10 +973,10 @@ static void clear_buddies(struct cfs_rq
> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> * this is probably good enough.
> */
> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> if ((s64)(se->vruntime - se->deadline) < 0)
> - return;
> + return false;
> 
> /*
> * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -993,10 +993,7 @@ static void update_deadline(struct cfs_r
> /*
> * The task has consumed its request, reschedule.
> */
> - if (cfs_rq->nr_running > 1) {
> - resched_curr(rq_of(cfs_rq));
> - clear_buddies(cfs_rq, se);
> - }
> + return true;
> }
> 
> #include "pelt.h"
> @@ -1134,6 +1131,38 @@ static inline void update_curr_task(stru
> dl_server_update(p->dl_server, delta_exec);
> }
> 
> +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> +{
> + if (!sched_feat(PREEMPT_SHORT))
> + return false;
> +
> + if (curr->vlag == curr->deadline)
> + return false;
> +
> + return !entity_eligible(cfs_rq, curr);
> +}

Hi perter 

Can this be made more aggressive here? Something like , in the PREEMPT_SHORT
+ NO_RUN_TO_PARITY combination, it could break the first deadline of the current
task. This can achieve better latency benefits in certain embedded scenarios, such as
high-priority periodic tasks.

+static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+ if (!sched_feat(PREEMPT_SHORT))
+ return false;
+
+ if (sched_feat(RUN_TO_PARITY) && curr->vlag == curr->deadline)
+ return false;
+
+ return !entity_eligible(cfs_rq, curr);
+}

Additionally, if possible, could you please include my name in this patch? I spent over a
month finding this solution and conducting the tests, and I hope to leave some trace of
my efforts during that time. This is also one of the reasons why I love Linux and am eager
to contribute to open source. I would be extremely grateful.
https://lore.kernel.org/lkml/20240613131437.9555-1-spring.cxz@gmail.com/

thanks
Chunxin


> +
> +static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
> +    struct sched_entity *pse, struct sched_entity *se)
> +{
> + if (!sched_feat(PREEMPT_SHORT))
> + return false;
> +
> + if (pse->slice >= se->slice)
> + return false;
> +
> + if (!entity_eligible(cfs_rq, pse))
> + return false;
> +
> + if (entity_before(pse, se))
> + return true;
> +
> + if (!entity_eligible(cfs_rq, se))
> + return true;
> +
> + return false;
> +}
> +
> /*
> * Used by other classes to account runtime.
> */
> @@ -1157,6 +1186,7 @@ static void update_curr(struct cfs_rq *c
> struct sched_entity *curr = cfs_rq->curr;
> struct rq *rq = rq_of(cfs_rq);
> s64 delta_exec;
> + bool resched;
> 
> if (unlikely(!curr))
> return;
> @@ -1166,7 +1196,7 @@ static void update_curr(struct cfs_rq *c
> return;
> 
> curr->vruntime += calc_delta_fair(delta_exec, curr);
> - update_deadline(cfs_rq, curr);
> + resched = update_deadline(cfs_rq, curr);
> update_min_vruntime(cfs_rq);
> 
> if (entity_is_task(curr)) {
> @@ -1184,6 +1214,14 @@ static void update_curr(struct cfs_rq *c
> }
> 
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> +
> + if (rq->nr_running == 1)
> + return;
> +
> + if (resched || did_preempt_short(cfs_rq, curr)) {
> + resched_curr(rq);
> + clear_buddies(cfs_rq, curr);
> + }
> }
> 
> static void update_curr_fair(struct rq *rq)
> @@ -8611,7 +8649,17 @@ static void check_preempt_wakeup_fair(st
> cfs_rq = cfs_rq_of(se);
> update_curr(cfs_rq);
> /*
> - * XXX pick_eevdf(cfs_rq) != se ?
> + * If @p has a shorter slice than current and @p is eligible, override
> + * current's slice protection in order to allow preemption.
> + *
> + * Note that even if @p does not turn out to be the most eligible
> + * task at this moment, current's slice protection will be lost.
> + */
> + if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
> + se->vlag = se->deadline + 1;
> +
> + /*
> + * If @p has become the most eligible task, force preemption.
> */
> if (pick_eevdf(cfs_rq) == pse)
> goto preempt;
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -18,6 +18,11 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
> * 0-lag point or until is has exhausted it's slice.
> */
> SCHED_FEAT(RUN_TO_PARITY, true)
> +/*
> + * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for
> + * current.
> + */
> +SCHED_FEAT(PREEMPT_SHORT, true)
> 
> /*
> * Prefer to schedule the task we woke last (assuming it failed
> 
> 
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-08-05 12:24   ` Chunxin Zang
@ 2024-08-07 17:54     ` Peter Zijlstra
  2024-08-13 10:44       ` Chunxin Zang
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-07 17:54 UTC (permalink / raw)
  To: Chunxin Zang
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, K Prateek Nayak,
	wuyun.abel, youssefesmat, tglx, Mike Galbraith, Mike Galbraith,
	Chunxin Zang

On Mon, Aug 05, 2024 at 08:24:24PM +0800, Chunxin Zang wrote:
> > On Jul 27, 2024, at 18:27, Peter Zijlstra <peterz@infradead.org> wrote:

> > +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > +{
> > + if (!sched_feat(PREEMPT_SHORT))
> > + return false;
> > +
> > + if (curr->vlag == curr->deadline)
> > + return false;
> > +
> > + return !entity_eligible(cfs_rq, curr);
> > +}

> Can this be made more aggressive here? Something like , in the PREEMPT_SHORT
> + NO_RUN_TO_PARITY combination, it could break the first deadline of the current
> task. This can achieve better latency benefits in certain embedded scenarios, such as
> high-priority periodic tasks.

You are aware we have SCHED_DEADLINE for those, right?

Why can't you use that? and what can we do to fix that.

> +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> +{
> + if (!sched_feat(PREEMPT_SHORT))
> + return false;
> +
> + if (sched_feat(RUN_TO_PARITY) && curr->vlag == curr->deadline)
> + return false;
> +
> + return !entity_eligible(cfs_rq, curr);
> +}

No, this will destroy the steady state schedule (where no tasks
join/leave) and make it so that all tasks hug the lag=0 state
arbitrarily close -- as allowed by the scheduling quanta.

Yes, it will get you better latency, because nobody gets to actually run
it's requested slice.

The goal really is for tasks to get their request -- and yes that means
you get to wait. PREEMPT_SHORT is already an exception to this rule, and
it is very specifically limited to wake-ups so as to retain as much of
the intended behaviour as possible.

> Additionally, if possible, could you please include my name in this patch? I spent over a
> month finding this solution and conducting the tests, and I hope to leave some trace of
> my efforts during that time. This is also one of the reasons why I love Linux and am eager
> to contribute to open source. I would be extremely grateful.

I've made it the below. Does that work for you?

---
Subject: sched/eevdf: Allow shorter slices to wakeup-preempt
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Sep 26 14:32:32 CEST 2023

Part of the reason to have shorter slices is to improve
responsiveness. Allow shorter slices to preempt longer slices on
wakeup.

    Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |

  100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT

  1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
  2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
  3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
  1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
  2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
  3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
  1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
  2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
  3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |

  100ms massive_intr 500us cyclictest PREEMPT_SHORT

  1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
  2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
  3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
  1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
  2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
  3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
  1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
  2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
  3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |

As per the numbers the, this makes cyclictest (short slice) it's
max-delay more consistent and consistency drops the sum-delay. The
trade-off is that the massive_intr (long slice) gets more context
switches and a slight increase in sum-delay.

Chunxin contributed did_preempt_short() where a task that lost slice
protection from PREEMPT_SHORT gets rescheduled once it becomes
in-eligible.

[mike: numbers]
Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
---
 kernel/sched/fair.c     |   64 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |    5 +++
 2 files changed, 61 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -973,10 +973,10 @@ static void clear_buddies(struct cfs_rq
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+		return false;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -993,10 +993,7 @@ static void update_deadline(struct cfs_r
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
-		clear_buddies(cfs_rq, se);
-	}
+	return true;
 }
 
 #include "pelt.h"
@@ -1134,6 +1131,38 @@ static inline void update_curr_task(stru
 		dl_server_update(p->dl_server, delta_exec);
 }
 
+static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (curr->vlag == curr->deadline)
+		return false;
+
+	return !entity_eligible(cfs_rq, curr);
+}
+
+static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
+				    struct sched_entity *pse, struct sched_entity *se)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (pse->slice >= se->slice)
+		return false;
+
+	if (!entity_eligible(cfs_rq, pse))
+		return false;
+
+	if (entity_before(pse, se))
+		return true;
+
+	if (!entity_eligible(cfs_rq, se))
+		return true;
+
+	return false;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -1157,6 +1186,7 @@ static void update_curr(struct cfs_rq *c
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
+	bool resched;
 
 	if (unlikely(!curr))
 		return;
@@ -1166,7 +1196,7 @@ static void update_curr(struct cfs_rq *c
 		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	resched = update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1184,6 +1214,14 @@ static void update_curr(struct cfs_rq *c
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	if (rq->nr_running == 1)
+		return;
+
+	if (resched || did_preempt_short(cfs_rq, curr)) {
+		resched_curr(rq);
+		clear_buddies(cfs_rq, curr);
+	}
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -8611,7 +8649,17 @@ static void check_preempt_wakeup_fair(st
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 	/*
-	 * XXX pick_eevdf(cfs_rq) != se ?
+	 * If @p has a shorter slice than current and @p is eligible, override
+	 * current's slice protection in order to allow preemption.
+	 *
+	 * Note that even if @p does not turn out to be the most eligible
+	 * task at this moment, current's slice protection will be lost.
+	 */
+	if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
+		se->vlag = se->deadline + 1;
+
+	/*
+	 * If @p has become the most eligible task, force preemption.
 	 */
 	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -18,6 +18,11 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
  * 0-lag point or until is has exhausted it's slice.
  */
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for
+ * current.
+ */
+SCHED_FEAT(PREEMPT_SHORT, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
  2024-08-05 12:24   ` Chunxin Zang
@ 2024-08-08 10:15   ` Chen Yu
  2024-08-08 10:22     ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 277+ messages in thread
From: Chen Yu @ 2024-08-08 10:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Mike Galbraith

Hi Peter,

On 2024-07-27 at 12:27:53 +0200, Peter Zijlstra wrote:
> Part of the reason to have shorter slices is to improve
> responsiveness. Allow shorter slices to preempt longer slices on
> wakeup.
> 
>     Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> 
>   100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> 
>   1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
>   2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
>   3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
>   1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
>   2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
>   3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
>   1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
>   2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
>   3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> 
>   100ms massive_intr 500us cyclictest PREEMPT_SHORT
> 
>   1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
>   2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
>   3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
>   1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
>   2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
>   3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
>   1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
>   2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
>   3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> 
> As per the numbers the, this makes cyclictest (short slice) it's
> max-delay more consistent and consistency drops the sum-delay. The
> trade-off is that the massive_intr (long slice) gets more context
> switches and a slight increase in sum-delay.
> 
> [mike: numbers]
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>

Besides this short preemption, it seems that an important patch is missing from
this patch set, that was originally from Prateek and you refined it to fix the
current task's false negative eligibility:
https://lore.kernel.org/lkml/20240424150721.GQ30852@noisy.programming.kicks-ass.net/

The RESPECT_SLICE is introduced to honor the current task's slice during wakeup preemption.
Without it we got reported that over-preemption and performance downgrading are observed
when running SPECjbb on servers.

echo RESPECT_SLICE > /sys/kernel/debug/sched/features

echo RUN_TO_PARITY > /sys/kernel/debug/sched/features
@task_duration_usecs_before_preempted: 
[2, 4)              8732 |@@@                                                 |
[4, 8)            109400 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          |
[8, 16)            95815 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[16, 32)          110647 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)          131298 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128)         132566 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[128, 256)         82095 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[256, 512)         33771 |@@@@@@@@@@@@@                                       |
[512, 1K)          24180 |@@@@@@@@@                                           |
[1K, 2K)           31056 |@@@@@@@@@@@@                                        |
[2K, 4K)          117533 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[4K, 8K)            4472 |@                                                   |
[8K, 16K)           1149 |                                                    |
[16K, 32K)           289 |                                                    |
[32K, 64K)           110 |                                                    |
[64K, 128K)           20 |                                                    |
[128K, 256K)           4 |                                                    |


echo NO_RUN_TO_PARITY > /sys/kernel/debug/sched/features
@task_duration_usecs_before_preempted: 
[4, 8)                 1 |                                                    |
[8, 16)               12 |                                                    |
[16, 32)              20 |                                                    |
[32, 64)              38 |                                                    |
[64, 128)             64 |                                                    |
[128, 256)            98 |                                                    |
[256, 512)           248 |                                                    |
[512, 1K)           1196 |                                                    |
[1K, 2K)            3456 |                                                    |
[2K, 4K)          417269 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)           22893 |@@                                                  |
[8K, 16K)           7818 |                                                    |
[16K, 32K)          1471 |                                                    |
[32K, 64K)           373 |                                                    |
[64K, 128K)           96 |                                                    |
[128K, 256K)           3 |                                                    |

We can see that without the fix, the task will be preempted and can not reach
its time slice budget(we enlarge its slice).

May I know if we can put that patch into this series please?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-08-08 10:15   ` Chen Yu
@ 2024-08-08 10:22     ` Peter Zijlstra
  2024-08-08 12:31       ` Chen Yu
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-08 10:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Mike Galbraith

On Thu, Aug 08, 2024 at 06:15:50PM +0800, Chen Yu wrote:
> Hi Peter,
> 
> On 2024-07-27 at 12:27:53 +0200, Peter Zijlstra wrote:
> > Part of the reason to have shorter slices is to improve
> > responsiveness. Allow shorter slices to preempt longer slices on
> > wakeup.
> > 
> >     Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> > 
> >   100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> > 
> >   1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
> >   2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
> >   3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
> >   1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
> >   2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
> >   3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
> >   1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
> >   2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
> >   3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> > 
> >   100ms massive_intr 500us cyclictest PREEMPT_SHORT
> > 
> >   1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
> >   2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
> >   3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
> >   1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
> >   2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
> >   3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
> >   1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
> >   2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
> >   3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> > 
> > As per the numbers the, this makes cyclictest (short slice) it's
> > max-delay more consistent and consistency drops the sum-delay. The
> > trade-off is that the massive_intr (long slice) gets more context
> > switches and a slight increase in sum-delay.
> > 
> > [mike: numbers]
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> 
> Besides this short preemption, it seems that an important patch is missing from
> this patch set, that was originally from Prateek and you refined it to fix the
> current task's false negative eligibility:
> https://lore.kernel.org/lkml/20240424150721.GQ30852@noisy.programming.kicks-ass.net/
> 
> The RESPECT_SLICE is introduced to honor the current task's slice during wakeup preemption.
> Without it we got reported that over-preemption and performance downgrading are observed
> when running SPECjbb on servers.

So I *think* that running as SCHED_BATCH gets you exactly that
behaviour, no?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-08-08 10:22     ` Peter Zijlstra
@ 2024-08-08 12:31       ` Chen Yu
  2024-08-09  7:35         ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Chen Yu @ 2024-08-08 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Mike Galbraith

On 2024-08-08 at 12:22:07 +0200, Peter Zijlstra wrote:
> On Thu, Aug 08, 2024 at 06:15:50PM +0800, Chen Yu wrote:
> > Hi Peter,
> > 
> > On 2024-07-27 at 12:27:53 +0200, Peter Zijlstra wrote:
> > > Part of the reason to have shorter slices is to improve
> > > responsiveness. Allow shorter slices to preempt longer slices on
> > > wakeup.
> > > 
> > >     Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> > > 
> > >   100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> > > 
> > >   1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
> > >   2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
> > >   3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
> > >   1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
> > >   2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
> > >   3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
> > >   1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
> > >   2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
> > >   3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> > > 
> > >   100ms massive_intr 500us cyclictest PREEMPT_SHORT
> > > 
> > >   1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
> > >   2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
> > >   3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
> > >   1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
> > >   2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
> > >   3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
> > >   1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
> > >   2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
> > >   3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> > > 
> > > As per the numbers the, this makes cyclictest (short slice) it's
> > > max-delay more consistent and consistency drops the sum-delay. The
> > > trade-off is that the massive_intr (long slice) gets more context
> > > switches and a slight increase in sum-delay.
> > > 
> > > [mike: numbers]
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> > 
> > Besides this short preemption, it seems that an important patch is missing from
> > this patch set, that was originally from Prateek and you refined it to fix the
> > current task's false negative eligibility:
> > https://lore.kernel.org/lkml/20240424150721.GQ30852@noisy.programming.kicks-ass.net/
> > 
> > The RESPECT_SLICE is introduced to honor the current task's slice during wakeup preemption.
> > Without it we got reported that over-preemption and performance downgrading are observed
> > when running SPECjbb on servers.
> 
> So I *think* that running as SCHED_BATCH gets you exactly that
> behaviour, no?

SCHED_BATCH should work as it avoids the wakeup preemption as much as possible.
Except that RESPECT_SLICE considers the cgroup hierarchical to check if the current
sched_entity has used up its slice, which seems to be less aggressive.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-08-08 12:31       ` Chen Yu
@ 2024-08-09  7:35         ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-09  7:35 UTC (permalink / raw)
  To: Chen Yu
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Mike Galbraith

On Thu, Aug 08, 2024 at 08:31:53PM +0800, Chen Yu wrote:
> On 2024-08-08 at 12:22:07 +0200, Peter Zijlstra wrote:
> > On Thu, Aug 08, 2024 at 06:15:50PM +0800, Chen Yu wrote:
> > > Hi Peter,
> > > 
> > > On 2024-07-27 at 12:27:53 +0200, Peter Zijlstra wrote:
> > > > Part of the reason to have shorter slices is to improve
> > > > responsiveness. Allow shorter slices to preempt longer slices on
> > > > wakeup.
> > > > 
> > > >     Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> > > > 
> > > >   100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> > > > 
> > > >   1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
> > > >   2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
> > > >   3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
> > > >   1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
> > > >   2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
> > > >   3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
> > > >   1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
> > > >   2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
> > > >   3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> > > > 
> > > >   100ms massive_intr 500us cyclictest PREEMPT_SHORT
> > > > 
> > > >   1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
> > > >   2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
> > > >   3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
> > > >   1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
> > > >   2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
> > > >   3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
> > > >   1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
> > > >   2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
> > > >   3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> > > > 
> > > > As per the numbers the, this makes cyclictest (short slice) it's
> > > > max-delay more consistent and consistency drops the sum-delay. The
> > > > trade-off is that the massive_intr (long slice) gets more context
> > > > switches and a slight increase in sum-delay.
> > > > 
> > > > [mike: numbers]
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> > > 
> > > Besides this short preemption, it seems that an important patch is missing from
> > > this patch set, that was originally from Prateek and you refined it to fix the
> > > current task's false negative eligibility:
> > > https://lore.kernel.org/lkml/20240424150721.GQ30852@noisy.programming.kicks-ass.net/
> > > 
> > > The RESPECT_SLICE is introduced to honor the current task's slice during wakeup preemption.
> > > Without it we got reported that over-preemption and performance downgrading are observed
> > > when running SPECjbb on servers.
> > 
> > So I *think* that running as SCHED_BATCH gets you exactly that
> > behaviour, no?
> 
> SCHED_BATCH should work as it avoids the wakeup preemption as much as possible.
> Except that RESPECT_SLICE considers the cgroup hierarchical to check if the current
> sched_entity has used up its slice, which seems to be less aggressive.

Note that update_deadline() will trigger a resched at the end up a slice
regardless -- this is driven from update_curr() and also invoked from
any preemption.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair()
  2024-07-27 10:27 ` [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
@ 2024-08-09 16:53   ` Valentin Schneider
  2024-08-10 22:17     ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Valentin Schneider @ 2024-08-09 16:53 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 27/07/24 12:27, Peter Zijlstra wrote:
> Working towards delaying dequeue, notably also inside the hierachy,
> rework dequeue_task_fair() such that it can 'resume' an interrupted
> hierarchy walk.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 40 insertions(+), 21 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6861,34 +6861,43 @@ enqueue_task_fair(struct rq *rq, struct
>  static void set_next_buddy(struct sched_entity *se);
>
>  /*
> - * The dequeue_task method is called before nr_running is
> - * decreased. We remove the task from the rbtree and
> - * update the fair scheduling stats:
> + * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> + * failing half-way through and resume the dequeue later.
> + *
> + * Returns:
> + * -1 - dequeue delayed
> + *  0 - dequeue throttled
> + *  1 - dequeue complete
>   */
> -static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  {
> -	struct cfs_rq *cfs_rq;
> -	struct sched_entity *se = &p->se;
> -	int task_sleep = flags & DEQUEUE_SLEEP;
> -	int idle_h_nr_running = task_has_idle_policy(p);
>       bool was_sched_idle = sched_idle_rq(rq);
>       int rq_h_nr_running = rq->cfs.h_nr_running;
> +	bool task_sleep = flags & DEQUEUE_SLEEP;
> +	struct task_struct *p = NULL;
> +	int idle_h_nr_running = 0;
> +	int h_nr_running = 0;
> +	struct cfs_rq *cfs_rq;
>
> -	util_est_dequeue(&rq->cfs, p);
> +	if (entity_is_task(se)) {
> +		p = task_of(se);
> +		h_nr_running = 1;
> +		idle_h_nr_running = task_has_idle_policy(p);
> +	}
>

This leaves the *h_nr_running to 0 for non-task entities. IIUC this makes
sense for ->sched_delayed entities (they should be empty of tasks), not so
sure for the other case. However, this only ends up being used for non-task
entities in:
- pick_next_entity(), if se->sched_delayed
- unregister_fair_sched_group()

IIRC unregister_fair_sched_group() can only happen after the group has been
drained, so it would then indeed be empty of tasks, but I reckon this could
do with a comment/assert in dequeue_entities(), no? Or did I get too
confused by cgroups again?


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair()
  2024-08-09 16:53   ` Valentin Schneider
@ 2024-08-10 22:17     ` Peter Zijlstra
  2024-08-12 10:02       ` Valentin Schneider
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-10 22:17 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Fri, Aug 09, 2024 at 06:53:30PM +0200, Valentin Schneider wrote:
> On 27/07/24 12:27, Peter Zijlstra wrote:
> > Working towards delaying dequeue, notably also inside the hierachy,
> > rework dequeue_task_fair() such that it can 'resume' an interrupted
> > hierarchy walk.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/fair.c |   61 ++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 40 insertions(+), 21 deletions(-)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6861,34 +6861,43 @@ enqueue_task_fair(struct rq *rq, struct
> >  static void set_next_buddy(struct sched_entity *se);
> >
> >  /*
> > - * The dequeue_task method is called before nr_running is
> > - * decreased. We remove the task from the rbtree and
> > - * update the fair scheduling stats:
> > + * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> > + * failing half-way through and resume the dequeue later.
> > + *
> > + * Returns:
> > + * -1 - dequeue delayed
> > + *  0 - dequeue throttled
> > + *  1 - dequeue complete
> >   */
> > -static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >  {
> > -	struct cfs_rq *cfs_rq;
> > -	struct sched_entity *se = &p->se;
> > -	int task_sleep = flags & DEQUEUE_SLEEP;
> > -	int idle_h_nr_running = task_has_idle_policy(p);
> >       bool was_sched_idle = sched_idle_rq(rq);
> >       int rq_h_nr_running = rq->cfs.h_nr_running;
> > +	bool task_sleep = flags & DEQUEUE_SLEEP;
> > +	struct task_struct *p = NULL;
> > +	int idle_h_nr_running = 0;
> > +	int h_nr_running = 0;
> > +	struct cfs_rq *cfs_rq;
> >
> > -	util_est_dequeue(&rq->cfs, p);
> > +	if (entity_is_task(se)) {
> > +		p = task_of(se);
> > +		h_nr_running = 1;
> > +		idle_h_nr_running = task_has_idle_policy(p);
> > +	}
> >
> 
> This leaves the *h_nr_running to 0 for non-task entities. IIUC this makes
> sense for ->sched_delayed entities (they should be empty of tasks), not so
> sure for the other case. However, this only ends up being used for non-task
> entities in:
> - pick_next_entity(), if se->sched_delayed
> - unregister_fair_sched_group()
> 
> IIRC unregister_fair_sched_group() can only happen after the group has been
> drained, so it would then indeed be empty of tasks, but I reckon this could
> do with a comment/assert in dequeue_entities(), no? Or did I get too
> confused by cgroups again?
> 

Yeah, so I did have me a patch that made all this work for cfs bandwidth
control as well. And then we need all this for throttled cgroup entries
as well.

Anyway... I had the patch, it worked, but then I remembered you were
going to rewrite all that anyway and I was making a terrible mess of
things, so I made it go away again.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair()
  2024-08-10 22:17     ` Peter Zijlstra
@ 2024-08-12 10:02       ` Valentin Schneider
  0 siblings, 0 replies; 277+ messages in thread
From: Valentin Schneider @ 2024-08-12 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On 11/08/24 00:17, Peter Zijlstra wrote:
> On Fri, Aug 09, 2024 at 06:53:30PM +0200, Valentin Schneider wrote:
>> On 27/07/24 12:27, Peter Zijlstra wrote:
>> > -static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>> >  {
>> > -	struct cfs_rq *cfs_rq;
>> > -	struct sched_entity *se = &p->se;
>> > -	int task_sleep = flags & DEQUEUE_SLEEP;
>> > -	int idle_h_nr_running = task_has_idle_policy(p);
>> >       bool was_sched_idle = sched_idle_rq(rq);
>> >       int rq_h_nr_running = rq->cfs.h_nr_running;
>> > +	bool task_sleep = flags & DEQUEUE_SLEEP;
>> > +	struct task_struct *p = NULL;
>> > +	int idle_h_nr_running = 0;
>> > +	int h_nr_running = 0;
>> > +	struct cfs_rq *cfs_rq;
>> >
>> > -	util_est_dequeue(&rq->cfs, p);
>> > +	if (entity_is_task(se)) {
>> > +		p = task_of(se);
>> > +		h_nr_running = 1;
>> > +		idle_h_nr_running = task_has_idle_policy(p);
>> > +	}
>> >
>>
>> This leaves the *h_nr_running to 0 for non-task entities. IIUC this makes
>> sense for ->sched_delayed entities (they should be empty of tasks), not so
>> sure for the other case. However, this only ends up being used for non-task
>> entities in:
>> - pick_next_entity(), if se->sched_delayed
>> - unregister_fair_sched_group()
>>
>> IIRC unregister_fair_sched_group() can only happen after the group has been
>> drained, so it would then indeed be empty of tasks, but I reckon this could
>> do with a comment/assert in dequeue_entities(), no? Or did I get too
>> confused by cgroups again?
>>
>
> Yeah, so I did have me a patch that made all this work for cfs bandwidth
> control as well. And then we need all this for throttled cgroup entries
> as well.
>
> Anyway... I had the patch, it worked, but then I remembered you were
> going to rewrite all that anyway and I was making a terrible mess of
> things, so I made it go away again.

Heh, sounds like someone needs to get back to it then :-)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-08-07 17:54     ` Peter Zijlstra
@ 2024-08-13 10:44       ` Chunxin Zang
  0 siblings, 0 replies; 277+ messages in thread
From: Chunxin Zang @ 2024-08-13 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, K Prateek Nayak,
	wuyun.abel, youssefesmat, tglx, Mike Galbraith, Mike Galbraith,
	Chunxin Zang



> On Aug 8, 2024, at 01:54, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Aug 05, 2024 at 08:24:24PM +0800, Chunxin Zang wrote:
>>> On Jul 27, 2024, at 18:27, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>> +{
>>> + if (!sched_feat(PREEMPT_SHORT))
>>> + return false;
>>> +
>>> + if (curr->vlag == curr->deadline)
>>> + return false;
>>> +
>>> + return !entity_eligible(cfs_rq, curr);
>>> +}
> 
>> Can this be made more aggressive here? Something like , in the PREEMPT_SHORT
>> + NO_RUN_TO_PARITY combination, it could break the first deadline of the current
>> task. This can achieve better latency benefits in certain embedded scenarios, such as
>> high-priority periodic tasks.
> 
> You are aware we have SCHED_DEADLINE for those, right?
> 
> Why can't you use that? and what can we do to fix that.
> 
>> +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>> +{
>> + if (!sched_feat(PREEMPT_SHORT))
>> + return false;
>> +
>> + if (sched_feat(RUN_TO_PARITY) && curr->vlag == curr->deadline)
>> + return false;
>> +
>> + return !entity_eligible(cfs_rq, curr);
>> +}
> 
> No, this will destroy the steady state schedule (where no tasks
> join/leave) and make it so that all tasks hug the lag=0 state
> arbitrarily close -- as allowed by the scheduling quanta.
> 
> Yes, it will get you better latency, because nobody gets to actually run
> it's requested slice.
> 
> The goal really is for tasks to get their request -- and yes that means
> you get to wait. PREEMPT_SHORT is already an exception to this rule, and
> it is very specifically limited to wake-ups so as to retain as much of
> the intended behaviour as possible.

I think I understand your point now. That approach does indeed seem more reasonable.
Next, I will try using 'request slice' to conduct more tests in my scenario.

> 
>> Additionally, if possible, could you please include my name in this patch? I spent over a
>> month finding this solution and conducting the tests, and I hope to leave some trace of
>> my efforts during that time. This is also one of the reasons why I love Linux and am eager
>> to contribute to open source. I would be extremely grateful.
> 
> I've made it the below. Does that work for you?

Heartfelt thanks :)

Chunxin

> 
> ---
> Subject: sched/eevdf: Allow shorter slices to wakeup-preempt
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Sep 26 14:32:32 CEST 2023
> 
> Part of the reason to have shorter slices is to improve
> responsiveness. Allow shorter slices to preempt longer slices on
> wakeup.
> 
>    Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
> 
>  100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT
> 
>  1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
>  2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
>  3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
>  1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
>  2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
>  3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
>  1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
>  2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
>  3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |
> 
>  100ms massive_intr 500us cyclictest PREEMPT_SHORT
> 
>  1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
>  2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
>  3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
>  1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
>  2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
>  3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
>  1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
>  2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
>  3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |
> 
> As per the numbers the, this makes cyclictest (short slice) it's
> max-delay more consistent and consistency drops the sum-delay. The
> trade-off is that the massive_intr (long slice) gets more context
> switches and a slight increase in sum-delay.
> 
> Chunxin contributed did_preempt_short() where a task that lost slice
> protection from PREEMPT_SHORT gets rescheduled once it becomes
> in-eligible.
> 
> [mike: numbers]
> Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com>
> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
> ---
> kernel/sched/fair.c     |   64 ++++++++++++++++++++++++++++++++++++++++++------
> kernel/sched/features.h |    5 +++
> 2 files changed, 61 insertions(+), 8 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -973,10 +973,10 @@ static void clear_buddies(struct cfs_rq
>  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>  * this is probably good enough.
>  */
> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> if ((s64)(se->vruntime - se->deadline) < 0)
> - return;
> + return false;
> 
> /*
>  * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -993,10 +993,7 @@ static void update_deadline(struct cfs_r
> /*
>  * The task has consumed its request, reschedule.
>  */
> - if (cfs_rq->nr_running > 1) {
> - resched_curr(rq_of(cfs_rq));
> - clear_buddies(cfs_rq, se);
> - }
> + return true;
> }
> 
> #include "pelt.h"
> @@ -1134,6 +1131,38 @@ static inline void update_curr_task(stru
> dl_server_update(p->dl_server, delta_exec);
> }
> 
> +static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> +{
> + if (!sched_feat(PREEMPT_SHORT))
> + return false;
> +
> + if (curr->vlag == curr->deadline)
> + return false;
> +
> + return !entity_eligible(cfs_rq, curr);
> +}
> +
> +static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
> +     struct sched_entity *pse, struct sched_entity *se)
> +{
> + if (!sched_feat(PREEMPT_SHORT))
> + return false;
> +
> + if (pse->slice >= se->slice)
> + return false;
> +
> + if (!entity_eligible(cfs_rq, pse))
> + return false;
> +
> + if (entity_before(pse, se))
> + return true;
> +
> + if (!entity_eligible(cfs_rq, se))
> + return true;
> +
> + return false;
> +}
> +
> /*
>  * Used by other classes to account runtime.
>  */
> @@ -1157,6 +1186,7 @@ static void update_curr(struct cfs_rq *c
> struct sched_entity *curr = cfs_rq->curr;
> struct rq *rq = rq_of(cfs_rq);
> s64 delta_exec;
> + bool resched;
> 
> if (unlikely(!curr))
> return;
> @@ -1166,7 +1196,7 @@ static void update_curr(struct cfs_rq *c
> return;
> 
> curr->vruntime += calc_delta_fair(delta_exec, curr);
> - update_deadline(cfs_rq, curr);
> + resched = update_deadline(cfs_rq, curr);
> update_min_vruntime(cfs_rq);
> 
> if (entity_is_task(curr)) {
> @@ -1184,6 +1214,14 @@ static void update_curr(struct cfs_rq *c
> }
> 
> account_cfs_rq_runtime(cfs_rq, delta_exec);
> +
> + if (rq->nr_running == 1)
> + return;
> +
> + if (resched || did_preempt_short(cfs_rq, curr)) {
> + resched_curr(rq);
> + clear_buddies(cfs_rq, curr);
> + }
> }
> 
> static void update_curr_fair(struct rq *rq)
> @@ -8611,7 +8649,17 @@ static void check_preempt_wakeup_fair(st
> cfs_rq = cfs_rq_of(se);
> update_curr(cfs_rq);
> /*
> -  * XXX pick_eevdf(cfs_rq) != se ?
> +  * If @p has a shorter slice than current and @p is eligible, override
> +  * current's slice protection in order to allow preemption.
> +  *
> +  * Note that even if @p does not turn out to be the most eligible
> +  * task at this moment, current's slice protection will be lost.
> +  */
> + if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
> + se->vlag = se->deadline + 1;
> +
> + /*
> +  * If @p has become the most eligible task, force preemption.
>  */
> if (pick_eevdf(cfs_rq) == pse)
> goto preempt;
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -18,6 +18,11 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
>  * 0-lag point or until is has exhausted it's slice.
>  */
> SCHED_FEAT(RUN_TO_PARITY, true)
> +/*
> + * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for
> + * current.
> + */
> +SCHED_FEAT(PREEMPT_SHORT, true)
> 
> /*
>  * Prefer to schedule the task we woke last (assuming it failed



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
@ 2024-08-13 12:43   ` Valentin Schneider
  2024-08-13 21:54     ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-08-27  9:17   ` [PATCH 12/24] " Chen Yu
  2 siblings, 1 reply; 277+ messages in thread
From: Valentin Schneider @ 2024-08-13 12:43 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 27/07/24 12:27, Peter Zijlstra wrote:
> @@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>  {
>       detach_task_cfs_rq(p);
> +	/*
> +	 * Since this is called after changing class, this isn't quite right.
> +	 * Specifically, this causes the task to get queued in the target class
> +	 * and experience a 'spurious' wakeup.
> +	 *
> +	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
> +	 * problem.
> +	 */
> +	p->se.sched_delayed = 0;
> +	/*
> +	 * While here, also clear the vlag, it makes little sense to carry that
> +	 * over the excursion into the new class.
> +	 */
> +	p->se.vlag = 0;

RQ lock is held, the task can't be current if it's ->sched_delayed; is a
dequeue_task() not possible at this point?  Or just not worth it?


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-07-27 10:27 ` [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Peter Zijlstra
@ 2024-08-13 12:43   ` Valentin Schneider
  2024-08-13 22:18     ` Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Valentin Schneider @ 2024-08-13 12:43 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 27/07/24 12:27, Peter Zijlstra wrote:
> Note that tasks that are kept on the runqueue to burn off negative
> lag, are not in fact runnable anymore, they'll get dequeued the moment
> they get picked.
>
> As such, don't count this time towards runnable.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c  |    2 ++
>  kernel/sched/sched.h |    6 ++++++
>  2 files changed, 8 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>                       if (cfs_rq->next == se)
>                               cfs_rq->next = NULL;
>                       se->sched_delayed = 1;
> +			update_load_avg(cfs_rq, se, 0);

Shouldn't this be before setting ->sched_delayed? accumulate_sum() should
see the time delta as spent being runnable.

>                       return false;
>               }
>       }
> @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
>       }
>
>       se->sched_delayed = 0;
> +	update_load_avg(cfs_rq, se, 0);

Ditto on the ordering

>  }
>
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -816,6 +816,9 @@ static inline void se_update_runnable(st
>
>  static inline long se_runnable(struct sched_entity *se)
>  {
> +	if (se->sched_delayed)
> +		return false;
> +

Per __update_load_avg_se(), delayed-dequeue entities are still ->on_rq, so
their load signal will increase. Do we want a similar helper for the @load
input of ___update_load_sum()?


>       if (entity_is_task(se))
>               return !!se->on_rq;
>       else
> @@ -830,6 +833,9 @@ static inline void se_update_runnable(st
>
>  static inline long se_runnable(struct sched_entity *se)
>  {
> +	if (se->sched_delayed)
> +		return false;
> +
>       return !!se->on_rq;
>  }
>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-13 12:43   ` Valentin Schneider
@ 2024-08-13 21:54     ` Peter Zijlstra
  2024-08-13 22:07       ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-13 21:54 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Tue, Aug 13, 2024 at 02:43:47PM +0200, Valentin Schneider wrote:
> On 27/07/24 12:27, Peter Zijlstra wrote:
> > @@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
> >  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> >  {
> >       detach_task_cfs_rq(p);
> > +	/*
> > +	 * Since this is called after changing class, this isn't quite right.
> > +	 * Specifically, this causes the task to get queued in the target class
> > +	 * and experience a 'spurious' wakeup.
> > +	 *
> > +	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
> > +	 * problem.
> > +	 */
> > +	p->se.sched_delayed = 0;
> > +	/*
> > +	 * While here, also clear the vlag, it makes little sense to carry that
> > +	 * over the excursion into the new class.
> > +	 */
> > +	p->se.vlag = 0;
> 
> RQ lock is held, the task can't be current if it's ->sched_delayed; is a
> dequeue_task() not possible at this point?  Or just not worth it?

Hurmph, I really can't remember why I did it like this :-(

Also, I remember thinking this vlag reset might not be ideal, PI induced
class excursions might be very short and would benefit from retaining
vlag.

Let me make this something like:

  if (se->sched_delayed)
    dequeue_entities(rq, &p->se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-13 21:54     ` Peter Zijlstra
@ 2024-08-13 22:07       ` Peter Zijlstra
  2024-08-14  5:53         ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-13 22:07 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Tue, Aug 13, 2024 at 11:54:21PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 13, 2024 at 02:43:47PM +0200, Valentin Schneider wrote:
> > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > @@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
> > >  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> > >  {
> > >       detach_task_cfs_rq(p);
> > > +	/*
> > > +	 * Since this is called after changing class, this isn't quite right.
> > > +	 * Specifically, this causes the task to get queued in the target class
> > > +	 * and experience a 'spurious' wakeup.
> > > +	 *
> > > +	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
> > > +	 * problem.
> > > +	 */
> > > +	p->se.sched_delayed = 0;
> > > +	/*
> > > +	 * While here, also clear the vlag, it makes little sense to carry that
> > > +	 * over the excursion into the new class.
> > > +	 */
> > > +	p->se.vlag = 0;
> > 
> > RQ lock is held, the task can't be current if it's ->sched_delayed; is a
> > dequeue_task() not possible at this point?  Or just not worth it?
> 
> Hurmph, I really can't remember why I did it like this :-(

Obviously I remember it right after hitting send...

We've just done:

	dequeue_task();
	p->sched_class = some_other_class;
	enqueue_task();

IOW, we're enqueued as some other class at this point. There is no way
we can fix it up at this point.

Perhaps I can use the sched_class::switching_to thing sched_ext will
bring.

For now, lets keep things as is. I'll look at things later


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-13 12:43   ` Valentin Schneider
@ 2024-08-13 22:18     ` Peter Zijlstra
  2024-08-14  7:25       ` Peter Zijlstra
  2024-08-14 12:59       ` Vincent Guittot
  0 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-13 22:18 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Tue, Aug 13, 2024 at 02:43:56PM +0200, Valentin Schneider wrote:
> On 27/07/24 12:27, Peter Zijlstra wrote:
> > Note that tasks that are kept on the runqueue to burn off negative
> > lag, are not in fact runnable anymore, they'll get dequeued the moment
> > they get picked.
> >
> > As such, don't count this time towards runnable.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/fair.c  |    2 ++
> >  kernel/sched/sched.h |    6 ++++++
> >  2 files changed, 8 insertions(+)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >                       if (cfs_rq->next == se)
> >                               cfs_rq->next = NULL;
> >                       se->sched_delayed = 1;
> > +			update_load_avg(cfs_rq, se, 0);
> 
> Shouldn't this be before setting ->sched_delayed? accumulate_sum() should
> see the time delta as spent being runnable.
> 
> >                       return false;
> >               }
> >       }
> > @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
> >       }
> >
> >       se->sched_delayed = 0;
> > +	update_load_avg(cfs_rq, se, 0);
> 
> Ditto on the ordering

Bah, so I remember thinking about it and then I obviously go and do it
the exact wrong way around eh? Let me double check this tomorrow morning
with the brain slightly more awake :/

> >  }
> >
> >  /*
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -816,6 +816,9 @@ static inline void se_update_runnable(st
> >
> >  static inline long se_runnable(struct sched_entity *se)
> >  {
> > +	if (se->sched_delayed)
> > +		return false;
> > +
> 
> Per __update_load_avg_se(), delayed-dequeue entities are still ->on_rq, so
> their load signal will increase. Do we want a similar helper for the @load
> input of ___update_load_sum()?

So the whole reason to keep then enqueued is so that they can continue
to compete for vruntime, and vruntime is load based. So it would be very
weird to remove them from load.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-13 22:07       ` Peter Zijlstra
@ 2024-08-14  5:53         ` Peter Zijlstra
  2024-08-27  9:35           ` Chen Yu
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-14  5:53 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Wed, Aug 14, 2024 at 12:07:57AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 13, 2024 at 11:54:21PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 13, 2024 at 02:43:47PM +0200, Valentin Schneider wrote:
> > > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > > @@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
> > > >  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> > > >  {
> > > >       detach_task_cfs_rq(p);
> > > > +	/*
> > > > +	 * Since this is called after changing class, this isn't quite right.
> > > > +	 * Specifically, this causes the task to get queued in the target class
> > > > +	 * and experience a 'spurious' wakeup.
> > > > +	 *
> > > > +	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
> > > > +	 * problem.
> > > > +	 */
> > > > +	p->se.sched_delayed = 0;
> > > > +	/*
> > > > +	 * While here, also clear the vlag, it makes little sense to carry that
> > > > +	 * over the excursion into the new class.
> > > > +	 */
> > > > +	p->se.vlag = 0;
> > > 
> > > RQ lock is held, the task can't be current if it's ->sched_delayed; is a
> > > dequeue_task() not possible at this point?  Or just not worth it?
> > 
> > Hurmph, I really can't remember why I did it like this :-(
> 
> Obviously I remember it right after hitting send...
> 
> We've just done:
> 
> 	dequeue_task();
> 	p->sched_class = some_other_class;
> 	enqueue_task();
> 
> IOW, we're enqueued as some other class at this point. There is no way
> we can fix it up at this point.

With just a little more sleep than last night, perhaps you're right
after all. Yes we're on a different class, but we can *still* dequeue it
again.


That is, something like the below ... I'll stick it on and see if
anything falls over.

---
 kernel/sched/fair.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 714826d97ef2..53c8f3ccfd0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13105,20 +13105,16 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
 	/*
-	 * Since this is called after changing class, this isn't quite right.
-	 * Specifically, this causes the task to get queued in the target class
-	 * and experience a 'spurious' wakeup.
-	 *
-	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
-	 * problem.
-	 */
-	p->se.sched_delayed = 0;
-	/*
-	 * While here, also clear the vlag, it makes little sense to carry that
-	 * over the excursion into the new class.
+	 * Since this is called after changing class, this is a little weird
+	 * and we cannot use DEQUEUE_DELAYED.
 	 */
-	p->se.vlag = 0;
-	p->se.rel_deadline = 0;
+	if (p->se.sched_delayed) {
+		dequeue_task(DEQUEUE_NOCLOCK, DEQUEUE_SLEEP);
+		p->se.sched_delayed = 0;
+		p->se.rel_deadline = 0;
+		if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
+			p->se.vlag = 0;
+	}
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-13 22:18     ` Peter Zijlstra
@ 2024-08-14  7:25       ` Peter Zijlstra
  2024-08-14  7:28         ` Peter Zijlstra
  2024-08-14 10:23         ` Valentin Schneider
  2024-08-14 12:59       ` Vincent Guittot
  1 sibling, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-14  7:25 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Wed, Aug 14, 2024 at 12:18:07AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 13, 2024 at 02:43:56PM +0200, Valentin Schneider wrote:
> > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > Note that tasks that are kept on the runqueue to burn off negative
> > > lag, are not in fact runnable anymore, they'll get dequeued the moment
> > > they get picked.
> > >
> > > As such, don't count this time towards runnable.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/fair.c  |    2 ++
> > >  kernel/sched/sched.h |    6 ++++++
> > >  2 files changed, 8 insertions(+)
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > >                       if (cfs_rq->next == se)
> > >                               cfs_rq->next = NULL;
> > >                       se->sched_delayed = 1;
> > > +			update_load_avg(cfs_rq, se, 0);
> > 
> > Shouldn't this be before setting ->sched_delayed? accumulate_sum() should
> > see the time delta as spent being runnable.
> > 
> > >                       return false;
> > >               }
> > >       }
> > > @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
> > >       }
> > >
> > >       se->sched_delayed = 0;
> > > +	update_load_avg(cfs_rq, se, 0);
> > 
> > Ditto on the ordering
> 
> Bah, so I remember thinking about it and then I obviously go and do it
> the exact wrong way around eh? Let me double check this tomorrow morning
> with the brain slightly more awake :/

OK, so I went over it again and I ended up with the below diff -- which
assuming I didn't make a giant mess of things *again*, I should go fold
back into various other patches ...

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b15dbfb1ce5..fa8907f2c716 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5461,14 +5461,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	bool sleep = flags & DEQUEUE_SLEEP;
 
+	update_curr(cfs_rq);
+
 	if (flags & DEQUEUE_DELAYED) {
-		/*
-		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
-		 * at which point we've already done update_curr() and do not
-		 * want to do so again.
-		 */
 		SCHED_WARN_ON(!se->sched_delayed);
-		se->sched_delayed = 0;
 	} else {
 		bool delay = sleep;
 		/*
@@ -5479,14 +5475,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 			delay = false;
 
 		SCHED_WARN_ON(delay && se->sched_delayed);
-		update_curr(cfs_rq);
 
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
-			se->sched_delayed = 1;
 			update_load_avg(cfs_rq, se, 0);
+			se->sched_delayed = 1;
 			return false;
 		}
 	}
@@ -5536,6 +5531,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
 		update_min_vruntime(cfs_rq);
 
+	if (flags & DEQUEUE_DELAYED) {
+		se->sched_delayed = 0;
+		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
+			se->vlag = 0;
+	}
+
 	if (cfs_rq->nr_running == 0)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
 
@@ -5611,11 +5612,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	struct sched_entity *se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-		SCHED_WARN_ON(se->sched_delayed);
-		SCHED_WARN_ON(se->on_rq);
-		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
-			se->vlag = 0;
-
 		return NULL;
 	}
 	return se;
@@ -6906,7 +6902,7 @@ requeue_delayed_entity(struct sched_entity *se)
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 	/*
-	 * se->sched_delayed should imply both: se->on_rq == 1.
+	 * se->sched_delayed should imply: se->on_rq == 1.
 	 * Because a delayed entity is one that is still on
 	 * the runqueue competing until elegibility.
 	 */
@@ -6927,8 +6923,8 @@ requeue_delayed_entity(struct sched_entity *se)
 		}
 	}
 
-	se->sched_delayed = 0;
 	update_load_avg(cfs_rq, se, 0);
+	se->sched_delayed = 0;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-14  7:25       ` Peter Zijlstra
@ 2024-08-14  7:28         ` Peter Zijlstra
  2024-08-14 10:23         ` Valentin Schneider
  1 sibling, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-14  7:28 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Wed, Aug 14, 2024 at 09:25:48AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 14, 2024 at 12:18:07AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 13, 2024 at 02:43:56PM +0200, Valentin Schneider wrote:
> > > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > > Note that tasks that are kept on the runqueue to burn off negative
> > > > lag, are not in fact runnable anymore, they'll get dequeued the moment
> > > > they get picked.
> > > >
> > > > As such, don't count this time towards runnable.
> > > >
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > >  kernel/sched/fair.c  |    2 ++
> > > >  kernel/sched/sched.h |    6 ++++++
> > > >  2 files changed, 8 insertions(+)
> > > >
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > > >                       if (cfs_rq->next == se)
> > > >                               cfs_rq->next = NULL;
> > > >                       se->sched_delayed = 1;
> > > > +			update_load_avg(cfs_rq, se, 0);
> > > 
> > > Shouldn't this be before setting ->sched_delayed? accumulate_sum() should
> > > see the time delta as spent being runnable.
> > > 
> > > >                       return false;
> > > >               }
> > > >       }
> > > > @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
> > > >       }
> > > >
> > > >       se->sched_delayed = 0;
> > > > +	update_load_avg(cfs_rq, se, 0);
> > > 
> > > Ditto on the ordering
> > 
> > Bah, so I remember thinking about it and then I obviously go and do it
> > the exact wrong way around eh? Let me double check this tomorrow morning
> > with the brain slightly more awake :/
> 
> OK, so I went over it again and I ended up with the below diff -- which
> assuming I didn't make a giant mess of things *again*, I should go fold
> back into various other patches ...
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1b15dbfb1ce5..fa8907f2c716 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5461,14 +5461,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>  	bool sleep = flags & DEQUEUE_SLEEP;
>  
> +	update_curr(cfs_rq);
> +
>  	if (flags & DEQUEUE_DELAYED) {
> -		/*
> -		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
> -		 * at which point we've already done update_curr() and do not
> -		 * want to do so again.
> -		 */
>  		SCHED_WARN_ON(!se->sched_delayed);
> -		se->sched_delayed = 0;
>  	} else {
>  		bool delay = sleep;
>  		/*

Because repeated update_curr() is harmless (I think I was thinking it
would move the clock, but it doesn't do that), but a missed
update_curr() makes a mess.

> @@ -5479,14 +5475,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  			delay = false;
>  
>  		SCHED_WARN_ON(delay && se->sched_delayed);
> -		update_curr(cfs_rq);
>  
>  		if (sched_feat(DELAY_DEQUEUE) && delay &&
>  		    !entity_eligible(cfs_rq, se)) {
>  			if (cfs_rq->next == se)
>  				cfs_rq->next = NULL;
> -			se->sched_delayed = 1;
>  			update_load_avg(cfs_rq, se, 0);
> +			se->sched_delayed = 1;
>  			return false;
>  		}
>  	}

As you said, update to now, then mark delayed.

> @@ -5536,6 +5531,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
>  		update_min_vruntime(cfs_rq);
>  
> +	if (flags & DEQUEUE_DELAYED) {
> +		se->sched_delayed = 0;
> +		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> +			se->vlag = 0;
> +	}
> +
>  	if (cfs_rq->nr_running == 0)
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
>  
> @@ -5611,11 +5612,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  	struct sched_entity *se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> -		SCHED_WARN_ON(se->sched_delayed);
> -		SCHED_WARN_ON(se->on_rq);
> -		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> -			se->vlag = 0;
> -
>  		return NULL;
>  	}
>  	return se;

Because if we have to clear delayed at the end up dequeue, we might as
well do all of the fixups at that point.

> @@ -6906,7 +6902,7 @@ requeue_delayed_entity(struct sched_entity *se)
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  	/*
> -	 * se->sched_delayed should imply both: se->on_rq == 1.
> +	 * se->sched_delayed should imply: se->on_rq == 1.
>  	 * Because a delayed entity is one that is still on
>  	 * the runqueue competing until elegibility.
>  	 */

edit fail, somewhere along history.

> @@ -6927,8 +6923,8 @@ requeue_delayed_entity(struct sched_entity *se)
>  		}
>  	}
>  
> -	se->sched_delayed = 0;
>  	update_load_avg(cfs_rq, se, 0);
> +	se->sched_delayed = 0;
>  }

What you said..

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-14  7:25       ` Peter Zijlstra
  2024-08-14  7:28         ` Peter Zijlstra
@ 2024-08-14 10:23         ` Valentin Schneider
  1 sibling, 0 replies; 277+ messages in thread
From: Valentin Schneider @ 2024-08-14 10:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On 14/08/24 09:25, Peter Zijlstra wrote:
> On Wed, Aug 14, 2024 at 12:18:07AM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 13, 2024 at 02:43:56PM +0200, Valentin Schneider wrote:
>> > On 27/07/24 12:27, Peter Zijlstra wrote:
>> > > @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
>> > >       }
>> > >
>> > >       se->sched_delayed = 0;
>> > > +	update_load_avg(cfs_rq, se, 0);
>> >
>> > Ditto on the ordering
>>
>> Bah, so I remember thinking about it and then I obviously go and do it
>> the exact wrong way around eh? Let me double check this tomorrow morning
>> with the brain slightly more awake :/
>
> OK, so I went over it again and I ended up with the below diff -- which
> assuming I didn't make a giant mess of things *again*, I should go fold
> back into various other patches ...
>

Looks right to me, thanks! I'll go test the newer stack of patches.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-13 22:18     ` Peter Zijlstra
  2024-08-14  7:25       ` Peter Zijlstra
@ 2024-08-14 12:59       ` Vincent Guittot
  2024-08-17 23:06         ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-08-14 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Wed, 14 Aug 2024 at 00:18, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 13, 2024 at 02:43:56PM +0200, Valentin Schneider wrote:
> > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > Note that tasks that are kept on the runqueue to burn off negative
> > > lag, are not in fact runnable anymore, they'll get dequeued the moment
> > > they get picked.
> > >
> > > As such, don't count this time towards runnable.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/fair.c  |    2 ++
> > >  kernel/sched/sched.h |    6 ++++++
> > >  2 files changed, 8 insertions(+)
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5388,6 +5388,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> > >                       if (cfs_rq->next == se)
> > >                               cfs_rq->next = NULL;
> > >                       se->sched_delayed = 1;
> > > +                   update_load_avg(cfs_rq, se, 0);
> >
> > Shouldn't this be before setting ->sched_delayed? accumulate_sum() should
> > see the time delta as spent being runnable.
> >
> > >                       return false;
> > >               }
> > >       }
> > > @@ -6814,6 +6815,7 @@ requeue_delayed_entity(struct sched_enti
> > >       }
> > >
> > >       se->sched_delayed = 0;
> > > +   update_load_avg(cfs_rq, se, 0);
> >
> > Ditto on the ordering
>
> Bah, so I remember thinking about it and then I obviously go and do it
> the exact wrong way around eh? Let me double check this tomorrow morning
> with the brain slightly more awake :/
>
> > >  }
> > >
> > >  /*
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -816,6 +816,9 @@ static inline void se_update_runnable(st
> > >
> > >  static inline long se_runnable(struct sched_entity *se)
> > >  {
> > > +   if (se->sched_delayed)
> > > +           return false;
> > > +
> >
> > Per __update_load_avg_se(), delayed-dequeue entities are still ->on_rq, so
> > their load signal will increase. Do we want a similar helper for the @load
> > input of ___update_load_sum()?
>
> So the whole reason to keep then enqueued is so that they can continue
> to compete for vruntime, and vruntime is load based. So it would be very
> weird to remove them from load.

We only use the weight to update vruntime, not the load. The load is
used to balance tasks between cpus and if we keep a "delayed" dequeued
task in the load, we will artificially inflate the load_avg on this rq

Shouldn't we track separately the sum of the weight of delayed dequeue
to apply it only on vruntime update ?
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (24 preceding siblings ...)
  2024-08-01 12:08 ` [PATCH 00/24] Complete EEVDF Luis Machado
@ 2024-08-14 14:34 ` Vincent Guittot
  2024-08-14 16:45   ` Mike Galbraith
  2024-08-16 15:22 ` Valentin Schneider
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-08-14 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, efault

On Sat, 27 Jul 2024 at 13:02, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.

While trying to test what would be the impact of delayed dequeue on
load_avg, I noticed something strange with the running slice. I have a
simple test with 2 always running threads on 1 CPU and the each thread
runs around 100ms continuously before switching to the other one
whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
between 2 context swicthes

I'm using your sched/core branch. Is it the correct one ?

>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID
>
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-14 14:34 ` Vincent Guittot
@ 2024-08-14 16:45   ` Mike Galbraith
  2024-08-14 16:59     ` Vincent Guittot
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-08-14 16:45 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx

On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
>
> While trying to test what would be the impact of delayed dequeue on
> load_avg, I noticed something strange with the running slice. I have a
> simple test with 2 always running threads on 1 CPU and the each thread
> runs around 100ms continuously before switching to the other one
> whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
> between 2 context swicthes
>
> I'm using your sched/core branch. Is it the correct one ?

Hm, building that branch, I see the expected tick granularity (4ms).

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-14 16:45   ` Mike Galbraith
@ 2024-08-14 16:59     ` Vincent Guittot
  2024-08-14 17:18       ` Mike Galbraith
  2024-08-14 17:35       ` K Prateek Nayak
  0 siblings, 2 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-14 16:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote:
>
> On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
> >
> > While trying to test what would be the impact of delayed dequeue on
> > load_avg, I noticed something strange with the running slice. I have a
> > simple test with 2 always running threads on 1 CPU and the each thread
> > runs around 100ms continuously before switching to the other one
> > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
> > between 2 context swicthes
> >
> > I'm using your sched/core branch. Is it the correct one ?
>
> Hm, building that branch, I see the expected tick granularity (4ms).

On my side tip/sched/core switches every 4ms but Peter's sched/core,
which is delayed queued on top of tip/sched/core if I don't get it
wrong, switches every 100ms.

>
>         -Mike
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-14 16:59     ` Vincent Guittot
@ 2024-08-14 17:18       ` Mike Galbraith
  2024-08-14 17:25         ` Vincent Guittot
  2024-08-14 17:35       ` K Prateek Nayak
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-08-14 17:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 2024-08-14 at 18:59 +0200, Vincent Guittot wrote:
> On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote:
> >
> > On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
> > >
> > > While trying to test what would be the impact of delayed dequeue on
> > > load_avg, I noticed something strange with the running slice. I have a
> > > simple test with 2 always running threads on 1 CPU and the each thread
> > > runs around 100ms continuously before switching to the other one
> > > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
> > > between 2 context swicthes
> > >
> > > I'm using your sched/core branch. Is it the correct one ?
> >
> > Hm, building that branch, I see the expected tick granularity (4ms).
>
> On my side tip/sched/core switches every 4ms but Peter's sched/core,
> which is delayed queued on top of tip/sched/core if I don't get it
> wrong, switches every 100ms.

FWIW, I checked my local master-rt tree as well, which has Peter's
latest eevdf series wedged in (plus 4cc290c20a98 now).. it also worked
as expected.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-14 17:18       ` Mike Galbraith
@ 2024-08-14 17:25         ` Vincent Guittot
  0 siblings, 0 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-14 17:25 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 14 Aug 2024 at 19:18, Mike Galbraith <efault@gmx.de> wrote:
>
> On Wed, 2024-08-14 at 18:59 +0200, Vincent Guittot wrote:
> > On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote:
> > >
> > > On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
> > > >
> > > > While trying to test what would be the impact of delayed dequeue on
> > > > load_avg, I noticed something strange with the running slice. I have a
> > > > simple test with 2 always running threads on 1 CPU and the each thread
> > > > runs around 100ms continuously before switching to the other one
> > > > whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
> > > > between 2 context swicthes
> > > >
> > > > I'm using your sched/core branch. Is it the correct one ?
> > >
> > > Hm, building that branch, I see the expected tick granularity (4ms).
> >
> > On my side tip/sched/core switches every 4ms but Peter's sched/core,
> > which is delayed queued on top of tip/sched/core if I don't get it
> > wrong, switches every 100ms.
>
> FWIW, I checked my local master-rt tree as well, which has Peter's
> latest eevdf series wedged in (plus 4cc290c20a98 now).. it also worked
> as expected.

After looking at deadline and slice, the issue is on my tool was
trying to change the slice (an old trial for previous version) which
got clamp to 100ms.
we can forgot this, sorry for the noise

>
>         -Mike
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-14 16:59     ` Vincent Guittot
  2024-08-14 17:18       ` Mike Galbraith
@ 2024-08-14 17:35       ` K Prateek Nayak
  1 sibling, 0 replies; 277+ messages in thread
From: K Prateek Nayak @ 2024-08-14 17:35 UTC (permalink / raw)
  To: Vincent Guittot, Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, wuyun.abel,
	youssefesmat, tglx

Hello Vincent, Mike,

On 8/14/2024 10:29 PM, Vincent Guittot wrote:
> On Wed, 14 Aug 2024 at 18:46, Mike Galbraith <efault@gmx.de> wrote:
>>
>> On Wed, 2024-08-14 at 16:34 +0200, Vincent Guittot wrote:
>>>
>>> While trying to test what would be the impact of delayed dequeue on
>>> load_avg, I noticed something strange with the running slice. I have a
>>> simple test with 2 always running threads on 1 CPU and the each thread
>>> runs around 100ms continuously before switching to the other one
>>> whereas I was expecting 3ms (the sysctl_sched_base_slice on my system)
>>> between 2 context swicthes
>>>
>>> I'm using your sched/core branch. Is it the correct one ?
>>
>> Hm, building that branch, I see the expected tick granularity (4ms).
> 
> On my side tip/sched/core switches every 4ms but Peter's sched/core,
> which is delayed queued on top of tip/sched/core if I don't get it
> wrong, switches every 100ms.

I could not observe this behavior when running two busy loops pinned to
one CPU on my end. I'm running with base_slice_ns of 3ms and the
sched_feats related to EEVDF complete looks as follows:

     PLACE_LAG
     PLACE_DEADLINE_INITIAL
     PLACE_REL_DEADLINE
     RUN_TO_PARITY
     PREEMPT_SHORT
     NO_NEXT_BUDDY
     CACHE_HOT_BUDDY
     DELAY_DEQUEUE
     DELAY_ZERO
     WAKEUP_PREEMPTION
     ...

Also I'm running with CONFIG_HZ=250 (4ms tick granularity)

     CONFIG_HZ_250=y
     CONFIG_HZ=250

Enabling sched_switch tracepeoint, I see the following:

             ...
             loop-4061    109.710379: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.714377: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.718375: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.722374: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.726379: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.730377: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.734367: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.738365: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.742364: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.746361: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.750359: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.754357: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.758355: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.762353: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.766351: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.770349: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.774347: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.778345: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.782343: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.786346: sched_switch: ... prev_pid=4060 ... prev_state=R ==> next_comm=kworker/1:1 next_pid=1616 next_prio=120
      kworker/1:1-1616    109.786412: sched_switch: prev_comm=kworker/1:1 prev_pid=1616 ... prev_state=I ==> ... next_pid=4061 ...
             loop-4061    109.794337: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.798335: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.802335: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.806331: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.810329: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.814327: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.818325: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.822323: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.826321: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             loop-4060    109.830321: sched_switch: ... prev_pid=4060 ... prev_state=R ==> ... next_pid=4061 ...
             loop-4061    109.834317: sched_switch: ... prev_pid=4061 ... prev_state=R ==> ... next_pid=4060 ...
             ...

(Trimmed traces are for busy loops with pids 4060 and 4061)

I see the expected tick granularity similar to Mike. Since Peter's tree
is prone to force-updates, I'm on

     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core

at commit 4cc290c20a98 "sched/eevdf: Dequeue in switched_from_fair()"
which was committed at "2024-08-14 08:15:39 +0200".

> 
>>
>>          -Mike
>>

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (25 preceding siblings ...)
  2024-08-14 14:34 ` Vincent Guittot
@ 2024-08-16 15:22 ` Valentin Schneider
  2024-08-20 16:43 ` Hongyan Xia
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 277+ messages in thread
From: Valentin Schneider @ 2024-08-16 15:22 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 27/07/24 12:27, Peter Zijlstra wrote:
> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID

So what I've been testing at your queue/sched/core, HEAD at:
  4cc290c20a98 ("sched/eevdf: Dequeue in switched_from_fair()")
survives my (simplistic) CFS bandwidth testing, so FWIW:
  Tested-by: Valentin Schneider <vschneid@redhat.com>

And for patches 01-20:
  Reviewed-by: Valentin Schneider <vschneid@redhat.com>

with one caveat that I agree with Vincent on maybe doing something about
the load PELT signal of delayed dequeued entities.


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-14 12:59       ` Vincent Guittot
@ 2024-08-17 23:06         ` Peter Zijlstra
  2024-08-19 12:50           ` Vincent Guittot
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-17 23:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Wed, Aug 14, 2024 at 02:59:00PM +0200, Vincent Guittot wrote:

> > So the whole reason to keep then enqueued is so that they can continue
> > to compete for vruntime, and vruntime is load based. So it would be very
> > weird to remove them from load.
> 
> We only use the weight to update vruntime, not the load. The load is
> used to balance tasks between cpus and if we keep a "delayed" dequeued
> task in the load, we will artificially inflate the load_avg on this rq

So far load has been a direct sum of all weight. Additionally, we delay
until a task gets picked again, migrating tasks to other CPUs will
expedite this condition.

Anyway, at the moment I don't have strong evidence either which way, and
the above argument seem to suggest not changing things for now.

We can always re-evaluate.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Propagate min_slice up the cgroup hierarchy
  2024-07-27 10:27 ` [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2024-09-29  2:02   ` [PATCH 23/24] " Tianchen Ding
  1 sibling, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     aef6987d89544d63a47753cf3741cabff0b5574c
Gitweb:        https://git.kernel.org/tip/aef6987d89544d63a47753cf3741cabff0b5574c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 20 Jun 2024 13:16:49 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:46 +02:00

sched/eevdf: Propagate min_slice up the cgroup hierarchy

In the absence of an explicit cgroup slice configureation, make mixed
slice length work with cgroups by propagating the min_slice up the
hierarchy.

This ensures the cgroup entity gets timely service to service its
entities that have this timing constraint set on them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org
---
 include/linux/sched.h |  1 +-
 kernel/sched/fair.c   | 57 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 89a3d8d..3709ded 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -542,6 +542,7 @@ struct sched_entity {
 	struct rb_node			run_node;
 	u64				deadline;
 	u64				min_vruntime;
+	u64				min_slice;
 
 	struct list_head		group_node;
 	unsigned char			on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3284d3c..fea057b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,6 +782,21 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
 }
 
+static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *root = __pick_root_entity(cfs_rq);
+	struct sched_entity *curr = cfs_rq->curr;
+	u64 min_slice = ~0ULL;
+
+	if (curr && curr->on_rq)
+		min_slice = curr->slice;
+
+	if (root)
+		min_slice = min(min_slice, root->min_slice);
+
+	return min_slice;
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
 	return entity_before(__node_2_se(a), __node_2_se(b));
@@ -798,19 +813,34 @@ static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node
 	}
 }
 
+static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (rse->min_slice < se->min_slice)
+			se->min_slice = rse->min_slice;
+	}
+}
+
 /*
  * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
 static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
 	u64 old_min_vruntime = se->min_vruntime;
+	u64 old_min_slice = se->min_slice;
 	struct rb_node *node = &se->run_node;
 
 	se->min_vruntime = se->vruntime;
 	__min_vruntime_update(se, node->rb_right);
 	__min_vruntime_update(se, node->rb_left);
 
-	return se->min_vruntime == old_min_vruntime;
+	se->min_slice = se->slice;
+	__min_slice_update(se, node->rb_right);
+	__min_slice_update(se, node->rb_left);
+
+	return se->min_vruntime == old_min_vruntime &&
+	       se->min_slice == old_min_slice;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
@@ -823,6 +853,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
+	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				__entity_less, &min_vruntime_cb);
 }
@@ -6911,6 +6942,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
+	u64 slice = 0;
 
 	if (flags & ENQUEUE_DELAYED) {
 		requeue_delayed_entity(se);
@@ -6940,7 +6972,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			break;
 		}
 		cfs_rq = cfs_rq_of(se);
+
+		/*
+		 * Basically set the slice of group entries to the min_slice of
+		 * their respective cfs_rq. This ensures the group can service
+		 * its entities in the desired time-frame.
+		 */
+		if (slice) {
+			se->slice = slice;
+			se->custom_slice = 1;
+		}
 		enqueue_entity(cfs_rq, se, flags);
+		slice = cfs_rq_min_slice(cfs_rq);
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@ -6962,6 +7005,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		se->slice = slice;
+		slice = cfs_rq_min_slice(cfs_rq);
+
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
 
@@ -7027,11 +7073,15 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
 	struct cfs_rq *cfs_rq;
+	u64 slice = 0;
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
 		h_nr_running = 1;
 		idle_h_nr_running = task_has_idle_policy(p);
+	} else {
+		cfs_rq = group_cfs_rq(se);
+		slice = cfs_rq_min_slice(cfs_rq);
 	}
 
 	for_each_sched_entity(se) {
@@ -7056,6 +7106,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
+			slice = cfs_rq_min_slice(cfs_rq);
+
 			/* Avoid re-evaluating load for this entity: */
 			se = parent_entity(se);
 			/*
@@ -7077,6 +7129,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
+		se->slice = slice;
+		slice = cfs_rq_min_slice(cfs_rq);
+
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Avoid re-setting virtual deadline on 'migrations'
  2024-07-27 10:27 ` [PATCH 20/24] sched/fair: Avoid re-setting virtual deadline on migrations Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     82e9d0456e06cebe2c89f3c73cdbc9e3805e9437
Gitweb:        https://git.kernel.org/tip/82e9d0456e06cebe2c89f3c73cdbc9e3805e9437
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 31 May 2024 15:49:40 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:45 +02:00

sched/fair: Avoid re-setting virtual deadline on 'migrations'

During OSPM24 Youssef noted that migrations are re-setting the virtual
deadline. Notably everything that does a dequeue-enqueue, like setting
nice, changing preferred numa-node, and a myriad of other random crap,
will cause this to happen.

This shouldn't be. Preserve the relative virtual deadline across such
dequeue/enqueue cycles.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org
---
 include/linux/sched.h   |  6 ++++--
 kernel/sched/fair.c     | 23 ++++++++++++++++++-----
 kernel/sched/features.h |  4 ++++
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8a3a389..d25e1cf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,8 +544,10 @@ struct sched_entity {
 	u64				min_vruntime;
 
 	struct list_head		group_node;
-	unsigned int			on_rq;
-	unsigned int			sched_delayed;
+	unsigned char			on_rq;
+	unsigned char			sched_delayed;
+	unsigned char			rel_deadline;
+					/* hole */
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0eb1bbf..fef0e1f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5270,6 +5270,12 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	se->vruntime = vruntime - lag;
 
+	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+		se->deadline += se->vruntime;
+		se->rel_deadline = 0;
+		return;
+	}
+
 	/*
 	 * When joining the competition; the existing tasks will be,
 	 * on average, halfway through their slice, as such start tasks
@@ -5382,23 +5388,24 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
+	bool sleep = flags & DEQUEUE_SLEEP;
+
 	update_curr(cfs_rq);
 
 	if (flags & DEQUEUE_DELAYED) {
 		SCHED_WARN_ON(!se->sched_delayed);
 	} else {
-		bool sleep = flags & DEQUEUE_SLEEP;
-
+		bool delay = sleep;
 		/*
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
 		if (flags & DEQUEUE_SPECIAL)
-			sleep = false;
+			delay = false;
 
-		SCHED_WARN_ON(sleep && se->sched_delayed);
+		SCHED_WARN_ON(delay && se->sched_delayed);
 
-		if (sched_feat(DELAY_DEQUEUE) && sleep &&
+		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
@@ -5429,6 +5436,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	clear_buddies(cfs_rq, se);
 
 	update_entity_lag(cfs_rq, se);
+	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
+	}
+
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
@@ -12992,6 +13004,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	if (p->se.sched_delayed) {
 		dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);
 		p->se.sched_delayed = 0;
+		p->se.rel_deadline = 0;
 		if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
 			p->se.vlag = 0;
 	}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7fdeb55..caa4d72 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -10,6 +10,10 @@ SCHED_FEAT(PLACE_LAG, true)
  */
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 /*
+ * Preserve relative virtual deadline on 'migration'.
+ */
+SCHED_FEAT(PLACE_REL_DEADLINE, true)
+/*
  * Inhibit (wakeup) preemption until the current task has either matched the
  * 0-lag point or until is has exhausted it's slice.
  */

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Allow shorter slices to wakeup-preempt
  2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
  2024-08-05 12:24   ` Chunxin Zang
  2024-08-08 10:15   ` Chen Yu
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Chunxin Zang, Peter Zijlstra (Intel), Valentin Schneider,
	Mike Galbraith, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     85e511df3cec46021024176672a748008ed135bf
Gitweb:        https://git.kernel.org/tip/85e511df3cec46021024176672a748008ed135bf
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 Sep 2023 14:32:32 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:45 +02:00

sched/eevdf: Allow shorter slices to wakeup-preempt

Part of the reason to have shorter slices is to improve
responsiveness. Allow shorter slices to preempt longer slices on
wakeup.

    Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |

  100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT

  1 massive_intr:(5)      | 846018.956 ms |   779188 | avg:   0.273 ms | max:  58.337 ms | sum:212545.245 ms |
  2 massive_intr:(5)      | 853450.693 ms |   792269 | avg:   0.275 ms | max:  71.193 ms | sum:218263.588 ms |
  3 massive_intr:(5)      | 843888.920 ms |   771456 | avg:   0.277 ms | max:  92.405 ms | sum:213353.221 ms |
  1 chromium-browse:(8)   |  53015.889 ms |   131766 | avg:   0.463 ms | max:  36.341 ms | sum:60959.230  ms |
  2 chromium-browse:(8)   |  53864.088 ms |   136962 | avg:   0.480 ms | max:  27.091 ms | sum:65687.681  ms |
  3 chromium-browse:(9)   |  53637.904 ms |   132637 | avg:   0.481 ms | max:  24.756 ms | sum:63781.673  ms |
  1 cyclictest:(5)        |  12615.604 ms |   639689 | avg:   0.471 ms | max:  32.272 ms | sum:301351.094 ms |
  2 cyclictest:(5)        |  12511.583 ms |   642578 | avg:   0.448 ms | max:  44.243 ms | sum:287632.830 ms |
  3 cyclictest:(5)        |  12545.867 ms |   635953 | avg:   0.475 ms | max:  25.530 ms | sum:302374.658 ms |

  100ms massive_intr 500us cyclictest PREEMPT_SHORT

  1 massive_intr:(5)      | 839843.919 ms |   837384 | avg:   0.264 ms | max:  74.366 ms | sum:221476.885 ms |
  2 massive_intr:(5)      | 852449.913 ms |   845086 | avg:   0.252 ms | max:  68.162 ms | sum:212595.968 ms |
  3 massive_intr:(5)      | 839180.725 ms |   836883 | avg:   0.266 ms | max:  69.742 ms | sum:222812.038 ms |
  1 chromium-browse:(11)  |  54591.481 ms |   138388 | avg:   0.458 ms | max:  35.427 ms | sum:63401.508  ms |
  2 chromium-browse:(8)   |  52034.541 ms |   132276 | avg:   0.436 ms | max:  31.826 ms | sum:57732.958  ms |
  3 chromium-browse:(8)   |  55231.771 ms |   141892 | avg:   0.469 ms | max:  27.607 ms | sum:66538.697  ms |
  1 cyclictest:(5)        |  13156.391 ms |   667412 | avg:   0.373 ms | max:  38.247 ms | sum:249174.502 ms |
  2 cyclictest:(5)        |  12688.939 ms |   665144 | avg:   0.374 ms | max:  33.548 ms | sum:248509.392 ms |
  3 cyclictest:(5)        |  13475.623 ms |   669110 | avg:   0.370 ms | max:  37.819 ms | sum:247673.390 ms |

As per the numbers the, this makes cyclictest (short slice) it's
max-delay more consistent and consistency drops the sum-delay. The
trade-off is that the massive_intr (long slice) gets more context
switches and a slight increase in sum-delay.

Chunxin contributed did_preempt_short() where a task that lost slice
protection from PREEMPT_SHORT gets rescheduled once it becomes
in-eligible.

[mike: numbers]
Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
---
 kernel/sched/fair.c     | 64 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/features.h |  5 +++-
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fef0e1f..cc30ea3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -973,10 +973,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
  */
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	if ((s64)(se->vruntime - se->deadline) < 0)
-		return;
+		return false;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -993,10 +993,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	if (cfs_rq->nr_running > 1) {
-		resched_curr(rq_of(cfs_rq));
-		clear_buddies(cfs_rq, se);
-	}
+	return true;
 }
 
 #include "pelt.h"
@@ -1134,6 +1131,38 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
 		dl_server_update(p->dl_server, delta_exec);
 }
 
+static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (curr->vlag == curr->deadline)
+		return false;
+
+	return !entity_eligible(cfs_rq, curr);
+}
+
+static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
+				    struct sched_entity *pse, struct sched_entity *se)
+{
+	if (!sched_feat(PREEMPT_SHORT))
+		return false;
+
+	if (pse->slice >= se->slice)
+		return false;
+
+	if (!entity_eligible(cfs_rq, pse))
+		return false;
+
+	if (entity_before(pse, se))
+		return true;
+
+	if (!entity_eligible(cfs_rq, se))
+		return true;
+
+	return false;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -1157,6 +1186,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	struct rq *rq = rq_of(cfs_rq);
 	s64 delta_exec;
+	bool resched;
 
 	if (unlikely(!curr))
 		return;
@@ -1166,7 +1196,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		return;
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
-	update_deadline(cfs_rq, curr);
+	resched = update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
 	if (entity_is_task(curr)) {
@@ -1184,6 +1214,14 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	}
 
 	account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+	if (rq->nr_running == 1)
+		return;
+
+	if (resched || did_preempt_short(cfs_rq, curr)) {
+		resched_curr(rq);
+		clear_buddies(cfs_rq, curr);
+	}
 }
 
 static void update_curr_fair(struct rq *rq)
@@ -8605,7 +8643,17 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	cfs_rq = cfs_rq_of(se);
 	update_curr(cfs_rq);
 	/*
-	 * XXX pick_eevdf(cfs_rq) != se ?
+	 * If @p has a shorter slice than current and @p is eligible, override
+	 * current's slice protection in order to allow preemption.
+	 *
+	 * Note that even if @p does not turn out to be the most eligible
+	 * task at this moment, current's slice protection will be lost.
+	 */
+	if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
+		se->vlag = se->deadline + 1;
+
+	/*
+	 * If @p has become the most eligible task, force preemption.
 	 */
 	if (pick_eevdf(cfs_rq) == pse)
 		goto preempt;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index caa4d72..2908740 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -18,6 +18,11 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
  * 0-lag point or until is has exhausted it's slice.
  */
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for
+ * current.
+ */
+SCHED_FEAT(PREEMPT_SHORT, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
  2024-07-27 10:27 ` [PATCH 22/24] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     857b158dc5e81c6de795ef6be006eed146098fc6
Gitweb:        https://git.kernel.org/tip/857b158dc5e81c6de795ef6be006eed146098fc6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 22 May 2023 13:46:30 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:45 +02:00

sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.

For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:

  {A,B,C,D}(w=1,r=8):

  ABCD...
  +---+---+---+---

  t=0, V=1.5				t=1, V=3.5
  A  |------<				A          |------<
  B   |------<				B   |------<
  C    |------<				C    |------<
  D     |------<			D     |------<
  ---+*------+-------+---		---+--*----+-------+---

  t=2, V=5.5				t=3, V=7.5
  A          |------<			A          |------<
  B           |------<			B           |------<
  C    |------<				C            |------<
  D     |------<			D     |------<
  ---+----*--+-------+---		---+------*+-------+---

Note: 4 identical tasks in FIFO order

~~~

  {A,B}(w=1,r=16) C(w=2,r=16)

  AACCBBCC...
  +---+---+---+---

  t=0, V=1.25				t=2, V=5.25
  A  |--------------<                   A                  |--------------<
  B   |--------------<                  B   |--------------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+----*--+-------+---

  t=4, V=8.25				t=6, V=12.25
  A                  |--------------<   A                  |--------------<
  B   |--------------<                  B                   |--------------<
  C            |------<                 C            |------<
  ---+-------*-------+---               ---+-------+---*---+---

Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
      task doesn't go below q.

Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.

Note: the period of the heavy task is half the full period at:
      W*(r_i/w_i) = 4*(2q/2) = 4q

~~~

  {A,C,D}(w=1,r=16) B(w=1,r=8):

  BAACCBDD...
  +---+---+---+---

  t=0, V=1.5				t=1, V=3.5
  A  |--------------<			A  |---------------<
  B   |------<				B           |------<
  C    |--------------<			C    |--------------<
  D     |--------------<		D     |--------------<
  ---+*------+-------+---		---+--*----+-------+---

  t=3, V=7.5				t=5, V=11.5
  A                  |---------------<  A                  |---------------<
  B           |------<                  B           |------<
  C    |--------------<                 C                    |--------------<
  D     |--------------<                D     |--------------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=6, V=13.5
  A                  |---------------<
  B                   |------<
  C                    |--------------<
  D     |--------------<
  ---+-------+----*--+---

Note: 1 short task -- again double r so that the deadline of the short task
      won't be below q. Made B short because its not the leftmost task, but is
      eligible with the 0,1,2,3 spread.

Note: like with the heavy task, the period of the short task observes:
      W*(r_i/w_i) = 4*(1q/1) = 4q

~~~

  A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)

  BCCAABCC...
  +---+---+---+---

  t=0, V=1.25				t=1, V=3.25
  A  |--------------<                   A  |--------------<
  B   |------<                          B           |------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+--*----+-------+---

  t=3, V=7.25				t=5, V=11.25
  A  |--------------<                   A                  |--------------<
  B           |------<                  B           |------<
  C            |------<                 C            |------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=6, V=13.25
  A                  |--------------<
  B                   |------<
  C            |------<
  ---+-------+----*--+---

Note: 1 heavy and 1 short task -- combine them all.

Note: both the short and heavy task end up with a period of 4q

~~~

  A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)

  BBCAABBC...
  +---+---+---+---

  t=0, V=1				t=2, V=5
  A  |--------------<                   A  |--------------<
  B   |------<                          B           |------<
  C    |------<                         C    |------<
  ---+*------+-------+---               ---+----*--+-------+---

  t=3, V=7				t=5, V=11
  A  |--------------<                   A                  |--------------<
  B           |------<                  B           |------<
  C            |------<                 C            |------<
  ---+------*+-------+---               ---+-------+--*----+---

  t=7, V=15
  A                  |--------------<
  B                   |------<
  C            |------<
  ---+-------+------*+---

Note: as before but permuted

~~~

>From all this it can be deduced that, for the steady state:

 - the total period (P) of a schedule is:	W*max(r_i/w_i)
 - the average period of a task is:		W*(r_i/w_i)
 - each task obtains the fair share:		w_i/W of each full period P

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
---
 include/linux/sched.h   |  1 +
 kernel/sched/core.c     |  4 +++-
 kernel/sched/debug.c    |  3 ++-
 kernel/sched/fair.c     |  6 ++++--
 kernel/sched/syscalls.c | 29 +++++++++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d25e1cf..89a3d8d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -547,6 +547,7 @@ struct sched_entity {
 	unsigned char			on_rq;
 	unsigned char			sched_delayed;
 	unsigned char			rel_deadline;
+	unsigned char			custom_slice;
 					/* hole */
 
 	u64				exec_start;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 868b71b..0165811 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4390,7 +4390,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
-	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 	/* A delayed task cannot be in clone(). */
@@ -4643,6 +4642,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 		p->prio = p->normal_prio = p->static_prio;
 		set_load_weight(p, false);
+		p->se.custom_slice = 0;
+		p->se.slice = sysctl_sched_base_slice;
 
 		/*
 		 * We don't need the reset flag anymore after the fork. It has
@@ -8412,6 +8413,7 @@ void __init sched_init(void)
 	}
 
 	set_load_weight(&init_task, false);
+	init_task.se.slice = sysctl_sched_base_slice,
 
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 831a77a..01ce9a7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -739,11 +739,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 	else
 		SEQ_printf(m, " %c", task_state_to_char(p));
 
-	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
+	SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
 		p->comm, task_pid_nr(p),
 		SPLIT_NS(p->se.vruntime),
 		entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
 		SPLIT_NS(p->se.deadline),
+		p->se.custom_slice ? 'S' : ' ',
 		SPLIT_NS(p->se.slice),
 		SPLIT_NS(p->se.sum_exec_runtime),
 		(long long)(p->nvcsw + p->nivcsw),
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc30ea3..3284d3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -983,7 +983,8 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 * nice) while the request time r_i is determined by
 	 * sysctl_sched_base_slice.
 	 */
-	se->slice = sysctl_sched_base_slice;
+	if (!se->custom_slice)
+		se->slice = sysctl_sched_base_slice;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
@@ -5227,7 +5228,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
 	s64 lag = 0;
 
-	se->slice = sysctl_sched_base_slice;
+	if (!se->custom_slice)
+		se->slice = sysctl_sched_base_slice;
 	vslice = calc_delta_fair(se->slice, se);
 
 	/*
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 60e70c8..4fae3cf 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -401,10 +401,20 @@ static void __setscheduler_params(struct task_struct *p,
 
 	p->policy = policy;
 
-	if (dl_policy(policy))
+	if (dl_policy(policy)) {
 		__setparam_dl(p, attr);
-	else if (fair_policy(policy))
+	} else if (fair_policy(policy)) {
 		p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+		if (attr->sched_runtime) {
+			p->se.custom_slice = 1;
+			p->se.slice = clamp_t(u64, attr->sched_runtime,
+					      NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
+					      NSEC_PER_MSEC*100); /* HZ=100  / 10 */
+		} else {
+			p->se.custom_slice = 0;
+			p->se.slice = sysctl_sched_base_slice;
+		}
+	}
 
 	/*
 	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -700,7 +710,9 @@ recheck:
 	 * but store a possible modification of reset_on_fork.
 	 */
 	if (unlikely(policy == p->policy)) {
-		if (fair_policy(policy) && attr->sched_nice != task_nice(p))
+		if (fair_policy(policy) &&
+		    (attr->sched_nice != task_nice(p) ||
+		     (attr->sched_runtime != p->se.slice)))
 			goto change;
 		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
 			goto change;
@@ -846,6 +858,9 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
 		.sched_nice	= PRIO_TO_NICE(p->static_prio),
 	};
 
+	if (p->se.custom_slice)
+		attr.sched_runtime = p->se.slice;
+
 	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
 	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
 		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
@@ -1012,12 +1027,14 @@ err_size:
 
 static void get_params(struct task_struct *p, struct sched_attr *attr)
 {
-	if (task_has_dl_policy(p))
+	if (task_has_dl_policy(p)) {
 		__getparam_dl(p, attr);
-	else if (task_has_rt_policy(p))
+	} else if (task_has_rt_policy(p)) {
 		attr->sched_priority = p->rt_priority;
-	else
+	} else {
 		attr->sched_nice = task_nice(p);
+		attr->sched_runtime = p->se.slice;
+	}
 }
 
 /**

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-07-27 10:27 ` [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Peter Zijlstra
  2024-08-13 12:43   ` Valentin Schneider
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fc1892becd5672f52329a75c73117b60ac7841b7
Gitweb:        https://git.kernel.org/tip/fc1892becd5672f52329a75c73117b60ac7841b7
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 26 Apr 2024 13:00:50 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:45 +02:00

sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE

Note that tasks that are kept on the runqueue to burn off negative
lag, are not in fact runnable anymore, they'll get dequeued the moment
they get picked.

As such, don't count this time towards runnable.

Thanks to Valentin for spotting I had this backwards initially.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
---
 kernel/sched/fair.c  | 2 ++
 kernel/sched/sched.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1a59339..0eb1bbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5402,6 +5402,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		    !entity_eligible(cfs_rq, se)) {
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
+			update_load_avg(cfs_rq, se, 0);
 			se->sched_delayed = 1;
 			return false;
 		}
@@ -6841,6 +6842,7 @@ requeue_delayed_entity(struct sched_entity *se)
 		}
 	}
 
+	update_load_avg(cfs_rq, se, 0);
 	se->sched_delayed = 0;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 263b4de..2f5d658 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -820,6 +820,9 @@ static inline void se_update_runnable(struct sched_entity *se)
 
 static inline long se_runnable(struct sched_entity *se)
 {
+	if (se->sched_delayed)
+		return false;
+
 	if (entity_is_task(se))
 		return !!se->on_rq;
 	else
@@ -834,6 +837,9 @@ static inline void se_update_runnable(struct sched_entity *se) { }
 
 static inline long se_runnable(struct sched_entity *se)
 {
+	if (se->sched_delayed)
+		return false;
+
 	return !!se->on_rq;
 }
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Implement DELAY_ZERO
  2024-07-27 10:27 ` [PATCH 18/24] sched/fair: Implement DELAY_ZERO Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     54a58a78779169f9c92a51facf6de7ce94962328
Gitweb:        https://git.kernel.org/tip/54a58a78779169f9c92a51facf6de7ce94962328
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 12:26:06 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:44 +02:00

sched/fair: Implement DELAY_ZERO

'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them
at the 0-lag point, truncate lag (eg. don't let them earn positive
lag).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org
---
 kernel/sched/fair.c     | 20 ++++++++++++++++++--
 kernel/sched/features.h |  3 +++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da5065a..1a59339 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5447,8 +5447,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
 		update_min_vruntime(cfs_rq);
 
-	if (flags & DEQUEUE_DELAYED)
+	if (flags & DEQUEUE_DELAYED) {
 		se->sched_delayed = 0;
+		if (sched_feat(DELAY_ZERO) && se->vlag > 0)
+			se->vlag = 0;
+	}
 
 	if (cfs_rq->nr_running == 0)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
@@ -5527,7 +5530,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		SCHED_WARN_ON(se->sched_delayed);
 		SCHED_WARN_ON(se->on_rq);
-
 		return NULL;
 	}
 	return se;
@@ -6825,6 +6827,20 @@ requeue_delayed_entity(struct sched_entity *se)
 	SCHED_WARN_ON(!se->sched_delayed);
 	SCHED_WARN_ON(!se->on_rq);
 
+	if (sched_feat(DELAY_ZERO)) {
+		update_entity_lag(cfs_rq, se);
+		if (se->vlag > 0) {
+			cfs_rq->nr_running--;
+			if (se != cfs_rq->curr)
+				__dequeue_entity(cfs_rq, se);
+			se->vlag = 0;
+			place_entity(cfs_rq, se, 0);
+			if (se != cfs_rq->curr)
+				__enqueue_entity(cfs_rq, se);
+			cfs_rq->nr_running++;
+		}
+	}
+
 	se->sched_delayed = 0;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 1feaa7b..7fdeb55 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -34,8 +34,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  * By delaying the dequeue for non-eligible tasks, they remain in the
  * competition and can burn off their negative lag. When they get selected
  * they'll have positive lag by definition.
+ *
+ * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
 SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(DELAY_ZERO, true)
 
 /*
  * Allow wakeup-time preemption of the current task:

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Implement delayed dequeue
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
  2024-08-02 14:39   ` Valentin Schneider
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2024-08-19 10:01   ` [PATCH 17/24] " Luis Machado
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     152e11f6df293e816a6a37c69757033cdc72667d
Gitweb:        https://git.kernel.org/tip/152e11f6df293e816a6a37c69757033cdc72667d
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 12:25:32 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:44 +02:00

sched/fair: Implement delayed dequeue

Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
noting that lag is fundamentally a temporal measure. It should not be
carried around indefinitely.

OTOH it should also not be instantly discarded, doing so will allow a
task to game the system by purposefully (micro) sleeping at the end of
its time quantum.

Since lag is intimately tied to the virtual time base, a wall-time
based decay is also insufficient, notably competition is required for
any of this to make sense.

Instead, delay the dequeue and keep the 'tasks' on the runqueue,
competing until they are eligible.

Strictly speaking, we only care about keeping them until the 0-lag
point, but that is a difficult proposition, instead carry them around
until they get picked again, and dequeue them at that point.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org
---
 kernel/sched/deadline.c |  1 +-
 kernel/sched/fair.c     | 80 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/features.h |  9 +++++-
 3 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index bbaeace..0f2df67 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2428,7 +2428,6 @@ again:
 		else
 			p = dl_se->server_pick_next(dl_se);
 		if (!p) {
-			WARN_ON_ONCE(1);
 			dl_se->dl_yielded = 1;
 			update_curr_dl_se(rq, dl_se, 0);
 			goto again;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25b14df..da5065a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5379,20 +5379,39 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static void
+static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	int action = UPDATE_TG;
+	update_curr(cfs_rq);
+
+	if (flags & DEQUEUE_DELAYED) {
+		SCHED_WARN_ON(!se->sched_delayed);
+	} else {
+		bool sleep = flags & DEQUEUE_SLEEP;
 
+		/*
+		 * DELAY_DEQUEUE relies on spurious wakeups, special task
+		 * states must not suffer spurious wakeups, excempt them.
+		 */
+		if (flags & DEQUEUE_SPECIAL)
+			sleep = false;
+
+		SCHED_WARN_ON(sleep && se->sched_delayed);
+
+		if (sched_feat(DELAY_DEQUEUE) && sleep &&
+		    !entity_eligible(cfs_rq, se)) {
+			if (cfs_rq->next == se)
+				cfs_rq->next = NULL;
+			se->sched_delayed = 1;
+			return false;
+		}
+	}
+
+	int action = UPDATE_TG;
 	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
 		action |= DO_DETACH;
 
 	/*
-	 * Update run-time statistics of the 'current'.
-	 */
-	update_curr(cfs_rq);
-
-	/*
 	 * When dequeuing a sched_entity, we must:
 	 *   - Update loads to have both entity and cfs_rq synced with now.
 	 *   - For group_entity, update its runnable_weight to reflect the new
@@ -5428,8 +5447,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
 		update_min_vruntime(cfs_rq);
 
+	if (flags & DEQUEUE_DELAYED)
+		se->sched_delayed = 0;
+
 	if (cfs_rq->nr_running == 0)
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
+
+	return true;
 }
 
 static void
@@ -5828,11 +5852,21 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	idle_task_delta = cfs_rq->idle_h_nr_running;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+		int flags;
+
 		/* throttled entity or throttle-on-deactivate */
 		if (!se->on_rq)
 			goto done;
 
-		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+		/*
+		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
+		 * This avoids teaching dequeue_entities() about throttled
+		 * entities and keeps things relatively simple.
+		 */
+		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
+		if (se->sched_delayed)
+			flags |= DEQUEUE_DELAYED;
+		dequeue_entity(qcfs_rq, se, flags);
 
 		if (cfs_rq_is_idle(group_cfs_rq(se)))
 			idle_task_delta = cfs_rq->h_nr_running;
@@ -6918,6 +6952,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	bool was_sched_idle = sched_idle_rq(rq);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
+	bool task_delayed = flags & DEQUEUE_DELAYED;
 	struct task_struct *p = NULL;
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
@@ -6931,7 +6966,13 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
-		dequeue_entity(cfs_rq, se, flags);
+
+		if (!dequeue_entity(cfs_rq, se, flags)) {
+			if (p && &p->se == se)
+				return -1;
+
+			break;
+		}
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
@@ -6956,6 +6997,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 			break;
 		}
 		flags |= DEQUEUE_SLEEP;
+		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
 	}
 
 	for_each_sched_entity(se) {
@@ -6985,6 +7027,17 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	if (p && task_delayed) {
+		SCHED_WARN_ON(!task_sleep);
+		SCHED_WARN_ON(p->on_rq != 1);
+
+		/* Fix-up what dequeue_task_fair() skipped */
+		hrtick_update(rq);
+
+		/* Fix-up what block_task() skipped. */
+		__block_task(rq, p);
+	}
+
 	return 1;
 }
 
@@ -6997,8 +7050,10 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	util_est_dequeue(&rq->cfs, p);
 
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (dequeue_entities(rq, &p->se, flags) < 0) {
+		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
 		return false;
+	}
 
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
 	hrtick_update(rq);
@@ -12971,6 +13026,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
 	}
+
+	if (!first)
+		return;
+
+	SCHED_WARN_ON(se->sched_delayed);
 }
 
 void init_cfs_rq(struct cfs_rq *cfs_rq)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 97fb2d4..1feaa7b 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
 SCHED_FEAT(CACHE_HOT_BUDDY, true)
 
 /*
+ * Delay dequeueing tasks until they get selected or woken.
+ *
+ * By delaying the dequeue for non-eligible tasks, they remain in the
+ * competition and can burn off their negative lag. When they get selected
+ * they'll have positive lag by definition.
+ */
+SCHED_FEAT(DELAY_DEQUEUE, true)
+
+/*
  * Allow wakeup-time preemption of the current task:
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched,freezer: Mark TASK_FROZEN special
  2024-07-27 10:27 ` [PATCH 15/24] sched,freezer: Mark TASK_FROZEN special Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a1c446611e31ca5363d4db51e398271da1dce0af
Gitweb:        https://git.kernel.org/tip/a1c446611e31ca5363d4db51e398271da1dce0af
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Jul 2024 21:30:09 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:44 +02:00

sched,freezer: Mark TASK_FROZEN special

The special task states are those that do not suffer spurious wakeups,
TASK_FROZEN is very much one of those, mark it as such.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.998329901@infradead.org
---
 include/linux/sched.h | 5 +++--
 kernel/freezer.c      | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f4a648e..8a3a389 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -149,8 +149,9 @@ struct user_event_mm;
  * Special states are those that do not use the normal wait-loop pattern. See
  * the comment with set_special_state().
  */
-#define is_special_task_state(state)				\
-	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | TASK_DEAD))
+#define is_special_task_state(state)					\
+	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
+		    TASK_DEAD | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
diff --git a/kernel/freezer.c b/kernel/freezer.c
index f57aaf9..44bbd7d 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -72,7 +72,7 @@ bool __refrigerator(bool check_kthr_stop)
 		bool freeze;
 
 		raw_spin_lock_irq(&current->pi_lock);
-		set_current_state(TASK_FROZEN);
+		WRITE_ONCE(current->__state, TASK_FROZEN);
 		/* unstale saved_state so that __thaw_task() will wake us up */
 		current->saved_state = TASK_RUNNING;
 		raw_spin_unlock_irq(&current->pi_lock);

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched: Teach dequeue_task() about special task states
  2024-07-27 10:27 ` [PATCH 16/24] sched: Teach dequeue_task() about special task states Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e1459a50ba31831efdfc35278023d959e4ba775b
Gitweb:        https://git.kernel.org/tip/e1459a50ba31831efdfc35278023d959e4ba775b
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 01 Jul 2024 21:38:11 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:44 +02:00

sched: Teach dequeue_task() about special task states

Since special task states must not suffer spurious wakeups, and the
proposed delayed dequeue can cause exactly these (under some boundary
conditions), propagate this knowledge into dequeue_task() such that it
can do the right thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
---
 kernel/sched/core.c  | 7 ++++++-
 kernel/sched/sched.h | 3 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 80e639e..868b71b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6530,11 +6530,16 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
 		} else {
+			int flags = DEQUEUE_NOCLOCK;
+
 			prev->sched_contributes_to_load =
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
 				!(prev_state & TASK_FROZEN);
 
+			if (unlikely(is_special_task_state(prev_state)))
+				flags |= DEQUEUE_SPECIAL;
+
 			/*
 			 * __schedule()			ttwu()
 			 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -6546,7 +6551,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			block_task(rq, prev, DEQUEUE_NOCLOCK);
+			block_task(rq, prev, flags);
 		}
 		switch_count = &prev->nvcsw;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffca977..263b4de 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2248,10 +2248,11 @@ extern const u32		sched_prio_to_wmult[40];
  *
  */
 
-#define DEQUEUE_SLEEP		0x01
+#define DEQUEUE_SLEEP		0x01 /* Matches ENQUEUE_WAKEUP */
 #define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
+#define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Prepare pick_next_task() for delayed dequeue
  2024-07-27 10:27 ` [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2024-09-10  9:16   ` [PATCH 13/24] " Luis Machado
  1 sibling, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f12e148892ede8d9ee82bcd3e469e6d01fc077ac
Gitweb:        https://git.kernel.org/tip/f12e148892ede8d9ee82bcd3e469e6d01fc077ac
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 11:26:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:43 +02:00

sched/fair: Prepare pick_next_task() for delayed dequeue

Delayed dequeue's natural end is when it gets picked again. Ensure
pick_next_task() knows what to do with delayed tasks.

Note, this relies on the earlier patch that made pick_next_task()
state invariant -- it will restart the pick on dequeue, because
obviously the just dequeued task is no longer eligible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org
---
 kernel/sched/fair.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a84903..a4f1f79 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5473,6 +5473,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
+static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
+
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5481,16 +5483,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * 4) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq)
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	/*
 	 * Enabling NEXT_BUDDY will affect latency but not fairness.
 	 */
 	if (sched_feat(NEXT_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
 		return cfs_rq->next;
+	}
+
+	struct sched_entity *se = pick_eevdf(cfs_rq);
+	if (se->sched_delayed) {
+		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		SCHED_WARN_ON(se->sched_delayed);
+		SCHED_WARN_ON(se->on_rq);
 
-	return pick_eevdf(cfs_rq);
+		return NULL;
+	}
+	return se;
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -8507,7 +8520,9 @@ again:
 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
 			goto again;
 
-		se = pick_next_entity(cfs_rq);
+		se = pick_next_entity(rq, cfs_rq);
+		if (!se)
+			goto again;
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Implement ENQUEUE_DELAYED
  2024-07-27 10:27 ` [PATCH 14/24] sched/fair: Implement ENQUEUE_DELAYED Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     781773e3b68031bd001c0c18aa72e8470c225ebd
Gitweb:        https://git.kernel.org/tip/781773e3b68031bd001c0c18aa72e8470c225ebd
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 11:57:43 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:43 +02:00

sched/fair: Implement ENQUEUE_DELAYED

Doing a wakeup on a delayed dequeue task is about as simple as it
sounds -- remove the delayed mark and enjoy the fact it was actually
still on the runqueue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
---
 kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4f1f79..25b14df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,6 +5290,9 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 static inline bool cfs_bandwidth_used(void);
 
 static void
+requeue_delayed_entity(struct sched_entity *se);
+
+static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	bool curr = cfs_rq->curr == se;
@@ -5922,8 +5925,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
-		if (se->on_rq)
+		if (se->on_rq) {
+			SCHED_WARN_ON(se->sched_delayed);
 			break;
+		}
 		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
 
 		if (cfs_rq_is_idle(group_cfs_rq(se)))
@@ -6773,6 +6778,22 @@ static int sched_idle_cpu(int cpu)
 }
 #endif
 
+static void
+requeue_delayed_entity(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	/*
+	 * se->sched_delayed should imply: se->on_rq == 1.
+	 * Because a delayed entity is one that is still on
+	 * the runqueue competing until elegibility.
+	 */
+	SCHED_WARN_ON(!se->sched_delayed);
+	SCHED_WARN_ON(!se->on_rq);
+
+	se->sched_delayed = 0;
+}
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -6787,6 +6808,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 
+	if (flags & ENQUEUE_DELAYED) {
+		requeue_delayed_entity(se);
+		return;
+	}
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
@@ -6804,8 +6830,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
 	for_each_sched_entity(se) {
-		if (se->on_rq)
+		if (se->on_rq) {
+			if (se->sched_delayed)
+				requeue_delayed_entity(se);
 			break;
+		}
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
  2024-07-27 10:27 ` [PATCH 11/24] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
Gitweb:        https://git.kernel.org/tip/e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 11:00:10 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:42 +02:00

sched/fair: Assert {set_next,put_prev}_entity() are properly balanced

Just a little sanity test..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.486423066@infradead.org
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59b00d7..37acd53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5452,6 +5452,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	}
 
 	update_stats_curr_start(cfs_rq, se);
+	SCHED_WARN_ON(cfs_rq->curr);
 	cfs_rq->curr = se;
 
 	/*
@@ -5513,6 +5514,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
+	SCHED_WARN_ON(cfs_rq->curr != prev);
 	cfs_rq->curr = NULL;
 }
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/uclamg: Handle delayed dequeue
  2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2024-08-19  9:14     ` Christian Loehle
  2024-08-20 16:23   ` [PATCH 10/24] " Hongyan Xia
  2024-08-21 13:34   ` Hongyan Xia
  2 siblings, 1 reply; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Luis Machado, Hongyan Xia, Peter Zijlstra (Intel),
	Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     dfa0a574cbc47bfd5f8985f74c8ea003a37fa078
Gitweb:        https://git.kernel.org/tip/dfa0a574cbc47bfd5f8985f74c8ea003a37fa078
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 05 Jun 2024 12:09:11 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:42 +02:00

sched/uclamg: Handle delayed dequeue

Delayed dequeue has tasks sit around on the runqueue that are not
actually runnable -- specifically, they will be dequeued the moment
they get picked.

One side-effect is that such a task can get migrated, which leads to a
'nested' dequeue_task() scenario that messes up uclamp if we don't
take care.

Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
the runqueue. This however will have removed the task from uclamp --
per uclamp_rq_dec() in dequeue_task(). So far so good.

However, if at that point the task gets migrated -- or nice adjusted
or any of a myriad of operations that does a dequeue-enqueue cycle --
we'll pass through dequeue_task()/enqueue_task() again. Without
modification this will lead to a double decrement for uclamp, which is
wrong.

Reported-by: Luis Machado <luis.machado@arm.com>
Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
---
 kernel/sched/core.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7356464..80e639e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1691,6 +1691,9 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	if (p->se.sched_delayed)
+		return;
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_inc_id(rq, p, clamp_id);
 
@@ -1715,6 +1718,9 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 	if (unlikely(!p->sched_class->uclamp_enabled))
 		return;
 
+	if (p->se.sched_delayed)
+		return;
+
 	for_each_clamp_id(clamp_id)
 		uclamp_rq_dec_id(rq, p, clamp_id);
 }
@@ -1994,8 +2000,12 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
 	}
 
-	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+	/*
+	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
+	 * ->sched_delayed.
+	 */
+	uclamp_rq_inc(rq, p);
 
 	if (sched_core_enabled(rq))
 		sched_core_enqueue(rq, p);
@@ -2017,6 +2027,10 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
+	/*
+	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
+	 * and mark the task ->sched_delayed.
+	 */
 	uclamp_rq_dec(rq, p);
 	return p->sched_class->dequeue_task(rq, p, flags);
 }

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
  2024-08-13 12:43   ` Valentin Schneider
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  2024-08-27  9:17   ` [PATCH 12/24] " Chen Yu
  2 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2e0199df252a536a03f4cb0810324dff523d1e79
Gitweb:        https://git.kernel.org/tip/2e0199df252a536a03f4cb0810324dff523d1e79
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 11:03:42 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:43 +02:00

sched/fair: Prepare exit/cleanup paths for delayed_dequeue

When dequeue_task() is delayed it becomes possible to exit a task (or
cgroup) that is still enqueued. Ensure things are dequeued before
freeing.

Thanks to Valentin for asking the obvious questions and making
switched_from_fair() less weird.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.631948434@infradead.org
---
 kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 37acd53..9a84903 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8342,7 +8342,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 
 static void task_dead_fair(struct task_struct *p)
 {
-	remove_entity_load_avg(&p->se);
+	struct sched_entity *se = &p->se;
+
+	if (se->sched_delayed) {
+		struct rq_flags rf;
+		struct rq *rq;
+
+		rq = task_rq_lock(p, &rf);
+		if (se->sched_delayed) {
+			update_rq_clock(rq);
+			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+		}
+		task_rq_unlock(rq, p, &rf);
+	}
+
+	remove_entity_load_avg(se);
 }
 
 /*
@@ -12854,10 +12868,22 @@ static void attach_task_cfs_rq(struct task_struct *p)
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
+	/*
+	 * Since this is called after changing class, this is a little weird
+	 * and we cannot use DEQUEUE_DELAYED.
+	 */
+	if (p->se.sched_delayed) {
+		dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);
+		p->se.sched_delayed = 0;
+		if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
+			p->se.vlag = 0;
+	}
 }
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
+	SCHED_WARN_ON(p->se.sched_delayed);
+
 	attach_task_cfs_rq(p);
 
 	set_task_max_allowed_capacity(p);
@@ -13008,28 +13034,35 @@ void online_fair_sched_group(struct task_group *tg)
 
 void unregister_fair_sched_group(struct task_group *tg)
 {
-	unsigned long flags;
-	struct rq *rq;
 	int cpu;
 
 	destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
 	for_each_possible_cpu(cpu) {
-		if (tg->se[cpu])
-			remove_entity_load_avg(tg->se[cpu]);
+		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+		struct sched_entity *se = tg->se[cpu];
+		struct rq *rq = cpu_rq(cpu);
+
+		if (se) {
+			if (se->sched_delayed) {
+				guard(rq_lock_irqsave)(rq);
+				if (se->sched_delayed) {
+					update_rq_clock(rq);
+					dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+				}
+				list_del_leaf_cfs_rq(cfs_rq);
+			}
+			remove_entity_load_avg(se);
+		}
 
 		/*
 		 * Only empty task groups can be destroyed; so we can speculatively
 		 * check on_list without danger of it being re-added.
 		 */
-		if (!tg->cfs_rq[cpu]->on_list)
-			continue;
-
-		rq = cpu_rq(cpu);
-
-		raw_spin_rq_lock_irqsave(rq, flags);
-		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
-		raw_spin_rq_unlock_irqrestore(rq, flags);
+		if (cfs_rq->on_list) {
+			guard(rq_lock_irqsave)(rq);
+			list_del_leaf_cfs_rq(cfs_rq);
+		}
 	}
 }
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Re-organize dequeue_task_fair()
  2024-07-27 10:27 ` [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
  2024-08-09 16:53   ` Valentin Schneider
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fab4a808ba9fb59b691d7096eed9b1494812ffd6
Gitweb:        https://git.kernel.org/tip/fab4a808ba9fb59b691d7096eed9b1494812ffd6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Apr 2024 09:50:41 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:41 +02:00

sched/fair: Re-organize dequeue_task_fair()

Working towards delaying dequeue, notably also inside the hierachy,
rework dequeue_task_fair() such that it can 'resume' an interrupted
hierarchy walk.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org
---
 kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03f76b3..59b00d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6861,34 +6861,43 @@ enqueue_throttle:
 static void set_next_buddy(struct sched_entity *se);
 
 /*
- * The dequeue_task method is called before nr_running is
- * decreased. We remove the task from the rbtree and
- * update the fair scheduling stats:
+ * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
+ * failing half-way through and resume the dequeue later.
+ *
+ * Returns:
+ * -1 - dequeue delayed
+ *  0 - dequeue throttled
+ *  1 - dequeue complete
  */
-static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 {
-	struct cfs_rq *cfs_rq;
-	struct sched_entity *se = &p->se;
-	int task_sleep = flags & DEQUEUE_SLEEP;
-	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
+	bool task_sleep = flags & DEQUEUE_SLEEP;
+	struct task_struct *p = NULL;
+	int idle_h_nr_running = 0;
+	int h_nr_running = 0;
+	struct cfs_rq *cfs_rq;
 
-	util_est_dequeue(&rq->cfs, p);
+	if (entity_is_task(se)) {
+		p = task_of(se);
+		h_nr_running = 1;
+		idle_h_nr_running = task_has_idle_policy(p);
+	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		cfs_rq->h_nr_running--;
+		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+			idle_h_nr_running = h_nr_running;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			goto dequeue_throttle;
+			return 0;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -6912,20 +6921,18 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		se_update_runnable(se);
 		update_cfs_group(se);
 
-		cfs_rq->h_nr_running--;
+		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
 
 		if (cfs_rq_is_idle(cfs_rq))
-			idle_h_nr_running = 1;
+			idle_h_nr_running = h_nr_running;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			goto dequeue_throttle;
-
+			return 0;
 	}
 
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, 1);
+	sub_nr_running(rq, h_nr_running);
 
 	if (rq_h_nr_running && !rq->cfs.h_nr_running)
 		dl_server_stop(&rq->fair_server);
@@ -6934,10 +6941,23 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
-dequeue_throttle:
-	util_est_update(&rq->cfs, p, task_sleep);
-	hrtick_update(rq);
+	return 1;
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+{
+	util_est_dequeue(&rq->cfs, p);
+
+	if (dequeue_entities(rq, &p->se, flags) < 0)
+		return false;
 
+	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
+	hrtick_update(rq);
 	return true;
 }
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched: Prepare generic code for delayed dequeue
  2024-07-27 10:27 ` [PATCH 09/24] sched: Prepare generic code for delayed dequeue Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     abc158c82ae555078aa5dd2d8407c3df0f868904
Gitweb:        https://git.kernel.org/tip/abc158c82ae555078aa5dd2d8407c3df0f868904
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 10:55:59 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:42 +02:00

sched: Prepare generic code for delayed dequeue

While most of the delayed dequeue code can be done inside the
sched_class itself, there is one location where we do not have an
appropriate hook, namely ttwu_runnable().

Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking
delayed dequeue tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 14 +++++++++++++-
 kernel/sched/sched.h  |  2 ++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c1b4ee..f4a648e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,7 @@ struct sched_entity {
 
 	struct list_head		group_node;
 	unsigned int			on_rq;
+	unsigned int			sched_delayed;
 
 	u64				exec_start;
 	u64				sum_exec_runtime;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6c59548..7356464 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2036,6 +2036,8 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
+
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
@@ -3689,12 +3691,14 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
+		update_rq_clock(rq);
+		if (p->se.sched_delayed)
+			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
 			 * it should preempt the task that is current now.
 			 */
-			update_rq_clock(rq);
 			wakeup_preempt(rq, p, wake_flags);
 		}
 		ttwu_do_wakeup(p);
@@ -4074,11 +4078,16 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 * case the whole 'p->on_rq && ttwu_runnable()' case below
 		 * without taking any locks.
 		 *
+		 * Specifically, given current runs ttwu() we must be before
+		 * schedule()'s block_task(), as such this must not observe
+		 * sched_delayed.
+		 *
 		 * In particular:
 		 *  - we rely on Program-Order guarantees for all the ordering,
 		 *  - we're serialized against set_special_state() by virtue of
 		 *    it disabling IRQs (this allows not taking ->pi_lock).
 		 */
+		SCHED_WARN_ON(p->se.sched_delayed);
 		if (!ttwu_state_match(p, state, &success))
 			goto out;
 
@@ -4370,6 +4379,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.slice			= sysctl_sched_base_slice;
 	INIT_LIST_HEAD(&p->se.group_node);
 
+	/* A delayed task cannot be in clone(). */
+	SCHED_WARN_ON(p->se.sched_delayed);
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69ab3b0..ffca977 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2253,6 +2253,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
 #define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
+#define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
@@ -2268,6 +2269,7 @@ extern const u32		sched_prio_to_wmult[40];
 #endif
 #define ENQUEUE_INITIAL		0x80
 #define ENQUEUE_MIGRATING	0x100
+#define ENQUEUE_DELAYED		0x200
 
 #define RETRY_TASK		((void *)-1UL)
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched: Split DEQUEUE_SLEEP from deactivate_task()
  2024-07-27 10:27 ` [PATCH 08/24] sched: Split DEQUEUE_SLEEP from deactivate_task() Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e8901061ca0cd9acbd3d29d41d16c69c2bfff9f0
Gitweb:        https://git.kernel.org/tip/e8901061ca0cd9acbd3d29d41d16c69c2bfff9f0
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 23 May 2024 10:48:09 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:42 +02:00

sched: Split DEQUEUE_SLEEP from deactivate_task()

As a preparation for dequeue_task() failing, and a second code-path
needing to take care of the 'success' path, split out the DEQEUE_SLEEP
path from deactivate_task().

Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING
ordering fail.

Fixed-by: Libo Chen <libo.chen@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
---
 kernel/sched/core.c  | 23 +++++++++++++----------
 kernel/sched/sched.h | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f7a4e9..6c59548 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2036,12 +2036,23 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
+	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
 
+	/*
+	 * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
+	 * dequeue_task() and cleared *after* enqueue_task().
+	 */
+
 	dequeue_task(rq, p, flags);
 }
 
+static void block_task(struct rq *rq, struct task_struct *p, int flags)
+{
+	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
+		__block_task(rq, p);
+}
+
 /**
  * task_curr - is this task currently executing on a CPU?
  * @p: the task in question.
@@ -6498,9 +6509,6 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 				!(prev_state & TASK_NOLOAD) &&
 				!(prev_state & TASK_FROZEN);
 
-			if (prev->sched_contributes_to_load)
-				rq->nr_uninterruptible++;
-
 			/*
 			 * __schedule()			ttwu()
 			 *   prev_state = prev->state;    if (p->on_rq && ...)
@@ -6512,12 +6520,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 			 *
 			 * After this, schedule() must not care about p->state any more.
 			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
+			block_task(rq, prev, DEQUEUE_NOCLOCK);
 		}
 		switch_count = &prev->nvcsw;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6196f90..69ab3b0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -68,6 +68,7 @@
 #include <linux/wait_api.h>
 #include <linux/wait_bit.h>
 #include <linux/workqueue_api.h>
+#include <linux/delayacct.h>
 
 #include <trace/events/power.h>
 #include <trace/events/sched.h>
@@ -2585,6 +2586,19 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
 	sched_update_tick_dependency(rq);
 }
 
+static inline void __block_task(struct rq *rq, struct task_struct *p)
+{
+	WRITE_ONCE(p->on_rq, 0);
+	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible++;
+
+	if (p->in_iowait) {
+		atomic_inc(&rq->nr_iowait);
+		delayacct_blkio_start();
+	}
+}
+
 extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
 extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Cleanup pick_task_fair()'s curr
  2024-07-27 10:27 ` [PATCH 04/24] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c97f54fe6d014419e557200ed075cf53b47c5420
Gitweb:        https://git.kernel.org/tip/c97f54fe6d014419e557200ed075cf53b47c5420
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Apr 2024 09:50:12 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:41 +02:00

sched/fair: Cleanup pick_task_fair()'s curr

With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from
pick_next_entity()") curr is no longer being used, so no point in
clearing it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.614707623@infradead.org
---
 kernel/sched/fair.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ba1ca5..175ccec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8463,15 +8463,9 @@ again:
 		return NULL;
 
 	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
 		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-		}
+		if (cfs_rq->curr && cfs_rq->curr->on_rq)
+			update_curr(cfs_rq);
 
 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
 			goto again;

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Cleanup pick_task_fair() vs throttle
  2024-07-27 10:27 ` [PATCH 03/24] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ben Segall, Valentin Schneider, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8e2e13ac6122915bd98315237b0317495e391be0
Gitweb:        https://git.kernel.org/tip/8e2e13ac6122915bd98315237b0317495e391be0
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Apr 2024 09:50:07 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:40 +02:00

sched/fair: Cleanup pick_task_fair() vs throttle

Per 54d27365cae8 ("sched/fair: Prevent throttling in early
pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the
'if (curr)' check is to ensure the (downward) traversal does not
result in an empty cfs_rq.

But then the pick_task_fair() 'copy' of all this made it restart the
traversal anyway, so that seems to solve the issue too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.501679876@infradead.org
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8201f0f..7ba1ca5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8471,11 +8471,11 @@ again:
 				update_curr(cfs_rq);
 			else
 				curr = NULL;
-
-			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-				goto again;
 		}
 
+		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+			goto again;
+
 		se = pick_next_entity(cfs_rq);
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/fair: Unify pick_{,next_}_task_fair()
  2024-07-27 10:27 ` [PATCH 05/24] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     3b3dd89b8bb0f03657859c22c86c19224f778638
Gitweb:        https://git.kernel.org/tip/3b3dd89b8bb0f03657859c22c86c19224f778638
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Apr 2024 09:50:16 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:41 +02:00

sched/fair: Unify pick_{,next_}_task_fair()

Implement pick_next_task_fair() in terms of pick_task_fair() to
de-duplicate the pick loop.

More importantly, this makes all the pick loops use the
state-invariant form, which is useful to introduce further re-try
conditions in later patches.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.725062368@infradead.org
---
 kernel/sched/fair.c | 60 +++++---------------------------------------
 1 file changed, 8 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 175ccec..1452c53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8451,7 +8451,6 @@ preempt:
 	resched_curr(rq);
 }
 
-#ifdef CONFIG_SMP
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
@@ -8463,7 +8462,7 @@ again:
 		return NULL;
 
 	do {
-		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
+		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
@@ -8484,19 +8483,19 @@ again:
 
 	return task_of(se);
 }
-#endif
 
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
-	struct cfs_rq *cfs_rq = &rq->cfs;
 	struct sched_entity *se;
 	struct task_struct *p;
 	int new_tasks;
 
 again:
-	if (!sched_fair_runnable(rq))
+	p = pick_task_fair(rq);
+	if (!p)
 		goto idle;
+	se = &p->se;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	if (!prev || prev->sched_class != &fair_sched_class)
@@ -8508,52 +8507,14 @@ again:
 	 *
 	 * Therefore attempt to avoid putting and setting the entire cgroup
 	 * hierarchy, only change the part that actually changes.
-	 */
-
-	do {
-		struct sched_entity *curr = cfs_rq->curr;
-
-		/*
-		 * Since we got here without doing put_prev_entity() we also
-		 * have to consider cfs_rq->curr. If it is still a runnable
-		 * entity, update_curr() will update its vruntime, otherwise
-		 * forget we've ever seen it.
-		 */
-		if (curr) {
-			if (curr->on_rq)
-				update_curr(cfs_rq);
-			else
-				curr = NULL;
-
-			/*
-			 * This call to check_cfs_rq_runtime() will do the
-			 * throttle and dequeue its entity in the parent(s).
-			 * Therefore the nr_running test will indeed
-			 * be correct.
-			 */
-			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
-				cfs_rq = &rq->cfs;
-
-				if (!cfs_rq->nr_running)
-					goto idle;
-
-				goto simple;
-			}
-		}
-
-		se = pick_next_entity(cfs_rq);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
-
-	/*
+	 *
 	 * Since we haven't yet done put_prev_entity and if the selected task
 	 * is a different task than we started out with, try and touch the
 	 * least amount of cfs_rqs.
 	 */
 	if (prev != p) {
 		struct sched_entity *pse = &prev->se;
+		struct cfs_rq *cfs_rq;
 
 		while (!(cfs_rq = is_same_group(se, pse))) {
 			int se_depth = se->depth;
@@ -8579,13 +8540,8 @@ simple:
 	if (prev)
 		put_prev_task(rq, prev);
 
-	do {
-		se = pick_next_entity(cfs_rq);
-		set_next_entity(cfs_rq, se);
-		cfs_rq = group_cfs_rq(se);
-	} while (cfs_rq);
-
-	p = task_of(se);
+	for_each_sched_entity(se)
+		set_next_entity(cfs_rq_of(se), se);
 
 done: __maybe_unused;
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched: Allow sched_class::dequeue_task() to fail
  2024-07-27 10:27 ` [PATCH 06/24] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     863ccdbb918a77e3f011571f943020bf7f0b114b
Gitweb:        https://git.kernel.org/tip/863ccdbb918a77e3f011571f943020bf7f0b114b
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 03 Apr 2024 09:50:20 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:41 +02:00

sched: Allow sched_class::dequeue_task() to fail

Change the function signature of sched_class::dequeue_task() to return
a boolean, allowing future patches to 'fail' dequeue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
---
 kernel/sched/core.c      | 7 +++++--
 kernel/sched/deadline.c  | 4 +++-
 kernel/sched/fair.c      | 4 +++-
 kernel/sched/idle.c      | 3 ++-
 kernel/sched/rt.c        | 4 +++-
 kernel/sched/sched.h     | 4 ++--
 kernel/sched/stop_task.c | 3 ++-
 7 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab50100..4f7a4e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,7 +2001,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		sched_core_enqueue(rq, p);
 }
 
-void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+/*
+ * Must only return false when DEQUEUE_SLEEP.
+ */
+inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (sched_core_enabled(rq))
 		sched_core_dequeue(rq, p, flags);
@@ -2015,7 +2018,7 @@ void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	}
 
 	uclamp_rq_dec(rq, p);
-	p->sched_class->dequeue_task(rq, p, flags);
+	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c5f1cc7..bbaeace 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2162,7 +2162,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_dl_task(rq, p);
 }
 
-static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	update_curr_dl(rq);
 
@@ -2172,6 +2172,8 @@ static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_dl_entity(&p->dl, flags);
 	if (!p->dl.dl_throttled && !dl_server(&p->dl))
 		dequeue_pushable_dl_task(rq, p);
+
+	return true;
 }
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1452c53..03f76b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6865,7 +6865,7 @@ static void set_next_buddy(struct sched_entity *se);
  * decreased. We remove the task from the rbtree and
  * update the fair scheduling stats:
  */
-static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
@@ -6937,6 +6937,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+
+	return true;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index d560f7f..1607420 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -482,13 +482,14 @@ struct task_struct *pick_next_task_idle(struct rq *rq)
  * It is not legal to sleep in the idle task - print a warning
  * message if some code attempts to do it:
  */
-static void
+static bool
 dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
 {
 	raw_spin_rq_unlock_irq(rq);
 	printk(KERN_ERR "bad: scheduling from the idle thread!\n");
 	dump_stack();
 	raw_spin_rq_lock_irq(rq);
+	return true;
 }
 
 /*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a8731da..fdc8e05 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1483,7 +1483,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 		enqueue_pushable_task(rq, p);
 }
 
-static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
+static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 
@@ -1491,6 +1491,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_rt_entity(rt_se, flags);
 
 	dequeue_pushable_task(rq, p);
+
+	return true;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a6d6b6f..6196f90 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2285,7 +2285,7 @@ struct sched_class {
 #endif
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
-	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
+	bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
 	bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
 
@@ -3606,7 +3606,7 @@ extern int __sched_setaffinity(struct task_struct *p, struct affinity_context *c
 extern void __setscheduler_prio(struct task_struct *p, int prio);
 extern void set_load_weight(struct task_struct *p, bool update_load);
 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
-extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags);
+extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void check_class_changed(struct rq *rq, struct task_struct *p,
 				const struct sched_class *prev_class,
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index b1b8fe6..4cf0207 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -57,10 +57,11 @@ enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 	add_nr_running(rq, 1);
 }
 
-static void
+static bool
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
 	sub_nr_running(rq, 1);
+	return true;
 }
 
 static void yield_task_stop(struct rq *rq)

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Add feature comments
  2024-07-27 10:27 ` [PATCH 01/24] sched/eevdf: Add feature comments Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f25b7b32b0db6d71b07b06fe8de45b0408541c2a
Gitweb:        https://git.kernel.org/tip/f25b7b32b0db6d71b07b06fe8de45b0408541c2a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Sat, 14 Oct 2023 23:12:20 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:40 +02:00

sched/eevdf: Add feature comments

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.287790895@infradead.org
---
 kernel/sched/features.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 929021f..97fb2d4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,7 +5,14 @@
  * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
  */
 SCHED_FEAT(PLACE_LAG, true)
+/*
+ * Give new tasks half a slice to ease into the competition.
+ */
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+/*
+ * Inhibit (wakeup) preemption until the current task has either matched the
+ * 0-lag point or until is has exhausted it's slice.
+ */
 SCHED_FEAT(RUN_TO_PARITY, true)
 
 /*

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: Remove min_vruntime_copy
  2024-07-27 10:27 ` [PATCH 02/24] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
@ 2024-08-18  6:23   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-08-18  6:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Valentin Schneider, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     949090eaf0a3e39aa0f4a675407e16d0e975da11
Gitweb:        https://git.kernel.org/tip/949090eaf0a3e39aa0f4a675407e16d0e975da11
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 04 Oct 2023 12:43:53 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 17 Aug 2024 11:06:40 +02:00

sched/eevdf: Remove min_vruntime_copy

Since commit e8f331bcc270 ("sched/smp: Use lag to simplify
cross-runqueue placement") the min_vruntime_copy is no longer used.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org
---
 kernel/sched/fair.c  | 5 ++---
 kernel/sched/sched.h | 4 ----
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d39a82..8201f0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,8 +779,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	/* ensure we never gain time by being placed backwards. */
-	u64_u32_store(cfs_rq->min_vruntime,
-		      __update_min_vruntime(cfs_rq, vruntime));
+	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -12933,7 +12932,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
-	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
+	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1e1d1b4..a6d6b6f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -613,10 +613,6 @@ struct cfs_rq {
 	u64			min_vruntime_fi;
 #endif
 
-#ifndef CONFIG_64BIT
-	u64			min_vruntime_copy;
-#endif
-
 	struct rb_root_cached	tasks_timeline;
 
 	/*

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [tip: sched/core] sched/uclamg: Handle delayed dequeue
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-08-19  9:14     ` Christian Loehle
  0 siblings, 0 replies; 277+ messages in thread
From: Christian Loehle @ 2024-08-19  9:14 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits
  Cc: Luis Machado, Hongyan Xia, Peter Zijlstra (Intel),
	Valentin Schneider, x86

On 8/18/24 07:23, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     dfa0a574cbc47bfd5f8985f74c8ea003a37fa078
> Gitweb:        https://git.kernel.org/tip/dfa0a574cbc47bfd5f8985f74c8ea003a37fa078
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Wed, 05 Jun 2024 12:09:11 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Sat, 17 Aug 2024 11:06:42 +02:00
> 
> sched/uclamg: Handle delayed dequeue

Nit, but I haven't seen the typo until now.

> 
> Delayed dequeue has tasks sit around on the runqueue that are not
> actually runnable -- specifically, they will be dequeued the moment
> they get picked.
> 
> One side-effect is that such a task can get migrated, which leads to a
> 'nested' dequeue_task() scenario that messes up uclamp if we don't
> take care.
> 
> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> the runqueue. This however will have removed the task from uclamp --
> per uclamp_rq_dec() in dequeue_task(). So far so good.
> 
> However, if at that point the task gets migrated -- or nice adjusted
> or any of a myriad of operations that does a dequeue-enqueue cycle --
> we'll pass through dequeue_task()/enqueue_task() again. Without
> modification this will lead to a double decrement for uclamp, which is
> wrong.
> 
> Reported-by: Luis Machado <luis.machado@arm.com>
> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Tested-by: Valentin Schneider <vschneid@redhat.com>
> Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
> ---
>  kernel/sched/core.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7356464..80e639e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1691,6 +1691,9 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	if (p->se.sched_delayed)
> +		return;
> +
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_inc_id(rq, p, clamp_id);
>  
> @@ -1715,6 +1718,9 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>  	if (unlikely(!p->sched_class->uclamp_enabled))
>  		return;
>  
> +	if (p->se.sched_delayed)
> +		return;
> +
>  	for_each_clamp_id(clamp_id)
>  		uclamp_rq_dec_id(rq, p, clamp_id);
>  }
> @@ -1994,8 +2000,12 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>  	}
>  
> -	uclamp_rq_inc(rq, p);
>  	p->sched_class->enqueue_task(rq, p, flags);
> +	/*
> +	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> +	 * ->sched_delayed.
> +	 */
> +	uclamp_rq_inc(rq, p);
>  
>  	if (sched_core_enabled(rq))
>  		sched_core_enqueue(rq, p);
> @@ -2017,6 +2027,10 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  		psi_dequeue(p, flags & DEQUEUE_SLEEP);
>  	}
>  
> +	/*
> +	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> +	 * and mark the task ->sched_delayed.
> +	 */
>  	uclamp_rq_dec(rq, p);
>  	return p->sched_class->dequeue_task(rq, p, flags);
>  }
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
  2024-08-02 14:39   ` Valentin Schneider
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-08-19 10:01   ` Luis Machado
       [not found]   ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
  2024-11-01 12:47   ` [PATCH 17/24] sched/fair: Implement delayed dequeue Phil Auld
  4 siblings, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-08-19 10:01 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi Peter,

On 7/27/24 11:27, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/deadline.c |    1 
>  kernel/sched/fair.c     |   82 ++++++++++++++++++++++++++++++++++++++++++------
>  kernel/sched/features.h |    9 +++++
>  3 files changed, 81 insertions(+), 11 deletions(-)
> 
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2428,7 +2428,6 @@ static struct task_struct *__pick_next_t
>  		else
>  			p = dl_se->server_pick_next(dl_se);
>  		if (!p) {
> -			WARN_ON_ONCE(1);
>  			dl_se->dl_yielded = 1;
>  			update_curr_dl_se(rq, dl_se, 0);
>  			goto again;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5379,20 +5379,44 @@ static void clear_buddies(struct cfs_rq
>  
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  
> -static void
> +static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -	int action = UPDATE_TG;
> +	if (flags & DEQUEUE_DELAYED) {
> +		/*
> +		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
> +		 * at which point we've already done update_curr() and do not
> +		 * want to do so again.
> +		 */
> +		SCHED_WARN_ON(!se->sched_delayed);
> +		se->sched_delayed = 0;
> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		/*
> +		 * DELAY_DEQUEUE relies on spurious wakeups, special task
> +		 * states must not suffer spurious wakeups, excempt them.
> +		 */
> +		if (flags & DEQUEUE_SPECIAL)
> +			sleep = false;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
>  
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {
> +			if (cfs_rq->next == se)
> +				cfs_rq->next = NULL;
> +			se->sched_delayed = 1;
> +			return false;
> +		}
> +	}
> +
> +	int action = UPDATE_TG;
>  	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
>  		action |= DO_DETACH;
>  
>  	/*
> -	 * Update run-time statistics of the 'current'.
> -	 */
> -	update_curr(cfs_rq);
> -
> -	/*
>  	 * When dequeuing a sched_entity, we must:
>  	 *   - Update loads to have both entity and cfs_rq synced with now.
>  	 *   - For group_entity, update its runnable_weight to reflect the new
> @@ -5430,6 +5454,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  
>  	if (cfs_rq->nr_running == 0)
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
> +
> +	return true;
>  }
>  
>  static void
> @@ -5828,11 +5854,21 @@ static bool throttle_cfs_rq(struct cfs_r
>  	idle_task_delta = cfs_rq->idle_h_nr_running;
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +		int flags;
> +
>  		/* throttled entity or throttle-on-deactivate */
>  		if (!se->on_rq)
>  			goto done;
>  
> -		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
> +		/*
> +		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
> +		 * This avoids teaching dequeue_entities() about throttled
> +		 * entities and keeps things relatively simple.
> +		 */
> +		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
> +		if (se->sched_delayed)
> +			flags |= DEQUEUE_DELAYED;
> +		dequeue_entity(qcfs_rq, se, flags);
>  
>  		if (cfs_rq_is_idle(group_cfs_rq(se)))
>  			idle_task_delta = cfs_rq->h_nr_running;
> @@ -6918,6 +6954,7 @@ static int dequeue_entities(struct rq *r
>  	bool was_sched_idle = sched_idle_rq(rq);
>  	int rq_h_nr_running = rq->cfs.h_nr_running;
>  	bool task_sleep = flags & DEQUEUE_SLEEP;
> +	bool task_delayed = flags & DEQUEUE_DELAYED;
>  	struct task_struct *p = NULL;
>  	int idle_h_nr_running = 0;
>  	int h_nr_running = 0;
> @@ -6931,7 +6968,13 @@ static int dequeue_entities(struct rq *r
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
> -		dequeue_entity(cfs_rq, se, flags);
> +
> +		if (!dequeue_entity(cfs_rq, se, flags)) {
> +			if (p && &p->se == se)
> +				return -1;
> +
> +			break;
> +		}
>  
>  		cfs_rq->h_nr_running -= h_nr_running;
>  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> @@ -6956,6 +6999,7 @@ static int dequeue_entities(struct rq *r
>  			break;
>  		}
>  		flags |= DEQUEUE_SLEEP;
> +		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
>  	}
>  
>  	for_each_sched_entity(se) {
> @@ -6985,6 +7029,17 @@ static int dequeue_entities(struct rq *r
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
>  
> +	if (p && task_delayed) {
> +		SCHED_WARN_ON(!task_sleep);
> +		SCHED_WARN_ON(p->on_rq != 1);
> +
> +		/* Fix-up what dequeue_task_fair() skipped */
> +		hrtick_update(rq);
> +
> +		/* Fix-up what block_task() skipped. */
> +		__block_task(rq, p);
> +	}
> +
>  	return 1;
>  }
>  /*
> @@ -6996,8 +7051,10 @@ static bool dequeue_task_fair(struct rq
>  {
>  	util_est_dequeue(&rq->cfs, p);
>  
> -	if (dequeue_entities(rq, &p->se, flags) < 0)
> +	if (dequeue_entities(rq, &p->se, flags) < 0) {
> +		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
>  		return false;
> +	}
>  
>  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	hrtick_update(rq);
> @@ -12973,6 +13030,11 @@ static void set_next_task_fair(struct rq
>  		/* ensure bandwidth has been allocated on our new cfs_rq */
>  		account_cfs_rq_runtime(cfs_rq, 0);
>  	}
> +
> +	if (!first)
> +		return;
> +
> +	SCHED_WARN_ON(se->sched_delayed);
>  }
>  
>  void init_cfs_rq(struct cfs_rq *cfs_rq)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
>  SCHED_FEAT(CACHE_HOT_BUDDY, true)
>  
>  /*
> + * Delay dequeueing tasks until they get selected or woken.
> + *
> + * By delaying the dequeue for non-eligible tasks, they remain in the
> + * competition and can burn off their negative lag. When they get selected
> + * they'll have positive lag by definition.
> + */
> +SCHED_FEAT(DELAY_DEQUEUE, true)
> +
> +/*
>   * Allow wakeup-time preemption of the current task:
>   */
>  SCHED_FEAT(WAKEUP_PREEMPTION, true)
> 
> 
> 

Just a heads-up I'm chasing some odd behavior on the big.little/pixel 6 platform, where
sometimes I see runs with spikes of higher frequencies for extended amounts of time (multiple
seconds), in particular for little cores, which leads to higher energy use.

I'm still trying to understand why that happens, but looks like the utilization values are
sometimes stuck at high values. I just want to make sure the delayed dequeue changes aren't
interfering with the util calculations.

Unfortunately the benchmark is Android-specific, so hard to provide a reasonable
reproducer for Linux.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID
  2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
  2024-07-28 21:30   ` Thomas Gleixner
  2024-07-29  7:53   ` Juri Lelli
@ 2024-08-19 11:11   ` Christian Loehle
  2 siblings, 0 replies; 277+ messages in thread
From: Christian Loehle @ 2024-08-19 11:11 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 7/27/24 11:27, Peter Zijlstra wrote:
> In order to measure thread time in a DVFS world, introduce
> CLOCK_THREAD_DVFS_ID -- a copy of CLOCK_THREAD_CPUTIME_ID that slows
> down with both DVFS scaling and CPU capacity.
> 
> The clock does *NOT* support setting timers.
> 
> Useful for both SCHED_DEADLINE and the newly introduced
> sched_attr::sched_runtime usage for SCHED_NORMAL.
> 

How will this look like in practice then?
Is it up to userspace to adjust sched_runtime to capacity/dvfs
accordingly every time it changes? I guess not.
Will sched_attr::sched_runtime be for CPUCLOCK_DVFS by default?
I assume that would be a uapi change?
Do we need an additional flag in sched_attr to specify the clock
to be measured against?

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/posix-timers_types.h |    5 ++--
>  include/linux/sched.h              |    1 
>  include/linux/sched/cputime.h      |    3 ++
>  include/uapi/linux/time.h          |    1 
>  kernel/sched/core.c                |   40 +++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c                |    8 +++++--
>  kernel/time/posix-cpu-timers.c     |   16 +++++++++++++-
>  kernel/time/posix-timers.c         |    1 
>  kernel/time/posix-timers.h         |    1 
>  9 files changed, 71 insertions(+), 5 deletions(-)
> 
> --- a/include/linux/posix-timers_types.h
> +++ b/include/linux/posix-timers_types.h
> @@ -13,9 +13,9 @@
>   *
>   * Bit 2 indicates whether a cpu clock refers to a thread or a process.
>   *
> - * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or FD=3.
> + * Bits 1 and 0 give the type: PROF=0, VIRT=1, SCHED=2, or DVSF=3

s/DVSF/DVFS

> [snip]

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
  2024-08-17 23:06         ` Peter Zijlstra
@ 2024-08-19 12:50           ` Vincent Guittot
  0 siblings, 0 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-19 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On Sun, 18 Aug 2024 at 01:06, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Aug 14, 2024 at 02:59:00PM +0200, Vincent Guittot wrote:
>
> > > So the whole reason to keep then enqueued is so that they can continue
> > > to compete for vruntime, and vruntime is load based. So it would be very
> > > weird to remove them from load.
> >
> > We only use the weight to update vruntime, not the load. The load is
> > used to balance tasks between cpus and if we keep a "delayed" dequeued
> > task in the load, we will artificially inflate the load_avg on this rq
>
> So far load has been a direct sum of all weight. Additionally, we delay

it has been the sum of all runnable tasks but delayed tasks are not
runnable anymore. The task stays "enqueued" only to help clearing its
lag

> until a task gets picked again, migrating tasks to other CPUs will
> expedite this condition.
>
> Anyway, at the moment I don't have strong evidence either which way, and
> the above argument seem to suggest not changing things for now.
>
> We can always re-evaluate.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-08-20 16:23   ` Hongyan Xia
  2024-08-21 13:34   ` Hongyan Xia
  2 siblings, 0 replies; 277+ messages in thread
From: Hongyan Xia @ 2024-08-20 16:23 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault,
	Luis Machado

On 27/07/2024 11:27, Peter Zijlstra wrote:
> Delayed dequeue has tasks sit around on the runqueue that are not
> actually runnable -- specifically, they will be dequeued the moment
> they get picked.
> 
> One side-effect is that such a task can get migrated, which leads to a
> 'nested' dequeue_task() scenario that messes up uclamp if we don't
> take care.
> 
> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> the runqueue. This however will have removed the task from uclamp --
> per uclamp_rq_dec() in dequeue_task(). So far so good.
> 
> However, if at that point the task gets migrated -- or nice adjusted
> or any of a myriad of operations that does a dequeue-enqueue cycle --
> we'll pass through dequeue_task()/enqueue_task() again. Without
> modification this will lead to a double decrement for uclamp, which is
> wrong.
> 
> Reported-by: Luis Machado <luis.machado@arm.com>
> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/core.c |   16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
>   	if (unlikely(!p->sched_class->uclamp_enabled))
>   		return;
>   
> +	if (p->se.sched_delayed)
> +		return;
> +
>   	for_each_clamp_id(clamp_id)
>   		uclamp_rq_inc_id(rq, p, clamp_id);
>   
> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
>   	if (unlikely(!p->sched_class->uclamp_enabled))
>   		return;
>   
> +	if (p->se.sched_delayed)
> +		return;
> +
>   	for_each_clamp_id(clamp_id)
>   		uclamp_rq_dec_id(rq, p, clamp_id);
>   }
> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
>   		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>   	}
>   
> -	uclamp_rq_inc(rq, p);
>   	p->sched_class->enqueue_task(rq, p, flags);
> +	/*
> +	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> +	 * ->sched_delayed.
> +	 */
> +	uclamp_rq_inc(rq, p);

Apart from the typo in the title, this is a notable functional change.

Both classes that support uclamp update the CPU frequency in 
enqueue_task(). Before, a task that have uclamp_min will immediately 
drive up the frequency the moment it is enqueued. Now, driving up the 
frequency is delayed until the next util update.

I do not yet have evidence suggesting this is quantitatively bad, like 
first frame drops, but we might want to keep an eye on this, and switch 
back to the old way if possible.

>   
>   	if (sched_core_enabled(rq))
>   		sched_core_enqueue(rq, p);
> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
>   		psi_dequeue(p, flags & DEQUEUE_SLEEP);
>   	}
>   
> +	/*
> +	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> +	 * and mark the task ->sched_delayed.
> +	 */
>   	uclamp_rq_dec(rq, p);
>   	return p->sched_class->dequeue_task(rq, p, flags);
>   }
> 
> 

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (26 preceding siblings ...)
  2024-08-16 15:22 ` Valentin Schneider
@ 2024-08-20 16:43 ` Hongyan Xia
  2024-08-21  9:46   ` Hongyan Xia
  2024-08-29 17:02 ` Aleksandr Nogikh
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 277+ messages in thread
From: Hongyan Xia @ 2024-08-20 16:43 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi Peter,

On 27/07/2024 11:27, Peter Zijlstra wrote:
> Hi all,
> 
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
> 
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
> 
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
> 
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> 
> 
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> 
>   - split up the huge delay-dequeue patch
>   - tested/fixed cfs-bandwidth
>   - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>   - SCHED_BATCH is equivalent to RESPECT_SLICE
>   - propagate min_slice up cgroups
>   - CLOCK_THREAD_DVFS_ID
> 

The latest tip/sched/core at commit

aef6987d89544d63a47753cf3741cabff0b5574c

crashes very early on on my Juno r2 board (arm64). The trace is here:

[    0.049599] ------------[ cut here ]------------
[    0.054279] kernel BUG at kernel/sched/deadline.c:63!
[    0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[    0.066285] Modules linked in:
[    0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 
6.11.0-rc1-g55404cef33db #1070
[    0.077855] Hardware name: ARM Juno development board (r2) (DT)
[    0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[    0.090919] pc : enqueue_dl_entity+0x53c/0x540
[    0.095434] lr : dl_server_start+0xb8/0x10c
[    0.099679] sp : ffffffc081ca3c30
[    0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: 
0000000000000002
[    0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: 
0000000000000001
[    0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: 
0000000000000000
[    0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: 
ffffffc080d0ad00
[    0.132015] x17: 000000000000002f x16: 0000000000000000 x15: 
ffffffc081ca8000
[    0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: 
0000000000000001
[    0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : 
0000000000000030
[    0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : 
000000002cf93d25
[    0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : 
0000000000000001
[    0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : 
ffffff897ef21040
[    0.175483] Call trace:
[    0.177958]  enqueue_dl_entity+0x53c/0x540
[    0.182117]  dl_server_start+0xb8/0x10c
[    0.186010]  enqueue_task_fair+0x5c8/0x6ac
[    0.190165]  enqueue_task+0x54/0x1e8
[    0.193793]  wake_up_new_task+0x250/0x39c
[    0.197862]  kernel_clone+0x140/0x2f0
[    0.201578]  user_mode_thread+0x4c/0x58
[    0.205468]  rest_init+0x24/0xd8
[    0.208743]  start_kernel+0x2bc/0x2fc
[    0.212460]  __primary_switched+0x80/0x88
[    0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000)
[    0.222711] ---[ end trace 0000000000000000 ]---
[    0.227391] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.234187] ---[ end Kernel panic - not syncing: Attempted to kill 
the idle task! ]---

I'm not an expert in DL server so I have no idea where the problem could 
be. If you know where to look off the top of your head then much better. 
If not, I'll do some bi-section later.

Hongyan

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-20 16:43 ` Hongyan Xia
@ 2024-08-21  9:46   ` Hongyan Xia
  2024-08-21 16:25     ` Mike Galbraith
  2024-08-22 15:55     ` Peter Zijlstra
  0 siblings, 2 replies; 277+ messages in thread
From: Hongyan Xia @ 2024-08-21  9:46 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 20/08/2024 17:43, Hongyan Xia wrote:
> Hi Peter,
> 
> On 27/07/2024 11:27, Peter Zijlstra wrote:
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF 
>> patches.
>> They've been sitting in my git tree for ever it seems, and people have 
>> been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan 
>> on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right 
>> before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>   - split up the huge delay-dequeue patch
>>   - tested/fixed cfs-bandwidth
>>   - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>   - SCHED_BATCH is equivalent to RESPECT_SLICE
>>   - propagate min_slice up cgroups
>>   - CLOCK_THREAD_DVFS_ID
>>
> 
> The latest tip/sched/core at commit
> 
> aef6987d89544d63a47753cf3741cabff0b5574c
> 
> crashes very early on on my Juno r2 board (arm64). The trace is here:
> 
> [    0.049599] ------------[ cut here ]------------
> [    0.054279] kernel BUG at kernel/sched/deadline.c:63!
> [    0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT 
> SMP
> [    0.066285] Modules linked in:
> [    0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 
> 6.11.0-rc1-g55404cef33db #1070
> [    0.077855] Hardware name: ARM Juno development board (r2) (DT)
> [    0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [    0.090919] pc : enqueue_dl_entity+0x53c/0x540
> [    0.095434] lr : dl_server_start+0xb8/0x10c
> [    0.099679] sp : ffffffc081ca3c30
> [    0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: 
> 0000000000000002
> [    0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: 
> 0000000000000001
> [    0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: 
> 0000000000000000
> [    0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: 
> ffffffc080d0ad00
> [    0.132015] x17: 000000000000002f x16: 0000000000000000 x15: 
> ffffffc081ca8000
> [    0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: 
> 0000000000000001
> [    0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : 
> 0000000000000030
> [    0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : 
> 000000002cf93d25
> [    0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : 
> 0000000000000001
> [    0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : 
> ffffff897ef21040
> [    0.175483] Call trace:
> [    0.177958]  enqueue_dl_entity+0x53c/0x540
> [    0.182117]  dl_server_start+0xb8/0x10c
> [    0.186010]  enqueue_task_fair+0x5c8/0x6ac
> [    0.190165]  enqueue_task+0x54/0x1e8
> [    0.193793]  wake_up_new_task+0x250/0x39c
> [    0.197862]  kernel_clone+0x140/0x2f0
> [    0.201578]  user_mode_thread+0x4c/0x58
> [    0.205468]  rest_init+0x24/0xd8
> [    0.208743]  start_kernel+0x2bc/0x2fc
> [    0.212460]  __primary_switched+0x80/0x88
> [    0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000)
> [    0.222711] ---[ end trace 0000000000000000 ]---
> [    0.227391] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.234187] ---[ end Kernel panic - not syncing: Attempted to kill 
> the idle task! ]---
> 
> I'm not an expert in DL server so I have no idea where the problem could 
> be. If you know where to look off the top of your head then much better. 
> If not, I'll do some bi-section later.
> 

Okay, in case the trace I provided isn't clear enough, I traced the 
crash to a call chain like this:

dl_server_start()
	enqueue_dl_entity()
		update_stats_enqueue_dl()
			update_stats_enqueue_sleeper_dl()
				__schedstats_from_dl_se()
					dl_task_of() <---------- crash

If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if 
this is the reason why other people are not seeing this. This is 
probably not EEVDF but DL refactoring related.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2024-08-20 16:23   ` [PATCH 10/24] " Hongyan Xia
@ 2024-08-21 13:34   ` Hongyan Xia
  2024-08-22  8:19     ` Vincent Guittot
  2 siblings, 1 reply; 277+ messages in thread
From: Hongyan Xia @ 2024-08-21 13:34 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault,
	Luis Machado

Hi Peter,

Sorry for bombarding this thread in the last couple of days. I'm seeing 
several issues in the latest tip/sched/core after these patches landed.

What I'm now seeing seems to be an unbalanced util_est. First, I applied 
the following diff to warn against util_est != 0 when no tasks are on 
the queue:

https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/

Then, I'm reliably seeing warnings on my Juno board during boot in 
latest tip/sched/core.

If I do the same thing to util_est just like what you did in this uclamp 
patch, like this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 574ef19df64b..58aac42c99e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)

  	if (flags & ENQUEUE_DELAYED) {
  		requeue_delayed_entity(se);
-		return;
+		goto util_est;
  	}

  	/*
@@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
  	 * Let's add the task's estimated utilization to the cfs_rq's
  	 * estimated utilization, before we update schedutil.
  	 */
-	util_est_enqueue(&rq->cfs, p);

  	/*
  	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
  	assert_list_leaf_cfs_rq(rq);

  	hrtick_update(rq);
+util_est:
+	if (!p->se.sched_delayed)
+		util_est_enqueue(&rq->cfs, p);
  }

  static void set_next_buddy(struct sched_entity *se);
@@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct 
sched_entity *se, int flags)
   */
  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
  {
-	util_est_dequeue(&rq->cfs, p);
+	if (!p->se.sched_delayed)
+		util_est_dequeue(&rq->cfs, p);

  	if (dequeue_entities(rq, &p->se, flags) < 0) {
  		if (!rq->cfs.h_nr_running)

which is basically enqueuing util_est after enqueue_task_fair(), 
dequeuing util_est before dequeue_task_fair() and double check 
p->se.delayed_dequeue, then the unbalanced issue seems to go away.

Hopefully this helps you in finding where the problem could be.

Hongyan

On 27/07/2024 11:27, Peter Zijlstra wrote:
> Delayed dequeue has tasks sit around on the runqueue that are not
> actually runnable -- specifically, they will be dequeued the moment
> they get picked.
> 
> One side-effect is that such a task can get migrated, which leads to a
> 'nested' dequeue_task() scenario that messes up uclamp if we don't
> take care.
> 
> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> the runqueue. This however will have removed the task from uclamp --
> per uclamp_rq_dec() in dequeue_task(). So far so good.
> 
> However, if at that point the task gets migrated -- or nice adjusted
> or any of a myriad of operations that does a dequeue-enqueue cycle --
> we'll pass through dequeue_task()/enqueue_task() again. Without
> modification this will lead to a double decrement for uclamp, which is
> wrong.
> 
> Reported-by: Luis Machado <luis.machado@arm.com>
> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/core.c |   16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
>   	if (unlikely(!p->sched_class->uclamp_enabled))
>   		return;
>   
> +	if (p->se.sched_delayed)
> +		return;
> +
>   	for_each_clamp_id(clamp_id)
>   		uclamp_rq_inc_id(rq, p, clamp_id);
>   
> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
>   	if (unlikely(!p->sched_class->uclamp_enabled))
>   		return;
>   
> +	if (p->se.sched_delayed)
> +		return;
> +
>   	for_each_clamp_id(clamp_id)
>   		uclamp_rq_dec_id(rq, p, clamp_id);
>   }
> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
>   		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>   	}
>   
> -	uclamp_rq_inc(rq, p);
>   	p->sched_class->enqueue_task(rq, p, flags);
> +	/*
> +	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> +	 * ->sched_delayed.
> +	 */
> +	uclamp_rq_inc(rq, p);
>   
>   	if (sched_core_enabled(rq))
>   		sched_core_enqueue(rq, p);
> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
>   		psi_dequeue(p, flags & DEQUEUE_SLEEP);
>   	}
>   
> +	/*
> +	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> +	 * and mark the task ->sched_delayed.
> +	 */
>   	uclamp_rq_dec(rq, p);
>   	return p->sched_class->dequeue_task(rq, p, flags);
>   }
> 
> 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-21  9:46   ` Hongyan Xia
@ 2024-08-21 16:25     ` Mike Galbraith
  2024-08-22 15:55     ` Peter Zijlstra
  1 sibling, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-08-21 16:25 UTC (permalink / raw)
  To: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Wed, 2024-08-21 at 10:46 +0100, Hongyan Xia wrote:
>
> If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if
> this is the reason why other people are not seeing this. This is
> probably not EEVDF but DL refactoring related.

FWIW, tip.today boots/runs gripe free on dinky rpi4b (SCHEDSTATS=y).

	-Mike


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-21 13:34   ` Hongyan Xia
@ 2024-08-22  8:19     ` Vincent Guittot
  2024-08-22  8:21       ` Vincent Guittot
  2024-08-22  9:21       ` Luis Machado
  0 siblings, 2 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22  8:19 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Luis Machado

Hi,

On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> Hi Peter,
>
> Sorry for bombarding this thread in the last couple of days. I'm seeing
> several issues in the latest tip/sched/core after these patches landed.
>
> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied

I also see a remaining util_est for idle rq because of an unbalance
call of util_est_enqueue|dequeue

> the following diff to warn against util_est != 0 when no tasks are on
> the queue:
>
> https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
>
> Then, I'm reliably seeing warnings on my Juno board during boot in
> latest tip/sched/core.
>
> If I do the same thing to util_est just like what you did in this uclamp
> patch, like this:

I think that the solution is simpler than your proposal and we just
need to always call util_est_enqueue() before the
requeue_delayed_entity

@@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
        int rq_h_nr_running = rq->cfs.h_nr_running;
        u64 slice = 0;

-       if (flags & ENQUEUE_DELAYED) {
-               requeue_delayed_entity(se);
-               return;
-       }
-
        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
@@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
         */
        util_est_enqueue(&rq->cfs, p);

+       if (flags & ENQUEUE_DELAYED) {
+               requeue_delayed_entity(se);
+               return;
+       }
+
        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag


>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 574ef19df64b..58aac42c99e5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>
>         if (flags & ENQUEUE_DELAYED) {
>                 requeue_delayed_entity(se);
> -               return;
> +               goto util_est;
>         }
>
>         /*
> @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>          * Let's add the task's estimated utilization to the cfs_rq's
>          * estimated utilization, before we update schedutil.
>          */
> -       util_est_enqueue(&rq->cfs, p);
>
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
> @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         assert_list_leaf_cfs_rq(rq);
>
>         hrtick_update(rq);
> +util_est:
> +       if (!p->se.sched_delayed)
> +               util_est_enqueue(&rq->cfs, p);
>   }
>
>   static void set_next_buddy(struct sched_entity *se);
> @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
> sched_entity *se, int flags)
>    */
>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
> int flags)
>   {
> -       util_est_dequeue(&rq->cfs, p);
> +       if (!p->se.sched_delayed)
> +               util_est_dequeue(&rq->cfs, p);
>
>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>                 if (!rq->cfs.h_nr_running)
>
> which is basically enqueuing util_est after enqueue_task_fair(),
> dequeuing util_est before dequeue_task_fair() and double check
> p->se.delayed_dequeue, then the unbalanced issue seems to go away.
>
> Hopefully this helps you in finding where the problem could be.
>
> Hongyan
>
> On 27/07/2024 11:27, Peter Zijlstra wrote:
> > Delayed dequeue has tasks sit around on the runqueue that are not
> > actually runnable -- specifically, they will be dequeued the moment
> > they get picked.
> >
> > One side-effect is that such a task can get migrated, which leads to a
> > 'nested' dequeue_task() scenario that messes up uclamp if we don't
> > take care.
> >
> > Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> > the runqueue. This however will have removed the task from uclamp --
> > per uclamp_rq_dec() in dequeue_task(). So far so good.
> >
> > However, if at that point the task gets migrated -- or nice adjusted
> > or any of a myriad of operations that does a dequeue-enqueue cycle --
> > we'll pass through dequeue_task()/enqueue_task() again. Without
> > modification this will lead to a double decrement for uclamp, which is
> > wrong.
> >
> > Reported-by: Luis Machado <luis.machado@arm.com>
> > Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >   kernel/sched/core.c |   16 +++++++++++++++-
> >   1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
> >       if (unlikely(!p->sched_class->uclamp_enabled))
> >               return;
> >
> > +     if (p->se.sched_delayed)
> > +             return;
> > +
> >       for_each_clamp_id(clamp_id)
> >               uclamp_rq_inc_id(rq, p, clamp_id);
> >
> > @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
> >       if (unlikely(!p->sched_class->uclamp_enabled))
> >               return;
> >
> > +     if (p->se.sched_delayed)
> > +             return;
> > +
> >       for_each_clamp_id(clamp_id)
> >               uclamp_rq_dec_id(rq, p, clamp_id);
> >   }
> > @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
> >               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
> >       }
> >
> > -     uclamp_rq_inc(rq, p);
> >       p->sched_class->enqueue_task(rq, p, flags);
> > +     /*
> > +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> > +      * ->sched_delayed.
> > +      */
> > +     uclamp_rq_inc(rq, p);
> >
> >       if (sched_core_enabled(rq))
> >               sched_core_enqueue(rq, p);
> > @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
> >               psi_dequeue(p, flags & DEQUEUE_SLEEP);
> >       }
> >
> > +     /*
> > +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> > +      * and mark the task ->sched_delayed.
> > +      */
> >       uclamp_rq_dec(rq, p);
> >       return p->sched_class->dequeue_task(rq, p, flags);
> >   }
> >
> >

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22  8:19     ` Vincent Guittot
@ 2024-08-22  8:21       ` Vincent Guittot
  2024-08-22  9:21       ` Luis Machado
  1 sibling, 0 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22  8:21 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Luis Machado

On Thu, 22 Aug 2024 at 10:19, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi,
>
> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> >
> > Hi Peter,
> >
> > Sorry for bombarding this thread in the last couple of days. I'm seeing
> > several issues in the latest tip/sched/core after these patches landed.
> >
> > What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
>
> I also see a remaining util_est for idle rq because of an unbalance
> call of util_est_enqueue|dequeue
>
> > the following diff to warn against util_est != 0 when no tasks are on
> > the queue:
> >
> > https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
> >
> > Then, I'm reliably seeing warnings on my Juno board during boot in
> > latest tip/sched/core.
> >
> > If I do the same thing to util_est just like what you did in this uclamp
> > patch, like this:
>
> I think that the solution is simpler than your proposal and we just
> need to always call util_est_enqueue() before the
> requeue_delayed_entity

I have been too quick and the below doesn't fix the problem
>
> @@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
>
> -       if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> -               return;
> -       }
> -
>         /*
>          * The code below (indirectly) updates schedutil which looks at
>          * the cfs_rq utilization to select a frequency.
> @@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
>
> +       if (flags & ENQUEUE_DELAYED) {
> +               requeue_delayed_entity(se);
> +               return;
> +       }
> +
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
>          * utilization updates, so do it here explicitly with the IOWAIT flag
>
>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 574ef19df64b..58aac42c99e5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
> > task_struct *p, int flags)
> >
> >         if (flags & ENQUEUE_DELAYED) {
> >                 requeue_delayed_entity(se);
> > -               return;
> > +               goto util_est;
> >         }
> >
> >         /*
> > @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
> > task_struct *p, int flags)
> >          * Let's add the task's estimated utilization to the cfs_rq's
> >          * estimated utilization, before we update schedutil.
> >          */
> > -       util_est_enqueue(&rq->cfs, p);
> >
> >         /*
> >          * If in_iowait is set, the code below may not trigger any cpufreq
> > @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
> > task_struct *p, int flags)
> >         assert_list_leaf_cfs_rq(rq);
> >
> >         hrtick_update(rq);
> > +util_est:
> > +       if (!p->se.sched_delayed)
> > +               util_est_enqueue(&rq->cfs, p);
> >   }
> >
> >   static void set_next_buddy(struct sched_entity *se);
> > @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
> > sched_entity *se, int flags)
> >    */
> >   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
> > int flags)
> >   {
> > -       util_est_dequeue(&rq->cfs, p);
> > +       if (!p->se.sched_delayed)
> > +               util_est_dequeue(&rq->cfs, p);
> >
> >         if (dequeue_entities(rq, &p->se, flags) < 0) {
> >                 if (!rq->cfs.h_nr_running)
> >
> > which is basically enqueuing util_est after enqueue_task_fair(),
> > dequeuing util_est before dequeue_task_fair() and double check
> > p->se.delayed_dequeue, then the unbalanced issue seems to go away.
> >
> > Hopefully this helps you in finding where the problem could be.
> >
> > Hongyan
> >
> > On 27/07/2024 11:27, Peter Zijlstra wrote:
> > > Delayed dequeue has tasks sit around on the runqueue that are not
> > > actually runnable -- specifically, they will be dequeued the moment
> > > they get picked.
> > >
> > > One side-effect is that such a task can get migrated, which leads to a
> > > 'nested' dequeue_task() scenario that messes up uclamp if we don't
> > > take care.
> > >
> > > Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> > > the runqueue. This however will have removed the task from uclamp --
> > > per uclamp_rq_dec() in dequeue_task(). So far so good.
> > >
> > > However, if at that point the task gets migrated -- or nice adjusted
> > > or any of a myriad of operations that does a dequeue-enqueue cycle --
> > > we'll pass through dequeue_task()/enqueue_task() again. Without
> > > modification this will lead to a double decrement for uclamp, which is
> > > wrong.
> > >
> > > Reported-by: Luis Machado <luis.machado@arm.com>
> > > Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >   kernel/sched/core.c |   16 +++++++++++++++-
> > >   1 file changed, 15 insertions(+), 1 deletion(-)
> > >
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
> > >       if (unlikely(!p->sched_class->uclamp_enabled))
> > >               return;
> > >
> > > +     if (p->se.sched_delayed)
> > > +             return;
> > > +
> > >       for_each_clamp_id(clamp_id)
> > >               uclamp_rq_inc_id(rq, p, clamp_id);
> > >
> > > @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
> > >       if (unlikely(!p->sched_class->uclamp_enabled))
> > >               return;
> > >
> > > +     if (p->se.sched_delayed)
> > > +             return;
> > > +
> > >       for_each_clamp_id(clamp_id)
> > >               uclamp_rq_dec_id(rq, p, clamp_id);
> > >   }
> > > @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
> > >               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
> > >       }
> > >
> > > -     uclamp_rq_inc(rq, p);
> > >       p->sched_class->enqueue_task(rq, p, flags);
> > > +     /*
> > > +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> > > +      * ->sched_delayed.
> > > +      */
> > > +     uclamp_rq_inc(rq, p);
> > >
> > >       if (sched_core_enabled(rq))
> > >               sched_core_enqueue(rq, p);
> > > @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
> > >               psi_dequeue(p, flags & DEQUEUE_SLEEP);
> > >       }
> > >
> > > +     /*
> > > +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> > > +      * and mark the task ->sched_delayed.
> > > +      */
> > >       uclamp_rq_dec(rq, p);
> > >       return p->sched_class->dequeue_task(rq, p, flags);
> > >   }
> > >
> > >

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22  8:19     ` Vincent Guittot
  2024-08-22  8:21       ` Vincent Guittot
@ 2024-08-22  9:21       ` Luis Machado
  2024-08-22  9:53         ` Vincent Guittot
  1 sibling, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-08-22  9:21 UTC (permalink / raw)
  To: Vincent Guittot, Hongyan Xia
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 8/22/24 09:19, Vincent Guittot wrote:
> Hi,
> 
> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>
>> Hi Peter,
>>
>> Sorry for bombarding this thread in the last couple of days. I'm seeing
>> several issues in the latest tip/sched/core after these patches landed.
>>
>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> 
> I also see a remaining util_est for idle rq because of an unbalance
> call of util_est_enqueue|dequeue
> 

I can confirm issues with the utilization values and frequencies being driven
seemingly incorrectly, in particular for little cores.

What I'm seeing with the stock series is high utilization values for some tasks
and little cores having their frequencies maxed out for extended periods of
time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
idle. But whenever certain tasks get scheduled there, they have a very high util
level and so the frequency is kept at max.

As a consequence this drives up power usage.

I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
the util numbers and frequencies being used for the little cores. With his fix,
I can also see lower energy use for my specific benchmark.


>> the following diff to warn against util_est != 0 when no tasks are on
>> the queue:
>>
>> https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
>>
>> Then, I'm reliably seeing warnings on my Juno board during boot in
>> latest tip/sched/core.
>>
>> If I do the same thing to util_est just like what you did in this uclamp
>> patch, like this:
> 
> I think that the solution is simpler than your proposal and we just
> need to always call util_est_enqueue() before the
> requeue_delayed_entity
> 
> @@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
> 
> -       if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> -               return;
> -       }
> -
>         /*
>          * The code below (indirectly) updates schedutil which looks at
>          * the cfs_rq utilization to select a frequency.
> @@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
> 
> +       if (flags & ENQUEUE_DELAYED) {
> +               requeue_delayed_entity(se);
> +               return;
> +       }
> +
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
>          * utilization updates, so do it here explicitly with the IOWAIT flag
> 
> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 574ef19df64b..58aac42c99e5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>
>>         if (flags & ENQUEUE_DELAYED) {
>>                 requeue_delayed_entity(se);
>> -               return;
>> +               goto util_est;
>>         }
>>
>>         /*
>> @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>          * Let's add the task's estimated utilization to the cfs_rq's
>>          * estimated utilization, before we update schedutil.
>>          */
>> -       util_est_enqueue(&rq->cfs, p);
>>
>>         /*
>>          * If in_iowait is set, the code below may not trigger any cpufreq
>> @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>         assert_list_leaf_cfs_rq(rq);
>>
>>         hrtick_update(rq);
>> +util_est:
>> +       if (!p->se.sched_delayed)
>> +               util_est_enqueue(&rq->cfs, p);
>>   }
>>
>>   static void set_next_buddy(struct sched_entity *se);
>> @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
>> sched_entity *se, int flags)
>>    */
>>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
>> int flags)
>>   {
>> -       util_est_dequeue(&rq->cfs, p);
>> +       if (!p->se.sched_delayed)
>> +               util_est_dequeue(&rq->cfs, p);
>>
>>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>>                 if (!rq->cfs.h_nr_running)
>>
>> which is basically enqueuing util_est after enqueue_task_fair(),
>> dequeuing util_est before dequeue_task_fair() and double check
>> p->se.delayed_dequeue, then the unbalanced issue seems to go away.
>>
>> Hopefully this helps you in finding where the problem could be.
>>
>> Hongyan
>>
>> On 27/07/2024 11:27, Peter Zijlstra wrote:
>>> Delayed dequeue has tasks sit around on the runqueue that are not
>>> actually runnable -- specifically, they will be dequeued the moment
>>> they get picked.
>>>
>>> One side-effect is that such a task can get migrated, which leads to a
>>> 'nested' dequeue_task() scenario that messes up uclamp if we don't
>>> take care.
>>>
>>> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
>>> the runqueue. This however will have removed the task from uclamp --
>>> per uclamp_rq_dec() in dequeue_task(). So far so good.
>>>
>>> However, if at that point the task gets migrated -- or nice adjusted
>>> or any of a myriad of operations that does a dequeue-enqueue cycle --
>>> we'll pass through dequeue_task()/enqueue_task() again. Without
>>> modification this will lead to a double decrement for uclamp, which is
>>> wrong.
>>>
>>> Reported-by: Luis Machado <luis.machado@arm.com>
>>> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> ---
>>>   kernel/sched/core.c |   16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
>>>       if (unlikely(!p->sched_class->uclamp_enabled))
>>>               return;
>>>
>>> +     if (p->se.sched_delayed)
>>> +             return;
>>> +
>>>       for_each_clamp_id(clamp_id)
>>>               uclamp_rq_inc_id(rq, p, clamp_id);
>>>
>>> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
>>>       if (unlikely(!p->sched_class->uclamp_enabled))
>>>               return;
>>>
>>> +     if (p->se.sched_delayed)
>>> +             return;
>>> +
>>>       for_each_clamp_id(clamp_id)
>>>               uclamp_rq_dec_id(rq, p, clamp_id);
>>>   }
>>> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
>>>               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>>>       }
>>>
>>> -     uclamp_rq_inc(rq, p);
>>>       p->sched_class->enqueue_task(rq, p, flags);
>>> +     /*
>>> +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
>>> +      * ->sched_delayed.
>>> +      */
>>> +     uclamp_rq_inc(rq, p);
>>>
>>>       if (sched_core_enabled(rq))
>>>               sched_core_enqueue(rq, p);
>>> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
>>>               psi_dequeue(p, flags & DEQUEUE_SLEEP);
>>>       }
>>>
>>> +     /*
>>> +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
>>> +      * and mark the task ->sched_delayed.
>>> +      */
>>>       uclamp_rq_dec(rq, p);
>>>       return p->sched_class->dequeue_task(rq, p, flags);
>>>   }
>>>
>>>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22  9:21       ` Luis Machado
@ 2024-08-22  9:53         ` Vincent Guittot
  2024-08-22 10:20           ` Vincent Guittot
  2024-08-22 10:28           ` Luis Machado
  0 siblings, 2 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22  9:53 UTC (permalink / raw)
  To: Luis Machado
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
>
> On 8/22/24 09:19, Vincent Guittot wrote:
> > Hi,
> >
> > On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> >>
> >> Hi Peter,
> >>
> >> Sorry for bombarding this thread in the last couple of days. I'm seeing
> >> several issues in the latest tip/sched/core after these patches landed.
> >>
> >> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> >
> > I also see a remaining util_est for idle rq because of an unbalance
> > call of util_est_enqueue|dequeue
> >
>
> I can confirm issues with the utilization values and frequencies being driven
> seemingly incorrectly, in particular for little cores.
>
> What I'm seeing with the stock series is high utilization values for some tasks
> and little cores having their frequencies maxed out for extended periods of
> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
> idle. But whenever certain tasks get scheduled there, they have a very high util
> level and so the frequency is kept at max.
>
> As a consequence this drives up power usage.
>
> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
> the util numbers and frequencies being used for the little cores. With his fix,
> I can also see lower energy use for my specific benchmark.

The main problem is that the util_est of a delayed dequeued task
remains on the rq and keeps the rq utilization high and as a result
the frequency higher than needed.

The below seems to works for me and keep sync the enqueue/dequeue of
uti_test with the enqueue/dequeue of the task as if de dequeue was not
delayed

Another interest is that we will not try to migrate a delayed dequeue
sleeping task that doesn't actually impact the current load of the cpu
and as a result will not help in the load balance. I haven't yet fully
checked what would happen with hotplug

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea057b311f6..0970bcdc889a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6944,11 +6944,6 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
        int rq_h_nr_running = rq->cfs.h_nr_running;
        u64 slice = 0;

-       if (flags & ENQUEUE_DELAYED) {
-               requeue_delayed_entity(se);
-               return;
-       }
-
        /*
         * The code below (indirectly) updates schedutil which looks at
         * the cfs_rq utilization to select a frequency.
@@ -6957,6 +6952,11 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
         */
        util_est_enqueue(&rq->cfs, p);

+       if (flags & ENQUEUE_DELAYED) {
+               requeue_delayed_entity(se);
+               return;
+       }
+
        /*
         * If in_iowait is set, the code below may not trigger any cpufreq
         * utilization updates, so do it here explicitly with the IOWAIT flag
@@ -9276,6 +9276,8 @@ int can_migrate_task(struct task_struct *p,
struct lb_env *env)

        lockdep_assert_rq_held(env->src_rq);

+       if (p->se.sched_delayed)
+               return 0;
        /*
         * We do not migrate tasks that are:
         * 1) throttled_lb_pair, or

>
>
> >> the following diff to warn against util_est != 0 when no tasks are on
> >> the queue:
> >>
> >> https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
> >>
> >> Then, I'm reliably seeing warnings on my Juno board during boot in
> >> latest tip/sched/core.
> >>
> >> If I do the same thing to util_est just like what you did in this uclamp
> >> patch, like this:
> >
> > I think that the solution is simpler than your proposal and we just
> > need to always call util_est_enqueue() before the
> > requeue_delayed_entity
> >
> > @@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
> > task_struct *p, int flags)
> >         int rq_h_nr_running = rq->cfs.h_nr_running;
> >         u64 slice = 0;
> >
> > -       if (flags & ENQUEUE_DELAYED) {
> > -               requeue_delayed_entity(se);
> > -               return;
> > -       }
> > -
> >         /*
> >          * The code below (indirectly) updates schedutil which looks at
> >          * the cfs_rq utilization to select a frequency.
> > @@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
> > task_struct *p, int flags)
> >          */
> >         util_est_enqueue(&rq->cfs, p);
> >
> > +       if (flags & ENQUEUE_DELAYED) {
> > +               requeue_delayed_entity(se);
> > +               return;
> > +       }
> > +
> >         /*
> >          * If in_iowait is set, the code below may not trigger any cpufreq
> >          * utilization updates, so do it here explicitly with the IOWAIT flag
> >
> >
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 574ef19df64b..58aac42c99e5 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
> >> task_struct *p, int flags)
> >>
> >>         if (flags & ENQUEUE_DELAYED) {
> >>                 requeue_delayed_entity(se);
> >> -               return;
> >> +               goto util_est;
> >>         }
> >>
> >>         /*
> >> @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
> >> task_struct *p, int flags)
> >>          * Let's add the task's estimated utilization to the cfs_rq's
> >>          * estimated utilization, before we update schedutil.
> >>          */
> >> -       util_est_enqueue(&rq->cfs, p);
> >>
> >>         /*
> >>          * If in_iowait is set, the code below may not trigger any cpufreq
> >> @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
> >> task_struct *p, int flags)
> >>         assert_list_leaf_cfs_rq(rq);
> >>
> >>         hrtick_update(rq);
> >> +util_est:
> >> +       if (!p->se.sched_delayed)
> >> +               util_est_enqueue(&rq->cfs, p);
> >>   }
> >>
> >>   static void set_next_buddy(struct sched_entity *se);
> >> @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
> >> sched_entity *se, int flags)
> >>    */
> >>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
> >> int flags)
> >>   {
> >> -       util_est_dequeue(&rq->cfs, p);
> >> +       if (!p->se.sched_delayed)
> >> +               util_est_dequeue(&rq->cfs, p);
> >>
> >>         if (dequeue_entities(rq, &p->se, flags) < 0) {
> >>                 if (!rq->cfs.h_nr_running)
> >>
> >> which is basically enqueuing util_est after enqueue_task_fair(),
> >> dequeuing util_est before dequeue_task_fair() and double check
> >> p->se.delayed_dequeue, then the unbalanced issue seems to go away.
> >>
> >> Hopefully this helps you in finding where the problem could be.
> >>
> >> Hongyan
> >>
> >> On 27/07/2024 11:27, Peter Zijlstra wrote:
> >>> Delayed dequeue has tasks sit around on the runqueue that are not
> >>> actually runnable -- specifically, they will be dequeued the moment
> >>> they get picked.
> >>>
> >>> One side-effect is that such a task can get migrated, which leads to a
> >>> 'nested' dequeue_task() scenario that messes up uclamp if we don't
> >>> take care.
> >>>
> >>> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> >>> the runqueue. This however will have removed the task from uclamp --
> >>> per uclamp_rq_dec() in dequeue_task(). So far so good.
> >>>
> >>> However, if at that point the task gets migrated -- or nice adjusted
> >>> or any of a myriad of operations that does a dequeue-enqueue cycle --
> >>> we'll pass through dequeue_task()/enqueue_task() again. Without
> >>> modification this will lead to a double decrement for uclamp, which is
> >>> wrong.
> >>>
> >>> Reported-by: Luis Machado <luis.machado@arm.com>
> >>> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> >>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>> ---
> >>>   kernel/sched/core.c |   16 +++++++++++++++-
> >>>   1 file changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
> >>>       if (unlikely(!p->sched_class->uclamp_enabled))
> >>>               return;
> >>>
> >>> +     if (p->se.sched_delayed)
> >>> +             return;
> >>> +
> >>>       for_each_clamp_id(clamp_id)
> >>>               uclamp_rq_inc_id(rq, p, clamp_id);
> >>>
> >>> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
> >>>       if (unlikely(!p->sched_class->uclamp_enabled))
> >>>               return;
> >>>
> >>> +     if (p->se.sched_delayed)
> >>> +             return;
> >>> +
> >>>       for_each_clamp_id(clamp_id)
> >>>               uclamp_rq_dec_id(rq, p, clamp_id);
> >>>   }
> >>> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
> >>>               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
> >>>       }
> >>>
> >>> -     uclamp_rq_inc(rq, p);
> >>>       p->sched_class->enqueue_task(rq, p, flags);
> >>> +     /*
> >>> +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> >>> +      * ->sched_delayed.
> >>> +      */
> >>> +     uclamp_rq_inc(rq, p);
> >>>
> >>>       if (sched_core_enabled(rq))
> >>>               sched_core_enqueue(rq, p);
> >>> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
> >>>               psi_dequeue(p, flags & DEQUEUE_SLEEP);
> >>>       }
> >>>
> >>> +     /*
> >>> +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> >>> +      * and mark the task ->sched_delayed.
> >>> +      */
> >>>       uclamp_rq_dec(rq, p);
> >>>       return p->sched_class->dequeue_task(rq, p, flags);
> >>>   }
> >>>
> >>>
>

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22  9:53         ` Vincent Guittot
@ 2024-08-22 10:20           ` Vincent Guittot
  2024-08-22 10:28           ` Luis Machado
  1 sibling, 0 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22 10:20 UTC (permalink / raw)
  To: Luis Machado
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 22 Aug 2024 at 11:53, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
> >
> > On 8/22/24 09:19, Vincent Guittot wrote:
> > > Hi,
> > >
> > > On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> > >>
> > >> Hi Peter,
> > >>
> > >> Sorry for bombarding this thread in the last couple of days. I'm seeing
> > >> several issues in the latest tip/sched/core after these patches landed.
> > >>
> > >> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> > >
> > > I also see a remaining util_est for idle rq because of an unbalance
> > > call of util_est_enqueue|dequeue
> > >
> >
> > I can confirm issues with the utilization values and frequencies being driven
> > seemingly incorrectly, in particular for little cores.
> >
> > What I'm seeing with the stock series is high utilization values for some tasks
> > and little cores having their frequencies maxed out for extended periods of
> > time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
> > idle. But whenever certain tasks get scheduled there, they have a very high util
> > level and so the frequency is kept at max.
> >
> > As a consequence this drives up power usage.
> >
> > I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
> > the util numbers and frequencies being used for the little cores. With his fix,
> > I can also see lower energy use for my specific benchmark.
>
> The main problem is that the util_est of a delayed dequeued task
> remains on the rq and keeps the rq utilization high and as a result
> the frequency higher than needed.
>
> The below seems to works for me and keep sync the enqueue/dequeue of
> uti_test with the enqueue/dequeue of the task as if de dequeue was not
> delayed
>
> Another interest is that we will not try to migrate a delayed dequeue
> sleeping task that doesn't actually impact the current load of the cpu
> and as a result will not help in the load balance. I haven't yet fully
> checked what would happen with hotplug

And there is the case of a delayed dequeue task that gets its affinity changed

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea057b311f6..0970bcdc889a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6944,11 +6944,6 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
>
> -       if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> -               return;
> -       }
> -
>         /*
>          * The code below (indirectly) updates schedutil which looks at
>          * the cfs_rq utilization to select a frequency.
> @@ -6957,6 +6952,11 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
>
> +       if (flags & ENQUEUE_DELAYED) {
> +               requeue_delayed_entity(se);
> +               return;
> +       }
> +
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
>          * utilization updates, so do it here explicitly with the IOWAIT flag
> @@ -9276,6 +9276,8 @@ int can_migrate_task(struct task_struct *p,
> struct lb_env *env)
>
>         lockdep_assert_rq_held(env->src_rq);
>
> +       if (p->se.sched_delayed)
> +               return 0;
>         /*
>          * We do not migrate tasks that are:
>          * 1) throttled_lb_pair, or
>
> >
> >
> > >> the following diff to warn against util_est != 0 when no tasks are on
> > >> the queue:
> > >>
> > >> https://lore.kernel.org/all/752ae417c02b9277ca3ec18893747c54dd5f277f.1724245193.git.hongyan.xia2@arm.com/
> > >>
> > >> Then, I'm reliably seeing warnings on my Juno board during boot in
> > >> latest tip/sched/core.
> > >>
> > >> If I do the same thing to util_est just like what you did in this uclamp
> > >> patch, like this:
> > >
> > > I think that the solution is simpler than your proposal and we just
> > > need to always call util_est_enqueue() before the
> > > requeue_delayed_entity
> > >
> > > @@ -6970,11 +6970,6 @@ enqueue_task_fair(struct rq *rq, struct
> > > task_struct *p, int flags)
> > >         int rq_h_nr_running = rq->cfs.h_nr_running;
> > >         u64 slice = 0;
> > >
> > > -       if (flags & ENQUEUE_DELAYED) {
> > > -               requeue_delayed_entity(se);
> > > -               return;
> > > -       }
> > > -
> > >         /*
> > >          * The code below (indirectly) updates schedutil which looks at
> > >          * the cfs_rq utilization to select a frequency.
> > > @@ -6983,6 +6978,11 @@ enqueue_task_fair(struct rq *rq, struct
> > > task_struct *p, int flags)
> > >          */
> > >         util_est_enqueue(&rq->cfs, p);
> > >
> > > +       if (flags & ENQUEUE_DELAYED) {
> > > +               requeue_delayed_entity(se);
> > > +               return;
> > > +       }
> > > +
> > >         /*
> > >          * If in_iowait is set, the code below may not trigger any cpufreq
> > >          * utilization updates, so do it here explicitly with the IOWAIT flag
> > >
> > >
> > >>
> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >> index 574ef19df64b..58aac42c99e5 100644
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -6946,7 +6946,7 @@ enqueue_task_fair(struct rq *rq, struct
> > >> task_struct *p, int flags)
> > >>
> > >>         if (flags & ENQUEUE_DELAYED) {
> > >>                 requeue_delayed_entity(se);
> > >> -               return;
> > >> +               goto util_est;
> > >>         }
> > >>
> > >>         /*
> > >> @@ -6955,7 +6955,6 @@ enqueue_task_fair(struct rq *rq, struct
> > >> task_struct *p, int flags)
> > >>          * Let's add the task's estimated utilization to the cfs_rq's
> > >>          * estimated utilization, before we update schedutil.
> > >>          */
> > >> -       util_est_enqueue(&rq->cfs, p);
> > >>
> > >>         /*
> > >>          * If in_iowait is set, the code below may not trigger any cpufreq
> > >> @@ -7050,6 +7049,9 @@ enqueue_task_fair(struct rq *rq, struct
> > >> task_struct *p, int flags)
> > >>         assert_list_leaf_cfs_rq(rq);
> > >>
> > >>         hrtick_update(rq);
> > >> +util_est:
> > >> +       if (!p->se.sched_delayed)
> > >> +               util_est_enqueue(&rq->cfs, p);
> > >>   }
> > >>
> > >>   static void set_next_buddy(struct sched_entity *se);
> > >> @@ -7173,7 +7175,8 @@ static int dequeue_entities(struct rq *rq, struct
> > >> sched_entity *se, int flags)
> > >>    */
> > >>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p,
> > >> int flags)
> > >>   {
> > >> -       util_est_dequeue(&rq->cfs, p);
> > >> +       if (!p->se.sched_delayed)
> > >> +               util_est_dequeue(&rq->cfs, p);
> > >>
> > >>         if (dequeue_entities(rq, &p->se, flags) < 0) {
> > >>                 if (!rq->cfs.h_nr_running)
> > >>
> > >> which is basically enqueuing util_est after enqueue_task_fair(),
> > >> dequeuing util_est before dequeue_task_fair() and double check
> > >> p->se.delayed_dequeue, then the unbalanced issue seems to go away.
> > >>
> > >> Hopefully this helps you in finding where the problem could be.
> > >>
> > >> Hongyan
> > >>
> > >> On 27/07/2024 11:27, Peter Zijlstra wrote:
> > >>> Delayed dequeue has tasks sit around on the runqueue that are not
> > >>> actually runnable -- specifically, they will be dequeued the moment
> > >>> they get picked.
> > >>>
> > >>> One side-effect is that such a task can get migrated, which leads to a
> > >>> 'nested' dequeue_task() scenario that messes up uclamp if we don't
> > >>> take care.
> > >>>
> > >>> Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
> > >>> the runqueue. This however will have removed the task from uclamp --
> > >>> per uclamp_rq_dec() in dequeue_task(). So far so good.
> > >>>
> > >>> However, if at that point the task gets migrated -- or nice adjusted
> > >>> or any of a myriad of operations that does a dequeue-enqueue cycle --
> > >>> we'll pass through dequeue_task()/enqueue_task() again. Without
> > >>> modification this will lead to a double decrement for uclamp, which is
> > >>> wrong.
> > >>>
> > >>> Reported-by: Luis Machado <luis.machado@arm.com>
> > >>> Reported-by: Hongyan Xia <hongyan.xia2@arm.com>
> > >>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >>> ---
> > >>>   kernel/sched/core.c |   16 +++++++++++++++-
> > >>>   1 file changed, 15 insertions(+), 1 deletion(-)
> > >>>
> > >>> --- a/kernel/sched/core.c
> > >>> +++ b/kernel/sched/core.c
> > >>> @@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
> > >>>       if (unlikely(!p->sched_class->uclamp_enabled))
> > >>>               return;
> > >>>
> > >>> +     if (p->se.sched_delayed)
> > >>> +             return;
> > >>> +
> > >>>       for_each_clamp_id(clamp_id)
> > >>>               uclamp_rq_inc_id(rq, p, clamp_id);
> > >>>
> > >>> @@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
> > >>>       if (unlikely(!p->sched_class->uclamp_enabled))
> > >>>               return;
> > >>>
> > >>> +     if (p->se.sched_delayed)
> > >>> +             return;
> > >>> +
> > >>>       for_each_clamp_id(clamp_id)
> > >>>               uclamp_rq_dec_id(rq, p, clamp_id);
> > >>>   }
> > >>> @@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
> > >>>               psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
> > >>>       }
> > >>>
> > >>> -     uclamp_rq_inc(rq, p);
> > >>>       p->sched_class->enqueue_task(rq, p, flags);
> > >>> +     /*
> > >>> +      * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> > >>> +      * ->sched_delayed.
> > >>> +      */
> > >>> +     uclamp_rq_inc(rq, p);
> > >>>
> > >>>       if (sched_core_enabled(rq))
> > >>>               sched_core_enqueue(rq, p);
> > >>> @@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
> > >>>               psi_dequeue(p, flags & DEQUEUE_SLEEP);
> > >>>       }
> > >>>
> > >>> +     /*
> > >>> +      * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
> > >>> +      * and mark the task ->sched_delayed.
> > >>> +      */
> > >>>       uclamp_rq_dec(rq, p);
> > >>>       return p->sched_class->dequeue_task(rq, p, flags);
> > >>>   }
> > >>>
> > >>>
> >

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22  9:53         ` Vincent Guittot
  2024-08-22 10:20           ` Vincent Guittot
@ 2024-08-22 10:28           ` Luis Machado
  2024-08-22 12:07             ` Luis Machado
  1 sibling, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-08-22 10:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 8/22/24 10:53, Vincent Guittot wrote:
> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
>>
>> On 8/22/24 09:19, Vincent Guittot wrote:
>>> Hi,
>>>
>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> Sorry for bombarding this thread in the last couple of days. I'm seeing
>>>> several issues in the latest tip/sched/core after these patches landed.
>>>>
>>>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
>>>
>>> I also see a remaining util_est for idle rq because of an unbalance
>>> call of util_est_enqueue|dequeue
>>>
>>
>> I can confirm issues with the utilization values and frequencies being driven
>> seemingly incorrectly, in particular for little cores.
>>
>> What I'm seeing with the stock series is high utilization values for some tasks
>> and little cores having their frequencies maxed out for extended periods of
>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
>> idle. But whenever certain tasks get scheduled there, they have a very high util
>> level and so the frequency is kept at max.
>>
>> As a consequence this drives up power usage.
>>
>> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
>> the util numbers and frequencies being used for the little cores. With his fix,
>> I can also see lower energy use for my specific benchmark.
> 
> The main problem is that the util_est of a delayed dequeued task
> remains on the rq and keeps the rq utilization high and as a result
> the frequency higher than needed.
> 
> The below seems to works for me and keep sync the enqueue/dequeue of
> uti_test with the enqueue/dequeue of the task as if de dequeue was not
> delayed
> 
> Another interest is that we will not try to migrate a delayed dequeue
> sleeping task that doesn't actually impact the current load of the cpu
> and as a result will not help in the load balance. I haven't yet fully
> checked what would happen with hotplug

Thanks. Those are good points. Let me go and try your patch.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22 10:28           ` Luis Machado
@ 2024-08-22 12:07             ` Luis Machado
  2024-08-22 12:10               ` Vincent Guittot
  0 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-08-22 12:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Vincent,

On 8/22/24 11:28, Luis Machado wrote:
> On 8/22/24 10:53, Vincent Guittot wrote:
>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
>>>
>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>> Hi,
>>>>
>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> Sorry for bombarding this thread in the last couple of days. I'm seeing
>>>>> several issues in the latest tip/sched/core after these patches landed.
>>>>>
>>>>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
>>>>
>>>> I also see a remaining util_est for idle rq because of an unbalance
>>>> call of util_est_enqueue|dequeue
>>>>
>>>
>>> I can confirm issues with the utilization values and frequencies being driven
>>> seemingly incorrectly, in particular for little cores.
>>>
>>> What I'm seeing with the stock series is high utilization values for some tasks
>>> and little cores having their frequencies maxed out for extended periods of
>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
>>> idle. But whenever certain tasks get scheduled there, they have a very high util
>>> level and so the frequency is kept at max.
>>>
>>> As a consequence this drives up power usage.
>>>
>>> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
>>> the util numbers and frequencies being used for the little cores. With his fix,
>>> I can also see lower energy use for my specific benchmark.
>>
>> The main problem is that the util_est of a delayed dequeued task
>> remains on the rq and keeps the rq utilization high and as a result
>> the frequency higher than needed.
>>
>> The below seems to works for me and keep sync the enqueue/dequeue of
>> uti_test with the enqueue/dequeue of the task as if de dequeue was not
>> delayed
>>
>> Another interest is that we will not try to migrate a delayed dequeue
>> sleeping task that doesn't actually impact the current load of the cpu
>> and as a result will not help in the load balance. I haven't yet fully
>> checked what would happen with hotplug
> 
> Thanks. Those are good points. Let me go and try your patch.

I gave your fix a try, but it seems to make things worse. It is comparable
to the behavior we had before Peter added the uclamp imbalance fix, so I
believe there is something incorrect there.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22 12:07             ` Luis Machado
@ 2024-08-22 12:10               ` Vincent Guittot
  2024-08-22 14:58                 ` Vincent Guittot
  0 siblings, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22 12:10 UTC (permalink / raw)
  To: Luis Machado
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>
> Vincent,
>
> On 8/22/24 11:28, Luis Machado wrote:
> > On 8/22/24 10:53, Vincent Guittot wrote:
> >> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
> >>>
> >>> On 8/22/24 09:19, Vincent Guittot wrote:
> >>>> Hi,
> >>>>
> >>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> >>>>>
> >>>>> Hi Peter,
> >>>>>
> >>>>> Sorry for bombarding this thread in the last couple of days. I'm seeing
> >>>>> several issues in the latest tip/sched/core after these patches landed.
> >>>>>
> >>>>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> >>>>
> >>>> I also see a remaining util_est for idle rq because of an unbalance
> >>>> call of util_est_enqueue|dequeue
> >>>>
> >>>
> >>> I can confirm issues with the utilization values and frequencies being driven
> >>> seemingly incorrectly, in particular for little cores.
> >>>
> >>> What I'm seeing with the stock series is high utilization values for some tasks
> >>> and little cores having their frequencies maxed out for extended periods of
> >>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
> >>> idle. But whenever certain tasks get scheduled there, they have a very high util
> >>> level and so the frequency is kept at max.
> >>>
> >>> As a consequence this drives up power usage.
> >>>
> >>> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
> >>> the util numbers and frequencies being used for the little cores. With his fix,
> >>> I can also see lower energy use for my specific benchmark.
> >>
> >> The main problem is that the util_est of a delayed dequeued task
> >> remains on the rq and keeps the rq utilization high and as a result
> >> the frequency higher than needed.
> >>
> >> The below seems to works for me and keep sync the enqueue/dequeue of
> >> uti_test with the enqueue/dequeue of the task as if de dequeue was not
> >> delayed
> >>
> >> Another interest is that we will not try to migrate a delayed dequeue
> >> sleeping task that doesn't actually impact the current load of the cpu
> >> and as a result will not help in the load balance. I haven't yet fully
> >> checked what would happen with hotplug
> >
> > Thanks. Those are good points. Let me go and try your patch.
>
> I gave your fix a try, but it seems to make things worse. It is comparable
> to the behavior we had before Peter added the uclamp imbalance fix, so I
> believe there is something incorrect there.

we need to filter case where task are enqueued/dequeued several
consecutive times. That's what I'm look now

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22 12:10               ` Vincent Guittot
@ 2024-08-22 14:58                 ` Vincent Guittot
  2024-08-29 15:42                   ` Hongyan Xia
  0 siblings, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-08-22 14:58 UTC (permalink / raw)
  To: Luis Machado
  Cc: Hongyan Xia, Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
> >
> > Vincent,
> >
> > On 8/22/24 11:28, Luis Machado wrote:
> > > On 8/22/24 10:53, Vincent Guittot wrote:
> > >> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
> > >>>
> > >>> On 8/22/24 09:19, Vincent Guittot wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
> > >>>>>
> > >>>>> Hi Peter,
> > >>>>>
> > >>>>> Sorry for bombarding this thread in the last couple of days. I'm seeing
> > >>>>> several issues in the latest tip/sched/core after these patches landed.
> > >>>>>
> > >>>>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
> > >>>>
> > >>>> I also see a remaining util_est for idle rq because of an unbalance
> > >>>> call of util_est_enqueue|dequeue
> > >>>>
> > >>>
> > >>> I can confirm issues with the utilization values and frequencies being driven
> > >>> seemingly incorrectly, in particular for little cores.
> > >>>
> > >>> What I'm seeing with the stock series is high utilization values for some tasks
> > >>> and little cores having their frequencies maxed out for extended periods of
> > >>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
> > >>> idle. But whenever certain tasks get scheduled there, they have a very high util
> > >>> level and so the frequency is kept at max.
> > >>>
> > >>> As a consequence this drives up power usage.
> > >>>
> > >>> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
> > >>> the util numbers and frequencies being used for the little cores. With his fix,
> > >>> I can also see lower energy use for my specific benchmark.
> > >>
> > >> The main problem is that the util_est of a delayed dequeued task
> > >> remains on the rq and keeps the rq utilization high and as a result
> > >> the frequency higher than needed.
> > >>
> > >> The below seems to works for me and keep sync the enqueue/dequeue of
> > >> uti_test with the enqueue/dequeue of the task as if de dequeue was not
> > >> delayed
> > >>
> > >> Another interest is that we will not try to migrate a delayed dequeue
> > >> sleeping task that doesn't actually impact the current load of the cpu
> > >> and as a result will not help in the load balance. I haven't yet fully
> > >> checked what would happen with hotplug
> > >
> > > Thanks. Those are good points. Let me go and try your patch.
> >
> > I gave your fix a try, but it seems to make things worse. It is comparable
> > to the behavior we had before Peter added the uclamp imbalance fix, so I
> > believe there is something incorrect there.
>
> we need to filter case where task are enqueued/dequeued several
> consecutive times. That's what I'm look now

I just realize before that It's not only util_est but the h_nr_running
that keeps delayed tasks as well so all stats of the rq are biased:
h_nr_running, util_est, runnable avg and load_avg.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-21  9:46   ` Hongyan Xia
  2024-08-21 16:25     ` Mike Galbraith
@ 2024-08-22 15:55     ` Peter Zijlstra
  2024-08-27  9:43       ` Hongyan Xia
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-08-22 15:55 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Wed, Aug 21, 2024 at 10:46:07AM +0100, Hongyan Xia wrote:
> Okay, in case the trace I provided isn't clear enough, I traced the crash to
> a call chain like this:
> 
> dl_server_start()
> 	enqueue_dl_entity()
> 		update_stats_enqueue_dl()
> 			update_stats_enqueue_sleeper_dl()
> 				__schedstats_from_dl_se()
> 					dl_task_of() <---------- crash
> 
> If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is
> the reason why other people are not seeing this. This is probably not EEVDF
> but DL refactoring related.

Thanks for the report -- I'll see if I can spot something. Since you
initially fingered these eevdf patches, could you confirm or deny that
changing:

  kernel/sched/features.h:SCHED_FEAT(DELAY_DEQUEUE, true)

to false, makes any difference in the previously failing case?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
  2024-08-13 12:43   ` Valentin Schneider
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-08-27  9:17   ` Chen Yu
  2024-08-28  3:06     ` Chen Yu
  2 siblings, 1 reply; 277+ messages in thread
From: Chen Yu @ 2024-08-27  9:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, tim.c.chen

On 2024-07-27 at 12:27:44 +0200, Peter Zijlstra wrote:
> When dequeue_task() is delayed it becomes possible to exit a task (or
> cgroup) that is still enqueued. Ensure things are dequeued before
> freeing.
> 
> NOTE: switched_from_fair() causes spurious wakeups due to clearing
> sched_delayed after enqueueing a task in another class that should've
> been dequeued. This *should* be harmless.
>

It might bring some expected behavior in some corner cases reported here:
https://lore.kernel.org/lkml/202408161619.9ed8b83e-lkp@intel.com/
As the block task might return from schedule() with TASK_INTERRUPTIBLE.

We cooked a patch to workaround it(as below).

thanks,
Chenyu
 
From 9251b25073d43aeac04a6ee69b590fbfa1b8e1a5 Mon Sep 17 00:00:00 2001
From: Chen Yu <yu.c.chen@intel.com>
Date: Mon, 26 Aug 2024 22:16:38 +0800
Subject: [PATCH] sched/eevdf: Dequeue the delayed task when changing its
 schedule policy

[Problem Statement]
The following warning was reported:

 do not call blocking ops when !TASK_RUNNING; state=1 set at kthread_worker_fn (kernel/kthread.c:?)
 WARNING: CPU: 1 PID: 674 at kernel/sched/core.c:8469 __might_sleep

 handle_bug
 exc_invalid_op
 asm_exc_invalid_op
 __might_sleep
 __might_sleep
 kthread_worker_fn
 kthread_worker_fn
 kthread
 __cfi_kthread_worker_fn
 ret_from_fork
 __cfi_kthread
 ret_from_fork_asm

[Symptom]
kthread_worker_fn()
  ...
repeat:
  set_current_state(TASK_INTERRUPTIBLE);
  ...
  if (work) { // false
    __set_current_state(TASK_RUNNING);
    ...
  } else if (!freezing(current)) {
    schedule();
    // after schedule, the state is still *TASK_INTERRUPTIBLE*
  }

  try_to_freeze()
    might_sleep() <--- trigger the warning

[Analysis]
The question is after schedule(), the state remains TASK_INTERRUPTIBLE
rather than TASK_RUNNING. The short answer is, someone has incorrectly
picked the TASK_INTERRUPTIBLE task from the tree. The scenario is described
below, and all steps happen on 1 CPU:

time
 |
 |
 |
 v

  kthread_worker_fn()   <--- t1
    set_current_state(TASK_INTERRUPTIBLE)
      schedule()
        block_task(t1)
          dequeue_entity(t1)
            t1->sched_delayed = 1

        t2 = pick_next_task()
          put_prev_task(t1)
            enqueue_entity(t1)  <--- TASK_INTERRUPTIBLE in the tree

  t1 switches to t2

  erofs_init_percpu_worker()    <--- t2
    sched_set_fifo_low(t1)
      sched_setscheduler_nocheck(t1)

        __sched_setscheduler(t1)
          t1->sched_class = &rt_sched_class

          check_class_changed(t1)
            switched_from_fair(t1)
              t1->sched_delayed = 0 <--- gotcha

              ** from now on, t1 in the tree is TASK_INTERRUPTIBLE **
              ** and sched_delayed = 0 **

          preempt_enable()
            preempt_schedule()
              t1 = pick_next_task() <--- because sched_delayed = 0, eligible

  t2 switches back to t1, now t1 is TASK_INTERRUPTIBLE.

The cause is, switched_from_fair() incorrectly clear the sched_delayed
flag and confuse the pick_next_task() that it thinks a delayed task is a
eligible task(without dequeue it).

[Proposal]
In the __sched_setscheduler() when trying to change the policy of that
delayed task, do not re-enqueue the delayed task thus to avoid being
picked again. The side effect that, the delayed task can not wait for
its 0-vlag time to be dequeued, but its effect should be neglect.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202408161619.9ed8b83e-lkp@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/syscalls.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 4fae3cf25a3a..10859536e509 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -818,7 +818,8 @@ int __sched_setscheduler(struct task_struct *p,
 		if (oldprio < p->prio)
 			queue_flags |= ENQUEUE_HEAD;
 
-		enqueue_task(rq, p, queue_flags);
+		if (!p->se.sched_delayed)
+			enqueue_task(rq, p, queue_flags);
 	}
 	if (running)
 		set_next_task(rq, p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-14  5:53         ` Peter Zijlstra
@ 2024-08-27  9:35           ` Chen Yu
  2024-08-27 20:29             ` Valentin Schneider
  0 siblings, 1 reply; 277+ messages in thread
From: Chen Yu @ 2024-08-27  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel,
	kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 2024-08-14 at 07:53:30 +0200, Peter Zijlstra wrote:
> On Wed, Aug 14, 2024 at 12:07:57AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 13, 2024 at 11:54:21PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 13, 2024 at 02:43:47PM +0200, Valentin Schneider wrote:
> > > > On 27/07/24 12:27, Peter Zijlstra wrote:
> > > > > @@ -12817,10 +12830,26 @@ static void attach_task_cfs_rq(struct ta
> > > > >  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> > > > >  {
> > > > >       detach_task_cfs_rq(p);
> > > > > +	/*
> > > > > +	 * Since this is called after changing class, this isn't quite right.
> > > > > +	 * Specifically, this causes the task to get queued in the target class
> > > > > +	 * and experience a 'spurious' wakeup.
> > > > > +	 *
> > > > > +	 * However, since 'spurious' wakeups are harmless, this shouldn't be a
> > > > > +	 * problem.
> > > > > +	 */
> > > > > +	p->se.sched_delayed = 0;
> > > > > +	/*
> > > > > +	 * While here, also clear the vlag, it makes little sense to carry that
> > > > > +	 * over the excursion into the new class.
> > > > > +	 */
> > > > > +	p->se.vlag = 0;
> > > > 
> > > > RQ lock is held, the task can't be current if it's ->sched_delayed; is a
> > > > dequeue_task() not possible at this point?  Or just not worth it?
> > > 
> > > Hurmph, I really can't remember why I did it like this :-(
> > 
> > Obviously I remember it right after hitting send...
> > 
> > We've just done:
> > 
> > 	dequeue_task();
> > 	p->sched_class = some_other_class;
> > 	enqueue_task();
> > 
> > IOW, we're enqueued as some other class at this point. There is no way
> > we can fix it up at this point.
> 
> With just a little more sleep than last night, perhaps you're right
> after all. Yes we're on a different class, but we can *still* dequeue it
> again.

Not quite get this. If the old class is cfs, the task is in a rb-tree. And
if the new class is rt then the task is in the prio list. Just wonder
would the rt.dequeue break the data of rb-tree?

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-08-22 15:55     ` Peter Zijlstra
@ 2024-08-27  9:43       ` Hongyan Xia
  0 siblings, 0 replies; 277+ messages in thread
From: Hongyan Xia @ 2024-08-27  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 22/08/2024 16:55, Peter Zijlstra wrote:
> On Wed, Aug 21, 2024 at 10:46:07AM +0100, Hongyan Xia wrote:
>> Okay, in case the trace I provided isn't clear enough, I traced the crash to
>> a call chain like this:
>>
>> dl_server_start()
>> 	enqueue_dl_entity()
>> 		update_stats_enqueue_dl()
>> 			update_stats_enqueue_sleeper_dl()
>> 				__schedstats_from_dl_se()
>> 					dl_task_of() <---------- crash
>>
>> If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if this is
>> the reason why other people are not seeing this. This is probably not EEVDF
>> but DL refactoring related.
> 
> Thanks for the report -- I'll see if I can spot something. Since you
> initially fingered these eevdf patches, could you confirm or deny that
> changing:
> 
>    kernel/sched/features.h:SCHED_FEAT(DELAY_DEQUEUE, true)
> 
> to false, makes any difference in the previously failing case?

Sadly the issue persists. I'm seeing exactly the same backtrace on my 
Juno board.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-27  9:35           ` Chen Yu
@ 2024-08-27 20:29             ` Valentin Schneider
  2024-08-28  2:55               ` Chen Yu
  0 siblings, 1 reply; 277+ messages in thread
From: Valentin Schneider @ 2024-08-27 20:29 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

On 27/08/24 17:35, Chen Yu wrote:
> On 2024-08-14 at 07:53:30 +0200, Peter Zijlstra wrote:
>> On Wed, Aug 14, 2024 at 12:07:57AM +0200, Peter Zijlstra wrote:
>> > On Tue, Aug 13, 2024 at 11:54:21PM +0200, Peter Zijlstra wrote:
>> >
>> > Obviously I remember it right after hitting send...
>> >
>> > We've just done:
>> >
>> >    dequeue_task();
>> >    p->sched_class = some_other_class;
>> >    enqueue_task();
>> >
>> > IOW, we're enqueued as some other class at this point. There is no way
>> > we can fix it up at this point.
>>
>> With just a little more sleep than last night, perhaps you're right
>> after all. Yes we're on a different class, but we can *still* dequeue it
>> again.
>
> Not quite get this. If the old class is cfs, the task is in a rb-tree. And
> if the new class is rt then the task is in the prio list. Just wonder
> would the rt.dequeue break the data of rb-tree?
>

On a class change e.g. CFS to RT, __sched_setscheduler() would
dequeue_task() (take it out of the RB tree), change the class,
enqueue_task() (put it in the RT priolist).

Then check_class_changed()->switched_from_fair() happens, dequeue_task(),
and that takes it out of the RT priolist. At least that's the theory, since
that currently explodes...

> thanks,
> Chenyu


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-27 20:29             ` Valentin Schneider
@ 2024-08-28  2:55               ` Chen Yu
  0 siblings, 0 replies; 277+ messages in thread
From: Chen Yu @ 2024-08-28  2:55 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, linux-kernel,
	kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 2024-08-27 at 22:29:50 +0200, Valentin Schneider wrote:
> On 27/08/24 17:35, Chen Yu wrote:
> > On 2024-08-14 at 07:53:30 +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 14, 2024 at 12:07:57AM +0200, Peter Zijlstra wrote:
> >> > On Tue, Aug 13, 2024 at 11:54:21PM +0200, Peter Zijlstra wrote:
> >> >
> >> > Obviously I remember it right after hitting send...
> >> >
> >> > We've just done:
> >> >
> >> >    dequeue_task();
> >> >    p->sched_class = some_other_class;
> >> >    enqueue_task();
> >> >
> >> > IOW, we're enqueued as some other class at this point. There is no way
> >> > we can fix it up at this point.
> >>
> >> With just a little more sleep than last night, perhaps you're right
> >> after all. Yes we're on a different class, but we can *still* dequeue it
> >> again.
> >
> > Not quite get this. If the old class is cfs, the task is in a rb-tree. And
> > if the new class is rt then the task is in the prio list. Just wonder
> > would the rt.dequeue break the data of rb-tree?
> >
> 
> On a class change e.g. CFS to RT, __sched_setscheduler() would
> dequeue_task() (take it out of the RB tree), change the class,
> enqueue_task() (put it in the RT priolist).
> 
> Then check_class_changed()->switched_from_fair() happens, dequeue_task(),
> and that takes it out of the RT priolist. At least that's the theory, since
> that currently explodes...
>

I see, thanks for this explaination! I overlooked that the task is already on
rt priolist. I applied Peter's dequeue patch with minor fix and did not see
the warning[1] after several cycle test(previously it is 100% reproducible).

[1] https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
  2024-08-27  9:17   ` [PATCH 12/24] " Chen Yu
@ 2024-08-28  3:06     ` Chen Yu
  0 siblings, 0 replies; 277+ messages in thread
From: Chen Yu @ 2024-08-28  3:06 UTC (permalink / raw)
  To: Peter Zijlstra, Valentin Schneider
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, tim.c.chen

On 2024-08-27 at 17:17:20 +0800, Chen Yu wrote:
> On 2024-07-27 at 12:27:44 +0200, Peter Zijlstra wrote:
> > When dequeue_task() is delayed it becomes possible to exit a task (or
> > cgroup) that is still enqueued. Ensure things are dequeued before
> > freeing.
> > 
> > NOTE: switched_from_fair() causes spurious wakeups due to clearing
> > sched_delayed after enqueueing a task in another class that should've
> > been dequeued. This *should* be harmless.
> >
> 
> It might bring some expected behavior in some corner cases reported here:
> https://lore.kernel.org/lkml/202408161619.9ed8b83e-lkp@intel.com/
> As the block task might return from schedule() with TASK_INTERRUPTIBLE.
> 
> We cooked a patch to workaround it(as below).
> 
> thanks,
> Chenyu
>  
> >From 9251b25073d43aeac04a6ee69b590fbfa1b8e1a5 Mon Sep 17 00:00:00 2001
> From: Chen Yu <yu.c.chen@intel.com>
> Date: Mon, 26 Aug 2024 22:16:38 +0800
> Subject: [PATCH] sched/eevdf: Dequeue the delayed task when changing its
>  schedule policy
> 
> [Problem Statement]
> The following warning was reported:
> 
>  do not call blocking ops when !TASK_RUNNING; state=1 set at kthread_worker_fn (kernel/kthread.c:?)
>  WARNING: CPU: 1 PID: 674 at kernel/sched/core.c:8469 __might_sleep
> 
>  handle_bug
>  exc_invalid_op
>  asm_exc_invalid_op
>  __might_sleep
>  __might_sleep
>  kthread_worker_fn
>  kthread_worker_fn
>  kthread
>  __cfi_kthread_worker_fn
>  ret_from_fork
>  __cfi_kthread
>  ret_from_fork_asm
> 
> [Symptom]
> kthread_worker_fn()
>   ...
> repeat:
>   set_current_state(TASK_INTERRUPTIBLE);
>   ...
>   if (work) { // false
>     __set_current_state(TASK_RUNNING);
>     ...
>   } else if (!freezing(current)) {
>     schedule();
>     // after schedule, the state is still *TASK_INTERRUPTIBLE*
>   }
> 
>   try_to_freeze()
>     might_sleep() <--- trigger the warning
> 
> [Analysis]
> The question is after schedule(), the state remains TASK_INTERRUPTIBLE
> rather than TASK_RUNNING. The short answer is, someone has incorrectly
> picked the TASK_INTERRUPTIBLE task from the tree. The scenario is described
> below, and all steps happen on 1 CPU:
> 
> time
>  |
>  |
>  |
>  v
> 
>   kthread_worker_fn()   <--- t1
>     set_current_state(TASK_INTERRUPTIBLE)
>       schedule()
>         block_task(t1)
>           dequeue_entity(t1)
>             t1->sched_delayed = 1
> 
>         t2 = pick_next_task()
>           put_prev_task(t1)
>             enqueue_entity(t1)  <--- TASK_INTERRUPTIBLE in the tree
> 
>   t1 switches to t2
> 
>   erofs_init_percpu_worker()    <--- t2
>     sched_set_fifo_low(t1)
>       sched_setscheduler_nocheck(t1)
> 
>         __sched_setscheduler(t1)
>           t1->sched_class = &rt_sched_class
> 
>           check_class_changed(t1)
>             switched_from_fair(t1)
>               t1->sched_delayed = 0 <--- gotcha
> 
>               ** from now on, t1 in the tree is TASK_INTERRUPTIBLE **
>               ** and sched_delayed = 0 **
> 
>           preempt_enable()
>             preempt_schedule()
>               t1 = pick_next_task() <--- because sched_delayed = 0, eligible
> 
>   t2 switches back to t1, now t1 is TASK_INTERRUPTIBLE.
> 
> The cause is, switched_from_fair() incorrectly clear the sched_delayed
> flag and confuse the pick_next_task() that it thinks a delayed task is a
> eligible task(without dequeue it).
>

Valentin pointed out that after the requeue, the t1 is in the new RT priolist.
So the value of sched_delayed does not matter much. The problem is
that rt priolist has a TASK_INTERRUPTIBLE task to be picked by next
schedule(). There is a fix from peter to dequeue this task in switched_from_fair(),
which can fix this problem. But I think the current proposal can save one extra
enqueue/dequeue operation, no?

thanks,
Chenyu 

> [Proposal]
> In the __sched_setscheduler() when trying to change the policy of that
> delayed task, do not re-enqueue the delayed task thus to avoid being
> picked again. The side effect that, the delayed task can not wait for
> its 0-vlag time to be dequeued, but its effect should be neglect.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202408161619.9ed8b83e-lkp@intel.com
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/syscalls.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 4fae3cf25a3a..10859536e509 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -818,7 +818,8 @@ int __sched_setscheduler(struct task_struct *p,
>  		if (oldprio < p->prio)
>  			queue_flags |= ENQUEUE_HEAD;
>  
> -		enqueue_task(rq, p, queue_flags);
> +		if (!p->se.sched_delayed)
> +			enqueue_task(rq, p, queue_flags);
>  	}
>  	if (running)
>  		set_next_task(rq, p);
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
       [not found]   ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
@ 2024-08-28 22:38     ` Marek Szyprowski
  2024-10-10  2:49       ` Sean Christopherson
  0 siblings, 1 reply; 277+ messages in thread
From: Marek Szyprowski @ 2024-08-28 22:38 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 27.07.2024 12:27, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
>
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
>
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
>
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
>
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

This patch landed recently in linux-next as commit 152e11f6df29 
("sched/fair: Implement delayed dequeue"). In my tests on some of the 
ARM 32bit boards it causes a regression in rtcwake tool behavior - from 
time to time this simple call never ends:

# time rtcwake -s 10 -m on

Reverting this commit (together with its compile dependencies) on top of 
linux-next fixes this issue. Let me know how can I help debugging this 
issue.

> ---
>   kernel/sched/deadline.c |    1
>   kernel/sched/fair.c     |   82 ++++++++++++++++++++++++++++++++++++++++++------
>   kernel/sched/features.h |    9 +++++
>   3 files changed, 81 insertions(+), 11 deletions(-)
>
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2428,7 +2428,6 @@ static struct task_struct *__pick_next_t
>   		else
>   			p = dl_se->server_pick_next(dl_se);
>   		if (!p) {
> -			WARN_ON_ONCE(1);
>   			dl_se->dl_yielded = 1;
>   			update_curr_dl_se(rq, dl_se, 0);
>   			goto again;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5379,20 +5379,44 @@ static void clear_buddies(struct cfs_rq
>   
>   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>   
> -static void
> +static bool
>   dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   {
> -	int action = UPDATE_TG;
> +	if (flags & DEQUEUE_DELAYED) {
> +		/*
> +		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
> +		 * at which point we've already done update_curr() and do not
> +		 * want to do so again.
> +		 */
> +		SCHED_WARN_ON(!se->sched_delayed);
> +		se->sched_delayed = 0;
> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		/*
> +		 * DELAY_DEQUEUE relies on spurious wakeups, special task
> +		 * states must not suffer spurious wakeups, excempt them.
> +		 */
> +		if (flags & DEQUEUE_SPECIAL)
> +			sleep = false;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
>   
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {
> +			if (cfs_rq->next == se)
> +				cfs_rq->next = NULL;
> +			se->sched_delayed = 1;
> +			return false;
> +		}
> +	}
> +
> +	int action = UPDATE_TG;
>   	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
>   		action |= DO_DETACH;
>   
>   	/*
> -	 * Update run-time statistics of the 'current'.
> -	 */
> -	update_curr(cfs_rq);
> -
> -	/*
>   	 * When dequeuing a sched_entity, we must:
>   	 *   - Update loads to have both entity and cfs_rq synced with now.
>   	 *   - For group_entity, update its runnable_weight to reflect the new
> @@ -5430,6 +5454,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>   
>   	if (cfs_rq->nr_running == 0)
>   		update_idle_cfs_rq_clock_pelt(cfs_rq);
> +
> +	return true;
>   }
>   
>   static void
> @@ -5828,11 +5854,21 @@ static bool throttle_cfs_rq(struct cfs_r
>   	idle_task_delta = cfs_rq->idle_h_nr_running;
>   	for_each_sched_entity(se) {
>   		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +		int flags;
> +
>   		/* throttled entity or throttle-on-deactivate */
>   		if (!se->on_rq)
>   			goto done;
>   
> -		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
> +		/*
> +		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
> +		 * This avoids teaching dequeue_entities() about throttled
> +		 * entities and keeps things relatively simple.
> +		 */
> +		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
> +		if (se->sched_delayed)
> +			flags |= DEQUEUE_DELAYED;
> +		dequeue_entity(qcfs_rq, se, flags);
>   
>   		if (cfs_rq_is_idle(group_cfs_rq(se)))
>   			idle_task_delta = cfs_rq->h_nr_running;
> @@ -6918,6 +6954,7 @@ static int dequeue_entities(struct rq *r
>   	bool was_sched_idle = sched_idle_rq(rq);
>   	int rq_h_nr_running = rq->cfs.h_nr_running;
>   	bool task_sleep = flags & DEQUEUE_SLEEP;
> +	bool task_delayed = flags & DEQUEUE_DELAYED;
>   	struct task_struct *p = NULL;
>   	int idle_h_nr_running = 0;
>   	int h_nr_running = 0;
> @@ -6931,7 +6968,13 @@ static int dequeue_entities(struct rq *r
>   
>   	for_each_sched_entity(se) {
>   		cfs_rq = cfs_rq_of(se);
> -		dequeue_entity(cfs_rq, se, flags);
> +
> +		if (!dequeue_entity(cfs_rq, se, flags)) {
> +			if (p && &p->se == se)
> +				return -1;
> +
> +			break;
> +		}
>   
>   		cfs_rq->h_nr_running -= h_nr_running;
>   		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> @@ -6956,6 +6999,7 @@ static int dequeue_entities(struct rq *r
>   			break;
>   		}
>   		flags |= DEQUEUE_SLEEP;
> +		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
>   	}
>   
>   	for_each_sched_entity(se) {
> @@ -6985,6 +7029,17 @@ static int dequeue_entities(struct rq *r
>   	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>   		rq->next_balance = jiffies;
>   
> +	if (p && task_delayed) {
> +		SCHED_WARN_ON(!task_sleep);
> +		SCHED_WARN_ON(p->on_rq != 1);
> +
> +		/* Fix-up what dequeue_task_fair() skipped */
> +		hrtick_update(rq);
> +
> +		/* Fix-up what block_task() skipped. */
> +		__block_task(rq, p);
> +	}
> +
>   	return 1;
>   }
>   /*
> @@ -6996,8 +7051,10 @@ static bool dequeue_task_fair(struct rq
>   {
>   	util_est_dequeue(&rq->cfs, p);
>   
> -	if (dequeue_entities(rq, &p->se, flags) < 0)
> +	if (dequeue_entities(rq, &p->se, flags) < 0) {
> +		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
>   		return false;
> +	}
>   
>   	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>   	hrtick_update(rq);
> @@ -12973,6 +13030,11 @@ static void set_next_task_fair(struct rq
>   		/* ensure bandwidth has been allocated on our new cfs_rq */
>   		account_cfs_rq_runtime(cfs_rq, 0);
>   	}
> +
> +	if (!first)
> +		return;
> +
> +	SCHED_WARN_ON(se->sched_delayed);
>   }
>   
>   void init_cfs_rq(struct cfs_rq *cfs_rq)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
>   SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   
>   /*
> + * Delay dequeueing tasks until they get selected or woken.
> + *
> + * By delaying the dequeue for non-eligible tasks, they remain in the
> + * competition and can burn off their negative lag. When they get selected
> + * they'll have positive lag by definition.
> + */
> +SCHED_FEAT(DELAY_DEQUEUE, true)
> +
> +/*
>    * Allow wakeup-time preemption of the current task:
>    */
>   SCHED_FEAT(WAKEUP_PREEMPTION, true)
>
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-22 14:58                 ` Vincent Guittot
@ 2024-08-29 15:42                   ` Hongyan Xia
  2024-09-05 13:02                     ` Dietmar Eggemann
  0 siblings, 1 reply; 277+ messages in thread
From: Hongyan Xia @ 2024-08-29 15:42 UTC (permalink / raw)
  To: Vincent Guittot, Luis Machado
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 22/08/2024 15:58, Vincent Guittot wrote:
> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>>>
>>> Vincent,
>>>
>>> On 8/22/24 11:28, Luis Machado wrote:
>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com> wrote:
>>>>>>
>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> Sorry for bombarding this thread in the last couple of days. I'm seeing
>>>>>>>> several issues in the latest tip/sched/core after these patches landed.
>>>>>>>>
>>>>>>>> What I'm now seeing seems to be an unbalanced call of util_est. First, I applied
>>>>>>>
>>>>>>> I also see a remaining util_est for idle rq because of an unbalance
>>>>>>> call of util_est_enqueue|dequeue
>>>>>>>
>>>>>>
>>>>>> I can confirm issues with the utilization values and frequencies being driven
>>>>>> seemingly incorrectly, in particular for little cores.
>>>>>>
>>>>>> What I'm seeing with the stock series is high utilization values for some tasks
>>>>>> and little cores having their frequencies maxed out for extended periods of
>>>>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the cores are mostly
>>>>>> idle. But whenever certain tasks get scheduled there, they have a very high util
>>>>>> level and so the frequency is kept at max.
>>>>>>
>>>>>> As a consequence this drives up power usage.
>>>>>>
>>>>>> I gave Hongyan's draft fix a try and observed a much more reasonable behavior for
>>>>>> the util numbers and frequencies being used for the little cores. With his fix,
>>>>>> I can also see lower energy use for my specific benchmark.
>>>>>
>>>>> The main problem is that the util_est of a delayed dequeued task
>>>>> remains on the rq and keeps the rq utilization high and as a result
>>>>> the frequency higher than needed.
>>>>>
>>>>> The below seems to works for me and keep sync the enqueue/dequeue of
>>>>> uti_test with the enqueue/dequeue of the task as if de dequeue was not
>>>>> delayed
>>>>>
>>>>> Another interest is that we will not try to migrate a delayed dequeue
>>>>> sleeping task that doesn't actually impact the current load of the cpu
>>>>> and as a result will not help in the load balance. I haven't yet fully
>>>>> checked what would happen with hotplug
>>>>
>>>> Thanks. Those are good points. Let me go and try your patch.
>>>
>>> I gave your fix a try, but it seems to make things worse. It is comparable
>>> to the behavior we had before Peter added the uclamp imbalance fix, so I
>>> believe there is something incorrect there.
>>
>> we need to filter case where task are enqueued/dequeued several
>> consecutive times. That's what I'm look now
> 
> I just realize before that It's not only util_est but the h_nr_running
> that keeps delayed tasks as well so all stats of the rq are biased:
> h_nr_running, util_est, runnable avg and load_avg.

After staring at the code even more, I think the situation is worse.

First thing is that uclamp might also want to be part of these stats 
(h_nr_running, util_est, runnable_avg, load_avg) that do not follow 
delayed dequeue which needs to be specially handled in the same way. The 
current way of handling uclamp in core.c misses the frequency update, 
like I commented before.

Second, there is also an out-of-sync issue in update_load_avg(). We only 
update the task-level se in delayed dequeue and requeue, but we return 
early and the upper levels are completely skipped, as if the delayed 
task is still on rq. This de-sync is wrong.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (27 preceding siblings ...)
  2024-08-20 16:43 ` Hongyan Xia
@ 2024-08-29 17:02 ` Aleksandr Nogikh
  2024-09-10 11:45 ` Sven Schnelle
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 277+ messages in thread
From: Aleksandr Nogikh @ 2024-08-29 17:02 UTC (permalink / raw)
  To: peterz
  Cc: bsegall, dietmar.eggemann, efault, juri.lelli, kprateek.nayak,
	linux-kernel, mgorman, mingo, rostedt, tglx, vincent.guittot,
	vschneid, wuyun.abel, youssefesmat, syzkaller-bugs, dvyukov,
	syzkaller

This series has caused an explosion of different kernel crashes on our
syzbot instance that fuzzes linux-next. I guess, such kernel behavior
indicates some massive underlying memory corruption (?)

Some of the crash titles we've seen (we didn't release them -- there
were too many, 70+):

KASAN: stack-out-of-bounds Write in insn_decode 
kernel panic: stack is corrupted in vprintk_store
kernel panic: stack is corrupted in _printk
BUG: spinlock recursion in __schedule
WARNING in __put_task_struct
BUG: unable to handle kernel NULL pointer dereference in asm_exc_page_fault
WARNING in rng_dev_read
BUG: scheduling while atomic in prb_final_commit
kernel BUG in dequeue_rt_stack
BUG: scheduling while atomic in rcu_is_watching
BUG: spinlock recursion in copy_process
KASAN: slab-use-after-free Read in sched_core_enqueue
kernel panic: stack is corrupted in refill_stock
kernel panic: stack is corrupted in prb_reserve
WARNING: bad unlock balance in timekeeping_get_ns
KASAN: slab-use-after-free Read in set_next_task_fair

I wonder if the actual problem is already known and possibly there are
even some fix patches?

If not and if it may be of any help, we can try to come up with some
contained instruction to reproduce these issues with syzkaller.

--
Aleksandr

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-08-29 15:42                   ` Hongyan Xia
@ 2024-09-05 13:02                     ` Dietmar Eggemann
  2024-09-05 13:33                       ` Vincent Guittot
                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-05 13:02 UTC (permalink / raw)
  To: Hongyan Xia, Vincent Guittot, Luis Machado
  Cc: Peter Zijlstra, mingo, juri.lelli, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, efault

On 29/08/2024 17:42, Hongyan Xia wrote:
> On 22/08/2024 15:58, Vincent Guittot wrote:
>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>> <vincent.guittot@linaro.org> wrote:
>>>
>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>>>>
>>>> Vincent,
>>>>
>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> Sorry for bombarding this thread in the last couple of days.
>>>>>>>>> I'm seeing
>>>>>>>>> several issues in the latest tip/sched/core after these patches
>>>>>>>>> landed.
>>>>>>>>>
>>>>>>>>> What I'm now seeing seems to be an unbalanced call of util_est.
>>>>>>>>> First, I applied
>>>>>>>>
>>>>>>>> I also see a remaining util_est for idle rq because of an unbalance
>>>>>>>> call of util_est_enqueue|dequeue
>>>>>>>>
>>>>>>>
>>>>>>> I can confirm issues with the utilization values and frequencies
>>>>>>> being driven
>>>>>>> seemingly incorrectly, in particular for little cores.
>>>>>>>
>>>>>>> What I'm seeing with the stock series is high utilization values
>>>>>>> for some tasks
>>>>>>> and little cores having their frequencies maxed out for extended
>>>>>>> periods of
>>>>>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the
>>>>>>> cores are mostly
>>>>>>> idle. But whenever certain tasks get scheduled there, they have a
>>>>>>> very high util
>>>>>>> level and so the frequency is kept at max.
>>>>>>>
>>>>>>> As a consequence this drives up power usage.
>>>>>>>
>>>>>>> I gave Hongyan's draft fix a try and observed a much more
>>>>>>> reasonable behavior for
>>>>>>> the util numbers and frequencies being used for the little cores.
>>>>>>> With his fix,
>>>>>>> I can also see lower energy use for my specific benchmark.
>>>>>>
>>>>>> The main problem is that the util_est of a delayed dequeued task
>>>>>> remains on the rq and keeps the rq utilization high and as a result
>>>>>> the frequency higher than needed.
>>>>>>
>>>>>> The below seems to works for me and keep sync the enqueue/dequeue of
>>>>>> uti_test with the enqueue/dequeue of the task as if de dequeue was
>>>>>> not
>>>>>> delayed
>>>>>>
>>>>>> Another interest is that we will not try to migrate a delayed dequeue
>>>>>> sleeping task that doesn't actually impact the current load of the
>>>>>> cpu
>>>>>> and as a result will not help in the load balance. I haven't yet
>>>>>> fully
>>>>>> checked what would happen with hotplug
>>>>>
>>>>> Thanks. Those are good points. Let me go and try your patch.
>>>>
>>>> I gave your fix a try, but it seems to make things worse. It is
>>>> comparable
>>>> to the behavior we had before Peter added the uclamp imbalance fix,
>>>> so I
>>>> believe there is something incorrect there.
>>>
>>> we need to filter case where task are enqueued/dequeued several
>>> consecutive times. That's what I'm look now
>>
>> I just realize before that It's not only util_est but the h_nr_running
>> that keeps delayed tasks as well so all stats of the rq are biased:
>> h_nr_running, util_est, runnable avg and load_avg.
> 
> After staring at the code even more, I think the situation is worse.
> 
> First thing is that uclamp might also want to be part of these stats
> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> delayed dequeue which needs to be specially handled in the same way. The
> current way of handling uclamp in core.c misses the frequency update,
> like I commented before.
> 
> Second, there is also an out-of-sync issue in update_load_avg(). We only
> update the task-level se in delayed dequeue and requeue, but we return
> early and the upper levels are completely skipped, as if the delayed
> task is still on rq. This de-sync is wrong.

I had a look at the util_est issue.

This keeps rq->cfs.avg.util_avg sane for me with
SCHED_FEAT(DELAY_DEQUEUE, true):

-->8--

From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Thu, 5 Sep 2024 00:05:23 +0200
Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE

Remove delayed tasks from util_est even they are runnable.

Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e693ca8ebd6..5c32cc26d6c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	u64 slice = 0;
 
-	if (flags & ENQUEUE_DELAYED) {
-		requeue_delayed_entity(se);
-		return;
-	}
-
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
 	 * Let's add the task's estimated utilization to the cfs_rq's
 	 * estimated utilization, before we update schedutil.
 	 */
-	util_est_enqueue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
+		util_est_enqueue(&rq->cfs, p);
+
+	if (flags & ENQUEUE_DELAYED) {
+		requeue_delayed_entity(se);
+		return;
+	}
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	util_est_dequeue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+		util_est_dequeue(&rq->cfs, p);
 
 	if (dequeue_entities(rq, &p->se, flags) < 0) {
 		if (!rq->cfs.h_nr_running)
-- 
2.34.1


























^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 13:02                     ` Dietmar Eggemann
@ 2024-09-05 13:33                       ` Vincent Guittot
  2024-09-05 14:07                         ` Dietmar Eggemann
  2024-09-05 14:18                       ` Peter Zijlstra
  2024-09-10  8:09                       ` [tip: sched/core] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE tip-bot2 for Dietmar Eggemann
  2 siblings, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-09-05 13:33 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Hongyan Xia, Luis Machado, Peter Zijlstra, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 29/08/2024 17:42, Hongyan Xia wrote:
> > On 22/08/2024 15:58, Vincent Guittot wrote:
> >> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
> >> <vincent.guittot@linaro.org> wrote:
> >>>
> >>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
> >>>>
> >>>> Vincent,
> >>>>
> >>>> On 8/22/24 11:28, Luis Machado wrote:
> >>>>> On 8/22/24 10:53, Vincent Guittot wrote:
> >>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Peter,
> >>>>>>>>>
> >>>>>>>>> Sorry for bombarding this thread in the last couple of days.
> >>>>>>>>> I'm seeing
> >>>>>>>>> several issues in the latest tip/sched/core after these patches
> >>>>>>>>> landed.
> >>>>>>>>>
> >>>>>>>>> What I'm now seeing seems to be an unbalanced call of util_est.
> >>>>>>>>> First, I applied
> >>>>>>>>
> >>>>>>>> I also see a remaining util_est for idle rq because of an unbalance
> >>>>>>>> call of util_est_enqueue|dequeue
> >>>>>>>>
> >>>>>>>
> >>>>>>> I can confirm issues with the utilization values and frequencies
> >>>>>>> being driven
> >>>>>>> seemingly incorrectly, in particular for little cores.
> >>>>>>>
> >>>>>>> What I'm seeing with the stock series is high utilization values
> >>>>>>> for some tasks
> >>>>>>> and little cores having their frequencies maxed out for extended
> >>>>>>> periods of
> >>>>>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the
> >>>>>>> cores are mostly
> >>>>>>> idle. But whenever certain tasks get scheduled there, they have a
> >>>>>>> very high util
> >>>>>>> level and so the frequency is kept at max.
> >>>>>>>
> >>>>>>> As a consequence this drives up power usage.
> >>>>>>>
> >>>>>>> I gave Hongyan's draft fix a try and observed a much more
> >>>>>>> reasonable behavior for
> >>>>>>> the util numbers and frequencies being used for the little cores.
> >>>>>>> With his fix,
> >>>>>>> I can also see lower energy use for my specific benchmark.
> >>>>>>
> >>>>>> The main problem is that the util_est of a delayed dequeued task
> >>>>>> remains on the rq and keeps the rq utilization high and as a result
> >>>>>> the frequency higher than needed.
> >>>>>>
> >>>>>> The below seems to works for me and keep sync the enqueue/dequeue of
> >>>>>> uti_test with the enqueue/dequeue of the task as if de dequeue was
> >>>>>> not
> >>>>>> delayed
> >>>>>>
> >>>>>> Another interest is that we will not try to migrate a delayed dequeue
> >>>>>> sleeping task that doesn't actually impact the current load of the
> >>>>>> cpu
> >>>>>> and as a result will not help in the load balance. I haven't yet
> >>>>>> fully
> >>>>>> checked what would happen with hotplug
> >>>>>
> >>>>> Thanks. Those are good points. Let me go and try your patch.
> >>>>
> >>>> I gave your fix a try, but it seems to make things worse. It is
> >>>> comparable
> >>>> to the behavior we had before Peter added the uclamp imbalance fix,
> >>>> so I
> >>>> believe there is something incorrect there.
> >>>
> >>> we need to filter case where task are enqueued/dequeued several
> >>> consecutive times. That's what I'm look now
> >>
> >> I just realize before that It's not only util_est but the h_nr_running
> >> that keeps delayed tasks as well so all stats of the rq are biased:
> >> h_nr_running, util_est, runnable avg and load_avg.
> >
> > After staring at the code even more, I think the situation is worse.
> >
> > First thing is that uclamp might also want to be part of these stats
> > (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> > delayed dequeue which needs to be specially handled in the same way. The
> > current way of handling uclamp in core.c misses the frequency update,
> > like I commented before.
> >
> > Second, there is also an out-of-sync issue in update_load_avg(). We only
> > update the task-level se in delayed dequeue and requeue, but we return
> > early and the upper levels are completely skipped, as if the delayed
> > task is still on rq. This de-sync is wrong.
>
> I had a look at the util_est issue.
>
> This keeps rq->cfs.avg.util_avg sane for me with
> SCHED_FEAT(DELAY_DEQUEUE, true):
>
> -->8--
>
> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Thu, 5 Sep 2024 00:05:23 +0200
> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>
> Remove delayed tasks from util_est even they are runnable.

Unfortunately, this is not only about util_est

cfs_rq's runnable_avg is also wrong  because we normally have :
cfs_rq's runnable_avg == /Sum se's runnable_avg
but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
entities are still accounted in h_nr_running

That also means that cfs_rq's h_nr_running is not accurate anymore
because it includes delayed dequeue

and cfs_rq load_avg is kept artificially high which biases
load_balance and cgroup's shares

>
> Exclude delayed task which are (a) migrating between rq's or (b) in a
> SAVE/RESTORE dequeue/enqueue.
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e693ca8ebd6..5c32cc26d6c2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
>
> -       if (flags & ENQUEUE_DELAYED) {
> -               requeue_delayed_entity(se);
> -               return;
> -       }
> -
>         /*
>          * The code below (indirectly) updates schedutil which looks at
>          * the cfs_rq utilization to select a frequency.
>          * Let's add the task's estimated utilization to the cfs_rq's
>          * estimated utilization, before we update schedutil.
>          */
> -       util_est_enqueue(&rq->cfs, p);
> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
> +               util_est_enqueue(&rq->cfs, p);
> +
> +       if (flags & ENQUEUE_DELAYED) {
> +               requeue_delayed_entity(se);
> +               return;
> +       }
>
>         /*
>          * If in_iowait is set, the code below may not trigger any cpufreq
> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> -       util_est_dequeue(&rq->cfs, p);
> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> +               util_est_dequeue(&rq->cfs, p);
>
>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>                 if (!rq->cfs.h_nr_running)
> --
> 2.34.1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 13:33                       ` Vincent Guittot
@ 2024-09-05 14:07                         ` Dietmar Eggemann
  2024-09-05 14:29                           ` Vincent Guittot
                                             ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-05 14:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hongyan Xia, Luis Machado, Peter Zijlstra, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 05/09/2024 15:33, Vincent Guittot wrote:
> On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 29/08/2024 17:42, Hongyan Xia wrote:
>>> On 22/08/2024 15:58, Vincent Guittot wrote:
>>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>>>> <vincent.guittot@linaro.org> wrote:
>>>>>
>>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>>>>>>
>>>>>> Vincent,
>>>>>>
>>>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>

[...]

>>> After staring at the code even more, I think the situation is worse.
>>>
>>> First thing is that uclamp might also want to be part of these stats
>>> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
>>> delayed dequeue which needs to be specially handled in the same way. The
>>> current way of handling uclamp in core.c misses the frequency update,
>>> like I commented before.
>>>
>>> Second, there is also an out-of-sync issue in update_load_avg(). We only
>>> update the task-level se in delayed dequeue and requeue, but we return
>>> early and the upper levels are completely skipped, as if the delayed
>>> task is still on rq. This de-sync is wrong.
>>
>> I had a look at the util_est issue.
>>
>> This keeps rq->cfs.avg.util_avg sane for me with
>> SCHED_FEAT(DELAY_DEQUEUE, true):
>>
>> -->8--
>>
>> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Date: Thu, 5 Sep 2024 00:05:23 +0200
>> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>>
>> Remove delayed tasks from util_est even they are runnable.
> 
> Unfortunately, this is not only about util_est
> 
> cfs_rq's runnable_avg is also wrong  because we normally have :
> cfs_rq's runnable_avg == /Sum se's runnable_avg
> but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> entities are still accounted in h_nr_running

Yes, I agree.

se's runnable_avg should be fine already since:

se_runnable()

  if (se->sched_delayed)
    return false

But then, like you said, __update_load_avg_cfs_rq() needs correct
cfs_rq->h_nr_running.

And I guess we need something like:

se_on_rq()

  if (se->sched_delayed)
    return false

for

__update_load_avg_se()

- if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+ if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),


My hope was we can fix util_est independently since it drives CPU
frequency. Whereas PELT load_avg and runnable_avg are "only" used for
load balancing. But I agree, it has to be fixed as well.

> That also means that cfs_rq's h_nr_running is not accurate anymore
> because it includes delayed dequeue

+1

> and cfs_rq load_avg is kept artificially high which biases
> load_balance and cgroup's shares

+1

>> Exclude delayed task which are (a) migrating between rq's or (b) in a
>> SAVE/RESTORE dequeue/enqueue.

I just realized that this fixes the uneven util_est_dequeue/enqueue
calls so we don't see the underflow depicted by Hongyan and no massive
rq->cfs util_est due to missing ue_dequeues.
But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
that.

>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> ---
>>  kernel/sched/fair.c | 16 +++++++++-------
>>  1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1e693ca8ebd6..5c32cc26d6c2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>         int rq_h_nr_running = rq->cfs.h_nr_running;
>>         u64 slice = 0;
>>
>> -       if (flags & ENQUEUE_DELAYED) {
>> -               requeue_delayed_entity(se);
>> -               return;
>> -       }
>> -
>>         /*
>>          * The code below (indirectly) updates schedutil which looks at
>>          * the cfs_rq utilization to select a frequency.
>>          * Let's add the task's estimated utilization to the cfs_rq's
>>          * estimated utilization, before we update schedutil.
>>          */
>> -       util_est_enqueue(&rq->cfs, p);
>> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
>> +               util_est_enqueue(&rq->cfs, p);
>> +
>> +       if (flags & ENQUEUE_DELAYED) {
>> +               requeue_delayed_entity(se);
>> +               return;
>> +       }
>>
>>         /*
>>          * If in_iowait is set, the code below may not trigger any cpufreq
>> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>   */
>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  {
>> -       util_est_dequeue(&rq->cfs, p);
>> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>> +               util_est_dequeue(&rq->cfs, p);
>>
>>         if (dequeue_entities(rq, &p->se, flags) < 0) {
>>                 if (!rq->cfs.h_nr_running)
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 13:02                     ` Dietmar Eggemann
  2024-09-05 13:33                       ` Vincent Guittot
@ 2024-09-05 14:18                       ` Peter Zijlstra
  2024-09-10  8:09                       ` [tip: sched/core] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE tip-bot2 for Dietmar Eggemann
  2 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-05 14:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Hongyan Xia, Vincent Guittot, Luis Machado, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Sep 05, 2024 at 03:02:44PM +0200, Dietmar Eggemann wrote:

> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Thu, 5 Sep 2024 00:05:23 +0200
> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
> 
> Remove delayed tasks from util_est even they are runnable.
> 
> Exclude delayed task which are (a) migrating between rq's or (b) in a
> SAVE/RESTORE dequeue/enqueue.
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1e693ca8ebd6..5c32cc26d6c2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	int rq_h_nr_running = rq->cfs.h_nr_running;
>  	u64 slice = 0;
>  
> -	if (flags & ENQUEUE_DELAYED) {
> -		requeue_delayed_entity(se);
> -		return;
> -	}
> -
>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
>  	 * the cfs_rq utilization to select a frequency.
>  	 * Let's add the task's estimated utilization to the cfs_rq's
>  	 * estimated utilization, before we update schedutil.
>  	 */
> -	util_est_enqueue(&rq->cfs, p);
> +	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
> +		util_est_enqueue(&rq->cfs, p);
> +
> +	if (flags & ENQUEUE_DELAYED) {
> +		requeue_delayed_entity(se);
> +		return;
> +	}
>  
>  	/*
>  	 * If in_iowait is set, the code below may not trigger any cpufreq
> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> -	util_est_dequeue(&rq->cfs, p);
> +	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> +		util_est_dequeue(&rq->cfs, p);
>  
>  	if (dequeue_entities(rq, &p->se, flags) < 0) {
>  		if (!rq->cfs.h_nr_running)

Thanks!

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:07                         ` Dietmar Eggemann
@ 2024-09-05 14:29                           ` Vincent Guittot
  2024-09-05 14:50                             ` Dietmar Eggemann
  2024-09-05 14:53                           ` Peter Zijlstra
  2024-09-06  9:55                           ` Dietmar Eggemann
  2 siblings, 1 reply; 277+ messages in thread
From: Vincent Guittot @ 2024-09-05 14:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Hongyan Xia, Luis Machado, Peter Zijlstra, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 5 Sept 2024 at 16:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 05/09/2024 15:33, Vincent Guittot wrote:
> > On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 29/08/2024 17:42, Hongyan Xia wrote:
> >>> On 22/08/2024 15:58, Vincent Guittot wrote:
> >>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
> >>>> <vincent.guittot@linaro.org> wrote:
> >>>>>
> >>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
> >>>>>>
> >>>>>> Vincent,
> >>>>>>
> >>>>>> On 8/22/24 11:28, Luis Machado wrote:
> >>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
> >>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>
>
> [...]
>
> >>> After staring at the code even more, I think the situation is worse.
> >>>
> >>> First thing is that uclamp might also want to be part of these stats
> >>> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> >>> delayed dequeue which needs to be specially handled in the same way. The
> >>> current way of handling uclamp in core.c misses the frequency update,
> >>> like I commented before.
> >>>
> >>> Second, there is also an out-of-sync issue in update_load_avg(). We only
> >>> update the task-level se in delayed dequeue and requeue, but we return
> >>> early and the upper levels are completely skipped, as if the delayed
> >>> task is still on rq. This de-sync is wrong.
> >>
> >> I had a look at the util_est issue.
> >>
> >> This keeps rq->cfs.avg.util_avg sane for me with
> >> SCHED_FEAT(DELAY_DEQUEUE, true):
> >>
> >> -->8--
> >>
> >> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> Date: Thu, 5 Sep 2024 00:05:23 +0200
> >> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
> >>
> >> Remove delayed tasks from util_est even they are runnable.
> >
> > Unfortunately, this is not only about util_est
> >
> > cfs_rq's runnable_avg is also wrong  because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
>
> Yes, I agree.
>
> se's runnable_avg should be fine already since:
>
> se_runnable()
>
>   if (se->sched_delayed)
>     return false
>
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.
>
> And I guess we need something like:
>
> se_on_rq()
>
>   if (se->sched_delayed)
>     return false
>
> for
>
> __update_load_avg_se()
>
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
>
>
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.

runnable_avg is also used for frequency selection

>
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
>
> +1
>
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
>
> +1
>
> >> Exclude delayed task which are (a) migrating between rq's or (b) in a
> >> SAVE/RESTORE dequeue/enqueue.
>
> I just realized that this fixes the uneven util_est_dequeue/enqueue
> calls so we don't see the underflow depicted by Hongyan and no massive
> rq->cfs util_est due to missing ue_dequeues.
> But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
> that.
>
> >> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> ---
> >>  kernel/sched/fair.c | 16 +++++++++-------
> >>  1 file changed, 9 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1e693ca8ebd6..5c32cc26d6c2 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>         int rq_h_nr_running = rq->cfs.h_nr_running;
> >>         u64 slice = 0;
> >>
> >> -       if (flags & ENQUEUE_DELAYED) {
> >> -               requeue_delayed_entity(se);
> >> -               return;
> >> -       }
> >> -
> >>         /*
> >>          * The code below (indirectly) updates schedutil which looks at
> >>          * the cfs_rq utilization to select a frequency.
> >>          * Let's add the task's estimated utilization to the cfs_rq's
> >>          * estimated utilization, before we update schedutil.
> >>          */
> >> -       util_est_enqueue(&rq->cfs, p);
> >> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
> >> +               util_est_enqueue(&rq->cfs, p);
> >> +
> >> +       if (flags & ENQUEUE_DELAYED) {
> >> +               requeue_delayed_entity(se);
> >> +               return;
> >> +       }
> >>
> >>         /*
> >>          * If in_iowait is set, the code below may not trigger any cpufreq
> >> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >>   */
> >>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>  {
> >> -       util_est_dequeue(&rq->cfs, p);
> >> +       if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> >> +               util_est_dequeue(&rq->cfs, p);
> >>
> >>         if (dequeue_entities(rq, &p->se, flags) < 0) {
> >>                 if (!rq->cfs.h_nr_running)
> >> --
> >> 2.34.1
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:29                           ` Vincent Guittot
@ 2024-09-05 14:50                             ` Dietmar Eggemann
  0 siblings, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-05 14:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hongyan Xia, Luis Machado, Peter Zijlstra, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 05/09/2024 16:29, Vincent Guittot wrote:
> On Thu, 5 Sept 2024 at 16:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 05/09/2024 15:33, Vincent Guittot wrote:
>>> On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 29/08/2024 17:42, Hongyan Xia wrote:
>>>>> On 22/08/2024 15:58, Vincent Guittot wrote:
>>>>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>>
>>>>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>>>>>>>>
>>>>>>>> Vincent,
>>>>>>>>
>>>>>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>

[...]

>> My hope was we can fix util_est independently since it drives CPU
>> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
>> load balancing. But I agree, it has to be fixed as well.
> 
> runnable_avg is also used for frequency selection

Ah, yes. So we would need proper cfs_rq->h_nr_running accounting as well.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:07                         ` Dietmar Eggemann
  2024-09-05 14:29                           ` Vincent Guittot
@ 2024-09-05 14:53                           ` Peter Zijlstra
  2024-09-06  6:14                             ` Vincent Guittot
  2024-09-06 10:45                             ` Peter Zijlstra
  2024-09-06  9:55                           ` Dietmar Eggemann
  2 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-05 14:53 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Hongyan Xia, Luis Machado, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:

> > Unfortunately, this is not only about util_est
> > 
> > cfs_rq's runnable_avg is also wrong  because we normally have :
> > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > entities are still accounted in h_nr_running
> 
> Yes, I agree.
> 
> se's runnable_avg should be fine already since:
> 
> se_runnable()
> 
>   if (se->sched_delayed)
>     return false
> 
> But then, like you said, __update_load_avg_cfs_rq() needs correct
> cfs_rq->h_nr_running.

Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
I'll contest that h_nr_running is in fact correct, albeit no longer
suitable for this purpose.

We can track h_nr_delayed I suppose, and subtract that.

> And I guess we need something like:
> 
> se_on_rq()
> 
>   if (se->sched_delayed)
>     return false
> 
> for
> 
> __update_load_avg_se()
> 
> - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
> 
> 
> My hope was we can fix util_est independently since it drives CPU
> frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> load balancing. But I agree, it has to be fixed as well.
> 
> > That also means that cfs_rq's h_nr_running is not accurate anymore
> > because it includes delayed dequeue
> 
> +1
> 
> > and cfs_rq load_avg is kept artificially high which biases
> > load_balance and cgroup's shares
> 
> +1

Again, fundamentally the delayed tasks are delayed because they need to
remain part of the competition in order to 'earn' time. It really is
fully on_rq, and should be for the purpose of load and load-balancing.

It is only special in that it will never run again (until it gets
woken).

Consider (2 CPUs, 4 tasks):

  CPU1		CPU2
   A		 D
   B (delayed)
   C

Then migrating any one of the tasks on CPU1 to CPU2 will make them all
earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.

Yes, I realize this might seem weird, but we're going to be getting a
ton more of this weirdness once proxy execution lands, then we'll be
having the entire block chain still on the runqueue (and actually
consuming time).

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:53                           ` Peter Zijlstra
@ 2024-09-06  6:14                             ` Vincent Guittot
  2024-09-06 10:45                             ` Peter Zijlstra
  1 sibling, 0 replies; 277+ messages in thread
From: Vincent Guittot @ 2024-09-06  6:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Hongyan Xia, Luis Machado, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 5 Sept 2024 at 16:54, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 05, 2024 at 04:07:01PM +0200, Dietmar Eggemann wrote:
>
> > > Unfortunately, this is not only about util_est
> > >
> > > cfs_rq's runnable_avg is also wrong  because we normally have :
> > > cfs_rq's runnable_avg == /Sum se's runnable_avg
> > > but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> > > entities are still accounted in h_nr_running
> >
> > Yes, I agree.
> >
> > se's runnable_avg should be fine already since:
> >
> > se_runnable()
> >
> >   if (se->sched_delayed)
> >     return false
> >
> > But then, like you said, __update_load_avg_cfs_rq() needs correct
> > cfs_rq->h_nr_running.
>
> Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
> I'll contest that h_nr_running is in fact correct, albeit no longer
> suitable for this purpose.

AFAICT, delayed dequeue tasks are there only to consume their negative
lag but don't want to run in any case.  So I keep thinking that they
should not be counted in h_nr_running nor in runnable or load. They
only want to be kept in the rb tree of the cfs to consume this
negative lag and they want to keep their weight in the
cfs_rq->avg_load, which has nothing to do with the pelt load, to keep
a fair slope for the vruntime.

>
> We can track h_nr_delayed I suppose, and subtract that.
>
> > And I guess we need something like:
> >
> > se_on_rq()
> >
> >   if (se->sched_delayed)
> >     return false
> >
> > for
> >
> > __update_load_avg_se()
> >
> > - if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> > + if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
> >
> >
> > My hope was we can fix util_est independently since it drives CPU
> > frequency. Whereas PELT load_avg and runnable_avg are "only" used for
> > load balancing. But I agree, it has to be fixed as well.
> >
> > > That also means that cfs_rq's h_nr_running is not accurate anymore
> > > because it includes delayed dequeue
> >
> > +1
> >
> > > and cfs_rq load_avg is kept artificially high which biases
> > > load_balance and cgroup's shares
> >
> > +1
>
> Again, fundamentally the delayed tasks are delayed because they need to
> remain part of the competition in order to 'earn' time. It really is
> fully on_rq, and should be for the purpose of load and load-balancing.

They don't compete with other they wait for their lag to become
positive which is completely different and biases all the system

>
> It is only special in that it will never run again (until it gets
> woken).
>
> Consider (2 CPUs, 4 tasks):
>
>   CPU1          CPU2
>    A             D
>    B (delayed)
>    C
>
> Then migrating any one of the tasks on CPU1 to CPU2 will make them all
> earn time at 1/2 instead of 1/3 vs 1/1. More fair etc.

But the one that is "enqueued" with the delayed queue will have twice
more time and balancing the delayed task will doesn't help to balance
the system because it doesn't run

Also the delayed task can make a cpu overloaded where it is not. All
this is unfair

>
> Yes, I realize this might seem weird, but we're going to be getting a
> ton more of this weirdness once proxy execution lands, then we'll be
> having the entire block chain still on the runqueue (and actually
> consuming time).

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:07                         ` Dietmar Eggemann
  2024-09-05 14:29                           ` Vincent Guittot
  2024-09-05 14:53                           ` Peter Zijlstra
@ 2024-09-06  9:55                           ` Dietmar Eggemann
  2 siblings, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-06  9:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hongyan Xia, Luis Machado, Peter Zijlstra, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 05/09/2024 16:07, Dietmar Eggemann wrote:
> On 05/09/2024 15:33, Vincent Guittot wrote:
>> On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 29/08/2024 17:42, Hongyan Xia wrote:
>>>> On 22/08/2024 15:58, Vincent Guittot wrote:
>>>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>>>>> <vincent.guittot@linaro.org> wrote:
>>>>>>
>>>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@arm.com> wrote:
>>>>>>>
>>>>>>> Vincent,
>>>>>>>
>>>>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@arm.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@arm.com>

[...]

> I just realized that this fixes the uneven util_est_dequeue/enqueue
> calls so we don't see the underflow depicted by Hongyan and no massive
> rq->cfs util_est due to missing ue_dequeues.
> But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
> that.

Looks like I got confused ... After checking again, it seems to be OK:


  dequeue_task_fair()

    if !(p is delayed && (migrating || DEQUEUE_SAVE))
      util_est_dequeue()

    if !entity_eligible(&p->se)
      se->sched_delayed = 1       -> p not contributing to
                                     rq->cfs.avg.util_est

  enqueue_task_fair()

    if !(p is delayed && (migrating || ENQUEUE_RESTORE))
      util_est_enqueue()

    if ENQUEUE_DELAYED
      requeue_delayed_entity()
        se->sched_delayed = 0     -> p contributing to
                                     rq->cfs.avg.util_est


Luis M. did test this for power/perf with jetnews on Pix6 mainline 6.8
and the regression went away.

There are still occasional slight CPU frequency spiking on little CPUs.
Could be the influence of decayed tasks on runnable_avg but we're not
sure yet.

[...]

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-05 14:53                           ` Peter Zijlstra
  2024-09-06  6:14                             ` Vincent Guittot
@ 2024-09-06 10:45                             ` Peter Zijlstra
  2024-09-08  7:43                               ` Mike Galbraith
                                                 ` (2 more replies)
  1 sibling, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-06 10:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Hongyan Xia, Luis Machado, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Sep 05, 2024 at 04:53:54PM +0200, Peter Zijlstra wrote:

> > But then, like you said, __update_load_avg_cfs_rq() needs correct
> > cfs_rq->h_nr_running.
> 
> Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
> I'll contest that h_nr_running is in fact correct, albeit no longer
> suitable for this purpose.
> 
> We can track h_nr_delayed I suppose, and subtract that.

Something like so?

---
 kernel/sched/debug.c |  1 +
 kernel/sched/fair.c  | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/pelt.c  |  2 +-
 kernel/sched/sched.h |  7 +++++--
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 01ce9a76164c..3d3c5be78075 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -829,6 +829,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_delayed", cfs_rq->h_nr_delayed);
 	SEQ_printf(m, "  .%-30s: %d\n", "idle_nr_running",
 			cfs_rq->idle_nr_running);
 	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11e890486c1b..629b46308960 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5456,9 +5456,31 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
+static void set_delayed(struct sched_entity *se)
+{
+	se->sched_delayed = 1;
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_delayed++;
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+}
+
+static void clear_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 0;
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_delayed--;
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+}
+
+static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
+{
+	clear_delayed(se);
 	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
 		se->vlag = 0;
 }
@@ -5488,7 +5510,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
 			update_load_avg(cfs_rq, se, 0);
-			se->sched_delayed = 1;
+			set_delayed(se);
 			return false;
 		}
 	}
@@ -5907,7 +5929,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, dequeue = 1;
+	long task_delta, idle_task_delta, delayed_delta, dequeue = 1;
 	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	raw_spin_lock(&cfs_b->lock);
@@ -5940,6 +5962,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	delayed_delta = cfs_rq->h_nr_delayed;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		int flags;
@@ -5963,6 +5986,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		qcfs_rq->h_nr_delayed -= delayed_delta;
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5985,6 +6009,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		qcfs_rq->h_nr_delayed -= delayed_delta;
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -6010,7 +6035,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta;
+	long task_delta, idle_task_delta, delayed_delta;
 	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
@@ -6046,6 +6071,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	delayed_delta = cfs_rq->h_nr_delayed;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -6060,6 +6086,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_delayed += delayed_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6077,6 +6104,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_delayed += delayed_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6930,7 +6958,7 @@ requeue_delayed_entity(struct sched_entity *se)
 	}
 
 	update_load_avg(cfs_rq, se, 0);
-	se->sched_delayed = 0;
+	clear_delayed(se);
 }
 
 /*
@@ -6944,6 +6972,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
+	int h_nr_delayed = 0;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	u64 slice = 0;
@@ -6953,6 +6982,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		return;
 	}
 
+	if (task_new)
+		h_nr_delayed = !!se->sched_delayed;
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
@@ -6991,6 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		cfs_rq->h_nr_delayed += h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
@@ -7014,6 +7047,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		cfs_rq->h_nr_delayed += h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
@@ -7076,6 +7110,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	struct task_struct *p = NULL;
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
+	int h_nr_delayed = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
 
@@ -7083,6 +7118,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		p = task_of(se);
 		h_nr_running = 1;
 		idle_h_nr_running = task_has_idle_policy(p);
+		if (!task_sleep && !task_delayed)
+			h_nr_delayed = !!se->sched_delayed;
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -7100,6 +7137,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		cfs_rq->h_nr_delayed -= h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = h_nr_running;
@@ -7138,6 +7176,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		cfs_rq->h_nr_delayed -= h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = h_nr_running;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index fa52906a4478..21e3ff5eb77a 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -321,7 +321,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->h_nr_running,
+				cfs_rq->h_nr_running - cfs_rq->h_nr_delayed,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3744f16a1293..d91360b0cca1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -603,6 +603,7 @@ struct cfs_rq {
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
+	unsigned int		h_nr_delayed;
 
 	s64			avg_vruntime;
 	u64			avg_load;
@@ -813,8 +814,10 @@ struct dl_rq {
 
 static inline void se_update_runnable(struct sched_entity *se)
 {
-	if (!entity_is_task(se))
-		se->runnable_weight = se->my_q->h_nr_running;
+	if (!entity_is_task(se)) {
+		struct cfs_rq *cfs_rq = se->my_q;
+		se->runnable_weight = cfs_rq->h_nr_running - cfs_rq->h_nr_delayed;
+	}
 }
 
 static inline long se_runnable(struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-06 10:45                             ` Peter Zijlstra
@ 2024-09-08  7:43                               ` Mike Galbraith
  2024-09-10  8:09                               ` [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE tip-bot2 for Peter Zijlstra
  2024-09-10 11:04                               ` [PATCH 10/24] sched/uclamg: Handle delayed dequeue Luis Machado
  2 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-09-08  7:43 UTC (permalink / raw)
  To: Peter Zijlstra, Dietmar Eggemann
  Cc: Vincent Guittot, Hongyan Xia, Luis Machado, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Fri, 2024-09-06 at 12:45 +0200, Peter Zijlstra wrote:
> On Thu, Sep 05, 2024 at 04:53:54PM +0200, Peter Zijlstra wrote:
> 
> > > But then, like you said, __update_load_avg_cfs_rq() needs correct
> > > cfs_rq->h_nr_running.
> > 
> > Uff. So yes __update_load_avg_cfs_rq() needs a different number,
> > but
> > I'll contest that h_nr_running is in fact correct, albeit no longer
> > suitable for this purpose.
> > 
> > We can track h_nr_delayed I suppose, and subtract that.
> 
> Something like so?

With these two added to the series plus your prototype below, watching
sched_debug as box builds kernels and whatnot.. is about as stimulating
as watching paint peel <thumbs up emoji>

sched-fair-Properly-deactivate-sched_delayed-task-upon-class-change.patch
sched-fair-Fix-util_est-accounting-for-DELAY_DEQUEUE.patch

> 
> ---
>  kernel/sched/debug.c |  1 +
>  kernel/sched/fair.c  | 49
> ++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/pelt.c  |  2 +-
>  kernel/sched/sched.h |  7 +++++--
>  4 files changed, 51 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 01ce9a76164c..3d3c5be78075 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -829,6 +829,7 @@ void print_cfs_rq(struct seq_file *m, int cpu,
> struct cfs_rq *cfs_rq)
>         SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread",
> SPLIT_NS(spread));
>         SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq-
> >nr_running);
>         SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq-
> >h_nr_running);
> +       SEQ_printf(m, "  .%-30s: %d\n", "h_nr_delayed", cfs_rq-
> >h_nr_delayed);
>         SEQ_printf(m, "  .%-30s: %d\n", "idle_nr_running",
>                         cfs_rq->idle_nr_running);
>         SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 11e890486c1b..629b46308960 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5456,9 +5456,31 @@ static void clear_buddies(struct cfs_rq
> *cfs_rq, struct sched_entity *se)
>  
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq
> *cfs_rq);
>  
> -static inline void finish_delayed_dequeue_entity(struct sched_entity
> *se)
> +static void set_delayed(struct sched_entity *se)
> +{
> +       se->sched_delayed = 1;
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +               cfs_rq->h_nr_delayed++;
> +               if (cfs_rq_throttled(cfs_rq))
> +                       break;
> +       }
> +}
> +
> +static void clear_delayed(struct sched_entity *se)
>  {
>         se->sched_delayed = 0;
> +       for_each_sched_entity(se) {
> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +               cfs_rq->h_nr_delayed--;
> +               if (cfs_rq_throttled(cfs_rq))
> +                       break;
> +       }
> +}
> +
> +static inline void finish_delayed_dequeue_entity(struct sched_entity
> *se)
> +{
> +       clear_delayed(se);
>         if (sched_feat(DELAY_ZERO) && se->vlag > 0)
>                 se->vlag = 0;
>  }
> @@ -5488,7 +5510,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *se, int flags)
>                         if (cfs_rq->next == se)
>                                 cfs_rq->next = NULL;
>                         update_load_avg(cfs_rq, se, 0);
> -                       se->sched_delayed = 1;
> +                       set_delayed(se);
>                         return false;
>                 }
>         }
> @@ -5907,7 +5929,7 @@ static bool throttle_cfs_rq(struct cfs_rq
> *cfs_rq)
>         struct rq *rq = rq_of(cfs_rq);
>         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>         struct sched_entity *se;
> -       long task_delta, idle_task_delta, dequeue = 1;
> +       long task_delta, idle_task_delta, delayed_delta, dequeue = 1;
>         long rq_h_nr_running = rq->cfs.h_nr_running;
>  
>         raw_spin_lock(&cfs_b->lock);
> @@ -5940,6 +5962,7 @@ static bool throttle_cfs_rq(struct cfs_rq
> *cfs_rq)
>  
>         task_delta = cfs_rq->h_nr_running;
>         idle_task_delta = cfs_rq->idle_h_nr_running;
> +       delayed_delta = cfs_rq->h_nr_delayed;
>         for_each_sched_entity(se) {
>                 struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>                 int flags;
> @@ -5963,6 +5986,7 @@ static bool throttle_cfs_rq(struct cfs_rq
> *cfs_rq)
>  
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
> +               qcfs_rq->h_nr_delayed -= delayed_delta;
>  
>                 if (qcfs_rq->load.weight) {
>                         /* Avoid re-evaluating load for this entity:
> */
> @@ -5985,6 +6009,7 @@ static bool throttle_cfs_rq(struct cfs_rq
> *cfs_rq)
>  
>                 qcfs_rq->h_nr_running -= task_delta;
>                 qcfs_rq->idle_h_nr_running -= idle_task_delta;
> +               qcfs_rq->h_nr_delayed -= delayed_delta;
>         }
>  
>         /* At this point se is NULL and we are at root level*/
> @@ -6010,7 +6035,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>         struct rq *rq = rq_of(cfs_rq);
>         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>         struct sched_entity *se;
> -       long task_delta, idle_task_delta;
> +       long task_delta, idle_task_delta, delayed_delta;
>         long rq_h_nr_running = rq->cfs.h_nr_running;
>  
>         se = cfs_rq->tg->se[cpu_of(rq)];
> @@ -6046,6 +6071,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>         task_delta = cfs_rq->h_nr_running;
>         idle_task_delta = cfs_rq->idle_h_nr_running;
> +       delayed_delta = cfs_rq->h_nr_delayed;
>         for_each_sched_entity(se) {
>                 struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>  
> @@ -6060,6 +6086,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>                 qcfs_rq->h_nr_running += task_delta;
>                 qcfs_rq->idle_h_nr_running += idle_task_delta;
> +               qcfs_rq->h_nr_delayed += delayed_delta;
>  
>                 /* end evaluation on encountering a throttled cfs_rq
> */
>                 if (cfs_rq_throttled(qcfs_rq))
> @@ -6077,6 +6104,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>                 qcfs_rq->h_nr_running += task_delta;
>                 qcfs_rq->idle_h_nr_running += idle_task_delta;
> +               qcfs_rq->h_nr_delayed += delayed_delta;
>  
>                 /* end evaluation on encountering a throttled cfs_rq
> */
>                 if (cfs_rq_throttled(qcfs_rq))
> @@ -6930,7 +6958,7 @@ requeue_delayed_entity(struct sched_entity *se)
>         }
>  
>         update_load_avg(cfs_rq, se, 0);
> -       se->sched_delayed = 0;
> +       clear_delayed(se);
>  }
>  
>  /*
> @@ -6944,6 +6972,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>         struct cfs_rq *cfs_rq;
>         struct sched_entity *se = &p->se;
>         int idle_h_nr_running = task_has_idle_policy(p);
> +       int h_nr_delayed = 0;
>         int task_new = !(flags & ENQUEUE_WAKEUP);
>         int rq_h_nr_running = rq->cfs.h_nr_running;
>         u64 slice = 0;
> @@ -6953,6 +6982,9 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>                 return;
>         }
>  
> +       if (task_new)
> +               h_nr_delayed = !!se->sched_delayed;
> +
>         /*
>          * The code below (indirectly) updates schedutil which looks
> at
>          * the cfs_rq utilization to select a frequency.
> @@ -6991,6 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>  
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
> +               cfs_rq->h_nr_delayed += h_nr_delayed;
>  
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = 1;
> @@ -7014,6 +7047,7 @@ enqueue_task_fair(struct rq *rq, struct
> task_struct *p, int flags)
>  
>                 cfs_rq->h_nr_running++;
>                 cfs_rq->idle_h_nr_running += idle_h_nr_running;
> +               cfs_rq->h_nr_delayed += h_nr_delayed;
>  
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = 1;
> @@ -7076,6 +7110,7 @@ static int dequeue_entities(struct rq *rq,
> struct sched_entity *se, int flags)
>         struct task_struct *p = NULL;
>         int idle_h_nr_running = 0;
>         int h_nr_running = 0;
> +       int h_nr_delayed = 0;
>         struct cfs_rq *cfs_rq;
>         u64 slice = 0;
>  
> @@ -7083,6 +7118,8 @@ static int dequeue_entities(struct rq *rq,
> struct sched_entity *se, int flags)
>                 p = task_of(se);
>                 h_nr_running = 1;
>                 idle_h_nr_running = task_has_idle_policy(p);
> +               if (!task_sleep && !task_delayed)
> +                       h_nr_delayed = !!se->sched_delayed;
>         } else {
>                 cfs_rq = group_cfs_rq(se);
>                 slice = cfs_rq_min_slice(cfs_rq);
> @@ -7100,6 +7137,7 @@ static int dequeue_entities(struct rq *rq,
> struct sched_entity *se, int flags)
>  
>                 cfs_rq->h_nr_running -= h_nr_running;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> +               cfs_rq->h_nr_delayed -= h_nr_delayed;
>  
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = h_nr_running;
> @@ -7138,6 +7176,7 @@ static int dequeue_entities(struct rq *rq,
> struct sched_entity *se, int flags)
>  
>                 cfs_rq->h_nr_running -= h_nr_running;
>                 cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> +               cfs_rq->h_nr_delayed -= h_nr_delayed;
>  
>                 if (cfs_rq_is_idle(cfs_rq))
>                         idle_h_nr_running = h_nr_running;
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index fa52906a4478..21e3ff5eb77a 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -321,7 +321,7 @@ int __update_load_avg_cfs_rq(u64 now, struct
> cfs_rq *cfs_rq)
>  {
>         if (___update_load_sum(now, &cfs_rq->avg,
>                                 scale_load_down(cfs_rq->load.weight),
> -                               cfs_rq->h_nr_running,
> +                               cfs_rq->h_nr_running - cfs_rq-
> >h_nr_delayed,
>                                 cfs_rq->curr != NULL)) {
>  
>                 ___update_load_avg(&cfs_rq->avg, 1);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3744f16a1293..d91360b0cca1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -603,6 +603,7 @@ struct cfs_rq {
>         unsigned int            h_nr_running;      /*
> SCHED_{NORMAL,BATCH,IDLE} */
>         unsigned int            idle_nr_running;   /* SCHED_IDLE */
>         unsigned int            idle_h_nr_running; /* SCHED_IDLE */
> +       unsigned int            h_nr_delayed;
>  
>         s64                     avg_vruntime;
>         u64                     avg_load;
> @@ -813,8 +814,10 @@ struct dl_rq {
>  
>  static inline void se_update_runnable(struct sched_entity *se)
>  {
> -       if (!entity_is_task(se))
> -               se->runnable_weight = se->my_q->h_nr_running;
> +       if (!entity_is_task(se)) {
> +               struct cfs_rq *cfs_rq = se->my_q;
> +               se->runnable_weight = cfs_rq->h_nr_running - cfs_rq-
> >h_nr_delayed;
> +       }
>  }
>  
>  static inline long se_runnable(struct sched_entity *se)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE
  2024-09-06 10:45                             ` Peter Zijlstra
  2024-09-08  7:43                               ` Mike Galbraith
@ 2024-09-10  8:09                               ` tip-bot2 for Peter Zijlstra
  2024-11-27  4:17                                 ` K Prateek Nayak
  2024-09-10 11:04                               ` [PATCH 10/24] sched/uclamg: Handle delayed dequeue Luis Machado
  2 siblings, 1 reply; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-09-10  8:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Vincent Guittot, Peter Zijlstra (Intel), x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
Gitweb:        https://git.kernel.org/tip/2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 06 Sep 2024 12:45:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 10 Sep 2024 09:51:15 +02:00

sched/eevdf: More PELT vs DELAYED_DEQUEUE

Vincent and Dietmar noted that while commit fc1892becd56 fixes the
entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.

Track h_nr_delayed such that we can discount those and adjust the
signal.

Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240906104525.GG4928@noisy.programming.kicks-ass.net
---
 kernel/sched/debug.c |  1 +-
 kernel/sched/fair.c  | 49 ++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/pelt.c  |  2 +-
 kernel/sched/sched.h |  7 ++++--
 4 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index de1dc52..35974ac 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -844,6 +844,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
+	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_delayed", cfs_rq->h_nr_delayed);
 	SEQ_printf(m, "  .%-30s: %d\n", "idle_nr_running",
 			cfs_rq->idle_nr_running);
 	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 922d690..0bc5e62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5456,9 +5456,31 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
+static void set_delayed(struct sched_entity *se)
+{
+	se->sched_delayed = 1;
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_delayed++;
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+}
+
+static void clear_delayed(struct sched_entity *se)
 {
 	se->sched_delayed = 0;
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_delayed--;
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+}
+
+static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
+{
+	clear_delayed(se);
 	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
 		se->vlag = 0;
 }
@@ -5488,7 +5510,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 			if (cfs_rq->next == se)
 				cfs_rq->next = NULL;
 			update_load_avg(cfs_rq, se, 0);
-			se->sched_delayed = 1;
+			set_delayed(se);
 			return false;
 		}
 	}
@@ -5907,7 +5929,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta, dequeue = 1;
+	long task_delta, idle_task_delta, delayed_delta, dequeue = 1;
 	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	raw_spin_lock(&cfs_b->lock);
@@ -5940,6 +5962,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	delayed_delta = cfs_rq->h_nr_delayed;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 		int flags;
@@ -5963,6 +5986,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		qcfs_rq->h_nr_delayed -= delayed_delta;
 
 		if (qcfs_rq->load.weight) {
 			/* Avoid re-evaluating load for this entity: */
@@ -5985,6 +6009,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running -= task_delta;
 		qcfs_rq->idle_h_nr_running -= idle_task_delta;
+		qcfs_rq->h_nr_delayed -= delayed_delta;
 	}
 
 	/* At this point se is NULL and we are at root level*/
@@ -6010,7 +6035,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
-	long task_delta, idle_task_delta;
+	long task_delta, idle_task_delta, delayed_delta;
 	long rq_h_nr_running = rq->cfs.h_nr_running;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
@@ -6046,6 +6071,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	task_delta = cfs_rq->h_nr_running;
 	idle_task_delta = cfs_rq->idle_h_nr_running;
+	delayed_delta = cfs_rq->h_nr_delayed;
 	for_each_sched_entity(se) {
 		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
 
@@ -6060,6 +6086,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_delayed += delayed_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6077,6 +6104,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 		qcfs_rq->h_nr_running += task_delta;
 		qcfs_rq->idle_h_nr_running += idle_task_delta;
+		qcfs_rq->h_nr_delayed += delayed_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(qcfs_rq))
@@ -6930,7 +6958,7 @@ requeue_delayed_entity(struct sched_entity *se)
 	}
 
 	update_load_avg(cfs_rq, se, 0);
-	se->sched_delayed = 0;
+	clear_delayed(se);
 }
 
 /*
@@ -6944,6 +6972,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
+	int h_nr_delayed = 0;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	u64 slice = 0;
@@ -6970,6 +6999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (p->in_iowait)
 		cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
 
+	if (task_new)
+		h_nr_delayed = !!se->sched_delayed;
+
 	for_each_sched_entity(se) {
 		if (se->on_rq) {
 			if (se->sched_delayed)
@@ -6992,6 +7024,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		cfs_rq->h_nr_delayed += h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
@@ -7015,6 +7048,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		cfs_rq->h_nr_running++;
 		cfs_rq->idle_h_nr_running += idle_h_nr_running;
+		cfs_rq->h_nr_delayed += h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = 1;
@@ -7077,6 +7111,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	struct task_struct *p = NULL;
 	int idle_h_nr_running = 0;
 	int h_nr_running = 0;
+	int h_nr_delayed = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
 
@@ -7084,6 +7119,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		p = task_of(se);
 		h_nr_running = 1;
 		idle_h_nr_running = task_has_idle_policy(p);
+		if (!task_sleep && !task_delayed)
+			h_nr_delayed = !!se->sched_delayed;
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -7101,6 +7138,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		cfs_rq->h_nr_delayed -= h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = h_nr_running;
@@ -7139,6 +7177,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		cfs_rq->h_nr_running -= h_nr_running;
 		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
+		cfs_rq->h_nr_delayed -= h_nr_delayed;
 
 		if (cfs_rq_is_idle(cfs_rq))
 			idle_h_nr_running = h_nr_running;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index fa52906..21e3ff5 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -321,7 +321,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
 	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->h_nr_running,
+				cfs_rq->h_nr_running - cfs_rq->h_nr_delayed,
 				cfs_rq->curr != NULL)) {
 
 		___update_load_avg(&cfs_rq->avg, 1);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3744f16..d91360b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -603,6 +603,7 @@ struct cfs_rq {
 	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
 	unsigned int		idle_nr_running;   /* SCHED_IDLE */
 	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
+	unsigned int		h_nr_delayed;
 
 	s64			avg_vruntime;
 	u64			avg_load;
@@ -813,8 +814,10 @@ struct dl_rq {
 
 static inline void se_update_runnable(struct sched_entity *se)
 {
-	if (!entity_is_task(se))
-		se->runnable_weight = se->my_q->h_nr_running;
+	if (!entity_is_task(se)) {
+		struct cfs_rq *cfs_rq = se->my_q;
+		se->runnable_weight = cfs_rq->h_nr_running - cfs_rq->h_nr_delayed;
+	}
 }
 
 static inline long se_runnable(struct sched_entity *se)

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/core] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
  2024-09-05 13:02                     ` Dietmar Eggemann
  2024-09-05 13:33                       ` Vincent Guittot
  2024-09-05 14:18                       ` Peter Zijlstra
@ 2024-09-10  8:09                       ` tip-bot2 for Dietmar Eggemann
  2 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Dietmar Eggemann @ 2024-09-10  8:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     729288bc68560b4d5b094cb7a6f794c752ef22a2
Gitweb:        https://git.kernel.org/tip/729288bc68560b4d5b094cb7a6f794c752ef22a2
Author:        Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate:    Thu, 05 Sep 2024 00:05:23 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 10 Sep 2024 09:51:15 +02:00

kernel/sched: Fix util_est accounting for DELAY_DEQUEUE

Remove delayed tasks from util_est even they are runnable.

Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com
---
 kernel/sched/fair.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e946ca0..922d690 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	u64 slice = 0;
 
-	if (flags & ENQUEUE_DELAYED) {
-		requeue_delayed_entity(se);
-		return;
-	}
-
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
 	 * Let's add the task's estimated utilization to the cfs_rq's
 	 * estimated utilization, before we update schedutil.
 	 */
-	util_est_enqueue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
+		util_est_enqueue(&rq->cfs, p);
+
+	if (flags & ENQUEUE_DELAYED) {
+		requeue_delayed_entity(se);
+		return;
+	}
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	util_est_dequeue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+		util_est_dequeue(&rq->cfs, p);
 
 	if (dequeue_entities(rq, &p->se, flags) < 0) {
 		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue
  2024-07-27 10:27 ` [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-09-10  9:16   ` Luis Machado
  1 sibling, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-09-10  9:16 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi Peter,

On 7/27/24 11:27, Peter Zijlstra wrote:
> Delayed dequeue's natural end is when it gets picked again. Ensure
> pick_next_task() knows what to do with delayed tasks.
> 
> Note, this relies on the earlier patch that made pick_next_task()
> state invariant -- it will restart the pick on dequeue, because
> obviously the just dequeued task is no longer eligible.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/fair.c |   23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5453,6 +5453,8 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>  	se->prev_sum_exec_runtime = se->sum_exec_runtime;
>  }
>  
> +static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
> +
>  /*
>   * Pick the next process, keeping these things in mind, in this order:
>   * 1) keep things fair between processes/task groups
> @@ -5461,16 +5463,27 @@ set_next_entity(struct cfs_rq *cfs_rq, s
>   * 4) do not run the "skip" process, if something else is available
>   */
>  static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq)
> +pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	/*
>  	 * Enabling NEXT_BUDDY will affect latency but not fairness.
>  	 */
>  	if (sched_feat(NEXT_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> +		/* ->next will never be delayed */
> +		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
>  		return cfs_rq->next;
> +	}
> +
> +	struct sched_entity *se = pick_eevdf(cfs_rq);
> +	if (se->sched_delayed) {
> +		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> +		SCHED_WARN_ON(se->sched_delayed);
> +		SCHED_WARN_ON(se->on_rq);

While exercising the h_nr_delayed changes on Android/Pixel 6 (6.8-based), I ran into
a situation where pick_eevdf seems to be returning NULL, and then we proceed to try to
dereference it and crash during boot.

I can fix it by guarding against a NULL se after the call to pick_eevdf, and then the code
runs OK from there as pick_task_fair will have another go at trying to pick the next entity.

I haven't checked exactly why we return NULL from pick_eevdf, but I recall seeing similar
reports of pick_eevdf sometimes failing to pick any task. Anyway, I thought I'd point this
out in case others see a similar situation.

Back to testing the h_nr_delayed changes.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-06 10:45                             ` Peter Zijlstra
  2024-09-08  7:43                               ` Mike Galbraith
  2024-09-10  8:09                               ` [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE tip-bot2 for Peter Zijlstra
@ 2024-09-10 11:04                               ` Luis Machado
  2024-09-10 14:05                                 ` Peter Zijlstra
  2 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-09-10 11:04 UTC (permalink / raw)
  To: Peter Zijlstra, Dietmar Eggemann
  Cc: Vincent Guittot, Hongyan Xia, mingo, juri.lelli, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx, efault

Peter,

On 9/6/24 11:45, Peter Zijlstra wrote:
> On Thu, Sep 05, 2024 at 04:53:54PM +0200, Peter Zijlstra wrote:
> 
>>> But then, like you said, __update_load_avg_cfs_rq() needs correct
>>> cfs_rq->h_nr_running.
>>
>> Uff. So yes __update_load_avg_cfs_rq() needs a different number, but
>> I'll contest that h_nr_running is in fact correct, albeit no longer
>> suitable for this purpose.
>>
>> We can track h_nr_delayed I suppose, and subtract that.
> 
> Something like so?
> 
> ---
>  kernel/sched/debug.c |  1 +
>  kernel/sched/fair.c  | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/pelt.c  |  2 +-
>  kernel/sched/sched.h |  7 +++++--
>  4 files changed, 51 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 01ce9a76164c..3d3c5be78075 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -829,6 +829,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
>  	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
>  	SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
>  	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
> +	SEQ_printf(m, "  .%-30s: %d\n", "h_nr_delayed", cfs_rq->h_nr_delayed);
>  	SEQ_printf(m, "  .%-30s: %d\n", "idle_nr_running",
>  			cfs_rq->idle_nr_running);
>  	SEQ_printf(m, "  .%-30s: %d\n", "idle_h_nr_running",
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 11e890486c1b..629b46308960 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5456,9 +5456,31 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  
> -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> +static void set_delayed(struct sched_entity *se)
> +{
> +	se->sched_delayed = 1;
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +		cfs_rq->h_nr_delayed++;
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +}
> +
> +static void clear_delayed(struct sched_entity *se)
>  {
>  	se->sched_delayed = 0;
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +		cfs_rq->h_nr_delayed--;
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +}
> +
> +static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> +{
> +	clear_delayed(se);
>  	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
>  		se->vlag = 0;
>  }
> @@ -5488,7 +5510,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  			if (cfs_rq->next == se)
>  				cfs_rq->next = NULL;
>  			update_load_avg(cfs_rq, se, 0);
> -			se->sched_delayed = 1;
> +			set_delayed(se);
>  			return false;
>  		}
>  	}
> @@ -5907,7 +5929,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  	struct rq *rq = rq_of(cfs_rq);
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  	struct sched_entity *se;
> -	long task_delta, idle_task_delta, dequeue = 1;
> +	long task_delta, idle_task_delta, delayed_delta, dequeue = 1;
>  	long rq_h_nr_running = rq->cfs.h_nr_running;
>  
>  	raw_spin_lock(&cfs_b->lock);
> @@ -5940,6 +5962,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  	task_delta = cfs_rq->h_nr_running;
>  	idle_task_delta = cfs_rq->idle_h_nr_running;
> +	delayed_delta = cfs_rq->h_nr_delayed;
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>  		int flags;
> @@ -5963,6 +5986,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  		qcfs_rq->h_nr_running -= task_delta;
>  		qcfs_rq->idle_h_nr_running -= idle_task_delta;
> +		qcfs_rq->h_nr_delayed -= delayed_delta;
>  
>  		if (qcfs_rq->load.weight) {
>  			/* Avoid re-evaluating load for this entity: */
> @@ -5985,6 +6009,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  		qcfs_rq->h_nr_running -= task_delta;
>  		qcfs_rq->idle_h_nr_running -= idle_task_delta;
> +		qcfs_rq->h_nr_delayed -= delayed_delta;
>  	}
>  
>  	/* At this point se is NULL and we are at root level*/
> @@ -6010,7 +6035,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	struct rq *rq = rq_of(cfs_rq);
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  	struct sched_entity *se;
> -	long task_delta, idle_task_delta;
> +	long task_delta, idle_task_delta, delayed_delta;
>  	long rq_h_nr_running = rq->cfs.h_nr_running;
>  
>  	se = cfs_rq->tg->se[cpu_of(rq)];
> @@ -6046,6 +6071,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  	task_delta = cfs_rq->h_nr_running;
>  	idle_task_delta = cfs_rq->idle_h_nr_running;
> +	delayed_delta = cfs_rq->h_nr_delayed;
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>  
> @@ -6060,6 +6086,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  		qcfs_rq->h_nr_running += task_delta;
>  		qcfs_rq->idle_h_nr_running += idle_task_delta;
> +		qcfs_rq->h_nr_delayed += delayed_delta;
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(qcfs_rq))
> @@ -6077,6 +6104,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  		qcfs_rq->h_nr_running += task_delta;
>  		qcfs_rq->idle_h_nr_running += idle_task_delta;
> +		qcfs_rq->h_nr_delayed += delayed_delta;
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(qcfs_rq))
> @@ -6930,7 +6958,7 @@ requeue_delayed_entity(struct sched_entity *se)
>  	}
>  
>  	update_load_avg(cfs_rq, se, 0);
> -	se->sched_delayed = 0;
> +	clear_delayed(se);
>  }
>  
>  /*
> @@ -6944,6 +6972,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se = &p->se;
>  	int idle_h_nr_running = task_has_idle_policy(p);
> +	int h_nr_delayed = 0;
>  	int task_new = !(flags & ENQUEUE_WAKEUP);
>  	int rq_h_nr_running = rq->cfs.h_nr_running;
>  	u64 slice = 0;
> @@ -6953,6 +6982,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		return;
>  	}
>  
> +	if (task_new)
> +		h_nr_delayed = !!se->sched_delayed;
> +
>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
>  	 * the cfs_rq utilization to select a frequency.
> @@ -6991,6 +7023,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		cfs_rq->h_nr_running++;
>  		cfs_rq->idle_h_nr_running += idle_h_nr_running;
> +		cfs_rq->h_nr_delayed += h_nr_delayed;
>  
>  		if (cfs_rq_is_idle(cfs_rq))
>  			idle_h_nr_running = 1;
> @@ -7014,6 +7047,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  		cfs_rq->h_nr_running++;
>  		cfs_rq->idle_h_nr_running += idle_h_nr_running;
> +		cfs_rq->h_nr_delayed += h_nr_delayed;
>  
>  		if (cfs_rq_is_idle(cfs_rq))
>  			idle_h_nr_running = 1;
> @@ -7076,6 +7110,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	struct task_struct *p = NULL;
>  	int idle_h_nr_running = 0;
>  	int h_nr_running = 0;
> +	int h_nr_delayed = 0;
>  	struct cfs_rq *cfs_rq;
>  	u64 slice = 0;
>  
> @@ -7083,6 +7118,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		p = task_of(se);
>  		h_nr_running = 1;
>  		idle_h_nr_running = task_has_idle_policy(p);
> +		if (!task_sleep && !task_delayed)
> +			h_nr_delayed = !!se->sched_delayed;
>  	} else {
>  		cfs_rq = group_cfs_rq(se);
>  		slice = cfs_rq_min_slice(cfs_rq);
> @@ -7100,6 +7137,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		cfs_rq->h_nr_running -= h_nr_running;
>  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> +		cfs_rq->h_nr_delayed -= h_nr_delayed;
>  
>  		if (cfs_rq_is_idle(cfs_rq))
>  			idle_h_nr_running = h_nr_running;
> @@ -7138,6 +7176,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		cfs_rq->h_nr_running -= h_nr_running;
>  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> +		cfs_rq->h_nr_delayed -= h_nr_delayed;
>  
>  		if (cfs_rq_is_idle(cfs_rq))
>  			idle_h_nr_running = h_nr_running;
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index fa52906a4478..21e3ff5eb77a 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -321,7 +321,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>  {
>  	if (___update_load_sum(now, &cfs_rq->avg,
>  				scale_load_down(cfs_rq->load.weight),
> -				cfs_rq->h_nr_running,
> +				cfs_rq->h_nr_running - cfs_rq->h_nr_delayed,
>  				cfs_rq->curr != NULL)) {
>  
>  		___update_load_avg(&cfs_rq->avg, 1);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3744f16a1293..d91360b0cca1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -603,6 +603,7 @@ struct cfs_rq {
>  	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
>  	unsigned int		idle_nr_running;   /* SCHED_IDLE */
>  	unsigned int		idle_h_nr_running; /* SCHED_IDLE */
> +	unsigned int		h_nr_delayed;
>  
>  	s64			avg_vruntime;
>  	u64			avg_load;
> @@ -813,8 +814,10 @@ struct dl_rq {
>  
>  static inline void se_update_runnable(struct sched_entity *se)
>  {
> -	if (!entity_is_task(se))
> -		se->runnable_weight = se->my_q->h_nr_running;
> +	if (!entity_is_task(se)) {
> +		struct cfs_rq *cfs_rq = se->my_q;
> +		se->runnable_weight = cfs_rq->h_nr_running - cfs_rq->h_nr_delayed;
> +	}
>  }
>  
>  static inline long se_runnable(struct sched_entity *se)

I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.

First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.

As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
pattern looks similar to what we observed with the uclamp inc/dec imbalance.

I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
something will ring bells.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (28 preceding siblings ...)
  2024-08-29 17:02 ` Aleksandr Nogikh
@ 2024-09-10 11:45 ` Sven Schnelle
  2024-09-10 12:21   ` Sven Schnelle
  2024-11-06  1:07 ` Saravana Kannan
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
  31 siblings, 1 reply; 277+ messages in thread
From: Sven Schnelle @ 2024-09-10 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Peter Zijlstra <peterz@infradead.org> writes:

> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID

I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:

Sometimes the system doesn't manage to print a oops, this one is the best i got:

[  596.146142] ------------[ cut here ]------------
[  596.146161] se->sched_delayed
[  596.146166] WARNING: CPU: 1 PID: 0 at kernel/sched/fair.c:13131 __set_next_task_fair.part.0+0x350/0x400
[  596.146179] Modules linked in: [..]
[  596.146288] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.11.0-rc7-next-20240909 #18
[  596.146294] Hardware name: IBM 3931 A01 704 (LPAR)
[  596.146298] Krnl PSW : 0404e00180000000 001a9c2b5eea4ea4 (__set_next_task_fair.part.0+0x354/0x400)
[  596.146307]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[  596.146314] Krnl GPRS: 001c000300000027 001c000300000023 0000000000000011 0000000000000004
[  596.146319]            0000000000000001 001a9c2b5f1fb118 000000036ef94dd5 0000001b77ca6ea8
[  596.146323]            001c000000000000 001a9c2b5eec6fc0 0000001b77ca6000 00000000b7334800
[  596.146328]            0000000000000000 001a9c2b5eefad70 001a9c2b5eea4ea0 001a9bab5ee8f9f8
[  596.146340] Krnl Code: 001a9c2b5eea4e94: c0200121bbe6        larl    %r2,001a9c2b612dc660
[  596.146340]            001a9c2b5eea4e9a: c0e5fff9e9d3        brasl   %r14,001a9c2b5ede2240
[  596.146340]           #001a9c2b5eea4ea0: af000000            mc      0,0
[  596.146340]           >001a9c2b5eea4ea4: a7f4fe83            brc     15,001a9c2b5eea4baa
[  596.146340]            001a9c2b5eea4ea8: c0e50038ba2c        brasl   %r14,001a9c2b5f5bc300

[  596.146558] CPU: 1 UID: 0 PID: 18582 Comm: prctl-sched-cor Tainted: G        W          6.11.0-rc7-next-20240909 #18
[  596.146564] Tainted: [W]=WARN
[  596.146567] Hardware name: IBM 3931 A01 704 (LPAR)
[  596.146570] Krnl PSW : 0404e00180000000 001a9c2b5eec2de4 (dequeue_entity+0xe64/0x11f0)
[  596.146578]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[  596.146584] Krnl GPRS: 001c000300000027 001c000300000023 000000000000001a 0000000000000004
[  596.146589]            0000000000000001 001a9c2b5f1fb118 001a9c2b61be7144 0000000016e6692a
[  596.146593]            0000000000000001 00000000b7334951 0000000158494800 00000000b7334900
[  596.146597]            000000000000489e 0000000000000009 001a9c2b5eec2de0 001a9bab75dff760
[  596.146607] Krnl Code: 001a9c2b5eec2dd4: c0200120cdf6        larl    %r2,001a9c2b612dc9c0
[  596.146607]            001a9c2b5eec2dda: c0e5fff8fa33        brasl   %r14,001a9c2b5ede2240
[  596.146607]           #001a9c2b5eec2de0: af000000            mc      0,0
[  596.146607]           >001a9c2b5eec2de4: c004fffff90a        brcl    0,001a9c2b5eec1ff8
[  596.146607]            001a9c2b5eec2dea: a7f4fbbe            brc     15,001a9c2b5eec2566
[  596.146607]            001a9c2b5eec2dee: a7d10001            tmll    %r13,1
[  596.146607]            001a9c2b5eec2df2: a774fb1c            brc     7,001a9c2b5eec242a
[  596.146607]            001a9c2b5eec2df6: a7f4f95f            brc     15,001a9c2b5eec20b4
[  596.146637] Call Trace:
[  596.146640]  [<001a9c2b5eec2de4>] dequeue_entity+0xe64/0x11f0 
[  596.146645] ([<001a9c2b5eec2de0>] dequeue_entity+0xe60/0x11f0)
[  596.146650]  [<001a9c2b5eec34b0>] dequeue_entities+0x340/0xe10 
[  596.146655]  [<001a9c2b5eec4208>] dequeue_task_fair+0xb8/0x5a0 
[  596.146660]  [<001a9c2b6115ab68>] __schedule+0xb58/0x14f0 
[  596.146666]  [<001a9c2b6115b59c>] schedule+0x9c/0x240 
[  596.146670]  [<001a9c2b5edf5190>] do_wait+0x160/0x440 
[  596.146676]  [<001a9c2b5edf5936>] kernel_waitid+0xd6/0x110 
[  596.146680]  [<001a9c2b5edf5b4e>] __do_sys_waitid+0x1de/0x1f0 
[  596.146685]  [<001a9c2b5edf5c36>] __s390x_sys_waitid+0xd6/0x120 
[  596.146690]  [<001a9c2b5ed0cbd6>] do_syscall+0x2f6/0x430 
[  596.146695]  [<001a9c2b611543a4>] __do_syscall+0xa4/0x170 
[  596.146700]  [<001a9c2b6117046c>] system_call+0x74/0x98 
[  596.146705] Last Breaking-Event-Address:
[  596.146707]  [<001a9c2b5ede2418>] __warn_printk+0x1d8/0x1e0

This happens when running the strace test suite. The system normaly has
128 CPUs. With this configuration the crash doesn't happen, but when
disabling all but four CPUs and running 'make check -j16' in the strace
test suite the crash is almost always reproducable.

Regards
Sven

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-09-10 11:45 ` Sven Schnelle
@ 2024-09-10 12:21   ` Sven Schnelle
  2024-09-10 14:07     ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Sven Schnelle @ 2024-09-10 12:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Sven Schnelle <svens@linux.ibm.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>  - split up the huge delay-dequeue patch
>>  - tested/fixed cfs-bandwidth
>>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>>  - propagate min_slice up cgroups
>>  - CLOCK_THREAD_DVFS_ID
>
> I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
>
> Sometimes the system doesn't manage to print a oops, this one is the best i got:
>
> [..]
> This happens when running the strace test suite. The system normaly has
> 128 CPUs. With this configuration the crash doesn't happen, but when
> disabling all but four CPUs and running 'make check -j16' in the strace
> test suite the crash is almost always reproducable.

I failed to add the log from git bisect. Unfortunately i had to skip
some commit because the kernel didn't compile:

git bisect start
# status: waiting for both good and bad commits
# bad: [100cc857359b5d731407d1038f7e76cd0e871d94] Add linux-next specific files for 20240909
git bisect bad 100cc857359b5d731407d1038f7e76cd0e871d94
# status: waiting for good commit(s), bad commit known
# good: [da3ea35007d0af457a0afc87e84fddaebc4e0b63] Linux 6.11-rc7
git bisect good da3ea35007d0af457a0afc87e84fddaebc4e0b63
# good: [df20078b9706977cc3308740b56993cf27665f90] Merge branch 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect good df20078b9706977cc3308740b56993cf27665f90
# good: [609f9e1b6242e7158ce96f9124372601997ce56c] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc.git
git bisect good 609f9e1b6242e7158ce96f9124372601997ce56c
# skip: [664c3413e9a6c345a6c926841358314be9da8309] Merge branch 'usb-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
git bisect skip 664c3413e9a6c345a6c926841358314be9da8309
# good: [16531118ba63dd9bcd65203d04a9c9d6f6800547] iio: bmi323: peripheral in lowest power state on suspend
git bisect good 16531118ba63dd9bcd65203d04a9c9d6f6800547
# bad: [d9c7ac7f8bfb16f431daa7c77bdfe2b163361ead] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86.git
git bisect bad d9c7ac7f8bfb16f431daa7c77bdfe2b163361ead
# bad: [05536babd768b38d84ad168450f48634a013603d] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect bad 05536babd768b38d84ad168450f48634a013603d
# good: [dabef94a179957db117db344b924e5d5c4074e5f] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good dabef94a179957db117db344b924e5d5c4074e5f
# bad: [d4886a325947ecae6867fc858657062211aae3b9] Merge branch into tip/master: 'locking/core'
git bisect bad d4886a325947ecae6867fc858657062211aae3b9
# bad: [51c095bee5c77590d43519f03179342e910d333c] Merge branch into tip/master: 'core/core'
git bisect bad 51c095bee5c77590d43519f03179342e910d333c
# bad: [fc1892becd5672f52329a75c73117b60ac7841b7] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
git bisect bad fc1892becd5672f52329a75c73117b60ac7841b7
# good: [ae04f69de0bef93c7086cf2983dbc8e8fd624ebe] sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()
git bisect good ae04f69de0bef93c7086cf2983dbc8e8fd624ebe
# good: [abc158c82ae555078aa5dd2d8407c3df0f868904] sched: Prepare generic code for delayed dequeue
git bisect good abc158c82ae555078aa5dd2d8407c3df0f868904
# skip: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
git bisect skip 781773e3b68031bd001c0c18aa72e8470c225ebd
# skip: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
git bisect skip e1459a50ba31831efdfc35278023d959e4ba775b
# skip: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
git bisect skip a1c446611e31ca5363d4db51e398271da1dce0af
# good: [e28b5f8bda01720b5ce8456b48cf4b963f9a80a1] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
git bisect good e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
# skip: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
git bisect skip f12e148892ede8d9ee82bcd3e469e6d01fc077ac
# skip: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
git bisect skip 152e11f6df293e816a6a37c69757033cdc72667d
# skip: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
git bisect skip 2e0199df252a536a03f4cb0810324dff523d1e79
# bad: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
git bisect bad 54a58a78779169f9c92a51facf6de7ce94962328
# only skipped commits left to test
# possible first bad commit: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
# possible first bad commit: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
# possible first bad commit: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
# possible first bad commit: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
# possible first bad commit: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
# possible first bad commit: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
# possible first bad commit: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-10 11:04                               ` [PATCH 10/24] sched/uclamg: Handle delayed dequeue Luis Machado
@ 2024-09-10 14:05                                 ` Peter Zijlstra
  2024-09-11  8:35                                   ` Luis Machado
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-10 14:05 UTC (permalink / raw)
  To: Luis Machado
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Tue, Sep 10, 2024 at 12:04:11PM +0100, Luis Machado wrote:
> I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.
> 
> First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
> accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.
> 
> As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
> pattern looks similar to what we observed with the uclamp inc/dec imbalance.

:-(

> I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
> something will ring bells.

So first thing to do is trace h_nr_delayed I suppose, in my own
(limited) testing that was mostly [0,1] correctly correlating to there
being a delayed task on the runqueue.

I'm assuming that removing the usage sites restores function?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-09-10 12:21   ` Sven Schnelle
@ 2024-09-10 14:07     ` Peter Zijlstra
  2024-09-10 14:52       ` Sven Schnelle
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-10 14:07 UTC (permalink / raw)
  To: Sven Schnelle
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Tue, Sep 10, 2024 at 02:21:05PM +0200, Sven Schnelle wrote:
> Sven Schnelle <svens@linux.ibm.com> writes:
> 
> > Peter Zijlstra <peterz@infradead.org> writes:
> >
> >> Hi all,
> >>
> >> So after much delay this is hopefully the final version of the EEVDF patches.
> >> They've been sitting in my git tree for ever it seems, and people have been
> >> testing it and sending fixes.
> >>
> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> >> as I know that was the very last issue holding it back.
> >>
> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> >> in tip/sched/core once -rc1 drops.
> >>
> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> >>
> >>
> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> >>
> >>  - split up the huge delay-dequeue patch
> >>  - tested/fixed cfs-bandwidth
> >>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
> >>  - SCHED_BATCH is equivalent to RESPECT_SLICE
> >>  - propagate min_slice up cgroups
> >>  - CLOCK_THREAD_DVFS_ID
> >
> > I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
> >
> > Sometimes the system doesn't manage to print a oops, this one is the best i got:
> >
> > [..]
> > This happens when running the strace test suite. The system normaly has
> > 128 CPUs. With this configuration the crash doesn't happen, but when
> > disabling all but four CPUs and running 'make check -j16' in the strace
> > test suite the crash is almost always reproducable.

I noted: Comm: prctl-sched-cor, which is testing core scheduling, right?

Only today I;ve merged a fix for that:

  c662e2b1e8cf ("sched: Fix sched_delayed vs sched_core")

Could you double check if merging tip/sched/core into your next tree
helps anything at all?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-09-10 14:07     ` Peter Zijlstra
@ 2024-09-10 14:52       ` Sven Schnelle
  0 siblings, 0 replies; 277+ messages in thread
From: Sven Schnelle @ 2024-09-10 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 10, 2024 at 02:21:05PM +0200, Sven Schnelle wrote:
>> Sven Schnelle <svens@linux.ibm.com> writes:
>> 
>> > Peter Zijlstra <peterz@infradead.org> writes:
>> >
>> >> Hi all,
>> >>
>> >> So after much delay this is hopefully the final version of the EEVDF patches.
>> >> They've been sitting in my git tree for ever it seems, and people have been
>> >> testing it and sending fixes.
>> >>
>> >> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> >> as I know that was the very last issue holding it back.
>> >>
>> >> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> >> in tip/sched/core once -rc1 drops.
>> >>
>> >> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>> >>
>> >>
>> >> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>> >>
>> >>  - split up the huge delay-dequeue patch
>> >>  - tested/fixed cfs-bandwidth
>> >>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>> >>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>> >>  - propagate min_slice up cgroups
>> >>  - CLOCK_THREAD_DVFS_ID
>> >
>> > I'm seeing crashes/warnings like the following on s390 with linux-next 20240909:
>> >
>> > Sometimes the system doesn't manage to print a oops, this one is the best i got:
>> >
>> > [..]
>> > This happens when running the strace test suite. The system normaly has
>> > 128 CPUs. With this configuration the crash doesn't happen, but when
>> > disabling all but four CPUs and running 'make check -j16' in the strace
>> > test suite the crash is almost always reproducable.
>
> I noted: Comm: prctl-sched-cor, which is testing core scheduling, right?
>
> Only today I;ve merged a fix for that:
>
>   c662e2b1e8cf ("sched: Fix sched_delayed vs sched_core")
>
> Could you double check if merging tip/sched/core into your next tree
> helps anything at all?

Yes, that fixes the issue. Thanks!

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-10 14:05                                 ` Peter Zijlstra
@ 2024-09-11  8:35                                   ` Luis Machado
  2024-09-11  8:45                                     ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-09-11  8:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 9/10/24 15:05, Peter Zijlstra wrote:
> On Tue, Sep 10, 2024 at 12:04:11PM +0100, Luis Machado wrote:
>> I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.
>>
>> First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
>> accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.
>>
>> As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
>> pattern looks similar to what we observed with the uclamp inc/dec imbalance.
> 
> :-(
> 
>> I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
>> something will ring bells.
> 
> So first thing to do is trace h_nr_delayed I suppose, in my own
> (limited) testing that was mostly [0,1] correctly correlating to there
> being a delayed task on the runqueue.
> 
> I'm assuming that removing the usage sites restores function?

It does restore function if we remove the usage.

From an initial look:

cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
  .h_nr_delayed                  : -4
  .h_nr_delayed                  : -6
  .h_nr_delayed                  : -1
  .h_nr_delayed                  : -6
  .h_nr_delayed                  : -1
  .h_nr_delayed                  : -1
  .h_nr_delayed                  : -5
  .h_nr_delayed                  : -6

So probably an unexpected decrement or lack of an increment somewhere.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  8:35                                   ` Luis Machado
@ 2024-09-11  8:45                                     ` Peter Zijlstra
  2024-09-11  8:55                                       ` Luis Machado
                                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-11  8:45 UTC (permalink / raw)
  To: Luis Machado
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
> On 9/10/24 15:05, Peter Zijlstra wrote:
> > On Tue, Sep 10, 2024 at 12:04:11PM +0100, Luis Machado wrote:
> >> I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.
> >>
> >> First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
> >> accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.
> >>
> >> As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
> >> pattern looks similar to what we observed with the uclamp inc/dec imbalance.
> > 
> > :-(
> > 
> >> I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
> >> something will ring bells.
> > 
> > So first thing to do is trace h_nr_delayed I suppose, in my own
> > (limited) testing that was mostly [0,1] correctly correlating to there
> > being a delayed task on the runqueue.
> > 
> > I'm assuming that removing the usage sites restores function?
> 
> It does restore function if we remove the usage.
> 
> From an initial look:
> 
> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
>   .h_nr_delayed                  : -4
>   .h_nr_delayed                  : -6
>   .h_nr_delayed                  : -1
>   .h_nr_delayed                  : -6
>   .h_nr_delayed                  : -1
>   .h_nr_delayed                  : -1
>   .h_nr_delayed                  : -5
>   .h_nr_delayed                  : -6
> 
> So probably an unexpected decrement or lack of an increment somewhere.

Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
out. I'll see if I can reproduce that.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  8:45                                     ` Peter Zijlstra
@ 2024-09-11  8:55                                       ` Luis Machado
  2024-09-11  9:10                                       ` Mike Galbraith
  2024-09-11 10:46                                       ` Luis Machado
  2 siblings, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-09-11  8:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 9/11/24 09:45, Peter Zijlstra wrote:
> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
>> On 9/10/24 15:05, Peter Zijlstra wrote:
>>> On Tue, Sep 10, 2024 at 12:04:11PM +0100, Luis Machado wrote:
>>>> I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.
>>>>
>>>> First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
>>>> accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.
>>>>
>>>> As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
>>>> pattern looks similar to what we observed with the uclamp inc/dec imbalance.
>>>
>>> :-(
>>>
>>>> I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
>>>> something will ring bells.
>>>
>>> So first thing to do is trace h_nr_delayed I suppose, in my own
>>> (limited) testing that was mostly [0,1] correctly correlating to there
>>> being a delayed task on the runqueue.
>>>
>>> I'm assuming that removing the usage sites restores function?
>>
>> It does restore function if we remove the usage.
>>
>> From an initial look:
>>
>> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
>>   .h_nr_delayed                  : -4
>>   .h_nr_delayed                  : -6
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -6
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -5
>>   .h_nr_delayed                  : -6
>>
>> So probably an unexpected decrement or lack of an increment somewhere.
> 
> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
> out. I'll see if I can reproduce that.

I'll keep looking on my end as well. I'm trying to capture the first time it goes bad. For some
reason my SCHED_WARN_ON didn't trigger when it should've.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  8:45                                     ` Peter Zijlstra
  2024-09-11  8:55                                       ` Luis Machado
@ 2024-09-11  9:10                                       ` Mike Galbraith
  2024-09-11  9:13                                         ` Peter Zijlstra
                                                           ` (2 more replies)
  2024-09-11 10:46                                       ` Luis Machado
  2 siblings, 3 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-09-11  9:10 UTC (permalink / raw)
  To: Peter Zijlstra, Luis Machado
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
> > > 
> > > I'm assuming that removing the usage sites restores function?
> > 
> > It does restore function if we remove the usage.
> > 
> > From an initial look:
> > 
> > cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
> >   .h_nr_delayed                  : -4
> >   .h_nr_delayed                  : -6
> >   .h_nr_delayed                  : -1
> >   .h_nr_delayed                  : -6
> >   .h_nr_delayed                  : -1
> >   .h_nr_delayed                  : -1
> >   .h_nr_delayed                  : -5
> >   .h_nr_delayed                  : -6
> > 
> > So probably an unexpected decrement or lack of an increment somewhere.
> 
> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
> out. I'll see if I can reproduce that.

Hm, would be interesting to know how the heck he's triggering that.

My x86_64 box refuses to produce any such artifacts with anything I've
tossed at it, including full LTP with enterprise RT and !RT configs,
both in master and my local SLE15-SP7 branch.  Hohum.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:10                                       ` Mike Galbraith
@ 2024-09-11  9:13                                         ` Peter Zijlstra
  2024-09-11  9:27                                           ` Mike Galbraith
  2024-09-11 11:49                                           ` Dietmar Eggemann
  2024-09-11  9:38                                         ` Luis Machado
  2024-09-12 12:58                                         ` Luis Machado
  2 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-09-11  9:13 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Luis Machado, Dietmar Eggemann, Vincent Guittot, Hongyan Xia,
	mingo, juri.lelli, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Wed, Sep 11, 2024 at 11:10:26AM +0200, Mike Galbraith wrote:
> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
> > On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
> > > > 
> > > > I'm assuming that removing the usage sites restores function?
> > > 
> > > It does restore function if we remove the usage.
> > > 
> > > From an initial look:
> > > 
> > > cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
> > >   .h_nr_delayed                  : -4
> > >   .h_nr_delayed                  : -6
> > >   .h_nr_delayed                  : -1
> > >   .h_nr_delayed                  : -6
> > >   .h_nr_delayed                  : -1
> > >   .h_nr_delayed                  : -1
> > >   .h_nr_delayed                  : -5
> > >   .h_nr_delayed                  : -6
> > > 
> > > So probably an unexpected decrement or lack of an increment somewhere.
> > 
> > Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
> > out. I'll see if I can reproduce that.
> 
> Hm, would be interesting to know how the heck he's triggering that.
> 
> My x86_64 box refuses to produce any such artifacts with anything I've
> tossed at it, including full LTP with enterprise RT and !RT configs,
> both in master and my local SLE15-SP7 branch.  Hohum.

Yeah, my hackbench runs also didn't show that. Perhaps something funny
with cgroups. I didn't test cgroup bandwidth for exanple.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:13                                         ` Peter Zijlstra
@ 2024-09-11  9:27                                           ` Mike Galbraith
  2024-09-12 14:00                                             ` Mike Galbraith
  2024-09-11 11:49                                           ` Dietmar Eggemann
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-09-11  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luis Machado, Dietmar Eggemann, Vincent Guittot, Hongyan Xia,
	mingo, juri.lelli, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Wed, 2024-09-11 at 11:13 +0200, Peter Zijlstra wrote:
> On Wed, Sep 11, 2024 at 11:10:26AM +0200, Mike Galbraith wrote:
> >
> > Hm, would be interesting to know how the heck he's triggering that.
> >
> > My x86_64 box refuses to produce any such artifacts with anything I've
> > tossed at it, including full LTP with enterprise RT and !RT configs,
> > both in master and my local SLE15-SP7 branch.  Hohum.
>
> Yeah, my hackbench runs also didn't show that. Perhaps something funny
> with cgroups. I didn't test cgroup bandwidth for exanple.

That's all on in enterprise configs tested with LTP, so hypothetically
got some testing.  I also turned on AUTOGROUP in !RT configs so cgroups
would get some exercise no matter what I'm mucking about with.

	-Mike


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:10                                       ` Mike Galbraith
  2024-09-11  9:13                                         ` Peter Zijlstra
@ 2024-09-11  9:38                                         ` Luis Machado
  2024-09-12 12:58                                         ` Luis Machado
  2 siblings, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-09-11  9:38 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On 9/11/24 10:10, Mike Galbraith wrote:
> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
>>>>
>>>> I'm assuming that removing the usage sites restores function?
>>>
>>> It does restore function if we remove the usage.
>>>
>>> From an initial look:
>>>
>>> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
>>>   .h_nr_delayed                  : -4
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -5
>>>   .h_nr_delayed                  : -6
>>>
>>> So probably an unexpected decrement or lack of an increment somewhere.
>>
>> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
>> out. I'll see if I can reproduce that.
> 
> Hm, would be interesting to know how the heck he's triggering that.
> 
> My x86_64 box refuses to produce any such artifacts with anything I've
> tossed at it, including full LTP with enterprise RT and !RT configs,
> both in master and my local SLE15-SP7 branch.  Hohum.
> 
> 	-Mike

From what I can tell, the decrement that makes h_nr_delayed go negative is in
the dequeue_entities path.

First:

                if (!task_sleep && !task_delayed)
                        h_nr_delayed = !!se->sched_delayed;

h_nr_delayed is 1 here.

Then we decrement cfs_rq->h_nr_delayed below:

                cfs_rq->h_nr_running -= h_nr_running;
                cfs_rq->idle_h_nr_running -= idle_h_nr_running;
                cfs_rq->h_nr_delayed -= h_nr_delayed;


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  8:45                                     ` Peter Zijlstra
  2024-09-11  8:55                                       ` Luis Machado
  2024-09-11  9:10                                       ` Mike Galbraith
@ 2024-09-11 10:46                                       ` Luis Machado
  2 siblings, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-09-11 10:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 9/11/24 09:45, Peter Zijlstra wrote:
> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
>> On 9/10/24 15:05, Peter Zijlstra wrote:
>>> On Tue, Sep 10, 2024 at 12:04:11PM +0100, Luis Machado wrote:
>>>> I gave the above patch a try on our Android workload running on the Pixel 6 with a 6.8-based kernel.
>>>>
>>>> First I'd like to confirm that Dietmar's fix that was pushed to tip:sched/core (Fix util_est
>>>> accounting for DELAY_DEQUEUE) helps bring the frequencies and power use down to more sensible levels.
>>>>
>>>> As for the above changes, unfortunately I'm seeing high frequencies and high power usage again. The
>>>> pattern looks similar to what we observed with the uclamp inc/dec imbalance.
>>>
>>> :-(
>>>
>>>> I haven't investigated this in depth yet, but I'll go stare at some traces and the code, and hopefully
>>>> something will ring bells.
>>>
>>> So first thing to do is trace h_nr_delayed I suppose, in my own
>>> (limited) testing that was mostly [0,1] correctly correlating to there
>>> being a delayed task on the runqueue.
>>>
>>> I'm assuming that removing the usage sites restores function?
>>
>> It does restore function if we remove the usage.
>>
>> From an initial look:
>>
>> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
>>   .h_nr_delayed                  : -4
>>   .h_nr_delayed                  : -6
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -6
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -1
>>   .h_nr_delayed                  : -5
>>   .h_nr_delayed                  : -6
>>
>> So probably an unexpected decrement or lack of an increment somewhere.
> 
> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
> out. I'll see if I can reproduce that.

Before reverting it, let me run a few more checks first. Dietmar tells me
he sees sane values for h_nr_delayed on a juno system.

I just want to make sure I'm not hitting some oddness with the 6.8 kernel
on the pixel 6.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:13                                         ` Peter Zijlstra
  2024-09-11  9:27                                           ` Mike Galbraith
@ 2024-09-11 11:49                                           ` Dietmar Eggemann
  1 sibling, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-11 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith
  Cc: Luis Machado, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On 11/09/2024 11:13, Peter Zijlstra wrote:
> On Wed, Sep 11, 2024 at 11:10:26AM +0200, Mike Galbraith wrote:
>> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:

[...]

>>>> So probably an unexpected decrement or lack of an increment somewhere.
>>>
>>> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
>>> out. I'll see if I can reproduce that.
>>
>> Hm, would be interesting to know how the heck he's triggering that.
>>
>> My x86_64 box refuses to produce any such artifacts with anything I've
>> tossed at it, including full LTP with enterprise RT and !RT configs,
>> both in master and my local SLE15-SP7 branch.  Hohum.
> 
> Yeah, my hackbench runs also didn't show that. Perhaps something funny
> with cgroups. I didn't test cgroup bandwidth for exanple.

Don't see it either on my Arm64 Juno-r0 (6 CPUs) with:

cgexec -g cpu:/A/B/C hackbench -l 1000

We are checking the Pixel6 now.
















^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:10                                       ` Mike Galbraith
  2024-09-11  9:13                                         ` Peter Zijlstra
  2024-09-11  9:38                                         ` Luis Machado
@ 2024-09-12 12:58                                         ` Luis Machado
  2024-09-12 20:44                                           ` Dietmar Eggemann
  2 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-09-12 12:58 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On 9/11/24 10:10, Mike Galbraith wrote:
> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
>>>>
>>>> I'm assuming that removing the usage sites restores function?
>>>
>>> It does restore function if we remove the usage.
>>>
>>> From an initial look:
>>>
>>> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                             
>>>   .h_nr_delayed                  : -4
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -5
>>>   .h_nr_delayed                  : -6
>>>
>>> So probably an unexpected decrement or lack of an increment somewhere.
>>
>> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
>> out. I'll see if I can reproduce that.
> 
> Hm, would be interesting to know how the heck he's triggering that.
> 
> My x86_64 box refuses to produce any such artifacts with anything I've
> tossed at it, including full LTP with enterprise RT and !RT configs,
> both in master and my local SLE15-SP7 branch.  Hohum.
> 
> 	-Mike

Ok, I seem to have narrowed this down to scheduler class switching. In particular
switched_from_fair.

Valentin's patch (75b6499024a6c1a4ef0288f280534a5c54269076
sched/fair: Properly deactivate sched_delayed task upon class change) introduced
finish_delayed_dequeue_entity, which takes care of cleaning up the state of delayed-dequeue
tasks during class change. Things work fine (minus delayed task accounting) up to this point.

When Peter introduced his patch to do h_nr_delayed accounting, we modified
finish_delayed_dequeue_entity to also call clear_delayed, instead of simply
zeroing se->sched_delayed.

The call to clear_delayed decrements the rq's h_nr_delayed, and it gets used elsewhere
to cleanup the state of delayed-dequeue tasks, in order to share some common code.

With that said, my testing on Android shows that when we hit switched_from_fair during
switching sched classes (due to some RT task), we're in a state where...

1 - We already called into dequeue_entities for this delayed task.
2 - We tested true for the !task_sleep && !task_delayed condition.
3 - se->sched_delayed is true, so h_nr_delayed == 1.
4 - We carry on executing the rest of dequeue_entities and decrement the rq's h_nr_running by 1.

In switched_from_fair, after the above events, we call into finish_delayed_dequeue_entity -> clear_delayed
and do yet another decrement to the rq's h_nr_delayed, now potentially making it negative. As
a consequence, we probably misuse the negative value and adjust the frequencies incorrectly. I
think this is the issue I'm seeing.

It is worth pointing out that even with the Android setup, things only go bad when there is enough
competition and switching of classes (lots of screen events etc).

My suggestion of a fix (below), still under testing, is to inline the delayed-dequeue and the lag zeroing
cleanup within switched_from_fair instead of calling finish_delayed_dequeue_entity. Or maybe
drop finish_delayed_dequeue_entity and inline its contents into its callers.

The rest of Peter's patch introducing h_nr_delayed seems OK as far as I could test.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993ac282a83..f8df2f8d2e11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13168,7 +13168,9 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
                 * related to sched_delayed being true and that wasn't done
                 * due to the generic dequeue not using DEQUEUE_DELAYED.
                 */
-               finish_delayed_dequeue_entity(&p->se);
+               p->se.sched_delayed = 0;
+               if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
+                       p->se.vlag = 0;
                p->se.rel_deadline = 0;
                __block_task(rq, p);
        }

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-11  9:27                                           ` Mike Galbraith
@ 2024-09-12 14:00                                             ` Mike Galbraith
  2024-09-13 16:39                                               ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-09-12 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luis Machado, Dietmar Eggemann, Vincent Guittot, Hongyan Xia,
	mingo, juri.lelli, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Wed, 2024-09-11 at 11:27 +0200, Mike Galbraith wrote:
> On Wed, 2024-09-11 at 11:13 +0200, Peter Zijlstra wrote:
> > On Wed, Sep 11, 2024 at 11:10:26AM +0200, Mike Galbraith wrote:
> > >
> > > Hm, would be interesting to know how the heck he's triggering that.
> > >
> > > My x86_64 box refuses to produce any such artifacts with anything I've
> > > tossed at it, including full LTP with enterprise RT and !RT configs,
> > > both in master and my local SLE15-SP7 branch.  Hohum.
> >
> > Yeah, my hackbench runs also didn't show that. Perhaps something funny
> > with cgroups. I didn't test cgroup bandwidth for exanple.
>
> That's all on in enterprise configs tested with LTP, so hypothetically
> got some testing.  I also turned on AUTOGROUP in !RT configs so cgroups
> would get some exercise no matter what I'm mucking about with.

Oho, I just hit a pick_eevdf() returns NULL in pick_next_entity() and
we deref it bug in tip that I recall having seen someone else mention
them having hit.  LTP was chugging away doing lord knows what when
evolution apparently decided to check accounts, which didn't go well.

state=TASK_WAKING(?), on_rq=0, on_cpu=1, cfs_rq.nr_running=0

crash> bt -sx
PID: 29024    TASK: ffff9118b7583300  CPU: 1    COMMAND: "pool-evolution"
 #0 [ffffa939dfd0f930] machine_kexec+0x1a0 at ffffffffab886cc0
 #1 [ffffa939dfd0f990] __crash_kexec+0x6a at ffffffffab99496a
 #2 [ffffa939dfd0fa50] crash_kexec+0x23 at ffffffffab994e33
 #3 [ffffa939dfd0fa60] oops_end+0xbe at ffffffffab844b4e
 #4 [ffffa939dfd0fa80] page_fault_oops+0x151 at ffffffffab898fc1
 #5 [ffffa939dfd0fb08] exc_page_fault+0x6b at ffffffffac3a410b
 #6 [ffffa939dfd0fb30] asm_exc_page_fault+0x22 at ffffffffac400ac2
    [exception RIP: pick_task_fair+113]
    RIP: ffffffffab8fb471  RSP: ffffa939dfd0fbe0  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffff91180735ee00  RCX: 000b709eab0437d5
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: ffff91180735ee00
    RBP: ffff91180735f400   R8: 00000000000001d9   R9: 0000000000000000
    R10: ffff911a8ecb9380  R11: 0000000000000000  R12: ffff911a8eab89c0
    R13: ffff911a8eab8a40  R14: ffffffffacafc373  R15: ffff9118b7583300
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffa939dfd0fc08] pick_next_task_fair+0x48 at ffffffffab9013b8
 #8 [ffffa939dfd0fc48] __schedule+0x1d9 at ffffffffac3aab39
 #9 [ffffa939dfd0fcf8] schedule+0x24 at ffffffffac3ac084
#10 [ffffa939dfd0fd10] futex_wait_queue+0x63 at ffffffffab98e353
#11 [ffffa939dfd0fd38] __futex_wait+0x139 at ffffffffab98e989
#12 [ffffa939dfd0fdf0] futex_wait+0x6a at ffffffffab98ea5a
#13 [ffffa939dfd0fe80] do_futex+0x88 at ffffffffab98a9f8
#14 [ffffa939dfd0fe90] __x64_sys_futex+0x5e at ffffffffab98ab0e
#15 [ffffa939dfd0ff00] do_syscall_64+0x74 at ffffffffac39ce44
#16 [ffffa939dfd0ff40] entry_SYSCALL_64_after_hwframe+0x4b at ffffffffac4000ac
    RIP: 00007fd6b991a849  RSP: 00007fd6813ff6e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000a6c  RCX: 00007fd6b991a849
    RDX: 0000000000000a6c  RSI: 0000000000000080  RDI: 00005631abf620c0
    RBP: 00005631abf620b8   R8: 00007fd6bad0a080   R9: 00000000000015fe
    R10: 00007fd6813ff700  R11: 0000000000000246  R12: 00005631abf620b0
    R13: 00005631abf620b0  R14: 00005631abf620b8  R15: 0000000000000000
    ORIG_RAX: 00000000000000ca  CS: 0033  SS: 002b
crash> dis pick_task_fair+113
0xffffffffab8fb471 <pick_task_fair+113>:        cmpb   $0x0,0x51(%rax)
crash> gdb list *pick_task_fair+113
0xffffffffab8fb471 is in pick_task_fair (kernel/sched/fair.c:5639).
5634			SCHED_WARN_ON(cfs_rq->next->sched_delayed);
5635			return cfs_rq->next;
5636		}
5637
5638		struct sched_entity *se = pick_eevdf(cfs_rq);
5639		if (se->sched_delayed) {
5640			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
5641			SCHED_WARN_ON(se->sched_delayed);
5642			SCHED_WARN_ON(se->on_rq);
5643			return NULL;
crash> task_struct -x 0xffff9118b7583300 | grep "__state ="
  __state = 0x200,
crash> task_struct -x 0xffff9118b7583300 | grep rq
  on_rq = 0x0,
    on_rq = 0x0,
    cfs_rq = 0xffff9117e81a3e00,
    on_rq = 0x0,
    rq = 0x0,
crash> task_struct -xo | grep sched_entity
    [0x80] struct sched_entity se
crash> sched_entity 0xffff9118b7583380
struct sched_entity {
  load = {
    weight = 1048576,
    inv_weight = 4194304
  },
  run_node = {
    __rb_parent_color = 1,
    rb_right = 0x0,
    rb_left = 0x0
  },
  deadline = 5788784166,
  min_vruntime = 5785784166,
  min_slice = 3000000,
  group_node = {
    next = 0xffff9118b75833c0,
    prev = 0xffff9118b75833c0
  },
  on_rq = 0 '\000',
  sched_delayed = 0 '\000',
  rel_deadline = 0 '\000',
  custom_slice = 0 '\000',
  exec_start = 5630407844294,
  sum_exec_runtime = 5031478,
  prev_sum_exec_runtime = 5004139,
  vruntime = 5785811505,
  vlag = 0,
  slice = 3000000,
  nr_migrations = 0,
  depth = 1,
  parent = 0xffff9117e81a0600,
  cfs_rq = 0xffff9117e81a3e00,
  my_q = 0x0,
  runnable_weight = 0,
  avg = {
    last_update_time = 5630386353152,
    load_sum = 2555,
    runnable_sum = 2617274,
    util_sum = 83342,
    period_contrib = 877,
    load_avg = 39,
    runnable_avg = 39,
    util_avg = 1,
    util_est = 2147483760
  }
}
crash> cfs_rq 0xffff9117e81a3e00
struct cfs_rq {
  load = {
    weight = 0,
    inv_weight = 0
  },
  nr_running = 0,
  h_nr_running = 0,
  idle_nr_running = 0,
  idle_h_nr_running = 0,
  h_nr_delayed = 0,
  avg_vruntime = 0,
  avg_load = 0,
  min_vruntime = 5785811505,
  forceidle_seq = 0,
  min_vruntime_fi = 0,
  tasks_timeline = {
    rb_root = {
      rb_node = 0x0
    },
    rb_leftmost = 0x0
  },
  curr = 0xffff9118b7583380,
  next = 0x0,
  avg = {
    last_update_time = 5630386353152,
    load_sum = 2617381,
    runnable_sum = 2617379,
    util_sum = 83417,
    period_contrib = 877,
    load_avg = 39,
    runnable_avg = 39,
    util_avg = 1,
    util_est = 0
  },
  removed = {
    lock = {
      raw_lock = {
        {
          val = {
            counter = 0
          },
          {
            locked = 0 '\000',
            pending = 0 '\000'
          },
          {
            locked_pending = 0,
            tail = 0
          }
        }
      }
    },
    nr = 0,
    load_avg = 0,
    util_avg = 0,
    runnable_avg = 0
  },
  last_update_tg_load_avg = 5630407057919,
  tg_load_avg_contrib = 39,
  propagate = 0,
  prop_runnable_sum = 0,
  h_load = 0,
  last_h_load_update = 4296299815,
  h_load_next = 0x0,
  rq = 0xffff911a8eab89c0,
  on_list = 1,
  leaf_cfs_rq_list = {
    next = 0xffff911794a2d348,
    prev = 0xffff9119ebe62148
  },
  tg = 0xffff91178434a080,
  idle = 0,
  runtime_enabled = 0,
  runtime_remaining = 0,
  throttled_pelt_idle = 0,
  throttled_clock = 0,
  throttled_clock_pelt = 0,
  throttled_clock_pelt_time = 0,
  throttled_clock_self = 0,
  throttled_clock_self_time = 0,
  throttled = 0,
  throttle_count = 0,
  throttled_list = {
    next = 0xffff9117e81a3fa8,
    prev = 0xffff9117e81a3fa8
  },
  throttled_csd_list = {
    next = 0xffff9117e81a3fb8,
    prev = 0xffff9117e81a3fb8
  }
}
crash>


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-12 12:58                                         ` Luis Machado
@ 2024-09-12 20:44                                           ` Dietmar Eggemann
  0 siblings, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-09-12 20:44 UTC (permalink / raw)
  To: Luis Machado, Mike Galbraith, Peter Zijlstra
  Cc: Vincent Guittot, Hongyan Xia, mingo, juri.lelli, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, kprateek.nayak, wuyun.abel,
	youssefesmat, tglx

On 12/09/2024 14:58, Luis Machado wrote:
> On 9/11/24 10:10, Mike Galbraith wrote:
>> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:

[...]

> Ok, I seem to have narrowed this down to scheduler class switching. In particular
> switched_from_fair.
> 
> Valentin's patch (75b6499024a6c1a4ef0288f280534a5c54269076
> sched/fair: Properly deactivate sched_delayed task upon class change) introduced
> finish_delayed_dequeue_entity, which takes care of cleaning up the state of delayed-dequeue
> tasks during class change. Things work fine (minus delayed task accounting) up to this point.
> 
> When Peter introduced his patch to do h_nr_delayed accounting, we modified
> finish_delayed_dequeue_entity to also call clear_delayed, instead of simply
> zeroing se->sched_delayed.
> 
> The call to clear_delayed decrements the rq's h_nr_delayed, and it gets used elsewhere
> to cleanup the state of delayed-dequeue tasks, in order to share some common code.
> 
> With that said, my testing on Android shows that when we hit switched_from_fair during
> switching sched classes (due to some RT task), we're in a state where...
> 
> 1 - We already called into dequeue_entities for this delayed task.
> 2 - We tested true for the !task_sleep && !task_delayed condition.
> 3 - se->sched_delayed is true, so h_nr_delayed == 1.
> 4 - We carry on executing the rest of dequeue_entities and decrement the rq's h_nr_running by 1.
> 
> In switched_from_fair, after the above events, we call into finish_delayed_dequeue_entity -> clear_delayed
> and do yet another decrement to the rq's h_nr_delayed, now potentially making it negative. As
> a consequence, we probably misuse the negative value and adjust the frequencies incorrectly. I
> think this is the issue I'm seeing.
> 
> It is worth pointing out that even with the Android setup, things only go bad when there is enough
> competition and switching of classes (lots of screen events etc).
> 
> My suggestion of a fix (below), still under testing, is to inline the delayed-dequeue and the lag zeroing
> cleanup within switched_from_fair instead of calling finish_delayed_dequeue_entity. Or maybe
> drop finish_delayed_dequeue_entity and inline its contents into its callers.
> 
> The rest of Peter's patch introducing h_nr_delayed seems OK as far as I could test.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993ac282a83..f8df2f8d2e11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13168,7 +13168,9 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
>                  * related to sched_delayed being true and that wasn't done
>                  * due to the generic dequeue not using DEQUEUE_DELAYED.
>                  */
> -               finish_delayed_dequeue_entity(&p->se);
> +               p->se.sched_delayed = 0;
> +               if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
> +                       p->se.vlag = 0;
>                 p->se.rel_deadline = 0;
>                 __block_task(rq, p);
>         }

I could recreate this on QEMU with:

@@ -5473,6 +5473,7 @@ static void clear_delayed(struct sched_entity *se)
        for_each_sched_entity(se) {
                struct cfs_rq *cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_delayed--;
+               BUG_ON((int)cfs_rq->h_nr_delayed < 0);
                if (cfs_rq_throttled(cfs_rq))
                        break;
        }
running:

  # while(true); do chrt -rr -p 50 $$; chrt -o -p 0 $$; done

in one shell and:

  # hackbench

in another.

[  318.490522] ------------[ cut here ]------------
[  318.490969] kernel BUG at kernel/sched/fair.c:5476!
[  318.491411] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  318.491964] CPU: 3 UID: 0 PID: 68053 Comm: chrt Not tainted 6.11.0-rc1-00066-g2e05f6c71d36-dirty #23
[  318.492604] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.4
[  318.494192] RIP: 0010:switched_from_fair+0x67/0xe0
[  318.494899] Code: ff ff c6 85 d1 00 00 00 00 48 85 db 75 0e eb 1c 48 8b 9b 98 00 00 00 48 85 db 74 10 48 8b 0
[  318.496491] RSP: 0018:ffffb1154bc63e20 EFLAGS: 00010097
[  318.496991] RAX: 0000000000000001 RBX: ffff92668844e200 RCX: ffff9266fddadea8
[  318.497681] RDX: ffff926684213e00 RSI: ffff9266fddadea8 RDI: ffff92668844e608
[  318.498339] RBP: ffff92668844e180 R08: ffff92668106c44c R09: ffff92668119c0b8
[  318.498940] R10: ffff926681f42000 R11: ffffffffb1e155d0 R12: ffff9266fddad640
[  318.499525] R13: ffffb1154bc63ed8 R14: 0000000000000078 R15: ffff9266fddad640
[  318.500234] FS:  00007f68f52bf740(0000) GS:ffff9266fdd80000(0000) knlGS:0000000000000000
[  318.500837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.501261] CR2: 00007f68f53afdd0 CR3: 0000000006a58002 CR4: 0000000000370ef0
[  318.501798] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  318.502385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  318.502919] Call Trace:
[  318.503118]  <TASK>
[  318.503284]  ? die+0x32/0x90
[  318.503508]  ? do_trap+0xd8/0x100
[  318.503770]  ? switched_from_fair+0x67/0xe0
[  318.504085]  ? do_error_trap+0x60/0x80
[  318.504374]  ? switched_from_fair+0x67/0xe0
[  318.504652]  ? exc_invalid_op+0x53/0x70
[  318.504995]  ? switched_from_fair+0x67/0xe0
[  318.505270]  ? asm_exc_invalid_op+0x1a/0x20
[  318.505588]  ? switched_from_fair+0x67/0xe0
[  318.505929]  check_class_changed+0x2a/0x80
[  318.506236]  __sched_setscheduler+0x1f3/0x920
[  318.506526]  do_sched_setscheduler+0xfd/0x1c0
[  318.506867]  ? do_sys_openat2+0x7c/0xc0
[  318.507141]  __x64_sys_sched_setscheduler+0x1a/0x30
[  318.507462]  do_syscall_64+0x9e/0x1a0
[  318.507722]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  318.508127] RIP: 0033:0x7f68f53c3719
[  318.508369] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 8
[  318.509941] RSP: 002b:00007fff8d920578 EFLAGS: 00000246 ORIG_RAX: 0000000000000090
[  318.511178] RAX: ffffffffffffffda RBX: 00007fff8d920610 RCX: 00007f68f53c3719
[  318.512105] RDX: 00007fff8d92058c RSI: 0000000000000002 RDI: 0000000000000103
[  318.512676] RBP: 0000000000000103 R08: 0000000000000000 R09: 0000000000000000
[  318.513296] R10: 1999999999999999 R11: 0000000000000246 R12: 0000000000000002
[  318.513917] R13: 0000000000000002 R14: 0000000000000032 R15: 0000000000000103
[  318.514617]  </TASK>
[  318.514861] Modules linked in:
[  318.515132] ---[ end trace 0000000000000000 ]---
[  318.515466] RIP: 0010:switched_from_fair+0x67/0xe0
[  318.515942] Code: ff ff c6 85 d1 00 00 00 00 48 85 db 75 0e eb 1c 48 8b 9b 98 00 00 00 48 85 db 74 10 48 8b 0
[  318.517411] RSP: 0018:ffffb1154bc63e20 EFLAGS: 00010097
[  318.517819] RAX: 0000000000000001 RBX: ffff92668844e200 RCX: ffff9266fddadea8
[  318.518354] RDX: ffff926684213e00 RSI: ffff9266fddadea8 RDI: ffff92668844e608
[  318.518946] RBP: ffff92668844e180 R08: ffff92668106c44c R09: ffff92668119c0b8
[  318.519527] R10: ffff926681f42000 R11: ffffffffb1e155d0 R12: ffff9266fddad640
[  318.520138] R13: ffffb1154bc63ed8 R14: 0000000000000078 R15: ffff9266fddad640
[  318.520684] FS:  00007f68f52bf740(0000) GS:ffff9266fdd80000(0000) knlGS:0000000000000000
[  318.521350] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.521745] CR2: 00007f68f53afdd0 CR3: 0000000006a58002 CR4: 0000000000370ef0
[  318.522306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  318.522896] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  318.523388] note: chrt[68053] exited with irqs disabled

With your proposed fix the issue goes away.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>





^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-12 14:00                                             ` Mike Galbraith
@ 2024-09-13 16:39                                               ` Mike Galbraith
  2024-09-14  3:40                                                 ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-09-13 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luis Machado, Dietmar Eggemann, Vincent Guittot, Hongyan Xia,
	mingo, juri.lelli, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, 2024-09-12 at 16:00 +0200, Mike Galbraith wrote:
>
> Oho, I just hit a pick_eevdf() returns NULL in pick_next_entity() and
> we deref it bug in tip that I recall having seen someone else mention
> them having hit.  LTP was chugging away doing lord knows what when
> evolution apparently decided to check accounts, which didn't go well.

BTW, what LTP was up to was cfs_bandwidth01.  I reproduced the crash
using it on master with the full patch set pretty quickly, but rebased
tip 4c293d0fa315 (sans sched/eevdf: More PELT vs DELAYED_DEQUEUE) seems
to be stable.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-13 16:39                                               ` Mike Galbraith
@ 2024-09-14  3:40                                                 ` Mike Galbraith
  2024-09-24 15:16                                                   ` Luis Machado
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-09-14  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Luis Machado, Dietmar Eggemann, Vincent Guittot, Hongyan Xia,
	mingo, juri.lelli, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Fri, 2024-09-13 at 18:39 +0200, Mike Galbraith wrote:
> tip 4c293d0fa315 (sans sched/eevdf: More PELT vs DELAYED_DEQUEUE) seems
> to be stable.

Belay that, it went boom immediately this morning while trying to
trigger a warning I met elsewhere.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-14  3:40                                                 ` Mike Galbraith
@ 2024-09-24 15:16                                                   ` Luis Machado
  2024-09-24 17:35                                                     ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-09-24 15:16 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On 9/14/24 04:40, Mike Galbraith wrote:
> On Fri, 2024-09-13 at 18:39 +0200, Mike Galbraith wrote:
>> tip 4c293d0fa315 (sans sched/eevdf: More PELT vs DELAYED_DEQUEUE) seems
>> to be stable.
> 
> Belay that, it went boom immediately this morning while trying to
> trigger a warning I met elsewhere.
> 
> 	-Mike

Are you still hitting this one?

I was trying to reproduce this one on the Pixel 6 (manifests as
a boot crash) but couldn't so far. Something might've changed a bit.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-24 15:16                                                   ` Luis Machado
@ 2024-09-24 17:35                                                     ` Mike Galbraith
  2024-09-25  5:14                                                       ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-09-24 17:35 UTC (permalink / raw)
  To: Luis Machado, Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Tue, 2024-09-24 at 16:16 +0100, Luis Machado wrote:
> On 9/14/24 04:40, Mike Galbraith wrote:
> > On Fri, 2024-09-13 at 18:39 +0200, Mike Galbraith wrote:
> > > tip 4c293d0fa315 (sans sched/eevdf: More PELT vs DELAYED_DEQUEUE)
> > > seems
> > > to be stable.
> >
> > Belay that, it went boom immediately this morning while trying to
> > trigger a warning I met elsewhere.
> >
> >         -Mike
>
> Are you still hitting this one?
>
> I was trying to reproduce this one on the Pixel 6 (manifests as
> a boot crash) but couldn't so far. Something might've changed a bit.

I'm also having no luck triggering it.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
  2024-09-24 17:35                                                     ` Mike Galbraith
@ 2024-09-25  5:14                                                       ` Mike Galbraith
  0 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-09-25  5:14 UTC (permalink / raw)
  To: Luis Machado, Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Hongyan Xia, mingo, juri.lelli,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Tue, 2024-09-24 at 19:35 +0200, Mike Galbraith wrote:
> On Tue, 2024-09-24 at 16:16 +0100, Luis Machado wrote:
> > On 9/14/24 04:40, Mike Galbraith wrote:
> > > On Fri, 2024-09-13 at 18:39 +0200, Mike Galbraith wrote:
> > > > tip 4c293d0fa315 (sans sched/eevdf: More PELT vs DELAYED_DEQUEUE)
> > > > seems
> > > > to be stable.
> > >
> > > Belay that, it went boom immediately this morning while trying to
> > > trigger a warning I met elsewhere.
> > >
> > >         -Mike
> >
> > Are you still hitting this one?
> >
> > I was trying to reproduce this one on the Pixel 6 (manifests as
> > a boot crash) but couldn't so far. Something might've changed a bit.
>
> I'm also having no luck triggering it.

BTW, that includes hefty RT/!RT config beating sessions w/wo..

   sched/eevdf: More PELT vs DELAYED_DEQUEUE

..and w/wo your mod to same, with dequeue on unthrottle patchlet to
keep sched_delayed encounter there from interrupting the hunt.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy
  2024-07-27 10:27 ` [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy Peter Zijlstra
  2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2024-09-29  2:02   ` Tianchen Ding
  1 sibling, 0 replies; 277+ messages in thread
From: Tianchen Ding @ 2024-09-29  2:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault, mingo,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel

Hi Peter,

On 2024/7/27 18:27, Peter Zijlstra wrote:
> In the absence of an explicit cgroup slice configureation, make mixed
> slice length work with cgroups by propagating the min_slice up the
> hierarchy.

Will it be acceptable to introduce a cgroup interface (e.g., sth. like 
cpu.fair_runtime or cpu.fair_slice) to overwrite the caculated min_slice?
This could be useful in container scenarios.

> 
> This ensures the cgroup entity gets timely service to service its
> entities that have this timing constraint set on them.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/sched.h |    1
>   kernel/sched/fair.c   |   57 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -542,6 +542,7 @@ struct sched_entity {
>   	struct rb_node			run_node;
>   	u64				deadline;
>   	u64				min_vruntime;
> +	u64				min_slice;
>   
>   	struct list_head		group_node;
>   	unsigned char			on_rq;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -782,6 +782,21 @@ static void update_min_vruntime(struct c
>   	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
>   }
>   
> +static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *root = __pick_root_entity(cfs_rq);
> +	struct sched_entity *curr = cfs_rq->curr;
> +	u64 min_slice = ~0ULL;
> +
> +	if (curr && curr->on_rq)
> +		min_slice = curr->slice;
> +
> +	if (root)
> +		min_slice = min(min_slice, root->min_slice);

If a sched_delayed se keeps the min_slice, then the parent se will receive a 
shorter slice (from the sched_delayed se) than it should be. Is it a problem?

Thanks.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-08-28 22:38     ` Marek Szyprowski
@ 2024-10-10  2:49       ` Sean Christopherson
  2024-10-10  7:57         ` Mike Galbraith
  2024-10-10  8:19         ` Peter Zijlstra
  0 siblings, 2 replies; 277+ messages in thread
From: Sean Christopherson @ 2024-10-10  2:49 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	efault, kvm

+KVM

On Thu, Aug 29, 2024, Marek Szyprowski wrote:
> On 27.07.2024 12:27, Peter Zijlstra wrote:
> > Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> > noting that lag is fundamentally a temporal measure. It should not be
> > carried around indefinitely.
> >
> > OTOH it should also not be instantly discarded, doing so will allow a
> > task to game the system by purposefully (micro) sleeping at the end of
> > its time quantum.
> >
> > Since lag is intimately tied to the virtual time base, a wall-time
> > based decay is also insufficient, notably competition is required for
> > any of this to make sense.
> >
> > Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> > competing until they are eligible.
> >
> > Strictly speaking, we only care about keeping them until the 0-lag
> > point, but that is a difficult proposition, instead carry them around
> > until they get picked again, and dequeue them at that point.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> This patch landed recently in linux-next as commit 152e11f6df29 
> ("sched/fair: Implement delayed dequeue"). In my tests on some of the 
> ARM 32bit boards it causes a regression in rtcwake tool behavior - from 
> time to time this simple call never ends:
> 
> # time rtcwake -s 10 -m on
> 
> Reverting this commit (together with its compile dependencies) on top of 
> linux-next fixes this issue. Let me know how can I help debugging this 
> issue.

This commit broke KVM's posted interrupt handling (and other things), and the root
cause may be the same underlying issue.

TL;DR: Code that checks task_struct.on_rq may be broken by this commit.

KVM's breakage boils down to the preempt notifiers, i.e. kvm_sched_out(), being
invoked with current->on_rq "true" after KVM has explicitly called schedule().
kvm_sched_out() uses current->on_rq to determine if the vCPU is being preempted
(voluntarily or not, doesn't matter), and so waiting until some later point in
time to call __block_task() causes KVM to think the task was preempted, when in
reality it was not.

  static void kvm_sched_out(struct preempt_notifier *pn,
 			  struct task_struct *next)
  {
	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

	WRITE_ONCE(vcpu->scheduled_out, true);

	if (current->on_rq && vcpu->wants_to_run) {  <================
		WRITE_ONCE(vcpu->preempted, true);
		WRITE_ONCE(vcpu->ready, true);
	}
	kvm_arch_vcpu_put(vcpu);
	__this_cpu_write(kvm_running_vcpu, NULL);
  }

KVM uses vcpu->preempted for a variety of things, but the most visibly problematic
is waking a vCPU from (virtual) HLT via posted interrupt wakeup.  When a vCPU
HLTs, KVM ultimate calls schedule() to schedule out the vCPU until it receives
a wake event.

When a device or another vCPU can post an interrupt as a wake event, KVM mucks
with the blocking vCPU's posted interrupt descriptor so that posted interrupts
that should be wake events get delivered on a dedicated host IRQ vector, so that
KVM can kick and wake the target vCPU.

But when vcpu->preempted is true, KVM suppresses posted interrupt notifications,
knowing that the vCPU will be scheduled back in.  Because a vCPU (task) can be
preempted while KVM is emulating HLT, KVM keys off vcpu->preempted to set PID.SN,
and doesn't exempt the blocking case.  In short, KVM uses vcpu->preempted, i.e.
current->on_rq, to differentiate between the vCPU getting preempted and KVM
executing schedule().

As a result, the false positive for vcpu->preempted causes KVM to suppress posted
interrupt notifications and the target vCPU never gets its wake event.

Peter,

Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
was before this commit.

@@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,

        WRITE_ONCE(vcpu->scheduled_out, true);

-       if (current->on_rq && vcpu->wants_to_run) {
+       if (se_runnable(&current->se) && vcpu->wants_to_run) {
                WRITE_ONCE(vcpu->preempted, true);
                WRITE_ONCE(vcpu->ready, true);
        }

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10  2:49       ` Sean Christopherson
@ 2024-10-10  7:57         ` Mike Galbraith
  2024-10-10 16:18           ` Sean Christopherson
  2024-10-10  8:19         ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-10-10  7:57 UTC (permalink / raw)
  To: Sean Christopherson, Marek Szyprowski
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx, kvm

On Wed, 2024-10-09 at 19:49 -0700, Sean Christopherson wrote:
>
> Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> was before this commit.
>
> @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  
>         WRITE_ONCE(vcpu->scheduled_out, true);
>  
> -       if (current->on_rq && vcpu->wants_to_run) {
> +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
>                 WRITE_ONCE(vcpu->preempted, true);
>                 WRITE_ONCE(vcpu->ready, true);
>         }

Why is that deemed "obviously not appropriate"?  ->on_rq in and of
itself meaning only "on rq" doesn't seem like a bad thing.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10  2:49       ` Sean Christopherson
  2024-10-10  7:57         ` Mike Galbraith
@ 2024-10-10  8:19         ` Peter Zijlstra
  2024-10-10  9:18           ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-10-10  8:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marek Szyprowski, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	efault, kvm

On Wed, Oct 09, 2024 at 07:49:54PM -0700, Sean Christopherson wrote:

> TL;DR: Code that checks task_struct.on_rq may be broken by this commit.

Correct, and while I did look at quite a few, I did miss KVM used it,
damn.

> Peter,
> 
> Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> was before this commit.
> 
> @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
>  
>         WRITE_ONCE(vcpu->scheduled_out, true);
>  
> -       if (current->on_rq && vcpu->wants_to_run) {
> +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
>                 WRITE_ONCE(vcpu->preempted, true);
>                 WRITE_ONCE(vcpu->ready, true);
>         }

se_runnable() isn't quite right, but yes, a helper along those lines is
probably best. Let me try and grep more to see if there's others I
missed as well :/

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10  8:19         ` Peter Zijlstra
@ 2024-10-10  9:18           ` Peter Zijlstra
  2024-10-10 18:23             ` Sean Christopherson
                               ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-10-10  9:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marek Szyprowski, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	efault, kvm

On Thu, Oct 10, 2024 at 10:19:40AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2024 at 07:49:54PM -0700, Sean Christopherson wrote:
> 
> > TL;DR: Code that checks task_struct.on_rq may be broken by this commit.
> 
> Correct, and while I did look at quite a few, I did miss KVM used it,
> damn.
> 
> > Peter,
> > 
> > Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> > but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> > posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> > was before this commit.
> > 
> > @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
> >  
> >         WRITE_ONCE(vcpu->scheduled_out, true);
> >  
> > -       if (current->on_rq && vcpu->wants_to_run) {
> > +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
> >                 WRITE_ONCE(vcpu->preempted, true);
> >                 WRITE_ONCE(vcpu->ready, true);
> >         }
> 
> se_runnable() isn't quite right, but yes, a helper along those lines is
> probably best. Let me try and grep more to see if there's others I
> missed as well :/

How's the below? I remember looking at the freezer thing before and
deciding it isn't a correctness thing, but given I added the helper, I
changed it anyway. I've added a bunch of comments and the perf thing is
similar to KVM, it wants to know about preemptions so that had to change
too.

---
 include/linux/sched.h         |  5 +++++
 kernel/events/core.c          |  2 +-
 kernel/freezer.c              |  7 ++++++-
 kernel/rcu/tasks.h            |  9 +++++++++
 kernel/sched/core.c           | 12 +++++++++---
 kernel/time/tick-sched.c      |  5 +++++
 kernel/trace/trace_selftest.c |  2 +-
 virt/kvm/kvm_main.c           |  2 +-
 8 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0053f0664847..2b1f454e4575 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2134,6 +2134,11 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+static inline bool task_is_runnable(struct task_struct *p)
+{
+	return p->on_rq && !p->se.sched_delayed;
+}
+
 extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4287cb..cdd09769e6c5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9251,7 +9251,7 @@ static void perf_event_switch(struct task_struct *task,
 		},
 	};
 
-	if (!sched_in && task->on_rq) {
+	if (!sched_in && task_is_runnable(task)) {
 		switch_event.event_id.header.misc |=
 				PERF_RECORD_MISC_SWITCH_OUT_PREEMPT;
 	}
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 44bbd7dbd2c8..8d530d0949ff 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -109,7 +109,12 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
 {
 	unsigned int state = READ_ONCE(p->__state);
 
-	if (p->on_rq)
+	/*
+	 * Allow freezing the sched_delayed tasks; they will not execute until
+	 * ttwu() fixes them up, so it is safe to swap their state now, instead
+	 * of waiting for them to get fully dequeued.
+	 */
+	if (task_is_runnable(p))
 		return 0;
 
 	if (p != current && task_curr(p))
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 6333f4ccf024..4d7ee95df06e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -985,6 +985,15 @@ static bool rcu_tasks_is_holdout(struct task_struct *t)
 	if (!READ_ONCE(t->on_rq))
 		return false;
 
+	/*
+	 * t->on_rq && !t->se.sched_delayed *could* be considered sleeping but
+	 * since it is a spurious state (it will transition into the
+	 * traditional blocked state or get woken up without outside
+	 * dependencies), not considering it such should only affect timing.
+	 *
+	 * Be conservative for now and not include it.
+	 */
+
 	/*
 	 * Idle tasks (or idle injection) within the idle loop are RCU-tasks
 	 * quiescent states. But CPU boot code performed by the idle task
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bacc5cd3693..be5c04eb5ba0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -548,6 +548,11 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *   ON_RQ_MIGRATING state is used for migration without holding both
  *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
  *
+ *   Additionally it is possible to be ->on_rq but still be considered not
+ *   runnable when p->se.sched_delayed is true. These tasks are on the runqueue
+ *   but will be dequeued as soon as they get picked again. See the
+ *   task_is_runnable() helper.
+ *
  * p->on_cpu <- { 0, 1 }:
  *
  *   is set by prepare_task() and cleared by finish_task() such that it will be
@@ -4358,9 +4363,10 @@ static bool __task_needs_rq_lock(struct task_struct *p)
  * @arg: Argument to function.
  *
  * Fix the task in it's current state by avoiding wakeups and or rq operations
- * and call @func(@arg) on it.  This function can use ->on_rq and task_curr()
- * to work out what the state is, if required.  Given that @func can be invoked
- * with a runqueue lock held, it had better be quite lightweight.
+ * and call @func(@arg) on it.  This function can use task_is_runnable() and
+ * task_curr() to work out what the state is, if required.  Given that @func
+ * can be invoked with a runqueue lock held, it had better be quite
+ * lightweight.
  *
  * Returns:
  *   Whatever @func returns
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184c7090..59efa14ce185 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -435,6 +435,11 @@ static void tick_nohz_kick_task(struct task_struct *tsk)
 	 *   tick_nohz_task_switch()
 	 *     LOAD p->tick_dep_mask
 	 */
+	// XXX given a task picks up the dependency on schedule(), should we
+	// only care about tasks that are currently on the CPU instead of all
+	// that are on the runqueue?
+	//
+	// That is, does this want to be: task_on_cpu() / task_curr()?
 	if (!sched_task_on_rq(tsk))
 		return;
 
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index c4ad7cd7e778..1469dd8075fa 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1485,7 +1485,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tr->max_latency = 0;
 
-	while (p->on_rq) {
+	while (task_is_runnable(p)) {
 		/*
 		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb2548d99..0c666f1870af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6387,7 +6387,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
 	WRITE_ONCE(vcpu->scheduled_out, true);
 
-	if (current->on_rq && vcpu->wants_to_run) {
+	if (task_is_runnable(current) && vcpu->wants_to_run) {
 		WRITE_ONCE(vcpu->preempted, true);
 		WRITE_ONCE(vcpu->ready, true);
 	}

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10  7:57         ` Mike Galbraith
@ 2024-10-10 16:18           ` Sean Christopherson
  2024-10-10 17:12             ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Sean Christopherson @ 2024-10-10 16:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Marek Szyprowski, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, kvm

On Thu, Oct 10, 2024, Mike Galbraith wrote:
> On Wed, 2024-10-09 at 19:49 -0700, Sean Christopherson wrote:
> >
> > Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> > but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> > posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> > was before this commit.
> >
> > @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
> >  
> >         WRITE_ONCE(vcpu->scheduled_out, true);
> >  
> > -       if (current->on_rq && vcpu->wants_to_run) {
> > +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
> >                 WRITE_ONCE(vcpu->preempted, true);
> >                 WRITE_ONCE(vcpu->ready, true);
> >         }
> 
> Why is that deemed "obviously not appropriate"?  ->on_rq in and of
> itself meaning only "on rq" doesn't seem like a bad thing.

Doh, my wording was unclear.  I didn't mean the logic was inappropriate, I meant
that KVM shouldn't be poking into an internal sched/ helper.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10 16:18           ` Sean Christopherson
@ 2024-10-10 17:12             ` Mike Galbraith
  0 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-10-10 17:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marek Szyprowski, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, kvm

On Thu, 2024-10-10 at 09:18 -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Mike Galbraith wrote:
> > On Wed, 2024-10-09 at 19:49 -0700, Sean Christopherson wrote:
> > >
> > > Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> > > but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> > > posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> > > was before this commit.
> > >
> > > @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
> > >  
> > >         WRITE_ONCE(vcpu->scheduled_out, true);
> > >  
> > > -       if (current->on_rq && vcpu->wants_to_run) {
> > > +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
> > >                 WRITE_ONCE(vcpu->preempted, true);
> > >                 WRITE_ONCE(vcpu->ready, true);
> > >         }
> >
> > Why is that deemed "obviously not appropriate"?  ->on_rq in and of
> > itself meaning only "on rq" doesn't seem like a bad thing.
>
> Doh, my wording was unclear.  I didn't mean the logic was inappropriate, I meant
> that KVM shouldn't be poking into an internal sched/ helper.

Ah, confusion all better.  (yeah, swiping other's toys is naughty)

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-10-10  9:18           ` Peter Zijlstra
@ 2024-10-10 18:23             ` Sean Christopherson
  2024-10-12 14:15             ` [tip: sched/urgent] sched: Fix external p->on_rq users tip-bot2 for Peter Zijlstra
  2024-10-14  7:28             ` [tip: sched/urgent] sched/fair: " tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 277+ messages in thread
From: Sean Christopherson @ 2024-10-10 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Marek Szyprowski, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	efault, kvm

On Thu, Oct 10, 2024, Peter Zijlstra wrote:
> On Thu, Oct 10, 2024 at 10:19:40AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 09, 2024 at 07:49:54PM -0700, Sean Christopherson wrote:
> > 
> > > TL;DR: Code that checks task_struct.on_rq may be broken by this commit.
> > 
> > Correct, and while I did look at quite a few, I did miss KVM used it,
> > damn.
> > 
> > > Peter,
> > > 
> > > Any thoughts on how best to handle this?  The below hack-a-fix resolves the issue,
> > > but it's obviously not appropriate.  KVM uses vcpu->preempted for more than just
> > > posted interrupts, so KVM needs equivalent functionality to current->on-rq as it
> > > was before this commit.
> > > 
> > > @@ -6387,7 +6390,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
> > >  
> > >         WRITE_ONCE(vcpu->scheduled_out, true);
> > >  
> > > -       if (current->on_rq && vcpu->wants_to_run) {
> > > +       if (se_runnable(&current->se) && vcpu->wants_to_run) {
> > >                 WRITE_ONCE(vcpu->preempted, true);
> > >                 WRITE_ONCE(vcpu->ready, true);
> > >         }
> > 
> > se_runnable() isn't quite right, but yes, a helper along those lines is
> > probably best. Let me try and grep more to see if there's others I
> > missed as well :/
> 
> How's the below? I remember looking at the freezer thing before and
> deciding it isn't a correctness thing, but given I added the helper, I
> changed it anyway. I've added a bunch of comments and the perf thing is
> similar to KVM, it wants to know about preemptions so that had to change
> too.

Fixes KVM's woes!  Thanks!

^ permalink raw reply	[flat|nested] 277+ messages in thread

* [tip: sched/urgent] sched: Fix external p->on_rq users
  2024-10-10  9:18           ` Peter Zijlstra
  2024-10-10 18:23             ` Sean Christopherson
@ 2024-10-12 14:15             ` tip-bot2 for Peter Zijlstra
  2024-10-14  7:28             ` [tip: sched/urgent] sched/fair: " tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-10-12 14:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sean Christopherson, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     1cc2f68c016ad3ac8b3a0495797dd61e19a10025
Gitweb:        https://git.kernel.org/tip/1cc2f68c016ad3ac8b3a0495797dd61e19a10025
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 10 Oct 2024 11:38:10 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 11 Oct 2024 10:49:33 +02:00

sched: Fix external p->on_rq users

Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.

Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.

Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
---
 include/linux/sched.h         |  5 +++++
 kernel/events/core.c          |  2 +-
 kernel/freezer.c              |  7 ++++++-
 kernel/rcu/tasks.h            |  9 +++++++++
 kernel/sched/core.c           | 12 +++++++++---
 kernel/time/tick-sched.c      |  5 +++++
 kernel/trace/trace_selftest.c |  2 +-
 virt/kvm/kvm_main.c           |  2 +-
 8 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e6ee425..8a9517e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2133,6 +2133,11 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+static inline bool task_is_runnable(struct task_struct *p)
+{
+	return p->on_rq && !p->se.sched_delayed;
+}
+
 extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4..cdd0976 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9251,7 +9251,7 @@ static void perf_event_switch(struct task_struct *task,
 		},
 	};
 
-	if (!sched_in && task->on_rq) {
+	if (!sched_in && task_is_runnable(task)) {
 		switch_event.event_id.header.misc |=
 				PERF_RECORD_MISC_SWITCH_OUT_PREEMPT;
 	}
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 44bbd7d..8d530d0 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -109,7 +109,12 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
 {
 	unsigned int state = READ_ONCE(p->__state);
 
-	if (p->on_rq)
+	/*
+	 * Allow freezing the sched_delayed tasks; they will not execute until
+	 * ttwu() fixes them up, so it is safe to swap their state now, instead
+	 * of waiting for them to get fully dequeued.
+	 */
+	if (task_is_runnable(p))
 		return 0;
 
 	if (p != current && task_curr(p))
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 6333f4c..4d7ee95 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -986,6 +986,15 @@ static bool rcu_tasks_is_holdout(struct task_struct *t)
 		return false;
 
 	/*
+	 * t->on_rq && !t->se.sched_delayed *could* be considered sleeping but
+	 * since it is a spurious state (it will transition into the
+	 * traditional blocked state or get woken up without outside
+	 * dependencies), not considering it such should only affect timing.
+	 *
+	 * Be conservative for now and not include it.
+	 */
+
+	/*
 	 * Idle tasks (or idle injection) within the idle loop are RCU-tasks
 	 * quiescent states. But CPU boot code performed by the idle task
 	 * isn't a quiescent state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71232f8..7db711b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -548,6 +548,11 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *   ON_RQ_MIGRATING state is used for migration without holding both
  *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
  *
+ *   Additionally it is possible to be ->on_rq but still be considered not
+ *   runnable when p->se.sched_delayed is true. These tasks are on the runqueue
+ *   but will be dequeued as soon as they get picked again. See the
+ *   task_is_runnable() helper.
+ *
  * p->on_cpu <- { 0, 1 }:
  *
  *   is set by prepare_task() and cleared by finish_task() such that it will be
@@ -4317,9 +4322,10 @@ static bool __task_needs_rq_lock(struct task_struct *p)
  * @arg: Argument to function.
  *
  * Fix the task in it's current state by avoiding wakeups and or rq operations
- * and call @func(@arg) on it.  This function can use ->on_rq and task_curr()
- * to work out what the state is, if required.  Given that @func can be invoked
- * with a runqueue lock held, it had better be quite lightweight.
+ * and call @func(@arg) on it.  This function can use task_is_runnable() and
+ * task_curr() to work out what the state is, if required.  Given that @func
+ * can be invoked with a runqueue lock held, it had better be quite
+ * lightweight.
  *
  * Returns:
  *   Whatever @func returns
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184..59efa14 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -435,6 +435,11 @@ static void tick_nohz_kick_task(struct task_struct *tsk)
 	 *   tick_nohz_task_switch()
 	 *     LOAD p->tick_dep_mask
 	 */
+	// XXX given a task picks up the dependency on schedule(), should we
+	// only care about tasks that are currently on the CPU instead of all
+	// that are on the runqueue?
+	//
+	// That is, does this want to be: task_on_cpu() / task_curr()?
 	if (!sched_task_on_rq(tsk))
 		return;
 
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index c4ad7cd..1469dd8 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1485,7 +1485,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tr->max_latency = 0;
 
-	while (p->on_rq) {
+	while (task_is_runnable(p)) {
 		/*
 		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb25..0c666f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6387,7 +6387,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
 	WRITE_ONCE(vcpu->scheduled_out, true);
 
-	if (current->on_rq && vcpu->wants_to_run) {
+	if (task_is_runnable(current) && vcpu->wants_to_run) {
 		WRITE_ONCE(vcpu->preempted, true);
 		WRITE_ONCE(vcpu->ready, true);
 	}

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* [tip: sched/urgent] sched/fair: Fix external p->on_rq users
  2024-10-10  9:18           ` Peter Zijlstra
  2024-10-10 18:23             ` Sean Christopherson
  2024-10-12 14:15             ` [tip: sched/urgent] sched: Fix external p->on_rq users tip-bot2 for Peter Zijlstra
@ 2024-10-14  7:28             ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 277+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2024-10-14  7:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sean Christopherson, Peter Zijlstra (Intel), Ingo Molnar, x86,
	linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     cd9626e9ebc77edec33023fe95dab4b04ffc819d
Gitweb:        https://git.kernel.org/tip/cd9626e9ebc77edec33023fe95dab4b04ffc819d
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 10 Oct 2024 11:38:10 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 14 Oct 2024 09:14:35 +02:00

sched/fair: Fix external p->on_rq users

Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.

Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.

Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
---
 include/linux/sched.h         |  5 +++++
 kernel/events/core.c          |  2 +-
 kernel/freezer.c              |  7 ++++++-
 kernel/rcu/tasks.h            |  9 +++++++++
 kernel/sched/core.c           | 12 +++++++++---
 kernel/time/tick-sched.c      |  6 ++++++
 kernel/trace/trace_selftest.c |  2 +-
 virt/kvm/kvm_main.c           |  2 +-
 8 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e6ee425..8a9517e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2133,6 +2133,11 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+static inline bool task_is_runnable(struct task_struct *p)
+{
+	return p->on_rq && !p->se.sched_delayed;
+}
+
 extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4..cdd0976 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9251,7 +9251,7 @@ static void perf_event_switch(struct task_struct *task,
 		},
 	};
 
-	if (!sched_in && task->on_rq) {
+	if (!sched_in && task_is_runnable(task)) {
 		switch_event.event_id.header.misc |=
 				PERF_RECORD_MISC_SWITCH_OUT_PREEMPT;
 	}
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 44bbd7d..8d530d0 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -109,7 +109,12 @@ static int __set_task_frozen(struct task_struct *p, void *arg)
 {
 	unsigned int state = READ_ONCE(p->__state);
 
-	if (p->on_rq)
+	/*
+	 * Allow freezing the sched_delayed tasks; they will not execute until
+	 * ttwu() fixes them up, so it is safe to swap their state now, instead
+	 * of waiting for them to get fully dequeued.
+	 */
+	if (task_is_runnable(p))
 		return 0;
 
 	if (p != current && task_curr(p))
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 6333f4c..4d7ee95 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -986,6 +986,15 @@ static bool rcu_tasks_is_holdout(struct task_struct *t)
 		return false;
 
 	/*
+	 * t->on_rq && !t->se.sched_delayed *could* be considered sleeping but
+	 * since it is a spurious state (it will transition into the
+	 * traditional blocked state or get woken up without outside
+	 * dependencies), not considering it such should only affect timing.
+	 *
+	 * Be conservative for now and not include it.
+	 */
+
+	/*
 	 * Idle tasks (or idle injection) within the idle loop are RCU-tasks
 	 * quiescent states. But CPU boot code performed by the idle task
 	 * isn't a quiescent state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71232f8..7db711b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -548,6 +548,11 @@ sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { }
  *   ON_RQ_MIGRATING state is used for migration without holding both
  *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
  *
+ *   Additionally it is possible to be ->on_rq but still be considered not
+ *   runnable when p->se.sched_delayed is true. These tasks are on the runqueue
+ *   but will be dequeued as soon as they get picked again. See the
+ *   task_is_runnable() helper.
+ *
  * p->on_cpu <- { 0, 1 }:
  *
  *   is set by prepare_task() and cleared by finish_task() such that it will be
@@ -4317,9 +4322,10 @@ static bool __task_needs_rq_lock(struct task_struct *p)
  * @arg: Argument to function.
  *
  * Fix the task in it's current state by avoiding wakeups and or rq operations
- * and call @func(@arg) on it.  This function can use ->on_rq and task_curr()
- * to work out what the state is, if required.  Given that @func can be invoked
- * with a runqueue lock held, it had better be quite lightweight.
+ * and call @func(@arg) on it.  This function can use task_is_runnable() and
+ * task_curr() to work out what the state is, if required.  Given that @func
+ * can be invoked with a runqueue lock held, it had better be quite
+ * lightweight.
  *
  * Returns:
  *   Whatever @func returns
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 753a184..f203f00 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -434,6 +434,12 @@ static void tick_nohz_kick_task(struct task_struct *tsk)
 	 *   smp_mb__after_spin_lock()
 	 *   tick_nohz_task_switch()
 	 *     LOAD p->tick_dep_mask
+	 *
+	 * XXX given a task picks up the dependency on schedule(), should we
+	 * only care about tasks that are currently on the CPU instead of all
+	 * that are on the runqueue?
+	 *
+	 * That is, does this want to be: task_on_cpu() / task_curr()?
 	 */
 	if (!sched_task_on_rq(tsk))
 		return;
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index c4ad7cd..1469dd8 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -1485,7 +1485,7 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
 	/* reset the max latency */
 	tr->max_latency = 0;
 
-	while (p->on_rq) {
+	while (task_is_runnable(p)) {
 		/*
 		 * Sleep to make sure the -deadline thread is asleep too.
 		 * On virtual machines we can't rely on timings,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 05cbb25..0c666f1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6387,7 +6387,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
 	WRITE_ONCE(vcpu->scheduled_out, true);
 
-	if (current->on_rq && vcpu->wants_to_run) {
+	if (task_is_runnable(current) && vcpu->wants_to_run) {
 		WRITE_ONCE(vcpu->preempted, true);
 		WRITE_ONCE(vcpu->ready, true);
 	}

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
                     ` (3 preceding siblings ...)
       [not found]   ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
@ 2024-11-01 12:47   ` Phil Auld
  2024-11-01 12:56     ` Peter Zijlstra
  2024-11-04  9:28     ` Dietmar Eggemann
  4 siblings, 2 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-01 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault


Hi Peterm

On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.

This one is causing a 10-20% performance hit on our filesystem tests.

On 6.12-rc5 (so with the latest follow ons) we get:

with DELAY_DEQUEUE the bandwidth is 510 MB/s
with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s

The test is fio, something like this:

taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs

In this case it's ext4, but I'm not sure it will be FS specific.

I should have the machine and setup next week to poke further but I wanted
to mention it now just in case any one has an "aha" moment.

It seems to only effect these FS loads. Other perf tests are not showing any
issues that I am aware of.



Thanks,
Phil



> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/deadline.c |    1 
>  kernel/sched/fair.c     |   82 ++++++++++++++++++++++++++++++++++++++++++------
>  kernel/sched/features.h |    9 +++++
>  3 files changed, 81 insertions(+), 11 deletions(-)
> 
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2428,7 +2428,6 @@ static struct task_struct *__pick_next_t
>  		else
>  			p = dl_se->server_pick_next(dl_se);
>  		if (!p) {
> -			WARN_ON_ONCE(1);
>  			dl_se->dl_yielded = 1;
>  			update_curr_dl_se(rq, dl_se, 0);
>  			goto again;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5379,20 +5379,44 @@ static void clear_buddies(struct cfs_rq
>  
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
>  
> -static void
> +static bool
>  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
> -	int action = UPDATE_TG;
> +	if (flags & DEQUEUE_DELAYED) {
> +		/*
> +		 * DEQUEUE_DELAYED is typically called from pick_next_entity()
> +		 * at which point we've already done update_curr() and do not
> +		 * want to do so again.
> +		 */
> +		SCHED_WARN_ON(!se->sched_delayed);
> +		se->sched_delayed = 0;
> +	} else {
> +		bool sleep = flags & DEQUEUE_SLEEP;
> +
> +		/*
> +		 * DELAY_DEQUEUE relies on spurious wakeups, special task
> +		 * states must not suffer spurious wakeups, excempt them.
> +		 */
> +		if (flags & DEQUEUE_SPECIAL)
> +			sleep = false;
> +
> +		SCHED_WARN_ON(sleep && se->sched_delayed);
> +		update_curr(cfs_rq);
>  
> +		if (sched_feat(DELAY_DEQUEUE) && sleep &&
> +		    !entity_eligible(cfs_rq, se)) {
> +			if (cfs_rq->next == se)
> +				cfs_rq->next = NULL;
> +			se->sched_delayed = 1;
> +			return false;
> +		}
> +	}
> +
> +	int action = UPDATE_TG;
>  	if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
>  		action |= DO_DETACH;
>  
>  	/*
> -	 * Update run-time statistics of the 'current'.
> -	 */
> -	update_curr(cfs_rq);
> -
> -	/*
>  	 * When dequeuing a sched_entity, we must:
>  	 *   - Update loads to have both entity and cfs_rq synced with now.
>  	 *   - For group_entity, update its runnable_weight to reflect the new
> @@ -5430,6 +5454,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  
>  	if (cfs_rq->nr_running == 0)
>  		update_idle_cfs_rq_clock_pelt(cfs_rq);
> +
> +	return true;
>  }
>  
>  static void
> @@ -5828,11 +5854,21 @@ static bool throttle_cfs_rq(struct cfs_r
>  	idle_task_delta = cfs_rq->idle_h_nr_running;
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
> +		int flags;
> +
>  		/* throttled entity or throttle-on-deactivate */
>  		if (!se->on_rq)
>  			goto done;
>  
> -		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
> +		/*
> +		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
> +		 * This avoids teaching dequeue_entities() about throttled
> +		 * entities and keeps things relatively simple.
> +		 */
> +		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
> +		if (se->sched_delayed)
> +			flags |= DEQUEUE_DELAYED;
> +		dequeue_entity(qcfs_rq, se, flags);
>  
>  		if (cfs_rq_is_idle(group_cfs_rq(se)))
>  			idle_task_delta = cfs_rq->h_nr_running;
> @@ -6918,6 +6954,7 @@ static int dequeue_entities(struct rq *r
>  	bool was_sched_idle = sched_idle_rq(rq);
>  	int rq_h_nr_running = rq->cfs.h_nr_running;
>  	bool task_sleep = flags & DEQUEUE_SLEEP;
> +	bool task_delayed = flags & DEQUEUE_DELAYED;
>  	struct task_struct *p = NULL;
>  	int idle_h_nr_running = 0;
>  	int h_nr_running = 0;
> @@ -6931,7 +6968,13 @@ static int dequeue_entities(struct rq *r
>  
>  	for_each_sched_entity(se) {
>  		cfs_rq = cfs_rq_of(se);
> -		dequeue_entity(cfs_rq, se, flags);
> +
> +		if (!dequeue_entity(cfs_rq, se, flags)) {
> +			if (p && &p->se == se)
> +				return -1;
> +
> +			break;
> +		}
>  
>  		cfs_rq->h_nr_running -= h_nr_running;
>  		cfs_rq->idle_h_nr_running -= idle_h_nr_running;
> @@ -6956,6 +6999,7 @@ static int dequeue_entities(struct rq *r
>  			break;
>  		}
>  		flags |= DEQUEUE_SLEEP;
> +		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
>  	}
>  
>  	for_each_sched_entity(se) {
> @@ -6985,6 +7029,17 @@ static int dequeue_entities(struct rq *r
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
>  
> +	if (p && task_delayed) {
> +		SCHED_WARN_ON(!task_sleep);
> +		SCHED_WARN_ON(p->on_rq != 1);
> +
> +		/* Fix-up what dequeue_task_fair() skipped */
> +		hrtick_update(rq);
> +
> +		/* Fix-up what block_task() skipped. */
> +		__block_task(rq, p);
> +	}
> +
>  	return 1;
>  }
>  /*
> @@ -6996,8 +7051,10 @@ static bool dequeue_task_fair(struct rq
>  {
>  	util_est_dequeue(&rq->cfs, p);
>  
> -	if (dequeue_entities(rq, &p->se, flags) < 0)
> +	if (dequeue_entities(rq, &p->se, flags) < 0) {
> +		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
>  		return false;
> +	}
>  
>  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
>  	hrtick_update(rq);
> @@ -12973,6 +13030,11 @@ static void set_next_task_fair(struct rq
>  		/* ensure bandwidth has been allocated on our new cfs_rq */
>  		account_cfs_rq_runtime(cfs_rq, 0);
>  	}
> +
> +	if (!first)
> +		return;
> +
> +	SCHED_WARN_ON(se->sched_delayed);
>  }
>  
>  void init_cfs_rq(struct cfs_rq *cfs_rq)
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -29,6 +29,15 @@ SCHED_FEAT(NEXT_BUDDY, false)
>  SCHED_FEAT(CACHE_HOT_BUDDY, true)
>  
>  /*
> + * Delay dequeueing tasks until they get selected or woken.
> + *
> + * By delaying the dequeue for non-eligible tasks, they remain in the
> + * competition and can burn off their negative lag. When they get selected
> + * they'll have positive lag by definition.
> + */
> +SCHED_FEAT(DELAY_DEQUEUE, true)
> +
> +/*
>   * Allow wakeup-time preemption of the current task:
>   */
>  SCHED_FEAT(WAKEUP_PREEMPTION, true)
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 12:47   ` [PATCH 17/24] sched/fair: Implement delayed dequeue Phil Auld
@ 2024-11-01 12:56     ` Peter Zijlstra
  2024-11-01 13:38       ` Phil Auld
  2024-11-04  9:28     ` Dietmar Eggemann
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-01 12:56 UTC (permalink / raw)
  To: Phil Auld
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Fri, Nov 01, 2024 at 08:47:15AM -0400, Phil Auld wrote:

> This one is causing a 10-20% performance hit on our filesystem tests.
> 
> On 6.12-rc5 (so with the latest follow ons) we get:
> 
> with DELAY_DEQUEUE the bandwidth is 510 MB/s
> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> 
> The test is fio, something like this:
> 
> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> In this case it's ext4, but I'm not sure it will be FS specific.
> 
> I should have the machine and setup next week to poke further but I wanted
> to mention it now just in case any one has an "aha" moment.
> 
> It seems to only effect these FS loads. Other perf tests are not showing any
> issues that I am aware of.

There's a number of reports -- mostly it seems to be a case of excessive
preemption hurting things.

What happens if you use:

  schedtool -B -a 1-8 -e fio ....



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 12:56     ` Peter Zijlstra
@ 2024-11-01 13:38       ` Phil Auld
  2024-11-01 14:26         ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-01 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Fri, Nov 01, 2024 at 01:56:59PM +0100 Peter Zijlstra wrote:
> On Fri, Nov 01, 2024 at 08:47:15AM -0400, Phil Auld wrote:
> 
> > This one is causing a 10-20% performance hit on our filesystem tests.
> > 
> > On 6.12-rc5 (so with the latest follow ons) we get:
> > 
> > with DELAY_DEQUEUE the bandwidth is 510 MB/s
> > with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> > 
> > The test is fio, something like this:
> > 
> > taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> > 
> > In this case it's ext4, but I'm not sure it will be FS specific.
> > 
> > I should have the machine and setup next week to poke further but I wanted
> > to mention it now just in case any one has an "aha" moment.
> > 
> > It seems to only effect these FS loads. Other perf tests are not showing any
> > issues that I am aware of.
> 
> There's a number of reports -- mostly it seems to be a case of excessive
> preemption hurting things.
> 
> What happens if you use:
> 
>   schedtool -B -a 1-8 -e fio ....
> 
>

Thanks for taking a look. 

That makes the overall performance way worse:

DELAY_DEQUEUE - 146 MB/s
NO_DELAY_DEQUEUE - 156 MB/s

I guess that does make the difference between delay and nodelay
about half. 

How is delay dequeue causing more preemption? Or is that more
for eevdf in general?  We aren't seeing any issues there except
for the delay dequeue thing.



Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 13:38       ` Phil Auld
@ 2024-11-01 14:26         ` Peter Zijlstra
  2024-11-01 14:42           ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-01 14:26 UTC (permalink / raw)
  To: Phil Auld
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:

> How is delay dequeue causing more preemption? 

The thing delay dequeue does is it keeps !eligible tasks on the runqueue
until they're picked again. Them getting picked means they're eligible.
If at that point they're still not runnable, they're dequeued.

By keeping them around like this, they can earn back their lag.

The result is that the moment they get woken up again, they're going to
be eligible and are considered for preemption.

The whole thinking behind this is that while 'lag' measures the
mount of service difference from the ideal (positive lag will have less
service, while negative lag will have had too much service), this is
only true for the (constantly) competing task.

The moment a task leaves, will it still have had too much service? And
after a few seconds of inactivity?

So by keeping the deactivated tasks (artificially) in the competition
until they're at least at the equal service point, lets them burn off
some of that debt.

It is not dissimilar to how CFS had sleeper bonus, except that was
walltime based, while this is competition based.

Notably, this makes a significant difference for interactive tasks that
only run periodically. If they're not eligible at the point of wakeup,
they'll incur undue latency.

Now, I imagine FIO to have tasks blocking on IO, and while they're
blocked, they'll be earning their eligibility, such that when they're
woken they're good to go, preempting whatever.

Whatever doesn't seem to enjoy this.

Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
whole does like preemption -- so now it's a question of figuring out
what exactly it does and doesn't like. Which is never trivial :/

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 14:26         ` Peter Zijlstra
@ 2024-11-01 14:42           ` Phil Auld
  2024-11-01 18:08             ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-01 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> 
> > How is delay dequeue causing more preemption? 
> 
> The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> until they're picked again. Them getting picked means they're eligible.
> If at that point they're still not runnable, they're dequeued.
> 
> By keeping them around like this, they can earn back their lag.
> 
> The result is that the moment they get woken up again, they're going to
> be eligible and are considered for preemption.
> 
> 
> The whole thinking behind this is that while 'lag' measures the
> mount of service difference from the ideal (positive lag will have less
> service, while negative lag will have had too much service), this is
> only true for the (constantly) competing task.
> 
> The moment a task leaves, will it still have had too much service? And
> after a few seconds of inactivity?
> 
> So by keeping the deactivated tasks (artificially) in the competition
> until they're at least at the equal service point, lets them burn off
> some of that debt.
> 
> It is not dissimilar to how CFS had sleeper bonus, except that was
> walltime based, while this is competition based.
> 
> 
> Notably, this makes a significant difference for interactive tasks that
> only run periodically. If they're not eligible at the point of wakeup,
> they'll incur undue latency.
> 
> 
> Now, I imagine FIO to have tasks blocking on IO, and while they're
> blocked, they'll be earning their eligibility, such that when they're
> woken they're good to go, preempting whatever.
> 
> Whatever doesn't seem to enjoy this.
> 
> 
> Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> whole does like preemption -- so now it's a question of figuring out
> what exactly it does and doesn't like. Which is never trivial :/
> 

Thanks for that detailed explanation.

I can confirm that FIO does like the preemption

NO_WAKEUP_P and DELAY    - 427 MB/s
NO_WAKEUP_P and NO_DELAY - 498 MB/s
WAKEUP_P and DELAY       - 519 MB/s
WAKEUP_P and NO_DELAY    - 590 MB/s

Something in the delay itself
(extra tasks in the queue? not migrating the delayed task? ...) 

I'll start looking at tracing next week.


Thanks,
Phil


-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 14:42           ` Phil Auld
@ 2024-11-01 18:08             ` Mike Galbraith
  2024-11-01 20:07               ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-01 18:08 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Fri, 2024-11-01 at 10:42 -0400, Phil Auld wrote:
> On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> > On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> >
> > > How is delay dequeue causing more preemption?
> >
> > The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> > until they're picked again. Them getting picked means they're eligible.
> > If at that point they're still not runnable, they're dequeued.
> >
> > By keeping them around like this, they can earn back their lag.
> >
> > The result is that the moment they get woken up again, they're going to
> > be eligible and are considered for preemption.
> >
> >
> > The whole thinking behind this is that while 'lag' measures the
> > mount of service difference from the ideal (positive lag will have less
> > service, while negative lag will have had too much service), this is
> > only true for the (constantly) competing task.
> >
> > The moment a task leaves, will it still have had too much service? And
> > after a few seconds of inactivity?
> >
> > So by keeping the deactivated tasks (artificially) in the competition
> > until they're at least at the equal service point, lets them burn off
> > some of that debt.
> >
> > It is not dissimilar to how CFS had sleeper bonus, except that was
> > walltime based, while this is competition based.
> >
> >
> > Notably, this makes a significant difference for interactive tasks that
> > only run periodically. If they're not eligible at the point of wakeup,
> > they'll incur undue latency.
> >
> >
> > Now, I imagine FIO to have tasks blocking on IO, and while they're
> > blocked, they'll be earning their eligibility, such that when they're
> > woken they're good to go, preempting whatever.
> >
> > Whatever doesn't seem to enjoy this.
> >
> >
> > Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> > whole does like preemption -- so now it's a question of figuring out
> > what exactly it does and doesn't like. Which is never trivial :/
> >
>
> Thanks for that detailed explanation.
>
> I can confirm that FIO does like the preemption
>
> NO_WAKEUP_P and DELAY    - 427 MB/s
> NO_WAKEUP_P and NO_DELAY - 498 MB/s
> WAKEUP_P and DELAY       - 519 MB/s
> WAKEUP_P and NO_DELAY    - 590 MB/s
>
> Something in the delay itself
> (extra tasks in the queue? not migrating the delayed task? ...)

I think it's all about short term fairness and asymmetric buddies.

tbench comparison eevdf vs cfs, 100% apple to apple.

1 tbench buddy pair scheduled cross core.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv

Note 60/47 utilization, now pinned/stacked.

6.1.114-cfs
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
 4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv

Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
it can utilize a full 50%, but it must first preempt wide bottom buddy.

Now eevdf.  (zero source deltas other than eevdf)
6.1.114-eevdf -delay_dequeue
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4988 root      20   0   21424   1948   1736 R 56.44 0.012   0:32.92 3 tbench
 4989 root      20   0    4720    128      0 R 44.55 0.001   0:25.49 3 tbench_srv
6.1.114-eevdf +delay_dequeue
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
 4934 root      20   0   21424   1952   1736 R 52.00 0.012   0:30.09 3 tbench
 4935 root      20   0    4720    124      0 R 49.00 0.001   0:28.15 3 tbench_srv

As Peter noted, delay_dequeue levels the sleeper playing field.  Both
of these guys are 1:1 sleepers, but they're asymmetric in width.

Bottom line, box full of 1:1 buddies pairing up and stacking in L2.

tbench 8
6.1.114-cfs      3674.37 MB/sec
6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
                 3701.66 MB/sec +delay_dequeue

For tbench, preemption = shorter turnaround = higher throughput.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 18:08             ` Mike Galbraith
@ 2024-11-01 20:07               ` Phil Auld
  2024-11-02  4:32                 ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-01 20:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx


Hi Mike,

On Fri, Nov 01, 2024 at 07:08:31PM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-01 at 10:42 -0400, Phil Auld wrote:
> > On Fri, Nov 01, 2024 at 03:26:49PM +0100 Peter Zijlstra wrote:
> > > On Fri, Nov 01, 2024 at 09:38:22AM -0400, Phil Auld wrote:
> > >
> > > > How is delay dequeue causing more preemption?
> > >
> > > The thing delay dequeue does is it keeps !eligible tasks on the runqueue
> > > until they're picked again. Them getting picked means they're eligible.
> > > If at that point they're still not runnable, they're dequeued.
> > >
> > > By keeping them around like this, they can earn back their lag.
> > >
> > > The result is that the moment they get woken up again, they're going to
> > > be eligible and are considered for preemption.
> > >
> > >
> > > The whole thinking behind this is that while 'lag' measures the
> > > mount of service difference from the ideal (positive lag will have less
> > > service, while negative lag will have had too much service), this is
> > > only true for the (constantly) competing task.
> > >
> > > The moment a task leaves, will it still have had too much service? And
> > > after a few seconds of inactivity?
> > >
> > > So by keeping the deactivated tasks (artificially) in the competition
> > > until they're at least at the equal service point, lets them burn off
> > > some of that debt.
> > >
> > > It is not dissimilar to how CFS had sleeper bonus, except that was
> > > walltime based, while this is competition based.
> > >
> > >
> > > Notably, this makes a significant difference for interactive tasks that
> > > only run periodically. If they're not eligible at the point of wakeup,
> > > they'll incur undue latency.
> > >
> > >
> > > Now, I imagine FIO to have tasks blocking on IO, and while they're
> > > blocked, they'll be earning their eligibility, such that when they're
> > > woken they're good to go, preempting whatever.
> > >
> > > Whatever doesn't seem to enjoy this.
> > >
> > >
> > > Given BATCH makes such a terrible mess of things, I'm thinking FIO as a
> > > whole does like preemption -- so now it's a question of figuring out
> > > what exactly it does and doesn't like. Which is never trivial :/
> > >
> >
> > Thanks for that detailed explanation.
> >
> > I can confirm that FIO does like the preemption
> >
> > NO_WAKEUP_P and DELAY    - 427 MB/s
> > NO_WAKEUP_P and NO_DELAY - 498 MB/s
> > WAKEUP_P and DELAY       - 519 MB/s
> > WAKEUP_P and NO_DELAY    - 590 MB/s
> >
> > Something in the delay itself
> > (extra tasks in the queue? not migrating the delayed task? ...)
> 
> I think it's all about short term fairness and asymmetric buddies.

Thanks for jumping in.  My jargon decoder ring seems to be failing me
so I'm not completely sure what you are saying below :)

"buddies" you mean tasks that waking each other up and sleeping.
And one runs for longer than the other, right?

> 
> tbench comparison eevdf vs cfs, 100% apple to apple.
> 
> 1 tbench buddy pair scheduled cross core.
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
 
> Note 60/47 utilization, now pinned/stacked.
> 
> 6.1.114-cfs
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
>  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv

What is the difference between these first two?  The first is on
separate cores so they don't interfere with each other? And the second is
pinned to the same core?

>
> Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> it can utilize a full 50%, but it must first preempt wide bottom buddy.
>

We've got "light" and "wide" here which is a bit mixed metaphorically :) 
So here CFS is letting the wakee preempt the waker and providing pretty
equal fairness. And hot l2 caching is masking the assymmetry. 

> Now eevdf.  (zero source deltas other than eevdf)
> 6.1.114-eevdf -delay_dequeue
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4988 root      20   0   21424   1948   1736 R 56.44 0.012   0:32.92 3 tbench
>  4989 root      20   0    4720    128      0 R 44.55 0.001   0:25.49 3 tbench_srv
> 6.1.114-eevdf +delay_dequeue
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4934 root      20   0   21424   1952   1736 R 52.00 0.012   0:30.09 3 tbench
>  4935 root      20   0    4720    124      0 R 49.00 0.001   0:28.15 3 tbench_srv
> 
> As Peter noted, delay_dequeue levels the sleeper playing field.  Both
> of these guys are 1:1 sleepers, but they're asymmetric in width.

With wakeup preemption off it doesn't help in my case. I was thinking
maybe the preemption was preventing some batching of IO completions or
initiations. But that was wrong it seems. 

Does it also possibly make wakeup migration less likely and thus increase
stacking?  

> Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> 
> tbench 8
> 6.1.114-cfs      3674.37 MB/sec
> 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
>                  3701.66 MB/sec +delay_dequeue
> 
> For tbench, preemption = shorter turnaround = higher throughput.

So here you have a benchmark that gets a ~5% boost from delayed_dequeue.

But I've got one that get's a 20% penalty so I'm not exactly sure what
to make of that. Clearly FIO does not have the same pattern as tbench. 

It's not a special case though, this is one that our perf team runs
regularly to look for regressions. 

I'll be able to poke at it more next week so hopefully I can see what it's
doing. 


Cheers,
Phil


> 
> 	-Mike
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 20:07               ` Phil Auld
@ 2024-11-02  4:32                 ` Mike Galbraith
  2024-11-04 13:05                   ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-02  4:32 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Fri, 2024-11-01 at 16:07 -0400, Phil Auld wrote:


> Thanks for jumping in.  My jargon decoder ring seems to be failing me
> so I'm not completely sure what you are saying below :)
>
> "buddies" you mean tasks that waking each other up and sleeping.
> And one runs for longer than the other, right?

Yeah, buddies are related waker/wakee 1:1 1:N or M:N, excluding tasks
happening to be sitting on a CPU where, say a timer fires, an IRQ leads
to a wakeup of lord knows what, lock wakeups etc etc etc. I think Peter
coined the term buddy to mean that (less typing), and it stuck.

> > 1 tbench buddy pair scheduled cross core.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> > 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
>  
> > Note 60/47 utilization, now pinned/stacked.
> >
> > 6.1.114-cfs
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> >  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
> >  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv
>
> What is the difference between these first two?  The first is on
> separate cores so they don't interfere with each other? And the second is
> pinned to the same core?

Yeah, see 'P'. Given CPU headroom, a tbench pair can consume ~107%.
They're not fully synchronous.. wouldn't be relevant here/now if they
were :)

> > Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> > it can utilize a full 50%, but it must first preempt wide bottom buddy.
> >
>
> We've got "light" and "wide" here which is a bit mixed metaphorically
> :)

Wide, skinny, feather-weight or lard-ball, they all work for me.

> So here CFS is letting the wakee preempt the waker and providing pretty
> equal fairness. And hot l2 caching is masking the assymmetry.

No, it's way simpler: preemption slices through the only thing it can
slice through, the post wakeup concurrent bits.. that otherwise sits
directly in the communication stream as a lump of latency in a latency
bound operation.

>
> With wakeup preemption off it doesn't help in my case. I was thinking
> maybe the preemption was preventing some batching of IO completions
> or
> initiations. But that was wrong it seems.

Dunno.

> Does it also possibly make wakeup migration less likely and thus increase
> stacking?

The buddy being preempted certainly won't be wakeup migrated, because
it won't sleep. Two very sleepy tasks when bw constrained becomes one
100% hog and one 99.99% hog when CPU constrained.

> > Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> >
> > tbench 8
> > 6.1.114-cfs      3674.37 MB/sec
> > 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
> >                  3701.66 MB/sec +delay_dequeue
> >
> > For tbench, preemption = shorter turnaround = higher throughput.
>
> So here you have a benchmark that gets a ~5% boost from
> delayed_dequeue.
>
> But I've got one that get's a 20% penalty so I'm not exactly sure what
> to make of that. Clearly FIO does not have the same pattern as tbench.

There are basically two options in sched-land, shave fastpath cycles,
or some variant of Rob Peter to pay Paul ;-)

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-01 12:47   ` [PATCH 17/24] sched/fair: Implement delayed dequeue Phil Auld
  2024-11-01 12:56     ` Peter Zijlstra
@ 2024-11-04  9:28     ` Dietmar Eggemann
  2024-11-04 11:55       ` Dietmar Eggemann
  2024-11-04 12:50       ` Phil Auld
  1 sibling, 2 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-11-04  9:28 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, efault

Hi Phil,

On 01/11/2024 13:47, Phil Auld wrote:
> 
> Hi Peterm
> 
> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>> noting that lag is fundamentally a temporal measure. It should not be
>> carried around indefinitely.
>>
>> OTOH it should also not be instantly discarded, doing so will allow a
>> task to game the system by purposefully (micro) sleeping at the end of
>> its time quantum.
>>
>> Since lag is intimately tied to the virtual time base, a wall-time
>> based decay is also insufficient, notably competition is required for
>> any of this to make sense.
>>
>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>> competing until they are eligible.
>>
>> Strictly speaking, we only care about keeping them until the 0-lag
>> point, but that is a difficult proposition, instead carry them around
>> until they get picked again, and dequeue them at that point.
> 
> This one is causing a 10-20% performance hit on our filesystem tests.
> 
> On 6.12-rc5 (so with the latest follow ons) we get:
> 
> with DELAY_DEQUEUE the bandwidth is 510 MB/s
> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> 
> The test is fio, something like this:
> 
> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs

I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
- sched: psi: pass enqueue/dequeue flags to psi callbacks directly
(2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)

Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.

vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)

# sudo lshw -class disk -class storage
  *-nvme                    
       description: NVMe device
       product: GIGABYTE GP-ASM2NE6500GTTD
       vendor: Phison Electronics Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/nvme0
       version: EGFM13.2
       ...
       capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
       configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
       resources: irq:16 memory:70800000-70803fff

# mount | grep ^/dev/nvme0
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)

Which disk device you're using?

> 
> In this case it's ext4, but I'm not sure it will be FS specific.
> 
> I should have the machine and setup next week to poke further but I wanted
> to mention it now just in case any one has an "aha" moment.
> 
> It seems to only effect these FS loads. Other perf tests are not showing any
> issues that I am aware of.

[...]


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-04  9:28     ` Dietmar Eggemann
@ 2024-11-04 11:55       ` Dietmar Eggemann
  2024-11-04 12:50       ` Phil Auld
  1 sibling, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-11-04 11:55 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat,
	tglx, efault, Christian Loehle

+cc Christian Loehle <christian.loehle@arm.com>

On 04/11/2024 10:28, Dietmar Eggemann wrote:
> Hi Phil,
> 
> On 01/11/2024 13:47, Phil Auld wrote:
>>
>> Hi Peterm
>>
>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>> noting that lag is fundamentally a temporal measure. It should not be
>>> carried around indefinitely.
>>>
>>> OTOH it should also not be instantly discarded, doing so will allow a
>>> task to game the system by purposefully (micro) sleeping at the end of
>>> its time quantum.
>>>
>>> Since lag is intimately tied to the virtual time base, a wall-time
>>> based decay is also insufficient, notably competition is required for
>>> any of this to make sense.
>>>
>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>> competing until they are eligible.
>>>
>>> Strictly speaking, we only care about keeping them until the 0-lag
>>> point, but that is a difficult proposition, instead carry them around
>>> until they get picked again, and dequeue them at that point.
>>
>> This one is causing a 10-20% performance hit on our filesystem tests.
>>
>> On 6.12-rc5 (so with the latest follow ons) we get:
>>
>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>
>> The test is fio, something like this:
>>
>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> 
> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
                 ^^^^^^^
> 
> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)

Christian Loehle just told me that my cpumask looks odd. Should be
0xaaaa instead.

Retested:

vanilla features: 954MB/s (mean out of 5 runs, σ: 30.83)
NO_DELAY_DEQUEUE: 932MB/s (mean out of 5 runs, σ: 28.10)

Now there are only 8 CPUs (instead of 10) for the 8 (+2) fio tasks. σ
went up probably because of more wakeup/preemption latency.

> 
> # sudo lshw -class disk -class storage
>   *-nvme                    
>        description: NVMe device
>        product: GIGABYTE GP-ASM2NE6500GTTD
>        vendor: Phison Electronics Corporation
>        physical id: 0
>        bus info: pci@0000:01:00.0
>        logical name: /dev/nvme0
>        version: EGFM13.2
>        ...
>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>        resources: irq:16 memory:70800000-70803fff
> 
> # mount | grep ^/dev/nvme0
> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> 
> Which disk device you're using?
> 
>>
>> In this case it's ext4, but I'm not sure it will be FS specific.
>>
>> I should have the machine and setup next week to poke further but I wanted
>> to mention it now just in case any one has an "aha" moment.
>>
>> It seems to only effect these FS loads. Other perf tests are not showing any
>> issues that I am aware of.
> 
> [...]
> 
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-04  9:28     ` Dietmar Eggemann
  2024-11-04 11:55       ` Dietmar Eggemann
@ 2024-11-04 12:50       ` Phil Auld
  2024-11-05  9:53         ` Christian Loehle
  2024-11-08 14:53         ` Dietmar Eggemann
  1 sibling, 2 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-04 12:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault


Hi Dietmar,

On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> Hi Phil,
> 
> On 01/11/2024 13:47, Phil Auld wrote:
> > 
> > Hi Peterm
> > 
> > On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
> >> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> >> noting that lag is fundamentally a temporal measure. It should not be
> >> carried around indefinitely.
> >>
> >> OTOH it should also not be instantly discarded, doing so will allow a
> >> task to game the system by purposefully (micro) sleeping at the end of
> >> its time quantum.
> >>
> >> Since lag is intimately tied to the virtual time base, a wall-time
> >> based decay is also insufficient, notably competition is required for
> >> any of this to make sense.
> >>
> >> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> >> competing until they are eligible.
> >>
> >> Strictly speaking, we only care about keeping them until the 0-lag
> >> point, but that is a difficult proposition, instead carry them around
> >> until they get picked again, and dequeue them at that point.
> > 
> > This one is causing a 10-20% performance hit on our filesystem tests.
> > 
> > On 6.12-rc5 (so with the latest follow ons) we get:
> > 
> > with DELAY_DEQUEUE the bandwidth is 510 MB/s
> > with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> > 
> > The test is fio, something like this:
> > 
> > taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> 
> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> 
> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> 
> # sudo lshw -class disk -class storage
>   *-nvme                    
>        description: NVMe device
>        product: GIGABYTE GP-ASM2NE6500GTTD
>        vendor: Phison Electronics Corporation
>        physical id: 0
>        bus info: pci@0000:01:00.0
>        logical name: /dev/nvme0
>        version: EGFM13.2
>        ...
>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>        resources: irq:16 memory:70800000-70803fff
> 
> # mount | grep ^/dev/nvme0
> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> 
> Which disk device you're using?

Most of the reports are on various NVME drives (samsung mostly I think).


One thing I should add is that it's all on LVM: 


vgcreate vg /dev/nvme0n1 -y
lvcreate -n thinMeta -L 3GB vg -y
lvcreate -n thinPool -l 99%FREE vg -y
lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
lvcreate -n testLV -V 1300G --thinpool thinPool vg
wipefs -a /dev/mapper/vg-testLV
mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
mount /dev/mapper/vg-testLV /testfs 


With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
drive directly it's a little more variable. Some it shows on xfs, some it show
on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
it shows it's 100% reproducible on that setup. 

It's always the randwrite numbers. The rest look fine.

Also, as yet I'm not personally doing this testing, just looking into it and
passing on the information I have. 


Thanks for taking a look. 

Cheers,
Phil

> 
> > 
> > In this case it's ext4, but I'm not sure it will be FS specific.
> > 
> > I should have the machine and setup next week to poke further but I wanted
> > to mention it now just in case any one has an "aha" moment.
> > 
> > It seems to only effect these FS loads. Other perf tests are not showing any
> > issues that I am aware of.
> 
> [...]
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-02  4:32                 ` Mike Galbraith
@ 2024-11-04 13:05                   ` Phil Auld
  2024-11-05  4:05                     ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-04 13:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-01 at 16:07 -0400, Phil Auld wrote:
> 
> 
> > Thanks for jumping in.  My jargon decoder ring seems to be failing me
> > so I'm not completely sure what you are saying below :)
> >
> > "buddies" you mean tasks that waking each other up and sleeping.
> > And one runs for longer than the other, right?
> 
> Yeah, buddies are related waker/wakee 1:1 1:N or M:N, excluding tasks
> happening to be sitting on a CPU where, say a timer fires, an IRQ leads
> to a wakeup of lord knows what, lock wakeups etc etc etc. I think Peter
> coined the term buddy to mean that (less typing), and it stuck.
>

Thanks!

> > > 1 tbench buddy pair scheduled cross core.
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > > 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> > > 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
> >  
> > > Note 60/47 utilization, now pinned/stacked.
> > >
> > > 6.1.114-cfs
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > >  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
> > >  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv
> >
> > What is the difference between these first two?  The first is on
> > separate cores so they don't interfere with each other? And the second is
> > pinned to the same core?
> 
> Yeah, see 'P'. Given CPU headroom, a tbench pair can consume ~107%.
> They're not fully synchronous.. wouldn't be relevant here/now if they
> were :)
> 
> > > Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> > > it can utilize a full 50%, but it must first preempt wide bottom buddy.
> > >
> >
> > We've got "light" and "wide" here which is a bit mixed metaphorically
> > :)
> 
> Wide, skinny, feather-weight or lard-ball, they all work for me.
> 
> > So here CFS is letting the wakee preempt the waker and providing pretty
> > equal fairness. And hot l2 caching is masking the assymmetry.
> 
> No, it's way simpler: preemption slices through the only thing it can
> slice through, the post wakeup concurrent bits.. that otherwise sits
> directly in the communication stream as a lump of latency in a latency
> bound operation.
> 
> >
> > With wakeup preemption off it doesn't help in my case. I was thinking
> > maybe the preemption was preventing some batching of IO completions
> > or
> > initiations. But that was wrong it seems.
> 
> Dunno.
> 
> > Does it also possibly make wakeup migration less likely and thus increase
> > stacking?
> 
> The buddy being preempted certainly won't be wakeup migrated, because
> it won't sleep. Two very sleepy tasks when bw constrained becomes one
> 100% hog and one 99.99% hog when CPU constrained.
>

Not the waker who gets preempted but the wakee may be a bit more
sticky on his current cpu and thus stack more since he's still
in that runqueue. But that's just a mental excercise trying to
find things that are directly related to delay dequeue. No observation
other than the over all perf hit. 


> > > Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> > >
> > > tbench 8
> > > 6.1.114-cfs      3674.37 MB/sec
> > > 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
> > >                  3701.66 MB/sec +delay_dequeue
> > >
> > > For tbench, preemption = shorter turnaround = higher throughput.
> >
> > So here you have a benchmark that gets a ~5% boost from
> > delayed_dequeue.
> >
> > But I've got one that get's a 20% penalty so I'm not exactly sure what
> > to make of that. Clearly FIO does not have the same pattern as tbench.
> 
> There are basically two options in sched-land, shave fastpath cycles,
> or some variant of Rob Peter to pay Paul ;-)
>

That Peter is cranky :)

Cheers,
Phil


> 	-Mike
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-04 13:05                   ` Phil Auld
@ 2024-11-05  4:05                     ` Mike Galbraith
  2024-11-05  4:22                       ` K Prateek Nayak
                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-05  4:05 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Mon, 2024-11-04 at 08:05 -0500, Phil Auld wrote:
> On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
>
> >
> > The buddy being preempted certainly won't be wakeup migrated...
>
> Not the waker who gets preempted but the wakee may be a bit more
> sticky on his current cpu and thus stack more since he's still
> in that runqueue.

Ah, indeed, if wakees don't get scraped off before being awakened, they
can and do miss chances at an idle CPU according to trace_printk().

I'm undecided if overall it's boon, bane or even matters, as there is
still an ample supply of wakeup migration, but seems it can indeed
inject wakeup latency needlessly, so <sharpens stick>...

My box booted and neither become exceptionally noisy nor inexplicably
silent in.. oh, minutes now, so surely yours will be perfectly fine.

After one minute of lightly loaded box browsing, trace_printk() said:

  645   - racy peek says there is a room available
   11   - cool, reserved room is free
  206   - no vacancy or wakee pinned
38807   - SIS accommodates room seeker

The below should improve the odds, but high return seems unlikely.

---
 kernel/sched/core.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3790,7 +3790,13 @@ static int ttwu_runnable(struct task_str
 	rq = __task_rq_lock(p, &rf);
 	if (task_on_rq_queued(p)) {
 		update_rq_clock(rq);
-		if (p->se.sched_delayed)
+		/*
+		 * If wakee is mobile and the room it reserved is occupied, let it try to migrate.
+		 */
+		if (p->se.sched_delayed && rq->nr_running > 1 && cpumask_weight(p->cpus_ptr) > 1) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
+			goto out_unlock;
+		} else if (p->se.sched_delayed)
 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 		if (!task_on_cpu(rq, p)) {
 			/*
@@ -3802,6 +3808,7 @@ static int ttwu_runnable(struct task_str
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
+out_unlock:
 	__task_rq_unlock(rq, &rf);

 	return ret;



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  4:05                     ` Mike Galbraith
@ 2024-11-05  4:22                       ` K Prateek Nayak
  2024-11-05  6:46                         ` Mike Galbraith
  2024-11-05 15:20                       ` Phil Auld
  2024-11-06 13:53                       ` Peter Zijlstra
  2 siblings, 1 reply; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-05  4:22 UTC (permalink / raw)
  To: Mike Galbraith, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

Hello Mike,

On 11/5/2024 9:35 AM, Mike Galbraith wrote:
> On Mon, 2024-11-04 at 08:05 -0500, Phil Auld wrote:
>> On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
>>
>>>
>>> The buddy being preempted certainly won't be wakeup migrated...
>>
>> Not the waker who gets preempted but the wakee may be a bit more
>> sticky on his current cpu and thus stack more since he's still
>> in that runqueue.
> 
> Ah, indeed, if wakees don't get scraped off before being awakened, they
> can and do miss chances at an idle CPU according to trace_printk().
> 
> I'm undecided if overall it's boon, bane or even matters, as there is
> still an ample supply of wakeup migration, but seems it can indeed
> inject wakeup latency needlessly, so <sharpens stick>...

I had tried this out a while back but I was indiscriminately doing a
DEQUEUE_DELAYED and letting delayed tasks go through a full ttwu cycle
which did not yield any improvements on hackbench. Your approach to
selectively do it might indeed be better (more thoughts below)

> 
> My box booted and neither become exceptionally noisy nor inexplicably
> silent in.. oh, minutes now, so surely yours will be perfectly fine.
> 
> After one minute of lightly loaded box browsing, trace_printk() said:
> 
>    645   - racy peek says there is a room available
>     11   - cool, reserved room is free
>    206   - no vacancy or wakee pinned
> 38807   - SIS accommodates room seeker
> 
> The below should improve the odds, but high return seems unlikely.
> 
> ---
>   kernel/sched/core.c |    9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3790,7 +3790,13 @@ static int ttwu_runnable(struct task_str
>   	rq = __task_rq_lock(p, &rf);
>   	if (task_on_rq_queued(p)) {
>   		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> +		/*
> +		 * If wakee is mobile and the room it reserved is occupied, let it try to migrate.
> +		 */
> +		if (p->se.sched_delayed && rq->nr_running > 1 && cpumask_weight(p->cpus_ptr) > 1) {

Would checking "p->nr_cpus_allowed > 1" be enough instead of doing a
"cpumask_weight(p->cpus_ptr) > 1"?

I was thinking, since the task is indeed delayed, there has to be more
than one task on the runqueue right since a single task by itself cannot
be ineligible and be marked for delayed dequeue? The only time we
encounter a delayed task with "rq->nr_running == 1" is if the other
tasks have been fully dequeued and pick_next_task() is in the process of
picking off all the delayed task, but since that is done with the rq
lock held in schedule(), it is even possible for the
"rq->nr_running > 1" to be false here?

> +			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
> +			goto out_unlock;
> +		} else if (p->se.sched_delayed)
>   			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>   		if (!task_on_cpu(rq, p)) {
>   			/*
> @@ -3802,6 +3808,7 @@ static int ttwu_runnable(struct task_str
>   		ttwu_do_wakeup(p);
>   		ret = 1;
>   	}
> +out_unlock:
>   	__task_rq_unlock(rq, &rf);
> 
>   	return ret;
> 
> 

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  4:22                       ` K Prateek Nayak
@ 2024-11-05  6:46                         ` Mike Galbraith
  2024-11-06  3:02                           ` K Prateek Nayak
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-05  6:46 UTC (permalink / raw)
  To: K Prateek Nayak, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-05 at 09:52 +0530, K Prateek Nayak wrote:
> Hello Mike,

Greetings,

> Would checking "p->nr_cpus_allowed > 1" be enough instead of doing a
> "cpumask_weight(p->cpus_ptr) > 1"?

Yeah (thwap).

> I was thinking, since the task is indeed delayed, there has to be more
> than one task on the runqueue right since a single task by itself cannot
> be ineligible and be marked for delayed dequeue?

But they migrate via LB, and idle balance unlocks the rq.
trace_printk() just verified that they do still both land with
sched_delayed intact and with nr_running = 1.

> The only time we
> encounter a delayed task with "rq->nr_running == 1" is if the other
> tasks have been fully dequeued and pick_next_task() is in the process of
> picking off all the delayed task, but since that is done with the rq
> lock held in schedule(), it is even possible for the
> "rq->nr_running > 1" to be false here?

I don't see how, the rq being looked at is locked.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-04 12:50       ` Phil Auld
@ 2024-11-05  9:53         ` Christian Loehle
  2024-11-05 15:55           ` Phil Auld
  2024-11-08 14:53         ` Dietmar Eggemann
  1 sibling, 1 reply; 277+ messages in thread
From: Christian Loehle @ 2024-11-05  9:53 UTC (permalink / raw)
  To: Phil Auld, Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 11/4/24 12:50, Phil Auld wrote:
> 
> Hi Dietmar,
> 
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm
>>>
>>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
>>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
>>>> noting that lag is fundamentally a temporal measure. It should not be
>>>> carried around indefinitely.
>>>>
>>>> OTOH it should also not be instantly discarded, doing so will allow a
>>>> task to game the system by purposefully (micro) sleeping at the end of
>>>> its time quantum.
>>>>
>>>> Since lag is intimately tied to the virtual time base, a wall-time
>>>> based decay is also insufficient, notably competition is required for
>>>> any of this to make sense.
>>>>
>>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
>>>> competing until they are eligible.
>>>>
>>>> Strictly speaking, we only care about keeping them until the 0-lag
>>>> point, but that is a difficult proposition, instead carry them around
>>>> until they get picked again, and dequeue them at that point.
>>>
>>> This one is causing a 10-20% performance hit on our filesystem tests.
>>>
>>> On 6.12-rc5 (so with the latest follow ons) we get:
>>>
>>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
>>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
>>>
>>> The test is fio, something like this:
>>>
>>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
>>
>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>>   *-nvme                    
>>        description: NVMe device
>>        product: GIGABYTE GP-ASM2NE6500GTTD
>>        vendor: Phison Electronics Corporation
>>        physical id: 0
>>        bus info: pci@0000:01:00.0
>>        logical name: /dev/nvme0
>>        version: EGFM13.2
>>        ...
>>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>>        resources: irq:16 memory:70800000-70803fff 
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
> 
> Most of the reports are on various NVME drives (samsung mostly I think).
> 
> 
> One thing I should add is that it's all on LVM: 
> 
> 
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs 
> 
> 
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup. 
> 
> It's always the randwrite numbers. The rest look fine.

Hi Phil,

Thanks for the detailed instructions. Unfortunately even with your LVM setup on
the platforms I've tried I don't see a regression so far, all the numbers are
about equal for DELAY_DEQUEUE and NO_DELAY_DEQUEUE.

Anyway I have some follow-ups, first let me trim the fio command for readability:
fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs

dropping defaults nr_files, loops, fsync, randrepeat
fio --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs

Adding the CPU affinities directly:
fio --cpus_allowed 1-8 --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs

Now I was wondering about the following:
Is it actually the kworker (not another fio) being preempted? (I'm pretty sure it is)
To test: --cpus_allowed_policy split (each fio process gets it's own CPU).

You wrote:
>I was thinking maybe the preemption was preventing some batching of IO completions or
>initiations. But that was wrong it seems.

So while it doesn't reproduce for me, the only thing being preempted regularly is
the kworker (running iomap_dio_complete_work). I don't quite follow the "that was
wrong it seems" part then. Could you elaborate?

Could you also post the other benchmark numbers? Does any of them score higher in IOPS?
Is --rw write the same issue if you set --bs 4k (assuming you set a larger bs for seqwrite).

Can you set the kworkers handling completions to SCHED_BATCH too? Just to confirm.

Regards,
Christian

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  4:05                     ` Mike Galbraith
  2024-11-05  4:22                       ` K Prateek Nayak
@ 2024-11-05 15:20                       ` Phil Auld
  2024-11-05 19:05                         ` Phil Auld
  2024-11-06 13:53                       ` Peter Zijlstra
  2 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-05 15:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, Nov 05, 2024 at 05:05:12AM +0100 Mike Galbraith wrote:
> On Mon, 2024-11-04 at 08:05 -0500, Phil Auld wrote:
> > On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
> >
> > >
> > > The buddy being preempted certainly won't be wakeup migrated...
> >
> > Not the waker who gets preempted but the wakee may be a bit more
> > sticky on his current cpu and thus stack more since he's still
> > in that runqueue.
> 
> Ah, indeed, if wakees don't get scraped off before being awakened, they
> can and do miss chances at an idle CPU according to trace_printk().
> 
> I'm undecided if overall it's boon, bane or even matters, as there is
> still an ample supply of wakeup migration, but seems it can indeed
> inject wakeup latency needlessly, so <sharpens stick>...
> 
> My box booted and neither become exceptionally noisy nor inexplicably
> silent in.. oh, minutes now, so surely yours will be perfectly fine.
> 
> After one minute of lightly loaded box browsing, trace_printk() said:
> 
>   645   - racy peek says there is a room available
>    11   - cool, reserved room is free
>   206   - no vacancy or wakee pinned
> 38807   - SIS accommodates room seeker
> 
> The below should improve the odds, but high return seems unlikely.
>

Thanks, I'll give it a spin with the nr_cpus_allowed bit.


Cheers,
Phil



> ---
>  kernel/sched/core.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3790,7 +3790,13 @@ static int ttwu_runnable(struct task_str
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
>  		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> +		/*
> +		 * If wakee is mobile and the room it reserved is occupied, let it try to migrate.
> +		 */
> +		if (p->se.sched_delayed && rq->nr_running > 1 && cpumask_weight(p->cpus_ptr) > 1) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
> +			goto out_unlock;
> +		} else if (p->se.sched_delayed)
>  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>  		if (!task_on_cpu(rq, p)) {
>  			/*
> @@ -3802,6 +3808,7 @@ static int ttwu_runnable(struct task_str
>  		ttwu_do_wakeup(p);
>  		ret = 1;
>  	}
> +out_unlock:
>  	__task_rq_unlock(rq, &rf);
> 
>  	return ret;
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  9:53         ` Christian Loehle
@ 2024-11-05 15:55           ` Phil Auld
  0 siblings, 0 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-05 15:55 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	efault


Hi Christian,

On Tue, Nov 05, 2024 at 09:53:49AM +0000 Christian Loehle wrote:
> On 11/4/24 12:50, Phil Auld wrote:
> > 
> > Hi Dietmar,
> > 
> > On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> >> Hi Phil,
> >>
> >> On 01/11/2024 13:47, Phil Auld wrote:
> >>>
> >>> Hi Peterm
> >>>
> >>> On Sat, Jul 27, 2024 at 12:27:49PM +0200 Peter Zijlstra wrote:
> >>>> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> >>>> noting that lag is fundamentally a temporal measure. It should not be
> >>>> carried around indefinitely.
> >>>>
> >>>> OTOH it should also not be instantly discarded, doing so will allow a
> >>>> task to game the system by purposefully (micro) sleeping at the end of
> >>>> its time quantum.
> >>>>
> >>>> Since lag is intimately tied to the virtual time base, a wall-time
> >>>> based decay is also insufficient, notably competition is required for
> >>>> any of this to make sense.
> >>>>
> >>>> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> >>>> competing until they are eligible.
> >>>>
> >>>> Strictly speaking, we only care about keeping them until the 0-lag
> >>>> point, but that is a difficult proposition, instead carry them around
> >>>> until they get picked again, and dequeue them at that point.
> >>>
> >>> This one is causing a 10-20% performance hit on our filesystem tests.
> >>>
> >>> On 6.12-rc5 (so with the latest follow ons) we get:
> >>>
> >>> with DELAY_DEQUEUE the bandwidth is 510 MB/s
> >>> with NO_DELAY_DEQUEUE the bandwidth is 590 MB/s
> >>>
> >>> The test is fio, something like this:
> >>>
> >>> taskset -c 1,2,3,4,5,6,7,8 fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> >>
> >> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> >> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> >> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> >>
> >> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> >>
> >> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> >> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> >>
> >> # sudo lshw -class disk -class storage
> >>   *-nvme                    
> >>        description: NVMe device
> >>        product: GIGABYTE GP-ASM2NE6500GTTD
> >>        vendor: Phison Electronics Corporation
> >>        physical id: 0
> >>        bus info: pci@0000:01:00.0
> >>        logical name: /dev/nvme0
> >>        version: EGFM13.2
> >>        ...
> >>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> >>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> >>        resources: irq:16 memory:70800000-70803fff 
> >>
> >> # mount | grep ^/dev/nvme0
> >> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> >>
> >> Which disk device you're using?
> > 
> > Most of the reports are on various NVME drives (samsung mostly I think).
> > 
> > 
> > One thing I should add is that it's all on LVM: 
> > 
> > 
> > vgcreate vg /dev/nvme0n1 -y
> > lvcreate -n thinMeta -L 3GB vg -y
> > lvcreate -n thinPool -l 99%FREE vg -y
> > lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> > lvcreate -n testLV -V 1300G --thinpool thinPool vg
> > wipefs -a /dev/mapper/vg-testLV
> > mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> > mount /dev/mapper/vg-testLV /testfs 
> > 
> > 
> > With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> > drive directly it's a little more variable. Some it shows on xfs, some it show
> > on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> > it shows it's 100% reproducible on that setup. 
> > 
> > It's always the randwrite numbers. The rest look fine.
> 
> Hi Phil,
> 
> Thanks for the detailed instructions. Unfortunately even with your LVM setup on
> the platforms I've tried I don't see a regression so far, all the numbers are
> about equal for DELAY_DEQUEUE and NO_DELAY_DEQUEUE.
>

Yeah, that's odd.

Fwiw:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Vendor ID:                AuthenticAMD
  BIOS Vendor ID:         Advanced Micro Devices, Inc.
  Model name:             AMD EPYC 7313P 16-Core Processor
    BIOS Model name:      AMD EPYC 7313P 16-Core Processor                Unknown CPU @ 3.0GHz
    BIOS CPU family:      107
    CPU family:           25
    ...

16 SMT2 cores (siblings are 16-31)


#lsblk -N
NAME    TYPE MODEL                      SERIAL              REV TRAN   RQ-SIZE  MQ
nvme3n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605178 GDC5602Q nvme      1023  32
nvme2n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605128 GDC5602Q nvme      1023  32
nvme0n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605125 GDC5602Q nvme      1023  32
nvme1n1 disk SAMSUNG MZQL21T9HCJR-00A07 S64GNJ0T605127 GDC5602Q nvme      1023  32

Where nvme0n1 is the one I'm actually using.

I'm on 6.12.0-0.rc5.44.eln143.x86_64  which is v6.12-rc5 with RHEL .config.  This
should have little to no franken-kernel bits but now that I have the machine
I'll build from upstream (with the RHEL .config still) to make sure.

We did see it on all the RCs so far. 


> Anyway I have some follow-ups, first let me trim the fio command for readability:
> fio --rw randwrite --bs 4k --runtime 1m --fsync 0 --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --nrfiles 1 --loops 1 --name default --randrepeat 1 --time_based --group_reporting --directory /testfs
> 
> dropping defaults nr_files, loops, fsync, randrepeat
> fio --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
> 
> Adding the CPU affinities directly:
> fio --cpus_allowed 1-8 --rw randwrite --bs 4k --runtime 1m --iodepth 32 --direct 1 --ioengine libaio --numjobs 8 --size 30g --name default --time_based --group_reporting --directory /testfs
>

Fair enough.  It should work the same with taskset I suppose except the below bit. I've
been given this from our perforance team. They have a framework that produces nice html
pages with red and green results and graphs and whatnot.  Right now it's in the form of
a script that pulls the KB/s number of out the json output which is nice and keeps me
from going crosseyed looking at the full fio run output.

> Now I was wondering about the following:
> Is it actually the kworker (not another fio) being preempted? (I'm pretty sure it is)
> To test: --cpus_allowed_policy split (each fio process gets it's own CPU).

with  --cpus-allowed and --cpus_allowed_policy split the results with DELAY_DEQUEUE are
better (540MB/s) but with NO_DELAY_DEQUEUE they are also better (640 MB/s). It was
510MB/s and 590MB/s before. 

> 
> You wrote:
> >I was thinking maybe the preemption was preventing some batching of IO completions or
> >initiations. But that was wrong it seems.
> 
> So while it doesn't reproduce for me, the only thing being preempted regularly is
> the kworker (running iomap_dio_complete_work). I don't quite follow the "that was
> wrong it seems" part then. Could you elaborate?
>

I was thinking that the fio batch test along with the disabling WAKEUP_PREEMPTION
was telling me that it wasn't the over preemption issue, but that also I could be
wrong about...


> Could you also post the other benchmark numbers? Does any of them score higher in IOPS?
> Is --rw write the same issue if you set --bs 4k (assuming you set a larger bs for seqwrite).
>

I don't have numbers for all of the other flavors but I ran --rw write --bs 4k:

DELAY_DEQUEUE     ~590MB/s
NO_DELAY_DEQUEUE  ~840MB/s

Those results are not good for DELAY_DEQUEUE either.

> Can you set the kworkers handling completions to SCHED_BATCH too? Just to confirm.

I think I did the wrong kworkes the first time. So I'll try again to figure out which
kworkers to twiddle (or I'll just do all 227 of them...).



Thanks,
Phil


> 
> Regards,
> Christian
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05 15:20                       ` Phil Auld
@ 2024-11-05 19:05                         ` Phil Auld
  2024-11-06  2:45                           ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-05 19:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, Nov 05, 2024 at 10:20:10AM -0500 Phil Auld wrote:
> On Tue, Nov 05, 2024 at 05:05:12AM +0100 Mike Galbraith wrote:
> > On Mon, 2024-11-04 at 08:05 -0500, Phil Auld wrote:
> > > On Sat, Nov 02, 2024 at 05:32:14AM +0100 Mike Galbraith wrote:
> > >
> > > >
> > > > The buddy being preempted certainly won't be wakeup migrated...
> > >
> > > Not the waker who gets preempted but the wakee may be a bit more
> > > sticky on his current cpu and thus stack more since he's still
> > > in that runqueue.
> > 
> > Ah, indeed, if wakees don't get scraped off before being awakened, they
> > can and do miss chances at an idle CPU according to trace_printk().
> > 
> > I'm undecided if overall it's boon, bane or even matters, as there is
> > still an ample supply of wakeup migration, but seems it can indeed
> > inject wakeup latency needlessly, so <sharpens stick>...
> > 
> > My box booted and neither become exceptionally noisy nor inexplicably
> > silent in.. oh, minutes now, so surely yours will be perfectly fine.
> > 
> > After one minute of lightly loaded box browsing, trace_printk() said:
> > 
> >   645   - racy peek says there is a room available
> >    11   - cool, reserved room is free
> >   206   - no vacancy or wakee pinned
> > 38807   - SIS accommodates room seeker
> > 
> > The below should improve the odds, but high return seems unlikely.
> >
> 
> Thanks, I'll give it a spin with the nr_cpus_allowed bit.
>

Well that worked pretty well. It actually makes DELAY_DEQUEUE a litte better
than NO_DELAY_DEQUEUE

DELAY_DEQUEUE     ~595MB/s
NO_DELAY_DEQUEUE  ~581MB/s

I left the cpumask_weight becaude vim isn't happy with my terminal to that machine
for some reason I have not found yet. So I couldn't actually edit the darn thing.
This is not my normal build setup. But I'll spin up a real build with this patch
and throw it over the wall to the perf team to have them do their full battery
of tests on it.

Probably "Paul" will be cranky now. 


Thanks,
Phil


> 
> Cheers,
> Phil
> 
> 
> 
> > ---
> >  kernel/sched/core.c |    9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3790,7 +3790,13 @@ static int ttwu_runnable(struct task_str
> >  	rq = __task_rq_lock(p, &rf);
> >  	if (task_on_rq_queued(p)) {
> >  		update_rq_clock(rq);
> > -		if (p->se.sched_delayed)
> > +		/*
> > +		 * If wakee is mobile and the room it reserved is occupied, let it try to migrate.
> > +		 */
> > +		if (p->se.sched_delayed && rq->nr_running > 1 && cpumask_weight(p->cpus_ptr) > 1) {
> > +			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
> > +			goto out_unlock;
> > +		} else if (p->se.sched_delayed)
> >  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> >  		if (!task_on_cpu(rq, p)) {
> >  			/*
> > @@ -3802,6 +3808,7 @@ static int ttwu_runnable(struct task_str
> >  		ttwu_do_wakeup(p);
> >  		ret = 1;
> >  	}
> > +out_unlock:
> >  	__task_rq_unlock(rq, &rf);
> > 
> >  	return ret;
> > 
> > 
> 
> -- 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (29 preceding siblings ...)
  2024-09-10 11:45 ` Sven Schnelle
@ 2024-11-06  1:07 ` Saravana Kannan
  2024-11-06  6:19   ` K Prateek Nayak
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
  31 siblings, 1 reply; 277+ messages in thread
From: Saravana Kannan @ 2024-11-06  1:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault, Android Kernel Team,
	Qais Yousef, Vincent Palomares, Samuel Wu, David Dai, John Stultz

On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Hi all,
>
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
>
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
>
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
>
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>
>
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID
>

Hi Peter,

TL;DR:
We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
change we accept. Some of these changes are merges from Linus's tree.
We can see a very clear change in behavior with this patch series.
Based on what we are seeing, we'd expect this change in behavior to
cause pretty serious power regression (7-25%) depending on what the
actual bug is and the use case.

Intro:
We run these tests 20 times for every build (a bunch of changes). All
the data below is from the 20+ builds before this series and 20 builds
after this series (inclusive). So, all the "before numbers" are from
(20 x 20) 400+ runs and all the after numbers are from another 400+
runs.

Test:
We create a synthetic "tiny" thread that runs for 3ms and sleeps for
10ms at Fmin. We let it run like this for several seconds to make sure
the util is low and all the "new thread" boost stuff isn't kicking in.
So, at this point, the CPU frequency is at Fmin. Then we let this
thread run continuously without sleeping and measure (using ftrace)
the time it takes for the CPU to get to Fmax.

We do this separately (fresh run) on the Pixel 6 with the cpu affinity
set to each cluster and once without any cpu affinity (thread starts
at little).

Data:
All the values below are in milliseconds.

When the thread is not affined to any CPU: So thread starts on little,
ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
big, ramps up to fmax.
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 169    | 151   |
|-----------------------+----------|
| Median          | 221    | 177   |
|-----------------------+----------|
| Mean            | 221    | 177   |
|-----------------------+----------|
| 95th percentile | 249    | 200   |
+----------------------------------+

When thread is affined to the little cluster:
The average time to reach Fmax is 104 ms without this series and 66 ms
after this series. We didn't collect the individual per run data. We
can if you really need it. We also noticed that the little cluster
wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
MHz).

When thread is affined to the middle cluster:
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 99     | 84    |
|-----------------------+----------|
| Median          | 111    | 104   |
|-----------------------+----------|
| Mean            | 111    | 104   |
|-----------------------+----------|
| 95th percentile | 120    | 119   |
+----------------------------------+

When thread is affined to the big cluster:
+----------------------------------+
| Data            | Before | After |
|-----------------------+----------|
| 5th percentile  | 138    | 96    |
|-----------------------+----------|
| Median          | 147    | 144   |
|-----------------------+----------|
| Mean            | 145    | 134   |
|-----------------------+----------|
| 95th percentile | 151    | 150   |
+----------------------------------+

As you can see, the ramp up time has decreased noticeably. Also, as
you can tell from the 5th percentile numbers, the standard deviation
has also increased a lot too, causing a wider spread of the ramp up
time (more noticeable in the middle and big clusters). So in general
this looks like it's going to increase the usage of the middle and big
CPU clusters and also going to shift the CPU frequency residency to
frequencies that are 5 to 25% higher.

We already checked the rate_limit_us value and it is the same for both
the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
case). Also, based on my limited understanding the DELAYED_DEQUEUE
stuff is only relevant if there are multiple contending threads in a
CPU. In this case it's just 1 continuously running thread with a
kworker that runs sporadically less than 1% of the time.

So, without a deeper understanding of this patch series, it's behaving
as if the PELT signal is accumulating faster than expected. Which is a
bit surprising to me because AFAIK (which is not much) the EEVDF
series isn't supposed to change the PELT behavior.

If you want to get a visual idea of what the system is doing, here are
some perfetto links that visualize the traces. Hopefully you have
access permissions to these. You can use the W, S, A, D keys to pan
and zoom around the timeline.

Big Before:
https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
Big After:
https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9

Thanks,
Saravana, Sam and David

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05 19:05                         ` Phil Auld
@ 2024-11-06  2:45                           ` Mike Galbraith
  0 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-06  2:45 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-05 at 14:05 -0500, Phil Auld wrote:
>
> Well that worked pretty well. It actually makes DELAY_DEQUEUE a litte better
> than NO_DELAY_DEQUEUE
>
> DELAY_DEQUEUE     ~595MB/s
> NO_DELAY_DEQUEUE  ~581MB/s

Hrmph, not the expected result, but sharp stick's mission was to
confirm/deny that delta's relevance, so job well done.. kindling.

	-Mike




^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  6:46                         ` Mike Galbraith
@ 2024-11-06  3:02                           ` K Prateek Nayak
  0 siblings, 0 replies; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-06  3:02 UTC (permalink / raw)
  To: Mike Galbraith, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

Hello Mike,

On 11/5/2024 12:16 PM, Mike Galbraith wrote:
> On Tue, 2024-11-05 at 09:52 +0530, K Prateek Nayak wrote:
>> Hello Mike,
> 
> Greetings,
> 
>> Would checking "p->nr_cpus_allowed > 1" be enough instead of doing a
>> "cpumask_weight(p->cpus_ptr) > 1"?
> 
> Yeah (thwap).
> 
>> I was thinking, since the task is indeed delayed, there has to be more
>> than one task on the runqueue right since a single task by itself cannot
>> be ineligible and be marked for delayed dequeue?
> 
> But they migrate via LB, and idle balance unlocks the rq.
> trace_printk() just verified that they do still both land with
> sched_delayed intact and with nr_running = 1.

Ah! You are right! thank you for clarifying. Since the sharp stick seems
to be working, let me go thrown a bunch of workloads at it and report
back :)

-- 
Thanks and Regards,
Prateek

> 
>> The only time we
>> encounter a delayed task with "rq->nr_running == 1" is if the other
>> tasks have been fully dequeued and pick_next_task() is in the process of
>> picking off all the delayed task, but since that is done with the rq
>> lock held in schedule(), it is even possible for the
>> "rq->nr_running > 1" to be false here?
> 
> I don't see how, the rq being looked at is locked.
> 
> 	-Mike


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-06  1:07 ` Saravana Kannan
@ 2024-11-06  6:19   ` K Prateek Nayak
  2024-11-06 11:09     ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-06  6:19 UTC (permalink / raw)
  To: Saravana Kannan, Samuel Wu, David Dai, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, wuyun.abel,
	youssefesmat, tglx, efault, Android Kernel Team, Qais Yousef,
	Vincent Palomares, John Stultz, Mike Galbraith, Luis Machado

(+ Mike, Luis)

Hello Saravana, Sam, David,

On 11/6/2024 6:37 AM, Saravana Kannan wrote:
> On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>   - split up the huge delay-dequeue patch
>>   - tested/fixed cfs-bandwidth
>>   - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>   - SCHED_BATCH is equivalent to RESPECT_SLICE
>>   - propagate min_slice up cgroups
>>   - CLOCK_THREAD_DVFS_ID
>>
> 
> Hi Peter,
> 
> TL;DR:
> We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
> change we accept. Some of these changes are merges from Linus's tree.
> We can see a very clear change in behavior with this patch series.
> Based on what we are seeing, we'd expect this change in behavior to
> cause pretty serious power regression (7-25%) depending on what the
> actual bug is and the use case.

Do the regressions persist with NO_DELAY_DEQUEUE? You can disable the
DELAY_DEQUEUE feature added in EEVDF Complete via debugfs by doing a:

     # echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features

Since delayed entities are still on the runqueue, they can affect PELT
calculation. Vincent and Dietmar have both noted this and Peter posted
https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
in response but it was pulled out since Luis reported observing -ve
values for h_nr_delayed on his setup. A lot has been fixed around
delayed dequeue since and I wonder if now would be the right time to
re-attempt h_nr_delayed tracking.

There is also the fact that delayed entities don't update the tg
loadavg since the delayed path calls update_load_avg() without
UPDATE_TG flag set which can again cause some changes in PELT
calculation since the averages are used to estimate the entity
shares when running with cgroups.

> 
> Intro:
> We run these tests 20 times for every build (a bunch of changes). All
> the data below is from the 20+ builds before this series and 20 builds
> after this series (inclusive). So, all the "before numbers" are from
> (20 x 20) 400+ runs and all the after numbers are from another 400+
> runs.
> 
> Test:
> We create a synthetic "tiny" thread that runs for 3ms and sleeps for
> 10ms at Fmin. We let it run like this for several seconds to make sure
> the util is low and all the "new thread" boost stuff isn't kicking in.
> So, at this point, the CPU frequency is at Fmin. Then we let this
> thread run continuously without sleeping and measure (using ftrace)
> the time it takes for the CPU to get to Fmax.
> 
> We do this separately (fresh run) on the Pixel 6 with the cpu affinity
> set to each cluster and once without any cpu affinity (thread starts
> at little).
> 
> Data:
> All the values below are in milliseconds.
> 
> When the thread is not affined to any CPU: So thread starts on little,
> ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
> big, ramps up to fmax.
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 169    | 151   |
> |-----------------------+----------|
> | Median          | 221    | 177   |
> |-----------------------+----------|
> | Mean            | 221    | 177   |
> |-----------------------+----------|
> | 95th percentile | 249    | 200   |
> +----------------------------------+
> 
> When thread is affined to the little cluster:
> The average time to reach Fmax is 104 ms without this series and 66 ms
> after this series. We didn't collect the individual per run data. We
> can if you really need it. We also noticed that the little cluster
> wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
> mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
> MHz).
> 
> When thread is affined to the middle cluster:
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 99     | 84    |
> |-----------------------+----------|
> | Median          | 111    | 104   |
> |-----------------------+----------|
> | Mean            | 111    | 104   |
> |-----------------------+----------|
> | 95th percentile | 120    | 119   |
> +----------------------------------+
> 
> When thread is affined to the big cluster:
> +----------------------------------+
> | Data            | Before | After |
> |-----------------------+----------|
> | 5th percentile  | 138    | 96    |
> |-----------------------+----------|
> | Median          | 147    | 144   |
> |-----------------------+----------|
> | Mean            | 145    | 134   |
> |-----------------------+----------|
> | 95th percentile | 151    | 150   |
> +----------------------------------+
> 
> As you can see, the ramp up time has decreased noticeably. Also, as
> you can tell from the 5th percentile numbers, the standard deviation
> has also increased a lot too, causing a wider spread of the ramp up
> time (more noticeable in the middle and big clusters). So in general
> this looks like it's going to increase the usage of the middle and big
> CPU clusters and also going to shift the CPU frequency residency to
> frequencies that are 5 to 25% higher.
> 
> We already checked the rate_limit_us value and it is the same for both
> the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
> case). Also, based on my limited understanding the DELAYED_DEQUEUE
> stuff is only relevant if there are multiple contending threads in a
> CPU. In this case it's just 1 continuously running thread with a
> kworker that runs sporadically less than 1% of the time.

There is an ongoing investigation on delayed entities possibly not
migrating if they are woken up before they are fully dequeued. Since you
mention there is only one task, this should not matter but could you
also try out Mike's suggestion from
https://lore.kernel.org/lkml/1bffa5f2ca0fec8a00f84ffab86dc6e8408af31c.camel@gmx.de/
and see if it makes a difference on your test suite?

-- 
Thanks and Regards,
Prateek

> 
> So, without a deeper understanding of this patch series, it's behaving
> as if the PELT signal is accumulating faster than expected. Which is a
> bit surprising to me because AFAIK (which is not much) the EEVDF
> series isn't supposed to change the PELT behavior.
> 
> If you want to get a visual idea of what the system is doing, here are
> some perfetto links that visualize the traces. Hopefully you have
> access permissions to these. You can use the W, S, A, D keys to pan
> and zoom around the timeline.
> 
> Big Before:
> https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
> Big After:
> https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9

P.S. I only gave a quick glance but I do see the frequency ramp up with
larger deltas and reach Fmax much quickly in case of "Big After"

> 
> Thanks,
> Saravana, Sam and David

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-06  6:19   ` K Prateek Nayak
@ 2024-11-06 11:09     ` Peter Zijlstra
  2024-11-06 12:06       ` Luis Machado
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-06 11:09 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Saravana Kannan, Samuel Wu, David Dai, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx, efault,
	Android Kernel Team, Qais Yousef, Vincent Palomares, John Stultz,
	Luis Machado

On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:

> Since delayed entities are still on the runqueue, they can affect PELT
> calculation. Vincent and Dietmar have both noted this and Peter posted
> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> in response but it was pulled out since Luis reported observing -ve
> values for h_nr_delayed on his setup. A lot has been fixed around
> delayed dequeue since and I wonder if now would be the right time to
> re-attempt h_nr_delayed tracking.

Yeah, it's something I meant to get back to. I think the patch as posted
was actually right and it didn't work for Luis because of some other,
since fixed issue.

But I might be misremembering things. I'll get to it eventually :/

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-06 11:09     ` Peter Zijlstra
@ 2024-11-06 12:06       ` Luis Machado
  2024-11-08  7:07         ` Saravana Kannan
  0 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-11-06 12:06 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: Saravana Kannan, Samuel Wu, David Dai, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx, efault,
	Android Kernel Team, Qais Yousef, Vincent Palomares, John Stultz

Hi,

On 11/6/24 11:09, Peter Zijlstra wrote:
> On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
> 
>> Since delayed entities are still on the runqueue, they can affect PELT
>> calculation. Vincent and Dietmar have both noted this and Peter posted
>> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
>> in response but it was pulled out since Luis reported observing -ve
>> values for h_nr_delayed on his setup. A lot has been fixed around
>> delayed dequeue since and I wonder if now would be the right time to
>> re-attempt h_nr_delayed tracking.
> 
> Yeah, it's something I meant to get back to. I think the patch as posted
> was actually right and it didn't work for Luis because of some other,
> since fixed issue.
> 
> But I might be misremembering things. I'll get to it eventually :/

Sorry for the late reply, I got sidetracked on something else.

There have been a few power regressions (based on our Pixel6-based testing) due
to the delayed-dequeue series.

The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]

The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.

So power use was brought down by the above fix, but some issues still remained, like the
accounting issues with h_nr_running and not taking sched_delayed tasks into account.

Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]

Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
accounting that slowly drove frequencies up again.

If I recall correctly, Peter has pulled that particular patch from the tree, but we should
definitely revisit it with a proper fix for the imbalance. Suggestion in [5].

[1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
[2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
[3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
[4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
[5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-05  4:05                     ` Mike Galbraith
  2024-11-05  4:22                       ` K Prateek Nayak
  2024-11-05 15:20                       ` Phil Auld
@ 2024-11-06 13:53                       ` Peter Zijlstra
  2024-11-06 14:14                         ` Peter Zijlstra
  2024-11-06 14:14                         ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
  2 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-06 13:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Tue, Nov 05, 2024 at 05:05:12AM +0100, Mike Galbraith wrote:

> After one minute of lightly loaded box browsing, trace_printk() said:
> 
>   645   - racy peek says there is a room available
>    11   - cool, reserved room is free
>   206   - no vacancy or wakee pinned
> 38807   - SIS accommodates room seeker
> 
> The below should improve the odds, but high return seems unlikely.
> 
> ---
>  kernel/sched/core.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3790,7 +3790,13 @@ static int ttwu_runnable(struct task_str
>  	rq = __task_rq_lock(p, &rf);
>  	if (task_on_rq_queued(p)) {
>  		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> +		/*
> +		 * If wakee is mobile and the room it reserved is occupied, let it try to migrate.
> +		 */
> +		if (p->se.sched_delayed && rq->nr_running > 1 && cpumask_weight(p->cpus_ptr) > 1) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
> +			goto out_unlock;
> +		} else if (p->se.sched_delayed)
>  			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>  		if (!task_on_cpu(rq, p)) {
>  			/*
> @@ -3802,6 +3808,7 @@ static int ttwu_runnable(struct task_str
>  		ttwu_do_wakeup(p);
>  		ret = 1;
>  	}
> +out_unlock:
>  	__task_rq_unlock(rq, &rf);
> 
>  	return ret;

So... I was trying to make that prettier and ended up with something
like this:

---
 kernel/sched/core.c  | 46 ++++++++++++++++++++++++++++------------------
 kernel/sched/sched.h |  5 +++++
 2 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54d82c21fc8e..b083c6385e88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3774,28 +3774,38 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+
+		/*
+		 * Since sched_delayed means we cannot be current anywhere,
+		 * dequeue it here and have it fall through to the
+		 * select_task_rq() case further along the ttwu() path.
+		 */
+		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+
+		enqueue_task(rq, p, queue_flags);
 	}
-	__task_rq_unlock(rq, &rf);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
 
-	return ret;
+	return 1;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 21b1780c6695..1714ac38500f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1787,6 +1787,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 13:53                       ` Peter Zijlstra
@ 2024-11-06 14:14                         ` Peter Zijlstra
  2024-11-06 14:38                           ` Peter Zijlstra
  2024-11-06 15:22                           ` Mike Galbraith
  2024-11-06 14:14                         ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
  1 sibling, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-06 14:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, Nov 06, 2024 at 02:53:46PM +0100, Peter Zijlstra wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 54d82c21fc8e..b083c6385e88 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3774,28 +3774,38 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
>  
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along the ttwu() path.
> +		 */
> +		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>  		}
> +
> +		enqueue_task(rq, p, queue_flags);

And then I wondered... this means that !task_on_cpu() is true for
sched_delayed, and thus we can move this in the below branch.

But also, we can probably dequeue every such task, not only
sched_delayed ones.

>  	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
> +	}
> +	ttwu_do_wakeup(p);
>  
> +	return 1;
>  }


Yielding something like this on top... which boots. But since I forgot
to make it a feature, I can't actually tell at this point.. *sigh*

Anyway, more toys to poke at I suppose.


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b083c6385e88..69b19ba77598 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3781,28 +3781,32 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		return 0;
 
 	update_rq_clock(rq);
-	if (p->se.sched_delayed) {
-		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+	if (!task_on_cpu(rq, p)) {
+		int queue_flags = DEQUEUE_NOCLOCK;
+
+		if (p->se.sched_delayed)
+			queue_flags |= DEQUEUE_DELAYED;
 
 		/*
-		 * Since sched_delayed means we cannot be current anywhere,
-		 * dequeue it here and have it fall through to the
-		 * select_task_rq() case further along the ttwu() path.
+		 * Since we're not current anywhere *AND* hold pi_lock, dequeue
+		 * it here and have it fall through to the select_task_rq()
+		 * case further along the ttwu() path.
 		 */
 		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
 			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
 			return 0;
 		}
 
-		enqueue_task(rq, p, queue_flags);
-	}
-	if (!task_on_cpu(rq, p)) {
+		if (p->se.sched_delayed)
+			enqueue_task(rq, p, queue_flags);
+
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
 		 * it should preempt the task that is current now.
 		 */
 		wakeup_preempt(rq, p, wake_flags);
 	}
+	SCHED_WARN_ON(p->se.sched_delayed);
 	ttwu_do_wakeup(p);
 
 	return 1;

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 13:53                       ` Peter Zijlstra
  2024-11-06 14:14                         ` Peter Zijlstra
@ 2024-11-06 14:14                         ` Mike Galbraith
  2024-11-06 14:33                           ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-06 14:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 2024-11-06 at 14:53 +0100, Peter Zijlstra wrote:
> 
> So... I was trying to make that prettier and ended up with something
> like this:

Passing ENQUEUE_DELAYED to dequeue_task() looks funky until you check
the value, but otherwise yeah, when applied that looks better to me.

> 
> ---
>  kernel/sched/core.c  | 46 ++++++++++++++++++++++++++++------------------
>  kernel/sched/sched.h |  5 +++++
>  2 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 54d82c21fc8e..b083c6385e88 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3774,28 +3774,38 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> +       CLASS(__task_rq_lock, rq_guard)(p);
> +       struct rq *rq = rq_guard.rq;
>  
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed) {
> +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +               /*
> +                * Since sched_delayed means we cannot be current anywhere,
> +                * dequeue it here and have it fall through to the
> +                * select_task_rq() case further along the ttwu() path.
> +                */
> +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> +
> +               enqueue_task(rq, p, queue_flags);
>         }
> -       __task_rq_unlock(rq, &rf);
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
> +       }
> +       ttwu_do_wakeup(p);
>  
> -       return ret;
> +       return 1;
>  }
>  
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 21b1780c6695..1714ac38500f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1787,6 +1787,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>  
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 14:14                         ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
@ 2024-11-06 14:33                           ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-06 14:33 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, Nov 06, 2024 at 03:14:58PM +0100, Mike Galbraith wrote:
> On Wed, 2024-11-06 at 14:53 +0100, Peter Zijlstra wrote:
> > 
> > So... I was trying to make that prettier and ended up with something
> > like this:
> 
> Passing ENQUEUE_DELAYED to dequeue_task() looks funky until you check
> the value, but otherwise yeah, when applied that looks better to me.

Yeah, it does look funneh, but we've been doing that for a long long
while.

Still, perhaps I should rename the shared ones to QUEUE_foo and only
have the specific ones be {EN,DE}QUEUE_foo.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 14:14                         ` Peter Zijlstra
@ 2024-11-06 14:38                           ` Peter Zijlstra
  2024-11-06 15:22                           ` Mike Galbraith
  1 sibling, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-06 14:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, Nov 06, 2024 at 03:14:20PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2024 at 02:53:46PM +0100, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 54d82c21fc8e..b083c6385e88 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3774,28 +3774,38 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > +	CLASS(__task_rq_lock, rq_guard)(p);
> > +	struct rq *rq = rq_guard.rq;
> >  
> > +	if (!task_on_rq_queued(p))
> > +		return 0;
> > +
> > +	update_rq_clock(rq);
> > +	if (p->se.sched_delayed) {
> > +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +		/*
> > +		 * Since sched_delayed means we cannot be current anywhere,
> > +		 * dequeue it here and have it fall through to the
> > +		 * select_task_rq() case further along the ttwu() path.
> > +		 */
> > +		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> > +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> > +			return 0;
> >  		}
> > +
> > +		enqueue_task(rq, p, queue_flags);
> 
> And then I wondered... this means that !task_on_cpu() is true for
> sched_delayed, and thus we can move this in the below branch.
> 
> But also, we can probably dequeue every such task, not only
> sched_delayed ones.
> 
> >  	}
> > +	if (!task_on_cpu(rq, p)) {
> > +		/*
> > +		 * When on_rq && !on_cpu the task is preempted, see if
> > +		 * it should preempt the task that is current now.
> > +		 */
> > +		wakeup_preempt(rq, p, wake_flags);
> > +	}
> > +	ttwu_do_wakeup(p);
> >  
> > +	return 1;
> >  }
> 
> 
> Yielding something like this on top... which boots. But since I forgot
> to make it a feature, I can't actually tell at this point.. *sigh*

It dies real fast, so clearly I'm missing something. Oh well.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 14:14                         ` Peter Zijlstra
  2024-11-06 14:38                           ` Peter Zijlstra
@ 2024-11-06 15:22                           ` Mike Galbraith
  2024-11-07  4:03                             ` Mike Galbraith
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-06 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 2024-11-06 at 15:14 +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2024 at 02:53:46PM +0100, Peter Zijlstra wrote:
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 54d82c21fc8e..b083c6385e88 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3774,28 +3774,38 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> >  
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> > +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> > +                       return 0;
> >                 }
> > +
> > +               enqueue_task(rq, p, queue_flags);
> 
> And then I wondered... this means that !task_on_cpu() is true for
> sched_delayed, and thus we can move this in the below branch.
> 
> But also, we can probably dequeue every such task, not only
> sched_delayed ones.
> 
> >         }
> > +       if (!task_on_cpu(rq, p)) {
> > +               /*
> > +                * When on_rq && !on_cpu the task is preempted, see if
> > +                * it should preempt the task that is current now.
> > +                */
> > +               wakeup_preempt(rq, p, wake_flags);
> > +       }
> > +       ttwu_do_wakeup(p);
> >  
> > +       return 1;
> >  }
> 
> 
> Yielding something like this on top... which boots. But since I forgot
> to make it a feature, I can't actually tell at this point.. *sigh*
> 
> Anyway, more toys to poke at I suppose.
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b083c6385e88..69b19ba77598 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3781,28 +3781,32 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>                 return 0;
>  
>         update_rq_clock(rq);
> -       if (p->se.sched_delayed) {
> -               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +       if (!task_on_cpu(rq, p)) {
> +               int queue_flags = DEQUEUE_NOCLOCK;
> +
> +               if (p->se.sched_delayed)
> +                       queue_flags |= DEQUEUE_DELAYED;
>  
>                 /*
> -                * Since sched_delayed means we cannot be current anywhere,
> -                * dequeue it here and have it fall through to the
> -                * select_task_rq() case further along the ttwu() path.
> +                * Since we're not current anywhere *AND* hold pi_lock, dequeue
> +                * it here and have it fall through to the select_task_rq()
> +                * case further along the ttwu() path.
>                  */
>                 if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
>                         dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
>                         return 0;
>                 }

Hm, if we try to bounce a preempted task and fail, the wakeup_preempt()
call won't happen.

Bouncing preempted tasks is double edged sword.. on the one hand, it's
a huge win if bounce works for communicating tasks who will otherwise
be talking around the not-my-buddy man-in-the-middle who did the
preempting, but on the other, when PELT has its white hat on (also has
a black one) and has buddies pairing up nicely in an approaching
saturation scenario, bounces disturb it, add chaos.  Dunno.

>  
> -               enqueue_task(rq, p, queue_flags);
> -       }
> -       if (!task_on_cpu(rq, p)) {
> +               if (p->se.sched_delayed)
> +                       enqueue_task(rq, p, queue_flags);
> +
>                 /*
>                  * When on_rq && !on_cpu the task is preempted, see if
>                  * it should preempt the task that is current now.
>                  */
>                 wakeup_preempt(rq, p, wake_flags);
>         }
> +       SCHED_WARN_ON(p->se.sched_delayed);
>         ttwu_do_wakeup(p);
>  
>         return 1;


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-06 15:22                           ` Mike Galbraith
@ 2024-11-07  4:03                             ` Mike Galbraith
  2024-11-07  9:46                               ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-07  4:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Wed, 2024-11-06 at 16:22 +0100, Mike Galbraith wrote:
>
> Hm, if we try to bounce a preempted task and fail, the wakeup_preempt()
> call won't happen.

Zzzt, wrong, falling through still leads to the bottom of a wakeup with
its preempt check...

> Bouncing preempted tasks is double edged sword..

..but that bit is pretty intriguing.  From the service latency and
utilization perspective only at decision time (prime mission), it's an
obvious win to migrate to an idle CPU.

It's also a clear win for communication latency when buddies are NOT
popular but misused end to end latency measurement tools ala TCP_RR
with only microscopic concurrency. For the other netperf modes of
operation, there's no shortage of concurrency to salvage *and get out
of the communication stream*, and I think that applies to wide swaths
of the real world.  What makes it intriguing is the cross-over point
where "stacking is the stupidest idea ever" becomes "stacking may put
my and my buddy's wide butts directly in our own communication stream,
but that's less pain than what unrelated wide butts inflict on top of
higher LLC vs L2 latency".

For UDP_STREAM (async to the bone), there is no such a point, it would
seemingly prefer its buddy call from orbit, but for its more reasonable
TCP brother and ilk, there is.

Sample numbers (talk), interference is 8 unbound 88% compute instances,
box is crusty ole 8 rq i7-4790.

UDP_STREAM-1    unbound    Avg:  47135  Sum:    47135
UDP_STREAM-1    stacked    Avg:  39602  Sum:    39602
UDP_STREAM-1    cross-smt  Avg:  61599  Sum:    61599
UDP_STREAM-1    cross-core Avg:  67680  Sum:    67680

(distancia muy bueno!)

TCP_STREAM-1    unbound    Avg:  26299  Sum:    26299
TCP_STREAM-1    stacked    Avg:  27893  Sum:    27893
TCP_STREAM-1    cross-smt  Avg:  16728  Sum:    16728
TCP_STREAM-1    cross-core Avg:  13877  Sum:    13877

(idiota, distancia NO bueno, castillo inflable muy bueno!)

Service latency dominates.. not quite always, and bouncing tasks about
is simultaneously the only sane thing to do and pure evil... like
everything else in sched land, making it a hard game to win :)

I built that patch out of curiosity, and yeah, set_next_task_fair()
finding a cfs_rq->curr ends play time pretty quickly.  Too bad my
service latency is a bit dinged up, bouncing preempted wakees about
promises to be interesting.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-07  4:03                             ` Mike Galbraith
@ 2024-11-07  9:46                               ` Mike Galbraith
  2024-11-07 14:02                                 ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-07  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
>
> I built that patch out of curiosity, and yeah, set_next_task_fair()
> finding a cfs_rq->curr ends play time pretty quickly.

The below improved uptime, and trace_printk() says it's doing the
intended, so I suppose I'll add a feature and see what falls out.

---
 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3794,7 +3794,7 @@ static int ttwu_runnable(struct task_str
 		int queue_flags = DEQUEUE_NOCLOCK;

 		if (p->se.sched_delayed)
-			queue_flags |= DEQUEUE_DELAYED;
+			queue_flags |= (DEQUEUE_DELAYED | DEQUEUE_SLEEP);

 		/*
 		 * Since we're not current anywhere *AND* hold pi_lock, dequeue
@@ -3802,7 +3802,7 @@ static int ttwu_runnable(struct task_str
 		 * case further along the ttwu() path.
 		 */
 		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
-			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			dequeue_task(rq, p, queue_flags);
 			return 0;
 		}





^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-07  9:46                               ` Mike Galbraith
@ 2024-11-07 14:02                                 ` Mike Galbraith
  2024-11-07 14:09                                   ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-07 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> >
> > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > finding a cfs_rq->curr ends play time pretty quickly.
>
> The below improved uptime, and trace_printk() says it's doing the
> intended, so I suppose I'll add a feature and see what falls out.

From netperf, I got.. number tabulation practice.  Three runs of each
test with and without produced nothing but variance/noise.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-07 14:02                                 ` Mike Galbraith
@ 2024-11-07 14:09                                   ` Peter Zijlstra
  2024-11-08  0:24                                     ` [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-07 14:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > >
> > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > finding a cfs_rq->curr ends play time pretty quickly.
> >
> > The below improved uptime, and trace_printk() says it's doing the
> > intended, so I suppose I'll add a feature and see what falls out.
> 
> From netperf, I got.. number tabulation practice.  Three runs of each
> test with and without produced nothing but variance/noise.

Make it go away then.

If you could write a Changelog for you inspired bit and stick my cleaned
up version under it, I'd be much obliged.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-07 14:09                                   ` Peter Zijlstra
@ 2024-11-08  0:24                                     ` Mike Galbraith
  2024-11-08 13:34                                       ` Phil Auld
                                                         ` (2 more replies)
  0 siblings, 3 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-08  0:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > >
> > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > finding a cfs_rq->curr ends play time pretty quickly.
> > >
> > > The below improved uptime, and trace_printk() says it's doing the
> > > intended, so I suppose I'll add a feature and see what falls out.
> >
> > From netperf, I got.. number tabulation practice.  Three runs of each
> > test with and without produced nothing but variance/noise.
>
> Make it go away then.
>
> If you could write a Changelog for you inspired bit and stick my cleaned
> up version under it, I'd be much obliged.

Salut, much obliged for eyeball relief.

---snip---

Phil Auld (Redhat) reported an fio benchmark regression having been found
to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
may be related to wakees losing the ability to migrate, and confirmed that
restoration of same indeed did restore previous performance.

(de-uglified-a-lot-by)

Reported-by: Phil Auld <pauld@redhat.com>
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h |    5 +++++
 2 files changed, 34 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
-
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
+
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+
+		/*
+		 * Since sched_delayed means we cannot be current anywhere,
+		 * dequeue it here and have it fall through to the
+		 * select_task_rq() case further along the ttwu() path.
+		 */
+		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+
+		enqueue_task(rq, p, queue_flags);
+	}
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
 	}
-	__task_rq_unlock(rq, &rf);
+	ttwu_do_wakeup(p);

-	return ret;
+	return 1;
 }

 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }

+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-06 12:06       ` Luis Machado
@ 2024-11-08  7:07         ` Saravana Kannan
  2024-11-08 23:17           ` Samuel Wu
  0 siblings, 1 reply; 277+ messages in thread
From: Saravana Kannan @ 2024-11-08  7:07 UTC (permalink / raw)
  To: Luis Machado
  Cc: Peter Zijlstra, K Prateek Nayak, Samuel Wu, David Dai, mingo,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx,
	efault, Android Kernel Team, Qais Yousef, Vincent Palomares,
	John Stultz

On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
>
> Hi,
>
> On 11/6/24 11:09, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
> >
> >> Since delayed entities are still on the runqueue, they can affect PELT
> >> calculation. Vincent and Dietmar have both noted this and Peter posted
> >> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> >> in response but it was pulled out since Luis reported observing -ve
> >> values for h_nr_delayed on his setup. A lot has been fixed around
> >> delayed dequeue since and I wonder if now would be the right time to
> >> re-attempt h_nr_delayed tracking.
> >
> > Yeah, it's something I meant to get back to. I think the patch as posted
> > was actually right and it didn't work for Luis because of some other,
> > since fixed issue.
> >
> > But I might be misremembering things. I'll get to it eventually :/
>
> Sorry for the late reply, I got sidetracked on something else.
>
> There have been a few power regressions (based on our Pixel6-based testing) due
> to the delayed-dequeue series.
>
> The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
> handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
>
> The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
> imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
>
> So power use was brought down by the above fix, but some issues still remained, like the
> accounting issues with h_nr_running and not taking sched_delayed tasks into account.
>
> Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
>
> Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
> mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
> accounting that slowly drove frequencies up again.
>
> If I recall correctly, Peter has pulled that particular patch from the tree, but we should
> definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
>
> [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
> [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
> [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/

Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
recollect the data to see if that's the cause. We'll get back to this
thread once we have some data.

-Saravana

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-08  0:24                                     ` [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU Mike Galbraith
@ 2024-11-08 13:34                                       ` Phil Auld
  2024-11-11  2:46                                       ` Xuewen Yan
  2024-11-12  7:05                                       ` Mike Galbraith
  2 siblings, 0 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-08 13:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Fri, Nov 08, 2024 at 01:24:35AM +0100 Mike Galbraith wrote:
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > >
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > >
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > >
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> >
> > Make it go away then.
> >
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
> 
> Salut, much obliged for eyeball relief.
>

Thanks Mike (and Peter).  We have our full perf tests running on Mike's
original verion of this patch. Results probably Monday (there's a long
queue). We'll see if this blows up anything else then.  I'll queue up a
build with this cleaned up version as well but the results will be late
next week, probably.

At that point maybe some or all of these:

Suggested-by: Phil Auld <pauld@redhat.com> 
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>


Cheers,
Phil


> ---snip---
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> (de-uglified-a-lot-by)
> 
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along the ttwu() path.
> +		 */
> +		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>  		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>  	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>  }
> 
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
> 
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +		    __task_rq_unlock(_T->rq, &_T->rf),
> +		    struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>  		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
>  		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-04 12:50       ` Phil Auld
  2024-11-05  9:53         ` Christian Loehle
@ 2024-11-08 14:53         ` Dietmar Eggemann
  2024-11-08 18:16           ` Phil Auld
  1 sibling, 1 reply; 277+ messages in thread
From: Dietmar Eggemann @ 2024-11-08 14:53 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 04/11/2024 13:50, Phil Auld wrote:
> 
> Hi Dietmar,
> 
> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>> Hi Phil,
>>
>> On 01/11/2024 13:47, Phil Auld wrote:
>>>
>>> Hi Peterm

[...]

>> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
>> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
>> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
>>
>> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
>>
>> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
>> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
>>
>> # sudo lshw -class disk -class storage
>>   *-nvme                    
>>        description: NVMe device
>>        product: GIGABYTE GP-ASM2NE6500GTTD
>>        vendor: Phison Electronics Corporation
>>        physical id: 0
>>        bus info: pci@0000:01:00.0
>>        logical name: /dev/nvme0
>>        version: EGFM13.2
>>        ...
>>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
>>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
>>        resources: irq:16 memory:70800000-70803fff
>>
>> # mount | grep ^/dev/nvme0
>> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
>>
>> Which disk device you're using?
> 
> Most of the reports are on various NVME drives (samsung mostly I think).
> 
> 
> One thing I should add is that it's all on LVM: 
> 
> 
> vgcreate vg /dev/nvme0n1 -y
> lvcreate -n thinMeta -L 3GB vg -y
> lvcreate -n thinPool -l 99%FREE vg -y
> lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> lvcreate -n testLV -V 1300G --thinpool thinPool vg
> wipefs -a /dev/mapper/vg-testLV
> mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> mount /dev/mapper/vg-testLV /testfs 
> 
> 
> With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> drive directly it's a little more variable. Some it shows on xfs, some it show
> on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> it shows it's 100% reproducible on that setup. 
> 
> It's always the randwrite numbers. The rest look fine.
> 
> Also, as yet I'm not personally doing this testing, just looking into it and
> passing on the information I have. 

One reason I don't see the difference between DELAY_DEQUEUE and
NO_DELAY_DEQUEUE could be because of the affinity of the related
nvme interrupts: 

$ cat /proc/interrupts

     CPU0 CPU1    CPU2 CPU3 CPU4    CPU5 CPU6 CPU7    CPU8 ...
132:   0    0  1523653    0   0        0   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
133:   0    0        0    0   0  1338451   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
134:   0    0        0    0   0        0   0    0  2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3

$ cat /proc/irq/132/smp_affinity_list 
0-2
cat /proc/irq/133/smp_affinity_list 
3-5
cat /proc/irq/134/smp_affinity_list 
6-8

So the 8 fio tasks from: 

# fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
  --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
  --numjobs 8 --size 30g --name default --time_based
  --group_reporting --cpus_allowed_policy shared
  --directory /testfs

don't have to fight with per-CPU kworkers on each CPU.

e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 -> 
      run iomap_dio_complete_work() in kworker/8:x'

In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
rq->nr_running > 1) condition in ttwu_runnable() condition i only see
the per-CPU kworker in there, so p->nr_cpus_allowed == 1.

So the patch shouldn't make a difference for this scenario?

But maybe your VDO or thinpool setup creates waker/wakee pairs with
wakee->nr_cpus_allowed > 1? 

Does your machine has single CPU smp_affinity masks for these nvme
interrupts?

[...]



































^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-08 14:53         ` Dietmar Eggemann
@ 2024-11-08 18:16           ` Phil Auld
  2024-11-11 11:29             ` Dietmar Eggemann
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-08 18:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Fri, Nov 08, 2024 at 03:53:26PM +0100 Dietmar Eggemann wrote:
> On 04/11/2024 13:50, Phil Auld wrote:
> > 
> > Hi Dietmar,
> > 
> > On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
> >> Hi Phil,
> >>
> >> On 01/11/2024 13:47, Phil Auld wrote:
> >>>
> >>> Hi Peterm
> 
> [...]
> 
> >> I'm not seeing this on my i7-13700K running tip sched/core (1a6151017ee5
> >> - sched: psi: pass enqueue/dequeue flags to psi callbacks directly
> >> (2024-10-26 Johannes Weiner)) (6.12.0-rc4 - based)
> >>
> >> Using 'taskset 0xaaaaa' avoiding SMT and running only on P-cores.
> >>
> >> vanilla features: 990MB/s (mean out of 5 runs, σ:  9.38)
> >> NO_DELAY_DEQUEUE: 992MB/s (mean out of 5 runs, σ: 10.61)
> >>
> >> # sudo lshw -class disk -class storage
> >>   *-nvme                    
> >>        description: NVMe device
> >>        product: GIGABYTE GP-ASM2NE6500GTTD
> >>        vendor: Phison Electronics Corporation
> >>        physical id: 0
> >>        bus info: pci@0000:01:00.0
> >>        logical name: /dev/nvme0
> >>        version: EGFM13.2
> >>        ...
> >>        capabilities: nvme pciexpress msix msi pm nvm_express bus_master cap_list
> >>        configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:19871987SN215108954872 GIGABYTE GP-ASM2NE6500GTTD state=live
> >>        resources: irq:16 memory:70800000-70803fff
> >>
> >> # mount | grep ^/dev/nvme0
> >> /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
> >>
> >> Which disk device you're using?
> > 
> > Most of the reports are on various NVME drives (samsung mostly I think).
> > 
> > 
> > One thing I should add is that it's all on LVM: 
> > 
> > 
> > vgcreate vg /dev/nvme0n1 -y
> > lvcreate -n thinMeta -L 3GB vg -y
> > lvcreate -n thinPool -l 99%FREE vg -y
> > lvconvert --thinpool /dev/mapper/vg-thinPool --poolmetadata /dev/mapper/vg-thinMeta -Zn -y
> > lvcreate -n testLV -V 1300G --thinpool thinPool vg
> > wipefs -a /dev/mapper/vg-testLV
> > mkfs.ext4 /dev/mapper/vg-testLV -E lazy_itable_init=0,lazy_journal_init=0 -F
> > mount /dev/mapper/vg-testLV /testfs 
> > 
> > 
> > With VDO or thinpool (as above) it shows on both ext4 and xfs. With fs on
> > drive directly it's a little more variable. Some it shows on xfs, some it show
> > on ext4 and not vice-versa, seems to depend on the drive or hw raid. But when
> > it shows it's 100% reproducible on that setup. 
> > 
> > It's always the randwrite numbers. The rest look fine.
> > 
> > Also, as yet I'm not personally doing this testing, just looking into it and
> > passing on the information I have. 
> 
> One reason I don't see the difference between DELAY_DEQUEUE and
> NO_DELAY_DEQUEUE could be because of the affinity of the related
> nvme interrupts: 
> 
> $ cat /proc/interrupts
> 
>      CPU0 CPU1    CPU2 CPU3 CPU4    CPU5 CPU6 CPU7    CPU8 ...
> 132:   0    0  1523653    0   0        0   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
> 133:   0    0        0    0   0  1338451   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
> 134:   0    0        0    0   0        0   0    0  2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3
> 
> $ cat /proc/irq/132/smp_affinity_list 
> 0-2
> cat /proc/irq/133/smp_affinity_list 
> 3-5
> cat /proc/irq/134/smp_affinity_list 
> 6-8
> 
> So the 8 fio tasks from: 
> 
> # fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
>   --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
>   --numjobs 8 --size 30g --name default --time_based
>   --group_reporting --cpus_allowed_policy shared
>   --directory /testfs
> 
> don't have to fight with per-CPU kworkers on each CPU.
> 
> e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 -> 
>       run iomap_dio_complete_work() in kworker/8:x'
> 
> In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
> rq->nr_running > 1) condition in ttwu_runnable() condition i only see
> the per-CPU kworker in there, so p->nr_cpus_allowed == 1.
> 
> So the patch shouldn't make a difference for this scenario?
>

If the kworker is waking up an fio task it could.  I don't think
they are bound to a single cpu.

But yes if your trace is only showing the kworker there then it would
not help.  Are you actually able to reproduce the difference?


> But maybe your VDO or thinpool setup creates waker/wakee pairs with
> wakee->nr_cpus_allowed > 1? 
>

That's certainly possible but I don't know for sure. There are well more
dio kworkers on the box than cpus though if I recall. I don't know
if they all have singel cpu affinities. 


> Does your machine has single CPU smp_affinity masks for these nvme
> interrupts?
>

I don't know. I had to give the machine back. 



Cheers,
Phil


> [...]
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-08  7:07         ` Saravana Kannan
@ 2024-11-08 23:17           ` Samuel Wu
  2024-11-11  4:07             ` K Prateek Nayak
  0 siblings, 1 reply; 277+ messages in thread
From: Samuel Wu @ 2024-11-08 23:17 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Luis Machado, Peter Zijlstra, K Prateek Nayak, David Dai, mingo,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx,
	efault, Android Kernel Team, Qais Yousef, Vincent Palomares,
	John Stultz

On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote:
>
> On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
> >
> > Hi,
> >
> > On 11/6/24 11:09, Peter Zijlstra wrote:
> > > On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
> > >
> > >> Since delayed entities are still on the runqueue, they can affect PELT
> > >> calculation. Vincent and Dietmar have both noted this and Peter posted
> > >> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> > >> in response but it was pulled out since Luis reported observing -ve
> > >> values for h_nr_delayed on his setup. A lot has been fixed around
> > >> delayed dequeue since and I wonder if now would be the right time to
> > >> re-attempt h_nr_delayed tracking.
> > >
> > > Yeah, it's something I meant to get back to. I think the patch as posted
> > > was actually right and it didn't work for Luis because of some other,
> > > since fixed issue.
> > >
> > > But I might be misremembering things. I'll get to it eventually :/
> >
> > Sorry for the late reply, I got sidetracked on something else.
> >
> > There have been a few power regressions (based on our Pixel6-based testing) due
> > to the delayed-dequeue series.
> >
> > The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
> > handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
> >
> > The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
> > imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
> >
> > So power use was brought down by the above fix, but some issues still remained, like the
> > accounting issues with h_nr_running and not taking sched_delayed tasks into account.
> >
> > Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
> >
> > Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
> > mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
> > accounting that slowly drove frequencies up again.
> >
> > If I recall correctly, Peter has pulled that particular patch from the tree, but we should
> > definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
> >
> > [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
> > [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
> > [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> > [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> > [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
>
> Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
> recollect the data to see if that's the cause. We'll get back to this
> thread once we have some data.
>
> -Saravana

The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled.

Same test example from before, when thread is affined to the big cluster:
+----------------------------------+
| Data            | Enabled | Disabled |
|-----------------------+----------|
| 5th percentile  | 96     | 143    |
|-----------------------+----------|
| Median          | 144    | 147   |
|-----------------------+----------|
| Mean            | 134    | 147   |
|-----------------------+----------|
| 95th percentile | 150    | 150   |
+----------------------------------+

What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always
be enabled by default and/or is there a fix coming for 6.12?

Thanks,
Sam

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-08  0:24                                     ` [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU Mike Galbraith
  2024-11-08 13:34                                       ` Phil Auld
@ 2024-11-11  2:46                                       ` Xuewen Yan
  2024-11-11  3:53                                         ` Mike Galbraith
  2024-11-12  7:05                                       ` Mike Galbraith
  2 siblings, 1 reply; 277+ messages in thread
From: Xuewen Yan @ 2024-11-11  2:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Phil Auld, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	Xuewen Yan, Qais Yousef

Hi Mike and Peter,

On Fri, Nov 8, 2024 at 8:28 AM Mike Galbraith <efault@gmx.de> wrote:
>
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > >
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > >
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > >
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> >
> > Make it go away then.
> >
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
>
> Salut, much obliged for eyeball relief.
>
> ---snip---
>
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
>
> (de-uglified-a-lot-by)
>
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> -
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> +       CLASS(__task_rq_lock, rq_guard)(p);
> +       struct rq *rq = rq_guard.rq;
> +
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed) {
> +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +               /*
> +                * Since sched_delayed means we cannot be current anywhere,
> +                * dequeue it here and have it fall through to the
> +                * select_task_rq() case further along the ttwu() path.
> +                */
> +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {

For sched_asym_cpucapacity system, need we consider the
task_fits_cpu_capacity there?


> +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> +
> +               enqueue_task(rq, p, queue_flags);
> +       }
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
>         }
> -       __task_rq_unlock(rq, &rf);
> +       ttwu_do_wakeup(p);
>
> -       return ret;
> +       return 1;
>  }
>
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
>
>
>

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-11  2:46                                       ` Xuewen Yan
@ 2024-11-11  3:53                                         ` Mike Galbraith
  0 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-11  3:53 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Peter Zijlstra, Phil Auld, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx,
	Xuewen Yan, Qais Yousef

On Mon, 2024-11-11 at 10:46 +0800, Xuewen Yan wrote:
> > 
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> 
> For sched_asym_cpucapacity system, need we consider the
> task_fits_cpu_capacity there?

I don't think so.  Wakeup placement logic is what we're deflecting the
wakee toward, this is not the right spot to add any complexity.


	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-08 23:17           ` Samuel Wu
@ 2024-11-11  4:07             ` K Prateek Nayak
  2024-11-26 23:32               ` Saravana Kannan
  0 siblings, 1 reply; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-11  4:07 UTC (permalink / raw)
  To: Samuel Wu, Saravana Kannan, Peter Zijlstra
  Cc: Luis Machado, David Dai, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx, efault,
	Android Kernel Team, Qais Yousef, Vincent Palomares, John Stultz

Hello Sam,

On 11/9/2024 4:47 AM, Samuel Wu wrote:
> On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote:
>>
>> On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> On 11/6/24 11:09, Peter Zijlstra wrote:
>>>> On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
>>>>
>>>>> Since delayed entities are still on the runqueue, they can affect PELT
>>>>> calculation. Vincent and Dietmar have both noted this and Peter posted
>>>>> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
>>>>> in response but it was pulled out since Luis reported observing -ve
>>>>> values for h_nr_delayed on his setup. A lot has been fixed around
>>>>> delayed dequeue since and I wonder if now would be the right time to
>>>>> re-attempt h_nr_delayed tracking.
>>>>
>>>> Yeah, it's something I meant to get back to. I think the patch as posted
>>>> was actually right and it didn't work for Luis because of some other,
>>>> since fixed issue.
>>>>
>>>> But I might be misremembering things. I'll get to it eventually :/
>>>
>>> Sorry for the late reply, I got sidetracked on something else.
>>>
>>> There have been a few power regressions (based on our Pixel6-based testing) due
>>> to the delayed-dequeue series.
>>>
>>> The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
>>> handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
>>>
>>> The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
>>> imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
>>>
>>> So power use was brought down by the above fix, but some issues still remained, like the
>>> accounting issues with h_nr_running and not taking sched_delayed tasks into account.
>>>
>>> Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
>>>
>>> Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
>>> mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
>>> accounting that slowly drove frequencies up again.
>>>
>>> If I recall correctly, Peter has pulled that particular patch from the tree, but we should
>>> definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
>>>
>>> [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
>>> [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
>>> [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
>>> [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
>>> [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
>>
>> Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
>> recollect the data to see if that's the cause. We'll get back to this
>> thread once we have some data.
>>
>> -Saravana
> 
> The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled.
> 
> Same test example from before, when thread is affined to the big cluster:
> +----------------------------------+
> | Data            | Enabled | Disabled |
> |-----------------------+----------|
> | 5th percentile  | 96     | 143    |
> |-----------------------+----------|
> | Median          | 144    | 147   |
> |-----------------------+----------|
> | Mean            | 134    | 147   |
> |-----------------------+----------|
> | 95th percentile | 150    | 150   |
> +----------------------------------+
> 
> What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always
> be enabled by default and/or is there a fix coming for 6.12?

DELAY_DEQUEUE should be enabled by default from v6.12 but there are a
few fixes for the same in-flight. Could try running with the changes
from [1] and [2] and see if you could reproduce the behavior and if
you can, is it equally bad?

Both changes apply cleanly for me on top of current

     git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit fe9beaaa802d ("sched: No PREEMPT_RT=y for all{yes,mod}config")
when applied in that order.

[1] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
[2] https://lore.kernel.org/lkml/750542452c4f852831e601e1b8de40df4b108d9a.camel@gmx.de/

> 
> Thanks,
> Sam

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 17/24] sched/fair: Implement delayed dequeue
  2024-11-08 18:16           ` Phil Auld
@ 2024-11-11 11:29             ` Dietmar Eggemann
  0 siblings, 0 replies; 277+ messages in thread
From: Dietmar Eggemann @ 2024-11-11 11:29 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On 08/11/2024 19:16, Phil Auld wrote:
> On Fri, Nov 08, 2024 at 03:53:26PM +0100 Dietmar Eggemann wrote:
>> On 04/11/2024 13:50, Phil Auld wrote:
>>>
>>> Hi Dietmar,
>>>
>>> On Mon, Nov 04, 2024 at 10:28:37AM +0100 Dietmar Eggemann wrote:
>>>> Hi Phil,
>>>>
>>>> On 01/11/2024 13:47, Phil Auld wrote:

[...]

>> One reason I don't see the difference between DELAY_DEQUEUE and
>> NO_DELAY_DEQUEUE could be because of the affinity of the related
>> nvme interrupts: 
>>
>> $ cat /proc/interrupts
>>
>>      CPU0 CPU1    CPU2 CPU3 CPU4    CPU5 CPU6 CPU7    CPU8 ...
>> 132:   0    0  1523653    0   0        0   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 1-edge nvme0q1
>> 133:   0    0        0    0   0  1338451   0    0       0  ... IR-PCI-MSIX-0000:01:00.0 2-edge nvme0q2
>> 134:   0    0        0    0   0        0   0    0  2252297 ... IR-PCI-MSIX-0000:01:00.0 3-edge nvme0q3
>>
>> $ cat /proc/irq/132/smp_affinity_list 
>> 0-2
>> cat /proc/irq/133/smp_affinity_list 
>> 3-5
>> cat /proc/irq/134/smp_affinity_list 
>> 6-8
>>
>> So the 8 fio tasks from: 
>>
>> # fio --cpus_allowed 1,2,3,4,5,6,7,8 --rw randwrite --bs 4k
>>   --runtime 8s --iodepth 32 --direct 1 --ioengine libaio
>>   --numjobs 8 --size 30g --name default --time_based
>>   --group_reporting --cpus_allowed_policy shared
>>   --directory /testfs
>>
>> don't have to fight with per-CPU kworkers on each CPU.
>>
>> e.g. 'nvme0q3 interrupt -> queue on workqueue dio/nvme0n1p2 -> 
>>       run iomap_dio_complete_work() in kworker/8:x'
>>
>> In case I trace the 'task_on_rq_queued(p) && p->se.sched_delayed &&
>> rq->nr_running > 1) condition in ttwu_runnable() condition i only see
>> the per-CPU kworker in there, so p->nr_cpus_allowed == 1.
>>
>> So the patch shouldn't make a difference for this scenario?
>>
> 
> If the kworker is waking up an fio task it could.  I don't think
> they are bound to a single cpu.
> 
> But yes if your trace is only showing the kworker there then it would
> not help.  Are you actually able to reproduce the difference?

No, with my setup I don't see any difference running your fio test. But
the traces also show me that there are no scenarios in which this patch
can make a difference in the scores.

>> But maybe your VDO or thinpool setup creates waker/wakee pairs with
>> wakee->nr_cpus_allowed > 1? 
>>
> 
> That's certainly possible but I don't know for sure. There are well more
> dio kworkers on the box than cpus though if I recall. I don't know
> if they all have singel cpu affinities. 

Yeah there must be more tasks (inc. kworkers) w/ 'p->nr_cpus_allowed >
1' involved.

>> Does your machine has single CPU smp_affinity masks for these nvme
>> interrupts?
>>
> 
> I don't know. I had to give the machine back.

Ah, too late then ;-)


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-08  0:24                                     ` [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU Mike Galbraith
  2024-11-08 13:34                                       ` Phil Auld
  2024-11-11  2:46                                       ` Xuewen Yan
@ 2024-11-12  7:05                                       ` Mike Galbraith
  2024-11-12 12:41                                         ` Phil Auld
  2 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-12  7:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx

On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > 
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > 
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > > 
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> > 
> > Make it go away then.
> > 
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
> 
> Salut, much obliged for eyeball relief.

Unfortunate change log place holder below aside, I think this patch may
need to be yanked as trading one not readily repeatable regression for
at least one that definitely is, and likely multiple others.

(adds knob)

tbench 8

NO_MIGRATE_DELAYED    3613.49 MB/sec
MIGRATE_DELAYED       3145.59 MB/sec
NO_DELAY_DEQUEUE      3355.42 MB/sec

First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
I've mentioned it doing, but $subject promptly did away with that and
then some.

I thought I might be able to do away with the reservation like side
effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...

     sched/eevdf: More PELT vs DELAYED_DEQUEUE

...for cgroups free test config, but Q/D poke at idle_cpu() helped not
at all.

> ---snip---
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> (de-uglified-a-lot-by)
> 
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> -
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> +       CLASS(__task_rq_lock, rq_guard)(p);
> +       struct rq *rq = rq_guard.rq;
> +
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed) {
> +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +               /*
> +                * Since sched_delayed means we cannot be current anywhere,
> +                * dequeue it here and have it fall through to the
> +                * select_task_rq() case further along the ttwu() path.
> +                */
> +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> +
> +               enqueue_task(rq, p, queue_flags);
> +       }
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
>         }
> -       __task_rq_unlock(rq, &rf);
> +       ttwu_do_wakeup(p);
>  
> -       return ret;
> +       return 1;
>  }
>  
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>  
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12  7:05                                       ` Mike Galbraith
@ 2024-11-12 12:41                                         ` Phil Auld
  2024-11-12 14:23                                           ` Peter Zijlstra
  2024-11-12 14:23                                           ` Mike Galbraith
  0 siblings, 2 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-12 12:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > > 
> > > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > > 
> > > > > The below improved uptime, and trace_printk() says it's doing the
> > > > > intended, so I suppose I'll add a feature and see what falls out.
> > > > 
> > > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > > test with and without produced nothing but variance/noise.
> > > 
> > > Make it go away then.
> > > 
> > > If you could write a Changelog for you inspired bit and stick my cleaned
> > > up version under it, I'd be much obliged.
> > 
> > Salut, much obliged for eyeball relief.
> 
> Unfortunate change log place holder below aside, I think this patch may
> need to be yanked as trading one not readily repeatable regression for
> at least one that definitely is, and likely multiple others.
> 
> (adds knob)
>

Yes, I ws just coming here to reply. I have the results from the first
version of the patch (I don't think the later one fundemtally changed
enough that it will matter but those results are still pending).

Not entirely surprisingly we've traded a ~10% rand write regression for
5-10% rand read regression. This makes sense to me since the reads are
more likely to be synchronous and thus be more buddy-like and benefit
from flipping back and forth on the same cpu.  

I'd probably have to take the reads over the writes in such a trade off :)

> tbench 8
> 
> NO_MIGRATE_DELAYED    3613.49 MB/sec
> MIGRATE_DELAYED       3145.59 MB/sec
> NO_DELAY_DEQUEUE      3355.42 MB/sec
> 
> First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> I've mentioned it doing, but $subject promptly did away with that and
> then some.
>

Yep, that's not pretty. 

> I thought I might be able to do away with the reservation like side
> effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> 
>      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> 
> ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> at all.
>

I wonder if the last_wakee stuff could be leveraged here (an idle thought,
so to speak). Haven't looked closely enough. 


Cheers,
Phil

> > ---snip---
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > (de-uglified-a-lot-by)
> > 
> > Reported-by: Phil Auld <pauld@redhat.com>
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > ---
> >  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
> >  kernel/sched/sched.h |    5 +++++
> >  2 files changed, 34 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> > +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> > +                       return 0;
> >                 }
> > -               ttwu_do_wakeup(p);
> > -               ret = 1;
> > +
> > +               enqueue_task(rq, p, queue_flags);
> > +       }
> > +       if (!task_on_cpu(rq, p)) {
> > +               /*
> > +                * When on_rq && !on_cpu the task is preempted, see if
> > +                * it should preempt the task that is current now.
> > +                */
> > +               wakeup_preempt(rq, p, wake_flags);
> >         }
> > -       __task_rq_unlock(rq, &rf);
> > +       ttwu_do_wakeup(p);
> >  
> > -       return ret;
> > +       return 1;
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
> >         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> >  }
> >  
> > +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> > +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> > +                   __task_rq_unlock(_T->rq, &_T->rf),
> > +                   struct rq *rq; struct rq_flags rf)
> > +
> >  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
> >                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
> >                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> > 
> > 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12 12:41                                         ` Phil Auld
@ 2024-11-12 14:23                                           ` Peter Zijlstra
  2024-11-12 14:23                                           ` Mike Galbraith
  1 sibling, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-12 14:23 UTC (permalink / raw)
  To: Phil Auld
  Cc: Mike Galbraith, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, Nov 12, 2024 at 07:41:17AM -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> > On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > > > 
> > > > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > > > 
> > > > > > The below improved uptime, and trace_printk() says it's doing the
> > > > > > intended, so I suppose I'll add a feature and see what falls out.
> > > > > 
> > > > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > > > test with and without produced nothing but variance/noise.
> > > > 
> > > > Make it go away then.
> > > > 
> > > > If you could write a Changelog for you inspired bit and stick my cleaned
> > > > up version under it, I'd be much obliged.
> > > 
> > > Salut, much obliged for eyeball relief.
> > 
> > Unfortunate change log place holder below aside, I think this patch may
> > need to be yanked as trading one not readily repeatable regression for
> > at least one that definitely is, and likely multiple others.
> > 
> > (adds knob)
> >
> 
> Yes, I ws just coming here to reply. I have the results from the first
> version of the patch (I don't think the later one fundemtally changed
> enough that it will matter but those results are still pending).
> 
> Not entirely surprisingly we've traded a ~10% rand write regression for
> 5-10% rand read regression. This makes sense to me since the reads are
> more likely to be synchronous and thus be more buddy-like and benefit
> from flipping back and forth on the same cpu.  

OK, so I'm going to make this commit disappear.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12 12:41                                         ` Phil Auld
  2024-11-12 14:23                                           ` Peter Zijlstra
@ 2024-11-12 14:23                                           ` Mike Galbraith
  2024-11-12 15:41                                             ` Phil Auld
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-12 14:23 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-12 at 07:41 -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> >
> > Unfortunate change log place holder below aside, I think this patch may
> > need to be yanked as trading one not readily repeatable regression for
> > at least one that definitely is, and likely multiple others.
> >
> > (adds knob)
> >
>
> Yes, I ws just coming here to reply. I have the results from the first
> version of the patch (I don't think the later one fundemtally changed
> enough that it will matter but those results are still pending).
>
> Not entirely surprisingly we've traded a ~10% rand write regression for
> 5-10% rand read regression. This makes sense to me since the reads are
> more likely to be synchronous and thus be more buddy-like and benefit
> from flipping back and forth on the same cpu. 

Ok, that would seem to second "shoot it".

> I'd probably have to take the reads over the writes in such a trade off :)
>
> > tbench 8
> >
> > NO_MIGRATE_DELAYED    3613.49 MB/sec
> > MIGRATE_DELAYED       3145.59 MB/sec
> > NO_DELAY_DEQUEUE      3355.42 MB/sec
> >
> > First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> > I've mentioned it doing, but $subject promptly did away with that and
> > then some.
> >
>
> Yep, that's not pretty.

Yeah, not to mention annoying.

I get the "adds bounce cache pain" aspect, but not why pre-EEVDF
wouldn't be just as heavily affected, it having nothing blocking high
frequency migration (the eternal scheduler boogieman:).

Bottom line would appear to be that these survivors should be left
where they ended up, either due to LB or more likely bog standard
prev_cpu locality, for they are part and parcel of a progression.

> > I thought I might be able to do away with the reservation like side
> > effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> >
> >      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> >
> > ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> > at all.

We don't however have to let sched_delayed block SIS though.  Rendering
them transparent in idle_cpu() did NOT wreck the progression, so
maaaybe could help your regression.

> I wonder if the last_wakee stuff could be leveraged here (an idle thought,
> so to speak). Haven't looked closely enough.

If you mean heuristics, the less of those we have, the better off we
are.. they _always_ find a way to embed their teeth in your backside.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12 14:23                                           ` Mike Galbraith
@ 2024-11-12 15:41                                             ` Phil Auld
  2024-11-12 16:15                                               ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-12 15:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-12 at 07:41 -0500, Phil Auld wrote:
> > On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> > >
> > > Unfortunate change log place holder below aside, I think this patch may
> > > need to be yanked as trading one not readily repeatable regression for
> > > at least one that definitely is, and likely multiple others.
> > >
> > > (adds knob)
> > >
> >
> > Yes, I ws just coming here to reply. I have the results from the first
> > version of the patch (I don't think the later one fundemtally changed
> > enough that it will matter but those results are still pending).
> >
> > Not entirely surprisingly we've traded a ~10% rand write regression for
> > 5-10% rand read regression. This makes sense to me since the reads are
> > more likely to be synchronous and thus be more buddy-like and benefit
> > from flipping back and forth on the same cpu. 
> 
> Ok, that would seem to second "shoot it".
>

Yes, drop it please, I think. Thanks!


> > I'd probably have to take the reads over the writes in such a trade off :)
> >
> > > tbench 8
> > >
> > > NO_MIGRATE_DELAYED    3613.49 MB/sec
> > > MIGRATE_DELAYED       3145.59 MB/sec
> > > NO_DELAY_DEQUEUE      3355.42 MB/sec
> > >
> > > First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> > > I've mentioned it doing, but $subject promptly did away with that and
> > > then some.
> > >
> >
> > Yep, that's not pretty.
> 
> Yeah, not to mention annoying.
> 
> I get the "adds bounce cache pain" aspect, but not why pre-EEVDF
> wouldn't be just as heavily affected, it having nothing blocking high
> frequency migration (the eternal scheduler boogieman:).
> 
> Bottom line would appear to be that these survivors should be left
> where they ended up, either due to LB or more likely bog standard
> prev_cpu locality, for they are part and parcel of a progression.
> 
> > > I thought I might be able to do away with the reservation like side
> > > effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> > >
> > >      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> > >
> > > ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> > > at all.
> 
> We don't however have to let sched_delayed block SIS though.  Rendering
> them transparent in idle_cpu() did NOT wreck the progression, so
> maaaybe could help your regression.
>

You mean something like:

if (rq->nr_running > rq->h_nr_delayed)
       return 0;

in idle_cpu() instead of the straight rq->nr_running check? I don't
have the h_nr_delayed stuff yet but can look for it.  I'm not sure
that will help with the delayees being sticky.  But I can try to try
that if I'm understanding you right.

I'll try to dig into it some more regardless.

> > I wonder if the last_wakee stuff could be leveraged here (an idle thought,
> > so to speak). Haven't looked closely enough.
> 
> If you mean heuristics, the less of those we have, the better off we
> are.. they _always_ find a way to embed their teeth in your backside.
>

Sure, I get that. But when you have a trade-off like this being "smarter"
about when to do the dequeue might help. But yes, that could go wrong.

I'm not a fan of knobs either but we could do your patch with the feat
and default it off.


Cheers,
Phil

> 	-Mike
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12 15:41                                             ` Phil Auld
@ 2024-11-12 16:15                                               ` Mike Galbraith
  2024-11-14 11:07                                                 ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-12 16:15 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
>
> >
> > We don't however have to let sched_delayed block SIS though.  Rendering
> > them transparent in idle_cpu() did NOT wreck the progression, so
> > maaaybe could help your regression.
> >
>
> You mean something like:
>
> if (rq->nr_running > rq->h_nr_delayed)
>        return 0;
>
> in idle_cpu() instead of the straight rq->nr_running check?

Yeah, close enough.


	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-12 16:15                                               ` Mike Galbraith
@ 2024-11-14 11:07                                                 ` Mike Galbraith
  2024-11-14 11:28                                                   ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-14 11:07 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> >
> > >
> > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > maaaybe could help your regression.
> > >
> >
> > You mean something like:
> >
> > if (rq->nr_running > rq->h_nr_delayed)
> >        return 0;
> >
> > in idle_cpu() instead of the straight rq->nr_running check?
>
> Yeah, close enough.

The below is all you need.

Watching blockage rate during part of a netperf scaling run without, a
bit over 2/sec was the highest it got, but with, that drops to the same
zero as turning off the feature, so... relevance highly unlikely but
not quite impossible?

---
 kernel/sched/fair.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct

 	/*
 	 * We do not migrate tasks that are:
+	 * 0) not runnable (not useful here/now, but are annoying), or
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
 	 * 3) running (obviously), or
 	 * 4) are cache-hot on their current CPU.
 	 */
+	if (p->se.sched_delayed)
+		return 0;
+
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-14 11:07                                                 ` Mike Galbraith
@ 2024-11-14 11:28                                                   ` Phil Auld
  2024-11-19 11:30                                                     ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-14 11:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, Nov 14, 2024 at 12:07:03PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> > On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> > >
> > > >
> > > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > > maaaybe could help your regression.
> > > >
> > >
> > > You mean something like:
> > >
> > > if (rq->nr_running > rq->h_nr_delayed)
> > >        return 0;
> > >
> > > in idle_cpu() instead of the straight rq->nr_running check?
> >
> > Yeah, close enough.
> 
> The below is all you need.
> 
> Watching blockage rate during part of a netperf scaling run without, a
> bit over 2/sec was the highest it got, but with, that drops to the same
> zero as turning off the feature, so... relevance highly unlikely but
> not quite impossible?
>

I'll give this a try on my issue. This'll be simpler than the other way.

Thanks!



Cheers,
Phil


> ---
>  kernel/sched/fair.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct
> 
>  	/*
>  	 * We do not migrate tasks that are:
> +	 * 0) not runnable (not useful here/now, but are annoying), or
>  	 * 1) throttled_lb_pair, or
>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
>  	 * 3) running (obviously), or
>  	 * 4) are cache-hot on their current CPU.
>  	 */
> +	if (p->se.sched_delayed)
> +		return 0;
> +
>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  		return 0;
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-14 11:28                                                   ` Phil Auld
@ 2024-11-19 11:30                                                     ` Phil Auld
  2024-11-19 11:51                                                       ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-19 11:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, Nov 14, 2024 at 06:28:54AM -0500 Phil Auld wrote:
> On Thu, Nov 14, 2024 at 12:07:03PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > > > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> > > >
> > > > >
> > > > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > > > maaaybe could help your regression.
> > > > >
> > > >
> > > > You mean something like:
> > > >
> > > > if (rq->nr_running > rq->h_nr_delayed)
> > > >        return 0;
> > > >
> > > > in idle_cpu() instead of the straight rq->nr_running check?
> > >
> > > Yeah, close enough.
> > 
> > The below is all you need.
> > 
> > Watching blockage rate during part of a netperf scaling run without, a
> > bit over 2/sec was the highest it got, but with, that drops to the same
> > zero as turning off the feature, so... relevance highly unlikely but
> > not quite impossible?
> >
> 
> I'll give this a try on my issue. This'll be simpler than the other way.
>

This, below, by itself, did not do help and caused a small slowdown on some
other tests.  Did this need to be on top of the wakeup change? 


Cheers,
Phil

> Thanks!
> 
> 
> 
> Cheers,
> Phil
> 
> 
> > ---
> >  kernel/sched/fair.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct
> > 
> >  	/*
> >  	 * We do not migrate tasks that are:
> > +	 * 0) not runnable (not useful here/now, but are annoying), or
> >  	 * 1) throttled_lb_pair, or
> >  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> >  	 * 3) running (obviously), or
> >  	 * 4) are cache-hot on their current CPU.
> >  	 */
> > +	if (p->se.sched_delayed)
> > +		return 0;
> > +
> >  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> >  		return 0;
> > 
> > 
> 
> -- 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-19 11:30                                                     ` Phil Auld
@ 2024-11-19 11:51                                                       ` Mike Galbraith
  2024-11-20 18:37                                                         ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-19 11:51 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
>
> This, below, by itself, did not do help and caused a small slowdown on some
> other tests.  Did this need to be on top of the wakeup change?

No, that made a mess.  The numbers said it was quite a reach, no
surprise it fell flat.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-19 11:51                                                       ` Mike Galbraith
@ 2024-11-20 18:37                                                         ` Mike Galbraith
  2024-11-21 11:56                                                           ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-20 18:37 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> >
> > This, below, by itself, did not do help and caused a small slowdown on some
> > other tests.  Did this need to be on top of the wakeup change?
>
> No, that made a mess.

Rashly speculating that turning mobile kthread component loose is what
helped your write regression...

You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
only turn hard working kthreads loose to try to dodge service latency.
Seems unlikely wakeup frequency * instances would combine to shred fio
the way turning tbench loose did.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-20 18:37                                                         ` Mike Galbraith
@ 2024-11-21 11:56                                                           ` Phil Auld
  2024-11-21 12:07                                                             ` Phil Auld
  2024-11-23  8:44                                                             ` [PATCH V2] " Mike Galbraith
  0 siblings, 2 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-21 11:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > >
> > > This, below, by itself, did not do help and caused a small slowdown on some
> > > other tests.  Did this need to be on top of the wakeup change?
> >
> > No, that made a mess.
> 
> Rashly speculating that turning mobile kthread component loose is what
> helped your write regression...
> 
> You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> only turn hard working kthreads loose to try to dodge service latency.
> Seems unlikely wakeup frequency * instances would combine to shred fio
> the way turning tbench loose did.
>

Thanks, I'll try that. 


Cheers,
Phil



> 	-Mike
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-21 11:56                                                           ` Phil Auld
@ 2024-11-21 12:07                                                             ` Phil Auld
  2024-11-21 21:21                                                               ` Phil Auld
  2024-11-23  8:44                                                             ` [PATCH V2] " Mike Galbraith
  1 sibling, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-11-21 12:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, Nov 21, 2024 at 06:56:28AM -0500 Phil Auld wrote:
> On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > >
> > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > other tests.  Did this need to be on top of the wakeup change?
> > >
> > > No, that made a mess.
> > 
> > Rashly speculating that turning mobile kthread component loose is what
> > helped your write regression...
> > 
> > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > only turn hard working kthreads loose to try to dodge service latency.
> > Seems unlikely wakeup frequency * instances would combine to shred fio
> > the way turning tbench loose did.
> >
> 
> Thanks, I'll try that. 
>

Also, fwiw, I think there is another report here

https://lore.kernel.org/lkml/392209D9-5AC6-4FDE-8D84-FB8A82AD9AEF@oracle.com/

which seems to be the same thing but mis-bisected. I've asked them to try
with NO_DELAY_DEQUEUE just to be sure.  But it looks like a duck.


Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-21 12:07                                                             ` Phil Auld
@ 2024-11-21 21:21                                                               ` Phil Auld
  0 siblings, 0 replies; 277+ messages in thread
From: Phil Auld @ 2024-11-21 21:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, Nov 21, 2024 at 07:07:04AM -0500 Phil Auld wrote:
> On Thu, Nov 21, 2024 at 06:56:28AM -0500 Phil Auld wrote:
> > On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > > >
> > > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > > other tests.  Did this need to be on top of the wakeup change?
> > > >
> > > > No, that made a mess.
> > > 
> > > Rashly speculating that turning mobile kthread component loose is what
> > > helped your write regression...
> > > 
> > > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > > only turn hard working kthreads loose to try to dodge service latency.
> > > Seems unlikely wakeup frequency * instances would combine to shred fio
> > > the way turning tbench loose did.
> > >
> > 
> > Thanks, I'll try that. 
> >
> 
> Also, fwiw, I think there is another report here
> 
> https://lore.kernel.org/lkml/392209D9-5AC6-4FDE-8D84-FB8A82AD9AEF@oracle.com/
> 
> which seems to be the same thing but mis-bisected. I've asked them to try
> with NO_DELAY_DEQUEUE just to be sure.  But it looks like a duck.
>

But it does not quack like one.  Their issue did not go away with
NO_DELAY_DEQUEUE so something different is causing that one.


> 
> Cheers,
> Phil
> 
> -- 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-21 11:56                                                           ` Phil Auld
  2024-11-21 12:07                                                             ` Phil Auld
@ 2024-11-23  8:44                                                             ` Mike Galbraith
  2024-11-26  5:32                                                               ` K Prateek Nayak
  2024-12-02 16:24                                                               ` Phil Auld
  1 sibling, 2 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-23  8:44 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Thu, 2024-11-21 at 06:56 -0500, Phil Auld wrote:
> On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > >
> > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > other tests.  Did this need to be on top of the wakeup change?
> > >
> > > No, that made a mess.
> >
> > Rashly speculating that turning mobile kthread component loose is what
> > helped your write regression...
> >
> > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > only turn hard working kthreads loose to try to dodge service latency.
> > Seems unlikely wakeup frequency * instances would combine to shred fio
> > the way turning tbench loose did.
> >
>
> Thanks, I'll try that.

You may still want to try that, but my box says probably not.  Playing
with your write command line, the players I see are pinned kworkers and
mobile fio instances.

Maybe try the below instead. The changelog is obsolete BS unless you
say otherwise, but while twiddled V2 will still migrate tbench a bit,
and per trace_printk() does still let all kinds of stuff wander off to
roll the SIS dice, it does not even scratch the paint of the formerly
obliterated tbench progression.

Question: did wiping off the evil leave any meaningful goodness behind?

---

sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Phil Auld (Redhat) reported an fio benchmark regression having been found
to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
may be related to wakees losing the ability to migrate, and confirmed that
restoration of same indeed did restore previous performance.

V2: do not rip buddies apart, convenient on/off switch

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
 kernel/sched/features.h |    5 ++++
 kernel/sched/sched.h    |    5 ++++
 3 files changed, 42 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
-
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
+
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+		int dequeue = sched_feat(DEQUEUE_DELAYED);
+
+		/*
+		 * Since sched_delayed means we cannot be current anywhere,
+		 * dequeue it here and have it fall through to the
+		 * select_task_rq() case further along in ttwu() path.
+		 * Note: Do not rip buddies apart else chaos follows.
+		 */
+		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
+		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+
+		enqueue_task(rq, p, queue_flags);
+	}
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
 	}
-	__task_rq_unlock(rq, &rf);
+	ttwu_do_wakeup(p);

-	return ret;
+	return 1;
 }

 #ifdef CONFIG_SMP
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -47,6 +47,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
 SCHED_FEAT(DELAY_DEQUEUE, true)
+/*
+ * Free ONLY non-buddy delayed tasks to wakeup-migrate to avoid taking.
+ * an unnecessary latency hit.  Rending buddies asunder inflicts harm.
+ */
+SCHED_FEAT(DEQUEUE_DELAYED, true)
 SCHED_FEAT(DELAY_ZERO, true)

 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1783,6 +1783,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }

+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-23  8:44                                                             ` [PATCH V2] " Mike Galbraith
@ 2024-11-26  5:32                                                               ` K Prateek Nayak
  2024-11-26  6:30                                                                 ` Mike Galbraith
  2024-12-02 16:24                                                               ` Phil Auld
  1 sibling, 1 reply; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-26  5:32 UTC (permalink / raw)
  To: Mike Galbraith, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

Hello Mike,

On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> [..snip..]
> 
> Question: did wiping off the evil leave any meaningful goodness behind?
> 
> ---
> 
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> V2: do not rip buddies apart, convenient on/off switch
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
>   kernel/sched/features.h |    5 ++++
>   kernel/sched/sched.h    |    5 ++++
>   3 files changed, 42 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
>    */
>   static int ttwu_runnable(struct task_struct *p, int wake_flags)
>   {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +		int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along in ttwu() path.
> +		 * Note: Do not rip buddies apart else chaos follows.
> +		 */
> +		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&

Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
encounter a delayed task?

> +		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {

Technically, we are still looking at the last wakeup here since
record_wakee() is only called later. If we care about 1:1 buddies,
should we just see "current == p->last_wakee", otherwise, there is a
good chance "p" has a m:n waker-wakee relationship in which case
perhaps a "want_affine" like heuristic can help?

For science, I was wondering if the decision to dequeue + migrate or
requeue the delayed task can be put off until after the whole
select_task_rq() target selection (note: without the h_nr_delayed
stuff, some of that wake_affine_idle() logic falls apart). Hackbench
(which saw some regression with EEVDF Complete) seem to like it
somewhat, but it still falls behind NO_DELAY_DEQUEUE.

    ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     NO_DELAY_DEQUEUE	     Mike's v2	  Full ttwu + requeue/migrate
     5.76	           5.72  (  1% )        5.82  ( -1% )
     6.53	           6.56  (  0% )        6.65  ( -2% )
     6.79	           7.04  ( -4% )        7.02  ( -3% )
     6.91	           7.04  ( -2% )        7.03  ( -2% )
     7.63	           8.05  ( -6% )        7.88  ( -3% )

Only subtle changes in IBS profiles; there aren't any obvious shift
in hotspots with hackbench at least. Not sure if it is just the act of
needing to do a dequeue + enqueue from the wakeup context that adds to
the overall regression.

-- 
Thanks and Regards,
Prateek

> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>   		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>   	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>   }
> 
>   #ifdef CONFIG_SMP
> [..snip..]



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-26  5:32                                                               ` K Prateek Nayak
@ 2024-11-26  6:30                                                                 ` Mike Galbraith
  2024-11-26  9:42                                                                   ` Mike Galbraith
  2024-11-27 14:13                                                                   ` Mike Galbraith
  0 siblings, 2 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-26  6:30 UTC (permalink / raw)
  To: K Prateek Nayak, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-26 at 11:02 +0530, K Prateek Nayak wrote:
> Hello Mike,
> 
> On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> > [..snip..]
> > 
> > Question: did wiping off the evil leave any meaningful goodness behind?
> > 
> > ---
> > 
> > sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > V2: do not rip buddies apart, convenient on/off switch
> > 
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > ---
> >   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
> >   kernel/sched/features.h |    5 ++++
> >   kernel/sched/sched.h    |    5 ++++
> >   3 files changed, 42 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
> >    */
> >   static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >   {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +               int dequeue = sched_feat(DEQUEUE_DELAYED);
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along in ttwu() path.
> > +                * Note: Do not rip buddies apart else chaos follows.
> > +                */
> > +               if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
> 
> Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
> encounter a delayed task?

The switch is for test convenience.

> > +                   !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
> 
> Technically, we are still looking at the last wakeup here since
> record_wakee() is only called later. If we care about 1:1 buddies,
> should we just see "current == p->last_wakee", otherwise, there is a
> good chance "p" has a m:n waker-wakee relationship in which case
> perhaps a "want_affine" like heuristic can help?

The intent is to blunt the instrument a bit. Paul should have a highly
active interrupt source, which will give wakeup credit to whatever is
sitting on that CPU, breaking 1:1 connections.. a little bit.   That's
why it still migrates tbench buddies, but NOT at a rate that turns a
tbench progression into a new low regression.  The hope is that the
load shift caused by that active interrupt source is enough to give
Paul's regression some of the help it demonstrated wanting, without the
collateral damage.  It might now be so weak as to not meet the
"meaningful" in my question, in which case it lands on the ginormous
pile of meh, sorta works, but why would anyone care.

> For science, I was wondering if the decision to dequeue + migrate or
> requeue the delayed task can be put off until after the whole
> select_task_rq() target selection (note: without the h_nr_delayed
> stuff, some of that wake_affine_idle() logic falls apart). Hackbench
> (which saw some regression with EEVDF Complete) seem to like it
> somewhat, but it still falls behind NO_DELAY_DEQUEUE.

You can, with a few more fast path cycles and some duplication, none of
which looks very desirable.
 
>     ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      NO_DELAY_DEQUEUE        Mike's v2    Full ttwu + requeue/migrate
>      5.76                  5.72  (  1% )        5.82  ( -1% )
>      6.53                  6.56  (  0% )        6.65  ( -2% )
>      6.79                  7.04  ( -4% )        7.02  ( -3% )
>      6.91                  7.04  ( -2% )        7.03  ( -2% )
>      7.63                  8.05  ( -6% )        7.88  ( -3% )
> 
> Only subtle changes in IBS profiles; there aren't any obvious shift
> in hotspots with hackbench at least. Not sure if it is just the act of
> needing to do a dequeue + enqueue from the wakeup context that adds to
> the overall regression.

Those numbers say say to me that hackbench doesn't care deeply.  That
works for me, because I don't care deeply about nutty fork bombs ;-)

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-26  6:30                                                                 ` Mike Galbraith
@ 2024-11-26  9:42                                                                   ` Mike Galbraith
  2024-12-02 19:15                                                                     ` Phil Auld
  2024-11-27 14:13                                                                   ` Mike Galbraith
  1 sibling, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-11-26  9:42 UTC (permalink / raw)
  To: K Prateek Nayak, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
>
> The intent is to blunt the instrument a bit. Paul should have

Yeah I did... ahem, I meant of course Phil.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH 00/24] Complete EEVDF
  2024-11-11  4:07             ` K Prateek Nayak
@ 2024-11-26 23:32               ` Saravana Kannan
  0 siblings, 0 replies; 277+ messages in thread
From: Saravana Kannan @ 2024-11-26 23:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Samuel Wu, Peter Zijlstra, Luis Machado, David Dai, mingo,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx,
	efault, Android Kernel Team, Qais Yousef, Vincent Palomares,
	John Stultz

On Sun, Nov 10, 2024 at 8:08 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Sam,
>
> On 11/9/2024 4:47 AM, Samuel Wu wrote:
> > On Thu, Nov 7, 2024 at 11:08 PM Saravana Kannan <saravanak@google.com> wrote:
> >>
> >> On Wed, Nov 6, 2024 at 4:07 AM Luis Machado <luis.machado@arm.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 11/6/24 11:09, Peter Zijlstra wrote:
> >>>> On Wed, Nov 06, 2024 at 11:49:00AM +0530, K Prateek Nayak wrote:
> >>>>
> >>>>> Since delayed entities are still on the runqueue, they can affect PELT
> >>>>> calculation. Vincent and Dietmar have both noted this and Peter posted
> >>>>> https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> >>>>> in response but it was pulled out since Luis reported observing -ve
> >>>>> values for h_nr_delayed on his setup. A lot has been fixed around
> >>>>> delayed dequeue since and I wonder if now would be the right time to
> >>>>> re-attempt h_nr_delayed tracking.
> >>>>
> >>>> Yeah, it's something I meant to get back to. I think the patch as posted
> >>>> was actually right and it didn't work for Luis because of some other,
> >>>> since fixed issue.
> >>>>
> >>>> But I might be misremembering things. I'll get to it eventually :/
> >>>
> >>> Sorry for the late reply, I got sidetracked on something else.
> >>>
> >>> There have been a few power regressions (based on our Pixel6-based testing) due
> >>> to the delayed-dequeue series.
> >>>
> >>> The main one drove the frequencies up due to an imbalance in the uclamp inc/dec
> >>> handling. That has since been fixed by "[PATCH 10/24] sched/uclamg: Handle delayed dequeue". [1]
> >>>
> >>> The bug also made it so disabling DELAY_DEQUEUE at runtime didn't fix things, because the
> >>> imbalance/stale state would be perpetuated. Disabling DELAY_DEQUEUE before boot did fix things.
> >>>
> >>> So power use was brought down by the above fix, but some issues still remained, like the
> >>> accounting issues with h_nr_running and not taking sched_delayed tasks into account.
> >>>
> >>> Dietmar addressed some of it with "kernel/sched: Fix util_est accounting for DELAY_DEQUEUE". [2]
> >>>
> >>> Peter sent another patch to add accounting for sched_delayed tasks [3]. Though the patch was
> >>> mostly correct, under some circumstances [4] we spotted imbalances in the sched_delayed
> >>> accounting that slowly drove frequencies up again.
> >>>
> >>> If I recall correctly, Peter has pulled that particular patch from the tree, but we should
> >>> definitely revisit it with a proper fix for the imbalance. Suggestion in [5].
> >>>
> >>> [1] https://lore.kernel.org/lkml/20240727105029.315205425@infradead.org/
> >>> [2] https://lore.kernel.org/lkml/c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com/
> >>> [3] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> >>> [4] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> >>> [5] https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
> >>
> >> Thanks for the replies. We are trying to disable DELAY_DEQUEUE and
> >> recollect the data to see if that's the cause. We'll get back to this
> >> thread once we have some data.
> >>
> >> -Saravana
> >
> > The test data is back to pre-EEVDF state with DELAY_DEQUEUE disabled.
> >
> > Same test example from before, when thread is affined to the big cluster:
> > +----------------------------------+
> > | Data            | Enabled | Disabled |
> > |-----------------------+----------|
> > | 5th percentile  | 96     | 143    |
> > |-----------------------+----------|
> > | Median          | 144    | 147   |
> > |-----------------------+----------|
> > | Mean            | 134    | 147   |
> > |-----------------------+----------|
> > | 95th percentile | 150    | 150   |
> > +----------------------------------+
> >
> > What are the next steps to bring this behavior back? Will DELAY_DEQUEUE always
> > be enabled by default and/or is there a fix coming for 6.12?
>
> DELAY_DEQUEUE should be enabled by default from v6.12 but there are a
> few fixes for the same in-flight. Could try running with the changes
> from [1] and [2] and see if you could reproduce the behavior and if
> you can, is it equally bad?
>
> Both changes apply cleanly for me on top of current
>
>      git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
>
> at commit fe9beaaa802d ("sched: No PREEMPT_RT=y for all{yes,mod}config")
> when applied in that order.
>
> [1] https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
> [2] https://lore.kernel.org/lkml/750542452c4f852831e601e1b8de40df4b108d9a.camel@gmx.de/

Have these changes landed in 6.12? Or will these in 6.13?

We tested 6.12 and the issue we reported is still present. What should
we do for any products we want to ship on 6.12? Disable Delayed
Dequeue or backport any fixes to 6.12 LTS?

Peter/Vincent, do you plan on backporting the future fixes to the 6.12
LTS kernel? Anything else we can do to help with making sure this is
fixed on the LTS kernel?

-Saravana

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE
  2024-09-10  8:09                               ` [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE tip-bot2 for Peter Zijlstra
@ 2024-11-27  4:17                                 ` K Prateek Nayak
  2024-11-27  9:34                                   ` Luis Machado
  0 siblings, 1 reply; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-27  4:17 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: Dietmar Eggemann, Vincent Guittot, x86, linux-kernel,
	linux-tip-commits

Hello Peter,

On 9/10/2024 1:39 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
> Gitweb:        https://git.kernel.org/tip/2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Fri, 06 Sep 2024 12:45:25 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Tue, 10 Sep 2024 09:51:15 +02:00
> 
> sched/eevdf: More PELT vs DELAYED_DEQUEUE
> 
> Vincent and Dietmar noted that while commit fc1892becd56 fixes the
> entity runnable stats, it does not adjust the cfs_rq runnable stats,
> which are based off of h_nr_running.
> 
> Track h_nr_delayed such that we can discount those and adjust the
> signal.
> 
> Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20240906104525.GG4928@noisy.programming.kicks-ass.net

I've been testing this fix for a while now to see if it helps the
regressions reported around EEVDF complete. The issue with negative
"h_nr_delayed" reported by Luis previously seem to have been fixed as a
result of commit 75b6499024a6 ("sched/fair: Properly deactivate
sched_delayed task upon class change")

I've been running stress-ng for a while and haven't seen any cases of
negative "h_nr_delayed". I'd also added the following WARN_ON() to see
if there are any delayed tasks on the cfs_rq before switching to idle in
some of my previous experiments and I did not see any splat during my
benchmark runs.

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 621696269584..c19a31fa46c9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -457,6 +457,9 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct t
  
  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
  {
+	/* All delayed tasks must be picked off before switching to idle */
+	SCHED_WARN_ON(rq->cfs.h_nr_delayed);
+
  	update_idle_core(rq);
  	scx_update_idle(rq, true);
  	schedstat_inc(rq->sched_goidle);
--

If you are including this back, feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> [..snip..]

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE
  2024-11-27  4:17                                 ` K Prateek Nayak
@ 2024-11-27  9:34                                   ` Luis Machado
  2024-11-28  6:35                                     ` K Prateek Nayak
  0 siblings, 1 reply; 277+ messages in thread
From: Luis Machado @ 2024-11-27  9:34 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra (Intel)
  Cc: Dietmar Eggemann, Vincent Guittot, x86, linux-kernel,
	linux-tip-commits

Hi,

On 11/27/24 04:17, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 9/10/2024 1:39 PM, tip-bot2 for Peter Zijlstra wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID:     2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
>> Gitweb:        https://git.kernel.org/tip/2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
>> Author:        Peter Zijlstra <peterz@infradead.org>
>> AuthorDate:    Fri, 06 Sep 2024 12:45:25 +02:00
>> Committer:     Peter Zijlstra <peterz@infradead.org>
>> CommitterDate: Tue, 10 Sep 2024 09:51:15 +02:00
>>
>> sched/eevdf: More PELT vs DELAYED_DEQUEUE
>>
>> Vincent and Dietmar noted that while commit fc1892becd56 fixes the
>> entity runnable stats, it does not adjust the cfs_rq runnable stats,
>> which are based off of h_nr_running.
>>
>> Track h_nr_delayed such that we can discount those and adjust the
>> signal.
>>
>> Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
>> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: https://lkml.kernel.org/r/20240906104525.GG4928@noisy.programming.kicks-ass.net
> 
> I've been testing this fix for a while now to see if it helps the
> regressions reported around EEVDF complete. The issue with negative
> "h_nr_delayed" reported by Luis previously seem to have been fixed as a
> result of commit 75b6499024a6 ("sched/fair: Properly deactivate
> sched_delayed task upon class change")

I recall having 75b6499024a6 in my testing tree and somehow still noticing
unbalanced accounting for h_nr_delayed, where it would be decremented
twice eventually, leading to negative numbers.

I might have to give it another go if we're considering including the change
as-is, just to make sure. Since our setups are slightly different, we could
be exercising some slightly different paths.

Did this patch help with the regressions you noticed though? 

> 
> I've been running stress-ng for a while and haven't seen any cases of
> negative "h_nr_delayed". I'd also added the following WARN_ON() to see
> if there are any delayed tasks on the cfs_rq before switching to idle in
> some of my previous experiments and I did not see any splat during my
> benchmark runs.
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 621696269584..c19a31fa46c9 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -457,6 +457,9 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct t
>  
>  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
>  {
> +    /* All delayed tasks must be picked off before switching to idle */
> +    SCHED_WARN_ON(rq->cfs.h_nr_delayed);
> +
>      update_idle_core(rq);
>      scx_update_idle(rq, true);
>      schedstat_inc(rq->sched_goidle);
> -- 
> 
> If you are including this back, feel free to add:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
>> [..snip..]
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-26  6:30                                                                 ` Mike Galbraith
  2024-11-26  9:42                                                                   ` Mike Galbraith
@ 2024-11-27 14:13                                                                   ` Mike Galbraith
  1 sibling, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-11-27 14:13 UTC (permalink / raw)
  To: K Prateek Nayak, Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, wuyun.abel, youssefesmat, tglx

On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
>
> The intent is to blunt the instrument a bit. Paul should have a highly
> active interrupt source, which will give wakeup credit to whatever is
> sitting on that CPU, breaking 1:1 connections.. a little bit.   That's
> why it still migrates tbench buddies, but NOT at a rate that turns a
> tbench progression into a new low regression.

BTW, the reason for the tbench wreckage being so bad is that when the
box is near saturation, not only are a portion of the surviving
sched_delayed tasks affine wakeups (always the optimal configuration
for this fast mover cliebt/server pair in an L3 equipped box), they are
exclusively affine wakeups.  That is most definitely gonna hurt.

When saturating that becomes the best option for a lot of client/server
pairs, even those with a lot of concurrency.  Turning them loose to
migrate at that time is far more likely than not to hurt a LOT, so V1
was doomed.

> The hope is that the
> load shift caused by that active interrupt source is enough to give
> Paul's regression some of the help it demonstrated wanting, without the
> collateral damage.  It might now be so weak as to not meet the
> "meaningful" in my question, in which case it lands on the ginormous
> pile of meh, sorta works, but why would anyone care.

In my IO challenged box, patch is useless to fio, nothing can help a
load where all of the IO action, and wimpy action at that, is nailed to
one CPU.  I can see it helping other latency sensitive stuff, like say
1:N mother of all work and/or control threads (and ilk), but if Phil's
problematic box looks anything like this box.. nah, it's a long reach.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE
  2024-11-27  9:34                                   ` Luis Machado
@ 2024-11-28  6:35                                     ` K Prateek Nayak
  0 siblings, 0 replies; 277+ messages in thread
From: K Prateek Nayak @ 2024-11-28  6:35 UTC (permalink / raw)
  To: Luis Machado, Peter Zijlstra (Intel)
  Cc: Dietmar Eggemann, Vincent Guittot, x86, linux-kernel,
	linux-tip-commits, Saravana Kannan, Samuel Wu,
	Android Kernel Team

(+ Saravana, Samuel)

Hello Luis,

On 11/27/2024 3:04 PM, Luis Machado wrote:
> Hi,
> 
> On 11/27/24 04:17, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 9/10/2024 1:39 PM, tip-bot2 for Peter Zijlstra wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID:     2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
>>> Gitweb:        https://git.kernel.org/tip/2e05f6c71d36f8ae1410a1cf3f12848cc17916e9
>>> Author:        Peter Zijlstra <peterz@infradead.org>
>>> AuthorDate:    Fri, 06 Sep 2024 12:45:25 +02:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Tue, 10 Sep 2024 09:51:15 +02:00
>>>
>>> sched/eevdf: More PELT vs DELAYED_DEQUEUE
>>>
>>> Vincent and Dietmar noted that while commit fc1892becd56 fixes the
>>> entity runnable stats, it does not adjust the cfs_rq runnable stats,
>>> which are based off of h_nr_running.
>>>
>>> Track h_nr_delayed such that we can discount those and adjust the
>>> signal.
>>>
>>> Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
>>> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Link: https://lkml.kernel.org/r/20240906104525.GG4928@noisy.programming.kicks-ass.net
>>
>> I've been testing this fix for a while now to see if it helps the
>> regressions reported around EEVDF complete. The issue with negative
>> "h_nr_delayed" reported by Luis previously seem to have been fixed as a
>> result of commit 75b6499024a6 ("sched/fair: Properly deactivate
>> sched_delayed task upon class change")
> 
> I recall having 75b6499024a6 in my testing tree and somehow still noticing
> unbalanced accounting for h_nr_delayed, where it would be decremented
> twice eventually, leading to negative numbers.
> 
> I might have to give it another go if we're considering including the change
> as-is, just to make sure. Since our setups are slightly different, we could
> be exercising some slightly different paths.

That would be great! Thank you :)

Now that I see, you did have Valentine's patches in your tree during
testing
https://lore.kernel.org/lkml/6df12fde-1e0d-445f-8f8a-736d11f9ee41@arm.com/
Perhaps it could be the fixup commit 98442f0ccd82 ("sched: Fix
delayed_dequeue vs switched_from_fair()") or the fact that my benchmark
didn't stress this path enough to break you as you mentioned. I would
have still expected it to hit that SCHED_WARN_ON() I had added in
set_next_task_idle() if something went sideways.

> 
> Did this patch help with the regressions you noticed though?

I believe it was Saravana who was seeing anomalies in PELT ramp-up with
DELAY_DEQUEUE. My test setup is currently not equipped to catch it but
Saravana was interested in these fixes being backported to v6.12 LTS in
https://lore.kernel.org/lkml/CAGETcx_1pZCtWiBbDmUcxEw3abF5dr=XdFCkH9zXWK75g7457w@mail.gmail.com/

I believe tracking h_nr_delayed and disregarding delayed tasks in
certain places is a necessary fix. None of the benchmarks in my test
setup seem to mind running without it but I'm also doing most of my
testing with performance governor and the PELT anomalies seem to affect
more from a PM perspective and not load balancing perspective.

> 
>>
>> I've been running stress-ng for a while and haven't seen any cases of
>> negative "h_nr_delayed". I'd also added the following WARN_ON() to see
>> if there are any delayed tasks on the cfs_rq before switching to idle in
>> some of my previous experiments and I did not see any splat during my
>> benchmark runs.
>>
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 621696269584..c19a31fa46c9 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -457,6 +457,9 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct t
>>   
>>   static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
>>   {
>> +    /* All delayed tasks must be picked off before switching to idle */
>> +    SCHED_WARN_ON(rq->cfs.h_nr_delayed);
>> +
>>       update_idle_core(rq);
>>       scx_update_idle(rq, true);
>>       schedstat_inc(rq->sched_goidle);
>> -- 
>>
>> If you are including this back, feel free to add:
>>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>
>>> [..snip..]
>>
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 277+ messages in thread

* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
                   ` (30 preceding siblings ...)
  2024-11-06  1:07 ` Saravana Kannan
@ 2024-11-28 10:32 ` Marcel Ziswiler
  2024-11-28 10:58   ` Peter Zijlstra
  2024-12-10  8:45   ` Luis Machado
  31 siblings, 2 replies; 277+ messages in thread
From: Marcel Ziswiler @ 2024-11-28 10:32 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

Hi all,

On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote:
> Hi all,
> 
> So after much delay this is hopefully the final version of the EEVDF patches.
> They've been sitting in my git tree for ever it seems, and people have been
> testing it and sending fixes.
> 
> I've spend the last two days testing and fixing cfs-bandwidth, and as far
> as I know that was the very last issue holding it back.
> 
> These patches apply on top of queue.git sched/dl-server, which I plan on merging
> in tip/sched/core once -rc1 drops.
> 
> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
> 
> 
> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
> 
>  - split up the huge delay-dequeue patch
>  - tested/fixed cfs-bandwidth
>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>  - propagate min_slice up cgroups
>  - CLOCK_THREAD_DVFS_ID

We found the following 7 commits from this patch set to crash in enqueue_dl_entity():

54a58a787791 sched/fair: Implement DELAY_ZERO
152e11f6df29 sched/fair: Implement delayed dequeue
e1459a50ba31 sched: Teach dequeue_task() about special task states
a1c446611e31 sched,freezer: Mark TASK_FROZEN special
781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue

Resulting in the following crash dump (this is running v6.12.1):

[   14.652856] sched: DL replenish lagged too much
[   16.572706] ------------[ cut here ]------------
[   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c
[   16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan
phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg
ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm
hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me
m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec
drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils
 drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax
[   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
[   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
[   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
[   16.580661] lr : dl_server_start+0x44/0x12c
[   16.581028] sp : ffff80008002bc00
[   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
[   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
[   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
[   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
[   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
[   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
[   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
[   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
[   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
[   16.587540] Call trace:
[   16.587754]  enqueue_dl_entity+0x46c/0x55c
[   16.588113]  dl_server_start+0x44/0x12c
[   16.588449]  enqueue_task_fair+0x124/0x49c
[   16.588807]  enqueue_task+0x3c/0xe0
[   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
[   16.589511]  try_to_wake_up+0x1d0/0x61c
[   16.589847]  wake_up_process+0x18/0x24
[   16.590175]  kick_pool+0x84/0x150
[   16.590467]  __queue_work+0x2f4/0x544
[   16.590788]  delayed_work_timer_fn+0x1c/0x28
[   16.591161]  call_timer_fn+0x34/0x1ac
[   16.591481]  __run_timer_base+0x20c/0x314
[   16.591832]  run_timer_softirq+0x3c/0x78
[   16.592176]  handle_softirqs+0x124/0x35c
[   16.592520]  __do_softirq+0x14/0x20
[   16.592827]  ____do_softirq+0x10/0x1c
[   16.593148]  call_on_irq_stack+0x24/0x4c
[   16.593490]  do_softirq_own_stack+0x1c/0x2c
[   16.593857]  irq_exit_rcu+0x8c/0xc0
[   16.594163]  el0_interrupt+0x48/0xbc
[   16.594477]  __el0_irq_handler_common+0x18/0x24
[   16.594874]  el0t_64_irq_handler+0x10/0x1c
[   16.595232]  el0t_64_irq+0x190/0x194
[   16.595545] ---[ end trace 0000000000000000 ]---
[   16.595950] ------------[ cut here ]------------

It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].

We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms.
While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64.

We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current
findings asking for any feedback/suggestions from the community.

Thanks!

Cheers

Marcel

[1] https://hastebin.skyra.pw/hoqesigaye.yaml

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
@ 2024-11-28 10:58   ` Peter Zijlstra
  2024-11-28 11:37     ` Marcel Ziswiler
  2024-12-10  8:45   ` Luis Machado
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-28 10:58 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote:

> Resulting in the following crash dump (this is running v6.12.1):
> 
> [   14.652856] sched: DL replenish lagged too much
> [   16.572706] ------------[ cut here ]------------
> [   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c

> [   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
> [   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
> [   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
> [   16.580661] lr : dl_server_start+0x44/0x12c
> [   16.581028] sp : ffff80008002bc00
> [   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
> [   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
> [   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
> [   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
> [   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
> [   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
> [   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
> [   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
> [   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
> [   16.587540] Call trace:
> [   16.587754]  enqueue_dl_entity+0x46c/0x55c
> [   16.588113]  dl_server_start+0x44/0x12c
> [   16.588449]  enqueue_task_fair+0x124/0x49c
> [   16.588807]  enqueue_task+0x3c/0xe0
> [   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
> [   16.589511]  try_to_wake_up+0x1d0/0x61c
> [   16.589847]  wake_up_process+0x18/0x24
> [   16.590175]  kick_pool+0x84/0x150
> [   16.590467]  __queue_work+0x2f4/0x544
> [   16.590788]  delayed_work_timer_fn+0x1c/0x28
> [   16.591161]  call_timer_fn+0x34/0x1ac
> [   16.591481]  __run_timer_base+0x20c/0x314
> [   16.591832]  run_timer_softirq+0x3c/0x78
> [   16.592176]  handle_softirqs+0x124/0x35c
> [   16.592520]  __do_softirq+0x14/0x20
> [   16.592827]  ____do_softirq+0x10/0x1c
> [   16.593148]  call_on_irq_stack+0x24/0x4c
> [   16.593490]  do_softirq_own_stack+0x1c/0x2c
> [   16.593857]  irq_exit_rcu+0x8c/0xc0
> [   16.594163]  el0_interrupt+0x48/0xbc
> [   16.594477]  __el0_irq_handler_common+0x18/0x24
> [   16.594874]  el0t_64_irq_handler+0x10/0x1c
> [   16.595232]  el0t_64_irq+0x190/0x194
> [   16.595545] ---[ end trace 0000000000000000 ]---
> [   16.595950] ------------[ cut here ]------------
> 
> It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].

Right, I've had a number of these reports, but so far we've not yet
managed to figure out how it's all happening.

> We are running the exact same scheduler stress test both on Intel NUCs
> as well as RADXA ROCK 5B board farms.  While so far we have not seen
> this on amd64 it crashes consistently/reproducible on aarch64.

Oooh, that's something. So far the few reports have not been (easily)
reproducible. If this is readily reproducible on arm64 that would
help a lot. Juri, do you have access to an arm64 test box?

A very long shot:

 https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com

doesn't help, does it?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:58   ` Peter Zijlstra
@ 2024-11-28 11:37     ` Marcel Ziswiler
  2024-11-29  9:08       ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Marcel Ziswiler @ 2024-11-28 11:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, 2024-11-28 at 11:58 +0100, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 11:32:10AM +0100, Marcel Ziswiler wrote:
> 
> > Resulting in the following crash dump (this is running v6.12.1):

[snip]

> > It looks like it is trying to enqueue an already queued deadline task. Full serial console log available
> > [1].
> 
> Right, I've had a number of these reports, but so far we've not yet
> managed to figure out how it's all happening.
> 
> > We are running the exact same scheduler stress test both on Intel NUCs
> > as well as RADXA ROCK 5B board farms.  While so far we have not seen
> > this on amd64 it crashes consistently/reproducible on aarch64.
> 
> Oooh, that's something. So far the few reports have not been (easily)
> reproducible. If this is readily reproducible on arm64 that would
> help a lot. Juri, do you have access to an arm64 test box?

As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
anything which helps in resolving this.

> A very long shot:
> 
>  https://lkml.kernel.org/r/20241127063740.8278-1-juri.lelli@redhat.com
> 
> doesn't help, does it?

No, still the same with this on top of today's -next.

Thanks!

Cheers

Marcel

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 11:37     ` Marcel Ziswiler
@ 2024-11-29  9:08       ` Peter Zijlstra
  2024-12-02 18:46         ` Marcel Ziswiler
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-11-29  9:08 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:

> > Oooh, that's something. So far the few reports have not been (easily)
> > reproducible. If this is readily reproducible on arm64 that would
> > help a lot. Juri, do you have access to an arm64 test box?
> 
> As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> anything which helps in resolving this.

I was hoping you could perhaps share a binary with Juri privately or
with RHT (same difference etc), such that he can poke at it too.

Anyway, if you don't mind a bit of back and forth, would you mind adding
the below patch to your kernel and doing:

(all assuming your kernel has ftrace enabled)

  echo 1 > /sys/kernel/debug/tracing/options/stacktrace
  echo 1 > /proc/sys/kernel/traceoff_on_warning

running your test to failure and then dumping the trace into a file
like:

  cat /sys/kernel/debug/tracing/trace > ~/trace

Then compress the file (bzip2 or whatever is popular these days) and
send it my way along with a dmesg dump (private is fine -- these things
tend to be large-ish).

Hopefully, this will give us a little clue as to where the double
enqueue happens.

---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..b9cd9b40a19f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 	scoped_guard (rq_lock, rq) {
 		struct rq_flags *rf = &scope.rf;
 
+		if (dl_se == &rq->fair_server) {
+			trace_printk("timer fair server %d throttled %d\n",
+				     cpu_of(rq), dl_se->dl_throttled);
+		}
+
 		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
@@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		rq_lock(rq, &rf);
 	}
 
+	if (dl_se == &rq->fair_server)
+		trace_printk("inactive fair server %d\n", cpu_of(rq));
+
 	sched_clock_tick();
 	update_rq_clock(rq);
 
@@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("enqueue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
@@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("dequeue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	if (RB_EMPTY_NODE(&dl_se->rb_node))
 		return;

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-23  8:44                                                             ` [PATCH V2] " Mike Galbraith
  2024-11-26  5:32                                                               ` K Prateek Nayak
@ 2024-12-02 16:24                                                               ` Phil Auld
  2024-12-02 16:55                                                                 ` Mike Galbraith
  1 sibling, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-12-02 16:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> On Thu, 2024-11-21 at 06:56 -0500, Phil Auld wrote:
> > On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > > >
> > > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > > other tests.  Did this need to be on top of the wakeup change?
> > > >
> > > > No, that made a mess.
> > >
> > > Rashly speculating that turning mobile kthread component loose is what
> > > helped your write regression...
> > >
> > > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > > only turn hard working kthreads loose to try to dodge service latency.
> > > Seems unlikely wakeup frequency * instances would combine to shred fio
> > > the way turning tbench loose did.
> > >
> >
> > Thanks, I'll try that.
> 
> You may still want to try that, but my box says probably not.  Playing
> with your write command line, the players I see are pinned kworkers and
> mobile fio instances.
>

Yep. The PF_KTHREAD thing did not help. 


> Maybe try the below instead. The changelog is obsolete BS unless you
> say otherwise, but while twiddled V2 will still migrate tbench a bit,
> and per trace_printk() does still let all kinds of stuff wander off to
> roll the SIS dice, it does not even scratch the paint of the formerly
> obliterated tbench progression.
>

Will give this one a try when I get caught up from being off all week for
US turkey day. 


Thanks!

> Question: did wiping off the evil leave any meaningful goodness behind?


Is that for this patch?  

If you mean for the original patch (which subsequently broke the reads) then
no. It was more or less even for all the other tests. It fixed the randwrite
issue by moving it to randread. Everything else we run regularly was about
the same. So no extra goodness to help decide :)


Cheers,
Phil

> 
> ---
> 
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> V2: do not rip buddies apart, convenient on/off switch
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
>  kernel/sched/features.h |    5 ++++
>  kernel/sched/sched.h    |    5 ++++
>  3 files changed, 42 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +		int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along in ttwu() path.
> +		 * Note: Do not rip buddies apart else chaos follows.
> +		 */
> +		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
> +		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>  		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>  	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>  }
> 
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -47,6 +47,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
>   */
>  SCHED_FEAT(DELAY_DEQUEUE, true)
> +/*
> + * Free ONLY non-buddy delayed tasks to wakeup-migrate to avoid taking.
> + * an unnecessary latency hit.  Rending buddies asunder inflicts harm.
> + */
> +SCHED_FEAT(DEQUEUE_DELAYED, true)
>  SCHED_FEAT(DELAY_ZERO, true)
> 
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1783,6 +1783,11 @@ task_rq_unlock(struct rq *rq, struct tas
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
> 
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +		    __task_rq_unlock(_T->rq, &_T->rf),
> +		    struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>  		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
>  		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-12-02 16:24                                                               ` Phil Auld
@ 2024-12-02 16:55                                                                 ` Mike Galbraith
  2024-12-02 19:12                                                                   ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Mike Galbraith @ 2024-12-02 16:55 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
>
>
> > Question: did wiping off the evil leave any meaningful goodness behind?
>
> Is that for this patch?

Yeah.  Trying it on my box with your write command line didn't improve
the confidence level either.  My box has one CPU handling IRQs and
waking pinned workers to service 8 fio instances.  Patch was useless
for that.

	-Mike



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-29  9:08       ` Peter Zijlstra
@ 2024-12-02 18:46         ` Marcel Ziswiler
  2024-12-09  9:49           ` Peter Zijlstra
  2024-12-10 16:13           ` Steven Rostedt
  0 siblings, 2 replies; 277+ messages in thread
From: Marcel Ziswiler @ 2024-12-02 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

Sorry for my late reply, I was traveling back from Manchester to Switzerland but I am all settled down again.

On Fri, 2024-11-29 at 10:08 +0100, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:37:14PM +0100, Marcel Ziswiler wrote:
> 
> > > Oooh, that's something. So far the few reports have not been (easily)
> > > reproducible. If this is readily reproducible on arm64 that would
> > > help a lot. Juri, do you have access to an arm64 test box?
> > 
> > As mentioned above, so far our scheduler stress test is not yet open source but Codethink is eager to share
> > anything which helps in resolving this.
> 
> I was hoping you could perhaps share a binary with Juri privately or
> with RHT (same difference etc), such that he can poke at it too.

Sure, there is nothing secret about it, it is just that we have not gotten around open sourcing all parts of it
just yet.

The UEFI aarch64 embedded Linux image I am using may be found here [1]. Plus matching bmap file should you
fancy using that [2]. And the SSH key may help when interacting with the system (e.g. that is how I trigger the
failure as the console is quite busy with tracing) [3]. However, that was built by CI and does not contain a
kernel with below patch applied yet. I manually dumped the kernel config and compiled v6.12.1 with your patch
applied and deployed it (to /lib/modules, /usr/lib/kernel et. al.) in the below case where I provide the dump.

> Anyway, if you don't mind a bit of back and forth, 

Sure.

> would you mind adding
> the below patch to your kernel and doing:
> 
> (all assuming your kernel has ftrace enabled)
> 
>   echo 1 > /sys/kernel/debug/tracing/options/stacktrace
>   echo 1 > /proc/sys/kernel/traceoff_on_warning
> 
> running your test to failure and then dumping the trace into a file
> like:
> 
>   cat /sys/kernel/debug/tracing/trace > ~/trace

Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
buffer any longer. So I did the following instead on the serial console terminal:

tail -f /sys/kernel/debug/tracing/trace

Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

> Then compress the file (bzip2 or whatever is popular these days)

xz or zstd (;-p)

> and
> send it my way along with a dmesg dump (private is fine -- these things
> tend to be large-ish).

As mentioned before, there is nothing secret about it. Please find it here [4].

> Hopefully, this will give us a little clue as to where the double
> enqueue happens.

Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!

> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d9d5a702f1a6..b9cd9b40a19f 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1203,6 +1203,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>   scoped_guard (rq_lock, rq) {
>   struct rq_flags *rf = &scope.rf;
>  
> + if (dl_se == &rq->fair_server) {
> + trace_printk("timer fair server %d throttled %d\n",
> +      cpu_of(rq), dl_se->dl_throttled);
> + }
> +
>   if (!dl_se->dl_throttled || !dl_se->dl_runtime)
>   return HRTIMER_NORESTART;
>  
> @@ -1772,6 +1777,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>   rq_lock(rq, &rf);
>   }
>  
> + if (dl_se == &rq->fair_server)
> + trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
>   sched_clock_tick();
>   update_rq_clock(rq);
>  
> @@ -1967,6 +1975,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("enqueue fair server %d h_nr_running %d\n",
> +      cpu_of(rq), rq->cfs.h_nr_running);
> + }
>  
>   WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
> @@ -1978,6 +1992,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>   struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> + struct rq *rq = rq_of_dl_se(dl_se);
> +
> + if (dl_se == &rq->fair_server) {
> + trace_printk("dequeue fair server %d h_nr_running %d\n",
> +      cpu_of(rq), rq->cfs.h_nr_running);
> + }
>  
>   if (RB_EMPTY_NODE(&dl_se->rb_node))
>   return;

[1] https://drive.codethink.co.uk/s/N8CQipaNNN45gYM

[2] https://drive.codethink.co.uk/s/mpcPawXpCjPL8D3

[3] https://drive.codethink.co.uk/s/8RjHNTQQRpYgaLc

[4] https://drive.codethink.co.uk/s/MWtzWjLDtdD3E5i

Cheers

Marcel

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-12-02 16:55                                                                 ` Mike Galbraith
@ 2024-12-02 19:12                                                                   ` Phil Auld
  2024-12-09 13:11                                                                     ` Phil Auld
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-12-02 19:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> >
> >
> > > Question: did wiping off the evil leave any meaningful goodness behind?
> >
> > Is that for this patch?
> 
> Yeah.  Trying it on my box with your write command line didn't improve
> the confidence level either.  My box has one CPU handling IRQs and
> waking pinned workers to service 8 fio instances.  Patch was useless
> for that.
>

I'll give it a try. Our "box" is multiple different boxes but the results
vary somewhat.  The one I sent info about earlier in this thread is just
one of the more egregious and is the one the perf team lent me for a while.


Cheers,
Phil


> 	-Mike
> 
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-11-26  9:42                                                                   ` Mike Galbraith
@ 2024-12-02 19:15                                                                     ` Phil Auld
  0 siblings, 0 replies; 277+ messages in thread
From: Phil Auld @ 2024-12-02 19:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: K Prateek Nayak, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, wuyun.abel, youssefesmat, tglx

On Tue, Nov 26, 2024 at 10:42:37AM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
> >
> > The intent is to blunt the instrument a bit. Paul should have
> 
> Yeah I did... ahem, I meant of course Phil.
>

Heh, you are not alone, Mike  :)


> 	-Mike
> 

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-02 18:46         ` Marcel Ziswiler
@ 2024-12-09  9:49           ` Peter Zijlstra
  2024-12-10 16:05             ` Marcel Ziswiler
  2024-12-10 16:13           ` Steven Rostedt
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2024-12-09  9:49 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault


Sorry for the delay, I got laid low by snot monsters :/

On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:

> Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
> buffer any longer. So I did the following instead on the serial console terminal:
> 
> tail -f /sys/kernel/debug/tracing/trace
> 
> Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
might be of help. And yes, dumping trace buffers over 1m5 serial lines
is tedious -- been there done that, got a t-shirt and all that.

Still, let me see if perhaps making that WARN in enqueue_dl_entity()
return makes the whole thing less fatal.

I've included the traceoff_on_warning and ftrace_dump in the code, so
all you really need to still do is enable the stacktrace option.

   echo 1 > /sys/kernel/debug/tracing/options/stacktrace

> Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!

Could I bother you to try again with the below patch?

There are two new hunks vs the previous one, the hunk in
enqueue_dl_entity() (the very last bit) will stop tracing and dump the
buffers when that condition is hit in addition to then aborting the
double enqueue, hopefully leaving the system is a slightly better state.

The other new hunk is the one for dl_server_stop() (second hunk), while
going over the code last week, I found that this might be a possible
hole leading to the observed double enqueue, so fingers crossed.

---

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 33b4646f8b24..bd1df7612482 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 	scoped_guard (rq_lock, rq) {
 		struct rq_flags *rf = &scope.rf;
 
+		if (dl_se == &rq->fair_server) {
+			trace_printk("timer fair server %d throttled %d\n",
+				     cpu_of(rq), dl_se->dl_throttled);
+		}
+
 		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
@@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 
 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
+	if (current->dl_server == dl_se) {
+		struct rq *rq = rq_of_dl_se(dl_se);
+		trace_printk("stop fair server %d\n", cpu_of(rq));
+		current->dl_server = NULL;
+	}
+
 	if (!dl_se->dl_runtime)
 		return;
 
@@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
 		rq_lock(rq, &rf);
 	}
 
+	if (dl_se == &rq->fair_server)
+		trace_printk("inactive fair server %d\n", cpu_of(rq));
+
 	sched_clock_tick();
 	update_rq_clock(rq);
 
@@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
 static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("enqueue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
 
@@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
+	struct rq *rq = rq_of_dl_se(dl_se);
+
+	if (dl_se == &rq->fair_server) {
+		trace_printk("dequeue fair server %d h_nr_running %d\n",
+			     cpu_of(rq), rq->cfs.h_nr_running);
+	}
 
 	if (RB_EMPTY_NODE(&dl_se->rb_node))
 		return;
@@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	WARN_ON_ONCE(on_dl_rq(dl_se));
+	if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
+		tracing_off();
+		ftrace_dump(DUMP_ALL);
+		return;
+	}
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-12-02 19:12                                                                   ` Phil Auld
@ 2024-12-09 13:11                                                                     ` Phil Auld
  2024-12-09 15:06                                                                       ` Mike Galbraith
  0 siblings, 1 reply; 277+ messages in thread
From: Phil Auld @ 2024-12-09 13:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx


Hi Mike et al.,

On Mon, Dec 02, 2024 at 02:12:52PM -0500 Phil Auld wrote:
> On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> > On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> > >
> > >
> > > > Question: did wiping off the evil leave any meaningful goodness behind?
> > >
> > > Is that for this patch?
> > 
> > Yeah.  Trying it on my box with your write command line didn't improve
> > the confidence level either.  My box has one CPU handling IRQs and
> > waking pinned workers to service 8 fio instances.  Patch was useless
> > for that.
> >
> 
> I'll give it a try. Our "box" is multiple different boxes but the results
> vary somewhat.  The one I sent info about earlier in this thread is just
> one of the more egregious and is the one the perf team lent me for a while.
>

In our testing this has the same effect as the original dequeue-when-delayed
fix.  It solves the randwrite issue and introduces the ~10-15% randread
regression. 

Seems to be a real trade-off here. The same guys who benefit from spreading
in one case benefit from staying put in the other... 


Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
  2024-12-09 13:11                                                                     ` Phil Auld
@ 2024-12-09 15:06                                                                       ` Mike Galbraith
  0 siblings, 0 replies; 277+ messages in thread
From: Mike Galbraith @ 2024-12-09 15:06 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, kprateek.nayak, wuyun.abel, youssefesmat, tglx

On Mon, 2024-12-09 at 08:11 -0500, Phil Auld wrote:
>
> Hi Mike et al.,
>
> On Mon, Dec 02, 2024 at 02:12:52PM -0500 Phil Auld wrote:
> > On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> > > On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > > > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> > > >
> > > >
> > > > > Question: did wiping off the evil leave any meaningful goodness behind?
> > > >
> > > > Is that for this patch?
> > >
> > > Yeah.  Trying it on my box with your write command line didn't improve
> > > the confidence level either.  My box has one CPU handling IRQs and
> > > waking pinned workers to service 8 fio instances.  Patch was useless
> > > for that.
> > >
> >
> > I'll give it a try. Our "box" is multiple different boxes but the results
> > vary somewhat.  The one I sent info about earlier in this thread is just
> > one of the more egregious and is the one the perf team lent me for a while.
> >
>
> In our testing this has the same effect as the original dequeue-when-delayed
> fix.  It solves the randwrite issue and introduces the ~10-15% randread
> regression.
>
> Seems to be a real trade-off here. The same guys who benefit from spreading
> in one case benefit from staying put in the other...

Does as much harm as it does good isn't the mark of a keeper.  Oh well.

	-Mike

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
  2024-11-28 10:58   ` Peter Zijlstra
@ 2024-12-10  8:45   ` Luis Machado
  1 sibling, 0 replies; 277+ messages in thread
From: Luis Machado @ 2024-12-10  8:45 UTC (permalink / raw)
  To: Marcel Ziswiler, Peter Zijlstra, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel
  Cc: kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On 11/28/24 10:32, Marcel Ziswiler wrote:
> Hi all,
> 
> On Sat, 2024-07-27 at 12:27 +0200, Peter Zijlstra wrote:
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF patches.
>> They've been sitting in my git tree for ever it seems, and people have been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>  - split up the huge delay-dequeue patch
>>  - tested/fixed cfs-bandwidth
>>  - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>  - SCHED_BATCH is equivalent to RESPECT_SLICE
>>  - propagate min_slice up cgroups
>>  - CLOCK_THREAD_DVFS_ID
> 
> We found the following 7 commits from this patch set to crash in enqueue_dl_entity():
> 
> 54a58a787791 sched/fair: Implement DELAY_ZERO
> 152e11f6df29 sched/fair: Implement delayed dequeue
> e1459a50ba31 sched: Teach dequeue_task() about special task states
> a1c446611e31 sched,freezer: Mark TASK_FROZEN special
> 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
> f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
> 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
> 
> Resulting in the following crash dump (this is running v6.12.1):
> 
> [   14.652856] sched: DL replenish lagged too much
> [   16.572706] ------------[ cut here ]------------
> [   16.573115] WARNING: CPU: 5 PID: 912 at kernel/sched/deadline.c:1995 enqueue_dl_entity+0x46c/0x55c
> [   16.573900] Modules linked in: overlay crct10dif_ce rk805_pwrkey snd_soc_es8316 pwm_fan
> phy_rockchip_naneng_combphy rockchip_saradc rtc_hym8563 industrialio_trigg
> ered_buffer kfifo_buf rockchip_thermal phy_rockchip_usbdp typec spi_rockchip_sfc snd_soc_rockchip_i2s_tdm
> hantro_vpu v4l2_vp9 v4l2_h264 v4l2_jpeg panthor v4l2_mem2me
> m rockchipdrm drm_gpuvm drm_exec drm_shmem_helper analogix_dp gpu_sched dw_mipi_dsi dw_hdmi cec
> drm_display_helper snd_soc_audio_graph_card snd_soc_simple_card_utils
>  drm_dma_helper drm_kms_helper cfg80211 rfkill pci_endpoint_test drm backlight dm_mod dax
> [   16.578350] CPU: 5 UID: 0 PID: 912 Comm: job10 Not tainted 6.12.1-dirty #15
> [   16.578956] Hardware name: radxa Radxa ROCK 5B/Radxa ROCK 5B, BIOS 2024.10-rc3 10/01/2024
> [   16.579667] pstate: 204000c9 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   16.580273] pc : enqueue_dl_entity+0x46c/0x55c
> [   16.580661] lr : dl_server_start+0x44/0x12c
> [   16.581028] sp : ffff80008002bc00
> [   16.581318] x29: ffff80008002bc00 x28: dead000000000122 x27: 0000000000000000
> [   16.581941] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000009
> [   16.582563] x23: ffff33c976db0e40 x22: 0000000000000001 x21: 00000000002dc6c0
> [   16.583186] x20: 0000000000000001 x19: ffff33c976db17a8 x18: 0000000000000000
> [   16.583808] x17: ffff5dd9779ac000 x16: ffff800080028000 x15: 11c3485b851e0698
> [   16.584430] x14: 11b4b257e4156000 x13: 0000000000000255 x12: 0000000000000000
> [   16.585053] x11: ffff33c976db0ec0 x10: 0000000000000000 x9 : 0000000000000009
> [   16.585674] x8 : 0000000000000005 x7 : ffff33c976db19a0 x6 : ffff33c78258b440
> [   16.586296] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [   16.586918] x2 : 0000000000000001 x1 : 0000000000000001 x0 : ffff33c798e112f0
> [   16.587540] Call trace:
> [   16.587754]  enqueue_dl_entity+0x46c/0x55c
> [   16.588113]  dl_server_start+0x44/0x12c
> [   16.588449]  enqueue_task_fair+0x124/0x49c
> [   16.588807]  enqueue_task+0x3c/0xe0
> [   16.589113]  ttwu_do_activate.isra.0+0x6c/0x208
> [   16.589511]  try_to_wake_up+0x1d0/0x61c
> [   16.589847]  wake_up_process+0x18/0x24
> [   16.590175]  kick_pool+0x84/0x150
> [   16.590467]  __queue_work+0x2f4/0x544
> [   16.590788]  delayed_work_timer_fn+0x1c/0x28
> [   16.591161]  call_timer_fn+0x34/0x1ac
> [   16.591481]  __run_timer_base+0x20c/0x314
> [   16.591832]  run_timer_softirq+0x3c/0x78
> [   16.592176]  handle_softirqs+0x124/0x35c
> [   16.592520]  __do_softirq+0x14/0x20
> [   16.592827]  ____do_softirq+0x10/0x1c
> [   16.593148]  call_on_irq_stack+0x24/0x4c
> [   16.593490]  do_softirq_own_stack+0x1c/0x2c
> [   16.593857]  irq_exit_rcu+0x8c/0xc0
> [   16.594163]  el0_interrupt+0x48/0xbc
> [   16.594477]  __el0_irq_handler_common+0x18/0x24
> [   16.594874]  el0t_64_irq_handler+0x10/0x1c
> [   16.595232]  el0t_64_irq+0x190/0x194
> [   16.595545] ---[ end trace 0000000000000000 ]---
> [   16.595950] ------------[ cut here ]------------


Random piece of data, but I also had some difficulty making things boot on Android when trying
Vincent's nr_running accounting series due to a very similar crash/stack trace. Though what I
saw went a bit further and actually crashed within task_contending, called from
enqueue_dl_entity. Possibly crashed in one of the inlined functions.

Even though the kernel was 6.8 and it was backports, it seems awfully similar to the above.

> 
> It looks like it is trying to enqueue an already queued deadline task. Full serial console log available [1].
> 
> We are running the exact same scheduler stress test both on Intel NUCs as well as RADXA ROCK 5B board farms.
> While so far we have not seen this on amd64 it crashes consistently/reproducible on aarch64.
> 
> We haven't had time to do a non-proprietary reproduction case as of yet but I wanted to report our current
> findings asking for any feedback/suggestions from the community.
> 
> Thanks!
> 
> Cheers
> 
> Marcel
> 
> [1] https://hastebin.skyra.pw/hoqesigaye.yaml
> 


^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-09  9:49           ` Peter Zijlstra
@ 2024-12-10 16:05             ` Marcel Ziswiler
  0 siblings, 0 replies; 277+ messages in thread
From: Marcel Ziswiler @ 2024-12-10 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, kprateek.nayak,
	wuyun.abel, youssefesmat, tglx, efault

On Mon, 2024-12-09 at 10:49 +0100, Peter Zijlstra wrote:
> 
> Sorry for the delay, I got laid low by snot monsters :/
> 
> On Mon, Dec 02, 2024 at 07:46:21PM +0100, Marcel Ziswiler wrote:
> 
> > Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the
> > trace
> > buffer any longer. So I did the following instead on the serial console terminal:
> > 
> > tail -f /sys/kernel/debug/tracing/trace
> > 
> > Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> > megabaud I am not fully sure whether it was able to keep up logging what you are looking for.
> 
> Ah, that is unfortunate. There is a ftrace_dump_on_oops option that
> might be of help. And yes, dumping trace buffers over 1m5 serial lines
> is tedious -- been there done that, got a t-shirt and all that.
> 
> Still, let me see if perhaps making that WARN in enqueue_dl_entity()
> return makes the whole thing less fatal.
> 
> I've included the traceoff_on_warning and ftrace_dump in the code, so
> all you really need to still do is enable the stacktrace option.
> 
>    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
> 
> > Yes, and do not hesitate to ask for any additional information et. al. we are happy to help. Thanks!
> 
> Could I bother you to try again with the below patch?

Sure, here you go.

https://drive.codethink.co.uk/s/HniZCtccDBMHpAK

> There are two new hunks vs the previous one, the hunk in
> enqueue_dl_entity() (the very last bit) will stop tracing and dump the
> buffers when that condition is hit in addition to then aborting the
> double enqueue, hopefully leaving the system is a slightly better state.
> 
> The other new hunk is the one for dl_server_stop() (second hunk), while
> going over the code last week, I found that this might be a possible
> hole leading to the observed double enqueue, so fingers crossed.
> 
> ---
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 33b4646f8b24..bd1df7612482 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1223,6 +1223,11 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>  	scoped_guard (rq_lock, rq) {
>  		struct rq_flags *rf = &scope.rf;
>  
> +		if (dl_se == &rq->fair_server) {
> +			trace_printk("timer fair server %d throttled %d\n",
> +				     cpu_of(rq), dl_se->dl_throttled);
> +		}
> +
>  		if (!dl_se->dl_throttled || !dl_se->dl_runtime)
>  			return HRTIMER_NORESTART;
>  
> @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>  
>  void dl_server_stop(struct sched_dl_entity *dl_se)
>  {
> +	if (current->dl_server == dl_se) {
> +		struct rq *rq = rq_of_dl_se(dl_se);
> +		trace_printk("stop fair server %d\n", cpu_of(rq));
> +		current->dl_server = NULL;
> +	}
> +
>  	if (!dl_se->dl_runtime)
>  		return;
>  
> @@ -1792,6 +1803,9 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
>  		rq_lock(rq, &rf);
>  	}
>  
> +	if (dl_se == &rq->fair_server)
> +		trace_printk("inactive fair server %d\n", cpu_of(rq));
> +
>  	sched_clock_tick();
>  	update_rq_clock(rq);
>  
> @@ -1987,6 +2001,12 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_se(dl_se);
> +
> +	if (dl_se == &rq->fair_server) {
> +		trace_printk("enqueue fair server %d h_nr_running %d\n",
> +			     cpu_of(rq), rq->cfs.h_nr_running);
> +	}
>  
>  	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
> @@ -1998,6 +2018,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> +	struct rq *rq = rq_of_dl_se(dl_se);
> +
> +	if (dl_se == &rq->fair_server) {
> +		trace_printk("dequeue fair server %d h_nr_running %d\n",
> +			     cpu_of(rq), rq->cfs.h_nr_running);
> +	}
>  
>  	if (RB_EMPTY_NODE(&dl_se->rb_node))
>  		return;
> @@ -2012,7 +2038,11 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  static void
>  enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
>  {
> -	WARN_ON_ONCE(on_dl_rq(dl_se));
> +	if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
> +		tracing_off();
> +		ftrace_dump(DUMP_ALL);
> +		return;
> +	}
>  
>  	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-02 18:46         ` Marcel Ziswiler
  2024-12-09  9:49           ` Peter Zijlstra
@ 2024-12-10 16:13           ` Steven Rostedt
  1 sibling, 0 replies; 277+ messages in thread
From: Steven Rostedt @ 2024-12-10 16:13 UTC (permalink / raw)
  To: Marcel Ziswiler
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel,
	kprateek.nayak, wuyun.abel, youssefesmat, tglx, efault

On Mon, 02 Dec 2024 19:46:21 +0100
Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> wrote:

> Unfortunately, once I trigger the failure the system is completely dead and won't allow me to dump the trace
> buffer any longer. So I did the following instead on the serial console terminal:
> 
> tail -f /sys/kernel/debug/tracing/trace
> 
> Not sure whether there is any better way to go about this. Plus even so we run the serial console at 1.5
> megabaud I am not fully sure whether it was able to keep up logging what you are looking for.

If the memory of the machine is persistent (it is on several of my
machines) you can use the persistent ring buffer.

Add to the kernel command line (or enable bootconfig that attaches a
command line to the kernel if you can't change the parameters):

  reserve_mem=20M:12M:trace trace_instance=boot_map^traceoff^traceprintk@trace

The above will create a "boot_map" instance with tracing off on boot
and trace_printk() going to it. Start tracing:

 trace-cmd start -B boot_map -p nop

If or replace "-p nop" with any events or tracers you want, including
function tracing", then after a crash.

  trace-cmd show -B boot_map

If the memory is persistent and you don't use KASLR (may want to also
add nokaslr if arm64 supports KASLR it and you use it), you should get
the trace right up to the crash.

See Documentation/trace/debugging.rst for more details.

-- Steve

^ permalink raw reply	[flat|nested] 277+ messages in thread

* [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
@ 2024-12-29 22:51 Doug Smythies
  2025-01-06 11:57 ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2024-12-29 22:51 UTC (permalink / raw)
  To: peterz; +Cc: Doug Smythies, linux-kernel, vincent.guittot

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]

Hi Peter,

I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
An edited summary (actual log attached):

	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
	skip	152e11f6df29 sched/fair: Implement delayed dequeue
	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()

Where "bN" is just my assigned kernel name for each bisection step.

In the linux-kernel email archives I found a thread that isolated these same commits.
It was from late Novermebr / early December:

https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174

An example of the turbostat manifestation of the issue:

doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
[sudo] password for doug:
Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
99.76   4800    4104    12304   73      80.08
99.76   4800    4104    12047   73      80.23
99.76   4800    879     12157   73      11.40
99.76   4800    26667   84214   72      557.23
99.76   4800    4104    12036   72      79.39

Where TSC_MHz was reported as 879, there was a big gap in time.
Like 4.7 seconds instead of 1.
Where TSC_MHz was reported as 26667, there was not a big gap in time.

It happens for about 5% of the samples + or - a lot.
It only happens when the workload is almost exactly 100%.
More load, it doesn't occur.
Less load, it doesn't occur. Although, I did get this once:

Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
91.46   4800    4104    11348   73      103.98
91.46   4800    4104    11353   73      103.89
91.50   4800    3903    11339   73      98.16
91.43   4800    4271    12001   73      108.52
91.45   4800    4148    11481   73      105.13
91.46   4800    4104    11341   73      103.96
91.46   4800    4104    11348   73      103.99

So, it might just be much less probable and less severe.

It happens over many different types of workload that I have tried.

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
6 cores. 2 thread per core, 12 CPUs.
OS: Ubuntu 24.04.1 LTS (server, no GUI)

... Doug


[-- Attachment #2: bisect-log.txt --]
[-- Type: text/plain, Size: 5985 bytes --]

doug@s19:~/kernel/linux$ git bisect bad
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
781773e3b68031bd001c0c18aa72e8470c225ebd
a1c446611e31ca5363d4db51e398271da1dce0af
e1459a50ba31831efdfc35278023d959e4ba775b
f12e148892ede8d9ee82bcd3e469e6d01fc077ac
152e11f6df293e816a6a37c69757033cdc72667d
2e0199df252a536a03f4cb0810324dff523d1e79
54a58a78779169f9c92a51facf6de7ce94962328
We cannot bisect more!

doug@s19:~/kernel/linux$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [98f7e32f20d28ec452afb208f9cffc08448a2652] Linux 6.11
git bisect good 98f7e32f20d28ec452afb208f9cffc08448a2652
# status: waiting for bad commit, 1 good commit known
# bad: [9852d85ec9d492ebef56dc5f229416c925758edc] Linux 6.12-rc1
git bisect bad 9852d85ec9d492ebef56dc5f229416c925758edc
# good: [176000734ee2978121fde22a954eb1eabb204329] Merge tag 'ata-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
git bisect good 176000734ee2978121fde22a954eb1eabb204329
# bad: [d0359e4ca0f26aaf3118124dfb562e3b3dca1c06] Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad d0359e4ca0f26aaf3118124dfb562e3b3dca1c06
# bad: [171754c3808214d4fd8843eab584599a429deb52] Merge tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
git bisect bad 171754c3808214d4fd8843eab584599a429deb52
# good: [e55ef65510a401862b902dc979441ea10ae25c61] Merge tag 'amd-drm-next-6.12-2024-08-26' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good e55ef65510a401862b902dc979441ea10ae25c61
# good: [32bd3eb5fbab954e68adba8c0b6a43cf03605c93] Merge tag 'drm-intel-gt-next-2024-09-06' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
git bisect good 32bd3eb5fbab954e68adba8c0b6a43cf03605c93
# good: [726e2d0cf2bbc14e3bf38491cddda1a56fe18663] Merge tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping
git bisect good 726e2d0cf2bbc14e3bf38491cddda1a56fe18663
# good: [839c4f596f898edc424070dc8b517381572f8502] Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect good 839c4f596f898edc424070dc8b517381572f8502
# bad: [bd9bbc96e8356886971317f57994247ca491dbf1] sched: Rework dl_server
git bisect bad bd9bbc96e8356886971317f57994247ca491dbf1
# good: [863ccdbb918a77e3f011571f943020bf7f0b114b] sched: Allow sched_class::dequeue_task() to fail
git bisect good 863ccdbb918a77e3f011571f943020bf7f0b114b
# bad: [fc1892becd5672f52329a75c73117b60ac7841b7] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
git bisect bad fc1892becd5672f52329a75c73117b60ac7841b7
# skip: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue
git bisect skip 2e0199df252a536a03f4cb0810324dff523d1e79
# skip: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
git bisect skip f12e148892ede8d9ee82bcd3e469e6d01fc077ac
# skip: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
git bisect skip e1459a50ba31831efdfc35278023d959e4ba775b
# skip: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
git bisect skip 781773e3b68031bd001c0c18aa72e8470c225ebd
# good: [abc158c82ae555078aa5dd2d8407c3df0f868904] sched: Prepare generic code for delayed dequeue
git bisect good abc158c82ae555078aa5dd2d8407c3df0f868904
# skip: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
git bisect skip a1c446611e31ca5363d4db51e398271da1dce0af
# good: [e28b5f8bda01720b5ce8456b48cf4b963f9a80a1] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
git bisect good e28b5f8bda01720b5ce8456b48cf4b963f9a80a1
# skip: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
git bisect skip 152e11f6df293e816a6a37c69757033cdc72667d
# bad: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
git bisect bad 54a58a78779169f9c92a51facf6de7ce94962328
# only skipped commits left to test
# possible first bad commit: [54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
# possible first bad commit: [152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
# possible first bad commit: [e1459a50ba31831efdfc35278023d959e4ba775b] sched: Teach dequeue_task() about special task states
# possible first bad commit: [a1c446611e31ca5363d4db51e398271da1dce0af] sched,freezer: Mark TASK_FROZEN special
# possible first bad commit: [781773e3b68031bd001c0c18aa72e8470c225ebd] sched/fair: Implement ENQUEUE_DELAYED
# possible first bad commit: [f12e148892ede8d9ee82bcd3e469e6d01fc077ac] sched/fair: Prepare pick_next_task() for delayed dequeue
# possible first bad commit: [2e0199df252a536a03f4cb0810324dff523d1e79] sched/fair: Prepare exit/cleanup paths for delayed_dequeue

doug@s19:~/kernel/linux$ git log --oneline | grep -B 2 -A 10 54a58a78779
	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
	skip	152e11f6df29 sched/fair: Implement delayed dequeue
	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2024-12-29 22:51 Doug Smythies
@ 2025-01-06 11:57 ` Peter Zijlstra
  2025-01-06 15:01   ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-06 11:57 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
> After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
> An edited summary (actual log attached):
> 
> 	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
> b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
> b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
> 	skip	152e11f6df29 sched/fair: Implement delayed dequeue
> 	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
> 	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
> 	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
> 	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
> 	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
> b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
> 		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
> b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
> 		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()
> 
> Where "bN" is just my assigned kernel name for each bisection step.
> 
> In the linux-kernel email archives I found a thread that isolated these same commits.
> It was from late Novermebr / early December:
> 
> https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174
> 
> An example of the turbostat manifestation of the issue:
> 
> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> [sudo] password for doug:
> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
> 99.76   4800    4104    12304   73      80.08
> 99.76   4800    4104    12047   73      80.23
> 99.76   4800    879     12157   73      11.40
> 99.76   4800    26667   84214   72      557.23
> 99.76   4800    4104    12036   72      79.39
> 
> Where TSC_MHz was reported as 879, there was a big gap in time.
> Like 4.7 seconds instead of 1.
> Where TSC_MHz was reported as 26667, there was not a big gap in time.
> 
> It happens for about 5% of the samples + or - a lot.
> It only happens when the workload is almost exactly 100%.
> More load, it doesn't occur.
> Less load, it doesn't occur. Although, I did get this once:
> 
> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
> 91.46   4800    4104    11348   73      103.98
> 91.46   4800    4104    11353   73      103.89
> 91.50   4800    3903    11339   73      98.16
> 91.43   4800    4271    12001   73      108.52
> 91.45   4800    4148    11481   73      105.13
> 91.46   4800    4104    11341   73      103.96
> 91.46   4800    4104    11348   73      103.99
> 
> So, it might just be much less probable and less severe.
> 
> It happens over many different types of workload that I have tried.

In private email you've communicated it happens due to
sched_setaffinity() sometimes taking multiple seconds.

I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner
for each CPU, but so far am not able to reproduce.

What is the easiest 100% load you're seeing this with?

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 11:57 ` Peter Zijlstra
@ 2025-01-06 15:01   ` Doug Smythies
  2025-01-06 16:59     ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-06 15:01 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.06 03:58 Peter Zijlstra wrote:
>On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote:
>> Hi Peter,
>> 
>> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
>> After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
>> An edited summary (actual log attached):
>> 
>> 	 	82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
>> b10	bad	fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
>> b13	bad	54a58a787791 sched/fair: Implement DELAY_ZERO
>> 	skip	152e11f6df29 sched/fair: Implement delayed dequeue
>> 	skip	e1459a50ba31 sched: Teach dequeue_task() about special task states
>> 	skip	a1c446611e31 sched,freezer: Mark TASK_FROZEN special
>> 	skip	781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
>> 	skip	f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
>> 	skip	2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
>> b12	good	e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
>> 		dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
>> b11	good	abc158c82ae5 sched: Prepare generic code for delayed dequeue
>> 		e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()
>> 
>> Where "bN" is just my assigned kernel name for each bisection step.
>> 
>> In the linux-kernel email archives I found a thread that isolated these same commits.
>> It was from late Novermebr / early December:
>> 
>> https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m9aeb4d897e029cf7546513bb09499c320457c174
>> 
>> An example of the turbostat manifestation of the issue:
>> 
>> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
>> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>> [sudo] password for doug:
>> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
>> 99.76   4800    4104    12304   73      80.08
>> 99.76   4800    4104    12047   73      80.23
>> 99.76   4800    879     12157   73      11.40
>> 99.76   4800    26667   84214   72      557.23
>> 99.76   4800    4104    12036   72      79.39
>> 
>> Where TSC_MHz was reported as 879, there was a big gap in time.
>> Like 4.7 seconds instead of 1.
>> Where TSC_MHz was reported as 26667, there was not a big gap in time.
>> 
>> It happens for about 5% of the samples + or - a lot.
>> It only happens when the workload is almost exactly 100%.
>> More load, it doesn't occur.
>> Less load, it doesn't occur. Although, I did get this once:
>> 
>> Busy%   Bzy_MHz TSC_MHz IRQ     PkgTmp  PkgWatt
>> 91.46   4800    4104    11348   73      103.98
>> 91.46   4800    4104    11353   73      103.89
>> 91.50   4800    3903    11339   73      98.16
>> 91.43   4800    4271    12001   73      108.52
>> 91.45   4800    4148    11481   73      105.13
>> 91.46   4800    4104    11341   73      103.96
>> 91.46   4800    4104    11348   73      103.99
>> 
>> So, it might just be much less probable and less severe.
>> 
>> It happens over many different types of workload that I have tried.
>
> In private email you've communicated it happens due to
> sched_setaffinity() sometimes taking multiple seconds.
>
> I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner
> for each CPU, but so far am not able to reproduce.

I have also been trying to reproduce the issue without using turbostat.
No success.

The other thing to note is that my test computer is otherwise very
very idle with no GUI and few services.

>
> What is the easiest 100% load you're seeing this with?

Lately, and specifically to be able to tell others, I have been using:

yes > /dev/null &

On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
I run 12 of those work loads.

... Doug



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 15:01   ` Doug Smythies
@ 2025-01-06 16:59     ` Peter Zijlstra
  2025-01-06 17:04       ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-06 16:59 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:

> > What is the easiest 100% load you're seeing this with?
> 
> Lately, and specifically to be able to tell others, I have been using:
> 
> yes > /dev/null &
> 
> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> I run 12 of those work loads.

On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
do:

for ((i=0; i<40; i++)) ; do yes > /dev/null & done
tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1

But no so far, nada :-( I've tried with full preemption and voluntary,
HZ=1000.





^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 16:59     ` Peter Zijlstra
@ 2025-01-06 17:04       ` Peter Zijlstra
  2025-01-06 17:14         ` Peter Zijlstra
  2025-01-06 22:28         ` Doug Smythies
  0 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-06 17:04 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
> 
> > > What is the easiest 100% load you're seeing this with?
> > 
> > Lately, and specifically to be able to tell others, I have been using:
> > 
> > yes > /dev/null &
> > 
> > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> > I run 12 of those work loads.
> 
> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
> do:
> 
> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> 
> But no so far, nada :-( I've tried with full preemption and voluntary,
> HZ=1000.
> 

And just as I send this, I see these happen:

100.00  3100    2793    40302   71      195.22
100.00  3100    2618    40459   72      183.58
100.00  3100    2993    46215   71      209.21
100.00  3100    2789    40467   71      195.19
99.92   3100    2798    40589   71      195.76
100.00  3100    2793    40397   72      195.46
...
100.00  3100    2844    41906   71      199.43
100.00  3100    2779    40468   71      194.51
99.96   3100    2320    40933   71      163.23
100.00  3100    3529    61823   72      245.70
100.00  3100    2793    40493   72      195.45
100.00  3100    2793    40462   72      195.56

They look like funny little blips. Nowhere near as bad as you had
though.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:04       ` Peter Zijlstra
@ 2025-01-06 17:14         ` Peter Zijlstra
  2025-01-07  1:24           ` Doug Smythies
  2025-01-06 22:28         ` Doug Smythies
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-06 17:14 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
> > 
> > > > What is the easiest 100% load you're seeing this with?
> > > 
> > > Lately, and specifically to be able to tell others, I have been using:
> > > 
> > > yes > /dev/null &
> > > 
> > > On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
> > > I run 12 of those work loads.
> > 
> > On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
> > do:
> > 
> > for ((i=0; i<40; i++)) ; do yes > /dev/null & done
> > tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> > 
> > But no so far, nada :-( I've tried with full preemption and voluntary,
> > HZ=1000.
> > 
> 
> And just as I send this, I see these happen:
> 
> 100.00  3100    2793    40302   71      195.22
> 100.00  3100    2618    40459   72      183.58
> 100.00  3100    2993    46215   71      209.21
> 100.00  3100    2789    40467   71      195.19
> 99.92   3100    2798    40589   71      195.76
> 100.00  3100    2793    40397   72      195.46
> ...
> 100.00  3100    2844    41906   71      199.43
> 100.00  3100    2779    40468   71      194.51
> 99.96   3100    2320    40933   71      163.23
> 100.00  3100    3529    61823   72      245.70
> 100.00  3100    2793    40493   72      195.45
> 100.00  3100    2793    40462   72      195.56
> 
> They look like funny little blips. Nowhere near as bad as you had
> though.

Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things,
could you perhaps try the below hackery for me? Its a bit of a wild
guess, but throw stuff at wall, see what sticks etc..

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..fa4b9891f93a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	} else {
 
 		if (!is_migration_disabled(p)) {
-			if (task_on_rq_queued(p))
+			if (task_on_rq_queued(p) && !p->se.sched_delayed)
 				rq = move_queued_task(rq, rf, p, dest_cpu);
 
 			if (!pending->stop_pending) {
@@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED;
+		if (!is_cpu_allowed(p, cpu_of(rq))) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+		enqueue_task(rq, p, queue_flags);
 	}
-	__task_rq_unlock(rq, &rf);
-
-	return ret;
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65fa64845d9f..b4c1f6c06c18 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:04       ` Peter Zijlstra
  2025-01-06 17:14         ` Peter Zijlstra
@ 2025-01-06 22:28         ` Doug Smythies
  2025-01-07 11:26           ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-06 22:28 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 4909 bytes --]

On 2025.01.06 09:05 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
>> 
>>>> What is the easiest 100% load you're seeing this with?
>>> 
>>> Lately, and specifically to be able to tell others, I have been using:
>>> 
>>> yes > /dev/null &
>>> 
>>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
>>> I run 12 of those work loads.
>> 
>> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
>> do:
>> 
>> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
>> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>> 
>> But no so far, nada :-( I've tried with full preemption and voluntary,
>> HZ=1000.

My HZ=1000 also. And: CONFIG_NO_HZ_FULL=y
 
>
> And just as I send this, I see these happen:
>
> 100.00  3100    2793    40302   71      195.22
> 100.00  3100    2618    40459   72      183.58
> 100.00  3100    2993    46215   71      209.21
> 100.00  3100    2789    40467   71      195.19
> 99.92   3100    2798    40589   71      195.76
> 100.00  3100    2793    40397   72      195.46
> ...
> 100.00  3100    2844    41906   71      199.43
> 100.00  3100    2779    40468   71      194.51
> 99.96   3100    2320    40933   71      163.23
> 100.00  3100    3529    61823   72      245.70
> 100.00  3100    2793    40493   72      195.45
> 100.00  3100    2793    40462   72      195.56
>
> They look like funny little blips. Nowhere near as bad as you had
> though.

Yes, I get a lot of the lesser magnitude ones.

The large magnitude ones are very much a function of what else is running.
If just add a 0.5% load at 73 hertz work/sleep frequency, then over a 2 hour and
31 minute test I got a maximum interval time of 1.68 seconds.
Without that small pertibations I got tons of interval times of 7 seconds,
Meaning the regular 1 second interval plus 6 seconds for the CPU migration.

Since I can not seem to function without making a graph, some example graphs
are attached.

By the way, and to make it easier to go away while tests run, I am now using this
turbostat command:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval
1 | grep "^[1-9]"
6005701 1736201357.014741       -       99.76   12034
177731  1736201386.221771       -       99.76   12034
6003699 1736201393.226740       -       99.76   14167
6003704 1736201422.253743       -       99.76   12040
6005700 1736201447.278740       -       99.76   12030
311699  1736201475.816740       -       99.76   12033

Which will show when a CPU migration took over 10 milliseconds.
If you want to go further, for example to only display ones that took
over a second and to include the target CPU, then patch turbostat:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..f8a73cc8fbfc 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
                struct timeval tv;

                timersub(&t->tv_end, &t->tv_begin, &tv);
-               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
        }

        /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
        int i;
        int status;

+       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
        if (cpu_migrate(cpu)) {
                fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
                return -1;
        }

-       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);

        if (first_counter_read)
                get_apic_id(t);

Example output:

sudo ./turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1 | grep "^[1-9]"
1330709 1736202049.987740       -       99.76   12040
1330603 1736202049.987740       11      99.76   1003
6008710 1736202068.008741       -       99.76   12030
6008601 1736202068.008741       11      99.76   1003
2003709 1736202120.936740       -       99.76   12028
2003603 1736202120.936740       11      99.76   1002
6005710 1736202140.956741       -       99.76   12028
6005604 1736202140.956741       11      99.76   1002

In this short example all captures were for the CPU 5 to 11 migration.
2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.

I'll try, and report on, your test patch from the other email later.


[-- Attachment #2: turbostat-sampling-issue-seconds.png --]
[-- Type: image/png, Size: 32233 bytes --]

[-- Attachment #3: turbostat-sampling-issue-seconds-detail-a.png --]
[-- Type: image/png, Size: 42780 bytes --]

[-- Attachment #4: turbostat-sampling-issue-seconds-detail-b.png --]
[-- Type: image/png, Size: 87699 bytes --]

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 17:14         ` Peter Zijlstra
@ 2025-01-07  1:24           ` Doug Smythies
  2025-01-07 10:49             ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-07  1:24 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2024.01.06 09:14 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 06:04:55PM +0100, Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 05:59:32PM +0100, Peter Zijlstra wrote:
>>> On Mon, Jan 06, 2025 at 07:01:34AM -0800, Doug Smythies wrote:
>>>
>>>>> What is the easiest 100% load you're seeing this with?
>>>>
>>>> Lately, and specifically to be able to tell others, I have been using:
>>>>
>>>> yes > /dev/null &
>>>>
>>>> On my Intel i5-10600K, with 6 cores and 2 threads per core, 12 CPUs,
>>>> I run 12 of those work loads.
>>>
>>> On my headless ivb-ep 2 sockets, 10 cores each and 2 threads per core, I
>>> do:
>>>
>>> for ((i=0; i<40; i++)) ; do yes > /dev/null & done
>>> tools/power/x86/turbostat/turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
>>>
>>> But no so far, nada :-( I've tried with full preemption and voluntary,
>>> HZ=1000.
>>>
>>
>> And just as I send this, I see these happen:
>>
>> 100.00  3100    2793    40302   71      195.22
>> 100.00  3100    2618    40459   72      183.58
>> 100.00  3100    2993    46215   71      209.21
>> 100.00  3100    2789    40467   71      195.19
>> 99.92   3100    2798    40589   71      195.76
>> 100.00  3100    2793    40397   72      195.46
>> ...
>> 100.00  3100    2844    41906   71      199.43
>> 100.00  3100    2779    40468   71      194.51
>> 99.96   3100    2320    40933   71      163.23
>> 100.00  3100    3529    61823   72      245.70
>> 100.00  3100    2793    40493   72      195.45
>> 100.00  3100    2793    40462   72      195.56
>>
>> They look like funny little blips. Nowhere near as bad as you had
>> though.
>
> Anyway, given you've confirmed disabling DELAY_DEQUEUE fixes things,
> could you perhaps try the below hackery for me? Its a bit of a wild
> guess, but throw stuff at wall, see what sticks etc..
>
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 84902936a620..fa4b9891f93a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3019,7 +3019,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>       } else {
>
>               if (!is_migration_disabled(p)) {
> -                     if (task_on_rq_queued(p))
> +                     if (task_on_rq_queued(p) && !p->se.sched_delayed)
>                               rq = move_queued_task(rq, rf, p, dest_cpu);
>
>                       if (!pending->stop_pending) {
> @@ -3776,28 +3776,30 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -     struct rq_flags rf;
> -     struct rq *rq;
> -     int ret = 0;
> +     CLASS(__task_rq_lock, rq_guard)(p);
> +     struct rq *rq = rq_guard.rq;
>
> -     rq = __task_rq_lock(p, &rf);
> -     if (task_on_rq_queued(p)) {
> -             update_rq_clock(rq);
> -             if (p->se.sched_delayed)
> -                     enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -             if (!task_on_cpu(rq, p)) {
> -                     /*
> -                      * When on_rq && !on_cpu the task is preempted, see if
> -                      * it should preempt the task that is current now.
> -                      */
> -                     wakeup_preempt(rq, p, wake_flags);
> +     if (!task_on_rq_queued(p))
> +             return 0;
> +
> +     update_rq_clock(rq);
> +     if (p->se.sched_delayed) {
> +             int queue_flags = ENQUEUE_NOCLOCK | ENQUEUE_DELAYED;
> +             if (!is_cpu_allowed(p, cpu_of(rq))) {
> +                     dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                     return 0;
>               }
> -             ttwu_do_wakeup(p);
> -             ret = 1;
> +             enqueue_task(rq, p, queue_flags);
>       }
> -     __task_rq_unlock(rq, &rf);
> -
> -     return ret;
> +     if (!task_on_cpu(rq, p)) {
> +             /*
> +              * When on_rq && !on_cpu the task is preempted, see if
> +              * it should preempt the task that is current now.
> +              */
> +             wakeup_preempt(rq, p, wake_flags);
> +     }
> +     ttwu_do_wakeup(p);
> +     return 1;
>  }
>
>  #ifdef CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 65fa64845d9f..b4c1f6c06c18 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1793,6 +1793,11 @@ task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>       raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                 _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                 __task_rq_unlock(_T->rq, &_T->rf),
> +                 struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                   _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                   task_rq_unlock(_T->rq, _T->lock, &_T->rf),

I tried the patch on top of kernel 6.13-rc6.
It did not fix the issue.

I used my patched version of turbostat as per the previous email,
so that I could see which CPU and the CPU migration time.
CPU migration times >= 10 milliseconds are listed.
Results:

doug@s19:~date
Mon Jan  6 04:37:58 PM PST 2025
doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval
1 | grep -v \- | grep -e "^[1-9]" -e "^ [1-9]" -e "^  [1-9]"
usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
  16599 1736210307.324843       11      99.76   1004
6003601 1736210314.329844       11      99.76   1018
1164604 1736210330.509843       11      99.76   1003
6003604 1736210347.524844       11      99.76   1005
  23602 1736210369.570843       11      99.76   1003
 161680 1736210384.748843       7       99.76   1002
5750600 1736210398.507843       11      99.76   1005
6003607 1736210478.587844       11      99.76   1002
 210645 1736210479.799843       3       99.76   7017
  22602 1736210495.838843       11      99.76   1002
6003390 1736210520.861844       11      99.76   1002
 108627 1736210534.984843       10      99.76   1002
  23604 1736210570.047843       11      99.76   1003
6004604 1736210600.076843       11      99.76   1003
1895606 1736210606.977843       11      99.76   1002
3110603 1736210745.226843       11      99.76   1003
6003606 1736210765.244844       11      99.76   1002
6003605 1736210785.262843       11      99.76   1002
 401642 1736210847.732843       9       99.76   1002
6003604 1736210891.781843       11      99.76   1003
6003607 1736210914.802844       11      99.76   1002
6003605 1736210945.831843       11      99.76   1002
5579609 1736210968.428848       11      99.76   1002
6003600 1736210975.433844       11      99.76   6585
  93623 1736210985.537843       10      99.76   1003
5005605 1736210994.547843       11      99.76   1003
2654601 1736211029.244843       11      99.76   1004
  17604 1736211057.290843       11      99.76   1003
  23598 1736211077.334843       11      99.76   1006
 114671 1736211079.451843       2       99.76   1003
6003603 1736211105.475843       11      99.76   1002
^Cdoug@s19:~$ date
Mon Jan  6 04:52:18 PM PST 2025
doug@s19:~$ uname -a
Linux s19 6.13.0-rc6-peterz #1320 SMP PREEMPT_DYNAMIC Mon Jan  6 16:25:39 PST 2025 x86_64 x86_64 x86_64 GNU/Linux





^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07  1:24           ` Doug Smythies
@ 2025-01-07 10:49             ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-07 10:49 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 05:24:44PM -0800, Doug Smythies wrote:
> I tried the patch on top of kernel 6.13-rc6.
> It did not fix the issue.

Oh well, it was a long shot anyway.

I'll try and make it reproduce again.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-06 22:28         ` Doug Smythies
@ 2025-01-07 11:26           ` Peter Zijlstra
  2025-01-07 15:04             ` Doug Smythies
  2025-01-07 19:23             ` Peter Zijlstra
  0 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-07 11:26 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:

> Which will show when a CPU migration took over 10 milliseconds.
> If you want to go further, for example to only display ones that took
> over a second and to include the target CPU, then patch turbostat:
> 
> doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
> index 58a487c225a7..f8a73cc8fbfc 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
>                 struct timeval tv;
> 
>                 timersub(&t->tv_end, &t->tv_begin, &tv);
> -               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
> +               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>         }
> 
>         /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
> @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
>         int i;
>         int status;
> 
> +       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
> +
>         if (cpu_migrate(cpu)) {
>                 fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
>                 return -1;
>         }
> 
> -       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
> +//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);
> 
>         if (first_counter_read)
>                 get_apic_id(t);
> 
> 

So I've taken the second node offline, running with 10 cores (20
threads) now.

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
 106783 1736248404.951438       -       100.00  20119
     46 1736248404.844701       0       100.00  1005
     41 1736248404.844742       20      100.00  1007
     42 1736248404.844784       1       100.00  1005
     40 1736248404.844824       21      100.00  1006
     41 1736248404.844865       2       100.00  1005
     40 1736248404.844905       22      100.00  1006
     41 1736248404.844946       3       100.00  1006
     40 1736248404.844986       23      100.00  1005
     41 1736248404.845027       4       100.00  1005
     40 1736248404.845067       24      100.00  1006
     41 1736248404.845108       5       100.00  1011
     40 1736248404.845149       25      100.00  1005
     41 1736248404.845190       6       100.00  1005
     40 1736248404.845230       26      100.00  1005
     42 1736248404.845272       7       100.00  1007
     41 1736248404.845313       27      100.00  1005
     41 1736248404.845355       8       100.00  1005
     42 1736248404.845397       28      100.00  1006
     46 1736248404.845443       9       100.00  1009
 105995 1736248404.951438       29      100.00  1005

Is by far the worst I've had in the past few minutes playing with this.

If I get a blimp (>10000) then it is always on the last CPU, are you
seeing the same thing?

> In this short example all captures were for the CPU 5 to 11 migration.
> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.

This seems to suggest you are, always on CPU 11.

Weird!

Anyway, let me see if I can capture a trace of this..

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 11:26           ` Peter Zijlstra
@ 2025-01-07 15:04             ` Doug Smythies
  2025-01-07 16:25               ` Doug Smythies
  2025-01-07 19:23             ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-07 15:04 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.07 03:26 Peter Zijlstra wrote:
> On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:
>
>> Which will show when a CPU migration took over 10 milliseconds.
>> If you want to go further, for example to only display ones that took
>> over a second and to include the target CPU, then patch turbostat:
>> 
>> doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff
>> diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
>> index 58a487c225a7..f8a73cc8fbfc 100644
>> --- a/tools/power/x86/turbostat/turbostat.c
>> +++ b/tools/power/x86/turbostat/turbostat.c
>> @@ -2704,7 +2704,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
>>                 struct timeval tv;
>> 
>>                 timersub(&t->tv_end, &t->tv_begin, &tv);
>> -               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>> +               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
>>         }
>> 
>>         /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
>> @@ -4570,12 +4570,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
>>         int i;
>>         int status;
>> 
>> +       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
>> +
>>         if (cpu_migrate(cpu)) {
>>                 fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
>>                 return -1;
>>         }
>> 
>> -       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
>> +//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);
>> 
>>         if (first_counter_read)
>>                 get_apic_id(t);
>> 
>> 
>
> So I've taken the second node offline, running with 10 cores (20
> threads) now.
>
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
> 106783 1736248404.951438       -       100.00  20119
>     46 1736248404.844701       0       100.00  1005
>     41 1736248404.844742       20      100.00  1007
>     42 1736248404.844784       1       100.00  1005
>     40 1736248404.844824       21      100.00  1006
>     41 1736248404.844865       2       100.00  1005
>     40 1736248404.844905       22      100.00  1006
>     41 1736248404.844946       3       100.00  1006
>     40 1736248404.844986       23      100.00  1005
>     41 1736248404.845027       4       100.00  1005
>     40 1736248404.845067       24      100.00  1006
>     41 1736248404.845108       5       100.00  1011
>     40 1736248404.845149       25      100.00  1005
>     41 1736248404.845190       6       100.00  1005
>     40 1736248404.845230       26      100.00  1005
>     42 1736248404.845272       7       100.00  1007
>     41 1736248404.845313       27      100.00  1005
>     41 1736248404.845355       8       100.00  1005
>     42 1736248404.845397       28      100.00  1006
>     46 1736248404.845443       9       100.00  1009
> 105995 1736248404.951438       29      100.00  1005
>
> Is by far the worst I've had in the past few minutes playing with this.
>
> If I get a blimp (>10000) then it is always on the last CPU, are you
> seeing the same thing?

More or less, yes. The very long migrations are dominated by the
CPU 5 to CPU 11 migration.

Here is data from yesterday for other CPUs:
usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
605706 1736224605.542844       0       99.76   1922
  10001 1736224605.561844       1       99.76   1922
  10999 1736224605.572843       7       99.76   1923
  11001 1736224605.583844       2       99.76   1925
  11000 1736224605.606844       4       99.76   1924
  10999 1736224605.617843       10      99.76   1923
 105001 1736224605.722844       5       99.76   1922
 465657 1736224608.190843       8       99.76   1002
 494000 1736224608.684843       3       99.76   1003
 395674 1736224610.081843       7       99.76   1964
  19679 1736224617.108843       7       99.76   1003
  37709 1736224636.633845       0       99.76   1003
  65641 1736224689.796843       9       99.76   1003
 406631 1736224693.206843       4       99.76   1002
 105622 1736225026.238843       10      99.76   1003
 409622 1736225053.673843       10      99.76   1003
  16706 1736225302.149847       0       99.76   1820
  10000 1736225302.185846       4       99.76   1825
  19663 1736225317.249844       7       99.76   1012
>
>> In this short example all captures were for the CPU 5 to 11 migration.
>> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.
>
> This seems to suggest you are, always on CPU 11.
>
> Weird!

Yes, weird. I think, but am not certain, the CPU sequence in turbostat
per interval loop is:

Wake on highest numbered CPU (11 in my case)
Do a bunch of work that can be done without MSR reads.
For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
  Do the CPU specific work
Finish the intervals work and printing and such on CPU 11.
Sleep for the interval time (we have been using 1 second)

Without any proof, I was thinking the CPU 11 dominance
for the long migration issue was due to the other bits of
work done on that CPU.

> Anyway, let me see if I can capture a trace of this..



^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 15:04             ` Doug Smythies
@ 2025-01-07 16:25               ` Doug Smythies
  0 siblings, 0 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-07 16:25 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.07 07:04 Doug Smythies wrote:
> On 2025.01.07 03:26 Peter Zijlstra wrote:
>> On Mon, Jan 06, 2025 at 02:28:40PM -0800, Doug Smythies wrote:

>> If I get a blimp (>10000) then it is always on the last CPU, are you
>> seeing the same thing?
>
> More or less, yes. The very long migrations are dominated by the
> CPU 5 to CPU 11 migration.
>>
>>> In this short example all captures were for the CPU 5 to 11 migration.
>>> 2 at 6 seconds, 1 at 1.33 seconds and 1 at 2 seconds.
>>
>> This seems to suggest you are, always on CPU 11.
>>
>> Weird!
>
> Yes, weird. I think, but am not certain, the CPU sequence in turbostat
> per interval loop is:
>
> Wake on highest numbered CPU (11 in my case)
> Do a bunch of work that can be done without MSR reads.
> For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
>  Do the CPU specific work
> Finish the intervals work and printing and such on CPU 11.
> Sleep for the interval time (we have been using 1 second)
>
> Without any proof, I was thinking the CPU 11 dominance
> for the long migration issue was due to the other bits of
> work done on that CPU.

To test this theory I hacked turbostat to migrate to CPU 3
After the CPU specific work loop.
So now the per interval workflow is:

Wake on CPU 3
Do a bunch of work that can be done without MSR reads.
For each CPU in topological order (0,6,1,7,2,8,3,9,4,10,5,11 in my case)
 Do the CPU specific work
Migrate to CPU 3
Finish the intervals work and printing and such on CPU 3.
Sleep for the interval time

And now I get:

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
  12646 1736266361.533240       3       99.76   1005
6004653 1736266384.555240       3       99.76   1006
6004653 1736266393.563240       3       99.76   1004
6005648 1736266400.570240       3       99.76   7019
6005653 1736266432.602240       3       99.76   1005
6003656 1736266479.652242       3       99.76   1004
  15636 1736266501.690240       3       99.76   1005
4948651 1736266528.661240       3       99.76   1004
 521672 1736266534.192240       2       99.76   1002
1117651 1736266585.360239       3       99.76   1004
6003652 1736266592.365240       3       99.76   2123
3526648 1736266612.909240       3       99.76   1004
6003650 1736266632.927240       3       99.76   1005
 396623 1736266636.327239       10      99.76   1002
6003654 1736266660.349240       3       99.76   1005
6003653 1736266682.369239       3       99.76   1006
6003653 1736266703.388240       3       99.76   1004
 514673 1736266718.918240       2       99.76   1003
  14652 1736266725.940240       3       99.76   1004
6003653 1736266745.958240       3       99.76   1004
6003653 1736266767.978240       3       99.76   1006
6003652 1736266794.002240       3       99.76   1006
6003653 1736266815.021240       3       99.76   1004
2496651 1736266841.542239       3       99.76   1007
6003647 1736266848.547240       3       99.76   3504  <<< 8 minutes 7 seconds elapsed



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 11:26           ` Peter Zijlstra
  2025-01-07 15:04             ` Doug Smythies
@ 2025-01-07 19:23             ` Peter Zijlstra
  2025-01-08  5:15               ` Doug Smythies
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-07 19:23 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot

On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote:

> So I've taken the second node offline, running with 10 cores (20
> threads) now.
> 
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
>  106783 1736248404.951438       -       100.00  20119
>      46 1736248404.844701       0       100.00  1005
>      41 1736248404.844742       20      100.00  1007
>      42 1736248404.844784       1       100.00  1005
>      40 1736248404.844824       21      100.00  1006
>      41 1736248404.844865       2       100.00  1005
>      40 1736248404.844905       22      100.00  1006
>      41 1736248404.844946       3       100.00  1006
>      40 1736248404.844986       23      100.00  1005
>      41 1736248404.845027       4       100.00  1005
>      40 1736248404.845067       24      100.00  1006
>      41 1736248404.845108       5       100.00  1011
>      40 1736248404.845149       25      100.00  1005
>      41 1736248404.845190       6       100.00  1005
>      40 1736248404.845230       26      100.00  1005
>      42 1736248404.845272       7       100.00  1007
>      41 1736248404.845313       27      100.00  1005
>      41 1736248404.845355       8       100.00  1005
>      42 1736248404.845397       28      100.00  1006
>      46 1736248404.845443       9       100.00  1009
>  105995 1736248404.951438       29      100.00  1005
> 
> Is by far the worst I've had in the past few minutes playing with this.

Much also depends on how much (or if at all) cgroups are used.

What exact cgroup config are you having? /sys/kernel/debug/sched/debug
should be able to tell you.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-07 19:23             ` Peter Zijlstra
@ 2025-01-08  5:15               ` Doug Smythies
  2025-01-08 13:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-08  5:15 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.07.11:24 Peter Zijlstra wrote:
> On Tue, Jan 07, 2025 at 12:26:07PM +0100, Peter Zijlstra wrote:

>> So I've taken the second node offline, running with 10 cores (20
>> threads) now.
>> 
>> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
>>  106783 1736248404.951438       -       100.00  20119
>>      46 1736248404.844701       0       100.00  1005
>>      41 1736248404.844742       20      100.00  1007
>>      42 1736248404.844784       1       100.00  1005
>>      40 1736248404.844824       21      100.00  1006
>>      41 1736248404.844865       2       100.00  1005
>>      40 1736248404.844905       22      100.00  1006
>>      41 1736248404.844946       3       100.00  1006
>>      40 1736248404.844986       23      100.00  1005
>>      41 1736248404.845027       4       100.00  1005
>>      40 1736248404.845067       24      100.00  1006
>>      41 1736248404.845108       5       100.00  1011
>>      40 1736248404.845149       25      100.00  1005
>>      41 1736248404.845190       6       100.00  1005
>>      40 1736248404.845230       26      100.00  1005
>>      42 1736248404.845272       7       100.00  1007
>>      41 1736248404.845313       27      100.00  1005
>>      41 1736248404.845355       8       100.00  1005
>>      42 1736248404.845397       28      100.00  1006
>>      46 1736248404.845443       9       100.00  1009
>>  105995 1736248404.951438       29      100.00  1005
>> 
>> Is by far the worst I've had in the past few minutes playing with this.
>
> Much also depends on how much (or if at all) cgroups are used.
>
> What exact cgroup config are you having? /sys/kernel/debug/sched/debug
> should be able to tell you.

I do not know.
I'll capture the above output, compress it, and send it to you.

I did also boot with systemd.unified_cgroup_hierarchy=0
and it made no difference.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08  5:15               ` Doug Smythies
@ 2025-01-08 13:12                 ` Peter Zijlstra
  2025-01-08 15:48                   ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-08 13:12 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot


I failed to realize the follow up email was private, so duplicating that
here again, but also new content :-)

On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote:
> On 2025.07.11:24 Peter Zijlstra wrote:

> > What exact cgroup config are you having? /sys/kernel/debug/sched/debug
> > should be able to tell you.
> 
> I do not know.
> I'll capture the above output, compress it, and send it to you.
> 
> I did also boot with systemd.unified_cgroup_hierarchy=0
> and it made no difference.

I think you need: "cgroup_disable=cpu noautogroup" to fully disable all
the cpu-cgroup muck. Anyway:

$ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}'
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
turbostat       /autogroup-286
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
yes     /user.slice/user-1000.slice/session-1.scope
turbostat       /autogroup-286

That matches the scenario where I could reproduce, two competing groups.

I'm seeing wild vruntime divergence when this happens -- this is
definitely wonky. Basically the turbostat groups gets starved for a
while while the yes group catches up.

It looks like reweight_entity() is shooting out the cgroup entity to the
right.

So it builds up some negative lag (received surplus service) and then
because turbostat goes sleep for a second, it's cgroup's share gets
truncated to 2 and it shoots the cgroup entity out waaaaaaaay far.

Thing is, waking up *should* fix that up again, but that doesn't appear
to happen, leaving us up a creek.

/me noodles a bit....

Does this help?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0e58e51801f..daa62cfa3092 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	if (flags & ENQUEUE_DELAYED) {
 		requeue_delayed_entity(se);
+		se = se->parent;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			update_load_avg(cfs_rq, se, UPDATE_TG);
+			se_update_runnable(se);
+			update_cfs_group(se);
+		}
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08 13:12                 ` Peter Zijlstra
@ 2025-01-08 15:48                   ` Doug Smythies
  2025-01-09 10:59                     ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-08 15:48 UTC (permalink / raw)
  To: 'Peter Zijlstra'; +Cc: linux-kernel, vincent.guittot, Doug Smythies

On 2025.01.08 05:12 Peter Zijlstra wrote:
> On Tue, Jan 07, 2025 at 09:15:59PM -0800, Doug Smythies wrote:
>> On 2025.07.11:24 Peter Zijlstra wrote:
>
>>> What exact cgroup config are you having? /sys/kernel/debug/sched/debug
>>> should be able to tell you.
>>
>> I do not know.
>> I'll capture the above output, compress it, and send it to you.
>>
>> I did also boot with systemd.unified_cgroup_hierarchy=0
>> and it made no difference.
>
> I think you need: "cgroup_disable=cpu noautogroup" to fully disable all
> the cpu-cgroup muck. Anyway:
>
> $ zcat cgroup2.txt.gz | grep -e yes -e turbo | awk '{print $2 "\t" $16}'
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> turbostat       /autogroup-286
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> yes     /user.slice/user-1000.slice/session-1.scope
> turbostat       /autogroup-286
>
> That matches the scenario where I could reproduce, two competing groups.
>
> I'm seeing wild vruntime divergence when this happens -- this is
> definitely wonky. Basically the turbostat groups gets starved for a
> while while the yes group catches up.
>
> It looks like reweight_entity() is shooting out the cgroup entity to the
> right.
>
> So it builds up some negative lag (received surplus service) and then
> because turbostat goes sleep for a second, it's cgroup's share gets
> truncated to 2 and it shoots the cgroup entity out waaaaaaaay far.
>
> Thing is, waking up *should* fix that up again, but that doesn't appear
> to happen, leaving us up a creek.
>
> /me noodles a bit....
>
> Does this help?

Sorry, but no it did not help.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c0e58e51801f..daa62cfa3092 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7000,6 +7063,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
>       if (flags & ENQUEUE_DELAYED) {
>               requeue_delayed_entity(se);
> +             se = se->parent;
> +             for_each_sched_entity(se) {
> +                     cfs_rq = cfs_rq_of(se);
> +                     update_load_avg(cfs_rq, se, UPDATE_TG);
> +                     se_update_runnable(se);
> +                     update_cfs_group(se);
> +             }
>               return;
>       }



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-08 15:48                   ` Doug Smythies
@ 2025-01-09 10:59                     ` Peter Zijlstra
  2025-01-10  5:09                       ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-09 10:59 UTC (permalink / raw)
  To: Doug Smythies; +Cc: linux-kernel, vincent.guittot, Ingo Molnar, wuyun.abel

On Wed, Jan 08, 2025 at 07:48:29AM -0800, Doug Smythies wrote:

> > Does this help?
> 
> Sorry, but no it did not help.

Mooo :-(

OK, new day, new chances though.

I noticed this in my traces today:

       turbostat-1222    [006] d..2.   311.935649: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 1048576 avg_vruntime: 3184159639071 vruntime: 3184159640194 (-1123) deadline: 3184162621107 } ->
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 }
       turbostat-1222    [006] d..2.   311.935651: reweight_entity: (ffff888108f13e00-ffff88885ef38440-6)
                               { weight: 2 avg_vruntime: 3184177463330 vruntime: 3184748414495 (-570951165) deadline: 4747605329439 } ->
                               { weight: 1048576 avg_vruntime: 3184176414812 vruntime: 3184177464419 (-1049607) deadline: 3184180445332 }

Which is a weight transition: 1048576 -> 2 -> 1048576.

One would expect the lag to shoot out *AND* come back, notably:

  -1123*1048576/2 = -588775424
  -588775424*2/1048576 = -1123

Except the trace shows it is all off. Worse, subsequent cycles shoot it
out further and further.

This made me have a very hard look at reweight_entity(), and
specifically the ->on_rq case, which is more prominent with
DELAY_DEQUEUE.

And indeed, it is all sorts of broken. While the computation of the new
lag is correct, the computation for the new vruntime, using the new lag
is broken for it does not consider the logic set out in place_entity().

With the below patch, I now see things like:
    
    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475 } ->
                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203 }
    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline: 6316614641111 } ->
                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650 }

Which isn't perfect yet, but much closer.

Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0e58e51801f..b9575db5ecfe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -689,21 +689,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
-static s64 entity_lag(u64 avruntime, struct sched_entity *se)
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	s64 vlag, limit;
 
-	vlag = avruntime - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
-
-	return clamp(vlag, -limit, limit);
-}
-
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
 	SCHED_WARN_ON(!se->on_rq);
 
-	se->vlag = entity_lag(avg_vruntime(cfs_rq), se);
+	vlag = avg_vruntime(cfs_rq) - se->vruntime;
+	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+
+	se->vlag = clamp(vlag, -limit, limit);
 }
 
 /*
@@ -3770,137 +3765,32 @@ static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
 #endif
 
-static void reweight_eevdf(struct sched_entity *se, u64 avruntime,
-			   unsigned long weight)
-{
-	unsigned long old_weight = se->load.weight;
-	s64 vlag, vslice;
-
-	/*
-	 * VRUNTIME
-	 * --------
-	 *
-	 * COROLLARY #1: The virtual runtime of the entity needs to be
-	 * adjusted if re-weight at !0-lag point.
-	 *
-	 * Proof: For contradiction assume this is not true, so we can
-	 * re-weight without changing vruntime at !0-lag point.
-	 *
-	 *             Weight	VRuntime   Avg-VRuntime
-	 *     before    w          v            V
-	 *      after    w'         v'           V'
-	 *
-	 * Since lag needs to be preserved through re-weight:
-	 *
-	 *	lag = (V - v)*w = (V'- v')*w', where v = v'
-	 *	==>	V' = (V - v)*w/w' + v		(1)
-	 *
-	 * Let W be the total weight of the entities before reweight,
-	 * since V' is the new weighted average of entities:
-	 *
-	 *	V' = (WV + w'v - wv) / (W + w' - w)	(2)
-	 *
-	 * by using (1) & (2) we obtain:
-	 *
-	 *	(WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
-	 *	==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
-	 *	==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
-	 *	==>	(V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
-	 *
-	 * Since we are doing at !0-lag point which means V != v, we
-	 * can simplify (3):
-	 *
-	 *	==>	W / (W + w' - w) = w / w'
-	 *	==>	Ww' = Ww + ww' - ww
-	 *	==>	W * (w' - w) = w * (w' - w)
-	 *	==>	W = w	(re-weight indicates w' != w)
-	 *
-	 * So the cfs_rq contains only one entity, hence vruntime of
-	 * the entity @v should always equal to the cfs_rq's weighted
-	 * average vruntime @V, which means we will always re-weight
-	 * at 0-lag point, thus breach assumption. Proof completed.
-	 *
-	 *
-	 * COROLLARY #2: Re-weight does NOT affect weighted average
-	 * vruntime of all the entities.
-	 *
-	 * Proof: According to corollary #1, Eq. (1) should be:
-	 *
-	 *	(V - v)*w = (V' - v')*w'
-	 *	==>    v' = V' - (V - v)*w/w'		(4)
-	 *
-	 * According to the weighted average formula, we have:
-	 *
-	 *	V' = (WV - wv + w'v') / (W - w + w')
-	 *	   = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
-	 *	   = (WV - wv + w'V' - Vw + wv) / (W - w + w')
-	 *	   = (WV + w'V' - Vw) / (W - w + w')
-	 *
-	 *	==>  V'*(W - w + w') = WV + w'V' - Vw
-	 *	==>	V' * (W - w) = (W - w) * V	(5)
-	 *
-	 * If the entity is the only one in the cfs_rq, then reweight
-	 * always occurs at 0-lag point, so V won't change. Or else
-	 * there are other entities, hence W != w, then Eq. (5) turns
-	 * into V' = V. So V won't change in either case, proof done.
-	 *
-	 *
-	 * So according to corollary #1 & #2, the effect of re-weight
-	 * on vruntime should be:
-	 *
-	 *	v' = V' - (V - v) * w / w'		(4)
-	 *	   = V  - (V - v) * w / w'
-	 *	   = V  - vl * w / w'
-	 *	   = V  - vl'
-	 */
-	if (avruntime != se->vruntime) {
-		vlag = entity_lag(avruntime, se);
-		vlag = div_s64(vlag * old_weight, weight);
-		se->vruntime = avruntime - vlag;
-	}
-
-	/*
-	 * DEADLINE
-	 * --------
-	 *
-	 * When the weight changes, the virtual time slope changes and
-	 * we should adjust the relative virtual deadline accordingly.
-	 *
-	 *	d' = v' + (d - v)*w/w'
-	 *	   = V' - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  - (V - v)*w/w' + (d - v)*w/w'
-	 *	   = V  + (d - V)*w/w'
-	 */
-	vslice = (s64)(se->deadline - avruntime);
-	vslice = div_s64(vslice * old_weight, weight);
-	se->deadline = avruntime + vslice;
-}
+static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 			    unsigned long weight)
 {
 	bool curr = cfs_rq->curr == se;
-	u64 avruntime;
 
 	if (se->on_rq) {
 		/* commit outstanding execution time */
 		update_curr(cfs_rq);
-		avruntime = avg_vruntime(cfs_rq);
+		update_entity_lag(cfs_rq, se);
+		se->deadline -= se->vruntime;
+		se->rel_deadline = 1;
 		if (!curr)
 			__dequeue_entity(cfs_rq, se);
 		update_load_sub(&cfs_rq->load, se->load.weight);
 	}
 	dequeue_load_avg(cfs_rq, se);
 
-	if (se->on_rq) {
-		reweight_eevdf(se, avruntime, weight);
-	} else {
-		/*
-		 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
-		 * we need to scale se->vlag when w_i changes.
-		 */
-		se->vlag = div_s64(se->vlag * se->load.weight, weight);
-	}
+	/*
+	 * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
+	 * we need to scale se->vlag when w_i changes.
+	 */
+	se->vlag = div_s64(se->vlag * se->load.weight, weight);
+	if (se->rel_deadline)
+		se->deadline = div_s64(se->deadline * se->load.weight, weight);
 
 	update_load_set(&se->load, weight);
 
@@ -3915,6 +3805,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
 		update_load_add(&cfs_rq->load, se->load.weight);
+		place_entity(cfs_rq, se, 0);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
 
@@ -5355,7 +5246,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	se->vruntime = vruntime - lag;
 
-	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+	if (se->rel_deadline) {
 		se->deadline += se->vruntime;
 		se->rel_deadline = 0;
 		return;

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-09 10:59                     ` Peter Zijlstra
@ 2025-01-10  5:09                       ` Doug Smythies
  2025-01-10 11:57                         ` Peter Zijlstra
  2025-01-12 19:59                         ` Doug Smythies
  0 siblings, 2 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-10  5:09 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

Hi Peter,

Thanks for all your hard work on this.

On 2025.01.09 03:00 Peter Zijlstra wrote:

...

> This made me have a very hard look at reweight_entity(), and
> specifically the ->on_rq case, which is more prominent with
> DELAY_DEQUEUE.
>
> And indeed, it is all sorts of broken. While the computation of the new
> lag is correct, the computation for the new vruntime, using the new lag
> is broken for it does not consider the logic set out in place_entity().
>
> With the below patch, I now see things like:
>    
>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
} ->
>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
}
>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
6316614641111 } ->
>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
}
> 
> Which isn't perfect yet, but much closer.

Agreed.
I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
It still compares to the "b12" kernel (the last good one in the kernel bisection).
It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
verses 6 seconds without the patch.

I left things running for many hours and will let it continue overnight.
There seems to have been an issue at one spot in time:

usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
488994  1736476550.732222       -       99.76   12889
488520  1736476550.732222       11      99.76   1012
960999  1736476552.694222       -       99.76   17922
960587  1736476552.694222       11      99.76   1493
914999  1736476554.610222       -       99.76   23579
914597  1736476554.610222       11      99.76   1962
809999  1736476556.421222       -       99.76   23134
809598  1736476556.421222       11      99.76   1917
770998  1736476558.193221       -       99.76   21757
770603  1736476558.193221       11      99.76   1811
726999  1736476559.921222       -       99.76   21294
726600  1736476559.921222       11      99.76   1772
686998  1736476561.609221       -       99.76   20801
686600  1736476561.609221       11      99.76   1731
650998  1736476563.261221       -       99.76   20280
650601  1736476563.261221       11      99.76   1688
610998  1736476564.873221       -       99.76   19857
610606  1736476564.873221       11      99.76   1653

I had one of these the other day also, but they were all 6 seconds.
Its like a burst of problematic data. I have the data somewhere,
and can try to find it tomorrow.

>
> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

...

[-- Attachment #2: turbostat-sampling-issue-fixed-seconds.png --]
[-- Type: image/png, Size: 62449 bytes --]

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10  5:09                       ` Doug Smythies
@ 2025-01-10 11:57                         ` Peter Zijlstra
  2025-01-12 23:14                           ` Doug Smythies
  2025-01-12 19:59                         ` Doug Smythies
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-10 11:57 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> Thanks for all your hard work on this.
> 
> On 2025.01.09 03:00 Peter Zijlstra wrote:
> 
> ...
> 
> > This made me have a very hard look at reweight_entity(), and
> > specifically the ->on_rq case, which is more prominent with
> > DELAY_DEQUEUE.
> >
> > And indeed, it is all sorts of broken. While the computation of the new
> > lag is correct, the computation for the new vruntime, using the new lag
> > is broken for it does not consider the logic set out in place_entity().
> >
> > With the below patch, I now see things like:
> >    
> >    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> >                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
> >                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
> }
> >    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> >                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
> >                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
> }
> > 
> > Which isn't perfect yet, but much closer.
> 
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.

Progress!

> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:

Right, so by happy accident I also left mine running overnight and I
think I caught one of those weird spikes. It took a bit of staring to
figure out what went wrong this time, but what I now think is the thing
that sets off the chain of fail is a combination of DELAY_DEQUEUE and
the cgroup reweight stuff.

Specifically, when a cgroup's CPU queue becomes empty,
calc_group_shares() will drop its weight to the floor.

Now, normally, when a queue goes empty, it gets dequeued from its
parent and its weight is immaterial. However, with DELAY_DEQUEUE the
thing will stick around for a while -- at a very low weight.

What makes this esp. troublesome is that even for an active cgroup (like
the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
like). Worse, the avg_vruntime thing uses load_scale_down() because u64
just isn't all that big :/

(if only all 64bit machines could do 128bit divisions as cheaply as x86_64)

This means that on a 16 CPU machine, the weight of a 'normal' all busy
queue will be 64, and the weight of an empty queue will be 2, which means
the effect of the ginormous lag on the avg_vruntime is fairly
significant, pushing it wildly off balance and affecting placement of
new tasks.

Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
competition as the entity was.

As such, we should not adjust the weight of an empty queue.

I've started a new run, and some 15 minutes of runtime show nothing
interesting, I'll have to let it run for a while again.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..93644b3983d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
 	long shares;

-	if (!gcfs_rq)
+	/*
+	 * When a group becomes empty, preserve its weight. This matters for
+	 * DELAY_DEQUEUE.
+	 */
+	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;

 	if (throttled_hierarchy(gcfs_rq))

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10  5:09                       ` Doug Smythies
  2025-01-10 11:57                         ` Peter Zijlstra
@ 2025-01-12 19:59                         ` Doug Smythies
  1 sibling, 0 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-12 19:59 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

While we have moved on from this branch of this email thread,
just for completeness, I'll give the additional data from the overnight
test. There is also an observation that will be made and continued
in the next email.
 
On 2025.01.09 21:09 Doug Smythies wrote:
> On 2025.01.09 03:00 Peter Zijlstra wrote:
>
> ...
>
>> This made me have a very hard look at reweight_entity(), and
>> specifically the ->on_rq case, which is more prominent with
>> DELAY_DEQUEUE.
>>
>> And indeed, it is all sorts of broken. While the computation of the new
>> lag is correct, the computation for the new vruntime, using the new lag
>> is broken for it does not consider the logic set out in place_entity().
>>
>> With the below patch, I now see things like:
>>    
>>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
> } ->
>>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline:
6427157349203
> }
>>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
> 6316614641111 } ->
>>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline:
4874220535650
> }
>> 
>> Which isn't perfect yet, but much closer.
>
> Agreed.
> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
> It still compares to the "b12" kernel (the last good one in the kernel bisection).
> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
> verses 6 seconds without the patch.
>
> I left things running for many hours and will let it continue overnight.
> There seems to have been an issue at one spot in time:
>
> usec    Time_Of_Day_Seconds     CPU     Busy%   IRQ
> 488994  1736476550.732222       -       99.76   12889
> 488520  1736476550.732222       11      99.76   1012
> 960999  1736476552.694222       -       99.76   17922
> 960587  1736476552.694222       11      99.76   1493
> 914999  1736476554.610222       -       99.76   23579
> 914597  1736476554.610222       11      99.76   1962
> 809999  1736476556.421222       -       99.76   23134
> 809598  1736476556.421222       11      99.76   1917
> 770998  1736476558.193221       -       99.76   21757
> 770603  1736476558.193221       11      99.76   1811
> 726999  1736476559.921222       -       99.76   21294
> 726600  1736476559.921222       11      99.76   1772
> 686998  1736476561.609221       -       99.76   20801
> 686600  1736476561.609221       11      99.76   1731
> 650998  1736476563.261221       -       99.76   20280
> 650601  1736476563.261221       11      99.76   1688
> 610998  1736476564.873221       -       99.76   19857
> 610606  1736476564.873221       11      99.76   1653

The test was continued overnight yielding this additional information:

868008  1736496040.956236       -       99.76   12668
867542  1736496040.956222       5       99.76   1046
5950010 1736496047.907233       -       99.76   22459
5949592 1736496047.907222       5       99.76   1871
5791008 1736496054.699232       -       99.76   83625
5790605 1736496054.699222       5       99.76   6957

1962999 1736502192.036227       -       99.76   12896
1962528 1736502192.036227       11      99.76   1030
434858  1736502193.472086       -       99.76   35824
434387  1736502193.472086       11      99.76   2965

Or 2 more continuous bursts, and a 5.9 second sample.

Observation: There isn't any 10's of milliseconds data.
Based on the graph, which is basically the same test
done in ever so slightly a different way, there should be
a lot of such data.

Rather than re-attach the same graph, I'll present the
Same data as a histogram:

First the b12 kernel (the last good one in the kernel bisection):

Time          Occurrences
1.000000, 3282
1.001000, 1826
1.002000, 227
1.003000, 1852
1.004000, 1036
1.005000, 731
1.006000, 75
1.007000, 30
1.008000, 9
1.009000, 2
1.010000, 1
1.011000, 1

Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent)

Second Kernel 6.13-rc6+this one patch

Time          Occurrences
1.000000, 1274
1.001000, 1474
1.002000, 512
1.003000, 3201
1.004000, 849
1.005000, 593
1.006000, 246
1.007000, 104
1.008000, 36
1.009000, 15
1.010000, 19
1.011000, 16
1.012000, 11
1.013000, 27
1.014000, 26
1.015000, 35
1.016000, 105
1.017000, 85
1.018000, 135
1.019000, 283
1.020000, 17
1.021000, 4
1.022000, 3
1.023000, 1

Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent) 

Where, and for example, this line:

1.005000, 593

Means that there were 593 occurrences of turbostat interval times
between 1.005 and 1.005999 seconds.

So, I would expect to see that reflected in the overnight test, but don't.
It would appear that the slightly different way of doing the test
effects the probability of occurrence significantly.

I'll continue in a reply to your patch 2 email from Friday (Jan 10th). 

>
> I had one of these the other day also, but they were all 6 seconds.
> Its like a burst of problematic data. I have the data somewhere,
> and can try to find it tomorrow.
>>
>> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> ...



^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-10 11:57                         ` Peter Zijlstra
@ 2025-01-12 23:14                           ` Doug Smythies
  2025-01-13 11:03                             ` Peter Zijlstra
  2025-01-13 11:05                             ` Peter Zijlstra
  0 siblings, 2 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-12 23:14 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

[-- Attachment #1: Type: text/plain, Size: 7364 bytes --]

On 2025.01.10 03:57 Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 09:09:26PM -0800, Doug Smythies wrote:
>> On 2025.01.09 03:00 Peter Zijlstra wrote:
>> 
>> ...
>> 
>>> This made me have a very hard look at reweight_entity(), and
>>> specifically the ->on_rq case, which is more prominent with
>>> DELAY_DEQUEUE.
>>>
>>> And indeed, it is all sorts of broken. While the computation of the new
>>> lag is correct, the computation for the new vruntime, using the new lag
>>> is broken for it does not consider the logic set out in place_entity().
>>>
>>> With the below patch, I now see things like:
>>>    
>>>    migration/12-55      [012] d..3.   309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
>>>                               { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline:
4860516552475
>> } ->
>>>                               { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline:
6427157349203
>> }
>>>    migration/14-62      [014] d..3.   309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
>>>                               { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
>> 6316614641111 } ->
>>>                               { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline:
4874220535650
>> }
>>> 
>>> Which isn't perfect yet, but much closer.
>> 
>> Agreed.
>> I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
>> It still compares to the "b12" kernel (the last good one in the kernel bisection).
>> It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
>> verses 6 seconds without the patch.
>
> Progress!
>
>> I left things running for many hours and will let it continue overnight.
>> There seems to have been an issue at one spot in time:
>
> Right, so by happy accident I also left mine running overnight and I
> think I caught one of those weird spikes. It took a bit of staring to
> figure out what went wrong this time, but what I now think is the thing
> that sets off the chain of fail is a combination of DELAY_DEQUEUE and
> the cgroup reweight stuff.
>
> Specifically, when a cgroup's CPU queue becomes empty,
> calc_group_shares() will drop its weight to the floor.
>
> Now, normally, when a queue goes empty, it gets dequeued from its
> parent and its weight is immaterial. However, with DELAY_DEQUEUE the
> thing will stick around for a while -- at a very low weight.
>
> What makes this esp. troublesome is that even for an active cgroup (like
> the 'yes' group) the per-cpu weight will be relatively small (~1/nr_cpus
> like). Worse, the avg_vruntime thing uses load_scale_down() because u64
> just isn't all that big :/
>
> (if only all 64bit machines could do 128bit divisions as cheaply as x86_64)
>
> This means that on a 16 CPU machine, the weight of a 'normal' all busy
> queue will be 64, and the weight of an empty queue will be 2, which means
> the effect of the ginormous lag on the avg_vruntime is fairly
> significant, pushing it wildly off balance and affecting placement of
> new tasks.
>
> Also, this violates the spirit of DELAY_DEQUEUE, that wants to continue
> competition as the entity was.
>
> As such, we should not adjust the weight of an empty queue.
>
> I've started a new run, and some 15 minutes of runtime show nothing
> interesting, I'll have to let it run for a while again.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..93644b3983d4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,7 +4019,11 @@ static void update_cfs_group(struct sched_entity *se)
> 	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> 	long shares;
> 
> -	if (!gcfs_rq)
> +	/*
> +	 * When a group becomes empty, preserve its weight. This matters for
> +	 * DELAY_DEQUEUE.
> +	 */
> +	if (!gcfs_rq || !gcfs_rq->load.weight)
> 		return;
> 
> 	if (throttled_hierarchy(gcfs_rq))

I tested the above patch on top of the previous patch.
Multiple tests and multiple methods over many hours and 
I never got any hit at all for a detected CPU migration greater than or
equal to 10 milliseconds.
Which is good news.

The test I have been running to create some of the graphs I have been
attaching is a little different, using turbostat with different options:

turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1

And with this test I get intervals over 1 second by over 10 milliseconds.
(I referred to this observation in the previous email.).
I attach a new version of a previous graph, with the data from this "patch 2" added.
Also, below are the same results presented as histograms
(some repeated from my previous email):

First the b12 kernel (the last good one in the kernel bisection):

Time          Occurrences
1.000000, 3282
1.001000, 1826
1.002000, 227
1.003000, 1852
1.004000, 1036
1.005000, 731
1.006000, 75
1.007000, 30
1.008000, 9
1.009000, 2
1.010000, 1
1.011000, 1

Total: 9072 : Total >= 10 mSec: 2 ( 0.02 percent)

Second Kernel 6.13-rc6+the first patch

Time          Occurrences
1.000000, 1274
1.001000, 1474
1.002000, 512
1.003000, 3201
1.004000, 849
1.005000, 593
1.006000, 246
1.007000, 104
1.008000, 36
1.009000, 15
1.010000, 19
1.011000, 16
1.012000, 11
1.013000, 27
1.014000, 26
1.015000, 35
1.016000, 105
1.017000, 85
1.018000, 135
1.019000, 283
1.020000, 17
1.021000, 4
1.022000, 3
1.023000, 1

Total: 9071 : Total >= 10 mSec: 767 ( 8.46 percent)

Third: Kernel 6.13-rc6+the first patch+the above patch:

1.000000, 2034
1.001000, 2108
1.002000, 2030
1.003000, 2492
1.004000, 216
1.005000, 109
1.006000, 23
1.007000, 8
1.008000, 3
1.009000, 9
1.010000, 1
1.011000, 2
1.012000, 2
1.014000, 3
1.015000, 10
1.016000, 19
1.017000, 1
1.018000, 1

Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)

Where, and for example, this line:

1.016000, 19

means that there were 19 occurrences of turbostat interval times
between 1.016 and 1.016999 seconds.

As mentioned earlier, I haven't been able to obtain any detailed
information as to where any extra delay is occurring, in particular
if it is related to CPU migration or not.

I am assuming the system under test is perturbed just enough
by the slight difference in the test that the longer times do
not occur.

As a side note, and just for informational purposes:
The changes do have an effect on performance.
Using 40 pairs of ping-pong token passing rings running 30 million
loops I get:

Where:
nodelay = NO_DELAY_DEQUEUE
delay = DELAY_DEQUEUE
noauto = "cgroup_disable=cpu noautogroup" on grub command line
auto = nothing cgroup related on the grub command line.
All tests done with the CPU frequency scaling governor = performance.
(lesser numbers are better)

noauto	nodelay	test 1	16.24 uSec/loop	reference test
noauto	nodelay	test 2	16.22 uSec/loop 	-0.15%
noauto	delay	test 1	15.41 uSec/loop	-5.16%
auto	nodelay	test 1	24.31 uSec/loop	+49.88%
auto	nodelay	test 2	24.31 uSec/loop	+49.24%
auto	delay	test 1	21.42 uSec/loop	+21.42%
auto	delay	test 2	21.71 uSec/loop	+21.71%

A graph is also attached showing the same results, but for each
of the 40 pairs. Reveals differences in relative executions times
from first to last ping-pong pair.


[-- Attachment #2: turbostat-sampling-issue-fixed--2-seconds.png --]
[-- Type: image/png, Size: 78967 bytes --]

[-- Attachment #3: 40-ping-pairs-compare.png --]
[-- Type: image/png, Size: 54905 bytes --]

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-12 23:14                           ` Doug Smythies
@ 2025-01-13 11:03                             ` Peter Zijlstra
  2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-13 11:05                             ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-13 11:03 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> I tested the above patch on top of the previous patch.

That was indeed the intention.

> Multiple tests and multiple methods over many hours and 
> I never got any hit at all for a detected CPU migration greater than or
> equal to 10 milliseconds.
> Which is good news.

Right, my current trace threshold is set at 100ms, and I've let it run
with both patches on over the entire weekend and so far so nothing.

So definitely progress.

> The test I have been running to create some of the graphs I have been
> attaching is a little different, using turbostat with different options:
> 
> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
> 
> And with this test I get intervals over 1 second by over 10 milliseconds.
> (I referred to this observation in the previous email.).

OK, almost but not quite there it seems.

> Third: Kernel 6.13-rc6+the first patch+the above patch:
> 
> 1.000000, 2034
> 1.001000, 2108
> 1.002000, 2030
> 1.003000, 2492
> 1.004000, 216
> 1.005000, 109
> 1.006000, 23
> 1.007000, 8
> 1.008000, 3
> 1.009000, 9
> 1.010000, 1
> 1.011000, 2
> 1.012000, 2
> 1.014000, 3
> 1.015000, 10
> 1.016000, 19
> 1.017000, 1
> 1.018000, 1
> 
> Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)
> 
> Where, and for example, this line:
> 
> 1.016000, 19
> 
> means that there were 19 occurrences of turbostat interval times
> between 1.016 and 1.016999 seconds.

OK, let me lower my threshold to 10ms and change the turbostat
invocation -- see if I can catch me some wabbits :-)

FWIW, I'm using the below hackery to catch them wabbits.

---
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 1b69caa87480..61ff330e068b 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv,
 			return -EFAULT;
 	}
 	if (unlikely(tz != NULL)) {
+		if (tz == (void*)1) {
+			trace_printk("WHOOPSIE!\n");
+			tracing_off();
+			return 0;
+		}
+
 		if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
 			return -EFAULT;
 	}
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..baeac7388be2 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -67,6 +67,7 @@
 #include <stdbool.h>
 #include <assert.h>
 #include <linux/kernel.h>
+#include <sys/syscall.h>
 
 #define UNUSED(x) (void)(x)
 
@@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
 		struct timeval tv;
 
 		timersub(&t->tv_end, &t->tv_begin, &tv);
-		outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+		outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
 	}
 
 	/* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -4570,12 +4571,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 	int i;
 	int status;
 
+	gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
 	if (cpu_migrate(cpu)) {
 		fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
 		return -1;
 	}
 
-	gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//	gettimeofday(&t->tv_begin, (struct timezone *)NULL);
 
 	if (first_counter_read)
 		get_apic_id(t);
@@ -4730,6 +4733,15 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 done:
 	gettimeofday(&t->tv_end, (struct timezone *)NULL);
 
+	{
+		struct timeval tv;
+		u64 delta;
+		timersub(&t->tv_end, &t->tv_begin, &tv);
+		delta = tv.tv_sec * 1000000 + tv.tv_usec;
+		if (delta > 100000)
+			syscall(__NR_gettimeofday, &tv, (void*)1);
+	}
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-12 23:14                           ` Doug Smythies
  2025-01-13 11:03                             ` Peter Zijlstra
@ 2025-01-13 11:05                             ` Peter Zijlstra
  2025-01-13 16:01                               ` Doug Smythies
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-13 11:05 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> The test I have been running to create some of the graphs I have been
> attaching is a little different, using turbostat with different options:
> 
> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
> 
> And with this test I get intervals over 1 second by over 10 milliseconds.

> First the b12 kernel (the last good one in the kernel bisection):
> 
> Time          Occurrences
> 1.000000, 3282
> 1.001000, 1826
> 1.002000, 227
> 1.003000, 1852
> 1.004000, 1036
> 1.005000, 731
> 1.006000, 75
> 1.007000, 30
> 1.008000, 9
> 1.009000, 2
> 1.010000, 1
> 1.011000, 1

You're creating these Time values from the consecutive
Time_Of_Day_Seconds data using a script? Let me go check the turbostat
code to see if my hackery is still invoked, even if not displayed.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-13 11:05                             ` Peter Zijlstra
@ 2025-01-13 16:01                               ` Doug Smythies
  0 siblings, 0 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-13 16:01 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.13 03:06 Peter Zijlstra wrote:
> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
>
>> The test I have been running to create some of the graphs I have been
>> attaching is a little different, using turbostat with different options:
>> 
>> turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds --interval 1
>> 
>> And with this test I get intervals over 1 second by over 10 milliseconds.
>
>> First the b12 kernel (the last good one in the kernel bisection):
>> 
>> Time          Occurrences
>> 1.000000, 3282
>> 1.001000, 1826
>> 1.002000, 227
>> 1.003000, 1852
>> 1.004000, 1036
>> 1.005000, 731
>> 1.006000, 75
>> 1.007000, 30
>> 1.008000, 9
>> 1.009000, 2
>> 1.010000, 1
>> 1.011000, 1
>
> You're creating these Time values from the consecutive
> Time_Of_Day_Seconds data using a script? Let me go check the turbostat
> code to see if my hackery is still invoked, even if not displayed.

Yes, sort of.
I put the output into a spreadsheet and add a column calculating
the time difference between samples.

The histogram is created by a simple c program run against that extracted column.

Anyway, I finally did get some useful information. Examples:

Samp	uSec	Time of day		Delta T		Freq	TSC	IRQ	TMP	PWR
4086	4548	1736734595.017487	1.007149935	4800	4104	12128	73	107.52		
6222	4059	1736736736.520998	1.009660006	4800	4098	12124	74	106.73		
6263	400	1736736777.699340	1.023000002	4800	4104	12345	73	106.51		

The summary histogram line for that capture is:

Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent)

The maximum uSec was 4548 and there are only about 20 (counted manually) greater
than 1 millisecond (i.e. all good)

The command used was:

turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1

Anyway, there never is any long time within the turbostat per interval execution.
Any extra time seems to be outside of the main loop.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-13 11:03                             ` Peter Zijlstra
@ 2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-14 15:15                                 ` Doug Smythies
  2025-01-19  0:09                                 ` Doug Smythies
  0 siblings, 2 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-14 10:58 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

> > means that there were 19 occurrences of turbostat interval times
> > between 1.016 and 1.016999 seconds.
> 
> OK, let me lower my threshold to 10ms and change the turbostat
> invocation -- see if I can catch me some wabbits :-)

I've had it run overnight and have not caught a single >10ms event :-(

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 10:58                               ` Peter Zijlstra
@ 2025-01-14 15:15                                 ` Doug Smythies
  2025-01-15  2:08                                   ` Len Brown
  2025-01-19  0:09                                 ` Doug Smythies
  1 sibling, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-14 15:15 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.14 02:59 Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

>>> means that there were 19 occurrences of turbostat interval times
>>> between 1.016 and 1.016999 seconds.
>> 
>> OK, let me lower my threshold to 10ms and change the turbostat
>> invocation -- see if I can catch me some wabbits :-)
>
> I've had it run overnight and have not caught a single >10ms event :-(

O.K. thanks for trying.

For me, the probability of occurrence does vary:

After moving gettimeofday calls back to their original places
in the code I recompiled turbostat, and got:

Total: 9082 : Total >= 10 mSec: 24 ( 0.26 percent)

And 2 previous tests had significant differences
In probability of occurrence:

Total: 9071 : Total >= 10 mSec: 39 ( 0.43 percent)
Total: 9079 : Total >= 10 mSec: 128 ( 1.41 percent)

Whenever I try to obtain more information by eliminating
the --Summary directive in turbostat, I never get
a >= 10 mSec hit

I tried looking into the sleep lengths by themselves without
using turbostat, and never saw a 1 second requested sleep
be longer than requested by >= 1 mSec.

Regardless, the 2 patches seem to have solved the up to
6 seconds extra time between samples issue. The most
I have seen with all this testing as been 23 milliseconds.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 15:15                                 ` Doug Smythies
@ 2025-01-15  2:08                                   ` Len Brown
  2025-01-15 16:47                                     ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Len Brown @ 2025-01-15  2:08 UTC (permalink / raw)
  To: Doug Smythies
  Cc: Peter Zijlstra, linux-kernel, vincent.guittot, Ingo Molnar,
	wuyun.abel

Doug,
Your attention to detail and persistence has once again found a tricky
underlying bug -- kudos!

Re: turbostat behaviour

Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is
printed as a sanity check.  If there are any irregularities in it, as
you noticed, then something very strange in the hardware or software
is going wrong (and the actual turbostat results will likely not be
reliable).

YTes, the "usec" column measures how long it takes to migrate to a CPU
and collect stats there.  So if you are hunting down a glitch in
migration all you need is this column to see it.  "usec" on the
summary row is the difference between the 1st migration and after the
last -- excluding the sysfs/procfs time that is consumed on the last
CPU.  So migration delays will also be reflected there.

Note: we have a patch queued which changes the "usec" on the Summary
row to *include* the sysfs/procfs time on the last CPU.  (The per-cpu
"usec" values are unchanged.)  This is because we've noticed some
really weird delays in doing things like reading /proc/interrupts and
we want to be able to easily do A/B comparisons by simply including or
excluding counters.

Also FYI, The scheme of migrating to each CPU so that collecting stats
there will be "local" isn't scaling so well on very large systems, and
I'm about to take a close look at it.  In yogini we used a different
scheme, where a thread is bound to each CPU, so they can collect in
parallel; and we may be moving to something like that.

cheers,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-15  2:08                                   ` Len Brown
@ 2025-01-15 16:47                                     ` Doug Smythies
  0 siblings, 0 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-15 16:47 UTC (permalink / raw)
  To: 'Len Brown'
  Cc: 'Peter Zijlstra', linux-kernel, vincent.guittot,
	'Ingo Molnar', wuyun.abel, Doug Smythies

Hi Len,

Thank you for chiming in on this thread.

On 2025.01.14 18:09 Len Brown wrote:
> Doug,
> Your attention to detail and persistence has once again found a tricky
> underlying bug -- kudos!
>
> Re: turbostat behaviour
>
> Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is
> printed as a sanity check.  If there are any irregularities in it, as
> you noticed, then something very strange in the hardware or software
> is going wrong (and the actual turbostat results will likely not be
> reliable).

While I use turbostat almost every day, I am embarrassed to admit
that until this investigation I did not know about the ability to
add the "Time_Of_Day_Seconds" and "usec" columns.
They have been incredibly useful.
Early on, and until I discovered those two "show" options, I was
using the sanity check calculation of TSC_MHz to reveal the anomaly.

> Yes, the "usec" column measures how long it takes to migrate to a CPU
> and collect stats there.  So if you are hunting down a glitch in
> migration all you need is this column to see it.  "usec" on the
> summary row is the difference between the 1st migration and after the
> last -- excluding the sysfs/procfs time that is consumed on the last
> CPU.  So migration delays will also be reflected there.

On a per CPU basis, it excludes the actual CPU migration step.
Peter and I made a modification to turbostat to have the per CPU
'usec" column focus just on the CPU migration time. [1]

> Note: we have a patch queued which changes the "usec" on the Summary
> row to *include* the sysfs/procfs time on the last CPU.

I did not realise that is was just for the last "sysfs/procfs" time.
I'll take a closer look, and wonder if that can explain why I have been
unable to catch the lingering >= 10 mSec stuff.

>  (The per-cpu
> "usec" values are unchanged.)  This is because we've noticed some
> really weird delays in doing things like reading /proc/interrupts and
> we want to be able to easily do A/B comparisons by simply including or
> excluding counters.

Yes, I saw the patch email on the linux-pm email list and have included it
in my local turbostat for about a week now.

> Also FYI, The scheme of migrating to each CPU so that collecting stats
> there will be "local" isn't scaling so well on very large systems, and
> I'm about to take a close look at it.  In yogini we used a different
> scheme, where a thread is bound to each CPU, so they can collect in
> parallel; and we may be moving to something like that.
>
> cheers,
> Len Brown, Intel Open Source Technology Center

[1] https://lore.kernel.org/lkml/001b01db608a$56d3dc40$047b94c0$@telus.net/

... Doug



^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-14 10:58                               ` Peter Zijlstra
  2025-01-14 15:15                                 ` Doug Smythies
@ 2025-01-19  0:09                                 ` Doug Smythies
  2025-01-20  3:55                                   ` Doug Smythies
  2025-01-21  8:49                                   ` Peter Zijlstra
  1 sibling, 2 replies; 277+ messages in thread
From: Doug Smythies @ 2025-01-19  0:09 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

An update.

On 2025.01.14 02:59 Peter Zijlstra wrote:
> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:

>>> means that there were 19 occurrences of turbostat interval times
>>> between 1.016 and 1.016999 seconds.
>> 
>> OK, let me lower my threshold to 10ms and change the turbostat
>> invocation -- see if I can catch me some wabbits :-)
>
> I've had it run overnight and have not caught a single >10ms event :-(

Okay, so both you and I have many many hours of testing and
never see >= 10ms in that area of the turbostat code anymore.

The lingering >= 10ms (but I have never seen more than 25 ms)
is outside of that timing. As previously reported, I thought it might
be in the sampling interval sleep step, but I did a bunch of testing
and it doesn't appear to be there. That leaves:

delta_platform(&platform_counters_even, &platform_counters_odd);
compute_average(ODD_COUNTERS);
format_all_counters(ODD_COUNTERS);
flush_output_stdout();

I modified your tracing trigger thing in turbostat to this:

doug@s19:~/kernel/linux/tools/power/x86/turbostat$ git diff turbostat.c
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 58a487c225a7..777efb64a754 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -67,6 +67,7 @@
 #include <stdbool.h>
 #include <assert.h>
 #include <linux/kernel.h>
+#include <sys/syscall.h>

 #define UNUSED(x) (void)(x)

@@ -2704,7 +2705,7 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data
                struct timeval tv;

                timersub(&t->tv_end, &t->tv_begin, &tv);
-               outp += sprintf(outp, "%5ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
+               outp += sprintf(outp, "%7ld\t", tv.tv_sec * 1000000 + tv.tv_usec);
        }

        /* Time_Of_Day_Seconds: on each row, print sec.usec last timestamp taken */
@@ -2713,6 +2714,11 @@ int format_counters(struct thread_data *t, struct core_data *c, struct pkg_data

        interval_float = t->tv_delta.tv_sec + t->tv_delta.tv_usec / 1000000.0;

+       double requested_interval = (double) interval_tv.tv_sec + (double) interval_tv.tv_usec / 1000000.0;
+
+       if(interval_float >= (requested_interval + 0.01))  /* was the last interval over by more than 10 mSec? */
+               syscall(__NR_gettimeofday, &tv_delta, (void*)1);
+
        tsc = t->tsc * tsc_tweak;

        /* topo columns, print blanks on 1st (average) line */
@@ -4570,12 +4576,14 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
        int i;
        int status;

+       gettimeofday(&t->tv_begin, (struct timezone *)NULL); /* doug test */
+
        if (cpu_migrate(cpu)) {
                fprintf(outf, "%s: Could not migrate to CPU %d\n", __func__, cpu);
                return -1;
        }

-       gettimeofday(&t->tv_begin, (struct timezone *)NULL);
+//     gettimeofday(&t->tv_begin, (struct timezone *)NULL);

        if (first_counter_read)
                get_apic_id(t);

And so that I could prove a correlation with the trace times
and to my graph times I also did not turn off tracing upon a hit:

doug@s19:~/kernel/linux$ git diff kernel/time/time.c
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 1b69caa87480..fb84915159cc 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -149,6 +149,12 @@ SYSCALL_DEFINE2(gettimeofday, struct __kernel_old_timeval __user *, tv,
                        return -EFAULT;
        }
        if (unlikely(tz != NULL)) {
+               if (tz == (void*)1) {
+                       trace_printk("WHOOPSIE!\n");
+//                     tracing_off();
+                       return 0;
+               }
+
                if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
                        return -EFAULT;
        }

I ran a test for about 1 hour and 28 minutes.
The data in the trace correlates with turbostat line by line
TOD differentials. Trace got:

turbostat-1370    [011] .....   751.738151: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....   760.763184: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1362.788298: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1365.815332: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1366.836340: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1367.856355: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1368.867365: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1373.893423: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1374.910439: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1377.928469: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1378.941483: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1379.959490: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1382.982525: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1385.005548: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1386.019561: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1387.030572: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1398.097683: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1620.752963: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1621.772969: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1622.788972: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1697.022098: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1703.071104: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1704.088103: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1705.105107: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1706.116106: __x64_sys_gettimeofday: WHOOPSIE!
turbostat-1370    [011] .....  1707.126107: __x64_sys_gettimeofday: WHOOPSIE!

Going back to some old test data from when the CPU migration in turbostat
often took up to 6 seconds. If I subtract that migration time from the measured
interval time, I get a lot of samples between 10 and 23 ms.

I am saying there were 2 different issues. The 2nd was hidden by the 1st
because its magnitude was about 260 times less.

I do not know if my trace is any use. I'll compress it and send it to you only, off list.
My trace is as per this older email:

https://lore.kernel.org/all/20240727105030.226163742@infradead.org/T/#m453062b267551ff4786d33a2eb5f326f92241e96



^ permalink raw reply related	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-19  0:09                                 ` Doug Smythies
@ 2025-01-20  3:55                                   ` Doug Smythies
  2025-01-21 11:06                                     ` Peter Zijlstra
  2025-01-21  8:49                                   ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-20  3:55 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

Hi Peter,

I now know that the 2nd issue I mentioned yesterday
is a completely separate issue. I would have to do
a new kernel bisection to isolate it and then start
a new thread with whomever about it.

On 2025.01.18 16:09 Doug Smythies wrote:

> Hi Peter,
> 
> An update.
>
> On 2025.01.14 02:59 Peter Zijlstra wrote:
>> On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
>>> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
>
>>>> means that there were 19 occurrences of turbostat interval times
>>>> between 1.016 and 1.016999 seconds.
>>> 
>>> OK, let me lower my threshold to 10ms and change the turbostat
>>> invocation -- see if I can catch me some wabbits :-)
>>
>> I've had it run overnight and have not caught a single >10ms event :-(
>
> Okay, so both you and I have many many hours of testing and
> never see >= 10ms in that area of the turbostat code anymore.
>
> The lingering >= 10ms (but I have never seen more than 25 ms)
> is outside of that timing.

... snip ...

> I am saying there were 2 different issues. The 2nd was hidden by the 1st
> because its magnitude was about 260 times less.

The first issue was solved by your two commits of this thread and now in
Kernel 6.13:

66951e4860d3 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
6d71a9c61604 sched/fair: Fix EEVDF entity placement bug causing scheduling lag

The second issue is not present in my original kernel bisection for the first
bad kernel and must have been introduced later on.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-19  0:09                                 ` Doug Smythies
  2025-01-20  3:55                                   ` Doug Smythies
@ 2025-01-21  8:49                                   ` Peter Zijlstra
  2025-01-21 11:21                                     ` Peter Zijlstra
  1 sibling, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-21  8:49 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sat, Jan 18, 2025 at 04:09:02PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> An update.
> 
> On 2025.01.14 02:59 Peter Zijlstra wrote:
> > On Mon, Jan 13, 2025 at 12:03:12PM +0100, Peter Zijlstra wrote:
> >> On Sun, Jan 12, 2025 at 03:14:17PM -0800, Doug Smythies wrote:
> 
> >>> means that there were 19 occurrences of turbostat interval times
> >>> between 1.016 and 1.016999 seconds.
> >> 
> >> OK, let me lower my threshold to 10ms and change the turbostat
> >> invocation -- see if I can catch me some wabbits :-)
> >
> > I've had it run overnight and have not caught a single >10ms event :-(
> 
> Okay, so both you and I have many many hours of testing and
> never see >= 10ms in that area of the turbostat code anymore.

Hehe, yeah, I actually had it run for 3 solid days in the end.

> The lingering >= 10ms (but I have never seen more than 25 ms)
> is outside of that timing. As previously reported, I thought it might
> be in the sampling interval sleep step, but I did a bunch of testing
> and it doesn't appear to be there. That leaves:
> 
> delta_platform(&platform_counters_even, &platform_counters_odd);
> compute_average(ODD_COUNTERS);
> format_all_counters(ODD_COUNTERS);
> flush_output_stdout();
> 
> I modified your tracing trigger thing in turbostat to this:

Shiny!

What turbostat invocation do I use? I think the last I had was:

  tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1

I've started a new run of yes-vs-turbostate with the modified trigger
condition. Lets see what pops out.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-20  3:55                                   ` Doug Smythies
@ 2025-01-21 11:06                                     ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-21 11:06 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Sun, Jan 19, 2025 at 07:55:56PM -0800, Doug Smythies wrote:
> Hi Peter,
> 
> I now know that the 2nd issue I mentioned yesterday
> is a completely separate issue. I would have to do
> a new kernel bisection to isolate it and then start
> a new thread with whomever about it.

Yeah, lets not make things complicated and just carry on with the fun
and games as is :-)

^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21  8:49                                   ` Peter Zijlstra
@ 2025-01-21 11:21                                     ` Peter Zijlstra
  2025-01-21 15:58                                       ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-21 11:21 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:

> > I modified your tracing trigger thing in turbostat to this:
> 
> Shiny!
> 
> What turbostat invocation do I use? I think the last I had was:
> 
>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
> 
> I've started a new run of yes-vs-turbostate with the modified trigger
> condition. Lets see what pops out.

Ok, I have a trace.o

So I see turbostat wake up on CPU 15, do its migration round 0-15 and
when its back at 15 it prints the WHOOPSIE.

(trimmed trace):

             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==> next_comm=turbostat next_pid=1185 next_prio=100
    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
     migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==> next_comm=turbostat next_pid=1185 next_prio=100
       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!

The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
or 430us, which doesn't seem excessive to me.

Let me go read this turbostat code to figure out what exactly the
trigger condition signifies. Because I'm not seeing nothing weird here.

^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21 11:21                                     ` Peter Zijlstra
@ 2025-01-21 15:58                                       ` Doug Smythies
  2025-01-24  4:34                                         ` Doug Smythies
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-21 15:58 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.21 03:22 Peter Zijlstra wrote:
>On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:
>
>>> I modified your tracing trigger thing in turbostat to this:
>> 
>> Shiny!
>> 
>> What turbostat invocation do I use? I think the last I had was:
>> 
>>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
>> 
>> I've started a new run of yes-vs-turbostate with the modified trigger
>> condition. Lets see what pops out.
>
> Ok, I have a trace.o
>
> So I see turbostat wake up on CPU 15, do its migration round 0-15 and
> when its back at 15 it prints the WHOOPSIE.
>
> (trimmed trace):
>
>             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
>             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==>
next_comm=turbostat next_pid=1185 next_prio=100
>    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
>    migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
>     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
>     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
>     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
>     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
>     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
>     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
>     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
>     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
>     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
>    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
>    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
>    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
>    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
>    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
>             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==>
next_comm=turbostat next_pid=1185 next_prio=100
>       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!
>
> The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
> or 430us, which doesn't seem excessive to me.
>
> Let me go read this turbostat code to figure out what exactly the
> trigger condition signifies. Because I'm not seeing nothing weird here.

I think the anomaly would have been about 1 second ago, on CPU 15,
and before entering sleep.
But after the previous call to the time of day stuff.

Somewhere in this code:

                delta_platform(&platform_counters_even, &platform_counters_odd);
                compute_average(ODD_COUNTERS);
                format_all_counters(ODD_COUNTERS);
                flush_output_stdout();

Please know that I ran a couple of tests yesterday for a total of about 8 hours
and never got a measured interval time >= 10 mSec.
I was using kernel 6.13, which includes your 2 patches, and I tried a slight
modification to the turbostat command:

sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out
/dev/shm/turbo.log

That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh
communications.
It was on purpose that I used RAM to write the log file to.



^ permalink raw reply	[flat|nested] 277+ messages in thread

* RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-21 15:58                                       ` Doug Smythies
@ 2025-01-24  4:34                                         ` Doug Smythies
  2025-01-24 11:04                                           ` Peter Zijlstra
  0 siblings, 1 reply; 277+ messages in thread
From: Doug Smythies @ 2025-01-24  4:34 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel,
	Doug Smythies

On 2025.01.21 Doug Smythies wrote:
> On 2025.01.21 03:22 Peter Zijlstra wrote:
>> On Tue, Jan 21, 2025 at 09:49:08AM +0100, Peter Zijlstra wrote:
>>
>>>> I modified your tracing trigger thing in turbostat to this:
>>> 
>>> Shiny!
>>> 
>>> What turbostat invocation do I use? I think the last I had was:
>>> 
>>>   tools/power/x86/turbostat/turbostat --quiet --show Busy%,IRQ,Time_Of_Day_Seconds,CPU,usec --interval 1
>>> 
>>> I've started a new run of yes-vs-turbostate with the modified trigger
>>> condition. Lets see what pops out.
>>
>> Ok, I have a trace.o
>>
>> So I see turbostat wake up on CPU 15, do its migration round 0-15 and
>> when its back at 15 it prints the WHOOPSIE.
>>
>> (trimmed trace):
>>
>>             yes-1169    [015] dNh4.  4238.261759: sched_wakeup: comm=turbostat pid=1185 prio=100 target_cpu=015
>>             yes-1169    [015] d..2.  4238.261761: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R ==>
> next_comm=turbostat next_pid=1185 next_prio=100
>>    migration/15-158     [015] d..3.  4238.261977: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=15 dest_cpu=0
>>    migration/0-20      [000] d..3.  4238.261991: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=0 dest_cpu=1
>>     migration/1-116     [001] d..3.  4238.262003: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=1 dest_cpu=2
>>     migration/2-25      [002] d..3.  4238.262018: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=2 dest_cpu=3
>>     migration/3-122     [003] d..3.  4238.262031: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=3 dest_cpu=4
>>     migration/4-31      [004] d..3.  4238.262044: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=4 dest_cpu=5
>>     migration/5-128     [005] d..3.  4238.262057: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=5 dest_cpu=6
>>     migration/6-37      [006] d..3.  4238.262071: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=6 dest_cpu=7
>>     migration/7-134     [007] d..3.  4238.262084: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=7 dest_cpu=8
>>     migration/8-43      [008] d..3.  4238.262097: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=8 dest_cpu=9
>>     migration/9-140     [009] d..3.  4238.262109: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=9 dest_cpu=10
>>    migration/10-49      [010] d..3.  4238.262123: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=10 dest_cpu=11
>>    migration/11-146     [011] d..3.  4238.262136: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=11 dest_cpu=12
>>    migration/12-55      [012] d..3.  4238.262150: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=12 dest_cpu=13
>>    migration/13-152     [013] d..3.  4238.262164: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=13 dest_cpu=14
>>    migration/14-62      [014] d..3.  4238.262177: sched_migrate_task: comm=turbostat pid=1185 prio=100 orig_cpu=14 dest_cpu=15
>>             yes-1169    [015] d..2.  4238.262182: sched_switch: prev_comm=yes prev_pid=1169 prev_prio=120 prev_state=R+ ==>
> next_comm=turbostat next_pid=1185 next_prio=100
>>       turbostat-1185    [015] .....  4238.262189: __x64_sys_gettimeofday: WHOOPSIE!
>>
>> The time between wakeup and whoopsie 4238.262189-4238.261759 = .000430
>> or 430us, which doesn't seem excessive to me.
>>
>> Let me go read this turbostat code to figure out what exactly the
>> trigger condition signifies. Because I'm not seeing nothing weird here.
>
> I think the anomaly would have been about 1 second ago, on CPU 15,
> and before entering sleep.
> But after the previous call to the time of day stuff.
>
> Somewhere in this code:
>
>  delta_platform(&platform_counters_even, &platform_counters_odd);
>  compute_average(ODD_COUNTERS);
>  format_all_counters(ODD_COUNTERS);
>  flush_output_stdout();
>
> Please know that I ran a couple of tests yesterday for a total of about 8 hours
> and never got a measured interval time >= 10 mSec.
> I was using kernel 6.13, which includes your 2 patches, and I tried a slight
> modification to the turbostat command:
>
> sudo ./turbostat --quiet --Summary --show Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz,Time_Of_Day_Seconds,usec --interval 1 --out
/dev/shm/turbo.log
>
> That allowed me to acquire more than my ssh session history limit of about 9000 lines (seconds) and also eliminated ssh
> communications.
> It was on purpose that I used RAM to write the log file to.

I have run more tests over the last couple of days, totalling over 30 hours.
I simply do not get a measured interval time >= 10mSec using kernel 6.13.
The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff.
I never tried kernel 6.13-rc7. 



^ permalink raw reply	[flat|nested] 277+ messages in thread

* Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF
  2025-01-24  4:34                                         ` Doug Smythies
@ 2025-01-24 11:04                                           ` Peter Zijlstra
  0 siblings, 0 replies; 277+ messages in thread
From: Peter Zijlstra @ 2025-01-24 11:04 UTC (permalink / raw)
  To: Doug Smythies
  Cc: linux-kernel, vincent.guittot, 'Ingo Molnar', wuyun.abel

On Thu, Jan 23, 2025 at 08:34:57PM -0800, Doug Smythies wrote:

> I have run more tests over the last couple of days, totalling over 30 hours.
> I simply do not get a measured interval time >= 10mSec using kernel 6.13.
> The previous work was kernel 6.13-rc6 + the 2 patches + the tracing stuff.
> I never tried kernel 6.13-rc7. 

OK, lets close this for now then. Feel free to contact me again if you
find anything else.

Thanks for all the useful input!

^ permalink raw reply	[flat|nested] 277+ messages in thread

end of thread, other threads:[~2025-01-24 11:04 UTC | newest]

Thread overview: 277+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-27 10:27 [PATCH 00/24] Complete EEVDF Peter Zijlstra
2024-07-27 10:27 ` [PATCH 01/24] sched/eevdf: Add feature comments Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 02/24] sched/eevdf: Remove min_vruntime_copy Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 03/24] sched/fair: Cleanup pick_task_fair() vs throttle Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 04/24] sched/fair: Cleanup pick_task_fair()s curr Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] sched/fair: Cleanup pick_task_fair()'s curr tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 05/24] sched/fair: Unify pick_{,next_}_task_fair() Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 06/24] sched: Allow sched_class::dequeue_task() to fail Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 07/24] sched/fair: Re-organize dequeue_task_fair() Peter Zijlstra
2024-08-09 16:53   ` Valentin Schneider
2024-08-10 22:17     ` Peter Zijlstra
2024-08-12 10:02       ` Valentin Schneider
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 08/24] sched: Split DEQUEUE_SLEEP from deactivate_task() Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 09/24] sched: Prepare generic code for delayed dequeue Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 10/24] sched/uclamg: Handle " Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-08-19  9:14     ` Christian Loehle
2024-08-20 16:23   ` [PATCH 10/24] " Hongyan Xia
2024-08-21 13:34   ` Hongyan Xia
2024-08-22  8:19     ` Vincent Guittot
2024-08-22  8:21       ` Vincent Guittot
2024-08-22  9:21       ` Luis Machado
2024-08-22  9:53         ` Vincent Guittot
2024-08-22 10:20           ` Vincent Guittot
2024-08-22 10:28           ` Luis Machado
2024-08-22 12:07             ` Luis Machado
2024-08-22 12:10               ` Vincent Guittot
2024-08-22 14:58                 ` Vincent Guittot
2024-08-29 15:42                   ` Hongyan Xia
2024-09-05 13:02                     ` Dietmar Eggemann
2024-09-05 13:33                       ` Vincent Guittot
2024-09-05 14:07                         ` Dietmar Eggemann
2024-09-05 14:29                           ` Vincent Guittot
2024-09-05 14:50                             ` Dietmar Eggemann
2024-09-05 14:53                           ` Peter Zijlstra
2024-09-06  6:14                             ` Vincent Guittot
2024-09-06 10:45                             ` Peter Zijlstra
2024-09-08  7:43                               ` Mike Galbraith
2024-09-10  8:09                               ` [tip: sched/core] sched/eevdf: More PELT vs DELAYED_DEQUEUE tip-bot2 for Peter Zijlstra
2024-11-27  4:17                                 ` K Prateek Nayak
2024-11-27  9:34                                   ` Luis Machado
2024-11-28  6:35                                     ` K Prateek Nayak
2024-09-10 11:04                               ` [PATCH 10/24] sched/uclamg: Handle delayed dequeue Luis Machado
2024-09-10 14:05                                 ` Peter Zijlstra
2024-09-11  8:35                                   ` Luis Machado
2024-09-11  8:45                                     ` Peter Zijlstra
2024-09-11  8:55                                       ` Luis Machado
2024-09-11  9:10                                       ` Mike Galbraith
2024-09-11  9:13                                         ` Peter Zijlstra
2024-09-11  9:27                                           ` Mike Galbraith
2024-09-12 14:00                                             ` Mike Galbraith
2024-09-13 16:39                                               ` Mike Galbraith
2024-09-14  3:40                                                 ` Mike Galbraith
2024-09-24 15:16                                                   ` Luis Machado
2024-09-24 17:35                                                     ` Mike Galbraith
2024-09-25  5:14                                                       ` Mike Galbraith
2024-09-11 11:49                                           ` Dietmar Eggemann
2024-09-11  9:38                                         ` Luis Machado
2024-09-12 12:58                                         ` Luis Machado
2024-09-12 20:44                                           ` Dietmar Eggemann
2024-09-11 10:46                                       ` Luis Machado
2024-09-06  9:55                           ` Dietmar Eggemann
2024-09-05 14:18                       ` Peter Zijlstra
2024-09-10  8:09                       ` [tip: sched/core] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE tip-bot2 for Dietmar Eggemann
2024-07-27 10:27 ` [PATCH 11/24] sched/fair: Assert {set_next,put_prev}_entity() are properly balanced Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 12/24] sched/fair: Prepare exit/cleanup paths for delayed_dequeue Peter Zijlstra
2024-08-13 12:43   ` Valentin Schneider
2024-08-13 21:54     ` Peter Zijlstra
2024-08-13 22:07       ` Peter Zijlstra
2024-08-14  5:53         ` Peter Zijlstra
2024-08-27  9:35           ` Chen Yu
2024-08-27 20:29             ` Valentin Schneider
2024-08-28  2:55               ` Chen Yu
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-08-27  9:17   ` [PATCH 12/24] " Chen Yu
2024-08-28  3:06     ` Chen Yu
2024-07-27 10:27 ` [PATCH 13/24] sched/fair: Prepare pick_next_task() for delayed dequeue Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-09-10  9:16   ` [PATCH 13/24] " Luis Machado
2024-07-27 10:27 ` [PATCH 14/24] sched/fair: Implement ENQUEUE_DELAYED Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 15/24] sched,freezer: Mark TASK_FROZEN special Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 16/24] sched: Teach dequeue_task() about special task states Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 17/24] sched/fair: Implement delayed dequeue Peter Zijlstra
2024-08-02 14:39   ` Valentin Schneider
2024-08-02 14:59     ` Peter Zijlstra
2024-08-02 16:32       ` Valentin Schneider
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-08-19 10:01   ` [PATCH 17/24] " Luis Machado
     [not found]   ` <CGME20240828223802eucas1p16755f4531ed0611dc4871649746ea774@eucas1p1.samsung.com>
2024-08-28 22:38     ` Marek Szyprowski
2024-10-10  2:49       ` Sean Christopherson
2024-10-10  7:57         ` Mike Galbraith
2024-10-10 16:18           ` Sean Christopherson
2024-10-10 17:12             ` Mike Galbraith
2024-10-10  8:19         ` Peter Zijlstra
2024-10-10  9:18           ` Peter Zijlstra
2024-10-10 18:23             ` Sean Christopherson
2024-10-12 14:15             ` [tip: sched/urgent] sched: Fix external p->on_rq users tip-bot2 for Peter Zijlstra
2024-10-14  7:28             ` [tip: sched/urgent] sched/fair: " tip-bot2 for Peter Zijlstra
2024-11-01 12:47   ` [PATCH 17/24] sched/fair: Implement delayed dequeue Phil Auld
2024-11-01 12:56     ` Peter Zijlstra
2024-11-01 13:38       ` Phil Auld
2024-11-01 14:26         ` Peter Zijlstra
2024-11-01 14:42           ` Phil Auld
2024-11-01 18:08             ` Mike Galbraith
2024-11-01 20:07               ` Phil Auld
2024-11-02  4:32                 ` Mike Galbraith
2024-11-04 13:05                   ` Phil Auld
2024-11-05  4:05                     ` Mike Galbraith
2024-11-05  4:22                       ` K Prateek Nayak
2024-11-05  6:46                         ` Mike Galbraith
2024-11-06  3:02                           ` K Prateek Nayak
2024-11-05 15:20                       ` Phil Auld
2024-11-05 19:05                         ` Phil Auld
2024-11-06  2:45                           ` Mike Galbraith
2024-11-06 13:53                       ` Peter Zijlstra
2024-11-06 14:14                         ` Peter Zijlstra
2024-11-06 14:38                           ` Peter Zijlstra
2024-11-06 15:22                           ` Mike Galbraith
2024-11-07  4:03                             ` Mike Galbraith
2024-11-07  9:46                               ` Mike Galbraith
2024-11-07 14:02                                 ` Mike Galbraith
2024-11-07 14:09                                   ` Peter Zijlstra
2024-11-08  0:24                                     ` [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU Mike Galbraith
2024-11-08 13:34                                       ` Phil Auld
2024-11-11  2:46                                       ` Xuewen Yan
2024-11-11  3:53                                         ` Mike Galbraith
2024-11-12  7:05                                       ` Mike Galbraith
2024-11-12 12:41                                         ` Phil Auld
2024-11-12 14:23                                           ` Peter Zijlstra
2024-11-12 14:23                                           ` Mike Galbraith
2024-11-12 15:41                                             ` Phil Auld
2024-11-12 16:15                                               ` Mike Galbraith
2024-11-14 11:07                                                 ` Mike Galbraith
2024-11-14 11:28                                                   ` Phil Auld
2024-11-19 11:30                                                     ` Phil Auld
2024-11-19 11:51                                                       ` Mike Galbraith
2024-11-20 18:37                                                         ` Mike Galbraith
2024-11-21 11:56                                                           ` Phil Auld
2024-11-21 12:07                                                             ` Phil Auld
2024-11-21 21:21                                                               ` Phil Auld
2024-11-23  8:44                                                             ` [PATCH V2] " Mike Galbraith
2024-11-26  5:32                                                               ` K Prateek Nayak
2024-11-26  6:30                                                                 ` Mike Galbraith
2024-11-26  9:42                                                                   ` Mike Galbraith
2024-12-02 19:15                                                                     ` Phil Auld
2024-11-27 14:13                                                                   ` Mike Galbraith
2024-12-02 16:24                                                               ` Phil Auld
2024-12-02 16:55                                                                 ` Mike Galbraith
2024-12-02 19:12                                                                   ` Phil Auld
2024-12-09 13:11                                                                     ` Phil Auld
2024-12-09 15:06                                                                       ` Mike Galbraith
2024-11-06 14:14                         ` [PATCH 17/24] sched/fair: Implement delayed dequeue Mike Galbraith
2024-11-06 14:33                           ` Peter Zijlstra
2024-11-04  9:28     ` Dietmar Eggemann
2024-11-04 11:55       ` Dietmar Eggemann
2024-11-04 12:50       ` Phil Auld
2024-11-05  9:53         ` Christian Loehle
2024-11-05 15:55           ` Phil Auld
2024-11-08 14:53         ` Dietmar Eggemann
2024-11-08 18:16           ` Phil Auld
2024-11-11 11:29             ` Dietmar Eggemann
2024-07-27 10:27 ` [PATCH 18/24] sched/fair: Implement DELAY_ZERO Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 19/24] sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE Peter Zijlstra
2024-08-13 12:43   ` Valentin Schneider
2024-08-13 22:18     ` Peter Zijlstra
2024-08-14  7:25       ` Peter Zijlstra
2024-08-14  7:28         ` Peter Zijlstra
2024-08-14 10:23         ` Valentin Schneider
2024-08-14 12:59       ` Vincent Guittot
2024-08-17 23:06         ` Peter Zijlstra
2024-08-19 12:50           ` Vincent Guittot
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 20/24] sched/fair: Avoid re-setting virtual deadline on migrations Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] sched/fair: Avoid re-setting virtual deadline on 'migrations' tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 21/24] sched/eevdf: Allow shorter slices to wakeup-preempt Peter Zijlstra
2024-08-05 12:24   ` Chunxin Zang
2024-08-07 17:54     ` Peter Zijlstra
2024-08-13 10:44       ` Chunxin Zang
2024-08-08 10:15   ` Chen Yu
2024-08-08 10:22     ` Peter Zijlstra
2024-08-08 12:31       ` Chen Yu
2024-08-09  7:35         ` Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 22/24] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-07-27 10:27 ` [PATCH 23/24] sched/eevdf: Propagate min_slice up the cgroup hierarchy Peter Zijlstra
2024-08-18  6:23   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2024-09-29  2:02   ` [PATCH 23/24] " Tianchen Ding
2024-07-27 10:27 ` [RFC PATCH 24/24] sched/time: Introduce CLOCK_THREAD_DVFS_ID Peter Zijlstra
2024-07-28 21:30   ` Thomas Gleixner
2024-07-29  7:53   ` Juri Lelli
2024-08-02 11:29     ` Peter Zijlstra
2024-08-19 11:11   ` Christian Loehle
2024-08-01 12:08 ` [PATCH 00/24] Complete EEVDF Luis Machado
2024-08-14 14:34 ` Vincent Guittot
2024-08-14 16:45   ` Mike Galbraith
2024-08-14 16:59     ` Vincent Guittot
2024-08-14 17:18       ` Mike Galbraith
2024-08-14 17:25         ` Vincent Guittot
2024-08-14 17:35       ` K Prateek Nayak
2024-08-16 15:22 ` Valentin Schneider
2024-08-20 16:43 ` Hongyan Xia
2024-08-21  9:46   ` Hongyan Xia
2024-08-21 16:25     ` Mike Galbraith
2024-08-22 15:55     ` Peter Zijlstra
2024-08-27  9:43       ` Hongyan Xia
2024-08-29 17:02 ` Aleksandr Nogikh
2024-09-10 11:45 ` Sven Schnelle
2024-09-10 12:21   ` Sven Schnelle
2024-09-10 14:07     ` Peter Zijlstra
2024-09-10 14:52       ` Sven Schnelle
2024-11-06  1:07 ` Saravana Kannan
2024-11-06  6:19   ` K Prateek Nayak
2024-11-06 11:09     ` Peter Zijlstra
2024-11-06 12:06       ` Luis Machado
2024-11-08  7:07         ` Saravana Kannan
2024-11-08 23:17           ` Samuel Wu
2024-11-11  4:07             ` K Prateek Nayak
2024-11-26 23:32               ` Saravana Kannan
2024-11-28 10:32 ` [REGRESSION] " Marcel Ziswiler
2024-11-28 10:58   ` Peter Zijlstra
2024-11-28 11:37     ` Marcel Ziswiler
2024-11-29  9:08       ` Peter Zijlstra
2024-12-02 18:46         ` Marcel Ziswiler
2024-12-09  9:49           ` Peter Zijlstra
2024-12-10 16:05             ` Marcel Ziswiler
2024-12-10 16:13           ` Steven Rostedt
2024-12-10  8:45   ` Luis Machado
  -- strict thread matches above, loose matches on Subject: below --
2024-12-29 22:51 Doug Smythies
2025-01-06 11:57 ` Peter Zijlstra
2025-01-06 15:01   ` Doug Smythies
2025-01-06 16:59     ` Peter Zijlstra
2025-01-06 17:04       ` Peter Zijlstra
2025-01-06 17:14         ` Peter Zijlstra
2025-01-07  1:24           ` Doug Smythies
2025-01-07 10:49             ` Peter Zijlstra
2025-01-06 22:28         ` Doug Smythies
2025-01-07 11:26           ` Peter Zijlstra
2025-01-07 15:04             ` Doug Smythies
2025-01-07 16:25               ` Doug Smythies
2025-01-07 19:23             ` Peter Zijlstra
2025-01-08  5:15               ` Doug Smythies
2025-01-08 13:12                 ` Peter Zijlstra
2025-01-08 15:48                   ` Doug Smythies
2025-01-09 10:59                     ` Peter Zijlstra
2025-01-10  5:09                       ` Doug Smythies
2025-01-10 11:57                         ` Peter Zijlstra
2025-01-12 23:14                           ` Doug Smythies
2025-01-13 11:03                             ` Peter Zijlstra
2025-01-14 10:58                               ` Peter Zijlstra
2025-01-14 15:15                                 ` Doug Smythies
2025-01-15  2:08                                   ` Len Brown
2025-01-15 16:47                                     ` Doug Smythies
2025-01-19  0:09                                 ` Doug Smythies
2025-01-20  3:55                                   ` Doug Smythies
2025-01-21 11:06                                     ` Peter Zijlstra
2025-01-21  8:49                                   ` Peter Zijlstra
2025-01-21 11:21                                     ` Peter Zijlstra
2025-01-21 15:58                                       ` Doug Smythies
2025-01-24  4:34                                         ` Doug Smythies
2025-01-24 11:04                                           ` Peter Zijlstra
2025-01-13 11:05                             ` Peter Zijlstra
2025-01-13 16:01                               ` Doug Smythies
2025-01-12 19:59                         ` Doug Smythies

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).