[RFC][PATCH 0/5] sched: Try and address some recent-ish regressions

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
@ 2025-05-20  9:45 Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Hi!

So Chris poked me about how they're having a wee performance drop after around
6.11. He's extended his schbench tool to mimic the workload in question.

Specifically the commandline given:

  schbench -L -m 4 -M auto -t 128 -n 0 -r 60

This benchmark wants to stay on a single (large) LLC (Chris, perhaps add an
option to start the CPU mask with
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list or something). Both
the machine Chris has (SKL, 20+ cores per LLC) and the machines I ran this on
(SKL,SPR 20+ cores) are Intel, AMD has smaller LLC and the problem wasn't as
pronounced there.

Use performance CPU governor (as always when benchmarking). Also, if the test
results are unstable as all heck, disable turbo.

After a fair amount of tinkering I managed to reproduce on my SPR and Thomas'
SKL. The SKL would only give usable numbers with the second socket offline and
turbo disabled -- YMMV.

Chris further provided a bisect into the DELAY_DEQUEUE patches and a bisect
leading to commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control")
-- which enables the dl_server by default.

SKL (performance, no_turbo):

schbench-6.9.0-1.txt:average rps: 2040360.55
schbench-6.9.0-2.txt:average rps: 2038846.78
schbench-6.9.0-3.txt:average rps: 2037892.28

schbench-6.15.0-rc6+-1.txt:average rps: 1907718.18
schbench-6.15.0-rc6+-2.txt:average rps: 1906931.07
schbench-6.15.0-rc6+-3.txt:average rps: 1903190.38

schbench-6.15.0-rc6+-dirty-1.txt:average rps: 2002224.78
schbench-6.15.0-rc6+-dirty-2.txt:average rps: 2007116.80
schbench-6.15.0-rc6+-dirty-3.txt:average rps: 2005294.57

schbench-6.15.0-rc6+-dirty-delayed-1.txt:average rps: 2011282.15
schbench-6.15.0-rc6+-dirty-delayed-2.txt:average rps: 2016347.10
schbench-6.15.0-rc6+-dirty-delayed-3.txt:average rps: 2014515.47

schbench-6.15.0-rc6+-dirty-delayed-default-1.txt:average rps: 2042169.00
schbench-6.15.0-rc6+-dirty-delayed-default-2.txt:average rps: 2032789.77
schbench-6.15.0-rc6+-dirty-delayed-default-3.txt:average rps: 2040313.95

SPR (performance):

schbench-6.9.0-1.txt:average rps: 2975450.75
schbench-6.9.0-2.txt:average rps: 2975464.38
schbench-6.9.0-3.txt:average rps: 2974881.02

schbench-6.15.0-rc6+-1.txt:average rps: 2882537.37
schbench-6.15.0-rc6+-2.txt:average rps: 2881658.70
schbench-6.15.0-rc6+-3.txt:average rps: 2884293.37

schbench-6.15.0-rc6+-dl_server-1.txt:average rps: 2924423.18
schbench-6.15.0-rc6+-dl_server-2.txt:average rps: 2920422.63

schbench-6.15.0-rc6+-dirty-1.txt:average rps: 3011540.97
schbench-6.15.0-rc6+-dirty-2.txt:average rps: 3010124.10

schbench-6.15.0-rc6+-dirty-delayed-1.txt:average rps: 3030883.15
schbench-6.15.0-rc6+-dirty-delayed-2.txt:average rps: 3031627.05

schbench-6.15.0-rc6+-dirty-delayed-default-1.txt:average rps: 3053005.98
schbench-6.15.0-rc6+-dirty-delayed-default-2.txt:average rps: 3052972.80

As can be seen, the SPR is much easier to please than the SKL for whatever
reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
one benchmark.

Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
somewhat rough state. At the very least the last patch is missing ttwu_stat(),
still need to figure out how to account it ;-)

Chris, I'm hoping your machine will agree with these numbers; it hasn't been
straight sailing in that regard.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
@ 2025-05-20  9:45 ` Peter Zijlstra
  2025-06-03 16:03   ` Juri Lelli
  2025-05-20  9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.

His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.

Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.

Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 +
 kernel/sched/deadline.c |   31 +++++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 4 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
+	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se);
+
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti
 
 		if (!dl_se->server_has_tasks(dl_se)) {
 			replenish_dl_entity(dl_se);
+			dl_server_stopped(dl_se);
 			return HRTIMER_NORESTART;
 		}
 
@@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime)
+	if (dl_se->dl_runtime) {
+		dl_se->dl_server_idle = 0;
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent
 		setup_new_dl_entity(dl_se);
 	}
 
-	if (!dl_se->dl_runtime)
+	if (!dl_se->dl_runtime || dl_se->dl_server_active)
 		return;
 
 	dl_se->dl_server_active = 1;
@@ -1672,7 +1677,7 @@ void dl_server_start(struct sched_dl_ent
 		resched_curr(dl_se->rq);
 }
 
-void dl_server_stop(struct sched_dl_entity *dl_se)
+static void __dl_server_stop(struct sched_dl_entity *dl_se)
 {
 	if (!dl_se->dl_runtime)
 		return;
@@ -1684,6 +1689,24 @@ void dl_server_stop(struct sched_dl_enti
 	dl_se->dl_server_active = 0;
 }
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se)
+{
+	if (!dl_se->dl_server_active)
+		return false;
+
+	if (dl_se->dl_server_idle) {
+		__dl_server_stop(dl_se);
+		return true;
+	}
+
+	dl_se->dl_server_idle = 1;
+	return false;
+}
+
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+}
+
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
@@ -2435,7 +2458,7 @@ static struct task_struct *__pick_task_d
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (dl_server_active(dl_se)) {
+			if (!dl_server_stopped(dl_se)) {
 				dl_se->dl_yielded = 1;
 				update_curr_dl_se(rq, dl_se, 0);
 			}



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq()
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
@ 2025-05-20  9:45 ` Peter Zijlstra
  2025-06-09  5:01   ` Mike Galbraith
  2025-05-20  9:45 ` [RFC][PATCH 3/5] sched: Split up ttwu_runnable() Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Optimize ttwu() by pushing select_idle_siblings() up above waiting for
on_cpu(). This allows making use of the cycles otherwise spend waiting
to search for an idle CPU.

One little detail is that since the task we're looking for an idle CPU
for might still be on the CPU, that CPU won't report as running the
idle task, and thus won't find his own CPU idle, even when it is.

To compensate, remove the 'rq->curr == rq->idle' condition from
idle_cpu() -- it doesn't really make sense anyway.

Additionally, Chris found (concurrently) that perf-c2c reported that
test as being a cache-miss monster.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |    3 ++-
 kernel/sched/syscalls.c |    3 ---
 2 files changed, 2 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4305,6 +4305,8 @@ int try_to_wake_up(struct task_struct *p
 		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
 			break;
 
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
 		 * this task as prev, wait until it's done referencing the task.
@@ -4316,7 +4318,6 @@ int try_to_wake_up(struct task_struct *p
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
 				delayacct_blkio_end(p);
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -203,9 +203,6 @@ int idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
-	if (rq->curr != rq->idle)
-		return 0;
-
 	if (rq->nr_running)
 		return 0;
 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 3/5] sched: Split up ttwu_runnable()
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
@ 2025-05-20  9:45 ` Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 4/5] sched: Add ttwu_queue controls Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Split up ttwu_runnable() in preparation for more changes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   43 +++++++++++++++++++++----------------------
 kernel/sched/sched.h |    5 +++++
 2 files changed, 26 insertions(+), 22 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3765,6 +3765,25 @@ ttwu_do_activate(struct rq *rq, struct t
 #endif
 }
 
+static int __ttwu_runnable(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
+}
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3792,28 +3811,8 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
-
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret = 1;
-	}
-	__task_rq_unlock(rq, &rf);
-
-	return ret;
+	CLASS(__task_rq_lock, guard)(p);
+	return __ttwu_runnable(guard.rq, p, wake_flags);
 }
 
 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1831,6 +1831,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 4/5] sched: Add ttwu_queue controls
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-05-20  9:45 ` [RFC][PATCH 3/5] sched: Split up ttwu_runnable() Peter Zijlstra
@ 2025-05-20  9:45 ` Peter Zijlstra
  2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

There are two (soon three) callers of ttwu_queue_wakelist(),
distinguish them with their own WF_ and add some knobs on.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   22 ++++++++++++----------
 kernel/sched/features.h |    2 ++
 kernel/sched/sched.h    |    2 ++
 3 files changed, 16 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3928,7 +3928,7 @@ bool cpus_share_resources(int this_cpu,
 	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
 }
 
-static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
+static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, bool def)
 {
 	/* See SCX_OPS_ALLOW_QUEUED_WAKEUP. */
 	if (!scx_allow_ttwu_queue(p))
@@ -3969,18 +3969,19 @@ static inline bool ttwu_queue_cond(struc
 	if (!cpu_rq(cpu)->nr_running)
 		return true;
 
-	return false;
+	return def;
 }
 
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		__ttwu_queue_wakelist(p, cpu, wake_flags);
-		return true;
-	}
+	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
 
-	return false;
+	if (!ttwu_queue_cond(p, cpu, def))
+		return false;
+
+	sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+	__ttwu_queue_wakelist(p, cpu, wake_flags);
+	return true;
 }
 
 #else /* !CONFIG_SMP */
@@ -3997,7 +3998,7 @@ static void ttwu_queue(struct task_struc
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
-	if (ttwu_queue_wakelist(p, cpu, wake_flags))
+	if (sched_feat(TTWU_QUEUE) && ttwu_queue_wakelist(p, cpu, wake_flags))
 		return;
 
 	rq_lock(rq, &rf);
@@ -4301,7 +4302,8 @@ int try_to_wake_up(struct task_struct *p
 		 * scheduling.
 		 */
 		if (smp_load_acquire(&p->on_cpu) &&
-		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
+		    sched_feat(TTWU_QUEUE_ON_CPU) &&
+		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
 			break;
 
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,6 +81,8 @@ SCHED_FEAT(TTWU_QUEUE, false)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 #endif
+SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
+SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2312,6 +2312,8 @@ static inline int task_on_rq_migrating(s
 #define WF_CURRENT_CPU		0x40 /* Prefer to move the wakee to the current CPU. */
 #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
 
+#define WF_ON_CPU		0x0100
+
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
 static_assert(WF_FORK == SD_BALANCE_FORK);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-05-20  9:45 ` [RFC][PATCH 4/5] sched: Add ttwu_queue controls Peter Zijlstra
@ 2025-05-20  9:45 ` Peter Zijlstra
  2025-06-06 15:03   ` Vincent Guittot
  2025-06-13  7:34   ` Dietmar Eggemann
  2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
  2025-06-02  4:44 ` K Prateek Nayak
  6 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-20  9:45 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

One of the things lost with introduction of DELAY_DEQUEUE is the
ability of TTWU to move those tasks around on wakeup, since they're
on_rq, and as such, need to be woken in-place.

Doing the in-place thing adds quite a bit of cross-cpu latency, add a
little something that gets remote CPUs to do their own in-place
wakeups, significantly reducing the rq->lock contention.

Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/fair.c     |    5 ++-
 kernel/sched/features.h |    1 
 kernel/sched/sched.h    |    1 
 4 files changed, 72 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
 	return 1;
 }
 
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
+#ifdef CONFIG_SMP
+	if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
+		/*
+		 * Similar to try_to_block_task():
+		 *
+		 * __schedule()				ttwu()
+		 *   prev_state = prev->state		  if (p->sched_delayed)
+		 *   if (prev_state)			     smp_acquire__after_ctrl_dep()
+		 *     try_to_block_task()		     p->state = TASK_WAKING
+		 *       ... set_delayed()
+		 *         RELEASE p->sched_delayed = 1
+		 *
+		 * __schedule() and ttwu() have matching control dependencies.
+		 *
+		 * Notably, once we observe sched_delayed we know the task has
+		 * passed try_to_block_task() and p->state is ours to modify.
+		 *
+		 * TASK_WAKING controls ttwu() concurrency.
+		 */
+		smp_acquire__after_ctrl_dep();
+		WRITE_ONCE(p->__state, TASK_WAKING);
+
+		if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
+			return 1;
+	}
+#endif
+
 	CLASS(__task_rq_lock, guard)(p);
 	return __ttwu_runnable(guard.rq, p, wake_flags);
 }
@@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
 	update_rq_clock(rq);
 
 	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
+		struct rq *p_rq = task_rq(p);
+		int ret;
+
+		/*
+		 * This is the ttwu_runnable() case. Notably it is possible for
+		 * on-rq entities to get migrated -- even sched_delayed ones.
+		 */
+		if (unlikely(p_rq != rq)) {
+			rq_unlock(rq, &rf);
+			p_rq = __task_rq_lock(p, &rf);
+		}
+
+		ret = __ttwu_runnable(p_rq, p, WF_TTWU);
+
+		if (unlikely(p_rq != rq)) {
+			if (!ret)
+				set_task_cpu(p, cpu_of(rq));
+
+			__task_rq_unlock(p_rq, &rf);
+			rq_lock(rq, &rf);
+			update_rq_clock(rq);
+		}
+
+		if (ret) {
+			// XXX ttwu_stat()
+			continue;
+		}
+
+		/*
+		 * This is the 'normal' case where the task is blocked.
+		 */
+
 		if (WARN_ON_ONCE(p->on_cpu))
 			smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
-			set_task_cpu(p, cpu_of(rq));
-
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 	}
 
@@ -3974,7 +4032,7 @@ static inline bool ttwu_queue_cond(struc
 
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
+	bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
 
 	if (!ttwu_queue_cond(p, cpu, def))
 		return false;
@@ -4269,8 +4327,8 @@ int try_to_wake_up(struct task_struct *p
 		 * __schedule().  See the comment for smp_mb__after_spinlock().
 		 *
 		 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
-		 * schedule()'s deactivate_task() has 'happened' and p will no longer
-		 * care about it's own p->state. See the comment in __schedule().
+		 * schedule()'s try_to_block_task() has 'happened' and p will no longer
+		 * care about it's own p->state. See the comment in try_to_block_task().
 		 */
 		smp_acquire__after_ctrl_dep();
 
@@ -6712,8 +6770,8 @@ static void __sched notrace __schedule(i
 	preempt = sched_mode == SM_PREEMPT;
 
 	/*
-	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that we form a control dependency vs deactivate_task() below.
+	 * We must load prev->state once, such that we form a control
+	 * dependency vs try_to_block_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (sched_mode == SM_IDLE) {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5395,7 +5395,10 @@ static __always_inline void return_cfs_r
 
 static void set_delayed(struct sched_entity *se)
 {
-	se->sched_delayed = 1;
+	/*
+	 * See TTWU_QUEUE_DELAYED in ttwu_runnable().
+	 */
+	smp_store_release(&se->sched_delayed, 1);
 
 	/*
 	 * Delayed se of cfs_rq have no tasks queued on them.
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
 SCHED_FEAT(TTWU_QUEUE, true)
 #endif
 SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
+SCHED_FEAT(TTWU_QUEUE_DELAYED, false)
 SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
 #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
 
 #define WF_ON_CPU		0x0100
+#define WF_DELAYED		0x0200
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
                   ` (4 preceding siblings ...)
  2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
@ 2025-05-28 19:59 ` Peter Zijlstra
  2025-05-29  1:41   ` Chris Mason
                     ` (2 more replies)
  2025-06-02  4:44 ` K Prateek Nayak
  6 siblings, 3 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-28 19:59 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel

On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:

> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> still need to figure out how to account it ;-)
> 
> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> straight sailing in that regard.

Anybody? -- If no comments I'll just stick them in sched/core or so.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
@ 2025-05-29  1:41   ` Chris Mason
  2025-06-14 10:04     ` Peter Zijlstra
  2025-05-29 10:18   ` Beata Michalska
  2025-05-30 10:04   ` Chris Mason
  2 siblings, 1 reply; 33+ messages in thread
From: Chris Mason @ 2025-05-29  1:41 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel

On 5/28/25 3:59 PM, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
>> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
>> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
>> still need to figure out how to account it ;-)
>>
>> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
>> straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.

Hi Peter,

I'll get all of these run on the big turin machine, should have some
numbers tomorrow.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
  2025-05-29  1:41   ` Chris Mason
@ 2025-05-29 10:18   ` Beata Michalska
  2025-05-30  9:00     ` Peter Zijlstra
  2025-05-30 10:04   ` Chris Mason
  2 siblings, 1 reply; 33+ messages in thread
From: Beata Michalska @ 2025-05-29 10:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, May 28, 2025 at 09:59:44PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
> > Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> > somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> > still need to figure out how to account it ;-)
> > 
> > Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> > straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.
>
Hi Peter,

I've tried out your series on top of 6.15 on an Ampere Altra Mt Jade
dual-socket (160-core) system, which enables SCHED_CLUSTER (2-core MC domains).
Sharing preliminary test results of 50 runs per setup as, so far, the data
show quite a bit of run-to-run variability - not sure how useful those will be.
At this point without any deep dive, which is probably needed and hopefully
will come later on.


Results for average rps (60s) sorted based on P90

CFG |   min      |  max       |   stdev    |   90th
----+------------+------------+------------+-----------
 1  | 704577.50  | 942665.67  | 46439.49   | 891272.09
 4  | 656313.48  | 877223.85  | 47871.43   | 837693.28
 3  | 658665.75  | 859520.32  | 50257.35   | 832174.80
 5  | 630419.62  | 842170.47  | 47267.52   | 815911.81
 2  | 647163.57  | 815392.65  | 35559.98   | 783884.00

Legend:
#1 : kernel 6.9
#2 : kernel 6.15
#3 : kernel 6.15 patched def (TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT)
#4 : kernel 6.15 patched + TTWU_QUEUE_ON_CPU + TTWU_QUEUE_DEFAULT
#5 : kernel 6.15 patched + NO_TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT

---
BR
Beata

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-29 10:18   ` Beata Michalska
@ 2025-05-30  9:00     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-05-30  9:00 UTC (permalink / raw)
  To: Beata Michalska
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Thu, May 29, 2025 at 12:18:54PM +0200, Beata Michalska wrote:
> On Wed, May 28, 2025 at 09:59:44PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> > 
> > > Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> > > somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> > > still need to figure out how to account it ;-)
> > > 
> > > Chris, I'm hoping your machine will agree with these numbers; it hasn't been
> > > straight sailing in that regard.
> > 
> > Anybody? -- If no comments I'll just stick them in sched/core or so.
> >
> Hi Peter,
> 
> I've tried out your series on top of 6.15 on an Ampere Altra Mt Jade
> dual-socket (160-core) system, which enables SCHED_CLUSTER (2-core MC domains).

Ah, that's a radically different system than what we set out with. Good
to get some feedback on that indeed.

> Sharing preliminary test results of 50 runs per setup as, so far, the data
> show quite a bit of run-to-run variability - not sure how useful those will be.

Yeah, I had some of that on the Skylake system, I had to disable turbo
for the numbers to become stable enough to say anything much.

> At this point without any deep dive, which is probably needed and hopefully
> will come later on.
> 
> 
> Results for average rps (60s) sorted based on P90
> 
> CFG |   min      |  max       |   stdev    |   90th
> ----+------------+------------+------------+-----------
>  1  | 704577.50  | 942665.67  | 46439.49   | 891272.09
>  2  | 647163.57  | 815392.65  | 35559.98   | 783884.00
>  3  | 658665.75  | 859520.32  | 50257.35   | 832174.80

>  4  | 656313.48  | 877223.85  | 47871.43   | 837693.28
>  5  | 630419.62  | 842170.47  | 47267.52   | 815911.81
> 
> Legend:
> #1 : kernel 6.9
> #2 : kernel 6.15
> #3 : kernel 6.15 patched def (TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT)
> #4 : kernel 6.15 patched + TTWU_QUEUE_ON_CPU + TTWU_QUEUE_DEFAULT
> #5 : kernel 6.15 patched + NO_TTWU_QUEUE_ON_CPU + NO_TTWU_QUEUE_DEFAULT

Right, minor improvement. At least its not making it worse :-)

The new toy is TTWU_QUEUE_DELAYED, and yeah, I did notice that disabling
TTWU_QUEUE_ON_CPU was a bad idea.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
  2025-05-29  1:41   ` Chris Mason
  2025-05-29 10:18   ` Beata Michalska
@ 2025-05-30 10:04   ` Chris Mason
  2 siblings, 0 replies; 33+ messages in thread
From: Chris Mason @ 2025-05-30 10:04 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel

On 5/28/25 3:59 PM, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 11:45:38AM +0200, Peter Zijlstra wrote:
> 
>> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
>> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
>> still need to figure out how to account it ;-)
>>
>> Chris, I'm hoping your machine will agree with these numbers; it hasn't been
>> straight sailing in that regard.
> 
> Anybody? -- If no comments I'll just stick them in sched/core or so.

My initial numbers were quite bad, roughly 50% fewer RPS than the old
6.9 kernel on the big turin machine.  I need to redo things and make
sure the numbers are all valid, I'll try and do that today.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
                   ` (5 preceding siblings ...)
  2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
@ 2025-06-02  4:44 ` K Prateek Nayak
  2025-06-13  3:28   ` K Prateek Nayak
  6 siblings, 1 reply; 33+ messages in thread
From: K Prateek Nayak @ 2025-06-02  4:44 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm

Hello Peter,

On 5/20/2025 3:15 PM, Peter Zijlstra wrote:
> As can be seen, the SPR is much easier to please than the SKL for whatever
> reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
> TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
> one benchmark.

I haven't tried toggling any of the newly added SCHED_FEAT() yet.
Following are the numbers for the out of the box variant:

tl;dr Minor improvements across the board; no noticeable regressions
except for a few schbench datapoints but they also have a high
run-to-run variance so we should be good.

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	  tip:sched/core at commit 914873bc7df9 ("Merge tag
           'x86-build-2025-05-25' of
           git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

ttwu_opt: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     0.92 [  7.68]( 6.04)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -3.56]( 4.96)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 2.27)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.26]( 1.70)
     16-groups     1.00 [ -0.00]( 1.10)     0.97 [  3.01]( 1.62)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)       ttwu_opt[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.04 [  4.33]( 1.84)
         2     1.00 [  0.00]( 1.13)     1.06 [  5.52]( 1.04)
         4     1.00 [  0.00]( 1.12)     1.05 [  5.41]( 0.53)
         8     1.00 [  0.00]( 0.93)     1.06 [  5.72]( 0.47)
        16     1.00 [  0.00]( 0.38)     1.07 [  6.99]( 0.50)
        32     1.00 [  0.00]( 0.66)     1.05 [  4.68]( 1.79)
        64     1.00 [  0.00]( 1.18)     1.06 [  5.53]( 0.37)
       128     1.00 [  0.00]( 1.12)     1.06 [  5.52]( 0.13)
       256     1.00 [  0.00]( 0.42)     0.99 [ -0.83]( 1.01)
       512     1.00 [  0.00]( 0.14)     1.01 [  1.06]( 0.13)
      1024     1.00 [  0.00]( 0.26)     1.02 [  1.82]( 0.41)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.97 [ -2.79]( 9.17)
     Scale     1.00 [  0.00]( 2.85)     1.00 [  0.12]( 2.91)
       Add     1.00 [  0.00]( 3.39)     0.98 [ -2.36]( 4.85)
     Triad     1.00 [  0.00]( 6.39)     1.01 [  1.45]( 8.42)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.98 [ -1.84]( 2.07)
     Scale     1.00 [  0.00]( 4.34)     0.96 [ -3.80]( 6.38)
       Add     1.00 [  0.00]( 4.14)     0.97 [ -3.04]( 6.31)
     Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.36]( 2.60)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)       ttwu_opt[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.06 [  5.63]( 1.17)
      2-clients     1.00 [  0.00]( 0.58)     1.06 [  6.25]( 0.85)
      4-clients     1.00 [  0.00]( 0.35)     1.06 [  5.59]( 0.49)
      8-clients     1.00 [  0.00]( 0.48)     1.06 [  5.76]( 0.81)
     16-clients     1.00 [  0.00]( 0.66)     1.06 [  5.95]( 0.69)
     32-clients     1.00 [  0.00]( 1.15)     1.06 [  5.84]( 1.34)
     64-clients     1.00 [  0.00]( 1.38)     1.05 [  5.20]( 1.50)
     128-clients    1.00 [  0.00]( 0.87)     1.04 [  4.39]( 1.03)
     256-clients    1.00 [  0.00]( 5.36)     1.00 [  0.10]( 3.48)
     512-clients    1.00 [  0.00](54.39)     0.98 [ -1.93](52.45)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.89 [ 10.87](35.39)
       2     1.00 [ -0.00]( 1.15)     0.88 [ 12.00]( 4.55)
       4     1.00 [ -0.00](13.46)     0.96 [  4.17](10.60)
       8     1.00 [ -0.00]( 7.14)     0.84 [ 15.79]( 8.44)
      16     1.00 [ -0.00]( 3.49)     1.08 [ -8.47]( 4.69)
      32     1.00 [ -0.00]( 1.06)     1.10 [ -9.57]( 2.91)
      64     1.00 [ -0.00]( 5.48)     1.25 [-25.00]( 5.36)
     128     1.00 [ -0.00](10.45)     1.18 [-17.99](12.54)
     256     1.00 [ -0.00](31.14)     1.28 [-27.79](17.66)
     512     1.00 [ -0.00]( 1.52)     1.01 [ -0.51]( 2.78)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.00)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     0.99 [ -0.95]( 2.06)
      64     1.00 [  0.00]( 1.05)     0.92 [ -7.58]( 9.01)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
     256     1.00 [  0.00]( 0.72)     1.00 [ -0.31]( 0.42)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.45)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.75 [ 25.00](11.08)
       2     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.78)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 3.87)
       8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33](12.91)
      16     1.00 [ -0.00]( 7.56)     0.92 [  7.69](11.71)
      32     1.00 [ -0.00](15.11)     1.07 [ -6.67]( 3.30)
      64     1.00 [ -0.00]( 9.63)     1.00 [ -0.00]( 8.15)
     128     1.00 [ -0.00]( 4.86)     0.89 [ 11.06]( 7.83)
     256     1.00 [ -0.00]( 2.34)     1.00 [  0.20]( 0.10)
     512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       ttwu_opt[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 1.04)
       2     1.00 [ -0.00]( 0.87)     1.05 [ -5.40]( 3.10)
       4     1.00 [ -0.00]( 1.21)     0.99 [  0.54]( 1.27)
       8     1.00 [ -0.00]( 0.27)     0.99 [  0.79]( 2.14)
      16     1.00 [ -0.00]( 4.04)     1.01 [ -0.53]( 0.55)
      32     1.00 [ -0.00]( 7.35)     1.10 [ -9.97](21.10)
      64     1.00 [ -0.00]( 3.54)     1.03 [ -2.89]( 1.55)
     128     1.00 [ -0.00]( 0.37)     0.99 [  0.62]( 0.00)
     256     1.00 [ -0.00]( 9.57)     0.92 [  8.36]( 2.22)
     512     1.00 [ -0.00]( 1.82)     1.01 [ -1.23]( 0.94)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                 %diff
     ycsb-cassandra              -0.05%
     ycsb-mongodb                -0.80%

     deathstarbench-1x            2.44%
     deathstarbench-2x            5.47%
     deathstarbench-3x            0.36%
     deathstarbench-6x            1.14%

     hammerdb+mysql 16VU          1.08%
     hammerdb+mysql 64VU         -0.43%

> 
> Anyway, the patches are stable (finally!, I hope, knock on wood) but in a
> somewhat rough state. At the very least the last patch is missing ttwu_stat(),
> still need to figure out how to account it ;-)
> 

Since TTWU_QUEUE_DELAYED is off by defaults, feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

if you are planning on retaining the current defaults for the
SCHED_FEATs. I'll get back with numbers for TTWU_QUEUE_DELAYED and
TTWU_QUEUE_DEFAULT soon.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling
  2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
@ 2025-06-03 16:03   ` Juri Lelli
  2025-06-13  9:43     ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Juri Lelli @ 2025-06-03 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clm, linux-kernel

Hi,

On 20/05/25 11:45, Peter Zijlstra wrote:
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
> 
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
> 
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
> 
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h   |    1 +
>  kernel/sched/deadline.c |   31 +++++++++++++++++++++++++++----
>  2 files changed, 28 insertions(+), 4 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -702,6 +702,7 @@ struct sched_dl_entity {
>  	unsigned int			dl_defer	  : 1;
>  	unsigned int			dl_defer_armed	  : 1;
>  	unsigned int			dl_defer_running  : 1;
> +	unsigned int			dl_server_idle    : 1;
>  
>  	/*
>  	 * Bandwidth enforcement timer. Each -deadline task has its
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq
>  /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>  static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>  
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> +
>  static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>  {
>  	struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti
>  
>  		if (!dl_se->server_has_tasks(dl_se)) {
>  			replenish_dl_entity(dl_se);
> +			dl_server_stopped(dl_se);
>  			return HRTIMER_NORESTART;
>  		}
>  
> @@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r
>  void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>  {
>  	/* 0 runtime = fair server disabled */
> -	if (dl_se->dl_runtime)
> +	if (dl_se->dl_runtime) {
> +		dl_se->dl_server_idle = 0;
>  		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> +	}
>  }
>  
>  void dl_server_start(struct sched_dl_entity *dl_se)
> @@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent
>  		setup_new_dl_entity(dl_se);
>  	}
>  
> -	if (!dl_se->dl_runtime)
> +	if (!dl_se->dl_runtime || dl_se->dl_server_active)
>  		return;
>  
>  	dl_se->dl_server_active = 1;
> @@ -1672,7 +1677,7 @@ void dl_server_start(struct sched_dl_ent
>  		resched_curr(dl_se->rq);
>  }
>  
> -void dl_server_stop(struct sched_dl_entity *dl_se)
> +static void __dl_server_stop(struct sched_dl_entity *dl_se)
>  {
>  	if (!dl_se->dl_runtime)
>  		return;
> @@ -1684,6 +1689,24 @@ void dl_server_stop(struct sched_dl_enti
>  	dl_se->dl_server_active = 0;
>  }
>  
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se)
> +{
> +	if (!dl_se->dl_server_active)
> +		return false;
> +
> +	if (dl_se->dl_server_idle) {
> +		__dl_server_stop(dl_se);
> +		return true;
> +	}
> +
> +	dl_se->dl_server_idle = 1;
> +	return false;
> +}
> +
> +void dl_server_stop(struct sched_dl_entity *dl_se)
> +{
> +}

What if we explicitly set the server to idle (instead of ignoring the
stop) where this gets called in dequeue_entities()? Also, don't we need
to actually stop the server if we are changing its parameters from
sched_fair_server_write()?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
@ 2025-06-06 15:03   ` Vincent Guittot
  2025-06-06 15:38     ` Peter Zijlstra
                       ` (2 more replies)
  2025-06-13  7:34   ` Dietmar Eggemann
  1 sibling, 3 replies; 33+ messages in thread
From: Vincent Guittot @ 2025-06-06 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
>
> One of the things lost with introduction of DELAY_DEQUEUE is the
> ability of TTWU to move those tasks around on wakeup, since they're
> on_rq, and as such, need to be woken in-place.

I was thinking that you would call select_task_rq() somewhere in the
wake up path of delayed entity to get a chance to migrate it which was
one reason for the perf regression (and which would have also been
useful for EAS case) but IIUC, the task is still enqueued on the same
CPU but the target cpu will do the enqueue itself instead on the local
CPU. Or am I missing something ?

>
> Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> little something that gets remote CPUs to do their own in-place
> wakeups, significantly reducing the rq->lock contention.
>
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
>  kernel/sched/fair.c     |    5 ++-
>  kernel/sched/features.h |    1
>  kernel/sched/sched.h    |    1
>  4 files changed, 72 insertions(+), 9 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
>         return 1;
>  }
>
> +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
>  /*
>   * Consider @p being inside a wait loop:
>   *
> @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> +#ifdef CONFIG_SMP
> +       if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> +               /*
> +                * Similar to try_to_block_task():
> +                *
> +                * __schedule()                         ttwu()
> +                *   prev_state = prev->state             if (p->sched_delayed)
> +                *   if (prev_state)                         smp_acquire__after_ctrl_dep()
> +                *     try_to_block_task()                   p->state = TASK_WAKING
> +                *       ... set_delayed()
> +                *         RELEASE p->sched_delayed = 1
> +                *
> +                * __schedule() and ttwu() have matching control dependencies.
> +                *
> +                * Notably, once we observe sched_delayed we know the task has
> +                * passed try_to_block_task() and p->state is ours to modify.
> +                *
> +                * TASK_WAKING controls ttwu() concurrency.
> +                */
> +               smp_acquire__after_ctrl_dep();
> +               WRITE_ONCE(p->__state, TASK_WAKING);
> +
> +               if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> +                       return 1;
> +       }
> +#endif
> +
>         CLASS(__task_rq_lock, guard)(p);
>         return __ttwu_runnable(guard.rq, p, wake_flags);
>  }
> @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
>         update_rq_clock(rq);
>
>         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> +               struct rq *p_rq = task_rq(p);
> +               int ret;
> +
> +               /*
> +                * This is the ttwu_runnable() case. Notably it is possible for
> +                * on-rq entities to get migrated -- even sched_delayed ones.

I haven't found where the sched_delayed task could migrate on another cpu.

> +                */
> +               if (unlikely(p_rq != rq)) {
> +                       rq_unlock(rq, &rf);
> +                       p_rq = __task_rq_lock(p, &rf);
> +               }
> +
> +               ret = __ttwu_runnable(p_rq, p, WF_TTWU);
> +
> +               if (unlikely(p_rq != rq)) {
> +                       if (!ret)
> +                               set_task_cpu(p, cpu_of(rq));
> +
> +                       __task_rq_unlock(p_rq, &rf);
> +                       rq_lock(rq, &rf);
> +                       update_rq_clock(rq);
> +               }
> +
> +               if (ret) {
> +                       // XXX ttwu_stat()
> +                       continue;
> +               }
> +
> +               /*
> +                * This is the 'normal' case where the task is blocked.
> +                */
> +
>                 if (WARN_ON_ONCE(p->on_cpu))
>                         smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> -               if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> -                       set_task_cpu(p, cpu_of(rq));
> -
>                 ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
>         }
>
> @@ -3974,7 +4032,7 @@ static inline bool ttwu_queue_cond(struc
>
>  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>  {
> -       bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> +       bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
>
>         if (!ttwu_queue_cond(p, cpu, def))
>                 return false;
> @@ -4269,8 +4327,8 @@ int try_to_wake_up(struct task_struct *p
>                  * __schedule().  See the comment for smp_mb__after_spinlock().
>                  *
>                  * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> -                * schedule()'s deactivate_task() has 'happened' and p will no longer
> -                * care about it's own p->state. See the comment in __schedule().
> +                * schedule()'s try_to_block_task() has 'happened' and p will no longer
> +                * care about it's own p->state. See the comment in try_to_block_task().
>                  */
>                 smp_acquire__after_ctrl_dep();
>
> @@ -6712,8 +6770,8 @@ static void __sched notrace __schedule(i
>         preempt = sched_mode == SM_PREEMPT;
>
>         /*
> -        * We must load prev->state once (task_struct::state is volatile), such
> -        * that we form a control dependency vs deactivate_task() below.
> +        * We must load prev->state once, such that we form a control
> +        * dependency vs try_to_block_task() below.
>          */
>         prev_state = READ_ONCE(prev->__state);
>         if (sched_mode == SM_IDLE) {
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5395,7 +5395,10 @@ static __always_inline void return_cfs_r
>
>  static void set_delayed(struct sched_entity *se)
>  {
> -       se->sched_delayed = 1;
> +       /*
> +        * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> +        */
> +       smp_store_release(&se->sched_delayed, 1);
>
>         /*
>          * Delayed se of cfs_rq have no tasks queued on them.
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
>  SCHED_FEAT(TTWU_QUEUE, true)
>  #endif
>  SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> +SCHED_FEAT(TTWU_QUEUE_DELAYED, false)

I'm not sure that the feature will be tested as people mainly test
default config

>  SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
>
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
>  #define WF_RQ_SELECTED         0x80 /* ->select_task_rq() was called */
>
>  #define WF_ON_CPU              0x0100
> +#define WF_DELAYED             0x0200
>
>  #ifdef CONFIG_SMP
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-06 15:03   ` Vincent Guittot
@ 2025-06-06 15:38     ` Peter Zijlstra
  2025-06-06 16:55       ` Vincent Guittot
  2025-06-06 16:18     ` Phil Auld
  2025-06-16 12:01     ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-06 15:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, Jun 06, 2025 at 05:03:36PM +0200, Vincent Guittot wrote:
> On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > One of the things lost with introduction of DELAY_DEQUEUE is the
> > ability of TTWU to move those tasks around on wakeup, since they're
> > on_rq, and as such, need to be woken in-place.
> 
> I was thinking that you would call select_task_rq() somewhere in the
> wake up path of delayed entity to get a chance to migrate it which was
> one reason for the perf regression (and which would have also been
> useful for EAS case) but IIUC, the task is still enqueued on the same
> CPU but the target cpu will do the enqueue itself instead on the local
> CPU. Or am I missing something ?

Correct. I tried to add that migration into the mix, but then things get
tricky real fast.

Just getting rid of the remote rq lock also helped; these dispatch
threads just need to get on with waking up tasks, any delay hurts.

> >
> > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > little something that gets remote CPUs to do their own in-place
> > wakeups, significantly reducing the rq->lock contention.
> >
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/sched/fair.c     |    5 ++-
> >  kernel/sched/features.h |    1
> >  kernel/sched/sched.h    |    1
> >  4 files changed, 72 insertions(+), 9 deletions(-)
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> >         return 1;
> >  }
> >
> > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > +
> >  /*
> >   * Consider @p being inside a wait loop:
> >   *
> > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > +#ifdef CONFIG_SMP
> > +       if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > +               /*
> > +                * Similar to try_to_block_task():
> > +                *
> > +                * __schedule()                         ttwu()
> > +                *   prev_state = prev->state             if (p->sched_delayed)
> > +                *   if (prev_state)                         smp_acquire__after_ctrl_dep()
> > +                *     try_to_block_task()                   p->state = TASK_WAKING
> > +                *       ... set_delayed()
> > +                *         RELEASE p->sched_delayed = 1
> > +                *
> > +                * __schedule() and ttwu() have matching control dependencies.
> > +                *
> > +                * Notably, once we observe sched_delayed we know the task has
> > +                * passed try_to_block_task() and p->state is ours to modify.
> > +                *
> > +                * TASK_WAKING controls ttwu() concurrency.
> > +                */
> > +               smp_acquire__after_ctrl_dep();
> > +               WRITE_ONCE(p->__state, TASK_WAKING);
> > +
> > +               if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > +                       return 1;
> > +       }
> > +#endif
> > +
> >         CLASS(__task_rq_lock, guard)(p);
> >         return __ttwu_runnable(guard.rq, p, wake_flags);
> >  }
> > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> >         update_rq_clock(rq);
> >
> >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > +               struct rq *p_rq = task_rq(p);
> > +               int ret;
> > +
> > +               /*
> > +                * This is the ttwu_runnable() case. Notably it is possible for
> > +                * on-rq entities to get migrated -- even sched_delayed ones.
> 
> I haven't found where the sched_delayed task could migrate on another cpu.

Doesn't happen often, but it can happen. Nothing really stops it from
happening. Eg weight based balancing can do it. As can numa balancing
and affinity changes.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-06 15:03   ` Vincent Guittot
  2025-06-06 15:38     ` Peter Zijlstra
@ 2025-06-06 16:18     ` Phil Auld
  2025-06-16 12:01     ` Peter Zijlstra
  2 siblings, 0 replies; 33+ messages in thread
From: Phil Auld @ 2025-06-06 16:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel


Hi Peter,

On Fri, Jun 06, 2025 at 05:03:36PM +0200 Vincent Guittot wrote:
> On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > One of the things lost with introduction of DELAY_DEQUEUE is the
> > ability of TTWU to move those tasks around on wakeup, since they're
> > on_rq, and as such, need to be woken in-place.
> 
> I was thinking that you would call select_task_rq() somewhere in the
> wake up path of delayed entity to get a chance to migrate it which was
> one reason for the perf regression (and which would have also been
> useful for EAS case) but IIUC, the task is still enqueued on the same
> CPU but the target cpu will do the enqueue itself instead on the local
> CPU. Or am I missing something ?

Yeah, this one still bites us.  We ran these patches on our perf
tests (with out twiddling any FEATs) and it was basically a wash.

The fs regression we saw due to always waking up on the same cpu
was still present as expected based on this patch I suppose.

Thanks,
Phil

> 
> >
> > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > little something that gets remote CPUs to do their own in-place
> > wakeups, significantly reducing the rq->lock contention.
> >
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/sched/fair.c     |    5 ++-
> >  kernel/sched/features.h |    1
> >  kernel/sched/sched.h    |    1
> >  4 files changed, 72 insertions(+), 9 deletions(-)
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> >         return 1;
> >  }
> >
> > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > +
> >  /*
> >   * Consider @p being inside a wait loop:
> >   *
> > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > +#ifdef CONFIG_SMP
> > +       if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > +               /*
> > +                * Similar to try_to_block_task():
> > +                *
> > +                * __schedule()                         ttwu()
> > +                *   prev_state = prev->state             if (p->sched_delayed)
> > +                *   if (prev_state)                         smp_acquire__after_ctrl_dep()
> > +                *     try_to_block_task()                   p->state = TASK_WAKING
> > +                *       ... set_delayed()
> > +                *         RELEASE p->sched_delayed = 1
> > +                *
> > +                * __schedule() and ttwu() have matching control dependencies.
> > +                *
> > +                * Notably, once we observe sched_delayed we know the task has
> > +                * passed try_to_block_task() and p->state is ours to modify.
> > +                *
> > +                * TASK_WAKING controls ttwu() concurrency.
> > +                */
> > +               smp_acquire__after_ctrl_dep();
> > +               WRITE_ONCE(p->__state, TASK_WAKING);
> > +
> > +               if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > +                       return 1;
> > +       }
> > +#endif
> > +
> >         CLASS(__task_rq_lock, guard)(p);
> >         return __ttwu_runnable(guard.rq, p, wake_flags);
> >  }
> > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> >         update_rq_clock(rq);
> >
> >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > +               struct rq *p_rq = task_rq(p);
> > +               int ret;
> > +
> > +               /*
> > +                * This is the ttwu_runnable() case. Notably it is possible for
> > +                * on-rq entities to get migrated -- even sched_delayed ones.
> 
> I haven't found where the sched_delayed task could migrate on another cpu.
> 
> > +                */
> > +               if (unlikely(p_rq != rq)) {
> > +                       rq_unlock(rq, &rf);
> > +                       p_rq = __task_rq_lock(p, &rf);
> > +               }
> > +
> > +               ret = __ttwu_runnable(p_rq, p, WF_TTWU);
> > +
> > +               if (unlikely(p_rq != rq)) {
> > +                       if (!ret)
> > +                               set_task_cpu(p, cpu_of(rq));
> > +
> > +                       __task_rq_unlock(p_rq, &rf);
> > +                       rq_lock(rq, &rf);
> > +                       update_rq_clock(rq);
> > +               }
> > +
> > +               if (ret) {
> > +                       // XXX ttwu_stat()
> > +                       continue;
> > +               }
> > +
> > +               /*
> > +                * This is the 'normal' case where the task is blocked.
> > +                */
> > +
> >                 if (WARN_ON_ONCE(p->on_cpu))
> >                         smp_cond_load_acquire(&p->on_cpu, !VAL);
> >
> > -               if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> > -                       set_task_cpu(p, cpu_of(rq));
> > -
> >                 ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> >         }
> >
> > @@ -3974,7 +4032,7 @@ static inline bool ttwu_queue_cond(struc
> >
> >  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> >  {
> > -       bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> > +       bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
> >
> >         if (!ttwu_queue_cond(p, cpu, def))
> >                 return false;
> > @@ -4269,8 +4327,8 @@ int try_to_wake_up(struct task_struct *p
> >                  * __schedule().  See the comment for smp_mb__after_spinlock().
> >                  *
> >                  * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> > -                * schedule()'s deactivate_task() has 'happened' and p will no longer
> > -                * care about it's own p->state. See the comment in __schedule().
> > +                * schedule()'s try_to_block_task() has 'happened' and p will no longer
> > +                * care about it's own p->state. See the comment in try_to_block_task().
> >                  */
> >                 smp_acquire__after_ctrl_dep();
> >
> > @@ -6712,8 +6770,8 @@ static void __sched notrace __schedule(i
> >         preempt = sched_mode == SM_PREEMPT;
> >
> >         /*
> > -        * We must load prev->state once (task_struct::state is volatile), such
> > -        * that we form a control dependency vs deactivate_task() below.
> > +        * We must load prev->state once, such that we form a control
> > +        * dependency vs try_to_block_task() below.
> >          */
> >         prev_state = READ_ONCE(prev->__state);
> >         if (sched_mode == SM_IDLE) {
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5395,7 +5395,10 @@ static __always_inline void return_cfs_r
> >
> >  static void set_delayed(struct sched_entity *se)
> >  {
> > -       se->sched_delayed = 1;
> > +       /*
> > +        * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> > +        */
> > +       smp_store_release(&se->sched_delayed, 1);
> >
> >         /*
> >          * Delayed se of cfs_rq have no tasks queued on them.
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
> >  SCHED_FEAT(TTWU_QUEUE, true)
> >  #endif
> >  SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> > +SCHED_FEAT(TTWU_QUEUE_DELAYED, false)
> 
> I'm not sure that the feature will be tested as people mainly test
> default config
> 
> >  SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
> >
> >  /*
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
> >  #define WF_RQ_SELECTED         0x80 /* ->select_task_rq() was called */
> >
> >  #define WF_ON_CPU              0x0100
> > +#define WF_DELAYED             0x0200
> >
> >  #ifdef CONFIG_SMP
> >  static_assert(WF_EXEC == SD_BALANCE_EXEC);
> >
> >
> 

-- 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-06 15:38     ` Peter Zijlstra
@ 2025-06-06 16:55       ` Vincent Guittot
  2025-06-11  9:39         ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Vincent Guittot @ 2025-06-06 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, 6 Jun 2025 at 17:38, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 06, 2025 at 05:03:36PM +0200, Vincent Guittot wrote:
> > On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > One of the things lost with introduction of DELAY_DEQUEUE is the
> > > ability of TTWU to move those tasks around on wakeup, since they're
> > > on_rq, and as such, need to be woken in-place.
> >
> > I was thinking that you would call select_task_rq() somewhere in the
> > wake up path of delayed entity to get a chance to migrate it which was
> > one reason for the perf regression (and which would have also been
> > useful for EAS case) but IIUC, the task is still enqueued on the same
> > CPU but the target cpu will do the enqueue itself instead on the local
> > CPU. Or am I missing something ?
>
> Correct. I tried to add that migration into the mix, but then things get
> tricky real fast.

Yeah, I can imagine

>
> Just getting rid of the remote rq lock also helped; these dispatch
> threads just need to get on with waking up tasks, any delay hurts.
>
> > >
> > > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > > little something that gets remote CPUs to do their own in-place
> > > wakeups, significantly reducing the rq->lock contention.
> > >
> > > Reported-by: Chris Mason <clm@meta.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
> > >  kernel/sched/fair.c     |    5 ++-
> > >  kernel/sched/features.h |    1
> > >  kernel/sched/sched.h    |    1
> > >  4 files changed, 72 insertions(+), 9 deletions(-)
> > >
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> > >         return 1;
> > >  }
> > >
> > > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > > +
> > >  /*
> > >   * Consider @p being inside a wait loop:
> > >   *
> > > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> > >   */
> > >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> > >  {
> > > +#ifdef CONFIG_SMP
> > > +       if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > > +               /*
> > > +                * Similar to try_to_block_task():
> > > +                *
> > > +                * __schedule()                         ttwu()
> > > +                *   prev_state = prev->state             if (p->sched_delayed)
> > > +                *   if (prev_state)                         smp_acquire__after_ctrl_dep()
> > > +                *     try_to_block_task()                   p->state = TASK_WAKING
> > > +                *       ... set_delayed()
> > > +                *         RELEASE p->sched_delayed = 1
> > > +                *
> > > +                * __schedule() and ttwu() have matching control dependencies.
> > > +                *
> > > +                * Notably, once we observe sched_delayed we know the task has
> > > +                * passed try_to_block_task() and p->state is ours to modify.
> > > +                *
> > > +                * TASK_WAKING controls ttwu() concurrency.
> > > +                */
> > > +               smp_acquire__after_ctrl_dep();
> > > +               WRITE_ONCE(p->__state, TASK_WAKING);
> > > +
> > > +               if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > > +                       return 1;
> > > +       }
> > > +#endif
> > > +
> > >         CLASS(__task_rq_lock, guard)(p);
> > >         return __ttwu_runnable(guard.rq, p, wake_flags);
> > >  }
> > > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > >         update_rq_clock(rq);
> > >
> > >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > > +               struct rq *p_rq = task_rq(p);
> > > +               int ret;
> > > +
> > > +               /*
> > > +                * This is the ttwu_runnable() case. Notably it is possible for
> > > +                * on-rq entities to get migrated -- even sched_delayed ones.
> >
> > I haven't found where the sched_delayed task could migrate on another cpu.
>
> Doesn't happen often, but it can happen. Nothing really stops it from
> happening. Eg weight based balancing can do it. As can numa balancing
> and affinity changes.

Yes, I agree that delayed tasks can migrate because of load balancing
but not at wake up.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq()
  2025-05-20  9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
@ 2025-06-09  5:01   ` Mike Galbraith
  2025-06-13  9:40     ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2025-06-09  5:01 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm
  Cc: linux-kernel

Greetings,

This patch gives RT builds terminal heartburn.  This particular boot
survived long/well enough to trigger it with LTP sched tests and still
be able to crash dump the hung box.

(is_migration_disabled() confirmation thingy below gripe)

[   44.379563] WARNING: CPU: 6 PID: 4468 at kernel/sched/core.c:3354 set_task_cpu+0x1c1/0x1d0
[   44.379569] Modules linked in: af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security bridge stp llc iscsi_ibft iscsi_boot_sysfs rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter binfmt_misc intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek nls_iso8859_1 nls_cp437 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_scodec_component snd_hda_intel at24 iTCO_wdt snd_intel_dspcfg r8169 regmap_i2c snd_intel_sdw_acpi intel_pmc_bxt kvm mei_pxp mei_hdcp iTCO_vendor_support realtek snd_hda_codec i2c_i801 mdio_devres snd_hda_core ums_realtek libphy i2c_mux irqbypass pcspkr i2c_smbus snd_hwdep snd_pcm usblp mdio_bus mei_me lpc_ich mfd_core mei snd_timer
[   44.379621]  snd soundcore thermal fan joydev intel_smartconnect tiny_power_button nfsd auth_rpcgss nfs_acl lockd sch_fq_codel grace sunrpc fuse configfs dmi_sysfs ip_tables x_tables uas usb_storage hid_logitech_hidpp hid_logitech_dj hid_generic usbhid nouveau drm_ttm_helper ttm gpu_sched xhci_pci i2c_algo_bit xhci_hcd ahci drm_gpuvm ehci_pci ehci_hcd libahci drm_exec mxm_wmi libata polyval_clmulni usbcore ghash_clmulni_intel drm_display_helper sha512_ssse3 sha1_ssse3 cec rc_core video wmi button sd_mod scsi_dh_emc scsi_dh_rdac scsi_dh_alua sg scsi_mod scsi_common vfat fat ext4 crc16 mbcache jbd2 loop msr efivarfs aesni_intel
[   44.379663] CPU: 6 UID: 0 PID: 4468 Comm: sandbox_ipc_thr Kdump: loaded Not tainted 6.15.0.ge271ed52-master-rt #19 PREEMPT_{RT,(lazy)}  e4f2516a9b85ac19222adb94a538ef0c57343c1c
[   44.379666] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[   44.379668] RIP: 0010:set_task_cpu+0x1c1/0x1d0
[   44.379670] Code: 0f 0b e9 8f fe ff ff 80 8b 8c 05 00 00 04 e9 f5 fe ff ff 0f 0b e9 7c fe ff ff 0f 0b 66 83 bb 40 04 00 00 00 0f 84 8b fe ff ff <0f> 0b e9 84 fe ff ff 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
[   44.379672] RSP: 0018:ffffcd844ef77668 EFLAGS: 00010002
[   44.379673] RAX: 0000000000000200 RBX: ffff896da1e8c700 RCX: 0000000000000000
[   44.379675] RDX: ffff896da1e8cb30 RSI: 0000000000000000 RDI: ffff896da1e8c700
[   44.379676] RBP: 0000000000000000 R08: 0000000000000206 R09: 000000000002361d
[   44.379677] R10: fbfffffffffff79d R11: 0000000000000004 R12: 0000000000000000
[   44.379678] R13: 0000000000000000 R14: 0000000000000028 R15: ffff896da1e8d030
[   44.379679] FS:  0000000000000000(0000) GS:ffff8970f15d8000(0000) knlGS:0000000000000000
[   44.379681] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   44.379682] CR2: 00007f4c297c4688 CR3: 000000011b9de002 CR4: 00000000001726f0
[   44.379683] Call Trace:
[   44.379686]  <TASK>
[   44.379688]  try_to_wake_up+0x245/0x810
[   44.379692]  rt_mutex_slowunlock+0x1d2/0x2d0
[   44.379696]  ? __pfx_lru_activate+0x10/0x10
[   44.379700]  folio_batch_move_lru+0xc7/0x100
[   44.379704]  ? __pfx_lru_activate+0x10/0x10
[   44.379706]  __folio_batch_add_and_move+0xf2/0x110
[   44.379710]  folio_mark_accessed+0x80/0x1b0
[   44.379711]  unmap_page_range+0x176b/0x1a60
[   44.379717]  unmap_vmas+0xae/0x1a0
[   44.379720]  exit_mmap+0xe5/0x3c0
[   44.379725]  mmput+0x6e/0x150
[   44.379729]  do_exit+0x23c/0xa20
[   44.379732]  do_group_exit+0x33/0x90
[   44.379735]  get_signal+0x85d/0x8b0
[   44.379738]  arch_do_signal_or_restart+0x2d/0x240
[   44.379743]  ? place_entity+0x1b/0x130
[   44.379745]  ? __x64_sys_poll+0x47/0x1a0
[   44.379749]  exit_to_user_mode_loop+0x86/0x150
[   44.379753]  do_syscall_64+0x1ba/0x8e0
[   44.379756]  ? wakeup_preempt+0x40/0x70
[   44.379758]  ? ttwu_do_activate+0x84/0x210
[   44.379760]  ? _raw_spin_unlock_irqrestore+0x22/0x40
[   44.379763]  ? try_to_wake_up+0xab/0x810
[   44.379765]  ? preempt_count_add+0x4b/0xa0
[   44.379768]  ? futex_hash_put+0x43/0x90
[   44.379772]  ? futex_wake+0xb2/0x1c0
[   44.379775]  ? do_futex+0x125/0x190
[   44.379776]  ? __x64_sys_futex+0x10b/0x1c0
[   44.379779]  ? do_syscall_64+0x7f/0x8e0
[   44.379781]  ? __do_sys_prctl+0xbe/0xee0
[   44.379783]  ? do_syscall_64+0x7f/0x8e0
[   44.379786]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   44.379789] RIP: 0033:0x7f4c3571fdef
[   44.379803] Code: Unable to access opcode bytes at 0x7f4c3571fdc5.
[   44.379803] RSP: 002b:00007f4c0c9fe700 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[   44.379805] RAX: fffffffffffffdfc RBX: 00007f4c0c9fe730 RCX: 00007f4c3571fdef
[   44.379806] RDX: 00000000ffffffff RSI: 0000000000000002 RDI: 00007f4c0c9fe730
[   44.379807] RBP: 00007f4c0c9fe920 R08: 0000000000000000 R09: 0000000000000007
[   44.379808] R10: 00005587485bf1d0 R11: 0000000000000293 R12: 00005587485a7fc0
[   44.379809] R13: 0000000000000000 R14: 0000000000001174 R15: 00007f4c0c1ff000
[   44.379812]  </TASK>

---
 kernel/sched/core.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4313,7 +4313,10 @@ int try_to_wake_up(struct task_struct *p
 		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
 			break;
 
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+		if (is_migration_disabled(p))
+			cpu = -1;
+		else
+			cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
@@ -4326,6 +4329,9 @@ int try_to_wake_up(struct task_struct *p
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+		if (cpu == -1)
+			cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
 				delayacct_blkio_end(p);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-06 16:55       ` Vincent Guittot
@ 2025-06-11  9:39         ` Peter Zijlstra
  2025-06-16 12:39           ` Vincent Guittot
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-11  9:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, Jun 06, 2025 at 06:55:37PM +0200, Vincent Guittot wrote:
> > > > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > > >         update_rq_clock(rq);
> > > >
> > > >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > > > +               struct rq *p_rq = task_rq(p);
> > > > +               int ret;
> > > > +
> > > > +               /*
> > > > +                * This is the ttwu_runnable() case. Notably it is possible for
> > > > +                * on-rq entities to get migrated -- even sched_delayed ones.
> > >
> > > I haven't found where the sched_delayed task could migrate on another cpu.
> >
> > Doesn't happen often, but it can happen. Nothing really stops it from
> > happening. Eg weight based balancing can do it. As can numa balancing
> > and affinity changes.
> 
> Yes, I agree that delayed tasks can migrate because of load balancing
> but not at wake up.

Right, but this here is the case where wakeup races with load-balancing.
Specifically, due to the wake_list, the wakeup can happen while the task
is on CPU N, and by the time the IPI gets processed the task has moved
to CPU M.

It doesn't happen often, but it was 'fun' chasing that fail around for a
day :/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-06-02  4:44 ` K Prateek Nayak
@ 2025-06-13  3:28   ` K Prateek Nayak
  2025-06-14 10:15     ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: K Prateek Nayak @ 2025-06-13  3:28 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot
  Cc: linux-kernel, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm

Hello Peter,

On 6/2/2025 10:14 AM, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/20/2025 3:15 PM, Peter Zijlstra wrote:
>> As can be seen, the SPR is much easier to please than the SKL for whatever
>> reason. I'm thinking we can make TTWU_QUEUE_DELAYED default on, but I suspect
>> TTWU_QUEUE_DEFAULT might be a harder sell -- we'd need to run more than this
>> one benchmark.
> 
> I haven't tried toggling any of the newly added SCHED_FEAT() yet.

Here are the full results:

tldr;

- schbench (old) has a consistent regression for 16, 32, 64,
   128, 256 workers (> CCX size, < Overloaded) except for with
   256 workers case with TTWU_QUEUE_DEFAULT which shows an
   improvement.

- new schebench has few regressions around 32, 64, and 128
   workers for wakeup and request latency.

- Most others benchmarks show minor improvements /
   regressions but nothing serious.
   

o Variants

"DELAYED" enables "TTWU_QUEUE_DELAYED" alone, "DEFAULT" enables
"TTWU_QUEUE_DEFAULT" alone, and "BOTH" variant enables both.
vanilla was shared previously which is same as out of box with no
changes made to the sched features.


o Benchmark numbers

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     0.92 [  7.68]( 6.04)     0.95 [  5.12](10.12)     1.02 [ -1.92]( 6.70)     0.95 [  4.90]( 5.28)
      2-groups     1.00 [ -0.00]( 9.58)     1.04 [ -3.56]( 4.96)     1.03 [ -3.12]( 5.12)     0.98 [  1.56]( 4.30)     1.01 [ -1.11]( 5.78)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 2.27)     1.01 [ -1.09]( 2.68)     1.00 [ -0.43]( 2.58)     1.01 [ -0.65]( 1.38)
      8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.26]( 1.70)     0.99 [  0.95]( 4.92)     0.97 [  3.15]( 1.60)     1.00 [ -0.00]( 3.67)
     16-groups     1.00 [ -0.00]( 1.10)     0.97 [  3.01]( 1.62)     0.96 [  3.77]( 1.42)     0.95 [  4.60]( 0.67)     0.96 [  4.44]( 1.10)
     
     
     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     1.04 [  4.33]( 1.84)     1.06 [  5.97]( 0.42)     1.06 [  6.12]( 1.02)     1.06 [  5.54]( 0.73)
         2     1.00 [  0.00]( 1.13)     1.06 [  5.52]( 1.04)     1.07 [  7.17]( 0.42)     1.07 [  6.81]( 0.30)     1.08 [  7.96]( 0.39)
         4     1.00 [  0.00]( 1.12)     1.05 [  5.41]( 0.53)     1.07 [  7.39]( 0.67)     1.06 [  6.45]( 0.91)     1.07 [  7.36]( 0.63)
         8     1.00 [  0.00]( 0.93)     1.06 [  5.72]( 0.47)     1.07 [  6.90]( 0.24)     1.07 [  7.09]( 1.45)     1.07 [  6.94]( 0.45)
        16     1.00 [  0.00]( 0.38)     1.07 [  6.99]( 0.50)     1.05 [  4.95]( 0.98)     1.05 [  5.39]( 0.71)     1.05 [  5.43]( 1.05)
        32     1.00 [  0.00]( 0.66)     1.05 [  4.68]( 1.79)     1.06 [  5.70]( 0.54)     1.07 [  6.93]( 2.39)     1.03 [  3.17]( 1.06)
        64     1.00 [  0.00]( 1.18)     1.06 [  5.53]( 0.37)     1.04 [  4.05]( 0.84)     1.07 [  7.35]( 1.57)     1.06 [  5.62]( 1.13)
       128     1.00 [  0.00]( 1.12)     1.06 [  5.52]( 0.13)     1.05 [  4.94]( 0.75)     1.08 [  7.56]( 0.81)     1.05 [  4.80]( 0.55)
       256     1.00 [  0.00]( 0.42)     0.99 [ -0.83]( 1.01)     0.99 [ -0.58]( 0.57)     1.00 [  0.06]( 0.68)     1.00 [  0.03]( 1.47)
       512     1.00 [  0.00]( 0.14)     1.01 [  1.06]( 0.13)     1.02 [  1.67]( 0.18)     1.03 [  2.62]( 0.28)     1.02 [  2.17]( 0.33)
      1024     1.00 [  0.00]( 0.26)     1.02 [  1.82]( 0.41)     1.02 [  2.48]( 0.27)     1.03 [  3.38]( 0.37)     1.01 [  1.39]( 0.03)
     
     
     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.97 [ -2.79]( 9.17)     0.99 [ -1.29]( 4.68)     1.01 [  1.25]( 4.86)     0.99 [ -0.66]( 9.29)
     Scale     1.00 [  0.00]( 2.85)     1.00 [  0.12]( 2.91)     0.99 [ -1.34]( 5.55)     1.00 [ -0.20]( 3.38)     0.98 [ -2.09]( 5.33)
       Add     1.00 [  0.00]( 3.39)     0.98 [ -2.36]( 4.85)     0.98 [ -2.32]( 5.23)     1.00 [  0.10]( 3.17)     0.98 [ -1.99]( 4.73)
     Triad     1.00 [  0.00]( 6.39)     1.01 [  1.45]( 8.42)     1.00 [ -0.38]( 8.28)     1.05 [  4.69]( 5.66)     1.06 [  6.02]( 4.53)
     
     
     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.98 [ -1.84]( 2.07)     0.98 [ -2.06]( 6.75)     1.01 [  1.31]( 2.86)     1.02 [  2.12]( 3.30)
     Scale     1.00 [  0.00]( 4.34)     0.96 [ -3.80]( 6.38)     0.97 [ -2.88]( 6.99)     0.97 [ -2.62]( 5.70)     1.00 [ -0.37]( 3.94)
       Add     1.00 [  0.00]( 4.14)     0.97 [ -3.04]( 6.31)     0.97 [ -3.14]( 6.91)     0.99 [ -0.79]( 4.24)     1.00 [ -0.35]( 4.06)
     Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.36]( 2.60)     0.96 [ -3.80]( 6.15)     0.99 [ -0.61]( 1.33)     0.97 [ -3.05]( 5.48)
     
     
     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     1.06 [  5.63]( 1.17)     1.06 [  6.03]( 0.53)     1.09 [  8.63]( 0.79)     1.06 [  6.36]( 0.09)
      2-clients     1.00 [  0.00]( 0.58)     1.06 [  6.25]( 0.85)     1.05 [  5.47]( 0.83)     1.08 [  8.24]( 1.29)     1.05 [  5.15]( 0.57)
      4-clients     1.00 [  0.00]( 0.35)     1.06 [  5.59]( 0.49)     1.05 [  5.06]( 0.65)     1.08 [  8.15]( 0.82)     1.05 [  5.46]( 0.62)
      8-clients     1.00 [  0.00]( 0.48)     1.06 [  5.76]( 0.81)     1.05 [  5.26]( 0.71)     1.08 [  8.19]( 0.60)     1.05 [  5.34]( 0.80)
     16-clients     1.00 [  0.00]( 0.66)     1.06 [  5.95]( 0.69)     1.06 [  5.52]( 0.78)     1.08 [  8.31]( 0.86)     1.06 [  5.76]( 0.48)
     32-clients     1.00 [  0.00]( 1.15)     1.06 [  5.84]( 1.34)     1.06 [  5.57]( 0.96)     1.08 [  8.30]( 0.90)     1.06 [  5.66]( 1.45)
     64-clients     1.00 [  0.00]( 1.38)     1.05 [  5.20]( 1.50)     1.05 [  4.67]( 1.39)     1.07 [  7.43]( 1.47)     1.05 [  5.18]( 1.48)
     128-clients    1.00 [  0.00]( 0.87)     1.04 [  4.39]( 1.03)     1.04 [  4.43]( 0.98)     1.06 [  5.98]( 1.01)     1.05 [  4.60]( 1.06)
     256-clients    1.00 [  0.00]( 5.36)     1.00 [  0.10]( 3.48)     1.00 [  0.09]( 4.22)     1.01 [  0.71]( 3.18)     1.01 [  1.25]( 3.69)
     512-clients    1.00 [  0.00](54.39)     0.98 [ -1.93](52.45)     1.00 [ -0.35](53.30)     1.02 [  1.75](54.93)     1.02 [  1.76](55.71)
     
     
     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.89 [ 10.87](35.39)     0.78 [ 21.74](34.41)     0.91 [  8.70](12.44)     0.72 [ 28.26](26.70)
       2     1.00 [ -0.00]( 1.15)     0.88 [ 12.00]( 4.55)     0.78 [ 22.00]( 6.61)     0.90 [ 10.00]( 5.75)     0.82 [ 18.00](17.98)
       4     1.00 [ -0.00](13.46)     0.96 [  4.17](10.60)     1.00 [ -0.00]( 8.54)     0.96 [  4.17]( 3.30)     0.98 [  2.08]( 8.19)
       8     1.00 [ -0.00]( 7.14)     0.84 [ 15.79]( 8.44)     0.98 [  1.75]( 3.67)     0.95 [  5.26]( 4.99)     0.91 [  8.77]( 2.92)
      16     1.00 [ -0.00]( 3.49)     1.08 [ -8.47]( 4.69)     1.07 [ -6.78]( 0.92)     1.07 [ -6.78]( 0.91)     1.07 [ -6.78]( 3.27)
      32     1.00 [ -0.00]( 1.06)     1.10 [ -9.57]( 2.91)     1.07 [ -7.45]( 2.97)     1.07 [ -7.45]( 4.23)     1.05 [ -5.32]( 7.80)
      64     1.00 [ -0.00]( 5.48)     1.25 [-25.00]( 5.36)     1.17 [-17.44]( 1.44)     1.23 [-23.26]( 2.79)     1.20 [-19.77]( 2.19)
     128     1.00 [ -0.00](10.45)     1.18 [-17.99](12.54)     1.16 [-16.36](21.21)     1.13 [-12.85](12.71)     1.09 [ -8.64]( 3.05)
     256     1.00 [ -0.00](31.14)     1.28 [-27.79](17.66)     0.84 [ 16.21](32.14)     1.19 [-19.21]( 1.68)     1.07 [ -6.86]( 7.48)
     512     1.00 [ -0.00]( 1.52)     1.01 [ -0.51]( 2.78)     0.97 [  3.03]( 2.91)     0.98 [  1.77]( 1.07)     1.01 [ -0.51]( 1.01)
     
     
     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.00)     1.00 [  0.29]( 0.15)     0.99 [ -0.59]( 0.46)     1.00 [  0.29]( 0.30)
       2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
       4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)     1.00 [  0.00]( 0.00)
       8     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.00)     1.00 [  0.00]( 0.40)     1.00 [  0.29]( 0.15)
      16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)     1.00 [  0.00]( 0.00)
      32     1.00 [  0.00]( 3.41)     0.99 [ -0.95]( 2.06)     0.98 [ -2.23]( 3.41)     0.98 [ -2.23]( 3.31)     1.03 [  2.54]( 0.32)
      64     1.00 [  0.00]( 1.05)     0.92 [ -7.58]( 9.01)     0.86 [-13.92](11.30)     1.00 [  0.00]( 4.74)     1.00 [ -0.38]( 9.98)
     128     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.38]( 0.00)
     256     1.00 [  0.00]( 0.72)     1.00 [ -0.31]( 0.42)     1.01 [  1.23]( 1.33)     1.01 [  0.61]( 0.83)     1.01 [  0.92]( 1.36)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.45)     0.99 [ -0.72]( 1.18)     1.00 [  0.48]( 0.33)     1.01 [  1.44]( 0.49)
     
     
     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.75 [ 25.00](11.08)     0.69 [ 31.25]( 8.13)     0.75 [ 25.00](11.08)     0.62 [ 37.50]( 8.94)
       2     1.00 [ -0.00]( 0.00)     1.00 [ -0.00]( 3.78)     0.86 [ 14.29]( 7.45)     0.93 [  7.14]( 3.87)     0.79 [ 21.43]( 4.84)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 3.87)     0.79 [ 21.43]( 4.56)     0.93 [  7.14]( 0.00)     0.79 [ 21.43]( 8.85)
       8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33](12.91)     0.92 [  8.33]( 0.00)     0.83 [ 16.67](18.23)     1.08 [ -8.33](12.91)
      16     1.00 [ -0.00]( 7.56)     0.92 [  7.69](11.71)     0.85 [ 15.38](12.06)     1.08 [ -7.69](11.92)     0.85 [ 15.38](12.91)
      32     1.00 [ -0.00](15.11)     1.07 [ -6.67]( 3.30)     1.00 [ -0.00](19.06)     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.43)
      64     1.00 [ -0.00]( 9.63)     1.00 [ -0.00]( 8.15)     1.00 [ -0.00]( 5.34)     1.05 [ -5.00]( 7.75)     0.90 [ 10.00]( 9.94)
     128     1.00 [ -0.00]( 4.86)     0.89 [ 11.06]( 7.83)     0.91 [  8.54]( 7.87)     0.88 [ 12.06]( 8.73)     0.86 [ 14.07]( 5.01)
     256     1.00 [ -0.00]( 2.34)     1.00 [  0.20]( 0.10)     1.04 [ -4.50]( 4.59)     1.03 [ -2.90]( 1.95)     1.04 [ -3.70]( 4.13)
     512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)     1.00 [  0.38]( 0.20)     0.99 [  0.77]( 0.20)     1.00 [ -0.00]( 0.40)
     
     
     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)       vanilla[pct imp](CV)     DELAYED[pct imp](CV)     DEFUALT[pct imp](CV)      BOTH[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 1.04)     0.99 [  1.30]( 1.07)     1.02 [ -1.82]( 0.00)     1.01 [ -1.30]( 3.10)
       2     1.00 [ -0.00]( 0.87)     1.05 [ -5.40]( 3.10)     1.02 [ -1.89]( 1.58)     1.01 [ -1.08]( 2.76)     1.02 [ -1.62]( 1.45)
       4     1.00 [ -0.00]( 1.21)     0.99 [  0.54]( 1.27)     0.99 [  1.08]( 1.67)     1.01 [ -1.21]( 1.21)     1.01 [ -1.35]( 1.91)
       8     1.00 [ -0.00]( 0.27)     0.99 [  0.79]( 2.14)     0.98 [  2.37]( 0.72)     0.99 [  1.05]( 2.53)     0.99 [  0.79]( 1.12)
      16     1.00 [ -0.00]( 4.04)     1.01 [ -0.53]( 0.55)     1.01 [ -0.80]( 1.08)     1.00 [ -0.27]( 0.36)     0.99 [  0.53]( 0.50)
      32     1.00 [ -0.00]( 7.35)     1.10 [ -9.97](21.10)     1.01 [ -0.66](10.27)     1.25 [-25.36](21.41)     0.90 [  9.52]( 2.08)
      64     1.00 [ -0.00]( 3.54)     1.03 [ -2.89]( 1.55)     1.02 [ -2.00]( 0.98)     1.01 [ -0.67]( 3.62)     1.01 [ -0.89]( 4.98)
     128     1.00 [ -0.00]( 0.37)     0.99 [  0.62]( 0.00)     0.99 [  0.72]( 0.11)     0.99 [  0.62]( 0.11)     0.99 [  0.83]( 0.11)
     256     1.00 [ -0.00]( 9.57)     0.92 [  8.36]( 2.22)     1.03 [ -3.11](12.58)     1.05 [ -5.02]( 8.36)     1.00 [ -0.00](11.71)
     512     1.00 [ -0.00]( 1.82)     1.01 [ -1.23]( 0.94)     1.02 [ -2.45]( 1.53)     1.00 [  0.35]( 0.83)     1.02 [ -1.93]( 1.40)
     
     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                 vanilla     DELAYED   DEFAULT    BOTH
     ycsb-cassandra              -0.05%       0.65%    -0.49%    -0.48%
     ycsb-mongodb                -0.80%      -0.85%    -1.00%    -0.98%
      
     deathstarbench-1x            2.44%       1.54%     1.65%     0.18%
     deathstarbench-2x            5.47%       4.88%     7.92%     6.75%
     deathstarbench-3x            0.36%       1.74%    -1.75%     0.31%
     deathstarbench-6x            1.14%       1.94%     2.24%     1.58%
     
     hammerdb+mysql 16VU          1.08%       5.21%     2.69%     3.80%
     hammerdb+mysql 64VU         -0.43%      -0.31%     2.12%    -0.25%


-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
  2025-06-06 15:03   ` Vincent Guittot
@ 2025-06-13  7:34   ` Dietmar Eggemann
  2025-06-13  9:51     ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Dietmar Eggemann @ 2025-06-13  7:34 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel

On 20/05/2025 11:45, Peter Zijlstra wrote:

[...]

> @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
>  	update_rq_clock(rq);
>  
>  	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> +		struct rq *p_rq = task_rq(p);
> +		int ret;
> +
> +		/*
> +		 * This is the ttwu_runnable() case. Notably it is possible for
> +		 * on-rq entities to get migrated -- even sched_delayed ones.
> +		 */
> +		if (unlikely(p_rq != rq)) {
> +			rq_unlock(rq, &rf);
> +			p_rq = __task_rq_lock(p, &rf);

I always get this fairly early with TTWU_QUEUE_DELAYED enabled, related
to p->pi_lock not held in wakeup from interrupt.

[   36.175285] WARNING: CPU: 0 PID: 162 at kernel/sched/core.c:679 __task_rq_lock+0xf8/0x128
[   36.176021] Modules linked in:
[   36.176187] CPU: 0 UID: 0 PID: 162 Comm: (udev-worker) Tainted: G W 6.15.0-00005-gcacccfab15bd-dirty #59 PREEMPT 
[   36.176587] Tainted: [W]=WARN
[   36.176727] Hardware name: linux,dummy-virt (DT)
[   36.176964] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   36.177301] pc : __task_rq_lock+0xf8/0x128
[   36.177576] lr : __task_rq_lock+0xf4/0x128
...
[   36.181314] Call trace:
[   36.181510]  __task_rq_lock+0xf8/0x128 (P)
[   36.181824]  sched_ttwu_pending+0x2d8/0x378
[   36.182020]  __flush_smp_call_function_queue+0x138/0x37c
[   36.182222]  generic_smp_call_function_single_interrupt+0x14/0x20
[   36.182440]  ipi_handler+0x254/0x2bc
[   36.182585]  handle_percpu_devid_irq+0xa8/0x2d4
[   36.182780]  handle_irq_desc+0x34/0x58
[   36.182942]  generic_handle_domain_irq+0x1c/0x28
[   36.183109]  gic_handle_irq+0x40/0xe0
[   36.183289]  call_on_irq_stack+0x24/0x64
[   36.183441]  do_interrupt_handler+0x80/0x84
[   36.183647]  el1_interrupt+0x34/0x70
[   36.183795]  el1h_64_irq_handler+0x18/0x24
[   36.184002]  el1h_64_irq+0x6c/0x70

[...]

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
>  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
>  
>  #define WF_ON_CPU		0x0100

Looks like there is no specific handling for WF_ON_CPU yet?

> +#define WF_DELAYED		0x0200

[...]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq()
  2025-06-09  5:01   ` Mike Galbraith
@ 2025-06-13  9:40     ` Peter Zijlstra
  2025-06-13 10:20       ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-13  9:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Mon, Jun 09, 2025 at 07:01:47AM +0200, Mike Galbraith wrote:

Right; so the problem being that we can race with
migrate_disable_switch().

>  kernel/sched/core.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4313,7 +4313,10 @@ int try_to_wake_up(struct task_struct *p
>  		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
>  			break;
>  
> -		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> +		if (is_migration_disabled(p))
> +			cpu = -1;
> +		else
> +			cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>  
>  		/*
>  		 * If the owning (remote) CPU is still in the middle of schedule() with
> @@ -4326,6 +4329,9 @@ int try_to_wake_up(struct task_struct *p
>  		 */
>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>  
> +		if (cpu == -1)
> +			cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> +
>  		if (task_cpu(p) != cpu) {
>  			if (p->in_iowait) {
>  				delayacct_blkio_end(p);
> 

So select_task_rq() already checks is_migration_disabled(); just not
well enough. Also, I'm thinking that if we see migration_disabled, we
don't need to call it a second time, just let it be where it was.

Does something like this help? Specifically, when nr_cpus_allowed == 1
|| is_migration_disabled(), don't change @cpu at all.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3593,7 +3593,7 @@ int select_task_rq(struct task_struct *p
 		cpu = p->sched_class->select_task_rq(p, cpu, *wake_flags);
 		*wake_flags |= WF_RQ_SELECTED;
 	} else {
-		cpu = cpumask_any(p->cpus_ptr);
+		cpu = task_cpu(p);
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling
  2025-06-03 16:03   ` Juri Lelli
@ 2025-06-13  9:43     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-13  9:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clm, linux-kernel

On Tue, Jun 03, 2025 at 06:03:12PM +0200, Juri Lelli wrote:
> Hi,
> 

> > @@ -1684,6 +1689,24 @@ void dl_server_stop(struct sched_dl_enti
> >  	dl_se->dl_server_active = 0;
> >  }
> >  
> > +static bool dl_server_stopped(struct sched_dl_entity *dl_se)
> > +{
> > +	if (!dl_se->dl_server_active)
> > +		return false;
> > +
> > +	if (dl_se->dl_server_idle) {
> > +		__dl_server_stop(dl_se);
> > +		return true;
> > +	}
> > +
> > +	dl_se->dl_server_idle = 1;
> > +	return false;
> > +}
> > +
> > +void dl_server_stop(struct sched_dl_entity *dl_se)
> > +{
> > +}
> 
> What if we explicitly set the server to idle (instead of ignoring the
> stop) where this gets called in dequeue_entities()?

That would break thing; we want to detect if it was ever !idle in the
period.

> Also, don't we need to actually stop the server if we are changing its
> parameters from sched_fair_server_write()?

Quite - let me just remove the offending callsites them.

Would this explain this massive regression 0day reported here? Seems
weird.

Anyway, let me go update the patch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-13  7:34   ` Dietmar Eggemann
@ 2025-06-13  9:51     ` Peter Zijlstra
  2025-06-13 10:46       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-13  9:51 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, Jun 13, 2025 at 09:34:22AM +0200, Dietmar Eggemann wrote:
> On 20/05/2025 11:45, Peter Zijlstra wrote:
> 
> [...]
> 
> > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> >  	update_rq_clock(rq);
> >  
> >  	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > +		struct rq *p_rq = task_rq(p);
> > +		int ret;
> > +
> > +		/*
> > +		 * This is the ttwu_runnable() case. Notably it is possible for
> > +		 * on-rq entities to get migrated -- even sched_delayed ones.
> > +		 */
> > +		if (unlikely(p_rq != rq)) {
> > +			rq_unlock(rq, &rf);
> > +			p_rq = __task_rq_lock(p, &rf);
> 
> I always get this fairly early with TTWU_QUEUE_DELAYED enabled, related
> to p->pi_lock not held in wakeup from interrupt.
> 
> [   36.175285] WARNING: CPU: 0 PID: 162 at kernel/sched/core.c:679 __task_rq_lock+0xf8/0x128

Thanks, let me go have a look.

> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
> >  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
> >  
> >  #define WF_ON_CPU		0x0100
> 
> Looks like there is no specific handling for WF_ON_CPU yet?

Oh, indeed. That didn't survive the tinkering and then I forgot to clean
it up here. Let me go find a broom and sweep these few bits under the
carpet then :-)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq()
  2025-06-13  9:40     ` Peter Zijlstra
@ 2025-06-13 10:20       ` Mike Galbraith
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2025-06-13 10:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Fri, 2025-06-13 at 11:40 +0200, Peter Zijlstra wrote:
> On Mon, Jun 09, 2025 at 07:01:47AM +0200, Mike Galbraith wrote:
> 
> Right; so the problem being that we can race with
> migrate_disable_switch().

Yeah.  Most of the time when we do fallback saves us, but we can and do
zip past it, and that turns box various shades of sad.

> 
> Does something like this help?

It surely will, but I'll testdrive it.  No news is good news.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3593,7 +3593,7 @@ int select_task_rq(struct task_struct *p
>  		cpu = p->sched_class->select_task_rq(p, cpu,
> *wake_flags);
>  		*wake_flags |= WF_RQ_SELECTED;
>  	} else {
> -		cpu = cpumask_any(p->cpus_ptr);
> +		cpu = task_cpu(p);
>  	}
>  
>  	/*


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-13  9:51     ` Peter Zijlstra
@ 2025-06-13 10:46       ` Peter Zijlstra
  2025-06-16  8:16         ` Dietmar Eggemann
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-13 10:46 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, Jun 13, 2025 at 11:51:19AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 13, 2025 at 09:34:22AM +0200, Dietmar Eggemann wrote:
> > On 20/05/2025 11:45, Peter Zijlstra wrote:
> > 
> > [...]
> > 
> > > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > >  	update_rq_clock(rq);
> > >  
> > >  	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > > +		struct rq *p_rq = task_rq(p);
> > > +		int ret;
> > > +
> > > +		/*
> > > +		 * This is the ttwu_runnable() case. Notably it is possible for
> > > +		 * on-rq entities to get migrated -- even sched_delayed ones.
> > > +		 */
> > > +		if (unlikely(p_rq != rq)) {
> > > +			rq_unlock(rq, &rf);
> > > +			p_rq = __task_rq_lock(p, &rf);
> > 
> > I always get this fairly early with TTWU_QUEUE_DELAYED enabled, related
> > to p->pi_lock not held in wakeup from interrupt.
> > 
> > [   36.175285] WARNING: CPU: 0 PID: 162 at kernel/sched/core.c:679 __task_rq_lock+0xf8/0x128
> 
> Thanks, let me go have a look.

I'm thinking this should cure things.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -677,7 +677,12 @@ struct rq *__task_rq_lock(struct task_st
 {
 	struct rq *rq;
 
-	lockdep_assert_held(&p->pi_lock);
+	/*
+	 * TASK_WAKING is used to serialize the remote end of wakeup, rather
+	 * than p->pi_lock.
+	 */
+	lockdep_assert(p->__state == TASK_WAKING ||
+		       lockdep_is_held(&p->pi_lock) != LOCK_STATE_NOT_HELD);
 
 	for (;;) {
 		rq = task_rq(p);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-05-29  1:41   ` Chris Mason
@ 2025-06-14 10:04     ` Peter Zijlstra
  2025-06-16  0:35       ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-14 10:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel

On Wed, May 28, 2025 at 09:41:33PM -0400, Chris Mason wrote:

> I'll get all of these run on the big turin machine, should have some
> numbers tomorrow.

Right... so Turin. I had a quick look through our IRC logs but I
couldn't find exactly which model you had, and unfortunately AMD uses
the Turin name for both Zen 5c and Zen 5 Epyc :-(

Anyway, the big and obvious difference between the whole Intel and AMD
machines is the L3. So far we've been looking at SKL/SPR single L3
performance, but Turin (whichever that might be) will be having many L3.
With Zen5 having 8 cores per L3 and Zen5c having 16.

Additionally, your schbench -M auto thing is doing exactly the wrong
thing for them. What you want is for those message threads to be spread
out across the L3s, not all stuck to the first (which is what -M auto
would end up doing). And then the associated worker threads would
ideally stick to their respective L3s and not scatter all over the
machine.

Anyway, most of the data we shared was about single socket SKL, might be
we missed some obvious things for the multi-L3 case.

I'll go poke at some of the things I've so far neglected because of the
single L3 focus.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-06-13  3:28   ` K Prateek Nayak
@ 2025-06-14 10:15     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-14 10:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, juri.lelli, vincent.guittot, linux-kernel,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm

On Fri, Jun 13, 2025 at 08:58:56AM +0530, K Prateek Nayak wrote:

> - schbench (old) has a consistent regression for 16, 32, 64,
>   128, 256 workers (> CCX size, < Overloaded) except for with
>   256 workers case with TTWU_QUEUE_DEFAULT which shows an
>   improvement.
> 
> - new schebench has few regressions around 32, 64, and 128
>   workers for wakeup and request latency.

Right, so I actually made Chris' favourite workloads worse with these
patches :/

Let me go try this again..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions
  2025-06-14 10:04     ` Peter Zijlstra
@ 2025-06-16  0:35       ` Chris Mason
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Mason @ 2025-06-16  0:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel

On 6/14/25 6:04 AM, Peter Zijlstra wrote:
> On Wed, May 28, 2025 at 09:41:33PM -0400, Chris Mason wrote:
> 
>> I'll get all of these run on the big turin machine, should have some
>> numbers tomorrow.
> 
> Right... so Turin. I had a quick look through our IRC logs but I
> couldn't find exactly which model you had, and unfortunately AMD uses
> the Turin name for both Zen 5c and Zen 5 Epyc :-(

Looks like the one I've been testing on is the epyc variant.  But,
stepping back for a bit, I bisected a few regressions between 6.9 and 6.13:

- DL server
- DELAY_{DEQUEUE,ZERO}
- PSI fun (not really me, but relevant)

I think these are all important and relevant, but it was strange that
none of these patches seemed to move the needle much on the turin
machines (+/- futexes), so I went back to the drawing board.

Our internal 6.9 kernel was the "fast" one, and comparing it with
vanilla 6.9, it turns out we'd carried some patches that significantly
improved our web workload on top of vanilla 6.9.

In other words, I've been trying to find a regression that Vernet
actually fixed in 6.9 already.  Bisecting pointed to:

Author: David Vernet <void@manifault.com>
Date:   Tue May 7 08:15:32 2024 -0700

    Revert "sched/fair: Remove sysctl_sched_migration_cost condition"

    This reverts commit c5b0a7eefc70150caf23e37bc9d639c68c87a097.

Comparing schedstat.py output from the fast and slow kernel..this is 6.9
vs 6.13, but I'll get a comparison tomorrow where the schedstat version
actually matches.

# grep balance slow.stat.out
lb_balance_not_idle: 687
lb_imbalance_not_idle: 0
lb_balance_idle: 659054
lb_imbalance_idle: 2635
lb_balance_newly_idle: 2051682
lb_imbalance_newly_idle: 500328
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

# grep balance fast.stat.out
lb_balance_idle: 606600
lb_imbalance_idle: 1911
lb_balance_not_idle: 680
lb_imbalance_not_idle: 0
lb_balance_newly_idle: 11697
lb_imbalance_newly_idle: 22868
sbe_balanced: 0
sbf_balanced: 0
ttwu_move_balance: 0

Reverting that commit above on vanilla 6.9 makes us fast.  Disabling new
idle balance completely is fast on our 6.13 kernel, but reverting that
one commit doesn't change much.  I'll switch back to upstream and
compare newidle balance behavior.

> 
> Anyway, the big and obvious difference between the whole Intel and AMD
> machines is the L3. So far we've been looking at SKL/SPR single L3
> performance, but Turin (whichever that might be) will be having many L3.
> With Zen5 having 8 cores per L3 and Zen5c having 16.
> 
> Additionally, your schbench -M auto thing is doing exactly the wrong
> thing for them. What you want is for those message threads to be spread
> out across the L3s, not all stuck to the first (which is what -M auto
> would end up doing). And then the associated worker threads would
> ideally stick to their respective L3s and not scatter all over the
> machine.
> 
> Anyway, most of the data we shared was about single socket SKL, might be
> we missed some obvious things for the multi-L3 case.
> 
> I'll go poke at some of the things I've so far neglected because of the
> single L3 focus.

You're 100% right about all of this, and I really do want to add some
better smarts to the pinning for both numa and chiplets.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-13 10:46       ` Peter Zijlstra
@ 2025-06-16  8:16         ` Dietmar Eggemann
  0 siblings, 0 replies; 33+ messages in thread
From: Dietmar Eggemann @ 2025-06-16  8:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On 13/06/2025 12:46, Peter Zijlstra wrote:
> On Fri, Jun 13, 2025 at 11:51:19AM +0200, Peter Zijlstra wrote:
>> On Fri, Jun 13, 2025 at 09:34:22AM +0200, Dietmar Eggemann wrote:
>>> On 20/05/2025 11:45, Peter Zijlstra wrote:

[...]

>>> I always get this fairly early with TTWU_QUEUE_DELAYED enabled, related
>>> to p->pi_lock not held in wakeup from interrupt.
>>>
>>> [   36.175285] WARNING: CPU: 0 PID: 162 at kernel/sched/core.c:679 __task_rq_lock+0xf8/0x128
>>
>> Thanks, let me go have a look.
> 
> I'm thinking this should cure things.
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -677,7 +677,12 @@ struct rq *__task_rq_lock(struct task_st
>  {
>  	struct rq *rq;
>  
> -	lockdep_assert_held(&p->pi_lock);
> +	/*
> +	 * TASK_WAKING is used to serialize the remote end of wakeup, rather
> +	 * than p->pi_lock.
> +	 */
> +	lockdep_assert(p->__state == TASK_WAKING ||
> +		       lockdep_is_held(&p->pi_lock) != LOCK_STATE_NOT_HELD);
>  
>  	for (;;) {
>  		rq = task_rq(p);

Yes, it does. I assume we can only end up in sched_ttwu_pending()'s 'if
(unlikely(p_rq != rq))' when ttwu_queue_wakelist() is called from
ttwu_runnable(), i.e. for sched_delayed tasks.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-06 15:03   ` Vincent Guittot
  2025-06-06 15:38     ` Peter Zijlstra
  2025-06-06 16:18     ` Phil Auld
@ 2025-06-16 12:01     ` Peter Zijlstra
  2025-06-16 16:37       ` Peter Zijlstra
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-16 12:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Fri, Jun 06, 2025 at 05:03:36PM +0200, Vincent Guittot wrote:
> On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > One of the things lost with introduction of DELAY_DEQUEUE is the
> > ability of TTWU to move those tasks around on wakeup, since they're
> > on_rq, and as such, need to be woken in-place.
> 
> I was thinking that you would call select_task_rq() somewhere in the
> wake up path of delayed entity to get a chance to migrate it which was
> one reason for the perf regression (and which would have also been
> useful for EAS case) but IIUC, 

FWIW, the trivial form of all this is something like the below. The
problem is that performance sucks :/ For me it is worse than not doing
it. But perhaps it is the right thing for the more complicated cases ?

On my SPR:

schbench-6.9.0-1.txt:average rps: 2975450.75
schbench-6.9.0-2.txt:average rps: 2975464.38
schbench-6.9.0-3.txt:average rps: 2974881.02

(these patches)
schbench-6.15.0-dirty-1.txt:average rps: 3029984.58
schbench-6.15.0-dirty-2.txt:average rps: 3034723.10
schbench-6.15.0-dirty-3.txt:average rps: 3033893.33

TTWU_QUEUE_DELAYED
schbench-6.15.0-dirty-delayed-1.txt:average rps: 3048778.58
schbench-6.15.0-dirty-delayed-2.txt:average rps: 3049587.90
schbench-6.15.0-dirty-delayed-3.txt:average rps: 3045826.95

NO_DELAY_DEQUEUE
schbench-6.15.0-dirty-no_delay-1.txt:average rps: 3043629.03
schbench-6.15.0-dirty-no_delay-2.txt:average rps: 3046054.47
schbench-6.15.0-dirty-no_delay-3.txt:average rps: 3044736.37

TTWU_DEQUEUE
schbench-6.15.0-dirty-dequeue-1.txt:average rps: 3008790.80
schbench-6.15.0-dirty-dequeue-2.txt:average rps: 3017497.33
schbench-6.15.0-dirty-dequeue-3.txt:average rps: 3005858.57



Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -3770,8 +3770,13 @@ static int __ttwu_runnable(struct rq *rq
 		return 0;
 
 	update_rq_clock(rq);
-	if (p->se.sched_delayed)
+	if (p->se.sched_delayed) {
+		if (sched_feat(TTWU_DEQUEUE)) {
+			dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_DELAYED | DEQUEUE_SLEEP);
+			return 0;
+		}
 		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	}
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -84,6 +84,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
 SCHED_FEAT(TTWU_QUEUE_DELAYED, false)
 SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
+SCHED_FEAT(TTWU_DEQUEUE, false)
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-11  9:39         ` Peter Zijlstra
@ 2025-06-16 12:39           ` Vincent Guittot
  0 siblings, 0 replies; 33+ messages in thread
From: Vincent Guittot @ 2025-06-16 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 11 Jun 2025 at 11:39, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 06, 2025 at 06:55:37PM +0200, Vincent Guittot wrote:
> > > > > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > > > >         update_rq_clock(rq);
> > > > >
> > > > >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > > > > +               struct rq *p_rq = task_rq(p);
> > > > > +               int ret;
> > > > > +
> > > > > +               /*
> > > > > +                * This is the ttwu_runnable() case. Notably it is possible for
> > > > > +                * on-rq entities to get migrated -- even sched_delayed ones.
> > > >
> > > > I haven't found where the sched_delayed task could migrate on another cpu.
> > >
> > > Doesn't happen often, but it can happen. Nothing really stops it from
> > > happening. Eg weight based balancing can do it. As can numa balancing
> > > and affinity changes.
> >
> > Yes, I agree that delayed tasks can migrate because of load balancing
> > but not at wake up.
>
> Right, but this here is the case where wakeup races with load-balancing.
> Specifically, due to the wake_list, the wakeup can happen while the task
> is on CPU N, and by the time the IPI gets processed the task has moved
> to CPU M.
>
> It doesn't happen often, but it was 'fun' chasing that fail around for a
> day :/

Ok, it makes sense now.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
  2025-06-16 12:01     ` Peter Zijlstra
@ 2025-06-16 16:37       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-16 16:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Mon, Jun 16, 2025 at 02:01:25PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 06, 2025 at 05:03:36PM +0200, Vincent Guittot wrote:
> > On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > One of the things lost with introduction of DELAY_DEQUEUE is the
> > > ability of TTWU to move those tasks around on wakeup, since they're
> > > on_rq, and as such, need to be woken in-place.
> > 
> > I was thinking that you would call select_task_rq() somewhere in the
> > wake up path of delayed entity to get a chance to migrate it which was
> > one reason for the perf regression (and which would have also been
> > useful for EAS case) but IIUC, 
> 
> FWIW, the trivial form of all this is something like the below. The
> problem is that performance sucks :/ For me it is worse than not doing
> it. 

And because I was poking at the thing, I had to try the complicated
version again... This seems to survive long enough for a few benchmark
runs, and its not bad.

It very much burns after a while though :-( So I'll have to poke more at
this. Clearly I'm missing something (again!).

---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -994,6 +994,7 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+	unsigned			sched_remote_delayed:1;
 #ifdef CONFIG_RT_MUTEXES
 	unsigned			sched_rt_mutex:1;
 #endif
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -3844,6 +3849,50 @@ static int ttwu_runnable(struct task_str
 }
 
 #ifdef CONFIG_SMP
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
+static inline bool ttwu_do_migrate(struct task_struct *p, int cpu)
+{
+	if (task_cpu(p) == cpu)
+		return false;
+
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
+	psi_ttwu_dequeue(p);
+	set_task_cpu(p, cpu);
+	return true;
+}
+
+static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	int cpu = task_cpu(p);
+
+	/*
+	 * Notably it is possible for on-rq entities to get migrated -- even
+	 * sched_delayed ones.
+	 */
+	if (unlikely(cpu_of(rq) != cpu)) {
+		/* chase after it */
+		__ttwu_queue_wakelist(p, cpu, wake_flags | WF_DELAYED);
+		return 1;
+	}
+
+	if (task_on_rq_queued(p))
+		dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+	if (!ttwu_do_migrate(p, cpu))
+		return 0;
+
+	wake_flags |= WF_MIGRATED;
+	/* shoot it to the other CPU */
+	__ttwu_queue_wakelist(p, cpu, wake_flags);
+	return 1;
+}
+
 void sched_ttwu_pending(void *arg)
 {
 	struct llist_node *llist = arg;
@@ -3857,39 +3906,12 @@ void sched_ttwu_pending(void *arg)
 	update_rq_clock(rq);
 
 	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
-		struct rq *p_rq = task_rq(p);
-		int ret;
-
-		/*
-		 * This is the ttwu_runnable() case. Notably it is possible for
-		 * on-rq entities to get migrated -- even sched_delayed ones.
-		 */
-		if (unlikely(p_rq != rq)) {
-			rq_unlock(rq, &guard.rf);
-			p_rq = __task_rq_lock(p, &guard.rf);
-		}
-
-		ret = __ttwu_runnable(p_rq, p, WF_TTWU);
-
-		if (unlikely(p_rq != rq)) {
-			if (!ret)
-				set_task_cpu(p, cpu_of(rq));
-
-			__task_rq_unlock(p_rq, &guard.rf);
-			rq_lock(rq, &guard.rf);
-			update_rq_clock(rq);
-		}
-
-		if (ret)
-			continue;
-
-		/*
-		 * This is the 'normal' case where the task is blocked.
-		 */
-
 		if (WARN_ON_ONCE(p->on_cpu))
 			smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+		if (p->sched_remote_delayed && ttwu_delayed(rq, p, WF_TTWU))
+			continue;
+
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &guard.rf);
 	}
 
@@ -3933,6 +3955,7 @@ static void __ttwu_queue_wakelist(struct
 	struct rq *rq = cpu_rq(cpu);
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+	p->sched_remote_delayed = !!(wake_flags & WF_DELAYED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
 	__smp_call_single_queue(cpu, &p->wake_entry.llist);
@@ -4371,17 +4394,8 @@ int try_to_wake_up(struct task_struct *p
 		 * their previous state and preserve Program Order.
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-		if (task_cpu(p) != cpu) {
-			if (p->in_iowait) {
-				delayacct_blkio_end(p);
-				atomic_dec(&task_rq(p)->nr_iowait);
-			}
-
+		if (ttwu_do_migrate(p, cpu))
 			wake_flags |= WF_MIGRATED;
-			psi_ttwu_dequeue(p);
-			set_task_cpu(p, cpu);
-		}
 #else
 		cpu = task_cpu(p);
 #endif /* CONFIG_SMP */

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-06-16 16:37 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
2025-06-03 16:03   ` Juri Lelli
2025-06-13  9:43     ` Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
2025-06-09  5:01   ` Mike Galbraith
2025-06-13  9:40     ` Peter Zijlstra
2025-06-13 10:20       ` Mike Galbraith
2025-05-20  9:45 ` [RFC][PATCH 3/5] sched: Split up ttwu_runnable() Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 4/5] sched: Add ttwu_queue controls Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
2025-06-06 15:03   ` Vincent Guittot
2025-06-06 15:38     ` Peter Zijlstra
2025-06-06 16:55       ` Vincent Guittot
2025-06-11  9:39         ` Peter Zijlstra
2025-06-16 12:39           ` Vincent Guittot
2025-06-06 16:18     ` Phil Auld
2025-06-16 12:01     ` Peter Zijlstra
2025-06-16 16:37       ` Peter Zijlstra
2025-06-13  7:34   ` Dietmar Eggemann
2025-06-13  9:51     ` Peter Zijlstra
2025-06-13 10:46       ` Peter Zijlstra
2025-06-16  8:16         ` Dietmar Eggemann
2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-29  1:41   ` Chris Mason
2025-06-14 10:04     ` Peter Zijlstra
2025-06-16  0:35       ` Chris Mason
2025-05-29 10:18   ` Beata Michalska
2025-05-30  9:00     ` Peter Zijlstra
2025-05-30 10:04   ` Chris Mason
2025-06-02  4:44 ` K Prateek Nayak
2025-06-13  3:28   ` K Prateek Nayak
2025-06-14 10:15     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).