[PATCH v2 00/12] sched: Address schbench regression

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/12] sched: Address schbench regression
@ 2025-07-02 11:49 Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Hi!

Previous version:

  https://lkml.kernel.org/r/20250520094538.086709102@infradead.org


Changes:
 - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
 - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
 - fixed lockdep splat (dietmar)
 - added a few preperatory patches


Patches apply on top of tip/master (which includes the disabling of private futex)
and clm's newidle balance patch (which I'm awaiting vingu's ack on).

Performance is similar to the last version; as tested on my SPR on v6.15 base:

v6.15:
schbench-6.15.0-1.txt:average rps: 2891403.72
schbench-6.15.0-2.txt:average rps: 2889997.02
schbench-6.15.0-3.txt:average rps: 2894745.17

v6.15 + patches 1-10:
schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
schbench-6.15.0-dirty-6.txt:average rps: 3038160.15

v6.15 + all patches:
schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10


Patches can also be had here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core


I'm hoping we can get this merged for next cycle so we can all move on from this.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-15 19:11   ` Chris Mason
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz, Johannes Weiner

Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
pressure spikes from aggregation race") caused a regression for him on
a high context switch rate benchmark (schbench) due to the now
repeating cpu_clock() calls.

In particular the problem is that get_recent_times() will extrapolate
the current state to 'now'. But if an update uses a timestamp from
before the start of the update, it is possible to get two reads
with inconsistent results. It is effectively back-dating an update.

(note that this all hard-relies on the clock being synchronized across
CPUs -- if this is not the case, all bets are off).

Combine this problem with the fact that there are per-group-per-cpu
seqcounts, the commit in question pushed the clock read into the group
iteration, causing tree-depth cpu_clock() calls. On architectures
where cpu_clock() has appreciable overhead, this hurts.

Instead move to a per-cpu seqcount, which allows us to have a single
clock read for all group updates, increasing internal consistency and
lowering update overhead. This comes at the cost of a longer update
side (proportional to the tree depth) which can cause the read side to
retry more often.

Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
---
 include/linux/psi_types.h |    6 --
 kernel/sched/psi.c        |  121 +++++++++++++++++++++++++---------------------
 2 files changed, 68 insertions(+), 59 deletions(-)

--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -84,11 +84,9 @@ enum psi_aggregators {
 struct psi_group_cpu {
 	/* 1st cacheline updated by the scheduler */
 
-	/* Aggregator needs to know of concurrent changes */
-	seqcount_t seq ____cacheline_aligned_in_smp;
-
 	/* States of the tasks belonging to this group */
-	unsigned int tasks[NR_PSI_TASK_COUNTS];
+	unsigned int tasks[NR_PSI_TASK_COUNTS]
+			____cacheline_aligned_in_smp;
 
 	/* Aggregate pressure state derived from the tasks */
 	u32 state_mask;
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -176,6 +176,28 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
+static DEFINE_PER_CPU(seqcount_t, psi_seq);
+
+static inline void psi_write_begin(int cpu)
+{
+	write_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline void psi_write_end(int cpu)
+{
+	write_seqcount_end(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline u32 psi_read_begin(int cpu)
+{
+	return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline bool psi_read_retry(int cpu, u32 seq)
+{
+	return read_seqcount_retry(per_cpu_ptr(&psi_seq, cpu), seq);
+}
+
 static void psi_avgs_work(struct work_struct *work);
 
 static void poll_timer_fn(struct timer_list *t);
@@ -186,7 +208,7 @@ static void group_init(struct psi_group
 
 	group->enabled = true;
 	for_each_possible_cpu(cpu)
-		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
+		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
 	group->avg_last_update = sched_clock();
 	group->avg_next_update = group->avg_last_update + psi_period;
 	mutex_init(&group->avgs_lock);
@@ -266,14 +288,14 @@ static void get_recent_times(struct psi_
 
 	/* Snapshot a coherent view of the CPU state */
 	do {
-		seq = read_seqcount_begin(&groupc->seq);
+		seq = psi_read_begin(cpu);
 		now = cpu_clock(cpu);
 		memcpy(times, groupc->times, sizeof(groupc->times));
 		state_mask = groupc->state_mask;
 		state_start = groupc->state_start;
 		if (cpu == current_cpu)
 			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
-	} while (read_seqcount_retry(&groupc->seq, seq));
+	} while (psi_read_retry(cpu, seq));
 
 	/* Calculate state time deltas against the previous snapshot */
 	for (s = 0; s < NR_PSI_STATES; s++) {
@@ -772,31 +794,21 @@ static void record_times(struct psi_grou
 		groupc->times[PSI_NONIDLE] += delta;
 }
 
+#define for_each_group(iter, group) \
+	for (typeof(group) iter = group; iter; iter = iter->parent)
+
 static void psi_group_change(struct psi_group *group, int cpu,
 			     unsigned int clear, unsigned int set,
-			     bool wake_clock)
+			     u64 now, bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
 	unsigned int t, m;
 	u32 state_mask;
-	u64 now;
 
 	lockdep_assert_rq_held(cpu_rq(cpu));
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
 	/*
-	 * First we update the task counts according to the state
-	 * change requested through the @clear and @set bits.
-	 *
-	 * Then if the cgroup PSI stats accounting enabled, we
-	 * assess the aggregate resource states this CPU's tasks
-	 * have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 */
-	write_seqcount_begin(&groupc->seq);
-	now = cpu_clock(cpu);
-
-	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
 	 * the state mask. Clear, set, or carry the current state if
@@ -847,7 +859,6 @@ static void psi_group_change(struct psi_
 
 		groupc->state_mask = state_mask;
 
-		write_seqcount_end(&groupc->seq);
 		return;
 	}
 
@@ -868,8 +879,6 @@ static void psi_group_change(struct psi_
 
 	groupc->state_mask = state_mask;
 
-	write_seqcount_end(&groupc->seq);
-
 	if (state_mask & group->rtpoll_states)
 		psi_schedule_rtpoll_work(group, 1, false);
 
@@ -904,24 +913,29 @@ static void psi_flags_change(struct task
 void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
-	struct psi_group *group;
+	u64 now;
 
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
 
-	group = task_psi_group(task);
-	do {
-		psi_group_change(group, cpu, clear, set, true);
-	} while ((group = group->parent));
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
+	for_each_group(group, task_psi_group(task))
+		psi_group_change(group, cpu, clear, set, now, true);
+	psi_write_end(cpu);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
-	struct psi_group *group, *common = NULL;
+	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
+	u64 now;
+
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -930,16 +944,15 @@ void psi_task_switch(struct task_struct
 		 * ancestors with @prev, those will already have @prev's
 		 * TSK_ONCPU bit set, and we can stop the iteration there.
 		 */
-		group = task_psi_group(next);
-		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+		for_each_group(group, task_psi_group(next)) {
+			struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+
+			if (groupc->state_mask & PSI_ONCPU) {
 				common = group;
 				break;
 			}
-
-			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
+		}
 	}
 
 	if (prev->pid) {
@@ -972,12 +985,11 @@ void psi_task_switch(struct task_struct
 
 		psi_flags_change(prev, clear, set);
 
-		group = task_psi_group(prev);
-		do {
+		for_each_group(group, task_psi_group(prev)) {
 			if (group == common)
 				break;
-			psi_group_change(group, cpu, clear, set, wake_clock);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, clear, set, now, wake_clock);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If there are
@@ -987,20 +999,21 @@ void psi_task_switch(struct task_struct
 		 */
 		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = group->parent)
-				psi_group_change(group, cpu, clear, set, wake_clock);
+			for_each_group(group, common)
+				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
+	psi_write_end(cpu);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
 {
 	int cpu = task_cpu(curr);
-	struct psi_group *group;
 	struct psi_group_cpu *groupc;
 	s64 delta;
 	u64 irq;
+	u64 now;
 
 	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
 		return;
@@ -1009,8 +1022,7 @@ void psi_account_irqtime(struct rq *rq,
 		return;
 
 	lockdep_assert_rq_held(rq);
-	group = task_psi_group(curr);
-	if (prev && task_psi_group(prev) == group)
+	if (prev && task_psi_group(prev) == task_psi_group(curr))
 		return;
 
 	irq = irq_time_read(cpu);
@@ -1019,25 +1031,22 @@ void psi_account_irqtime(struct rq *rq,
 		return;
 	rq->psi_irq_time = irq;
 
-	do {
-		u64 now;
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
+	for_each_group(group, task_psi_group(curr)) {
 		if (!group->enabled)
 			continue;
 
 		groupc = per_cpu_ptr(group->pcpu, cpu);
 
-		write_seqcount_begin(&groupc->seq);
-		now = cpu_clock(cpu);
-
 		record_times(groupc, now);
 		groupc->times[PSI_IRQ_FULL] += delta;
 
-		write_seqcount_end(&groupc->seq);
-
 		if (group->rtpoll_states & (1 << PSI_IRQ_FULL))
 			psi_schedule_rtpoll_work(group, 1, false);
-	} while ((group = group->parent));
+	}
+	psi_write_end(cpu);
 }
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
@@ -1225,12 +1234,14 @@ void psi_cgroup_restart(struct psi_group
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-		struct rq_flags rf;
+		u64 now;
 
-		rq_lock_irq(rq, &rf);
-		psi_group_change(group, cpu, 0, 0, true);
-		rq_unlock_irq(rq, &rf);
+		guard(rq_lock_irq)(cpu_rq(cpu));
+
+		psi_write_begin(cpu);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+		psi_write_end(cpu);
 	}
 }
 #endif /* CONFIG_CGROUPS */



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
@ 2025-07-15 19:11   ` Chris Mason
  2025-07-16  6:06     ` K Prateek Nayak
  2025-07-16  6:53     ` Beata Michalska
  0 siblings, 2 replies; 101+ messages in thread
From: Chris Mason @ 2025-07-15 19:11 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel, Johannes Weiner

On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
> pressure spikes from aggregation race") caused a regression for him on
> a high context switch rate benchmark (schbench) due to the now
> repeating cpu_clock() calls.
> 
> In particular the problem is that get_recent_times() will extrapolate
> the current state to 'now'. But if an update uses a timestamp from
> before the start of the update, it is possible to get two reads
> with inconsistent results. It is effectively back-dating an update.
> 
> (note that this all hard-relies on the clock being synchronized across
> CPUs -- if this is not the case, all bets are off).
> 
> Combine this problem with the fact that there are per-group-per-cpu
> seqcounts, the commit in question pushed the clock read into the group
> iteration, causing tree-depth cpu_clock() calls. On architectures
> where cpu_clock() has appreciable overhead, this hurts.
> 
> Instead move to a per-cpu seqcount, which allows us to have a single
> clock read for all group updates, increasing internal consistency and
> lowering update overhead. This comes at the cost of a longer update
> side (proportional to the tree depth) which can cause the read side to
> retry more often.
> 
> Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
> Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net 
> ---
>  include/linux/psi_types.h |    6 --
>  kernel/sched/psi.c        |  121 +++++++++++++++++++++++++---------------------
>  2 files changed, 68 insertions(+), 59 deletions(-)
> 
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -84,11 +84,9 @@ enum psi_aggregators {
>  struct psi_group_cpu {
>  	/* 1st cacheline updated by the scheduler */
>  
> -	/* Aggregator needs to know of concurrent changes */
> -	seqcount_t seq ____cacheline_aligned_in_smp;
> -
>  	/* States of the tasks belonging to this group */
> -	unsigned int tasks[NR_PSI_TASK_COUNTS];
> +	unsigned int tasks[NR_PSI_TASK_COUNTS]
> +			____cacheline_aligned_in_smp;
>  
>  	/* Aggregate pressure state derived from the tasks */
>  	u32 state_mask;
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -176,6 +176,28 @@ struct psi_group psi_system = {
>  	.pcpu = &system_group_pcpu,
>  };
>  
> +static DEFINE_PER_CPU(seqcount_t, psi_seq);

[ ... ]

> @@ -186,7 +208,7 @@ static void group_init(struct psi_group
>  
>  	group->enabled = true;
>  	for_each_possible_cpu(cpu)
> -		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> +		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>  	group->avg_last_update = sched_clock();
>  	group->avg_next_update = group->avg_last_update + psi_period;
>  	mutex_init(&group->avgs_lock);

I'm not sure if someone mentioned this already, but testing the
series I got a bunch of softlockups in get_recent_times()
that randomly jumped from CPU to CPU.

This fixed it for me, but reading it now I'm wondering
if we want to seqcount_init() unconditionally even when PSI
is off.  

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2024c1d36402d..979a447bc281f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
        int cpu;

        group->enabled = true;
-       for_each_possible_cpu(cpu)
-               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
        group->avg_last_update = sched_clock();
        group->avg_next_update = group->avg_last_update + psi_period;
        mutex_init(&group->avgs_lock);
@@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)

 void __init psi_init(void)
 {
+       int cpu;
        if (!psi_enable) {
                static_branch_enable(&psi_disabled);
                static_branch_disable(&psi_cgroups_enabled);
@@ -241,6 +240,8 @@ void __init psi_init(void)
                static_branch_disable(&psi_cgroups_enabled);

        psi_period = jiffies_to_nsecs(PSI_FREQ);
+       for_each_possible_cpu(cpu)
+               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
        group_init(&psi_system);
 }


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-15 19:11   ` Chris Mason
@ 2025-07-16  6:06     ` K Prateek Nayak
  2025-07-16  6:53     ` Beata Michalska
  1 sibling, 0 replies; 101+ messages in thread
From: K Prateek Nayak @ 2025-07-16  6:06 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel, Johannes Weiner, Ayush.jain3, Srikanth.Aithal

(+ Ayush, Srikanth)

Hello Chris,

On 7/16/2025 12:41 AM, Chris Mason wrote:
>> @@ -186,7 +208,7 @@ static void group_init(struct psi_group
>>  
>>  	group->enabled = true;
>>  	for_each_possible_cpu(cpu)
>> -		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>> +		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>>  	group->avg_last_update = sched_clock();
>>  	group->avg_next_update = group->avg_last_update + psi_period;
>>  	mutex_init(&group->avgs_lock);
> 
> I'm not sure if someone mentioned this already, but testing the
> series I got a bunch of softlockups in get_recent_times()
> that randomly jumped from CPU to CPU.

Ayush, Srikanth, and I ran into this yesterday when testing different
trees (next, queue:sched/core) with similar signatures:

    watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:3:2140]
    Modules linked in: ...
    CPU: 0 UID: 0 PID: 2140 Comm: kworker/0:3 Not tainted 6.16.0-rc1-test+ #20 PREEMPT(voluntary)
    Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    Workqueue: events psi_avgs_work
    RIP: 0010:collect_percpu_times+0x3a0/0x670
    Code: 65 48 2b 05 4a 79 d2 02 0f 85 dc 02 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d e9 34 ba d2 ff f3 90 49 81 ff ff 1f 00 00 <0f> 86 73 fd ff ff 4c 89 fe 48 c7 c7 80 9d 29 bb e8 cb 92 73 00 e9
    RSP: 0018:ffffcda753383d10 EFLAGS: 00000297
    RAX: ffff8be86fadcd40 RBX: ffffeda7308d4580 RCX: 000000000000006b
    RDX: 000000000000002b RSI: 0000000000000100 RDI: ffffffffbab3f400
    RBP: ffffcda753383e30 R08: 000000000000006b R09: 0000000000000000
    R10: 0000008cca6be372 R11: 0000000000000006 R12: 000000000000006b
    R13: ffffeda7308d4594 R14: 00000000000037e5 R15: 000000000000006b
    FS:  0000000000000000(0000) GS:ffff8ba8c1118000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055b3cf990c3c CR3: 000000807dc40006 CR4: 0000000000f70ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? psi_group_change+0x1ff/0x460
     ? add_timer_on+0x10a/0x160
     psi_avgs_work+0x4c/0xd0
     ? queue_delayed_work_on+0x6d/0x80
     process_one_work+0x193/0x3c0
     worker_thread+0x29d/0x3c0
     ? __pfx_worker_thread+0x10/0x10
     kthread+0xff/0x210
     ? __pfx_kthread+0x10/0x10
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x171/0x1a0
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>

I was able to reproduce this reliably running 100 copies of an infinite
loop doing - cgroup create, move a task, move task back to root, remove
cgroup - alongside hackbench running in a seperate cgroup and I hit this
in ~5-10min.

I have been running the same test with your fix and haven't run into
this for over 30min now. Feel free to add:

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> 
> This fixed it for me, but reading it now I'm wondering
> if we want to seqcount_init() unconditionally even when PSI
> is off.

Looking at "psi_enable", it can only be toggled via the kernel
parameter "psi=" and I don't see anything that does a
"static_branch_disable(&psi_disabled)" at runtime so I think
your fix should be good.

> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 2024c1d36402d..979a447bc281f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
>         int cpu;

"cpu" variable can be removed too from group_init() now.

> 
>         group->enabled = true;
> -       for_each_possible_cpu(cpu)
> -               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group->avg_last_update = sched_clock();
>         group->avg_next_update = group->avg_last_update + psi_period;
>         mutex_init(&group->avgs_lock);
> @@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)
> 
>  void __init psi_init(void)
>  {
> +       int cpu;

nit. newline after declaration.

>         if (!psi_enable) {
>                 static_branch_enable(&psi_disabled);
>                 static_branch_disable(&psi_cgroups_enabled);
> @@ -241,6 +240,8 @@ void __init psi_init(void)
>                 static_branch_disable(&psi_cgroups_enabled);
> 
>         psi_period = jiffies_to_nsecs(PSI_FREQ);
> +       for_each_possible_cpu(cpu)
> +               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group_init(&psi_system);
>  }
> 
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-15 19:11   ` Chris Mason
  2025-07-16  6:06     ` K Prateek Nayak
@ 2025-07-16  6:53     ` Beata Michalska
  2025-07-16 10:40       ` Peter Zijlstra
  1 sibling, 1 reply; 101+ messages in thread
From: Beata Michalska @ 2025-07-16  6:53 UTC (permalink / raw)
  To: Chris Mason
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel, Johannes Weiner

On Tue, Jul 15, 2025 at 03:11:14PM -0400, Chris Mason wrote:
> On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> > Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
> > pressure spikes from aggregation race") caused a regression for him on
> > a high context switch rate benchmark (schbench) due to the now
> > repeating cpu_clock() calls.
> > 
> > In particular the problem is that get_recent_times() will extrapolate
> > the current state to 'now'. But if an update uses a timestamp from
> > before the start of the update, it is possible to get two reads
> > with inconsistent results. It is effectively back-dating an update.
> > 
> > (note that this all hard-relies on the clock being synchronized across
> > CPUs -- if this is not the case, all bets are off).
> > 
> > Combine this problem with the fact that there are per-group-per-cpu
> > seqcounts, the commit in question pushed the clock read into the group
> > iteration, causing tree-depth cpu_clock() calls. On architectures
> > where cpu_clock() has appreciable overhead, this hurts.
> > 
> > Instead move to a per-cpu seqcount, which allows us to have a single
> > clock read for all group updates, increasing internal consistency and
> > lowering update overhead. This comes at the cost of a longer update
> > side (proportional to the tree depth) which can cause the read side to
> > retry more often.
> > 
> > Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
> > Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
> > Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net 
> > ---
> >  include/linux/psi_types.h |    6 --
> >  kernel/sched/psi.c        |  121 +++++++++++++++++++++++++---------------------
> >  2 files changed, 68 insertions(+), 59 deletions(-)
> > 
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -84,11 +84,9 @@ enum psi_aggregators {
> >  struct psi_group_cpu {
> >  	/* 1st cacheline updated by the scheduler */
> >  
> > -	/* Aggregator needs to know of concurrent changes */
> > -	seqcount_t seq ____cacheline_aligned_in_smp;
> > -
> >  	/* States of the tasks belonging to this group */
> > -	unsigned int tasks[NR_PSI_TASK_COUNTS];
> > +	unsigned int tasks[NR_PSI_TASK_COUNTS]
> > +			____cacheline_aligned_in_smp;
> >  
> >  	/* Aggregate pressure state derived from the tasks */
> >  	u32 state_mask;
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -176,6 +176,28 @@ struct psi_group psi_system = {
> >  	.pcpu = &system_group_pcpu,
> >  };
> >  
> > +static DEFINE_PER_CPU(seqcount_t, psi_seq);
> 
> [ ... ]
> 
> > @@ -186,7 +208,7 @@ static void group_init(struct psi_group
> >  
> >  	group->enabled = true;
> >  	for_each_possible_cpu(cpu)
> > -		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > +		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
> >  	group->avg_last_update = sched_clock();
> >  	group->avg_next_update = group->avg_last_update + psi_period;
> >  	mutex_init(&group->avgs_lock);
> 
> I'm not sure if someone mentioned this already, but testing the
> series I got a bunch of softlockups in get_recent_times()
> that randomly jumped from CPU to CPU.

... was beaten to it. I can observe the same.
> 
> This fixed it for me, but reading it now I'm wondering
> if we want to seqcount_init() unconditionally even when PSI
> is off.  
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 2024c1d36402d..979a447bc281f 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -207,8 +207,6 @@ static void group_init(struct psi_group *group)
>         int cpu;
> 
>         group->enabled = true;
> -       for_each_possible_cpu(cpu)
> -               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group->avg_last_update = sched_clock();
>         group->avg_next_update = group->avg_last_update + psi_period;
>         mutex_init(&group->avgs_lock);
> @@ -231,6 +229,7 @@ static void group_init(struct psi_group *group)
> 
>  void __init psi_init(void)
>  {
> +       int cpu;
>         if (!psi_enable) {
>                 static_branch_enable(&psi_disabled);
>                 static_branch_disable(&psi_cgroups_enabled);
> @@ -241,6 +240,8 @@ void __init psi_init(void)
>                 static_branch_disable(&psi_cgroups_enabled);
> 
>         psi_period = jiffies_to_nsecs(PSI_FREQ);
> +       for_each_possible_cpu(cpu)
> +               seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>         group_init(&psi_system);
>  }
> 
> 
Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.

---
BR
Beata

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-16  6:53     ` Beata Michalska
@ 2025-07-16 10:40       ` Peter Zijlstra
  2025-07-16 14:54         ` Johannes Weiner
                           ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-16 10:40 UTC (permalink / raw)
  To: Beata Michalska
  Cc: Chris Mason, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel,
	Johannes Weiner

On Wed, Jul 16, 2025 at 08:53:01AM +0200, Beata Michalska wrote:
> Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.

Yeah, I suppose that should work. The below builds, but I've not yet
observed the issue myself.

---
Subject: sched/psi: Fix psi_seq initialization
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 15 Jul 2025 15:11:14 -0400

With the seqcount moved out of the group into a global psi_seq,
re-initializing the seqcount on group creation is causing seqcount
corruption.

Fixes: 570c8efd5eb7 ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
Reported-by: Chris Mason <clm@meta.com>
Suggested-by: Beata Michalska <beata.michalska@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/psi.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -176,7 +176,7 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
-static DEFINE_PER_CPU(seqcount_t, psi_seq);
+static DEFINE_PER_CPU(seqcount_t, psi_seq) = SEQCNT_ZERO(psi_seq);
 
 static inline void psi_write_begin(int cpu)
 {
@@ -204,11 +204,7 @@ static void poll_timer_fn(struct timer_l
 
 static void group_init(struct psi_group *group)
 {
-	int cpu;
-
 	group->enabled = true;
-	for_each_possible_cpu(cpu)
-		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
 	group->avg_last_update = sched_clock();
 	group->avg_next_update = group->avg_last_update + psi_period;
 	mutex_init(&group->avgs_lock);

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-16 10:40       ` Peter Zijlstra
@ 2025-07-16 14:54         ` Johannes Weiner
  2025-07-16 16:27         ` Chris Mason
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 101+ messages in thread
From: Johannes Weiner @ 2025-07-16 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Beata Michalska, Chris Mason, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

On Wed, Jul 16, 2025 at 12:40:50PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 16, 2025 at 08:53:01AM +0200, Beata Michalska wrote:
> > Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.
> 
> Yeah, I suppose that should work. The below builds, but I've not yet
> observed the issue myself.
> 
> ---
> Subject: sched/psi: Fix psi_seq initialization
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, 15 Jul 2025 15:11:14 -0400
> 
> With the seqcount moved out of the group into a global psi_seq,
> re-initializing the seqcount on group creation is causing seqcount
> corruption.
> 
> Fixes: 570c8efd5eb7 ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
> Reported-by: Chris Mason <clm@meta.com>
> Suggested-by: Beata Michalska <beata.michalska@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Argh, missed that during the review as well.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-16 10:40       ` Peter Zijlstra
  2025-07-16 14:54         ` Johannes Weiner
@ 2025-07-16 16:27         ` Chris Mason
  2025-07-23  4:16         ` Aithal, Srikanth
  2025-07-25  5:13         ` K Prateek Nayak
  3 siblings, 0 replies; 101+ messages in thread
From: Chris Mason @ 2025-07-16 16:27 UTC (permalink / raw)
  To: Peter Zijlstra, Beata Michalska
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel, Johannes Weiner

On 7/16/25 6:40 AM, Peter Zijlstra wrote:
> On Wed, Jul 16, 2025 at 08:53:01AM +0200, Beata Michalska wrote:
>> Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.
> 
> Yeah, I suppose that should work. The below builds, but I've not yet
> observed the issue myself.

[ ... ]

>  
> -static DEFINE_PER_CPU(seqcount_t, psi_seq);
> +static DEFINE_PER_CPU(seqcount_t, psi_seq) = SEQCNT_ZERO(psi_seq);

As expected, this also works.  Thanks everyone.

-chris

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-16 10:40       ` Peter Zijlstra
  2025-07-16 14:54         ` Johannes Weiner
  2025-07-16 16:27         ` Chris Mason
@ 2025-07-23  4:16         ` Aithal, Srikanth
  2025-07-25  5:13         ` K Prateek Nayak
  3 siblings, 0 replies; 101+ messages in thread
From: Aithal, Srikanth @ 2025-07-23  4:16 UTC (permalink / raw)
  To: Peter Zijlstra, Beata Michalska
  Cc: Chris Mason, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel,
	Johannes Weiner

On 7/16/2025 4:10 PM, Peter Zijlstra wrote:
> On Wed, Jul 16, 2025 at 08:53:01AM +0200, Beata Michalska wrote:
>> Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.
> 
> Yeah, I suppose that should work. The below builds, but I've not yet
> observed the issue myself.
> 
> ---
> Subject: sched/psi: Fix psi_seq initialization
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, 15 Jul 2025 15:11:14 -0400
> 
> With the seqcount moved out of the group into a global psi_seq,
> re-initializing the seqcount on group creation is causing seqcount
> corruption.
> 
> Fixes: 570c8efd5eb7 ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
> Reported-by: Chris Mason <clm@meta.com>
> Suggested-by: Beata Michalska <beata.michalska@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/psi.c |    6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
> 
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -176,7 +176,7 @@ struct psi_group psi_system = {
>   	.pcpu = &system_group_pcpu,
>   };
>   
> -static DEFINE_PER_CPU(seqcount_t, psi_seq);
> +static DEFINE_PER_CPU(seqcount_t, psi_seq) = SEQCNT_ZERO(psi_seq);
>   
>   static inline void psi_write_begin(int cpu)
>   {
> @@ -204,11 +204,7 @@ static void poll_timer_fn(struct timer_l
>   
>   static void group_init(struct psi_group *group)
>   {
> -	int cpu;
> -
>   	group->enabled = true;
> -	for_each_possible_cpu(cpu)
> -		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
>   	group->avg_last_update = sched_clock();
>   	group->avg_next_update = group->avg_last_update + psi_period;
>   	mutex_init(&group->avgs_lock);

I've tested the above patch, and it resolves the issue. Could we include 
this patch in the linux-next builds? We have been encountering the 
reported issue regularly in our daily CI.


Tested-by: Srikanth Aithal <sraithal@amd.com>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-07-16 10:40       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2025-07-23  4:16         ` Aithal, Srikanth
@ 2025-07-25  5:13         ` K Prateek Nayak
  3 siblings, 0 replies; 101+ messages in thread
From: K Prateek Nayak @ 2025-07-25  5:13 UTC (permalink / raw)
  To: mingo, Peter Zijlstra
  Cc: Chris Mason, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, linux-kernel,
	Johannes Weiner, Beata Michalska

Hello Ingo, Peter,

On 7/16/2025 4:10 PM, Peter Zijlstra wrote:
> On Wed, Jul 16, 2025 at 08:53:01AM +0200, Beata Michalska wrote:
>> Wouldn't it be enough to use SEQCNT_ZERO? Those are static per-cpu ones.
> 
> Yeah, I suppose that should work. The below builds, but I've not yet
> observed the issue myself.
> 
> ---
> Subject: sched/psi: Fix psi_seq initialization
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, 15 Jul 2025 15:11:14 -0400
> 
> With the seqcount moved out of the group into a global psi_seq,
> re-initializing the seqcount on group creation is causing seqcount
> corruption.
> 
> Fixes: 570c8efd5eb7 ("sched/psi: Optimize psi_group_change() cpu_clock() usage")
> Reported-by: Chris Mason <clm@meta.com>
> Suggested-by: Beata Michalska <beata.michalska@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I've been running with this fix for a bunch of my testing and when I forget
about it (as was the case when testing John's Proxy Exec branch), I usually
run into the softlockup in psi_avgs_work().

Is it too late to include this in tip:sched/core for v6.17?

Also feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-02 16:12   ` Juri Lelli
                     ` (5 more replies)
  2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 6 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.

His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.

Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.

Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org
---
 include/linux/sched.h   |    1 +
 kernel/sched/deadline.c |   25 ++++++++++++++++++++++---
 kernel/sched/fair.c     |    9 ---------
 3 files changed, 23 insertions(+), 12 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -701,6 +701,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
+	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se);
+
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti
 
 		if (!dl_se->server_has_tasks(dl_se)) {
 			replenish_dl_entity(dl_se);
+			dl_server_stopped(dl_se);
 			return HRTIMER_NORESTART;
 		}
 
@@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime)
+	if (dl_se->dl_runtime) {
+		dl_se->dl_server_idle = 0;
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent
 		setup_new_dl_entity(dl_se);
 	}
 
-	if (!dl_se->dl_runtime)
+	if (!dl_se->dl_runtime || dl_se->dl_server_active)
 		return;
 
 	dl_se->dl_server_active = 1;
@@ -1684,6 +1689,20 @@ void dl_server_stop(struct sched_dl_enti
 	dl_se->dl_server_active = 0;
 }
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se)
+{
+	if (!dl_se->dl_server_active)
+		return false;
+
+	if (dl_se->dl_server_idle) {
+		dl_server_stop(dl_se);
+		return true;
+	}
+
+	dl_se->dl_server_idle = 1;
+	return false;
+}
+
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
@@ -2435,7 +2454,7 @@ static struct task_struct *__pick_task_d
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (dl_server_active(dl_se)) {
+			if (!dl_server_stopped(dl_se)) {
 				dl_se->dl_yielded = 1;
 				update_curr_dl_se(rq, dl_se, 0);
 			}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5879,7 +5879,6 @@ static bool throttle_cfs_rq(struct cfs_r
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5963,10 +5962,6 @@ static bool throttle_cfs_rq(struct cfs_r
 
 	/* At this point se is NULL and we are at root level*/
 	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
 done:
 	/*
 	 * Note: distribution will already see us throttled via the
@@ -7060,7 +7055,6 @@ static void set_next_buddy(struct sched_
 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 {
 	bool was_sched_idle = sched_idle_rq(rq);
-	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
 	struct task_struct *p = NULL;
@@ -7144,9 +7138,6 @@ static int dequeue_entities(struct rq *r
 
 	sub_nr_running(rq, h_nr_queued);
 
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
@ 2025-07-02 16:12   ` Juri Lelli
  2025-07-10 12:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 101+ messages in thread
From: Juri Lelli @ 2025-07-02 16:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, clm, linux-kernel

Hi Peter,

On 02/07/25 13:49, Peter Zijlstra wrote:
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
> 
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
> 
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
> 
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org
> ---

This looks good to me.

Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>

Thanks!
Juri


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip: sched/core] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
  2025-07-02 16:12   ` Juri Lelli
@ 2025-07-10 12:46   ` tip-bot2 for Peter Zijlstra
  2025-07-14 22:56   ` [PATCH v2 02/12] " Mel Gorman
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 101+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-07-10 12:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Chris Mason, Peter Zijlstra (Intel), Juri Lelli, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     cccb45d7c4295bbfeba616582d0249f2d21e6df5
Gitweb:        https://git.kernel.org/tip/cccb45d7c4295bbfeba616582d0249f2d21e6df5
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 20 May 2025 11:19:30 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 09 Jul 2025 13:40:21 +02:00

sched/deadline: Less agressive dl_server handling

Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.

His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.

Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.

Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
---
 include/linux/sched.h   |  1 +
 kernel/sched/deadline.c | 25 ++++++++++++++++++++++---
 kernel/sched/fair.c     |  9 ---------
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eec6b22..4802fcf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -698,6 +698,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
+	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0f30697..23668fc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1150,6 +1150,8 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se);
+
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1169,6 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 
 		if (!dl_se->server_has_tasks(dl_se)) {
 			replenish_dl_entity(dl_se);
+			dl_server_stopped(dl_se);
 			return HRTIMER_NORESTART;
 		}
 
@@ -1572,8 +1575,10 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime)
+	if (dl_se->dl_runtime) {
+		dl_se->dl_server_idle = 0;
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1596,7 +1601,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 		setup_new_dl_entity(dl_se);
 	}
 
-	if (!dl_se->dl_runtime)
+	if (!dl_se->dl_runtime || dl_se->dl_server_active)
 		return;
 
 	dl_se->dl_server_active = 1;
@@ -1617,6 +1622,20 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	dl_se->dl_server_active = 0;
 }
 
+static bool dl_server_stopped(struct sched_dl_entity *dl_se)
+{
+	if (!dl_se->dl_server_active)
+		return false;
+
+	if (dl_se->dl_server_idle) {
+		dl_server_stop(dl_se);
+		return true;
+	}
+
+	dl_se->dl_server_idle = 1;
+	return false;
+}
+
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
@@ -2354,7 +2373,7 @@ again:
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (dl_server_active(dl_se)) {
+			if (!dl_server_stopped(dl_se)) {
 				dl_se->dl_yielded = 1;
 				update_curr_dl_se(rq, dl_se, 0);
 			}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab0822c..a1350c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5802,7 +5802,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5886,10 +5885,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	/* At this point se is NULL and we are at root level*/
 	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
 done:
 	/*
 	 * Note: distribution will already see us throttled via the
@@ -6966,7 +6961,6 @@ static void set_next_buddy(struct sched_entity *se);
 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 {
 	bool was_sched_idle = sched_idle_rq(rq);
-	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
 	struct task_struct *p = NULL;
@@ -7050,9 +7044,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 	sub_nr_running(rq, h_nr_queued);
 
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
  2025-07-02 16:12   ` Juri Lelli
  2025-07-10 12:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2025-07-14 22:56   ` Mel Gorman
  2025-07-15 14:55     ` Chris Mason
  2025-07-30  9:34   ` Geert Uytterhoeven
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 101+ messages in thread
From: Mel Gorman @ 2025-07-14 22:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
> 

Unrelated to the patch but I've been doing a bit of arcology recently
finding the motivation for various decisions and paragraphs like this
have been painful (most recent was figuring out why a decision was made
for 2.6.32). If the load was described, can you add a Link: tag?  If the
workload is proprietary, cannot be described or would be impractical to
independently created than can that be stated here instead?

> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
> 

Obvious in hindsight but even then a brief explanation as to why it
triggered that particular corner case would be helpful. i.e. was throttling
the trigger or dequeue for a workload with very few (1?) tasks?

> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
> 

I didn't dig into this too much but is that 1s fixed because it's related to
the dl server itself rather than any task using the deadline scheduler? The
use of "default" indicates it's tunable but at a glance, it's not clear
if sched_setattr can be used to reconfigure the dl_server or not. Even if
that is the expected case, it's not obvious (to me).

> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org

Patch looks ok but I'm not particularly familiar with the deadline
scheduler. Even so;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

One nit below

> ---
>  include/linux/sched.h   |    1 +
>  kernel/sched/deadline.c |   25 ++++++++++++++++++++++---
>  kernel/sched/fair.c     |    9 ---------
>  3 files changed, 23 insertions(+), 12 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -701,6 +701,7 @@ struct sched_dl_entity {
>  	unsigned int			dl_defer	  : 1;
>  	unsigned int			dl_defer_armed	  : 1;
>  	unsigned int			dl_defer_running  : 1;
> +	unsigned int			dl_server_idle    : 1;
>  
>  	/*
>  	 * Bandwidth enforcement timer. Each -deadline task has its
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq
>  /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>  static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>  
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> +
>  static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>  {
>  	struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti
>  
>  		if (!dl_se->server_has_tasks(dl_se)) {
>  			replenish_dl_entity(dl_se);
> +			dl_server_stopped(dl_se);
>  			return HRTIMER_NORESTART;
>  		}
>  
> @@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r
>  void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>  {
>  	/* 0 runtime = fair server disabled */
> -	if (dl_se->dl_runtime)
> +	if (dl_se->dl_runtime) {
> +		dl_se->dl_server_idle = 0;
>  		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> +	}
>  }
>  
>  void dl_server_start(struct sched_dl_entity *dl_se)
> @@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent
>  		setup_new_dl_entity(dl_se);
>  	}
>  
> -	if (!dl_se->dl_runtime)
> +	if (!dl_se->dl_runtime || dl_se->dl_server_active)
>  		return;
>  
>  	dl_se->dl_server_active = 1;
> @@ -1684,6 +1689,20 @@ void dl_server_stop(struct sched_dl_enti
>  	dl_se->dl_server_active = 0;
>  }
>  
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se)
> +{
> +	if (!dl_se->dl_server_active)
> +		return false;
> +
> +	if (dl_se->dl_server_idle) {
> +		dl_server_stop(dl_se);
> +		return true;
> +	}
> +
> +	dl_se->dl_server_idle = 1;
> +	return false;
> +}
> +

The function name does not suggest there are side-effects. If I'm reading it
correctly, it's basically a 2-pass filter related to the sched_period. It
could do with a comment with bonus points mentioning the duration of the
2-pass filter depends on the sched_period of the dl_server.

>  void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>  		    dl_server_has_tasks_f has_tasks,
>  		    dl_server_pick_f pick_task)
> @@ -2435,7 +2454,7 @@ static struct task_struct *__pick_task_d
>  	if (dl_server(dl_se)) {
>  		p = dl_se->server_pick_task(dl_se);
>  		if (!p) {
> -			if (dl_server_active(dl_se)) {
> +			if (!dl_server_stopped(dl_se)) {
>  				dl_se->dl_yielded = 1;
>  				update_curr_dl_se(rq, dl_se, 0);
>  			}
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5879,7 +5879,6 @@ static bool throttle_cfs_rq(struct cfs_r
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  	struct sched_entity *se;
>  	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
> -	long rq_h_nr_queued = rq->cfs.h_nr_queued;
>  
>  	raw_spin_lock(&cfs_b->lock);
>  	/* This will start the period timer if necessary */
> @@ -5963,10 +5962,6 @@ static bool throttle_cfs_rq(struct cfs_r
>  
>  	/* At this point se is NULL and we are at root level*/
>  	sub_nr_running(rq, queued_delta);
> -
> -	/* Stop the fair server if throttling resulted in no runnable tasks */
> -	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> -		dl_server_stop(&rq->fair_server);
>  done:
>  	/*
>  	 * Note: distribution will already see us throttled via the
> @@ -7060,7 +7055,6 @@ static void set_next_buddy(struct sched_
>  static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  {
>  	bool was_sched_idle = sched_idle_rq(rq);
> -	int rq_h_nr_queued = rq->cfs.h_nr_queued;
>  	bool task_sleep = flags & DEQUEUE_SLEEP;
>  	bool task_delayed = flags & DEQUEUE_DELAYED;
>  	struct task_struct *p = NULL;
> @@ -7144,9 +7138,6 @@ static int dequeue_entities(struct rq *r
>  
>  	sub_nr_running(rq, h_nr_queued);
>  
> -	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> -		dl_server_stop(&rq->fair_server);
> -
>  	/* balance early to pull high priority tasks */
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
> 
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-14 22:56   ` [PATCH v2 02/12] " Mel Gorman
@ 2025-07-15 14:55     ` Chris Mason
  2025-07-16 18:19       ` Mel Gorman
  0 siblings, 1 reply; 101+ messages in thread
From: Chris Mason @ 2025-07-15 14:55 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel

On 7/14/25 6:56 PM, Mel Gorman wrote:
> On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
>> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
>> bandwidth control") caused a significant dip in his favourite
>> benchmark of the day. Simply disabling dl_server cured things.
>>
> 
> Unrelated to the patch but I've been doing a bit of arcology recently
> finding the motivation for various decisions and paragraphs like this
> have been painful (most recent was figuring out why a decision was made
> for 2.6.32). If the load was described, can you add a Link: tag?  If the
> workload is proprietary, cannot be described or would be impractical to
> independently created than can that be stated here instead?
> 

Hi Mel,

"benchmark of the day" is pretty accurate, since I usually just bash on
schbench until I see roughly the same problem that I'm debugging from
production.  This time, it was actually a networking benchmark (uperf),
but setup for that is more involved.

This other thread describes the load, with links to schbench and command
line:

https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/

The short version:

https://github.com/masoncl/schbench.git
schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0

- 4 CPUs waking up all the other CPUs constantly
  - (pretending to be network irqs)
- 1024 total worker threads spread over the other CPUs
- all the workers immediately going idle after waking
- single socket machine with ~250 cores and HT.

The basic recipe for the regression is as many CPUs as possible going in
and out of idle.

(I know you're really asking for these details in the commit or in the
comments, but hopefully this is useful for Link:'ing)

-chris

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-15 14:55     ` Chris Mason
@ 2025-07-16 18:19       ` Mel Gorman
  0 siblings, 0 replies; 101+ messages in thread
From: Mel Gorman @ 2025-07-16 18:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	linux-kernel

On Tue, Jul 15, 2025 at 10:55:03AM -0400, Chris Mason wrote:
> On 7/14/25 6:56 PM, Mel Gorman wrote:
> > On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
> >> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> >> bandwidth control") caused a significant dip in his favourite
> >> benchmark of the day. Simply disabling dl_server cured things.
> >>
> > 
> > Unrelated to the patch but I've been doing a bit of arcology recently
> > finding the motivation for various decisions and paragraphs like this
> > have been painful (most recent was figuring out why a decision was made
> > for 2.6.32). If the load was described, can you add a Link: tag?  If the
> > workload is proprietary, cannot be described or would be impractical to
> > independently created than can that be stated here instead?
> > 
> 
> Hi Mel,
> 
> "benchmark of the day" is pretty accurate, since I usually just bash on
> schbench until I see roughly the same problem that I'm debugging from
> production.  This time, it was actually a networking benchmark (uperf),
> but setup for that is more involved.
> 
> This other thread describes the load, with links to schbench and command
> line:
> 
> https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/
> 
> The short version:
> 
> https://github.com/masoncl/schbench.git
> schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> 
> - 4 CPUs waking up all the other CPUs constantly
>   - (pretending to be network irqs)

Ok, so the 4 CPUs are a simulation of network traffic arriving that can be
delivered to any CPU. Sounds similar to MSIX where interrupts can arrive
on any CPU and I'm guessing you're not doing any packet steering in the
"real" workload. I'm also guessing there is nothing special about "4"
other than it was enough threads to keep the SUT active even if the worker
tasks did no work.

> - 1024 total worker threads spread over the other CPUs

Ok.

> - all the workers immediately going idle after waking

So 0 think time to stress a corner case.

> - single socket machine with ~250 cores and HT.
> 

To be 100% sure, 250 cores + HT is 500 logical CPUs correct? Using 1024 would
appear to be an attempt to simulate strict deadlines for minimal processing
of data received from the network while processors are saturated. IIUC,
the workload would stress wakeup preemption, LB and finding an idle CPU
decisions while ensuring EEVDF rules are adhered to.

> The basic recipe for the regression is as many CPUs as possible going in
> and out of idle.
> 
> (I know you're really asking for these details in the commit or in the
> comments, but hopefully this is useful for Link:'ing)
> 

Yes it is. Because even adding this will capture the specific benchmark
for future reference -- at least as long as lore lives.

Link: https://lore.kernel.org/r/3c67ae44-5244-4341-9edd-04a93b1cb290@meta.com

Do you mind adding this or ensure it makes it to the final changelog?
It's not a big deal, just a preference. Historically there was no push
for something like this but most recent history was dominated by CFS.
There were a lot of subtle heuristics there that are hard to replicate in
EEVDF without violating the intent of EEVDF.

I had seen that schbench invocation and I was 99% certain it was the
"favourite benchmark of the day".  The pattern seems reasonable as a
microbench favouring latency over throughput for fast dispatching of work
from network ingress to backend processing. Thats enough to name a mmtests
configuration based on the existing schbench implementation. Maybe something
like schbench-fakenet-fastdispatch.  This sort of pattern is not even that
unique as such as IO-intensive workloads may also exhibit a similar pattern,
particularly if XFS is the filesystem. That is a reasonable scenario
whether DL is involved or not.

Thanks Chris.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
                     ` (2 preceding siblings ...)
  2025-07-14 22:56   ` [PATCH v2 02/12] " Mel Gorman
@ 2025-07-30  9:34   ` Geert Uytterhoeven
  2025-07-30  9:46     ` Juri Lelli
  2025-08-05 22:03   ` Chris Bainbridge
  2025-09-15 22:29   ` John Stultz
  5 siblings, 1 reply; 101+ messages in thread
From: Geert Uytterhoeven @ 2025-07-30  9:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel, linux-m68k

Hi Peter,

On Wed, 2 Jul 2025 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
>
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
>
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
>
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org

Thanks for your patch, which is now commit cccb45d7c4295bbf
("sched/deadline: Less agressive dl_server handling") upstream.

This commit causes

    sched: DL replenish lagged too much

to be printed after full user-space (Debian) start-up on m68k
(atari_defconfig running on ARAnyM).  Reverting this commit and fixing
the small conflict gets rid of the message.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -701,6 +701,7 @@ struct sched_dl_entity {
>         unsigned int                    dl_defer          : 1;
>         unsigned int                    dl_defer_armed    : 1;
>         unsigned int                    dl_defer_running  : 1;
> +       unsigned int                    dl_server_idle    : 1;
>
>         /*
>          * Bandwidth enforcement timer. Each -deadline task has its
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1215,6 +1215,8 @@ static void __push_dl_task(struct rq *rq
>  /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>  static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> +
>  static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>  {
>         struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1234,6 +1236,7 @@ static enum hrtimer_restart dl_server_ti
>
>                 if (!dl_se->server_has_tasks(dl_se)) {
>                         replenish_dl_entity(dl_se);
> +                       dl_server_stopped(dl_se);
>                         return HRTIMER_NORESTART;
>                 }
>
> @@ -1639,8 +1642,10 @@ void dl_server_update_idle_time(struct r
>  void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
>  {
>         /* 0 runtime = fair server disabled */
> -       if (dl_se->dl_runtime)
> +       if (dl_se->dl_runtime) {
> +               dl_se->dl_server_idle = 0;
>                 update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
> +       }
>  }
>
>  void dl_server_start(struct sched_dl_entity *dl_se)
> @@ -1663,7 +1668,7 @@ void dl_server_start(struct sched_dl_ent
>                 setup_new_dl_entity(dl_se);
>         }
>
> -       if (!dl_se->dl_runtime)
> +       if (!dl_se->dl_runtime || dl_se->dl_server_active)
>                 return;
>
>         dl_se->dl_server_active = 1;
> @@ -1684,6 +1689,20 @@ void dl_server_stop(struct sched_dl_enti
>         dl_se->dl_server_active = 0;
>  }
>
> +static bool dl_server_stopped(struct sched_dl_entity *dl_se)
> +{
> +       if (!dl_se->dl_server_active)
> +               return false;
> +
> +       if (dl_se->dl_server_idle) {
> +               dl_server_stop(dl_se);
> +               return true;
> +       }
> +
> +       dl_se->dl_server_idle = 1;
> +       return false;
> +}
> +
>  void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
>                     dl_server_has_tasks_f has_tasks,
>                     dl_server_pick_f pick_task)
> @@ -2435,7 +2454,7 @@ static struct task_struct *__pick_task_d
>         if (dl_server(dl_se)) {
>                 p = dl_se->server_pick_task(dl_se);
>                 if (!p) {
> -                       if (dl_server_active(dl_se)) {
> +                       if (!dl_server_stopped(dl_se)) {
>                                 dl_se->dl_yielded = 1;
>                                 update_curr_dl_se(rq, dl_se, 0);
>                         }
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5879,7 +5879,6 @@ static bool throttle_cfs_rq(struct cfs_r
>         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>         struct sched_entity *se;
>         long queued_delta, runnable_delta, idle_delta, dequeue = 1;
> -       long rq_h_nr_queued = rq->cfs.h_nr_queued;
>
>         raw_spin_lock(&cfs_b->lock);
>         /* This will start the period timer if necessary */
> @@ -5963,10 +5962,6 @@ static bool throttle_cfs_rq(struct cfs_r
>
>         /* At this point se is NULL and we are at root level*/
>         sub_nr_running(rq, queued_delta);
> -
> -       /* Stop the fair server if throttling resulted in no runnable tasks */
> -       if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> -               dl_server_stop(&rq->fair_server);
>  done:
>         /*
>          * Note: distribution will already see us throttled via the
> @@ -7060,7 +7055,6 @@ static void set_next_buddy(struct sched_
>  static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  {
>         bool was_sched_idle = sched_idle_rq(rq);
> -       int rq_h_nr_queued = rq->cfs.h_nr_queued;
>         bool task_sleep = flags & DEQUEUE_SLEEP;
>         bool task_delayed = flags & DEQUEUE_DELAYED;
>         struct task_struct *p = NULL;
> @@ -7144,9 +7138,6 @@ static int dequeue_entities(struct rq *r
>
>         sub_nr_running(rq, h_nr_queued);
>
> -       if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
> -               dl_server_stop(&rq->fair_server);
> -
>         /* balance early to pull high priority tasks */
>         if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>                 rq->next_balance = jiffies;

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-30  9:34   ` Geert Uytterhoeven
@ 2025-07-30  9:46     ` Juri Lelli
  2025-07-30 10:05       ` Geert Uytterhoeven
  0 siblings, 1 reply; 101+ messages in thread
From: Juri Lelli @ 2025-07-30  9:46 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Peter Zijlstra, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel, linux-m68k

Hello,

On 30/07/25 11:34, Geert Uytterhoeven wrote:
> Hi Peter,

Apologies for interjecting.

> On Wed, 2 Jul 2025 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> > Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> > bandwidth control") caused a significant dip in his favourite
> > benchmark of the day. Simply disabling dl_server cured things.
> >
> > His workload hammers the 0->1, 1->0 transitions, and the
> > dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> > idea in hind sight and all that.
> >
> > Change things around to only disable the dl_server when there has not
> > been a fair task around for a whole period. Since the default period
> > is 1 second, this ensures the benchmark never trips this, overhead
> > gone.
> >
> > Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org
> 
> Thanks for your patch, which is now commit cccb45d7c4295bbf
> ("sched/deadline: Less agressive dl_server handling") upstream.
> 
> This commit causes
> 
>     sched: DL replenish lagged too much
> 
> to be printed after full user-space (Debian) start-up on m68k
> (atari_defconfig running on ARAnyM).  Reverting this commit and fixing
> the small conflict gets rid of the message.

Does
https://lore.kernel.org/lkml/20250615131129.954975-1-kuyo.chang@mediatek.com/
help already (w/o the revert)?

Best,
Juri


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-30  9:46     ` Juri Lelli
@ 2025-07-30 10:05       ` Geert Uytterhoeven
  0 siblings, 0 replies; 101+ messages in thread
From: Geert Uytterhoeven @ 2025-07-30 10:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Peter Zijlstra, mingo, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel, linux-m68k

Hi Juri,

On Wed, 30 Jul 2025 at 11:46, Juri Lelli <juri.lelli@redhat.com> wrote:
> On 30/07/25 11:34, Geert Uytterhoeven wrote:
> Apologies for interjecting.

No apologies needed, much appreciated!

> > On Wed, 2 Jul 2025 at 14:19, Peter Zijlstra <peterz@infradead.org> wrote:
> > > Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> > > bandwidth control") caused a significant dip in his favourite
> > > benchmark of the day. Simply disabling dl_server cured things.
> > >
> > > His workload hammers the 0->1, 1->0 transitions, and the
> > > dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> > > idea in hind sight and all that.
> > >
> > > Change things around to only disable the dl_server when there has not
> > > been a fair task around for a whole period. Since the default period
> > > is 1 second, this ensures the benchmark never trips this, overhead
> > > gone.
> > >
> > > Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> > > Reported-by: Chris Mason <clm@meta.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org
> >
> > Thanks for your patch, which is now commit cccb45d7c4295bbf
> > ("sched/deadline: Less agressive dl_server handling") upstream.
> >
> > This commit causes
> >
> >     sched: DL replenish lagged too much
> >
> > to be printed after full user-space (Debian) start-up on m68k
> > (atari_defconfig running on ARAnyM).  Reverting this commit and fixing
> > the small conflict gets rid of the message.
>
> Does
> https://lore.kernel.org/lkml/20250615131129.954975-1-kuyo.chang@mediatek.com/
> help already (w/o the revert)?

Thanks, it does!

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
                     ` (3 preceding siblings ...)
  2025-07-30  9:34   ` Geert Uytterhoeven
@ 2025-08-05 22:03   ` Chris Bainbridge
  2025-08-05 23:04     ` Chris Bainbridge
  2025-09-15 22:29   ` John Stultz
  5 siblings, 1 reply; 101+ messages in thread
From: Chris Bainbridge @ 2025-08-05 22:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
> 
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
> 
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
> 
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org

This commit causes almost every boot of my laptop (which is booted from
USB flash/SSD drive) to log "sched: DL replenish lagged too much" around
7 seconds in to the boot. Is this expected? Just asking as this is a
change in behaviour - I haven't seen this warning before in several
years of using this laptop.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-08-05 22:03   ` Chris Bainbridge
@ 2025-08-05 23:04     ` Chris Bainbridge
  0 siblings, 0 replies; 101+ messages in thread
From: Chris Bainbridge @ 2025-08-05 23:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Tue, 5 Aug 2025 at 23:03, Chris Bainbridge <chris.bainbridge@gmail.com> wrote:
>
> On Wed, Jul 02, 2025 at 01:49:26PM +0200, Peter Zijlstra wrote:
> > Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> > bandwidth control") caused a significant dip in his favourite
> > benchmark of the day. Simply disabling dl_server cured things.
> >
> > His workload hammers the 0->1, 1->0 transitions, and the
> > dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> > idea in hind sight and all that.
> >
> > Change things around to only disable the dl_server when there has not
> > been a fair task around for a whole period. Since the default period
> > is 1 second, this ensures the benchmark never trips this, overhead
> > gone.
> >
> > Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org
>
> This commit causes almost every boot of my laptop (which is booted from
> USB flash/SSD drive) to log "sched: DL replenish lagged too much" around
> 7 seconds in to the boot. Is this expected? Just asking as this is a
> change in behaviour - I haven't seen this warning before in several
> years of using this laptop.

Nevermind, I see this has already been reported:

https://lore.kernel.org/lkml/CAMuHMdXn4z1pioTtBGMfQM0jsLviqS2jwysaWXpoLxWYoGa82w@mail.gmail.com/

Fix:

https://lore.kernel.org/lkml/20250615131129.954975-1-kuyo.chang@mediatek.com/

Works for me.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
                     ` (4 preceding siblings ...)
  2025-08-05 22:03   ` Chris Bainbridge
@ 2025-09-15 22:29   ` John Stultz
  2025-09-16  4:18     ` John Stultz
  5 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-15 22:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 2, 2025 at 4:49 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
> bandwidth control") caused a significant dip in his favourite
> benchmark of the day. Simply disabling dl_server cured things.
>
> His workload hammers the 0->1, 1->0 transitions, and the
> dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
> idea in hind sight and all that.
>
> Change things around to only disable the dl_server when there has not
> been a fair task around for a whole period. Since the default period
> is 1 second, this ensures the benchmark never trips this, overhead
> gone.
>
> Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.507378961@infradead.org

So I know this patch has already had a couple of issues reported against it:
  [1] "sched: DL replenish lagged too much" warnings
  [2]  changing server parameters break per-runqueue running_bw tracking
  [3] dl_server_stopped() should return true if dl_se->dl_server_active is 0.

In fact, I reported [4] some trouble with one stress test
(ksched_football) I've been developing with the proxy-exec series
getting stuck because of the behavior change this brought. And while
I'm open to my test being problematic (it tries to generate 5*NR_CPU
spinning RT tasks, which starves kthreadd and with this change causes
the dl_server to only let one thread be spawned per second) it still
seemed concerning (as before this change the dl_server and
rt_throttling would let us generate the RT tasks faster - though that
may have been unintentional, I've not gotten my head totally around
the previous behavior, and unfortunately got distracted with other
work).

But separately testing out Peter's new cleanups for sched_ext (without
my problematic test), I started tripping over workqueue lockup BUGs in
certain situations, and I've found I can reproduce these pretty easily
with vanilla v6.17-rc6 alone (which includes the fixes for the
reported issues above).  I don't have CONFIG_SCHED_PROXY_EXEC enabled
for this.

If I run a 2 cpu qemu instance with locktorture enabled, booting with
the cmd arg:
"torture.random_shuffle=1 locktorture.writer_fifo=1
locktorture.torture_type=mutex_lock locktorture.nested_locks=8
locktorture.rt_boost=1 locktorture.rt_boost_factor=50
locktorture.stutter=0 "

Within ~7 minutes, I'll usually see something like:
[   92.301253] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=0 stuck for 42s!
[   92.305170] Showing busy workqueues and worker pools:
[   92.307434] workqueue events_power_efficient: flags=0x80
[   92.309796]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.309834]     pending: neigh_managed_work
[   92.314565]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=4 refcnt=5
[   92.314604]     pending: crda_timeout_work, neigh_managed_work,
neigh_periodic_work, gc_worker
[   92.321151] workqueue mm_percpu_wq: flags=0x8
[   92.323124]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.323161]     pending: vmstat_update
[   92.327638] workqueue kblockd: flags=0x18
[   92.329429]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
[   92.329467]     pending: blk_mq_timeout_work
[   92.334259] Showing backtraces of running workers in stalled
CPU-bound worker pools:

And this will continue every 30 secs with the stuck time increasing:
[ 1411.472533] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=0 stuck for 1361s!
[ 1411.476404] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0
nice=-20 stuck for 1345s!
[ 1411.480214] Showing busy workqueues and worker pools:
[ 1411.482939] workqueue events_power_efficient: flags=0x80
[ 1411.486171]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=6 refcnt=7
[ 1411.486220]     pending: crda_timeout_work, neigh_managed_work,
neigh_periodic_work, gc_worker, reg_check_chans_work, check_lifetime
[ 1411.497091] workqueue mm_percpu_wq: flags=0x8
[ 1411.499829]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 1411.499878]     pending: vmstat_update
[ 1411.506010] workqueue kblockd: flags=0x18
[ 1411.508689]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
[ 1411.508738]     pending: blk_mq_timeout_work
[ 1411.515311] workqueue mld: flags=0x40008
[ 1411.517764]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 1411.517813]     pending: mld_ifc_work
[ 1411.523923] Showing backtraces of running workers in stalled
CPU-bound worker pools:

I bisected it down to this commit cccb45d7c4295 ("sched/deadline: Less
agressive dl_server handling"), and I found that reverting just the
dequeue_entities() changes (last two chunks of this patch), resolves
the lockup warnings.

That revert *also* resolves my ksched_football test problem, but I
suspect it's also because that change is the main point of this patch
(getting the costly dl_server_stop() out of the frequently used
dequeue_entities() path).

Now while locktorture will keep the cpu busy, it usually bounces
around a fair bit, but when I see the lockup warnings, it
lockturture_writer usually is pinning one cpu and stays that way,
where as with the revert I don't see things getting stuck running on
one cpu.  Adding a trace_printk() to __pick_task_dl() when we pick
from a dl_server, shows the dl_server running and picking tasks for
awhile, but then abruptly stopping shortly before we see the lockup
warnings, suggesting the dl_server somehow stopped and didn't get
started again.

From my initial debugging, it seems like the dl_se->dl_server_active
is set, but the dl_se isn't enqueued, and because active is set we're
short cutting out of dl_server_start(). I suspect the revert re-adding
dl_server_stop() helps because it forces dl_server_active to be
cleared and so we keep the dl_server_active and enqueued state in
sync.   Maybe we are dequeueing the dl_server from update_curr_dl_se()
due to dl_runtime_exceeded() but somehow not re-enqueueing it via
dl_server_timer()?

So I am still digging into it, but I do think there's something still
amiss with this change.  If there's anything I should try to further
debug this, do let me know.

thanks
-john

[1] https://lore.kernel.org/lkml/CAMuHMdVPR7Q7pvn+QqGcq2pJ00apDgUcaCmAgq6nnM1uHySMcw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20250721-upstream-fix-dlserver-lessaggressive-b4-v1-1-4ebc10c87e40@redhat.com/
[3] https://lore.kernel.org/lkml/20250809130419.1980742-1-chenhuacai@loongson.cn/
[4] https://lore.kernel.org/lkml/CANDhNCo+G4_t8jYU-QNPz42uZsKdMgEmTnr8pYSKbgm26NJUCg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling
  2025-09-15 22:29   ` John Stultz
@ 2025-09-16  4:18     ` John Stultz
  2025-09-16  5:28       ` [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation John Stultz
  0 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-16  4:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Mon, Sep 15, 2025 at 3:29 PM John Stultz <jstultz@google.com> wrote:
> From my initial debugging, it seems like the dl_se->dl_server_active
> is set, but the dl_se isn't enqueued, and because active is set we're
> short cutting out of dl_server_start(). I suspect the revert re-adding
> dl_server_stop() helps because it forces dl_server_active to be
> cleared and so we keep the dl_server_active and enqueued state in
> sync.   Maybe we are dequeueing the dl_server from update_curr_dl_se()
> due to dl_runtime_exceeded() but somehow not re-enqueueing it via
> dl_server_timer()?
>

So further digging, it looks like when the problem occurs, we call
start_dl_timer() to set the hrtimer, then when the hrtimer later
fires, we call dl_server_timer(), which catches on the `if
(!dl_se->server_has_tasks(dl_se))` case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally
HRTIMER_NORESTART.  With dl_server_stopped() having set
dl_se->dl_server_idle before returning false (and notably not calling
dl_server_stop() which would clear dl_server_active).

However, dl_server_idle doesn't persist for very long, as we
dl_server_update is shortly called setting it back to 0 before calling
update_curr_dl_se() which I suspect is dequeueing the dl_server.

From this point it seems we get into a situation where the timer
doesn't fire again. And nothing enqueues the dl_server entity back
onto the runqueue, so it never picks from the fair sched and we see
the starvation on that core.

After we get into the problematic state, the rq->fair_server looks like:
$3 = {rb_node = {__rb_parent_color = 18446612689482602296, rb_right =
0x0, rb_left = 0x0}, dl_runtime = 50000000, dl_deadline = 1000000000,
  dl_period = 1000000000, dl_bw = 52428, dl_density = 52428, runtime =
49615739, deadline = 217139555275, flags = 0, dl_throttled = 0,
  dl_yielded = 0, dl_non_contending = 0, dl_overrun = 0, dl_server =
1, dl_server_active = 1, dl_defer = 1, dl_defer_armed = 0,
  dl_defer_running = 0, dl_server_idle = 0, dl_timer = {node = {node =
{__rb_parent_color = 18446612689482602384,
        rb_right = 0xffff8881b9d1cbf8, rb_left = 0x0}, expires =
215349181777}, _softexpires = 215349181777,
    function = 0xffffffff813849d0 <dl_task_timer>, base =
0xffff8881b9d1c300, state = 0 '\000', is_rel = 0 '\000', is_soft = 0
'\000',
    is_hard = 1 '\001'}, inactive_timer = {node = {node =
{__rb_parent_color = 18446612689482602448, rb_right = 0x0, rb_left =
0x0},
      expires = 0}, _softexpires = 0, function = 0xffffffff8137faa0
<inactive_task_timer>, base = 0xffff8881b9c1c300, state = 0 '\000',
    is_rel = 0 '\000', is_soft = 0 '\000', is_hard = 1 '\001'}, rq =
0xffff8881b9d2cc00,
  server_has_tasks = 0xffffffff81363620 <fair_server_has_tasks>,
server_pick_task = 0xffffffff8136c2d0 <fair_server_pick_task>,
  pi_se = 0xffff8881b9d2d738}

I'm still getting a bit lost in the dl_server state machine, but it's
clear we're falling through a crack.

My current theory is that in dl_server_timer() we probably should be
calling dl_server_stop() instead of dl_server_stopped(), as that seems
to avoid the problem for me, but I'm also worried it might cause
problems with not setting dl_server_idle, though again my sense of the
state machine there is foggy.

thanks
-john

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16  4:18     ` John Stultz
@ 2025-09-16  5:28       ` John Stultz
  2025-09-16  8:51         ` Juri Lelli
  0 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-16  5:28 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman, Xuewen Yan,
	K Prateek Nayak, Suleiman Souhlal, Qais Yousef, Joel Fernandes,
	kuyo chang, hupu, kernel-team

With 6.17-rc6, I found when running with locktorture enabled, on
a two core qemu VM, I could easily hit some lockup warnings:

[   92.301253] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 42s!
[   92.305170] Showing busy workqueues and worker pools:
[   92.307434] workqueue events_power_efficient: flags=0x80
[   92.309796]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.309834]     pending: neigh_managed_work
[   92.314565]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=4 refcnt=5
[   92.314604]     pending: crda_timeout_work, neigh_managed_work, neigh_periodic_work, gc_worker
[   92.321151] workqueue mm_percpu_wq: flags=0x8
[   92.323124]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
[   92.323161]     pending: vmstat_update
[   92.327638] workqueue kblockd: flags=0x18
[   92.329429]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
[   92.329467]     pending: blk_mq_timeout_work
[   92.334259] Showing backtraces of running workers in stalled CPU-bound worker pools:

I bisected it down to commit cccb45d7c429 ("sched/deadline: Less
agressive dl_server handling"), and in debugging it seems there
is a chance where we end up with the dl_server dequeued, with
dl_se->dl_server_active. This causes dl_server_start() to
return without enqueueing the dl_server, thus it fails to run
when RT tasks starve the cpu.

I found when this happens, the dl_timer hrtimer is set and calls
 dl_server_timer(), which catches on the
  `if (!dl_se->server_has_tasks(dl_se))`
case, which then calls replenish_dl_entity() and
dl_server_stopped() and finally returns HRTIMER_NORESTART.

The problem being, dl_server_stopped() will set
dl_se->dl_server_idle before returning false (and notably not
calling dl_server_stop() which would clear dl_server_active).

After this, we end up in a situation where the timer doesn't
fire again. And nothing enqueues the dl_server entity back onto
the runqueue, so it never picks from the fair sched and we see
the starvation on that core.

So in dl_server_timer() call dl_server_stop() instead of
dl_server_stopped(), as that will ensure dl_server_active
gets cleared when we are dequeued.

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Signed-off-by: John Stultz <jstultz@google.com>
---
NOTE: I'm not confident this is the right fix, but I wanted
to share for feedback and testing.

Also, this resolves the lockup warnings and problematic behavior
I see with locktorture, but does *not* resolve the behavior
change I hit with my ksched_football test (which intentionally
causes RT starvation) that I bisected down to the same
problematic change and mentioned here:
  https://lore.kernel.org/lkml/20250722070600.3267819-1-jstultz@google.com/
This may be just a problem with my test, but I'm still a bit
wary that this behavior change may bite folks.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/deadline.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f25301267e471..215c3e2cee370 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;

-static bool dl_server_stopped(struct sched_dl_entity *dl_se);
-
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_

 		if (!dl_se->server_has_tasks(dl_se)) {
 			replenish_dl_entity(dl_se);
-			dl_server_stopped(dl_se);
+			dl_server_stop(dl_se);
 			return HRTIMER_NORESTART;
 		}

-- 
2.51.0.384.g4c02a37b29-goog

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16  5:28       ` [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation John Stultz
@ 2025-09-16  8:51         ` Juri Lelli
  2025-09-16 11:01           ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Juri Lelli @ 2025-09-16  8:51 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Mel Gorman, Xuewen Yan, K Prateek Nayak, Suleiman Souhlal,
	Qais Yousef, Joel Fernandes, kuyo chang, hupu, kernel-team

Hi John,

Thanks a lot for looking into this!

On 16/09/25 05:28, John Stultz wrote:
> With 6.17-rc6, I found when running with locktorture enabled, on
> a two core qemu VM, I could easily hit some lockup warnings:
> 
> [   92.301253] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 42s!
> [   92.305170] Showing busy workqueues and worker pools:
> [   92.307434] workqueue events_power_efficient: flags=0x80
> [   92.309796]   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [   92.309834]     pending: neigh_managed_work
> [   92.314565]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=4 refcnt=5
> [   92.314604]     pending: crda_timeout_work, neigh_managed_work, neigh_periodic_work, gc_worker
> [   92.321151] workqueue mm_percpu_wq: flags=0x8
> [   92.323124]   pwq 6: cpus=1 node=0 flags=0x0 nice=0 active=1 refcnt=2
> [   92.323161]     pending: vmstat_update
> [   92.327638] workqueue kblockd: flags=0x18
> [   92.329429]   pwq 7: cpus=1 node=0 flags=0x0 nice=-20 active=1 refcnt=2
> [   92.329467]     pending: blk_mq_timeout_work
> [   92.334259] Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> I bisected it down to commit cccb45d7c429 ("sched/deadline: Less
> agressive dl_server handling"), and in debugging it seems there
> is a chance where we end up with the dl_server dequeued, with
> dl_se->dl_server_active. This causes dl_server_start() to
> return without enqueueing the dl_server, thus it fails to run
> when RT tasks starve the cpu.
> 
> I found when this happens, the dl_timer hrtimer is set and calls
>  dl_server_timer(), which catches on the
>   `if (!dl_se->server_has_tasks(dl_se))`
> case, which then calls replenish_dl_entity() and
> dl_server_stopped() and finally returns HRTIMER_NORESTART.
> 
> The problem being, dl_server_stopped() will set
> dl_se->dl_server_idle before returning false (and notably not
> calling dl_server_stop() which would clear dl_server_active).
> 
> After this, we end up in a situation where the timer doesn't
> fire again. And nothing enqueues the dl_server entity back onto
> the runqueue, so it never picks from the fair sched and we see
> the starvation on that core.

Because dl_server_start() returns early if dl_server_active is set.
 
> So in dl_server_timer() call dl_server_stop() instead of
> dl_server_stopped(), as that will ensure dl_server_active
> gets cleared when we are dequeued.
> 
> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> NOTE: I'm not confident this is the right fix, but I wanted
> to share for feedback and testing.
> 
> Also, this resolves the lockup warnings and problematic behavior
> I see with locktorture, but does *not* resolve the behavior
> change I hit with my ksched_football test (which intentionally
> causes RT starvation) that I bisected down to the same
> problematic change and mentioned here:
>   https://lore.kernel.org/lkml/20250722070600.3267819-1-jstultz@google.com/
> This may be just a problem with my test, but I'm still a bit
> wary that this behavior change may bite folks.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: kuyo chang <kuyo.chang@mediatek.com>
> Cc: hupu <hupu.gm@gmail.com>
> Cc: kernel-team@android.com
> ---
>  kernel/sched/deadline.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f25301267e471..215c3e2cee370 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
>  /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>  static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>  
> -static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> -
>  static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>  {
>  	struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>  
>  		if (!dl_se->server_has_tasks(dl_se)) {
>  			replenish_dl_entity(dl_se);
> -			dl_server_stopped(dl_se);
> +			dl_server_stop(dl_se);
>  			return HRTIMER_NORESTART;
>  		}

It looks OK for a quick testing I've done. Also, it seems to make sense
to me. The defer timer has fired (we are executing the callback). If the
server hasn't got tasks to serve we can just stop it (clearing the
flags) and wait for the next enqueue of fair to start it again still in
defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
dequeue_dl_entity() I believe we need to call to deal with
task_non_contending().

Peter, what do you think?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16  8:51         ` Juri Lelli
@ 2025-09-16 11:01           ` Peter Zijlstra
  2025-09-16 12:52             ` Juri Lelli
                               ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-16 11:01 UTC (permalink / raw)
  To: Juri Lelli
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 10:51:34AM +0200, Juri Lelli wrote:

> > @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> >  
> >  		if (!dl_se->server_has_tasks(dl_se)) {
> >  			replenish_dl_entity(dl_se);
> > -			dl_server_stopped(dl_se);
> > +			dl_server_stop(dl_se);
> >  			return HRTIMER_NORESTART;
> >  		}
> 
> It looks OK for a quick testing I've done. Also, it seems to make sense
> to me. The defer timer has fired (we are executing the callback). If the
> server hasn't got tasks to serve we can just stop it (clearing the
> flags) and wait for the next enqueue of fair to start it again still in
> defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
> dequeue_dl_entity() I believe we need to call to deal with
> task_non_contending().
> 
> Peter, what do you think?

Well, the problem was that we were starting/stopping the thing too
often, and the general idea of that commit:

  cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")

was to not stop the server, unless it's not seen fair tasks for a whole
period.

Now, the case John trips seems to be that there were tasks, we ran tasks
until budget exhausted, dequeued the server and did start_dl_timer().

Then the bandwidth timer fires at a point where there are no more fair
tasks, replenish_dl_entity() gets called, which *should* set the
0-laxity timer, but doesn't -- because !server_has_tasks() -- and then
nothing.

So perhaps we should do something like the below. Simply continue
as normal, until we do a whole cycle without having seen a task.

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5b64bc621993..269ca2eb5ba9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 */
 	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
 	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
-		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
+		if (!is_dl_boosted(dl_se)) {
 
 			/*
 			 * Set dl_se->dl_defer_armed and dl_throttled variables to
@@ -1171,12 +1171,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 		if (!dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
-		if (!dl_se->server_has_tasks(dl_se)) {
-			replenish_dl_entity(dl_se);
-			dl_server_stopped(dl_se);
-			return HRTIMER_NORESTART;
-		}
-
 		if (dl_se->dl_defer_armed) {
 			/*
 			 * First check if the server could consume runtime in background.


Notably, this removes all ->server_has_tasks() users, so if this works
and is correct, we can completely remove that callback and simplify
more.

Hmm?

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16 11:01           ` Peter Zijlstra
@ 2025-09-16 12:52             ` Juri Lelli
  2025-09-16 14:30               ` Peter Zijlstra
  2025-09-16 17:35             ` John Stultz
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 101+ messages in thread
From: Juri Lelli @ 2025-09-16 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On 16/09/25 13:01, Peter Zijlstra wrote:
> On Tue, Sep 16, 2025 at 10:51:34AM +0200, Juri Lelli wrote:
> 
> > > @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> > >  
> > >  		if (!dl_se->server_has_tasks(dl_se)) {
> > >  			replenish_dl_entity(dl_se);
> > > -			dl_server_stopped(dl_se);
> > > +			dl_server_stop(dl_se);
> > >  			return HRTIMER_NORESTART;
> > >  		}
> > 
> > It looks OK for a quick testing I've done. Also, it seems to make sense
> > to me. The defer timer has fired (we are executing the callback). If the
> > server hasn't got tasks to serve we can just stop it (clearing the
> > flags) and wait for the next enqueue of fair to start it again still in
> > defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
> > dequeue_dl_entity() I believe we need to call to deal with
> > task_non_contending().
> > 
> > Peter, what do you think?
> 
> Well, the problem was that we were starting/stopping the thing too
> often, and the general idea of that commit:
> 
>   cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> 
> was to not stop the server, unless it's not seen fair tasks for a whole
> period.
> 
> Now, the case John trips seems to be that there were tasks, we ran tasks
> until budget exhausted, dequeued the server and did start_dl_timer().
> 
> Then the bandwidth timer fires at a point where there are no more fair
> tasks, replenish_dl_entity() gets called, which *should* set the
> 0-laxity timer, but doesn't -- because !server_has_tasks() -- and then
> nothing.
> 
> So perhaps we should do something like the below. Simply continue
> as normal, until we do a whole cycle without having seen a task.
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 5b64bc621993..269ca2eb5ba9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>  	 */
>  	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
>  	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> -		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> +		if (!is_dl_boosted(dl_se)) {
>  
>  			/*
>  			 * Set dl_se->dl_defer_armed and dl_throttled variables to
> @@ -1171,12 +1171,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>  		if (!dl_se->dl_runtime)
>  			return HRTIMER_NORESTART;
>  
> -		if (!dl_se->server_has_tasks(dl_se)) {
> -			replenish_dl_entity(dl_se);
> -			dl_server_stopped(dl_se);
> -			return HRTIMER_NORESTART;
> -		}
> -
>  		if (dl_se->dl_defer_armed) {
>  			/*
>  			 * First check if the server could consume runtime in background.
> 
> 
> Notably, this removes all ->server_has_tasks() users, so if this works
> and is correct, we can completely remove that callback and simplify
> more.
> 
> Hmm?

But then what stops the server when the 0-laxity (defer) timer fires
again a period down the line?


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16 12:52             ` Juri Lelli
@ 2025-09-16 14:30               ` Peter Zijlstra
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-16 14:30 UTC (permalink / raw)
  To: Juri Lelli
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 02:52:44PM +0200, Juri Lelli wrote:
> On 16/09/25 13:01, Peter Zijlstra wrote:
> > On Tue, Sep 16, 2025 at 10:51:34AM +0200, Juri Lelli wrote:
> > 
> > > > @@ -1173,7 +1171,7 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> > > >  
> > > >  		if (!dl_se->server_has_tasks(dl_se)) {
> > > >  			replenish_dl_entity(dl_se);
> > > > -			dl_server_stopped(dl_se);
> > > > +			dl_server_stop(dl_se);
> > > >  			return HRTIMER_NORESTART;
> > > >  		}
> > > 
> > > It looks OK for a quick testing I've done. Also, it seems to make sense
> > > to me. The defer timer has fired (we are executing the callback). If the
> > > server hasn't got tasks to serve we can just stop it (clearing the
> > > flags) and wait for the next enqueue of fair to start it again still in
> > > defer mode. hrtimer_try_to_cancel() is redundant (but harmless),
> > > dequeue_dl_entity() I believe we need to call to deal with
> > > task_non_contending().
> > > 
> > > Peter, what do you think?
> > 
> > Well, the problem was that we were starting/stopping the thing too
> > often, and the general idea of that commit:
> > 
> >   cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> > 
> > was to not stop the server, unless it's not seen fair tasks for a whole
> > period.
> > 
> > Now, the case John trips seems to be that there were tasks, we ran tasks
> > until budget exhausted, dequeued the server and did start_dl_timer().
> > 
> > Then the bandwidth timer fires at a point where there are no more fair
> > tasks, replenish_dl_entity() gets called, which *should* set the
> > 0-laxity timer, but doesn't -- because !server_has_tasks() -- and then
> > nothing.
> > 
> > So perhaps we should do something like the below. Simply continue
> > as normal, until we do a whole cycle without having seen a task.
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 5b64bc621993..269ca2eb5ba9 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
> >  	 */
> >  	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
> >  	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> > -		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> > +		if (!is_dl_boosted(dl_se)) {
> >  
> >  			/*
> >  			 * Set dl_se->dl_defer_armed and dl_throttled variables to
> > @@ -1171,12 +1171,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
> >  		if (!dl_se->dl_runtime)
> >  			return HRTIMER_NORESTART;
> >  
> > -		if (!dl_se->server_has_tasks(dl_se)) {
> > -			replenish_dl_entity(dl_se);
> > -			dl_server_stopped(dl_se);
> > -			return HRTIMER_NORESTART;
> > -		}
> > -
> >  		if (dl_se->dl_defer_armed) {
> >  			/*
> >  			 * First check if the server could consume runtime in background.
> > 
> > 
> > Notably, this removes all ->server_has_tasks() users, so if this works
> > and is correct, we can completely remove that callback and simplify
> > more.
> > 
> > Hmm?
> 
> But then what stops the server when the 0-laxity (defer) timer fires
> again a period down the line?

At that point we'll actually run the server, right? And then
__pick_task_dl() will hit the !p case and call dl_server_stopped().

If idle==1 it will actually stop the server, otherwise it will set
idle=1 and around we go.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16 11:01           ` Peter Zijlstra
  2025-09-16 12:52             ` Juri Lelli
@ 2025-09-16 17:35             ` John Stultz
  2025-09-16 21:30               ` Peter Zijlstra
  2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
  2025-09-25  7:55             ` tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-16 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> Now, the case John trips seems to be that there were tasks, we ran tasks
> until budget exhausted, dequeued the server and did start_dl_timer().
>
> Then the bandwidth timer fires at a point where there are no more fair
> tasks, replenish_dl_entity() gets called, which *should* set the
> 0-laxity timer, but doesn't -- because !server_has_tasks() -- and then
> nothing.
>
> So perhaps we should do something like the below. Simply continue
> as normal, until we do a whole cycle without having seen a task.
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 5b64bc621993..269ca2eb5ba9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>          */
>         if (dl_se->dl_defer && !dl_se->dl_defer_running &&
>             dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> -               if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> +               if (!is_dl_boosted(dl_se)) {
>
>                         /*
>                          * Set dl_se->dl_defer_armed and dl_throttled variables to
> @@ -1171,12 +1171,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>                 if (!dl_se->dl_runtime)
>                         return HRTIMER_NORESTART;
>
> -               if (!dl_se->server_has_tasks(dl_se)) {
> -                       replenish_dl_entity(dl_se);
> -                       dl_server_stopped(dl_se);
> -                       return HRTIMER_NORESTART;
> -               }
> -
>                 if (dl_se->dl_defer_armed) {
>                         /*
>                          * First check if the server could consume runtime in background.
>
>
> Notably, this removes all ->server_has_tasks() users, so if this works
> and is correct, we can completely remove that callback and simplify
> more.

So this does seem to avoid this lockup warning issue I've been seeing
in my initial testing. I've not done much other testing with it
though.

I of course still see the thread spawning issues with my
ksched_football test that come from keeping the dl_server running for
the whole period, but that's a separate thing I'm trying to isolate.

Tested-by: John Stultz <jstultz@google.com>

thanks
-john

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16 17:35             ` John Stultz
@ 2025-09-16 21:30               ` Peter Zijlstra
  2025-09-17  3:29                 ` John Stultz
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-16 21:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 10:35:46AM -0700, John Stultz wrote:
> I of course still see the thread spawning issues with my
> ksched_football test that come from keeping the dl_server running for
> the whole period, but that's a separate thing I'm trying to isolate.

So what happens with that football thing -- you're saying thread
spawning issues, so kthread_create() is not making progress?

You're also saying 'keeping the dl_server running for the whole period',
are you seeing it not being limited to its allocated bandwidth?

Per the bug today, the dl_server does get throttled, not sufficiently?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-16 21:30               ` Peter Zijlstra
@ 2025-09-17  3:29                 ` John Stultz
  2025-09-17  9:34                   ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-17  3:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 2:30 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 16, 2025 at 10:35:46AM -0700, John Stultz wrote:
> > I of course still see the thread spawning issues with my
> > ksched_football test that come from keeping the dl_server running for
> > the whole period, but that's a separate thing I'm trying to isolate.
>
> So what happens with that football thing -- you're saying thread
> spawning issues, so kthread_create() is not making progress?
>
> You're also saying 'keeping the dl_server running for the whole period',
> are you seeing it not being limited to its allocated bandwidth?
>
> Per the bug today, the dl_server does get throttled, not sufficiently?

So yeah, I think this is a totally different issue than the lockup
warning problem, and again, I suspect my test is at least partially
culpable.

You can find my test here:
https://github.com/johnstultz-work/linux-dev/commit/d4c36a62444241558947e22af5a972a0859e031a

The idea is we first start a "Referee" RT prio 20 task, and from that
task we spawn NR_CPU  RT prio 5 "Offense players" which just spin
incrementing a value ("the ball"), then NR_CPU RT prio 10 "Defense
players" that just spin (conceptually blocking the Offense players
from running on the cpu). Once everyone is running, the ref zeros the
ball and sleeps for a few seconds. When it wakes it checks the ball is
still zero and shuts the test down.

So the point of this test is very much for the defense threads to
starve the offense from the cpu.  In doing so, it is helpful for
validating the RT scheduling invariant that we always (heh, well...
ignoring RT_PUSH_IPI, but that's yet a different issue) run the top
NR_CPU RT priority tasks across the cpus. So it inherently stresses
rt-throttling as well.

Note: In the test there are also RT prio 15 "crazy-fans" that sleep,
but occasionally wake up and "streak across the field" trying to
disrupt the game, as well as low priority defense players that take
either proxy-mutex or rt_mutexes that the high priority defense
threads try to aquire to ensure proxy-exec & priority inheritance
boosting is working - but these can be mostly ignored for this
problem.

Each time the ref spawns a player type, we go from lowest to highest
priority, calling kthread_create(), sched_setattr_nocheck() and then
wake_up_process(), for NR_CPUs, and then the Ref repeatedly sleeps
waiting for NR_CPU tasks to check-in (incrementing a player counter).

The problem is that previously the tasks were spawned without much
delay, but after your change, we see each RT tasks after the first
NR_CPU take ~1 second to spawn. On systems with large cpu counts
taking NR_CPU*player-types seconds to start can be long enough (easily
minutes) that we hit hung task errors.

As I've been digging, part of the issue seems to be kthreadd is not an
RT task, thus after we create NR_CPU spinner threads, the following
threads don't actually start right away. They have to wait until
rt-throttling or the dl_server runs, allowing kthreadd to get cpu time
and start the thread.  But then we have to wait until throttling stops
so that the RT prio Ref thread can run to set the new thread as RT
prio and then spawn the next, which then won't start until we do
throttling again, etc.

So it makes sense if the dl_server is "refreshed" once a second, we
might see this one thread per second interlock happening.  So this
seems likely just an issue with my test.

However, prior to your change, it seemed like the the RT tasks would
be throttled, kthreadd would run and the thread would start, then with
no SCHED_NORMAL tasks to run, the dl_server would stop and we'd run
the Ref, which would immediately eneueued kthreadd, which started the
dl_server and since the dl_server hadn't done much, I'm guessing it
still had some runtime/bandwidth and would run kthreadd then stop
right after.

So now my guess (and my apologies, as I really am not familiar enough
the the DL details) is that since we're not stopping the dl_server, it
seems like the dl_server runtime gets used up each cycle, even though
it didn't do much boosting. So we have to wait the full cycle for a
refresh.

I've used a slightly modified version of my test (simplifying it a bit
and adding extra trace annotations) to capture the following traces:

New behavior: https://ui.perfetto.dev/#!/?s=663e6a8f4f0ecdbc2bbde3efdc601dc980f7b333
This shows the clear 1 second delay interlock with kthreadd for each
thread after the first two, creating a clear stair-step pattern.

Old behavior: https://ui.perfetto.dev/#!/?s=3d6a70716c6a66eb8d7056ef2c640838404186c2
This shows how there's a 1 second delay after the first two threads
spawn, but then the rest of the created threads quickly interleave
with kthreadd, allowing them to start off quickly

I still need to add some temporary trace events around the dl_server
so I can better understand how that interleaving used to work. As it
may have been just luck that my test worked back then.

However, interestingly if we go back to 6.11, before the dl_server
landed, it's curious because I do see the 1 second stair steps, but
with larger cpu counts there seems to be more of them working in
parallel, so each second, the ref will manage to interlock with
kthreadd 4 to 8 times, and that many of threads will start up.  So its
behavior isn't exactly like either the new or old behavior with
dl_server, but it escaped hitting the hung task errors.

But as the new dl_server logic is much slower for spawning rt threads,
I probably need to find a new way to start my test. But I just want to
raise the issue in case others hit similar problems with the change.

thanks
-john

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17  3:29                 ` John Stultz
@ 2025-09-17  9:34                   ` Peter Zijlstra
  2025-09-17 12:26                     ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-17  9:34 UTC (permalink / raw)
  To: John Stultz
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Tue, Sep 16, 2025 at 08:29:01PM -0700, John Stultz wrote:

> The problem is that previously the tasks were spawned without much
> delay, but after your change, we see each RT tasks after the first
> NR_CPU take ~1 second to spawn. On systems with large cpu counts
> taking NR_CPU*player-types seconds to start can be long enough (easily
> minutes) that we hit hung task errors.
> 
> As I've been digging, part of the issue seems to be kthreadd is not an
> RT task, thus after we create NR_CPU spinner threads, the following
> threads don't actually start right away. They have to wait until
> rt-throttling or the dl_server runs, allowing kthreadd to get cpu time
> and start the thread.  But then we have to wait until throttling stops
> so that the RT prio Ref thread can run to set the new thread as RT
> prio and then spawn the next, which then won't start until we do
> throttling again, etc.
> 
> So it makes sense if the dl_server is "refreshed" once a second, we
> might see this one thread per second interlock happening.  So this
> seems likely just an issue with my test.
> 
> However, prior to your change, it seemed like the the RT tasks would
> be throttled, kthreadd would run and the thread would start, then with
> no SCHED_NORMAL tasks to run, the dl_server would stop and we'd run
> the Ref, which would immediately eneueued kthreadd, which started the
> dl_server and since the dl_server hadn't done much, I'm guessing it
> still had some runtime/bandwidth and would run kthreadd then stop
> right after.
> 
> So now my guess (and my apologies, as I really am not familiar enough
> the the DL details) is that since we're not stopping the dl_server, it
> seems like the dl_server runtime gets used up each cycle, even though
> it didn't do much boosting. So we have to wait the full cycle for a
> refresh.

Yes. This makes sense.

The old code would disable the dl_server when fair tasks drops to 0
so even though we had that yield in __pick_task_dl(), we'd never hit it.
So the moment another fair task shows up (0->1) we re-enqueue the
dl_server (using update_dl_entity() / CBS wakeup rules) and continue
consuming bandwidth.

However, since we're now not stopping the thing, we hit that yield,
getting this pretty terrible behaviour where we will only run fair tasks
until there are none and then yield our entire period, forcing another
task to wait until the next cycle.

Let me go have a play, surely we can do better.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17  9:34                   ` Peter Zijlstra
@ 2025-09-17 12:26                     ` Peter Zijlstra
  2025-09-17 13:56                       ` Juri Lelli
                                         ` (3 more replies)
  0 siblings, 4 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-17 12:26 UTC (permalink / raw)
  To: John Stultz
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Wed, Sep 17, 2025 at 11:34:42AM +0200, Peter Zijlstra wrote:

> Yes. This makes sense.
> 
> The old code would disable the dl_server when fair tasks drops to 0
> so even though we had that yield in __pick_task_dl(), we'd never hit it.
> So the moment another fair task shows up (0->1) we re-enqueue the
> dl_server (using update_dl_entity() / CBS wakeup rules) and continue
> consuming bandwidth.
> 
> However, since we're now not stopping the thing, we hit that yield,
> getting this pretty terrible behaviour where we will only run fair tasks
> until there are none and then yield our entire period, forcing another
> task to wait until the next cycle.
> 
> Let me go have a play, surely we can do better.

Can you please try:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent

That's yesterdays patch and the below. Its compile tested only, but
with a bit of luck it'll actually work ;-)

---
Subject: sched/deadline: Fix dl_server behaviour
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Sep 17 12:03:20 CEST 2025

John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").

When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.

The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).

Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).

Notably, this does not re-introduce the behaviour cccb45d7c4295 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h   |    1 -
 kernel/sched/deadline.c |   23 ++---------------------
 kernel/sched/sched.h    |   33 +++++++++++++++++++++++++++++++--
 3 files changed, 33 insertions(+), 24 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,7 +706,6 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
-	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1571,10 +1571,8 @@ void dl_server_update_idle_time(struct r
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime) {
-		dl_se->dl_server_idle = 0;
+	if (dl_se->dl_runtime)
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
-	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1602,20 +1600,6 @@ void dl_server_stop(struct sched_dl_enti
 	dl_se->dl_server_active = 0;
 }
 
-static bool dl_server_stopped(struct sched_dl_entity *dl_se)
-{
-	if (!dl_se->dl_server_active)
-		return true;
-
-	if (dl_se->dl_server_idle) {
-		dl_server_stop(dl_se);
-		return true;
-	}
-
-	dl_se->dl_server_idle = 1;
-	return false;
-}
-
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick_task)
 {
@@ -2384,10 +2368,7 @@ static struct task_struct *__pick_task_d
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (!dl_server_stopped(dl_se)) {
-				dl_se->dl_yielded = 1;
-				update_curr_dl_se(rq, dl_se, 0);
-			}
+			dl_server_stop(dl_se);
 			goto again;
 		}
 		rq->dl_server = dl_se;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -371,10 +371,39 @@ extern s64 dl_scaled_delta_exec(struct r
  *   dl_server_update() -- called from update_curr_common(), propagates runtime
  *                         to the server.
  *
- *   dl_server_start()
- *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *   dl_server_start() -- start the server when it has tasks; it will stop
+ *			  automatically when there are no more tasks, per
+ *			  dl_se::server_pick() returning NULL.
+ *
+ *   dl_server_stop() -- (force) stop the server; use when updating
+ *                       parameters.
  *
  *   dl_server_init() -- initializes the server.
+ *
+ * When started the dl_server will (per dl_defer) schedule a timer for its
+ * zero-laxity point -- that is, unlike regular EDF tasks which run ASAP, a
+ * server will run at the very end of its period.
+ *
+ * This is done such that any runtime from the target class can be accounted
+ * against the server -- through dl_server_update() above -- such that when it
+ * becomes time to run, it might already be out of runtime and get deferred
+ * until the next period. In this case dl_server_timer() will alternate
+ * between defer and replenish but never actually enqueue the server.
+ *
+ * Only when the target class does not manage to exhaust the server's runtime
+ * (there's actualy starvation in the given period), will the dl_server get on
+ * the runqueue. Once queued it will pick tasks from the target class and run
+ * them until either its runtime is exhaused, at which point its back to
+ * dl_server_timer, or until there are no more tasks to run, at which point
+ * the dl_server stops itself.
+ *
+ * By stopping at this point the dl_server retains bandwidth, which, if a new
+ * task wakes up imminently (starting the server again), can be used --
+ * subject to CBS wakeup rules -- without having to wait for the next period.
+ *
+ * Additionally, because of the dl_defer behaviour the start/stop behaviour is
+ * naturally thottled to once per period, avoiding high context switch
+ * workloads from spamming the hrtimer program/cancel paths.
  */
 extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17 12:26                     ` Peter Zijlstra
@ 2025-09-17 13:56                       ` Juri Lelli
  2025-09-17 17:30                         ` Peter Zijlstra
  2025-09-17 19:29                       ` John Stultz
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 101+ messages in thread
From: Juri Lelli @ 2025-09-17 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On 17/09/25 14:26, Peter Zijlstra wrote:
> On Wed, Sep 17, 2025 at 11:34:42AM +0200, Peter Zijlstra wrote:
> 
> > Yes. This makes sense.
> > 
> > The old code would disable the dl_server when fair tasks drops to 0
> > so even though we had that yield in __pick_task_dl(), we'd never hit it.
> > So the moment another fair task shows up (0->1) we re-enqueue the
> > dl_server (using update_dl_entity() / CBS wakeup rules) and continue
> > consuming bandwidth.
> > 
> > However, since we're now not stopping the thing, we hit that yield,
> > getting this pretty terrible behaviour where we will only run fair tasks
> > until there are none and then yield our entire period, forcing another
> > task to wait until the next cycle.
> > 
> > Let me go have a play, surely we can do better.
> 
> Can you please try:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
> 
> That's yesterdays patch and the below. Its compile tested only, but
> with a bit of luck it'll actually work ;-)
> 
> ---
> Subject: sched/deadline: Fix dl_server behaviour
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Sep 17 12:03:20 CEST 2025
> 
> John reported undesirable behaviour with the dl_server since commit:
> cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").
> 
> When starving fair tasks on purpose (starting spinning FIFO tasks),
> his fair workload, which often goes (briefly) idle, would delay fair
> invocations for a second, running one invocation per second was both
> unexpected and terribly slow.
> 
> The reason this happens is that when dl_se->server_pick_task() returns
> NULL, indicating no runnable tasks, it would yield, pushing any later
> jobs out a whole period (1 second).
> 
> Instead simply stop the server. This should restore behaviour in that
> a later wakeup (which restarts the server) will be able to continue
> running (subject to the CBS wakeup rules).
> 
> Notably, this does not re-introduce the behaviour cccb45d7c4295 set
> out to solve, any start/stop cycle is naturally throttled by the timer
> period (no active cancel).

Neat!

> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> Reported-by: John Stultz <jstultz@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---

...

> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -371,10 +371,39 @@ extern s64 dl_scaled_delta_exec(struct r
>   *   dl_server_update() -- called from update_curr_common(), propagates runtime
>   *                         to the server.
>   *
> - *   dl_server_start()
> - *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
> + *   dl_server_start() -- start the server when it has tasks; it will stop
> + *			  automatically when there are no more tasks, per
> + *			  dl_se::server_pick() returning NULL.
> + *
> + *   dl_server_stop() -- (force) stop the server; use when updating
> + *                       parameters.
>   *
>   *   dl_server_init() -- initializes the server.
> + *
> + * When started the dl_server will (per dl_defer) schedule a timer for its
> + * zero-laxity point -- that is, unlike regular EDF tasks which run ASAP, a
> + * server will run at the very end of its period.
> + *
> + * This is done such that any runtime from the target class can be accounted
> + * against the server -- through dl_server_update() above -- such that when it
> + * becomes time to run, it might already be out of runtime and get deferred
> + * until the next period. In this case dl_server_timer() will alternate
> + * between defer and replenish but never actually enqueue the server.
> + *
> + * Only when the target class does not manage to exhaust the server's runtime
> + * (there's actualy starvation in the given period), will the dl_server get on
> + * the runqueue. Once queued it will pick tasks from the target class and run
> + * them until either its runtime is exhaused, at which point its back to
> + * dl_server_timer, or until there are no more tasks to run, at which point
> + * the dl_server stops itself.
> + *
> + * By stopping at this point the dl_server retains bandwidth, which, if a new
> + * task wakes up imminently (starting the server again), can be used --
> + * subject to CBS wakeup rules -- without having to wait for the next period.

In both cases we still defer until either the new period or the current
0-laxity, right?

The stop cleans all the flags, so subsequent start calls
enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
and dl_defer_armed in both cases and then we start_dl_timer (defer
timer) after it (without enqueueing right away).

Or maybe I am still a bit lost. :)

> + * Additionally, because of the dl_defer behaviour the start/stop behaviour is
> + * naturally thottled to once per period, avoiding high context switch
> + * workloads from spamming the hrtimer program/cancel paths.

Right. Also nice cleanup of a flag and a method.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17 13:56                       ` Juri Lelli
@ 2025-09-17 17:30                         ` Peter Zijlstra
  2025-09-18  8:37                           ` Juri Lelli
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-17 17:30 UTC (permalink / raw)
  To: Juri Lelli
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Wed, Sep 17, 2025 at 03:56:20PM +0200, Juri Lelli wrote:

> > + * By stopping at this point the dl_server retains bandwidth, which, if a new
> > + * task wakes up imminently (starting the server again), can be used --
> > + * subject to CBS wakeup rules -- without having to wait for the next period.
> 
> In both cases we still defer until either the new period or the current
> 0-laxity, right?
> 
> The stop cleans all the flags, so subsequent start calls
> enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
> and dl_defer_armed in both cases and then we start_dl_timer (defer
> timer) after it (without enqueueing right away).
> 
> Or maybe I am still a bit lost. :)

The way I read it earlier today:

  dl_server_start()
    enqueue_dl_entity(WAKEUP)
      if (WAKEUP)
	task_contending();
	update_dl_entity()
	  dl_entity_overflows() := true
	  update_dl_revised_wakeup();

In that case, it is possible to continue running with a slight
adjustment to the runtime (it gets scaled back to account for 'lost'
time or somesuch IIRC).


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17 17:30                         ` Peter Zijlstra
@ 2025-09-18  8:37                           ` Juri Lelli
  2025-09-18  9:04                             ` Peter Zijlstra
  0 siblings, 1 reply; 101+ messages in thread
From: Juri Lelli @ 2025-09-18  8:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On 17/09/25 19:30, Peter Zijlstra wrote:
> On Wed, Sep 17, 2025 at 03:56:20PM +0200, Juri Lelli wrote:
> 
> > > + * By stopping at this point the dl_server retains bandwidth, which, if a new
> > > + * task wakes up imminently (starting the server again), can be used --
> > > + * subject to CBS wakeup rules -- without having to wait for the next period.
> > 
> > In both cases we still defer until either the new period or the current
> > 0-laxity, right?
> > 
> > The stop cleans all the flags, so subsequent start calls
> > enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
> > and dl_defer_armed in both cases and then we start_dl_timer (defer
> > timer) after it (without enqueueing right away).
> > 
> > Or maybe I am still a bit lost. :)
> 
> The way I read it earlier today:
> 
>   dl_server_start()
>     enqueue_dl_entity(WAKEUP)
>       if (WAKEUP)
> 	task_contending();
> 	update_dl_entity()
> 	  dl_entity_overflows() := true
> 	  update_dl_revised_wakeup();
> 
> In that case, it is possible to continue running with a slight
> adjustment to the runtime (it gets scaled back to account for 'lost'
> time or somesuch IIRC).
> 

Hummm, but this is for !implicit (dl_deadline != dl_period) tasks, is
it? And dl-servers are implicit.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-18  8:37                           ` Juri Lelli
@ 2025-09-18  9:04                             ` Peter Zijlstra
  2025-09-18  9:42                               ` Juri Lelli
  0 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-18  9:04 UTC (permalink / raw)
  To: Juri Lelli
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Thu, Sep 18, 2025 at 10:37:04AM +0200, Juri Lelli wrote:
> On 17/09/25 19:30, Peter Zijlstra wrote:
> > On Wed, Sep 17, 2025 at 03:56:20PM +0200, Juri Lelli wrote:
> > 
> > > > + * By stopping at this point the dl_server retains bandwidth, which, if a new
> > > > + * task wakes up imminently (starting the server again), can be used --
> > > > + * subject to CBS wakeup rules -- without having to wait for the next period.
> > > 
> > > In both cases we still defer until either the new period or the current
> > > 0-laxity, right?
> > > 
> > > The stop cleans all the flags, so subsequent start calls
> > > enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
> > > and dl_defer_armed in both cases and then we start_dl_timer (defer
> > > timer) after it (without enqueueing right away).
> > > 
> > > Or maybe I am still a bit lost. :)
> > 
> > The way I read it earlier today:
> > 
> >   dl_server_start()
> >     enqueue_dl_entity(WAKEUP)
> >       if (WAKEUP)
> > 	task_contending();
> > 	update_dl_entity()
> > 	  dl_entity_overflows() := true
> > 	  update_dl_revised_wakeup();
> > 
> > In that case, it is possible to continue running with a slight
> > adjustment to the runtime (it gets scaled back to account for 'lost'
> > time or somesuch IIRC).
> > 
> 
> Hummm, but this is for !implicit (dl_deadline != dl_period) tasks, is
> it? And dl-servers are implicit.

Bah. You're right.

So how about this:

  dl_server_timer()
    if (dl_se->dl_defer_armed)
      dl_se->dl_defer_running = 1;

    enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH)

 __pick_task_dl()
   p = dl_se->server_pick_task(dl_se);
   if (!p)
     dl_server_stop()
       dl_se->dl_defer_armed = 0;
       dl_se->dl_throttled = 0;
       dl_se->dl_server_active = 0;
       /* notably it leaves dl_defer_running == 1 */

 dl_server_start()
   dl_se->dl_server_active = 1;
   enqueue_dl_entity(WAKEUP)
     if (WAKEUP)
       task_contending();
       update_dl_entity()
         if (dl_server() && dl_se->dl_defer)
	   if (!dl_se->dl_defer_running) /* !true := false */
	     /* do not set dl_defer_armed / dl_throttled */

Note: update_curr_dl_se() will eventually clear dl_defer_running when it
gets throttled.

And so it continues with the previous reservation. And I suppose the
question is, should it do update_dl_revised_wakeup() in this case?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-18  9:04                             ` Peter Zijlstra
@ 2025-09-18  9:42                               ` Juri Lelli
  0 siblings, 0 replies; 101+ messages in thread
From: Juri Lelli @ 2025-09-18  9:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On 18/09/25 11:04, Peter Zijlstra wrote:
> On Thu, Sep 18, 2025 at 10:37:04AM +0200, Juri Lelli wrote:
> > On 17/09/25 19:30, Peter Zijlstra wrote:
> > > On Wed, Sep 17, 2025 at 03:56:20PM +0200, Juri Lelli wrote:
> > > 
> > > > > + * By stopping at this point the dl_server retains bandwidth, which, if a new
> > > > > + * task wakes up imminently (starting the server again), can be used --
> > > > > + * subject to CBS wakeup rules -- without having to wait for the next period.
> > > > 
> > > > In both cases we still defer until either the new period or the current
> > > > 0-laxity, right?
> > > > 
> > > > The stop cleans all the flags, so subsequent start calls
> > > > enqueue(ENQUEUE_WAKEUP) -> update_dl_entity() which sets dl_throttled
> > > > and dl_defer_armed in both cases and then we start_dl_timer (defer
> > > > timer) after it (without enqueueing right away).
> > > > 
> > > > Or maybe I am still a bit lost. :)
> > > 
> > > The way I read it earlier today:
> > > 
> > >   dl_server_start()
> > >     enqueue_dl_entity(WAKEUP)
> > >       if (WAKEUP)
> > > 	task_contending();
> > > 	update_dl_entity()
> > > 	  dl_entity_overflows() := true
> > > 	  update_dl_revised_wakeup();
> > > 
> > > In that case, it is possible to continue running with a slight
> > > adjustment to the runtime (it gets scaled back to account for 'lost'
> > > time or somesuch IIRC).
> > > 
> > 
> > Hummm, but this is for !implicit (dl_deadline != dl_period) tasks, is
> > it? And dl-servers are implicit.
> 
> Bah. You're right.
> 
> So how about this:
> 
>   dl_server_timer()
>     if (dl_se->dl_defer_armed)
>       dl_se->dl_defer_running = 1;
> 
>     enqueue_dl_entity(dl_se, ENQUEUE_REPLENISH)
> 
>  __pick_task_dl()
>    p = dl_se->server_pick_task(dl_se);
>    if (!p)
>      dl_server_stop()
>        dl_se->dl_defer_armed = 0;
>        dl_se->dl_throttled = 0;
>        dl_se->dl_server_active = 0;
>        /* notably it leaves dl_defer_running == 1 */
> 
>  dl_server_start()
>    dl_se->dl_server_active = 1;
>    enqueue_dl_entity(WAKEUP)
>      if (WAKEUP)
>        task_contending();
>        update_dl_entity()
>          if (dl_server() && dl_se->dl_defer)
> 	   if (!dl_se->dl_defer_running) /* !true := false */
> 	     /* do not set dl_defer_armed / dl_throttled */
> 
> Note: update_curr_dl_se() will eventually clear dl_defer_running when it
> gets throttled.

Ah right! I indeed missed the dl_defer_running condition check.

> And so it continues with the previous reservation. And I suppose the
> question is, should it do update_dl_revised_wakeup() in this case?

No, it can use its current reservation, so no need to possibly shrink
it. Also the revised rule is for constrained tasks anyway.

I think we are good! Thanks a lot for fixing this and for bearing with
me. :)

Feel free to add my Reviewed/Tested/Acked-by as you see fit.

Juri


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation
  2025-09-17 12:26                     ` Peter Zijlstra
  2025-09-17 13:56                       ` Juri Lelli
@ 2025-09-17 19:29                       ` John Stultz
  2025-09-18  6:56                       ` [tip: sched/urgent] sched/deadline: Fix dl_server behaviour tip-bot2 for Peter Zijlstra
  2025-09-25  7:55                       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 101+ messages in thread
From: John Stultz @ 2025-09-17 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, LKML, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman,
	Xuewen Yan, K Prateek Nayak, Suleiman Souhlal, Qais Yousef,
	Joel Fernandes, kuyo chang, hupu, kernel-team

On Wed, Sep 17, 2025 at 5:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Sep 17, 2025 at 11:34:42AM +0200, Peter Zijlstra wrote:
>
> > Yes. This makes sense.
> >
> > The old code would disable the dl_server when fair tasks drops to 0
> > so even though we had that yield in __pick_task_dl(), we'd never hit it.
> > So the moment another fair task shows up (0->1) we re-enqueue the
> > dl_server (using update_dl_entity() / CBS wakeup rules) and continue
> > consuming bandwidth.
> >
> > However, since we're now not stopping the thing, we hit that yield,
> > getting this pretty terrible behaviour where we will only run fair tasks
> > until there are none and then yield our entire period, forcing another
> > task to wait until the next cycle.
> >
> > Let me go have a play, surely we can do better.
>
> Can you please try:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
>
> That's yesterdays patch and the below. Its compile tested only, but
> with a bit of luck it'll actually work ;-)

Yep. It works! See the same initial 1sec delay for dl_server to kick
in and boost kthreadd (that we always had), but then we see it bounce
between the RT Referee task and non-RT kthreadd to spawn off the other
threads right away.  Even on the large cpu systems, the dl_server
manages to alternate with the ref task and kicks off all 320 threads
over just two periods so there's only a 2-3sec delay, rather than the
5min+ we had without this change.

So for my issue:
Tested-by: John Stultz <jstultz@google.com>

I'll do some more testing with other things just to make sure nothing
else comes up.

Thanks so much for the quick fix!
-john

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip: sched/urgent] sched/deadline: Fix dl_server behaviour
  2025-09-17 12:26                     ` Peter Zijlstra
  2025-09-17 13:56                       ` Juri Lelli
  2025-09-17 19:29                       ` John Stultz
@ 2025-09-18  6:56                       ` tip-bot2 for Peter Zijlstra
  2025-09-25  7:55                       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 101+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-09-18  6:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     2dcbcce9bfac6ddc2e2f9243fa846a875371de79
Gitweb:        https://git.kernel.org/tip/2dcbcce9bfac6ddc2e2f9243fa846a875371de79
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 17 Sep 2025 12:03:20 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00

sched/deadline: Fix dl_server behaviour

John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").

When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.

The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).

Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).

Notably, this does not re-introduce the behaviour cccb45d7c4295 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h   |  1 -
 kernel/sched/deadline.c | 23 ++---------------------
 kernel/sched/sched.h    | 33 +++++++++++++++++++++++++++++++--
 3 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f89313b..e4ce0a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,7 +706,6 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
-	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5a5080b..72c1f72 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1571,10 +1571,8 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime) {
-		dl_se->dl_server_idle = 0;
+	if (dl_se->dl_runtime)
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
-	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1602,20 +1600,6 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	dl_se->dl_server_active = 0;
 }
 
-static bool dl_server_stopped(struct sched_dl_entity *dl_se)
-{
-	if (!dl_se->dl_server_active)
-		return true;
-
-	if (dl_se->dl_server_idle) {
-		dl_server_stop(dl_se);
-		return true;
-	}
-
-	dl_se->dl_server_idle = 1;
-	return false;
-}
-
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick_task)
 {
@@ -2384,10 +2368,7 @@ again:
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (!dl_server_stopped(dl_se)) {
-				dl_se->dl_yielded = 1;
-				update_curr_dl_se(rq, dl_se, 0);
-			}
+			dl_server_stop(dl_se);
 			goto again;
 		}
 		rq->dl_server = dl_se;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f10d627..cf2109b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -371,10 +371,39 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
  *   dl_server_update() -- called from update_curr_common(), propagates runtime
  *                         to the server.
  *
- *   dl_server_start()
- *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *   dl_server_start() -- start the server when it has tasks; it will stop
+ *			  automatically when there are no more tasks, per
+ *			  dl_se::server_pick() returning NULL.
+ *
+ *   dl_server_stop() -- (force) stop the server; use when updating
+ *                       parameters.
  *
  *   dl_server_init() -- initializes the server.
+ *
+ * When started the dl_server will (per dl_defer) schedule a timer for its
+ * zero-laxity point -- that is, unlike regular EDF tasks which run ASAP, a
+ * server will run at the very end of its period.
+ *
+ * This is done such that any runtime from the target class can be accounted
+ * against the server -- through dl_server_update() above -- such that when it
+ * becomes time to run, it might already be out of runtime and get deferred
+ * until the next period. In this case dl_server_timer() will alternate
+ * between defer and replenish but never actually enqueue the server.
+ *
+ * Only when the target class does not manage to exhaust the server's runtime
+ * (there's actualy starvation in the given period), will the dl_server get on
+ * the runqueue. Once queued it will pick tasks from the target class and run
+ * them until either its runtime is exhaused, at which point its back to
+ * dl_server_timer, or until there are no more tasks to run, at which point
+ * the dl_server stops itself.
+ *
+ * By stopping at this point the dl_server retains bandwidth, which, if a new
+ * task wakes up imminently (starting the server again), can be used --
+ * subject to CBS wakeup rules -- without having to wait for the next period.
+ *
+ * Additionally, because of the dl_defer behaviour the start/stop behaviour is
+ * naturally thottled to once per period, avoiding high context switch
+ * workloads from spamming the hrtimer program/cancel paths.
  */
 extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [tip: sched/urgent] sched/deadline: Fix dl_server behaviour
  2025-09-17 12:26                     ` Peter Zijlstra
                                         ` (2 preceding siblings ...)
  2025-09-18  6:56                       ` [tip: sched/urgent] sched/deadline: Fix dl_server behaviour tip-bot2 for Peter Zijlstra
@ 2025-09-25  7:55                       ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 101+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-09-25  7:55 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     a3a70caf7906708bf9bbc80018752a6b36543808
Gitweb:        https://git.kernel.org/tip/a3a70caf7906708bf9bbc80018752a6b36543808
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 17 Sep 2025 12:03:20 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 25 Sep 2025 09:51:50 +02:00

sched/deadline: Fix dl_server behaviour

John reported undesirable behaviour with the dl_server since commit:
cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling").

When starving fair tasks on purpose (starting spinning FIFO tasks),
his fair workload, which often goes (briefly) idle, would delay fair
invocations for a second, running one invocation per second was both
unexpected and terribly slow.

The reason this happens is that when dl_se->server_pick_task() returns
NULL, indicating no runnable tasks, it would yield, pushing any later
jobs out a whole period (1 second).

Instead simply stop the server. This should restore behaviour in that
a later wakeup (which restarts the server) will be able to continue
running (subject to the CBS wakeup rules).

Notably, this does not re-introduce the behaviour cccb45d7c4295 set
out to solve, any start/stop cycle is naturally throttled by the timer
period (no active cancel).

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h   |  1 -
 kernel/sched/deadline.c | 23 ++---------------------
 kernel/sched/sched.h    | 33 +++++++++++++++++++++++++++++++--
 3 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f89313b..e4ce0a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -706,7 +706,6 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
-	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5a5080b..72c1f72 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1571,10 +1571,8 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime) {
-		dl_se->dl_server_idle = 0;
+	if (dl_se->dl_runtime)
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
-	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1602,20 +1600,6 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	dl_se->dl_server_active = 0;
 }
 
-static bool dl_server_stopped(struct sched_dl_entity *dl_se)
-{
-	if (!dl_se->dl_server_active)
-		return true;
-
-	if (dl_se->dl_server_idle) {
-		dl_server_stop(dl_se);
-		return true;
-	}
-
-	dl_se->dl_server_idle = 1;
-	return false;
-}
-
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_pick_f pick_task)
 {
@@ -2384,10 +2368,7 @@ again:
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
-			if (!dl_server_stopped(dl_se)) {
-				dl_se->dl_yielded = 1;
-				update_curr_dl_se(rq, dl_se, 0);
-			}
+			dl_server_stop(dl_se);
 			goto again;
 		}
 		rq->dl_server = dl_se;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f10d627..cf2109b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -371,10 +371,39 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
  *   dl_server_update() -- called from update_curr_common(), propagates runtime
  *                         to the server.
  *
- *   dl_server_start()
- *   dl_server_stop()  -- start/stop the server when it has (no) tasks.
+ *   dl_server_start() -- start the server when it has tasks; it will stop
+ *			  automatically when there are no more tasks, per
+ *			  dl_se::server_pick() returning NULL.
+ *
+ *   dl_server_stop() -- (force) stop the server; use when updating
+ *                       parameters.
  *
  *   dl_server_init() -- initializes the server.
+ *
+ * When started the dl_server will (per dl_defer) schedule a timer for its
+ * zero-laxity point -- that is, unlike regular EDF tasks which run ASAP, a
+ * server will run at the very end of its period.
+ *
+ * This is done such that any runtime from the target class can be accounted
+ * against the server -- through dl_server_update() above -- such that when it
+ * becomes time to run, it might already be out of runtime and get deferred
+ * until the next period. In this case dl_server_timer() will alternate
+ * between defer and replenish but never actually enqueue the server.
+ *
+ * Only when the target class does not manage to exhaust the server's runtime
+ * (there's actualy starvation in the given period), will the dl_server get on
+ * the runqueue. Once queued it will pick tasks from the target class and run
+ * them until either its runtime is exhaused, at which point its back to
+ * dl_server_timer, or until there are no more tasks to run, at which point
+ * the dl_server stops itself.
+ *
+ * By stopping at this point the dl_server retains bandwidth, which, if a new
+ * task wakes up imminently (starting the server again), can be used --
+ * subject to CBS wakeup rules -- without having to wait for the next period.
+ *
+ * Additionally, because of the dl_defer behaviour the start/stop behaviour is
+ * naturally thottled to once per period, avoiding high context switch
+ * workloads from spamming the hrtimer program/cancel paths.
  */
 extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-16 11:01           ` Peter Zijlstra
  2025-09-16 12:52             ` Juri Lelli
  2025-09-16 17:35             ` John Stultz
@ 2025-09-18  6:56             ` tip-bot2 for Peter Zijlstra
  2025-09-18 14:46               ` Dietmar Eggemann
  2025-09-22 21:57               ` Marek Szyprowski
  2025-09-25  7:55             ` tip-bot2 for Peter Zijlstra
  3 siblings, 2 replies; 101+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-09-18  6:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00

sched/deadline: Fix dl_server getting stuck

John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
("sched/deadline: Less agressive dl_server handling").

While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.

When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.

This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.

What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.

IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.

This removes all dl_se->server_has_tasks() users, so remove the whole
thing.

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h   |  1 -
 kernel/sched/deadline.c | 12 +-----------
 kernel/sched/fair.c     |  7 +------
 kernel/sched/sched.h    |  4 ----
 4 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8188b8..f89313b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -733,7 +733,6 @@ struct sched_dl_entity {
 	 * runnable task.
 	 */
 	struct rq			*rq;
-	dl_server_has_tasks_f		server_has_tasks;
 	dl_server_pick_f		server_pick_task;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f253012..5a5080b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 */
 	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
 	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
-		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
+		if (!is_dl_boosted(dl_se)) {
 
 			/*
 			 * Set dl_se->dl_defer_armed and dl_throttled variables to
@@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
 
-static bool dl_server_stopped(struct sched_dl_entity *dl_se);
-
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1171,12 +1169,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 		if (!dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
-		if (!dl_se->server_has_tasks(dl_se)) {
-			replenish_dl_entity(dl_se);
-			dl_server_stopped(dl_se);
-			return HRTIMER_NORESTART;
-		}
-
 		if (dl_se->dl_defer_armed) {
 			/*
 			 * First check if the server could consume runtime in background.
@@ -1625,11 +1617,9 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
-		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
 {
 	dl_se->rq = rq;
-	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick_task = pick_task;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4d91e8..59d7dc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8859,11 +8859,6 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru
 	return pick_next_task_fair(rq, prev, NULL);
 }
 
-static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
-{
-	return !!dl_se->rq->cfs.nr_queued;
-}
-
 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
 {
 	return pick_task_fair(dl_se->rq);
@@ -8875,7 +8870,7 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
+	dl_server_init(dl_se, rq, fair_server_pick_task);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d..f10d627 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -365,9 +365,6 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
  *
  *   dl_se::rq -- runqueue we belong to.
  *
- *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
- *                                server when it runs out of tasks to run.
- *
  *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
  *                           returns NULL.
  *
@@ -383,7 +380,6 @@ extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
-		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
@ 2025-09-18 14:46               ` Dietmar Eggemann
  2025-09-22 21:57               ` Marek Szyprowski
  1 sibling, 0 replies; 101+ messages in thread
From: Dietmar Eggemann @ 2025-09-18 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86



On 18.09.25 08:56, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/urgent branch of tip:
> 
> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
> 
> sched/deadline: Fix dl_server getting stuck
> 
> John found it was easy to hit lockup warnings when running locktorture
> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> ("sched/deadline: Less agressive dl_server handling").
> 
> While debugging it seems there is a chance where we end up with the
> dl_server dequeued, with dl_se->dl_server_active. This causes
> dl_server_start() to return without enqueueing the dl_server, thus it
> fails to run when RT tasks starve the cpu.
> 
> When this happens, dl_server_timer() catches the
> '!dl_se->server_has_tasks(dl_se)' case, which then calls
> replenish_dl_entity() and dl_server_stopped() and finally return
> HRTIMER_NO_RESTART.
> 
> This ends in no new timer and also no enqueue, leaving the dl_server
> 'dead', allowing starvation.
> 
> What should have happened is for the bandwidth timer to start the
> zero-laxity timer, which in turn would enqueue the dl_server and cause
> dl_se->server_pick_task() to be called -- which will stop the
> dl_server if no fair tasks are observed for a whole period.
> 
> IOW, it is totally irrelevant if there are fair tasks at the moment of
> bandwidth refresh.
> 
> This removes all dl_se->server_has_tasks() users, so remove the whole
> thing.

I see the same results like John running his locktorture test, the
'BUG: workqueue lockup' is gone now.

Just got confused because of these two remaining dl_server_has_tasks references:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73c7de26fa60..73d750292446 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -634,7 +634,6 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
-typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
 typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
 
 struct sched_dl_entity {
@@ -728,9 +727,6 @@ struct sched_dl_entity {
         * dl_server_update().
         *
         * @rq the runqueue this server is for
-        *
-        * @server_has_tasks() returns true if @server_pick return a
-        * runnable task.
         */
        struct rq                       *rq;
        dl_server_pick_f                server_pick_task;

Can you still tweak the patch to get rif of them with the patch?

[...]

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
  2025-09-18 14:46               ` Dietmar Eggemann
@ 2025-09-22 21:57               ` Marek Szyprowski
  2025-09-22 23:46                 ` John Stultz
                                   ` (2 more replies)
  1 sibling, 3 replies; 101+ messages in thread
From: Marek Szyprowski @ 2025-09-22 21:57 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits
  Cc: John Stultz, Peter Zijlstra (Intel), x86,
	'Linux Samsung SOC'

On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/urgent branch of tip:
>
> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
>
> sched/deadline: Fix dl_server getting stuck
>
> John found it was easy to hit lockup warnings when running locktorture
> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> ("sched/deadline: Less agressive dl_server handling").
>
> While debugging it seems there is a chance where we end up with the
> dl_server dequeued, with dl_se->dl_server_active. This causes
> dl_server_start() to return without enqueueing the dl_server, thus it
> fails to run when RT tasks starve the cpu.
>
> When this happens, dl_server_timer() catches the
> '!dl_se->server_has_tasks(dl_se)' case, which then calls
> replenish_dl_entity() and dl_server_stopped() and finally return
> HRTIMER_NO_RESTART.
>
> This ends in no new timer and also no enqueue, leaving the dl_server
> 'dead', allowing starvation.
>
> What should have happened is for the bandwidth timer to start the
> zero-laxity timer, which in turn would enqueue the dl_server and cause
> dl_se->server_pick_task() to be called -- which will stop the
> dl_server if no fair tasks are observed for a whole period.
>
> IOW, it is totally irrelevant if there are fair tasks at the moment of
> bandwidth refresh.
>
> This removes all dl_se->server_has_tasks() users, so remove the whole
> thing.
>
> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> Reported-by: John Stultz <jstultz@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: John Stultz <jstultz@google.com>
> ---

This patch landed in today's linux-next as commit 077e1e2e0015 
("sched/deadline: Fix dl_server getting stuck"). In my tests I found 
that it breaks CPU hotplug on some of my systems. On 64bit 
Exynos5433-based TM2e board I've captured the following lock dep warning 
(which unfortunately doesn't look like really related to CPU hotplug):

# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
Detected VIPT I-cache on CPU7
CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329 
rcutree_report_cpu_starting+0x1e8/0x348
Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative 
brcmfmac brcmutil sha256 snd_soc_wm5110 cfg80211 snd_soc_wm_adsp cs_dsp 
snd_soc_tm2_wm5110 snd_soc_arizona arizona_micsupp phy_exynos5_usbdrd 
s5p_mfc typec arizona_ldo1 hci_uart btqca s5p_jpeg max77693_haptic btbcm 
s3fwrn5_i2c exynos_gsc bluetooth s3fwrn5 nci v4l2_mem2mem nfc 
snd_soc_i2s snd_soc_idma snd_soc_hdmi_codec snd_soc_max98504 
snd_soc_s3c_dma videobuf2_dma_contig videobuf2_memops ecdh_generic 
snd_soc_core ir_spi videobuf2_v4l2 ecc snd_compress ntc_thermistor 
panfrost videodev snd_pcm_dmaengine snd_pcm rfkill drm_shmem_helper 
panel_samsung_s6e3ha2 videobuf2_common backlight pwrseq_core gpu_sched 
mc snd_timer snd soundcore ipv6
CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16012 PREEMPT
Hardware name: Samsung TM2E board (DT)
Hardware name: Samsung TM2E board (DT)
Detected VIPT I-cache on CPU7

======================================================
WARNING: possible circular locking dependency detected
6.17.0-rc6+ #16012 Not tainted
------------------------------------------------------
swapper/7/0 is trying to acquire lock:
ffff000024021cc8 (&irq_desc_lock_class){-.-.}-{2:2}, at: 
__irq_get_desc_lock+0x5c/0x9c

but task is already holding lock:
ffff800083e479c0 (&port_lock_key){-.-.}-{3:3}, at: 
s3c24xx_serial_console_write+0x80/0x268

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&port_lock_key){-.-.}-{3:3}:
        _raw_spin_lock_irqsave+0x60/0x88
        s3c24xx_serial_console_write+0x80/0x268
        console_flush_all+0x304/0x49c
        console_unlock+0x70/0x110
        vprintk_emit+0x254/0x39c
        vprintk_default+0x38/0x44
        vprintk+0x28/0x34
        _printk+0x5c/0x84
        register_console+0x3ac/0x4f8
        serial_core_register_port+0x6c4/0x7a4
        serial_ctrl_register_port+0x10/0x1c
        uart_add_one_port+0x10/0x1c
        s3c24xx_serial_probe+0x34c/0x6d8
        platform_probe+0x5c/0xac
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0xdc/0x164
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x80/0xdc
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x14/0x20
        bus_probe_device+0xb0/0xb4
        deferred_probe_work_func+0x8c/0xc8
        process_one_work+0x208/0x60c
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

-> #1 (console_owner){..-.}-{0:0}:
        console_lock_spinning_enable+0x6c/0x7c
        console_flush_all+0x2c8/0x49c
        console_unlock+0x70/0x110
        vprintk_emit+0x254/0x39c
        vprintk_default+0x38/0x44
        vprintk+0x28/0x34
        _printk+0x5c/0x84
        exynos_wkup_irq_set_wake+0x80/0xa4
        irq_set_irq_wake+0x164/0x1e0
        arizona_irq_set_wake+0x18/0x24
        irq_set_irq_wake+0x164/0x1e0
        regmap_irq_sync_unlock+0x328/0x530
        __irq_put_desc_unlock+0x48/0x4c
        irq_set_irq_wake+0x84/0x1e0
        arizona_set_irq_wake+0x5c/0x70
        wm5110_probe+0x220/0x354 [snd_soc_wm5110]
        platform_probe+0x5c/0xac
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0xdc/0x164
        __driver_attach+0x9c/0x1ac
        bus_for_each_dev+0x74/0xd0
        driver_attach+0x24/0x30
        bus_add_driver+0xe4/0x208
        driver_register+0x60/0x128
        __platform_driver_register+0x24/0x30
        cs_exit+0xc/0x20 [cpufreq_conservative]
        do_one_initcall+0x64/0x308
        do_init_module+0x58/0x23c
        load_module+0x1b48/0x1dc4
        init_module_from_file+0x84/0xc4
        idempotent_init_module+0x188/0x280
        __arm64_sys_finit_module+0x68/0xac
        invoke_syscall+0x48/0x110
        el0_svc_.common.c

(system is frozen at this point).

Let me know if I can help somehow debugging this issue. Reverting 
$subject on top of linux-next fixes this issue.


>   include/linux/sched.h   |  1 -
>   kernel/sched/deadline.c | 12 +-----------
>   kernel/sched/fair.c     |  7 +------
>   kernel/sched/sched.h    |  4 ----
>   4 files changed, 2 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f8188b8..f89313b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -733,7 +733,6 @@ struct sched_dl_entity {
>   	 * runnable task.
>   	 */
>   	struct rq			*rq;
> -	dl_server_has_tasks_f		server_has_tasks;
>   	dl_server_pick_f		server_pick_task;
>   
>   #ifdef CONFIG_RT_MUTEXES
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f253012..5a5080b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
>   	 */
>   	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
>   	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
> -		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
> +		if (!is_dl_boosted(dl_se)) {
>   
>   			/*
>   			 * Set dl_se->dl_defer_armed and dl_throttled variables to
> @@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
>   /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
>   static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
>   
> -static bool dl_server_stopped(struct sched_dl_entity *dl_se);
> -
>   static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
>   {
>   	struct rq *rq = rq_of_dl_se(dl_se);
> @@ -1171,12 +1169,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
>   		if (!dl_se->dl_runtime)
>   			return HRTIMER_NORESTART;
>   
> -		if (!dl_se->server_has_tasks(dl_se)) {
> -			replenish_dl_entity(dl_se);
> -			dl_server_stopped(dl_se);
> -			return HRTIMER_NORESTART;
> -		}
> -
>   		if (dl_se->dl_defer_armed) {
>   			/*
>   			 * First check if the server could consume runtime in background.
> @@ -1625,11 +1617,9 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
>   }
>   
>   void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> -		    dl_server_has_tasks_f has_tasks,
>   		    dl_server_pick_f pick_task)
>   {
>   	dl_se->rq = rq;
> -	dl_se->server_has_tasks = has_tasks;
>   	dl_se->server_pick_task = pick_task;
>   }
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c4d91e8..59d7dc9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8859,11 +8859,6 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru
>   	return pick_next_task_fair(rq, prev, NULL);
>   }
>   
> -static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
> -{
> -	return !!dl_se->rq->cfs.nr_queued;
> -}
> -
>   static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
>   {
>   	return pick_task_fair(dl_se->rq);
> @@ -8875,7 +8870,7 @@ void fair_server_init(struct rq *rq)
>   
>   	init_dl_entity(dl_se);
>   
> -	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
> +	dl_server_init(dl_se, rq, fair_server_pick_task);
>   }
>   
>   /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d..f10d627 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -365,9 +365,6 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
>    *
>    *   dl_se::rq -- runqueue we belong to.
>    *
> - *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
> - *                                server when it runs out of tasks to run.
> - *
>    *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
>    *                           returns NULL.
>    *
> @@ -383,7 +380,6 @@ extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
>   extern void dl_server_start(struct sched_dl_entity *dl_se);
>   extern void dl_server_stop(struct sched_dl_entity *dl_se);
>   extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
> -		    dl_server_has_tasks_f has_tasks,
>   		    dl_server_pick_f pick_task);
>   extern void sched_init_dl_servers(void);
>   
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-22 21:57               ` Marek Szyprowski
@ 2025-09-22 23:46                 ` John Stultz
  2025-09-23  6:31                   ` Marek Szyprowski
  2025-09-23  7:25                 ` Peter Zijlstra
  2025-09-23 22:02                 ` Peter Zijlstra
  2 siblings, 1 reply; 101+ messages in thread
From: John Stultz @ 2025-09-22 23:46 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel), x86,
	Linux Samsung SOC

On Mon, Sep 22, 2025 at 2:57 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> This patch landed in today's linux-next as commit 077e1e2e0015
> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
> that it breaks CPU hotplug on some of my systems. On 64bit
> Exynos5433-based TM2e board I've captured the following lock dep warning
> (which unfortunately doesn't look like really related to CPU hotplug):
>

Huh. Nor does it really look related to the dl_server change. Interesting...


> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
> Detected VIPT I-cache on CPU7
> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329
> rcutree_report_cpu_starting+0x1e8/0x348
> Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative
> brcmfmac brcmutil sha256 snd_soc_wm5110 cfg80211 snd_soc_wm_adsp cs_dsp
> snd_soc_tm2_wm5110 snd_soc_arizona arizona_micsupp phy_exynos5_usbdrd
> s5p_mfc typec arizona_ldo1 hci_uart btqca s5p_jpeg max77693_haptic btbcm
> s3fwrn5_i2c exynos_gsc bluetooth s3fwrn5 nci v4l2_mem2mem nfc
> snd_soc_i2s snd_soc_idma snd_soc_hdmi_codec snd_soc_max98504
> snd_soc_s3c_dma videobuf2_dma_contig videobuf2_memops ecdh_generic
> snd_soc_core ir_spi videobuf2_v4l2 ecc snd_compress ntc_thermistor
> panfrost videodev snd_pcm_dmaengine snd_pcm rfkill drm_shmem_helper
> panel_samsung_s6e3ha2 videobuf2_common backlight pwrseq_core gpu_sched
> mc snd_timer snd soundcore ipv6
> CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16012 PREEMPT
> Hardware name: Samsung TM2E board (DT)
> Hardware name: Samsung TM2E board (DT)
> Detected VIPT I-cache on CPU7
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.17.0-rc6+ #16012 Not tainted
> ------------------------------------------------------
> swapper/7/0 is trying to acquire lock:
> ffff000024021cc8 (&irq_desc_lock_class){-.-.}-{2:2}, at:
> __irq_get_desc_lock+0x5c/0x9c
>
> but task is already holding lock:
> ffff800083e479c0 (&port_lock_key){-.-.}-{3:3}, at:
> s3c24xx_serial_console_write+0x80/0x268
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&port_lock_key){-.-.}-{3:3}:
>         _raw_spin_lock_irqsave+0x60/0x88
>         s3c24xx_serial_console_write+0x80/0x268
>         console_flush_all+0x304/0x49c
>         console_unlock+0x70/0x110
>         vprintk_emit+0x254/0x39c
>         vprintk_default+0x38/0x44
>         vprintk+0x28/0x34
>         _printk+0x5c/0x84
>         register_console+0x3ac/0x4f8
>         serial_core_register_port+0x6c4/0x7a4
>         serial_ctrl_register_port+0x10/0x1c
>         uart_add_one_port+0x10/0x1c
>         s3c24xx_serial_probe+0x34c/0x6d8
>         platform_probe+0x5c/0xac
>         really_probe+0xbc/0x298
>         __driver_probe_device+0x78/0x12c
>         driver_probe_device+0xdc/0x164
>         __device_attach_driver+0xb8/0x138
>         bus_for_each_drv+0x80/0xdc
>         __device_attach+0xa8/0x1b0
>         device_initial_probe+0x14/0x20
>         bus_probe_device+0xb0/0xb4
>         deferred_probe_work_func+0x8c/0xc8
>         process_one_work+0x208/0x60c
>         worker_thread+0x244/0x388
>         kthread+0x150/0x228
>         ret_from_fork+0x10/0x20
>
> -> #1 (console_owner){..-.}-{0:0}:
>         console_lock_spinning_enable+0x6c/0x7c
>         console_flush_all+0x2c8/0x49c
>         console_unlock+0x70/0x110
>         vprintk_emit+0x254/0x39c
>         vprintk_default+0x38/0x44
>         vprintk+0x28/0x34
>         _printk+0x5c/0x84
>         exynos_wkup_irq_set_wake+0x80/0xa4
>         irq_set_irq_wake+0x164/0x1e0
>         arizona_irq_set_wake+0x18/0x24
>         irq_set_irq_wake+0x164/0x1e0
>         regmap_irq_sync_unlock+0x328/0x530
>         __irq_put_desc_unlock+0x48/0x4c
>         irq_set_irq_wake+0x84/0x1e0
>         arizona_set_irq_wake+0x5c/0x70
>         wm5110_probe+0x220/0x354 [snd_soc_wm5110]
>         platform_probe+0x5c/0xac
>         really_probe+0xbc/0x298
>         __driver_probe_device+0x78/0x12c
>         driver_probe_device+0xdc/0x164
>         __driver_attach+0x9c/0x1ac
>         bus_for_each_dev+0x74/0xd0
>         driver_attach+0x24/0x30
>         bus_add_driver+0xe4/0x208
>         driver_register+0x60/0x128
>         __platform_driver_register+0x24/0x30
>         cs_exit+0xc/0x20 [cpufreq_conservative]
>         do_one_initcall+0x64/0x308
>         do_init_module+0x58/0x23c
>         load_module+0x1b48/0x1dc4
>         init_module_from_file+0x84/0xc4
>         idempotent_init_module+0x188/0x280
>         __arm64_sys_finit_module+0x68/0xac
>         invoke_syscall+0x48/0x110
>         el0_svc_.common.c
>
> (system is frozen at this point).

So I've seen issues like this when testing scheduler changes,
particularly when I've added debug printks or WARN_ONs that trip while
we're deep in the scheduler core and hold various locks. I reported
something similar here:
https://lore.kernel.org/lkml/CANDhNCo8NRm4meR7vHqvP8vVZ-_GXVPuUKSO1wUQkKdfjvy20w@mail.gmail.com/

Now, usually I'll see the lockdep warning, and the hang is much more rare.

But I don't see right off how the dl_server change would affect this,
other than just changing the timing of execution such that you manage
to trip over the existing issue.

So far I don't see anything similar testing hotplug on x86 qemu.  Do
you get any other console messages or warnings prior?

Looking at the backtrace, I wonder if changing the pr_info() in
exynos_wkup_irq_set_wake() to printk_deferred() might avoid this?

thanks
-john

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-22 23:46                 ` John Stultz
@ 2025-09-23  6:31                   ` Marek Szyprowski
  0 siblings, 0 replies; 101+ messages in thread
From: Marek Szyprowski @ 2025-09-23  6:31 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, linux-tip-commits, Peter Zijlstra (Intel), x86,
	Linux Samsung SOC, Krzysztof Kozlowski

On 23.09.2025 01:46, John Stultz wrote:
> On Mon, Sep 22, 2025 at 2:57 PM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> This patch landed in today's linux-next as commit 077e1e2e0015
>> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
>> that it breaks CPU hotplug on some of my systems. On 64bit
>> Exynos5433-based TM2e board I've captured the following lock dep warning
>> (which unfortunately doesn't look like really related to CPU hotplug):
>>
> Huh. Nor does it really look related to the dl_server change. Interesting...
>
>
>> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
>> Detected VIPT I-cache on CPU7
>> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
>> ------------[ cut here ]------------
>> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329
>> rcutree_report_cpu_starting+0x1e8/0x348
>> Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative
>> brcmfmac brcmutil sha256 snd_soc_wm5110 cfg80211 snd_soc_wm_adsp cs_dsp
>> snd_soc_tm2_wm5110 snd_soc_arizona arizona_micsupp phy_exynos5_usbdrd
>> s5p_mfc typec arizona_ldo1 hci_uart btqca s5p_jpeg max77693_haptic btbcm
>> s3fwrn5_i2c exynos_gsc bluetooth s3fwrn5 nci v4l2_mem2mem nfc
>> snd_soc_i2s snd_soc_idma snd_soc_hdmi_codec snd_soc_max98504
>> snd_soc_s3c_dma videobuf2_dma_contig videobuf2_memops ecdh_generic
>> snd_soc_core ir_spi videobuf2_v4l2 ecc snd_compress ntc_thermistor
>> panfrost videodev snd_pcm_dmaengine snd_pcm rfkill drm_shmem_helper
>> panel_samsung_s6e3ha2 videobuf2_common backlight pwrseq_core gpu_sched
>> mc snd_timer snd soundcore ipv6
>> CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16012 PREEMPT
>> Hardware name: Samsung TM2E board (DT)
>> Hardware name: Samsung TM2E board (DT)
>> Detected VIPT I-cache on CPU7
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.17.0-rc6+ #16012 Not tainted
>> ------------------------------------------------------
>> swapper/7/0 is trying to acquire lock:
>> ffff000024021cc8 (&irq_desc_lock_class){-.-.}-{2:2}, at:
>> __irq_get_desc_lock+0x5c/0x9c
>>
>> but task is already holding lock:
>> ffff800083e479c0 (&port_lock_key){-.-.}-{3:3}, at:
>> s3c24xx_serial_console_write+0x80/0x268
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #2 (&port_lock_key){-.-.}-{3:3}:
>>          _raw_spin_lock_irqsave+0x60/0x88
>>          s3c24xx_serial_console_write+0x80/0x268
>>          console_flush_all+0x304/0x49c
>>          console_unlock+0x70/0x110
>>          vprintk_emit+0x254/0x39c
>>          vprintk_default+0x38/0x44
>>          vprintk+0x28/0x34
>>          _printk+0x5c/0x84
>>          register_console+0x3ac/0x4f8
>>          serial_core_register_port+0x6c4/0x7a4
>>          serial_ctrl_register_port+0x10/0x1c
>>          uart_add_one_port+0x10/0x1c
>>          s3c24xx_serial_probe+0x34c/0x6d8
>>          platform_probe+0x5c/0xac
>>          really_probe+0xbc/0x298
>>          __driver_probe_device+0x78/0x12c
>>          driver_probe_device+0xdc/0x164
>>          __device_attach_driver+0xb8/0x138
>>          bus_for_each_drv+0x80/0xdc
>>          __device_attach+0xa8/0x1b0
>>          device_initial_probe+0x14/0x20
>>          bus_probe_device+0xb0/0xb4
>>          deferred_probe_work_func+0x8c/0xc8
>>          process_one_work+0x208/0x60c
>>          worker_thread+0x244/0x388
>>          kthread+0x150/0x228
>>          ret_from_fork+0x10/0x20
>>
>> -> #1 (console_owner){..-.}-{0:0}:
>>          console_lock_spinning_enable+0x6c/0x7c
>>          console_flush_all+0x2c8/0x49c
>>          console_unlock+0x70/0x110
>>          vprintk_emit+0x254/0x39c
>>          vprintk_default+0x38/0x44
>>          vprintk+0x28/0x34
>>          _printk+0x5c/0x84
>>          exynos_wkup_irq_set_wake+0x80/0xa4
>>          irq_set_irq_wake+0x164/0x1e0
>>          arizona_irq_set_wake+0x18/0x24
>>          irq_set_irq_wake+0x164/0x1e0
>>          regmap_irq_sync_unlock+0x328/0x530
>>          __irq_put_desc_unlock+0x48/0x4c
>>          irq_set_irq_wake+0x84/0x1e0
>>          arizona_set_irq_wake+0x5c/0x70
>>          wm5110_probe+0x220/0x354 [snd_soc_wm5110]
>>          platform_probe+0x5c/0xac
>>          really_probe+0xbc/0x298
>>          __driver_probe_device+0x78/0x12c
>>          driver_probe_device+0xdc/0x164
>>          __driver_attach+0x9c/0x1ac
>>          bus_for_each_dev+0x74/0xd0
>>          driver_attach+0x24/0x30
>>          bus_add_driver+0xe4/0x208
>>          driver_register+0x60/0x128
>>          __platform_driver_register+0x24/0x30
>>          cs_exit+0xc/0x20 [cpufreq_conservative]
>>          do_one_initcall+0x64/0x308
>>          do_init_module+0x58/0x23c
>>          load_module+0x1b48/0x1dc4
>>          init_module_from_file+0x84/0xc4
>>          idempotent_init_module+0x188/0x280
>>          __arm64_sys_finit_module+0x68/0xac
>>          invoke_syscall+0x48/0x110
>>          el0_svc_.common.c
>>
>> (system is frozen at this point).
> So I've seen issues like this when testing scheduler changes,
> particularly when I've added debug printks or WARN_ONs that trip while
> we're deep in the scheduler core and hold various locks. I reported
> something similar here:
> https://lore.kernel.org/lkml/CANDhNCo8NRm4meR7vHqvP8vVZ-_GXVPuUKSO1wUQkKdfjvy20w@mail.gmail.com/
>
> Now, usually I'll see the lockdep warning, and the hang is much more rare.
>
> But I don't see right off how the dl_server change would affect this,
> other than just changing the timing of execution such that you manage
> to trip over the existing issue.
>
> So far I don't see anything similar testing hotplug on x86 qemu.  Do
> you get any other console messages or warnings prior?

Nope. But the most suspicious message that is there is the 'CPU7: Booted 
secondary processor 0x0000000101' line, which I got while off-lining all 
non-zero CPUs.


> Looking at the backtrace, I wonder if changing the pr_info() in
> exynos_wkup_irq_set_wake() to printk_deferred() might avoid this?


I've removed that pr_info() from exynos_wkup_irq_set_wake() completely 
and now I get the following warning:

# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
# Detected VIPT I-cache on CPU7
  CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329 
rcutree_report_cpu_starting+0x1e8/0x348
  Modules linked in: brcmfmac_wcc brcmfmac brcmutil sha256 
cpufreq_powersave cpufreq_conservative cfg80211 snd_soc_tm2_wm5110 
hci_uart btqca btbcm s3fwrn5_i2c snd_soc_wm5110 bluetooth 
arizona_micsupp phy_exynos5_usbdrd s3fwrn5 s5p_mfc nci typec 
snd_soc_wm_adsp s5p_jpeg cs_dsp nfc ecdh_generic max77693_haptic 
snd_soc_arizona arizona_ldo1 ecc rfkill snd_soc_i2s snd_soc_idma 
snd_soc_max98504 snd_soc_hdmi_codec snd_soc_s3c_dma pwrseq_core 
snd_soc_core exynos_gsc ir_spi v4l2_mem2mem videobuf2_dma_contig 
videobuf2_memops snd_compress snd_pcm_dmaengine videobuf2_v4l2 videodev 
ntc_thermistor snd_pcm panfrost videobuf2_common drm_shmem_helper 
gpu_sched snd_timer mc panel_samsung_s6e3ha2 backlight snd soundcore ipv6
  CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16014 
PREEMPT
  Hardware name: Samsung TM2E board (DT)
  Hardware name: Samsung TM2E board (DT)
  Detected VIPT I-cache on CPU7
  CPU7: Booted secondary processor 0x0000000103 [0x410fd031]

  ================================
  WARNING: inconsistent lock state
  6.17.0-rc6+ #16014 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  swapper/7/0 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff800083e479c0 (&port_lock_key){?.-.}-{3:3}, at: 
s3c24xx_serial_console_write+0x80/0x268
  {IN-HARDIRQ-W} state was registered at:
    lock_acquire+0x1c8/0x354
    _raw_spin_lock+0x48/0x60
    s3c64xx_serial_handle_irq+0x6c/0x164
    __handle_irq_event_percpu+0x9c/0x2d8
    handle_irq_event+0x4c/0xac
    handle_fasteoi_irq+0x108/0x198
    handle_irq_desc+0x40/0x58
    generic_handle_domain_irq+0x1c/0x28
    gic_handle_irq+0x40/0xc8
    call_on_irq_stack+0x30/0x48
    do_interrupt_handler+0x80/0x84
    el1_interrupt+0x34/0x64
    el1h_64_irq_handler+0x18/0x24
    el1h_64_irq+0x6c/0x70
    default_idle_call+0xac/0x26c
    do_idle+0x220/0x284
    cpu_startup_entry+0x38/0x3c
    rest_init+0xf4/0x184
    start_kernel+0x70c/0x7d4
    __primary_switched+0x88/0x90
  irq event stamp: 63878
  hardirqs last  enabled at (63877): [<ffff800080121d2c>] 
do_idle+0x220/0x284
  hardirqs last disabled at (63878): [<ffff80008132f3a4>] 
el1_brk64+0x1c/0x54
  softirqs last  enabled at (63812): [<ffff8000800c1164>] 
handle_softirqs+0x4c4/0x4dc
  softirqs last disabled at (63807): [<ffff800080010690>] 
__do_softirq+0x14/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&port_lock_key);
    <Interrupt>
      lock(&port_lock_key);

   *** DEADLOCK ***

  5 locks held by swapper/7/0:
   #0: ffff800082d0aa98 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x150/0x39c
   #1: ffff800082d0aaf0 (console_srcu){....}-{0:0}, at: 
console_flush_all+0x78/0x49c
   #2: ffff800082d0acb0 (console_owner){+.-.}-{0:0}, at: 
console_lock_spinning_enable+0x48/0x7c
   #3: ffff800082d0acd8 
(printk_legacy_map-wait-type-override){+...}-{4:4}, at: 
console_flush_all+0x2b0/0x49c
   #4: ffff800083e479c0 (&port_lock_key){?.-.}-{3:3}, at: 
s3c24xx_serial_console_write+0x80/0x268

  stack backtrace:
  CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16014 
PREEMPT
  Hardware name: Samsung TM2E board (DT)
  Call trace:
   show_stack+0x18/0x24 (C)
   dump_stack_lvl+0x90/0xd0
   dump_stack+0x18/0x24
   print_usage_bug.part.0+0x29c/0x358
   mark_lock+0x7bc/0x960
   mark_held_locks+0x58/0x90
   lockdep_hardirqs_on_prepare+0x104/0x214
   trace_hardirqs_on+0x58/0x1d8
   secondary_start_kernel+0x134/0x160
   __secondary_switched+0xc0/0xc4
  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 0 at kernel/context_tracking.c:127 
ct_kernel_exit.constprop.0+0x120/0x184
  Modules linked in: brcmfmac_wcc brcmfmac brcmutil sha256 
cpufreq_powersave cpufreq_conservative cfg80211 snd_soc_tm2_wm5110 
hci_uart btqca btbcm s3fwrn5_i2c snd_soc_wm5110 bluetooth 
arizona_micsupp phy_exynos5_usbdrd s3fwrn5 s5p_mfc nci typec 
snd_soc_wm_adsp s5p_jpeg cs_dsp nfc ecdh_generic max77693_haptic 
snd_soc_arizona arizona_ldo1 ecc rfkill snd_soc_i2s snd_soc_idma 
snd_soc_max98504 snd_soc_hdmi_c

(no more messages, system frozen)

It looks that offlining CPUs 1-7 was successful (there is a prompt char 
in the second line), but then CPU7 got somehow onlined again, what 
causes this freeze.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-22 21:57               ` Marek Szyprowski
  2025-09-22 23:46                 ` John Stultz
@ 2025-09-23  7:25                 ` Peter Zijlstra
  2025-09-23  7:52                   ` Marek Szyprowski
  2025-09-23 22:02                 ` Peter Zijlstra
  2 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-23  7:25 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-tip-commits, John Stultz, x86,
	'Linux Samsung SOC'

On Mon, Sep 22, 2025 at 11:57:02PM +0200, Marek Szyprowski wrote:
> On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/urgent branch of tip:
> >
> > Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> > Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> > Author:        Peter Zijlstra <peterz@infradead.org>
> > AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
> >
> > sched/deadline: Fix dl_server getting stuck
> >
> > John found it was easy to hit lockup warnings when running locktorture
> > on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> > ("sched/deadline: Less agressive dl_server handling").
> >
> > While debugging it seems there is a chance where we end up with the
> > dl_server dequeued, with dl_se->dl_server_active. This causes
> > dl_server_start() to return without enqueueing the dl_server, thus it
> > fails to run when RT tasks starve the cpu.
> >
> > When this happens, dl_server_timer() catches the
> > '!dl_se->server_has_tasks(dl_se)' case, which then calls
> > replenish_dl_entity() and dl_server_stopped() and finally return
> > HRTIMER_NO_RESTART.
> >
> > This ends in no new timer and also no enqueue, leaving the dl_server
> > 'dead', allowing starvation.
> >
> > What should have happened is for the bandwidth timer to start the
> > zero-laxity timer, which in turn would enqueue the dl_server and cause
> > dl_se->server_pick_task() to be called -- which will stop the
> > dl_server if no fair tasks are observed for a whole period.
> >
> > IOW, it is totally irrelevant if there are fair tasks at the moment of
> > bandwidth refresh.
> >
> > This removes all dl_se->server_has_tasks() users, so remove the whole
> > thing.
> >
> > Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> > Reported-by: John Stultz <jstultz@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: John Stultz <jstultz@google.com>
> > ---
> 
> This patch landed in today's linux-next as commit 077e1e2e0015 
> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found 
> that it breaks CPU hotplug on some of my systems. On 64bit 
> Exynos5433-based TM2e board I've captured the following lock dep warning 
> (which unfortunately doesn't look like really related to CPU hotplug):

Absolutely wild guess; does something like this help?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..bf78c46620a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12972,6 +12972,8 @@ static void rq_offline_fair(struct rq *rq)
 
 	/* Ensure that we remove rq contribution to group share: */
 	clear_tg_offline_cfs_rqs(rq);
+
+	dl_server_stop(&rq->fair_server);
 }
 
 #ifdef CONFIG_SCHED_CORE

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-23  7:25                 ` Peter Zijlstra
@ 2025-09-23  7:52                   ` Marek Szyprowski
  0 siblings, 0 replies; 101+ messages in thread
From: Marek Szyprowski @ 2025-09-23  7:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-tip-commits, John Stultz, x86,
	'Linux Samsung SOC', Krzysztof Kozlowski

On 23.09.2025 09:25, Peter Zijlstra wrote:
> On Mon, Sep 22, 2025 at 11:57:02PM +0200, Marek Szyprowski wrote:
>> On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
>>> The following commit has been merged into the sched/urgent branch of tip:
>>>
>>> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>> Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>> Author:        Peter Zijlstra <peterz@infradead.org>
>>> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
>>>
>>> sched/deadline: Fix dl_server getting stuck
>>>
>>> John found it was easy to hit lockup warnings when running locktorture
>>> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
>>> ("sched/deadline: Less agressive dl_server handling").
>>>
>>> While debugging it seems there is a chance where we end up with the
>>> dl_server dequeued, with dl_se->dl_server_active. This causes
>>> dl_server_start() to return without enqueueing the dl_server, thus it
>>> fails to run when RT tasks starve the cpu.
>>>
>>> When this happens, dl_server_timer() catches the
>>> '!dl_se->server_has_tasks(dl_se)' case, which then calls
>>> replenish_dl_entity() and dl_server_stopped() and finally return
>>> HRTIMER_NO_RESTART.
>>>
>>> This ends in no new timer and also no enqueue, leaving the dl_server
>>> 'dead', allowing starvation.
>>>
>>> What should have happened is for the bandwidth timer to start the
>>> zero-laxity timer, which in turn would enqueue the dl_server and cause
>>> dl_se->server_pick_task() to be called -- which will stop the
>>> dl_server if no fair tasks are observed for a whole period.
>>>
>>> IOW, it is totally irrelevant if there are fair tasks at the moment of
>>> bandwidth refresh.
>>>
>>> This removes all dl_se->server_has_tasks() users, so remove the whole
>>> thing.
>>>
>>> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
>>> Reported-by: John Stultz <jstultz@google.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Tested-by: John Stultz <jstultz@google.com>
>>> ---
>> This patch landed in today's linux-next as commit 077e1e2e0015
>> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
>> that it breaks CPU hotplug on some of my systems. On 64bit
>> Exynos5433-based TM2e board I've captured the following lock dep warning
>> (which unfortunately doesn't look like really related to CPU hotplug):
> Absolutely wild guess; does something like this help?
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18a30ae35441..bf78c46620a5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12972,6 +12972,8 @@ static void rq_offline_fair(struct rq *rq)
>   
>   	/* Ensure that we remove rq contribution to group share: */
>   	clear_tg_offline_cfs_rqs(rq);
> +
> +	dl_server_stop(&rq->fair_server);
>   }
>   
>   #ifdef CONFIG_SCHED_CORE

Unfortunately not, same log.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-22 21:57               ` Marek Szyprowski
  2025-09-22 23:46                 ` John Stultz
  2025-09-23  7:25                 ` Peter Zijlstra
@ 2025-09-23 22:02                 ` Peter Zijlstra
  2025-09-29 15:19                   ` Marek Szyprowski
       [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
  2 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-09-23 22:02 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: linux-kernel, linux-tip-commits, John Stultz, x86,
	'Linux Samsung SOC', Mark Rutland

On Mon, Sep 22, 2025 at 11:57:02PM +0200, Marek Szyprowski wrote:
> On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the sched/urgent branch of tip:
> >
> > Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> > Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> > Author:        Peter Zijlstra <peterz@infradead.org>
> > AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
> >
> > sched/deadline: Fix dl_server getting stuck
> >
> > John found it was easy to hit lockup warnings when running locktorture
> > on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> > ("sched/deadline: Less agressive dl_server handling").
> >
> > While debugging it seems there is a chance where we end up with the
> > dl_server dequeued, with dl_se->dl_server_active. This causes
> > dl_server_start() to return without enqueueing the dl_server, thus it
> > fails to run when RT tasks starve the cpu.
> >
> > When this happens, dl_server_timer() catches the
> > '!dl_se->server_has_tasks(dl_se)' case, which then calls
> > replenish_dl_entity() and dl_server_stopped() and finally return
> > HRTIMER_NO_RESTART.
> >
> > This ends in no new timer and also no enqueue, leaving the dl_server
> > 'dead', allowing starvation.
> >
> > What should have happened is for the bandwidth timer to start the
> > zero-laxity timer, which in turn would enqueue the dl_server and cause
> > dl_se->server_pick_task() to be called -- which will stop the
> > dl_server if no fair tasks are observed for a whole period.
> >
> > IOW, it is totally irrelevant if there are fair tasks at the moment of
> > bandwidth refresh.
> >
> > This removes all dl_se->server_has_tasks() users, so remove the whole
> > thing.
> >
> > Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
> > Reported-by: John Stultz <jstultz@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Tested-by: John Stultz <jstultz@google.com>
> > ---
> 
> This patch landed in today's linux-next as commit 077e1e2e0015 
> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found 
> that it breaks CPU hotplug on some of my systems. On 64bit 
> Exynos5433-based TM2e board I've captured the following lock dep warning 
> (which unfortunately doesn't look like really related to CPU hotplug):

Right -- I've looked at this patch a few times over the day, and the
only thing I can think of is that we keep the dl_server timer running.
But I already gave you a patch that *should've* stopped it.

There were a few issues with it -- notably if you've booted with
something like isolcpus / nohz_full it might not have worked because the
site I put the dl_server_stop() would only get ran if there was a root
domain attached to the CPU.

Put it in a different spot, just to make sure.

There is also the fact that dl_server_stop() uses
hrtimer_try_to_cancel(), which can 'fail' when the timer is actively
running. But if that is the case, it must be spin-waiting on rq->lock
-- since the caller of dl_server_stop() will be holding that. Once
dl_server_stop() completes and the rq->lock is released, the timer will
see !dl_se->dl_throttled and immediately stop without restarting.

So that *should* not be a problem.

Anyway, clutching at staws here etc.

> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
> Detected VIPT I-cache on CPU7
> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329 
> rcutree_report_cpu_starting+0x1e8/0x348

This is really weird; this does indeed look like CPU7 decides to boot
again. AFAICT it is not hotplug failing and bringing the CPU back again,
but it is really starting again.

I'm not well versed enough in ARM64 foo to know what would cause a CPU
to boot -- but on x86_64 this isn't something that would easily happen
by accident.

Not stopping a timer would certainly not be sufficient -- notably
hrtimers_cpu_dying() would have migrated the thing.

> (system is frozen at this point).

The whole lockdep and freezing thing is typically printk choking on
itself.

My personal way around this are these here patches:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/experimental

They don't apply cleanly anymore, but the conflict isn't hard, so I've
not taken the bother to rebase them yet. It relies on the platform
having earlyprintk configured, then add force_early_printk to your
kernel cmdline to have earlyprintk completely take over.

Typical early serial drivers are lock-free and don't suffer from
lockups.

If you get it to work, you might get more data out of it.


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3eb6faa91d06..c0b1dc360e68 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8363,6 +8363,7 @@ static inline void sched_set_rq_offline(struct rq *rq, int cpu)
 		BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
 		set_rq_offline(rq);
 	}
+	dl_server_stop(&rq->fair_server);
 	rq_unlock_irqrestore(rq, &rf);
 }
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-23 22:02                 ` Peter Zijlstra
@ 2025-09-29 15:19                   ` Marek Szyprowski
       [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
  1 sibling, 0 replies; 101+ messages in thread
From: Marek Szyprowski @ 2025-09-29 15:19 UTC (permalink / raw)
  To: Peter Zijlstra, Krzysztof Kozlowski
  Cc: linux-kernel, linux-tip-commits, John Stultz, x86,
	'Linux Samsung SOC', Mark Rutland

On 24.09.2025 00:02, Peter Zijlstra wrote:
> On Mon, Sep 22, 2025 at 11:57:02PM +0200, Marek Szyprowski wrote:
>> On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
>>> The following commit has been merged into the sched/urgent branch of tip:
>>>
>>> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>> Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>> Author:        Peter Zijlstra <peterz@infradead.org>
>>> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
>>>
>>> sched/deadline: Fix dl_server getting stuck
>>>
>>> John found it was easy to hit lockup warnings when running locktorture
>>> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
>>> ("sched/deadline: Less agressive dl_server handling").
>>>
>>> While debugging it seems there is a chance where we end up with the
>>> dl_server dequeued, with dl_se->dl_server_active. This causes
>>> dl_server_start() to return without enqueueing the dl_server, thus it
>>> fails to run when RT tasks starve the cpu.
>>>
>>> When this happens, dl_server_timer() catches the
>>> '!dl_se->server_has_tasks(dl_se)' case, which then calls
>>> replenish_dl_entity() and dl_server_stopped() and finally return
>>> HRTIMER_NO_RESTART.
>>>
>>> This ends in no new timer and also no enqueue, leaving the dl_server
>>> 'dead', allowing starvation.
>>>
>>> What should have happened is for the bandwidth timer to start the
>>> zero-laxity timer, which in turn would enqueue the dl_server and cause
>>> dl_se->server_pick_task() to be called -- which will stop the
>>> dl_server if no fair tasks are observed for a whole period.
>>>
>>> IOW, it is totally irrelevant if there are fair tasks at the moment of
>>> bandwidth refresh.
>>>
>>> This removes all dl_se->server_has_tasks() users, so remove the whole
>>> thing.
>>>
>>> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
>>> Reported-by: John Stultz <jstultz@google.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Tested-by: John Stultz <jstultz@google.com>
>>> ---
>> This patch landed in today's linux-next as commit 077e1e2e0015
>> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
>> that it breaks CPU hotplug on some of my systems. On 64bit
>> Exynos5433-based TM2e board I've captured the following lock dep warning
>> (which unfortunately doesn't look like really related to CPU hotplug):
> Right -- I've looked at this patch a few times over the day, and the
> only thing I can think of is that we keep the dl_server timer running.
> But I already gave you a patch that *should've* stopped it.
>
> There were a few issues with it -- notably if you've booted with
> something like isolcpus / nohz_full it might not have worked because the
> site I put the dl_server_stop() would only get ran if there was a root
> domain attached to the CPU.
>
> Put it in a different spot, just to make sure.
>
> There is also the fact that dl_server_stop() uses
> hrtimer_try_to_cancel(), which can 'fail' when the timer is actively
> running. But if that is the case, it must be spin-waiting on rq->lock
> -- since the caller of dl_server_stop() will be holding that. Once
> dl_server_stop() completes and the rq->lock is released, the timer will
> see !dl_se->dl_throttled and immediately stop without restarting.
>
> So that *should* not be a problem.
>
> Anyway, clutching at staws here etc.
>
>> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
>> Detected VIPT I-cache on CPU7
>> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
>> ------------[ cut here ]------------
>> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329
>> rcutree_report_cpu_starting+0x1e8/0x348
> This is really weird; this does indeed look like CPU7 decides to boot
> again. AFAICT it is not hotplug failing and bringing the CPU back again,
> but it is really starting again.
>
> I'm not well versed enough in ARM64 foo to know what would cause a CPU
> to boot -- but on x86_64 this isn't something that would easily happen
> by accident.
>
> Not stopping a timer would certainly not be sufficient -- notably
> hrtimers_cpu_dying() would have migrated the thing.
>
>> (system is frozen at this point).
> The whole lockdep and freezing thing is typically printk choking on
> itself.
>
> My personal way around this are these here patches:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/experimental
>
> They don't apply cleanly anymore, but the conflict isn't hard, so I've
> not taken the bother to rebase them yet. It relies on the platform
> having earlyprintk configured, then add force_early_printk to your
> kernel cmdline to have earlyprintk completely take over.
>
> Typical early serial drivers are lock-free and don't suffer from
> lockups.
>
> If you get it to work, you might get more data out of it.

Thanks for some hints, but unfortunately ARM64 doesn't support 
earlyprintk, so I was not able to use this method.

However I've played a bit with this code and found that this strange 
wake-up of the CPU7 seems to be caused by the timer. If I restore

   if (!dl_se->server_has_tasks(dl_se))
           return HRTIMER_NORESTART;

part in the dl_server_timer, the everything works again as before this 
patch.

This issue is however not Exynos5433 ARM 64bit specific. Similar lockup 
happens on Exynos5422 ARM 32bit boards, although there is no message in 
that case. Does it mean that handling of the hrtimers on Exynos boards 
is a bit broken in the context of CPU hotplug? I've never analyzed that 
part of Exynos SoC support. Krzysztof, any chance You remember how it works?

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 101+ messages in thread

[parent not found: <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>]

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
       [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
@ 2025-10-09  8:35                     ` Krzysztof Kozlowski
  2025-10-09  9:26                     ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Krzysztof Kozlowski @ 2025-10-09  8:35 UTC (permalink / raw)
  To: Marek Szyprowski, Peter Zijlstra
  Cc: linux-kernel, linux-tip-commits, John Stultz, x86,
	'Linux Samsung SOC', Mark Rutland

On 30/09/2025 00:19, Marek Szyprowski wrote:
> On 24.09.2025 00:02, Peter Zijlstra wrote:
>> On Mon, Sep 22, 2025 at 11:57:02PM +0200, Marek Szyprowski wrote:
>>> On 18.09.2025 08:56, tip-bot2 for Peter Zijlstra wrote:
>>>> The following commit has been merged into the sched/urgent branch of tip:
>>>>
>>>> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>>> Gitweb:https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
>>>> Author:        Peter Zijlstra<peterz@infradead.org>
>>>> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
>>>> Committer:     Peter Zijlstra<peterz@infradead.org>
>>>> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
>>>>
>>>> sched/deadline: Fix dl_server getting stuck
>>>>
>>>> John found it was easy to hit lockup warnings when running locktorture
>>>> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
>>>> ("sched/deadline: Less agressive dl_server handling").
>>>>
>>>> While debugging it seems there is a chance where we end up with the
>>>> dl_server dequeued, with dl_se->dl_server_active. This causes
>>>> dl_server_start() to return without enqueueing the dl_server, thus it
>>>> fails to run when RT tasks starve the cpu.
>>>>
>>>> When this happens, dl_server_timer() catches the
>>>> '!dl_se->server_has_tasks(dl_se)' case, which then calls
>>>> replenish_dl_entity() and dl_server_stopped() and finally return
>>>> HRTIMER_NO_RESTART.
>>>>
>>>> This ends in no new timer and also no enqueue, leaving the dl_server
>>>> 'dead', allowing starvation.
>>>>
>>>> What should have happened is for the bandwidth timer to start the
>>>> zero-laxity timer, which in turn would enqueue the dl_server and cause
>>>> dl_se->server_pick_task() to be called -- which will stop the
>>>> dl_server if no fair tasks are observed for a whole period.
>>>>
>>>> IOW, it is totally irrelevant if there are fair tasks at the moment of
>>>> bandwidth refresh.
>>>>
>>>> This removes all dl_se->server_has_tasks() users, so remove the whole
>>>> thing.
>>>>
>>>> Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
>>>> Reported-by: John Stultz<jstultz@google.com>
>>>> Signed-off-by: Peter Zijlstra (Intel)<peterz@infradead.org>
>>>> Tested-by: John Stultz<jstultz@google.com>
>>>> ---
>>> This patch landed in today's linux-next as commit 077e1e2e0015
>>> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
>>> that it breaks CPU hotplug on some of my systems. On 64bit
>>> Exynos5433-based TM2e board I've captured the following lock dep warning
>>> (which unfortunately doesn't look like really related to CPU hotplug):
>> Right -- I've looked at this patch a few times over the day, and the
>> only thing I can think of is that we keep the dl_server timer running.
>> But I already gave you a patch that *should've* stopped it.
>>
>> There were a few issues with it -- notably if you've booted with
>> something like isolcpus / nohz_full it might not have worked because the
>> site I put the dl_server_stop() would only get ran if there was a root
>> domain attached to the CPU.
>>
>> Put it in a different spot, just to make sure.
>>
>> There is also the fact that dl_server_stop() uses
>> hrtimer_try_to_cancel(), which can 'fail' when the timer is actively
>> running. But if that is the case, it must be spin-waiting on rq->lock
>> -- since the caller of dl_server_stop() will be holding that. Once
>> dl_server_stop() completes and the rq->lock is released, the timer will
>> see !dl_se->dl_throttled and immediately stop without restarting.
>>
>> So that *should* not be a problem.
>>
>> Anyway, clutching at staws here etc.
>>
>>> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
>>> Detected VIPT I-cache on CPU7
>>> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329
>>> rcutree_report_cpu_starting+0x1e8/0x348
>> This is really weird; this does indeed look like CPU7 decides to boot
>> again. AFAICT it is not hotplug failing and bringing the CPU back again,
>> but it is really starting again.
>>
>> I'm not well versed enough in ARM64 foo to know what would cause a CPU
>> to boot -- but on x86_64 this isn't something that would easily happen
>> by accident.
>>
>> Not stopping a timer would certainly not be sufficient -- notably
>> hrtimers_cpu_dying() would have migrated the thing.
>>
>>> (system is frozen at this point).
>> The whole lockdep and freezing thing is typically printk choking on
>> itself.
>>
>> My personal way around this are these here patches:
>>
>>    git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git debug/experimental
>>
>> They don't apply cleanly anymore, but the conflict isn't hard, so I've
>> not taken the bother to rebase them yet. It relies on the platform
>> having earlyprintk configured, then add force_early_printk to your
>> kernel cmdline to have earlyprintk completely take over.
>>
>> Typical early serial drivers are lock-free and don't suffer from
>> lockups.
>>
>> If you get it to work, you might get more data out of it.
> 
> Thanks for some hints, but unfortunately ARM64 doesn't support 
> earlyprintk, so I was not able to use this method.
> 
> However I've played a bit with this code and found that this strange 
> wake-up of the CPU7 seems to be caused by the timer. If I restore
> 
>    if (!dl_se->server_has_tasks(dl_se))
>            return HRTIMER_NORESTART;
> 
> part in the dl_server_timer, the everything works again as before this 
> patch.
> 
> This issue is however not Exynos5433 ARM 64bit specific. Similar lockup 
> happens on Exynos5422 ARM 32bit boards, although there is no message in 
> that case. Does it mean that handling of the hrtimers on Exynos boards 
> is a bit broken in the context of CPU hotplug? I've never analyzed that 
> part of Exynos SoC support. Krzysztof, any chance You remember how it works?


I don't recall anything around this, but I also don't remember that much
of details anymore.

I believe long time ago - around 2014-2015 - were testing Exynos MCT
against hotplug as well, so I think it was working. Many things could
happen in between...

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
       [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
  2025-10-09  8:35                     ` Krzysztof Kozlowski
@ 2025-10-09  9:26                     ` Peter Zijlstra
  2025-10-09 11:42                       ` Marek Szyprowski
  1 sibling, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-10-09  9:26 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Krzysztof Kozlowski, linux-kernel, linux-tip-commits, John Stultz,
	x86, 'Linux Samsung SOC', Mark Rutland

On Mon, Sep 29, 2025 at 05:19:27PM +0200, Marek Szyprowski wrote:

> Thanks for some hints, but unfortunately ARM64 doesn't support 
> earlyprintk, so I was not able to use this method.

I've been told it is called earlycon= on arm; still sets up
early_console and make early_printk() work.

My patches just re-route everything printk() into early_printk().

> However I've played a bit with this code and found that this strange 
> wake-up of the CPU7 seems to be caused by the timer. If I restore
> 
>    if (!dl_se->server_has_tasks(dl_se))
>            return HRTIMER_NORESTART;
> 
> part in the dl_server_timer, the everything works again as before this 
> patch.

Right, so something on your platform is not working right. Stray timers
should not wake up CPUs.

> This issue is however not Exynos5433 ARM 64bit specific. Similar lockup 
> happens on Exynos5422 ARM 32bit boards, although there is no message in 
> that case. Does it mean that handling of the hrtimers on Exynos boards 
> is a bit broken in the context of CPU hotplug? 

There is this thread:

  https://lkml.kernel.org/r/20251009080007.GH3245006@noisy.programming.kicks-ass.net

I suspect this might be the same issue and we've made a little progress
there.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-10-09  9:26                     ` Peter Zijlstra
@ 2025-10-09 11:42                       ` Marek Szyprowski
  0 siblings, 0 replies; 101+ messages in thread
From: Marek Szyprowski @ 2025-10-09 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Krzysztof Kozlowski, linux-kernel, linux-tip-commits, John Stultz,
	x86, 'Linux Samsung SOC', Mark Rutland,
	Krzysztof Kozlowski

On 09.10.2025 11:26, Peter Zijlstra wrote:
> On Mon, Sep 29, 2025 at 05:19:27PM +0200, Marek Szyprowski wrote:
>
>> Thanks for some hints, but unfortunately ARM64 doesn't support
>> earlyprintk, so I was not able to use this method.
> I've been told it is called earlycon= on arm; still sets up
> early_console and make early_printk() work.

ARM and ARM64 are almost completely separate architectures. 
CONFIG_EARLY_PRINTK is not available on ARM64.


> My patches just re-route everything printk() into early_printk().

>> However I've played a bit with this code and found that this strange
>> wake-up of the CPU7 seems to be caused by the timer. If I restore
>>
>>     if (!dl_se->server_has_tasks(dl_se))
>>             return HRTIMER_NORESTART;
>>
>> part in the dl_server_timer, the everything works again as before this
>> patch.
> Right, so something on your platform is not working right. Stray timers
> should not wake up CPUs.
>
>> This issue is however not Exynos5433 ARM 64bit specific. Similar lockup
>> happens on Exynos5422 ARM 32bit boards, although there is no message in
>> that case. Does it mean that handling of the hrtimers on Exynos boards
>> is a bit broken in the context of CPU hotplug?
> There is this thread:
>
>    https://lkml.kernel.org/r/20251009080007.GH3245006@noisy.programming.kicks-ass.net
>
> I suspect this might be the same issue and we've made a little progress
> there.

Indeed it looks like a similar issue and the proposed patch fixes the 
problem on my Exynos based boards! Thanks! I will post my comment and 
Tested-by tag there.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
  2025-09-16 11:01           ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
@ 2025-09-25  7:55             ` tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 101+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-09-25  7:55 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c
Gitweb:        https://git.kernel.org/tip/4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 25 Sep 2025 09:51:50 +02:00

sched/deadline: Fix dl_server getting stuck

John found it was easy to hit lockup warnings when running locktorture
on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
("sched/deadline: Less agressive dl_server handling").

While debugging it seems there is a chance where we end up with the
dl_server dequeued, with dl_se->dl_server_active. This causes
dl_server_start() to return without enqueueing the dl_server, thus it
fails to run when RT tasks starve the cpu.

When this happens, dl_server_timer() catches the
'!dl_se->server_has_tasks(dl_se)' case, which then calls
replenish_dl_entity() and dl_server_stopped() and finally return
HRTIMER_NO_RESTART.

This ends in no new timer and also no enqueue, leaving the dl_server
'dead', allowing starvation.

What should have happened is for the bandwidth timer to start the
zero-laxity timer, which in turn would enqueue the dl_server and cause
dl_se->server_pick_task() to be called -- which will stop the
dl_server if no fair tasks are observed for a whole period.

IOW, it is totally irrelevant if there are fair tasks at the moment of
bandwidth refresh.

This removes all dl_se->server_has_tasks() users, so remove the whole
thing.

Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h   |  1 -
 kernel/sched/deadline.c | 12 +-----------
 kernel/sched/fair.c     |  7 +------
 kernel/sched/sched.h    |  4 ----
 4 files changed, 2 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8188b8..f89313b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -733,7 +733,6 @@ struct sched_dl_entity {
 	 * runnable task.
 	 */
 	struct rq			*rq;
-	dl_server_has_tasks_f		server_has_tasks;
 	dl_server_pick_f		server_pick_task;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f253012..5a5080b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -875,7 +875,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	 */
 	if (dl_se->dl_defer && !dl_se->dl_defer_running &&
 	    dl_time_before(rq_clock(dl_se->rq), dl_se->deadline - dl_se->runtime)) {
-		if (!is_dl_boosted(dl_se) && dl_se->server_has_tasks(dl_se)) {
+		if (!is_dl_boosted(dl_se)) {
 
 			/*
 			 * Set dl_se->dl_defer_armed and dl_throttled variables to
@@ -1152,8 +1152,6 @@ static void __push_dl_task(struct rq *rq, struct rq_flags *rf)
 /* a defer timer will not be reset if the runtime consumed was < dl_server_min_res */
 static const u64 dl_server_min_res = 1 * NSEC_PER_MSEC;
 
-static bool dl_server_stopped(struct sched_dl_entity *dl_se);
-
 static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_dl_entity *dl_se)
 {
 	struct rq *rq = rq_of_dl_se(dl_se);
@@ -1171,12 +1169,6 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 		if (!dl_se->dl_runtime)
 			return HRTIMER_NORESTART;
 
-		if (!dl_se->server_has_tasks(dl_se)) {
-			replenish_dl_entity(dl_se);
-			dl_server_stopped(dl_se);
-			return HRTIMER_NORESTART;
-		}
-
 		if (dl_se->dl_defer_armed) {
 			/*
 			 * First check if the server could consume runtime in background.
@@ -1625,11 +1617,9 @@ static bool dl_server_stopped(struct sched_dl_entity *dl_se)
 }
 
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
-		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
 {
 	dl_se->rq = rq;
-	dl_se->server_has_tasks = has_tasks;
 	dl_se->server_pick_task = pick_task;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a05..8ce56a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8859,11 +8859,6 @@ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_stru
 	return pick_next_task_fair(rq, prev, NULL);
 }
 
-static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
-{
-	return !!dl_se->rq->cfs.nr_queued;
-}
-
 static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
 {
 	return pick_task_fair(dl_se->rq);
@@ -8875,7 +8870,7 @@ void fair_server_init(struct rq *rq)
 
 	init_dl_entity(dl_se);
 
-	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
+	dl_server_init(dl_se, rq, fair_server_pick_task);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d..f10d627 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -365,9 +365,6 @@ extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s6
  *
  *   dl_se::rq -- runqueue we belong to.
  *
- *   dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
- *                                server when it runs out of tasks to run.
- *
  *   dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
  *                           returns NULL.
  *
@@ -383,7 +380,6 @@ extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
 extern void dl_server_start(struct sched_dl_entity *dl_se);
 extern void dl_server_stop(struct sched_dl_entity *dl_se);
 extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
-		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
 

^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-10 16:47   ` Vincent Guittot
  2025-07-14 22:59   ` Mel Gorman
  2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Optimize ttwu() by pushing select_idle_siblings() up above waiting for
on_cpu(). This allows making use of the cycles otherwise spend waiting
to search for an idle CPU.

One little detail is that since the task we're looking for an idle CPU
for might still be on the CPU, that CPU won't report as running the
idle task, and thus won't find his own CPU idle, even when it is.

To compensate, remove the 'rq->curr == rq->idle' condition from
idle_cpu() -- it doesn't really make sense anyway.

Additionally, Chris found (concurrently) that perf-c2c reported that
test as being a cache-miss monster.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250520101727.620602459@infradead.org
---
 kernel/sched/core.c     |    5 +++--
 kernel/sched/syscalls.c |    3 ---
 2 files changed, 3 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3593,7 +3593,7 @@ int select_task_rq(struct task_struct *p
 		cpu = p->sched_class->select_task_rq(p, cpu, *wake_flags);
 		*wake_flags |= WF_RQ_SELECTED;
 	} else {
-		cpu = cpumask_any(p->cpus_ptr);
+		cpu = task_cpu(p);
 	}
 
 	/*
@@ -4309,6 +4309,8 @@ int try_to_wake_up(struct task_struct *p
 		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
 			break;
 
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+
 		/*
 		 * If the owning (remote) CPU is still in the middle of schedule() with
 		 * this task as prev, wait until it's done referencing the task.
@@ -4320,7 +4322,6 @@ int try_to_wake_up(struct task_struct *p
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
 				delayacct_blkio_end(p);
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -203,9 +203,6 @@ int idle_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
-	if (rq->curr != rq->idle)
-		return 0;
-
 	if (rq->nr_running)
 		return 0;
 



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq()
  2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
@ 2025-07-10 16:47   ` Vincent Guittot
  2025-07-14 22:59   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2025-07-10 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 2 Jul 2025 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Optimize ttwu() by pushing select_idle_siblings() up above waiting for
> on_cpu(). This allows making use of the cycles otherwise spend waiting
> to search for an idle CPU.
>
> One little detail is that since the task we're looking for an idle CPU
> for might still be on the CPU, that CPU won't report as running the
> idle task, and thus won't find his own CPU idle, even when it is.
>
> To compensate, remove the 'rq->curr == rq->idle' condition from
> idle_cpu() -- it doesn't really make sense anyway.
>
> Additionally, Chris found (concurrently) that perf-c2c reported that
> test as being a cache-miss monster.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.620602459@infradead.org

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/core.c     |    5 +++--
>  kernel/sched/syscalls.c |    3 ---
>  2 files changed, 3 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3593,7 +3593,7 @@ int select_task_rq(struct task_struct *p
>                 cpu = p->sched_class->select_task_rq(p, cpu, *wake_flags);
>                 *wake_flags |= WF_RQ_SELECTED;
>         } else {
> -               cpu = cpumask_any(p->cpus_ptr);
> +               cpu = task_cpu(p);
>         }
>
>         /*
> @@ -4309,6 +4309,8 @@ int try_to_wake_up(struct task_struct *p
>                     ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
>                         break;
>
> +               cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> +
>                 /*
>                  * If the owning (remote) CPU is still in the middle of schedule() with
>                  * this task as prev, wait until it's done referencing the task.
> @@ -4320,7 +4322,6 @@ int try_to_wake_up(struct task_struct *p
>                  */
>                 smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>                 if (task_cpu(p) != cpu) {
>                         if (p->in_iowait) {
>                                 delayacct_blkio_end(p);
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -203,9 +203,6 @@ int idle_cpu(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
>
> -       if (rq->curr != rq->idle)
> -               return 0;
> -
>         if (rq->nr_running)
>                 return 0;
>
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq()
  2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
  2025-07-10 16:47   ` Vincent Guittot
@ 2025-07-14 22:59   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Mel Gorman @ 2025-07-14 22:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 02, 2025 at 01:49:27PM +0200, Peter Zijlstra wrote:
> Optimize ttwu() by pushing select_idle_siblings() up above waiting for
> on_cpu(). This allows making use of the cycles otherwise spend waiting
> to search for an idle CPU.
> 
> One little detail is that since the task we're looking for an idle CPU
> for might still be on the CPU, that CPU won't report as running the
> idle task, and thus won't find his own CPU idle, even when it is.
> 
> To compensate, remove the 'rq->curr == rq->idle' condition from
> idle_cpu() -- it doesn't really make sense anyway.
> 
> Additionally, Chris found (concurrently) that perf-c2c reported that
> test as being a cache-miss monster.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.620602459@infradead.org

*facepalm*

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-10 16:48   ` Vincent Guittot
  2025-07-14 23:00   ` Mel Gorman
  2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Reflow and get rid of 'ret' variable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250520101727.732703833@infradead.org
---
 kernel/sched/core.c  |   36 ++++++++++++++++--------------------
 kernel/sched/sched.h |    5 +++++
 2 files changed, 21 insertions(+), 20 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3754,28 +3754,24 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	CLASS(__task_rq_lock, guard)(p);
+	struct rq *rq = guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret = 1;
-	}
-	__task_rq_unlock(rq, &rf);
+	if (!task_on_rq_queued(p))
+		return 0;
 
-	return ret;
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 void sched_ttwu_pending(void *arg)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1806,6 +1806,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable()
  2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
@ 2025-07-10 16:48   ` Vincent Guittot
  2025-07-14 23:00   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2025-07-10 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 2 Jul 2025 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Reflow and get rid of 'ret' variable.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.732703833@infradead.org

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/core.c  |   36 ++++++++++++++++--------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 21 insertions(+), 20 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3754,28 +3754,24 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> +       CLASS(__task_rq_lock, guard)(p);
> +       struct rq *rq = guard.rq;
>
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> -               }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> -       }
> -       __task_rq_unlock(rq, &rf);
> +       if (!task_on_rq_queued(p))
> +               return 0;
>
> -       return ret;
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed)
> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
> +       }
> +       ttwu_do_wakeup(p);
> +       return 1;
>  }
>
>  void sched_ttwu_pending(void *arg)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1806,6 +1806,11 @@ task_rq_unlock(struct rq *rq, struct tas
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable()
  2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
  2025-07-10 16:48   ` Vincent Guittot
@ 2025-07-14 23:00   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Mel Gorman @ 2025-07-14 23:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 02, 2025 at 01:49:28PM +0200, Peter Zijlstra wrote:
> Reflow and get rid of 'ret' variable.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.732703833@infradead.org

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 05/12] sched: Add ttwu_queue controls
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-10 16:51   ` Vincent Guittot
  2025-07-14 23:14   ` Mel Gorman
  2025-07-02 11:49 ` [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

There are two (soon three) callers of ttwu_queue_wakelist(),
distinguish them with their own WF_ and add some knobs on.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250520101727.874587738@infradead.org
---
 kernel/sched/core.c     |   22 ++++++++++++----------
 kernel/sched/features.h |    2 ++
 kernel/sched/sched.h    |    2 ++
 3 files changed, 16 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3888,7 +3888,7 @@ bool cpus_share_resources(int this_cpu,
 	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
 }
 
-static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
+static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, bool def)
 {
 	/* See SCX_OPS_ALLOW_QUEUED_WAKEUP. */
 	if (!scx_allow_ttwu_queue(p))
@@ -3929,18 +3929,19 @@ static inline bool ttwu_queue_cond(struc
 	if (!cpu_rq(cpu)->nr_running)
 		return true;
 
-	return false;
+	return def;
 }
 
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu)) {
-		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-		__ttwu_queue_wakelist(p, cpu, wake_flags);
-		return true;
-	}
+	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
+
+	if (!ttwu_queue_cond(p, cpu, def))
+		return false;
 
-	return false;
+	sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+	__ttwu_queue_wakelist(p, cpu, wake_flags);
+	return true;
 }
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -3948,7 +3949,7 @@ static void ttwu_queue(struct task_struc
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
-	if (ttwu_queue_wakelist(p, cpu, wake_flags))
+	if (sched_feat(TTWU_QUEUE) && ttwu_queue_wakelist(p, cpu, wake_flags))
 		return;
 
 	rq_lock(rq, &rf);
@@ -4251,7 +4252,8 @@ int try_to_wake_up(struct task_struct *p
 		 * scheduling.
 		 */
 		if (smp_load_acquire(&p->on_cpu) &&
-		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
+		    sched_feat(TTWU_QUEUE_ON_CPU) &&
+		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
 			break;
 
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,6 +81,8 @@ SCHED_FEAT(TTWU_QUEUE, false)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 #endif
+SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
+SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2279,6 +2279,8 @@ static inline int task_on_rq_migrating(s
 #define WF_CURRENT_CPU		0x40 /* Prefer to move the wakee to the current CPU. */
 #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
 
+#define WF_ON_CPU		0x0100
+
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
 static_assert(WF_FORK == SD_BALANCE_FORK);
 static_assert(WF_TTWU == SD_BALANCE_WAKE);



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 05/12] sched: Add ttwu_queue controls
  2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
@ 2025-07-10 16:51   ` Vincent Guittot
  2025-07-14 23:14   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2025-07-10 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 2 Jul 2025 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> There are two (soon three) callers of ttwu_queue_wakelist(),
> distinguish them with their own WF_ and add some knobs on.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.874587738@infradead.org
> ---
>  kernel/sched/core.c     |   22 ++++++++++++----------
>  kernel/sched/features.h |    2 ++
>  kernel/sched/sched.h    |    2 ++
>  3 files changed, 16 insertions(+), 10 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3888,7 +3888,7 @@ bool cpus_share_resources(int this_cpu,
>         return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
>  }
>
> -static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> +static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, bool def)
>  {
>         /* See SCX_OPS_ALLOW_QUEUED_WAKEUP. */
>         if (!scx_allow_ttwu_queue(p))
> @@ -3929,18 +3929,19 @@ static inline bool ttwu_queue_cond(struc
>         if (!cpu_rq(cpu)->nr_running)
>                 return true;
>
> -       return false;
> +       return def;
>  }
>
>  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>  {
> -       if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu)) {
> -               sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> -               __ttwu_queue_wakelist(p, cpu, wake_flags);
> -               return true;
> -       }
> +       bool def = sched_feat(TTWU_QUEUE_DEFAULT);

I'm always confused by this sched feature name because
sched_feat(TTWU_QUEUE_DEFAULT) must be false in order to have the
current (default ?) behaviour
Or you mean queue by default in the wakelist which is disable to keep
current behavior

> +
> +       if (!ttwu_queue_cond(p, cpu, def))
> +               return false;
>
> -       return false;
> +       sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> +       __ttwu_queue_wakelist(p, cpu, wake_flags);
> +       return true;
>  }
>
>  static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> @@ -3948,7 +3949,7 @@ static void ttwu_queue(struct task_struc
>         struct rq *rq = cpu_rq(cpu);
>         struct rq_flags rf;
>
> -       if (ttwu_queue_wakelist(p, cpu, wake_flags))
> +       if (sched_feat(TTWU_QUEUE) && ttwu_queue_wakelist(p, cpu, wake_flags))
>                 return;
>
>         rq_lock(rq, &rf);
> @@ -4251,7 +4252,8 @@ int try_to_wake_up(struct task_struct *p
>                  * scheduling.
>                  */
>                 if (smp_load_acquire(&p->on_cpu) &&
> -                   ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
> +                   sched_feat(TTWU_QUEUE_ON_CPU) &&
> +                   ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
>                         break;
>
>                 cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -81,6 +81,8 @@ SCHED_FEAT(TTWU_QUEUE, false)
>   */
>  SCHED_FEAT(TTWU_QUEUE, true)
>  #endif
> +SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> +SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
>
>  /*
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2279,6 +2279,8 @@ static inline int task_on_rq_migrating(s
>  #define WF_CURRENT_CPU         0x40 /* Prefer to move the wakee to the current CPU. */
>  #define WF_RQ_SELECTED         0x80 /* ->select_task_rq() was called */
>
> +#define WF_ON_CPU              0x0100
> +
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
>  static_assert(WF_FORK == SD_BALANCE_FORK);
>  static_assert(WF_TTWU == SD_BALANCE_WAKE);
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 05/12] sched: Add ttwu_queue controls
  2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
  2025-07-10 16:51   ` Vincent Guittot
@ 2025-07-14 23:14   ` Mel Gorman
  1 sibling, 0 replies; 101+ messages in thread
From: Mel Gorman @ 2025-07-14 23:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Wed, Jul 02, 2025 at 01:49:29PM +0200, Peter Zijlstra wrote:
> There are two (soon three) callers of ttwu_queue_wakelist(),
> distinguish them with their own WF_ and add some knobs on.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.874587738@infradead.org
> ---
>  kernel/sched/core.c     |   22 ++++++++++++----------
>  kernel/sched/features.h |    2 ++
>  kernel/sched/sched.h    |    2 ++
>  3 files changed, 16 insertions(+), 10 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3888,7 +3888,7 @@ bool cpus_share_resources(int this_cpu,
>  	return per_cpu(sd_share_id, this_cpu) == per_cpu(sd_share_id, that_cpu);
>  }
>  
> -static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
> +static inline bool ttwu_queue_cond(struct task_struct *p, int cpu, bool def)
>  {
>  	/* See SCX_OPS_ALLOW_QUEUED_WAKEUP. */
>  	if (!scx_allow_ttwu_queue(p))
> @@ -3929,18 +3929,19 @@ static inline bool ttwu_queue_cond(struc
>  	if (!cpu_rq(cpu)->nr_running)
>  		return true;
>  
> -	return false;
> +	return def;
>  }
>  
>  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>  {
> -	if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(p, cpu)) {
> -		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> -		__ttwu_queue_wakelist(p, cpu, wake_flags);
> -		return true;
> -	}
> +	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> +
> +	if (!ttwu_queue_cond(p, cpu, def))
> +		return false;
>  
> -	return false;
> +	sched_clock_cpu(cpu); /* Sync clocks across CPUs */
> +	__ttwu_queue_wakelist(p, cpu, wake_flags);
> +	return true;
>  }
>  

I get that you're moving the feature checks into the callers but it's
unclear what the intent behind TTWU_QUEUE_DEFAULT is. It's somewhat
preserving existing behaviour and is probably preparation for a later
patch but it's less clear why it's necessary or what changing it would
reveal while debugging.

>  static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
> @@ -3948,7 +3949,7 @@ static void ttwu_queue(struct task_struc
>  	struct rq *rq = cpu_rq(cpu);
>  	struct rq_flags rf;
>  
> -	if (ttwu_queue_wakelist(p, cpu, wake_flags))
> +	if (sched_feat(TTWU_QUEUE) && ttwu_queue_wakelist(p, cpu, wake_flags))
>  		return;
>  
>  	rq_lock(rq, &rf);
> @@ -4251,7 +4252,8 @@ int try_to_wake_up(struct task_struct *p
>  		 * scheduling.
>  		 */
>  		if (smp_load_acquire(&p->on_cpu) &&
> -		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
> +		    sched_feat(TTWU_QUEUE_ON_CPU) &&
> +		    ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
>  			break;
>  
>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -81,6 +81,8 @@ SCHED_FEAT(TTWU_QUEUE, false)
>   */
>  SCHED_FEAT(TTWU_QUEUE, true)
>  #endif
> +SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> +SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
>  
>  /*
>   * When doing wakeups, attempt to limit superfluous scans of the LLC domain.
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2279,6 +2279,8 @@ static inline int task_on_rq_migrating(s
>  #define WF_CURRENT_CPU		0x40 /* Prefer to move the wakee to the current CPU. */
>  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
>  
> +#define WF_ON_CPU		0x0100
> +
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
>  static_assert(WF_FORK == SD_BALANCE_FORK);
>  static_assert(WF_TTWU == SD_BALANCE_WAKE);
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 06/12] sched: Introduce ttwu_do_migrate()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (4 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-10 16:51   ` Vincent Guittot
  2025-07-02 11:49 ` [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Split out the migration related bits into their own function for later
re-use.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3774,6 +3774,21 @@ static int ttwu_runnable(struct task_str
 	return 1;
 }
 
+static inline bool ttwu_do_migrate(struct task_struct *p, int cpu)
+{
+	if (task_cpu(p) == cpu)
+		return false;
+
+	if (p->in_iowait) {
+		delayacct_blkio_end(p);
+		atomic_dec(&task_rq(p)->nr_iowait);
+	}
+
+	psi_ttwu_dequeue(p);
+	set_task_cpu(p, cpu);
+	return true;
+}
+
 void sched_ttwu_pending(void *arg)
 {
 	struct llist_node *llist = arg;
@@ -4268,17 +4283,8 @@ int try_to_wake_up(struct task_struct *p
 		 * their previous state and preserve Program Order.
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-		if (task_cpu(p) != cpu) {
-			if (p->in_iowait) {
-				delayacct_blkio_end(p);
-				atomic_dec(&task_rq(p)->nr_iowait);
-			}
-
+		if (ttwu_do_migrate(p, cpu))
 			wake_flags |= WF_MIGRATED;
-			psi_ttwu_dequeue(p);
-			set_task_cpu(p, cpu);
-		}
 
 		ttwu_queue(p, cpu, wake_flags);
 	}



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 06/12] sched: Introduce ttwu_do_migrate()
  2025-07-02 11:49 ` [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() Peter Zijlstra
@ 2025-07-10 16:51   ` Vincent Guittot
  0 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2025-07-10 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 2 Jul 2025 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Split out the migration related bits into their own function for later
> re-use.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/core.c |   26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3774,6 +3774,21 @@ static int ttwu_runnable(struct task_str
>         return 1;
>  }
>
> +static inline bool ttwu_do_migrate(struct task_struct *p, int cpu)
> +{
> +       if (task_cpu(p) == cpu)
> +               return false;
> +
> +       if (p->in_iowait) {
> +               delayacct_blkio_end(p);
> +               atomic_dec(&task_rq(p)->nr_iowait);
> +       }
> +
> +       psi_ttwu_dequeue(p);
> +       set_task_cpu(p, cpu);
> +       return true;
> +}
> +
>  void sched_ttwu_pending(void *arg)
>  {
>         struct llist_node *llist = arg;
> @@ -4268,17 +4283,8 @@ int try_to_wake_up(struct task_struct *p
>                  * their previous state and preserve Program Order.
>                  */
>                 smp_cond_load_acquire(&p->on_cpu, !VAL);
> -
> -               if (task_cpu(p) != cpu) {
> -                       if (p->in_iowait) {
> -                               delayacct_blkio_end(p);
> -                               atomic_dec(&task_rq(p)->nr_iowait);
> -                       }
> -
> +               if (ttwu_do_migrate(p, cpu))
>                         wake_flags |= WF_MIGRATED;
> -                       psi_ttwu_dequeue(p);
> -                       set_task_cpu(p, cpu);
> -               }
>
>                 ttwu_queue(p, cpu, wake_flags);
>         }
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 07/12] psi: Split psi_ttwu_dequeue()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (5 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-17 23:59   ` Chris Mason
  2025-07-02 11:49 ` [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist() Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Currently psi_ttwu_dequeue() is called while holding p->pi_lock and
takes rq->lock. Split the function in preparation for calling
ttwu_do_migration() while already holding rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   18 ++++++++++++++----
 kernel/sched/stats.h |   24 +++++++++++++-----------
 2 files changed, 27 insertions(+), 15 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3774,17 +3774,27 @@ static int ttwu_runnable(struct task_str
 	return 1;
 }
 
-static inline bool ttwu_do_migrate(struct task_struct *p, int cpu)
+static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, int cpu)
 {
+	struct rq *p_rq = rq ? : task_rq(p);
+
 	if (task_cpu(p) == cpu)
 		return false;
 
 	if (p->in_iowait) {
 		delayacct_blkio_end(p);
-		atomic_dec(&task_rq(p)->nr_iowait);
+		atomic_dec(&p_rq->nr_iowait);
 	}
 
-	psi_ttwu_dequeue(p);
+	if (psi_ttwu_need_dequeue(p)) {
+		if (rq) {
+			lockdep_assert(task_rq(p) == rq);
+			__psi_ttwu_dequeue(p);
+		} else {
+			guard(__task_rq_lock)(p);
+			__psi_ttwu_dequeue(p);
+		}
+	}
 	set_task_cpu(p, cpu);
 	return true;
 }
@@ -4283,7 +4293,7 @@ int try_to_wake_up(struct task_struct *p
 		 * their previous state and preserve Program Order.
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
-		if (ttwu_do_migrate(p, cpu))
+		if (ttwu_do_migrate(NULL, p, cpu))
 			wake_flags |= WF_MIGRATED;
 
 		ttwu_queue(p, cpu, wake_flags);
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -191,23 +191,24 @@ static inline void psi_dequeue(struct ta
 	psi_task_change(p, p->psi_flags, 0);
 }
 
-static inline void psi_ttwu_dequeue(struct task_struct *p)
+static inline bool psi_ttwu_need_dequeue(struct task_struct *p)
 {
 	if (static_branch_likely(&psi_disabled))
-		return;
+		return false;
 	/*
 	 * Is the task being migrated during a wakeup? Make sure to
 	 * deregister its sleep-persistent psi states from the old
 	 * queue, and let psi_enqueue() know it has to requeue.
 	 */
-	if (unlikely(p->psi_flags)) {
-		struct rq_flags rf;
-		struct rq *rq;
-
-		rq = __task_rq_lock(p, &rf);
-		psi_task_change(p, p->psi_flags, 0);
-		__task_rq_unlock(rq, &rf);
-	}
+	if (!likely(!p->psi_flags))
+		return false;
+
+	return true;
+}
+
+static inline void __psi_ttwu_dequeue(struct task_struct *p)
+{
+	psi_task_change(p, p->psi_flags, 0);
 }
 
 static inline void psi_sched_switch(struct task_struct *prev,
@@ -223,7 +224,8 @@ static inline void psi_sched_switch(stru
 #else /* !CONFIG_PSI: */
 static inline void psi_enqueue(struct task_struct *p, bool migrate) {}
 static inline void psi_dequeue(struct task_struct *p, bool migrate) {}
-static inline void psi_ttwu_dequeue(struct task_struct *p) {}
+static inline bool psi_ttwu_need_dequeue(struct task_struct *p) { return false; }
+static inline void __psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 07/12] psi: Split psi_ttwu_dequeue()
  2025-07-02 11:49 ` [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() Peter Zijlstra
@ 2025-07-17 23:59   ` Chris Mason
  2025-07-18 18:02     ` Steven Rostedt
  0 siblings, 1 reply; 101+ messages in thread
From: Chris Mason @ 2025-07-17 23:59 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel

On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> Currently psi_ttwu_dequeue() is called while holding p->pi_lock and
> takes rq->lock. Split the function in preparation for calling
> ttwu_do_migration() while already holding rq->lock.
> 

[ ... ]


This patch regresses schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0 from
5.2M RPS to 4.5M RPS, and profiles show that CPU 0-3 are spending more
time in __task_rq_lock()

> -static inline void psi_ttwu_dequeue(struct task_struct *p)
> +static inline bool psi_ttwu_need_dequeue(struct task_struct *p)
>  {
>  	if (static_branch_likely(&psi_disabled))
> -		return;
> +		return false;
>  	/*
>  	 * Is the task being migrated during a wakeup? Make sure to
>  	 * deregister its sleep-persistent psi states from the old
>  	 * queue, and let psi_enqueue() know it has to requeue.
>  	 */
> -	if (unlikely(p->psi_flags)) {
> -		struct rq_flags rf;
> -		struct rq *rq;
> -
> -		rq = __task_rq_lock(p, &rf);
> -		psi_task_change(p, p->psi_flags, 0);
> -		__task_rq_unlock(rq, &rf);
> -	}
> +	if (!likely(!p->psi_flags))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^

I think we need roughly one less bang?

-chris

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 07/12] psi: Split psi_ttwu_dequeue()
  2025-07-17 23:59   ` Chris Mason
@ 2025-07-18 18:02     ` Steven Rostedt
  0 siblings, 0 replies; 101+ messages in thread
From: Steven Rostedt @ 2025-07-18 18:02 UTC (permalink / raw)
  To: Chris Mason
  Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel

On Thu, 17 Jul 2025 19:59:52 -0400
Chris Mason <clm@meta.com> wrote:

> > -	}
> > +	if (!likely(!p->psi_flags))  
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> I think we need roughly one less bang?

I think it needs two fewer bangs!

	if (unlikely(p->psi_flags))

  ;-)

-- Steve

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (6 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 09/12] sched: Clean up ttwu comments Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

The relation between ttwu_queue_wakelist() and __ttwu_queue_wakelist()
is ill defined -- probably because the former is the only caller of
the latter and it grew into an arbitrary subfunction.

Clean things up a little such that __ttwu_queue_wakelist() no longer
takes the wake_flags argument, making for a more sensible separation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3848,11 +3848,11 @@ bool call_function_single_prep_ipi(int c
  * via sched_ttwu_wakeup() for activation so the wakee incurs the cost
  * of the wakeup instead of the waker.
  */
-static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
-	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+	sched_clock_cpu(cpu); /* Sync clocks across CPUs */
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
 #ifdef CONFIG_SMP
@@ -3954,8 +3954,9 @@ static bool ttwu_queue_wakelist(struct t
 	if (!ttwu_queue_cond(p, cpu, def))
 		return false;
 
-	sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-	__ttwu_queue_wakelist(p, cpu, wake_flags);
+	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+
+	__ttwu_queue_wakelist(p, cpu);
 	return true;
 }
 



^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 09/12] sched: Clean up ttwu comments
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (7 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Various changes have rendered these comments slightly out-of-date.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4276,8 +4276,8 @@ int try_to_wake_up(struct task_struct *p
 		 * __schedule().  See the comment for smp_mb__after_spinlock().
 		 *
 		 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
-		 * schedule()'s deactivate_task() has 'happened' and p will no longer
-		 * care about it's own p->state. See the comment in __schedule().
+		 * schedule()'s try_to_block_task() has 'happened' and p will no longer
+		 * care about its own p->state. See the comment in try_to_block_task().
 		 */
 		smp_acquire__after_ctrl_dep();
 
@@ -6708,8 +6708,8 @@ static void __sched notrace __schedule(i
 	preempt = sched_mode == SM_PREEMPT;
 
 	/*
-	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that we form a control dependency vs deactivate_task() below.
+	 * We must load prev->state once, such that we form a control
+	 * dependency vs try_to_block_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (sched_mode == SM_IDLE) {



^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending()
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (8 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 09/12] sched: Clean up ttwu comments Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-10 16:51   ` Vincent Guittot
  2025-07-02 11:49 ` [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3827,22 +3827,26 @@ void sched_ttwu_pending(void *arg)
 	struct llist_node *llist = arg;
 	struct rq *rq = this_rq();
 	struct task_struct *p, *t;
-	struct rq_flags rf;
 
 	if (!llist)
 		return;
 
-	rq_lock_irqsave(rq, &rf);
+	CLASS(rq_lock_irqsave, guard)(rq);
 	update_rq_clock(rq);
 
 	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
+		int wake_flags = WF_TTWU;
+
 		if (WARN_ON_ONCE(p->on_cpu))
 			smp_cond_load_acquire(&p->on_cpu, !VAL);
 
 		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
 			set_task_cpu(p, cpu_of(rq));
 
-		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
+		if (p->sched_remote_wakeup)
+			wake_flags |= WF_MIGRATED;
+
+		ttwu_do_activate(rq, p, wake_flags, &guard.rf);
 	}
 
 	/*
@@ -3856,7 +3860,6 @@ void sched_ttwu_pending(void *arg)
 	 * Since now nr_running > 0, idle_cpu() will always get correct result.
 	 */
 	WRITE_ONCE(rq->ttwu_pending, 0);
-	rq_unlock_irqrestore(rq, &rf);
 }
 
 /*



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending()
  2025-07-02 11:49 ` [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() Peter Zijlstra
@ 2025-07-10 16:51   ` Vincent Guittot
  0 siblings, 0 replies; 101+ messages in thread
From: Vincent Guittot @ 2025-07-10 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Wed, 2 Jul 2025 at 14:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c |   11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3827,22 +3827,26 @@ void sched_ttwu_pending(void *arg)
>         struct llist_node *llist = arg;
>         struct rq *rq = this_rq();
>         struct task_struct *p, *t;
> -       struct rq_flags rf;
>
>         if (!llist)
>                 return;
>
> -       rq_lock_irqsave(rq, &rf);
> +       CLASS(rq_lock_irqsave, guard)(rq);
>         update_rq_clock(rq);
>
>         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> +               int wake_flags = WF_TTWU;

Adding WF_TTWU is not part of using lock guard but for patch 12 with
ttwu_delayed -> select_task_rq()


> +
>                 if (WARN_ON_ONCE(p->on_cpu))
>                         smp_cond_load_acquire(&p->on_cpu, !VAL);
>
>                 if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
>                         set_task_cpu(p, cpu_of(rq));
>
> -               ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> +               if (p->sched_remote_wakeup)
> +                       wake_flags |= WF_MIGRATED;
> +
> +               ttwu_do_activate(rq, p, wake_flags, &guard.rf);
>         }
>
>         /*
> @@ -3856,7 +3860,6 @@ void sched_ttwu_pending(void *arg)
>          * Since now nr_running > 0, idle_cpu() will always get correct result.
>          */
>         WRITE_ONCE(rq->ttwu_pending, 0);
> -       rq_unlock_irqrestore(rq, &rf);
>  }
>
>  /*
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (9 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

Change how TTWU handles sched_delayed tasks.

Currently sched_delayed tasks are seen as on_rq and will hit
ttwu_runnable(), which treats sched_delayed tasks the same as other
on_rq tasks, it makes them runnable on the runqueue they're on.

However, tasks that were dequeued (and not delayed) will get a
different wake-up path, notably they will pass through wakeup
balancing.

Change ttwu_runnable() to dequeue delayed tasks and report it isn't
on_rq after all, ensuring the task continues down the regular wakeup
path.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3793,8 +3793,10 @@ static int ttwu_runnable(struct task_str
 		return 0;
 
 	update_rq_clock(rq);
-	if (p->se.sched_delayed)
-		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (p->se.sched_delayed) {
+		dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_DELAYED | DEQUEUE_SLEEP);
+		return 0;
+	}
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if



^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (10 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed Peter Zijlstra
@ 2025-07-02 11:49 ` Peter Zijlstra
  2025-07-03 16:00   ` Phil Auld
                     ` (2 more replies)
  2025-07-02 15:27 ` [PATCH v2 00/12] sched: Address schbench regression Chris Mason
                   ` (2 subsequent siblings)
  14 siblings, 3 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-02 11:49 UTC (permalink / raw)
  To: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel, peterz

One of the more expensive things to do is take a remote runqueue lock;
which is exactly what ttwu_runnable() ends up doing. However, in the
case of sched_delayed tasks it is possible to queue up an IPI instead.

Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250520101727.984171377@infradead.org
---
 include/linux/sched.h   |    1 
 kernel/sched/core.c     |   96 +++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/fair.c     |   17 ++++++++
 kernel/sched/features.h |    1 
 kernel/sched/sched.h    |    1 
 5 files changed, 110 insertions(+), 6 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -984,6 +984,7 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+	unsigned			sched_remote_delayed:1;
 #ifdef CONFIG_RT_MUTEXES
 	unsigned			sched_rt_mutex:1;
 #endif
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -675,7 +675,12 @@ struct rq *__task_rq_lock(struct task_st
 {
 	struct rq *rq;
 
-	lockdep_assert_held(&p->pi_lock);
+	/*
+	 * TASK_WAKING is used to serialize the remote end of wakeup, rather
+	 * than p->pi_lock.
+	 */
+	lockdep_assert(p->__state == TASK_WAKING ||
+		       lockdep_is_held(&p->pi_lock) != LOCK_STATE_NOT_HELD);
 
 	for (;;) {
 		rq = task_rq(p);
@@ -3727,6 +3732,8 @@ ttwu_do_activate(struct rq *rq, struct t
 	}
 }
 
+static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3754,6 +3761,35 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
+	if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
+		/*
+		 * Similar to try_to_block_task():
+		 *
+		 * __schedule()				ttwu()
+		 *   prev_state = prev->state		  if (p->sched_delayed)
+		 *   if (prev_state)			     smp_acquire__after_ctrl_dep()
+		 *     try_to_block_task()		     p->state = TASK_WAKING
+		 *       ... set_delayed()
+		 *         RELEASE p->sched_delayed = 1
+		 *
+		 * __schedule() and ttwu() have matching control dependencies.
+		 *
+		 * Notably, once we observe sched_delayed we know the task has
+		 * passed try_to_block_task() and p->state is ours to modify.
+		 *
+		 * TASK_WAKING controls ttwu() concurrency.
+		 */
+		smp_acquire__after_ctrl_dep();
+		WRITE_ONCE(p->__state, TASK_WAKING);
+		/*
+		 * Bit of a hack, see select_task_rq_fair()'s WF_DELAYED case.
+		 */
+		p->wake_cpu = smp_processor_id();
+
+		if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
+			return 1;
+	}
+
 	CLASS(__task_rq_lock, guard)(p);
 	struct rq *rq = guard.rq;
 
@@ -3776,6 +3812,8 @@ static int ttwu_runnable(struct task_str
 	return 1;
 }
 
+static void __ttwu_queue_wakelist(struct task_struct *p, int cpu);
+
 static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, int cpu)
 {
 	struct rq *p_rq = rq ? : task_rq(p);
@@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc
 	return true;
 }
 
+static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
+			struct rq_flags *rf)
+{
+	struct rq *p_rq = task_rq(p);
+	int cpu;
+
+	/*
+	 * Notably it is possible for on-rq entities to get migrated -- even
+	 * sched_delayed ones. This should be rare though, so flip the locks
+	 * rather than IPI chase after it.
+	 */
+	if (unlikely(rq != p_rq)) {
+		rq_unlock(rq, rf);
+		p_rq = __task_rq_lock(p, rf);
+		update_rq_clock(p_rq);
+	}
+
+	if (task_on_rq_queued(p))
+		dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+
+	/*
+	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
+	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
+	 * rather than p->pi_lock.
+	 */
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+	if (ttwu_do_migrate(rq, p, cpu))
+		wake_flags |= WF_MIGRATED;
+
+	if (unlikely(rq != p_rq)) {
+		__task_rq_unlock(p_rq, rf);
+		rq_lock(rq, rf);
+	}
+
+	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+	p->sched_remote_delayed = 0;
+
+	/* it wants to run here */
+	if (cpu_of(rq) == cpu)
+		return 0;
+
+	/* shoot it to the CPU it wants to run on */
+	__ttwu_queue_wakelist(p, cpu);
+	return 1;
+}
+
 void sched_ttwu_pending(void *arg)
 {
 	struct llist_node *llist = arg;
@@ -3819,12 +3903,13 @@ void sched_ttwu_pending(void *arg)
 		if (WARN_ON_ONCE(p->on_cpu))
 			smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
-			set_task_cpu(p, cpu_of(rq));
-
 		if (p->sched_remote_wakeup)
 			wake_flags |= WF_MIGRATED;
 
+		if (p->sched_remote_delayed &&
+		    ttwu_delayed(rq, p, wake_flags | WF_DELAYED, &guard.rf))
+			continue;
+
 		ttwu_do_activate(rq, p, wake_flags, &guard.rf);
 	}
 
@@ -3964,12 +4049,13 @@ static inline bool ttwu_queue_cond(struc
 
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
 {
-	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
+	bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
 
 	if (!ttwu_queue_cond(p, cpu, def))
 		return false;
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
+	p->sched_remote_delayed = !!(wake_flags & WF_DELAYED);
 
 	__ttwu_queue_wakelist(p, cpu);
 	return true;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5327,7 +5327,10 @@ static __always_inline void return_cfs_r
 
 static void set_delayed(struct sched_entity *se)
 {
-	se->sched_delayed = 1;
+	/*
+	 * See TTWU_QUEUE_DELAYED in ttwu_runnable().
+	 */
+	smp_store_release(&se->sched_delayed, 1);
 
 	/*
 	 * Delayed se of cfs_rq have no tasks queued on them.
@@ -8481,6 +8484,18 @@ select_task_rq_fair(struct task_struct *
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
 
+	if (wake_flags & WF_DELAYED) {
+		/*
+		 * This is the ttwu_delayed() case; where prev_cpu is in fact
+		 * the CPU that did the wakeup, while @p is running on the
+		 * current CPU.
+		 *
+		 * Make sure to flip them the right way around, otherwise
+		 * wake-affine is going to do the wrong thing.
+		 */
+		swap(cpu, new_cpu);
+	}
+
 	/*
 	 * required for stable ->cpus_allowed
 	 */
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
 SCHED_FEAT(TTWU_QUEUE, true)
 #endif
 SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
+SCHED_FEAT(TTWU_QUEUE_DELAYED, true)
 SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
 
 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s
 #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
 
 #define WF_ON_CPU		0x0100
+#define WF_DELAYED		0x0200
 
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
 static_assert(WF_FORK == SD_BALANCE_FORK);



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
@ 2025-07-03 16:00   ` Phil Auld
  2025-07-03 16:47     ` Peter Zijlstra
  2025-07-08 12:44   ` Dietmar Eggemann
  2025-07-23  5:42   ` Shrikanth Hegde
  2 siblings, 1 reply; 101+ messages in thread
From: Phil Auld @ 2025-07-03 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

Hi Peter,

On Wed, Jul 02, 2025 at 01:49:36PM +0200 Peter Zijlstra wrote:
> One of the more expensive things to do is take a remote runqueue lock;
> which is exactly what ttwu_runnable() ends up doing. However, in the
> case of sched_delayed tasks it is possible to queue up an IPI instead.
> 
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.984171377@infradead.org
> ---
>  include/linux/sched.h   |    1 
>  kernel/sched/core.c     |   96 +++++++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/fair.c     |   17 ++++++++
>  kernel/sched/features.h |    1 
>  kernel/sched/sched.h    |    1 
>  5 files changed, 110 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -984,6 +984,7 @@ struct task_struct {
>  	 * ->sched_remote_wakeup gets used, so it can be in this word.
>  	 */
>  	unsigned			sched_remote_wakeup:1;
> +	unsigned			sched_remote_delayed:1;
>  #ifdef CONFIG_RT_MUTEXES
>  	unsigned			sched_rt_mutex:1;
>  #endif
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -675,7 +675,12 @@ struct rq *__task_rq_lock(struct task_st
>  {
>  	struct rq *rq;
>  
> -	lockdep_assert_held(&p->pi_lock);
> +	/*
> +	 * TASK_WAKING is used to serialize the remote end of wakeup, rather
> +	 * than p->pi_lock.
> +	 */
> +	lockdep_assert(p->__state == TASK_WAKING ||
> +		       lockdep_is_held(&p->pi_lock) != LOCK_STATE_NOT_HELD);
>  
>  	for (;;) {
>  		rq = task_rq(p);
> @@ -3727,6 +3732,8 @@ ttwu_do_activate(struct rq *rq, struct t
>  	}
>  }
>  
> +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
>  /*
>   * Consider @p being inside a wait loop:
>   *
> @@ -3754,6 +3761,35 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> +	if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> +		/*
> +		 * Similar to try_to_block_task():
> +		 *
> +		 * __schedule()				ttwu()
> +		 *   prev_state = prev->state		  if (p->sched_delayed)
> +		 *   if (prev_state)			     smp_acquire__after_ctrl_dep()
> +		 *     try_to_block_task()		     p->state = TASK_WAKING
> +		 *       ... set_delayed()
> +		 *         RELEASE p->sched_delayed = 1
> +		 *
> +		 * __schedule() and ttwu() have matching control dependencies.
> +		 *
> +		 * Notably, once we observe sched_delayed we know the task has
> +		 * passed try_to_block_task() and p->state is ours to modify.
> +		 *
> +		 * TASK_WAKING controls ttwu() concurrency.
> +		 */
> +		smp_acquire__after_ctrl_dep();
> +		WRITE_ONCE(p->__state, TASK_WAKING);
> +		/*
> +		 * Bit of a hack, see select_task_rq_fair()'s WF_DELAYED case.
> +		 */
> +		p->wake_cpu = smp_processor_id();
> +
> +		if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> +			return 1;
> +	}
> +
>  	CLASS(__task_rq_lock, guard)(p);
>  	struct rq *rq = guard.rq;
>  
> @@ -3776,6 +3812,8 @@ static int ttwu_runnable(struct task_str
>  	return 1;
>  }
>  
> +static void __ttwu_queue_wakelist(struct task_struct *p, int cpu);
> +
>  static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, int cpu)
>  {
>  	struct rq *p_rq = rq ? : task_rq(p);
> @@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc
>  	return true;
>  }
>  
> +static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
> +			struct rq_flags *rf)
> +{
> +	struct rq *p_rq = task_rq(p);
> +	int cpu;
> +
> +	/*
> +	 * Notably it is possible for on-rq entities to get migrated -- even
> +	 * sched_delayed ones. This should be rare though, so flip the locks
> +	 * rather than IPI chase after it.
> +	 */
> +	if (unlikely(rq != p_rq)) {
> +		rq_unlock(rq, rf);
> +		p_rq = __task_rq_lock(p, rf);
> +		update_rq_clock(p_rq);
> +	}
> +
> +	if (task_on_rq_queued(p))
> +		dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> +
> +	/*
> +	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
> +	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
> +	 * rather than p->pi_lock.
> +	 */
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> +	if (ttwu_do_migrate(rq, p, cpu))
> +

This doesn't compile because ttwu_do_migrate() doesn't take a *rq.

It's easy enough to fix up and I'll try to have our perf team try these
out. 

Thanks,
Phil



		wake_flags |= WF_MIGRATED;
> +
> +	if (unlikely(rq != p_rq)) {
> +		__task_rq_unlock(p_rq, rf);
> +		rq_lock(rq, rf);
> +	}
> +
> +	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +	p->sched_remote_delayed = 0;
> +
> +	/* it wants to run here */
> +	if (cpu_of(rq) == cpu)
> +		return 0;
> +
> +	/* shoot it to the CPU it wants to run on */
> +	__ttwu_queue_wakelist(p, cpu);
> +	return 1;
> +}
> +
>  void sched_ttwu_pending(void *arg)
>  {
>  	struct llist_node *llist = arg;
> @@ -3819,12 +3903,13 @@ void sched_ttwu_pending(void *arg)
>  		if (WARN_ON_ONCE(p->on_cpu))
>  			smp_cond_load_acquire(&p->on_cpu, !VAL);
>  
> -		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> -			set_task_cpu(p, cpu_of(rq));
> -
>  		if (p->sched_remote_wakeup)
>  			wake_flags |= WF_MIGRATED;
>  
> +		if (p->sched_remote_delayed &&
> +		    ttwu_delayed(rq, p, wake_flags | WF_DELAYED, &guard.rf))
> +			continue;
> +
>  		ttwu_do_activate(rq, p, wake_flags, &guard.rf);
>  	}
>  
> @@ -3964,12 +4049,13 @@ static inline bool ttwu_queue_cond(struc
>  
>  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>  {
> -	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> +	bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
>  
>  	if (!ttwu_queue_cond(p, cpu, def))
>  		return false;
>  
>  	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +	p->sched_remote_delayed = !!(wake_flags & WF_DELAYED);
>  
>  	__ttwu_queue_wakelist(p, cpu);
>  	return true;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5327,7 +5327,10 @@ static __always_inline void return_cfs_r
>  
>  static void set_delayed(struct sched_entity *se)
>  {
> -	se->sched_delayed = 1;
> +	/*
> +	 * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> +	 */
> +	smp_store_release(&se->sched_delayed, 1);
>  
>  	/*
>  	 * Delayed se of cfs_rq have no tasks queued on them.
> @@ -8481,6 +8484,18 @@ select_task_rq_fair(struct task_struct *
>  	/* SD_flags and WF_flags share the first nibble */
>  	int sd_flag = wake_flags & 0xF;
>  
> +	if (wake_flags & WF_DELAYED) {
> +		/*
> +		 * This is the ttwu_delayed() case; where prev_cpu is in fact
> +		 * the CPU that did the wakeup, while @p is running on the
> +		 * current CPU.
> +		 *
> +		 * Make sure to flip them the right way around, otherwise
> +		 * wake-affine is going to do the wrong thing.
> +		 */
> +		swap(cpu, new_cpu);
> +	}
> +
>  	/*
>  	 * required for stable ->cpus_allowed
>  	 */
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
>  SCHED_FEAT(TTWU_QUEUE, true)
>  #endif
>  SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> +SCHED_FEAT(TTWU_QUEUE_DELAYED, true)
>  SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
>  
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s
>  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
>  
>  #define WF_ON_CPU		0x0100
> +#define WF_DELAYED		0x0200
>  
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
>  static_assert(WF_FORK == SD_BALANCE_FORK);
> 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-03 16:00   ` Phil Auld
@ 2025-07-03 16:47     ` Peter Zijlstra
  2025-07-03 17:11       ` Phil Auld
  2025-07-04  6:13       ` K Prateek Nayak
  0 siblings, 2 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-03 16:47 UTC (permalink / raw)
  To: Phil Auld
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Thu, Jul 03, 2025 at 12:00:27PM -0400, Phil Auld wrote:

> > +	if (ttwu_do_migrate(rq, p, cpu))
> > +
> 
> This doesn't compile because ttwu_do_migrate() doesn't take a *rq.
> 
> It's easy enough to fix up and I'll try to have our perf team try these
> out. 

I'm confused, isn't that what patch 7 does?

Also, I updated the git tree today, fixing a silly mistake. But I don't
remember build failures here.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-03 16:47     ` Peter Zijlstra
@ 2025-07-03 17:11       ` Phil Auld
  2025-07-14 13:57         ` Phil Auld
  2025-07-04  6:13       ` K Prateek Nayak
  1 sibling, 1 reply; 101+ messages in thread
From: Phil Auld @ 2025-07-03 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Thu, Jul 03, 2025 at 06:47:08PM +0200 Peter Zijlstra wrote:
> On Thu, Jul 03, 2025 at 12:00:27PM -0400, Phil Auld wrote:
> 
> > > +	if (ttwu_do_migrate(rq, p, cpu))
> > > +
> > 
> > This doesn't compile because ttwu_do_migrate() doesn't take a *rq.
> > 
> > It's easy enough to fix up and I'll try to have our perf team try these
> > out. 
> 
> I'm confused, isn't that what patch 7 does?
>

Heh,  I seem to have not had patch 7 (psi did not make it through my
stupid gmail filters).  I did look through all the ones I had but did
not look at the numbers and see I was missing one... 

Nevermind.


Cheers,
Phil

> Also, I updated the git tree today, fixing a silly mistake. But I don't
> remember build failures here.
> 

-- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-03 17:11       ` Phil Auld
@ 2025-07-14 13:57         ` Phil Auld
  0 siblings, 0 replies; 101+ messages in thread
From: Phil Auld @ 2025-07-14 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

Hi Peter,

On Thu, Jul 03, 2025 at 01:11:27PM -0400 Phil Auld wrote:
> On Thu, Jul 03, 2025 at 06:47:08PM +0200 Peter Zijlstra wrote:
> > On Thu, Jul 03, 2025 at 12:00:27PM -0400, Phil Auld wrote:
> > 
> > > > +	if (ttwu_do_migrate(rq, p, cpu))
> > > > +
> > > 
> > > This doesn't compile because ttwu_do_migrate() doesn't take a *rq.
> > > 
> > > It's easy enough to fix up and I'll try to have our perf team try these
> > > out. 
> > 

For the second part if this email...

Our perf team reports that this series (as originally posted without the minor
fixups) addresses the randwrite issue we were seeing with delayed dequeue
and shows a bit better performance on some other fs related tests.

It's neutral on other tests so modulo the issues others reported I think
we'd be happy to have these in.  We did not see any new regressions.

Thanks!

Cheers,
Phil
-- 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-03 16:47     ` Peter Zijlstra
  2025-07-03 17:11       ` Phil Auld
@ 2025-07-04  6:13       ` K Prateek Nayak
  2025-07-04  7:59         ` Peter Zijlstra
  1 sibling, 1 reply; 101+ messages in thread
From: K Prateek Nayak @ 2025-07-04  6:13 UTC (permalink / raw)
  To: Peter Zijlstra, Phil Auld
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

Hello Peter,

On 7/3/2025 10:17 PM, Peter Zijlstra wrote:
> Also, I updated the git tree today, fixing a silly mistake. But I don't
> remember build failures here.

Running HammerDB + MySQL on baremetal results in a splats from
assert_clock_updated() ~10min into the run on peterz:sched/core at
commit 098ac7dd8a57 ("sched: Add ttwu_queue support for delayed
tasks") which I hope is the updated one.

I'm running with the following diff and haven't seen a splat yet
(slightly longer than 10min into HammerDB; still testing):

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9855121c2440..71ac0e7effeb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3871,7 +3871,7 @@ static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
  	if (unlikely(rq != p_rq)) {
  		__task_rq_unlock(p_rq, rf);
  		rq_lock(rq, rf);
-		update_rq_clock(p_rq);
+		update_rq_clock(rq);
  	}
  
  	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
---

P.S. The full set of splats are as follows:

     ------------[ cut here ]------------
     WARNING: CPU: 141 PID: 8164 at kernel/sched/sched.h:1643 update_curr_se+0x5c/0x60
     Modules linked in: ...
     CPU: 141 UID: 1000 PID: 8164 Comm: mysqld Tainted: G     U              6.16.0-rc1-peterz-queue-098ac7+ #869 PREEMPT(voluntary)
     Tainted: [U]=USER
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     RIP: 0010:update_curr_se+0x5c/0x60
     Code: be a8 00 00 00 00 48 8d 96 80 02 00 00 74 07 48 8d 96 00 01 00 00 48 8b 4a 60 48 39 c1 48 0f 4c c8 48 89 4a 60 e9 1f af d5 ff <0f> 0b eb ae 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
     RSP: 0000:ffffd162151ffd30 EFLAGS: 00010097
     RAX: 0000000000000001 RBX: ffff8bb25c903500 RCX: ffffffff8afb3ca0
     RDX: 0000000000000009 RSI: ffff8bb25c903500 RDI: ffff8bf07e571b00
     RBP: ffff8bb25ea94400 R08: 0000014f849737b3 R09: 0000000000000009
     R10: 0000000000000001 R11: 0000000000000000 R12: ffff8bf07e571b00
     R13: ffff8bf07e571b00 R14: ffff8bb25c903500 R15: ffff8bb25ea94400
     FS:  00007fcd9e01f640(0000) GS:ffff8bf0f0858000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007fcf3990c004 CR3: 0000000147f9900b CR4: 0000000000f70ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      update_curr+0x31/0x240
      enqueue_entity+0x2e/0x470
      enqueue_task_fair+0x122/0x850
      enqueue_task+0x88/0x1c0
      ttwu_do_activate+0x75/0x230
      sched_ttwu_pending+0x2b9/0x430
      __flush_smp_call_function_queue+0x140/0x420
      __sysvec_call_function_single+0x1c/0xb0
      sysvec_call_function_single+0x43/0xb0
      asm_sysvec_call_function_single+0x1a/0x20
     RIP: 0033:0x202a6de
     Code: 89 f1 4d 89 c4 4d 8d 44 24 01 41 0f b6 50 ff 49 8d 71 01 44 0f b6 5e ff 89 d0 44 29 d8 0f 85 01 02 00 00 48 89 ca 48 83 ea 01 <0f> 84 0c 02 00 00 4c 39 c7 75 c7 48 89 4d 80 4c 89 55 88 4c 89 4d
     RSP: 002b:00007fcd9e01a1f0 EFLAGS: 00000202
     RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
     RDX: 0000000000000002 RSI: 00007fc2684926c2 RDI: 00007fb96fe862b7
     RBP: 00007fcd9e01a270 R08: 00007fb96fe862b5 R09: 00007fc2684926c1
     R10: 0000000000000000 R11: 0000000000000000 R12: 00007fb96fe862b4
     R13: 0000000000000000 R14: 00000000ffffffff R15: 00007fcd9e01a310
      </TASK>
     ---[ end trace 0000000000000000 ]---

     ------------[ cut here ]------------
     WARNING: CPU: 141 PID: 8164 at kernel/sched/sched.h:1643 update_load_avg+0x6f7/0x780
     Modules linked in: ...
     CPU: 141 UID: 1000 PID: 8164 Comm: mysqld Tainted: G     U  W           6.16.0-rc1-peterz-queue-098ac7+ #869 PREEMPT(voluntary)
     Tainted: [U]=USER, [W]=WARN
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     RIP: 0010:update_load_avg+0x6f7/0x780
     Code: 42 f8 e9 c8 fa ff ff 39 c1 0f 42 c8 e9 4a fb ff ff 31 c0 45 31 c9 e9 cf fb ff ff 4c 8b a7 80 01 00 00 49 29 d4 e9 5a f9 ff ff <0f> 0b e9 42 f9 ff ff 69 d0 7e b6 00 00 e9 f4 fb ff ff 39 c2 0f 42
     RSP: 0000:ffffd162151ffd20 EFLAGS: 00010097
     RAX: ffff8bf07e571b00 RBX: ffff8bb25ea94400 RCX: 0000000000000041
     RDX: 0000000000000000 RSI: ffff8bb237533500 RDI: ffff8bb25ea94400
     RBP: ffff8bb237533500 R08: 000000000000132d R09: 0000000000000009
     R10: 0000000000000001 R11: 0000000000000000 R12: ffff8bf07e571b00
     R13: 0000000000000005 R14: ffff8bb25c903500 R15: ffff8bb25ea94400
     FS:  00007fcd9e01f640(0000) GS:ffff8bf0f0858000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007fcf3990c004 CR3: 0000000147f9900b CR4: 0000000000f70ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? update_curr+0x1bd/0x240
      enqueue_entity+0x3e/0x470
      enqueue_task_fair+0x122/0x850
      enqueue_task+0x88/0x1c0
      ttwu_do_activate+0x75/0x230
      sched_ttwu_pending+0x2b9/0x430
      __flush_smp_call_function_queue+0x140/0x420
      __sysvec_call_function_single+0x1c/0xb0
      sysvec_call_function_single+0x43/0xb0
      asm_sysvec_call_function_single+0x1a/0x20
     RIP: 0033:0x202a6de
     Code: 89 f1 4d 89 c4 4d 8d 44 24 01 41 0f b6 50 ff 49 8d 71 01 44 0f b6 5e ff 89 d0 44 29 d8 0f 85 01 02 00 00 48 89 ca 48 83 ea 01 <0f> 84 0c 02 00 00 4c 39 c7 75 c7 48 89 4d 80 4c 89 55 88 4c 89 4d
     RSP: 002b:00007fcd9e01a1f0 EFLAGS: 00000202
     RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
     RDX: 0000000000000002 RSI: 00007fc2684926c2 RDI: 00007fb96fe862b7
     RBP: 00007fcd9e01a270 R08: 00007fb96fe862b5 R09: 00007fc2684926c1
     R10: 0000000000000000 R11: 0000000000000000 R12: 00007fb96fe862b4
     R13: 0000000000000000 R14: 00000000ffffffff R15: 00007fcd9e01a310
      </TASK>
     ---[ end trace 0000000000000000 ]---

     ------------[ cut here ]------------
     WARNING: CPU: 141 PID: 8164 at kernel/sched/sched.h:1643 enqueue_task+0x168/0x1c0
     Modules linked in: ...
     CPU: 141 UID: 1000 PID: 8164 Comm: mysqld Tainted: G     U  W           6.16.0-rc1-peterz-queue-098ac7+ #869 PREEMPT(voluntary)
     Tainted: [U]=USER, [W]=WARN
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     RIP: 0010:enqueue_task+0x168/0x1c0
     Code: ee 4c 89 e7 5d 41 5c 41 5d e9 24 d0 ff ff 41 f7 c5 00 02 00 00 0f 85 e4 fe ff ff e9 18 ff ff ff e8 3d f0 ff ff e9 b9 fe ff ff <0f> 0b eb ab 85 c0 74 10 80 fa 01 19 d2 83 e2 f6 83 c2 0e e9 7a ff
     RSP: 0000:ffffd162151ffe00 EFLAGS: 00010097
     RAX: 000000003cc00001 RBX: ffff8bf07e571b00 RCX: 0000000000000000
     RDX: 000000002721340a RSI: 0000014fabb84509 RDI: ffff8bf07e55cd80
     RBP: ffff8bb237533480 R08: 0000014fabb84509 R09: 0000000000000001
     R10: 0000000000000001 R11: 0000000000000000 R12: ffff8bf07e571b00
     R13: 0000000000000009 R14: ffffd162151ffe90 R15: 0000000000000008
     FS:  00007fcd9e01f640(0000) GS:ffff8bf0f0858000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00007fcf3990c004 CR3: 0000000147f9900b CR4: 0000000000f70ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      ttwu_do_activate+0x75/0x230
      sched_ttwu_pending+0x2b9/0x430
      __flush_smp_call_function_queue+0x140/0x420
      __sysvec_call_function_single+0x1c/0xb0
      sysvec_call_function_single+0x43/0xb0
      asm_sysvec_call_function_single+0x1a/0x20
     RIP: 0033:0x202a6de
     Code: 89 f1 4d 89 c4 4d 8d 44 24 01 41 0f b6 50 ff 49 8d 71 01 44 0f b6 5e ff 89 d0 44 29 d8 0f 85 01 02 00 00 48 89 ca 48 83 ea 01 <0f> 84 0c 02 00 00 4c 39 c7 75 c7 48 89 4d 80 4c 89 55 88 4c 89 4d
     RSP: 002b:00007fcd9e01a1f0 EFLAGS: 00000202
     RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
     RDX: 0000000000000002 RSI: 00007fc2684926c2 RDI: 00007fb96fe862b7
     RBP: 00007fcd9e01a270 R08: 00007fb96fe862b5 R09: 00007fc2684926c1
     R10: 0000000000000000 R11: 0000000000000000 R12: 00007fb96fe862b4
     R13: 0000000000000000 R14: 00000000ffffffff R15: 00007fcd9e01a310
      </TASK>
     ---[ end trace 0000000000000000 ]---

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-04  6:13       ` K Prateek Nayak
@ 2025-07-04  7:59         ` Peter Zijlstra
  0 siblings, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-04  7:59 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Phil Auld, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, clm, linux-kernel

On Fri, Jul 04, 2025 at 11:43:43AM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 7/3/2025 10:17 PM, Peter Zijlstra wrote:
> > Also, I updated the git tree today, fixing a silly mistake. But I don't
> > remember build failures here.
> 
> Running HammerDB + MySQL on baremetal results in a splats from
> assert_clock_updated() ~10min into the run on peterz:sched/core at
> commit 098ac7dd8a57 ("sched: Add ttwu_queue support for delayed
> tasks") which I hope is the updated one.
> 
> I'm running with the following diff and haven't seen a splat yet
> (slightly longer than 10min into HammerDB; still testing):
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9855121c2440..71ac0e7effeb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3871,7 +3871,7 @@ static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
>  	if (unlikely(rq != p_rq)) {
>  		__task_rq_unlock(p_rq, rf);
>  		rq_lock(rq, rf);
> -		update_rq_clock(p_rq);
> +		update_rq_clock(rq);
>  	}
>  	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

Damn,... I did the edit right on the test box and then messed it up when
editing the patch :-(

I'll go fix.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
  2025-07-03 16:00   ` Phil Auld
@ 2025-07-08 12:44   ` Dietmar Eggemann
  2025-07-08 18:57     ` Peter Zijlstra
  2025-07-08 21:02     ` Peter Zijlstra
  2025-07-23  5:42   ` Shrikanth Hegde
  2 siblings, 2 replies; 101+ messages in thread
From: Dietmar Eggemann @ 2025-07-08 12:44 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot, rostedt,
	bsegall, mgorman, vschneid, clm
  Cc: linux-kernel

On 02/07/2025 13:49, Peter Zijlstra wrote:

[...]

> @@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc
>  	return true;
>  }
>  
> +static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
> +			struct rq_flags *rf)
> +{
> +	struct rq *p_rq = task_rq(p);
> +	int cpu;
> +
> +	/*
> +	 * Notably it is possible for on-rq entities to get migrated -- even
> +	 * sched_delayed ones. This should be rare though, so flip the locks
> +	 * rather than IPI chase after it.
> +	 */
> +	if (unlikely(rq != p_rq)) {
> +		rq_unlock(rq, rf);
> +		p_rq = __task_rq_lock(p, rf);
> +		update_rq_clock(p_rq);
> +	}
> +
> +	if (task_on_rq_queued(p))
> +		dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> +
> +	/*
> +	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
> +	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
> +	 * rather than p->pi_lock.
> +	 */
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);

There are 'lockdep_assert_held(&p->pi_lock)'s in select_task_rq() and
select_task_rq_fair() which should trigger IMHO? Can they be changed the
same way like  __task_rq_lock()?

> +	if (ttwu_do_migrate(rq, p, cpu))
> +		wake_flags |= WF_MIGRATED;

[...]

>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s
>  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
>  
>  #define WF_ON_CPU		0x0100

Looks like this is still not used. Not sure whether it can be removed or
you wanted to add a condition for this as well?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-08 12:44   ` Dietmar Eggemann
@ 2025-07-08 18:57     ` Peter Zijlstra
  2025-07-08 21:02     ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-08 18:57 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Tue, Jul 08, 2025 at 02:44:56PM +0200, Dietmar Eggemann wrote:
> On 02/07/2025 13:49, Peter Zijlstra wrote:
> 
> [...]
> 
> > @@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc
> >  	return true;
> >  }
> >  
> > +static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
> > +			struct rq_flags *rf)
> > +{
> > +	struct rq *p_rq = task_rq(p);
> > +	int cpu;
> > +
> > +	/*
> > +	 * Notably it is possible for on-rq entities to get migrated -- even
> > +	 * sched_delayed ones. This should be rare though, so flip the locks
> > +	 * rather than IPI chase after it.
> > +	 */
> > +	if (unlikely(rq != p_rq)) {
> > +		rq_unlock(rq, rf);
> > +		p_rq = __task_rq_lock(p, rf);
> > +		update_rq_clock(p_rq);
> > +	}
> > +
> > +	if (task_on_rq_queued(p))
> > +		dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> > +
> > +	/*
> > +	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
> > +	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
> > +	 * rather than p->pi_lock.
> > +	 */
> > +	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> 
> There are 'lockdep_assert_held(&p->pi_lock)'s in select_task_rq() and
> select_task_rq_fair() which should trigger IMHO? Can they be changed the
> same way like  __task_rq_lock()?

And not a single robot has yet reported this :-(.. Yeah, let me go look.
Seeing how this was performance stuff, I clearly did not run enough
lockdep builds :/

> > +	if (ttwu_do_migrate(rq, p, cpu))
> > +		wake_flags |= WF_MIGRATED;
> 
> [...]
> 
> >  /*
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s
> >  #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
> >  
> >  #define WF_ON_CPU		0x0100
> 
> Looks like this is still not used. Not sure whether it can be removed or
> you wanted to add a condition for this as well?

Bah, I'm sure I deleted that at some point. Let me try killing it again
:-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-08 12:44   ` Dietmar Eggemann
  2025-07-08 18:57     ` Peter Zijlstra
@ 2025-07-08 21:02     ` Peter Zijlstra
  1 sibling, 0 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-08 21:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
	vschneid, clm, linux-kernel

On Tue, Jul 08, 2025 at 02:44:56PM +0200, Dietmar Eggemann wrote:

> > +	/*
> > +	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
> > +	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
> > +	 * rather than p->pi_lock.
> > +	 */
> > +	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> 
> There are 'lockdep_assert_held(&p->pi_lock)'s in select_task_rq() and
> select_task_rq_fair() which should trigger IMHO? Can they be changed the
> same way like  __task_rq_lock()?

It needs a slightly different fix; notably the reason for these is the
stability of the cpumasks. For that holding either p->pi_lock or
rq->lock is sufficient.

Something a little like so...

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3557,13 +3557,13 @@ static int select_fallback_rq(int cpu, s
 	return dest_cpu;
 }
 
-/*
- * The caller (fork, wakeup) owns p->pi_lock, ->cpus_ptr is stable.
- */
 static inline
 int select_task_rq(struct task_struct *p, int cpu, int *wake_flags)
 {
-	lockdep_assert_held(&p->pi_lock);
+	/*
+	 * Ensure the sched_setaffinity() state is stable.
+	 */
+	lockdep_assert_sched_held(p);
 
 	if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) {
 		cpu = p->sched_class->select_task_rq(p, cpu, *wake_flags);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8499,7 +8499,7 @@ select_task_rq_fair(struct task_struct *
 	/*
 	 * required for stable ->cpus_allowed
 	 */
-	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_sched_held(p);
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1500,6 +1500,12 @@ static inline void lockdep_assert_rq_hel
 	lockdep_assert_held(__rq_lockp(rq));
 }
 
+static inline void lockdep_assert_sched_held(struct task_struct *p)
+{
+	lockdep_assert(lockdep_is_held(&p->pi_lock) ||
+		       lockdep_is_held(__rq_lockp(task_rq(p))));
+}
+
 extern void raw_spin_rq_lock_nested(struct rq *rq, int subclass);
 extern bool raw_spin_rq_trylock(struct rq *rq);
 extern void raw_spin_rq_unlock(struct rq *rq);

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks
  2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
  2025-07-03 16:00   ` Phil Auld
  2025-07-08 12:44   ` Dietmar Eggemann
@ 2025-07-23  5:42   ` Shrikanth Hegde
  2 siblings, 0 replies; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-23  5:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm



On 7/2/25 17:19, Peter Zijlstra wrote:
> One of the more expensive things to do is take a remote runqueue lock;
> which is exactly what ttwu_runnable() ends up doing. However, in the
> case of sched_delayed tasks it is possible to queue up an IPI instead.
> 
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20250520101727.984171377@infradead.org
> ---
>   include/linux/sched.h   |    1
>   kernel/sched/core.c     |   96 +++++++++++++++++++++++++++++++++++++++++++++---
>   kernel/sched/fair.c     |   17 ++++++++
>   kernel/sched/features.h |    1
>   kernel/sched/sched.h    |    1
>   5 files changed, 110 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -984,6 +984,7 @@ struct task_struct {
>   	 * ->sched_remote_wakeup gets used, so it can be in this word.
>   	 */
>   	unsigned			sched_remote_wakeup:1;
> +	unsigned			sched_remote_delayed:1;
>   #ifdef CONFIG_RT_MUTEXES
>   	unsigned			sched_rt_mutex:1;
>   #endif
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -675,7 +675,12 @@ struct rq *__task_rq_lock(struct task_st
>   {
>   	struct rq *rq;
>   
> -	lockdep_assert_held(&p->pi_lock);
> +	/*
> +	 * TASK_WAKING is used to serialize the remote end of wakeup, rather
> +	 * than p->pi_lock.
> +	 */
> +	lockdep_assert(p->__state == TASK_WAKING ||
> +		       lockdep_is_held(&p->pi_lock) != LOCK_STATE_NOT_HELD);
>   
>   	for (;;) {
>   		rq = task_rq(p);
> @@ -3727,6 +3732,8 @@ ttwu_do_activate(struct rq *rq, struct t
>   	}
>   }
>   
> +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> +
>   /*
>    * Consider @p being inside a wait loop:
>    *
> @@ -3754,6 +3761,35 @@ ttwu_do_activate(struct rq *rq, struct t
>    */
>   static int ttwu_runnable(struct task_struct *p, int wake_flags)
>   {
> +	if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> +		/*
> +		 * Similar to try_to_block_task():
> +		 *
> +		 * __schedule()				ttwu()
> +		 *   prev_state = prev->state		  if (p->sched_delayed)
> +		 *   if (prev_state)			     smp_acquire__after_ctrl_dep()
> +		 *     try_to_block_task()		     p->state = TASK_WAKING
> +		 *       ... set_delayed()
> +		 *         RELEASE p->sched_delayed = 1
> +		 *
> +		 * __schedule() and ttwu() have matching control dependencies.
> +		 *
> +		 * Notably, once we observe sched_delayed we know the task has
> +		 * passed try_to_block_task() and p->state is ours to modify.
> +		 *
> +		 * TASK_WAKING controls ttwu() concurrency.
> +		 */
> +		smp_acquire__after_ctrl_dep();
> +		WRITE_ONCE(p->__state, TASK_WAKING);
> +		/*
> +		 * Bit of a hack, see select_task_rq_fair()'s WF_DELAYED case.
> +		 */
> +		p->wake_cpu = smp_processor_id();
> +
> +		if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> +			return 1;
> +	}
> +
>   	CLASS(__task_rq_lock, guard)(p);
>   	struct rq *rq = guard.rq;
>   
> @@ -3776,6 +3812,8 @@ static int ttwu_runnable(struct task_str
>   	return 1;
>   }
>   
> +static void __ttwu_queue_wakelist(struct task_struct *p, int cpu);
> +
>   static inline bool ttwu_do_migrate(struct rq *rq, struct task_struct *p, int cpu)
>   {
>   	struct rq *p_rq = rq ? : task_rq(p);
> @@ -3801,6 +3839,52 @@ static inline bool ttwu_do_migrate(struc
>   	return true;
>   }
>   
> +static int ttwu_delayed(struct rq *rq, struct task_struct *p, int wake_flags,
> +			struct rq_flags *rf)
> +{
> +	struct rq *p_rq = task_rq(p);
> +	int cpu;
> +
> +	/*
> +	 * Notably it is possible for on-rq entities to get migrated -- even
> +	 * sched_delayed ones. This should be rare though, so flip the locks
> +	 * rather than IPI chase after it.
> +	 */
> +	if (unlikely(rq != p_rq)) {
> +		rq_unlock(rq, rf);
> +		p_rq = __task_rq_lock(p, rf);
> +		update_rq_clock(p_rq);
> +	}
> +
> +	if (task_on_rq_queued(p))
> +		dequeue_task(p_rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> +
> +	/*
> +	 * NOTE: unlike the regular try_to_wake_up() path, this runs both
> +	 * select_task_rq() and ttwu_do_migrate() while holding rq->lock
> +	 * rather than p->pi_lock.
> +	 */

When it comes here, it was because p->wake_cpu was a remote cpu and taking remote rq lock
would be costly.

So, when p->wake_cpu is passed, eventually we could end up fetching that rq again, such as in idle_cpu
checks, which also could be costly no?

Why there is need for select_task_rq here again? Why can't ttwu_do_activate be
done here instead?

> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> +	if (ttwu_do_migrate(rq, p, cpu))
> +		wake_flags |= WF_MIGRATED;
> +
> +	if (unlikely(rq != p_rq)) {
> +		__task_rq_unlock(p_rq, rf);
> +		rq_lock(rq, rf);
> +	}
> +
> +	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +	p->sched_remote_delayed = 0;
> +
> +	/* it wants to run here */
> +	if (cpu_of(rq) == cpu)
> +		return 0;
> +
> +	/* shoot it to the CPU it wants to run on */
> +	__ttwu_queue_wakelist(p, cpu);
> +	return 1;
> +}
> +
>   void sched_ttwu_pending(void *arg)
>   {
>   	struct llist_node *llist = arg;
> @@ -3819,12 +3903,13 @@ void sched_ttwu_pending(void *arg)
>   		if (WARN_ON_ONCE(p->on_cpu))
>   			smp_cond_load_acquire(&p->on_cpu, !VAL);
>   
> -		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> -			set_task_cpu(p, cpu_of(rq));
> -
>   		if (p->sched_remote_wakeup)
>   			wake_flags |= WF_MIGRATED;
>   
> +		if (p->sched_remote_delayed &&
> +		    ttwu_delayed(rq, p, wake_flags | WF_DELAYED, &guard.rf))
> +			continue;
> +
>   		ttwu_do_activate(rq, p, wake_flags, &guard.rf);
>   	}
>   
> @@ -3964,12 +4049,13 @@ static inline bool ttwu_queue_cond(struc
>   
>   static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
>   {
> -	bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> +	bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
>   
>   	if (!ttwu_queue_cond(p, cpu, def))
>   		return false;
>   
>   	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> +	p->sched_remote_delayed = !!(wake_flags & WF_DELAYED);
>   
>   	__ttwu_queue_wakelist(p, cpu);
>   	return true;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5327,7 +5327,10 @@ static __always_inline void return_cfs_r
>   
>   static void set_delayed(struct sched_entity *se)
>   {
> -	se->sched_delayed = 1;
> +	/*
> +	 * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> +	 */
> +	smp_store_release(&se->sched_delayed, 1);
>   
>   	/*
>   	 * Delayed se of cfs_rq have no tasks queued on them.
> @@ -8481,6 +8484,18 @@ select_task_rq_fair(struct task_struct *
>   	/* SD_flags and WF_flags share the first nibble */
>   	int sd_flag = wake_flags & 0xF;
>   
> +	if (wake_flags & WF_DELAYED) {
> +		/*
> +		 * This is the ttwu_delayed() case; where prev_cpu is in fact
> +		 * the CPU that did the wakeup, while @p is running on the
> +		 * current CPU.
> +		 *
> +		 * Make sure to flip them the right way around, otherwise
> +		 * wake-affine is going to do the wrong thing.
> +		 */
> +		swap(cpu, new_cpu);
> +	}
> +
>   	/*
>   	 * required for stable ->cpus_allowed
>   	 */
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
>   SCHED_FEAT(TTWU_QUEUE, true)
>   #endif
>   SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> +SCHED_FEAT(TTWU_QUEUE_DELAYED, true)
>   SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
>   
>   /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2280,6 +2280,7 @@ static inline int task_on_rq_migrating(s
>   #define WF_RQ_SELECTED		0x80 /* ->select_task_rq() was called */
>   
>   #define WF_ON_CPU		0x0100
> +#define WF_DELAYED		0x0200
>   
>   static_assert(WF_EXEC == SD_BALANCE_EXEC);
>   static_assert(WF_FORK == SD_BALANCE_FORK);
> 
> 
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (11 preceding siblings ...)
  2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
@ 2025-07-02 15:27 ` Chris Mason
  2025-07-07  9:05 ` Shrikanth Hegde
  2025-07-17 13:04 ` Beata Michalska
  14 siblings, 0 replies; 101+ messages in thread
From: Chris Mason @ 2025-07-02 15:27 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel

On 7/2/25 7:49 AM, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org 
> 
> 
> Changes:
>  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>  - fixed lockdep splat (dietmar)
>  - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:

Thanks for working on these! I'm on vacation until July 14th, but I'll
give them a shot when I'm back in the office.

-chris


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (12 preceding siblings ...)
  2025-07-02 15:27 ` [PATCH v2 00/12] sched: Address schbench regression Chris Mason
@ 2025-07-07  9:05 ` Shrikanth Hegde
  2025-07-07  9:11   ` Peter Zijlstra
                     ` (2 more replies)
  2025-07-17 13:04 ` Beata Michalska
  14 siblings, 3 replies; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-07  9:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm



On 7/2/25 17:19, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> 
> 
> Changes:
>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>   - fixed lockdep splat (dietmar)
>   - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:
>


Hi Peter,
Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.

I see significant regression in schbench. let me know if i have to test different
number of threads based on the system size.
Will go through the series and will try a bisect meanwhile.


schbench command used and varied 16,32,64,128 as thread groups.
./schbench -L -m 4 -M auto -n 0 -r 60 -t <thread_groups>


base: commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56 (origin/master, origin/HEAD)
Merge: 11119b0b378a 94b59d5f567a
Author: Borislav Petkov (AMD) <bp@alien8.de>
Date:   Sat Jul 5 19:24:35 2025 +0200

     Merge irq/drivers into tip/master


====================================
16 threads   base       base+series
====================================
                              
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:       7.20,      12.40(-72.22)
90.0th:      14.00,      32.60(-132.86)
99.0th:      23.80,      56.00(-135.29)
99.9th:      33.80,      74.80(-121.30)

RPS percentiles (requests) runtime 30 (s)
20.0th:  381235.20,  350720.00(-8.00)
50.0th:  382054.40,  353996.80(-7.34)
90.0th:  382464.00,  356044.80(-6.91)

====================================
32 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:       9.00,      47.60(-428.89)
90.0th:      19.00,     104.00(-447.37)
99.0th:      32.00,     144.20(-350.62)
99.9th:      46.00,     178.20(-287.39)

RPS percentiles (requests) runtime 30 (s)
20.0th:  763699.20,  515379.20(-32.52)
50.0th:  764928.00,  519168.00(-32.13)
90.0th:  766156.80,  530227.20(-30.79)


====================================
64 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:      13.40,     112.80(-741.79)
90.0th:      25.00,     216.00(-764.00)
99.0th:      38.40,     282.00(-634.38)
99.9th:      60.00,     331.40(-452.33)

RPS percentiles (requests) runtime 30 (s)
20.0th: 1500364.80,  689152.00(-54.07)
50.0th: 1501184.00,  693248.00(-53.82)
90.0th: 1502822.40,  695296.00(-53.73)


====================================
128 threads   base       base+series
====================================
Wakeup Latencies percentiles (usec) runtime 30 (s)
50.0th:      22.00,     168.80(-667.27)
90.0th:      43.60,     320.60(-635.32)
99.0th:      71.40,     395.60(-454.06)
99.9th:     100.00,     445.40(-345.40)

RPS percentiles (requests) runtime 30 (s)
20.0th: 2686156.80, 1034854.40(-61.47)
50.0th: 2730393.60, 1057587.20(-61.27)
90.0th: 2763161.60, 1084006.40(-60.77)


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07  9:05 ` Shrikanth Hegde
@ 2025-07-07  9:11   ` Peter Zijlstra
  2025-07-07  9:38     ` Shrikanth Hegde
  2025-07-07 18:19   ` Shrikanth Hegde
  2025-07-08 15:09   ` Chris Mason
  2 siblings, 1 reply; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-07  9:11 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm

On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
> > Hi!
> > 
> > Previous version:
> > 
> >    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > 
> > 
> > Changes:
> >   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> >   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> >   - fixed lockdep splat (dietmar)
> >   - added a few preperatory patches
> > 
> > 
> > Patches apply on top of tip/master (which includes the disabling of private futex)
> > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > 
> > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > 
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.

Urgh, those are terrible numbers :/

What do the caches look like on that setup? Obviously all the 8 SMT
(is this the supercore that glues two SMT4 things together for backwards
compat?) share some cache, but is there some shared cache between the
cores?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07  9:11   ` Peter Zijlstra
@ 2025-07-07  9:38     ` Shrikanth Hegde
  2025-07-16 13:46       ` Phil Auld
  0 siblings, 1 reply; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-07  9:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm



On 7/7/25 14:41, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 7/2/25 17:19, Peter Zijlstra wrote:
>>> Hi!
>>>
>>> Previous version:
>>>
>>>     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>>
>>>
>>> Changes:
>>>    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>>    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>>    - fixed lockdep splat (dietmar)
>>>    - added a few preperatory patches
>>>
>>>
>>> Patches apply on top of tip/master (which includes the disabling of private futex)
>>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>>
>>> Performance is similar to the last version; as tested on my SPR on v6.15 base:
>>>
>>
>>
>> Hi Peter,
>> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
>>
>> I see significant regression in schbench. let me know if i have to test different
>> number of threads based on the system size.
>> Will go through the series and will try a bisect meanwhile.
> 
> Urgh, those are terrible numbers :/
> 
> What do the caches look like on that setup? Obviously all the 8 SMT
> (is this the supercore that glues two SMT4 things together for backwards
> compat?) share some cache, but is there some shared cache between the
> cores?

It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC 
is per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07  9:38     ` Shrikanth Hegde
@ 2025-07-16 13:46       ` Phil Auld
  2025-07-17 17:25         ` Phil Auld
  0 siblings, 1 reply; 101+ messages in thread
From: Phil Auld @ 2025-07-16 13:46 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm


Hi Peter,

On Mon, Jul 07, 2025 at 03:08:08PM +0530 Shrikanth Hegde wrote:
> 
> 
> On 7/7/25 14:41, Peter Zijlstra wrote:
> > On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> > > 
> > > 
> > > On 7/2/25 17:19, Peter Zijlstra wrote:
> > > > Hi!
> > > > 
> > > > Previous version:
> > > > 
> > > >     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > > > 
> > > > 
> > > > Changes:
> > > >    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> > > >    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> > > >    - fixed lockdep splat (dietmar)
> > > >    - added a few preperatory patches
> > > > 
> > > > 
> > > > Patches apply on top of tip/master (which includes the disabling of private futex)
> > > > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > > > 
> > > > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > > > 
> > > 
> > > 
> > > Hi Peter,
> > > Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> > > 
> > > I see significant regression in schbench. let me know if i have to test different
> > > number of threads based on the system size.
> > > Will go through the series and will try a bisect meanwhile.
> > 
> > Urgh, those are terrible numbers :/
> > 
> > What do the caches look like on that setup? Obviously all the 8 SMT
> > (is this the supercore that glues two SMT4 things together for backwards
> > compat?) share some cache, but is there some shared cache between the
> > cores?
> 
> It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC is
> per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)
> 

We've confirmed the issue with schbench on EPYC hardware. It's not limited
to PPC systems, although this system may also have interesting caching. 
We don't see issues with our other tests.

---------------

Here are the latency reports from schbench on a single-socket AMD EPYC
9655P server with 96 cores and 192 CPUs.

Results for this test:
./schbench/schbench -L -m 4 -t 192 -i 30 -r 30

6.15.0-rc6  baseline
threads  wakeup_99_usec  request_99_usec
1        5               3180
16       5               3996
64       3452            14256
128      7112            32960
192      11536           46016

6.15.0-rc6.pz_fixes2 (with 12 part series))
threads  wakeup_99_usec  request_99_usec
1        5               3172
16       5               3844
64       3348            17376
128      21024           100480
192      44224           176384

For 128 and 192 threads, Wakeup and Request latencies increased by a factor of
3x.

We're testing now with NO_TTWU_QUEUE_DELAYED and I'll try to report on
that when we have results. 

Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-16 13:46       ` Phil Auld
@ 2025-07-17 17:25         ` Phil Auld
  0 siblings, 0 replies; 101+ messages in thread
From: Phil Auld @ 2025-07-17 17:25 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Peter Zijlstra, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm

On Wed, Jul 16, 2025 at 09:46:40AM -0400 Phil Auld wrote:
> 
> Hi Peter,
> 
> On Mon, Jul 07, 2025 at 03:08:08PM +0530 Shrikanth Hegde wrote:
> > 
> > 
> > On 7/7/25 14:41, Peter Zijlstra wrote:
> > > On Mon, Jul 07, 2025 at 02:35:38PM +0530, Shrikanth Hegde wrote:
> > > > 
> > > > 
> > > > On 7/2/25 17:19, Peter Zijlstra wrote:
> > > > > Hi!
> > > > > 
> > > > > Previous version:
> > > > > 
> > > > >     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > > > > 
> > > > > 
> > > > > Changes:
> > > > >    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> > > > >    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> > > > >    - fixed lockdep splat (dietmar)
> > > > >    - added a few preperatory patches
> > > > > 
> > > > > 
> > > > > Patches apply on top of tip/master (which includes the disabling of private futex)
> > > > > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > > > > 
> > > > > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > > > > 
> > > > 
> > > > 
> > > > Hi Peter,
> > > > Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> > > > 
> > > > I see significant regression in schbench. let me know if i have to test different
> > > > number of threads based on the system size.
> > > > Will go through the series and will try a bisect meanwhile.
> > > 
> > > Urgh, those are terrible numbers :/
> > > 
> > > What do the caches look like on that setup? Obviously all the 8 SMT
> > > (is this the supercore that glues two SMT4 things together for backwards
> > > compat?) share some cache, but is there some shared cache between the
> > > cores?
> > 
> > It is a supercore(we call it as bigcore) which glues two SMT4 cores. LLC is
> > per SMT4 core. So from scheduler perspective system is 10 cores (SMT4)
> > 
> 
> We've confirmed the issue with schbench on EPYC hardware. It's not limited
> to PPC systems, although this system may also have interesting caching. 
> We don't see issues with our other tests.
> 
> ---------------
> 
> Here are the latency reports from schbench on a single-socket AMD EPYC
> 9655P server with 96 cores and 192 CPUs.
> 
> Results for this test:
> ./schbench/schbench -L -m 4 -t 192 -i 30 -r 30
> 
> 6.15.0-rc6  baseline
> threads  wakeup_99_usec  request_99_usec
> 1        5               3180
> 16       5               3996
> 64       3452            14256
> 128      7112            32960
> 192      11536           46016
> 
> 6.15.0-rc6.pz_fixes2 (with 12 part series))
> threads  wakeup_99_usec  request_99_usec
> 1        5               3172
> 16       5               3844
> 64       3348            17376
> 128      21024           100480
> 192      44224           176384
> 
> For 128 and 192 threads, Wakeup and Request latencies increased by a factor of
> 3x.
> 
> We're testing now with NO_TTWU_QUEUE_DELAYED and I'll try to report on
> that when we have results. 
>

To follow up on this: With NO_TTWU_QUEUE_DELAYED the above latency issues
with schbench go away.

In addition, the randwrite regression we were having with delayed tasks
remains resolved.  And the assorted small gains here and there are still
present. 

Overall with NO_TTWU_QUEUE_DELAYED this series is helpful. We'd probably
make that the default if it got merged as is.  But maybe there is no
need for that part of the code.  


Thanks,
Phil

> Cheers,
> Phil
> -- 
> 
> 

-- 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07  9:05 ` Shrikanth Hegde
  2025-07-07  9:11   ` Peter Zijlstra
@ 2025-07-07 18:19   ` Shrikanth Hegde
  2025-07-08 19:02     ` Peter Zijlstra
  2025-07-08 15:09   ` Chris Mason
  2 siblings, 1 reply; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-07 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm



On 7/7/25 14:35, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
>> Hi!
>>
>> Previous version:
>>
>>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>
>>
>> Changes:
>>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>   - fixed lockdep splat (dietmar)
>>   - added a few preperatory patches
>>
>>
>> Patches apply on top of tip/master (which includes the disabling of 
>> private futex)
>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>
>> Performance is similar to the last version; as tested on my SPR on 
>> v6.15 base:
>>
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test 
> different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.
> 
> 

Used "./schbench -L -m 4 -M auto -t 64 -n 0 -t 60 -i 60" for git bisect.
Also kept HZ=1000


Git bisect points to
# first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks

Note:
at commit: "sched: Change ttwu_runnable() vs sched_delayed" there is a small regression.

-------------------------------------
Numbers at different commits:
-------------------------------------
commit 8784fb5fa2e0042fe3b1632d4876e1037b695f56      <<<< baseline
Merge: 11119b0b378a 94b59d5f567a
Author: Borislav Petkov (AMD) <bp@alien8.de>
Date:   Sat Jul 5 19:24:35 2025 +0200

     Merge irq/drivers into tip/master

Wakeup Latencies percentiles (usec) runtime 30 (s) (39778894 total samples)
           50.0th: 14         (11798914 samples)
           90.0th: 27         (15931329 samples)
         * 99.0th: 42         (3032865 samples)
           99.9th: 64         (346598 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1394688    (18 samples)
         * 50.0th: 1394688    (0 samples)
           90.0th: 1398784    (11 samples)

--------------------------------------

commit 88ca74dd6fe5d5b03647afb4698238e4bec3da39 (HEAD)       <<< Still good commit
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:34 2025 +0200

     sched: Use lock guard in sched_ttwu_pending()

Wakeup Latencies percentiles (usec) runtime 30 (s) (40132792 total samples)
           50.0th: 14         (11986044 samples)
           90.0th: 27         (15143836 samples)
         * 99.0th: 46         (3267133 samples)
           99.9th: 72         (333940 samples)
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1402880    (23 samples)
         * 50.0th: 1402880    (0 samples)
           90.0th: 1406976    (8 samples)

-----------------------------------------------------------------

commit 755d11feca4544b4bc6933dcdef29c41585fa747        <<< There is a small regression.
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:35 2025 +0200

     sched: Change ttwu_runnable() vs sched_delayed

Wakeup Latencies percentiles (usec) runtime 30 (s) (39308991 total samples)
           50.0th: 18         (12991812 samples)
           90.0th: 34         (14381736 samples)
         * 99.0th: 56         (3399332 samples)
           99.9th: 84         (342508 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1353728    (21 samples)
         * 50.0th: 1353728    (0 samples)
           90.0th: 1357824    (10 samples)

-----------------------------------------------------------

commit dc968ba0544889883d0912360dd72d90f674c140              <<<< Major regression
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Jul 2 13:49:36 2025 +0200

     sched: Add ttwu_queue support for delayed tasks

Wakeup Latencies percentiles (usec) runtime 30 (s) (19818598 total samples)
           50.0th: 111        (5891601 samples)
           90.0th: 214        (7947099 samples)
         * 99.0th: 283        (1749294 samples)
           99.9th: 329        (177336 samples)

RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 654336     (7 samples)
         * 50.0th: 660480     (11 samples)
           90.0th: 666624     (11 samples)
  



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07 18:19   ` Shrikanth Hegde
@ 2025-07-08 19:02     ` Peter Zijlstra
  2025-07-09 16:46       ` Shrikanth Hegde
                         ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Peter Zijlstra @ 2025-07-08 19:02 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm

On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:

> Git bisect points to
> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks

Moo.. Are IPIs particularly expensive on your platform?

The 5 cores makes me think this is a partition of sorts, but IIRC the
power LPAR stuff was fixed physical, so routing interrupts shouldn't be
much more expensive vs native hardware.

> Note:
> at commit: "sched: Change ttwu_runnable() vs sched_delayed" there is a small regression.

Yes, that was more or less expected. I also see a dip because of that
patch, but its small compared to the gains gotten by the previous
patches -- so I was hoping I'd get away with it :-)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-08 19:02     ` Peter Zijlstra
@ 2025-07-09 16:46       ` Shrikanth Hegde
  2025-07-14 17:54       ` Shrikanth Hegde
  2025-07-21 19:37       ` Shrikanth Hegde
  2 siblings, 0 replies; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-09 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm,
	Madhavan Srinivasan



On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
> 

Yes, we call it as dedicated LPAR. (Hypervisor optimises such that overhead is minimal,
i think that i true for interrupts too).


Some more variations of testing and numbers:

The system had some configs which i had messed up such as CONFIG_SCHED_SMT=n. I copied the default
distro config back and ran the benchmark again. Slightly better numbers compared to earlier.
Still a major regression. Collected mpstat numbers. It shows much less percentage compared to
earlier.

--------------------------------------------------------------------------
base: 8784fb5fa2e0 (tip/master)

Wakeup Latencies percentiles (usec) runtime 30 (s) (41567569 total samples)
           50.0th: 11         (10767158 samples)
           90.0th: 22         (16782627 samples)
         * 99.0th: 36         (3347363 samples)
           99.9th: 52         (344977 samples)
           min=1, max=731
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1443840    (31 samples)
         * 50.0th: 1443840    (0 samples)
           90.0th: 1443840    (0 samples)
           min=1433480, max=1444037
average rps: 1442889.23

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
all    3.24    0.00   11.39    0.00   37.30    0.00    0.00    0.00    0.00   48.07
all    2.59    0.00   11.56    0.00   37.62    0.00    0.00    0.00    0.00   48.23



base + clm's patch + series:
Wakeup Latencies percentiles (usec) runtime 30 (s) (27166787 total samples)
           50.0th: 57         (8242048 samples)
           90.0th: 120        (10677365 samples)
         * 99.0th: 182        (2435082 samples)
           99.9th: 262        (241664 samples)
           min=1, max=89984
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 896000     (8 samples)
         * 50.0th: 902144     (10 samples)
           90.0th: 928768     (10 samples)
           min=881548, max=971101
average rps: 907530.10                                               <<< close to 40% drop in RPS.

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
all    1.95    0.00    7.67    0.00   14.84    0.00    0.00    0.00    0.00   75.55
all    1.61    0.00    7.91    0.00   13.53    0.05    0.00    0.00    0.00   76.90

-----------------------------------------------------------------------------

- To be sure, I tried on another system. That system had 30 cores.

base:
Wakeup Latencies percentiles (usec) runtime 30 (s) (40339785 total samples)
           50.0th: 12         (12585268 samples)
           90.0th: 24         (15194626 samples)
         * 99.0th: 44         (3206872 samples)
           99.9th: 59         (320508 samples)
           min=1, max=1049
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1320960    (14 samples)
         * 50.0th: 1333248    (2 samples)
           90.0th: 1386496    (12 samples)
           min=1309615, max=1414281

base + clm's patch + series:
Wakeup Latencies percentiles (usec) runtime 30 (s) (34318584 total samples)
           50.0th: 23         (10486283 samples)
           90.0th: 64         (13436248 samples)
         * 99.0th: 122        (3039318 samples)
           99.9th: 166        (306231 samples)
           min=1, max=7255
RPS percentiles (requests) runtime 30 (s) (31 total samples)
           20.0th: 1006592    (8 samples)
         * 50.0th: 1239040    (9 samples)
           90.0th: 1259520    (11 samples)
           min=852462, max=1268841
average rps: 1144229.23                                             << close 10-15% drop in RPS


- Then I resized that 30 core LPAR into a 5 core LPAR to see if the issue pops up in a smaller
config. It did. I see similar regression of 40-50% drop in RPS.

- Then I made it as 6 core system. To see if this is due to any ping pong because of odd numbers.
Numbers are similar to 5 core case.

- Maybe regressions is higher in smaller configurations.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-08 19:02     ` Peter Zijlstra
  2025-07-09 16:46       ` Shrikanth Hegde
@ 2025-07-14 17:54       ` Shrikanth Hegde
  2025-07-21 19:37       ` Shrikanth Hegde
  2 siblings, 0 replies; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-14 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, clm



On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
> The 5 cores makes me think this is a partition of sorts, but IIRC the
> power LPAR stuff was fixed physical, so routing interrupts shouldn't be
> much more expensive vs native hardware.
> 
Some more data from the regression. I am looking at rps numbers
while running ./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5.
All the data is from an LPAR(VM) with 5 cores.


echo TTWU_QUEUE_DELAYED > features
average rps: 970491.00

echo NO_TTWU_QUEUE_DELAYED > features
current rps: 1555456.78

So below data points are with feature enabled or disabled with series applied + clm's patch.
-------------------------------------------------------
./hardirqs

TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
env2                               816
IPI-2                          1421603       << IPI are less compared to with feature.


NO_TTWU_QUEUE_DELAYED
HARDIRQ                    TOTAL_usecs
ibmvscsi                             8
env2                               266
IPI-2                          6489980

-------------------------------------------------------

Disabled all the idle states. Regression still exits.

-------------------------------------------------------

See this warning everytime i run schbench:  This happens with PATCH 12/12 only.

It is triggering this warning. Some clock update is getting messed up?

1637 static inline void assert_clock_updated(struct rq *rq)
1638 {
1639         /*
1640          * The only reason for not seeing a clock update since the
1641          * last rq_pin_lock() is if we're currently skipping updates.
1642          */
1643         WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
1644 }
  

WARNING: kernel/sched/sched.h:1643 at update_load_avg+0x424/0x48c, CPU#6: swapper/6/0
CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Kdump: loaded Not tainted 6.16.0-rc4+ #276 PREEMPT(voluntary)
NIP:  c0000000001cea60 LR: c0000000001d7254 CTR: c0000000001d77b0
REGS: c000000003a674c0 TRAP: 0700   Not tainted  (6.16.0-rc4+)
MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 28008208  XER: 20040000
CFAR: c0000000001ce68c IRQMASK: 3
GPR00: c0000000001d7254 c000000003a67760 c000000001bc8100 c000000061915400
GPR04: c00000008c80f480 0000000000000005 c000000003a679b0 0000000000000000
GPR08: 0000000000000001 0000000000000000 c0000003ff14d480 0000000000004000
GPR12: c0000000001d77b0 c0000003ffff7880 0000000000000000 000000002eef18c0
GPR16: 0000000000000006 0000000000000006 0000000000000008 c000000002ca2468
GPR20: 0000000000000000 0000000000000004 0000000000000009 0000000000000001
GPR24: 0000000000000000 0000000000000001 0000000000000001 c0000003ff14d480
GPR28: 0000000000000001 0000000000000005 c00000008c80f480 c000000061915400
NIP [c0000000001cea60] update_load_avg+0x424/0x48c
LR [c0000000001d7254] enqueue_entity+0x5c/0x5b8
Call Trace:
[c000000003a67760] [c000000003a677d0] 0xc000000003a677d0 (unreliable)
[c000000003a677d0] [c0000000001d7254] enqueue_entity+0x5c/0x5b8
[c000000003a67880] [c0000000001d7918] enqueue_task_fair+0x168/0x7d8
[c000000003a678f0] [c0000000001b9554] enqueue_task+0x5c/0x1c8
[c000000003a67930] [c0000000001c3f40] ttwu_do_activate+0x98/0x2fc
[c000000003a67980] [c0000000001c4460] sched_ttwu_pending+0x2bc/0x72c
[c000000003a67a60] [c0000000002c16ac] __flush_smp_call_function_queue+0x1a0/0x750
[c000000003a67b10] [c00000000005e1c4] smp_ipi_demux_relaxed+0xec/0xf4
[c000000003a67b50] [c000000000057dd4] doorbell_exception+0xe0/0x25c
[c000000003a67b90] [c0000000000383d0] __replay_soft_interrupts+0xf0/0x154
[c000000003a67d40] [c000000000038684] arch_local_irq_restore.part.0+0x1cc/0x214
[c000000003a67d90] [c0000000001b6ec8] finish_task_switch.isra.0+0xb4/0x2f8
[c000000003a67e30] [c00000000110fb9c] __schedule+0x294/0x83c
[c000000003a67ee0] [c0000000011105f0] schedule_idle+0x3c/0x64
[c000000003a67f10] [c0000000001f27f0] do_idle+0x15c/0x1ac
[c000000003a67f60] [c0000000001f2b08] cpu_startup_entry+0x4c/0x50
[c000000003a67f90] [c00000000005ede0] start_secondary+0x284/0x288
[c000000003a67fe0] [c00000000000e058] start_secondary_prolog+0x10/0x14

----------------------------------------------------------------

perf stat -a:  ( idle states enabled)

TTWU_QUEUE_DELAYED:

         13,612,930      context-switches                 #    0.000 /sec
            912,737      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    449,817,741,085      cycles
    137,051,199,092      instructions                     #    0.30  insn per cycle
     25,789,965,217      branches                         #    0.000 /sec
        286,202,628      branch-misses                    #    1.11% of all branches

NO_TTWU_QUEUE_DELAYED:

         24,782,786      context-switches                 #    0.000 /sec
          4,697,384      cpu-migrations                   #    0.000 /sec
              1,250      page-faults                      #    0.000 /sec
    701,934,506,023      cycles
    220,728,025,829      instructions                     #    0.31  insn per cycle
     40,271,327,989      branches                         #    0.000 /sec
        474,496,395      branch-misses                    #    1.18% of all branches

both cycles and instructions are low.

-------------------------------------------------------------------

perf stat -a:  ( idle states disabled)

TTWU_QUEUE_DELAYED:
            
         15,402,193      context-switches                 #    0.000 /sec
          1,237,128      cpu-migrations                   #    0.000 /sec
              1,245      page-faults                      #    0.000 /sec
    781,215,992,865      cycles
    149,112,303,840      instructions                     #    0.19  insn per cycle
     28,240,010,182      branches                         #    0.000 /sec
        294,485,795      branch-misses                    #    1.04% of all branches

NO_TTWU_QUEUE_DELAYED:

         25,332,898      context-switches                 #    0.000 /sec
          4,756,682      cpu-migrations                   #    0.000 /sec
              1,256      page-faults                      #    0.000 /sec
    781,318,730,494      cycles
    220,536,732,094      instructions                     #    0.28  insn per cycle
     40,424,495,545      branches                         #    0.000 /sec
        446,724,952      branch-misses                    #    1.11% of all branches

Since idle states are disabled, cycles are always spent on CPU. so cycles are more or less, while instruction
differs. Does it mean with feature enabled, is there a lock(maybe rq) for too long?

--------------------------------------------------------------------

Will try to gather more into why is this happening.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-08 19:02     ` Peter Zijlstra
  2025-07-09 16:46       ` Shrikanth Hegde
  2025-07-14 17:54       ` Shrikanth Hegde
@ 2025-07-21 19:37       ` Shrikanth Hegde
  2025-07-22 20:20         ` Chris Mason
  2 siblings, 1 reply; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-21 19:37 UTC (permalink / raw)
  To: Peter Zijlstra, clm
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid

On 7/9/25 00:32, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 11:49:17PM +0530, Shrikanth Hegde wrote:
> 
>> Git bisect points to
>> # first bad commit: [dc968ba0544889883d0912360dd72d90f674c140] sched: Add ttwu_queue support for delayed tasks
> 
> Moo.. Are IPIs particularly expensive on your platform?
> 
>
It seems like the cost of IPIs is likely hurting here.

IPI latency really depends on whether CPU was busy, shallow idle state or deep idle state.
When it is in deep idle state numbers show close to 5-8us on average on this small system.
When system is busy, (could be doing another schbench thread) is around 1-2us.

Measured the time it took for taking the remote rq lock in baseline, that is around 1-1.5us only.
Also, here LLC is small core.(SMT4 core). So quite often the series would choose to send IPI.

Did one more experiment, pin worker and message thread such that it always sends IPI.

NO_TTWU_QUEUE_DELAYED

./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5
average rps: 1549224.72
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5
average rps: 1560839.00

TTWU_QUEUE_DELAYED

./schbench -L -m 4 -M auto -t 64 -n 0 -r 5 -i 5             << IPI could be sent quite often ***
average rps: 959522.31
./schbench -L -m 4 -M 0-3 -W 4-39 -t 64 -n 0 -r 5 -i 5      << IPI are always sent. (M,W) don't share cache.
average rps: 470865.00                                      << rps goes even lower

=================================

*** issues/observations in schbench.

Chris,

When one does -W auto or -M auto i think code is meant to run, n message threads on first n CPUs and worker threads
on remaining CPUs?
I don't see that happening.  above behavior can be achieved only with -M <cpus> -W <cpus>

         int i = 0;
         CPU_ZERO(m_cpus);
         for (int i = 0; i < m_threads; ++i) {
                 CPU_SET(i, m_cpus);
                 CPU_CLR(i, w_cpus);
         }
         for (; i < CPU_SETSIZE; i++) {             << here i refers to the one in scope. which is 0. Hence w_cpus is set for all cpus.
                                                       And hence workers end up running on all CPUs even with -W auto
                 CPU_SET(i, w_cpus);
         }

Another issue, is that if CPU0 if offline, then auto pinning fails. Maybe no one cares about that case?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-21 19:37       ` Shrikanth Hegde
@ 2025-07-22 20:20         ` Chris Mason
  2025-07-24 18:23           ` Chris Mason
  0 siblings, 1 reply; 101+ messages in thread
From: Chris Mason @ 2025-07-22 20:20 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid

On 7/21/25 12:37 PM, Shrikanth Hegde wrote:

> *** issues/observations in schbench.
> 
> Chris,
> 
> When one does -W auto or -M auto i think code is meant to run, n message
> threads on first n CPUs and worker threads
> on remaining CPUs?
> I don't see that happening.  above behavior can be achieved only with -M
> <cpus> -W <cpus>
> 
>         int i = 0;
>         CPU_ZERO(m_cpus);
>         for (int i = 0; i < m_threads; ++i) {
>                 CPU_SET(i, m_cpus);
>                 CPU_CLR(i, w_cpus);
>         }
>         for (; i < CPU_SETSIZE; i++) {             << here i refers to
> the one in scope. which is 0. Hence w_cpus is set for all cpus.
>                                                       And hence workers
> end up running on all CPUs even with -W auto
>                 CPU_SET(i, w_cpus);
>         }

Oh, you're exactly right.  Fixing this up, thanks.  I'll do some runs to
see if this changes things on my test boxes as well.

> 
> 
> Another issue, is that if CPU0 if offline, then auto pinning fails.
> Maybe no one cares about that case?

The auto pinning is pretty simple right now, I'm planning on making it
numa/ccx aware.  Are CPUs offline enough on test systems that we want to
worry about that?

-chris


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-22 20:20         ` Chris Mason
@ 2025-07-24 18:23           ` Chris Mason
  0 siblings, 0 replies; 101+ messages in thread
From: Chris Mason @ 2025-07-24 18:23 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid

On 7/22/25 1:20 PM, Chris Mason wrote:
> On 7/21/25 12:37 PM, Shrikanth Hegde wrote:
> 
>> *** issues/observations in schbench.
>>
>> Chris,
>>
>> When one does -W auto or -M auto i think code is meant to run, n message
>> threads on first n CPUs and worker threads
>> on remaining CPUs?
>> I don't see that happening.  above behavior can be achieved only with -M
>> <cpus> -W <cpus>
>>
>>         int i = 0;
>>         CPU_ZERO(m_cpus);
>>         for (int i = 0; i < m_threads; ++i) {
>>                 CPU_SET(i, m_cpus);
>>                 CPU_CLR(i, w_cpus);
>>         }
>>         for (; i < CPU_SETSIZE; i++) {             << here i refers to
>> the one in scope. which is 0. Hence w_cpus is set for all cpus.
>>                                                       And hence workers
>> end up running on all CPUs even with -W auto
>>                 CPU_SET(i, w_cpus);
>>         }
> 
> Oh, you're exactly right.  Fixing this up, thanks.  I'll do some runs to
> see if this changes things on my test boxes as well.

Fixing this makes it substantially slower (5.2M RPS -> 3.8M RPS), with
more time spent in select_task_rq().  I need to trace a bit to
understand if the message thread CPUs are actually getting used that
often for workers, or if the exclusion makes our idle CPU hunt slower
somehow.

-chris


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-07  9:05 ` Shrikanth Hegde
  2025-07-07  9:11   ` Peter Zijlstra
  2025-07-07 18:19   ` Shrikanth Hegde
@ 2025-07-08 15:09   ` Chris Mason
  2025-07-08 17:29     ` Shrikanth Hegde
  2 siblings, 1 reply; 101+ messages in thread
From: Chris Mason @ 2025-07-08 15:09 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid

On 7/7/25 5:05 AM, Shrikanth Hegde wrote:
> 
> 
> On 7/2/25 17:19, Peter Zijlstra wrote:
>> Hi!
>>
>> Previous version:
>>
>>    https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>
>> Changes:
>>   - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>   - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>   - fixed lockdep splat (dietmar)
>>   - added a few preperatory patches
>>
>>
>> Patches apply on top of tip/master (which includes the disabling of
>> private futex)
>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>
>> Performance is similar to the last version; as tested on my SPR on
>> v6.15 base:
>>
> 
> 
> Hi Peter,
> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
> 
> I see significant regression in schbench. let me know if i have to test
> different
> number of threads based on the system size.
> Will go through the series and will try a bisect meanwhile.

Not questioning the git bisect results you had later in this thread, but
double checking that you had the newidle balance patch in place that
Peter mentioned?

https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/

The newidle balance frequency changes the cost of everything else, so I
wanted to make sure we were measuring the same things.

Thanks!

-chris


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-08 15:09   ` Chris Mason
@ 2025-07-08 17:29     ` Shrikanth Hegde
  0 siblings, 0 replies; 101+ messages in thread
From: Shrikanth Hegde @ 2025-07-08 17:29 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid



On 7/8/25 20:39, Chris Mason wrote:
> On 7/7/25 5:05 AM, Shrikanth Hegde wrote:
>>
>>
>> On 7/2/25 17:19, Peter Zijlstra wrote:
>>> Hi!
>>>
>>> Previous version:
>>>
>>>     https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
>>>
>>> Changes:
>>>    - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>>>    - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>>>    - fixed lockdep splat (dietmar)
>>>    - added a few preperatory patches
>>>
>>>
>>> Patches apply on top of tip/master (which includes the disabling of
>>> private futex)
>>> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
>>>
>>> Performance is similar to the last version; as tested on my SPR on
>>> v6.15 base:
>>>
>>
>>
>> Hi Peter,
>> Gave this a spin on a machine with 5 cores (SMT8) PowerPC system.
>>
>> I see significant regression in schbench. let me know if i have to test
>> different
>> number of threads based on the system size.
>> Will go through the series and will try a bisect meanwhile.
> 
> Not questioning the git bisect results you had later in this thread, but
> double checking that you had the newidle balance patch in place that
> Peter mentioned?
> 
> https://lore.kernel.org/lkml/20250626144017.1510594-2-clm@fb.com/
> 
> The newidle balance frequency changes the cost of everything else, so I
> wanted to make sure we were measuring the same things.
> 

Hi Chris.

It was base + series only. and base was 8784fb5fa2e0("Merge irq/drivers into tip/master").
So it didn't have your changes.

I tested again with your changes and i still see a major regression.


./schbench -L -m 4 -M auto -t 64 -n 0 -t 60 -i 60
Wakeup Latencies percentiles (usec) runtime 30 (s) (18848611 total samples)
	  50.0th: 115        (5721408 samples)
	  90.0th: 238        (7500535 samples)
	* 99.0th: 316        (1670597 samples)
	  99.9th: 360        (162283 samples)
	  min=1, max=1487

RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 623616     (7 samples)
	* 50.0th: 629760     (15 samples)
	  90.0th: 631808     (7 samples)
	  min=617820, max=635475
average rps: 628514.30


git log --oneline
7aaf5ef0841b (HEAD) sched: Add ttwu_queue support for delayed tasks
f77b53b6766a sched: Change ttwu_runnable() vs sched_delayed
986ced69ba7b sched: Use lock guard in sched_ttwu_pending()
2c0eb5c88134 sched: Clean up ttwu comments
e1374ac7f74a sched: Re-arrange __ttwu_queue_wakelist()
7e673db9e90f psi: Split psi_ttwu_dequeue()
e2225f1c24a9 sched: Introduce ttwu_do_migrate()
80765734f127 sched: Add ttwu_queue controls
745406820d30 sched: Use lock guard in ttwu_runnable()
d320cebe6e28 sched: Optimize ttwu() / select_task_rq()
329fc7eaad76 sched/deadline: Less agressive dl_server handling
708281193493 sched/psi: Optimize psi_group_change() cpu_clock() usage
c28590ad7b91 sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
8784fb5fa2e0 (origin/master, origin/HEAD) Merge irq/drivers into tip/master


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
                   ` (13 preceding siblings ...)
  2025-07-07  9:05 ` Shrikanth Hegde
@ 2025-07-17 13:04 ` Beata Michalska
  2025-07-17 16:57   ` Beata Michalska
  14 siblings, 1 reply; 101+ messages in thread
From: Beata Michalska @ 2025-07-17 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

Hi Peter,

Below are the results of running the schbench on Altra
(as a reminder 2-core MC, 2 Numa Nodes, 160 cores)

`Legend:
- 'Flags=none' means neither TTWU_QUEUE_DEFAULT nor
  TTWU_QUEUE_DELAYED is set (or available).
- '*…*' marks Top-3 Min & Max, Bottom-3 Std dev, and
  Top-3 90th-percentile values.

Base 6.16-rc5
  Flags=none
  Min=681870.77 | Max=913649.50 | Std=53802.90       | 90th=890201.05

sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
  Flags=none
  Min=770952.12 | Max=888047.45 | Std=34430.24       | 90th=877347.24

sched/psi: Optimize psi_group_change() cpu_clock() usage
  Flags=none
  Min=748137.65 | Max=936312.33 | Std=56818.23       | 90th=*921497.27*

sched/deadline: Less agressive dl_server handling
  Flags=none
  Min=783621.95 | Max=*944604.67* | Std=43538.64     | 90th=*909961.16*

sched: Optimize ttwu() / select_task_rq()
  Flags=none
  Min=*826038.87* | Max=*1003496.73* | Std=49875.43  | 90th=*971944.88*

sched: Use lock guard in ttwu_runnable()
  Flags=none
  Min=780172.75 | Max=914170.20 | Std=35998.33       | 90th=866095.80

sched: Add ttwu_queue controls
  Flags=TTWU_QUEUE_DEFAULT
  Min=*792430.45* | Max=903422.78 | Std=33582.71     | 90th=887256.68

  Flags=none
  Min=*803532.80* | Max=894772.48 | Std=29359.35     | 90th=877920.34

sched: Introduce ttwu_do_migrate()
  Flags=TTWU_QUEUE_DEFAULT
  Min=749824.30 | Max=*965139.77* | Std=57022.47     | 90th=903659.07
 
  Flags=none
  Min=787464.65 | Max=885349.20 | Std=27030.82       | 90th=875750.44

psi: Split psi_ttwu_dequeue()
  Flags=TTWU_QUEUE_DEFAULT
  Min=762960.98 | Max=916538.12 | Std=42002.19       | 90th=876425.84
 
  Flags=none
  Min=773608.48 | Max=920812.87 | Std=42189.17       | 90th=871760.47

sched: Re-arrange __ttwu_queue_wakelist()
  Flags=TTWU_QUEUE_DEFAULT
  Min=702870.58 | Max=835243.42 | Std=44224.02       | 90th=825311.12

  Flags=none
  Min=712499.38 | Max=838492.03 | Std=38351.20       | 90th=817135.94

sched: Use lock guard in sched_ttwu_pending()
  Flags=TTWU_QUEUE_DEFAULT
  Min=729080.55 | Max=853609.62 | Std=43440.63       | 90th=838684.48

  Flags=none
  Min=708123.47 | Max=850804.48 | Std=40642.28       | 90th=830295.08

sched: Change ttwu_runnable() vs sched_delayed
  Flags=TTWU_QUEUE_DEFAULT
  Min=580218.87 | Max=838684.07 | Std=57078.24       | 90th=792973.33

  Flags=none
  Min=721274.90 | Max=784897.92 | Std=*19017.78*     | 90th=774792.30

sched: Add ttwu_queue support for delayed tasks
  Flags=none
  Min=712979.48 | Max=830192.10 | Std=33173.90       | 90th=798599.66

  Flags=TTWU_QUEUE_DEFAULT
  Min=698094.12 | Max=857627.93 | Std=38294.94       | 90th=789981.59
 
  Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
  Min=683348.77 | Max=782179.15 | Std=25086.71       | 90th=750947.00

  Flags=TTWU_QUEUE_DELAYED
  Min=669822.23 | Max=807768.85 | Std=38766.41       | 90th=794052.05

sched: fix ttwu_delayed
  Flags=none
  Min=671844.35 | Max=798737.67 | Std=33438.64       | 90th=788584.62

  Flags=TTWU_QUEUE_DEFAULT
  Min=688607.40 | Max=828679.53 | Std=33184.78       | 90th=782490.23

  Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
  Min=579171.13 | Max=643929.18 | Std=*14644.92*     | 90th=639764.16

  Flags=TTWU_QUEUE_DELAYED
  Min=614265.22 | Max=675172.05 | Std=*13309.92*     | 90th=647181.10


Best overall performer:
sched: Optimize ttwu() / select_task_rq()
  Flags=none
  Min=*826038.87* | Max=*1003496.73* | Std=49875.43 | 90th=*971944.88*

Hope this will he somehwat helpful.

---
BR
Beata

On Wed, Jul 02, 2025 at 01:49:24PM +0200, Peter Zijlstra wrote:
> Hi!
> 
> Previous version:
> 
>   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> 
> 
> Changes:
>  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
>  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
>  - fixed lockdep splat (dietmar)
>  - added a few preperatory patches
> 
> 
> Patches apply on top of tip/master (which includes the disabling of private futex)
> and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> 
> Performance is similar to the last version; as tested on my SPR on v6.15 base:
> 
> v6.15:
> schbench-6.15.0-1.txt:average rps: 2891403.72
> schbench-6.15.0-2.txt:average rps: 2889997.02
> schbench-6.15.0-3.txt:average rps: 2894745.17
> 
> v6.15 + patches 1-10:
> schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
> schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
> schbench-6.15.0-dirty-6.txt:average rps: 3038160.15
> 
> v6.15 + all patches:
> schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
> schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
> schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10
> 
> 
> Patches can also be had here:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> 
> 
> I'm hoping we can get this merged for next cycle so we can all move on from this.
> 
> 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v2 00/12] sched: Address schbench regression
  2025-07-17 13:04 ` Beata Michalska
@ 2025-07-17 16:57   ` Beata Michalska
  0 siblings, 0 replies; 101+ messages in thread
From: Beata Michalska @ 2025-07-17 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, clm, linux-kernel

On Thu, Jul 17, 2025 at 03:04:55PM +0200, Beata Michalska wrote:
> Hi Peter,
> 
> Below are the results of running the schbench on Altra
> (as a reminder 2-core MC, 2 Numa Nodes, 160 cores)
> 
> `Legend:
> - 'Flags=none' means neither TTWU_QUEUE_DEFAULT nor
>   TTWU_QUEUE_DELAYED is set (or available).
> - '*…*' marks Top-3 Min & Max, Bottom-3 Std dev, and
>   Top-3 90th-percentile values.
> 
> Base 6.16-rc5
>   Flags=none
>   Min=681870.77 | Max=913649.50 | Std=53802.90       | 90th=890201.05
> 
> sched/fair: bump sd->max_newidle_lb_cost when newidle balance fails
>   Flags=none
>   Min=770952.12 | Max=888047.45 | Std=34430.24       | 90th=877347.24
> 
> sched/psi: Optimize psi_group_change() cpu_clock() usage
>   Flags=none
>   Min=748137.65 | Max=936312.33 | Std=56818.23       | 90th=*921497.27*
> 
> sched/deadline: Less agressive dl_server handling
>   Flags=none
>   Min=783621.95 | Max=*944604.67* | Std=43538.64     | 90th=*909961.16*
> 
> sched: Optimize ttwu() / select_task_rq()
>   Flags=none
>   Min=*826038.87* | Max=*1003496.73* | Std=49875.43  | 90th=*971944.88*
> 
> sched: Use lock guard in ttwu_runnable()
>   Flags=none
>   Min=780172.75 | Max=914170.20 | Std=35998.33       | 90th=866095.80
> 
> sched: Add ttwu_queue controls
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=*792430.45* | Max=903422.78 | Std=33582.71     | 90th=887256.68
> 
>   Flags=none
>   Min=*803532.80* | Max=894772.48 | Std=29359.35     | 90th=877920.34
> 
> sched: Introduce ttwu_do_migrate()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=749824.30 | Max=*965139.77* | Std=57022.47     | 90th=903659.07
>  
>   Flags=none
>   Min=787464.65 | Max=885349.20 | Std=27030.82       | 90th=875750.44
> 
> psi: Split psi_ttwu_dequeue()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=762960.98 | Max=916538.12 | Std=42002.19       | 90th=876425.84
>  
>   Flags=none
>   Min=773608.48 | Max=920812.87 | Std=42189.17       | 90th=871760.47
> 
> sched: Re-arrange __ttwu_queue_wakelist()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=702870.58 | Max=835243.42 | Std=44224.02       | 90th=825311.12
> 
>   Flags=none
>   Min=712499.38 | Max=838492.03 | Std=38351.20       | 90th=817135.94
> 
> sched: Use lock guard in sched_ttwu_pending()
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=729080.55 | Max=853609.62 | Std=43440.63       | 90th=838684.48
> 
>   Flags=none
>   Min=708123.47 | Max=850804.48 | Std=40642.28       | 90th=830295.08
> 
> sched: Change ttwu_runnable() vs sched_delayed
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=580218.87 | Max=838684.07 | Std=57078.24       | 90th=792973.33
> 
>   Flags=none
>   Min=721274.90 | Max=784897.92 | Std=*19017.78*     | 90th=774792.30
> 
> sched: Add ttwu_queue support for delayed tasks
>   Flags=none
>   Min=712979.48 | Max=830192.10 | Std=33173.90       | 90th=798599.66
> 
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=698094.12 | Max=857627.93 | Std=38294.94       | 90th=789981.59
>  
>   Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
>   Min=683348.77 | Max=782179.15 | Std=25086.71       | 90th=750947.00
> 
>   Flags=TTWU_QUEUE_DELAYED
>   Min=669822.23 | Max=807768.85 | Std=38766.41       | 90th=794052.05
> 
> sched: fix ttwu_delayed
This one is actually:
sched: Add ttwu_queue support for delayed tasks
+
https://lore.kernel.org/all/0672c7df-543c-4f3e-829a-46969fad6b34@amd.com/

Apologies for that.

---
BR
Beata
>   Flags=none
>   Min=671844.35 | Max=798737.67 | Std=33438.64       | 90th=788584.62
> 
>   Flags=TTWU_QUEUE_DEFAULT
>   Min=688607.40 | Max=828679.53 | Std=33184.78       | 90th=782490.23
> 
>   Flags=TTWU_QUEUE_DEFAULT/TTWU_QUEUE_DELAYED
>   Min=579171.13 | Max=643929.18 | Std=*14644.92*     | 90th=639764.16
> 
>   Flags=TTWU_QUEUE_DELAYED
>   Min=614265.22 | Max=675172.05 | Std=*13309.92*     | 90th=647181.10
> 
> 
> Best overall performer:
> sched: Optimize ttwu() / select_task_rq()
>   Flags=none
>   Min=*826038.87* | Max=*1003496.73* | Std=49875.43 | 90th=*971944.88*
> 
> Hope this will he somehwat helpful.
> 
> ---
> BR
> Beata
> 
> On Wed, Jul 02, 2025 at 01:49:24PM +0200, Peter Zijlstra wrote:
> > Hi!
> > 
> > Previous version:
> > 
> >   https://lkml.kernel.org/r/20250520094538.086709102@infradead.org
> > 
> > 
> > Changes:
> >  - keep dl_server_stop(), just remove the 'normal' usage of it (juril)
> >  - have the sched_delayed wake list IPIs do select_task_rq() (vingu)
> >  - fixed lockdep splat (dietmar)
> >  - added a few preperatory patches
> > 
> > 
> > Patches apply on top of tip/master (which includes the disabling of private futex)
> > and clm's newidle balance patch (which I'm awaiting vingu's ack on).
> > 
> > Performance is similar to the last version; as tested on my SPR on v6.15 base:
> > 
> > v6.15:
> > schbench-6.15.0-1.txt:average rps: 2891403.72
> > schbench-6.15.0-2.txt:average rps: 2889997.02
> > schbench-6.15.0-3.txt:average rps: 2894745.17
> > 
> > v6.15 + patches 1-10:
> > schbench-6.15.0-dirty-4.txt:average rps: 3038265.95
> > schbench-6.15.0-dirty-5.txt:average rps: 3037327.50
> > schbench-6.15.0-dirty-6.txt:average rps: 3038160.15
> > 
> > v6.15 + all patches:
> > schbench-6.15.0-dirty-deferred-1.txt:average rps: 3043404.30
> > schbench-6.15.0-dirty-deferred-2.txt:average rps: 3046124.17
> > schbench-6.15.0-dirty-deferred-3.txt:average rps: 3043627.10
> > 
> > 
> > Patches can also be had here:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
> > 
> > 
> > I'm hoping we can get this merged for next cycle so we can all move on from this.
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2025-10-09 11:43 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 11:49 [PATCH v2 00/12] sched: Address schbench regression Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 01/12] sched/psi: Optimize psi_group_change() cpu_clock() usage Peter Zijlstra
2025-07-15 19:11   ` Chris Mason
2025-07-16  6:06     ` K Prateek Nayak
2025-07-16  6:53     ` Beata Michalska
2025-07-16 10:40       ` Peter Zijlstra
2025-07-16 14:54         ` Johannes Weiner
2025-07-16 16:27         ` Chris Mason
2025-07-23  4:16         ` Aithal, Srikanth
2025-07-25  5:13         ` K Prateek Nayak
2025-07-02 11:49 ` [PATCH v2 02/12] sched/deadline: Less agressive dl_server handling Peter Zijlstra
2025-07-02 16:12   ` Juri Lelli
2025-07-10 12:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-07-14 22:56   ` [PATCH v2 02/12] " Mel Gorman
2025-07-15 14:55     ` Chris Mason
2025-07-16 18:19       ` Mel Gorman
2025-07-30  9:34   ` Geert Uytterhoeven
2025-07-30  9:46     ` Juri Lelli
2025-07-30 10:05       ` Geert Uytterhoeven
2025-08-05 22:03   ` Chris Bainbridge
2025-08-05 23:04     ` Chris Bainbridge
2025-09-15 22:29   ` John Stultz
2025-09-16  4:18     ` John Stultz
2025-09-16  5:28       ` [RFC][PATCH] sched/deadline: Fix dl_server getting stuck, allowing cpu starvation John Stultz
2025-09-16  8:51         ` Juri Lelli
2025-09-16 11:01           ` Peter Zijlstra
2025-09-16 12:52             ` Juri Lelli
2025-09-16 14:30               ` Peter Zijlstra
2025-09-16 17:35             ` John Stultz
2025-09-16 21:30               ` Peter Zijlstra
2025-09-17  3:29                 ` John Stultz
2025-09-17  9:34                   ` Peter Zijlstra
2025-09-17 12:26                     ` Peter Zijlstra
2025-09-17 13:56                       ` Juri Lelli
2025-09-17 17:30                         ` Peter Zijlstra
2025-09-18  8:37                           ` Juri Lelli
2025-09-18  9:04                             ` Peter Zijlstra
2025-09-18  9:42                               ` Juri Lelli
2025-09-17 19:29                       ` John Stultz
2025-09-18  6:56                       ` [tip: sched/urgent] sched/deadline: Fix dl_server behaviour tip-bot2 for Peter Zijlstra
2025-09-25  7:55                       ` tip-bot2 for Peter Zijlstra
2025-09-18  6:56             ` [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck tip-bot2 for Peter Zijlstra
2025-09-18 14:46               ` Dietmar Eggemann
2025-09-22 21:57               ` Marek Szyprowski
2025-09-22 23:46                 ` John Stultz
2025-09-23  6:31                   ` Marek Szyprowski
2025-09-23  7:25                 ` Peter Zijlstra
2025-09-23  7:52                   ` Marek Szyprowski
2025-09-23 22:02                 ` Peter Zijlstra
2025-09-29 15:19                   ` Marek Szyprowski
     [not found]                   ` <eae77bd0-d874-4ddf-88d7-c1ab75358f91@samsung.com>
2025-10-09  8:35                     ` Krzysztof Kozlowski
2025-10-09  9:26                     ` Peter Zijlstra
2025-10-09 11:42                       ` Marek Szyprowski
2025-09-25  7:55             ` tip-bot2 for Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 03/12] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
2025-07-10 16:47   ` Vincent Guittot
2025-07-14 22:59   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 04/12] sched: Use lock guard in ttwu_runnable() Peter Zijlstra
2025-07-10 16:48   ` Vincent Guittot
2025-07-14 23:00   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 05/12] sched: Add ttwu_queue controls Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-14 23:14   ` Mel Gorman
2025-07-02 11:49 ` [PATCH v2 06/12] sched: Introduce ttwu_do_migrate() Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-02 11:49 ` [PATCH v2 07/12] psi: Split psi_ttwu_dequeue() Peter Zijlstra
2025-07-17 23:59   ` Chris Mason
2025-07-18 18:02     ` Steven Rostedt
2025-07-02 11:49 ` [PATCH v2 08/12] sched: Re-arrange __ttwu_queue_wakelist() Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 09/12] sched: Clean up ttwu comments Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 10/12] sched: Use lock guard in sched_ttwu_pending() Peter Zijlstra
2025-07-10 16:51   ` Vincent Guittot
2025-07-02 11:49 ` [PATCH v2 11/12] sched: Change ttwu_runnable() vs sched_delayed Peter Zijlstra
2025-07-02 11:49 ` [PATCH v2 12/12] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
2025-07-03 16:00   ` Phil Auld
2025-07-03 16:47     ` Peter Zijlstra
2025-07-03 17:11       ` Phil Auld
2025-07-14 13:57         ` Phil Auld
2025-07-04  6:13       ` K Prateek Nayak
2025-07-04  7:59         ` Peter Zijlstra
2025-07-08 12:44   ` Dietmar Eggemann
2025-07-08 18:57     ` Peter Zijlstra
2025-07-08 21:02     ` Peter Zijlstra
2025-07-23  5:42   ` Shrikanth Hegde
2025-07-02 15:27 ` [PATCH v2 00/12] sched: Address schbench regression Chris Mason
2025-07-07  9:05 ` Shrikanth Hegde
2025-07-07  9:11   ` Peter Zijlstra
2025-07-07  9:38     ` Shrikanth Hegde
2025-07-16 13:46       ` Phil Auld
2025-07-17 17:25         ` Phil Auld
2025-07-07 18:19   ` Shrikanth Hegde
2025-07-08 19:02     ` Peter Zijlstra
2025-07-09 16:46       ` Shrikanth Hegde
2025-07-14 17:54       ` Shrikanth Hegde
2025-07-21 19:37       ` Shrikanth Hegde
2025-07-22 20:20         ` Chris Mason
2025-07-24 18:23           ` Chris Mason
2025-07-08 15:09   ` Chris Mason
2025-07-08 17:29     ` Shrikanth Hegde
2025-07-17 13:04 ` Beata Michalska
2025-07-17 16:57   ` Beata Michalska

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.