From: Peter Zijlstra <peterz@infradead.org>
To: Chris Mason <clm@meta.com>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@kernel.org>,
dietmar.eggemann@arm.com, vschneid@redhat.com,
Juri Lelli <juri.lelli@gmail.com>,
Thomas Gleixner <tglx@linutronix.de>
Subject: Re: scheduler performance regression since v6.11
Date: Fri, 16 May 2025 12:18:22 +0200 [thread overview]
Message-ID: <20250516101822.GC16434@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <d3c8527f-ffaf-4463-a305-17ca21a06ce8@meta.com>
On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:
Right, so I can reproduce on Thomas' SKL and maybe see some of it on my
SPR.
I've managed to discover a whole bunch of ways that ttwu() can explode
again :-) But as you surmised, your workload *LOVES* TTWU_QUEUE, and
DELAYED_DEQUEUE takes some of that away, because those delayed things
remain on-rq and ttwu() can't deal with that other than by doing the
wakeup in-line and that's exactly the thing this workload hates most.
(I'll keep poking at ttwu() to see if I can get a combination of
TTWU_QUEUE and DELAYED_DEQUEUE that does not explode in 'fun' ways)
However, I've found that flipping the default in ttwu_queue_cond() seems
to make up for quite a bit -- for your workload.
(basically, all the work we can get away from those pinned message CPUs
is a win)
Also, meanwhile you discovered that the other part of your performance
woes were due to dl_server, specifically, disabling that gave you back a
healthy chunk of your performance.
The problem is indeed that we toggle the dl_server on every nr_running
from 0 and to 0 transition, and your workload has a shit-ton of those,
so every time we get the overhead of starting and stopping this thing.
In hindsight, that's a fairly stupid setup, and the below patch changes
this to keep the dl_server around until it's not seen fair activity for
a whole period. This appears to fully recover this dip.
Trouble seems to be that dl_server_update() always gets tickled by
random garbage, so in the end the dl_server never stops... oh well.
Juri, could you have a look at this, perhaps I messed up something
trivial -- its been like that this week :/
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..1f92572b20c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,7 @@ struct sched_dl_entity {
unsigned int dl_defer : 1;
unsigned int dl_defer_armed : 1;
unsigned int dl_defer_running : 1;
+ unsigned int dl_server_idle : 1;
/*
* Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..010537a2f368 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3964,7 +3964,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
if (!cpu_rq(cpu)->nr_running)
return true;
- return false;
+ return sched_feat(TTWU_QUEUE_DEFAULT);
}
static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245..dce3a95cb8bc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1639,8 +1639,10 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
{
/* 0 runtime = fair server disabled */
- if (dl_se->dl_runtime)
+ if (dl_se->dl_runtime) {
+ dl_se->dl_server_idle = 0;
update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+ }
}
void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1663,20 +1665,24 @@ void dl_server_start(struct sched_dl_entity *dl_se)
setup_new_dl_entity(dl_se);
}
- if (!dl_se->dl_runtime)
+ if (!dl_se->dl_runtime || dl_se->dl_server_active)
return;
+ trace_printk("dl_server starting\n");
+
dl_se->dl_server_active = 1;
enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
resched_curr(dl_se->rq);
}
-void dl_server_stop(struct sched_dl_entity *dl_se)
+static void __dl_server_stop(struct sched_dl_entity *dl_se)
{
if (!dl_se->dl_runtime)
return;
+ trace_printk("dl_server stopping\n");
+
dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
hrtimer_try_to_cancel(&dl_se->dl_timer);
dl_se->dl_defer_armed = 0;
@@ -1684,6 +1690,10 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
dl_se->dl_server_active = 0;
}
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+}
+
void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick_task)
@@ -2436,6 +2446,9 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
p = dl_se->server_pick_task(dl_se);
if (!p) {
if (dl_server_active(dl_se)) {
+ if (dl_se->dl_server_idle)
+ __dl_server_stop(dl_se);
+ dl_se->dl_server_idle = 1;
dl_se->dl_yielded = 1;
update_curr_dl_se(rq, dl_se, 0);
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..75aa7fdc4c98 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,6 +81,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
*/
SCHED_FEAT(TTWU_QUEUE, true)
#endif
+SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
next prev parent reply other threads:[~2025-05-16 10:18 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 23:13 scheduler performance regression since v6.11 Chris Mason
2025-05-09 19:49 ` Peter Zijlstra
2025-05-12 18:08 ` Peter Zijlstra
2025-05-12 19:39 ` Chris Mason
2025-05-12 22:35 ` Chris Mason
2025-05-13 7:15 ` Peter Zijlstra
2025-05-16 10:18 ` Peter Zijlstra [this message]
2025-05-20 14:38 ` Dietmar Eggemann
2025-05-20 14:53 ` Chris Mason
2025-05-21 13:59 ` Dietmar Eggemann
2025-05-21 14:32 ` Chris Mason
2025-05-20 19:38 ` Peter Zijlstra
2025-05-21 14:02 ` Dietmar Eggemann
2025-05-21 15:02 ` Peter Zijlstra
2025-05-21 19:00 ` Peter Zijlstra
2025-05-21 14:54 ` Peter Zijlstra
2025-05-22 8:48 ` Peter Zijlstra
2025-05-22 15:00 ` Johannes Weiner
2025-05-23 15:40 ` Peter Zijlstra
2025-05-23 12:27 ` Dietmar Eggemann
2025-07-10 12:46 ` [tip: sched/core] sched/psi: Optimize psi_group_change() cpu_clock() usage tip-bot2 for Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250516101822.GC16434@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=clm@meta.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=tglx@linutronix.de \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.