From: Phil Auld <pauld@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
mingo@redhat.com, juri.lelli@redhat.com,
dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
clm@meta.com, linux-kernel@vger.kernel.org
Subject: Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
Date: Fri, 6 Jun 2025 12:18:20 -0400 [thread overview]
Message-ID: <20250606161820.GA224542@pauld.westford.csb> (raw)
In-Reply-To: <CAKfTPtDOQVEMRWaK9xEVqSDKcvUfai4CUck6G=oOdaeRBhZQUw@mail.gmail.com>
Hi Peter,
On Fri, Jun 06, 2025 at 05:03:36PM +0200 Vincent Guittot wrote:
> On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > One of the things lost with introduction of DELAY_DEQUEUE is the
> > ability of TTWU to move those tasks around on wakeup, since they're
> > on_rq, and as such, need to be woken in-place.
>
> I was thinking that you would call select_task_rq() somewhere in the
> wake up path of delayed entity to get a chance to migrate it which was
> one reason for the perf regression (and which would have also been
> useful for EAS case) but IIUC, the task is still enqueued on the same
> CPU but the target cpu will do the enqueue itself instead on the local
> CPU. Or am I missing something ?
Yeah, this one still bites us. We ran these patches on our perf
tests (with out twiddling any FEATs) and it was basically a wash.
The fs regression we saw due to always waking up on the same cpu
was still present as expected based on this patch I suppose.
Thanks,
Phil
>
> >
> > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > little something that gets remote CPUs to do their own in-place
> > wakeups, significantly reducing the rq->lock contention.
> >
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> > kernel/sched/core.c | 74 ++++++++++++++++++++++++++++++++++++++++++------
> > kernel/sched/fair.c | 5 ++-
> > kernel/sched/features.h | 1
> > kernel/sched/sched.h | 1
> > 4 files changed, 72 insertions(+), 9 deletions(-)
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> > return 1;
> > }
> >
> > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > +
> > /*
> > * Consider @p being inside a wait loop:
> > *
> > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> > */
> > static int ttwu_runnable(struct task_struct *p, int wake_flags)
> > {
> > +#ifdef CONFIG_SMP
> > + if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > + /*
> > + * Similar to try_to_block_task():
> > + *
> > + * __schedule() ttwu()
> > + * prev_state = prev->state if (p->sched_delayed)
> > + * if (prev_state) smp_acquire__after_ctrl_dep()
> > + * try_to_block_task() p->state = TASK_WAKING
> > + * ... set_delayed()
> > + * RELEASE p->sched_delayed = 1
> > + *
> > + * __schedule() and ttwu() have matching control dependencies.
> > + *
> > + * Notably, once we observe sched_delayed we know the task has
> > + * passed try_to_block_task() and p->state is ours to modify.
> > + *
> > + * TASK_WAKING controls ttwu() concurrency.
> > + */
> > + smp_acquire__after_ctrl_dep();
> > + WRITE_ONCE(p->__state, TASK_WAKING);
> > +
> > + if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > + return 1;
> > + }
> > +#endif
> > +
> > CLASS(__task_rq_lock, guard)(p);
> > return __ttwu_runnable(guard.rq, p, wake_flags);
> > }
> > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> > update_rq_clock(rq);
> >
> > llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > + struct rq *p_rq = task_rq(p);
> > + int ret;
> > +
> > + /*
> > + * This is the ttwu_runnable() case. Notably it is possible for
> > + * on-rq entities to get migrated -- even sched_delayed ones.
>
> I haven't found where the sched_delayed task could migrate on another cpu.
>
> > + */
> > + if (unlikely(p_rq != rq)) {
> > + rq_unlock(rq, &rf);
> > + p_rq = __task_rq_lock(p, &rf);
> > + }
> > +
> > + ret = __ttwu_runnable(p_rq, p, WF_TTWU);
> > +
> > + if (unlikely(p_rq != rq)) {
> > + if (!ret)
> > + set_task_cpu(p, cpu_of(rq));
> > +
> > + __task_rq_unlock(p_rq, &rf);
> > + rq_lock(rq, &rf);
> > + update_rq_clock(rq);
> > + }
> > +
> > + if (ret) {
> > + // XXX ttwu_stat()
> > + continue;
> > + }
> > +
> > + /*
> > + * This is the 'normal' case where the task is blocked.
> > + */
> > +
> > if (WARN_ON_ONCE(p->on_cpu))
> > smp_cond_load_acquire(&p->on_cpu, !VAL);
> >
> > - if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> > - set_task_cpu(p, cpu_of(rq));
> > -
> > ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> > }
> >
> > @@ -3974,7 +4032,7 @@ static inline bool ttwu_queue_cond(struc
> >
> > static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> > {
> > - bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> > + bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
> >
> > if (!ttwu_queue_cond(p, cpu, def))
> > return false;
> > @@ -4269,8 +4327,8 @@ int try_to_wake_up(struct task_struct *p
> > * __schedule(). See the comment for smp_mb__after_spinlock().
> > *
> > * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> > - * schedule()'s deactivate_task() has 'happened' and p will no longer
> > - * care about it's own p->state. See the comment in __schedule().
> > + * schedule()'s try_to_block_task() has 'happened' and p will no longer
> > + * care about it's own p->state. See the comment in try_to_block_task().
> > */
> > smp_acquire__after_ctrl_dep();
> >
> > @@ -6712,8 +6770,8 @@ static void __sched notrace __schedule(i
> > preempt = sched_mode == SM_PREEMPT;
> >
> > /*
> > - * We must load prev->state once (task_struct::state is volatile), such
> > - * that we form a control dependency vs deactivate_task() below.
> > + * We must load prev->state once, such that we form a control
> > + * dependency vs try_to_block_task() below.
> > */
> > prev_state = READ_ONCE(prev->__state);
> > if (sched_mode == SM_IDLE) {
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5395,7 +5395,10 @@ static __always_inline void return_cfs_r
> >
> > static void set_delayed(struct sched_entity *se)
> > {
> > - se->sched_delayed = 1;
> > + /*
> > + * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> > + */
> > + smp_store_release(&se->sched_delayed, 1);
> >
> > /*
> > * Delayed se of cfs_rq have no tasks queued on them.
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
> > SCHED_FEAT(TTWU_QUEUE, true)
> > #endif
> > SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> > +SCHED_FEAT(TTWU_QUEUE_DELAYED, false)
>
> I'm not sure that the feature will be tested as people mainly test
> default config
>
> > SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
> >
> > /*
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
> > #define WF_RQ_SELECTED 0x80 /* ->select_task_rq() was called */
> >
> > #define WF_ON_CPU 0x0100
> > +#define WF_DELAYED 0x0200
> >
> > #ifdef CONFIG_SMP
> > static_assert(WF_EXEC == SD_BALANCE_EXEC);
> >
> >
>
--
next prev parent reply other threads:[~2025-06-06 16:18 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-20 9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
2025-06-03 16:03 ` Juri Lelli
2025-06-13 9:43 ` Peter Zijlstra
2025-05-20 9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
2025-06-09 5:01 ` Mike Galbraith
2025-06-13 9:40 ` Peter Zijlstra
2025-06-13 10:20 ` Mike Galbraith
2025-05-20 9:45 ` [RFC][PATCH 3/5] sched: Split up ttwu_runnable() Peter Zijlstra
2025-05-20 9:45 ` [RFC][PATCH 4/5] sched: Add ttwu_queue controls Peter Zijlstra
2025-05-20 9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
2025-06-06 15:03 ` Vincent Guittot
2025-06-06 15:38 ` Peter Zijlstra
2025-06-06 16:55 ` Vincent Guittot
2025-06-11 9:39 ` Peter Zijlstra
2025-06-16 12:39 ` Vincent Guittot
2025-06-06 16:18 ` Phil Auld [this message]
2025-06-16 12:01 ` Peter Zijlstra
2025-06-16 16:37 ` Peter Zijlstra
2025-06-13 7:34 ` Dietmar Eggemann
2025-06-13 9:51 ` Peter Zijlstra
2025-06-13 10:46 ` Peter Zijlstra
2025-06-16 8:16 ` Dietmar Eggemann
2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-29 1:41 ` Chris Mason
2025-06-14 10:04 ` Peter Zijlstra
2025-06-16 0:35 ` Chris Mason
2025-05-29 10:18 ` Beata Michalska
2025-05-30 9:00 ` Peter Zijlstra
2025-05-30 10:04 ` Chris Mason
2025-06-02 4:44 ` K Prateek Nayak
2025-06-13 3:28 ` K Prateek Nayak
2025-06-14 10:15 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250606161820.GA224542@pauld.westford.csb \
--to=pauld@redhat.com \
--cc=bsegall@google.com \
--cc=clm@meta.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).