linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Phil Auld <pauld@redhat.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	mingo@redhat.com, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	clm@meta.com, linux-kernel@vger.kernel.org
Subject: Re: [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks
Date: Fri, 6 Jun 2025 12:18:20 -0400	[thread overview]
Message-ID: <20250606161820.GA224542@pauld.westford.csb> (raw)
In-Reply-To: <CAKfTPtDOQVEMRWaK9xEVqSDKcvUfai4CUck6G=oOdaeRBhZQUw@mail.gmail.com>


Hi Peter,

On Fri, Jun 06, 2025 at 05:03:36PM +0200 Vincent Guittot wrote:
> On Tue, 20 May 2025 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > One of the things lost with introduction of DELAY_DEQUEUE is the
> > ability of TTWU to move those tasks around on wakeup, since they're
> > on_rq, and as such, need to be woken in-place.
> 
> I was thinking that you would call select_task_rq() somewhere in the
> wake up path of delayed entity to get a chance to migrate it which was
> one reason for the perf regression (and which would have also been
> useful for EAS case) but IIUC, the task is still enqueued on the same
> CPU but the target cpu will do the enqueue itself instead on the local
> CPU. Or am I missing something ?

Yeah, this one still bites us.  We ran these patches on our perf
tests (with out twiddling any FEATs) and it was basically a wash.

The fs regression we saw due to always waking up on the same cpu
was still present as expected based on this patch I suppose.

Thanks,
Phil

> 
> >
> > Doing the in-place thing adds quite a bit of cross-cpu latency, add a
> > little something that gets remote CPUs to do their own in-place
> > wakeups, significantly reducing the rq->lock contention.
> >
> > Reported-by: Chris Mason <clm@meta.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/sched/core.c     |   74 ++++++++++++++++++++++++++++++++++++++++++------
> >  kernel/sched/fair.c     |    5 ++-
> >  kernel/sched/features.h |    1
> >  kernel/sched/sched.h    |    1
> >  4 files changed, 72 insertions(+), 9 deletions(-)
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3784,6 +3784,8 @@ static int __ttwu_runnable(struct rq *rq
> >         return 1;
> >  }
> >
> > +static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags);
> > +
> >  /*
> >   * Consider @p being inside a wait loop:
> >   *
> > @@ -3811,6 +3813,33 @@ static int __ttwu_runnable(struct rq *rq
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > +#ifdef CONFIG_SMP
> > +       if (sched_feat(TTWU_QUEUE_DELAYED) && READ_ONCE(p->se.sched_delayed)) {
> > +               /*
> > +                * Similar to try_to_block_task():
> > +                *
> > +                * __schedule()                         ttwu()
> > +                *   prev_state = prev->state             if (p->sched_delayed)
> > +                *   if (prev_state)                         smp_acquire__after_ctrl_dep()
> > +                *     try_to_block_task()                   p->state = TASK_WAKING
> > +                *       ... set_delayed()
> > +                *         RELEASE p->sched_delayed = 1
> > +                *
> > +                * __schedule() and ttwu() have matching control dependencies.
> > +                *
> > +                * Notably, once we observe sched_delayed we know the task has
> > +                * passed try_to_block_task() and p->state is ours to modify.
> > +                *
> > +                * TASK_WAKING controls ttwu() concurrency.
> > +                */
> > +               smp_acquire__after_ctrl_dep();
> > +               WRITE_ONCE(p->__state, TASK_WAKING);
> > +
> > +               if (ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_DELAYED))
> > +                       return 1;
> > +       }
> > +#endif
> > +
> >         CLASS(__task_rq_lock, guard)(p);
> >         return __ttwu_runnable(guard.rq, p, wake_flags);
> >  }
> > @@ -3830,12 +3859,41 @@ void sched_ttwu_pending(void *arg)
> >         update_rq_clock(rq);
> >
> >         llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
> > +               struct rq *p_rq = task_rq(p);
> > +               int ret;
> > +
> > +               /*
> > +                * This is the ttwu_runnable() case. Notably it is possible for
> > +                * on-rq entities to get migrated -- even sched_delayed ones.
> 
> I haven't found where the sched_delayed task could migrate on another cpu.
> 
> > +                */
> > +               if (unlikely(p_rq != rq)) {
> > +                       rq_unlock(rq, &rf);
> > +                       p_rq = __task_rq_lock(p, &rf);
> > +               }
> > +
> > +               ret = __ttwu_runnable(p_rq, p, WF_TTWU);
> > +
> > +               if (unlikely(p_rq != rq)) {
> > +                       if (!ret)
> > +                               set_task_cpu(p, cpu_of(rq));
> > +
> > +                       __task_rq_unlock(p_rq, &rf);
> > +                       rq_lock(rq, &rf);
> > +                       update_rq_clock(rq);
> > +               }
> > +
> > +               if (ret) {
> > +                       // XXX ttwu_stat()
> > +                       continue;
> > +               }
> > +
> > +               /*
> > +                * This is the 'normal' case where the task is blocked.
> > +                */
> > +
> >                 if (WARN_ON_ONCE(p->on_cpu))
> >                         smp_cond_load_acquire(&p->on_cpu, !VAL);
> >
> > -               if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
> > -                       set_task_cpu(p, cpu_of(rq));
> > -
> >                 ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
> >         }
> >
> > @@ -3974,7 +4032,7 @@ static inline bool ttwu_queue_cond(struc
> >
> >  static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
> >  {
> > -       bool def = sched_feat(TTWU_QUEUE_DEFAULT);
> > +       bool def = sched_feat(TTWU_QUEUE_DEFAULT) || (wake_flags & WF_DELAYED);
> >
> >         if (!ttwu_queue_cond(p, cpu, def))
> >                 return false;
> > @@ -4269,8 +4327,8 @@ int try_to_wake_up(struct task_struct *p
> >                  * __schedule().  See the comment for smp_mb__after_spinlock().
> >                  *
> >                  * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> > -                * schedule()'s deactivate_task() has 'happened' and p will no longer
> > -                * care about it's own p->state. See the comment in __schedule().
> > +                * schedule()'s try_to_block_task() has 'happened' and p will no longer
> > +                * care about it's own p->state. See the comment in try_to_block_task().
> >                  */
> >                 smp_acquire__after_ctrl_dep();
> >
> > @@ -6712,8 +6770,8 @@ static void __sched notrace __schedule(i
> >         preempt = sched_mode == SM_PREEMPT;
> >
> >         /*
> > -        * We must load prev->state once (task_struct::state is volatile), such
> > -        * that we form a control dependency vs deactivate_task() below.
> > +        * We must load prev->state once, such that we form a control
> > +        * dependency vs try_to_block_task() below.
> >          */
> >         prev_state = READ_ONCE(prev->__state);
> >         if (sched_mode == SM_IDLE) {
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5395,7 +5395,10 @@ static __always_inline void return_cfs_r
> >
> >  static void set_delayed(struct sched_entity *se)
> >  {
> > -       se->sched_delayed = 1;
> > +       /*
> > +        * See TTWU_QUEUE_DELAYED in ttwu_runnable().
> > +        */
> > +       smp_store_release(&se->sched_delayed, 1);
> >
> >         /*
> >          * Delayed se of cfs_rq have no tasks queued on them.
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -82,6 +82,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
> >  SCHED_FEAT(TTWU_QUEUE, true)
> >  #endif
> >  SCHED_FEAT(TTWU_QUEUE_ON_CPU, true)
> > +SCHED_FEAT(TTWU_QUEUE_DELAYED, false)
> 
> I'm not sure that the feature will be tested as people mainly test
> default config
> 
> >  SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
> >
> >  /*
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2313,6 +2313,7 @@ static inline int task_on_rq_migrating(s
> >  #define WF_RQ_SELECTED         0x80 /* ->select_task_rq() was called */
> >
> >  #define WF_ON_CPU              0x0100
> > +#define WF_DELAYED             0x0200
> >
> >  #ifdef CONFIG_SMP
> >  static_assert(WF_EXEC == SD_BALANCE_EXEC);
> >
> >
> 

-- 


  parent reply	other threads:[~2025-06-06 16:18 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-20  9:45 [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 1/5] sched/deadline: Less agressive dl_server handling Peter Zijlstra
2025-06-03 16:03   ` Juri Lelli
2025-06-13  9:43     ` Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 2/5] sched: Optimize ttwu() / select_task_rq() Peter Zijlstra
2025-06-09  5:01   ` Mike Galbraith
2025-06-13  9:40     ` Peter Zijlstra
2025-06-13 10:20       ` Mike Galbraith
2025-05-20  9:45 ` [RFC][PATCH 3/5] sched: Split up ttwu_runnable() Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 4/5] sched: Add ttwu_queue controls Peter Zijlstra
2025-05-20  9:45 ` [RFC][PATCH 5/5] sched: Add ttwu_queue support for delayed tasks Peter Zijlstra
2025-06-06 15:03   ` Vincent Guittot
2025-06-06 15:38     ` Peter Zijlstra
2025-06-06 16:55       ` Vincent Guittot
2025-06-11  9:39         ` Peter Zijlstra
2025-06-16 12:39           ` Vincent Guittot
2025-06-06 16:18     ` Phil Auld [this message]
2025-06-16 12:01     ` Peter Zijlstra
2025-06-16 16:37       ` Peter Zijlstra
2025-06-13  7:34   ` Dietmar Eggemann
2025-06-13  9:51     ` Peter Zijlstra
2025-06-13 10:46       ` Peter Zijlstra
2025-06-16  8:16         ` Dietmar Eggemann
2025-05-28 19:59 ` [RFC][PATCH 0/5] sched: Try and address some recent-ish regressions Peter Zijlstra
2025-05-29  1:41   ` Chris Mason
2025-06-14 10:04     ` Peter Zijlstra
2025-06-16  0:35       ` Chris Mason
2025-05-29 10:18   ` Beata Michalska
2025-05-30  9:00     ` Peter Zijlstra
2025-05-30 10:04   ` Chris Mason
2025-06-02  4:44 ` K Prateek Nayak
2025-06-13  3:28   ` K Prateek Nayak
2025-06-14 10:15     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250606161820.GA224542@pauld.westford.csb \
    --to=pauld@redhat.com \
    --cc=bsegall@google.com \
    --cc=clm@meta.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).