Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Felix Abecassis <fabecassis@nvidia.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
Date: Wed, 6 May 2026 20:15:01 +0200	[thread overview]
Message-ID: <afuFJUx_r2lm3y_q@gpd4> (raw)
In-Reply-To: <CAKfTPtDLDfACg+71-CCG_yqWqJUFjUmh6CRurkUzeBYh-KeEmg@mail.gmail.com>

Hi Vincent,

On Wed, May 06, 2026 at 12:29:10PM +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active, always prefer fully-idle SMT cores over partially-idle
> > ones.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bbdf537f61154..6a7e4943804b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         return idle_cpu;
> >  }
> >
> > +/*
> > + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> > + * are more preferred (see select_idle_capacity()).
> > + */
> > +enum asym_fits_state {
> > +       /* In descending order of preference */
> > +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> > +       ASYM_IDLE_CORE_COMPLETE_MISFIT,
> > +       ASYM_IDLE_THREAD_FITS,
> > +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > +       ASYM_IDLE_COMPLETE_MISFIT,
> > +
> > +       /* util_fits_cpu() bias for an idle core. */
> > +       ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > -       int fits, best_fits = 0;
> > +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> >         int cpu, best_cpu = -1;
> >         struct cpumask *cpus;
> >
> > @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> >         for_each_cpu_wrap(cpu, cpus, target) {
> > +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >                 unsigned long cpu_cap = capacity_of(cpu);
> >
> >                 if (!choose_idle_cpu(cpu, p))
> > @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > -               if (fits > 0)
> > +               if (fits > 0 && preferred_core)
> >                         return cpu;
> >                 /*
> >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                  */
> >                 else if (fits < 0)
> >                         cpu_cap = get_actual_cpu_capacity(cpu);
> > +               /*
> > +                * fits > 0 implies we are not on a preferred core
> > +                * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> > +                * so the effective range becomes
> > +                * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> > +                *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
> > +                *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> > +                *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> > +                */
> > +               else if (fits > 0)
> > +                       fits = ASYM_IDLE_THREAD_FITS;
> > +
> > +               /*
> > +                * If we are on a preferred core, translate the range of fits
> > +                * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> > +                * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> > +                * This ensures that an idle core is always given priority over
> > +                * (partially) busy core.
> > +                *
> > +                * A fully fitting idle core would have returned early and hence
> > +                * fits > 0 for preferred_core need not be dealt with.
> > +                */
> > +               if (preferred_core)
> > +                       fits += ASYM_IDLE_CORE_BIAS;
> 
> It might be good to add a comment stating that if the system doesn't
> have SMT, prefers_idle_core and preferred_core are always true.
> 
> This is okay because CPU == Core in this case but the value differs
> from the default 0 or -1 of util_fits_cpu

Ack.

> 
> >
> >                 /*
> > -                * First, select CPU which fits better (-1 being better than 0).
> > +                * First, select CPU which fits better (lower is more preferred).
> >                  * Then, select the one with best capacity at same level.
> >                  */
> >                 if ((fits < best_fits) ||
> > @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 }
> >         }
> >
> > +       /*
> > +        * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
> 
> ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
> fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
> ASYM_IDLE_CORE_COMPLETE_MISFIT]
> 
> Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS

Ah yes, using ASYM_IDLE_CORE_BIAS is just confusing, we should definitely use
[ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]. Will fix this.

> 
> > +        * range means the chosen CPU is in a fully idle SMT core. Values above
> > +        * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

Ack.

> 
> > +        *
> > +        * The asym-capacity wakeup path returns from select_idle_sibling()
> > +        * after this function and never runs select_idle_cpu(), so the usual
> > +        * select_idle_cpu() tail that clears idle cores must live here when the
> > +        * idle-core preference did not win.
> > +        */
> > +       if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

Ack.

> 
> > +               set_idle_cores(target, false);
> > +
> >         return best_cpu;
> >  }
> >
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                  unsigned long util_max,
> >                                  int cpu)
> >  {
> > -       if (sched_asym_cpucap_active())
> > +       if (sched_asym_cpucap_active()) {
> >                 /*
> >                  * Return true only if the cpu fully fits the task requirements
> >                  * which include the utilization and the performance hints.
> > +                *
> > +                * When SMT is active, also require that the core has no busy
> > +                * siblings.
> >                  */
> > -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> > +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +       }
> >
> >         return true;
> >  }
> > --
> > 2.54.0
> >

Thanks,
-Andrea

next prev parent reply	other threads:[~2026-05-06 18:15 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260428144352.3575863-1-arighi@nvidia.com>
     [not found] ` <20260428144352.3575863-2-arighi@nvidia.com>
2026-05-05  9:15   ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Dietmar Eggemann
2026-05-05  9:22     ` Andrea Righi
     [not found] ` <20260428144352.3575863-3-arighi@nvidia.com>
2026-05-05 12:48   ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Dietmar Eggemann
2026-05-06  9:45   ` Vincent Guittot
2026-05-06 10:19     ` K Prateek Nayak
2026-05-06 10:30       ` Vincent Guittot
2026-05-05 20:40 ` [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Dietmar Eggemann
     [not found] ` <20260428144352.3575863-4-arighi@nvidia.com>
2026-05-05 17:20   ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Dietmar Eggemann
2026-05-06 18:31     ` Andrea Righi
2026-05-06 10:29   ` Vincent Guittot
2026-05-06 12:34     ` Vincent Guittot
2026-05-06 18:15     ` Andrea Righi [this message]
     [not found] ` <20260428144352.3575863-6-arighi@nvidia.com>
2026-05-06 12:59   ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Vincent Guittot
2026-05-06 17:01     ` Dietmar Eggemann
2026-05-06 18:11       ` Andrea Righi
2026-05-07  6:47         ` Vincent Guittot
2026-05-08 14:49           ` Dietmar Eggemann
2026-05-08 22:05             ` Andrea Righi
2026-05-09 18:01 Andrea Righi
2026-05-09 18:01 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-05-11 13:07   ` Vincent Guittot
2026-05-11 13:45     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afuFJUx_r2lm3y_q@gpd4 \
    --to=arighi@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=bsegall@google.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=fabecassis@nvidia.com \
    --cc=joelagnelf@nvidia.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.