From: Andrea Righi <arighi@nvidia.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Christian Loehle <christian.loehle@arm.com>,
Koba Ko <kobak@nvidia.com>,
Felix Abecassis <fabecassis@nvidia.com>,
Balbir Singh <balbirs@nvidia.com>,
Joel Fernandes <joelagnelf@nvidia.com>,
Shrikanth Hegde <sshegde@linux.ibm.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
Date: Wed, 6 May 2026 20:15:01 +0200 [thread overview]
Message-ID: <afuFJUx_r2lm3y_q@gpd4> (raw)
In-Reply-To: <CAKfTPtDLDfACg+71-CCG_yqWqJUFjUmh6CRurkUzeBYh-KeEmg@mail.gmail.com>
Hi Vincent,
On Wed, May 06, 2026 at 12:29:10PM +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active, always prefer fully-idle SMT cores over partially-idle
> > ones.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 65 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bbdf537f61154..6a7e4943804b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > return idle_cpu;
> > }
> >
> > +/*
> > + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> > + * are more preferred (see select_idle_capacity()).
> > + */
> > +enum asym_fits_state {
> > + /* In descending order of preference */
> > + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> > + ASYM_IDLE_CORE_COMPLETE_MISFIT,
> > + ASYM_IDLE_THREAD_FITS,
> > + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > + ASYM_IDLE_COMPLETE_MISFIT,
> > +
> > + /* util_fits_cpu() bias for an idle core. */
> > + ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> > /*
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > static int
> > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > + bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> > unsigned long task_util, util_min, util_max, best_cap = 0;
> > - int fits, best_fits = 0;
> > + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> > int cpu, best_cpu = -1;
> > struct cpumask *cpus;
> >
> > @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > + bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > unsigned long cpu_cap = capacity_of(cpu);
> >
> > if (!choose_idle_cpu(cpu, p))
> > @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> > /* This CPU fits with all requirements */
> > - if (fits > 0)
> > + if (fits > 0 && preferred_core)
> > return cpu;
> > /*
> > * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > */
> > else if (fits < 0)
> > cpu_cap = get_actual_cpu_capacity(cpu);
> > + /*
> > + * fits > 0 implies we are not on a preferred core
> > + * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> > + * so the effective range becomes
> > + * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> > + * ASYM_IDLE_COMPLETE_MISFIT - does not fit
> > + * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> > + * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> > + */
> > + else if (fits > 0)
> > + fits = ASYM_IDLE_THREAD_FITS;
> > +
> > + /*
> > + * If we are on a preferred core, translate the range of fits
> > + * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> > + * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> > + * This ensures that an idle core is always given priority over
> > + * (partially) busy core.
> > + *
> > + * A fully fitting idle core would have returned early and hence
> > + * fits > 0 for preferred_core need not be dealt with.
> > + */
> > + if (preferred_core)
> > + fits += ASYM_IDLE_CORE_BIAS;
>
> It might be good to add a comment stating that if the system doesn't
> have SMT, prefers_idle_core and preferred_core are always true.
>
> This is okay because CPU == Core in this case but the value differs
> from the default 0 or -1 of util_fits_cpu
Ack.
>
> >
> > /*
> > - * First, select CPU which fits better (-1 being better than 0).
> > + * First, select CPU which fits better (lower is more preferred).
> > * Then, select the one with best capacity at same level.
> > */
> > if ((fits < best_fits) ||
> > @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > }
> > }
> >
> > + /*
> > + * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
>
> ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
> fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
> ASYM_IDLE_CORE_COMPLETE_MISFIT]
>
> Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS
Ah yes, using ASYM_IDLE_CORE_BIAS is just confusing, we should definitely use
[ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]. Will fix this.
>
> > + * range means the chosen CPU is in a fully idle SMT core. Values above
> > + * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
Ack.
>
> > + *
> > + * The asym-capacity wakeup path returns from select_idle_sibling()
> > + * after this function and never runs select_idle_cpu(), so the usual
> > + * select_idle_cpu() tail that clears idle cores must live here when the
> > + * idle-core preference did not win.
> > + */
> > + if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
Ack.
>
> > + set_idle_cores(target, false);
> > +
> > return best_cpu;
> > }
> >
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> > unsigned long util_max,
> > int cpu)
> > {
> > - if (sched_asym_cpucap_active())
> > + if (sched_asym_cpucap_active()) {
> > /*
> > * Return true only if the cpu fully fits the task requirements
> > * which include the utilization and the performance hints.
> > + *
> > + * When SMT is active, also require that the core has no busy
> > + * siblings.
> > */
> > - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + return (!sched_smt_active() || is_core_idle(cpu)) &&
> > + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + }
> >
> > return true;
> > }
> > --
> > 2.54.0
> >
Thanks,
-Andrea
next prev parent reply other threads:[~2026-05-06 18:15 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-04-28 16:29 ` K Prateek Nayak
2026-04-29 16:07 ` [PATCH v2 " Andrea Righi
2026-05-05 9:15 ` [PATCH " Dietmar Eggemann
2026-05-05 9:22 ` Andrea Righi
2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-05-05 12:48 ` Dietmar Eggemann
2026-05-06 9:45 ` Vincent Guittot
2026-05-06 10:19 ` K Prateek Nayak
2026-05-06 10:30 ` Vincent Guittot
2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-05-05 17:20 ` Dietmar Eggemann
2026-05-06 18:31 ` Andrea Righi
2026-05-06 10:29 ` Vincent Guittot
2026-05-06 12:34 ` Vincent Guittot
2026-05-06 18:15 ` Andrea Righi [this message]
2026-04-28 14:41 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-04-28 14:41 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-05-06 12:59 ` Vincent Guittot
2026-05-06 17:01 ` Dietmar Eggemann
2026-05-06 18:11 ` Andrea Righi
2026-05-05 20:40 ` [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Dietmar Eggemann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afuFJUx_r2lm3y_q@gpd4 \
--to=arighi@nvidia.com \
--cc=balbirs@nvidia.com \
--cc=bsegall@google.com \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=fabecassis@nvidia.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kobak@nvidia.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sshegde@linux.ibm.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox