From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Christian Loehle <christian.loehle@arm.com>,
Koba Ko <kobak@nvidia.com>,
Felix Abecassis <fabecassis@nvidia.com>,
Balbir Singh <balbirs@nvidia.com>,
Joel Fernandes <joelagnelf@nvidia.com>,
Shrikanth Hegde <sshegde@linux.ibm.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
Date: Fri, 24 Apr 2026 10:46:14 +0200 [thread overview]
Message-ID: <aest1kBSzJ4JUvim@gpd4> (raw)
In-Reply-To: <75cf4fd1-2e80-4167-9113-954015ba63e1@amd.com>
Hi Prateek,
On Fri, Apr 24, 2026 at 10:44:09AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 4/23/2026 1:06 PM, Andrea Righi wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 69361c63353ad..934eb663f445e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> >
> > - if (sched_feat(SIS_UTIL)) {
> > + if (sched_feat(SIS_UTIL) && sd->shared) {
> > /*
> > * Increment because !--nr is the condition to stop scan.
> > *
> > @@ -12840,7 +12840,8 @@ static void set_cpu_sd_state_busy(int cpu)
> > goto unlock;
> > sd->nohz_idle = 0;
>
> I just realised this flag only matters for accounting to "nr_busy_cpus"
> and we can bail out earlier if we don't have an sd->shared altogether.
>
> You can probably adapt this to use guard(rcu)() while you are at it
> and send these bits as a separate cleanup first saying that the
> assumption of sd_llc->shared always existing will change with the
> coming patches and you are introducing guard rails for the same.
Ack.
>
> >
> > - atomic_inc(&sd->shared->nr_busy_cpus);
> > + if (sd->shared)
> > + atomic_inc(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -12869,7 +12870,8 @@ static void set_cpu_sd_state_idle(int cpu)
> > goto unlock;
> > sd->nohz_idle = 1;
> >
> > - atomic_dec(&sd->shared->nr_busy_cpus);
> > + if (sd->shared)
> > + atomic_dec(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5847b83d9d552..dc50193b198c6 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -680,19 +680,39 @@ static void update_top_cache_domain(int cpu)
> > int id = cpu;
> > int size = 1;
> >
> > + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> > + /*
> > + * The shared object is attached to sd_asym_cpucapacity only when the
> > + * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> > + * On overlapping (NUMA) asym domains we fall back to letting the
> > + * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> > + * here.
> > + */
> > + if (sd && sd->shared)
> > + sds = sd->shared;
> > +
> > + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> > +
> > sd = highest_flag_domain(cpu, SD_SHARE_LLC);
> > if (sd) {
> > id = cpumask_first(sched_domain_span(sd));
> > size = cpumask_weight(sched_domain_span(sd));
> >
> > - /* If sd_llc exists, sd_llc_shared should exist too. */
> > - WARN_ON_ONCE(!sd->shared);
> > - sds = sd->shared;
> > + /*
> > + * If sd_asym_cpucapacity didn't claim the shared object,
> > + * sd_llc must have one linked.
> > + */
> > + if (!sds) {
> > + WARN_ON_ONCE(!sd->shared);
> > + sds = sd->shared;
> > + }
> > }
> >
> > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > per_cpu(sd_llc_size, cpu) = size;
> > per_cpu(sd_llc_id, cpu) = id;
> > +
> > + /* TODO: Rename sd_llc_shared to fit the new role. */
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> Would love for folks to chime in but IMO "sd_wakeup_shared" sounds
> pretty reasonable since it is mainly the wakeup path that depends on
> this except for one !ASYM load balancing trigger.
sd_wakeup_shared captures the bigger consumer (wakeup), but not the nohz
balancer kick logic.
Maybe "sd_balance_shared" (balance in a broad sense, wakeup is still affecting
balancing at the end) or "sd_effective_shared" (if we want to stress that
topology may move: LLC vs asym)?
>
> >
> > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > @@ -711,9 +731,6 @@ static void update_top_cache_domain(int cpu)
> >
> > sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> > rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> > -
> > - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> > - rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> > }
> >
> > /*
> > @@ -2650,6 +2667,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
> > }
> > }
> >
> > +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> > +{
> > + int sd_id = cpumask_first(sched_domain_span(sd));
> > +
> > + sd->shared = *per_cpu_ptr(d->sds, sd_id);
> > + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> > + atomic_inc(&sd->shared->ref);
> > +}
> > +
> > /*
> > * Build sched domains for a given set of CPUs and attach the sched domains
> > * to the individual CPUs
> > @@ -2708,20 +2734,53 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > }
> >
> > for_each_cpu(i, cpu_map) {
> > + struct sched_domain *sd_asym = NULL;
> > + bool asym_claimed = false;
> > +
> > sd = *per_cpu_ptr(d.sd, i);
> > if (!sd)
> > continue;
> >
> > + /*
> > + * In case of ASYM_CPUCAPACITY, attach sd->shared to
> > + * sd_asym_cpucapacity for wakeup stat tracking.
> > + *
> > + * Caveats:
> > + *
> > + * 1) has_asym is system-wide, but a given CPU may still
> > + * lack an SD_ASYM_CPUCAPACITY_FULL ancestor (e.g., an
> > + * exclusive cpuset carving out a symmetric capacity island).
> > + * Such CPUs must fall through to the LLC seeding path below.
> > + *
> > + * 2) Skip the asym attach if the asym ancestor is an
> > + * overlapping domain (SD_NUMA). On those topologies let the
> > + * LLC path own the shared object instead.
> > + *
> > + * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
> > + * always has more than one group else it is prone to
> > + * degeneration.
>
> I looked into this and we only set SD_ASYM_CPUCAPACITY if we find more
> than one capacity and SD_ASYM_CPUCAPACITY_FULL implies there are atleast
> two CPUs covering differnt capcities in the span.
>
> The very first SD_ASYM_CPUCAPACITY_FULL domain should be safe from
> degeneration when it is non-overlapping.
Makes sense, maybe we can replace the XXX part with note like this:
* Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct
* capacities exist in the domain span, so the asym domain we attach
* to cannot degenerate into a single-capacity group. The relevant
* edge cases are instead covered by the caveats above.
>
> > + */
> > + sd_asym = sd;
> > + while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> > + sd_asym = sd_asym->parent;
> > +
> > + if (sd_asym && !(sd_asym->flags & SD_NUMA)) {
> > + init_sched_domain_shared(&d, sd_asym);
> > + asym_claimed = true;
> > + }
>
> We should probably guard this behind a "has_asym" check. Maybe even
> extract into a sperate helper if the nesting gets too deep. Thoughts?
Ack, we can add an `if (has_asym)` as a quick skip logic and fold the walk +
NUMA check into a small helper.
>
> > +
> > /* First, find the topmost SD_SHARE_LLC domain */
> > + sd = *per_cpu_ptr(d.sd, i);
>
> nit.
>
> I think this reassignment is no longer required since you use a separate
> "sd_asym" variable now.
Ack.
>
> > while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> > sd = sd->parent;
> >
> > if (sd->flags & SD_SHARE_LLC) {
> > - int sd_id = cpumask_first(sched_domain_span(sd));
> > -
> > - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> > - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> > - atomic_inc(&sd->shared->ref);
> > + /*
> > + * Initialize the sd->shared for SD_SHARE_LLC unless
> > + * the asym path above already claimed it.
> > + */
> > + if (!asym_claimed)
> > + init_sched_domain_shared(&d, sd);
>
> Tbh, if "has_asym" is true, we probabaly don't even need this since the
> nr_busy_cpus accounting gets us nothing.
>
> Might save a little overhead and space on those systems but I would
> love to hear if there are any concerns if we just drop the
> sd_llc->shared when we detect asym capacities.
Hm... but "has_asym" is global, we may still need LLC-owned shared for symmetric
islands and NUMA-overlap cases, no?
Thanks,
-Andrea
next prev parent reply other threads:[~2026-04-24 8:46 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-04-24 5:14 ` K Prateek Nayak
2026-04-24 8:46 ` Andrea Righi [this message]
2026-04-24 11:18 ` K Prateek Nayak
2026-04-24 23:29 ` Andrea Righi
2026-04-23 7:36 ` [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-04-24 5:42 ` K Prateek Nayak
2026-04-23 7:36 ` [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-04-24 5:37 ` K Prateek Nayak
2026-04-24 9:21 ` Andrea Righi
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-04-24 5:55 ` K Prateek Nayak
2026-04-24 12:32 ` Vincent Guittot
2026-04-24 17:13 ` Andrea Righi
2026-04-27 5:13 ` K Prateek Nayak
2026-04-27 8:35 ` Vincent Guittot
2026-04-27 16:01 ` Andrea Righi
2026-04-27 17:26 ` Vincent Guittot
2026-04-23 7:36 ` [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting Andrea Righi
2026-04-24 4:29 ` K Prateek Nayak
2026-04-24 5:19 ` Andrea Righi
2026-04-24 12:34 ` Vincent Guittot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aest1kBSzJ4JUvim@gpd4 \
--to=arighi@nvidia.com \
--cc=balbirs@nvidia.com \
--cc=bsegall@google.com \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=fabecassis@nvidia.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kobak@nvidia.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sshegde@linux.ibm.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox