From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F10186341 for ; Tue, 28 Apr 2026 06:45:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777358749; cv=none; b=cRv/m1xG0ASNLTGgQzNw5SJebBy+sapc6N38TSBc7Eep/0xWpZMo5RGZ/dwX/eyoxgHvDU75riUTrz6MaK4n+BNfIJTmEwLtBjxSXVORtLUT2IKPC7Scp49ryTugvufW+ve1VIwk2fbQ78WrcwUTtleJ9KRXzth6e1ADQieGsA0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777358749; c=relaxed/simple; bh=OYaOJnliUwtc75UZxtgKASM/ZzjmvojA2hTlShYzjbM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=J74wie1g0gS+7LdawQQ1iEPR9Susfn2JDmbuZq+2W2k1ldbHhyKPi9gTlvFRjqMilmBnc/pBIORwgjk9fr2C0BAg2sz3nhO1aA4wE/VXMHClwD53Ebyl4p6hh0RC5qoeWMcOURTFcChiGMSOB26AsqDgPgaOyWBV/1K/wfadw30= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=BrN2DXYp; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="BrN2DXYp" Received: from pps.filterd (m0360083.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63RKxeJm1815471; Tue, 28 Apr 2026 06:45:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=2wPYL3 ZAimoLHVeuGW7H5Tm52FDriOsMARNu+bCdSfI=; b=BrN2DXYpPvUjjcbm3j8g12 WqpVxLhI1FkXhuIvoMJ3XvWd31z+qgB7mmIL1hZkaAoYCNlrbRmTsLamxD+KEX+h Yi7CvvpNjGBKcbuwjqQnP74layptW4CmR0m7bdCavBgeSje9SV7fvWvlFzah30MF /0Zgpq5sKy8P2S9G+W9L+P3IMfHvzdQC5ytubeJjKgOWnkhvjftk2K0ozmzp2E4W uUw4FkSVtsxjJaR3sEf0Otci/ZTgAs0xQOb9kU9+uLg4Wt5bdS6QAbrHL3rwlQoe 1vQB2iXNUz5Wn9xmDsqJfTOnuQQnQuUNoZP0Q5SuvtAyZeJtxsLkfYwzV9z4R2Zg == Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4drn44m62r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 28 Apr 2026 06:45:25 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.7/8.18.1.7) with ESMTP id 63S6cqpi030796; Tue, 28 Apr 2026 06:45:24 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4ds8xk0gev-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 28 Apr 2026 06:45:24 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (smtpav02.fra02v.mail.ibm.com [10.20.54.101]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 63S6jMa346596526 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 28 Apr 2026 06:45:22 GMT Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 25BFF20043; Tue, 28 Apr 2026 06:45:22 +0000 (GMT) Received: from smtpav02.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9C39F20040; Tue, 28 Apr 2026 06:45:16 +0000 (GMT) Received: from [9.124.221.229] (unknown [9.124.221.229]) by smtpav02.fra02v.mail.ibm.com (Postfix) with ESMTP; Tue, 28 Apr 2026 06:45:16 +0000 (GMT) Message-ID: Date: Tue, 28 Apr 2026 12:15:15 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/6] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity To: Andrea Righi , K Prateek Nayak Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Christian Loehle , Koba Ko , Felix Abecassis , Balbir Singh , Joel Fernandes , linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot References: <20260428051720.3180182-1-arighi@nvidia.com> <20260428051720.3180182-3-arighi@nvidia.com> From: Shrikanth Hegde Content-Language: en-US In-Reply-To: <20260428051720.3180182-3-arighi@nvidia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-Reinject: loops=2 maxloops=12 X-Proofpoint-ORIG-GUID: ZRPnel54ovwtHFmF1qJTtZpul8dBRmc6 X-Authority-Analysis: v=2.4 cv=Ft81OWrq c=1 sm=1 tr=0 ts=69f05785 cx=c_pps a=GFwsV6G8L6GxiO2Y/PsHdQ==:117 a=GFwsV6G8L6GxiO2Y/PsHdQ==:17 a=IkcTkHD0fZMA:10 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=RnoormkPH1_aCDwRdu11:22 a=iQ6ETzBq9ecOQQE5vZCe:22 a=zd2uoN0lAAAA:8 a=Ikd4Dj_1AAAA:8 a=5VVBd769IAjYIo0IXkwA:9 a=QEXdDO2ut3YA:10 X-Proofpoint-GUID: cesfDJAKh6SIqe4ILGqWqtGUyqDjdkqZ X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDI4MDA1NSBTYWx0ZWRfX5NMV6ng7Z/0r nR97RG6e/57KQB5b4W17NwxX2weBOg2tN6a0qjm56zgMuaKO3gWktKE2StmQO6T3sQy3E88XutW dIK6cdPVy2A+oW4b5cpeuWNEIhMKTfDpU7GYnyXCOD7iJhNz6adhVBZTe/zXQg96g8Q+Csq7Mog Nqv3ehIxbbyPleiI414Ox/ZHmaf46dDk/pdykpSwafFAMltU9wKk1Jgogj2FUq2NFDt9cnQvixn M7ZUcUW8mVELDXOD+mBN1BtHx/Z0BqOXt5hgLB0ovU8TPoRKu784uKWBOax+OILlzmb/n5lPz/T R+DKND94XrF1CVq6OhstNi5h6lpseiPApAyma5iK0FCK7XxHh3eIX3/3czUp7kt9mXF06S2eK0j j3hzWDnkmnsXYfZ0dK+fwgSYB4EzKgiUO+JgWmDPRQHcYY6tlgBIZ/Us4QQurCbcioTTsYuNkBk IU4N4K/6q49QC21dg3A== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-28_01,2026-04-21_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 lowpriorityscore=0 bulkscore=0 spamscore=0 impostorscore=0 clxscore=1015 malwarescore=0 phishscore=0 suspectscore=0 adultscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2604200000 definitions=main-2604280055 On 4/28/26 10:46 AM, Andrea Righi wrote: > From: K Prateek Nayak > > On asymmetric CPU capacity systems, the wakeup path uses > select_idle_capacity(), which scans the span of sd_asym_cpucapacity > rather than sd_llc. > > The has_idle_cores hint however lives on sd_llc->shared, so the > wakeup-time read of has_idle_cores operates on an LLC-scoped blob while > the actual scan/decision spans the asym domain; nr_busy_cpus also lives > in the same shared sched_domain data, but it's never used in the asym > CPU capacity scenario. > > Therefore, move the sched_domain_shared object to sd_asym_cpucapacity > whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that > ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case > the scope of has_idle_cores matches the scope of the wakeup scan. > > Fall back to attaching the shared object to sd_llc in three cases: > > 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere); > > 2) CPUs in an exclusive cpuset that carves out a symmetric capacity > island: has_asym is system-wide but those CPUs have no > SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow > the symmetric LLC path in select_idle_sibling(); > > 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an > SD_NUMA-built domain. init_sched_domain_shared() keys the shared > blob off cpumask_first(span), which on overlapping NUMA domains > would alias unrelated spans onto the same blob. Keep the shared > object on the LLC there; select_idle_capacity() gracefully skips > the has_idle_cores preference when sd->shared is NULL. > Can you share the example topology where this benefits? Is SD_ASYM_CPUCAPACITY_FULL one level above LLC but below NUMA? > While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared, > as it is no longer strictly tied to the LLC. > llc scans are at wakeup's. name sd_balance_shared indicates it is for load balance. > Co-developed-by: Andrea Righi > Signed-off-by: Andrea Righi > Signed-off-by: K Prateek Nayak > --- > kernel/sched/fair.c | 20 +++++---- > kernel/sched/sched.h | 2 +- > kernel/sched/topology.c | 91 +++++++++++++++++++++++++++++++++++------ > 3 files changed, 91 insertions(+), 22 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fc0828150c780..ece3a26f59c27 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7790,7 +7790,7 @@ static inline void set_idle_cores(int cpu, int val) > { > struct sched_domain_shared *sds; > > - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); > + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu)); > if (sds) > WRITE_ONCE(sds->has_idle_cores, val); > } > @@ -7799,7 +7799,7 @@ static inline bool test_idle_cores(int cpu) > { > struct sched_domain_shared *sds; > > - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); > + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu)); > if (sds) > return READ_ONCE(sds->has_idle_cores); > > @@ -7808,7 +7808,7 @@ static inline bool test_idle_cores(int cpu) > > /* > * Scans the local SMT mask to see if the entire core is idle, and records this > - * information in sd_llc_shared->has_idle_cores. > + * information in sd_balance_shared->has_idle_cores. > * > * Since SMT siblings share all cache levels, inspecting this limited remote > * state should be fairly cheap. > @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); > int i, cpu, idle_cpu = -1, nr = INT_MAX; > > - if (sched_feat(SIS_UTIL)) { > + if (sched_feat(SIS_UTIL) && sd->shared) { > /* > * Increment because !--nr is the condition to stop scan. > * > @@ -12759,7 +12759,7 @@ static bool nohz_balancer_needs_kick(struct rq *rq) > return false; > } > > - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu)); > + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu)); > if (sds) { > /* > * If there is an imbalance between LLC domains (IOW we could > @@ -12841,10 +12841,13 @@ static void set_cpu_sd_state_busy(int cpu) > guard(rcu)(); > sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); > > - if (!sd || !sd->nohz_idle) > + /* > + * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this LLC > + * domain has no shared object there is nothing to clear or account. > + */ > + if (!sd || !sd->shared || !sd->nohz_idle) > return; > sd->nohz_idle = 0; > - > atomic_inc(&sd->shared->nr_busy_cpus); > } > > @@ -12868,7 +12871,8 @@ static void set_cpu_sd_state_idle(int cpu) > guard(rcu)(); > sd = rcu_dereference_all(per_cpu(sd_llc, cpu)); > > - if (!sd || sd->nohz_idle) > + /* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */ > + if (!sd || !sd->shared || sd->nohz_idle) > return; > sd->nohz_idle = 1; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 9f63b15d309d1..330f5893c4561 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DECLARE_PER_CPU(int, sd_llc_size); > DECLARE_PER_CPU(int, sd_llc_id); > DECLARE_PER_CPU(int, sd_share_id); > -DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > +DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 5847b83d9d552..1e6ce369a4bbc 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); > DEFINE_PER_CPU(int, sd_llc_size); > DEFINE_PER_CPU(int, sd_llc_id); > DEFINE_PER_CPU(int, sd_share_id); > -DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); > +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared); > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); > @@ -680,20 +680,39 @@ static void update_top_cache_domain(int cpu) > int id = cpu; > int size = 1; > > + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL); > + /* > + * The shared object is attached to sd_asym_cpucapacity only when the > + * asym domain is non-overlapping (i.e., not built from SD_NUMA). > + * On overlapping (NUMA) asym domains we fall back to letting the > + * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL > + * here. > + */ > + if (sd && sd->shared) > + sds = sd->shared; > + > + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd); > + > sd = highest_flag_domain(cpu, SD_SHARE_LLC); > if (sd) { > id = cpumask_first(sched_domain_span(sd)); > size = cpumask_weight(sched_domain_span(sd)); > > - /* If sd_llc exists, sd_llc_shared should exist too. */ > - WARN_ON_ONCE(!sd->shared); > - sds = sd->shared; > + /* > + * If sd_asym_cpucapacity didn't claim the shared object, > + * sd_llc must have one linked. > + */ > + if (!sds) { > + WARN_ON_ONCE(!sd->shared); > + sds = sd->shared; > + } > } > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); > per_cpu(sd_llc_size, cpu) = size; > per_cpu(sd_llc_id, cpu) = id; > - rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); > + > + rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds); > > sd = lowest_flag_domain(cpu, SD_CLUSTER); > if (sd) > @@ -711,9 +730,6 @@ static void update_top_cache_domain(int cpu) > > sd = highest_flag_domain(cpu, SD_ASYM_PACKING); > rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd); > - > - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL); > - rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd); > } > > /* > @@ -2650,6 +2666,49 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc) > } > } > > +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd) > +{ > + int sd_id = cpumask_first(sched_domain_span(sd)); > + > + sd->shared = *per_cpu_ptr(d->sds, sd_id); > + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight); > + atomic_inc(&sd->shared->ref); > +} > + > +/* > + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost > + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is > + * not an overlapping NUMA-built domain (then LLC should claim shared). > + * > + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island), > + * then LLC must claim shared instead. > + * > + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct capacities > + * exist in the domain span, so the asym domain we attach to cannot degenerate > + * into a single-capacity group. The relevant edge cases are instead covered by > + * the caveats above. > + * > + * Return true if this CPU's asym path claimed sd->shared, false otherwise. > + */ > +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu) > +{ > + struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu); > + struct sched_domain *sd_asym; > + > + if (!sd) > + return false; > + > + sd_asym = sd; > + while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL)) > + sd_asym = sd_asym->parent; > + > + if (!sd_asym || (sd_asym->flags & SD_NUMA)) > + return false; > + > + init_sched_domain_shared(d, sd_asym); > + return true; > +} > + > /* > * Build sched domains for a given set of CPUs and attach the sched domains > * to the individual CPUs > @@ -2708,20 +2767,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > > for_each_cpu(i, cpu_map) { > + bool asym_claimed = false; > + > sd = *per_cpu_ptr(d.sd, i); > if (!sd) > continue; > > + if (has_asym) > + asym_claimed = claim_asym_sched_domain_shared(&d, i); > + > /* First, find the topmost SD_SHARE_LLC domain */ > while (sd->parent && (sd->parent->flags & SD_SHARE_LLC)) > sd = sd->parent; > > if (sd->flags & SD_SHARE_LLC) { > - int sd_id = cpumask_first(sched_domain_span(sd)); > - > - sd->shared = *per_cpu_ptr(d.sds, sd_id); > - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight); > - atomic_inc(&sd->shared->ref); > + /* > + * Initialize the sd->shared for SD_SHARE_LLC unless > + * the asym path above already claimed it. > + */ > + if (!asym_claimed) > + init_sched_domain_shared(&d, sd); > > /* > * In presence of higher domains, adjust the