[PATCH v4 0/9] sched/topology: Optimize sd->shared allocation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation
@ 2026-03-12  4:44 K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
                   ` (9 more replies)
  0 siblings, 10 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Hello folks,

I got distracted for a bit but here is v4 of the series with most of the
feedback on v3 incorporated. Nothing much has changed but if you see
anything you don't like, please let me know and we can discuss how to
best address it.

Background
==========

Discussed at LPC'25, the allocation of per-CPU "sched_domain_shared"
objects for each topology level was found to be unnecessary since only
"sd_llc_shared" is ever used by the scheduler and rest is either
reclaimed during __sdt_free() or remain allocated without any purpose.

Folks are already optimizing for unnecessary sched domain allocations
with commit f79c9aa446d6 ("x86/smpboot: avoid SMT domain attach/destroy
if SMT is not enabled") removing the SMT level entirely on the x86 side
when it is know that the domain will be degenerated anyways by the
scheduler.

New approach to sd->shared allocation
=====================================

This goes one step ahead with the "sched_domain_shared" allocations by
moving it out of "sd_data" which is allocated for every topology level
and into "s_data" instead which is allocated once per partition.

"sd->shared" is only allocated for the topmost SD_SHARE_LLC domain and
the topology layer uses the sched domain degeneration path to pass the
reference to the final "sd_llc" domain. Since degeneration of parent
ensures 1:1 mapping between the span with the child, and the fact that
SD_SHARE_LLC domains never overlap, degeneration of an SD_SAHRE_LLC
domain either means its span is same as that of its child or that it
only contains a single CPU making it redundant.

Future work
===========

This is an initial optimization for larger idea to break the global
"nohz.idle_cpus_mask" into per-LLC chunks which would embed itself in
"sd_llc_shared" and bloat the struct. Reducing the overhead of
allocating "sched_domain_shared" now would benefit later by reducing the
temporary memory pressure experience during sched domain rebuild.

The suggestion to entirely remove per-CPU "sd_llc_shared" from Shrikanth
has been deferred to this future work once few more users of
"sd_llc_shared" in CONFIG_NO_HZ_COMON are converted over to use per-CPU
"sd_nohz->shared" reference leaving only the {test,set}_idle_cores()
using the per-CPU "sd_llc_shared" references.

Misc cleanups
=============

Since the topology layer also checks for the existence of a valid
"sd->shared" when "sd_llc" is present, the handling of "sd_llc_shared"
can also be simplified when a reference to "sd_llc" is already present
in the scope (Patch 8 and Patch 9).

Patches are based on top of:

  git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit 54a66e431eea ("sched/headers: Inline raw_spin_rq_unlock()").
---
Changelog v3..v4:

o Collected tags from Chenyu, Shrikanth, and Valentin (Thanks a ton for
  reviewing and testing v3)

o Broke off the "imb_numa_nr" calculation into a separate helper to
  avoid having two "if" conditions - one searcing for sd_llc and other
  for sd_llc's parent. (Valentin)

o Moved the claiming of "d.sds" objects into claim_allocations() to keep
  all those bits in one place. (Shrikanth)

o Elaborated with a comment in Patch 9 on why dereferencing "sd->shared"
  in select_idle_cpu() since it's lifetime is tied to "sd_llc"
  dereferenced in the caller and as long as there exists a RCU protected
  reference to "sd", "sd->shared" is also valid. (Shrikanth)

o Illustrated the problematic case highlighted in Patch 1 with an
  example. (Shrikanth)

o Made a note of the larger optimization to "nohz.idle_cpus_mask" coming
  up in the cover latter that is helped along by this optimization.
  (Peter)

o Rebase and more testing with hotplug and cpuset.

v3: https://lore.kernel.org/lkml/20260120113246.27987-1-kprateek.nayak@amd.com/

Changelog rfc v2..v3:

o Broke off the "sd->shared" assignment optimization into a separate
  series for easier review.

o Spotted a case of incorrect calculation of load balancing periods
  in presence of cpuset partitions (Patch 1).

o Broke off the single "sd->shared" assignment optimization patch into
  3 parts for easier review (Patch 2 - Patch 4). The "Reviewed-by:" tag
  from Gautham was dropped as a result.

o Building on recent effort from Peter to remove the superfluous usage
  of rcu_read_lock() in !preemptible() regions, Patch5 and Patch 6
  cleans up the fair task's wakeup path before adding more cleanups in
  Patch 7 and Patch 8.

o Dropped the RFC tag.

v2: https://lore.kernel.org/lkml/20251208083602.31898-1-kprateek.nayak@amd.com/
---
K Prateek Nayak (9):
  sched/topology: Compute sd_weight considering cpuset partitions
  sched/topology: Extract "imb_numa_nr" calculation into a separate
    helper
  sched/topology: Allocate per-CPU sched_domain_shared in s_data
  sched/topology: Switch to assigning "sd->shared" from s_data
  sched/topology: Remove sched_domain_shared allocation with sd_data
  sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
  sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
  sched/fair: Simplify the entry condition for update_idle_cpu_scan()
  sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()

 include/linux/sched/topology.h |   1 -
 kernel/sched/fair.c            |  70 ++++-----
 kernel/sched/sched.h           |   2 +-
 kernel/sched/topology.c        | 263 +++++++++++++++++++++------------
 4 files changed, 199 insertions(+), 137 deletions(-)

base-commit: 54a66e431eeacf23e1dc47cb3507f2d0c068aaf0
-- 
2.34.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-12  9:34   ` Peter Zijlstra
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.

For example, consider a large system of 128CPUs divided into 8 * 16CPUs
partition which is typical when deploying virtual machines:

  [                      PKG Domain: 128CPUs                      ]

  [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]

Although each partition only contains 16CPUs, the load balancing
interval is set to a minimum of 128 jiffies considering the span of the
entire domain with 128CPUs which can lead to longer imbalances within
the partition although balancing within is cheaper with 16CPUs.

Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.

For the above example, the balancing intervals for the partitions PKG
domain changes as follows:

                  before   after
balance_interval   128      16
min_interval       128      16
max_interval       256      32

Intervals are now proportional to the CPUs in the partitioned domain as
was intended by the original formula.

Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Illustrated the changes in the load balancing intervals with an
  example. (Shrikanth)

o Collected the tags from Chenyu, Shrikanth, and Valentin. (Thanks a
  ton!)
---
 kernel/sched/topology.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f555..34b20b0e1867 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1645,8 +1645,6 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_weight = cpumask_weight(tl->mask(tl, cpu));
-
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
@@ -1654,8 +1652,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd_flags &= TOPOLOGY_SD_FLAGS;
 
 	*sd = (struct sched_domain){
-		.min_interval		= sd_weight,
-		.max_interval		= 2*sd_weight,
 		.busy_factor		= 16,
 		.imbalance_pct		= 117,
 
@@ -1675,7 +1671,6 @@ sd_init(struct sched_domain_topology_level *tl,
 					,
 
 		.last_balance		= jiffies,
-		.balance_interval	= sd_weight,
 
 		/* 50% success rate */
 		.newidle_call		= 512,
@@ -1693,6 +1688,11 @@ sd_init(struct sched_domain_topology_level *tl,
 	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
 	sd_id = cpumask_first(sd_span);
 
+	sd_weight = cpumask_weight(sd_span);
+	sd->min_interval = sd_weight;
+	sd->max_interval = 2 * sd_weight;
+	sd->balance_interval = sd_weight;
+
 	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-12 13:37   ` kernel test robot
                     ` (2 more replies)
  2026-03-12  4:44 ` [PATCH v4 3/9] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Subsequent changes to assign "sd->shared" from "s_data" would
necessitate finding the topmost SD_SHARE_LLC to assign shared object to.

This is very similar to the "imb_numa_nr" computation loop except that
"imb_numa_nr" cares about the first domain without the SD_SHARE_LLC flag
(immediate parent of sd_llc) whereas the "sd->shared" assignment would
require sd_llc itself.

Extract the "imb_numa_nr" calculation into a helper
adjust_numa_imbalance() and use the current loop in the
build_sched_domains() to find the sd_llc.

While at it, guard the call behind CONFIG_NUMA's status since
"imb_numa_nr" only makes sense on NUMA enabled configs with SD_NUMA
domains.

No functional changes intended.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o New patch based on the suggestion from Valentin and Chenyu in
  https://lore.kernel.org/lkml/xhsmh343e43fd.mognet@vschneid-thinkpadt14sgen2i.remote.csb/

  Notable deviation is moving the entire "imb_numa_nr" loop into the
  adjust_numa_imbalance() helper to keep all the bits in one place
  instead of passing "imb" and "imb_span" as references to the helper.

o Guarded the call behind CONFIG_NUMA's status to save overhead when
  NUMA domains don't exist.
---
 kernel/sched/topology.c | 133 ++++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 53 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 34b20b0e1867..7f25c784c038 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2551,6 +2551,74 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 	return true;
 }
 
+/*
+ * Calculate an allowed NUMA imbalance such that LLCs do not get
+ * imbalanced.
+ */
+static void adjust_numa_imbalance(struct sched_domain *sd_llc)
+{
+	struct sched_domain *parent;
+	unsigned int imb_span = 1;
+	unsigned int imb = 0;
+	unsigned int nr_llcs;
+
+	WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
+	WARN_ON(!sd_llc->parent);
+
+	/*
+	 * For a single LLC per node, allow an
+	 * imbalance up to 12.5% of the node. This is
+	 * arbitrary cutoff based two factors -- SMT and
+	 * memory channels. For SMT-2, the intent is to
+	 * avoid premature sharing of HT resources but
+	 * SMT-4 or SMT-8 *may* benefit from a different
+	 * cutoff. For memory channels, this is a very
+	 * rough estimate of how many channels may be
+	 * active and is based on recent CPUs with
+	 * many cores.
+	 *
+	 * For multiple LLCs, allow an imbalance
+	 * until multiple tasks would share an LLC
+	 * on one node while LLCs on another node
+	 * remain idle. This assumes that there are
+	 * enough logical CPUs per LLC to avoid SMT
+	 * factors and that there is a correlation
+	 * between LLCs and memory channels.
+	 */
+	nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
+	if (nr_llcs == 1)
+		imb = sd_llc->parent->span_weight >> 3;
+	else
+		imb = nr_llcs;
+
+	imb = max(1U, imb);
+	sd_llc->parent->imb_numa_nr = imb;
+
+	/*
+	 * Set span based on the first NUMA domain.
+	 *
+	 * NUMA systems always add a NODE domain before
+	 * iterating the NUMA domains. Since this is before
+	 * degeneration, start from sd_llc's parent's
+	 * parent which is the lowest an SD_NUMA domain can
+	 * be relative to sd_llc.
+	 */
+	parent = sd_llc->parent->parent;
+	while (parent && !(parent->flags & SD_NUMA))
+		parent = parent->parent;
+
+	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
+
+	/* Update the upper remainder of the topology */
+	parent = sd_llc->parent;
+	while (parent) {
+		int factor = max(1U, (parent->span_weight / imb_span));
+
+		parent->imb_numa_nr = imb * factor;
+		parent = parent->parent;
+	}
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2608,62 +2676,21 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
-	/*
-	 * Calculate an allowed NUMA imbalance such that LLCs do not get
-	 * imbalanced.
-	 */
 	for_each_cpu(i, cpu_map) {
-		unsigned int imb = 0;
-		unsigned int imb_span = 1;
+		sd = *per_cpu_ptr(d.sd, i);
+		if (!sd)
+			continue;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			struct sched_domain *child = sd->child;
-
-			if (!(sd->flags & SD_SHARE_LLC) && child &&
-			    (child->flags & SD_SHARE_LLC)) {
-				struct sched_domain __rcu *top_p;
-				unsigned int nr_llcs;
-
-				/*
-				 * For a single LLC per node, allow an
-				 * imbalance up to 12.5% of the node. This is
-				 * arbitrary cutoff based two factors -- SMT and
-				 * memory channels. For SMT-2, the intent is to
-				 * avoid premature sharing of HT resources but
-				 * SMT-4 or SMT-8 *may* benefit from a different
-				 * cutoff. For memory channels, this is a very
-				 * rough estimate of how many channels may be
-				 * active and is based on recent CPUs with
-				 * many cores.
-				 *
-				 * For multiple LLCs, allow an imbalance
-				 * until multiple tasks would share an LLC
-				 * on one node while LLCs on another node
-				 * remain idle. This assumes that there are
-				 * enough logical CPUs per LLC to avoid SMT
-				 * factors and that there is a correlation
-				 * between LLCs and memory channels.
-				 */
-				nr_llcs = sd->span_weight / child->span_weight;
-				if (nr_llcs == 1)
-					imb = sd->span_weight >> 3;
-				else
-					imb = nr_llcs;
-				imb = max(1U, imb);
-				sd->imb_numa_nr = imb;
-
-				/* Set span based on the first NUMA domain. */
-				top_p = sd->parent;
-				while (top_p && !(top_p->flags & SD_NUMA)) {
-					top_p = top_p->parent;
-				}
-				imb_span = top_p ? top_p->span_weight : sd->span_weight;
-			} else {
-				int factor = max(1U, (sd->span_weight / imb_span));
+		/* First, find the topmost SD_SHARE_LLC domain */
+		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+			sd = sd->parent;
 
-				sd->imb_numa_nr = imb * factor;
-			}
-		}
+		/*
+		 * In presence of higher domains, adjust the
+		 * NUMA imbalance stats for the hierarchy.
+		 */
+		if (IS_ENABLED(CONFIG_NUMA) && (sd->flags & SD_SHARE_LLC) && sd->parent)
+			adjust_numa_imbalance(sd);
 	}
 
 	/* Calculate CPU capacity for physical packages and nodes */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 3/9] sched/topology: Allocate per-CPU sched_domain_shared in s_data
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 4/9] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

The "sched_domain_shared" object is allocated for every topology level
in __sdt_alloc() and is freed post sched domain rebuild if they aren't
assigned during sd_init().

"sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
the assigned objects, only "sd_llc_shared" is ever used by the
scheduler.

Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
never overlap, allocate only a single range of per-CPU
"sched_domain_shared" object with s_data instead of doing it per
topology level.

The subsequent commit uses the degeneration path to correctly assign the
"sd->shared" to the topmost SD_SHARE_LLC domain.

No functional changes are expected at this point.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Collected tags from Valentin, and Chenyu. (Thanks a ton!)
---
 kernel/sched/topology.c | 48 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 7f25c784c038..f0541c6511fa 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -782,6 +782,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 }
 
 struct s_data {
+	struct sched_domain_shared * __percpu *sds;
 	struct sched_domain * __percpu *sd;
 	struct root_domain	*rd;
 };
@@ -789,6 +790,7 @@ struct s_data {
 enum s_alloc {
 	sa_rootdomain,
 	sa_sd,
+	sa_sd_shared,
 	sa_sd_storage,
 	sa_none,
 };
@@ -1535,6 +1537,9 @@ static void set_domain_attribute(struct sched_domain *sd,
 static void __sdt_free(const struct cpumask *cpu_map);
 static int __sdt_alloc(const struct cpumask *cpu_map);
 
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map);
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map);
+
 static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 				 const struct cpumask *cpu_map)
 {
@@ -1546,6 +1551,9 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 	case sa_sd:
 		free_percpu(d->sd);
 		fallthrough;
+	case sa_sd_shared:
+		__sds_free(d, cpu_map);
+		fallthrough;
 	case sa_sd_storage:
 		__sdt_free(cpu_map);
 		fallthrough;
@@ -1561,9 +1569,11 @@ __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
 
 	if (__sdt_alloc(cpu_map))
 		return sa_sd_storage;
+	if (__sds_alloc(d, cpu_map))
+		return sa_sd_shared;
 	d->sd = alloc_percpu(struct sched_domain *);
 	if (!d->sd)
-		return sa_sd_storage;
+		return sa_sd_shared;
 	d->rd = alloc_rootdomain();
 	if (!d->rd)
 		return sa_sd;
@@ -2466,6 +2476,42 @@ static void __sdt_free(const struct cpumask *cpu_map)
 	}
 }
 
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map)
+{
+	int j;
+
+	d->sds = alloc_percpu(struct sched_domain_shared *);
+	if (!d->sds)
+		return -ENOMEM;
+
+	for_each_cpu(j, cpu_map) {
+		struct sched_domain_shared *sds;
+
+		sds = kzalloc_node(sizeof(struct sched_domain_shared),
+				GFP_KERNEL, cpu_to_node(j));
+		if (!sds)
+			return -ENOMEM;
+
+		*per_cpu_ptr(d->sds, j) = sds;
+	}
+
+	return 0;
+}
+
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map)
+{
+	int j;
+
+	if (!d->sds)
+		return;
+
+	for_each_cpu(j, cpu_map)
+		kfree(*per_cpu_ptr(d->sds, j));
+
+	free_percpu(d->sds);
+	d->sds = NULL;
+}
+
 static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 4/9] sched/topology: Switch to assigning "sd->shared" from s_data
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (2 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 3/9] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 5/9] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".

sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.

claim_allocations() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.

To keep all the claim_allocations() bits in one place,
claim_allocations() has been extended to accept "s_data" and iterate the
domains internally to free both "sched_domain_shared" and the
per-topology-level data for the particular CPU in one place.

Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Moved claiming the per-CPU "d.sds" reference into claim_allocations()
  to keep everything in one place. (Shrikanth)

o Slightly different diff as a result of moving the "imb_numa_nr"
  calculation into a separate helper in Patch 2.
---
 kernel/sched/topology.c | 73 +++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f0541c6511fa..ebd955faab40 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -685,6 +685,9 @@ static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
+
+		/* If sd_llc exists, sd_llc_shared should exist too. */
+		WARN_ON_ONCE(!sd->shared);
 		sds = sd->shared;
 	}
 
@@ -733,6 +736,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 		if (sd_parent_degenerate(tmp, parent)) {
 			tmp->parent = parent->parent;
 
+			/* Pick reference to parent->shared. */
+			if (parent->shared) {
+				WARN_ON_ONCE(tmp->shared);
+				tmp->shared = parent->shared;
+				parent->shared = NULL;
+			}
+
 			if (parent->parent) {
 				parent->parent->child = tmp;
 				parent->parent->groups->flags = tmp->flags;
@@ -1586,21 +1596,28 @@ __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
  * sched_group structure so that the subsequent __free_domain_allocs()
  * will not free the data we're using.
  */
-static void claim_allocations(int cpu, struct sched_domain *sd)
+static void claim_allocations(int cpu, struct s_data *d)
 {
-	struct sd_data *sdd = sd->private;
+	struct sched_domain *sd;
+
+	if (atomic_read(&(*per_cpu_ptr(d->sds, cpu))->ref))
+		*per_cpu_ptr(d->sds, cpu) = NULL;
 
-	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
-	*per_cpu_ptr(sdd->sd, cpu) = NULL;
+	for (sd = *per_cpu_ptr(d->sd, cpu); sd; sd = sd->parent) {
+		struct sd_data *sdd = sd->private;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
-		*per_cpu_ptr(sdd->sds, cpu) = NULL;
+		WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
+		*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
-		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+		if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
+			*per_cpu_ptr(sdd->sds, cpu) = NULL;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
-		*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+		if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
+			*per_cpu_ptr(sdd->sg, cpu) = NULL;
+
+		if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
+			*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+	}
 }
 
 #ifdef CONFIG_NUMA
@@ -1740,16 +1757,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->cache_nice_tries = 1;
 	}
 
-	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
-	 * instance.
-	 */
-	if (sd->flags & SD_SHARE_LLC) {
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
-		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-	}
-
 	sd->private = sdd;
 
 	return sd;
@@ -2731,12 +2738,20 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
-		/*
-		 * In presence of higher domains, adjust the
-		 * NUMA imbalance stats for the hierarchy.
-		 */
-		if (IS_ENABLED(CONFIG_NUMA) && (sd->flags & SD_SHARE_LLC) && sd->parent)
-			adjust_numa_imbalance(sd);
+		if (sd->flags & SD_SHARE_LLC) {
+			int sd_id = cpumask_first(sched_domain_span(sd));
+
+			sd->shared = *per_cpu_ptr(d.sds, sd_id);
+			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+			atomic_inc(&sd->shared->ref);
+
+			/*
+			 * In presence of higher domains, adjust the
+			 * NUMA imbalance stats for the hierarchy.
+			 */
+			if (IS_ENABLED(CONFIG_NUMA) && sd->parent)
+				adjust_numa_imbalance(sd);
+		}
 	}
 
 	/* Calculate CPU capacity for physical packages and nodes */
@@ -2744,10 +2759,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		if (!cpumask_test_cpu(i, cpu_map))
 			continue;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			claim_allocations(i, sd);
+		claim_allocations(i, &d);
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent)
 			init_sched_groups_capacity(i, sd);
-		}
 	}
 
 	/* Attach the domains */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 5/9] sched/topology: Remove sched_domain_shared allocation with sd_data
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (3 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 4/9] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Now that "sd->shared" assignments are using the sched_domain_shared
objects allocated with s_data, remove the sd_data based allocations.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Collected tag from Valentin. (Thanks a ton!)
---
 include/linux/sched/topology.h |  1 -
 kernel/sched/topology.c        | 19 -------------------
 2 files changed, 20 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032426dc..51c29581f15e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -172,7 +172,6 @@ typedef int (*sched_domain_flags_f)(void);
 
 struct sd_data {
 	struct sched_domain *__percpu *sd;
-	struct sched_domain_shared *__percpu *sds;
 	struct sched_group *__percpu *sg;
 	struct sched_group_capacity *__percpu *sgc;
 };
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ebd955faab40..963007d83216 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1609,9 +1609,6 @@ static void claim_allocations(int cpu, struct s_data *d)
 		WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 		*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-		if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
-			*per_cpu_ptr(sdd->sds, cpu) = NULL;
-
 		if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 			*per_cpu_ptr(sdd->sg, cpu) = NULL;
 
@@ -2392,10 +2389,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 		if (!sdd->sd)
 			return -ENOMEM;
 
-		sdd->sds = alloc_percpu(struct sched_domain_shared *);
-		if (!sdd->sds)
-			return -ENOMEM;
-
 		sdd->sg = alloc_percpu(struct sched_group *);
 		if (!sdd->sg)
 			return -ENOMEM;
@@ -2406,7 +2399,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
-			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
 
@@ -2417,13 +2409,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
-					GFP_KERNEL, cpu_to_node(j));
-			if (!sds)
-				return -ENOMEM;
-
-			*per_cpu_ptr(sdd->sds, j) = sds;
-
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sg)
@@ -2465,8 +2450,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sd, j));
 			}
 
-			if (sdd->sds)
-				kfree(*per_cpu_ptr(sdd->sds, j));
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
@@ -2474,8 +2457,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
-		free_percpu(sdd->sds);
-		sdd->sds = NULL;
 		free_percpu(sdd->sg);
 		sdd->sg = NULL;
 		free_percpu(sdd->sgc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (4 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 5/9] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-12  9:46   ` Peter Zijlstra
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Similar to commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held()
in idle_get_state() to allow removing superfluous rcu_read_lock()
regions in the fair task's wakeup path where the pi_lock is held and
IRQs are disabled.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o No changes.
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 953d89d71804..c4fc7726f82a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2853,7 +2853,7 @@ static inline void idle_set_state(struct rq *rq,
 
 static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	WARN_ON_ONCE(!rcu_read_lock_any_held());
 
 	return rq->idle_state;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (5 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-15 23:36   ` Dietmar Eggemann
  2026-03-18  8:08   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() in the " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 8/9] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

select_task_rq_fair() is always called with p->pi_lock held and IRQs
disabled which makes it equivalent of an RCU read-side.

Since commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()") switched to using rcu_dereference_all() in the
wakeup path, drop the explicit rcu_read_{lock,unlock}() in the fair
task's wakeup path.

Future plans to reuse select_task_rq_fair() /
find_energy_efficient_cpu() in the fair class' balance callback will do
so with IRQs disabled and will comply with the requirements of
rcu_dereference_all() which makes this safe keeping in mind future
development plans too.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o No changes.
---
 kernel/sched/fair.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d57c02e82f3a..28853c0abb83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8570,10 +8570,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	struct perf_domain *pd;
 	struct energy_env eenv;
 
-	rcu_read_lock();
 	pd = rcu_dereference_all(rd->pd);
 	if (!pd)
-		goto unlock;
+		return target;
 
 	/*
 	 * Energy-aware wake-up happens on the lowest sched_domain starting
@@ -8583,13 +8582,13 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
 		sd = sd->parent;
 	if (!sd)
-		goto unlock;
+		return target;
 
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
 	if (!task_util_est(p) && p_util_min == 0)
-		goto unlock;
+		return target;
 
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
@@ -8684,7 +8683,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						    prev_cpu);
 			/* CPU utilization has changed */
 			if (prev_delta < base_energy)
-				goto unlock;
+				return target;
 			prev_delta -= base_energy;
 			prev_actual_cap = cpu_actual_cap;
 			best_delta = min(best_delta, prev_delta);
@@ -8708,7 +8707,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						   max_spare_cap_cpu);
 			/* CPU utilization has changed */
 			if (cur_delta < base_energy)
-				goto unlock;
+				return target;
 			cur_delta -= base_energy;
 
 			/*
@@ -8725,7 +8724,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			best_actual_cap = cpu_actual_cap;
 		}
 	}
-	rcu_read_unlock();
 
 	if ((best_fits > prev_fits) ||
 	    ((best_fits > 0) && (best_delta < prev_delta)) ||
@@ -8733,11 +8731,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		target = best_energy_cpu;
 
 	return target;
-
-unlock:
-	rcu_read_unlock();
-
-	return target;
 }
 
 /*
@@ -8782,7 +8775,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8808,14 +8800,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 			break;
 	}
 
-	if (unlikely(sd)) {
-		/* Slow path */
-		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
-	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
-		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-	}
-	rcu_read_unlock();
+	/* Slow path */
+	if (unlikely(sd))
+		return sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
+
+	/* Fast path */
+	if (wake_flags & WF_TTWU)
+		return select_idle_sibling(p, prev_cpu, new_cpu);
 
 	return new_cpu;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 8/9] sched/fair: Simplify the entry condition for update_idle_cpu_scan()
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (6 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-12  4:44 ` [PATCH v4 9/9] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
  2026-03-16  0:22 ` [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation Dietmar Eggemann
  9 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Only the topmost SD_SHARE_LLC domain has the "sd->shared" assigned.
Simply use "sd->shared" as an indicator for load balancing at the highest
SD_SHARE_LLC domain in update_idle_cpu_scan() instead of relying on
llc_size.

Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Collected tags from Chenyu. (Thanks a ton!)
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28853c0abb83..d7e4de909a63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11234,6 +11234,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 				 unsigned long sum_util)
 {
 	struct sched_domain_shared *sd_share;
+	struct sched_domain *sd = env->sd;
 	int llc_weight, pct;
 	u64 x, y, tmp;
 	/*
@@ -11247,11 +11248,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
 		return;
 
-	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
-	if (env->sd->span_weight != llc_weight)
-		return;
-
-	sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu));
+	sd_share = sd->shared;
 	if (!sd_share)
 		return;
 
@@ -11285,10 +11282,11 @@ static void update_idle_cpu_scan(struct lb_env *env,
 	 */
 	/* equation [3] */
 	x = sum_util;
+	llc_weight = sd->span_weight;
 	do_div(x, llc_weight);
 
 	/* equation [4] */
-	pct = env->sd->imbalance_pct;
+	pct = sd->imbalance_pct;
 	tmp = x * x * pct * pct;
 	do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
 	tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v4 9/9] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (7 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 8/9] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
@ 2026-03-12  4:44 ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2026-03-16  0:22 ` [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation Dietmar Eggemann
  9 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  4:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
	Shrikanth Hegde, Li Chen, Gautham R. Shenoy, K Prateek Nayak

Use the "sd_llc" passed to select_idle_cpu() to obtain the
"sd_llc_shared" instead of dereferencing the per-CPU variable.

Since "sd->shared" is always reclaimed at the same time as "sd" via
call_rcu() and update_top_cache_domain() always ensures a valid
"sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
always be dereferenced without needing an additional check.

While at it move the cpumask_and() operation after the SIS_UTIL bailout
check to avoid unnecessarily computing the cpumask.

Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog v3..v4:

o Collected tags from Chenyu, and Shrikanth. (Thanks a ton!)

o Added a brief comment in select_idle_cpu() on why directly
  dereferencing "sd->shared" is safe as long as an RCU-protected
  reference to "sd" exists. (Shrikanth).
---
 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7e4de909a63..8dbf63d460b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7876,21 +7876,26 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	struct sched_domain_shared *sd_share;
-
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	if (sched_feat(SIS_UTIL)) {
-		sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target));
-		if (sd_share) {
-			/* because !--nr is the condition to stop scan */
-			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
-			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
-				return -1;
-		}
+		/*
+		 * Increment because !--nr is the condition to stop scan.
+		 *
+		 * Since "sd" is "sd_llc" for target CPU dereferenced in the
+		 * caller, it is safe to directly dereference "sd->shared".
+		 * Topology bits always ensure it assigned for "sd_llc" abd it
+		 * cannot disappear as long as we have a RCU protected
+		 * reference to one the associated "sd" here.
+		 */
+		nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+		/* overloaded LLC is unlikely to have idle cpu/core */
+		if (nr == 1)
+			return -1;
 	}
 
+	if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
+		return -1;
+
 	if (static_branch_unlikely(&sched_cluster_active)) {
 		struct sched_group *sg = sd->groups;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
@ 2026-03-12  9:34   ` Peter Zijlstra
  2026-03-12  9:59     ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-12  9:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

On Thu, Mar 12, 2026 at 04:44:26AM +0000, K Prateek Nayak wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c85f555..34b20b0e1867 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1645,8 +1645,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  	struct cpumask *sd_span;
>  	u64 now = sched_clock();
>  
> -	sd_weight = cpumask_weight(tl->mask(tl, cpu));
> -
>  	if (tl->sd_flags)
>  		sd_flags = (*tl->sd_flags)();
>  	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
> @@ -1654,8 +1652,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd_flags &= TOPOLOGY_SD_FLAGS;
>  
>  	*sd = (struct sched_domain){
> -		.min_interval		= sd_weight,
> -		.max_interval		= 2*sd_weight,
>  		.busy_factor		= 16,
>  		.imbalance_pct		= 117,
>  
> @@ -1675,7 +1671,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  					,
>  
>  		.last_balance		= jiffies,
> -		.balance_interval	= sd_weight,
>  
>  		/* 50% success rate */
>  		.newidle_call		= 512,
> @@ -1693,6 +1688,11 @@ sd_init(struct sched_domain_topology_level *tl,
>  	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
>  	sd_id = cpumask_first(sd_span);
>  
> +	sd_weight = cpumask_weight(sd_span);
> +	sd->min_interval = sd_weight;
> +	sd->max_interval = 2 * sd_weight;
> +	sd->balance_interval = sd_weight;
> +
>  	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
>  
>  	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==


Why not like so?

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f555..79bab80af8f2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_weight = cpumask_weight(tl->mask(tl, cpu));
+	sd_span = sched_domain_span(sd);
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+	sd_weight = cpumask_weight(sd_span);
+	sd_id = cpumask_first(sd_span);
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
-			"wrong sd_flags in topology description\n"))
+		      "wrong sd_flags in topology description\n"))
 		sd_flags &= TOPOLOGY_SD_FLAGS;
+	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
@@ -1689,12 +1693,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		.name			= tl->name,
 	};
 
-	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
-	sd_id = cpumask_first(sd_span);
-
-	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
-
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
  2026-03-12  4:44 ` [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
@ 2026-03-12  9:46   ` Peter Zijlstra
  2026-03-12 10:06     ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-12  9:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

On Thu, Mar 12, 2026 at 04:44:31AM +0000, K Prateek Nayak wrote:
> Similar to commit 71fedc41c23b ("sched/fair: Switch to
> rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held()
> in idle_get_state() to allow removing superfluous rcu_read_lock()
> regions in the fair task's wakeup path where the pi_lock is held and
> IRQs are disabled.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog v3..v4:
> 
> o No changes.
> ---
>  kernel/sched/sched.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 953d89d71804..c4fc7726f82a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2853,7 +2853,7 @@ static inline void idle_set_state(struct rq *rq,
>  
>  static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>  {
> -	WARN_ON_ONCE(!rcu_read_lock_held());
> +	WARN_ON_ONCE(!rcu_read_lock_any_held());

Should we perhaps make that:

	lockdep_assert(rcu_read_lock_any_held());

?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12  9:34   ` Peter Zijlstra
@ 2026-03-12  9:59     ` K Prateek Nayak
  2026-03-12 10:01       ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12  9:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

Hello Peter,

On 3/12/2026 3:04 PM, Peter Zijlstra wrote:
> Why not like so?
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c85f555..79bab80af8f2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
>  	struct cpumask *sd_span;
>  	u64 now = sched_clock();
>  
> -	sd_weight = cpumask_weight(tl->mask(tl, cpu));
> +	sd_span = sched_domain_span(sd);
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +	sd_weight = cpumask_weight(sd_span);
> +	sd_id = cpumask_first(sd_span);
>  
>  	if (tl->sd_flags)
>  		sd_flags = (*tl->sd_flags)();
>  	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
> -			"wrong sd_flags in topology description\n"))
> +		      "wrong sd_flags in topology description\n"))
>  		sd_flags &= TOPOLOGY_SD_FLAGS;
> +	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);


That can work too. Since sd_span is is just a variable array at the
end of sched_domain, it shouldn't be affected by the assignment below.
I'll update this in the next version.

>  
>  	*sd = (struct sched_domain){
>  		.min_interval		= sd_weight,
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12  9:59     ` K Prateek Nayak
@ 2026-03-12 10:01       ` Peter Zijlstra
  2026-03-12 10:09         ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-12 10:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

On Thu, Mar 12, 2026 at 03:29:09PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 3/12/2026 3:04 PM, Peter Zijlstra wrote:
> > Why not like so?
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 061f8c85f555..79bab80af8f2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
> >  	struct cpumask *sd_span;
> >  	u64 now = sched_clock();
> >  
> > -	sd_weight = cpumask_weight(tl->mask(tl, cpu));
> > +	sd_span = sched_domain_span(sd);
> > +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> > +	sd_weight = cpumask_weight(sd_span);
> > +	sd_id = cpumask_first(sd_span);
> >  
> >  	if (tl->sd_flags)
> >  		sd_flags = (*tl->sd_flags)();
> >  	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
> > -			"wrong sd_flags in topology description\n"))
> > +		      "wrong sd_flags in topology description\n"))
> >  		sd_flags &= TOPOLOGY_SD_FLAGS;
> > +	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
> 
> 
> That can work too. Since sd_span is is just a variable array at the
> end of sched_domain, it shouldn't be affected by the assignment below.
> I'll update this in the next version.

I've already fixed it up, was about to go push it out to
queue/sched/core.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
  2026-03-12  9:46   ` Peter Zijlstra
@ 2026-03-12 10:06     ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

Hello Peter,

On 3/12/2026 3:16 PM, Peter Zijlstra wrote:
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 953d89d71804..c4fc7726f82a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2853,7 +2853,7 @@ static inline void idle_set_state(struct rq *rq,
>>  
>>  static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>>  {
>> -	WARN_ON_ONCE(!rcu_read_lock_held());
>> +	WARN_ON_ONCE(!rcu_read_lock_any_held());
> 
> Should we perhaps make that:
> 
> 	lockdep_assert(rcu_read_lock_any_held());
> 
> ?

That makes sense! I was under the impression this was behind
CONFIG_DEBUG_LOCK_ALLOC but looks like this still checks for
!preemptible() on !CONFIG_DEBUG_LOCK_ALLOC which does add a
bit of overhead.

I'll modify it to lockdep_assert() in the next version.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12 10:01       ` Peter Zijlstra
@ 2026-03-12 10:09         ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

On 3/12/2026 3:31 PM, Peter Zijlstra wrote:
>> That can work too. Since sd_span is is just a variable array at the
>> end of sched_domain, it shouldn't be affected by the assignment below.
>> I'll update this in the next version.
> 
> I've already fixed it up, was about to go push it out to
> queue/sched/core.

Oh! Thanks a ton. I'll base the rest of nohz stuff on top of that and
hopefully cut out a new version by next week.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
@ 2026-03-12 13:37   ` kernel test robot
  2026-03-12 15:42     ` K Prateek Nayak
  2026-03-16  0:18   ` Dietmar Eggemann
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2 siblings, 1 reply; 56+ messages in thread
From: kernel test robot @ 2026-03-12 13:37 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: oe-kbuild-all, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy,
	K Prateek Nayak

Hi Prateek,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 54a66e431eeacf23e1dc47cb3507f2d0c068aaf0]

url:    https://github.com/intel-lab-lkp/linux/commits/K-Prateek-Nayak/sched-topology-Compute-sd_weight-considering-cpuset-partitions/20260312-125021
base:   54a66e431eeacf23e1dc47cb3507f2d0c068aaf0
patch link:    https://lore.kernel.org/r/20260312044434.1974-3-kprateek.nayak%40amd.com
patch subject: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
config: nios2-randconfig-r131-20260312 (https://download.01.org/0day-ci/archive/20260312/202603122149.xyvcIkPY-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 8.5.0
sparse: v0.6.5-rc1
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260312/202603122149.xyvcIkPY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603122149.xyvcIkPY-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/sched/build_utility.c: note: in included file:
   kernel/sched/debug.c:730:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/debug.c:730:17: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/debug.c:730:17: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/debug.c:1069:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/debug.c:1069:9: sparse:     expected struct task_struct *tsk
   kernel/sched/debug.c:1069:9: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/debug.c:1069:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/debug.c:1069:9: sparse:     expected struct task_struct *tsk
   kernel/sched/debug.c:1069:9: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/build_utility.c: note: in included file:
   kernel/sched/stats.c:136:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/stats.c:136:17: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/stats.c:136:17: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/build_utility.c: note: in included file:
   kernel/sched/topology.c:116:56: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:116:56: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:116:56: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:135:60: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:135:60: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:135:60: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:158:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:158:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:158:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:469:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct perf_domain *pd @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:469:19: sparse:     expected struct perf_domain *pd
   kernel/sched/topology.c:469:19: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:644:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:644:49: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:644:49: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:729:50: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:729:50: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:729:50: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:737:55: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@     got struct sched_domain *[assigned] tmp @@
   kernel/sched/topology.c:737:55: sparse:     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
   kernel/sched/topology.c:737:55: sparse:     got struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:750:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:750:29: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:750:29: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:755:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:755:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:755:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:776:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *sd @@
   kernel/sched/topology.c:776:13: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:776:13: sparse:     got struct sched_domain [noderef] __rcu *sd
   kernel/sched/topology.c:938:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:938:70: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:938:70: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:967:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:967:59: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:967:59: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1013:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1013:57: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:1013:57: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1015:25: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1015:25: sparse:     expected struct sched_domain *sibling
   kernel/sched/topology.c:1015:25: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1023:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1023:55: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:1023:55: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1025:25: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1025:25: sparse:     expected struct sched_domain *sibling
   kernel/sched/topology.c:1025:25: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1095:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1095:62: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:1095:62: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1199:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1199:40: sparse:     expected struct sched_domain *child
   kernel/sched/topology.c:1199:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1337:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:1337:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:1337:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:1683:43: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain [noderef] __rcu *child @@     got struct sched_domain *child @@
   kernel/sched/topology.c:1683:43: sparse:     expected struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1683:43: sparse:     got struct sched_domain *child
   kernel/sched/topology.c:2478:31: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *parent @@     got struct sched_domain *sd @@
   kernel/sched/topology.c:2478:31: sparse:     expected struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2478:31: sparse:     got struct sched_domain *sd
>> kernel/sched/topology.c:2606:16: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *[noderef] __rcu parent @@
   kernel/sched/topology.c:2606:16: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:2606:16: sparse:     got struct sched_domain [noderef] __rcu *[noderef] __rcu parent
>> kernel/sched/topology.c:2608:24: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2608:24: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:2608:24: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2613:16: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2613:16: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:2613:16: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2618:24: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2618:24: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:2618:24: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2667:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2667:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2667:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2686:28: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2686:28: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2686:28: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2701:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2701:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2701:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/build_utility.c: note: in included file:
   kernel/sched/build_utility.c: note: in included file:
   kernel/sched/sched.h:2367:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2367:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2367:25: sparse:    struct task_struct *

vim +2606 kernel/sched/topology.c

  2553	
  2554	/*
  2555	 * Calculate an allowed NUMA imbalance such that LLCs do not get
  2556	 * imbalanced.
  2557	 */
  2558	static void adjust_numa_imbalance(struct sched_domain *sd_llc)
  2559	{
  2560		struct sched_domain *parent;
  2561		unsigned int imb_span = 1;
  2562		unsigned int imb = 0;
  2563		unsigned int nr_llcs;
  2564	
  2565		WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
  2566		WARN_ON(!sd_llc->parent);
  2567	
  2568		/*
  2569		 * For a single LLC per node, allow an
  2570		 * imbalance up to 12.5% of the node. This is
  2571		 * arbitrary cutoff based two factors -- SMT and
  2572		 * memory channels. For SMT-2, the intent is to
  2573		 * avoid premature sharing of HT resources but
  2574		 * SMT-4 or SMT-8 *may* benefit from a different
  2575		 * cutoff. For memory channels, this is a very
  2576		 * rough estimate of how many channels may be
  2577		 * active and is based on recent CPUs with
  2578		 * many cores.
  2579		 *
  2580		 * For multiple LLCs, allow an imbalance
  2581		 * until multiple tasks would share an LLC
  2582		 * on one node while LLCs on another node
  2583		 * remain idle. This assumes that there are
  2584		 * enough logical CPUs per LLC to avoid SMT
  2585		 * factors and that there is a correlation
  2586		 * between LLCs and memory channels.
  2587		 */
  2588		nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
  2589		if (nr_llcs == 1)
  2590			imb = sd_llc->parent->span_weight >> 3;
  2591		else
  2592			imb = nr_llcs;
  2593	
  2594		imb = max(1U, imb);
  2595		sd_llc->parent->imb_numa_nr = imb;
  2596	
  2597		/*
  2598		 * Set span based on the first NUMA domain.
  2599		 *
  2600		 * NUMA systems always add a NODE domain before
  2601		 * iterating the NUMA domains. Since this is before
  2602		 * degeneration, start from sd_llc's parent's
  2603		 * parent which is the lowest an SD_NUMA domain can
  2604		 * be relative to sd_llc.
  2605		 */
> 2606		parent = sd_llc->parent->parent;
  2607		while (parent && !(parent->flags & SD_NUMA))
> 2608			parent = parent->parent;
  2609	
  2610		imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
  2611	
  2612		/* Update the upper remainder of the topology */
  2613		parent = sd_llc->parent;
  2614		while (parent) {
  2615			int factor = max(1U, (parent->span_weight / imb_span));
  2616	
  2617			parent->imb_numa_nr = imb * factor;
  2618			parent = parent->parent;
  2619		}
  2620	}
  2621	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12 13:37   ` kernel test robot
@ 2026-03-12 15:42     ` K Prateek Nayak
  2026-03-12 16:02       ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-12 15:42 UTC (permalink / raw)
  To: kernel test robot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: oe-kbuild-all, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Chen Yu, Shrikanth Hegde, Li Chen, Gautham R. Shenoy

On 3/12/2026 7:07 PM, kernel test robot wrote:
> sparse warnings: (new ones prefixed by >>)
>    kernel/sched/build_utility.c: note: in included file:
>    kernel/sched/debug.c:730:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@

So what is out official stance on sparse in the sched bits? Because I
can make this go away with:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 963007d83216..7bf1f830067f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2591,7 +2591,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
  */
 static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 {
-	struct sched_domain *parent;
+	struct sched_domain __rcu *parent;
 	unsigned int imb_span = 1;
 	unsigned int imb = 0;
 	unsigned int nr_llcs;
---

But I can make a ton more go away by doing:

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 51c29581f15e..7d1efd981caf 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,8 +72,8 @@ struct sched_domain_shared {
 
 struct sched_domain {
 	/* These fields must be setup */
-	struct sched_domain __rcu *parent;	/* top domain must be null terminated */
-	struct sched_domain __rcu *child;	/* bottom domain must be null terminated */
+	struct sched_domain *parent;	/* top domain must be null terminated */
+	struct sched_domain *child;	/* bottom domain must be null terminated */
 	struct sched_group *groups;	/* the balancing groups of the domain */
 	unsigned long min_interval;	/* Minimum balance interval ms */
 	unsigned long max_interval;	/* Maximum balance interval ms */
---

"__rcu" evaluates to "noderef, address_space(__rcu)" but we do end up
dereferencing a bunch of these directly (like sd->parent->parent) but
noderef suggests that is illegal?

One place this probably helps is to spot cases where a pointer *needs*
to be accessed via rcu_dereference*() but it isn't - that is indeed nice
to have but ...

Then it also complains about using rcu_dereference*() on pointers that
isn't __rcu annotated but perhaps that is solvable (although some of it
isn't very pretty like "cpumask ** __rcu *sched_domains_numa_masks").
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12 15:42     ` K Prateek Nayak
@ 2026-03-12 16:02       ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-12 16:02 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: kernel test robot, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, linux-kernel, oe-kbuild-all, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

On Thu, Mar 12, 2026 at 09:12:50PM +0530, K Prateek Nayak wrote:
> On 3/12/2026 7:07 PM, kernel test robot wrote:
> > sparse warnings: (new ones prefixed by >>)
> >    kernel/sched/build_utility.c: note: in included file:
> >    kernel/sched/debug.c:730:17: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
> 
> So what is out official stance on sparse in the sched bits? Because I
> can make this go away with:

I take patches for correctness :-)

I do not take patches that don't affect correctness but make the code
unreadable -- there was a submission along those lines recently.

I can be convinced to take patches in the middle provided they don't
affect readability too much.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
  2026-03-12  4:44 ` [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
@ 2026-03-15 23:36   ` Dietmar Eggemann
  2026-03-16  3:19     ` K Prateek Nayak
  2026-03-18  8:08     ` [tip: sched/core] PM: EM: Switch to rcu_dereference_all() in " tip-bot2 for Dietmar Eggemann
  2026-03-18  8:08   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() in the " tip-bot2 for K Prateek Nayak
  1 sibling, 2 replies; 56+ messages in thread
From: Dietmar Eggemann @ 2026-03-15 23:36 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

On 12.03.26 05:44, K Prateek Nayak wrote:
> select_task_rq_fair() is always called with p->pi_lock held and IRQs
> disabled which makes it equivalent of an RCU read-side.
> 
> Since commit 71fedc41c23b ("sched/fair: Switch to
> rcu_dereference_all()") switched to using rcu_dereference_all() in the
> wakeup path, drop the explicit rcu_read_{lock,unlock}() in the fair
> task's wakeup path.
> 
> Future plans to reuse select_task_rq_fair() /
> find_energy_efficient_cpu() in the fair class' balance callback will do
> so with IRQs disabled and will comply with the requirements of
> rcu_dereference_all() which makes this safe keeping in mind future
> development plans too.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog v3..v4:
> 
> o No changes.
> ---
>  kernel/sched/fair.c | 33 ++++++++++++---------------------
>  1 file changed, 12 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d57c02e82f3a..28853c0abb83 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8570,10 +8570,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	struct perf_domain *pd;
>  	struct energy_env eenv;
>  
> -	rcu_read_lock();
>  	pd = rcu_dereference_all(rd->pd);
Got an RCU related warning when running EAS:

[    3.795872] EM: rcu read lock needed
[    3.795903] WARNING: ./include/linux/energy_model.h:251 at compute_energy+0x3
a4/0x3bc, CPU#1: swapper/1/0
[    3.813755] Modules linked in:
[    3.816844] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 7.0.0-rc2-00295-
g93189edc73c8-dirty #30 PREEMPT
[    3.826807] Hardware name: ARM Juno development board (r0) (DT)
[    3.832752] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    3.839750] pc : compute_energy+0x3a4/0x3bc
[    3.843970] lr : compute_energy+0x3a4/0x3bc
[    3.848185] sp : ffff8000838f39e0
[    3.851518] x29: ffff8000838f3a00 x28: ffff0008008fe560 x27: 00000000000000f9
[    3.858717] x26: ffff800082702250 x25: ffff000803211e00 x24: ffff8000838f3af8
[    3.865913] x23: ffff0008009a3580 x22: 00000000000000f9 x21: 00000000ffffffff
[    3.873108] x20: ffff0008030377c0 x19: 00000000000000b7 x18: 0000000000000028
[    3.880303] x17: ffff8008fc91f000 x16: ffff8000838f0000 x15: 0000000000000007
[    3.887497] x14: fffffffffffc73af x13: 0a64656465656e20 x12: 0000000000008000
[    3.894690] x11: ffff80008272ca18 x10: ffff8000825f7000 x9 : 0000000000000050
[    3.901884] x8 : 0000000000000008 x7 : ffff800080190100 x6 : 0000000000000001
[    3.909076] x5 : 0000000000000161 x4 : 0000000000000000 x3 : 0000000000000001
[    3.916268] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0008002e1ac0
[    3.923461] Call trace:
[    3.925924]  compute_energy+0x3a4/0x3bc (P)
[    3.930147]  select_task_rq_fair+0x590/0x1990
[    3.934541]  try_to_wake_up+0x1f8/0xac4
[    3.938420]  wake_up_process+0x18/0x24
[    3.942208]  process_timeout+0x14/0x20
[    3.945995]  call_timer_fn+0xb8/0x470

We need to adapt em_cpu_energy() in include/linux/energy_model.h as
well.

---8<---

From 74f9067751b02f4bd0934ba6d47f2a204c763abe Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Sun, 15 Mar 2026 23:45:39 +0100
Subject: [PATCH] PM: EM: Switch to rcu_dereference_all() in wakeup path

em_cpu_energy() is part of the EAS (Fair) task wakeup path. Now that
rcu_read_{,un}lock() have been removed from find_energy_efficient_cpu()
switch to rcu_dereference_all() and check for rcu_read_lock_any_held()
in em_cpu_energy() as well.
In EAS (Fair) task wakeup path is a preempt/IRQ disabled region, so
rcu_read_{,un}lock() can be removed.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/energy_model.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e7497f804644..c909a8ba22e8 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -248,7 +248,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	struct em_perf_state *ps;
 	int i;
 
-	WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
+	lockdep_assert(rcu_read_lock_any_held());
 
 	if (!sum_util)
 		return 0;
@@ -267,7 +267,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested performance.
 	 */
-	em_table = rcu_dereference(pd->em_table);
+	em_table = rcu_dereference_all(pd->em_table);
 	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
 	ps = &em_table->state[i];
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
  2026-03-12 13:37   ` kernel test robot
@ 2026-03-16  0:18   ` Dietmar Eggemann
  2026-03-16  3:41     ` K Prateek Nayak
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2 siblings, 1 reply; 56+ messages in thread
From: Dietmar Eggemann @ 2026-03-16  0:18 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

On 12.03.26 05:44, K Prateek Nayak wrote:

[...]

> +/*
> + * Calculate an allowed NUMA imbalance such that LLCs do not get
> + * imbalanced.
> + */
> +static void adjust_numa_imbalance(struct sched_domain *sd_llc)
> +{
> +	struct sched_domain *parent;
> +	unsigned int imb_span = 1;
> +	unsigned int imb = 0;
> +	unsigned int nr_llcs;
> +
> +	WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
> +	WARN_ON(!sd_llc->parent);
> +
> +	/*
> +	 * For a single LLC per node, allow an
> +	 * imbalance up to 12.5% of the node. This is
> +	 * arbitrary cutoff based two factors -- SMT and
> +	 * memory channels. For SMT-2, the intent is to
> +	 * avoid premature sharing of HT resources but
> +	 * SMT-4 or SMT-8 *may* benefit from a different
> +	 * cutoff. For memory channels, this is a very
> +	 * rough estimate of how many channels may be
> +	 * active and is based on recent CPUs with
> +	 * many cores.
> +	 *
> +	 * For multiple LLCs, allow an imbalance
> +	 * until multiple tasks would share an LLC
> +	 * on one node while LLCs on another node
> +	 * remain idle. This assumes that there are
> +	 * enough logical CPUs per LLC to avoid SMT
> +	 * factors and that there is a correlation
> +	 * between LLCs and memory channels.
> +	 */
> +	nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
> +	if (nr_llcs == 1)
> +		imb = sd_llc->parent->span_weight >> 3;
> +	else
> +		imb = nr_llcs;
> +
> +	imb = max(1U, imb);
> +	sd_llc->parent->imb_numa_nr = imb;

Here you set imb_numa_nr e.g. for PKG ...

> +
> +	/*
> +	 * Set span based on the first NUMA domain.
> +	 *
> +	 * NUMA systems always add a NODE domain before
> +	 * iterating the NUMA domains. Since this is before
> +	 * degeneration, start from sd_llc's parent's
> +	 * parent which is the lowest an SD_NUMA domain can
> +	 * be relative to sd_llc.
> +	 */
> +	parent = sd_llc->parent->parent;
> +	while (parent && !(parent->flags & SD_NUMA))
> +		parent = parent->parent;
> +
> +	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
> +
> +	/* Update the upper remainder of the topology */
> +	parent = sd_llc->parent;
> +	while (parent) {
> +		int factor = max(1U, (parent->span_weight / imb_span));
> +
> +		parent->imb_numa_nr = imb * factor;

... and here again.

Shouldn't we only set it for 'if (parent->flags & SD_NUMA)'?

Not sure if there are case in which PKG would persist in

... -> MC -> PKG -> NODE -> NUMA -> ... ?

Although access to sd->imb_numa_nr seems to be guarded by sd->flags &
SD_NUMA.

> +		parent = parent->parent;
> +	}
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation
  2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
                   ` (8 preceding siblings ...)
  2026-03-12  4:44 ` [PATCH v4 9/9] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
@ 2026-03-16  0:22 ` Dietmar Eggemann
  9 siblings, 0 replies; 56+ messages in thread
From: Dietmar Eggemann @ 2026-03-16  0:22 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

On 12.03.26 05:44, K Prateek Nayak wrote:
> Hello folks,
> 
> I got distracted for a bit but here is v4 of the series with most of the
> feedback on v3 incorporated. Nothing much has changed but if you see
> anything you don't like, please let me know and we can discuss how to
> best address it.

[...]

> ---
> K Prateek Nayak (9):
>   sched/topology: Compute sd_weight considering cpuset partitions
>   sched/topology: Extract "imb_numa_nr" calculation into a separate
>     helper
>   sched/topology: Allocate per-CPU sched_domain_shared in s_data
>   sched/topology: Switch to assigning "sd->shared" from s_data
>   sched/topology: Remove sched_domain_shared allocation with sd_data
>   sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
>   sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
>   sched/fair: Simplify the entry condition for update_idle_cpu_scan()
>   sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
> 
>  include/linux/sched/topology.h |   1 -
>  kernel/sched/fair.c            |  70 ++++-----
>  kernel/sched/sched.h           |   2 +-
>  kernel/sched/topology.c        | 263 +++++++++++++++++++++------------
>  4 files changed, 199 insertions(+), 137 deletions(-)

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
  2026-03-15 23:36   ` Dietmar Eggemann
@ 2026-03-16  3:19     ` K Prateek Nayak
  2026-03-18  8:08     ` [tip: sched/core] PM: EM: Switch to rcu_dereference_all() in " tip-bot2 for Dietmar Eggemann
  1 sibling, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-16  3:19 UTC (permalink / raw)
  To: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

Hello Dietmar,

On 3/16/2026 5:06 AM, Dietmar Eggemann wrote:
> We need to adapt em_cpu_energy() in include/linux/energy_model.h as
> well.

I clearly didn't test the EAS bits! I'll make sure to spin something up
with QEMU next time to hit these paths.

> 
> ---8<---
> 
> From 74f9067751b02f4bd0934ba6d47f2a204c763abe Mon Sep 17 00:00:00 2001
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Sun, 15 Mar 2026 23:45:39 +0100
> Subject: [PATCH] PM: EM: Switch to rcu_dereference_all() in wakeup path
> 
> em_cpu_energy() is part of the EAS (Fair) task wakeup path. Now that
> rcu_read_{,un}lock() have been removed from find_energy_efficient_cpu()
> switch to rcu_dereference_all() and check for rcu_read_lock_any_held()
> in em_cpu_energy() as well.
> In EAS (Fair) task wakeup path is a preempt/IRQ disabled region, so
> rcu_read_{,un}lock() can be removed.
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thank you for catching my mistake and for the fix! Feel free to
include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
>  include/linux/energy_model.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> index e7497f804644..c909a8ba22e8 100644
> --- a/include/linux/energy_model.h
> +++ b/include/linux/energy_model.h
> @@ -248,7 +248,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	struct em_perf_state *ps;
>  	int i;
>  
> -	WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
> +	lockdep_assert(rcu_read_lock_any_held());
>  
>  	if (!sum_util)
>  		return 0;
> @@ -267,7 +267,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
>  	 * Find the lowest performance state of the Energy Model above the
>  	 * requested performance.
>  	 */
> -	em_table = rcu_dereference(pd->em_table);
> +	em_table = rcu_dereference_all(pd->em_table);
>  	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
>  	ps = &em_table->state[i];
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-16  0:18   ` Dietmar Eggemann
@ 2026-03-16  3:41     ` K Prateek Nayak
  2026-03-16  8:24       ` Dietmar Eggemann
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-16  3:41 UTC (permalink / raw)
  To: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

Hello Dietmar,

On 3/16/2026 5:48 AM, Dietmar Eggemann wrote:
>> +	/*
>> +	 * For a single LLC per node, allow an
>> +	 * imbalance up to 12.5% of the node. This is
>> +	 * arbitrary cutoff based two factors -- SMT and
>> +	 * memory channels. For SMT-2, the intent is to
>> +	 * avoid premature sharing of HT resources but
>> +	 * SMT-4 or SMT-8 *may* benefit from a different
>> +	 * cutoff. For memory channels, this is a very
>> +	 * rough estimate of how many channels may be
>> +	 * active and is based on recent CPUs with
>> +	 * many cores.
>> +	 *
>> +	 * For multiple LLCs, allow an imbalance
>> +	 * until multiple tasks would share an LLC
>> +	 * on one node while LLCs on another node
>> +	 * remain idle. This assumes that there are
>> +	 * enough logical CPUs per LLC to avoid SMT
>> +	 * factors and that there is a correlation
>> +	 * between LLCs and memory channels.
>> +	 */
>> +	nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
>> +	if (nr_llcs == 1)
>> +		imb = sd_llc->parent->span_weight >> 3;
>> +	else
>> +		imb = nr_llcs;
>> +
>> +	imb = max(1U, imb);
>> +	sd_llc->parent->imb_numa_nr = imb;
> 
> Here you set imb_numa_nr e.g. for PKG ...

Ack! That is indeed a redundant assign since it gets reassigned
in the bottom loop. For this commit, we have kept it 1:1 with the
loop that existed before in build_sched_domains(). 

> 
>> +
>> +	/*
>> +	 * Set span based on the first NUMA domain.
>> +	 *
>> +	 * NUMA systems always add a NODE domain before
>> +	 * iterating the NUMA domains. Since this is before
>> +	 * degeneration, start from sd_llc's parent's
>> +	 * parent which is the lowest an SD_NUMA domain can
>> +	 * be relative to sd_llc.
>> +	 */
>> +	parent = sd_llc->parent->parent;
>> +	while (parent && !(parent->flags & SD_NUMA))
>> +		parent = parent->parent;
>> +
>> +	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
>> +
>> +	/* Update the upper remainder of the topology */
>> +	parent = sd_llc->parent;
>> +	while (parent) {
>> +		int factor = max(1U, (parent->span_weight / imb_span));
>> +
>> +		parent->imb_numa_nr = imb * factor;
> 
> ... and here again.
> 
> Shouldn't we only set it for 'if (parent->flags & SD_NUMA)'?
> 
> Not sure if there are case in which PKG would persist in
> 
> ... -> MC -> PKG -> NODE -> NUMA -> ... ?
> 
> Although access to sd->imb_numa_nr seems to be guarded by sd->flags &
> SD_NUMA.

Indeed! "imb_numa_nr" only makes sense when looking at NUMA domains
and having it assigned to 1 for lower domains is harmless
(but wasteful indeed). I'm 99% sure we can simply do:

  (Only build tested)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 43150591914b..e9068a809dbc 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2623,9 +2623,6 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	else
 		imb = nr_llcs;
 
-	imb = max(1U, imb);
-	sd_llc->parent->imb_numa_nr = imb;
-
 	/*
 	 * Set span based on the first NUMA domain.
 	 *
@@ -2639,10 +2636,14 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	while (parent && !(parent->flags & SD_NUMA))
 		parent = parent->parent;
 
-	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
+	/* No NUMA domain to adjust imbalance for! */
+	if (!parent)
+		return;
+
+	imb = max(1U, imb);
+	imb_span = parent->span_weight;
 
 	/* Update the upper remainder of the topology */
-	parent = sd_llc->parent;
 	while (parent) {
 		int factor = max(1U, (parent->span_weight / imb_span));
 
---

If we have NUMA domains, we definitely have NODE and NODE sets neither
SD_SHARE_LLC, nor SD_NUMA so likely sd->parent is PKG / NODE domain and
NUMA has to start at sd->parent->parent and it has to break at the first
SD_NUMA domains.

If it doesn't exist, we don't have any NUMA domains and nothing to worry
about, and if we do, the final loop will adjust the NUMA imbalance.

Thoughts? Again, this commit was kept 1:1 with the previous loop but we
can always improve :-)

> 
>> +		parent = parent->parent;
>> +	}
>> +}
>> +
> [...]

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-16  3:41     ` K Prateek Nayak
@ 2026-03-16  8:24       ` Dietmar Eggemann
  2026-03-16  8:50         ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: Dietmar Eggemann @ 2026-03-16  8:24 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

Hi Prateek,

On 16.03.26 04:41, K Prateek Nayak wrote:
> Hello Dietmar,
> 
> On 3/16/2026 5:48 AM, Dietmar Eggemann wrote:

[...]

> Indeed! "imb_numa_nr" only makes sense when looking at NUMA domains
> and having it assigned to 1 for lower domains is harmless
> (but wasteful indeed). I'm 99% sure we can simply do:
> 
>   (Only build tested)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 43150591914b..e9068a809dbc 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2623,9 +2623,6 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>  	else
>  		imb = nr_llcs;
>  
> -	imb = max(1U, imb);
> -	sd_llc->parent->imb_numa_nr = imb;
> -
>  	/*
>  	 * Set span based on the first NUMA domain.
>  	 *
> @@ -2639,10 +2636,14 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>  	while (parent && !(parent->flags & SD_NUMA))
>  		parent = parent->parent;
>  
> -	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
> +	/* No NUMA domain to adjust imbalance for! */
> +	if (!parent)
> +		return;
> +
> +	imb = max(1U, imb);
> +	imb_span = parent->span_weight;
>  
>  	/* Update the upper remainder of the topology */
> -	parent = sd_llc->parent;
>  	while (parent) {
>  		int factor = max(1U, (parent->span_weight / imb_span));
>  
> ---
> 
> If we have NUMA domains, we definitely have NODE and NODE sets neither
> SD_SHARE_LLC, nor SD_NUMA so likely sd->parent is PKG / NODE domain and
> NUMA has to start at sd->parent->parent and it has to break at the first
> SD_NUMA domains.
> 
> If it doesn't exist, we don't have any NUMA domains and nothing to worry
> about, and if we do, the final loop will adjust the NUMA imbalance.
> 
> Thoughts? Again, this commit was kept 1:1 with the previous loop but we
> can always improve :-)
Ah, I see!

This would work, IMHO.

Tested on qemu-system-aarch64 w/

  -smp 8,sockets=2,clusters=2,cores=2,threads=1

Are you aware of a setup in which PKG would survive between MC and
lowest NUMA?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-16  8:24       ` Dietmar Eggemann
@ 2026-03-16  8:50         ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-16  8:50 UTC (permalink / raw)
  To: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Valentin Schneider, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu, Shrikanth Hegde,
	Li Chen, Gautham R. Shenoy

Hello Dietmar,

On 3/16/2026 1:54 PM, Dietmar Eggemann wrote:
> Hi Prateek,
> 
> On 16.03.26 04:41, K Prateek Nayak wrote:
>> Hello Dietmar,
>>
>> On 3/16/2026 5:48 AM, Dietmar Eggemann wrote:
> 
> [...]
> 
>> Indeed! "imb_numa_nr" only makes sense when looking at NUMA domains
>> and having it assigned to 1 for lower domains is harmless
>> (but wasteful indeed). I'm 99% sure we can simply do:
>>
>>   (Only build tested)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 43150591914b..e9068a809dbc 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2623,9 +2623,6 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>>  	else
>>  		imb = nr_llcs;
>>  
>> -	imb = max(1U, imb);
>> -	sd_llc->parent->imb_numa_nr = imb;
>> -
>>  	/*
>>  	 * Set span based on the first NUMA domain.
>>  	 *
>> @@ -2639,10 +2636,14 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>>  	while (parent && !(parent->flags & SD_NUMA))
>>  		parent = parent->parent;
>>  
>> -	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
>> +	/* No NUMA domain to adjust imbalance for! */
>> +	if (!parent)
>> +		return;
>> +
>> +	imb = max(1U, imb);
>> +	imb_span = parent->span_weight;
>>  
>>  	/* Update the upper remainder of the topology */
>> -	parent = sd_llc->parent;
>>  	while (parent) {
>>  		int factor = max(1U, (parent->span_weight / imb_span));
>>  
>> ---
>>
>> If we have NUMA domains, we definitely have NODE and NODE sets neither
>> SD_SHARE_LLC, nor SD_NUMA so likely sd->parent is PKG / NODE domain and
>> NUMA has to start at sd->parent->parent and it has to break at the first
>> SD_NUMA domains.
>>
>> If it doesn't exist, we don't have any NUMA domains and nothing to worry
>> about, and if we do, the final loop will adjust the NUMA imbalance.
>>
>> Thoughts? Again, this commit was kept 1:1 with the previous loop but we
>> can always improve :-)
> Ah, I see!
> 
> This would work, IMHO.
> 
> Tested on qemu-system-aarch64 w/
> 
>   -smp 8,sockets=2,clusters=2,cores=2,threads=1
> 
> Are you aware of a setup in which PKG would survive between MC and
> lowest NUMA?

On x86, you can have:

  -smp 8,sockets=2,dies=2,cores=2,threads=1

and each "die" will appear as an MC within the socket so we get

  NUMA {         0-7         }
  NODE {   0-3   } {   4-7   }
  PKG  {   0-3   } {   4-7   }
  MC   {0,1} {2,3} {4,5} {6,7}

In the above case, NODE is degenerated since it matches with PKG
and MC, PKG, NUMA survive at the end.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
  2026-03-12  4:44 ` [PATCH v4 9/9] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Chen Yu, Shrikanth Hegde,
	Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fe7171d0d5dfbe189e41db99580ebacafc3c09ce
Gitweb:        https://git.kernel.org/tip/fe7171d0d5dfbe189e41db99580ebacafc3c09ce
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:34 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:50 +01:00

sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()

Use the "sd_llc" passed to select_idle_cpu() to obtain the
"sd_llc_shared" instead of dereferencing the per-CPU variable.

Since "sd->shared" is always reclaimed at the same time as "sd" via
call_rcu() and update_top_cache_domain() always ensures a valid
"sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
always be dereferenced without needing an additional check.

While at it move the cpumask_and() operation after the SIS_UTIL bailout
check to avoid unnecessarily computing the cpumask.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-10-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85c22f0..0a35a82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7876,21 +7876,26 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
-	struct sched_domain_shared *sd_share;
-
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
 	if (sched_feat(SIS_UTIL)) {
-		sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target));
-		if (sd_share) {
-			/* because !--nr is the condition to stop scan */
-			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
-			/* overloaded LLC is unlikely to have idle cpu/core */
-			if (nr == 1)
-				return -1;
-		}
+		/*
+		 * Increment because !--nr is the condition to stop scan.
+		 *
+		 * Since "sd" is "sd_llc" for target CPU dereferenced in the
+		 * caller, it is safe to directly dereference "sd->shared".
+		 * Topology bits always ensure it assigned for "sd_llc" abd it
+		 * cannot disappear as long as we have a RCU protected
+		 * reference to one the associated "sd" here.
+		 */
+		nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+		/* overloaded LLC is unlikely to have idle cpu/core */
+		if (nr == 1)
+			return -1;
 	}
 
+	if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
+		return -1;
+
 	if (static_branch_unlikely(&sched_cluster_active)) {
 		struct sched_group *sg = sd->groups;
 

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/fair: Simplify the entry condition for update_idle_cpu_scan()
  2026-03-12  4:44 ` [PATCH v4 8/9] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Chen Yu,
	Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f1320a8dd8ba6518ddb53ea4e3efcb49dc41d257
Gitweb:        https://git.kernel.org/tip/f1320a8dd8ba6518ddb53ea4e3efcb49dc41d257
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:33 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:50 +01:00

sched/fair: Simplify the entry condition for update_idle_cpu_scan()

Only the topmost SD_SHARE_LLC domain has the "sd->shared" assigned.
Simply use "sd->shared" as an indicator for load balancing at the highest
SD_SHARE_LLC domain in update_idle_cpu_scan() instead of relying on
llc_size.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-9-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e24d3e..85c22f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11234,6 +11234,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 				 unsigned long sum_util)
 {
 	struct sched_domain_shared *sd_share;
+	struct sched_domain *sd = env->sd;
 	int llc_weight, pct;
 	u64 x, y, tmp;
 	/*
@@ -11247,11 +11248,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 	if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
 		return;
 
-	llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
-	if (env->sd->span_weight != llc_weight)
-		return;
-
-	sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu));
+	sd_share = sd->shared;
 	if (!sd_share)
 		return;
 
@@ -11285,10 +11282,11 @@ static void update_idle_cpu_scan(struct lb_env *env,
 	 */
 	/* equation [3] */
 	x = sum_util;
+	llc_weight = sd->span_weight;
 	do_div(x, llc_weight);
 
 	/* equation [4] */
-	pct = env->sd->imbalance_pct;
+	pct = sd->imbalance_pct;
 	tmp = x * x * pct * pct;
 	do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
 	tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
  2026-03-12  4:44 ` [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
  2026-03-15 23:36   ` Dietmar Eggemann
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Dietmar Eggemann, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fa6874dfeee06352ce7c4c271be6a25d84a38b54
Gitweb:        https://git.kernel.org/tip/fa6874dfeee06352ce7c4c271be6a25d84a38b54
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:32 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:50 +01:00

sched/fair: Remove superfluous rcu_read_lock() in the wakeup path

select_task_rq_fair() is always called with p->pi_lock held and IRQs
disabled which makes it equivalent of an RCU read-side.

Since commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()") switched to using rcu_dereference_all() in the
wakeup path, drop the explicit rcu_read_{lock,unlock}() in the fair
task's wakeup path.

Future plans to reuse select_task_rq_fair() /
find_energy_efficient_cpu() in the fair class' balance callback will do
so with IRQs disabled and will comply with the requirements of
rcu_dereference_all() which makes this safe keeping in mind future
development plans too.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-8-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1e5c82..3e24d3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8570,10 +8570,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	struct perf_domain *pd;
 	struct energy_env eenv;
 
-	rcu_read_lock();
 	pd = rcu_dereference_all(rd->pd);
 	if (!pd)
-		goto unlock;
+		return target;
 
 	/*
 	 * Energy-aware wake-up happens on the lowest sched_domain starting
@@ -8583,13 +8582,13 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
 		sd = sd->parent;
 	if (!sd)
-		goto unlock;
+		return target;
 
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
 	if (!task_util_est(p) && p_util_min == 0)
-		goto unlock;
+		return target;
 
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
@@ -8684,7 +8683,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						    prev_cpu);
 			/* CPU utilization has changed */
 			if (prev_delta < base_energy)
-				goto unlock;
+				return target;
 			prev_delta -= base_energy;
 			prev_actual_cap = cpu_actual_cap;
 			best_delta = min(best_delta, prev_delta);
@@ -8708,7 +8707,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 						   max_spare_cap_cpu);
 			/* CPU utilization has changed */
 			if (cur_delta < base_energy)
-				goto unlock;
+				return target;
 			cur_delta -= base_energy;
 
 			/*
@@ -8725,7 +8724,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			best_actual_cap = cpu_actual_cap;
 		}
 	}
-	rcu_read_unlock();
 
 	if ((best_fits > prev_fits) ||
 	    ((best_fits > 0) && (best_delta < prev_delta)) ||
@@ -8733,11 +8731,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		target = best_energy_cpu;
 
 	return target;
-
-unlock:
-	rcu_read_unlock();
-
-	return target;
 }
 
 /*
@@ -8782,7 +8775,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8808,14 +8800,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 			break;
 	}
 
-	if (unlikely(sd)) {
-		/* Slow path */
-		new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
-	} else if (wake_flags & WF_TTWU) { /* XXX always ? */
-		/* Fast path */
-		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-	}
-	rcu_read_unlock();
+	/* Slow path */
+	if (unlikely(sd))
+		return sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
+
+	/* Fast path */
+	if (wake_flags & WF_TTWU)
+		return select_idle_sibling(p, prev_cpu, new_cpu);
 
 	return new_cpu;
 }

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] PM: EM: Switch to rcu_dereference_all() in wakeup path
  2026-03-15 23:36   ` Dietmar Eggemann
  2026-03-16  3:19     ` K Prateek Nayak
@ 2026-03-18  8:08     ` tip-bot2 for Dietmar Eggemann
  1 sibling, 0 replies; 56+ messages in thread
From: tip-bot2 for Dietmar Eggemann @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8ca12326f592f7554acf2788ecb1c5c954dcf31c
Gitweb:        https://git.kernel.org/tip/8ca12326f592f7554acf2788ecb1c5c954dcf31c
Author:        Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate:    Mon, 16 Mar 2026 00:36:22 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:49 +01:00

PM: EM: Switch to rcu_dereference_all() in wakeup path

em_cpu_energy() is part of the EAS (Fair) task wakeup path. Now that
rcu_read_{,un}lock() have been removed from find_energy_efficient_cpu()
switch to rcu_dereference_all() and check for rcu_read_lock_any_held()
in em_cpu_energy() as well.
In EAS (Fair) task wakeup path is a preempt/IRQ disabled region, so
rcu_read_{,un}lock() can be removed.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/5b1228b7-5949-4a45-9f62-e8ce936de694@arm.com
---
 include/linux/energy_model.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index e7497f8..c909a8b 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -248,7 +248,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	struct em_perf_state *ps;
 	int i;
 
-	WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
+	lockdep_assert(rcu_read_lock_any_held());
 
 	if (!sum_util)
 		return 0;
@@ -267,7 +267,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	 * Find the lowest performance state of the Energy Model above the
 	 * requested performance.
 	 */
-	em_table = rcu_dereference(pd->em_table);
+	em_table = rcu_dereference_all(pd->em_table);
 	i = em_pd_get_efficient_state(em_table->state, pd, max_util);
 	ps = &em_table->state[i];
 

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
  2026-03-12  4:44 ` [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
  2026-03-12  9:46   ` Peter Zijlstra
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Dietmar Eggemann, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f494bfb04615119f31dbd3222c9d39fea3817d40
Gitweb:        https://git.kernel.org/tip/f494bfb04615119f31dbd3222c9d39fea3817d40
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:31 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:49 +01:00

sched/core: Check for rcu_read_lock_any_held() in idle_get_state()

Similar to commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held()
in idle_get_state() to allow removing superfluous rcu_read_lock()
regions in the fair task's wakeup path where the pi_lock is held and
IRQs are disabled.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-7-kprateek.nayak@amd.com
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 953d89d..b863bbd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2853,7 +2853,7 @@ static inline void idle_set_state(struct rq *rq,
 
 static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 {
-	WARN_ON_ONCE(!rcu_read_lock_held());
+	lockdep_assert(rcu_read_lock_any_held());
 
 	return rq->idle_state;
 }

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Remove sched_domain_shared allocation with sd_data
  2026-03-12  4:44 ` [PATCH v4 5/9] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Valentin Schneider,
	Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     10febd397591d93f42adb743c2c664041e7f1bcb
Gitweb:        https://git.kernel.org/tip/10febd397591d93f42adb743c2c664041e7f1bcb
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:30 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:49 +01:00

sched/topology: Remove sched_domain_shared allocation with sd_data

Now that "sd->shared" assignments are using the sched_domain_shared
objects allocated with s_data, remove the sd_data based allocations.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-6-kprateek.nayak@amd.com
---
 include/linux/sched/topology.h |  1 -
 kernel/sched/topology.c        | 19 -------------------
 2 files changed, 20 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032..51c2958 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -172,7 +172,6 @@ typedef int (*sched_domain_flags_f)(void);
 
 struct sd_data {
 	struct sched_domain *__percpu *sd;
-	struct sched_domain_shared *__percpu *sds;
 	struct sched_group *__percpu *sg;
 	struct sched_group_capacity *__percpu *sgc;
 };
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b19d84f..4315059 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1609,9 +1609,6 @@ static void claim_allocations(int cpu, struct s_data *d)
 		WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
 		*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-		if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
-			*per_cpu_ptr(sdd->sds, cpu) = NULL;
-
 		if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
 			*per_cpu_ptr(sdd->sg, cpu) = NULL;
 
@@ -2390,10 +2387,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 		if (!sdd->sd)
 			return -ENOMEM;
 
-		sdd->sds = alloc_percpu(struct sched_domain_shared *);
-		if (!sdd->sds)
-			return -ENOMEM;
-
 		sdd->sg = alloc_percpu(struct sched_group *);
 		if (!sdd->sg)
 			return -ENOMEM;
@@ -2404,7 +2397,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
-			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
 
@@ -2415,13 +2407,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
-					GFP_KERNEL, cpu_to_node(j));
-			if (!sds)
-				return -ENOMEM;
-
-			*per_cpu_ptr(sdd->sds, j) = sds;
-
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sg)
@@ -2463,8 +2448,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sd, j));
 			}
 
-			if (sdd->sds)
-				kfree(*per_cpu_ptr(sdd->sds, j));
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
@@ -2472,8 +2455,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
-		free_percpu(sdd->sds);
-		sdd->sds = NULL;
 		free_percpu(sdd->sg);
 		sdd->sg = NULL;
 		free_percpu(sdd->sgc);

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Switch to assigning "sd->shared" from s_data
  2026-03-12  4:44 ` [PATCH v4 4/9] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Dietmar Eggemann, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     bb7a5e44fc6f3d5a252d95c48d057d5beccb8b35
Gitweb:        https://git.kernel.org/tip/bb7a5e44fc6f3d5a252d95c48d057d5beccb8b35
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:29 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:48 +01:00

sched/topology: Switch to assigning "sd->shared" from s_data

Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".

sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.

claim_allocations() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.

To keep all the claim_allocations() bits in one place,
claim_allocations() has been extended to accept "s_data" and iterate the
domains internally to free both "sched_domain_shared" and the
per-topology-level data for the particular CPU in one place.

Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-5-kprateek.nayak@amd.com
---
 kernel/sched/topology.c | 73 ++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9006586..b19d84f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -685,6 +685,9 @@ static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
+
+		/* If sd_llc exists, sd_llc_shared should exist too. */
+		WARN_ON_ONCE(!sd->shared);
 		sds = sd->shared;
 	}
 
@@ -733,6 +736,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 		if (sd_parent_degenerate(tmp, parent)) {
 			tmp->parent = parent->parent;
 
+			/* Pick reference to parent->shared. */
+			if (parent->shared) {
+				WARN_ON_ONCE(tmp->shared);
+				tmp->shared = parent->shared;
+				parent->shared = NULL;
+			}
+
 			if (parent->parent) {
 				parent->parent->child = tmp;
 				parent->parent->groups->flags = tmp->flags;
@@ -1586,21 +1596,28 @@ __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
  * sched_group structure so that the subsequent __free_domain_allocs()
  * will not free the data we're using.
  */
-static void claim_allocations(int cpu, struct sched_domain *sd)
+static void claim_allocations(int cpu, struct s_data *d)
 {
-	struct sd_data *sdd = sd->private;
+	struct sched_domain *sd;
+
+	if (atomic_read(&(*per_cpu_ptr(d->sds, cpu))->ref))
+		*per_cpu_ptr(d->sds, cpu) = NULL;
 
-	WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
-	*per_cpu_ptr(sdd->sd, cpu) = NULL;
+	for (sd = *per_cpu_ptr(d->sd, cpu); sd; sd = sd->parent) {
+		struct sd_data *sdd = sd->private;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
-		*per_cpu_ptr(sdd->sds, cpu) = NULL;
+		WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
+		*per_cpu_ptr(sdd->sd, cpu) = NULL;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
-		*per_cpu_ptr(sdd->sg, cpu) = NULL;
+		if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
+			*per_cpu_ptr(sdd->sds, cpu) = NULL;
 
-	if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
-		*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+		if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
+			*per_cpu_ptr(sdd->sg, cpu) = NULL;
+
+		if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
+			*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+	}
 }
 
 #ifdef CONFIG_NUMA
@@ -1738,16 +1755,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->cache_nice_tries = 1;
 	}
 
-	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
-	 * instance.
-	 */
-	if (sd->flags & SD_SHARE_LLC) {
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
-		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-	}
-
 	sd->private = sdd;
 
 	return sd;
@@ -2729,12 +2736,20 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
-		/*
-		 * In presence of higher domains, adjust the
-		 * NUMA imbalance stats for the hierarchy.
-		 */
-		if (IS_ENABLED(CONFIG_NUMA) && (sd->flags & SD_SHARE_LLC) && sd->parent)
-			adjust_numa_imbalance(sd);
+		if (sd->flags & SD_SHARE_LLC) {
+			int sd_id = cpumask_first(sched_domain_span(sd));
+
+			sd->shared = *per_cpu_ptr(d.sds, sd_id);
+			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+			atomic_inc(&sd->shared->ref);
+
+			/*
+			 * In presence of higher domains, adjust the
+			 * NUMA imbalance stats for the hierarchy.
+			 */
+			if (IS_ENABLED(CONFIG_NUMA) && sd->parent)
+				adjust_numa_imbalance(sd);
+		}
 	}
 
 	/* Calculate CPU capacity for physical packages and nodes */
@@ -2742,10 +2757,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		if (!cpumask_test_cpu(i, cpu_map))
 			continue;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			claim_allocations(i, sd);
+		claim_allocations(i, &d);
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent)
 			init_sched_groups_capacity(i, sd);
-		}
 	}
 
 	/* Attach the domains */

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Allocate per-CPU sched_domain_shared in s_data
  2026-03-12  4:44 ` [PATCH v4 3/9] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Valentin Schneider,
	Chen Yu, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     1cc8a33ca7e8d38f962b64ece2a42c411a67bc76
Gitweb:        https://git.kernel.org/tip/1cc8a33ca7e8d38f962b64ece2a42c411a67bc76
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:28 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:48 +01:00

sched/topology: Allocate per-CPU sched_domain_shared in s_data

The "sched_domain_shared" object is allocated for every topology level
in __sdt_alloc() and is freed post sched domain rebuild if they aren't
assigned during sd_init().

"sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
the assigned objects, only "sd_llc_shared" is ever used by the
scheduler.

Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
never overlap, allocate only a single range of per-CPU
"sched_domain_shared" object with s_data instead of doing it per
topology level.

The subsequent commit uses the degeneration path to correctly assign the
"sd->shared" to the topmost SD_SHARE_LLC domain.

No functional changes are expected at this point.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-4-kprateek.nayak@amd.com
---
 kernel/sched/topology.c | 48 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6303790..9006586 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -782,6 +782,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 }
 
 struct s_data {
+	struct sched_domain_shared * __percpu *sds;
 	struct sched_domain * __percpu *sd;
 	struct root_domain	*rd;
 };
@@ -789,6 +790,7 @@ struct s_data {
 enum s_alloc {
 	sa_rootdomain,
 	sa_sd,
+	sa_sd_shared,
 	sa_sd_storage,
 	sa_none,
 };
@@ -1535,6 +1537,9 @@ static void set_domain_attribute(struct sched_domain *sd,
 static void __sdt_free(const struct cpumask *cpu_map);
 static int __sdt_alloc(const struct cpumask *cpu_map);
 
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map);
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map);
+
 static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 				 const struct cpumask *cpu_map)
 {
@@ -1546,6 +1551,9 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
 	case sa_sd:
 		free_percpu(d->sd);
 		fallthrough;
+	case sa_sd_shared:
+		__sds_free(d, cpu_map);
+		fallthrough;
 	case sa_sd_storage:
 		__sdt_free(cpu_map);
 		fallthrough;
@@ -1561,9 +1569,11 @@ __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
 
 	if (__sdt_alloc(cpu_map))
 		return sa_sd_storage;
+	if (__sds_alloc(d, cpu_map))
+		return sa_sd_shared;
 	d->sd = alloc_percpu(struct sched_domain *);
 	if (!d->sd)
-		return sa_sd_storage;
+		return sa_sd_shared;
 	d->rd = alloc_rootdomain();
 	if (!d->rd)
 		return sa_sd;
@@ -2464,6 +2474,42 @@ static void __sdt_free(const struct cpumask *cpu_map)
 	}
 }
 
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map)
+{
+	int j;
+
+	d->sds = alloc_percpu(struct sched_domain_shared *);
+	if (!d->sds)
+		return -ENOMEM;
+
+	for_each_cpu(j, cpu_map) {
+		struct sched_domain_shared *sds;
+
+		sds = kzalloc_node(sizeof(struct sched_domain_shared),
+				GFP_KERNEL, cpu_to_node(j));
+		if (!sds)
+			return -ENOMEM;
+
+		*per_cpu_ptr(d->sds, j) = sds;
+	}
+
+	return 0;
+}
+
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map)
+{
+	int j;
+
+	if (!d->sds)
+		return;
+
+	for_each_cpu(j, cpu_map)
+		kfree(*per_cpu_ptr(d->sds, j));
+
+	free_percpu(d->sds);
+	d->sds = NULL;
+}
+
 static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
  2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
  2026-03-12 13:37   ` kernel test robot
  2026-03-16  0:18   ` Dietmar Eggemann
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  2 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra (Intel),
	Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1
Gitweb:        https://git.kernel.org/tip/5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:27 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:48 +01:00

sched/topology: Extract "imb_numa_nr" calculation into a separate helper

Subsequent changes to assign "sd->shared" from "s_data" would
necessitate finding the topmost SD_SHARE_LLC to assign shared object to.

This is very similar to the "imb_numa_nr" computation loop except that
"imb_numa_nr" cares about the first domain without the SD_SHARE_LLC flag
(immediate parent of sd_llc) whereas the "sd->shared" assignment would
require sd_llc itself.

Extract the "imb_numa_nr" calculation into a helper
adjust_numa_imbalance() and use the current loop in the
build_sched_domains() to find the sd_llc.

While at it, guard the call behind CONFIG_NUMA's status since
"imb_numa_nr" only makes sense on NUMA enabled configs with SD_NUMA
domains.

No functional changes intended.

Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-3-kprateek.nayak@amd.com
---
 kernel/sched/topology.c | 133 +++++++++++++++++++++++----------------
 1 file changed, 80 insertions(+), 53 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79bab80..6303790 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2550,6 +2550,74 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 }
 
 /*
+ * Calculate an allowed NUMA imbalance such that LLCs do not get
+ * imbalanced.
+ */
+static void adjust_numa_imbalance(struct sched_domain *sd_llc)
+{
+	struct sched_domain *parent;
+	unsigned int imb_span = 1;
+	unsigned int imb = 0;
+	unsigned int nr_llcs;
+
+	WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
+	WARN_ON(!sd_llc->parent);
+
+	/*
+	 * For a single LLC per node, allow an
+	 * imbalance up to 12.5% of the node. This is
+	 * arbitrary cutoff based two factors -- SMT and
+	 * memory channels. For SMT-2, the intent is to
+	 * avoid premature sharing of HT resources but
+	 * SMT-4 or SMT-8 *may* benefit from a different
+	 * cutoff. For memory channels, this is a very
+	 * rough estimate of how many channels may be
+	 * active and is based on recent CPUs with
+	 * many cores.
+	 *
+	 * For multiple LLCs, allow an imbalance
+	 * until multiple tasks would share an LLC
+	 * on one node while LLCs on another node
+	 * remain idle. This assumes that there are
+	 * enough logical CPUs per LLC to avoid SMT
+	 * factors and that there is a correlation
+	 * between LLCs and memory channels.
+	 */
+	nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
+	if (nr_llcs == 1)
+		imb = sd_llc->parent->span_weight >> 3;
+	else
+		imb = nr_llcs;
+
+	imb = max(1U, imb);
+	sd_llc->parent->imb_numa_nr = imb;
+
+	/*
+	 * Set span based on the first NUMA domain.
+	 *
+	 * NUMA systems always add a NODE domain before
+	 * iterating the NUMA domains. Since this is before
+	 * degeneration, start from sd_llc's parent's
+	 * parent which is the lowest an SD_NUMA domain can
+	 * be relative to sd_llc.
+	 */
+	parent = sd_llc->parent->parent;
+	while (parent && !(parent->flags & SD_NUMA))
+		parent = parent->parent;
+
+	imb_span = parent ? parent->span_weight : sd_llc->parent->span_weight;
+
+	/* Update the upper remainder of the topology */
+	parent = sd_llc->parent;
+	while (parent) {
+		int factor = max(1U, (parent->span_weight / imb_span));
+
+		parent->imb_numa_nr = imb * factor;
+		parent = parent->parent;
+	}
+}
+
+/*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
  */
@@ -2606,62 +2674,21 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
-	/*
-	 * Calculate an allowed NUMA imbalance such that LLCs do not get
-	 * imbalanced.
-	 */
 	for_each_cpu(i, cpu_map) {
-		unsigned int imb = 0;
-		unsigned int imb_span = 1;
+		sd = *per_cpu_ptr(d.sd, i);
+		if (!sd)
+			continue;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			struct sched_domain *child = sd->child;
-
-			if (!(sd->flags & SD_SHARE_LLC) && child &&
-			    (child->flags & SD_SHARE_LLC)) {
-				struct sched_domain __rcu *top_p;
-				unsigned int nr_llcs;
-
-				/*
-				 * For a single LLC per node, allow an
-				 * imbalance up to 12.5% of the node. This is
-				 * arbitrary cutoff based two factors -- SMT and
-				 * memory channels. For SMT-2, the intent is to
-				 * avoid premature sharing of HT resources but
-				 * SMT-4 or SMT-8 *may* benefit from a different
-				 * cutoff. For memory channels, this is a very
-				 * rough estimate of how many channels may be
-				 * active and is based on recent CPUs with
-				 * many cores.
-				 *
-				 * For multiple LLCs, allow an imbalance
-				 * until multiple tasks would share an LLC
-				 * on one node while LLCs on another node
-				 * remain idle. This assumes that there are
-				 * enough logical CPUs per LLC to avoid SMT
-				 * factors and that there is a correlation
-				 * between LLCs and memory channels.
-				 */
-				nr_llcs = sd->span_weight / child->span_weight;
-				if (nr_llcs == 1)
-					imb = sd->span_weight >> 3;
-				else
-					imb = nr_llcs;
-				imb = max(1U, imb);
-				sd->imb_numa_nr = imb;
-
-				/* Set span based on the first NUMA domain. */
-				top_p = sd->parent;
-				while (top_p && !(top_p->flags & SD_NUMA)) {
-					top_p = top_p->parent;
-				}
-				imb_span = top_p ? top_p->span_weight : sd->span_weight;
-			} else {
-				int factor = max(1U, (sd->span_weight / imb_span));
+		/* First, find the topmost SD_SHARE_LLC domain */
+		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+			sd = sd->parent;
 
-				sd->imb_numa_nr = imb * factor;
-			}
-		}
+		/*
+		 * In presence of higher domains, adjust the
+		 * NUMA imbalance stats for the hierarchy.
+		 */
+		if (IS_ENABLED(CONFIG_NUMA) && (sd->flags & SD_SHARE_LLC) && sd->parent)
+			adjust_numa_imbalance(sd);
 	}
 
 	/* Calculate CPU capacity for physical packages and nodes */

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
  2026-03-12  9:34   ` Peter Zijlstra
@ 2026-03-18  8:08   ` tip-bot2 for K Prateek Nayak
  2026-03-20 23:58     ` Nathan Chancellor
  1 sibling, 1 reply; 56+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-03-18  8:08 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra (Intel), Shrikanth Hegde, Chen Yu,
	Valentin Schneider, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8e8e23dea43e64ddafbd1246644c3219209be113
Gitweb:        https://git.kernel.org/tip/8e8e23dea43e64ddafbd1246644c3219209be113
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:26 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:47 +01:00

sched/topology: Compute sd_weight considering cpuset partitions

The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.

For example, consider a large system of 128CPUs divided into 8 * 16CPUs
partition which is typical when deploying virtual machines:

  [                      PKG Domain: 128CPUs                      ]

  [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]

Although each partition only contains 16CPUs, the load balancing
interval is set to a minimum of 128 jiffies considering the span of the
entire domain with 128CPUs which can lead to longer imbalances within
the partition although balancing within is cheaper with 16CPUs.

Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.

For the above example, the balancing intervals for the partitions PKG
domain changes as follows:

                  before   after
balance_interval   128      16
min_interval       128      16
max_interval       256      32

Intervals are now proportional to the CPUs in the partitioned domain as
was intended by the original formula.

Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
---
 kernel/sched/topology.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c8..79bab80 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_weight = cpumask_weight(tl->mask(tl, cpu));
+	sd_span = sched_domain_span(sd);
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+	sd_weight = cpumask_weight(sd_span);
+	sd_id = cpumask_first(sd_span);
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
-			"wrong sd_flags in topology description\n"))
+		      "wrong sd_flags in topology description\n"))
 		sd_flags &= TOPOLOGY_SD_FLAGS;
+	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
@@ -1689,12 +1693,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		.name			= tl->name,
 	};
 
-	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
-	sd_id = cpumask_first(sd_span);
-
-	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
-
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
@ 2026-03-20 23:58     ` Nathan Chancellor
  2026-03-21  3:36       ` K Prateek Nayak
  2026-03-21 16:38       ` [PATCH] sched/topology: Initialize sd_span after assignment to *sd K Prateek Nayak
  0 siblings, 2 replies; 56+ messages in thread
From: Nathan Chancellor @ 2026-03-20 23:58 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde, Chen Yu,
	Valentin Schneider, Dietmar Eggemann, x86

Hi all,

On Wed, Mar 18, 2026 at 08:08:44AM -0000, tip-bot2 for K Prateek Nayak wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     8e8e23dea43e64ddafbd1246644c3219209be113
> Gitweb:        https://git.kernel.org/tip/8e8e23dea43e64ddafbd1246644c3219209be113
> Author:        K Prateek Nayak <kprateek.nayak@amd.com>
> AuthorDate:    Thu, 12 Mar 2026 04:44:26 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 18 Mar 2026 09:06:47 +01:00
> 
> sched/topology: Compute sd_weight considering cpuset partitions
> 
> The "sd_weight" used for calculating the load balancing interval, and
> its limits, considers the span weight of the entire topology level
> without accounting for cpuset partitions.
> 
> For example, consider a large system of 128CPUs divided into 8 * 16CPUs
> partition which is typical when deploying virtual machines:
> 
>   [                      PKG Domain: 128CPUs                      ]
> 
>   [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]
> 
> Although each partition only contains 16CPUs, the load balancing
> interval is set to a minimum of 128 jiffies considering the span of the
> entire domain with 128CPUs which can lead to longer imbalances within
> the partition although balancing within is cheaper with 16CPUs.
> 
> Compute the "sd_weight" after computing the "sd_span" considering the
> cpu_map covered by the partition, and set the load balancing interval,
> and its limits accordingly.
> 
> For the above example, the balancing intervals for the partitions PKG
> domain changes as follows:
> 
>                   before   after
> balance_interval   128      16
> min_interval       128      16
> max_interval       256      32
> 
> Intervals are now proportional to the CPUs in the partitioned domain as
> was intended by the original formula.
> 
> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
> ---
>  kernel/sched/topology.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c8..79bab80 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
>  	struct cpumask *sd_span;
>  	u64 now = sched_clock();
>  
> -	sd_weight = cpumask_weight(tl->mask(tl, cpu));
> +	sd_span = sched_domain_span(sd);
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +	sd_weight = cpumask_weight(sd_span);
> +	sd_id = cpumask_first(sd_span);
>  
>  	if (tl->sd_flags)
>  		sd_flags = (*tl->sd_flags)();
>  	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
> -			"wrong sd_flags in topology description\n"))
> +		      "wrong sd_flags in topology description\n"))
>  		sd_flags &= TOPOLOGY_SD_FLAGS;
> +	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
>  
>  	*sd = (struct sched_domain){
>  		.min_interval		= sd_weight,
> @@ -1689,12 +1693,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  		.name			= tl->name,
>  	};
>  
> -	sd_span = sched_domain_span(sd);
> -	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> -	sd_id = cpumask_first(sd_span);
> -
> -	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
> -
>  	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>  		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>  		  "CPU capacity asymmetry not supported on SMT\n");

Apologies if this has already been reported or addressed but I am seeing
a crash when booting certain ARM configurations after this change landed
in -next. I reduced it down to

  $ cat kernel/configs/schedstats.config
  CONFIG_SCHEDSTATS=y

  $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-arm \
      -display none \
      -nodefaults \
      -no-reboot \
      -machine virt \
      -append 'console=ttyAMA0 earlycon' \
      -kernel arch/arm/boot/zImage \
      -initrd rootfs.cpio \
      -m 1G \
      -serial mon:stdio
  [    0.000000] Booting Linux on physical CPU 0x0
  [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
  ...
  [    0.031929] 8<--- cut here ---
  [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
  [    0.032172] [00000000] *pgd=00000000
  [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
  [    0.032902] Modules linked in:
  [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
  [    0.033658] Hardware name: Generic DT based system
  [    0.033770] PC is at build_sched_domains+0x7d0/0x1628
  [    0.034091] LR is at build_sched_domains+0x78c/0x1628
  [    0.034166] pc : [<c03c54bc>]    lr : [<c03c5478>]    psr: 20000053
  [    0.034255] sp : f080dec0  ip : 00000000  fp : c1e244a4
  [    0.034339] r10: c1e04fd4  r9 : c1e24518  r8 : 00000000
  [    0.034415] r7 : c2088f20  r6 : c28db924  r5 : c1e051ec  r4 : 00000010
  [    0.034508] r3 : 00000000  r2 : 00000000  r1 : 00000010  r0 : 00000010
  [    0.034623] Flags: nzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment none
  [    0.034730] Control: 10c5387d  Table: 4020406a  DAC: 00000051
  [    0.034819] Register r0 information: zero-size pointer
  [    0.034990] Register r1 information: zero-size pointer
  [    0.035064] Register r2 information: NULL pointer
  [    0.035133] Register r3 information: NULL pointer
  [    0.035198] Register r4 information: zero-size pointer
  [    0.035266] Register r5 information: non-slab/vmalloc memory
  [    0.035376] Register r6 information: slab kmalloc-512 start c28db800 pointer offset 292 size 512
  [    0.035623] Register r7 information: non-slab/vmalloc memory
  [    0.035703] Register r8 information: NULL pointer
  [    0.035769] Register r9 information: non-slab/vmalloc memory
  [    0.035848] Register r10 information: non-slab/vmalloc memory
  [    0.035928] Register r11 information: non-slab/vmalloc memory
  [    0.036006] Register r12 information: NULL pointer
  [    0.036083] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
  [    0.036243] Stack: (0xf080dec0 to 0xf080e000)
  [    0.036339] dec0: 00000000 c139a06c 00000001 00000000 c1e243f4 c28db924 c28db800 00000000
  [    0.036450] dee0: 00000000 ffff8ad3 00000000 00000001 c18f9f1c 00000000 c1e03d80 c1a8d4d0
  [    0.036559] df00: 00000000 c2073b8f c28e3180 00000000 c20d0050 c1d7ea64 c28b8800 f4b63fe3
  [    0.036665] df20: c1e22714 c2969480 c1e22714 00000000 c2074620 c1a8a0e8 00000000 00000000
  [    0.036772] df40: f080df6c c1c1d724 20000053 c0303d80 f080df64 f4b63fe3 c1d703dc c1d703dc
  [    0.036878] df60: c1d703dc 00000000 00000000 c1c01368 c2969480 f080df74 f080df74 f4b63fe3
  [    0.036989] df80: 00000000 c1e04f80 c13979fc 00000000 00000000 00000000 00000000 00000000
  [    0.037097] dfa0: 00000000 c1397a14 00000000 c03001ac 00000000 00000000 00000000 00000000
  [    0.037206] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  [    0.037316] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
  [    0.037447] Call trace:
  [    0.037698]  build_sched_domains from sched_init_smp+0x80/0x108
  [    0.037943]  sched_init_smp from kernel_init_freeable+0xe8/0x24c
  [    0.038029]  kernel_init_freeable from kernel_init+0x18/0x12c
  [    0.038122]  kernel_init from ret_from_fork+0x14/0x28
  [    0.038209] Exception stack(0xf080dfb0 to 0xf080dff8)
  [    0.038277] dfa0:                                     00000000 00000000 00000000 00000000
  [    0.038386] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  [    0.038495] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
  [    0.038640] Code: e58d3020 e58d300c e59d3020 e59d200c (e5832000)
  [    0.038903] ---[ end trace 0000000000000000 ]---
  [    0.039275] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
  [    0.039628] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

If there is any more information I can provide or patches I can test, I
am more than happy to do so.

Cheers,
Nathan

# bad: [b5d083a3ed1e2798396d5e491432e887da8d4a06] Add linux-next specific files for 20260319
# good: [8a30aeb0d1b4e4aaf7f7bae72f20f2ae75385ccb] Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect start 'b5d083a3ed1e2798396d5e491432e887da8d4a06' '8a30aeb0d1b4e4aaf7f7bae72f20f2ae75385ccb'
# good: [21fbd87ec0afe2af5457f5a7f9acbee4bf5db891] Merge branch 'main' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect good 21fbd87ec0afe2af5457f5a7f9acbee4bf5db891
# good: [bffa4391cf4ee844778893a781f14faa55c75cce] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
git bisect good bffa4391cf4ee844778893a781f14faa55c75cce
# bad: [a360efb89caee066919156db3921e616093c43b6] Merge branch 'for-leds-next' of https://git.kernel.org/pub/scm/linux/kernel/git/lee/leds.git
git bisect bad a360efb89caee066919156db3921e616093c43b6
# good: [77f1b9e1181ac53ae9ce7c3c0e52002d02495c5e] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good 77f1b9e1181ac53ae9ce7c3c0e52002d02495c5e
# bad: [d0b3afea83e48990083c0367c10f02af751166b4] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect bad d0b3afea83e48990083c0367c10f02af751166b4
# good: [fe58c95c6f191a8c45dc183a2348a3b4caa77ed8] Merge branch into tip/master: 'perf/core'
git bisect good fe58c95c6f191a8c45dc183a2348a3b4caa77ed8
# bad: [90924d8b73ac96a1a8b1cb9ba6cae36e193061a1] Merge branch into tip/master: 'timers/vdso'
git bisect bad 90924d8b73ac96a1a8b1cb9ba6cae36e193061a1
# bad: [91396a53d7c7cb694627c665e0dbd2589c99eb0a] Merge branch into tip/master: 'timers/core'
git bisect bad 91396a53d7c7cb694627c665e0dbd2589c99eb0a
# bad: [fe7171d0d5dfbe189e41db99580ebacafc3c09ce] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
git bisect bad fe7171d0d5dfbe189e41db99580ebacafc3c09ce
# good: [54a66e431eeacf23e1dc47cb3507f2d0c068aaf0] sched/headers: Inline raw_spin_rq_unlock()
git bisect good 54a66e431eeacf23e1dc47cb3507f2d0c068aaf0
# bad: [1cc8a33ca7e8d38f962b64ece2a42c411a67bc76] sched/topology: Allocate per-CPU sched_domain_shared in s_data
git bisect bad 1cc8a33ca7e8d38f962b64ece2a42c411a67bc76
# good: [786244f70322e41c937e69f0f935bfd11a9611bf] Merge tag 'v7.0-rc4' into sched/core, to pick up scheduler fixes
git bisect good 786244f70322e41c937e69f0f935bfd11a9611bf
# bad: [5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
git bisect bad 5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1
# bad: [8e8e23dea43e64ddafbd1246644c3219209be113] sched/topology: Compute sd_weight considering cpuset partitions
git bisect bad 8e8e23dea43e64ddafbd1246644c3219209be113
# first bad commit: [8e8e23dea43e64ddafbd1246644c3219209be113] sched/topology: Compute sd_weight considering cpuset partitions

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-20 23:58     ` Nathan Chancellor
@ 2026-03-21  3:36       ` K Prateek Nayak
  2026-03-21  7:33         ` Chen, Yu C
  2026-03-21 16:38       ` [PATCH] sched/topology: Initialize sd_span after assignment to *sd K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21  3:36 UTC (permalink / raw)
  To: Nathan Chancellor, Peter Zijlstra
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde, Chen Yu,
	Valentin Schneider, Dietmar Eggemann, x86

Hello Nathan,

Thank you for the report.

On 3/21/2026 5:28 AM, Nathan Chancellor wrote:
>   $ cat kernel/configs/schedstats.config
>   CONFIG_SCHEDSTATS=y

Is the "schedstats.config" available somewhere? I tried these
steps on my end but couldn't reproduce the crash with my config.

Also, are you saying it is necessary to enable CONFIG_SCHEDSTATS
to observe the crash?

> 
>   $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage
> 
>   $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio
> 
>   $ qemu-system-arm \
>       -display none \
>       -nodefaults \
>       -no-reboot \
>       -machine virt \
>       -append 'console=ttyAMA0 earlycon' \
>       -kernel arch/arm/boot/zImage \
>       -initrd rootfs.cpio \
>       -m 1G \
>       -serial mon:stdio
>   [    0.000000] Booting Linux on physical CPU 0x0
>   [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
>   ...
>   [    0.031929] 8<--- cut here ---
>   [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
>   [    0.032172] [00000000] *pgd=00000000
>   [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
>   [    0.032902] Modules linked in:
>   [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
>   [    0.033658] Hardware name: Generic DT based system
>   [    0.033770] PC is at build_sched_domains+0x7d0/0x1628

For me, this points to:

  $ scripts/faddr2line vmlinux build_sched_domains+0x7d0/0x1628
  build_sched_domains+0x7d0/0x1628:
  find_next_bit_wrap at include/linux/find.h:455
  (inlined by) build_sched_groups at kernel/sched/topology.c:1255
  (inlined by) build_sched_domains at kernel/sched/topology.c:2603

which is the:

  span = sched_domain_span(sd);

  for_each_cpu_wrap(i, span, cpu) /* Here */ {
    ...
  }

in build_sched_groups() so we are likely going off the allocated
cpumask size but before that, we do this in the caller:

  sd->span_weight = cpumask_weight(sched_domain_span(sd));

which should have crashed too if we had a NULL pointer in the
cpumask range. So I'm at a loss. Maybe the pc points to a
different location in your build?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21  3:36       ` K Prateek Nayak
@ 2026-03-21  7:33         ` Chen, Yu C
  2026-03-21  7:47           ` Chen, Yu C
  0 siblings, 1 reply; 56+ messages in thread
From: Chen, Yu C @ 2026-03-21  7:33 UTC (permalink / raw)
  To: K Prateek Nayak, Nathan Chancellor, Peter Zijlstra
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86

On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
> Hello Nathan,
> 
> Thank you for the report.
> 
> On 3/21/2026 5:28 AM, Nathan Chancellor wrote:
>>    $ cat kernel/configs/schedstats.config
>>    CONFIG_SCHEDSTATS=y
> 
> Is the "schedstats.config" available somewhere? I tried these
> steps on my end but couldn't reproduce the crash with my config.
> 
> Also, are you saying it is necessary to enable CONFIG_SCHEDSTATS
> to observe the crash?
> 
>>
>>    $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage
>>
>>    $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio
>>
>>    $ qemu-system-arm \
>>        -display none \
>>        -nodefaults \
>>        -no-reboot \
>>        -machine virt \
>>        -append 'console=ttyAMA0 earlycon' \
>>        -kernel arch/arm/boot/zImage \
>>        -initrd rootfs.cpio \
>>        -m 1G \
>>        -serial mon:stdio
>>    [    0.000000] Booting Linux on physical CPU 0x0
>>    [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
>>    ...
>>    [    0.031929] 8<--- cut here ---
>>    [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
>>    [    0.032172] [00000000] *pgd=00000000
>>    [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
>>    [    0.032902] Modules linked in:
>>    [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
>>    [    0.033658] Hardware name: Generic DT based system
>>    [    0.033770] PC is at build_sched_domains+0x7d0/0x1628
> 
> For me, this points to:
> 
>    $ scripts/faddr2line vmlinux build_sched_domains+0x7d0/0x1628

I suppose we might need to use arm-linux-gnueabi-addr2line, just
in case of miss-match.

>    build_sched_domains+0x7d0/0x1628:
>    find_next_bit_wrap at include/linux/find.h:455
>    (inlined by) build_sched_groups at kernel/sched/topology.c:1255
>    (inlined by) build_sched_domains at kernel/sched/topology.c:2603
> 
> which is the:
> 
>    span = sched_domain_span(sd);
> 
>    for_each_cpu_wrap(i, span, cpu) /* Here */ {
>      ...
>    }
> 
> in build_sched_groups() so we are likely going off the allocated
> cpumask size but before that, we do this in the caller:
> 
>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
> 
> which should have crashed too if we had a NULL pointer in the
> cpumask range. So I'm at a loss. Maybe the pc points to a
> different location in your build?
> 

A wild guess, the major change is that we access sd->span, before
initializing  the sd structure with *sd = { ... }. The sd is allocated
via alloc_percpu() uninitialized, the span at the end of the sd structure
remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
affected by this uninitialized state. Before this patch, after *sd = { 
... }
is executed, the contents of  sd->span are explicitly set to 0, which might
be safer?

Thanks,
Chenyu


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21  7:33         ` Chen, Yu C
@ 2026-03-21  7:47           ` Chen, Yu C
  2026-03-21  8:59             ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: Chen, Yu C @ 2026-03-21  7:47 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor,
	Peter Zijlstra

On 3/21/2026 3:33 PM, Chen, Yu C wrote:
> On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
>>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
>>
>> which should have crashed too if we had a NULL pointer in the
>> cpumask range. So I'm at a loss. Maybe the pc points to a
>> different location in your build?
>>
> 
> A wild guess, the major change is that we access sd->span, before
> initializing  the sd structure with *sd = { ... }. The sd is allocated
> via alloc_percpu() uninitialized, the span at the end of the sd structure
> remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
> affected by this uninitialized state. Before this patch, after *sd = 
> { ... }
> is executed, the contents of  sd->span are explicitly set to 0, which might
> be safer?
> 

I replied too fast, please ignore above comments, the sd->span should 
have been
set via cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu))


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21  7:47           ` Chen, Yu C
@ 2026-03-21  8:59             ` K Prateek Nayak
  2026-03-21  9:45               ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21  8:59 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor,
	Peter Zijlstra

Hello Chenyu,

On 3/21/2026 1:17 PM, Chen, Yu C wrote:
> On 3/21/2026 3:33 PM, Chen, Yu C wrote:
>> On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
>>>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
>>>
>>> which should have crashed too if we had a NULL pointer in the
>>> cpumask range. So I'm at a loss. Maybe the pc points to a
>>> different location in your build?
>>>
>>
>> A wild guess, the major change is that we access sd->span, before
>> initializing  the sd structure with *sd = { ... }. The sd is allocated
>> via alloc_percpu() uninitialized, the span at the end of the sd structure
>> remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
>> affected by this uninitialized state. Before this patch, after *sd = { ... }
>> is executed, the contents of  sd->span are explicitly set to 0, which might
>> be safer?
>>
> 
> I replied too fast, please ignore above comments, the sd->span should have been
> set via cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu))

So I managed to reproduce the crash and it is actually crashing at:

  last->next = first;

in build_sched_groups(). If I print the span befora nd after we do
the *sd = { ... }, I see:

  [    0.056301] span before: 0
  [    0.056559] span after:
  [    0.056686] span double check:

double check does a cpumask_pr_args(sched_domain_span(sd)).
This solves the crash on top of this patch:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79bab80af8f2..b347ae5d2786 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1693,6 +1693,8 @@ sd_init(struct sched_domain_topology_level *tl,
 		.name			= tl->name,
 	};
 
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");
---

And I see:

  [    0.056479] span before: 0
  [    0.056749] span after: 0
  [    0.056881] span double check: 0


But since span[] is a variable array at the end of sched_domain struct,
doing a *sd = { ... } shouldn't modify it since the size isn't known at
compile time and the compiler will only overwrite the fixed fields.

Is there a compiler angle I'm missing here?

The cpumask_and() that comes first looks like:

@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        ldr     r3, [r9]        @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317], MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ kernel/sched/topology.c:1646:         u64 now = sched_clock();
        strd    r0, [sp, #28]   @,,
@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        mov     r1, r6  @, i
        mov     r0, r9  @, ivtmp.1798
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        mov     r4, fp  @ tmp740, sd
@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        blx     r3              @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        ldr     r3, [r0]        @ MEM[(const long unsigned int *)_356], MEM[(const long unsigned int *)_356]
        ldr     r0, [r7]        @ MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)cpu_map_104(D)]
        and     r0, r0, r3      @ tmp736, MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)_356]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        uxth    r0, r0  @ _360, tmp736
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        str     r0, [r4, #292]! @ _360, MEM[(long unsigned int *)sd_352 + 292B]
---

*sd assignment looks as follows in my disassembly:

.L1867:
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        ldr     ip, [sp, #48]   @ tmp1203, %sfp
        mov     r2, #296        @,
        mov     r0, fp  @, sd
        mov     r1, #0  @,
        ldr     r3, [ip]        @ jiffies.324_453, jiffies
        str     r3, [sp, #36]   @ jiffies.324_453, %sfp
        ldr     ip, [ip]        @ jiffies.326_454, jiffies
@ kernel/sched/topology.c:1693:                 .name                   = tl->name,
        ldr     r3, [r9, #28]   @ _455, MEM[(char * *)tl_317 + 28B]
        str     r3, [sp, #16]   @ _455, %sfp
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     ip, [sp, #8]    @ jiffies.326_454, %sfp
        bl      memset          @
        ldr     r3, [sp, #36]   @ jiffies.324_453, %sfp
        ldr     r2, [sp, #28]   @ now, %sfp
        str     r3, [fp, #48]   @ jiffies.324_453, sd_352->last_balance
        ldr     r3, [sp, #16]   @ _455, %sfp
        ldr     ip, [sp, #8]    @ jiffies.326_454, %sfp
        str     r2, [fp, #72]   @ now, sd_352->newidle_stamp
        str     r3, [fp, #272]  @ _455, sd_352->name
        mov     r3, #16 @ tmp1502,
        ldr     r2, [sp, #32]   @ now, %sfp
        str     r3, [fp, #20]   @ tmp1502, sd_352->busy_factor
@ kernel/sched/topology.c:1678:                                         | sd_flags
        orr     r3, r4, #4096   @ _452, sd_flags,
@ kernel/sched/topology.c:1696:         WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
        and     r4, r4, #160    @ tmp779, sd_flags,
@ kernel/sched/topology.c:1678:                                         | sd_flags
        orr     r3, r3, #23     @ _452, _452,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     r2, [fp, #76]   @ now, sd_352->newidle_stamp
@ kernel/sched/topology.c:1696:         WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
        cmp     r4, #160        @ tmp779,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        mov     r2, #512        @ tmp776,
        str     ip, [fp, #88]   @ jiffies.326_454, sd_352->last_decay_max_lb_cost
        str     r2, [fp, #60]   @ tmp776, sd_352->newidle_call
        str     r2, [fp, #68]   @ tmp776, sd_352->newidle_ratio
@ kernel/sched/topology.c:1662:                 .max_interval           = 2*sd_weight,
        lsl     r2, r10, #1     @ tmp773, _484,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     r5, [fp, #4]    @ sd, sd_352->child
        str     r2, [fp, #16]   @ tmp773, sd_352->max_interval
        mov     r2, #117        @ tmp775,
        str     r10, [fp, #12]  @ _484, sd_352->min_interval
        str     r2, [fp, #24]   @ tmp775, sd_352->imbalance_pct
        mov     r2, #256        @ tmp777,
        str     r10, [fp, #52]  @ _484, sd_352->balance_interval
        str     r3, [fp, #40]   @ _452, sd_352->flags
        str     r2, [fp, #64]   @ tmp777, sd_352->newidle_success
---


If I add the new cpumask_and() I get the following after *sd assignment:

@ kernel/sched/topology.c:1696:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        ldr     r3, [r9]        @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317], MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
        blx     r3              @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        ldr     r3, [r7]        @ MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)cpu_map_104(D)]
        ldr     r2, [r0]        @ MEM[(const long unsigned int *)_457], MEM[(const long unsigned int *)_457]
        and     r3, r3, r2      @ tmp788, MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)_457]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        uxth    r3, r3  @ tmp791, tmp788
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        str     r3, [fp, #292]  @ tmp791, MEM[(long unsigned int *)sd_352 + 292B]
---


Both cpumask_and() seems to store to:

  MEM[(long unsigned int *)sd_352 + 292B]

So I'm at a loss why this happens. Let me dig little more.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21  8:59             ` K Prateek Nayak
@ 2026-03-21  9:45               ` K Prateek Nayak
  2026-03-21 10:13                 ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21  9:45 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor,
	Peter Zijlstra

Hello folks,

On 3/21/2026 2:29 PM, K Prateek Nayak wrote:
> So I managed to reproduce the crash and it is actually crashing at:
> 
>   last->next = first;
> 
> in build_sched_groups(). If I print the span befora nd after we do
> the *sd = { ... }, I see:
> 
>   [    0.056301] span before: 0
>   [    0.056559] span after:
>   [    0.056686] span double check:
> 
> double check does a cpumask_pr_args(sched_domain_span(sd)).
> This solves the crash on top of this patch:
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79bab80af8f2..b347ae5d2786 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1693,6 +1693,8 @@ sd_init(struct sched_domain_topology_level *tl,
>  		.name			= tl->name,
>  	};
>  
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +
>  	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>  		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>  		  "CPU capacity asymmetry not supported on SMT\n");
> ---
> 
> And I see:
> 
>   [    0.056479] span before: 0
>   [    0.056749] span after: 0
>   [    0.056881] span double check: 0
> 
> 
> But since span[] is a variable array at the end of sched_domain struct,
> doing a *sd = { ... } shouldn't modify it since the size isn't known at
> compile time and the compiler will only overwrite the fixed fields.
> 
> Is there a compiler angle I'm missing here?

So this is what I've found: By default we have:

  cpumask_size: 4
  struct sched_domain size: 296

If I do:

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032426dc..f0bebce274f7 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -148,7 +148,7 @@ struct sched_domain {
 	 * by attaching extra space to the end of the structure,
 	 * depending on how many CPUs the kernel has booted up with)
 	 */
-	unsigned long span[];
+	unsigned long span[1];
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
---

I still see:

  cpumask_size: 4
  struct sched_domain size: 296

Which means we are overwriting the sd->span during *sd assignment even
with the variable length array at the end :-(

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21  9:45               ` K Prateek Nayak
@ 2026-03-21 10:13                 ` K Prateek Nayak
  2026-03-21 12:48                   ` Chen, Yu C
  2026-03-21 14:13                   ` Shrikanth Hegde
  0 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21 10:13 UTC (permalink / raw)
  To: Peter Zijlstra, Chen, Yu C
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor

On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
> So this is what I've found: By default we have:
> 
>   cpumask_size: 4
>   struct sched_domain size: 296
> 
> If I do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..f0bebce274f7 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>  	 * by attaching extra space to the end of the structure,
>  	 * depending on how many CPUs the kernel has booted up with)
>  	 */
> -	unsigned long span[];
> +	unsigned long span[1];
>  };
>  
>  static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> I still see:
> 
>   cpumask_size: 4
>   struct sched_domain size: 296
> 
> Which means we are overwriting the sd->span during *sd assignment even
> with the variable length array at the end :-(
> 

And more evidence - by default we have:

  sched_domain size: 296
  offset of sd_span: 292

sizeof() seems to account some sort of 4-byte padding for the struct which
pushes the offset of sd->span into the struct size.

To resolve this, we can also do:

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032426dc..48bea2f7f750 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -148,7 +148,7 @@ struct sched_domain {
 	 * by attaching extra space to the end of the structure,
 	 * depending on how many CPUs the kernel has booted up with)
 	 */
-	unsigned long span[];
+	unsigned long span[] __aligned(2 * sizeof(int));
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
---

and the kernel boots fine with the sd_span offset aligned with
sched_domain struct size:

  sched_domain size: 296
  offset of sd_span: 296


So Peter, which solution do you prefer?

1. Doing cpumask_and() after the *sd = { ... } initialization. (or)

2. Align sd->span to an 8-byte boundary.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21 10:13                 ` K Prateek Nayak
@ 2026-03-21 12:48                   ` Chen, Yu C
  2026-03-24  2:54                     ` K Prateek Nayak
  2026-03-21 14:13                   ` Shrikanth Hegde
  1 sibling, 1 reply; 56+ messages in thread
From: Chen, Yu C @ 2026-03-21 12:48 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor

On 3/21/2026 6:13 PM, K Prateek Nayak wrote:
> On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
>> So this is what I've found: By default we have:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> If I do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..f0bebce274f7 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>   	 * by attaching extra space to the end of the structure,
>>   	 * depending on how many CPUs the kernel has booted up with)
>>   	 */
>> -	unsigned long span[];
>> +	unsigned long span[1];
>>   };
>>   
>>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> I still see:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> Which means we are overwriting the sd->span during *sd assignment even
>> with the variable length array at the end :-(
>>

Ah, that's right.

> 
> And more evidence - by default we have:
> 
>    sched_domain size: 296
>    offset of sd_span: 292
> 
> sizeof() seems to account some sort of 4-byte padding for the struct which
> pushes the offset of sd->span into the struct size.
> 

In your disassembly for *sd = {...}

mov     r2, #296
mov     r0, fp
mov     r1, #0
...
bl memset  <-- oops!

> To resolve this, we can also do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..48bea2f7f750 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>   	 * by attaching extra space to the end of the structure,
>   	 * depending on how many CPUs the kernel has booted up with)
>   	 */
> -	unsigned long span[];
> +	unsigned long span[] __aligned(2 * sizeof(int));
>   };
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> and the kernel boots fine with the sd_span offset aligned with
> sched_domain struct size:
> 
>    sched_domain size: 296
>    offset of sd_span: 296
> 
> 
> So Peter, which solution do you prefer?
> 
> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
> 
> 2. Align sd->span to an 8-byte boundary.
> 

I vote for option 1, as option 2 relies on how the compiler
interprets sizeof() and the offset of each member within
the structure IMO. Initializing the values after *sd = {} seems
safer and more generic, but the decision is up to Peter : )

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21 10:13                 ` K Prateek Nayak
  2026-03-21 12:48                   ` Chen, Yu C
@ 2026-03-21 14:13                   ` Shrikanth Hegde
  2026-03-21 15:14                     ` K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: Shrikanth Hegde @ 2026-03-21 14:13 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Chen, Yu C
  Cc: linux-tip-commits, linux-kernel, Valentin Schneider,
	Dietmar Eggemann, x86, Nathan Chancellor

Hi Prateek.

On 3/21/26 3:43 PM, K Prateek Nayak wrote:
> On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
>> So this is what I've found: By default we have:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> If I do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..f0bebce274f7 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>   	 * by attaching extra space to the end of the structure,
>>   	 * depending on how many CPUs the kernel has booted up with)
>>   	 */
>> -	unsigned long span[];
>> +	unsigned long span[1];
>>   };
>>   
>>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> I still see:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> Which means we are overwriting the sd->span during *sd assignment even
>> with the variable length array at the end :-(
>>
> 
> And more evidence - by default we have:
> 
>    sched_domain size: 296
>    offset of sd_span: 292
> 
> sizeof() seems to account some sort of 4-byte padding for the struct which
> pushes the offset of sd->span into the struct size.
> 
> To resolve this, we can also do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..48bea2f7f750 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>   	 * by attaching extra space to the end of the structure,
>   	 * depending on how many CPUs the kernel has booted up with)
>   	 */
> -	unsigned long span[];
> +	unsigned long span[] __aligned(2 * sizeof(int));
>   };

Wouldn't that be susceptible to change in sched_domain somewhere in between?
Right now, it maybe aligning to 296 since it is 8 byte aligned.

But lets say someone adds a new int in between. Then size of sched_domain would be 300.
but span would still be 296 since it 8 bit aligned?

>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> and the kernel boots fine with the sd_span offset aligned with
> sched_domain struct size:
> 
>    sched_domain size: 296
>    offset of sd_span: 296
> 
>
> So Peter, which solution do you prefer?
> 
> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
> 
> 2. Align sd->span to an 8-byte boundary.
> 

Only update sd_weight and leave everything as it was earlier?

sd_weight = cpumask_and_weight(cpu_map, tl->mask(tl, cpu));


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21 14:13                   ` Shrikanth Hegde
@ 2026-03-21 15:14                     ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21 15:14 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra, Chen, Yu C
  Cc: linux-tip-commits, linux-kernel, Valentin Schneider,
	Dietmar Eggemann, x86, Nathan Chancellor

Hello Shrikanth,

On 3/21/2026 7:43 PM, Shrikanth Hegde wrote:
>> And more evidence - by default we have:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 292
>>
>> sizeof() seems to account some sort of 4-byte padding for the struct which
>> pushes the offset of sd->span into the struct size.
>>
>> To resolve this, we can also do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..48bea2f7f750 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>        * by attaching extra space to the end of the structure,
>>        * depending on how many CPUs the kernel has booted up with)
>>        */
>> -    unsigned long span[];
>> +    unsigned long span[] __aligned(2 * sizeof(int));
>>   };
> 
> Wouldn't that be susceptible to change in sched_domain somewhere in between?
> Right now, it maybe aligning to 296 since it is 8 byte aligned.
> 
> But lets say someone adds a new int in between. Then size of sched_domain would be 300.
> but span would still be 296 since it 8 bit aligned?

So the official GCC specification for "Arrays of Length Zero" [1] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

so you can either have:

  struct sched_domain {
    ...
    unsigned int span_length;  /* 288  4 */
    unsigned long span[];      /* 292  0 */

    /* XXX 4 byte tail padding */

    /* size: 296 */
  }

or:

  struct sched_domain {
    ...
    unsigned int span_length;  /* 288  4 */

    /* XXX 4 bytes hole, try to pack */

    unsigned long span[];      /* 296  0 */

    /* size: 296 */
  }

If the variable length array is aligned, there is no need for a tail
padding and in both cases, length of span[] is always 0.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

But ...

> 
>>     static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> and the kernel boots fine with the sd_span offset aligned with
>> sched_domain struct size:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 296
>>
>>
>> So Peter, which solution do you prefer?
>>
>> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
>>
>> 2. Align sd->span to an 8-byte boundary.
>>
> 
> Only update sd_weight and leave everything as it was earlier?
> 
> sd_weight = cpumask_and_weight(cpu_map, tl->mask(tl, cpu));

... I agree with you and Chenyu, that this approach is better since
padding and alignment is again dependent on the compiler.

Anyhow we do a cpumask_weight() for sd_weight, and cpumask_and_weight()
being the same complexity shouldn't add any more overhead.

While we are at it, we can also remove that "sd->span_weight" assignment
before the build_sched_groups() loop since we already have it here.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-20 23:58     ` Nathan Chancellor
  2026-03-21  3:36       ` K Prateek Nayak
@ 2026-03-21 16:38       ` K Prateek Nayak
  2026-03-23  9:08         ` Shrikanth Hegde
  2026-03-23  9:36         ` Peter Zijlstra
  1 sibling, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-21 16:38 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Nathan Chancellor, Valentin Schneider, Dietmar Eggemann,
	Shrikanth Hegde, Chen Yu, linux-kernel
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Gautham R. Shenoy, x86,
	K Prateek Nayak

Nathan reported a kernel panic on his ARM builds after commit
8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
partitions") which was root caused to the compiler zeroing out the first
few bytes of sd->span.

During the debug [1], it was discovered that, on some configs,
offsetof(struct sched_domain, span) at 292 was less than
sizeof(struct sched_domain) at 296 resulting in:

  *sd = { ... }

assignment clearing out first 4 bytes of sd->span which was initialized
before.

The official GCC specification for "Arrays of Length Zero" [2] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

which means the relative offset of the variable length array at the end
of the sturct can indeed be less than sizeof() the struct as a result of
tail padding thus overwriting that data of the flexible array that
overlapped with the padding whenever the struct is initialized as whole.

Partially revert commit 8e8e23dea43e ("sched/topology: Compute sd_weight
considering cpuset partitions") to initialize sd_span after the fixed
memebers of sd.

Use

  cpumask_weight_and(cpu_map, tl->mask(tl, cpu))

to calculate span_weight before initializing the sd_span.
cpumask_and_weight() is of same complexity as cpumask_and() and the
additional overhead is negligible.

While at it, also initialize sd->span_weight in sd_init() since
sd_weight now captures the cpu_map constraints. Fixup the
sd->span_weight whenever sd_span is fixed up by the generic topology
layer.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20260320235824.GA1176840@ax162/
Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
Link: https://lore.kernel.org/all/a8c125fd-960d-4b35-b640-95a33584eb08@amd.com/ [1]
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Nathan, can you please check if this fixes the issue you are observing -
it at least fixed one that I'm observing ;-)

Peter, if you would like to keep revert and enhancements separate, let
me know and I'll spin a v2.
---
 kernel/sched/topology.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 43150591914b..721ed9b883b8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1669,17 +1669,13 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
-	sd_weight = cpumask_weight(sd_span);
-	sd_id = cpumask_first(sd_span);
+	sd_weight = cpumask_weight_and(cpu_map, tl->mask(tl, cpu));
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
 		      "wrong sd_flags in topology description\n"))
 		sd_flags &= TOPOLOGY_SD_FLAGS;
-	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
@@ -1715,8 +1711,15 @@ sd_init(struct sched_domain_topology_level *tl,
 		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
 		.name			= tl->name,
+		.span_weight		= sd_weight,
 	};
 
+	sd_span = sched_domain_span(sd);
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+	sd_id = cpumask_first(sd_span);
+
+	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
+
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");
@@ -2518,6 +2521,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 			cpumask_or(sched_domain_span(sd),
 				   sched_domain_span(sd),
 				   sched_domain_span(child));
+
+			sd->span_weight = cpumask_weight(sched_domain_span(sd));
 		}
 
 	}
@@ -2697,7 +2702,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			sd->span_weight = cpumask_weight(sched_domain_span(sd));
 			if (sd->flags & SD_NUMA) {
 				if (build_overlap_sched_groups(sd, i))
 					goto error;

base-commit: fe7171d0d5dfbe189e41db99580ebacafc3c09ce
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-21 16:38       ` [PATCH] sched/topology: Initialize sd_span after assignment to *sd K Prateek Nayak
@ 2026-03-23  9:08         ` Shrikanth Hegde
  2026-03-23 17:34           ` K Prateek Nayak
  2026-03-23  9:36         ` Peter Zijlstra
  1 sibling, 1 reply; 56+ messages in thread
From: Shrikanth Hegde @ 2026-03-23  9:08 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Nathan Chancellor, Valentin Schneider, Dietmar Eggemann, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86



On 3/21/26 10:08 PM, K Prateek Nayak wrote:
> Nathan reported a kernel panic on his ARM builds after commit
> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
> partitions") which was root caused to the compiler zeroing out the first
> few bytes of sd->span.
> 
> During the debug [1], it was discovered that, on some configs,
> offsetof(struct sched_domain, span) at 292 was less than
> sizeof(struct sched_domain) at 296 resulting in:
> 
>    *sd = { ... }
> 
> assignment clearing out first 4 bytes of sd->span which was initialized
> before.
> 
> The official GCC specification for "Arrays of Length Zero" [2] says:
> 
>    Although the size of a zero-length array is zero, an array member of
>    this kind may increase the size of the enclosing type as a result of
>    tail padding.
> 
> which means the relative offset of the variable length array at the end
> of the sturct can indeed be less than sizeof() the struct as a result of
> tail padding thus overwriting that data of the flexible array that
> overlapped with the padding whenever the struct is initialized as whole.
> 
> Partially revert commit 8e8e23dea43e ("sched/topology: Compute sd_weight
> considering cpuset partitions") to initialize sd_span after the fixed
> memebers of sd.
> 
> Use
> 
>    cpumask_weight_and(cpu_map, tl->mask(tl, cpu))
> 
> to calculate span_weight before initializing the sd_span.
> cpumask_and_weight() is of same complexity as cpumask_and() and the
> additional overhead is negligible.
> 
> While at it, also initialize sd->span_weight in sd_init() since
> sd_weight now captures the cpu_map constraints. Fixup the
> sd->span_weight whenever sd_span is fixed up by the generic topology
> layer.
> 


This description is a bit confusing. Fixup happens naturally since
cpu_map now reflects the changes right?

Maybe mention about that removal in build_sched_domains?

> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Closes: https://lore.kernel.org/all/20260320235824.GA1176840@ax162/
> Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
> Link: https://lore.kernel.org/all/a8c125fd-960d-4b35-b640-95a33584eb08@amd.com/ [1]
> Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Nathan, can you please check if this fixes the issue you are observing -
> it at least fixed one that I'm observing ;-)
> 
> Peter, if you would like to keep revert and enhancements separate, let
> me know and I'll spin a v2.
> ---
>   kernel/sched/topology.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 43150591914b..721ed9b883b8 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1669,17 +1669,13 @@ sd_init(struct sched_domain_topology_level *tl,
>   	struct cpumask *sd_span;
>   	u64 now = sched_clock();
>   
> -	sd_span = sched_domain_span(sd);
> -	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> -	sd_weight = cpumask_weight(sd_span);
> -	sd_id = cpumask_first(sd_span);
> +	sd_weight = cpumask_weight_and(cpu_map, tl->mask(tl, cpu));
>   
>   	if (tl->sd_flags)
>   		sd_flags = (*tl->sd_flags)();
>   	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
>   		      "wrong sd_flags in topology description\n"))
>   		sd_flags &= TOPOLOGY_SD_FLAGS;
> -	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
>   
>   	*sd = (struct sched_domain){
>   		.min_interval		= sd_weight,
> @@ -1715,8 +1711,15 @@ sd_init(struct sched_domain_topology_level *tl,
>   		.last_decay_max_lb_cost	= jiffies,
>   		.child			= child,
>   		.name			= tl->name,
> +		.span_weight		= sd_weight,
>   	};
>   
> +	sd_span = sched_domain_span(sd);
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +	sd_id = cpumask_first(sd_span);
> +
> +	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
> +
>   	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>   		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>   		  "CPU capacity asymmetry not supported on SMT\n");
> @@ -2518,6 +2521,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>   			cpumask_or(sched_domain_span(sd),
>   				   sched_domain_span(sd),
>   				   sched_domain_span(child));
> +
> +			sd->span_weight = cpumask_weight(sched_domain_span(sd));
>   		}
>   
>   	}
> @@ -2697,7 +2702,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   	/* Build the groups for the domains */
>   	for_each_cpu(i, cpu_map) {
>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> -			sd->span_weight = cpumask_weight(sched_domain_span(sd));
>   			if (sd->flags & SD_NUMA) {
>   				if (build_overlap_sched_groups(sd, i))
>   					goto error;
> 
> base-commit: fe7171d0d5dfbe189e41db99580ebacafc3c09ce


Other than nits in changelog:
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>


PS: b4 am -Q was quite confused which patch to pick for 0001.
may since it was a reply to the thread. Not sure. So i pulled
each patch separate and applied.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-21 16:38       ` [PATCH] sched/topology: Initialize sd_span after assignment to *sd K Prateek Nayak
  2026-03-23  9:08         ` Shrikanth Hegde
@ 2026-03-23  9:36         ` Peter Zijlstra
  2026-03-23 13:24           ` Jon Hunter
                             ` (4 more replies)
  1 sibling, 5 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-23  9:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Nathan Chancellor,
	Valentin Schneider, Dietmar Eggemann, Shrikanth Hegde, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86, Kees Cook

On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
> Nathan reported a kernel panic on his ARM builds after commit
> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
> partitions") which was root caused to the compiler zeroing out the first
> few bytes of sd->span.
> 
> During the debug [1], it was discovered that, on some configs,
> offsetof(struct sched_domain, span) at 292 was less than
> sizeof(struct sched_domain) at 296 resulting in:
> 
>   *sd = { ... }
> 
> assignment clearing out first 4 bytes of sd->span which was initialized
> before.
> 
> The official GCC specification for "Arrays of Length Zero" [2] says:
> 
>   Although the size of a zero-length array is zero, an array member of
>   this kind may increase the size of the enclosing type as a result of
>   tail padding.
> 
> which means the relative offset of the variable length array at the end
> of the sturct can indeed be less than sizeof() the struct as a result of
> tail padding thus overwriting that data of the flexible array that
> overlapped with the padding whenever the struct is initialized as whole.

WTF! that's terrible :(

Why is this allowed, this makes no bloody sense :/

However the way we allocate space for flex arrays is: sizeof(*obj) +
count * sizeof(*obj->member); this means that we do have sufficient
space, irrespective of this extra padding.


Does this work?

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 51c29581f15e..defa86ed9b06 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -153,7 +153,21 @@ struct sched_domain {
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 {
-	return to_cpumask(sd->span);
+	/*
+	 * Because C is an absolutely broken piece of shit, it is allowed for
+	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
+	 * initialzation *sd = { ... }; which will clear every unmentioned
+	 * member, can over-write the start of the flexible array member.
+	 *
+	 * Luckily, the way we allocate the flexible array is by:
+	 *
+	 *   sizeof(*sd) + count * sizeof(*sd->span)
+	 *
+	 * this means that we have sufficient space for the whole flex array
+	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
+	 */
+	unsigned long *bitmap = (void *)sd + sizeof(*sd);
+	return to_cpumask(bitmap);
 }
 
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-23  9:36         ` Peter Zijlstra
@ 2026-03-23 13:24           ` Jon Hunter
  2026-03-23 15:36           ` Chen, Yu C
                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Jon Hunter @ 2026-03-23 13:24 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Nathan Chancellor,
	Valentin Schneider, Dietmar Eggemann, Shrikanth Hegde, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86, Kees Cook, linux-tegra@vger.kernel.org

Hi Peter,

On 23/03/2026 09:36, Peter Zijlstra wrote:
> On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
>> Nathan reported a kernel panic on his ARM builds after commit
>> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
>> partitions") which was root caused to the compiler zeroing out the first
>> few bytes of sd->span.
>>
>> During the debug [1], it was discovered that, on some configs,
>> offsetof(struct sched_domain, span) at 292 was less than
>> sizeof(struct sched_domain) at 296 resulting in:
>>
>>    *sd = { ... }
>>
>> assignment clearing out first 4 bytes of sd->span which was initialized
>> before.
>>
>> The official GCC specification for "Arrays of Length Zero" [2] says:
>>
>>    Although the size of a zero-length array is zero, an array member of
>>    this kind may increase the size of the enclosing type as a result of
>>    tail padding.
>>
>> which means the relative offset of the variable length array at the end
>> of the sturct can indeed be less than sizeof() the struct as a result of
>> tail padding thus overwriting that data of the flexible array that
>> overlapped with the padding whenever the struct is initialized as whole.
> 
> WTF! that's terrible :(
> 
> Why is this allowed, this makes no bloody sense :/
> 
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>   {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>   }
>   
>   extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],


I noticed the same issue that Nathan reported on 32-bit Tegra and the 
above does fix it for me.

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-23  9:36         ` Peter Zijlstra
  2026-03-23 13:24           ` Jon Hunter
@ 2026-03-23 15:36           ` Chen, Yu C
  2026-03-23 17:24           ` K Prateek Nayak
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 56+ messages in thread
From: Chen, Yu C @ 2026-03-23 15:36 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Nathan Chancellor,
	Valentin Schneider, Dietmar Eggemann, Shrikanth Hegde,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86, Kees Cook

On 3/23/2026 5:36 PM, Peter Zijlstra wrote:
> On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
>> Nathan reported a kernel panic on his ARM builds after commit
>> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
>> partitions") which was root caused to the compiler zeroing out the first
>> few bytes of sd->span.
>>
>> During the debug [1], it was discovered that, on some configs,
>> offsetof(struct sched_domain, span) at 292 was less than
>> sizeof(struct sched_domain) at 296 resulting in:
>>
>>    *sd = { ... }
>>
>> assignment clearing out first 4 bytes of sd->span which was initialized
>> before.
>>
>> The official GCC specification for "Arrays of Length Zero" [2] says:
>>
>>    Although the size of a zero-length array is zero, an array member of
>>    this kind may increase the size of the enclosing type as a result of
>>    tail padding.
>>
>> which means the relative offset of the variable length array at the end
>> of the sturct can indeed be less than sizeof() the struct as a result of
>> tail padding thus overwriting that data of the flexible array that
>> overlapped with the padding whenever the struct is initialized as whole.
> 
> WTF! that's terrible :(
> 
> Why is this allowed, this makes no bloody sense :/
> 
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>   {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>   }
>   
>   extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

While I still wonder if it is risky to initialize the structure members 
before
*sd = { ... }, this patch could keep the current sd_init() unchanged.
According to the tests on GNR, it works as expected with no regressions
noticed on top of sched/core commit 349edbba1125 ("sched/fair: Simplify
SIS_UTIL handling in select_idle_cpu()"),

Tested-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-23  9:36         ` Peter Zijlstra
  2026-03-23 13:24           ` Jon Hunter
  2026-03-23 15:36           ` Chen, Yu C
@ 2026-03-23 17:24           ` K Prateek Nayak
  2026-03-23 22:41           ` Nathan Chancellor
  2026-03-24  9:10           ` [tip: sched/core] sched/topology: Fix sched_domain_span() tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-23 17:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Nathan Chancellor,
	Valentin Schneider, Dietmar Eggemann, Shrikanth Hegde, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86, Kees Cook

Hello Peter,

On 3/23/2026 3:06 PM, Peter Zijlstra wrote:
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?

Solves the panic on the setup shared by Nathan and KASAN hasn't
noted anything in my baremetal testing so feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-23  9:08         ` Shrikanth Hegde
@ 2026-03-23 17:34           ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-23 17:34 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Nathan Chancellor, Valentin Schneider, Dietmar Eggemann, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86

Hello Shrikanth,

On 3/23/2026 2:38 PM, Shrikanth Hegde wrote:
>> While at it, also initialize sd->span_weight in sd_init() since
>> sd_weight now captures the cpu_map constraints. Fixup the
>> sd->span_weight whenever sd_span is fixed up by the generic topology
>> layer.
>>
> 
> 
> This description is a bit confusing. Fixup happens naturally since
> cpu_map now reflects the changes right?

That was for the hunk in build_sched_domain() where the the sd span
is fixed up if it is found that the child isn't a subset of the
parent in which case span_weight needs to be calculated again after
the cpumask_or().

[..snip..]

> Other than nits in changelog:
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Thanks for the review but Peter has found an alternate approach to
work around this with the current flow of computing span first.

> PS: b4 am -Q was quite confused which patch to pick for 0001.
> may since it was a reply to the thread. Not sure. So i pulled
> each patch separate and applied.

Sorry for the inconvenience. For single patch it should be still fine
to grab the raw patch but for larger series I make sure to post out
separately for convenience. Will be mindful next time.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd
  2026-03-23  9:36         ` Peter Zijlstra
                             ` (2 preceding siblings ...)
  2026-03-23 17:24           ` K Prateek Nayak
@ 2026-03-23 22:41           ` Nathan Chancellor
  2026-03-24  9:10           ` [tip: sched/core] sched/topology: Fix sched_domain_span() tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 56+ messages in thread
From: Nathan Chancellor @ 2026-03-23 22:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Dietmar Eggemann, Shrikanth Hegde, Chen Yu,
	linux-kernel, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, x86, Kees Cook

On Mon, Mar 23, 2026 at 10:36:27AM +0100, Peter Zijlstra wrote:
> Does this work?

Yes, that avoids the initial panic I reported.

Tested-by: Nathan Chancellor <nathan@kernel.org>

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>  
>  static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>  {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>  }
>  
>  extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions
  2026-03-21 12:48                   ` Chen, Yu C
@ 2026-03-24  2:54                     ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-24  2:54 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Peter Zijlstra, linux-tip-commits, linux-kernel, Shrikanth Hegde,
	Valentin Schneider, Dietmar Eggemann, x86, Nathan Chancellor

Hello Chenyu,

On 3/21/2026 6:18 PM, Chen, Yu C wrote:
>> And more evidence - by default we have:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 292
>>
>> sizeof() seems to account some sort of 4-byte padding for the struct which
>> pushes the offset of sd->span into the struct size.
>>
> 
> In your disassembly for *sd = {...}
> 
> mov     r2, #296
> mov     r0, fp
> mov     r1, #0
> ...
> bl memset  <-- oops!

Ah! I was not able to see this correctly on Saturday. Thank you for
pointing it out.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/topology: Fix sched_domain_span()
  2026-03-23  9:36         ` Peter Zijlstra
                             ` (3 preceding siblings ...)
  2026-03-23 22:41           ` Nathan Chancellor
@ 2026-03-24  9:10           ` tip-bot2 for Peter Zijlstra
  4 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-03-24  9:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Nathan Chancellor, Peter Zijlstra (Intel), Jon Hunter, Chen Yu,
	K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e379dce8af11d8d6040b4348316a499bfd174bfb
Gitweb:        https://git.kernel.org/tip/e379dce8af11d8d6040b4348316a499bfd174bfb
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 23 Mar 2026 10:36:27 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Mar 2026 10:07:04 +01:00

sched/topology: Fix sched_domain_span()

Commit 8e8e23dea43e ("sched/topology: Compute sd_weight considering
cpuset partitions") ends up relying on the fact that structure
initialization should not touch the flexible array.

However, the official GCC specification for "Arrays of Length Zero"
[*] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

Additionally, structure initialization will zero tail padding. With
the end result that since offsetof(*type, member) < sizeof(*type),
array initialization will clobber the flex array.

Luckily, the way flexible array sizes are calculated is:

  sizeof(*type) + count * sizeof(*type->member)

This means we have the complete size of the flex array *outside* of
sizeof(*type), so use that instead of relying on the broken flex array
definition.

[*] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260323093627.GY3738010@noisy.programming.kicks-ass.net
---
 include/linux/sched/topology.h | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 51c2958..36553e1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -142,18 +142,30 @@ struct sched_domain {
 
 	unsigned int span_weight;
 	/*
-	 * Span of all CPUs in this domain.
+	 * See sched_domain_span(), on why flex arrays are broken.
 	 *
-	 * NOTE: this field is variable length. (Allocated dynamically
-	 * by attaching extra space to the end of the structure,
-	 * depending on how many CPUs the kernel has booted up with)
-	 */
 	unsigned long span[];
+	 */
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 {
-	return to_cpumask(sd->span);
+	/*
+	 * Turns out that C flexible arrays are fundamentally broken since it
+	 * is allowed for offsetof(*sd, span) < sizeof(*sd), this means that
+	 * structure initialzation *sd = { ... }; which writes every byte
+	 * inside sizeof(*type), will over-write the start of the flexible
+	 * array.
+	 *
+	 * Luckily, the way we allocate sched_domain is by:
+	 *
+	 *   sizeof(*sd) + cpumask_size()
+	 *
+	 * this means that we have sufficient space for the whole flex array
+	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
+	 */
+	unsigned long *bitmap = (void *)sd + sizeof(*sd);
+	return to_cpumask(bitmap);
 }
 
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

^ permalink raw reply related	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2026-03-24  9:10 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12  4:44 [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
2026-03-12  9:34   ` Peter Zijlstra
2026-03-12  9:59     ` K Prateek Nayak
2026-03-12 10:01       ` Peter Zijlstra
2026-03-12 10:09         ` K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-20 23:58     ` Nathan Chancellor
2026-03-21  3:36       ` K Prateek Nayak
2026-03-21  7:33         ` Chen, Yu C
2026-03-21  7:47           ` Chen, Yu C
2026-03-21  8:59             ` K Prateek Nayak
2026-03-21  9:45               ` K Prateek Nayak
2026-03-21 10:13                 ` K Prateek Nayak
2026-03-21 12:48                   ` Chen, Yu C
2026-03-24  2:54                     ` K Prateek Nayak
2026-03-21 14:13                   ` Shrikanth Hegde
2026-03-21 15:14                     ` K Prateek Nayak
2026-03-21 16:38       ` [PATCH] sched/topology: Initialize sd_span after assignment to *sd K Prateek Nayak
2026-03-23  9:08         ` Shrikanth Hegde
2026-03-23 17:34           ` K Prateek Nayak
2026-03-23  9:36         ` Peter Zijlstra
2026-03-23 13:24           ` Jon Hunter
2026-03-23 15:36           ` Chen, Yu C
2026-03-23 17:24           ` K Prateek Nayak
2026-03-23 22:41           ` Nathan Chancellor
2026-03-24  9:10           ` [tip: sched/core] sched/topology: Fix sched_domain_span() tip-bot2 for Peter Zijlstra
2026-03-12  4:44 ` [PATCH v4 2/9] sched/topology: Extract "imb_numa_nr" calculation into a separate helper K Prateek Nayak
2026-03-12 13:37   ` kernel test robot
2026-03-12 15:42     ` K Prateek Nayak
2026-03-12 16:02       ` Peter Zijlstra
2026-03-16  0:18   ` Dietmar Eggemann
2026-03-16  3:41     ` K Prateek Nayak
2026-03-16  8:24       ` Dietmar Eggemann
2026-03-16  8:50         ` K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 3/9] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 4/9] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 5/9] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 6/9] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
2026-03-12  9:46   ` Peter Zijlstra
2026-03-12 10:06     ` K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 7/9] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
2026-03-15 23:36   ` Dietmar Eggemann
2026-03-16  3:19     ` K Prateek Nayak
2026-03-18  8:08     ` [tip: sched/core] PM: EM: Switch to rcu_dereference_all() in " tip-bot2 for Dietmar Eggemann
2026-03-18  8:08   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() in the " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 8/9] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-12  4:44 ` [PATCH v4 9/9] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
2026-03-18  8:08   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-03-16  0:22 ` [PATCH v4 0/9] sched/topology: Optimize sd->shared allocation Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox