* [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-21 14:45 ` Chen, Yu C
` (2 more replies)
2026-01-20 11:32 ` [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
` (7 subsequent siblings)
8 siblings, 3 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.
Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.
Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o New patch.
---
kernel/sched/topology.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..649674bb6c3c 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1638,8 +1638,6 @@ sd_init(struct sched_domain_topology_level *tl,
int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span;
- sd_weight = cpumask_weight(tl->mask(tl, cpu));
-
if (tl->sd_flags)
sd_flags = (*tl->sd_flags)();
if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
@@ -1647,8 +1645,6 @@ sd_init(struct sched_domain_topology_level *tl,
sd_flags &= TOPOLOGY_SD_FLAGS;
*sd = (struct sched_domain){
- .min_interval = sd_weight,
- .max_interval = 2*sd_weight,
.busy_factor = 16,
.imbalance_pct = 117,
@@ -1668,7 +1664,6 @@ sd_init(struct sched_domain_topology_level *tl,
,
.last_balance = jiffies,
- .balance_interval = sd_weight,
/* 50% success rate */
.newidle_call = 512,
@@ -1685,6 +1680,11 @@ sd_init(struct sched_domain_topology_level *tl,
cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
sd_id = cpumask_first(sd_span);
+ sd_weight = cpumask_weight(sd_span);
+ sd->min_interval = sd_weight;
+ sd->max_interval = 2 * sd_weight;
+ sd->balance_interval = sd_weight;
+
sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions
2026-01-20 11:32 ` [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
@ 2026-01-21 14:45 ` Chen, Yu C
2026-01-21 15:42 ` Shrikanth Hegde
2026-02-05 16:53 ` Valentin Schneider
2 siblings, 0 replies; 36+ messages in thread
From: Chen, Yu C @ 2026-01-21 14:45 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy,
Ingo Molnar, Juri Lelli, Vincent Guittot, Peter Zijlstra,
linux-kernel
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> The "sd_weight" used for calculating the load balancing interval, and
> its limits, considers the span weight of the entire topology level
> without accounting for cpuset partitions.
>
> Compute the "sd_weight" after computing the "sd_span" considering the
> cpu_map covered by the partition, and set the load balancing interval,
> and its limits accordingly.
>
> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
This not only fixes the issue for periodic load balancer
but also nohz balance because it gives a correct value for nr_busy_cpus
so from my understanding,
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions
2026-01-20 11:32 ` [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
2026-01-21 14:45 ` Chen, Yu C
@ 2026-01-21 15:42 ` Shrikanth Hegde
2026-01-22 2:51 ` K Prateek Nayak
2026-02-05 16:53 ` Valentin Schneider
2 siblings, 1 reply; 36+ messages in thread
From: Shrikanth Hegde @ 2026-01-21 15:42 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Chen Yu, Gautham R. Shenoy
On 1/20/26 5:02 PM, K Prateek Nayak wrote:
> The "sd_weight" used for calculating the load balancing interval, and
> its limits, considers the span weight of the entire topology level
> without accounting for cpuset partitions.
>
Please add one example showing the wrong sd_weights
while having cpuset partitions. That would be helpful.
> Compute the "sd_weight" after computing the "sd_span" considering the
> cpu_map covered by the partition, and set the load balancing interval,
> and its limits accordingly.
>
> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o New patch.
> ---
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions
2026-01-21 15:42 ` Shrikanth Hegde
@ 2026-01-22 2:51 ` K Prateek Nayak
0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-22 2:51 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Chen Yu, Gautham R. Shenoy
Hello Shrikanth,
On 1/21/2026 9:12 PM, Shrikanth Hegde wrote:
>
>
> On 1/20/26 5:02 PM, K Prateek Nayak wrote:
>> The "sd_weight" used for calculating the load balancing interval, and
>> its limits, considers the span weight of the entire topology level
>> without accounting for cpuset partitions.
>>
>
> Please add one example showing the wrong sd_weights
> while having cpuset partitions. That would be helpful.
Ack! I'll update with an example in the next version.
>
>> Compute the "sd_weight" after computing the "sd_span" considering the
>> cpu_map covered by the partition, and set the load balancing interval,
>> and its limits accordingly.
>>
>> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> Changelog rfc v2..v3:
>>
>> o New patch.
>> ---
>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Thanks a ton for the review.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions
2026-01-20 11:32 ` [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
2026-01-21 14:45 ` Chen, Yu C
2026-01-21 15:42 ` Shrikanth Hegde
@ 2026-02-05 16:53 ` Valentin Schneider
2 siblings, 0 replies; 36+ messages in thread
From: Valentin Schneider @ 2026-02-05 16:53 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy, K Prateek Nayak
On 20/01/26 11:32, K Prateek Nayak wrote:
> The "sd_weight" used for calculating the load balancing interval, and
> its limits, considers the span weight of the entire topology level
> without accounting for cpuset partitions.
>
> Compute the "sd_weight" after computing the "sd_span" considering the
> cpu_map covered by the partition, and set the load balancing interval,
> and its limits accordingly.
>
> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
2026-01-20 11:32 ` [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-21 15:17 ` Chen, Yu C
2026-02-05 16:53 ` Valentin Schneider
2026-01-20 11:32 ` [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
` (6 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
The "sched_domain_shared" object is allocated for every topology level
in __sdt_alloc() and is freed post sched domain rebuild if they aren't
assigned during sd_init().
"sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
the assigned objects, only "sd_llc_shared" is ever used by the
scheduler.
Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
never overlap, allocate only a single range of per-CPU
"sched_domain_shared" object with s_data instead of doing it per
topology level.
The subsequent commit uses the degeneration path to correctly assign the
"sd->shared" to the topmost SD_SHARE_LLC domain.
No functional changes are expected at this point.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o Broke off from a single large patch. Previously
https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
---
kernel/sched/topology.c | 48 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 649674bb6c3c..623e8835d322 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -776,6 +776,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
}
struct s_data {
+ struct sched_domain_shared * __percpu *sds;
struct sched_domain * __percpu *sd;
struct root_domain *rd;
};
@@ -783,6 +784,7 @@ struct s_data {
enum s_alloc {
sa_rootdomain,
sa_sd,
+ sa_sd_shared,
sa_sd_storage,
sa_none,
};
@@ -1529,6 +1531,9 @@ static void set_domain_attribute(struct sched_domain *sd,
static void __sdt_free(const struct cpumask *cpu_map);
static int __sdt_alloc(const struct cpumask *cpu_map);
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map);
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map);
+
static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
const struct cpumask *cpu_map)
{
@@ -1540,6 +1545,9 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
case sa_sd:
free_percpu(d->sd);
fallthrough;
+ case sa_sd_shared:
+ __sds_free(d, cpu_map);
+ fallthrough;
case sa_sd_storage:
__sdt_free(cpu_map);
fallthrough;
@@ -1555,9 +1563,11 @@ __visit_domain_allocation_hell(struct s_data *d, const struct cpumask *cpu_map)
if (__sdt_alloc(cpu_map))
return sa_sd_storage;
+ if (__sds_alloc(d, cpu_map))
+ return sa_sd_shared;
d->sd = alloc_percpu(struct sched_domain *);
if (!d->sd)
- return sa_sd_storage;
+ return sa_sd_shared;
d->rd = alloc_rootdomain();
if (!d->rd)
return sa_sd;
@@ -2458,6 +2468,42 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}
+static int __sds_alloc(struct s_data *d, const struct cpumask *cpu_map)
+{
+ int j;
+
+ d->sds = alloc_percpu(struct sched_domain_shared *);
+ if (!d->sds)
+ return -ENOMEM;
+
+ for_each_cpu(j, cpu_map) {
+ struct sched_domain_shared *sds;
+
+ sds = kzalloc_node(sizeof(struct sched_domain_shared),
+ GFP_KERNEL, cpu_to_node(j));
+ if (!sds)
+ return -ENOMEM;
+
+ *per_cpu_ptr(d->sds, j) = sds;
+ }
+
+ return 0;
+}
+
+static void __sds_free(struct s_data *d, const struct cpumask *cpu_map)
+{
+ int j;
+
+ if (!d->sds)
+ return;
+
+ for_each_cpu(j, cpu_map)
+ kfree(*per_cpu_ptr(d->sds, j));
+
+ free_percpu(d->sds);
+ d->sds = NULL;
+}
+
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *child, int cpu)
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data
2026-01-20 11:32 ` [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
@ 2026-01-21 15:17 ` Chen, Yu C
2026-02-05 16:53 ` Valentin Schneider
1 sibling, 0 replies; 36+ messages in thread
From: Chen, Yu C @ 2026-01-21 15:17 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel, Tim Chen
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> The "sched_domain_shared" object is allocated for every topology level
> in __sdt_alloc() and is freed post sched domain rebuild if they aren't
> assigned during sd_init().
>
> "sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
> the assigned objects, only "sd_llc_shared" is ever used by the
> scheduler.
>
> Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
> never overlap, allocate only a single range of per-CPU
> "sched_domain_shared" object with s_data instead of doing it per
> topology level.
>
> The subsequent commit uses the degeneration path to correctly assign the
> "sd->shared" to the topmost SD_SHARE_LLC domain.
>
> No functional changes are expected at this point.
>
The sched_domain_shared is allocated in s_data and then
immediately released because per_cpu_ptr(d->sds, j) is
not NULL. I guess this is the reason why you mentioned
"No functional changes".
For the cluster domain that shares L2, I suppose we can bring
it back if there is a need in the future.
From my understanding,
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data
2026-01-20 11:32 ` [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
2026-01-21 15:17 ` Chen, Yu C
@ 2026-02-05 16:53 ` Valentin Schneider
1 sibling, 0 replies; 36+ messages in thread
From: Valentin Schneider @ 2026-02-05 16:53 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy, K Prateek Nayak
On 20/01/26 11:32, K Prateek Nayak wrote:
> The "sched_domain_shared" object is allocated for every topology level
> in __sdt_alloc() and is freed post sched domain rebuild if they aren't
> assigned during sd_init().
>
> "sd->shared" is only assigned for SD_SHARE_LLC domains and out of all
> the assigned objects, only "sd_llc_shared" is ever used by the
> scheduler.
>
> Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains
> never overlap, allocate only a single range of per-CPU
> "sched_domain_shared" object with s_data instead of doing it per
> topology level.
>
> The subsequent commit uses the degeneration path to correctly assign the
> "sd->shared" to the topmost SD_SHARE_LLC domain.
>
> No functional changes are expected at this point.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
2026-01-20 11:32 ` [PATCH v3 1/8] sched/topology: Compute sd_weight considering cpuset partitions K Prateek Nayak
2026-01-20 11:32 ` [PATCH v3 2/8] sched/topology: Allocate per-CPU sched_domain_shared in s_data K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-21 15:26 ` Chen, Yu C
` (2 more replies)
2026-01-20 11:32 ` [PATCH v3 4/8] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
` (5 subsequent siblings)
8 siblings, 3 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".
sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.
build_sched_domains() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.
Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o Broke off from a single large patch. Previously
https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
---
kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 623e8835d322..0f56462fef6f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
+
+ /* If sd_llc exists, sd_llc_shared should exist too. */
+ WARN_ON_ONCE(!sd->shared);
sds = sd->shared;
}
@@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
if (sd_parent_degenerate(tmp, parent)) {
tmp->parent = parent->parent;
+ /* Pick reference to parent->shared. */
+ if (parent->shared) {
+ WARN_ON_ONCE(tmp->shared);
+ tmp->shared = parent->shared;
+ parent->shared = NULL;
+ }
+
if (parent->parent) {
parent->parent->child = tmp;
parent->parent->groups->flags = tmp->flags;
@@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
sd->cache_nice_tries = 1;
}
- /*
- * For all levels sharing cache; connect a sched_domain_shared
- * instance.
- */
- if (sd->flags & SD_SHARE_LLC) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
-
sd->private = sdd;
return sd;
@@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb_span = 1;
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ struct sched_domain *parent = sd->parent;
struct sched_domain *child = sd->child;
+ /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) &&
+ (!parent || !(parent->flags & SD_SHARE_LLC))) {
+ int llc_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d.sds, llc_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+ }
+
if (!(sd->flags & SD_SHARE_LLC) && child &&
(child->flags & SD_SHARE_LLC)) {
struct sched_domain __rcu *top_p;
@@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (!cpumask_test_cpu(i, cpu_map))
continue;
+ if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
+ *per_cpu_ptr(d.sds, i) = NULL;
+
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-20 11:32 ` [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
@ 2026-01-21 15:26 ` Chen, Yu C
2026-01-22 2:49 ` K Prateek Nayak
2026-01-22 8:12 ` Shrikanth Hegde
2026-02-05 16:53 ` Valentin Schneider
2 siblings, 1 reply; 36+ messages in thread
From: Chen, Yu C @ 2026-01-21 15:26 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy, Tim Chen,
linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
>
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
>
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
>
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o Broke off from a single large patch. Previously
> https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
> kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> +
> + /* If sd_llc exists, sd_llc_shared should exist too. */
> + WARN_ON_ONCE(!sd->shared);
> sds = sd->shared;
> }
>
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
> if (sd_parent_degenerate(tmp, parent)) {
> tmp->parent = parent->parent;
>
> + /* Pick reference to parent->shared. */
> + if (parent->shared) {
> + WARN_ON_ONCE(tmp->shared);
> + tmp->shared = parent->shared;
> + parent->shared = NULL;
> + }
> +
> if (parent->parent) {
> parent->parent->child = tmp;
> parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 1;
> }
>
> - /*
> - * For all levels sharing cache; connect a sched_domain_shared
> - * instance.
> - */
> - if (sd->flags & SD_SHARE_LLC) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> -
> sd->private = sdd;
>
> return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
I agree that in the current implementation, we use the llc_id="first CPU" to
index into d.sds, and this value actually represents the LLC ID. In the
cache-aware scheduling, we plan to convert the llc_id to a logical ID that
is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
rename the aforementioned llc_id to sd_id?
Anyway I will run some tests on the entire patch set and provide feedback
afterward.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-21 15:26 ` Chen, Yu C
@ 2026-01-22 2:49 ` K Prateek Nayak
0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-22 2:49 UTC (permalink / raw)
To: Chen, Yu C
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy, Tim Chen,
linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Hello Chenyu,
On 1/21/2026 8:56 PM, Chen, Yu C wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> unsigned int imb_span = 1;
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> + struct sched_domain *parent = sd->parent;
>> struct sched_domain *child = sd->child;
>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> + if ((sd->flags & SD_SHARE_LLC) &&
>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> + int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>
> I agree that in the current implementation, we use the llc_id="first CPU" to
> index into d.sds, and this value actually represents the LLC ID. In the
> cache-aware scheduling, we plan to convert the llc_id to a logical ID that
> is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
> rename the aforementioned llc_id to sd_id?
Ack! Will modify in the next version!
>
> Anyway I will run some tests on the entire patch set and provide feedback
> afterward.
Thanks a ton for taking a look at the series! Much appreciated.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-20 11:32 ` [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
2026-01-21 15:26 ` Chen, Yu C
@ 2026-01-22 8:12 ` Shrikanth Hegde
2026-01-22 8:36 ` K Prateek Nayak
2026-02-05 16:53 ` Valentin Schneider
2 siblings, 1 reply; 36+ messages in thread
From: Shrikanth Hegde @ 2026-01-22 8:12 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
On 1/20/26 5:02 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
>
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
>
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
>
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o Broke off from a single large patch. Previously
> https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
> kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> +
> + /* If sd_llc exists, sd_llc_shared should exist too. */
> + WARN_ON_ONCE(!sd->shared);
> sds = sd->shared;
> }
>
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
> if (sd_parent_degenerate(tmp, parent)) {
> tmp->parent = parent->parent;
>
> + /* Pick reference to parent->shared. */
> + if (parent->shared) {
> + WARN_ON_ONCE(tmp->shared);
> + tmp->shared = parent->shared;
> + parent->shared = NULL;
> + }
> +
> if (parent->parent) {
> parent->parent->child = tmp;
> parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 1;
> }
>
> - /*
> - * For all levels sharing cache; connect a sched_domain_shared
> - * instance.
> - */
> - if (sd->flags & SD_SHARE_LLC) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> -
> sd->private = sdd;
>
> return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> + }
> +
> if (!(sd->flags & SD_SHARE_LLC) && child &&
> (child->flags & SD_SHARE_LLC)) {
> struct sched_domain __rcu *top_p;
> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (!cpumask_test_cpu(i, cpu_map))
> continue;
>
> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
> + *per_cpu_ptr(d.sds, i) = NULL;
> +
Can we do this claim_allocations only?
sdt_alloc and free is complicated already.
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> claim_allocations(i, sd);
> init_sched_groups_capacity(i, sd);
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-22 8:12 ` Shrikanth Hegde
@ 2026-01-22 8:36 ` K Prateek Nayak
2026-01-23 4:08 ` Shrikanth Hegde
0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-22 8:36 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
Hello Shrikanth,
On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> if (!cpumask_test_cpu(i, cpu_map))
>> continue;
>> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>> + *per_cpu_ptr(d.sds, i) = NULL;
>> +
>
> Can we do this claim_allocations only?
I didn't do it there since we didn't have reference to the "s_data"
inside claim_allocations().
If I remember this right, only init_sched_groups_capacity() has the
requirement to traverse the CPUs in reverse to do
update_group_capacity() when we hit the first CPU in the group.
It doesn't modify the "->ref" of any allocations.
I can put the claim_allocations() bits the previous loop and pass the
CPU and the s_data reference so it can free both "d.sds" and all the
"d.sd" bits in one place and retain this reverse loop for
init_sched_groups_capacity(). Does that sound better?
> sdt_alloc and free is complicated already.
>
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> claim_allocations(i, sd);
>> init_sched_groups_capacity(i, sd);
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-22 8:36 ` K Prateek Nayak
@ 2026-01-23 4:08 ` Shrikanth Hegde
2026-01-23 4:53 ` K Prateek Nayak
0 siblings, 1 reply; 36+ messages in thread
From: Shrikanth Hegde @ 2026-01-23 4:08 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
On 1/22/26 2:06 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
>
> On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>> if (!cpumask_test_cpu(i, cpu_map))
>>> continue;
>>> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>>> + *per_cpu_ptr(d.sds, i) = NULL;
>>> +
>>
>> Can we do this claim_allocations only?
>
> I didn't do it there since we didn't have reference to the "s_data"
> inside claim_allocations().
>
> If I remember this right, only init_sched_groups_capacity() has the
> requirement to traverse the CPUs in reverse to do
> update_group_capacity() when we hit the first CPU in the group.
> It doesn't modify the "->ref" of any allocations.
>
> I can put the claim_allocations() bits the previous loop and pass the
> CPU and the s_data reference so it can free both "d.sds" and all the
> "d.sd" bits in one place and retain this reverse loop for
> init_sched_groups_capacity(). Does that sound better?
>
Yes.IMO Having it in one place is better.
Even the next loop could be used to do that.
>> sdt_alloc and free is complicated already.
>>
>>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>> claim_allocations(i, sd);
>>> init_sched_groups_capacity(i, sd);
>>
>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-23 4:08 ` Shrikanth Hegde
@ 2026-01-23 4:53 ` K Prateek Nayak
0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-23 4:53 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
On 1/23/2026 9:38 AM, Shrikanth Hegde wrote:
>> I can put the claim_allocations() bits the previous loop and pass the
>> CPU and the s_data reference so it can free both "d.sds" and all the
>> "d.sd" bits in one place and retain this reverse loop for
>> init_sched_groups_capacity(). Does that sound better?
>>
>
> Yes.IMO Having it in one place is better.
> Even the next loop could be used to do that.
Ack! Will change accordingly in the next version. Thank you again for
the suggestion.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-01-20 11:32 ` [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
2026-01-21 15:26 ` Chen, Yu C
2026-01-22 8:12 ` Shrikanth Hegde
@ 2026-02-05 16:53 ` Valentin Schneider
2026-02-06 5:20 ` K Prateek Nayak
2 siblings, 1 reply; 36+ messages in thread
From: Valentin Schneider @ 2026-02-05 16:53 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy, K Prateek Nayak
On 20/01/26 11:32, K Prateek Nayak wrote:
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> + }
> +
We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
but given this needs to write the sd->imb_numa_nr for every SD I couldn't
factorize this into something that looked sane :(
> if (!(sd->flags & SD_SHARE_LLC) && child &&
> (child->flags & SD_SHARE_LLC)) {
> struct sched_domain __rcu *top_p;
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-02-05 16:53 ` Valentin Schneider
@ 2026-02-06 5:20 ` K Prateek Nayak
2026-02-06 9:38 ` Valentin Schneider
2026-02-14 2:59 ` Chen, Yu C
0 siblings, 2 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-02-06 5:20 UTC (permalink / raw)
To: Valentin Schneider, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy
Hello Valentin,
On 2/5/2026 10:23 PM, Valentin Schneider wrote:
> On 20/01/26 11:32, K Prateek Nayak wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> unsigned int imb_span = 1;
>>
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> + struct sched_domain *parent = sd->parent;
>> struct sched_domain *child = sd->child;
>>
>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> + if ((sd->flags & SD_SHARE_LLC) &&
>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> + int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>> + atomic_inc(&sd->shared->ref);
>> + }
>> +
>
> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
> factorize this into something that looked sane :(
Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
it and "sd->shared" assignment cares when we are *at* the sd_llc.
Since we have to assign the "sd->shared" before claim_allocations(),
I couldn't find a better spot to assign it.
That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
and its "parent". I'll let you be the judge of whether the following is
better or worse ;-)
(Only build tested)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ac268da91778..e98bb812de35 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb_span = 1;
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
- struct sched_domain *child = sd->child;
+ struct sched_domain *parent = sd->parent;
- if (!(sd->flags & SD_SHARE_LLC) && child &&
- (child->flags & SD_SHARE_LLC)) {
- struct sched_domain __rcu *top_p;
+ /* Topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) &&
+ (!parent || !(parent->flags & SD_SHARE_LLC))) {
+ int sd_id = cpumask_first(sched_domain_span(sd));
+ struct sched_domain *top_p;
unsigned int nr_llcs;
+ sd->shared = *per_cpu_ptr(d.sds, sd_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+
+ /* No SD_NUMA domains. */
+ if (!parent)
+ break;
+
/*
* For a single LLC per node, allow an
* imbalance up to 12.5% of the node. This is
@@ -2641,7 +2651,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* factors and that there is a correlation
* between LLCs and memory channels.
*/
- nr_llcs = sd->span_weight / child->span_weight;
+ nr_llcs = parent->span_weight / sd->span_weight;
if (nr_llcs == 1)
imb = sd->span_weight >> 3;
else
@@ -2650,11 +2660,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
sd->imb_numa_nr = imb;
/* Set span based on the first NUMA domain. */
- top_p = sd->parent;
+ top_p = parent;
while (top_p && !(top_p->flags & SD_NUMA)) {
top_p = top_p->parent;
}
- imb_span = top_p ? top_p->span_weight : sd->span_weight;
+ imb_span = top_p ? top_p->span_weight : parent->span_weight;
} else {
int factor = max(1U, (sd->span_weight / imb_span));
---
>
>> if (!(sd->flags & SD_SHARE_LLC) && child &&
>> (child->flags & SD_SHARE_LLC)) {
>> struct sched_domain __rcu *top_p;
>
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-02-06 5:20 ` K Prateek Nayak
@ 2026-02-06 9:38 ` Valentin Schneider
2026-02-14 3:04 ` Chen, Yu C
2026-02-14 2:59 ` Chen, Yu C
1 sibling, 1 reply; 36+ messages in thread
From: Valentin Schneider @ 2026-02-06 9:38 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy
On 06/02/26 10:50, K Prateek Nayak wrote:
> Hello Valentin,
>
> On 2/5/2026 10:23 PM, Valentin Schneider wrote:
>> On 20/01/26 11:32, K Prateek Nayak wrote:
>>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>> unsigned int imb_span = 1;
>>>
>>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>> + struct sched_domain *parent = sd->parent;
>>> struct sched_domain *child = sd->child;
>>>
>>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>>> + if ((sd->flags & SD_SHARE_LLC) &&
>>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>>> + int llc_id = cpumask_first(sched_domain_span(sd));
>>> +
>>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>>> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>>> + atomic_inc(&sd->shared->ref);
>>> + }
>>> +
>>
>> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
>> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
>> factorize this into something that looked sane :(
>
> Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
> it and "sd->shared" assignment cares when we are *at* the sd_llc.
>
> Since we have to assign the "sd->shared" before claim_allocations(),
> I couldn't find a better spot to assign it.
>
> That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
> and its "parent". I'll let you be the judge of whether the following is
> better or worse ;-)
>
> (Only build tested)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ac268da91778..e98bb812de35 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> - struct sched_domain *child = sd->child;
> + struct sched_domain *parent = sd->parent;
>
> - if (!(sd->flags & SD_SHARE_LLC) && child &&
> - (child->flags & SD_SHARE_LLC)) {
> - struct sched_domain __rcu *top_p;
> + /* Topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int sd_id = cpumask_first(sched_domain_span(sd));
> + struct sched_domain *top_p;
> unsigned int nr_llcs;
>
> + sd->shared = *per_cpu_ptr(d.sds, sd_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> +
> + /* No SD_NUMA domains. */
> + if (!parent)
> + break;
> +
AIUI we currently write sd->imb_numa_nr for all SD's, but it's only useful
for the SD_NUMA ones... How about the lightly tested:
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index afb2c26efb4e5..03db45658f6bd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2575,6 +2575,51 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
return true;
}
+static inline void adjust_numa_imbalance(struct sched_domain *sd_llc,
+ unsigned int *imb, unsigned int *imb_span)
+{
+ /*
+ * For a single LLC per node, allow an
+ * imbalance up to 12.5% of the node. This is
+ * arbitrary cutoff based two factors -- SMT and
+ * memory channels. For SMT-2, the intent is to
+ * avoid premature sharing of HT resources but
+ * SMT-4 or SMT-8 *may* benefit from a different
+ * cutoff. For memory channels, this is a very
+ * rough estimate of how many channels may be
+ * active and is based on recent CPUs with
+ * many cores.
+ *
+ * For multiple LLCs, allow an imbalance
+ * until multiple tasks would share an LLC
+ * on one node while LLCs on another node
+ * remain idle. This assumes that there are
+ * enough logical CPUs per LLC to avoid SMT
+ * factors and that there is a correlation
+ * between LLCs and memory channels.
+ */
+ struct sched_domain *top_p;
+ unsigned int nr_llcs;
+
+ WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
+ WARN_ON(!sd_llc->parent);
+
+ nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
+ if (nr_llcs == 1)
+ *imb = sd_llc->parent->span_weight >> 3;
+ else
+ *imb = nr_llcs;
+ *imb = max(1U, *imb);
+ sd_llc->parent->imb_numa_nr = *imb;
+
+ /* Set span based on the first NUMA domain. */
+ top_p = sd_llc->parent->parent;
+ while (top_p && !(top_p->flags & SD_NUMA)) {
+ top_p = top_p->parent;
+ }
+ *imb_span = top_p ? top_p->span_weight : sd_llc->parent->span_weight;
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2640,63 +2685,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb = 0;
unsigned int imb_span = 1;
- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
- struct sched_domain *parent = sd->parent;
-
- /* Topmost SD_SHARE_LLC domain. */
- if ((sd->flags & SD_SHARE_LLC) &&
- (!parent || !(parent->flags & SD_SHARE_LLC))) {
- int sd_id = cpumask_first(sched_domain_span(sd));
- struct sched_domain *top_p;
- unsigned int nr_llcs;
-
- sd->shared = *per_cpu_ptr(d.sds, sd_id);
- atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
- atomic_inc(&sd->shared->ref);
-
- /* No SD_NUMA domains. */
- if (!parent)
- break;
-
- /*
- * For a single LLC per node, allow an
- * imbalance up to 12.5% of the node. This is
- * arbitrary cutoff based two factors -- SMT and
- * memory channels. For SMT-2, the intent is to
- * avoid premature sharing of HT resources but
- * SMT-4 or SMT-8 *may* benefit from a different
- * cutoff. For memory channels, this is a very
- * rough estimate of how many channels may be
- * active and is based on recent CPUs with
- * many cores.
- *
- * For multiple LLCs, allow an imbalance
- * until multiple tasks would share an LLC
- * on one node while LLCs on another node
- * remain idle. This assumes that there are
- * enough logical CPUs per LLC to avoid SMT
- * factors and that there is a correlation
- * between LLCs and memory channels.
- */
- nr_llcs = parent->span_weight / sd->span_weight;
- if (nr_llcs == 1)
- imb = sd->span_weight >> 3;
- else
- imb = nr_llcs;
- imb = max(1U, imb);
- sd->imb_numa_nr = imb;
-
- /* Set span based on the first NUMA domain. */
- top_p = parent;
- while (top_p && !(top_p->flags & SD_NUMA)) {
- top_p = top_p->parent;
- }
- imb_span = top_p ? top_p->span_weight : parent->span_weight;
- } else {
- int factor = max(1U, (sd->span_weight / imb_span));
+ sd = *per_cpu_ptr(d.sd, i);
+ /* First, find the topmost SD_SHARE_LLC domain */
+ while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+ sd = sd->parent;
- sd->imb_numa_nr = imb * factor;
- }
+ if (sd->flags & SD_SHARE_LLC) {
+ int sd_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d.sds, sd_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+ }
+
+ /* Special case the first parent of the topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) && sd->parent) {
+ adjust_numa_imbalance(sd, &imb, &imb_span);
+ sd = sd->parent->parent;
+ }
+
+ /* Update the upper remainder of the topology */
+ while (sd) {
+ int factor = max(1U, (sd->span_weight / imb_span));
+ sd->imb_numa_nr = imb * factor;
+ sd = sd->parent;
}
}
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-02-06 9:38 ` Valentin Schneider
@ 2026-02-14 3:04 ` Chen, Yu C
2026-02-16 3:50 ` K Prateek Nayak
0 siblings, 1 reply; 36+ messages in thread
From: Chen, Yu C @ 2026-02-14 3:04 UTC (permalink / raw)
To: Valentin Schneider
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Shrikanth Hegde, Gautham R. Shenoy, K Prateek Nayak, Ingo Molnar,
Juri Lelli, Vincent Guittot, Peter Zijlstra, linux-kernel
On 2/6/2026 5:38 PM, Valentin Schneider wrote:
> On 06/02/26 10:50, K Prateek Nayak wrote:
>> Hello Valentin,
>>
>> On 2/5/2026 10:23 PM, Valentin Schneider wrote:
>>> On 20/01/26 11:32, K Prateek Nayak wrote:
[ ... ]
>
> AIUI we currently write sd->imb_numa_nr for all SD's, but it's only useful
> for the SD_NUMA ones... How about the lightly tested:
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index afb2c26efb4e5..03db45658f6bd 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2575,6 +2575,51 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> return true;
> }
>
> +static inline void adjust_numa_imbalance(struct sched_domain *sd_llc,
> + unsigned int *imb, unsigned int *imb_span)
> +{
> + /*
> + * For a single LLC per node, allow an
> + * imbalance up to 12.5% of the node. This is
> + * arbitrary cutoff based two factors -- SMT and
> + * memory channels. For SMT-2, the intent is to
> + * avoid premature sharing of HT resources but
> + * SMT-4 or SMT-8 *may* benefit from a different
> + * cutoff. For memory channels, this is a very
> + * rough estimate of how many channels may be
> + * active and is based on recent CPUs with
> + * many cores.
> + *
> + * For multiple LLCs, allow an imbalance
> + * until multiple tasks would share an LLC
> + * on one node while LLCs on another node
> + * remain idle. This assumes that there are
> + * enough logical CPUs per LLC to avoid SMT
> + * factors and that there is a correlation
> + * between LLCs and memory channels.
> + */
> + struct sched_domain *top_p;
> + unsigned int nr_llcs;
> +
> + WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
> + WARN_ON(!sd_llc->parent);
> +
> + nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
> + if (nr_llcs == 1)
> + *imb = sd_llc->parent->span_weight >> 3;
> + else
> + *imb = nr_llcs;
> + *imb = max(1U, *imb);
> + sd_llc->parent->imb_numa_nr = *imb;
> +
> + /* Set span based on the first NUMA domain. */
> + top_p = sd_llc->parent->parent;
> + while (top_p && !(top_p->flags & SD_NUMA)) {
> + top_p = top_p->parent;
> + }
> + *imb_span = top_p ? top_p->span_weight : sd_llc->parent->span_weight;
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2640,63 +2685,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb = 0;
> unsigned int imb_span = 1;
>
> - for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> - struct sched_domain *parent = sd->parent;
> -
> - /* Topmost SD_SHARE_LLC domain. */
> - if ((sd->flags & SD_SHARE_LLC) &&
> - (!parent || !(parent->flags & SD_SHARE_LLC))) {
> - int sd_id = cpumask_first(sched_domain_span(sd));
> - struct sched_domain *top_p;
> - unsigned int nr_llcs;
> -
> - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> - atomic_inc(&sd->shared->ref);
> -
> - /* No SD_NUMA domains. */
> - if (!parent)
> - break;
> -
> - /*
> - * For a single LLC per node, allow an
> - * imbalance up to 12.5% of the node. This is
> - * arbitrary cutoff based two factors -- SMT and
> - * memory channels. For SMT-2, the intent is to
> - * avoid premature sharing of HT resources but
> - * SMT-4 or SMT-8 *may* benefit from a different
> - * cutoff. For memory channels, this is a very
> - * rough estimate of how many channels may be
> - * active and is based on recent CPUs with
> - * many cores.
> - *
> - * For multiple LLCs, allow an imbalance
> - * until multiple tasks would share an LLC
> - * on one node while LLCs on another node
> - * remain idle. This assumes that there are
> - * enough logical CPUs per LLC to avoid SMT
> - * factors and that there is a correlation
> - * between LLCs and memory channels.
> - */
> - nr_llcs = parent->span_weight / sd->span_weight;
> - if (nr_llcs == 1)
> - imb = sd->span_weight >> 3;
> - else
> - imb = nr_llcs;
> - imb = max(1U, imb);
> - sd->imb_numa_nr = imb;
> -
> - /* Set span based on the first NUMA domain. */
> - top_p = parent;
> - while (top_p && !(top_p->flags & SD_NUMA)) {
> - top_p = top_p->parent;
> - }
> - imb_span = top_p ? top_p->span_weight : parent->span_weight;
> - } else {
> - int factor = max(1U, (sd->span_weight / imb_span));
> + sd = *per_cpu_ptr(d.sd, i);
might be
if (!sd)
continue;
otherwise sd->flags below might cause NULL pointer exception.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-02-14 3:04 ` Chen, Yu C
@ 2026-02-16 3:50 ` K Prateek Nayak
0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-02-16 3:50 UTC (permalink / raw)
To: Chen, Yu C, Valentin Schneider
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Shrikanth Hegde, Gautham R. Shenoy, Ingo Molnar, Juri Lelli,
Vincent Guittot, Peter Zijlstra, linux-kernel
Hello Chenyu,
On 2/14/2026 8:34 AM, Chen, Yu C wrote:
>> + sd = *per_cpu_ptr(d.sd, i);
>
> might be
> if (!sd)
> continue;
> otherwise sd->flags below might cause NULL pointer exception.
Ack! I'll add in that check in the next version. Thank you for catching
that.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data
2026-02-06 5:20 ` K Prateek Nayak
2026-02-06 9:38 ` Valentin Schneider
@ 2026-02-14 2:59 ` Chen, Yu C
1 sibling, 0 replies; 36+ messages in thread
From: Chen, Yu C @ 2026-02-14 2:59 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Valentin Schneider, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Shrikanth Hegde, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Vincent Guittot, Juri Lelli, linux-kernel
On 2/6/2026 1:20 PM, K Prateek Nayak wrote:
> Hello Valentin,
>
> On 2/5/2026 10:23 PM, Valentin Schneider wrote:
>> On 20/01/26 11:32, K Prateek Nayak wrote:
[...]
>>
>> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
>> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
>> factorize this into something that looked sane :(
>
> Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
> it and "sd->shared" assignment cares when we are *at* the sd_llc.
>
> Since we have to assign the "sd->shared" before claim_allocations(),
> I couldn't find a better spot to assign it.
>
> That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
> and its "parent". I'll let you be the judge of whether the following is
> better or worse ;-)
>
> (Only build tested)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ac268da91778..e98bb812de35 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> - struct sched_domain *child = sd->child;
> + struct sched_domain *parent = sd->parent;
>
> - if (!(sd->flags & SD_SHARE_LLC) && child &&
> - (child->flags & SD_SHARE_LLC)) {
> - struct sched_domain __rcu *top_p;
> + /* Topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int sd_id = cpumask_first(sched_domain_span(sd));
> + struct sched_domain *top_p;
> unsigned int nr_llcs;
>
> + sd->shared = *per_cpu_ptr(d.sds, sd_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> +
> + /* No SD_NUMA domains. */
> + if (!parent)
> + break;
> +
> /*
> * For a single LLC per node, allow an
> * imbalance up to 12.5% of the node. This is
> @@ -2641,7 +2651,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> * factors and that there is a correlation
> * between LLCs and memory channels.
> */
> - nr_llcs = sd->span_weight / child->span_weight;
> + nr_llcs = parent->span_weight / sd->span_weight;
> if (nr_llcs == 1)
> imb = sd->span_weight >> 3;
> else
> @@ -2650,11 +2660,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> sd->imb_numa_nr = imb;
>
> /* Set span based on the first NUMA domain. */
> - top_p = sd->parent;
> + top_p = parent;
maybe top_p = parent->parent
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 4/8] sched/topology: Remove sched_domain_shared allocation with sd_data
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (2 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-02-05 16:53 ` Valentin Schneider
2026-01-20 11:32 ` [PATCH v3 5/8] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
` (4 subsequent siblings)
8 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
Now that "sd->shared" assignments are using the sched_domain_shared
objects allocated with s_data, remove the sd_data based allocations.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o Broke off from a single large patch. Previously
https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
---
include/linux/sched/topology.h | 1 -
kernel/sched/topology.c | 19 -------------------
2 files changed, 20 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..fc3d89160513 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -171,7 +171,6 @@ typedef int (*sched_domain_flags_f)(void);
struct sd_data {
struct sched_domain *__percpu *sd;
- struct sched_domain_shared *__percpu *sds;
struct sched_group *__percpu *sg;
struct sched_group_capacity *__percpu *sgc;
};
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0f56462fef6f..cba91f20b4e0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1597,9 +1597,6 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
WARN_ON_ONCE(*per_cpu_ptr(sdd->sd, cpu) != sd);
*per_cpu_ptr(sdd->sd, cpu) = NULL;
- if (atomic_read(&(*per_cpu_ptr(sdd->sds, cpu))->ref))
- *per_cpu_ptr(sdd->sds, cpu) = NULL;
-
if (atomic_read(&(*per_cpu_ptr(sdd->sg, cpu))->ref))
*per_cpu_ptr(sdd->sg, cpu) = NULL;
@@ -2377,10 +2374,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sdd->sd)
return -ENOMEM;
- sdd->sds = alloc_percpu(struct sched_domain_shared *);
- if (!sdd->sds)
- return -ENOMEM;
-
sdd->sg = alloc_percpu(struct sched_group *);
if (!sdd->sg)
return -ENOMEM;
@@ -2391,7 +2384,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
for_each_cpu(j, cpu_map) {
struct sched_domain *sd;
- struct sched_domain_shared *sds;
struct sched_group *sg;
struct sched_group_capacity *sgc;
@@ -2402,13 +2394,6 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sd, j) = sd;
- sds = kzalloc_node(sizeof(struct sched_domain_shared),
- GFP_KERNEL, cpu_to_node(j));
- if (!sds)
- return -ENOMEM;
-
- *per_cpu_ptr(sdd->sds, j) = sds;
-
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sg)
@@ -2450,8 +2435,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
kfree(*per_cpu_ptr(sdd->sd, j));
}
- if (sdd->sds)
- kfree(*per_cpu_ptr(sdd->sds, j));
if (sdd->sg)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
@@ -2459,8 +2442,6 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
free_percpu(sdd->sd);
sdd->sd = NULL;
- free_percpu(sdd->sds);
- sdd->sds = NULL;
free_percpu(sdd->sg);
sdd->sg = NULL;
free_percpu(sdd->sgc);
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 4/8] sched/topology: Remove sched_domain_shared allocation with sd_data
2026-01-20 11:32 ` [PATCH v3 4/8] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
@ 2026-02-05 16:53 ` Valentin Schneider
0 siblings, 0 replies; 36+ messages in thread
From: Valentin Schneider @ 2026-02-05 16:53 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Chen Yu,
Shrikanth Hegde, Gautham R. Shenoy, K Prateek Nayak
On 20/01/26 11:32, K Prateek Nayak wrote:
> Now that "sd->shared" assignments are using the sched_domain_shared
> objects allocated with s_data, remove the sd_data based allocations.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 5/8] sched/core: Check for rcu_read_lock_any_held() in idle_get_state()
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (3 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 4/8] sched/topology: Remove sched_domain_shared allocation with sd_data K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-20 11:32 ` [PATCH v3 6/8] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
` (3 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
Similar to commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held()
in idle_get_state() to allow removing superfluous rcu_read_lock()
regions in the fair task's wakeup path where the pi_lock is held and
IRQs are disabled.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o New patch.
---
kernel/sched/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58c9d244f12b..14fc9fca2502 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2783,7 +2783,7 @@ static inline void idle_set_state(struct rq *rq,
static inline struct cpuidle_state *idle_get_state(struct rq *rq)
{
- WARN_ON_ONCE(!rcu_read_lock_held());
+ WARN_ON_ONCE(!rcu_read_lock_any_held());
return rq->idle_state;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [PATCH v3 6/8] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (4 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 5/8] sched/core: Check for rcu_read_lock_any_held() in idle_get_state() K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-20 11:32 ` [PATCH v3 7/8] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
` (2 subsequent siblings)
8 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
select_task_rq_fair() is always called with p->pi_lock held and IRQs
disabled which makes it equivalent of an RCU read-side.
Since commit 71fedc41c23b ("sched/fair: Switch to
rcu_dereference_all()") switched to using rcu_dereference_all() in the
wakeup path, drop the explicit rcu_read_{lock,unlock}() in the fair
task's wakeup path.
Future plans to reuse select_task_rq_fair() /
find_energy_efficient_cpu() in the fair class' balance callback will do
so with IRQs disabled and will comply with the requirements of
rcu_dereference_all() which makes this safe keeping in mind future
development plans too.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o New patch.
---
kernel/sched/fair.c | 33 ++++++++++++---------------------
1 file changed, 12 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04993c763a06..e4f208c44916 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8323,10 +8323,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
struct perf_domain *pd;
struct energy_env eenv;
- rcu_read_lock();
pd = rcu_dereference_all(rd->pd);
if (!pd)
- goto unlock;
+ return target;
/*
* Energy-aware wake-up happens on the lowest sched_domain starting
@@ -8336,13 +8335,13 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
sd = sd->parent;
if (!sd)
- goto unlock;
+ return target;
target = prev_cpu;
sync_entity_load_avg(&p->se);
if (!task_util_est(p) && p_util_min == 0)
- goto unlock;
+ return target;
eenv_task_busy_time(&eenv, p, prev_cpu);
@@ -8437,7 +8436,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
prev_cpu);
/* CPU utilization has changed */
if (prev_delta < base_energy)
- goto unlock;
+ return target;
prev_delta -= base_energy;
prev_actual_cap = cpu_actual_cap;
best_delta = min(best_delta, prev_delta);
@@ -8461,7 +8460,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
max_spare_cap_cpu);
/* CPU utilization has changed */
if (cur_delta < base_energy)
- goto unlock;
+ return target;
cur_delta -= base_energy;
/*
@@ -8478,7 +8477,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
best_actual_cap = cpu_actual_cap;
}
}
- rcu_read_unlock();
if ((best_fits > prev_fits) ||
((best_fits > 0) && (best_delta < prev_delta)) ||
@@ -8486,11 +8484,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
target = best_energy_cpu;
return target;
-
-unlock:
- rcu_read_unlock();
-
- return target;
}
/*
@@ -8535,7 +8528,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
}
- rcu_read_lock();
for_each_domain(cpu, tmp) {
/*
* If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8561,14 +8553,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
break;
}
- if (unlikely(sd)) {
- /* Slow path */
- new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
- } else if (wake_flags & WF_TTWU) { /* XXX always ? */
- /* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
- }
- rcu_read_unlock();
+ /* Slow path */
+ if (unlikely(sd))
+ return sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
+
+ /* Fast path */
+ if (wake_flags & WF_TTWU)
+ return select_idle_sibling(p, prev_cpu, new_cpu);
return new_cpu;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* [PATCH v3 7/8] sched/fair: Simplify the entry condition for update_idle_cpu_scan()
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (5 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 6/8] sched/fair: Remove superfluous rcu_read_lock() in the wakeup path K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-02-14 15:41 ` Chen, Yu C
2026-01-20 11:32 ` [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
2026-01-21 16:16 ` [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation Peter Zijlstra
8 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
Only the topmost SD_SHARE_LLC domain has the "sd->shared" assigned.
Simply use "sd->shared" as an indicator for load balancing at the highest
SD_SHARE_LLC domain in update_idle_cpu_scan() instead of relying on
llc_size.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o No changes.
---
kernel/sched/fair.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4f208c44916..c308c0700a7f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10996,6 +10996,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
unsigned long sum_util)
{
struct sched_domain_shared *sd_share;
+ struct sched_domain *sd = env->sd;
int llc_weight, pct;
u64 x, y, tmp;
/*
@@ -11009,11 +11010,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE)
return;
- llc_weight = per_cpu(sd_llc_size, env->dst_cpu);
- if (env->sd->span_weight != llc_weight)
- return;
-
- sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu));
+ sd_share = sd->shared;
if (!sd_share)
return;
@@ -11047,10 +11044,11 @@ static void update_idle_cpu_scan(struct lb_env *env,
*/
/* equation [3] */
x = sum_util;
+ llc_weight = sd->span_weight;
do_div(x, llc_weight);
/* equation [4] */
- pct = env->sd->imbalance_pct;
+ pct = sd->imbalance_pct;
tmp = x * x * pct * pct;
do_div(tmp, 10000 * SCHED_CAPACITY_SCALE);
tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE);
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 7/8] sched/fair: Simplify the entry condition for update_idle_cpu_scan()
2026-01-20 11:32 ` [PATCH v3 7/8] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
@ 2026-02-14 15:41 ` Chen, Yu C
0 siblings, 0 replies; 36+ messages in thread
From: Chen, Yu C @ 2026-02-14 15:41 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy,
Ingo Molnar, Peter Zijlstra, Vincent Guittot, Juri Lelli,
linux-kernel
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> Only the topmost SD_SHARE_LLC domain has the "sd->shared" assigned.
> Simply use "sd->shared" as an indicator for load balancing at the highest
> SD_SHARE_LLC domain in update_idle_cpu_scan() instead of relying on
> llc_size.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (6 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 7/8] sched/fair: Simplify the entry condition for update_idle_cpu_scan() K Prateek Nayak
@ 2026-01-20 11:32 ` K Prateek Nayak
2026-01-23 6:06 ` Shrikanth Hegde
2026-02-14 15:56 ` Chen, Yu C
2026-01-21 16:16 ` [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation Peter Zijlstra
8 siblings, 2 replies; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-20 11:32 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy,
K Prateek Nayak
Use the "sd_llc" passed to select_idle_cpu() to obtain the
"sd_llc_shared" instead of dereferencing the per-CPU variable.
Since "sd->shared" is always reclaimed at the same time as "sd" via
call_rcu() and update_top_cache_domain() always ensures a valid
"sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
always be dereferenced without needing an additional check.
While at it move the cpumask_and() operation after the SIS_UTIL bailout
check to avoid unnecessarily computing the cpumask.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o No changes. to the diff. Added more details on directly dereferencing
"sd->shared" without a NULL check in the commit message.
---
kernel/sched/fair.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c308c0700a7f..b4ae9444d32f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7629,21 +7629,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- struct sched_domain_shared *sd_share;
-
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
if (sched_feat(SIS_UTIL)) {
- sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target));
- if (sd_share) {
- /* because !--nr is the condition to stop scan */
- nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
- /* overloaded LLC is unlikely to have idle cpu/core */
- if (nr == 1)
- return -1;
- }
+ /* because !--nr is the condition to stop scan */
+ nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+ /* overloaded LLC is unlikely to have idle cpu/core */
+ if (nr == 1)
+ return -1;
}
+ if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
+ return -1;
+
if (static_branch_unlikely(&sched_cluster_active)) {
struct sched_group *sg = sd->groups;
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
2026-01-20 11:32 ` [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
@ 2026-01-23 6:06 ` Shrikanth Hegde
2026-01-23 6:27 ` K Prateek Nayak
2026-02-14 15:56 ` Chen, Yu C
1 sibling, 1 reply; 36+ messages in thread
From: Shrikanth Hegde @ 2026-01-23 6:06 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
On 1/20/26 5:02 PM, K Prateek Nayak wrote:
> Use the "sd_llc" passed to select_idle_cpu() to obtain the
> "sd_llc_shared" instead of dereferencing the per-CPU variable.
>
> Since "sd->shared" is always reclaimed at the same time as "sd" via
> call_rcu() and update_top_cache_domain() always ensures a valid
> "sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
> always be dereferenced without needing an additional check.
>
> While at it move the cpumask_and() operation after the SIS_UTIL bailout
> check to avoid unnecessarily computing the cpumask.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o No changes. to the diff. Added more details on directly dereferencing
> "sd->shared" without a NULL check in the commit message.
> ---
> kernel/sched/fair.c | 19 ++++++++-----------
> 1 file changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c308c0700a7f..b4ae9444d32f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7629,21 +7629,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> - struct sched_domain_shared *sd_share;
> -
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>
> if (sched_feat(SIS_UTIL)) {
> - sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target));
> - if (sd_share) {
> - /* because !--nr is the condition to stop scan */
> - nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> - /* overloaded LLC is unlikely to have idle cpu/core */
> - if (nr == 1)
> - return -1;
> - }
> + /* because !--nr is the condition to stop scan */
> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> + /* overloaded LLC is unlikely to have idle cpu/core */
> + if (nr == 1)
> + return -1;
> }
I stared at sd->shared->nr_idle_scan for a while to see why it is safe
even when lets say there is no LLC domain.
It is because it is sd_llc here. Not any other domains. and
there is sd_llc check before calling select_idle_cpu.
So maybe add a comment here, saying null check for sd_llc is already there
and that's why it is safe to call it directly.
>
> + if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
> + return -1;
> +
> if (static_branch_unlikely(&sched_cluster_active)) {
> struct sched_group *sg = sd->groups;
>
While reading this series, this reminded me we had discussed about unifying
sd_llc->shared and sd_llc_shared thing into one (in v1 or v2).
is that dropped or you plan to fix it after this series?
Other than minor comments and nits series looks good to me.
So, for the series.
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
2026-01-23 6:06 ` Shrikanth Hegde
@ 2026-01-23 6:27 ` K Prateek Nayak
2026-01-23 7:14 ` Shrikanth Hegde
0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-23 6:27 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
Hello Shrikanth,
On 1/23/2026 11:36 AM, Shrikanth Hegde wrote:
>> + /* because !--nr is the condition to stop scan */
>> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
>> + /* overloaded LLC is unlikely to have idle cpu/core */
>> + if (nr == 1)
>> + return -1;
>> }
>
>
> I stared at sd->shared->nr_idle_scan for a while to see why it is safe
> even when lets say there is no LLC domain.
>
> It is because it is sd_llc here. Not any other domains. and
> there is sd_llc check before calling select_idle_cpu.
Ack! We come here with a valid "sd_llc" from select_idle_sibling()
and "sd" and "sd->shared" are freed at the same time via call_rcu() when
the last reference is dropped so having a reference to "sd" guarantees
"sd->shared" is not freed and the topology bits will ensure
"sd_llc->shared" is always present (or it screams and we crash here).
>
> So maybe add a comment here, saying null check for sd_llc is already there
> and that's why it is safe to call it directly.
>
>> + if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
>> + return -1;
>> +
>> if (static_branch_unlikely(&sched_cluster_active)) {
>> struct sched_group *sg = sd->groups;
>>
>
> While reading this series, this reminded me we had discussed about unifying
> sd_llc->shared and sd_llc_shared thing into one (in v1 or v2).
> is that dropped or you plan to fix it after this series?
Must have slipped out of my mind! I believe the only other user of
"sd_llc_shared" directly would then be nohz_balancer_kick() and
{test,set}_idle_cores().
Out of those, I would only consider set_idle_core() from wakeup to
be a fast-path but we'll already have a "sd_llc" reference there
so we should be able to flip the idle_cores indicator without
needing an extra dereference.
We can only keep per-CPU "sd_llc" and remove "sd_llc_shared". I
hope that is what you were suggesting. Otherwise please let me
know if I misinterpreted the question.
>
>
> Other than minor comments and nits series looks good to me.
> So, for the series.
>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Thank you for reviewing the series.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
2026-01-23 6:27 ` K Prateek Nayak
@ 2026-01-23 7:14 ` Shrikanth Hegde
0 siblings, 0 replies; 36+ messages in thread
From: Shrikanth Hegde @ 2026-01-23 7:14 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Gautham R. Shenoy, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
On 1/23/26 11:57 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
>
> On 1/23/2026 11:36 AM, Shrikanth Hegde wrote:
>>> + /* because !--nr is the condition to stop scan */
>>> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
>>> + /* overloaded LLC is unlikely to have idle cpu/core */
>>> + if (nr == 1)
>>> + return -1;
>>> }
>>
>>
>> I stared at sd->shared->nr_idle_scan for a while to see why it is safe
>> even when lets say there is no LLC domain.
>>
>> It is because it is sd_llc here. Not any other domains. and
>> there is sd_llc check before calling select_idle_cpu.
>
> Ack! We come here with a valid "sd_llc" from select_idle_sibling()
> and "sd" and "sd->shared" are freed at the same time via call_rcu() when
> the last reference is dropped so having a reference to "sd" guarantees
> "sd->shared" is not freed and the topology bits will ensure
> "sd_llc->shared" is always present (or it screams and we crash here).
>
>>
>> So maybe add a comment here, saying null check for sd_llc is already there
>> and that's why it is safe to call it directly.
>>
>>> + if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr))
>>> + return -1;
>>> +
>>> if (static_branch_unlikely(&sched_cluster_active)) {
>>> struct sched_group *sg = sd->groups;
>>>
>>
>> While reading this series, this reminded me we had discussed about unifying
>> sd_llc->shared and sd_llc_shared thing into one (in v1 or v2).
>> is that dropped or you plan to fix it after this series?
>
> Must have slipped out of my mind! I believe the only other user of
> "sd_llc_shared" directly would then be nohz_balancer_kick() and
> {test,set}_idle_cores().
>
> Out of those, I would only consider set_idle_core() from wakeup to
> be a fast-path but we'll already have a "sd_llc" reference there
> so we should be able to flip the idle_cores indicator without
> needing an extra dereference.
>
> We can only keep per-CPU "sd_llc" and remove "sd_llc_shared". I
> hope that is what you were suggesting. Otherwise please let me
> know if I misinterpreted the question.
>
you got it right. keep sd_llc only.
>>
>>
>> Other than minor comments and nits series looks good to me.
>> So, for the series.
>>
>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>
> Thank you for reviewing the series.
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
2026-01-20 11:32 ` [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
2026-01-23 6:06 ` Shrikanth Hegde
@ 2026-02-14 15:56 ` Chen, Yu C
1 sibling, 0 replies; 36+ messages in thread
From: Chen, Yu C @ 2026-02-14 15:56 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Shrikanth Hegde, Gautham R. Shenoy,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> Use the "sd_llc" passed to select_idle_cpu() to obtain the
> "sd_llc_shared" instead of dereferencing the per-CPU variable.
>
> Since "sd->shared" is always reclaimed at the same time as "sd" via
> call_rcu() and update_top_cache_domain() always ensures a valid
> "sd->shared" assignment when "sd_llc" is present, "sd_llc->shared" can
> always be dereferenced without needing an additional check.
>
> While at it move the cpumask_and() operation after the SIS_UTIL bailout
> check to avoid unnecessarily computing the cpumask.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
I ran netperf, schbench, hackbench, stress-ng-context,
and stream on a platform with 384 CPUs and 6 nodes, and no
significant differences were observed. (Netperf is of the
greatest interest because SIS_UTIL had a substantial impact
on netperf when it was first introduced.) I suppose
sched/topology: Switch to assigning "sd->shared" from s_data
will have a new version, I'll look at it when it is posted.
netperf
baseline llc
Hmean-96pairs 119.86 ( 0.00%) 122.21 ( 1.96%)
Hmean-192pairs 56.48 ( 0.00%) 57.37 ( 1.57%)
Hmean-288pairs 78.27 ( 0.00%) 78.21 ( -0.08%)
Hmean-384pairs 62.69 ( 0.00%) 62.98 ( 0.47%)
thanks,
Chenyu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation
2026-01-20 11:32 [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation K Prateek Nayak
` (7 preceding siblings ...)
2026-01-20 11:32 ` [PATCH v3 8/8] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu() K Prateek Nayak
@ 2026-01-21 16:16 ` Peter Zijlstra
2026-01-22 2:56 ` K Prateek Nayak
8 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2026-01-21 16:16 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy
On Tue, Jan 20, 2026 at 11:32:38AM +0000, K Prateek Nayak wrote:
> "sd->shared" is only allocated for the topmost SD_SHARE_LLC domain and
> the topology layer uses the sched domain degeneration path to pass the
> reference to the final "sd_llc" domain.
I'm fairly sure we've had patches that introduced it for other levels at
various times, but clearly none of those ever made it.
Anyway, a quick peek seems to suggest it is still easy to extend.
> include/linux/sched/topology.h | 1 -
> kernel/sched/fair.c | 62 +++++++-----------
> kernel/sched/sched.h | 2 +-
> kernel/sched/topology.c | 111 ++++++++++++++++++++++-----------
> 4 files changed, 101 insertions(+), 75 deletions(-)
Is this really worth the extra lines though?
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation
2026-01-21 16:16 ` [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation Peter Zijlstra
@ 2026-01-22 2:56 ` K Prateek Nayak
2026-01-23 9:54 ` Peter Zijlstra
0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2026-01-22 2:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy
Hello Peter,
On 1/21/2026 9:46 PM, Peter Zijlstra wrote:
> On Tue, Jan 20, 2026 at 11:32:38AM +0000, K Prateek Nayak wrote:
>
>> "sd->shared" is only allocated for the topmost SD_SHARE_LLC domain and
>> the topology layer uses the sched domain degeneration path to pass the
>> reference to the final "sd_llc" domain.
>
> I'm fairly sure we've had patches that introduced it for other levels at
> various times, but clearly none of those ever made it.
>
> Anyway, a quick peek seems to suggest it is still easy to extend.
>
>
>> include/linux/sched/topology.h | 1 -
>> kernel/sched/fair.c | 62 +++++++-----------
>> kernel/sched/sched.h | 2 +-
>> kernel/sched/topology.c | 111 ++++++++++++++++++++++-----------
>> 4 files changed, 101 insertions(+), 75 deletions(-)
>
> Is this really worth the extra lines though?
The larger plan was to move the "nohz.idle_cpus" tracking into the
sched_domain_shared instance which will bloat these allocations.
Instead of (#CPUs x #topology_levels) surplus, most of which will get
reclaimed at the end anyways, we'll only have #CPUs worth of
allocations now.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 0/8] sched/topology: Optimize sd->shared allocation
2026-01-22 2:56 ` K Prateek Nayak
@ 2026-01-23 9:54 ` Peter Zijlstra
0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2026-01-23 9:54 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Chen Yu, Shrikanth Hegde, Gautham R. Shenoy
On Thu, Jan 22, 2026 at 08:26:29AM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 1/21/2026 9:46 PM, Peter Zijlstra wrote:
> > On Tue, Jan 20, 2026 at 11:32:38AM +0000, K Prateek Nayak wrote:
> >
> >> "sd->shared" is only allocated for the topmost SD_SHARE_LLC domain and
> >> the topology layer uses the sched domain degeneration path to pass the
> >> reference to the final "sd_llc" domain.
> >
> > I'm fairly sure we've had patches that introduced it for other levels at
> > various times, but clearly none of those ever made it.
> >
> > Anyway, a quick peek seems to suggest it is still easy to extend.
> >
> >
> >> include/linux/sched/topology.h | 1 -
> >> kernel/sched/fair.c | 62 +++++++-----------
> >> kernel/sched/sched.h | 2 +-
> >> kernel/sched/topology.c | 111 ++++++++++++++++++++++-----------
> >> 4 files changed, 101 insertions(+), 75 deletions(-)
> >
> > Is this really worth the extra lines though?
>
> The larger plan was to move the "nohz.idle_cpus" tracking into the
> sched_domain_shared instance which will bloat these allocations.
>
> Instead of (#CPUs x #topology_levels) surplus, most of which will get
> reclaimed at the end anyways, we'll only have #CPUs worth of
> allocations now.
Fair enough I suppose. Be sure to call this out as the primary reason
for doing this.
^ permalink raw reply [flat|nested] 36+ messages in thread