[PATCH v4 0/2] Improving topology_span

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] Improving topology_span_sane
@ 2025-03-04 16:08 Steve Wahl
  2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Steve Wahl @ 2025-03-04 16:08 UTC (permalink / raw)
  To: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, K Prateek Nayak,
	Vishal Chourasia, samir
  Cc: Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

toplogy_span_sane() has an O(N^2) algorithm that takes an inordinate
amount of time on systems with a large number of cpus.

The first patch in this series replaces the algorithm used with a O(N)
method that should exactly duplicate the previous code's results.

The second patch simplifies the first, taking a similar amount of time
to run, but potentially has different results than previous code under
situations believed to not truly exist, like a CPU not being included
in its own span.

Version 1:
  * Original patch

Version 2:

  * Adopted simplifications from K Prateek Nayak,and fixed use of
    num_possible_cpus().

Version 3:

  * Undid the simplifications from version 2 when noticed that results
    could differ from original code; kept num_possible_cpus() fix.

Version 4:

  * Turned the patch into a series of 2, the second re-introduces the
    simplifications, and includes further simplification suggested by
    Valentin Schneider in the discussion for Version 2.

Steve Wahl (2):
  sched/topology: improve topology_span_sane speed
  sched/topology: Refinement to topology_span_sane speedup

 kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 25 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
@ 2025-03-04 16:08 ` Steve Wahl
  2025-04-08 19:05   ` [tip: sched/core] " tip-bot2 for Steve Wahl
  2025-06-10 11:07   ` [PATCH v4 1/2] " Leon Romanovsky
  2025-03-04 16:08 ` [PATCH v4 2/2] sched/topology: Refinement to topology_span_sane speedup Steve Wahl
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Steve Wahl @ 2025-03-04 16:08 UTC (permalink / raw)
  To: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, K Prateek Nayak,
	Vishal Chourasia, samir
  Cc: Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Use a different approach to topology_span_sane(), that checks for the
same constraint of no partial overlaps for any two CPU sets for
non-NUMA topology levels, but does so in a way that is O(N) rather
than O(N^2).

Instead of comparing with all other masks to detect collisions, keep
one mask that includes all CPUs seen so far and detect collisions with
a single cpumask_intersects test.

If the current mask has no collisions with previously seen masks, it
should be a new mask, which can be uniquely identified by the lowest
bit set in this mask.  Keep a pointer to this mask for future
reference (in an array indexed by the lowest bit set), and add the
CPUs in this mask to the list of those seen.

If the current mask does collide with previously seen masks, it should
be exactly equal to a mask seen before, looked up in the same array
indexed by the lowest bit set in the mask, a single comparison.

Move the topology_span_sane() check out of the existing topology level
loop, let it use its own loop so that the array allocation can be done
only once, shared across levels.

On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
the average time to take one processor offline is reduced from 2.18
seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
34m49.765s without this change, 16m10.038s with this change in place.)

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---

Version 4: No change.

Version 3 discussion:
    https://lore.kernel.org/all/20250210154259.375312-1-steve.wahl@hpe.com/

Version 3: While the intent of this patch is no functional change, I
discovered that version 2 had conditions where it would give different
results than the original code.  Version 3 returns to the V1 approach,
fixing the num_possible_cpus() problem Peter Zijlstra noted.  In a
stand-alone test program that used all possible sets of four 4-bit
masks, this algorithm matched the original code in all cases, where
the others did not.

Version 2 discussion:
    https://lore.kernel.org/all/20241031200431.182443-1-steve.wahl@hpe.com/

Version 2: Adopted suggestion by K Prateek Nayak that removes an array and
simplifies the code, and eliminates the erroneous use of
num_possible_cpus() that Peter Zijlstra noted.

Version 1 discussion:
    https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/

 kernel/sched/topology.c | 83 ++++++++++++++++++++++++++++-------------
 1 file changed, 58 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..3fb834301315 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2356,36 +2356,69 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 
 /*
  * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
- * any two given CPUs at this (non-NUMA) topology level.
+ * any two given CPUs on non-NUMA topology levels.
  */
-static bool topology_span_sane(struct sched_domain_topology_level *tl,
-			      const struct cpumask *cpu_map, int cpu)
+static bool topology_span_sane(const struct cpumask *cpu_map)
 {
-	int i = cpu + 1;
+	struct sched_domain_topology_level *tl;
+	const struct cpumask **masks;
+	struct cpumask *covered;
+	int cpu, id;
+	bool ret = false;
 
-	/* NUMA levels are allowed to overlap */
-	if (tl->flags & SDTL_OVERLAP)
-		return true;
+	lockdep_assert_held(&sched_domains_mutex);
+	covered = sched_domains_tmpmask;
+
+	masks = kmalloc_array(nr_cpu_ids, sizeof(struct cpumask *), GFP_KERNEL);
+	if (!masks)
+		return ret;
+
+	for_each_sd_topology(tl) {
+
+		/* NUMA levels are allowed to overlap */
+		if (tl->flags & SDTL_OVERLAP)
+			continue;
+
+		cpumask_clear(covered);
+		memset(masks, 0, nr_cpu_ids * sizeof(struct cpumask *));
 
-	/*
-	 * Non-NUMA levels cannot partially overlap - they must be either
-	 * completely equal or completely disjoint. Otherwise we can end up
-	 * breaking the sched_group lists - i.e. a later get_group() pass
-	 * breaks the linking done for an earlier span.
-	 */
-	for_each_cpu_from(i, cpu_map) {
 		/*
-		 * We should 'and' all those masks with 'cpu_map' to exactly
-		 * match the topology we're about to build, but that can only
-		 * remove CPUs, which only lessens our ability to detect
-		 * overlaps
+		 * Non-NUMA levels cannot partially overlap - they must be either
+		 * completely equal or completely disjoint. Otherwise we can end up
+		 * breaking the sched_group lists - i.e. a later get_group() pass
+		 * breaks the linking done for an earlier span.
 		 */
-		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
-			return false;
+		for_each_cpu(cpu, cpu_map) {
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl->mask(cpu));
+
+			/* zeroed masks cannot possibly collide */
+			if (id >= nr_cpu_ids)
+				continue;
+
+			/* if this mask doesn't collide with what we've already seen */
+			if (!cpumask_intersects(tl->mask(cpu), covered)) {
+				/* this failing would be an error in this algorithm */
+				if (WARN_ON(masks[id]))
+					goto notsane;
+
+				/* record the mask we saw for this id */
+				masks[id] = tl->mask(cpu);
+				cpumask_or(covered, tl->mask(cpu), covered);
+			} else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
+				/*
+				 * a collision with covered should have exactly matched
+				 * a previously seen mask with the same id
+				 */
+				goto notsane;
+			}
+		}
 	}
+	ret = true;
 
-	return true;
+ notsane:
+	kfree(masks);
+	return ret;
 }
 
 /*
@@ -2417,9 +2450,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		sd = NULL;
 		for_each_sd_topology(tl) {
 
-			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
-				goto error;
-
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
@@ -2433,6 +2463,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	if (WARN_ON(!topology_span_sane(cpu_map)))
+		goto error;
+
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/2] sched/topology: Refinement to topology_span_sane speedup
  2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
  2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
@ 2025-03-04 16:08 ` Steve Wahl
  2025-04-08 19:05   ` [tip: sched/core] " tip-bot2 for Steve Wahl
  2025-03-06  6:46 ` [PATCH v4 0/2] Improving topology_span_sane K Prateek Nayak
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Steve Wahl @ 2025-03-04 16:08 UTC (permalink / raw)
  To: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, K Prateek Nayak,
	Vishal Chourasia, samir
  Cc: Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Simplify the topology_span_sane code further, removing the need to
allocate an array and gotos used to make sure the array gets freed.

This version is in a separate commit because it could return a
different sanity result than the previous code, but only in odd
circumstances that are not expected to actually occur; for example,
when a CPU is not listed in its own mask.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
---

Version 4: First appearance of this second patch.

 kernel/sched/topology.c | 48 ++++++++++++++++-------------------------
 1 file changed, 19 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3fb834301315..23b2012ff2af 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2361,17 +2361,12 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 static bool topology_span_sane(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;
-	const struct cpumask **masks;
-	struct cpumask *covered;
-	int cpu, id;
-	bool ret = false;
+	struct cpumask *covered, *id_seen;
+	int cpu;
 
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
-
-	masks = kmalloc_array(nr_cpu_ids, sizeof(struct cpumask *), GFP_KERNEL);
-	if (!masks)
-		return ret;
+	id_seen = sched_domains_tmpmask2;
 
 	for_each_sd_topology(tl) {
 
@@ -2380,7 +2375,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 			continue;
 
 		cpumask_clear(covered);
-		memset(masks, 0, nr_cpu_ids * sizeof(struct cpumask *));
+		cpumask_clear(id_seen);
 
 		/*
 		 * Non-NUMA levels cannot partially overlap - they must be either
@@ -2389,36 +2384,27 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 		 * breaks the linking done for an earlier span.
 		 */
 		for_each_cpu(cpu, cpu_map) {
-			/* lowest bit set in this mask is used as a unique id */
-			id = cpumask_first(tl->mask(cpu));
+			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
+			int id;
 
-			/* zeroed masks cannot possibly collide */
-			if (id >= nr_cpu_ids)
-				continue;
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl_cpu_mask);
 
-			/* if this mask doesn't collide with what we've already seen */
-			if (!cpumask_intersects(tl->mask(cpu), covered)) {
-				/* this failing would be an error in this algorithm */
-				if (WARN_ON(masks[id]))
-					goto notsane;
+			if (cpumask_test_cpu(id, id_seen)) {
+				/* First CPU has already been seen, ensure identical spans */
+				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
+					return false;
+			} else {
+				/* First CPU hasn't been seen before, ensure it's a completely new span */
+				if (cpumask_intersects(tl_cpu_mask, covered))
+					return false;
 
-				/* record the mask we saw for this id */
-				masks[id] = tl->mask(cpu);
-				cpumask_or(covered, tl->mask(cpu), covered);
-			} else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
-				/*
-				 * a collision with covered should have exactly matched
-				 * a previously seen mask with the same id
-				 */
-				goto notsane;
+				cpumask_or(covered, covered, tl_cpu_mask);
+				cpumask_set_cpu(id, id_seen);
 			}
 		}
 	}
-	ret = true;
-
- notsane:
-	kfree(masks);
-	return ret;
+	return true;
 }
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/2] Improving topology_span_sane
  2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
  2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
  2025-03-04 16:08 ` [PATCH v4 2/2] sched/topology: Refinement to topology_span_sane speedup Steve Wahl
@ 2025-03-06  6:46 ` K Prateek Nayak
  2025-03-06 14:33 ` Valentin Schneider
  2025-03-07 10:06 ` Madadi Vineeth Reddy
  4 siblings, 0 replies; 25+ messages in thread
From: K Prateek Nayak @ 2025-03-06  6:46 UTC (permalink / raw)
  To: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir
  Cc: Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Steve,

On 3/4/2025 9:38 PM, Steve Wahl wrote:
> toplogy_span_sane() has an O(N^2) algorithm that takes an inordinate
> amount of time on systems with a large number of cpus.
> 
> The first patch in this series replaces the algorithm used with a O(N)
> method that should exactly duplicate the previous code's results.
> 
> The second patch simplifies the first, taking a similar amount of time
> to run, but potentially has different results than previous code under
> situations believed to not truly exist, like a CPU not being included
> in its own span.

I've tested Patch 1 individually and the whole series as is on top of
tip:sched/core and I haven't run into any issues with the optimization
on my 3rd Generation EPYC system.

Please feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> 
> Version 1:
>    * Original patch
> 
> Version 2:
> 
>    * Adopted simplifications from K Prateek Nayak,and fixed use of
>      num_possible_cpus().
> 
> Version 3:
> 
>    * Undid the simplifications from version 2 when noticed that results
>      could differ from original code; kept num_possible_cpus() fix.
> 
> Version 4:
> 
>    * Turned the patch into a series of 2, the second re-introduces the
>      simplifications, and includes further simplification suggested by
>      Valentin Schneider in the discussion for Version 2.
> 
> Steve Wahl (2):
>    sched/topology: improve topology_span_sane speed
>    sched/topology: Refinement to topology_span_sane speedup
> 
>   kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>   1 file changed, 48 insertions(+), 25 deletions(-)
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/2] Improving topology_span_sane
  2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
                   ` (2 preceding siblings ...)
  2025-03-06  6:46 ` [PATCH v4 0/2] Improving topology_span_sane K Prateek Nayak
@ 2025-03-06 14:33 ` Valentin Schneider
  2025-03-07 10:06 ` Madadi Vineeth Reddy
  4 siblings, 0 replies; 25+ messages in thread
From: Valentin Schneider @ 2025-03-06 14:33 UTC (permalink / raw)
  To: Steve Wahl, Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, linux-kernel, K Prateek Nayak, Vishal Chourasia,
	samir
  Cc: Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On 04/03/25 10:08, Steve Wahl wrote:
> toplogy_span_sane() has an O(N^2) algorithm that takes an inordinate
> amount of time on systems with a large number of cpus.
>
> The first patch in this series replaces the algorithm used with a O(N)
> method that should exactly duplicate the previous code's results.
>
> The second patch simplifies the first, taking a similar amount of time
> to run, but potentially has different results than previous code under
> situations believed to not truly exist, like a CPU not being included
> in its own span.
>

Had to hack up arch_topology.c some more to replicate the setup described
in

  ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")

but eventually go there, and it was correctly caught by topology_span_sane().

Thanks!

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 0/2] Improving topology_span_sane
  2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
                   ` (3 preceding siblings ...)
  2025-03-06 14:33 ` Valentin Schneider
@ 2025-03-07 10:06 ` Madadi Vineeth Reddy
  4 siblings, 0 replies; 25+ messages in thread
From: Madadi Vineeth Reddy @ 2025-03-07 10:06 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Naman Jain,
	Saurabh Singh Sengar, srivatsa, Michael Kelley, Russ Anderson,
	Dimitri Sivanich, K Prateek Nayak, Vishal Chourasia, samir,
	Madadi Vineeth Reddy

Hi Steve,

On 04/03/25 21:38, Steve Wahl wrote:
> toplogy_span_sane() has an O(N^2) algorithm that takes an inordinate
> amount of time on systems with a large number of cpus.
> 
> The first patch in this series replaces the algorithm used with a O(N)
> method that should exactly duplicate the previous code's results.
> 
> The second patch simplifies the first, taking a similar amount of time
> to run, but potentially has different results than previous code under
> situations believed to not truly exist, like a CPU not being included
> in its own span.

I have reviewed the proposed approach for the topology sanity check and
it looks good to me.

I have also tested the patch on a Power10 system with 12 cores (96 CPUs).
The average CPU hotplug latency decreased by around 10%.

Therefore,

Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks,
Madadi Vineeth Reddy

> 
> Version 1:
>   * Original patch
> 
> Version 2:
> 
>   * Adopted simplifications from K Prateek Nayak,and fixed use of
>     num_possible_cpus().
> 
> Version 3:
> 
>   * Undid the simplifications from version 2 when noticed that results
>     could differ from original code; kept num_possible_cpus() fix.
> 
> Version 4:
> 
>   * Turned the patch into a series of 2, the second re-introduces the
>     simplifications, and includes further simplification suggested by
>     Valentin Schneider in the discussion for Version 2.
> 
> Steve Wahl (2):
>   sched/topology: improve topology_span_sane speed
>   sched/topology: Refinement to topology_span_sane speedup
> 
>  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 25 deletions(-)
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched/topology: Refinement to topology_span_sane speedup
  2025-03-04 16:08 ` [PATCH v4 2/2] sched/topology: Refinement to topology_span_sane speedup Steve Wahl
@ 2025-04-08 19:05   ` tip-bot2 for Steve Wahl
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for Steve Wahl @ 2025-04-08 19:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steve Wahl, Peter Zijlstra (Intel), Valentin Schneider,
	Madadi Vineeth Reddy, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     ce29a7da84cdeafc7c08c32d329037c71ab3f3dd
Gitweb:        https://git.kernel.org/tip/ce29a7da84cdeafc7c08c32d329037c71ab3f3dd
Author:        Steve Wahl <steve.wahl@hpe.com>
AuthorDate:    Tue, 04 Mar 2025 10:08:44 -06:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:52 +02:00

sched/topology: Refinement to topology_span_sane speedup

Simplify the topology_span_sane code further, removing the need to
allocate an array and gotos used to make sure the array gets freed.

This version is in a separate commit because it could return a
different sanity result than the previous code, but only in odd
circumstances that are not expected to actually occur; for example,
when a CPU is not listed in its own mask.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304160844.75373-3-steve.wahl@hpe.com
---
 kernel/sched/topology.c | 52 ++++++++++++++--------------------------
 1 file changed, 19 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 439e6ce..b334f25 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2352,17 +2352,12 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 static bool topology_span_sane(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;
-	const struct cpumask **masks;
-	struct cpumask *covered;
-	int cpu, id;
-	bool ret = false;
+	struct cpumask *covered, *id_seen;
+	int cpu;
 
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
-
-	masks = kmalloc_array(nr_cpu_ids, sizeof(struct cpumask *), GFP_KERNEL);
-	if (!masks)
-		return ret;
+	id_seen = sched_domains_tmpmask2;
 
 	for_each_sd_topology(tl) {
 
@@ -2371,7 +2366,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 			continue;
 
 		cpumask_clear(covered);
-		memset(masks, 0, nr_cpu_ids * sizeof(struct cpumask *));
+		cpumask_clear(id_seen);
 
 		/*
 		 * Non-NUMA levels cannot partially overlap - they must be either
@@ -2380,36 +2375,27 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 		 * breaks the linking done for an earlier span.
 		 */
 		for_each_cpu(cpu, cpu_map) {
-			/* lowest bit set in this mask is used as a unique id */
-			id = cpumask_first(tl->mask(cpu));
+			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
+			int id;
 
-			/* zeroed masks cannot possibly collide */
-			if (id >= nr_cpu_ids)
-				continue;
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl_cpu_mask);
 
-			/* if this mask doesn't collide with what we've already seen */
-			if (!cpumask_intersects(tl->mask(cpu), covered)) {
-				/* this failing would be an error in this algorithm */
-				if (WARN_ON(masks[id]))
-					goto notsane;
+			if (cpumask_test_cpu(id, id_seen)) {
+				/* First CPU has already been seen, ensure identical spans */
+				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
+					return false;
+			} else {
+				/* First CPU hasn't been seen before, ensure it's a completely new span */
+				if (cpumask_intersects(tl_cpu_mask, covered))
+					return false;
 
-				/* record the mask we saw for this id */
-				masks[id] = tl->mask(cpu);
-				cpumask_or(covered, tl->mask(cpu), covered);
-			} else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
-				/*
-				 * a collision with covered should have exactly matched
-				 * a previously seen mask with the same id
-				 */
-				goto notsane;
+				cpumask_or(covered, covered, tl_cpu_mask);
+				cpumask_set_cpu(id, id_seen);
 			}
 		}
 	}
-	ret = true;
-
- notsane:
-	kfree(masks);
-	return ret;
+	return true;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched/topology: improve topology_span_sane speed
  2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
@ 2025-04-08 19:05   ` tip-bot2 for Steve Wahl
  2025-06-10 11:07   ` [PATCH v4 1/2] " Leon Romanovsky
  1 sibling, 0 replies; 25+ messages in thread
From: tip-bot2 for Steve Wahl @ 2025-04-08 19:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Steve Wahl, Peter Zijlstra (Intel), Valentin Schneider,
	Madadi Vineeth Reddy, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f55dac1dafb3334be1d5b54bf385e8cfaa0ab3b3
Gitweb:        https://git.kernel.org/tip/f55dac1dafb3334be1d5b54bf385e8cfaa0ab3b3
Author:        Steve Wahl <steve.wahl@hpe.com>
AuthorDate:    Tue, 04 Mar 2025 10:08:43 -06:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 08 Apr 2025 20:55:51 +02:00

sched/topology: improve topology_span_sane speed

Use a different approach to topology_span_sane(), that checks for the
same constraint of no partial overlaps for any two CPU sets for
non-NUMA topology levels, but does so in a way that is O(N) rather
than O(N^2).

Instead of comparing with all other masks to detect collisions, keep
one mask that includes all CPUs seen so far and detect collisions with
a single cpumask_intersects test.

If the current mask has no collisions with previously seen masks, it
should be a new mask, which can be uniquely identified by the lowest
bit set in this mask.  Keep a pointer to this mask for future
reference (in an array indexed by the lowest bit set), and add the
CPUs in this mask to the list of those seen.

If the current mask does collide with previously seen masks, it should
be exactly equal to a mask seen before, looked up in the same array
indexed by the lowest bit set in the mask, a single comparison.

Move the topology_span_sane() check out of the existing topology level
loop, let it use its own loop so that the array allocation can be done
only once, shared across levels.

On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
the average time to take one processor offline is reduced from 2.18
seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
34m49.765s without this change, 16m10.038s with this change in place.)

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lore.kernel.org/r/20250304160844.75373-2-steve.wahl@hpe.com
---
 kernel/sched/topology.c | 83 +++++++++++++++++++++++++++-------------
 1 file changed, 58 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f1ebc60..439e6ce 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2347,36 +2347,69 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 
 /*
  * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
- * any two given CPUs at this (non-NUMA) topology level.
+ * any two given CPUs on non-NUMA topology levels.
  */
-static bool topology_span_sane(struct sched_domain_topology_level *tl,
-			      const struct cpumask *cpu_map, int cpu)
+static bool topology_span_sane(const struct cpumask *cpu_map)
 {
-	int i = cpu + 1;
+	struct sched_domain_topology_level *tl;
+	const struct cpumask **masks;
+	struct cpumask *covered;
+	int cpu, id;
+	bool ret = false;
 
-	/* NUMA levels are allowed to overlap */
-	if (tl->flags & SDTL_OVERLAP)
-		return true;
+	lockdep_assert_held(&sched_domains_mutex);
+	covered = sched_domains_tmpmask;
+
+	masks = kmalloc_array(nr_cpu_ids, sizeof(struct cpumask *), GFP_KERNEL);
+	if (!masks)
+		return ret;
+
+	for_each_sd_topology(tl) {
+
+		/* NUMA levels are allowed to overlap */
+		if (tl->flags & SDTL_OVERLAP)
+			continue;
+
+		cpumask_clear(covered);
+		memset(masks, 0, nr_cpu_ids * sizeof(struct cpumask *));
 
-	/*
-	 * Non-NUMA levels cannot partially overlap - they must be either
-	 * completely equal or completely disjoint. Otherwise we can end up
-	 * breaking the sched_group lists - i.e. a later get_group() pass
-	 * breaks the linking done for an earlier span.
-	 */
-	for_each_cpu_from(i, cpu_map) {
 		/*
-		 * We should 'and' all those masks with 'cpu_map' to exactly
-		 * match the topology we're about to build, but that can only
-		 * remove CPUs, which only lessens our ability to detect
-		 * overlaps
+		 * Non-NUMA levels cannot partially overlap - they must be either
+		 * completely equal or completely disjoint. Otherwise we can end up
+		 * breaking the sched_group lists - i.e. a later get_group() pass
+		 * breaks the linking done for an earlier span.
 		 */
-		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
-			return false;
+		for_each_cpu(cpu, cpu_map) {
+			/* lowest bit set in this mask is used as a unique id */
+			id = cpumask_first(tl->mask(cpu));
+
+			/* zeroed masks cannot possibly collide */
+			if (id >= nr_cpu_ids)
+				continue;
+
+			/* if this mask doesn't collide with what we've already seen */
+			if (!cpumask_intersects(tl->mask(cpu), covered)) {
+				/* this failing would be an error in this algorithm */
+				if (WARN_ON(masks[id]))
+					goto notsane;
+
+				/* record the mask we saw for this id */
+				masks[id] = tl->mask(cpu);
+				cpumask_or(covered, tl->mask(cpu), covered);
+			} else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
+				/*
+				 * a collision with covered should have exactly matched
+				 * a previously seen mask with the same id
+				 */
+				goto notsane;
+			}
+		}
 	}
+	ret = true;
 
-	return true;
+ notsane:
+	kfree(masks);
+	return ret;
 }
 
 /*
@@ -2408,9 +2441,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		sd = NULL;
 		for_each_sd_topology(tl) {
 
-			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
-				goto error;
-
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
@@ -2424,6 +2454,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	if (WARN_ON(!topology_span_sane(cpu_map)))
+		goto error;
+
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
  2025-04-08 19:05   ` [tip: sched/core] " tip-bot2 for Steve Wahl
@ 2025-06-10 11:07   ` Leon Romanovsky
  2025-06-10 11:33     ` K Prateek Nayak
  1 sibling, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-10 11:07 UTC (permalink / raw)
  To: Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, K Prateek Nayak,
	Vishal Chourasia, samir, Naman Jain, Saurabh Singh Sengar,
	srivatsa, Michael Kelley, Russ Anderson, Dimitri Sivanich

On Tue, Mar 04, 2025 at 10:08:43AM -0600, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
> 
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
> 
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified by the lowest
> bit set in this mask.  Keep a pointer to this mask for future
> reference (in an array indexed by the lowest bit set), and add the
> CPUs in this mask to the list of those seen.
> 
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, looked up in the same array
> indexed by the lowest bit set in the mask, a single comparison.
> 
> Move the topology_span_sane() check out of the existing topology level
> loop, let it use its own loop so that the array allocation can be done
> only once, shared across levels.
> 
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
> 
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---

<...>

>  
> +	if (WARN_ON(!topology_span_sane(cpu_map)))
> +		goto error;

Hi, 

This WARN_ON() generate the following splat in our regression over VMs.

 [    0.408379] ------------[ cut here ]------------
 [    0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
 [    0.410797] Modules linked in:
 [    0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE 
 [    0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [    0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
 [    0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
 [    0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
 [    0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
 [    0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
 [    0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
 [    0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
 [    0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
 [    0.425164] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
 [    0.426751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [    0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
 [    0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [    0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [    0.431429] Call Trace:
 [    0.431983]  <TASK>
 [    0.432500]  sched_init_smp+0x32/0xa0
 [    0.433069]  ? stop_machine+0x2c/0x40
 [    0.433821]  kernel_init_freeable+0xf5/0x260
 [    0.434682]  ? rest_init+0xc0/0xc0
 [    0.435399]  kernel_init+0x16/0x120
 [    0.436140]  ret_from_fork+0x5e/0xd0
 [    0.436817]  ? rest_init+0xc0/0xc0
 [    0.437526]  ret_from_fork_asm+0x11/0x20
 [    0.438335]  </TASK>
 [    0.438841] ---[ end trace 0000000000000000 ]---

Thanks

> +
>  	/* Build the groups for the domains */
>  	for_each_cpu(i, cpu_map) {
>  		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-10 11:07   ` [PATCH v4 1/2] " Leon Romanovsky
@ 2025-06-10 11:33     ` K Prateek Nayak
  2025-06-10 12:36       ` Leon Romanovsky
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-10 11:33 UTC (permalink / raw)
  To: Leon Romanovsky, Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Vishal Chourasia, samir,
	Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Leon,

On 6/10/2025 4:37 PM, Leon Romanovsky wrote:

[..snip..]

>>   
>> +	if (WARN_ON(!topology_span_sane(cpu_map)))
>> +		goto error;
> 
> Hi,
> 
> This WARN_ON() generate the following splat in our regression over VMs.> 
>   [    0.408379] ------------[ cut here ]------------
>   [    0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
>   [    0.410797] Modules linked in:
>   [    0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
>   [    0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [    0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
>   [    0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
>   [    0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
>   [    0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
>   [    0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
>   [    0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
>   [    0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
>   [    0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
>   [    0.425164] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
>   [    0.426751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [    0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
>   [    0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [    0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [    0.431429] Call Trace:
>   [    0.431983]  <TASK>
>   [    0.432500]  sched_init_smp+0x32/0xa0
>   [    0.433069]  ? stop_machine+0x2c/0x40
>   [    0.433821]  kernel_init_freeable+0xf5/0x260
>   [    0.434682]  ? rest_init+0xc0/0xc0
>   [    0.435399]  kernel_init+0x16/0x120
>   [    0.436140]  ret_from_fork+0x5e/0xd0
>   [    0.436817]  ? rest_init+0xc0/0xc0
>   [    0.437526]  ret_from_fork_asm+0x11/0x20
>   [    0.438335]  </TASK>
>   [    0.438841] ---[ end trace 0000000000000000 ]---

Would it be possible for you to boot the guest with "sched_verbose" in
kernel cmdline and attach the full dmesg? Thanks in advance.

-- 
Thanks and Regards,
Prateek

> 
> Thanks
> 
>> +
>>   	/* Build the groups for the domains */
>>   	for_each_cpu(i, cpu_map) {
>>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> -- 
>> 2.26.2
>>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-10 11:33     ` K Prateek Nayak
@ 2025-06-10 12:36       ` Leon Romanovsky
  2025-06-10 13:09         ` Leon Romanovsky
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-10 12:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
> Hello Leon,
> 
> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
> 
> [..snip..]
> 
> > > +	if (WARN_ON(!topology_span_sane(cpu_map)))
> > > +		goto error;
> > 
> > Hi,
> > 
> > This WARN_ON() generate the following splat in our regression over VMs.>
> > [    0.408379] ------------[ cut here ]------------
> >   [    0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
> >   [    0.410797] Modules linked in:
> >   [    0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
> >   [    0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >   [    0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
> >   [    0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
> >   [    0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
> >   [    0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
> >   [    0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
> >   [    0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
> >   [    0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
> >   [    0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
> >   [    0.425164] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
> >   [    0.426751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   [    0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
> >   [    0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >   [    0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >   [    0.431429] Call Trace:
> >   [    0.431983]  <TASK>
> >   [    0.432500]  sched_init_smp+0x32/0xa0
> >   [    0.433069]  ? stop_machine+0x2c/0x40
> >   [    0.433821]  kernel_init_freeable+0xf5/0x260
> >   [    0.434682]  ? rest_init+0xc0/0xc0
> >   [    0.435399]  kernel_init+0x16/0x120
> >   [    0.436140]  ret_from_fork+0x5e/0xd0
> >   [    0.436817]  ? rest_init+0xc0/0xc0
> >   [    0.437526]  ret_from_fork_asm+0x11/0x20
> >   [    0.438335]  </TASK>
> >   [    0.438841] ---[ end trace 0000000000000000 ]---
> 
> Would it be possible for you to boot the guest with "sched_verbose" in
> kernel cmdline and attach the full dmesg? Thanks in advance.

I'll try, but can't promise due to how this kernel is been running in
our systems.

Thanks

> 
> -- 
> Thanks and Regards,
> Prateek
> 
> > 
> > Thanks
> > 
> > > +
> > >   	/* Build the groups for the domains */
> > >   	for_each_cpu(i, cpu_map) {
> > >   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > -- 
> > > 2.26.2
> > > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-10 12:36       ` Leon Romanovsky
@ 2025-06-10 13:09         ` Leon Romanovsky
  2025-06-10 19:39           ` Steve Wahl
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:09 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich



On Tue, Jun 10, 2025, at 15:36, Leon Romanovsky wrote:
> On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
>> Hello Leon,
>> 
>> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
>> 
>> [..snip..]
>> 
>> > > +	if (WARN_ON(!topology_span_sane(cpu_map)))
>> > > +		goto error;
>> > 
>> > Hi,
>> > 
>> > This WARN_ON() generate the following splat in our regression over VMs.>
>> > [    0.408379] ------------[ cut here ]------------
>> >   [    0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
>> >   [    0.410797] Modules linked in:
>> >   [    0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
>> >   [    0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>> >   [    0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
>> >   [    0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
>> >   [    0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
>> >   [    0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
>> >   [    0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
>> >   [    0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
>> >   [    0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
>> >   [    0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
>> >   [    0.425164] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
>> >   [    0.426751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >   [    0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
>> >   [    0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >   [    0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> >   [    0.431429] Call Trace:
>> >   [    0.431983]  <TASK>
>> >   [    0.432500]  sched_init_smp+0x32/0xa0
>> >   [    0.433069]  ? stop_machine+0x2c/0x40
>> >   [    0.433821]  kernel_init_freeable+0xf5/0x260
>> >   [    0.434682]  ? rest_init+0xc0/0xc0
>> >   [    0.435399]  kernel_init+0x16/0x120
>> >   [    0.436140]  ret_from_fork+0x5e/0xd0
>> >   [    0.436817]  ? rest_init+0xc0/0xc0
>> >   [    0.437526]  ret_from_fork_asm+0x11/0x20
>> >   [    0.438335]  </TASK>
>> >   [    0.438841] ---[ end trace 0000000000000000 ]---
>> 
>> Would it be possible for you to boot the guest with "sched_verbose" in
>> kernel cmdline and attach the full dmesg? Thanks in advance.
>
> I'll try, but can't promise due to how this kernel is been running in
> our systems.



[    0.032233] [mem 0xc0000000-0xfed1bfff] available for PCI devices
[    0.032237] Booting paravirtualized kernel on KVM
[    0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.036921] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
[    0.038074] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
[    0.038108] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose
[    0.038222] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.
[    0.038235] random: crng init done
[    0.038235] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.038236] printk: log_buf_len total cpu_extra contributions: 36864 bytes
[    0.038237] printk: log_buf_len min size: 65536 bytes
[    0.038330] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
[    0.038331] printk: early log buf free: 56792(86%)
[    0.038452] software IO TLB: area num 16.
[    0.049552] Fallback order for Node 0: 0 4 3 2 1
 [    0.049556] Fallback order for Node 1: 1 4 3 2 0
 [    0.049559] Fallback order for Node 2: 2 4 3 0 1
 [    0.049561] Fallback order for Node 3: 3 4 1 0 2
 [    0.049563] Fallback order for Node 4: 4 0 1 2 3
 [    0.049569] Built 5 zonelists, mobility grouping on.  Total pages: 3932026
[    0.049570] Policy zone: Normal
[    0.049571] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.073214] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
[    0.082959] ftrace: allocating 46168 entries in 182 pages
[    0.082961] ftrace: allocated 182 pages with 5 groups
[    0.083102] rcu: Hierarchical RCU implementation.
[    0.083102] rcu:        RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
[    0.083104] Rude variant of Tasks RCU enabled.
[    0.083104] Tracing variant of Tasks RCU enabled.
[    0.083105] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.083106] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
[    0.083115] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[    0.083117] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[    0.089643] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
[    0.089831] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.100835] Console: colour VGA+ 80x25
[    0.100838] printk: legacy console [tty0] enabled
[    0.132452] printk: legacy console [ttyS1] enabled
[    0.221725] ACPI: Core revision 20250404
[    0.222382] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.223635] APIC: Switch to symmetric I/O mode setup
[    0.224298] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
[    0.225262] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
[    0.226436] kvm-guest: setup PV IPIs
[    0.227740] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.228537] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2563bd843df, max_idle_ns: 440795257314 ns
[    0.229871] Calibrating delay loop (skipped) preset value.. 5187.80 BogoMIPS (lpj=10375616)
[    0.231044] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.234092] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.234805] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.235598] Speculative Store Bypass: Vulnerable
[    0.236229] GDS: Unknown: Dependent on hypervisor status
[    0.236955] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.237871] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.238713] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.239535] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.240439] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.241219] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.242085] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.242927] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.243794] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.244595] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.245401] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.246078] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
[    0.249871] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
[    0.250683] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
[    0.251500] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
[    0.253380] Freeing SMP alternatives memory: 48K
[    0.253876] pid_max: default: 32768 minimum: 301
[    0.254516] LSM: initializing lsm=capability
[    0.255115] stackdepot: allocating hash table of 1048576 entries via kvcalloc
[    0.262981] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
[    0.265481] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
[    0.266233] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[    0.267255] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[    0.268594] smpboot: CPU0: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (family: 0x6, model: 0x55, stepping: 0x7)
[    0.269870] Performance Events: Skylake events, full-width counters, Intel PMU driver.
[    0.269870] ... version:                2
[    0.269870] ... bit width:              48
[    0.269870] ... generic registers:      4
[    0.269873] ... value mask:             0000ffffffffffff
[    0.270548] ... max period:             00007fffffffffff
[    0.271220] ... fixed-purpose events:   3
[    0.271763] ... event mask:             000000070000000f
[    0.272574] signal: max sigframe size: 3216
[    0.273155] rcu: Hierarchical SRCU implementation.
[    0.273773] rcu:        Max phase no-delay instances is 1000.
[    0.274097] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[    0.275329] smp: Bringing up secondary CPUs ...
[    0.276031] smpboot: x86: Booting SMP configuration:
[    0.276689] .... node  #0, CPUs:        #1
[    0.277528] .... node  #1, CPUs:    #2  #3
[    0.278084] .... node  #2, CPUs:    #4  #5
[    0.279023] .... node  #3, CPUs:    #6  #7
[    0.279946] .... node  #4, CPUs:    #8  #9
[    0.313886] smp: Brought up 5 nodes, 10 CPUs
[    0.315058] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
[    0.316713] ------------[ cut here ]------------
[    0.316713] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
[    0.318187] Modules linked in:
[    0.318619] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
 [    0.319928] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[    0.321286] RIP: 0010:build_sched_domains+0xe67/0x13a0
[    0.321873] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
[    0.324099] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
[    0.324779] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
[    0.325659] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
[    0.326109] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
[    0.326989] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
[    0.327868] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
[    0.328743] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
[    0.329772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.330069] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
[    0.330973] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.331858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.332740] Call Trace:
[    0.333111]  <TASK>
[    0.333453]  sched_init_smp+0x32/0xa0
[    0.333877]  ? stop_machine+0x2c/0x40
[    0.334382]  kernel_init_freeable+0xf5/0x260
[    0.334954]  ? rest_init+0xc0/0xc0
[    0.335423]  kernel_init+0x16/0x120
[    0.335907]  ret_from_fork+0x5e/0xd0
[    0.336396]  ? rest_init+0xc0/0xc0
[    0.336866]  ret_from_fork_asm+0x11/0x20
[    0.337409]  </TASK>
[    0.337755] ---[ end trace 0000000000000000 ]---
[    0.338089] Memory: 15307024K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
[    0.340215] devtmpfs: initialized
[    0.341149] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.342235] posixtimers hash table entries: 8192 (order: 5, 131072 bytes, vmalloc)
[    0.343256] futex hash table entries: 512 (32768 bytes on 5 NUMA nodes, total 160 KiB, linear).
[    0.346367] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.347279] thermal_sys: Registered thermal governor 'step_wise'
[    0.347288] cpuidle: using governor ladder
[    0.348603] cpuidle: using governor menu
[    0.349254] PCI: ECAM [mem 0xb0000000-0xbfffffff] (base 0xb0000000) for domain 0000 [bus 00-ff]
[    0.350190] PCI: ECAM [mem 0xb0000000-0xbfffffff] reserved as E820 entry
[    0.351025] PCI: Using configuration type 1 for base access
[    0.351822] kprobes: kprobe jump-optimization is enabled. All kprobes are optimized if possible.
[    0.381999] HugeTLB: allocation took 0ms with hugepage_allocation_threads=2
[    0.393902] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[    0.394769] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
[    0.402159] ACPI: Added _OSI(Module Device)
[    0.402744] ACPI: Added _OSI(Processor Device)
[    0.403326] ACPI: Added _OSI(Processor Aggregator Device)
[    0.404648] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.405807] ACPI: Interpreter enabled

Thanks

>
> Thanks
>
>> 
>> -- 
>> Thanks and Regards,
>> Prateek
>> 
>> > 
>> > Thanks
>> > 
>> > > +
>> > >   	/* Build the groups for the domains */
>> > >   	for_each_cpu(i, cpu_map) {
>> > >   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> > > -- 
>> > > 2.26.2
>> > > 
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-10 13:09         ` Leon Romanovsky
@ 2025-06-10 19:39           ` Steve Wahl
  2025-06-11  6:06             ` Leon Romanovsky
  0 siblings, 1 reply; 25+ messages in thread
From: Steve Wahl @ 2025-06-10 19:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: K Prateek Nayak, Steve Wahl, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Vishal Chourasia, samir, Naman Jain, Saurabh Singh Sengar,
	srivatsa, Michael Kelley, Russ Anderson, Dimitri Sivanich

On Tue, Jun 10, 2025 at 04:09:52PM +0300, Leon Romanovsky wrote:
> 
> 
> On Tue, Jun 10, 2025, at 15:36, Leon Romanovsky wrote:
> > On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
> >> Hello Leon,
> >> 
> >> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
> >> 
> >> [..snip..]
> >> 
> >> > > +	if (WARN_ON(!topology_span_sane(cpu_map)))
> >> > > +		goto error;
> >> > 
> >> > Hi,
> >> > 
> >> > This WARN_ON() generate the following splat in our regression over VMs.>
> >> > [    0.408379] ------------[ cut here ]------------
> >> >   [    0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
> >> >   [    0.410797] Modules linked in:
> >> >   [    0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
> >> >   [    0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >> >   [    0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
> >> >   [    0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
> >> >   [    0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
> >> >   [    0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
> >> >   [    0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
> >> >   [    0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
> >> >   [    0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
> >> >   [    0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
> >> >   [    0.425164] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
> >> >   [    0.426751] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> >   [    0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
> >> >   [    0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> >   [    0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> >   [    0.431429] Call Trace:
> >> >   [    0.431983]  <TASK>
> >> >   [    0.432500]  sched_init_smp+0x32/0xa0
> >> >   [    0.433069]  ? stop_machine+0x2c/0x40
> >> >   [    0.433821]  kernel_init_freeable+0xf5/0x260
> >> >   [    0.434682]  ? rest_init+0xc0/0xc0
> >> >   [    0.435399]  kernel_init+0x16/0x120
> >> >   [    0.436140]  ret_from_fork+0x5e/0xd0
> >> >   [    0.436817]  ? rest_init+0xc0/0xc0
> >> >   [    0.437526]  ret_from_fork_asm+0x11/0x20
> >> >   [    0.438335]  </TASK>
> >> >   [    0.438841] ---[ end trace 0000000000000000 ]---
> >> 
> >> Would it be possible for you to boot the guest with "sched_verbose" in
> >> kernel cmdline and attach the full dmesg? Thanks in advance.
> >
> > I'll try, but can't promise due to how this kernel is been running in
> > our systems.
> 
> 
> 
> [    0.032233] [mem 0xc0000000-0xfed1bfff] available for PCI devices
> [    0.032237] Booting paravirtualized kernel on KVM
> [    0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> [    0.036921] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
> [    0.038074] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
> [    0.038108] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose
> [    0.038222] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.
> [    0.038235] random: crng init done
> [    0.038235] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [    0.038236] printk: log_buf_len total cpu_extra contributions: 36864 bytes
> [    0.038237] printk: log_buf_len min size: 65536 bytes
> [    0.038330] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
> [    0.038331] printk: early log buf free: 56792(86%)
> [    0.038452] software IO TLB: area num 16.
> [    0.049552] Fallback order for Node 0: 0 4 3 2 1
>  [    0.049556] Fallback order for Node 1: 1 4 3 2 0
>  [    0.049559] Fallback order for Node 2: 2 4 3 0 1
>  [    0.049561] Fallback order for Node 3: 3 4 1 0 2
>  [    0.049563] Fallback order for Node 4: 4 0 1 2 3
>  [    0.049569] Built 5 zonelists, mobility grouping on.  Total pages: 3932026
> [    0.049570] Policy zone: Normal
> [    0.049571] mem auto-init: stack:off, heap alloc:off, heap free:off
> [    0.073214] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
> [    0.082959] ftrace: allocating 46168 entries in 182 pages
> [    0.082961] ftrace: allocated 182 pages with 5 groups
> [    0.083102] rcu: Hierarchical RCU implementation.
> [    0.083102] rcu:        RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
> [    0.083104] Rude variant of Tasks RCU enabled.
> [    0.083104] Tracing variant of Tasks RCU enabled.
> [    0.083105] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [    0.083106] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
> [    0.083115] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [    0.083117] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [    0.089643] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
> [    0.089831] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    0.100835] Console: colour VGA+ 80x25
> [    0.100838] printk: legacy console [tty0] enabled
> [    0.132452] printk: legacy console [ttyS1] enabled
> [    0.221725] ACPI: Core revision 20250404
> [    0.222382] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> [    0.223635] APIC: Switch to symmetric I/O mode setup
> [    0.224298] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
> [    0.225262] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
> [    0.226436] kvm-guest: setup PV IPIs
> [    0.227740] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.228537] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2563bd843df, max_idle_ns: 440795257314 ns
> [    0.229871] Calibrating delay loop (skipped) preset value.. 5187.80 BogoMIPS (lpj=10375616)
> [    0.231044] x86/cpu: User Mode Instruction Prevention (UMIP) activated
> [    0.234092] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [    0.234805] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [    0.235598] Speculative Store Bypass: Vulnerable
> [    0.236229] GDS: Unknown: Dependent on hypervisor status
> [    0.236955] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [    0.237871] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [    0.238713] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [    0.239535] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [    0.240439] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [    0.241219] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> [    0.242085] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> [    0.242927] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> [    0.243794] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> [    0.244595] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> [    0.245401] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> [    0.246078] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
> [    0.249871] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
> [    0.250683] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
> [    0.251500] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
> [    0.253380] Freeing SMP alternatives memory: 48K
> [    0.253876] pid_max: default: 32768 minimum: 301
> [    0.254516] LSM: initializing lsm=capability
> [    0.255115] stackdepot: allocating hash table of 1048576 entries via kvcalloc
> [    0.262981] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
> [    0.265481] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
> [    0.266233] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [    0.267255] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [    0.268594] smpboot: CPU0: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (family: 0x6, model: 0x55, stepping: 0x7)
> [    0.269870] Performance Events: Skylake events, full-width counters, Intel PMU driver.
> [    0.269870] ... version:                2
> [    0.269870] ... bit width:              48
> [    0.269870] ... generic registers:      4
> [    0.269873] ... value mask:             0000ffffffffffff
> [    0.270548] ... max period:             00007fffffffffff
> [    0.271220] ... fixed-purpose events:   3
> [    0.271763] ... event mask:             000000070000000f
> [    0.272574] signal: max sigframe size: 3216
> [    0.273155] rcu: Hierarchical SRCU implementation.
> [    0.273773] rcu:        Max phase no-delay instances is 1000.
> [    0.274097] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [    0.275329] smp: Bringing up secondary CPUs ...
> [    0.276031] smpboot: x86: Booting SMP configuration:
> [    0.276689] .... node  #0, CPUs:        #1
> [    0.277528] .... node  #1, CPUs:    #2  #3
> [    0.278084] .... node  #2, CPUs:    #4  #5
> [    0.279023] .... node  #3, CPUs:    #6  #7
> [    0.279946] .... node  #4, CPUs:    #8  #9
> [    0.313886] smp: Brought up 5 nodes, 10 CPUs
> [    0.315058] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
> [    0.316713] ------------[ cut here ]------------
> [    0.316713] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
> [    0.318187] Modules linked in:
> [    0.318619] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
>  [    0.319928] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [    0.321286] RIP: 0010:build_sched_domains+0xe67/0x13a0
> [    0.321873] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
> [    0.324099] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
> [    0.324779] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
> [    0.325659] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
> [    0.326109] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
> [    0.326989] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
> [    0.327868] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
> [    0.328743] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
> [    0.329772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.330069] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
> [    0.330973] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    0.331858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    0.332740] Call Trace:
> [    0.333111]  <TASK>
> [    0.333453]  sched_init_smp+0x32/0xa0
> [    0.333877]  ? stop_machine+0x2c/0x40
> [    0.334382]  kernel_init_freeable+0xf5/0x260
> [    0.334954]  ? rest_init+0xc0/0xc0
> [    0.335423]  kernel_init+0x16/0x120
> [    0.335907]  ret_from_fork+0x5e/0xd0
> [    0.336396]  ? rest_init+0xc0/0xc0
> [    0.336866]  ret_from_fork_asm+0x11/0x20
> [    0.337409]  </TASK>
> [    0.337755] ---[ end trace 0000000000000000 ]---
> [    0.338089] Memory: 15307024K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
> [    0.340215] devtmpfs: initialized
> [    0.341149] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
> [    0.342235] posixtimers hash table entries: 8192 (order: 5, 131072 bytes, vmalloc)
> [    0.343256] futex hash table entries: 512 (32768 bytes on 5 NUMA nodes, total 160 KiB, linear).
> [    0.346367] NET: Registered PF_NETLINK/PF_ROUTE protocol family
> [    0.347279] thermal_sys: Registered thermal governor 'step_wise'
> [    0.347288] cpuidle: using governor ladder
> [    0.348603] cpuidle: using governor menu
> [    0.349254] PCI: ECAM [mem 0xb0000000-0xbfffffff] (base 0xb0000000) for domain 0000 [bus 00-ff]
> [    0.350190] PCI: ECAM [mem 0xb0000000-0xbfffffff] reserved as E820 entry
> [    0.351025] PCI: Using configuration type 1 for base access
> [    0.351822] kprobes: kprobe jump-optimization is enabled. All kprobes are optimized if possible.
> [    0.381999] HugeTLB: allocation took 0ms with hugepage_allocation_threads=2
> [    0.393902] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [    0.394769] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
> [    0.402159] ACPI: Added _OSI(Module Device)
> [    0.402744] ACPI: Added _OSI(Processor Device)
> [    0.403326] ACPI: Added _OSI(Processor Aggregator Device)
> [    0.404648] ACPI: 1 ACPI AML tables successfully acquired and loaded
> [    0.405807] ACPI: Interpreter enabled
> 
> Thanks

I don't think that's the full dmesg output, maybe a console capture
with reduced levels?  I'm not finding the output of sched_domain_debug() and
sched_domain_debug_one() here.

Thanks,

Steve Wahl

> >
> > Thanks
> >
> >> 
> >> -- 
> >> Thanks and Regards,
> >> Prateek
> >> 
> >> > 
> >> > Thanks
> >> > 
> >> > > +
> >> > >   	/* Build the groups for the domains */
> >> > >   	for_each_cpu(i, cpu_map) {
> >> > >   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> >> > > -- 
> >> > > 2.26.2
> >> > > 
> >>

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-10 19:39           ` Steve Wahl
@ 2025-06-11  6:06             ` Leon Romanovsky
  2025-06-11  6:56               ` K Prateek Nayak
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-11  6:06 UTC (permalink / raw)
  To: Steve Wahl
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Tue, Jun 10, 2025 at 02:39:54PM -0500, Steve Wahl wrote:
> On Tue, Jun 10, 2025 at 04:09:52PM +0300, Leon Romanovsky wrote:
> > 
> > 
> > On Tue, Jun 10, 2025, at 15:36, Leon Romanovsky wrote:
> > > On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
> > >> Hello Leon,
> > >> 
> > >> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
> > >> 
> > >> [..snip..]
> > >> 
> > >> > > +	if (WARN_ON(!topology_span_sane(cpu_map)))
> > >> > > +		goto error;
> > >> > 
> > >> > Hi,

<...>

> > [    0.032233] [mem 0xc0000000-0xfed1bfff] available for PCI devices
> > [    0.032237] Booting paravirtualized kernel on KVM
> > [    0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> > [    0.036921] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
> > [    0.038074] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
> > [    0.038108] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose
> > [    0.038222] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.

<...>

> 
> I don't think that's the full dmesg output, maybe a console capture
> with reduced levels?  I'm not finding the output of sched_domain_debug() and
> sched_domain_debug_one() here.

It is not reduced, but standard debug level log, where KERN_DEBUG prints
aren't printed.

I don't know why sched_verbose is implemented how it is implemented,
but all these KERN_DEBUG prints in sched_domain_debug_one() are not controlled
through sched_verbose.

Thanks

> 
> Thanks,
> 
> Steve Wahl
> 
> > >
> > > Thanks
> > >
> > >> 
> > >> -- 
> > >> Thanks and Regards,
> > >> Prateek
> > >> 
> > >> > 
> > >> > Thanks
> > >> > 
> > >> > > +
> > >> > >   	/* Build the groups for the domains */
> > >> > >   	for_each_cpu(i, cpu_map) {
> > >> > >   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > >> > > -- 
> > >> > > 2.26.2
> > >> > > 
> > >>
> 
> -- 
> Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-11  6:06             ` Leon Romanovsky
@ 2025-06-11  6:56               ` K Prateek Nayak
  2025-06-12  7:41                 ` Leon Romanovsky
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-11  6:56 UTC (permalink / raw)
  To: Leon Romanovsky, Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Vishal Chourasia, samir,
	Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Leon,

On 6/11/2025 11:36 AM, Leon Romanovsky wrote:
>> I don't think that's the full dmesg output, maybe a console capture
>> with reduced levels?  I'm not finding the output of sched_domain_debug() and
>> sched_domain_debug_one() here.
> It is not reduced, but standard debug level log, where KERN_DEBUG prints
> aren't printed.
> 
> I don't know why sched_verbose is implemented how it is implemented,
> but all these KERN_DEBUG prints in sched_domain_debug_one() are not controlled
> through sched_verbose.

Sorry for this oversight! Would it be possible to get the logs with
"ignore_loglevel" added to the kernel cmdline? Please and thank you.

Even the qemu cmdline for the guest can help! We can try reproducing
it at our end then. Thank you for all the help.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-11  6:56               ` K Prateek Nayak
@ 2025-06-12  7:41                 ` Leon Romanovsky
  2025-06-12  9:30                   ` K Prateek Nayak
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-12  7:41 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Wed, Jun 11, 2025 at 12:26:53PM +0530, K Prateek Nayak wrote:
> Hello Leon,
> 
> On 6/11/2025 11:36 AM, Leon Romanovsky wrote:
> > > I don't think that's the full dmesg output, maybe a console capture
> > > with reduced levels?  I'm not finding the output of sched_domain_debug() and
> > > sched_domain_debug_one() here.
> > It is not reduced, but standard debug level log, where KERN_DEBUG prints
> > aren't printed.
> > 
> > I don't know why sched_verbose is implemented how it is implemented,
> > but all these KERN_DEBUG prints in sched_domain_debug_one() are not controlled
> > through sched_verbose.
> 
> Sorry for this oversight! Would it be possible to get the logs with
> "ignore_loglevel" added to the kernel cmdline? Please and thank you.

 [    0.000000] Linux version 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 (svc-nbu-sw-nic@kernel-build-19019-747zt-3xglw-0bkcj) (gcc (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3), GNU ld version 2.41-38.fc40) #1 SMP Mon Jun  9 11:49:32 UTC 2025
 [    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose ignore_loglevel
 [    0.000000] BIOS-provided physical RAM map:
 [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
 [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
 [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
 [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdbfff] usable
 [    0.000000] BIOS-e820: [mem 0x000000007ffdc000-0x000000007fffffff] reserved
 [    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
 [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
 [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
 [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
 [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000043fffffff] usable
 [    0.000000] printk: debug: ignoring loglevel setting.
 [    0.000000] NX (Execute Disable) protection: active
 [    0.000000] APIC: Static calls initialized
 [    0.000000] SMBIOS 2.8 present.
 [    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [    0.000000] DMI: Memory slots populated: 1/1
 [    0.000000] Hypervisor detected: KVM
 [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
 [    0.000000] kvm-clock: using sched offset of 68382447374 cycles
 [    0.000002] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
 [    0.000008] tsc: Detected 2593.904 MHz processor
 [    0.000797] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
 [    0.000799] e820: remove [mem 0x000a0000-0x000fffff] usable
 [    0.000802] last_pfn = 0x440000 max_arch_pfn = 0x400000000
 [    0.000830] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
 [    0.000834] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
 [    0.000869] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
 [    0.000878] Using GB pages for direct mapping
 [    0.001019] RAMDISK: [mem 0x3572a000-0x36b8cfff]
 [    0.001023] ACPI: Early table checksum verification disabled
 [    0.001036] ACPI: RSDP 0x00000000000F59B0 000014 (v00 BOCHS )
 [    0.001040] ACPI: RSDT 0x000000007FFE22A8 00003C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
 [    0.001050] ACPI: FACP 0x000000007FFE1E53 0000F4 (v03 BOCHS  BXPCFACP 00000001 BXPC 00000001)
 [    0.001056] ACPI: DSDT 0x000000007FFDFC40 002213 (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
 [    0.001059] ACPI: FACS 0x000000007FFDFC00 000040
 [    0.001062] ACPI: APIC 0x000000007FFE1F47 0000C0 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
 [    0.001065] ACPI: HPET 0x000000007FFE2007 000038 (v01 BOCHS  BXPCHPET 00000001 BXPC 00000001)
 [    0.001068] ACPI: SRAT 0x000000007FFE203F 0001E8 (v01 BOCHS  BXPCSRAT 00000001 BXPC 00000001)
 [    0.001071] ACPI: SLIT 0x000000007FFE2227 000045 (v01 BOCHS  BXPCSLIT 00000001 BXPC 00000001)
 [    0.001074] ACPI: MCFG 0x000000007FFE226C 00003C (v01 BOCHS  BXPCMCFG 00000001 BXPC 00000001)
 [    0.001076] ACPI: Reserving FACP table memory at [mem 0x7ffe1e53-0x7ffe1f46]
 [    0.001078] ACPI: Reserving DSDT table memory at [mem 0x7ffdfc40-0x7ffe1e52]
 [    0.001079] ACPI: Reserving FACS table memory at [mem 0x7ffdfc00-0x7ffdfc3f]
 [    0.001079] ACPI: Reserving APIC table memory at [mem 0x7ffe1f47-0x7ffe2006]
 [    0.001080] ACPI: Reserving HPET table memory at [mem 0x7ffe2007-0x7ffe203e]
 [    0.001080] ACPI: Reserving SRAT table memory at [mem 0x7ffe203f-0x7ffe2226]
 [    0.001081] ACPI: Reserving SLIT table memory at [mem 0x7ffe2227-0x7ffe226b]
 [    0.001082] ACPI: Reserving MCFG table memory at [mem 0x7ffe226c-0x7ffe22a7]
 [    0.001132] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
 [    0.001134] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
 [    0.001135] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
 [    0.001136] ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1ffffffff]
 [    0.001137] ACPI: SRAT: Node 2 PXM 2 [mem 0x200000000-0x2bfffffff]
 [    0.001138] ACPI: SRAT: Node 3 PXM 3 [mem 0x2c0000000-0x37fffffff]
 [    0.001138] ACPI: SRAT: Node 4 PXM 4 [mem 0x380000000-0x43fffffff]
 [    0.001141] NUMA: Initialized distance table, cnt=5
 [    0.001143] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
 [    0.001145] NUMA: Node 0 [mem 0x00001000-0x7fffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00001000-0x13fffffff]
 [    0.001150] NODE_DATA(0) allocated [mem 0x13fffc800-0x13fffffff]
 [    0.001156] NODE_DATA(1) allocated [mem 0x1ffffc800-0x1ffffffff]
 [    0.001161] NODE_DATA(2) allocated [mem 0x2bfffc800-0x2bfffffff]
 [    0.001165] NODE_DATA(3) allocated [mem 0x37fffc800-0x37fffffff]
 [    0.001170] NODE_DATA(4) allocated [mem 0x43fff9800-0x43fffcfff]
 [    0.001204] Zone ranges:
 [    0.001205]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
 [    0.001206]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
 [    0.001208]   Normal   [mem 0x0000000100000000-0x000000043fffffff]
 [    0.001209] Movable zone start for each node
 [    0.001209] Early memory node ranges
 [    0.001210]   node   0: [mem 0x0000000000001000-0x000000000009efff]
 [    0.001212]   node   0: [mem 0x0000000000100000-0x000000007ffdbfff]
 [    0.001213]   node   0: [mem 0x0000000100000000-0x000000013fffffff]
 [    0.001214]   node   1: [mem 0x0000000140000000-0x00000001ffffffff]
 [    0.001215]   node   2: [mem 0x0000000200000000-0x00000002bfffffff]
 [    0.001216]   node   3: [mem 0x00000002c0000000-0x000000037fffffff]
 [    0.001217]   node   4: [mem 0x0000000380000000-0x000000043fffffff]
 [    0.001218] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff]
 [    0.001220] Initmem setup node 1 [mem 0x0000000140000000-0x00000001ffffffff]
 [    0.001221] Initmem setup node 2 [mem 0x0000000200000000-0x00000002bfffffff]
 [    0.001222] Initmem setup node 3 [mem 0x00000002c0000000-0x000000037fffffff]
 [    0.001223] Initmem setup node 4 [mem 0x0000000380000000-0x000000043fffffff]
 [    0.001227] On node 0, zone DMA: 1 pages in unavailable ranges
 [    0.001263] On node 0, zone DMA: 97 pages in unavailable ranges
 [    0.007283] On node 0, zone Normal: 36 pages in unavailable ranges
 [    0.032111] ACPI: PM-Timer IO Port: 0x608
 [    0.032126] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
 [    0.032160] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
 [    0.032164] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
 [    0.032166] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
 [    0.032167] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
 [    0.032173] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
 [    0.032174] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
 [    0.032179] ACPI: Using ACPI (MADT) for SMP configuration information
 [    0.032180] ACPI: HPET id: 0x8086a201 base: 0xfed00000
 [    0.032184] TSC deadline timer available
 [    0.032188] CPU topo: Max. logical packages:  10
 [    0.032189] CPU topo: Max. logical dies:      10
 [    0.032189] CPU topo: Max. dies per package:   1
 [    0.032194] CPU topo: Max. threads per core:   1
 [    0.032194] CPU topo: Num. cores per package:     1
 [    0.032195] CPU topo: Num. threads per package:   1
 [    0.032195] CPU topo: Allowing 10 present CPUs plus 0 hotplug CPUs
 [    0.032209] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
 [    0.032217] kvm-guest: KVM setup pv remote TLB flush
 [    0.032219] kvm-guest: setup PV sched yield
 [    0.032234] [mem 0xc0000000-0xfed1bfff] available for PCI devices
 [    0.032237] Booting paravirtualized kernel on KVM
 [    0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
 [    0.036873] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
 [    0.038028] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
 [    0.038033] pcpu-alloc: s177240 r8192 d31656 u1048576 alloc=1*2097152
 [    0.038035] pcpu-alloc: [0] 00 01 [1] 02 03 [2] 04 05 [3] 06 07 
 [    0.038040] pcpu-alloc: [4] 08 09 
 [    0.038062] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose ignore_loglevel
 [    0.038186] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.
 [    0.038199] random: crng init done
 [    0.038200] printk: log_buf_len individual max cpu contribution: 4096 bytes
 [    0.038201] printk: log_buf_len total cpu_extra contributions: 36864 bytes
 [    0.038201] printk: log_buf_len min size: 65536 bytes
 [    0.038295] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
 [    0.038296] printk: early log buf free: 56704(86%)
 [    0.038437] software IO TLB: area num 16.
 [    0.049582] Fallback order for Node 0: 0 4 3 2 1 
 [    0.049586] Fallback order for Node 1: 1 4 3 2 0 
 [    0.049588] Fallback order for Node 2: 2 4 3 0 1 
 [    0.049591] Fallback order for Node 3: 3 4 1 0 2 
 [    0.049593] Fallback order for Node 4: 4 0 1 2 3 
 [    0.049598] Built 5 zonelists, mobility grouping on.  Total pages: 3932026
 [    0.049599] Policy zone: Normal
 [    0.049601] mem auto-init: stack:off, heap alloc:off, heap free:off
 [    0.073185] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
 [    0.082951] ftrace: allocating 46168 entries in 182 pages
 [    0.082953] ftrace: allocated 182 pages with 5 groups
 [    0.083097] rcu: Hierarchical RCU implementation.
 [    0.083098] rcu: 	RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
 [    0.083099] 	Rude variant of Tasks RCU enabled.
 [    0.083100] 	Tracing variant of Tasks RCU enabled.
 [    0.083100] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
 [    0.083101] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
 [    0.083110] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
 [    0.083112] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
 [    0.089648] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
 [    0.089836] rcu: srcu_init: Setting srcu_struct sizes based on contention.
 [    0.100697] Console: colour VGA+ 80x25
 [    0.100700] printk: legacy console [tty0] enabled
 [    0.132430] printk: legacy console [ttyS1] enabled
 [    0.229785] ACPI: Core revision 20250404
 [    0.230478] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
 [    0.231809] APIC: Switch to symmetric I/O mode setup
 [    0.232506] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
 [    0.233506] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
 [    0.234730] kvm-guest: setup PV IPIs
 [    0.236077] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
 [    0.236911] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2563bd843df, max_idle_ns: 440795257314 ns
 [    0.238295] Calibrating delay loop (skipped) preset value.. 5187.80 BogoMIPS (lpj=10375616)
 [    0.242677] x86/cpu: User Mode Instruction Prevention (UMIP) activated
 [    0.243537] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
 [    0.244273] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
 [    0.245073] Speculative Store Bypass: Vulnerable
 [    0.245714] GDS: Unknown: Dependent on hypervisor status
 [    0.246511] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
 [    0.247580] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
 [    0.248414] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
 [    0.250296] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
 [    0.251201] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
 [    0.251970] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
 [    0.254507] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
 [    0.255330] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
 [    0.256198] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
 [    0.257000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
 [    0.257814] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
 [    0.258503] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
 [    0.259316] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
 [    0.260115] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
 [    0.260921] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
 [    0.262998] Freeing SMP alternatives memory: 48K
 [    0.263638] pid_max: default: 32768 minimum: 301
 [    0.264305] LSM: initializing lsm=capability
 [    0.264928] stackdepot: allocating hash table of 1048576 entries via kvcalloc
 [    0.272932] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
 [    0.275571] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
 [    0.276757] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
 [    0.277798] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
 [    0.278910] smpboot: CPU0: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (family: 0x6, model: 0x55, stepping: 0x7)
 [    0.280313] Performance Events: Skylake events, full-width counters, Intel PMU driver.
 [    0.281375] ... version:                2
 [    0.281946] ... bit width:              48
 [    0.282295] ... generic registers:      4
 [    0.282463] ... value mask:             0000ffffffffffff
 [    0.283187] ... max period:             00007fffffffffff
 [    0.283911] ... fixed-purpose events:   3
 [    0.284485] ... event mask:             000000070000000f
 [    0.285356] signal: max sigframe size: 3216
 [    0.285969] rcu: Hierarchical SRCU implementation.
 [    0.286476] rcu: 	Max phase no-delay instances is 1000.
 [    0.287216] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
 [    0.288498] smp: Bringing up secondary CPUs ...
 [    0.289225] smpboot: x86: Booting SMP configuration:
 [    0.289900] .... node  #0, CPUs:        #1
 [    0.290511] .... node  #1, CPUs:    #2  #3
 [    0.291559] .... node  #2, CPUs:    #4  #5
 [    0.292557] .... node  #3, CPUs:    #6  #7
 [    0.293593] .... node  #4, CPUs:    #8  #9
 [    0.326310] smp: Brought up 5 nodes, 10 CPUs
 [    0.327532] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
 [    0.329252] ------------[ cut here ]------------
 [    0.329252] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
 [    0.330608] Modules linked in:
 [    0.331050] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE 
 [    0.332386] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [    0.333767] RIP: 0010:build_sched_domains+0xe67/0x13a0
 [    0.334298] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
 [    0.336635] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
 [    0.337326] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
 [    0.338234] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
 [    0.338523] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
 [    0.339425] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
 [    0.340323] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
 [    0.341221] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
 [    0.342298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [    0.343096] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
 [    0.344042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [    0.344927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [    0.345811] Call Trace:
 [    0.346191]  <TASK>
 [    0.346429]  sched_init_smp+0x32/0xa0
 [    0.346944]  ? stop_machine+0x2c/0x40
 [    0.347460]  kernel_init_freeable+0xf5/0x260
 [    0.348031]  ? rest_init+0xc0/0xc0
 [    0.348513]  kernel_init+0x16/0x120
 [    0.349008]  ret_from_fork+0x5e/0xd0
 [    0.349510]  ? rest_init+0xc0/0xc0
 [    0.349998]  ret_from_fork_asm+0x11/0x20
 [    0.350464]  </TASK>
 [    0.350812] ---[ end trace 0000000000000000 ]---

> 
> Even the qemu cmdline for the guest can help! We can try reproducing
> it at our end then. Thank you for all the help.

It is custom QEMU with limited access to hypervisor. This crash is
inside VM.

Thanks

> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-12  7:41                 ` Leon Romanovsky
@ 2025-06-12  9:30                   ` K Prateek Nayak
  2025-06-12 10:41                     ` K Prateek Nayak
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-12  9:30 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Leon,

Thank you for more info!

On 6/12/2025 1:11 PM, Leon Romanovsky wrote:
>   [    0.032188] CPU topo: Max. logical packages:  10
>   [    0.032189] CPU topo: Max. logical dies:      10
>   [    0.032189] CPU topo: Max. dies per package:   1
>   [    0.032194] CPU topo: Max. threads per core:   1
>   [    0.032194] CPU topo: Num. cores per package:     1
>   [    0.032195] CPU topo: Num. threads per package:   1
>   [    0.032195] CPU topo: Allowing 10 present CPUs plus 0 hotplug CPUs

This indicates each CPU is a socket leading to 10 sockets ...

>   [    0.288498] smp: Bringing up secondary CPUs ...
>   [    0.289225] smpboot: x86: Booting SMP configuration:
>   [    0.289900] .... node  #0, CPUs:        #1
>   [    0.290511] .... node  #1, CPUs:    #2  #3
>   [    0.291559] .... node  #2, CPUs:    #4  #5
>   [    0.292557] .... node  #3, CPUs:    #6  #7
>   [    0.293593] .... node  #4, CPUs:    #8  #9
>   [    0.326310] smp: Brought up 5 nodes, 10 CPUs

... and this indicates two sockets are grouped as one NUMA node
leading to 5 nodes in total. I tried the following:

     qemu-system-x86_64 -enable-kvm \
     -cpu EPYC-Milan-v2 -m 20G -smp cpus=10,sockets=10 \
     -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     ...

but could not hit this issue with v6.16-rc1 kernel and QEMU emulator
version 10.0.50 (v10.0.0-1610-gd9ce74873a)

>   [    0.327532] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
>   [    0.329252] ------------[ cut here ]------------
>   [    0.329252] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
>   [    0.330608] Modules linked in:
>   [    0.331050] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
>   [    0.332386] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [    0.333767] RIP: 0010:build_sched_domains+0xe67/0x13a0
>   [    0.334298] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
>   [    0.336635] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
>   [    0.337326] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
>   [    0.338234] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
>   [    0.338523] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
>   [    0.339425] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
>   [    0.340323] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
>   [    0.341221] FS:  0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
>   [    0.342298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [    0.343096] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
>   [    0.344042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [    0.344927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [    0.345811] Call Trace:
>   [    0.346191]  <TASK>
>   [    0.346429]  sched_init_smp+0x32/0xa0
>   [    0.346944]  ? stop_machine+0x2c/0x40
>   [    0.347460]  kernel_init_freeable+0xf5/0x260
>   [    0.348031]  ? rest_init+0xc0/0xc0
>   [    0.348513]  kernel_init+0x16/0x120
>   [    0.349008]  ret_from_fork+0x5e/0xd0
>   [    0.349510]  ? rest_init+0xc0/0xc0
>   [    0.349998]  ret_from_fork_asm+0x11/0x20
>   [    0.350464]  </TASK>
>   [    0.350812] ---[ end trace 0000000000000000 ]---

Ah! Since this happens so early topology isn't created yet for
the debug prints to hit! Is it possible to get a dmesg with
"ignore_loglevel" and "sched_verbose" on an older kernel that
did not throw this error on the same host?

> 
>>
>> Even the qemu cmdline for the guest can help! We can try reproducing
>> it at our end then. Thank you for all the help.
> 
> It is custom QEMU with limited access to hypervisor. This crash is
> inside VM.

Noted! Thank a ton for all the data provided.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-12  9:30                   ` K Prateek Nayak
@ 2025-06-12 10:41                     ` K Prateek Nayak
  2025-06-15  6:42                       ` Leon Romanovsky
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-12 10:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
> Ah! Since this happens so early topology isn't created yet for
> the debug prints to hit! Is it possible to get a dmesg with
> "ignore_loglevel" and "sched_verbose" on an older kernel that
> did not throw this error on the same host?

One better would be running with the following diff on top of v6.16-rc1
is possible:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9026d325d0fd..811c8d0f5b9a 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2398,7 +2398,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
  {
  	struct sched_domain_topology_level *tl;
  	struct cpumask *covered, *id_seen;
-	int cpu;
+	int cpu, id;
  
  	lockdep_assert_held(&sched_domains_mutex);
  	covered = sched_domains_tmpmask;
@@ -2421,19 +2421,21 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
  		 */
  		for_each_cpu(cpu, cpu_map) {
  			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
-			int id;
  
  			/* lowest bit set in this mask is used as a unique id */
  			id = cpumask_first(tl_cpu_mask);
  
+			pr_warn("tl(%s) CPU(%d) ID(%d) CPU_TL_SPAN(%*pbl) ID_TL_SPAN(%*pbl)\n",
+				tl->name, cpu, id, cpumask_pr_args(tl->mask(cpu)), cpumask_pr_args(tl->mask(id)));
+
  			if (cpumask_test_cpu(id, id_seen)) {
  				/* First CPU has already been seen, ensure identical spans */
  				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
-					return false;
+					goto fail;
  			} else {
  				/* First CPU hasn't been seen before, ensure it's a completely new span */
  				if (cpumask_intersects(tl_cpu_mask, covered))
-					return false;
+					goto fail;
  
  				cpumask_or(covered, covered, tl_cpu_mask);
  				cpumask_set_cpu(id, id_seen);
@@ -2441,6 +2443,16 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
  		}
  	}
  	return true;
+
+fail:
+	pr_warn("Failed tl: %s\n", tl->name);
+	pr_warn("Failed for CPU: %d\n", cpu);
+	pr_warn("ID CPU at tl: %d\n", id);
+	pr_warn("Failed CPU span at tl: %*pbl\n", cpumask_pr_args(tl->mask(cpu)));
+	pr_warn("ID CPU span: %*pbl\n", cpumask_pr_args(tl->mask(id)));
+	pr_warn("ID CPUs seen: %*pbl\n", cpumask_pr_args(id_seen));
+	pr_warn("CPUs covered: %*pbl\n", cpumask_pr_args(covered));
+	return false;
  }
  
  /*
--

In my case, it logs the following (no failures  seen yet):

     tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
     tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
     tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
     tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
     tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
     tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
     tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
     tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
     tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
     tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
     tl(CLS) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
     tl(CLS) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
     tl(CLS) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
     tl(CLS) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
     tl(CLS) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
     tl(CLS) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
     tl(CLS) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
     tl(CLS) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
     tl(CLS) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
     tl(CLS) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
     tl(MC) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
     tl(MC) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
     tl(MC) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
     tl(MC) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
     tl(MC) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
     tl(MC) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
     tl(MC) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
     tl(MC) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
     tl(MC) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
     tl(MC) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
     tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
     tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
     tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
     tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
     tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
     tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
     tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
     tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
     tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
     tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
     tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(2) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(3) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(4) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(5) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(6) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(7) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(8) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
     tl(NODE) CPU(9) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-12 10:41                     ` K Prateek Nayak
@ 2025-06-15  6:42                       ` Leon Romanovsky
  2025-06-16 14:18                         ` Steve Wahl
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-15  6:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Thu, Jun 12, 2025 at 04:11:52PM +0530, K Prateek Nayak wrote:
> On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
> > Ah! Since this happens so early topology isn't created yet for
> > the debug prints to hit! Is it possible to get a dmesg with
> > "ignore_loglevel" and "sched_verbose" on an older kernel that
> > did not throw this error on the same host?

This is dmesg with reverted two commits "ched/topology: Refinement to
topology_span_sane speedup" and "sched/topology: improve
topology_span_sane speed"

[    0.034409] TSC deadline timer available
[    0.034413] CPU topo: Max. logical packages:  10
[    0.034414] CPU topo: Max. logical dies:      10
[    0.034414] CPU topo: Max. dies per package:   1
[    0.034418] CPU topo: Max. threads per core:   1
[    0.034418] CPU topo: Num. cores per package:     1
[    0.034419] CPU topo: Num. threads per package:   1
[    0.034419] CPU topo: Allowing 10 present CPUs plus 0 hotplug CPUs
[    0.034433] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
[    0.034441] kvm-guest: KVM setup pv remote TLB flush
[    0.034444] kvm-guest: setup PV sched yield
[    0.034458] [mem 0xc0000000-0xfed1bfff] available for PCI devices
[    0.034462] Booting paravirtualized kernel on KVM
[    0.034463] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.039358] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
[    0.040564] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
[    0.040569] pcpu-alloc: s177240 r8192 d31656 u1048576 alloc=1*2097152
[    0.040571] pcpu-alloc: [0] 00 01 [1] 02 03 [2] 04 05 [3] 06 07
[    0.040576] pcpu-alloc: [4] 08 09
[    0.040596] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_10_14_45 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose ignore_loglevel
[    0.040729] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_10_14_45 selinux=0 biosdevname=0 audit=0", will be passed to user space.
[    0.040741] random: crng init done
[    0.040742] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.040743] printk: log_buf_len total cpu_extra contributions: 36864 bytes
[    0.040743] printk: log_buf_len min size: 65536 bytes
[    0.040844] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
[    0.040845] printk: early log buf free: 56704(86%)
[    0.040976] software IO TLB: area num 16.
[    0.052732] Fallback order for Node 0: 0 4 3 2 1
[    0.052736] Fallback order for Node 1: 1 4 3 2 0
[    0.052739] Fallback order for Node 2: 2 4 3 0 1
[    0.052741] Fallback order for Node 3: 3 4 1 0 2
[    0.052744] Fallback order for Node 4: 4 0 1 2 3
[    0.052749] Built 5 zonelists, mobility grouping on.  Total pages: 3932026
[    0.052750] Policy zone: Normal
[    0.052751] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.078003] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
[    0.088263] ftrace: allocating 46166 entries in 182 pages
[    0.088265] ftrace: allocated 182 pages with 5 groups
[    0.088409] rcu: Hierarchical RCU implementation.
[    0.088409] rcu:        RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
[    0.088411] Rude variant of Tasks RCU enabled.
[    0.088411] Tracing variant of Tasks RCU enabled.
[    0.088412] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.088413] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
[    0.088422] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[    0.088424] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[    0.095295] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
[    0.095483] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.110676] Console: colour VGA+ 80x25
[    0.110679] printk: legacy console [tty0] enabled
[    0.160222] printk: legacy console [ttyS1] enabled
[    0.289596] ACPI: Core revision 20250404
[    0.290499] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.292323] APIC: Switch to symmetric I/O mode setup
[    0.293258] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
[    0.294633] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
[    0.296279] kvm-guest: setup PV IPIs
[    0.298170] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.299291] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x21134f58f0d, max_idle_ns: 440795217993 ns
[    0.301176] Calibrating delay loop (skipped) preset value.. 4589.21 BogoMIPS (lpj=9178432)
[    0.302763] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.303945] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.305458] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.306528] Speculative Store Bypass: Vulnerable
[    0.307389] GDS: Unknown: Dependent on hypervisor status
[    0.308387] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.309599] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.310740] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.313177] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.314406] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.315447] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.316572] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.317494] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.318748] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.319841] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.321493] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.322579] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
[    0.323691] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
[    0.324820] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
[    0.325489] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
[    0.327966] Freeing SMP alternatives memory: 48K
[    0.329452] pid_max: default: 32768 minimum: 301
[    0.330372] LSM: initializing lsm=capability
[    0.331218] stackdepot: allocating hash table of 1048576 entries via kvcalloc
[    0.339580] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
[    0.342571] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
[    0.345701] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[    0.347131] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[    0.348912] smpboot: CPU0: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (family: 0x6, model: 0x55, stepping: 0x4)
[    0.349176] Performance Events: Skylake events, full-width counters, Intel PMU driver.
[    0.349176] ... version:                2
[    0.349179] ... bit width:              48
[    0.349954] ... generic registers:      4
[    0.350715] ... value mask:             0000ffffffffffff
[    0.351674] ... max period:             00007fffffffffff
[    0.352650] ... fixed-purpose events:   3
[    0.353178] ... event mask:             000000070000000f
[    0.354281] signal: max sigframe size: 3216
[    0.355106] rcu: Hierarchical SRCU implementation.
[    0.356006] rcu:        Max phase no-delay instances is 1000.
[    0.357014] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[    0.357826] smp: Bringing up secondary CPUs ...
[    0.358794] smpboot: x86: Booting SMP configuration:
[    0.359726] .... node  #0, CPUs:        #1
[    0.360869] .... node  #1, CPUs:    #2  #3
[    0.361475] .... node  #2, CPUs:    #4  #5
[    0.362677] .... node  #3, CPUs:    #6  #7
[    0.363879] .... node  #4, CPUs:    #8  #9
[    0.397193] smp: Brought up 5 nodes, 10 CPUs
[    0.398864] smpboot: Total of 10 processors activated (45892.16 BogoMIPS)
[    0.400864] CPU0 attaching sched-domain(s):
[    0.401441]  domain-0: span=0-1 level=PKG
[    0.402243]   groups: 0:{ span=0 }, 1:{ span=1 }
[    0.403145]   domain-1: span=0-1,8-9 level=NUMA
[    0.404026]    groups: 0:{ span=0-1 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.405178]    domain-2: span=0-1,6-9 level=NUMA
[    0.406077]     groups: 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.407704]     domain-3: span=0-1,4-9 level=NUMA
[    0.408609]      groups: 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
[    0.409640]      domain-4: span=0-9 level=NUMA
[    0.410466]       groups: 0:{ span=0-1,4-9 mask=0-1 cap=8192 }, 2:{ span=2-9 mask=2-3 cap=8192 }
[    0.412025] CPU1 attaching sched-domain(s):
[    0.412831]  domain-0: span=0-1 level=PKG
[    0.413429]   groups: 1:{ span=1 }, 0:{ span=0 }
[    0.414289]   domain-1: span=0-1,8-9 level=NUMA
[    0.415138]    groups: 0:{ span=0-1 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.416310]    domain-2: span=0-1,6-9 level=NUMA
[    0.417166]     groups: 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.417636]     domain-3: span=0-1,4-9 level=NUMA
[    0.418511]      groups: 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
[    0.420089]      domain-4: span=0-9 level=NUMA
[    0.420937]       groups: 0:{ span=0-1,4-9 mask=0-1 cap=8192 }, 2:{ span=2-9 mask=2-3 cap=8192 }
[    0.421649] CPU2 attaching sched-domain(s):
[    0.422470]  domain-0: span=2-3 level=PKG
[    0.423253]   groups: 2:{ span=2 }, 3:{ span=3 }
[    0.424138]   domain-1: span=2-3,8-9 level=NUMA
[    0.425012]    groups: 2:{ span=2-3 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.425523]    domain-2: span=2-3,6-9 level=NUMA
[    0.426401]     groups: 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.427994]     domain-3: span=2-9 level=NUMA
[    0.428839]      groups: 2:{ span=2-3,6-9 mask=2-3 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
[    0.429632]      domain-4: span=0-9 level=NUMA
[    0.430486]       groups: 2:{ span=2-9 mask=2-3 cap=8192 }, 0:{ span=0-1,4-9 mask=0-1 cap=8192 }
[    0.432109] CPU3 attaching sched-domain(s):
[    0.432930]  domain-0: span=2-3 level=PKG
[    0.433431]   groups: 3:{ span=3 }, 2:{ span=2 }
[    0.434303]   domain-1: span=2-3,8-9 level=NUMA
[    0.435174]    groups: 2:{ span=2-3 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.436375]    domain-2: span=2-3,6-9 level=NUMA
[    0.437178]     groups: 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.438756]     domain-3: span=2-9 level=NUMA
[    0.439580]      groups: 2:{ span=2-3,6-9 mask=2-3 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
[    0.441178]      domain-4: span=0-9 level=NUMA
[    0.442012]       groups: 2:{ span=2-9 mask=2-3 cap=8192 }, 0:{ span=0-1,4-9 mask=0-1 cap=8192 }
[    0.443577] CPU4 attaching sched-domain(s):
[    0.444371]  domain-0: span=4-5 level=PKG
[    0.445133]   groups: 4:{ span=4 }, 5:{ span=5 }
[    0.445445]   domain-1: span=4-5,8-9 level=NUMA
[    0.446288]    groups: 4:{ span=4-5 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.447448]    domain-2: span=4-9 level=NUMA
[    0.448264]     groups: 4:{ span=4-5,8-9 mask=4-5 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.449629]     domain-3: span=0-9 level=NUMA
[    0.450456]      groups: 4:{ span=4-9 mask=4-5 cap=6144 }, 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 2:{ span=2-3,6-9 mask=2-3 cap=6144 }
[    0.452542] CPU5 attaching sched-domain(s):
[    0.453179]  domain-0: span=4-5 level=PKG
[    0.453971]   groups: 5:{ span=5 }, 4:{ span=4 }
[    0.454858]   domain-1: span=4-5,8-9 level=NUMA
[    0.455741]    groups: 4:{ span=4-5 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.456899]    domain-2: span=4-9 level=NUMA
[    0.457446]     groups: 4:{ span=4-5,8-9 mask=4-5 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
[    0.458994]     domain-3: span=0-9 level=NUMA
[    0.459815]      groups: 4:{ span=4-9 mask=4-5 cap=6144 }, 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 2:{ span=2-3,6-9 mask=2-3 cap=6144 }
[    0.461734] CPU6 attaching sched-domain(s):
[    0.462527]  domain-0: span=6-7 level=PKG
[    0.463297]   groups: 6:{ span=6 }, 7:{ span=7 }
[    0.464154]   domain-1: span=6-9 level=NUMA
[    0.464936]    groups: 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.465503]    domain-2: span=0-9 level=NUMA
[    0.466311]     groups: 6:{ span=6-9 mask=6-7 cap=4096 }, 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 4:{ span=4-5,8-9 mask=4-5 cap=4096 }
[    0.469027] CPU7 attaching sched-domain(s):
[    0.469432]  domain-0: span=6-7 level=PKG
[    0.470238]   groups: 7:{ span=7 }, 6:{ span=6 }
[    0.471127]   domain-1: span=6-9 level=NUMA
[    0.471940]    groups: 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }
[    0.473158]    domain-2: span=0-9 level=NUMA
[    0.473442]     groups: 6:{ span=6-9 mask=6-7 cap=4096 }, 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 4:{ span=4-5,8-9 mask=4-5 cap=4096 }
[    0.476610] CPU8 attaching sched-domain(s):
[    0.477179]  domain-0: span=8-9 level=PKG
[    0.477957]   groups: 8:{ span=8 }, 9:{ span=9 }
[    0.478841]   domain-1: span=0-9 level=NUMA
[    0.479661]    groups: 8:{ span=8-9 cap=2048 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
[    0.481764] CPU9 attaching sched-domain(s):
[    0.482570]  domain-0: span=8-9 level=PKG
[    0.483360]   groups: 9:{ span=9 }, 8:{ span=8 }
[    0.484245]   domain-1: span=0-9 level=NUMA
[    0.485065]    groups: 8:{ span=8-9 cap=2048 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
[    0.485780] root domain span: 0-9
[    0.486507] Memory: 15306544K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
[    0.489652] devtmpfs: initialized

> 
> One better would be running with the following diff on top of v6.16-rc1
> is possible:

We are working to get this one too.

Thanks

> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9026d325d0fd..811c8d0f5b9a 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2398,7 +2398,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  {
>  	struct sched_domain_topology_level *tl;
>  	struct cpumask *covered, *id_seen;
> -	int cpu;
> +	int cpu, id;
>  	lockdep_assert_held(&sched_domains_mutex);
>  	covered = sched_domains_tmpmask;
> @@ -2421,19 +2421,21 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  		 */
>  		for_each_cpu(cpu, cpu_map) {
>  			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> -			int id;
>  			/* lowest bit set in this mask is used as a unique id */
>  			id = cpumask_first(tl_cpu_mask);
> +			pr_warn("tl(%s) CPU(%d) ID(%d) CPU_TL_SPAN(%*pbl) ID_TL_SPAN(%*pbl)\n",
> +				tl->name, cpu, id, cpumask_pr_args(tl->mask(cpu)), cpumask_pr_args(tl->mask(id)));
> +
>  			if (cpumask_test_cpu(id, id_seen)) {
>  				/* First CPU has already been seen, ensure identical spans */
>  				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
> -					return false;
> +					goto fail;
>  			} else {
>  				/* First CPU hasn't been seen before, ensure it's a completely new span */
>  				if (cpumask_intersects(tl_cpu_mask, covered))
> -					return false;
> +					goto fail;
>  				cpumask_or(covered, covered, tl_cpu_mask);
>  				cpumask_set_cpu(id, id_seen);
> @@ -2441,6 +2443,16 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  		}
>  	}
>  	return true;
> +
> +fail:
> +	pr_warn("Failed tl: %s\n", tl->name);
> +	pr_warn("Failed for CPU: %d\n", cpu);
> +	pr_warn("ID CPU at tl: %d\n", id);
> +	pr_warn("Failed CPU span at tl: %*pbl\n", cpumask_pr_args(tl->mask(cpu)));
> +	pr_warn("ID CPU span: %*pbl\n", cpumask_pr_args(tl->mask(id)));
> +	pr_warn("ID CPUs seen: %*pbl\n", cpumask_pr_args(id_seen));
> +	pr_warn("CPUs covered: %*pbl\n", cpumask_pr_args(covered));
> +	return false;
>  }
>  /*
> --
> 
> In my case, it logs the following (no failures  seen yet):
> 
>     tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
>     tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
>     tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
>     tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
>     tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
>     tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
>     tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
>     tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
>     tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
>     tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
>     tl(CLS) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
>     tl(CLS) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
>     tl(CLS) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
>     tl(CLS) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
>     tl(CLS) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
>     tl(CLS) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
>     tl(CLS) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
>     tl(CLS) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
>     tl(CLS) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
>     tl(CLS) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
>     tl(MC) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
>     tl(MC) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
>     tl(MC) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
>     tl(MC) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
>     tl(MC) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
>     tl(MC) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
>     tl(MC) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
>     tl(MC) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
>     tl(MC) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
>     tl(MC) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
>     tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>     tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>     tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>     tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>     tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>     tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>     tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>     tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>     tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>     tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>     tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(2) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(3) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(4) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(5) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(6) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(7) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(8) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
>     tl(NODE) CPU(9) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-15  6:42                       ` Leon Romanovsky
@ 2025-06-16 14:18                         ` Steve Wahl
  2025-06-17  3:04                           ` K Prateek Nayak
  2025-06-17  7:34                           ` Leon Romanovsky
  0 siblings, 2 replies; 25+ messages in thread
From: Steve Wahl @ 2025-06-16 14:18 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: K Prateek Nayak, Steve Wahl, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, linux-kernel,
	Vishal Chourasia, samir, Naman Jain, Saurabh Singh Sengar,
	srivatsa, Michael Kelley, Russ Anderson, Dimitri Sivanich

On Sun, Jun 15, 2025 at 09:42:07AM +0300, Leon Romanovsky wrote:
> On Thu, Jun 12, 2025 at 04:11:52PM +0530, K Prateek Nayak wrote:
> > On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
> > > Ah! Since this happens so early topology isn't created yet for
> > > the debug prints to hit! Is it possible to get a dmesg with
> > > "ignore_loglevel" and "sched_verbose" on an older kernel that
> > > did not throw this error on the same host?
> 
> This is dmesg with reverted two commits "ched/topology: Refinement to
> topology_span_sane speedup" and "sched/topology: improve
> topology_span_sane speed"

I would be interested in whether there's a difference with only the
second patch being reverted.  The first patch is expected to get the
exact same results as previous code, only faster.  The second had
simplifications suggested by others that could give different results
under conditions that were not expected to exist.  The commit message
for the second patch explains this.

Thanks,

--> Steve Wahl

> [    0.034409] TSC deadline timer available
> [    0.034413] CPU topo: Max. logical packages:  10
> [    0.034414] CPU topo: Max. logical dies:      10
> [    0.034414] CPU topo: Max. dies per package:   1
> [    0.034418] CPU topo: Max. threads per core:   1
> [    0.034418] CPU topo: Num. cores per package:     1
> [    0.034419] CPU topo: Num. threads per package:   1
> [    0.034419] CPU topo: Allowing 10 present CPUs plus 0 hotplug CPUs
> [    0.034433] kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
> [    0.034441] kvm-guest: KVM setup pv remote TLB flush
> [    0.034444] kvm-guest: setup PV sched yield
> [    0.034458] [mem 0xc0000000-0xfed1bfff] available for PCI devices
> [    0.034462] Booting paravirtualized kernel on KVM
> [    0.034463] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> [    0.039358] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
> [    0.040564] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
> [    0.040569] pcpu-alloc: s177240 r8192 d31656 u1048576 alloc=1*2097152
> [    0.040571] pcpu-alloc: [0] 00 01 [1] 02 03 [2] 04 05 [3] 06 07
> [    0.040576] pcpu-alloc: [4] 08 09
> [    0.040596] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_10_14_45 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose ignore_loglevel
> [    0.040729] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_10_14_45 selinux=0 biosdevname=0 audit=0", will be passed to user space.
> [    0.040741] random: crng init done
> [    0.040742] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [    0.040743] printk: log_buf_len total cpu_extra contributions: 36864 bytes
> [    0.040743] printk: log_buf_len min size: 65536 bytes
> [    0.040844] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
> [    0.040845] printk: early log buf free: 56704(86%)
> [    0.040976] software IO TLB: area num 16.
> [    0.052732] Fallback order for Node 0: 0 4 3 2 1
> [    0.052736] Fallback order for Node 1: 1 4 3 2 0
> [    0.052739] Fallback order for Node 2: 2 4 3 0 1
> [    0.052741] Fallback order for Node 3: 3 4 1 0 2
> [    0.052744] Fallback order for Node 4: 4 0 1 2 3
> [    0.052749] Built 5 zonelists, mobility grouping on.  Total pages: 3932026
> [    0.052750] Policy zone: Normal
> [    0.052751] mem auto-init: stack:off, heap alloc:off, heap free:off
> [    0.078003] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
> [    0.088263] ftrace: allocating 46166 entries in 182 pages
> [    0.088265] ftrace: allocated 182 pages with 5 groups
> [    0.088409] rcu: Hierarchical RCU implementation.
> [    0.088409] rcu:        RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
> [    0.088411] Rude variant of Tasks RCU enabled.
> [    0.088411] Tracing variant of Tasks RCU enabled.
> [    0.088412] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [    0.088413] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
> [    0.088422] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [    0.088424] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [    0.095295] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
> [    0.095483] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    0.110676] Console: colour VGA+ 80x25
> [    0.110679] printk: legacy console [tty0] enabled
> [    0.160222] printk: legacy console [ttyS1] enabled
> [    0.289596] ACPI: Core revision 20250404
> [    0.290499] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> [    0.292323] APIC: Switch to symmetric I/O mode setup
> [    0.293258] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
> [    0.294633] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
> [    0.296279] kvm-guest: setup PV IPIs
> [    0.298170] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [    0.299291] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x21134f58f0d, max_idle_ns: 440795217993 ns
> [    0.301176] Calibrating delay loop (skipped) preset value.. 4589.21 BogoMIPS (lpj=9178432)
> [    0.302763] x86/cpu: User Mode Instruction Prevention (UMIP) activated
> [    0.303945] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [    0.305458] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [    0.306528] Speculative Store Bypass: Vulnerable
> [    0.307389] GDS: Unknown: Dependent on hypervisor status
> [    0.308387] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [    0.309599] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [    0.310740] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [    0.313177] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [    0.314406] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [    0.315447] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> [    0.316572] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> [    0.317494] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> [    0.318748] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> [    0.319841] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
> [    0.321493] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
> [    0.322579] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
> [    0.323691] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
> [    0.324820] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
> [    0.325489] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
> [    0.327966] Freeing SMP alternatives memory: 48K
> [    0.329452] pid_max: default: 32768 minimum: 301
> [    0.330372] LSM: initializing lsm=capability
> [    0.331218] stackdepot: allocating hash table of 1048576 entries via kvcalloc
> [    0.339580] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
> [    0.342571] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
> [    0.345701] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [    0.347131] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [    0.348912] smpboot: CPU0: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (family: 0x6, model: 0x55, stepping: 0x4)
> [    0.349176] Performance Events: Skylake events, full-width counters, Intel PMU driver.
> [    0.349176] ... version:                2
> [    0.349179] ... bit width:              48
> [    0.349954] ... generic registers:      4
> [    0.350715] ... value mask:             0000ffffffffffff
> [    0.351674] ... max period:             00007fffffffffff
> [    0.352650] ... fixed-purpose events:   3
> [    0.353178] ... event mask:             000000070000000f
> [    0.354281] signal: max sigframe size: 3216
> [    0.355106] rcu: Hierarchical SRCU implementation.
> [    0.356006] rcu:        Max phase no-delay instances is 1000.
> [    0.357014] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [    0.357826] smp: Bringing up secondary CPUs ...
> [    0.358794] smpboot: x86: Booting SMP configuration:
> [    0.359726] .... node  #0, CPUs:        #1
> [    0.360869] .... node  #1, CPUs:    #2  #3
> [    0.361475] .... node  #2, CPUs:    #4  #5
> [    0.362677] .... node  #3, CPUs:    #6  #7
> [    0.363879] .... node  #4, CPUs:    #8  #9
> [    0.397193] smp: Brought up 5 nodes, 10 CPUs
> [    0.398864] smpboot: Total of 10 processors activated (45892.16 BogoMIPS)
> [    0.400864] CPU0 attaching sched-domain(s):
> [    0.401441]  domain-0: span=0-1 level=PKG
> [    0.402243]   groups: 0:{ span=0 }, 1:{ span=1 }
> [    0.403145]   domain-1: span=0-1,8-9 level=NUMA
> [    0.404026]    groups: 0:{ span=0-1 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.405178]    domain-2: span=0-1,6-9 level=NUMA
> [    0.406077]     groups: 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.407704]     domain-3: span=0-1,4-9 level=NUMA
> [    0.408609]      groups: 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
> [    0.409640]      domain-4: span=0-9 level=NUMA
> [    0.410466]       groups: 0:{ span=0-1,4-9 mask=0-1 cap=8192 }, 2:{ span=2-9 mask=2-3 cap=8192 }
> [    0.412025] CPU1 attaching sched-domain(s):
> [    0.412831]  domain-0: span=0-1 level=PKG
> [    0.413429]   groups: 1:{ span=1 }, 0:{ span=0 }
> [    0.414289]   domain-1: span=0-1,8-9 level=NUMA
> [    0.415138]    groups: 0:{ span=0-1 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.416310]    domain-2: span=0-1,6-9 level=NUMA
> [    0.417166]     groups: 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.417636]     domain-3: span=0-1,4-9 level=NUMA
> [    0.418511]      groups: 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
> [    0.420089]      domain-4: span=0-9 level=NUMA
> [    0.420937]       groups: 0:{ span=0-1,4-9 mask=0-1 cap=8192 }, 2:{ span=2-9 mask=2-3 cap=8192 }
> [    0.421649] CPU2 attaching sched-domain(s):
> [    0.422470]  domain-0: span=2-3 level=PKG
> [    0.423253]   groups: 2:{ span=2 }, 3:{ span=3 }
> [    0.424138]   domain-1: span=2-3,8-9 level=NUMA
> [    0.425012]    groups: 2:{ span=2-3 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.425523]    domain-2: span=2-3,6-9 level=NUMA
> [    0.426401]     groups: 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.427994]     domain-3: span=2-9 level=NUMA
> [    0.428839]      groups: 2:{ span=2-3,6-9 mask=2-3 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
> [    0.429632]      domain-4: span=0-9 level=NUMA
> [    0.430486]       groups: 2:{ span=2-9 mask=2-3 cap=8192 }, 0:{ span=0-1,4-9 mask=0-1 cap=8192 }
> [    0.432109] CPU3 attaching sched-domain(s):
> [    0.432930]  domain-0: span=2-3 level=PKG
> [    0.433431]   groups: 3:{ span=3 }, 2:{ span=2 }
> [    0.434303]   domain-1: span=2-3,8-9 level=NUMA
> [    0.435174]    groups: 2:{ span=2-3 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.436375]    domain-2: span=2-3,6-9 level=NUMA
> [    0.437178]     groups: 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.438756]     domain-3: span=2-9 level=NUMA
> [    0.439580]      groups: 2:{ span=2-3,6-9 mask=2-3 cap=6144 }, 4:{ span=4-9 mask=4-5 cap=6144 }
> [    0.441178]      domain-4: span=0-9 level=NUMA
> [    0.442012]       groups: 2:{ span=2-9 mask=2-3 cap=8192 }, 0:{ span=0-1,4-9 mask=0-1 cap=8192 }
> [    0.443577] CPU4 attaching sched-domain(s):
> [    0.444371]  domain-0: span=4-5 level=PKG
> [    0.445133]   groups: 4:{ span=4 }, 5:{ span=5 }
> [    0.445445]   domain-1: span=4-5,8-9 level=NUMA
> [    0.446288]    groups: 4:{ span=4-5 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.447448]    domain-2: span=4-9 level=NUMA
> [    0.448264]     groups: 4:{ span=4-5,8-9 mask=4-5 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.449629]     domain-3: span=0-9 level=NUMA
> [    0.450456]      groups: 4:{ span=4-9 mask=4-5 cap=6144 }, 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 2:{ span=2-3,6-9 mask=2-3 cap=6144 }
> [    0.452542] CPU5 attaching sched-domain(s):
> [    0.453179]  domain-0: span=4-5 level=PKG
> [    0.453971]   groups: 5:{ span=5 }, 4:{ span=4 }
> [    0.454858]   domain-1: span=4-5,8-9 level=NUMA
> [    0.455741]    groups: 4:{ span=4-5 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.456899]    domain-2: span=4-9 level=NUMA
> [    0.457446]     groups: 4:{ span=4-5,8-9 mask=4-5 cap=4096 }, 6:{ span=6-9 mask=6-7 cap=4096 }
> [    0.458994]     domain-3: span=0-9 level=NUMA
> [    0.459815]      groups: 4:{ span=4-9 mask=4-5 cap=6144 }, 0:{ span=0-1,6-9 mask=0-1 cap=6144 }, 2:{ span=2-3,6-9 mask=2-3 cap=6144 }
> [    0.461734] CPU6 attaching sched-domain(s):
> [    0.462527]  domain-0: span=6-7 level=PKG
> [    0.463297]   groups: 6:{ span=6 }, 7:{ span=7 }
> [    0.464154]   domain-1: span=6-9 level=NUMA
> [    0.464936]    groups: 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.465503]    domain-2: span=0-9 level=NUMA
> [    0.466311]     groups: 6:{ span=6-9 mask=6-7 cap=4096 }, 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 4:{ span=4-5,8-9 mask=4-5 cap=4096 }
> [    0.469027] CPU7 attaching sched-domain(s):
> [    0.469432]  domain-0: span=6-7 level=PKG
> [    0.470238]   groups: 7:{ span=7 }, 6:{ span=6 }
> [    0.471127]   domain-1: span=6-9 level=NUMA
> [    0.471940]    groups: 6:{ span=6-7 cap=2048 }, 8:{ span=8-9 cap=2048 }
> [    0.473158]    domain-2: span=0-9 level=NUMA
> [    0.473442]     groups: 6:{ span=6-9 mask=6-7 cap=4096 }, 0:{ span=0-1,8-9 mask=0-1 cap=4096 }, 2:{ span=2-3,8-9 mask=2-3 cap=4096 }, 4:{ span=4-5,8-9 mask=4-5 cap=4096 }
> [    0.476610] CPU8 attaching sched-domain(s):
> [    0.477179]  domain-0: span=8-9 level=PKG
> [    0.477957]   groups: 8:{ span=8 }, 9:{ span=9 }
> [    0.478841]   domain-1: span=0-9 level=NUMA
> [    0.479661]    groups: 8:{ span=8-9 cap=2048 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
> [    0.481764] CPU9 attaching sched-domain(s):
> [    0.482570]  domain-0: span=8-9 level=PKG
> [    0.483360]   groups: 9:{ span=9 }, 8:{ span=8 }
> [    0.484245]   domain-1: span=0-9 level=NUMA
> [    0.485065]    groups: 8:{ span=8-9 cap=2048 }, 0:{ span=0-1 cap=2048 }, 2:{ span=2-3 cap=2048 }, 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
> [    0.485780] root domain span: 0-9
> [    0.486507] Memory: 15306544K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
> [    0.489652] devtmpfs: initialized
> 
> > 
> > One better would be running with the following diff on top of v6.16-rc1
> > is possible:
> 
> We are working to get this one too.
> 
> Thanks
> 
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9026d325d0fd..811c8d0f5b9a 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2398,7 +2398,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> >  {
> >  	struct sched_domain_topology_level *tl;
> >  	struct cpumask *covered, *id_seen;
> > -	int cpu;
> > +	int cpu, id;
> >  	lockdep_assert_held(&sched_domains_mutex);
> >  	covered = sched_domains_tmpmask;
> > @@ -2421,19 +2421,21 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> >  		 */
> >  		for_each_cpu(cpu, cpu_map) {
> >  			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> > -			int id;
> >  			/* lowest bit set in this mask is used as a unique id */
> >  			id = cpumask_first(tl_cpu_mask);
> > +			pr_warn("tl(%s) CPU(%d) ID(%d) CPU_TL_SPAN(%*pbl) ID_TL_SPAN(%*pbl)\n",
> > +				tl->name, cpu, id, cpumask_pr_args(tl->mask(cpu)), cpumask_pr_args(tl->mask(id)));
> > +
> >  			if (cpumask_test_cpu(id, id_seen)) {
> >  				/* First CPU has already been seen, ensure identical spans */
> >  				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
> > -					return false;
> > +					goto fail;
> >  			} else {
> >  				/* First CPU hasn't been seen before, ensure it's a completely new span */
> >  				if (cpumask_intersects(tl_cpu_mask, covered))
> > -					return false;
> > +					goto fail;
> >  				cpumask_or(covered, covered, tl_cpu_mask);
> >  				cpumask_set_cpu(id, id_seen);
> > @@ -2441,6 +2443,16 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> >  		}
> >  	}
> >  	return true;
> > +
> > +fail:
> > +	pr_warn("Failed tl: %s\n", tl->name);
> > +	pr_warn("Failed for CPU: %d\n", cpu);
> > +	pr_warn("ID CPU at tl: %d\n", id);
> > +	pr_warn("Failed CPU span at tl: %*pbl\n", cpumask_pr_args(tl->mask(cpu)));
> > +	pr_warn("ID CPU span: %*pbl\n", cpumask_pr_args(tl->mask(id)));
> > +	pr_warn("ID CPUs seen: %*pbl\n", cpumask_pr_args(id_seen));
> > +	pr_warn("CPUs covered: %*pbl\n", cpumask_pr_args(covered));
> > +	return false;
> >  }
> >  /*
> > --
> > 
> > In my case, it logs the following (no failures  seen yet):
> > 
> >     tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
> >     tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
> >     tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
> >     tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
> >     tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
> >     tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
> >     tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
> >     tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
> >     tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
> >     tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
> >     tl(CLS) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
> >     tl(CLS) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
> >     tl(CLS) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
> >     tl(CLS) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
> >     tl(CLS) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
> >     tl(CLS) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
> >     tl(CLS) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
> >     tl(CLS) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
> >     tl(CLS) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
> >     tl(CLS) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
> >     tl(MC) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
> >     tl(MC) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
> >     tl(MC) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
> >     tl(MC) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
> >     tl(MC) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
> >     tl(MC) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
> >     tl(MC) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
> >     tl(MC) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
> >     tl(MC) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
> >     tl(MC) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
> >     tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
> >     tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
> >     tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
> >     tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
> >     tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
> >     tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
> >     tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
> >     tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
> >     tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
> >     tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
> >     tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(2) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(3) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(4) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(5) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(6) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(7) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(8) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> >     tl(NODE) CPU(9) ID(0) CPU_TL_SPAN(0-9) ID_TL_SPAN(0-9)
> > 
> > -- 
> > Thanks and Regards,
> > Prateek
> > 

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-16 14:18                         ` Steve Wahl
@ 2025-06-17  3:04                           ` K Prateek Nayak
  2025-06-17  7:55                             ` Leon Romanovsky
  2025-06-17  7:34                           ` Leon Romanovsky
  1 sibling, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-17  3:04 UTC (permalink / raw)
  To: Steve Wahl, Leon Romanovsky
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Vishal Chourasia, samir,
	Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Steve,

On 6/16/2025 7:48 PM, Steve Wahl wrote:
> On Sun, Jun 15, 2025 at 09:42:07AM +0300, Leon Romanovsky wrote:
>> On Thu, Jun 12, 2025 at 04:11:52PM +0530, K Prateek Nayak wrote:
>>> On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
>>>> Ah! Since this happens so early topology isn't created yet for
>>>> the debug prints to hit! Is it possible to get a dmesg with
>>>> "ignore_loglevel" and "sched_verbose" on an older kernel that
>>>> did not throw this error on the same host?
>>
>> This is dmesg with reverted two commits "ched/topology: Refinement to
>> topology_span_sane speedup" and "sched/topology: improve
>> topology_span_sane speed"
> 
> I would be interested in whether there's a difference with only the
> second patch being reverted.  The first patch is expected to get the
> exact same results as previous code, only faster.  The second had
> simplifications suggested by others that could give different results
> under conditions that were not expected to exist.  The commit message
> for the second patch explains this.

Since NUMA domains are skipped as a result of SD_OVERLAP, the remaining
PKG domains don't show any discrepancy that would fail the current
check:

     CPU0 attaching sched-domain(s):
      domain-0: span=0-1 level=PKG               id:0    span:0-1
       groups: 0:{ span=0 }, 1:{ span=1 }
     CPU1 attaching sched-domain(s):
      domain-0: span=0-1 level=PKG               id:0    span:0-1
       groups: 1:{ span=1 }, 0:{ span=0 }
     CPU2 attaching sched-domain(s):
      domain-0: span=2-3 level=PKG               id:2    span:2-3
       groups: 2:{ span=2 }, 3:{ span=3 }
     CPU3 attaching sched-domain(s):
      domain-0: span=2-3 level=PKG               id:2    span:2-3
       groups: 3:{ span=3 }, 2:{ span=2 }
     CPU4 attaching sched-domain(s):
      domain-0: span=4-5 level=PKG               id:4    span:4-5
       groups: 4:{ span=4 }, 5:{ span=5 }
     CPU5 attaching sched-domain(s):
      domain-0: span=4-5 level=PKG               id:4    span:4-5
       groups: 5:{ span=5 }, 4:{ span=4 }
     CPU6 attaching sched-domain(s):
      domain-0: span=6-7 level=PKG               id:6    span:6-7
       groups: 6:{ span=6 }, 7:{ span=7 }
     CPU7 attaching sched-domain(s):
      domain-0: span=6-7 level=PKG               id:6    span:6-7
       groups: 7:{ span=7 }, 6:{ span=6 }
     CPU8 attaching sched-domain(s):
      domain-0: span=8-9 level=PKG               id:8    span:8-9
       groups: 8:{ span=8 }, 9:{ span=9 }
     CPU9 attaching sched-domain(s):
      domain-0: span=8-9 level=PKG               id:8    span:8-9
       groups: 9:{ span=9 }, 8:{ span=8 }

I suspect a topology level that gets degenerated for the failed check
but looking at the degeneration path, the degenerated domains should
either have a single CPU in it (SMT,CLS,MC) or it should have the
same span as PKG (NODE domain) for it to degenerate which should be
sane.

Leon, could you also paste the output of numactl -H from within the
guest please. I'm wondering if the NUMA topology makes a difference
here somehow.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-16 14:18                         ` Steve Wahl
  2025-06-17  3:04                           ` K Prateek Nayak
@ 2025-06-17  7:34                           ` Leon Romanovsky
  2025-06-17  9:22                             ` K Prateek Nayak
  1 sibling, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-17  7:34 UTC (permalink / raw)
  To: Steve Wahl
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Mon, Jun 16, 2025 at 09:18:41AM -0500, Steve Wahl wrote:
> On Sun, Jun 15, 2025 at 09:42:07AM +0300, Leon Romanovsky wrote:
> > On Thu, Jun 12, 2025 at 04:11:52PM +0530, K Prateek Nayak wrote:
> > > On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
> > > > Ah! Since this happens so early topology isn't created yet for
> > > > the debug prints to hit! Is it possible to get a dmesg with
> > > > "ignore_loglevel" and "sched_verbose" on an older kernel that
> > > > did not throw this error on the same host?
> > 
> > This is dmesg with reverted two commits "ched/topology: Refinement to
> > topology_span_sane speedup" and "sched/topology: improve
> > topology_span_sane speed"

<...>

> > > 
> > > One better would be running with the following diff on top of v6.16-rc1
> > > is possible:
> > 
> > We are working to get this one too.

 [    0.435961] smp: Bringing up secondary CPUs ...
 [    0.437573] smpboot: x86: Booting SMP configuration:
 [    0.438611] .... node  #0, CPUs:        #1
 [    0.440449] .... node  #1, CPUs:    #2  #3
 [    0.442906] .... node  #2, CPUs:    #4  #5
 [    0.445298] .... node  #3, CPUs:    #6  #7
 [    0.447715] .... node  #4, CPUs:    #8  #9
 [    0.481482] smp: Brought up 5 nodes, 10 CPUs
 [    0.483160] smpboot: Total of 10 processors activated (45892.16 BogoMIPS)
 [    0.486872] tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
 [    0.488029] tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
 [    0.489151] tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
 [    0.489761] tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
 [    0.490876] tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
 [    0.491996] tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
 [    0.493115] tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
 [    0.493754] tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
 [    0.494875] tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
 [    0.496008] tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
 [    0.497129] tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
 [    0.497763] tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
 [    0.498954] tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
 [    0.500167] tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
 [    0.501371] tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
 [    0.501792] tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
 [    0.503001] tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
 [    0.504202] tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
 [    0.505419] tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
 [    0.506637] tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
 [    0.507843] tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
 [    0.509199] tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
 [    0.509792] tl(NODE) CPU(2) ID(2) CPU_TL_SPAN(2-3,8-9) ID_TL_SPAN(2-3,8-9)
 [    0.511143] Failed tl: NODE
 [    0.511789] Failed for CPU: 2
 [    0.512466] ID CPU at tl: 2
 [    0.513115] Failed CPU span at tl: 2-3,8-9
 [    0.513701] ID CPU span: 2-3,8-9
 [    0.514419] ID CPUs seen: 0
 [    0.515055] CPUs covered: 0-1,8-9
 [    0.515802] ------------[ cut here ]------------
 [    0.516753] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2499 build_sched_domains.cold+0x96/0x23a
 [    0.517937] Modules linked in:
 [    0.518630] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1master_70c6e66 #1 NONE
 [    0.520353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 [    0.522008] RIP: 0010:build_sched_domains.cold+0x96/0x23a
 [    0.523102] Code: c2 80 33 91 83 48 c7 c7 0d 6e 68 82 e8 76 1f 00 00 8b 35 80 1d f8 01 48 c7 c2 c0 33 91 83 48 c7 c7 24 6e 68 82 e8 5d 1f 00 00 <0f> 0b bd f4 ff ff ff e9 fe 08 28 00 be 40 00 00 00 bf 0f 00 00 00
 [    0.526338] RSP: 0000:ffff88810096be18 EFLAGS: 00010246
 [    0.527408] RAX: 0000000000000015 RBX: 0000000000000002 RCX: ffff88843ffd26a8
 [    0.528804] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
 [    0.529808] RBP: ffff888100062150 R08: 0000000000000000 R09: 0000000000000000
 [    0.531211] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000002
 [    0.532607] R13: 0000000000000002 R14: 0000000000000002 R15: ffff88838000b480
 [    0.533813] FS:  0000000000000000(0000) GS:ffff8881b9358000(0000) knlGS:0000000000000000
 [    0.535478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [    0.536639] CR2: ffff88843ffff000 CR3: 0000000002e5d001 CR4: 0000000000370eb0
 [    0.537802] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [    0.539203] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [    0.540564] Call Trace:
 [    0.541138]  <TASK>
 [    0.541647]  sched_init_smp+0x32/0xa0
 [    0.542434]  kernel_init_freeable+0x169/0x330
 [    0.543329]  ? rest_init+0x1b0/0x1b0
 [    0.544092]  kernel_init+0x17/0x140
 [    0.544830]  ret_from_fork+0x140/0x1b0
 [    0.545419]  ? rest_init+0x1b0/0x1b0
 [    0.546185]  ret_from_fork_asm+0x11/0x20
 [    0.547041]  </TASK>
 [    0.547586] irq event stamp: 8887
 [    0.548321] hardirqs last  enabled at (8897): [<ffffffff814b3b9a>] __up_console_sem+0x5a/0x70
 [    0.549918] hardirqs last disabled at (8908): [<ffffffff814b3b7f>] __up_console_sem+0x3f/0x70
 [    0.551593] softirqs last  enabled at (8292): [<ffffffff814363b2>] irq_exit_rcu+0x82/0xe0
 [    0.553208] softirqs last disabled at (8285): [<ffffffff814363b2>] irq_exit_rcu+0x82/0xe0
 [    0.553909] ---[ end trace 0000000000000000 ]---


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-17  3:04                           ` K Prateek Nayak
@ 2025-06-17  7:55                             ` Leon Romanovsky
  0 siblings, 0 replies; 25+ messages in thread
From: Leon Romanovsky @ 2025-06-17  7:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steve Wahl, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, linux-kernel, Vishal Chourasia,
	samir, Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On Tue, Jun 17, 2025 at 08:34:53AM +0530, K Prateek Nayak wrote:
> Hello Steve,

<...>

> Leon, could you also paste the output of numactl -H from within the
> guest please. I'm wondering if the NUMA topology makes a difference
> here somehow.

[leonro@vm ~]$ sudo numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: 2927 MB
node 0 free: 1603 MB
node 1 cpus: 2 3
node 1 size: 3023 MB
node 1 free: 3008 MB
node 2 cpus: 4 5
node 2 size: 3023 MB
node 2 free: 3007 MB
node 3 cpus: 6 7
node 3 size: 3023 MB
node 3 free: 3002 MB
node 4 cpus: 8 9
node 4 size: 3022 MB
node 4 free: 2718 MB
node distances:
node   0   1   2   3   4 
  0:  10  39  38  37  36 
  1:  39  10  38  37  36 
  2:  38  38  10  37  36 
  3:  37  37  37  10  36 
  4:  36  36  36  36  10 


> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-17  7:34                           ` Leon Romanovsky
@ 2025-06-17  9:22                             ` K Prateek Nayak
  2025-06-23  6:06                               ` K Prateek Nayak
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-17  9:22 UTC (permalink / raw)
  To: Leon Romanovsky, Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Vishal Chourasia, samir,
	Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

Hello Leon,

On 6/17/2025 1:04 PM, Leon Romanovsky wrote:
> On Mon, Jun 16, 2025 at 09:18:41AM -0500, Steve Wahl wrote:
>> On Sun, Jun 15, 2025 at 09:42:07AM +0300, Leon Romanovsky wrote:
>>> On Thu, Jun 12, 2025 at 04:11:52PM +0530, K Prateek Nayak wrote:
>>>> On 6/12/2025 3:00 PM, K Prateek Nayak wrote:
>>>>> Ah! Since this happens so early topology isn't created yet for
>>>>> the debug prints to hit! Is it possible to get a dmesg with
>>>>> "ignore_loglevel" and "sched_verbose" on an older kernel that
>>>>> did not throw this error on the same host?
>>>
>>> This is dmesg with reverted two commits "ched/topology: Refinement to
>>> topology_span_sane speedup" and "sched/topology: improve
>>> topology_span_sane speed"
> 
> <...>
> 
>>>>
>>>> One better would be running with the following diff on top of v6.16-rc1
>>>> is possible:
>>>
>>> We are working to get this one too.

Thank you for all the data! Using the NUMA topology from the other
thread:

On 6/17/2025 1:25 PM, Leon Romanovsky wrote:
> [leonro@vm ~]$ sudo numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: 2927 MB
> node 0 free: 1603 MB
> node 1 cpus: 2 3
> node 1 size: 3023 MB
> node 1 free: 3008 MB
> node 2 cpus: 4 5
> node 2 size: 3023 MB
> node 2 free: 3007 MB
> node 3 cpus: 6 7
> node 3 size: 3023 MB
> node 3 free: 3002 MB
> node 4 cpus: 8 9
> node 4 size: 3022 MB
> node 4 free: 2718 MB
> node distances:
> node   0   1   2   3   4
>    0:  10  39  38  37  36
>    1:  39  10  38  37  36
>    2:  38  38  10  37  36
>    3:  37  37  37  10  36
>    4:  36  36  36  36  10 

I could reproduce the warning using:

     sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
     -cpu host \
     -m 20G -smp cpus=10,sockets=10 -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     -numa dist,src=0,dst=1,val=39 \
     -numa dist,src=0,dst=2,val=38 \
     -numa dist,src=0,dst=3,val=37 \
     -numa dist,src=0,dst=4,val=36 \
     -numa dist,src=1,dst=0,val=39 \
     -numa dist,src=1,dst=2,val=38 \
     -numa dist,src=1,dst=3,val=37 \
     -numa dist,src=1,dst=4,val=36 \
     -numa dist,src=2,dst=0,val=38 \
     -numa dist,src=2,dst=1,val=38 \
     -numa dist,src=2,dst=3,val=37 \
     -numa dist,src=2,dst=4,val=36 \
     -numa dist,src=3,dst=0,val=37 \
     -numa dist,src=3,dst=1,val=37 \
     -numa dist,src=3,dst=2,val=37 \
     -numa dist,src=3,dst=4,val=36 \
     -numa dist,src=4,dst=0,val=36 \
     -numa dist,src=4,dst=1,val=36 \
     -numa dist,src=4,dst=2,val=36 \
     -numa dist,src=4,dst=3,val=36 \
     ...

> 
>   [    0.435961] smp: Bringing up secondary CPUs ...
>   [    0.437573] smpboot: x86: Booting SMP configuration:
>   [    0.438611] .... node  #0, CPUs:        #1
>   [    0.440449] .... node  #1, CPUs:    #2  #3
>   [    0.442906] .... node  #2, CPUs:    #4  #5
>   [    0.445298] .... node  #3, CPUs:    #6  #7
>   [    0.447715] .... node  #4, CPUs:    #8  #9
>   [    0.481482] smp: Brought up 5 nodes, 10 CPUs
>   [    0.483160] smpboot: Total of 10 processors activated (45892.16 BogoMIPS)
>   [    0.486872] tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
>   [    0.488029] tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
>   [    0.489151] tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
>   [    0.489761] tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
>   [    0.490876] tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
>   [    0.491996] tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
>   [    0.493115] tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
>   [    0.493754] tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
>   [    0.494875] tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
>   [    0.496008] tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
>   [    0.497129] tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>   [    0.497763] tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>   [    0.498954] tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>   [    0.500167] tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>   [    0.501371] tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>   [    0.501792] tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>   [    0.503001] tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>   [    0.504202] tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>   [    0.505419] tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>   [    0.506637] tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>   [    0.507843] tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
>   [    0.509199] tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
>   [    0.509792] tl(NODE) CPU(2) ID(2) CPU_TL_SPAN(2-3,8-9) ID_TL_SPAN(2-3,8-9)

Looking at this, NODE should be a SD_OVERLAP domain here since the spans
across the nodes overlap. The following solves the warning for me:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8e06b1d22e91..759f7b8e24e6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2010,6 +2010,7 @@ void sched_init_numa(int offline_node)
  	 */
  	tl[i++] = (struct sched_domain_topology_level){
  		.mask = sd_numa_mask,
+		.flags = SDTL_OVERLAP,
  		.numa_level = 0,
  		SD_INIT_NAME(NODE)
  	};
--

NODE domain gets degenerated eventually via the default return in
sd_parent_degenerate() based on my tracing since "~cflags & pflags"
between PKG and NODE is 0 (node always has 1 group) but I'm not
sure if this requires more fundamental modification to
"sd_numa_mask".

Valentin, Peter, what is the right solution here?

>   [    0.511143] Failed tl: NODE
>   [    0.511789] Failed for CPU: 2
>   [    0.512466] ID CPU at tl: 2
>   [    0.513115] Failed CPU span at tl: 2-3,8-9
>   [    0.513701] ID CPU span: 2-3,8-9
>   [    0.514419] ID CPUs seen: 0
>   [    0.515055] CPUs covered: 0-1,8-9 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
  2025-06-17  9:22                             ` K Prateek Nayak
@ 2025-06-23  6:06                               ` K Prateek Nayak
  0 siblings, 0 replies; 25+ messages in thread
From: K Prateek Nayak @ 2025-06-23  6:06 UTC (permalink / raw)
  To: Leon Romanovsky, Steve Wahl
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel, Vishal Chourasia, samir,
	Naman Jain, Saurabh Singh Sengar, srivatsa, Michael Kelley,
	Russ Anderson, Dimitri Sivanich

On 6/17/2025 2:52 PM, K Prateek Nayak wrote:
>>   [    0.435961] smp: Bringing up secondary CPUs ...
>>   [    0.437573] smpboot: x86: Booting SMP configuration:
>>   [    0.438611] .... node  #0, CPUs:        #1
>>   [    0.440449] .... node  #1, CPUs:    #2  #3
>>   [    0.442906] .... node  #2, CPUs:    #4  #5
>>   [    0.445298] .... node  #3, CPUs:    #6  #7
>>   [    0.447715] .... node  #4, CPUs:    #8  #9
>>   [    0.481482] smp: Brought up 5 nodes, 10 CPUs
>>   [    0.483160] smpboot: Total of 10 processors activated (45892.16 BogoMIPS)
>>   [    0.486872] tl(SMT) CPU(0) ID(0) CPU_TL_SPAN(0) ID_TL_SPAN(0)
>>   [    0.488029] tl(SMT) CPU(1) ID(1) CPU_TL_SPAN(1) ID_TL_SPAN(1)
>>   [    0.489151] tl(SMT) CPU(2) ID(2) CPU_TL_SPAN(2) ID_TL_SPAN(2)
>>   [    0.489761] tl(SMT) CPU(3) ID(3) CPU_TL_SPAN(3) ID_TL_SPAN(3)
>>   [    0.490876] tl(SMT) CPU(4) ID(4) CPU_TL_SPAN(4) ID_TL_SPAN(4)
>>   [    0.491996] tl(SMT) CPU(5) ID(5) CPU_TL_SPAN(5) ID_TL_SPAN(5)
>>   [    0.493115] tl(SMT) CPU(6) ID(6) CPU_TL_SPAN(6) ID_TL_SPAN(6)
>>   [    0.493754] tl(SMT) CPU(7) ID(7) CPU_TL_SPAN(7) ID_TL_SPAN(7)
>>   [    0.494875] tl(SMT) CPU(8) ID(8) CPU_TL_SPAN(8) ID_TL_SPAN(8)
>>   [    0.496008] tl(SMT) CPU(9) ID(9) CPU_TL_SPAN(9) ID_TL_SPAN(9)
>>   [    0.497129] tl(PKG) CPU(0) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>>   [    0.497763] tl(PKG) CPU(1) ID(0) CPU_TL_SPAN(0-1) ID_TL_SPAN(0-1)
>>   [    0.498954] tl(PKG) CPU(2) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>>   [    0.500167] tl(PKG) CPU(3) ID(2) CPU_TL_SPAN(2-3) ID_TL_SPAN(2-3)
>>   [    0.501371] tl(PKG) CPU(4) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>>   [    0.501792] tl(PKG) CPU(5) ID(4) CPU_TL_SPAN(4-5) ID_TL_SPAN(4-5)
>>   [    0.503001] tl(PKG) CPU(6) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>>   [    0.504202] tl(PKG) CPU(7) ID(6) CPU_TL_SPAN(6-7) ID_TL_SPAN(6-7)
>>   [    0.505419] tl(PKG) CPU(8) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>>   [    0.506637] tl(PKG) CPU(9) ID(8) CPU_TL_SPAN(8-9) ID_TL_SPAN(8-9)
>>   [    0.507843] tl(NODE) CPU(0) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
>>   [    0.509199] tl(NODE) CPU(1) ID(0) CPU_TL_SPAN(0-1,8-9) ID_TL_SPAN(0-1,8-9)
>>   [    0.509792] tl(NODE) CPU(2) ID(2) CPU_TL_SPAN(2-3,8-9) ID_TL_SPAN(2-3,8-9)
> 
> Looking at this, NODE should be a SD_OVERLAP domain here since the spans
> across the nodes overlap. The following solves the warning for me:

So turns out the mask resolved for NODE is all wrong!

> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8e06b1d22e91..759f7b8e24e6 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2010,6 +2010,7 @@ void sched_init_numa(int offline_node)
>        */
>       tl[i++] = (struct sched_domain_topology_level){
>           .mask = sd_numa_mask,
> +        .flags = SDTL_OVERLAP,
>           .numa_level = 0,
>           SD_INIT_NAME(NODE)
>       };
> -- 

And this solution is wrong too! Leon, could you please try the below diff
and let me know if it solves the issue in your case:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..e106035d78d8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2426,6 +2426,14 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
  		cpumask_clear(covered);
  		cpumask_clear(id_seen);
  
+#ifdef CONFIG_NUMA
+		/*
+		 * Reset sched_domains_curr_level since tl->mask(cpu)
+		 * below can resolve to sd_numa_mask() for NODE.
+		 */
+		sched_domains_curr_level = tl->numa_level;
+#endif
+
  		/*
  		 * Non-NUMA levels cannot partially overlap - they must be either
  		 * completely equal or completely disjoint. Otherwise we can end up
---

We can reset "sched_domains_curr_level" to 0 before the loop and that
should work too since all numa levels >= 1 have SDTL_OVERLAP set but
this is just to err on the side of caution.

Previously, topology_span_sane() used the sched_domain_span() which
didn't depend on "sched_domains_curr_level" to resolve the tl->mask()
but since the rework uses tl directly now, this is needed.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-06-23  6:06 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 16:08 [PATCH v4 0/2] Improving topology_span_sane Steve Wahl
2025-03-04 16:08 ` [PATCH v4 1/2] sched/topology: improve topology_span_sane speed Steve Wahl
2025-04-08 19:05   ` [tip: sched/core] " tip-bot2 for Steve Wahl
2025-06-10 11:07   ` [PATCH v4 1/2] " Leon Romanovsky
2025-06-10 11:33     ` K Prateek Nayak
2025-06-10 12:36       ` Leon Romanovsky
2025-06-10 13:09         ` Leon Romanovsky
2025-06-10 19:39           ` Steve Wahl
2025-06-11  6:06             ` Leon Romanovsky
2025-06-11  6:56               ` K Prateek Nayak
2025-06-12  7:41                 ` Leon Romanovsky
2025-06-12  9:30                   ` K Prateek Nayak
2025-06-12 10:41                     ` K Prateek Nayak
2025-06-15  6:42                       ` Leon Romanovsky
2025-06-16 14:18                         ` Steve Wahl
2025-06-17  3:04                           ` K Prateek Nayak
2025-06-17  7:55                             ` Leon Romanovsky
2025-06-17  7:34                           ` Leon Romanovsky
2025-06-17  9:22                             ` K Prateek Nayak
2025-06-23  6:06                               ` K Prateek Nayak
2025-03-04 16:08 ` [PATCH v4 2/2] sched/topology: Refinement to topology_span_sane speedup Steve Wahl
2025-04-08 19:05   ` [tip: sched/core] " tip-bot2 for Steve Wahl
2025-03-06  6:46 ` [PATCH v4 0/2] Improving topology_span_sane K Prateek Nayak
2025-03-06 14:33 ` Valentin Schneider
2025-03-07 10:06 ` Madadi Vineeth Reddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).