public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] cpumask: Add for_each_cpu_from()
       [not found] ` <20240319185148.985729-2-kyle.meyer@hpe.com>
@ 2024-03-20 18:31   ` Yury Norov
  2024-03-20 20:22     ` Yury Norov
  0 siblings, 1 reply; 10+ messages in thread
From: Yury Norov @ 2024-03-20 18:31 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: andriy.shevchenko, linux, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, linux-kernel, russ.anderson, dimitri.sivanich,
	steve.wahl

On Tue, Mar 19, 2024 at 01:51:47PM -0500, Kyle Meyer wrote:
> Add for_each_cpu_from() as a generic cpumask macro.
> 
> for_each_cpu_from() is the same as for_each_cpu(), except it starts at
> @cpu instead of zero.
> 
> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>

Acked-by: Kyle Meyer <kyle.meyer@hpe.com>

> ---
>  include/linux/cpumask.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 1c29947db848..655211db38ff 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -368,6 +368,16 @@ unsigned int __pure cpumask_next_wrap(int n, const struct cpumask *mask, int sta
>  #define for_each_cpu_or(cpu, mask1, mask2)				\
>  	for_each_or_bit(cpu, cpumask_bits(mask1), cpumask_bits(mask2), small_cpumask_bits)
>  
> +/**
> + * for_each_cpu_from - iterate over every cpu present in @mask, starting at @cpu
> + * @cpu: the (optionally unsigned) integer iterator
> + * @mask: the cpumask pointer
> + *
> + * After the loop, cpu is >= nr_cpu_ids.
> + */
> +#define for_each_cpu_from(cpu, mask)				\
> +	for_each_set_bit_from(cpu, cpumask_bits(mask), small_cpumask_bits)
> +
>  /**
>   * cpumask_any_but - return a "random" in a cpumask, but not this one.
>   * @mask: the cpumask to search
> -- 
> 2.44.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched/topology: Optimize topology_span_sane()
       [not found] ` <20240319185148.985729-3-kyle.meyer@hpe.com>
@ 2024-03-20 18:32   ` Yury Norov
  0 siblings, 0 replies; 10+ messages in thread
From: Yury Norov @ 2024-03-20 18:32 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: andriy.shevchenko, linux, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, linux-kernel, russ.anderson, dimitri.sivanich,
	steve.wahl

On Tue, Mar 19, 2024 at 01:51:48PM -0500, Kyle Meyer wrote:
> Optimize topology_span_sane() by removing duplicate comparisons.
> 
> The total number of comparisons is reduced from N * (N - 1) to
> N * (N - 1) / 2 (per non-NUMA scheduling domain level).
> 
> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>

Reviewed-by: Yury Norov <yury.norov@gmail.com>

> ---
>  kernel/sched/topology.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 99ea5986038c..b6bcafc09969 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2347,7 +2347,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
>  			      const struct cpumask *cpu_map, int cpu)
>  {
> -	int i;
> +	int i = cpu + 1;
>  
>  	/* NUMA levels are allowed to overlap */
>  	if (tl->flags & SDTL_OVERLAP)
> @@ -2359,9 +2359,7 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
>  	 * breaking the sched_group lists - i.e. a later get_group() pass
>  	 * breaks the linking done for an earlier span.
>  	 */
> -	for_each_cpu(i, cpu_map) {
> -		if (i == cpu)
> -			continue;
> +	for_each_cpu_from(i, cpu_map) {
>  		/*
>  		 * We should 'and' all those masks with 'cpu_map' to exactly
>  		 * match the topology we're about to build, but that can only
> -- 
> 2.44.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] cpumask: Add for_each_cpu_from()
  2024-03-20 18:31   ` [PATCH 1/2] cpumask: Add for_each_cpu_from() Yury Norov
@ 2024-03-20 20:22     ` Yury Norov
  0 siblings, 0 replies; 10+ messages in thread
From: Yury Norov @ 2024-03-20 20:22 UTC (permalink / raw)
  To: Kyle Meyer
  Cc: andriy.shevchenko, linux, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, vschneid, linux-kernel, russ.anderson, dimitri.sivanich,
	steve.wahl

On Wed, Mar 20, 2024 at 11:31:18AM -0700, Yury Norov wrote:
> On Tue, Mar 19, 2024 at 01:51:47PM -0500, Kyle Meyer wrote:
> > Add for_each_cpu_from() as a generic cpumask macro.
> > 
> > for_each_cpu_from() is the same as for_each_cpu(), except it starts at
> > @cpu instead of zero.
> > 
> > Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
> 
> Acked-by: Kyle Meyer <kyle.meyer@hpe.com>

Sorry, please read: 

Acked-by: Yury Norov <yury.norov@gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] sched/topology: Optimize topology_span_sane()
       [not found] <20240319185148.985729-1-kyle.meyer@hpe.com>
       [not found] ` <20240319185148.985729-2-kyle.meyer@hpe.com>
       [not found] ` <20240319185148.985729-3-kyle.meyer@hpe.com>
@ 2024-04-09  8:31 ` Valentin Schneider
  2 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2024-04-09  8:31 UTC (permalink / raw)
  To: Kyle Meyer, yury.norov, andriy.shevchenko, linux, mingo, peterz,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, linux-kernel
  Cc: russ.anderson, dimitri.sivanich, steve.wahl, Kyle Meyer

On 19/03/24 13:51, Kyle Meyer wrote:
> A soft lockup is being detected in build_sched_domains() on 32 socket
> Sapphire Rapids systems with 3840 processors.
>
> topology_span_sane(), called by build_sched_domains(), checks that each
> processor's non-NUMA scheduling domains are completely equal or
> completely disjoint. If a non-NUMA scheduling domain partially overlaps
> another, scheduling groups can break.
>
> This series adds for_each_cpu_from() as a generic cpumask macro to
> optimize topology_span_sane() by removing duplicate comparisons. The
> total number of comparisons is reduced from N * (N - 1) to
> N * (N - 1) / 2 (per non-NUMA scheduling domain level), decreasing the
> boot time by approximately 20 seconds and preventing the soft lockup on
> the mentioned systems.
>
> Kyle Meyer (2):
>   cpumask: Add for_each_cpu_from()
>   sched/topology: Optimize topology_span_sane()

I somehow never got 2/2, and it doesn't show up on lore.kernel.org
either. I can see it from Yury's reply and it looks OK to me, but you'll
have to resend it for maintainers to be able to pick it up.

>
>  include/linux/cpumask.h | 10 ++++++++++
>  kernel/sched/topology.c |  6 ++----
>  2 files changed, 12 insertions(+), 4 deletions(-)
>
> -- 
> 2.44.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-02 17:57 [PATCH 0/2] sched/topology: optimize topology_span_sane() Yury Norov
@ 2024-08-02 17:57 ` Yury Norov
  2024-08-06 15:50   ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Yury Norov @ 2024-08-02 17:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yury Norov, Christophe JAILLET, Leonardo Bras, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
even when cpu != i. In such case, cpumask_equal() would always return
true, and we can proceed to the next iteration immediately.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 kernel/sched/topology.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 754ad5fa3c99..571759606954 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2368,6 +2368,8 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
 	 */
 	for_each_cpu_from(i, cpu_map) {
 		mi = tl->mask(i);
+		if (mi == mc)
+			continue;
 
 		/*
 		 * We should 'and' all those masks with 'cpu_map' to exactly
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-02 17:57 ` [PATCH 2/2] " Yury Norov
@ 2024-08-06 15:50   ` Valentin Schneider
  2024-08-06 18:00     ` Yury Norov
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2024-08-06 15:50 UTC (permalink / raw)
  To: Yury Norov, linux-kernel
  Cc: Yury Norov, Christophe JAILLET, Leonardo Bras, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman

On 02/08/24 10:57, Yury Norov wrote:
> The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
> even when cpu != i.

For which architecture have you observed this? AFAIA all implementations of
tl->sched_domain_mask_f are built on a per-CPU cpumask.

e.g. for the default topology, we have:

  cpu_smt_mask() -> topology_sibling_cpumask()

which is implemented as:

  arch/loongarch/include/asm/topology.h:35:#define topology_sibling_cpumask(cpu)		(&cpu_sibling_map[cpu])
  arch/mips/include/asm/topology.h:18:#define topology_sibling_cpumask(cpu)		(&cpu_sibling_map[cpu])
  arch/powerpc/include/asm/topology.h:139:#define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
  arch/s390/include/asm/topology.h:31:#define topology_sibling_cpumask(cpu)	  (&cpu_topology[cpu].thread_mask)
  arch/sparc/include/asm/topology_64.h:50:#define topology_sibling_cpumask(cpu)		(&per_cpu(cpu_sibling_map, cpu))
  arch/x86/include/asm/topology.h:186:#define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
  include/linux/arch_topology.h:91:#define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
  include/linux/topology.h:218:#define topology_sibling_cpumask(cpu)		cpumask_of(cpu)

and ditto for cpu_coregroup_mask() & cpu_cpu_mask().


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-06 15:50   ` Valentin Schneider
@ 2024-08-06 18:00     ` Yury Norov
  2024-08-07 13:53       ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Yury Norov @ 2024-08-06 18:00 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Christophe JAILLET, Leonardo Bras, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman

On Tue, Aug 06, 2024 at 05:50:23PM +0200, Valentin Schneider wrote:
> On 02/08/24 10:57, Yury Norov wrote:
> > The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
> > even when cpu != i.
> 
> For which architecture have you observed this? AFAIA all implementations of
> tl->sched_domain_mask_f are built on a per-CPU cpumask.

x86_64, qemu emulating 16 CPUs in 4 nodes, Linux 6.10, approximately
defconfig.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-06 18:00     ` Yury Norov
@ 2024-08-07 13:53       ` Valentin Schneider
  2024-08-07 16:39         ` Yury Norov
  0 siblings, 1 reply; 10+ messages in thread
From: Valentin Schneider @ 2024-08-07 13:53 UTC (permalink / raw)
  To: Yury Norov
  Cc: linux-kernel, Christophe JAILLET, Leonardo Bras, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman

On 06/08/24 11:00, Yury Norov wrote:
> On Tue, Aug 06, 2024 at 05:50:23PM +0200, Valentin Schneider wrote:
>> On 02/08/24 10:57, Yury Norov wrote:
>> > The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
>> > even when cpu != i.
>>
>> For which architecture have you observed this? AFAIA all implementations of
>> tl->sched_domain_mask_f are built on a per-CPU cpumask.
>
> x86_64, qemu emulating 16 CPUs in 4 nodes, Linux 6.10, approximately
> defconfig.

For the default_topology:
cpu_smt_mask() # SMT
  (per_cpu(cpu_sibling_map, cpu))

cpu_clustergroup_mask() # CLS
  per_cpu(cpu_l2c_shared_map, cpu);

cpu_coregroup_mask() # MC
  per_cpu(cpu_llc_shared_map, cpu);

cpu_cpu_mask() # PKG
  cpumask_of_node(cpu_to_node(cpu));

Ok so PKG can potentially hit that condition, and so can any
sched_domain_mask_f that relies on the node masks...

I'm thinking ideally we should have checks in place to ensure all
node_to_cpumask_map[] masks are disjoint, then we could entirely skip the levels
that use these masks in topology_span_sane(), but there's unfortunately no
nice way to flag them... Also there would cases where there's no real
difference between PKG and NODE other than NODE is still based on a per-cpu
cpumask and PKG isn't, so I don't see a nicer way to go about this.

Please add something like the following to the changelog, and with that:
Reviewed-by: Valentin Schneider <vschneid@redhat.com>

"""
Some topology levels (e.g. PKG in default_topology[]) have a
sched_domain_mask_f implementation that reuses the same mask for several
CPUs (in PKG's case, one mask for all CPUs in the same NUMA node).

For such topology levels, repeating cpumask_equal() checks is wasteful -
check that the tl->mask(i) pointers aren't the same first.
"""


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-07 13:53       ` Valentin Schneider
@ 2024-08-07 16:39         ` Yury Norov
  0 siblings, 0 replies; 10+ messages in thread
From: Yury Norov @ 2024-08-07 16:39 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: linux-kernel, Christophe JAILLET, Leonardo Bras, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman

On Wed, Aug 07, 2024 at 03:53:18PM +0200, Valentin Schneider wrote:
> On 06/08/24 11:00, Yury Norov wrote:
> > On Tue, Aug 06, 2024 at 05:50:23PM +0200, Valentin Schneider wrote:
> >> On 02/08/24 10:57, Yury Norov wrote:
> >> > The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
> >> > even when cpu != i.
> >>
> >> For which architecture have you observed this? AFAIA all implementations of
> >> tl->sched_domain_mask_f are built on a per-CPU cpumask.
> >
> > x86_64, qemu emulating 16 CPUs in 4 nodes, Linux 6.10, approximately
> > defconfig.
> 
> For the default_topology:
> cpu_smt_mask() # SMT
>   (per_cpu(cpu_sibling_map, cpu))
> 
> cpu_clustergroup_mask() # CLS
>   per_cpu(cpu_l2c_shared_map, cpu);
> 
> cpu_coregroup_mask() # MC
>   per_cpu(cpu_llc_shared_map, cpu);
> 
> cpu_cpu_mask() # PKG
>   cpumask_of_node(cpu_to_node(cpu));
> 
> Ok so PKG can potentially hit that condition, and so can any
> sched_domain_mask_f that relies on the node masks...
> 
> I'm thinking ideally we should have checks in place to ensure all
> node_to_cpumask_map[] masks are disjoint, then we could entirely skip the levels
> that use these masks in topology_span_sane(), but there's unfortunately no
> nice way to flag them... Also there would cases where there's no real
> difference between PKG and NODE other than NODE is still based on a per-cpu
> cpumask and PKG isn't, so I don't see a nicer way to go about this.
> 
> Please add something like the following to the changelog, and with that:
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>

Sure, will do.

> """
> Some topology levels (e.g. PKG in default_topology[]) have a
> sched_domain_mask_f implementation that reuses the same mask for several
> CPUs (in PKG's case, one mask for all CPUs in the same NUMA node).
> 
> For such topology levels, repeating cpumask_equal() checks is wasteful -
> check that the tl->mask(i) pointers aren't the same first.
> """

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] sched/topology: optimize topology_span_sane()
  2024-08-07 19:05 [PATCH v2 0/2] " Yury Norov
@ 2024-08-07 19:05 ` Yury Norov
  0 siblings, 0 replies; 10+ messages in thread
From: Yury Norov @ 2024-08-07 19:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yury Norov, Chen Yu, Christophe JAILLET, Leonardo Bras,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider

The function may call cpumask_equal with tl->mask(cpu) == tl->mask(i),
even though cpu != i. In such case, cpumask_equal() would always return
true, and we can proceed to the next iteration immediately.

Comment is provided by Valentin Schneider.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
---
 kernel/sched/topology.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8af3b48da458..3661d4173d1f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2370,6 +2370,18 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
 	 */
 	for_each_cpu_from(i, cpu_map) {
 		mi = tl->mask(i);
+		/*
+		 * Some topology levels (e.g. PKG in default_topology[])
+		 * have a sched_domain_mask_f implementation that reuses
+		 * the same mask for several CPUs (in PKG's case, one mask
+		 * for all CPUs in the same NUMA node).
+		 *
+		 * For such topology levels, repeating cpumask_equal()
+		 * checks is wasteful. Instead, we first check that the
+		 * tl->mask(i) pointers aren't the same.
+		 */
+		if (mi == mc)
+			continue;
 
 		/*
 		 * We should 'and' all those masks with 'cpu_map' to exactly
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-07 19:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240319185148.985729-1-kyle.meyer@hpe.com>
     [not found] ` <20240319185148.985729-2-kyle.meyer@hpe.com>
2024-03-20 18:31   ` [PATCH 1/2] cpumask: Add for_each_cpu_from() Yury Norov
2024-03-20 20:22     ` Yury Norov
     [not found] ` <20240319185148.985729-3-kyle.meyer@hpe.com>
2024-03-20 18:32   ` [PATCH 2/2] sched/topology: Optimize topology_span_sane() Yury Norov
2024-04-09  8:31 ` [PATCH 0/2] " Valentin Schneider
2024-08-02 17:57 [PATCH 0/2] sched/topology: optimize topology_span_sane() Yury Norov
2024-08-02 17:57 ` [PATCH 2/2] " Yury Norov
2024-08-06 15:50   ` Valentin Schneider
2024-08-06 18:00     ` Yury Norov
2024-08-07 13:53       ` Valentin Schneider
2024-08-07 16:39         ` Yury Norov
  -- strict thread matches above, loose matches on Subject: below --
2024-08-07 19:05 [PATCH v2 0/2] " Yury Norov
2024-08-07 19:05 ` [PATCH 2/2] " Yury Norov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox