Re: [PATCH v2] sched/topology: improve topology_span_sane speed

All of lore.kernel.org
 help / color / mirror / Atom feed

From: samir <samir@linux.ibm.com>
To: Steve Wahl <steve.wahl@hpe.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vishal Chourasia <vishalc@linux.ibm.com>,
	Russ Anderson <rja@hpe.com>, Dimitri Sivanich <sivanich@hpe.com>,
	srikar@linux.ibm.com, sshegde@linux.ibm.com
Subject: Re: [PATCH v2] sched/topology: improve topology_span_sane speed
Date: Mon, 25 Nov 2024 21:50:53 +0530	[thread overview]
Message-ID: <02dbeaa81619ccf86204f577c5ef7705@linux.ibm.com> (raw)
In-Reply-To: <20241031200431.182443-1-steve.wahl@hpe.com>

On 2024-11-01 01:34, Steve Wahl wrote:
> Use a different approach to topology_span_sane(), that checks for the
> same constraint of no partial overlaps for any two CPU sets for
> non-NUMA topology levels, but does so in a way that is O(N) rather
> than O(N^2).
> 
> Instead of comparing with all other masks to detect collisions, keep
> one mask that includes all CPUs seen so far and detect collisions with
> a single cpumask_intersects test.
> 
> If the current mask has no collisions with previously seen masks, it
> should be a new mask, which can be uniquely identified ("id") by the
> lowest bit set in this mask.  Mark that we've seen a mask with this
> id, and add the CPUs in this mask to the list of those seen.
> 
> If the current mask does collide with previously seen masks, it should
> be exactly equal to a mask seen before, identified once again by the
> lowest bit the current mask has set.  It's an error if we haven't seen
> a mask with that id, or if the current mask doesn't match the one we
> get by looking up that id.
> 
> Move the topology_span_sane() check out of the existing topology level
> loop, let it do its own looping to match the needs of this algorithm.
> 
> On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> the average time to take one processor offline is reduced from 2.18
> seconds to 1.01 seconds.  (Off-lining 959 of 1920 processors took
> 34m49.765s without this change, 16m10.038s with this change in place.)
> 
> Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
> ---
> 
> Version 2: Adopted suggestion by K Prateek Nayak that removes an array 
> and
> simplifies the code, and eliminates the erroneous use of
> num_possible_cpus() that Peter Zijlstra noted.
> 
> Version 1 discussion:
>     
> https://lore.kernel.org/all/20241010155111.230674-1-steve.wahl@hpe.com/
> 
>  kernel/sched/topology.c | 73 +++++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9748a4c8d668..6a2a3e91d59e 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2356,35 +2356,58 @@ static struct sched_domain
> *build_sched_domain(struct sched_domain_topology_leve
> 
>  /*
>   * Ensure topology masks are sane, i.e. there are no conflicts 
> (overlaps) for
> - * any two given CPUs at this (non-NUMA) topology level.
> + * any two given CPUs on non-NUMA topology levels.
>   */
> -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> -			      const struct cpumask *cpu_map, int cpu)
> +static bool topology_span_sane(const struct cpumask *cpu_map)
>  {
> -	int i = cpu + 1;
> +	struct sched_domain_topology_level *tl;
> +	struct cpumask *covered, *id_seen;
> +	int cpu;
> 
> -	/* NUMA levels are allowed to overlap */
> -	if (tl->flags & SDTL_OVERLAP)
> -		return true;
> +	lockdep_assert_held(&sched_domains_mutex);
> +	covered = sched_domains_tmpmask;
> +	id_seen = sched_domains_tmpmask2;
> +
> +	for_each_sd_topology(tl) {
> +
> +		/* NUMA levels are allowed to overlap */
> +		if (tl->flags & SDTL_OVERLAP)
> +			continue;
> +
> +		cpumask_clear(covered);
> +		cpumask_clear(id_seen);
> 
> -	/*
> -	 * Non-NUMA levels cannot partially overlap - they must be either
> -	 * completely equal or completely disjoint. Otherwise we can end up
> -	 * breaking the sched_group lists - i.e. a later get_group() pass
> -	 * breaks the linking done for an earlier span.
> -	 */
> -	for_each_cpu_from(i, cpu_map) {
>  		/*
> -		 * We should 'and' all those masks with 'cpu_map' to exactly
> -		 * match the topology we're about to build, but that can only
> -		 * remove CPUs, which only lessens our ability to detect
> -		 * overlaps
> +		 * Non-NUMA levels cannot partially overlap - they must be either
> +		 * completely equal or completely disjoint. Otherwise we can end up
> +		 * breaking the sched_group lists - i.e. a later get_group() pass
> +		 * breaks the linking done for an earlier span.
>  		 */
> -		if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> -		    cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> -			return false;
> +		for_each_cpu(cpu, cpu_map) {
> +			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> +			int id;
> +
> +			/* lowest bit set in this mask is used as a unique id */
> +			id = cpumask_first(tl_cpu_mask);
> +
> +			/* if this mask doesn't collide with what we've already seen */
> +			if (!cpumask_intersects(tl_cpu_mask, covered)) {
> +				/* Really odd case when cpu != id, likely not sane */
> +				if ((cpu != id) && !cpumask_equal(tl_cpu_mask, tl->mask(id)))
> +					return false;
> +				if (cpumask_test_and_set_cpu(id, id_seen))
> +					return false;
> +				cpumask_or(covered, tl_cpu_mask, covered);
> +			} else if ((!cpumask_test_cpu(id, id_seen)) ||
> +				    !cpumask_equal(tl->mask(id), tl_cpu_mask)) {
> +				/*
> +				 * a collision with covered should have exactly matched
> +				 * a previously seen mask with the same id
> +				 */
> +				return false;
> +			}
> +		}
>  	}
> -
>  	return true;
>  }
> 
> @@ -2417,9 +2440,6 @@ build_sched_domains(const struct cpumask
> *cpu_map, struct sched_domain_attr *att
>  		sd = NULL;
>  		for_each_sd_topology(tl) {
> 
> -			if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> -				goto error;
> -
>  			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> 
>  			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
> @@ -2433,6 +2453,9 @@ build_sched_domains(const struct cpumask
> *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
> 
> +	if (WARN_ON(!topology_span_sane(cpu_map)))
> +		goto error;
> +
>  	/* Build the groups for the domains */
>  	for_each_cpu(i, cpu_map) {
>  		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {

Hello Steve,

I have verified the above patch on PowerPC, and here are my findings:
Below are the 5 iteration data for “time ppc64_cpu --smt=off/4” 
command(min, max, Average, and Std Dev).

——lscpu—-
CPU(s):                   360
  On-line CPU(s) list:    0-359

Without Patch:
————uname -a————
6.12.0+ 

Metric        SMT Off (s)            SMT 4 (s)

Min	      71.123			37.629

Max	      75.031			40.785

Average       73.175			39.992

Std Dev	      1.619			1.191


With patch:
————uname -a————
6.12.0+ 

Metric        SMT Off (s)            SMT 4 (s)

Min		63.638			37.114

Max		73.707			40.773

Average		70.106			 38.641

Std Dev		 3.443			1.472

The logs indicate minimal improvement in the SMT 4 state with the patch. 
However, a slight improvement is observed in the SMT OFF state with the 
patch, as reflected in all three metrics: Min, Max, and Average.

SMT 4 State:
	•	Across all metrics (Min, Max, and Average), there is negligible 
improvement with the patch.
	•	The differences are within the margin of measurement variability, as 
reflected in the similar Std Dev values (~1.1–1.4).
SMT Off State:
	•	A noticeable improvement is observed in the Min, Max, and Average 
values with the patch:
	◦	Min: Improved by ~10.5% (from 71.123 to 63.638 seconds).
	◦	Max: Improved by ~1.8% (from 75.031 to 73.707 seconds).
	◦	Average: Improved by ~4.2% (from 73.175 to 70.106 seconds).
	•	The Std Dev increased significantly (from 1.619 to 3.443).

Thanks,
Samir

next prev parent reply	other threads:[~2024-11-25 16:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-31 20:04 [PATCH v2] sched/topology: improve topology_span_sane speed Steve Wahl
2024-11-06  4:49 ` K Prateek Nayak
2024-11-07 17:43   ` Steve Wahl
2024-11-12 16:15 ` Valentin Schneider
2024-11-13 15:42   ` Steve Wahl
2024-11-19 21:03     ` Steve Wahl
2024-11-19 22:54       ` Valentin Schneider
2024-11-20 15:52         ` Steve Wahl
2024-11-25 16:20 ` samir [this message]
2024-11-26 16:25 ` Michael Kelley
2024-12-13  6:31 ` Saurabh Singh Sengar
2024-12-16 15:38 ` Valentin Schneider
2024-12-17 18:16   ` Steve Wahl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=02dbeaa81619ccf86204f577c5ef7705@linux.ibm.com \
    --to=samir@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rja@hpe.com \
    --cc=rostedt@goodmis.org \
    --cc=sivanich@hpe.com \
    --cc=srikar@linux.ibm.com \
    --cc=sshegde@linux.ibm.com \
    --cc=steve.wahl@hpe.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vishalc@linux.ibm.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.