Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@techsingularity.net>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: peterz@infradead.org, aubrey.li@linux.intel.com, efault@gmx.de,
	gautham.shenoy@amd.com, linux-kernel@vger.kernel.org,
	mingo@kernel.org, song.bao.hua@hisilicon.com,
	srikar@linux.vnet.ibm.com, valentin.schneider@arm.com,
	vincent.guittot@linaro.org
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group
Date: Thu, 17 Feb 2022 10:05:24 +0000	[thread overview]
Message-ID: <20220217100523.GV3366@techsingularity.net> (raw)
In-Reply-To: <20220217055408.28151-1-kprateek.nayak@amd.com>

Thanks Prateek,

On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote:
> In AMD Zen like systems which contains multiple LLCs per socket,
> users want to spread bandwidth hungry applications across multiple
> LLCs. Stream is one such representative workload where the best
> performance is obtained by limiting one stream thread per LLC. To
> ensure this, users are known to pin the tasks to a specify a subset of
> the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
> 
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
> 
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
> 

In this case the stream threads can use any CPU of the subset, presumably
this is parallelised with OpenMP without specifying spread or bind
directives.

> stream-5045    [032] d..2.   167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
> stream-5045    [032] d..2.   167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
> stream-5045    [032] d..2.   167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
> stream-5045    [032] d..2.   167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032
> 

Resulting in some stacking with the baseline

> stream-4733    [032] d..2.   116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
> stream-4733    [032] d..2.   116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
> stream-4733    [032] d..2.   116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
> stream-4733    [032] d..2.   116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
> stream-4733    [032] d..2.   116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
> stream-4733    [032] d..2.   116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
> stream-4733    [032] d..2.   116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
> 

And no stacking with your patch. So far so good.

> <SNIP>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..6e875f1f34e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  
>  	case group_has_spare:
>  		if (sd->flags & SD_NUMA) {
> +			struct cpumask *cpus;
> +			int imb;
>  #ifdef CONFIG_NUMA_BALANCING
>  			int idlest_cpu;
>  			/*
> @@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * Otherwise, keep the task close to the wakeup source
>  			 * and improve locality if the number of running tasks
>  			 * would remain below threshold where an imbalance is
> -			 * allowed. If there is a real need of migration,
> -			 * periodic load balance will take care of it.
> +			 * allowed while accounting for the possibility the
> +			 * task is pinned to a subset of CPUs. If there is a
> +			 * real need of migration, periodic load balance will
> +			 * take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> +			cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +			cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> +			imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
>  				return NULL;

One concern I have is that we incur a cpumask setup and cpumask_weight
cost on every clone whether a restricted CPU mask is used or not.  Peter,
is it acceptable to avoid the cpumask check if there is no restrictions
on allowed cpus like this?

	imb = sd->imb_numa_nr;
	if (p->nr_cpus_allowed != num_online_cpus())
		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

		cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
		imb = min(cpumask_weight(cpus), imb);
	}

It's not perfect as a hotplug event could occur but that would be a fairly
harmless race with a limited impact (race with hotplug during clone may
stack for a short interval before LB intervenes).

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2022-02-17 10:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  5:54 [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group K Prateek Nayak
2022-02-17 10:05 ` Mel Gorman [this message]
2022-02-17 11:23   ` K Prateek Nayak
2022-02-17 13:15     ` Mel Gorman
2022-02-18  3:55       ` K Prateek Nayak
2022-02-22  6:00       ` K Prateek Nayak
2022-02-22  8:45         ` Mel Gorman
2022-02-22  9:39           ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220217100523.GV3366@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.