public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: peterz@infradead.org, aubrey.li@linux.intel.com, efault@gmx.de,
	gautham.shenoy@amd.com, linux-kernel@vger.kernel.org,
	mingo@kernel.org, song.bao.hua@hisilicon.com,
	srikar@linux.vnet.ibm.com, valentin.schneider@arm.com,
	vincent.guittot@linaro.org
Subject: Re: [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group
Date: Thu, 17 Feb 2022 10:05:24 +0000	[thread overview]
Message-ID: <20220217100523.GV3366@techsingularity.net> (raw)
In-Reply-To: <20220217055408.28151-1-kprateek.nayak@amd.com>

Thanks Prateek,

On Thu, Feb 17, 2022 at 11:24:08AM +0530, K Prateek Nayak wrote:
> In AMD Zen like systems which contains multiple LLCs per socket,
> users want to spread bandwidth hungry applications across multiple
> LLCs. Stream is one such representative workload where the best
> performance is obtained by limiting one stream thread per LLC. To
> ensure this, users are known to pin the tasks to a specify a subset of
> the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
> 
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
> 
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
> 

In this case the stream threads can use any CPU of the subset, presumably
this is parallelised with OpenMP without specifying spread or bind
directives.

> stream-5045    [032] d..2.   167.914699: sched_wakeup_new: comm=stream pid=5047 prio=120 target_cpu=048
> stream-5045    [032] d..2.   167.914746: sched_wakeup_new: comm=stream pid=5048 prio=120 target_cpu=000
> stream-5045    [032] d..2.   167.914846: sched_wakeup_new: comm=stream pid=5049 prio=120 target_cpu=016
> stream-5045    [032] d..2.   167.914891: sched_wakeup_new: comm=stream pid=5050 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.914928: sched_wakeup_new: comm=stream pid=5051 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.914976: sched_wakeup_new: comm=stream pid=5052 prio=120 target_cpu=032
> stream-5045    [032] d..2.   167.915011: sched_wakeup_new: comm=stream pid=5053 prio=120 target_cpu=032
> 

Resulting in some stacking with the baseline

> stream-4733    [032] d..2.   116.017980: sched_wakeup_new: comm=stream pid=4735 prio=120 target_cpu=048
> stream-4733    [032] d..2.   116.018032: sched_wakeup_new: comm=stream pid=4736 prio=120 target_cpu=000
> stream-4733    [032] d..2.   116.018127: sched_wakeup_new: comm=stream pid=4737 prio=120 target_cpu=064
> stream-4733    [032] d..2.   116.018185: sched_wakeup_new: comm=stream pid=4738 prio=120 target_cpu=112
> stream-4733    [032] d..2.   116.018235: sched_wakeup_new: comm=stream pid=4739 prio=120 target_cpu=096
> stream-4733    [032] d..2.   116.018289: sched_wakeup_new: comm=stream pid=4740 prio=120 target_cpu=016
> stream-4733    [032] d..2.   116.018334: sched_wakeup_new: comm=stream pid=4741 prio=120 target_cpu=080
> 

And no stacking with your patch. So far so good.

> <SNIP>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c4bfffe8c2c..6e875f1f34e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9130,6 +9130,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  
>  	case group_has_spare:
>  		if (sd->flags & SD_NUMA) {
> +			struct cpumask *cpus;
> +			int imb;
>  #ifdef CONFIG_NUMA_BALANCING
>  			int idlest_cpu;
>  			/*
> @@ -9147,10 +9149,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * Otherwise, keep the task close to the wakeup source
>  			 * and improve locality if the number of running tasks
>  			 * would remain below threshold where an imbalance is
> -			 * allowed. If there is a real need of migration,
> -			 * periodic load balance will take care of it.
> +			 * allowed while accounting for the possibility the
> +			 * task is pinned to a subset of CPUs. If there is a
> +			 * real need of migration, periodic load balance will
> +			 * take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> +			cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +			cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> +			imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
>  				return NULL;

One concern I have is that we incur a cpumask setup and cpumask_weight
cost on every clone whether a restricted CPU mask is used or not.  Peter,
is it acceptable to avoid the cpumask check if there is no restrictions
on allowed cpus like this?

	imb = sd->imb_numa_nr;
	if (p->nr_cpus_allowed != num_online_cpus())
		struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

		cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
		imb = min(cpumask_weight(cpus), imb);
	}

It's not perfect as a hotplug event could occur but that would be a fairly
harmless race with a limited impact (race with hotplug during clone may
stack for a short interval before LB intervenes).

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2022-02-17 10:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  5:54 [PATCH v4] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group K Prateek Nayak
2022-02-17 10:05 ` Mel Gorman [this message]
2022-02-17 11:23   ` K Prateek Nayak
2022-02-17 13:15     ` Mel Gorman
2022-02-18  3:55       ` K Prateek Nayak
2022-02-22  6:00       ` K Prateek Nayak
2022-02-22  8:45         ` Mel Gorman
2022-02-22  9:39           ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220217100523.GV3366@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox