Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jirka Hladky <jhladky@redhat.com>
To: Rik van Riel <riel@redhat.com>, Artem Bityutskiy <dedekind1@gmail.com>
Cc: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org
Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination
Date: Mon, 11 May 2015 14:44:54 +0200	[thread overview]
Message-ID: <5550A446.2050005@redhat.com> (raw)
In-Reply-To: <20150506114128.0c846a37@cuia.bos.redhat.com>

Hi Rik,

we have results for SPECjbb2005 and Linpack&Stream benchmarks with

4.1.0-0.rc1.git0.1.el7.x86_64 (without patch)
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch and AUTONUMA disabled

The tests has been conducted on 3 different systems with 4 NUMA nodes 
and different versions of Intel processors and different amount of RAM.


For SPECjbb benchmark we see
-with your latest proposed patch applied
   * gains in range 7-15% !! for single instance  SPECjbb (tested on 
variety of systems, biggest gains on brickland system, gains are growing 
with growing number of threads)
   * for multi-instance SPECjbb run (4 parallel jobs on 4 NUMA node 
system) on change in results
   * for linpack no change
   * for stream bench slight improvements (but very close to error margin)
- with AUTONUMA disabled
   * with SPECjbb (both single and 4 parallel jobs) performance drop to 
1/2 of performance with AUTONUMA enabled
   * for linpack and stream performance drop by 30% compared with 
AUTONUMA enabled

In summary:
* the proposed patch improves performance for single process SPECjbb 
bench without hurting anything
* With AUTUNUMA disabled, performance drop is huge

Please let me know if you need more details.

Thanks
Jirka

On 05/06/2015 05:41 PM, Rik van Riel wrote:
> On Wed, 06 May 2015 13:35:30 +0300
> Artem Bityutskiy <dedekind1@gmail.com> wrote:
>
>> we observe a tremendous regression between kernel version 3.16 and 3.17
>> (and up), and I've bisected it to this commit:
>>
>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
> Artem, Jirka, does this patch fix (or at least improve) the issues you
> have been seeing?  Does it introduce any new regressions?
>
> Peter, Mel, I think it may be time to stop waiting for the impedance
> mismatch between the load balancer and NUMA balancing to be resolved,
> and try to just avoid the issue in the NUMA balancing code...
>
> ----8<----
>
> Subject: numa,sched: only consider less busy nodes as numa balancing destination
>
> Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
> preferred node") fixes an issue where workloads would never converge
> on a fully loaded (or overloaded) system.
>
> However, it introduces a regression on less than fully loaded systems,
> where workloads converge on a few NUMA nodes, instead of properly staying
> spread out across the whole system. This leads to a reduction in available
> memory bandwidth, and usable CPU cache, with predictable performance problems.
>
> The root cause appears to be an interaction between the load balancer and
> NUMA balancing, where the short term load represented by the load balancer
> differs from the long term load the NUMA balancing code would like to base
> its decisions on.
>
> Simply reverting a43455a1 would re-introduce the non-convergence of
> workloads on fully loaded systems, so that is not a good option. As
> an aside, the check done before a43455a1 only applied to a task's
> preferred node, not to other candidate nodes in the system, so the
> converge-on-too-few-nodes problem still happens, just to a lesser
> degree.
>
> Instead, try to compensate for the impedance mismatch between the
> load balancer and NUMA balancing by only ever considering a lesser
> loaded node as a destination for NUMA balancing, regardless of
> whether the task is trying to move to the preferred node, or to another
> node.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Artem Bityutski <dedekind1@gmail.com>
> Reported-by: Jirka Hladky <jhladky@redhat.com>
> ---
>   kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
>   1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffeaa4105e48..480e6a35ab35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>   	}
>   }
>   
> +/* Only move tasks to a NUMA node less busy than the current node. */
> +static bool numa_has_capacity(struct task_numa_env *env)
> +{
> +	struct numa_stats *src = &env->src_stats;
> +	struct numa_stats *dst = &env->dst_stats;
> +
> +	if (src->has_free_capacity && !dst->has_free_capacity)
> +		return false;
> +
> +	/*
> +	 * Only consider a task move if the source has a higher destination
> +	 * than the destination, corrected for CPU capacity on each node.
> +	 *
> +	 *      src->load                dst->load
> +	 * --------------------- vs ---------------------
> +	 * src->compute_capacity    dst->compute_capacity
> +	 */
> +	if (src->load * dst->compute_capacity >
> +	    dst->load * src->compute_capacity)
> +		return true;
> +
> +	return false;
> +}
> +
>   static int task_numa_migrate(struct task_struct *p)
>   {
>   	struct task_numa_env env = {
> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
>   	update_numa_stats(&env.dst_stats, env.dst_nid);
>   
>   	/* Try to find a spot on the preferred nid. */
> -	task_numa_find_cpu(&env, taskimp, groupimp);
> +	if (numa_has_capacity(&env))
> +		task_numa_find_cpu(&env, taskimp, groupimp);
>   
>   	/*
>   	 * Look at other nodes in these cases:
> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p)
>   			env.dist = dist;
>   			env.dst_nid = nid;
>   			update_numa_stats(&env.dst_stats, env.dst_nid);
> -			task_numa_find_cpu(&env, taskimp, groupimp);
> +			if (numa_has_capacity(&env))
> +				task_numa_find_cpu(&env, taskimp, groupimp);
>   		}
>   	}
>

next prev parent reply	other threads:[~2015-05-11 12:45 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-06 10:35 autoNUMA web workload regression Artem Bityutskiy
2015-05-06 10:37 ` Bityutskiy, Artem
2015-05-06 14:40 ` Rik van Riel
2015-05-06 15:41 ` [PATCH] numa,sched: only consider less busy nodes as numa balancing destination Rik van Riel
2015-05-06 17:00   ` Peter Zijlstra
2015-05-06 17:06     ` Rik van Riel
2015-05-07 13:29   ` Artem Bityutskiy
2015-05-08 13:13   ` Artem Bityutskiy
2015-05-08 20:03     ` Rik van Riel
2015-05-08 22:52       ` Rik van Riel
2015-05-11 11:11       ` Artem Bityutskiy
2015-05-11 14:20         ` Rik van Riel
2015-05-12 13:50       ` Artem Bityutskiy
2015-05-12 15:45         ` Rik van Riel
2015-05-13  6:29           ` Peter Zijlstra
2015-05-13  6:31             ` Peter Zijlstra
2015-05-13 10:59             ` Artem Bityutskiy
2015-05-13 13:51             ` Rik van Riel
2015-05-11 12:44   ` Jirka Hladky [this message]
2015-05-11 14:44     ` Rik van Riel
2015-05-26 20:29   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5550A446.2050005@redhat.com \
    --to=jhladky@redhat.com \
    --cc=dedekind1@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.