[PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing
@ 2015-05-27 19:04 riel
  2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: riel @ 2015-05-27 19:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1

A previous attempt to resolve a major conflict between load balancing and
NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the
balance point if unbalanced"), introduced its own problems.

Revert that changeset, and introduce a new fix, which actually seems to
resolve the issues observed in Jirka's tests.

A test where the main thread creates a large memory area, and spawns
a worker thread to iterate over the memory (placed on another node
by select_task_rq_fair), after which the main thread goes to sleep
and waits for the worker thread to loop over all the memory now sees
the worker thread migrated to where the memory is, instead of having
all the memory migrated over like before.

Jirka has run a number of performance tests on several systems:
single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
system, with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
  2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel
@ 2015-05-27 19:04 ` riel
  2015-06-07 17:47   ` [tip:sched/core] Revert " tip-bot for Rik van Riel
  2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
  2015-05-29  7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy
  2 siblings, 1 reply; 11+ messages in thread
From: riel @ 2015-05-27 19:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1

From: Rik van Riel <riel@redhat.com>

Commit 095bebf61a46 ("sched/numa: Do not move past the balance point
if unbalanced") broke convergence of workloads with just one runnable
thread, by making it impossible for the one runnable thread on the
system to move from one NUMA node to another.

Instead, the thread would remain where it was, and pull all the memory
across to its location, which is much slower than just migrating the
thread to where the memory is.

The next patch has a better fix for the issue that 095bebf61a46 tried
to address.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffeaa4105e48..c47bf0dffb34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1198,11 +1198,9 @@ static void task_numa_assign(struct task_numa_env *env,
 static bool load_too_imbalanced(long src_load, long dst_load,
 				struct task_numa_env *env)
 {
+	long imb, old_imb;
+	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
-	long orig_src_load;
-	long load_a, load_b;
-	long moved_load;
-	long imb;
 
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
@@ -1215,39 +1213,30 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	dst_capacity = env->dst_stats.compute_capacity;
 
 	/* We care about the slope of the imbalance, not the direction. */
-	load_a = dst_load;
-	load_b = src_load;
-	if (load_a < load_b)
-		swap(load_a, load_b);
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
 
 	/* Is the difference below the threshold? */
-	imb = load_a * src_capacity * 100 -
-		load_b * dst_capacity * env->imbalance_pct;
+	imb = dst_load * src_capacity * 100 -
+	      src_load * dst_capacity * env->imbalance_pct;
 	if (imb <= 0)
 		return false;
 
 	/*
 	 * The imbalance is above the allowed threshold.
-	 * Allow a move that brings us closer to a balanced situation,
-	 * without moving things past the point of balance.
+	 * Compare it with the old imbalance.
 	 */
 	orig_src_load = env->src_stats.load;
+	orig_dst_load = env->dst_stats.load;
 
-	/*
-	 * In a task swap, there will be one load moving from src to dst,
-	 * and another moving back. This is the net sum of both moves.
-	 * A simple task move will always have a positive value.
-	 * Allow the move if it brings the system closer to a balanced
-	 * situation, without crossing over the balance point.
-	 */
-	moved_load = orig_src_load - src_load;
+	if (orig_dst_load < orig_src_load)
+		swap(orig_dst_load, orig_src_load);
 
-	if (moved_load > 0)
-		/* Moving src -> dst. Did we overshoot balance? */
-		return src_load * dst_capacity < dst_load * src_capacity;
-	else
-		/* Moving dst -> src. Did we overshoot balance? */
-		return dst_load * src_capacity < src_load * dst_capacity;
+	old_imb = orig_dst_load * src_capacity * 100 -
+		  orig_src_load * dst_capacity * env->imbalance_pct;
+
+	/* Would this change make things worse? */
+	return (imb > old_imb);
 }
 
 /*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [tip:sched/core] Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
  2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
@ 2015-06-07 17:47   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot for Rik van Riel @ 2015-06-07 17:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, torvalds, mingo, hpa, akpm, jhladky, riel,
	linux-kernel

Commit-ID:  e4991b240c622f0441c21f4869e13209abc08c5e
Gitweb:     http://git.kernel.org/tip/e4991b240c622f0441c21f4869e13209abc08c5e
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 27 May 2015 15:04:27 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 7 Jun 2015 15:57:44 +0200

Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")

Commit 095bebf61a46 ("sched/numa: Do not move past the balance point
if unbalanced") broke convergence of workloads with just one runnable
thread, by making it impossible for the one runnable thread on the
system to move from one NUMA node to another.

Instead, the thread would remain where it was, and pull all the memory
across to its location, which is much slower than just migrating the
thread to where the memory is.

The next patch has a better fix for the issue that 095bebf61a46 tried
to address.

Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1432753468-7785-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 41 +++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84ada05..723d69e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1198,11 +1198,9 @@ static void task_numa_assign(struct task_numa_env *env,
 static bool load_too_imbalanced(long src_load, long dst_load,
 				struct task_numa_env *env)
 {
+	long imb, old_imb;
+	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
-	long orig_src_load;
-	long load_a, load_b;
-	long moved_load;
-	long imb;
 
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
@@ -1215,39 +1213,30 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	dst_capacity = env->dst_stats.compute_capacity;
 
 	/* We care about the slope of the imbalance, not the direction. */
-	load_a = dst_load;
-	load_b = src_load;
-	if (load_a < load_b)
-		swap(load_a, load_b);
+	if (dst_load < src_load)
+		swap(dst_load, src_load);
 
 	/* Is the difference below the threshold? */
-	imb = load_a * src_capacity * 100 -
-		load_b * dst_capacity * env->imbalance_pct;
+	imb = dst_load * src_capacity * 100 -
+	      src_load * dst_capacity * env->imbalance_pct;
 	if (imb <= 0)
 		return false;
 
 	/*
 	 * The imbalance is above the allowed threshold.
-	 * Allow a move that brings us closer to a balanced situation,
-	 * without moving things past the point of balance.
+	 * Compare it with the old imbalance.
 	 */
 	orig_src_load = env->src_stats.load;
+	orig_dst_load = env->dst_stats.load;
 
-	/*
-	 * In a task swap, there will be one load moving from src to dst,
-	 * and another moving back. This is the net sum of both moves.
-	 * A simple task move will always have a positive value.
-	 * Allow the move if it brings the system closer to a balanced
-	 * situation, without crossing over the balance point.
-	 */
-	moved_load = orig_src_load - src_load;
+	if (orig_dst_load < orig_src_load)
+		swap(orig_dst_load, orig_src_load);
 
-	if (moved_load > 0)
-		/* Moving src -> dst. Did we overshoot balance? */
-		return src_load * dst_capacity < dst_load * src_capacity;
-	else
-		/* Moving dst -> src. Did we overshoot balance? */
-		return dst_load * src_capacity < src_load * dst_capacity;
+	old_imb = orig_dst_load * src_capacity * 100 -
+		  orig_src_load * dst_capacity * env->imbalance_pct;
+
+	/* Would this change make things worse? */
+	return (imb > old_imb);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel
  2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
@ 2015-05-27 19:04 ` riel
  2015-05-28  8:29   ` Mel Gorman
  2015-05-28 11:07   ` Srikar Dronamraju
  2015-05-29  7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy
  2 siblings, 2 replies; 11+ messages in thread
From: riel @ 2015-05-27 19:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: mgorman, jhladky, peterz, mingo, dedekind1

From: Rik van Riel <riel@redhat.com>

Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.

The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.

Simply reverting a43455a1 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.

This patch also addresses the issue that a system with a single runnable
thread would never migrate that thread to near its memory, introduced by
095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced").

A test where the main thread creates a large memory area, and spawns
a worker thread to iterate over the memory (placed on another node
by select_task_rq_fair), after which the main thread goes to sleep
and waits for the worker thread to loop over all the memory now sees
the worker thread migrated to where the memory is, instead of having
all the memory migrated over like before.

Jirka has run a number of performance tests on several systems:
single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
system, with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Artem Bityutski <dedekind1@gmail.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47bf0dffb34..f655f2ad155d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }

+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+	struct numa_stats *src = &env->src_stats;
+	struct numa_stats *dst = &env->dst_stats;
+
+	if (src->has_free_capacity && !dst->has_free_capacity)
+		return false;
+
+	/*
+	 * Only consider a task move if the source has a higher destination
+	 * than the destination, corrected for CPU capacity on each node.
+	 *
+	 *      src->load                dst->load
+	 * --------------------- vs ---------------------
+	 * src->compute_capacity    dst->compute_capacity
+	 */
+	if (src->load * dst->compute_capacity >
+	    dst->load * src->compute_capacity)
+		return true;
+
+	return false;
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);

 	/* Try to find a spot on the preferred nid. */
-	task_numa_find_cpu(&env, taskimp, groupimp);
+	if (numa_has_capacity(&env))
+		task_numa_find_cpu(&env, taskimp, groupimp);

 	/*
 	 * Look at other nodes in these cases:
@@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, taskimp, groupimp);
+			if (numa_has_capacity(&env))
+				task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}

-- 
2.1.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
@ 2015-05-28  8:29   ` Mel Gorman
  2015-05-28 11:07   ` Srikar Dronamraju
  1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2015-05-28  8:29 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, jhladky, peterz, mingo, dedekind1

On Wed, May 27, 2015 at 03:04:28PM -0400, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
> preferred node") fixes an issue where workloads would never converge
> on a fully loaded (or overloaded) system.
> 
> However, it introduces a regression on less than fully loaded systems,
> where workloads converge on a few NUMA nodes, instead of properly staying
> spread out across the whole system. This leads to a reduction in available
> memory bandwidth, and usable CPU cache, with predictable performance problems.
> 
> The root cause appears to be an interaction between the load balancer and
> NUMA balancing, where the short term load represented by the load balancer
> differs from the long term load the NUMA balancing code would like to base
> its decisions on.
> 
> Simply reverting a43455a1 would re-introduce the non-convergence of
> workloads on fully loaded systems, so that is not a good option. As
> an aside, the check done before a43455a1 only applied to a task's
> preferred node, not to other candidate nodes in the system, so the
> converge-on-too-few-nodes problem still happens, just to a lesser
> degree.
> 
> Instead, try to compensate for the impedance mismatch between the
> load balancer and NUMA balancing by only ever considering a lesser
> loaded node as a destination for NUMA balancing, regardless of
> whether the task is trying to move to the preferred node, or to another
> node.
> 
> This patch also addresses the issue that a system with a single runnable
> thread would never migrate that thread to near its memory, introduced by
> 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced").
> 
> A test where the main thread creates a large memory area, and spawns
> a worker thread to iterate over the memory (placed on another node
> by select_task_rq_fair), after which the main thread goes to sleep
> and waits for the worker thread to loop over all the memory now sees
> the worker thread migrated to where the memory is, instead of having
> all the memory migrated over like before.
> 
> Jirka has run a number of performance tests on several systems:
> single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
> system, with higher gains on systems with more cores per socket.
> Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
> little or no changes with the revert of 095bebf61a46 and this patch.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Artem Bityutski <dedekind1@gmail.com>
> Reported-by: Jirka Hladky <jhladky@redhat.com>
> Tested-by: Jirka Hladky <jhladky@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
  2015-05-28  8:29   ` Mel Gorman
@ 2015-05-28 11:07   ` Srikar Dronamraju
  2015-05-28 13:33     ` Rik van Riel
  1 sibling, 1 reply; 11+ messages in thread
From: Srikar Dronamraju @ 2015-05-28 11:07 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, jhladky, peterz, mingo, dedekind1

* riel@redhat.com <riel@redhat.com> [2015-05-27 15:04:28]:

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c47bf0dffb34..f655f2ad155d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  	}
>  }
>  
> +/* Only move tasks to a NUMA node less busy than the current node. */
> +static bool numa_has_capacity(struct task_numa_env *env)
> +{
> +	struct numa_stats *src = &env->src_stats;
> +	struct numa_stats *dst = &env->dst_stats;
> +
> +	if (src->has_free_capacity && !dst->has_free_capacity)
> +		return false;
> +
> +	/*
> +	 * Only consider a task move if the source has a higher destination
> +	 * than the destination, corrected for CPU capacity on each node.

In the above comment, did you mean source has higher load than the
destination?

> +	 *
> +	 *      src->load                dst->load
> +	 * --------------------- vs ---------------------
> +	 * src->compute_capacity    dst->compute_capacity
> +	 */
> +	if (src->load * dst->compute_capacity >
> +	    dst->load * src->compute_capacity)
> +		return true;
> +
> +	return false;
> +}
> +
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-28 11:07   ` Srikar Dronamraju
@ 2015-05-28 13:33     ` Rik van Riel
  2015-05-28 13:37       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2015-05-28 13:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, mgorman, jhladky, peterz, mingo, dedekind1

On 05/28/2015 07:07 AM, Srikar Dronamraju wrote:
> * riel@redhat.com <riel@redhat.com> [2015-05-27 15:04:28]:
> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c47bf0dffb34..f655f2ad155d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>>  	}
>>  }
>>  
>> +/* Only move tasks to a NUMA node less busy than the current node. */
>> +static bool numa_has_capacity(struct task_numa_env *env)
>> +{
>> +	struct numa_stats *src = &env->src_stats;
>> +	struct numa_stats *dst = &env->dst_stats;
>> +
>> +	if (src->has_free_capacity && !dst->has_free_capacity)
>> +		return false;
>> +
>> +	/*
>> +	 * Only consider a task move if the source has a higher destination
>> +	 * than the destination, corrected for CPU capacity on each node.
> 
> In the above comment, did you mean source has higher load than the
> destination?

Uh yes, indeed. Thanks for spotting that.

Peter, do you want me to send a v2 with the typo fixed, or
is it easier for you to fix up the typo before committing?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-28 13:33     ` Rik van Riel
@ 2015-05-28 13:37       ` Peter Zijlstra
  2015-05-28 13:52         ` [PATCH v2 " Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2015-05-28 13:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Srikar Dronamraju, linux-kernel, mgorman, jhladky, mingo,
	dedekind1

On Thu, May 28, 2015 at 09:33:25AM -0400, Rik van Riel wrote:
> Peter, do you want me to send a v2 with the typo fixed, or
> is it easier for you to fix up the typo before committing?

Please send a new one, and also fix the Changelog to only include 12
char SHA1 references :-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] numa,sched: only consider less busy nodes as numa balancing destination
  2015-05-28 13:37       ` Peter Zijlstra
@ 2015-05-28 13:52         ` Rik van Riel
  2015-06-07 17:47           ` [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations tip-bot for Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2015-05-28 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, linux-kernel, mgorman, jhladky, mingo,
	dedekind1

Here you are :)

---8<---

Subject: numa,sched: only consider less busy nodes as numa balancing destination

Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.

The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.

Simply reverting a43455a1d572 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1d572 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.

This patch also addresses the issue that a system with a single runnable
thread would never migrate that thread to near its memory, introduced by
095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced").

A test where the main thread creates a large memory area, and spawns
a worker thread to iterate over the memory (placed on another node
by select_task_rq_fair), after which the main thread goes to sleep
and waits for the worker thread to loop over all the memory now sees
the worker thread migrated to where the memory is, instead of having
all the memory migrated over like before.

Jirka has run a number of performance tests on several systems:
single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
system, with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Artem Bityutski <dedekind1@gmail.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c47bf0dffb34..424d6ec1f61d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }

+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+	struct numa_stats *src = &env->src_stats;
+	struct numa_stats *dst = &env->dst_stats;
+
+	if (src->has_free_capacity && !dst->has_free_capacity)
+		return false;
+
+	/*
+	 * Only consider a task move if the source has a higher load
+	 * than the destination, corrected for CPU capacity on each node.
+	 *
+	 *      src->load                dst->load
+	 * --------------------- vs ---------------------
+	 * src->compute_capacity    dst->compute_capacity
+	 */
+	if (src->load * dst->compute_capacity >
+	    dst->load * src->compute_capacity)
+		return true;
+
+	return false;
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);

 	/* Try to find a spot on the preferred nid. */
-	task_numa_find_cpu(&env, taskimp, groupimp);
+	if (numa_has_capacity(&env))
+		task_numa_find_cpu(&env, taskimp, groupimp);

 	/*
 	 * Look at other nodes in these cases:
@@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, taskimp, groupimp);
+			if (numa_has_capacity(&env))
+				task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations
  2015-05-28 13:52         ` [PATCH v2 " Rik van Riel
@ 2015-06-07 17:47           ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 11+ messages in thread
From: tip-bot for Rik van Riel @ 2015-06-07 17:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, riel, jhladky, torvalds, tglx, srikar, dedekind1,
	linux-kernel, mgorman, akpm, hpa

Commit-ID:  6f9aad0bc37286c0441b57f0ba8cffee50715426
Gitweb:     http://git.kernel.org/tip/6f9aad0bc37286c0441b57f0ba8cffee50715426
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 28 May 2015 09:52:49 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 7 Jun 2015 15:57:45 +0200

sched/numa: Only consider less busy nodes as numa balancing destinations

Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks
the preferred node") fixes an issue where workloads would never
converge on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly
staying spread out across the whole system. This leads to a reduction
in available memory bandwidth, and usable CPU cache, with predictable
performance problems.

The root cause appears to be an interaction between the load balancer
and NUMA balancing, where the short term load represented by the load
balancer differs from the long term load the NUMA balancing code would
like to base its decisions on.

Simply reverting a43455a1d572 would re-introduce the non-convergence
of workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1d572 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the load
balancer and NUMA balancing by only ever considering a lesser loaded
node as a destination for NUMA balancing, regardless of whether the
task is trying to move to the preferred node, or to another node.

This patch also addresses the issue that a system with a single
runnable thread would never migrate that thread to near its memory,
introduced by 095bebf61a46 ("sched/numa: Do not move past the balance
point if unbalanced").

A test where the main thread creates a large memory area, and spawns a
worker thread to iterate over the memory (placed on another node by
select_task_rq_fair), after which the main thread goes to sleep and
waits for the worker thread to loop over all the memory now sees the
worker thread migrated to where the memory is, instead of having all
the memory migrated over like before.

Jirka has run a number of performance tests on several systems: single
instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

Reported-by: Artem Bityutski <dedekind1@gmail.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 723d69e..4b6e5f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }

+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+	struct numa_stats *src = &env->src_stats;
+	struct numa_stats *dst = &env->dst_stats;
+
+	if (src->has_free_capacity && !dst->has_free_capacity)
+		return false;
+
+	/*
+	 * Only consider a task move if the source has a higher load
+	 * than the destination, corrected for CPU capacity on each node.
+	 *
+	 *      src->load                dst->load
+	 * --------------------- vs ---------------------
+	 * src->compute_capacity    dst->compute_capacity
+	 */
+	if (src->load * dst->compute_capacity >
+	    dst->load * src->compute_capacity)
+		return true;
+
+	return false;
+}
+
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);

 	/* Try to find a spot on the preferred nid. */
-	task_numa_find_cpu(&env, taskimp, groupimp);
+	if (numa_has_capacity(&env))
+		task_numa_find_cpu(&env, taskimp, groupimp);

 	/*
 	 * Look at other nodes in these cases:
@@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			task_numa_find_cpu(&env, taskimp, groupimp);
+			if (numa_has_capacity(&env))
+				task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing
  2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel
  2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
  2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
@ 2015-05-29  7:44 ` Artem Bityutskiy
  2 siblings, 0 replies; 11+ messages in thread
From: Artem Bityutskiy @ 2015-05-29  7:44 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, mgorman, jhladky, peterz, mingo

On Wed, 2015-05-27 at 15:04 -0400, riel@redhat.com wrote:
> A previous attempt to resolve a major conflict between load balancing and
> NUMA balancing, changeset 095bebf61a46 ("sched/numa: Do not move past the
> balance point if unbalanced"), introduced its own problems.
> 
> Revert that changeset, and introduce a new fix, which actually seems to
> resolve the issues observed in Jirka's tests.
> 
> A test where the main thread creates a large memory area, and spawns
> a worker thread to iterate over the memory (placed on another node
> by select_task_rq_fair), after which the main thread goes to sleep
> and waits for the worker thread to loop over all the memory now sees
> the worker thread migrated to where the memory is, instead of having
> all the memory migrated over like before.
> 
> Jirka has run a number of performance tests on several systems:
> single instance SpecJBB 2005 performance is 7-15% higher on a 4 node
> system, with higher gains on systems with more cores per socket.
> Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
> little or no changes with the revert of 095bebf61a46 and this patch.

[Re-sending since it didn't hit the mailing list first time, due to
HTML]

Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>

Hi Rik,

I've executed our eCommerce Web workload benchmark. Last time I did not
revert 095bebf61a46, now I tested this patch-set. Let me summarize
everything.

Here is the average web server response time in millisecs for various
kernels.

1. v4.1-rc1 - 1450
2. v4.1-rc1 + a43455a1d572daf7b730fe12eb747d1e17411365 reverted  -  300
3. v4.1-rc1 + NUMA disabled  - 480
4. v4.1-rc5 + this patch-set - 1260

So as you see, for our workload reverting
a43455a1d572daf7b730fe12eb747d1e17411365
results in Web server being most responsive (reminder - this is about a
2-socket Haswell-EP
server).

Just disabling NUMA also gives a big improvement, but not as good as
reverting the
"offending" (from our workload's POW) patch.

This patch-set does result in a noticeable improvement too.

Artem.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-06-07 17:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-27 19:04 [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing riel
2015-05-27 19:04 ` [PATCH 1/2] revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") riel
2015-06-07 17:47   ` [tip:sched/core] Revert " tip-bot for Rik van Riel
2015-05-27 19:04 ` [PATCH 2/2] numa,sched: only consider less busy nodes as numa balancing destination riel
2015-05-28  8:29   ` Mel Gorman
2015-05-28 11:07   ` Srikar Dronamraju
2015-05-28 13:33     ` Rik van Riel
2015-05-28 13:37       ` Peter Zijlstra
2015-05-28 13:52         ` [PATCH v2 " Rik van Riel
2015-06-07 17:47           ` [tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations tip-bot for Rik van Riel
2015-05-29  7:44 ` [PATCH 0/2] numa,sched: resolve conflict between load balancing and NUMA balancing Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox