[PATCH] sched,numa: do not move past the balance point if unbalanced

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched,numa: do not move past the balance point if unbalanced
@ 2015-01-12 21:30 Rik van Riel
  2015-01-15 10:45 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Rik van Riel @ 2015-01-12 21:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: mgorman, peterz, mingo

There is a subtle interaction between the logic introduced in commit
e63da03639cc9e6e83b62e7ef8ffdbb92421416a, the way the load balancer
counts the load on each NUMA node, and the way NUMA hinting faults are
done.

Specifically, the load balancer only counts currently running tasks
in the load, while NUMA hinting faults may cause tasks to stop, if
the page is locked by another task.

This could cause all of the threads of a large single instance workload,
like SPECjbb2005, to migrate to the same NUMA node. This was possible
because occasionally they all fault on the same few pages, and only one
of the threads remains runnable. That thread can move to the process's
preferred NUMA node without making the imbalance worse, because nothing
else is running at that time.

The fix is to check the direction of the net moving of load, and to
refuse a NUMA move if it would cause the system to move past the point
of balance.  In an unbalanced state, only moves that bring us closer
to the balance point are allowed.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40667cb..5b07eef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1197,9 +1197,11 @@ static void task_numa_assign(struct task_numa_env *env,
 static bool load_too_imbalanced(long src_load, long dst_load,
 				struct task_numa_env *env)
 {
-	long imb, old_imb;
-	long orig_src_load, orig_dst_load;
 	long src_capacity, dst_capacity;
+	long orig_src_load;
+	long load_a, load_b;
+	long moved_load;
+	long imb;
 
 	/*
 	 * The load is corrected for the CPU capacity available on each node.
@@ -1212,30 +1214,38 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	dst_capacity = env->dst_stats.compute_capacity;
 
 	/* We care about the slope of the imbalance, not the direction. */
-	if (dst_load < src_load)
-		swap(dst_load, src_load);
+	load_a = dst_load;
+	load_b = src_load;
+	if (load_a < load_b)
+		swap(load_a, load_b);
 
 	/* Is the difference below the threshold? */
-	imb = dst_load * src_capacity * 100 -
-	      src_load * dst_capacity * env->imbalance_pct;
+	imb = load_a * src_capacity * 100 -
+	      load_b * dst_capacity * env->imbalance_pct;
 	if (imb <= 0)
 		return false;
 
 	/*
 	 * The imbalance is above the allowed threshold.
-	 * Compare it with the old imbalance.
+	 * Allow the move if it brings us closer to a balanced situation,
+	 * without moving things past the point of balance.
 	 */
 	orig_src_load = env->src_stats.load;
-	orig_dst_load = env->dst_stats.load;
 
-	if (orig_dst_load < orig_src_load)
-		swap(orig_dst_load, orig_src_load);
-
-	old_imb = orig_dst_load * src_capacity * 100 -
-		  orig_src_load * dst_capacity * env->imbalance_pct;
+	/*
+	 * In a task swap, there will be one load moving from src to dst,
+	 * and another moving back. This is the net sum of both moves.
+	 * Allow the move if it brings the system closer to a balanced
+	 * situation, without crossing over the balance point.
+	 */
+	moved_load = orig_src_load - src_load;
 
-	/* Would this change make things worse? */
-	return (imb > old_imb);
+	if (moved_load > 0)
+		/* Moving src -> dst. Did we overshoot balance? */
+		return src_load < dst_load;
+	else
+		/* Moving dst -> src. Did we overshoot balance? */
+		return dst_load < src_load;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched,numa: do not move past the balance point if unbalanced
  2015-01-12 21:30 [PATCH] sched,numa: do not move past the balance point if unbalanced Rik van Riel
@ 2015-01-15 10:45 ` Peter Zijlstra
  2015-01-15 18:21   ` Rik van Riel
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2015-01-15 10:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, mgorman, mingo

On Mon, Jan 12, 2015 at 04:30:39PM -0500, Rik van Riel wrote:
> There is a subtle interaction between the logic introduced in commit
> e63da03639cc9e6e83b62e7ef8ffdbb92421416a, the way the load balancer

  e63da03639cc ("sched/numa: Allow task switch if load imbalance improves")

I have the below in my .gitconfig to easily create them things:

[core]
        abbrev = 12
[alias]
        one = show -s --pretty='format:%h (\"%s\")'

> counts the load on each NUMA node, and the way NUMA hinting faults are
> done.
> 
> Specifically, the load balancer only counts currently running tasks
> in the load, while NUMA hinting faults may cause tasks to stop, if
> the page is locked by another task.
> 
> This could cause all of the threads of a large single instance workload,
> like SPECjbb2005, to migrate to the same NUMA node. This was possible
> because occasionally they all fault on the same few pages, and only one
> of the threads remains runnable. That thread can move to the process's
> preferred NUMA node without making the imbalance worse, because nothing
> else is running at that time.
> 
> The fix is to check the direction of the net moving of load, and to
> refuse a NUMA move if it would cause the system to move past the point
> of balance.  In an unbalanced state, only moves that bring us closer
> to the balance point are allowed.

Did you also test with whatever load needed the previous thing? Its far
too easy to fix one and break the other in my experience ;-)

>  	orig_src_load = env->src_stats.load;
> -	orig_dst_load = env->dst_stats.load;
>  
> -	if (orig_dst_load < orig_src_load)
> -		swap(orig_dst_load, orig_src_load);
> -
> -	old_imb = orig_dst_load * src_capacity * 100 -
> -		  orig_src_load * dst_capacity * env->imbalance_pct;
> +	/*
> +	 * In a task swap, there will be one load moving from src to dst,
> +	 * and another moving back. This is the net sum of both moves.
> +	 * Allow the move if it brings the system closer to a balanced
> +	 * situation, without crossing over the balance point.
> +	 */

This comment seems to 'forget' about the !swap moves?

> +	moved_load = orig_src_load - src_load;
>  
> -	/* Would this change make things worse? */
> -	return (imb > old_imb);
> +	if (moved_load > 0)
> +		/* Moving src -> dst. Did we overshoot balance? */
> +		return src_load < dst_load;

So here we inhibit movement when the src cpu gained (moved_load > 0) and
src was smaller than dst.

However there is no check the new src value is in fact bigger than dst;
src could have gained and still be smaller. And afaict that's a valid
move, even under the proposed semantics, right?

> +	else
> +		/* Moving dst -> src. Did we overshoot balance? */
> +		return dst_load < src_load;

And vs.

>  }

One should use capacity muck when comparing load between CPUs.

In any case, brain hurt.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched,numa: do not move past the balance point if unbalanced
  2015-01-15 10:45 ` Peter Zijlstra
@ 2015-01-15 18:21   ` Rik van Riel
  0 siblings, 0 replies; 3+ messages in thread
From: Rik van Riel @ 2015-01-15 18:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mgorman, mingo

On 01/15/2015 05:45 AM, Peter Zijlstra wrote:
> On Mon, Jan 12, 2015 at 04:30:39PM -0500, Rik van Riel wrote:
>> There is a subtle interaction between the logic introduced in commit
>> e63da03639cc9e6e83b62e7ef8ffdbb92421416a, the way the load balancer
>
>    e63da03639cc ("sched/numa: Allow task switch if load imbalance improves")

Will do. Thanks for the git config :)

>> The fix is to check the direction of the net moving of load, and to
>> refuse a NUMA move if it would cause the system to move past the point
>> of balance.  In an unbalanced state, only moves that bring us closer
>> to the balance point are allowed.
>
> Did you also test with whatever load needed the previous thing? Its far
> too easy to fix one and break the other in my experience ;-)

The load that caused the need for the previous fix
was one where all CPUs were overloaded, and one
process did not have its threads converged on one
node yet.

In that case, the load from moving one task is a
small fraction of the total load, and we are unable
to move through the balance point to the other side.

>>   	orig_src_load = env->src_stats.load;
>> -	orig_dst_load = env->dst_stats.load;
>>
>> -	if (orig_dst_load < orig_src_load)
>> -		swap(orig_dst_load, orig_src_load);
>> -
>> -	old_imb = orig_dst_load * src_capacity * 100 -
>> -		  orig_src_load * dst_capacity * env->imbalance_pct;
>> +	/*
>> +	 * In a task swap, there will be one load moving from src to dst,
>> +	 * and another moving back. This is the net sum of both moves.
>> +	 * Allow the move if it brings the system closer to a balanced
>> +	 * situation, without crossing over the balance point.
>> +	 */
>
> This comment seems to 'forget' about the !swap moves?

Not sure how to describe that, except by pointing out
that a task move is always from src to dst :)

>> +	moved_load = orig_src_load - src_load;
>>
>> -	/* Would this change make things worse? */
>> -	return (imb > old_imb);
>> +	if (moved_load > 0)
>> +		/* Moving src -> dst. Did we overshoot balance? */
>> +		return src_load < dst_load;
>
> So here we inhibit movement when the src cpu gained (moved_load > 0) and
> src was smaller than dst.

moved_load > 0 means that the src cpu loses load

> However there is no check the new src value is in fact bigger than dst;
> src could have gained and still be smaller. And afaict that's a valid
> move, even under the proposed semantics, right?

If src gains load, moved_load will be negative

>> +	else
>> +		/* Moving dst -> src. Did we overshoot balance? */
>> +		return dst_load < src_load;
>
> And vs.
>
>>   }
>
> One should use capacity muck when comparing load between CPUs.

You are right, I should use the capacities here.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-01-15 18:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-12 21:30 [PATCH] sched,numa: do not move past the balance point if unbalanced Rik van Riel
2015-01-15 10:45 ` Peter Zijlstra
2015-01-15 18:21   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox