[PATCH] Load balancing problem in 2.6.2-mm1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Load balancing problem in 2.6.2-mm1
@ 2004-02-06  9:24 Rick Lindsley
  2004-02-06  9:38 ` Nick Piggin
  2004-02-06 10:30 ` Anton Blanchard
  0 siblings, 2 replies; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06  9:24 UTC (permalink / raw)
  To: piggin, akpm; +Cc: linux-kernel, mjbligh, dvhltc

Nick, Andrew --

Found a problem in Nick's code which had it way overbalancing much of
the time.  I've included a patch below.

I had been porting my schedstats patch to the -mm tree and noticed some
huge imbalances (on the order of 33 million) being "corrected" and at
first I thought it was a mis-port. But it was right.  We really were
deciding 33 million processes had to move.  Of course, we never found
that many, but we still moved as many as can_migrate_task() would allow.

In find_busiest_group(), after we exit the do/while, we select our
imbalance.  But max_load, avg_load, and this_load are all unsigned,
so min(x,y) will make a bad choice if max_load < avg_load < this_load
(that is, a choice between two negative [very large] numbers).

Unfortunately, there is a bug when max_load never gets changed from zero
(look in the loop and think what happens if the only load on the machine
is being created by cpu groups of which we are a member). And you have
a recipe for some really bogus values for imbalance.

Even if you fix the max_load == 0 bug, there will still be times when
avg_load - this_load will be negative (thus very large) and you'll make
the decision to move stuff when you shouldn't have.

This patch allows for this_load to set max_load, which if I understand
the logic properly is correct.  It then adds a check to imbalance to make
sure a negative number hasn't been coerced into a large positive number.
With this patch applied, the algorithm is *much* more conservative ...
maybe *too* conservative but that's for another round of testing ...

Rick

diff -rup linux-2.6.2-mm1/kernel/sched.c linux-2.6.2-mm1-fix/kernel/sched.c
--- linux-2.6.2-mm1/kernel/sched.c	Thu Feb  5 14:47:17 2004
+++ linux-2.6.2-mm1-fix/kernel/sched.c	Thu Feb  5 21:44:04 2004
@@ -1352,7 +1624,7 @@ static struct sched_group *
 find_busiest_group(struct sched_domain *domain, int this_cpu,
 				unsigned long *imbalance, enum idle_type idle)
 {
-	unsigned long max_load, avg_load, total_load, this_load;
+	unsigned long max_load, avg_load, total_load, this_load, load_diff;
 	int modify, total_nr_cpus, busiest_nr_cpus = 0;
 	enum idle_type package_idle = IDLE;
 	struct sched_group *busiest = NULL, *group = domain->groups;
@@ -1407,14 +1679,13 @@ find_busiest_group(struct sched_domain *
 		total_nr_cpus += nr_cpus;
 		avg_load /= nr_cpus;

+		if (avg_load > max_load)
+			max_load = avg_load;
+
 		if (local_group) {
 			this_load = avg_load;
-			goto nextgroup;
-		}
-
-		if (avg_load >= max_load) {
+		} else if (avg_load >= max_load) {
 			busiest = group;
-			max_load = avg_load;
 			busiest_nr_cpus = nr_cpus;
 		}
 nextgroup:
@@ -1437,8 +1708,19 @@ nextgroup:
 	 * reduce the max loaded cpu below the average load, as either of these
 	 * actions would just result in more rebalancing later, and ping-pong
 	 * tasks around. Thus we look for the minimum possible imbalance.
+	 * Negative imbalances (*we* are more loaded than anyone else) will
+	 * be counted as no imbalance for these purposes -- we can't fix that
+	 * by pulling tasks to us.  Be careful of negative numbers as they'll
+	 * appear as very large values with unsigned longs.
 	 */
-	*imbalance = min(max_load - avg_load, avg_load - this_load);
+	if (avg_load > this_load)
+		load_diff = avg_load - this_load;
+	else
+		load_diff = 0;
+
+	*imbalance = min(max_load - avg_load, load_diff);
+	if ((long)*imbalance < 0)
+		*imbalance = 0;

 	/* Get rid of the scaling factor now, rounding *up* as we divide */
 	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06  9:24 [PATCH] Load balancing problem in 2.6.2-mm1 Rick Lindsley
@ 2004-02-06  9:38 ` Nick Piggin
  2004-02-06 18:13   ` Rick Lindsley
  2004-02-06 10:30 ` Anton Blanchard
  1 sibling, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-06  9:38 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: akpm, linux-kernel, mjbligh, dvhltc



Rick Lindsley wrote:

>Nick, Andrew --
>
>Found a problem in Nick's code which had it way overbalancing much of
>the time.  I've included a patch below.
>
>I had been porting my schedstats patch to the -mm tree and noticed some
>huge imbalances (on the order of 33 million) being "corrected" and at
>first I thought it was a mis-port. But it was right.  We really were
>deciding 33 million processes had to move.  Of course, we never found
>that many, but we still moved as many as can_migrate_task() would allow.
>
>In find_busiest_group(), after we exit the do/while, we select our
>imbalance.  But max_load, avg_load, and this_load are all unsigned,
>so min(x,y) will make a bad choice if max_load < avg_load < this_load
>(that is, a choice between two negative [very large] numbers).
>
>Unfortunately, there is a bug when max_load never gets changed from zero
>(look in the loop and think what happens if the only load on the machine
>is being created by cpu groups of which we are a member). And you have
>a recipe for some really bogus values for imbalance.
>
>Even if you fix the max_load == 0 bug, there will still be times when
>avg_load - this_load will be negative (thus very large) and you'll make
>the decision to move stuff when you shouldn't have.
>
>This patch allows for this_load to set max_load, which if I understand
>the logic properly is correct.  It then adds a check to imbalance to make
>sure a negative number hasn't been coerced into a large positive number.
>With this patch applied, the algorithm is *much* more conservative ...
>maybe *too* conservative but that's for another round of testing ...
>
>Rick
>
>diff -rup linux-2.6.2-mm1/kernel/sched.c linux-2.6.2-mm1-fix/kernel/sched.c
>--- linux-2.6.2-mm1/kernel/sched.c	Thu Feb  5 14:47:17 2004
>+++ linux-2.6.2-mm1-fix/kernel/sched.c	Thu Feb  5 21:44:04 2004
>@@ -1352,7 +1624,7 @@ static struct sched_group *
> find_busiest_group(struct sched_domain *domain, int this_cpu,
> 				unsigned long *imbalance, enum idle_type idle)
> {
>-	unsigned long max_load, avg_load, total_load, this_load;
>+	unsigned long max_load, avg_load, total_load, this_load, load_diff;
> 	int modify, total_nr_cpus, busiest_nr_cpus = 0;
> 	enum idle_type package_idle = IDLE;
> 	struct sched_group *busiest = NULL, *group = domain->groups;
>@@ -1407,14 +1679,13 @@ find_busiest_group(struct sched_domain *
> 		total_nr_cpus += nr_cpus;
> 		avg_load /= nr_cpus;
> 
>+		if (avg_load > max_load)
>+			max_load = avg_load;
>+
> 		if (local_group) {
> 			this_load = avg_load;
>-			goto nextgroup;
>-		}
>-
>-		if (avg_load >= max_load) {
>+		} else if (avg_load >= max_load) {
> 			busiest = group;
>-			max_load = avg_load;
> 			busiest_nr_cpus = nr_cpus;
> 		}
> nextgroup:
>@@ -1437,8 +1708,19 @@ nextgroup:
> 	 * reduce the max loaded cpu below the average load, as either of these
> 	 * actions would just result in more rebalancing later, and ping-pong
> 	 * tasks around. Thus we look for the minimum possible imbalance.
>+	 * Negative imbalances (*we* are more loaded than anyone else) will
>+	 * be counted as no imbalance for these purposes -- we can't fix that
>+	 * by pulling tasks to us.  Be careful of negative numbers as they'll
>+	 * appear as very large values with unsigned longs.
> 	 */
>-	*imbalance = min(max_load - avg_load, avg_load - this_load);
>+	if (avg_load > this_load)
>+		load_diff = avg_load - this_load;
>+	else
>+		load_diff = 0;
>+
>+	*imbalance = min(max_load - avg_load, load_diff);
>+	if ((long)*imbalance < 0)
>+		*imbalance = 0;
> 
>

You're right Rick, thanks for catching this. Why do you have
this last test though? This shouldn't be possible?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06  9:24 [PATCH] Load balancing problem in 2.6.2-mm1 Rick Lindsley
  2004-02-06  9:38 ` Nick Piggin
@ 2004-02-06 10:30 ` Anton Blanchard
  2004-02-06 18:15   ` Rick Lindsley
  2004-02-06 18:33   ` Martin J. Bligh
  1 sibling, 2 replies; 37+ messages in thread
From: Anton Blanchard @ 2004-02-06 10:30 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: piggin, akpm, linux-kernel, mjbligh, dvhltc


Hi,

> This patch allows for this_load to set max_load, which if I understand
> the logic properly is correct.  It then adds a check to imbalance to make
> sure a negative number hasn't been coerced into a large positive number.
> With this patch applied, the algorithm is *much* more conservative ...
> maybe *too* conservative but that's for another round of testing ...

Good stuff, I just gave the patch a spin and things seem a little
calmer. However Im still seeing a lot of balancing going on within a
node.

Setup:

2 threads per cpu.
2 nodes of 16 threads each.

I ran a single "yes > /dev/null"

And it looks like that process is bouncing around the entire node.
Below is a 2 second average.

Anton

cpu    user  system    idle             cpu    user  system    idle

node 0:
cpu0      2       0      99             cpu1      9       0      91
cpu2      1       0      99             cpu3      8       0      92
cpu4      3       0      97             cpu5     10       0      90
cpu6      2       0      98             cpu7     10       0      90
cpu8      2       0      98             cpu9      9       0      90
cpu10     3       0      96             cpu11     9       0      90
cpu12     2       0      98             cpu13    10       0      90
cpu14     2       1      97             cpu15    10       1      89

node 1:
cpu16     0       0     100             cpu17     0       0     100
cpu18     0       0     101             cpu19     0       0     100
cpu20     0       0     100             cpu21     0       0     101
cpu22     0       0     101             cpu23     0       0     100
cpu24     0       0     100             cpu25     0       0     100
cpu26     0       0     100             cpu27     0       0     100
cpu28     0       0     101             cpu29     0       0     100
cpu30     0       0     100             cpu31     0       0     100

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06  9:38 ` Nick Piggin
@ 2004-02-06 18:13   ` Rick Lindsley
  2004-02-06 21:57     ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 18:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel, mjbligh, dvhltc

    >+	if ((long)*imbalance < 0)
    >+		*imbalance = 0;
    > 
    
    You're right Rick, thanks for catching this. Why do you have
    this last test though? This shouldn't be possible?

A combination of paranoia and leftover code from a previous fix :)  At the
least, this could probably become a BUG_ON now, or I wouldn't cry if
we took it out entirely.

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 10:30 ` Anton Blanchard
@ 2004-02-06 18:15   ` Rick Lindsley
  2004-02-06 18:39     ` Martin J. Bligh
  2004-02-06 18:33   ` Martin J. Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 18:15 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: piggin, akpm, linux-kernel, mjbligh, dvhltc

    Good stuff, I just gave the patch a spin and things seem a little
    calmer. However Im still seeing a lot of balancing going on within a
    node.

This is a clearly recognizable edge case, so I'll try drawing this up on
some paper and see if I can suggest another patch.  There's no good reason
to move one lone process from a particular processor to another idle one.

But it also approaches a question that's come up before:  if you have 2
tasks on processor A and 1 on processor B, do you move one from A to B?
One argument is that the two tasks on A will take twice as long as
the one on B if you do nothing.  But another says that bouncing a task
around can't correct the overall imbalance and so is wasteful.  I know
of benchmarks where both behaviors are considered important.  Thoughts?

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 10:30 ` Anton Blanchard
  2004-02-06 18:15   ` Rick Lindsley
@ 2004-02-06 18:33   ` Martin J. Bligh
  1 sibling, 0 replies; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-06 18:33 UTC (permalink / raw)
  To: Anton Blanchard, Rick Lindsley; +Cc: piggin, akpm, linux-kernel, dvhltc

>> This patch allows for this_load to set max_load, which if I understand
>> the logic properly is correct.  It then adds a check to imbalance to make
>> sure a negative number hasn't been coerced into a large positive number.
>> With this patch applied, the algorithm is *much* more conservative ...
>> maybe *too* conservative but that's for another round of testing ...
> 
> Good stuff, I just gave the patch a spin and things seem a little
> calmer. However Im still seeing a lot of balancing going on within a
> node.
> 
> Setup:
> 
> 2 threads per cpu.
> 2 nodes of 16 threads each.
> 
> I ran a single "yes > /dev/null"
> 
> And it looks like that process is bouncing around the entire node.
> Below is a 2 second average.

OK, what happens if you apply this ... does it fix it?
(not saying this is the correct solution, just debug)

M.

diff -purN -X /home/mbligh/.diff.exclude mm1/kernel/sched.c mm1-schedfix/kernel/sched.c
--- mm1/kernel/sched.c	2004-02-06 10:28:55.000000000 -0800
+++ mm1-schedfix/kernel/sched.c	2004-02-06 10:32:04.000000000 -0800
@@ -1440,15 +1440,11 @@ nextgroup:
 	 */
 	*imbalance = min(max_load - avg_load, avg_load - this_load);
 
-	/* Get rid of the scaling factor now, rounding *up* as we divide */
-	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
+	/* Get rid of the scaling factor now */
+	*imbalance = *imbalance >> SCHED_LOAD_SHIFT;
 
 	if (*imbalance == 0) {
-		if (package_idle != NOT_IDLE && domain->flags & SD_FLAG_IDLE
-			&& max_load * busiest_nr_cpus > (3*SCHED_LOAD_SCALE/2))
-			*imbalance = 1;
-		else
-			busiest = NULL;
+		busiest = NULL;
 	}
 
 	return busiest;


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 18:15   ` Rick Lindsley
@ 2004-02-06 18:39     ` Martin J. Bligh
  2004-02-06 22:02       ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-06 18:39 UTC (permalink / raw)
  To: Rick Lindsley, Anton Blanchard; +Cc: piggin, akpm, linux-kernel, dvhltc

>     Good stuff, I just gave the patch a spin and things seem a little
>     calmer. However Im still seeing a lot of balancing going on within a
>     node.
> 
> This is a clearly recognizable edge case, so I'll try drawing this up on
> some paper and see if I can suggest another patch.  There's no good reason
> to move one lone process from a particular processor to another idle one.
> 
> But it also approaches a question that's come up before:  if you have 2
> tasks on processor A and 1 on processor B, do you move one from A to B?
> One argument is that the two tasks on A will take twice as long as
> the one on B if you do nothing.  But another says that bouncing a task
> around can't correct the overall imbalance and so is wasteful.  I know
> of benchmarks where both behaviors are considered important.  Thoughts?

It's the classic fairness vs throughput thing we've argued about before.
Most workloads don't have that static a number of processes, but it 
probably does need to do it if the imbalance is persistent ... but much
more reluctantly than normal balancing. See the patch I sent out a bit
earlier to test it - that may be *too* extreme in the other direction,
but it should confirm what's going on, at least.

M.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 18:13   ` Rick Lindsley
@ 2004-02-06 21:57     ` Nick Piggin
  2004-02-06 22:30       ` Rick Lindsley
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 21:57 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: akpm, linux-kernel, mjbligh, dvhltc



Rick Lindsley wrote:

>    >+	if ((long)*imbalance < 0)
>    >+		*imbalance = 0;
>    > 
>    
>    You're right Rick, thanks for catching this. Why do you have
>    this last test though? This shouldn't be possible?
>
>A combination of paranoia and leftover code from a previous fix :)  At the
>least, this could probably become a BUG_ON now, or I wouldn't cry if
>we took it out entirely.
>
>

I gave it a run with WARN_ON for a while and its fine so we'll
take it out all together.

We also shouldn't need load_diff, because if (avg_load <= this_load)
then imbalance will be zero, so I'll fix that up.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 18:39     ` Martin J. Bligh
@ 2004-02-06 22:02       ` Nick Piggin
  2004-02-06 22:34         ` Rick Lindsley
  2004-02-06 22:42         ` Martin J. Bligh
  0 siblings, 2 replies; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 22:02 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc



Martin J. Bligh wrote:

>>    Good stuff, I just gave the patch a spin and things seem a little
>>    calmer. However Im still seeing a lot of balancing going on within a
>>    node.
>>
>>This is a clearly recognizable edge case, so I'll try drawing this up on
>>some paper and see if I can suggest another patch.  There's no good reason
>>to move one lone process from a particular processor to another idle one.
>>
>>But it also approaches a question that's come up before:  if you have 2
>>tasks on processor A and 1 on processor B, do you move one from A to B?
>>One argument is that the two tasks on A will take twice as long as
>>the one on B if you do nothing.  But another says that bouncing a task
>>around can't correct the overall imbalance and so is wasteful.  I know
>>of benchmarks where both behaviors are considered important.  Thoughts?
>>
>
>It's the classic fairness vs throughput thing we've argued about before.
>Most workloads don't have that static a number of processes, but it 
>probably does need to do it if the imbalance is persistent ... but much
>more reluctantly than normal balancing. See the patch I sent out a bit
>earlier to test it - that may be *too* extreme in the other direction,
>but it should confirm what's going on, at least.
>
>

Yep. I've argued for fairness here, and that is presently what
we get. Between nodes the threshold should probably be higher
though.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 21:57     ` Nick Piggin
@ 2004-02-06 22:30       ` Rick Lindsley
  2004-02-06 22:40         ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 22:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel, mjbligh, dvhltc

    We also shouldn't need load_diff, because if (avg_load <= this_load)
    then imbalance will be zero, so I'll fix that up.

Are we sure imbalance will be zero?  I think we still need that.  We can
turn it into a single C statement if we want to be clever but what you
save in temporary variable we'd better replace in comments to make the
cleverness plain.  We need there to be no "negative" numbers in the
min() statement in case max-avg is non-zero but avg-this is "negative".

Imagine loads of

    cpu0	0
    cpu1	0
    cpu2	3 
    cpu3	2
    cpu4	0

and we're running on cpu3.

    max_load=3
    avg_load=1
    this_load=2

min(max-avg, avg-this) will be min(3-1,1-2) or two, and we'll choose to
try to pull two to cpu3 instead of just leaving it alone which is the
right thing to do.

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:02       ` Nick Piggin
@ 2004-02-06 22:34         ` Rick Lindsley
  2004-02-06 22:48           ` Nick Piggin
  2004-02-06 22:42         ` Martin J. Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 22:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Martin J. Bligh, Anton Blanchard, akpm, linux-kernel, dvhltc

    Yep. I've argued for fairness here, and that is presently what
    we get. Between nodes the threshold should probably be higher
    though.

While I like the idea of a self-tuning scheduler, when combined with
this new sched_domain algorithm it's hard to tell if the tuning or the
algorithm is at fault when we get results we don't like.  Have you done
much running with the auto-tuning turned off, using the old values,
to see the impact (positive or negative) that just the new algorithm has?

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:30       ` Rick Lindsley
@ 2004-02-06 22:40         ` Nick Piggin
  2004-02-06 22:49           ` Rick Lindsley
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 22:40 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: akpm, linux-kernel, mjbligh, dvhltc



Rick Lindsley wrote:

>    We also shouldn't need load_diff, because if (avg_load <= this_load)
>    then imbalance will be zero, so I'll fix that up.
>
>Are we sure imbalance will be zero?  I think we still need that.  We can
>turn it into a single C statement if we want to be clever but what you
>save in temporary variable we'd better replace in comments to make the
>cleverness plain.  We need there to be no "negative" numbers in the
>min() statement in case max-avg is non-zero but avg-this is "negative".
>
>Imagine loads of
>
>    cpu0	0
>    cpu1	0
>    cpu2	3 
>    cpu3	2
>    cpu4	0
>
>and we're running on cpu3.
>
>    max_load=3
>    avg_load=1
>    this_load=2
>
>min(max-avg, avg-this) will be min(3-1,1-2) or two, and we'll choose to
>try to pull two to cpu3 instead of just leaving it alone which is the
>right thing to do.
>
>

No you definitely still need the test... this is what I mean:

       if (avg_load > this_load)
                *imbalance = min(max_load - avg_load, avg_load - this_load);
        else
                *imbalance = 0;



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:02       ` Nick Piggin
  2004-02-06 22:34         ` Rick Lindsley
@ 2004-02-06 22:42         ` Martin J. Bligh
  2004-02-06 22:53           ` Nick Piggin
  2004-02-06 23:11           ` Rick Lindsley
  1 sibling, 2 replies; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-06 22:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc

>> It's the classic fairness vs throughput thing we've argued about before.
>> Most workloads don't have that static a number of processes, but it 
>> probably does need to do it if the imbalance is persistent ... but much
>> more reluctantly than normal balancing. See the patch I sent out a bit
>> earlier to test it - that may be *too* extreme in the other direction,
>> but it should confirm what's going on, at least.
> 
> Yep. I've argued for fairness here, and that is presently what
> we get. Between nodes the threshold should probably be higher
> though.

OK, but do you agree that the rate we rebalance things like 2 vs 1 should
be slower than the rate we rebalance 3 vs 1 ? Fairness is only relevant
over a long term imbalance anyway, so there should be a big damper on
"fairness only" rebalances.

Moreover, as Rick pointed out, it's particularly futile over idle cpus ;-)

m.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:34         ` Rick Lindsley
@ 2004-02-06 22:48           ` Nick Piggin
  0 siblings, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 22:48 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Martin J. Bligh, Anton Blanchard, akpm, linux-kernel, dvhltc

Rick Lindsley wrote:

>    Yep. I've argued for fairness here, and that is presently what
>    we get. Between nodes the threshold should probably be higher
>    though.
>
>While I like the idea of a self-tuning scheduler, when combined with
>this new sched_domain algorithm it's hard to tell if the tuning or the
>algorithm is at fault when we get results we don't like.  Have you done
>much running with the auto-tuning turned off, using the old values,
>to see the impact (positive or negative) that just the new algorithm has?
>
>

I'm not sure what you mean by self-tuning. Do you mean the scheduling
backoff stuff? Because that makes very little difference on a 16-way
NUMAQ. However it becomes critical for SGI above around 128 CPUs IIRC
so I just kept it in mind when doing sched domains.

The new balancing calculations are definitely a win in my tests. One
tiny regression (the order of 1%) I saw on the NUMAQ was tbench due to
increased idle time. But I'll still take it as a win because we were
doing nearly 1000 times less inter node balancing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:40         ` Nick Piggin
@ 2004-02-06 22:49           ` Rick Lindsley
  2004-02-06 23:08             ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 22:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel, mjbligh, dvhltc

    No you definitely still need the test... this is what I mean:
    
           if (avg_load > this_load)
                    *imbalance = min(max_load - avg_load, avg_load - this_load);
            else
                    *imbalance = 0;

Ah, yes, ok. I thought you were saying it would be zero because of code
elsewhere.  Yup, that'll work.

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:42         ` Martin J. Bligh
@ 2004-02-06 22:53           ` Nick Piggin
  2004-02-06 23:11           ` Rick Lindsley
  1 sibling, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 22:53 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc



Martin J. Bligh wrote:

>>>It's the classic fairness vs throughput thing we've argued about before.
>>>Most workloads don't have that static a number of processes, but it 
>>>probably does need to do it if the imbalance is persistent ... but much
>>>more reluctantly than normal balancing. See the patch I sent out a bit
>>>earlier to test it - that may be *too* extreme in the other direction,
>>>but it should confirm what's going on, at least.
>>>
>>Yep. I've argued for fairness here, and that is presently what
>>we get. Between nodes the threshold should probably be higher
>>though.
>>
>
>OK, but do you agree that the rate we rebalance things like 2 vs 1 should
>be slower than the rate we rebalance 3 vs 1 ? Fairness is only relevant
>over a long term imbalance anyway, so there should be a big damper on
>"fairness only" rebalances.
>
>

Well presently it happens at the same rate. This isn't bad though,
because you just use the more conservative rate. Its probably not
worth distinguishing the two cases.

If a CPU becomes idle, it will attempt to balance immediately.

>Moreover, as Rick pointed out, it's particularly futile over idle cpus ;-)
>
>

I don't follow...


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:49           ` Rick Lindsley
@ 2004-02-06 23:08             ` Nick Piggin
  0 siblings, 0 replies; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 23:08 UTC (permalink / raw)
  To: Rick Lindsley, akpm; +Cc: linux-kernel, mjbligh, dvhltc

[-- Attachment #1: Type: text/plain, Size: 445 bytes --]

Rick Lindsley wrote:

>    No you definitely still need the test... this is what I mean:
>    
>           if (avg_load > this_load)
>                    *imbalance = min(max_load - avg_load, avg_load - this_load);
>            else
>                    *imbalance = 0;
>
>Ah, yes, ok. I thought you were saying it would be zero because of code
>elsewhere.  Yup, that'll work.
>
>Rick
>
>

OK, I moved the rescaling too. Andrew, please apply.



[-- Attachment #2: sched-fix.patch --]
[-- Type: text/plain, Size: 2848 bytes --]


From: Rick Lindsley <ricklind@us.ibm.com>

In find_busiest_group(), after we exit the do/while, we select our
imbalance.  But max_load, avg_load, and this_load are all unsigned,
so min(x,y) will make a bad choice if max_load < avg_load < this_load
(that is, a choice between two negative [very large] numbers).

Unfortunately, there is a bug when max_load never gets changed from zero
(look in the loop and think what happens if the only load on the machine
is being created by cpu groups of which we are a member). And you have
a recipe for some really bogus values for imbalance.

Even if you fix the max_load == 0 bug, there will still be times when
avg_load - this_load will be negative (thus very large) and you'll make
the decision to move stuff when you shouldn't have.

This patch allows for this_load to set max_load, which if I understand
the logic properly is correct. With this patch applied, the algorithm is
*much* more conservative ... maybe *too* conservative but that's for 
another round of testing ...



 linux-2.6-npiggin/kernel/sched.c |   24 +++++++++++++++---------
 1 files changed, 15 insertions(+), 9 deletions(-)

diff -puN kernel/sched.c~sched-fix kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-fix	2004-02-06 20:41:27.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2004-02-07 09:59:48.000000000 +1100
@@ -1407,14 +1407,13 @@ find_busiest_group(struct sched_domain *
 		total_nr_cpus += nr_cpus;
 		avg_load /= nr_cpus;
 
+		if (avg_load > max_load)
+			max_load = avg_load;
+
 		if (local_group) {
 			this_load = avg_load;
-			goto nextgroup;
-		}
-
-		if (avg_load >= max_load) {
+		} else if (avg_load >= max_load) {
 			busiest = group;
-			max_load = avg_load;
 			busiest_nr_cpus = nr_cpus;
 		}
 nextgroup:
@@ -1437,11 +1436,18 @@ nextgroup:
 	 * reduce the max loaded cpu below the average load, as either of these
 	 * actions would just result in more rebalancing later, and ping-pong
 	 * tasks around. Thus we look for the minimum possible imbalance.
+	 * Negative imbalances (*we* are more loaded than anyone else) will
+	 * be counted as no imbalance for these purposes -- we can't fix that
+	 * by pulling tasks to us.  Be careful of negative numbers as they'll
+	 * appear as very large values with unsigned longs.
 	 */
-	*imbalance = min(max_load - avg_load, avg_load - this_load);
-
-	/* Get rid of the scaling factor now, rounding *up* as we divide */
-	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
+	if (avg_load >= this_load) {
+		*imbalance = min(max_load - avg_load, avg_load - this_load);
+		/* Get rid of the scaling factor, rounding *up* as we divide */
+		*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1)
+						>> SCHED_LOAD_SHIFT;
+	} else
+		*imbalance = 0;
 
 	if (*imbalance == 0) {
 		if (package_idle != NOT_IDLE && domain->flags & SD_FLAG_IDLE

_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 22:42         ` Martin J. Bligh
  2004-02-06 22:53           ` Nick Piggin
@ 2004-02-06 23:11           ` Rick Lindsley
  2004-02-06 23:20             ` Nick Piggin
  1 sibling, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-06 23:11 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Nick Piggin, Anton Blanchard, akpm, linux-kernel, dvhltc

    OK, but do you agree that the rate we rebalance things like 2 vs 1 should
    be slower than the rate we rebalance 3 vs 1 ? Fairness is only relevant
    over a long term imbalance anyway, so there should be a big damper on
    "fairness only" rebalances.

I think, given the precision we're granted via SCHED_LOAD_SCALE, in
combination with the new "load average" (cpu_load) code, that we can
achieve what we want.

If cpu0 has 2 runnable tasks and cpu1 has 1 runnable task, won't we see
the "load average" of cpu0 slowly approach 2, but not jump there?

Right now, we round up on all fractions and Martin has proposed a patch
which takes it the other way and rounds down.  What if in marginal
cases like this where this is a small but persistent difference, we
could bump the task to another cpu when it reaches (say) 1.8 or 1.9?
That would keep it there longer for shorter-lived tasks, but for those
long-runners, they'd eventually spread the pain around a little.

And yes, a cpu_load of even 1.0 should *never* get migrated to a cpu
with a load 0.0.  Instead of

    *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;

how about, for instance,

    if (max_load <= SCHED_LOAD_SCALE)
	*imbalance = 0;
    else
	*imbalance = (*imbalance + (SCHED_LOAD_SCALE / 6) - 1)
	    >> SCHED_LOAD_SHIFT;

The intent is to never move anything if max_load is 1 or less (what
advantage is there?) and to create a slight tendency to round up at
loads greater than that, which would still tend to leave things where
they were until they'd been there a while.  In fact the "bonus"
(SCHED_LOAD_SCALE / 6 - 1) could be another configurable in the scheduling
domain so that at some level you're not interested in fairness and
they just don't bounce at all.

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 23:11           ` Rick Lindsley
@ 2004-02-06 23:20             ` Nick Piggin
  2004-02-06 23:33               ` Martin J. Bligh
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 23:20 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Martin J. Bligh, Anton Blanchard, akpm, linux-kernel, dvhltc



Rick Lindsley wrote:

>    OK, but do you agree that the rate we rebalance things like 2 vs 1 should
>    be slower than the rate we rebalance 3 vs 1 ? Fairness is only relevant
>    over a long term imbalance anyway, so there should be a big damper on
>    "fairness only" rebalances.
>
>I think, given the precision we're granted via SCHED_LOAD_SCALE, in
>combination with the new "load average" (cpu_load) code, that we can
>achieve what we want.
>
>If cpu0 has 2 runnable tasks and cpu1 has 1 runnable task, won't we see
>the "load average" of cpu0 slowly approach 2, but not jump there?
>
>

Yep

>Right now, we round up on all fractions and Martin has proposed a patch
>which takes it the other way and rounds down.  What if in marginal
>cases like this where this is a small but persistent difference, we
>could bump the task to another cpu when it reaches (say) 1.8 or 1.9?
>That would keep it there longer for shorter-lived tasks, but for those
>long-runners, they'd eventually spread the pain around a little.
>
>And yes, a cpu_load of even 1.0 should *never* get migrated to a cpu
>with a load 0.0.  Instead of
>

This isn't the load though, but imbalance. It has already passed
through our imbalance_pct filter (or we are idle), so we can pretty
safely assume that we want to try to move at least one task.

>
>    *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
>
>how about, for instance,
>
>    if (max_load <= SCHED_LOAD_SCALE)
>	*imbalance = 0;
>    else
>	*imbalance = (*imbalance + (SCHED_LOAD_SCALE / 6) - 1)
>	    >> SCHED_LOAD_SHIFT;
>
>The intent is to never move anything if max_load is 1 or less (what
>advantage is there?) and to create a slight tendency to round up at
>loads greater than that, which would still tend to leave things where
>they were until they'd been there a while.  In fact the "bonus"
>(SCHED_LOAD_SCALE / 6 - 1) could be another configurable in the scheduling
>domain so that at some level you're not interested in fairness and
>they just don't bounce at all.
>
>

Hopefully just tending to round down more would damp it better.
*imbalance = (*imbalance + SCHED_LOAD_SCALE/2) >> SCHED_LOAD_SHIFT;
Or even remove the addition all together.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 23:20             ` Nick Piggin
@ 2004-02-06 23:33               ` Martin J. Bligh
  2004-02-06 23:41                 ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-06 23:33 UTC (permalink / raw)
  To: Nick Piggin, Rick Lindsley; +Cc: Anton Blanchard, akpm, linux-kernel, dvhltc

> From a later email ....
>
> Hopefully just tending to round down more would damp it better.
> *imbalance = (*imbalance + SCHED_LOAD_SCALE/2) >> SCHED_LOAD_SHIFT;
> Or even remove the addition all together.

I'd side with just removing the addition alltogether ...

>>Moreover, as Rick pointed out, it's particularly futile over idle cpus ;-)
>
> I don't follow...

If CPU 7 has 1 task, and cpu 8 has 0 tasks, there's an imbalance of 1.
There is no point whatsoever in bouncing that task back and forth
between cpu 7 and 8 - it just makes things slower, and trashes the cache.
There's *no* fairness issue here.

If CPU 8 has 2 tasks, and cpu 1 has 1 task, there's an imbalance of 1.
*If* that imbalance persists (and it probably won't, given tasks being
created, destroyed, and blocking for IO), we may want to rotate that 
to 1 vs 2, and then back to 2 vs 1, etc. in the interests of fairness,
even though it's slower throughput overall.

However, my point is that we shouldn't do that as agressively (in terms
of how long the imbalance persists for) as we should for loads of 3 vs 1 -
that's a "real" imbalance, not just a fairness problem.

M.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 23:33               ` Martin J. Bligh
@ 2004-02-06 23:41                 ` Nick Piggin
  2004-02-06 23:47                   ` Martin J. Bligh
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-06 23:41 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc



Martin J. Bligh wrote:

>>From a later email ....
>>
>>Hopefully just tending to round down more would damp it better.
>>*imbalance = (*imbalance + SCHED_LOAD_SCALE/2) >> SCHED_LOAD_SHIFT;
>>Or even remove the addition all together.
>>
>
>I'd side with just removing the addition alltogether ...
>
>

By that stage though, it has already passed through the
imbalance_pct filter, and with the higher precision and
averaging of previous loads, it might take a while to
get there.

>>>Moreover, as Rick pointed out, it's particularly futile over idle cpus ;-)
>>>
>>I don't follow...
>>
>
>If CPU 7 has 1 task, and cpu 8 has 0 tasks, there's an imbalance of 1.
>There is no point whatsoever in bouncing that task back and forth
>between cpu 7 and 8 - it just makes things slower, and trashes the cache.
>There's *no* fairness issue here.
>
>

Right, it should not be moved. I think Anton is seeing a problem
with active balancing, and not so much a problem with the imbalance
calculation though.

>If CPU 8 has 2 tasks, and cpu 1 has 1 task, there's an imbalance of 1.
>*If* that imbalance persists (and it probably won't, given tasks being
>created, destroyed, and blocking for IO), we may want to rotate that 
>to 1 vs 2, and then back to 2 vs 1, etc. in the interests of fairness,
>even though it's slower throughput overall.
>

Yes, although as long as it's node local and happens a couple of
times a second you should be pretty hard pressed noticing the
difference.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 23:41                 ` Nick Piggin
@ 2004-02-06 23:47                   ` Martin J. Bligh
  2004-02-07  0:11                     ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-06 23:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc

>> If CPU 8 has 2 tasks, and cpu 1 has 1 task, there's an imbalance of 1.
>> *If* that imbalance persists (and it probably won't, given tasks being
>> created, destroyed, and blocking for IO), we may want to rotate that 
>> to 1 vs 2, and then back to 2 vs 1, etc. in the interests of fairness,
>> even though it's slower throughput overall.
>> 
> 
> Yes, although as long as it's node local and happens a couple of
> times a second you should be pretty hard pressed noticing the
> difference.

Not sure how true that turns out to be in practice ... probably depends
heavily on both the workload (how heavily it's using the cache) and the
chip (larger caches have proportionately more to lose).

As we go forward in time, cache warmth gets increasingly important, as
CPUs accelerate speeds quicker than memory. Cache sizes also get larger.
I'd really like us to be conservative here - the unfairness thing is 
really hard to hit anyway - you need a static number of processes that
don't ever block on IO or anything.

M.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-06 23:47                   ` Martin J. Bligh
@ 2004-02-07  0:11                     ` Nick Piggin
  2004-02-07  0:25                       ` Martin J. Bligh
  2004-02-09 16:37                       ` Timothy Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Nick Piggin @ 2004-02-07  0:11 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc



Martin J. Bligh wrote:

>>>If CPU 8 has 2 tasks, and cpu 1 has 1 task, there's an imbalance of 1.
>>>*If* that imbalance persists (and it probably won't, given tasks being
>>>created, destroyed, and blocking for IO), we may want to rotate that 
>>>to 1 vs 2, and then back to 2 vs 1, etc. in the interests of fairness,
>>>even though it's slower throughput overall.
>>>
>>>
>>Yes, although as long as it's node local and happens a couple of
>>times a second you should be pretty hard pressed noticing the
>>difference.
>>
>
>Not sure how true that turns out to be in practice ... probably depends
>heavily on both the workload (how heavily it's using the cache) and the
>chip (larger caches have proportionately more to lose).
>
>As we go forward in time, cache warmth gets increasingly important, as
>CPUs accelerate speeds quicker than memory. Cache sizes also get larger.
>I'd really like us to be conservative here - the unfairness thing is 
>really hard to hit anyway - you need a static number of processes that
>don't ever block on IO or anything.
>
>

Can we keep current behaviour default, and if arches want to
override it they can? And if someone one day does testing to
show it really isn't a good idea, then we can change the default.

I like to try stick to the fairness first approach.

We got quite a few complaints about unfairness when the
scheduler used to keep 2 on one cpu and 1 on another, even in
development kernels. I suspect that most wouldn't have known
one way or the other if only top showed 66% each, but still.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-07  0:11                     ` Nick Piggin
@ 2004-02-07  0:25                       ` Martin J. Bligh
  2004-02-07  0:31                         ` Nick Piggin
  2004-02-09 16:37                       ` Timothy Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-07  0:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc

>> Not sure how true that turns out to be in practice ... probably depends
>> heavily on both the workload (how heavily it's using the cache) and the
>> chip (larger caches have proportionately more to lose).
>> 
>> As we go forward in time, cache warmth gets increasingly important, as
>> CPUs accelerate speeds quicker than memory. Cache sizes also get larger.
>> I'd really like us to be conservative here - the unfairness thing is 
>> really hard to hit anyway - you need a static number of processes that
>> don't ever block on IO or anything.
> 
> Can we keep current behaviour default, and if arches want to
> override it they can? And if someone one day does testing to
> show it really isn't a good idea, then we can change the default.

Well, that should be a pretty easy test to do. I'll try it.

M.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-07  0:25                       ` Martin J. Bligh
@ 2004-02-07  0:31                         ` Nick Piggin
  2004-02-07  9:50                           ` Anton Blanchard
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-07  0:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc



Martin J. Bligh wrote:

>>>Not sure how true that turns out to be in practice ... probably depends
>>>heavily on both the workload (how heavily it's using the cache) and the
>>>chip (larger caches have proportionately more to lose).
>>>
>>>As we go forward in time, cache warmth gets increasingly important, as
>>>CPUs accelerate speeds quicker than memory. Cache sizes also get larger.
>>>I'd really like us to be conservative here - the unfairness thing is 
>>>really hard to hit anyway - you need a static number of processes that
>>>don't ever block on IO or anything.
>>>
>>Can we keep current behaviour default, and if arches want to
>>override it they can? And if someone one day does testing to
>>show it really isn't a good idea, then we can change the default.
>>
>
>Well, that should be a pretty easy test to do. I'll try it.
>
>

OK, use the revision of Rick's patch I posted, and don't use
CONFIG_SCHED_SMT because I think there is a problem with it.

Thanks
Nick


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-07  0:31                         ` Nick Piggin
@ 2004-02-07  9:50                           ` Anton Blanchard
  2004-02-08  0:40                             ` Rick Lindsley
  0 siblings, 1 reply; 37+ messages in thread
From: Anton Blanchard @ 2004-02-07  9:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Martin J. Bligh, Rick Lindsley, akpm, linux-kernel, dvhltc

> OK, use the revision of Rick's patch I posted, and don't use
> CONFIG_SCHED_SMT because I think there is a problem with it.

Missed this email :) I gave your last patch a spin and im seeing similar
behaviour.

Its got to be an overly enthuiastic active balance, the migration threads 
have used about 10 minutes of cpu time and a single cpu bound process
will never sleep (assuming there is nothing else to run) and so cannot be
moved by normal means.

Anton

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-07  9:50                           ` Anton Blanchard
@ 2004-02-08  0:40                             ` Rick Lindsley
  2004-02-08  1:12                               ` Anton Blanchard
  0 siblings, 1 reply; 37+ messages in thread
From: Rick Lindsley @ 2004-02-08  0:40 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Nick Piggin, Martin J. Bligh, akpm, linux-kernel, dvhltc

    Its got to be an overly enthuiastic active balance, the migration threads 
    have used about 10 minutes of cpu time and a single cpu bound process
    will never sleep (assuming there is nothing else to run) and so cannot be
    moved by normal means.

The current imbalance code rounds up to 1, meaning that we'll often
see an "imbalance" of 1 even when it's 1 to 0 and just been moved.
Did you see these results even with Martin's patch to not round up to 1?

Easiest way to turn off the active balance (for this test, at least)
is this patch which just turns off that code:

diff -rup linux-2.6.2-mm1/kernel/sched.c linux-2.6.2-mm1+/kernel/sched.c
--- linux-2.6.2-mm1/kernel/sched.c	Thu Feb  5 14:47:17 2004
+++ linux-2.6.2-mm1+/kernel/sched.c	Sat Feb  7 16:39:18 2004
@@ -1525,6 +1525,7 @@ out:
 	if (!balanced && nr_moved == 0)
 		failed = 1;
 
+#if 0
 	if (domain->flags & SD_FLAG_IDLE && failed && busiest &&
 	   		domain->nr_balance_failed > domain->cache_nice_tries) {
 		int i;
@@ -1546,6 +1547,7 @@ out:
 				wake_up_process(busiest->migration_thread);
 		}
 	}
+#endif
 
 	if (failed)
 		domain->nr_balance_failed++;

Not the right long-term solution but at least we can pin down where this
obviously incorrect behavior is coming from.

Rick

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  0:40                             ` Rick Lindsley
@ 2004-02-08  1:12                               ` Anton Blanchard
  2004-02-08  1:21                                 ` Nick Piggin
  2004-02-08  1:22                                 ` Anton Blanchard
  0 siblings, 2 replies; 37+ messages in thread
From: Anton Blanchard @ 2004-02-08  1:12 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Nick Piggin, Martin J. Bligh, akpm, linux-kernel, dvhltc

 
Hi,

> The current imbalance code rounds up to 1, meaning that we'll often
> see an "imbalance" of 1 even when it's 1 to 0 and just been moved.
> Did you see these results even with Martin's patch to not round up to 1?

Indeed Martins patch does fix the problem:

cpu    user  system    idle             cpu    user  system    idle
cpu0      0       0     100             cpu1      0       0     100
cpu2      0       0     100             cpu3      0       0     100
cpu4      0       0     100             cpu5      0       0     100
cpu6      0       0     100             cpu7      0       0     100
cpu8      0       0     100             cpu9      0       0     100
cpu10     0       0     100             cpu11     0       0     100
cpu12     0       0     100             cpu13   100       0       0
cpu14     0       0     100             cpu15     0       0     100

My current tree has your patch and Martins patch. So far its looking
good.

Anton

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  1:12                               ` Anton Blanchard
@ 2004-02-08  1:21                                 ` Nick Piggin
  2004-02-08  1:41                                   ` Anton Blanchard
  2004-02-08  1:22                                 ` Anton Blanchard
  1 sibling, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-08  1:21 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc



Anton Blanchard wrote:

> 
>Hi,
>
>
>>The current imbalance code rounds up to 1, meaning that we'll often
>>see an "imbalance" of 1 even when it's 1 to 0 and just been moved.
>>Did you see these results even with Martin's patch to not round up to 1?
>>
>
>Indeed Martins patch does fix the problem:
>
>cpu    user  system    idle             cpu    user  system    idle
>cpu0      0       0     100             cpu1      0       0     100
>cpu2      0       0     100             cpu3      0       0     100
>cpu4      0       0     100             cpu5      0       0     100
>cpu6      0       0     100             cpu7      0       0     100
>cpu8      0       0     100             cpu9      0       0     100
>cpu10     0       0     100             cpu11     0       0     100
>cpu12     0       0     100             cpu13   100       0       0
>cpu14     0       0     100             cpu15     0       0     100
>
>My current tree has your patch and Martins patch. So far its looking
>good.
>
>

Rick's being the one I sent you?

Does active balancing still work? Ie. get two processes running on the
same physical CPU and see if one is migrated away.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  1:12                               ` Anton Blanchard
  2004-02-08  1:21                                 ` Nick Piggin
@ 2004-02-08  1:22                                 ` Anton Blanchard
  1 sibling, 0 replies; 37+ messages in thread
From: Anton Blanchard @ 2004-02-08  1:22 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Nick Piggin, Martin J. Bligh, akpm, linux-kernel, dvhltc


> My current tree has your patch and Martins patch. So far its looking
> good.

Actually now it refuses to let me out of 1 cpu (2 threads). Theres 5
things in the runqueue but Im only using 1 cpu.

Anton

cpu    user  system    idle             cpu    user  system    idle
cpu0     98       2       0             cpu1     98       2       0
cpu2      0       0     100             cpu3      0       0     100
cpu4      0       0     100             cpu5      0       0     100
cpu6      0       0     100             cpu7      0       0     100
cpu8      0       0     100             cpu9      0       0     100
cpu10     0       0     100             cpu11     0       0     100
cpu12     0       0     100             cpu13     0       0     100
cpu14     0       0     100             cpu15     0       0     100

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  1:21                                 ` Nick Piggin
@ 2004-02-08  1:41                                   ` Anton Blanchard
  2004-02-08  3:20                                     ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Anton Blanchard @ 2004-02-08  1:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc


> Does active balancing still work? Ie. get two processes running on the
> same physical CPU and see if one is migrated away.

Both Ricks patch and your patch still have the active rebalance issue.

Martins patch doesnt, but that seems to be because it does absolutely no
rebalancing (all my runnable tasks are on one physical cpu) :)

Anton

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  1:41                                   ` Anton Blanchard
@ 2004-02-08  3:20                                     ` Nick Piggin
  2004-02-08  3:57                                       ` Anton Blanchard
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-08  3:20 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc



Anton Blanchard wrote:

>>Does active balancing still work? Ie. get two processes running on the
>>same physical CPU and see if one is migrated away.
>>
>
>Both Ricks patch and your patch still have the active rebalance issue.
>
>Martins patch doesnt, but that seems to be because it does absolutely no
>rebalancing (all my runnable tasks are on one physical cpu) :)
>
>

Yeah its because you have a lot of cpus, so the average is still
small. You also need something like

if (*imbalance == 0 && max_load - this_load > SCHED_LOAD_SCALE)
    *imbalance = 1;

I don't have a >= 4 CPU box to test on, so I hate to be feeding
you lots of little unproven patches.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  3:20                                     ` Nick Piggin
@ 2004-02-08  3:57                                       ` Anton Blanchard
  2004-02-08  4:05                                         ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Anton Blanchard @ 2004-02-08  3:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc

 
Hi,

> Yeah its because you have a lot of cpus, so the average is still
> small. You also need something like
> 
> if (*imbalance == 0 && max_load - this_load > SCHED_LOAD_SCALE)
>    *imbalance = 1;

OK I'll give that a try.

> I don't have a >= 4 CPU box to test on, so I hate to be feeding
> you lots of little unproven patches.

I dont have a problem with that. Id be chasing this myself but Ive got
to get a few other things done first. I can however reboot and test
things at the same time :)

Anton

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  3:57                                       ` Anton Blanchard
@ 2004-02-08  4:05                                         ` Nick Piggin
  2004-02-08 12:14                                           ` Anton Blanchard
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2004-02-08  4:05 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc

[-- Attachment #1: Type: text/plain, Size: 346 bytes --]



Anton Blanchard wrote:

> 
>Hi,
>
>  
>
>>Yeah its because you have a lot of cpus, so the average is still
>>small. You also need something like
>>
>>if (*imbalance == 0 && max_load - this_load > SCHED_LOAD_SCALE)
>>   *imbalance = 1;
>>    
>>
>
>OK I'll give that a try.
>  
>
>  
>

Can you try this patch instead pretty please ;)

Thanks



[-- Attachment #2: rollup.patch --]
[-- Type: text/plain, Size: 2941 bytes --]

 linux-2.6-npiggin/kernel/sched.c |   52 ++++++++++++++++++++++++---------------
 1 files changed, 32 insertions(+), 20 deletions(-)

diff -puN kernel/sched.c~rollup kernel/sched.c
--- linux-2.6/kernel/sched.c~rollup	2004-02-08 15:03:53.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2004-02-08 15:03:53.000000000 +1100
@@ -1405,16 +1405,28 @@ find_busiest_group(struct sched_domain *
 
 		total_load += avg_load;
 		total_nr_cpus += nr_cpus;
-		avg_load /= nr_cpus;
+
+		/*
+		 * Load is cumulative over SD_FLAG_IDLE domains, but
+		 * spread over !SD_FLAG_IDLE domains. For example, 2
+		 * processes running on an SMT CPU puts a load of 2 on
+		 * that CPU, however 2 processes running on 2 CPUs puts
+		 * a load of 1 on that domain.
+		 *
+		 * This should be configurable so as SMT siblings become
+		 * more powerful, they can "spread" more load - for example,
+		 * the above case might only count as a load of 1.7.
+		 */
+		if (!(domain->flags & SD_FLAG_IDLE))
+			avg_load /= nr_cpus;
+
+		if (avg_load > max_load)
+			max_load = avg_load;
 
 		if (local_group) {
 			this_load = avg_load;
-			goto nextgroup;
-		}
-
-		if (avg_load >= max_load) {
+		} else if (avg_load >= max_load) {
 			busiest = group;
-			max_load = avg_load;
 			busiest_nr_cpus = nr_cpus;
 		}
 nextgroup:
@@ -1424,8 +1436,10 @@ nextgroup:
 	if (!busiest)
 		goto out_balanced;
 
-	avg_load = total_load / total_nr_cpus;
-	if (idle == NOT_IDLE && this_load >= avg_load)
+	if (!(domain->flags & SD_FLAG_IDLE))
+		avg_load = total_load / total_nr_cpus;
+
+	if (this_load >= avg_load)
 		goto out_balanced;
 
 	if (idle == NOT_IDLE && 100*max_load <= domain->imbalance_pct*this_load)
@@ -1437,20 +1451,18 @@ nextgroup:
 	 * reduce the max loaded cpu below the average load, as either of these
 	 * actions would just result in more rebalancing later, and ping-pong
 	 * tasks around. Thus we look for the minimum possible imbalance.
+	 * Negative imbalances (*we* are more loaded than anyone else) will
+	 * be counted as no imbalance for these purposes -- we can't fix that
+	 * by pulling tasks to us.  Be careful of negative numbers as they'll
+	 * appear as very large values with unsigned longs.
 	 */
-	*imbalance = min(max_load - avg_load, avg_load - this_load);
-
-	/* Get rid of the scaling factor now, rounding *up* as we divide */
-	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
-
-	if (*imbalance == 0) {
-		if (package_idle != NOT_IDLE && domain->flags & SD_FLAG_IDLE
-			&& max_load * busiest_nr_cpus > (3*SCHED_LOAD_SCALE/2))
-			*imbalance = 1;
-		else
-			busiest = NULL;
-	}
+	*imbalance = min(max_load - avg_load, avg_load - this_load) / 2;
+	/* Get rid of the scaling factor, rounding *up* as we divide */
+	*imbalance = (*imbalance + SCHED_LOAD_SCALE-1)
+					>> SCHED_LOAD_SHIFT;
 
+	if (*imbalance == 0 && (max_load - this_load) > SCHED_LOAD_SCALE)
+		*imbalance = 1;
 	return busiest;
 
 out_balanced:

_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-08  4:05                                         ` Nick Piggin
@ 2004-02-08 12:14                                           ` Anton Blanchard
  0 siblings, 0 replies; 37+ messages in thread
From: Anton Blanchard @ 2004-02-08 12:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Rick Lindsley, Martin J. Bligh, akpm, linux-kernel, dvhltc


Hi,

> Can you try this patch instead pretty please ;)

Still had the active rebalance problem. Martin suggested I try just the
first junk of his patch:

@@ -1442,8 +1442,8 @@ nextgroup:
        if ((long)*imbalance < 0)
                *imbalance = 0;
  
-       /* Get rid of the scaling factor now, rounding *up* as we divide */
-       *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
+       /* Get rid of the scaling factor now */
+       *imbalance = *imbalance >> SCHED_LOAD_SHIFT;

This fixed the active rebalance issue, however now it doesnt want to
rebalance out of the node. Ive got one completely empty node and one full
one (all primary and secondary threads are busy).

Anton

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-07  0:11                     ` Nick Piggin
  2004-02-07  0:25                       ` Martin J. Bligh
@ 2004-02-09 16:37                       ` Timothy Miller
  2004-02-09 16:43                         ` Martin J. Bligh
  1 sibling, 1 reply; 37+ messages in thread
From: Timothy Miller @ 2004-02-09 16:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Rick Lindsley, Anton Blanchard, akpm,
	linux-kernel, dvhltc



Nick Piggin wrote:

> 
> Can we keep current behaviour default, and if arches want to
> override it they can? And if someone one day does testing to
> show it really isn't a good idea, then we can change the default.
> 
> I like to try stick to the fairness first approach.
> 
> We got quite a few complaints about unfairness when the
> scheduler used to keep 2 on one cpu and 1 on another, even in
> development kernels. I suspect that most wouldn't have known
> one way or the other if only top showed 66% each, but still.
> 


Stupid question:  Does the balancing consider process priority?  Is it 
unfair to have two lower pri tasks always on one cpu while the highest 
pri of the three is always by itself?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Load balancing problem in 2.6.2-mm1
  2004-02-09 16:37                       ` Timothy Miller
@ 2004-02-09 16:43                         ` Martin J. Bligh
  0 siblings, 0 replies; 37+ messages in thread
From: Martin J. Bligh @ 2004-02-09 16:43 UTC (permalink / raw)
  To: Timothy Miller, Nick Piggin
  Cc: Rick Lindsley, Anton Blanchard, akpm, linux-kernel, dvhltc

>> Can we keep current behaviour default, and if arches want to
>> override it they can? And if someone one day does testing to
>> show it really isn't a good idea, then we can change the default.
>> 
>> I like to try stick to the fairness first approach.
>> 
>> We got quite a few complaints about unfairness when the
>> scheduler used to keep 2 on one cpu and 1 on another, even in
>> development kernels. I suspect that most wouldn't have known
>> one way or the other if only top showed 66% each, but still.
> 
> Stupid question:  Does the balancing consider process priority?  Is it 
> unfair to have two lower pri tasks always on one cpu while the highest 
> pri of the three is always by itself?

No, it doesn't take account of that. We've discussed that before at some 
point, and yes, I  think it's wrong, but on the other hand, we want to be 
reasonably fast at deciding what to take, and there are a whole bunch of 
other criteria that we ought to be taking account as well - it's difficult.

M.


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2004-02-09 16:43 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-06  9:24 [PATCH] Load balancing problem in 2.6.2-mm1 Rick Lindsley
2004-02-06  9:38 ` Nick Piggin
2004-02-06 18:13   ` Rick Lindsley
2004-02-06 21:57     ` Nick Piggin
2004-02-06 22:30       ` Rick Lindsley
2004-02-06 22:40         ` Nick Piggin
2004-02-06 22:49           ` Rick Lindsley
2004-02-06 23:08             ` Nick Piggin
2004-02-06 10:30 ` Anton Blanchard
2004-02-06 18:15   ` Rick Lindsley
2004-02-06 18:39     ` Martin J. Bligh
2004-02-06 22:02       ` Nick Piggin
2004-02-06 22:34         ` Rick Lindsley
2004-02-06 22:48           ` Nick Piggin
2004-02-06 22:42         ` Martin J. Bligh
2004-02-06 22:53           ` Nick Piggin
2004-02-06 23:11           ` Rick Lindsley
2004-02-06 23:20             ` Nick Piggin
2004-02-06 23:33               ` Martin J. Bligh
2004-02-06 23:41                 ` Nick Piggin
2004-02-06 23:47                   ` Martin J. Bligh
2004-02-07  0:11                     ` Nick Piggin
2004-02-07  0:25                       ` Martin J. Bligh
2004-02-07  0:31                         ` Nick Piggin
2004-02-07  9:50                           ` Anton Blanchard
2004-02-08  0:40                             ` Rick Lindsley
2004-02-08  1:12                               ` Anton Blanchard
2004-02-08  1:21                                 ` Nick Piggin
2004-02-08  1:41                                   ` Anton Blanchard
2004-02-08  3:20                                     ` Nick Piggin
2004-02-08  3:57                                       ` Anton Blanchard
2004-02-08  4:05                                         ` Nick Piggin
2004-02-08 12:14                                           ` Anton Blanchard
2004-02-08  1:22                                 ` Anton Blanchard
2004-02-09 16:37                       ` Timothy Miller
2004-02-09 16:43                         ` Martin J. Bligh
2004-02-06 18:33   ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox