[Patch] don't kick ALB in the presence of pinned task

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch] don't kick ALB in the presence of pinned task
@ 2005-08-02  0:42 Siddha, Suresh B
  2005-08-02  6:09 ` Nick Piggin
  2005-08-02  9:27 ` Ingo Molnar
  0 siblings, 2 replies; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-02  0:42 UTC (permalink / raw)
  To: mingo, nickpiggin; +Cc: akpm, linux-kernel, steiner

Jack Steiner brought this issue at my OLS talk.

Take a scenario where two tasks are pinned to two HT threads in a physical
package. Idle packages in the system will keep kicking migration_thread
on the busy package with out any success.

We will run into similar scenarios in the presence of CMP/NUMA.

Patch appended.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

--- linux-2.6.13-rc4/kernel/sched.c~	2005-08-01 10:50:27.085884216 -0700
+++ linux-2.6.13-rc4/kernel/sched.c	2005-08-01 14:39:04.147573872 -0700
@@ -2098,6 +2098,16 @@
 		if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
 
 			spin_lock(&busiest->lock);
+
+			/* don't kick the migration_thread, if the curr
+			 * task on busiest cpu can't be moved to this_cpu
+			 */
+			if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) {
+				spin_unlock(&busiest->lock);
+				all_pinned = 1;
+				goto out_one_pinned;
+			}
+
 			if (!busiest->active_balance) {
 				busiest->active_balance = 1;
 				busiest->push_cpu = this_cpu;
@@ -2135,6 +2145,7 @@
 out_balanced:
 	spin_unlock(&this_rq->lock);
 
+out_one_pinned:
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] don't kick ALB in the presence of pinned task
  2005-08-02  0:42 [Patch] don't kick ALB in the presence of pinned task Siddha, Suresh B
@ 2005-08-02  6:09 ` Nick Piggin
  2005-08-02  9:43   ` Ingo Molnar
  2005-08-02 21:12   ` Siddha, Suresh B
  2005-08-02  9:27 ` Ingo Molnar
  1 sibling, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-02  6:09 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, akpm, linux-kernel, steiner

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

Siddha, Suresh B wrote:
> Jack Steiner brought this issue at my OLS talk.
> 
> Take a scenario where two tasks are pinned to two HT threads in a physical
> package. Idle packages in the system will keep kicking migration_thread
> on the busy package with out any success.
> 
> We will run into similar scenarios in the presence of CMP/NUMA.
> 
> Patch appended.
> 

Hmm, I would have hoped the new "all_pinned" logic should have
handled this case properly. Are you actually seeing this happen?

I have a patch here which I still need to do more testing with,
which might help performance on HT systems.

I found that idle siblings could cause SMP and NUMA balancing to
be too aggressive in some cases.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: sched-opt-ht.patch --]
[-- Type: text/plain, Size: 2941 bytes --]

If an idle sibling of an HT queue encounters a busy sibling, then
make higher level load balancing of the non-idle variety.

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-07-29 19:30:39.000000000 +1000
+++ linux-2.6/kernel/sched.c	2005-07-29 19:35:01.000000000 +1000
@@ -1889,7 +1889,7 @@ out:
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum idle_type idle)
+		   unsigned long *imbalance, enum idle_type idle, int *sd_idle)
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
@@ -1914,6 +1914,9 @@ find_busiest_group(struct sched_domain *
 		avg_load = 0;
 
 		for_each_cpu_mask(i, group->cpumask) {
+			if (*sd_idle && !idle_cpu(i))
+				*sd_idle = 0;
+
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
 				load = target_load(i, load_idx);
@@ -2057,11 +2060,15 @@ static int load_balance(int this_cpu, ru
 	unsigned long imbalance;
 	int nr_moved, all_pinned = 0;
 	int active_balance = 0;
+	int sd_idle = 0;
+
+	if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER)
+		sd_idle = 1;
 
 	spin_lock(&this_rq->lock);
 	schedstat_inc(sd, lb_cnt[idle]);
 
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle);
+	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -2136,6 +2143,8 @@ static int load_balance(int this_cpu, ru
 			sd->balance_interval *= 2;
 	}
 
+	if (!nr_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+		return -1;
 	return nr_moved;
 
 out_balanced:
@@ -2149,6 +2158,8 @@ out_balanced:
 			(sd->balance_interval < sd->max_interval))
 		sd->balance_interval *= 2;
 
+	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+		return -1;
 	return 0;
 }
 
@@ -2166,9 +2177,13 @@ static int load_balance_newidle(int this
 	runqueue_t *busiest = NULL;
 	unsigned long imbalance;
 	int nr_moved = 0;
+	int sd_idle = 0;
 
+	if (sd->flags & SD_SHARE_CPUPOWER)
+		sd_idle = 1;
+	
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
-	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
+	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2193,15 +2208,19 @@ static int load_balance_newidle(int this
 		spin_unlock(&busiest->lock);
 	}
 
-	if (!nr_moved)
+	if (!nr_moved) {
 		schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
-	else
+		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+			return -1;
+	} else
 		sd->nr_balance_failed = 0;
 
 	return nr_moved;
 
 out_balanced:
 	schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
+	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+		return -1;
 	sd->nr_balance_failed = 0;
 	return 0;
 }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] don't kick ALB in the presence of pinned task
  2005-08-02  6:09 ` Nick Piggin
@ 2005-08-02  9:43   ` Ingo Molnar
  2005-08-02 10:06     ` Nick Piggin
  2005-08-02 21:12   ` Siddha, Suresh B
  1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2005-08-02  9:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, akpm, linux-kernel, steiner


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Siddha, Suresh B wrote:
> >Jack Steiner brought this issue at my OLS talk.
> >
> >Take a scenario where two tasks are pinned to two HT threads in a physical
> >package. Idle packages in the system will keep kicking migration_thread
> >on the busy package with out any success.
> >
> >We will run into similar scenarios in the presence of CMP/NUMA.
> >
> >Patch appended.
> >
> 
> Hmm, I would have hoped the new "all_pinned" logic should have handled 
> this case properly. [...]

no, active_balance is a different case, not covered by the all_pinned 
logic. This is a HT-special scenario, where busiest->nr_running == 1, 
and we have to do active load-balancing. This does not go through 
move_tasks() and does not set all_pinned. (If nr_running werent 1 we'd 
not have to kick active load-balancing.)

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] don't kick ALB in the presence of pinned task
  2005-08-02  9:43   ` Ingo Molnar
@ 2005-08-02 10:06     ` Nick Piggin
  0 siblings, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2005-08-02 10:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Siddha, Suresh B, akpm, linux-kernel, steiner

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>
>>Hmm, I would have hoped the new "all_pinned" logic should have handled 
>>this case properly. [...]
> 
> 
> no, active_balance is a different case, not covered by the all_pinned 
> logic. This is a HT-special scenario, where busiest->nr_running == 1, 
> and we have to do active load-balancing. This does not go through 
> move_tasks() and does not set all_pinned. (If nr_running werent 1 we'd 
> not have to kick active load-balancing.)
> 

Yeah I see. It looks like Suresh's patch should do a reasonable
job at doing "all pinned backoff" too, using the existing logic.
So I agree - great catch.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] don't kick ALB in the presence of pinned task
  2005-08-02  6:09 ` Nick Piggin
  2005-08-02  9:43   ` Ingo Molnar
@ 2005-08-02 21:12   ` Siddha, Suresh B
  1 sibling, 0 replies; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-02 21:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, mingo, akpm, linux-kernel, steiner

On Tue, Aug 02, 2005 at 04:09:17PM +1000, Nick Piggin wrote:
> I have a patch here which I still need to do more testing with,
> which might help performance on HT systems.
> 
> I found that idle siblings could cause SMP and NUMA balancing to
> be too aggressive in some cases.
> -- 
> If an idle sibling of an HT queue encounters a busy sibling, then
> make higher level load balancing of the non-idle variety.

Makes sense and patch looks good. Please push this minor comment fix along
with your patch. Thanks.

--- linux-2.6.13-rc4/kernel/sched.c~	2005-08-02 13:36:34.804764128 -0700
+++ linux-2.6.13-rc4/kernel/sched.c	2005-08-02 13:38:00.689707632 -0700
@@ -2316,7 +2316,9 @@
 
 		if (j - sd->last_balance >= interval) {
 			if (load_balance(this_cpu, this_rq, sd, idle)) {
-				/* We've pulled tasks over so no longer idle */
+				/* We've pulled tasks over so no longer idle
+				 * or one of our SMT sibling is not idle.
+				 */
 				idle = NOT_IDLE;
 			}
 			sd->last_balance += interval;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Patch] don't kick ALB in the presence of pinned task
  2005-08-02  0:42 [Patch] don't kick ALB in the presence of pinned task Siddha, Suresh B
  2005-08-02  6:09 ` Nick Piggin
@ 2005-08-02  9:27 ` Ingo Molnar
  2005-08-09 23:08   ` allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task) Siddha, Suresh B
  1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2005-08-02  9:27 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, akpm, linux-kernel, steiner

* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> Jack Steiner brought this issue at my OLS talk.
> 
> Take a scenario where two tasks are pinned to two HT threads in a physical
> package. Idle packages in the system will keep kicking migration_thread
> on the busy package with out any success.
> 
> We will run into similar scenarios in the presence of CMP/NUMA.
> 
> Patch appended.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

nice catch!

fine for -mm, but i dont think we need this fix in 2.6.13, as the effect 
of the bug is an extra context-switch per 'CPU goes idle' event, in this 
very specific (and arguably broken) task binding scenario. In a 
worst-case scenario a CPU going idle can 'spam' that other CPU with 
migration requests, but it still seems like a pretty artificial workload 
scenario where the system has significant idle time left. I have tested 
your fix on a HT box and it solves the problem.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-02  9:27 ` Ingo Molnar
@ 2005-08-09 23:08   ` Siddha, Suresh B
  2005-08-10  0:27     ` Nick Piggin
  2005-08-10  7:16     ` Ingo Molnar
  0 siblings, 2 replies; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-09 23:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Siddha, Suresh B, nickpiggin, akpm, linux-kernel, steiner, dvhltc,
	mbligh

On Tue, Aug 02, 2005 at 11:27:17AM +0200, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:
> 
> > Jack Steiner brought this issue at my OLS talk.
> > 
> > Take a scenario where two tasks are pinned to two HT threads in a physical
> > package. Idle packages in the system will keep kicking migration_thread
> > on the busy package with out any success.
> > 
> > We will run into similar scenarios in the presence of CMP/NUMA.
> > 
> > Patch appended.
> > 
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> nice catch!
> 
> fine for -mm, but i dont think we need this fix in 2.6.13, as the effect 
> of the bug is an extra context-switch per 'CPU goes idle' event, in this 
> very specific (and arguably broken) task binding scenario.

No. This is not a broken scenario. Its possible in NUMA case aswell.

For example, lets take two nodes each having two physical packages. And
assume that there are two tasks and both of them are on (may or may n't be
pinned) two packages in node-0

Todays load balance will detect that there is an imbalance between the
two nodes and will try to distribute the load between the nodes.

In general, we should allow the load of a group to grow upto its cpu_power
and stop preventing these costly movements.

Appended patch will fix this. I have done limited testing of this patch.
Guys with big NUMA boxes, please give this patch a try. 

--

When the system is lightly loaded, don't bother about the average load.
In this case, allow the load of a sched group to grow upto its cpu_power.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

--- linux-2.6.13-rc5/kernel/sched.c~	2005-08-09 13:30:19.067072328 -0700
+++ linux-2.6.13-rc5/kernel/sched.c	2005-08-09 14:39:08.363323880 -0700
@@ -1932,9 +1932,23 @@
 		group = group->next;
 	} while (group != sd->groups);
 
-	if (!busiest || this_load >= max_load)
+	if (!busiest || this_load >= max_load || max_load <= SCHED_LOAD_SCALE)
 		goto out_balanced;
 
+	/* When the system is lightly loaded, don't bother about
+	 * the average load. Just make sure all the sched groups
+	 * are with in their capacities (i.e., load <= group's cpu_power)
+	 */
+	if (total_load <= total_pwr) {
+		if (this_load >= SCHED_LOAD_SCALE)
+			goto out_balanced;
+
+		*imbalance = min((max_load - SCHED_LOAD_SCALE) * busiest->cpu_power,
+				 (SCHED_LOAD_SCALE - this_load) * this->cpu_power) / SCHED_LOAD_SCALE;
+
+		goto fix_imbalance;
+	}
+
 	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
 
 	if (this_load >= avg_load ||
@@ -1957,6 +1971,7 @@
 				(avg_load - this_load) * this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
+fix_imbalance:
 	if (*imbalance < SCHED_LOAD_SCALE) {
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-09 23:08   ` allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task) Siddha, Suresh B
@ 2005-08-10  0:27     ` Nick Piggin
  2005-08-10  2:03       ` Siddha, Suresh B
  2005-08-10  7:16     ` Ingo Molnar
  1 sibling, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-10  0:27 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, akpm, linux-kernel, steiner, dvhltc, mbligh

Siddha, Suresh B wrote:

>For example, lets take two nodes each having two physical packages. And
>assume that there are two tasks and both of them are on (may or may n't be
>pinned) two packages in node-0
>
>Todays load balance will detect that there is an imbalance between the
>two nodes and will try to distribute the load between the nodes.
>
>In general, we should allow the load of a group to grow upto its cpu_power
>and stop preventing these costly movements.
>
>Appended patch will fix this. I have done limited testing of this patch.
>Guys with big NUMA boxes, please give this patch a try. 
>
>--
>
>When the system is lightly loaded, don't bother about the average load.
>In this case, allow the load of a sched group to grow upto its cpu_power.
>
>

Yeah this makes sense. Thanks.

I think we'll only need your first line change to fix this, though.

Your second change will break situations where a single group is very
loaded, but it is in a domain with lots of cpu_power
(total_load <= total_power).

Nick


Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-10  0:27     ` Nick Piggin
@ 2005-08-10  2:03       ` Siddha, Suresh B
  2005-08-11  3:09         ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-10  2:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ingo Molnar, akpm, linux-kernel, steiner,
	dvhltc, mbligh

On Wed, Aug 10, 2005 at 10:27:44AM +1000, Nick Piggin wrote:
> Yeah this makes sense. Thanks.
> 
> I think we'll only need your first line change to fix this, though.
> 
> Your second change will break situations where a single group is very
> loaded, but it is in a domain with lots of cpu_power
> (total_load <= total_power).

In that case, we will move the excess load from that group to some
other group which is below its capacity. Instead of bringing everyone
to avg load, we make sure that everyone is at or below its cpu_power.
This will minimize the movements between the nodes.

For example, Let us assume sched groups node-0, node-1 each has 
4*SCHED_LOAD_SCALE as its cpu_power.

And with 6 tasks on node-0 and 0 on node-1, current load balance 
will move 3 tasks from node-0 to 1. But with my patch, it will move only 
2 tasks to node-1. Is this what you are referring to as breakage?

Even with just the first line change, we will still allow going into
a state of 4 on node-0 and 2 on node-1.

With the second hunk of the patch we are minimizing the movement between nodes
and at the same time making sure everyone is below its cpu_power, when
the system is lightly loaded.

If the group's resources are very critical, groups cpu_power should
represent that criticality.

thanks,
suresh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-10  2:03       ` Siddha, Suresh B
@ 2005-08-11  3:09         ` Nick Piggin
  2005-08-11 18:14           ` Siddha, Suresh B
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11  3:09 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Andrew Morton, lkml, steiner, dvhltc, mbligh

[-- Attachment #1: Type: text/plain, Size: 1569 bytes --]

On Tue, 2005-08-09 at 19:03 -0700, Siddha, Suresh B wrote:
> On Wed, Aug 10, 2005 at 10:27:44AM +1000, Nick Piggin wrote:
> > Yeah this makes sense. Thanks.
> > 
> > I think we'll only need your first line change to fix this, though.
> > 
> > Your second change will break situations where a single group is very
> > loaded, but it is in a domain with lots of cpu_power
> > (total_load <= total_power).
> 
> In that case, we will move the excess load from that group to some
> other group which is below its capacity. Instead of bringing everyone
> to avg load, we make sure that everyone is at or below its cpu_power.
> This will minimize the movements between the nodes.
> 
> For example, Let us assume sched groups node-0, node-1 each has 
> 4*SCHED_LOAD_SCALE as its cpu_power.
> 
> And with 6 tasks on node-0 and 0 on node-1, current load balance 
> will move 3 tasks from node-0 to 1. But with my patch, it will move only 
> 2 tasks to node-1. Is this what you are referring to as breakage?
> 

No, I had thought it was possible to get into a situation where
one queue could be very loaded but not have anyone to pull from
it if total_load <= total_pwr.

I see that shouldn't happen though.

I have a variation on the 2nd part of your patch which I think
I would prefer. IMO it kind of generalises the current imbalance
calculation to handle this case rather than introducing a new
special case.

Untested as yet, but I'll queue it to send to Andrew after it
gets some testing - unless you have any objections that is.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.



[-- Attachment #2: sched-fix-fbg.patch --]
[-- Type: text/x-patch, Size: 1578 bytes --]

Don't pull tasks from a group if that would cause the
group's total load to drop below its total cpu_power
(ie. cause the group to start going idle).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-08-11 12:10:10.199651212 +1000
+++ linux-2.6/kernel/sched.c	2005-08-11 12:53:15.361971195 +1000
@@ -1886,6 +1886,7 @@
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
+	unsigned long max_pull;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -1932,7 +1933,7 @@
 		group = group->next;
 	} while (group != sd->groups);
 
-	if (!busiest || this_load >= max_load)
+	if (!busiest || this_load >= max_load || max_load <= SCHED_LOAD_SCALE)
 		goto out_balanced;
 
 	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
@@ -1952,8 +1953,12 @@
 	 * by pulling tasks to us.  Be careful of negative numbers as they'll
 	 * appear as very large values with unsigned longs.
 	 */
+
+	/* Don't want to pull so many tasks that a group would go idle */
+	max_pull = min(max_load - avg_load, max_load - SCHED_LOAD_SCALE);
+	
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min((max_load - avg_load) * busiest->cpu_power,
+	*imbalance = min(max_pull * busiest->cpu_power,
 				(avg_load - this_load) * this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-11  3:09         ` Nick Piggin
@ 2005-08-11 18:14           ` Siddha, Suresh B
  2005-08-11 23:49             ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-11 18:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, lkml, steiner,
	dvhltc, mbligh

On Thu, Aug 11, 2005 at 01:09:10PM +1000, Nick Piggin wrote:
> I have a variation on the 2nd part of your patch which I think
> I would prefer. IMO it kind of generalises the current imbalance
> calculation to handle this case rather than introducing a new
> special case.

There is a difference between our changes. 

When the system is lightly loaded, my patch minimizes the number of 
groups picking up that load. This will help in power savings for 
example in the context of CMP. There are more changes required
(user or kernel) for complete power savings, but this is a direction 
towards that.

How about this patch?
--

Don't pull tasks from a group if that would cause the
group's total load to drop below its total cpu_power
(ie. cause the group to start going idle).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>

--- linux-2.6.13-rc5/kernel/sched.c~	2005-08-09 13:30:19.067072328 -0700
+++ linux-2.6.13-rc5/kernel/sched.c	2005-08-11 09:29:55.937128384 -0700
@@ -1886,6 +1886,7 @@
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
+	unsigned long excess_load, max_pull;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -1932,7 +1933,7 @@
 		group = group->next;
 	} while (group != sd->groups);
 
-	if (!busiest || this_load >= max_load)
+	if (!busiest || this_load >= max_load || max_load <= SCHED_LOAD_SCALE)
 		goto out_balanced;
 
 	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
@@ -1952,9 +1953,19 @@
 	 * by pulling tasks to us.  Be careful of negative numbers as they'll
 	 * appear as very large values with unsigned longs.
 	 */
+
+	/* Don't want to pull so many tasks that a group would go idle */
+	excess_load = min(max_load - avg_load, max_load - SCHED_LOAD_SCALE);
+
+	if (this_load < SCHED_LOAD_SCALE)
+		/* pull as many tasks so that this group is fully utilized */
+		max_pull = max(avg_load - this_load, SCHED_LOAD_SCALE - this_load);
+	else 
+		max_pull = avg_load - this_load;
+	
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min((max_load - avg_load) * busiest->cpu_power,
-				(avg_load - this_load) * this->cpu_power)
+	*imbalance = min(excess_load * busiest->cpu_power,
+				max_pull * this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
 	if (*imbalance < SCHED_LOAD_SCALE) {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-11 18:14           ` Siddha, Suresh B
@ 2005-08-11 23:49             ` Nick Piggin
  2005-08-12  0:39               ` Siddha, Suresh B
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-11 23:49 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Andrew Morton, lkml, steiner, dvhltc, mbligh

Siddha, Suresh B wrote:
> On Thu, Aug 11, 2005 at 01:09:10PM +1000, Nick Piggin wrote:
> 
>>I have a variation on the 2nd part of your patch which I think
>>I would prefer. IMO it kind of generalises the current imbalance
>>calculation to handle this case rather than introducing a new
>>special case.
> 
> 
> There is a difference between our changes. 
> 
> When the system is lightly loaded, my patch minimizes the number of 
> groups picking up that load. This will help in power savings for 
> example in the context of CMP. There are more changes required
> (user or kernel) for complete power savings, but this is a direction 
> towards that.
> 
> How about this patch?

Well, it is a departure from our current idea of balancing.
I would prefer to use my patch initially to fix the _bug_
you found, then we can think about changing policy for
power savings.

Main things I'm worried about:

Idle time regressions that pop up any time we put
restrictions on balancing.

This can tend to unbalance memory controllers (eg. on POWER5,
CMP Opteron) which can be a performance problem on those
systems.

Lastly, complexity in the calculation.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-11 23:49             ` Nick Piggin
@ 2005-08-12  0:39               ` Siddha, Suresh B
  2005-08-12  1:24                 ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-12  0:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, lkml, steiner,
	dvhltc, mbligh

On Fri, Aug 12, 2005 at 09:49:36AM +1000, Nick Piggin wrote:
> Well, it is a departure from our current idea of balancing.

That idea is already changing from the first line of the patch. 
And the change is "allowing the load to grow upto the sched 
group's cpu_power"

> I would prefer to use my patch initially to fix the _bug_
> you found, then we can think about changing policy for
> power savings.

Second part of the patch is just a logical extension of the
first one.

> 
> Main things I'm worried about:
> 
> Idle time regressions that pop up any time we put
> restrictions on balancing.

We are talking about lightly loaded case anyway. We are not changing
the behavior of a heavily loaded system.

> 
> This can tend to unbalance memory controllers (eg. on POWER5,
> CMP Opteron) which can be a performance problem on those
> systems.

We will do that already with the first line in the patch. 

If we want to distribute uniformly among the memory controllers, 
cpu_power of the group should reflect it (shared resources bottle neck).

> Lastly, complexity in the calculation.

My first patch is a simple straight fwd one.

thanks,
suresh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-12  0:39               ` Siddha, Suresh B
@ 2005-08-12  1:24                 ` Nick Piggin
  2005-08-12  1:44                   ` Siddha, Suresh B
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2005-08-12  1:24 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Andrew Morton, lkml, steiner, dvhltc, mbligh

Siddha, Suresh B wrote:

>On Fri, Aug 12, 2005 at 09:49:36AM +1000, Nick Piggin wrote:
>
>>Well, it is a departure from our current idea of balancing.
>>
>
>That idea is already changing from the first line of the patch. 
>And the change is "allowing the load to grow upto the sched 
>group's cpu_power"
>
>

But this is the intended behaviour of the scheduler. IMO
it was a bug in the implementation.

>>I would prefer to use my patch initially to fix the _bug_
>>you found, then we can think about changing policy for
>>power savings.
>>
>
>Second part of the patch is just a logical extension of the
>first one.
>
>
>>Main things I'm worried about:
>>
>>Idle time regressions that pop up any time we put
>>restrictions on balancing.
>>
>
>We are talking about lightly loaded case anyway. We are not changing
>the behavior of a heavily loaded system.
>
>

But if the system is going idle, it _looks_ like it is lightly
loaded to the CPU scheduler. However, it is imperitive that
any spare tasks get moved to idle CPUs in order to keep things
like IO going.

It seems to be a problem with database workloads.

>>This can tend to unbalance memory controllers (eg. on POWER5,
>>CMP Opteron) which can be a performance problem on those
>>systems.
>>
>
>We will do that already with the first line in the patch. 
>
>

But that's because the tasks are already running on that CPU, and
affinity is more important in that case (especially for NUMA systems).

>If we want to distribute uniformly among the memory controllers, 
>cpu_power of the group should reflect it (shared resources bottle neck).
>
>

I don't mean that we want to completely distribute load evenly over
memory controllers, but I think it is better not to introduce this kind
of bias to the scheduler. At least not without some justification.

>>Lastly, complexity in the calculation.
>>
>
>My first patch is a simple straight fwd one.
>
>

It is simpler than the second, but it introduces (yet another) special
case in find_busiest_group.

I'm not saying that I would reject any patch that did this or changed
behaviour in the way that you would propose, however I would like
to merge the version I sent as a bug fix first.

Thanks,
Nick

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-12  1:24                 ` Nick Piggin
@ 2005-08-12  1:44                   ` Siddha, Suresh B
  0 siblings, 0 replies; 16+ messages in thread
From: Siddha, Suresh B @ 2005-08-12  1:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, lkml, steiner,
	dvhltc, mbligh

On Fri, Aug 12, 2005 at 11:24:14AM +1000, Nick Piggin wrote:
> I'm not saying that I would reject any patch that did this or changed
> behaviour in the way that you would propose, however I would like
> to merge the version I sent as a bug fix first.

Please go ahead. Depending on the need, we will revisit this when we add
CMP enhancements.

thanks,
suresh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task)
  2005-08-09 23:08   ` allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task) Siddha, Suresh B
  2005-08-10  0:27     ` Nick Piggin
@ 2005-08-10  7:16     ` Ingo Molnar
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2005-08-10  7:16 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, akpm, linux-kernel, steiner, dvhltc, mbligh


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> On Tue, Aug 02, 2005 at 11:27:17AM +0200, Ingo Molnar wrote:
> > 
> > * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:
> > 
> > > Jack Steiner brought this issue at my OLS talk.
> > > 
> > > Take a scenario where two tasks are pinned to two HT threads in a physical
> > > package. Idle packages in the system will keep kicking migration_thread
> > > on the busy package with out any success.
> > > 
> > > We will run into similar scenarios in the presence of CMP/NUMA.
> > > 
> > > Patch appended.
> > > 
> > > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > 
> > nice catch!
> > 
> > fine for -mm, but i dont think we need this fix in 2.6.13, as the effect 
> > of the bug is an extra context-switch per 'CPU goes idle' event, in this 
> > very specific (and arguably broken) task binding scenario.
> 
> No. This is not a broken scenario. Its possible in NUMA case aswell.
> 
> For example, lets take two nodes each having two physical packages. 
> And assume that there are two tasks and both of them are on (may or 
> may n't be pinned) two packages in node-0
> 
> Todays load balance will detect that there is an imbalance between the 
> two nodes and will try to distribute the load between the nodes.
> 
> In general, we should allow the load of a group to grow upto its 
> cpu_power and stop preventing these costly movements.
> 
> Appended patch will fix this. I have done limited testing of this 
> patch. Guys with big NUMA boxes, please give this patch a try.

makes sense in general - we should not try to move things around when we 
are under-utilized. (In theory there could be heavily fluctuating 
workloads which do not produce an above 100% average utilization, and 
which could benefit from a perfectly even distribution of tasks - but i 
dont think the load-balancer should care, as load-balancing is mostly a 
"slow" mechanism.)

Again, 2.6.14 stuff.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2005-08-12  1:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-02  0:42 [Patch] don't kick ALB in the presence of pinned task Siddha, Suresh B
2005-08-02  6:09 ` Nick Piggin
2005-08-02  9:43   ` Ingo Molnar
2005-08-02 10:06     ` Nick Piggin
2005-08-02 21:12   ` Siddha, Suresh B
2005-08-02  9:27 ` Ingo Molnar
2005-08-09 23:08   ` allow the load to grow upto its cpu_power (was Re: [Patch] don't kick ALB in the presence of pinned task) Siddha, Suresh B
2005-08-10  0:27     ` Nick Piggin
2005-08-10  2:03       ` Siddha, Suresh B
2005-08-11  3:09         ` Nick Piggin
2005-08-11 18:14           ` Siddha, Suresh B
2005-08-11 23:49             ` Nick Piggin
2005-08-12  0:39               ` Siddha, Suresh B
2005-08-12  1:24                 ` Nick Piggin
2005-08-12  1:44                   ` Siddha, Suresh B
2005-08-10  7:16     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox