[PATCH] Fix longstanding load balancing bug in the scheduler.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Fix longstanding load balancing bug in the scheduler.
@ 2006-09-06 23:38 Christoph Lameter
  2006-09-07 10:58 ` Nick Piggin
  2006-09-08 17:35 ` Siddha, Suresh B
  0 siblings, 2 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-09-06 23:38 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, npiggin

The scheduler will stop load balancing if the most busy processor
contains processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu. If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e. If one has processor 0 busy running four tasks pinned via
taskset and there are none on processor 1. If one then starts 
two processes on processor 2 then the scheduler will not move one of
the two processes away from processor 2.

This patch fixes that issue by forcing the scheduler to come out of
its corner and retrying the load balancing by considering other
processors for load balancing. Instead of sulking the scheduler will 
simply shun the run queue with the pinned unmovable threads.

This patch was originally developed by John Hawkes and discussed
at http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material, simplified it and gone back to 
equipping struct rq with the cpu the queue is associated with since this 
makes the patch much easier and it is likely that others in the future 
will have the same difficulty of figuring out which processor owns which 
runqueue.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc5-mm1/kernel/sched.c
===================================================================
--- linux-2.6.18-rc5-mm1.orig/kernel/sched.c	2006-09-06 16:13:40.000000000 -0700
+++ linux-2.6.18-rc5-mm1/kernel/sched.c	2006-09-06 16:13:41.000000000 -0700
@@ -239,6 +239,7 @@
 	/* For active balancing */
 	int active_balance;
 	int push_cpu;
+	int cpu;		/* cpu of this runqueue */
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
@@ -268,6 +269,15 @@
 
 static DEFINE_PER_CPU(struct rq, runqueues);
 
+int cpu_of(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+	return rq->cpu;
+#else
+	return 0;
+#endif
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.
@@ -2206,7 +2216,8 @@
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum idle_type idle, int *sd_idle)
+		   unsigned long *imbalance, enum idle_type idle, int *sd_idle,
+		   cpumask_t *cpus)
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
@@ -2243,7 +2254,12 @@
 		sum_weighted_load = sum_nr_running = avg_load = 0;
 
 		for_each_cpu_mask(i, group->cpumask) {
-			struct rq *rq = cpu_rq(i);
+			struct rq *rq;
+
+			if (!cpu_isset(i, *cpus))
+				continue;
+
+			rq = cpu_rq(i);
 
 			if (*sd_idle && !idle_cpu(i))
 				*sd_idle = 0;
@@ -2461,13 +2477,17 @@
  */
 static struct rq *
 find_busiest_queue(struct sched_group *group, enum idle_type idle,
-		   unsigned long imbalance)
+		   unsigned long imbalance, cpumask_t *cpus)
 {
 	struct rq *busiest = NULL, *rq;
 	unsigned long max_load = 0;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
+
+		if (!cpu_isset(i, *cpus))
+			continue;
+
 		rq = cpu_rq(i);
 
 		if (rq->nr_running == 1 && rq->raw_weighted_load > imbalance)
@@ -2506,6 +2526,7 @@
 	struct sched_group *group;
 	unsigned long imbalance;
 	struct rq *busiest;
+	cpumask_t cpus = cpu_online_map;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2519,13 +2540,15 @@
 
 	schedstat_inc(sd, lb_cnt[idle]);
 
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
+							&cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle, imbalance);
+	busiest = find_busiest_queue(group, idle, imbalance, &cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2550,8 +2573,12 @@
 		double_rq_unlock(this_rq, busiest);
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
-		if (unlikely(all_pinned))
+		if (unlikely(all_pinned)) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
 			goto out_balanced;
+		}
 	}
 
 	if (!nr_moved) {
@@ -2640,6 +2667,7 @@
 	unsigned long imbalance;
 	int nr_moved = 0;
 	int sd_idle = 0;
+	cpumask_t cpus = cpu_online_map;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2652,13 +2680,16 @@
 		sd_idle = 1;
 
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
-	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE,
+				&sd_idle, &cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance,
+				&cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2676,6 +2707,12 @@
 					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, NEWLY_IDLE, NULL);
 		spin_unlock(&busiest->lock);
+
+		if (!nr_moved) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
+		}
 	}
 
 	if (!nr_moved) {
@@ -6878,6 +6915,7 @@
 			rq->cpu_load[j] = 0;
 		rq->active_balance = 0;
 		rq->push_cpu = 0;
+		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-06 23:38 [PATCH] Fix longstanding load balancing bug in the scheduler Christoph Lameter
@ 2006-09-07 10:58 ` Nick Piggin
  2006-09-07 15:29   ` Christoph Lameter
  2006-09-07 17:24   ` Christoph Lameter
  2006-09-08 17:35 ` Siddha, Suresh B
  1 sibling, 2 replies; 19+ messages in thread
From: Nick Piggin @ 2006-09-07 10:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel

On Wed, Sep 06, 2006 at 04:38:33PM -0700, Christoph Lameter wrote:
> The scheduler will stop load balancing if the most busy processor
> contains processes pinned via processor affinity.
> 
> The scheduler currently only does one search for busiest cpu. If it cannot
> pull any tasks away from the busiest cpu because they were pinned then the
> scheduler goes into a corner and sulks leaving the idle processors idle.
> 
> F.e. If one has processor 0 busy running four tasks pinned via
> taskset and there are none on processor 1. If one then starts 
> two processes on processor 2 then the scheduler will not move one of
> the two processes away from processor 2.
> 
> This patch fixes that issue by forcing the scheduler to come out of
> its corner and retrying the load balancing by considering other
> processors for load balancing. Instead of sulking the scheduler will 
> simply shun the run queue with the pinned unmovable threads.
> 
> This patch was originally developed by John Hawkes and discussed
> at http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.
> 
> I have removed extraneous material, simplified it and gone back to 
> equipping struct rq with the cpu the queue is associated with since this 
> makes the patch much easier and it is likely that others in the future 
> will have the same difficulty of figuring out which processor owns which 
> runqueue.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>

So what I worry about with this approach is that it can really blow
out the latency of a balancing operation. Say you have N-1 CPUs with
lots of stuff locked on their runqueues.

The solution I envisage is to do a "rotor" approach. For example
the last attempted CPU could be stored in the starving CPU's sd...
and it will subsequently try another one.

I've been hot and cold on such an implementation for a while: on one
hand it is a real problem we have; OTOH I was hoping that the domain
balancing might be better generalised. But I increasingly don't
think we should let perfect stand in the way of good... ;)

Would you be interested in testing a patch?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 10:58 ` Nick Piggin
@ 2006-09-07 15:29   ` Christoph Lameter
  2006-09-07 17:24   ` Christoph Lameter
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-09-07 15:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel

On Thu, 7 Sep 2006, Nick Piggin wrote:

> So what I worry about with this approach is that it can really blow
> out the latency of a balancing operation. Say you have N-1 CPUs with
> lots of stuff locked on their runqueues.
> 
> The solution I envisage is to do a "rotor" approach. For example
> the last attempted CPU could be stored in the starving CPU's sd...
> and it will subsequently try another one.
> 
> I've been hot and cold on such an implementation for a while: on one
> hand it is a real problem we have; OTOH I was hoping that the domain
> balancing might be better generalised. But I increasingly don't
> think we should let perfect stand in the way of good... ;)
> 
> Would you be interested in testing a patch?

Sure but I think we should move fast on this one. This has now been known 
for around a year or so.
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 10:58 ` Nick Piggin
  2006-09-07 15:29   ` Christoph Lameter
@ 2006-09-07 17:24   ` Christoph Lameter
  2006-09-07 21:47     ` Nick Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-07 17:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel

Hmmm... Some more comments

On Thu, 7 Sep 2006, Nick Piggin wrote:

> So what I worry about with this approach is that it can really blow
> out the latency of a balancing operation. Say you have N-1 CPUs with
> lots of stuff locked on their runqueues.

Right but that situation is rare and the performance is certainly
better if the unpinned processes are not all running on the same 
processor.

> The solution I envisage is to do a "rotor" approach. For example
> the last attempted CPU could be stored in the starving CPU's sd...
> and it will subsequently try another one.

That wont work since the notion of "pinned" is relative to a cpu.
A process may be pinned to a group of processors. It will only appear to 
be pinned for cpus outside that set of processors!

What good does storing a processor number do if the processes can change 
dynamically and so may the pinning of processors. We are dealing with
a set of processes. Each of those may be pinned to a set of processors.

> I've been hot and cold on such an implementation for a while: on one
> hand it is a real problem we have; OTOH I was hoping that the domain
> balancing might be better generalised. But I increasingly don't
> think we should let perfect stand in the way of good... ;)

I think we should fix the problem ASAP. Lets do this fix and then we can 
think about other solutions. You already had lots of time to think about
the rotor.

This looks to me as a design flaw that would require either a major rework 
of the scheduler or (my favorite) a delegation of difficult (and 
computational complex and expensive) scheduler decisions to user space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 17:24   ` Christoph Lameter
@ 2006-09-07 21:47     ` Nick Piggin
  2006-09-07 22:20       ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-09-07 21:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel

On Thu, Sep 07, 2006 at 10:24:26AM -0700, Christoph Lameter wrote:
> Hmmm... Some more comments
> 
> On Thu, 7 Sep 2006, Nick Piggin wrote:
> 
> > So what I worry about with this approach is that it can really blow
> > out the latency of a balancing operation. Say you have N-1 CPUs with
> > lots of stuff locked on their runqueues.
> 
> Right but that situation is rare and the performance is certainly
> better if the unpinned processes are not all running on the same 
> processor.

But the situation is rare full stop otherwise we'd have had more
complaints. And we do tend to care about theoretical max latency
in many other obscure paths than the scheduler.

How about if you have N/2 CPUs with lots of stuff on runqueues?
Then the other N/2 will each be scanning N/2+1 runqueues... that's
a lot of bus traffic going on which you probably don't want.

> 
> > The solution I envisage is to do a "rotor" approach. For example
> > the last attempted CPU could be stored in the starving CPU's sd...
> > and it will subsequently try another one.
> 
> That wont work since the notion of "pinned" is relative to a cpu.
> A process may be pinned to a group of processors. It will only appear to 
> be pinned for cpus outside that set of processors!

A sched-domain is per-CPU as well. Why doesn't it work?

> What good does storing a processor number do if the processes can change 
> dynamically and so may the pinning of processors. We are dealing with
> a set of processes. Each of those may be pinned to a set of processors.

Yes, it isn't going to be perfect, but it would at least work (which
it doesn't now) without introducing that latency.

> 
> > I've been hot and cold on such an implementation for a while: on one
> > hand it is a real problem we have; OTOH I was hoping that the domain
> > balancing might be better generalised. But I increasingly don't
> > think we should let perfect stand in the way of good... ;)
> 
> I think we should fix the problem ASAP. Lets do this fix and then we can 
> think about other solutions. You already had lots of time to think about
> the rotor.

The problem has been known about for many years. It doesn't remain
unfixed because we didn't know about this brute force patch ;)
Given that Ingo also maintains the -RT patchset, it might face a
rocky road... but if he is OK with it then OK by me for now.

> This looks to me as a design flaw that would require either a major rework 
> of the scheduler or (my favorite) a delegation of difficult (and 
> computational complex and expensive) scheduler decisions to user space.

I'd be really happy to move this to userspace :)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 21:47     ` Nick Piggin
@ 2006-09-07 22:20       ` Christoph Lameter
  2006-09-07 22:32         ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-07 22:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel

On Thu, 7 Sep 2006, Nick Piggin wrote:

> > Right but that situation is rare and the performance is certainly
> > better if the unpinned processes are not all running on the same 
> > processor.
> 
> But the situation is rare full stop otherwise we'd have had more
> complaints. And we do tend to care about theoretical max latency
> in many other obscure paths than the scheduler.

The situation is rare if you just have a couple of cpus and it also does 
not occur if you do not pin processors.

> How about if you have N/2 CPUs with lots of stuff on runqueues?
> Then the other N/2 will each be scanning N/2+1 runqueues... that's
> a lot of bus traffic going on which you probably don't want.

Then the load will likely be sitting on a runqueue and run 
much slower since it has to share cpus although many cpus are available 
to take new jobs. The scanning is very fast and it is certainly better 
than continuing to overload a single processor.

> > That wont work since the notion of "pinned" is relative to a cpu.
> > A process may be pinned to a group of processors. It will only appear to 
> > be pinned for cpus outside that set of processors!
> 
> A sched-domain is per-CPU as well. Why doesn't it work?

Lets say you store the latest set of pinned cpus encountered (I am not 
sure what cpu number would accomplish). The next time you have to 
reschedule the situation may be completely different and you would have
to revalidate the per cpu pinned cpu set. 

> > What good does storing a processor number do if the processes can change 
> > dynamically and so may the pinning of processors. We are dealing with
> > a set of processes. Each of those may be pinned to a set of processors.
> 
> Yes, it isn't going to be perfect, but it would at least work (which
> it doesn't now) without introducing that latency.

Could you tell us how this could work?

> > This looks to me as a design flaw that would require either a major rework 
> > of the scheduler or (my favorite) a delegation of difficult (and 
> > computational complex and expensive) scheduler decisions to user space.
> 
> I'd be really happy to move this to userspace :)

Can we merge this patch and then I will try to move this as fast as 
possible to userspace?

Ok then we need to do the following to address the current issue and to
open up the possibilities for future enhancements:

1. Add a new field to /proc/<pid>/cpu that can be written to and that
   forces the process to be moved to the corresponding cpu.

2. Extend statistics in /proc/<pid>/ to allow the export of more scheduler
   information and a special flag that is set if a process cannot be 
   rescheduling. initially the user space scheduler may just poll this 
  field.

3. Optional: Some sort of notification mechanism that a user space 
   scheduler helper could subscribe to. If a pinned tasks situation
   is encountered then the notification passes the number of the
   idle cpu.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 22:20       ` Christoph Lameter
@ 2006-09-07 22:32         ` Nick Piggin
  2006-09-08  2:52           ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-09-07 22:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, Ingo Molnar, Suresh B

On Thu, Sep 07, 2006 at 03:20:32PM -0700, Christoph Lameter wrote:
> On Thu, 7 Sep 2006, Nick Piggin wrote:
> 
> > How about if you have N/2 CPUs with lots of stuff on runqueues?
> > Then the other N/2 will each be scanning N/2+1 runqueues... that's
> > a lot of bus traffic going on which you probably don't want.
> 
> Then the load will likely be sitting on a runqueue and run 
> much slower since it has to share cpus although many cpus are available 
> to take new jobs. The scanning is very fast and it is certainly better 
> than continuing to overload a single processor.

But it is N^2... I thought SGI of all would hesitate to put such an
algorithm into the scheduler ;)

> > A sched-domain is per-CPU as well. Why doesn't it work?
> 
> Lets say you store the latest set of pinned cpus encountered (I am not 
> sure what cpu number would accomplish). The next time you have to 
> reschedule the situation may be completely different and you would have
> to revalidate the per cpu pinned cpu set. 

Maybe you misunderstood, I'll explain below.

> > Yes, it isn't going to be perfect, but it would at least work (which
> > it doesn't now) without introducing that latency.
> 
> Could you tell us how this could work?

You keep track of a fallback CPU. Then, if balancing fails because all
processes are pinned, you just try pulling from the fallback CPU. If
that fails, set the fallback to the next CPU.

OTOH that still results in suboptimal scheduler, and now that I remember
your workload, you had a small number of CPUs screwing up balancing for
a big system. Hmm... so this may be not great for you.

> 
> > > This looks to me as a design flaw that would require either a major rework 
> > > of the scheduler or (my favorite) a delegation of difficult (and 
> > > computational complex and expensive) scheduler decisions to user space.
> > 
> > I'd be really happy to move this to userspace :)
> 
> Can we merge this patch and then I will try to move this as fast as 
> possible to userspace?

I'd like to get an ack from Ingo, but otherwise OK I guess.

> 
> Ok then we need to do the following to address the current issue and to
> open up the possibilities for future enhancements:
> 
> 1. Add a new field to /proc/<pid>/cpu that can be written to and that
>    forces the process to be moved to the corresponding cpu.
> 
> 2. Extend statistics in /proc/<pid>/ to allow the export of more scheduler
>    information and a special flag that is set if a process cannot be 
>    rescheduling. initially the user space scheduler may just poll this 
>   field.
> 
> 3. Optional: Some sort of notification mechanism that a user space 
>    scheduler helper could subscribe to. If a pinned tasks situation
>    is encountered then the notification passes the number of the
>    idle cpu.

Hmm, how about

1. Export SD information in /sys/
2. Export runqueue load information there too
3. One attribute in /sys/.../CPUn/ will be written to a CPU number m and
   a count t, which will try to pull t tasks from CPUm to CPUn
4. Another attribute would be the result of the last balance attempt
   (eg. 'all pinned').

?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-07 22:32         ` Nick Piggin
@ 2006-09-08  2:52           ` Christoph Lameter
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter @ 2006-09-08  2:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, linux-kernel, Ingo Molnar, Suresh B

On Fri, 8 Sep 2006, Nick Piggin wrote:

> > Then the load will likely be sitting on a runqueue and run 
> > much slower since it has to share cpus although many cpus are available 
> > to take new jobs. The scanning is very fast and it is certainly better 
> > than continuing to overload a single processor.
> 
> But it is N^2... I thought SGI of all would hesitate to put such an
> algorithm into the scheduler ;)

You definitely got that one right. But we would prefer to a working 
scheduler over a broken one.

> > Could you tell us how this could work?
> 
> You keep track of a fallback CPU. Then, if balancing fails because all
> processes are pinned, you just try pulling from the fallback CPU. If
> that fails, set the fallback to the next CPU.

Ok. As you note below this wont do too much good.

> OTOH that still results in suboptimal scheduler, and now that I remember
> your workload, you had a small number of CPUs screwing up balancing for
> a big system. Hmm... so this may be not great for you.

Right. Lets put this in as a temporary measure and then well try to get to 
a long term solution that avoids doing N^2 searches in the kernel.

> I'd like to get an ack from Ingo, but otherwise OK I guess.

Ok. Lets see what others say.

> Hmm, how about
> 
> 1. Export SD information in /sys/
> 2. Export runqueue load information there too

Ok. That would work.

> 3. One attribute in /sys/.../CPUn/ will be written to a CPU number m and
>    a count t, which will try to pull t tasks from CPUm to CPUn

Too contorted and too difficult to use. Having a cpu field in 
/proc/<pid>/cpu is self explanatory, easy to understand and allows
per process control for the user space mechanism. Could also be used
to test things or to manually intervene in the scheduler.

Transferring an abstract number of processes from one cpu to another is 
rare. One strives to have one process per cpu.

> 4. Another attribute would be the result of the last balance attempt
>    (eg. 'all pinned').

Ok.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-06 23:38 [PATCH] Fix longstanding load balancing bug in the scheduler Christoph Lameter
  2006-09-07 10:58 ` Nick Piggin
@ 2006-09-08 17:35 ` Siddha, Suresh B
  2006-09-08 18:40   ` Christoph Lameter
  1 sibling, 1 reply; 19+ messages in thread
From: Siddha, Suresh B @ 2006-09-08 17:35 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linux-kernel, npiggin, mingo

On Wed, Sep 06, 2006 at 04:38:33PM -0700, Christoph Lameter wrote:
> +int cpu_of(struct rq *rq)
> +{
> +#ifdef CONFIG_SMP
> +	return rq->cpu;
> +#else
> +	return 0;
> +#endif
> +}

Is there a reason why this is not made inline?

>  static struct sched_group *
>  find_busiest_group(struct sched_domain *sd, int this_cpu,
> -		   unsigned long *imbalance, enum idle_type idle, int
> *sd_idle)
> +		   unsigned long *imbalance, enum idle_type idle, int
> *sd_idle,
> +		   cpumask_t *cpus)
>  {
>  	struct sched_group *busiest = NULL, *this = NULL, *group =
> sd->groups;
>  	unsigned long max_load, avg_load, total_load, this_load,
> total_pwr;
> @@ -2243,7 +2254,12 @@
>  		sum_weighted_load = sum_nr_running = avg_load = 0;
>  
>  		for_each_cpu_mask(i, group->cpumask) {
> -			struct rq *rq = cpu_rq(i);
> +			struct rq *rq;
> +
> +			if (!cpu_isset(i, *cpus))
> +				continue;

In normal conditions can we make this "cpus" argument NULL and only set/use it
when we run into pinned condition? That will atleast avoid unnecessary memory
test bit operations in the normal case.

thanks,
suresh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-08 17:35 ` Siddha, Suresh B
@ 2006-09-08 18:40   ` Christoph Lameter
  2006-09-08 20:00     ` Siddha, Suresh B
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-08 18:40 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: akpm, linux-kernel, npiggin, mingo

On Fri, 8 Sep 2006, Siddha, Suresh B wrote:

> > +#endif
> > +}
> 
> Is there a reason why this is not made inline?

Just did not think of it.

> > +			struct rq *rq;
> > +
> > +			if (!cpu_isset(i, *cpus))
> > +				continue;
> 
> In normal conditions can we make this "cpus" argument NULL and only set/use it
> when we run into pinned condition? That will atleast avoid unnecessary memory
> test bit operations in the normal case.

The balancing operation is not that frequent and having to treat a special 
case in the callers would make code more complicated and likely offset the
gains in this function.

Fix up the declaration of cpu_of()

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc5-mm1/kernel/sched.c
===================================================================
--- linux-2.6.18-rc5-mm1.orig/kernel/sched.c	2006-09-08 11:38:35.852594785 -0700
+++ linux-2.6.18-rc5-mm1/kernel/sched.c	2006-09-08 11:39:29.182308471 -0700
@@ -269,7 +269,7 @@ struct rq {
 
 static DEFINE_PER_CPU(struct rq, runqueues);
 
-int cpu_of(struct rq *rq)
+static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
 	return rq->cpu;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-08 18:40   ` Christoph Lameter
@ 2006-09-08 20:00     ` Siddha, Suresh B
  2006-09-08 20:19       ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Siddha, Suresh B @ 2006-09-08 20:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Siddha, Suresh B, akpm, linux-kernel, npiggin, mingo

On Fri, Sep 08, 2006 at 11:40:51AM -0700, Christoph Lameter wrote:
> The balancing operation is not that frequent and having to treat a special 
> case in the callers would make code more complicated and likely offset the
> gains in this function.

This solution as such is not accurate and clean :) and my suggestion is
not making it any more ugly.

With increase in NR_CPUS, cost of cpumask operations will increase and 
we shouldn't penalize the other logical threads or cores sharing the caches by
bringing in unnecessary cache lines.

thanks,
suresh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-08 20:00     ` Siddha, Suresh B
@ 2006-09-08 20:19       ` Christoph Lameter
  2006-09-09  0:03         ` Siddha, Suresh B
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-08 20:19 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: akpm, linux-kernel, npiggin, mingo

On Fri, 8 Sep 2006, Siddha, Suresh B wrote:

> On Fri, Sep 08, 2006 at 11:40:51AM -0700, Christoph Lameter wrote:
> > The balancing operation is not that frequent and having to treat a special 
> > case in the callers would make code more complicated and likely offset the
> > gains in this function.
> 
> This solution as such is not accurate and clean :) and my suggestion is
> not making it any more ugly.
> 
> With increase in NR_CPUS, cost of cpumask operations will increase and 
> we shouldn't penalize the other logical threads or cores sharing the caches by
> bringing in unnecessary cache lines.

One cacheline sized 128bytes will support all 1024 cpus that IA64 allows. 
cacheline align the cpumask?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-08 20:19       ` Christoph Lameter
@ 2006-09-09  0:03         ` Siddha, Suresh B
  2006-09-09  5:25           ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Siddha, Suresh B @ 2006-09-09  0:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Siddha, Suresh B, akpm, linux-kernel, npiggin, mingo

On Fri, Sep 08, 2006 at 01:19:28PM -0700, Christoph Lameter wrote:
> On Fri, 8 Sep 2006, Siddha, Suresh B wrote:
> 
> > On Fri, Sep 08, 2006 at 11:40:51AM -0700, Christoph Lameter wrote:
> > > The balancing operation is not that frequent and having to treat a special 
> > > case in the callers would make code more complicated and likely offset the
> > > gains in this function.
> > 
> > This solution as such is not accurate and clean :) and my suggestion is
> > not making it any more ugly.
> > 
> > With increase in NR_CPUS, cost of cpumask operations will increase and 
> > we shouldn't penalize the other logical threads or cores sharing the caches by
> > bringing in unnecessary cache lines.
> 
> One cacheline sized 128bytes will support all 1024 cpus that IA64 allows. 
> cacheline align the cpumask?

one or more, it is unnecessary for the common case.

thanks,
suresh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-09  0:03         ` Siddha, Suresh B
@ 2006-09-09  5:25           ` Christoph Lameter
  2006-09-09 19:56             ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-09  5:25 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: akpm, linux-kernel, npiggin, mingo

On Fri, 8 Sep 2006, Siddha, Suresh B wrote:

> > One cacheline sized 128bytes will support all 1024 cpus that IA64 allows. 
> > cacheline align the cpumask?
> 
> one or more, it is unnecessary for the common case.

The common case is an arch with much less cpus. The maxinum on i386
f.e. is 255 meaning 8 bytes. That fits in the cacheline that is already
used for the stack frame of the calling function. 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-09  5:25           ` Christoph Lameter
@ 2006-09-09 19:56             ` Christoph Lameter
  2006-09-11  8:37               ` Nick Piggin
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-09 19:56 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: akpm, linux-kernel, npiggin, mingo

On Fri, 8 Sep 2006, Christoph Lameter wrote:

> > one or more, it is unnecessary for the common case.
> 
> The common case is an arch with much less cpus. The maxinum on i386
> f.e. is 255 meaning 8 bytes. That fits in the cacheline that is already
> used for the stack frame of the calling function. 

Ughh. Wrong. 255 cpus require 32 bytes. System rarely have that 
much. If you configure a kernel with less than 32 cpus then this will be 
one word on the stack.

Also note that the patch restricts the search to online cpus. The 
scheduler will check offline cpus without this patch. That may actually 
result in speed improvements since the cachelines from offline cpus are
no longer brought in during the search for the busiest group / cpu.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler.
  2006-09-09 19:56             ` Christoph Lameter
@ 2006-09-11  8:37               ` Nick Piggin
  2006-09-12 18:37                 ` [PATCH] Fix longstanding load balancing bug in the scheduler V2 Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Nick Piggin @ 2006-09-11  8:37 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Siddha, Suresh B, akpm, linux-kernel, mingo

On Sat, Sep 09, 2006 at 12:56:16PM -0700, Christoph Lameter wrote:
> On Fri, 8 Sep 2006, Christoph Lameter wrote:
> 
> > > one or more, it is unnecessary for the common case.
> > 
> > The common case is an arch with much less cpus. The maxinum on i386
> > f.e. is 255 meaning 8 bytes. That fits in the cacheline that is already
> > used for the stack frame of the calling function. 
> 
> Ughh. Wrong. 255 cpus require 32 bytes. System rarely have that 
> much. If you configure a kernel with less than 32 cpus then this will be 
> one word on the stack.
> 
> Also note that the patch restricts the search to online cpus. The 
> scheduler will check offline cpus without this patch. That may actually 
> result in speed improvements since the cachelines from offline cpus are
> no longer brought in during the search for the busiest group / cpu.

This should not be the case. The sched-domain structure should always
reflect the online CPUs only (see our hotplug cpu handler), and if you
find otherwise then that would be a bug.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler V2
  2006-09-11  8:37               ` Nick Piggin
@ 2006-09-12 18:37                 ` Christoph Lameter
  2006-09-15  0:26                   ` Andrew Morton
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2006-09-12 18:37 UTC (permalink / raw)
  To: akpm; +Cc: Nick Piggin, Siddha, Suresh B, linux-kernel, mingo

Fix longstanding load balancing bug in the scheduler V2.

AFAIK this is an important scheduler bug that needs to go 
into 2.6.18 and all stable release since the issue can stall the 
scheduler for good.


The scheduler will stop load balancing if the most busy processor
contains processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu. If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e. If you have processor 0 busy running four tasks pinned via
taskset, there are none on processor 1 and one just started two
processes on processor 2 then the scheduler will not move one of
the two processes away from processor 2.

This patch fixes that issue by forcing the scheduler to come out of
its corner and retrying the load balancing by considering other
processors for load balancing.

This patch was originally developed by John Hawkes and discussed
at http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping
struct rq with the cpu the queue is associated with since this
makes the patch much easier and it is likely that others in the
future will have the same difficulty of figuring out which
processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack. On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack. The maximum additional cache footprint is one cacheline.
Typically memory use will be much less than a cacheline and the additional
cpumask will be placed on the stack in a cacheline that already contains
other local variable.

V1->V2:
- Add static inline to cpu_of() (Thanks, Suresh)
- Use CPU_MASK_ALL instead of cpu_online_map since sched domain only
  include active processors (Thanks, Nick).
- Add discussion of performance impact

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.18-rc6-mm1/kernel/sched.c
===================================================================
--- linux-2.6.18-rc6-mm1.orig/kernel/sched.c	2006-09-12 13:25:17.510076005 -0500
+++ linux-2.6.18-rc6-mm1/kernel/sched.c	2006-09-12 13:26:33.391524056 -0500
@@ -239,6 +239,7 @@
 	/* For active balancing */
 	int active_balance;
 	int push_cpu;
+	int cpu;		/* cpu of this runqueue */
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
@@ -268,6 +269,15 @@
 
 static DEFINE_PER_CPU(struct rq, runqueues);
 
+static inline int cpu_of(struct rq *rq)
+{
+#ifdef CONFIG_SMP
+	return rq->cpu;
+#else
+	return 0;
+#endif
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.
@@ -2206,7 +2216,8 @@
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum idle_type idle, int *sd_idle)
+		   unsigned long *imbalance, enum idle_type idle, int *sd_idle,
+		   cpumask_t *cpus)
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
@@ -2243,7 +2254,12 @@
 		sum_weighted_load = sum_nr_running = avg_load = 0;
 
 		for_each_cpu_mask(i, group->cpumask) {
-			struct rq *rq = cpu_rq(i);
+			struct rq *rq;
+
+			if (!cpu_isset(i, *cpus))
+				continue;
+
+			rq = cpu_rq(i);
 
 			if (*sd_idle && !idle_cpu(i))
 				*sd_idle = 0;
@@ -2461,13 +2477,17 @@
  */
 static struct rq *
 find_busiest_queue(struct sched_group *group, enum idle_type idle,
-		   unsigned long imbalance)
+		   unsigned long imbalance, cpumask_t *cpus)
 {
 	struct rq *busiest = NULL, *rq;
 	unsigned long max_load = 0;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
+
+		if (!cpu_isset(i, *cpus))
+			continue;
+
 		rq = cpu_rq(i);
 
 		if (rq->nr_running == 1 && rq->raw_weighted_load > imbalance)
@@ -2506,6 +2526,7 @@
 	struct sched_group *group;
 	unsigned long imbalance;
 	struct rq *busiest;
+	cpumask_t cpus = CPU_MASK_ALL;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2519,13 +2540,15 @@
 
 	schedstat_inc(sd, lb_cnt[idle]);
 
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
+							&cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle, imbalance);
+	busiest = find_busiest_queue(group, idle, imbalance, &cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2550,8 +2573,12 @@
 		double_rq_unlock(this_rq, busiest);
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
-		if (unlikely(all_pinned))
+		if (unlikely(all_pinned)) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
 			goto out_balanced;
+		}
 	}
 
 	if (!nr_moved) {
@@ -2640,6 +2667,7 @@
 	unsigned long imbalance;
 	int nr_moved = 0;
 	int sd_idle = 0;
+	cpumask_t cpus = CPU_MASK_ALL;
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -2652,13 +2680,16 @@
 		sd_idle = 1;
 
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
-	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, &sd_idle);
+redo:
+	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE,
+				&sd_idle, &cpus);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance,
+				&cpus);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2676,6 +2707,12 @@
 					minus_1_or_zero(busiest->nr_running),
 					imbalance, sd, NEWLY_IDLE, NULL);
 		spin_unlock(&busiest->lock);
+
+		if (!nr_moved) {
+			cpu_clear(cpu_of(busiest), cpus);
+			if (!cpus_empty(cpus))
+				goto redo;
+		}
 	}
 
 	if (!nr_moved) {
@@ -6878,6 +6915,7 @@
 			rq->cpu_load[j] = 0;
 		rq->active_balance = 0;
 		rq->push_cpu = 0;
+		rq->cpu = i;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler V2
  2006-09-12 18:37                 ` [PATCH] Fix longstanding load balancing bug in the scheduler V2 Christoph Lameter
@ 2006-09-15  0:26                   ` Andrew Morton
  2006-09-15  9:44                     ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2006-09-15  0:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nick Piggin, Siddha, Suresh B, linux-kernel, mingo

On Tue, 12 Sep 2006 11:37:55 -0700 (PDT)
Christoph Lameter <christoph@sgi.com> wrote:

> Fix longstanding load balancing bug in the scheduler V2.
> 
> AFAIK this is an important scheduler bug that needs to go 
> into 2.6.18 and all stable release since the issue can stall the 
> scheduler for good.

The timing is of course problematic.  One approach could be to merge it
into 2.6.19-early, backport into 2.6.18.x after a few weeks.  I don't know
if that's a lot better, really - it's unlikely that anyone will be running
serious performance testing against 2.6.19-rc1 or -rc2.

I'm struggling to understand how serious this really is - if the bug is
"longstanding" then very few machines must be encountering it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Fix longstanding load balancing bug in the scheduler V2
  2006-09-15  0:26                   ` Andrew Morton
@ 2006-09-15  9:44                     ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2006-09-15  9:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Nick Piggin, Siddha, Suresh B, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> On Tue, 12 Sep 2006 11:37:55 -0700 (PDT)
> Christoph Lameter <christoph@sgi.com> wrote:
> 
> > Fix longstanding load balancing bug in the scheduler V2.
> > 
> > AFAIK this is an important scheduler bug that needs to go 
> > into 2.6.18 and all stable release since the issue can stall the 
> > scheduler for good.
> 
> The timing is of course problematic.  One approach could be to merge 
> it into 2.6.19-early, backport into 2.6.18.x after a few weeks.  I 
> don't know if that's a lot better, really - it's unlikely that anyone 
> will be running serious performance testing against 2.6.19-rc1 or 
> -rc2.

with that release approach it's:

Acked-by: Ingo Molnar <mingo@elte.hu>

> I'm struggling to understand how serious this really is - if the bug 
> is "longstanding" then very few machines must be encountering it?

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-09-15  9:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-06 23:38 [PATCH] Fix longstanding load balancing bug in the scheduler Christoph Lameter
2006-09-07 10:58 ` Nick Piggin
2006-09-07 15:29   ` Christoph Lameter
2006-09-07 17:24   ` Christoph Lameter
2006-09-07 21:47     ` Nick Piggin
2006-09-07 22:20       ` Christoph Lameter
2006-09-07 22:32         ` Nick Piggin
2006-09-08  2:52           ` Christoph Lameter
2006-09-08 17:35 ` Siddha, Suresh B
2006-09-08 18:40   ` Christoph Lameter
2006-09-08 20:00     ` Siddha, Suresh B
2006-09-08 20:19       ` Christoph Lameter
2006-09-09  0:03         ` Siddha, Suresh B
2006-09-09  5:25           ` Christoph Lameter
2006-09-09 19:56             ` Christoph Lameter
2006-09-11  8:37               ` Nick Piggin
2006-09-12 18:37                 ` [PATCH] Fix longstanding load balancing bug in the scheduler V2 Christoph Lameter
2006-09-15  0:26                   ` Andrew Morton
2006-09-15  9:44                     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox