public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: smpnice work around for active_load_balance()
@ 2006-03-28  6:00 Peter Williams
  2006-03-28 19:25 ` Siddha, Suresh B
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Williams @ 2006-03-28  6:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2019 bytes --]

Problem:

It is undesirable for HT/MC packages to have more than one of their CPUs
busy if there are other packages that have all of their CPUs idle.  This
involves moving the only running task (i.e. the one actually on the CPU)
off on to another CPU and is achieved by using active_load_balance() and
relying on the fact that (when it starts) the queue's migration thread
will preempt the sole running task and (therefore) make it movable.  The
migration thread then moves it to an idle package.

Unfortunately, the mechanism for setting the run queues active_balance
flag is buried deep inside load_balance() and relies heavily on
find_busiest_group() and find_busiest_queue() reporting success even if
the busiest queue has only one task running.  This requirement is not
currently met.

Solution:

The long term solution to this problem is provide an alternative
mechanism for triggering active load balancing for run queues that need it.

However, in the meantime, this patch modifies find_busiest_group() so
that (when idle != NEWLY_IDLE) it prefers groups with at least one CPU 
with more than one running task over those with only one and modifies 
find_busiest_queue() so that (when idle != NEWLY_IDLE) it prefers queues 
with more than one running task over those with only one.  This means 
that the "busiest" queue found in load_balance() will only have one or 
less runnable tasks if there were no queues with more than one runnable 
task.  When called in load_balance_newidle() they will have the existing 
functionality.  These measure will prevent high load weight tasks that 
have a CPU to themselves from suppressing load balancing between other 
queues.

NB This patch should be backed out when a proper fix for the problems 
inherit in HT and MC systems is implemented.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


[-- Attachment #2: smpnice-active_load_balance --]
[-- Type: text/plain, Size: 2175 bytes --]

Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c	2006-03-27 17:02:53.000000000 +1100
+++ MM-2.6.X/kernel/sched.c	2006-03-28 16:57:58.000000000 +1100
@@ -2098,6 +2098,7 @@ find_busiest_group(struct sched_domain *
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
 	unsigned long this_load_per_task, this_nr_running;
+	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -2152,7 +2153,15 @@ find_busiest_group(struct sched_domain *
 			this = group;
 			this_nr_running = sum_nr_running;
 			this_load_per_task = sum_weighted_load;
-		} else if (avg_load > max_load && nr_loaded_cpus) {
+		} else if (nr_loaded_cpus) {
+			if (avg_load > max_load || !busiest_has_loaded_cpus) {
+				max_load = avg_load;
+				busiest = group;
+				busiest_nr_running = sum_nr_running;
+				busiest_load_per_task = sum_weighted_load;
+				busiest_has_loaded_cpus = 1;
+			}
+		} else if (!busiest_has_loaded_cpus && avg_load < max_load) {
 			max_load = avg_load;
 			busiest = group;
 			busiest_nr_running = sum_nr_running;
@@ -2161,7 +2170,7 @@ find_busiest_group(struct sched_domain *
 		group = group->next;
 	} while (group != sd->groups);
 
-	if (!busiest || this_load >= max_load)
+	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
 
 	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
@@ -2265,12 +2274,19 @@ static runqueue_t *find_busiest_queue(st
 {
 	unsigned long max_load = 0;
 	runqueue_t *busiest = NULL, *rqi;
+	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
 		rqi = cpu_rq(i);
 
-		if (rqi->raw_weighted_load > max_load && rqi->nr_running > 1) {
+		if (rqi->nr_running > 1) {
+			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
+				max_load = rqi->raw_weighted_load;
+				busiest = rqi;
+				busiest_is_loaded = 1;
+			}
+		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
 			max_load = rqi->raw_weighted_load;
 			busiest = rqi;
 		}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-28  6:00 [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
@ 2006-03-28 19:25 ` Siddha, Suresh B
  2006-03-28 22:44   ` Peter Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-03-28 19:25 UTC (permalink / raw)
  To: Peter Williams
  Cc: Andrew Morton, Siddha, Suresh B, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Tue, Mar 28, 2006 at 05:00:50PM +1100, Peter Williams wrote:
> Problem:
> 
> It is undesirable for HT/MC packages to have more than one of their CPUs
> busy if there are other packages that have all of their CPUs idle.  This

We need to balance even if the other packages are not idle.. For example,
consider a 4-core DP system, if we have 6 runnable(assume same priority)
processes, we want to schedule 3 of them in each package..

Todays active load balance implementation is very simple and generic. And
hence it works smoothly with dual and multi-core.. Please read my OLS 
2005 paper which talks about different scheduling scenarios and also how 
we were planning to implement Power savings policy incase of multi-core.. 
I had a prototype patch for doing this, which I held it up before going
on vacation, as it needed some rework with your smpnice patch in place..
I will post a patch ontop of current mainline for your reference.

> +		} else if (!busiest_has_loaded_cpus && avg_load < max_load) {

I haven't fully digested the result of this patch but should this be
avg_load < max_load or avg_load > max_load ?

Either way, I can show scheduling scenarios which will fail...

>  
> -		if (rqi->raw_weighted_load > max_load && rqi->nr_running > 1) {
> +		if (rqi->nr_running > 1) {
> +			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
> +				max_load = rqi->raw_weighted_load;
> +				busiest = rqi;
> +				busiest_is_loaded = 1;
> +			}
> +		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {

Please note the point that same scheduling logic has to work for all
the different levels of scheduler domains... I think these checks complicates
the decisions as we go up in the scheduling hirerachy.. Please go through
the HT/MC/MP/Numa combinations and with same/different priority processes for
different scenarios..

Even with no HT and MC, this patch has still has issues in the presence
of different priority tasks... consider a simple DP system and run two
instances of high priority tasks(simple infinite loop) and two normal priority
tasks. With "top" I observed that these normal priority tasks keep on jumping
from one processor to another... Ideally with smpnice, we would assume that 
each processor should have two tasks (one high priority and another one 
with normal priority) ..

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-28 19:25 ` Siddha, Suresh B
@ 2006-03-28 22:44   ` Peter Williams
  2006-03-29  2:14     ` Peter Williams
  2006-03-29  2:52     ` Siddha, Suresh B
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-03-28 22:44 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Tue, Mar 28, 2006 at 05:00:50PM +1100, Peter Williams wrote:
>> Problem:
>>
>> It is undesirable for HT/MC packages to have more than one of their CPUs
>> busy if there are other packages that have all of their CPUs idle.  This
> 
> We need to balance even if the other packages are not idle.. For example,
> consider a 4-core DP system, if we have 6 runnable(assume same priority)
> processes, we want to schedule 3 of them in each package..

Well I hope that when you do a proper implementation for this issue that 
it takes this into account.  The current implementation doesn't.

> 
> Todays active load balance implementation is very simple and generic. And
> hence it works smoothly with dual and multi-core..

The application of active balancing to address your problem in the 
current implementation is essentially random.

> Please read my OLS 
> 2005 paper which talks about different scheduling scenarios and also how 

A URL would be handy.

> we were planning to implement Power savings policy incase of multi-core.. 
> I had a prototype patch for doing this, which I held it up before going
> on vacation, as it needed some rework with your smpnice patch in place..
> I will post a patch ontop of current mainline for your reference.
> 
>> +		} else if (!busiest_has_loaded_cpus && avg_load < max_load) {
> 
> I haven't fully digested the result of this patch but should this be
> avg_load < max_load or avg_load > max_load ?

Yes.  Thanks for spotting that.

> 
> Either way, I can show scheduling scenarios which will fail...

I'd be interested to see the ones that would fail with the corrected 
code.  I can show lots of examples where load balancing fails to do the 
right thing without the smpnice patches so it becomes a matter of which 
are more important.

> 
>>  
>> -		if (rqi->raw_weighted_load > max_load && rqi->nr_running > 1) {
>> +		if (rqi->nr_running > 1) {
>> +			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
>> +				max_load = rqi->raw_weighted_load;
>> +				busiest = rqi;
>> +				busiest_is_loaded = 1;
>> +			}
>> +		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
> 
> Please note the point that same scheduling logic has to work for all
> the different levels of scheduler domains... I think these checks complicates
> the decisions as we go up in the scheduling hirerachy.. Please go through
> the HT/MC/MP/Numa combinations and with same/different priority processes for
> different scenarios..

Sometimes complexity is necessary.  E.g. to handle the limitations of HT 
technology.  In this case, the complexity is necessary to make "nice" 
work on SMP systems.  The thing that broke "nice" on SMP systems was the 
adoption of separate run queues for each CPU and backing out that change 
in order to fix the problem is not an option so alternative solutions 
such as smpnice are required.

> 
> Even with no HT and MC, this patch has still has issues in the presence
> of different priority tasks... consider a simple DP system and run two
> instances of high priority tasks(simple infinite loop) and two normal priority
> tasks. With "top" I observed that these normal priority tasks keep on jumping
> from one processor to another... Ideally with smpnice, we would assume that 
> each processor should have two tasks (one high priority and another one 
> with normal priority) ..

Yes, but you are failing to take into account the effect of the other 
tasks on your system (e.g. top) that run from time to time.  If their 
burst of CPU use happens to coincide with some load balancing activity 
they will cause an imbalance to be detected (that is different to that 
which only considers your test tasks) and this will result in some tasks 
being moved.  Beware the Heisenberg Uncertainty Principle :-).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-28 22:44   ` Peter Williams
@ 2006-03-29  2:14     ` Peter Williams
  2006-03-29  2:52     ` Siddha, Suresh B
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-03-29  2:14 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Peter Williams wrote:
> Siddha, Suresh B wrote:
>> On Tue, Mar 28, 2006 at 05:00:50PM +1100, Peter Williams wrote:
[bits deleted]
>>
>> Even with no HT and MC, this patch has still has issues in the presence
>> of different priority tasks... consider a simple DP system and run two
>> instances of high priority tasks(simple infinite loop) and two normal 
>> priority
>> tasks. With "top" I observed that these normal priority tasks keep on 
>> jumping
>> from one processor to another... Ideally with smpnice, we would assume 
>> that each processor should have two tasks (one high priority and 
>> another one with normal priority) ..
> 
> Yes, but you are failing to take into account the effect of the other 
> tasks on your system (e.g. top) that run from time to time.  If their 
> burst of CPU use happens to coincide with some load balancing activity 
> they will cause an imbalance to be detected (that is different to that 
> which only considers your test tasks) and this will result in some tasks 
> being moved.  Beware the Heisenberg Uncertainty Principle :-).

Notwithstanding the HUP, I've investigated this and have found that 
there is more instability than expected and that it is due to a silly 
bit of code (by me) at the end of find_busiest_queue() marked by the 
comment:

/* or if there's a reasonable chance that *imbalance is big
  * enough to cause a move
  */

that makes load balancing more aggressive.  The functionality it 
implemented should have been abandoned when the code was updated to use 
average run queue loads instead of SCHED_LOAD_SCALE in the code that 
handled small imbalances but wasn't.  I'll send Andrew a patch that 
removes the offending code shortly.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-28 22:44   ` Peter Williams
  2006-03-29  2:14     ` Peter Williams
@ 2006-03-29  2:52     ` Siddha, Suresh B
  2006-03-29  3:42       ` Peter Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-03-29  2:52 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Wed, Mar 29, 2006 at 09:44:49AM +1100, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > We need to balance even if the other packages are not idle.. For example,
> > consider a 4-core DP system, if we have 6 runnable(assume same priority)
> > processes, we want to schedule 3 of them in each package..
> 
> Well I hope that when you do a proper implementation for this issue that 
> it takes this into account.  The current implementation doesn't.

This will also have issues when we want to implement power savings policy
for multi-core. Attached is the prototype patch(against 2.6.16-git15)
I was planning to send to mainline..

> > Todays active load balance implementation is very simple and generic. And
> > hence it works smoothly with dual and multi-core..
> 
> The application of active balancing to address your problem in the 
> current implementation is essentially random.

why so? we wanted to implement these HT and MC optimizations generically
in the scheduler and domain topology(and sched groups cpu_power) provided
that infrastructure cleanly..

> 
> > Please read my OLS 
> > 2005 paper which talks about different scheduling scenarios and also how 
> 
> A URL would be handy.

http://www.linuxsymposium.org/2005/linuxsymposium_procv2.pdf
Look for the paper titled "Chip Multi Processing aware Linux Kernel Scheduler"

> > 
> > Either way, I can show scheduling scenarios which will fail...
> 
> I'd be interested to see the ones that would fail with the corrected 
> code.  

4-way system with HT (8 logical processors) ... 

Package-P0 T0 has a highest priority task, T1 is idle
Package-P1 Both T0 and T1 have 1 normal priority task each..
P2 and P3 are idle.

Scheduler needs to move one of the normal priority tasks to P2 or P3.. 
But find_busiest_group() will always think P0 as the busy group and
will not distribute the load as expected..

I am giving so many examples that I am confused at the end of day, which
examples are fixed and which are not by your patches :)
So please send the latest smpnice patch, which you think is clean and fixes 
all the issues(look at all my examples and also the ones mentioned in the 
OLS paper...)

> Sometimes complexity is necessary.  E.g. to handle the limitations of HT 
> technology.  In this case, the complexity is necessary to make "nice" 
> work on SMP systems.  The thing that broke "nice" on SMP systems was the 
> adoption of separate run queues for each CPU and backing out that change 
> in order to fix the problem is not an option so alternative solutions 
> such as smpnice are required.

I agree with the issue you are trying to address but at the same time
we should make sure that the new patch is clean, easy to understand, 
have some predictable behavior and really fixes the issue that we
are addressing..

--

Two sysctls enable the SMT/MC power savings policy for the scheduler.

kernel.sched_core_power_savings
kernel.sched_smt_power_savings

If any of the above values are non-zero, then power savings policy will be
enabled for that level. Based on these sysctls, sched groups cpu power
will be determined for different domains. When power savings policy is
enabled and under light load conditions, scheduler will minimize the physical
packages carrying the load and thus conserving power.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

diff -pNru linux-2.6.16-git15/arch/i386/kernel/smpboot.c linux/arch/i386/kernel/smpboot.c
--- linux-2.6.16-git15/arch/i386/kernel/smpboot.c	2006-03-28 12:07:28.325536192 -0800
+++ linux/arch/i386/kernel/smpboot.c	2006-03-28 14:43:23.837282224 -0800
@@ -449,10 +449,16 @@ cpumask_t cpu_coregroup_map(int cpu)
 	struct cpuinfo_x86 *c = cpu_data + cpu;
 	/*
 	 * For perf, we return last level cache shared map.
-	 * TBD: when power saving sched policy is added, we will return
-	 *      cpu_core_map when power saving policy is enabled
+	 * And for power savings, we return cpu_core_map
+	 *
+	 * TBD: On CPU generations where there are no shared resources
+	 * (like last level cache) between cores,
+	 * let the OS choose core_power_savings sched policy during bootup.
 	 */
-	return c->llc_shared_map;
+	if (core_power_savings || smt_power_savings)
+		return cpu_core_map[cpu];
+	else
+		return c->llc_shared_map;
 }
 
 /* representing cpus for which sibling maps can be computed */
diff -pNru linux-2.6.16-git15/arch/x86_64/kernel/smpboot.c linux/arch/x86_64/kernel/smpboot.c
--- linux-2.6.16-git15/arch/x86_64/kernel/smpboot.c	2006-03-28 12:07:28.692480408 -0800
+++ linux/arch/x86_64/kernel/smpboot.c	2006-03-28 14:42:26.428009760 -0800
@@ -454,10 +454,16 @@ cpumask_t cpu_coregroup_map(int cpu)
 	struct cpuinfo_x86 *c = cpu_data + cpu;
 	/*
 	 * For perf, we return last level cache shared map.
-	 * TBD: when power saving sched policy is added, we will return
-	 *      cpu_core_map when power saving policy is enabled
+	 * And for power savings, we return cpu_core_map
+	 *
+	 * TBD: On CPU generations where there are no shared resources
+	 * (like last level cache) between cores,
+	 * let the OS choose core_power_savings sched policy during bootup.
 	 */
-	return c->llc_shared_map;
+	if (core_power_savings || smt_power_savings)
+		return cpu_core_map[cpu];
+	else
+		return c->llc_shared_map;
 }
 
 /* representing cpus for which sibling maps can be computed */
diff -pNru linux-2.6.16-git15/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.16-git15/include/linux/sched.h	2006-03-28 12:07:34.735561720 -0800
+++ linux/include/linux/sched.h	2006-03-28 12:09:36.100111504 -0800
@@ -38,6 +38,7 @@
 #include <linux/futex.h>
 
 #include <linux/auxvec.h>	/* For AT_VECTOR_SIZE */
+#include <linux/sysctl.h>
 
 struct exec_domain;
 
@@ -564,6 +565,10 @@ enum idle_type
 #define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
 #define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
 #define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
+#define SD_POWERSAVINGS_BALANCE	256	/* Balance for power savings */
+
+#define BALANCE_FOR_POWER	((core_power_savings || smt_power_savings) \
+				 ? SD_POWERSAVINGS_BALANCE : 0)
 
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
@@ -1396,6 +1401,10 @@ static inline void arch_pick_mmap_layout
 
 extern long sched_setaffinity(pid_t pid, cpumask_t new_mask);
 extern long sched_getaffinity(pid_t pid, cpumask_t *mask);
+extern int smt_power_savings, core_power_savings;
+extern int sched_sysctl_handler(ctl_table *table, int write, struct file *file,
+				void __user *buffer, size_t *length, loff_t *ppos);
+
 
 extern void normalize_rt_tasks(void);
 
diff -pNru linux-2.6.16-git15/include/linux/sysctl.h linux/include/linux/sysctl.h
--- linux-2.6.16-git15/include/linux/sysctl.h	2006-03-28 12:07:34.747559896 -0800
+++ linux/include/linux/sysctl.h	2006-03-28 12:10:09.184081976 -0800
@@ -148,6 +148,8 @@ enum
 	KERN_SPIN_RETRY=70,	/* int: number of spinlock retries */
 	KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
 	KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
+ 	KERN_SCHED_HT_POWER=73,	/* int: schedule for saving ht power */
+ 	KERN_SCHED_CORE_POWER=74,/* int: schedule for saving core power */
 };
 
 
diff -pNru linux-2.6.16-git15/include/linux/topology.h linux/include/linux/topology.h
--- linux-2.6.16-git15/include/linux/topology.h	2006-03-28 12:07:34.749559592 -0800
+++ linux/include/linux/topology.h	2006-03-28 12:09:36.102111200 -0800
@@ -134,7 +134,8 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE,	\
+				| SD_WAKE_AFFINE	\
+				| BALANCE_FOR_POWER,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\
diff -pNru linux-2.6.16-git15/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.16-git15/kernel/sched.c	2006-03-28 12:07:34.831547128 -0800
+++ linux/kernel/sched.c	2006-03-28 16:31:31.181056064 -0800
@@ -1061,6 +1061,12 @@ static int sched_balance_self(int cpu, i
 	struct task_struct *t = current;
 	struct sched_domain *tmp, *sd = NULL;
 
+	/*
+	 * If any Power savings logic is enabled, we don't sched balance
+	 */
+	if (smt_power_savings || core_power_savings)
+		return cpu;
+
 	for_each_domain(cpu, tmp)
 		if (tmp->flags & flag)
 			sd = tmp;
@@ -1928,9 +1934,9 @@ static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
 		   unsigned long *imbalance, enum idle_type idle, int *sd_idle)
 {
-	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups, *min_group = NULL, *group_leader = NULL;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
-	unsigned long max_pull;
+	unsigned long max_pull, excess_load, min_load = ULONG_MAX, leader_load = 0, unit_load;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -1977,13 +1983,53 @@ find_busiest_group(struct sched_domain *
 			max_load = avg_load;
 			busiest = group;
 		}
+
+		if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+			goto group_next;
+
+		/* this group is already running at full capacity. Don't get
+		 * involved in any power savings calculations
+		 */
+		if (avg_load >= SCHED_LOAD_SCALE)
+			goto group_next;
+
+		unit_load = (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE) / group->cpu_power;
+		/* Calculate the group which has the least non-idle load
+		 * This is the group from where we need to pick up the load
+		 * for saving power
+		 */
+		if (avg_load >= unit_load) {
+			if ((avg_load < min_load) ||
+			    (avg_load == min_load &&
+			      first_cpu(group->cpumask) < 
+		              first_cpu(min_group->cpumask))) {
+				min_group = group;
+				min_load = avg_load;
+			}
+		}
+
+		/* Calculate the group which is almost near its
+		 * capacity but still has some space to pick up some load
+		 * from other group and save more power
+		 */
+		if (avg_load <= (SCHED_LOAD_SCALE - unit_load))
+			if (avg_load && (avg_load > leader_load || 
+			    (avg_load == leader_load &&
+			     first_cpu(group->cpumask) > 
+			      first_cpu(group_leader->cpumask)))) {
+				group_leader = group;
+				leader_load = avg_load;
+			}
+
+group_next:
 		group = group->next;
 	} while (group != sd->groups);
 
+	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
+
 	if (!busiest || this_load >= max_load || max_load <= SCHED_LOAD_SCALE)
 		goto out_balanced;
 
-	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
 
 	if (this_load >= avg_load ||
 			100*max_load <= sd->imbalance_pct*this_load)
@@ -2002,11 +2048,17 @@ find_busiest_group(struct sched_domain *
 	 */
 
 	/* Don't want to pull so many tasks that a group would go idle */
-	max_pull = min(max_load - avg_load, max_load - SCHED_LOAD_SCALE);
+	excess_load = min(max_load - avg_load, max_load - SCHED_LOAD_SCALE);
+
+	if (this_load < SCHED_LOAD_SCALE)
+		/* pull as many tasks so that this group is fully utilized */
+		max_pull = max(avg_load - this_load, SCHED_LOAD_SCALE - this_load);
+	else
+		max_pull = avg_load - this_load;
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * busiest->cpu_power,
-				(avg_load - this_load) * this->cpu_power)
+	*imbalance = min(excess_load * busiest->cpu_power,
+				max_pull * this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
 	if (*imbalance < SCHED_LOAD_SCALE) {
@@ -2056,7 +2108,32 @@ find_busiest_group(struct sched_domain *
 	return busiest;
 
 out_balanced:
+	if (!(sd->flags & SD_POWERSAVINGS_BALANCE))
+		goto ret;
 
+	/* when this group is above its capacity or overall the system
+	 * is loaded heavily, we can't do any power savings. We will simply
+	 * return
+	 */
+	if (this_load >= SCHED_LOAD_SCALE || avg_load >= SCHED_LOAD_SCALE)
+		goto ret;
+
+	if (this == group_leader && group_leader != min_group) {
+		/*
+		 * Ideally this should be
+		 * (SCHED_LOAD_SCALE - leader_load) * group_leader->cpu_power
+		 * 			/ (SCHED_LOAD_SCALE * SCHED_LOAD_SCALE);
+		 * Anyhow, we need to rely on active load balance when the
+		 * system is lightly loaded. And active load balance
+		 * moves one at a time.
+		 *
+		 * we should be Ok with 1.
+		 */
+		*imbalance = 1;
+		return min_group;
+	}
+
+ret:
 	*imbalance = 0;
 	return NULL;
 }
@@ -5561,6 +5638,8 @@ static cpumask_t sched_domain_node_span(
 }
 #endif
 
+
+int smt_power_savings = 0, core_power_savings = 0;
 /*
  * At the moment, CONFIG_SCHED_SMT is never defined, but leave it in so we
  * can switch it on easily if needed.
@@ -5861,7 +5940,7 @@ void build_sched_domains(const cpumask_t
 	/* Calculate CPU power for physical packages and nodes */
 	for_each_cpu_mask(i, *cpu_map) {
 		int power;
-		struct sched_domain *sd;
+		struct sched_domain *sd, *sd1;
 #ifdef CONFIG_SCHED_SMT
 		sd = &per_cpu(cpu_domains, i);
 		power = SCHED_LOAD_SCALE;
@@ -5869,26 +5948,53 @@ void build_sched_domains(const cpumask_t
 #endif
 #ifdef CONFIG_SCHED_MC
 		sd = &per_cpu(core_domains, i);
-		power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1)
+		if (smt_power_savings)
+			power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
+		else
+			power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1)
 					    * SCHED_LOAD_SCALE / 10;
 		sd->groups->cpu_power = power;
 
 		sd = &per_cpu(phys_domains, i);
+		if (i != first_cpu(sd->groups->cpumask))
+			continue;
 
- 		/*
- 		 * This has to be < 2 * SCHED_LOAD_SCALE
- 		 * Lets keep it SCHED_LOAD_SCALE, so that
- 		 * while calculating NUMA group's cpu_power
- 		 * we can simply do
- 		 *  numa_group->cpu_power += phys_group->cpu_power;
- 		 *
- 		 * See "only add power once for each physical pkg"
- 		 * comment below
- 		 */
- 		sd->groups->cpu_power = SCHED_LOAD_SCALE;
+		sd->groups->cpu_power = 0;
+		if (core_power_savings || smt_power_savings) {
+			int j;
+
+ 			for_each_cpu_mask(j, sd->groups->cpumask) {
+ 				sd1 = &per_cpu(core_domains, j);
+ 				/*
+ 			 	 * for each core we will add once
+ 				 * to the group in physical domain
+ 			 	 */
+  	 			if (j != first_cpu(sd1->groups->cpumask))
+ 					continue;
+ 
+ 				if (smt_power_savings)
+   					sd->groups->cpu_power += SCHED_LOAD_SCALE * cpus_weight(sd1->groups->cpumask);
+ 				else
+   					sd->groups->cpu_power += SCHED_LOAD_SCALE;
+   			}
+ 		} else 
+ 			/*
+ 			 * This has to be < 2 * SCHED_LOAD_SCALE
+ 			 * Lets keep it SCHED_LOAD_SCALE, so that
+ 			 * while calculating NUMA group's cpu_power
+ 			 * we can simply do
+ 			 *  numa_group->cpu_power += phys_group->cpu_power;
+ 			 *
+ 			 * See "only add power once for each physical pkg"
+ 			 * comment below
+ 			 */
+ 			sd->groups->cpu_power = SCHED_LOAD_SCALE;
 #else
 		sd = &per_cpu(phys_domains, i);
-		power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
+		if (smt_power_savings)
+			power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
+		else
+			power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
 				(cpus_weight(sd->groups->cpumask)-1) / 10;
 		sd->groups->cpu_power = power;
 #endif
@@ -6017,6 +6123,21 @@ void partition_sched_domains(cpumask_t *
 		build_sched_domains(partition2);
 }
 
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+int sched_sysctl_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec(table, write, file, buffer, length, ppos);
+	if (write) {
+		lock_cpu_hotplug();
+		detach_destroy_domains(&cpu_online_map);
+		arch_init_sched_domains(&cpu_online_map);
+		unlock_cpu_hotplug();
+	}
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * Force a reinitialization of the sched domains hierarchy.  The domains
diff -pNru linux-2.6.16-git15/kernel/sysctl.c linux/kernel/sysctl.c
--- linux-2.6.16-git15/kernel/sysctl.c	2006-03-28 12:07:34.837546216 -0800
+++ linux/kernel/sysctl.c	2006-03-28 16:32:53.354563792 -0800
@@ -683,6 +683,26 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SCHED_SMT
+	{
+		.ctl_name	= KERN_SCHED_HT_POWER,
+		.procname	= "sched_smt_power_savings",
+		.data		= &smt_power_savings,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &sched_sysctl_handler,
+	},
+#endif
+#ifdef CONFIG_SCHED_MC
+	{
+		.ctl_name	= KERN_SCHED_CORE_POWER,
+		.procname	= "sched_core_power_savings",
+		.data		= &core_power_savings,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &sched_sysctl_handler,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-29  2:52     ` Siddha, Suresh B
@ 2006-03-29  3:42       ` Peter Williams
  2006-03-29 22:52         ` Siddha, Suresh B
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Williams @ 2006-03-29  3:42 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Wed, Mar 29, 2006 at 09:44:49AM +1100, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> We need to balance even if the other packages are not idle.. For example,
>>> consider a 4-core DP system, if we have 6 runnable(assume same priority)
>>> processes, we want to schedule 3 of them in each package..
>> Well I hope that when you do a proper implementation for this issue that 
>> it takes this into account.  The current implementation doesn't.
> 
> This will also have issues when we want to implement power savings policy
> for multi-core. Attached is the prototype patch(against 2.6.16-git15)
> I was planning to send to mainline..
> 
>>> Todays active load balance implementation is very simple and generic. And
>>> hence it works smoothly with dual and multi-core..
>> The application of active balancing to address your problem in the 
>> current implementation is essentially random.
> 
> why so? we wanted to implement these HT and MC optimizations generically
> in the scheduler and domain topology(and sched groups cpu_power) provided
> that infrastructure cleanly..

I meant that it doesn't explicitly address your problem.  What it does 
is ASSUME that failure of load balancing to move tasks is because there 
was exactly one task on the source run queue and that this makes it a 
suitable candidate to have that single task moved elsewhere in the blind 
hope that it may fix an HT/MC imbalance that may or may not exist.  In 
my mind this is very close to random.  Also back to front and inefficient.

> 
>>> Please read my OLS 
>>> 2005 paper which talks about different scheduling scenarios and also how 
>> A URL would be handy.
> 
> http://www.linuxsymposium.org/2005/linuxsymposium_procv2.pdf
> Look for the paper titled "Chip Multi Processing aware Linux Kernel Scheduler"
> 
>>> Either way, I can show scheduling scenarios which will fail...
>> I'd be interested to see the ones that would fail with the corrected 
>> code.  
> 
> 4-way system with HT (8 logical processors) ... 
> 
> Package-P0 T0 has a highest priority task, T1 is idle
> Package-P1 Both T0 and T1 have 1 normal priority task each..
> P2 and P3 are idle.
> 
> Scheduler needs to move one of the normal priority tasks to P2 or P3.. 
> But find_busiest_group() will always think P0 as the busy group and
> will not distribute the load as expected..

This is a variation of the problem that active_load_balance() is being 
used as a random fix for.  I.e. it's HT/MC specific and should 
eventually be fixed in HT/MC specific code.  As I've said several times 
the mechanism for triggering active_load_balance() to try and achieve 
your aims is essentially random and it's a matter of luck whether it 
works or not (even without smpnice patches).  The smpnice patch probably 
reduces the odds that it will work but that's not their problem.  The 
unique load balancing needs of HT/MC need to be handled more 
deterministically.

> 
> I am giving so many examples that I am confused at the end of day, which
> examples are fixed and which are not by your patches :)
> So please send the latest smpnice patch, which you think is clean and fixes 
> all the issues(look at all my examples and also the ones mentioned in the 
> OLS paper...)

I'm in the process of doing that.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-29  3:42       ` Peter Williams
@ 2006-03-29 22:52         ` Siddha, Suresh B
  2006-03-29 23:40           ` Peter Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-03-29 22:52 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Wed, Mar 29, 2006 at 02:42:45PM +1100, Peter Williams wrote:
> I meant that it doesn't explicitly address your problem.  What it does 
> is ASSUME that failure of load balancing to move tasks is because there 
> was exactly one task on the source run queue and that this makes it a 
> suitable candidate to have that single task moved elsewhere in the blind 
> hope that it may fix an HT/MC imbalance that may or may not exist.  In 
> my mind this is very close to random.  

That so called assumption happens only when load balancing has
failed for more than the domain specific cache_nice_tries. Only reason
why it can fail so many times is because of all pinned tasks or only a single
task is running on that particular CPU. load balancing code takes care of both
these scenarios..

sched groups cpu_power controls the mechanism of implementing HT/MC
optimizations in addition to active balance code... There is no randomness
in this.


> Also back to front and inefficient.

HT/MC imbalance is detected in a normal way.. A lightly loaded group
finds an imbalance and tries to pull some load from a busy group (which
is inline with normal load balance)... pull fails because the only task
on that cpu is busy running and needs to go off the cpu (which is triggered
by active load balance)... Scheduler load balance is generally done by a 
pull mechansim and here (HT/MC) it is still a pull mechanism(triggering a 
final push only because of the single running task) 

If you have any better generic and simple method, please let us know.

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-29 22:52         ` Siddha, Suresh B
@ 2006-03-29 23:40           ` Peter Williams
  2006-03-30  0:50             ` Siddha, Suresh B
  2006-04-03  1:04             ` [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-03-29 23:40 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Wed, Mar 29, 2006 at 02:42:45PM +1100, Peter Williams wrote:
>> I meant that it doesn't explicitly address your problem.  What it does 
>> is ASSUME that failure of load balancing to move tasks is because there 
>> was exactly one task on the source run queue and that this makes it a 
>> suitable candidate to have that single task moved elsewhere in the blind 
>> hope that it may fix an HT/MC imbalance that may or may not exist.  In 
>> my mind this is very close to random.  
> 
> That so called assumption happens only when load balancing has
> failed for more than the domain specific cache_nice_tries. Only reason
> why it can fail so many times is because of all pinned tasks or only a single
> task is running on that particular CPU. load balancing code takes care of both
> these scenarios..
> 
> sched groups cpu_power controls the mechanism of implementing HT/MC
> optimizations in addition to active balance code... There is no randomness
> in this.

The above explanation just increases my belief in the randomness of this 
solution.  This code is mostly done without locks and is therefore very 
racy and any assumptions made based on the number of times load 
balancing has failed etc. are highly speculative.

And even if there is only one task on the CPU there's no guarantee that
that CPU is in a package that meets the other requirements to make the 
move desirable.  So there's a good probability that you'll be moving 
tasks unnecessarily.

It's a poor solution and it's being inflicted on architectures that 
don't need it.  Even if cache_nice_tries is used to suppress this 
behaviour on architectures that don't need it they have to carry the 
code in their kernel.

> 
> 
>> Also back to front and inefficient.
> 
> HT/MC imbalance is detected in a normal way.. A lightly loaded group
> finds an imbalance and tries to pull some load from a busy group (which
> is inline with normal load balance)... pull fails because the only task
> on that cpu is busy running and needs to go off the cpu (which is triggered
> by active load balance)... Scheduler load balance is generally done by a 
> pull mechansim and here (HT/MC) it is still a pull mechanism(triggering a 
> final push only because of the single running task) 
> 
> If you have any better generic and simple method, please let us know.

I gave an example in a previous e-mail.  Basically, at the end of 
scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
foolish to contemplate moving tasks of the queue just after you've moved 
some there) and the run queue has exactly one running task and it's time 
for a HT/MC rebalance check on the package that this run queue belongs 
to then check that package to to see if it meets the rest of criteria 
for needing to lose some tasks.  If it does look for a package that is a 
suitable recipient for the moved task and if you find one then mark this 
run queue as needing active load balancing and arrange for its migration 
thread to be started.

Simple, direct and amenable to being only built on architectures that 
need the functionality.

Another (more complex) solution that would also allow improvements to 
other HT related code (e.g. the sleeping dependent code) would be to 
modify the load balancing code so that all CPUs in a package share a run 
queue and load balancing is then done between packages.  As long as the 
number of CPUs in a package is small this shouldn't have scalability 
issues.  The big disadvantage of this approach is its complexity which 
is probably too great to contemplate doing it in 2.6.X kernels.


Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-29 23:40           ` Peter Williams
@ 2006-03-30  0:50             ` Siddha, Suresh B
  2006-03-30  1:14               ` Peter Williams
  2006-04-03  1:04             ` [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-03-30  0:50 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Thu, Mar 30, 2006 at 10:40:24AM +1100, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > On Wed, Mar 29, 2006 at 02:42:45PM +1100, Peter Williams wrote:
> >> I meant that it doesn't explicitly address your problem.  What it does 
> >> is ASSUME that failure of load balancing to move tasks is because there 
> >> was exactly one task on the source run queue and that this makes it a 
> >> suitable candidate to have that single task moved elsewhere in the blind 
> >> hope that it may fix an HT/MC imbalance that may or may not exist.  In 
> >> my mind this is very close to random.  
> > 
> > That so called assumption happens only when load balancing has
> > failed for more than the domain specific cache_nice_tries. Only reason
> > why it can fail so many times is because of all pinned tasks or only a single
> > task is running on that particular CPU. load balancing code takes care of both
> > these scenarios..
> > 
> > sched groups cpu_power controls the mechanism of implementing HT/MC
> > optimizations in addition to active balance code... There is no randomness
> > in this.
> 
> The above explanation just increases my belief in the randomness of this 
> solution.  This code is mostly done without locks and is therefore very 
> racy and any assumptions made based on the number of times load 
> balancing has failed etc. are highly speculative.

Isn't it the same case with regular cpu load calculations during load
balance?

> And even if there is only one task on the CPU there's no guarantee that
> that CPU is in a package that meets the other requirements to make the 
> move desirable.  So there's a good probability that you'll be moving 
> tasks unnecessarily.

sched groups cpu_power and domain topology information cleanly
encapsulates the imbalance identification and source/destination groups
to fix the imbalance.

> It's a poor solution and it's being inflicted on architectures that 
> don't need it.  Even if cache_nice_tries is used to suppress this 
> behaviour on architectures that don't need it they have to carry the 
> code in their kernel.

We can clearly throw CONFIG_SCHED_MC/SMT around that code.. Nick/Ingo
do you see any issue?

> 
> > 
> > 
> >> Also back to front and inefficient.
> > 
> > HT/MC imbalance is detected in a normal way.. A lightly loaded group
> > finds an imbalance and tries to pull some load from a busy group (which
> > is inline with normal load balance)... pull fails because the only task
> > on that cpu is busy running and needs to go off the cpu (which is triggered
> > by active load balance)... Scheduler load balance is generally done by a 
> > pull mechansim and here (HT/MC) it is still a pull mechanism(triggering a 
> > final push only because of the single running task) 
> > 
> > If you have any better generic and simple method, please let us know.
> 
> I gave an example in a previous e-mail.  Basically, at the end of 
> scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
> foolish to contemplate moving tasks of the queue just after you've moved 
> some there) and the run queue has exactly one running task and it's time 
> for a HT/MC rebalance check on the package that this run queue belongs 
> to then check that package to to see if it meets the rest of criteria 
> for needing to lose some tasks.  If it does look for a package that is a 
> suitable recipient for the moved task and if you find one then mark this 
> run queue as needing active load balancing and arrange for its migration 
> thread to be started.
> 
> Simple, direct and amenable to being only built on architectures that 
> need the functionality.

First of all we will be doing unnecessary checks to see if there is
an imbalance.. Current code triggers the checks and movement only when
it is necessary.. And second, finding the correct destination cpu in the 
presence of SMT and MC is really complicated.. Look at different examples
in the OLS paper.. Domain topology provides all this info with no added
complexity...

> Another (more complex) solution that would also allow improvements to 
> other HT related code (e.g. the sleeping dependent code) would be to 
> modify the load balancing code so that all CPUs in a package share a run 
> queue and load balancing is then done between packages.  As long as the 
> number of CPUs in a package is small this shouldn't have scalability 
> issues.  The big disadvantage of this approach is its complexity which 
> is probably too great to contemplate doing it in 2.6.X kernels.

Presence of SMT and MC, implementation of power-savings scheduler
policy will present more challenges...

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-30  0:50             ` Siddha, Suresh B
@ 2006-03-30  1:14               ` Peter Williams
  2006-04-02  4:48                 ` smpnice loadbalancing with high priority tasks Siddha, Suresh B
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Williams @ 2006-03-30  1:14 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Thu, Mar 30, 2006 at 10:40:24AM +1100, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> On Wed, Mar 29, 2006 at 02:42:45PM +1100, Peter Williams wrote:
>>>> I meant that it doesn't explicitly address your problem.  What it does 
>>>> is ASSUME that failure of load balancing to move tasks is because there 
>>>> was exactly one task on the source run queue and that this makes it a 
>>>> suitable candidate to have that single task moved elsewhere in the blind 
>>>> hope that it may fix an HT/MC imbalance that may or may not exist.  In 
>>>> my mind this is very close to random.  
>>> That so called assumption happens only when load balancing has
>>> failed for more than the domain specific cache_nice_tries. Only reason
>>> why it can fail so many times is because of all pinned tasks or only a single
>>> task is running on that particular CPU. load balancing code takes care of both
>>> these scenarios..
>>>
>>> sched groups cpu_power controls the mechanism of implementing HT/MC
>>> optimizations in addition to active balance code... There is no randomness
>>> in this.
>> The above explanation just increases my belief in the randomness of this 
>> solution.  This code is mostly done without locks and is therefore very 
>> racy and any assumptions made based on the number of times load 
>> balancing has failed etc. are highly speculative.
> 
> Isn't it the same case with regular cpu load calculations during load
> balance?

Yes.  Which is why move_tasks() is designed to cope.

But this doesn't effect the argument w.r.t. your code.

> 
>> And even if there is only one task on the CPU there's no guarantee that
>> that CPU is in a package that meets the other requirements to make the 
>> move desirable.  So there's a good probability that you'll be moving 
>> tasks unnecessarily.
> 
> sched groups cpu_power and domain topology information cleanly
> encapsulates the imbalance identification and source/destination groups
> to fix the imbalance.

But you don't look at the rest of the queues in the package to see if 
the need is REALLY required.

> 
>> It's a poor solution and it's being inflicted on architectures that 
>> don't need it.  Even if cache_nice_tries is used to suppress this 
>> behaviour on architectures that don't need it they have to carry the 
>> code in their kernel.
> 
> We can clearly throw CONFIG_SCHED_MC/SMT around that code.. Nick/Ingo
> do you see any issue?

That just makes it a poor solution and ugly. :-)

> 
>>>
>>>> Also back to front and inefficient.
>>> HT/MC imbalance is detected in a normal way.. A lightly loaded group
>>> finds an imbalance and tries to pull some load from a busy group (which
>>> is inline with normal load balance)... pull fails because the only task
>>> on that cpu is busy running and needs to go off the cpu (which is triggered
>>> by active load balance)... Scheduler load balance is generally done by a 
>>> pull mechansim and here (HT/MC) it is still a pull mechanism(triggering a 
>>> final push only because of the single running task) 
>>>
>>> If you have any better generic and simple method, please let us know.
>> I gave an example in a previous e-mail.  Basically, at the end of 
>> scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
>> foolish to contemplate moving tasks of the queue just after you've moved 
>> some there) and the run queue has exactly one running task and it's time 
>> for a HT/MC rebalance check on the package that this run queue belongs 
>> to then check that package to to see if it meets the rest of criteria 
>> for needing to lose some tasks.  If it does look for a package that is a 
>> suitable recipient for the moved task and if you find one then mark this 
>> run queue as needing active load balancing and arrange for its migration 
>> thread to be started.
>>
>> Simple, direct and amenable to being only built on architectures that 
>> need the functionality.
> 
> First of all we will be doing unnecessary checks to see if there is
> an imbalance.. Current code triggers the checks and movement only when
> it is necessary.. And second, finding the correct destination cpu in the 
> presence of SMT and MC is really complicated.. Look at different examples
> in the OLS paper.. Domain topology provides all this info with no added
> complexity...
> 
>> Another (more complex) solution that would also allow improvements to 
>> other HT related code (e.g. the sleeping dependent code) would be to 
>> modify the load balancing code so that all CPUs in a package share a run 
>> queue and load balancing is then done between packages.  As long as the 
>> number of CPUs in a package is small this shouldn't have scalability 
>> issues.  The big disadvantage of this approach is its complexity which 
>> is probably too great to contemplate doing it in 2.6.X kernels.
> 
> Presence of SMT and MC, implementation of power-savings scheduler
> policy will present more challenges...

And I would recommend a similar approach to what I've suggested above. 
They could probably be combined into a single neat well encapsulated 
solution.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* smpnice loadbalancing with high priority tasks
  2006-03-30  1:14               ` Peter Williams
@ 2006-04-02  4:48                 ` Siddha, Suresh B
  2006-04-02  7:08                   ` Peter Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-02  4:48 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

Peter,

There are still issues which we need to address.. These are surfacing
as we are patching issue by issue(instead of addressing the root issue, which
is: presence of high priority tasks messes up load balancing of normal
priority tasks..)

for example

a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with smpnice that 
extra normal priority task keeps jumping from one cpu to another cpu having
the normal priority task.

This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We
are not including the cpu with high priority task in max_load calculations
but including that in total and avg_load calcuations.. leading to max_load <
avg_load and load balance between cpus running normal priority tasks(2 Vs 1)
will always show imbalanace as one normal priority and the extra normal
priority task will keep moving from one cpu to another cpu having
normal priority task..

b) on a simple DP system, if we have two high priority and two normal priority
tasks, ideally we should schedule one high and one normal priority task on 
each cpu.. current code doesn't find an imbalance if both the normal priority
tasks gets scheduled on the same cpu(running one high priority task)

there may not be benchmarks which expose these conditions.. but I think
we haven't addressed the corner case conditions well enough..

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-02  4:48                 ` smpnice loadbalancing with high priority tasks Siddha, Suresh B
@ 2006-04-02  7:08                   ` Peter Williams
  2006-04-04  0:24                     ` Siddha, Suresh B
  2006-04-20  1:24                     ` [patch] smpnice: don't consider sched groups which are lightly loaded for balancing Siddha, Suresh B
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-02  7:08 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> Peter,
> 
> There are still issues which we need to address.. These are surfacing
> as we are patching issue by issue(instead of addressing the root issue, which
> is: presence of high priority tasks messes up load balancing of normal
> priority tasks..)
> 
> for example
> 
> a) on a simple 4-way MP system, if we have one high priority and 4 normal
> priority tasks, with smpnice we would like to see the high priority task
> scheduled on one cpu, two other cpus getting one normal task each and the
> fourth cpu getting the remaining two normal tasks. but with smpnice that 
> extra normal priority task keeps jumping from one cpu to another cpu having
> the normal priority task.
> 
> This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We
> are not including the cpu with high priority task in max_load calculations
> but including that in total and avg_load calcuations.. leading to max_load <
> avg_load and load balance between cpus running normal priority tasks(2 Vs 1)
> will always show imbalanace as one normal priority and the extra normal
> priority task will keep moving from one cpu to another cpu having
> normal priority task..

I can't see anything like this in the code.  Can you send a patch to fix 
what you think the problem in the is?

The effect you describe can be caused by other tasks running on the 
system (see below for fuller description).

> 
> b) on a simple DP system, if we have two high priority and two normal priority
> tasks, ideally we should schedule one high and one normal priority task on 
> each cpu.. current code doesn't find an imbalance if both the normal priority
> tasks gets scheduled on the same cpu(running one high priority task)

This is one of my standard tests and it works for me.  The only time the 
two normal priority tasks end up on the same CPU during my tests is when 
some other normal priority tasks (e.g. top, X.org) happen to be running 
when load balancing occurs.  This causes an imbalance and tasks that 
aren't actually on the CPU get moved to fix the imbalance.  This is 
usually the test tasks as (because they are hard spinners) they have a 
smaller interactive bonus than the other tasks and get preempted as a 
result.

> 
> there may not be benchmarks which expose these conditions.. but I think
> we haven't addressed the corner case conditions well enough..

Both of the problems you describe above are probably caused by the fact 
that there are other tasks (than those in your tests) running on your 
system and if they happen to be on a run queue at the time load 
balancing is done they will cause an imbalance to be detected that is 
different to what you expect based on a simplistic view of the world 
that only considers the test tasks.  When this happens tasks will get 
moved to restore balance.  This (in my opinion) is what you are seeing.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-03-29 23:40           ` Peter Williams
  2006-03-30  0:50             ` Siddha, Suresh B
@ 2006-04-03  1:04             ` Peter Williams
  2006-04-03 16:57               ` Siddha, Suresh B
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Williams @ 2006-04-03  1:04 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Peter Williams wrote:
> Siddha, Suresh B wrote:
>> HT/MC imbalance is detected in a normal way.. A lightly loaded group
>> finds an imbalance and tries to pull some load from a busy group (which
>> is inline with normal load balance)... pull fails because the only task
>> on that cpu is busy running and needs to go off the cpu (which is 
>> triggered
>> by active load balance)... Scheduler load balance is generally done by 
>> a pull mechansim and here (HT/MC) it is still a pull 
>> mechanism(triggering a final push only because of the single running 
>> task)
>> If you have any better generic and simple method, please let us know.
> 
> I gave an example in a previous e-mail.  Basically, at the end of 
> scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
> foolish to contemplate moving tasks of the queue just after you've moved 
> some there) and the run queue has exactly one running task and it's time 
> for a HT/MC rebalance check on the package that this run queue belongs 
> to then check that package to to see if it meets the rest of criteria 
> for needing to lose some tasks.  If it does look for a package that is a 
> suitable recipient for the moved task and if you find one then mark this 
> run queue as needing active load balancing and arrange for its migration 
> thread to be started.
> 
> Simple, direct and amenable to being only built on architectures that 
> need the functionality.

Are you working on this idea or should I do it?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-04-03  1:04             ` [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
@ 2006-04-03 16:57               ` Siddha, Suresh B
  2006-04-03 23:11                 ` Peter Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-03 16:57 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Mon, Apr 03, 2006 at 11:04:52AM +1000, Peter Williams wrote:
> Peter Williams wrote:
> > I gave an example in a previous e-mail.  Basically, at the end of 
> > scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
> > foolish to contemplate moving tasks of the queue just after you've moved 
> > some there) and the run queue has exactly one running task and it's time 
> > for a HT/MC rebalance check on the package that this run queue belongs 
> > to then check that package to to see if it meets the rest of criteria 
> > for needing to lose some tasks.  If it does look for a package that is a 
> > suitable recipient for the moved task and if you find one then mark this 
> > run queue as needing active load balancing and arrange for its migration 
> > thread to be started.
> > 
> > Simple, direct and amenable to being only built on architectures that 
> > need the functionality.
> 
> Are you working on this idea or should I do it?

my issues raised in response to this idea are unanswered.

<issues>
First of all we will be doing unnecessary checks to see if there is
an imbalance.. Current code triggers the checks and movement only when
it is necessary.. And second, finding the correct destination cpu in the 
presence of SMT and MC is really complicated.. Look at different examples
in the OLS paper.. Domain topology provides all this info with no added
complexity...
</issues>

I don't see a merit and so I am not looking into this.

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] sched: smpnice work around for active_load_balance()
  2006-04-03 16:57               ` Siddha, Suresh B
@ 2006-04-03 23:11                 ` Peter Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-03 23:11 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Mon, Apr 03, 2006 at 11:04:52AM +1000, Peter Williams wrote:
>> Peter Williams wrote:
>>> I gave an example in a previous e-mail.  Basically, at the end of 
>>> scheduler_tick() if rebalance_tick() doesn't move any tasks (it would be 
>>> foolish to contemplate moving tasks of the queue just after you've moved 
>>> some there) and the run queue has exactly one running task and it's time 
>>> for a HT/MC rebalance check on the package that this run queue belongs 
>>> to then check that package to to see if it meets the rest of criteria 
>>> for needing to lose some tasks.  If it does look for a package that is a 
>>> suitable recipient for the moved task and if you find one then mark this 
>>> run queue as needing active load balancing and arrange for its migration 
>>> thread to be started.
>>>
>>> Simple, direct and amenable to being only built on architectures that 
>>> need the functionality.
>> Are you working on this idea or should I do it?
> 
> my issues raised in response to this idea are unanswered.
> 
> <issues>
> First of all we will be doing unnecessary checks to see if there is
> an imbalance..

There will be saved overhead when the current code is removed that can 
be used.  Also, if you examine the idea you'll find that it would be 
very cheap and the possibilities for early abandonment of the checks 
would make them very efficient.

> Current code triggers the checks and movement only when
> it is necessary..

That is untrue.  The best you can say is when they MIGHT be necessary.

> And second, finding the correct destination cpu in the 
> presence of SMT and MC is really complicated.. Look at different examples
> in the OLS paper.. Domain topology provides all this info with no added
> complexity...

This to me would seem to be an argument in favour of change rather than 
an argument for retaining the current highly random process.

Finding the appropriate destination package/queue can be left to 
active_load_balance() and this would reduce the impact of the inherent 
raciness of load balancing.

> </issues>
> 
> I don't see a merit and so I am not looking into this.

OK.  I'll do it.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-02  7:08                   ` Peter Williams
@ 2006-04-04  0:24                     ` Siddha, Suresh B
  2006-04-04  1:22                       ` Peter Williams
  2006-04-20  1:24                     ` [patch] smpnice: don't consider sched groups which are lightly loaded for balancing Siddha, Suresh B
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-04  0:24 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Sun, Apr 02, 2006 at 05:08:33PM +1000, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > Peter,
> > 
> > There are still issues which we need to address.. These are surfacing
> > as we are patching issue by issue(instead of addressing the root issue, which
> > is: presence of high priority tasks messes up load balancing of normal
> > priority tasks..)
> > 
> > for example
> > 
> > a) on a simple 4-way MP system, if we have one high priority and 4 normal
> > priority tasks, with smpnice we would like to see the high priority task
> > scheduled on one cpu, two other cpus getting one normal task each and the
> > fourth cpu getting the remaining two normal tasks. but with smpnice that 
> > extra normal priority task keeps jumping from one cpu to another cpu having
> > the normal priority task.
> > 
> > This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We
> > are not including the cpu with high priority task in max_load calculations
> > but including that in total and avg_load calcuations.. leading to max_load <
> > avg_load and load balance between cpus running normal priority tasks(2 Vs 1)
> > will always show imbalanace as one normal priority and the extra normal
> > priority task will keep moving from one cpu to another cpu having
> > normal priority task..
> 
> I can't see anything like this in the code.  

Don't you see a condition where max_load < avg_load(as mentioned in the
above example) and in this case, code ignores avg_load and imbalance
will aways be the extra normal priority task( coming from
"max_load - busiest_load_per_task") and this normal priority task keeps 
hopping from one cpu to another cpu having normal priority task..

> Can you send a patch to fix 
> what you think the problem in the is?

I am looking at ways in fixing all these issues cleanly... I don't have
a clean solution yet...

> 
> The effect you describe can be caused by other tasks running on the 
> system (see below for fuller description).
> 
> > 
> > b) on a simple DP system, if we have two high priority and two normal priority
> > tasks, ideally we should schedule one high and one normal priority task on 
> > each cpu.. current code doesn't find an imbalance if both the normal priority
> > tasks gets scheduled on the same cpu(running one high priority task)

if the system goes into this state(whatever the reason may be...), how can
the current code detect this and resolve this imbalance... (by reviewing the
code, it shows that imbalance in this particular situation can't be detected...)


> This is one of my standard tests and it works for me.  The only time the 

Force your test into the above scenario and see how it behaves... if you
simply start new processes, then exec balance will nicely balance...
things go bad only when we go into the described above state..

I can give numerous other examples which fails

c) DP system: if the cpu-0 has two high priority and cpu-1 has one normal
priority task, how can the current code detect this imbalance..

d) 4-way MP system: if the cpu-0 has two high priority tasks, cpu-1 has
one high priority and four normal priority and cpu-2,3 each has one
high priority task.. how does the current code distribute the normal
priority tasks among cpu-1,2,3... (in this case, max_load will always
point to cpu-0 and will never distribute the noraml priority tasks...)

by giving these examples I am just pointing out various corner conditions..

> This (in my opinion) is what you are seeing.

Simple review of the current code is showing all these problems... when you
start new tasks, all will be well because of fork/exec balance... but
during run time, if the system goes to a particular state, current imbalance
code behaves poorly in the presence of high priority tasks...

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  0:24                     ` Siddha, Suresh B
@ 2006-04-04  1:22                       ` Peter Williams
  2006-04-04  1:34                         ` Peter Williams
  2006-04-04  2:11                         ` Siddha, Suresh B
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-04  1:22 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Sun, Apr 02, 2006 at 05:08:33PM +1000, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> Peter,
>>>
>>> There are still issues which we need to address.. These are surfacing
>>> as we are patching issue by issue(instead of addressing the root issue, which
>>> is: presence of high priority tasks messes up load balancing of normal
>>> priority tasks..)
>>>
>>> for example
>>>
>>> a) on a simple 4-way MP system, if we have one high priority and 4 normal
>>> priority tasks, with smpnice we would like to see the high priority task
>>> scheduled on one cpu, two other cpus getting one normal task each and the
>>> fourth cpu getting the remaining two normal tasks. but with smpnice that 
>>> extra normal priority task keeps jumping from one cpu to another cpu having
>>> the normal priority task.
>>>
>>> This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We
>>> are not including the cpu with high priority task in max_load calculations
>>> but including that in total and avg_load calcuations.. leading to max_load <
>>> avg_load and load balance between cpus running normal priority tasks(2 Vs 1)
>>> will always show imbalanace as one normal priority and the extra normal
>>> priority task will keep moving from one cpu to another cpu having
>>> normal priority task..
>> I can't see anything like this in the code.  
> 
> Don't you see a condition where max_load < avg_load(as mentioned in the
> above example) and in this case, code ignores avg_load and imbalance
> will aways be the extra normal priority task( coming from
> "max_load - busiest_load_per_task") and this normal priority task keeps 
> hopping from one cpu to another cpu having normal priority task..
> 
>> Can you send a patch to fix 
>> what you think the problem in the is?
> 
> I am looking at ways in fixing all these issues cleanly... I don't have
> a clean solution yet...

OK.  I think this means some fiddling with avg_load may be necessary in 
some cases but this will be complex.  I'm not really happy about making 
this code more complex until some of the current unnecessary complexity 
is removed.  I.e. until a proper solution to the problem of triggering 
active_load_balance() is implemented.

> 
>> The effect you describe can be caused by other tasks running on the 
>> system (see below for fuller description).
>>
>>> b) on a simple DP system, if we have two high priority and two normal priority
>>> tasks, ideally we should schedule one high and one normal priority task on 
>>> each cpu.. current code doesn't find an imbalance if both the normal priority
>>> tasks gets scheduled on the same cpu(running one high priority task)
> 
> if the system goes into this state(whatever the reason may be...), how can
> the current code detect this and resolve this imbalance... (by reviewing the
> code, it shows that imbalance in this particular situation can't be detected...)
> 
> 
>> This is one of my standard tests and it works for me.  The only time the 
> 
> Force your test into the above scenario and see how it behaves... if you
> simply start new processes, then exec balance will nicely balance...
> things go bad only when we go into the described above state..
> 
> I can give numerous other examples which fails
> 
> c) DP system: if the cpu-0 has two high priority and cpu-1 has one normal
> priority task, how can the current code detect this imbalance..

How would it not?

> 
> d) 4-way MP system: if the cpu-0 has two high priority tasks, cpu-1 has
> one high priority and four normal priority and cpu-2,3 each has one
> high priority task.. how does the current code distribute the normal
> priority tasks among cpu-1,2,3... (in this case, max_load will always
> point to cpu-0 and will never distribute the noraml priority tasks...)

This should cause cpu-0 to lose one of its tasks creating a new state 
and load balancing will be looking at a different situation to try and 
balance.

> 
> by giving these examples I am just pointing out various corner conditions..

Without smpnice, can you show how the default load balancing would 
result in the "nice" values being reliably enforced in your examples.

> 
>> This (in my opinion) is what you are seeing.
> 
> Simple review of the current code is showing all these problems... when you
> start new tasks, all will be well because of fork/exec balance... but
> during run time, if the system goes to a particular state, current imbalance
> code behaves poorly in the presence of high priority tasks...

The good news is that, in real life, high priority tasks generally only 
use very short bursts of CPU. :-)

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  1:22                       ` Peter Williams
@ 2006-04-04  1:34                         ` Peter Williams
  2006-04-04  2:11                         ` Siddha, Suresh B
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-04  1:34 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Peter Williams wrote:
> Siddha, Suresh B wrote:
>> On Sun, Apr 02, 2006 at 05:08:33PM +1000, Peter Williams wrote:
>>> Siddha, Suresh B wrote:
>>>> Peter,
>>>>
>>>> There are still issues which we need to address.. These are surfacing
>>>> as we are patching issue by issue(instead of addressing the root 
>>>> issue, which
>>>> is: presence of high priority tasks messes up load balancing of normal
>>>> priority tasks..)
>>>>
>>>> for example
>>>>
>>>> a) on a simple 4-way MP system, if we have one high priority and 4 
>>>> normal
>>>> priority tasks, with smpnice we would like to see the high priority 
>>>> task
>>>> scheduled on one cpu, two other cpus getting one normal task each 
>>>> and the
>>>> fourth cpu getting the remaining two normal tasks. but with smpnice 
>>>> that extra normal priority task keeps jumping from one cpu to 
>>>> another cpu having
>>>> the normal priority task.
>>>>
>>>> This is because of the busiest_has_loaded_cpus, nr_loaded_cpus 
>>>> logic.. We
>>>> are not including the cpu with high priority task in max_load 
>>>> calculations
>>>> but including that in total and avg_load calcuations.. leading to 
>>>> max_load <
>>>> avg_load and load balance between cpus running normal priority 
>>>> tasks(2 Vs 1)
>>>> will always show imbalanace as one normal priority and the extra normal
>>>> priority task will keep moving from one cpu to another cpu having
>>>> normal priority task..
>>> I can't see anything like this in the code.  
>>
>> Don't you see a condition where max_load < avg_load(as mentioned in the
>> above example) and in this case, code ignores avg_load and imbalance
>> will aways be the extra normal priority task( coming from
>> "max_load - busiest_load_per_task") and this normal priority task 
>> keeps hopping from one cpu to another cpu having normal priority task..
>>
>>> Can you send a patch to fix what you think the problem in the is?
>>
>> I am looking at ways in fixing all these issues cleanly... I don't have
>> a clean solution yet...
> 
> OK.  I think this means some fiddling with avg_load may be necessary in 
> some cases but this will be complex.  I'm not really happy about making 
> this code more complex until some of the current unnecessary complexity 
> is removed.  I.e. until a proper solution to the problem of triggering 
> active_load_balance() is implemented.

I forgot to mention that I've been looking at whether mucking around 
with avg_load is necessary and so far have been unable to convince 
myself that it is.  Your argument above hasn't changed that opinion.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  1:22                       ` Peter Williams
  2006-04-04  1:34                         ` Peter Williams
@ 2006-04-04  2:11                         ` Siddha, Suresh B
  2006-04-04  3:24                           ` Peter Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-04  2:11 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

On Tue, Apr 04, 2006 at 11:22:23AM +1000, Peter Williams wrote:
> OK.  I think this means some fiddling with avg_load may be necessary in 
> some cases but this will be complex.  I'm not really happy about making 
> this code more complex until some of the current unnecessary complexity 
> is removed.  I.e. until a proper solution to the problem of triggering 
> active_load_balance() is implemented.

Here is Nicks view about active_load_balance()

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3950745131e23472fb5ace2ee4a2093e7590ec69

> > 
> > c) DP system: if the cpu-0 has two high priority and cpu-1 has one normal
> > priority task, how can the current code detect this imbalance..
> 
> How would it not?

imbalance will be always < busiest_load_per_task and
max_load - this_load will be < 2 * busiest_load_per_task...
and pwr_move will be <= pwr_now...

> > d) 4-way MP system: if the cpu-0 has two high priority tasks, cpu-1 has
> > one high priority and four normal priority and cpu-2,3 each has one
> > high priority task.. how does the current code distribute the normal
> > priority tasks among cpu-1,2,3... (in this case, max_load will always
> > point to cpu-0 and will never distribute the noraml priority tasks...)
> 
> This should cause cpu-0 to lose one of its tasks creating a new state 

how? in this case also...

imbalance will be always < busiest_load_per_task and
max_load - this_load will be < 2 * busiest_load_per_task...
and pwr_move will be <= pwr_now...


> Without smpnice, can you show how the default load balancing would 
> result in the "nice" values being reliably enforced in your examples.

I agree with the issue that we are trying to fix here.. but I feel
it is incomplete.. With the current code in mainline, anyone can say the 
behavior by going through the code.... with smpnice code, code is complex
and really doesn't achieve what that patch really wants to fix..

> The good news is that, in real life, high priority tasks generally only 
> use very short bursts of CPU. :-)

do we then really need smpnice complexity?

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  2:11                         ` Siddha, Suresh B
@ 2006-04-04  3:24                           ` Peter Williams
  2006-04-04  4:34                             ` Peter Williams
  2006-04-06  2:14                             ` Peter Williams
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-04  3:24 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Tue, Apr 04, 2006 at 11:22:23AM +1000, Peter Williams wrote:
>> OK.  I think this means some fiddling with avg_load may be necessary in 
>> some cases but this will be complex.  I'm not really happy about making 
>> this code more complex until some of the current unnecessary complexity 
>> is removed.  I.e. until a proper solution to the problem of triggering 
>> active_load_balance() is implemented.
> 
> Here is Nicks view about active_load_balance()
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3950745131e23472fb5ace2ee4a2093e7590ec69

That's pre smpnice. :-)

> 
>>> c) DP system: if the cpu-0 has two high priority and cpu-1 has one normal
>>> priority task, how can the current code detect this imbalance..
>> How would it not?
> 
> imbalance will be always < busiest_load_per_task and
> max_load - this_load will be < 2 * busiest_load_per_task...
> and pwr_move will be <= pwr_now...

I had thought about substituting (busiest_load_per_task + 
this_load_per_task) for (busiest_load_per_task * 2) but couldn't 
convince myself that it was the right thing to do.  (The final update to 
this_load_per_task would need to be moved.)  The reason I couldn't 
convince myself is that I thought it might be too aggressive and cause 
excessive balancing.  Maybe something more sophisticated is needed to 
prevent that possibility.  It should be noted that the relative sizes of 
busiest_load_per_task and this_load_per_task my be useful in deciding 
what to do in these cases.  I'll put some thought into that.

BTW load balancing without smpnice would do just as badly here in that 
it wouldn't notice an imbalance either.

> 
>>> d) 4-way MP system: if the cpu-0 has two high priority tasks, cpu-1 has
>>> one high priority and four normal priority and cpu-2,3 each has one
>>> high priority task.. how does the current code distribute the normal
>>> priority tasks among cpu-1,2,3... (in this case, max_load will always
>>> point to cpu-0 and will never distribute the noraml priority tasks...)
>> This should cause cpu-0 to lose one of its tasks creating a new state 
> 
> how? in this case also...
> 
> imbalance will be always < busiest_load_per_task and
> max_load - this_load will be < 2 * busiest_load_per_task...
> and pwr_move will be <= pwr_now...
> 
> 
>> Without smpnice, can you show how the default load balancing would 
>> result in the "nice" values being reliably enforced in your examples.
> 
> I agree with the issue that we are trying to fix here.. but I feel
> it is incomplete.. With the current code in mainline, anyone can say the 
> behavior by going through the code.... with smpnice code, code is complex
> and really doesn't achieve what that patch really wants to fix..

It does in most cases and we could reduce the complexity if we had an 
alternative trigger for active_load_balance() :-)

> 
>> The good news is that, in real life, high priority tasks generally only 
>> use very short bursts of CPU. :-)
> 
> do we then really need smpnice complexity?

Most people who express unhappiness with SMP and nice are looking at the 
other end of the problem i.e. they nice 19 a process to make it run in 
the background but it gets a CPU to itself while a couple nice 0 tasks 
have to share the other CPU.  The high priority case has to be 
considered as well (e.g. one high priority task and one normal priority 
task running on a 2 CPU machine with a CPU each when another task wakes 
-- you'd like that to end up on the CPU of the normal priority task not 
the one with the high priority task, etc.) but its effects are more 
likely to be transitory and high priority tasks would not be expected to 
have a long term effect on balancing.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  3:24                           ` Peter Williams
@ 2006-04-04  4:34                             ` Peter Williams
  2006-04-06  2:14                             ` Peter Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-04  4:34 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Peter Williams wrote:
> Siddha, Suresh B wrote:
>>>> c) DP system: if the cpu-0 has two high priority and cpu-1 has one 
>>>> normal
>>>> priority task, how can the current code detect this imbalance..
>>> How would it not?
>>
>> imbalance will be always < busiest_load_per_task and
>> max_load - this_load will be < 2 * busiest_load_per_task...
>> and pwr_move will be <= pwr_now...
> 
> I had thought about substituting (busiest_load_per_task + 
> this_load_per_task) for (busiest_load_per_task * 2) but couldn't 
> convince myself that it was the right thing to do.  (The final update to 
> this_load_per_task would need to be moved.)  The reason I couldn't 
> convince myself is that I thought it might be too aggressive and cause 
> excessive balancing.  Maybe something more sophisticated is needed to 
> prevent that possibility.  It should be noted that the relative sizes of 
> busiest_load_per_task and this_load_per_task my be useful in deciding 
> what to do in these cases.  I'll put some thought into that.

How does this bit of code look?

if (busiest_load_per_task > this_load_per_task) {
	if (max_load - this_load > busiest_load_per_task) {
		*imbalance = busiest_load_per_task;
		return busiest;
	}
} else if (max_load - this_load >= busiest_load_per_task*2) {
	*imbalance = busiest_load_per_task;
	return busiest;
}

My maths indicate this will work even in cases when the difference 
between the two load per task values is small.  By "work" I mean that it 
will one of the high priority tasks and it won't bounce back. Do you agree?

The actual patch would be a little neater than this, of course.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: smpnice loadbalancing with high priority tasks
  2006-04-04  3:24                           ` Peter Williams
  2006-04-04  4:34                             ` Peter Williams
@ 2006-04-06  2:14                             ` Peter Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-06  2:14 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Andrew Morton, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

Peter Williams wrote:
> Siddha, Suresh B wrote:
>> do we then really need smpnice complexity?
> 
> Most people who express unhappiness with SMP and nice are looking at the 
> other end of the problem i.e. they nice 19 a process to make it run in 
> the background but it gets a CPU to itself while a couple nice 0 tasks 
> have to share the other CPU.  The high priority case has to be 
> considered as well (e.g. one high priority task and one normal priority 
> task running on a 2 CPU machine with a CPU each when another task wakes 
> -- you'd like that to end up on the CPU of the normal priority task not 
> the one with the high priority task, etc.) but its effects are more 
> likely to be transitory and high priority tasks would not be expected to 
> have a long term effect on balancing.

Another advantage of smpnice load balancing that I failed to point out 
is that it increases the chances that the tasks at the head of the 
queues within a HT package are approximately the same priority.  This, 
in turn, reduces the probability that it will be necessary to force one 
of the channels to idle and this will tend to increase the overall 
system throughput.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-02  7:08                   ` Peter Williams
  2006-04-04  0:24                     ` Siddha, Suresh B
@ 2006-04-20  1:24                     ` Siddha, Suresh B
  2006-04-20  5:19                       ` Peter Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-20  1:24 UTC (permalink / raw)
  To: Peter Williams, akpm
  Cc: Siddha, Suresh B, Andrew Morton, Mike Galbraith, Nick Piggin,
	Ingo Molnar, Con Kolivas, Chen, Kenneth W,
	Linux Kernel Mailing List

with smpnice, sched groups with highest priority tasks can mask the 
imbalance between the other sched groups with in the same domain. 
This patch fixes some of the listed down scenarios by not considering 
the sched groups which are lightly loaded. 

a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current smpnice
extra normal priority task keeps jumping from one cpu to another cpu having
the normal priority task.  This is because of the busiest_has_loaded_cpus, 
nr_loaded_cpus logic.. We are not including the cpu with high priority 
task in max_load calculations but including that in total and avg_load 
calcuations.. leading to max_load < avg_load and load balance between 
cpus running normal priority tasks(2 Vs 1) will always show imbalanace 
as one normal priority and the extra normal priority task will keep moving 
from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest 
priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority 
task each..  P2 and P3 are idle.  With this patch, one of the normal priority 
tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make 
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0 
containing one high priority and one low priority, Package-1 containing 
one low priority(with other thread being idle)..  Package-1 thinks that it 
need to take the low priority thread from Package-0. And find_busiest_queue() 
returns the cpu thread with highest priority task.. And ultimately(with help 
of active load balance) we move high priority task to Package-1. And same 
continues with Package-0 now, moving high priority task from package-1 to 
package-0..  Even without the presence of active load balance, load balance 
will fail to balance the above scenario..  Fix find_busiest_queue to use 
"imbalance" when it is lightly loaded.

To fix all the scenarios, we probably need another look at the smpnice 
balancing patches..

This patch doesn't fix this issue for example:
4-way simple MP system. P0 containing two high priority tasks, P1 containing
one high priority and two normal priority tasks, one high priotity task
each on P2, P3. Current load balance doesn't detect/fix the
imbalance by moving one of the normal priority task running on P1 to P2 or P3.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

--- linux-2.6.17-rc1/kernel/sched.c	2006-04-13 11:38:16.897333112 -0700
+++ linux~/kernel/sched.c	2006-04-19 16:19:01.294816528 -0700
@@ -2145,7 +2145,6 @@ find_busiest_group(struct sched_domain *
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
 	unsigned long this_load_per_task, this_nr_running;
-	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -2200,15 +2199,8 @@ find_busiest_group(struct sched_domain *
 			this = group;
 			this_nr_running = sum_nr_running;
 			this_load_per_task = sum_weighted_load;
-		} else if (nr_loaded_cpus) {
-			if (avg_load > max_load || !busiest_has_loaded_cpus) {
-				max_load = avg_load;
-				busiest = group;
-				busiest_nr_running = sum_nr_running;
-				busiest_load_per_task = sum_weighted_load;
-				busiest_has_loaded_cpus = 1;
-			}
-		} else if (!busiest_has_loaded_cpus && avg_load > max_load) {
+		} else if (avg_load > max_load && 
+			   sum_nr_running > group->cpu_power / SCHED_LOAD_SCALE) {
 			max_load = avg_load;
 			busiest = group;
 			busiest_nr_running = sum_nr_running;
@@ -2241,6 +2233,16 @@ find_busiest_group(struct sched_domain *
 	if (max_load <= busiest_load_per_task)
 		goto out_balanced;
 
+	/*
+	 * In the presence of smp nice balancing, certain scenarios can have
+	 * max load less than avg load(as we skip the groups at or below
+	 * its cpu_power, while calculating max_load..) 
+	 */
+	if (max_load < avg_load) {
+		*imbalance = 0;
+		goto small_imbalance;
+	}
+
 	/* Don't want to pull so many tasks that a group would go idle */
 	max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
 
@@ -2255,6 +2257,7 @@ find_busiest_group(struct sched_domain *
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
+small_imbalance:
 	if (*imbalance < busiest_load_per_task) {
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;
@@ -2318,23 +2321,19 @@ out_balanced:
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
 static runqueue_t *find_busiest_queue(struct sched_group *group,
-	enum idle_type idle)
+	enum idle_type idle, unsigned long imbalance)
 {
 	unsigned long max_load = 0;
 	runqueue_t *busiest = NULL, *rqi;
-	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
 		rqi = cpu_rq(i);
 
-		if (rqi->nr_running > 1) {
-			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
-				max_load = rqi->raw_weighted_load;
-				busiest = rqi;
-				busiest_is_loaded = 1;
-			}
-		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
+		if (rqi->nr_running == 1 && rqi->raw_weighted_load > imbalance)
+			continue;
+
+		if (rqi->raw_weighted_load > max_load) {
 			max_load = rqi->raw_weighted_load;
 			busiest = rqi;
 		}
@@ -2377,7 +2376,7 @@ static int load_balance(int this_cpu, ru
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle);
+	busiest = find_busiest_queue(group, idle, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2501,7 +2500,7 @@ static int load_balance_newidle(int this
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20  1:24                     ` [patch] smpnice: don't consider sched groups which are lightly loaded for balancing Siddha, Suresh B
@ 2006-04-20  5:19                       ` Peter Williams
  2006-04-20 16:54                         ` Siddha, Suresh B
  2006-04-20 17:04                         ` Siddha, Suresh B
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-20  5:19 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: akpm, Mike Galbraith, Nick Piggin, Ingo Molnar, Con Kolivas,
	Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> with smpnice, sched groups with highest priority tasks can mask the 
> imbalance between the other sched groups with in the same domain. 
> This patch fixes some of the listed down scenarios by not considering 
> the sched groups which are lightly loaded. 
> 
> a) on a simple 4-way MP system, if we have one high priority and 4 normal
> priority tasks, with smpnice we would like to see the high priority task
> scheduled on one cpu, two other cpus getting one normal task each and the
> fourth cpu getting the remaining two normal tasks. but with current smpnice
> extra normal priority task keeps jumping from one cpu to another cpu having
> the normal priority task.  This is because of the busiest_has_loaded_cpus, 
> nr_loaded_cpus logic.. We are not including the cpu with high priority 
> task in max_load calculations but including that in total and avg_load 
> calcuations.. leading to max_load < avg_load and load balance between 
> cpus running normal priority tasks(2 Vs 1) will always show imbalanace 
> as one normal priority and the extra normal priority task will keep moving 
> from one cpu to another cpu having normal priority task..
> 
> b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest 
> priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority 
> task each..  P2 and P3 are idle.  With this patch, one of the normal priority 
> tasks on P1 will be moved to P2 or P3..
> 
> c) With the current weighted smp nice calculations, it doesn't always make 
> sense to look at the highest weighted runqueue in the busy group..
> Consider a load balance scenario on a DP with HT system, with Package-0 
> containing one high priority and one low priority, Package-1 containing 
> one low priority(with other thread being idle)..  Package-1 thinks that it 
> need to take the low priority thread from Package-0. And find_busiest_queue() 
> returns the cpu thread with highest priority task.. And ultimately(with help 
> of active load balance) we move high priority task to Package-1. And same 
> continues with Package-0 now, moving high priority task from package-1 to 
> package-0..  Even without the presence of active load balance, load balance 
> will fail to balance the above scenario..  Fix find_busiest_queue to use 
> "imbalance" when it is lightly loaded.

Looks neater than the original code.

Some comments in line below (including inside the patch).

> 
> To fix all the scenarios, we probably need another look at the smpnice 
> balancing patches..

I think that you should only describe the problem(s) that the patch 
fixes as this has to go into the Change Log and the ones it doesn't fix 
are of no interest there.

> 
> This patch doesn't fix this issue for example:
> 4-way simple MP system. P0 containing two high priority tasks, P1 containing
> one high priority and two normal priority tasks, one high priotity task
> each on P2, P3. Current load balance doesn't detect/fix the
> imbalance by moving one of the normal priority task running on P1 to P2 or P3.

Is this always the case or just a possibility?  Please describe the hole 
it slips through (and please do that every time you provide a scenario).

> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> --- linux-2.6.17-rc1/kernel/sched.c	2006-04-13 11:38:16.897333112 -0700
> +++ linux~/kernel/sched.c	2006-04-19 16:19:01.294816528 -0700
> @@ -2145,7 +2145,6 @@ find_busiest_group(struct sched_domain *
>  	unsigned long max_pull;
>  	unsigned long busiest_load_per_task, busiest_nr_running;
>  	unsigned long this_load_per_task, this_nr_running;
> -	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
>  	int load_idx;
>  
>  	max_load = this_load = total_load = total_pwr = 0;
> @@ -2200,15 +2199,8 @@ find_busiest_group(struct sched_domain *
>  			this = group;
>  			this_nr_running = sum_nr_running;
>  			this_load_per_task = sum_weighted_load;
> -		} else if (nr_loaded_cpus) {
> -			if (avg_load > max_load || !busiest_has_loaded_cpus) {
> -				max_load = avg_load;
> -				busiest = group;
> -				busiest_nr_running = sum_nr_running;
> -				busiest_load_per_task = sum_weighted_load;
> -				busiest_has_loaded_cpus = 1;
> -			}
> -		} else if (!busiest_has_loaded_cpus && avg_load > max_load) {
> +		} else if (avg_load > max_load && 

White space at the end of this line (will upset Andrew).

> +			   sum_nr_running > group->cpu_power / SCHED_LOAD_SCALE) {
>  			max_load = avg_load;
>  			busiest = group;
>  			busiest_nr_running = sum_nr_running;
> @@ -2241,6 +2233,16 @@ find_busiest_group(struct sched_domain *
>  	if (max_load <= busiest_load_per_task)
>  		goto out_balanced;
>  
> +	/*
> +	 * In the presence of smp nice balancing, certain scenarios can have
> +	 * max load less than avg load(as we skip the groups at or below
> +	 * its cpu_power, while calculating max_load..) 

Ditto.

> +	 */
> +	if (max_load < avg_load) {
> +		*imbalance = 0;
> +		goto small_imbalance;
> +	}
> +
>  	/* Don't want to pull so many tasks that a group would go idle */
>  	max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
>  
> @@ -2255,6 +2257,7 @@ find_busiest_group(struct sched_domain *
>  	 * a think about bumping its value to force at least one task to be
>  	 * moved
>  	 */
> +small_imbalance:
>  	if (*imbalance < busiest_load_per_task) {

I think small_imbalance be moved to here and save an unnecessary 
comparison. The only time busiest_load_per_task should be zero and the 
above expression fail when imbalance is zero is if the only tasks 
running in "busiest" were the migration tasks.

>  		unsigned long pwr_now = 0, pwr_move = 0;
>  		unsigned long tmp;
> @@ -2318,23 +2321,19 @@ out_balanced:
>   * find_busiest_queue - find the busiest runqueue among the cpus in group.
>   */
>  static runqueue_t *find_busiest_queue(struct sched_group *group,
> -	enum idle_type idle)
> +	enum idle_type idle, unsigned long imbalance)
>  {
>  	unsigned long max_load = 0;
>  	runqueue_t *busiest = NULL, *rqi;
> -	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
>  	int i;
>  
>  	for_each_cpu_mask(i, group->cpumask) {
>  		rqi = cpu_rq(i);
>  
> -		if (rqi->nr_running > 1) {
> -			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
> -				max_load = rqi->raw_weighted_load;
> -				busiest = rqi;
> -				busiest_is_loaded = 1;
> -			}
> -		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
> +		if (rqi->nr_running == 1 && rqi->raw_weighted_load > imbalance)
> +			continue;
> +
> +		if (rqi->raw_weighted_load > max_load) {
>  			max_load = rqi->raw_weighted_load;
>  			busiest = rqi;
>  		}
> @@ -2377,7 +2376,7 @@ static int load_balance(int this_cpu, ru
>  		goto out_balanced;
>  	}
>  
> -	busiest = find_busiest_queue(group, idle);
> +	busiest = find_busiest_queue(group, idle, imbalance);
>  	if (!busiest) {
>  		schedstat_inc(sd, lb_nobusyq[idle]);
>  		goto out_balanced;
> @@ -2501,7 +2500,7 @@ static int load_balance_newidle(int this
>  		goto out_balanced;
>  	}
>  
> -	busiest = find_busiest_queue(group, NEWLY_IDLE);
> +	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
>  	if (!busiest) {
>  		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
>  		goto out_balanced;


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20  5:19                       ` Peter Williams
@ 2006-04-20 16:54                         ` Siddha, Suresh B
  2006-04-20 23:11                           ` Peter Williams
  2006-04-20 23:49                           ` Andrew Morton
  2006-04-20 17:04                         ` Siddha, Suresh B
  1 sibling, 2 replies; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-20 16:54 UTC (permalink / raw)
  To: Peter Williams, akpm
  Cc: Siddha, Suresh B, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

updated patch appended. thanks.

--
with smpnice, sched groups with highest priority tasks can mask the
imbalance between the other sched groups with in the same domain.
This patch fixes some of the listed down scenarios by not considering
the sched groups which are lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current smpnice
extra normal priority task keeps jumping from one cpu to another cpu having
the normal priority task.  This is because of the busiest_has_loaded_cpus,
nr_loaded_cpus logic.. We are not including the cpu with high priority
task in max_load calculations but including that in total and avg_load
calcuations.. leading to max_load < avg_load and load balance between
cpus running normal priority tasks(2 Vs 1) will always show imbalanace
as one normal priority and the extra normal priority task will keep moving
from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest
priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority
task each..  P2 and P3 are idle.  With this patch, one of the normal priority
tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing
one low priority(with other thread being idle)..  Package-1 thinks that it
need to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with help
of active load balance) we move high priority task to Package-1. And same
continues with Package-0 now, moving high priority task from package-1 to
package-0..  Even without the presence of active load balance, load balance
will fail to balance the above scenario..  Fix find_busiest_queue to use
"imbalance" when it is lightly loaded.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

--- linux-2.6.17-rc1/kernel/sched.c	2006-04-13 11:38:16.897333112 -0700
+++ linux~/kernel/sched.c	2006-04-20 07:52:31.974898888 -0700
@@ -2145,7 +2145,6 @@ find_busiest_group(struct sched_domain *
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
 	unsigned long this_load_per_task, this_nr_running;
-	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -2200,15 +2199,8 @@ find_busiest_group(struct sched_domain *
 			this = group;
 			this_nr_running = sum_nr_running;
 			this_load_per_task = sum_weighted_load;
-		} else if (nr_loaded_cpus) {
-			if (avg_load > max_load || !busiest_has_loaded_cpus) {
-				max_load = avg_load;
-				busiest = group;
-				busiest_nr_running = sum_nr_running;
-				busiest_load_per_task = sum_weighted_load;
-				busiest_has_loaded_cpus = 1;
-			}
-		} else if (!busiest_has_loaded_cpus && avg_load > max_load) {
+		} else if (avg_load > max_load &&
+			   sum_nr_running > group->cpu_power / SCHED_LOAD_SCALE) {
 			max_load = avg_load;
 			busiest = group;
 			busiest_nr_running = sum_nr_running;
@@ -2241,6 +2233,16 @@ find_busiest_group(struct sched_domain *
 	if (max_load <= busiest_load_per_task)
 		goto out_balanced;
 
+	/*
+	 * In the presence of smp nice balancing, certain scenarios can have
+	 * max load less than avg load(as we skip the groups at or below
+	 * its cpu_power, while calculating max_load..)
+	 */
+	if (max_load < avg_load) {
+		*imbalance = 0;
+		goto small_imbalance;
+	}
+
 	/* Don't want to pull so many tasks that a group would go idle */
 	max_pull = min(max_load - avg_load, max_load - busiest_load_per_task);
 
@@ -2256,6 +2258,7 @@ find_busiest_group(struct sched_domain *
 	 * moved
 	 */
 	if (*imbalance < busiest_load_per_task) {
+small_imbalance:
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;
 		unsigned int imbn = 2;
@@ -2318,23 +2321,19 @@ out_balanced:
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
 static runqueue_t *find_busiest_queue(struct sched_group *group,
-	enum idle_type idle)
+	enum idle_type idle, unsigned long imbalance)
 {
 	unsigned long max_load = 0;
 	runqueue_t *busiest = NULL, *rqi;
-	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
 		rqi = cpu_rq(i);
 
-		if (rqi->nr_running > 1) {
-			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
-				max_load = rqi->raw_weighted_load;
-				busiest = rqi;
-				busiest_is_loaded = 1;
-			}
-		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
+		if (rqi->nr_running == 1 && rqi->raw_weighted_load > imbalance)
+			continue;
+
+		if (rqi->raw_weighted_load > max_load) {
 			max_load = rqi->raw_weighted_load;
 			busiest = rqi;
 		}
@@ -2377,7 +2376,7 @@ static int load_balance(int this_cpu, ru
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, idle);
+	busiest = find_busiest_queue(group, idle, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[idle]);
 		goto out_balanced;
@@ -2501,7 +2500,7 @@ static int load_balance_newidle(int this
 		goto out_balanced;
 	}
 
-	busiest = find_busiest_queue(group, NEWLY_IDLE);
+	busiest = find_busiest_queue(group, NEWLY_IDLE, imbalance);
 	if (!busiest) {
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out_balanced;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20  5:19                       ` Peter Williams
  2006-04-20 16:54                         ` Siddha, Suresh B
@ 2006-04-20 17:04                         ` Siddha, Suresh B
  2006-04-21  0:00                           ` Peter Williams
  1 sibling, 1 reply; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-20 17:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, akpm, Mike Galbraith, Nick Piggin, Ingo Molnar,
	Con Kolivas, Chen, Kenneth W, Linux Kernel Mailing List

On Thu, Apr 20, 2006 at 03:19:52PM +1000, Peter Williams wrote:
> > 
> > This patch doesn't fix this issue for example:
> > 4-way simple MP system. P0 containing two high priority tasks, P1 containing
> > one high priority and two normal priority tasks, one high priotity task
> > each on P2, P3. Current load balance doesn't detect/fix the
> > imbalance by moving one of the normal priority task running on P1 to P2 or P3.
> 
> Is this always the case or just a possibility?  Please describe the hole 
> it slips through (and please do that every time you provide a scenario).

I thought a scenario is enough to show the hole :) Anyhow, I brought this 
issue before also..
http://www.ussg.iu.edu/hypermail/linux/kernel/0604.0/0517.html

Load balance on P2 or P3 will always show P0 as max load but it will not
be able to move any load from P0. As
imbalance will be always < busiest_load_per_task and
max_load - this_load will be < imbn(2) * busiest_load_per_task...
and pwr_move will be <= pwr_now...

Basically sched groups with highest priority tasks can mask the 
imbalance between the other sched groups with in the same domain. 

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20 16:54                         ` Siddha, Suresh B
@ 2006-04-20 23:11                           ` Peter Williams
  2006-04-20 23:49                           ` Andrew Morton
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-20 23:11 UTC (permalink / raw)
  To: Siddha, Suresh B, akpm
  Cc: Mike Galbraith, Nick Piggin, Ingo Molnar, Con Kolivas,
	Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> updated patch appended. thanks.
> 
> --
> with smpnice, sched groups with highest priority tasks can mask the
> imbalance between the other sched groups with in the same domain.
> This patch fixes some of the listed down scenarios by not considering
> the sched groups which are lightly loaded.
> 
> a) on a simple 4-way MP system, if we have one high priority and 4 normal
> priority tasks, with smpnice we would like to see the high priority task
> scheduled on one cpu, two other cpus getting one normal task each and the
> fourth cpu getting the remaining two normal tasks. but with current smpnice
> extra normal priority task keeps jumping from one cpu to another cpu having
> the normal priority task.  This is because of the busiest_has_loaded_cpus,
> nr_loaded_cpus logic.. We are not including the cpu with high priority
> task in max_load calculations but including that in total and avg_load
> calcuations.. leading to max_load < avg_load and load balance between
> cpus running normal priority tasks(2 Vs 1) will always show imbalanace
> as one normal priority and the extra normal priority task will keep moving
> from one cpu to another cpu having normal priority task..
> 
> b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest
> priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority
> task each..  P2 and P3 are idle.  With this patch, one of the normal priority
> tasks on P1 will be moved to P2 or P3..
> 
> c) With the current weighted smp nice calculations, it doesn't always make
> sense to look at the highest weighted runqueue in the busy group..
> Consider a load balance scenario on a DP with HT system, with Package-0
> containing one high priority and one low priority, Package-1 containing
> one low priority(with other thread being idle)..  Package-1 thinks that it
> need to take the low priority thread from Package-0. And find_busiest_queue()
> returns the cpu thread with highest priority task.. And ultimately(with help
> of active load balance) we move high priority task to Package-1. And same
> continues with Package-0 now, moving high priority task from package-1 to
> package-0..  Even without the presence of active load balance, load balance
> will fail to balance the above scenario..  Fix find_busiest_queue to use
> "imbalance" when it is lightly loaded.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

Acked-by: Peter Williams <pwil3058@bigpond.com.au>

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20 16:54                         ` Siddha, Suresh B
  2006-04-20 23:11                           ` Peter Williams
@ 2006-04-20 23:49                           ` Andrew Morton
  2006-04-21  0:25                             ` Siddha, Suresh B
  2006-04-21  0:28                             ` Peter Williams
  1 sibling, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2006-04-20 23:49 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: pwil3058, suresh.b.siddha, efault, nickpiggin, mingo, kernel,
	kenneth.w.chen, linux-kernel

"Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
>
> updated patch appended. thanks.

Where are we up to with smpnice now?  Are there still any known
regressions/problems/bugs/etc?  Has sufficient testing been done for us to
know this?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20 17:04                         ` Siddha, Suresh B
@ 2006-04-21  0:00                           ` Peter Williams
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Williams @ 2006-04-21  0:00 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: akpm, Mike Galbraith, Nick Piggin, Ingo Molnar, Con Kolivas,
	Chen, Kenneth W, Linux Kernel Mailing List

Siddha, Suresh B wrote:
> On Thu, Apr 20, 2006 at 03:19:52PM +1000, Peter Williams wrote:
>>> This patch doesn't fix this issue for example:
>>> 4-way simple MP system. P0 containing two high priority tasks, P1 containing
>>> one high priority and two normal priority tasks, one high priotity task
>>> each on P2, P3. Current load balance doesn't detect/fix the
>>> imbalance by moving one of the normal priority task running on P1 to P2 or P3.
>> Is this always the case or just a possibility?  Please describe the hole 
>> it slips through (and please do that every time you provide a scenario).
> 
> I thought a scenario is enough to show the hole :) Anyhow, I brought this 
> issue before also..
> http://www.ussg.iu.edu/hypermail/linux/kernel/0604.0/0517.html
> 
> Load balance on P2 or P3 will always show P0 as max load but it will not
> be able to move any load from P0. As
> imbalance will be always < busiest_load_per_task and
> max_load - this_load will be < imbn(2) * busiest_load_per_task...
> and pwr_move will be <= pwr_now...

This will depend on how high the priority of the high priority tasks are 
relative to normal tasks.  E.g. it's quite possible to have two high 
priority tasks whose combined load weight is less than that of two 
normal tasks and a high priority task.

> 
> Basically sched groups with highest priority tasks can mask the 
> imbalance between the other sched groups with in the same domain. 

Sometimes.

I don't think that this stable state is so bad that anything special 
needs to be done especially as the fact that high priority tasks tend to 
only use the CPU in short bursts means that it probably won't exist for 
very long.

To paraphrase Ingo (from another thread), load balancing is a 
probabilistic exercise.  For a start, achieving a deterministic optimal 
distribution would be an NP algorithm and by the time you determined the 
correct distribution (which could be a very long time) the "state" 
information on which the determination was based would have changed 
(possibly a lot).  This latter bit (probably minus the possibly a lot) 
is true anyway as find_busiest_group() and find_busiest_queue() are 
called without locks meaning that the state upon which their results are 
determined may change before move_tasks() is called.

I think this justifies saying that this scenario probably doesn't matter 
and, therefore, fixing it isn't urgent.

BTW I agree with your earlier statements that the modification to 
move_tasks() to circumvent the skip mechanism in some circumstances 
needs to be refined so that it doesn't move the highest priority task of 
the busiest queue.  I'll be submitting a patch later today.

I think that the next thing that needs to be addressed after that is a 
modification to try_to_wake_up() to improve the distribution of high 
priority tasks across CPUs.  I think that just sticking them on any CPU 
and waiting for the load balancing code to kick in and move them 
unnecessarily increases their latency.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20 23:49                           ` Andrew Morton
@ 2006-04-21  0:25                             ` Siddha, Suresh B
  2006-04-21  0:28                             ` Peter Williams
  1 sibling, 0 replies; 32+ messages in thread
From: Siddha, Suresh B @ 2006-04-21  0:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, pwil3058, efault, nickpiggin, mingo, kernel,
	kenneth.w.chen, linux-kernel

On Thu, Apr 20, 2006 at 04:49:36PM -0700, Andrew Morton wrote:
> "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> >
> > updated patch appended. thanks.
> 
> Where are we up to with smpnice now?  Are there still any known
> regressions/problems/bugs/etc?  Has sufficient testing been done for us to
> know this?

Based on the code review, so far I have been giving load balancing examples 
which break. Peter is also plannning to fix outstanding issues which are open.

In the coming days, we are planning to run some benchmarks.

Nick, it will be great if you can review and provide your feedback.

thanks,
suresh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-20 23:49                           ` Andrew Morton
  2006-04-21  0:25                             ` Siddha, Suresh B
@ 2006-04-21  0:28                             ` Peter Williams
  2006-04-21  1:25                               ` Andrew Morton
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Williams @ 2006-04-21  0:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Siddha, Suresh B, efault, nickpiggin, mingo, kernel,
	kenneth.w.chen, linux-kernel

Andrew Morton wrote:
> "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
>> updated patch appended. thanks.
> 
> Where are we up to with smpnice now?  Are there still any known
> regressions/problems/bugs/etc?

One more change to move_tasks() is required to address an issue raised 
by Suresh w.r.t. the possibility unnecessary movement of the highest 
priority task from the busiest queue (possible because of the 
active/expired array mechanism).  I hope to forward a patch for this 
later today.

After that the only thing I would like to do at this stage is modify 
try_to_wake_up() so that it tries harder to distribute high priority 
tasks across the CPUs.  I wouldn't classify this as absolutely necessary 
as it's really just a measure that attempts to reduce latency for high 
priority tasks as it should get them onto a CPU more quickly than just 
sticking them anywhere and waiting for load balancing to kick in if 
they've been put on a CPU with a higher priority task already running. 
Also it's only really necessary when there a lot of high priority tasks 
running.  So this isn't urgent and probably needs to be coordinated with 
Ingo's RT load balancing stuff anyway.

>  Has sufficient testing been done for us to
> know this?

I run smpnice kernels on all of my SMP machines all of the time.  But I 
don't have anything with more than 2 CPUs so I've been relying on their 
presence in -mm to get wider testing on larger machines.

I think that once this patch and the move_tasks() one that I'll forward 
later today are incorporated we should have something that (although not 
perfect) works pretty well.  Neither of these changes should cause a 
behavioural change in the case where all tasks are nice==0.

As load balancing is inherently probabilistic I don't think that we 
should hold out for "perfect".

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch] smpnice: don't consider sched groups which are lightly loaded for balancing
  2006-04-21  0:28                             ` Peter Williams
@ 2006-04-21  1:25                               ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2006-04-21  1:25 UTC (permalink / raw)
  To: Peter Williams
  Cc: suresh.b.siddha, efault, nickpiggin, mingo, kernel,
	kenneth.w.chen, linux-kernel

Peter Williams <pwil3058@bigpond.net.au> wrote:
>
> Andrew Morton wrote:
> > "Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:
> >> updated patch appended. thanks.
> > 
> > Where are we up to with smpnice now?  Are there still any known
> > regressions/problems/bugs/etc?
> 
> One more change to move_tasks() is required to address an issue raised 
> by Suresh w.r.t. the possibility unnecessary movement of the highest 
> priority task from the busiest queue (possible because of the 
> active/expired array mechanism).  I hope to forward a patch for this 
> later today.

OK.

> After that the only thing I would like to do at this stage is modify 
> try_to_wake_up() so that it tries harder to distribute high priority 
> tasks across the CPUs.  I wouldn't classify this as absolutely necessary 
> as it's really just a measure that attempts to reduce latency for high 
> priority tasks as it should get them onto a CPU more quickly than just 
> sticking them anywhere and waiting for load balancing to kick in if 
> they've been put on a CPU with a higher priority task already running. 
> Also it's only really necessary when there a lot of high priority tasks 
> running.  So this isn't urgent and probably needs to be coordinated with 
> Ingo's RT load balancing stuff anyway.

Sure, we can leave things like that until later.

> >  Has sufficient testing been done for us to
> > know this?

I should have said "testing for regressions".  We know that smpnice
improves some things.  My concern is that it doesn't cause any non-silly
workloads to worsen.  Once we're at that stage I think we're ready to go.

IOW: at this stage we should concentrate upon not taking any workloads
backwards, rather than upon taking even more workloads even more forwards. 
That can come later.

> I run smpnice kernels on all of my SMP machines all of the time.  But I 
> don't have anything with more than 2 CPUs so I've been relying on their 
> presence in -mm to get wider testing on larger machines.

Sure.  A mortal doesn't have the hardware and isn't set up to test certain
high-value workloads...

> As load balancing is inherently probabilistic I don't think that we 
> should hold out for "perfect".

Sure.  "same or better" is the aim here.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-04-21  1:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-28  6:00 [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
2006-03-28 19:25 ` Siddha, Suresh B
2006-03-28 22:44   ` Peter Williams
2006-03-29  2:14     ` Peter Williams
2006-03-29  2:52     ` Siddha, Suresh B
2006-03-29  3:42       ` Peter Williams
2006-03-29 22:52         ` Siddha, Suresh B
2006-03-29 23:40           ` Peter Williams
2006-03-30  0:50             ` Siddha, Suresh B
2006-03-30  1:14               ` Peter Williams
2006-04-02  4:48                 ` smpnice loadbalancing with high priority tasks Siddha, Suresh B
2006-04-02  7:08                   ` Peter Williams
2006-04-04  0:24                     ` Siddha, Suresh B
2006-04-04  1:22                       ` Peter Williams
2006-04-04  1:34                         ` Peter Williams
2006-04-04  2:11                         ` Siddha, Suresh B
2006-04-04  3:24                           ` Peter Williams
2006-04-04  4:34                             ` Peter Williams
2006-04-06  2:14                             ` Peter Williams
2006-04-20  1:24                     ` [patch] smpnice: don't consider sched groups which are lightly loaded for balancing Siddha, Suresh B
2006-04-20  5:19                       ` Peter Williams
2006-04-20 16:54                         ` Siddha, Suresh B
2006-04-20 23:11                           ` Peter Williams
2006-04-20 23:49                           ` Andrew Morton
2006-04-21  0:25                             ` Siddha, Suresh B
2006-04-21  0:28                             ` Peter Williams
2006-04-21  1:25                               ` Andrew Morton
2006-04-20 17:04                         ` Siddha, Suresh B
2006-04-21  0:00                           ` Peter Williams
2006-04-03  1:04             ` [PATCH] sched: smpnice work around for active_load_balance() Peter Williams
2006-04-03 16:57               ` Siddha, Suresh B
2006-04-03 23:11                 ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox