public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] sched: aggressive idle balance
@ 2004-11-02 20:16 Andrew Theurer
  2004-11-02 21:29 ` Chen, Kenneth W
  2004-11-02 22:34 ` Chen, Kenneth W
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Theurer @ 2004-11-02 20:16 UTC (permalink / raw)
  To: kenneth.w.chen, kernel, ricklind, Nick Piggin, mingo; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

Hello all,

This patch allows more aggressive idle balances, reducing idle time in 
scenarios where should not be any, where nr_running > nr_cpus.  We have seen 
this in a couple of online transaction workloads.  Three areas are targeted:

1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if 
the task->cpu is busy and the sibling is idle.  This has been expanded to any 
idle cpu, but the closest idle cpu is picked first by starting with cpu->sd, 
then going up the domains as necessary.

2) can_migrate() is modified.  Honestly part of it looked kind of backwards.  
We checked for task_hot() if the cpu was idle.  I would think we would do 
that if is was -not- idle.  Changes to can migrate are:
  a) same policies for running tasks and cpus_allowed, return 0
  b) We omit task_hot check for -whole- idle cpus (not just a sibling),
     when task_cpu(p) and this_cpu are member siblings in the same core,
     or when nr_balance_failed > cache_nice_tries
  c) finally, if this_cpu is busy we still do the task_hot check

3) Add SD_BALANCE_NEWIDLE to SD_NODE_INIT.  All this does is allow a newly 
idle cpu a broader span of cpus, only if necessary, to pull a task.  We still 
target cpus closest to the this_cpu, starting with this_cpu->sd, then as 
needed moving up the domains.  **This is vital for Opteron, where each node 
only has one cpu.  idle_balance() never works because it would never consider 
looking beyond the cpu->sd, because cpu->sd->sd did not have 
SD_BALANCE_NEWIDLE set.

So far we have seen 3-5% with these patches on online transaction workolads 
and no regressions on SDET.  Kenneth, I am particularly interested in using 
this with your increased cache_hot_time value, where you got your best 
throughput:

> cache_hot_time  | workload throughput
> --------------------------------------
> 25ms            - 111.1   (5% idle)

...but still had idle time.  Do you think you could try these patches with 
your 25ms cache_hot_time?  I think your workload could benefit from both the 
longer cache_hot_time for busy cpus, but more aggressive idle balances, 
hopefully driving your workload to 100% cpu utilization.

Any feedback greatly appreciated, thanks.

-Andrew Theurer


[-- Attachment #2: idle-balance-patch-2610rc1bk9 --]
[-- Type: text/x-diff, Size: 6611 bytes --]

diff -Naurp linux-2.6.10-rc1-bk9/include/asm-i386/topology.h linux-2.6.10-rc1-bk9-idle/include/asm-i386/topology.h
--- linux-2.6.10-rc1-bk9/include/asm-i386/topology.h	2004-11-02 18:06:26.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/include/asm-i386/topology.h	2004-11-02 18:18:15.000000000 -0800
@@ -83,6 +83,8 @@ static inline cpumask_t pcibus_to_cpumas
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
+				| SD_BALANCE_NEWIDLE	\
+				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -Naurp linux-2.6.10-rc1-bk9/include/asm-ia64/topology.h linux-2.6.10-rc1-bk9-idle/include/asm-ia64/topology.h
--- linux-2.6.10-rc1-bk9/include/asm-ia64/topology.h	2004-11-02 18:06:26.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/include/asm-ia64/topology.h	2004-11-02 18:18:15.000000000 -0800
@@ -56,6 +56,8 @@ void build_cpu_to_node_map(void);
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
+				| SD_BALANCE_NEWIDLE	\
+				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -Naurp linux-2.6.10-rc1-bk9/include/asm-ppc64/topology.h linux-2.6.10-rc1-bk9-idle/include/asm-ppc64/topology.h
--- linux-2.6.10-rc1-bk9/include/asm-ppc64/topology.h	2004-11-02 18:06:26.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/include/asm-ppc64/topology.h	2004-11-02 18:18:15.000000000 -0800
@@ -51,6 +51,8 @@ static inline int node_to_first_cpu(int 
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
+				| SD_BALANCE_NEWIDLE	\
+				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -Naurp linux-2.6.10-rc1-bk9/include/asm-x86_64/topology.h linux-2.6.10-rc1-bk9-idle/include/asm-x86_64/topology.h
--- linux-2.6.10-rc1-bk9/include/asm-x86_64/topology.h	2004-11-02 18:06:26.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/include/asm-x86_64/topology.h	2004-11-02 18:18:15.000000000 -0800
@@ -46,7 +46,9 @@ static inline cpumask_t __pcibus_to_cpum
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
+				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
+				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -Naurp linux-2.6.10-rc1-bk9/include/linux/topology.h linux-2.6.10-rc1-bk9-idle/include/linux/topology.h
--- linux-2.6.10-rc1-bk9/include/linux/topology.h	2004-11-02 18:06:26.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/include/linux/topology.h	2004-11-02 18:17:21.000000000 -0800
@@ -120,6 +120,7 @@ static inline int __next_node_with_cpus(
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
+				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -Naurp linux-2.6.10-rc1-bk9/kernel/sched.c linux-2.6.10-rc1-bk9-idle/kernel/sched.c
--- linux-2.6.10-rc1-bk9/kernel/sched.c	2004-11-02 18:06:47.000000000 -0800
+++ linux-2.6.10-rc1-bk9-idle/kernel/sched.c	2004-11-02 18:22:32.000000000 -0800
@@ -441,6 +441,31 @@ static inline void rq_unlock(runqueue_t 
 	spin_unlock_irq(&rq->lock);
 }
 
+#ifdef CONFIG_SCHED_SMT
+static int cpu_and_siblings_are_idle(int cpu)
+{
+	int sib;
+	for_each_cpu_mask(sib, cpu_sibling_map[cpu]) {
+		if (idle_cpu(sib))
+			continue;
+		return 0;
+	}
+
+	return 1;
+}
+
+static inline int cpus_are_siblings(int cpu1, int cpu2)
+{
+        cpumask_t tmp;
+        cpus_and(tmp, cpu_sibling_map[cpu1], cpu_sibling_map[cpu2]);
+        return cpus_weight(tmp);
+}
+#else
+#define cpus_are_siblings(A, B) (0)
+#define cpu_and_siblings_are_idle(A) idle_cpu(A)
+#endif
+
+
 #ifdef CONFIG_SCHEDSTATS
 /*
  * Called when a process is dequeued from the active array and given
@@ -923,9 +948,10 @@ static inline unsigned long target_load(
 #endif
 
 /*
- * wake_idle() is useful especially on SMT architectures to wake a
- * task onto an idle sibling if we would otherwise wake it onto a
- * busy sibling.
+ * wake_idle() will wake a task on an idle cpu if task->cpu is
+ * not idle and an idle cpu is available.  The span of cpus to
+ * search starts with cpus closest then further out as needed,
+ * so we always favor a closer, idle cpu.
  *
  * Returns the CPU we should wake onto.
  */
@@ -933,24 +959,23 @@ static inline unsigned long target_load(
 static int wake_idle(int cpu, task_t *p)
 {
 	cpumask_t tmp;
-	runqueue_t *rq = cpu_rq(cpu);
 	struct sched_domain *sd;
 	int i;
 
 	if (idle_cpu(cpu))
 		return cpu;
 
-	sd = rq->sd;
-	if (!(sd->flags & SD_WAKE_IDLE))
-		return cpu;
-
-	cpus_and(tmp, sd->span, p->cpus_allowed);
-
-	for_each_cpu_mask(i, tmp) {
-		if (idle_cpu(i))
-			return i;
+	for_each_domain(cpu, sd) {
+		if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, cpu_online_map);
+			cpus_and(tmp, tmp, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i))
+					return i;
+			}
+		}
+		else break;
 	}
-
 	return cpu;
 }
 #else
@@ -1062,7 +1087,7 @@ static int try_to_wake_up(task_t * p, un
 out_set_cpu:
 	schedstat_inc(rq, ttwu_attempts);
 	new_cpu = wake_idle(new_cpu, p);
-	if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
+	if (new_cpu != cpu) {
 		schedstat_inc(rq, ttwu_moved);
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
@@ -1637,13 +1662,20 @@ int can_migrate_task(task_t *p, runqueue
 	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
 
-	/* Aggressive migration if we've failed balancing */
-	if (idle == NEWLY_IDLE ||
-			sd->nr_balance_failed < sd->cache_nice_tries) {
-		if (task_hot(p, rq->timestamp_last_tick, sd))
-			return 0;
-	}
+	/* 
+	 * Aggressive migration allowed if:
+	 * 1) the [whole] cpu is idle, or
+	 * 2) too many balance attempts have failed, or
+	 * 3) both cpus are siblings in the same core
+	 */
+
+	if (cpu_and_siblings_are_idle(this_cpu) || \
+			sd->nr_balance_failed > sd->cache_nice_tries || \
+			cpus_are_siblings(this_cpu, task_cpu(p)))
+		return 1;
 
+	if (task_hot(p, rq->timestamp_last_tick, sd))
+			return 0;
 	return 1;
 }
 
@@ -2058,23 +2090,6 @@ static inline void idle_balance(int this
 	}
 }
 
-#ifdef CONFIG_SCHED_SMT
-static int cpu_and_siblings_are_idle(int cpu)
-{
-	int sib;
-	for_each_cpu_mask(sib, cpu_sibling_map[cpu]) {
-		if (idle_cpu(sib))
-			continue;
-		return 0;
-	}
-
-	return 1;
-}
-#else
-#define cpu_and_siblings_are_idle(A) idle_cpu(A)
-#endif
-
-
 /*
  * active_load_balance is run by migration threads. It pushes running tasks
  * off the busiest CPU onto idle CPUs. It requires at least 1 task to be

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC][PATCH] sched: aggressive idle balance
  2004-11-02 20:16 [RFC][PATCH] sched: aggressive idle balance Andrew Theurer
@ 2004-11-02 21:29 ` Chen, Kenneth W
  2004-11-02 22:34 ` Chen, Kenneth W
  1 sibling, 0 replies; 5+ messages in thread
From: Chen, Kenneth W @ 2004-11-02 21:29 UTC (permalink / raw)
  To: 'Andrew Theurer', kernel, ricklind, Nick Piggin, mingo
  Cc: linux-kernel

Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> So far we have seen 3-5% with these patches on online transaction workolads
> and no regressions on SDET.  Kenneth, I am particularly interested in using
> this with your increased cache_hot_time value, where you got your best
> throughput:
>
> ...but still had idle time.  Do you think you could try these patches with
> your 25ms cache_hot_time?  I think your workload could benefit from both the
> longer cache_hot_time for busy cpus, but more aggressive idle balances,
> hopefully driving your workload to 100% cpu utilization.

Looks interesting, I will queue this up on our benchmark setup.

- Ken



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC][PATCH] sched: aggressive idle balance
  2004-11-02 20:16 [RFC][PATCH] sched: aggressive idle balance Andrew Theurer
  2004-11-02 21:29 ` Chen, Kenneth W
@ 2004-11-02 22:34 ` Chen, Kenneth W
  2004-11-03  2:28   ` Andrew Theurer
  1 sibling, 1 reply; 5+ messages in thread
From: Chen, Kenneth W @ 2004-11-02 22:34 UTC (permalink / raw)
  To: 'Andrew Theurer', kernel, ricklind, Nick Piggin, mingo
  Cc: linux-kernel

Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> This patch allows more aggressive idle balances, reducing idle time in
> scenarios where should not be any, where nr_running > nr_cpus.  We have seen
> this in a couple of online transaction workloads.  Three areas are targeted:
>
> 1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
> the task->cpu is busy and the sibling is idle.  This has been expanded to any
> idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
> then going up the domains as necessary.

It occurs to me that half of the patch only applicable to HT, like the change
in wake_idle().  And also, do you really want to put that functionality in
wake_idle()?  Seems defeating the original intention of that function, which
only tries to wake up sibling cpu as far as how I understand the code.

My setup is 4-way SMP, no HT (4-way itanium2 processor), sorry, I won't be able
to tell you how this portion of the change affect online transaction workload.

- Ken



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC][PATCH] sched: aggressive idle balance
@ 2004-11-02 22:55 Chen, Kenneth W
  0 siblings, 0 replies; 5+ messages in thread
From: Chen, Kenneth W @ 2004-11-02 22:55 UTC (permalink / raw)
  To: 'Andrew Theurer', kernel, ricklind, 'Nick Piggin',
	mingo
  Cc: linux-kernel

Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>
> This patch allows more aggressive idle balances, reducing idle time in
> scenarios where should not be any, where nr_running > nr_cpus.  We have seen
> this in a couple of online transaction workloads.  Three areas are targeted:
>
> 1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
> the task->cpu is busy and the sibling is idle.  This has been expanded to any
> idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
> then going up the domains as necessary.

Chen, Kenneth W wrote on Tuesday, November 02, 2004 2:35 PM
> It occurs to me that half of the patch only applicable to HT, like the change
> in wake_idle().  And also, do you really want to put that functionality in
> wake_idle()?  Seems defeating the original intention of that function, which
> only tries to wake up sibling cpu as far as how I understand the code.
>
> My setup is 4-way SMP, no HT (4-way itanium2 processor), sorry, I won't be able
> to tell you how this portion of the change affect online transaction workload.


Move that functionality into try_to_wake_up directly.  I'm going to try this
on my setup.


--- kernel/sched.c.orig	2004-11-02 13:35:33.000000000 -0800
+++ kernel/sched.c	2004-11-02 14:51:08.000000000 -0800
@@ -1059,13 +1059,21 @@ static int try_to_wake_up(task_t * p, un
 				schedstat_inc(sd, ttwu_wake_balance);
 				goto out_set_cpu;
 			}
+		} else if (sd->flags & SD_WAKE_IDLE) {
+			cpus_and(tmp, sd->span, cpu_online_map);
+			cpus_and(tmp, tmp, p->cpus_allowed);
+			for_each_cpu_mask(i, tmp) {
+				if (idle_cpu(i)) {
+					new_cpu = i;
+					goto out_set_cpu;
+				}
+			}
 		}
 	}

 	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
 out_set_cpu:
 	schedstat_inc(rq, ttwu_attempts);
-	new_cpu = wake_idle(new_cpu, p);
 	if (new_cpu != cpu && cpu_isset(new_cpu, p->cpus_allowed)) {
 		schedstat_inc(rq, ttwu_moved);
 		set_task_cpu(p, new_cpu);



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] sched: aggressive idle balance
  2004-11-02 22:34 ` Chen, Kenneth W
@ 2004-11-03  2:28   ` Andrew Theurer
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Theurer @ 2004-11-03  2:28 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: kernel, ricklind, Nick Piggin, mingo, linux-kernel

Chen, Kenneth W wrote:

>Andrew Theurer wrote on Tuesday, November 02, 2004 12:17 PM
>  
>
>>This patch allows more aggressive idle balances, reducing idle time in
>>scenarios where should not be any, where nr_running > nr_cpus.  We have seen
>>this in a couple of online transaction workloads.  Three areas are targeted:
>>
>>1) In try_to_wake_up(), wake_idle() is called to move the task to a sibling if
>>the task->cpu is busy and the sibling is idle.  This has been expanded to any
>>idle cpu, but the closest idle cpu is picked first by starting with cpu->sd,
>>then going up the domains as necessary.
>>    
>>
>
>It occurs to me that half of the patch only applicable to HT, like the change
>in wake_idle().  And also, do you really want to put that functionality in
>wake_idle()?  Seems defeating the original intention of that function, which
>only tries to wake up sibling cpu as far as how I understand the code.
>  
>
The patch (wake_idle()) should still make a difference with nonHT 
systems.  It is true that in the mainline code, only HT systems 
benefited from wake_idle().  In fact, wake_idle() probably did nothing 
at all for Itanium, but with this patch it will, since we scan for an 
idle cpu up the domains that have flag SD_WAKE_IDLE (and we addedd that 
flag to SD_CPU_INIT and SD_NODE_INIT).

The patch's intention is to make sure we don't leave idle cpus idle.  
This can happen because the rate at which tasks are activated and 
deactivated is much higher than the idle balance interval, and because 
the task_hot prevents migrations even when we do encounter an idle 
balance tick.

Honsetly, I think you should be able to see the benefits form all 
seciotns of this patch, even without HT.

-Andrew Theurer

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-11-03  2:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-02 20:16 [RFC][PATCH] sched: aggressive idle balance Andrew Theurer
2004-11-02 21:29 ` Chen, Kenneth W
2004-11-02 22:34 ` Chen, Kenneth W
2004-11-03  2:28   ` Andrew Theurer
  -- strict thread matches above, loose matches on Subject: below --
2004-11-02 22:55 Chen, Kenneth W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox