[RFC] (How to) Let idle CPUs sleep

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] (How to) Let idle CPUs sleep
@ 2005-05-07 18:27 Srivatsa Vaddagiri
  2005-05-08  3:50 ` Rusty Russell
  2005-05-08 13:31 ` Andi Kleen
  0 siblings, 2 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-07 18:27 UTC (permalink / raw)
  To: schwidefsky, jdike, Andrew Morton, Ingo Molnar, Nick Piggin,
	Rusty Russell, rmk+lkml
  Cc: linux-kernel, user-mode-linux-devel

Hello,
	I need some inputs from the community (specifically from virtual
machine and embedded/power-management folks) on something that I am working on.

	This is regarding cutting off the regular timer ticks when a CPU
becomes idle and it does not have any next timer set to expire in the "near"
term. Both CONFIG_VST and CONFIG_NO_IDLE_HZ deal with this. Both embedded and
virtualized platforms (ex: UML/S390) benefit from this. For ex: if 100s
of guest are running on a single box, then cutting off some useless HZ ticks
in the idle CPUs of all guests will lead to efficient use of host CPU's cycles.

Cutting of local timer ticks has an effect on the scheduler load balance
activity and I am trying to see how best to reduce the impact.

Two solutions have been proposed so far:

	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
	   say a second, Nick?) on how long a idle CPU can avoid taking
	   local-timer ticks. As a result, the load imbalance could exist only
	   for this max duration, after which the sleeping CPU will wake up
	   and balance itself. If there is no imbalance, it can go and sleep
	   again for the max duration.

 	   For ex, lets say a idle CPU found that it doesn't have any near timer
	   for the next 1 minute. Instead of letting it sleep for 1 minute in
	   a single stretch, we let it sleep in bursts of 100 msec (or whatever
	   is the max. duration chosen). This still is better than having
	   the idle CPU take HZ ticks a second.

	   As a special case, when all the CPUs of an image go idle, we
	   could consider completely shutting off local timer ticks
	   across all CPUs (till the next non-timer interrupt).


	B. Don't impose any max limit on how long a idle CPU can sleep.
	   Here we let the idle CPU sleep as long as it wants. It is
	   woken up by a "busy" CPU when it detects an imbalance. The
	   busy CPU acts as a watchdog here. If there are no such
	   busy CPUs, then it means that nobody will acts as watchdogs
	   and idle CPUs sleep as long as they want. A possible watchdog
	   implementation has been discussed at:

	http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2

A is obviously more simpler to implement compared to B!
Whether both are more or less equally efficient is something that I dont know.

To help us decide which way to go, could I have some comments from the virtual
machine and embedded folks on which solution they prefer and why?



--


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-07 18:27 [RFC] (How to) Let idle CPUs sleep Srivatsa Vaddagiri
@ 2005-05-08  3:50 ` Rusty Russell
  2005-05-08  4:14   ` Nick Piggin
  2005-05-08 10:13   ` Arjan van de Ven
  2005-05-08 13:31 ` Andi Kleen
  1 sibling, 2 replies; 36+ messages in thread
From: Rusty Russell @ 2005-05-08  3:50 UTC (permalink / raw)
  To: vatsa
  Cc: schwidefsky, jdike, Andrew Morton, Ingo Molnar, Nick Piggin,
	rmk+lkml, linux-kernel, user-mode-linux-devel

On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> Two solutions have been proposed so far:
> 
> 	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
> 	   say a second, Nick?) on how long a idle CPU can avoid taking
> 	   local-timer ticks. As a result, the load imbalance could exist only
> 	   for this max duration, after which the sleeping CPU will wake up
> 	   and balance itself. If there is no imbalance, it can go and sleep
> 	   again for the max duration.
> 
>  	   For ex, lets say a idle CPU found that it doesn't have any near timer
> 	   for the next 1 minute. Instead of letting it sleep for 1 minute in
> 	   a single stretch, we let it sleep in bursts of 100 msec (or whatever
> 	   is the max. duration chosen). This still is better than having
> 	   the idle CPU take HZ ticks a second.
> 
> 	   As a special case, when all the CPUs of an image go idle, we
> 	   could consider completely shutting off local timer ticks
> 	   across all CPUs (till the next non-timer interrupt).
> 
> 
> 	B. Don't impose any max limit on how long a idle CPU can sleep.
> 	   Here we let the idle CPU sleep as long as it wants. It is
> 	   woken up by a "busy" CPU when it detects an imbalance. The
> 	   busy CPU acts as a watchdog here. If there are no such
> 	   busy CPUs, then it means that nobody will acts as watchdogs
> 	   and idle CPUs sleep as long as they want. A possible watchdog
> 	   implementation has been discussed at:
> 
> 	http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2

My preference would be the second: fix the scheduler so it doesn't rely
on regular polling.  However, as long as the UP case runs with no timer
interrupts when idle, many people will be happy (eg. most embedded).

Rusty.
-- 
A bad analogy is like a leaky screwdriver -- Richard Braakman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08  3:50 ` Rusty Russell
@ 2005-05-08  4:14   ` Nick Piggin
  2005-05-08 12:19     ` Srivatsa Vaddagiri
                       ` (2 more replies)
  2005-05-08 10:13   ` Arjan van de Ven
  1 sibling, 3 replies; 36+ messages in thread
From: Nick Piggin @ 2005-05-08  4:14 UTC (permalink / raw)
  To: Rusty Russell
  Cc: vatsa, schwidefsky, jdike, Andrew Morton, Ingo Molnar, rmk+lkml,
	linux-kernel, user-mode-linux-devel

Rusty Russell wrote:
> On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> 
>>Two solutions have been proposed so far:
>>
>>	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
>>	   say a second, Nick?) on how long a idle CPU can avoid taking

Yeah probably something around that order of magnitude. I suspect
there will fast be a point where either you'll get other timers
going off more frequently, and / or you simply get very quickly
diminishing returns on the amount of power saving gained from
increasing the period.

>>	   local-timer ticks. As a result, the load imbalance could exist only
>>	   for this max duration, after which the sleeping CPU will wake up
>>	   and balance itself. If there is no imbalance, it can go and sleep
>>	   again for the max duration.
>>
>> 	   For ex, lets say a idle CPU found that it doesn't have any near timer
>>	   for the next 1 minute. Instead of letting it sleep for 1 minute in
>>	   a single stretch, we let it sleep in bursts of 100 msec (or whatever
>>	   is the max. duration chosen). This still is better than having
>>	   the idle CPU take HZ ticks a second.
>>
>>	   As a special case, when all the CPUs of an image go idle, we
>>	   could consider completely shutting off local timer ticks
>>	   across all CPUs (till the next non-timer interrupt).
>>
>>
>>	B. Don't impose any max limit on how long a idle CPU can sleep.
>>	   Here we let the idle CPU sleep as long as it wants. It is
>>	   woken up by a "busy" CPU when it detects an imbalance. The
>>	   busy CPU acts as a watchdog here. If there are no such
>>	   busy CPUs, then it means that nobody will acts as watchdogs
>>	   and idle CPUs sleep as long as they want. A possible watchdog
>>	   implementation has been discussed at:
>>
>>	http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2
> 
> 
> My preference would be the second: fix the scheduler so it doesn't rely
> on regular polling.

It is not so much a matter of "fixing" the scheduler as just adding
more heuristics. When are we too busy? When should we wake another
CPU? What if that CPU is an SMT sibling? What if it is across the
other side of the topology, and other CPUs closer to it are busy
as well? What if they're busy but not as busy as we are? etc.

We've already got that covered in the existing periodic pull balancing,
so instead of duplicating this logic and moving this extra work to busy
CPUs, we can just use the existing framework.

At least we should try method A first, and if that isn't good enough
(though I suspect it will be), then think about adding more complexity
to the scheduler.

>  However, as long as the UP case runs with no timer
> interrupts when idle, many people will be happy (eg. most embedded).
> 

Well in the UP case, both A and B should basically degenerate to the
same thing.

Probably the more important case for the scheduler is to be able to
turn off idle SMP hypervisor clients, Srivatsa?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08  3:50 ` Rusty Russell
  2005-05-08  4:14   ` Nick Piggin
@ 2005-05-08 10:13   ` Arjan van de Ven
  2005-05-08 13:33     ` Andi Kleen
  1 sibling, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2005-05-08 10:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: vatsa, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	Nick Piggin, rmk+lkml, linux-kernel, user-mode-linux-devel

On Sun, 2005-05-08 at 13:50 +1000, Rusty Russell wrote:
> My preference would be the second: fix the scheduler so it doesn't rely
> on regular polling.  However, as long as the UP case runs with no timer
> interrupts when idle, many people will be happy (eg. most embedded).

alternatively; if a CPU is idle a long time we could do a software level
hotunplug on it (after setting it to the lowest possible frequency and
power state), and have some sort of thing that keeps track of "spare but
unplugged" cpus that can plug cpus back in on demand.

That also be nice for all the virtual environments where this could
interact with the hypervisor etc




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08  4:14   ` Nick Piggin
@ 2005-05-08 12:19     ` Srivatsa Vaddagiri
  2005-05-09  6:27       ` Nick Piggin
  2005-05-11 18:03     ` Tony Lindgren
  2005-06-30 12:47     ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-08 12:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	rmk+lkml, linux-kernel, user-mode-linux-devel

On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.

I am looking at it from the other perspective also i.e, virtualized
env. Any amount of unnecessary timer ticks will lead to equivalent amount
of unnecessary context switches among the guest OSes.

> It is not so much a matter of "fixing" the scheduler as just adding
> more heuristics. When are we too busy? When should we wake another
> CPU? What if that CPU is an SMT sibling? What if it is across the
> other side of the topology, and other CPUs closer to it are busy
> as well? What if they're busy but not as busy as we are? etc.
> 
> We've already got that covered in the existing periodic pull balancing,
> so instead of duplicating this logic and moving this extra work to busy
> CPUs, we can just use the existing framework.

I don't think we have to duplicate the logic, just "reuse" whatever logic
exists (in find_busiest_group etc). However I do agree there is movement
of extra work to busy CPUs, but that is only to help the idle CPU sleep longer.
Whether it justifies the additional complexity or not is what this RFC is
about I guess!

FWIW, I have also made some modifications in the original proposal 
for reducing the watchdog workload (instead of the same non-idle cpu waking 
up all the sleeping CPUs it finds in the same rebalance_tick, the task
is spread over multiple non-idle tasks in different rebalance_ticks).
New (lightly tested) patch is in the mail below.


> At least we should try method A first, and if that isn't good enough
> (though I suspect it will be), then think about adding more complexity
> to the scheduler.

What would be good to measure between the two approaches is the CPU utilization 
(over a period of time - say 10 hrs) of somewhat lightly loaded SMP guest OSes 
(i.e some CPUs are idle and other CPUs of the same guest are not idle), when 
multiple such guest OSes are running simultaneously on the same box.  This 
means I need a port of VST to UML :(

> Well in the UP case, both A and B should basically degenerate to the
> same thing.

I agree.

> Probably the more important case for the scheduler is to be able to
> turn off idle SMP hypervisor clients, Srivatsa?

True. To make a distinction, these SMP clients can be either completely
idle (all their CPUs idle) or partially idle (only fraction of CPUs idle).
It would be good to cater to both kind of clients.

My latest watchdog implementation is below for reference:

---

 linux-2.6.12-rc3-mm2-vatsa/include/linux/sched.h |    1 
 linux-2.6.12-rc3-mm2-vatsa/kernel/sched.c        |  150 ++++++++++++++++++++++-
 2 files changed, 146 insertions(+), 5 deletions(-)

diff -puN kernel/sched.c~sched-nohz kernel/sched.c
--- linux-2.6.12-rc3-mm2/kernel/sched.c~sched-nohz	2005-05-04 18:23:30.000000000 +0530
+++ linux-2.6.12-rc3-mm2-vatsa/kernel/sched.c	2005-05-07 22:09:04.000000000 +0530
@@ -1875,6 +1875,25 @@ out:
 	return pulled;
 }
 
+static inline struct sched_domain *
+sched_domain_ptr(int dst_cpu, int src_cpu, struct sched_domain *src_ptr)
+{
+	struct sched_domain *tmp, *dst_ptr;
+
+	dst_ptr = cpu_rq(dst_cpu)->sd;
+
+	for_each_domain(src_cpu, tmp) {
+		if (tmp == src_ptr || !dst_ptr)
+			break;
+		dst_ptr = dst_ptr->parent;
+	}
+
+	if (tmp == NULL)
+		dst_ptr = NULL;
+
+	return dst_ptr;
+}
+
 /*
  * find_busiest_group finds and returns the busiest CPU group within the
  * domain. It calculates and returns the number of tasks which should be
@@ -1882,11 +1901,18 @@ out:
  */
 static struct sched_group *
 find_busiest_group(struct sched_domain *sd, int this_cpu,
-		   unsigned long *imbalance, enum idle_type idle)
+		   unsigned long *imbalance, enum idle_type idle,
+		   cpumask_t *wakemaskp)
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	int load_idx;
+#ifdef CONFIG_NO_IDLE_HZ
+	int grp_sleeping = 0, woken = 0;
+	cpumask_t tmpmask;
+	struct sched_domain *sd1;
+	unsigned long interval;
+#endif
 
 	max_load = this_load = total_load = total_pwr = 0;
 	if (idle == NOT_IDLE)
@@ -1896,6 +1922,11 @@ find_busiest_group(struct sched_domain *
 	else
 		load_idx = sd->idle_idx;
 
+#ifdef CONFIG_NO_IDLE_HZ
+	if (wakemaskp)
+		cpus_clear(*wakemaskp);
+#endif
+
 	do {
 		unsigned long load;
 		int local_group;
@@ -1906,6 +1937,17 @@ find_busiest_group(struct sched_domain *
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
 
+#ifdef CONFIG_NO_IDLE_HZ
+		grp_sleeping = 0;
+		woken = 0;
+		if (wakemaskp && idle == NOT_IDLE) {
+			/* Are all CPUs in the group sleeping ? */
+			cpus_and(tmpmask, group->cpumask, nohz_cpu_mask);
+			if (cpus_equal(tmpmask, group->cpumask))
+				grp_sleeping = 1;
+		}
+#endif
+
 		for_each_cpu_mask(i, group->cpumask) {
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
@@ -1914,6 +1956,36 @@ find_busiest_group(struct sched_domain *
 				load = source_load(i, load_idx);
 
 			avg_load += load;
+
+#ifdef CONFIG_NO_IDLE_HZ
+			/* Try to find a CPU that can be woken up from the
+			 * sleeping group. After we wake up one CPU, we will let
+			 * it wakeup others in its group.
+			 */
+			if (!grp_sleeping || woken)
+				continue;
+
+			sd1 = sched_domain_ptr(i, this_cpu, sd);
+
+			if (!sd1 || !sd1->flags & SD_LOAD_BALANCE)
+				continue;
+
+			interval = sd1->balance_interval;
+			/* scale ms to jiffies */
+			interval = msecs_to_jiffies(interval);
+	                if (unlikely(!interval))
+        	                interval = 1;
+
+			if (jiffies - sd1->last_balance >= interval) {
+				/* Lets record this CPU as a possible target
+				 * to be woken up. Whether we actually wake it
+				 * up or not depends on the CPU's imbalance wrt
+				 * others in the domain.
+				 */
+				woken = 1;
+				cpu_set(i, *wakemaskp);
+			}
+#endif
 		}
 
 		total_load += avg_load;
@@ -2050,11 +2122,15 @@ static int load_balance(int this_cpu, ru
 	unsigned long imbalance;
 	int nr_moved, all_pinned = 0;
 	int active_balance = 0;
+	cpumask_t wakemask;
+#ifdef CONFIG_NO_IDLE_HZ
+	struct sched_domain *sd1;
+#endif
 
 	spin_lock(&this_rq->lock);
 	schedstat_inc(sd, lb_cnt[idle]);
 
-	group = find_busiest_group(sd, this_cpu, &imbalance, idle);
+	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &wakemask);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[idle]);
 		goto out_balanced;
@@ -2130,9 +2206,11 @@ static int load_balance(int this_cpu, ru
 			sd->balance_interval *= 2;
 	}
 
-	return nr_moved;
+	goto out_nohz;
 
 out_balanced:
+	nr_moved = 0;
+
 	spin_unlock(&this_rq->lock);
 
 	schedstat_inc(sd, lb_balanced[idle]);
@@ -2143,7 +2221,36 @@ out_balanced:
 			(sd->balance_interval < sd->max_interval))
 		sd->balance_interval *= 2;
 
-	return 0;
+out_nohz:
+#ifdef CONFIG_NO_IDLE_HZ
+	if (!cpus_empty(wakemask)) {
+		int i;
+
+		/* Lets try to wakeup one CPU from the mask. Rest of the cpus
+		 * in the mask can be woken up by other CPUs when they do load
+		 * balancing in this domain. That way, the overhead of watchdog
+		 * functionality is spread across (non-idle) CPUs in the domain.
+		 */
+
+		for_each_cpu_mask(i, wakemask) {
+
+			sd1 = sched_domain_ptr(i, this_cpu, sd);
+
+			if (!sd1)
+				continue;
+
+			find_busiest_group(sd1, i, &imbalance, SCHED_IDLE,
+					   			NULL);
+			if (imbalance > 0) {
+				spin_lock(&cpu_rq(i)->lock);
+				resched_task(cpu_rq(i)->idle);
+				spin_unlock(&cpu_rq(i)->lock);
+				break;
+			}
+		}
+	}
+#endif
+	return nr_moved;
 }
 
 /*
@@ -2162,7 +2269,7 @@ static int load_balance_newidle(int this
 	int nr_moved = 0;
 
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
-	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
+	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE, NULL);
 	if (!group) {
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out_balanced;
@@ -2323,6 +2430,39 @@ static void rebalance_tick(int this_cpu,
 		}
 	}
 }
+
+#ifdef CONFIG_NO_IDLE_HZ
+/*
+ * Try hard to pull tasks. Called by idle task before it sleeps cutting off
+ * local timer ticks.  This clears the various load counters and tries to pull
+ * tasks.
+ *
+ * Returns 1 if tasks were pulled over, 0 otherwise.
+ */
+int idle_balance_retry(void)
+{
+	int j, moved = 0, this_cpu = smp_processor_id();
+	runqueue_t *this_rq = this_rq();
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	for (j = 0; j < 3; j++)
+		this_rq->cpu_load[j] = 0;
+
+	rebalance_tick(this_cpu, this_rq, SCHED_IDLE);
+
+	if (this_rq->nr_running) {
+		moved = 1;
+		set_tsk_need_resched(current);
+	}
+
+	local_irq_restore(flags);
+
+	return moved;
+}
+#endif
+
 #else
 /*
  * on UP we do not need to balance between CPUs:
diff -puN include/linux/sched.h~sched-nohz include/linux/sched.h
--- linux-2.6.12-rc3-mm2/include/linux/sched.h~sched-nohz	2005-05-04 18:23:30.000000000 +0530
+++ linux-2.6.12-rc3-mm2-vatsa/include/linux/sched.h	2005-05-04 18:23:37.000000000 +0530
@@ -897,6 +897,7 @@ extern int task_curr(const task_t *p);
 extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
 extern task_t *idle_task(int cpu);
+extern int idle_balance_retry(void);
 
 void yield(void);
 

_

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-07 18:27 [RFC] (How to) Let idle CPUs sleep Srivatsa Vaddagiri
  2005-05-08  3:50 ` Rusty Russell
@ 2005-05-08 13:31 ` Andi Kleen
  2005-05-08 15:26   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2005-05-08 13:31 UTC (permalink / raw)
  To: vatsa; +Cc: linux-kernel, user-mode-linux-devel

Srivatsa Vaddagiri <vatsa@in.ibm.com> writes:

> Hello,
> 	I need some inputs from the community (specifically from virtual
> machine and embedded/power-management folks) on something that I am working on.

I think the best way is to let other CPUs handle the load balancing
for idle CPUs. Basically when a CPU goes fully idle then you mark
this in some global data structure, and CPUs doing load balancing
after doing their own thing look for others that need to be balanced
too and handle them too. When no CPU is left non idle then nothing needs
to be load balanced anyways. When a idle CPU gets a task it just gets
an reschedule IPI as usual, that wakes it up. 

I call this the "scoreboard".

The trick is to evenly load balance the work over the remaining CPUs.
Something simple like never doing work for more than 1/idlecpus is
probably enough. In theory one could even use machine NUMA topology
information for this, but that would be probably overkill for the
first implementation.

With the scoreboard implementation CPus could be virtually idle
forever, which I think is best for virtualization.

BTW we need a very similar thing for RCU too.

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08 10:13   ` Arjan van de Ven
@ 2005-05-08 13:33     ` Andi Kleen
  2005-05-08 13:44       ` Arjan van de Ven
  0 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2005-05-08 13:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: vatsa, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	Nick Piggin, rmk+lkml, linux-kernel, user-mode-linux-devel

Arjan van de Ven <arjan@infradead.org> writes:

> On Sun, 2005-05-08 at 13:50 +1000, Rusty Russell wrote:
>> My preference would be the second: fix the scheduler so it doesn't rely
>> on regular polling.  However, as long as the UP case runs with no timer
>> interrupts when idle, many people will be happy (eg. most embedded).
>
> alternatively; if a CPU is idle a long time we could do a software level
> hotunplug on it (after setting it to the lowest possible frequency and
> power state), and have some sort of thing that keeps track of "spare but
> unplugged" cpus that can plug cpus back in on demand.


We need to do this anyways for RCU, because fully idle CPUs don't go
through quiescent states and could stall the whole RCU system.

But it has to be *really* lightweight because these transistion can
happen a lot (consider a CPU that very often goes to sleep for a short time)
>
> That also be nice for all the virtual environments where this could
> interact with the hypervisor etc

I am not sure how useful it is to make this heavy weight by involving
more subsystems. I would try to keep the idle state as lightweight
as possible, to keep the cost of going to sleep/waking up low.

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08 13:33     ` Andi Kleen
@ 2005-05-08 13:44       ` Arjan van de Ven
  2005-05-08 14:53         ` Andi Kleen
  0 siblings, 1 reply; 36+ messages in thread
From: Arjan van de Ven @ 2005-05-08 13:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: vatsa, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	Nick Piggin, rmk+lkml, linux-kernel, user-mode-linux-devel


> But it has to be *really* lightweight because these transistion can
> happen a lot (consider a CPU that very often goes to sleep for a short time)

lightweight is good of course. But even if it's medium weight.. it just
means you need to be REALLY idle (eg for longer time) for it to trigger.
I guess we need some sort of per arch "idleness threshhold" for this.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08 13:44       ` Arjan van de Ven
@ 2005-05-08 14:53         ` Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2005-05-08 14:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: vatsa, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	Nick Piggin, rmk+lkml, linux-kernel, user-mode-linux-devel

On Sun, May 08, 2005 at 03:44:14PM +0200, Arjan van de Ven wrote:
> 
> > But it has to be *really* lightweight because these transistion can
> > happen a lot (consider a CPU that very often goes to sleep for a short time)
> 
> lightweight is good of course. But even if it's medium weight.. it just
> means you need to be REALLY idle (eg for longer time) for it to trigger.
> I guess we need some sort of per arch "idleness threshhold" for this.

The question is how useful it is for the hypervisor to even know that.
Why can't it just detect long idle periods by itself if it really wants?

-Andi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08 13:31 ` Andi Kleen
@ 2005-05-08 15:26   ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-08 15:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, user-mode-linux-devel, rusty, schwidefsky, jdike,
	akpm, mingo, rmk+lkml, nickpiggin, Dipankar

On Sun, May 08, 2005 at 03:31:00PM +0200, Andi Kleen wrote:
> I think the best way is to let other CPUs handle the load balancing
> for idle CPUs. Basically when a CPU goes fully idle then you mark
> this in some global data structure, 

nohz_cpu_mask already exists for this purpose.

> and CPUs doing load balancing after doing their own thing look for others 
> that need to be balanced too and handle them too. 

This is precisely what I had proposed in my watchdog implementation.

> When no CPU is left non idle then nothing needs to be load balanced anyways. 
> When a idle CPU gets a task it just gets an reschedule IPI as usual, that 
> wakes it up. 

True.

> 
> I call this the "scoreboard".
> 
> The trick is to evenly load balance the work over the remaining CPUs.
> Something simple like never doing work for more than 1/idlecpus is
> probably enough. 

Well, there is this imbalance_pct which acts as a trigger threshold before
which load balance won't happen.  I do take this into account before
waking up the sleeping idle cpu (the same imbalance_pct logic would 
have been followed by the idle CPU if it were to continue taking timer
ticks).

So I guess your 1/idlecpus and the imbalance_pct may act on parallel lines.

> In theory one could even use machine NUMA topology
> information for this, but that would be probably overkill for the
> first implementation.
> 
> With the scoreboard implementation CPus could be virtually idle
> forever, which I think is best for virtualization.
> 
> BTW we need a very similar thing for RCU too.

RCU is taken care of already, except it is broken. There is a small
race which is not fixed. Following patch (which I wrote aainst 2.6.10 kernel
maybe) should fix that race. I intend to post this patch after test
agaist more recent kernel.


--- kernel/rcupdate.c.org	2005-02-11 11:38:47.000000000 +0530
+++ kernel/rcupdate.c	2005-02-11 11:44:08.000000000 +0530
@@ -199,8 +199,11 @@ static void rcu_start_batch(struct rcu_c
  */
 static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp, struct rcu_state *rsp)
 {
+	cpumask_t tmpmask;
+
 	cpu_clear(cpu, rsp->cpumask);
-	if (cpus_empty(rsp->cpumask)) {
+	cpus_andnot(tmpmask, rsp->cpumask, nohz_cpu_mask);
+	if (cpus_empty(tmpmask)) {
 		/* batch completed ! */
 		rcp->completed = rcp->cur;
 		rcu_start_batch(rcp, rsp, 0);



-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08 12:19     ` Srivatsa Vaddagiri
@ 2005-05-09  6:27       ` Nick Piggin
  2005-05-12  8:38         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-05-09  6:27 UTC (permalink / raw)
  To: vatsa
  Cc: Rusty Russell, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	rmk+lkml, linux-kernel, user-mode-linux-devel

Srivatsa Vaddagiri wrote:
> On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
> 
>>Yeah probably something around that order of magnitude. I suspect
>>there will fast be a point where either you'll get other timers
>>going off more frequently, and / or you simply get very quickly
>>diminishing returns on the amount of power saving gained from
>>increasing the period.
> 
> 
> I am looking at it from the other perspective also i.e, virtualized
> env. Any amount of unnecessary timer ticks will lead to equivalent amount
> of unnecessary context switches among the guest OSes.
> 

Yep.

> 
>>It is not so much a matter of "fixing" the scheduler as just adding
>>more heuristics. When are we too busy? When should we wake another
>>CPU? What if that CPU is an SMT sibling? What if it is across the
>>other side of the topology, and other CPUs closer to it are busy
>>as well? What if they're busy but not as busy as we are? etc.
>>
>>We've already got that covered in the existing periodic pull balancing,
>>so instead of duplicating this logic and moving this extra work to busy
>>CPUs, we can just use the existing framework.
> 
> 
> I don't think we have to duplicate the logic, just "reuse" whatever logic
> exists (in find_busiest_group etc). However I do agree there is movement

OK, that may possibly be an option... however:

> of extra work to busy CPUs, but that is only to help the idle CPU sleep longer.
> Whether it justifies the additional complexity or not is what this RFC is
> about I guess!
> 

Yeah, this is a bit worrying. In general we should not be loading
up busy CPUs with any more work, and sleeping idle CPUs should be
done as a blunt "slowpath" operation. Ie. something that works well
enough.

> FWIW, I have also made some modifications in the original proposal 
> for reducing the watchdog workload (instead of the same non-idle cpu waking 
> up all the sleeping CPUs it finds in the same rebalance_tick, the task
> is spread over multiple non-idle tasks in different rebalance_ticks).
> New (lightly tested) patch is in the mail below.
> 

Mmyeah, I'm not a big fan :)

I could probably find some time to do my implementation if you have
a complete working patch for eg. UML.

> 
> 
>>At least we should try method A first, and if that isn't good enough
>>(though I suspect it will be), then think about adding more complexity
>>to the scheduler.
> 
> 
> What would be good to measure between the two approaches is the CPU utilization 
> (over a period of time - say 10 hrs) of somewhat lightly loaded SMP guest OSes 
> (i.e some CPUs are idle and other CPUs of the same guest are not idle), when 
> multiple such guest OSes are running simultaneously on the same box.  This 
> means I need a port of VST to UML :(
> 

Yeah that would be good.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08  4:14   ` Nick Piggin
  2005-05-08 12:19     ` Srivatsa Vaddagiri
@ 2005-05-11 18:03     ` Tony Lindgren
  2005-05-12  8:46       ` Srivatsa Vaddagiri
  2005-06-30 12:47     ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 36+ messages in thread
From: Tony Lindgren @ 2005-05-11 18:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, vatsa, schwidefsky, jdike, Andrew Morton,
	Ingo Molnar, rmk+lkml, linux-kernel, user-mode-linux-devel

* Nick Piggin <nickpiggin@yahoo.com.au> [050507 21:15]:
> Rusty Russell wrote:
> >On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> >
> >>Two solutions have been proposed so far:
> >>
> >>	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
> >>	   say a second, Nick?) on how long a idle CPU can avoid taking
> 
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.
> 
> >>	   local-timer ticks. As a result, the load imbalance could exist 
> >>	   only
> >>	   for this max duration, after which the sleeping CPU will wake up
> >>	   and balance itself. If there is no imbalance, it can go and sleep
> >>	   again for the max duration.
> >>
> >>	   For ex, lets say a idle CPU found that it doesn't have any near 
> >>	   timer
> >>	   for the next 1 minute. Instead of letting it sleep for 1 minute in
> >>	   a single stretch, we let it sleep in bursts of 100 msec (or 
> >>	   whatever
> >>	   is the max. duration chosen). This still is better than having
> >>	   the idle CPU take HZ ticks a second.
> >>
> >>	   As a special case, when all the CPUs of an image go idle, we
> >>	   could consider completely shutting off local timer ticks
> >>	   across all CPUs (till the next non-timer interrupt).
> >>
> >>
> >>	B. Don't impose any max limit on how long a idle CPU can sleep.
> >>	   Here we let the idle CPU sleep as long as it wants. It is
> >>	   woken up by a "busy" CPU when it detects an imbalance. The
> >>	   busy CPU acts as a watchdog here. If there are no such
> >>	   busy CPUs, then it means that nobody will acts as watchdogs
> >>	   and idle CPUs sleep as long as they want. A possible watchdog
> >>	   implementation has been discussed at:
> >>
> >>	http://marc.theaimsgroup.com/?l=linux-kernel&m=111287808905764&w=2
> >
> >
> >My preference would be the second: fix the scheduler so it doesn't rely
> >on regular polling.
> 
> It is not so much a matter of "fixing" the scheduler as just adding
> more heuristics. When are we too busy? When should we wake another
> CPU? What if that CPU is an SMT sibling? What if it is across the
> other side of the topology, and other CPUs closer to it are busy
> as well? What if they're busy but not as busy as we are? etc.
> 
> We've already got that covered in the existing periodic pull balancing,
> so instead of duplicating this logic and moving this extra work to busy
> CPUs, we can just use the existing framework.
> 
> At least we should try method A first, and if that isn't good enough
> (though I suspect it will be), then think about adding more complexity
> to the scheduler.
> 
> > However, as long as the UP case runs with no timer
> >interrupts when idle, many people will be happy (eg. most embedded).
> >
> 
> Well in the UP case, both A and B should basically degenerate to the
> same thing.
> 
> Probably the more important case for the scheduler is to be able to
> turn off idle SMP hypervisor clients, Srivatsa?

Sorry to jump in late. For embedded stuff we should be able to skip
ticks until something _really_ happens, like an interrupt.

So we need to be able to skip ticks several seconds at a time. Ticks
should be event driven. For embedded systems option B is really
the only way to go to take advantage of the power savings.

Of course the situation is different on servers, where the goal is
to save ticks to be able to run more virtual machines. And then
cutting down the ticks down to few per second does the trick.

Regards,

Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-09  6:27       ` Nick Piggin
@ 2005-05-12  8:38         ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-12  8:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: schwidefsky, Ingo Molnar, linux-kernel

On Mon, May 09, 2005 at 04:27:26PM +1000, Nick Piggin wrote:
> I could probably find some time to do my implementation if you have
> a complete working patch for eg. UML.

Well, turns out that if we restrict the amount of time idle cpus are 
allowed to sleep, then there is very little change reqd in the scheduler.
Most of the calculation of exponential sleep times can be done outside
it (in the idle CPU's code).

First, the scheduler support to zero cpu_load[] counters before idle
cpu sleeps.

---

 linux-2.6.12-rc3-mm3-vatsa/include/linux/sched.h |    1 
 linux-2.6.12-rc3-mm3-vatsa/kernel/sched.c        |   33 +++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff -puN kernel/sched.c~sched-nohz kernel/sched.c
--- linux-2.6.12-rc3-mm3/kernel/sched.c~sched-nohz	2005-05-11 17:05:13.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/kernel/sched.c	2005-05-11 17:06:38.000000000 +0530
@@ -2323,6 +2323,39 @@ static void rebalance_tick(int this_cpu,
 		}
 	}
 }
+
+#ifdef CONFIG_NO_IDLE_HZ
+/*
+ * Try hard to pull tasks. Called by idle task before it sleeps cutting off
+ * local timer ticks.  This clears the various load counters and tries to pull
+ * tasks.
+ *
+ * Returns 1 if tasks were pulled over, 0 otherwise.
+ */
+int idle_balance_retry(void)
+{
+	int j, moved = 0, this_cpu = smp_processor_id();
+	runqueue_t *this_rq = this_rq();
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	for (j = 0; j < 3; j++)
+		this_rq->cpu_load[j] = 0;
+
+	rebalance_tick(this_cpu, this_rq, SCHED_IDLE);
+
+	if (this_rq->nr_running) {
+		moved = 1;
+		set_tsk_need_resched(current);
+	}
+
+	local_irq_restore(flags);
+
+	return moved;
+}
+#endif
+
 #else
 /*
  * on UP we do not need to balance between CPUs:
diff -puN include/linux/sched.h~sched-nohz include/linux/sched.h
--- linux-2.6.12-rc3-mm3/include/linux/sched.h~sched-nohz	2005-05-11 17:05:13.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/include/linux/sched.h	2005-05-11 17:13:19.000000000 +0530
@@ -897,6 +897,7 @@ extern int task_curr(const task_t *p);
 extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
 extern task_t *idle_task(int cpu);
+extern int idle_balance_retry(void);
 
 void yield(void);
 

_


A sample patch that implements exponential sleep time is below. Note that this 
patch only makes idle cpu pretend as if it is asleep (instead of really cutting
of timer ticks). I used this merely to test the scheduler change.

Martin,
	You probably need something like this for S390 arch!



---

 linux-2.6.12-rc3-mm3-vatsa/arch/i386/Kconfig          |    4 +
 linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/apic.c    |   16 ++++--
 linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/irq.c     |    4 +
 linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/process.c |   47 ++++++++++++++++--
 linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/smp.c     |    6 ++
 5 files changed, 69 insertions(+), 8 deletions(-)

diff -puN arch/i386/Kconfig~vst-sim arch/i386/Kconfig
--- linux-2.6.12-rc3-mm3/arch/i386/Kconfig~vst-sim	2005-05-10 15:53:33.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/Kconfig	2005-05-10 15:54:22.000000000 +0530
@@ -443,6 +443,10 @@ config X86_OOSTORE
 	depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR
 	default y
 
+config NO_IDLE_HZ
+        bool "Tickless Idle CPUs support"
+        default n
+
 config HPET_TIMER
 	bool "HPET Timer Support"
 	help
diff -puN arch/i386/kernel/process.c~vst-sim arch/i386/kernel/process.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/process.c~vst-sim	2005-05-10 15:53:34.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/process.c	2005-05-12 14:06:16.000000000 +0530
@@ -94,6 +94,12 @@ void enable_hlt(void)
 
 EXPORT_SYMBOL(enable_hlt);
 
+DEFINE_PER_CPU(int, idle_asleep);
+DEFINE_PER_CPU(unsigned long, sleep_duration);
+
+#define MAX_SLEEP_DURATION 	128	/* in tick counts */
+#define MIN_SLEEP_DURATION	8	/* in tick counts */
+
 /*
  * We use this if we don't have any better
  * idle routine..
@@ -102,8 +108,36 @@ void default_idle(void)
 {
 	if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
 		local_irq_disable();
-		if (!need_resched())
-			safe_halt();
+		if (!need_resched()) {
+			unsigned long jif_next, jif_delta;
+
+			jif_next = next_timer_interrupt();
+			jif_delta = jif_next - jiffies;
+
+			if (jif_delta > MIN_SLEEP_DURATION) {
+				unsigned long slpint;
+
+				if (idle_balance_retry()) {
+					local_irq_enable();
+					return;
+				}
+
+				slpint = min(__get_cpu_var(sleep_duration),
+					     jif_delta);
+
+				jif_next = jiffies + slpint;
+				/* Hack to discard local timer ticks */
+				__get_cpu_var(idle_asleep) = 1;
+				cpu_set(smp_processor_id(), nohz_cpu_mask);
+				local_irq_enable();
+				while ((jiffies < jif_next-1) &&
+					 __get_cpu_var(idle_asleep))
+					cpu_relax();
+				__get_cpu_var(idle_asleep) = 0;
+				cpu_clear(smp_processor_id(), nohz_cpu_mask);
+			} else
+				safe_halt();
+		}
 		else
 			local_irq_enable();
 	} else {
@@ -178,6 +212,8 @@ void cpu_idle(void)
 {
 	int cpu = _smp_processor_id();
 
+	__get_cpu_var(sleep_duration) = MIN_SLEEP_DURATION;
+
 	/* endless idle loop with no priority at all */
 	while (1) {
 		while (!need_resched()) {
@@ -189,7 +225,7 @@ void cpu_idle(void)
 			rmb();
 			idle = pm_idle;
 
-			if (!idle)
+			//if (!idle)
 				idle = default_idle;
 
 			if (cpu_is_offline(cpu))
@@ -197,7 +233,12 @@ void cpu_idle(void)
 
 			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
 			idle();
+
+			if (__get_cpu_var(sleep_duration) < MAX_SLEEP_DURATION)
+				__get_cpu_var(sleep_duration) *= 2;
+
 		}
+		__get_cpu_var(sleep_duration) = MIN_SLEEP_DURATION;
 		schedule();
 	}
 }
diff -puN arch/i386/kernel/irq.c~vst-sim arch/i386/kernel/irq.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/irq.c~vst-sim	2005-05-10 15:53:34.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/irq.c	2005-05-10 15:53:47.000000000 +0530
@@ -46,6 +46,8 @@ static union irq_ctx *hardirq_ctx[NR_CPU
 static union irq_ctx *softirq_ctx[NR_CPUS];
 #endif
 
+DECLARE_PER_CPU(int, idle_asleep);
+
 /*
  * do_IRQ handles all normal device IRQ's (the special
  * SMP cross-CPU interrupts have their own specific
@@ -60,6 +62,8 @@ fastcall unsigned int do_IRQ(struct pt_r
 	u32 *isp;
 #endif
 
+	__get_cpu_var(idle_asleep) = 0;
+
 	irq_enter();
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
 	/* Debugging check for stack overflow: is there less than 1KB free? */
diff -puN arch/i386/kernel/smp.c~vst-sim arch/i386/kernel/smp.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/smp.c~vst-sim	2005-05-11 16:59:38.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/smp.c	2005-05-11 16:59:58.000000000 +0530
@@ -309,6 +309,8 @@ static inline void leave_mm (unsigned lo
  * 2) Leave the mm if we are in the lazy tlb mode.
  */
 
+DECLARE_PER_CPU(int, idle_asleep);
+
 fastcall void smp_invalidate_interrupt(struct pt_regs *regs)
 {
 	unsigned long cpu;
@@ -336,6 +338,7 @@ fastcall void smp_invalidate_interrupt(s
 			leave_mm(cpu);
 	}
 	ack_APIC_irq();
+	__get_cpu_var(idle_asleep) = 0;
 	smp_mb__before_clear_bit();
 	cpu_clear(cpu, flush_cpumask);
 	smp_mb__after_clear_bit();
@@ -598,6 +601,8 @@ void smp_send_stop(void)
 fastcall void smp_reschedule_interrupt(struct pt_regs *regs)
 {
 	ack_APIC_irq();
+
+	__get_cpu_var(idle_asleep) = 0;
 }
 
 fastcall void smp_call_function_interrupt(struct pt_regs *regs)
@@ -607,6 +612,7 @@ fastcall void smp_call_function_interrup
 	int wait = call_data->wait;
 
 	ack_APIC_irq();
+	__get_cpu_var(idle_asleep) = 0;
 	/*
 	 * Notify initiating CPU that I've grabbed the data and am
 	 * about to execute the function
diff -puN arch/i386/kernel/apic.c~vst-sim arch/i386/kernel/apic.c
--- linux-2.6.12-rc3-mm3/arch/i386/kernel/apic.c~vst-sim	2005-05-10 15:53:36.000000000 +0530
+++ linux-2.6.12-rc3-mm3-vatsa/arch/i386/kernel/apic.c	2005-05-10 15:53:47.000000000 +0530
@@ -1171,6 +1171,8 @@ inline void smp_local_timer_interrupt(st
 	 */
 }
 
+DECLARE_PER_CPU(int, idle_asleep);
+
 /*
  * Local APIC timer interrupt. This is the most natural way for doing
  * local interrupts, but local timer interrupts can be emulated by
@@ -1185,15 +1187,19 @@ fastcall void smp_apic_timer_interrupt(s
 	int cpu = smp_processor_id();
 
 	/*
-	 * the NMI deadlock-detector uses this.
-	 */
-	per_cpu(irq_stat, cpu).apic_timer_irqs++;
-
-	/*
 	 * NOTE! We'd better ACK the irq immediately,
 	 * because timer handling can be slow.
 	 */
 	ack_APIC_irq();
+
+	if (__get_cpu_var(idle_asleep))
+		return;
+
+	/*
+	 * the NMI deadlock-detector uses this.
+	 */
+	per_cpu(irq_stat, cpu).apic_timer_irqs++;
+
 	/*
 	 * update_process_times() expects us to have done irq_enter().
 	 * Besides, if we don't timer interrupts ignore the global

_
-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-11 18:03     ` Tony Lindgren
@ 2005-05-12  8:46       ` Srivatsa Vaddagiri
  2005-05-12 16:01         ` Lee Revell
  0 siblings, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-12  8:46 UTC (permalink / raw)
  To: Tony Lindgren; +Cc: Nick Piggin, schwidefsky, jdike, Ingo Molnar, linux-kernel

On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> Sorry to jump in late. For embedded stuff we should be able to skip
> ticks until something _really_ happens, like an interrupt.
> 
> So we need to be able to skip ticks several seconds at a time. Ticks
> should be event driven. For embedded systems option B is really
> the only way to go to take advantage of the power savings.

I don't know how sensitive embedded platforms are to load imbalance.
If they are not sensitive, then we could let the max time idle
cpus are allowed to sleep to be few seconds. That way, idle CPU
wakes up once in 3 or 4 seconds to check for imbalance and still
be able to save power for those 3/4 seconds that it sleeps.

I guess it is a tradeoff here between the complexity we want to
put and the real benefit we get. Its hard for me to get the numbers
(since I don't have easy access to the right tools to measure them)
to show how much the real benefit is :(

-- 

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12  8:46       ` Srivatsa Vaddagiri
@ 2005-05-12 16:01         ` Lee Revell
  2005-05-12 16:16           ` Tony Lindgren
  0 siblings, 1 reply; 36+ messages in thread
From: Lee Revell @ 2005-05-12 16:01 UTC (permalink / raw)
  To: vatsa
  Cc: Tony Lindgren, Nick Piggin, schwidefsky, jdike, Ingo Molnar,
	linux-kernel

On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > Sorry to jump in late. For embedded stuff we should be able to skip
> > ticks until something _really_ happens, like an interrupt.
> > 
> > So we need to be able to skip ticks several seconds at a time. Ticks
> > should be event driven. For embedded systems option B is really
> > the only way to go to take advantage of the power savings.
> 
> I don't know how sensitive embedded platforms are to load imbalance.
> If they are not sensitive, then we could let the max time idle
> cpus are allowed to sleep to be few seconds. That way, idle CPU
> wakes up once in 3 or 4 seconds to check for imbalance and still
> be able to save power for those 3/4 seconds that it sleeps.

Not very.  Embedded systems are usually UP so don't care at all.  If an
embedded system is SMP often it's because one CPU is dedicated to RT
tasks, and this model will become less common as RT preemption allows
you to do everything on a single processor.

Lee


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 16:01         ` Lee Revell
@ 2005-05-12 16:16           ` Tony Lindgren
  2005-05-12 16:28             ` Jesse Barnes
  0 siblings, 1 reply; 36+ messages in thread
From: Tony Lindgren @ 2005-05-12 16:16 UTC (permalink / raw)
  To: Lee Revell
  Cc: vatsa, Nick Piggin, schwidefsky, jdike, Ingo Molnar, linux-kernel

* Lee Revell <rlrevell@joe-job.com> [050512 09:05]:
> On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> > On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > > Sorry to jump in late. For embedded stuff we should be able to skip
> > > ticks until something _really_ happens, like an interrupt.
> > > 
> > > So we need to be able to skip ticks several seconds at a time. Ticks
> > > should be event driven. For embedded systems option B is really
> > > the only way to go to take advantage of the power savings.
> > 
> > I don't know how sensitive embedded platforms are to load imbalance.
> > If they are not sensitive, then we could let the max time idle
> > cpus are allowed to sleep to be few seconds. That way, idle CPU
> > wakes up once in 3 or 4 seconds to check for imbalance and still
> > be able to save power for those 3/4 seconds that it sleeps.
> 
> Not very.  Embedded systems are usually UP so don't care at all.  If an
> embedded system is SMP often it's because one CPU is dedicated to RT
> tasks, and this model will become less common as RT preemption allows
> you to do everything on a single processor.

Yes UP mostly & sounds like this only affects MP systems.

Although there is ARM MP support patches available.
I guess embedded MP systems may be used for multimedia
stuff eventually.

Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 16:16           ` Tony Lindgren
@ 2005-05-12 16:28             ` Jesse Barnes
  2005-05-12 17:12               ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 36+ messages in thread
From: Jesse Barnes @ 2005-05-12 16:28 UTC (permalink / raw)
  To: Tony Lindgren
  Cc: Lee Revell, vatsa, Nick Piggin, schwidefsky, jdike, Ingo Molnar,
	linux-kernel

On Thursday, May 12, 2005 9:16 am, Tony Lindgren wrote:
> * Lee Revell <rlrevell@joe-job.com> [050512 09:05]:
> > On Thu, 2005-05-12 at 14:16 +0530, Srivatsa Vaddagiri wrote:
> > > On Wed, May 11, 2005 at 11:03:49AM -0700, Tony Lindgren wrote:
> > > > Sorry to jump in late. For embedded stuff we should be able to
> > > > skip ticks until something _really_ happens, like an interrupt.
> > > >
> > > > So we need to be able to skip ticks several seconds at a time.
> > > > Ticks should be event driven. For embedded systems option B is
> > > > really the only way to go to take advantage of the power
> > > > savings.

That seems like a lot of added complexity.  Isn't it possible to go 
totally tickless and actually *remove* some of the complexity of the 
current design?  I know Linus has frowned upon the idea in the past, 
but I've had to deal with the tick code a bit in the past, and it seems 
like getting rid of ticks entirely might actually simplify things (both 
conceptually and code-wise).

Seems like we could schedule timer interrupts based solely on add_timer 
type stuff; the scheduler could use it if necessary for load balancing 
(along with fork/exec based balancing perhaps) on large machines where 
load imbalances hurt throughput a lot.  But on small systems if all 
your processes were blocked, you'd just go to sleep indefinitely and 
save a bit of power and avoid unnecessary overhead.

I haven't looked at the lastest tickless patches, so I'm not sure if my 
claims of simplicity are overblown, but especially as multiprocessor 
systems become more and more common it just seems wasteful to wakeup 
all the CPUs every so often only to have them find that they have 
nothing to do.

Jesse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 16:28             ` Jesse Barnes
@ 2005-05-12 17:12               ` Srivatsa Vaddagiri
  2005-05-12 17:59                 ` Jesse Barnes
                                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-12 17:12 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky, jdike,
	Ingo Molnar, linux-kernel, george

On Thu, May 12, 2005 at 09:28:55AM -0700, Jesse Barnes wrote:
> Seems like we could schedule timer interrupts based solely on add_timer 
> type stuff; the scheduler could use it if necessary for load balancing 
> (along with fork/exec based balancing perhaps) on large machines where 
> load imbalances hurt throughput a lot.  But on small systems if all 

Even if we were to go for this tickless design, the fundamental question
remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
(busy) CPU wake it up (which makes the implementation complex) or the idle CPU 
checks imbalance itself at periodic intervals (which restricts the amount of
time a idle CPU may sleep).

> your processes were blocked, you'd just go to sleep indefinitely and 
> save a bit of power and avoid unnecessary overhead.
> 
> I haven't looked at the lastest tickless patches, so I'm not sure if my 
> claims of simplicity are overblown, but especially as multiprocessor 
> systems become more and more common it just seems wasteful to wakeup 
> all the CPUs every so often only to have them find that they have 
> nothing to do.

I guess George's experience in implementing tickless systems is that
it is more of an overhead for a general purpose OS like Linux. George?


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 17:12               ` Srivatsa Vaddagiri
@ 2005-05-12 17:59                 ` Jesse Barnes
  2005-05-12 18:16                   ` Tony Lindgren
  2005-05-13  6:27                   ` Srivatsa Vaddagiri
  2005-05-12 18:08                 ` Martin Schwidefsky
  2005-05-12 21:16                 ` George Anzinger
  2 siblings, 2 replies; 36+ messages in thread
From: Jesse Barnes @ 2005-05-12 17:59 UTC (permalink / raw)
  To: vatsa
  Cc: Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky, jdike,
	Ingo Molnar, linux-kernel, george

Srivatsa Vaddagiri wrote:

>Even if we were to go for this tickless design, the fundamental question
>remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
>(busy) CPU wake it up (which makes the implementation complex) or the idle CPU 
>checks imbalance itself at periodic intervals (which restricts the amount of
>time a idle CPU may sleep).
>  
>
Waking it up at fork or exec time might be doable, and having a busy CPU 
wake up other CPUs wouldn't add too much complexity, would it?

>I guess George's experience in implementing tickless systems is that
>it is more of an overhead for a general purpose OS like Linux. George?
>  
>
The latest patches seem to do tick skipping rather than wholesale 
ticklessness.  Admittedly, the latter is a more invasive change, but one 
that may end up being simpler in the long run.  But maybe George did a 
design like that in the past and rejected it?

Jesse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 17:12               ` Srivatsa Vaddagiri
  2005-05-12 17:59                 ` Jesse Barnes
@ 2005-05-12 18:08                 ` Martin Schwidefsky
  2005-05-12 18:21                   ` Tony Lindgren
  2005-05-13  6:23                   ` Srivatsa Vaddagiri
  2005-05-12 21:16                 ` George Anzinger
  2 siblings, 2 replies; 36+ messages in thread
From: Martin Schwidefsky @ 2005-05-12 18:08 UTC (permalink / raw)
  To: vatsa
  Cc: george, jdike, Jesse Barnes, linux-kernel, Ingo Molnar,
	Nick Piggin, Lee Revell, Tony Lindgren

> > Seems like we could schedule timer interrupts based solely on add_timer

> > type stuff; the scheduler could use it if necessary for load balancing
> > (along with fork/exec based balancing perhaps) on large machines where
> > load imbalances hurt throughput a lot.  But on small systems if all
>
> Even if we were to go for this tickless design, the fundamental question
> remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some
other
> (busy) CPU wake it up (which makes the implementation complex) or the
idle CPU
> checks imbalance itself at periodic intervals (which restricts the amount
of
> time a idle CPU may sleep).

I would prefer a solution where the busy CPU wakes up an idle CPU if the
imbalance is too large. Any scheme that requires an idle CPU to poll at
some intervals will have one of two problem: either the poll intervall
is large then the imbalance will stay around for a long time, or the
poll intervall is small then this will behave badly in a heavily
virtualized environment with many images.

> > your processes were blocked, you'd just go to sleep indefinitely and
> > save a bit of power and avoid unnecessary overhead.
> >
> > I haven't looked at the lastest tickless patches, so I'm not sure if my
> > claims of simplicity are overblown, but especially as multiprocessor
> > systems become more and more common it just seems wasteful to wakeup
> > all the CPUs every so often only to have them find that they have
> > nothing to do.
>
> I guess George's experience in implementing tickless systems is that
> it is more of an overhead for a general purpose OS like Linux. George?

The original implementation of a tickless system introduced an overhead
in the system call path. The recent solution is tickless while in idle
that does not have that overhead any longer. The second source of
overhead is the need to reprogram the timer interrupt source. That
can be expensive (see i386) or cheap (see s390). So it depends, as usual.
In our experience on s390 tickless systems is a huge win.

blue skies,
   Martin

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 17:59                 ` Jesse Barnes
@ 2005-05-12 18:16                   ` Tony Lindgren
  2005-05-13  6:27                   ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 36+ messages in thread
From: Tony Lindgren @ 2005-05-12 18:16 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: vatsa, Lee Revell, Nick Piggin, schwidefsky, jdike, Ingo Molnar,
	linux-kernel, george

* Jesse Barnes <jbarnes@virtuousgeek.org> [050512 11:01]:
> Srivatsa Vaddagiri wrote:
> 
> >Even if we were to go for this tickless design, the fundamental question
> >remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some 
> >other
> >(busy) CPU wake it up (which makes the implementation complex) or the idle 
> >CPU checks imbalance itself at periodic intervals (which restricts the 
> >amount of
> >time a idle CPU may sleep).
> > 
> >
> Waking it up at fork or exec time might be doable, and having a busy CPU 
> wake up other CPUs wouldn't add too much complexity, would it?

At least then it would be event driven rather than polling approach.

Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 18:08                 ` Martin Schwidefsky
@ 2005-05-12 18:21                   ` Tony Lindgren
  2005-05-13  6:23                   ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 36+ messages in thread
From: Tony Lindgren @ 2005-05-12 18:21 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: vatsa, george, jdike, Jesse Barnes, linux-kernel, Ingo Molnar,
	Nick Piggin, Lee Revell

* Martin Schwidefsky <schwidefsky@de.ibm.com> [050512 11:15]:
> > > Seems like we could schedule timer interrupts based solely on add_timer
> 
> > > type stuff; the scheduler could use it if necessary for load balancing
> > > (along with fork/exec based balancing perhaps) on large machines where
> > > load imbalances hurt throughput a lot.  But on small systems if all
> >
> > Even if we were to go for this tickless design, the fundamental question
> > remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some
> other
> > (busy) CPU wake it up (which makes the implementation complex) or the
> idle CPU
> > checks imbalance itself at periodic intervals (which restricts the amount
> of
> > time a idle CPU may sleep).
> 
> I would prefer a solution where the busy CPU wakes up an idle CPU if the
> imbalance is too large. Any scheme that requires an idle CPU to poll at
> some intervals will have one of two problem: either the poll intervall
> is large then the imbalance will stay around for a long time, or the
> poll intervall is small then this will behave badly in a heavily
> virtualized environment with many images.
> 
> > > your processes were blocked, you'd just go to sleep indefinitely and
> > > save a bit of power and avoid unnecessary overhead.
> > >
> > > I haven't looked at the lastest tickless patches, so I'm not sure if my
> > > claims of simplicity are overblown, but especially as multiprocessor
> > > systems become more and more common it just seems wasteful to wakeup
> > > all the CPUs every so often only to have them find that they have
> > > nothing to do.
> >
> > I guess George's experience in implementing tickless systems is that
> > it is more of an overhead for a general purpose OS like Linux. George?
> 
> The original implementation of a tickless system introduced an overhead
> in the system call path. The recent solution is tickless while in idle
> that does not have that overhead any longer. The second source of
> overhead is the need to reprogram the timer interrupt source. That
> can be expensive (see i386) or cheap (see s390). So it depends, as usual.
> In our experience on s390 tickless systems is a huge win.

In the dyn-tick patch for ARM and x86, timer reprogramming is only
done during idle. When the system is busy, there is continuous tick
and timer reprogramming is skipped. So the second source of overhead
should be minimal.

Tony

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 17:12               ` Srivatsa Vaddagiri
  2005-05-12 17:59                 ` Jesse Barnes
  2005-05-12 18:08                 ` Martin Schwidefsky
@ 2005-05-12 21:16                 ` George Anzinger
  2005-05-12 21:35                   ` Jesse Barnes
  2 siblings, 1 reply; 36+ messages in thread
From: George Anzinger @ 2005-05-12 21:16 UTC (permalink / raw)
  To: vatsa
  Cc: Jesse Barnes, Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky,
	jdike, Ingo Molnar, linux-kernel

Srivatsa Vaddagiri wrote:
> On Thu, May 12, 2005 at 09:28:55AM -0700, Jesse Barnes wrote:
> 
>>Seems like we could schedule timer interrupts based solely on add_timer 
>>type stuff; the scheduler could use it if necessary for load balancing 
>>(along with fork/exec based balancing perhaps) on large machines where 
>>load imbalances hurt throughput a lot.  But on small systems if all 
> 
> 
> Even if we were to go for this tickless design, the fundamental question
> remains: who wakes up the (sleeping) idle CPU upon a imbalance? Does some other
> (busy) CPU wake it up (which makes the implementation complex) or the idle CPU 
> checks imbalance itself at periodic intervals (which restricts the amount of
> time a idle CPU may sleep).
> 
> 
>>your processes were blocked, you'd just go to sleep indefinitely and 
>>save a bit of power and avoid unnecessary overhead.
>>
>>I haven't looked at the latest tickless patches, so I'm not sure if my 
>>claims of simplicity are overblown, but especially as multiprocessor 
>>systems become more and more common it just seems wasteful to wakeup 
>>all the CPUs every so often only to have them find that they have 
>>nothing to do.
> 
> 
> I guess George's experience in implementing tickless systems is that
> it is more of an overhead for a general purpose OS like Linux. George?

The tickless system differs from VST in that it is designed to only "tick" when 
there is something in the time list to do and it does this ALWAYS.  The VST 
notion is that ticks are not needed if the cpu is idle.  This is VASTLY simpler 
to do than a tickless system because, mostly, the accounting requirements.  When 
a VST cpu is not ticking, the full sleep time is charged to the idle task and 
the system does not need to time slice or do any time driven user profiling or 
execution limiting.

And this is exactly where the tickless system runs into overload.  Simply 
speaking, each task has with it certain limits and requirements WRT time.  It is 
almost always time sliced, but it may also have execution limits and settimer 
execution time interrupts that it wants.  These need to be set up for each task 
when it is switched to and removed when the system switches away from it.  In 
the test I did, I reduced all these timers to one (I think I just used the slice 
time, but this is not the thing to do if the user is trying to do profiling.  In 
any case, only one timer needs to be set up, possibly some work needs to be done 
to find the min. of all the execution time timers it has, but only that one 
needs to go in the time list.).  BUT it needs to happen each context switch time 
and thus adds overhead to the switch time.  In my testing, the overhead for this 
became higher than the ticked system overhead for the same services at a 
relatively low context switch rate.  From a systems point of view, you just 
don't want to increase overhead when the load goes up.  This leads to fragile 
systems.
> 
Now, the question still remains, if a cpu in an SMP system is sleeping because 
of VST, who or how is it to be wakened to responded to increases in the system 
load.  If all CPUs are sleeping, there is some event (i.e. interrupt) that does 
this.  I think, in an SMP system, any awake CPUs should, during their load 
balancing, notice that there are sleeping CPUs and wake them as the load increases.

> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 21:16                 ` George Anzinger
@ 2005-05-12 21:35                   ` Jesse Barnes
  2005-05-12 22:15                     ` George Anzinger
  0 siblings, 1 reply; 36+ messages in thread
From: Jesse Barnes @ 2005-05-12 21:35 UTC (permalink / raw)
  To: george
  Cc: vatsa, Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky, jdike,
	Ingo Molnar, linux-kernel

On Thursday, May 12, 2005 2:16 pm, George Anzinger wrote:
> The tickless system differs from VST in that it is designed to only
> "tick" when there is something in the time list to do and it does
> this ALWAYS.

Right, that's what I gathered (and what I was getting at).

> And this is exactly where the tickless system runs into overload. 
> Simply speaking, each task has with it certain limits and
> requirements WRT time.  It is almost always time sliced, but it may
> also have execution limits and settimer execution time interrupts
> that it wants.

Right...

> These need to be set up for each task when it is 
> switched to and removed when the system switches away from it.

Why do they need to be switched when the task's slice is up?  Can't they 
remain in the timer list?  I guess I'm imagining a global outstanding 
timer list that's manipulated by add/remove_timer, settimer, etc..  
When a timeout occurs the timer interrupt handler could mark tasks 
runnable again, bump their priority, send SIGALRM, or whatever.  Most 
timers are deleted before they expire anyway, right?  If that's true 
you definitely don't want to save and restore them on every context 
switch...

> In 
> the test I did, I reduced all these timers to one (I think I just
> used the slice time, but this is not the thing to do if the user is
> trying to do profiling.  In any case, only one timer needs to be set
> up, possibly some work needs to be done to find the min. of all the
> execution time timers it has, but only that one needs to go in the
> time list.).

Or at least only the nearest one has to be programmed as the next 
interrupt, and before going back to sleep the kernel could check if any 
timers had expired while the last one was running (aren't missing 
jiffies accounted for this way too, but of course jiffies would go away 
in this scheme)?

> BUT it needs to happen each context switch time and 
> thus adds overhead to the switch time.  In my testing, the overhead
> for this became higher than the ticked system overhead for the same
> services at a relatively low context switch rate.

Yeah, that does sound expensive.

> From a systems 
> point of view, you just don't want to increase overhead when the load
> goes up.  This leads to fragile systems.
>
> Now, the question still remains, if a cpu in an SMP system is
> sleeping because of VST, who or how is it to be wakened to responded
> to increases in the system load.  If all CPUs are sleeping, there is
> some event (i.e. interrupt) that does this.  I think, in an SMP
> system, any awake CPUs should, during their load balancing, notice
> that there are sleeping CPUs and wake them as the load increases.

Sounds reasonable to me, should be as simple as a 'wake up and load 
balance' IPI, right?

Jesse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 21:35                   ` Jesse Barnes
@ 2005-05-12 22:15                     ` George Anzinger
  2005-05-13  0:43                       ` Jesse Barnes
  0 siblings, 1 reply; 36+ messages in thread
From: George Anzinger @ 2005-05-12 22:15 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: vatsa, Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky, jdike,
	Ingo Molnar, linux-kernel

Jesse Barnes wrote:
> On Thursday, May 12, 2005 2:16 pm, George Anzinger wrote:
> 
>>The tickless system differs from VST in that it is designed to only
>>"tick" when there is something in the time list to do and it does
>>this ALWAYS.
> 
> 
> Right, that's what I gathered (and what I was getting at).
> 
> 
>>And this is exactly where the tickless system runs into overload. 
>>Simply speaking, each task has with it certain limits and
>>requirements WRT time.  It is almost always time sliced, but it may
>>also have execution limits and settimer execution time interrupts
>>that it wants.
> 
> 
> Right...
> 
> 
>>These need to be set up for each task when it is 
>>switched to and removed when the system switches away from it.
> 
> 
> Why do they need to be switched when the task's slice is up?

No, that would not be too bad, but they need to be removed when the task is 
switched away from.  This happens FAR more often than a slice expiring.  Most 
tasks are switched away from because they block, not because their time is over.

  Can't they
> remain in the timer list?  

We are timeing the tasks slice.  It is only active when the task has the cpu...

I guess I'm imagining a global outstanding
> timer list that's manipulated by add/remove_timer, settimer, etc..  
> When a timeout occurs the timer interrupt handler could mark tasks 
> runnable again, bump their priority, send SIGALRM, or whatever.  Most 

The timers that cause the problem are the ones that only run when the task is 
active.  These are the slice timer, the profile timer (ITIMER_PROF), the 
execution limit timer and the settime timer that is relative to execution time 
(ITIMER_VIRTUAL).

Again, we can colapse all these to one, but still it needs to be setup when the 
task is switched to and removed when it is switched away.

> timers are deleted before they expire anyway, right?  If that's true 
> you definitely don't want to save and restore them on every context 
> switch...

One of these timers is almost ALWAYS needed.  And yes, mostly they do not 
expire.  That is usually good as an expire cost more than a removal (provided 
you do not need to reprogram the timer as a result).

Timers that run on system time (ITIMER_REAL) stay in the list even when the task 
is inactive.
> 
> 
>>In 
>>the test I did, I reduced all these timers to one (I think I just
>>used the slice time, but this is not the thing to do if the user is
>>trying to do profiling.  In any case, only one timer needs to be set
>>up, possibly some work needs to be done to find the min. of all the
>>execution time timers it has, but only that one needs to go in the
>>time list.).
> 
> 
> Or at least only the nearest one has to be programmed as the next 
> interrupt, and before going back to sleep the kernel could check if any 
> timers had expired while the last one was running (aren't missing 
> jiffies accounted for this way too, but of course jiffies would go away 
> in this scheme)?
> 
> 
>>BUT it needs to happen each context switch time and 
>>thus adds overhead to the switch time.  In my testing, the overhead
>>for this became higher than the ticked system overhead for the same
>>services at a relatively low context switch rate.
> 
> 
> Yeah, that does sound expensive.
> 
> 
>>From a systems 
>>point of view, you just don't want to increase overhead when the load
>>goes up.  This leads to fragile systems.
>>
>>Now, the question still remains, if a cpu in an SMP system is
>>sleeping because of VST, who or how is it to be wakened to responded
>>to increases in the system load.  If all CPUs are sleeping, there is
>>some event (i.e. interrupt) that does this.  I think, in an SMP
>>system, any awake CPUs should, during their load balancing, notice
>>that there are sleeping CPUs and wake them as the load increases.
> 
> 
> Sounds reasonable to me, should be as simple as a 'wake up and load 
> balance' IPI, right?

I think there is already an IPI to tell another cpu that it has work.  That 
should be enough.  Need to check, however.  From the VST point of view, any 
interrupt wake the cpu from the VST sleep, so it need not even target the 
scheduler..

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 22:15                     ` George Anzinger
@ 2005-05-13  0:43                       ` Jesse Barnes
  2005-05-13  6:31                         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 36+ messages in thread
From: Jesse Barnes @ 2005-05-13  0:43 UTC (permalink / raw)
  To: george
  Cc: Jesse Barnes, vatsa, Tony Lindgren, Lee Revell, Nick Piggin,
	schwidefsky, jdike, Ingo Molnar, linux-kernel

On Thursday, May 12, 2005 3:15 pm, George Anzinger wrote:
> The timers that cause the problem are the ones that only run when the
> task is active.  These are the slice timer, the profile timer
> (ITIMER_PROF), the execution limit timer and the settime timer that
> is relative to execution time (ITIMER_VIRTUAL).

ITIMER_PROF could simply be ignored if the task it corresponds to isn't 
active when it fires, so it wouldn't incur any overhead.  
ITIMER_VIRTUAL sounds like it would uglify things though, and of course 
unused timer slice interrupts would have to be cleared out.

> Again, we can colapse all these to one, but still it needs to be
> setup when the task is switched to and removed when it is switched
> away.

Right, I see what you're saying now.  It's not as simple as I had hoped.

> Timers that run on system time (ITIMER_REAL) stay in the list even
> when the task is inactive.

Right, they'll cause the task they're associated with to become runnable 
again, or get a signal, or whatever.

> I think there is already an IPI to tell another cpu that it has work.
>  That should be enough.  Need to check, however.  From the VST point
> of view, any interrupt wake the cpu from the VST sleep, so it need
> not even target the scheduler..

But in this case you probably want it to, so it can rebalance tasks to 
the CPU that just woke up.

Jesse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 18:08                 ` Martin Schwidefsky
  2005-05-12 18:21                   ` Tony Lindgren
@ 2005-05-13  6:23                   ` Srivatsa Vaddagiri
  2005-05-13  7:16                     ` Nick Piggin
  1 sibling, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-13  6:23 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: george, jdike, Jesse Barnes, linux-kernel, Ingo Molnar,
	Nick Piggin, Lee Revell, Tony Lindgren

On Thu, May 12, 2005 at 08:08:26PM +0200, Martin Schwidefsky wrote:
> I would prefer a solution where the busy CPU wakes up an idle CPU if the
> imbalance is too large. Any scheme that requires an idle CPU to poll at
> some intervals will have one of two problem: either the poll intervall
> is large then the imbalance will stay around for a long time, or the
> poll intervall is small then this will behave badly in a heavily
> virtualized environment with many images.

I guess all the discussions we are having boils down to this: what is the max
time one can afford to have an imbalanced system because of sleeping idle CPU
not participating in load balance? 10ms, 100ms, 1 sec or more?

Maybe the answer depends on how much imbalance it is that we are talking of
here. A high order of imbalance would mean that the sleeping idle CPUs have
to be woken up quickly, while a low order imbalance could mean that 
we can let it sleep longer.

>From all the discussions we have been having, I think a watchdog
implementation makes more sense. Nick/Ingo, what do you think
should be our final decision on this?



-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-12 17:59                 ` Jesse Barnes
  2005-05-12 18:16                   ` Tony Lindgren
@ 2005-05-13  6:27                   ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-13  6:27 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Tony Lindgren, Lee Revell, Nick Piggin, schwidefsky, jdike,
	Ingo Molnar, linux-kernel, george

On Thu, May 12, 2005 at 10:59:59AM -0700, Jesse Barnes wrote:
> The latest patches seem to do tick skipping rather than wholesale 
> ticklessness.  Admittedly, the latter is a more invasive change, but one 

If you are referring to my i386 patch, that was only a hack to test the
scheduler change!

> that may end up being simpler in the long run.  But maybe George did a 
> design like that in the past and rejected it?

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  0:43                       ` Jesse Barnes
@ 2005-05-13  6:31                         ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-13  6:31 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: george, Jesse Barnes, Tony Lindgren, Lee Revell, Nick Piggin,
	schwidefsky, jdike, Ingo Molnar, linux-kernel

On Thu, May 12, 2005 at 05:43:46PM -0700, Jesse Barnes wrote:
> But in this case you probably want it to, so it can rebalance tasks to 
> the CPU that just woke up.

Usually the sleeping idle CPU is sent a resched IPI, which will cause it to
call schedule or idle_balance_retry/rebalance_tick (ref: my earlier patch) to 
find out if it has do a load balance.

-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  6:23                   ` Srivatsa Vaddagiri
@ 2005-05-13  7:16                     ` Nick Piggin
  2005-05-13  8:04                       ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-05-13  7:16 UTC (permalink / raw)
  To: vatsa
  Cc: Martin Schwidefsky, george, jdike, Jesse Barnes, linux-kernel,
	Ingo Molnar, Lee Revell, Tony Lindgren

Srivatsa Vaddagiri wrote:
> On Thu, May 12, 2005 at 08:08:26PM +0200, Martin Schwidefsky wrote:
> 
>>I would prefer a solution where the busy CPU wakes up an idle CPU if the
>>imbalance is too large. Any scheme that requires an idle CPU to poll at
>>some intervals will have one of two problem: either the poll intervall
>>is large then the imbalance will stay around for a long time, or the
>>poll intervall is small then this will behave badly in a heavily
>>virtualized environment with many images.
> 
> 
> I guess all the discussions we are having boils down to this: what is the max
> time one can afford to have an imbalanced system because of sleeping idle CPU
> not participating in load balance? 10ms, 100ms, 1 sec or more?
> 

Exactly. Any scheme that requires a busy CPU to poll at some intervals
will also have problems: you are putting more work on the busy CPUs,
there will be superfluous IPIs, and there will be cases where the
imbalance is allowed to get too large. Not really much different from
doing the work with the idle CPUs really, *except* that it will introduce
a lot of complexity that nobody has shown is needed.

> Maybe the answer depends on how much imbalance it is that we are talking of
> here. A high order of imbalance would mean that the sleeping idle CPUs have
> to be woken up quickly, while a low order imbalance could mean that 
> we can let it sleep longer.
> 
>>From all the discussions we have been having, I think a watchdog
> implementation makes more sense. Nick/Ingo, what do you think
> should be our final decision on this?
> 

Well the complex solution won't go in until it is shown that the
simple version has fundamental failure cases - but I don't think
we need to make a final decision yet do we?

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  7:16                     ` Nick Piggin
@ 2005-05-13  8:04                       ` Ingo Molnar
  2005-05-13  8:27                         ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2005-05-13  8:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: vatsa, Martin Schwidefsky, george, jdike, Jesse Barnes,
	linux-kernel, Lee Revell, Tony Lindgren

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > From all the discussions we have been having, I think a watchdog
> > implementation makes more sense. Nick/Ingo, what do you think
> > should be our final decision on this?
> 
> Well the complex solution won't go in until it is shown that the
> simple version has fundamental failure cases - but I don't think we
> need to make a final decision yet do we?

there's no need to make a final decision yet. But the more complex 
watchdog solution does have the advantage of putting idle CPUs to sleep 
immediately and perpetually.

the power equation is really easy: the implicit cost of a deep CPU sleep 
is say 1-2 msecs. (that's how long it takes to shut the CPU and the bus 
down, etc.) If we do an exponential backoff we periodically re-wake the 
CPU fully up again - wasting 1-2msec (or more) more power. With the 
watchdog solution we have more overhead on the busy CPU but it takes 
_much_ less power for a truly idle CPU to be turned off. [the true 
'effective cost' all depends on the scheduling pattern as well, but the 
calculation before is still valid.] Whatever the algorithmic overhead of 
the watchdog code, it's dwarved by the power overhead caused by false 
idle-wakeups of CPUs under exponential backoff.

the watchdog solution - despite being more complex - is also more 
orthogonal in that it does not change the balancing decisions at all - 
they just get offloaded to another CPU. The exponential backoff OTOH 
materially changes how we do SMP balancing - which might or might not 
matter much, but it will always depend on circumstances. So in the long 
run the watchdog solution is probably easier to control. (because it's 
just an algorithm offload, not a material scheduling feature.)

so unless there are strong implementational arguments against the 
watchdog solution, i definitely think it's the higher quality solution, 
both in terms of power savings, and in terms of impact.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  8:04                       ` Ingo Molnar
@ 2005-05-13  8:27                         ` Nick Piggin
  2005-05-13  9:19                           ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2005-05-13  8:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: vatsa, Martin Schwidefsky, george, jdike, Jesse Barnes,
	linux-kernel, Lee Revell, Tony Lindgren

Ingo Molnar wrote:

> the power equation is really easy: the implicit cost of a deep CPU sleep 
> is say 1-2 msecs. (that's how long it takes to shut the CPU and the bus 
> down, etc.) If we do an exponential backoff we periodically re-wake the 
> CPU fully up again - wasting 1-2msec (or more) more power. With the 
> watchdog solution we have more overhead on the busy CPU but it takes 
> _much_ less power for a truly idle CPU to be turned off. [the true 
> 'effective cost' all depends on the scheduling pattern as well, but the 
> calculation before is still valid.] Whatever the algorithmic overhead of 
> the watchdog code, it's dwarved by the power overhead caused by false 
> idle-wakeups of CPUs under exponential backoff.
> 

Well, it really depends on how it is implemented, and what tradeoffs
you make.

Let's say that you don't start deep sleeping until you've backed off
to 64ms rebalancing. Now the CPU power consumption is reduced to less
than 2% of ideal.

Now we don't have to worry about uniprocessor, and SMP systems that
go *completely* idle can have mechanism to indefinitely deep sleep
all CPUs until there is real work.

What you're left with are SMP systems with *some* activity happening,
and of those, I bet most idle CPUs will have other reasons to be woken
up other than the scheduler tick anyway.

And don't forget that the watchdog approach can just as easily deep
sleep a CPU only to immediately wake it up again if it detects an
imbalance.

So in terms of real, system wide power savings, I'm guessing the
difference would really be immesurable.

And the CPU usage / wakeup cost arguments cut both ways. The busy
CPUs have to do extra work in the watchdog case.

> the watchdog solution - despite being more complex - is also more 
> orthogonal in that it does not change the balancing decisions at all - 
> they just get offloaded to another CPU. The exponential backoff OTOH 
> materially changes how we do SMP balancing - which might or might not 
> matter much, but it will always depend on circumstances. So in the long 
> run the watchdog solution is probably easier to control. (because it's 
> just an algorithm offload, not a material scheduling feature.)
> 

Well so does the watchdog, really. But it's probably not like you have
to *really* tune sleep algorithms _exactly_ right, yeah? So long as you
get within even 5% of total theoretical power saving on SMP systems,
it's likely good enough.

> so unless there are strong implementational arguments against the 
> watchdog solution, i definitely think it's the higher quality solution, 
> both in terms of power savings, and in terms of impact.
> 

I'm think power savings will be unmeasurable between the two approaches,
backoff will be quite a lot less complex, and have less impact on CPUs
that are busy doing real work.

Smells like a bakeoff coming up :)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  8:27                         ` Nick Piggin
@ 2005-05-13  9:19                           ` Srivatsa Vaddagiri
  2005-05-13  9:33                             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-05-13  9:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Martin Schwidefsky, george, jdike, Jesse Barnes,
	linux-kernel, Lee Revell, Tony Lindgren

On Fri, May 13, 2005 at 08:29:17AM +0000, Nick Piggin wrote:
> And don't forget that the watchdog approach can just as easily deep
> sleep a CPU only to immediately wake it up again if it detects an
> imbalance.

I think we should increase the threshold beyond which the idle CPU
is woken up (more than the imbalance_pct that exists already). This
should justify waking up the CPU.

> And the CPU usage / wakeup cost arguments cut both ways. The busy
> CPUs have to do extra work in the watchdog case.

Maybe with a really smart watchdog solution, we can cut down this overhead.
I did think of other schemes - a dedicated CPU per node acting as watchdog 
for that node and per-node wacthdog kernel threads? - to name a few. What I had
proposed was the best I thought. But maybe we can look at improving it 
to see if the overhead concern you have can be reduced - meeting the interests
of both the worlds :)

-- 

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-13  9:19                           ` Srivatsa Vaddagiri
@ 2005-05-13  9:33                             ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2005-05-13  9:33 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, Martin Schwidefsky, george, jdike, Jesse Barnes,
	linux-kernel, Lee Revell, Tony Lindgren

Srivatsa Vaddagiri wrote:
> On Fri, May 13, 2005 at 08:29:17AM +0000, Nick Piggin wrote:
> 
>>And don't forget that the watchdog approach can just as easily deep
>>sleep a CPU only to immediately wake it up again if it detects an
>>imbalance.
> 
> 
> I think we should increase the threshold beyond which the idle CPU
> is woken up (more than the imbalance_pct that exists already). This
> should justify waking up the CPU.
> 

Oh yeah that's possible (and maybe preferable - testing will need to
be done). But again it doesn't solve the corner cases where problems
happen. And it introduces more divergence to the balancing algorithms.

Basically I'm trying to counter the notion that the watchdog solution
is fundamentally better just because it allows indefinite sleeping and
backoff doesn't. You'll always be waking things up when they should
stay sleeping, and putting them to sleep only to require they be woken
up again.

> 
>>And the CPU usage / wakeup cost arguments cut both ways. The busy
>>CPUs have to do extra work in the watchdog case.
> 
> 
> Maybe with a really smart watchdog solution, we can cut down this overhead.

Really smart usually == bad, especially when it's not something that
has been shown to be terribly critical.

> I did think of other schemes - a dedicated CPU per node acting as watchdog 
> for that node and per-node wacthdog kernel threads? - to name a few. What I had
> proposed was the best I thought. But maybe we can look at improving it 
> to see if the overhead concern you have can be reduced - meeting the interests
> of both the worlds :)

My main concern is complexity, second concern is diminishing returns,
third concern is overhead on other CPUs :)

But I won't pretend to know it all - I don't have a good grasp of the
problem domains, so I'm just making some noise now so we don't put in
a complex solution where a simple one would suffice.

The best idea obviously would be to get the interested parties involved,
and get different approaches running side by side, then measure things!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-05-08  4:14   ` Nick Piggin
  2005-05-08 12:19     ` Srivatsa Vaddagiri
  2005-05-11 18:03     ` Tony Lindgren
@ 2005-06-30 12:47     ` Srivatsa Vaddagiri
  2005-07-06 17:31       ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-06-30 12:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, schwidefsky, jdike, Andrew Morton, Ingo Molnar,
	rmk+lkml, linux-kernel, manfred

[ This is a continuation of the thread at : http://lkml.org/lkml/2005/5/7/98 ]

On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
> Rusty Russell wrote:
> >On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> >
> >>Two solutions have been proposed so far:
> >>
> >>	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
> >>	   say a second, Nick?) on how long a idle CPU can avoid taking
> 
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.

I did some measurements on the average interval that a idle CPU is allowed to
sleep before it wakes up. I also have some numbers on Xen comparing performance 
between the two approaches (100ms limit vs no limit on sleep interval). 

First, the avg sleep interval. 

Numbers were obtained on this configuration:

	Machine  : 8-way Intel (P3) box
	Distro   : RHEL3 
	Kernel	 : 2.6.12 

Each time a CPU became idle, next_timer_interrupt was called to find the next 
nearest timer (sleep_time_req) and the CPU slept till either the time elapsed 
or it was interrupted by a interrupt (external/device, reschedule, invalidate 
or call_function). When any of these happen, the actual amount of time slept 
in jiffies (sleep_time_real) was measured. In order to measure what effect 
internal system events (reschedule/invalidate/call-function IPI) alone have on 
avg sleep interval, I redirected all external interrupts to CPU 0.

Also the numbers were obtained at init runlevel 3 with system mostly idle.
I basically logged in and ran the test to get the numbers. Apart from
this test, the only other workloads active were the system daemons that are 
initiated at runlevel 3. The test cleared various per-cpu counters (which were 
used to capture sleep_time_real numbers), slept for 2 minutes and read the 
counters at the end of it.

Below are the numbers I obtained from the test:

CPU#   # of	  Mean		Std Dev	    Max			Min
       samples	

1:     113       1088.690      761.352       2001.000          2.000
2:      76       1633.197      640.230       2002.000         94.000
3:      80       1552.300      682.050       2003.000          1.000
4:      71       1735.606      531.744       2004.000        165.000
5:    2525         48.669        6.391         50.000          2.000
6:      74       1652.905      601.908       2006.000         97.000
7:    1783         68.915       33.495        100.000          0.000

[ CPU0 is excluded because its avg sleep interval is just 1 jiffie - 
  The external jiffie timer keeps waking it every jiffy ]

The mean above represents sleep_time_real, i.e the real time a CPU
was allowed to sleep before something woke it. The best mean
time was for CPU4 (1735.606 jiffies or ~1.7 sec). Also note that
the maximum time it actually could sleep was ~2 sec.

Digging further revealed that this max time was restricted by 
various timers kernel uses. Mostly it was found to be because of
the slab allocator reap timer (it requests a timer every ~2sec on
every CPU) and machine_check timer (MCE_RATE in arch/i386/kernel/cpu/mcheck/
non-fatal.c ).

Just to see the effect of increasing the timeouts of these timers, 
I changed MCE_RATE to 60*HZ (from 15*HZ), REAPTIMEOUT_CPUC to 20*HZ (from 2*HZ),
and REAPTIMEOUT_LIST3 to 40*HZ (from 4*HZ). With this change and reruning
my test under the same conditions, the new numbers that were obtained for 
sleep_time_real are:

CPU#   # of	  Mean		Std Dev	    Max			Min
       samples	

1:      24       5229.833     7341.283      20001.000          4.000
2:    2520         49.797        2.745         50.000          3.000
3:       8      15002.250     8575.149      20003.000       1127.000
4:       8      15002.875     8557.532      20004.000       1156.000
5:       8      15003.750     8537.711      20005.000       1189.000
6:     996        125.923     1209.668      30000.000          2.000
7:     134        940.269      176.835       1060.000        310.000

Note that now the best average sleep_time_real is now 15002.875 (~15 sec) 
for CPU4 and max was 20004.000 (~20 sec).

Now I am not saying that we can blindly increase these timeouts
(REAPTIMEOUT_CPUC/REAPTIMEOUT_LIST3/MCE_RATE). But I think that we can
make these timers idle-cpu aware. For example, if the slab timer notices 
that it is running on a idle CPU and the numbers of slabs it is actually 
reaping  have come down very much, then it can request for longer timeouts
instead of every ~2sec the way it does now. 

It is probably good for virtualization/power-management env that we identify
such short timeouts requested in kernel and make them idle-cpu aware to 
the extend possible. In that case, restricting max sleep time to 100ms, IMO,
is not going to be efficient.

In order to see the benefit of sleeping longer (and rather than waking
every 100ms), I ran some benchmarks on Xen. Basically I measured
the throughput of a CPU-intensive benchmark (which finds first 30 random prime
numbers). Time taken to find the 30 prime numbers was used as a benchmark
to compare the two approaches.

Configuration used was:

	Machine 	: 4-way Intel Xeon box (HT disabled in BIOS)
	Distro 		: RHEL4
	Kernel		: Xen-unstable kernel (based on 2.6.11) for both 
			  Domain 0 and guest domains

Xen-unstable was used since it provides SMP guest support. Additional 4-way 
domains were started using the ttylinux distro 
(http://easynews.dl.sourceforge.net/sourceforge/xen/ttylinux-xen.bz2).

The prime-number benchmark was run in one of the domains while the other
domains were idle. The throughput of the benchmark was measued once with 
a total of 30 4-way domains started and another time with 100 4-way domains 
started. In each case, to ensure that virtual-cpu/physical-cpu ration remains
same, I pinned each virtual CPU of each domain to corresponding physical CPU.
For ex: virtual CPU 0 of domains 1-30 was pinned physical CPU 0, virtual CPU 1
of domains 1-30 was pinned to physical CPU 1 and so on.

Case A : No-limit-on-sleep-interval

				Avg of 10 runs		Std Dev

30 domains started		28.55 sec		0.009
100 domains started		28.835 sec		0.011

Note that in going from 30->100 domains, the performance has deteriorated by 
only 0.9%.

Case B : 100ms limit on sleep interval. Since by default Xen guest domain kernel
	 follows case A, it was modified to restrict sleep interval to 100ms

     	            	 	Avg of 10 runs		Std Dev

30 domains started		28.867			0.010
100 domains started		29.993        		0.009

Note that in going from 30->100 domains, there is a deterioration of 3.9%

Comparing the two approaches,

			No limit	100 ms limit	Difference

	30 domains	  28.55		28.867		1.1%
	100 domains	  28.835	29.993		4%

The difference is growing with increasing virtualization. I have heard that 
the number of partitions can be much more (>300?) in some high end zSeries 
machines.

So those are the numbers. To conclude:

	- Currently, in 2.6.12 kernel, the avg real sleep time obtained
	  is ~1.7 sec
	- We can (and IMO we must) look at the short timers in use by the 
	  kernel and potentially increase this avg real sleep time 
	  further (atleast for idle CPUs). This mean waking up every 100 ms
	  makes it inefficient.
	- Even if we decide to go for the simple approach, we should increase
	  the sleep-time limit to few seconds. Don't know what that means 
	  for load-balancing.

-- 

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] (How to) Let idle CPUs sleep
  2005-06-30 12:47     ` Srivatsa Vaddagiri
@ 2005-07-06 17:31       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2005-07-06 17:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: schwidefsky, Ingo Molnar, linux-kernel, manfred

On Thu, Jun 30, 2005 at 06:17:11PM +0530, Srivatsa Vaddagiri wrote:
> Digging further revealed that this max time was restricted by 
> various timers kernel uses. Mostly it was found to be because of
> the slab allocator reap timer (it requests a timer every ~2sec on
> every CPU) and machine_check timer (MCE_RATE in arch/i386/kernel/cpu/mcheck/
> non-fatal.c ).

I modified the slab allocator to do a slightly better job of handling timers.
(instead of blindly increasing the reap timeout to 20 sec as I did in my last 
run).  The patch below is merely a hack meant to see what kind of benefits we 
can possibly hope to get by reworking the reap timer.  A good solution would 
probably increase/decrease the reap periods based on memory pressure and 
idleness of the system. Anyway the patch I tried out is:


---

 linux-2.6.13-rc1-root/mm/slab.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff -puN mm/slab.c~vst-slab mm/slab.c
--- linux-2.6.13-rc1/mm/slab.c~vst-slab	2005-07-05 16:36:03.000000000 +0530
+++ linux-2.6.13-rc1-root/mm/slab.c	2005-07-06 18:01:28.000000000 +0530
@@ -371,6 +371,8 @@ struct kmem_cache_s {
  */
 #define REAPTIMEOUT_CPUC	(2*HZ)
 #define REAPTIMEOUT_LIST3	(4*HZ)
+#define MAX_REAP_TIMEOUT	(30*HZ)
+#define MAX_DRAIN_COUNT		2
 
 #if STATS
 #define	STATS_INC_ACTIVE(x)	((x)->num_active++)
@@ -569,6 +571,7 @@ static enum {
 } g_cpucache_up;
 
 static DEFINE_PER_CPU(struct work_struct, reap_work);
+static DEFINE_PER_CPU(unsigned long, last_timeout);
 
 static void free_block(kmem_cache_t* cachep, void** objpp, int len);
 static void enable_cpucache (kmem_cache_t *cachep);
@@ -883,8 +886,10 @@ static int __init cpucache_init(void)
 	 * pages to gfp.
 	 */
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-		if (cpu_online(cpu))
+		if (cpu_online(cpu)) {
+			per_cpu(last_timeout, cpu) = REAPTIMEOUT_CPUC + cpu;
 			start_cpu_timer(cpu);
+		}
 	}
 
 	return 0;
@@ -1543,7 +1548,7 @@ static void smp_call_function_all_cpus(v
 	preempt_enable();
 }
 
-static void drain_array_locked(kmem_cache_t* cachep,
+static int drain_array_locked(kmem_cache_t* cachep,
 				struct array_cache *ac, int force);
 
 static void do_drain(void *arg)
@@ -2753,10 +2758,10 @@ static void enable_cpucache(kmem_cache_t
 					cachep->name, -err);
 }
 
-static void drain_array_locked(kmem_cache_t *cachep,
+static int drain_array_locked(kmem_cache_t *cachep,
 				struct array_cache *ac, int force)
 {
-	int tofree;
+	int tofree = 1;
 
 	check_spinlock_acquired(cachep);
 	if (ac->touched && !force) {
@@ -2771,6 +2776,8 @@ static void drain_array_locked(kmem_cach
 		memmove(&ac_entry(ac)[0], &ac_entry(ac)[tofree],
 					sizeof(void*)*ac->avail);
 	}
+
+	return tofree;
 }
 
 /**
@@ -2787,17 +2794,20 @@ static void drain_array_locked(kmem_cach
 static void cache_reap(void *unused)
 {
 	struct list_head *walk;
+	int drain_count = 0, freed_slab_count = 0;
+	unsigned long timeout = __get_cpu_var(last_timeout);
+	int cpu = smp_processor_id();
 
 	if (down_trylock(&cache_chain_sem)) {
 		/* Give up. Setup the next iteration. */
-		schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC + smp_processor_id());
+		schedule_delayed_work(&__get_cpu_var(reap_work), timeout);
 		return;
 	}
 
 	list_for_each(walk, &cache_chain) {
 		kmem_cache_t *searchp;
 		struct list_head* p;
-		int tofree;
+		int tofree, count;
 		struct slab *slabp;
 
 		searchp = list_entry(walk, kmem_cache_t, next);
@@ -2809,7 +2819,9 @@ static void cache_reap(void *unused)
 
 		spin_lock_irq(&searchp->spinlock);
 
-		drain_array_locked(searchp, ac_data(searchp), 0);
+		count = drain_array_locked(searchp, ac_data(searchp), 0);
+		if (count > drain_count)
+			drain_count = count;
 
 		if(time_after(searchp->lists.next_reap, jiffies))
 			goto next_unlock;
@@ -2825,6 +2837,9 @@ static void cache_reap(void *unused)
 		}
 
 		tofree = (searchp->free_limit+5*searchp->num-1)/(5*searchp->num);
+		if (tofree > freed_slab_count)
+			freed_slab_count = tofree;
+
 		do {
 			p = list3_data(searchp)->slabs_free.next;
 			if (p == &(list3_data(searchp)->slabs_free))
@@ -2854,7 +2869,16 @@ next:
 	up(&cache_chain_sem);
 	drain_remote_pages();
 	/* Setup the next iteration */
-	schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC + smp_processor_id());
+#ifdef CONFIG_NO_IDLE_HZ
+	if (drain_count < MAX_DRAIN_COUNT && !freed_slab_count) {
+		if (timeout * 2 < MAX_REAP_TIMEOUT)
+			timeout *= 2;
+	} else
+#endif
+		timeout = REAPTIMEOUT_CPUC + cpu;
+
+	__get_cpu_var(last_timeout) = timeout;
+	schedule_delayed_work(&__get_cpu_var(reap_work), timeout);
 }
 
 #ifdef CONFIG_PROC_FS
_

The results with this patch on a 8way Intel box are:

CPU#   # of       Mean          Std Dev     Max                 Min
       samples


1:      31       4124.742     4331.914      14317.000          0.000
2:      35       3603.600     3792.050      12556.000         14.000
3:    2585         49.458        4.207         50.000          2.000
4:     151        847.682      329.343       1139.000         15.000
5:      23       5432.652     3461.856      12024.000        120.000
6:      19       6229.158     5641.813      15000.000        169.000
7:      67       1865.672     1528.343       5000.000         19.000


Note that the best average is around ~6sec (with max being 15 sec).

Given, this do you still advocate that we restrict idle CPUs to wakeup
every 100ms and check for imbalance? IMO, we should let them sleep
much longer (few seconds) ..What are the consequences on load balancing
if idle CPUs sleep that long? Does it mean that system can remain unresponsive
for few seconds under some circumstances (when there is a burst of activity and
at that time idle CPUs are sleeping)?


-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2005-07-07  0:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-07 18:27 [RFC] (How to) Let idle CPUs sleep Srivatsa Vaddagiri
2005-05-08  3:50 ` Rusty Russell
2005-05-08  4:14   ` Nick Piggin
2005-05-08 12:19     ` Srivatsa Vaddagiri
2005-05-09  6:27       ` Nick Piggin
2005-05-12  8:38         ` Srivatsa Vaddagiri
2005-05-11 18:03     ` Tony Lindgren
2005-05-12  8:46       ` Srivatsa Vaddagiri
2005-05-12 16:01         ` Lee Revell
2005-05-12 16:16           ` Tony Lindgren
2005-05-12 16:28             ` Jesse Barnes
2005-05-12 17:12               ` Srivatsa Vaddagiri
2005-05-12 17:59                 ` Jesse Barnes
2005-05-12 18:16                   ` Tony Lindgren
2005-05-13  6:27                   ` Srivatsa Vaddagiri
2005-05-12 18:08                 ` Martin Schwidefsky
2005-05-12 18:21                   ` Tony Lindgren
2005-05-13  6:23                   ` Srivatsa Vaddagiri
2005-05-13  7:16                     ` Nick Piggin
2005-05-13  8:04                       ` Ingo Molnar
2005-05-13  8:27                         ` Nick Piggin
2005-05-13  9:19                           ` Srivatsa Vaddagiri
2005-05-13  9:33                             ` Nick Piggin
2005-05-12 21:16                 ` George Anzinger
2005-05-12 21:35                   ` Jesse Barnes
2005-05-12 22:15                     ` George Anzinger
2005-05-13  0:43                       ` Jesse Barnes
2005-05-13  6:31                         ` Srivatsa Vaddagiri
2005-06-30 12:47     ` Srivatsa Vaddagiri
2005-07-06 17:31       ` Srivatsa Vaddagiri
2005-05-08 10:13   ` Arjan van de Ven
2005-05-08 13:33     ` Andi Kleen
2005-05-08 13:44       ` Arjan van de Ven
2005-05-08 14:53         ` Andi Kleen
2005-05-08 13:31 ` Andi Kleen
2005-05-08 15:26   ` Srivatsa Vaddagiri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).