[git pull request] scheduler updates

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [git pull request] scheduler updates
@ 2007-08-12 16:32 Ingo Molnar
  2007-08-14  8:37 ` [accounting regression since rc1] " Christian Borntraeger
       [not found] ` <200708141032.47235.borntraeger@de.ibm.com>
  0 siblings, 2 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-12 16:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel

Linus, please pull the latest scheduler git tree from:

  git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

three bugfixes:

- a nice fix from eagle-eye Oleg for a subtle typo in the balancing
  code, the effect of this bug was more agressive idle balancing. This
  bug was introduced by one of the original CFS commits.

- a round of global->static fixes from Adrian Bunk - this change,
  besides the cleanup effect, chops 100 bytes off sched.o.

- Peter Zijlstra noticed a sleeper-bonus bug. I kept this patch under
  observation and testing this past week and saw no ill effects so far. 
  It could fix two suspected regressions. (It could improve Kasper
  Sandberg's workload and it could improve the sleeper/runner
  problem/bug Roman Zippel was seeing.)

test-built and test-booted on x86-32 and x86-64, and did a dozen of 
randconfig builds for good measure (which uncovered two new build errors 
in latest -git).

Thanks,

	Ingo

--------------->
Adrian Bunk (1):
      sched: make global code static

Ingo Molnar (1):
      sched: fix sleeper bonus

Oleg Nesterov (1):
      sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/

 include/linux/cpu.h |    2 --
 kernel/sched.c      |   48 ++++++++++++++++++++++++------------------------
 kernel/sched_fair.c |   12 ++++++------
 3 files changed, 30 insertions(+), 32 deletions(-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-12 16:32 [git pull request] scheduler updates Ingo Molnar
@ 2007-08-14  8:37 ` Christian Borntraeger
  2007-08-16  8:17   ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger
  2007-08-20 15:45   ` [accounting regression since rc1] scheduler updates Ingo Molnar
       [not found] ` <200708141032.47235.borntraeger@de.ibm.com>
  1 sibling, 2 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-14  8:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

This is a 2nd try with correct email address, sorry for the duplicates.

Am Sonntag, 12. August 2007 schrieb Ingo Molnar:
> Linus, please pull the latest scheduler git tree from:

Hello Ingo,

this is a followup to the discussion in 
http://lkml.org/lkml/2007/7/19/538

Since 2.6.12, s390 already does precise accouting for system and user time.  
Depending on CONFIG_VIRT_CPU_ACCOUNTING, we  use two 64bit hardware timers on 
s390: the first returns the wall clock time and is stepped even if the 
virtual cpu is not backed by a physical cpu. The second timer is only 
stepped, when the virtual cpu is backed by a physical cpu. The timers have a 
very high accurancy, and the architecture guarantees that bit 51 is increased 
by one/microsecond. We store both timers on each context switch, irq, 
syscall, and machinecheck in entry.S. The calculation are made in 
arch/s390/kernel/vtime.c in accouting_system_vtime and friends with 
microsecond accurracy. This is also used for irq accouting (see the 
definition of irq_enter). It basically boils down to precise numbers in the 
cpu stat and the utime/stime for processes as well as knowledge about time 
stolen by the hypervisor. 

With CFS the accounting was changed, and everything is now based on 
sum_exec_runtime.  There is now an accounting regression on s390 (and maybe 
ppc64), as the default jiffy implemenation does not know anything about 
virtual cpus.

While looking for a solution, I started with a very quick hack and reverted 
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa for the procfs related changes. If I 
revert that commit, it seems that I get the old behaviour -  but of course 
this is just a hack.

I see some options now:

1. Jan could finish his sched_clock implementation for s390 and we would get 
close to the precise numbers. This would also let CFS make better decisions. 
Downside: its not as precise as before as we do some math on the numbers and 
it will burn cycles to compute numbers we already have 
(utime=sum*utime/stime). 
2. set sum_exec_runtime based on the precise utime and stime. Dont know enough 
about CFS if this would show different scheduling behaviour than 1
3. ifdef fs/proc/array.c depending on CONFIG_VIRT_CPU_ACCOUNTING. This will 
save some cycles, and the numbers are precise to a microsecond. Downside: the 
scheduler gets no information about virtual cpus and steal time so its 
probably not completely fair
4. implement sched_clock AND reuse the exisiting utime and stime numbers.
5. other clever solutions I cannot see

Any suggestions?

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH][RFC] Re: accounting regression since rc1
  2007-08-14  8:37 ` [accounting regression since rc1] " Christian Borntraeger
@ 2007-08-16  8:17   ` Christian Borntraeger
  2007-08-20 15:45   ` [accounting regression since rc1] scheduler updates Ingo Molnar
  1 sibling, 0 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-16  8:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

Ingo,

this patch fixes the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It 
reverts parts of commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa by converting 
fs/proc/array.c back to cputime_t. The new functions task_utime and 
task_stime now return cputime_t instead of clock_t. If 
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are returned 
directly instead of using sum_exec_runtime. 

Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as on 
i386. Not tested on ppc64 - Paul?

Feedback is welcome.

Signed-Off-By: Christian Borntraeger <borntraeger@de.ibm.com>

---
 fs/proc/array.c |   41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/proc/array.c
===================================================================
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -320,7 +320,18 @@ int proc_pid_status(struct task_struct *
 	return buffer - orig;
 }
 
-static clock_t task_utime(struct task_struct *p)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
+static cputime_t task_utime(struct task_struct *p)
+{
+	return p->utime;
+}
+
+static cputime_t task_stime(struct task_struct *p)
+{
+	return p->stime;
+}
+#else
+static cputime_t task_utime(struct task_struct *p)
 {
 	clock_t utime = cputime_to_clock_t(p->utime),
 		total = utime + cputime_to_clock_t(p->stime);
@@ -337,10 +348,10 @@ static clock_t task_utime(struct task_st
 	}
 	utime = (clock_t)temp;
 
-	return utime;
+	return clock_t_to_cputime(utime);
 }
 
-static clock_t task_stime(struct task_struct *p)
+static cputime_t task_stime(struct task_struct *p)
 {
 	clock_t stime;
 
@@ -349,10 +360,12 @@ static clock_t task_stime(struct task_st
 	 * the total, to make sure the total observed by userspace
 	 * grows monotonically - apps rely on that):
 	 */
-	stime = nsec_to_clock_t(p->se.sum_exec_runtime) - task_utime(p);
+	stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
+			cputime_to_clock_t(task_utime(p));
 
-	return stime;
+	return clock_t_to_cputime(stime);
 }
+#endif
 
 static int do_task_stat(struct task_struct *task, char *buffer, int whole)
 {
@@ -368,8 +381,7 @@ static int do_task_stat(struct task_stru
 	unsigned long long start_time;
 	unsigned long cmin_flt = 0, cmaj_flt = 0;
 	unsigned long  min_flt = 0,  maj_flt = 0;
-	cputime_t cutime, cstime;
-	clock_t utime, stime;
+	cputime_t cutime, cstime, utime, stime;
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
@@ -387,8 +399,7 @@ static int do_task_stat(struct task_stru
 
 	sigemptyset(&sigign);
 	sigemptyset(&sigcatch);
-	cutime = cstime = cputime_zero;
-	utime = stime = 0;
+	cutime = cstime = utime = stime = cputime_zero;
 
 	rcu_read_lock();
 	if (lock_task_sighand(task, &flags)) {
@@ -414,15 +425,15 @@ static int do_task_stat(struct task_stru
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
-				utime += task_utime(t);
-				stime += task_stime(t);
+				utime = cputime_add(utime, task_utime(t));
+				stime = cputime_add(stime, task_stime(t));
 				t = next_thread(t);
 			} while (t != task);
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			utime += cputime_to_clock_t(sig->utime);
-			stime += cputime_to_clock_t(sig->stime);
+			utime = cputime_add(utime, sig->utime);
+			stime = cputime_add(stime, sig->stime);
 		}
 
 		sid = signal_session(sig);
@@ -471,8 +482,8 @@ static int do_task_stat(struct task_stru
 		cmin_flt,
 		maj_flt,
 		cmaj_flt,
-		utime,
-		stime,
+		cputime_to_clock_t(utime),
+		cputime_to_clock_t(stime),
 		cputime_to_clock_t(cutime),
 		cputime_to_clock_t(cstime),
 		priority,

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-14  8:37 ` [accounting regression since rc1] " Christian Borntraeger
  2007-08-16  8:17   ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger
@ 2007-08-20 15:45   ` Ingo Molnar
  2007-08-20 17:03     ` Martin Schwidefsky
  2007-08-21  8:17     ` Christian Borntraeger
  1 sibling, 2 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-20 15:45 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> 1. Jan could finish his sched_clock implementation for s390 and we 
> would get close to the precise numbers. This would also let CFS make 
> better decisions. [...]

i think this is the best option and it should give us the same /proc 
accuracy on s390 as before, plus improved scheduler precision. (and 
improved tracing accuracy, etc. etc.) Note that for architectures that 
already have sched_clock() at least as precise as the stime/utime stats 
there's no problem - and that seems to include all architectures except 
s390.

could you send that precise sched_clock() patch? It should be an order 
of magnitude simpler than the high-precision stime/utime tracking you 
already do, and it's needed for quality scheduling anyway.

> [...] Downside: its not as precise as before as we do some math on the 
> numbers and it will burn cycles to compute numbers we already have 
> (utime=sum*utime/stime).

i can see no real downside to it: if all of stime, utime and 
sum_exec_clock are precise, then the numbers we present via /proc are 
precise too:

   sum_exec * utime / stime;

there should be no loss of precision on s390 because the 
multiplication/division rounding is not accumulating - we keep the 
precise sum_exec, utime and stime values untouched.

on x86 we dont really want to slow down every irq and syscall event with 
precise stime/utime stats for 'top' to display. On s390 the 
multiplication and division is indeed superfluous but it keeps the code 
generic for arches where utime/stime is less precise and irq-sampled - 
while the sum is always precise. It also animates architectures that 
have an imprecise sched_clock() implementation to improve its accuracy. 
Accessing the /proc files alone is many orders of magnitude more 
expensive than this simple multiplication and division.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 15:45   ` [accounting regression since rc1] scheduler updates Ingo Molnar
@ 2007-08-20 17:03     ` Martin Schwidefsky
  2007-08-20 18:08       ` Ingo Molnar
  2007-08-21  8:17     ` Christian Borntraeger
  1 sibling, 1 reply; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-20 17:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Mon, 2007-08-20 at 17:45 +0200, Ingo Molnar wrote: 
> * Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> > 1. Jan could finish his sched_clock implementation for s390 and we 
> > would get close to the precise numbers. This would also let CFS make 
> > better decisions. [...]
> 
> i think this is the best option and it should give us the same /proc 
> accuracy on s390 as before, plus improved scheduler precision. (and 
> improved tracing accuracy, etc. etc.) Note that for architectures that 
> already have sched_clock() at least as precise as the stime/utime stats 
> there's no problem - and that seems to include all architectures except 
> s390.

For far we have used the TOD clock for sched_clock. This clocks measures
real time with an accuracy of 1usec or better. The [us]time accounting
with CONFIG_VIRT_CPU_ACCOUNTING=y is done using the CPU timer. This
timer measures virtual time with an accuracy of 1usec of better. Without
CONFIG_VIRT_CPU_ACCOUNTING the [us]time accounting is done with HZ
ticks. Which means that sched_clock() is at least as precise as [us]time
on s390 as well, only that we distinguish between real time / virtual
time if the improved accounting is used.

> could you send that precise sched_clock() patch? It should be an order 
> of magnitude simpler than the high-precision stime/utime tracking you 
> already do, and it's needed for quality scheduling anyway.

Sure if you can explain what it should do. This is still unclear to me,
for a non-idle CPU the virtual cpu time should be used but for an idle
CPU the real time should be used ? That seems rather ill-defined to me.
On s390 we have three times to consider, real time, virtual cpu time and
steal time. For a given period we have real = virtual + steal. And if a
cpu is idle we have real = steal, virtual = 0. My best interpretation of
what you want is that sched_clock should progress with virtual cpu time
if the current process is not idle and with the real time if it is. No ?

> > [...] Downside: its not as precise as before as we do some math on the 
> > numbers and it will burn cycles to compute numbers we already have 
> > (utime=sum*utime/stime).
> 
> i can see no real downside to it: if all of stime, utime and 
> sum_exec_clock are precise, then the numbers we present via /proc are 
> precise too:
> 
>    sum_exec * utime / stime;
> 
> there should be no loss of precision on s390 because the 
> multiplication/division rounding is not accumulating - we keep the 
> precise sum_exec, utime and stime values untouched.

But then sched_clock() has to return the virtual cpu time only,
otherwise it will be hard to make sum_exec exact, wouldn't it?
And why should we jump through all these loops to come up with values
that are only as good as the values we already have?

> on x86 we dont really want to slow down every irq and syscall event with 
> precise stime/utime stats for 'top' to display. On s390 the 
> multiplication and division is indeed superfluous but it keeps the code 
> generic for arches where utime/stime is less precise and irq-sampled - 
> while the sum is always precise. It also animates architectures that 
> have an imprecise sched_clock() implementation to improve its accuracy. 
> Accessing the /proc files alone is many orders of magnitude more 
> expensive than this simple multiplication and division.

Yes, I can understand why you don't want to have the exact cpu
accounting scheme on x86 since it will slow down every context switch
quite a bit (that includes user <-> kernel, softirq <-> hardirq <->
process context, ..). On s390 the cost is acceptable, for an empty
system call it is about 40 additional cycles for the precise accounting.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 17:03     ` Martin Schwidefsky
@ 2007-08-20 18:08       ` Ingo Molnar
  2007-08-20 18:33         ` Martin Schwidefsky
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-20 18:08 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras


* Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> > could you send that precise sched_clock() patch? It should be an 
> > order of magnitude simpler than the high-precision stime/utime 
> > tracking you already do, and it's needed for quality scheduling 
> > anyway.
> 
> Sure if you can explain what it should do. This is still unclear to 
> me, for a non-idle CPU the virtual cpu time should be used but for an 
> idle CPU the real time should be used ? That seems rather ill-defined 
> to me. On s390 we have three times to consider, real time, virtual cpu 
> time and steal time. For a given period we have real = virtual + 
> steal. And if a cpu is idle we have real = steal, virtual = 0. My best 
> interpretation of what you want is that sched_clock should progress 
> with virtual cpu time if the current process is not idle and with the 
> real time if it is. No ?

The core scheduler is invariant to sched_clock()'s behavior during idle 
periods [i.e. sched_clock() can do _anything_ during idle periods, and 
scheduling would not/schould not change], and that's the source of the 
uncertainty you noted.

So the best way to proceed is still a bit unclear to me - but in any 
case, a few boundary conditions are already cast into stone: we 
definitely dont want to make life harder for s390 (without giving any 
tangible benefits) and we dont want to regress any existing precision of 
s390 either.

We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is 
busy: sched_clock() very much wants to track virtual time. (real time is 
pretty much meaningless and coupling sched_clock() to real time would 
make the virtual machine's behavior dependent on the host's load, which 
breaks the "seemless virtualization to inside observers" common-sense 
requirement of virtual-CPU scheduling.)

For sched_clock()'s behavior while the virtual CPU is idle: my current 
idea for that is the patch below (a loosely analoguous problem exists 
with nohz/dynticks): it makes sched_clock() valid across idle periods 
too and uses wall-clock time for that.

If a virtual CPU is idle then i think the "real = steal, virtual = 0" 
way of thinking about idle looks a bit unnatural to me - wouldnt it be 
better to think in terms of "steal = 0, virtual = real" ? Basically a 
virtual CPU can idle at "perfect speed", without the host "stealing" any 
cycles from it. And with that way of thinking, if s390 passed in the 
real-idle-time value to the new callbacks below it would all fall into 
place. Hm?

that way we'd have a meaningful sched_clock() across idle periods too, 
useful for tracers, better scheduler debug-statistics, etc.

	Ingo

---------->
Subject: sched: sched_clock_idle_[sleep|wakeup]_event()
From: Ingo Molnar <mingo@elte.hu>

construct a more or less wall-clock time out of sched_clock(), by
using ACPI-idle's existing knowledge about how much time we spent
idling. This allows the rq clock to work around TSC-stops-in-C2,
TSC-gets-corrupted-in-C3 type of problems.

( Besides the scheduler's statistics this also benefits blktrace and
  printk-timestamps as well. )

Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup
callbacks allow the scheduler to get out the most of the period where
the CPU has a reliable TSC. This results in slightly more precise
task statistics.

the ACPI bits were acked by Len.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Len Brown <len.brown@intel.com>
---
 arch/i386/kernel/tsc.c        |    1 -
 drivers/acpi/processor_idle.c |   32 +++++++++++++++++++++++++-------
 include/linux/sched.h         |    3 ++-
 kernel/sched.c                |   41 ++++++++++++++++++++++++++++++++---------
 kernel/sched_debug.c          |    3 ++-
 5 files changed, 61 insertions(+), 19 deletions(-)

Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -292,7 +292,6 @@ static struct clocksource clocksource_ts
 
 void mark_tsc_unstable(char *reason)
 {
-	sched_clock_unstable_event();
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
 		tsc_enabled = 0;
Index: linux/drivers/acpi/processor_idle.c
===================================================================
--- linux.orig/drivers/acpi/processor_idle.c
+++ linux/drivers/acpi/processor_idle.c
@@ -63,6 +63,7 @@
 ACPI_MODULE_NAME("processor_idle");
 #define ACPI_PROCESSOR_FILE_POWER	"power"
 #define US_TO_PM_TIMER_TICKS(t)		((t * (PM_TIMER_FREQUENCY/1000)) / 1000)
+#define PM_TIMER_TICK_NS		(1000000000ULL/PM_TIMER_FREQUENCY)
 #define C2_OVERHEAD			4	/* 1us (3.579 ticks per us) */
 #define C3_OVERHEAD			4	/* 1us (3.579 ticks per us) */
 static void (*pm_idle_save) (void) __read_mostly;
@@ -462,6 +463,9 @@ static void acpi_processor_idle(void)
 		 * TBD: Can't get time duration while in C1, as resumes
 		 *      go to an ISR rather than here.  Need to instrument
 		 *      base interrupt handler.
+		 *
+		 * Note: the TSC better not stop in C1, sched_clock() will
+		 *       skew otherwise.
 		 */
 		sleep_ticks = 0xFFFFFFFF;
 		break;
@@ -469,6 +473,8 @@ static void acpi_processor_idle(void)
 	case ACPI_STATE_C2:
 		/* Get start time (ticks) */
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
+		/* Tell the scheduler that we are going deep-idle: */
+		sched_clock_idle_sleep_event();
 		/* Invoke C2 */
 		acpi_state_timer_broadcast(pr, cx, 1);
 		acpi_cstate_enter(cx);
@@ -479,17 +485,22 @@ static void acpi_processor_idle(void)
 		/* TSC halts in C2, so notify users */
 		mark_tsc_unstable("possible TSC halt in C2");
 #endif
+		/* Compute time (ticks) that we were actually asleep */
+		sleep_ticks = ticks_elapsed(t1, t2);
+
+		/* Tell the scheduler how much we idled: */
+		sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+
 		/* Re-enable interrupts */
 		local_irq_enable();
+		/* Do not account our idle-switching overhead: */
+		sleep_ticks -= cx->latency_ticks + C2_OVERHEAD;
+
 		current_thread_info()->status |= TS_POLLING;
-		/* Compute time (ticks) that we were actually asleep */
-		sleep_ticks =
-		    ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD;
 		acpi_state_timer_broadcast(pr, cx, 0);
 		break;
 
 	case ACPI_STATE_C3:
-
 		/*
 		 * disable bus master
 		 * bm_check implies we need ARB_DIS
@@ -518,6 +529,8 @@ static void acpi_processor_idle(void)
 		t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
 		/* Invoke C3 */
 		acpi_state_timer_broadcast(pr, cx, 1);
+		/* Tell the scheduler that we are going deep-idle: */
+		sched_clock_idle_sleep_event();
 		acpi_cstate_enter(cx);
 		/* Get end time (ticks) */
 		t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
@@ -531,12 +544,17 @@ static void acpi_processor_idle(void)
 		/* TSC halts in C3, so notify users */
 		mark_tsc_unstable("TSC halts in C3");
 #endif
+		/* Compute time (ticks) that we were actually asleep */
+		sleep_ticks = ticks_elapsed(t1, t2);
+		/* Tell the scheduler how much we idled: */
+		sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+
 		/* Re-enable interrupts */
 		local_irq_enable();
+		/* Do not account our idle-switching overhead: */
+		sleep_ticks -= cx->latency_ticks + C3_OVERHEAD;
+
 		current_thread_info()->status |= TS_POLLING;
-		/* Compute time (ticks) that we were actually asleep */
-		sleep_ticks =
-		    ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD;
 		acpi_state_timer_broadcast(pr, cx, 0);
 		break;
 
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1388,7 +1388,8 @@ extern void sched_exec(void);
 #define sched_exec()   {}
 #endif
 
-extern void sched_clock_unstable_event(void);
+extern void sched_clock_idle_sleep_event(void);
+extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
 #ifdef CONFIG_HOTPLUG_CPU
 extern void idle_task_exit(void);
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -262,7 +262,8 @@ struct rq {
 	s64 clock_max_delta;
 
 	unsigned int clock_warps, clock_overflows;
-	unsigned int clock_unstable_events;
+	u64 idle_clock;
+	unsigned int clock_deep_idle_events;
 	u64 tick_timestamp;
 
 	atomic_t nr_iowait;
@@ -556,18 +557,40 @@ static inline struct rq *this_rq_lock(vo
 }
 
 /*
- * CPU frequency is/was unstable - start new by setting prev_clock_raw:
+ * We are going deep-idle (irqs are disabled):
  */
-void sched_clock_unstable_event(void)
+void sched_clock_idle_sleep_event(void)
 {
-	unsigned long flags;
-	struct rq *rq;
+	struct rq *rq = cpu_rq(smp_processor_id());
 
-	rq = task_rq_lock(current, &flags);
-	rq->prev_clock_raw = sched_clock();
-	rq->clock_unstable_events++;
-	task_rq_unlock(rq, &flags);
+	spin_lock(&rq->lock);
+	__update_rq_clock(rq);
+	spin_unlock(&rq->lock);
+	rq->clock_deep_idle_events++;
+}
+EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
+
+/*
+ * We just idled delta nanoseconds (called with irqs disabled):
+ */
+void sched_clock_idle_wakeup_event(u64 delta_ns)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+	u64 now = sched_clock();
+
+	rq->idle_clock += delta_ns;
+	/*
+	 * Override the previous timestamp and ignore all
+	 * sched_clock() deltas that occured while we idled,
+	 * and use the PM-provided delta_ns to advance the
+	 * rq clock:
+	 */
+	spin_lock(&rq->lock);
+	rq->prev_clock_raw = now;
+	rq->clock += delta_ns;
+	spin_unlock(&rq->lock);
 }
+EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
 /*
  * resched_task - mark a task 'to be rescheduled now'.
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -154,10 +154,11 @@ static void print_cpu(struct seq_file *m
 	P(next_balance);
 	P(curr->pid);
 	P(clock);
+	P(idle_clock);
 	P(prev_clock_raw);
 	P(clock_warps);
 	P(clock_overflows);
-	P(clock_unstable_events);
+	P(clock_deep_idle_events);
 	P(clock_max_delta);
 	P(cpu_load[0]);
 	P(cpu_load[1]);


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:08       ` Ingo Molnar
@ 2007-08-20 18:33         ` Martin Schwidefsky
  2007-08-20 19:00           ` Balbir Singh
                             ` (3 more replies)
  2007-08-20 23:07         ` Paul Mackerras
  2007-08-21  2:18         ` Andi Kleen
  2 siblings, 4 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-20 18:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> For sched_clock()'s behavior while the virtual CPU is idle: my current 
> idea for that is the patch below (a loosely analoguous problem exists 
> with nohz/dynticks): it makes sched_clock() valid across idle periods 
> too and uses wall-clock time for that.

Ok, that would mean that sched_clock can just return the virtual cpu
time and the two hooks starts and stops the idle periods as far as the
scheduler is concerned. In this case we can use the patch from Jan with
the new implementation for sched_clock and add the two hooks to the
places where the cpu-idle notifiers are done (do_monitor_call and
default_idle). In fact this could be an idle-notifier. Hmm, I take a
closer look tomorrow when I'm back at the office.

> If a virtual CPU is idle then i think the "real = steal, virtual = 0" 
> way of thinking about idle looks a bit unnatural to me - wouldnt it be 
> better to think in terms of "steal = 0, virtual = real" ? Basically a 
> virtual CPU can idle at "perfect speed", without the host "stealing" any 
> cycles from it. And with that way of thinking, if s390 passed in the 
> real-idle-time value to the new callbacks below it would all fall into 
> place. Hm?

How you think about an idle cpu depends on your viewpoint. The source
for the virtual cpu time on s390 is the cpu timer. This timer is stopped
when a virtual cpu looses the physical cpu, so it seems natural to me to
think that real=steal, virtual=0 because the cpu timer is stopped while
the cpu is idle. The other way of thinking about it is as valid though.

> that way we'd have a meaningful sched_clock() across idle periods too, 
> useful for tracers, better scheduler debug-statistics, etc.

That would be good.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:33         ` Martin Schwidefsky
@ 2007-08-20 19:00           ` Balbir Singh
  2007-08-20 19:05           ` Ingo Molnar
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Balbir Singh @ 2007-08-20 19:00 UTC (permalink / raw)
  To: schwidefsky
  Cc: Ingo Molnar, Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

Martin Schwidefsky wrote:
> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
>> For sched_clock()'s behavior while the virtual CPU is idle: my current 
>> idea for that is the patch below (a loosely analoguous problem exists 
>> with nohz/dynticks): it makes sched_clock() valid across idle periods 
>> too and uses wall-clock time for that.
> 
> Ok, that would mean that sched_clock can just return the virtual cpu
> time and the two hooks starts and stops the idle periods as far as the
> scheduler is concerned. In this case we can use the patch from Jan with
> the new implementation for sched_clock and add the two hooks to the
> places where the cpu-idle notifiers are done (do_monitor_call and
> default_idle). In fact this could be an idle-notifier. Hmm, I take a
> closer look tomorrow when I'm back at the office.
> 

<snip>

I am partially responsible for the regression. While working on the
CPU accounting change, I for some unknown reason always assumed
that sched_clock() was virtualized. I should have taken a closer look.

Ingo, with this new approach, sched_clock() although not virtualized,
advances as if it is (due to the idle state change accounting).
I have one question though, what if the underlying CPU is forcefully
scheduled out from the virtual CPU?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:33         ` Martin Schwidefsky
  2007-08-20 19:00           ` Balbir Singh
@ 2007-08-20 19:05           ` Ingo Molnar
  2007-08-21  7:20             ` Christian Borntraeger
  2007-08-20 19:12           ` Ingo Molnar
  2007-08-21  7:00           ` Christian Borntraeger
  3 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-20 19:05 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras


* Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> > For sched_clock()'s behavior while the virtual CPU is idle: my current 
> > idea for that is the patch below (a loosely analoguous problem exists 
> > with nohz/dynticks): it makes sched_clock() valid across idle periods 
> > too and uses wall-clock time for that.
> 
> Ok, that would mean that sched_clock can just return the virtual cpu 
> time and the two hooks starts and stops the idle periods as far as the 
> scheduler is concerned. In this case we can use the patch from Jan 
> with the new implementation for sched_clock and add the two hooks to 
> the places where the cpu-idle notifiers are done (do_monitor_call and 
> default_idle). In fact this could be an idle-notifier. Hmm, I take a 
> closer look tomorrow when I'm back at the office.

ok. Just to make it sure wrt. release-management: you said s390 
sched_clock() is currently at least as precise as stime/utime - so this 
would suggest that there is no regression over v2.6.22? Regardless of 
whether it's a live regression or not, i think we want the nohz 
improvement (and the s390 patch if the callbacks are OK to you) in .23, 
and we want to migrate all users of "raw" sched_clock() [blktrace, 
softlockup-detector, print-timestamps, etc.] over to the better 
cpu_clock() interface.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 19:05           ` Ingo Molnar
@ 2007-08-21  7:20             ` Christian Borntraeger
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21  7:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Montag, 20. August 2007 schrieb Ingo Molnar:
> ok. Just to make it sure wrt. release-management: you said s390 
> sched_clock() is currently at least as precise as stime/utime - so this 
> would suggest that there is no regression over v2.6.22? Regardless of 

On current git s390 has a sched_clock implementaton based on real time. That 
means we have no knowledge about steal time. E.g. if you only get 50% of your 
physial cpu from the hypervisor on 2.6.22 a single cpu bound process has 50% 
in top while on 2.6.23-rc top shows 100%, so yes, there is a regression.

Christian




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:33         ` Martin Schwidefsky
  2007-08-20 19:00           ` Balbir Singh
  2007-08-20 19:05           ` Ingo Molnar
@ 2007-08-20 19:12           ` Ingo Molnar
  2007-08-21  7:00           ` Christian Borntraeger
  3 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-20 19:12 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras


* Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> > If a virtual CPU is idle then i think the "real = steal, virtual = 
> > 0" way of thinking about idle looks a bit unnatural to me - wouldnt 
> > it be better to think in terms of "steal = 0, virtual = real" ? 
> > Basically a virtual CPU can idle at "perfect speed", without the 
> > host "stealing" any cycles from it. And with that way of thinking, 
> > if s390 passed in the real-idle-time value to the new callbacks 
> > below it would all fall into place. Hm?
> 
> How you think about an idle cpu depends on your viewpoint. The source 
> for the virtual cpu time on s390 is the cpu timer. This timer is 
> stopped when a virtual cpu looses the physical cpu, so it seems 
> natural to me to think that real=steal, virtual=0 because the cpu 
> timer is stopped while the cpu is idle. The other way of thinking 
> about it is as valid though.

my thinking is this: the structure of "idle time" only matters if it can 
be observed from "within" a virtual machine - via timers. Are on s390 
any of the typical app-visible timers (timer_list, etc.) driven by the 
virtual tick? [which slows down if a virtual CPU is scheduled away by 
the host/monitor/hypervisor?] Or is the virtual tick only affecting 
scheduling/cpu-accounting statistics in essence?

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:33         ` Martin Schwidefsky
                             ` (2 preceding siblings ...)
  2007-08-20 19:12           ` Ingo Molnar
@ 2007-08-21  7:00           ` Christian Borntraeger
  2007-08-21  9:18             ` Martin Schwidefsky
  3 siblings, 1 reply; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21  7:00 UTC (permalink / raw)
  To: schwidefsky
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Montag, 20. August 2007 schrieb Martin Schwidefsky:
> On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> Ok, that would mean that sched_clock can just return the virtual cpu
> time and the two hooks starts and stops the idle periods as far as the
> scheduler is concerned. In this case we can use the patch from Jan with
> the new implementation for sched_clock and add the two hooks to the
> places where the cpu-idle notifiers are done (do_monitor_call and
> default_idle). In fact this could be an idle-notifier. Hmm, I take a
> closer look tomorrow when I'm back at the office.
> 
> > If a virtual CPU is idle then i think the "real = steal, virtual = 0" 
> > way of thinking about idle looks a bit unnatural to me - wouldnt it be 
> > better to think in terms of "steal = 0, virtual = real" ? Basically a 
> > virtual CPU can idle at "perfect speed", without the host "stealing" any 
> > cycles from it. And with that way of thinking, if s390 passed in the 
> > real-idle-time value to the new callbacks below it would all fall into 
> > place. Hm?

Martin,

I think we already do something like this. If you look at cpustat in 2.6.22 
and earlier we already have steal increase = 0, idle increase = 100 % on an 
idle cpu, even on s390. So while from the hardware perspective steal is 
growing, we do the right thing in Linux, no?

Chrisian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  7:00           ` Christian Borntraeger
@ 2007-08-21  9:18             ` Martin Schwidefsky
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-21  9:18 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

On Tue, 2007-08-21 at 09:00 +0200, Christian Borntraeger wrote:
> Am Montag, 20. August 2007 schrieb Martin Schwidefsky:
> > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote:
> > Ok, that would mean that sched_clock can just return the virtual cpu
> > time and the two hooks starts and stops the idle periods as far as the
> > scheduler is concerned. In this case we can use the patch from Jan with
> > the new implementation for sched_clock and add the two hooks to the
> > places where the cpu-idle notifiers are done (do_monitor_call and
> > default_idle). In fact this could be an idle-notifier. Hmm, I take a
> > closer look tomorrow when I'm back at the office.
> > 
> > > If a virtual CPU is idle then i think the "real = steal, virtual = 0" 
> > > way of thinking about idle looks a bit unnatural to me - wouldnt it be 
> > > better to think in terms of "steal = 0, virtual = real" ? Basically a 
> > > virtual CPU can idle at "perfect speed", without the host "stealing" any 
> > > cycles from it. And with that way of thinking, if s390 passed in the 
> > > real-idle-time value to the new callbacks below it would all fall into 
> > > place. Hm?
> 
> Martin,
> 
> I think we already do something like this. If you look at cpustat in 2.6.22 
> and earlier we already have steal increase = 0, idle increase = 100 % on an 
> idle cpu, even on s390. So while from the hardware perspective steal is 
> growing, we do the right thing in Linux, no?

This is done in kernel/sched.c:account_steal_time(). If the architecture
backend reports steal time for idle it is accounted as idle time.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:08       ` Ingo Molnar
  2007-08-20 18:33         ` Martin Schwidefsky
@ 2007-08-20 23:07         ` Paul Mackerras
  2007-08-21  2:18         ` Andi Kleen
  2 siblings, 0 replies; 40+ messages in thread
From: Paul Mackerras @ 2007-08-20 23:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens

Ingo Molnar writes:

> We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is 
> busy: sched_clock() very much wants to track virtual time. (real time is 
> pretty much meaningless and coupling sched_clock() to real time would 
> make the virtual machine's behavior dependent on the host's load, which 
> breaks the "seemless virtualization to inside observers" common-sense 
> requirement of virtual-CPU scheduling.)

OK, that would imply that virtualized powerpc machines want to use the
PURR register for sched_clock().  The PURR basically counts time that
this virtual CPU (SMT thread) spends dispatching instructions, so it
excludes both the time taken by the hypervisor and the time (dispatch
cycles) taken by the other thread.

> For sched_clock()'s behavior while the virtual CPU is idle: my current 
> idea for that is the patch below (a loosely analoguous problem exists 
> with nohz/dynticks): it makes sched_clock() valid across idle periods 
> too and uses wall-clock time for that.

The straightforward thing is just to use the PURR all the time, even
during idle.  What that means is this:

* If the other thread is active then sched_clock() will continue to
  advance at a slow rate during idle (reflecting the fact that the
  active thread is getting almost all of the dispatch cycles).

* If the other thread is idle and the partition is a "shared
  processor" partition then sched_clock() not advance since in that
  case we idle in the hypervisor.

* If the other thread is idle and the partition is a "dedicated
  processor" partition then sched_clock() will advance at half speed
  on both threads, since the two threads each get half of the dispatch
  cycles.

It sounds like this behaviour should be OK - do you agree?

> If a virtual CPU is idle then i think the "real = steal, virtual = 0" 
> way of thinking about idle looks a bit unnatural to me - wouldnt it be 
> better to think in terms of "steal = 0, virtual = real" ? Basically a 

Stolen time while the virtual CPU is idle gets (or at least, used to
get :) accounted as idle time.

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 18:08       ` Ingo Molnar
  2007-08-20 18:33         ` Martin Schwidefsky
  2007-08-20 23:07         ` Paul Mackerras
@ 2007-08-21  2:18         ` Andi Kleen
  2007-08-21  7:09           ` Ingo Molnar
  2 siblings, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2007-08-21  2:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens,
	Paul Mackerras

Ingo Molnar <mingo@elte.hu> writes:

You should just be using idle notifiers instead instead of adding more
and more custom hooks (like NOHZ has already) x86-64 still has them
and there is a old patch around to add them to i386.

-Andi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  2:18         ` Andi Kleen
@ 2007-08-21  7:09           ` Ingo Molnar
  2007-08-21 10:07             ` Andi Kleen
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21  7:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens,
	Paul Mackerras


* Andi Kleen <andi@firstfloor.org> wrote:

> You should just be using idle notifiers instead instead of adding more 
> and more custom hooks (like NOHZ has already) x86-64 still has them 
> and there is a old patch around to add them to i386.

these are specially placed callbacks that we want to call from certain 
ACPI codepaths but not from all of them.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  7:09           ` Ingo Molnar
@ 2007-08-21 10:07             ` Andi Kleen
  2007-08-21 10:20               ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2007-08-21 10:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Martin Schwidefsky, Christian Borntraeger,
	Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber,
	heiko.carstens, Paul Mackerras

On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote:
> 
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > You should just be using idle notifiers instead instead of adding more 
> > and more custom hooks (like NOHZ has already) x86-64 still has them 
> > and there is a old patch around to add them to i386.
> 
> these are specially placed callbacks that we want to call from certain 
> ACPI codepaths but not from all of them.

Because you believe TSC only stops in C2 and C3? That's not correct on
all systems.

-Andi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 10:07             ` Andi Kleen
@ 2007-08-21 10:20               ` Ingo Molnar
  2007-08-21 11:15                 ` Andi Kleen
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 10:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens,
	Paul Mackerras


* Andi Kleen <andi@firstfloor.org> wrote:

> On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote:
> > 
> > * Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > You should just be using idle notifiers instead instead of adding 
> > > more and more custom hooks (like NOHZ has already) x86-64 still 
> > > has them and there is a old patch around to add them to i386.
> > 
> > these are specially placed callbacks that we want to call from 
> > certain ACPI codepaths but not from all of them.
> 
> Because you believe TSC only stops in C2 and C3? That's not correct on 
> all systems.

i know there are some incredibly broken (but rare) boxes where the bios 
will report it only knows C1 and do C2? Is that the case you are 
referring to, or is there something else too?

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 10:20               ` Ingo Molnar
@ 2007-08-21 11:15                 ` Andi Kleen
  2007-08-21 11:20                   ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Andi Kleen @ 2007-08-21 11:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Martin Schwidefsky, Christian Borntraeger,
	Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber,
	heiko.carstens, Paul Mackerras

> i know there are some incredibly broken (but rare) boxes where the bios 
> will report it only knows C1 and do C2? Is that the case you are 
> referring to, or is there something else too?

There are first a couple of older and not so old (Centaur) chips that 
generally stop the TSC in C1.

And also some boxes who shouldn't have anything deeper than C2 have
trouble with the TSC. For example I got a Opteron machine (which
definitely shouldn't have any C2 since it's two socket) where the
TSC appears to stop or at least slow down a lot in C1.

And thirdly it's just unclean to add all kinds of custom hooks
there. It was already ugly in NOHZ, please don't continue that.

-Andi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:15                 ` Andi Kleen
@ 2007-08-21 11:20                   ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 11:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens,
	Paul Mackerras

* Andi Kleen <andi@firstfloor.org> wrote:

> > i know there are some incredibly broken (but rare) boxes where the bios 
> > will report it only knows C1 and do C2? Is that the case you are 
> > referring to, or is there something else too?
> 
> There are first a couple of older and not so old (Centaur) chips that 
> generally stop the TSC in C1.

there's not much we can do about them: the ACPI code doesnt measure 
their idle time, right? This is mostly for statistics purposes, so 
unless "broken" means tons of boxes, we dont have to have 100% coverage.

> And also some boxes who shouldn't have anything deeper than C2 have 
> trouble with the TSC. For example I got a Opteron machine (which 
> definitely shouldn't have any C2 since it's two socket) where the TSC 
> appears to stop or at least slow down a lot in C1.

how much is "a lot" in C1? There's an AMD TSC-slows-down-C1 erratum but 
that should be less than 1%. (which is fine enough for idle time 
measurements)

> And thirdly it's just unclean to add all kinds of custom hooks there. 
> It was already ugly in NOHZ, please don't continue that.

i'm not opposed to the idle notifiers but iirc the idle notifiers caused 
problems in themselves so part of them were reverted. We can do this 
more cleanly in .24 - it will make the benefits of the notifier cleanup 
even more apparent.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-20 15:45   ` [accounting regression since rc1] scheduler updates Ingo Molnar
  2007-08-20 17:03     ` Martin Schwidefsky
@ 2007-08-21  8:17     ` Christian Borntraeger
  2007-08-21  8:42       ` Ingo Molnar
  2007-08-21 11:25       ` Ingo Molnar
  1 sibling, 2 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21  8:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Montag, 20. August 2007 schrieb Ingo Molnar:
> could you send that precise sched_clock() patch? It should be an order 
> of magnitude simpler than the high-precision stime/utime tracking you 
> already do, and it's needed for quality scheduling anyway.

I have a question about that. I just played with sched_clock, and even when I 
intentionally slow down sched_clock by a factor of 2, my cpu bound process 
gets 100 % in top. If this is intentional, I dont understand how a 
virtualized sched_clock would fix the accounting change?

Thanks

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  8:17     ` Christian Borntraeger
@ 2007-08-21  8:42       ` Ingo Molnar
  2007-08-21  9:11         ` Martin Schwidefsky
  2007-08-21 11:25       ` Ingo Molnar
  1 sibling, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21  8:42 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > could you send that precise sched_clock() patch? It should be an order 
> > of magnitude simpler than the high-precision stime/utime tracking you 
> > already do, and it's needed for quality scheduling anyway.
> 
> I have a question about that. I just played with sched_clock, and even 
> when I intentionally slow down sched_clock by a factor of 2, my cpu 
> bound process gets 100 % in top. If this is intentional, I dont 
> understand how a virtualized sched_clock would fix the accounting 
> change?

hm, does on s390 scheduler_tick() get driven in virtual time or in real 
time? The very latest scheduler code will enforce a minimum rate of 
sched_clock() across two scheduler_tick() calls (in rc3 and later 
kernels). If sched_clock() "slows down" but scheduler_tick() still has a 
real-time frequency then that impacts the quality of scheduling. So 
scheduler_tick() and sched_clock() must really have the same behavior 
(either both are virtual or both are real), so that scheduling becomes 
invariant to steal-time.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  8:42       ` Ingo Molnar
@ 2007-08-21  9:11         ` Martin Schwidefsky
  2007-08-21  9:34           ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-21  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Tue, 2007-08-21 at 10:42 +0200, Ingo Molnar wrote:
> * Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> > Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > > could you send that precise sched_clock() patch? It should be an order 
> > > of magnitude simpler than the high-precision stime/utime tracking you 
> > > already do, and it's needed for quality scheduling anyway.
> > 
> > I have a question about that. I just played with sched_clock, and even 
> > when I intentionally slow down sched_clock by a factor of 2, my cpu 
> > bound process gets 100 % in top. If this is intentional, I dont 
> > understand how a virtualized sched_clock would fix the accounting 
> > change?
> 
> hm, does on s390 scheduler_tick() get driven in virtual time or in real 
> time? The very latest scheduler code will enforce a minimum rate of 
> sched_clock() across two scheduler_tick() calls (in rc3 and later 
> kernels). If sched_clock() "slows down" but scheduler_tick() still has a 
> real-time frequency then that impacts the quality of scheduling. So 
> scheduler_tick() and sched_clock() must really have the same behavior 
> (either both are virtual or both are real), so that scheduling becomes 
> invariant to steal-time.

scheduler_tick() is based on the HZ timer which uses the TOD clock =
real time. sched_clock() currently uses the TOD clock as well so in
regard to the new scheduler we currently do not have a problem. We have
a problem with cpu time accounting, the change to the /proc code breaks
the precise accounting on s390. To solve the cpu time accounting we need
to change sched_clock() to the cpu timer = virtual time. To change the
scheduler_tick() as well requires another patch and I fear it would
complicate things in the s390 backend.

And if you say that the scheduling becomes invariant to steal-time, how
is the cpu time accounting via sum_exec supposed to work if it does not
take steal-time into account ?

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  9:11         ` Martin Schwidefsky
@ 2007-08-21  9:34           ` Ingo Molnar
  2007-08-21  9:48             ` Paul Mackerras
                               ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21  9:34 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

* Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> > hm, does on s390 scheduler_tick() get driven in virtual time or in 
> > real time? The very latest scheduler code will enforce a minimum 
> > rate of sched_clock() across two scheduler_tick() calls (in rc3 and 
> > later kernels). If sched_clock() "slows down" but scheduler_tick() 
> > still has a real-time frequency then that impacts the quality of 
> > scheduling. So scheduler_tick() and sched_clock() must really have 
> > the same behavior (either both are virtual or both are real), so 
> > that scheduling becomes invariant to steal-time.
> 
> scheduler_tick() is based on the HZ timer which uses the TOD clock = 
> real time. sched_clock() currently uses the TOD clock as well so in 
> regard to the new scheduler we currently do not have a problem. We 
> have a problem with cpu time accounting, the change to the /proc code 
> breaks the precise accounting on s390. To solve the cpu time 
> accounting we need to change sched_clock() to the cpu timer = virtual 
> time. To change the scheduler_tick() as well requires another patch 
> and I fear it would complicate things in the s390 backend.

my feeling is that it gives us generally higher-quality scheduling if we 
drive all things scheduler via virtual time. Do you agree with that?

> And if you say that the scheduling becomes invariant to steal-time, 
> how is the cpu time accounting via sum_exec supposed to work if it 
> does not take steal-time into account ?

right now there are two distinct and independent things: scheduler 
behavior (the scheduling decisions the scheduler makes) and accounting 
behavior.

the 'invariant' i mentioned only covers scheduler behavior, not 
accounting behavior. Accounting is separate in theory, but coupled in 
practice now via sum_exec_runtime.

Before we do a patch to decouple them again, lets make sure we agree on 
the direction to take here. There are two ways to account within a 
virtual machine: either in real time or in virtual time.

it seems you'd like accounting to be sensitive to 'external load' - i.e. 
you'd like an 'internal' top to show the 'real' CPU accounting, right? 

Wouldnt it be more consistent if a virtual box would not show any 
dependency on external load? (i.e. it would slow down all of its 
internal functionality transparently, without exposing it via /proc. The 
only way to observe that would be the TOD interfaces: gettimeofday and 
real-time clock driven POSIX timers. Even timer_list could be driven via 
virtual time - although that would probably break user expectations, 
right?) Or would accounting-in-virtual-time break user expectations too? 
(most of the other hypervisors let guests account in virtual time.)

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  9:34           ` Ingo Molnar
@ 2007-08-21  9:48             ` Paul Mackerras
  2007-08-21 10:38             ` Martin Schwidefsky
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Paul Mackerras @ 2007-08-21  9:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds,
	Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens

Ingo Molnar writes:

> my feeling is that it gives us generally higher-quality scheduling if we 
> drive all things scheduler via virtual time. Do you agree with that?

On PowerPC at least, while we can measure virtual time, there is no
hardware facility for getting an interrupt after a certain amount of
virtual time has elapsed, but there is a facility to get an interrupt
after an elapsed real-time interval.  So sched_clock() could easily
change to measure virtual time, but I don't see how to make
scheduler_tick() be driven off virtual time.  It sounds like s390 is
the same.

> it seems you'd like accounting to be sensitive to 'external load' - i.e. 
> you'd like an 'internal' top to show the 'real' CPU accounting, right? 
> 
> Wouldnt it be more consistent if a virtual box would not show any 
> dependency on external load? (i.e. it would slow down all of its 
> internal functionality transparently, without exposing it via /proc. The 

The tools that use the data in /proc get unhappy if user time + system
time + hardirq time + softirq time + idle time + stolen time doesn't
add up to real time.  The way we handle that is by making stolen time
represent the time taken away by the hypervisor (and on PowerPC with
SMT, the time taken by the other thread too).

Paul.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  9:34           ` Ingo Molnar
  2007-08-21  9:48             ` Paul Mackerras
@ 2007-08-21 10:38             ` Martin Schwidefsky
  2007-08-21 11:36               ` Ingo Molnar
  2007-08-21 10:39             ` Christian Borntraeger
  2007-08-21 10:43             ` Christian Borntraeger
  3 siblings, 1 reply; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-21 10:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Tue, 2007-08-21 at 11:34 +0200, Ingo Molnar wrote:
> * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> 
> > > hm, does on s390 scheduler_tick() get driven in virtual time or in 
> > > real time? The very latest scheduler code will enforce a minimum 
> > > rate of sched_clock() across two scheduler_tick() calls (in rc3 and 
> > > later kernels). If sched_clock() "slows down" but scheduler_tick() 
> > > still has a real-time frequency then that impacts the quality of 
> > > scheduling. So scheduler_tick() and sched_clock() must really have 
> > > the same behavior (either both are virtual or both are real), so 
> > > that scheduling becomes invariant to steal-time.
> > 
> > scheduler_tick() is based on the HZ timer which uses the TOD clock = 
> > real time. sched_clock() currently uses the TOD clock as well so in 
> > regard to the new scheduler we currently do not have a problem. We 
> > have a problem with cpu time accounting, the change to the /proc code 
> > breaks the precise accounting on s390. To solve the cpu time 
> > accounting we need to change sched_clock() to the cpu timer = virtual 
> > time. To change the scheduler_tick() as well requires another patch 
> > and I fear it would complicate things in the s390 backend.
> 
> my feeling is that it gives us generally higher-quality scheduling if we 
> drive all things scheduler via virtual time. Do you agree with that?

Yes, I'm in favour of converting sched_clock to use virtual time. It
makes sense to me.

> > And if you say that the scheduling becomes invariant to steal-time, 
> > how is the cpu time accounting via sum_exec supposed to work if it 
> > does not take steal-time into account ?
> 
> right now there are two distinct and independent things: scheduler 
> behavior (the scheduling decisions the scheduler makes) and accounting 
> behavior.
> 
> the 'invariant' i mentioned only covers scheduler behavior, not 
> accounting behavior. Accounting is separate in theory, but coupled in 
> practice now via sum_exec_runtime.

Hmm, ok. But the fact is that right now the accounting via
sum_exec_runtime is broken in regard to virtual cpus, isn't it?

> Before we do a patch to decouple them again, lets make sure we agree on 
> the direction to take here. There are two ways to account within a 
> virtual machine: either in real time or in virtual time.
> 
> it seems you'd like accounting to be sensitive to 'external load' - i.e. 
> you'd like an 'internal' top to show the 'real' CPU accounting, right? 

Yes, we want utime and stime represent the time spent on the physical
cpu. To make up for the missing time the steal time field has been
introduced.

> Wouldnt it be more consistent if a virtual box would not show any 
> dependency on external load? (i.e. it would slow down all of its 
> internal functionality transparently, without exposing it via /proc. The 
> only way to observe that would be the TOD interfaces: gettimeofday and 
> real-time clock driven POSIX timers. Even timer_list could be driven via 
> virtual time - although that would probably break user expectations, 
> right?) Or would accounting-in-virtual-time break user expectations too? 
> (most of the other hypervisors let guests account in virtual time.

No, imho it is less consistent if the virtual box shows virtual time. If
you look at the top output as a user and it shows that some process used
x% of cpu what does it tell you? With virtual cpus next to nothing, you
have to normalize the numbers with the %steal while the process was
running. But even then it still is not a good number because the %steal
changes while a process is running. The only good solution is to use
virtual time for all cputime values.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 10:38             ` Martin Schwidefsky
@ 2007-08-21 11:36               ` Ingo Molnar
  2007-08-21 11:58                 ` Martin Schwidefsky
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 11:36 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

* Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> > Wouldnt it be more consistent if a virtual box would not show any 
> > dependency on external load? (i.e. it would slow down all of its 
> > internal functionality transparently, without exposing it via /proc. 
> > The only way to observe that would be the TOD interfaces: 
> > gettimeofday and real-time clock driven POSIX timers. Even 
> > timer_list could be driven via virtual time - although that would 
> > probably break user expectations, right?) Or would 
> > accounting-in-virtual-time break user expectations too? (most of the 
> > other hypervisors let guests account in virtual time.
> 
> No, imho it is less consistent if the virtual box shows virtual time. 
> If you look at the top output as a user and it shows that some process 
> used x% of cpu what does it tell you? [...]

it tells me something really important: that the virtual box's scheduler 
gave this task as much CPU time as it could.

> [...] With virtual cpus next to nothing, you have to normalize the 
> numbers with the %steal while the process was running. But even then 
> it still is not a good number because the %steal changes while a 
> process is running. The only good solution is to use virtual time for 
> all cputime values.

the steal percentage is really a concept one step higher in the 
scheduling hierarchy. You should be running top (or an equivalent tool) 
in the _hypervisor_ context if you want to know about how much time each 
virtual box gets. 'mixing' that information with the 'internal' task 
statistics of the virtual box is less consistent IMO and leads to the 
loss of information. (with the 'mixing' method there's no way to tell 
whether a task got all CPU time it asked for - and _that_ is which an 
admin is interested in just as much as the time allocation between 
virtual boxes.)

so in say KVM you determine the steal percentage by looking at 'top' 
output in the hypervisor context. (or looking at steal% in the internal 
top output) Looking at 'top' in the guest tells you everything internal 
to that guest, without mixing external scheduling information into it.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:36               ` Ingo Molnar
@ 2007-08-21 11:58                 ` Martin Schwidefsky
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-21 11:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Tue, 2007-08-21 at 13:36 +0200, Ingo Molnar wrote:
> * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> 
> > > Wouldnt it be more consistent if a virtual box would not show any 
> > > dependency on external load? (i.e. it would slow down all of its 
> > > internal functionality transparently, without exposing it via /proc. 
> > > The only way to observe that would be the TOD interfaces: 
> > > gettimeofday and real-time clock driven POSIX timers. Even 
> > > timer_list could be driven via virtual time - although that would 
> > > probably break user expectations, right?) Or would 
> > > accounting-in-virtual-time break user expectations too? (most of the 
> > > other hypervisors let guests account in virtual time.
> > 
> > No, imho it is less consistent if the virtual box shows virtual time. 
> > If you look at the top output as a user and it shows that some process 
> > used x% of cpu what does it tell you? [...]
> 
> it tells me something really important: that the virtual box's scheduler 
> gave this task as much CPU time as it could.

So? As far as accounting is concerned the user doesn't care one bit what
the scheduler decided. The user cares how much cpu was used. If you want
to know how much of the cputime available to the virtual box was used
for a process you just have to add/subtract the steal time. Your have
the complete picture what happened. You cannot get the complete picture
if you have do process accounting based on the TOD clock, you never know
how much cpu was actually spent for a process.

> > [...] With virtual cpus next to nothing, you have to normalize the 
> > numbers with the %steal while the process was running. But even then 
> > it still is not a good number because the %steal changes while a 
> > process is running. The only good solution is to use virtual time for 
> > all cputime values.
> 
> the steal percentage is really a concept one step higher in the 
> scheduling hierarchy. You should be running top (or an equivalent tool) 
> in the _hypervisor_ context if you want to know about how much time each 
> virtual box gets. 'mixing' that information with the 'internal' task 
> statistics of the virtual box is less consistent IMO and leads to the 
> loss of information. (with the 'mixing' method there's no way to tell 
> whether a task got all CPU time it asked for - and _that_ is which an 
> admin is interested in just as much as the time allocation between 
> virtual boxes.)

Not really. If I look at a process I want to know how much cpu it used.
Not virtual but real cpu. And we learned this the hard way, trying to do
accounting with numbers that have to get normalized by numbers from the
hypervisor is just plain broken.

> so in say KVM you determine the steal percentage by looking at 'top' 
> output in the hypervisor context. (or looking at steal% in the internal 
> top output) Looking at 'top' in the guest tells you everything internal 
> to that guest, without mixing external scheduling information into it.

The information about accounting numbers that are internal to the guest
is 99.99% useless. You always have to take the external scheduling into
account. We choose to do the accounting in a way that does not require
additional steps to get to useful numbers.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  9:34           ` Ingo Molnar
  2007-08-21  9:48             ` Paul Mackerras
  2007-08-21 10:38             ` Martin Schwidefsky
@ 2007-08-21 10:39             ` Christian Borntraeger
  2007-08-21 10:43             ` Christian Borntraeger
  3 siblings, 0 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21 10:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> Wouldnt it be more consistent if a virtual box would not show any 
> dependency on external load? (i.e. it would slow down all of its 
> internal functionality transparently, without exposing it via /proc. The 
> only way to observe that would be the TOD interfaces: gettimeofday and 
> real-time clock driven POSIX timers. Even timer_list could be driven via 
> virtual time - although that would probably break user expectations, 
> right?) Or would accounting-in-virtual-time break user expectations too? 
> (most of the other hypervisors let guests account in virtual time.)

Most hypervisors let guest account in virtual time because this requires no 
code change. We had that in the past as well. But now we have lots of 
customers that rely on a different accounting model. Before 
CONFIG_VIRT_CPU_ACCOUNTING we got this model of top showing 99% of cpu used, 
even if the hypervisor gives us 20% of the physical one.  We actually want to 
show that this process gets only 19.8% of a real cpu for several reasons:
- people do workload management based on used cycles
- people/departments pay money based on used cycles
- If your physical cpu has to much load, you want to identify processes by 
physical cpu usage

There are even some vendors that claimed that Linux accouting was completely 
broken and useless on s390 and people should use their vendor tool to get the 
right numbers. While these tools still have important features 
CONFIG_VIRT_CPU_ACCOUNTING fixed at least the "broken" parts. 


Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  9:34           ` Ingo Molnar
                               ` (2 preceding siblings ...)
  2007-08-21 10:39             ` Christian Borntraeger
@ 2007-08-21 10:43             ` Christian Borntraeger
  2007-08-21 11:15               ` Ingo Molnar
  3 siblings, 1 reply; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> the 'invariant' i mentioned only covers scheduler behavior, not 
> accounting behavior. Accounting is separate in theory, but coupled in 
> practice now via sum_exec_runtime.

Forgot to answer about that: That means that the current design does not cover 
our requirement of showing process real time, even if we implement 
sched_clock? In that case I would suggest to merge my patch as a quick but 
correct solution and do it properly for 2.6.24. Of course better solutions 
are welcome :-)

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 10:43             ` Christian Borntraeger
@ 2007-08-21 11:15               ` Ingo Molnar
  2007-08-21 11:24                 ` Christian Borntraeger
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 11:15 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras


* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> > the 'invariant' i mentioned only covers scheduler behavior, not 
> > accounting behavior. Accounting is separate in theory, but coupled in 
> > practice now via sum_exec_runtime.
> 
> Forgot to answer about that: That means that the current design does 
> not cover our requirement of showing process real time, even if we 
> implement sched_clock? In that case I would suggest to merge my patch 
> as a quick but correct solution and do it properly for 2.6.24. Of 
> course better solutions are welcome :-)

you mean to revert b27f03d4bd? I'd really like to see this fixed for 
real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the 
only case affected)

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:15               ` Ingo Molnar
@ 2007-08-21 11:24                 ` Christian Borntraeger
  2007-08-21 11:30                   ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21 11:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> you mean to revert b27f03d4bd? I'd really like to see this fixed for 
> real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the 
> only case affected)

Not a complete revert, but an ifdef-workaround. I wrote this patch last week 
and you were on cc:
http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2

This patch should fix s390 and let everybody else use your new code. If you 
are conviced that we get a better solution before 2.6.23 hits the street, 
thats fine with me. 

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:24                 ` Christian Borntraeger
@ 2007-08-21 11:30                   ` Ingo Molnar
  2007-08-21 11:58                     ` Christian Borntraeger
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 11:30 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Am Dienstag, 21. August 2007 schrieb Ingo Molnar:
> > you mean to revert b27f03d4bd? I'd really like to see this fixed for 
> > real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the 
> > only case affected)
> 
> Not a complete revert, but an ifdef-workaround. I wrote this patch last week 
> and you were on cc:
> http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2
> 
> This patch should fix s390 and let everybody else use your new code. 
> If you are conviced that we get a better solution before 2.6.23 hits 
> the street, thats fine with me.

what do you think about the rq_clock() #ifdef i did in the previous mail 
plus you making sched_clock() virtual? That way you can keep 
scheduler_tick() driven by real-time, although that generally will cause 
artifacts with SMP load-balancing too. (that was true in the past too)

but i dont mind your patch either - it's really the architecture's 
choice how visible it wants to make external load to the task stats of 
its virtual machines. I think it is more logical to say that 100% CPU 
time displayed in 'top' means that the task got all the CPU time it 
asked for from the virtual machine. (and if you are curious about how 
much time was stolen from the virtual box altogether you look at the 
stolen-time stats in isolation.)

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:30                   ` Ingo Molnar
@ 2007-08-21 11:58                     ` Christian Borntraeger
  2007-08-21 12:21                       ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-21 11:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

> what do you think about the rq_clock() #ifdef i did in the previous mail 
> plus you making sched_clock() virtual? That way you can keep 
> scheduler_tick() driven by real-time, although that generally will cause 
> artifacts with SMP load-balancing too. (that was true in the past too)

I just has a test run with a virtual sched_clock and your patch. 
Unfortunately, it doesnt work. top shows 100% for a cpu bound process, but 
steal time shows about 5% stolen cpu.

This brings me to another problem: runtime.
Let me give an example. You get 90% cpu from your hipervisor in a shared 
environment. If you now start a cpu bound task that gets the full cpu for 
lets say 10 minutes. I REALLY want to see 9 minutes in ps and top because my 
department might pay for used cpu cycles. 

> 
> but i dont mind your patch either - it's really the architecture's 
> choice how visible it wants to make external load to the task stats of 
> its virtual machines. I think it is more logical to say that 100% CPU 
> time displayed in 'top' means that the task got all the CPU time it 
> asked for from the virtual machine. (and if you are curious about how 
> much time was stolen from the virtual box altogether you look at the 
> stolen-time stats in isolation.)

Well, as I said we started with the same approach (virtual cpu) but we learned 
that these numbers have no meaning at all because the hypervisor does have 
different scheduling timeslices and having 100% inside the guest can still 
result in almost nothing if the system is really loaded.

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:58                     ` Christian Borntraeger
@ 2007-08-21 12:21                       ` Ingo Molnar
  2007-08-21 12:57                         ` Martin Schwidefsky
  0 siblings, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 12:21 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel,
	Jan Glauber, heiko.carstens, Paul Mackerras

* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> > but i dont mind your patch either - it's really the architecture's 
> > choice how visible it wants to make external load to the task stats 
> > of its virtual machines. I think it is more logical to say that 100% 
> > CPU time displayed in 'top' means that the task got all the CPU time 
> > it asked for from the virtual machine. (and if you are curious about 
> > how much time was stolen from the virtual box altogether you look at 
> > the stolen-time stats in isolation.)
> 
> Well, as I said we started with the same approach (virtual cpu) but we 
> learned that these numbers have no meaning at all because the 
> hypervisor does have different scheduling timeslices and having 100% 
> inside the guest can still result in almost nothing if the system is 
> really loaded.

hm, i think i must have used the wrong terminology, so let me describe 
what i mean, so that we can argue this more efficiently ;-)

What i call "real time sched_clock()" is a sched_clock() that returns 
the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances 
by 1 billion units every wall-clock second, in each guest.

A "virtual time sched_clock()" is a sched_clock() that returns only the 
amount of time the virtual CPU was executed by the hypervisor. I.e. on a 
3 times overloaded hypervisor with 3 guests it will advance 333 million 
nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual' 
because the clock slows down as load goes up. In CFS-speak the virtual 
clock is the "fair-clock".)

to me the right scheme for sched_clock() is the virtual variant: to 
return the load-scaled nanoseconds. That way CFS will be able to 
schedule fairly even if time has been "stolen" from a task [by virtue of 
the hypervisor scheduling away the guest context without giving any 
notice about this to the guest kernel] - because sched_clock() measures 
the virtual time that got allocated to that guest by the hypervisor.

[ here i'm assuming precise host and precise guest statistics (which is 
  naturally the case if both are Linux), and in that context the virtual 
  numbers very much make sense, and whether 'top' displays 100% for a 
  sole CPU-bound task should be mostly a matter of tooling. ]

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 12:21                       ` Ingo Molnar
@ 2007-08-21 12:57                         ` Martin Schwidefsky
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2007-08-21 12:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton,
	linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras

On Tue, 2007-08-21 at 14:21 +0200, Ingo Molnar wrote:
> hm, i think i must have used the wrong terminology, so let me describe 
> what i mean, so that we can argue this more efficiently ;-)

Ok, seems we need to be a bit more precise.

> What i call "real time sched_clock()" is a sched_clock() that returns 
> the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances 
> by 1 billion units every wall-clock second, in each guest.

Which is what we call the TOD clock.

> A "virtual time sched_clock()" is a sched_clock() that returns only the 
> amount of time the virtual CPU was executed by the hypervisor. I.e. on a 
> 3 times overloaded hypervisor with 3 guests it will advance 333 million 
> nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual' 
> because the clock slows down as load goes up. In CFS-speak the virtual 
> clock is the "fair-clock".)

We 100% agree that sched_clock() should be virtual.

> to me the right scheme for sched_clock() is the virtual variant: to 
> return the load-scaled nanoseconds. That way CFS will be able to 
> schedule fairly even if time has been "stolen" from a task [by virtue of 
> the hypervisor scheduling away the guest context without giving any 
> notice about this to the guest kernel] - because sched_clock() measures 
> the virtual time that got allocated to that guest by the hypervisor.

Ok, this means we will need Jans patch that makes sched_clock() use the
cpu timer. You said that this change would require that scheduler_tick()
has to use virtual time as well, which would be the second patch.
And then we require a third patch that fixes the process accounting.

> [ here i'm assuming precise host and precise guest statistics (which is 
>   naturally the case if both are Linux), and in that context the virtual 
>   numbers very much make sense, and whether 'top' displays 100% for a 
>   sole CPU-bound task should be mostly a matter of tooling. ]

It is not only a matter of tooling. The tool needs to have enough,
precise information to actually return the requested information. If the
user want to know how much real cpu a process has used (and our user do
want to know that), the output of /proc/<pid>/stat fields 14-17 have to
contain the real cpu usage for user/system/cumulated user and cumulated
system time. If they would contain the virtual cpu usage you cannot tell
how much real cpu a process used, even if you have access to the steal
timer numbers. The reason is that you would have to synchronize the
scheduling points in the guest and the hypervisor to come up with
reasonable numbers. This is something we obviously do not want to do.
The output of /proc/<pid>/stat is what Christian and I are complaining
about. Since the introduction of CFS and VIRT_ACCOUNTING=y the output
of /proc/<pid>/stat has changed its meaning and in our opinion it is
wrong now.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21  8:17     ` Christian Borntraeger
  2007-08-21  8:42       ` Ingo Molnar
@ 2007-08-21 11:25       ` Ingo Molnar
  2007-08-22  7:50         ` Christian Borntraeger
  1 sibling, 1 reply; 40+ messages in thread
From: Ingo Molnar @ 2007-08-21 11:25 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras


* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Am Montag, 20. August 2007 schrieb Ingo Molnar:
> > could you send that precise sched_clock() patch? It should be an order 
> > of magnitude simpler than the high-precision stime/utime tracking you 
> > already do, and it's needed for quality scheduling anyway.
> 
> I have a question about that. I just played with sched_clock, and even 
> when I intentionally slow down sched_clock by a factor of 2, my cpu 
> bound process gets 100 % in top. If this is intentional, I dont 
> understand how a virtualized sched_clock would fix the accounting 
> change?

could you try the patch below, does it work any better?

	Ingo

---
 kernel/sched.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -333,6 +333,14 @@ static void __update_rq_clock(struct rq 
 #ifdef CONFIG_SCHED_DEBUG
 	WARN_ON_ONCE(cpu_of(rq) != smp_processor_id());
 #endif
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+	/*
+	 * Trust sched_clock on s390:
+	 */
+	if (unlikely(delta > rq->clock_max_delta))
+		rq->clock_max_delta = delta;
+	clock += delta;
+#else
 	/*
 	 * Protect against sched_clock() occasionally going backwards:
 	 */
@@ -355,6 +363,7 @@ static void __update_rq_clock(struct rq 
 			clock += delta;
 		}
 	}
+#endif
 
 	rq->prev_clock_raw = now;
 	rq->clock = clock;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-21 11:25       ` Ingo Molnar
@ 2007-08-22  7:50         ` Christian Borntraeger
  2007-08-22  7:59           ` Ingo Molnar
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-22  7:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Dienstag, 21. August 2007 schrieben Sie:
> could you try the patch below, does it work any better?

I looked again at the scheduler code and things are getting better when I run 
the patch below on top of your patch and with our sched_clock prototype. I 
guess there is a reason why you want rq->clock advanced by at least one tick?

We discussed calling scheduler_tick with virtual time as well.
Would it have the same result?
What would be the impact on latency?

After looking at the current s390 timer code, it seems that this kind of 
change is not trivial enough to be rc3+ ready. 
I personally think, that for 2.6.23 we should use the patch against 
fs/proc/array.c and everything else for 2.6.24?

Christian

---
 kernel/sched.c |    6 ------
 1 file changed, 6 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3321,15 +3321,9 @@ void scheduler_tick(void)
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
 	struct task_struct *curr = rq->curr;
-	u64 next_tick = rq->tick_timestamp + TICK_NSEC;

 	spin_lock(&rq->lock);
 	__update_rq_clock(rq);
-	/*
-	 * Let rq->clock advance by at least TICK_NSEC:
-	 */
-	if (unlikely(rq->clock < next_tick))
-		rq->clock = next_tick;
 	rq->tick_timestamp = rq->clock;
 	update_cpu_load(rq);
 	if (curr != rq->idle) /* FIXME: needed? */

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [accounting regression since rc1]  scheduler updates
  2007-08-22  7:50         ` Christian Borntraeger
@ 2007-08-22  7:59           ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-22  7:59 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

* Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> Am Dienstag, 21. August 2007 schrieben Sie:
> > could you try the patch below, does it work any better?
> 
> I looked again at the scheduler code and things are getting better 
> when I run the patch below on top of your patch and with our 
> sched_clock prototype. I guess there is a reason why you want 
> rq->clock advanced by at least one tick?

yeah - on PCs if for whatever reason the TSC misbehaves (and that's 
quite frequent) then this code sets a minimum boundary for behavior. If 
sched_clock() is totally random or does not advance at all or goes 
backwards all the time then rq_clock() still functions and falls back to 
jiffies-granularity behavior in essence.

> We discussed calling scheduler_tick with virtual time as well.
> Would it have the same result?
> What would be the impact on latency?

if you call scheduler_tick() with virtual time then the "safety" 
measures in rq_clock() do not kick in and sched_clock() behaves 
correctly as far as the scheduler is concerned. (if everything is in 
virtual time then the scheduler has no way to observe/notice that in 
reality this is a virtual machine.)

> After looking at the current s390 timer code, it seems that this kind of 
> change is not trivial enough to be rc3+ ready. 
> I personally think, that for 2.6.23 we should use the patch against 
> fs/proc/array.c and everything else for 2.6.24?

yes, that has the least impact for .23 - i have added your array.c patch 
to my queue.

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

[parent not found: <200708141032.47235.borntraeger@de.ibm.com>]

[parent not found: <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org>]

* Re: [accounting regression since rc1]  scheduler updates
       [not found]   ` <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org>
@ 2007-08-14 18:19     ` Christian Borntraeger
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Borntraeger @ 2007-08-14 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, linux-kernel, Martin Schwidefsky,
	Jan Glauber, heiko.carstens, Paul Mackerras

Am Dienstag, 14. August 2007 schrieb Linus Torvalds:
> On Tue, 14 Aug 2007, Christian Borntraeger wrote:
> > Hello Ingo,
> Just a fyi: I think Ingo is on vacation this week without any internet 
> access, I think he'll be back next Monday.
> 
> 		Linus

Well deserved.
As Martin and Jan are away as well I have a week to look into that problem 
myself :-)

Christian

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-08-22  7:59 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-12 16:32 [git pull request] scheduler updates Ingo Molnar
2007-08-14  8:37 ` [accounting regression since rc1] " Christian Borntraeger
2007-08-16  8:17   ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger
2007-08-20 15:45   ` [accounting regression since rc1] scheduler updates Ingo Molnar
2007-08-20 17:03     ` Martin Schwidefsky
2007-08-20 18:08       ` Ingo Molnar
2007-08-20 18:33         ` Martin Schwidefsky
2007-08-20 19:00           ` Balbir Singh
2007-08-20 19:05           ` Ingo Molnar
2007-08-21  7:20             ` Christian Borntraeger
2007-08-20 19:12           ` Ingo Molnar
2007-08-21  7:00           ` Christian Borntraeger
2007-08-21  9:18             ` Martin Schwidefsky
2007-08-20 23:07         ` Paul Mackerras
2007-08-21  2:18         ` Andi Kleen
2007-08-21  7:09           ` Ingo Molnar
2007-08-21 10:07             ` Andi Kleen
2007-08-21 10:20               ` Ingo Molnar
2007-08-21 11:15                 ` Andi Kleen
2007-08-21 11:20                   ` Ingo Molnar
2007-08-21  8:17     ` Christian Borntraeger
2007-08-21  8:42       ` Ingo Molnar
2007-08-21  9:11         ` Martin Schwidefsky
2007-08-21  9:34           ` Ingo Molnar
2007-08-21  9:48             ` Paul Mackerras
2007-08-21 10:38             ` Martin Schwidefsky
2007-08-21 11:36               ` Ingo Molnar
2007-08-21 11:58                 ` Martin Schwidefsky
2007-08-21 10:39             ` Christian Borntraeger
2007-08-21 10:43             ` Christian Borntraeger
2007-08-21 11:15               ` Ingo Molnar
2007-08-21 11:24                 ` Christian Borntraeger
2007-08-21 11:30                   ` Ingo Molnar
2007-08-21 11:58                     ` Christian Borntraeger
2007-08-21 12:21                       ` Ingo Molnar
2007-08-21 12:57                         ` Martin Schwidefsky
2007-08-21 11:25       ` Ingo Molnar
2007-08-22  7:50         ` Christian Borntraeger
2007-08-22  7:59           ` Ingo Molnar
     [not found] ` <200708141032.47235.borntraeger@de.ibm.com>
     [not found]   ` <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org>
2007-08-14 18:19     ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox