linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/4] [RFC] true vs. system idle cputime
@ 2008-10-08 16:19 Martin Schwidefsky
  2008-10-08 16:19 ` [patch 1/4] fix scaled & unscaled cputime accounting Martin Schwidefsky
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-08 16:19 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

Greetings,
while working on the analysis of a mismatch between the cputime accounting
numbers of z/VM as the host and Linux as the guest I started to wonder
about the accounting of idle time. z/VM showed more cpu time for the guest
as the guest itself. With the current code everything that the idle process
does is accounted as idle time. If idle is sleeping that is fine, but if
idle is actually using cpu cycles this is wrong.

The question is how wrong? To find out I've implemented really precise
accounting of true idle vs. system idle cputime for s390. A really simple
test that wakes up 100 times per second to do some minimal work before
going back to sleep showed 0.35% of system idle time. If you are dealing
with lots of virtual penguins this quickly becomes significant.

There are four patches in this series:
Patch #1: Cleanup scaled / unscaled cputime accounting
Patch #2: Change the accounting interface to allow the architectures to do
          precise idle time accounting
Patch #3: s390 patch to improve the precision of the idle_time_us value
Patch #4: s390 patch to implement improved idle time accounting

There is one change in patch #2 that might require a change on powerpc
and/or ia64. The generic TICK_ONESHOT/NO_HZ code calculates the number
of ticks spent with a disabled HZ timer and accounts this as idle time.
For a configuration for VIRT_CPU_ACCOUNTING=y this is horribly wrong.
Either you have precise accounting or you don't. Patch #2 just removes
the calculation for VIRT_CPU_ACCOUNTING=y. The architectures which support
precise accounting have to deal with it on their own. This is where the
powerpc and ia64 maintainer come into play. Would you look at patch #2
please ?

To make it clearer what happens in tick_nohz_restart_sched_tick I've added
a new function account_idle_ticks(). And for good measure another one named
account_steal_ticks() for xen where "interesting" things have been done
with the account_steal_time interface.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/4] fix scaled & unscaled cputime accounting
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
@ 2008-10-08 16:19 ` Martin Schwidefsky
  2008-10-16  4:31   ` Paul Mackerras
  2008-10-08 16:20 ` [patch 2/4] idle " Martin Schwidefsky
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-08 16:19 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling, Martin Schwidefsky

[-- Attachment #1: 202-cputime-scaled.diff --]
[-- Type: text/plain, Size: 11204 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time. 
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/ia64/kernel/time.c     |   12 ++++--------
 arch/powerpc/kernel/time.c  |    7 ++-----
 arch/s390/kernel/vtime.c    |   10 +++++-----
 include/linux/kernel_stat.h |    6 ++----
 kernel/sched.c              |   41 ++++++++++++++++-------------------------
 kernel/time/tick-sched.c    |    5 +++--
 kernel/timer.c              |   12 +++++-------
 7 files changed, 37 insertions(+), 56 deletions(-)

Index: linux-idle/arch/ia64/kernel/time.c
===================================================================
--- linux-idle.orig/arch/ia64/kernel/time.c
+++ linux-idle/arch/ia64/kernel/time.c
@@ -93,13 +93,11 @@ void ia64_account_on_switch(struct task_
 	now = ia64_get_itc();
 
 	delta_stime = cycle_to_cputime(pi->ac_stime + (now - pi->ac_stamp));
-	account_system_time(prev, 0, delta_stime);
-	account_system_time_scaled(prev, delta_stime);
+	account_system_time(prev, 0, delta_stime, delta_stime);
 
 	if (pi->ac_utime) {
 		delta_utime = cycle_to_cputime(pi->ac_utime);
-		account_user_time(prev, delta_utime);
-		account_user_time_scaled(prev, delta_utime);
+		account_user_time(prev, delta_utime, delta_utime);
 	}
 
 	pi->ac_stamp = ni->ac_stamp = now;
@@ -122,8 +120,7 @@ void account_system_vtime(struct task_st
 	now = ia64_get_itc();
 
 	delta_stime = cycle_to_cputime(ti->ac_stime + (now - ti->ac_stamp));
-	account_system_time(tsk, 0, delta_stime);
-	account_system_time_scaled(tsk, delta_stime);
+	account_system_time(tsk, 0, delta_stime, delta_stime);
 	ti->ac_stime = 0;
 
 	ti->ac_stamp = now;
@@ -143,8 +140,7 @@ void account_process_tick(struct task_st
 
 	if (ti->ac_utime) {
 		delta_utime = cycle_to_cputime(ti->ac_utime);
-		account_user_time(p, delta_utime);
-		account_user_time_scaled(p, delta_utime);
+		account_user_time(p, delta_utime, delta_utime);
 		ti->ac_utime = 0;
 	}
 }
Index: linux-idle/arch/powerpc/kernel/time.c
===================================================================
--- linux-idle.orig/arch/powerpc/kernel/time.c
+++ linux-idle/arch/powerpc/kernel/time.c
@@ -258,8 +258,7 @@ void account_system_vtime(struct task_st
 		delta += sys_time;
 		get_paca()->system_time = 0;
 	}
-	account_system_time(tsk, 0, delta);
-	account_system_time_scaled(tsk, deltascaled);
+	account_system_time(tsk, 0, delta, deltascaled);
 	per_cpu(cputime_last_delta, smp_processor_id()) = delta;
 	per_cpu(cputime_scaled_last_delta, smp_processor_id()) = deltascaled;
 	local_irq_restore(flags);
@@ -277,10 +276,8 @@ void account_process_tick(struct task_st
 
 	utime = get_paca()->user_time;
 	get_paca()->user_time = 0;
-	account_user_time(tsk, utime);
-
 	utimescaled = cputime_to_scaled(utime);
-	account_user_time_scaled(tsk, utimescaled);
+	account_user_time(tsk, utime, utimescaled);
 }
 
 /*
Index: linux-idle/arch/s390/kernel/vtime.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/vtime.c
+++ linux-idle/arch/s390/kernel/vtime.c
@@ -51,12 +51,12 @@ void account_process_tick(struct task_st
 	rcu_user_flag = cputime != 0;
 	S390_lowcore.user_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_user_time(tsk, cputime);
+	account_user_time(tsk, cputime, cputime);
 
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, HARDIRQ_OFFSET, cputime);
+	account_system_time(tsk, HARDIRQ_OFFSET, cputime, cputime);
 
 	cputime = S390_lowcore.steal_clock;
 	if ((__s64) cputime > 0) {
@@ -83,12 +83,12 @@ void account_vtime(struct task_struct *t
 	cputime = S390_lowcore.user_timer >> 12;
 	S390_lowcore.user_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_user_time(tsk, cputime);
+	account_user_time(tsk, cputime, cputime);
 
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, 0, cputime);
+	account_system_time(tsk, 0, cputime, cputime);
 }
 
 /*
@@ -108,7 +108,7 @@ void account_system_vtime(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, 0, cputime);
+	account_system_time(tsk, 0, cputime, cputime);
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
Index: linux-idle/include/linux/kernel_stat.h
===================================================================
--- linux-idle.orig/include/linux/kernel_stat.h
+++ linux-idle/include/linux/kernel_stat.h
@@ -52,10 +52,8 @@ static inline int kstat_irqs(int irq)
 	return sum;
 }
 
-extern void account_user_time(struct task_struct *, cputime_t);
-extern void account_user_time_scaled(struct task_struct *, cputime_t);
-extern void account_system_time(struct task_struct *, int, cputime_t);
-extern void account_system_time_scaled(struct task_struct *, cputime_t);
+extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
+extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
 extern void account_steal_time(struct task_struct *, cputime_t);
 
 #endif /* _LINUX_KERNEL_STAT_H */
Index: linux-idle/kernel/sched.c
===================================================================
--- linux-idle.orig/kernel/sched.c
+++ linux-idle/kernel/sched.c
@@ -4063,13 +4063,17 @@ unsigned long long task_sched_runtime(st
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @cputime: the cpu time spent in user space since the last update
+ * @cputime_scaled: cputime scaled by cpu frequency
  */
-void account_user_time(struct task_struct *p, cputime_t cputime)
+void account_user_time(struct task_struct *p, cputime_t cputime,
+		       cputime_t cputime_scaled)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t tmp;
 
+	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
+	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -4085,50 +4089,48 @@ void account_user_time(struct task_struc
  * Account guest cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @cputime: the cpu time spent in virtual machine since the last update
+ * @cputime_scaled: cputime scaled by cpu frequency
  */
-static void account_guest_time(struct task_struct *p, cputime_t cputime)
+static void account_guest_time(struct task_struct *p, cputime_t cputime,
+			       cputime_t cputime_scaled)
 {
 	cputime64_t tmp;
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
+	/* Add guest time to process. */
 	p->utime = cputime_add(p->utime, cputime);
 	p->gtime = cputime_add(p->gtime, cputime);
+	p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
 
+	/* Add guest time to cpustat. */
 	cpustat->user = cputime64_add(cpustat->user, tmp);
 	cpustat->guest = cputime64_add(cpustat->guest, tmp);
 }
 
 /*
- * Account scaled user cpu time to a process.
- * @p: the process that the cpu time gets accounted to
- * @cputime: the cpu time spent in user space since the last update
- */
-void account_user_time_scaled(struct task_struct *p, cputime_t cputime)
-{
-	p->utimescaled = cputime_add(p->utimescaled, cputime);
-}
-
-/*
  * Account system cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @hardirq_offset: the offset to subtract from hardirq_count()
  * @cputime: the cpu time spent in kernel space since the last update
+ * @cputime_scaled: cputime scaled by cpu frequency
  */
 void account_system_time(struct task_struct *p, int hardirq_offset,
-			 cputime_t cputime)
+			 cputime_t cputime, cputime_t cputime_scaled)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	struct rq *rq = this_rq();
 	cputime64_t tmp;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
-		account_guest_time(p, cputime);
+		account_guest_time(p, cputime, cputime_scaled);
 		return;
 	}
 
+	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
+	p->stimescaled = cputime_add(p->stimescaled, cputime_scaled);
 
 	/* Add system time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
@@ -4147,17 +4149,6 @@ void account_system_time(struct task_str
 }
 
 /*
- * Account scaled system cpu time to a process.
- * @p: the process that the cpu time gets accounted to
- * @hardirq_offset: the offset to subtract from hardirq_count()
- * @cputime: the cpu time spent in kernel space since the last update
- */
-void account_system_time_scaled(struct task_struct *p, cputime_t cputime)
-{
-	p->stimescaled = cputime_add(p->stimescaled, cputime);
-}
-
-/*
  * Account for involuntary wait time.
  * @p: the process from which the cpu time has been stolen
  * @steal: the cpu time spent in involuntary wait
Index: linux-idle/kernel/time/tick-sched.c
===================================================================
--- linux-idle.orig/kernel/time/tick-sched.c
+++ linux-idle/kernel/time/tick-sched.c
@@ -378,6 +378,7 @@ void tick_nohz_restart_sched_tick(void)
 	int cpu = smp_processor_id();
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	unsigned long ticks;
+	cputime_t cputime;
 	ktime_t now;
 
 	local_irq_disable();
@@ -410,8 +411,8 @@ void tick_nohz_restart_sched_tick(void)
 	 */
 	if (ticks && ticks < LONG_MAX) {
 		add_preempt_count(HARDIRQ_OFFSET);
-		account_system_time(current, HARDIRQ_OFFSET,
-				    jiffies_to_cputime(ticks));
+		cputime = jiffies_to_cputime(ticks);
+		account_system_time(current, HARDIRQ_OFFSET, cputime, cputime);
 		sub_preempt_count(HARDIRQ_OFFSET);
 	}
 
Index: linux-idle/kernel/timer.c
===================================================================
--- linux-idle.orig/kernel/timer.c
+++ linux-idle/kernel/timer.c
@@ -954,13 +954,11 @@ void account_process_tick(struct task_st
 {
 	cputime_t one_jiffy = jiffies_to_cputime(1);
 
-	if (user_tick) {
-		account_user_time(p, one_jiffy);
-		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
-	} else {
-		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
-		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
-	}
+	if (user_tick)
+		account_user_time(p, one_jiffy, cputime_to_scaled(one_jiffy));
+	else
+		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
+				    cputime_to_scaled(one_jiffy));
 }
 #endif
 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/4] idle cputime accounting
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
  2008-10-08 16:19 ` [patch 1/4] fix scaled & unscaled cputime accounting Martin Schwidefsky
@ 2008-10-08 16:20 ` Martin Schwidefsky
  2008-10-16  4:59   ` Paul Mackerras
  2008-10-08 16:20 ` [patch 3/4] improve precision of idle accounting Martin Schwidefsky
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-08 16:20 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling, Martin Schwidefsky

[-- Attachment #1: 203-cputime-idle.diff --]
[-- Type: text/plain, Size: 13606 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_steal_time and the second with account_idle_time.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/ia64/kernel/time.c     |   10 ++++-
 arch/powerpc/kernel/time.c  |   13 +++++--
 arch/s390/kernel/vtime.c    |   20 ++++++++---
 arch/x86/xen/time.c         |   10 ++---
 include/linux/kernel_stat.h |    7 +++
 include/linux/sched.h       |    1 
 kernel/sched.c              |   80 ++++++++++++++++++++++++++++++++++----------
 kernel/time/tick-sched.c    |   11 ++----
 kernel/timer.c              |   13 -------
 9 files changed, 111 insertions(+), 54 deletions(-)

Index: linux-idle/arch/ia64/kernel/time.c
===================================================================
--- linux-idle.orig/arch/ia64/kernel/time.c
+++ linux-idle/arch/ia64/kernel/time.c
@@ -93,7 +93,10 @@ void ia64_account_on_switch(struct task_
 	now = ia64_get_itc();
 
 	delta_stime = cycle_to_cputime(pi->ac_stime + (now - pi->ac_stamp));
-	account_system_time(prev, 0, delta_stime, delta_stime);
+	if (idle_task(smp_processor_id()) != prev)
+		account_system_time(prev, 0, delta_stime, delta_stime);
+	else
+		account_idle_time(delta_stime);
 
 	if (pi->ac_utime) {
 		delta_utime = cycle_to_cputime(pi->ac_utime);
@@ -120,7 +123,10 @@ void account_system_vtime(struct task_st
 	now = ia64_get_itc();
 
 	delta_stime = cycle_to_cputime(ti->ac_stime + (now - ti->ac_stamp));
-	account_system_time(tsk, 0, delta_stime, delta_stime);
+	if (irq_count() || idle_task(smp_processor_id()) != tsk)
+		account_system_time(tsk, 0, delta_stime, delta_stime);
+	else
+		account_idle_time(delta_stime);
 	ti->ac_stime = 0;
 
 	ti->ac_stamp = now;
Index: linux-idle/arch/powerpc/kernel/time.c
===================================================================
--- linux-idle.orig/arch/powerpc/kernel/time.c
+++ linux-idle/arch/powerpc/kernel/time.c
@@ -258,7 +258,10 @@ void account_system_vtime(struct task_st
 		delta += sys_time;
 		get_paca()->system_time = 0;
 	}
-	account_system_time(tsk, 0, delta, deltascaled);
+	if (in_irq() || idle_task(smp_processor_id()) != tsk)
+		account_system_time(tsk, 0, delta, deltascaled);
+	else
+		account_idle_time(delta);
 	per_cpu(cputime_last_delta, smp_processor_id()) = delta;
 	per_cpu(cputime_scaled_last_delta, smp_processor_id()) = deltascaled;
 	local_irq_restore(flags);
@@ -337,8 +340,12 @@ void calculate_steal_time(void)
 	tb = mftb();
 	purr = mfspr(SPRN_PURR);
 	stolen = (tb - pme->tb) - (purr - pme->purr);
-	if (stolen > 0)
-		account_steal_time(current, stolen);
+	if (stolen > 0) {
+		if (idle_task(smp_processor_id()) != current)
+			account_steal_time(stolen);
+		else
+			account_idle_time(stolen);
+	}
 	pme->tb = tb;
 	pme->purr = purr;
 }
Index: linux-idle/arch/s390/kernel/vtime.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/vtime.c
+++ linux-idle/arch/s390/kernel/vtime.c
@@ -56,13 +56,19 @@ void account_process_tick(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, HARDIRQ_OFFSET, cputime, cputime);
+	if (idle_task(smp_processor_id()) != current)
+		account_system_time(tsk, HARDIRQ_OFFSET, cputime, cputime);
+	else
+		account_idle_time(cputime);
 
 	cputime = S390_lowcore.steal_clock;
 	if ((__s64) cputime > 0) {
 		cputime >>= 12;
 		S390_lowcore.steal_clock -= cputime << 12;
-		account_steal_time(tsk, cputime);
+		if (idle_task(smp_processor_id()) != current)
+			account_steal_time(cputime);
+		else
+			account_idle_time(cputime);
 	}
 }
 
@@ -88,7 +94,10 @@ void account_vtime(struct task_struct *t
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, 0, cputime, cputime);
+	if (idle_task(smp_processor_id()) != current)
+		account_system_time(tsk, 0, cputime, cputime);
+	else
+		account_idle_time(cputime);
 }
 
 /*
@@ -108,7 +117,10 @@ void account_system_vtime(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, 0, cputime, cputime);
+	if (in_irq() || idle_task(smp_processor_id()) != current)
+		account_system_time(tsk, 0, cputime, cputime);
+	else
+		account_idle_time(cputime);
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
Index: linux-idle/arch/x86/xen/time.c
===================================================================
--- linux-idle.orig/arch/x86/xen/time.c
+++ linux-idle/arch/x86/xen/time.c
@@ -134,8 +134,7 @@ static void do_stolen_accounting(void)
 	*snap = state;
 
 	/* Add the appropriate number of ticks of stolen time,
-	   including any left-overs from last time.  Passing NULL to
-	   account_steal_time accounts the time as stolen. */
+	   including any left-overs from last time. */
 	stolen = runnable + offline + __get_cpu_var(residual_stolen);
 
 	if (stolen < 0)
@@ -143,11 +142,10 @@ static void do_stolen_accounting(void)
 
 	ticks = iter_div_u64_rem(stolen, NS_PER_TICK, &stolen);
 	__get_cpu_var(residual_stolen) = stolen;
-	account_steal_time(NULL, ticks);
+	account_steal_ticks(ticks);
 
 	/* Add the appropriate number of ticks of blocked time,
-	   including any left-overs from last time.  Passing idle to
-	   account_steal_time accounts the time as idle/wait. */
+	   including any left-overs from last time. */
 	blocked += __get_cpu_var(residual_blocked);
 
 	if (blocked < 0)
@@ -155,7 +153,7 @@ static void do_stolen_accounting(void)
 
 	ticks = iter_div_u64_rem(blocked, NS_PER_TICK, &blocked);
 	__get_cpu_var(residual_blocked) = blocked;
-	account_steal_time(idle_task(smp_processor_id()), ticks);
+	account_idle_ticks(ticks);
 }
 
 /*
Index: linux-idle/include/linux/kernel_stat.h
===================================================================
--- linux-idle.orig/include/linux/kernel_stat.h
+++ linux-idle/include/linux/kernel_stat.h
@@ -54,6 +54,11 @@ static inline int kstat_irqs(int irq)
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
-extern void account_steal_time(struct task_struct *, cputime_t);
+extern void account_steal_time(cputime_t);
+extern void account_idle_time(cputime_t);
+
+extern void account_process_tick(struct task_struct *, int user);
+extern void account_steal_ticks(unsigned long ticks);
+extern void account_idle_ticks(unsigned long ticks);
 
 #endif /* _LINUX_KERNEL_STAT_H */
Index: linux-idle/include/linux/sched.h
===================================================================
--- linux-idle.orig/include/linux/sched.h
+++ linux-idle/include/linux/sched.h
@@ -284,7 +284,6 @@ long io_schedule_timeout(long timeout);
 
 extern void cpu_init (void);
 extern void trap_init(void);
-extern void account_process_tick(struct task_struct *task, int user);
 extern void update_process_times(int user);
 extern void scheduler_tick(void);
 extern void hrtick_resched(void);
Index: linux-idle/kernel/sched.c
===================================================================
--- linux-idle.orig/kernel/sched.c
+++ linux-idle/kernel/sched.c
@@ -4120,7 +4120,6 @@ void account_system_time(struct task_str
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	struct rq *rq = this_rq();
 	cputime64_t tmp;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
@@ -4138,37 +4137,84 @@ void account_system_time(struct task_str
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
 	else if (softirq_count())
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
-	else if (p != rq->idle)
-		cpustat->system = cputime64_add(cpustat->system, tmp);
-	else if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, tmp);
+		cpustat->system = cputime64_add(cpustat->system, tmp);
+
 	/* Account for system time used */
 	acct_update_integrals(p);
 }
 
 /*
  * Account for involuntary wait time.
- * @p: the process from which the cpu time has been stolen
  * @steal: the cpu time spent in involuntary wait
  */
-void account_steal_time(struct task_struct *p, cputime_t steal)
+void account_steal_time(cputime_t cputime)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+
+	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+}
+
+/*
+ * Account for idle time.
+ * @cputime: the cpu time spent in idle wait
+ */
+void account_idle_time(cputime_t cputime)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp = cputime_to_cputime64(steal);
+	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
-	if (p == rq->idle) {
-		p->stime = cputime_add(p->stime, steal);
-		if (atomic_read(&rq->nr_iowait) > 0)
-			cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
-		else
-			cpustat->idle = cputime64_add(cpustat->idle, tmp);
-	} else
-		cpustat->steal = cputime64_add(cpustat->steal, tmp);
+	if (atomic_read(&rq->nr_iowait) > 0)
+		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+	else
+		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
 }
 
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
+
+/*
+ * Account a single tick of cpu time.
+ * @p: the process that the cpu time gets accounted to
+ * @user_tick: indicates if the tick is a user or a system tick
+ */
+void account_process_tick(struct task_struct *p, int user_tick)
+{
+	cputime_t one_jiffy = jiffies_to_cputime(1);
+	cputime_t one_jiffy_scaled = cputime_to_scaled(one_jiffy);
+	struct rq *rq = this_rq();
+
+	if (user_tick)
+		account_user_time(p, one_jiffy, one_jiffy_scaled);
+	else if (p != rq->idle)
+		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
+				    one_jiffy_scaled);
+	else
+		account_idle_time(one_jiffy);
+}
+
+/*
+ * Account multiple ticks of steal time.
+ * @p: the process from which the cpu time has been stolen
+ * @ticks: number of stolen ticks
+ */
+void account_steal_ticks(unsigned long ticks)
+{
+	account_steal_time(jiffies_to_cputime(ticks));
+}
+
+/*
+ * Account multiple ticks of idle time.
+ * @ticks: number of stolen ticks
+ */
+void account_idle_ticks(unsigned long ticks)
+{
+	account_idle_time(jiffies_to_cputime(ticks));
+}
+
+#endif
+
 /*
  * Use precise platform statistics if available:
  */
Index: linux-idle/kernel/time/tick-sched.c
===================================================================
--- linux-idle.orig/kernel/time/tick-sched.c
+++ linux-idle/kernel/time/tick-sched.c
@@ -378,7 +378,6 @@ void tick_nohz_restart_sched_tick(void)
 	int cpu = smp_processor_id();
 	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
 	unsigned long ticks;
-	cputime_t cputime;
 	ktime_t now;
 
 	local_irq_disable();
@@ -400,6 +399,7 @@ void tick_nohz_restart_sched_tick(void)
 	tick_do_update_jiffies64(now);
 	cpu_clear(cpu, nohz_cpu_mask);
 
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING
 	/*
 	 * We stopped the tick in idle. Update process times would miss the
 	 * time we slept as update_process_times does only a 1 tick
@@ -409,12 +409,9 @@ void tick_nohz_restart_sched_tick(void)
 	/*
 	 * We might be one off. Do not randomly account a huge number of ticks!
 	 */
-	if (ticks && ticks < LONG_MAX) {
-		add_preempt_count(HARDIRQ_OFFSET);
-		cputime = jiffies_to_cputime(ticks);
-		account_system_time(current, HARDIRQ_OFFSET, cputime, cputime);
-		sub_preempt_count(HARDIRQ_OFFSET);
-	}
+	if (ticks && ticks < LONG_MAX)
+		account_idle_ticks(ticks);
+#endif
 
 	touch_softlockup_watchdog();
 	/*
Index: linux-idle/kernel/timer.c
===================================================================
--- linux-idle.orig/kernel/timer.c
+++ linux-idle/kernel/timer.c
@@ -949,19 +949,6 @@ unsigned long get_next_timer_interrupt(u
 }
 #endif
 
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING
-void account_process_tick(struct task_struct *p, int user_tick)
-{
-	cputime_t one_jiffy = jiffies_to_cputime(1);
-
-	if (user_tick)
-		account_user_time(p, one_jiffy, cputime_to_scaled(one_jiffy));
-	else
-		account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
-				    cputime_to_scaled(one_jiffy));
-}
-#endif
-
 /*
  * Called from the timer interrupt handler to charge one tick to the current
  * process.  user_tick is 1 if the tick is user time, 0 for system.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/4] improve precision of idle accounting.
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
  2008-10-08 16:19 ` [patch 1/4] fix scaled & unscaled cputime accounting Martin Schwidefsky
  2008-10-08 16:20 ` [patch 2/4] idle " Martin Schwidefsky
@ 2008-10-08 16:20 ` Martin Schwidefsky
  2008-10-08 16:20 ` [patch 4/4] improve idle cputime accounting Martin Schwidefsky
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-08 16:20 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling, Martin Schwidefsky

[-- Attachment #1: 204-cputime-idle-stck.diff --]
[-- Type: text/plain, Size: 5683 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Increase the precision of the idle time calculation that is exported
to user space via /sys/devices/system/cpu/cpu<x>/idle_time_us

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/s390/include/asm/cpu.h |    3 -
 arch/s390/kernel/process.c  |   67 +++++++++++++++++++++++++++++---------------
 arch/s390/kernel/smp.c      |   25 ++++++++--------
 3 files changed, 59 insertions(+), 36 deletions(-)

Index: linux-idle/arch/s390/include/asm/cpu.h
===================================================================
--- linux-idle.orig/arch/s390/include/asm/cpu.h
+++ linux-idle/arch/s390/include/asm/cpu.h
@@ -14,7 +14,6 @@
 
 struct s390_idle_data {
 	spinlock_t lock;
-	unsigned int in_idle;
 	unsigned long long idle_count;
 	unsigned long long idle_enter;
 	unsigned long long idle_time;
@@ -26,7 +25,7 @@ void s390_idle_leave(void);
 
 static inline void s390_idle_check(void)
 {
-	if ((&__get_cpu_var(s390_idle))->in_idle)
+	if ((&__get_cpu_var(s390_idle))->idle_enter != 0ULL)
 		s390_idle_leave();
 }
 
Index: linux-idle/arch/s390/kernel/process.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/process.c
+++ linux-idle/arch/s390/kernel/process.c
@@ -38,6 +38,7 @@
 #include <linux/utsname.h>
 #include <linux/tick.h>
 #include <linux/elfcore.h>
+#include <linux/kernel_stat.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/system.h>
@@ -79,30 +80,19 @@ DEFINE_PER_CPU(struct s390_idle_data, s3
 	.lock = __SPIN_LOCK_UNLOCKED(s390_idle.lock)
 };
 
-static int s390_idle_enter(void)
+void s390_idle_leave(void)
 {
 	struct s390_idle_data *idle;
+	unsigned long long idle_time;
 
 	idle = &__get_cpu_var(s390_idle);
+	idle_time = S390_lowcore.int_clock - idle->idle_enter;
 	spin_lock(&idle->lock);
+	idle->idle_time += idle_time;
+	idle->idle_enter = 0ULL;
 	idle->idle_count++;
-	idle->in_idle = 1;
-	idle->idle_enter = get_clock();
 	spin_unlock(&idle->lock);
-	vtime_stop_cpu_timer();
-	return NOTIFY_OK;
-}
-
-void s390_idle_leave(void)
-{
-	struct s390_idle_data *idle;
-
 	vtime_start_cpu_timer();
-	idle = &__get_cpu_var(s390_idle);
-	spin_lock(&idle->lock);
-	idle->idle_time += get_clock() - idle->idle_enter;
-	idle->in_idle = 0;
-	spin_unlock(&idle->lock);
 }
 
 extern void s390_handle_mcck(void);
@@ -111,16 +101,16 @@ extern void s390_handle_mcck(void);
  */
 static void default_idle(void)
 {
+	struct s390_idle_data *idle = &__get_cpu_var(s390_idle);
+	unsigned long addr;
+	psw_t psw;
+
 	/* CPU is going idle. */
 	local_irq_disable();
 	if (need_resched()) {
 		local_irq_enable();
 		return;
 	}
-	if (s390_idle_enter() == NOTIFY_BAD) {
-		local_irq_enable();
-		return;
-	}
 #ifdef CONFIG_HOTPLUG_CPU
 	if (cpu_is_offline(smp_processor_id())) {
 		preempt_enable_no_resched();
@@ -136,9 +126,42 @@ static void default_idle(void)
 		return;
 	}
 	trace_hardirqs_on();
+	vtime_stop_cpu_timer();
+
+	/*
+	 * The inline assembly is equivalent to
+	 * 	idle->idle_enter = get_clock();
+	 * 	__load_psw_mask(psw_kernel_bits | PSW_MASK_WAIT |
+	 *			   PSW_MASK_IO | PSW_MASK_EXT);
+	 * The difference is that the inline assembly makes sure that
+	 * the stck instruction is right before the lpsw instruction.
+	 * This is done to increase the precision.
+	 */
+
 	/* Wait for external, I/O or machine check interrupt. */
-	__load_psw_mask(psw_kernel_bits | PSW_MASK_WAIT |
-			PSW_MASK_IO | PSW_MASK_EXT);
+	psw.mask = psw_kernel_bits|PSW_MASK_WAIT|PSW_MASK_IO|PSW_MASK_EXT;
+#ifndef __s390x__
+	asm volatile(
+		"	basr	%0,0\n"
+		"0:	ahi	%0,1f-0b\n"
+		"	st	%0,4(%2)\n"
+		"	stck	0(%3)\n"
+		"	lpsw	0(%2)\n"
+		"1:"
+		: "=&d" (addr), "=m" (idle->idle_enter)
+		: "a" (&psw), "a" (&idle->idle_enter), "m" (psw)
+		: "memory", "cc");
+#else /* __s390x__ */
+	asm volatile(
+		"	larl	%0,1f\n"
+		"	stg	%0,8(%2)\n"
+		"	stck	0(%3)\n"
+		"	lpswe	0(%2)\n"
+		"1:"
+		: "=&d" (addr), "=m" (idle->idle_enter)
+		: "a" (&psw), "a" (&idle->idle_enter), "m" (psw)
+		: "memory", "cc");
+#endif /* __s390x__ */
 }
 
 void cpu_idle(void)
Index: linux-idle/arch/s390/kernel/smp.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/smp.c
+++ linux-idle/arch/s390/kernel/smp.c
@@ -992,9 +992,11 @@ static ssize_t show_idle_count(struct sy
 	unsigned long long idle_count;
 
 	idle = &per_cpu(s390_idle, dev->id);
-	spin_lock_irq(&idle->lock);
+	spin_lock(&idle->lock);
 	idle_count = idle->idle_count;
-	spin_unlock_irq(&idle->lock);
+	if (idle->idle_enter)
+		idle_count++;
+	spin_unlock(&idle->lock);
 	return sprintf(buf, "%llu\n", idle_count);
 }
 static SYSDEV_ATTR(idle_count, 0444, show_idle_count, NULL);
@@ -1003,18 +1005,17 @@ static ssize_t show_idle_time(struct sys
 				struct sysdev_attribute *attr, char *buf)
 {
 	struct s390_idle_data *idle;
-	unsigned long long new_time;
+	unsigned long long now, idle_time, idle_enter;
 
 	idle = &per_cpu(s390_idle, dev->id);
-	spin_lock_irq(&idle->lock);
-	if (idle->in_idle) {
-		new_time = get_clock();
-		idle->idle_time += new_time - idle->idle_enter;
-		idle->idle_enter = new_time;
-	}
-	new_time = idle->idle_time;
-	spin_unlock_irq(&idle->lock);
-	return sprintf(buf, "%llu\n", new_time >> 12);
+	spin_lock(&idle->lock);
+	now = get_clock();
+	idle_time = idle->idle_time;
+	idle_enter = idle->idle_enter;
+	if (idle_enter != 0ULL && idle_enter < now)
+		idle_time += now - idle_enter;
+	spin_unlock(&idle->lock);
+	return sprintf(buf, "%llu\n", idle_time >> 12);
 }
 static SYSDEV_ATTR(idle_time_us, 0444, show_idle_time, NULL);
 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 4/4] improve idle cputime accounting
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
                   ` (2 preceding siblings ...)
  2008-10-08 16:20 ` [patch 3/4] improve precision of idle accounting Martin Schwidefsky
@ 2008-10-08 16:20 ` Martin Schwidefsky
  2008-10-08 21:22 ` [patch 0/4] [RFC] true vs. system idle cputime Luck, Tony
  2008-10-15 14:01 ` Martin Schwidefsky
  5 siblings, 0 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-08 16:20 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling, Martin Schwidefsky

[-- Attachment #1: 205-cputime-idle-s390.diff --]
[-- Type: text/plain, Size: 4779 bytes --]

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Distinguish the cputime of the idle process where idle is actually using
cpu cycles from the cputime where idle is sleeping on an enabled wait psw.
The former is acconuted as system time, the later as idle time.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/s390/include/asm/cpu.h |    1 +
 arch/s390/kernel/process.c  |    7 +++++++
 arch/s390/kernel/s390_ext.c |    2 +-
 arch/s390/kernel/vtime.c    |   24 ++++++------------------
 drivers/s390/cio/cio.c      |    2 +-
 5 files changed, 16 insertions(+), 20 deletions(-)

Index: linux-idle/arch/s390/include/asm/cpu.h
===================================================================
--- linux-idle.orig/arch/s390/include/asm/cpu.h
+++ linux-idle/arch/s390/include/asm/cpu.h
@@ -15,6 +15,7 @@
 struct s390_idle_data {
 	spinlock_t lock;
 	unsigned long long idle_count;
+	unsigned long long idle_delta;
 	unsigned long long idle_enter;
 	unsigned long long idle_time;
 };
Index: linux-idle/arch/s390/kernel/process.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/process.c
+++ linux-idle/arch/s390/kernel/process.c
@@ -87,6 +87,13 @@ void s390_idle_leave(void)
 
 	idle = &__get_cpu_var(s390_idle);
 	idle_time = S390_lowcore.int_clock - idle->idle_enter;
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+	idle_time += idle->idle_delta;
+	idle->idle_delta = idle_time & 4095;
+	idle_time -= idle->idle_delta;
+	account_idle_time(idle_time >> 12);
+	S390_lowcore.last_update_clock = S390_lowcore.int_clock;
+#endif
 	spin_lock(&idle->lock);
 	idle->idle_time += idle_time;
 	idle->idle_enter = 0ULL;
Index: linux-idle/arch/s390/kernel/s390_ext.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/s390_ext.c
+++ linux-idle/arch/s390/kernel/s390_ext.c
@@ -119,8 +119,8 @@ void do_extint(struct pt_regs *regs, uns
 	struct pt_regs *old_regs;
 
 	old_regs = set_irq_regs(regs);
-	irq_enter();
 	s390_idle_check();
+	irq_enter();
 	if (S390_lowcore.int_clock >= S390_lowcore.clock_comparator)
 		/* Serve timer interrupts first. */
 		clock_comparator_work();
Index: linux-idle/arch/s390/kernel/vtime.c
===================================================================
--- linux-idle.orig/arch/s390/kernel/vtime.c
+++ linux-idle/arch/s390/kernel/vtime.c
@@ -56,19 +56,13 @@ void account_process_tick(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	if (idle_task(smp_processor_id()) != current)
-		account_system_time(tsk, HARDIRQ_OFFSET, cputime, cputime);
-	else
-		account_idle_time(cputime);
+	account_system_time(tsk, HARDIRQ_OFFSET, cputime, cputime);
 
 	cputime = S390_lowcore.steal_clock;
 	if ((__s64) cputime > 0) {
 		cputime >>= 12;
 		S390_lowcore.steal_clock -= cputime << 12;
-		if (idle_task(smp_processor_id()) != current)
-			account_steal_time(cputime);
-		else
-			account_idle_time(cputime);
+		account_steal_time(cputime);
 	}
 }
 
@@ -94,10 +88,7 @@ void account_vtime(struct task_struct *t
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	if (idle_task(smp_processor_id()) != current)
-		account_system_time(tsk, 0, cputime, cputime);
-	else
-		account_idle_time(cputime);
+	account_system_time(tsk, 0, cputime, cputime);
 }
 
 /*
@@ -117,10 +108,7 @@ void account_system_vtime(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	if (in_irq() || idle_task(smp_processor_id()) != current)
-		account_system_time(tsk, 0, cputime, cputime);
-	else
-		account_idle_time(cputime);
+	account_system_time(tsk, 0, cputime, cputime);
 }
 EXPORT_SYMBOL_GPL(account_system_vtime);
 
@@ -158,8 +146,8 @@ void vtime_start_cpu_timer(void)
 	if (vt_list->idle & 1LL<<63)
 		return;
 
-	if (!list_empty(&vt_list->list))
-		set_vtimer(vt_list->idle);
+	S390_lowcore.last_update_timer = S390_lowcore.async_enter_timer;
+	set_vtimer(vt_list->idle);
 }
 
 void vtime_stop_cpu_timer(void)
Index: linux-idle/drivers/s390/cio/cio.c
===================================================================
--- linux-idle.orig/drivers/s390/cio/cio.c
+++ linux-idle/drivers/s390/cio/cio.c
@@ -626,8 +626,8 @@ do_IRQ (struct pt_regs *regs)
 	struct pt_regs *old_regs;
 
 	old_regs = set_irq_regs(regs);
-	irq_enter();
 	s390_idle_check();
+	irq_enter();
 	if (S390_lowcore.int_clock >= S390_lowcore.clock_comparator)
 		/* Serve timer interrupts first. */
 		clock_comparator_work();

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [patch 0/4] [RFC] true vs. system idle cputime
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
                   ` (3 preceding siblings ...)
  2008-10-08 16:20 ` [patch 4/4] improve idle cputime accounting Martin Schwidefsky
@ 2008-10-08 21:22 ` Luck, Tony
  2008-10-09  8:03   ` Martin Schwidefsky
  2008-10-15 14:01 ` Martin Schwidefsky
  5 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2008-10-08 21:22 UTC (permalink / raw)
  To: Martin Schwidefsky, linux-arch@vger.kernel.org
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

> There is one change in patch #2 that might require a change on powerpc
> and/or ia64. The generic TICK_ONESHOT/NO_HZ code calculates the number
> of ticks spent with a disabled HZ timer and accounts this as idle time.

Your patches apply, build and boot cleanly on ia64.

At the moment ia64 doesn't do NO_HZ (exisiting cpus only support a C1
state, which we can enter/leave pretty quickly ... so power saving
for NO_HZ is negligible) ... so your code change isn't an issue for
ia64.


-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [patch 0/4] [RFC] true vs. system idle cputime
  2008-10-08 21:22 ` [patch 0/4] [RFC] true vs. system idle cputime Luck, Tony
@ 2008-10-09  8:03   ` Martin Schwidefsky
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-09  8:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: linux-arch@vger.kernel.org, Heiko Carstens, Paul Mackerras,
	Benjamin Herrenschmidt, Hidetoshi Seto, Jeremy Fitzhardinge,
	Chris Wright, Michael Neuling

On Wed, 2008-10-08 at 14:22 -0700, Luck, Tony wrote:
> > There is one change in patch #2 that might require a change on powerpc
> > and/or ia64. The generic TICK_ONESHOT/NO_HZ code calculates the number
> > of ticks spent with a disabled HZ timer and accounts this as idle time.
> 
> Your patches apply, build and boot cleanly on ia64.
> 
> At the moment ia64 doesn't do NO_HZ (exisiting cpus only support a C1
> state, which we can enter/leave pretty quickly ... so power saving
> for NO_HZ is negligible) ... so your code change isn't an issue for
> ia64.

Thanks for checking.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/4] [RFC] true vs. system idle cputime
  2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
                   ` (4 preceding siblings ...)
  2008-10-08 21:22 ` [patch 0/4] [RFC] true vs. system idle cputime Luck, Tony
@ 2008-10-15 14:01 ` Martin Schwidefsky
  2008-10-15 20:56   ` Benjamin Herrenschmidt
  5 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-15 14:01 UTC (permalink / raw)
  To: linux-arch
  Cc: Heiko Carstens, Paul Mackerras, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

On Wed, 2008-10-08 at 18:19 +0200, Martin Schwidefsky wrote:
> Greetings,
> while working on the analysis of a mismatch between the cputime accounting
> numbers of z/VM as the host and Linux as the guest I started to wonder
> about the accounting of idle time. z/VM showed more cpu time for the guest
> as the guest itself. With the current code everything that the idle process
> does is accounted as idle time. If idle is sleeping that is fine, but if
> idle is actually using cpu cycles this is wrong.
> 
> The question is how wrong? To find out I've implemented really precise
> accounting of true idle vs. system idle cputime for s390. A really simple
> test that wakes up 100 times per second to do some minimal work before
> going back to sleep showed 0.35% of system idle time. If you are dealing
> with lots of virtual penguins this quickly becomes significant.
> 
> There are four patches in this series:
> Patch #1: Cleanup scaled / unscaled cputime accounting
> Patch #2: Change the accounting interface to allow the architectures to do
>           precise idle time accounting
> Patch #3: s390 patch to improve the precision of the idle_time_us value
> Patch #4: s390 patch to implement improved idle time accounting
> 
> There is one change in patch #2 that might require a change on powerpc
> and/or ia64. The generic TICK_ONESHOT/NO_HZ code calculates the number
> of ticks spent with a disabled HZ timer and accounts this as idle time.
> For a configuration for VIRT_CPU_ACCOUNTING=y this is horribly wrong.
> Either you have precise accounting or you don't. Patch #2 just removes
> the calculation for VIRT_CPU_ACCOUNTING=y. The architectures which support
> precise accounting have to deal with it on their own. This is where the
> powerpc and ia64 maintainer come into play. Would you look at patch #2
> please ?
> 
> To make it clearer what happens in tick_nohz_restart_sched_tick I've added
> a new function account_idle_ticks(). And for good measure another one named
> account_steal_ticks() for xen where "interesting" things have been done
> with the account_steal_time interface.

Any news about powerpc? Do these patches break anything or does it work?

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/4] [RFC] true vs. system idle cputime
  2008-10-15 14:01 ` Martin Schwidefsky
@ 2008-10-15 20:56   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 13+ messages in thread
From: Benjamin Herrenschmidt @ 2008-10-15 20:56 UTC (permalink / raw)
  To: schwidefsky
  Cc: linux-arch, Heiko Carstens, Paul Mackerras, Hidetoshi Seto,
	Tony Luck, Jeremy Fitzhardinge, Chris Wright, Michael Neuling

On Wed, 2008-10-15 at 16:01 +0200, Martin Schwidefsky wrote:
> > There is one change in patch #2 that might require a change on powerpc
> > and/or ia64. The generic TICK_ONESHOT/NO_HZ code calculates the number
> > of ticks spent with a disabled HZ timer and accounts this as idle time.
> > For a configuration for VIRT_CPU_ACCOUNTING=y this is horribly wrong.
> > Either you have precise accounting or you don't. Patch #2 just removes
> > the calculation for VIRT_CPU_ACCOUNTING=y. The architectures which support
> > precise accounting have to deal with it on their own. This is where the
> > powerpc and ia64 maintainer come into play. Would you look at patch #2
> > please ?
> > 
> > To make it clearer what happens in tick_nohz_restart_sched_tick I've added
> > a new function account_idle_ticks(). And for good measure another one named
> > account_steal_ticks() for xen where "interesting" things have been done
> > with the account_steal_time interface.
> 
> Any news about powerpc? Do these patches break anything or does it work?

I didn't have a chance to look at it yet. I'll try to get that looked at
today.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/4] fix scaled & unscaled cputime accounting
  2008-10-08 16:19 ` [patch 1/4] fix scaled & unscaled cputime accounting Martin Schwidefsky
@ 2008-10-16  4:31   ` Paul Mackerras
  0 siblings, 0 replies; 13+ messages in thread
From: Paul Mackerras @ 2008-10-16  4:31 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-arch, Heiko Carstens, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

Martin Schwidefsky writes:

> The utimescaled / stimescaled fields in the task structure and the
> global cpustat should be set on all architectures. On s390 the calls
> to account_user_time_scaled and account_system_time_scaled never have
> been added. In addition system time that is accounted as guest time
> to the user time of a process is accounted to the scaled system time
> instead of the scaled user time. 
> To fix the bugs and to prevent future forgetfulness this patch merges
> account_system_time_scaled into account_system_time and
> account_user_time_scaled into account_user_time.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

Looks fine to me.

Acked-by: Paul Mackerras <paulus@samba.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/4] idle cputime accounting
  2008-10-08 16:20 ` [patch 2/4] idle " Martin Schwidefsky
@ 2008-10-16  4:59   ` Paul Mackerras
  2008-10-16  6:42     ` Martin Schwidefsky
  0 siblings, 1 reply; 13+ messages in thread
From: Paul Mackerras @ 2008-10-16  4:59 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-arch, Heiko Carstens, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

Martin Schwidefsky writes:

> The cpu time spent by the idle process actually doing something is
> currently accounted as idle time. This is plain wrong, the architectures
> that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
> time spent doing nothing and the time spent by idle doing work. The first
> is accounted with account_steal_time and the second with account_idle_time.

I don't think what you said in that last sentence is correct.  If the
hypervisor takes cpu time away from us while we're doing nothing,
that's still idle time, not stolen time.  Otherwise, imagine the
situation where we have one process running periodically and using say
2% of the cpu, and nothing else (except the idle tasks) running.  At
the moment the kernel will report 98% idle and 0% stolen, which makes
sense.  If we do what you say above, then the kernel will report close
to 0% idle and 98% stolen time.  That looks like this partition is
fully loaded and the machine as a whole is quite overloaded, which is
incorrect.

However, that doesn't seem to be what your patch does, at least as far
as powerpc is concerned.  What you seem to have done in the patch is
to move the logic that says "stolen time is idle time if stolen from
the idle task" from generic code into arch-specific code.  Which is
fine, but it isn't mentioned in your patch description.

> This patch contains the necessary common code changes to be able to
> distinguish idle system time and true idle time. The architectures with
> support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

I can't see how you can make that distinction without adding hooks
into the idle task code so it can say "now we're really idle" and "now
we're executing idle task stuff".  I don't see that this patch does
that for any architecture.  Is that what your last sentence above is
saying?

Finally, with this patch I get the following compilation error on
powerpc:

arch/powerpc/kernel/process.c: In function '__switch_to':
arch/powerpc/kernel/process.c:400: error: implicit declaration of function 'account_process_tick'
make[2]: *** [arch/powerpc/kernel/process.o] Error 1
make[1]: *** [arch/powerpc/kernel] Error 2

Overall, most of the patch looks OK, but the patch description needs
to explain things more clearly.

Paul.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/4] idle cputime accounting
  2008-10-16  4:59   ` Paul Mackerras
@ 2008-10-16  6:42     ` Martin Schwidefsky
  2008-10-16  9:08       ` Martin Schwidefsky
  0 siblings, 1 reply; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-16  6:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Heiko Carstens, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

On Thu, 2008-10-16 at 15:59 +1100, Paul Mackerras wrote:
> Martin Schwidefsky writes:
> 
> > The cpu time spent by the idle process actually doing something is
> > currently accounted as idle time. This is plain wrong, the architectures
> > that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
> > time spent doing nothing and the time spent by idle doing work. The first
> > is accounted with account_steal_time and the second with account_idle_time.
> 
> I don't think what you said in that last sentence is correct.  If the
> hypervisor takes cpu time away from us while we're doing nothing,
> that's still idle time, not stolen time.  Otherwise, imagine the
> situation where we have one process running periodically and using say
> 2% of the cpu, and nothing else (except the idle tasks) running.  At
> the moment the kernel will report 98% idle and 0% stolen, which makes
> sense.  If we do what you say above, then the kernel will report close
> to 0% idle and 98% stolen time.  That looks like this partition is
> fully loaded and the machine as a whole is quite overloaded, which is
> incorrect.

True, that is nonsense. It should be "The first is accounted with
account_idle_time and the second with account_system_time." I'll fix the
description, good catch. The point I'm trying to make is that it is a
difference if idle is actively using the cpu vs it is truly idle. If it
is using the cpu and the hypervisor steals cpu then this has to be added
as steal time.

> However, that doesn't seem to be what your patch does, at least as far
> as powerpc is concerned.  What you seem to have done in the patch is
> to move the logic that says "stolen time is idle time if stolen from
> the idle task" from generic code into arch-specific code.  Which is
> fine, but it isn't mentioned in your patch description.

That is what is done for powerpc until you can come up with an improved
method to measure true idle time.

> > This patch contains the necessary common code changes to be able to
> > distinguish idle system time and true idle time. The architectures with
> > support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
> 
> I can't see how you can make that distinction without adding hooks
> into the idle task code so it can say "now we're really idle" and "now
> we're executing idle task stuff".  I don't see that this patch does
> that for any architecture.  Is that what your last sentence above is
> saying?

You need to look at patch #3 and patch #4. They are s390 specific and
what they do is to add the required hooks. We already have code that
measures the true idle time which is the time the cpu has an enabled
wait psw loaded. With the two s390 patches the code does a store-clock
right before loading the enabled wait psw and another store-clock as the
first instruction on the asynchronous interrupt paths in entry[64].S.
The difference between these two time stamps is the true idle time.
Is is possible to pull of this scheme on powerpc as well ?

> Finally, with this patch I get the following compilation error on
> powerpc:
> 
> arch/powerpc/kernel/process.c: In function '__switch_to':
> arch/powerpc/kernel/process.c:400: error: implicit declaration of function 'account_process_tick'
> make[2]: *** [arch/powerpc/kernel/process.o] Error 1
> make[1]: *** [arch/powerpc/kernel] Error 2

This is with CONFIG_VIRT_CPU_ACCOUNTING=y?

> Overall, most of the patch looks OK, but the patch description needs
> to explain things more clearly.

Yes, I'll fix it.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/4] idle cputime accounting
  2008-10-16  6:42     ` Martin Schwidefsky
@ 2008-10-16  9:08       ` Martin Schwidefsky
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Schwidefsky @ 2008-10-16  9:08 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: linux-arch, Heiko Carstens, Benjamin Herrenschmidt,
	Hidetoshi Seto, Tony Luck, Jeremy Fitzhardinge, Chris Wright,
	Michael Neuling

On Thu, 2008-10-16 at 08:42 +0200, Martin Schwidefsky wrote:
> > Finally, with this patch I get the following compilation error on
> > powerpc:
> > 
> > arch/powerpc/kernel/process.c: In function '__switch_to':
> > arch/powerpc/kernel/process.c:400: error: implicit declaration of
> function 'account_process_tick'
> > make[2]: *** [arch/powerpc/kernel/process.o] Error 1
> > make[1]: *** [arch/powerpc/kernel] Error 2
> 
> This is with CONFIG_VIRT_CPU_ACCOUNTING=y?

Ok, the include for kernel_stat.h has been missing in
arch/powerpc/kernel/process.c. All the other call sites already have the
include.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-10-16  9:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-08 16:19 [patch 0/4] [RFC] true vs. system idle cputime Martin Schwidefsky
2008-10-08 16:19 ` [patch 1/4] fix scaled & unscaled cputime accounting Martin Schwidefsky
2008-10-16  4:31   ` Paul Mackerras
2008-10-08 16:20 ` [patch 2/4] idle " Martin Schwidefsky
2008-10-16  4:59   ` Paul Mackerras
2008-10-16  6:42     ` Martin Schwidefsky
2008-10-16  9:08       ` Martin Schwidefsky
2008-10-08 16:20 ` [patch 3/4] improve precision of idle accounting Martin Schwidefsky
2008-10-08 16:20 ` [patch 4/4] improve idle cputime accounting Martin Schwidefsky
2008-10-08 21:22 ` [patch 0/4] [RFC] true vs. system idle cputime Luck, Tony
2008-10-09  8:03   ` Martin Schwidefsky
2008-10-15 14:01 ` Martin Schwidefsky
2008-10-15 20:56   ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).