[RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
@ 2008-05-26 14:31 Vaidyanathan Srinivasan
  2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw)
  To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

The following RFC patch tries to implement scaled CPU utilisation statistics
using APERF and MPERF MSR registers in an x86 platform.

The CPU capacity is significantly changed when the CPU's frequency is reduced
for the purpose of power savings.  The applications that run at such lower CPU
frequencies are also accounted for real CPU time by default.  If the
applications have been run at full CPU frequency, they would have finished the
work faster and not get charged for excessive CPU time.

One of the solution to this problem it so scale the utime and stime entitlement
for the process as per the current CPU frequency.  This technique is used in
powerpc architecture with the help of hardware registers that accurately capture
the entitlement.

On x86 hardware, APERF and MPERF are MSR registers that can provide feedback on
current CPU frequency.  Currently these registers are used to detect current CPU
frequency on each core in a multi-core x86 processor where the frequency of the
entire package is changed.

This patch demonstrates the idea of scaling utime and stime based on cpu
frequency.  The scaled values are exported through taskstats delay accounting
infrastructure.

Example:

On a two socket two CPU x86 hardware:
./getdelays -d -l -m0-3  

PID     4172

CPU             count     real total  virtual total    delay total
                43873      148009250     3368915732       28751295
IO              count    delay total
                    0              0
MEM             count    delay total
                    0              0
                utime          stime
                40000         108000
         scaled utime   scaled stime          total
                26676          72032       98714169

The utime/stime and scaled utime/stime are printed in micro secs while the
totals are in nano seconds. The CPU was running at 66% of its maximum frequency.

We can observe that scaled utime/stime values are 66% of their normal
accumulated runtime values, and total is 66% of 'real total'.

The following output is for CPU intensive job running for 10s:

PID     4134

CPU             count     real total  virtual total    delay total
                   61    10000625000     9807860434              2
IO              count    delay total
                    0              0
MEM             count    delay total
                    0              0
                utime          stime
             10000000              0
         scaled utime   scaled stime          total
              9886696              0     9887313918

Ondemand governor was running and it took sometime to switch the frequency to
maximum.  Hence the scaled values are marginally less than that of the elapsed
utime.

Limitations:

* RFC patch to communicate just the idea, implementation may need rework
* Works only for 32-bit x86 hardware
* MSRs and APERF/MPERF ratio is calculated at every context switch which is very
  slow
* Hacked cputime_t task_struct->utime to hold 'jiffies * 1000' values just to
  account for fractional jiffies.  Since cputime_t is jiffies in x86, we cannot
  add fractional jiffies at each context switch. Need to convert the scaled
  utime/stime data types and units to micro seconds or nano seconds.

ToDo:

* Compute scaling ratio per package only at each frequency switch
  -- Notify frequency change to all affected CPUs
* Use more accurate time unit for x86 scaled utime and stime  

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

---

Vaidyanathan Srinivasan (3):
      Print scaled utime and stime in getdelays
      Make calls to account_scaled_stats
      General framework for APERF/MPERF access and accounting

 Documentation/accounting/getdelays.c       |   13 ++
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   21 +++
 arch/x86/kernel/process_32.c               |    8 +
 arch/x86/kernel/time_32.c                  |  171 ++++++++++++++++++++++++++++
 include/linux/hardirq.h                    |    4 +
 kernel/delayacct.c                         |    7 +
 kernel/timer.c                             |    2 
 kernel/tsacct.c                            |   10 +-
 8 files changed, 225 insertions(+), 11 deletions(-)

-- 
        Vaidyanathan Srinivasan,
        Linux Technology Center,
        IBM India Systems and Technology Labs.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting
  2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan
@ 2008-05-26 14:31 ` Vaidyanathan Srinivasan
  2008-05-26 18:11   ` Balbir Singh
  2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw)
  To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

General framework for low level APERF/MPERF access.
* Avoid resetting APERF/MPERF in acpi-cpufreq.c
* Implement functions that will calculate the scaled stats
* acpi_get_pm_msrs_delta() will give delta values after reading the current values
* Change get_measured_perf() to use acpi_get_pm_msrs_delta()
* scaled_stats_init() detect availability of APERF/MPERF using cpuid
* reset_for_scaled_stats() called when a process occupies the CPU
* account_scaled_stats() is called when the process leaves the CPU

Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   21 +++
 arch/x86/kernel/time_32.c                  |  171 ++++++++++++++++++++++++++++
 2 files changed, 188 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
index b0c8208..761beec 100644
--- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -59,6 +59,13 @@ enum {
 #define INTEL_MSR_RANGE		(0xffff)
 #define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
 
+/* Buffer to store old snapshot values */
+DEFINE_PER_CPU(u64, cpufreq_old_aperf);
+DEFINE_PER_CPU(u64, cpufreq_old_mperf);
+
+extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta,
+				   u64 *aperf_old, u64 *mperf_old, int reset);
+
 struct acpi_cpufreq_data {
 	struct acpi_processor_performance *acpi_data;
 	struct cpufreq_frequency_table *freq_table;
@@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu)
 		} split;
 		u64 whole;
 	} aperf_cur, mperf_cur;
+	u64 *aperf_old, *mperf_old;
 
 	cpumask_t saved_mask;
 	unsigned int perf_percent;
@@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu)
 		return 0;
 	}
 
-	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
-	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
+	/*
+	 * Get the old APERF/MPERF values for this cpu and pass it to
+	 * acpi_get_pm_msrs_delta() which will read the current values
+	 * and return the delta.
+	 */
+	aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id()));
+	mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id()));
 
-	wrmsr(MSR_IA32_APERF, 0,0);
-	wrmsr(MSR_IA32_MPERF, 0,0);
+	acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, 
+				aperf_old, mperf_old, 1);
 
 #ifdef __i386__
 	/*
diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c
index 2ff21f3..5131e01 100644
--- a/arch/x86/kernel/time_32.c
+++ b/arch/x86/kernel/time_32.c
@@ -32,10 +32,13 @@
 #include <linux/interrupt.h>
 #include <linux/time.h>
 #include <linux/mca.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/arch_hooks.h>
 #include <asm/hpet.h>
 #include <asm/time.h>
+#include <asm/processor.h>
+#include <asm/cputime.h>
 
 #include "do_timer.h"
 
@@ -136,3 +139,171 @@ void __init time_init(void)
 	tsc_init();
 	late_time_init = choose_time_init();
 }
+
+/*
+ * This function should be used to get the APERF/MPERF MSRS delta from the cpu.
+ * We let the individual users of this function store the old values of APERF
+ * and MPERF registers in per cpu variables. They pass these old values as 3rd
+ * and 4th arguments. 'reset' tells if the old values should be reset or not.
+ * Mostly, users of this function will like to reset the old values.
+ */
+void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old,
+				u64 *mperf_old, int reset)
+{
+	union {
+		struct {
+			u32 lo;
+			u32 hi;
+		} split;
+		u64 whole;
+	} aperf_cur, mperf_cur;
+	unsigned long flags;
+
+	/* Read current values of APERF and MPERF MSRs*/
+	local_irq_save(flags);
+	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
+	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
+	local_irq_restore(flags);
+
+	/*
+	 * If the new values are less than the previous values, there has
+	 * been an overflow and both APERF and MPERF have been reset to
+	 * zero. In this case consider absolute value as diff/delta.
+	 * Note that we do not check for 'reset' here, since resetting here
+	 * is no more optional and has to be done for values to make sense.
+	 */
+	if (unlikely((mperf_cur.whole <= *mperf_old) ||
+			(aperf_cur.whole <= *aperf_old)))
+	{
+		*aperf_old = 0;
+		*mperf_old = 0;
+	}
+
+	/* Calculate the delta from the current and per cpu old values */
+	*mperf_delta = mperf_cur.whole - *mperf_old;
+	*aperf_delta = aperf_cur.whole - *aperf_old;
+
+	/* Set the per cpu variables to current readings */
+	if (reset) {
+		*mperf_old = mperf_cur.whole;
+		*aperf_old = aperf_cur.whole;
+	}
+}
+EXPORT_SYMBOL(acpi_get_pm_msrs_delta);
+
+
+DEFINE_PER_CPU(u64, cputime_old_aperf);
+DEFINE_PER_CPU(u64, cputime_old_mperf);
+
+DEFINE_PER_CPU(cputime_t, task_utime_old);
+DEFINE_PER_CPU(cputime_t, task_stime_old);
+
+
+#define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
+
+static int cpu_supports_freq_scaling;
+
+/* Initialize scaled stat functions */
+void scaled_stats_init(void)
+{
+	struct cpuinfo_x86 *c = &cpu_data(0);
+
+	/* Check for APERF/MPERF support in hardware */
+	if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
+		unsigned int ecx;
+		ecx = cpuid_ecx(6);
+		if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
+			cpu_supports_freq_scaling = 1;
+		else
+			cpu_supports_freq_scaling = -1;
+	}
+}
+	
+/*
+ * Reset the old utime and stime (percpu) value to the new task
+ * that we are going to switch to.
+ */
+void reset_for_scaled_stats(struct task_struct *tsk)
+{
+	u64 aperf_delta, mperf_delta;
+	u64 *aperf_old, *mperf_old;
+
+	if (cpu_supports_freq_scaling < 0)
+		return;
+
+	if(!cpu_supports_freq_scaling) {
+		scaled_stats_init();
+		if(cpu_supports_freq_scaling < 0)
+			return;
+	}
+
+	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
+	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
+	acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
+				mperf_old, 1);
+
+	per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
+	per_cpu(task_stime_old, smp_processor_id()) = tsk->stime;
+}
+
+
+/* Account scaled statistics for a task on context switch */
+void account_scaled_stats(struct task_struct *tsk)
+{
+	u64 aperf_delta, mperf_delta;
+	u64 *aperf_old, *mperf_old;
+	cputime_t time;
+	u64 time_msec;
+	int readmsrs = 1;
+
+	if(!cpu_supports_freq_scaling)
+		scaled_stats_init();
+
+	/*
+	 * Get the old APERF/MPERF values for this cpu and pass it to
+	 * acpi_get_pm_msrs_delta() which will read the current values
+	 * and return the delta.
+	 */
+	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
+	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
+
+	if (cputime_gt(tsk->utime, per_cpu(task_utime_old,
+					   smp_processor_id()))) {
+		acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
+					mperf_old, 1);
+		readmsrs = 0;
+		time = cputime_sub(tsk->utime, per_cpu(task_utime_old,
+							smp_processor_id()));
+		time_msec = cputime_to_msecs(time);
+		time_msec *= 1000; /* Scale it to hold fractional values */
+		if (cpu_supports_freq_scaling == 1) {
+			time_msec *= aperf_delta;
+			time_msec = div64_u64(time_msec, mperf_delta);
+		}
+		time = msecs_to_cputime(time_msec);
+		account_user_time_scaled(tsk, time);
+		per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
+	}
+
+	if (cputime_gt(tsk->stime, per_cpu(task_stime_old,
+					   smp_processor_id()))) {
+		if (readmsrs)
+			acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta,
+						aperf_old, mperf_old, 1);
+		time = cputime_sub(tsk->stime, per_cpu(task_stime_old,
+							smp_processor_id()));
+
+		time_msec = cputime_to_msecs(time);
+		time_msec *= 1000; /* Scale it to hold fractional values */
+		if (cpu_supports_freq_scaling == 1) {
+			time_msec *= aperf_delta;
+			time_msec = div64_u64(time_msec, mperf_delta);
+		}
+		time = msecs_to_cputime(time_msec);
+		account_system_time_scaled(tsk, time);
+		per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; 
+	}
+
+}	
+EXPORT_SYMBOL(account_scaled_stats);
+


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH v1 2/3] Make calls to account_scaled_stats
  2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan
  2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
@ 2008-05-26 14:31 ` Vaidyanathan Srinivasan
  2008-05-26 18:18   ` Balbir Singh
  2008-05-29 15:18   ` Michael Neuling
  2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan
  2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
  3 siblings, 2 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw)
  To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

Hook various accounting functions to call scaled stats

* Hook porcess contect switch: __switch_to()
* Hook IRQ handling account_system_vtime() in hardirq.hA
* Update __delayacct_add_tsk() to take care of scaling by 1000
* Update bacct_add_tsk() to take care of scaling by 1000

Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 arch/x86/kernel/process_32.c |    8 ++++++++
 include/linux/hardirq.h      |    4 ++++
 kernel/delayacct.c           |    7 ++++++-
 kernel/timer.c               |    2 --
 kernel/tsacct.c              |   10 ++++++++--
 5 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index f8476df..c81a783 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -56,6 +56,9 @@
 #include <asm/cpu.h>
 #include <asm/kdebug.h>
 
+extern void account_scaled_stats(struct task_struct *tsk);
+extern void reset_for_scaled_stats(struct task_struct *tsk);
+
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
 static int hlt_counter;
@@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct
 		loadsegment(gs, next->gs);
 
 	x86_write_percpu(current_task, next_p);
+	/* Account scaled statistics for the task leaving CPU */
+	account_scaled_stats(prev_p);
+	barrier();
+	/* Initialise stats counter for new task */
+	reset_for_scaled_stats(next_p);
 
 	return prev_p;
 }
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 181006c..4458736 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,9 @@
 #include <asm/hardirq.h>
 #include <asm/system.h>
 
+/* TBD: Add config option */
+extern void account_scaled_stats(struct task_struct *tsk);
+
 /*
  * We put the hardirq and softirq counter into the preemption
  * counter. The bitmask has the following meaning:
@@ -115,6 +118,7 @@ struct task_struct;
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 static inline void account_system_vtime(struct task_struct *tsk)
 {
+	account_scaled_stats(tsk);
 }
 #endif
 
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 10e43fd..3e2938f 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 
 	tmp = (s64)d->cpu_scaled_run_real_total;
 	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
-	tmp += timespec_to_ns(&ts);
+	/* HACK: Remember, we multipled the cputime_t by 1000 to include
+	 * fraction.  Now it is time to scale it back to correct 'ns' value.
+	 * Perhaps, we should use nano second unit (u64 type) for utimescaled 
+	 * and stimescaled?
+	 */
+	tmp += div_s64(timespec_to_ns(&ts),1000);
 	d->cpu_scaled_run_real_total =
 		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
 
diff --git a/kernel/timer.c b/kernel/timer.c
index ceacc66..de8a615 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick)
 
 	if (user_tick) {
 		account_user_time(p, one_jiffy);
-		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
 	} else {
 		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
-		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
 	}
 }
 #endif
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 4ab1b58..ee0d93b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
 	rcu_read_unlock();
 	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
 	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
+	/* HACK: cputime unit is jiffies on x86 and not good for fractional 
+	 * additional.  cputime_t type {u,s}timescaled is multiplied by 
+	 * 1000 for scaled accounting.  Hence, cputime_to_msecs will actually 
+	 * give the required micro second value.  The multiplier 
+	 * USEC_PER_MSEC has been dropped.
+	 */
 	stats->ac_utimescaled =
-		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
+		cputime_to_msecs(tsk->utimescaled);
 	stats->ac_stimescaled =
-		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
+		cputime_to_msecs(tsk->stimescaled);
 	stats->ac_minflt = tsk->min_flt;
 	stats->ac_majflt = tsk->maj_flt;
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays
  2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan
  2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
  2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan
@ 2008-05-26 14:31 ` Vaidyanathan Srinivasan
  2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
  3 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw)
  To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

Add print in getdelays program to show scaled stats from taskstats.
* Normal utime and stime are printed in micro seconds
* Sacled utime and stime are printed in micro seconds
* Total scaled run real time is printed in nano seconds

The values in the taskstats are printed as is hence the change in units.

getdelays -d should print these values.
Prefered cmdline is getdelays -l -p -m0-3 to get the final delay count
at exit.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---

 Documentation/accounting/getdelays.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c
index 40121b5..f695563 100644
--- a/Documentation/accounting/getdelays.c
+++ b/Documentation/accounting/getdelays.c
@@ -197,13 +197,22 @@ void print_delayacct(struct taskstats *t)
 	       "IO    %15s%15s\n"
 	       "      %15llu%15llu\n"
 	       "MEM   %15s%15s\n"
-	       "      %15llu%15llu\n",
+               "      %15llu%15llu\n"
+               "      %15s%15s\n"
+               "      %15llu%15llu\n"
+               "      %15s%15s%15s\n"
+               "      %15llu%15llu%15llu\n",
 	       "count", "real total", "virtual total", "delay total",
 	       t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total,
 	       t->cpu_delay_total,
 	       "count", "delay total",
 	       t->blkio_count, t->blkio_delay_total,
-	       "count", "delay total", t->swapin_count, t->swapin_delay_total);
+	       "count", "delay total", t->swapin_count, t->swapin_delay_total,
+	       "utime", "stime",
+	       t->ac_utime, t->ac_stime,
+               "scaled utime", "scaled stime", "total",
+               t->ac_utimescaled, t->ac_stimescaled, 
+               t->cpu_scaled_run_real_total);
 }
 
 void task_context_switch_counts(struct taskstats *t)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan
                   ` (2 preceding siblings ...)
  2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan
@ 2008-05-26 15:50 ` Arjan van de Ven
  2008-05-26 17:24   ` Balbir Singh
                     ` (2 more replies)
  3 siblings, 3 replies; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-26 15:50 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

On Mon, 26 May 2008 20:01:33 +0530
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> The following RFC patch tries to implement scaled CPU utilisation
> statistics using APERF and MPERF MSR registers in an x86 platform.
> 
> The CPU capacity is significantly changed when the CPU's frequency is
> reduced for the purpose of power savings.  The applications that run
> at such lower CPU frequencies are also accounted for real CPU time by
> default.  If the applications have been run at full CPU frequency,
> they would have finished the work faster and not get charged for
> excessive CPU time.
> 
> One of the solution to this problem it so scale the utime and stime
> entitlement for the process as per the current CPU frequency.  This
> technique is used in powerpc architecture with the help of hardware
> registers that accurately capture the entitlement.
> 

there are some issues with this unfortunately, and these make it
a very complex thing to do. 
Just to mention a few:
1) What if the BIOS no longer allows us to go to the max frequency for
a period (for example as a result of overheating); with the approach
above, the admin would THINK he can go faster, but he cannot in reality,
so there's misleading information (the system looks half busy, while in
reality it's actually the opposite, it's overloaded). Management tools
will take the wrong decisions (such as moving MORE work to the box, not
less)
2) On systems with Intel Dynamic Acceleration technology, you can get
over 100% of cycles this way. (For those who don't know what IDA is;
IDA is basically a case where if your Penryn based dual core laptop is
only using 1 core, the other core can go faster than 100% as long as
thermals etc allow it). How do you want to deal with this?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
@ 2008-05-26 17:24   ` Balbir Singh
  2008-05-26 18:00     ` Arjan van de Ven
  2008-05-27 14:04   ` Vaidyanathan Srinivasan
  2008-05-31 21:13   ` Pavel Machek
  2 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2008-05-26 17:24 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

Arjan van de Ven wrote:
> On Mon, 26 May 2008 20:01:33 +0530
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
>> The following RFC patch tries to implement scaled CPU utilisation
>> statistics using APERF and MPERF MSR registers in an x86 platform.
>>
>> The CPU capacity is significantly changed when the CPU's frequency is
>> reduced for the purpose of power savings.  The applications that run
>> at such lower CPU frequencies are also accounted for real CPU time by
>> default.  If the applications have been run at full CPU frequency,
>> they would have finished the work faster and not get charged for
>> excessive CPU time.
>>
>> One of the solution to this problem it so scale the utime and stime
>> entitlement for the process as per the current CPU frequency.  This
>> technique is used in powerpc architecture with the help of hardware
>> registers that accurately capture the entitlement.
>>
> 
> there are some issues with this unfortunately, and these make it
> a very complex thing to do. 
> Just to mention a few:
> 1) What if the BIOS no longer allows us to go to the max frequency for
> a period (for example as a result of overheating); with the approach
> above, the admin would THINK he can go faster, but he cannot in reality,
> so there's misleading information (the system looks half busy, while in
> reality it's actually the opposite, it's overloaded). Management tools
> will take the wrong decisions (such as moving MORE work to the box, not
> less)
> 2) On systems with Intel Dynamic Acceleration technology, you can get
> over 100% of cycles this way. (For those who don't know what IDA is;
> IDA is basically a case where if your Penryn based dual core laptop is
> only using 1 core, the other core can go faster than 100% as long as
> thermals etc allow it). How do you want to deal with this?

Arjan,


These problems exist anyway, irrespective of scaled accounting (I'd say that
they are exceptions)

1. The management tool does have access to the current frequency and maximum
frequency, irrespective of scaled accounting. The decision could still be taken
on the data that is already available and management tools can already use them
2. With IDA, we'd have to document that APERF/MPERF can be greater than 100% if
the system is overclocked.

Scaled accounting only intends to provide data already available. Interpretation
is left to management tools and we'll document the corner cases that you just
mentioned.



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 17:24   ` Balbir Singh
@ 2008-05-26 18:00     ` Arjan van de Ven
  2008-05-26 18:36       ` Balbir Singh
  0 siblings, 1 reply; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-26 18:00 UTC (permalink / raw)
  To: balbir
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

On Mon, 26 May 2008 22:54:43 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Arjan,
> 
> 
> These problems exist anyway, irrespective of scaled accounting (I'd
> say that they are exceptions)
> 
> 1. The management tool does have access to the current frequency and
> maximum frequency, irrespective of scaled accounting. The decision
> could still be taken on the data that is already available and
> management tools can already use them 

it's sadly not as easy as you make it sound. From everything you wrote
you're making the assumption "if we're not at maximum frequency, we
have room to spare", which is very much not a correct assumption

> 2. With IDA, we'd have to
> document that APERF/MPERF can be greater than 100% if the system is
> overclocked.
> 
> Scaled accounting only intends to provide data already available.
> Interpretation is left to management tools and we'll document the
> corner cases that you just mentioned.

IDA is not overclocking, nor is it a corner case *at all*. It's the
common case in fact on more modern systems. Having the kernel present
"raw" data to applications that then have no idea how to really use it
to be honest isn't very attractive to me as idea: you're presenting a
very raw hardware interface that will keep changing over time in terms
of how to interpret the data... the kernel needs to abstract such hard
stuff from applications, not fully expose them to it. Especially since
these things *ARE* tricky and *WILL* change. Future x86 hardware will
have behavior that makes the "oh we'll document the corner cases"
extremely unpractical. Heck, even todays hardware (but arguably not yet
the server hardware) behaves like that. "Documenting the common case as
corner case" is not the right thing to do when introducing some new
behavior/interface. Sorry.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting
  2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
@ 2008-05-26 18:11   ` Balbir Singh
  2008-05-27 14:54     ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2008-05-26 18:11 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

Vaidyanathan Srinivasan wrote:
> General framework for low level APERF/MPERF access.
> * Avoid resetting APERF/MPERF in acpi-cpufreq.c
> * Implement functions that will calculate the scaled stats
> * acpi_get_pm_msrs_delta() will give delta values after reading the current values
> * Change get_measured_perf() to use acpi_get_pm_msrs_delta()
> * scaled_stats_init() detect availability of APERF/MPERF using cpuid
> * reset_for_scaled_stats() called when a process occupies the CPU
> * account_scaled_stats() is called when the process leaves the CPU
> 
> Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   21 +++
>  arch/x86/kernel/time_32.c                  |  171 ++++++++++++++++++++++++++++
>  2 files changed, 188 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> index b0c8208..761beec 100644
> --- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> +++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> @@ -59,6 +59,13 @@ enum {
>  #define INTEL_MSR_RANGE		(0xffff)
>  #define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
> 
> +/* Buffer to store old snapshot values */
> +DEFINE_PER_CPU(u64, cpufreq_old_aperf);
> +DEFINE_PER_CPU(u64, cpufreq_old_mperf);
> +
> +extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta,
> +				   u64 *aperf_old, u64 *mperf_old, int reset);
> +
>  struct acpi_cpufreq_data {
>  	struct acpi_processor_performance *acpi_data;
>  	struct cpufreq_frequency_table *freq_table;
> @@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu)
>  		} split;
>  		u64 whole;
>  	} aperf_cur, mperf_cur;
> +	u64 *aperf_old, *mperf_old;
> 
>  	cpumask_t saved_mask;
>  	unsigned int perf_percent;
> @@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu)
>  		return 0;
>  	}
> 
> -	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
> -	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
> +	/*
> +	 * Get the old APERF/MPERF values for this cpu and pass it to
> +	 * acpi_get_pm_msrs_delta() which will read the current values
> +	 * and return the delta.
> +	 */
> +	aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id()));
> +	mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id()));
> 
> -	wrmsr(MSR_IA32_APERF, 0,0);
> -	wrmsr(MSR_IA32_MPERF, 0,0);
> +	acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, 
> +				aperf_old, mperf_old, 1);
> 
>  #ifdef __i386__
>  	/*
> diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c
> index 2ff21f3..5131e01 100644
> --- a/arch/x86/kernel/time_32.c
> +++ b/arch/x86/kernel/time_32.c
> @@ -32,10 +32,13 @@
>  #include <linux/interrupt.h>
>  #include <linux/time.h>
>  #include <linux/mca.h>
> +#include <linux/kernel_stat.h>
> 
>  #include <asm/arch_hooks.h>
>  #include <asm/hpet.h>
>  #include <asm/time.h>
> +#include <asm/processor.h>
> +#include <asm/cputime.h>
> 
>  #include "do_timer.h"
> 
> @@ -136,3 +139,171 @@ void __init time_init(void)
>  	tsc_init();
>  	late_time_init = choose_time_init();
>  }
> +
> +/*
> + * This function should be used to get the APERF/MPERF MSRS delta from the cpu.
> + * We let the individual users of this function store the old values of APERF
> + * and MPERF registers in per cpu variables. They pass these old values as 3rd
> + * and 4th arguments. 'reset' tells if the old values should be reset or not.
> + * Mostly, users of this function will like to reset the old values.
> + */
> +void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old,
> +				u64 *mperf_old, int reset)
> +{
> +	union {
> +		struct {
> +			u32 lo;
> +			u32 hi;
> +		} split;
> +		u64 whole;
> +	} aperf_cur, mperf_cur;
> +	unsigned long flags;
> +
> +	/* Read current values of APERF and MPERF MSRs*/
> +	local_irq_save(flags);

Why do we need to do this? We've already disabled pre-emption in the caller?

> +	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
> +	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
> +	local_irq_restore(flags);
> +
> +	/*
> +	 * If the new values are less than the previous values, there has
> +	 * been an overflow and both APERF and MPERF have been reset to
> +	 * zero. In this case consider absolute value as diff/delta.
> +	 * Note that we do not check for 'reset' here, since resetting here
> +	 * is no more optional and has to be done for values to make sense.
> +	 */
> +	if (unlikely((mperf_cur.whole <= *mperf_old) ||
> +			(aperf_cur.whole <= *aperf_old)))
> +	{
> +		*aperf_old = 0;
> +		*mperf_old = 0;
> +	}
> +
> +	/* Calculate the delta from the current and per cpu old values */
> +	*mperf_delta = mperf_cur.whole - *mperf_old;
> +	*aperf_delta = aperf_cur.whole - *aperf_old;
> +
> +	/* Set the per cpu variables to current readings */
> +	if (reset) {
> +		*mperf_old = mperf_cur.whole;
> +		*aperf_old = aperf_cur.whole;
> +	}
> +}
> +EXPORT_SYMBOL(acpi_get_pm_msrs_delta);
> +
> +
> +DEFINE_PER_CPU(u64, cputime_old_aperf);
> +DEFINE_PER_CPU(u64, cputime_old_mperf);
> +
> +DEFINE_PER_CPU(cputime_t, task_utime_old);
> +DEFINE_PER_CPU(cputime_t, task_stime_old);
> +

I fail to understand the per cpu variable for task_utime_old and task_stime_old?
What does it represent? Why is it global?


> +
> +#define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
> +
> +static int cpu_supports_freq_scaling;
> +
> +/* Initialize scaled stat functions */
> +void scaled_stats_init(void)
> +{
> +	struct cpuinfo_x86 *c = &cpu_data(0);
> +
> +	/* Check for APERF/MPERF support in hardware */
> +	if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
> +		unsigned int ecx;
> +		ecx = cpuid_ecx(6);
> +		if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
> +			cpu_supports_freq_scaling = 1;
> +		else
> +			cpu_supports_freq_scaling = -1;
> +	}
> +}
> +	
> +/*
> + * Reset the old utime and stime (percpu) value to the new task
> + * that we are going to switch to.
> + */
> +void reset_for_scaled_stats(struct task_struct *tsk)
> +{
> +	u64 aperf_delta, mperf_delta;
> +	u64 *aperf_old, *mperf_old;
> +
> +	if (cpu_supports_freq_scaling < 0)
> +		return;
> +
> +	if(!cpu_supports_freq_scaling) {
> +		scaled_stats_init();
> +		if(cpu_supports_freq_scaling < 0)
> +			return;
> +	}
> +
> +	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
> +	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
> +	acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
> +				mperf_old, 1);
> +
> +	per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
> +	per_cpu(task_stime_old, smp_processor_id()) = tsk->stime;
> +}
> +

I hope this routine is called with preemption disabled

> +
> +/* Account scaled statistics for a task on context switch */
> +void account_scaled_stats(struct task_struct *tsk)
> +{
> +	u64 aperf_delta, mperf_delta;
> +	u64 *aperf_old, *mperf_old;
> +	cputime_t time;
> +	u64 time_msec;
> +	int readmsrs = 1;
> +
> +	if(!cpu_supports_freq_scaling)
> +		scaled_stats_init();
> +
> +	/*
> +	 * Get the old APERF/MPERF values for this cpu and pass it to
> +	 * acpi_get_pm_msrs_delta() which will read the current values
> +	 * and return the delta.
> +	 */
> +	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
> +	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
> +
> +	if (cputime_gt(tsk->utime, per_cpu(task_utime_old,
> +					   smp_processor_id()))) {
> +		acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
> +					mperf_old, 1);
> +		readmsrs = 0;
> +		time = cputime_sub(tsk->utime, per_cpu(task_utime_old,
> +							smp_processor_id()));
> +		time_msec = cputime_to_msecs(time);
> +		time_msec *= 1000; /* Scale it to hold fractional values */

What is 1000? The code is not clear

> +		if (cpu_supports_freq_scaling == 1) {
> +			time_msec *= aperf_delta;
> +			time_msec = div64_u64(time_msec, mperf_delta);
> +		}
> +		time = msecs_to_cputime(time_msec);
> +		account_user_time_scaled(tsk, time);
> +		per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
> +	}
> +
> +	if (cputime_gt(tsk->stime, per_cpu(task_stime_old,
> +					   smp_processor_id()))) {
> +		if (readmsrs)
> +			acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta,
> +						aperf_old, mperf_old, 1);
> +		time = cputime_sub(tsk->stime, per_cpu(task_stime_old,
> +							smp_processor_id()));
> +
> +		time_msec = cputime_to_msecs(time);
> +		time_msec *= 1000; /* Scale it to hold fractional values */
> +		if (cpu_supports_freq_scaling == 1) {
> +			time_msec *= aperf_delta;
> +			time_msec = div64_u64(time_msec, mperf_delta);
> +		}
> +		time = msecs_to_cputime(time_msec);
> +		account_system_time_scaled(tsk, time);
> +		per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; 
> +	}
> +
> +}	
> +EXPORT_SYMBOL(account_scaled_stats);
> +
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats
  2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan
@ 2008-05-26 18:18   ` Balbir Singh
  2008-05-27 15:02     ` Vaidyanathan Srinivasan
  2008-05-29 15:18   ` Michael Neuling
  1 sibling, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2008-05-26 18:18 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

Vaidyanathan Srinivasan wrote:
> Hook various accounting functions to call scaled stats
> 
> * Hook porcess contect switch: __switch_to()
> * Hook IRQ handling account_system_vtime() in hardirq.hA
> * Update __delayacct_add_tsk() to take care of scaling by 1000
> * Update bacct_add_tsk() to take care of scaling by 1000
> 
> Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  arch/x86/kernel/process_32.c |    8 ++++++++
>  include/linux/hardirq.h      |    4 ++++
>  kernel/delayacct.c           |    7 ++++++-
>  kernel/timer.c               |    2 --
>  kernel/tsacct.c              |   10 ++++++++--
>  5 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index f8476df..c81a783 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -56,6 +56,9 @@
>  #include <asm/cpu.h>
>  #include <asm/kdebug.h>
> 
> +extern void account_scaled_stats(struct task_struct *tsk);
> +extern void reset_for_scaled_stats(struct task_struct *tsk);
> +
>  asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
> 
>  static int hlt_counter;
> @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct
>  		loadsegment(gs, next->gs);
> 
>  	x86_write_percpu(current_task, next_p);
> +	/* Account scaled statistics for the task leaving CPU */
> +	account_scaled_stats(prev_p);
> +	barrier();
> +	/* Initialise stats counter for new task */
> +	reset_for_scaled_stats(next_p);
> 

This is a bad place to hook into. Can't we do scaled accounting they way we do
it for powerpc (SPURR)? I would rather hook into account_*time

>  	return prev_p;
>  }
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index 181006c..4458736 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -7,6 +7,9 @@
>  #include <asm/hardirq.h>
>  #include <asm/system.h>
> 
> +/* TBD: Add config option */
> +extern void account_scaled_stats(struct task_struct *tsk);
> +
>  /*
>   * We put the hardirq and softirq counter into the preemption
>   * counter. The bitmask has the following meaning:
> @@ -115,6 +118,7 @@ struct task_struct;
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING
>  static inline void account_system_vtime(struct task_struct *tsk)
>  {
> +	account_scaled_stats(tsk);
>  }
>  #endif
> 
> diff --git a/kernel/delayacct.c b/kernel/delayacct.c
> index 10e43fd..3e2938f 100644
> --- a/kernel/delayacct.c
> +++ b/kernel/delayacct.c
> @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
> 
>  	tmp = (s64)d->cpu_scaled_run_real_total;
>  	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
> -	tmp += timespec_to_ns(&ts);
> +	/* HACK: Remember, we multipled the cputime_t by 1000 to include
> +	 * fraction.  Now it is time to scale it back to correct 'ns' value.
> +	 * Perhaps, we should use nano second unit (u64 type) for utimescaled 
> +	 * and stimescaled?
> +	 */
> +	tmp += div_s64(timespec_to_ns(&ts),1000);

1000 is a bit magical, please use a meaningful #define

>  	d->cpu_scaled_run_real_total =
>  		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
> 
> diff --git a/kernel/timer.c b/kernel/timer.c
> index ceacc66..de8a615 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick)
> 
>  	if (user_tick) {
>  		account_user_time(p, one_jiffy);
> -		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
>  	} else {
>  		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
> -		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
>  	}

Why couldn't we leverage these functions here?

>  }
>  #endif
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 4ab1b58..ee0d93b 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
>  	rcu_read_unlock();
>  	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
>  	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
> +	/* HACK: cputime unit is jiffies on x86 and not good for fractional 
> +	 * additional.  cputime_t type {u,s}timescaled is multiplied by 
> +	 * 1000 for scaled accounting.  Hence, cputime_to_msecs will actually 
> +	 * give the required micro second value.  The multiplier 
> +	 * USEC_PER_MSEC has been dropped.
> +	 */
>  	stats->ac_utimescaled =
> -		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
> +		cputime_to_msecs(tsk->utimescaled);
>  	stats->ac_stimescaled =
> -		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
> +		cputime_to_msecs(tsk->stimescaled);
>  	stats->ac_minflt = tsk->min_flt;
>  	stats->ac_majflt = tsk->maj_flt;
> 
> 


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 18:00     ` Arjan van de Ven
@ 2008-05-26 18:36       ` Balbir Singh
  2008-05-26 18:51         ` Arjan van de Ven
  0 siblings, 1 reply; 31+ messages in thread
From: Balbir Singh @ 2008-05-26 18:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

Arjan van de Ven wrote:
> On Mon, 26 May 2008 22:54:43 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Arjan,
>>
>>
>> These problems exist anyway, irrespective of scaled accounting (I'd
>> say that they are exceptions)
>>
>> 1. The management tool does have access to the current frequency and
>> maximum frequency, irrespective of scaled accounting. The decision
>> could still be taken on the data that is already available and
>> management tools can already use them 
> 
> it's sadly not as easy as you make it sound. From everything you wrote
> you're making the assumption "if we're not at maximum frequency, we
> have room to spare", which is very much not a correct assumption
> 

That's true in general. If the CPUs are throttled due to overheating, the system
management application will figure out that it cannot change the frequency. How
do I interpret my CPU frequency applet's data when it says that the system is
running at 46%?

>> 2. With IDA, we'd have to
>> document that APERF/MPERF can be greater than 100% if the system is
>> overclocked.
>>
>> Scaled accounting only intends to provide data already available.
>> Interpretation is left to management tools and we'll document the
>> corner cases that you just mentioned.
> 
> IDA is not overclocking, nor is it a corner case *at all*. It's the
> common case in fact on more modern systems. Having the kernel present
> "raw" data to applications that then have no idea how to really use it
> to be honest isn't very attractive to me as idea: you're presenting a
> very raw hardware interface that will keep changing over time in terms
> of how to interpret the data... the kernel needs to abstract such hard
> stuff from applications, not fully expose them to it. Especially since
> these things *ARE* tricky and *WILL* change. Future x86 hardware will
> have behavior that makes the "oh we'll document the corner cases"
> extremely unpractical. Heck, even todays hardware (but arguably not yet
> the server hardware) behaves like that. "Documenting the common case as
> corner case" is not the right thing to do when introducing some new
> behavior/interface. Sorry.

Before I argue against that, I would like to ask

1. How are APERF/MPERF be meant to be utilized?
2. The CPU frequency driver/governer uses APERF/MPERF as well - we could argue
and say that it should not be using/exposing that data to user space or using
that data to make decisions.
3. How do I answer the following problem

My CPU utilization is 50% at all frequencies (since utilization is time based),
does it mean that frequency scaling does not impact my workload?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 18:36       ` Balbir Singh
@ 2008-05-26 18:51         ` Arjan van de Ven
  2008-05-27 12:59           ` Balbir Singh
  2008-05-27 13:29           ` Vaidyanathan Srinivasan
  0 siblings, 2 replies; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-26 18:51 UTC (permalink / raw)
  To: balbir
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

On Tue, 27 May 2008 00:06:56 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Arjan van de Ven wrote:
> > On Mon, 26 May 2008 22:54:43 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> >> Arjan,
> >>
> >>
> >> These problems exist anyway, irrespective of scaled accounting (I'd
> >> say that they are exceptions)
> >>
> >> 1. The management tool does have access to the current frequency
> >> and maximum frequency, irrespective of scaled accounting. The
> >> decision could still be taken on the data that is already
> >> available and management tools can already use them 
> > 
> > it's sadly not as easy as you make it sound. From everything you
> > wrote you're making the assumption "if we're not at maximum
> > frequency, we have room to spare", which is very much not a correct
> > assumption
> > 
> 
> That's true in general. If the CPUs are throttled due to overheating,
> the system management application will figure out that it cannot
> change the frequency. 

It's not the system management application but the kernel (and the
hardware! Esp in case of IDA) that manage the frequency. 

> How do I interpret my CPU frequency applet's
> data when it says that the system is running at 46%?

That is a very good question. The answer is "uhh badly". Sad but true.
I'm not arguing against that.
The problem I have is that what you're doing does not make it better!

> 
> >> 2. With IDA, we'd have to
> >> document that APERF/MPERF can be greater than 100% if the system is
> >> overclocked.
> >>
> >> Scaled accounting only intends to provide data already available.
> >> Interpretation is left to management tools and we'll document the
> >> corner cases that you just mentioned.
> > 
> > IDA is not overclocking, nor is it a corner case *at all*. It's the
> > common case in fact on more modern systems. Having the kernel
> > present "raw" data to applications that then have no idea how to
> > really use it to be honest isn't very attractive to me as idea:
> > you're presenting a very raw hardware interface that will keep
> > changing over time in terms of how to interpret the data... the
> > kernel needs to abstract such hard stuff from applications, not
> > fully expose them to it. Especially since these things *ARE* tricky
> > and *WILL* change. Future x86 hardware will have behavior that
> > makes the "oh we'll document the corner cases" extremely
> > unpractical. Heck, even todays hardware (but arguably not yet the
> > server hardware) behaves like that. "Documenting the common case as
> > corner case" is not the right thing to do when introducing some new
> > behavior/interface. Sorry.
> 
> Before I argue against that, I would like to ask
> 
> 1. How are APERF/MPERF be meant to be utilized?

It's meant to be used by the cpu frequency governors to figure out how
many actual cycles are actually used (esp needed in case of IDA).

> 2. The CPU frequency driver/governer uses APERF/MPERF as well - we
> could argue and say that it should not be using/exposing that data to
> user space or using that data to make decisions.

that's a case where it really makes sense; it's the case where the
thing that controls the cpu P-state actually learns about how much work
was done to reevaluate what the cpu frequency should be going forward.
Eg it's a case of comparing actual frequency (APERF/MPERF) to see
what's useful to set next.
IDA makes this all needed due to the dynamic nature of the concept of
"frequency".

> 3. How do I answer the following problem
> 
> My CPU utilization is 50% at all frequencies (since utilization is
> time based), does it mean that frequency scaling does not impact my
> workload?

without knowing anything else than this, then yes that would be a
logical conclusion: the most likely cause would be because your cpu is
memory bound. In fact, you could then scale down your cpu
frequency/voltage to be lower, and save some power without losing
performance. 
It's a weird workload though, its probably a time based thing where you
alternate between idle and fully memory bound loads.

(which is another case where your patches would then expose idle time
even though your cpu is fully utilized for the 50% of the time it's
running)



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 18:51         ` Arjan van de Ven
@ 2008-05-27 12:59           ` Balbir Singh
  2008-05-27 13:19             ` Vaidyanathan Srinivasan
  2008-05-31 21:27             ` Pavel Machek
  2008-05-27 13:29           ` Vaidyanathan Srinivasan
  1 sibling, 2 replies; 31+ messages in thread
From: Balbir Singh @ 2008-05-27 12:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

Arjan van de Ven wrote:
> On Tue, 27 May 2008 00:06:56 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> Arjan van de Ven wrote:
>>> On Mon, 26 May 2008 22:54:43 +0530
>>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>>>
>>>> Arjan,
>>>>
>>>>
>>>> These problems exist anyway, irrespective of scaled accounting (I'd
>>>> say that they are exceptions)
>>>>
>>>> 1. The management tool does have access to the current frequency
>>>> and maximum frequency, irrespective of scaled accounting. The
>>>> decision could still be taken on the data that is already
>>>> available and management tools can already use them 
>>> it's sadly not as easy as you make it sound. From everything you
>>> wrote you're making the assumption "if we're not at maximum
>>> frequency, we have room to spare", which is very much not a correct
>>> assumption
>>>
>> That's true in general. If the CPUs are throttled due to overheating,
>> the system management application will figure out that it cannot
>> change the frequency. 
> 
> It's not the system management application but the kernel (and the
> hardware! Esp in case of IDA) that manage the frequency. 
> 
>> How do I interpret my CPU frequency applet's
>> data when it says that the system is running at 46%?
> 
> That is a very good question. The answer is "uhh badly". Sad but true.
> I'm not arguing against that.
> The problem I have is that what you're doing does not make it better!
> 

Well, how do we make that problem better? Do you have a solution in mind? Many
of us can live and interpret the data we see - I know that I am running at 46%
of the maximum potential capacity at this moment.

>>>> 2. With IDA, we'd have to
>>>> document that APERF/MPERF can be greater than 100% if the system is
>>>> overclocked.
>>>>
>>>> Scaled accounting only intends to provide data already available.
>>>> Interpretation is left to management tools and we'll document the
>>>> corner cases that you just mentioned.
>>> IDA is not overclocking, nor is it a corner case *at all*. It's the
>>> common case in fact on more modern systems. Having the kernel
>>> present "raw" data to applications that then have no idea how to
>>> really use it to be honest isn't very attractive to me as idea:
>>> you're presenting a very raw hardware interface that will keep
>>> changing over time in terms of how to interpret the data... the
>>> kernel needs to abstract such hard stuff from applications, not
>>> fully expose them to it. Especially since these things *ARE* tricky
>>> and *WILL* change. Future x86 hardware will have behavior that
>>> makes the "oh we'll document the corner cases" extremely
>>> unpractical. Heck, even todays hardware (but arguably not yet the
>>> server hardware) behaves like that. "Documenting the common case as
>>> corner case" is not the right thing to do when introducing some new
>>> behavior/interface. Sorry.
>> Before I argue against that, I would like to ask
>>
>> 1. How are APERF/MPERF be meant to be utilized?
> 
> It's meant to be used by the cpu frequency governors to figure out how
> many actual cycles are actually used (esp needed in case of IDA).
> 
>> 2. The CPU frequency driver/governer uses APERF/MPERF as well - we
>> could argue and say that it should not be using/exposing that data to
>> user space or using that data to make decisions.
> 
> that's a case where it really makes sense; it's the case where the
> thing that controls the cpu P-state actually learns about how much work
> was done to reevaluate what the cpu frequency should be going forward.

So why should the in kernel governor be the only one to do so. What if I wanted
to write a user space governor that potentially wanted to do the same thing?
How would it work? Why should the governor be the only one calculating how much
work was done, why can't a user space application do the same thing?

> Eg it's a case of comparing actual frequency (APERF/MPERF) to see
> what's useful to set next.
> IDA makes this all needed due to the dynamic nature of the concept of
> "frequency".
> 

Could you point me to the IDA specification, so that I can read and understand
your concern better.

>> 3. How do I answer the following problem
>>
>> My CPU utilization is 50% at all frequencies (since utilization is
>> time based), does it mean that frequency scaling does not impact my
>> workload?
> 
> without knowing anything else than this, then yes that would be a
> logical conclusion: the most likely cause would be because your cpu is
> memory bound. In fact, you could then scale down your cpu
> frequency/voltage to be lower, and save some power without losing
> performance. 
> It's a weird workload though, its probably a time based thing where you
> alternate between idle and fully memory bound loads.
> 
> (which is another case where your patches would then expose idle time
> even though your cpu is fully utilized for the 50% of the time it's
> running)

We expect the end user to see 50% as scaled utilization and 100% as normal
utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches
intend to provide the data and not impose what control action should be taken.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 12:59           ` Balbir Singh
@ 2008-05-27 13:19             ` Vaidyanathan Srinivasan
  2008-05-27 14:15               ` Arjan van de Ven
  2008-05-31 21:27             ` Pavel Machek
  1 sibling, 1 reply; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 13:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Arjan van de Ven, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora


[sniped]

> >> 3. How do I answer the following problem
> >>
> >> My CPU utilization is 50% at all frequencies (since utilization is
> >> time based), does it mean that frequency scaling does not impact my
> >> workload?
> > 
> > without knowing anything else than this, then yes that would be a
> > logical conclusion: the most likely cause would be because your cpu is
> > memory bound. In fact, you could then scale down your cpu
> > frequency/voltage to be lower, and save some power without losing
> > performance. 
> > It's a weird workload though, its probably a time based thing where you
> > alternate between idle and fully memory bound loads.
> > 
> > (which is another case where your patches would then expose idle time
> > even though your cpu is fully utilized for the 50% of the time it's
> > running)
> 
> We expect the end user to see 50% as scaled utilization and 100% as normal
> utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches
> intend to provide the data and not impose what control action should be taken.

Hi Arjan,

As Balbir mentioned, we are not changing the idle time calculations.
The meaning of current utime and stime are preserved and they are
relative to current CPU capacity.

We are just adding a new metric (which already exist in taskstats) to
provide more utilisation data for higher level management software to
take decisions.

At any time we will have both the traditional utilisation value
relative to current CPU capacity, and scaled utilisation that is
relative to maximum CPU capacity.

--Vaidy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 18:51         ` Arjan van de Ven
  2008-05-27 12:59           ` Balbir Singh
@ 2008-05-27 13:29           ` Vaidyanathan Srinivasan
  2008-05-27 14:19             ` Arjan van de Ven
  1 sibling, 1 reply; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 13:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

* Arjan van de Ven <arjan@infradead.org> [2008-05-26 11:51:08]:

> On Tue, 27 May 2008 00:06:56 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Arjan van de Ven wrote:
> > > On Mon, 26 May 2008 22:54:43 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > >> Arjan,
> > >>
> > >>
> > >> These problems exist anyway, irrespective of scaled accounting (I'd
> > >> say that they are exceptions)
> > >>
> > >> 1. The management tool does have access to the current frequency
> > >> and maximum frequency, irrespective of scaled accounting. The
> > >> decision could still be taken on the data that is already
> > >> available and management tools can already use them 
> > > 
> > > it's sadly not as easy as you make it sound. From everything you
> > > wrote you're making the assumption "if we're not at maximum
> > > frequency, we have room to spare", which is very much not a correct
> > > assumption
> > > 
> > 
> > That's true in general. If the CPUs are throttled due to overheating,
> > the system management application will figure out that it cannot
> > change the frequency. 
> 
> It's not the system management application but the kernel (and the
> hardware! Esp in case of IDA) that manage the frequency. 
> 
> > How do I interpret my CPU frequency applet's
> > data when it says that the system is running at 46%?
> 
> That is a very good question. The answer is "uhh badly". Sad but true.
> I'm not arguing against that.
> The problem I have is that what you're doing does not make it better!
> 
> > 
> > >> 2. With IDA, we'd have to
> > >> document that APERF/MPERF can be greater than 100% if the system is
> > >> overclocked.
> > >>
> > >> Scaled accounting only intends to provide data already available.
> > >> Interpretation is left to management tools and we'll document the
> > >> corner cases that you just mentioned.
> > > 
> > > IDA is not overclocking, nor is it a corner case *at all*. It's the
> > > common case in fact on more modern systems. Having the kernel
> > > present "raw" data to applications that then have no idea how to
> > > really use it to be honest isn't very attractive to me as idea:
> > > you're presenting a very raw hardware interface that will keep
> > > changing over time in terms of how to interpret the data... the
> > > kernel needs to abstract such hard stuff from applications, not
> > > fully expose them to it. Especially since these things *ARE* tricky
> > > and *WILL* change. Future x86 hardware will have behavior that
> > > makes the "oh we'll document the corner cases" extremely
> > > unpractical. Heck, even todays hardware (but arguably not yet the
> > > server hardware) behaves like that. "Documenting the common case as
> > > corner case" is not the right thing to do when introducing some new
> > > behavior/interface. Sorry.
> > 
> > Before I argue against that, I would like to ask
> > 
> > 1. How are APERF/MPERF be meant to be utilized?
> 
> It's meant to be used by the cpu frequency governors to figure out how
> many actual cycles are actually used (esp needed in case of IDA).
> 
> > 2. The CPU frequency driver/governer uses APERF/MPERF as well - we
> > could argue and say that it should not be using/exposing that data to
> > user space or using that data to make decisions.
> 
> that's a case where it really makes sense; it's the case where the
> thing that controls the cpu P-state actually learns about how much work
> was done to reevaluate what the cpu frequency should be going forward.
> Eg it's a case of comparing actual frequency (APERF/MPERF) to see
> what's useful to set next.
> IDA makes this all needed due to the dynamic nature of the concept of
> "frequency".

Scaled statistics relative to maximum CPU capacity is just a method of
exposing the actual CPU utilisation of applications independent of CPU
frequency changes.

Reason behind the metric is same as the above fact that you have
mentioned.  The CPU frequency governors cannot make decisions only
based on idle time ratio.  It needs to know current utilisation (used
cycles) relative to maximum capacity so that the frequency can be
changed to next higher level.

Higher level management software that wants to control CPU capacity
externally will need similar information.

> 
> > 3. How do I answer the following problem
> > 
> > My CPU utilization is 50% at all frequencies (since utilization is
> > time based), does it mean that frequency scaling does not impact my
> > workload?
> 
> without knowing anything else than this, then yes that would be a
> logical conclusion: the most likely cause would be because your cpu is
> memory bound. In fact, you could then scale down your cpu
> frequency/voltage to be lower, and save some power without losing
> performance. 
> It's a weird workload though, its probably a time based thing where you
> alternate between idle and fully memory bound loads.
> 
> (which is another case where your patches would then expose idle time
> even though your cpu is fully utilized for the 50% of the time it's
> running)
> 
> 
> 
> -- 
> If you want to reach me at my work email, use arjan@linux.intel.com
> For development, discussion and tips for power savings, 
> visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
  2008-05-26 17:24   ` Balbir Singh
@ 2008-05-27 14:04   ` Vaidyanathan Srinivasan
  2008-05-27 16:40     ` Arjan van de Ven
  2008-05-31 21:17     ` Pavel Machek
  2008-05-31 21:13   ` Pavel Machek
  2 siblings, 2 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 14:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

* Arjan van de Ven <arjan@infradead.org> [2008-05-26 08:50:00]:

> On Mon, 26 May 2008 20:01:33 +0530
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> > The following RFC patch tries to implement scaled CPU utilisation
> > statistics using APERF and MPERF MSR registers in an x86 platform.
> > 
> > The CPU capacity is significantly changed when the CPU's frequency is
> > reduced for the purpose of power savings.  The applications that run
> > at such lower CPU frequencies are also accounted for real CPU time by
> > default.  If the applications have been run at full CPU frequency,
> > they would have finished the work faster and not get charged for
> > excessive CPU time.
> > 
> > One of the solution to this problem it so scale the utime and stime
> > entitlement for the process as per the current CPU frequency.  This
> > technique is used in powerpc architecture with the help of hardware
> > registers that accurately capture the entitlement.
> > 
> 
> there are some issues with this unfortunately, and these make it
> a very complex thing to do. 
> Just to mention a few:
> 1) What if the BIOS no longer allows us to go to the max frequency for
> a period (for example as a result of overheating); with the approach
> above, the admin would THINK he can go faster, but he cannot in reality,
> so there's misleading information (the system looks half busy, while in
> reality it's actually the opposite, it's overloaded). Management tools
> will take the wrong decisions (such as moving MORE work to the box, not
> less)
> 2) On systems with Intel Dynamic Acceleration technology, you can get
> over 100% of cycles this way. (For those who don't know what IDA is;
> IDA is basically a case where if your Penryn based dual core laptop is
> only using 1 core, the other core can go faster than 100% as long as
> thermals etc allow it). How do you want to deal with this?

Hi Arjan,

Thanks you for the inputs.  The above issues are very valid and our
solution should be able to react appropriately to the above situation.

What we are proposing is a scaled time value that is scaled to the
current CPU capacity.  If the scaled utilisation is 50% when the CPU
is at 100% capacity, it is expected to remain at 50% even if the CPU's
capacity is dropped to 50%, while the traditional utilisation value
will be 100%.

The problem in the above two cases is that we had assumed that the
maximum CPU capacity is 100% at normal capacity (without IDA).

If the CPU is at half the maximum frequency, then scaled stats should
show 50%.  

Now in case 1, the CPU's capacity cannot be increased further and you
expect to have shown 100% in scaled stats as well.  If the process ran
for 10s, then the scaled time should be 10s since we cannot make the
process run faster.

In case 2, the CPU's capacity increases beyond assumed 100% and now
the tasks will have excess of 100% utilisation.  If the real run time
is 5s, then scaled runtime will be more than 5s say 7s.  This
essentially says that the process has done work worth of 7s of CPU
time when it was at 100% capacity.

The point I am trying to make is whether scaling should be done
relative to CPUs designed maximum capacity or maximum capacity under
current constraints is to be discussed.

Case A:
------

Scaled stats is stats relative to maximum designed capacity including
IDA

In this case nominal utilisation will always be less than 100%, but
the higher level software need to know the environment and interpret
the values so as to determine the remaining capacity.

Example: Assume IDA provides a 20% boost

We know that normal capacity will be 83%, with a 20% boosting in case
of IDA, we can reach 100%.

If there was a P-State constraint due to power/thermal envelop, the we
know that the maximum capacity will be reduced to say 40%.

Remaining cap = Available cap - Current Cap

Available capacity is a dynamically varying quantity but still the
stats are useful and interpretable.

Case B:
------

Scaled stats is stats relative to current available capacity.

In this case we assume 100% means current available capacity and hence
any scaled utilisation less than 100% can be counted as spare capacity.

If we are constrained to half frequency, and we are running at 1/4th
freq, then our scaled stats will be 50%, implying that we can still
double our capacity by switching to next higher frequency.

This is not very elegant statistics, because the scaled 'time' value
will not make sense after a long runtime over various transitions
through the constraints and acceleration.

In this case the OS needs to know about the constraint. The higher
level management software will not be able to interpret the scaled
stats at the moment when the constraint has changed.  They need to
know about the constraints and changes as well.

I will prefer metric as in case A.  I assume even in case of IDA, we
will know the CPU's maximum capacity with acceleration, and its
nominal capacity. APERF/MPERF ratio can be interpreted correctly even
if it exceeds 1.

Please let us know if we can improve the framework to include both the
power constraint case and acceleration case.

Thanks,
Vaidy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 13:19             ` Vaidyanathan Srinivasan
@ 2008-05-27 14:15               ` Arjan van de Ven
  2008-05-27 15:27                 ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-27 14:15 UTC (permalink / raw)
  To: svaidy
  Cc: Balbir Singh, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

On Tue, 27 May 2008 18:49:26 +0530
> At any time we will have both the traditional utilisation value
> relative to current CPU capacity, and scaled utilisation that is
> relative to maximum CPU capacity.
> 

this is where I raise a red flag *because the patch is not doing
that* !!

sadly it gives a random metric which in some circumstances looks like
it does that, but in reality it is not doing that.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 13:29           ` Vaidyanathan Srinivasan
@ 2008-05-27 14:19             ` Arjan van de Ven
  2008-05-27 15:20               ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-27 14:19 UTC (permalink / raw)
  To: svaidy
  Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

\> > 
> > that's a case where it really makes sense; it's the case where the
> > thing that controls the cpu P-state actually learns about how much
> > work was done to reevaluate what the cpu frequency should be going
> > forward. Eg it's a case of comparing actual frequency (APERF/MPERF)
> > to see what's useful to set next.
> > IDA makes this all needed due to the dynamic nature of the concept
> > of "frequency".
> 
> Scaled statistics relative to maximum CPU capacity is just a method of
> exposing the actual CPU utilisation of applications independent of CPU
> frequency changes.
> 
> Reason behind the metric is same as the above fact that you have
> mentioned.  The CPU frequency governors cannot make decisions only
> based on idle time ratio.  It needs to know current utilisation (used
> cycles) relative to maximum capacity so that the frequency can be
> changed to next higher level.
> 
> Higher level management software that wants to control CPU capacity
> externally will need similar information.
>
I entirely understand that desire.
But you're not giving it that information!
The patch is giving it a really poor approximation, an approximation
that will get worse and worse in upcoming cpu generations.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting
  2008-05-26 18:11   ` Balbir Singh
@ 2008-05-27 14:54     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 14:54 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-26 23:41:02]:

> Vaidyanathan Srinivasan wrote:
> > General framework for low level APERF/MPERF access.
> > * Avoid resetting APERF/MPERF in acpi-cpufreq.c
> > * Implement functions that will calculate the scaled stats
> > * acpi_get_pm_msrs_delta() will give delta values after reading the current values
> > * Change get_measured_perf() to use acpi_get_pm_msrs_delta()
> > * scaled_stats_init() detect availability of APERF/MPERF using cpuid
> > * reset_for_scaled_stats() called when a process occupies the CPU
> > * account_scaled_stats() is called when the process leaves the CPU
> > 
> > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > ---
> > 
> >  arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   21 +++
> >  arch/x86/kernel/time_32.c                  |  171 ++++++++++++++++++++++++++++
> >  2 files changed, 188 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> > index b0c8208..761beec 100644
> > --- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> > +++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
> > @@ -59,6 +59,13 @@ enum {
> >  #define INTEL_MSR_RANGE		(0xffff)
> >  #define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
> > 
> > +/* Buffer to store old snapshot values */
> > +DEFINE_PER_CPU(u64, cpufreq_old_aperf);
> > +DEFINE_PER_CPU(u64, cpufreq_old_mperf);
> > +
> > +extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta,
> > +				   u64 *aperf_old, u64 *mperf_old, int reset);
> > +
> >  struct acpi_cpufreq_data {
> >  	struct acpi_processor_performance *acpi_data;
> >  	struct cpufreq_frequency_table *freq_table;
> > @@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu)
> >  		} split;
> >  		u64 whole;
> >  	} aperf_cur, mperf_cur;
> > +	u64 *aperf_old, *mperf_old;
> > 
> >  	cpumask_t saved_mask;
> >  	unsigned int perf_percent;
> > @@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu)
> >  		return 0;
> >  	}
> > 
> > -	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
> > -	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
> > +	/*
> > +	 * Get the old APERF/MPERF values for this cpu and pass it to
> > +	 * acpi_get_pm_msrs_delta() which will read the current values
> > +	 * and return the delta.
> > +	 */
> > +	aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id()));
> > +	mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id()));
> > 
> > -	wrmsr(MSR_IA32_APERF, 0,0);
> > -	wrmsr(MSR_IA32_MPERF, 0,0);
> > +	acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, 
> > +				aperf_old, mperf_old, 1);
> > 
> >  #ifdef __i386__
> >  	/*
> > diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c
> > index 2ff21f3..5131e01 100644
> > --- a/arch/x86/kernel/time_32.c
> > +++ b/arch/x86/kernel/time_32.c
> > @@ -32,10 +32,13 @@
> >  #include <linux/interrupt.h>
> >  #include <linux/time.h>
> >  #include <linux/mca.h>
> > +#include <linux/kernel_stat.h>
> > 
> >  #include <asm/arch_hooks.h>
> >  #include <asm/hpet.h>
> >  #include <asm/time.h>
> > +#include <asm/processor.h>
> > +#include <asm/cputime.h>
> > 
> >  #include "do_timer.h"
> > 
> > @@ -136,3 +139,171 @@ void __init time_init(void)
> >  	tsc_init();
> >  	late_time_init = choose_time_init();
> >  }
> > +
> > +/*
> > + * This function should be used to get the APERF/MPERF MSRS delta from the cpu.
> > + * We let the individual users of this function store the old values of APERF
> > + * and MPERF registers in per cpu variables. They pass these old values as 3rd
> > + * and 4th arguments. 'reset' tells if the old values should be reset or not.
> > + * Mostly, users of this function will like to reset the old values.
> > + */
> > +void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old,
> > +				u64 *mperf_old, int reset)
> > +{
> > +	union {
> > +		struct {
> > +			u32 lo;
> > +			u32 hi;
> > +		} split;
> > +		u64 whole;
> > +	} aperf_cur, mperf_cur;
> > +	unsigned long flags;
> > +
> > +	/* Read current values of APERF and MPERF MSRs*/
> > +	local_irq_save(flags);
> 
> Why do we need to do this? We've already disabled pre-emption in the caller?

Hi Balbir,

Thanks for detailed review.

I will check and correct this in the next iteration.


> 
> > +	rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi);
> > +	rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi);
> > +	local_irq_restore(flags);
> > +
> > +	/*
> > +	 * If the new values are less than the previous values, there has
> > +	 * been an overflow and both APERF and MPERF have been reset to
> > +	 * zero. In this case consider absolute value as diff/delta.
> > +	 * Note that we do not check for 'reset' here, since resetting here
> > +	 * is no more optional and has to be done for values to make sense.
> > +	 */
> > +	if (unlikely((mperf_cur.whole <= *mperf_old) ||
> > +			(aperf_cur.whole <= *aperf_old)))
> > +	{
> > +		*aperf_old = 0;
> > +		*mperf_old = 0;
> > +	}
> > +
> > +	/* Calculate the delta from the current and per cpu old values */
> > +	*mperf_delta = mperf_cur.whole - *mperf_old;
> > +	*aperf_delta = aperf_cur.whole - *aperf_old;
> > +
> > +	/* Set the per cpu variables to current readings */
> > +	if (reset) {
> > +		*mperf_old = mperf_cur.whole;
> > +		*aperf_old = aperf_cur.whole;
> > +	}
> > +}
> > +EXPORT_SYMBOL(acpi_get_pm_msrs_delta);
> > +
> > +
> > +DEFINE_PER_CPU(u64, cputime_old_aperf);
> > +DEFINE_PER_CPU(u64, cputime_old_mperf);
> > +
> > +DEFINE_PER_CPU(cputime_t, task_utime_old);
> > +DEFINE_PER_CPU(cputime_t, task_stime_old);
> > +
> 
> I fail to understand the per cpu variable for task_utime_old and task_stime_old?
> What does it represent? Why is it global?

These are needed to store the previous value of utime and stime so
that we can compute the difference and scale it when the process
leaves the CPU.

reset_for_scaled_stats() will store the value of utime and stime when
the process occupies the CPU.  account_scaled_stats() will get the
difference when the process leaves the CPU and then scale the time.

I will get rid of this once I get centralised scaling ratio
computation from CPUfreq driver infrastructure.

> 
> > +
> > +#define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
> > +
> > +static int cpu_supports_freq_scaling;
> > +
> > +/* Initialize scaled stat functions */
> > +void scaled_stats_init(void)
> > +{
> > +	struct cpuinfo_x86 *c = &cpu_data(0);
> > +
> > +	/* Check for APERF/MPERF support in hardware */
> > +	if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
> > +		unsigned int ecx;
> > +		ecx = cpuid_ecx(6);
> > +		if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
> > +			cpu_supports_freq_scaling = 1;
> > +		else
> > +			cpu_supports_freq_scaling = -1;
> > +	}
> > +}
> > +	
> > +/*
> > + * Reset the old utime and stime (percpu) value to the new task
> > + * that we are going to switch to.
> > + */
> > +void reset_for_scaled_stats(struct task_struct *tsk)
> > +{
> > +	u64 aperf_delta, mperf_delta;
> > +	u64 *aperf_old, *mperf_old;
> > +
> > +	if (cpu_supports_freq_scaling < 0)
> > +		return;
> > +
> > +	if(!cpu_supports_freq_scaling) {
> > +		scaled_stats_init();
> > +		if(cpu_supports_freq_scaling < 0)
> > +			return;
> > +	}
> > +
> > +	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
> > +	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
> > +	acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
> > +				mperf_old, 1);
> > +
> > +	per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
> > +	per_cpu(task_stime_old, smp_processor_id()) = tsk->stime;
> > +}
> > +
> 
> I hope this routine is called with preemption disabled

Yes, this is called from __switch_to().  I will work towards removing
such costly code in critical path.

> > +
> > +/* Account scaled statistics for a task on context switch */
> > +void account_scaled_stats(struct task_struct *tsk)
> > +{
> > +	u64 aperf_delta, mperf_delta;
> > +	u64 *aperf_old, *mperf_old;
> > +	cputime_t time;
> > +	u64 time_msec;
> > +	int readmsrs = 1;
> > +
> > +	if(!cpu_supports_freq_scaling)
> > +		scaled_stats_init();
> > +
> > +	/*
> > +	 * Get the old APERF/MPERF values for this cpu and pass it to
> > +	 * acpi_get_pm_msrs_delta() which will read the current values
> > +	 * and return the delta.
> > +	 */
> > +	aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id()));
> > +	mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id()));
> > +
> > +	if (cputime_gt(tsk->utime, per_cpu(task_utime_old,
> > +					   smp_processor_id()))) {
> > +		acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old,
> > +					mperf_old, 1);
> > +		readmsrs = 0;
> > +		time = cputime_sub(tsk->utime, per_cpu(task_utime_old,
> > +							smp_processor_id()));
> > +		time_msec = cputime_to_msecs(time);
> > +		time_msec *= 1000; /* Scale it to hold fractional values */
> 
> What is 1000? The code is not clear

Constant 1000 is the scaling factor I have used.  1 jiffies = 1000 so
that I can store fractional jiffies values like 0.66 jiffies.  This is
a hack as documented in the intro.

Basically we need a unit that is more granular than jiffies in order
to scale it.  u64 type to store micro secs or nano secs will be a good
idea, but that will mean change in existing type of cputime_t.

I am hopeful that we will move away from jiffies granularity
accounting in x86 and switch to more accurate metric given the fact
that we already use high resolution timers and clock source in CFS.

If the cputime_t is more granular that 1 jiffies, then we can get rid
of this hack.

--Vaidy

> > +		if (cpu_supports_freq_scaling == 1) {
> > +			time_msec *= aperf_delta;
> > +			time_msec = div64_u64(time_msec, mperf_delta);
> > +		}
> > +		time = msecs_to_cputime(time_msec);
> > +		account_user_time_scaled(tsk, time);
> > +		per_cpu(task_utime_old, smp_processor_id()) = tsk->utime;
> > +	}
> > +
> > +	if (cputime_gt(tsk->stime, per_cpu(task_stime_old,
> > +					   smp_processor_id()))) {
> > +		if (readmsrs)
> > +			acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta,
> > +						aperf_old, mperf_old, 1);
> > +		time = cputime_sub(tsk->stime, per_cpu(task_stime_old,
> > +							smp_processor_id()));
> > +
> > +		time_msec = cputime_to_msecs(time);
> > +		time_msec *= 1000; /* Scale it to hold fractional values */
> > +		if (cpu_supports_freq_scaling == 1) {
> > +			time_msec *= aperf_delta;
> > +			time_msec = div64_u64(time_msec, mperf_delta);
> > +		}
> > +		time = msecs_to_cputime(time_msec);
> > +		account_system_time_scaled(tsk, time);
> > +		per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; 
> > +	}
> > +
> > +}	
> > +EXPORT_SYMBOL(account_scaled_stats);
> > +
> > 
> 
> 
> -- 
> 	Warm Regards,
> 	Balbir Singh
> 	Linux Technology Center
> 	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats
  2008-05-26 18:18   ` Balbir Singh
@ 2008-05-27 15:02     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 15:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-26 23:48:04]:

> Vaidyanathan Srinivasan wrote:
> > Hook various accounting functions to call scaled stats
> > 
> > * Hook porcess contect switch: __switch_to()
> > * Hook IRQ handling account_system_vtime() in hardirq.hA
> > * Update __delayacct_add_tsk() to take care of scaling by 1000
> > * Update bacct_add_tsk() to take care of scaling by 1000
> > 
> > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > ---
> > 
> >  arch/x86/kernel/process_32.c |    8 ++++++++
> >  include/linux/hardirq.h      |    4 ++++
> >  kernel/delayacct.c           |    7 ++++++-
> >  kernel/timer.c               |    2 --
> >  kernel/tsacct.c              |   10 ++++++++--
> >  5 files changed, 26 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > index f8476df..c81a783 100644
> > --- a/arch/x86/kernel/process_32.c
> > +++ b/arch/x86/kernel/process_32.c
> > @@ -56,6 +56,9 @@
> >  #include <asm/cpu.h>
> >  #include <asm/kdebug.h>
> > 
> > +extern void account_scaled_stats(struct task_struct *tsk);
> > +extern void reset_for_scaled_stats(struct task_struct *tsk);
> > +
> >  asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
> > 
> >  static int hlt_counter;
> > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct
> >  		loadsegment(gs, next->gs);
> > 
> >  	x86_write_percpu(current_task, next_p);
> > +	/* Account scaled statistics for the task leaving CPU */
> > +	account_scaled_stats(prev_p);
> > +	barrier();
> > +	/* Initialise stats counter for new task */
> > +	reset_for_scaled_stats(next_p);
> > 
> 
> This is a bad place to hook into. Can't we do scaled accounting they way we do
> it for powerpc (SPURR)? I would rather hook into account_*time

Agreed and I have also documented it in the intro.  I will try to
remove them in the next iteration.

> 
> >  	return prev_p;
> >  }
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index 181006c..4458736 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -7,6 +7,9 @@
> >  #include <asm/hardirq.h>
> >  #include <asm/system.h>
> > 
> > +/* TBD: Add config option */
> > +extern void account_scaled_stats(struct task_struct *tsk);
> > +
> >  /*
> >   * We put the hardirq and softirq counter into the preemption
> >   * counter. The bitmask has the following meaning:
> > @@ -115,6 +118,7 @@ struct task_struct;
> >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING
> >  static inline void account_system_vtime(struct task_struct *tsk)
> >  {
> > +	account_scaled_stats(tsk);
> >  }
> >  #endif
> > 
> > diff --git a/kernel/delayacct.c b/kernel/delayacct.c
> > index 10e43fd..3e2938f 100644
> > --- a/kernel/delayacct.c
> > +++ b/kernel/delayacct.c
> > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
> > 
> >  	tmp = (s64)d->cpu_scaled_run_real_total;
> >  	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
> > -	tmp += timespec_to_ns(&ts);
> > +	/* HACK: Remember, we multipled the cputime_t by 1000 to include
> > +	 * fraction.  Now it is time to scale it back to correct 'ns' value.
> > +	 * Perhaps, we should use nano second unit (u64 type) for utimescaled 
> > +	 * and stimescaled?
> > +	 */
> > +	tmp += div_s64(timespec_to_ns(&ts),1000);
> 
> 1000 is a bit magical, please use a meaningful #define


Sure I can use a #define.... anyway this needs to be solved.
This is documented in the intro and explained in other email.
 
> >  	d->cpu_scaled_run_real_total =
> >  		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
> > 
> > diff --git a/kernel/timer.c b/kernel/timer.c
> > index ceacc66..de8a615 100644
> > --- a/kernel/timer.c
> > +++ b/kernel/timer.c
> > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick)
> > 
> >  	if (user_tick) {
> >  		account_user_time(p, one_jiffy);
> > -		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
> >  	} else {
> >  		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
> > -		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
> >  	}
> 
> Why couldn't we leverage these functions here?

We don't know the scaling ratio at this time.  The ratio is being
calculated at every context switch by sampling APERF/MPERF when the
process occupies the CPU and when it leave the CPU.

We should use these functions once I get centralised scaling ratio
computation done from cpufreq driver.  Out-of-band frequency variation
needs to be solved for the other method.  Current implementation will
do correct accounting even if the freq variation is out-of-band.

--Vaidy

> >  }
> >  #endif
> > diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> > index 4ab1b58..ee0d93b 100644
> > --- a/kernel/tsacct.c
> > +++ b/kernel/tsacct.c
> > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
> >  	rcu_read_unlock();
> >  	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
> >  	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
> > +	/* HACK: cputime unit is jiffies on x86 and not good for fractional 
> > +	 * additional.  cputime_t type {u,s}timescaled is multiplied by 
> > +	 * 1000 for scaled accounting.  Hence, cputime_to_msecs will actually 
> > +	 * give the required micro second value.  The multiplier 
> > +	 * USEC_PER_MSEC has been dropped.
> > +	 */
> >  	stats->ac_utimescaled =
> > -		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
> > +		cputime_to_msecs(tsk->utimescaled);
> >  	stats->ac_stimescaled =
> > -		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
> > +		cputime_to_msecs(tsk->stimescaled);
> >  	stats->ac_minflt = tsk->min_flt;
> >  	stats->ac_majflt = tsk->maj_flt;
> > 
> > 
> 
> 
> -- 
> 	Warm Regards,
> 	Balbir Singh
> 	Linux Technology Center
> 	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 14:19             ` Arjan van de Ven
@ 2008-05-27 15:20               ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 15:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

* Arjan van de Ven <arjan@infradead.org> [2008-05-27 07:19:00]:

> \> > 
> > > that's a case where it really makes sense; it's the case where the
> > > thing that controls the cpu P-state actually learns about how much
> > > work was done to reevaluate what the cpu frequency should be going
> > > forward. Eg it's a case of comparing actual frequency (APERF/MPERF)
> > > to see what's useful to set next.
> > > IDA makes this all needed due to the dynamic nature of the concept
> > > of "frequency".
> > 
> > Scaled statistics relative to maximum CPU capacity is just a method of
> > exposing the actual CPU utilisation of applications independent of CPU
> > frequency changes.
> > 
> > Reason behind the metric is same as the above fact that you have
> > mentioned.  The CPU frequency governors cannot make decisions only
> > based on idle time ratio.  It needs to know current utilisation (used
> > cycles) relative to maximum capacity so that the frequency can be
> > changed to next higher level.
> > 
> > Higher level management software that wants to control CPU capacity
> > externally will need similar information.
> >
> I entirely understand that desire.

Good :)

> But you're not giving it that information!
> The patch is giving it a really poor approximation, an approximation
> that will get worse and worse in upcoming cpu generations.

I agree that power capping and acceleration makes the metric
approximate.  But was are trying to be as accurate and meaningful as
APERF/MPERF ratio is in the processor hardware.

Can I state the problem like this:  The metric is as accurate and
meaningful as APERF/MEPRF ratio, but the interpretation of the metric
is subject to the knowledge of power constraint or acceleration
currently in effect.

--Vaidy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 14:15               ` Arjan van de Ven
@ 2008-05-27 15:27                 ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 15:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Balbir Singh, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Amit K. Arora

* Arjan van de Ven <arjan@infradead.org> [2008-05-27 07:15:32]:

> On Tue, 27 May 2008 18:49:26 +0530
> > At any time we will have both the traditional utilisation value
> > relative to current CPU capacity, and scaled utilisation that is
> > relative to maximum CPU capacity.
> > 
> 
> this is where I raise a red flag *because the patch is not doing
> that* !!

Your concern is valid.  I assume you are objecting to the usefulness
and interpretation of the scaled metric and not to the fact that both
scaled and default non scaled metrics are independently available.
 
> sadly it gives a random metric which in some circumstances looks like
> it does that, but in reality it is not doing that.

I agree and I am interested in designing a metric that is useful in
most circumstances.  I am open to suggestions to cover the case of
power constraint and acceleration scenario.

Thanks,
Vaidy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 14:04   ` Vaidyanathan Srinivasan
@ 2008-05-27 16:40     ` Arjan van de Ven
  2008-05-27 18:26       ` Vaidyanathan Srinivasan
  2008-05-31 21:17     ` Pavel Machek
  1 sibling, 1 reply; 31+ messages in thread
From: Arjan van de Ven @ 2008-05-27 16:40 UTC (permalink / raw)
  To: svaidy
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

On Tue, 27 May 2008 19:34:40 +0530
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> 
> What we are proposing is a scaled time value that is scaled to the
> current CPU capacity.  If the scaled utilisation is 50% when the CPU
> is at 100% capacity, it is expected to remain at 50% even if the CPU's
> capacity is dropped to 50%, while the traditional utilisation value
> will be 100%.

When you use the word "capacity" I cringe ;(

> 
> The problem in the above two cases is that we had assumed that the
> maximum CPU capacity is 100% at normal capacity (without IDA).
> 
> If the CPU is at half the maximum frequency, then scaled stats should
> show 50%.  

see frequency != capacity.
It's about more than frequency. It's about how much cache you have
available too. If  you run single threaded on a dual core cpu, you have
100% of the cache, but the cpu is 50% idle. But that doesn't mean that
when you double the load, you actually get 2x the performance. So
you're not at 50% of capacity!



> The point I am trying to make is whether scaling should be done
> relative to CPUs designed maximum capacity or maximum capacity under
> current constraints is to be discussed.

now you're back at "capacity".. we were at frequency before ;(


> 
> Case A:
> ------
> 
> Scaled stats is stats relative to maximum designed capacity including
> IDA

you don't know what that is though.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 16:40     ` Arjan van de Ven
@ 2008-05-27 18:26       ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-27 18:26 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
	Michael Neuling, Balbir Singh, Amit K. Arora

* Arjan van de Ven <arjan@infradead.org> [2008-05-27 09:40:35]:

> On Tue, 27 May 2008 19:34:40 +0530
> Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:
> > 
> > What we are proposing is a scaled time value that is scaled to the
> > current CPU capacity.  If the scaled utilisation is 50% when the CPU
> > is at 100% capacity, it is expected to remain at 50% even if the CPU's
> > capacity is dropped to 50%, while the traditional utilisation value
> > will be 100%.
> 
> When you use the word "capacity" I cringe ;(
> 
> > 
> > The problem in the above two cases is that we had assumed that the
> > maximum CPU capacity is 100% at normal capacity (without IDA).
> > 
> > If the CPU is at half the maximum frequency, then scaled stats should
> > show 50%.  
> 
> see frequency != capacity.
> It's about more than frequency. It's about how much cache you have
> available too. If  you run single threaded on a dual core cpu, you have
> 100% of the cache, but the cpu is 50% idle. But that doesn't mean that
> when you double the load, you actually get 2x the performance. So
> you're not at 50% of capacity!
 
You are right.... I kind of interchangeably used frequency and
capacity assuming capacity is linearly proportional to frequency!

In reality, I agree that capacity is not linearly proportional to
frequency and it is dependent on cache usage etc.

The differences apart, I am sure I have conveyed the relationship
between scaled stats and CPU frequency. I used the term 'capacity' to
generalise CPU performance, but I guess it may lead to a different
discussion.  I will stick to frequency. :)

--Vaidy
 
> > The point I am trying to make is whether scaling should be done
> > relative to CPUs designed maximum capacity or maximum capacity under
> > current constraints is to be discussed.
> 
> now you're back at "capacity".. we were at frequency before ;(
> 
> 
> > 
> > Case A:
> > ------
> > 
> > Scaled stats is stats relative to maximum designed capacity including
> > IDA
> 
> you don't know what that is though.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats
  2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan
  2008-05-26 18:18   ` Balbir Singh
@ 2008-05-29 15:18   ` Michael Neuling
  2008-05-29 18:23     ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 31+ messages in thread
From: Michael Neuling @ 2008-05-29 15:18 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Balbir Singh,
	Amit K. Arora

In message <20080526143146.24680.36724.stgit@drishya.in.ibm.com> you wrote:
> Hook various accounting functions to call scaled stats
> 
> * Hook porcess contect switch: __switch_to()
> * Hook IRQ handling account_system_vtime() in hardirq.hA
> * Update __delayacct_add_tsk() to take care of scaling by 1000
> * Update bacct_add_tsk() to take care of scaling by 1000
> 
> Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> ---
> 
>  arch/x86/kernel/process_32.c |    8 ++++++++
>  include/linux/hardirq.h      |    4 ++++
>  kernel/delayacct.c           |    7 ++++++-
>  kernel/timer.c               |    2 --
>  kernel/tsacct.c              |   10 ++++++++--
>  5 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index f8476df..c81a783 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -56,6 +56,9 @@
>  #include <asm/cpu.h>
>  #include <asm/kdebug.h>
>  
> +extern void account_scaled_stats(struct task_struct *tsk);
> +extern void reset_for_scaled_stats(struct task_struct *tsk);
> +
>  asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
>  
>  static int hlt_counter;
> @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *pre
v_p, struct task_struct
>  		loadsegment(gs, next->gs);
>  
>  	x86_write_percpu(current_task, next_p);
> +	/* Account scaled statistics for the task leaving CPU */
> +	account_scaled_stats(prev_p);
> +	barrier();
> +	/* Initialise stats counter for new task */
> +	reset_for_scaled_stats(next_p);
>  
>  	return prev_p;
>  }
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index 181006c..4458736 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -7,6 +7,9 @@
>  #include <asm/hardirq.h>
>  #include <asm/system.h>
>  
> +/* TBD: Add config option */
> +extern void account_scaled_stats(struct task_struct *tsk);
> +
>  /*
>   * We put the hardirq and softirq counter into the preemption
>   * counter. The bitmask has the following meaning:
> @@ -115,6 +118,7 @@ struct task_struct;
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING
>  static inline void account_system_vtime(struct task_struct *tsk)
>  {
> +	account_scaled_stats(tsk);
>  }
>  #endif
>  
> diff --git a/kernel/delayacct.c b/kernel/delayacct.c
> index 10e43fd..3e2938f 100644
> --- a/kernel/delayacct.c
> +++ b/kernel/delayacct.c
> @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task
_struct *tsk)
>  
>  	tmp = (s64)d->cpu_scaled_run_real_total;
>  	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
> -	tmp += timespec_to_ns(&ts);
> +	/* HACK: Remember, we multipled the cputime_t by 1000 to include
> +	 * fraction.  Now it is time to scale it back to correct 'ns' value.
> +	 * Perhaps, we should use nano second unit (u64 type) for utimescaled 
> +	 * and stimescaled?
> +	 */
> +	tmp += div_s64(timespec_to_ns(&ts),1000);

This is going to break other archs (specifically powerpc) which doesn't
do this magical scale by 1000.

How often is this function called as the divide is going to slow things
down?  

>  	d->cpu_scaled_run_real_total =
>  		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
>  
> diff --git a/kernel/timer.c b/kernel/timer.c
> index ceacc66..de8a615 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int use
r_tick)
>  
>  	if (user_tick) {
>  		account_user_time(p, one_jiffy);
> -		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
>  	} else {
>  		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
> -		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
>  	}
>  }

Why did you remove this?

>  #endif
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 4ab1b58..ee0d93b 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_s
truct *tsk)
>  	rcu_read_unlock();
>  	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
>  	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
> +	/* HACK: cputime unit is jiffies on x86 and not good for fractional 
> +	 * additional.  cputime_t type {u,s}timescaled is multiplied by 
> +	 * 1000 for scaled accounting.  Hence, cputime_to_msecs will actually 
> +	 * give the required micro second value.  The multiplier 
> +	 * USEC_PER_MSEC has been dropped.
> +	 */
>  	stats->ac_utimescaled =
> -		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
> +		cputime_to_msecs(tsk->utimescaled);
>  	stats->ac_stimescaled =
> -		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
> +		cputime_to_msecs(tsk->stimescaled);

Again, isn't this going to effect other archs?  

Mikey

>  	stats->ac_minflt = tsk->min_flt;
>  	stats->ac_majflt = tsk->maj_flt;
>  
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats
  2008-05-29 15:18   ` Michael Neuling
@ 2008-05-29 18:23     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-29 18:23 UTC (permalink / raw)
  To: Michael Neuling
  Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Balbir Singh,
	Amit K. Arora

* Michael Neuling <mikey@neuling.org> [2008-05-29 10:18:56]:

> In message <20080526143146.24680.36724.stgit@drishya.in.ibm.com> you wrote:
> > Hook various accounting functions to call scaled stats
> > 
> > * Hook porcess contect switch: __switch_to()
> > * Hook IRQ handling account_system_vtime() in hardirq.hA
> > * Update __delayacct_add_tsk() to take care of scaling by 1000
> > * Update bacct_add_tsk() to take care of scaling by 1000
> > 
> > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com>
> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> > ---
> > 
> >  arch/x86/kernel/process_32.c |    8 ++++++++
> >  include/linux/hardirq.h      |    4 ++++
> >  kernel/delayacct.c           |    7 ++++++-
> >  kernel/timer.c               |    2 --
> >  kernel/tsacct.c              |   10 ++++++++--
> >  5 files changed, 26 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > index f8476df..c81a783 100644
> > --- a/arch/x86/kernel/process_32.c
> > +++ b/arch/x86/kernel/process_32.c
> > @@ -56,6 +56,9 @@
> >  #include <asm/cpu.h>
> >  #include <asm/kdebug.h>
> >  
> > +extern void account_scaled_stats(struct task_struct *tsk);
> > +extern void reset_for_scaled_stats(struct task_struct *tsk);
> > +
> >  asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
> >  
> >  static int hlt_counter;
> > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *pre
> v_p, struct task_struct
> >  		loadsegment(gs, next->gs);
> >  
> >  	x86_write_percpu(current_task, next_p);
> > +	/* Account scaled statistics for the task leaving CPU */
> > +	account_scaled_stats(prev_p);
> > +	barrier();
> > +	/* Initialise stats counter for new task */
> > +	reset_for_scaled_stats(next_p);
> >  
> >  	return prev_p;
> >  }
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index 181006c..4458736 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -7,6 +7,9 @@
> >  #include <asm/hardirq.h>
> >  #include <asm/system.h>
> >  
> > +/* TBD: Add config option */
> > +extern void account_scaled_stats(struct task_struct *tsk);
> > +
> >  /*
> >   * We put the hardirq and softirq counter into the preemption
> >   * counter. The bitmask has the following meaning:
> > @@ -115,6 +118,7 @@ struct task_struct;
> >  #ifndef CONFIG_VIRT_CPU_ACCOUNTING
> >  static inline void account_system_vtime(struct task_struct *tsk)
> >  {
> > +	account_scaled_stats(tsk);
> >  }
> >  #endif
> >  
> > diff --git a/kernel/delayacct.c b/kernel/delayacct.c
> > index 10e43fd..3e2938f 100644
> > --- a/kernel/delayacct.c
> > +++ b/kernel/delayacct.c
> > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task
> _struct *tsk)
> >  
> >  	tmp = (s64)d->cpu_scaled_run_real_total;
> >  	cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts);
> > -	tmp += timespec_to_ns(&ts);
> > +	/* HACK: Remember, we multipled the cputime_t by 1000 to include
> > +	 * fraction.  Now it is time to scale it back to correct 'ns' value.
> > +	 * Perhaps, we should use nano second unit (u64 type) for utimescaled 
> > +	 * and stimescaled?
> > +	 */
> > +	tmp += div_s64(timespec_to_ns(&ts),1000);
> 
> This is going to break other archs (specifically powerpc) which doesn't
> do this magical scale by 1000.
> 
> How often is this function called as the divide is going to slow things
> down? 

Hi Mikey,

Thanks for the review comments.  This scaling is a hack to store
fractions in cputime_t data type as mentioned in the intro and other
replies.  This should certainly go away once I find a clean method to
store fractional jiffies values for x86.  

> 
> >  	d->cpu_scaled_run_real_total =
> >  		(tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp;
> >  
> > diff --git a/kernel/timer.c b/kernel/timer.c
> > index ceacc66..de8a615 100644
> > --- a/kernel/timer.c
> > +++ b/kernel/timer.c
> > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int use
> r_tick)
> >  
> >  	if (user_tick) {
> >  		account_user_time(p, one_jiffy);
> > -		account_user_time_scaled(p, cputime_to_scaled(one_jiffy));
> >  	} else {
> >  		account_system_time(p, HARDIRQ_OFFSET, one_jiffy);
> > -		account_system_time_scaled(p, cputime_to_scaled(one_jiffy));
> >  	}
> >  }
> 
> Why did you remove this?

In this preliminary RFC implementation to demonstrate the idea, I have
not hooked into these routines since I do not know the scaling factor
at this time.  I am trying to maintain the scaling ratio from cpufreq
driver in the next version and just use it in the accounting
subsystem.  Once I have cpufreq subsystem to maintain the scaling
ratio, I can use these functions.

> >  #endif
> > diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> > index 4ab1b58..ee0d93b 100644
> > --- a/kernel/tsacct.c
> > +++ b/kernel/tsacct.c
> > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_s
> truct *tsk)
> >  	rcu_read_unlock();
> >  	stats->ac_utime	 = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC;
> >  	stats->ac_stime	 = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC;
> > +	/* HACK: cputime unit is jiffies on x86 and not good for fractional 
> > +	 * additional.  cputime_t type {u,s}timescaled is multiplied by 
> > +	 * 1000 for scaled accounting.  Hence, cputime_to_msecs will actually 
> > +	 * give the required micro second value.  The multiplier 
> > +	 * USEC_PER_MSEC has been dropped.
> > +	 */
> >  	stats->ac_utimescaled =
> > -		cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC;
> > +		cputime_to_msecs(tsk->utimescaled);
> >  	stats->ac_stimescaled =
> > -		cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC;
> > +		cputime_to_msecs(tsk->stimescaled);
> 
> Again, isn't this going to effect other archs?  

Yes, this is part of the scaling factor hack.  I will get rid of this
once we can store fractional cputime_t.

--Vaidy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
  2008-05-26 17:24   ` Balbir Singh
  2008-05-27 14:04   ` Vaidyanathan Srinivasan
@ 2008-05-31 21:13   ` Pavel Machek
  2008-06-02  6:08     ` Balbir Singh
  2 siblings, 1 reply; 31+ messages in thread
From: Pavel Machek @ 2008-05-31 21:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora

Hi!

> > entitlement for the process as per the current CPU frequency.  This
> > technique is used in powerpc architecture with the help of hardware
> > registers that accurately capture the entitlement.
> > 
> 
> there are some issues with this unfortunately, and these make it
> a very complex thing to do. 
> Just to mention a few:
> 1) What if the BIOS no longer allows us to go to the max frequency for
> a period (for example as a result of overheating); with the approach
> above, the admin would THINK he can go faster, but he cannot in reality,
> so there's misleading information (the system looks half busy, while in

Plus time one-second-computation-job returning anything else is just
wrong. Even when it only happens when overheated... or only on battery
power.

If you want scaled utime, you need new interface.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 14:04   ` Vaidyanathan Srinivasan
  2008-05-27 16:40     ` Arjan van de Ven
@ 2008-05-31 21:17     ` Pavel Machek
  1 sibling, 0 replies; 31+ messages in thread
From: Pavel Machek @ 2008-05-31 21:17 UTC (permalink / raw)
  To: Arjan van de Ven, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora

Hi!

> > > The following RFC patch tries to implement scaled CPU utilisation
> > > statistics using APERF and MPERF MSR registers in an x86 platform.
> > > 
> > > The CPU capacity is significantly changed when the CPU's frequency is
> > > reduced for the purpose of power savings.  The applications that run
> > > at such lower CPU frequencies are also accounted for real CPU time by
> > > default.  If the applications have been run at full CPU frequency,
> > > they would have finished the work faster and not get charged for
> > > excessive CPU time.
> > > 
> > > One of the solution to this problem it so scale the utime and stime
> > > entitlement for the process as per the current CPU frequency.  This
> > > technique is used in powerpc architecture with the help of hardware
> > > registers that accurately capture the entitlement.
> > > 
> > 
> > there are some issues with this unfortunately, and these make it
> > a very complex thing to do. 
> > Just to mention a few:
> > 1) What if the BIOS no longer allows us to go to the max frequency for
> > a period (for example as a result of overheating); with the approach
> > above, the admin would THINK he can go faster, but he cannot in reality,
> > so there's misleading information (the system looks half busy, while in
> > reality it's actually the opposite, it's overloaded). Management tools
> > will take the wrong decisions (such as moving MORE work to the box, not
> > less)
> > 2) On systems with Intel Dynamic Acceleration technology, you can get
> > over 100% of cycles this way. (For those who don't know what IDA is;
> > IDA is basically a case where if your Penryn based dual core laptop is
> > only using 1 core, the other core can go faster than 100% as long as
> > thermals etc allow it). How do you want to deal with this?
> 
> Hi Arjan,
> 
> Thanks you for the inputs.  The above issues are very valid and our
> solution should be able to react appropriately to the above situation.
> 
> What we are proposing is a scaled time value that is scaled to the
> current CPU capacity.  If the scaled utilisation is 50% when the CPU
> is at 100% capacity, it is expected to remain at 50% even if the CPU's
> capacity is dropped to 50%, while the traditional utilisation value
> will be 100%.

time one-second-busy-loop should return close to one second. That's
current behaviour. You don't like it, but it is useful.

If you change it, that's called 'regression'.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-27 12:59           ` Balbir Singh
  2008-05-27 13:19             ` Vaidyanathan Srinivasan
@ 2008-05-31 21:27             ` Pavel Machek
  2008-06-02 17:54               ` Vaidyanathan Srinivasan
  1 sibling, 1 reply; 31+ messages in thread
From: Pavel Machek @ 2008-05-31 21:27 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, Linux Kernel,
	venkatesh.pallipadi, suresh.b.siddha, Michael Neuling,
	Amit K. Arora

Hi!

> > without knowing anything else than this, then yes that would be a
> > logical conclusion: the most likely cause would be because your cpu is
> > memory bound. In fact, you could then scale down your cpu
> > frequency/voltage to be lower, and save some power without losing
> > performance. 
> > It's a weird workload though, its probably a time based thing where you
> > alternate between idle and fully memory bound loads.
> > 
> > (which is another case where your patches would then expose idle time
> > even though your cpu is fully utilized for the 50% of the time it's
> > running)
> 
> We expect the end user to see 50% as scaled utilization and 100% as normal
> utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches
> intend to provide the data and not impose what control action should be taken.

Aha, ok, forget about my regression comments.

Still, what you want to do seems hard. What if cpu is running at max
frequency but memory is not? What if cpu and memory is running at max
frequency but frontside bus is not?

You want some give_me_bogomips( cpu freq, mem freq, fsb freq ) function,
but that depends on workload, so it is impossible to do...

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-31 21:13   ` Pavel Machek
@ 2008-06-02  6:08     ` Balbir Singh
  0 siblings, 0 replies; 31+ messages in thread
From: Balbir Singh @ 2008-06-02  6:08 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Arjan van de Ven, Vaidyanathan Srinivasan, Linux Kernel,
	venkatesh.pallipadi, suresh.b.siddha, Michael Neuling,
	Amit K. Arora

Pavel Machek wrote:
> Hi!
> 
>>> entitlement for the process as per the current CPU frequency.  This
>>> technique is used in powerpc architecture with the help of hardware
>>> registers that accurately capture the entitlement.
>>>
>> there are some issues with this unfortunately, and these make it
>> a very complex thing to do. 
>> Just to mention a few:
>> 1) What if the BIOS no longer allows us to go to the max frequency for
>> a period (for example as a result of overheating); with the approach
>> above, the admin would THINK he can go faster, but he cannot in reality,
>> so there's misleading information (the system looks half busy, while in
> 
> Plus time one-second-computation-job returning anything else is just
> wrong. Even when it only happens when overheated... or only on battery
> power.
> 
> If you want scaled utime, you need new interface.

We do have a new interface, two new parameters per-task utimescaled and
stimescaled. They already exist in task_struct. Ditto for the delay accounting
pieces. Did I misunderstand your need for a new interface? We still keep utime
and stime around.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-05-31 21:27             ` Pavel Machek
@ 2008-06-02 17:54               ` Vaidyanathan Srinivasan
  2008-06-03  2:20                 ` Arjan van de Ven
  0 siblings, 1 reply; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-06-02 17:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Balbir Singh, Arjan van de Ven, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

* Pavel Machek <pavel@ucw.cz> [2008-05-31 23:27:10]:

> Hi!
> 
> > > without knowing anything else than this, then yes that would be a
> > > logical conclusion: the most likely cause would be because your cpu is
> > > memory bound. In fact, you could then scale down your cpu
> > > frequency/voltage to be lower, and save some power without losing
> > > performance. 
> > > It's a weird workload though, its probably a time based thing where you
> > > alternate between idle and fully memory bound loads.
> > > 
> > > (which is another case where your patches would then expose idle time
> > > even though your cpu is fully utilized for the 50% of the time it's
> > > running)
> > 
> > We expect the end user to see 50% as scaled utilization and 100% as normal
> > utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches
> > intend to provide the data and not impose what control action should be taken.
> 
> Aha, ok, forget about my regression comments.
> 
> Still, what you want to do seems hard. What if cpu is running at max
> frequency but memory is not? What if cpu and memory is running at max
> frequency but frontside bus is not?

uh... you made the scope of the problem bigger :)
If we take into consideration various system parameters like memory,
fsb that we can control to save power, then this scaling factor is not
accurate.

> You want some give_me_bogomips( cpu freq, mem freq, fsb freq ) function,
> but that depends on workload, so it is impossible to do...

Isn't this a good problem to solve :)

Lets start adding parameters in steps to make the scaled statistics
accurate.  We want to start with cpu and see if we can provide
a reasonable solution.

Workload variation is outside the scope since we are estimating how
much cpu was provided to the application or workload. The fact that
the workload could not utilise them effectively is not an accounting
problem.  Accounting the exact cpu resource that an application used
is a performance feedback problem which can potentially be derived
from accounting.

--Vaidy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
  2008-06-02 17:54               ` Vaidyanathan Srinivasan
@ 2008-06-03  2:20                 ` Arjan van de Ven
  0 siblings, 0 replies; 31+ messages in thread
From: Arjan van de Ven @ 2008-06-03  2:20 UTC (permalink / raw)
  To: svaidy
  Cc: Pavel Machek, Balbir Singh, Linux Kernel, venkatesh.pallipadi,
	suresh.b.siddha, Michael Neuling, Amit K. Arora

On Mon, 2 Jun 2008 23:24:00 +0530
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote:

> 
> Lets start adding parameters in steps to make the scaled statistics
> accurate.  We want to start with cpu and see if we can provide
> a reasonable solution.

sadly your patches don't even achieve that ;-(
At least not in terms of CPU capacity etc.


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2008-06-03  2:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan
2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
2008-05-26 18:11   ` Balbir Singh
2008-05-27 14:54     ` Vaidyanathan Srinivasan
2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan
2008-05-26 18:18   ` Balbir Singh
2008-05-27 15:02     ` Vaidyanathan Srinivasan
2008-05-29 15:18   ` Michael Neuling
2008-05-29 18:23     ` Vaidyanathan Srinivasan
2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan
2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven
2008-05-26 17:24   ` Balbir Singh
2008-05-26 18:00     ` Arjan van de Ven
2008-05-26 18:36       ` Balbir Singh
2008-05-26 18:51         ` Arjan van de Ven
2008-05-27 12:59           ` Balbir Singh
2008-05-27 13:19             ` Vaidyanathan Srinivasan
2008-05-27 14:15               ` Arjan van de Ven
2008-05-27 15:27                 ` Vaidyanathan Srinivasan
2008-05-31 21:27             ` Pavel Machek
2008-06-02 17:54               ` Vaidyanathan Srinivasan
2008-06-03  2:20                 ` Arjan van de Ven
2008-05-27 13:29           ` Vaidyanathan Srinivasan
2008-05-27 14:19             ` Arjan van de Ven
2008-05-27 15:20               ` Vaidyanathan Srinivasan
2008-05-27 14:04   ` Vaidyanathan Srinivasan
2008-05-27 16:40     ` Arjan van de Ven
2008-05-27 18:26       ` Vaidyanathan Srinivasan
2008-05-31 21:17     ` Pavel Machek
2008-05-31 21:13   ` Pavel Machek
2008-06-02  6:08     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox