* [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86
@ 2008-05-26 14:31 Vaidyanathan Srinivasan
2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan
` (3 more replies)
0 siblings, 4 replies; 31+ messages in thread
From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw)
To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha,
Michael Neuling, Balbir Singh, Amit K. Arora
The following RFC patch tries to implement scaled CPU utilisation statistics
using APERF and MPERF MSR registers in an x86 platform.
The CPU capacity is significantly changed when the CPU's frequency is reduced
for the purpose of power savings. The applications that run at such lower CPU
frequencies are also accounted for real CPU time by default. If the
applications have been run at full CPU frequency, they would have finished the
work faster and not get charged for excessive CPU time.
One of the solution to this problem it so scale the utime and stime entitlement
for the process as per the current CPU frequency. This technique is used in
powerpc architecture with the help of hardware registers that accurately capture
the entitlement.
On x86 hardware, APERF and MPERF are MSR registers that can provide feedback on
current CPU frequency. Currently these registers are used to detect current CPU
frequency on each core in a multi-core x86 processor where the frequency of the
entire package is changed.
This patch demonstrates the idea of scaling utime and stime based on cpu
frequency. The scaled values are exported through taskstats delay accounting
infrastructure.
Example:
On a two socket two CPU x86 hardware:
./getdelays -d -l -m0-3
PID 4172
CPU count real total virtual total delay total
43873 148009250 3368915732 28751295
IO count delay total
0 0
MEM count delay total
0 0
utime stime
40000 108000
scaled utime scaled stime total
26676 72032 98714169
The utime/stime and scaled utime/stime are printed in micro secs while the
totals are in nano seconds. The CPU was running at 66% of its maximum frequency.
We can observe that scaled utime/stime values are 66% of their normal
accumulated runtime values, and total is 66% of 'real total'.
The following output is for CPU intensive job running for 10s:
PID 4134
CPU count real total virtual total delay total
61 10000625000 9807860434 2
IO count delay total
0 0
MEM count delay total
0 0
utime stime
10000000 0
scaled utime scaled stime total
9886696 0 9887313918
Ondemand governor was running and it took sometime to switch the frequency to
maximum. Hence the scaled values are marginally less than that of the elapsed
utime.
Limitations:
* RFC patch to communicate just the idea, implementation may need rework
* Works only for 32-bit x86 hardware
* MSRs and APERF/MPERF ratio is calculated at every context switch which is very
slow
* Hacked cputime_t task_struct->utime to hold 'jiffies * 1000' values just to
account for fractional jiffies. Since cputime_t is jiffies in x86, we cannot
add fractional jiffies at each context switch. Need to convert the scaled
utime/stime data types and units to micro seconds or nano seconds.
ToDo:
* Compute scaling ratio per package only at each frequency switch
-- Notify frequency change to all affected CPUs
* Use more accurate time unit for x86 scaled utime and stime
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
Vaidyanathan Srinivasan (3):
Print scaled utime and stime in getdelays
Make calls to account_scaled_stats
General framework for APERF/MPERF access and accounting
Documentation/accounting/getdelays.c | 13 ++
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 21 +++
arch/x86/kernel/process_32.c | 8 +
arch/x86/kernel/time_32.c | 171 ++++++++++++++++++++++++++++
include/linux/hardirq.h | 4 +
kernel/delayacct.c | 7 +
kernel/timer.c | 2
kernel/tsacct.c | 10 +-
8 files changed, 225 insertions(+), 11 deletions(-)
--
Vaidyanathan Srinivasan,
Linux Technology Center,
IBM India Systems and Technology Labs.
^ permalink raw reply [flat|nested] 31+ messages in thread* [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting 2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan @ 2008-05-26 14:31 ` Vaidyanathan Srinivasan 2008-05-26 18:11 ` Balbir Singh 2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw) To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora General framework for low level APERF/MPERF access. * Avoid resetting APERF/MPERF in acpi-cpufreq.c * Implement functions that will calculate the scaled stats * acpi_get_pm_msrs_delta() will give delta values after reading the current values * Change get_measured_perf() to use acpi_get_pm_msrs_delta() * scaled_stats_init() detect availability of APERF/MPERF using cpuid * reset_for_scaled_stats() called when a process occupies the CPU * account_scaled_stats() is called when the process leaves the CPU Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> --- arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 21 +++ arch/x86/kernel/time_32.c | 171 ++++++++++++++++++++++++++++ 2 files changed, 188 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c index b0c8208..761beec 100644 --- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c +++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c @@ -59,6 +59,13 @@ enum { #define INTEL_MSR_RANGE (0xffff) #define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) +/* Buffer to store old snapshot values */ +DEFINE_PER_CPU(u64, cpufreq_old_aperf); +DEFINE_PER_CPU(u64, cpufreq_old_mperf); + +extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, + u64 *aperf_old, u64 *mperf_old, int reset); + struct acpi_cpufreq_data { struct acpi_processor_performance *acpi_data; struct cpufreq_frequency_table *freq_table; @@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu) } split; u64 whole; } aperf_cur, mperf_cur; + u64 *aperf_old, *mperf_old; cpumask_t saved_mask; unsigned int perf_percent; @@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu) return 0; } - rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); - rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); + /* + * Get the old APERF/MPERF values for this cpu and pass it to + * acpi_get_pm_msrs_delta() which will read the current values + * and return the delta. + */ + aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id())); + mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id())); - wrmsr(MSR_IA32_APERF, 0,0); - wrmsr(MSR_IA32_MPERF, 0,0); + acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, + aperf_old, mperf_old, 1); #ifdef __i386__ /* diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c index 2ff21f3..5131e01 100644 --- a/arch/x86/kernel/time_32.c +++ b/arch/x86/kernel/time_32.c @@ -32,10 +32,13 @@ #include <linux/interrupt.h> #include <linux/time.h> #include <linux/mca.h> +#include <linux/kernel_stat.h> #include <asm/arch_hooks.h> #include <asm/hpet.h> #include <asm/time.h> +#include <asm/processor.h> +#include <asm/cputime.h> #include "do_timer.h" @@ -136,3 +139,171 @@ void __init time_init(void) tsc_init(); late_time_init = choose_time_init(); } + +/* + * This function should be used to get the APERF/MPERF MSRS delta from the cpu. + * We let the individual users of this function store the old values of APERF + * and MPERF registers in per cpu variables. They pass these old values as 3rd + * and 4th arguments. 'reset' tells if the old values should be reset or not. + * Mostly, users of this function will like to reset the old values. + */ +void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old, + u64 *mperf_old, int reset) +{ + union { + struct { + u32 lo; + u32 hi; + } split; + u64 whole; + } aperf_cur, mperf_cur; + unsigned long flags; + + /* Read current values of APERF and MPERF MSRs*/ + local_irq_save(flags); + rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); + rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); + local_irq_restore(flags); + + /* + * If the new values are less than the previous values, there has + * been an overflow and both APERF and MPERF have been reset to + * zero. In this case consider absolute value as diff/delta. + * Note that we do not check for 'reset' here, since resetting here + * is no more optional and has to be done for values to make sense. + */ + if (unlikely((mperf_cur.whole <= *mperf_old) || + (aperf_cur.whole <= *aperf_old))) + { + *aperf_old = 0; + *mperf_old = 0; + } + + /* Calculate the delta from the current and per cpu old values */ + *mperf_delta = mperf_cur.whole - *mperf_old; + *aperf_delta = aperf_cur.whole - *aperf_old; + + /* Set the per cpu variables to current readings */ + if (reset) { + *mperf_old = mperf_cur.whole; + *aperf_old = aperf_cur.whole; + } +} +EXPORT_SYMBOL(acpi_get_pm_msrs_delta); + + +DEFINE_PER_CPU(u64, cputime_old_aperf); +DEFINE_PER_CPU(u64, cputime_old_mperf); + +DEFINE_PER_CPU(cputime_t, task_utime_old); +DEFINE_PER_CPU(cputime_t, task_stime_old); + + +#define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) + +static int cpu_supports_freq_scaling; + +/* Initialize scaled stat functions */ +void scaled_stats_init(void) +{ + struct cpuinfo_x86 *c = &cpu_data(0); + + /* Check for APERF/MPERF support in hardware */ + if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) { + unsigned int ecx; + ecx = cpuid_ecx(6); + if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY) + cpu_supports_freq_scaling = 1; + else + cpu_supports_freq_scaling = -1; + } +} + +/* + * Reset the old utime and stime (percpu) value to the new task + * that we are going to switch to. + */ +void reset_for_scaled_stats(struct task_struct *tsk) +{ + u64 aperf_delta, mperf_delta; + u64 *aperf_old, *mperf_old; + + if (cpu_supports_freq_scaling < 0) + return; + + if(!cpu_supports_freq_scaling) { + scaled_stats_init(); + if(cpu_supports_freq_scaling < 0) + return; + } + + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, + mperf_old, 1); + + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; +} + + +/* Account scaled statistics for a task on context switch */ +void account_scaled_stats(struct task_struct *tsk) +{ + u64 aperf_delta, mperf_delta; + u64 *aperf_old, *mperf_old; + cputime_t time; + u64 time_msec; + int readmsrs = 1; + + if(!cpu_supports_freq_scaling) + scaled_stats_init(); + + /* + * Get the old APERF/MPERF values for this cpu and pass it to + * acpi_get_pm_msrs_delta() which will read the current values + * and return the delta. + */ + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); + + if (cputime_gt(tsk->utime, per_cpu(task_utime_old, + smp_processor_id()))) { + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, + mperf_old, 1); + readmsrs = 0; + time = cputime_sub(tsk->utime, per_cpu(task_utime_old, + smp_processor_id())); + time_msec = cputime_to_msecs(time); + time_msec *= 1000; /* Scale it to hold fractional values */ + if (cpu_supports_freq_scaling == 1) { + time_msec *= aperf_delta; + time_msec = div64_u64(time_msec, mperf_delta); + } + time = msecs_to_cputime(time_msec); + account_user_time_scaled(tsk, time); + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; + } + + if (cputime_gt(tsk->stime, per_cpu(task_stime_old, + smp_processor_id()))) { + if (readmsrs) + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, + aperf_old, mperf_old, 1); + time = cputime_sub(tsk->stime, per_cpu(task_stime_old, + smp_processor_id())); + + time_msec = cputime_to_msecs(time); + time_msec *= 1000; /* Scale it to hold fractional values */ + if (cpu_supports_freq_scaling == 1) { + time_msec *= aperf_delta; + time_msec = div64_u64(time_msec, mperf_delta); + } + time = msecs_to_cputime(time_msec); + account_system_time_scaled(tsk, time); + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; + } + +} +EXPORT_SYMBOL(account_scaled_stats); + ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting 2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan @ 2008-05-26 18:11 ` Balbir Singh 2008-05-27 14:54 ` Vaidyanathan Srinivasan 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2008-05-26 18:11 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Vaidyanathan Srinivasan wrote: > General framework for low level APERF/MPERF access. > * Avoid resetting APERF/MPERF in acpi-cpufreq.c > * Implement functions that will calculate the scaled stats > * acpi_get_pm_msrs_delta() will give delta values after reading the current values > * Change get_measured_perf() to use acpi_get_pm_msrs_delta() > * scaled_stats_init() detect availability of APERF/MPERF using cpuid > * reset_for_scaled_stats() called when a process occupies the CPU > * account_scaled_stats() is called when the process leaves the CPU > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > --- > > arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 21 +++ > arch/x86/kernel/time_32.c | 171 ++++++++++++++++++++++++++++ > 2 files changed, 188 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > index b0c8208..761beec 100644 > --- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > +++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > @@ -59,6 +59,13 @@ enum { > #define INTEL_MSR_RANGE (0xffff) > #define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) > > +/* Buffer to store old snapshot values */ > +DEFINE_PER_CPU(u64, cpufreq_old_aperf); > +DEFINE_PER_CPU(u64, cpufreq_old_mperf); > + > +extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, > + u64 *aperf_old, u64 *mperf_old, int reset); > + > struct acpi_cpufreq_data { > struct acpi_processor_performance *acpi_data; > struct cpufreq_frequency_table *freq_table; > @@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu) > } split; > u64 whole; > } aperf_cur, mperf_cur; > + u64 *aperf_old, *mperf_old; > > cpumask_t saved_mask; > unsigned int perf_percent; > @@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu) > return 0; > } > > - rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); > - rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); > + /* > + * Get the old APERF/MPERF values for this cpu and pass it to > + * acpi_get_pm_msrs_delta() which will read the current values > + * and return the delta. > + */ > + aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id())); > + mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id())); > > - wrmsr(MSR_IA32_APERF, 0,0); > - wrmsr(MSR_IA32_MPERF, 0,0); > + acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, > + aperf_old, mperf_old, 1); > > #ifdef __i386__ > /* > diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c > index 2ff21f3..5131e01 100644 > --- a/arch/x86/kernel/time_32.c > +++ b/arch/x86/kernel/time_32.c > @@ -32,10 +32,13 @@ > #include <linux/interrupt.h> > #include <linux/time.h> > #include <linux/mca.h> > +#include <linux/kernel_stat.h> > > #include <asm/arch_hooks.h> > #include <asm/hpet.h> > #include <asm/time.h> > +#include <asm/processor.h> > +#include <asm/cputime.h> > > #include "do_timer.h" > > @@ -136,3 +139,171 @@ void __init time_init(void) > tsc_init(); > late_time_init = choose_time_init(); > } > + > +/* > + * This function should be used to get the APERF/MPERF MSRS delta from the cpu. > + * We let the individual users of this function store the old values of APERF > + * and MPERF registers in per cpu variables. They pass these old values as 3rd > + * and 4th arguments. 'reset' tells if the old values should be reset or not. > + * Mostly, users of this function will like to reset the old values. > + */ > +void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old, > + u64 *mperf_old, int reset) > +{ > + union { > + struct { > + u32 lo; > + u32 hi; > + } split; > + u64 whole; > + } aperf_cur, mperf_cur; > + unsigned long flags; > + > + /* Read current values of APERF and MPERF MSRs*/ > + local_irq_save(flags); Why do we need to do this? We've already disabled pre-emption in the caller? > + rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); > + rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); > + local_irq_restore(flags); > + > + /* > + * If the new values are less than the previous values, there has > + * been an overflow and both APERF and MPERF have been reset to > + * zero. In this case consider absolute value as diff/delta. > + * Note that we do not check for 'reset' here, since resetting here > + * is no more optional and has to be done for values to make sense. > + */ > + if (unlikely((mperf_cur.whole <= *mperf_old) || > + (aperf_cur.whole <= *aperf_old))) > + { > + *aperf_old = 0; > + *mperf_old = 0; > + } > + > + /* Calculate the delta from the current and per cpu old values */ > + *mperf_delta = mperf_cur.whole - *mperf_old; > + *aperf_delta = aperf_cur.whole - *aperf_old; > + > + /* Set the per cpu variables to current readings */ > + if (reset) { > + *mperf_old = mperf_cur.whole; > + *aperf_old = aperf_cur.whole; > + } > +} > +EXPORT_SYMBOL(acpi_get_pm_msrs_delta); > + > + > +DEFINE_PER_CPU(u64, cputime_old_aperf); > +DEFINE_PER_CPU(u64, cputime_old_mperf); > + > +DEFINE_PER_CPU(cputime_t, task_utime_old); > +DEFINE_PER_CPU(cputime_t, task_stime_old); > + I fail to understand the per cpu variable for task_utime_old and task_stime_old? What does it represent? Why is it global? > + > +#define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) > + > +static int cpu_supports_freq_scaling; > + > +/* Initialize scaled stat functions */ > +void scaled_stats_init(void) > +{ > + struct cpuinfo_x86 *c = &cpu_data(0); > + > + /* Check for APERF/MPERF support in hardware */ > + if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) { > + unsigned int ecx; > + ecx = cpuid_ecx(6); > + if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY) > + cpu_supports_freq_scaling = 1; > + else > + cpu_supports_freq_scaling = -1; > + } > +} > + > +/* > + * Reset the old utime and stime (percpu) value to the new task > + * that we are going to switch to. > + */ > +void reset_for_scaled_stats(struct task_struct *tsk) > +{ > + u64 aperf_delta, mperf_delta; > + u64 *aperf_old, *mperf_old; > + > + if (cpu_supports_freq_scaling < 0) > + return; > + > + if(!cpu_supports_freq_scaling) { > + scaled_stats_init(); > + if(cpu_supports_freq_scaling < 0) > + return; > + } > + > + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); > + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, > + mperf_old, 1); > + > + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; > + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; > +} > + I hope this routine is called with preemption disabled > + > +/* Account scaled statistics for a task on context switch */ > +void account_scaled_stats(struct task_struct *tsk) > +{ > + u64 aperf_delta, mperf_delta; > + u64 *aperf_old, *mperf_old; > + cputime_t time; > + u64 time_msec; > + int readmsrs = 1; > + > + if(!cpu_supports_freq_scaling) > + scaled_stats_init(); > + > + /* > + * Get the old APERF/MPERF values for this cpu and pass it to > + * acpi_get_pm_msrs_delta() which will read the current values > + * and return the delta. > + */ > + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); > + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); > + > + if (cputime_gt(tsk->utime, per_cpu(task_utime_old, > + smp_processor_id()))) { > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, > + mperf_old, 1); > + readmsrs = 0; > + time = cputime_sub(tsk->utime, per_cpu(task_utime_old, > + smp_processor_id())); > + time_msec = cputime_to_msecs(time); > + time_msec *= 1000; /* Scale it to hold fractional values */ What is 1000? The code is not clear > + if (cpu_supports_freq_scaling == 1) { > + time_msec *= aperf_delta; > + time_msec = div64_u64(time_msec, mperf_delta); > + } > + time = msecs_to_cputime(time_msec); > + account_user_time_scaled(tsk, time); > + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; > + } > + > + if (cputime_gt(tsk->stime, per_cpu(task_stime_old, > + smp_processor_id()))) { > + if (readmsrs) > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, > + aperf_old, mperf_old, 1); > + time = cputime_sub(tsk->stime, per_cpu(task_stime_old, > + smp_processor_id())); > + > + time_msec = cputime_to_msecs(time); > + time_msec *= 1000; /* Scale it to hold fractional values */ > + if (cpu_supports_freq_scaling == 1) { > + time_msec *= aperf_delta; > + time_msec = div64_u64(time_msec, mperf_delta); > + } > + time = msecs_to_cputime(time_msec); > + account_system_time_scaled(tsk, time); > + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; > + } > + > +} > +EXPORT_SYMBOL(account_scaled_stats); > + > -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting 2008-05-26 18:11 ` Balbir Singh @ 2008-05-27 14:54 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 14:54 UTC (permalink / raw) To: Balbir Singh Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-26 23:41:02]: > Vaidyanathan Srinivasan wrote: > > General framework for low level APERF/MPERF access. > > * Avoid resetting APERF/MPERF in acpi-cpufreq.c > > * Implement functions that will calculate the scaled stats > > * acpi_get_pm_msrs_delta() will give delta values after reading the current values > > * Change get_measured_perf() to use acpi_get_pm_msrs_delta() > > * scaled_stats_init() detect availability of APERF/MPERF using cpuid > > * reset_for_scaled_stats() called when a process occupies the CPU > > * account_scaled_stats() is called when the process leaves the CPU > > > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > > --- > > > > arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 21 +++ > > arch/x86/kernel/time_32.c | 171 ++++++++++++++++++++++++++++ > > 2 files changed, 188 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > > index b0c8208..761beec 100644 > > --- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > > +++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c > > @@ -59,6 +59,13 @@ enum { > > #define INTEL_MSR_RANGE (0xffff) > > #define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) > > > > +/* Buffer to store old snapshot values */ > > +DEFINE_PER_CPU(u64, cpufreq_old_aperf); > > +DEFINE_PER_CPU(u64, cpufreq_old_mperf); > > + > > +extern void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, > > + u64 *aperf_old, u64 *mperf_old, int reset); > > + > > struct acpi_cpufreq_data { > > struct acpi_processor_performance *acpi_data; > > struct cpufreq_frequency_table *freq_table; > > @@ -265,6 +272,7 @@ static unsigned int get_measured_perf(unsigned int cpu) > > } split; > > u64 whole; > > } aperf_cur, mperf_cur; > > + u64 *aperf_old, *mperf_old; > > > > cpumask_t saved_mask; > > unsigned int perf_percent; > > @@ -278,11 +286,16 @@ static unsigned int get_measured_perf(unsigned int cpu) > > return 0; > > } > > > > - rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); > > - rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); > > + /* > > + * Get the old APERF/MPERF values for this cpu and pass it to > > + * acpi_get_pm_msrs_delta() which will read the current values > > + * and return the delta. > > + */ > > + aperf_old = &(per_cpu(cpufreq_old_aperf, smp_processor_id())); > > + mperf_old = &(per_cpu(cpufreq_old_mperf, smp_processor_id())); > > > > - wrmsr(MSR_IA32_APERF, 0,0); > > - wrmsr(MSR_IA32_MPERF, 0,0); > > + acpi_get_pm_msrs_delta(&aperf_cur.whole, &mperf_cur.whole, > > + aperf_old, mperf_old, 1); > > > > #ifdef __i386__ > > /* > > diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c > > index 2ff21f3..5131e01 100644 > > --- a/arch/x86/kernel/time_32.c > > +++ b/arch/x86/kernel/time_32.c > > @@ -32,10 +32,13 @@ > > #include <linux/interrupt.h> > > #include <linux/time.h> > > #include <linux/mca.h> > > +#include <linux/kernel_stat.h> > > > > #include <asm/arch_hooks.h> > > #include <asm/hpet.h> > > #include <asm/time.h> > > +#include <asm/processor.h> > > +#include <asm/cputime.h> > > > > #include "do_timer.h" > > > > @@ -136,3 +139,171 @@ void __init time_init(void) > > tsc_init(); > > late_time_init = choose_time_init(); > > } > > + > > +/* > > + * This function should be used to get the APERF/MPERF MSRS delta from the cpu. > > + * We let the individual users of this function store the old values of APERF > > + * and MPERF registers in per cpu variables. They pass these old values as 3rd > > + * and 4th arguments. 'reset' tells if the old values should be reset or not. > > + * Mostly, users of this function will like to reset the old values. > > + */ > > +void acpi_get_pm_msrs_delta(u64 *aperf_delta, u64 *mperf_delta, u64 *aperf_old, > > + u64 *mperf_old, int reset) > > +{ > > + union { > > + struct { > > + u32 lo; > > + u32 hi; > > + } split; > > + u64 whole; > > + } aperf_cur, mperf_cur; > > + unsigned long flags; > > + > > + /* Read current values of APERF and MPERF MSRs*/ > > + local_irq_save(flags); > > Why do we need to do this? We've already disabled pre-emption in the caller? Hi Balbir, Thanks for detailed review. I will check and correct this in the next iteration. > > > + rdmsr(MSR_IA32_MPERF, mperf_cur.split.lo, mperf_cur.split.hi); > > + rdmsr(MSR_IA32_APERF, aperf_cur.split.lo, aperf_cur.split.hi); > > + local_irq_restore(flags); > > + > > + /* > > + * If the new values are less than the previous values, there has > > + * been an overflow and both APERF and MPERF have been reset to > > + * zero. In this case consider absolute value as diff/delta. > > + * Note that we do not check for 'reset' here, since resetting here > > + * is no more optional and has to be done for values to make sense. > > + */ > > + if (unlikely((mperf_cur.whole <= *mperf_old) || > > + (aperf_cur.whole <= *aperf_old))) > > + { > > + *aperf_old = 0; > > + *mperf_old = 0; > > + } > > + > > + /* Calculate the delta from the current and per cpu old values */ > > + *mperf_delta = mperf_cur.whole - *mperf_old; > > + *aperf_delta = aperf_cur.whole - *aperf_old; > > + > > + /* Set the per cpu variables to current readings */ > > + if (reset) { > > + *mperf_old = mperf_cur.whole; > > + *aperf_old = aperf_cur.whole; > > + } > > +} > > +EXPORT_SYMBOL(acpi_get_pm_msrs_delta); > > + > > + > > +DEFINE_PER_CPU(u64, cputime_old_aperf); > > +DEFINE_PER_CPU(u64, cputime_old_mperf); > > + > > +DEFINE_PER_CPU(cputime_t, task_utime_old); > > +DEFINE_PER_CPU(cputime_t, task_stime_old); > > + > > I fail to understand the per cpu variable for task_utime_old and task_stime_old? > What does it represent? Why is it global? These are needed to store the previous value of utime and stime so that we can compute the difference and scale it when the process leaves the CPU. reset_for_scaled_stats() will store the value of utime and stime when the process occupies the CPU. account_scaled_stats() will get the difference when the process leaves the CPU and then scale the time. I will get rid of this once I get centralised scaling ratio computation from CPUfreq driver infrastructure. > > > + > > +#define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1) > > + > > +static int cpu_supports_freq_scaling; > > + > > +/* Initialize scaled stat functions */ > > +void scaled_stats_init(void) > > +{ > > + struct cpuinfo_x86 *c = &cpu_data(0); > > + > > + /* Check for APERF/MPERF support in hardware */ > > + if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) { > > + unsigned int ecx; > > + ecx = cpuid_ecx(6); > > + if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY) > > + cpu_supports_freq_scaling = 1; > > + else > > + cpu_supports_freq_scaling = -1; > > + } > > +} > > + > > +/* > > + * Reset the old utime and stime (percpu) value to the new task > > + * that we are going to switch to. > > + */ > > +void reset_for_scaled_stats(struct task_struct *tsk) > > +{ > > + u64 aperf_delta, mperf_delta; > > + u64 *aperf_old, *mperf_old; > > + > > + if (cpu_supports_freq_scaling < 0) > > + return; > > + > > + if(!cpu_supports_freq_scaling) { > > + scaled_stats_init(); > > + if(cpu_supports_freq_scaling < 0) > > + return; > > + } > > + > > + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); > > + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); > > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, > > + mperf_old, 1); > > + > > + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; > > + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; > > +} > > + > > I hope this routine is called with preemption disabled Yes, this is called from __switch_to(). I will work towards removing such costly code in critical path. > > + > > +/* Account scaled statistics for a task on context switch */ > > +void account_scaled_stats(struct task_struct *tsk) > > +{ > > + u64 aperf_delta, mperf_delta; > > + u64 *aperf_old, *mperf_old; > > + cputime_t time; > > + u64 time_msec; > > + int readmsrs = 1; > > + > > + if(!cpu_supports_freq_scaling) > > + scaled_stats_init(); > > + > > + /* > > + * Get the old APERF/MPERF values for this cpu and pass it to > > + * acpi_get_pm_msrs_delta() which will read the current values > > + * and return the delta. > > + */ > > + aperf_old = &(per_cpu(cputime_old_aperf, smp_processor_id())); > > + mperf_old = &(per_cpu(cputime_old_mperf, smp_processor_id())); > > + > > + if (cputime_gt(tsk->utime, per_cpu(task_utime_old, > > + smp_processor_id()))) { > > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, aperf_old, > > + mperf_old, 1); > > + readmsrs = 0; > > + time = cputime_sub(tsk->utime, per_cpu(task_utime_old, > > + smp_processor_id())); > > + time_msec = cputime_to_msecs(time); > > + time_msec *= 1000; /* Scale it to hold fractional values */ > > What is 1000? The code is not clear Constant 1000 is the scaling factor I have used. 1 jiffies = 1000 so that I can store fractional jiffies values like 0.66 jiffies. This is a hack as documented in the intro. Basically we need a unit that is more granular than jiffies in order to scale it. u64 type to store micro secs or nano secs will be a good idea, but that will mean change in existing type of cputime_t. I am hopeful that we will move away from jiffies granularity accounting in x86 and switch to more accurate metric given the fact that we already use high resolution timers and clock source in CFS. If the cputime_t is more granular that 1 jiffies, then we can get rid of this hack. --Vaidy > > + if (cpu_supports_freq_scaling == 1) { > > + time_msec *= aperf_delta; > > + time_msec = div64_u64(time_msec, mperf_delta); > > + } > > + time = msecs_to_cputime(time_msec); > > + account_user_time_scaled(tsk, time); > > + per_cpu(task_utime_old, smp_processor_id()) = tsk->utime; > > + } > > + > > + if (cputime_gt(tsk->stime, per_cpu(task_stime_old, > > + smp_processor_id()))) { > > + if (readmsrs) > > + acpi_get_pm_msrs_delta(&aperf_delta, &mperf_delta, > > + aperf_old, mperf_old, 1); > > + time = cputime_sub(tsk->stime, per_cpu(task_stime_old, > > + smp_processor_id())); > > + > > + time_msec = cputime_to_msecs(time); > > + time_msec *= 1000; /* Scale it to hold fractional values */ > > + if (cpu_supports_freq_scaling == 1) { > > + time_msec *= aperf_delta; > > + time_msec = div64_u64(time_msec, mperf_delta); > > + } > > + time = msecs_to_cputime(time_msec); > > + account_system_time_scaled(tsk, time); > > + per_cpu(task_stime_old, smp_processor_id()) = tsk->stime; > > + } > > + > > +} > > +EXPORT_SYMBOL(account_scaled_stats); > > + > > > > > -- > Warm Regards, > Balbir Singh > Linux Technology Center > IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 2/3] Make calls to account_scaled_stats 2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan @ 2008-05-26 14:31 ` Vaidyanathan Srinivasan 2008-05-26 18:18 ` Balbir Singh 2008-05-29 15:18 ` Michael Neuling 2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven 3 siblings, 2 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw) To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora Hook various accounting functions to call scaled stats * Hook porcess contect switch: __switch_to() * Hook IRQ handling account_system_vtime() in hardirq.hA * Update __delayacct_add_tsk() to take care of scaling by 1000 * Update bacct_add_tsk() to take care of scaling by 1000 Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> --- arch/x86/kernel/process_32.c | 8 ++++++++ include/linux/hardirq.h | 4 ++++ kernel/delayacct.c | 7 ++++++- kernel/timer.c | 2 -- kernel/tsacct.c | 10 ++++++++-- 5 files changed, 26 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index f8476df..c81a783 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -56,6 +56,9 @@ #include <asm/cpu.h> #include <asm/kdebug.h> +extern void account_scaled_stats(struct task_struct *tsk); +extern void reset_for_scaled_stats(struct task_struct *tsk); + asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); static int hlt_counter; @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct loadsegment(gs, next->gs); x86_write_percpu(current_task, next_p); + /* Account scaled statistics for the task leaving CPU */ + account_scaled_stats(prev_p); + barrier(); + /* Initialise stats counter for new task */ + reset_for_scaled_stats(next_p); return prev_p; } diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 181006c..4458736 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -7,6 +7,9 @@ #include <asm/hardirq.h> #include <asm/system.h> +/* TBD: Add config option */ +extern void account_scaled_stats(struct task_struct *tsk); + /* * We put the hardirq and softirq counter into the preemption * counter. The bitmask has the following meaning: @@ -115,6 +118,7 @@ struct task_struct; #ifndef CONFIG_VIRT_CPU_ACCOUNTING static inline void account_system_vtime(struct task_struct *tsk) { + account_scaled_stats(tsk); } #endif diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 10e43fd..3e2938f 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) tmp = (s64)d->cpu_scaled_run_real_total; cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts); - tmp += timespec_to_ns(&ts); + /* HACK: Remember, we multipled the cputime_t by 1000 to include + * fraction. Now it is time to scale it back to correct 'ns' value. + * Perhaps, we should use nano second unit (u64 type) for utimescaled + * and stimescaled? + */ + tmp += div_s64(timespec_to_ns(&ts),1000); d->cpu_scaled_run_real_total = (tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp; diff --git a/kernel/timer.c b/kernel/timer.c index ceacc66..de8a615 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick) if (user_tick) { account_user_time(p, one_jiffy); - account_user_time_scaled(p, cputime_to_scaled(one_jiffy)); } else { account_system_time(p, HARDIRQ_OFFSET, one_jiffy); - account_system_time_scaled(p, cputime_to_scaled(one_jiffy)); } } #endif diff --git a/kernel/tsacct.c b/kernel/tsacct.c index 4ab1b58..ee0d93b 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk) rcu_read_unlock(); stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC; stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC; + /* HACK: cputime unit is jiffies on x86 and not good for fractional + * additional. cputime_t type {u,s}timescaled is multiplied by + * 1000 for scaled accounting. Hence, cputime_to_msecs will actually + * give the required micro second value. The multiplier + * USEC_PER_MSEC has been dropped. + */ stats->ac_utimescaled = - cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC; + cputime_to_msecs(tsk->utimescaled); stats->ac_stimescaled = - cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC; + cputime_to_msecs(tsk->stimescaled); stats->ac_minflt = tsk->min_flt; stats->ac_majflt = tsk->maj_flt; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats 2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan @ 2008-05-26 18:18 ` Balbir Singh 2008-05-27 15:02 ` Vaidyanathan Srinivasan 2008-05-29 15:18 ` Michael Neuling 1 sibling, 1 reply; 31+ messages in thread From: Balbir Singh @ 2008-05-26 18:18 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Vaidyanathan Srinivasan wrote: > Hook various accounting functions to call scaled stats > > * Hook porcess contect switch: __switch_to() > * Hook IRQ handling account_system_vtime() in hardirq.hA > * Update __delayacct_add_tsk() to take care of scaling by 1000 > * Update bacct_add_tsk() to take care of scaling by 1000 > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > --- > > arch/x86/kernel/process_32.c | 8 ++++++++ > include/linux/hardirq.h | 4 ++++ > kernel/delayacct.c | 7 ++++++- > kernel/timer.c | 2 -- > kernel/tsacct.c | 10 ++++++++-- > 5 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > index f8476df..c81a783 100644 > --- a/arch/x86/kernel/process_32.c > +++ b/arch/x86/kernel/process_32.c > @@ -56,6 +56,9 @@ > #include <asm/cpu.h> > #include <asm/kdebug.h> > > +extern void account_scaled_stats(struct task_struct *tsk); > +extern void reset_for_scaled_stats(struct task_struct *tsk); > + > asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); > > static int hlt_counter; > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct > loadsegment(gs, next->gs); > > x86_write_percpu(current_task, next_p); > + /* Account scaled statistics for the task leaving CPU */ > + account_scaled_stats(prev_p); > + barrier(); > + /* Initialise stats counter for new task */ > + reset_for_scaled_stats(next_p); > This is a bad place to hook into. Can't we do scaled accounting they way we do it for powerpc (SPURR)? I would rather hook into account_*time > return prev_p; > } > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > index 181006c..4458736 100644 > --- a/include/linux/hardirq.h > +++ b/include/linux/hardirq.h > @@ -7,6 +7,9 @@ > #include <asm/hardirq.h> > #include <asm/system.h> > > +/* TBD: Add config option */ > +extern void account_scaled_stats(struct task_struct *tsk); > + > /* > * We put the hardirq and softirq counter into the preemption > * counter. The bitmask has the following meaning: > @@ -115,6 +118,7 @@ struct task_struct; > #ifndef CONFIG_VIRT_CPU_ACCOUNTING > static inline void account_system_vtime(struct task_struct *tsk) > { > + account_scaled_stats(tsk); > } > #endif > > diff --git a/kernel/delayacct.c b/kernel/delayacct.c > index 10e43fd..3e2938f 100644 > --- a/kernel/delayacct.c > +++ b/kernel/delayacct.c > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) > > tmp = (s64)d->cpu_scaled_run_real_total; > cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts); > - tmp += timespec_to_ns(&ts); > + /* HACK: Remember, we multipled the cputime_t by 1000 to include > + * fraction. Now it is time to scale it back to correct 'ns' value. > + * Perhaps, we should use nano second unit (u64 type) for utimescaled > + * and stimescaled? > + */ > + tmp += div_s64(timespec_to_ns(&ts),1000); 1000 is a bit magical, please use a meaningful #define > d->cpu_scaled_run_real_total = > (tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp; > > diff --git a/kernel/timer.c b/kernel/timer.c > index ceacc66..de8a615 100644 > --- a/kernel/timer.c > +++ b/kernel/timer.c > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick) > > if (user_tick) { > account_user_time(p, one_jiffy); > - account_user_time_scaled(p, cputime_to_scaled(one_jiffy)); > } else { > account_system_time(p, HARDIRQ_OFFSET, one_jiffy); > - account_system_time_scaled(p, cputime_to_scaled(one_jiffy)); > } Why couldn't we leverage these functions here? > } > #endif > diff --git a/kernel/tsacct.c b/kernel/tsacct.c > index 4ab1b58..ee0d93b 100644 > --- a/kernel/tsacct.c > +++ b/kernel/tsacct.c > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk) > rcu_read_unlock(); > stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC; > stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC; > + /* HACK: cputime unit is jiffies on x86 and not good for fractional > + * additional. cputime_t type {u,s}timescaled is multiplied by > + * 1000 for scaled accounting. Hence, cputime_to_msecs will actually > + * give the required micro second value. The multiplier > + * USEC_PER_MSEC has been dropped. > + */ > stats->ac_utimescaled = > - cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC; > + cputime_to_msecs(tsk->utimescaled); > stats->ac_stimescaled = > - cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC; > + cputime_to_msecs(tsk->stimescaled); > stats->ac_minflt = tsk->min_flt; > stats->ac_majflt = tsk->maj_flt; > > -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats 2008-05-26 18:18 ` Balbir Singh @ 2008-05-27 15:02 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 15:02 UTC (permalink / raw) To: Balbir Singh Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-26 23:48:04]: > Vaidyanathan Srinivasan wrote: > > Hook various accounting functions to call scaled stats > > > > * Hook porcess contect switch: __switch_to() > > * Hook IRQ handling account_system_vtime() in hardirq.hA > > * Update __delayacct_add_tsk() to take care of scaling by 1000 > > * Update bacct_add_tsk() to take care of scaling by 1000 > > > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > > --- > > > > arch/x86/kernel/process_32.c | 8 ++++++++ > > include/linux/hardirq.h | 4 ++++ > > kernel/delayacct.c | 7 ++++++- > > kernel/timer.c | 2 -- > > kernel/tsacct.c | 10 ++++++++-- > > 5 files changed, 26 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > > index f8476df..c81a783 100644 > > --- a/arch/x86/kernel/process_32.c > > +++ b/arch/x86/kernel/process_32.c > > @@ -56,6 +56,9 @@ > > #include <asm/cpu.h> > > #include <asm/kdebug.h> > > > > +extern void account_scaled_stats(struct task_struct *tsk); > > +extern void reset_for_scaled_stats(struct task_struct *tsk); > > + > > asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); > > > > static int hlt_counter; > > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct > > loadsegment(gs, next->gs); > > > > x86_write_percpu(current_task, next_p); > > + /* Account scaled statistics for the task leaving CPU */ > > + account_scaled_stats(prev_p); > > + barrier(); > > + /* Initialise stats counter for new task */ > > + reset_for_scaled_stats(next_p); > > > > This is a bad place to hook into. Can't we do scaled accounting they way we do > it for powerpc (SPURR)? I would rather hook into account_*time Agreed and I have also documented it in the intro. I will try to remove them in the next iteration. > > > return prev_p; > > } > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > > index 181006c..4458736 100644 > > --- a/include/linux/hardirq.h > > +++ b/include/linux/hardirq.h > > @@ -7,6 +7,9 @@ > > #include <asm/hardirq.h> > > #include <asm/system.h> > > > > +/* TBD: Add config option */ > > +extern void account_scaled_stats(struct task_struct *tsk); > > + > > /* > > * We put the hardirq and softirq counter into the preemption > > * counter. The bitmask has the following meaning: > > @@ -115,6 +118,7 @@ struct task_struct; > > #ifndef CONFIG_VIRT_CPU_ACCOUNTING > > static inline void account_system_vtime(struct task_struct *tsk) > > { > > + account_scaled_stats(tsk); > > } > > #endif > > > > diff --git a/kernel/delayacct.c b/kernel/delayacct.c > > index 10e43fd..3e2938f 100644 > > --- a/kernel/delayacct.c > > +++ b/kernel/delayacct.c > > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) > > > > tmp = (s64)d->cpu_scaled_run_real_total; > > cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts); > > - tmp += timespec_to_ns(&ts); > > + /* HACK: Remember, we multipled the cputime_t by 1000 to include > > + * fraction. Now it is time to scale it back to correct 'ns' value. > > + * Perhaps, we should use nano second unit (u64 type) for utimescaled > > + * and stimescaled? > > + */ > > + tmp += div_s64(timespec_to_ns(&ts),1000); > > 1000 is a bit magical, please use a meaningful #define Sure I can use a #define.... anyway this needs to be solved. This is documented in the intro and explained in other email. > > d->cpu_scaled_run_real_total = > > (tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp; > > > > diff --git a/kernel/timer.c b/kernel/timer.c > > index ceacc66..de8a615 100644 > > --- a/kernel/timer.c > > +++ b/kernel/timer.c > > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int user_tick) > > > > if (user_tick) { > > account_user_time(p, one_jiffy); > > - account_user_time_scaled(p, cputime_to_scaled(one_jiffy)); > > } else { > > account_system_time(p, HARDIRQ_OFFSET, one_jiffy); > > - account_system_time_scaled(p, cputime_to_scaled(one_jiffy)); > > } > > Why couldn't we leverage these functions here? We don't know the scaling ratio at this time. The ratio is being calculated at every context switch by sampling APERF/MPERF when the process occupies the CPU and when it leave the CPU. We should use these functions once I get centralised scaling ratio computation done from cpufreq driver. Out-of-band frequency variation needs to be solved for the other method. Current implementation will do correct accounting even if the freq variation is out-of-band. --Vaidy > > } > > #endif > > diff --git a/kernel/tsacct.c b/kernel/tsacct.c > > index 4ab1b58..ee0d93b 100644 > > --- a/kernel/tsacct.c > > +++ b/kernel/tsacct.c > > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk) > > rcu_read_unlock(); > > stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC; > > stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC; > > + /* HACK: cputime unit is jiffies on x86 and not good for fractional > > + * additional. cputime_t type {u,s}timescaled is multiplied by > > + * 1000 for scaled accounting. Hence, cputime_to_msecs will actually > > + * give the required micro second value. The multiplier > > + * USEC_PER_MSEC has been dropped. > > + */ > > stats->ac_utimescaled = > > - cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC; > > + cputime_to_msecs(tsk->utimescaled); > > stats->ac_stimescaled = > > - cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC; > > + cputime_to_msecs(tsk->stimescaled); > > stats->ac_minflt = tsk->min_flt; > > stats->ac_majflt = tsk->maj_flt; > > > > > > > -- > Warm Regards, > Balbir Singh > Linux Technology Center > IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats 2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan 2008-05-26 18:18 ` Balbir Singh @ 2008-05-29 15:18 ` Michael Neuling 2008-05-29 18:23 ` Vaidyanathan Srinivasan 1 sibling, 1 reply; 31+ messages in thread From: Michael Neuling @ 2008-05-29 15:18 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Balbir Singh, Amit K. Arora In message <20080526143146.24680.36724.stgit@drishya.in.ibm.com> you wrote: > Hook various accounting functions to call scaled stats > > * Hook porcess contect switch: __switch_to() > * Hook IRQ handling account_system_vtime() in hardirq.hA > * Update __delayacct_add_tsk() to take care of scaling by 1000 > * Update bacct_add_tsk() to take care of scaling by 1000 > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > --- > > arch/x86/kernel/process_32.c | 8 ++++++++ > include/linux/hardirq.h | 4 ++++ > kernel/delayacct.c | 7 ++++++- > kernel/timer.c | 2 -- > kernel/tsacct.c | 10 ++++++++-- > 5 files changed, 26 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > index f8476df..c81a783 100644 > --- a/arch/x86/kernel/process_32.c > +++ b/arch/x86/kernel/process_32.c > @@ -56,6 +56,9 @@ > #include <asm/cpu.h> > #include <asm/kdebug.h> > > +extern void account_scaled_stats(struct task_struct *tsk); > +extern void reset_for_scaled_stats(struct task_struct *tsk); > + > asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); > > static int hlt_counter; > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *pre v_p, struct task_struct > loadsegment(gs, next->gs); > > x86_write_percpu(current_task, next_p); > + /* Account scaled statistics for the task leaving CPU */ > + account_scaled_stats(prev_p); > + barrier(); > + /* Initialise stats counter for new task */ > + reset_for_scaled_stats(next_p); > > return prev_p; > } > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > index 181006c..4458736 100644 > --- a/include/linux/hardirq.h > +++ b/include/linux/hardirq.h > @@ -7,6 +7,9 @@ > #include <asm/hardirq.h> > #include <asm/system.h> > > +/* TBD: Add config option */ > +extern void account_scaled_stats(struct task_struct *tsk); > + > /* > * We put the hardirq and softirq counter into the preemption > * counter. The bitmask has the following meaning: > @@ -115,6 +118,7 @@ struct task_struct; > #ifndef CONFIG_VIRT_CPU_ACCOUNTING > static inline void account_system_vtime(struct task_struct *tsk) > { > + account_scaled_stats(tsk); > } > #endif > > diff --git a/kernel/delayacct.c b/kernel/delayacct.c > index 10e43fd..3e2938f 100644 > --- a/kernel/delayacct.c > +++ b/kernel/delayacct.c > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task _struct *tsk) > > tmp = (s64)d->cpu_scaled_run_real_total; > cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts); > - tmp += timespec_to_ns(&ts); > + /* HACK: Remember, we multipled the cputime_t by 1000 to include > + * fraction. Now it is time to scale it back to correct 'ns' value. > + * Perhaps, we should use nano second unit (u64 type) for utimescaled > + * and stimescaled? > + */ > + tmp += div_s64(timespec_to_ns(&ts),1000); This is going to break other archs (specifically powerpc) which doesn't do this magical scale by 1000. How often is this function called as the divide is going to slow things down? > d->cpu_scaled_run_real_total = > (tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp; > > diff --git a/kernel/timer.c b/kernel/timer.c > index ceacc66..de8a615 100644 > --- a/kernel/timer.c > +++ b/kernel/timer.c > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int use r_tick) > > if (user_tick) { > account_user_time(p, one_jiffy); > - account_user_time_scaled(p, cputime_to_scaled(one_jiffy)); > } else { > account_system_time(p, HARDIRQ_OFFSET, one_jiffy); > - account_system_time_scaled(p, cputime_to_scaled(one_jiffy)); > } > } Why did you remove this? > #endif > diff --git a/kernel/tsacct.c b/kernel/tsacct.c > index 4ab1b58..ee0d93b 100644 > --- a/kernel/tsacct.c > +++ b/kernel/tsacct.c > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_s truct *tsk) > rcu_read_unlock(); > stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC; > stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC; > + /* HACK: cputime unit is jiffies on x86 and not good for fractional > + * additional. cputime_t type {u,s}timescaled is multiplied by > + * 1000 for scaled accounting. Hence, cputime_to_msecs will actually > + * give the required micro second value. The multiplier > + * USEC_PER_MSEC has been dropped. > + */ > stats->ac_utimescaled = > - cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC; > + cputime_to_msecs(tsk->utimescaled); > stats->ac_stimescaled = > - cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC; > + cputime_to_msecs(tsk->stimescaled); Again, isn't this going to effect other archs? Mikey > stats->ac_minflt = tsk->min_flt; > stats->ac_majflt = tsk->maj_flt; > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 2/3] Make calls to account_scaled_stats 2008-05-29 15:18 ` Michael Neuling @ 2008-05-29 18:23 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-29 18:23 UTC (permalink / raw) To: Michael Neuling Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Balbir Singh, Amit K. Arora * Michael Neuling <mikey@neuling.org> [2008-05-29 10:18:56]: > In message <20080526143146.24680.36724.stgit@drishya.in.ibm.com> you wrote: > > Hook various accounting functions to call scaled stats > > > > * Hook porcess contect switch: __switch_to() > > * Hook IRQ handling account_system_vtime() in hardirq.hA > > * Update __delayacct_add_tsk() to take care of scaling by 1000 > > * Update bacct_add_tsk() to take care of scaling by 1000 > > > > Signed-off-by: Amit K. Arora <aarora@linux.vnet.ibm.com> > > Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> > > --- > > > > arch/x86/kernel/process_32.c | 8 ++++++++ > > include/linux/hardirq.h | 4 ++++ > > kernel/delayacct.c | 7 ++++++- > > kernel/timer.c | 2 -- > > kernel/tsacct.c | 10 ++++++++-- > > 5 files changed, 26 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > > index f8476df..c81a783 100644 > > --- a/arch/x86/kernel/process_32.c > > +++ b/arch/x86/kernel/process_32.c > > @@ -56,6 +56,9 @@ > > #include <asm/cpu.h> > > #include <asm/kdebug.h> > > > > +extern void account_scaled_stats(struct task_struct *tsk); > > +extern void reset_for_scaled_stats(struct task_struct *tsk); > > + > > asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); > > > > static int hlt_counter; > > @@ -660,6 +663,11 @@ struct task_struct * __switch_to(struct task_struct *pre > v_p, struct task_struct > > loadsegment(gs, next->gs); > > > > x86_write_percpu(current_task, next_p); > > + /* Account scaled statistics for the task leaving CPU */ > > + account_scaled_stats(prev_p); > > + barrier(); > > + /* Initialise stats counter for new task */ > > + reset_for_scaled_stats(next_p); > > > > return prev_p; > > } > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > > index 181006c..4458736 100644 > > --- a/include/linux/hardirq.h > > +++ b/include/linux/hardirq.h > > @@ -7,6 +7,9 @@ > > #include <asm/hardirq.h> > > #include <asm/system.h> > > > > +/* TBD: Add config option */ > > +extern void account_scaled_stats(struct task_struct *tsk); > > + > > /* > > * We put the hardirq and softirq counter into the preemption > > * counter. The bitmask has the following meaning: > > @@ -115,6 +118,7 @@ struct task_struct; > > #ifndef CONFIG_VIRT_CPU_ACCOUNTING > > static inline void account_system_vtime(struct task_struct *tsk) > > { > > + account_scaled_stats(tsk); > > } > > #endif > > > > diff --git a/kernel/delayacct.c b/kernel/delayacct.c > > index 10e43fd..3e2938f 100644 > > --- a/kernel/delayacct.c > > +++ b/kernel/delayacct.c > > @@ -117,7 +117,12 @@ int __delayacct_add_tsk(struct taskstats *d, struct task > _struct *tsk) > > > > tmp = (s64)d->cpu_scaled_run_real_total; > > cputime_to_timespec(tsk->utimescaled + tsk->stimescaled, &ts); > > - tmp += timespec_to_ns(&ts); > > + /* HACK: Remember, we multipled the cputime_t by 1000 to include > > + * fraction. Now it is time to scale it back to correct 'ns' value. > > + * Perhaps, we should use nano second unit (u64 type) for utimescaled > > + * and stimescaled? > > + */ > > + tmp += div_s64(timespec_to_ns(&ts),1000); > > This is going to break other archs (specifically powerpc) which doesn't > do this magical scale by 1000. > > How often is this function called as the divide is going to slow things > down? Hi Mikey, Thanks for the review comments. This scaling is a hack to store fractions in cputime_t data type as mentioned in the intro and other replies. This should certainly go away once I find a clean method to store fractional jiffies values for x86. > > > d->cpu_scaled_run_real_total = > > (tmp < (s64)d->cpu_scaled_run_real_total) ? 0 : tmp; > > > > diff --git a/kernel/timer.c b/kernel/timer.c > > index ceacc66..de8a615 100644 > > --- a/kernel/timer.c > > +++ b/kernel/timer.c > > @@ -964,10 +964,8 @@ void account_process_tick(struct task_struct *p, int use > r_tick) > > > > if (user_tick) { > > account_user_time(p, one_jiffy); > > - account_user_time_scaled(p, cputime_to_scaled(one_jiffy)); > > } else { > > account_system_time(p, HARDIRQ_OFFSET, one_jiffy); > > - account_system_time_scaled(p, cputime_to_scaled(one_jiffy)); > > } > > } > > Why did you remove this? In this preliminary RFC implementation to demonstrate the idea, I have not hooked into these routines since I do not know the scaling factor at this time. I am trying to maintain the scaling ratio from cpufreq driver in the next version and just use it in the accounting subsystem. Once I have cpufreq subsystem to maintain the scaling ratio, I can use these functions. > > #endif > > diff --git a/kernel/tsacct.c b/kernel/tsacct.c > > index 4ab1b58..ee0d93b 100644 > > --- a/kernel/tsacct.c > > +++ b/kernel/tsacct.c > > @@ -62,10 +62,16 @@ void bacct_add_tsk(struct taskstats *stats, struct task_s > truct *tsk) > > rcu_read_unlock(); > > stats->ac_utime = cputime_to_msecs(tsk->utime) * USEC_PER_MSEC; > > stats->ac_stime = cputime_to_msecs(tsk->stime) * USEC_PER_MSEC; > > + /* HACK: cputime unit is jiffies on x86 and not good for fractional > > + * additional. cputime_t type {u,s}timescaled is multiplied by > > + * 1000 for scaled accounting. Hence, cputime_to_msecs will actually > > + * give the required micro second value. The multiplier > > + * USEC_PER_MSEC has been dropped. > > + */ > > stats->ac_utimescaled = > > - cputime_to_msecs(tsk->utimescaled) * USEC_PER_MSEC; > > + cputime_to_msecs(tsk->utimescaled); > > stats->ac_stimescaled = > > - cputime_to_msecs(tsk->stimescaled) * USEC_PER_MSEC; > > + cputime_to_msecs(tsk->stimescaled); > > Again, isn't this going to effect other archs? Yes, this is part of the scaling factor hack. I will get rid of this once we can store fractional cputime_t. --Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays 2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan @ 2008-05-26 14:31 ` Vaidyanathan Srinivasan 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven 3 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-26 14:31 UTC (permalink / raw) To: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora Add print in getdelays program to show scaled stats from taskstats. * Normal utime and stime are printed in micro seconds * Sacled utime and stime are printed in micro seconds * Total scaled run real time is printed in nano seconds The values in the taskstats are printed as is hence the change in units. getdelays -d should print these values. Prefered cmdline is getdelays -l -p -m0-3 to get the final delay count at exit. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> --- Documentation/accounting/getdelays.c | 13 +++++++++++-- 1 files changed, 11 insertions(+), 2 deletions(-) diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index 40121b5..f695563 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -197,13 +197,22 @@ void print_delayacct(struct taskstats *t) "IO %15s%15s\n" " %15llu%15llu\n" "MEM %15s%15s\n" - " %15llu%15llu\n", + " %15llu%15llu\n" + " %15s%15s\n" + " %15llu%15llu\n" + " %15s%15s%15s\n" + " %15llu%15llu%15llu\n", "count", "real total", "virtual total", "delay total", t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total, t->cpu_delay_total, "count", "delay total", t->blkio_count, t->blkio_delay_total, - "count", "delay total", t->swapin_count, t->swapin_delay_total); + "count", "delay total", t->swapin_count, t->swapin_delay_total, + "utime", "stime", + t->ac_utime, t->ac_stime, + "scaled utime", "scaled stime", "total", + t->ac_utimescaled, t->ac_stimescaled, + t->cpu_scaled_run_real_total); } void task_context_switch_counts(struct taskstats *t) ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan ` (2 preceding siblings ...) 2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan @ 2008-05-26 15:50 ` Arjan van de Ven 2008-05-26 17:24 ` Balbir Singh ` (2 more replies) 3 siblings, 3 replies; 31+ messages in thread From: Arjan van de Ven @ 2008-05-26 15:50 UTC (permalink / raw) To: Vaidyanathan Srinivasan Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora On Mon, 26 May 2008 20:01:33 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > The following RFC patch tries to implement scaled CPU utilisation > statistics using APERF and MPERF MSR registers in an x86 platform. > > The CPU capacity is significantly changed when the CPU's frequency is > reduced for the purpose of power savings. The applications that run > at such lower CPU frequencies are also accounted for real CPU time by > default. If the applications have been run at full CPU frequency, > they would have finished the work faster and not get charged for > excessive CPU time. > > One of the solution to this problem it so scale the utime and stime > entitlement for the process as per the current CPU frequency. This > technique is used in powerpc architecture with the help of hardware > registers that accurately capture the entitlement. > there are some issues with this unfortunately, and these make it a very complex thing to do. Just to mention a few: 1) What if the BIOS no longer allows us to go to the max frequency for a period (for example as a result of overheating); with the approach above, the admin would THINK he can go faster, but he cannot in reality, so there's misleading information (the system looks half busy, while in reality it's actually the opposite, it's overloaded). Management tools will take the wrong decisions (such as moving MORE work to the box, not less) 2) On systems with Intel Dynamic Acceleration technology, you can get over 100% of cycles this way. (For those who don't know what IDA is; IDA is basically a case where if your Penryn based dual core laptop is only using 1 core, the other core can go faster than 100% as long as thermals etc allow it). How do you want to deal with this? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven @ 2008-05-26 17:24 ` Balbir Singh 2008-05-26 18:00 ` Arjan van de Ven 2008-05-27 14:04 ` Vaidyanathan Srinivasan 2008-05-31 21:13 ` Pavel Machek 2 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2008-05-26 17:24 UTC (permalink / raw) To: Arjan van de Ven Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Arjan van de Ven wrote: > On Mon, 26 May 2008 20:01:33 +0530 > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > >> The following RFC patch tries to implement scaled CPU utilisation >> statistics using APERF and MPERF MSR registers in an x86 platform. >> >> The CPU capacity is significantly changed when the CPU's frequency is >> reduced for the purpose of power savings. The applications that run >> at such lower CPU frequencies are also accounted for real CPU time by >> default. If the applications have been run at full CPU frequency, >> they would have finished the work faster and not get charged for >> excessive CPU time. >> >> One of the solution to this problem it so scale the utime and stime >> entitlement for the process as per the current CPU frequency. This >> technique is used in powerpc architecture with the help of hardware >> registers that accurately capture the entitlement. >> > > there are some issues with this unfortunately, and these make it > a very complex thing to do. > Just to mention a few: > 1) What if the BIOS no longer allows us to go to the max frequency for > a period (for example as a result of overheating); with the approach > above, the admin would THINK he can go faster, but he cannot in reality, > so there's misleading information (the system looks half busy, while in > reality it's actually the opposite, it's overloaded). Management tools > will take the wrong decisions (such as moving MORE work to the box, not > less) > 2) On systems with Intel Dynamic Acceleration technology, you can get > over 100% of cycles this way. (For those who don't know what IDA is; > IDA is basically a case where if your Penryn based dual core laptop is > only using 1 core, the other core can go faster than 100% as long as > thermals etc allow it). How do you want to deal with this? Arjan, These problems exist anyway, irrespective of scaled accounting (I'd say that they are exceptions) 1. The management tool does have access to the current frequency and maximum frequency, irrespective of scaled accounting. The decision could still be taken on the data that is already available and management tools can already use them 2. With IDA, we'd have to document that APERF/MPERF can be greater than 100% if the system is overclocked. Scaled accounting only intends to provide data already available. Interpretation is left to management tools and we'll document the corner cases that you just mentioned. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 17:24 ` Balbir Singh @ 2008-05-26 18:00 ` Arjan van de Ven 2008-05-26 18:36 ` Balbir Singh 0 siblings, 1 reply; 31+ messages in thread From: Arjan van de Ven @ 2008-05-26 18:00 UTC (permalink / raw) To: balbir Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora On Mon, 26 May 2008 22:54:43 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Arjan, > > > These problems exist anyway, irrespective of scaled accounting (I'd > say that they are exceptions) > > 1. The management tool does have access to the current frequency and > maximum frequency, irrespective of scaled accounting. The decision > could still be taken on the data that is already available and > management tools can already use them it's sadly not as easy as you make it sound. From everything you wrote you're making the assumption "if we're not at maximum frequency, we have room to spare", which is very much not a correct assumption > 2. With IDA, we'd have to > document that APERF/MPERF can be greater than 100% if the system is > overclocked. > > Scaled accounting only intends to provide data already available. > Interpretation is left to management tools and we'll document the > corner cases that you just mentioned. IDA is not overclocking, nor is it a corner case *at all*. It's the common case in fact on more modern systems. Having the kernel present "raw" data to applications that then have no idea how to really use it to be honest isn't very attractive to me as idea: you're presenting a very raw hardware interface that will keep changing over time in terms of how to interpret the data... the kernel needs to abstract such hard stuff from applications, not fully expose them to it. Especially since these things *ARE* tricky and *WILL* change. Future x86 hardware will have behavior that makes the "oh we'll document the corner cases" extremely unpractical. Heck, even todays hardware (but arguably not yet the server hardware) behaves like that. "Documenting the common case as corner case" is not the right thing to do when introducing some new behavior/interface. Sorry. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 18:00 ` Arjan van de Ven @ 2008-05-26 18:36 ` Balbir Singh 2008-05-26 18:51 ` Arjan van de Ven 0 siblings, 1 reply; 31+ messages in thread From: Balbir Singh @ 2008-05-26 18:36 UTC (permalink / raw) To: Arjan van de Ven Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Arjan van de Ven wrote: > On Mon, 26 May 2008 22:54:43 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> Arjan, >> >> >> These problems exist anyway, irrespective of scaled accounting (I'd >> say that they are exceptions) >> >> 1. The management tool does have access to the current frequency and >> maximum frequency, irrespective of scaled accounting. The decision >> could still be taken on the data that is already available and >> management tools can already use them > > it's sadly not as easy as you make it sound. From everything you wrote > you're making the assumption "if we're not at maximum frequency, we > have room to spare", which is very much not a correct assumption > That's true in general. If the CPUs are throttled due to overheating, the system management application will figure out that it cannot change the frequency. How do I interpret my CPU frequency applet's data when it says that the system is running at 46%? >> 2. With IDA, we'd have to >> document that APERF/MPERF can be greater than 100% if the system is >> overclocked. >> >> Scaled accounting only intends to provide data already available. >> Interpretation is left to management tools and we'll document the >> corner cases that you just mentioned. > > IDA is not overclocking, nor is it a corner case *at all*. It's the > common case in fact on more modern systems. Having the kernel present > "raw" data to applications that then have no idea how to really use it > to be honest isn't very attractive to me as idea: you're presenting a > very raw hardware interface that will keep changing over time in terms > of how to interpret the data... the kernel needs to abstract such hard > stuff from applications, not fully expose them to it. Especially since > these things *ARE* tricky and *WILL* change. Future x86 hardware will > have behavior that makes the "oh we'll document the corner cases" > extremely unpractical. Heck, even todays hardware (but arguably not yet > the server hardware) behaves like that. "Documenting the common case as > corner case" is not the right thing to do when introducing some new > behavior/interface. Sorry. Before I argue against that, I would like to ask 1. How are APERF/MPERF be meant to be utilized? 2. The CPU frequency driver/governer uses APERF/MPERF as well - we could argue and say that it should not be using/exposing that data to user space or using that data to make decisions. 3. How do I answer the following problem My CPU utilization is 50% at all frequencies (since utilization is time based), does it mean that frequency scaling does not impact my workload? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 18:36 ` Balbir Singh @ 2008-05-26 18:51 ` Arjan van de Ven 2008-05-27 12:59 ` Balbir Singh 2008-05-27 13:29 ` Vaidyanathan Srinivasan 0 siblings, 2 replies; 31+ messages in thread From: Arjan van de Ven @ 2008-05-26 18:51 UTC (permalink / raw) To: balbir Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora On Tue, 27 May 2008 00:06:56 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Arjan van de Ven wrote: > > On Mon, 26 May 2008 22:54:43 +0530 > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > >> Arjan, > >> > >> > >> These problems exist anyway, irrespective of scaled accounting (I'd > >> say that they are exceptions) > >> > >> 1. The management tool does have access to the current frequency > >> and maximum frequency, irrespective of scaled accounting. The > >> decision could still be taken on the data that is already > >> available and management tools can already use them > > > > it's sadly not as easy as you make it sound. From everything you > > wrote you're making the assumption "if we're not at maximum > > frequency, we have room to spare", which is very much not a correct > > assumption > > > > That's true in general. If the CPUs are throttled due to overheating, > the system management application will figure out that it cannot > change the frequency. It's not the system management application but the kernel (and the hardware! Esp in case of IDA) that manage the frequency. > How do I interpret my CPU frequency applet's > data when it says that the system is running at 46%? That is a very good question. The answer is "uhh badly". Sad but true. I'm not arguing against that. The problem I have is that what you're doing does not make it better! > > >> 2. With IDA, we'd have to > >> document that APERF/MPERF can be greater than 100% if the system is > >> overclocked. > >> > >> Scaled accounting only intends to provide data already available. > >> Interpretation is left to management tools and we'll document the > >> corner cases that you just mentioned. > > > > IDA is not overclocking, nor is it a corner case *at all*. It's the > > common case in fact on more modern systems. Having the kernel > > present "raw" data to applications that then have no idea how to > > really use it to be honest isn't very attractive to me as idea: > > you're presenting a very raw hardware interface that will keep > > changing over time in terms of how to interpret the data... the > > kernel needs to abstract such hard stuff from applications, not > > fully expose them to it. Especially since these things *ARE* tricky > > and *WILL* change. Future x86 hardware will have behavior that > > makes the "oh we'll document the corner cases" extremely > > unpractical. Heck, even todays hardware (but arguably not yet the > > server hardware) behaves like that. "Documenting the common case as > > corner case" is not the right thing to do when introducing some new > > behavior/interface. Sorry. > > Before I argue against that, I would like to ask > > 1. How are APERF/MPERF be meant to be utilized? It's meant to be used by the cpu frequency governors to figure out how many actual cycles are actually used (esp needed in case of IDA). > 2. The CPU frequency driver/governer uses APERF/MPERF as well - we > could argue and say that it should not be using/exposing that data to > user space or using that data to make decisions. that's a case where it really makes sense; it's the case where the thing that controls the cpu P-state actually learns about how much work was done to reevaluate what the cpu frequency should be going forward. Eg it's a case of comparing actual frequency (APERF/MPERF) to see what's useful to set next. IDA makes this all needed due to the dynamic nature of the concept of "frequency". > 3. How do I answer the following problem > > My CPU utilization is 50% at all frequencies (since utilization is > time based), does it mean that frequency scaling does not impact my > workload? without knowing anything else than this, then yes that would be a logical conclusion: the most likely cause would be because your cpu is memory bound. In fact, you could then scale down your cpu frequency/voltage to be lower, and save some power without losing performance. It's a weird workload though, its probably a time based thing where you alternate between idle and fully memory bound loads. (which is another case where your patches would then expose idle time even though your cpu is fully utilized for the 50% of the time it's running) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 18:51 ` Arjan van de Ven @ 2008-05-27 12:59 ` Balbir Singh 2008-05-27 13:19 ` Vaidyanathan Srinivasan 2008-05-31 21:27 ` Pavel Machek 2008-05-27 13:29 ` Vaidyanathan Srinivasan 1 sibling, 2 replies; 31+ messages in thread From: Balbir Singh @ 2008-05-27 12:59 UTC (permalink / raw) To: Arjan van de Ven Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Arjan van de Ven wrote: > On Tue, 27 May 2008 00:06:56 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> Arjan van de Ven wrote: >>> On Mon, 26 May 2008 22:54:43 +0530 >>> Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >>> >>>> Arjan, >>>> >>>> >>>> These problems exist anyway, irrespective of scaled accounting (I'd >>>> say that they are exceptions) >>>> >>>> 1. The management tool does have access to the current frequency >>>> and maximum frequency, irrespective of scaled accounting. The >>>> decision could still be taken on the data that is already >>>> available and management tools can already use them >>> it's sadly not as easy as you make it sound. From everything you >>> wrote you're making the assumption "if we're not at maximum >>> frequency, we have room to spare", which is very much not a correct >>> assumption >>> >> That's true in general. If the CPUs are throttled due to overheating, >> the system management application will figure out that it cannot >> change the frequency. > > It's not the system management application but the kernel (and the > hardware! Esp in case of IDA) that manage the frequency. > >> How do I interpret my CPU frequency applet's >> data when it says that the system is running at 46%? > > That is a very good question. The answer is "uhh badly". Sad but true. > I'm not arguing against that. > The problem I have is that what you're doing does not make it better! > Well, how do we make that problem better? Do you have a solution in mind? Many of us can live and interpret the data we see - I know that I am running at 46% of the maximum potential capacity at this moment. >>>> 2. With IDA, we'd have to >>>> document that APERF/MPERF can be greater than 100% if the system is >>>> overclocked. >>>> >>>> Scaled accounting only intends to provide data already available. >>>> Interpretation is left to management tools and we'll document the >>>> corner cases that you just mentioned. >>> IDA is not overclocking, nor is it a corner case *at all*. It's the >>> common case in fact on more modern systems. Having the kernel >>> present "raw" data to applications that then have no idea how to >>> really use it to be honest isn't very attractive to me as idea: >>> you're presenting a very raw hardware interface that will keep >>> changing over time in terms of how to interpret the data... the >>> kernel needs to abstract such hard stuff from applications, not >>> fully expose them to it. Especially since these things *ARE* tricky >>> and *WILL* change. Future x86 hardware will have behavior that >>> makes the "oh we'll document the corner cases" extremely >>> unpractical. Heck, even todays hardware (but arguably not yet the >>> server hardware) behaves like that. "Documenting the common case as >>> corner case" is not the right thing to do when introducing some new >>> behavior/interface. Sorry. >> Before I argue against that, I would like to ask >> >> 1. How are APERF/MPERF be meant to be utilized? > > It's meant to be used by the cpu frequency governors to figure out how > many actual cycles are actually used (esp needed in case of IDA). > >> 2. The CPU frequency driver/governer uses APERF/MPERF as well - we >> could argue and say that it should not be using/exposing that data to >> user space or using that data to make decisions. > > that's a case where it really makes sense; it's the case where the > thing that controls the cpu P-state actually learns about how much work > was done to reevaluate what the cpu frequency should be going forward. So why should the in kernel governor be the only one to do so. What if I wanted to write a user space governor that potentially wanted to do the same thing? How would it work? Why should the governor be the only one calculating how much work was done, why can't a user space application do the same thing? > Eg it's a case of comparing actual frequency (APERF/MPERF) to see > what's useful to set next. > IDA makes this all needed due to the dynamic nature of the concept of > "frequency". > Could you point me to the IDA specification, so that I can read and understand your concern better. >> 3. How do I answer the following problem >> >> My CPU utilization is 50% at all frequencies (since utilization is >> time based), does it mean that frequency scaling does not impact my >> workload? > > without knowing anything else than this, then yes that would be a > logical conclusion: the most likely cause would be because your cpu is > memory bound. In fact, you could then scale down your cpu > frequency/voltage to be lower, and save some power without losing > performance. > It's a weird workload though, its probably a time based thing where you > alternate between idle and fully memory bound loads. > > (which is another case where your patches would then expose idle time > even though your cpu is fully utilized for the 50% of the time it's > running) We expect the end user to see 50% as scaled utilization and 100% as normal utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches intend to provide the data and not impose what control action should be taken. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 12:59 ` Balbir Singh @ 2008-05-27 13:19 ` Vaidyanathan Srinivasan 2008-05-27 14:15 ` Arjan van de Ven 2008-05-31 21:27 ` Pavel Machek 1 sibling, 1 reply; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 13:19 UTC (permalink / raw) To: Balbir Singh Cc: Arjan van de Ven, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora [sniped] > >> 3. How do I answer the following problem > >> > >> My CPU utilization is 50% at all frequencies (since utilization is > >> time based), does it mean that frequency scaling does not impact my > >> workload? > > > > without knowing anything else than this, then yes that would be a > > logical conclusion: the most likely cause would be because your cpu is > > memory bound. In fact, you could then scale down your cpu > > frequency/voltage to be lower, and save some power without losing > > performance. > > It's a weird workload though, its probably a time based thing where you > > alternate between idle and fully memory bound loads. > > > > (which is another case where your patches would then expose idle time > > even though your cpu is fully utilized for the 50% of the time it's > > running) > > We expect the end user to see 50% as scaled utilization and 100% as normal > utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches > intend to provide the data and not impose what control action should be taken. Hi Arjan, As Balbir mentioned, we are not changing the idle time calculations. The meaning of current utime and stime are preserved and they are relative to current CPU capacity. We are just adding a new metric (which already exist in taskstats) to provide more utilisation data for higher level management software to take decisions. At any time we will have both the traditional utilisation value relative to current CPU capacity, and scaled utilisation that is relative to maximum CPU capacity. --Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 13:19 ` Vaidyanathan Srinivasan @ 2008-05-27 14:15 ` Arjan van de Ven 2008-05-27 15:27 ` Vaidyanathan Srinivasan 0 siblings, 1 reply; 31+ messages in thread From: Arjan van de Ven @ 2008-05-27 14:15 UTC (permalink / raw) To: svaidy Cc: Balbir Singh, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora On Tue, 27 May 2008 18:49:26 +0530 > At any time we will have both the traditional utilisation value > relative to current CPU capacity, and scaled utilisation that is > relative to maximum CPU capacity. > this is where I raise a red flag *because the patch is not doing that* !! sadly it gives a random metric which in some circumstances looks like it does that, but in reality it is not doing that. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 14:15 ` Arjan van de Ven @ 2008-05-27 15:27 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 15:27 UTC (permalink / raw) To: Arjan van de Ven Cc: Balbir Singh, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Arjan van de Ven <arjan@infradead.org> [2008-05-27 07:15:32]: > On Tue, 27 May 2008 18:49:26 +0530 > > At any time we will have both the traditional utilisation value > > relative to current CPU capacity, and scaled utilisation that is > > relative to maximum CPU capacity. > > > > this is where I raise a red flag *because the patch is not doing > that* !! Your concern is valid. I assume you are objecting to the usefulness and interpretation of the scaled metric and not to the fact that both scaled and default non scaled metrics are independently available. > sadly it gives a random metric which in some circumstances looks like > it does that, but in reality it is not doing that. I agree and I am interested in designing a metric that is useful in most circumstances. I am open to suggestions to cover the case of power constraint and acceleration scenario. Thanks, Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 12:59 ` Balbir Singh 2008-05-27 13:19 ` Vaidyanathan Srinivasan @ 2008-05-31 21:27 ` Pavel Machek 2008-06-02 17:54 ` Vaidyanathan Srinivasan 1 sibling, 1 reply; 31+ messages in thread From: Pavel Machek @ 2008-05-31 21:27 UTC (permalink / raw) To: Balbir Singh Cc: Arjan van de Ven, Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Hi! > > without knowing anything else than this, then yes that would be a > > logical conclusion: the most likely cause would be because your cpu is > > memory bound. In fact, you could then scale down your cpu > > frequency/voltage to be lower, and save some power without losing > > performance. > > It's a weird workload though, its probably a time based thing where you > > alternate between idle and fully memory bound loads. > > > > (which is another case where your patches would then expose idle time > > even though your cpu is fully utilized for the 50% of the time it's > > running) > > We expect the end user to see 50% as scaled utilization and 100% as normal > utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches > intend to provide the data and not impose what control action should be taken. Aha, ok, forget about my regression comments. Still, what you want to do seems hard. What if cpu is running at max frequency but memory is not? What if cpu and memory is running at max frequency but frontside bus is not? You want some give_me_bogomips( cpu freq, mem freq, fsb freq ) function, but that depends on workload, so it is impossible to do... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-31 21:27 ` Pavel Machek @ 2008-06-02 17:54 ` Vaidyanathan Srinivasan 2008-06-03 2:20 ` Arjan van de Ven 0 siblings, 1 reply; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-06-02 17:54 UTC (permalink / raw) To: Pavel Machek Cc: Balbir Singh, Arjan van de Ven, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Pavel Machek <pavel@ucw.cz> [2008-05-31 23:27:10]: > Hi! > > > > without knowing anything else than this, then yes that would be a > > > logical conclusion: the most likely cause would be because your cpu is > > > memory bound. In fact, you could then scale down your cpu > > > frequency/voltage to be lower, and save some power without losing > > > performance. > > > It's a weird workload though, its probably a time based thing where you > > > alternate between idle and fully memory bound loads. > > > > > > (which is another case where your patches would then expose idle time > > > even though your cpu is fully utilized for the 50% of the time it's > > > running) > > > > We expect the end user to see 50% as scaled utilization and 100% as normal > > utilization. We don't intend to remove tsk->utime and tsk->stime. Our patches > > intend to provide the data and not impose what control action should be taken. > > Aha, ok, forget about my regression comments. > > Still, what you want to do seems hard. What if cpu is running at max > frequency but memory is not? What if cpu and memory is running at max > frequency but frontside bus is not? uh... you made the scope of the problem bigger :) If we take into consideration various system parameters like memory, fsb that we can control to save power, then this scaling factor is not accurate. > You want some give_me_bogomips( cpu freq, mem freq, fsb freq ) function, > but that depends on workload, so it is impossible to do... Isn't this a good problem to solve :) Lets start adding parameters in steps to make the scaled statistics accurate. We want to start with cpu and see if we can provide a reasonable solution. Workload variation is outside the scope since we are estimating how much cpu was provided to the application or workload. The fact that the workload could not utilise them effectively is not an accounting problem. Accounting the exact cpu resource that an application used is a performance feedback problem which can potentially be derived from accounting. --Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-06-02 17:54 ` Vaidyanathan Srinivasan @ 2008-06-03 2:20 ` Arjan van de Ven 0 siblings, 0 replies; 31+ messages in thread From: Arjan van de Ven @ 2008-06-03 2:20 UTC (permalink / raw) To: svaidy Cc: Pavel Machek, Balbir Singh, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora On Mon, 2 Jun 2008 23:24:00 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > > Lets start adding parameters in steps to make the scaled statistics > accurate. We want to start with cpu and see if we can provide > a reasonable solution. sadly your patches don't even achieve that ;-( At least not in terms of CPU capacity etc. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 18:51 ` Arjan van de Ven 2008-05-27 12:59 ` Balbir Singh @ 2008-05-27 13:29 ` Vaidyanathan Srinivasan 2008-05-27 14:19 ` Arjan van de Ven 1 sibling, 1 reply; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 13:29 UTC (permalink / raw) To: Arjan van de Ven Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Arjan van de Ven <arjan@infradead.org> [2008-05-26 11:51:08]: > On Tue, 27 May 2008 00:06:56 +0530 > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > Arjan van de Ven wrote: > > > On Mon, 26 May 2008 22:54:43 +0530 > > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > > > >> Arjan, > > >> > > >> > > >> These problems exist anyway, irrespective of scaled accounting (I'd > > >> say that they are exceptions) > > >> > > >> 1. The management tool does have access to the current frequency > > >> and maximum frequency, irrespective of scaled accounting. The > > >> decision could still be taken on the data that is already > > >> available and management tools can already use them > > > > > > it's sadly not as easy as you make it sound. From everything you > > > wrote you're making the assumption "if we're not at maximum > > > frequency, we have room to spare", which is very much not a correct > > > assumption > > > > > > > That's true in general. If the CPUs are throttled due to overheating, > > the system management application will figure out that it cannot > > change the frequency. > > It's not the system management application but the kernel (and the > hardware! Esp in case of IDA) that manage the frequency. > > > How do I interpret my CPU frequency applet's > > data when it says that the system is running at 46%? > > That is a very good question. The answer is "uhh badly". Sad but true. > I'm not arguing against that. > The problem I have is that what you're doing does not make it better! > > > > > >> 2. With IDA, we'd have to > > >> document that APERF/MPERF can be greater than 100% if the system is > > >> overclocked. > > >> > > >> Scaled accounting only intends to provide data already available. > > >> Interpretation is left to management tools and we'll document the > > >> corner cases that you just mentioned. > > > > > > IDA is not overclocking, nor is it a corner case *at all*. It's the > > > common case in fact on more modern systems. Having the kernel > > > present "raw" data to applications that then have no idea how to > > > really use it to be honest isn't very attractive to me as idea: > > > you're presenting a very raw hardware interface that will keep > > > changing over time in terms of how to interpret the data... the > > > kernel needs to abstract such hard stuff from applications, not > > > fully expose them to it. Especially since these things *ARE* tricky > > > and *WILL* change. Future x86 hardware will have behavior that > > > makes the "oh we'll document the corner cases" extremely > > > unpractical. Heck, even todays hardware (but arguably not yet the > > > server hardware) behaves like that. "Documenting the common case as > > > corner case" is not the right thing to do when introducing some new > > > behavior/interface. Sorry. > > > > Before I argue against that, I would like to ask > > > > 1. How are APERF/MPERF be meant to be utilized? > > It's meant to be used by the cpu frequency governors to figure out how > many actual cycles are actually used (esp needed in case of IDA). > > > 2. The CPU frequency driver/governer uses APERF/MPERF as well - we > > could argue and say that it should not be using/exposing that data to > > user space or using that data to make decisions. > > that's a case where it really makes sense; it's the case where the > thing that controls the cpu P-state actually learns about how much work > was done to reevaluate what the cpu frequency should be going forward. > Eg it's a case of comparing actual frequency (APERF/MPERF) to see > what's useful to set next. > IDA makes this all needed due to the dynamic nature of the concept of > "frequency". Scaled statistics relative to maximum CPU capacity is just a method of exposing the actual CPU utilisation of applications independent of CPU frequency changes. Reason behind the metric is same as the above fact that you have mentioned. The CPU frequency governors cannot make decisions only based on idle time ratio. It needs to know current utilisation (used cycles) relative to maximum capacity so that the frequency can be changed to next higher level. Higher level management software that wants to control CPU capacity externally will need similar information. > > > 3. How do I answer the following problem > > > > My CPU utilization is 50% at all frequencies (since utilization is > > time based), does it mean that frequency scaling does not impact my > > workload? > > without knowing anything else than this, then yes that would be a > logical conclusion: the most likely cause would be because your cpu is > memory bound. In fact, you could then scale down your cpu > frequency/voltage to be lower, and save some power without losing > performance. > It's a weird workload though, its probably a time based thing where you > alternate between idle and fully memory bound loads. > > (which is another case where your patches would then expose idle time > even though your cpu is fully utilized for the 50% of the time it's > running) > > > > -- > If you want to reach me at my work email, use arjan@linux.intel.com > For development, discussion and tips for power savings, > visit http://www.lesswatts.org ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 13:29 ` Vaidyanathan Srinivasan @ 2008-05-27 14:19 ` Arjan van de Ven 2008-05-27 15:20 ` Vaidyanathan Srinivasan 0 siblings, 1 reply; 31+ messages in thread From: Arjan van de Ven @ 2008-05-27 14:19 UTC (permalink / raw) To: svaidy Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora \> > > > that's a case where it really makes sense; it's the case where the > > thing that controls the cpu P-state actually learns about how much > > work was done to reevaluate what the cpu frequency should be going > > forward. Eg it's a case of comparing actual frequency (APERF/MPERF) > > to see what's useful to set next. > > IDA makes this all needed due to the dynamic nature of the concept > > of "frequency". > > Scaled statistics relative to maximum CPU capacity is just a method of > exposing the actual CPU utilisation of applications independent of CPU > frequency changes. > > Reason behind the metric is same as the above fact that you have > mentioned. The CPU frequency governors cannot make decisions only > based on idle time ratio. It needs to know current utilisation (used > cycles) relative to maximum capacity so that the frequency can be > changed to next higher level. > > Higher level management software that wants to control CPU capacity > externally will need similar information. > I entirely understand that desire. But you're not giving it that information! The patch is giving it a really poor approximation, an approximation that will get worse and worse in upcoming cpu generations. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 14:19 ` Arjan van de Ven @ 2008-05-27 15:20 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 15:20 UTC (permalink / raw) To: Arjan van de Ven Cc: balbir, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora * Arjan van de Ven <arjan@infradead.org> [2008-05-27 07:19:00]: > \> > > > > that's a case where it really makes sense; it's the case where the > > > thing that controls the cpu P-state actually learns about how much > > > work was done to reevaluate what the cpu frequency should be going > > > forward. Eg it's a case of comparing actual frequency (APERF/MPERF) > > > to see what's useful to set next. > > > IDA makes this all needed due to the dynamic nature of the concept > > > of "frequency". > > > > Scaled statistics relative to maximum CPU capacity is just a method of > > exposing the actual CPU utilisation of applications independent of CPU > > frequency changes. > > > > Reason behind the metric is same as the above fact that you have > > mentioned. The CPU frequency governors cannot make decisions only > > based on idle time ratio. It needs to know current utilisation (used > > cycles) relative to maximum capacity so that the frequency can be > > changed to next higher level. > > > > Higher level management software that wants to control CPU capacity > > externally will need similar information. > > > I entirely understand that desire. Good :) > But you're not giving it that information! > The patch is giving it a really poor approximation, an approximation > that will get worse and worse in upcoming cpu generations. I agree that power capping and acceleration makes the metric approximate. But was are trying to be as accurate and meaningful as APERF/MPERF ratio is in the processor hardware. Can I state the problem like this: The metric is as accurate and meaningful as APERF/MEPRF ratio, but the interpretation of the metric is subject to the knowledge of power constraint or acceleration currently in effect. --Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven 2008-05-26 17:24 ` Balbir Singh @ 2008-05-27 14:04 ` Vaidyanathan Srinivasan 2008-05-27 16:40 ` Arjan van de Ven 2008-05-31 21:17 ` Pavel Machek 2008-05-31 21:13 ` Pavel Machek 2 siblings, 2 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 14:04 UTC (permalink / raw) To: Arjan van de Ven Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora * Arjan van de Ven <arjan@infradead.org> [2008-05-26 08:50:00]: > On Mon, 26 May 2008 20:01:33 +0530 > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > > > The following RFC patch tries to implement scaled CPU utilisation > > statistics using APERF and MPERF MSR registers in an x86 platform. > > > > The CPU capacity is significantly changed when the CPU's frequency is > > reduced for the purpose of power savings. The applications that run > > at such lower CPU frequencies are also accounted for real CPU time by > > default. If the applications have been run at full CPU frequency, > > they would have finished the work faster and not get charged for > > excessive CPU time. > > > > One of the solution to this problem it so scale the utime and stime > > entitlement for the process as per the current CPU frequency. This > > technique is used in powerpc architecture with the help of hardware > > registers that accurately capture the entitlement. > > > > there are some issues with this unfortunately, and these make it > a very complex thing to do. > Just to mention a few: > 1) What if the BIOS no longer allows us to go to the max frequency for > a period (for example as a result of overheating); with the approach > above, the admin would THINK he can go faster, but he cannot in reality, > so there's misleading information (the system looks half busy, while in > reality it's actually the opposite, it's overloaded). Management tools > will take the wrong decisions (such as moving MORE work to the box, not > less) > 2) On systems with Intel Dynamic Acceleration technology, you can get > over 100% of cycles this way. (For those who don't know what IDA is; > IDA is basically a case where if your Penryn based dual core laptop is > only using 1 core, the other core can go faster than 100% as long as > thermals etc allow it). How do you want to deal with this? Hi Arjan, Thanks you for the inputs. The above issues are very valid and our solution should be able to react appropriately to the above situation. What we are proposing is a scaled time value that is scaled to the current CPU capacity. If the scaled utilisation is 50% when the CPU is at 100% capacity, it is expected to remain at 50% even if the CPU's capacity is dropped to 50%, while the traditional utilisation value will be 100%. The problem in the above two cases is that we had assumed that the maximum CPU capacity is 100% at normal capacity (without IDA). If the CPU is at half the maximum frequency, then scaled stats should show 50%. Now in case 1, the CPU's capacity cannot be increased further and you expect to have shown 100% in scaled stats as well. If the process ran for 10s, then the scaled time should be 10s since we cannot make the process run faster. In case 2, the CPU's capacity increases beyond assumed 100% and now the tasks will have excess of 100% utilisation. If the real run time is 5s, then scaled runtime will be more than 5s say 7s. This essentially says that the process has done work worth of 7s of CPU time when it was at 100% capacity. The point I am trying to make is whether scaling should be done relative to CPUs designed maximum capacity or maximum capacity under current constraints is to be discussed. Case A: ------ Scaled stats is stats relative to maximum designed capacity including IDA In this case nominal utilisation will always be less than 100%, but the higher level software need to know the environment and interpret the values so as to determine the remaining capacity. Example: Assume IDA provides a 20% boost We know that normal capacity will be 83%, with a 20% boosting in case of IDA, we can reach 100%. If there was a P-State constraint due to power/thermal envelop, the we know that the maximum capacity will be reduced to say 40%. Remaining cap = Available cap - Current Cap Available capacity is a dynamically varying quantity but still the stats are useful and interpretable. Case B: ------ Scaled stats is stats relative to current available capacity. In this case we assume 100% means current available capacity and hence any scaled utilisation less than 100% can be counted as spare capacity. If we are constrained to half frequency, and we are running at 1/4th freq, then our scaled stats will be 50%, implying that we can still double our capacity by switching to next higher frequency. This is not very elegant statistics, because the scaled 'time' value will not make sense after a long runtime over various transitions through the constraints and acceleration. In this case the OS needs to know about the constraint. The higher level management software will not be able to interpret the scaled stats at the moment when the constraint has changed. They need to know about the constraints and changes as well. I will prefer metric as in case A. I assume even in case of IDA, we will know the CPU's maximum capacity with acceleration, and its nominal capacity. APERF/MPERF ratio can be interpreted correctly even if it exceeds 1. Please let us know if we can improve the framework to include both the power constraint case and acceleration case. Thanks, Vaidy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 14:04 ` Vaidyanathan Srinivasan @ 2008-05-27 16:40 ` Arjan van de Ven 2008-05-27 18:26 ` Vaidyanathan Srinivasan 2008-05-31 21:17 ` Pavel Machek 1 sibling, 1 reply; 31+ messages in thread From: Arjan van de Ven @ 2008-05-27 16:40 UTC (permalink / raw) To: svaidy Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora On Tue, 27 May 2008 19:34:40 +0530 Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > > What we are proposing is a scaled time value that is scaled to the > current CPU capacity. If the scaled utilisation is 50% when the CPU > is at 100% capacity, it is expected to remain at 50% even if the CPU's > capacity is dropped to 50%, while the traditional utilisation value > will be 100%. When you use the word "capacity" I cringe ;( > > The problem in the above two cases is that we had assumed that the > maximum CPU capacity is 100% at normal capacity (without IDA). > > If the CPU is at half the maximum frequency, then scaled stats should > show 50%. see frequency != capacity. It's about more than frequency. It's about how much cache you have available too. If you run single threaded on a dual core cpu, you have 100% of the cache, but the cpu is 50% idle. But that doesn't mean that when you double the load, you actually get 2x the performance. So you're not at 50% of capacity! > The point I am trying to make is whether scaling should be done > relative to CPUs designed maximum capacity or maximum capacity under > current constraints is to be discussed. now you're back at "capacity".. we were at frequency before ;( > > Case A: > ------ > > Scaled stats is stats relative to maximum designed capacity including > IDA you don't know what that is though. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 16:40 ` Arjan van de Ven @ 2008-05-27 18:26 ` Vaidyanathan Srinivasan 0 siblings, 0 replies; 31+ messages in thread From: Vaidyanathan Srinivasan @ 2008-05-27 18:26 UTC (permalink / raw) To: Arjan van de Ven Cc: Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora * Arjan van de Ven <arjan@infradead.org> [2008-05-27 09:40:35]: > On Tue, 27 May 2008 19:34:40 +0530 > Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> wrote: > > > > What we are proposing is a scaled time value that is scaled to the > > current CPU capacity. If the scaled utilisation is 50% when the CPU > > is at 100% capacity, it is expected to remain at 50% even if the CPU's > > capacity is dropped to 50%, while the traditional utilisation value > > will be 100%. > > When you use the word "capacity" I cringe ;( > > > > > The problem in the above two cases is that we had assumed that the > > maximum CPU capacity is 100% at normal capacity (without IDA). > > > > If the CPU is at half the maximum frequency, then scaled stats should > > show 50%. > > see frequency != capacity. > It's about more than frequency. It's about how much cache you have > available too. If you run single threaded on a dual core cpu, you have > 100% of the cache, but the cpu is 50% idle. But that doesn't mean that > when you double the load, you actually get 2x the performance. So > you're not at 50% of capacity! You are right.... I kind of interchangeably used frequency and capacity assuming capacity is linearly proportional to frequency! In reality, I agree that capacity is not linearly proportional to frequency and it is dependent on cache usage etc. The differences apart, I am sure I have conveyed the relationship between scaled stats and CPU frequency. I used the term 'capacity' to generalise CPU performance, but I guess it may lead to a different discussion. I will stick to frequency. :) --Vaidy > > The point I am trying to make is whether scaling should be done > > relative to CPUs designed maximum capacity or maximum capacity under > > current constraints is to be discussed. > > now you're back at "capacity".. we were at frequency before ;( > > > > > > Case A: > > ------ > > > > Scaled stats is stats relative to maximum designed capacity including > > IDA > > you don't know what that is though. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-27 14:04 ` Vaidyanathan Srinivasan 2008-05-27 16:40 ` Arjan van de Ven @ 2008-05-31 21:17 ` Pavel Machek 1 sibling, 0 replies; 31+ messages in thread From: Pavel Machek @ 2008-05-31 21:17 UTC (permalink / raw) To: Arjan van de Ven, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora Hi! > > > The following RFC patch tries to implement scaled CPU utilisation > > > statistics using APERF and MPERF MSR registers in an x86 platform. > > > > > > The CPU capacity is significantly changed when the CPU's frequency is > > > reduced for the purpose of power savings. The applications that run > > > at such lower CPU frequencies are also accounted for real CPU time by > > > default. If the applications have been run at full CPU frequency, > > > they would have finished the work faster and not get charged for > > > excessive CPU time. > > > > > > One of the solution to this problem it so scale the utime and stime > > > entitlement for the process as per the current CPU frequency. This > > > technique is used in powerpc architecture with the help of hardware > > > registers that accurately capture the entitlement. > > > > > > > there are some issues with this unfortunately, and these make it > > a very complex thing to do. > > Just to mention a few: > > 1) What if the BIOS no longer allows us to go to the max frequency for > > a period (for example as a result of overheating); with the approach > > above, the admin would THINK he can go faster, but he cannot in reality, > > so there's misleading information (the system looks half busy, while in > > reality it's actually the opposite, it's overloaded). Management tools > > will take the wrong decisions (such as moving MORE work to the box, not > > less) > > 2) On systems with Intel Dynamic Acceleration technology, you can get > > over 100% of cycles this way. (For those who don't know what IDA is; > > IDA is basically a case where if your Penryn based dual core laptop is > > only using 1 core, the other core can go faster than 100% as long as > > thermals etc allow it). How do you want to deal with this? > > Hi Arjan, > > Thanks you for the inputs. The above issues are very valid and our > solution should be able to react appropriately to the above situation. > > What we are proposing is a scaled time value that is scaled to the > current CPU capacity. If the scaled utilisation is 50% when the CPU > is at 100% capacity, it is expected to remain at 50% even if the CPU's > capacity is dropped to 50%, while the traditional utilisation value > will be 100%. time one-second-busy-loop should return close to one second. That's current behaviour. You don't like it, but it is useful. If you change it, that's called 'regression'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven 2008-05-26 17:24 ` Balbir Singh 2008-05-27 14:04 ` Vaidyanathan Srinivasan @ 2008-05-31 21:13 ` Pavel Machek 2008-06-02 6:08 ` Balbir Singh 2 siblings, 1 reply; 31+ messages in thread From: Pavel Machek @ 2008-05-31 21:13 UTC (permalink / raw) To: Arjan van de Ven Cc: Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Balbir Singh, Amit K. Arora Hi! > > entitlement for the process as per the current CPU frequency. This > > technique is used in powerpc architecture with the help of hardware > > registers that accurately capture the entitlement. > > > > there are some issues with this unfortunately, and these make it > a very complex thing to do. > Just to mention a few: > 1) What if the BIOS no longer allows us to go to the max frequency for > a period (for example as a result of overheating); with the approach > above, the admin would THINK he can go faster, but he cannot in reality, > so there's misleading information (the system looks half busy, while in Plus time one-second-computation-job returning anything else is just wrong. Even when it only happens when overheated... or only on battery power. If you want scaled utime, you need new interface. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 2008-05-31 21:13 ` Pavel Machek @ 2008-06-02 6:08 ` Balbir Singh 0 siblings, 0 replies; 31+ messages in thread From: Balbir Singh @ 2008-06-02 6:08 UTC (permalink / raw) To: Pavel Machek Cc: Arjan van de Ven, Vaidyanathan Srinivasan, Linux Kernel, venkatesh.pallipadi, suresh.b.siddha, Michael Neuling, Amit K. Arora Pavel Machek wrote: > Hi! > >>> entitlement for the process as per the current CPU frequency. This >>> technique is used in powerpc architecture with the help of hardware >>> registers that accurately capture the entitlement. >>> >> there are some issues with this unfortunately, and these make it >> a very complex thing to do. >> Just to mention a few: >> 1) What if the BIOS no longer allows us to go to the max frequency for >> a period (for example as a result of overheating); with the approach >> above, the admin would THINK he can go faster, but he cannot in reality, >> so there's misleading information (the system looks half busy, while in > > Plus time one-second-computation-job returning anything else is just > wrong. Even when it only happens when overheated... or only on battery > power. > > If you want scaled utime, you need new interface. We do have a new interface, two new parameters per-task utimescaled and stimescaled. They already exist in task_struct. Ditto for the delay accounting pieces. Did I misunderstand your need for a new interface? We still keep utime and stime around. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2008-06-03 2:21 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-26 14:31 [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 1/3] General framework for APERF/MPERF access and accounting Vaidyanathan Srinivasan 2008-05-26 18:11 ` Balbir Singh 2008-05-27 14:54 ` Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 2/3] Make calls to account_scaled_stats Vaidyanathan Srinivasan 2008-05-26 18:18 ` Balbir Singh 2008-05-27 15:02 ` Vaidyanathan Srinivasan 2008-05-29 15:18 ` Michael Neuling 2008-05-29 18:23 ` Vaidyanathan Srinivasan 2008-05-26 14:31 ` [RFC PATCH v1 3/3] Print scaled utime and stime in getdelays Vaidyanathan Srinivasan 2008-05-26 15:50 ` [RFC PATCH v1 0/3] Scaled statistics using APERF/MPERF in x86 Arjan van de Ven 2008-05-26 17:24 ` Balbir Singh 2008-05-26 18:00 ` Arjan van de Ven 2008-05-26 18:36 ` Balbir Singh 2008-05-26 18:51 ` Arjan van de Ven 2008-05-27 12:59 ` Balbir Singh 2008-05-27 13:19 ` Vaidyanathan Srinivasan 2008-05-27 14:15 ` Arjan van de Ven 2008-05-27 15:27 ` Vaidyanathan Srinivasan 2008-05-31 21:27 ` Pavel Machek 2008-06-02 17:54 ` Vaidyanathan Srinivasan 2008-06-03 2:20 ` Arjan van de Ven 2008-05-27 13:29 ` Vaidyanathan Srinivasan 2008-05-27 14:19 ` Arjan van de Ven 2008-05-27 15:20 ` Vaidyanathan Srinivasan 2008-05-27 14:04 ` Vaidyanathan Srinivasan 2008-05-27 16:40 ` Arjan van de Ven 2008-05-27 18:26 ` Vaidyanathan Srinivasan 2008-05-31 21:17 ` Pavel Machek 2008-05-31 21:13 ` Pavel Machek 2008-06-02 6:08 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox