* [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies
@ 2006-04-07 6:30 Srivatsa Vaddagiri
2006-04-07 23:04 ` Paul Mackerras
0 siblings, 1 reply; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-07 6:30 UTC (permalink / raw)
To: anton, benh, paulus; +Cc: linuxppc-dev, sri_vatsa_v
Currently, only boot CPU calls do_timer to update jiffies. This prevents
idle boot CPU from skipping ticks. Patch below, against 2.6.17-rc1-mm1,
allows jiffies to be updated from any CPU.
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
---
linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c | 22 ++++++++++------------
1 file changed, 10 insertions(+), 12 deletions(-)
diff -puN arch/powerpc/kernel/time.c~boot_cpu_fix arch/powerpc/kernel/time.c
--- linux-2.6.17-rc1/arch/powerpc/kernel/time.c~boot_cpu_fix 2006-04-07 04:13:53.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c 2006-04-07 04:14:03.000000000 +0530
@@ -685,19 +685,17 @@ void timer_interrupt(struct pt_regs * re
if (!cpu_is_offline(cpu))
account_process_time(regs);
- /*
- * No need to check whether cpu is offline here; boot_cpuid
- * should have been fixed up by now.
- */
- if (cpu != boot_cpuid)
- continue;
-
write_seqlock(&xtime_lock);
- tb_last_jiffy += tb_ticks_per_jiffy;
- tb_last_stamp = per_cpu(last_jiffy, cpu);
- do_timer(regs);
- timer_recalc_offset(tb_last_jiffy);
- timer_check_rtc();
+ if (tb_ticks_since(tb_last_stamp) >= tb_ticks_per_jiffy) {
+ tb_last_jiffy += tb_ticks_per_jiffy;
+ tb_last_stamp += tb_ticks_per_jiffy;
+ /* Handle RTCL overflow on 601 */
+ if (__USE_RTC() && tb_last_stamp >= 1000000000)
+ tb_last_stamp -= 1000000000;
+ do_timer(regs);
+ timer_recalc_offset(tb_last_jiffy);
+ timer_check_rtc();
+ }
write_sequnlock(&xtime_lock);
}
_
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies
2006-04-07 6:30 [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies Srivatsa Vaddagiri
@ 2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49 ` Srivatsa Vaddagiri
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Paul Mackerras @ 2006-04-07 23:04 UTC (permalink / raw)
To: vatsa; +Cc: sri_vatsa_v, linuxppc-dev
Srivatsa Vaddagiri writes:
> Currently, only boot CPU calls do_timer to update jiffies. This prevents
> idle boot CPU from skipping ticks. Patch below, against 2.6.17-rc1-mm1,
> allows jiffies to be updated from any CPU.
We have to be very careful here. The code that keeps xtime and
gettimeofday in sync relies on xtime being incremented as close as
possible in time to when the timebase passes specific values. Since
we currently stagger the timer interrupts for the cpus throughout a
jiffy, having cpus other than the boot cpus calling do_timer will
break this and introduce inaccuracies. There are also implications
for the stolen time accounting on shared-processor LPAR systems.
I think we need to remove the staggering, thus having all cpus take
their timer interrupt at the same time. That way, any of them can
call do_timer. However we then have to be much more careful about
possible contention, e.g. on xtime_lock. Your patch has every cpu
taking xtime_lock for writing rather than just the boot cpu. I'd like
to see if there is some way to avoid that (while still having just one
cpu call do_timer, of course).
Regards,
Paul.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies
2006-04-07 23:04 ` Paul Mackerras
@ 2006-04-10 11:49 ` Srivatsa Vaddagiri
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
2006-04-10 12:19 ` [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks Srivatsa Vaddagiri
2 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-10 11:49 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev
On Sat, Apr 08, 2006 at 09:04:15AM +1000, Paul Mackerras wrote:
> Srivatsa Vaddagiri writes:
>
> > Currently, only boot CPU calls do_timer to update jiffies. This prevents
> > idle boot CPU from skipping ticks. Patch below, against 2.6.17-rc1-mm1,
> > allows jiffies to be updated from any CPU.
>
> We have to be very careful here. The code that keeps xtime and
> gettimeofday in sync relies on xtime being incremented as close as
> possible in time to when the timebase passes specific values. Since
> we currently stagger the timer interrupts for the cpus throughout a
> jiffy, having cpus other than the boot cpus calling do_timer will
> break this and introduce inaccuracies. There are also implications
> for the stolen time accounting on shared-processor LPAR systems.
>
> I think we need to remove the staggering, thus having all cpus take
> their timer interrupt at the same time. That way, any of them can
> call do_timer. However we then have to be much more careful about
> possible contention, e.g. on xtime_lock. Your patch has every cpu
> taking xtime_lock for writing rather than just the boot cpu. I'd like
> to see if there is some way to avoid that (while still having just one
> cpu call do_timer, of course).
Paul,
Thanks for the feedback on the patches.
Avoiding contention on xtime_lock doesnt seem to be trivial. Any
solution to it is fraught with races. Anyway, I have attempted one
solution (in the followon Patch 2/2) which keeps the overhead in timer
interrupt handler low.
Let me know if you have other suggestions to avoid xtime_lock
contention!
Following patches are sent in separate mails:
Patch 1/2 - Core patch to skip ticks - v2
Patch 2/2 - Allow boot CPU to skip ticks - v2
The sysctl control patch and decrementer statistics patch are as before
and hence I am not resending them this time.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/2] tickless idle cpus: core patch - v2
2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49 ` Srivatsa Vaddagiri
@ 2006-04-10 12:18 ` Srivatsa Vaddagiri
2006-04-11 17:35 ` Paul Mackerras
2006-04-21 10:49 ` Paul Mackerras
2006-04-10 12:19 ` [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks Srivatsa Vaddagiri
2 siblings, 2 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-10 12:18 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev
This is the v2 of the core patch to skip ticks when a CPU is idle.
Changes since v1:
- fix the buggy call to stop_hz_timer in idle_power4.S (hopefully it
is correct now!).
- Dont allow boot cpu to skip ticks (a follow-on patch will
remove this restriction)
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
---
linux-2.6.17-rc1-root/arch/powerpc/Kconfig | 6
linux-2.6.17-rc1-root/arch/powerpc/kernel/idle_power4.S | 5
linux-2.6.17-rc1-root/arch/powerpc/kernel/irq.c | 3
linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c | 143 +++++++++--
linux-2.6.17-rc1-root/arch/powerpc/kernel/traps.c | 1
linux-2.6.17-rc1-root/arch/powerpc/platforms/pseries/setup.c | 6
linux-2.6.17-rc1-root/include/asm-powerpc/time.h | 8
7 files changed, 147 insertions(+), 25 deletions(-)
diff -puN arch/powerpc/kernel/time.c~no_idle_hz arch/powerpc/kernel/time.c
--- linux-2.6.17-rc1/arch/powerpc/kernel/time.c~no_idle_hz 2006-04-09 10:40:58.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c 2006-04-10 14:32:04.000000000 +0530
@@ -633,40 +633,97 @@ static void iSeries_tb_recal(void)
}
#endif
-/*
- * For iSeries shared processors, we have to let the hypervisor
- * set the hardware decrementer. We set a virtual decrementer
- * in the lppaca and call the hypervisor if the virtual
- * decrementer is less than the current value in the hardware
- * decrementer. (almost always the new decrementer value will
- * be greater than the current hardware decementer so the hypervisor
- * call will not be needed)
- */
+#ifdef CONFIG_NO_IDLE_HZ
-/*
- * timer_interrupt - gets called when the decrementer overflows,
- * with interrupts disabled.
+static void account_ticks(struct pt_regs *regs);
+
+/* Returns 1 if this CPU was set in the mask */
+static inline int clear_hzless_mask(void)
+{
+ int cpu = smp_processor_id();
+ int rc = 0;
+
+ if (unlikely(cpu_isset(cpu, nohz_cpu_mask))) {
+ cpu_clear(cpu, nohz_cpu_mask);
+ rc = 1;
+ }
+
+ return rc;
+}
+
+#define MAX_DEC_COUNT UINT_MAX /* Decrementer is 32-bit */
+static int min_skip = 2; /* Minimum number of ticks to skip */
+static int max_skip; /* Maximum number of ticks to skip */
+
+
+int sysctl_hz_timer = 1;
+
+/* Defer timer interrupt for as long as possible. This is accomplished by
+ * programming the decrementer to a suitable value such that it raises the
+ * exception after desired interval. This features allows CPUs to
+ * be used more efficiently in virtualized environments and/or allows for
+ * lower power consumption.
+ *
+ * Called with interrupts disabled on an idle CPU. Caller has to ensure that
+ * idle loop is not exited w/o start_hz_timer being called via an interrupt
+ * to restore timer interrupt frequency.
*/
-void timer_interrupt(struct pt_regs * regs)
+
+void stop_hz_timer(void)
{
+ unsigned long cpu = smp_processor_id(), seq, delta;
int next_dec;
- int cpu = smp_processor_id();
- unsigned long ticks;
-#ifdef CONFIG_PPC32
- if (atomic_read(&ppc_n_lost_interrupts) != 0)
- do_IRQ(regs);
-#endif
+ if (sysctl_hz_timer != 0 || cpu == boot_cpuid)
+ return;
- irq_enter();
+ cpu_set(cpu, nohz_cpu_mask);
+ mb();
+ if (rcu_pending(cpu) || local_softirq_pending()) {
+ cpu_clear(cpu, nohz_cpu_mask);
+ return;
+ }
- profile_tick(CPU_PROFILING, regs);
- calculate_steal_time();
+ do {
+ seq = read_seqbegin(&xtime_lock);
-#ifdef CONFIG_PPC_ISERIES
- get_lppaca()->int_dword.fields.decr_int = 0;
+ delta = next_timer_interrupt() - jiffies;
+
+ if (delta < min_skip) {
+ cpu_clear(cpu, nohz_cpu_mask);
+ return;
+ }
+
+ if (delta > max_skip)
+ delta = max_skip;
+
+ next_dec = tb_last_stamp + delta * tb_ticks_per_jiffy;
+
+ } while (read_seqretry(&xtime_lock, seq));
+
+ next_dec -= get_tb();
+ set_dec(next_dec);
+
+ return;
+}
+
+/* Take into account skipped ticks and restore the HZ timer frequency */
+void start_hz_timer(struct pt_regs *regs)
+{
+ if (clear_hzless_mask())
+ account_ticks(regs);
+}
+
+#else
+static inline int clear_hzless_mask(void) { return 0;}
#endif
+static void account_ticks(struct pt_regs *regs)
+{
+ int next_dec;
+ int cpu = smp_processor_id();
+ unsigned long ticks;
+
while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
>= tb_ticks_per_jiffy) {
/* Update last_jiffy */
@@ -703,6 +760,41 @@ void timer_interrupt(struct pt_regs * re
next_dec = tb_ticks_per_jiffy - ticks;
set_dec(next_dec);
+}
+
+/*
+ * For iSeries shared processors, we have to let the hypervisor
+ * set the hardware decrementer. We set a virtual decrementer
+ * in the lppaca and call the hypervisor if the virtual
+ * decrementer is less than the current value in the hardware
+ * decrementer. (almost always the new decrementer value will
+ * be greater than the current hardware decementer so the hypervisor
+ * call will not be needed)
+ */
+
+/*
+ * timer_interrupt - gets called when the decrementer overflows,
+ * with interrupts disabled.
+ */
+void timer_interrupt(struct pt_regs * regs)
+{
+#ifdef CONFIG_PPC32
+ if (atomic_read(&ppc_n_lost_interrupts) != 0)
+ do_IRQ(regs);
+#endif
+
+ irq_enter();
+
+ clear_hzless_mask();
+
+ profile_tick(CPU_PROFILING, regs);
+ calculate_steal_time();
+
+#ifdef CONFIG_PPC_ISERIES
+ get_lppaca()->int_dword.fields.decr_int = 0;
+#endif
+
+ account_ticks(regs);
#ifdef CONFIG_PPC_ISERIES
if (hvlpevent_is_pending())
@@ -957,6 +1049,9 @@ void __init time_init(void)
tb_ticks_per_usec = ppc_tb_freq / 1000000;
tb_to_us = mulhwu_scale_factor(ppc_tb_freq, 1000000);
calc_cputime_factors();
+#ifdef CONFIG_NO_IDLE_HZ
+ max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
+#endif
/*
* Calculate the length of each tick in ns. It will not be
diff -puN arch/powerpc/kernel/irq.c~no_idle_hz arch/powerpc/kernel/irq.c
--- linux-2.6.17-rc1/arch/powerpc/kernel/irq.c~no_idle_hz 2006-04-09 10:40:58.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/irq.c 2006-04-09 10:40:59.000000000 +0530
@@ -60,6 +60,7 @@
#ifdef CONFIG_PPC_ISERIES
#include <asm/paca.h>
#endif
+#include <asm/time.h>
int __irq_offset_value;
#ifdef CONFIG_PPC32
@@ -189,6 +190,8 @@ void do_IRQ(struct pt_regs *regs)
irq_enter();
+ start_hz_timer(regs);
+
#ifdef CONFIG_DEBUG_STACKOVERFLOW
/* Debugging check for stack overflow: is there less than 2KB free? */
{
diff -puN include/asm-powerpc/time.h~no_idle_hz include/asm-powerpc/time.h
--- linux-2.6.17-rc1/include/asm-powerpc/time.h~no_idle_hz 2006-04-09 10:40:59.000000000 +0530
+++ linux-2.6.17-rc1-root/include/asm-powerpc/time.h 2006-04-09 10:40:59.000000000 +0530
@@ -198,6 +198,14 @@ static inline unsigned long tb_ticks_sin
return get_tbl() - tstamp;
}
+#ifdef CONFIG_NO_IDLE_HZ
+extern void stop_hz_timer(void);
+extern void start_hz_timer(struct pt_regs *);
+#else
+static inline void stop_hz_timer(void) { }
+static inline void start_hz_timer(struct pt_regs *regs) { }
+#endif
+
#define mulhwu(x,y) \
({unsigned z; asm ("mulhwu %0,%1,%2" : "=r" (z) : "r" (x), "r" (y)); z;})
diff -puN arch/powerpc/Kconfig~no_idle_hz arch/powerpc/Kconfig
--- linux-2.6.17-rc1/arch/powerpc/Kconfig~no_idle_hz 2006-04-09 10:40:59.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/Kconfig 2006-04-09 10:40:59.000000000 +0530
@@ -593,6 +593,12 @@ config HOTPLUG_CPU
Say N if you are unsure.
+config NO_IDLE_HZ
+ depends on EXPERIMENTAL && (PPC_PSERIES || PPC_PMAC || PPC_MAPLE)
+ bool "Switch off timer ticks on idle CPUs"
+ help
+ Switches the HZ timer interrupts off when a CPU is idle.
+
config KEXEC
bool "kexec system call (EXPERIMENTAL)"
depends on PPC_MULTIPLATFORM && EXPERIMENTAL
diff -puN arch/powerpc/kernel/traps.c~no_idle_hz arch/powerpc/kernel/traps.c
--- linux-2.6.17-rc1/arch/powerpc/kernel/traps.c~no_idle_hz 2006-04-09 10:40:59.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/traps.c 2006-04-09 10:40:59.000000000 +0530
@@ -875,6 +875,7 @@ void altivec_unavailable_exception(struc
void performance_monitor_exception(struct pt_regs *regs)
{
+ start_hz_timer(regs);
perf_irq(regs);
}
diff -puN arch/powerpc/platforms/pseries/setup.c~no_idle_hz arch/powerpc/platforms/pseries/setup.c
--- linux-2.6.17-rc1/arch/powerpc/platforms/pseries/setup.c~no_idle_hz 2006-04-09 10:40:59.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/platforms/pseries/setup.c 2006-04-09 10:40:59.000000000 +0530
@@ -463,8 +463,10 @@ static void pseries_dedicated_idle_sleep
* very low priority. The cede enables interrupts, which
* doesn't matter here.
*/
- if (!lppaca[cpu ^ 1].idle || poll_pending() == H_PENDING)
+ if (!lppaca[cpu ^ 1].idle || poll_pending() == H_PENDING) {
+ stop_hz_timer();
cede_processor();
+ }
out:
HMT_medium();
@@ -479,6 +481,8 @@ static void pseries_shared_idle_sleep(vo
*/
get_lppaca()->idle = 1;
+ stop_hz_timer();
+
/*
* Yield the processor to the hypervisor. We return if
* an external interrupt occurs (which are driven prior
diff -puN arch/powerpc/kernel/idle_power4.S~no_idle_hz arch/powerpc/kernel/idle_power4.S
--- linux-2.6.17-rc1/arch/powerpc/kernel/idle_power4.S~no_idle_hz 2006-04-09 10:40:59.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/idle_power4.S 2006-04-10 14:50:36.000000000 +0530
@@ -30,6 +30,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_CAN_NAP)
cmpwi 0,r4,0
beqlr
+ mflr r0
+ std r0,16(r1)
+ bl .stop_hz_timer
+ ld r0,16(r1)
+ mtlr r0
/* Go to NAP now */
BEGIN_FTR_SECTION
DSSALL
_
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks
2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49 ` Srivatsa Vaddagiri
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
@ 2006-04-10 12:19 ` Srivatsa Vaddagiri
2 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-10 12:19 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev
This patch (version 2) lets boot cpu to skip ticks. Tested against
2.6.17-rc1-mm1.
Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
---
linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c | 71 ++++++++++++++++++++---
1 file changed, 63 insertions(+), 8 deletions(-)
diff -puN arch/powerpc/kernel/time.c~boot_cpu_fix arch/powerpc/kernel/time.c
--- linux-2.6.17-rc1/arch/powerpc/kernel/time.c~boot_cpu_fix 2006-04-10 17:43:11.000000000 +0530
+++ linux-2.6.17-rc1-root/arch/powerpc/kernel/time.c 2006-04-10 17:44:32.000000000 +0530
@@ -637,6 +637,39 @@ static void iSeries_tb_recal(void)
static void account_ticks(struct pt_regs *regs);
+static spinlock_t do_timer_cpulock = SPIN_LOCK_UNLOCKED;
+static int do_timer_cpu; /* Which CPU should call do_timer? */
+
+static int __devinit do_timer_cpucallback(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ int cpu = (long)hcpu;
+
+ switch (action) {
+ case CPU_DOWN_PREPARE:
+ spin_lock(&do_timer_cpulock);
+ if (do_timer_cpu == cpu) {
+ cpumask_t tmpmask;
+ int new_cpu;
+
+ cpus_complement(tmpmask, nohz_cpu_mask);
+ cpu_clear(cpu, tmpmask);
+ new_cpu = any_online_cpu(tmpmask);
+ if (new_cpu != NR_CPUS)
+ do_timer_cpu = new_cpu;
+ }
+ spin_unlock(&do_timer_cpulock);
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata do_timer_notifier =
+{
+ .notifier_call = do_timer_cpucallback
+};
+
/* Returns 1 if this CPU was set in the mask */
static inline int clear_hzless_mask(void)
{
@@ -645,8 +678,12 @@ static inline int clear_hzless_mask(void
if (unlikely(cpu_isset(cpu, nohz_cpu_mask))) {
cpu_clear(cpu, nohz_cpu_mask);
- rc = 1;
- }
+ spin_lock(&do_timer_cpulock);
+ if (do_timer_cpu == NR_CPUS)
+ do_timer_cpu = cpu;
+ spin_unlock(&do_timer_cpulock);
+ rc = 1;
+ }
return rc;
}
@@ -684,6 +721,15 @@ void stop_hz_timer(void)
return;
}
+ spin_lock(&do_timer_cpulock);
+ if (do_timer_cpu == cpu) {
+ cpumask_t tmpmask;
+
+ cpus_complement(tmpmask, nohz_cpu_mask);
+ do_timer_cpu = any_online_cpu(tmpmask);
+ }
+ spin_unlock(&do_timer_cpulock);
+
do {
seq = read_seqbegin(&xtime_lock);
@@ -716,6 +762,7 @@ void start_hz_timer(struct pt_regs *regs
#else
static inline int clear_hzless_mask(void) { return 0;}
+#define do_timer_cpu boot_cpuid
#endif
static void account_ticks(struct pt_regs *regs)
@@ -742,16 +789,15 @@ static void account_ticks(struct pt_regs
if (!cpu_is_offline(cpu))
account_process_time(regs);
- /*
- * No need to check whether cpu is offline here; boot_cpuid
- * should have been fixed up by now.
- */
- if (cpu != boot_cpuid)
+ if (cpu != do_timer_cpu)
continue;
write_seqlock(&xtime_lock);
tb_last_jiffy += tb_ticks_per_jiffy;
- tb_last_stamp = per_cpu(last_jiffy, cpu);
+ tb_last_stamp += tb_ticks_per_jiffy;
+ /* Handle RTCL overflow on 601 */
+ if (__USE_RTC() && tb_last_stamp >= 1000000000)
+ tb_last_stamp -= 1000000000;
do_timer(regs);
timer_recalc_offset(tb_last_jiffy);
timer_check_rtc();
@@ -836,6 +882,13 @@ void __init smp_space_timers(unsigned in
unsigned long offset = tb_ticks_per_jiffy / max_cpus;
unsigned long previous_tb = per_cpu(last_jiffy, boot_cpuid);
+#ifdef CONFIG_NO_IDLE_HZ
+ /* Don't space timers - we want to let any CPU call do_timer to
+ * increment xtime.
+ */
+ half = offset = 0;
+#endif
+
/* make sure tb > per_cpu(last_jiffy, cpu) for all cpus always */
previous_tb -= tb_ticks_per_jiffy;
/*
@@ -1051,6 +1104,8 @@ void __init time_init(void)
calc_cputime_factors();
#ifdef CONFIG_NO_IDLE_HZ
max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
+ do_timer_cpu = boot_cpuid;
+ register_cpu_notifier(&do_timer_notifier);
#endif
/*
_
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tickless idle cpus: core patch - v2
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
@ 2006-04-11 17:35 ` Paul Mackerras
2006-04-12 4:50 ` Srivatsa Vaddagiri
2006-04-21 10:49 ` Paul Mackerras
1 sibling, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2006-04-11 17:35 UTC (permalink / raw)
To: vatsa; +Cc: linuxppc-dev
Srivatsa Vaddagiri writes:
> This is the v2 of the core patch to skip ticks when a CPU is idle.
> Changes since v1:
>
> - fix the buggy call to stop_hz_timer in idle_power4.S (hopefully it
> is correct now!).
It would be nice if we could arrange to call stop_hz_timer from the
top-level cpu_idle() function rather than having to call it from the
individual power_save() functions such as power4_idle(). Can you see
a problem with doing that?
Paul.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tickless idle cpus: core patch - v2
2006-04-11 17:35 ` Paul Mackerras
@ 2006-04-12 4:50 ` Srivatsa Vaddagiri
0 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-12 4:50 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev
On Wed, Apr 12, 2006 at 03:35:20AM +1000, Paul Mackerras wrote:
> It would be nice if we could arrange to call stop_hz_timer from the
> top-level cpu_idle() function rather than having to call it from the
> individual power_save() functions such as power4_idle(). Can you see
> a problem with doing that?
I had considered doing that, but one problem with it is - how do we ensure that
start_hz_timer will be called before idle thread calls schedule? A problem
scenario is when the power_save() function returns without taking an interrupt
(as is possible in pseries_dedicated_idle_sleep?), since start_hz_timer is
currently called from only an interrupt context.
Now we could contemplate calling start_hz_timer directly from cpu_idle
when power_save() function returns - but how do we get the register
context required as an argument in start_hz_timer()?
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tickless idle cpus: core patch - v2
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
2006-04-11 17:35 ` Paul Mackerras
@ 2006-04-21 10:49 ` Paul Mackerras
2006-04-24 15:39 ` Srivatsa Vaddagiri
1 sibling, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2006-04-21 10:49 UTC (permalink / raw)
To: vatsa; +Cc: linuxppc-dev
Srivatsa Vaddagiri writes:
> This is the v2 of the core patch to skip ticks when a CPU is idle.
...
> +/* Returns 1 if this CPU was set in the mask */
> +static inline int clear_hzless_mask(void)
> +{
> + int cpu = smp_processor_id();
> + int rc = 0;
> +
> + if (unlikely(cpu_isset(cpu, nohz_cpu_mask))) {
> + cpu_clear(cpu, nohz_cpu_mask);
Is the nohz_cpu_mask accessed by other cpus? I wonder if using a
1-byte per-cpu variable for this, or even a bit in the thread_info
struct, might be better because it would reduce the number of atomic
bit set/clear operations we have to do.
> +#define MAX_DEC_COUNT UINT_MAX /* Decrementer is 32-bit */
The decrementer value should be restricted to INT_MAX. I think some
implementations will take a decrementer interrupt if you set the
decrementer to a negative value.
> +static void account_ticks(struct pt_regs *regs)
> +{
> + int next_dec;
> + int cpu = smp_processor_id();
> + unsigned long ticks;
> +
> while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
> >= tb_ticks_per_jiffy) {
> /* Update last_jiffy */
This is just the loop body from timer_interrupt, right? Why do we
have to loop around N times after we skipped N ticks? What we're
doing inside the loop is:
- account_process_time, but we know we were in the idle task, so the
time is just idle time. If we have the accurate accounting stuff
turned on the first call to account_process_time will account all
the idle time anyway.
- we're not skipping ticks on the boot cpu (yet), so we won't do the
do_timer and timer_recalc_offset calls.
I think we could probably rearrange this code to eliminate the need
for the regs argument - all it's used for is telling whether we were
in user mode, and we know we weren't since we were in the idle task.
That would mean that we maybe could call start_hz_timer from the idle
task body instead of inside do_IRQ etc. The only thing we'd have to
watch out for is updating the decrementer if some interrupt handler
called add_timer/mod_timer etc.
Assuming we make the changes we have discussed to reduce the skewing
of the decrementer interrupts quite small, and allow any cpu to call
do_timer, then how where you thinking that xtime and the NTP stuff
would get updated, in the situation where all cpus are skipping ticks?
By doing N calls to do_timer in a row? Or by adding N-1 to jiffies_64
and then calling do_timer once?
> +#ifdef CONFIG_NO_IDLE_HZ
> + max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
> +#endif
If we allow the boot cpu to skip ticks, then we will have to make sure
that at least one cpu wakes up in time to do the updating in
recalc_timer_offset.
Paul.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] tickless idle cpus: core patch - v2
2006-04-21 10:49 ` Paul Mackerras
@ 2006-04-24 15:39 ` Srivatsa Vaddagiri
0 siblings, 0 replies; 9+ messages in thread
From: Srivatsa Vaddagiri @ 2006-04-24 15:39 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev
Paul,
At the outset I am slightly confused on whether you are
proposing to allow boot cpus to skip ticks or not. A clarification on
this will help!
On Fri, Apr 21, 2006 at 08:49:48PM +1000, Paul Mackerras wrote:
> Is the nohz_cpu_mask accessed by other cpus? I wonder if using a
Yes, nohz_cpu_mask can be accessed by other CPUs, as part of RCU batch
processing.
> 1-byte per-cpu variable for this, or even a bit in the thread_info
> struct, might be better because it would reduce the number of atomic
> bit set/clear operations we have to do.
Agreed that updating a global bitmask is not a scalable thing to do. But
unfortunately this is tied to RCU implementation ATM. I know that Dipankar has
plans to get rid of nohz_cpu_mask in the RCU implementation. When that happens,
maybe we can revist this.
> > +#define MAX_DEC_COUNT UINT_MAX /* Decrementer is 32-bit */
>
> The decrementer value should be restricted to INT_MAX. I think some
> implementations will take a decrementer interrupt if you set the
> decrementer to a negative value.
Ok. Thanks for pointing it out.
> > +static void account_ticks(struct pt_regs *regs)
> > +{
> > + int next_dec;
> > + int cpu = smp_processor_id();
> > + unsigned long ticks;
> > +
> > while ((ticks = tb_ticks_since(per_cpu(last_jiffy, cpu)))
> > >= tb_ticks_per_jiffy) {
> > /* Update last_jiffy */
>
> This is just the loop body from timer_interrupt, right? Why do we
Correct. Note that this loop body is common to both timer_interrupt and
start_hz_timer currently.
> have to loop around N times after we skipped N ticks? What we're
> doing inside the loop is:
>
> - account_process_time, but we know we were in the idle task, so the
> time is just idle time. If we have the accurate accounting stuff
> turned on the first call to account_process_time will account all
> the idle time anyway.
Do you mean to say that "the first call to account_system_vtime will account
all the idle time"? AFAICS account_process_time (->account_process_vtime) is
accounting just usertime.
I had a related question: if we put the idle CPU to some power-saving
state in the period that it is tickless, would its PURR value keep
getting incremented? If not, would that affect the idle time we
calculate after skipping N ticks (in case accurate acct is turned ON
that is)?
> - we're not skipping ticks on the boot cpu (yet), so we won't do the
> do_timer and timer_recalc_offset calls.
Don't follow you here. I thought we wanted to let the boot cpu to skip
ticks too (as you seem to indicate down below - while talking of
updating xtime/NTP)? If we want to allow that, then we need to be able to
call do_timer with a regs argument?
> I think we could probably rearrange this code to eliminate the need
> for the regs argument - all it's used for is telling whether we were
> in user mode, and we know we weren't since we were in the idle task.
> That would mean that we maybe could call start_hz_timer from the idle
> task body instead of inside do_IRQ etc. The only thing we'd have to
> watch out for is updating the decrementer if some interrupt handler
> called add_timer/mod_timer etc.
One problem in deferring the call to start_hz_timer like this is RCU. Ideally
we want to clear the tickless idle-CPU from nohz_cpu_mask -as soon- as it
starts processing an interrupt. Otherwise RCU will think that the CPU is in
tickless state (hence not accessing RCU-protected data structures) while the
idle-CPU is processing interrupts/softirqs and will not include the CPU in its
grace-period processing. This may be a bad thing to allow.
The other problem in deferring a call to start_hz_timer is we could cause them
to be running with an incorrect jiffy value (can happen if all cpus were
skipping ticks for a period).
IMO we should let start_hz_timer be called as it is being called now, i.e
immediately when a tickless idle-CPU wakes up. This would also
unfortunately mean that we need to embed calls to start_hz_timer from all
the interrupt paths (do_IRQ, performance_monitor_exception etc) where we
come out of tickless state.
> Assuming we make the changes we have discussed to reduce the skewing
> of the decrementer interrupts quite small, and allow any cpu to call
> do_timer, then how where you thinking that xtime and the NTP stuff
> would get updated, in the situation where all cpus are skipping ticks?
> By doing N calls to do_timer in a row? Or by adding N-1 to jiffies_64
> and then calling do_timer once?
I assume this implies we want to let boot-cpu to skip ticks?
Ideally it would be good if we add N-1 to jiffies_64 and call do_timer
once. But I wonder if it can be done w/o being racy. In order to calculate
N, we may calculate the difference, delta, between current timebase and
tb_last_stamp as:
delta = tb_ticks_since(tb_last_stamp); -> Step 1
We may then calculate N as:
N = delta / tb_ticks_per_jiffy; -> Step 2
However this will be racy, if we happen to sample timebase (Step 1) just
before a jiffy boundary. In which case, we would have missed to update
jiffy by 1? To avoid this, we may need to check for the corner case and
introduce a 'goto retry' loop?
BTW, I realized that my patch to let boot cpu to skip ticks also is
slightly buggy here:
if (cpu != do_timer_cpu)
continue;
write_seqlock(&xtime_lock);
...
do_timer(regs);
...
write_sequnlock(&xtime_lock);
There needs to a different while loop to call do_timer based on
tb_ticks_since(tb_last_stamp). Will fix it next time around.
>
> > +#ifdef CONFIG_NO_IDLE_HZ
> > + max_skip = __USE_RTC() ? HZ : MAX_DEC_COUNT / tb_ticks_per_jiffy;
> > +#endif
>
> If we allow the boot cpu to skip ticks, then we will have to make sure
> that at least one cpu wakes up in time to do the updating in
> recalc_timer_offset.
Good point. However is this critical, especially if all cpus had been
idle for quite a while? As I understand, if we don't wakeup in time to
do timer_recalc_offset, the only drawback is the first gettimeofday will
be slow (because it has make a system call)?
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-04-24 15:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-07 6:30 [PATCH 1/4] tickless idle cpu - Allow any CPU to update jiffies Srivatsa Vaddagiri
2006-04-07 23:04 ` Paul Mackerras
2006-04-10 11:49 ` Srivatsa Vaddagiri
2006-04-10 12:18 ` [PATCH 1/2] tickless idle cpus: core patch - v2 Srivatsa Vaddagiri
2006-04-11 17:35 ` Paul Mackerras
2006-04-12 4:50 ` Srivatsa Vaddagiri
2006-04-21 10:49 ` Paul Mackerras
2006-04-24 15:39 ` Srivatsa Vaddagiri
2006-04-10 12:19 ` [PATCH 2/2] tickless idle cpus: allow boot cpu to skip ticks Srivatsa Vaddagiri
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).