CONFIG_NO_HZ breaks blktrace timestamps

linux-btrace.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* CONFIG_NO_HZ breaks blktrace timestamps
@ 2008-01-09 22:48 David Dillow
  2008-01-10 20:25 ` David Dillow
  0 siblings, 1 reply; 37+ messages in thread
From: David Dillow @ 2008-01-09 22:48 UTC (permalink / raw)
  To: linux-btrace; +Cc: linux-kernel

While trying to gain some insight into a disk issue, I found that
blktrace/blkparse was giving me bogus traces -- I was seeing requests
complete before they were even dispatched or queued even! I had thought
that maybe this was an issue with SMP on the box, but when running with
'maxcpus=1', it told me that my 53 second test run only took 3.5
seconds.

I started tracking this down, and upon looking at cpu_clock(), and found
that it uses sched_clock(), which is based on jiffies. At this point I
had an ahah! moment and remembered that I had NO_HZ enabled.

Turning that off got blktrace to report the proper time for the test
run.

I'm not sure of the proper fix, nor am I sure I'm going to have much
time to track this down at the momenet, but I thought someone should
know.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-09 22:48 CONFIG_NO_HZ breaks blktrace timestamps David Dillow
@ 2008-01-10 20:25 ` David Dillow
  2008-01-10 22:44   ` Guillaume Chazarain
  2008-01-11  9:34   ` Ingo Molnar
  0 siblings, 2 replies; 37+ messages in thread
From: David Dillow @ 2008-01-10 20:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-btrace, mingo, tglx

Ingo, Thomas added as I think this is related to
sched.c:__update_rq_clock()'s checking for forward time warps.

On Wed, 2008-01-09 at 17:48 -0500, David Dillow wrote:
> While trying to gain some insight into a disk issue, I found that
> blktrace/blkparse was giving me bogus traces -- I was seeing requests
> complete before they were even dispatched or queued even! I had thought
> that maybe this was an issue with SMP on the box, but when running with
> 'maxcpus=1', it told me that my 53 second test run only took 3.5
> seconds.
> 
> I started tracking this down, and upon looking at cpu_clock(), and found
> that it uses sched_clock(), which is based on jiffies. At this point I
> had an ahah! moment and remembered that I had NO_HZ enabled.

[I did figure out that the sched_clock() jiffies implementation in
sched.c is a fallback.]

A few pieces of info I forgot in the original message -- this is on an
quad-processor, dual core Opteron box, running 2.6.24-rc7 x86_64. I'm
currently booting it with maxcpus=1, though at this point it is just a
hold-over from my initial bug hunting.

I can provide a full .config/dmesg if needed, but off the top:
CONFIG_NO_HZ=y
CONFIG_HZ\x1000
CONFIG_CPU_FREQ is not set
CONFIG_CPU_IDLE makes no difference

hpet is the current clocksource

When booting with "nohz=off", rq->clock and rq->prev_clock_raw
from /proc/sched_debug track nicely with ktime_get() ("now is at ...
msecs"). rq->clock_overflows is non-zero, but increases relatively
slowly -- 650 for ~355 seconds.

With "nohz=on", rq->prev_clock_raw still tracks nicely with ktime_get(),
but rq->clock is moving very slowly and rq->clock_overflows is
incrementing fairly rapidly -- 53718 for ~357 seconds.

Either way gives a rq->max_delta of 0.999844 -- ms, I presume.

rq->clock_overflows is only incremented in sched.c:__update_rq_clock(),
and only when delta pushes clock more than TICK_NSEC past the current
tick_timestamp. I'm assuming this happens when the CPU's been idle for a
bit, with everything waiting for IO (4 to 5ms in the ticked blktrace),
so there's been no updates of rq->clock from sched_clock().

At the moment, I'm not sure how to track this farther, or how to fix it
properly. Any advice would be appreciated.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-10 20:25 ` David Dillow
@ 2008-01-10 22:44   ` Guillaume Chazarain
  2008-01-11  3:04     ` David Dillow
  2008-01-11 10:41     ` Guillaume Chazarain
  2008-01-11  9:34   ` Ingo Molnar
  1 sibling, 2 replies; 37+ messages in thread
From: Guillaume Chazarain @ 2008-01-10 22:44 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-kernel, linux-btrace, mingo, tglx

David Dillow <dillowda@ornl.gov> wrote:

> At the moment, I'm not sure how to track this farther, or how to fix it
> properly. Any advice would be appreciated.

Just out of curiosity, could you try the appended cumulative patch and
report .clock_warps, .clock_overflows and .clock_underflows as you did.

Thanks.

commit 20fa02359d971bdb820d238184fabd42d8018e4f
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Thu Jan 10 23:36:43 2008 +0100

    sched: monitor clock underflows in /proc/sched_debug
    
    We monitor clock overflows, let's also monitor clock underflows.
    
    Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>

diff --git a/kernel/sched.c b/kernel/sched.c
index 37cf07a..cab9756 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -317,7 +317,7 @@ struct rq {
 	u64 clock, prev_clock_raw;
 	s64 clock_max_delta;
 
-	unsigned int clock_warps, clock_overflows;
+	unsigned int clock_warps, clock_overflows, clock_underflows;
 	u64 idle_clock;
 	unsigned int clock_deep_idle_events;
 	u64 tick_timestamp;
@@ -3485,8 +3485,10 @@ void scheduler_tick(void)
 	/*
 	 * Let rq->clock advance by at least TICK_NSEC:
 	 */
-	if (unlikely(rq->clock < next_tick))
+	if (unlikely(rq->clock < next_tick)) {
 		rq->clock = next_tick;
+		rq->clock_underflows++;
+	}
 	rq->tick_timestamp = rq->clock;
 	update_cpu_load(rq);
 	if (curr != rq->idle) /* FIXME: needed? */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 80fbbfc..9e5de09 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -179,6 +179,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 	PN(prev_clock_raw);
 	P(clock_warps);
 	P(clock_overflows);
+	P(clock_underflows);
 	P(clock_deep_idle_events);
 	PN(clock_max_delta);
 	P(cpu_load[0]);

commit c146421cae64bb626714dc951fa39b55d2f819c1
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Wed Jan 2 14:10:17 2008 +0100

    commit 60c6397ce4e8c9fd7feaeaef4167ace71c3949c8
    
        x86: scale cyc_2_nsec according to CPU frequency
    
        scale the sched_clock() cyc_2_nsec scaling factor according to
        CPU frequency changes.
    
        [ mingo@elte.hu: simplified it and fixed it for SMP. ]
    
        Signed-off-by: Ingo Molnar <mingo@elte.hu>
        Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/arch/x86/kernel/tsc_32.c b/arch/x86/kernel/tsc_32.c
index 9ebc0da..00bb4c1 100644
--- a/arch/x86/kernel/tsc_32.c
+++ b/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	/*
+	 * Start smoothly with the new frequency:
+	 */
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_block *nb, unsigned long val, void *data)
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
+				preempt_disable();
+				set_cyc2ns_scale(cpu_khz, smp_processor_id());
+				preempt_enable();
 				/*
 				 * TSC based sched_clock turns
 				 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_reliable(void) { }
 
 void __init tsc_init(void)
 {
+	int cpu;
+
 	if (!cpu_has_tsc || tsc_disable)
 		goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
+	/*
+	 * Secondary CPUs do not run through tsc_init(), so set up
+	 * all the scale factors for all CPUs, assuming the same
+	 * speed as the bootup CPU. (cpufreq notifiers will fix this
+	 * up if their speed diverges)
+	 */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(cpu_khz, cpu);
+
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..32edd2c 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 unsigned long long sched_clock(void)
@@ -100,7 +133,9 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
+	preempt_disable();
+	set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+	preempt_enable();
 
 	return 0;
 }
@@ -151,7 +186,7 @@ static unsigned long __init tsc_read_refs(unsigned long *pm,
 void __init tsc_calibrate(void)
 {
 	unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
+	int hpet = is_hpet_enabled(), cpu;
 
 	local_irq_save(flags);
 
@@ -206,7 +241,9 @@ void __init tsc_calibrate(void)
 	}
 
 	tsc_khz = tsc2 / tsc1;
-	set_cyc2ns_scale(tsc_khz);
+
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
 }
 
 /*
diff --git a/include/asm-x86/timer.h b/include/asm-x86/timer.h
index 0db7e99..4f6fcb0 100644
--- a/include/asm-x86/timer.h
+++ b/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(cyc);
+	local_irq_restore(flags);
+
+	return ns;
+}
 
 #endif

commit 235fdf2d607e7139d3a7392506996a476229d12e
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Thu Jan 10 23:36:44 2008 +0100

    sched: fix rq->clock warps on frequency changes
    
    Fix 2bacec8c318ca0418c0ee9ac662ee44207765dd4
    (sched: touch softlockup watchdog after idling) that reintroduced warps
    on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
    that checks rq->clock for warps, so call it after adjusting rq->clock.
    
    Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>

diff --git a/kernel/sched.c b/kernel/sched.c
index cab9756..eee610d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -668,7 +668,6 @@ void sched_clock_idle_wakeup_event(u64 delta_ns)
 	struct rq *rq = cpu_rq(smp_processor_id());
 	u64 now = sched_clock();
 
-	touch_softlockup_watchdog();
 	rq->idle_clock += delta_ns;
 	/*
 	 * Override the previous timestamp and ignore all
@@ -680,6 +679,7 @@ void sched_clock_idle_wakeup_event(u64 delta_ns)
 	rq->prev_clock_raw = now;
 	rq->clock += delta_ns;
 	spin_unlock(&rq->lock);
+	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 

commit 241a8c3fb4d6fd452518a8e4e8eeabe9696c82aa
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Thu Jan 10 23:36:45 2008 +0100

    sched: Fix rq->clock overflows detection with CONFIG_NO_HZ
    
    When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
    We check that the number of skipped ticks matches the clock jump seen in
    __update_rq_clock().
    
    
    Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>

diff --git a/kernel/sched.c b/kernel/sched.c
index eee610d..6f1271d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -288,6 +288,7 @@ struct rq {
 	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
 	unsigned char idle_at_tick;
 #ifdef CONFIG_NO_HZ
+	unsigned long last_tick_seen;
 	unsigned char in_nohz_recently;
 #endif
 	/* capture load from *all* tasks on this cpu: */
@@ -379,6 +380,32 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+#ifdef CONFIG_NO_HZ
+static inline bool nohz_on(int cpu)
+{
+	return tick_get_tick_sched(cpu)->nohz_mode != NOHZ_MODE_INACTIVE;
+}
+
+static inline u64 max_skipped_ticks(struct rq *rq)
+{
+	return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
+}
+
+static inline void update_last_tick_seen(struct rq *rq)
+{
+	rq->last_tick_seen = jiffies;
+}
+#else
+static inline u64 max_skipped_ticks(struct rq *rq)
+{
+	return 1;
+}
+
+static inline void update_last_tick_seen(struct rq *rq)
+{
+}
+#endif
+
 /*
  * Update the per-runqueue clock, as finegrained as the platform can give
  * us, but without assuming monotonicity, etc.:
@@ -403,9 +430,12 @@ static void __update_rq_clock(struct rq *rq)
 		/*
 		 * Catch too large forward jumps too:
 		 */
-		if (unlikely(clock + delta > rq->tick_timestamp + TICK_NSEC)) {
-			if (clock < rq->tick_timestamp + TICK_NSEC)
-				clock = rq->tick_timestamp + TICK_NSEC;
+		u64 max_jump = max_skipped_ticks(rq) * TICK_NSEC;
+		u64 max_time = rq->tick_timestamp + max_jump;
+
+		if (unlikely(clock + delta > max_time)) {
+			if (clock < max_time)
+				clock = max_time;
 			else
 				clock++;
 			rq->clock_overflows++;
@@ -3490,6 +3520,7 @@ void scheduler_tick(void)
 		rq->clock_underflows++;
 	}
 	rq->tick_timestamp = rq->clock;
+	update_last_tick_seen(rq);
 	update_cpu_load(rq);
 	if (curr != rq->idle) /* FIXME: needed? */
 		curr->sched_class->task_tick(rq, curr);
@@ -6762,6 +6793,7 @@ void __init sched_init(void)
 		lockdep_set_class(&rq->lock, &rq->rq_lock_key);
 		rq->nr_running = 0;
 		rq->clock = 1;
+		update_last_tick_seen(rq);
 		init_cfs_rq(&rq->cfs, rq);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);

commit 437f7a5480a0916e5b0877e04ca87f998ef3f3e2
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Thu Jan 10 23:36:45 2008 +0100

    sched: make sure jiffies is up to date before calling __update_rq_clock()
    
    Now that __update_rq_clock() uses jiffies to detect clock overflows, make sure
    jiffies are up to date before touch_softlockup_watchdog().
    
    Removed a touch_softlockup_watchdog() call becoming redundant with the added
    tick_nohz_update_jiffies().
    
    Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>

diff --git a/kernel/sched.c b/kernel/sched.c
index 6f1271d..f6b3160 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -63,6 +63,7 @@
 #include <linux/reciprocal_div.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
+#include <linux/tick.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -709,7 +710,7 @@ void sched_clock_idle_wakeup_event(u64 delta_ns)
 	rq->prev_clock_raw = now;
 	rq->clock += delta_ns;
 	spin_unlock(&rq->lock);
-	touch_softlockup_watchdog();
+	tick_nohz_update_jiffies();
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index cb89fa8..2190593 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -133,14 +133,13 @@ void tick_nohz_update_jiffies(void)
 	if (!ts->tick_stopped)
 		return;
 
-	touch_softlockup_watchdog();
-
 	cpu_clear(cpu, nohz_cpu_mask);
 	now = ktime_get();
 
 	local_irq_save(flags);
 	tick_do_update_jiffies64(now);
 	local_irq_restore(flags);
+	touch_softlockup_watchdog();
 }
 
 /**

commit 63a885d40031190e17860c899446376b0fceefa8
Author: Guillaume Chazarain <guichaz@yahoo.fr>
Date:   Thu Jan 10 23:36:45 2008 +0100

    scheduler_tick() is not called every jiffies

diff --git a/kernel/sched.c b/kernel/sched.c
index f6b3160..9f73005 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -387,6 +387,11 @@ static inline bool nohz_on(int cpu)
 	return tick_get_tick_sched(cpu)->nohz_mode != NOHZ_MODE_INACTIVE;
 }
 
+static inline u64 min_skipped_ticks(struct rq *rq)
+{
+	return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen : 1;
+}
+
 static inline u64 max_skipped_ticks(struct rq *rq)
 {
 	return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
@@ -397,6 +402,11 @@ static inline void update_last_tick_seen(struct rq *rq)
 	rq->last_tick_seen = jiffies;
 }
 #else
+static inline u64 min_skipped_ticks(struct rq *rq)
+{
+	return 1;
+}
+
 static inline u64 max_skipped_ticks(struct rq *rq)
 {
 	return 1;
@@ -3498,25 +3508,23 @@ void account_steal_time(struct task_struct *p, cputime_t steal)
 }
 
 /*
- * This function gets called by the timer code, with HZ frequency.
+ * This function gets called by the timer code with HZ frequency or less when
+ * using CONFIG_NO_HZ.
  * We call it with interrupts disabled.
- *
- * It also gets called by the fork code, when changing the parent's
- * timeslices.
  */
 void scheduler_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
 	struct task_struct *curr = rq->curr;
-	u64 next_tick = rq->tick_timestamp + TICK_NSEC;
+	u64 next_tick = rq->tick_timestamp + min_skipped_ticks(rq) * TICK_NSEC;
 
 	spin_lock(&rq->lock);
 	__update_rq_clock(rq);
 	/*
-	 * Let rq->clock advance by at least TICK_NSEC:
+	 * Let rq->clock advance by at least the number of elapsed ticks:
 	 */
-	if (unlikely(rq->clock < next_tick)) {
+	if (unlikely(rq->clock < next_tick - nohz_on(cpu) * TICK_NSEC)) {
 		rq->clock = next_tick;
 		rq->clock_underflows++;
 	}


-- 
Guillaume

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-10 22:44   ` Guillaume Chazarain
@ 2008-01-11  3:04     ` David Dillow
  2008-01-11  9:07       ` Jens Axboe
  2008-01-11  9:44       ` Ingo Molnar
  2008-01-11 10:41     ` Guillaume Chazarain
  1 sibling, 2 replies; 37+ messages in thread
From: David Dillow @ 2008-01-11  3:04 UTC (permalink / raw)
  To: Guillaume Chazarain; +Cc: linux-kernel, linux-btrace, mingo, tglx


On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
> David Dillow <dillowda@ornl.gov> wrote:
> 
> > At the moment, I'm not sure how to track this farther, or how to fix it
> > properly. Any advice would be appreciated.
> 
> Just out of curiosity, could you try the appended cumulative patch and
> report .clock_warps, .clock_overflows and .clock_underflows as you did.

With those patches, CONFIG_NO_HZ works just fine.

Patched kernel, nohz=off:
now at 214257.820809 msecs
  .clock                         : 214212.727559
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 244569.402345
  .clock_warps                   : 0
  .clock_overflows               : 577
  .clock_underflows              : 213887
  .clock_deep_idle_events        : 4
  .clock_max_delta               : 0.999830

Patched kernel, nohz=on:
now at 248931.524381 msecs
  .clock                         : 248745.808465
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 270911.098507
  .clock_warps                   : 0
  .clock_overflows               : 69
  .clock_underflows              : 784
  .clock_deep_idle_events        : 4
  .clock_max_delta               : 100.639397

Running my disk test, blktrace is getting the proper timestamps now with
CONFIG_NO_HZ.

Thanks!
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  3:04     ` David Dillow
@ 2008-01-11  9:07       ` Jens Axboe
  2008-01-11  9:23         ` Ingo Molnar
  2008-01-11  9:28         ` nigel
  2008-01-11  9:44       ` Ingo Molnar
  1 sibling, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-11  9:07 UTC (permalink / raw)
  To: David Dillow; +Cc: Guillaume Chazarain, linux-kernel, linux-btrace, mingo, tglx

On Thu, Jan 10 2008, David Dillow wrote:
> 
> On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
> > David Dillow <dillowda@ornl.gov> wrote:
> > 
> > > At the moment, I'm not sure how to track this farther, or how to fix it
> > > properly. Any advice would be appreciated.
> > 
> > Just out of curiosity, could you try the appended cumulative patch and
> > report .clock_warps, .clock_overflows and .clock_underflows as you did.
> 
> With those patches, CONFIG_NO_HZ works just fine.
> 
> Patched kernel, nohz=off:
> now at 214257.820809 msecs
>   .clock                         : 214212.727559
>   .idle_clock                    : 0.000000
>   .prev_clock_raw                : 244569.402345
>   .clock_warps                   : 0
>   .clock_overflows               : 577
>   .clock_underflows              : 213887
>   .clock_deep_idle_events        : 4
>   .clock_max_delta               : 0.999830
> 
> Patched kernel, nohz=on:
> now at 248931.524381 msecs
>   .clock                         : 248745.808465
>   .idle_clock                    : 0.000000
>   .prev_clock_raw                : 270911.098507
>   .clock_warps                   : 0
>   .clock_overflows               : 69
>   .clock_underflows              : 784
>   .clock_deep_idle_events        : 4
>   .clock_max_delta               : 100.639397
> 
> Running my disk test, blktrace is getting the proper timestamps now with
> CONFIG_NO_HZ.

Thanks for reporting this. Guillaume, did you write this patch? We need
to get it into 2.6.24-rc7 asap. Let me know if I should take care of
that, or if it's already queued up elsewhere.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:07       ` Jens Axboe
@ 2008-01-11  9:23         ` Ingo Molnar
  2008-01-11  9:25           ` Jens Axboe
  2008-01-11  9:29           ` CONFIG_NO_HZ breaks " Ingo Molnar
  2008-01-11  9:28         ` nigel
  1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx


* Jens Axboe <jens.axboe@oracle.com> wrote:

> Thanks for reporting this. Guillaume, did you write this patch? We 
> need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
> of that, or if it's already queued up elsewhere.

they are from the scheduler git tree (except the first debug patch), but 
queued up for v2.6.25 at the moment.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:23         ` Ingo Molnar
@ 2008-01-11  9:25           ` Jens Axboe
  2008-01-11  9:42             ` Ingo Molnar
  2008-01-11  9:29           ` CONFIG_NO_HZ breaks " Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2008-01-11  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > Thanks for reporting this. Guillaume, did you write this patch? We 
> > need to get it into 2.6.24-rc7 asap. Let me know if I should take care 
> > of that, or if it's already queued up elsewhere.
> 
> they are from the scheduler git tree (except the first debug patch), but 
> queued up for v2.6.25 at the moment.

So this means that blktrace will be broken with CONFIG_NO_HZ for 2.6.24?
That's clearly a regression.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:25           ` Jens Axboe
@ 2008-01-11  9:42             ` Ingo Molnar
  2008-01-11  9:56               ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

* Jens Axboe <jens.axboe@oracle.com> wrote:

> > they are from the scheduler git tree (except the first debug patch), 
> > but queued up for v2.6.25 at the moment.
> 
> So this means that blktrace will be broken with CONFIG_NO_HZ for 
> 2.6.24? That's clearly a regression.

64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
too and it didnt happen in v2.6.23 32-bit then it's a regression.

all this comes from blktrace's original decision of using sched_clock()
:-) It's not a global timesource and it's not trivial to turn it into a
halfways usable global timesource.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:42             ` Ingo Molnar
@ 2008-01-11  9:56               ` Jens Axboe
  2008-01-11 10:29                 ` [patch] block: fix " Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2008-01-11  9:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > they are from the scheduler git tree (except the first debug patch), 
> > > but queued up for v2.6.25 at the moment.
> > 
> > So this means that blktrace will be broken with CONFIG_NO_HZ for 
> > 2.6.24? That's clearly a regression.
> 
> 64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 32-bit 
> too and it didnt happen in v2.6.23 32-bit then it's a regression.

If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of some
option that isn't immediately apparent, then it's a regression. Period.

> all this comes from blktrace's original decision of using sched_clock()
> :-) It's not a global timesource and it's not trivial to turn it into a
> halfways usable global timesource.

Hey, it was a high res time source and the only one easily available :)
I'm fine with using another timesource, I'll take suggestions or patches
any day!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [patch] block: fix blktrace timestamps
  2008-01-11  9:56               ` Jens Axboe
@ 2008-01-11 10:29                 ` Ingo Molnar
  2008-01-11 10:47                   ` Guillaume Chazarain
                                     ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11 10:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx


(David, could you try the patch further below - does it fix bkltrace 
timestamps too?)

* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Jan 11 2008, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > > they are from the scheduler git tree (except the first debug patch), 
> > > > but queued up for v2.6.25 at the moment.
> > > 
> > > So this means that blktrace will be broken with CONFIG_NO_HZ for 
> > > 2.6.24? That's clearly a regression.
> > 
> > 64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 
> > 32-bit too and it didnt happen in v2.6.23 32-bit then it's a 
> > regression.
> 
> If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of some 
> option that isn't immediately apparent, then it's a regression. 
> Period.

not completely correct. CONFIG_NO_HZ is a default-disabled option that 
became newly available on 64-bit x86. So if NO_HZ does not completely 
work on 64-bit, and if 32-bit works fine - which we dont know yet (my 
guess would be that it's similarly broken on the same box) then it's not 
a regression.

But even if it's not "technically" a regression, it's something we want 
to fix in .24 if we can, so i'm all with you Jens :)

and you can see that by looking at the patch. It does not unbreak 
something that 2.6.24 broke. It makes something work better that used to 
be so broken for such a long time. (sched_clock() based blktrace)

anyway, the most dangerous one of these patches has become well-tested 
meanwhile in x86.git. So maybe we can get this fix into v2.6.24. Maybe 
not. We'll see.

> > all this comes from blktrace's original decision of using sched_clock()
> > :-) It's not a global timesource and it's not trivial to turn it into a
> > halfways usable global timesource.
> 
> Hey, it was a high res time source and the only one easily available 
> :) I'm fine with using another timesource, I'll take suggestions or 
> patches any day!

Correction: it was not a high res time source, it was "the scheduler's 
per-cpu, non-exported, non-coherent, warps-and-jumps-like-hell high-res 
timesource that was intentionally called the _sched_ clock" ;-)

ktime_get() should have been used instead, which is a proper GTOD 
clocksource. The patch below implements this.

	Ingo

------------->
Subject: block: fix blktrace timestamps
From: Ingo Molnar <mingo@elte.hu>

David Dillow reported broken blktrace timestamps. The reason
is cpu_clock() which is not a global time source.

Fix bkltrace timestamps by using ktime_get() like the networking
code does for packet timestamps. This also removes a whole lot
of complexity from bkltrace.c and shrinks the code by 500 bytes:

   text    data     bss     dec     hex filename
   2888     124      44    3056     bf0 blktrace.o.before
   2390     116      44    2550     9f6 blktrace.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 block/blktrace.c |   69 +------------------------------------------------------
 1 file changed, 2 insertions(+), 67 deletions(-)

Index: linux-x86.q/block/blktrace.c
=================================--- linux-x86.q.orig/block/blktrace.c
+++ linux-x86.q/block/blktrace.c
@@ -25,7 +25,6 @@
 #include <linux/time.h>
 #include <asm/uaccess.h>
 
-static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, };
 static unsigned int blktrace_seq __read_mostly = 1;
 
 /*
@@ -41,7 +40,7 @@ static void trace_note(struct blk_trace 
 		const int cpu = smp_processor_id();
 
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
-		t->time = cpu_clock(cpu) - per_cpu(blk_trace_cpu_offset, cpu);
+		t->time = ktime_to_ns(ktime_get());
 		t->device = bt->dev;
 		t->action = action;
 		t->pid = pid;
@@ -159,7 +158,7 @@ void __blk_add_trace(struct blk_trace *b
 
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
 		t->sequence = ++(*sequence);
-		t->time = cpu_clock(cpu) - per_cpu(blk_trace_cpu_offset, cpu);
+		t->time = ktime_to_ns(ktime_get());
 		t->sector = sector;
 		t->bytes = bytes;
 		t->action = what;
@@ -506,73 +505,9 @@ void blk_trace_shutdown(struct request_q
 	}
 }
 
-/*
- * Average offset over two calls to cpu_clock() with a gettimeofday()
- * in the middle
- */
-static void blk_check_time(unsigned long long *t, int this_cpu)
-{
-	unsigned long long a, b;
-	struct timeval tv;
-
-	a = cpu_clock(this_cpu);
-	do_gettimeofday(&tv);
-	b = cpu_clock(this_cpu);
-
-	*t = tv.tv_sec * 1000000000 + tv.tv_usec * 1000;
-	*t -= (a + b) / 2;
-}
-
-/*
- * calibrate our inter-CPU timings
- */
-static void blk_trace_check_cpu_time(void *data)
-{
-	unsigned long long *t;
-	int this_cpu = get_cpu();
-
-	t = &per_cpu(blk_trace_cpu_offset, this_cpu);
-
-	/*
-	 * Just call it twice, hopefully the second call will be cache hot
-	 * and a little more precise
-	 */
-	blk_check_time(t, this_cpu);
-	blk_check_time(t, this_cpu);
-
-	put_cpu();
-}
-
-static void blk_trace_set_ht_offsets(void)
-{
-#if defined(CONFIG_SCHED_SMT)
-	int cpu, i;
-
-	/*
-	 * now make sure HT siblings have the same time offset
-	 */
-	preempt_disable();
-	for_each_online_cpu(cpu) {
-		unsigned long long *cpu_off, *sibling_off;
-
-		for_each_cpu_mask(i, per_cpu(cpu_sibling_map, cpu)) {
-			if (i = cpu)
-				continue;
-
-			cpu_off = &per_cpu(blk_trace_cpu_offset, cpu);
-			sibling_off = &per_cpu(blk_trace_cpu_offset, i);
-			*sibling_off = *cpu_off;
-		}
-	}
-	preempt_enable();
-#endif
-}
-
 static __init int blk_trace_init(void)
 {
 	mutex_init(&blk_tree_mutex);
-	on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1);
-	blk_trace_set_ht_offsets();
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 10:29                 ` [patch] block: fix " Ingo Molnar
@ 2008-01-11 10:47                   ` Guillaume Chazarain
  2008-01-11 10:50                     ` Ingo Molnar
  2008-01-11 12:28                   ` Jens Axboe
  2008-01-11 15:36                   ` David Dillow
  2 siblings, 1 reply; 37+ messages in thread
From: Guillaume Chazarain @ 2008-01-11 10:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, David Dillow, linux-kernel, linux-btrace, mingo, tglx

Ingo Molnar <mingo@elte.hu> wrote:

> Correction: it was not a high res time source, it was "the scheduler's 
> per-cpu, non-exported, non-coherent, warps-and-jumps-like-hell high-res 
> timesource that was intentionally called the _sched_ clock" ;-)

I think the warts of cpu_clock() are fixable, except maybe
unsynchronization on SMP which is harder.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 10:47                   ` Guillaume Chazarain
@ 2008-01-11 10:50                     ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11 10:50 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: Jens Axboe, David Dillow, linux-kernel, linux-btrace, mingo, tglx


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> > Correction: it was not a high res time source, it was "the 
> > scheduler's per-cpu, non-exported, non-coherent, 
> > warps-and-jumps-like-hell high-res timesource that was intentionally 
> > called the _sched_ clock" ;-)
> 
> I think the warts of cpu_clock() are fixable, except maybe 
> unsynchronization on SMP which is harder.

yes - but in terms of v2.6.24, the bkltrace patch i sent is a lot less 
risky.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 10:29                 ` [patch] block: fix " Ingo Molnar
  2008-01-11 10:47                   ` Guillaume Chazarain
@ 2008-01-11 12:28                   ` Jens Axboe
  2008-01-11 12:42                     ` Jens Axboe
  2008-01-11 13:21                     ` Ingo Molnar
  2008-01-11 15:36                   ` David Dillow
  2 siblings, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-11 12:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> (David, could you try the patch further below - does it fix bkltrace 
> timestamps too?)
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Jan 11 2008, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > > they are from the scheduler git tree (except the first debug patch), 
> > > > > but queued up for v2.6.25 at the moment.
> > > > 
> > > > So this means that blktrace will be broken with CONFIG_NO_HZ for 
> > > > 2.6.24? That's clearly a regression.
> > > 
> > > 64-bit CONFIG_NO_HZ is a new feature in v2.6.24. If it happens on 
> > > 32-bit too and it didnt happen in v2.6.23 32-bit then it's a 
> > > regression.
> > 
> > If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of some 
> > option that isn't immediately apparent, then it's a regression. 
> > Period.
> 
> not completely correct. CONFIG_NO_HZ is a default-disabled option that 
> became newly available on 64-bit x86. So if NO_HZ does not completely 
> work on 64-bit, and if 32-bit works fine - which we dont know yet (my 
> guess would be that it's similarly broken on the same box) then it's not 
> a regression.

Ingo, it doesn't matter if the option is disabled by default or not!
The fact is that functionality foo works in 2.6.23 and doesn't in 2.6.24
because of something unrelated. And that IS a regression, no matter what
kind of word play you are doing here :-)

> But even if it's not "technically" a regression, it's something we want 
> to fix in .24 if we can, so i'm all with you Jens :)

That's good :)

> ktime_get() should have been used instead, which is a proper GTOD 
> clocksource. The patch below implements this.

Will give it a whirl, it looks promising indeed and gets rid of the ugly
cpu sync stuff. What is the cost of ktime_get() compared to
sched_clock()?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 12:28                   ` Jens Axboe
@ 2008-01-11 12:42                     ` Jens Axboe
  2008-01-11 13:21                     ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-11 12:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Jens Axboe wrote:
> > ktime_get() should have been used instead, which is a proper GTOD 
> > clocksource. The patch below implements this.
> 
> Will give it a whirl, it looks promising indeed and gets rid of the ugly
> cpu sync stuff. What is the cost of ktime_get() compared to
> sched_clock()?

Works for me - will apply it right now, we can always evaluate any
potential performance impact later. It's more important to have a
functional blktrace in 2.6.24.

Thanks Ingo!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 12:28                   ` Jens Axboe
  2008-01-11 12:42                     ` Jens Axboe
@ 2008-01-11 13:21                     ` Ingo Molnar
  2008-01-11 17:18                       ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11 13:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

* Jens Axboe <jens.axboe@oracle.com> wrote:

> > > If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of 
> > > some option that isn't immediately apparent, then it's a 
> > > regression. Period.
> > 
> > not completely correct. CONFIG_NO_HZ is a default-disabled option 
> > that became newly available on 64-bit x86. So if NO_HZ does not 
> > completely work on 64-bit, and if 32-bit works fine - which we dont 
> > know yet (my guess would be that it's similarly broken on the same 
> > box) then it's not a regression.
> 
> Ingo, it doesn't matter if the option is disabled by default or not! 
> The fact is that functionality foo works in 2.6.23 and doesn't in 
> 2.6.24 because of something unrelated. And that IS a regression, no 
> matter what kind of word play you are doing here :-)

argh, Jens. Really. I believe you stopped using your brain because 
CONFIG_BKL_IO_TRACE=y is affected by this bug and apparently you've got 
a weak spot for it ;)

Think about it another way then, in the context of another, real kernel 
feature we introduced in v2.6.24, namely CONFIG_CPU_IDLE=y:

- Fact: feature CONFIG_CPU_IDLE=y did not exist in 64-bit x86 in v2.6.23.
  It has known bugs but they are not flagged 'regressions' (because the 
  feature did not exist before and the option is default-disabled) hence 
  the feature can stay. Some fixes to it are too dangerous or too 
  unproven and are delayed to v2.6.25.

and now apply the same rule to CONFIG_NO_HZ:

- Fact: feature CONFIG_NO_HZ=y did not exist in 64-bit x86 in v2.6.23.
  It has known bugs but they are not flagged 'regressions' (because the 
  feature did not exist before and the option is default-disabled) hence 
  the feature can stay. Some fixes to it are too dangerous or too 
  unproven and are delayed to v2.6.25.

ok?

Yes, it's bad that this happened, and perhaps it _is_ a regression, but 
not for the reason you claim. [ It might be a regression if 32-bit 
blktrace has the same problem under NO_HZ for example _AND_ bkltrace 
worked perfectly on the same box on v2.6.23. ]

Kernel regressions have a _strict_ definition: they mean "anything that 
worked before will work in the future too". Not: "anything that worked 
before will work in the future too if you enable random new non-default 
kernel features".

[ If the latter was the criterium we'd probably never see new features
  integrated! New stuff has bugs, because it's not nearly as well-tested 
  as older stuff. What matters is to 1) not turn it on by default if it 
  has known bugs 2) to always make positive progress on it, i.e.: to not
  regress new features in future kernel releases. This way, in the ideal 
  case, we'll have a monotonic series towards a perfect, bug-free kernel 
  ;) ]

> > ktime_get() should have been used instead, which is a proper GTOD 
> > clocksource. The patch below implements this.
> 
> Will give it a whirl, it looks promising indeed and gets rid of the 
> ugly cpu sync stuff. [...]

fantastic! :)

> [...] What is the cost of ktime_get() compared to sched_clock()?

compared to the costs of IO transfers it should be OK. It can be really 
fast but in the worst-case it can be as slow as 3-6 usecs (when we use 
the acpi_pm clocksource). If there's complaints then probably the 
networking folks will yell first :)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 13:21                     ` Ingo Molnar
@ 2008-01-11 17:18                       ` Jens Axboe
  2008-01-14  7:51                         ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2008-01-11 17:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > > > If blktrace worked in 2.6.23 and it doesn't in 2.6.24 because of 
> > > > some option that isn't immediately apparent, then it's a 
> > > > regression. Period.
> > > 
> > > not completely correct. CONFIG_NO_HZ is a default-disabled option 
> > > that became newly available on 64-bit x86. So if NO_HZ does not 
> > > completely work on 64-bit, and if 32-bit works fine - which we dont 
> > > know yet (my guess would be that it's similarly broken on the same 
> > > box) then it's not a regression.
> > 
> > Ingo, it doesn't matter if the option is disabled by default or not! 
> > The fact is that functionality foo works in 2.6.23 and doesn't in 
> > 2.6.24 because of something unrelated. And that IS a regression, no 
> > matter what kind of word play you are doing here :-)
> 
> argh, Jens. Really. I believe you stopped using your brain because 
> CONFIG_BKL_IO_TRACE=y is affected by this bug and apparently you've got 
> a weak spot for it ;)

I don't think so.

> Think about it another way then, in the context of another, real kernel 
> feature we introduced in v2.6.24, namely CONFIG_CPU_IDLE=y:
> 
> - Fact: feature CONFIG_CPU_IDLE=y did not exist in 64-bit x86 in v2.6.23.
>   It has known bugs but they are not flagged 'regressions' (because the 
>   feature did not exist before and the option is default-disabled) hence 
>   the feature can stay. Some fixes to it are too dangerous or too 
>   unproven and are delayed to v2.6.25.

> 
> and now apply the same rule to CONFIG_NO_HZ:
> 
> - Fact: feature CONFIG_NO_HZ=y did not exist in 64-bit x86 in v2.6.23.
>   It has known bugs but they are not flagged 'regressions' (because the 
>   feature did not exist before and the option is default-disabled) hence 
>   the feature can stay. Some fixes to it are too dangerous or too 
>   unproven and are delayed to v2.6.25.

Forget about this unrelated feature, it has nothing to do with this
situation. What if some (eg) sata driver broke because someone enabled
CONFIG_NO_HZ. Would you say that's not a regression because it worked
before that option was there? That's crap, no user would buy that
argument. As far as the user is concerned, it IS a regression.

Since we obviously disagree on this, lets agree to not debate it any
further. It doesn't matter anyway, since the issue is resolved.

> > [...] What is the cost of ktime_get() compared to sched_clock()?
> 
> compared to the costs of IO transfers it should be OK. It can be really 
> fast but in the worst-case it can be as slow as 3-6 usecs (when we use 
> the acpi_pm clocksource). If there's complaints then probably the 
> networking folks will yell first :)

How does that compare to sched_clock()? 3-6 usec would render blktrace
unusable for certain types of testing, unfortunately. This isn't just
queried per IO, it's done for every action for that IO. So you'd
typically have up to eg 10 traces for a single IO piece, perhaps even
more.

For 2.6.24 the solution is fine, I'll revisit the impact when I can.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 17:18                       ` Jens Axboe
@ 2008-01-14  7:51                         ` Ingo Molnar
  2008-01-14  7:59                           ` Ingo Molnar
  2008-01-14  8:39                           ` Christoph Hellwig
  0 siblings, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-01-14  7:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx, Andrew Morton

* Jens Axboe <jens.axboe@oracle.com> wrote:

> > - Fact: feature CONFIG_CPU_IDLE=y did not exist in 64-bit x86 in v2.6.23.
> >   It has known bugs but they are not flagged 'regressions' (because the 
> >   feature did not exist before and the option is default-disabled) hence 
> >   the feature can stay. Some fixes to it are too dangerous or too 
> >   unproven and are delayed to v2.6.25.
> 
> > 
> > and now apply the same rule to CONFIG_NO_HZ:
> > 
> > - Fact: feature CONFIG_NO_HZ=y did not exist in 64-bit x86 in v2.6.23.
> >   It has known bugs but they are not flagged 'regressions' (because the 
> >   feature did not exist before and the option is default-disabled) hence 
> >   the feature can stay. Some fixes to it are too dangerous or too 
> >   unproven and are delayed to v2.6.25.
> 
> Forget about this unrelated feature, it has nothing to do with this 
> situation. [...]

why "forget" about a perfectly valid example right from v2.6.24 that 
demonstrates our definition for regressions and demonstrates the rules 
we are applying? Because it does not fit your argument? :)

> [...] What if some (eg) sata driver broke because someone enabled 
> CONFIG_NO_HZ. Would you say that's not a regression because it worked 
> before that option was there? [...]

as long as it's the same situation as have here (i.e. that NO_HZ 
appeared on an architecture where it wasnt offered before): yes, 
exactly, it's _NOT_ a regression in that sata driver [the sata driver 
might not even have been touched], and it's _NOT_ a regression in 
CONFIG_NO_HZ either because a regression can only be something that 
"something breaks that worked before". In that case it's a _bug_ in the 
newly introduced NO_HZ code.

> [...] That's crap, no user would buy that argument. As far as the user 
> is concerned, it IS a regression.

the user enabled a newly introduced kernel feature and that new feature 
might be not perfect. It might break a whole lot of unrelated code. We 
consider them bugs (of course), and work on fixing them, but still we 
dont consider such bugs a regression and dont revert a new feature if 
it's not fixed. It's a special-case. We apply that special case to all 
sorts of new kernel features, such as the new iwl3945 driver in 2.6.24, 
which has a number of unfixed bugs. We applied that to new kernel 
features as well, such as containers/control-groups in 2.6.23, etc. This 
is nothing new, we did this numerous times ever since the new-style 
development model was introduced around 2.6.12. _You_ might not know 
about it but that does not change the situation.

because a perfectly working system is:

 "a user's .config that worked before should work with the new kernel 
  too"

not:

 "a user's .config that worked before should work now too, with random
  new kernel features enabled as well."

the latter appears to be the rule you are applying, but it's not the 
regression rule we are using.

> Since we obviously disagree on this, lets agree to not debate it any 
> further. It doesn't matter anyway, since the issue is resolved.

this is not a matter of opinion or "disagreement", this is a matter of 
fact. You were the one telling me: "go fix that regression", so this 
issue might come up anytime in the future. Better clear it once and for 
all, we all have got better things to do than to rehash the same issue 
again and again.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-14  7:51                         ` Ingo Molnar
@ 2008-01-14  7:59                           ` Ingo Molnar
  2008-01-14  8:10                             ` Jens Axboe
  2008-01-14  8:39                           ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-14  7:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx, Andrew Morton

* Ingo Molnar <mingo@elte.hu> wrote:

> because a perfectly working system is:
> 
>  "a user's .config that worked before should work with the new kernel
>   too"
> 
> not:
> 
>  "a user's .config that worked before should work now too, with random
>   new kernel features enabled as well."
> 
> the latter appears to be the rule you are applying, but it's not the 
> regression rule we are using.

Jens, just to bring your definition of regressions to its logical 
conclusion: does this mean that if there is any longstanding bug in the 
block layer that you know about, but i didnt ever utilize that bit of 
the block layer it in my .config, and if i enable it now in the .config 
and i experience that bug, does it suddenly count as a regression? Do 
you realize that your definition for "regressions" turns _almost every_ 
current bug in the kernel into a regression?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-14  7:59                           ` Ingo Molnar
@ 2008-01-14  8:10                             ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-14  8:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx, Andrew Morton

On Mon, Jan 14 2008, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > because a perfectly working system is:
> > 
> >  "a user's .config that worked before should work with the new kernel
> >   too"
> > 
> > not:
> > 
> >  "a user's .config that worked before should work now too, with random
> >   new kernel features enabled as well."
> > 
> > the latter appears to be the rule you are applying, but it's not the 
> > regression rule we are using.
> 
> Jens, just to bring your definition of regressions to its logical 
> conclusion: does this mean that if there is any longstanding bug in the 
> block layer that you know about, but i didnt ever utilize that bit of 
> the block layer it in my .config, and if i enable it now in the .config 
> and i experience that bug, does it suddenly count as a regression? Do 
> you realize that your definition for "regressions" turns _almost every_ 
> current bug in the kernel into a regression?

Ingo, why do you keep harping this issue? I thought I suggested we agree
to disagree on this and let it rest.

And I would say that, yes, that is a regression, if that config option
is a core option that people are likely to enable. The CONFIG_NO_HZ is a
new option, people will select it. Your example pertain more to the 'use
mmio for IO operations' type options for drivers. If you enable that and
your driver suddenly stops working, you have a clear idea of WHY it
stops working and how to fix it. Not so with this blktrace scenario, I
bet that would take people quite a while to figure out how it broke.

Can we drop this subject now, please? The issue is resolved (and
merged), debating definitions of regressions is not very productive :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-14  7:51                         ` Ingo Molnar
  2008-01-14  7:59                           ` Ingo Molnar
@ 2008-01-14  8:39                           ` Christoph Hellwig
  2008-01-14  8:42                             ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2008-01-14  8:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, David Dillow, Guillaume Chazarain, linux-kernel,
	linux-btrace, mingo, tglx, Andrew Morton

Folks, this is getting a little silly.

Even if CONFIG_NO_HZ is new this is a an important regression, and
yes we should avoid regressions wherever we can, and for such a quite
important feature we should fix it.  On the other hand blktrace is using
the wrong interface, and it has been told multiple times on lkml that
sched_clock() should never ever be used outside the scheduler.  So the
burden to fix this regression lies on the shoulders of the blktrace
maintainer.

No need for silly name calling here.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-14  8:39                           ` Christoph Hellwig
@ 2008-01-14  8:42                             ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-14  8:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, David Dillow, Guillaume Chazarain, linux-kernel,
	linux-btrace, mingo, tglx, Andrew Morton

On Mon, Jan 14 2008, Christoph Hellwig wrote:
> 
> Folks, this is getting a little silly.

It is

> Even if CONFIG_NO_HZ is new this is a an important regression, and
> yes we should avoid regressions wherever we can, and for such a quite
> important feature we should fix it.  On the other hand blktrace is using
> the wrong interface, and it has been told multiple times on lkml that
> sched_clock() should never ever be used outside the scheduler.  So the
> burden to fix this regression lies on the shoulders of the blktrace
> maintainer.

That is true. So far alternatives have all been slower though, so not
very tempting to transition to.

> No need for silly name calling here.

I don't know what thread you are reading, but neiter Ingo nor I have
resorted to silly name calling. The thread has long since diverted from
the original topic, though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 10:29                 ` [patch] block: fix " Ingo Molnar
  2008-01-11 10:47                   ` Guillaume Chazarain
  2008-01-11 12:28                   ` Jens Axboe
@ 2008-01-11 15:36                   ` David Dillow
  2008-01-11 16:44                     ` Ingo Molnar
  2 siblings, 1 reply; 37+ messages in thread
From: David Dillow @ 2008-01-11 15:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx


On Fri, 2008-01-11 at 11:29 +0100, Ingo Molnar wrote:
> (David, could you try the patch further below - does it fix bkltrace 
> timestamps too?)

Jens already got back to you, but I can confirm it works here as well.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 15:36                   ` David Dillow
@ 2008-01-11 16:44                     ` Ingo Molnar
  2008-01-11 17:26                       ` Jens Axboe
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11 16:44 UTC (permalink / raw)
  To: David Dillow
  Cc: Jens Axboe, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx


* David Dillow <dillowda@ornl.gov> wrote:

> On Fri, 2008-01-11 at 11:29 +0100, Ingo Molnar wrote:
> > (David, could you try the patch further below - does it fix bkltrace 
> > timestamps too?)
> 
> Jens already got back to you, but I can confirm it works here as well.

great, thanks for testing it. (i'm not sure Jens saw the same issues as 
you, so your test was useful too.)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] block: fix blktrace timestamps
  2008-01-11 16:44                     ` Ingo Molnar
@ 2008-01-11 17:26                       ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-11 17:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * David Dillow <dillowda@ornl.gov> wrote:
> 
> > On Fri, 2008-01-11 at 11:29 +0100, Ingo Molnar wrote:
> > > (David, could you try the patch further below - does it fix bkltrace 
> > > timestamps too?)
> > 
> > Jens already got back to you, but I can confirm it works here as well.
> 
> great, thanks for testing it. (i'm not sure Jens saw the same issues as 
> you, so your test was useful too.)

Right, I merely checked that ktime_get() worked for me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:23         ` Ingo Molnar
  2008-01-11  9:25           ` Jens Axboe
@ 2008-01-11  9:29           ` Ingo Molnar
  2008-01-11  9:34             ` Jens Axboe
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx


* Ingo Molnar <mingo@elte.hu> wrote:

> > Thanks for reporting this. Guillaume, did you write this patch? We 
> > need to get it into 2.6.24-rc7 asap. Let me know if I should take 
> > care of that, or if it's already queued up elsewhere.
> 
> they are from the scheduler git tree (except the first debug patch), 
> but queued up for v2.6.25 at the moment.

some of them are not - Guillaume has not sent them. I'll sort it out.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:29           ` CONFIG_NO_HZ breaks " Ingo Molnar
@ 2008-01-11  9:34             ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2008-01-11  9:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

On Fri, Jan 11 2008, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > Thanks for reporting this. Guillaume, did you write this patch? We 
> > > need to get it into 2.6.24-rc7 asap. Let me know if I should take 
> > > care of that, or if it's already queued up elsewhere.
> > 
> > they are from the scheduler git tree (except the first debug patch), 
> > but queued up for v2.6.25 at the moment.
> 
> some of them are not - Guillaume has not sent them. I'll sort it out.

Thanks Ingo!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:07       ` Jens Axboe
  2008-01-11  9:23         ` Ingo Molnar
@ 2008-01-11  9:28         ` nigel
  2008-01-11  9:32           ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: nigel @ 2008-01-11  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Dillow, Guillaume Chazarain, linux-kernel, linux-btrace,
	mingo, tglx

Hi.

Jens Axboe wrote:
> On Thu, Jan 10 2008, David Dillow wrote:
>> On Thu, 2008-01-10 at 23:44 +0100, Guillaume Chazarain wrote:
>>> David Dillow <dillowda@ornl.gov> wrote:
>>>
>>>> At the moment, I'm not sure how to track this farther, or how to fix it
>>>> properly. Any advice would be appreciated.
>>> Just out of curiosity, could you try the appended cumulative patch and
>>> report .clock_warps, .clock_overflows and .clock_underflows as you did.
>> With those patches, CONFIG_NO_HZ works just fine.

Could these patches also help with hibernation issues? I'm trying
x86_64+NO_HZ, and seeing activity delayed during the atomic copy and
afterwards until I manually generate interrupts (by pressing keys).

Regards,

Nigel


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:28         ` nigel
@ 2008-01-11  9:32           ` Ingo Molnar
  2008-01-13 22:54             ` nigel
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:32 UTC (permalink / raw)
  To: nigel
  Cc: Jens Axboe, David Dillow, Guillaume Chazarain, linux-kernel,
	linux-btrace, mingo, tglx

* nigel@suspend2.net <nigel@suspend2.net> wrote:

> >>> Just out of curiosity, could you try the appended cumulative patch 
> >>> and report .clock_warps, .clock_overflows and .clock_underflows as 
> >>> you did.
> >> With those patches, CONFIG_NO_HZ works just fine.
> 
> Could these patches also help with hibernation issues? I'm trying 
> x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
> afterwards until I manually generate interrupts (by pressing keys).

i dont think that should be related to cpu_clock() use. Does the patch 
below make any difference? (or could you try x86.git to get the whole 
stack of x86 changes that we have at the moment.) Here's the coordinates 
for x86.git:

--------------{ x86.git instructions }---------->

# Add Linus's tree as a remote
git remote --add linus
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

# Add Ingo's tree as a remote
git remote --add x86
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

# With that setup, just run the following to get any changes you
# don't have.  It will also notice any new branches Ingo/Linus
# add to their repo.  Look in .git/config afterwards, the format
# to add new remotes is easy to figure out.
git remote update

	Ingo

----------->
Subject: x86: kick CPUS that might be sleeping in cpus_idle_wait
From: Steven Rostedt <rostedt@goodmis.org>

Sometimes cpu_idle_wait gets stuck because it might miss CPUS that are
already in idle, have no tasks waiting to run and have no interrupts going
to them.  This is common on bootup when switching cpu idle governors.

This patch gives those CPUS that don't check in an IPI kick.

Background:
-----------
I notice this while developing the mcount patches, that every once in a
while the system would hang. Looking deeper, the hang was always at boot
up when registering init_menu of the cpu_idle menu governor. Talking
with Thomas Gliexner, we discovered that one of the CPUS had no timer
events scheduled for it and it was in idle (running with NO_HZ). So the
CPU would not set the cpu_idle_state bit.

Hitting sysrq-t a few times would eventually route the interrupt to the
stuck CPU and the system would continue.

Note, I would have used the PDA isidle but that is set after the
cpu_idle_state bit is cleared, and would leave a window open where we
may miss being kicked.

hmm, looking closer at this, we still have a small race window between
clearing the cpu_idle_state and disabling interrupts (hence the RFC).

    CPU0:                          CPU 1:
  ---------                       ---------
 cpu_idle_wait():                 cpu_idle():
      |                           __cpu_cpu_var(is_idle) = 1;
      |                           if (__get_cpu_var(cpu_idle_state)) /* = 0 */
 per_cpu(cpu_idle_state, 1) = 1;         |
 if (per_cpu(is_idle, 1)) /* = 1 */     |
 smp_call_function(1)                    |
      |                             receives ipi and runs do_nothing.
 wait on map = empty               idle();
   /* waits forever */

So really we need interrupts off for most of this then. One might think
that we could simply clear the cpu_idle_state from do_nothing, but I'm
assuming that cpu_idle governors can be removed, and this might cause a
race that a governor might be used after the module was removed.

Venki said:

  I think your RFC patch is the right solution here.  As I see it, there is
  no race with your RFC patch.  As long as you call a dummy smp_call_function
  on all CPUs, we should be OK.  We can get rid of cpu_idle_state and the
  current wait forever logic altogether with dummy smp_call_function.  And so
  there wont be any wait forever scenario.

  The whole point of cpu_idle_wait() is to make all CPUs come out of idle
  loop atleast once.  The caller will use cpu_idle_wait something like this.

  // Want to change idle handler

  - Switch global idle handler to always present default_idle

  - call cpu_idle_wait so that all cpus come out of idle for an instant
    and stop using old idle pointer and start using default idle

  - Change the idle handler to a new handler

  - optional cpu_idle_wait if you want all cpus to start using the new
    handler immediately.

Maybe the below 1s patch is safe bet for .24.  But for .25, I would say we
just replace all complicated logic by simple dummy smp_call_function and
remove cpu_idle_state altogether.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process_32.c |   11 +++++++++++
 arch/x86/kernel/process_64.c |   11 +++++++++++
 2 files changed, 22 insertions(+)

Index: linux-x86.q/arch/x86/kernel/process_32.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -214,6 +214,10 @@ void cpu_idle(void)
 	}
 }

+static void do_nothing(void *unused)
+{
+}
+
 void cpu_idle_wait(void)
 {
 	unsigned int cpu, this_cpu = get_cpu();
@@ -238,6 +242,13 @@ void cpu_idle_wait(void)
 				cpu_clear(cpu, map);
 		}
 		cpus_and(map, map, cpu_online_map);
+		/*
+		 * We waited 1 sec, if a CPU still did not call idle
+		 * it may be because it is in idle and not waking up
+		 * because it has nothing to do.
+		 * Give all the remaining CPUS a kick.
+		 */
+		smp_call_function_mask(map, do_nothing, 0, 0);
 	} while (!cpus_empty(map));

 	set_cpus_allowed(current, tmp);
Index: linux-x86.q/arch/x86/kernel/process_64.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -204,6 +204,10 @@ void cpu_idle(void)
 	}
 }

+static void do_nothing(void *unused)
+{
+}
+
 void cpu_idle_wait(void)
 {
 	unsigned int cpu, this_cpu = get_cpu();
@@ -228,6 +232,13 @@ void cpu_idle_wait(void)
 				cpu_clear(cpu, map);
 		}
 		cpus_and(map, map, cpu_online_map);
+		/*
+		 * We waited 1 sec, if a CPU still did not call idle
+		 * it may be because it is in idle and not waking up
+		 * because it has nothing to do.
+		 * Give all the remaining CPUS a kick.
+		 */
+		smp_call_function_mask(map, do_nothing, 0, 0);
 	} while (!cpus_empty(map));

 	set_cpus_allowed(current, tmp);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:32           ` Ingo Molnar
@ 2008-01-13 22:54             ` nigel
  0 siblings, 0 replies; 37+ messages in thread
From: nigel @ 2008-01-13 22:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, David Dillow, Guillaume Chazarain, linux-kernel,
	linux-btrace, mingo, tglx

Hi.

Ingo Molnar wrote:
> * nigel@suspend2.net <nigel@suspend2.net> wrote:
> 
>>>>> Just out of curiosity, could you try the appended cumulative patch 
>>>>> and report .clock_warps, .clock_overflows and .clock_underflows as 
>>>>> you did.
>>>> With those patches, CONFIG_NO_HZ works just fine.
>> Could these patches also help with hibernation issues? I'm trying 
>> x86_64+NO_HZ, and seeing activity delayed during the atomic copy and 
>> afterwards until I manually generate interrupts (by pressing keys).
> 
> i dont think that should be related to cpu_clock() use. Does the patch 
> below make any difference? (or could you try x86.git to get the whole 
> stack of x86 changes that we have at the moment.) Here's the coordinates 
> for x86.git:

Sorry for the delay in replying. Something seems to help, but I haven't
managed to identify what yet. I don't think it was the patch appended
because I'm on UP. If you care, I'll see if I can find the time to look
more carefully.

Nigel


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  3:04     ` David Dillow
  2008-01-11  9:07       ` Jens Axboe
@ 2008-01-11  9:44       ` Ingo Molnar
  2008-01-11  9:51         ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:44 UTC (permalink / raw)
  To: David Dillow; +Cc: Guillaume Chazarain, linux-kernel, linux-btrace, mingo, tglx


* David Dillow <dillowda@ornl.gov> wrote:

> > Just out of curiosity, could you try the appended cumulative patch 
> > and report .clock_warps, .clock_overflows and .clock_underflows as 
> > you did.
> 
> With those patches, CONFIG_NO_HZ works just fine.

could you please try the two patches below, do they fix the problem as 
well? They got a ton of testing in x86.git in the past ~2 months and we 
could perhaps still push them into v2.6.24.

	Ingo

------------------>
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" <guichaz@yahoo.fr>

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ mingo@elte.hu: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/tsc_32.c |   43 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/tsc_64.c |   57 ++++++++++++++++++++++++++++++++++++++---------
 include/asm-x86/timer.h  |   23 ++++++++++++++----
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux-x86.q/arch/x86/kernel/tsc_32.c
=================================--- linux-x86.q.orig/arch/x86/kernel/tsc_32.c
+++ linux-x86.q/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include <linux/jiffies.h>
 #include <linux/init.h>
 #include <linux/dmi.h>
+#include <linux/percpu.h>
 
 #include <asm/delay.h>
 #include <asm/tsc.h>
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-unsigned long cyc2ns_scale __read_mostly;
 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long cpu_khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
+
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	/*
+	 * Start smoothly with the new frequency:
+	 */
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 /*
@@ -239,7 +258,9 @@ time_cpufreq_notifier(struct notifier_bl
 						ref_freq, freq->new);
 			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
 				tsc_khz = cpu_khz;
-				set_cyc2ns_scale(cpu_khz);
+				preempt_disable();
+				set_cyc2ns_scale(cpu_khz, smp_processor_id());
+				preempt_enable();
 				/*
 				 * TSC based sched_clock turns
 				 * to junk w/ cpufreq
@@ -367,6 +388,8 @@ static inline void check_geode_tsc_relia
 
 void __init tsc_init(void)
 {
+	int cpu;
+
 	if (!cpu_has_tsc || tsc_disable)
 		goto out_no_tsc;
 
@@ -380,7 +403,15 @@ void __init tsc_init(void)
 				(unsigned long)cpu_khz / 1000,
 				(unsigned long)cpu_khz % 1000);
 
-	set_cyc2ns_scale(cpu_khz);
+	/*
+	 * Secondary CPUs do not run through tsc_init(), so set up
+	 * all the scale factors for all CPUs, assuming the same
+	 * speed as the bootup CPU. (cpufreq notifiers will fix this
+	 * up if their speed diverges)
+	 */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(cpu_khz, cpu);
+
 	use_tsc_delay();
 
 	/* Check and install the TSC clocksource */
Index: linux-x86.q/arch/x86/kernel/tsc_64.c
=================================--- linux-x86.q.orig/arch/x86/kernel/tsc_64.c
+++ linux-x86.q/arch/x86/kernel/tsc_64.c
@@ -10,6 +10,7 @@
 
 #include <asm/hpet.h>
 #include <asm/timex.h>
+#include <asm/timer.h>
 
 static int notsc __initdata = 0;
 
@@ -18,16 +19,48 @@ EXPORT_SYMBOL(cpu_khz);
 unsigned int tsc_khz;
 EXPORT_SYMBOL(tsc_khz);
 
-static unsigned int cyc2ns_scale __read_mostly;
+/* Accelerators for sched_clock()
+ * convert from cycles(64bits) => nanoseconds (64bits)
+ *  basic equation:
+ *		ns = cycles / (freq / ns_per_sec)
+ *		ns = cycles * (ns_per_sec / freq)
+ *		ns = cycles * (10^9 / (cpu_khz * 10^3))
+ *		ns = cycles * (10^6 / cpu_khz)
+ *
+ *	Then we use scaling math (suggested by george@mvista.com) to get:
+ *		ns = cycles * (10^6 * SC / cpu_khz) / SC
+ *		ns = cycles * cyc2ns_scale / SC
+ *
+ *	And since SC is a constant power of two, we can convert the div
+ *  into a shift.
+ *
+ *  We can use khz divisor instead of mhz to keep a better precision, since
+ *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
+ *  (mathieu.desnoyers@polymtl.ca)
+ *
+ *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
+ */
+DEFINE_PER_CPU(unsigned long, cyc2ns);
 
-static inline void set_cyc2ns_scale(unsigned long khz)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 {
-	cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz;
-}
+	unsigned long flags, prev_scale, *scale;
+	unsigned long long tsc_now, ns_now;
 
-static unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-	return (cyc * cyc2ns_scale) >> NS_SCALE;
+	local_irq_save(flags);
+	sched_clock_idle_sleep_event();
+
+	scale = &per_cpu(cyc2ns, cpu);
+
+	rdtscll(tsc_now);
+	ns_now = __cycles_2_ns(tsc_now);
+
+	prev_scale = *scale;
+	if (cpu_khz)
+		*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
+
+	sched_clock_idle_wakeup_event(0);
+	local_irq_restore(flags);
 }
 
 unsigned long long sched_clock(void)
@@ -100,7 +133,9 @@ static int time_cpufreq_notifier(struct 
 			mark_tsc_unstable("cpufreq changes");
 	}
 
-	set_cyc2ns_scale(tsc_khz_ref);
+	preempt_disable();
+	set_cyc2ns_scale(tsc_khz_ref, smp_processor_id());
+	preempt_enable();
 
 	return 0;
 }
@@ -151,7 +186,7 @@ static unsigned long __init tsc_read_ref
 void __init tsc_calibrate(void)
 {
 	unsigned long flags, tsc1, tsc2, tr1, tr2, pm1, pm2, hpet1, hpet2;
-	int hpet = is_hpet_enabled();
+	int hpet = is_hpet_enabled(), cpu;
 
 	local_irq_save(flags);
 
@@ -206,7 +241,9 @@ void __init tsc_calibrate(void)
 	}
 
 	tsc_khz = tsc2 / tsc1;
-	set_cyc2ns_scale(tsc_khz);
+
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
 }
 
 /*
Index: linux-x86.q/include/asm-x86/timer.h
=================================--- linux-x86.q.orig/include/asm-x86/timer.h
+++ linux-x86.q/include/asm-x86/timer.h
@@ -2,6 +2,7 @@
 #define _ASMi386_TIMER_H
 #include <linux/init.h>
 #include <linux/pm.h>
+#include <linux/percpu.h>
 
 #define TICK_SIZE (tick_nsec / 1000)
 
@@ -16,7 +17,7 @@ extern int recalibrate_cpu_khz(void);
 #define calculate_cpu_khz() native_calculate_cpu_khz()
 #endif
 
-/* Accellerators for sched_clock()
+/* Accelerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
  *		ns = cycles / (freq / ns_per_sec)
@@ -31,20 +32,32 @@ extern int recalibrate_cpu_khz(void);
  *	And since SC is a constant power of two, we can convert the div
  *  into a shift.
  *
- *  We can use khz divisor instead of mhz to keep a better percision, since
+ *  We can use khz divisor instead of mhz to keep a better precision, since
  *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
  *  (mathieu.desnoyers@polymtl.ca)
  *
  *			-johnstul@us.ibm.com "math is hard, lets go shopping!"
  */
-extern unsigned long cyc2ns_scale __read_mostly;
+
+DECLARE_PER_CPU(unsigned long, cyc2ns);
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
 {
-	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+	return cyc * per_cpu(cyc2ns, smp_processor_id()) >> CYC2NS_SCALE_FACTOR;
 }
 
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
+{
+	unsigned long long ns;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	ns = __cycles_2_ns(cyc);
+	local_irq_restore(flags);
+
+	return ns;
+}
 
 #endif

--------------------->
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <mingo@elte.hu>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process_32.c |   15 ++++++++++++---
 arch/x86/kernel/process_64.c |   13 ++++++++++---
 2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
 	smp_mb();
 	local_irq_disable();
 	if (!need_resched()) {
-		/* Enables interrupts one instruction before HLT.
-		   x86 special cases this so there is no race. */
-		safe_halt();
+		ktime_t t0, t1;
+		u64 t0n, t1n;
+
+		t0 = ktime_get();
+		t0n = ktime_to_ns(t0);
+		safe_halt();	/* enables interrupts racelessly */
+		local_irq_disable();
+		t1 = ktime_get();
+		t1n = ktime_to_ns(t1);
+		sched_clock_idle_wakeup_event(t1n - t0n);
 	} else
 		local_irq_enable();
 	current_thread_info()->status |= TS_POLLING;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:44       ` Ingo Molnar
@ 2008-01-11  9:51         ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:51 UTC (permalink / raw)
  To: David Dillow; +Cc: Guillaume Chazarain, linux-kernel, linux-btrace, mingo, tglx


* Ingo Molnar <mingo@elte.hu> wrote:

> > With those patches, CONFIG_NO_HZ works just fine.
> 
> could you please try the two patches below, do they fix the problem as 
> well? They got a ton of testing in x86.git in the past ~2 months and 
> we could perhaps still push them into v2.6.24.

plus, if needed, perhaps this patch below from Guillaume ontop of it.

	Ingo

---------------->
Subject: sched: fix rq->clock warps on frequency changes
From: Guillaume Chazarain <guichaz@yahoo.fr>

sched: fix rq->clock warps on frequency changes

Fix 2bacec8c318ca0418c0ee9ac662ee44207765dd4
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq->clock for warps, so call it after adjusting rq->clock.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/kernel/sched.c
=================================--- linux-x86.q.orig/kernel/sched.c
+++ linux-x86.q/kernel/sched.c
@@ -668,7 +668,6 @@ void sched_clock_idle_wakeup_event(u64 d
 	struct rq *rq = cpu_rq(smp_processor_id());
 	u64 now = sched_clock();
 
-	touch_softlockup_watchdog();
 	rq->idle_clock += delta_ns;
 	/*
 	 * Override the previous timestamp and ignore all
@@ -680,6 +679,7 @@ void sched_clock_idle_wakeup_event(u64 d
 	rq->prev_clock_raw = now;
 	rq->clock += delta_ns;
 	spin_unlock(&rq->lock);
+	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-10 22:44   ` Guillaume Chazarain
  2008-01-11  3:04     ` David Dillow
@ 2008-01-11 10:41     ` Guillaume Chazarain
  2008-01-11 10:55       ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Guillaume Chazarain @ 2008-01-11 10:41 UTC (permalink / raw)
  To: mingo; +Cc: David Dillow, linux-kernel, linux-btrace, tglx, Jens Axboe, nigel

David Dillow <dillowda@ornl.gov> wrote:

> Patched kernel, nohz=off:
>   .clock_underflows              : 213887

A little bit of warning about these patches, they are WIP, that's why I
did not send them earlier. It regress nohz=off.

A bit of context: these patches aim at making sure cpu_clock() on my
laptop (cpufreq enabled) never overflows/underflows/warps with
CONFIG_NOHZ enabled. With these patches, I have a few hundreds
overflows and underflows during early bootup, and then nothing :-)

Ingo Molnar <mingo@elte.hu> wrote:

> they are from the scheduler git tree (except the first debug patch), but 
> queued up for v2.6.25 at the moment.

You are talking about "x86: scale cyc_2_nsec according to CPU
frequency" here, but I don't think it is at stakes here as David has:

> CONFIG_CPU_FREQ is not set

Let me review my patches myself to give a bit of context:

>     sched: monitor clock underflows in /proc/sched_debug

This, I'd like to have it in .25 just for convenience.

>         x86: scale cyc_2_nsec according to CPU frequency

You already know that one ;-)

>     sched: fix rq->clock warps on frequency changes

This is a bugfix for .25 once the previous patch is applied. I don't
think it helps David, but it could help blktrace users with cpufreq
enabled.

>     sched: Fix rq->clock overflows detection with CONFIG_NO_HZ

I think this one is the most important for David, but unfortunately it
has some problems.

> +static inline u64 max_skipped_ticks(struct rq *rq)
> +{
> +	return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
> +}

Here, I initially wrote rq->last_tick_seen + 1 but experiments showed
that +2 was needed as I really saw deltas of 2 milliseconds.

These patches have two objectives:
 - taking into account that jiffies are not always incremented by 1
thanks to nohz
 - as the tick is stopped and restarted it may not tick at the exact
expected moment, so allow a window of 1 jiffie. If the tick occurs
during the right jiffy, we know the TSC is more precise than the tick
so don't correct the clock.

And the problem is that I seem to need a window of 2 jiffies, so I need
some help.

>     sched: make sure jiffies is up to date before calling __update_rq_clock()

This is one is needed too but I'm less confident in its validity.

>     scheduler_tick() is not called every jiffies

This one is a bit ugly and seems to break nohz=off.

> -	if (unlikely(rq->clock < next_tick)) {
> +	if (unlikely(rq->clock < next_tick - nohz_on(cpu) * TICK_NSEC)) {

No, I'm not proud of this :-(

Thanks.

-- 
Guillaume

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11 10:41     ` Guillaume Chazarain
@ 2008-01-11 10:55       ` Ingo Molnar
  2008-01-11 22:30         ` Guillaume Chazarain
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11 10:55 UTC (permalink / raw)
  To: Guillaume Chazarain
  Cc: mingo, David Dillow, linux-kernel, linux-btrace, tglx, Jens Axboe,
	nigel


* Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> David Dillow <dillowda@ornl.gov> wrote:
> 
> > Patched kernel, nohz=off:
> >   .clock_underflows              : 213887
> 
> A little bit of warning about these patches, they are WIP, that's why 
> I did not send them earlier. It regress nohz=off.

ok. I have applied all but this one:

> A bit of context: these patches aim at making sure cpu_clock() on my
> laptop (cpufreq enabled) never overflows/underflows/warps with
> CONFIG_NOHZ enabled. With these patches, I have a few hundreds
> overflows and underflows during early bootup, and then nothing :-)

cool :-)

> >     sched: Fix rq->clock overflows detection with CONFIG_NO_HZ
> 
> I think this one is the most important for David, but unfortunately it
> has some problems.
> 
> > +static inline u64 max_skipped_ticks(struct rq *rq)
> > +{
> > +	return nohz_on(cpu_of(rq)) ? jiffies - rq->last_tick_seen + 2 : 1;
> > +}
> 
> Here, I initially wrote rq->last_tick_seen + 1 but experiments showed
> that +2 was needed as I really saw deltas of 2 milliseconds.
> 
> These patches have two objectives:
>  - taking into account that jiffies are not always incremented by 1
> thanks to nohz
>  - as the tick is stopped and restarted it may not tick at the exact
> expected moment, so allow a window of 1 jiffie. If the tick occurs
> during the right jiffy, we know the TSC is more precise than the tick
> so don't correct the clock.

i think it's much simpler to do what i have below. Could you try it on 
your box? Or if it is using ACPI idle - in that case the callbacks 
should already be there and there should be no need for further fixups.

> And the problem is that I seem to need a window of 2 jiffies, so I need
> some help.
> 
> >     sched: make sure jiffies is up to date before calling __update_rq_clock()
> 
> This is one is needed too but I'm less confident in its validity.
> 
> >     scheduler_tick() is not called every jiffies
> 
> This one is a bit ugly and seems to break nohz=off.

ok, i took this one out.

	Ingo

-------------------->
Subject: x86: idle wakeup event in the HLT loop
From: Ingo Molnar <mingo@elte.hu>

do a proper idle-wakeup event on HLT as well - some CPUs stop the TSC
in HLT too, not just when going through the ACPI methods.

(the ACPI idle code already does this.)

[ update the 64-bit side too, as noticed by Jiri Slaby. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/process_32.c |   15 ++++++++++++---
 arch/x86/kernel/process_64.c |   13 ++++++++++---
 2 files changed, 22 insertions(+), 6 deletions(-)

Index: linux-x86.q/arch/x86/kernel/process_32.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_32.c
+++ linux-x86.q/arch/x86/kernel/process_32.c
@@ -113,10 +113,19 @@ void default_idle(void)
 		smp_mb();
 
 		local_irq_disable();
-		if (!need_resched())
+		if (!need_resched()) {
+			ktime_t t0, t1;
+			u64 t0n, t1n;
+
+			t0 = ktime_get();
+			t0n = ktime_to_ns(t0);
 			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
+			local_irq_disable();
+			t1 = ktime_get();
+			t1n = ktime_to_ns(t1);
+			sched_clock_idle_wakeup_event(t1n - t0n);
+		}
+		local_irq_enable();
 		current_thread_info()->status |= TS_POLLING;
 	} else {
 		/* loop is done by the caller */
Index: linux-x86.q/arch/x86/kernel/process_64.c
=================================--- linux-x86.q.orig/arch/x86/kernel/process_64.c
+++ linux-x86.q/arch/x86/kernel/process_64.c
@@ -116,9 +116,16 @@ static void default_idle(void)
 	smp_mb();
 	local_irq_disable();
 	if (!need_resched()) {
-		/* Enables interrupts one instruction before HLT.
-		   x86 special cases this so there is no race. */
-		safe_halt();
+		ktime_t t0, t1;
+		u64 t0n, t1n;
+
+		t0 = ktime_get();
+		t0n = ktime_to_ns(t0);
+		safe_halt();	/* enables interrupts racelessly */
+		local_irq_disable();
+		t1 = ktime_get();
+		t1n = ktime_to_ns(t1);
+		sched_clock_idle_wakeup_event(t1n - t0n);
 	} else
 		local_irq_enable();
 	current_thread_info()->status |= TS_POLLING;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11 10:55       ` Ingo Molnar
@ 2008-01-11 22:30         ` Guillaume Chazarain
  2008-01-12  0:03           ` Guillaume Chazarain
  0 siblings, 1 reply; 37+ messages in thread
From: Guillaume Chazarain @ 2008-01-11 22:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, linux-kernel, linux-btrace, tglx, Jens Axboe, nigel

Ingo Molnar <mingo@elte.hu> wrote:

> ok. I have applied all but this one

Hmm, I couldn't find them in mingo/linux-2.6-sched-devel.git.

> i think it's much simpler to do what i have below. Could you try it on 
> your box? Or if it is using ACPI idle - in that case the callbacks 
> should already be there and there should be no need for further fixups.
> 
> Subject: x86: idle wakeup event in the HLT loop

I use ACPI, so this patch has no effect.

FYI, I'm currently trying to track down where rq->clock started to
overflow with nohz=off, and it seems to be before 2.6.23, so my patches
are not at fault ;-) Or maybe I am dreaming and it was always
overflowing. Investigating ...

-- 
Guillaume

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11 22:30         ` Guillaume Chazarain
@ 2008-01-12  0:03           ` Guillaume Chazarain
  0 siblings, 0 replies; 37+ messages in thread
From: Guillaume Chazarain @ 2008-01-12  0:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Dillow, linux-kernel, linux-btrace, tglx, Jens Axboe, nigel

Guillaume Chazarain <guichaz@yahoo.fr> wrote:

> FYI, I'm currently trying to track down where rq->clock started to
> overflow with nohz=off, and it seems to be before 2.6.23, so my patches
> are not at fault ;-) Or maybe I am dreaming and it was always
> overflowing. Investigating ...

And the winner is:

commit 529c77261bccd9d37f110f58b0753d95beaa9fa2
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Aug 10 23:05:11 2007 +0200

    sched: improve rq-clock overflow logic
    
    improve the rq-clock overflow logic: limit the absolute rq->clock
    delta since the last scheduler tick, instead of limiting the delta
    itself.
    
    tested by Arjan van de Ven - whole laptop was misbehaving due to
    an incorrectly calibrated cpu_khz confusing sched_clock().
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index b0afd8d..6247e4a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -263,6 +263,7 @@ struct rq {
 
 	unsigned int clock_warps, clock_overflows;
 	unsigned int clock_unstable_events;
+	u64 tick_timestamp;
 
 	atomic_t nr_iowait;
 
@@ -341,8 +342,11 @@ static void __update_rq_clock(struct rq *rq)
 		/*
 		 * Catch too large forward jumps too:
 		 */
-		if (unlikely(delta > 2*TICK_NSEC)) {
-			clock++;
+		if (unlikely(clock + delta > rq->tick_timestamp + TICK_NSEC)) {
+			if (clock < rq->tick_timestamp + TICK_NSEC)
+				clock = rq->tick_timestamp + TICK_NSEC;
+			else
+				clock++;
 			rq->clock_overflows++;
 		} else {
 			if (unlikely(delta > rq->clock_max_delta))
@@ -3308,9 +3312,16 @@ void scheduler_tick(void)
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
 	struct task_struct *curr = rq->curr;
+	u64 next_tick = rq->tick_timestamp + TICK_NSEC;
 
 	spin_lock(&rq->lock);
 	__update_rq_clock(rq);
+	/*
+	 * Let rq->clock advance by at least TICK_NSEC:
+	 */
+	if (unlikely(rq->clock < next_tick))
+		rq->clock = next_tick;
+	rq->tick_timestamp = rq->clock;
 	update_cpu_load(rq);
 	if (curr != rq->idle) /* FIXME: needed? */
 		curr->sched_class->task_tick(rq, curr);


Seems like I originally was not the only one seeing 2 jiffies jumps ;-)
I'll adapt my patches.

-- 
Guillaume

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-10 20:25 ` David Dillow
  2008-01-10 22:44   ` Guillaume Chazarain
@ 2008-01-11  9:34   ` Ingo Molnar
  2008-01-11 15:43     ` David Dillow
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2008-01-11  9:34 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-kernel, linux-btrace, mingo, tglx, Jens Axboe


* David Dillow <dillowda@ornl.gov> wrote:

> Ingo, Thomas added as I think this is related to 
> sched.c:__update_rq_clock()'s checking for forward time warps.

yep, we've got some fixes in this area. Do blktrace timestamps work fine 
in v2.6.23, with NOHZ?

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: CONFIG_NO_HZ breaks blktrace timestamps
  2008-01-11  9:34   ` Ingo Molnar
@ 2008-01-11 15:43     ` David Dillow
  0 siblings, 0 replies; 37+ messages in thread
From: David Dillow @ 2008-01-11 15:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, linux-btrace, mingo, tglx, Jens Axboe


On Fri, 2008-01-11 at 10:34 +0100, Ingo Molnar wrote:
> * David Dillow <dillowda@ornl.gov> wrote:
> 
> > Ingo, Thomas added as I think this is related to 
> > sched.c:__update_rq_clock()'s checking for forward time warps.
> 
> yep, we've got some fixes in this area. Do blktrace timestamps work fine 
> in v2.6.23, with NOHZ?

Do you still want this tested, given that your ktime change works? If
so, it will likely be next week before I can devote some cycles to it --
it'll take installing a new distro on the machines.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2008-01-14  8:42 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-09 22:48 CONFIG_NO_HZ breaks blktrace timestamps David Dillow
2008-01-10 20:25 ` David Dillow
2008-01-10 22:44   ` Guillaume Chazarain
2008-01-11  3:04     ` David Dillow
2008-01-11  9:07       ` Jens Axboe
2008-01-11  9:23         ` Ingo Molnar
2008-01-11  9:25           ` Jens Axboe
2008-01-11  9:42             ` Ingo Molnar
2008-01-11  9:56               ` Jens Axboe
2008-01-11 10:29                 ` [patch] block: fix " Ingo Molnar
2008-01-11 10:47                   ` Guillaume Chazarain
2008-01-11 10:50                     ` Ingo Molnar
2008-01-11 12:28                   ` Jens Axboe
2008-01-11 12:42                     ` Jens Axboe
2008-01-11 13:21                     ` Ingo Molnar
2008-01-11 17:18                       ` Jens Axboe
2008-01-14  7:51                         ` Ingo Molnar
2008-01-14  7:59                           ` Ingo Molnar
2008-01-14  8:10                             ` Jens Axboe
2008-01-14  8:39                           ` Christoph Hellwig
2008-01-14  8:42                             ` Jens Axboe
2008-01-11 15:36                   ` David Dillow
2008-01-11 16:44                     ` Ingo Molnar
2008-01-11 17:26                       ` Jens Axboe
2008-01-11  9:29           ` CONFIG_NO_HZ breaks " Ingo Molnar
2008-01-11  9:34             ` Jens Axboe
2008-01-11  9:28         ` nigel
2008-01-11  9:32           ` Ingo Molnar
2008-01-13 22:54             ` nigel
2008-01-11  9:44       ` Ingo Molnar
2008-01-11  9:51         ` Ingo Molnar
2008-01-11 10:41     ` Guillaume Chazarain
2008-01-11 10:55       ` Ingo Molnar
2008-01-11 22:30         ` Guillaume Chazarain
2008-01-12  0:03           ` Guillaume Chazarain
2008-01-11  9:34   ` Ingo Molnar
2008-01-11 15:43     ` David Dillow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).