[patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1
@ 2006-11-09 23:38 Thomas Gleixner
  2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
                   ` (19 more replies)
  0 siblings, 20 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

Andrew,

this is a drop in replacement for the following patches in 2.6.19-rc5-mm1:

hrtimers-state-tracking.patch
up to
acpi-verify-lapic-timer-fix.patch

The patch set is taking all the changes made during the -mm testing into
account and merged them back at the appropriate places.

Changes vs. the initially reviewed patch set:

- Trivial compile / Kconfig fixes
- command line paramater fixups
- apic timer / ACPI (C states) broadcasting fixups
- Disabled apic timer for high res on UP systems due to the unsolvable
  brokeness of BIOS supplied C state functionality
- APIC code cleanup in i386
- reworked APIC timer calibration

Dropped from the rc5-mm1 patch conglomerate:

- the naive attempt to detect the local APIC timer brokeness in C2 state
  due to circular dependency on interrupts (where the local APIC timer
  interrupt might be the only active one). The problem is detectable, 
  but it needs more thought and the gathered information/experience is
  not lost ! Replaced by brute force for now.

Some annotations for making the review simpler:

hrtimers-state-tracking.patch
	callback state trivial fix

hrtimers-clean-up-callback-tracking.patch
	no changes, kept for linearity

hrtimers-move-and-add-documentation.patch
	no changes, kept for linearity

clockevents-core.patch
	One off bug fixed
	inlcude and compile fixes
	broadcast support

acpi-include-apic-h.patch
	new

acpi-keep-track-of-timer-broadcast.patch
	new

acpi-add-hres-dyntick-broadcast-support.patch
	new

i386-cleanup-apic.patch
	new, no functional changes

clockevents-drivers-for-i386.patch
	broadcast fixups

pm-timer-allow-early-access.patch
	new, no functional changes
	
i386-lapic-calibrate-timer.patch
	new

high-res-timers-core.patch
	trivial fixups

gtod-mark-tsc-unusable-for-highres-timers.patch
	no changes, kept for linearity

dynticks-core.patch
	trivial fixups

dynticks-add-nohz-stats-to-proc-stat.patch
	no changes, patch fuzz due to prior patches

dynticks-i386-arch-code.patch
	no changes, patch fuzz due to prior patches

dynticks-i386-nmi-fix.patch
	new

high-res-timers-dynticks-enable-i386-support.patch
	no changes, patch fuzz due to prior patches

debugging-feature-timer-stats.patch
	no changes, patch fuzz due to prior patches

Thanks,

	tglx
--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 01/19] hrtimers: state tracking
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  9:19   ` Arjan van de Ven
  2006-11-23 22:26   ` Roman Zippel
  2006-11-09 23:38 ` [patch 02/19] hrtimers: clean up callback tracking Thomas Gleixner
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: hrtimers-state-tracking.patch --]
[-- Type: text/plain, Size: 6381 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Reintroduce ktimers feature "optimized away" by the ktimers review process:
multiple hrtimer states to enable the running of hrtimers without holding the
cpu-base-lock.

(The "optimized" rbtree hack carried only 2 states worth of information and we
need 4 for high resolution timers and dynamic ticks.)

Build-fixes-from: Andrew Morton <akpm@osdl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 21:06:05.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 21:06:07.000000000 +0100
@@ -40,6 +40,34 @@ enum hrtimer_restart {
 	HRTIMER_RESTART,	/* Timer must be restarted */
 };
 
+/*
+ * Bit values to track state of the timer
+ *
+ * Possible states:
+ *
+ * 0x00		inactive
+ * 0x01		enqueued into rbtree
+ * 0x02		callback function running
+ * 0x03		callback function running and enqueued
+ *		(was requeued on another CPU)
+ *
+ * The "callback function running and enqueued" status is only possible on
+ * SMP. It happens for example when a posix timer expired and the callback
+ * queued a signal. Between dropping the lock which protects the posix timer
+ * and reacquiring the base lock of the hrtimer, another CPU can deliver the
+ * signal and rearm the timer. We have to preserve the callback running state,
+ * as otherwise the timer could be removed before the softirq code finishes the
+ * the handling of the timer.
+ *
+ * The HRTIMER_STATE_ENQUEUE bit is always or'ed to the current state to
+ * preserve the HRTIMER_STATE_CALLBACK bit in the above scenario.
+ *
+ * All state transitions are protected by cpu_base->lock.
+ */
+#define HRTIMER_STATE_INACTIVE	0x00
+#define HRTIMER_STATE_ENQUEUED	0x01
+#define HRTIMER_STATE_CALLBACK	0x02
+
 /**
  * struct hrtimer - the basic hrtimer structure
  * @node:	red black tree node for time ordered insertion
@@ -48,6 +76,7 @@ enum hrtimer_restart {
  *		which the timer is based.
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
+ * @state:	state information (See bit values above)
  *
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
@@ -56,6 +85,7 @@ struct hrtimer {
 	ktime_t				expires;
 	enum hrtimer_restart		(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
+	unsigned long			state;
 };
 
 /**
@@ -141,9 +171,13 @@ extern int hrtimer_get_res(const clockid
 extern ktime_t hrtimer_get_next_event(void);
 #endif
 
+/*
+ * A timer is active, when it is enqueued into the rbtree or the callback
+ * function is running.
+ */
 static inline int hrtimer_active(const struct hrtimer *timer)
 {
-	return rb_parent(&timer->node) != &timer->node;
+	return timer->state != HRTIMER_STATE_INACTIVE;
 }
 
 /* Forward a hrtimer so it expires after now: */
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 21:06:05.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 21:06:07.000000000 +0100
@@ -235,6 +235,12 @@ lock_hrtimer_base(const struct hrtimer *
 
 #endif	/* !CONFIG_SMP */
 
+static inline int hrtimer_is_queued(struct hrtimer *timer)
+{
+	return timer->state != HRTIMER_STATE_INACTIVE &&
+		timer->state != HRTIMER_STATE_CALLBACK;
+}
+
 /*
  * Functions for the union type storage format of ktime_t which are
  * too large for inlining:
@@ -385,6 +391,11 @@ static void enqueue_hrtimer(struct hrtim
 	 */
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
+	/*
+	 * HRTIMER_STATE_ENQUEUED is or'ed to the current state to preserve the
+	 * state of a possibly running callback.
+	 */
+	timer->state |= HRTIMER_STATE_ENQUEUED;
 
 	if (!base->first || timer->expires.tv64 <
 	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
@@ -397,7 +408,8 @@ static void enqueue_hrtimer(struct hrtim
  * Caller must hold the base lock.
  */
 static void __remove_hrtimer(struct hrtimer *timer,
-			     struct hrtimer_clock_base *base)
+			     struct hrtimer_clock_base *base,
+			     unsigned long newstate)
 {
 	/*
 	 * Remove the timer from the rbtree and replace the
@@ -406,7 +418,7 @@ static void __remove_hrtimer(struct hrti
 	if (base->first == &timer->node)
 		base->first = rb_next(&timer->node);
 	rb_erase(&timer->node, &base->active);
-	rb_set_parent(&timer->node, &timer->node);
+	timer->state = newstate;
 }
 
 /*
@@ -415,8 +427,8 @@ static void __remove_hrtimer(struct hrti
 static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
-	if (hrtimer_active(timer)) {
-		__remove_hrtimer(timer, base);
+	if (hrtimer_is_queued(timer)) {
+		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
 		return 1;
 	}
 	return 0;
@@ -488,7 +500,7 @@ int hrtimer_try_to_cancel(struct hrtimer
 
 	base = lock_hrtimer_base(timer, &flags);
 
-	if (base->cpu_base->curr_timer != timer)
+	if (!(timer->state & HRTIMER_STATE_CALLBACK))
 		ret = remove_hrtimer(timer, base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -593,7 +605,6 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
-	rb_set_parent(&timer->node, &timer->node);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -644,13 +655,14 @@ static inline void run_hrtimer_queue(str
 
 		fn = timer->function;
 		set_curr_timer(cpu_base, timer);
-		__remove_hrtimer(timer, base);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
 
 		spin_lock_irq(&cpu_base->lock);
 
+		timer->state &= ~HRTIMER_STATE_CALLBACK;
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
 			enqueue_hrtimer(timer, base);
@@ -821,7 +833,8 @@ static void migrate_hrtimer_list(struct 
 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
-		__remove_hrtimer(timer, old_base);
+		BUG_ON(timer->state & HRTIMER_STATE_CALLBACK);
+		__remove_hrtimer(timer, old_base, HRTIMER_STATE_INACTIVE);
 		timer->base = new_base;
 		enqueue_hrtimer(timer, new_base);
 	}

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 02/19] hrtimers: clean up callback tracking
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
  2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  9:20   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 03/19] hrtimers: Move and add documentation Thomas Gleixner
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: hrtimers-clean-up-callback-tracking.patch --]
[-- Type: text/plain, Size: 2653 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Reintroduce ktimers feature "optimized away" by the ktimers review process:
remove the curr_timer pointer from the cpu-base and use the hrtimer state.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 21:06:07.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 21:06:09.000000000 +0100
@@ -136,7 +136,6 @@ struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
-	struct hrtimer			*curr_timer;
 };
 
 /*
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 21:06:07.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 21:06:09.000000000 +0100
@@ -150,8 +150,6 @@ static void hrtimer_get_softirq_time(str
  */
 #ifdef CONFIG_SMP
 
-#define set_curr_timer(b, t)		do { (b)->curr_timer = (t); } while (0)
-
 /*
  * We are using hashed locking: holding per_cpu(hrtimer_bases)[n].lock
  * means that all timers which are tied to this base via timer->base are
@@ -205,7 +203,7 @@ switch_hrtimer_base(struct hrtimer *time
 		 * completed. There is no conflict as we hold the lock until
 		 * the timer is enqueued.
 		 */
-		if (unlikely(base->cpu_base->curr_timer == timer))
+		if (unlikely(timer->state & HRTIMER_STATE_CALLBACK))
 			return base;
 
 		/* See the comment in lock_timer_base() */
@@ -219,8 +217,6 @@ switch_hrtimer_base(struct hrtimer *time
 
 #else /* CONFIG_SMP */
 
-#define set_curr_timer(b, t)		do { } while (0)
-
 static inline struct hrtimer_clock_base *
 lock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
 {
@@ -654,7 +650,6 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		set_curr_timer(cpu_base, timer);
 		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
 		spin_unlock_irq(&cpu_base->lock);
 
@@ -668,7 +663,6 @@ static inline void run_hrtimer_queue(str
 			enqueue_hrtimer(timer, base);
 		}
 	}
-	set_curr_timer(cpu_base, NULL);
 	spin_unlock_irq(&cpu_base->lock);
 }
 
@@ -855,8 +849,6 @@ static void migrate_hrtimers(int cpu)
 	spin_lock(&old_base->lock);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
-		BUG_ON(old_base->curr_timer);
-
 		migrate_hrtimer_list(&old_base->clock_base[i],
 				     &new_base->clock_base[i]);
 	}

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 03/19] hrtimers: Move and add documentation
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
  2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
  2006-11-09 23:38 ` [patch 02/19] hrtimers: clean up callback tracking Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 04/19] Add a framework to manage clock event devices Thomas Gleixner
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: hrtimers-move-and-add-documentation.patch --]
[-- Type: text/plain, Size: 32203 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Move the initial hrtimer.txt document to the new directory
"Documentation/hrtimer"

Add design notes for the high resolution timer and dynamic tick functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/Documentation/hrtimer/highres.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/Documentation/hrtimer/highres.txt	2006-11-09 21:06:11.000000000 +0100
@@ -0,0 +1,249 @@
+High resolution timers and dynamic ticks design notes
+-----------------------------------------------------
+
+Further information can be found in the paper of the OLS 2006 talk "hrtimers
+and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
+be found on the OLS website:
+http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf
+
+The slides to this talk are available from:
+http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf
+
+The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
+changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
+design of the Linux time(r) system before hrtimers and other building blocks
+got merged into mainline.
+
+Note: the paper and the slides are talking about "clock event source", while we
+switched to the name "clock event devices" in meantime.
+
+The design contains the following basic building blocks:
+
+- hrtimer base infrastructure
+- timeofday and clock source management
+- clock event management
+- high resolution timer functionality
+- dynamic ticks
+
+
+hrtimer base infrastructure
+---------------------------
+
+The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
+the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See
+also figure #2 (OLS slides p. 15)
+
+The main differences to the timer wheel, which holds the armed timer_list type
+timers are:
+       - time ordered enqueueing into a rb-tree
+       - independent of ticks (the processing is based on nanoseconds)
+
+
+timeofday and clock source management
+-------------------------------------
+
+John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
+code out of the architecture-specific areas into a generic management
+framework, as illustrated in figure #3 (OLS slides p. 18). The architecture
+specific portion is reduced to the low level hardware details of the clock
+sources, which are registered in the framework and selected on a quality based
+decision. The low level code provides hardware setup and readout routines and
+initializes data structures, which are used by the generic time keeping code to
+convert the clock ticks to nanosecond based time values. All other time keeping
+related functionality is moved into the generic code. The GTOD base patch got
+merged into the 2.6.18 kernel.
+
+Further information about the Generic Time Of Day framework is available in the
+OLS 2005 Proceedings Volume 1:
+http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
+
+The paper "We Are Not Getting Any Younger: A New Approach to Time and
+Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.
+
+Figure #3 (OLS slides p.18) illustrates the transformation.
+
+
+clock event management
+----------------------
+
+While clock sources provide read access to the monotonically increasing time
+value, clock event devices are used to schedule the next event
+interrupt(s). The next event is currently defined to be periodic, with its
+period defined at compile time. The setup and selection of the event device
+for various event driven functionalities is hardwired into the architecture
+dependent code. This results in duplicated code across all architectures and
+makes it extremely difficult to change the configuration of the system to use
+event interrupt devices other than those already built into the
+architecture. Another implication of the current design is that it is necessary
+to touch all the architecture-specific implementations in order to provide new
+functionality like high resolution timers or dynamic ticks.
+
+The clock events subsystem tries to address this problem by providing a generic
+solution to manage clock event devices and their usage for the various clock
+event driven kernel functionalities. The goal of the clock event subsystem is
+to minimize the clock event related architecture dependent code to the pure
+hardware related handling and to allow easy addition and utilization of new
+clock event devices. It also minimizes the duplicated code across the
+architectures as it provides generic functionality down to the interrupt
+service handler, which is almost inherently hardware dependent.
+
+Clock event devices are registered either by the architecture dependent boot
+code or at module insertion time. Each clock event device fills a data
+structure with clock-specific property parameters and callback functions. The
+clock event management decides, by using the specified property parameters, the
+set of system functions a clock event device will be used to support. This
+includes the distinction of per-CPU and per-system global event devices.
+
+System-level global event devices are used for the Linux periodic tick. Per-CPU
+event devices are used to provide local CPU functionality such as process
+accounting, profiling, and high resolution timers.
+
+The management layer assignes one or more of the folliwing functions to a clock
+event device:
+      - system global periodic tick (jiffies update)
+      - cpu local update_process_times
+      - cpu local profiling
+      - cpu local next event interrupt (non periodic mode)
+
+The clock event device delegates the selection of those timer interrupt related
+functions completely to the management layer. The clock management layer stores
+a function pointer in the device description structure, which has to be called
+from the hardware level handler. This removes a lot of duplicated code from the
+architecture specific timer interrupt handlers and hands the control over the
+clock event devices and the assignment of timer interrupt related functionality
+to the core code.
+
+The clock event layer API is rather small. Aside from the clock event device
+registration interface it provides functions to schedule the next event
+interrupt, clock event device notification service and support for suspend and
+resume.
+
+The framework adds about 700 lines of code which results in a 2KB increase of
+the kernel binary size. The conversion of i386 removes about 100 lines of
+code. The binary size decrease is in the range of 400 byte. We believe that the
+increase of flexibility and the avoidance of duplicated code across
+architectures justifies the slight increase of the binary size.
+
+The conversion of an architecture has no functional impact, but allows to
+utilize the high resolution and dynamic tick functionalites without any change
+to the clock event device and timer interrupt code. After the conversion the
+enabling of high resolution timers and dynamic ticks is simply provided by
+adding the kernel/time/Kconfig file to the architecture specific Kconfig and
+adding the dynamic tick specific calls to the idle routine (a total of 3 lines
+added to the idle function and the Kconfig file)
+
+Figure #4 (OLS slides p.20) illustrates the transformation.
+
+
+high resolution timer functionality
+-----------------------------------
+
+During system boot it is not possible to use the high resolution timer
+functionality, while making it possible would be difficult and would serve no
+useful function. The initialization of the clock event device framework, the
+clock source framework (GTOD) and hrtimers itself has to be done and
+appropriate clock sources and clock event devices have to be registered before
+the high resolution functionality can work. Up to the point where hrtimers are
+initialized, the system works in the usual low resolution periodic mode. The
+clock source and the clock event device layers provide notification functions
+which inform hrtimers about availability of new hardware. hrtimers validates
+the usability of the registered clock sources and clock event devices before
+switching to high resolution mode. This ensures also that a kernel which is
+configured for high resolution timers can run on a system which lacks the
+necessary hardware support.
+
+The high resolution timer code does not support SMP machines which have only
+global clock event devices. The support of such hardware would involve IPI
+calls when an interrupt happens. The overhead would be much larger than the
+benefit. This is the reason why we currently disable high resolution and
+dynamic ticks on i386 SMP systems which stop the local APIC in C3 power
+state. A workaround is available as an idea, but the problem has not been
+tackled yet.
+
+The time ordered insertion of timers provides all the infrastructure to decide
+whether the event device has to be reprogrammed when a timer is added. The
+decision is made per timer base and synchronized across per-cpu timer bases in
+a support function. The design allows the system to utilize separate per-CPU
+clock event devices for the per-CPU timer bases, but currently only one
+reprogrammable clock event device per-CPU is utilized.
+
+When the timer interrupt happens, the next event interrupt handler is called
+from the clock event distribution code and moves expired timers from the
+red-black tree to a separate double linked list and invokes the softirq
+handler. An additional mode field in the hrtimer structure allows the system to
+execute callback functions directly from the next event interrupt handler. This
+is restricted to code which can safely be executed in the hard interrupt
+context. This applies, for example, to the common case of a wakeup function as
+used by nanosleep. The advantage of executing the handler in the interrupt
+context is the avoidance of up to two context switches - from the interrupted
+context to the softirq and to the task which is woken up by the expired
+timer.
+
+Once a system has switched to high resolution mode, the periodic tick is
+switched off. This disables the per system global periodic clock event device -
+e.g. the PIT on i386 SMP systems.
+
+The periodic tick functionality is provided by an per-cpu hrtimer. The callback
+function is executed in the next event interrupt context and updates jiffies
+and calls update_process_times and profiling. The implementation of the hrtimer
+based periodic tick is designed to be extended with dynamic tick functionality.
+This allows to use a single clock event device to schedule high resolution
+timer and periodic events (jiffies tick, profiling, process accounting) on UP
+systems. This has been proved to work with the PIT on i386 and the Incrementer
+on PPC.
+
+The softirq for running the hrtimer queues and executing the callbacks has been
+separated from the tick bound timer softirq to allow accurate delivery of high
+resolution timer signals which are used by itimer and POSIX interval
+timers. The execution of this softirq can still be delayed by other softirqs,
+but the overall latencies have been significantly improved by this separation.
+
+Figure #5 (OLS slides p.22) illustrates the transformation.
+
+
+dynamic ticks
+-------------
+
+Dynamic ticks are the logical consequence of the hrtimer based periodic tick
+replacement (sched_tick). The functionality of the sched_tick hrtimer is
+extended by three functions:
+
+- hrtimer_stop_sched_tick
+- hrtimer_restart_sched_tick
+- hrtimer_update_jiffies
+
+hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
+evaluates the next scheduled timer event (from both hrtimers and the timer
+wheel) and in case that the next event is further away than the next tick it
+reprograms the sched_tick to this future event, to allow longer idle sleeps
+without worthless interruption by the periodic tick. The function is also
+called when an interrupt happens during the idle period, which does not cause a
+reschedule. The call is necessary as the interrupt handler might have armed a
+new timer whose expiry time is before the time which was identified as the
+nearest event in the previous call to hrtimer_stop_sched_tick.
+
+hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
+it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
+which is kept active until the next call to hrtimer_stop_sched_tick().
+
+hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
+in the idle period to make sure that jiffies are up to date and the interrupt
+handler has not to deal with an eventually stale jiffy value.
+
+The dynamic tick feature provides statistical values which are exported to
+userspace via /proc/stats and can be made available for enhanced power
+management control.
+
+The implementation leaves room for further development like full tickless
+systems, where the time slice is controlled by the scheduler, variable
+frequency profiling, and a complete removal of jiffies in the future.
+
+
+Aside the current initial submission of i386 support, the patchset has been
+extended to x86_64 and ARM already. Initial (work in progress) support is also
+available for MIPS and PowerPC.
+
+	  Thomas, Ingo
+
+
+
Index: linux-2.6.19-rc5-mm1/Documentation/hrtimer/hrtimers.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/Documentation/hrtimer/hrtimers.txt	2006-11-09 21:06:11.000000000 +0100
@@ -0,0 +1,178 @@
+
+hrtimers - subsystem for high-resolution kernel timers
+----------------------------------------------------
+
+This patch introduces a new subsystem for high-resolution kernel timers.
+
+One might ask the question: we already have a timer subsystem
+(kernel/timers.c), why do we need two timer subsystems? After a lot of
+back and forth trying to integrate high-resolution and high-precision
+features into the existing timer framework, and after testing various
+such high-resolution timer implementations in practice, we came to the
+conclusion that the timer wheel code is fundamentally not suitable for
+such an approach. We initially didn't believe this ('there must be a way
+to solve this'), and spent a considerable effort trying to integrate
+things into the timer wheel, but we failed. In hindsight, there are
+several reasons why such integration is hard/impossible:
+
+- the forced handling of low-resolution and high-resolution timers in
+  the same way leads to a lot of compromises, macro magic and #ifdef
+  mess. The timers.c code is very "tightly coded" around jiffies and
+  32-bitness assumptions, and has been honed and micro-optimized for a
+  relatively narrow use case (jiffies in a relatively narrow HZ range)
+  for many years - and thus even small extensions to it easily break
+  the wheel concept, leading to even worse compromises. The timer wheel
+  code is very good and tight code, there's zero problems with it in its
+  current usage - but it is simply not suitable to be extended for
+  high-res timers.
+
+- the unpredictable [O(N)] overhead of cascading leads to delays which
+  necessitate a more complex handling of high resolution timers, which
+  in turn decreases robustness. Such a design still led to rather large
+  timing inaccuracies. Cascading is a fundamental property of the timer
+  wheel concept, it cannot be 'designed out' without unevitably
+  degrading other portions of the timers.c code in an unacceptable way.
+
+- the implementation of the current posix-timer subsystem on top of
+  the timer wheel has already introduced a quite complex handling of
+  the required readjusting of absolute CLOCK_REALTIME timers at
+  settimeofday or NTP time - further underlying our experience by
+  example: that the timer wheel data structure is too rigid for high-res
+  timers.
+
+- the timer wheel code is most optimal for use cases which can be
+  identified as "timeouts". Such timeouts are usually set up to cover
+  error conditions in various I/O paths, such as networking and block
+  I/O. The vast majority of those timers never expire and are rarely
+  recascaded because the expected correct event arrives in time so they
+  can be removed from the timer wheel before any further processing of
+  them becomes necessary. Thus the users of these timeouts can accept
+  the granularity and precision tradeoffs of the timer wheel, and
+  largely expect the timer subsystem to have near-zero overhead.
+  Accurate timing for them is not a core purpose - in fact most of the
+  timeout values used are ad-hoc. For them it is at most a necessary
+  evil to guarantee the processing of actual timeout completions
+  (because most of the timeouts are deleted before completion), which
+  should thus be as cheap and unintrusive as possible.
+
+The primary users of precision timers are user-space applications that
+utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
+users like drivers and subsystems which require precise timed events
+(e.g. multimedia) can benefit from the availability of a separate
+high-resolution timer subsystem as well.
+
+While this subsystem does not offer high-resolution clock sources just
+yet, the hrtimer subsystem can be easily extended with high-resolution
+clock capabilities, and patches for that exist and are maturing quickly.
+The increasing demand for realtime and multimedia applications along
+with other potential users for precise timers gives another reason to
+separate the "timeout" and "precise timer" subsystems.
+
+Another potential benefit is that such a separation allows even more
+special-purpose optimization of the existing timer wheel for the low
+resolution and low precision use cases - once the precision-sensitive
+APIs are separated from the timer wheel and are migrated over to
+hrtimers. E.g. we could decrease the frequency of the timeout subsystem
+from 250 Hz to 100 HZ (or even smaller).
+
+hrtimer subsystem implementation details
+----------------------------------------
+
+the basic design considerations were:
+
+- simplicity
+
+- data structure not bound to jiffies or any other granularity. All the
+  kernel logic works at 64-bit nanoseconds resolution - no compromises.
+
+- simplification of existing, timing related kernel code
+
+another basic requirement was the immediate enqueueing and ordering of
+timers at activation time. After looking at several possible solutions
+such as radix trees and hashes, we chose the red black tree as the basic
+data structure. Rbtrees are available as a library in the kernel and are
+used in various performance-critical areas of e.g. memory management and
+file systems. The rbtree is solely used for time sorted ordering, while
+a separate list is used to give the expiry code fast access to the
+queued timers, without having to walk the rbtree.
+
+(This separate list is also useful for later when we'll introduce
+high-resolution clocks, where we need separate pending and expired
+queues while keeping the time-order intact.)
+
+Time-ordered enqueueing is not purely for the purposes of
+high-resolution clocks though, it also simplifies the handling of
+absolute timers based on a low-resolution CLOCK_REALTIME. The existing
+implementation needed to keep an extra list of all armed absolute
+CLOCK_REALTIME timers along with complex locking. In case of
+settimeofday and NTP, all the timers (!) had to be dequeued, the
+time-changing code had to fix them up one by one, and all of them had to
+be enqueued again. The time-ordered enqueueing and the storage of the
+expiry time in absolute time units removes all this complex and poorly
+scaling code from the posix-timer implementation - the clock can simply
+be set without having to touch the rbtree. This also makes the handling
+of posix-timers simpler in general.
+
+The locking and per-CPU behavior of hrtimers was mostly taken from the
+existing timer wheel code, as it is mature and well suited. Sharing code
+was not really a win, due to the different data structures. Also, the
+hrtimer functions now have clearer behavior and clearer names - such as
+hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
+equivalent to del_timer() and del_timer_sync()] - so there's no direct
+1:1 mapping between them on the algorithmical level, and thus no real
+potential for code sharing either.
+
+Basic data types: every time value, absolute or relative, is in a
+special nanosecond-resolution type: ktime_t. The kernel-internal
+representation of ktime_t values and operations is implemented via
+macros and inline functions, and can be switched between a "hybrid
+union" type and a plain "scalar" 64bit nanoseconds representation (at
+compile time). The hybrid union type optimizes time conversions on 32bit
+CPUs. This build-time-selectable ktime_t storage format was implemented
+to avoid the performance impact of 64-bit multiplications and divisions
+on 32bit CPUs. Such operations are frequently necessary to convert
+between the storage formats provided by kernel and userspace interfaces
+and the internal time format. (See include/linux/ktime.h for further
+details.)
+
+hrtimers - rounding of timer values
+-----------------------------------
+
+the hrtimer code will round timer events to lower-resolution clocks
+because it has to. Otherwise it will do no artificial rounding at all.
+
+one question is, what resolution value should be returned to the user by
+the clock_getres() interface. This will return whatever real resolution
+a given clock has - be it low-res, high-res, or artificially-low-res.
+
+hrtimers - testing and verification
+----------------------------------
+
+We used the high-resolution clock subsystem ontop of hrtimers to verify
+the hrtimer implementation details in praxis, and we also ran the posix
+timer tests in order to ensure specification compliance. We also ran
+tests on low-resolution clocks.
+
+The hrtimer patch converts the following kernel functionality to use
+hrtimers:
+
+ - nanosleep
+ - itimers
+ - posix-timers
+
+The conversion of nanosleep and posix-timers enabled the unification of
+nanosleep and clock_nanosleep.
+
+The code was successfully compiled for the following platforms:
+
+ i386, x86_64, ARM, PPC, PPC64, IA64
+
+The code was run-tested on the following platforms:
+
+ i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
+
+hrtimers were also integrated into the -rt tree, along with a
+hrtimers-based high-resolution clock implementation, so the hrtimers
+code got a healthy amount of testing and use in practice.
+
+	Thomas Gleixner, Ingo Molnar
Index: linux-2.6.19-rc5-mm1/Documentation/hrtimers.txt
===================================================================
--- linux-2.6.19-rc5-mm1.orig/Documentation/hrtimers.txt	2006-11-09 21:05:32.000000000 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,178 +0,0 @@
-
-hrtimers - subsystem for high-resolution kernel timers
-----------------------------------------------------
-
-This patch introduces a new subsystem for high-resolution kernel timers.
-
-One might ask the question: we already have a timer subsystem
-(kernel/timers.c), why do we need two timer subsystems? After a lot of
-back and forth trying to integrate high-resolution and high-precision
-features into the existing timer framework, and after testing various
-such high-resolution timer implementations in practice, we came to the
-conclusion that the timer wheel code is fundamentally not suitable for
-such an approach. We initially didn't believe this ('there must be a way
-to solve this'), and spent a considerable effort trying to integrate
-things into the timer wheel, but we failed. In hindsight, there are
-several reasons why such integration is hard/impossible:
-
-- the forced handling of low-resolution and high-resolution timers in
-  the same way leads to a lot of compromises, macro magic and #ifdef
-  mess. The timers.c code is very "tightly coded" around jiffies and
-  32-bitness assumptions, and has been honed and micro-optimized for a
-  relatively narrow use case (jiffies in a relatively narrow HZ range)
-  for many years - and thus even small extensions to it easily break
-  the wheel concept, leading to even worse compromises. The timer wheel
-  code is very good and tight code, there's zero problems with it in its
-  current usage - but it is simply not suitable to be extended for
-  high-res timers.
-
-- the unpredictable [O(N)] overhead of cascading leads to delays which
-  necessitate a more complex handling of high resolution timers, which
-  in turn decreases robustness. Such a design still led to rather large
-  timing inaccuracies. Cascading is a fundamental property of the timer
-  wheel concept, it cannot be 'designed out' without unevitably
-  degrading other portions of the timers.c code in an unacceptable way.
-
-- the implementation of the current posix-timer subsystem on top of
-  the timer wheel has already introduced a quite complex handling of
-  the required readjusting of absolute CLOCK_REALTIME timers at
-  settimeofday or NTP time - further underlying our experience by
-  example: that the timer wheel data structure is too rigid for high-res
-  timers.
-
-- the timer wheel code is most optimal for use cases which can be
-  identified as "timeouts". Such timeouts are usually set up to cover
-  error conditions in various I/O paths, such as networking and block
-  I/O. The vast majority of those timers never expire and are rarely
-  recascaded because the expected correct event arrives in time so they
-  can be removed from the timer wheel before any further processing of
-  them becomes necessary. Thus the users of these timeouts can accept
-  the granularity and precision tradeoffs of the timer wheel, and
-  largely expect the timer subsystem to have near-zero overhead.
-  Accurate timing for them is not a core purpose - in fact most of the
-  timeout values used are ad-hoc. For them it is at most a necessary
-  evil to guarantee the processing of actual timeout completions
-  (because most of the timeouts are deleted before completion), which
-  should thus be as cheap and unintrusive as possible.
-
-The primary users of precision timers are user-space applications that
-utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
-users like drivers and subsystems which require precise timed events
-(e.g. multimedia) can benefit from the availability of a separate
-high-resolution timer subsystem as well.
-
-While this subsystem does not offer high-resolution clock sources just
-yet, the hrtimer subsystem can be easily extended with high-resolution
-clock capabilities, and patches for that exist and are maturing quickly.
-The increasing demand for realtime and multimedia applications along
-with other potential users for precise timers gives another reason to
-separate the "timeout" and "precise timer" subsystems.
-
-Another potential benefit is that such a separation allows even more
-special-purpose optimization of the existing timer wheel for the low
-resolution and low precision use cases - once the precision-sensitive
-APIs are separated from the timer wheel and are migrated over to
-hrtimers. E.g. we could decrease the frequency of the timeout subsystem
-from 250 Hz to 100 HZ (or even smaller).
-
-hrtimer subsystem implementation details
-----------------------------------------
-
-the basic design considerations were:
-
-- simplicity
-
-- data structure not bound to jiffies or any other granularity. All the
-  kernel logic works at 64-bit nanoseconds resolution - no compromises.
-
-- simplification of existing, timing related kernel code
-
-another basic requirement was the immediate enqueueing and ordering of
-timers at activation time. After looking at several possible solutions
-such as radix trees and hashes, we chose the red black tree as the basic
-data structure. Rbtrees are available as a library in the kernel and are
-used in various performance-critical areas of e.g. memory management and
-file systems. The rbtree is solely used for time sorted ordering, while
-a separate list is used to give the expiry code fast access to the
-queued timers, without having to walk the rbtree.
-
-(This separate list is also useful for later when we'll introduce
-high-resolution clocks, where we need separate pending and expired
-queues while keeping the time-order intact.)
-
-Time-ordered enqueueing is not purely for the purposes of
-high-resolution clocks though, it also simplifies the handling of
-absolute timers based on a low-resolution CLOCK_REALTIME. The existing
-implementation needed to keep an extra list of all armed absolute
-CLOCK_REALTIME timers along with complex locking. In case of
-settimeofday and NTP, all the timers (!) had to be dequeued, the
-time-changing code had to fix them up one by one, and all of them had to
-be enqueued again. The time-ordered enqueueing and the storage of the
-expiry time in absolute time units removes all this complex and poorly
-scaling code from the posix-timer implementation - the clock can simply
-be set without having to touch the rbtree. This also makes the handling
-of posix-timers simpler in general.
-
-The locking and per-CPU behavior of hrtimers was mostly taken from the
-existing timer wheel code, as it is mature and well suited. Sharing code
-was not really a win, due to the different data structures. Also, the
-hrtimer functions now have clearer behavior and clearer names - such as
-hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
-equivalent to del_timer() and del_timer_sync()] - so there's no direct
-1:1 mapping between them on the algorithmical level, and thus no real
-potential for code sharing either.
-
-Basic data types: every time value, absolute or relative, is in a
-special nanosecond-resolution type: ktime_t. The kernel-internal
-representation of ktime_t values and operations is implemented via
-macros and inline functions, and can be switched between a "hybrid
-union" type and a plain "scalar" 64bit nanoseconds representation (at
-compile time). The hybrid union type optimizes time conversions on 32bit
-CPUs. This build-time-selectable ktime_t storage format was implemented
-to avoid the performance impact of 64-bit multiplications and divisions
-on 32bit CPUs. Such operations are frequently necessary to convert
-between the storage formats provided by kernel and userspace interfaces
-and the internal time format. (See include/linux/ktime.h for further
-details.)
-
-hrtimers - rounding of timer values
------------------------------------
-
-the hrtimer code will round timer events to lower-resolution clocks
-because it has to. Otherwise it will do no artificial rounding at all.
-
-one question is, what resolution value should be returned to the user by
-the clock_getres() interface. This will return whatever real resolution
-a given clock has - be it low-res, high-res, or artificially-low-res.
-
-hrtimers - testing and verification
-----------------------------------
-
-We used the high-resolution clock subsystem ontop of hrtimers to verify
-the hrtimer implementation details in praxis, and we also ran the posix
-timer tests in order to ensure specification compliance. We also ran
-tests on low-resolution clocks.
-
-The hrtimer patch converts the following kernel functionality to use
-hrtimers:
-
- - nanosleep
- - itimers
- - posix-timers
-
-The conversion of nanosleep and posix-timers enabled the unification of
-nanosleep and clock_nanosleep.
-
-The code was successfully compiled for the following platforms:
-
- i386, x86_64, ARM, PPC, PPC64, IA64
-
-The code was run-tested on the following platforms:
-
- i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
-
-hrtimers were also integrated into the -rt tree, along with a
-hrtimers-based high-resolution clock implementation, so the hrtimers
-code got a healthy amount of testing and use in practice.
-
-	Thomas Gleixner, Ingo Molnar

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 04/19] Add a framework to manage clock event devices.
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (2 preceding siblings ...)
  2006-11-09 23:38 ` [patch 03/19] hrtimers: Move and add documentation Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  9:47   ` Arjan van de Ven
  2006-11-23 22:36   ` Roman Zippel
  2006-11-09 23:38 ` [patch 05/19] ACPI: Include apic.h Thomas Gleixner
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: clockevents-core.patch --]
[-- Type: text/plain, Size: 28879 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

We have two types of clock event devices:
- global events (one device per system)
- local events (one device per cpu)

We assign the various time(r) related interrupts to those devices:

- global tick (advances jiffies)
- update process times (per cpu)
- profiling (per cpu)
- next timer events (per cpu)

Architectures register their clock event devices, with specific capability
bits set, and the framework code assigns the appropriate event handler to the
event device.  The functionality is assigned via an event handler to avoid
runtime evalutation of the assigned function bits.

This allows to control the clock event devices without the architectures
having to worry about the details of function assignment.  This is also a
preliminary for high resolution timers and dynamic ticks to allow the core
code to control the clock functionality without intrusive changes to the
architecture code.

For x86 based systems the code provides the ability to broadcast timer events.
This is necessary due to the fact, that the per CPU local APIC timers are
stopped in power saving states.

When high resolution timers and dynamic ticks are disabled, there is no change
in the behaviour of the system.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/include/linux/clockchips.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/include/linux/clockchips.h	2006-11-09 21:06:13.000000000 +0100
@@ -0,0 +1,143 @@
+/*  linux/include/linux/clockchips.h
+ *
+ *  This file contains the structure definitions for clockchips.
+ *
+ *  If you are not a clockchip, or the time of day code, you should
+ *  not be including this file!
+ */
+#ifndef _LINUX_CLOCKCHIPS_H
+#define _LINUX_CLOCKCHIPS_H
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+
+#include <linux/clocksource.h>
+#include <linux/interrupt.h>
+
+struct clock_event_device;
+
+/* Clock event mode commands */
+enum clock_event_mode {
+	CLOCK_EVT_PERIODIC,
+	CLOCK_EVT_ONESHOT,
+	CLOCK_EVT_SHUTDOWN,
+};
+
+/*
+ * Clock event capability flags:
+ *
+ * CAP_TICK:	The event source should be used for the periodic tick
+ * CAP_UPDATE:	The event source handler should call update_process_times()
+ * CAP_PROFILE: The event source handler should call profile_tick()
+ * CAP_NEXTEVT:	The event source can be reprogrammed in oneshot mode and is
+ *		a per cpu event source.
+ *
+ * The capability flags are used to select the appropriate handler for an event
+ * source. On an i386 UP system the PIT can serve all of the functionalities,
+ * while on a SMP system the PIT is solely used for the periodic tick and the
+ * local APIC timers are used for UPDATE / PROFILE / NEXTEVT. To avoid the run
+ * time query of those flags, the clock events layer assigns the appropriate
+ * event handler function, which contains only the selected calls, to the
+ * event.
+ */
+#define CLOCK_CAP_TICK		0x000001
+#define CLOCK_CAP_UPDATE	0x000002
+#define CLOCK_CAP_PROFILE	0x000004
+#ifdef CONFIG_HIGH_RES_TIMERS
+# define CLOCK_CAP_NEXTEVT	0x000008
+#else
+# define CLOCK_CAP_NEXTEVT	0x000000
+#endif
+
+#define CLOCK_BASE_CAPS_MASK	(CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | \
+				 CLOCK_CAP_UPDATE)
+#define CLOCK_CAPS_MASK		(CLOCK_BASE_CAPS_MASK | CLOCK_CAP_NEXTEVT)
+
+/**
+ * struct clock_event_device - clock event descriptor
+ *
+ * @name:		ptr to clock event name
+ * @capabilities:	capabilities of the event chip
+ * @max_delta_ns:	maximum delta value in ns
+ * @min_delta_ns:	minimum delta value in ns
+ * @mult:		nanosecond to cycles multiplier
+ * @shift:		nanoseconds to cycles divisor (power of two)
+ * @set_next_event:	set next event
+ * @set_mode:		set mode function
+ * @suspend:		suspend function (optional)
+ * @resume:		resume function (optional)
+ * @evthandler:		Assigned by the framework to be called by the low
+ *			level handler of the event source
+ */
+struct clock_event_device {
+	const char	*name;
+	unsigned int	capabilities;
+	unsigned long	max_delta_ns;
+	unsigned long	min_delta_ns;
+	unsigned long	mult;
+	int		shift;
+	void		(*set_next_event)(unsigned long evt,
+					  struct clock_event_device *);
+	void		(*set_mode)(enum clock_event_mode mode,
+				    struct clock_event_device *);
+	void		(*event_handler)(struct pt_regs *regs);
+};
+
+/*
+ * Calculate a multiplication factor for scaled math, which is used to convert
+ * nanoseconds based values to clock ticks:
+ *
+ * clock_ticks = (nanoseconds * factor) >> shift.
+ *
+ * div_sc is the rearranged equation to calculate a factor from a given clock
+ * ticks / nanoseconds ratio:
+ *
+ * factor = (clock_ticks << shift) / nanoseconds
+ */
+static inline unsigned long div_sc(unsigned long ticks, unsigned long nsec,
+				   int shift)
+{
+	uint64_t tmp = ((uint64_t)ticks) << shift;
+
+	do_div(tmp, nsec);
+	return (unsigned long) tmp;
+}
+
+/* Clock event layer functions */
+extern int register_local_clockevent(struct clock_event_device *);
+extern int register_global_clockevent(struct clock_event_device *);
+extern unsigned long clockevent_delta2ns(unsigned long latch,
+					 struct clock_event_device *evt);
+extern void clockevents_init(void);
+
+extern int clockevents_init_next_event(void);
+extern int clockevents_set_next_event(ktime_t expires, int force);
+extern int clockevents_next_event_available(void);
+extern void clockevents_resume_events(void);
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+extern void clockevents_set_broadcast(struct clock_event_device *evt,
+				      int broadcast);
+extern void clockevents_set_global_broadcast(struct clock_event_device *evt,
+					     int broadcast);
+extern int clockevents_register_broadcast(void (*fun)(cpumask_t *mask));
+#else
+static inline void clockevents_set_broadcast(struct clock_event_device *evt,
+					     int broadcast)
+{
+}
+#endif
+
+#else
+
+# define clockevents_init()		do { } while(0)
+# define clockevents_resume_events()	do { } while(0)
+
+struct clock_event_device;
+static inline void clockevents_set_broadcast(struct clock_event_device *evt,
+					     int broadcast)
+{
+}
+
+#endif
+
+#endif
Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 21:06:09.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 21:06:13.000000000 +0100
@@ -144,6 +144,9 @@ struct hrtimer_cpu_base {
  * is expired in the next softirq when the clock was advanced.
  */
 #define clock_was_set()		do { } while (0)
+#define hrtimer_clock_notify()	do { } while (0)
+extern ktime_t ktime_get(void);
+extern ktime_t ktime_get_real(void);
 
 /* Exported timer functions: */
 
Index: linux-2.6.19-rc5-mm1/init/main.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/init/main.c	2006-11-09 21:05:32.000000000 +0100
+++ linux-2.6.19-rc5-mm1/init/main.c	2006-11-09 21:06:13.000000000 +0100
@@ -37,6 +37,7 @@
 #include <linux/moduleparam.h>
 #include <linux/kallsyms.h>
 #include <linux/writeback.h>
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/efi.h>
@@ -532,6 +533,7 @@ asmlinkage void __init start_kernel(void
 	rcu_init();
 	init_IRQ();
 	pidhash_init();
+	clockevents_init();
 	init_timers();
 	hrtimers_init();
 	softirq_init();
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 21:06:09.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 21:06:13.000000000 +0100
@@ -31,6 +31,7 @@
  *  For licencing details see kernel-base/COPYING
  */
 
+#include <linux/clockchips.h>
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
@@ -46,7 +47,7 @@
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get(void)
+ktime_t ktime_get(void)
 {
 	struct timespec now;
 
@@ -60,7 +61,7 @@ static ktime_t ktime_get(void)
  *
  * returns the time in ktime_t format
  */
-static ktime_t ktime_get_real(void)
+ktime_t ktime_get_real(void)
 {
 	struct timespec now;
 
@@ -299,6 +300,7 @@ static unsigned long ktime_divns(const k
  */
 void hrtimer_notify_resume(void)
 {
+	clockevents_resume_events();
 	clock_was_set();
 }
 
Index: linux-2.6.19-rc5-mm1/kernel/time/Makefile
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/time/Makefile	2006-11-09 21:05:32.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/time/Makefile	2006-11-09 21:06:13.000000000 +0100
@@ -1 +1,3 @@
 obj-y += ntp.o clocksource.o jiffies.o
+
+obj-$(CONFIG_GENERIC_CLOCKEVENTS) += clockevents.o
Index: linux-2.6.19-rc5-mm1/kernel/time/clockevents.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/kernel/time/clockevents.c	2006-11-09 21:06:13.000000000 +0100
@@ -0,0 +1,757 @@
+/*
+ * linux/kernel/time/clockevents.c
+ *
+ * This file contains functions which manage clock event drivers.
+ *
+ * Copyright(C) 2005-2006, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2005-2006, Red Hat, Inc., Ingo Molnar
+ *
+ * We have two types of clock event devices:
+ * - global events (one device per system)
+ * - local events (one device per cpu)
+ *
+ * We assign the various time(r) related interrupts to those devices
+ *
+ * - global tick
+ * - profiling (per cpu)
+ * - next timer events (per cpu)
+ *
+ * TODO:
+ * - implement variable frequency profiling
+ *
+ * This code is licenced under the GPL version 2. For details see
+ * kernel-base/COPYING.
+ */
+
+#include <linux/clockchips.h>
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/irq.h>
+#include <linux/init.h>
+#include <linux/hrtimer.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/sysdev.h>
+
+#define MAX_CLOCK_EVENTS	4
+#define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS
+
+struct event_descr {
+	struct clock_event_device *event;
+	unsigned int mode;
+	unsigned int real_caps;
+	struct irqaction action;
+};
+
+struct local_events {
+	int installed;
+	struct event_descr events[MAX_CLOCK_EVENTS];
+	struct clock_event_device *nextevt;
+	ktime_t	expires_next;
+};
+
+/* Variables related to the global event device */
+static __read_mostly struct event_descr global_eventdevice;
+
+/*
+ * Lock to protect the above.
+ *
+ * Only the public management functions have to take this lock. The fast path
+ * of the framework, e.g. reprogramming the next event device is lockless as
+ * it is per cpu.
+ */
+static DEFINE_SPINLOCK(events_lock);
+
+/* Variables related to the per cpu local event devices */
+static DEFINE_PER_CPU(struct local_events, local_eventdevices);
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+static void clockevents_check_broadcast(struct event_descr *descr);
+#else
+static inline void clockevents_check_broadcast(struct event_descr *descr) { }
+#endif
+
+/*
+ * Math helper. Convert a latch value (device ticks) to nanoseconds
+ */
+unsigned long clockevent_delta2ns(unsigned long latch,
+				  struct clock_event_device *evt)
+{
+	u64 clc = ((u64) latch << evt->shift);
+
+	do_div(clc, evt->mult);
+	if (clc < KTIME_MONOTONIC_RES.tv64)
+		clc = KTIME_MONOTONIC_RES.tv64;
+	if (clc > LONG_MAX)
+		clc = LONG_MAX;
+
+	return (unsigned long) clc;
+}
+
+/*
+ * Bootup and lowres handler: ticks only
+ */
+static void handle_tick(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * Bootup and lowres handler: ticks and update_process_times
+ */
+static void handle_tick_update(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: ticks and profileing
+ */
+static void handle_tick_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	profile_tick(CPU_PROFILING);
+}
+
+/*
+ * Bootup and lowres handler: ticks, update_process_times and profiling
+ */
+static void handle_tick_update_profile(struct pt_regs *regs)
+{
+	write_seqlock(&xtime_lock);
+	do_timer(1);
+	write_sequnlock(&xtime_lock);
+
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING);
+}
+
+/*
+ * Bootup and lowres handler: update_process_times
+ */
+static void handle_update(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+}
+
+/*
+ * Bootup and lowres handler: update_process_times and profiling
+ */
+static void handle_update_profile(struct pt_regs *regs)
+{
+	update_process_times(user_mode(regs));
+	profile_tick(CPU_PROFILING);
+}
+
+/*
+ * Bootup and lowres handler: profiling
+ */
+static void handle_profile(struct pt_regs *regs)
+{
+	profile_tick(CPU_PROFILING);
+}
+
+/*
+ * Noop handler when we shut down an event device
+ */
+static void handle_noop(struct pt_regs *regs)
+{
+}
+
+/*
+ * Lookup table for bootup and lowres event assignment
+ *
+ * The event handler is choosen by the capability flags of the clock event
+ * device.
+ */
+static void __read_mostly *event_handlers[] = {
+	handle_noop,			/* 0: No capability selected */
+	handle_tick,			/* 1: Tick only	*/
+	handle_update,			/* 2: Update process times */
+	handle_tick_update,		/* 3: Tick + update process times */
+	handle_profile,			/* 4: Profiling int */
+	handle_tick_profile,		/* 5: Tick + Profiling int */
+	handle_update_profile,		/* 6: Update process times +
+					      profiling */
+	handle_tick_update_profile,	/* 7: Tick + update process times +
+					      profiling */
+#ifdef CONFIG_HIGH_RES_TIMERS
+	hrtimer_interrupt,		/* 8: Reprogrammable event device */
+#endif
+};
+
+/*
+ * Start up an event device
+ */
+static void startup_event(struct clock_event_device *evt, unsigned int caps)
+{
+	int mode;
+
+	if (caps == CLOCK_CAP_NEXTEVT)
+		mode = CLOCK_EVT_ONESHOT;
+	else
+		mode = CLOCK_EVT_PERIODIC;
+
+	evt->set_mode(mode, evt);
+}
+
+/*
+ * Setup an event device. Assign an handler and start it up
+ */
+static void setup_event(struct event_descr *descr,
+			struct clock_event_device *evt, unsigned int caps)
+{
+	void *handler = event_handlers[caps];
+
+	/* Set the event handler */
+	evt->event_handler = handler;
+
+	/* Store all relevant information */
+	descr->real_caps = caps;
+
+	startup_event(evt, caps);
+
+	printk(KERN_INFO "Clock event device %s configured with caps set: "
+	       "%02x\n", evt->name, descr->real_caps);
+}
+
+/**
+ * register_global_clockevent - register the device which generates
+ *			     global clock events
+ * @evt:	The device which generates global clock events (ticks)
+ *
+ * This can be a device which is only necessary for bootup. On UP systems this
+ * might be the only event device which is used for everything including
+ * high resolution events.
+ *
+ * When a cpu local event device is installed the global event device is
+ * switched off in the high resolution timer / tickless mode.
+ */
+int __init register_global_clockevent(struct clock_event_device *evt)
+{
+	/* Already installed? */
+	if (global_eventdevice.event) {
+		printk(KERN_ERR "Global clock event device already installed: "
+		       "%s. Ignoring new global eventsoruce %s\n",
+		       global_eventdevice.event->name,
+		       evt->name);
+		return -EBUSY;
+	}
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/*
+	 * Check, whether it is a valid global event device
+	 */
+	if (!(evt->capabilities & CLOCK_BASE_CAPS_MASK)) {
+		printk(KERN_ERR "Unsupported clock event device %s\n",
+		       evt->name);
+		return -EINVAL;
+	}
+
+	/*
+	 * On UP systems the global clock event device can be used as the next
+	 * event device. On SMP this is disabled because the next event device
+	 * must be per CPU.
+	 */
+	if (num_possible_cpus() > 1)
+		evt->capabilities &= ~CLOCK_CAP_NEXTEVT;
+
+
+	/* Mask out high resolution capabilities for now */
+	global_eventdevice.event = evt;
+	setup_event(&global_eventdevice, evt,
+		    evt->capabilities & CLOCK_BASE_CAPS_MASK);
+	return 0;
+}
+
+/*
+ * Mask out the functionality which is covered by the new event device
+ * and assign a new event handler.
+ */
+static void recalc_active_event(struct event_descr *descr,
+				unsigned int newcaps)
+{
+	unsigned int caps;
+
+	if (!descr->real_caps)
+		return;
+
+	/* Mask the overlapping bits */
+	caps = descr->real_caps & ~newcaps;
+
+	/* Assign the new event handler */
+	if (caps) {
+		descr->event->event_handler = event_handlers[caps];
+		printk(KERN_INFO "Clock event device %s new caps set: %02x\n" ,
+		       descr->event->name, caps);
+	} else {
+		descr->event->event_handler = handle_noop;
+
+		if (descr->event->set_mode)
+			descr->event->set_mode(CLOCK_EVT_SHUTDOWN,
+					       descr->event);
+
+		printk(KERN_INFO "Clock event device %s disabled\n" ,
+		       descr->event->name);
+	}
+	descr->real_caps = caps;
+	clockevents_check_broadcast(descr);
+}
+
+/*
+ * Recalc the events and reassign the handlers if necessary
+ *
+ * Called with event_lock held to protect the global event device.
+ */
+static int recalc_events(struct local_events *devices,
+			 struct event_descr *descr,
+			 struct clock_event_device *evt, unsigned int caps)
+{
+	int i;
+
+	if (!descr && devices->installed == MAX_CLOCK_EVENTS)
+		return -ENOSPC;
+
+	/*
+	 * If there is no handler and this is not a next-event capable
+	 * event device, refuse to handle it
+	 */
+	if (!(evt->capabilities & CLOCK_CAP_NEXTEVT) && !event_handlers[caps]) {
+		printk(KERN_ERR "Unsupported clock event device %s\n",
+		       evt->name);
+		return -EINVAL;
+	}
+
+	if (caps) {
+		if (global_eventdevice.event && descr != &global_eventdevice)
+			recalc_active_event(&global_eventdevice, caps);
+
+		for (i = 0; i < devices->installed; i++) {
+			if (&devices->events[i] != descr)
+				recalc_active_event(&devices->events[i], caps);
+		}
+	}
+
+	/* New device ? */
+	if (!descr) {
+		descr = &devices->events[devices->installed++];
+		descr->event = evt;
+	}
+
+	if (caps) {
+		/* Is next_event event device going to be installed? */
+		if (caps & CLOCK_CAP_NEXTEVT)
+			caps = CLOCK_CAP_NEXTEVT;
+
+		setup_event(descr, evt, caps);
+	} else
+		printk(KERN_INFO "Inactive clock event device %s registered\n",
+		       evt->name);
+
+	return 0;
+}
+
+/**
+ * register_local_clockevent - Set up a cpu local clock event device
+ * @evt:	event device to be registered
+ */
+int register_local_clockevent(struct clock_event_device *evt)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	/* Preset the handler in any case */
+	evt->event_handler = handle_noop;
+
+	/* Recalc event devices and maybe reassign handlers */
+	ret = recalc_events(devices, NULL, evt,
+			    evt->capabilities & CLOCK_BASE_CAPS_MASK);
+
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	/*
+	 * Trigger hrtimers, when the event device is next-event
+	 * capable
+	 */
+	if (!ret && (evt->capabilities & CLOCK_CAP_NEXTEVT))
+		hrtimer_clock_notify();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(register_local_clockevent);
+
+/*
+ * Find a next-event capable event device
+ *
+ * Called with event_lock held to protect the global event device.
+ */
+static int get_next_event_device(void)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int i;
+
+	for (i = 0; i < devices->installed; i++) {
+		struct clock_event_device *evt;
+
+		evt = devices->events[i].event;
+		if (evt->capabilities & CLOCK_CAP_NEXTEVT)
+			return i;
+	}
+
+	if (global_eventdevice.event->capabilities & CLOCK_CAP_NEXTEVT)
+		return GLOBAL_CLOCK_EVENT;
+
+	return -ENODEV;
+}
+
+/**
+ * clockevents_next_event_available - Check for a installed next-event device
+ *
+ * Returns 1, when such a device exists, otherwise 0
+ */
+int clockevents_next_event_available(void)
+{
+	unsigned long flags;
+	int idx;
+
+	spin_lock_irqsave(&events_lock, flags);
+	idx = get_next_event_device();
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return IS_ERR_VALUE(idx) ? 0 : 1;
+}
+
+/**
+ * clockevents_init_next_event - switch to next event (oneshot) mode
+ *
+ * Switch to one shot mode. On SMP systems the global event (tick) device is
+ * switched off. It is replaced by a hrtimer. On UP systems the global event
+ * device might be the only one and can be used as the next event device too.
+ *
+ * Returns 0 on success, otherwise an error code.
+ */
+int clockevents_init_next_event(void)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	struct event_descr *nextevt;
+	unsigned long flags;
+	int idx, ret = -ENODEV;
+
+	if (devices->nextevt)
+		return -EBUSY;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	idx = get_next_event_device();
+	if (IS_ERR_VALUE(idx))
+		goto out_unlock;
+
+	if (idx == GLOBAL_CLOCK_EVENT)
+		nextevt = &global_eventdevice;
+	else
+		nextevt = &devices->events[idx];
+
+	ret = recalc_events(devices, nextevt, nextevt->event, CLOCK_CAPS_MASK);
+	if (!ret)
+		devices->nextevt = nextevt->event;
+ out_unlock:
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return ret;
+}
+
+/*
+ * Reprogram the clock event device. Internal helper function
+ */
+static void do_clockevents_set_next_event(struct clock_event_device *nextevt,
+					 int64_t delta)
+{
+	unsigned long long clc;
+
+	if (delta > nextevt->max_delta_ns)
+		delta = nextevt->max_delta_ns;
+	if (delta < nextevt->min_delta_ns)
+		delta = nextevt->min_delta_ns;
+
+	clc = delta * nextevt->mult;
+	clc >>= nextevt->shift;
+	nextevt->set_next_event((unsigned long)clc, nextevt);
+}
+
+/**
+ * clockevents_set_next_event - Reprogram the clock event device.
+ * @expires:	absolute expiry time (monotonic clock)
+ * @force:	when set, enforce reprogramming, even if the event is in the
+ *		past
+ *
+ * Returns 0 on success, -ETIME when the event is in the past and force is not
+ * set.
+ * Called with interrupts disabled.
+ */
+int clockevents_set_next_event(ktime_t expires, int force)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	struct clock_event_device *nextevt = devices->nextevt;
+	int64_t delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+
+	if (delta <= 0 && !force) {
+		devices->expires_next.tv64 = KTIME_MAX;
+		return -ETIME;
+	}
+
+	devices->expires_next = expires;
+
+	do_clockevents_set_next_event(nextevt, delta);
+
+	return 0;
+}
+
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+
+static cpumask_t global_event_broadcast;
+static cpumask_t local_event_broadcast;
+static void (*broadcast_function)(cpumask_t *mask);
+static void (*global_event_handler)(struct pt_regs *regs);
+
+/**
+ * clockevents_set_broadcast - switch next event device from/to broadcast mode
+ *
+ * Called, when the PM code enters a state, where the next event device is
+ * switched off or comes back from this state.
+ */
+void clockevents_set_broadcast(struct clock_event_device *evt, int broadcast)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	struct clock_event_device *glblevt = global_eventdevice.event;
+	int cpu = smp_processor_id();
+	ktime_t expires = { .tv64 = KTIME_MAX };
+	int64_t delta;
+	unsigned long flags;
+
+	if (devices->nextevt != evt)
+		return;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	if (broadcast) {
+		cpu_set(cpu, local_event_broadcast);
+		evt->set_mode(CLOCK_EVT_SHUTDOWN, evt);
+	} else {
+		cpu_clear(cpu, local_event_broadcast);
+		evt->set_mode(CLOCK_EVT_ONESHOT, evt);
+		if (devices->expires_next.tv64 != KTIME_MAX)
+			clockevents_set_next_event(devices->expires_next, 1);
+	}
+
+	/* Reprogram the broadcast device */
+	for (cpu = first_cpu(local_event_broadcast); cpu != NR_CPUS;
+	     cpu = next_cpu(cpu, local_event_broadcast)) {
+		devices = &per_cpu(local_eventdevices, cpu);
+		if (devices->expires_next.tv64 < expires.tv64)
+			expires = devices->expires_next;
+	}
+
+	if (expires.tv64 != KTIME_MAX) {
+		delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
+		do_clockevents_set_next_event(glblevt, delta);
+	}
+
+	spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/**
+ * clockevents_set_global_broadcast - mark event device for global broadcast
+ *
+ * Switch an event device from / to global broadcasting. This is only relevant
+ * when the system has not switched to high resolution mode.
+ */
+void clockevents_set_global_broadcast(struct clock_event_device *evt,
+				      int broadcast)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int cpu = smp_processor_id();
+	unsigned long flags;
+
+	spin_lock_irqsave(&events_lock, flags);
+
+	if (broadcast) {
+		if (!cpu_isset(cpu, global_event_broadcast)) {
+			cpu_set(cpu, global_event_broadcast);
+			if (devices->nextevt != evt)
+				evt->set_mode(CLOCK_EVT_SHUTDOWN, evt);
+		}
+	} else {
+		if (cpu_isset(cpu, global_event_broadcast)) {
+			cpu_clear(cpu, global_event_broadcast);
+			if (devices->nextevt != evt)
+				evt->set_mode(CLOCK_EVT_PERIODIC, evt);
+		}
+	}
+
+	spin_unlock_irqrestore(&events_lock, flags);
+}
+
+/*
+ * Broadcast tick handler:
+ */
+static void handle_tick_broadcast(struct pt_regs *regs)
+{
+	/* Call the original handler global tick handler */
+	global_event_handler(regs);
+	broadcast_function(&global_event_broadcast);
+}
+
+/*
+ * Broadcast next event handler:
+ */
+static void handle_nextevt_broadcast(struct pt_regs *regs)
+{
+	struct local_events *devices;
+	ktime_t now = ktime_get();
+	cpumask_t mask;
+	int cpu;
+
+	spin_lock(&events_lock);
+	/* Find all expired events */
+	for (cpu = first_cpu(local_event_broadcast); cpu != NR_CPUS;
+	     cpu = next_cpu(cpu, local_event_broadcast)) {
+		devices = &per_cpu(local_eventdevices, cpu);
+		if (devices->expires_next.tv64 <= now.tv64) {
+			devices->expires_next.tv64 = KTIME_MAX;
+			cpu_set(cpu, mask);
+		}
+	}
+	spin_unlock(&events_lock);
+	/* Wakeup the cpus which have an expired event */
+	broadcast_function(&mask);
+}
+
+/*
+ * Check, if the reconfigured event device is the global broadcast device.
+ *
+ * Called with interrupts disabled and events_lock held
+ */
+static void clockevents_check_broadcast(struct event_descr *descr)
+{
+	if (descr != &global_eventdevice)
+		return;
+
+	/* The device was disabled. switch it to oneshot mode instead */
+	if (!descr->real_caps) {
+		global_event_handler = NULL;
+		descr->event->set_mode(CLOCK_EVT_ONESHOT, descr->event);
+		descr->event->event_handler = handle_nextevt_broadcast;
+	} else {
+		global_event_handler = descr->event->event_handler;
+		descr->event->event_handler = handle_tick_broadcast;
+	}
+
+}
+
+/*
+ * Install a broadcast function
+ */
+int clockevents_register_broadcast(void (*fun)(cpumask_t *mask))
+{
+	unsigned long flags;
+
+	if (broadcast_function)
+		return -EBUSY;
+
+	spin_lock_irqsave(&events_lock, flags);
+	broadcast_function = fun;
+	clockevents_check_broadcast(&global_eventdevice);
+	spin_unlock_irqrestore(&events_lock, flags);
+
+	return 0;
+}
+
+#endif
+
+/*
+ * Resume the cpu local clock events
+ */
+static void clockevents_resume_local_events(void *arg)
+{
+	struct local_events *devices = &__get_cpu_var(local_eventdevices);
+	int i;
+
+	for (i = 0; i < devices->installed; i++) {
+		if (devices->events[i].real_caps)
+			startup_event(devices->events[i].event,
+				      devices->events[i].real_caps);
+	}
+	touch_softlockup_watchdog();
+}
+
+/**
+ * clockevents_resume_events - resume the active clock devices
+ *
+ * Called after timekeeping is functional again
+ */
+void clockevents_resume_events(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	/* Resume global event device */
+	if (global_eventdevice.real_caps)
+		startup_event(global_eventdevice.event,
+			      global_eventdevice.real_caps);
+
+	local_irq_restore(flags);
+
+	/* Restart the CPU local events everywhere */
+	on_each_cpu(clockevents_resume_local_events, NULL, 0, 1);
+}
+
+/*
+ * Functions related to initialization and hotplug
+ */
+static int clockevents_cpu_notify(struct notifier_block *self,
+				  unsigned long action, void *hcpu)
+{
+	switch(action) {
+	case CPU_UP_PREPARE:
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		/*
+		 * Do something sensible here !
+		 * Disable the cpu local clock event devices ???
+		 */
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata clockevents_nb = {
+	.notifier_call	= clockevents_cpu_notify,
+};
+
+void __init clockevents_init(void)
+{
+	clockevents_cpu_notify(&clockevents_nb, (unsigned long)CPU_UP_PREPARE,
+				(void *)(long)smp_processor_id());
+	register_cpu_notifier(&clockevents_nb);
+}

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 05/19] ACPI: Include apic.h
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (3 preceding siblings ...)
  2006-11-09 23:38 ` [patch 04/19] Add a framework to manage clock event devices Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 06/19] ACPI: Keep track of timer broadcast Thomas Gleixner
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: acpi-include-apic-h.patch --]
[-- Type: text/plain, Size: 1204 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

apic.h does not get included on UP compiles.
That way the APICTIMER_STOPS_ON_C3 is not there and UP boxen have no support
for timer broadcasting. This was never noticed, because the lapic timer is 
only used for profiling on UP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/drivers/acpi/processor_idle.c	2006-11-09 17:47:58.000000000 +0100
+++ linux-2.6.19-rc5-mm1/drivers/acpi/processor_idle.c	2006-11-09 20:20:06.000000000 +0100
@@ -40,6 +40,16 @@
 #include <linux/sched.h>	/* need_resched() */
 #include <linux/latency.h>
 
+/*
+ * Include the apic definitions for x86 to have the APIC timer related defines
+ * available also for UP (on SMP it gets magically included via linux/smp.h).
+ * asm/acpi.h is not an option, as it would require more include magic. Also
+ * creating an empty asm-ia64/apic.h would just trade pest vs. cholera.
+ */
+#ifdef CONFIG_X86
+#include <asm/apic.h>
+#endif
+
 #include <asm/io.h>
 #include <asm/uaccess.h>
 

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 06/19] ACPI: Keep track of timer broadcast
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (4 preceding siblings ...)
  2006-11-09 23:38 ` [patch 05/19] ACPI: Include apic.h Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  9:51   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 07/19] ACPI: Add state propagation for dynamic broadcasting Thomas Gleixner
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: acpi-keep-track-of-timer-broadcast.patch --]
[-- Type: text/plain, Size: 3702 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

This is a preperatory patch for highres/dyntick:

- replace the big #ifdef ARCH_APICTIMER_STOPS_ON_C3 hackery by
  functions
- remove the double switch in the power verify function
  (in the worst case we switched ipi to apic and 20usec later
   apic to ipi)
- keep track of the the state which stops local APIC timer

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff -puN drivers/acpi/processor_idle.c~acpi-keep-track-of-timer-broadcast drivers/acpi/processor_idle.c
--- a/drivers/acpi/processor_idle.c~acpi-keep-track-of-timer-broadcast
+++ a/drivers/acpi/processor_idle.c
@@ -246,6 +246,49 @@ static void acpi_cstate_enter(struct acp
 	}
 }
 
+#ifdef ARCH_APICTIMER_STOPS_ON_C3
+
+/*
+ * Some BIOS implementations switch to C3 in the published C2 state. This seems
+ * to be a common problem on AMD boxen.
+ */
+static void acpi_timer_check_state(int state, struct acpi_processor *pr,
+				   struct acpi_processor_cx *cx)
+{
+	struct acpi_processor_power *pwr = &pr->power;
+
+	/*
+	 * Check, if one of the previous states already marked the lapic
+	 * unstable
+	 */
+	if (pwr->timer_broadcast_on_state < state)
+		return;
+
+	if(cx->type == ACPI_STATE_C3 ||
+	   boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+		pr->power.timer_broadcast_on_state = state;
+		return;
+	}
+}
+
+static void acpi_propagate_timer_broadcast(struct acpi_processor *pr)
+{
+	cpumask_t mask = cpumask_of_cpu(pr->id);
+
+	if (pr->power.timer_broadcast_on_state < INT_MAX)
+		on_each_cpu(switch_APIC_timer_to_ipi, &mask, 1, 1);
+	else
+		on_each_cpu(switch_ipi_to_APIC_timer, &mask, 1, 1);
+}
+
+#else
+
+static void acpi_timer_check_state(int state, struct acpi_processor *pr,
+				   struct acpi_processor_cx *cstate) { }
+static void acpi_propagate_timer_broadcast(struct acpi_processor *pr) { }
+
+#endif
+
 static void acpi_processor_idle(void)
 {
 	struct acpi_processor *pr = NULL;
@@ -912,11 +955,7 @@ static int acpi_processor_power_verify(s
 	unsigned int i;
 	unsigned int working = 0;
 
-#ifdef ARCH_APICTIMER_STOPS_ON_C3
-	int timer_broadcast = 0;
-	cpumask_t mask = cpumask_of_cpu(pr->id);
-	on_each_cpu(switch_ipi_to_APIC_timer, &mask, 1, 1);
-#endif
+	pr->power.timer_broadcast_on_state = INT_MAX;
 
 	for (i = 1; i < ACPI_PROCESSOR_MAX_POWER; i++) {
 		struct acpi_processor_cx *cx = &pr->power.states[i];
@@ -928,21 +967,14 @@ static int acpi_processor_power_verify(s
 
 		case ACPI_STATE_C2:
 			acpi_processor_power_verify_c2(cx);
-#ifdef ARCH_APICTIMER_STOPS_ON_C3
-			/* Some AMD systems fake C3 as C2, but still
-			   have timer troubles */
-			if (cx->valid && 
-				boot_cpu_data.x86_vendor == X86_VENDOR_AMD)
-				timer_broadcast++;
-#endif
+			if (cx->valid)
+				acpi_timer_check_state(i, pr, cx);
 			break;
 
 		case ACPI_STATE_C3:
 			acpi_processor_power_verify_c3(pr, cx);
-#ifdef ARCH_APICTIMER_STOPS_ON_C3
 			if (cx->valid)
-				timer_broadcast++;
-#endif
+				acpi_timer_check_state(i, pr, cx);
 			break;
 		}
 
@@ -950,10 +982,7 @@ static int acpi_processor_power_verify(s
 			working++;
 	}
 
-#ifdef ARCH_APICTIMER_STOPS_ON_C3
-	if (timer_broadcast)
-		on_each_cpu(switch_APIC_timer_to_ipi, &mask, 1, 1);
-#endif
+	acpi_propagate_timer_broadcast(pr);
 
 	return (working);
 }
diff -puN include/acpi/processor.h~acpi-keep-track-of-timer-broadcast include/acpi/processor.h
--- a/include/acpi/processor.h~acpi-keep-track-of-timer-broadcast
+++ a/include/acpi/processor.h
@@ -79,6 +79,7 @@ struct acpi_processor_power {
 	u32 bm_activity;
 	int count;
 	struct acpi_processor_cx states[ACPI_PROCESSOR_MAX_POWER];
+	int timer_broadcast_on_state;
 };
 
 /* Performance Management */
_

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 07/19] ACPI: Add state propagation for dynamic broadcasting
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (5 preceding siblings ...)
  2006-11-09 23:38 ` [patch 06/19] ACPI: Keep track of timer broadcast Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  9:52   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 08/19] i386: cleanup apic code Thomas Gleixner
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: acpi-add-hres-dyntick-broadcast-support.patch --]
[-- Type: text/plain, Size: 4343 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

This is a preparatory patch for high resolution timers and dynticks.

The local APIC timer is fast and per CPU, but it gets stopped in C3 state.
On some broken systems, especially AMD based ones it gets stopped in C2. This
also affects akpm's jinxed VAIO.

The broadcast function informs the local APIC management code that a state
which stops the local APIC is going to be entered/exited. This switches the
local APIC timer to the PIT broadcast mode. The clockevents layer takes care
of the distribution of events.

The lapic_timer_idle_broadcast() function is an empty inline for now, which
will be replaced by the later clockevents patches for the affected
architectures.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/drivers/acpi/processor_idle.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/drivers/acpi/processor_idle.c	2006-11-09 21:13:00.000000000 +0100
+++ linux-2.6.19-rc5-mm1/drivers/acpi/processor_idle.c	2006-11-09 21:13:06.000000000 +0100
@@ -281,11 +281,27 @@ static void acpi_propagate_timer_broadca
 		on_each_cpu(switch_ipi_to_APIC_timer, &mask, 1, 1);
 }
 
+/* Power(C) State timer broadcast control */
+static void acpi_state_timer_broadcast(struct acpi_processor *pr,
+				       struct acpi_processor_cx *cx,
+				       int broadcast)
+{
+	int state = cx - pr->power.states;
+
+	if (state >= pr->power.timer_broadcast_on_state)
+		lapic_timer_idle_broadcast(broadcast);
+}
+
 #else
 
 static void acpi_timer_check_state(int state, struct acpi_processor *pr,
 				   struct acpi_processor_cx *cstate) { }
 static void acpi_propagate_timer_broadcast(struct acpi_processor *pr) { }
+static void acpi_state_timer_broadcast(struct acpi_processor *pr,
+				       struct acpi_processor_cx *cx,
+				       int broadcast)
+{
+}
 
 #endif
 
@@ -431,6 +447,7 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_fadt.xpm_tmr_blk.address);
 		/* Invoke C2 */
+		acpi_state_timer_broadcast(pr, cx, 1);
 		acpi_cstate_enter(cx);
 		/* Get end time (ticks) */
 		t2 = inl(acpi_fadt.xpm_tmr_blk.address);
@@ -445,6 +462,7 @@ static void acpi_processor_idle(void)
 		/* Compute time (ticks) that we were actually asleep */
 		sleep_ticks =
 		    ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD;
+		acpi_state_timer_broadcast(pr, cx, 0);
 		break;
 
 	case ACPI_STATE_C3:
@@ -467,6 +485,7 @@ static void acpi_processor_idle(void)
 		/* Get start time (ticks) */
 		t1 = inl(acpi_fadt.xpm_tmr_blk.address);
 		/* Invoke C3 */
+		acpi_state_timer_broadcast(pr, cx, 1);
 		acpi_cstate_enter(cx);
 		/* Get end time (ticks) */
 		t2 = inl(acpi_fadt.xpm_tmr_blk.address);
@@ -487,6 +506,7 @@ static void acpi_processor_idle(void)
 		/* Compute time (ticks) that we were actually asleep */
 		sleep_ticks =
 		    ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD;
+		acpi_state_timer_broadcast(pr, cx, 0);
 		break;
 
 	default:
Index: linux-2.6.19-rc5-mm1/include/asm-i386/apic.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/apic.h	2006-11-09 21:14:36.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/apic.h	2006-11-09 21:16:34.000000000 +0100
@@ -113,6 +113,7 @@ extern void setup_secondary_APIC_clock (
 extern int APIC_init_uniprocessor (void);
 extern void disable_APIC_timer(void);
 extern void enable_APIC_timer(void);
+static inline void lapic_timer_idle_broadcast(int broadcast) { }
 
 extern void enable_NMI_through_LVT0 (void * dummy);
 
Index: linux-2.6.19-rc5-mm1/include/asm-x86_64/apic.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-x86_64/apic.h	2006-11-08 16:40:14.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-x86_64/apic.h	2006-11-09 21:17:00.000000000 +0100
@@ -84,6 +84,7 @@ extern int APIC_init_uniprocessor (void)
 extern void disable_APIC_timer(void);
 extern void enable_APIC_timer(void);
 extern void clustered_apic_check(void);
+static inline void lapic_timer_idle_broadcast(int broadcast) { }
 
 extern void setup_APIC_extened_lvt(unsigned char lvt_off, unsigned char vector,
 				   unsigned char msg_type, unsigned char mask);

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 08/19] i386: cleanup apic code
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (6 preceding siblings ...)
  2006-11-09 23:38 ` [patch 07/19] ACPI: Add state propagation for dynamic broadcasting Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10 10:04   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 09/19] i386: Convert to clock event devices Thomas Gleixner
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: i386-cleanup-apic.patch --]
[-- Type: text/plain, Size: 72055 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The apic code is quite unstructured and missing a lot of comments.

- Restructure the code into helper functions, timer, setup/shutdown,
  interrupt and power management blocks. 
- Fixup comments.
- Namespace fixups
- Inline helpers for version and is_integrated
- Combine the ack_bad_irq functions

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/apic.c	2006-11-09 21:05:30.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c	2006-11-09 21:06:15.000000000 +0100
@@ -44,6 +44,13 @@
 #include "io_ports.h"
 
 /*
+ * Sanity check
+ */
+#if (SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F
+# error SPURIOUS_APIC_VECTOR definition error
+#endif
+
+/*
  * cpu_mask that denotes the CPUs that needs timer interrupt coming in as
  * IPIs in place of local APIC timers
  */
@@ -51,561 +58,559 @@ static cpumask_t timer_bcast_ipi;
 
 /*
  * Knob to control our willingness to enable the local APIC.
+ *
+ * -1=force-disable, +1=force-enable
  */
-static int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */
-
-static inline void lapic_disable(void)
-{
-	enable_local_apic = -1;
-	clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
-}
-
-static inline void lapic_enable(void)
-{
-	enable_local_apic = 1;
-}
+static int enable_local_apic __initdata = 0;
 
 /*
- * Debug level
+ * Debug level, exported for io_apic.c
  */
 int apic_verbosity;
 
-
 static void apic_pm_activate(void);
 
-static int modern_apic(void)
+
+/* Using APIC to generate smp_local_timer_interrupt? */
+int using_apic_timer __read_mostly = 0;
+
+/* Local APIC was disabled by the BIOS and enabled by the kernel */
+static int enabled_via_apicbase;
+
+/*
+ * Get the LAPIC version
+ */
+static inline int lapic_get_version(void)
 {
-	unsigned int lvr, version;
-	/* AMD systems use old APIC versions, so check the CPU */
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-		boot_cpu_data.x86 >= 0xf)
-		return 1;
-	lvr = apic_read(APIC_LVR);
-	version = GET_APIC_VERSION(lvr);
-	return version >= 0x14;
+	return GET_APIC_VERSION(apic_read(APIC_LVR));
 }
 
 /*
- * 'what should we do if we get a hw irq event on an illegal vector'.
- * each architecture has to answer this themselves.
+ * Check, if the APIC is integrated or a seperate chip
  */
-void ack_bad_irq(unsigned int irq)
+static inline int lapic_is_integrated(void)
 {
-	printk("unexpected IRQ trap at vector %02x\n", irq);
-	/*
-	 * Currently unexpected vectors happen only on SMP and APIC.
-	 * We _must_ ack these because every local APIC has only N
-	 * irq slots per priority level, and a 'hanging, unacked' IRQ
-	 * holds up an irq slot - in excessive cases (when multiple
-	 * unexpected vectors occur) that might lock up the APIC
-	 * completely.
-	 * But only ack when the APIC is enabled -AK
-	 */
-	if (cpu_has_apic)
-		ack_APIC_irq();
+	return APIC_INTEGRATED(lapic_get_version());
 }
 
-void __init apic_intr_init(void)
+/*
+ * Check, whether this is a modern or a first generation APIC
+ */
+static int modern_apic(void)
 {
-#ifdef CONFIG_SMP
-	smp_intr_init();
-#endif
-	/* self generated IPI for local APIC timer */
-	set_intr_gate(LOCAL_TIMER_VECTOR, apic_timer_interrupt);
-
-	/* IPI vectors for APIC spurious and error interrupts */
-	set_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
-	set_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
-
-	/* thermal monitor LVT interrupt */
-#ifdef CONFIG_X86_MCE_P4THERMAL
-	set_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt);
-#endif
+	/* AMD systems use old APIC versions, so check the CPU */
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+	    boot_cpu_data.x86 >= 0xf)
+		return 1;
+	return lapic_get_version() >= 0x14;
 }
 
-/* Using APIC to generate smp_local_timer_interrupt? */
-int using_apic_timer __read_mostly = 0;
-
-static int enabled_via_apicbase;
-
+/**
+ * enable_NMI_through_LVT0 - enable NMI through local vector table 0
+ */
 void enable_NMI_through_LVT0 (void * dummy)
 {
-	unsigned int v, ver;
+	unsigned int v = APIC_DM_NMI;
 
-	ver = apic_read(APIC_LVR);
-	ver = GET_APIC_VERSION(ver);
-	v = APIC_DM_NMI;			/* unmask and set to NMI */
-	if (!APIC_INTEGRATED(ver))		/* 82489DX */
+	/* Level triggered for 82489DX */
+	if (!lapic_is_integrated())
 		v |= APIC_LVT_LEVEL_TRIGGER;
 	apic_write_around(APIC_LVT0, v);
 }
 
+/**
+ * get_physical_broadcast - Get number of physical broadcast IDs
+ */
 int get_physical_broadcast(void)
 {
-	if (modern_apic())
-		return 0xff;
-	else
-		return 0xf;
+	return modern_apic() ? 0xff : 0xf;
 }
 
-int get_maxlvt(void)
+/**
+ * lapic_get_maxlvt - get the maximum number of local vector table entries
+ */
+int lapic_get_maxlvt(void)
 {
-	unsigned int v, ver, maxlvt;
+	unsigned int v = apic_read(APIC_LVR);
 
-	v = apic_read(APIC_LVR);
-	ver = GET_APIC_VERSION(v);
 	/* 82489DXs do not report # of LVT entries. */
-	maxlvt = APIC_INTEGRATED(ver) ? GET_APIC_MAXLVT(v) : 2;
-	return maxlvt;
+	return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2;
 }
 
-void clear_local_APIC(void)
+/*
+ * Local APIC timer
+ */
+
+/*
+ * This part sets up the APIC 32 bit clock in LVTT1, with HZ interrupts
+ * per second. We assume that the caller has already set up the local
+ * APIC.
+ *
+ * The APIC timer is not exactly sync with the external timer chip, it
+ * closely follows bus clocks.
+ */
+
+/*
+ * The timer chip is already set up at HZ interrupts per second here,
+ * but we do not accept timer interrupts yet. We only allow the BP
+ * to calibrate.
+ */
+static unsigned int __devinit get_8254_timer_count(void)
 {
-	int maxlvt;
-	unsigned long v;
+	unsigned long flags;
 
-	maxlvt = get_maxlvt();
+	unsigned int count;
 
-	/*
-	 * Masking an LVT entry can trigger a local APIC error
-	 * if the vector is zero. Mask LVTERR first to prevent this.
-	 */
-	if (maxlvt >= 3) {
-		v = ERROR_APIC_VECTOR; /* any non-zero vector will do */
-		apic_write_around(APIC_LVTERR, v | APIC_LVT_MASKED);
-	}
-	/*
-	 * Careful: we have to set masks only first to deassert
-	 * any level-triggered sources.
-	 */
-	v = apic_read(APIC_LVTT);
-	apic_write_around(APIC_LVTT, v | APIC_LVT_MASKED);
-	v = apic_read(APIC_LVT0);
-	apic_write_around(APIC_LVT0, v | APIC_LVT_MASKED);
-	v = apic_read(APIC_LVT1);
-	apic_write_around(APIC_LVT1, v | APIC_LVT_MASKED);
-	if (maxlvt >= 4) {
-		v = apic_read(APIC_LVTPC);
-		apic_write_around(APIC_LVTPC, v | APIC_LVT_MASKED);
-	}
+	spin_lock_irqsave(&i8253_lock, flags);
 
-/* lets not touch this if we didn't frob it */
-#ifdef CONFIG_X86_MCE_P4THERMAL
-	if (maxlvt >= 5) {
-		v = apic_read(APIC_LVTTHMR);
-		apic_write_around(APIC_LVTTHMR, v | APIC_LVT_MASKED);
-	}
-#endif
-	/*
-	 * Clean APIC state for other OSs:
-	 */
-	apic_write_around(APIC_LVTT, APIC_LVT_MASKED);
-	apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
-	apic_write_around(APIC_LVT1, APIC_LVT_MASKED);
-	if (maxlvt >= 3)
-		apic_write_around(APIC_LVTERR, APIC_LVT_MASKED);
-	if (maxlvt >= 4)
-		apic_write_around(APIC_LVTPC, APIC_LVT_MASKED);
+	outb_p(0x00, PIT_MODE);
+	count = inb_p(PIT_CH0);
+	count |= inb_p(PIT_CH0) << 8;
 
-#ifdef CONFIG_X86_MCE_P4THERMAL
-	if (maxlvt >= 5)
-		apic_write_around(APIC_LVTTHMR, APIC_LVT_MASKED);
-#endif
-	v = GET_APIC_VERSION(apic_read(APIC_LVR));
-	if (APIC_INTEGRATED(v)) {	/* !82489DX */
-		if (maxlvt > 3)		/* Due to Pentium errata 3AP and 11AP. */
-			apic_write(APIC_ESR, 0);
-		apic_read(APIC_ESR);
-	}
-}
+	spin_unlock_irqrestore(&i8253_lock, flags);
 
-void __init connect_bsp_APIC(void)
-{
-	if (pic_mode) {
-		/*
-		 * Do not trust the local APIC being empty at bootup.
-		 */
-		clear_local_APIC();
-		/*
-		 * PIC mode, enable APIC mode in the IMCR, i.e.
-		 * connect BSP's local APIC to INT and NMI lines.
-		 */
-		apic_printk(APIC_VERBOSE, "leaving PIC mode, "
-				"enabling APIC mode.\n");
-		outb(0x70, 0x22);
-		outb(0x01, 0x23);
-	}
-	enable_apic_mode();
+	return count;
 }
 
-void disconnect_bsp_APIC(int virt_wire_setup)
+/* next tick in 8254 can be caught by catching timer wraparound */
+static void __devinit wait_8254_wraparound(void)
 {
-	if (pic_mode) {
-		/*
-		 * Put the board back into PIC mode (has an effect
-		 * only on certain older boards).  Note that APIC
-		 * interrupts, including IPIs, won't work beyond
-		 * this point!  The only exception are INIT IPIs.
-		 */
-		apic_printk(APIC_VERBOSE, "disabling APIC mode, "
-				"entering PIC mode.\n");
-		outb(0x70, 0x22);
-		outb(0x00, 0x23);
-	}
-	else {
-		/* Go back to Virtual Wire compatibility mode */
-		unsigned long value;
+	unsigned int curr_count, prev_count;
 
-		/* For the spurious interrupt use vector F, and enable it */
-		value = apic_read(APIC_SPIV);
-		value &= ~APIC_VECTOR_MASK;
-		value |= APIC_SPIV_APIC_ENABLED;
-		value |= 0xf;
-		apic_write_around(APIC_SPIV, value);
+	curr_count = get_8254_timer_count();
+	do {
+		prev_count = curr_count;
+		curr_count = get_8254_timer_count();
 
-		if (!virt_wire_setup) {
-			/* For LVT0 make it edge triggered, active high, external and enabled */
-			value = apic_read(APIC_LVT0);
-			value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING |
-				APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
-				APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED );
-			value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
-			value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT);
-			apic_write_around(APIC_LVT0, value);
-		}
-		else {
-			/* Disable LVT0 */
-			apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
-		}
+		/* workaround for broken Mercury/Neptune */
+		if (prev_count >= curr_count + 0x100)
+			curr_count = get_8254_timer_count();
 
-		/* For LVT1 make it edge triggered, active high, nmi and enabled */
-		value = apic_read(APIC_LVT1);
-		value &= ~(
-			APIC_MODE_MASK | APIC_SEND_PENDING |
-			APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
-			APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED);
-		value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
-		value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI);
-		apic_write_around(APIC_LVT1, value);
-	}
+	} while (prev_count >= curr_count);
 }
 
-void disable_local_APIC(void)
+/*
+ * Default initialization for 8254 timers. If we use other timers like HPET,
+ * we override this later
+ */
+void (*wait_timer_tick)(void) __devinitdata = wait_8254_wraparound;
+
+/*
+ * This function sets up the local APIC timer, with a timeout of
+ * 'clocks' APIC bus clock. During calibration we actually call
+ * this function twice on the boot CPU, once with a bogus timeout
+ * value, second time for real. The other (noncalibrating) CPUs
+ * call this function only once, with the real, calibrated value.
+ *
+ * We do reads before writes even if unnecessary, to get around the
+ * P5 APIC double write bug.
+ */
+
+#define APIC_DIVISOR 16
+
+static void __setup_APIC_LVTT(unsigned int clocks)
 {
-	unsigned long value;
+	unsigned int lvtt_value, tmp_value;
+	int cpu = smp_processor_id();
 
-	clear_local_APIC();
+	lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR;
+	if (!lapic_is_integrated())
+		lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
+
+	if (cpu_isset(cpu, timer_bcast_ipi))
+		lvtt_value |= APIC_LVT_MASKED;
+
+	apic_write_around(APIC_LVTT, lvtt_value);
 
 	/*
-	 * Disable APIC (implies clearing of registers
-	 * for 82489DX!).
+	 * Divide PICLK by 16
 	 */
-	value = apic_read(APIC_SPIV);
-	value &= ~APIC_SPIV_APIC_ENABLED;
-	apic_write_around(APIC_SPIV, value);
+	tmp_value = apic_read(APIC_TDCR);
+	apic_write_around(APIC_TDCR, (tmp_value
+				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
+				| APIC_TDR_DIV_16);
 
-	if (enabled_via_apicbase) {
-		unsigned int l, h;
-		rdmsr(MSR_IA32_APICBASE, l, h);
-		l &= ~MSR_IA32_APICBASE_ENABLE;
-		wrmsr(MSR_IA32_APICBASE, l, h);
-	}
+	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
 }
 
-/*
- * This is to verify that we're looking at a real local APIC.
- * Check these against your board if the CPUs aren't getting
- * started for no apparent reason.
- */
-int __init verify_local_APIC(void)
+static void __devinit setup_APIC_timer(unsigned int clocks)
 {
-	unsigned int reg0, reg1;
+	unsigned long flags;
+
+	local_irq_save(flags);
 
 	/*
-	 * The version register is read-only in a real APIC.
+	 * Wait for IRQ0's slice:
 	 */
-	reg0 = apic_read(APIC_LVR);
-	apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg0);
-	apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK);
-	reg1 = apic_read(APIC_LVR);
-	apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg1);
+	wait_timer_tick();
+
+	__setup_APIC_LVTT(clocks);
+
+	local_irq_restore(flags);
+}
+
+/*
+ * In this function we calibrate APIC bus clocks to the external
+ * timer. Unfortunately we cannot use jiffies and the timer irq
+ * to calibrate, since some later bootup code depends on getting
+ * the first irq? Ugh.
+ *
+ * We want to do the calibration only once since we
+ * want to have local timer irqs syncron. CPUs connected
+ * by the same APIC bus have the very same bus frequency.
+ * And we want to have irqs off anyways, no accidental
+ * APIC irq that way.
+ */
+
+static int __init calibrate_APIC_clock(void)
+{
+	unsigned long long t1 = 0, t2 = 0;
+	long tt1, tt2;
+	long result;
+	int i;
+	const int LOOPS = HZ/10;
+
+	apic_printk(APIC_VERBOSE, "calibrating APIC timer ...\n");
 
 	/*
-	 * The two version reads above should print the same
-	 * numbers.  If the second one is different, then we
-	 * poke at a non-APIC.
+	 * Put whatever arbitrary (but long enough) timeout
+	 * value into the APIC clock, we just want to get the
+	 * counter running for calibration.
 	 */
-	if (reg1 != reg0)
-		return 0;
+	__setup_APIC_LVTT(1000000000);
 
 	/*
-	 * Check if the version looks reasonably.
+	 * The timer chip counts down to zero. Let's wait
+	 * for a wraparound to start exact measurement:
+	 * (the current tick might have been already half done)
 	 */
-	reg1 = GET_APIC_VERSION(reg0);
-	if (reg1 == 0x00 || reg1 == 0xff)
-		return 0;
-	reg1 = get_maxlvt();
-	if (reg1 < 0x02 || reg1 == 0xff)
-		return 0;
+
+	wait_timer_tick();
 
 	/*
-	 * The ID register is read/write in a real APIC.
+	 * We wrapped around just now. Let's start:
 	 */
-	reg0 = apic_read(APIC_ID);
-	apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg0);
+	if (cpu_has_tsc)
+		rdtscll(t1);
+	tt1 = apic_read(APIC_TMCCT);
 
 	/*
-	 * The next two are just to see if we have sane values.
-	 * They're only really relevant if we're in Virtual Wire
-	 * compatibility mode, but most boxes are anymore.
+	 * Let's wait LOOPS wraprounds:
 	 */
-	reg0 = apic_read(APIC_LVT0);
-	apic_printk(APIC_DEBUG, "Getting LVT0: %x\n", reg0);
-	reg1 = apic_read(APIC_LVT1);
-	apic_printk(APIC_DEBUG, "Getting LVT1: %x\n", reg1);
+	for (i = 0; i < LOOPS; i++)
+		wait_timer_tick();
 
-	return 1;
-}
+	tt2 = apic_read(APIC_TMCCT);
+	if (cpu_has_tsc)
+		rdtscll(t2);
 
-void __init sync_Arb_IDs(void)
-{
-	/* Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1
-	   And not needed on AMD */
-	if (modern_apic())
-		return;
 	/*
-	 * Wait for idle.
+	 * The APIC bus clock counter is 32 bits only, it
+	 * might have overflown, but note that we use signed
+	 * longs, thus no extra care needed.
+	 *
+	 * underflown to be exact, as the timer counts down ;)
 	 */
-	apic_wait_icr_idle();
 
-	apic_printk(APIC_DEBUG, "Synchronizing Arb IDs.\n");
-	apic_write_around(APIC_ICR, APIC_DEST_ALLINC | APIC_INT_LEVELTRIG
-				| APIC_DM_INIT);
+	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
+
+	if (cpu_has_tsc)
+		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
+			"%ld.%04ld MHz.\n",
+			((long)(t2-t1)/LOOPS)/(1000000/HZ),
+			((long)(t2-t1)/LOOPS)%(1000000/HZ));
+
+	apic_printk(APIC_VERBOSE, "..... host bus clock speed is "
+		"%ld.%04ld MHz.\n",
+		result/(1000000/HZ),
+		result%(1000000/HZ));
+
+	return result;
 }
 
-extern void __error_in_apic_c (void);
+static unsigned int calibration_result;
 
-/*
- * An initial setup of the virtual wire mode.
- */
-void __init init_bsp_APIC(void)
+void __init setup_boot_APIC_clock(void)
 {
-	unsigned long value, ver;
+	unsigned long flags;
+	apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n");
+	using_apic_timer = 1;
+
+	local_irq_save(flags);
 
+	calibration_result = calibrate_APIC_clock();
 	/*
-	 * Don't do the setup now if we have a SMP BIOS as the
-	 * through-I/O-APIC virtual wire mode might be active.
+	 * Now set up the timer for real.
 	 */
-	if (smp_found_config || !cpu_has_apic)
-		return;
+	setup_APIC_timer(calibration_result);
+
+	local_irq_restore(flags);
+}
 
-	value = apic_read(APIC_LVR);
-	ver = GET_APIC_VERSION(value);
+void __devinit setup_secondary_APIC_clock(void)
+{
+	setup_APIC_timer(calibration_result);
+}
 
-	/*
-	 * Do not trust the local APIC being empty at bootup.
-	 */
-	clear_local_APIC();
+void disable_APIC_timer(void)
+{
+	if (using_apic_timer) {
+		unsigned long v;
 
-	/*
-	 * Enable APIC.
-	 */
-	value = apic_read(APIC_SPIV);
-	value &= ~APIC_VECTOR_MASK;
-	value |= APIC_SPIV_APIC_ENABLED;
-	
-	/* This bit is reserved on P4/Xeon and should be cleared */
-	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 15))
-		value &= ~APIC_SPIV_FOCUS_DISABLED;
-	else
-		value |= APIC_SPIV_FOCUS_DISABLED;
-	value |= SPURIOUS_APIC_VECTOR;
-	apic_write_around(APIC_SPIV, value);
+		v = apic_read(APIC_LVTT);
+		/*
+		 * When an illegal vector value (0-15) is written to an LVT
+		 * entry and delivery mode is Fixed, the APIC may signal an
+		 * illegal vector error, with out regard to whether the mask
+		 * bit is set or whether an interrupt is actually seen on
+		 * input.
+		 *
+		 * Boot sequence might call this function when the LVTT has
+		 * '0' vector value. So make sure vector field is set to
+		 * valid value.
+		 */
+		v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
+		apic_write_around(APIC_LVTT, v);
+	}
+}
 
-	/*
-	 * Set up the virtual wire mode.
-	 */
-	apic_write_around(APIC_LVT0, APIC_DM_EXTINT);
-	value = APIC_DM_NMI;
-	if (!APIC_INTEGRATED(ver))		/* 82489DX */
-		value |= APIC_LVT_LEVEL_TRIGGER;
-	apic_write_around(APIC_LVT1, value);
+void enable_APIC_timer(void)
+{
+	int cpu = smp_processor_id();
+
+	if (using_apic_timer && !cpu_isset(cpu, timer_bcast_ipi)) {
+		unsigned long v;
+
+		v = apic_read(APIC_LVTT);
+		apic_write_around(APIC_LVTT, v & ~APIC_LVT_MASKED);
+	}
 }
 
-void __devinit setup_local_APIC(void)
+void switch_APIC_timer_to_ipi(void *cpumask)
 {
-	unsigned long oldvalue, value, ver, maxlvt;
-	int i, j;
+	cpumask_t mask = *(cpumask_t *)cpumask;
+	int cpu = smp_processor_id();
 
-	/* Pound the ESR really hard over the head with a big hammer - mbligh */
-	if (esr_disable) {
-		apic_write(APIC_ESR, 0);
-		apic_write(APIC_ESR, 0);
-		apic_write(APIC_ESR, 0);
-		apic_write(APIC_ESR, 0);
+	if (cpu_isset(cpu, mask) &&
+	    !cpu_isset(cpu, timer_bcast_ipi)) {
+		disable_APIC_timer();
+		cpu_set(cpu, timer_bcast_ipi);
 	}
+}
+EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
 
-	value = apic_read(APIC_LVR);
-	ver = GET_APIC_VERSION(value);
+void switch_ipi_to_APIC_timer(void *cpumask)
+{
+	cpumask_t mask = *(cpumask_t *)cpumask;
+	int cpu = smp_processor_id();
+
+	if (cpu_isset(cpu, mask) &&
+	    cpu_isset(cpu, timer_bcast_ipi)) {
+		cpu_clear(cpu, timer_bcast_ipi);
+		enable_APIC_timer();
+	}
+}
+EXPORT_SYMBOL(switch_ipi_to_APIC_timer);
 
-	if ((SPURIOUS_APIC_VECTOR & 0x0f) != 0x0f)
-		__error_in_apic_c();
+/*
+ * Local timer interrupt handler. It does both profiling and
+ * process statistics/rescheduling.
+ */
+inline void smp_local_timer_interrupt(void)
+{
+	profile_tick(CPU_PROFILING);
+#ifdef CONFIG_SMP
+	update_process_times(user_mode_vm(get_irq_regs()));
+#endif
 
 	/*
-	 * Double-check whether this APIC is really registered.
+	 * We take the 'long' return path, and there every subsystem
+	 * grabs the apropriate locks (kernel lock/ irq lock).
+	 *
+	 * we might want to decouple profiling from the 'long path',
+	 * and do the profiling totally in assembly.
+	 *
+	 * Currently this isn't too much of an issue (performance wise),
+	 * we can take more than 100K local irqs per second on a 100 MHz P5.
 	 */
-	if (!apic_id_registered())
-		BUG();
+}
+
+/*
+ * Local APIC timer interrupt. This is the most natural way for doing
+ * local interrupts, but local timer interrupts can be emulated by
+ * broadcast interrupts too. [in case the hw doesn't support APIC timers]
+ *
+ * [ if a single-CPU system runs an SMP kernel then we call the local
+ *   interrupt as well. Thus we cannot inline the local irq ... ]
+ */
+
+fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+	int cpu = smp_processor_id();
 
 	/*
-	 * Intel recommends to set DFR, LDR and TPR before enabling
-	 * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
-	 * document number 292116).  So here it goes...
+	 * the NMI deadlock-detector uses this.
 	 */
-	init_apic_ldr();
+	per_cpu(irq_stat, cpu).apic_timer_irqs++;
 
 	/*
-	 * Set Task Priority to 'accept all'. We never change this
-	 * later on.
+	 * NOTE! We'd better ACK the irq immediately,
+	 * because timer handling can be slow.
 	 */
-	value = apic_read(APIC_TASKPRI);
-	value &= ~APIC_TPRI_MASK;
-	apic_write_around(APIC_TASKPRI, value);
-
+	ack_APIC_irq();
 	/*
-	 * After a crash, we no longer service the interrupts and a pending
-	 * interrupt from previous kernel might still have ISR bit set.
-	 *
-	 * Most probably by now CPU has serviced that pending interrupt and
-	 * it might not have done the ack_APIC_irq() because it thought,
-	 * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
-	 * does not clear the ISR bit and cpu thinks it has already serivced
-	 * the interrupt. Hence a vector might get locked. It was noticed
-	 * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
+	 * update_process_times() expects us to have done irq_enter().
+	 * Besides, if we don't timer interrupts ignore the global
+	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
-	for (i = APIC_ISR_NR - 1; i >= 0; i--) {
-		value = apic_read(APIC_ISR + i*0x10);
-		for (j = 31; j >= 0; j--) {
-			if (value & (1<<j))
-				ack_APIC_irq();
-		}
-	}
+	irq_enter();
+	smp_local_timer_interrupt();
+	irq_exit();
+	set_irq_regs(old_regs);
+}
 
-	/*
-	 * Now that we are all set up, enable the APIC
-	 */
-	value = apic_read(APIC_SPIV);
-	value &= ~APIC_VECTOR_MASK;
-	/*
-	 * Enable APIC
-	 */
-	value |= APIC_SPIV_APIC_ENABLED;
+#ifndef CONFIG_SMP
+static void up_apic_timer_interrupt_call(void)
+{
+	int cpu = smp_processor_id();
 
 	/*
-	 * Some unknown Intel IO/APIC (or APIC) errata is biting us with
-	 * certain networking cards. If high frequency interrupts are
-	 * happening on a particular IOAPIC pin, plus the IOAPIC routing
-	 * entry is masked/unmasked at a high rate as well then sooner or
-	 * later IOAPIC line gets 'stuck', no more interrupts are received
-	 * from the device. If focus CPU is disabled then the hang goes
-	 * away, oh well :-(
-	 *
-	 * [ This bug can be reproduced easily with a level-triggered
-	 *   PCI Ne2000 networking cards and PII/PIII processors, dual
-	 *   BX chipset. ]
-	 */
-	/*
-	 * Actually disabling the focus CPU check just makes the hang less
-	 * frequent as it makes the interrupt distributon model be more
-	 * like LRU than MRU (the short-term load is more even across CPUs).
-	 * See also the comment in end_level_ioapic_irq().  --macro
+	 * the NMI deadlock-detector uses this.
 	 */
-#if 1
-	/* Enable focus processor (bit==0) */
-	value &= ~APIC_SPIV_FOCUS_DISABLED;
+	per_cpu(irq_stat, cpu).apic_timer_irqs++;
+
+	smp_local_timer_interrupt();
+}
+#endif
+
+void smp_send_timer_broadcast_ipi(void)
+{
+	cpumask_t mask;
+
+	cpus_and(mask, cpu_online_map, timer_bcast_ipi);
+	if (!cpus_empty(mask)) {
+#ifdef CONFIG_SMP
+		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
 #else
-	/* Disable focus processor (bit==1) */
-	value |= APIC_SPIV_FOCUS_DISABLED;
+		/*
+		 * We can directly call the apic timer interrupt handler
+		 * in UP case. Minus all irq related functions
+		 */
+		up_apic_timer_interrupt_call();
 #endif
-	/*
-	 * Set spurious IRQ vector
-	 */
-	value |= SPURIOUS_APIC_VECTOR;
-	apic_write_around(APIC_SPIV, value);
+	}
+}
+
+int setup_profiling_timer(unsigned int multiplier)
+{
+	return -EINVAL;
+}
+
+/*
+ * Local APIC start and shutdown
+ */
+
+/**
+ * clear_local_APIC - shutdown the local APIC
+ *
+ * This is called, when a CPU is disabled and before rebooting, so the state of
+ * the local APIC has no dangling leftovers. Also used to cleanout any BIOS
+ * leftovers during boot.
+ */
+void clear_local_APIC(void)
+{
+	int maxlvt = lapic_get_maxlvt();
+	unsigned long v;
 
 	/*
-	 * Set up LVT0, LVT1:
-	 *
-	 * set up through-local-APIC on the BP's LINT0. This is not
-	 * strictly necessery in pure symmetric-IO mode, but sometimes
-	 * we delegate interrupts to the 8259A.
+	 * Masking an LVT entry can trigger a local APIC error
+	 * if the vector is zero. Mask LVTERR first to prevent this.
 	 */
+	if (maxlvt >= 3) {
+		v = ERROR_APIC_VECTOR; /* any non-zero vector will do */
+		apic_write_around(APIC_LVTERR, v | APIC_LVT_MASKED);
+	}
 	/*
-	 * TODO: set up through-local-APIC from through-I/O-APIC? --macro
+	 * Careful: we have to set masks only first to deassert
+	 * any level-triggered sources.
 	 */
-	value = apic_read(APIC_LVT0) & APIC_LVT_MASKED;
-	if (!smp_processor_id() && (pic_mode || !value)) {
-		value = APIC_DM_EXTINT;
-		apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n",
-				smp_processor_id());
-	} else {
-		value = APIC_DM_EXTINT | APIC_LVT_MASKED;
-		apic_printk(APIC_VERBOSE, "masked ExtINT on CPU#%d\n",
-				smp_processor_id());
+	v = apic_read(APIC_LVTT);
+	apic_write_around(APIC_LVTT, v | APIC_LVT_MASKED);
+	v = apic_read(APIC_LVT0);
+	apic_write_around(APIC_LVT0, v | APIC_LVT_MASKED);
+	v = apic_read(APIC_LVT1);
+	apic_write_around(APIC_LVT1, v | APIC_LVT_MASKED);
+	if (maxlvt >= 4) {
+		v = apic_read(APIC_LVTPC);
+		apic_write_around(APIC_LVTPC, v | APIC_LVT_MASKED);
 	}
-	apic_write_around(APIC_LVT0, value);
 
+	/* lets not touch this if we didn't frob it */
+#ifdef CONFIG_X86_MCE_P4THERMAL
+	if (maxlvt >= 5) {
+		v = apic_read(APIC_LVTTHMR);
+		apic_write_around(APIC_LVTTHMR, v | APIC_LVT_MASKED);
+	}
+#endif
 	/*
-	 * only the BP should see the LINT1 NMI signal, obviously.
+	 * Clean APIC state for other OSs:
 	 */
-	if (!smp_processor_id())
-		value = APIC_DM_NMI;
-	else
-		value = APIC_DM_NMI | APIC_LVT_MASKED;
-	if (!APIC_INTEGRATED(ver))		/* 82489DX */
-		value |= APIC_LVT_LEVEL_TRIGGER;
-	apic_write_around(APIC_LVT1, value);
-
-	if (APIC_INTEGRATED(ver) && !esr_disable) {		/* !82489DX */
-		maxlvt = get_maxlvt();
-		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP. */
-			apic_write(APIC_ESR, 0);
-		oldvalue = apic_read(APIC_ESR);
+	apic_write_around(APIC_LVTT, APIC_LVT_MASKED);
+	apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
+	apic_write_around(APIC_LVT1, APIC_LVT_MASKED);
+	if (maxlvt >= 3)
+		apic_write_around(APIC_LVTERR, APIC_LVT_MASKED);
+	if (maxlvt >= 4)
+		apic_write_around(APIC_LVTPC, APIC_LVT_MASKED);
 
-		value = ERROR_APIC_VECTOR;      // enables sending errors
-		apic_write_around(APIC_LVTERR, value);
-		/*
-		 * spec says clear errors after enabling vector.
-		 */
+#ifdef CONFIG_X86_MCE_P4THERMAL
+	if (maxlvt >= 5)
+		apic_write_around(APIC_LVTTHMR, APIC_LVT_MASKED);
+#endif
+	/* Integrated APIC (!82489DX) ? */
+	if (lapic_is_integrated()) {
 		if (maxlvt > 3)
+			/* Clear ESR due to Pentium errata 3AP and 11AP */
 			apic_write(APIC_ESR, 0);
-		value = apic_read(APIC_ESR);
-		if (value != oldvalue)
-			apic_printk(APIC_VERBOSE, "ESR value before enabling "
-				"vector: 0x%08lx  after: 0x%08lx\n",
-				oldvalue, value);
-	} else {
-		if (esr_disable)	
-			/* 
-			 * Something untraceble is creating bad interrupts on 
-			 * secondary quads ... for the moment, just leave the
-			 * ESR disabled - we can't do anything useful with the
-			 * errors anyway - mbligh
-			 */
-			printk("Leaving ESR disabled.\n");
-		else 
-			printk("No ESR for 82489DX.\n");
+		apic_read(APIC_ESR);
 	}
+}
 
-	setup_apic_nmi_watchdog(NULL);
-	apic_pm_activate();
+/**
+ * disable_local_APIC - clear and disable the local APIC
+ */
+void disable_local_APIC(void)
+{
+	unsigned long value;
+
+	clear_local_APIC();
+
+	/*
+	 * Disable APIC (implies clearing of registers
+	 * for 82489DX!).
+	 */
+	value = apic_read(APIC_SPIV);
+	value &= ~APIC_SPIV_APIC_ENABLED;
+	apic_write_around(APIC_SPIV, value);
+
+	/*
+	 * When LAPIC was disabled by the BIOS and enabled by the kernel,
+	 * restore the disabled state.
+	 */
+	if (enabled_via_apicbase) {
+		unsigned int l, h;
+
+		rdmsr(MSR_IA32_APICBASE, l, h);
+		l &= ~MSR_IA32_APICBASE_ENABLE;
+		wrmsr(MSR_IA32_APICBASE, l, h);
+	}
 }
 
 /*
- * If Linux enabled the LAPIC against the BIOS default
- * disable it down before re-entering the BIOS on shutdown.
- * Otherwise the BIOS may get confused and not power-off.
- * Additionally clear all LVT entries before disable_local_APIC
+ * If Linux enabled the LAPIC against the BIOS default disable it down before
+ * re-entering the BIOS on shutdown.  Otherwise the BIOS may get confused and
+ * not power-off.  Additionally clear all LVT entries before disable_local_APIC
  * for the case where Linux didn't enable the LAPIC.
  */
 void lapic_shutdown(void)
@@ -624,666 +629,519 @@ void lapic_shutdown(void)
 	local_irq_restore(flags);
 }
 
-#ifdef CONFIG_PM
-
-static struct {
-	int active;
-	/* r/w apic fields */
-	unsigned int apic_id;
-	unsigned int apic_taskpri;
-	unsigned int apic_ldr;
-	unsigned int apic_dfr;
-	unsigned int apic_spiv;
-	unsigned int apic_lvtt;
-	unsigned int apic_lvtpc;
-	unsigned int apic_lvt0;
-	unsigned int apic_lvt1;
-	unsigned int apic_lvterr;
-	unsigned int apic_tmict;
-	unsigned int apic_tdcr;
-	unsigned int apic_thmr;
-} apic_pm_state;
-
-static int lapic_suspend(struct sys_device *dev, pm_message_t state)
+/*
+ * This is to verify that we're looking at a real local APIC.
+ * Check these against your board if the CPUs aren't getting
+ * started for no apparent reason.
+ */
+int __init verify_local_APIC(void)
 {
-	unsigned long flags;
+	unsigned int reg0, reg1;
 
-	if (!apic_pm_state.active)
+	/*
+	 * The version register is read-only in a real APIC.
+	 */
+	reg0 = apic_read(APIC_LVR);
+	apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg0);
+	apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK);
+	reg1 = apic_read(APIC_LVR);
+	apic_printk(APIC_DEBUG, "Getting VERSION: %x\n", reg1);
+
+	/*
+	 * The two version reads above should print the same
+	 * numbers.  If the second one is different, then we
+	 * poke at a non-APIC.
+	 */
+	if (reg1 != reg0)
 		return 0;
 
-	apic_pm_state.apic_id = apic_read(APIC_ID);
-	apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI);
-	apic_pm_state.apic_ldr = apic_read(APIC_LDR);
-	apic_pm_state.apic_dfr = apic_read(APIC_DFR);
-	apic_pm_state.apic_spiv = apic_read(APIC_SPIV);
-	apic_pm_state.apic_lvtt = apic_read(APIC_LVTT);
-	apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC);
-	apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0);
-	apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1);
-	apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR);
-	apic_pm_state.apic_tmict = apic_read(APIC_TMICT);
-	apic_pm_state.apic_tdcr = apic_read(APIC_TDCR);
-	apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
-	
-	local_irq_save(flags);
-	disable_local_APIC();
-	local_irq_restore(flags);
-	return 0;
-}
-
-static int lapic_resume(struct sys_device *dev)
-{
-	unsigned int l, h;
-	unsigned long flags;
-
-	if (!apic_pm_state.active)
-		return 0;
-
-	local_irq_save(flags);
-
 	/*
-	 * Make sure the APICBASE points to the right address
-	 *
-	 * FIXME! This will be wrong if we ever support suspend on
-	 * SMP! We'll need to do this as part of the CPU restore!
+	 * Check if the version looks reasonably.
 	 */
-	rdmsr(MSR_IA32_APICBASE, l, h);
-	l &= ~MSR_IA32_APICBASE_BASE;
-	l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr;
-	wrmsr(MSR_IA32_APICBASE, l, h);
-
-	apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED);
-	apic_write(APIC_ID, apic_pm_state.apic_id);
-	apic_write(APIC_DFR, apic_pm_state.apic_dfr);
-	apic_write(APIC_LDR, apic_pm_state.apic_ldr);
-	apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri);
-	apic_write(APIC_SPIV, apic_pm_state.apic_spiv);
-	apic_write(APIC_LVT0, apic_pm_state.apic_lvt0);
-	apic_write(APIC_LVT1, apic_pm_state.apic_lvt1);
-	apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
-	apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
-	apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
-	apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
-	apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
-	apic_write(APIC_ESR, 0);
-	apic_read(APIC_ESR);
-	apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
-	apic_write(APIC_ESR, 0);
-	apic_read(APIC_ESR);
-	local_irq_restore(flags);
-	return 0;
-}
-
-/*
- * This device has no shutdown method - fully functioning local APICs
- * are needed on every CPU up until machine_halt/restart/poweroff.
- */
-
-static struct sysdev_class lapic_sysclass = {
-	set_kset_name("lapic"),
-	.resume		= lapic_resume,
-	.suspend	= lapic_suspend,
-};
-
-static struct sys_device device_lapic = {
-	.id	= 0,
-	.cls	= &lapic_sysclass,
-};
-
-static void __devinit apic_pm_activate(void)
-{
-	apic_pm_state.active = 1;
-}
-
-static int __init init_lapic_sysfs(void)
-{
-	int error;
-
-	if (!cpu_has_apic)
+	reg1 = GET_APIC_VERSION(reg0);
+	if (reg1 == 0x00 || reg1 == 0xff)
+		return 0;
+	reg1 = lapic_get_maxlvt();
+	if (reg1 < 0x02 || reg1 == 0xff)
 		return 0;
-	/* XXX: remove suspend/resume procs if !apic_pm_state.active? */
-
-	error = sysdev_class_register(&lapic_sysclass);
-	if (!error)
-		error = sysdev_register(&device_lapic);
-	return error;
-}
-device_initcall(init_lapic_sysfs);
-
-#else	/* CONFIG_PM */
-
-static void apic_pm_activate(void) { }
-
-#endif	/* CONFIG_PM */
-
-/*
- * Detect and enable local APICs on non-SMP boards.
- * Original code written by Keir Fraser.
- */
-
-static int __init apic_set_verbosity(char *str)
-{
-	if (strcmp("debug", str) == 0)
-		apic_verbosity = APIC_DEBUG;
-	else if (strcmp("verbose", str) == 0)
-		apic_verbosity = APIC_VERBOSE;
-	return 1;
-}
-
-__setup("apic=", apic_set_verbosity);
-
-static int __init detect_init_APIC (void)
-{
-	u32 h, l, features;
-
-	/* Disabled by kernel option? */
-	if (enable_local_apic < 0)
-		return -1;
-
-	switch (boot_cpu_data.x86_vendor) {
-	case X86_VENDOR_AMD:
-		if ((boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model > 1) ||
-		    (boot_cpu_data.x86 == 15))	    
-			break;
-		goto no_apic;
-	case X86_VENDOR_INTEL:
-		if (boot_cpu_data.x86 == 6 || boot_cpu_data.x86 == 15 ||
-		    (boot_cpu_data.x86 == 5 && cpu_has_apic))
-			break;
-		goto no_apic;
-	default:
-		goto no_apic;
-	}
-
-	if (!cpu_has_apic) {
-		/*
-		 * Over-ride BIOS and try to enable the local
-		 * APIC only if "lapic" specified.
-		 */
-		if (enable_local_apic <= 0) {
-			printk("Local APIC disabled by BIOS -- "
-			       "you can enable it with \"lapic\"\n");
-			return -1;
-		}
-		/*
-		 * Some BIOSes disable the local APIC in the
-		 * APIC_BASE MSR. This can only be done in
-		 * software for Intel P6 or later and AMD K7
-		 * (Model > 1) or later.
-		 */
-		rdmsr(MSR_IA32_APICBASE, l, h);
-		if (!(l & MSR_IA32_APICBASE_ENABLE)) {
-			printk("Local APIC disabled by BIOS -- reenabling.\n");
-			l &= ~MSR_IA32_APICBASE_BASE;
-			l |= MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE;
-			wrmsr(MSR_IA32_APICBASE, l, h);
-			enabled_via_apicbase = 1;
-		}
-	}
-	/*
-	 * The APIC feature bit should now be enabled
-	 * in `cpuid'
-	 */
-	features = cpuid_edx(1);
-	if (!(features & (1 << X86_FEATURE_APIC))) {
-		printk("Could not enable APIC!\n");
-		return -1;
-	}
-	set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
-	mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
-
-	/* The BIOS may have set up the APIC at some other address */
-	rdmsr(MSR_IA32_APICBASE, l, h);
-	if (l & MSR_IA32_APICBASE_ENABLE)
-		mp_lapic_addr = l & MSR_IA32_APICBASE_BASE;
-
-	if (nmi_watchdog != NMI_NONE)
-		nmi_watchdog = NMI_LOCAL_APIC;
-
-	printk("Found and enabled local APIC!\n");
-
-	apic_pm_activate();
-
-	return 0;
-
-no_apic:
-	printk("No local APIC present or hardware disabled\n");
-	return -1;
-}
-
-void __init init_apic_mappings(void)
-{
-	unsigned long apic_phys;
 
 	/*
-	 * If no local APIC can be found then set up a fake all
-	 * zeroes page to simulate the local APIC and another
-	 * one for the IO-APIC.
+	 * The ID register is read/write in a real APIC.
 	 */
-	if (!smp_found_config && detect_init_APIC()) {
-		apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE);
-		apic_phys = __pa(apic_phys);
-	} else
-		apic_phys = mp_lapic_addr;
-
-	set_fixmap_nocache(FIX_APIC_BASE, apic_phys);
-	printk(KERN_DEBUG "mapped APIC to %08lx (%08lx)\n", APIC_BASE,
-	       apic_phys);
+	reg0 = apic_read(APIC_ID);
+	apic_printk(APIC_DEBUG, "Getting ID: %x\n", reg0);
 
 	/*
-	 * Fetch the APIC ID of the BSP in case we have a
-	 * default configuration (or the MP table is broken).
+	 * The next two are just to see if we have sane values.
+	 * They're only really relevant if we're in Virtual Wire
+	 * compatibility mode, but most boxes are anymore.
 	 */
-	if (boot_cpu_physical_apicid == -1U)
-		boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
-
-#ifdef CONFIG_X86_IO_APIC
-	{
-		unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
-		int i;
-
-		for (i = 0; i < nr_ioapics; i++) {
-			if (smp_found_config) {
-				ioapic_phys = mp_ioapics[i].mpc_apicaddr;
-				if (!ioapic_phys) {
-					printk(KERN_ERR
-					       "WARNING: bogus zero IO-APIC "
-					       "address found in MPTABLE, "
-					       "disabling IO/APIC support!\n");
-					smp_found_config = 0;
-					skip_ioapic_setup = 1;
-					goto fake_ioapic_page;
-				}
-			} else {
-fake_ioapic_page:
-				ioapic_phys = (unsigned long)
-					      alloc_bootmem_pages(PAGE_SIZE);
-				ioapic_phys = __pa(ioapic_phys);
-			}
-			set_fixmap_nocache(idx, ioapic_phys);
-			printk(KERN_DEBUG "mapped IOAPIC to %08lx (%08lx)\n",
-			       __fix_to_virt(idx), ioapic_phys);
-			idx++;
-		}
-	}
-#endif
-}
-
-/*
- * This part sets up the APIC 32 bit clock in LVTT1, with HZ interrupts
- * per second. We assume that the caller has already set up the local
- * APIC.
- *
- * The APIC timer is not exactly sync with the external timer chip, it
- * closely follows bus clocks.
- */
-
-/*
- * The timer chip is already set up at HZ interrupts per second here,
- * but we do not accept timer interrupts yet. We only allow the BP
- * to calibrate.
- */
-static unsigned int __devinit get_8254_timer_count(void)
-{
-	unsigned long flags;
-
-	unsigned int count;
-
-	spin_lock_irqsave(&i8253_lock, flags);
-
-	outb_p(0x00, PIT_MODE);
-	count = inb_p(PIT_CH0);
-	count |= inb_p(PIT_CH0) << 8;
-
-	spin_unlock_irqrestore(&i8253_lock, flags);
-
-	return count;
-}
-
-/* next tick in 8254 can be caught by catching timer wraparound */
-static void __devinit wait_8254_wraparound(void)
-{
-	unsigned int curr_count, prev_count;
-
-	curr_count = get_8254_timer_count();
-	do {
-		prev_count = curr_count;
-		curr_count = get_8254_timer_count();
-
-		/* workaround for broken Mercury/Neptune */
-		if (prev_count >= curr_count + 0x100)
-			curr_count = get_8254_timer_count();
-
-	} while (prev_count >= curr_count);
-}
-
-/*
- * Default initialization for 8254 timers. If we use other timers like HPET,
- * we override this later
- */
-void (*wait_timer_tick)(void) __devinitdata = wait_8254_wraparound;
-
-/*
- * This function sets up the local APIC timer, with a timeout of
- * 'clocks' APIC bus clock. During calibration we actually call
- * this function twice on the boot CPU, once with a bogus timeout
- * value, second time for real. The other (noncalibrating) CPUs
- * call this function only once, with the real, calibrated value.
- *
- * We do reads before writes even if unnecessary, to get around the
- * P5 APIC double write bug.
- */
-
-#define APIC_DIVISOR 16
-
-static void __setup_APIC_LVTT(unsigned int clocks)
-{
-	unsigned int lvtt_value, tmp_value, ver;
-	int cpu = smp_processor_id();
-
-	ver = GET_APIC_VERSION(apic_read(APIC_LVR));
-	lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR;
-	if (!APIC_INTEGRATED(ver))
-		lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
-
-	if (cpu_isset(cpu, timer_bcast_ipi))
-		lvtt_value |= APIC_LVT_MASKED;
+	reg0 = apic_read(APIC_LVT0);
+	apic_printk(APIC_DEBUG, "Getting LVT0: %x\n", reg0);
+	reg1 = apic_read(APIC_LVT1);
+	apic_printk(APIC_DEBUG, "Getting LVT1: %x\n", reg1);
 
-	apic_write_around(APIC_LVTT, lvtt_value);
+	return 1;
+}
 
+/**
+ * sync_Arb_IDs - synchronize APIC bus arbitration IDs
+ */
+void __init sync_Arb_IDs(void)
+{
 	/*
-	 * Divide PICLK by 16
+	 * Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 And not
+	 * needed on AMD.
 	 */
-	tmp_value = apic_read(APIC_TDCR);
-	apic_write_around(APIC_TDCR, (tmp_value
-				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
-				| APIC_TDR_DIV_16);
+	if (modern_apic())
+		return;
+	/*
+	 * Wait for idle.
+	 */
+	apic_wait_icr_idle();
 
-	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
+	apic_printk(APIC_DEBUG, "Synchronizing Arb IDs.\n");
+	apic_write_around(APIC_ICR, APIC_DEST_ALLINC | APIC_INT_LEVELTRIG
+				| APIC_DM_INIT);
 }
 
-static void __devinit setup_APIC_timer(unsigned int clocks)
+/*
+ * An initial setup of the virtual wire mode.
+ */
+void __init init_bsp_APIC(void)
 {
-	unsigned long flags;
+	unsigned long value;
 
-	local_irq_save(flags);
+	/*
+	 * Don't do the setup now if we have a SMP BIOS as the
+	 * through-I/O-APIC virtual wire mode might be active.
+	 */
+	if (smp_found_config || !cpu_has_apic)
+		return;
 
 	/*
-	 * Wait for IRQ0's slice:
+	 * Do not trust the local APIC being empty at bootup.
 	 */
-	wait_timer_tick();
+	clear_local_APIC();
 
-	__setup_APIC_LVTT(clocks);
+	/*
+	 * Enable APIC.
+	 */
+	value = apic_read(APIC_SPIV);
+	value &= ~APIC_VECTOR_MASK;
+	value |= APIC_SPIV_APIC_ENABLED;
 
-	local_irq_restore(flags);
+	/* This bit is reserved on P4/Xeon and should be cleared */
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
+	    (boot_cpu_data.x86 == 15))
+		value &= ~APIC_SPIV_FOCUS_DISABLED;
+	else
+		value |= APIC_SPIV_FOCUS_DISABLED;
+	value |= SPURIOUS_APIC_VECTOR;
+	apic_write_around(APIC_SPIV, value);
+
+	/*
+	 * Set up the virtual wire mode.
+	 */
+	apic_write_around(APIC_LVT0, APIC_DM_EXTINT);
+	value = APIC_DM_NMI;
+	if (!lapic_is_integrated())		/* 82489DX */
+		value |= APIC_LVT_LEVEL_TRIGGER;
+	apic_write_around(APIC_LVT1, value);
 }
 
-/*
- * In this function we calibrate APIC bus clocks to the external
- * timer. Unfortunately we cannot use jiffies and the timer irq
- * to calibrate, since some later bootup code depends on getting
- * the first irq? Ugh.
- *
- * We want to do the calibration only once since we
- * want to have local timer irqs syncron. CPUs connected
- * by the same APIC bus have the very same bus frequency.
- * And we want to have irqs off anyways, no accidental
- * APIC irq that way.
+/**
+ * setup_local_APIC - setup the local APIC
  */
-
-static int __init calibrate_APIC_clock(void)
+void __devinit setup_local_APIC(void)
 {
-	unsigned long long t1 = 0, t2 = 0;
-	long tt1, tt2;
-	long result;
-	int i;
-	const int LOOPS = HZ/10;
+	unsigned long oldvalue, value, maxlvt, integrated;
+	int i, j;
 
-	apic_printk(APIC_VERBOSE, "calibrating APIC timer ...\n");
+	/* Pound the ESR really hard over the head with a big hammer - mbligh */
+	if (esr_disable) {
+		apic_write(APIC_ESR, 0);
+		apic_write(APIC_ESR, 0);
+		apic_write(APIC_ESR, 0);
+		apic_write(APIC_ESR, 0);
+	}
+
+	integrated = lapic_is_integrated();
 
 	/*
-	 * Put whatever arbitrary (but long enough) timeout
-	 * value into the APIC clock, we just want to get the
-	 * counter running for calibration.
+	 * Double-check whether this APIC is really registered.
 	 */
-	__setup_APIC_LVTT(1000000000);
+	if (!apic_id_registered())
+		BUG();
 
 	/*
-	 * The timer chip counts down to zero. Let's wait
-	 * for a wraparound to start exact measurement:
-	 * (the current tick might have been already half done)
+	 * Intel recommends to set DFR, LDR and TPR before enabling
+	 * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
+	 * document number 292116).  So here it goes...
 	 */
-
-	wait_timer_tick();
+	init_apic_ldr();
 
 	/*
-	 * We wrapped around just now. Let's start:
+	 * Set Task Priority to 'accept all'. We never change this
+	 * later on.
 	 */
-	if (cpu_has_tsc)
-		rdtscll(t1);
-	tt1 = apic_read(APIC_TMCCT);
+	value = apic_read(APIC_TASKPRI);
+	value &= ~APIC_TPRI_MASK;
+	apic_write_around(APIC_TASKPRI, value);
 
 	/*
-	 * Let's wait LOOPS wraprounds:
+	 * After a crash, we no longer service the interrupts and a pending
+	 * interrupt from previous kernel might still have ISR bit set.
+	 *
+	 * Most probably by now CPU has serviced that pending interrupt and
+	 * it might not have done the ack_APIC_irq() because it thought,
+	 * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
+	 * does not clear the ISR bit and cpu thinks it has already serivced
+	 * the interrupt. Hence a vector might get locked. It was noticed
+	 * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
 	 */
-	for (i = 0; i < LOOPS; i++)
-		wait_timer_tick();
+	for (i = APIC_ISR_NR - 1; i >= 0; i--) {
+		value = apic_read(APIC_ISR + i*0x10);
+		for (j = 31; j >= 0; j--) {
+			if (value & (1<<j))
+				ack_APIC_irq();
+		}
+	}
 
-	tt2 = apic_read(APIC_TMCCT);
-	if (cpu_has_tsc)
-		rdtscll(t2);
+	/*
+	 * Now that we are all set up, enable the APIC
+	 */
+	value = apic_read(APIC_SPIV);
+	value &= ~APIC_VECTOR_MASK;
+	/*
+	 * Enable APIC
+	 */
+	value |= APIC_SPIV_APIC_ENABLED;
 
 	/*
-	 * The APIC bus clock counter is 32 bits only, it
-	 * might have overflown, but note that we use signed
-	 * longs, thus no extra care needed.
+	 * Some unknown Intel IO/APIC (or APIC) errata is biting us with
+	 * certain networking cards. If high frequency interrupts are
+	 * happening on a particular IOAPIC pin, plus the IOAPIC routing
+	 * entry is masked/unmasked at a high rate as well then sooner or
+	 * later IOAPIC line gets 'stuck', no more interrupts are received
+	 * from the device. If focus CPU is disabled then the hang goes
+	 * away, oh well :-(
 	 *
-	 * underflown to be exact, as the timer counts down ;)
+	 * [ This bug can be reproduced easily with a level-triggered
+	 *   PCI Ne2000 networking cards and PII/PIII processors, dual
+	 *   BX chipset. ]
+	 */
+	/*
+	 * Actually disabling the focus CPU check just makes the hang less
+	 * frequent as it makes the interrupt distributon model be more
+	 * like LRU than MRU (the short-term load is more even across CPUs).
+	 * See also the comment in end_level_ioapic_irq().  --macro
 	 */
 
-	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
-
-	if (cpu_has_tsc)
-		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
-			"%ld.%04ld MHz.\n",
-			((long)(t2-t1)/LOOPS)/(1000000/HZ),
-			((long)(t2-t1)/LOOPS)%(1000000/HZ));
-
-	apic_printk(APIC_VERBOSE, "..... host bus clock speed is "
-		"%ld.%04ld MHz.\n",
-		result/(1000000/HZ),
-		result%(1000000/HZ));
-
-	return result;
-}
-
-static unsigned int calibration_result;
-
-void __init setup_boot_APIC_clock(void)
-{
-	unsigned long flags;
-	apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n");
-	using_apic_timer = 1;
-
-	local_irq_save(flags);
+	/* Enable focus processor (bit==0) */
+	value &= ~APIC_SPIV_FOCUS_DISABLED;
 
-	calibration_result = calibrate_APIC_clock();
 	/*
-	 * Now set up the timer for real.
+	 * Set spurious IRQ vector
 	 */
-	setup_APIC_timer(calibration_result);
+	value |= SPURIOUS_APIC_VECTOR;
+	apic_write_around(APIC_SPIV, value);
 
-	local_irq_restore(flags);
-}
+	/*
+	 * Set up LVT0, LVT1:
+	 *
+	 * set up through-local-APIC on the BP's LINT0. This is not
+	 * strictly necessery in pure symmetric-IO mode, but sometimes
+	 * we delegate interrupts to the 8259A.
+	 */
+	/*
+	 * TODO: set up through-local-APIC from through-I/O-APIC? --macro
+	 */
+	value = apic_read(APIC_LVT0) & APIC_LVT_MASKED;
+	if (!smp_processor_id() && (pic_mode || !value)) {
+		value = APIC_DM_EXTINT;
+		apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n",
+				smp_processor_id());
+	} else {
+		value = APIC_DM_EXTINT | APIC_LVT_MASKED;
+		apic_printk(APIC_VERBOSE, "masked ExtINT on CPU#%d\n",
+				smp_processor_id());
+	}
+	apic_write_around(APIC_LVT0, value);
 
-void __devinit setup_secondary_APIC_clock(void)
-{
-	setup_APIC_timer(calibration_result);
-}
+	/*
+	 * only the BP should see the LINT1 NMI signal, obviously.
+	 */
+	if (!smp_processor_id())
+		value = APIC_DM_NMI;
+	else
+		value = APIC_DM_NMI | APIC_LVT_MASKED;
+	if (!integrated)		/* 82489DX */
+		value |= APIC_LVT_LEVEL_TRIGGER;
+	apic_write_around(APIC_LVT1, value);
 
-void disable_APIC_timer(void)
-{
-	if (using_apic_timer) {
-		unsigned long v;
+	if (integrated && !esr_disable) {		/* !82489DX */
+		maxlvt = lapic_get_maxlvt();
+		if (maxlvt > 3)		/* Due to the Pentium erratum 3AP. */
+			apic_write(APIC_ESR, 0);
+		oldvalue = apic_read(APIC_ESR);
 
-		v = apic_read(APIC_LVTT);
+		/* enables sending errors */
+		value = ERROR_APIC_VECTOR;
+		apic_write_around(APIC_LVTERR, value);
 		/*
-		 * When an illegal vector value (0-15) is written to an LVT
-		 * entry and delivery mode is Fixed, the APIC may signal an
-		 * illegal vector error, with out regard to whether the mask
-		 * bit is set or whether an interrupt is actually seen on input.
-		 *
-		 * Boot sequence might call this function when the LVTT has
-		 * '0' vector value. So make sure vector field is set to
-		 * valid value.
+		 * spec says clear errors after enabling vector.
 		 */
-		v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
-		apic_write_around(APIC_LVTT, v);
+		if (maxlvt > 3)
+			apic_write(APIC_ESR, 0);
+		value = apic_read(APIC_ESR);
+		if (value != oldvalue)
+			apic_printk(APIC_VERBOSE, "ESR value before enabling "
+				"vector: 0x%08lx  after: 0x%08lx\n",
+				oldvalue, value);
+	} else {
+		if (esr_disable)
+			/*
+			 * Something untraceble is creating bad interrupts on
+			 * secondary quads ... for the moment, just leave the
+			 * ESR disabled - we can't do anything useful with the
+			 * errors anyway - mbligh
+			 */
+			printk(KERN_INFO "Leaving ESR disabled.\n");
+		else
+			printk(KERN_INFO "No ESR for 82489DX.\n");
 	}
+
+	setup_apic_nmi_watchdog(NULL);
+	apic_pm_activate();
 }
 
-void enable_APIC_timer(void)
+/*
+ * Detect and initialize APIC
+ */
+static int __init detect_init_APIC (void)
 {
-	int cpu = smp_processor_id();
+	u32 h, l, features;
 
-	if (using_apic_timer &&
-	    !cpu_isset(cpu, timer_bcast_ipi)) {
-		unsigned long v;
+	/* Disabled by kernel option? */
+	if (enable_local_apic < 0)
+		return -1;
 
-		v = apic_read(APIC_LVTT);
-		apic_write_around(APIC_LVTT, v & ~APIC_LVT_MASKED);
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_AMD:
+		if ((boot_cpu_data.x86 == 6 && boot_cpu_data.x86_model > 1) ||
+		    (boot_cpu_data.x86 == 15))
+			break;
+		goto no_apic;
+	case X86_VENDOR_INTEL:
+		if (boot_cpu_data.x86 == 6 || boot_cpu_data.x86 == 15 ||
+		    (boot_cpu_data.x86 == 5 && cpu_has_apic))
+			break;
+		goto no_apic;
+	default:
+		goto no_apic;
 	}
-}
-
-void switch_APIC_timer_to_ipi(void *cpumask)
-{
-	cpumask_t mask = *(cpumask_t *)cpumask;
-	int cpu = smp_processor_id();
 
-	if (cpu_isset(cpu, mask) &&
-	    !cpu_isset(cpu, timer_bcast_ipi)) {
-		disable_APIC_timer();
-		cpu_set(cpu, timer_bcast_ipi);
+	if (!cpu_has_apic) {
+		/*
+		 * Over-ride BIOS and try to enable the local APIC only if
+		 * "lapic" specified.
+		 */
+		if (enable_local_apic <= 0) {
+			printk(KERN_INFO "Local APIC disabled by BIOS -- "
+			       "you can enable it with \"lapic\"\n");
+			return -1;
+		}
+		/*
+		 * Some BIOSes disable the local APIC in the APIC_BASE
+		 * MSR. This can only be done in software for Intel P6 or later
+		 * and AMD K7 (Model > 1) or later.
+		 */
+		rdmsr(MSR_IA32_APICBASE, l, h);
+		if (!(l & MSR_IA32_APICBASE_ENABLE)) {
+			printk(KERN_INFO
+			       "Local APIC disabled by BIOS -- reenabling.\n");
+			l &= ~MSR_IA32_APICBASE_BASE;
+			l |= MSR_IA32_APICBASE_ENABLE | APIC_DEFAULT_PHYS_BASE;
+			wrmsr(MSR_IA32_APICBASE, l, h);
+			enabled_via_apicbase = 1;
+		}
 	}
-}
-EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
+	/*
+	 * The APIC feature bit should now be enabled
+	 * in `cpuid'
+	 */
+	features = cpuid_edx(1);
+	if (!(features & (1 << X86_FEATURE_APIC))) {
+		printk(KERN_WARNING "Could not enable APIC!\n");
+		return -1;
+	}
+	set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+	mp_lapic_addr = APIC_DEFAULT_PHYS_BASE;
 
-void switch_ipi_to_APIC_timer(void *cpumask)
-{
-	cpumask_t mask = *(cpumask_t *)cpumask;
-	int cpu = smp_processor_id();
+	/* The BIOS may have set up the APIC at some other address */
+	rdmsr(MSR_IA32_APICBASE, l, h);
+	if (l & MSR_IA32_APICBASE_ENABLE)
+		mp_lapic_addr = l & MSR_IA32_APICBASE_BASE;
 
-	if (cpu_isset(cpu, mask) &&
-	    cpu_isset(cpu, timer_bcast_ipi)) {
-		cpu_clear(cpu, timer_bcast_ipi);
-		enable_APIC_timer();
-	}
-}
-EXPORT_SYMBOL(switch_ipi_to_APIC_timer);
+	if (nmi_watchdog != NMI_NONE)
+		nmi_watchdog = NMI_LOCAL_APIC;
 
-#undef APIC_DIVISOR
+	printk(KERN_INFO "Found and enabled local APIC!\n");
 
-/*
- * Local timer interrupt handler. It does both profiling and
- * process statistics/rescheduling.
- *
- * We do profiling in every local tick, statistics/rescheduling
- * happen only every 'profiling multiplier' ticks. The default
- * multiplier is 1 and it can be changed by writing the new multiplier
- * value into /proc/profile.
- */
+	apic_pm_activate();
 
-inline void smp_local_timer_interrupt(void)
-{
-	profile_tick(CPU_PROFILING);
-#ifdef CONFIG_SMP
-	update_process_times(user_mode_vm(get_irq_regs()));
-#endif
+	return 0;
 
-	/*
-	 * We take the 'long' return path, and there every subsystem
-	 * grabs the apropriate locks (kernel lock/ irq lock).
-	 *
-	 * we might want to decouple profiling from the 'long path',
-	 * and do the profiling totally in assembly.
-	 *
-	 * Currently this isn't too much of an issue (performance wise),
-	 * we can take more than 100K local irqs per second on a 100 MHz P5.
-	 */
+no_apic:
+	printk(KERN_INFO "No local APIC present or hardware disabled\n");
+	return -1;
 }
 
-/*
- * Local APIC timer interrupt. This is the most natural way for doing
- * local interrupts, but local timer interrupts can be emulated by
- * broadcast interrupts too. [in case the hw doesn't support APIC timers]
- *
- * [ if a single-CPU system runs an SMP kernel then we call the local
- *   interrupt as well. Thus we cannot inline the local irq ... ]
+/**
+ * init_apic_mappings - initialize APIC mappings
  */
-
-fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
+void __init init_apic_mappings(void)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
-	int cpu = smp_processor_id();
+	unsigned long apic_phys;
 
 	/*
-	 * the NMI deadlock-detector uses this.
+	 * If no local APIC can be found then set up a fake all
+	 * zeroes page to simulate the local APIC and another
+	 * one for the IO-APIC.
 	 */
-	per_cpu(irq_stat, cpu).apic_timer_irqs++;
+	if (!smp_found_config && detect_init_APIC()) {
+		apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE);
+		apic_phys = __pa(apic_phys);
+	} else
+		apic_phys = mp_lapic_addr;
+
+	set_fixmap_nocache(FIX_APIC_BASE, apic_phys);
+	printk(KERN_DEBUG "mapped APIC to %08lx (%08lx)\n", APIC_BASE,
+	       apic_phys);
 
 	/*
-	 * NOTE! We'd better ACK the irq immediately,
-	 * because timer handling can be slow.
-	 */
-	ack_APIC_irq();
-	/*
-	 * update_process_times() expects us to have done irq_enter().
-	 * Besides, if we don't timer interrupts ignore the global
-	 * interrupt lock, which is the WrongThing (tm) to do.
+	 * Fetch the APIC ID of the BSP in case we have a
+	 * default configuration (or the MP table is broken).
 	 */
-	irq_enter();
-	smp_local_timer_interrupt();
-	irq_exit();
-	set_irq_regs(old_regs);
+	if (boot_cpu_physical_apicid == -1U)
+		boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
+
+#ifdef CONFIG_X86_IO_APIC
+	{
+		unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
+		int i;
+
+		for (i = 0; i < nr_ioapics; i++) {
+			if (smp_found_config) {
+				ioapic_phys = mp_ioapics[i].mpc_apicaddr;
+				if (!ioapic_phys) {
+					printk(KERN_ERR
+					       "WARNING: bogus zero IO-APIC "
+					       "address found in MPTABLE, "
+					       "disabling IO/APIC support!\n");
+					smp_found_config = 0;
+					skip_ioapic_setup = 1;
+					goto fake_ioapic_page;
+				}
+			} else {
+fake_ioapic_page:
+				ioapic_phys = (unsigned long)
+					      alloc_bootmem_pages(PAGE_SIZE);
+				ioapic_phys = __pa(ioapic_phys);
+			}
+			set_fixmap_nocache(idx, ioapic_phys);
+			printk(KERN_DEBUG "mapped IOAPIC to %08lx (%08lx)\n",
+			       __fix_to_virt(idx), ioapic_phys);
+			idx++;
+		}
+	}
+#endif
 }
 
-#ifndef CONFIG_SMP
-static void up_apic_timer_interrupt_call(void)
+/*
+ * This initializes the IO-APIC and APIC hardware if this is
+ * a UP kernel.
+ */
+int __init APIC_init_uniprocessor (void)
 {
-	int cpu = smp_processor_id();
+	if (enable_local_apic < 0)
+		clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+
+	if (!smp_found_config && !cpu_has_apic)
+		return -1;
 
 	/*
-	 * the NMI deadlock-detector uses this.
+	 * Complain if the BIOS pretends there is one.
 	 */
-	per_cpu(irq_stat, cpu).apic_timer_irqs++;
+	if (!cpu_has_apic &&
+	    APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) {
+		printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n",
+		       boot_cpu_physical_apicid);
+		clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+		return -1;
+	}
 
-	smp_local_timer_interrupt();
-}
+	verify_local_APIC();
+
+	connect_bsp_APIC();
+
+	/*
+	 * Hack: In case of kdump, after a crash, kernel might be booting
+	 * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid
+	 * might be zero if read from MP tables. Get it from LAPIC.
+	 */
+#ifdef CONFIG_CRASH_DUMP
+	boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
 #endif
+	phys_cpu_present_map = physid_mask_of_physid(boot_cpu_physical_apicid);
 
-void smp_send_timer_broadcast_ipi(void)
-{
-	cpumask_t mask;
+	setup_local_APIC();
 
-	cpus_and(mask, cpu_online_map, timer_bcast_ipi);
-	if (!cpus_empty(mask)) {
-#ifdef CONFIG_SMP
-		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
-#else
-		/*
-		 * We can directly call the apic timer interrupt handler
-		 * in UP case. Minus all irq related functions
-		 */
-		up_apic_timer_interrupt_call();
+#ifdef CONFIG_X86_IO_APIC
+	if (smp_found_config)
+		if (!skip_ioapic_setup && nr_ioapics)
+			setup_IO_APIC();
 #endif
-	}
+	setup_boot_APIC_clock();
+
+	return 0;
+}
+
+/*
+ * APIC command line parameters
+ */
+static int __init parse_lapic(char *arg)
+{
+	enable_local_apic = 1;
+	return 0;
+}
+early_param("lapic", parse_lapic);
+
+static int __init parse_nolapic(char *arg)
+{
+	enable_local_apic = -1;
+	clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+	return 0;
 }
+early_param("nolapic", parse_nolapic);
 
-int setup_profiling_timer(unsigned int multiplier)
+static int __init apic_set_verbosity(char *str)
 {
-	return -EINVAL;
+	if (strcmp("debug", str) == 0)
+		apic_verbosity = APIC_DEBUG;
+	else if (strcmp("verbose", str) == 0)
+		apic_verbosity = APIC_VERBOSE;
+	return 1;
 }
 
+__setup("apic=", apic_set_verbosity);
+
+
+/*
+ * Local APIC interrupts
+ */
+
 /*
  * This interrupt should _never_ happen with our APIC/SMP architecture
  */
@@ -1302,15 +1160,14 @@ fastcall void smp_spurious_interrupt(str
 		ack_APIC_irq();
 
 	/* see sw-dev-man vol 3, chapter 7.4.13.5 */
-	printk(KERN_INFO "spurious APIC interrupt on CPU#%d, should never happen.\n",
-			smp_processor_id());
+	printk(KERN_INFO "spurious APIC interrupt on CPU#%d, "
+	       "should never happen.\n", smp_processor_id());
 	irq_exit();
 }
 
 /*
  * This interrupt should never happen with our APIC/SMP architecture
  */
-
 fastcall void smp_error_interrupt(struct pt_regs *regs)
 {
 	unsigned long v, v1;
@@ -1334,69 +1191,247 @@ fastcall void smp_error_interrupt(struct
 	   7: Illegal register address
 	*/
 	printk (KERN_DEBUG "APIC error on CPU%d: %02lx(%02lx)\n",
-	        smp_processor_id(), v , v1);
+		smp_processor_id(), v , v1);
 	irq_exit();
 }
 
 /*
- * This initializes the IO-APIC and APIC hardware if this is
- * a UP kernel.
+ * Initialize APIC interrupts
  */
-int __init APIC_init_uniprocessor (void)
+void __init apic_intr_init(void)
 {
-	if (enable_local_apic < 0)
-		clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+#ifdef CONFIG_SMP
+	smp_intr_init();
+#endif
+	/* self generated IPI for local APIC timer */
+	set_intr_gate(LOCAL_TIMER_VECTOR, apic_timer_interrupt);
 
-	if (!smp_found_config && !cpu_has_apic)
-		return -1;
+	/* IPI vectors for APIC spurious and error interrupts */
+	set_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
+	set_intr_gate(ERROR_APIC_VECTOR, error_interrupt);
 
-	/*
-	 * Complain if the BIOS pretends there is one.
-	 */
-	if (!cpu_has_apic && APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) {
-		printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n",
-			boot_cpu_physical_apicid);
-		clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
-		return -1;
+	/* thermal monitor LVT interrupt */
+#ifdef CONFIG_X86_MCE_P4THERMAL
+	set_intr_gate(THERMAL_APIC_VECTOR, thermal_interrupt);
+#endif
+}
+
+/**
+ * connect_bsp_APIC - attach the APIC to the interrupt system
+ */
+void __init connect_bsp_APIC(void)
+{
+	if (pic_mode) {
+		/*
+		 * Do not trust the local APIC being empty at bootup.
+		 */
+		clear_local_APIC();
+		/*
+		 * PIC mode, enable APIC mode in the IMCR, i.e.  connect BSP's
+		 * local APIC to INT and NMI lines.
+		 */
+		apic_printk(APIC_VERBOSE, "leaving PIC mode, "
+				"enabling APIC mode.\n");
+		outb(0x70, 0x22);
+		outb(0x01, 0x23);
 	}
+	enable_apic_mode();
+}
 
-	verify_local_APIC();
+/**
+ * disconnect_bsp_APIC - detach the APIC from the interrupt system
+ * @virt_wire_setup:	indicates, whether virtual wire mode is selected
+ *
+ * Virtual wire mode is necessary to deliver legacy interrupts even when the
+ * APIC is disabled.
+ */
+void disconnect_bsp_APIC(int virt_wire_setup)
+{
+	if (pic_mode) {
+		/*
+		 * Put the board back into PIC mode (has an effect only on
+		 * certain older boards).  Note that APIC interrupts, including
+		 * IPIs, won't work beyond this point!  The only exception are
+		 * INIT IPIs.
+		 */
+		apic_printk(APIC_VERBOSE, "disabling APIC mode, "
+				"entering PIC mode.\n");
+		outb(0x70, 0x22);
+		outb(0x00, 0x23);
+	} else {
+		/* Go back to Virtual Wire compatibility mode */
+		unsigned long value;
 
-	connect_bsp_APIC();
+		/* For the spurious interrupt use vector F, and enable it */
+		value = apic_read(APIC_SPIV);
+		value &= ~APIC_VECTOR_MASK;
+		value |= APIC_SPIV_APIC_ENABLED;
+		value |= 0xf;
+		apic_write_around(APIC_SPIV, value);
 
-	/*
-	 * Hack: In case of kdump, after a crash, kernel might be booting
-	 * on a cpu with non-zero lapic id. But boot_cpu_physical_apicid
-	 * might be zero if read from MP tables. Get it from LAPIC.
-	 */
-#ifdef CONFIG_CRASH_DUMP
-	boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
-#endif
-	phys_cpu_present_map = physid_mask_of_physid(boot_cpu_physical_apicid);
+		if (!virt_wire_setup) {
+			/*
+			 * For LVT0 make it edge triggered, active high,
+			 * external and enabled
+			 */
+			value = apic_read(APIC_LVT0);
+			value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING |
+				APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+				APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED );
+			value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+			value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT);
+			apic_write_around(APIC_LVT0, value);
+		} else {
+			/* Disable LVT0 */
+			apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
+		}
 
-	setup_local_APIC();
+		/*
+		 * For LVT1 make it edge triggered, active high, nmi and
+		 * enabled
+		 */
+		value = apic_read(APIC_LVT1);
+		value &= ~(
+			APIC_MODE_MASK | APIC_SEND_PENDING |
+			APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+			APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED);
+		value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+		value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI);
+		apic_write_around(APIC_LVT1, value);
+	}
+}
 
-#ifdef CONFIG_X86_IO_APIC
-	if (smp_found_config)
-		if (!skip_ioapic_setup && nr_ioapics)
-			setup_IO_APIC();
-#endif
-	setup_boot_APIC_clock();
+/*
+ * Power management
+ */
+#ifdef CONFIG_PM
+
+static struct {
+	int active;
+	/* r/w apic fields */
+	unsigned int apic_id;
+	unsigned int apic_taskpri;
+	unsigned int apic_ldr;
+	unsigned int apic_dfr;
+	unsigned int apic_spiv;
+	unsigned int apic_lvtt;
+	unsigned int apic_lvtpc;
+	unsigned int apic_lvt0;
+	unsigned int apic_lvt1;
+	unsigned int apic_lvterr;
+	unsigned int apic_tmict;
+	unsigned int apic_tdcr;
+	unsigned int apic_thmr;
+} apic_pm_state;
+
+static int lapic_suspend(struct sys_device *dev, pm_message_t state)
+{
+	unsigned long flags;
+
+	if (!apic_pm_state.active)
+		return 0;
+
+	apic_pm_state.apic_id = apic_read(APIC_ID);
+	apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI);
+	apic_pm_state.apic_ldr = apic_read(APIC_LDR);
+	apic_pm_state.apic_dfr = apic_read(APIC_DFR);
+	apic_pm_state.apic_spiv = apic_read(APIC_SPIV);
+	apic_pm_state.apic_lvtt = apic_read(APIC_LVTT);
+	apic_pm_state.apic_lvtpc = apic_read(APIC_LVTPC);
+	apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0);
+	apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1);
+	apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR);
+	apic_pm_state.apic_tmict = apic_read(APIC_TMICT);
+	apic_pm_state.apic_tdcr = apic_read(APIC_TDCR);
+	apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
 
+	local_irq_save(flags);
+	disable_local_APIC();
+	local_irq_restore(flags);
 	return 0;
 }
 
-static int __init parse_lapic(char *arg)
+static int lapic_resume(struct sys_device *dev)
 {
-	lapic_enable();
+	unsigned int l, h;
+	unsigned long flags;
+
+	if (!apic_pm_state.active)
+		return 0;
+
+	local_irq_save(flags);
+
+	/*
+	 * Make sure the APICBASE points to the right address
+	 *
+	 * FIXME! This will be wrong if we ever support suspend on
+	 * SMP! We'll need to do this as part of the CPU restore!
+	 */
+	rdmsr(MSR_IA32_APICBASE, l, h);
+	l &= ~MSR_IA32_APICBASE_BASE;
+	l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr;
+	wrmsr(MSR_IA32_APICBASE, l, h);
+
+	apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED);
+	apic_write(APIC_ID, apic_pm_state.apic_id);
+	apic_write(APIC_DFR, apic_pm_state.apic_dfr);
+	apic_write(APIC_LDR, apic_pm_state.apic_ldr);
+	apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri);
+	apic_write(APIC_SPIV, apic_pm_state.apic_spiv);
+	apic_write(APIC_LVT0, apic_pm_state.apic_lvt0);
+	apic_write(APIC_LVT1, apic_pm_state.apic_lvt1);
+	apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
+	apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
+	apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
+	apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
+	apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
+	apic_write(APIC_ESR, 0);
+	apic_read(APIC_ESR);
+	apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
+	apic_write(APIC_ESR, 0);
+	apic_read(APIC_ESR);
+	local_irq_restore(flags);
 	return 0;
 }
-early_param("lapic", parse_lapic);
 
-static int __init parse_nolapic(char *arg)
+/*
+ * This device has no shutdown method - fully functioning local APICs
+ * are needed on every CPU up until machine_halt/restart/poweroff.
+ */
+
+static struct sysdev_class lapic_sysclass = {
+	set_kset_name("lapic"),
+	.resume		= lapic_resume,
+	.suspend	= lapic_suspend,
+};
+
+static struct sys_device device_lapic = {
+	.id	= 0,
+	.cls	= &lapic_sysclass,
+};
+
+static void __devinit apic_pm_activate(void)
 {
-	lapic_disable();
-	return 0;
+	apic_pm_state.active = 1;
 }
-early_param("nolapic", parse_nolapic);
 
+static int __init init_lapic_sysfs(void)
+{
+	int error;
+
+	if (!cpu_has_apic)
+		return 0;
+	/* XXX: remove suspend/resume procs if !apic_pm_state.active? */
+
+	error = sysdev_class_register(&lapic_sysclass);
+	if (!error)
+		error = sysdev_register(&device_lapic);
+	return error;
+}
+device_initcall(init_lapic_sysfs);
+
+#else	/* CONFIG_PM */
+
+static void apic_pm_activate(void) { }
+
+#endif	/* CONFIG_PM */
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/io_apic.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/io_apic.c	2006-11-09 21:05:30.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/io_apic.c	2006-11-09 21:06:15.000000000 +0100
@@ -1599,7 +1599,7 @@ void /*__init*/ print_local_APIC(void * 
 	v = apic_read(APIC_LVR);
 	printk(KERN_INFO "... APIC VERSION: %08x\n", v);
 	ver = GET_APIC_VERSION(v);
-	maxlvt = get_maxlvt();
+	maxlvt = lapic_get_maxlvt();
 
 	v = apic_read(APIC_TASKPRI);
 	printk(KERN_DEBUG "... APIC TASKPRI: %08x (%02x)\n", v, v & APIC_TPRI_MASK);
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/irq.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/irq.c	2006-11-09 21:05:30.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/irq.c	2006-11-09 21:06:15.000000000 +0100
@@ -10,7 +10,6 @@
  * io_apic.c.)
  */
 
-#include <asm/uaccess.h>
 #include <linux/module.h>
 #include <linux/seq_file.h>
 #include <linux/interrupt.h>
@@ -19,19 +18,34 @@
 #include <linux/cpu.h>
 #include <linux/delay.h>
 
+#include <asm/apic.h>
+#include <asm/uaccess.h>
+
 DEFINE_PER_CPU(irq_cpustat_t, irq_stat) ____cacheline_internodealigned_in_smp;
 EXPORT_PER_CPU_SYMBOL(irq_stat);
 
-#ifndef CONFIG_X86_LOCAL_APIC
 /*
  * 'what should we do if we get a hw irq event on an illegal vector'.
  * each architecture has to answer this themselves.
  */
 void ack_bad_irq(unsigned int irq)
 {
-	printk("unexpected IRQ trap at vector %02x\n", irq);
-}
+	printk(KERN_ERR "unexpected IRQ trap at vector %02x\n", irq);
+
+#ifdef CONFIG_X86_LOCAL_APIC
+	/*
+	 * Currently unexpected vectors happen only on SMP and APIC.
+	 * We _must_ ack these because every local APIC has only N
+	 * irq slots per priority level, and a 'hanging, unacked' IRQ
+	 * holds up an irq slot - in excessive cases (when multiple
+	 * unexpected vectors occur) that might lock up the APIC
+	 * completely.
+	 * But only ack when the APIC is enabled -AK
+	 */
+	if (cpu_has_apic)
+		ack_APIC_irq();
 #endif
+}
 
 #ifdef CONFIG_4KSTACKS
 /*
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/smpboot.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/smpboot.c	2006-11-09 21:05:30.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/smpboot.c	2006-11-09 21:06:15.000000000 +0100
@@ -744,7 +744,7 @@ wakeup_secondary_cpu(int logical_apicid,
 	/*
 	 * Due to the Pentium erratum 3AP.
 	 */
-	maxlvt = get_maxlvt();
+	maxlvt = lapic_get_maxlvt();
 	if (maxlvt > 3) {
 		apic_read_around(APIC_SPIV);
 		apic_write(APIC_ESR, 0);
@@ -834,7 +834,7 @@ wakeup_secondary_cpu(int phys_apicid, un
 	 */
 	Dprintk("#startup loops: %d.\n", num_starts);
 
-	maxlvt = get_maxlvt();
+	maxlvt = lapic_get_maxlvt();
 
 	for (j = 1; j <= num_starts; j++) {
 		Dprintk("Sending STARTUP #%d.\n",j);
Index: linux-2.6.19-rc5-mm1/include/asm-i386/apic.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/apic.h	2006-11-09 21:05:30.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/apic.h	2006-11-09 21:06:15.000000000 +0100
@@ -95,7 +95,7 @@ static inline void ack_APIC_irq(void)
 
 extern void (*wait_timer_tick)(void);
 
-extern int get_maxlvt(void);
+extern int lapic_get_maxlvt(void);
 extern void clear_local_APIC(void);
 extern void connect_bsp_APIC (void);
 extern void disconnect_bsp_APIC (int virt_wire_setup);

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 09/19] i386: Convert to clock event devices
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (7 preceding siblings ...)
  2006-11-09 23:38 ` [patch 08/19] i386: cleanup apic code Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10 10:10   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 10/19] PM_timer: allow early access and move externs to a header file Thomas Gleixner
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: clockevents-drivers-for-i386.patch --]
[-- Type: text/plain, Size: 26149 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add clockevent drivers for i386: lapic (local) and PIT (global).  Update the
timer IRQ to call into the PIT driver's event handler and the lapic-timer IRQ
to call into the lapic clockevent driver.  The assignement of timer
functionality is delegated to the core framework code and replaces the compile
and runtime evalution in do_timer_interrupt_hook()

Use the clockevents broadcast support and implement the lapic_broadcast function
for ACPI.

Build-fixes-from: Andrew Morton <akpm@osdl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/arch/i386/Kconfig
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/Kconfig	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/Kconfig	2006-11-09 21:17:29.000000000 +0100
@@ -18,6 +18,14 @@ config GENERIC_TIME
 	bool
 	default y
 
+config GENERIC_CLOCKEVENTS
+	bool
+	default y
+
+config GENERIC_CLOCKEVENTS_BROADCAST
+	bool
+	default y
+
 config LOCKDEP_SUPPORT
 	bool
 	default y
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/apic.c	2006-11-09 21:17:25.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c	2006-11-09 21:17:29.000000000 +0100
@@ -25,6 +25,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/sysdev.h>
 #include <linux/cpu.h>
+#include <linux/clockchips.h>
 #include <linux/module.h>
 
 #include <asm/atomic.h>
@@ -51,28 +52,49 @@
 #endif
 
 /*
- * cpu_mask that denotes the CPUs that needs timer interrupt coming in as
- * IPIs in place of local APIC timers
- */
-static cpumask_t timer_bcast_ipi;
-
-/*
  * Knob to control our willingness to enable the local APIC.
  *
  * -1=force-disable, +1=force-enable
  */
 static int enable_local_apic __initdata = 0;
 
+/* Enable local APIC timer for highres/dyntick on UP */
+static int enable_local_apic_timer __initdata = 0;
+
 /*
  * Debug level, exported for io_apic.c
  */
 int apic_verbosity;
 
-static void apic_pm_activate(void);
+static unsigned int calibration_result;
 
+static void lapic_next_event(unsigned long delta,
+			     struct clock_event_device *evt);
+static void lapic_timer_setup(enum clock_event_mode mode,
+			      struct clock_event_device *evt);
+static void lapic_timer_broadcast(cpumask_t *mask);
+static void apic_pm_activate(void);
 
-/* Using APIC to generate smp_local_timer_interrupt? */
-int using_apic_timer __read_mostly = 0;
+/*
+ * The local apic timer can be used for any function which is CPU local.
+ */
+static struct clock_event_device lapic_clockevent = {
+	.name = "lapic",
+	.capabilities = CLOCK_CAP_PROFILE
+#ifdef CONFIG_SMP
+	/*
+	 * On UP we keep update_process_times() on the PIT interrupt to
+	 * resemble the original behaviour as close as possible. SMP
+	 * requires to run this CPU local.
+	 */
+			| CLOCK_CAP_UPDATE
+#endif
+	,
+	.shift = 32,
+	.set_mode = lapic_timer_setup,
+	.set_next_event = lapic_next_event,
+};
+static DEFINE_PER_CPU(struct clock_event_device, lapic_events);
 
 /* Local APIC was disabled by the BIOS and enabled by the kernel */
 static int enabled_via_apicbase;
@@ -151,6 +173,11 @@ int lapic_get_maxlvt(void)
  */
 
 /*
+ * FIXME: Move this to i8253.h. There is no need to keep the access to
+ * the PIT scattered all around the place -tglx
+ */
+
+/*
  * The timer chip is already set up at HZ interrupts per second here,
  * but we do not accept timer interrupts yet. We only allow the BP
  * to calibrate.
@@ -208,16 +235,17 @@ void (*wait_timer_tick)(void) __devinitd
 
 #define APIC_DIVISOR 16
 
-static void __setup_APIC_LVTT(unsigned int clocks)
+static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
 {
 	unsigned int lvtt_value, tmp_value;
-	int cpu = smp_processor_id();
 
-	lvtt_value = APIC_LVT_TIMER_PERIODIC | LOCAL_TIMER_VECTOR;
+	lvtt_value = LOCAL_TIMER_VECTOR;
+	if (!oneshot)
+		lvtt_value |= APIC_LVT_TIMER_PERIODIC;
 	if (!lapic_is_integrated())
 		lvtt_value |= SET_APIC_TIMER_BASE(APIC_TIMER_BASE_DIV);
 
-	if (cpu_isset(cpu, timer_bcast_ipi))
+	if (!irqen)
 		lvtt_value |= APIC_LVT_MASKED;
 
 	apic_write_around(APIC_LVTT, lvtt_value);
@@ -230,31 +258,67 @@ static void __setup_APIC_LVTT(unsigned i
 				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
 				| APIC_TDR_DIV_16);
 
-	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
+	if (!oneshot)
+		apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
 }
 
-static void __devinit setup_APIC_timer(unsigned int clocks)
+/*
+ * Program the next event, relative to now
+ */
+static void lapic_next_event(unsigned long delta,
+			     struct clock_event_device *evt)
+{
+	apic_write_around(APIC_TMICT, delta);
+}
+
+/*
+ * Setup the lapic timer in periodic or oneshot mode
+ */
+static void lapic_timer_setup(enum clock_event_mode mode,
+			      struct clock_event_device *evt)
 {
 	unsigned long flags;
+	unsigned int v;
 
 	local_irq_save(flags);
 
-	/*
-	 * Wait for IRQ0's slice:
-	 */
-	wait_timer_tick();
-
-	__setup_APIC_LVTT(clocks);
+	switch (mode) {
+	case CLOCK_EVT_PERIODIC:
+	case CLOCK_EVT_ONESHOT:
+		__setup_APIC_LVTT(calibration_result,
+				  mode != CLOCK_EVT_PERIODIC, 1);
+		break;
+	case CLOCK_EVT_SHUTDOWN:
+		v = apic_read(APIC_LVTT);
+		v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
+		apic_write_around(APIC_LVTT, v);
+		break;
+	}
 
 	local_irq_restore(flags);
 }
 
 /*
+ * Setup the local APIC timer for this CPU. Copy the initilized values
+ * of the boot CPU and register the clock event in the framework.
+ */
+static void __devinit setup_APIC_timer(void)
+{
+	struct clock_event_device *levt = &__get_cpu_var(lapic_events);
+
+	memcpy(levt, &lapic_clockevent, sizeof(*levt));
+
+	register_local_clockevent(levt);
+}
+
+/*
  * In this function we calibrate APIC bus clocks to the external
  * timer. Unfortunately we cannot use jiffies and the timer irq
  * to calibrate, since some later bootup code depends on getting
  * the first irq? Ugh.
  *
+ * TODO: Fix this rather than saying "Ugh" -tglx
+ *
  * We want to do the calibration only once since we
  * want to have local timer irqs syncron. CPUs connected
  * by the same APIC bus have the very same bus frequency.
@@ -277,7 +341,7 @@ static int __init calibrate_APIC_clock(v
 	 * value into the APIC clock, we just want to get the
 	 * counter running for calibration.
 	 */
-	__setup_APIC_LVTT(1000000000);
+	__setup_APIC_LVTT(1000000000, 0, 0);
 
 	/*
 	 * The timer chip counts down to zero. Let's wait
@@ -314,6 +378,17 @@ static int __init calibrate_APIC_clock(v
 
 	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
 
+	/* Calculate the scaled math multiplication factor */
+	lapic_clockevent.mult = div_sc(tt1-tt2, TICK_NSEC * LOOPS, 32);
+	lapic_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFFFF, &lapic_clockevent);
+	lapic_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &lapic_clockevent);
+
+	apic_printk(APIC_VERBOSE, "..... tt1-tt2 %ld\n", tt1 - tt2);
+	apic_printk(APIC_VERBOSE, "..... mult: %ld\n", lapic_clockevent.mult);
+	apic_printk(APIC_VERBOSE, "..... calibration result: %ld\n", result);
+
 	if (cpu_has_tsc)
 		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
 			"%ld.%04ld MHz.\n",
@@ -328,8 +403,6 @@ static int __init calibrate_APIC_clock(v
 	return result;
 }
 
-static unsigned int calibration_result;
-
 void __init setup_boot_APIC_clock(void)
 {
 	unsigned long flags;
@@ -342,97 +415,66 @@ void __init setup_boot_APIC_clock(void)
 	/*
 	 * Now set up the timer for real.
 	 */
-	setup_APIC_timer(calibration_result);
+	setup_APIC_timer();
 
 	local_irq_restore(flags);
 }
 
 void __devinit setup_secondary_APIC_clock(void)
 {
-	setup_APIC_timer(calibration_result);
-}
-
-void disable_APIC_timer(void)
-{
-	if (using_apic_timer) {
-		unsigned long v;
-
-		v = apic_read(APIC_LVTT);
-		/*
-		 * When an illegal vector value (0-15) is written to an LVT
-		 * entry and delivery mode is Fixed, the APIC may signal an
-		 * illegal vector error, with out regard to whether the mask
-		 * bit is set or whether an interrupt is actually seen on
-		 * input.
-		 *
-		 * Boot sequence might call this function when the LVTT has
-		 * '0' vector value. So make sure vector field is set to
-		 * valid value.
-		 */
-		v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
-		apic_write_around(APIC_LVTT, v);
-	}
-}
-
-void enable_APIC_timer(void)
-{
-	int cpu = smp_processor_id();
-
-	if (using_apic_timer && !cpu_isset(cpu, timer_bcast_ipi)) {
-		unsigned long v;
-
-		v = apic_read(APIC_LVTT);
-		apic_write_around(APIC_LVTT, v & ~APIC_LVT_MASKED);
-	}
+	setup_APIC_timer();
 }
 
 void switch_APIC_timer_to_ipi(void *cpumask)
 {
+	struct clock_event_device *levt = &__get_cpu_var(lapic_events);
 	cpumask_t mask = *(cpumask_t *)cpumask;
 	int cpu = smp_processor_id();
 
-	if (cpu_isset(cpu, mask) &&
-	    !cpu_isset(cpu, timer_bcast_ipi)) {
-		disable_APIC_timer();
-		cpu_set(cpu, timer_bcast_ipi);
-	}
+	if (cpu_isset(cpu, mask) && levt->event_handler)
+		clockevents_set_global_broadcast(levt, 1);
 }
 EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
 
 void switch_ipi_to_APIC_timer(void *cpumask)
 {
+	struct clock_event_device *levt = &__get_cpu_var(lapic_events);
 	cpumask_t mask = *(cpumask_t *)cpumask;
 	int cpu = smp_processor_id();
 
-	if (cpu_isset(cpu, mask) &&
-	    cpu_isset(cpu, timer_bcast_ipi)) {
-		cpu_clear(cpu, timer_bcast_ipi);
-		enable_APIC_timer();
-	}
+	if (cpu_isset(cpu, mask) && levt->event_handler)
+		clockevents_set_global_broadcast(levt, 0);
 }
 EXPORT_SYMBOL(switch_ipi_to_APIC_timer);
 
 /*
- * Local timer interrupt handler. It does both profiling and
- * process statistics/rescheduling.
+ * The guts of the apic timer interrupt
  */
-inline void smp_local_timer_interrupt(void)
+fastcall void local_apic_timer_interrupt(struct pt_regs *regs)
 {
-	profile_tick(CPU_PROFILING);
-#ifdef CONFIG_SMP
-	update_process_times(user_mode_vm(get_irq_regs()));
-#endif
+	int cpu = smp_processor_id();
+	struct clock_event_device *evt = &per_cpu(lapic_events, cpu);
 
 	/*
-	 * We take the 'long' return path, and there every subsystem
-	 * grabs the apropriate locks (kernel lock/ irq lock).
-	 *
-	 * we might want to decouple profiling from the 'long path',
-	 * and do the profiling totally in assembly.
+	 * Normally we should not be here till LAPIC has been initialized but
+	 * in some cases like kdump, its possible that there is a pending LAPIC
+	 * timer interrupt from previous kernel's context and is delivered in
+	 * new kernel the moment interrupts are enabled.
 	 *
-	 * Currently this isn't too much of an issue (performance wise),
-	 * we can take more than 100K local irqs per second on a 100 MHz P5.
-	 */
+	 * Interrupts are enabled early and LAPIC is setup much later, hence
+	 * its possible that when we get here evt->event_handler is NULL.
+	 * Check for event_handler being NULL and discard the interrupt as
+	 * spurious.
+	 */
+	if (!evt->event_handler) {
+		printk(KERN_WARNING
+		       "Spurious LAPIC timer interrupt on cpu %d\n", cpu);
+		return;
+	}
+
+	per_cpu(irq_stat, cpu).apic_timer_irqs++;
+
+	evt->event_handler(regs);
 }
 
 /*
@@ -447,12 +489,6 @@ inline void smp_local_timer_interrupt(vo
 fastcall void smp_apic_timer_interrupt(struct pt_regs *regs)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
-	int cpu = smp_processor_id();
-
-	/*
-	 * the NMI deadlock-detector uses this.
-	 */
-	per_cpu(irq_stat, cpu).apic_timer_irqs++;
 
 	/*
 	 * NOTE! We'd better ACK the irq immediately,
@@ -465,41 +501,40 @@ fastcall void smp_apic_timer_interrupt(s
 	 * interrupt lock, which is the WrongThing (tm) to do.
 	 */
 	irq_enter();
-	smp_local_timer_interrupt();
+	local_apic_timer_interrupt(regs);
 	irq_exit();
 	set_irq_regs(old_regs);
 }
 
-#ifndef CONFIG_SMP
-static void up_apic_timer_interrupt_call(void)
+/*
+ * Local APIC timer broadcast function
+ */
+static void lapic_timer_broadcast(cpumask_t *cpumask)
 {
 	int cpu = smp_processor_id();
-
-	/*
-	 * the NMI deadlock-detector uses this.
-	 */
-	per_cpu(irq_stat, cpu).apic_timer_irqs++;
-
-	smp_local_timer_interrupt();
-}
-#endif
-
-void smp_send_timer_broadcast_ipi(void)
-{
 	cpumask_t mask;
 
-	cpus_and(mask, cpu_online_map, timer_bcast_ipi);
-	if (!cpus_empty(mask)) {
+	cpus_and(mask, cpu_online_map, *cpumask);
+	if (cpu_isset(cpu, mask)) {
+		cpu_clear(cpu, mask);
+		local_apic_timer_interrupt(get_irq_regs());
+	}
 #ifdef CONFIG_SMP
+	if (!cpus_empty(mask))
 		send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
-#else
-		/*
-		 * We can directly call the apic timer interrupt handler
-		 * in UP case. Minus all irq related functions
-		 */
-		up_apic_timer_interrupt_call();
 #endif
-	}
+}
+
+/*
+ * Local APIC set next event broadcast
+ */
+void lapic_timer_idle_broadcast(int broadcast)
+{
+	int cpu = smp_processor_id();
+	struct clock_event_device *evt = &per_cpu(lapic_events, cpu);
+
+	if (evt->event_handler)
+		clockevents_set_broadcast(evt, broadcast);
 }
 
 int setup_profiling_timer(unsigned int multiplier)
@@ -912,6 +947,11 @@ void __devinit setup_local_APIC(void)
 			printk(KERN_INFO "No ESR for 82489DX.\n");
 	}
 
+	/* Disable the local apic timer */
+	value = apic_read(APIC_LVTT);
+	value |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
+	apic_write_around(APIC_LVTT, value);
+
 	setup_apic_nmi_watchdog(NULL);
 	apic_pm_activate();
 }
@@ -1126,6 +1166,13 @@ static int __init parse_nolapic(char *ar
 }
 early_param("nolapic", parse_nolapic);
 
+static int __init apic_enable_lapic_timer(char *str)
+{
+	enable_local_apic_timer = 1;
+	return 0;
+}
+early_param("lapictimer", apic_enable_lapic_timer);
+
 static int __init apic_set_verbosity(char *str)
 {
 	if (strcmp("debug", str) == 0)
@@ -1137,7 +1184,6 @@ static int __init apic_set_verbosity(cha
 
 __setup("apic=", apic_set_verbosity);
 
-
 /*
  * Local APIC interrupts
  */
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/i8253.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/i8253.c	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/i8253.c	2006-11-09 21:17:29.000000000 +0100
@@ -2,7 +2,7 @@
  * i8253.c  8253/PIT functions
  *
  */
-#include <linux/clocksource.h>
+#include <linux/clockchips.h>
 #include <linux/spinlock.h>
 #include <linux/jiffies.h>
 #include <linux/sysdev.h>
@@ -19,20 +19,96 @@
 DEFINE_SPINLOCK(i8253_lock);
 EXPORT_SYMBOL(i8253_lock);
 
-void setup_pit_timer(void)
+#ifdef CONFIG_HPET_TIMER
+/*
+ * HPET replaces the PIT, when enabled. So we need to know, which of
+ * the two timers is used
+ */
+struct clock_event_device *global_clock_event;
+#endif
+
+/*
+ * Initialize the PIT timer.
+ *
+ * This is also called after resume to bring the PIT into operation again.
+ */
+static void init_pit_timer(enum clock_event_mode mode,
+			   struct clock_event_device *evt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&i8253_lock, flags);
+
+	switch(mode) {
+	case CLOCK_EVT_PERIODIC:
+		/* binary, mode 2, LSB/MSB, ch 0 */
+		outb_p(0x34, PIT_MODE);
+		udelay(10);
+		outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
+		udelay(10);
+		outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+		break;
+
+	case CLOCK_EVT_ONESHOT:
+	case CLOCK_EVT_SHUTDOWN:
+		/* One shot setup */
+		outb_p(0x38, PIT_MODE);
+		udelay(10);
+		break;
+	}
+	spin_unlock_irqrestore(&i8253_lock, flags);
+}
+
+/*
+ * Program the next event in oneshot mode
+ *
+ * Delta is given in PIT ticks
+ */
+static void pit_next_event(unsigned long delta, struct clock_event_device *evt)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-	outb_p(0x34,PIT_MODE);		/* binary, mode 2, LSB/MSB, ch 0 */
-	udelay(10);
-	outb_p(LATCH & 0xff , PIT_CH0);	/* LSB */
-	udelay(10);
-	outb(LATCH >> 8 , PIT_CH0);	/* MSB */
+	outb_p(delta & 0xff , PIT_CH0);	/* LSB */
+	outb(delta >> 8 , PIT_CH0);	/* MSB */
 	spin_unlock_irqrestore(&i8253_lock, flags);
 }
 
 /*
+ * On UP the PIT can serve all of the possible timer functions. On SMP systems
+ * it can be solely used for the global tick.
+ *
+ * The profiling and update capabilites are switched off once the local apic is
+ * registered. This mechanism replaces the previous #ifdef LOCAL_APIC -
+ * !using_apic_timer decisions in do_timer_interrupt_hook()
+ */
+struct clock_event_device pit_clockevent = {
+	.name		= "pit",
+	.capabilities	= CLOCK_CAP_TICK | CLOCK_CAP_PROFILE | CLOCK_CAP_UPDATE
+			| CLOCK_CAP_NEXTEVT,
+	.set_mode	= init_pit_timer,
+	.set_next_event = pit_next_event,
+	.shift		= 32,
+};
+
+/*
+ * Initialize the conversion factor and the min/max deltas of the clock event
+ * structure and register the clock event source with the framework.
+ */
+void __init setup_pit_timer(void)
+{
+	pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32);
+	pit_clockevent.max_delta_ns =
+		clockevent_delta2ns(0x7FFF, &pit_clockevent);
+	pit_clockevent.min_delta_ns =
+		clockevent_delta2ns(0xF, &pit_clockevent);
+	register_global_clockevent(&pit_clockevent);
+#ifdef CONFIG_HPET_TIMER
+	global_clock_event = &pit_clockevent;
+#endif
+}
+
+/*
  * Since the PIT overflows every tick, its not very useful
  * to just read by itself. So use jiffies to emulate a free
  * running counter:
@@ -46,7 +122,7 @@ static cycle_t pit_read(void)
 	static u32 old_jifs;
 
 	spin_lock_irqsave(&i8253_lock, flags);
-        /*
+	/*
 	 * Although our caller may have the read side of xtime_lock,
 	 * this is now a seqlock, and we are cheating in this routine
 	 * by having side effects on state that we cannot undo if
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/time.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/time.c	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/time.c	2006-11-09 21:17:29.000000000 +0100
@@ -161,15 +161,6 @@ EXPORT_SYMBOL(profile_pc);
  */
 irqreturn_t timer_interrupt(int irq, void *dev_id)
 {
-	/*
-	 * Here we are in the timer irq handler. We just have irqs locally
-	 * disabled but we don't know if the timer_bh is running on the other
-	 * CPU. We need to avoid to SMP race with it. NOTE: we don' t need
-	 * the irq version of write_lock because as just said we have irq
-	 * locally disabled. -arca
-	 */
-	write_seqlock(&xtime_lock);
-
 #ifdef CONFIG_X86_IO_APIC
 	if (timer_ack) {
 		/*
@@ -188,7 +179,6 @@ irqreturn_t timer_interrupt(int irq, voi
 
 	do_timer_interrupt_hook();
 
-
 	if (MCA_bus) {
 		/* The PS/2 uses level-triggered interrupts.  You can't
 		turn them off, nor would you want to (any attempt to
@@ -203,13 +193,6 @@ irqreturn_t timer_interrupt(int irq, voi
 		outb_p( irq_v|0x80, 0x61 );	/* reset the IRQ */
 	}
 
-	write_sequnlock(&xtime_lock);
-
-#ifdef CONFIG_X86_LOCAL_APIC
-	if (using_apic_timer)
-		smp_send_timer_broadcast_ipi();
-#endif
-
 	return IRQ_HANDLED;
 }
 
@@ -278,39 +261,6 @@ void notify_arch_cmos_timer(void)
 	mod_timer(&sync_cmos_timer, jiffies + 1);
 }
 
-static int timer_resume(struct sys_device *dev)
-{
-#ifdef CONFIG_HPET_TIMER
-	if (is_hpet_enabled())
-		hpet_reenable();
-#endif
-	setup_pit_timer();
-	touch_softlockup_watchdog();
-	return 0;
-}
-
-static struct sysdev_class timer_sysclass = {
-	.resume = timer_resume,
-	set_kset_name("timer"),
-};
-
-
-/* XXX this driverfs stuff should probably go elsewhere later -john */
-static struct sys_device device_timer = {
-	.id	= 0,
-	.cls	= &timer_sysclass,
-};
-
-static int time_init_device(void)
-{
-	int error = sysdev_class_register(&timer_sysclass);
-	if (!error)
-		error = sysdev_register(&device_timer);
-	return error;
-}
-
-device_initcall(time_init_device);
-
 #ifdef CONFIG_HPET_TIMER
 extern void (*late_time_init)(void);
 /* Duplicate of time_init() below, with hpet_enable part added */
Index: linux-2.6.19-rc5-mm1/include/asm-i386/i8253.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/i8253.h	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/i8253.h	2006-11-09 21:17:29.000000000 +0100
@@ -1,6 +1,26 @@
 #ifndef __ASM_I8253_H__
 #define __ASM_I8253_H__
 
+#include <linux/clockchips.h>
+
 extern spinlock_t i8253_lock;
 
+#ifdef CONFIG_HPET_TIMER
+extern struct clock_event_device *global_clock_event;
+#else
+extern struct clock_event_device pit_clockevent;
+# define global_clock_event (&pit_clockevent)
+#endif
+
+/**
+ * pit_interrupt_hook - hook into timer tick
+ * @regs:	standard registers from interrupt
+ *
+ * Call the global clock event handler.
+ **/
+static inline void pit_interrupt_hook(struct pt_regs *regs)
+{
+	global_clock_event->event_handler(regs);
+}
+
 #endif	/* __ASM_I8253_H__ */
Index: linux-2.6.19-rc5-mm1/include/asm-i386/mach-default/do_timer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/mach-default/do_timer.h	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/mach-default/do_timer.h	2006-11-09 21:17:29.000000000 +0100
@@ -1,36 +1,18 @@
 /* defines for inline arch setup functions */
+#include <linux/clockchips.h>
 
-#include <asm/apic.h>
 #include <asm/i8259.h>
+#include <asm/i8253.h>
 
 /**
  * do_timer_interrupt_hook - hook into timer tick
- * @regs:	standard registers from interrupt
  *
- * Description:
- *	This hook is called immediately after the timer interrupt is ack'd.
- *	It's primary purpose is to allow architectures that don't possess
- *	individual per CPU clocks (like the CPU APICs supply) to broadcast the
- *	timer interrupt as a means of triggering reschedules etc.
+ * Call the pit clock event handler. see asm/i8253.h
  **/
 
 static inline void do_timer_interrupt_hook(void)
 {
-	do_timer(1);
-#ifndef CONFIG_SMP
-	update_process_times(user_mode_vm(get_irq_regs()));
-#endif
-/*
- * In the SMP case we use the local APIC timer interrupt to do the
- * profiling, except when we simulate SMP mode on a uniprocessor
- * system, in that case we have to call the local interrupt handler.
- */
-#ifndef CONFIG_X86_LOCAL_APIC
-	profile_tick(CPU_PROFILING);
-#else
-	if (!using_apic_timer)
-		smp_local_timer_interrupt();
-#endif
+	pit_interrupt_hook(get_irq_regs());
 }
 
 
Index: linux-2.6.19-rc5-mm1/include/asm-i386/mach-voyager/do_timer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/mach-voyager/do_timer.h	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/mach-voyager/do_timer.h	2006-11-09 21:17:29.000000000 +0100
@@ -1,13 +1,18 @@
 /* defines for inline arch setup functions */
+#include <linux/clockchips.h>
+
 #include <asm/voyager.h>
+#include <asm/i8253.h>
 
+/**
+ * do_timer_interrupt_hook - hook into timer tick
+ * @regs:     standard registers from interrupt
+ *
+ * Call the pit clock event handler. see asm/i8253.h
+ **/
 static inline void do_timer_interrupt_hook(void)
 {
-	do_timer(1);
-#ifndef CONFIG_SMP
-	update_process_times(user_mode_vm(irq_regs));
-#endif
-
+	pit_interrupt_hook(get_irq_regs());
 	voyager_timer_interrupt();
 }
 
Index: linux-2.6.19-rc5-mm1/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.19-rc5-mm1.orig/Documentation/kernel-parameters.txt	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/Documentation/kernel-parameters.txt	2006-11-09 21:17:29.000000000 +0100
@@ -748,6 +748,11 @@ and is between 256 and 4096 characters. 
 	lapic		[IA-32,APIC] Enable the local APIC even if BIOS
 			disabled it.
 
+	lapictimer	[IA-32,APIC] Enable the local APIC timer on UP
+			systems for high resulution timers and dynticks.
+			This only has an effect when the local APIC is
+			available. It does not imply the "lapic" option.
+
 	lasi=		[HW,SCSI] PARISC LASI driver for the 53c700 chip
 			Format: addr:<io>,irq:<irq>
 
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/smpboot.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/smpboot.c	2006-11-09 21:17:25.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/smpboot.c	2006-11-09 21:17:29.000000000 +0100
@@ -438,9 +438,7 @@ static void __devinit smp_callin(void)
 	/*
 	 * Save our processor parameters
 	 */
- 	smp_store_cpu_info(cpuid);
-
-	disable_APIC_timer();
+	smp_store_cpu_info(cpuid);
 
 	/*
 	 * Allow the master to continue.
@@ -557,7 +555,6 @@ static void __devinit start_secondary(vo
 		enable_NMI_through_LVT0(NULL);
 		enable_8259A_irq(0);
 	}
-	enable_APIC_timer();
 	/*
 	 * low-memory mappings have been cleared, flush them from
 	 * the local TLBs too.
Index: linux-2.6.19-rc5-mm1/include/asm-i386/apic.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/apic.h	2006-11-09 21:17:25.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/apic.h	2006-11-09 21:18:00.000000000 +0100
@@ -111,13 +111,10 @@ extern void smp_local_timer_interrupt (v
 extern void setup_boot_APIC_clock (void);
 extern void setup_secondary_APIC_clock (void);
 extern int APIC_init_uniprocessor (void);
-extern void disable_APIC_timer(void);
-extern void enable_APIC_timer(void);
-static inline void lapic_timer_idle_broadcast(int broadcast) { }
+extern void lapic_timer_idle_broadcast(int broadcast);
 
 extern void enable_NMI_through_LVT0 (void * dummy);
 
-void smp_send_timer_broadcast_ipi(void);
 void switch_APIC_timer_to_ipi(void *cpumask);
 void switch_ipi_to_APIC_timer(void *cpumask);
 #define ARCH_APICTIMER_STOPS_ON_C3	1
Index: linux-2.6.19-rc5-mm1/include/asm-i386/mpspec.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/mpspec.h	2006-11-09 21:14:33.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/mpspec.h	2006-11-09 21:17:29.000000000 +0100
@@ -23,7 +23,6 @@ extern struct mpc_config_intsrc mp_irqs 
 extern int mpc_default_type;
 extern unsigned long mp_lapic_addr;
 extern int pic_mode;
-extern int using_apic_timer;
 
 #ifdef CONFIG_ACPI
 extern void mp_register_lapic (u8 id, u8 enabled);

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 10/19] PM_timer: allow early access and move externs to a header file
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (8 preceding siblings ...)
  2006-11-09 23:38 ` [patch 09/19] i386: Convert to clock event devices Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10 10:12   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 11/19] i386: Rework local APIC calibration Thomas Gleixner
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: pm-timer-allow-early-access.patch --]
[-- Type: text/plain, Size: 3991 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Allow early access to the power management timer by exposing
the verified read function and providing a helper function
which checks the pmtmr_ioport variable and returns either the
pm timer readout or 0 in case the pm timer is not available.

Create a new header file and replace also the ifdef'ed extern
definition in arch/i386/kernel/acpi/boot.c

This is a preperatory patch for the rework of the local apic timer
calibration.

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/acpi/boot.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/acpi/boot.c	2006-11-09 20:55:37.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/acpi/boot.c	2006-11-09 21:06:19.000000000 +0100
@@ -25,6 +25,7 @@
 
 #include <linux/init.h>
 #include <linux/acpi.h>
+#include <linux/acpi_pmtmr.h>
 #include <linux/efi.h>
 #include <linux/cpumask.h>
 #include <linux/module.h>
@@ -702,10 +703,6 @@ static int __init acpi_parse_hpet(unsign
 #define	acpi_parse_hpet	NULL
 #endif
 
-#ifdef CONFIG_X86_PM_TIMER
-extern u32 pmtmr_ioport;
-#endif
-
 static int __init acpi_parse_fadt(unsigned long phys, unsigned long size)
 {
 	struct fadt_descriptor *fadt = NULL;
Index: linux-2.6.19-rc5-mm1/drivers/clocksource/acpi_pm.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/drivers/clocksource/acpi_pm.c	2006-11-09 20:55:37.000000000 +0100
+++ linux-2.6.19-rc5-mm1/drivers/clocksource/acpi_pm.c	2006-11-09 21:06:19.000000000 +0100
@@ -16,15 +16,13 @@
  * This file is licensed under the GPL v2.
  */
 
+#include <linux/acpi_pmtmr.h>
 #include <linux/clocksource.h>
 #include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/pci.h>
 #include <asm/io.h>
 
-/* Number of PMTMR ticks expected during calibration run */
-#define PMTMR_TICKS_PER_SEC 3579545
-
 /*
  * The I/O port the PMTMR resides at.
  * The location is detected during setup_arch(),
@@ -32,15 +30,13 @@
  */
 u32 pmtmr_ioport __read_mostly;
 
-#define ACPI_PM_MASK CLOCKSOURCE_MASK(24) /* limit it to 24 bits */
-
 static inline u32 read_pmtmr(void)
 {
 	/* mask the output to 24 bits */
 	return inl(pmtmr_ioport) & ACPI_PM_MASK;
 }
 
-static cycle_t acpi_pm_read_verified(void)
+u32 acpi_pm_read_verified(void)
 {
 	u32 v1 = 0, v2 = 0, v3 = 0;
 
@@ -57,7 +53,12 @@ static cycle_t acpi_pm_read_verified(voi
 	} while (unlikely((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
 			  || (v3 > v1 && v3 < v2)));
 
-	return (cycle_t)v2;
+	return v2;
+}
+
+static cycle_t acpi_pm_read_slow(void)
+{
+	return (cycle_t)acpi_pm_read_verified();
 }
 
 static cycle_t acpi_pm_read(void)
@@ -87,7 +88,7 @@ __setup("acpi_pm_good", acpi_pm_good_set
 
 static inline void acpi_pm_need_workaround(void)
 {
-	clocksource_acpi_pm.read = acpi_pm_read_verified;
+	clocksource_acpi_pm.read = acpi_pm_read_slow;
 	clocksource_acpi_pm.rating = 110;
 }
 
Index: linux-2.6.19-rc5-mm1/include/linux/acpi_pmtmr.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/include/linux/acpi_pmtmr.h	2006-11-09 21:06:19.000000000 +0100
@@ -0,0 +1,38 @@
+#ifndef _ACPI_PMTMR_H_
+#define _ACPI_PMTMR_H_
+
+#include <linux/clocksource.h>
+
+/* Number of PMTMR ticks expected during calibration run */
+#define PMTMR_TICKS_PER_SEC 3579545
+
+/* limit it to 24 bits */
+#define ACPI_PM_MASK CLOCKSOURCE_MASK(24)
+
+/* Overrun value */
+#define ACPI_PM_OVRRUN	1<<24
+
+#ifdef CONFIG_X86_PM_TIMER
+
+extern u32 acpi_pm_read_verified(void);
+extern u32 pmtmr_ioport;
+
+static inline u32 acpi_pm_read_early(void)
+{
+	if (!pmtmr_ioport)
+		return 0;
+	/* mask the output to 24 bits */
+	return acpi_pm_read_verified();
+}
+
+#else
+
+static inline u32 acpi_pm_read_early(void)
+{
+	return 0;
+}
+
+#endif
+
+#endif
+

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 11/19] i386: Rework local APIC calibration
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (9 preceding siblings ...)
  2006-11-09 23:38 ` [patch 10/19] PM_timer: allow early access and move externs to a header file Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10 10:17   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 12/19] high-res timers: core Thomas Gleixner
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: i386-lapic-calibrate-timer.patch --]
[-- Type: text/plain, Size: 15869 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The local apic timer calibration has two problem cases:

1. The calibration is based on readout of the PIT/HPET timer to detect 
the wrap of the periodic tick. It happens that a box gets stuck in the
calibration loop due to a PIT with a broken readout function.

2. CoreDuo boxen show a sporadic PIT runs too slow defect, which results
in a wrong lapic calibration. The PIT goes back to normal operation once
the lapic timer is switched to periodic mode.

Rework the code to address both problems:
- Make the calibration interrupt driven. This removes the wait_timer_tick
  magic hackery from lapic.c and time_hpet.c. The clockevents framework
  allows easy substitution of the global tick event handler for the
  calibration. This is more accurate than monitoring jiffies. At this
  point of the boot process, nothing disturbes the interrupt delivery, so
  the results are very accurate.

- Verify the calibration against the PM timer, when available by using the
  early access function. When the measured calibration period is outside
  of an one percent window, then the lapic timer calibration is adjusted
  to the pm timer result.

- Verify the calibration by running the lapic timer with the calibration
  handler. Disable lapic timer in case of deviation.

This also removes the "synchronization" of the local apic timer to the
global tick. This synchronization never worked, as there is no way to
synchronize PIT(HPET) and local APIC timer. The synchronization by waiting
for the tick just alignes the local APIC timer for the first events, but
later the events drift away due to the different clocks. Removing the
"sync" is just randomizing the asynchronous behaviour at setup time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/apic.c	2006-11-09 21:06:17.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/apic.c	2006-11-09 21:06:22.000000000 +0100
@@ -26,6 +26,7 @@
 #include <linux/sysdev.h>
 #include <linux/cpu.h>
 #include <linux/clockchips.h>
+#include <linux/acpi_pmtmr.h>
 #include <linux/module.h>
 
 #include <asm/atomic.h>
@@ -163,64 +164,8 @@ int lapic_get_maxlvt(void)
  * Local APIC timer
  */
 
-/*
- * This part sets up the APIC 32 bit clock in LVTT1, with HZ interrupts
- * per second. We assume that the caller has already set up the local
- * APIC.
- *
- * The APIC timer is not exactly sync with the external timer chip, it
- * closely follows bus clocks.
- */
-
-/*
- * FIXME: Move this to i8253.h. There is no need to keep the access to
- * the PIT scattered all around the place -tglx
- */
-
-/*
- * The timer chip is already set up at HZ interrupts per second here,
- * but we do not accept timer interrupts yet. We only allow the BP
- * to calibrate.
- */
-static unsigned int __devinit get_8254_timer_count(void)
-{
-	unsigned long flags;
-
-	unsigned int count;
-
-	spin_lock_irqsave(&i8253_lock, flags);
-
-	outb_p(0x00, PIT_MODE);
-	count = inb_p(PIT_CH0);
-	count |= inb_p(PIT_CH0) << 8;
-
-	spin_unlock_irqrestore(&i8253_lock, flags);
-
-	return count;
-}
-
-/* next tick in 8254 can be caught by catching timer wraparound */
-static void __devinit wait_8254_wraparound(void)
-{
-	unsigned int curr_count, prev_count;
-
-	curr_count = get_8254_timer_count();
-	do {
-		prev_count = curr_count;
-		curr_count = get_8254_timer_count();
-
-		/* workaround for broken Mercury/Neptune */
-		if (prev_count >= curr_count + 0x100)
-			curr_count = get_8254_timer_count();
-
-	} while (prev_count >= curr_count);
-}
-
-/*
- * Default initialization for 8254 timers. If we use other timers like HPET,
- * we override this later
- */
-void (*wait_timer_tick)(void) __devinitdata = wait_8254_wraparound;
+/* Clock divisor is set to 16 */
+#define APIC_DIVISOR 16
 
 /*
  * This function sets up the local APIC timer, with a timeout of
@@ -232,9 +177,6 @@ void (*wait_timer_tick)(void) __devinitd
  * We do reads before writes even if unnecessary, to get around the
  * P5 APIC double write bug.
  */
-
-#define APIC_DIVISOR 16
-
 static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen)
 {
 	unsigned int lvtt_value, tmp_value;
@@ -312,112 +254,244 @@ static void __devinit setup_APIC_timer(v
 }
 
 /*
- * In this function we calibrate APIC bus clocks to the external
- * timer. Unfortunately we cannot use jiffies and the timer irq
- * to calibrate, since some later bootup code depends on getting
- * the first irq? Ugh.
+ * In this functions we calibrate APIC bus clocks to the external timer.
+ *
+ * We want to do the calibration only once since we want to have local timer
+ * irqs syncron. CPUs connected by the same APIC bus have the very same bus
+ * frequency.
  *
- * TODO: Fix this rather than saying "Ugh" -tglx
+ * This was previously done by reading the PIT/HPET and waiting for a wrap
+ * around to find out, that a tick has elapsed. I have a box, where the PIT
+ * readout is broken, so it never gets out of the wait loop again. This was
+ * also reported by others.
  *
- * We want to do the calibration only once since we
- * want to have local timer irqs syncron. CPUs connected
- * by the same APIC bus have the very same bus frequency.
- * And we want to have irqs off anyways, no accidental
- * APIC irq that way.
+ * Monitoring the jiffies value is inaccurate and the clockevents
+ * infrastructure allows us to do a simple substitution of the interrupt
+ * handler.
+ *
+ * The calibration routine also uses the pm_timer when possible, as the PIT
+ * happens to run way too slow (factor 2.3 on my VAIO CoreDuo, which goes
+ * back to normal later in the boot process).
  */
 
-static int __init calibrate_APIC_clock(void)
+#define LAPIC_CAL_LOOPS		(HZ/10)
+
+static __initdata volatile int lapic_cal_loops = -1;
+static __initdata long lapic_cal_t1, lapic_cal_t2;
+static __initdata unsigned long long lapic_cal_tsc1, lapic_cal_tsc2;
+static __initdata unsigned long lapic_cal_pm1, lapic_cal_pm2;
+static __initdata unsigned long lapic_cal_j1, lapic_cal_j2;
+
+/*
+ * Temporary interrupt handler.
+ */
+static void __init lapic_cal_handler(struct pt_regs *regs)
 {
-	unsigned long long t1 = 0, t2 = 0;
-	long tt1, tt2;
-	long result;
-	int i;
-	const int LOOPS = HZ/10;
+	unsigned long long tsc = 0;
+	long tapic = apic_read(APIC_TMCCT);
+	unsigned long pm = acpi_pm_read_early();
 
-	apic_printk(APIC_VERBOSE, "calibrating APIC timer ...\n");
+	if (cpu_has_tsc)
+		rdtscll(tsc);
 
-	/*
-	 * Put whatever arbitrary (but long enough) timeout
-	 * value into the APIC clock, we just want to get the
-	 * counter running for calibration.
-	 */
-	__setup_APIC_LVTT(1000000000, 0, 0);
+	switch (lapic_cal_loops++) {
+	case 0:
+		lapic_cal_t1 = tapic;
+		lapic_cal_tsc1 = tsc;
+		lapic_cal_pm1 = pm;
+		lapic_cal_j1 = jiffies;
+		break;
 
-	/*
-	 * The timer chip counts down to zero. Let's wait
-	 * for a wraparound to start exact measurement:
-	 * (the current tick might have been already half done)
-	 */
+	case LAPIC_CAL_LOOPS:
+		lapic_cal_t2 = tapic;
+		lapic_cal_tsc2 = tsc;
+		if (pm < lapic_cal_pm1)
+			pm += ACPI_PM_OVRRUN;
+		lapic_cal_pm2 = pm;
+		lapic_cal_j2 = jiffies;
+		break;
+	}
+}
 
-	wait_timer_tick();
+/*
+ * Setup the boot APIC
+ *
+ * Calibrate and verify the result.
+ */
+void __init setup_boot_APIC_clock(void)
+{
+	struct clock_event_device *levt = &__get_cpu_var(lapic_events);
+	const long pm_100ms = PMTMR_TICKS_PER_SEC/10;
+	const long pm_thresh = pm_100ms/100;
+	void (*real_handler)(struct pt_regs *regs);
+	unsigned long deltaj;
+	long delta, deltapm;
+	cpumask_t cpumask;
 
-	/*
-	 * We wrapped around just now. Let's start:
-	 */
-	if (cpu_has_tsc)
-		rdtscll(t1);
-	tt1 = apic_read(APIC_TMCCT);
+	apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n"
+		    "calibrating APIC timer ...\n");
+
+	/* Register broadcast function */
+	clockevents_register_broadcast(lapic_timer_broadcast);
 
 	/*
-	 * Let's wait LOOPS wraprounds:
+	 * Enable the apic timer next event capability only for
+	 * SMP and on UP, when requested via commandline
 	 */
-	for (i = 0; i < LOOPS; i++)
-		wait_timer_tick();
+	if (num_possible_cpus() > 1 || enable_local_apic_timer)
+		lapic_clockevent.capabilities |= CLOCK_CAP_NEXTEVT;
 
-	tt2 = apic_read(APIC_TMCCT);
-	if (cpu_has_tsc)
-		rdtscll(t2);
+	local_irq_disable();
+
+	/* Replace the global interrupt handler */
+	real_handler = global_clock_event->event_handler;
+	global_clock_event->event_handler = lapic_cal_handler;
 
 	/*
-	 * The APIC bus clock counter is 32 bits only, it
-	 * might have overflown, but note that we use signed
-	 * longs, thus no extra care needed.
-	 *
-	 * underflown to be exact, as the timer counts down ;)
+	 * Setup the APIC counter to 1e9. There is no way the lapic
+	 * can underflow in the 100ms detection time frame
 	 */
+	__setup_APIC_LVTT(1000000000, 0, 0);
+
+	/* Let the interrupts run */
+	local_irq_enable();
+
+	while(lapic_cal_loops <= LAPIC_CAL_LOOPS);
+
+	local_irq_disable();
+
+	/* Restore the real event handler */
+	global_clock_event->event_handler = real_handler;
+
+	/* Build delta t1-t2 as apic timer counts down */
+	delta = lapic_cal_t1 - lapic_cal_t2;
+	apic_printk(APIC_VERBOSE, "... lapic delta = %ld\n", delta);
+
+	/* Check, if the PM timer is available */
+	deltapm = lapic_cal_pm2 - lapic_cal_pm1;
+	apic_printk(APIC_VERBOSE, "... PM timer delta = %ld\n", deltapm);
 
-	result = (tt1-tt2)*APIC_DIVISOR/LOOPS;
+	if (deltapm) {
+		unsigned long mult;
+		u64 res;
+
+		mult = clocksource_hz2mult(PMTMR_TICKS_PER_SEC, 22);
+
+		if (deltapm > (pm_100ms - pm_thresh) &&
+		    deltapm < (pm_100ms + pm_thresh)) {
+			apic_printk(APIC_VERBOSE, "... PM timer result ok\n");
+		} else {
+			res = (((u64) deltapm) *  mult) >> 22;
+			do_div(res, 1000000);
+			printk(KERN_WARNING "APIC calibration not consistent "
+			       "with PM Timer: %ldms instead of 100ms\n",
+			       (long)res);
+			/* Correct the lapic counter value */
+			res = (((u64) delta ) * pm_100ms);
+			do_div(res, deltapm);
+			printk(KERN_INFO "APIC delta adjusted to PM-Timer: "
+			       "%lu (%ld)\n", (unsigned long) res, delta);
+			delta = (long) res;
+		}
+	}
 
 	/* Calculate the scaled math multiplication factor */
-	lapic_clockevent.mult = div_sc(tt1-tt2, TICK_NSEC * LOOPS, 32);
+	lapic_clockevent.mult = div_sc(delta, TICK_NSEC * LAPIC_CAL_LOOPS, 32);
 	lapic_clockevent.max_delta_ns =
 		clockevent_delta2ns(0x7FFFFF, &lapic_clockevent);
 	lapic_clockevent.min_delta_ns =
 		clockevent_delta2ns(0xF, &lapic_clockevent);
 
-	apic_printk(APIC_VERBOSE, "..... tt1-tt2 %ld\n", tt1 - tt2);
+	calibration_result = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS;
+
+	apic_printk(APIC_VERBOSE, "..... delta %ld\n", delta);
 	apic_printk(APIC_VERBOSE, "..... mult: %ld\n", lapic_clockevent.mult);
-	apic_printk(APIC_VERBOSE, "..... calibration result: %ld\n", result);
+	apic_printk(APIC_VERBOSE, "..... calibration result: %u\n",
+		    calibration_result);
 
-	if (cpu_has_tsc)
+	if (cpu_has_tsc) {
+		delta = (long)(lapic_cal_tsc2 - lapic_cal_tsc1);
 		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
-			"%ld.%04ld MHz.\n",
-			((long)(t2-t1)/LOOPS)/(1000000/HZ),
-			((long)(t2-t1)/LOOPS)%(1000000/HZ));
+			    "%ld.%04ld MHz.\n",
+			    (delta / LAPIC_CAL_LOOPS) / (1000000 / HZ),
+			    (delta / LAPIC_CAL_LOOPS) % (1000000 / HZ));
+	}
 
 	apic_printk(APIC_VERBOSE, "..... host bus clock speed is "
-		"%ld.%04ld MHz.\n",
-		result/(1000000/HZ),
-		result%(1000000/HZ));
+		    "%u.%04u MHz.\n",
+		    calibration_result / (1000000 / HZ),
+		    calibration_result % (1000000 / HZ));
 
-	return result;
-}
 
-void __init setup_boot_APIC_clock(void)
-{
-	unsigned long flags;
-	apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n");
-	using_apic_timer = 1;
-
-	local_irq_save(flags);
+	apic_printk(APIC_VERBOSE, "... verify APIC timer\n");
 
-	calibration_result = calibrate_APIC_clock();
 	/*
-	 * Now set up the timer for real.
+	 * Start LAPIC timer and verify that the calculated factor is correct
 	 */
 	setup_APIC_timer();
 
-	local_irq_restore(flags);
+	/* Replace the lapic interrupt handler */
+	real_handler = levt->event_handler;
+	levt->event_handler = lapic_cal_handler;
+	lapic_cal_loops = -1;
+
+	/* Let the interrupts run */
+	local_irq_enable();
+
+	while(lapic_cal_loops <= LAPIC_CAL_LOOPS);
+
+	local_irq_disable();
+
+	/* Restore the real event handler */
+	levt->event_handler = real_handler;
+
+	local_irq_enable();
+
+	/* Jiffies delta */
+	deltaj = lapic_cal_j2 - lapic_cal_j1;
+	apic_printk(APIC_VERBOSE, "... jiffies delta = %lu\n", deltaj);
+
+	/* Check, if the PM timer is available */
+	deltapm = lapic_cal_pm2 - lapic_cal_pm1;
+	apic_printk(APIC_VERBOSE, "... PM timer delta = %ld\n", deltapm);
+
+	if (deltapm) {
+		if (deltapm > (pm_100ms - pm_thresh) &&
+		    deltapm < (pm_100ms + pm_thresh)) {
+			apic_printk(APIC_VERBOSE, "... PM timer result ok\n");
+			/* Check, if the jiffies result is consistent */
+			if (deltaj < LAPIC_CAL_LOOPS-2 ||
+			    deltaj > LAPIC_CAL_LOOPS+2) {
+				/*
+				 * Not sure, what we can do about this one.
+				 * When high resultion timers are active
+				 * and the lapic timer does not stop in C3
+				 * we are fine. Otherwise more trouble might
+				 * be waiting. -- tglx
+				 */
+				printk(KERN_WARNING "Global event device %s "
+				       "has wrong frequency "
+				       "(%lu ticks instead of %d)\n",
+				       global_clock_event->name, deltaj,
+				       LAPIC_CAL_LOOPS);
+			}
+			return;
+		}
+	} else {
+		/* Check, if the jiffies result is consistent */
+		if (deltaj >= LAPIC_CAL_LOOPS-2 &&
+		    deltaj <= LAPIC_CAL_LOOPS+2) {
+			apic_printk(APIC_VERBOSE, "... jiffies result ok\n");
+			return;
+		}
+	}
+
+	printk(KERN_WARNING
+	       "APIC timer disabled due to verification failure.\n");
+	local_irq_disable();
+	cpumask = cpumask_of_cpu(smp_processor_id());
+	switch_APIC_timer_to_ipi(&cpumask);
+	local_irq_enable();
 }
 
 void __devinit setup_secondary_APIC_clock(void)
Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/time_hpet.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/time_hpet.c	2006-11-09 20:55:35.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/time_hpet.c	2006-11-09 21:06:22.000000000 +0100
@@ -43,23 +43,6 @@ static void hpet_writel(unsigned long d,
 	writel(d, hpet_virt_address + a);
 }
 
-#ifdef CONFIG_X86_LOCAL_APIC
-/*
- * HPET counters dont wrap around on every tick. They just change the
- * comparator value and continue. Next tick can be caught by checking
- * for a change in the comparator value. Used in apic.c.
- */
-static void __devinit wait_hpet_tick(void)
-{
-	unsigned int start_cmp_val, end_cmp_val;
-
-	start_cmp_val = hpet_readl(HPET_T0_CMP);
-	do {
-		end_cmp_val = hpet_readl(HPET_T0_CMP);
-	} while (start_cmp_val == end_cmp_val);
-}
-#endif
-
 static int hpet_timer_stop_set_go(unsigned long tick)
 {
 	unsigned int cfg;
@@ -213,11 +196,6 @@ int __init hpet_enable(void)
 		hpet_alloc(&hd);
 	}
 #endif
-
-#ifdef CONFIG_X86_LOCAL_APIC
-	if (hpet_use_timer)
-		wait_timer_tick = wait_hpet_tick;
-#endif
 	return 0;
 }
 
Index: linux-2.6.19-rc5-mm1/include/asm-i386/apic.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/asm-i386/apic.h	2006-11-09 21:06:17.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/asm-i386/apic.h	2006-11-09 21:06:22.000000000 +0100
@@ -93,8 +93,6 @@ static inline void ack_APIC_irq(void)
 	apic_write_around(APIC_EOI, 0);
 }
 
-extern void (*wait_timer_tick)(void);
-
 extern int lapic_get_maxlvt(void);
 extern void clear_local_APIC(void);
 extern void connect_bsp_APIC (void);

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 12/19] high-res timers: core
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (10 preceding siblings ...)
  2006-11-09 23:38 ` [patch 11/19] i386: Rework local APIC calibration Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10 10:26   ` Arjan van de Ven
  2006-11-09 23:38 ` [patch 13/19] GTOD: Mark TSC unusable for highres timers Thomas Gleixner
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: high-res-timers-core.patch --]
[-- Type: text/plain, Size: 35240 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the core bits of high-res timers support.

The design makes use of the existing hrtimers subsystem which manages a
per-CPU and per-clock tree of timers, and the clockevents framework, which
provides a standard API to request programmable clock events from.  The core
code does not have to know about the clock details - it makes use of
clockevents_set_next_event().

Once the preliminaries for high resolution mode (a continous time source for
time keeping and a reprogrammable clock event device) are available, the
hrtimer code is switched to high resolution mode.  The per-cpu clock event
devices are switched into one shot mode and on SMP systems an eventually
available global clock event device (e.g.  PIT on i386) is switched off.  The
periodic tick, which updates jiffies and calls update_process_times and
profiling, is provided by a per-cpu hrtimer.  The callback function is
executed in the timer interrupt context.  The hrtimer based implementation of
the periodic tick is designed to be extended with dynamic tick functionality.

The impact to non-high-res architectures is intended to be minimal.

More detailed information is available in Documentation/hrtimer/highres.txt

Build-fixes-from: Valdis.Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.19-rc5-mm1.orig/Documentation/kernel-parameters.txt	2006-11-09 21:06:17.000000000 +0100
+++ linux-2.6.19-rc5-mm1/Documentation/kernel-parameters.txt	2006-11-09 21:06:24.000000000 +0100
@@ -594,6 +594,10 @@ and is between 256 and 4096 characters. 
 			highmem otherwise. This also works to reduce highmem
 			size on bigger boxes.
 
+	highres=	[KNL] Enable/disable high resolution timer mode.
+			Valid parameters: "on", "off"
+			Default: "on"
+
 	hisax=		[HW,ISDN]
 			See Documentation/isdn/README.HiSax.
 
Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 21:06:13.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 21:06:24.000000000 +0100
@@ -17,6 +17,7 @@
 
 #include <linux/rbtree.h>
 #include <linux/ktime.h>
+#include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/wait.h>
@@ -41,6 +42,23 @@ enum hrtimer_restart {
 };
 
 /*
+ * hrtimer callback modes:
+ *
+ *	HRTIMER_CB_SOFTIRQ:		Callback must run in softirq context
+ *	HRTIMER_CB_IRQSAFE:		Callback may run in hardirq context
+ *	HRTIMER_CB_IRQSAFE_NO_RESTART:	Callback may run in hardirq context and
+ *					does not restart the timer
+ *	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:	Callback must run in softirq context
+ *					Special mode for tick emultation
+ */
+enum hrtimer_cb_mode {
+	HRTIMER_CB_SOFTIRQ,
+	HRTIMER_CB_IRQSAFE,
+	HRTIMER_CB_IRQSAFE_NO_RESTART,
+	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ,
+};
+
+/*
  * Bit values to track state of the timer
  *
  * Possible states:
@@ -50,6 +68,7 @@ enum hrtimer_restart {
  * 0x02		callback function running
  * 0x03		callback function running and enqueued
  *		(was requeued on another CPU)
+ * 0x04		callback pending (high resolution mode)
  *
  * The "callback function running and enqueued" status is only possible on
  * SMP. It happens for example when a posix timer expired and the callback
@@ -67,6 +86,7 @@ enum hrtimer_restart {
 #define HRTIMER_STATE_INACTIVE	0x00
 #define HRTIMER_STATE_ENQUEUED	0x01
 #define HRTIMER_STATE_CALLBACK	0x02
+#define HRTIMER_STATE_PENDING	0x04
 
 /**
  * struct hrtimer - the basic hrtimer structure
@@ -77,6 +97,9 @@ enum hrtimer_restart {
  * @function:	timer expiry callback function
  * @base:	pointer to the timer base (per cpu and per clock)
  * @state:	state information (See bit values above)
+ * @cb_mode:	high resolution timer feature to select the callback execution
+ *		 mode
+ * @cb_entry:	list head to enqueue an expired timer into the callback list
  *
  * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
  */
@@ -86,6 +109,10 @@ struct hrtimer {
 	enum hrtimer_restart		(*function)(struct hrtimer *);
 	struct hrtimer_clock_base	*base;
 	unsigned long			state;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	enum hrtimer_cb_mode		cb_mode;
+	struct list_head		cb_entry;
+#endif
 };
 
 /**
@@ -110,6 +137,9 @@ struct hrtimer_sleeper {
  * @get_time:		function to retrieve the current time of the clock
  * @get_softirq_time:	function to retrieve the current time from the softirq
  * @softirq_time:	the time when running the hrtimer queue in the softirq
+ * @cb_pending:		list of timers where the callback is pending
+ * @offset:		offset of this clock to the monotonic base
+ * @reprogram:		function to reprogram the timer event
  */
 struct hrtimer_clock_base {
 	struct hrtimer_cpu_base	*cpu_base;
@@ -120,6 +150,12 @@ struct hrtimer_clock_base {
 	ktime_t			(*get_time)(void);
 	ktime_t			(*get_softirq_time)(void);
 	ktime_t			softirq_time;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t			offset;
+	int			(*reprogram)(struct hrtimer *t,
+					     struct hrtimer_clock_base *b,
+					     ktime_t n);
+#endif
 };
 
 #define HRTIMER_MAX_CLOCK_BASES 2
@@ -131,20 +167,77 @@ struct hrtimer_clock_base {
  * @lock_key:		the lock_class_key for use with lockdep
  * @clock_base:		array of clock bases for this cpu
  * @curr_timer:		the timer which is executing a callback right now
+ * @expires_next:	absolute time of the next event which was scheduled
+ *			via clock_set_next_event()
+ * @hres_active:	State of high resolution mode
+ * @check_clocks:	Indictator, when set evaluate time source and clock
+ *			event devices whether high resolution mode can be
+ *			activated.
+ * @cb_pending:		Expired timers are moved from the rbtree to this
+ *			list in the timer interrupt. The list is processed
+ *			in the softirq.
+ * @sched_timer:	hrtimer to schedule the periodic tick in high
+ *			resolution mode
  */
 struct hrtimer_cpu_base {
 	spinlock_t			lock;
 	struct lock_class_key		lock_key;
 	struct hrtimer_clock_base	clock_base[HRTIMER_MAX_CLOCK_BASES];
+#ifdef CONFIG_HIGH_RES_TIMERS
+	ktime_t				expires_next;
+	int				hres_active;
+	unsigned long			check_clocks;
+	struct list_head		cb_pending;
+	struct hrtimer			sched_timer;
+#endif
 };
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+extern void hrtimer_clock_notify(void);
+extern void clock_was_set(void);
+extern void hrtimer_interrupt(struct pt_regs *regs);
+
+/*
+ * In high resolution mode the time reference must be read accurate
+ */
+static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer)
+{
+	return timer->base->get_time();
+}
+
+/*
+ * The resolution of the clocks. The resolution value is returned in
+ * the clock_getres() system call to give application programmers an
+ * idea of the (in)accuracy of timers. Timer values are rounded up to
+ * this resolution values.
+ */
+# define KTIME_HIGH_RES		(ktime_t) { .tv64 = 1 }
+# define KTIME_MONOTONIC_RES	KTIME_HIGH_RES
+
+#else
+
+# define KTIME_MONOTONIC_RES	KTIME_LOW_RES
+
 /*
  * clock_was_set() is a NOP for non- high-resolution systems. The
  * time-sorted order guarantees that a timer does not expire early and
  * is expired in the next softirq when the clock was advanced.
  */
-#define clock_was_set()		do { } while (0)
-#define hrtimer_clock_notify()	do { } while (0)
+static inline void clock_was_set(void) { }
+static inline void hrtimer_clock_notify(void) { }
+
+/*
+ * In non high resolution mode the time reference is taken from
+ * the base softirq time variable.
+ */
+static inline ktime_t hrtimer_cb_get_time(struct hrtimer *timer)
+{
+	return timer->base->softirq_time;
+}
+
+#endif
+
 extern ktime_t ktime_get(void);
 extern ktime_t ktime_get_real(void);
 
Index: linux-2.6.19-rc5-mm1/include/linux/interrupt.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/interrupt.h	2006-11-09 20:55:32.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/interrupt.h	2006-11-09 21:06:24.000000000 +0100
@@ -236,7 +236,10 @@ enum
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
 	BLOCK_SOFTIRQ,
-	TASKLET_SOFTIRQ
+	TASKLET_SOFTIRQ,
+#ifdef CONFIG_HIGH_RES_TIMERS
+	HRTIMER_SOFTIRQ,
+#endif
 };
 
 /* softirq mask and active fields moved to irq_cpustat_t in
Index: linux-2.6.19-rc5-mm1/include/linux/ktime.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/ktime.h	2006-11-09 20:55:32.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/ktime.h	2006-11-09 21:06:24.000000000 +0100
@@ -261,8 +261,7 @@ static inline u64 ktime_to_ns(const ktim
  * idea of the (in)accuracy of timers. Timer values are rounded up to
  * this resolution values.
  */
-#define KTIME_REALTIME_RES	(ktime_t){ .tv64 = TICK_NSEC }
-#define KTIME_MONOTONIC_RES	(ktime_t){ .tv64 = TICK_NSEC }
+#define KTIME_LOW_RES		(ktime_t){ .tv64 = TICK_NSEC }
 
 /* Get the monotonic time in timespec format: */
 extern void ktime_get_ts(struct timespec *ts);
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 21:06:13.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 21:06:24.000000000 +0100
@@ -38,7 +38,12 @@
 #include <linux/hrtimer.h>
 #include <linux/notifier.h>
 #include <linux/syscalls.h>
+#include <linux/kallsyms.h>
 #include <linux/interrupt.h>
+#include <linux/clockchips.h>
+#include <linux/profile.h>
+#include <linux/seq_file.h>
+#include <linux/err.h>
 
 #include <asm/uaccess.h>
 
@@ -81,7 +86,7 @@ EXPORT_SYMBOL_GPL(ktime_get_real);
  * This ensures that we capture erroneous accesses to these clock ids
  * rather than moving them into the range of valid clock id's.
  */
-static DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
+DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) =
 {
 
 	.clock_base =
@@ -89,12 +94,12 @@ static DEFINE_PER_CPU(struct hrtimer_cpu
 		{
 			.index = CLOCK_REALTIME,
 			.get_time = &ktime_get_real,
-			.resolution = KTIME_REALTIME_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 		{
 			.index = CLOCK_MONOTONIC,
 			.get_time = &ktime_get,
-			.resolution = KTIME_MONOTONIC_RES,
+			.resolution = KTIME_LOW_RES,
 		},
 	}
 };
@@ -228,7 +233,7 @@ lock_hrtimer_base(const struct hrtimer *
 	return base;
 }
 
-#define switch_hrtimer_base(t, b)	(b)
+# define switch_hrtimer_base(t, b)	(b)
 
 #endif	/* !CONFIG_SMP */
 
@@ -265,9 +270,6 @@ ktime_t ktime_add_ns(const ktime_t kt, u
 
 	return ktime_add(kt, tmp);
 }
-
-#else /* CONFIG_KTIME_SCALAR */
-
 # endif /* !CONFIG_KTIME_SCALAR */
 
 /*
@@ -295,11 +297,437 @@ static unsigned long ktime_divns(const k
 # define ktime_divns(kt, div)		(unsigned long)((kt).tv64 / (div))
 #endif /* BITS_PER_LONG >= 64 */
 
+/* High resolution timer related functions */
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+/*
+ * High resolution timer enabled ?
+ */
+static int hrtimer_hres_enabled __read_mostly  = 1;
+
+/*
+ * Enable / Disable high resolution mode
+ */
+static int __init setup_hrtimer_hres(char *str)
+{
+	if (!strcmp(str, "off"))
+		hrtimer_hres_enabled = 0;
+	else if (!strcmp(str, "on"))
+		hrtimer_hres_enabled = 1;
+	else
+		return 0;
+	return 1;
+}
+
+__setup("highres=", setup_hrtimer_hres);
+
+/*
+ * Is the high resolution mode active ?
+ */
+static inline int hrtimer_hres_active(void)
+{
+	return __get_cpu_var(hrtimer_bases).hres_active;
+}
+
+/*
+ * The time, when the last jiffy update happened. Protected by xtime_lock.
+ */
+static ktime_t last_jiffies_update;
+
+/*
+ * Reprogram the event source with checking both queues for the
+ * next event
+ * Called with interrupts disabled and base->lock held
+ */
+static void hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base)
+{
+	int i;
+	struct hrtimer_clock_base *base = cpu_base->clock_base;
+	ktime_t expires;
+
+	cpu_base->expires_next.tv64 = KTIME_MAX;
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) {
+		struct hrtimer *timer;
+
+		if (!base->first)
+			continue;
+		timer = rb_entry(base->first, struct hrtimer, node);
+		expires = ktime_sub(timer->expires, base->offset);
+		if (expires.tv64 < cpu_base->expires_next.tv64)
+			cpu_base->expires_next = expires;
+	}
+
+	if (cpu_base->expires_next.tv64 != KTIME_MAX)
+		clockevents_set_next_event(cpu_base->expires_next, 1);
+}
+
+/*
+ * Shared reprogramming for clock_realtime and clock_monotonic
+ *
+ * When a timer is enqueued and expires earlier than the already enqueued
+ * timers, we have to check, whether it expires earlier than the timer for
+ * which the clock event device was armed.
+ *
+ * Called with interrupts disabled and base->cpu_base.lock held
+ */
+static int hrtimer_reprogram(struct hrtimer *timer,
+			     struct hrtimer_clock_base *base)
+{
+	ktime_t *expires_next = &__get_cpu_var(hrtimer_bases).expires_next;
+	ktime_t expires = ktime_sub(timer->expires, base->offset);
+	int res;
+
+	/*
+	 * When the callback is running, we do not reprogram the clock event
+	 * device. The timer callback is either running on a different CPU or
+	 * the callback is executed in the hrtimer_interupt context. The
+	 * reprogramming is handled either by the softirq, which called the
+	 * callback or at the end of the hrtimer_interrupt.
+	 */
+	if (timer->state & HRTIMER_STATE_CALLBACK)
+		return 0;
+
+	if (expires.tv64 >= expires_next->tv64)
+		return 0;
+
+	/*
+	 * Clockevents returns -ETIME, when the event was in the past.
+	 */
+	res = clockevents_set_next_event(expires, 0);
+	if (!IS_ERR_VALUE(res))
+		*expires_next = expires;
+	return res;
+}
+
+
+/*
+ * Retrigger next event is called after clock was set
+ *
+ * Called with interrupts disabled via on_each_cpu()
+ */
+static void retrigger_next_event(void *arg)
+{
+	struct hrtimer_cpu_base *base;
+	struct timespec realtime_offset;
+	unsigned long seq;
+
+	if (!hrtimer_hres_active())
+		return;
+
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		set_normalized_timespec(&realtime_offset,
+					-wall_to_monotonic.tv_sec,
+					-wall_to_monotonic.tv_nsec);
+	} while (read_seqretry(&xtime_lock, seq));
+
+	base = &__get_cpu_var(hrtimer_bases);
+
+	/* Adjust CLOCK_REALTIME offset */
+	spin_lock(&base->lock);
+	base->clock_base[CLOCK_REALTIME].offset =
+		timespec_to_ktime(realtime_offset);
+
+	hrtimer_force_reprogram(base);
+	spin_unlock(&base->lock);
+}
+
+/*
+ * Clock realtime was set
+ *
+ * Change the offset of the realtime clock vs. the monotonic
+ * clock.
+ *
+ * We might have to reprogram the high resolution timer interrupt. On
+ * SMP we call the architecture specific code to retrigger _all_ high
+ * resolution timer interrupts. On UP we just disable interrupts and
+ * call the high resolution interrupt code.
+ */
+void clock_was_set(void)
+{
+	/* Retrigger the CPU local events everywhere */
+	on_each_cpu(retrigger_next_event, NULL, 0, 1);
+}
+
+/**
+ * hrtimer_clock_notify - A clock source or a clock event has been installed
+ *
+ * Notify the per cpu softirqs to recheck the clock sources and events
+ */
+void hrtimer_clock_notify(void)
+{
+	int i;
+
+	if (hrtimer_hres_enabled) {
+		for_each_possible_cpu(i)
+			set_bit(0, &per_cpu(hrtimer_bases, i).check_clocks);
+	}
+}
+
+static const ktime_t nsec_per_hz = { .tv64 = NSEC_PER_SEC / HZ };
+
+/*
+ * We switched off the global tick source when switching to high resolution
+ * mode. Update jiffies64.
+ *
+ * Must be called with interrupts disabled !
+ *
+ * FIXME: We need a mechanism to assign the update to a CPU. In principle this
+ * is not hard, but when dynamic ticks come into play it starts to be. We don't
+ * want to wake up a complete idle cpu just to update jiffies, so we need
+ * something more intellegent than a mere "do this only on CPUx".
+ */
+static void update_jiffies64(ktime_t now)
+{
+	unsigned long seq;
+	ktime_t delta;
+
+	/* Preevaluate to avoid lock contention */
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		delta = ktime_sub(now, last_jiffies_update);
+	} while (read_seqretry(&xtime_lock, seq));
+
+	if (delta.tv64 < nsec_per_hz.tv64)
+		return;
+
+	/* Reevalute with xtime_lock held */
+	write_seqlock(&xtime_lock);
+
+	delta = ktime_sub(now, last_jiffies_update);
+	if (delta.tv64 >= nsec_per_hz.tv64) {
+		unsigned long ticks = 1;
+
+		delta = ktime_sub(delta, nsec_per_hz);
+		last_jiffies_update = ktime_add(last_jiffies_update,
+						nsec_per_hz);
+
+		/* Slow path for long timeouts */
+		if (unlikely(delta.tv64 >= nsec_per_hz.tv64)) {
+			s64 incr = ktime_to_ns(nsec_per_hz);
+
+			ticks = ktime_divns(delta, incr);
+
+			last_jiffies_update = ktime_add_ns(last_jiffies_update,
+							   incr * ticks);
+			ticks++;
+		}
+		do_timer(ticks);
+	}
+	write_sequnlock(&xtime_lock);
+}
+
+/*
+ * We rearm the timer until we get disabled by the idle code
+ * Called with interrupts disabled.
+ */
+static enum hrtimer_restart hrtimer_sched_tick(struct hrtimer *timer)
+{
+	struct hrtimer_cpu_base *cpu_base =
+		container_of(timer, struct hrtimer_cpu_base, sched_timer);
+ 	struct pt_regs *regs = get_irq_regs();
+
+	/*
+	 * Do not call, when we are not in irq context and have
+	 * no valid regs pointer
+	 */
+	if (regs) {
+		/*
+		 * update_process_times() might take tasklist_lock, hence
+		 * drop the base lock. sched-tick hrtimers are per-CPU and
+		 * never accessible by userspace APIs, so this is safe to do.
+		 */
+		spin_unlock(&cpu_base->lock);
+		update_process_times(user_mode(regs));
+		profile_tick(CPU_PROFILING);
+		spin_lock(&cpu_base->lock);
+	}
+
+	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
+
+	return HRTIMER_RESTART;
+}
+
+/*
+ * A change in the clock source or clock events was detected.
+ * Check the clock source and the events, whether we can switch to
+ * high resolution mode or not.
+ *
+ * TODO: Handle the removal of clock sources / events
+ */
+static void hrtimer_check_clocks(void)
+{
+	struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!test_and_clear_bit(0, &base->check_clocks))
+		return;
+
+	if (!timekeeping_is_continuous())
+		return;
+
+	if (!clockevents_next_event_available())
+		return;
+
+	local_irq_save(flags);
+
+	if (base->hres_active) {
+		local_irq_restore(flags);
+		return;
+	}
+
+	now = ktime_get();
+	if (clockevents_init_next_event()) {
+		local_irq_restore(flags);
+		return;
+	}
+	base->hres_active = 1;
+	base->clock_base[CLOCK_REALTIME].resolution = KTIME_HIGH_RES;
+	base->clock_base[CLOCK_MONOTONIC].resolution = KTIME_HIGH_RES;
+
+	/* Did we start the jiffies update yet ? */
+	if (last_jiffies_update.tv64 == 0) {
+		write_seqlock(&xtime_lock);
+		last_jiffies_update = now;
+		write_sequnlock(&xtime_lock);
+	}
+
+	/*
+	 * Emulate tick processing via per-CPU hrtimers:
+	 */
+	hrtimer_init(&base->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	base->sched_timer.function = hrtimer_sched_tick;
+	base->sched_timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
+	hrtimer_start(&base->sched_timer, nsec_per_hz, HRTIMER_MODE_REL);
+
+	/* "Retrigger" the interrupt to get things going */
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	printk(KERN_INFO "Switched to high resolution mode on CPU %d\n",
+	       smp_processor_id());
+}
+
+/*
+ * Check, whether the timer is on the callback pending list
+ */
+static inline int hrtimer_cb_pending(const struct hrtimer *timer)
+{
+	return timer->state == HRTIMER_STATE_PENDING;
+}
+
+/*
+ * Remove a timer from the callback pending list
+ */
+static inline void hrtimer_remove_cb_pending(struct hrtimer *timer)
+{
+	list_del_init(&timer->cb_entry);
+}
+
+/*
+ * Initialize the high resolution related parts of cpu_base
+ */
+static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base)
+{
+	base->expires_next.tv64 = KTIME_MAX;
+	set_bit(0, &base->check_clocks);
+	base->hres_active = 0;
+	INIT_LIST_HEAD(&base->cb_pending);
+}
+
+/*
+ * Initialize the high resolution related parts of a hrtimer
+ */
+static inline void hrtimer_init_timer_hres(struct hrtimer *timer)
+{
+	INIT_LIST_HEAD(&timer->cb_entry);
+}
+
+/*
+ * When High resolution timers are active, try to reprogram. Note, that in case
+ * the state has HRTIMER_STATE_CALLBACK set, no reprogramming and no expiry
+ * check happens. The timer gets enqueued into the rbtree. The reprogramming
+ * and expiry check is done in the hrtimer_interrupt or in the softirq.
+ */
+static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
+					    struct hrtimer_clock_base *base)
+{
+	if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base)) {
+
+		/* Timer is expired, act upon the callback mode */
+		switch(timer->cb_mode) {
+		case HRTIMER_CB_IRQSAFE_NO_RESTART:
+			/*
+			 * We can call the callback from here. No restart
+			 * happens, so no danger of recursion
+			 */
+			BUG_ON(timer->function(timer) != HRTIMER_NORESTART);
+			return 1;
+		case HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:
+			/*
+			 * This is solely for the sched tick emulation with
+			 * dynamic tick support to ensure that we do not
+			 * restart the tick right on the edge and end up with
+			 * the tick timer in the softirq ! The calling site
+			 * takes care of this.
+			 */
+			return 1;
+		case HRTIMER_CB_IRQSAFE:
+		case HRTIMER_CB_SOFTIRQ:
+			/*
+			 * Move everything else into the softirq pending list !
+			 */
+			list_add_tail(&timer->cb_entry,
+				      &base->cpu_base->cb_pending);
+			timer->state = HRTIMER_STATE_PENDING;
+			raise_softirq(HRTIMER_SOFTIRQ);
+			return 1;
+		default:
+			BUG();
+		}
+	}
+	return 0;
+}
+
+/*
+ * Called after timekeeping resumed and updated jiffies64. Set the jiffies
+ * update time to now.
+ */
+static inline void hrtimer_resume_jiffy_update(void)
+{
+	unsigned long flags;
+	ktime_t now = ktime_get();
+
+	write_seqlock_irqsave(&xtime_lock, flags);
+	last_jiffies_update = now;
+	write_sequnlock_irqrestore(&xtime_lock, flags);
+}
+
+#else
+
+static inline int hrtimer_hres_active(void) { return 0; }
+static inline void hrtimer_check_clocks(void) { }
+static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base) { }
+static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
+					    struct hrtimer_clock_base *base)
+{
+	return 0;
+}
+static inline int hrtimer_cb_pending(struct hrtimer *timer) { return 0; }
+static inline void hrtimer_remove_cb_pending(struct hrtimer *timer) { }
+static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { }
+static inline void hrtimer_init_timer_hres(struct hrtimer *timer) { }
+static inline void hrtimer_resume_jiffy_update(void) { }
+
+#endif /* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Timekeeping resumed notification
  */
 void hrtimer_notify_resume(void)
 {
+	hrtimer_resume_jiffy_update();
 	clockevents_resume_events();
 	clock_was_set();
 }
@@ -361,7 +789,7 @@ hrtimer_forward(struct hrtimer *timer, k
  * red black tree is O(log(n)). Must hold the base lock.
  */
 static void enqueue_hrtimer(struct hrtimer *timer,
-			    struct hrtimer_clock_base *base)
+			    struct hrtimer_clock_base *base, int reprogram)
 {
 	struct rb_node **link = &base->active.rb_node;
 	struct rb_node *parent = NULL;
@@ -387,6 +815,22 @@ static void enqueue_hrtimer(struct hrtim
 	 * Insert the timer to the rbtree and check whether it
 	 * replaces the first pending timer
 	 */
+	if (!base->first || timer->expires.tv64 <
+	    rb_entry(base->first, struct hrtimer, node)->expires.tv64) {
+		/*
+		 * Reprogram the clock event device. When the timer is already
+		 * expired hrtimer_enqueue_reprogram has either called the
+		 * callback or added it to the pending list and raised the
+		 * softirq.
+		 *
+		 * This is a NOP for !HIGHRES
+		 */
+		if (reprogram && hrtimer_enqueue_reprogram(timer, base))
+			return;
+
+		base->first = &timer->node;
+	}
+
 	rb_link_node(&timer->node, parent, link);
 	rb_insert_color(&timer->node, &base->active);
 	/*
@@ -394,28 +838,38 @@ static void enqueue_hrtimer(struct hrtim
 	 * state of a possibly running callback.
 	 */
 	timer->state |= HRTIMER_STATE_ENQUEUED;
-
-	if (!base->first || timer->expires.tv64 <
-	    rb_entry(base->first, struct hrtimer, node)->expires.tv64)
-		base->first = &timer->node;
 }
 
 /*
  * __remove_hrtimer - internal function to remove a timer
  *
  * Caller must hold the base lock.
+ *
+ * High resolution timer mode reprograms the clock event device when the
+ * timer is the one which expires next. The caller can disable this by setting
+ * reprogram to zero. This is useful, when the context does a reprogramming
+ * anyway (e.g. timer interrupt)
  */
 static void __remove_hrtimer(struct hrtimer *timer,
 			     struct hrtimer_clock_base *base,
-			     unsigned long newstate)
+			     unsigned long newstate, int reprogram)
 {
-	/*
-	 * Remove the timer from the rbtree and replace the
-	 * first entry pointer if necessary.
-	 */
-	if (base->first == &timer->node)
-		base->first = rb_next(&timer->node);
-	rb_erase(&timer->node, &base->active);
+	/* High res. callback list. NOP for !HIGHRES */
+	if (hrtimer_cb_pending(timer))
+		hrtimer_remove_cb_pending(timer);
+	else {
+		/*
+		 * Remove the timer from the rbtree and replace the
+		 * first entry pointer if necessary.
+		 */
+		if (base->first == &timer->node) {
+			base->first = rb_next(&timer->node);
+			/* Reprogram the clock event device. if enabled */
+			if (reprogram && hrtimer_hres_active())
+				hrtimer_force_reprogram(base->cpu_base);
+		}
+		rb_erase(&timer->node, &base->active);
+	}
 	timer->state = newstate;
 }
 
@@ -426,7 +880,19 @@ static inline int
 remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
 {
 	if (hrtimer_is_queued(timer)) {
-		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE);
+		int reprogram;
+
+		/*
+		 * Remove the timer and force reprogramming when high
+		 * resolution mode is active and the timer is on the current
+		 * CPU. If we remove a timer on another CPU, reprogramming is
+		 * skipped. The interrupt event on this CPU is fired and
+		 * reprogramming happens in the interrupt handler. This is a
+		 * rare case and less expensive than a smp call.
+		 */
+		reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
+				 reprogram);
 		return 1;
 	}
 	return 0;
@@ -472,7 +938,7 @@ hrtimer_start(struct hrtimer *timer, kti
 	}
 	timer->expires = tim;
 
-	enqueue_hrtimer(timer, new_base);
+	enqueue_hrtimer(timer, new_base, base == new_base);
 
 	unlock_hrtimer_base(timer, &flags);
 
@@ -603,6 +1069,7 @@ void hrtimer_init(struct hrtimer *timer,
 		clock_id = CLOCK_MONOTONIC;
 
 	timer->base = &cpu_base->clock_base[clock_id];
+	hrtimer_init_timer_hres(timer);
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -625,6 +1092,138 @@ int hrtimer_get_res(const clockid_t whic
 }
 EXPORT_SYMBOL_GPL(hrtimer_get_res);
 
+#ifdef CONFIG_HIGH_RES_TIMERS
+
+/*
+ * High resolution timer interrupt
+ * Called with interrupts disabled
+ */
+void hrtimer_interrupt(struct pt_regs *regs)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	struct hrtimer_clock_base *base;
+	ktime_t expires_next, now;
+	int i, raise = 0;
+
+	BUG_ON(!cpu_base->hres_active);
+
+ retry:
+	now = ktime_get();
+
+	/* Check, if the jiffies need an update */
+	update_jiffies64(now);
+
+	expires_next.tv64 = KTIME_MAX;
+
+	base = cpu_base->clock_base;
+
+	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
+		ktime_t basenow;
+		struct rb_node *node;
+
+		spin_lock(&cpu_base->lock);
+
+		basenow = ktime_add(now, base->offset);
+
+		while ((node = base->first)) {
+			struct hrtimer *timer;
+
+			timer = rb_entry(node, struct hrtimer, node);
+
+			if (basenow.tv64 < timer->expires.tv64) {
+				ktime_t expires;
+
+				expires = ktime_sub(timer->expires,
+						    base->offset);
+				if (expires.tv64 < expires_next.tv64)
+					expires_next = expires;
+				break;
+			}
+
+			/* Move softirq callbacks to the pending list */
+			if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) {
+				__remove_hrtimer(timer, base,
+						 HRTIMER_STATE_PENDING, 0);
+				list_add_tail(&timer->cb_entry,
+					      &base->cpu_base->cb_pending);
+				raise = 1;
+				continue;
+			}
+
+			__remove_hrtimer(timer, base,
+					 HRTIMER_STATE_CALLBACK, 0);
+
+			if (timer->function(timer) != HRTIMER_NORESTART) {
+				BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+				/*
+				 * Do not reprogram. We do this when we break
+				 * out of the loop !
+				 */
+				enqueue_hrtimer(timer, base, 0);
+			}
+			timer->state &= ~HRTIMER_STATE_CALLBACK;
+		}
+		spin_unlock(&cpu_base->lock);
+		base++;
+	}
+
+	cpu_base->expires_next = expires_next;
+
+	/* Reprogramming necessary ? */
+	if (expires_next.tv64 != KTIME_MAX) {
+		if (clockevents_set_next_event(expires_next, 0))
+			goto retry;
+	}
+
+	/* Raise softirq ? */
+	if (raise)
+		raise_softirq(HRTIMER_SOFTIRQ);
+}
+
+static void run_hrtimer_softirq(struct softirq_action *h)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+	spin_lock_irq(&cpu_base->lock);
+
+	while (!list_empty(&cpu_base->cb_pending)) {
+		enum hrtimer_restart (*fn)(struct hrtimer *);
+		struct hrtimer *timer;
+		int restart;
+
+		timer = list_entry(cpu_base->cb_pending.next,
+				   struct hrtimer, cb_entry);
+
+		fn = timer->function;
+		__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
+		spin_unlock_irq(&cpu_base->lock);
+
+		restart = fn(timer);
+
+		spin_lock_irq(&cpu_base->lock);
+
+		timer->state &= ~HRTIMER_STATE_CALLBACK;
+		if (restart == HRTIMER_RESTART) {
+			BUG_ON(hrtimer_active(timer));
+			/*
+			 * Enqueue the timer, allow reprogramming of the event
+			 * device
+			 */
+			enqueue_hrtimer(timer, timer->base, 1);
+		} else if (hrtimer_active(timer)) {
+			/*
+			 * If the timer was rearmed on another CPU, reprogram
+			 * the event device.
+			 */
+			if (timer->base->first == &timer->node)
+				hrtimer_reprogram(timer, timer->base);
+		}
+	}
+	spin_unlock_irq(&cpu_base->lock);
+}
+
+#endif	/* CONFIG_HIGH_RES_TIMERS */
+
 /*
  * Expire the per base hrtimer-queue:
  */
@@ -652,7 +1251,7 @@ static inline void run_hrtimer_queue(str
 			break;
 
 		fn = timer->function;
-		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK);
+		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
 
 		restart = fn(timer);
@@ -662,7 +1261,7 @@ static inline void run_hrtimer_queue(str
 		timer->state &= ~HRTIMER_STATE_CALLBACK;
 		if (restart != HRTIMER_NORESTART) {
 			BUG_ON(hrtimer_active(timer));
-			enqueue_hrtimer(timer, base);
+			enqueue_hrtimer(timer, base, 0);
 		}
 	}
 	spin_unlock_irq(&cpu_base->lock);
@@ -670,12 +1269,21 @@ static inline void run_hrtimer_queue(str
 
 /*
  * Called from timer softirq every jiffy, expire hrtimers:
+ *
+ * For HRT its the fall back code to run the softirq in the timer
+ * softirq context in case the hrtimer initialization failed or has
+ * not been done yet.
  */
 void hrtimer_run_queues(void)
 {
 	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
 	int i;
 
+	hrtimer_check_clocks();
+
+	if (hrtimer_hres_active())
+		return;
+
 	hrtimer_get_softirq_time(cpu_base);
 
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
@@ -702,6 +1310,9 @@ void hrtimer_init_sleeper(struct hrtimer
 {
 	sl->timer.function = hrtimer_wakeup;
 	sl->task = task;
+#ifdef CONFIG_HIGH_RES_TIMERS
+	sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_RESTART;
+#endif
 }
 
 static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
@@ -712,7 +1323,8 @@ static int __sched do_nanosleep(struct h
 		set_current_state(TASK_INTERRUPTIBLE);
 		hrtimer_start(&t->timer, t->timer.expires, mode);
 
-		schedule();
+		if (likely(t->task))
+			schedule();
 
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_MODE_ABS;
@@ -817,6 +1429,7 @@ static void __devinit init_hrtimers_cpu(
 	for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++)
 		cpu_base->clock_base[i].cpu_base = cpu_base;
 
+	hrtimer_init_hres(cpu_base);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -830,9 +1443,12 @@ static void migrate_hrtimer_list(struct 
 	while ((node = rb_first(&old_base->active))) {
 		timer = rb_entry(node, struct hrtimer, node);
 		BUG_ON(timer->state & HRTIMER_STATE_CALLBACK);
-		__remove_hrtimer(timer, old_base, HRTIMER_STATE_INACTIVE);
+		__remove_hrtimer(timer, old_base, HRTIMER_STATE_INACTIVE, 0);
 		timer->base = new_base;
-		enqueue_hrtimer(timer, new_base);
+		/*
+		 * Enqueue the timer. Allow reprogramming of the event device
+		 */
+		enqueue_hrtimer(timer, new_base, 1);
 	}
 }
 
@@ -895,5 +1511,8 @@ void __init hrtimers_init(void)
 	hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
 			  (void *)(long)smp_processor_id());
 	register_cpu_notifier(&hrtimers_nb);
+#ifdef CONFIG_HIGH_RES_TIMERS
+	open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
+#endif
 }
 
Index: linux-2.6.19-rc5-mm1/kernel/itimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/itimer.c	2006-11-09 21:06:03.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/itimer.c	2006-11-09 21:06:24.000000000 +0100
@@ -136,7 +136,7 @@ enum hrtimer_restart it_real_fn(struct h
 	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, sig->tsk);
 
 	if (sig->it_real_incr.tv64 != 0) {
-		hrtimer_forward(timer, timer->base->softirq_time,
+		hrtimer_forward(timer, hrtimer_cb_get_time(timer),
 				sig->it_real_incr);
 		return HRTIMER_RESTART;
 	}
Index: linux-2.6.19-rc5-mm1/kernel/posix-timers.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/posix-timers.c	2006-11-09 21:06:03.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/posix-timers.c	2006-11-09 21:06:24.000000000 +0100
@@ -356,7 +356,7 @@ static enum hrtimer_restart posix_timer_
 		if (timr->it.real.interval.tv64 != 0) {
 			timr->it_overrun +=
 				hrtimer_forward(timer,
-						timer->base->softirq_time,
+						hrtimer_cb_get_time(timer),
 						timr->it.real.interval);
 			ret = HRTIMER_RESTART;
 			++timr->it_requeue_pending;
Index: linux-2.6.19-rc5-mm1/kernel/time/Kconfig
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/kernel/time/Kconfig	2006-11-09 21:06:24.000000000 +0100
@@ -0,0 +1,11 @@
+#
+# Timer subsystem related configuration options
+#
+config HIGH_RES_TIMERS
+	bool "High Resolution Timer Support"
+	depends on GENERIC_TIME && GENERIC_CLOCKEVENTS
+	help
+	  This option enables high resolution timer support. If your
+	  hardware is not capable then this option only increases
+	  the size of the kernel image.
+
Index: linux-2.6.19-rc5-mm1/kernel/timer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/timer.c	2006-11-09 21:06:02.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/timer.c	2006-11-09 21:06:24.000000000 +0100
@@ -1048,6 +1048,7 @@ static void update_wall_time(void)
 	if (change_clocksource()) {
 		clock->error = 0;
 		clock->xtime_nsec = 0;
+		hrtimer_clock_notify();
 		clocksource_calculate_interval(clock, tick_nsec);
 	}
 }

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (11 preceding siblings ...)
  2006-11-09 23:38 ` [patch 12/19] high-res timers: core Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-10  1:10   ` john stultz
  2006-11-09 23:38 ` [patch 14/19] dynticks: core code Thomas Gleixner
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: gtod-mark-tsc-unusable-for-highres-timers.patch --]
[-- Type: text/plain, Size: 1697 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The TSC is too unstable and unreliable to be used with high resolution timers.
The automatic detection of TSC unstability fails once we switched to high
resolution mode, because the tick emulation would use the TSC as reference. 
This results in a circular dependency.  Mark it unusable for high res upfront.

[akpm@osdl.org: updated for i386-time-avoid-pit-smp-lockups.patch]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff -puN arch/i386/kernel/tsc.c~gtod-mark-tsc-unusable-for-highres-timers arch/i386/kernel/tsc.c
--- a/arch/i386/kernel/tsc.c~gtod-mark-tsc-unusable-for-highres-timers
+++ a/arch/i386/kernel/tsc.c
@@ -459,10 +459,23 @@ static int __init init_tsc_clocksource(v
 		current_tsc_khz = tsc_khz;
 		clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz,
 							clocksource_tsc.shift);
+#ifndef CONFIG_HIGH_RES_TIMERS
 		/* lower the rating if we already know its unstable: */
 		if (check_tsc_unstable())
 			clocksource_tsc.rating = 0;
-
+#else
+		/*
+		 * Mark TSC unsuitable for high resolution timers. TSC has so
+		 * many pitfalls: frequency changes, stop in idle ...  When we
+		 * switch to high resolution mode we can not longer detect a
+		 * firmware caused frequency change, as the emulated tick uses
+		 * TSC as reference. This results in a circular dependency.
+		 * Switch only to high resolution mode, if pm_timer or such
+		 * is available.
+		 */
+		clocksource_tsc.rating = 50;
+		clocksource_tsc.is_continuous = 0;
+#endif
 		init_timer(&verify_tsc_freq_timer);
 		verify_tsc_freq_timer.function = verify_tsc_freq;
 		verify_tsc_freq_timer.expires =
_

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 14/19] dynticks: core code
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (12 preceding siblings ...)
  2006-11-09 23:38 ` [patch 13/19] GTOD: Mark TSC unusable for highres timers Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 15/19] dyntick: add nohz stats to /proc/stat Thomas Gleixner
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: dynticks-core.patch --]
[-- Type: text/plain, Size: 16067 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

dynamic ticks core code.

This is an extension to the per-cpu sched_tick timer of the high resolution
timer functionality.  The sched_tick timer is reprogrammed to a longer timeout
before going idle, when no timer events are due in the next tick.  The
periodic tick is resumed when the CPU leaves the idle state.  If a non-timer
IRQ hits the idle task jiffies are updated from irq_enter before calling the
interrupt code, otherwise the interrupt handler would eventually deal with a
stale jiffy value.

The per-cpu idle statistics information can be used to optimize power
management decisions.

More detailed information is available in Documentation/hrtimer/highres.txt

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/include/linux/hardirq.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hardirq.h	2006-11-09 20:14:41.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hardirq.h	2006-11-09 20:16:11.000000000 +0100
@@ -106,6 +106,16 @@ static inline void account_system_vtime(
  * always balanced, so the interrupted value of ->hardirq_context
  * will always be restored.
  */
+#define __irq_enter()					\
+	do {						\
+		account_system_vtime(current);		\
+		add_preempt_count(HARDIRQ_OFFSET);	\
+		trace_hardirq_enter();			\
+	} while (0)
+
+/*
+ * Enter irq context (on NO_HZ, update jiffies):
+ */
 extern void irq_enter(void);
 
 /*
@@ -123,7 +133,7 @@ extern void irq_enter(void);
  */
 extern void irq_exit(void);
 
-#define nmi_enter()		do { lockdep_off(); irq_enter(); } while (0)
+#define nmi_enter()		do { lockdep_off(); __irq_enter(); } while (0)
 #define nmi_exit()		do { __irq_exit(); lockdep_on(); } while (0)
 
 #endif /* LINUX_HARDIRQ_H */
Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 20:16:06.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 20:16:11.000000000 +0100
@@ -22,6 +22,7 @@
 #include <linux/list.h>
 #include <linux/wait.h>
 
+struct seq_file;
 struct hrtimer_clock_base;
 struct hrtimer_cpu_base;
 
@@ -178,6 +179,17 @@ struct hrtimer_clock_base {
  *			in the softirq.
  * @sched_timer:	hrtimer to schedule the periodic tick in high
  *			resolution mode
+ * @nr_events:		Total number of timer interrupt events
+ * @idle_tick:		Store the last idle tick expiry time when the tick
+ *			timer is modified for idle sleeps. This is necessary
+ *			to resume the tick timer operation in the timeline
+ *			when the CPU returns from idle
+ * @tick_stopped:	Indicator that the idle tick has been stopped
+ * @idle_jiffies:	jiffies at the entry to idle for idle time accounting
+ * @idle_calls:		Total number of idle calls
+ * @idle_sleeps:	Number of idle calls, where the sched tick was stopped
+ * @idle_entrytime:	Time when the idle call was entered
+ * @idle_sleeptime:	Sum of the time slept in idle with sched tick stopped
  */
 struct hrtimer_cpu_base {
 	spinlock_t			lock;
@@ -189,6 +201,16 @@ struct hrtimer_cpu_base {
 	unsigned long			check_clocks;
 	struct list_head		cb_pending;
 	struct hrtimer			sched_timer;
+	unsigned long			nr_events;
+#endif
+#ifdef CONFIG_NO_HZ
+	ktime_t				idle_tick;
+	int				tick_stopped;
+	unsigned long			idle_jiffies;
+	unsigned long			idle_calls;
+	unsigned long			idle_sleeps;
+	ktime_t				idle_entrytime;
+	ktime_t				idle_sleeptime;
 #endif
 };
 
@@ -295,6 +317,18 @@ extern void hrtimer_run_queues(void);
 /* Resume notification */
 void hrtimer_notify_resume(void);
 
+#ifdef CONFIG_NO_HZ
+extern void hrtimer_stop_sched_tick(void);
+extern void hrtimer_restart_sched_tick(void);
+extern void hrtimer_update_jiffies(void);
+extern void show_no_hz_stats(struct seq_file *p);
+#else
+static inline void hrtimer_stop_sched_tick(void) { }
+static inline void hrtimer_restart_sched_tick(void) { }
+static inline void hrtimer_update_jiffies(void) { }
+static inline void show_no_hz_stats(struct seq_file *p) { }
+#endif
+
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 20:16:06.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 20:16:11.000000000 +0100
@@ -44,6 +44,7 @@
 #include <linux/profile.h>
 #include <linux/seq_file.h>
 #include <linux/err.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
 
@@ -482,6 +483,7 @@ static void update_jiffies64(ktime_t now
 {
 	unsigned long seq;
 	ktime_t delta;
+	unsigned long ticks = 0;
 
 	/* Preevaluate to avoid lock contention */
 	do {
@@ -497,7 +499,6 @@ static void update_jiffies64(ktime_t now
 
 	delta = ktime_sub(now, last_jiffies_update);
 	if (delta.tv64 >= nsec_per_hz.tv64) {
-		unsigned long ticks = 1;
 
 		delta = ktime_sub(delta, nsec_per_hz);
 		last_jiffies_update = ktime_add(last_jiffies_update,
@@ -511,13 +512,238 @@ static void update_jiffies64(ktime_t now
 
 			last_jiffies_update = ktime_add_ns(last_jiffies_update,
 							   incr * ticks);
-			ticks++;
 		}
+		ticks++;
 		do_timer(ticks);
 	}
 	write_sequnlock(&xtime_lock);
 }
 
+#ifdef CONFIG_NO_HZ
+/**
+ * hrtimer_update_jiffies - update jiffies when idle was interrupted
+ *
+ * Called from interrupt entry when the CPU was idle
+ *
+ * In case the sched_tick was stopped on this CPU, we have to check if jiffies
+ * must be updated. Otherwise an interrupt handler could use a stale jiffy
+ * value.
+ */
+void hrtimer_update_jiffies(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long flags;
+	ktime_t now;
+
+	if (!cpu_base->tick_stopped || !cpu_base->hres_active)
+		return;
+
+	now = ktime_get();
+
+	local_irq_save(flags);
+	update_jiffies64(now);
+	local_irq_restore(flags);
+}
+
+/**
+ * hrtimer_stop_sched_tick - stop the idle tick from the idle task
+ *
+ * When the next event is more than a tick into the future, stop the idle tick
+ * Called either from the idle loop or from irq_exit() when a idle period was
+ * just interrupted by a interrupt which did not cause a reschedule.
+ */
+void hrtimer_stop_sched_tick(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long seq, last_jiffies, next_jiffies;
+	ktime_t last_update, expires, now;
+	unsigned long delta_jiffies;
+	unsigned long flags;
+
+	if (unlikely(!cpu_base->hres_active))
+		return;
+
+	local_irq_save(flags);
+
+	now = ktime_get();
+	/*
+	 * When called from irq_exit we need to account the idle sleep time
+	 * correctly.
+	 */
+	if (cpu_base->tick_stopped) {
+		ktime_t delta = ktime_sub(now, cpu_base->idle_entrytime);
+
+		cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime,
+						     delta);
+	}
+
+	cpu_base->idle_entrytime = now;
+	cpu_base->idle_calls++;
+
+	/* Read jiffies and the time when jiffies were updated last */
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		last_update = last_jiffies_update;
+		last_jiffies = jiffies;
+	} while (read_seqretry(&xtime_lock, seq));
+
+	/* Get the next timer wheel timer */
+	next_jiffies = get_next_timer_interrupt(last_jiffies);
+	delta_jiffies = next_jiffies - last_jiffies;
+
+	if ((long)delta_jiffies >= 1) {
+		/*
+		 * hrtimer_stop_sched_tick can be called several times before
+		 * the hrtimer_restart_sched_tick is called. This happens when
+		 * interrupts arrive which do not cause a reschedule. In the
+		 * first call we save the current tick time, so we can restart
+		 * the scheduler tick in hrtimer_restart_sched_tick.
+		 */
+		if (!cpu_base->tick_stopped) {
+			cpu_base->idle_tick = cpu_base->sched_timer.expires;
+			cpu_base->tick_stopped = 1;
+			cpu_base->idle_jiffies = last_jiffies;
+		}
+		/* calculate the expiry time for the next timer wheel timer */
+		expires = ktime_add_ns(last_update,
+				       nsec_per_hz.tv64 * delta_jiffies);
+		hrtimer_start(&cpu_base->sched_timer, expires,
+			      HRTIMER_MODE_ABS);
+		cpu_base->idle_sleeps++;
+	} else {
+		/* Raise the softirq if the timer wheel is behind jiffies */
+		if ((long) delta_jiffies < 0)
+			raise_softirq_irqoff(TIMER_SOFTIRQ);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * hrtimer_restart_sched_tick - restart the idle tick from the idle task
+ *
+ * Restart the idle tick when the CPU is woken up from idle
+ */
+void hrtimer_restart_sched_tick(void)
+{
+	struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+	unsigned long ticks;
+	ktime_t now, delta;
+
+	if (!cpu_base->hres_active || !cpu_base->tick_stopped)
+		return;
+
+	/* Update jiffies first */
+	now = ktime_get();
+
+	local_irq_disable();
+	update_jiffies64(now);
+
+	/* Account the idle time */
+	delta = ktime_sub(now, cpu_base->idle_entrytime);
+	cpu_base->idle_sleeptime = ktime_add(cpu_base->idle_sleeptime, delta);
+
+	/*
+	 * We stopped the tick in idle. Update process times would miss the
+	 * time we slept as update_process_times does only a 1 tick
+	 * accounting. Enforce that this is accounted to idle !
+	 */
+	ticks = jiffies - cpu_base->idle_jiffies;
+	/*
+	 * We might be one off. Do not randomly account a huge number of ticks!
+	 */
+	if (ticks && ticks < LONG_MAX) {
+		add_preempt_count(HARDIRQ_OFFSET);
+		account_system_time(current, HARDIRQ_OFFSET,
+				    jiffies_to_cputime(ticks));
+		sub_preempt_count(HARDIRQ_OFFSET);
+	}
+
+	/*
+	 * Cancel the scheduled timer and restore the tick
+	 */
+	cpu_base->tick_stopped  = 0;
+	hrtimer_cancel(&cpu_base->sched_timer);
+	cpu_base->sched_timer.expires = cpu_base->idle_tick;
+
+	while (1) {
+		/* Forward the time to expire in the future */
+		hrtimer_forward(&cpu_base->sched_timer, now, nsec_per_hz);
+		hrtimer_start(&cpu_base->sched_timer,
+			      cpu_base->sched_timer.expires, HRTIMER_MODE_ABS);
+
+		/* Check, if the timer was already in the past */
+		if (hrtimer_active(&cpu_base->sched_timer))
+			break;
+		/* Update jiffies and reread time */
+		update_jiffies64(now);
+		now = ktime_get();
+	}
+	local_irq_enable();
+}
+
+/**
+ * show_no_hz_stats - print out the no hz statistics
+ *
+ * The no_hz statistics are appended at the end of /proc/stats
+ *
+ * I: total number of idle calls
+ * S: number of idle calls which stopped the sched tick
+ * T: Summed up sleep time in idle with sched tick stopped (unit is seconds)
+ * A: Average sleep time: T/S (unit is seconds)
+ * E: Total number of timer interrupt events
+ */
+void show_no_hz_stats(struct seq_file *p)
+{
+	unsigned long calls = 0, sleeps = 0, events = 0;
+	struct timeval tsum, tavg;
+	ktime_t totaltime = { .tv64 = 0 };
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu);
+
+		calls += base->idle_calls;
+		sleeps += base->idle_sleeps;
+		totaltime = ktime_add(totaltime, base->idle_sleeptime);
+		events += base->nr_events;
+
+#ifdef CONFIG_SMP
+		tsum = ktime_to_timeval(base->idle_sleeptime);
+		if (base->idle_sleeps) {
+			uint64_t nsec = ktime_to_ns(base->idle_sleeptime);
+
+			do_div(nsec, base->idle_sleeps);
+			tavg = ns_to_timeval(nsec);
+		} else
+			tavg.tv_sec = tavg.tv_usec = 0;
+
+		seq_printf(p, "nohz cpu%d I:%lu S:%lu T:%d.%06d A:%d.%06d E: %lu\n",
+			   cpu, base->idle_calls, base->idle_sleeps,
+			   (int) tsum.tv_sec, (int) tsum.tv_usec,
+			   (int) tavg.tv_sec, (int) tavg.tv_usec,
+			   base->nr_events);
+#endif
+	}
+
+	tsum = ktime_to_timeval(totaltime);
+	if (sleeps) {
+		uint64_t nsec = ktime_to_ns(totaltime);
+
+			do_div(nsec, sleeps);
+			tavg = ns_to_timeval(nsec);
+	} else
+		tavg.tv_sec = tavg.tv_usec = 0;
+
+	seq_printf(p, "nohz total I:%lu S:%lu T:%d.%06d A:%d.%06d E: %lu\n",
+		   calls, sleeps,
+		   (int) tsum.tv_sec, (int) tsum.tv_usec,
+		   (int) tavg.tv_sec, (int) tavg.tv_usec,
+		   events);
+}
+
+#endif
+
 /*
  * We rearm the timer until we get disabled by the idle code
  * Called with interrupts disabled.
@@ -527,12 +753,30 @@ static enum hrtimer_restart hrtimer_sche
 	struct hrtimer_cpu_base *cpu_base =
 		container_of(timer, struct hrtimer_cpu_base, sched_timer);
  	struct pt_regs *regs = get_irq_regs();
+	ktime_t now = ktime_get();
+
+	/* Check, if the jiffies need an update */
+	update_jiffies64(now);
 
 	/*
 	 * Do not call, when we are not in irq context and have
 	 * no valid regs pointer
 	 */
 	if (regs) {
+#ifdef CONFIG_NO_HZ
+		/*
+		 * When we are idle and the tick is stopped, we have to touch
+		 * the watchdog as we might not schedule for a really long
+		 * time. This happens on complete idle SMP systems while
+		 * waiting on the login prompt. We also increment the "start of
+		 * idle" jiffy stamp so the idle accounting adjustment we do
+		 * when we go busy again does not account too much ticks.
+		 */
+		if (cpu_base->tick_stopped) {
+			touch_softlockup_watchdog();
+			cpu_base->idle_jiffies++;
+		}
+#endif
 		/*
 		 * update_process_times() might take tasklist_lock, hence
 		 * drop the base lock. sched-tick hrtimers are per-CPU and
@@ -544,7 +788,13 @@ static enum hrtimer_restart hrtimer_sche
 		spin_lock(&cpu_base->lock);
 	}
 
-	hrtimer_forward(timer, hrtimer_cb_get_time(timer), nsec_per_hz);
+#ifdef CONFIG_NO_HZ
+	/* Do not restart, when we are in the idle loop */
+	if (cpu_base->tick_stopped)
+		return HRTIMER_NORESTART;
+#endif
+
+	hrtimer_forward(timer, now, nsec_per_hz);
 
 	return HRTIMER_RESTART;
 }
@@ -1106,13 +1356,11 @@ void hrtimer_interrupt(struct pt_regs *r
 	int i, raise = 0;
 
 	BUG_ON(!cpu_base->hres_active);
+	cpu_base->nr_events++;
 
  retry:
 	now = ktime_get();
 
-	/* Check, if the jiffies need an update */
-	update_jiffies64(now);
-
 	expires_next.tv64 = KTIME_MAX;
 
 	base = cpu_base->clock_base;
Index: linux-2.6.19-rc5-mm1/kernel/softirq.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/softirq.c	2006-11-09 20:14:41.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/softirq.c	2006-11-09 20:16:11.000000000 +0100
@@ -278,9 +278,11 @@ EXPORT_SYMBOL(do_softirq);
  */
 void irq_enter(void)
 {
-	account_system_vtime(current);
-	add_preempt_count(HARDIRQ_OFFSET);
-	trace_hardirq_enter();
+	__irq_enter();
+#ifdef CONFIG_NO_HZ
+	if (idle_cpu(smp_processor_id()))
+		hrtimer_update_jiffies();
+#endif
 }
 
 #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
@@ -299,6 +301,12 @@ void irq_exit(void)
 	sub_preempt_count(IRQ_EXIT_OFFSET);
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
+
+#ifdef CONFIG_NO_HZ
+	/* Make sure that timer wheel updates are propagated */
+	if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched())
+		hrtimer_stop_sched_tick();
+#endif
 	preempt_enable_no_resched();
 }
 
Index: linux-2.6.19-rc5-mm1/kernel/time/Kconfig
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/time/Kconfig	2006-11-09 20:16:06.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/time/Kconfig	2006-11-09 20:16:33.000000000 +0100
@@ -9,3 +9,10 @@ config HIGH_RES_TIMERS
 	  hardware is not capable then this option only increases
 	  the size of the kernel image.
 
+config NO_HZ
+	bool "Tickless System (Dynamic Ticks)"
+	depends on HIGH_RES_TIMERS
+	help
+	  This option enables a tickless system: timer interrupts will
+	  only trigger on an as-needed basis both when the system is
+	  busy and when the system is idle.
Index: linux-2.6.19-rc5-mm1/kernel/timer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/timer.c	2006-11-09 20:16:06.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/timer.c	2006-11-09 20:16:11.000000000 +0100
@@ -462,7 +462,7 @@ static inline void __run_timers(tvec_bas
 	spin_unlock_irq(&base->lock);
 }
 
-#ifdef CONFIG_NO_IDLE_HZ
+#if defined(CONFIG_NO_IDLE_HZ) || defined(CONFIG_NO_HZ)
 /*
  * Find out when the next timer event is due to happen. This
  * is used on S/390 to stop all activity when a cpus is idle.

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 15/19] dyntick: add nohz stats to /proc/stat
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (13 preceding siblings ...)
  2006-11-09 23:38 ` [patch 14/19] dynticks: core code Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 16/19] dynticks: i386 arch code Thomas Gleixner
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: dynticks-add-nohz-stats-to-proc-stat.patch --]
[-- Type: text/plain, Size: 493 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add nohz stats to /proc/stat.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff -puN fs/proc/proc_misc.c~dynticks-add-nohz-stats-to-proc-stat fs/proc/proc_misc.c
--- a/fs/proc/proc_misc.c~dynticks-add-nohz-stats-to-proc-stat
+++ a/fs/proc/proc_misc.c
@@ -527,6 +527,8 @@ static int show_stat(struct seq_file *p,
 		nr_running(),
 		nr_iowait());
 
+	show_no_hz_stats(p);
+
 	return 0;
 }
 
_

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 16/19] dynticks: i386 arch code
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (14 preceding siblings ...)
  2006-11-09 23:38 ` [patch 15/19] dyntick: add nohz stats to /proc/stat Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 17/19] dynticks: Fix nmi watchdog Thomas Gleixner
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: dynticks-i386-arch-code.patch --]
[-- Type: text/plain, Size: 771 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Prepare i386 for dyntick: idle handler callbacks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff -puN arch/i386/kernel/process.c~dynticks-i386-arch-code arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c~dynticks-i386-arch-code
+++ a/arch/i386/kernel/process.c
@@ -168,6 +168,7 @@ void cpu_idle(void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
+		hrtimer_stop_sched_tick();
 		while (!need_resched()) {
 			void (*idle)(void);
 
@@ -186,6 +187,7 @@ void cpu_idle(void)
 			__get_cpu_var(irq_stat).idle_timestamp = jiffies;
 			idle();
 		}
+		hrtimer_restart_sched_tick();
 		preempt_enable_no_resched();
 		schedule();
 		preempt_disable();
_

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 17/19] dynticks: Fix nmi watchdog
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (15 preceding siblings ...)
  2006-11-09 23:38 ` [patch 16/19] dynticks: i386 arch code Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 18/19] high-res timers, dynticks: enable i386 support Thomas Gleixner
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: dynticks-i386-nmi-fix.patch --]
[-- Type: text/plain, Size: 1527 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The NMI watchdog implementation assumes that the local APIC timer
interrupt is happening. This assumption is not longer true when
high resolution timers and dynamic ticks come into play, as they
may switch off the local APIC timer completely. Take the PIT/HPET
interrupts into account too, to avoid false positives.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/arch/i386/kernel/nmi.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/kernel/nmi.c	2006-11-09 17:47:58.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/kernel/nmi.c	2006-11-09 20:52:29.000000000 +0100
@@ -23,6 +23,7 @@
 #include <linux/dmi.h>
 #include <linux/kprobes.h>
 #include <linux/cpumask.h>
+#include <linux/kernel_stat.h>
 
 #include <asm/smp.h>
 #include <asm/nmi.h>
@@ -920,9 +921,13 @@ __kprobes int nmi_watchdog_tick(struct p
 		cpu_clear(cpu, backtrace_mask);
 	}
 
-	sum = per_cpu(irq_stat, cpu).apic_timer_irqs;
+	/*
+	 * Take the local apic timer and PIT/HPET into account. We don't
+	 * know which one is active, when we have highres/dyntick on
+	 */
+	sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_irqs(0);
 
-	/* if the apic timer isn't firing, this cpu isn't doing much */
+	/* if the none of the timers isn't firing, this cpu isn't doing much */
 	if (!touched && last_irq_sums[cpu] == sum) {
 		/*
 		 * Ayiee, looks like this CPU is stuck ...

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 18/19] high-res timers, dynticks: enable i386 support
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (16 preceding siblings ...)
  2006-11-09 23:38 ` [patch 17/19] dynticks: Fix nmi watchdog Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-09 23:38 ` [patch 19/19] debugging feature: timer stats Thomas Gleixner
  2006-11-23 22:24 ` [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Roman Zippel
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: high-res-timers-dynticks-enable-i386-support.patch --]
[-- Type: text/plain, Size: 639 bytes --]

From: Ingo Molnar <mingo@elte.hu>

Enable high-res timers and dyntick on i386.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Index: linux-2.6.19-rc5-mm1/arch/i386/Kconfig
===================================================================
--- linux-2.6.19-rc5-mm1.orig/arch/i386/Kconfig	2006-11-09 20:15:54.000000000 +0100
+++ linux-2.6.19-rc5-mm1/arch/i386/Kconfig	2006-11-09 20:16:52.000000000 +0100
@@ -82,6 +82,8 @@ source "init/Kconfig"
 
 menu "Processor type and features"
 
+source "kernel/time/Kconfig"
+
 config SMP
 	bool "Symmetric multi-processing support"
 	---help---

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [patch 19/19] debugging feature: timer stats
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (17 preceding siblings ...)
  2006-11-09 23:38 ` [patch 18/19] high-res timers, dynticks: enable i386 support Thomas Gleixner
@ 2006-11-09 23:38 ` Thomas Gleixner
  2006-11-23 22:24 ` [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Roman Zippel
  19 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-09 23:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Ingo Molnar, Len Brown, John Stultz, Arjan van de Ven,
	Andi Kleen, Roman Zippel

[-- Attachment #1: debugging-feature-timer-stats.patch --]
[-- Type: text/plain, Size: 24012 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add /proc/timer_stats support: debugging feature to profile timer expiration. 
Both the starting site, process/PID and the expiration function is captured. 
This allows the quick identification of timer event sources in a system.

Sample output:

 # echo 1 > /proc/tstats
 # cat /proc/tstats
 Timerstats sample period: 3.888770 s
   12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   15,     1 swapper          hcd_submit_urb (rh_timer_func)
    4,   959 kedac            schedule_timeout (process_timeout)
    1,     0 swapper          page_writeback_init (wb_timer_fn)
   28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
    3,  3100 bash             schedule_timeout (process_timeout)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
    1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
    1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
    1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
 90 total events, 30.0 events/sec

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux-2.6.19-rc5-mm1/Documentation/hrtimer/timer_stats.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/Documentation/hrtimer/timer_stats.txt	2006-11-09 20:17:05.000000000 +0100
@@ -0,0 +1,68 @@
+timer_stats - timer usage statistics
+------------------------------------
+
+timer_stats is a debugging facility to make the timer (ab)usage in a Linux
+system visible to kernel and userspace developers. It is not intended for
+production usage as it adds significant overhead to the (hr)timer code and the
+(hr)timer data structures.
+
+timer_stats should be used by kernel and userspace developers to verify that
+their code does not make unduly use of timers. This helps to avoid unnecessary
+wakeups, which should be avoided to optimize power consumption.
+
+It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration
+section.
+
+timer_stats collects information about the timer events which are fired in a
+Linux system over a sample period:
+
+- the pid of the task(process) which initialized the timer
+- the name of the process which initialized the timer
+- the function where the timer was intialized
+- the callback function which is associated to the timer
+- the number of events (callbacks)
+
+timer_stats adds an entry to /proc: /proc/timer_stats
+
+This entry is used to control the statistics functionality and to read out the
+sampled information.
+
+The timer_stats functionality is inactive on bootup.
+
+To activate a sample period issue:
+# echo 1 >/proc/timer_stats
+
+To stop a sample period issue:
+# echo 0 >/proc/timer_stats
+
+The statistics can be retrieved by:
+# cat /proc/timer_stats
+
+The readout of /proc/timer_stats automatically disables sampling. The sampled
+information is kept until a new sample period is started. This allows multiple
+readouts.
+
+Sample output of /proc/timer_stats:
+
+Timerstats sample period: 3.888770 s
+  12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
+  15,     1 swapper          hcd_submit_urb (rh_timer_func)
+   4,   959 kedac            schedule_timeout (process_timeout)
+   1,     0 swapper          page_writeback_init (wb_timer_fn)
+  28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
+  22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
+   3,  3100 bash             schedule_timeout (process_timeout)
+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
+   1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
+   1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
+   1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
+90 total events, 30.0 events/sec
+
+The first column is the number of events, the second column the pid, the third
+column is the name of the process. The forth column shows the function which
+initialized the timer and in parantheses the callback function which was
+executed on expiry.
+
+    Thomas, Ingo
+
Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/hrtimer.h	2006-11-09 20:16:11.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/hrtimer.h	2006-11-09 20:17:05.000000000 +0100
@@ -101,8 +101,14 @@ enum hrtimer_cb_mode {
  * @cb_mode:	high resolution timer feature to select the callback execution
  *		 mode
  * @cb_entry:	list head to enqueue an expired timer into the callback list
+ * @start_site:	timer statistics field to store the site where the timer
+ *		was started
+ * @start_comm: timer statistics field to store the name of the process which
+ *		started the timer
+ * @start_pid: timer statistics field to store the pid of the task which
+ *		started the timer
  *
- * The hrtimer structure must be initialized by init_hrtimer_#CLOCKTYPE()
+ * The hrtimer structure must be initialized by hrtimer_init()
  */
 struct hrtimer {
 	struct rb_node			node;
@@ -114,6 +120,11 @@ struct hrtimer {
 	enum hrtimer_cb_mode		cb_mode;
 	struct list_head		cb_entry;
 #endif
+#ifdef CONFIG_TIMER_STATS
+	void				*start_site;
+	char				start_comm[16];
+	int				start_pid;
+#endif
 };
 
 /**
@@ -332,4 +343,44 @@ static inline void show_no_hz_stats(stru
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+				     void *timerf, char * comm);
+
+static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
+{
+	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
+				 timer->function, timer->start_comm);
+}
+
+extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
+						 void *addr);
+
+static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+	__timer_stats_hrtimer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void timer_stats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
+{
+}
+
+static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
+{
+}
+
+static inline void timer_stats_hrtimer_clear_start_info(struct hrtimer *timer)
+{
+}
+#endif
+
 #endif
Index: linux-2.6.19-rc5-mm1/include/linux/timer.h
===================================================================
--- linux-2.6.19-rc5-mm1.orig/include/linux/timer.h	2006-11-09 17:47:59.000000000 +0100
+++ linux-2.6.19-rc5-mm1/include/linux/timer.h	2006-11-09 20:17:05.000000000 +0100
@@ -2,6 +2,7 @@
 #define _LINUX_TIMER_H
 
 #include <linux/list.h>
+#include <linux/ktime.h>
 #include <linux/spinlock.h>
 #include <linux/stddef.h>
 
@@ -15,6 +16,11 @@ struct timer_list {
 	unsigned long data;
 
 	struct tvec_t_base_s *base;
+#ifdef CONFIG_TIMER_STATS
+	void *start_site;
+	char start_comm[16];
+	int start_pid;
+#endif
 };
 
 extern struct tvec_t_base_s boot_tvec_bases;
@@ -73,6 +79,49 @@ extern unsigned long next_timer_interrup
  */
 extern unsigned long get_next_timer_interrupt(unsigned long now);
 
+/*
+ * Timer-statistics info:
+ */
+#ifdef CONFIG_TIMER_STATS
+
+extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+				     void *timerf, char * comm);
+
+static inline void timer_stats_account_timer(struct timer_list *timer)
+{
+	timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
+				 timer->function, timer->start_comm);
+}
+
+extern void __timer_stats_timer_set_start_info(struct timer_list *timer,
+					       void *addr);
+
+static inline void timer_stats_timer_set_start_info(struct timer_list *timer)
+{
+	__timer_stats_timer_set_start_info(timer, __builtin_return_address(0));
+}
+
+static inline void timer_stats_timer_clear_start_info(struct timer_list *timer)
+{
+	timer->start_site = NULL;
+}
+#else
+static inline void timer_stats_account_timer(struct timer_list *timer)
+{
+}
+
+static inline void timer_stats_timer_set_start_info(struct timer_list *timer)
+{
+}
+
+static inline void timer_stats_timer_clear_start_info(struct timer_list *timer)
+{
+}
+#endif
+
+extern void delayed_work_timer_fn(unsigned long __data);
+
+
 /***
  * add_timer - start a timer
  * @timer: the timer to be added
Index: linux-2.6.19-rc5-mm1/kernel/hrtimer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/hrtimer.c	2006-11-09 20:16:11.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/hrtimer.c	2006-11-09 20:17:05.000000000 +0100
@@ -972,6 +972,18 @@ static inline void hrtimer_resume_jiffy_
 
 #endif /* CONFIG_HIGH_RES_TIMERS */
 
+#ifdef CONFIG_TIMER_STATS
+void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /*
  * Timekeeping resumed notification
  */
@@ -1140,6 +1152,7 @@ remove_hrtimer(struct hrtimer *timer, st
 		 * reprogramming happens in the interrupt handler. This is a
 		 * rare case and less expensive than a smp call.
 		 */
+		timer_stats_hrtimer_clear_start_info(timer);
 		reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
 		__remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
 				 reprogram);
@@ -1188,6 +1201,8 @@ hrtimer_start(struct hrtimer *timer, kti
 	}
 	timer->expires = tim;
 
+	timer_stats_hrtimer_set_start_info(timer);
+
 	enqueue_hrtimer(timer, new_base, base == new_base);
 
 	unlock_hrtimer_base(timer, &flags);
@@ -1320,6 +1335,12 @@ void hrtimer_init(struct hrtimer *timer,
 
 	timer->base = &cpu_base->clock_base[clock_id];
 	hrtimer_init_timer_hres(timer);
+
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL_GPL(hrtimer_init);
 
@@ -1400,6 +1421,7 @@ void hrtimer_interrupt(struct pt_regs *r
 
 			__remove_hrtimer(timer, base,
 					 HRTIMER_STATE_CALLBACK, 0);
+			timer_stats_account_hrtimer(timer);
 
 			if (timer->function(timer) != HRTIMER_NORESTART) {
 				BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
@@ -1442,6 +1464,8 @@ static void run_hrtimer_softirq(struct s
 		timer = list_entry(cpu_base->cb_pending.next,
 				   struct hrtimer, cb_entry);
 
+		timer_stats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
@@ -1498,6 +1522,8 @@ static inline void run_hrtimer_queue(str
 		if (base->softirq_time.tv64 <= timer->expires.tv64)
 			break;
 
+		timer_stats_account_hrtimer(timer);
+
 		fn = timer->function;
 		__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0);
 		spin_unlock_irq(&cpu_base->lock);
Index: linux-2.6.19-rc5-mm1/kernel/time/Makefile
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/time/Makefile	2006-11-09 17:52:38.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/time/Makefile	2006-11-09 20:17:05.000000000 +0100
@@ -1,3 +1,4 @@
 obj-y += ntp.o clocksource.o jiffies.o
 
-obj-$(CONFIG_GENERIC_CLOCKEVENTS) += clockevents.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS)	+= clockevents.o
+obj-$(CONFIG_TIMER_STATS)		+= timer_stats.o
Index: linux-2.6.19-rc5-mm1/kernel/time/timer_stats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.19-rc5-mm1/kernel/time/timer_stats.c	2006-11-09 20:17:05.000000000 +0100
@@ -0,0 +1,244 @@
+/*
+ * kernel/time/timer_stats.c
+ *
+ * Collect timer usage statistics.
+ *
+ * Copyright(C) 2006, Red Hat, Inc., Ingo Molnar
+ * Copyright(C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
+ *
+ * timer_stats is based on timer_top, a similar functionality which was part of
+ * Con Kolivas dyntick patch set. It was developed by Daniel Petrini at the
+ * Instituto Nokia de Tecnologia - INdT - Manaus. timer_top's design was based
+ * on dynamic allocation of the statistics entries rather than the static array
+ * which is used by timer_stats. It was written for the pre hrtimer kernel code
+ * and therefor did not take hrtimers into account. Nevertheless it provided
+ * the base for the timer_stats implementation and was a helpful source of
+ * inspiration in the first place. Kudos to Daniel and the Nokia folks for this
+ * effort.
+ *
+ * timer_top.c is
+ *	Copyright (C) 2005 Instituto Nokia de Tecnologia - INdT - Manaus
+ *	Written by Daniel Petrini <d.pensator@gmail.com>
+ *	timer_top.c was released under the GNU General Public License version 2
+ *
+ * We export the addresses and counting of timer functions being called,
+ * the pid and cmdline from the owner process if applicable.
+ *
+ * Start/stop data collection:
+ * # echo 1[0] >/proc/timer_stats
+ *
+ * Display the collected information:
+ * # cat /proc/timer_stats
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/proc_fs.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+
+#include <asm/uaccess.h>
+
+enum tstats_stat {
+	TSTATS_INACTIVE,
+	TSTATS_ACTIVE,
+	TSTATS_READOUT,
+	TSTATS_RESET,
+};
+
+struct tstats_entry {
+	void			*timer;
+	void			*start_func;
+	void			*expire_func;
+	unsigned long		counter;
+	pid_t			pid;
+	char			comm[TASK_COMM_LEN + 1];
+};
+
+#define TSTATS_MAX_ENTRIES	1024
+
+static struct tstats_entry tstats[TSTATS_MAX_ENTRIES];
+static DEFINE_SPINLOCK(tstats_lock);
+static enum tstats_stat tstats_status;
+static ktime_t tstats_time;
+
+/**
+ * timer_stats_update_stats - Update the statistics for a timer.
+ * @timer:	pointer to either a timer_list or a hrtimer
+ * @pid:	the pid of the task which set up the timer
+ * @startf:	pointer to the function which did the timer setup
+ * @timerf:	pointer to the timer callback function of the timer
+ * @comm:	name of the process which set up the timer
+ *
+ * When the timer is already registered, then the event counter is
+ * incremented. Otherwise the timer is registered in a free slot.
+ */
+void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
+			      void *timerf, char * comm)
+{
+	struct tstats_entry *entry = tstats;
+	unsigned long flags;
+	int i;
+
+	spin_lock_irqsave(&tstats_lock, flags);
+	if (tstats_status != TSTATS_ACTIVE)
+		goto out_unlock;
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES; i++, entry++) {
+		if (entry->timer == timer &&
+		    entry->start_func == startf &&
+		    entry->expire_func == timerf &&
+		    entry->pid == pid) {
+
+			entry->counter++;
+			break;
+		}
+		if (!entry->timer) {
+			entry->timer = timer;
+			entry->start_func = startf;
+			entry->expire_func = timerf;
+			entry->counter = 1;
+			entry->pid = pid;
+			memcpy(entry->comm, comm, TASK_COMM_LEN);
+			entry->comm[TASK_COMM_LEN] = 0;
+			break;
+		}
+	}
+
+ out_unlock:
+	spin_unlock_irqrestore(&tstats_lock, flags);
+}
+
+static void print_name_offset(struct seq_file *m, unsigned long addr)
+{
+	char namebuf[KSYM_NAME_LEN+1];
+	unsigned long size, offset;
+	const char *sym_name;
+	char *modname;
+
+	sym_name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+	if (sym_name)
+		seq_printf(m, "%s", sym_name);
+	else
+		seq_printf(m, "<%p>", (void *)addr);
+}
+
+static int tstats_show(struct seq_file *m, void *v)
+{
+	struct tstats_entry *entry = tstats;
+	struct timespec period;
+	unsigned long ms;
+	long events = 0;
+	int i;
+
+	spin_lock_irq(&tstats_lock);
+	switch(tstats_status) {
+	case TSTATS_ACTIVE:
+		tstats_time = ktime_sub(ktime_get(), tstats_time);
+	case TSTATS_INACTIVE:
+		tstats_status = TSTATS_READOUT;
+		break;
+	default:
+		spin_unlock_irq(&tstats_lock);
+		return -EBUSY;
+	}
+	spin_unlock_irq(&tstats_lock);
+
+	period = ktime_to_timespec(tstats_time);
+	ms = period.tv_nsec % 1000000;
+
+	seq_printf(m, "Timerstats sample period: %ld.%3ld s\n",
+		   period.tv_sec, ms);
+
+	for (i = 0; i < TSTATS_MAX_ENTRIES && entry->timer; i++, entry++) {
+		seq_printf(m, "%4lu, %5d %-16s ", entry->counter, entry->pid,
+			   entry->comm);
+
+		print_name_offset(m, (unsigned long)entry->start_func);
+		seq_puts(m, " (");
+		print_name_offset(m, (unsigned long)entry->expire_func);
+		seq_puts(m, ")\n");
+		events += entry->counter;
+	}
+
+	ms += period.tv_sec * 1000;
+	if (events && period.tv_sec)
+		seq_printf(m, "%ld total events, %ld.%ld events/sec\n", events,
+			   events / period.tv_sec, events * 1000 / ms);
+	else
+		seq_printf(m, "%ld total events\n", events);
+
+	tstats_status = TSTATS_INACTIVE;
+	return 0;
+}
+
+static ssize_t tstats_write(struct file *file, const char __user *buf,
+			    size_t count, loff_t *offs)
+{
+	char ctl[2];
+
+	if (count != 2 || *offs)
+		return -EINVAL;
+
+	if (copy_from_user(ctl, buf, count))
+		return -EFAULT;
+
+	switch (ctl[0]) {
+	case '0':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_ACTIVE) {
+			tstats_status = TSTATS_INACTIVE;
+			tstats_time = ktime_sub(ktime_get(), tstats_time);
+		}
+		spin_unlock_irq(&tstats_lock);
+		break;
+	case '1':
+		spin_lock_irq(&tstats_lock);
+		if (tstats_status == TSTATS_INACTIVE) {
+			tstats_status = TSTATS_RESET;
+			memset(tstats, 0, sizeof(tstats));
+			tstats_time = ktime_get();
+			tstats_status = TSTATS_ACTIVE;
+		}
+		spin_unlock_irq(&tstats_lock);
+		break;
+	default:
+		count = -EINVAL;
+	}
+
+	return count;
+}
+
+static int tstats_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, tstats_show, NULL);
+}
+
+static struct file_operations tstats_fops = {
+	.open		= tstats_open,
+	.read		= seq_read,
+	.write		= tstats_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init init_tstats(void)
+{
+	struct proc_dir_entry *pe;
+
+	pe = create_proc_entry("timer_stats", 0666, NULL);
+
+	if (!pe)
+		return -ENOMEM;
+
+	pe->proc_fops = &tstats_fops;
+
+	return 0;
+}
+module_init(init_tstats);
Index: linux-2.6.19-rc5-mm1/kernel/timer.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/timer.c	2006-11-09 20:16:11.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/timer.c	2006-11-09 20:17:05.000000000 +0100
@@ -34,6 +34,7 @@
 #include <linux/cpu.h>
 #include <linux/syscalls.h>
 #include <linux/delay.h>
+#include <linux/kallsyms.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -133,6 +134,18 @@ static void internal_add_timer(tvec_base
 	list_add_tail(&timer->entry, vec);
 }
 
+#ifdef CONFIG_TIMER_STATS
+void __timer_stats_timer_set_start_info(struct timer_list *timer, void *addr)
+{
+	if (timer->start_site)
+		return;
+
+	timer->start_site = addr;
+	memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
+	timer->start_pid = current->pid;
+}
+#endif
+
 /**
  * init_timer - initialize a timer.
  * @timer: the timer to be initialized
@@ -144,11 +157,16 @@ void fastcall init_timer(struct timer_li
 {
 	timer->entry.next = NULL;
 	timer->base = __raw_get_cpu_var(tvec_bases);
+#ifdef CONFIG_TIMER_STATS
+	timer->start_site = NULL;
+	timer->start_pid = -1;
+	memset(timer->start_comm, 0, TASK_COMM_LEN);
+#endif
 }
 EXPORT_SYMBOL(init_timer);
 
 static inline void detach_timer(struct timer_list *timer,
-					int clear_pending)
+				int clear_pending)
 {
 	struct list_head *entry = &timer->entry;
 
@@ -195,6 +213,7 @@ int __mod_timer(struct timer_list *timer
 	unsigned long flags;
 	int ret = 0;
 
+	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
 
 	base = lock_timer_base(timer, &flags);
@@ -245,6 +264,7 @@ void add_timer_on(struct timer_list *tim
 	tvec_base_t *base = per_cpu(tvec_bases, cpu);
   	unsigned long flags;
 
+	timer_stats_timer_set_start_info(timer);
   	BUG_ON(timer_pending(timer) || !timer->function);
 	spin_lock_irqsave(&base->lock, flags);
 	timer->base = base;
@@ -277,6 +297,7 @@ int mod_timer(struct timer_list *timer, 
 {
 	BUG_ON(!timer->function);
 
+	timer_stats_timer_set_start_info(timer);
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -307,6 +328,7 @@ int del_timer(struct timer_list *timer)
 	unsigned long flags;
 	int ret = 0;
 
+	timer_stats_timer_clear_start_info(timer);
 	if (timer_pending(timer)) {
 		base = lock_timer_base(timer, &flags);
 		if (timer_pending(timer)) {
@@ -440,6 +462,8 @@ static inline void __run_timers(tvec_bas
  			fn = timer->function;
  			data = timer->data;
 
+			timer_stats_account_timer(timer);
+
 			set_running_timer(base, timer);
 			detach_timer(timer, 1);
 			spin_unlock_irq(&base->lock);
@@ -1128,7 +1152,8 @@ static void run_timer_softirq(struct sof
 {
 	tvec_base_t *base = __get_cpu_var(tvec_bases);
 
- 	hrtimer_run_queues();
+	hrtimer_run_queues();
+
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
Index: linux-2.6.19-rc5-mm1/kernel/workqueue.c
===================================================================
--- linux-2.6.19-rc5-mm1.orig/kernel/workqueue.c	2006-11-09 17:47:59.000000000 +0100
+++ linux-2.6.19-rc5-mm1/kernel/workqueue.c	2006-11-09 20:17:05.000000000 +0100
@@ -122,7 +122,7 @@ int fastcall queue_work(struct workqueue
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
-static void delayed_work_timer_fn(unsigned long __data)
+void delayed_work_timer_fn(unsigned long __data)
 {
 	struct work_struct *work = (struct work_struct *)__data;
 	struct workqueue_struct *wq = work->wq_data;
@@ -143,11 +143,12 @@ static void delayed_work_timer_fn(unsign
  * Returns 0 if @work was already on a queue, non-zero otherwise.
  */
 int fastcall queue_delayed_work(struct workqueue_struct *wq,
-			struct work_struct *work, unsigned long delay)
+				struct work_struct *work, unsigned long delay)
 {
 	int ret = 0;
 	struct timer_list *timer = &work->timer;
 
+	timer_stats_timer_set_start_info(&work->timer);
 	if (!test_and_set_bit(0, &work->pending)) {
 		BUG_ON(timer_pending(timer));
 		BUG_ON(!list_empty(&work->entry));
@@ -489,6 +490,7 @@ EXPORT_SYMBOL(schedule_work);
  */
 int fastcall schedule_delayed_work(struct work_struct *work, unsigned long delay)
 {
+	timer_stats_timer_set_start_info(&work->timer);
 	return queue_delayed_work(keventd_wq, work, delay);
 }
 EXPORT_SYMBOL(schedule_delayed_work);
Index: linux-2.6.19-rc5-mm1/lib/Kconfig.debug
===================================================================
--- linux-2.6.19-rc5-mm1.orig/lib/Kconfig.debug	2006-11-09 17:47:59.000000000 +0100
+++ linux-2.6.19-rc5-mm1/lib/Kconfig.debug	2006-11-09 20:17:05.000000000 +0100
@@ -125,6 +125,17 @@ config SCHEDSTATS
 	  application, you can say N to avoid the very slight overhead
 	  this adds.
 
+config TIMER_STATS
+	bool "Collect kernel timers statistics"
+	depends on DEBUG_KERNEL && PROC_FS
+	help
+	  If you say Y here, additional code will be inserted into the
+	  timer routines to collect statistics about kernel timers being
+	  reprogrammed. The statistics can be read from /proc/timer_stats.
+	  The statistics collection is started by writing 1 to /proc/timer_stats,
+	  writing 0 stops it. This feature is useful to collect information
+	  about timer usage patterns in kernel and userspace.
+
 config DEBUG_SLAB
 	bool "Debug slab memory allocations"
 	depends on DEBUG_KERNEL && SLAB

--


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-09 23:38 ` [patch 13/19] GTOD: Mark TSC unusable for highres timers Thomas Gleixner
@ 2006-11-10  1:10   ` john stultz
  2006-11-10  5:10     ` Andi Kleen
  0 siblings, 1 reply; 70+ messages in thread
From: john stultz @ 2006-11-10  1:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, Arjan van de Ven,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> plain text document attachment
> (gtod-mark-tsc-unusable-for-highres-timers.patch)
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The TSC is too unstable and unreliable to be used with high resolution timers.
> The automatic detection of TSC unstability fails once we switched to high
> resolution mode, because the tick emulation would use the TSC as reference. 
> This results in a circular dependency.  Mark it unusable for high res upfront.
> 
> [akpm@osdl.org: updated for i386-time-avoid-pit-smp-lockups.patch]
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> diff -puN arch/i386/kernel/tsc.c~gtod-mark-tsc-unusable-for-highres-timers arch/i386/kernel/tsc.c
> --- a/arch/i386/kernel/tsc.c~gtod-mark-tsc-unusable-for-highres-timers
> +++ a/arch/i386/kernel/tsc.c
> @@ -459,10 +459,23 @@ static int __init init_tsc_clocksource(v
>  		current_tsc_khz = tsc_khz;
>  		clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz,
>  							clocksource_tsc.shift);
> +#ifndef CONFIG_HIGH_RES_TIMERS
>  		/* lower the rating if we already know its unstable: */
>  		if (check_tsc_unstable())
>  			clocksource_tsc.rating = 0;
> -
> +#else
> +		/*
> +		 * Mark TSC unsuitable for high resolution timers. TSC has so
> +		 * many pitfalls: frequency changes, stop in idle ...  When we
> +		 * switch to high resolution mode we can not longer detect a
> +		 * firmware caused frequency change, as the emulated tick uses
> +		 * TSC as reference. This results in a circular dependency.
> +		 * Switch only to high resolution mode, if pm_timer or such
> +		 * is available.
> +		 */
> +		clocksource_tsc.rating = 50;
> +		clocksource_tsc.is_continuous = 0;
> +#endif
>  		init_timer(&verify_tsc_freq_timer);
>  		verify_tsc_freq_timer.function = verify_tsc_freq;
>  		verify_tsc_freq_timer.expires =


Hmmm. I wish this patch was unnecessary, but I don't see an easy
solution. 

Mind adding a warning so users know why a system that might use the TSC
normally does not use the TSC w/ highres timers?

Otherwise looks ok.

Acked-by: John Stultz <johnstul@us.ibm.com>

thanks
-john


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  1:10   ` john stultz
@ 2006-11-10  5:10     ` Andi Kleen
  2006-11-10  8:10       ` Thomas Gleixner
  2006-11-10 10:28       ` Arjan van de Ven
  0 siblings, 2 replies; 70+ messages in thread
From: Andi Kleen @ 2006-11-10  5:10 UTC (permalink / raw)
  To: john stultz
  Cc: Thomas Gleixner, Andrew Morton, LKML, Ingo Molnar, Len Brown,
	Arjan van de Ven, Roman Zippel

	current_tsc_khz = tsc_khz;
> >  		clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz,
> >  							clocksource_tsc.shift);
> > +#ifndef CONFIG_HIGH_RES_TIMERS
> >  		/* lower the rating if we already know its unstable: */
> >  		if (check_tsc_unstable())
> >  			clocksource_tsc.rating = 0;
> > -
> > +#else
> > +		/*
> > +		 * Mark TSC unsuitable for high resolution timers. TSC has so
> > +		 * many pitfalls: frequency changes, stop in idle ...  When we
> > +		 * switch to high resolution mode we can not longer detect a
> > +		 * firmware caused frequency change, as the emulated tick uses
> > +		 * TSC as reference. This results in a circular dependency.
> > +		 * Switch only to high resolution mode, if pm_timer or such
> > +		 * is available.
> > +		 */
> > +		clocksource_tsc.rating = 50;
> > +		clocksource_tsc.is_continuous = 0;
> > +#endif
> >  		init_timer(&verify_tsc_freq_timer);
> >  		verify_tsc_freq_timer.function = verify_tsc_freq;
> >  		verify_tsc_freq_timer.expires =
> 
> 
> Hmmm. I wish this patch was unnecessary, but I don't see an easy
> solution. 

Very sad. This will make a lot of people unhappy, even to the point
where they might prefer disabling noidlehz over super slow gettimeofday. 
I assume you at least have a suitable command line option for that, right?

Can we get a summary on which systems the TSC is considered unstable?
Normally we assume if it's stable enough for gettimeofday it should
be stable enough for longer delays too.

-Andi


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  5:10     ` Andi Kleen
@ 2006-11-10  8:10       ` Thomas Gleixner
  2006-11-10  8:50         ` Andrew Morton
  2006-11-10 11:11         ` Pavel Machek
  2006-11-10 10:28       ` Arjan van de Ven
  1 sibling, 2 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-10  8:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: john stultz, Andrew Morton, LKML, Ingo Molnar, Len Brown,
	Arjan van de Ven, Roman Zippel

On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> > >  		verify_tsc_freq_timer.function = verify_tsc_freq;
> > >  		verify_tsc_freq_timer.expires =
> > 
> > 
> > Hmmm. I wish this patch was unnecessary, but I don't see an easy
> > solution. 
> 
> Very sad. This will make a lot of people unhappy, even to the point
> where they might prefer disabling noidlehz over super slow gettimeofday. 
> I assume you at least have a suitable command line option for that, right?

Yes it is sad. And the sadest part is that AMD and Intel have been asked
to fix that more than 5 years ago. They did not get their brain straight
and now we are the dimwits.

> Can we get a summary on which systems the TSC is considered unstable?
> Normally we assume if it's stable enough for gettimeofday it should
> be stable enough for longer delays too.

TSC is simply a nightmare:

- Frequency changes with CPU clock
- Unsynced across CPUs
- Stops in C3, which makes it completely unusable

Once you take away periodic interrupts it is simply broken. AMD and
Intel can run in circels, it does not get better.

	tglx





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:10       ` Thomas Gleixner
@ 2006-11-10  8:50         ` Andrew Morton
  2006-11-10  8:57           ` Ingo Molnar
  2006-11-10 11:11         ` Pavel Machek
  1 sibling, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2006-11-10  8:50 UTC (permalink / raw)
  To: tglx
  Cc: Andi Kleen, john stultz, LKML, Ingo Molnar, Len Brown,
	Arjan van de Ven, Roman Zippel

On Fri, 10 Nov 2006 09:10:06 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> > > >  		verify_tsc_freq_timer.function = verify_tsc_freq;
> > > >  		verify_tsc_freq_timer.expires =
> > > 
> > > 
> > > Hmmm. I wish this patch was unnecessary, but I don't see an easy
> > > solution. 
> > 
> > Very sad. This will make a lot of people unhappy, even to the point
> > where they might prefer disabling noidlehz over super slow gettimeofday. 
> > I assume you at least have a suitable command line option for that, right?
> 
> Yes it is sad. And the sadest part is that AMD and Intel have been asked
> to fix that more than 5 years ago. They did not get their brain straight
> and now we are the dimwits.
> 
> > Can we get a summary on which systems the TSC is considered unstable?
> > Normally we assume if it's stable enough for gettimeofday it should
> > be stable enough for longer delays too.
> 
> TSC is simply a nightmare:
> 
> - Frequency changes with CPU clock
> - Unsynced across CPUs
> - Stops in C3, which makes it completely unusable
> 
> Once you take away periodic interrupts it is simply broken. AMD and
> Intel can run in circels, it does not get better.
> 

What is the actual problem?  verify_tsc_freq()?

If so, could that function use the PIT/pmtimer/etc for working out if
the TSC is bust, rather than directly using jiffies?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:50         ` Andrew Morton
@ 2006-11-10  8:57           ` Ingo Molnar
  2006-11-10  9:13             ` Andrew Morton
                               ` (3 more replies)
  0 siblings, 4 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: tglx, Andi Kleen, john stultz, LKML, Len Brown, Arjan van de Ven,
	Roman Zippel

* Andrew Morton <akpm@osdl.org> wrote:

> If so, could that function use the PIT/pmtimer/etc for working out if 
> the TSC is bust, rather than directly using jiffies?

there's no realiable way to figure out the TSC is bust: some CPUs have a 
slight 'skew' between cores for example. On some systems the TSC might 
skew between sockets. A CPU might break its TSC only once some 
powersaving mode has been activated - which might be long after bootup. 
The whole TSC business is a nightmare and cannot be supported reliably. 
AFAIK Windows doesnt use it, so it's a continuous minefield for new 
hardware to break.

We should wait until CPU makers get their act together and implement a 
TSC variant that is /architecturally promised/ to have constant 
frequency (system bus frequency or whatever) and which never stops.

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:57           ` Ingo Molnar
@ 2006-11-10  9:13             ` Andrew Morton
  2006-11-10  9:29               ` Andi Kleen
  2006-11-10 10:35               ` Arjan van de Ven
  2006-11-10  9:27             ` Andi Kleen
                               ` (2 subsequent siblings)
  3 siblings, 2 replies; 70+ messages in thread
From: Andrew Morton @ 2006-11-10  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, Andi Kleen, john stultz, LKML, Len Brown, Arjan van de Ven,
	Roman Zippel

On Fri, 10 Nov 2006 09:57:28 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > If so, could that function use the PIT/pmtimer/etc for working out if 
> > the TSC is bust, rather than directly using jiffies?
> 
> there's no realiable way to figure out the TSC is bust: some CPUs have a 
> slight 'skew' between cores for example. On some systems the TSC might 
> skew between sockets. A CPU might break its TSC only once some 
> powersaving mode has been activated - which might be long after bootup. 
> The whole TSC business is a nightmare and cannot be supported reliably. 
> AFAIK Windows doesnt use it, so it's a continuous minefield for new 
> hardware to break.

But that's different.

We're limping along in a semi-OK fashion with the TSC.  But now Thomas is
proposing that we effectively kill it off for all x86 because of hrtimers.

And afaict the reason for that is that we're using jiffies to determine if
the TSC has gone bad, and that test is getting false positives.

> We should wait until CPU makers get their act together and implement a 
> TSC variant that is /architecturally promised/ to have constant 
> frequency (system bus frequency or whatever) and which never stops.
> 

That'll hurt the big machines rather a lot, won't it?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 01/19] hrtimers: state tracking
  2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
@ 2006-11-10  9:19   ` Arjan van de Ven
  2006-11-10  9:40     ` Andrew Morton
  2006-11-23 22:26   ` Roman Zippel
  1 sibling, 1 reply; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10  9:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel


> +/*
> + * Bit values to track state of the timer
> + *
> + * Possible states:
> + *
> + * 0x00		inactive
> + * 0x01		enqueued into rbtree
> + * 0x02		callback function running
> + * 0x03		callback function running and enqueued
> + *		(was requeued on another CPU)
> + *
> + * The "callback function running and enqueued" status is only possible on
> + * SMP. It happens for example when a posix timer expired and the callback
> + * queued a signal. Between dropping the lock which protects the posix timer
> + * and reacquiring the base lock of the hrtimer, another CPU can deliver the
> + * signal and rearm the timer. We have to preserve the callback running state,
> + * as otherwise the timer could be removed before the softirq code finishes the
> + * the handling of the timer.
> + *
> + * The HRTIMER_STATE_ENQUEUE bit is always or'ed to the current state to
> + * preserve the HRTIMER_STATE_CALLBACK bit in the above scenario.
> + *
> + * All state transitions are protected by cpu_base->lock.
> + */
> +#define HRTIMER_STATE_INACTIVE	0x00
> +#define HRTIMER_STATE_ENQUEUED	0x01
> +#define HRTIMER_STATE_CALLBACK	0x02

where is the define for 0x03?

>  
> +static inline int hrtimer_is_queued(struct hrtimer *timer)
> +{
> +	return timer->state != HRTIMER_STATE_INACTIVE &&
> +		timer->state != HRTIMER_STATE_CALLBACK;
> +}

the state things are either bits or they're not. If they're bits, you
probably want to make this a bitcheck instead...
>  	rb_insert_color(&timer->node, &base->active);
> +	/*
> +	 * HRTIMER_STATE_ENQUEUED is or'ed to the current state to preserve the
> +	 * state of a possibly running callback.
> +	 */
> +	timer->state |= HRTIMER_STATE_ENQUEUED;

ok so it IS a bit thing, see comment about hrtimer_is_queued() not being
a bit check then...



> -	if (base->cpu_base->curr_timer != timer)
> +	if (!(timer->state & HRTIMER_STATE_CALLBACK))
>  		ret = remove_hrtimer(timer, base);

if there is a hrtimer_is_queued() inline, might as well make a
hrtimer_is_running() inline as well


otherwise lookes ok; if you fix these few comments:
Acked-by: Arjan van de Ven <arjan@linux.intel.com>

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 02/19] hrtimers: clean up callback tracking
  2006-11-09 23:38 ` [patch 02/19] hrtimers: clean up callback tracking Thomas Gleixner
@ 2006-11-10  9:20   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10  9:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> plain text document attachment
> (hrtimers-clean-up-callback-tracking.patch)
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Reintroduce ktimers feature "optimized away" by the ktimers review process:
> remove the curr_timer pointer from the cpu-base and use the hrtimer state.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> Index: linux-2.6.19-rc5-mm1/include/linux/hrtimer.h
> ===================================================================
> --- linux-2.6.19-rc5-mm1.orig/include/l
> -		if (unlikely(base->cpu_base->curr_timer == timer))
> +		if (unlikely(timer->state & HRTIMER_STATE_CALLBACK))
>  			return base;

this also could use the hrtimer_is_running() inline from the [01/19]
review


otherwise looks ok:

Acked-by: Arjan van de Ven <arjan@linux.intel.com>

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:57           ` Ingo Molnar
  2006-11-10  9:13             ` Andrew Morton
@ 2006-11-10  9:27             ` Andi Kleen
  2006-11-10 10:14             ` Alan Cox
  2006-11-10 11:12             ` Pavel Machek
  3 siblings, 0 replies; 70+ messages in thread
From: Andi Kleen @ 2006-11-10  9:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, tglx, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

On Friday 10 November 2006 09:57, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > If so, could that function use the PIT/pmtimer/etc for working out if 
> > the TSC is bust, rather than directly using jiffies?
> 
> there's no realiable way to figure out the TSC is bust: some CPUs have a 
> slight 'skew' between cores for example. 

We find this out by black listing them. I got that working reliably as far as I know.

The main cases I know where we can't use it right now is:
- AMD >1 core
* when clock ramping is disabled it gets a little better, but on multi socket
it is still broken
* also it varies in frequency here which has to be handled
+ There is a little issue here that the frequency takes some unpredictible 
time to stabilize after the frequency change. AFAIK the error is too small
to cause problems though.
- Some Intel NUMA systems (IBM x4xx, Unisys ES7000, ScaleMP) 
* handled by detecting multiple Apic Clusters
- Intel systems with C3
* stops in C3. disable here
- a few P4 dual cores seem to lose TSC synchronization when overclocked
(or most likely overvolted) and running out of Spec
* I chose to ignore this case. User fault. They can set command line options.
- We had one Intel BIOS which misprogrammed the FSB dividers
* Got fixed by BIOS update. Also it was a obscure case that can be handled
with command line options.

I don't see how this is changing much with dyntimers. The only
difference that should be there is that you require TSC stability for
a longer time (instead of only HZ), but normally when the TSC is instable
it already causes trouble in the current setup.

You're probably overreacting to something. Maybe one of the old bugs?
(I had a typo in the Intel C3 detection for a long time that broke
a lot of Intel laptops) 

> On some systems the TSC might  
> skew between sockets. A CPU might break its TSC only once some 
> powersaving mode has been activated - which might be long after bootup. 
> The whole TSC business is a nightmare and cannot be supported reliably.

I disagree. 

> AFAIK Windows doesnt use it, so it's a continuous minefield for new 
> hardware to break.

Not true.
 
> We should wait until CPU makers get their act together and implement a 
> TSC variant that is /architecturally promised/ to have constant 
> frequency (system bus frequency or whatever) 

Intel already has that (modulo totally broken BIOS and overclocking). 
TSC is running always at highest P-state and usually synchronized too.

AMD is getting there.

> and which never stops. 

That's unrealistic unfortunately any time soon.  All the CPU vendors
are pushing for much more aggressive power saving and this basically 
means turning off the CPU completely in the deeper sleep states.

-Andi

 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  9:13             ` Andrew Morton
@ 2006-11-10  9:29               ` Andi Kleen
  2006-11-11 11:14                 ` Thomas Gleixner
  2006-11-10 10:35               ` Arjan van de Ven
  1 sibling, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-10  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, tglx, john stultz, LKML, Len Brown, Arjan van de Ven,
	Roman Zippel


> But that's different.
> 
> We're limping along in a semi-OK fashion with the TSC.  But now Thomas is
> proposing that we effectively kill it off for all x86 because of hrtimers.

I'm totally against that.
 
> And afaict the reason for that is that we're using jiffies to determine if
> the TSC has gone bad, and that test is getting false positives.


The i386 clocksource had always trouble with that. e.g.  I have a box
where the TSC works perfectly fine on a 64bit kernel, but since the new i386
clocksource code is in it always insists on disabling it shortly after boot.
My guess is that some of the checks in there are just broken and need
to be fixed.



> 
> > We should wait until CPU makers get their act together and implement a 
> > TSC variant that is /architecturally promised/ to have constant 
> > frequency (system bus frequency or whatever) and which never stops.
> > 
> 
> That'll hurt the big machines rather a lot, won't it?

It's unrealistic and short term it will cause extreme pain in many workloads
which are gettimeofday intensive (networking, databases etc.) 

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 01/19] hrtimers: state tracking
  2006-11-10  9:19   ` Arjan van de Ven
@ 2006-11-10  9:40     ` Andrew Morton
  2006-11-10  9:45       ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Andrew Morton @ 2006-11-10  9:40 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Gleixner, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Fri, 10 Nov 2006 10:19:48 +0100
Arjan van de Ven <arjan@infradead.org> wrote:

> 
> > +/*
> > + * Bit values to track state of the timer
> > + *
> > + * Possible states:
> > + *
> > + * 0x00		inactive
> > + * 0x01		enqueued into rbtree
> > + * 0x02		callback function running
> > + * 0x03		callback function running and enqueued
> > + *		(was requeued on another CPU)
> > + *
> > + * The "callback function running and enqueued" status is only possible on
> > + * SMP. It happens for example when a posix timer expired and the callback
> > + * queued a signal. Between dropping the lock which protects the posix timer
> > + * and reacquiring the base lock of the hrtimer, another CPU can deliver the
> > + * signal and rearm the timer. We have to preserve the callback running state,
> > + * as otherwise the timer could be removed before the softirq code finishes the
> > + * the handling of the timer.
> > + *
> > + * The HRTIMER_STATE_ENQUEUE bit is always or'ed to the current state to
> > + * preserve the HRTIMER_STATE_CALLBACK bit in the above scenario.
> > + *
> > + * All state transitions are protected by cpu_base->lock.
> > + */
> > +#define HRTIMER_STATE_INACTIVE	0x00
> > +#define HRTIMER_STATE_ENQUEUED	0x01
> > +#define HRTIMER_STATE_CALLBACK	0x02
> 
> where is the define for 0x03?
> 
> >  
> > +static inline int hrtimer_is_queued(struct hrtimer *timer)
> > +{
> > +	return timer->state != HRTIMER_STATE_INACTIVE &&
> > +		timer->state != HRTIMER_STATE_CALLBACK;
> > +}
> 
> the state things are either bits or they're not. If they're bits, you
> probably want to make this a bitcheck instead...
> >  	rb_insert_color(&timer->node, &base->active);
> > +	/*
> > +	 * HRTIMER_STATE_ENQUEUED is or'ed to the current state to preserve the
> > +	 * state of a possibly running callback.
> > +	 */
> > +	timer->state |= HRTIMER_STATE_ENQUEUED;
> 
> ok so it IS a bit thing, see comment about hrtimer_is_queued() not being
> a bit check then...
> 

eek.  I exhaustively went over that confusion in my initial (and lengthy)
review of these patches.

I don't think we ever saw a point-by-point reply.  What got lost?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 01/19] hrtimers: state tracking
  2006-11-10  9:40     ` Andrew Morton
@ 2006-11-10  9:45       ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-10  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Fri, 2006-11-10 at 01:40 -0800, Andrew Morton wrote:
> > 
> > ok so it IS a bit thing, see comment about hrtimer_is_queued() not being
> > a bit check then...
> > 
> 
> eek.  I exhaustively went over that confusion in my initial (and lengthy)
> review of these patches.
> 
> I don't think we ever saw a point-by-point reply.  What got lost?

I added comments in the defines and in the code as you requested.
Obviously not enough comments.

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 04/19] Add a framework to manage clock event devices.
  2006-11-09 23:38 ` [patch 04/19] Add a framework to manage clock event devices Thomas Gleixner
@ 2006-11-10  9:47   ` Arjan van de Ven
  2006-11-23 22:36   ` Roman Zippel
  1 sibling, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10  9:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> + * struct clock_event_device - clock event descriptor
> + *
> + * @name:		ptr to clock event name
> + * @capabilities:	capabilities of the event chip
> + * @max_delta_ns:	maximum delta value in ns
> + * @min_delta_ns:	minimum delta value in ns
> + * @mult:		nanosecond to cycles multiplier
> + * @shift:		nanoseconds to cycles divisor (power of two)
> + * @set_next_event:	set next event
> + * @set_mode:		set mode function
> + * @suspend:		suspend function (optional)
> + * @resume:		resume function (optional)
> + * @evthandler:		Assigned by the framework to be called by the low
> + *			level handler of the event source
> + */

it would be nice if the datastructure was "pure"; eg entirely owned by
the source (and could be made const), however the only way I can see
that done is by having a private duplicate datastructure in each clock
driver... which is way overkill for one single function pointer ;(


well you could do

struct clock_event_device 
{
	const struct clock_ops *ops;
	void *instance;
	evthndlr_t *evthandler;
}

that way if you have, say, 3 hpet channels you can use the same
"ops" (or maybe "props") structure for all three, but still register the
per channel state as well..


you maybe also want to have a "costs" member, so that you can pick the
cheapest available timer..


> +struct clock_event_device {
> +	const char	*name;
> +	unsigned int	capabilities;
> +	unsigned long	max_delta_ns;
> +	unsigned long	min_delta_ns;
> +	unsigned long	mult;
> +	int		shift;
> +	void		(*set_next_event)(unsigned long evt,
> +					  struct clock_event_device *);
> +	void		(*set_mode)(enum clock_event_mode mode,
> +				    struct clock_event_device *);
> +	void		(*event_handler)(struct pt_regs *regs);

is pt_regs really really needed here? We got rid of it in most places
(and made it a per tast struct thing), I wonder if it can be made to go
away here too.


> +/*
> + * Start up an event device
> + */
> +static void startup_event(struct clock_event_device *evt, unsigned int caps)
> +{
> +	int mode;
> +
> +	if (caps == CLOCK_CAP_NEXTEVT)

isn't caps a bitfield ? if so, shouldn't this be a & ?


> + */
> +int clockevents_set_next_event(ktime_t expires, int force)
> +{
> +	struct local_events *devices = &__get_cpu_var(local_eventdevices);
> +	struct clock_event_device *nextevt = devices->nextevt;
> +	int64_t delta = ktime_to_ns(ktime_sub(expires, ktime_get()));
> +
> +	if (delta <= 0 && !force) {
> +		devices->expires_next.tv64 = KTIME_MAX;
> +		return -ETIME;
> +	}

hmmm so if I set a timer 10 nsec in the future, and then I get an
interrupt, I suddenly get an infinite time timer? Sounds more like a
case of "please just run it right away"


> + * Resume the cpu local clock events
> + */
> +static void clockevents_resume_local_events(void *arg)
> +{
> +	struct local_events *devices = &__get_cpu_var(local_eventdevices);
> +	int i;
> +
> +	for (i = 0; i < devices->installed; i++) {
> +		if (devices->events[i].real_caps)
> +			startup_event(devices->events[i].event,
> +				      devices->events[i].real_caps);
> +	}
> +	touch_softlockup_watchdog();
> +}

what is this watchdog touching for?

> +static int clockevents_cpu_notify(struct notifier_block *self,
> +				  unsigned long action, void *hcpu)
> +{
> +	switch(action) {
> +	case CPU_UP_PREPARE:
> +		break;

don't you want to start the per cpu timer in such a case?




-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 06/19] ACPI: Keep track of timer broadcast
  2006-11-09 23:38 ` [patch 06/19] ACPI: Keep track of timer broadcast Thomas Gleixner
@ 2006-11-10  9:51   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10  9:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:

Acked-by: Arjan van de Ven <arjan@linux.intel.com>

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 07/19] ACPI: Add state propagation for dynamic broadcasting
  2006-11-09 23:38 ` [patch 07/19] ACPI: Add state propagation for dynamic broadcasting Thomas Gleixner
@ 2006-11-10  9:52   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10  9:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> plain text document attachment
> (acpi-add-hres-dyntick-broadcast-support.patch)

> +		lapic_timer_idle_broadcast(broadcast);
> +}

is this really lapic specific?

anyway Acked-by: Arjan van de Ven <arjan@linux.intel.com>
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 08/19] i386: cleanup apic code
  2006-11-09 23:38 ` [patch 08/19] i386: cleanup apic code Thomas Gleixner
@ 2006-11-10 10:04   ` Arjan van de Ven
  2006-11-10 10:16     ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

>  
>  /*
>   * Knob to control our willingness to enable the local APIC.
> + *
> + * -1=force-disable, +1=force-enable

mind doing 2 defines for these? Makes things more readable I suspect

> -	return maxlvt;
> +	return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2;
>  }

why not use lapic_is_integrated() here?
> \
> +	if (cpu_has_tsc)
> +		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "

please put "approximated at" or something here; or people will call
supportlines if they bought a 3.4Ghz processor and this shows 3.39999Ghz



> +EXPORT_SYMBOL(switch_APIC_timer_to_ipi);

why is this exported at all? Modules really shouldn't be touching apic
level details.... 



this patch is extremely difficult to review because diff has made a mess
out of it ;(

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 09/19] i386: Convert to clock event devices
  2006-11-09 23:38 ` [patch 09/19] i386: Convert to clock event devices Thomas Gleixner
@ 2006-11-10 10:10   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> - * Local timer interrupt handler. It does both profiling and
> - * process statistics/rescheduling.
> + * The guts of the apic timer interrupt
>   */
> -inline void smp_local_timer_interrupt(void)
> +fastcall void local_apic_timer_interrupt(struct pt_regs *regs)

please don't add more "fastcall"; CONFIG_REGPARM makes that the default
anyway



> +void __init setup_pit_timer(void)
> +{
> +	pit_clockevent.mult = div_sc(CLOCK_TICK_RATE, NSEC_PER_SEC, 32);
> +	pit_clockevent.max_delta_ns =
> +		clockevent_delta2ns(0x7FFF, &pit_clockevent);
> +	pit_clockevent.min_delta_ns =
> +		clockevent_delta2ns(0xF, &pit_clockevent);
> +	register_global_clockevent(&pit_clockevent);
> +#ifdef CONFIG_HPET_TIMER
> +	global_clock_event = &pit_clockevent;
> +#endif
> +}

ok this ifdef looks really really weird to me. Why does PIT code depend
on CONFIG_HPET ? HPET is mostly a runtime property!


other than that Acked-by: Arjan van de Ven <arjan@linux.intel.com>

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 10/19] PM_timer: allow early access and move externs to a header file
  2006-11-09 23:38 ` [patch 10/19] PM_timer: allow early access and move externs to a header file Thomas Gleixner
@ 2006-11-10 10:12   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> +/* Overrun value */
> +#define ACPI_PM_OVRRUN	1<<24

technically the PM timer can be either 24 or 32 bits and you can find
out at runtime; 24 is safe value though...


Acked-by: Arjan van de Ven <arjan@linux.intel.com>
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:57           ` Ingo Molnar
  2006-11-10  9:13             ` Andrew Morton
  2006-11-10  9:27             ` Andi Kleen
@ 2006-11-10 10:14             ` Alan Cox
  2006-11-10 11:19               ` Ingo Molnar
  2006-11-10 15:43               ` Chris Friesen
  2006-11-10 11:12             ` Pavel Machek
  3 siblings, 2 replies; 70+ messages in thread
From: Alan Cox @ 2006-11-10 10:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

Ar Gwe, 2006-11-10 am 09:57 +0100, ysgrifennodd Ingo Molnar:
> AFAIK Windows doesnt use it, so it's a continuous minefield for new 
> hardware to break.

Windows uses it extensively especially games. The AMD desync upset a lot
of Windows gamers.

> We should wait until CPU makers get their act together and implement a 
> TSC variant that is /architecturally promised/ to have constant 
> frequency (system bus frequency or whatever) and which never stops.

This will never happen for the really big boxes, light is just too
slow... Our current TSC handling is not perfect but the TSC is often
quite usable.

If hrtimer needs and requires we stop TSC support then we should delay
the merge of HRTIMERS until these new processors are out and common ;)

Alan


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 08/19] i386: cleanup apic code
  2006-11-10 10:16     ` Thomas Gleixner
@ 2006-11-10 10:16       ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:16 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Fri, 2006-11-10 at 11:16 +0100, Thomas Gleixner wrote:
> > > +EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
> > 
> > why is this exported at all? Modules really shouldn't be touching apic
> > level details.... 
> 
> This is exported for ACPI to handle the C3 stops lapic hell.


sounds that's so internal that it truely deserves to be a _GPL export

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 08/19] i386: cleanup apic code
  2006-11-10 10:04   ` Arjan van de Ven
@ 2006-11-10 10:16     ` Thomas Gleixner
  2006-11-10 10:16       ` Arjan van de Ven
  0 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-10 10:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Fri, 2006-11-10 at 11:04 +0100, Arjan van de Ven wrote:
> >  
> >  /*
> >   * Knob to control our willingness to enable the local APIC.
> > + *
> > + * -1=force-disable, +1=force-enable
> 
> mind doing 2 defines for these? Makes things more readable I suspect

Yep.

> > -	return maxlvt;
> > +	return APIC_INTEGRATED(GET_APIC_VERSION(v)) ? GET_APIC_MAXLVT(v) : 2;
> >  }
> 
> why not use lapic_is_integrated() here?

oops.

> > \
> > +	if (cpu_has_tsc)
> > +		apic_printk(APIC_VERBOSE, "..... CPU clock speed is "
> 
> please put "approximated at" or something here; or people will call
> supportlines if they bought a 3.4Ghz processor and this shows 3.39999Ghz

:)

> > +EXPORT_SYMBOL(switch_APIC_timer_to_ipi);
> 
> why is this exported at all? Modules really shouldn't be touching apic
> level details.... 

This is exported for ACPI to handle the C3 stops lapic hell.

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 11/19] i386: Rework local APIC calibration
  2006-11-09 23:38 ` [patch 11/19] i386: Rework local APIC calibration Thomas Gleixner
@ 2006-11-10 10:17   ` Arjan van de Ven
  2006-11-10 10:23     ` Thomas Gleixner
  2006-11-10 11:10     ` Ingo Molnar
  0 siblings, 2 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> plain text document attachment (i386-lapic-calibrate-timer.patch)
> From: Thomas Gleixner <tglx@linutronix.de>

One question: why do the irq measurement at all if pmtimer is
available? 

Acked-by: Arjan van de Ven <arjan@linux.intel.com>
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 11/19] i386: Rework local APIC calibration
  2006-11-10 10:17   ` Arjan van de Ven
@ 2006-11-10 10:23     ` Thomas Gleixner
  2006-11-10 11:10     ` Ingo Molnar
  1 sibling, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-10 10:23 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Fri, 2006-11-10 at 11:17 +0100, Arjan van de Ven wrote:
> On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> > plain text document attachment (i386-lapic-calibrate-timer.patch)
> > From: Thomas Gleixner <tglx@linutronix.de>
> 
> One question: why do the irq measurement at all if pmtimer is
> available? 

Good point. OTOH this works everywhere.

	tglx


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 12/19] high-res timers: core
  2006-11-09 23:38 ` [patch 12/19] high-res timers: core Thomas Gleixner
@ 2006-11-10 10:26   ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel

On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> + * hrtimer callback modes:
> + *
> + *	HRTIMER_CB_SOFTIRQ:		Callback must run in softirq context
> + *	HRTIMER_CB_IRQSAFE:		Callback may run in hardirq context
> + *	HRTIMER_CB_IRQSAFE_NO_RESTART:	Callback may run in hardirq context and
> + *					does not restart the timer
> + *	HRTIMER_CB_IRQSAFE_NO_SOFTIRQ:	Callback must run in softirq context
> + *					Special mode for tick emultation

This naming is treacherous (or the comment is wrong); NO_SOFTIRQ
suggests "can't run in softirq" but your comment says "must run in
softirq".. which is it?





> +/**
> + * hrtimer_clock_notify - A clock source or a clock event has been installed
> + *
> + * Notify the per cpu softirqs to recheck the clock sources and events
> + */
> +void hrtimer_clock_notify(void)
> +{
> +	int i;
> +
> +	if (hrtimer_hres_enabled) {
> +		for_each_possible_cpu(i)

hmm. possible or online or .. 


If you fix the comment/define: 
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  5:10     ` Andi Kleen
  2006-11-10  8:10       ` Thomas Gleixner
@ 2006-11-10 10:28       ` Arjan van de Ven
  2006-11-10 10:30         ` Andi Kleen
  1 sibling, 1 reply; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: john stultz, Thomas Gleixner, Andrew Morton, LKML, Ingo Molnar,
	Len Brown, Roman Zippel

On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> Very sad. This will make a lot of people unhappy, even to the point
> where they might prefer disabling noidlehz over super slow gettimeofday. 
> I assume you at least have a suitable command line option for that, right?
> 
> Can we get a summary on which systems the TSC is considered unstable?

the part where it stops in idle...
(the rest is fixed in recent enough hw)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:28       ` Arjan van de Ven
@ 2006-11-10 10:30         ` Andi Kleen
  2006-11-10 10:37           ` Arjan van de Ven
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-10 10:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: john stultz, Thomas Gleixner, Andrew Morton, LKML, Ingo Molnar,
	Len Brown, Roman Zippel

On Friday 10 November 2006 11:28, Arjan van de Ven wrote:
> On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> > Very sad. This will make a lot of people unhappy, even to the point
> > where they might prefer disabling noidlehz over super slow gettimeofday. 
> > I assume you at least have a suitable command line option for that, right?
> > 
> > Can we get a summary on which systems the TSC is considered unstable?
> 
> the part where it stops in idle...

That is handled by if (intel && C3 available) disable

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  9:13             ` Andrew Morton
  2006-11-10  9:29               ` Andi Kleen
@ 2006-11-10 10:35               ` Arjan van de Ven
  2006-11-10 10:47                 ` Andi Kleen
  1 sibling, 1 reply; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Roman Zippel

> We're limping along in a semi-OK fashion with the TSC. 

that's because we fake it a heck of a lot; like after C3 we just make
the kernel guestimate how much to progress it so that it has just enough
ductape on it to not totally fall apart ;(

There's no easy answer. We can keep trying to ductape the TSC everywhere
it sort of breaks (cpu frequency changes on older chips, C3 idle (which
old kernels hit less often just because of the constant timer ticks),
cross cpu drifts and offsets etc etc). 
What that would need at minimum is
1) a per cpu "offset" that gets added to whatever we read from rdtsc
instruction
2) a per cpu "multiplier" or something that gets applied to tsc deltas
3) all code that gets to mop up where TSC breaks (cpuspeed and C3 power
states) use "other timers" to adjust the offset/multiplier values on a
per cpu basis, rather than "hardware TSC".

I suspect that is enough to mostly keep it limping along. It's not
cheap, but it moves the costs mostly to the places where the hardware
can't do it, so if you want to call gettimeofday() in a tight loop at
least you don't pay the hpet tax. (only an add and maybe a mul but those
are cheap and effectively unavoidable if we want to keep the illusion
alive)

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:30         ` Andi Kleen
@ 2006-11-10 10:37           ` Arjan van de Ven
  0 siblings, 0 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: john stultz, Thomas Gleixner, Andrew Morton, LKML, Ingo Molnar,
	Len Brown, Roman Zippel

On Fri, 2006-11-10 at 11:30 +0100, Andi Kleen wrote:
> On Friday 10 November 2006 11:28, Arjan van de Ven wrote:
> > On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> > > Very sad. This will make a lot of people unhappy, even to the point
> > > where they might prefer disabling noidlehz over super slow gettimeofday. 
> > > I assume you at least have a suitable command line option for that, right?
> > > 
> > > Can we get a summary on which systems the TSC is considered unstable?
> > 
> > the part where it stops in idle...
> 
> That is handled by if (intel && C3 available) disable

I'm not so sure it doesn't stop on AMD; the ACPI spec at least allows
it; just that I've seen few AMD CPUs that actually have C3, but that
could be a matter of time.


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:35               ` Arjan van de Ven
@ 2006-11-10 10:47                 ` Andi Kleen
  2006-11-10 10:55                   ` Arjan van de Ven
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-10 10:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Ingo Molnar, tglx, john stultz, LKML, Len Brown,
	Roman Zippel

On Friday 10 November 2006 11:35, Arjan van de Ven wrote:
> 
> > We're limping along in a semi-OK fashion with the TSC. 
> 
> that's because we fake it a heck of a lot; like after C3 we just make
> the kernel guestimate how much to progress it so that it has just enough
> ductape on it to not totally fall apart ;(

Do we? Where?  AFAIK we just do some resetting after cpu frequency
changes, but on C3 TSC is just disabled globally.

That is better than it sounds.

Most systems don't have C3 right now. And on those that have
(laptops) it tends to be not that critical because they normally
don't run workload where gettimeofday() is really time critical
(and nobody expects them to be particularly fast anyways)

[... proposal for per CPU TSC state snipped ...]

All is being worked on.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:47                 ` Andi Kleen
@ 2006-11-10 10:55                   ` Arjan van de Ven
  2006-11-10 11:13                     ` Ingo Molnar
  2006-11-10 11:28                     ` Andi Kleen
  0 siblings, 2 replies; 70+ messages in thread
From: Arjan van de Ven @ 2006-11-10 10:55 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Ingo Molnar, tglx, john stultz, LKML, Len Brown,
	Roman Zippel


> 
> Do we? Where?  AFAIK we just do some resetting after cpu frequency
> changes, but on C3 TSC is just disabled globally.
> 
> That is better than it sounds.

is it?
> 
> Most systems don't have C3 right now. And on those that have
> (laptops) it tends to be not that critical because they normally
> don't run workload where gettimeofday() is really time critical
> (and nobody expects them to be particularly fast anyways)

and that got changed when the blade people decided to start using laptop
processors ......

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 11/19] i386: Rework local APIC calibration
  2006-11-10 10:17   ` Arjan van de Ven
  2006-11-10 10:23     ` Thomas Gleixner
@ 2006-11-10 11:10     ` Ingo Molnar
  1 sibling, 0 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 11:10 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Thomas Gleixner, Andrew Morton, LKML, Len Brown, John Stultz,
	Andi Kleen, Roman Zippel


* Arjan van de Ven <arjan@infradead.org> wrote:

> On Thu, 2006-11-09 at 23:38 +0000, Thomas Gleixner wrote:
> > plain text document attachment (i386-lapic-calibrate-timer.patch)
> > From: Thomas Gleixner <tglx@linutronix.de>
> 
> One question: why do the irq measurement at all if pmtimer is 
> available?

the pmtimer read will always return zero on platforms where there's no 
pm-timer clock - in this case the irq measurement is what calibrates.

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:10       ` Thomas Gleixner
  2006-11-10  8:50         ` Andrew Morton
@ 2006-11-10 11:11         ` Pavel Machek
  1 sibling, 0 replies; 70+ messages in thread
From: Pavel Machek @ 2006-11-10 11:11 UTC (permalink / raw)
  To: Thomas Gleixner, vojtech, jbohac
  Cc: Andi Kleen, john stultz, Andrew Morton, LKML, Ingo Molnar,
	Len Brown, Arjan van de Ven, Roman Zippel

Ahoj!

Tahle debata (lkml) by se Vam mohla hodit...
								Pavel

On Fri 2006-11-10 09:10:06, Thomas Gleixner wrote:
> On Fri, 2006-11-10 at 06:10 +0100, Andi Kleen wrote:
> > > >  		verify_tsc_freq_timer.function = verify_tsc_freq;
> > > >  		verify_tsc_freq_timer.expires =
> > > 
> > > 
> > > Hmmm. I wish this patch was unnecessary, but I don't see an easy
> > > solution. 
> > 
> > Very sad. This will make a lot of people unhappy, even to the point
> > where they might prefer disabling noidlehz over super slow gettimeofday. 
> > I assume you at least have a suitable command line option for that, right?
> 
> Yes it is sad. And the sadest part is that AMD and Intel have been asked
> to fix that more than 5 years ago. They did not get their brain straight
> and now we are the dimwits.
> 
> > Can we get a summary on which systems the TSC is considered unstable?
> > Normally we assume if it's stable enough for gettimeofday it should
> > be stable enough for longer delays too.
> 
> TSC is simply a nightmare:
> 
> - Frequency changes with CPU clock
> - Unsynced across CPUs
> - Stops in C3, which makes it completely unusable
> 
> Once you take away periodic interrupts it is simply broken. AMD and
> Intel can run in circels, it does not get better.
> 
> 	tglx
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  8:57           ` Ingo Molnar
                               ` (2 preceding siblings ...)
  2006-11-10 10:14             ` Alan Cox
@ 2006-11-10 11:12             ` Pavel Machek
  2006-11-10 11:48               ` Ingo Molnar
  3 siblings, 1 reply; 70+ messages in thread
From: Pavel Machek @ 2006-11-10 11:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

Hi!

> > If so, could that function use the PIT/pmtimer/etc for working out if 
> > the TSC is bust, rather than directly using jiffies?
> 
> there's no realiable way to figure out the TSC is bust: some CPUs have a 
> slight 'skew' between cores for example. On some systems the TSC might 
> skew between sockets. A CPU might break its TSC only once some 

But we could still do a whitelist?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:55                   ` Arjan van de Ven
@ 2006-11-10 11:13                     ` Ingo Molnar
  2006-11-10 11:28                     ` Andi Kleen
  1 sibling, 0 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 11:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Andrew Morton, tglx, john stultz, LKML, Len Brown,
	Roman Zippel


* Arjan van de Ven <arjan@infradead.org> wrote:

> > Most systems don't have C3 right now. And on those that have 
> > (laptops) it tends to be not that critical because they normally 
> > don't run workload where gettimeofday() is really time critical (and 
> > nobody expects them to be particularly fast anyways)
> 
> and that got changed when the blade people decided to start using 
> laptop processors ......

and some systems disable the lapic in C2 already: BIOSs started doing 
lowlevel-C3 in their C2 functionality and lie to the OS about it.

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:14             ` Alan Cox
@ 2006-11-10 11:19               ` Ingo Molnar
  2006-11-10 15:43               ` Chris Friesen
  1 sibling, 0 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 11:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> Ar Gwe, 2006-11-10 am 09:57 +0100, ysgrifennodd Ingo Molnar:
> > AFAIK Windows doesnt use it, so it's a continuous minefield for new 
> > hardware to break.
> 
> Windows uses it extensively especially games. The AMD desync upset a 
> lot of Windows gamers.

well, i meant the Windows kernel itself, not applications. (maybe the 
Windows kernel uses it on SMP systems where the TSC /used to be/ pretty 
stable, i dont know)

> > We should wait until CPU makers get their act together and implement 
> > a TSC variant that is /architecturally promised/ to have constant 
> > frequency (system bus frequency or whatever) and which never stops.
> 
> This will never happen for the really big boxes, light is just too 
> slow... [...]

that's not a problem - time goes as fast as light [by definition] :-)

> If hrtimer needs and requires we stop TSC support [...]

no, it doesnt, so there's no real friction here. We just observed that 
in the past 10 years no generally working TSC-based gettimeofday was 
written (and i wrote the first version of it for the Pentium, so the 
blame is on me too), and that we might be better off without it. If 
someone can pull off a working TSC-based gettimeofday() implementation 
then there's no objection from us.

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:55                   ` Arjan van de Ven
  2006-11-10 11:13                     ` Ingo Molnar
@ 2006-11-10 11:28                     ` Andi Kleen
  1 sibling, 0 replies; 70+ messages in thread
From: Andi Kleen @ 2006-11-10 11:28 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Ingo Molnar, tglx, john stultz, LKML, Len Brown,
	Roman Zippel

On Friday 10 November 2006 11:55, Arjan van de Ven wrote:

> > 
> > Most systems don't have C3 right now. And on those that have
> > (laptops) it tends to be not that critical because they normally
> > don't run workload where gettimeofday() is really time critical
> > (and nobody expects them to be particularly fast anyways)
> 
> and that got changed when the blade people decided to start using laptop
> processors ......

Well those will be handled eventually. Currently they just have
a slower gettimeofday.

But the majority of systems is not impacted.

BTW if someone really wants to have fast gettimeofday on a blade
they can just disable C3 and force TSC.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 11:12             ` Pavel Machek
@ 2006-11-10 11:48               ` Ingo Molnar
  2006-11-10 11:56                 ` Andi Kleen
  2006-11-10 12:00                 ` Pavel Machek
  0 siblings, 2 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 11:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel


* Pavel Machek <pavel@ucw.cz> wrote:

> > > If so, could that function use the PIT/pmtimer/etc for working out 
> > > if the TSC is bust, rather than directly using jiffies?
> > 
> > there's no realiable way to figure out the TSC is bust: some CPUs 
> > have a slight 'skew' between cores for example. On some systems the 
> > TSC might skew between sockets. A CPU might break its TSC only once 
> > some
> 
> But we could still do a whitelist?

we could, but it would have to be almost empty right now :-) Reason: 
even on systems that have (hardware-initialized) 'perfect' TSCs and 
which do not support any frequency scaling or power-saving mode, our 
current TSC initialization on SMP systems introduces a small (1-2 usecs) 
skew.

but even that limited set of systems is now mostly obsolete: no 
multi-core CPU based system i'm aware of would qualify. I have written 
user-space testcode for TSC and gettimeofday warps, see:

   http://redhat.com/~mingo/time-warp-test/time-warp-test.c

no SMP system i have passes at the moment, running 2.6.17/18:

 --------------------------------------
 jupiter:~> ./time-warp-test
 4 CPUs, running 4 parallel test-tasks.
 checking for time-warps via:
 - read time stamp counter (RDTSC) instruction (cycle resolution)
 - gettimeofday (TOD) syscall (usec resolution)

 [...]
 new TSC-warp maximum:     -6392 cycles, 0000294e1f3b6100 -> 0000294e1f3b4808
 | # of TSC-warps:183606 |

 --------------------------------------
 venus:~> ./time-warp-test
 4 CPUs, running 4 parallel test-tasks.
 [...]
 new TSC-warp maximum:     -1328 cycles, 00001d9549c6c738 -> 00001d9549c6c208
 | # of TSC-warps:332510 |

 --------------------------------------
 neptune:~> ./time-warp-test
 2 CPUs, running 2 parallel test-tasks.
 [...]
 new TSC-warp maximum:      -332 cycles, 0000005e00b1b89e -> 0000005e00b1b752
 | # of TSC-warps:340 |

 [and i'm lazy to turn on the 8-way now, but that has TSC warps too.]

so i'd love to see non-warping time, but after 10 years of trying i'm 
not holding my breath.

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 11:48               ` Ingo Molnar
@ 2006-11-10 11:56                 ` Andi Kleen
  2006-11-10 13:12                   ` Ingo Molnar
  2006-11-10 12:00                 ` Pavel Machek
  1 sibling, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-10 11:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pavel Machek, Andrew Morton, tglx, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel


> we could, but it would have to be almost empty right now :-) Reason: 
> even on systems that have (hardware-initialized) 'perfect' TSCs and 
> which do not support any frequency scaling or power-saving mode, our 
> current TSC initialization on SMP systems introduces a small (1-2 usecs) 
> skew.

On Intel we don't sync the TSC anymore and on most systems users seem
to be happy at least. And on multicore AMD it is drifting anyways and 
usually turned off.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 11:48               ` Ingo Molnar
  2006-11-10 11:56                 ` Andi Kleen
@ 2006-11-10 12:00                 ` Pavel Machek
  2006-11-10 13:14                   ` Ingo Molnar
  1 sibling, 1 reply; 70+ messages in thread
From: Pavel Machek @ 2006-11-10 12:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

Hi!

> > > > If so, could that function use the PIT/pmtimer/etc for working out 
> > > > if the TSC is bust, rather than directly using jiffies?
> > > 
> > > there's no realiable way to figure out the TSC is bust: some CPUs 
> > > have a slight 'skew' between cores for example. On some systems the 
> > > TSC might skew between sockets. A CPU might break its TSC only once 
> > > some
> > 
> > But we could still do a whitelist?
> 
> we could, but it would have to be almost empty right now :-) Reason: 

Well, if it would contain at least 50% of the UP machines... that
would be reasonably long list for a start.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 11:56                 ` Andi Kleen
@ 2006-11-10 13:12                   ` Ingo Molnar
  0 siblings, 0 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 13:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Pavel Machek, Andrew Morton, tglx, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel


* Andi Kleen <ak@suse.de> wrote:

> > we could, but it would have to be almost empty right now :-) Reason: 
> > even on systems that have (hardware-initialized) 'perfect' TSCs and 
> > which do not support any frequency scaling or power-saving mode, our 
> > current TSC initialization on SMP systems introduces a small (1-2 usecs) 
> > skew.
> 
> On Intel we don't sync the TSC anymore [...]

yeah, after i reported this a few months ago ;-)

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 12:00                 ` Pavel Machek
@ 2006-11-10 13:14                   ` Ingo Molnar
  0 siblings, 0 replies; 70+ messages in thread
From: Ingo Molnar @ 2006-11-10 13:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, tglx, Andi Kleen, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel


* Pavel Machek <pavel@ucw.cz> wrote:

> > we could, but it would have to be almost empty right now :-) Reason:
> 
> Well, if it would contain at least 50% of the UP machines... that 
> would be reasonably long list for a start.

which 50%? Does it include those where the TSC slows down due a thermal 
event SMM?

	Ingo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10 10:14             ` Alan Cox
  2006-11-10 11:19               ` Ingo Molnar
@ 2006-11-10 15:43               ` Chris Friesen
  1 sibling, 0 replies; 70+ messages in thread
From: Chris Friesen @ 2006-11-10 15:43 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Andrew Morton, tglx, Andi Kleen, john stultz, LKML,
	Len Brown, Arjan van de Ven, Roman Zippel

Alan Cox wrote:
> Ar Gwe, 2006-11-10 am 09:57 +0100, ysgrifennodd Ingo Molnar:

>>We should wait until CPU makers get their act together and implement a 
>>TSC variant that is /architecturally promised/ to have constant 
>>frequency (system bus frequency or whatever) and which never stops.
> 
> This will never happen for the really big boxes, light is just too
> slow... Our current TSC handling is not perfect but the TSC is often
> quite usable.

This hypothetical clock wouldn't have to run full speed, would it?  You 
could have a 1MHz clock distributed across even a large system fairly 
easily.

Wouldn't that be good enough?

Chris


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-10  9:29               ` Andi Kleen
@ 2006-11-11 11:14                 ` Thomas Gleixner
  2006-11-11 13:51                   ` Andi Kleen
  0 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-11 11:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Ingo Molnar, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

On Fri, 2006-11-10 at 10:29 +0100, Andi Kleen wrote:
> > But that's different.
> > 
> > We're limping along in a semi-OK fashion with the TSC.  But now Thomas is
> > proposing that we effectively kill it off for all x86 because of hrtimers.
> 
> I'm totally against that.

I'm working on that. The general disable is indeed overkill. All I need
to prevent is to switch over to highres/dyntick in case that there is no
fallback (e.g. pm_timer) available. Else I end up in a circular
dependency as the emulated tick depends on the monotonic clock.
 
> > And afaict the reason for that is that we're using jiffies to determine if
> > the TSC has gone bad, and that test is getting false positives.
> 
> The i386 clocksource had always trouble with that. e.g.  I have a box
> where the TSC works perfectly fine on a 64bit kernel, but since the new i386
> clocksource code is in it always insists on disabling it shortly after boot.
> My guess is that some of the checks in there are just broken and need
> to be fixed.

It's the unconditional mark_unstable call in ACPI C2 state. /me looks.

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-11 11:14                 ` Thomas Gleixner
@ 2006-11-11 13:51                   ` Andi Kleen
  2006-11-11 13:58                     ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-11 13:51 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, Ingo Molnar, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel


> > > And afaict the reason for that is that we're using jiffies to determine if
> > > the TSC has gone bad, and that test is getting false positives.
> > 
> > The i386 clocksource had always trouble with that. e.g.  I have a box
> > where the TSC works perfectly fine on a 64bit kernel, but since the new i386
> > clocksource code is in it always insists on disabling it shortly after boot.

shortly after boot means in user space here, not during the first idling.

> > My guess is that some of the checks in there are just broken and need
> > to be fixed.
> 
> It's the unconditional mark_unstable call in ACPI C2 state. /me looks.

The system doesn't support C2 states. It's an older single socket Athlon 64 
with VIA chipset. I haven't looked in detail on why it fails.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-11 13:51                   ` Andi Kleen
@ 2006-11-11 13:58                     ` Thomas Gleixner
  2006-11-11 13:59                       ` Andi Kleen
  0 siblings, 1 reply; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-11 13:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Ingo Molnar, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

On Sat, 2006-11-11 at 14:51 +0100, Andi Kleen wrote:
> > > > And afaict the reason for that is that we're using jiffies to determine if
> > > > the TSC has gone bad, and that test is getting false positives.
> > > 
> > > The i386 clocksource had always trouble with that. e.g.  I have a box
> > > where the TSC works perfectly fine on a 64bit kernel, but since the new i386
> > > clocksource code is in it always insists on disabling it shortly after boot.
> 
> shortly after boot means in user space here, not during the first idling.
> 
> > > My guess is that some of the checks in there are just broken and need
> > > to be fixed.
> > 
> > It's the unconditional mark_unstable call in ACPI C2 state. /me looks.
> 
> The system doesn't support C2 states. It's an older single socket Athlon 64 
> with VIA chipset. I haven't looked in detail on why it fails.

Does it have cpu freqency changing ?

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-11 13:58                     ` Thomas Gleixner
@ 2006-11-11 13:59                       ` Andi Kleen
  2006-11-11 14:08                         ` Thomas Gleixner
  0 siblings, 1 reply; 70+ messages in thread
From: Andi Kleen @ 2006-11-11 13:59 UTC (permalink / raw)
  To: tglx
  Cc: Andrew Morton, Ingo Molnar, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

On Saturday 11 November 2006 14:58, Thomas Gleixner wrote:

> > 
> > > > My guess is that some of the checks in there are just broken and need
> > > > to be fixed.
> > > 
> > > It's the unconditional mark_unstable call in ACPI C2 state. /me looks.
> > 
> > The system doesn't support C2 states. It's an older single socket Athlon 64 
> > with VIA chipset. I haven't looked in detail on why it fails.
> 
> Does it have cpu freqency changing ?

Yep. But only OS controlled one (powernow).

Most likely it happens when ondemand starts doing its thing.

-Andi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 13/19] GTOD: Mark TSC unusable for highres timers
  2006-11-11 13:59                       ` Andi Kleen
@ 2006-11-11 14:08                         ` Thomas Gleixner
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Gleixner @ 2006-11-11 14:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Ingo Molnar, john stultz, LKML, Len Brown,
	Arjan van de Ven, Roman Zippel

On Sat, 2006-11-11 at 14:59 +0100, Andi Kleen wrote:
> On Saturday 11 November 2006 14:58, Thomas Gleixner wrote:
> 
> > > 
> > > > > My guess is that some of the checks in there are just broken and need
> > > > > to be fixed.
> > > > 
> > > > It's the unconditional mark_unstable call in ACPI C2 state. /me looks.
> > > 
> > > The system doesn't support C2 states. It's an older single socket Athlon 64 
> > > with VIA chipset. I haven't looked in detail on why it fails.
> > 
> > Does it have cpu freqency changing ?
> 
> Yep. But only OS controlled one (powernow).
> 
> Most likely it happens when ondemand starts doing its thing.

Yes, thats one of the criterias the tsc clocksource is using. I'm
looking into that right now.

	tglx



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1
  2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
                   ` (18 preceding siblings ...)
  2006-11-09 23:38 ` [patch 19/19] debugging feature: timer stats Thomas Gleixner
@ 2006-11-23 22:24 ` Roman Zippel
  19 siblings, 0 replies; 70+ messages in thread
From: Roman Zippel @ 2006-11-23 22:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Arjan van de Ven, Andi Kleen

Hi,

On Thu, 9 Nov 2006, Thomas Gleixner wrote:

> Andrew,
> 
> this is a drop in replacement for the following patches in 2.6.19-rc5-mm1:
> 
> hrtimers-state-tracking.patch
> up to
> acpi-verify-lapic-timer-fix.patch

There is still the gtod-exponential-update_wall_time patch before that, I 
explained previously why it's wrong and how to fix this properly. Andrew, 
please drop this one.

http://www.ussg.iu.edu/hypermail/linux/kernel/0609.3/1320.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0609.3/1303.html

Something I also wanted to mention about the OLS paper: It's an 
interesting read and answers a few question, but not all. It concentrates 
very much on the past (previous and current implementations), what I'm 
missing are more details on how it can be used in the future. IMO it's 
very important information regarding merging, i.e. how can this be applied 
to our various architectures. This is were have my doubts and more 
questions about it later.

The paper stresses the point that it provides a generic infrastructure, 
but as such it also brings some amazing complexities. Dedicated 
implementations often have the advantage to be simpler and faster (I'm not 
saying that current ones are). How does your implementation keep the 
source and runtime complexities under control? Such generic frameworks 
have the tendency to grow - new requirements have to be met and thus 
complexity further increases.

bye, Roman

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 01/19] hrtimers: state tracking
  2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
  2006-11-10  9:19   ` Arjan van de Ven
@ 2006-11-23 22:26   ` Roman Zippel
  1 sibling, 0 replies; 70+ messages in thread
From: Roman Zippel @ 2006-11-23 22:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Arjan van de Ven, Andi Kleen

Hi,

On Thu, 9 Nov 2006, Thomas Gleixner wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Reintroduce ktimers feature "optimized away" by the ktimers review process:
> multiple hrtimer states to enable the running of hrtimers without holding the
> cpu-base-lock.

They were "optimized away" for a reason...

> (The "optimized" rbtree hack carried only 2 states worth of information and we
> need 4 for high resolution timers and dynamic ticks.)

If you need further flags for dynticks, then to do it conditionally, but 
keep this as is (at least for now), keep it small for the simple stuff.
As others have noted your usage is confusing, something like this hard to 
maintain - every time a flag is added/changed, almost every user has to 
checked to insure the state machine stays correct. Keep the basic states 
separate and if it the flag field should be needed for other reason, it 
should be easy enough to convert. This is not needed for an initial merge.
(Same goes for the next patch.)

bye, Roman

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [patch 04/19] Add a framework to manage clock event devices.
  2006-11-09 23:38 ` [patch 04/19] Add a framework to manage clock event devices Thomas Gleixner
  2006-11-10  9:47   ` Arjan van de Ven
@ 2006-11-23 22:36   ` Roman Zippel
  1 sibling, 0 replies; 70+ messages in thread
From: Roman Zippel @ 2006-11-23 22:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, LKML, Ingo Molnar, Len Brown, John Stultz,
	Arjan van de Ven, Andi Kleen

Hi,

On Thu, 9 Nov 2006, Thomas Gleixner wrote:

> From: Thomas Gleixner <tglx@linutronix.de>
> 
> We have two types of clock event devices:
> - global events (one device per system)
> - local events (one device per cpu)
> 
> We assign the various time(r) related interrupts to those devices:
> 
> - global tick (advances jiffies)
> - update process times (per cpu)
> - profiling (per cpu)
> - next timer events (per cpu)
> 
> Architectures register their clock event devices, with specific capability
> bits set, and the framework code assigns the appropriate event handler to the
> event device.  The functionality is assigned via an event handler to avoid
> runtime evalutation of the assigned function bits.
> 
> This allows to control the clock event devices without the architectures
> having to worry about the details of function assignment.  This is also a
> preliminary for high resolution timers and dynamic ticks to allow the core
> code to control the clock functionality without intrusive changes to the
> architecture code.

I have a few problems with this code and I'd really prefer some more arch 
maintainers would look at this (i.e. post it to the arch ml).
It's basically limited to only one global and one per cpu timer, are you 
sure this enough? Large systems may have several timer.

Even for cases I'm interested in I have no idea how to make use of it, 
e.g. I have a somewhat limited timer, which can't be reprogrammed without 
losing accuracy, but I have two (or maybe more) of them, so I can use one 
as general tick timer and its interrupt can be disabled as needed and a 
second timer can be used for dynamic timer events.
Something else I want to use separate timer is for kernel profiling, 
currently events started from the timer tick are basically invisible, so 
I'd like to start profiling on a different timer with a different 
frequency.
Currently high resolution timer are used for quite a lot once enabled, but 
I would like to see the option to limit them, i.e. use a low resolution 
timer for standard tasks (e.g. itimer, nanosleep) and provide a separate 
high resolution posix timer to user space.

This should give some idea of the background with which I'm looking at 
this code and I'm trying to find an answer to the question, how generic 
this really is and how usable this is beyond dynamic ticks.

> +struct clock_event_device {
> +	const char	*name;
> +	unsigned int	capabilities;
> +	unsigned long	max_delta_ns;
> +	unsigned long	min_delta_ns;
> +	unsigned long	mult;
> +	int		shift;
> +	void		(*set_next_event)(unsigned long evt,
> +					  struct clock_event_device *);
> +	void		(*set_mode)(enum clock_event_mode mode,
> +				    struct clock_event_device *);
> +	void		(*event_handler)(struct pt_regs *regs);
> +};
> +
> +/*
> + * Calculate a multiplication factor for scaled math, which is used to convert
> + * nanoseconds based values to clock ticks:
> + *
> + * clock_ticks = (nanoseconds * factor) >> shift.
> + *
> + * div_sc is the rearranged equation to calculate a factor from a given clock
> + * ticks / nanoseconds ratio:
> + *
> + * factor = (clock_ticks << shift) / nanoseconds
> + */
> +static inline unsigned long div_sc(unsigned long ticks, unsigned long nsec,
> +				   int shift)
> +{
> +	uint64_t tmp = ((uint64_t)ticks) << shift;
> +
> +	do_div(tmp, nsec);
> +	return (unsigned long) tmp;
> +}

One possible problem in this area: the nsec2cycle multiplier is mostly
constant AFAICT, where as the clock source cycle2nsec isn't (especially if 
controlled via ntp). This means this could produce slightly wrong 
results, the larger the longer the period is between timer interrupts.

> +
> +#define MAX_CLOCK_EVENTS	4
> +#define GLOBAL_CLOCK_EVENT	MAX_CLOCK_EVENTS
> +
> +struct event_descr {
> +	struct clock_event_device *event;
> +	unsigned int mode;
> +	unsigned int real_caps;
> +	struct irqaction action;
> +};
> +
> +struct local_events {
> +	int installed;
> +	struct event_descr events[MAX_CLOCK_EVENTS];
> +	struct clock_event_device *nextevt;
> +	ktime_t	expires_next;
> +};
> +
[...]
> +static void handle_tick(struct pt_regs *regs)
> +{
> +	write_seqlock(&xtime_lock);
> +	do_timer(1);
> +	write_sequnlock(&xtime_lock);
> +}
> +
> +/*
> + * Bootup and lowres handler: ticks and update_process_times
> + */
> +static void handle_tick_update(struct pt_regs *regs)
> +{
> +	write_seqlock(&xtime_lock);
> +	do_timer(1);
> +	write_sequnlock(&xtime_lock);
> +
> +	update_process_times(user_mode(regs));
> +}
> +
> +/*
> + * Bootup and lowres handler: ticks and profileing
> + */
> +static void handle_tick_profile(struct pt_regs *regs)
> +{
> +	write_seqlock(&xtime_lock);
> +	do_timer(1);
> +	write_sequnlock(&xtime_lock);
> +
> +	profile_tick(CPU_PROFILING);
> +}
> +
> +/*
> + * Bootup and lowres handler: ticks, update_process_times and profiling
> + */
> +static void handle_tick_update_profile(struct pt_regs *regs)
> +{
> +	write_seqlock(&xtime_lock);
> +	do_timer(1);
> +	write_sequnlock(&xtime_lock);
> +
> +	update_process_times(user_mode(regs));
> +	profile_tick(CPU_PROFILING);
> +}
> +
> +/*
> + * Bootup and lowres handler: update_process_times
> + */
> +static void handle_update(struct pt_regs *regs)
> +{
> +	update_process_times(user_mode(regs));
> +}
> +
> +/*
> + * Bootup and lowres handler: update_process_times and profiling
> + */
> +static void handle_update_profile(struct pt_regs *regs)
> +{
> +	update_process_times(user_mode(regs));
> +	profile_tick(CPU_PROFILING);
> +}
> +
> +/*
> + * Bootup and lowres handler: profiling
> + */
> +static void handle_profile(struct pt_regs *regs)
> +{
> +	profile_tick(CPU_PROFILING);
> +}
> +
> +/*
> + * Noop handler when we shut down an event device
> + */
> +static void handle_noop(struct pt_regs *regs)
> +{
> +}
> +
> +/*
> + * Lookup table for bootup and lowres event assignment
> + *
> + * The event handler is choosen by the capability flags of the clock event
> + * device.
> + */
> +static void __read_mostly *event_handlers[] = {
> +	handle_noop,			/* 0: No capability selected */
> +	handle_tick,			/* 1: Tick only	*/
> +	handle_update,			/* 2: Update process times */
> +	handle_tick_update,		/* 3: Tick + update process times */
> +	handle_profile,			/* 4: Profiling int */
> +	handle_tick_profile,		/* 5: Tick + Profiling int */
> +	handle_update_profile,		/* 6: Update process times +
> +					      profiling */
> +	handle_tick_update_profile,	/* 7: Tick + update process times +
> +					      profiling */
> +#ifdef CONFIG_HIGH_RES_TIMERS
> +	hrtimer_interrupt,		/* 8: Reprogrammable event device */
> +#endif
> +};
> +
> [...]

What's the point with all these little helper functions? This looks rather 
inflexible to me. The generic code has to know about all timer events and 
even provide helper to handle all variation between them?
There might be a small performance benefit in doing this, usually I'm all 
for it, but I don't think this one is really worth it. A simple list of 
active events would be much simpler, e.g. something like this is what I 
have in mind:

	source->time = gettime();
	source->timeout += MAX_TIMEOUT;
	for_each_handler() {
		if (source->time >= timer->timeout) {
			...
			source->timeout = min(source->timeout, timer->timeout);
		}
	}
	reprogram_timer();

The interrupt system already maintains a list of handler, so something 
like this should be reasonably possible without any extra overhead. AFAICT 
it could even simplify your hrtimer code by keeping realtime and 
monotonic hrtimer separate. Right now you have one loop for 
HRTIMER_MAX_CLOCK_BASES in the hrtimer code and another in the clock code 
only to calculate the next timeout.

The generic interrupt system already has a concept of per cpu interrupts, 
IMO it's worth it to check out how it can be used for clock events. IMO 
the core clock code should have almost no idea of per cpu clocks.

Basically I have these problems with this code (maybe there is a reason 
for all this, but I don't see it from the existing documentation):
- generic code shouldn't care about the number of used hw timer
- generic code shouldn't care about how these timers are used
- the various clients shouldn't know about each other (e.g. the sched_tick 
  code in the hrtimer source is just ugly).

In this context I still don't understand why you insist on the strict 
separation of clock events and sources, these _are_ related and separating 
it like this prevents more flexible usage, e.g. to have multiple clock 
sources and timer (and in many cases they are even the same).

> +/*
> + * Setup an event device. Assign an handler and start it up
> + */
> +static void setup_event(struct event_descr *descr,
> +			struct clock_event_device *evt, unsigned int caps)
> +{
> +	void *handler = event_handlers[caps];

These flags aren't really capabilities but intended usage. Every clock 
event driver has already its possible usage hardcoded (sometimes even with 
ugly ifdefs, e.g. lapic).
Capabilities would be something like:
- cpu local/global
- reprogrammable/fixed timer
Based on this the generic code should dynamically decide how to use and 
attach the clients.

bye, Roman

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2006-11-23 22:37 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-09 23:38 [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Thomas Gleixner
2006-11-09 23:38 ` [patch 01/19] hrtimers: state tracking Thomas Gleixner
2006-11-10  9:19   ` Arjan van de Ven
2006-11-10  9:40     ` Andrew Morton
2006-11-10  9:45       ` Thomas Gleixner
2006-11-23 22:26   ` Roman Zippel
2006-11-09 23:38 ` [patch 02/19] hrtimers: clean up callback tracking Thomas Gleixner
2006-11-10  9:20   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 03/19] hrtimers: Move and add documentation Thomas Gleixner
2006-11-09 23:38 ` [patch 04/19] Add a framework to manage clock event devices Thomas Gleixner
2006-11-10  9:47   ` Arjan van de Ven
2006-11-23 22:36   ` Roman Zippel
2006-11-09 23:38 ` [patch 05/19] ACPI: Include apic.h Thomas Gleixner
2006-11-09 23:38 ` [patch 06/19] ACPI: Keep track of timer broadcast Thomas Gleixner
2006-11-10  9:51   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 07/19] ACPI: Add state propagation for dynamic broadcasting Thomas Gleixner
2006-11-10  9:52   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 08/19] i386: cleanup apic code Thomas Gleixner
2006-11-10 10:04   ` Arjan van de Ven
2006-11-10 10:16     ` Thomas Gleixner
2006-11-10 10:16       ` Arjan van de Ven
2006-11-09 23:38 ` [patch 09/19] i386: Convert to clock event devices Thomas Gleixner
2006-11-10 10:10   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 10/19] PM_timer: allow early access and move externs to a header file Thomas Gleixner
2006-11-10 10:12   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 11/19] i386: Rework local APIC calibration Thomas Gleixner
2006-11-10 10:17   ` Arjan van de Ven
2006-11-10 10:23     ` Thomas Gleixner
2006-11-10 11:10     ` Ingo Molnar
2006-11-09 23:38 ` [patch 12/19] high-res timers: core Thomas Gleixner
2006-11-10 10:26   ` Arjan van de Ven
2006-11-09 23:38 ` [patch 13/19] GTOD: Mark TSC unusable for highres timers Thomas Gleixner
2006-11-10  1:10   ` john stultz
2006-11-10  5:10     ` Andi Kleen
2006-11-10  8:10       ` Thomas Gleixner
2006-11-10  8:50         ` Andrew Morton
2006-11-10  8:57           ` Ingo Molnar
2006-11-10  9:13             ` Andrew Morton
2006-11-10  9:29               ` Andi Kleen
2006-11-11 11:14                 ` Thomas Gleixner
2006-11-11 13:51                   ` Andi Kleen
2006-11-11 13:58                     ` Thomas Gleixner
2006-11-11 13:59                       ` Andi Kleen
2006-11-11 14:08                         ` Thomas Gleixner
2006-11-10 10:35               ` Arjan van de Ven
2006-11-10 10:47                 ` Andi Kleen
2006-11-10 10:55                   ` Arjan van de Ven
2006-11-10 11:13                     ` Ingo Molnar
2006-11-10 11:28                     ` Andi Kleen
2006-11-10  9:27             ` Andi Kleen
2006-11-10 10:14             ` Alan Cox
2006-11-10 11:19               ` Ingo Molnar
2006-11-10 15:43               ` Chris Friesen
2006-11-10 11:12             ` Pavel Machek
2006-11-10 11:48               ` Ingo Molnar
2006-11-10 11:56                 ` Andi Kleen
2006-11-10 13:12                   ` Ingo Molnar
2006-11-10 12:00                 ` Pavel Machek
2006-11-10 13:14                   ` Ingo Molnar
2006-11-10 11:11         ` Pavel Machek
2006-11-10 10:28       ` Arjan van de Ven
2006-11-10 10:30         ` Andi Kleen
2006-11-10 10:37           ` Arjan van de Ven
2006-11-09 23:38 ` [patch 14/19] dynticks: core code Thomas Gleixner
2006-11-09 23:38 ` [patch 15/19] dyntick: add nohz stats to /proc/stat Thomas Gleixner
2006-11-09 23:38 ` [patch 16/19] dynticks: i386 arch code Thomas Gleixner
2006-11-09 23:38 ` [patch 17/19] dynticks: Fix nmi watchdog Thomas Gleixner
2006-11-09 23:38 ` [patch 18/19] high-res timers, dynticks: enable i386 support Thomas Gleixner
2006-11-09 23:38 ` [patch 19/19] debugging feature: timer stats Thomas Gleixner
2006-11-23 22:24 ` [patch 00/21] Highres / dynticks drop in replacement for 2.6.19-rc5-mm1 Roman Zippel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox