[PATCH] ktimers subsystem 2.6.14-rc2-kt5

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
@ 2005-09-28 20:43 tglx
  2005-09-28 23:59 ` Frank Sorenson
                   ` (3 more replies)
  0 siblings, 4 replies; 67+ messages in thread
From: tglx @ 2005-09-28 20:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, akpm, george, johnstul, paulmck, hch, oleg, zippel,
	tim.bird

This is an updated version which contains following changes:

- Selectable time storage format: union/struct based, scalar (64bit)
- Fixed an endless loop in forward_posix_timer (George Anzinger)
- Fixed a wrong sizeof(x) (George Anzinger)
- Fixed build problems for non x86 architectures

Roman pointed out that the penalty for some architectures 
would be quite big when using the nsec_t (64bit) scalar time 
storage format. After a long discussion and some more detailed 
tests especially on ARM it turned out that the scalar format 
is unfortunately not suitable everywhere. The tradeoff between 
performance and cleanliness seems too big for some architectures. 

After several rounds of functional conversions and 
cleanups an acceptable compromise between cleanliness and 
storage format flexibility was found.

For 64bit architectures the scalar representation is definitely
a win and therefor enabled unconditionally. The code defaults to
the union/struct based implementation on 32bit archs, but can be
switched to the scalar storage format by setting 
CONFIG_KTIME_SCALAR=y if there is a benefit for the particular 
architecture. The union/struct magic has an advantage over the 
struct timespec based format which I considered to use first. It
produces better and denser code for most architecures and does no
harm anywhere else. This might change with improvements of 
compilers, but then it requires just a replacement of the related
macros / inlines.

The code is not harder to understand than the previous 
open coded scalar storage based implementation.

The correctness was verified with the posix timer tests from 
the HRT project on the forward ported ktimers based high 
resolution proof of concept implementation.
For those interested in this topic the patchseries is available
at http://www.tglx.de/private/tglx/ktimers/patch-2.6.14-rc2-kt5.patches.tar.bz2


Thanks for review and feedback.

tglx


ktimers seperate the "timer API" from the "timeout API". 
ktimers are used for:
- nanosleep
- posixtimers
- itimers


The patch contains the base implementation of ktimers and the
conversion of nanosleep, posixtimers and itimers to ktimer users. 

The patch does not require other changes to the Linux time(r) core
system.

The implementation was done with following constraints in mind:

- Not bound to jiffies
- Multiple time sources
- Per CPU timer queues
- Simplification of absolute CLOCK_REALTIME posix timers
- High resolution timer aware
- Allows the timeout API to reschedule the next event 
  (for tickless systems)

Ktimers enqueue the timers into a time sorted list, which is implemented 
with a rbtree, which is effiecient and already used in other performance 
critical parts of the kernel. This is a bit slower than the timer wheel, 
but due to the fact that the vast majority of timers is actually 
expiring it has to be waged versus the cascading penalty.

The code supports multiple time sources. Currently implemented are 
CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues 
and support functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

---
Index: linux-2.6.14-rc2-rt4/include/linux/calc64.h
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/include/linux/calc64.h
@@ -0,0 +1,31 @@
+#ifndef _linux_CALC64_H
+#define _linux_CALC64_H
+
+#include <linux/types.h>
+#include <asm/div64.h>
+
+#ifndef div_long_long_rem
+#define div_long_long_rem(dividend,divisor,remainder) 	\
+({							\
+	u64 result = dividend;				\
+	*remainder = do_div(result,divisor);		\
+	result;						\
+})
+#endif
+
+static inline long div_long_long_rem_signed(long long dividend,
+					    long divisor,
+					    long *remainder)
+{
+	long res;
+
+	if (unlikely(dividend < 0)) {
+		res = -div_long_long_rem(-dividend, divisor, remainder);
+		*remainder = -(*remainder);
+	} else {
+		res = div_long_long_rem(dividend, divisor, remainder);
+	}
+	return res;
+}
+
+#endif
Index: linux-2.6.14-rc2-rt4/include/linux/jiffies.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/jiffies.h
+++ linux-2.6.14-rc2-rt4/include/linux/jiffies.h
@@ -1,21 +1,12 @@
 #ifndef _LINUX_JIFFIES_H
 #define _LINUX_JIFFIES_H
 
+#include <linux/calc64.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/time.h>
 #include <linux/timex.h>
 #include <asm/param.h>			/* for HZ */
-#include <asm/div64.h>
-
-#ifndef div_long_long_rem
-#define div_long_long_rem(dividend,divisor,remainder) \
-({							\
-	u64 result = dividend;				\
-	*remainder = do_div(result,divisor);		\
-	result;						\
-})
-#endif
 
 /*
  * The following defines establish the engineering parameters of the PLL
Index: linux-2.6.14-rc2-rt4/fs/exec.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/fs/exec.c
+++ linux-2.6.14-rc2-rt4/fs/exec.c
@@ -645,9 +645,10 @@ static inline int de_thread(struct task_
 		 * synchronize with any firing (by calling del_timer_sync)
 		 * before we can safely let the old group leader die.
 		 */
-		sig->real_timer.data = (unsigned long)current;
-		if (del_timer_sync(&sig->real_timer))
-			add_timer(&sig->real_timer);
+		sig->real_timer.data = current;
+		if (stop_ktimer(&sig->real_timer))
+			start_ktimer(&sig->real_timer, NULL,
+				     KTIMER_RESTART|KTIMER_NOCHECK);
 	}
 	while (atomic_read(&sig->count) > count) {
 		sig->group_exit_task = current;
@@ -659,7 +660,7 @@ static inline int de_thread(struct task_
 	}
 	sig->group_exit_task = NULL;
 	sig->notify_count = 0;
-	sig->real_timer.data = (unsigned long)current;
+	sig->real_timer.data = current;
 	spin_unlock_irq(lock);
 
 	/*
Index: linux-2.6.14-rc2-rt4/fs/proc/array.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/fs/proc/array.c
+++ linux-2.6.14-rc2-rt4/fs/proc/array.c
@@ -330,7 +330,7 @@ static int do_task_stat(struct task_stru
 	unsigned long  min_flt = 0,  maj_flt = 0;
 	cputime_t cutime, cstime, utime, stime;
 	unsigned long rsslim = 0;
-	unsigned long it_real_value = 0;
+	DEFINE_KTIME(it_real_value);
 	struct task_struct *t;
 	char tcomm[sizeof(task->comm)];
 
@@ -386,7 +386,7 @@ static int do_task_stat(struct task_stru
 			utime = cputime_add(utime, task->signal->utime);
 			stime = cputime_add(stime, task->signal->stime);
 		}
-		it_real_value = task->signal->it_real_value;
+		it_real_value = task->signal->real_timer.expires;
 	}
 	ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
 	read_unlock(&tasklist_lock);
@@ -435,7 +435,7 @@ static int do_task_stat(struct task_stru
 		priority,
 		nice,
 		num_threads,
-		jiffies_to_clock_t(it_real_value),
+		(clock_t) ktime_to_clock_t(it_real_value),
 		start_time,
 		vsize,
 		mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
Index: linux-2.6.14-rc2-rt4/include/linux/ktimer.h
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/include/linux/ktimer.h
@@ -0,0 +1,335 @@
+#ifndef _LINUX_KTIMER_H
+#define _LINUX_KTIMER_H
+
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/time.h>
+#include <linux/wait.h>
+
+/* Timer API */
+
+/*
+ * Select the ktime_t data type
+ */
+#if defined(CONFIG_KTIME_SCALAR) || (BITS_PER_LONG == 64)
+ #define KTIME_IS_SCALAR
+#endif
+
+#ifndef KTIME_IS_SCALAR
+typedef union {
+	s64	tv64;
+	struct {
+#ifdef __BIG_ENDIAN
+	s32	sec, nsec;
+#else
+	s32	nsec, sec;
+#endif
+	} tv;
+} ktime_t;
+
+#else
+
+typedef s64 ktime_t;
+
+#endif
+
+struct ktimer_base;
+
+/*
+ * Timer structure must be initialized by init_ktimer_xxx !
+ */
+struct ktimer {
+	struct rb_node		node;
+	struct list_head	list;
+	ktime_t			expires;
+	ktime_t			expired;
+	ktime_t			interval;
+	int 	 	 	overrun;
+	unsigned long		status;
+	void 			(*function)(void *);
+	void			*data;
+	struct ktimer_base 	*base;
+};
+
+/*
+ * Timer base struct
+ */
+struct ktimer_base {
+	int			index;
+	char			*name;
+	spinlock_t		lock;
+	struct rb_root		active;
+	struct list_head	pending;
+	int			count;
+	unsigned long		resolution;
+	ktime_t			(*get_time)(void);
+	struct ktimer		*running_timer;
+	wait_queue_head_t	wait_for_running_timer;
+};
+
+/*
+ * Values for the mode argument of xxx_ktimer functions
+ */
+enum
+{
+	KTIMER_NOREARM,	/* Internal value */
+	KTIMER_ABS,	/* Time value is absolute */
+	KTIMER_REL,	/* Time value is relativ to now */
+	KTIMER_INCR,	/* Time value is relativ to previous expiry time */
+	KTIMER_FORWARD,	/* Timer is rearmed with value. Overruns are accounted */
+	KTIMER_REARM,	/* Timer is rearmed with interval. Overruns are accounted */
+	KTIMER_RESTART	/* Timer is restarted with the stored expiry value */
+};
+
+/* The timer states */
+enum
+{
+	KTIMER_INACTIVE,
+	KTIMER_PENDING,
+	KTIMER_EXPIRED,
+	KTIMER_EXPIRED_NOQUEUE,
+};
+
+/* Expiry must not be checked when the timer is started */
+#define KTIMER_NOCHECK		0x10000
+
+#define KTIMER_POISON		((void *) 0x00100101)
+
+#define KTIME_ZERO 		0LL
+
+#define ktimer_active(t) ((t)->status != KTIMER_INACTIVE)
+#define ktimer_before(t1, t2) (ktime_cmp((t1)->expires, <, (t2)->expires))
+
+#ifndef KTIME_IS_SCALAR
+/*
+ * Helper macros/inlines to get the math with ktime_t right. Uurgh, that's
+ * ugly as hell, but for performance sake we have to use this. The
+ * nsec_t based code was nice and simple. :(
+ *
+ * Be careful when using this stuff. It blows up on you if you dön't
+ * get the weirdness right.
+ *
+ * Be especially aware, that negative values are represented in the
+ * form:
+ * tv.sec < 0 and 0 >= tv.nsec < NSEC_PER_SEC
+ *
+ */
+#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
+
+#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
+#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
+
+#define ktime_set(s,n) 		\
+({				\
+	ktime_t __kt;		\
+	__kt.tv.sec = s;	\
+	__kt.tv.nsec = n;	\
+	__kt;			\
+})
+
+#define ktime_set_zero(k) k.tv64 = 0LL
+
+#define ktime_set_low_high(l,h) ktime_set(h,l)
+
+#define ktime_get_low(t)	(t).tv.nsec
+#define ktime_get_high(t)	(t).tv.sec
+
+static inline ktime_t ktime_set_normalized(long sec, long nsec)
+{
+	ktime_t res;
+
+	while (nsec < 0) {
+                nsec += NSEC_PER_SEC;
+		sec--;
+        }
+	while (nsec >= NSEC_PER_SEC) {
+                nsec -= NSEC_PER_SEC;
+		sec++;
+	}
+
+	res.tv.sec = sec;
+	res.tv.nsec = nsec;
+	return res;
+}
+
+static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
+{
+	ktime_t res;
+
+	res.tv64 = a.tv64 - b.tv64;
+	if (res.tv.nsec < 0)
+		res.tv.nsec += NSEC_PER_SEC;
+
+	return res;
+}
+
+static inline ktime_t ktime_add(ktime_t a, ktime_t b)
+{
+	ktime_t res;
+
+	res.tv64 = a.tv64 + b.tv64;
+	if (res.tv.nsec >= NSEC_PER_SEC) {
+		res.tv.nsec -= NSEC_PER_SEC;
+		res.tv.sec++;
+	}
+	return res;
+}
+
+static inline ktime_t ktime_add_ns(ktime_t a, u64 nsec)
+{
+	ktime_t tmp;
+
+	if (likely(nsec < NSEC_PER_SEC)) {
+		tmp.tv64 = nsec;
+	} else {
+		unsigned long rem;
+		rem = do_div(nsec, NSEC_PER_SEC);
+		tmp = ktime_set((long)nsec, rem);
+	}
+	return ktime_add(a,tmp);
+}
+
+#define timespec_to_ktime(ts)			\
+({						\
+	ktime_t __kt;				\
+	struct timespec __ts = (ts);		\
+	__kt.tv.sec = (s32)__ts.tv_sec;		\
+	__kt.tv.nsec = (s32)__ts.tv_nsec;	\
+	__kt;					\
+})
+
+#define ktime_to_timespec(kt)			\
+({						\
+	struct timespec __ts;			\
+	ktime_t __kt = (kt);			\
+	__ts.tv_sec = (time_t)__kt.tv.sec;	\
+	__ts.tv_nsec = (long)__kt.tv.nsec;	\
+	__ts;					\
+})
+
+#define ktime_to_timeval(kt)					\
+({								\
+	struct timeval __tv;					\
+	ktime_t __kt = (kt);					\
+	__tv.tv_sec = (time_t)__kt.tv.sec;			\
+	__tv.tv_usec = (long)(__kt.tv.nsec / NSEC_PER_USEC);	\
+	__tv;							\
+})
+
+#define ktime_to_clock_t(kt)				\
+({							\
+	ktime_t __kt = (kt);				\
+	u64 nsecs = (u64) __kt.tv.sec * NSEC_PER_SEC;	\
+	nsec_to_clock_t(nsecs + (u64) __kt.tv.nsec);	\
+})
+
+#define ktime_to_ns(kt) 					\
+({								\
+	ktime_t __kt = (kt);					\
+	(((u64)__kt.tv.sec * NSEC_PER_SEC) + (u64)__kt.tv.nsec);\
+})
+
+#else
+
+/* ktime_t macros when using a 64bit variable */
+
+#define DEFINE_KTIME(kt) ktime_t kt = 0LL
+
+#define ktime_cmp(a,op,b) ((a) op (b))
+#define ktime_cmp_val(a,op,b) ((a) op b)
+
+#define ktime_set(s,n) (((s64) s * NSEC_PER_SEC) + (s64)n)
+#define ktime_set_zero(kt) kt = 0LL
+
+#define ktime_set_low_high(l,h) ((s64)((u64)l) | (((s64) h) << 32))
+
+#define ktime_get_low(t)	((t) & 0xFFFFFFFFLL)
+#define ktime_get_high(t)	((t) >> 32)
+
+#define ktime_sub(a,b)	((a) - (b))
+#define ktime_add(a,b)	((a) + (b))
+#define ktime_add_ns(a,b) ((a) + (b))
+
+#define timespec_to_ktime(ts) ktime_set(ts.tv_sec, ts.tv_nsec)
+
+#define ktime_to_timespec(kt) ns_to_timespec(kt)
+#define ktime_to_timeval(kt) ns_to_timeval(kt)
+
+#define ktime_to_clock_t(kt) nsec_to_clock_t(kt)
+
+#define ktime_to_ns(kt) (kt)
+
+#define ktime_set_normalized(s,n) ktime_set(s,n)
+
+#endif
+
+/* Exported functions */
+extern void fastcall init_ktimer_real(struct ktimer *timer);
+extern void fastcall init_ktimer_mono(struct ktimer *timer);
+extern int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
+extern int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
+extern int try_to_stop_ktimer(struct ktimer *timer);
+extern int stop_ktimer(struct ktimer *timer);
+extern ktime_t get_remtime_ktimer(struct ktimer *timer, long fake);
+extern ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now);
+extern void __init init_ktimers(void);
+
+/* Conversion functions with rounding based on resolution */
+extern ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv);
+extern ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts);
+
+/* Posix timers current quirks */
+extern int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp);
+extern int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp);
+
+/* nanosleep functions */
+long ktimer_nanosleep_mono(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
+long ktimer_nanosleep_real(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
+
+#if defined(CONFIG_SMP)
+extern void wait_for_ktimer(struct ktimer *timer);
+#else
+#define wait_for_ktimer(t) do {} while (0)
+#endif
+
+#define KTIME_REALTIME_RES (NSEC_PER_SEC/HZ)
+#define KTIME_MONOTONIC_RES (NSEC_PER_SEC/HZ)
+
+static inline void get_ktime_mono_ts(struct timespec *ts)
+{
+	unsigned long seq;
+	struct timespec tomono;
+	do {
+		seq = read_seqbegin(&xtime_lock);
+		getnstimeofday(ts);
+		tomono = wall_to_monotonic;
+	} while (read_seqretry(&xtime_lock, seq));
+
+
+	set_normalized_timespec(ts, ts->tv_sec + tomono.tv_sec,
+				ts->tv_nsec + tomono.tv_nsec);
+
+}
+
+static inline ktime_t do_get_ktime_mono(void)
+{
+	struct timespec now;
+
+	get_ktime_mono_ts(&now);
+	return timespec_to_ktime(now);
+}
+
+#define get_ktime_real_ts(ts) getnstimeofday(ts)
+static inline ktime_t do_get_ktime_real(void)
+{
+	struct timespec now;
+
+	getnstimeofday(&now);
+	return timespec_to_ktime(now);
+}
+
+#define clock_was_set() do { } while (0)
+extern void run_ktimer_queues(void);
+
+#endif
Index: linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/posix-timers.h
+++ linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
@@ -51,10 +51,9 @@ struct k_itimer {
 	struct sigqueue *sigq;		/* signal queue entry. */
 	union {
 		struct {
-			struct timer_list timer;
-			struct list_head abs_timer_entry; /* clock abs_timer_list */
-			struct timespec wall_to_prev;   /* wall_to_monotonic used when set */
-			unsigned long incr; /* interval in jiffies */
+			struct ktimer timer;
+			ktime_t incr;
+			int overrun;
 		} real;
 		struct cpu_timer_list cpu;
 		struct {
@@ -66,10 +65,6 @@ struct k_itimer {
 	} it;
 };
 
-struct k_clock_abs {
-	struct list_head list;
-	spinlock_t lock;
-};
 struct k_clock {
 	int res;		/* in nano seconds */
 	int (*clock_getres) (clockid_t which_clock, struct timespec *tp);
@@ -77,7 +72,7 @@ struct k_clock {
 	int (*clock_set) (clockid_t which_clock, struct timespec * tp);
 	int (*clock_get) (clockid_t which_clock, struct timespec * tp);
 	int (*timer_create) (struct k_itimer *timer);
-	int (*nsleep) (clockid_t which_clock, int flags, struct timespec *);
+	int (*nsleep) (clockid_t which_clock, int flags, struct timespec *, struct timespec __user *);
 	int (*timer_set) (struct k_itimer * timr, int flags,
 			  struct itimerspec * new_setting,
 			  struct itimerspec * old_setting);
@@ -91,37 +86,104 @@ void register_posix_clock(clockid_t cloc
 
 /* Error handlers for timer_create, nanosleep and settime */
 int do_posix_clock_notimer_create(struct k_itimer *timer);
-int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *);
+int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *, struct timespec __user *);
 int do_posix_clock_nosettime(clockid_t, struct timespec *tp);
 
 /* function to call to trigger timer event */
 int posix_timer_event(struct k_itimer *timr, int si_private);
 
-struct now_struct {
-	unsigned long jiffies;
-};
-
-#define posix_get_now(now) (now)->jiffies = jiffies;
-#define posix_time_before(timer, now) \
-                      time_before((timer)->expires, (now)->jiffies)
-
-#define posix_bump_timer(timr, now)					\
-         do {								\
-              long delta, orun;						\
-	      delta = now.jiffies - (timr)->it.real.timer.expires;	\
-              if (delta >= 0) {						\
-	           orun = 1 + (delta / (timr)->it.real.incr);		\
-	          (timr)->it.real.timer.expires +=			\
-			 orun * (timr)->it.real.incr;			\
-                  (timr)->it_overrun += orun;				\
-              }								\
-            }while (0)
+#if (BITS_PER_LONG < 64)
+static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
+{
+	ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
+	unsigned long orun = 1;
+
+	if (ktime_cmp_val(delta, <, KTIME_ZERO))
+		goto out;
+
+	if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
+
+		int sft = 0;
+		u64 div, dclc, inc, dns;
+
+		dclc = dns = ktime_to_ns(delta);
+		div = inc = ktime_to_ns(t->it.real.incr);
+		/* Make sure the divisor is less than 2^32 */
+		while(div >> 32) {
+			sft++;
+			div >>= 1;
+		}
+		dclc >>= sft;
+		do_div(dclc, (unsigned long) div);
+		orun = (unsigned long) dclc;
+		if (likely(!(inc >> 32)))
+			dclc *= (unsigned long) inc;
+		else
+			dclc *= inc;
+		t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
+							dclc);
+	} else {
+		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+						     t->it.real.incr);
+	}
+	/*
+	 * Here is the correction for exact.  Also covers delta == incr
+	 * which is the else clause above.
+	 */
+	if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
+		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+						     t->it.real.incr);
+		orun++;
+	}
+	t->it_overrun += orun;
+
+ out:
+	return ktime_sub(t->it.real.timer.expires, now);
+}
+#else
+static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
+{
+	ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
+	unsigned long orun = 1;
+
+	if (ktime_cmp_val(delta, <, KTIME_ZERO))
+		goto out;
+
+	if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
+
+		u64 dns, inc;
+
+		dns = ktime_to_ns(delta);
+		inc = ktime_to_ns(t->it.real.incr);
+
+		orun = dns / inc;
+		t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
+							orun * inc);
+	} else {
+		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+						     t->it.real.incr);
+	}
+	/*
+	 * Here is the correction for exact.  Also covers delta == incr
+	 * which is the else clause above.
+	 */
+	if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
+		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
+						     t->it.real.incr);
+		orun++;
+	}
+	t->it_overrun += orun;
+ out:
+	return ktime_sub(t->it.real.timer.expires, now);
+}
+#endif
 
 int posix_cpu_clock_getres(clockid_t which_clock, struct timespec *);
 int posix_cpu_clock_get(clockid_t which_clock, struct timespec *);
 int posix_cpu_clock_set(clockid_t which_clock, const struct timespec *tp);
 int posix_cpu_timer_create(struct k_itimer *);
-int posix_cpu_nsleep(clockid_t, int, struct timespec *);
+int posix_cpu_nsleep(clockid_t, int, struct timespec *,
+		     struct timespec __user *);
 int posix_cpu_timer_set(struct k_itimer *, int,
 			struct itimerspec *, struct itimerspec *);
 int posix_cpu_timer_del(struct k_itimer *);
Index: linux-2.6.14-rc2-rt4/include/linux/sched.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/sched.h
+++ linux-2.6.14-rc2-rt4/include/linux/sched.h
@@ -104,6 +104,7 @@ extern unsigned long nr_iowait(void);
 #include <linux/param.h>
 #include <linux/resource.h>
 #include <linux/timer.h>
+#include <linux/ktimer.h>
 
 #include <asm/processor.h>
 
@@ -346,8 +347,7 @@ struct signal_struct {
 	struct list_head posix_timers;
 
 	/* ITIMER_REAL timer for the process */
-	struct timer_list real_timer;
-	unsigned long it_real_value, it_real_incr;
+	struct ktimer real_timer;
 
 	/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
 	cputime_t it_prof_expires, it_virt_expires;
Index: linux-2.6.14-rc2-rt4/include/linux/timer.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/timer.h
+++ linux-2.6.14-rc2-rt4/include/linux/timer.h
@@ -91,6 +91,6 @@ static inline void add_timer(struct time
 
 extern void init_timers(void);
 extern void run_local_timers(void);
-extern void it_real_fn(unsigned long);
+extern void it_real_fn(void *);
 
 #endif
Index: linux-2.6.14-rc2-rt4/init/main.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/init/main.c
+++ linux-2.6.14-rc2-rt4/init/main.c
@@ -485,6 +485,7 @@ asmlinkage void __init start_kernel(void
 	init_IRQ();
 	pidhash_init();
 	init_timers();
+	init_ktimers();
 	softirq_init();
 	time_init();
 
Index: linux-2.6.14-rc2-rt4/kernel/Makefile
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/Makefile
+++ linux-2.6.14-rc2-rt4/kernel/Makefile
@@ -7,7 +7,8 @@ obj-y     = sched.o fork.o exec_domain.o
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
-	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
+	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
+	    ktimers.o
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
Index: linux-2.6.14-rc2-rt4/kernel/exit.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/exit.c
+++ linux-2.6.14-rc2-rt4/kernel/exit.c
@@ -842,7 +842,7 @@ fastcall NORET_TYPE void do_exit(long co
 	update_mem_hiwater(tsk);
 	group_dead = atomic_dec_and_test(&tsk->signal->live);
 	if (group_dead) {
- 		del_timer_sync(&tsk->signal->real_timer);
+ 		stop_ktimer(&tsk->signal->real_timer);
 		acct_process(code);
 	}
 	exit_mm(tsk);
Index: linux-2.6.14-rc2-rt4/kernel/fork.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/fork.c
+++ linux-2.6.14-rc2-rt4/kernel/fork.c
@@ -804,10 +804,9 @@ static inline int copy_signal(unsigned l
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
 
-	sig->it_real_value = sig->it_real_incr = 0;
+	init_ktimer_mono(&sig->real_timer);
 	sig->real_timer.function = it_real_fn;
-	sig->real_timer.data = (unsigned long) tsk;
-	init_timer(&sig->real_timer);
+	sig->real_timer.data = tsk;
 
 	sig->it_virt_expires = cputime_zero;
 	sig->it_virt_incr = cputime_zero;
Index: linux-2.6.14-rc2-rt4/kernel/itimer.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/itimer.c
+++ linux-2.6.14-rc2-rt4/kernel/itimer.c
@@ -12,36 +12,22 @@
 #include <linux/syscalls.h>
 #include <linux/time.h>
 #include <linux/posix-timers.h>
+#include <linux/ktimer.h>
 
 #include <asm/uaccess.h>
 
-static unsigned long it_real_value(struct signal_struct *sig)
-{
-	unsigned long val = 0;
-	if (timer_pending(&sig->real_timer)) {
-		val = sig->real_timer.expires - jiffies;
-
-		/* look out for negative/zero itimer.. */
-		if ((long) val <= 0)
-			val = 1;
-	}
-	return val;
-}
-
 int do_getitimer(int which, struct itimerval *value)
 {
 	struct task_struct *tsk = current;
-	unsigned long interval, val;
+	ktime_t interval, val;
 	cputime_t cinterval, cval;
 
 	switch (which) {
 	case ITIMER_REAL:
-		spin_lock_irq(&tsk->sighand->siglock);
-		interval = tsk->signal->it_real_incr;
-		val = it_real_value(tsk->signal);
-		spin_unlock_irq(&tsk->sighand->siglock);
-		jiffies_to_timeval(val, &value->it_value);
-		jiffies_to_timeval(interval, &value->it_interval);
+		interval = tsk->signal->real_timer.interval;
+		val = get_remtime_ktimer(&tsk->signal->real_timer, NSEC_PER_USEC);
+		value->it_value = ktime_to_timeval(val);
+		value->it_interval = ktime_to_timeval(interval);
 		break;
 	case ITIMER_VIRTUAL:
 		read_lock(&tasklist_lock);
@@ -113,59 +99,35 @@ asmlinkage long sys_getitimer(int which,
 }
 
 
-void it_real_fn(unsigned long __data)
+/*
+ * The timer is automagically restarted, when interval != 0
+ */
+void it_real_fn(void *data)
 {
-	struct task_struct * p = (struct task_struct *) __data;
-	unsigned long inc = p->signal->it_real_incr;
-
-	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, p);
-
-	/*
-	 * Now restart the timer if necessary.  We don't need any locking
-	 * here because do_setitimer makes sure we have finished running
-	 * before it touches anything.
-	 * Note, we KNOW we are (or should be) at a jiffie edge here so
-	 * we don't need the +1 stuff.  Also, we want to use the prior
-	 * expire value so as to not "slip" a jiffie if we are late.
-	 * Deal with requesting a time prior to "now" here rather than
-	 * in add_timer.
-	 */
-	if (!inc)
-		return;
-	while (time_before_eq(p->signal->real_timer.expires, jiffies))
-		p->signal->real_timer.expires += inc;
-	add_timer(&p->signal->real_timer);
+	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, data);
 }
 
 int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
 {
 	struct task_struct *tsk = current;
- 	unsigned long val, interval, expires;
+	struct ktimer *timer;
+	ktime_t expires;
 	cputime_t cval, cinterval, nval, ninterval;
 
 	switch (which) {
 	case ITIMER_REAL:
-again:
-		spin_lock_irq(&tsk->sighand->siglock);
-		interval = tsk->signal->it_real_incr;
-		val = it_real_value(tsk->signal);
-		/* We are sharing ->siglock with it_real_fn() */
-		if (try_to_del_timer_sync(&tsk->signal->real_timer) < 0) {
-			spin_unlock_irq(&tsk->sighand->siglock);
-			goto again;
-		}
-		tsk->signal->it_real_incr =
-			timeval_to_jiffies(&value->it_interval);
-		expires = timeval_to_jiffies(&value->it_value);
-		if (expires)
-			mod_timer(&tsk->signal->real_timer,
-				  jiffies + 1 + expires);
-		spin_unlock_irq(&tsk->sighand->siglock);
+		timer = &tsk->signal->real_timer;
+		stop_ktimer(timer);
 		if (ovalue) {
-			jiffies_to_timeval(val, &ovalue->it_value);
-			jiffies_to_timeval(interval,
-					   &ovalue->it_interval);
-		}
+			ovalue->it_value = ktime_to_timeval(
+				get_remtime_ktimer(timer, NSEC_PER_USEC));
+			ovalue->it_interval = ktime_to_timeval(timer->interval);
+		}
+		timer->interval = ktimer_convert_timeval(timer, &value->it_interval);
+		expires = ktimer_convert_timeval(timer, &value->it_value);
+		if (ktime_cmp_val(expires, != , KTIME_ZERO))
+			modify_ktimer(timer, &expires, KTIMER_REL | KTIMER_NOCHECK);
+
 		break;
 	case ITIMER_VIRTUAL:
 		nval = timeval_to_cputime(&value->it_value);
Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
===================================================================
--- /dev/null
+++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
@@ -0,0 +1,824 @@
+/*
+ *  linux/kernel/ktimers.c
+ *
+ *  Copyright(C) 2005 Thomas Gleixner <tglx@linutronix.de>
+ *
+ *  Kudos to Ingo Molnar for review, criticism, ideas
+ *
+ *  Credits:
+ *	Lot of ideas and implementation details taken from
+ *	timer.c and related code
+ *
+ *  Kernel timers
+ *
+ *  In contrast to the timeout related API found in kernel/timer.c,
+ *  ktimers provide finer resolution and accuracy depending on system
+ *  configuration and capabilities.
+ *
+ *  These timers are used for
+ *  - itimers
+ *  - posixtimers
+ *  - nanosleep
+ *  - precise in kernel timing
+ *
+ *  Please do not abuse this API for simple timeouts.
+ *
+ *  For licencing details see kernel-base/COPYING
+ *
+ */
+
+#include <linux/cpu.h>
+#include <linux/interrupt.h>
+#include <linux/ktimer.h>
+#include <linux/module.h>
+#include <linux/notifier.h>
+#include <linux/percpu.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+static ktime_t get_ktime_mono(void);
+static ktime_t get_ktime_real(void);
+
+/* The time bases */
+#define MAX_KTIMER_BASES	2
+static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
+{
+	{
+		.index = CLOCK_REALTIME,
+		.name = "Realtime",
+		.get_time = &get_ktime_real,
+		.resolution = KTIME_REALTIME_RES,
+	},
+	{
+		.index = CLOCK_MONOTONIC,
+		.name = "Monotonic",
+		.get_time = &get_ktime_mono,
+		.resolution = KTIME_MONOTONIC_RES,
+	},
+};
+
+/*
+ * The SMP/UP kludge goes here
+ */
+#if defined(CONFIG_SMP)
+
+#define set_running_timer(b,t) b->running_timer = t
+#define wake_up_timer_waiters(b) wake_up(&b->wait_for_running_timer)
+#define ktimer_base_can_change (1)
+/*
+ * Wait for a running timer
+ */
+void wait_for_ktimer(struct ktimer *timer)
+{
+	struct ktimer_base *base = timer->base;
+
+	if (base && base->running_timer == timer)
+		wait_event(base->wait_for_running_timer,
+			   base->running_timer != timer);
+}
+
+/*
+ * We are using hashed locking: holding per_cpu(ktimer_bases)[n].lock
+ * means that all timers which are tied to this base via timer->base are
+ * locked, and the base itself is locked too.
+ *
+ * So __run_timers/migrate_timers can safely modify all timers which could
+ * be found on the lists/queues.
+ *
+ * When the timer's base is locked, and the timer removed from list, it is
+ * possible to set timer->base = NULL and drop the lock: the timer remains
+ * locked.
+ */
+static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
+					    unsigned long *flags)
+{
+	struct ktimer_base *base;
+
+	for (;;) {
+		base = timer->base;
+		if (likely(base != NULL)) {
+			spin_lock_irqsave(&base->lock, *flags);
+			if (likely(base == timer->base))
+				return base;
+			/* The timer has migrated to another CPU */
+			spin_unlock_irqrestore(&base->lock, *flags);
+		}
+		cpu_relax();
+	}
+}
+
+static inline struct ktimer_base *switch_ktimer_base(struct ktimer *timer,
+						     struct ktimer_base *base)
+{
+	int ktidx = base->index;
+	struct ktimer_base *new_base = &__get_cpu_var(ktimer_bases[ktidx]);
+
+	if (base != new_base) {
+		/*
+		 * We are trying to schedule the timer on the local CPU.
+		 * However we can't change timer's base while it is running,
+		 * so we keep it on the same CPU. No hassle vs. reprogramming
+		 * the event source in the high resolution case. The softirq
+		 * code will take care of this when the timer function has
+		 * completed. There is no conflict as we hold the lock until
+		 * the timer is enqueued.
+		 */
+		if (unlikely(base->running_timer == timer)) {
+			return base;
+		} else {
+			/* See the comment in lock_timer_base() */
+			timer->base = NULL;
+			spin_unlock(&base->lock);
+			spin_lock(&new_base->lock);
+			timer->base = new_base;
+		}
+	}
+	return new_base;
+}
+
+/*
+ * Get the timer base unlocked
+ *
+ * Take care of timer->base = NULL in switch_ktimer_base !
+ */
+static inline struct ktimer_base *get_ktimer_base_unlocked(struct ktimer *timer)
+{
+	struct ktimer_base *base;
+	while (!(base = timer->base));
+	return base;
+}
+#else
+
+#define set_running_timer(b,t) do {} while (0)
+#define wake_up_timer_waiters(b)  do {} while (0)
+
+static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
+					    unsigned long *flags)
+{
+	struct ktimer_base *base;
+
+	base = timer->base;
+	spin_lock_irqsave(&base->lock, *flags);
+	return base;
+}
+
+#define switch_ktimer_base(t, b) b
+
+#define get_ktimer_base_unlocked(t) (t)->base
+#define ktimer_base_can_change (0)
+
+#endif	/* !CONFIG_SMP */
+
+/*
+ * Convert timespec to ktime_t with resolution adjustment
+ *
+ * Note: We can access base without locking here, as ktimers can
+ * migrate between CPUs but can not be moved from one clock source to
+ * another. The clock source binding is set at init_ktimer_XXX.
+ */
+ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
+{
+	struct ktimer_base *base = get_ktimer_base_unlocked(timer);
+	ktime_t t;
+	long rem = ts->tv_nsec % base->resolution;
+
+	t = ktime_set(ts->tv_sec, ts->tv_nsec);
+
+	/* Check, if the value has to be rounded */
+	if (rem)
+		t = ktime_add_ns(t, base->resolution - rem);
+	return t;
+}
+
+/*
+ * Convert timeval to ktime_t with resolution adjustment
+ */
+ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv)
+{
+	struct timespec ts;
+
+	ts.tv_sec = tv->tv_sec;
+	ts.tv_nsec = tv->tv_usec * NSEC_PER_USEC;
+
+	return ktimer_convert_timespec(timer, &ts);
+}
+
+/*
+ * Internal function to add (re)start a timer
+ *
+ * The timer is inserted in expiry order.
+ * Insertion into the red black tree is O(log(n))
+ *
+ */
+static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
+			   ktime_t *tim, int mode)
+{
+	struct rb_node **link = &base->active.rb_node;
+	struct rb_node *parent = NULL;
+	struct ktimer *entry;
+	struct list_head *prev = &base->pending;
+	ktime_t now;
+
+	/* Get current time */
+	now = base->get_time();
+
+	/* Timer expiry mode */
+	switch (mode & ~KTIMER_NOCHECK) {
+	case KTIMER_ABS:
+		timer->expires = *tim;
+		break;
+	case KTIMER_REL:
+		timer->expires = ktime_add(now, *tim);
+		break;
+	case KTIMER_INCR:
+		timer->expires = ktime_add(timer->expires, *tim);
+		break;
+	case KTIMER_FORWARD:
+		while ktime_cmp(timer->expires, <= , now) {
+			timer->expires = ktime_add(timer->expires, *tim);
+			timer->overrun++;
+		}
+		goto nocheck;
+	case KTIMER_REARM:
+		while ktime_cmp(timer->expires, <= , now) {
+			timer->expires = ktime_add(timer->expires, *tim);
+			timer->overrun++;
+		}
+		goto nocheck;
+	case KTIMER_RESTART:
+		break;
+	default:
+		BUG();
+	}
+
+	/* Already expired.*/
+	if ktime_cmp(timer->expires, <=, now) {
+		timer->expired = now;
+		/* The caller takes care of expiry */
+		if (!(mode & KTIMER_NOCHECK))
+			return -1;
+	}
+ nocheck:
+
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct ktimer, node);
+		/*
+		 * We dont care about collisions. Nodes with
+		 * the same expiry time stay together.
+		 */
+		if (ktimer_before(timer, entry))
+			link = &(*link)->rb_left;
+		else {
+			link = &(*link)->rb_right;
+			prev = &entry->list;
+		}
+	}
+
+	rb_link_node(&timer->node, parent, link);
+	rb_insert_color(&timer->node, &base->active);
+	list_add(&timer->list, prev);
+	timer->status = KTIMER_PENDING;
+	base->count++;
+	return 0;
+}
+
+/*
+ * Internal helper to remove a timer
+ *
+ * The function allows automatic rearming for interval
+ * timers.
+ *
+ */
+static inline void do_remove_ktimer(struct ktimer *timer,
+				    struct ktimer_base *base, int rearm)
+{
+	list_del(&timer->list);
+	rb_erase(&timer->node, &base->active);
+	timer->node.rb_parent = KTIMER_POISON;
+	timer->status = KTIMER_INACTIVE;
+	base->count--;
+	BUG_ON(base->count < 0);
+	/* Auto rearm the timer ? */
+	if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
+		enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
+}
+
+/*
+ * Called with base lock held
+ */
+static inline int remove_ktimer(struct ktimer *timer, struct ktimer_base *base)
+{
+	if (ktimer_active(timer)) {
+		do_remove_ktimer(timer, base, KTIMER_NOREARM);
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * Internal function to (re)start a timer.
+ */
+static int internal_restart_ktimer(struct ktimer *timer, ktime_t *tim,
+				   int mode)
+{
+	struct ktimer_base *base, *new_base;
+  	unsigned long flags;
+	int ret;
+
+	BUG_ON(!timer->function);
+
+	base = lock_ktimer_base(timer, &flags);
+
+	/* Remove an active timer from the queue */
+	ret = remove_ktimer(timer, base);
+
+	/* Switch the timer base, if necessary */
+	new_base = switch_ktimer_base(timer, base);
+
+	/*
+	 * When the new timer setting is already expired,
+	 * let the calling code deal with it.
+	 */
+	if (enqueue_ktimer(timer, new_base, tim, mode))
+		ret = -1;
+
+	spin_unlock_irqrestore(&new_base->lock, flags);
+	return ret;
+}
+
+/***
+ * modify_ktimer - modify a running timer
+ * @timer: the timer to be modified
+ * @tim: expiry time (required)
+ * @mode: timer setup mode
+ *
+ */
+int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
+{
+  	BUG_ON(!tim || !timer->function);
+	return internal_restart_ktimer(timer, tim, mode);
+}
+
+/***
+ * start_ktimer - start a timer on current CPU
+ * @timer: the timer to be added
+ * @tim: expiry time (optional, if not set in the timer)
+ * @mode: timer setup mode
+ */
+int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
+{
+  	BUG_ON(ktimer_active(timer) || !timer->function);
+
+	return internal_restart_ktimer(timer, tim, mode);
+}
+
+/***
+ * try_to_stop_ktimer - try to deactivate a timer
+ */
+int try_to_stop_ktimer(struct ktimer *timer)
+{
+	struct ktimer_base *base;
+	unsigned long flags;
+	int ret = -1;
+
+	base = lock_ktimer_base(timer, &flags);
+
+	if (base->running_timer != timer) {
+		ret = remove_ktimer(timer, base);
+		if (ret)
+			timer->expired = base->get_time();
+	}
+
+	spin_unlock_irqrestore(&base->lock, flags);
+
+	return ret;
+
+}
+
+/***
+ * stop_timer_sync - deactivate a timer and wait for the handler to finish.
+ * @timer: the timer to be deactivated
+ *
+ */
+int stop_ktimer(struct ktimer *timer)
+{
+	for (;;) {
+		int ret = try_to_stop_ktimer(timer);
+		if (ret >= 0)
+			return ret;
+		wait_for_ktimer(timer);
+	}
+}
+
+/***
+ * get_remtime_ktimer - get remaining time for the timer
+ * @timer: the timer to read
+ * @fake:  when fake > 0 a pending, but expired timer
+ *	   returns fake (itimers need this, uurg)
+ */
+ktime_t get_remtime_ktimer(struct ktimer *timer, long fake)
+{
+	struct ktimer_base *base;
+	unsigned long flags;
+	ktime_t rem;
+
+	base = lock_ktimer_base(timer, &flags);
+	if (ktimer_active(timer)) {
+		rem = ktime_sub(timer->expires,base->get_time());
+		if (fake && ktime_cmp_val(rem, <=, KTIME_ZERO))
+			rem = ktime_set(0, fake);
+	} else {
+		if (!fake)
+			rem = ktime_sub(timer->expires,base->get_time());
+		else
+			ktime_set_zero(rem);
+	}
+	spin_unlock_irqrestore(&base->lock, flags);
+	return rem;
+}
+
+/***
+ * get_expiry_ktimer - get expiry time for the timer
+ * @timer: the timer to read
+ * @now:   if != NULL store current base->time
+ */
+ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now)
+{
+	struct ktimer_base *base;
+	unsigned long flags;
+	ktime_t expiry;
+
+	base = lock_ktimer_base(timer, &flags);
+	expiry = timer->expires;
+	if (now)
+		*now = base->get_time();
+	spin_unlock_irqrestore(&base->lock, flags);
+	return expiry;
+}
+
+/*
+ * Functions related to clock sources
+ */
+
+static inline void ktimer_common_init(struct ktimer *timer)
+{
+	memset(timer, 0, sizeof(struct ktimer));
+	timer->node.rb_parent = KTIMER_POISON;
+}
+
+/*
+ * Get monotonic time
+ */
+static ktime_t get_ktime_mono(void)
+{
+	return do_get_ktime_mono();
+}
+
+/***
+ * init_ktimer_mono - initialize a timer on monotonic time
+ * @timer: the timer to be initialized
+ *
+ */
+void fastcall init_ktimer_mono(struct ktimer *timer)
+{
+	ktimer_common_init(timer);
+	timer->base =
+		&per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC];
+}
+
+/***
+ * get_ktimer_mono_res - get the monotonic timer resolution
+ *
+ */
+int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp)
+{
+	tp->tv_sec = 0;
+	tp->tv_nsec =
+		per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC].resolution;
+	return 0;
+}
+
+/*
+ * Get real time
+ */
+static ktime_t get_ktime_real(void)
+{
+	return do_get_ktime_real();
+}
+
+/***
+ * init_ktimer_real - initialize a timer on real time
+ * @timer: the timer to be initialized
+ *
+ */
+void fastcall init_ktimer_real(struct ktimer *timer)
+{
+	ktimer_common_init(timer);
+	timer->base =
+		&per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME];
+}
+
+/***
+ * get_ktimer_real_res - get the real timer resolution
+ *
+ */
+int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp)
+{
+	tp->tv_sec = 0;
+	tp->tv_nsec =
+		per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME].resolution;
+	return 0;
+}
+
+/*
+ * The per base runqueue
+ */
+static inline void run_ktimer_queue(struct ktimer_base *base)
+{
+	ktime_t now = base->get_time();
+
+	spin_lock_irq(&base->lock);
+	while (!list_empty(&base->pending)) {
+		void (*fn)(void *);
+		void *data;
+		struct ktimer *timer = list_entry(base->pending.next,
+						  struct ktimer, list);
+		if ktime_cmp(now, <=, timer->expires)
+			break;
+		timer->expired = now;
+		fn = timer->function;
+		data = timer->data;
+		set_running_timer(base, timer);
+		do_remove_ktimer(timer, base, KTIMER_REARM);
+		spin_unlock_irq(&base->lock);
+ 		fn(data);
+		spin_lock_irq(&base->lock);
+		set_running_timer(base, NULL);
+	}
+	spin_unlock_irq(&base->lock);
+	wake_up_timer_waiters(base);
+}
+
+/*
+ * Called from timer softirq every jiffy
+ */
+void run_ktimer_queues(void)
+{
+	struct ktimer_base *base = __get_cpu_var(ktimer_bases);
+	int i;
+
+	for (i = 0; i < MAX_KTIMER_BASES; i++)
+		run_ktimer_queue(&base[i]);
+}
+
+/*
+ * Functions related to initialization
+ */
+static void __devinit init_ktimers_cpu(int cpu)
+{
+	struct ktimer_base *base = per_cpu(ktimer_bases, cpu);
+	int i;
+
+	for (i = 0; i < MAX_KTIMER_BASES; i++) {
+		spin_lock_init(&base->lock);
+		INIT_LIST_HEAD(&base->pending);
+		init_waitqueue_head(&base->wait_for_running_timer);
+		base++;
+	}
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void migrate_ktimer_list(struct ktimer_base *old_base,
+				struct ktimer_base *new_base)
+{
+	struct ktimer *timer;
+	struct rb_node *node;
+
+	while ((node = rb_first(&old_base->active))) {
+		timer = rb_entry(node, struct ktimer, node);
+		remove_ktimer(timer, old_base);
+		timer->base = new_base;
+		enqueue_ktimer(timer, new_base, NULL, KTIMER_RESTART);
+	}
+}
+
+static void __devinit migrate_ktimers(int cpu)
+{
+	struct ktimer_base *old_base;
+	struct ktimer_base *new_base;
+	int i;
+
+	BUG_ON(cpu_online(cpu));
+	old_base = per_cpu(ktimer_bases, cpu);
+	new_base = get_cpu_var(ktimer_bases);
+
+	local_irq_disable();
+
+	for (i = 0; i < MAX_KTIMER_BASES; i++) {
+
+		spin_lock(&new_base->lock);
+		spin_lock(&old_base->lock);
+
+		if (old_base->running_timer)
+			BUG();
+
+		migrate_ktimer_list(old_base, new_base);
+
+		spin_unlock(&old_base->lock);
+		spin_unlock(&new_base->lock);
+		old_base++;
+		new_base++;
+	}
+
+	local_irq_enable();
+	&put_cpu_var(ktimer_bases);
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+static int __devinit ktimer_cpu_notify(struct notifier_block *self,
+				       unsigned long action, void *hcpu)
+{
+	long cpu = (long)hcpu;
+	switch(action) {
+	case CPU_UP_PREPARE:
+		init_ktimers_cpu(cpu);
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+		migrate_ktimers(cpu);
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata ktimers_nb = {
+	.notifier_call	= ktimer_cpu_notify,
+};
+
+void __init init_ktimers(void)
+{
+	ktimer_cpu_notify(&ktimers_nb, (unsigned long)CPU_UP_PREPARE,
+				(void *)(long)smp_processor_id());
+	register_cpu_notifier(&ktimers_nb);
+}
+
+/*
+ * system interface related functions
+ */
+static void process_ktimer(void *data)
+{
+	wake_up_process(data);
+}
+
+/**
+ * schedule_ktimer - sleep until timeout
+ * @timeout: timeout value
+ * @state:   state to use for sleep
+ * @rel:    timeout value is abs/rel
+ *
+ * Make the current task sleep until @timeout is
+ * elapsed.
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout is guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case the remaining time
+ * will be returned
+ *
+ * The current task state is guaranteed to be TASK_RUNNING when this
+ * routine returns.
+ *
+ */
+static fastcall ktime_t __sched schedule_ktimer(struct ktimer *timer,
+					ktime_t *t, int state, int mode)
+{
+	timer->data = current;
+	timer->function = process_ktimer;
+
+	current->state = state;
+	if (start_ktimer(timer, t, mode)) {
+		current->state = TASK_RUNNING;
+		goto out;
+	}
+	if (current->state != TASK_RUNNING)
+		schedule();
+	stop_ktimer(timer);
+ out:
+	/* Store the absolute expiry time */
+	*t = timer->expires;
+	/* Return the remaining time */
+	return ktime_sub(timer->expires, timer->expired);
+}
+
+static long __sched nanosleep_restart(struct ktimer *timer,
+				      struct restart_block *restart)
+{
+	struct timespec tu;
+	ktime_t t, rem;
+	void *rfn = restart->fn;
+	struct timespec __user *rmtp = (struct timespec __user *) restart->arg2;
+
+	restart->fn = do_no_restart_syscall;
+
+	t = ktime_set_low_high(restart->arg0, restart->arg1);
+
+	rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, KTIMER_ABS);
+
+	if (ktime_cmp_val(rem, <=, KTIME_ZERO))
+		return 0;
+
+	tu = ktime_to_timespec(rem);
+	if (rmtp && copy_to_user(rmtp, &rem, sizeof(tu)))
+		return -EFAULT;
+
+	restart->fn = rfn;
+	/* The other values in restart are already filled in */
+	return -ERESTART_RESTARTBLOCK;
+}
+
+static long __sched nanosleep_restart_mono(struct restart_block *restart)
+{
+	struct ktimer timer;
+
+	init_ktimer_mono(&timer);
+	return nanosleep_restart(&timer, restart);
+}
+
+static long __sched nanosleep_restart_real(struct restart_block *restart)
+{
+	struct ktimer timer;
+
+	init_ktimer_real(&timer);
+	return nanosleep_restart(&timer, restart);
+}
+
+static long ktimer_nanosleep(struct ktimer *timer, struct timespec *rqtp,
+			     struct timespec __user *rmtp, int mode,
+			     long (*rfn)(struct restart_block *))
+{
+	struct timespec tu;
+	ktime_t rem, t;
+	struct restart_block *restart;
+
+	t = ktimer_convert_timespec(timer, rqtp);
+
+	/* t is updated to absolute expiry time ! */
+	rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, mode);
+
+	if (ktime_cmp_val(rem, <=, KTIME_ZERO))
+		return 0;
+
+	tu = ktime_to_timespec(rem);
+
+	if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu)))
+		return -EFAULT;
+
+	restart = &current_thread_info()->restart_block;
+	restart->fn = rfn;
+	restart->arg0 = ktime_get_low(t);
+	restart->arg1 = ktime_get_high(t);
+	restart->arg2 = (unsigned long) rmtp;
+	return -ERESTART_RESTARTBLOCK;
+
+}
+
+long ktimer_nanosleep_mono(struct timespec *rqtp,
+			   struct timespec __user *rmtp, int mode)
+{
+	struct ktimer timer;
+
+	init_ktimer_mono(&timer);
+	return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_mono);
+}
+
+long ktimer_nanosleep_real(struct timespec *rqtp,
+			   struct timespec __user *rmtp, int mode)
+{
+	struct ktimer timer;
+
+	init_ktimer_real(&timer);
+	return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_real);
+}
+
+asmlinkage long sys_nanosleep(struct timespec __user *rqtp,
+			      struct timespec __user *rmtp)
+{
+	struct timespec tu;
+
+	if (copy_from_user(&tu, rqtp, sizeof(tu)))
+		return -EFAULT;
+
+	if (!timespec_valid(&tu))
+		return -EINVAL;
+
+	return ktimer_nanosleep_mono(&tu, rmtp, KTIMER_REL);
+}
+
Index: linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/posix-cpu-timers.c
+++ linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
@@ -1394,7 +1394,7 @@ void set_process_cpu_timer(struct task_s
 static long posix_cpu_clock_nanosleep_restart(struct restart_block *);
 
 int posix_cpu_nsleep(clockid_t which_clock, int flags,
-		     struct timespec *rqtp)
+		     struct timespec *rqtp, struct timespec __user *rmtp)
 {
 	struct restart_block *restart_block =
 	    &current_thread_info()->restart_block;
@@ -1419,7 +1419,6 @@ int posix_cpu_nsleep(clockid_t which_clo
 	error = posix_cpu_timer_create(&timer);
 	timer.it_process = current;
 	if (!error) {
-		struct timespec __user *rmtp;
 		static struct itimerspec zero_it;
 		struct itimerspec it = { .it_value = *rqtp,
 					 .it_interval = {} };
@@ -1466,7 +1465,6 @@ int posix_cpu_nsleep(clockid_t which_clo
 		/*
 		 * Report back to the user the time still remaining.
 		 */
-		rmtp = (struct timespec __user *) restart_block->arg1;
 		if (rmtp != NULL && !(flags & TIMER_ABSTIME) &&
 		    copy_to_user(rmtp, &it.it_value, sizeof *rmtp))
 			return -EFAULT;
@@ -1474,6 +1472,7 @@ int posix_cpu_nsleep(clockid_t which_clo
 		restart_block->fn = posix_cpu_clock_nanosleep_restart;
 		/* Caller already set restart_block->arg1 */
 		restart_block->arg0 = which_clock;
+		restart_block->arg1 = (unsigned long) rmtp;
 		restart_block->arg2 = rqtp->tv_sec;
 		restart_block->arg3 = rqtp->tv_nsec;
 
@@ -1487,10 +1486,15 @@ static long
 posix_cpu_clock_nanosleep_restart(struct restart_block *restart_block)
 {
 	clockid_t which_clock = restart_block->arg0;
-	struct timespec t = { .tv_sec = restart_block->arg2,
-			      .tv_nsec = restart_block->arg3 };
+	struct timespec __user *rmtp;
+	struct timespec t;
+
+	rmtp = (struct timespec __user *) restart_block->arg1;
+	t.tv_sec = restart_block->arg2;
+	t.tv_nsec = restart_block->arg3;
+
 	restart_block->fn = do_no_restart_syscall;
-	return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t);
+	return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t, rmtp);
 }
 
 
@@ -1511,9 +1515,10 @@ static int process_cpu_timer_create(stru
 	return posix_cpu_timer_create(timer);
 }
 static int process_cpu_nsleep(clockid_t which_clock, int flags,
-			      struct timespec *rqtp)
+			      struct timespec *rqtp,
+			      struct timespec __user *rmtp)
 {
-	return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp);
+	return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp, rmtp);
 }
 static int thread_cpu_clock_getres(clockid_t which_clock, struct timespec *tp)
 {
@@ -1529,7 +1534,7 @@ static int thread_cpu_timer_create(struc
 	return posix_cpu_timer_create(timer);
 }
 static int thread_cpu_nsleep(clockid_t which_clock, int flags,
-			      struct timespec *rqtp)
+			      struct timespec *rqtp, struct timespec __user *rmtp)
 {
 	return -EINVAL;
 }
Index: linux-2.6.14-rc2-rt4/kernel/posix-timers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/posix-timers.c
+++ linux-2.6.14-rc2-rt4/kernel/posix-timers.c
@@ -48,21 +48,6 @@
 #include <linux/workqueue.h>
 #include <linux/module.h>
 
-#ifndef div_long_long_rem
-#include <asm/div64.h>
-
-#define div_long_long_rem(dividend,divisor,remainder) ({ \
-		       u64 result = dividend;		\
-		       *remainder = do_div(result,divisor); \
-		       result; })
-
-#endif
-#define CLOCK_REALTIME_RES TICK_NSEC  /* In nano seconds. */
-
-static inline u64  mpy_l_X_l_ll(unsigned long mpy1,unsigned long mpy2)
-{
-	return (u64)mpy1 * mpy2;
-}
 /*
  * Management arrays for POSIX timers.	 Timers are kept in slab memory
  * Timer ids are allocated by an external routine that keeps track of the
@@ -148,18 +133,18 @@ static DEFINE_SPINLOCK(idr_lock);
  */
 
 static struct k_clock posix_clocks[MAX_CLOCKS];
+
 /*
- * We only have one real clock that can be set so we need only one abs list,
- * even if we should want to have several clocks with differing resolutions.
+ * These ones are defined below.
  */
-static struct k_clock_abs abs_list = {.list = LIST_HEAD_INIT(abs_list.list),
-				      .lock = SPIN_LOCK_UNLOCKED};
+static int common_nsleep(clockid_t, int flags, struct timespec *t,
+			 struct timespec __user *rmtp);
+static void common_timer_get(struct k_itimer *, struct itimerspec *);
+static int common_timer_set(struct k_itimer *, int,
+			    struct itimerspec *, struct itimerspec *);
+static int common_timer_del(struct k_itimer *timer);
 
-static void posix_timer_fn(unsigned long);
-static u64 do_posix_clock_monotonic_gettime_parts(
-	struct timespec *tp, struct timespec *mo);
-int do_posix_clock_monotonic_gettime(struct timespec *tp);
-static int do_posix_clock_monotonic_get(clockid_t, struct timespec *tp);
+static void posix_timer_fn(void *data);
 
 static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags);
 
@@ -205,21 +190,25 @@ static inline int common_clock_set(clock
 
 static inline int common_timer_create(struct k_itimer *new_timer)
 {
-	INIT_LIST_HEAD(&new_timer->it.real.abs_timer_entry);
-	init_timer(&new_timer->it.real.timer);
-	new_timer->it.real.timer.data = (unsigned long) new_timer;
+	return -EINVAL;
+}
+
+static int timer_create_mono(struct k_itimer *new_timer)
+{
+	init_ktimer_mono(&new_timer->it.real.timer);
+	new_timer->it.real.timer.data = new_timer;
+	new_timer->it.real.timer.function = posix_timer_fn;
+	return 0;
+}
+
+static int timer_create_real(struct k_itimer *new_timer)
+{
+	init_ktimer_real(&new_timer->it.real.timer);
+	new_timer->it.real.timer.data = new_timer;
 	new_timer->it.real.timer.function = posix_timer_fn;
 	return 0;
 }
 
-/*
- * These ones are defined below.
- */
-static int common_nsleep(clockid_t, int flags, struct timespec *t);
-static void common_timer_get(struct k_itimer *, struct itimerspec *);
-static int common_timer_set(struct k_itimer *, int,
-			    struct itimerspec *, struct itimerspec *);
-static int common_timer_del(struct k_itimer *timer);
 
 /*
  * Return nonzero iff we know a priori this clockid_t value is bogus.
@@ -239,19 +228,44 @@ static inline int invalid_clockid(clocki
 	return 1;
 }
 
+/*
+ * Get real time for posix timers
+ */
+static int posix_get_ktime_real_ts(clockid_t which_clock, struct timespec *tp)
+{
+	get_ktime_real_ts(tp);
+	return 0;
+}
+
+/*
+ * Get monotonic time for posix timers
+ */
+static int posix_get_ktime_mono_ts(clockid_t which_clock, struct timespec *tp)
+{
+	get_ktime_mono_ts(tp);
+	return 0;
+}
+
+void do_posix_clock_monotonic_gettime(struct timespec *ts)
+{
+	get_ktime_mono_ts(ts);
+}
 
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
 static __init int init_posix_timers(void)
 {
-	struct k_clock clock_realtime = {.res = CLOCK_REALTIME_RES,
-					 .abs_struct = &abs_list
+	struct k_clock clock_realtime = {
+		.clock_getres = get_ktimer_real_res,
+		.clock_get = posix_get_ktime_real_ts,
+		.timer_create = timer_create_real,
 	};
-	struct k_clock clock_monotonic = {.res = CLOCK_REALTIME_RES,
-		.abs_struct = NULL,
-		.clock_get = do_posix_clock_monotonic_get,
-		.clock_set = do_posix_clock_nosettime
+	struct k_clock clock_monotonic = {
+		.clock_getres = get_ktimer_mono_res,
+		.clock_get = posix_get_ktime_mono_ts,
+		.clock_set = do_posix_clock_nosettime,
+		.timer_create = timer_create_mono,
 	};
 
 	register_posix_clock(CLOCK_REALTIME, &clock_realtime);
@@ -265,117 +279,17 @@ static __init int init_posix_timers(void
 
 __initcall(init_posix_timers);
 
-static void tstojiffie(struct timespec *tp, int res, u64 *jiff)
-{
-	long sec = tp->tv_sec;
-	long nsec = tp->tv_nsec + res - 1;
-
-	if (nsec > NSEC_PER_SEC) {
-		sec++;
-		nsec -= NSEC_PER_SEC;
-	}
-
-	/*
-	 * The scaling constants are defined in <linux/time.h>
-	 * The difference between there and here is that we do the
-	 * res rounding and compute a 64-bit result (well so does that
-	 * but it then throws away the high bits).
-  	 */
-	*jiff =  (mpy_l_X_l_ll(sec, SEC_CONVERSION) +
-		  (mpy_l_X_l_ll(nsec, NSEC_CONVERSION) >> 
-		   (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
-}
-
-/*
- * This function adjusts the timer as needed as a result of the clock
- * being set.  It should only be called for absolute timers, and then
- * under the abs_list lock.  It computes the time difference and sets
- * the new jiffies value in the timer.  It also updates the timers
- * reference wall_to_monotonic value.  It is complicated by the fact
- * that tstojiffies() only handles positive times and it needs to work
- * with both positive and negative times.  Also, for negative offsets,
- * we need to defeat the res round up.
- *
- * Return is true if there is a new time, else false.
- */
-static long add_clockset_delta(struct k_itimer *timr,
-			       struct timespec *new_wall_to)
-{
-	struct timespec delta;
-	int sign = 0;
-	u64 exp;
-
-	set_normalized_timespec(&delta,
-				new_wall_to->tv_sec -
-				timr->it.real.wall_to_prev.tv_sec,
-				new_wall_to->tv_nsec -
-				timr->it.real.wall_to_prev.tv_nsec);
-	if (likely(!(delta.tv_sec | delta.tv_nsec)))
-		return 0;
-	if (delta.tv_sec < 0) {
-		set_normalized_timespec(&delta,
-					-delta.tv_sec,
-					1 - delta.tv_nsec -
-					posix_clocks[timr->it_clock].res);
-		sign++;
-	}
-	tstojiffie(&delta, posix_clocks[timr->it_clock].res, &exp);
-	timr->it.real.wall_to_prev = *new_wall_to;
-	timr->it.real.timer.expires += (sign ? -exp : exp);
-	return 1;
-}
-
-static void remove_from_abslist(struct k_itimer *timr)
-{
-	if (!list_empty(&timr->it.real.abs_timer_entry)) {
-		spin_lock(&abs_list.lock);
-		list_del_init(&timr->it.real.abs_timer_entry);
-		spin_unlock(&abs_list.lock);
-	}
-}
 
 static void schedule_next_timer(struct k_itimer *timr)
 {
-	struct timespec new_wall_to;
-	struct now_struct now;
-	unsigned long seq;
-
-	/*
-	 * Set up the timer for the next interval (if there is one).
-	 * Note: this code uses the abs_timer_lock to protect
-	 * it.real.wall_to_prev and must hold it until exp is set, not exactly
-	 * obvious...
-
-	 * This function is used for CLOCK_REALTIME* and
-	 * CLOCK_MONOTONIC* timers.  If we ever want to handle other
-	 * CLOCKs, the calling code (do_schedule_next_timer) would need
-	 * to pull the "clock" info from the timer and dispatch the
-	 * "other" CLOCKs "next timer" code (which, I suppose should
-	 * also be added to the k_clock structure).
-	 */
-	if (!timr->it.real.incr)
+	if (ktime_cmp_val(timr->it.real.incr, ==, KTIME_ZERO))
 		return;
 
-	do {
-		seq = read_seqbegin(&xtime_lock);
-		new_wall_to =	wall_to_monotonic;
-		posix_get_now(&now);
-	} while (read_seqretry(&xtime_lock, seq));
-
-	if (!list_empty(&timr->it.real.abs_timer_entry)) {
-		spin_lock(&abs_list.lock);
-		add_clockset_delta(timr, &new_wall_to);
-
-		posix_bump_timer(timr, now);
-
-		spin_unlock(&abs_list.lock);
-	} else {
-		posix_bump_timer(timr, now);
-	}
-	timr->it_overrun_last = timr->it_overrun;
-	timr->it_overrun = -1;
+	timr->it_overrun_last = timr->it.real.overrun;
+	timr->it.real.overrun = timr->it.real.timer.overrun = -1;
 	++timr->it_requeue_pending;
-	add_timer(&timr->it.real.timer);
+	start_ktimer(&timr->it.real.timer, &timr->it.real.incr, KTIMER_FORWARD);
+	timr->it.real.overrun = timr->it.real.timer.overrun;
 }
 
 /*
@@ -413,14 +327,7 @@ int posix_timer_event(struct k_itimer *t
 {
 	memset(&timr->sigq->info, 0, sizeof(siginfo_t));
 	timr->sigq->info.si_sys_private = si_private;
-	/*
-	 * Send signal to the process that owns this timer.
-
-	 * This code assumes that all the possible abs_lists share the
-	 * same lock (there is only one list at this time). If this is
-	 * not the case, the CLOCK info would need to be used to find
-	 * the proper abs list lock.
-	 */
+	/* Send signal to the process that owns this timer.*/
 
 	timr->sigq->info.si_signo = timr->it_sigev_signo;
 	timr->sigq->info.si_errno = 0;
@@ -454,65 +361,28 @@ EXPORT_SYMBOL_GPL(posix_timer_event);
 
  * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
  */
-static void posix_timer_fn(unsigned long __data)
+static void posix_timer_fn(void *data)
 {
-	struct k_itimer *timr = (struct k_itimer *) __data;
+	struct k_itimer *timr = data;
 	unsigned long flags;
-	unsigned long seq;
-	struct timespec delta, new_wall_to;
-	u64 exp = 0;
-	int do_notify = 1;
+	int si_private = 0;
 
 	spin_lock_irqsave(&timr->it_lock, flags);
-	if (!list_empty(&timr->it.real.abs_timer_entry)) {
-		spin_lock(&abs_list.lock);
-		do {
-			seq = read_seqbegin(&xtime_lock);
-			new_wall_to =	wall_to_monotonic;
-		} while (read_seqretry(&xtime_lock, seq));
-		set_normalized_timespec(&delta,
-					new_wall_to.tv_sec -
-					timr->it.real.wall_to_prev.tv_sec,
-					new_wall_to.tv_nsec -
-					timr->it.real.wall_to_prev.tv_nsec);
-		if (likely((delta.tv_sec | delta.tv_nsec ) == 0)) {
-			/* do nothing, timer is on time */
-		} else if (delta.tv_sec < 0) {
-			/* do nothing, timer is already late */
-		} else {
-			/* timer is early due to a clock set */
-			tstojiffie(&delta,
-				   posix_clocks[timr->it_clock].res,
-				   &exp);
-			timr->it.real.wall_to_prev = new_wall_to;
-			timr->it.real.timer.expires += exp;
-			add_timer(&timr->it.real.timer);
-			do_notify = 0;
-		}
-		spin_unlock(&abs_list.lock);
 
-	}
-	if (do_notify)  {
-		int si_private=0;
+	if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
+		si_private = ++timr->it_requeue_pending;
 
-		if (timr->it.real.incr)
-			si_private = ++timr->it_requeue_pending;
-		else {
-			remove_from_abslist(timr);
-		}
+	if (posix_timer_event(timr, si_private))
+		/*
+		 * signal was not sent because of sig_ignor
+		 * we will not get a call back to restart it AND
+		 * it should be restarted.
+		 */
+		schedule_next_timer(timr);
 
-		if (posix_timer_event(timr, si_private))
-			/*
-			 * signal was not sent because of sig_ignor
-			 * we will not get a call back to restart it AND
-			 * it should be restarted.
-			 */
-			schedule_next_timer(timr);
-	}
 	unlock_timer(timr, flags); /* hold thru abs lock to keep irq off */
 }
 
-
 static inline struct task_struct * good_sigevent(sigevent_t * event)
 {
 	struct task_struct *rtn = current->group_leader;
@@ -776,39 +646,40 @@ static struct k_itimer * lock_timer(time
 static void
 common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
 {
-	unsigned long expires;
-	struct now_struct now;
+	ktime_t expires, now, remaining;
+	struct ktimer *timer = &timr->it.real.timer;
 
-	do
-		expires = timr->it.real.timer.expires;
-	while ((volatile long) (timr->it.real.timer.expires) != expires);
-
-	posix_get_now(&now);
-
-	if (expires &&
-	    ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) &&
-	    !timr->it.real.incr &&
-	    posix_time_before(&timr->it.real.timer, &now))
-		timr->it.real.timer.expires = expires = 0;
-	if (expires) {
-		if (timr->it_requeue_pending & REQUEUE_PENDING ||
-		    (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
-			posix_bump_timer(timr, now);
-			expires = timr->it.real.timer.expires;
-		}
-		else
-			if (!timer_pending(&timr->it.real.timer))
-				expires = 0;
-		if (expires)
-			expires -= now.jiffies;
-	}
-	jiffies_to_timespec(expires, &cur_setting->it_value);
-	jiffies_to_timespec(timr->it.real.incr, &cur_setting->it_interval);
-
-	if (cur_setting->it_value.tv_sec < 0) {
+	memset(cur_setting, 0, sizeof(struct itimerspec));
+	expires = get_expiry_ktimer(timer, &now);
+	remaining = ktime_sub(expires, now);
+
+	/* Time left ? or timer pending */
+	if (ktime_cmp_val(remaining, >, KTIME_ZERO) || ktimer_active(timer))
+		goto calci;
+	/* interval timer ? */
+	if (ktime_cmp_val(timr->it.real.incr, ==, 0))
+		return;
+	/*
+	 * When a requeue is pending or this is a SIGEV_NONE timer
+	 * move the expiry time forward by intervals, so expiry is >
+	 * now.
+	 * The active (non SIGEV_NONE) rearm should be done
+	 * automatically by the ktimer REARM mode. Thats the next
+	 * iteration.  The REQUEUE_PENDING part will go away !
+	 */
+	if (timr->it_requeue_pending & REQUEUE_PENDING ||
+	    (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
+		remaining = forward_posix_timer(timr, now);
+	}
+ calci:
+	/* interval timer ? */
+	if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
+		cur_setting->it_interval = ktime_to_timespec(timr->it.real.incr);
+	/* Return 0 only, when the timer is expired and not pending */
+	if (ktime_cmp_val(remaining, <=, KTIME_ZERO))
 		cur_setting->it_value.tv_nsec = 1;
-		cur_setting->it_value.tv_sec = 0;
-	}
+	else
+		cur_setting->it_value = ktime_to_timespec(remaining);
 }
 
 /* Get the time remaining on a POSIX.1b interval timer. */
@@ -832,6 +703,7 @@ sys_timer_gettime(timer_t timer_id, stru
 
 	return 0;
 }
+
 /*
  * Get the number of overruns of a POSIX.1b interval timer.  This is to
  * be the overrun of the timer last delivered.  At the same time we are
@@ -858,84 +730,6 @@ sys_timer_getoverrun(timer_t timer_id)
 
 	return overrun;
 }
-/*
- * Adjust for absolute time
- *
- * If absolute time is given and it is not CLOCK_MONOTONIC, we need to
- * adjust for the offset between the timer clock (CLOCK_MONOTONIC) and
- * what ever clock he is using.
- *
- * If it is relative time, we need to add the current (CLOCK_MONOTONIC)
- * time to it to get the proper time for the timer.
- */
-static int adjust_abs_time(struct k_clock *clock, struct timespec *tp, 
-			   int abs, u64 *exp, struct timespec *wall_to)
-{
-	struct timespec now;
-	struct timespec oc = *tp;
-	u64 jiffies_64_f;
-	int rtn =0;
-
-	if (abs) {
-		/*
-		 * The mask pick up the 4 basic clocks 
-		 */
-		if (!((clock - &posix_clocks[0]) & ~CLOCKS_MASK)) {
-			jiffies_64_f = do_posix_clock_monotonic_gettime_parts(
-				&now,  wall_to);
-			/*
-			 * If we are doing a MONOTONIC clock
-			 */
-			if((clock - &posix_clocks[0]) & CLOCKS_MONO){
-				now.tv_sec += wall_to->tv_sec;
-				now.tv_nsec += wall_to->tv_nsec;
-			}
-		} else {
-			/*
-			 * Not one of the basic clocks
-			 */
-			clock->clock_get(clock - posix_clocks, &now);
-			jiffies_64_f = get_jiffies_64();
-		}
-		/*
-		 * Take away now to get delta and normalize
-		 */
-		set_normalized_timespec(&oc, oc.tv_sec - now.tv_sec,
-					oc.tv_nsec - now.tv_nsec);
-	}else{
-		jiffies_64_f = get_jiffies_64();
-	}
-	/*
-	 * Check if the requested time is prior to now (if so set now)
-	 */
-	if (oc.tv_sec < 0)
-		oc.tv_sec = oc.tv_nsec = 0;
-
-	if (oc.tv_sec | oc.tv_nsec)
-		set_normalized_timespec(&oc, oc.tv_sec,
-					oc.tv_nsec + clock->res);
-	tstojiffie(&oc, clock->res, exp);
-
-	/*
-	 * Check if the requested time is more than the timer code
-	 * can handle (if so we error out but return the value too).
-	 */
-	if (*exp > ((u64)MAX_JIFFY_OFFSET))
-			/*
-			 * This is a considered response, not exactly in
-			 * line with the standard (in fact it is silent on
-			 * possible overflows).  We assume such a large 
-			 * value is ALMOST always a programming error and
-			 * try not to compound it by setting a really dumb
-			 * value.
-			 */
-			rtn = -EINVAL;
-	/*
-	 * return the actual jiffies expire time, full 64 bits
-	 */
-	*exp += jiffies_64_f;
-	return rtn;
-}
 
 /* Set a POSIX.1b interval timer. */
 /* timr->it_lock is taken. */
@@ -943,68 +737,52 @@ static inline int
 common_timer_set(struct k_itimer *timr, int flags,
 		 struct itimerspec *new_setting, struct itimerspec *old_setting)
 {
-	struct k_clock *clock = &posix_clocks[timr->it_clock];
-	u64 expire_64;
+	ktime_t expires;
+	int mode;
 
 	if (old_setting)
 		common_timer_get(timr, old_setting);
 
 	/* disable the timer */
-	timr->it.real.incr = 0;
+	ktime_set_zero(timr->it.real.incr);
 	/*
 	 * careful here.  If smp we could be in the "fire" routine which will
 	 * be spinning as we hold the lock.  But this is ONLY an SMP issue.
 	 */
-	if (try_to_del_timer_sync(&timr->it.real.timer) < 0) {
-#ifdef CONFIG_SMP
-		/*
-		 * It can only be active if on an other cpu.  Since
-		 * we have cleared the interval stuff above, it should
-		 * clear once we release the spin lock.  Of course once
-		 * we do that anything could happen, including the
-		 * complete melt down of the timer.  So return with
-		 * a "retry" exit status.
-		 */
+	if (try_to_stop_ktimer(&timr->it.real.timer) < 0)
 		return TIMER_RETRY;
-#endif
-	}
-
-	remove_from_abslist(timr);
 
 	timr->it_requeue_pending = (timr->it_requeue_pending + 2) & 
 		~REQUEUE_PENDING;
 	timr->it_overrun_last = 0;
 	timr->it_overrun = -1;
-	/*
-	 *switch off the timer when it_value is zero
-	 */
-	if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec) {
-		timr->it.real.timer.expires = 0;
+
+	/* switch off the timer when it_value is zero */
+	if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec)
 		return 0;
-	}
 
-	if (adjust_abs_time(clock,
-			    &new_setting->it_value, flags & TIMER_ABSTIME, 
-			    &expire_64, &(timr->it.real.wall_to_prev))) {
-		return -EINVAL;
-	}
-	timr->it.real.timer.expires = (unsigned long)expire_64;
-	tstojiffie(&new_setting->it_interval, clock->res, &expire_64);
-	timr->it.real.incr = (unsigned long)expire_64;
+	mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
 
-	/*
-	 * We do not even queue SIGEV_NONE timers!  But we do put them
-	 * in the abs list so we can do that right.
+	/* Posix madness. Only absolute CLOCK_REALTIME timers
+	 * are affected by clock sets. So we must reiniatilize
+	 * the timer.
 	 */
+	if (timr->it_clock == CLOCK_REALTIME && mode == KTIMER_ABS)
+		timer_create_real(timr);
+	else
+		timer_create_mono(timr);
+
+	expires = ktimer_convert_timespec(&timr->it.real.timer,
+					  &new_setting->it_value);
+	/* This should be moved to the auto rearm code */
+	timr->it.real.incr = ktimer_convert_timespec(&timr->it.real.timer,
+						     &new_setting->it_interval);
+
+	/* SIGEV_NONE timers are not queued ! See common_timer_get */
 	if (((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE))
-		add_timer(&timr->it.real.timer);
+		start_ktimer(&timr->it.real.timer, &expires,
+			     mode | KTIMER_NOCHECK);
 
-	if (flags & TIMER_ABSTIME && clock->abs_struct) {
-		spin_lock(&clock->abs_struct->lock);
-		list_add_tail(&(timr->it.real.abs_timer_entry),
-			      &(clock->abs_struct->list));
-		spin_unlock(&clock->abs_struct->lock);
-	}
 	return 0;
 }
 
@@ -1039,6 +817,7 @@ retry:
 
 	unlock_timer(timr, flag);
 	if (error == TIMER_RETRY) {
+		wait_for_ktimer(&timr->it.real.timer);
 		rtn = NULL;	// We already got the old time...
 		goto retry;
 	}
@@ -1052,24 +831,10 @@ retry:
 
 static inline int common_timer_del(struct k_itimer *timer)
 {
-	timer->it.real.incr = 0;
+	ktime_set_zero(timer->it.real.incr);
 
-	if (try_to_del_timer_sync(&timer->it.real.timer) < 0) {
-#ifdef CONFIG_SMP
-		/*
-		 * It can only be active if on an other cpu.  Since
-		 * we have cleared the interval stuff above, it should
-		 * clear once we release the spin lock.  Of course once
-		 * we do that anything could happen, including the
-		 * complete melt down of the timer.  So return with
-		 * a "retry" exit status.
-		 */
+	if (try_to_stop_ktimer(&timer->it.real.timer) < 0)
 		return TIMER_RETRY;
-#endif
-	}
-
-	remove_from_abslist(timer);
-
 	return 0;
 }
 
@@ -1085,24 +850,17 @@ sys_timer_delete(timer_t timer_id)
 	struct k_itimer *timer;
 	long flags;
 
-#ifdef CONFIG_SMP
-	int error;
 retry_delete:
-#endif
 	timer = lock_timer(timer_id, &flags);
 	if (!timer)
 		return -EINVAL;
 
-#ifdef CONFIG_SMP
-	error = timer_delete_hook(timer);
-
-	if (error == TIMER_RETRY) {
+	if (timer_delete_hook(timer) == TIMER_RETRY) {
 		unlock_timer(timer, flags);
+		wait_for_ktimer(&timer->it.real.timer);
 		goto retry_delete;
 	}
-#else
-	timer_delete_hook(timer);
-#endif
+
 	spin_lock(&current->sighand->siglock);
 	list_del(&timer->list);
 	spin_unlock(&current->sighand->siglock);
@@ -1119,6 +877,7 @@ retry_delete:
 	release_posix_timer(timer, IT_ID_SET);
 	return 0;
 }
+
 /*
  * return timer owned by the process, used by exit_itimers
  */
@@ -1126,22 +885,14 @@ static inline void itimer_delete(struct 
 {
 	unsigned long flags;
 
-#ifdef CONFIG_SMP
-	int error;
 retry_delete:
-#endif
 	spin_lock_irqsave(&timer->it_lock, flags);
 
-#ifdef CONFIG_SMP
-	error = timer_delete_hook(timer);
-
-	if (error == TIMER_RETRY) {
+	if (timer_delete_hook(timer) == TIMER_RETRY) {
 		unlock_timer(timer, flags);
+		wait_for_ktimer(&timer->it.real.timer);
 		goto retry_delete;
 	}
-#else
-	timer_delete_hook(timer);
-#endif
 	list_del(&timer->list);
 	/*
 	 * This keeps any tasks waiting on the spin lock from thinking
@@ -1170,60 +921,7 @@ void exit_itimers(struct signal_struct *
 	}
 }
 
-/*
- * And now for the "clock" calls
- *
- * These functions are called both from timer functions (with the timer
- * spin_lock_irq() held and from clock calls with no locking.	They must
- * use the save flags versions of locks.
- */
-
-/*
- * We do ticks here to avoid the irq lock ( they take sooo long).
- * The seqlock is great here.  Since we a reader, we don't really care
- * if we are interrupted since we don't take lock that will stall us or
- * any other cpu. Voila, no irq lock is needed.
- *
- */
-
-static u64 do_posix_clock_monotonic_gettime_parts(
-	struct timespec *tp, struct timespec *mo)
-{
-	u64 jiff;
-	unsigned int seq;
-
-	do {
-		seq = read_seqbegin(&xtime_lock);
-		getnstimeofday(tp);
-		*mo = wall_to_monotonic;
-		jiff = jiffies_64;
-
-	} while(read_seqretry(&xtime_lock, seq));
-
-	return jiff;
-}
-
-static int do_posix_clock_monotonic_get(clockid_t clock, struct timespec *tp)
-{
-	struct timespec wall_to_mono;
-
-	do_posix_clock_monotonic_gettime_parts(tp, &wall_to_mono);
-
-	tp->tv_sec += wall_to_mono.tv_sec;
-	tp->tv_nsec += wall_to_mono.tv_nsec;
-
-	if ((tp->tv_nsec - NSEC_PER_SEC) > 0) {
-		tp->tv_nsec -= NSEC_PER_SEC;
-		tp->tv_sec++;
-	}
-	return 0;
-}
-
-int do_posix_clock_monotonic_gettime(struct timespec *tp)
-{
-	return do_posix_clock_monotonic_get(CLOCK_MONOTONIC, tp);
-}
-
+/* Not available / possible... functions */
 int do_posix_clock_nosettime(clockid_t clockid, struct timespec *tp)
 {
 	return -EINVAL;
@@ -1236,7 +934,8 @@ int do_posix_clock_notimer_create(struct
 }
 EXPORT_SYMBOL_GPL(do_posix_clock_notimer_create);
 
-int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t)
+int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t,
+			       struct timespec __user *r)
 {
 #ifndef ENOTSUP
 	return -EOPNOTSUPP;	/* aka ENOTSUP in userland for POSIX */
@@ -1295,125 +994,34 @@ sys_clock_getres(clockid_t which_clock, 
 	return error;
 }
 
-static void nanosleep_wake_up(unsigned long __data)
-{
-	struct task_struct *p = (struct task_struct *) __data;
-
-	wake_up_process(p);
-}
-
 /*
- * The standard says that an absolute nanosleep call MUST wake up at
- * the requested time in spite of clock settings.  Here is what we do:
- * For each nanosleep call that needs it (only absolute and not on
- * CLOCK_MONOTONIC* (as it can not be set)) we thread a little structure
- * into the "nanosleep_abs_list".  All we need is the task_struct pointer.
- * When ever the clock is set we just wake up all those tasks.	 The rest
- * is done by the while loop in clock_nanosleep().
- *
- * On locking, clock_was_set() is called from update_wall_clock which
- * holds (or has held for it) a write_lock_irq( xtime_lock) and is
- * called from the timer bh code.  Thus we need the irq save locks.
- *
- * Also, on the call from update_wall_clock, that is done as part of a
- * softirq thing.  We don't want to delay the system that much (possibly
- * long list of timers to fix), so we defer that work to keventd.
+ * nanosleep for monotonic and realtime clocks
  */
-
-static DECLARE_WAIT_QUEUE_HEAD(nanosleep_abs_wqueue);
-static DECLARE_WORK(clock_was_set_work, (void(*)(void*))clock_was_set, NULL);
-
-static DECLARE_MUTEX(clock_was_set_lock);
-
-void clock_was_set(void)
+static int common_nsleep(clockid_t which_clock, int flags,
+			 struct timespec *tsave, struct timespec __user *rmtp)
 {
-	struct k_itimer *timr;
-	struct timespec new_wall_to;
-	LIST_HEAD(cws_list);
-	unsigned long seq;
-
+	int mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
 
-	if (unlikely(in_interrupt())) {
-		schedule_work(&clock_was_set_work);
-		return;
+	switch (which_clock) {
+	case CLOCK_REALTIME:
+		/* Posix madness. Only absolute timers on clock realtime
+		   are affected by clock set. */
+		if (mode == KTIMER_ABS)
+			return ktimer_nanosleep_real(tsave, rmtp, mode);
+	case CLOCK_MONOTONIC:
+		return ktimer_nanosleep_mono(tsave, rmtp, mode);
+	default:
+		break;
 	}
-	wake_up_all(&nanosleep_abs_wqueue);
-
-	/*
-	 * Check if there exist TIMER_ABSTIME timers to correct.
-	 *
-	 * Notes on locking: This code is run in task context with irq
-	 * on.  We CAN be interrupted!  All other usage of the abs list
-	 * lock is under the timer lock which holds the irq lock as
-	 * well.  We REALLY don't want to scan the whole list with the
-	 * interrupt system off, AND we would like a sequence lock on
-	 * this code as well.  Since we assume that the clock will not
-	 * be set often, it seems ok to take and release the irq lock
-	 * for each timer.  In fact add_timer will do this, so this is
-	 * not an issue.  So we know when we are done, we will move the
-	 * whole list to a new location.  Then as we process each entry,
-	 * we will move it to the actual list again.  This way, when our
-	 * copy is empty, we are done.  We are not all that concerned
-	 * about preemption so we will use a semaphore lock to protect
-	 * aginst reentry.  This way we will not stall another
-	 * processor.  It is possible that this may delay some timers
-	 * that should have expired, given the new clock, but even this
-	 * will be minimal as we will always update to the current time,
-	 * even if it was set by a task that is waiting for entry to
-	 * this code.  Timers that expire too early will be caught by
-	 * the expire code and restarted.
-
-	 * Absolute timers that repeat are left in the abs list while
-	 * waiting for the task to pick up the signal.  This means we
-	 * may find timers that are not in the "add_timer" list, but are
-	 * in the abs list.  We do the same thing for these, save
-	 * putting them back in the "add_timer" list.  (Note, these are
-	 * left in the abs list mainly to indicate that they are
-	 * ABSOLUTE timers, a fact that is used by the re-arm code, and
-	 * for which we have no other flag.)
-
-	 */
-
-	down(&clock_was_set_lock);
-	spin_lock_irq(&abs_list.lock);
-	list_splice_init(&abs_list.list, &cws_list);
-	spin_unlock_irq(&abs_list.lock);
-	do {
-		do {
-			seq = read_seqbegin(&xtime_lock);
-			new_wall_to =	wall_to_monotonic;
-		} while (read_seqretry(&xtime_lock, seq));
-
-		spin_lock_irq(&abs_list.lock);
-		if (list_empty(&cws_list)) {
-			spin_unlock_irq(&abs_list.lock);
-			break;
-		}
-		timr = list_entry(cws_list.next, struct k_itimer,
-				  it.real.abs_timer_entry);
-
-		list_del_init(&timr->it.real.abs_timer_entry);
-		if (add_clockset_delta(timr, &new_wall_to) &&
-		    del_timer(&timr->it.real.timer))  /* timer run yet? */
-			add_timer(&timr->it.real.timer);
-		list_add(&timr->it.real.abs_timer_entry, &abs_list.list);
-		spin_unlock_irq(&abs_list.lock);
-	} while (1);
-
-	up(&clock_was_set_lock);
+	return -EINVAL;
 }
 
-long clock_nanosleep_restart(struct restart_block *restart_block);
-
 asmlinkage long
 sys_clock_nanosleep(clockid_t which_clock, int flags,
 		    const struct timespec __user *rqtp,
 		    struct timespec __user *rmtp)
 {
 	struct timespec t;
-	struct restart_block *restart_block =
-	    &(current_thread_info()->restart_block);
-	int ret;
 
 	if (invalid_clockid(which_clock))
 		return -EINVAL;
@@ -1421,135 +1029,8 @@ sys_clock_nanosleep(clockid_t which_cloc
 	if (copy_from_user(&t, rqtp, sizeof (struct timespec)))
 		return -EFAULT;
 
-	if ((unsigned) t.tv_nsec >= NSEC_PER_SEC || t.tv_sec < 0)
+	if (!timespec_valid(&t))
 		return -EINVAL;
 
-	/*
-	 * Do this here as nsleep function does not have the real address.
-	 */
-	restart_block->arg1 = (unsigned long)rmtp;
-
-	ret = CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t));
-
-	if ((ret == -ERESTART_RESTARTBLOCK) && rmtp &&
-					copy_to_user(rmtp, &t, sizeof (t)))
-		return -EFAULT;
-	return ret;
-}
-
-
-static int common_nsleep(clockid_t which_clock,
-			 int flags, struct timespec *tsave)
-{
-	struct timespec t, dum;
-	struct timer_list new_timer;
-	DECLARE_WAITQUEUE(abs_wqueue, current);
-	u64 rq_time = (u64)0;
-	s64 left;
-	int abs;
-	struct restart_block *restart_block =
-	    &current_thread_info()->restart_block;
-
-	abs_wqueue.flags = 0;
-	init_timer(&new_timer);
-	new_timer.expires = 0;
-	new_timer.data = (unsigned long) current;
-	new_timer.function = nanosleep_wake_up;
-	abs = flags & TIMER_ABSTIME;
-
-	if (restart_block->fn == clock_nanosleep_restart) {
-		/*
-		 * Interrupted by a non-delivered signal, pick up remaining
-		 * time and continue.  Remaining time is in arg2 & 3.
-		 */
-		restart_block->fn = do_no_restart_syscall;
-
-		rq_time = restart_block->arg3;
-		rq_time = (rq_time << 32) + restart_block->arg2;
-		if (!rq_time)
-			return -EINTR;
-		left = rq_time - get_jiffies_64();
-		if (left <= (s64)0)
-			return 0;	/* Already passed */
-	}
-
-	if (abs && (posix_clocks[which_clock].clock_get !=
-			    posix_clocks[CLOCK_MONOTONIC].clock_get))
-		add_wait_queue(&nanosleep_abs_wqueue, &abs_wqueue);
-
-	do {
-		t = *tsave;
-		if (abs || !rq_time) {
-			adjust_abs_time(&posix_clocks[which_clock], &t, abs,
-					&rq_time, &dum);
-		}
-
-		left = rq_time - get_jiffies_64();
-		if (left >= (s64)MAX_JIFFY_OFFSET)
-			left = (s64)MAX_JIFFY_OFFSET;
-		if (left < (s64)0)
-			break;
-
-		new_timer.expires = jiffies + left;
-		__set_current_state(TASK_INTERRUPTIBLE);
-		add_timer(&new_timer);
-
-		schedule();
-
-		del_timer_sync(&new_timer);
-		left = rq_time - get_jiffies_64();
-	} while (left > (s64)0 && !test_thread_flag(TIF_SIGPENDING));
-
-	if (abs_wqueue.task_list.next)
-		finish_wait(&nanosleep_abs_wqueue, &abs_wqueue);
-
-	if (left > (s64)0) {
-
-		/*
-		 * Always restart abs calls from scratch to pick up any
-		 * clock shifting that happened while we are away.
-		 */
-		if (abs)
-			return -ERESTARTNOHAND;
-
-		left *= TICK_NSEC;
-		tsave->tv_sec = div_long_long_rem(left, 
-						  NSEC_PER_SEC, 
-						  &tsave->tv_nsec);
-		/*
-		 * Restart works by saving the time remaing in 
-		 * arg2 & 3 (it is 64-bits of jiffies).  The other
-		 * info we need is the clock_id (saved in arg0). 
-		 * The sys_call interface needs the users 
-		 * timespec return address which _it_ saves in arg1.
-		 * Since we have cast the nanosleep call to a clock_nanosleep
-		 * both can be restarted with the same code.
-		 */
-		restart_block->fn = clock_nanosleep_restart;
-		restart_block->arg0 = which_clock;
-		/*
-		 * Caller sets arg1
-		 */
-		restart_block->arg2 = rq_time & 0xffffffffLL;
-		restart_block->arg3 = rq_time >> 32;
-
-		return -ERESTART_RESTARTBLOCK;
-	}
-
-	return 0;
-}
-/*
- * This will restart clock_nanosleep.
- */
-long
-clock_nanosleep_restart(struct restart_block *restart_block)
-{
-	struct timespec t;
-	int ret = common_nsleep(restart_block->arg0, 0, &t);
-
-	if ((ret == -ERESTART_RESTARTBLOCK) && restart_block->arg1 &&
-	    copy_to_user((struct timespec __user *)(restart_block->arg1), &t,
-			 sizeof (t)))
-		return -EFAULT;
-	return ret;
+	return CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t, rmtp));
 }
Index: linux-2.6.14-rc2-rt4/kernel/timer.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/timer.c
+++ linux-2.6.14-rc2-rt4/kernel/timer.c
@@ -912,6 +912,7 @@ static void run_timer_softirq(struct sof
 {
 	tvec_base_t *base = &__get_cpu_var(tvec_bases);
 
+ 	run_ktimer_queues();
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
@@ -1177,62 +1178,6 @@ asmlinkage long sys_gettid(void)
 	return current->pid;
 }
 
-static long __sched nanosleep_restart(struct restart_block *restart)
-{
-	unsigned long expire = restart->arg0, now = jiffies;
-	struct timespec __user *rmtp = (struct timespec __user *) restart->arg1;
-	long ret;
-
-	/* Did it expire while we handled signals? */
-	if (!time_after(expire, now))
-		return 0;
-
-	expire = schedule_timeout_interruptible(expire - now);
-
-	ret = 0;
-	if (expire) {
-		struct timespec t;
-		jiffies_to_timespec(expire, &t);
-
-		ret = -ERESTART_RESTARTBLOCK;
-		if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
-			ret = -EFAULT;
-		/* The 'restart' block is already filled in */
-	}
-	return ret;
-}
-
-asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __user *rmtp)
-{
-	struct timespec t;
-	unsigned long expire;
-	long ret;
-
-	if (copy_from_user(&t, rqtp, sizeof(t)))
-		return -EFAULT;
-
-	if ((t.tv_nsec >= 1000000000L) || (t.tv_nsec < 0) || (t.tv_sec < 0))
-		return -EINVAL;
-
-	expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
-	expire = schedule_timeout_interruptible(expire);
-
-	ret = 0;
-	if (expire) {
-		struct restart_block *restart;
-		jiffies_to_timespec(expire, &t);
-		if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
-			return -EFAULT;
-
-		restart = &current_thread_info()->restart_block;
-		restart->fn = nanosleep_restart;
-		restart->arg0 = jiffies + expire;
-		restart->arg1 = (unsigned long) rmtp;
-		ret = -ERESTART_RESTARTBLOCK;
-	}
-	return ret;
-}
-
 /*
  * sys_sysinfo - fill in sysinfo struct
  */ 
Index: linux-2.6.14-rc2-rt4/include/linux/time.h
===================================================================
--- linux-2.6.14-rc2-rt4.orig/include/linux/time.h
+++ linux-2.6.14-rc2-rt4/include/linux/time.h
@@ -4,6 +4,7 @@
 #include <linux/types.h>
 
 #ifdef __KERNEL__
+#include <linux/calc64.h>
 #include <linux/seqlock.h>
 #endif
 
@@ -38,6 +39,11 @@ static __inline__ int timespec_equal(str
 	return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec);
 } 
 
+#define timespec_valid(ts) \
+(((ts)->tv_sec >= 0) && (((unsigned) (ts)->tv_nsec) < NSEC_PER_SEC))
+
+typedef s64 nsec_t;
+
 /* Converts Gregorian date to seconds since 1970-01-01 00:00:00.
  * Assumes input in normal date format, i.e. 1980-12-31 23:59:59
  * => year=1980, mon=12, day=31, hour=23, min=59, sec=59.
@@ -88,8 +94,7 @@ struct timespec current_kernel_time(void
 extern void do_gettimeofday(struct timeval *tv);
 extern int do_settimeofday(struct timespec *tv);
 extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
-extern void clock_was_set(void); // call when ever the clock is set
-extern int do_posix_clock_monotonic_gettime(struct timespec *tp);
+extern void do_posix_clock_monotonic_gettime(struct timespec *ts);
 extern long do_utimes(char __user * filename, struct timeval * times);
 struct itimerval;
 extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue);
@@ -113,6 +118,40 @@ set_normalized_timespec (struct timespec
 	ts->tv_nsec = nsec;
 }
 
+static __inline__ nsec_t timespec_to_ns(struct timespec *s)
+{
+	nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
+	return res + (nsec_t) s->tv_nsec;
+}
+
+static __inline__ struct timespec ns_to_timespec(nsec_t n)
+{
+	struct timespec ts;
+
+	if (n)
+		ts.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &ts.tv_nsec);
+	else
+		ts.tv_sec = ts.tv_nsec = 0;
+	return ts;
+}
+
+static __inline__ nsec_t timeval_to_ns(struct timeval *s)
+{
+	nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
+	return res + (nsec_t) s->tv_usec * NSEC_PER_USEC;
+}
+
+static __inline__ struct timeval ns_to_timeval(nsec_t n)
+{
+	struct timeval tv;
+	if (n) {
+		tv.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &tv.tv_usec);
+		tv.tv_usec /= 1000;
+	} else
+		tv.tv_sec = tv.tv_usec = 0;
+	return tv;
+}
+
 #endif /* __KERNEL__ */
 
 #define NFDBITS			__NFDBITS
@@ -145,23 +184,18 @@ struct	itimerval {
 /*
  * The IDs of the various system clocks (for POSIX.1b interval timers).
  */
-#define CLOCK_REALTIME		  0
-#define CLOCK_MONOTONIC	  1
+#define CLOCK_REALTIME		 0
+#define CLOCK_MONOTONIC	  	 1
 #define CLOCK_PROCESS_CPUTIME_ID 2
 #define CLOCK_THREAD_CPUTIME_ID	 3
-#define CLOCK_REALTIME_HR	 4
-#define CLOCK_MONOTONIC_HR	  5
 
 /*
  * The IDs of various hardware clocks
  */
-
-
 #define CLOCK_SGI_CYCLE 10
 #define MAX_CLOCKS 16
-#define CLOCKS_MASK  (CLOCK_REALTIME | CLOCK_MONOTONIC | \
-                     CLOCK_REALTIME_HR | CLOCK_MONOTONIC_HR)
-#define CLOCKS_MONO (CLOCK_MONOTONIC & CLOCK_MONOTONIC_HR)
+#define CLOCKS_MASK  (CLOCK_REALTIME | CLOCK_MONOTONIC)
+#define CLOCKS_MONO (CLOCK_MONOTONIC)
 
 /*
  * The various flags for setting POSIX.1b interval timers.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-28 20:43 [PATCH] ktimers subsystem 2.6.14-rc2-kt5 tglx
@ 2005-09-28 23:59 ` Frank Sorenson
  2005-09-29  0:50   ` Frank Sorenson
  2005-09-29  1:10 ` john stultz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 67+ messages in thread
From: Frank Sorenson @ 2005-09-28 23:59 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, mingo, akpm, george, johnstul, paulmck, hch, oleg,
	zippel, tim.bird

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

tglx@linutronix.de wrote:
> This is an updated version which contains following changes:
<snip>
> Thanks for review and feedback.
> 
> tglx

I get this kernel panic on boot (serial capture) with the latest
git tree (2.6.14-rc2++) plus this version of ktimers:

[4294709.646000] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[4294709.646000]  printing eip:
[4294709.646000] c0137578
[4294709.646000] *pde = 00000000
[4294709.646000] Oops: 0000 [#1]
[4294709.646000] PREEMPT 
[4294709.646000] Modules linked in: ipw2200 ieee80211 ieee80211_crypt
[4294709.646000] CPU:    0
[4294709.646000] EIP:    0060:[<c0137578>]    Not tainted VLI
[4294709.646000] EFLAGS: 00010087   (2.6.14-rc2-fs2) 
[4294709.646000] EIP is at enqueue_ktimer+0x168/0x280
[4294709.646000] eax: 00000000   ebx: c051c1b4   ecx: 0000002a   edx: 14712508
[4294709.646000] esi: 00000000   edi: f7f42240   ebp: c051c1b8   esp: c05e4f58
[4294709.646000] ds: 007b   es: 007b   ss: 0068
[4294709.646000] Process swapper (pid: 0, threadinfo=c05e4000 task=c0515bc0)
[4294709.646000] Stack: c051c1b4 00000000 c051c1ac 147a7f90 0000002a 147a7f90 0000002a f7f42240 
[4294709.646000]        147a6ff0 0000002a c051c1ac c0137d72 00000005 c05e4000 c051c1b8 c051c1b4 
[4294709.646000]        c05e4000 c0124380 f7f69a90 147a6ff0 0000002a 00000001 c06136c8 0000000a 
[4<0>Kernel panic - not syncing: Fatal exception in interrupt
[4294709.680000]  



Frank
- -- 
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
frank@tuxrocks.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOy5haI0dwg4A47wRAnXcAJ996Yrw2nkjuNThfLCep2GRZ0VjzgCcDIWl
IvIgmrrHG3qB8LNszTPITX8=
=TLMU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-28 23:59 ` Frank Sorenson
@ 2005-09-29  0:50   ` Frank Sorenson
  2005-09-29  0:56     ` john stultz
  0 siblings, 1 reply; 67+ messages in thread
From: Frank Sorenson @ 2005-09-29  0:50 UTC (permalink / raw)
  To: Frank Sorenson
  Cc: tglx, linux-kernel, mingo, akpm, george, johnstul, paulmck, hch,
	oleg, zippel, tim.bird

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Frank Sorenson wrote:
> I get this kernel panic on boot (serial capture) with the latest
> git tree (2.6.14-rc2++) plus this version of ktimers:

Here's a little more information.  I've narrowed the panic down to ntpd
startup.  Without ntpd, the system seems to run okay, but panics the
moment I startup ntpd.

Hope this helps,

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
frank@tuxrocks.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOzpSaI0dwg4A47wRAipFAJ0c6/2tif49xVEhDZCH2drgpJXQmACgoY+G
tT9LkOWmS67SyX5Vekrl024=
=f/qY
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-29  0:50   ` Frank Sorenson
@ 2005-09-29  0:56     ` john stultz
  2005-09-29  1:05       ` Frank Sorenson
  0 siblings, 1 reply; 67+ messages in thread
From: john stultz @ 2005-09-29  0:56 UTC (permalink / raw)
  To: Frank Sorenson
  Cc: tglx, linux-kernel, mingo, akpm, george, paulmck, hch, oleg,
	zippel, tim.bird

On Wed, 2005-09-28 at 18:50 -0600, Frank Sorenson wrote:
> Frank Sorenson wrote:
> > I get this kernel panic on boot (serial capture) with the latest
> > git tree (2.6.14-rc2++) plus this version of ktimers:
> 
> Here's a little more information.  I've narrowed the panic down to ntpd
> startup.  Without ntpd, the system seems to run okay, but panics the
> moment I startup ntpd.

Are you just testing the ktimers patch or the full set of patches Thomas
is working with (including my code)?

thanks
-john


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-29  0:56     ` john stultz
@ 2005-09-29  1:05       ` Frank Sorenson
  0 siblings, 0 replies; 67+ messages in thread
From: Frank Sorenson @ 2005-09-29  1:05 UTC (permalink / raw)
  To: john stultz
  Cc: tglx, linux-kernel, mingo, akpm, george, paulmck, hch, oleg,
	zippel, tim.bird

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

john stultz wrote:
> On Wed, 2005-09-28 at 18:50 -0600, Frank Sorenson wrote:
> 
>>Frank Sorenson wrote:
>>
>>>I get this kernel panic on boot (serial capture) with the latest
>>>git tree (2.6.14-rc2++) plus this version of ktimers:
>>
>>Here's a little more information.  I've narrowed the panic down to ntpd
>>startup.  Without ntpd, the system seems to run okay, but panics the
>>moment I startup ntpd.
> 
> 
> Are you just testing the ktimers patch or the full set of patches Thomas
> is working with (including my code)?
> 
> thanks
> -john

After first testing with other patches, I verified that the panic occurs
without any other patches involved.

So, I am just testing this particular ktimers patch, without any others.

Am I correct in my understanding that this patch is standalone?

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
frank@tuxrocks.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDOz3faI0dwg4A47wRAn+/AKDsu/lRzUhbln8pNoRpfZ2V45D0NgCfQLHF
lK6+uXzWFQQhp8SvqBxPw1M=
=B9oy
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-28 20:43 [PATCH] ktimers subsystem 2.6.14-rc2-kt5 tglx
  2005-09-28 23:59 ` Frank Sorenson
@ 2005-09-29  1:10 ` john stultz
  2005-09-29  6:53   ` Thomas Gleixner
  2005-09-29 19:57 ` George Anzinger
  2005-10-01  1:03 ` Roman Zippel
  3 siblings, 1 reply; 67+ messages in thread
From: john stultz @ 2005-09-29  1:10 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, mingo, akpm, george, paulmck, hch, oleg, zippel,
	tim.bird

On Wed, 2005-09-28 at 22:43 +0200, tglx@linutronix.de wrote:
> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> +			   ktime_t *tim, int mode)
> +{
> +	struct rb_node **link = &base->active.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct ktimer *entry;
> +	struct list_head *prev = &base->pending;
> +	ktime_t now;
> +
> +	/* Get current time */
> +	now = base->get_time();
> +
> +	/* Timer expiry mode */
> +	switch (mode & ~KTIMER_NOCHECK) {
> +	case KTIMER_ABS:
> +		timer->expires = *tim;
> +		break;
> +	case KTIMER_REL:
> +		timer->expires = ktime_add(now, *tim);
> +		break;
> +	case KTIMER_INCR:
> +		timer->expires = ktime_add(timer->expires, *tim);
> +		break;

...



> +static inline void do_remove_ktimer(struct ktimer *timer,
> +				    struct ktimer_base *base, int rearm)
> +{
> +	list_del(&timer->list);
> +	rb_erase(&timer->node, &base->active);
> +	timer->node.rb_parent = KTIMER_POISON;
> +	timer->status = KTIMER_INACTIVE;
> +	base->count--;
> +	BUG_ON(base->count < 0);
> +	/* Auto rearm the timer ? */
> +	if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> +		enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> +}


There's a couple of places like this where you pass NULL as the ktime_t
pointer tim to enqueue_ktimer(). However in enqueue_ktimer, you
dereference tim in a few spots w/o checking for NULL.

I'm guessing this is what Frank is seeing.

thanks
-john



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-29  1:10 ` john stultz
@ 2005-09-29  6:53   ` Thomas Gleixner
  2005-09-30 15:58     ` Frank Sorenson
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2005-09-29  6:53 UTC (permalink / raw)
  To: john stultz
  Cc: linux-kernel, mingo, akpm, george, paulmck, hch, oleg, zippel,
	tim.bird

On Wed, 2005-09-28 at 18:10 -0700, john stultz wrote:

> > +	/* Auto rearm the timer ? */
> > +	if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> > +		enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> > +}
> 
> 
> There's a couple of places like this where you pass NULL as the ktime_t
> pointer tim to enqueue_ktimer(). However in enqueue_ktimer, you
> dereference tim in a few spots w/o checking for NULL.
> 

The KTIMER_REARM case is the broken spot. I fixed this already as it was
oopsing here to, but somehow I messed up with quilt.

tglx

Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
===================================================================
--- linux-2.6.14-rc2-rt4.orig/kernel/ktimers.c
+++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
@@ -242,7 +242,7 @@ static int enqueue_ktimer(struct ktimer 
 		goto nocheck;
 	case KTIMER_REARM:
 		while ktime_cmp(timer->expires, <= , now) {
-			timer->expires = ktime_add(timer->expires, *tim);
+			timer->expires = ktime_add(timer->expires, timer->interval);
 			timer->overrun++;
 		}
 		goto nocheck;



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-29  6:53   ` Thomas Gleixner
@ 2005-09-30 15:58     ` Frank Sorenson
  0 siblings, 0 replies; 67+ messages in thread
From: Frank Sorenson @ 2005-09-30 15:58 UTC (permalink / raw)
  To: tglx
  Cc: john stultz, linux-kernel, mingo, akpm, george, paulmck, hch,
	oleg, zippel, tim.bird

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Gleixner wrote:
> The KTIMER_REARM case is the broken spot. I fixed this already as it was
> oopsing here to, but somehow I messed up with quilt.
> 
> tglx

This does indeed fix the panic.  Thanks.

Frank
- --
Frank Sorenson - KD7TZK
Systems Manager, Computer Science Department
Brigham Young University
frank@tuxrocks.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFDPWCKaI0dwg4A47wRAmjAAJ0XarfSYFyqAvGKi+uHbXZLg4+fEwCgso39
5hdrQfgzwMDdT9zM+4GkwLk=
=UoVd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-28 20:43 [PATCH] ktimers subsystem 2.6.14-rc2-kt5 tglx
  2005-09-28 23:59 ` Frank Sorenson
  2005-09-29  1:10 ` john stultz
@ 2005-09-29 19:57 ` George Anzinger
  2005-10-01  1:03 ` Roman Zippel
  3 siblings, 0 replies; 67+ messages in thread
From: George Anzinger @ 2005-09-29 19:57 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, mingo, akpm, johnstul, paulmck, hch, oleg, zippel,
	tim.bird

Am I the only one finding "=20\n" and other corruption in this patch?

George
-- 

tglx@linutronix.de wrote:
> This is an updated version which contains following changes:
> 
> - Selectable time storage format: union/struct based, scalar (64bit)
> - Fixed an endless loop in forward_posix_timer (George Anzinger)
> - Fixed a wrong sizeof(x) (George Anzinger)
> - Fixed build problems for non x86 architectures
> 
> Roman pointed out that the penalty for some architectures 
> would be quite big when using the nsec_t (64bit) scalar time 
> storage format. After a long discussion and some more detailed 
> tests especially on ARM it turned out that the scalar format 
> is unfortunately not suitable everywhere. The tradeoff between 
> performance and cleanliness seems too big for some architectures. 
> 
> After several rounds of functional conversions and 
> cleanups an acceptable compromise between cleanliness and 
> storage format flexibility was found.
> 
> For 64bit architectures the scalar representation is definitely
> a win and therefor enabled unconditionally. The code defaults to
> the union/struct based implementation on 32bit archs, but can be
> switched to the scalar storage format by setting 
> CONFIG_KTIME_SCALAR=y if there is a benefit for the particular 
> architecture. The union/struct magic has an advantage over the 
> struct timespec based format which I considered to use first. It
> produces better and denser code for most architecures and does no
> harm anywhere else. This might change with improvements of 
> compilers, but then it requires just a replacement of the related
> macros / inlines.
> 
> The code is not harder to understand than the previous 
> open coded scalar storage based implementation.
> 
> The correctness was verified with the posix timer tests from 
> the HRT project on the forward ported ktimers based high 
> resolution proof of concept implementation.
> For those interested in this topic the patchseries is available
> at http://www.tglx.de/private/tglx/ktimers/patch-2.6.14-rc2-kt5.patches.tar.bz2
> 
> 
> Thanks for review and feedback.
> 
> tglx
> 
> 
> ktimers seperate the "timer API" from the "timeout API". 
> ktimers are used for:
> - nanosleep
> - posixtimers
> - itimers
> 
> 
> The patch contains the base implementation of ktimers and the
> conversion of nanosleep, posixtimers and itimers to ktimer users. 
> 
> The patch does not require other changes to the Linux time(r) core
> system.
> 
> The implementation was done with following constraints in mind:
> 
> - Not bound to jiffies
> - Multiple time sources
> - Per CPU timer queues
> - Simplification of absolute CLOCK_REALTIME posix timers
> - High resolution timer aware
> - Allows the timeout API to reschedule the next event 
>   (for tickless systems)
> 
> Ktimers enqueue the timers into a time sorted list, which is implemented 
> with a rbtree, which is effiecient and already used in other performance 
> critical parts of the kernel. This is a bit slower than the timer wheel, 
> but due to the fact that the vast majority of timers is actually 
> expiring it has to be waged versus the cascading penalty.
> 
> The code supports multiple time sources. Currently implemented are 
> CLOCK_REALTIME and CLOCK_MONOTONIC. They provide seperate timer queues 
> and support functions.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> ---
> Index: linux-2.6.14-rc2-rt4/include/linux/calc64.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/include/linux/calc64.h
> @@ -0,0 +1,31 @@
> +#ifndef _linux_CALC64_H
> +#define _linux_CALC64_H
> +
> +#include <linux/types.h>
> +#include <asm/div64.h>
> +
> +#ifndef div_long_long_rem
> +#define div_long_long_rem(dividend,divisor,remainder) 	\
> +({							\
> +	u64 result = dividend;				\
> +	*remainder = do_div(result,divisor);		\
> +	result;						\
> +})
> +#endif
> +
> +static inline long div_long_long_rem_signed(long long dividend,
> +					    long divisor,
> +					    long *remainder)
> +{
> +	long res;
> +
> +	if (unlikely(dividend < 0)) {
> +		res = -div_long_long_rem(-dividend, divisor, remainder);
> +		*remainder = -(*remainder);
> +	} else {
> +		res = div_long_long_rem(dividend, divisor, remainder);
> +	}
> +	return res;
> +}
> +
> +#endif
> Index: linux-2.6.14-rc2-rt4/include/linux/jiffies.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/jiffies.h
> +++ linux-2.6.14-rc2-rt4/include/linux/jiffies.h
> @@ -1,21 +1,12 @@
>  #ifndef _LINUX_JIFFIES_H
>  #define _LINUX_JIFFIES_H
>  
> +#include <linux/calc64.h>
>  #include <linux/kernel.h>
>  #include <linux/types.h>
>  #include <linux/time.h>
>  #include <linux/timex.h>
>  #include <asm/param.h>			/* for HZ */
> -#include <asm/div64.h>
> -
> -#ifndef div_long_long_rem
> -#define div_long_long_rem(dividend,divisor,remainder) \
> -({							\
> -	u64 result = dividend;				\
> -	*remainder = do_div(result,divisor);		\
> -	result;						\
> -})
> -#endif
>  
>  /*
>   * The following defines establish the engineering parameters of the PLL
> Index: linux-2.6.14-rc2-rt4/fs/exec.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/fs/exec.c
> +++ linux-2.6.14-rc2-rt4/fs/exec.c
> @@ -645,9 +645,10 @@ static inline int de_thread(struct task_
>  		 * synchronize with any firing (by calling del_timer_sync)
>  		 * before we can safely let the old group leader die.
>  		 */
> -		sig->real_timer.data = (unsigned long)current;
> -		if (del_timer_sync(&sig->real_timer))
> -			add_timer(&sig->real_timer);
> +		sig->real_timer.data = current;
> +		if (stop_ktimer(&sig->real_timer))
> +			start_ktimer(&sig->real_timer, NULL,
> +				     KTIMER_RESTART|KTIMER_NOCHECK);
>  	}
>  	while (atomic_read(&sig->count) > count) {
>  		sig->group_exit_task = current;
> @@ -659,7 +660,7 @@ static inline int de_thread(struct task_
>  	}
>  	sig->group_exit_task = NULL;
>  	sig->notify_count = 0;
> -	sig->real_timer.data = (unsigned long)current;
> +	sig->real_timer.data = current;
>  	spin_unlock_irq(lock);
>  
>  	/*
> Index: linux-2.6.14-rc2-rt4/fs/proc/array.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/fs/proc/array.c
> +++ linux-2.6.14-rc2-rt4/fs/proc/array.c
> @@ -330,7 +330,7 @@ static int do_task_stat(struct task_stru
>  	unsigned long  min_flt = 0,  maj_flt = 0;
>  	cputime_t cutime, cstime, utime, stime;
>  	unsigned long rsslim = 0;
> -	unsigned long it_real_value = 0;
> +	DEFINE_KTIME(it_real_value);
>  	struct task_struct *t;
>  	char tcomm[sizeof(task->comm)];
>  
> @@ -386,7 +386,7 @@ static int do_task_stat(struct task_stru
>  			utime = cputime_add(utime, task->signal->utime);
>  			stime = cputime_add(stime, task->signal->stime);
>  		}
> -		it_real_value = task->signal->it_real_value;
> +		it_real_value = task->signal->real_timer.expires;
>  	}
>  	ppid = pid_alive(task) ? task->group_leader->real_parent->tgid : 0;
>  	read_unlock(&tasklist_lock);
> @@ -435,7 +435,7 @@ static int do_task_stat(struct task_stru
>  		priority,
>  		nice,
>  		num_threads,
> -		jiffies_to_clock_t(it_real_value),
> +		(clock_t) ktime_to_clock_t(it_real_value),
>  		start_time,
>  		vsize,
>  		mm ? get_mm_counter(mm, rss) : 0, /* you might want to shift this left 3 */
> Index: linux-2.6.14-rc2-rt4/include/linux/ktimer.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/include/linux/ktimer.h
> @@ -0,0 +1,335 @@
> +#ifndef _LINUX_KTIMER_H
> +#define _LINUX_KTIMER_H
> +
> +#include <linux/init.h>
> +#include <linux/list.h>
> +#include <linux/rbtree.h>
> +#include <linux/time.h>
> +#include <linux/wait.h>
> +
> +/* Timer API */
> +
> +/*
> + * Select the ktime_t data type
> + */
> +#if defined(CONFIG_KTIME_SCALAR) || (BITS_PER_LONG == 64)
> + #define KTIME_IS_SCALAR
> +#endif
> +
> +#ifndef KTIME_IS_SCALAR
> +typedef union {
> +	s64	tv64;
> +	struct {
> +#ifdef __BIG_ENDIAN
> +	s32	sec, nsec;
> +#else
> +	s32	nsec, sec;
> +#endif
> +	} tv;
> +} ktime_t;
> +
> +#else
> +
> +typedef s64 ktime_t;
> +
> +#endif
> +
> +struct ktimer_base;
> +
> +/*
> + * Timer structure must be initialized by init_ktimer_xxx !
> + */
> +struct ktimer {
> +	struct rb_node		node;
> +	struct list_head	list;
> +	ktime_t			expires;
> +	ktime_t			expired;
> +	ktime_t			interval;
> +	int 	 	 	overrun;
> +	unsigned long		status;
> +	void 			(*function)(void *);
> +	void			*data;
> +	struct ktimer_base 	*base;
> +};
> +
> +/*
> + * Timer base struct
> + */
> +struct ktimer_base {
> +	int			index;
> +	char			*name;
> +	spinlock_t		lock;
> +	struct rb_root		active;
> +	struct list_head	pending;
> +	int			count;
> +	unsigned long		resolution;
> +	ktime_t			(*get_time)(void);
> +	struct ktimer		*running_timer;
> +	wait_queue_head_t	wait_for_running_timer;
> +};
> +
> +/*
> + * Values for the mode argument of xxx_ktimer functions
> + */
> +enum
> +{
> +	KTIMER_NOREARM,	/* Internal value */
> +	KTIMER_ABS,	/* Time value is absolute */
> +	KTIMER_REL,	/* Time value is relativ to now */
> +	KTIMER_INCR,	/* Time value is relativ to previous expiry time */
> +	KTIMER_FORWARD,	/* Timer is rearmed with value. Overruns are accounted */
> +	KTIMER_REARM,	/* Timer is rearmed with interval. Overruns are accounted */
> +	KTIMER_RESTART	/* Timer is restarted with the stored expiry value */
> +};
> +
> +/* The timer states */
> +enum
> +{
> +	KTIMER_INACTIVE,
> +	KTIMER_PENDING,
> +	KTIMER_EXPIRED,
> +	KTIMER_EXPIRED_NOQUEUE,
> +};
> +
> +/* Expiry must not be checked when the timer is started */
> +#define KTIMER_NOCHECK		0x10000
> +
> +#define KTIMER_POISON		((void *) 0x00100101)
> +
> +#define KTIME_ZERO 		0LL
> +
> +#define ktimer_active(t) ((t)->status != KTIMER_INACTIVE)
> +#define ktimer_before(t1, t2) (ktime_cmp((t1)->expires, <, (t2)->expires))
> +
> +#ifndef KTIME_IS_SCALAR
> +/*
> + * Helper macros/inlines to get the math with ktime_t right. Uurgh, that's
> + * ugly as hell, but for performance sake we have to use this. The
> + * nsec_t based code was nice and simple. :(
> + *
> + * Be careful when using this stuff. It blows up on you if you dön't
> + * get the weirdness right.
> + *
> + * Be especially aware, that negative values are represented in the
> + * form:
> + * tv.sec < 0 and 0 >= tv.nsec < NSEC_PER_SEC
> + *
> + */
> +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> +
> +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
> +
> +#define ktime_set(s,n) 		\
> +({				\
> +	ktime_t __kt;		\
> +	__kt.tv.sec = s;	\
> +	__kt.tv.nsec = n;	\
> +	__kt;			\
> +})
> +
> +#define ktime_set_zero(k) k.tv64 = 0LL
> +
> +#define ktime_set_low_high(l,h) ktime_set(h,l)
> +
> +#define ktime_get_low(t)	(t).tv.nsec
> +#define ktime_get_high(t)	(t).tv.sec
> +
> +static inline ktime_t ktime_set_normalized(long sec, long nsec)
> +{
> +	ktime_t res;
> +
> +	while (nsec < 0) {
> +                nsec += NSEC_PER_SEC;
> +		sec--;
> +        }
> +	while (nsec >= NSEC_PER_SEC) {
> +                nsec -= NSEC_PER_SEC;
> +		sec++;
> +	}
> +
> +	res.tv.sec = sec;
> +	res.tv.nsec = nsec;
> +	return res;
> +}
> +
> +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> +{
> +	ktime_t res;
> +
> +	res.tv64 = a.tv64 - b.tv64;
> +	if (res.tv.nsec < 0)
> +		res.tv.nsec += NSEC_PER_SEC;
> +
> +	return res;
> +}
> +
> +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> +{
> +	ktime_t res;
> +
> +	res.tv64 = a.tv64 + b.tv64;
> +	if (res.tv.nsec >= NSEC_PER_SEC) {
> +		res.tv.nsec -= NSEC_PER_SEC;
> +		res.tv.sec++;
> +	}
> +	return res;
> +}
> +
> +static inline ktime_t ktime_add_ns(ktime_t a, u64 nsec)
> +{
> +	ktime_t tmp;
> +
> +	if (likely(nsec < NSEC_PER_SEC)) {
> +		tmp.tv64 = nsec;
> +	} else {
> +		unsigned long rem;
> +		rem = do_div(nsec, NSEC_PER_SEC);
> +		tmp = ktime_set((long)nsec, rem);
> +	}
> +	return ktime_add(a,tmp);
> +}
> +
> +#define timespec_to_ktime(ts)			\
> +({						\
> +	ktime_t __kt;				\
> +	struct timespec __ts = (ts);		\
> +	__kt.tv.sec = (s32)__ts.tv_sec;		\
> +	__kt.tv.nsec = (s32)__ts.tv_nsec;	\
> +	__kt;					\
> +})
> +
> +#define ktime_to_timespec(kt)			\
> +({						\
> +	struct timespec __ts;			\
> +	ktime_t __kt = (kt);			\
> +	__ts.tv_sec = (time_t)__kt.tv.sec;	\
> +	__ts.tv_nsec = (long)__kt.tv.nsec;	\
> +	__ts;					\
> +})
> +
> +#define ktime_to_timeval(kt)					\
> +({								\
> +	struct timeval __tv;					\
> +	ktime_t __kt = (kt);					\
> +	__tv.tv_sec = (time_t)__kt.tv.sec;			\
> +	__tv.tv_usec = (long)(__kt.tv.nsec / NSEC_PER_USEC);	\
> +	__tv;							\
> +})
> +
> +#define ktime_to_clock_t(kt)				\
> +({							\
> +	ktime_t __kt = (kt);				\
> +	u64 nsecs = (u64) __kt.tv.sec * NSEC_PER_SEC;	\
> +	nsec_to_clock_t(nsecs + (u64) __kt.tv.nsec);	\
> +})
> +
> +#define ktime_to_ns(kt) 					\
> +({								\
> +	ktime_t __kt = (kt);					\
> +	(((u64)__kt.tv.sec * NSEC_PER_SEC) + (u64)__kt.tv.nsec);\
> +})
> +
> +#else
> +
> +/* ktime_t macros when using a 64bit variable */
> +
> +#define DEFINE_KTIME(kt) ktime_t kt = 0LL
> +
> +#define ktime_cmp(a,op,b) ((a) op (b))
> +#define ktime_cmp_val(a,op,b) ((a) op b)
> +
> +#define ktime_set(s,n) (((s64) s * NSEC_PER_SEC) + (s64)n)
> +#define ktime_set_zero(kt) kt = 0LL
> +
> +#define ktime_set_low_high(l,h) ((s64)((u64)l) | (((s64) h) << 32))
> +
> +#define ktime_get_low(t)	((t) & 0xFFFFFFFFLL)
> +#define ktime_get_high(t)	((t) >> 32)
> +
> +#define ktime_sub(a,b)	((a) - (b))
> +#define ktime_add(a,b)	((a) + (b))
> +#define ktime_add_ns(a,b) ((a) + (b))
> +
> +#define timespec_to_ktime(ts) ktime_set(ts.tv_sec, ts.tv_nsec)
> +
> +#define ktime_to_timespec(kt) ns_to_timespec(kt)
> +#define ktime_to_timeval(kt) ns_to_timeval(kt)
> +
> +#define ktime_to_clock_t(kt) nsec_to_clock_t(kt)
> +
> +#define ktime_to_ns(kt) (kt)
> +
> +#define ktime_set_normalized(s,n) ktime_set(s,n)
> +
> +#endif
> +
> +/* Exported functions */
> +extern void fastcall init_ktimer_real(struct ktimer *timer);
> +extern void fastcall init_ktimer_mono(struct ktimer *timer);
> +extern int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
> +extern int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode);
> +extern int try_to_stop_ktimer(struct ktimer *timer);
> +extern int stop_ktimer(struct ktimer *timer);
> +extern ktime_t get_remtime_ktimer(struct ktimer *timer, long fake);
> +extern ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now);
> +extern void __init init_ktimers(void);
> +
> +/* Conversion functions with rounding based on resolution */
> +extern ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv);
> +extern ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts);
> +
> +/* Posix timers current quirks */
> +extern int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp);
> +extern int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp);
> +
> +/* nanosleep functions */
> +long ktimer_nanosleep_mono(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
> +long ktimer_nanosleep_real(struct timespec *rqtp, struct timespec __user *rmtp, int mode);
> +
> +#if defined(CONFIG_SMP)
> +extern void wait_for_ktimer(struct ktimer *timer);
> +#else
> +#define wait_for_ktimer(t) do {} while (0)
> +#endif
> +
> +#define KTIME_REALTIME_RES (NSEC_PER_SEC/HZ)
> +#define KTIME_MONOTONIC_RES (NSEC_PER_SEC/HZ)
> +
> +static inline void get_ktime_mono_ts(struct timespec *ts)
> +{
> +	unsigned long seq;
> +	struct timespec tomono;
> +	do {
> +		seq = read_seqbegin(&xtime_lock);
> +		getnstimeofday(ts);
> +		tomono = wall_to_monotonic;
> +	} while (read_seqretry(&xtime_lock, seq));
> +
> +
> +	set_normalized_timespec(ts, ts->tv_sec + tomono.tv_sec,
> +				ts->tv_nsec + tomono.tv_nsec);
> +
> +}
> +
> +static inline ktime_t do_get_ktime_mono(void)
> +{
> +	struct timespec now;
> +
> +	get_ktime_mono_ts(&now);
> +	return timespec_to_ktime(now);
> +}
> +
> +#define get_ktime_real_ts(ts) getnstimeofday(ts)
> +static inline ktime_t do_get_ktime_real(void)
> +{
> +	struct timespec now;
> +
> +	getnstimeofday(&now);
> +	return timespec_to_ktime(now);
> +}
> +
> +#define clock_was_set() do { } while (0)
> +extern void run_ktimer_queues(void);
> +
> +#endif
> Index: linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/posix-timers.h
> +++ linux-2.6.14-rc2-rt4/include/linux/posix-timers.h
> @@ -51,10 +51,9 @@ struct k_itimer {
>  	struct sigqueue *sigq;		/* signal queue entry. */
>  	union {
>  		struct {
> -			struct timer_list timer;
> -			struct list_head abs_timer_entry; /* clock abs_timer_list */
> -			struct timespec wall_to_prev;   /* wall_to_monotonic used when set */
> -			unsigned long incr; /* interval in jiffies */
> +			struct ktimer timer;
> +			ktime_t incr;
> +			int overrun;
>  		} real;
>  		struct cpu_timer_list cpu;
>  		struct {
> @@ -66,10 +65,6 @@ struct k_itimer {
>  	} it;
>  };
>  
> -struct k_clock_abs {
> -	struct list_head list;
> -	spinlock_t lock;
> -};
>  struct k_clock {
>  	int res;		/* in nano seconds */
>  	int (*clock_getres) (clockid_t which_clock, struct timespec *tp);
> @@ -77,7 +72,7 @@ struct k_clock {
>  	int (*clock_set) (clockid_t which_clock, struct timespec * tp);
>  	int (*clock_get) (clockid_t which_clock, struct timespec * tp);
>  	int (*timer_create) (struct k_itimer *timer);
> -	int (*nsleep) (clockid_t which_clock, int flags, struct timespec *);
> +	int (*nsleep) (clockid_t which_clock, int flags, struct timespec *, struct timespec __user *);
>  	int (*timer_set) (struct k_itimer * timr, int flags,
>  			  struct itimerspec * new_setting,
>  			  struct itimerspec * old_setting);
> @@ -91,37 +86,104 @@ void register_posix_clock(clockid_t cloc
>  
>  /* Error handlers for timer_create, nanosleep and settime */
>  int do_posix_clock_notimer_create(struct k_itimer *timer);
> -int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *);
> +int do_posix_clock_nonanosleep(clockid_t, int flags, struct timespec *, struct timespec __user *);
>  int do_posix_clock_nosettime(clockid_t, struct timespec *tp);
>  
>  /* function to call to trigger timer event */
>  int posix_timer_event(struct k_itimer *timr, int si_private);
>  
> -struct now_struct {
> -	unsigned long jiffies;
> -};
> -
> -#define posix_get_now(now) (now)->jiffies = jiffies;
> -#define posix_time_before(timer, now) \
> -                      time_before((timer)->expires, (now)->jiffies)
> -
> -#define posix_bump_timer(timr, now)					\
> -         do {								\
> -              long delta, orun;						\
> -	      delta = now.jiffies - (timr)->it.real.timer.expires;	\
> -              if (delta >= 0) {						\
> -	           orun = 1 + (delta / (timr)->it.real.incr);		\
> -	          (timr)->it.real.timer.expires +=			\
> -			 orun * (timr)->it.real.incr;			\
> -                  (timr)->it_overrun += orun;				\
> -              }								\
> -            }while (0)
> +#if (BITS_PER_LONG < 64)
> +static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
> +{
> +	ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
> +	unsigned long orun = 1;
> +
> +	if (ktime_cmp_val(delta, <, KTIME_ZERO))
> +		goto out;
> +
> +	if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
> +
> +		int sft = 0;
> +		u64 div, dclc, inc, dns;
> +
> +		dclc = dns = ktime_to_ns(delta);
> +		div = inc = ktime_to_ns(t->it.real.incr);
> +		/* Make sure the divisor is less than 2^32 */
> +		while(div >> 32) {
> +			sft++;
> +			div >>= 1;
> +		}
> +		dclc >>= sft;
> +		do_div(dclc, (unsigned long) div);
> +		orun = (unsigned long) dclc;
> +		if (likely(!(inc >> 32)))
> +			dclc *= (unsigned long) inc;
> +		else
> +			dclc *= inc;
> +		t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
> +							dclc);
> +	} else {
> +		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> +						     t->it.real.incr);
> +	}
> +	/*
> +	 * Here is the correction for exact.  Also covers delta == incr
> +	 * which is the else clause above.
> +	 */
> +	if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
> +		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> +						     t->it.real.incr);
> +		orun++;
> +	}
> +	t->it_overrun += orun;
> +
> + out:
> +	return ktime_sub(t->it.real.timer.expires, now);
> +}
> +#else
> +static inline ktime_t forward_posix_timer(struct k_itimer *t, ktime_t now)
> +{
> +	ktime_t delta = ktime_sub(now, t->it.real.timer.expires);
> +	unsigned long orun = 1;
> +
> +	if (ktime_cmp_val(delta, <, KTIME_ZERO))
> +		goto out;
> +
> +	if (unlikely(ktime_cmp(delta, >, t->it.real.incr))) {
> +
> +		u64 dns, inc;
> +
> +		dns = ktime_to_ns(delta);
> +		inc = ktime_to_ns(t->it.real.incr);
> +
> +		orun = dns / inc;
> +		t->it.real.timer.expires = ktime_add_ns(t->it.real.timer.expires,
> +							orun * inc);
> +	} else {
> +		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> +						     t->it.real.incr);
> +	}
> +	/*
> +	 * Here is the correction for exact.  Also covers delta == incr
> +	 * which is the else clause above.
> +	 */
> +	if (ktime_cmp(t->it.real.timer.expires, <=, now)) {
> +		t->it.real.timer.expires = ktime_add(t->it.real.timer.expires,
> +						     t->it.real.incr);
> +		orun++;
> +	}
> +	t->it_overrun += orun;
> + out:
> +	return ktime_sub(t->it.real.timer.expires, now);
> +}
> +#endif
>  
>  int posix_cpu_clock_getres(clockid_t which_clock, struct timespec *);
>  int posix_cpu_clock_get(clockid_t which_clock, struct timespec *);
>  int posix_cpu_clock_set(clockid_t which_clock, const struct timespec *tp);
>  int posix_cpu_timer_create(struct k_itimer *);
> -int posix_cpu_nsleep(clockid_t, int, struct timespec *);
> +int posix_cpu_nsleep(clockid_t, int, struct timespec *,
> +		     struct timespec __user *);
>  int posix_cpu_timer_set(struct k_itimer *, int,
>  			struct itimerspec *, struct itimerspec *);
>  int posix_cpu_timer_del(struct k_itimer *);
> Index: linux-2.6.14-rc2-rt4/include/linux/sched.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/sched.h
> +++ linux-2.6.14-rc2-rt4/include/linux/sched.h
> @@ -104,6 +104,7 @@ extern unsigned long nr_iowait(void);
>  #include <linux/param.h>
>  #include <linux/resource.h>
>  #include <linux/timer.h>
> +#include <linux/ktimer.h>
>  
>  #include <asm/processor.h>
>  
> @@ -346,8 +347,7 @@ struct signal_struct {
>  	struct list_head posix_timers;
>  
>  	/* ITIMER_REAL timer for the process */
> -	struct timer_list real_timer;
> -	unsigned long it_real_value, it_real_incr;
> +	struct ktimer real_timer;
>  
>  	/* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */
>  	cputime_t it_prof_expires, it_virt_expires;
> Index: linux-2.6.14-rc2-rt4/include/linux/timer.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/timer.h
> +++ linux-2.6.14-rc2-rt4/include/linux/timer.h
> @@ -91,6 +91,6 @@ static inline void add_timer(struct time
>  
>  extern void init_timers(void);
>  extern void run_local_timers(void);
> -extern void it_real_fn(unsigned long);
> +extern void it_real_fn(void *);
>  
>  #endif
> Index: linux-2.6.14-rc2-rt4/init/main.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/init/main.c
> +++ linux-2.6.14-rc2-rt4/init/main.c
> @@ -485,6 +485,7 @@ asmlinkage void __init start_kernel(void
>  	init_IRQ();
>  	pidhash_init();
>  	init_timers();
> +	init_ktimers();
>  	softirq_init();
>  	time_init();
>  
> Index: linux-2.6.14-rc2-rt4/kernel/Makefile
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/Makefile
> +++ linux-2.6.14-rc2-rt4/kernel/Makefile
> @@ -7,7 +7,8 @@ obj-y     = sched.o fork.o exec_domain.o
>  	    sysctl.o capability.o ptrace.o timer.o user.o \
>  	    signal.o sys.o kmod.o workqueue.o pid.o \
>  	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
> -	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o
> +	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \
> +	    ktimers.o
>  
>  obj-$(CONFIG_FUTEX) += futex.o
>  obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
> Index: linux-2.6.14-rc2-rt4/kernel/exit.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/exit.c
> +++ linux-2.6.14-rc2-rt4/kernel/exit.c
> @@ -842,7 +842,7 @@ fastcall NORET_TYPE void do_exit(long co
>  	update_mem_hiwater(tsk);
>  	group_dead = atomic_dec_and_test(&tsk->signal->live);
>  	if (group_dead) {
> - 		del_timer_sync(&tsk->signal->real_timer);
> + 		stop_ktimer(&tsk->signal->real_timer);
>  		acct_process(code);
>  	}
>  	exit_mm(tsk);
> Index: linux-2.6.14-rc2-rt4/kernel/fork.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/fork.c
> +++ linux-2.6.14-rc2-rt4/kernel/fork.c
> @@ -804,10 +804,9 @@ static inline int copy_signal(unsigned l
>  	init_sigpending(&sig->shared_pending);
>  	INIT_LIST_HEAD(&sig->posix_timers);
>  
> -	sig->it_real_value = sig->it_real_incr = 0;
> +	init_ktimer_mono(&sig->real_timer);
>  	sig->real_timer.function = it_real_fn;
> -	sig->real_timer.data = (unsigned long) tsk;
> -	init_timer(&sig->real_timer);
> +	sig->real_timer.data = tsk;
>  
>  	sig->it_virt_expires = cputime_zero;
>  	sig->it_virt_incr = cputime_zero;
> Index: linux-2.6.14-rc2-rt4/kernel/itimer.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/itimer.c
> +++ linux-2.6.14-rc2-rt4/kernel/itimer.c
> @@ -12,36 +12,22 @@
>  #include <linux/syscalls.h>
>  #include <linux/time.h>
>  #include <linux/posix-timers.h>
> +#include <linux/ktimer.h>
>  
>  #include <asm/uaccess.h>
>  
> -static unsigned long it_real_value(struct signal_struct *sig)
> -{
> -	unsigned long val = 0;
> -	if (timer_pending(&sig->real_timer)) {
> -		val = sig->real_timer.expires - jiffies;
> -
> -		/* look out for negative/zero itimer.. */
> -		if ((long) val <= 0)
> -			val = 1;
> -	}
> -	return val;
> -}
> -
>  int do_getitimer(int which, struct itimerval *value)
>  {
>  	struct task_struct *tsk = current;
> -	unsigned long interval, val;
> +	ktime_t interval, val;
>  	cputime_t cinterval, cval;
>  
>  	switch (which) {
>  	case ITIMER_REAL:
> -		spin_lock_irq(&tsk->sighand->siglock);
> -		interval = tsk->signal->it_real_incr;
> -		val = it_real_value(tsk->signal);
> -		spin_unlock_irq(&tsk->sighand->siglock);
> -		jiffies_to_timeval(val, &value->it_value);
> -		jiffies_to_timeval(interval, &value->it_interval);
> +		interval = tsk->signal->real_timer.interval;
> +		val = get_remtime_ktimer(&tsk->signal->real_timer, NSEC_PER_USEC);
> +		value->it_value = ktime_to_timeval(val);
> +		value->it_interval = ktime_to_timeval(interval);
>  		break;
>  	case ITIMER_VIRTUAL:
>  		read_lock(&tasklist_lock);
> @@ -113,59 +99,35 @@ asmlinkage long sys_getitimer(int which,
>  }
>  
>  
> -void it_real_fn(unsigned long __data)
> +/*
> + * The timer is automagically restarted, when interval != 0
> + */
> +void it_real_fn(void *data)
>  {
> -	struct task_struct * p = (struct task_struct *) __data;
> -	unsigned long inc = p->signal->it_real_incr;
> -
> -	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, p);
> -
> -	/*
> -	 * Now restart the timer if necessary.  We don't need any locking
> -	 * here because do_setitimer makes sure we have finished running
> -	 * before it touches anything.
> -	 * Note, we KNOW we are (or should be) at a jiffie edge here so
> -	 * we don't need the +1 stuff.  Also, we want to use the prior
> -	 * expire value so as to not "slip" a jiffie if we are late.
> -	 * Deal with requesting a time prior to "now" here rather than
> -	 * in add_timer.
> -	 */
> -	if (!inc)
> -		return;
> -	while (time_before_eq(p->signal->real_timer.expires, jiffies))
> -		p->signal->real_timer.expires += inc;
> -	add_timer(&p->signal->real_timer);
> +	send_group_sig_info(SIGALRM, SEND_SIG_PRIV, data);
>  }
>  
>  int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
>  {
>  	struct task_struct *tsk = current;
> - 	unsigned long val, interval, expires;
> +	struct ktimer *timer;
> +	ktime_t expires;
>  	cputime_t cval, cinterval, nval, ninterval;
>  
>  	switch (which) {
>  	case ITIMER_REAL:
> -again:
> -		spin_lock_irq(&tsk->sighand->siglock);
> -		interval = tsk->signal->it_real_incr;
> -		val = it_real_value(tsk->signal);
> -		/* We are sharing ->siglock with it_real_fn() */
> -		if (try_to_del_timer_sync(&tsk->signal->real_timer) < 0) {
> -			spin_unlock_irq(&tsk->sighand->siglock);
> -			goto again;
> -		}
> -		tsk->signal->it_real_incr =
> -			timeval_to_jiffies(&value->it_interval);
> -		expires = timeval_to_jiffies(&value->it_value);
> -		if (expires)
> -			mod_timer(&tsk->signal->real_timer,
> -				  jiffies + 1 + expires);
> -		spin_unlock_irq(&tsk->sighand->siglock);
> +		timer = &tsk->signal->real_timer;
> +		stop_ktimer(timer);
>  		if (ovalue) {
> -			jiffies_to_timeval(val, &ovalue->it_value);
> -			jiffies_to_timeval(interval,
> -					   &ovalue->it_interval);
> -		}
> +			ovalue->it_value = ktime_to_timeval(
> +				get_remtime_ktimer(timer, NSEC_PER_USEC));
> +			ovalue->it_interval = ktime_to_timeval(timer->interval);
> +		}
> +		timer->interval = ktimer_convert_timeval(timer, &value->it_interval);
> +		expires = ktimer_convert_timeval(timer, &value->it_value);
> +		if (ktime_cmp_val(expires, != , KTIME_ZERO))
> +			modify_ktimer(timer, &expires, KTIMER_REL | KTIMER_NOCHECK);
> +
>  		break;
>  	case ITIMER_VIRTUAL:
>  		nval = timeval_to_cputime(&value->it_value);
> Index: linux-2.6.14-rc2-rt4/kernel/ktimers.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.14-rc2-rt4/kernel/ktimers.c
> @@ -0,0 +1,824 @@
> +/*
> + *  linux/kernel/ktimers.c
> + *
> + *  Copyright(C) 2005 Thomas Gleixner <tglx@linutronix.de>
> + *
> + *  Kudos to Ingo Molnar for review, criticism, ideas
> + *
> + *  Credits:
> + *	Lot of ideas and implementation details taken from
> + *	timer.c and related code
> + *
> + *  Kernel timers
> + *
> + *  In contrast to the timeout related API found in kernel/timer.c,
> + *  ktimers provide finer resolution and accuracy depending on system
> + *  configuration and capabilities.
> + *
> + *  These timers are used for
> + *  - itimers
> + *  - posixtimers
> + *  - nanosleep
> + *  - precise in kernel timing
> + *
> + *  Please do not abuse this API for simple timeouts.
> + *
> + *  For licencing details see kernel-base/COPYING
> + *
> + */
> +
> +#include <linux/cpu.h>
> +#include <linux/interrupt.h>
> +#include <linux/ktimer.h>
> +#include <linux/module.h>
> +#include <linux/notifier.h>
> +#include <linux/percpu.h>
> +#include <linux/syscalls.h>
> +
> +#include <asm/uaccess.h>
> +
> +static ktime_t get_ktime_mono(void);
> +static ktime_t get_ktime_real(void);
> +
> +/* The time bases */
> +#define MAX_KTIMER_BASES	2
> +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
> +{
> +	{
> +		.index = CLOCK_REALTIME,
> +		.name = "Realtime",
> +		.get_time = &get_ktime_real,
> +		.resolution = KTIME_REALTIME_RES,
> +	},
> +	{
> +		.index = CLOCK_MONOTONIC,
> +		.name = "Monotonic",
> +		.get_time = &get_ktime_mono,
> +		.resolution = KTIME_MONOTONIC_RES,
> +	},
> +};
> +
> +/*
> + * The SMP/UP kludge goes here
> + */
> +#if defined(CONFIG_SMP)
> +
> +#define set_running_timer(b,t) b->running_timer = t
> +#define wake_up_timer_waiters(b) wake_up(&b->wait_for_running_timer)
> +#define ktimer_base_can_change (1)
> +/*
> + * Wait for a running timer
> + */
> +void wait_for_ktimer(struct ktimer *timer)
> +{
> +	struct ktimer_base *base = timer->base;
> +
> +	if (base && base->running_timer == timer)
> +		wait_event(base->wait_for_running_timer,
> +			   base->running_timer != timer);
> +}
> +
> +/*
> + * We are using hashed locking: holding per_cpu(ktimer_bases)[n].lock
> + * means that all timers which are tied to this base via timer->base are
> + * locked, and the base itself is locked too.
> + *
> + * So __run_timers/migrate_timers can safely modify all timers which could
> + * be found on the lists/queues.
> + *
> + * When the timer's base is locked, and the timer removed from list, it is
> + * possible to set timer->base = NULL and drop the lock: the timer remains
> + * locked.
> + */
> +static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
> +					    unsigned long *flags)
> +{
> +	struct ktimer_base *base;
> +
> +	for (;;) {
> +		base = timer->base;
> +		if (likely(base != NULL)) {
> +			spin_lock_irqsave(&base->lock, *flags);
> +			if (likely(base == timer->base))
> +				return base;
> +			/* The timer has migrated to another CPU */
> +			spin_unlock_irqrestore(&base->lock, *flags);
> +		}
> +		cpu_relax();
> +	}
> +}
> +
> +static inline struct ktimer_base *switch_ktimer_base(struct ktimer *timer,
> +						     struct ktimer_base *base)
> +{
> +	int ktidx = base->index;
> +	struct ktimer_base *new_base = &__get_cpu_var(ktimer_bases[ktidx]);
> +
> +	if (base != new_base) {
> +		/*
> +		 * We are trying to schedule the timer on the local CPU.
> +		 * However we can't change timer's base while it is running,
> +		 * so we keep it on the same CPU. No hassle vs. reprogramming
> +		 * the event source in the high resolution case. The softirq
> +		 * code will take care of this when the timer function has
> +		 * completed. There is no conflict as we hold the lock until
> +		 * the timer is enqueued.
> +		 */
> +		if (unlikely(base->running_timer == timer)) {
> +			return base;
> +		} else {
> +			/* See the comment in lock_timer_base() */
> +			timer->base = NULL;
> +			spin_unlock(&base->lock);
> +			spin_lock(&new_base->lock);
> +			timer->base = new_base;
> +		}
> +	}
> +	return new_base;
> +}
> +
> +/*
> + * Get the timer base unlocked
> + *
> + * Take care of timer->base = NULL in switch_ktimer_base !
> + */
> +static inline struct ktimer_base *get_ktimer_base_unlocked(struct ktimer *timer)
> +{
> +	struct ktimer_base *base;
> +	while (!(base = timer->base));
> +	return base;
> +}
> +#else
> +
> +#define set_running_timer(b,t) do {} while (0)
> +#define wake_up_timer_waiters(b)  do {} while (0)
> +
> +static inline struct ktimer_base *lock_ktimer_base(struct ktimer *timer,
> +					    unsigned long *flags)
> +{
> +	struct ktimer_base *base;
> +
> +	base = timer->base;
> +	spin_lock_irqsave(&base->lock, *flags);
> +	return base;
> +}
> +
> +#define switch_ktimer_base(t, b) b
> +
> +#define get_ktimer_base_unlocked(t) (t)->base
> +#define ktimer_base_can_change (0)
> +
> +#endif	/* !CONFIG_SMP */
> +
> +/*
> + * Convert timespec to ktime_t with resolution adjustment
> + *
> + * Note: We can access base without locking here, as ktimers can
> + * migrate between CPUs but can not be moved from one clock source to
> + * another. The clock source binding is set at init_ktimer_XXX.
> + */
> +ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
> +{
> +	struct ktimer_base *base = get_ktimer_base_unlocked(timer);
> +	ktime_t t;
> +	long rem = ts->tv_nsec % base->resolution;
> +
> +	t = ktime_set(ts->tv_sec, ts->tv_nsec);
> +
> +	/* Check, if the value has to be rounded */
> +	if (rem)
> +		t = ktime_add_ns(t, base->resolution - rem);
> +	return t;
> +}
> +
> +/*
> + * Convert timeval to ktime_t with resolution adjustment
> + */
> +ktime_t ktimer_convert_timeval(struct ktimer *timer, struct timeval *tv)
> +{
> +	struct timespec ts;
> +
> +	ts.tv_sec = tv->tv_sec;
> +	ts.tv_nsec = tv->tv_usec * NSEC_PER_USEC;
> +
> +	return ktimer_convert_timespec(timer, &ts);
> +}
> +
> +/*
> + * Internal function to add (re)start a timer
> + *
> + * The timer is inserted in expiry order.
> + * Insertion into the red black tree is O(log(n))
> + *
> + */
> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> +			   ktime_t *tim, int mode)
> +{
> +	struct rb_node **link = &base->active.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct ktimer *entry;
> +	struct list_head *prev = &base->pending;
> +	ktime_t now;
> +
> +	/* Get current time */
> +	now = base->get_time();
> +
> +	/* Timer expiry mode */
> +	switch (mode & ~KTIMER_NOCHECK) {
> +	case KTIMER_ABS:
> +		timer->expires = *tim;
> +		break;
> +	case KTIMER_REL:
> +		timer->expires = ktime_add(now, *tim);
> +		break;
> +	case KTIMER_INCR:
> +		timer->expires = ktime_add(timer->expires, *tim);
> +		break;
> +	case KTIMER_FORWARD:
> +		while ktime_cmp(timer->expires, <= , now) {
> +			timer->expires = ktime_add(timer->expires, *tim);
> +			timer->overrun++;
> +		}
> +		goto nocheck;
> +	case KTIMER_REARM:
> +		while ktime_cmp(timer->expires, <= , now) {
> +			timer->expires = ktime_add(timer->expires, *tim);
> +			timer->overrun++;
> +		}
> +		goto nocheck;
> +	case KTIMER_RESTART:
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	/* Already expired.*/
> +	if ktime_cmp(timer->expires, <=, now) {
> +		timer->expired = now;
> +		/* The caller takes care of expiry */
> +		if (!(mode & KTIMER_NOCHECK))
> +			return -1;
> +	}
> + nocheck:
> +
> +	while (*link) {
> +		parent = *link;
> +		entry = rb_entry(parent, struct ktimer, node);
> +		/*
> +		 * We dont care about collisions. Nodes with
> +		 * the same expiry time stay together.
> +		 */
> +		if (ktimer_before(timer, entry))
> +			link = &(*link)->rb_left;
> +		else {
> +			link = &(*link)->rb_right;
> +			prev = &entry->list;
> +		}
> +	}
> +
> +	rb_link_node(&timer->node, parent, link);
> +	rb_insert_color(&timer->node, &base->active);
> +	list_add(&timer->list, prev);
> +	timer->status = KTIMER_PENDING;
> +	base->count++;
> +	return 0;
> +}
> +
> +/*
> + * Internal helper to remove a timer
> + *
> + * The function allows automatic rearming for interval
> + * timers.
> + *
> + */
> +static inline void do_remove_ktimer(struct ktimer *timer,
> +				    struct ktimer_base *base, int rearm)
> +{
> +	list_del(&timer->list);
> +	rb_erase(&timer->node, &base->active);
> +	timer->node.rb_parent = KTIMER_POISON;
> +	timer->status = KTIMER_INACTIVE;
> +	base->count--;
> +	BUG_ON(base->count < 0);
> +	/* Auto rearm the timer ? */
> +	if (rearm && ktime_cmp_val(timer->interval, !=, KTIME_ZERO))
> +		enqueue_ktimer(timer, base, NULL, KTIMER_REARM);
> +}
> +
> +/*
> + * Called with base lock held
> + */
> +static inline int remove_ktimer(struct ktimer *timer, struct ktimer_base *base)
> +{
> +	if (ktimer_active(timer)) {
> +		do_remove_ktimer(timer, base, KTIMER_NOREARM);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Internal function to (re)start a timer.
> + */
> +static int internal_restart_ktimer(struct ktimer *timer, ktime_t *tim,
> +				   int mode)
> +{
> +	struct ktimer_base *base, *new_base;
> +  	unsigned long flags;
> +	int ret;
> +
> +	BUG_ON(!timer->function);
> +
> +	base = lock_ktimer_base(timer, &flags);
> +
> +	/* Remove an active timer from the queue */
> +	ret = remove_ktimer(timer, base);
> +
> +	/* Switch the timer base, if necessary */
> +	new_base = switch_ktimer_base(timer, base);
> +
> +	/*
> +	 * When the new timer setting is already expired,
> +	 * let the calling code deal with it.
> +	 */
> +	if (enqueue_ktimer(timer, new_base, tim, mode))
> +		ret = -1;
> +
> +	spin_unlock_irqrestore(&new_base->lock, flags);
> +	return ret;
> +}
> +
> +/***
> + * modify_ktimer - modify a running timer
> + * @timer: the timer to be modified
> + * @tim: expiry time (required)
> + * @mode: timer setup mode
> + *
> + */
> +int modify_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
> +{
> +  	BUG_ON(!tim || !timer->function);
> +	return internal_restart_ktimer(timer, tim, mode);
> +}
> +
> +/***
> + * start_ktimer - start a timer on current CPU
> + * @timer: the timer to be added
> + * @tim: expiry time (optional, if not set in the timer)
> + * @mode: timer setup mode
> + */
> +int start_ktimer(struct ktimer *timer, ktime_t *tim, int mode)
> +{
> +  	BUG_ON(ktimer_active(timer) || !timer->function);
> +
> +	return internal_restart_ktimer(timer, tim, mode);
> +}
> +
> +/***
> + * try_to_stop_ktimer - try to deactivate a timer
> + */
> +int try_to_stop_ktimer(struct ktimer *timer)
> +{
> +	struct ktimer_base *base;
> +	unsigned long flags;
> +	int ret = -1;
> +
> +	base = lock_ktimer_base(timer, &flags);
> +
> +	if (base->running_timer != timer) {
> +		ret = remove_ktimer(timer, base);
> +		if (ret)
> +			timer->expired = base->get_time();
> +	}
> +
> +	spin_unlock_irqrestore(&base->lock, flags);
> +
> +	return ret;
> +
> +}
> +
> +/***
> + * stop_timer_sync - deactivate a timer and wait for the handler to finish.
> + * @timer: the timer to be deactivated
> + *
> + */
> +int stop_ktimer(struct ktimer *timer)
> +{
> +	for (;;) {
> +		int ret = try_to_stop_ktimer(timer);
> +		if (ret >= 0)
> +			return ret;
> +		wait_for_ktimer(timer);
> +	}
> +}
> +
> +/***
> + * get_remtime_ktimer - get remaining time for the timer
> + * @timer: the timer to read
> + * @fake:  when fake > 0 a pending, but expired timer
> + *	   returns fake (itimers need this, uurg)
> + */
> +ktime_t get_remtime_ktimer(struct ktimer *timer, long fake)
> +{
> +	struct ktimer_base *base;
> +	unsigned long flags;
> +	ktime_t rem;
> +
> +	base = lock_ktimer_base(timer, &flags);
> +	if (ktimer_active(timer)) {
> +		rem = ktime_sub(timer->expires,base->get_time());
> +		if (fake && ktime_cmp_val(rem, <=, KTIME_ZERO))
> +			rem = ktime_set(0, fake);
> +	} else {
> +		if (!fake)
> +			rem = ktime_sub(timer->expires,base->get_time());
> +		else
> +			ktime_set_zero(rem);
> +	}
> +	spin_unlock_irqrestore(&base->lock, flags);
> +	return rem;
> +}
> +
> +/***
> + * get_expiry_ktimer - get expiry time for the timer
> + * @timer: the timer to read
> + * @now:   if != NULL store current base->time
> + */
> +ktime_t get_expiry_ktimer(struct ktimer *timer, ktime_t *now)
> +{
> +	struct ktimer_base *base;
> +	unsigned long flags;
> +	ktime_t expiry;
> +
> +	base = lock_ktimer_base(timer, &flags);
> +	expiry = timer->expires;
> +	if (now)
> +		*now = base->get_time();
> +	spin_unlock_irqrestore(&base->lock, flags);
> +	return expiry;
> +}
> +
> +/*
> + * Functions related to clock sources
> + */
> +
> +static inline void ktimer_common_init(struct ktimer *timer)
> +{
> +	memset(timer, 0, sizeof(struct ktimer));
> +	timer->node.rb_parent = KTIMER_POISON;
> +}
> +
> +/*
> + * Get monotonic time
> + */
> +static ktime_t get_ktime_mono(void)
> +{
> +	return do_get_ktime_mono();
> +}
> +
> +/***
> + * init_ktimer_mono - initialize a timer on monotonic time
> + * @timer: the timer to be initialized
> + *
> + */
> +void fastcall init_ktimer_mono(struct ktimer *timer)
> +{
> +	ktimer_common_init(timer);
> +	timer->base =
> +		&per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC];
> +}
> +
> +/***
> + * get_ktimer_mono_res - get the monotonic timer resolution
> + *
> + */
> +int get_ktimer_mono_res(clockid_t which_clock, struct timespec *tp)
> +{
> +	tp->tv_sec = 0;
> +	tp->tv_nsec =
> +		per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_MONOTONIC].resolution;
> +	return 0;
> +}
> +
> +/*
> + * Get real time
> + */
> +static ktime_t get_ktime_real(void)
> +{
> +	return do_get_ktime_real();
> +}
> +
> +/***
> + * init_ktimer_real - initialize a timer on real time
> + * @timer: the timer to be initialized
> + *
> + */
> +void fastcall init_ktimer_real(struct ktimer *timer)
> +{
> +	ktimer_common_init(timer);
> +	timer->base =
> +		&per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME];
> +}
> +
> +/***
> + * get_ktimer_real_res - get the real timer resolution
> + *
> + */
> +int get_ktimer_real_res(clockid_t which_clock, struct timespec *tp)
> +{
> +	tp->tv_sec = 0;
> +	tp->tv_nsec =
> +		per_cpu(ktimer_bases, raw_smp_processor_id())[CLOCK_REALTIME].resolution;
> +	return 0;
> +}
> +
> +/*
> + * The per base runqueue
> + */
> +static inline void run_ktimer_queue(struct ktimer_base *base)
> +{
> +	ktime_t now = base->get_time();
> +
> +	spin_lock_irq(&base->lock);
> +	while (!list_empty(&base->pending)) {
> +		void (*fn)(void *);
> +		void *data;
> +		struct ktimer *timer = list_entry(base->pending.next,
> +						  struct ktimer, list);
> +		if ktime_cmp(now, <=, timer->expires)
> +			break;
> +		timer->expired = now;
> +		fn = timer->function;
> +		data = timer->data;
> +		set_running_timer(base, timer);
> +		do_remove_ktimer(timer, base, KTIMER_REARM);
> +		spin_unlock_irq(&base->lock);
> + 		fn(data);
> +		spin_lock_irq(&base->lock);
> +		set_running_timer(base, NULL);
> +	}
> +	spin_unlock_irq(&base->lock);
> +	wake_up_timer_waiters(base);
> +}
> +
> +/*
> + * Called from timer softirq every jiffy
> + */
> +void run_ktimer_queues(void)
> +{
> +	struct ktimer_base *base = __get_cpu_var(ktimer_bases);
> +	int i;
> +
> +	for (i = 0; i < MAX_KTIMER_BASES; i++)
> +		run_ktimer_queue(&base[i]);
> +}
> +
> +/*
> + * Functions related to initialization
> + */
> +static void __devinit init_ktimers_cpu(int cpu)
> +{
> +	struct ktimer_base *base = per_cpu(ktimer_bases, cpu);
> +	int i;
> +
> +	for (i = 0; i < MAX_KTIMER_BASES; i++) {
> +		spin_lock_init(&base->lock);
> +		INIT_LIST_HEAD(&base->pending);
> +		init_waitqueue_head(&base->wait_for_running_timer);
> +		base++;
> +	}
> +}
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +static void migrate_ktimer_list(struct ktimer_base *old_base,
> +				struct ktimer_base *new_base)
> +{
> +	struct ktimer *timer;
> +	struct rb_node *node;
> +
> +	while ((node = rb_first(&old_base->active))) {
> +		timer = rb_entry(node, struct ktimer, node);
> +		remove_ktimer(timer, old_base);
> +		timer->base = new_base;
> +		enqueue_ktimer(timer, new_base, NULL, KTIMER_RESTART);
> +	}
> +}
> +
> +static void __devinit migrate_ktimers(int cpu)
> +{
> +	struct ktimer_base *old_base;
> +	struct ktimer_base *new_base;
> +	int i;
> +
> +	BUG_ON(cpu_online(cpu));
> +	old_base = per_cpu(ktimer_bases, cpu);
> +	new_base = get_cpu_var(ktimer_bases);
> +
> +	local_irq_disable();
> +
> +	for (i = 0; i < MAX_KTIMER_BASES; i++) {
> +
> +		spin_lock(&new_base->lock);
> +		spin_lock(&old_base->lock);
> +
> +		if (old_base->running_timer)
> +			BUG();
> +
> +		migrate_ktimer_list(old_base, new_base);
> +
> +		spin_unlock(&old_base->lock);
> +		spin_unlock(&new_base->lock);
> +		old_base++;
> +		new_base++;
> +	}
> +
> +	local_irq_enable();
> +	&put_cpu_var(ktimer_bases);
> +}
> +#endif /* CONFIG_HOTPLUG_CPU */
> +
> +static int __devinit ktimer_cpu_notify(struct notifier_block *self,
> +				       unsigned long action, void *hcpu)
> +{
> +	long cpu = (long)hcpu;
> +	switch(action) {
> +	case CPU_UP_PREPARE:
> +		init_ktimers_cpu(cpu);
> +		break;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	case CPU_DEAD:
> +		migrate_ktimers(cpu);
> +		break;
> +#endif
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block __devinitdata ktimers_nb = {
> +	.notifier_call	= ktimer_cpu_notify,
> +};
> +
> +void __init init_ktimers(void)
> +{
> +	ktimer_cpu_notify(&ktimers_nb, (unsigned long)CPU_UP_PREPARE,
> +				(void *)(long)smp_processor_id());
> +	register_cpu_notifier(&ktimers_nb);
> +}
> +
> +/*
> + * system interface related functions
> + */
> +static void process_ktimer(void *data)
> +{
> +	wake_up_process(data);
> +}
> +
> +/**
> + * schedule_ktimer - sleep until timeout
> + * @timeout: timeout value
> + * @state:   state to use for sleep
> + * @rel:    timeout value is abs/rel
> + *
> + * Make the current task sleep until @timeout is
> + * elapsed.
> + *
> + * You can set the task state as follows -
> + *
> + * %TASK_UNINTERRUPTIBLE - at least @timeout is guaranteed to
> + * pass before the routine returns. The routine will return 0
> + *
> + * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> + * delivered to the current task. In this case the remaining time
> + * will be returned
> + *
> + * The current task state is guaranteed to be TASK_RUNNING when this
> + * routine returns.
> + *
> + */
> +static fastcall ktime_t __sched schedule_ktimer(struct ktimer *timer,
> +					ktime_t *t, int state, int mode)
> +{
> +	timer->data = current;
> +	timer->function = process_ktimer;
> +
> +	current->state = state;
> +	if (start_ktimer(timer, t, mode)) {
> +		current->state = TASK_RUNNING;
> +		goto out;
> +	}
> +	if (current->state != TASK_RUNNING)
> +		schedule();
> +	stop_ktimer(timer);
> + out:
> +	/* Store the absolute expiry time */
> +	*t = timer->expires;
> +	/* Return the remaining time */
> +	return ktime_sub(timer->expires, timer->expired);
> +}
> +
> +static long __sched nanosleep_restart(struct ktimer *timer,
> +				      struct restart_block *restart)
> +{
> +	struct timespec tu;
> +	ktime_t t, rem;
> +	void *rfn = restart->fn;
> +	struct timespec __user *rmtp = (struct timespec __user *) restart->arg2;
> +
> +	restart->fn = do_no_restart_syscall;
> +
> +	t = ktime_set_low_high(restart->arg0, restart->arg1);
> +
> +	rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, KTIMER_ABS);
> +
> +	if (ktime_cmp_val(rem, <=, KTIME_ZERO))
> +		return 0;
> +
> +	tu = ktime_to_timespec(rem);
> +	if (rmtp && copy_to_user(rmtp, &rem, sizeof(tu)))
> +		return -EFAULT;
> +
> +	restart->fn = rfn;
> +	/* The other values in restart are already filled in */
> +	return -ERESTART_RESTARTBLOCK;
> +}
> +
> +static long __sched nanosleep_restart_mono(struct restart_block *restart)
> +{
> +	struct ktimer timer;
> +
> +	init_ktimer_mono(&timer);
> +	return nanosleep_restart(&timer, restart);
> +}
> +
> +static long __sched nanosleep_restart_real(struct restart_block *restart)
> +{
> +	struct ktimer timer;
> +
> +	init_ktimer_real(&timer);
> +	return nanosleep_restart(&timer, restart);
> +}
> +
> +static long ktimer_nanosleep(struct ktimer *timer, struct timespec *rqtp,
> +			     struct timespec __user *rmtp, int mode,
> +			     long (*rfn)(struct restart_block *))
> +{
> +	struct timespec tu;
> +	ktime_t rem, t;
> +	struct restart_block *restart;
> +
> +	t = ktimer_convert_timespec(timer, rqtp);
> +
> +	/* t is updated to absolute expiry time ! */
> +	rem = schedule_ktimer(timer, &t, TASK_INTERRUPTIBLE, mode);
> +
> +	if (ktime_cmp_val(rem, <=, KTIME_ZERO))
> +		return 0;
> +
> +	tu = ktime_to_timespec(rem);
> +
> +	if (rmtp && copy_to_user(rmtp, &tu, sizeof(tu)))
> +		return -EFAULT;
> +
> +	restart = &current_thread_info()->restart_block;
> +	restart->fn = rfn;
> +	restart->arg0 = ktime_get_low(t);
> +	restart->arg1 = ktime_get_high(t);
> +	restart->arg2 = (unsigned long) rmtp;
> +	return -ERESTART_RESTARTBLOCK;
> +
> +}
> +
> +long ktimer_nanosleep_mono(struct timespec *rqtp,
> +			   struct timespec __user *rmtp, int mode)
> +{
> +	struct ktimer timer;
> +
> +	init_ktimer_mono(&timer);
> +	return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_mono);
> +}
> +
> +long ktimer_nanosleep_real(struct timespec *rqtp,
> +			   struct timespec __user *rmtp, int mode)
> +{
> +	struct ktimer timer;
> +
> +	init_ktimer_real(&timer);
> +	return ktimer_nanosleep(&timer, rqtp, rmtp, mode, nanosleep_restart_real);
> +}
> +
> +asmlinkage long sys_nanosleep(struct timespec __user *rqtp,
> +			      struct timespec __user *rmtp)
> +{
> +	struct timespec tu;
> +
> +	if (copy_from_user(&tu, rqtp, sizeof(tu)))
> +		return -EFAULT;
> +
> +	if (!timespec_valid(&tu))
> +		return -EINVAL;
> +
> +	return ktimer_nanosleep_mono(&tu, rmtp, KTIMER_REL);
> +}
> +
> Index: linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/posix-cpu-timers.c
> +++ linux-2.6.14-rc2-rt4/kernel/posix-cpu-timers.c
> @@ -1394,7 +1394,7 @@ void set_process_cpu_timer(struct task_s
>  static long posix_cpu_clock_nanosleep_restart(struct restart_block *);
>  
>  int posix_cpu_nsleep(clockid_t which_clock, int flags,
> -		     struct timespec *rqtp)
> +		     struct timespec *rqtp, struct timespec __user *rmtp)
>  {
>  	struct restart_block *restart_block =
>  	    &current_thread_info()->restart_block;
> @@ -1419,7 +1419,6 @@ int posix_cpu_nsleep(clockid_t which_clo
>  	error = posix_cpu_timer_create(&timer);
>  	timer.it_process = current;
>  	if (!error) {
> -		struct timespec __user *rmtp;
>  		static struct itimerspec zero_it;
>  		struct itimerspec it = { .it_value = *rqtp,
>  					 .it_interval = {} };
> @@ -1466,7 +1465,6 @@ int posix_cpu_nsleep(clockid_t which_clo
>  		/*
>  		 * Report back to the user the time still remaining.
>  		 */
> -		rmtp = (struct timespec __user *) restart_block->arg1;
>  		if (rmtp != NULL && !(flags & TIMER_ABSTIME) &&
>  		    copy_to_user(rmtp, &it.it_value, sizeof *rmtp))
>  			return -EFAULT;
> @@ -1474,6 +1472,7 @@ int posix_cpu_nsleep(clockid_t which_clo
>  		restart_block->fn = posix_cpu_clock_nanosleep_restart;
>  		/* Caller already set restart_block->arg1 */
>  		restart_block->arg0 = which_clock;
> +		restart_block->arg1 = (unsigned long) rmtp;
>  		restart_block->arg2 = rqtp->tv_sec;
>  		restart_block->arg3 = rqtp->tv_nsec;
>  
> @@ -1487,10 +1486,15 @@ static long
>  posix_cpu_clock_nanosleep_restart(struct restart_block *restart_block)
>  {
>  	clockid_t which_clock = restart_block->arg0;
> -	struct timespec t = { .tv_sec = restart_block->arg2,
> -			      .tv_nsec = restart_block->arg3 };
> +	struct timespec __user *rmtp;
> +	struct timespec t;
> +
> +	rmtp = (struct timespec __user *) restart_block->arg1;
> +	t.tv_sec = restart_block->arg2;
> +	t.tv_nsec = restart_block->arg3;
> +
>  	restart_block->fn = do_no_restart_syscall;
> -	return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t);
> +	return posix_cpu_nsleep(which_clock, TIMER_ABSTIME, &t, rmtp);
>  }
>  
>  
> @@ -1511,9 +1515,10 @@ static int process_cpu_timer_create(stru
>  	return posix_cpu_timer_create(timer);
>  }
>  static int process_cpu_nsleep(clockid_t which_clock, int flags,
> -			      struct timespec *rqtp)
> +			      struct timespec *rqtp,
> +			      struct timespec __user *rmtp)
>  {
> -	return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp);
> +	return posix_cpu_nsleep(PROCESS_CLOCK, flags, rqtp, rmtp);
>  }
>  static int thread_cpu_clock_getres(clockid_t which_clock, struct timespec *tp)
>  {
> @@ -1529,7 +1534,7 @@ static int thread_cpu_timer_create(struc
>  	return posix_cpu_timer_create(timer);
>  }
>  static int thread_cpu_nsleep(clockid_t which_clock, int flags,
> -			      struct timespec *rqtp)
> +			      struct timespec *rqtp, struct timespec __user *rmtp)
>  {
>  	return -EINVAL;
>  }
> Index: linux-2.6.14-rc2-rt4/kernel/posix-timers.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/posix-timers.c
> +++ linux-2.6.14-rc2-rt4/kernel/posix-timers.c
> @@ -48,21 +48,6 @@
>  #include <linux/workqueue.h>
>  #include <linux/module.h>
>  
> -#ifndef div_long_long_rem
> -#include <asm/div64.h>
> -
> -#define div_long_long_rem(dividend,divisor,remainder) ({ \
> -		       u64 result = dividend;		\
> -		       *remainder = do_div(result,divisor); \
> -		       result; })
> -
> -#endif
> -#define CLOCK_REALTIME_RES TICK_NSEC  /* In nano seconds. */
> -
> -static inline u64  mpy_l_X_l_ll(unsigned long mpy1,unsigned long mpy2)
> -{
> -	return (u64)mpy1 * mpy2;
> -}
>  /*
>   * Management arrays for POSIX timers.	 Timers are kept in slab memory
>   * Timer ids are allocated by an external routine that keeps track of the
> @@ -148,18 +133,18 @@ static DEFINE_SPINLOCK(idr_lock);
>   */
>  
>  static struct k_clock posix_clocks[MAX_CLOCKS];
> +
>  /*
> - * We only have one real clock that can be set so we need only one abs list,
> - * even if we should want to have several clocks with differing resolutions.
> + * These ones are defined below.
>   */
> -static struct k_clock_abs abs_list = {.list = LIST_HEAD_INIT(abs_list.list),
> -				      .lock = SPIN_LOCK_UNLOCKED};
> +static int common_nsleep(clockid_t, int flags, struct timespec *t,
> +			 struct timespec __user *rmtp);
> +static void common_timer_get(struct k_itimer *, struct itimerspec *);
> +static int common_timer_set(struct k_itimer *, int,
> +			    struct itimerspec *, struct itimerspec *);
> +static int common_timer_del(struct k_itimer *timer);
>  
> -static void posix_timer_fn(unsigned long);
> -static u64 do_posix_clock_monotonic_gettime_parts(
> -	struct timespec *tp, struct timespec *mo);
> -int do_posix_clock_monotonic_gettime(struct timespec *tp);
> -static int do_posix_clock_monotonic_get(clockid_t, struct timespec *tp);
> +static void posix_timer_fn(void *data);
>  
>  static struct k_itimer *lock_timer(timer_t timer_id, unsigned long *flags);
>  
> @@ -205,21 +190,25 @@ static inline int common_clock_set(clock
>  
>  static inline int common_timer_create(struct k_itimer *new_timer)
>  {
> -	INIT_LIST_HEAD(&new_timer->it.real.abs_timer_entry);
> -	init_timer(&new_timer->it.real.timer);
> -	new_timer->it.real.timer.data = (unsigned long) new_timer;
> +	return -EINVAL;
> +}
> +
> +static int timer_create_mono(struct k_itimer *new_timer)
> +{
> +	init_ktimer_mono(&new_timer->it.real.timer);
> +	new_timer->it.real.timer.data = new_timer;
> +	new_timer->it.real.timer.function = posix_timer_fn;
> +	return 0;
> +}
> +
> +static int timer_create_real(struct k_itimer *new_timer)
> +{
> +	init_ktimer_real(&new_timer->it.real.timer);
> +	new_timer->it.real.timer.data = new_timer;
>  	new_timer->it.real.timer.function = posix_timer_fn;
>  	return 0;
>  }
>  
> -/*
> - * These ones are defined below.
> - */
> -static int common_nsleep(clockid_t, int flags, struct timespec *t);
> -static void common_timer_get(struct k_itimer *, struct itimerspec *);
> -static int common_timer_set(struct k_itimer *, int,
> -			    struct itimerspec *, struct itimerspec *);
> -static int common_timer_del(struct k_itimer *timer);
>  
>  /*
>   * Return nonzero iff we know a priori this clockid_t value is bogus.
> @@ -239,19 +228,44 @@ static inline int invalid_clockid(clocki
>  	return 1;
>  }
>  
> +/*
> + * Get real time for posix timers
> + */
> +static int posix_get_ktime_real_ts(clockid_t which_clock, struct timespec *tp)
> +{
> +	get_ktime_real_ts(tp);
> +	return 0;
> +}
> +
> +/*
> + * Get monotonic time for posix timers
> + */
> +static int posix_get_ktime_mono_ts(clockid_t which_clock, struct timespec *tp)
> +{
> +	get_ktime_mono_ts(tp);
> +	return 0;
> +}
> +
> +void do_posix_clock_monotonic_gettime(struct timespec *ts)
> +{
> +	get_ktime_mono_ts(ts);
> +}
>  
>  /*
>   * Initialize everything, well, just everything in Posix clocks/timers ;)
>   */
>  static __init int init_posix_timers(void)
>  {
> -	struct k_clock clock_realtime = {.res = CLOCK_REALTIME_RES,
> -					 .abs_struct = &abs_list
> +	struct k_clock clock_realtime = {
> +		.clock_getres = get_ktimer_real_res,
> +		.clock_get = posix_get_ktime_real_ts,
> +		.timer_create = timer_create_real,
>  	};
> -	struct k_clock clock_monotonic = {.res = CLOCK_REALTIME_RES,
> -		.abs_struct = NULL,
> -		.clock_get = do_posix_clock_monotonic_get,
> -		.clock_set = do_posix_clock_nosettime
> +	struct k_clock clock_monotonic = {
> +		.clock_getres = get_ktimer_mono_res,
> +		.clock_get = posix_get_ktime_mono_ts,
> +		.clock_set = do_posix_clock_nosettime,
> +		.timer_create = timer_create_mono,
>  	};
>  
>  	register_posix_clock(CLOCK_REALTIME, &clock_realtime);
> @@ -265,117 +279,17 @@ static __init int init_posix_timers(void
>  
>  __initcall(init_posix_timers);
>  
> -static void tstojiffie(struct timespec *tp, int res, u64 *jiff)
> -{
> -	long sec = tp->tv_sec;
> -	long nsec = tp->tv_nsec + res - 1;
> -
> -	if (nsec > NSEC_PER_SEC) {
> -		sec++;
> -		nsec -= NSEC_PER_SEC;
> -	}
> -
> -	/*
> -	 * The scaling constants are defined in <linux/time.h>
> -	 * The difference between there and here is that we do the
> -	 * res rounding and compute a 64-bit result (well so does that
> -	 * but it then throws away the high bits).
> -  	 */
> -	*jiff =  (mpy_l_X_l_ll(sec, SEC_CONVERSION) +
> -		  (mpy_l_X_l_ll(nsec, NSEC_CONVERSION) >> 
> -		   (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
> -}
> -
> -/*
> - * This function adjusts the timer as needed as a result of the clock
> - * being set.  It should only be called for absolute timers, and then
> - * under the abs_list lock.  It computes the time difference and sets
> - * the new jiffies value in the timer.  It also updates the timers
> - * reference wall_to_monotonic value.  It is complicated by the fact
> - * that tstojiffies() only handles positive times and it needs to work
> - * with both positive and negative times.  Also, for negative offsets,
> - * we need to defeat the res round up.
> - *
> - * Return is true if there is a new time, else false.
> - */
> -static long add_clockset_delta(struct k_itimer *timr,
> -			       struct timespec *new_wall_to)
> -{
> -	struct timespec delta;
> -	int sign = 0;
> -	u64 exp;
> -
> -	set_normalized_timespec(&delta,
> -				new_wall_to->tv_sec -
> -				timr->it.real.wall_to_prev.tv_sec,
> -				new_wall_to->tv_nsec -
> -				timr->it.real.wall_to_prev.tv_nsec);
> -	if (likely(!(delta.tv_sec | delta.tv_nsec)))
> -		return 0;
> -	if (delta.tv_sec < 0) {
> -		set_normalized_timespec(&delta,
> -					-delta.tv_sec,
> -					1 - delta.tv_nsec -
> -					posix_clocks[timr->it_clock].res);
> -		sign++;
> -	}
> -	tstojiffie(&delta, posix_clocks[timr->it_clock].res, &exp);
> -	timr->it.real.wall_to_prev = *new_wall_to;
> -	timr->it.real.timer.expires += (sign ? -exp : exp);
> -	return 1;
> -}
> -
> -static void remove_from_abslist(struct k_itimer *timr)
> -{
> -	if (!list_empty(&timr->it.real.abs_timer_entry)) {
> -		spin_lock(&abs_list.lock);
> -		list_del_init(&timr->it.real.abs_timer_entry);
> -		spin_unlock(&abs_list.lock);
> -	}
> -}
>  
>  static void schedule_next_timer(struct k_itimer *timr)
>  {
> -	struct timespec new_wall_to;
> -	struct now_struct now;
> -	unsigned long seq;
> -
> -	/*
> -	 * Set up the timer for the next interval (if there is one).
> -	 * Note: this code uses the abs_timer_lock to protect
> -	 * it.real.wall_to_prev and must hold it until exp is set, not exactly
> -	 * obvious...
> -
> -	 * This function is used for CLOCK_REALTIME* and
> -	 * CLOCK_MONOTONIC* timers.  If we ever want to handle other
> -	 * CLOCKs, the calling code (do_schedule_next_timer) would need
> -	 * to pull the "clock" info from the timer and dispatch the
> -	 * "other" CLOCKs "next timer" code (which, I suppose should
> -	 * also be added to the k_clock structure).
> -	 */
> -	if (!timr->it.real.incr)
> +	if (ktime_cmp_val(timr->it.real.incr, ==, KTIME_ZERO))
>  		return;
>  
> -	do {
> -		seq = read_seqbegin(&xtime_lock);
> -		new_wall_to =	wall_to_monotonic;
> -		posix_get_now(&now);
> -	} while (read_seqretry(&xtime_lock, seq));
> -
> -	if (!list_empty(&timr->it.real.abs_timer_entry)) {
> -		spin_lock(&abs_list.lock);
> -		add_clockset_delta(timr, &new_wall_to);
> -
> -		posix_bump_timer(timr, now);
> -
> -		spin_unlock(&abs_list.lock);
> -	} else {
> -		posix_bump_timer(timr, now);
> -	}
> -	timr->it_overrun_last = timr->it_overrun;
> -	timr->it_overrun = -1;
> +	timr->it_overrun_last = timr->it.real.overrun;
> +	timr->it.real.overrun = timr->it.real.timer.overrun = -1;
>  	++timr->it_requeue_pending;
> -	add_timer(&timr->it.real.timer);
> +	start_ktimer(&timr->it.real.timer, &timr->it.real.incr, KTIMER_FORWARD);
> +	timr->it.real.overrun = timr->it.real.timer.overrun;
>  }
>  
>  /*
> @@ -413,14 +327,7 @@ int posix_timer_event(struct k_itimer *t
>  {
>  	memset(&timr->sigq->info, 0, sizeof(siginfo_t));
>  	timr->sigq->info.si_sys_private = si_private;
> -	/*
> -	 * Send signal to the process that owns this timer.
> -
> -	 * This code assumes that all the possible abs_lists share the
> -	 * same lock (there is only one list at this time). If this is
> -	 * not the case, the CLOCK info would need to be used to find
> -	 * the proper abs list lock.
> -	 */
> +	/* Send signal to the process that owns this timer.*/
>  
>  	timr->sigq->info.si_signo = timr->it_sigev_signo;
>  	timr->sigq->info.si_errno = 0;
> @@ -454,65 +361,28 @@ EXPORT_SYMBOL_GPL(posix_timer_event);
>  
>   * This code is for CLOCK_REALTIME* and CLOCK_MONOTONIC* timers.
>   */
> -static void posix_timer_fn(unsigned long __data)
> +static void posix_timer_fn(void *data)
>  {
> -	struct k_itimer *timr = (struct k_itimer *) __data;
> +	struct k_itimer *timr = data;
>  	unsigned long flags;
> -	unsigned long seq;
> -	struct timespec delta, new_wall_to;
> -	u64 exp = 0;
> -	int do_notify = 1;
> +	int si_private = 0;
>  
>  	spin_lock_irqsave(&timr->it_lock, flags);
> -	if (!list_empty(&timr->it.real.abs_timer_entry)) {
> -		spin_lock(&abs_list.lock);
> -		do {
> -			seq = read_seqbegin(&xtime_lock);
> -			new_wall_to =	wall_to_monotonic;
> -		} while (read_seqretry(&xtime_lock, seq));
> -		set_normalized_timespec(&delta,
> -					new_wall_to.tv_sec -
> -					timr->it.real.wall_to_prev.tv_sec,
> -					new_wall_to.tv_nsec -
> -					timr->it.real.wall_to_prev.tv_nsec);
> -		if (likely((delta.tv_sec | delta.tv_nsec ) == 0)) {
> -			/* do nothing, timer is on time */
> -		} else if (delta.tv_sec < 0) {
> -			/* do nothing, timer is already late */
> -		} else {
> -			/* timer is early due to a clock set */
> -			tstojiffie(&delta,
> -				   posix_clocks[timr->it_clock].res,
> -				   &exp);
> -			timr->it.real.wall_to_prev = new_wall_to;
> -			timr->it.real.timer.expires += exp;
> -			add_timer(&timr->it.real.timer);
> -			do_notify = 0;
> -		}
> -		spin_unlock(&abs_list.lock);
>  
> -	}
> -	if (do_notify)  {
> -		int si_private=0;
> +	if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
> +		si_private = ++timr->it_requeue_pending;
>  
> -		if (timr->it.real.incr)
> -			si_private = ++timr->it_requeue_pending;
> -		else {
> -			remove_from_abslist(timr);
> -		}
> +	if (posix_timer_event(timr, si_private))
> +		/*
> +		 * signal was not sent because of sig_ignor
> +		 * we will not get a call back to restart it AND
> +		 * it should be restarted.
> +		 */
> +		schedule_next_timer(timr);
>  
> -		if (posix_timer_event(timr, si_private))
> -			/*
> -			 * signal was not sent because of sig_ignor
> -			 * we will not get a call back to restart it AND
> -			 * it should be restarted.
> -			 */
> -			schedule_next_timer(timr);
> -	}
>  	unlock_timer(timr, flags); /* hold thru abs lock to keep irq off */
>  }
>  
> -
>  static inline struct task_struct * good_sigevent(sigevent_t * event)
>  {
>  	struct task_struct *rtn = current->group_leader;
> @@ -776,39 +646,40 @@ static struct k_itimer * lock_timer(time
>  static void
>  common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
>  {
> -	unsigned long expires;
> -	struct now_struct now;
> +	ktime_t expires, now, remaining;
> +	struct ktimer *timer = &timr->it.real.timer;
>  
> -	do
> -		expires = timr->it.real.timer.expires;
> -	while ((volatile long) (timr->it.real.timer.expires) != expires);
> -
> -	posix_get_now(&now);
> -
> -	if (expires &&
> -	    ((timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) &&
> -	    !timr->it.real.incr &&
> -	    posix_time_before(&timr->it.real.timer, &now))
> -		timr->it.real.timer.expires = expires = 0;
> -	if (expires) {
> -		if (timr->it_requeue_pending & REQUEUE_PENDING ||
> -		    (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
> -			posix_bump_timer(timr, now);
> -			expires = timr->it.real.timer.expires;
> -		}
> -		else
> -			if (!timer_pending(&timr->it.real.timer))
> -				expires = 0;
> -		if (expires)
> -			expires -= now.jiffies;
> -	}
> -	jiffies_to_timespec(expires, &cur_setting->it_value);
> -	jiffies_to_timespec(timr->it.real.incr, &cur_setting->it_interval);
> -
> -	if (cur_setting->it_value.tv_sec < 0) {
> +	memset(cur_setting, 0, sizeof(struct itimerspec));
> +	expires = get_expiry_ktimer(timer, &now);
> +	remaining = ktime_sub(expires, now);
> +
> +	/* Time left ? or timer pending */
> +	if (ktime_cmp_val(remaining, >, KTIME_ZERO) || ktimer_active(timer))
> +		goto calci;
> +	/* interval timer ? */
> +	if (ktime_cmp_val(timr->it.real.incr, ==, 0))
> +		return;
> +	/*
> +	 * When a requeue is pending or this is a SIGEV_NONE timer
> +	 * move the expiry time forward by intervals, so expiry is >
> +	 * now.
> +	 * The active (non SIGEV_NONE) rearm should be done
> +	 * automatically by the ktimer REARM mode. Thats the next
> +	 * iteration.  The REQUEUE_PENDING part will go away !
> +	 */
> +	if (timr->it_requeue_pending & REQUEUE_PENDING ||
> +	    (timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE) {
> +		remaining = forward_posix_timer(timr, now);
> +	}
> + calci:
> +	/* interval timer ? */
> +	if (ktime_cmp_val(timr->it.real.incr, !=, KTIME_ZERO))
> +		cur_setting->it_interval = ktime_to_timespec(timr->it.real.incr);
> +	/* Return 0 only, when the timer is expired and not pending */
> +	if (ktime_cmp_val(remaining, <=, KTIME_ZERO))
>  		cur_setting->it_value.tv_nsec = 1;
> -		cur_setting->it_value.tv_sec = 0;
> -	}
> +	else
> +		cur_setting->it_value = ktime_to_timespec(remaining);
>  }
>  
>  /* Get the time remaining on a POSIX.1b interval timer. */
> @@ -832,6 +703,7 @@ sys_timer_gettime(timer_t timer_id, stru
>  
>  	return 0;
>  }
> +
>  /*
>   * Get the number of overruns of a POSIX.1b interval timer.  This is to
>   * be the overrun of the timer last delivered.  At the same time we are
> @@ -858,84 +730,6 @@ sys_timer_getoverrun(timer_t timer_id)
>  
>  	return overrun;
>  }
> -/*
> - * Adjust for absolute time
> - *
> - * If absolute time is given and it is not CLOCK_MONOTONIC, we need to
> - * adjust for the offset between the timer clock (CLOCK_MONOTONIC) and
> - * what ever clock he is using.
> - *
> - * If it is relative time, we need to add the current (CLOCK_MONOTONIC)
> - * time to it to get the proper time for the timer.
> - */
> -static int adjust_abs_time(struct k_clock *clock, struct timespec *tp, 
> -			   int abs, u64 *exp, struct timespec *wall_to)
> -{
> -	struct timespec now;
> -	struct timespec oc = *tp;
> -	u64 jiffies_64_f;
> -	int rtn =0;
> -
> -	if (abs) {
> -		/*
> -		 * The mask pick up the 4 basic clocks 
> -		 */
> -		if (!((clock - &posix_clocks[0]) & ~CLOCKS_MASK)) {
> -			jiffies_64_f = do_posix_clock_monotonic_gettime_parts(
> -				&now,  wall_to);
> -			/*
> -			 * If we are doing a MONOTONIC clock
> -			 */
> -			if((clock - &posix_clocks[0]) & CLOCKS_MONO){
> -				now.tv_sec += wall_to->tv_sec;
> -				now.tv_nsec += wall_to->tv_nsec;
> -			}
> -		} else {
> -			/*
> -			 * Not one of the basic clocks
> -			 */
> -			clock->clock_get(clock - posix_clocks, &now);
> -			jiffies_64_f = get_jiffies_64();
> -		}
> -		/*
> -		 * Take away now to get delta and normalize
> -		 */
> -		set_normalized_timespec(&oc, oc.tv_sec - now.tv_sec,
> -					oc.tv_nsec - now.tv_nsec);
> -	}else{
> -		jiffies_64_f = get_jiffies_64();
> -	}
> -	/*
> -	 * Check if the requested time is prior to now (if so set now)
> -	 */
> -	if (oc.tv_sec < 0)
> -		oc.tv_sec = oc.tv_nsec = 0;
> -
> -	if (oc.tv_sec | oc.tv_nsec)
> -		set_normalized_timespec(&oc, oc.tv_sec,
> -					oc.tv_nsec + clock->res);
> -	tstojiffie(&oc, clock->res, exp);
> -
> -	/*
> -	 * Check if the requested time is more than the timer code
> -	 * can handle (if so we error out but return the value too).
> -	 */
> -	if (*exp > ((u64)MAX_JIFFY_OFFSET))
> -			/*
> -			 * This is a considered response, not exactly in
> -			 * line with the standard (in fact it is silent on
> -			 * possible overflows).  We assume such a large 
> -			 * value is ALMOST always a programming error and
> -			 * try not to compound it by setting a really dumb
> -			 * value.
> -			 */
> -			rtn = -EINVAL;
> -	/*
> -	 * return the actual jiffies expire time, full 64 bits
> -	 */
> -	*exp += jiffies_64_f;
> -	return rtn;
> -}
>  
>  /* Set a POSIX.1b interval timer. */
>  /* timr->it_lock is taken. */
> @@ -943,68 +737,52 @@ static inline int
>  common_timer_set(struct k_itimer *timr, int flags,
>  		 struct itimerspec *new_setting, struct itimerspec *old_setting)
>  {
> -	struct k_clock *clock = &posix_clocks[timr->it_clock];
> -	u64 expire_64;
> +	ktime_t expires;
> +	int mode;
>  
>  	if (old_setting)
>  		common_timer_get(timr, old_setting);
>  
>  	/* disable the timer */
> -	timr->it.real.incr = 0;
> +	ktime_set_zero(timr->it.real.incr);
>  	/*
>  	 * careful here.  If smp we could be in the "fire" routine which will
>  	 * be spinning as we hold the lock.  But this is ONLY an SMP issue.
>  	 */
> -	if (try_to_del_timer_sync(&timr->it.real.timer) < 0) {
> -#ifdef CONFIG_SMP
> -		/*
> -		 * It can only be active if on an other cpu.  Since
> -		 * we have cleared the interval stuff above, it should
> -		 * clear once we release the spin lock.  Of course once
> -		 * we do that anything could happen, including the
> -		 * complete melt down of the timer.  So return with
> -		 * a "retry" exit status.
> -		 */
> +	if (try_to_stop_ktimer(&timr->it.real.timer) < 0)
>  		return TIMER_RETRY;
> -#endif
> -	}
> -
> -	remove_from_abslist(timr);
>  
>  	timr->it_requeue_pending = (timr->it_requeue_pending + 2) & 
>  		~REQUEUE_PENDING;
>  	timr->it_overrun_last = 0;
>  	timr->it_overrun = -1;
> -	/*
> -	 *switch off the timer when it_value is zero
> -	 */
> -	if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec) {
> -		timr->it.real.timer.expires = 0;
> +
> +	/* switch off the timer when it_value is zero */
> +	if (!new_setting->it_value.tv_sec && !new_setting->it_value.tv_nsec)
>  		return 0;
> -	}
>  
> -	if (adjust_abs_time(clock,
> -			    &new_setting->it_value, flags & TIMER_ABSTIME, 
> -			    &expire_64, &(timr->it.real.wall_to_prev))) {
> -		return -EINVAL;
> -	}
> -	timr->it.real.timer.expires = (unsigned long)expire_64;
> -	tstojiffie(&new_setting->it_interval, clock->res, &expire_64);
> -	timr->it.real.incr = (unsigned long)expire_64;
> +	mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
>  
> -	/*
> -	 * We do not even queue SIGEV_NONE timers!  But we do put them
> -	 * in the abs list so we can do that right.
> +	/* Posix madness. Only absolute CLOCK_REALTIME timers
> +	 * are affected by clock sets. So we must reiniatilize
> +	 * the timer.
>  	 */
> +	if (timr->it_clock == CLOCK_REALTIME && mode == KTIMER_ABS)
> +		timer_create_real(timr);
> +	else
> +		timer_create_mono(timr);
> +
> +	expires = ktimer_convert_timespec(&timr->it.real.timer,
> +					  &new_setting->it_value);
> +	/* This should be moved to the auto rearm code */
> +	timr->it.real.incr = ktimer_convert_timespec(&timr->it.real.timer,
> +						     &new_setting->it_interval);
> +
> +	/* SIGEV_NONE timers are not queued ! See common_timer_get */
>  	if (((timr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE))
> -		add_timer(&timr->it.real.timer);
> +		start_ktimer(&timr->it.real.timer, &expires,
> +			     mode | KTIMER_NOCHECK);
>  
> -	if (flags & TIMER_ABSTIME && clock->abs_struct) {
> -		spin_lock(&clock->abs_struct->lock);
> -		list_add_tail(&(timr->it.real.abs_timer_entry),
> -			      &(clock->abs_struct->list));
> -		spin_unlock(&clock->abs_struct->lock);
> -	}
>  	return 0;
>  }
>  
> @@ -1039,6 +817,7 @@ retry:
>  
>  	unlock_timer(timr, flag);
>  	if (error == TIMER_RETRY) {
> +		wait_for_ktimer(&timr->it.real.timer);
>  		rtn = NULL;	// We already got the old time...
>  		goto retry;
>  	}
> @@ -1052,24 +831,10 @@ retry:
>  
>  static inline int common_timer_del(struct k_itimer *timer)
>  {
> -	timer->it.real.incr = 0;
> +	ktime_set_zero(timer->it.real.incr);
>  
> -	if (try_to_del_timer_sync(&timer->it.real.timer) < 0) {
> -#ifdef CONFIG_SMP
> -		/*
> -		 * It can only be active if on an other cpu.  Since
> -		 * we have cleared the interval stuff above, it should
> -		 * clear once we release the spin lock.  Of course once
> -		 * we do that anything could happen, including the
> -		 * complete melt down of the timer.  So return with
> -		 * a "retry" exit status.
> -		 */
> +	if (try_to_stop_ktimer(&timer->it.real.timer) < 0)
>  		return TIMER_RETRY;
> -#endif
> -	}
> -
> -	remove_from_abslist(timer);
> -
>  	return 0;
>  }
>  
> @@ -1085,24 +850,17 @@ sys_timer_delete(timer_t timer_id)
>  	struct k_itimer *timer;
>  	long flags;
>  
> -#ifdef CONFIG_SMP
> -	int error;
>  retry_delete:
> -#endif
>  	timer = lock_timer(timer_id, &flags);
>  	if (!timer)
>  		return -EINVAL;
>  
> -#ifdef CONFIG_SMP
> -	error = timer_delete_hook(timer);
> -
> -	if (error == TIMER_RETRY) {
> +	if (timer_delete_hook(timer) == TIMER_RETRY) {
>  		unlock_timer(timer, flags);
> +		wait_for_ktimer(&timer->it.real.timer);
>  		goto retry_delete;
>  	}
> -#else
> -	timer_delete_hook(timer);
> -#endif
> +
>  	spin_lock(&current->sighand->siglock);
>  	list_del(&timer->list);
>  	spin_unlock(&current->sighand->siglock);
> @@ -1119,6 +877,7 @@ retry_delete:
>  	release_posix_timer(timer, IT_ID_SET);
>  	return 0;
>  }
> +
>  /*
>   * return timer owned by the process, used by exit_itimers
>   */
> @@ -1126,22 +885,14 @@ static inline void itimer_delete(struct 
>  {
>  	unsigned long flags;
>  
> -#ifdef CONFIG_SMP
> -	int error;
>  retry_delete:
> -#endif
>  	spin_lock_irqsave(&timer->it_lock, flags);
>  
> -#ifdef CONFIG_SMP
> -	error = timer_delete_hook(timer);
> -
> -	if (error == TIMER_RETRY) {
> +	if (timer_delete_hook(timer) == TIMER_RETRY) {
>  		unlock_timer(timer, flags);
> +		wait_for_ktimer(&timer->it.real.timer);
>  		goto retry_delete;
>  	}
> -#else
> -	timer_delete_hook(timer);
> -#endif
>  	list_del(&timer->list);
>  	/*
>  	 * This keeps any tasks waiting on the spin lock from thinking
> @@ -1170,60 +921,7 @@ void exit_itimers(struct signal_struct *
>  	}
>  }
>  
> -/*
> - * And now for the "clock" calls
> - *
> - * These functions are called both from timer functions (with the timer
> - * spin_lock_irq() held and from clock calls with no locking.	They must
> - * use the save flags versions of locks.
> - */
> -
> -/*
> - * We do ticks here to avoid the irq lock ( they take sooo long).
> - * The seqlock is great here.  Since we a reader, we don't really care
> - * if we are interrupted since we don't take lock that will stall us or
> - * any other cpu. Voila, no irq lock is needed.
> - *
> - */
> -
> -static u64 do_posix_clock_monotonic_gettime_parts(
> -	struct timespec *tp, struct timespec *mo)
> -{
> -	u64 jiff;
> -	unsigned int seq;
> -
> -	do {
> -		seq = read_seqbegin(&xtime_lock);
> -		getnstimeofday(tp);
> -		*mo = wall_to_monotonic;
> -		jiff = jiffies_64;
> -
> -	} while(read_seqretry(&xtime_lock, seq));
> -
> -	return jiff;
> -}
> -
> -static int do_posix_clock_monotonic_get(clockid_t clock, struct timespec *tp)
> -{
> -	struct timespec wall_to_mono;
> -
> -	do_posix_clock_monotonic_gettime_parts(tp, &wall_to_mono);
> -
> -	tp->tv_sec += wall_to_mono.tv_sec;
> -	tp->tv_nsec += wall_to_mono.tv_nsec;
> -
> -	if ((tp->tv_nsec - NSEC_PER_SEC) > 0) {
> -		tp->tv_nsec -= NSEC_PER_SEC;
> -		tp->tv_sec++;
> -	}
> -	return 0;
> -}
> -
> -int do_posix_clock_monotonic_gettime(struct timespec *tp)
> -{
> -	return do_posix_clock_monotonic_get(CLOCK_MONOTONIC, tp);
> -}
> -
> +/* Not available / possible... functions */
>  int do_posix_clock_nosettime(clockid_t clockid, struct timespec *tp)
>  {
>  	return -EINVAL;
> @@ -1236,7 +934,8 @@ int do_posix_clock_notimer_create(struct
>  }
>  EXPORT_SYMBOL_GPL(do_posix_clock_notimer_create);
>  
> -int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t)
> +int do_posix_clock_nonanosleep(clockid_t clock, int flags, struct timespec *t,
> +			       struct timespec __user *r)
>  {
>  #ifndef ENOTSUP
>  	return -EOPNOTSUPP;	/* aka ENOTSUP in userland for POSIX */
> @@ -1295,125 +994,34 @@ sys_clock_getres(clockid_t which_clock, 
>  	return error;
>  }
>  
> -static void nanosleep_wake_up(unsigned long __data)
> -{
> -	struct task_struct *p = (struct task_struct *) __data;
> -
> -	wake_up_process(p);
> -}
> -
>  /*
> - * The standard says that an absolute nanosleep call MUST wake up at
> - * the requested time in spite of clock settings.  Here is what we do:
> - * For each nanosleep call that needs it (only absolute and not on
> - * CLOCK_MONOTONIC* (as it can not be set)) we thread a little structure
> - * into the "nanosleep_abs_list".  All we need is the task_struct pointer.
> - * When ever the clock is set we just wake up all those tasks.	 The rest
> - * is done by the while loop in clock_nanosleep().
> - *
> - * On locking, clock_was_set() is called from update_wall_clock which
> - * holds (or has held for it) a write_lock_irq( xtime_lock) and is
> - * called from the timer bh code.  Thus we need the irq save locks.
> - *
> - * Also, on the call from update_wall_clock, that is done as part of a
> - * softirq thing.  We don't want to delay the system that much (possibly
> - * long list of timers to fix), so we defer that work to keventd.
> + * nanosleep for monotonic and realtime clocks
>   */
> -
> -static DECLARE_WAIT_QUEUE_HEAD(nanosleep_abs_wqueue);
> -static DECLARE_WORK(clock_was_set_work, (void(*)(void*))clock_was_set, NULL);
> -
> -static DECLARE_MUTEX(clock_was_set_lock);
> -
> -void clock_was_set(void)
> +static int common_nsleep(clockid_t which_clock, int flags,
> +			 struct timespec *tsave, struct timespec __user *rmtp)
>  {
> -	struct k_itimer *timr;
> -	struct timespec new_wall_to;
> -	LIST_HEAD(cws_list);
> -	unsigned long seq;
> -
> +	int mode = flags & TIMER_ABSTIME ? KTIMER_ABS : KTIMER_REL;
>  
> -	if (unlikely(in_interrupt())) {
> -		schedule_work(&clock_was_set_work);
> -		return;
> +	switch (which_clock) {
> +	case CLOCK_REALTIME:
> +		/* Posix madness. Only absolute timers on clock realtime
> +		   are affected by clock set. */
> +		if (mode == KTIMER_ABS)
> +			return ktimer_nanosleep_real(tsave, rmtp, mode);
> +	case CLOCK_MONOTONIC:
> +		return ktimer_nanosleep_mono(tsave, rmtp, mode);
> +	default:
> +		break;
>  	}
> -	wake_up_all(&nanosleep_abs_wqueue);
> -
> -	/*
> -	 * Check if there exist TIMER_ABSTIME timers to correct.
> -	 *
> -	 * Notes on locking: This code is run in task context with irq
> -	 * on.  We CAN be interrupted!  All other usage of the abs list
> -	 * lock is under the timer lock which holds the irq lock as
> -	 * well.  We REALLY don't want to scan the whole list with the
> -	 * interrupt system off, AND we would like a sequence lock on
> -	 * this code as well.  Since we assume that the clock will not
> -	 * be set often, it seems ok to take and release the irq lock
> -	 * for each timer.  In fact add_timer will do this, so this is
> -	 * not an issue.  So we know when we are done, we will move the
> -	 * whole list to a new location.  Then as we process each entry,
> -	 * we will move it to the actual list again.  This way, when our
> -	 * copy is empty, we are done.  We are not all that concerned
> -	 * about preemption so we will use a semaphore lock to protect
> -	 * aginst reentry.  This way we will not stall another
> -	 * processor.  It is possible that this may delay some timers
> -	 * that should have expired, given the new clock, but even this
> -	 * will be minimal as we will always update to the current time,
> -	 * even if it was set by a task that is waiting for entry to
> -	 * this code.  Timers that expire too early will be caught by
> -	 * the expire code and restarted.
> -
> -	 * Absolute timers that repeat are left in the abs list while
> -	 * waiting for the task to pick up the signal.  This means we
> -	 * may find timers that are not in the "add_timer" list, but are
> -	 * in the abs list.  We do the same thing for these, save
> -	 * putting them back in the "add_timer" list.  (Note, these are
> -	 * left in the abs list mainly to indicate that they are
> -	 * ABSOLUTE timers, a fact that is used by the re-arm code, and
> -	 * for which we have no other flag.)
> -
> -	 */
> -
> -	down(&clock_was_set_lock);
> -	spin_lock_irq(&abs_list.lock);
> -	list_splice_init(&abs_list.list, &cws_list);
> -	spin_unlock_irq(&abs_list.lock);
> -	do {
> -		do {
> -			seq = read_seqbegin(&xtime_lock);
> -			new_wall_to =	wall_to_monotonic;
> -		} while (read_seqretry(&xtime_lock, seq));
> -
> -		spin_lock_irq(&abs_list.lock);
> -		if (list_empty(&cws_list)) {
> -			spin_unlock_irq(&abs_list.lock);
> -			break;
> -		}
> -		timr = list_entry(cws_list.next, struct k_itimer,
> -				  it.real.abs_timer_entry);
> -
> -		list_del_init(&timr->it.real.abs_timer_entry);
> -		if (add_clockset_delta(timr, &new_wall_to) &&
> -		    del_timer(&timr->it.real.timer))  /* timer run yet? */
> -			add_timer(&timr->it.real.timer);
> -		list_add(&timr->it.real.abs_timer_entry, &abs_list.list);
> -		spin_unlock_irq(&abs_list.lock);
> -	} while (1);
> -
> -	up(&clock_was_set_lock);
> +	return -EINVAL;
>  }
>  
> -long clock_nanosleep_restart(struct restart_block *restart_block);
> -
>  asmlinkage long
>  sys_clock_nanosleep(clockid_t which_clock, int flags,
>  		    const struct timespec __user *rqtp,
>  		    struct timespec __user *rmtp)
>  {
>  	struct timespec t;
> -	struct restart_block *restart_block =
> -	    &(current_thread_info()->restart_block);
> -	int ret;
>  
>  	if (invalid_clockid(which_clock))
>  		return -EINVAL;
> @@ -1421,135 +1029,8 @@ sys_clock_nanosleep(clockid_t which_cloc
>  	if (copy_from_user(&t, rqtp, sizeof (struct timespec)))
>  		return -EFAULT;
>  
> -	if ((unsigned) t.tv_nsec >= NSEC_PER_SEC || t.tv_sec < 0)
> +	if (!timespec_valid(&t))
>  		return -EINVAL;
>  
> -	/*
> -	 * Do this here as nsleep function does not have the real address.
> -	 */
> -	restart_block->arg1 = (unsigned long)rmtp;
> -
> -	ret = CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t));
> -
> -	if ((ret == -ERESTART_RESTARTBLOCK) && rmtp &&
> -					copy_to_user(rmtp, &t, sizeof (t)))
> -		return -EFAULT;
> -	return ret;
> -}
> -
> -
> -static int common_nsleep(clockid_t which_clock,
> -			 int flags, struct timespec *tsave)
> -{
> -	struct timespec t, dum;
> -	struct timer_list new_timer;
> -	DECLARE_WAITQUEUE(abs_wqueue, current);
> -	u64 rq_time = (u64)0;
> -	s64 left;
> -	int abs;
> -	struct restart_block *restart_block =
> -	    &current_thread_info()->restart_block;
> -
> -	abs_wqueue.flags = 0;
> -	init_timer(&new_timer);
> -	new_timer.expires = 0;
> -	new_timer.data = (unsigned long) current;
> -	new_timer.function = nanosleep_wake_up;
> -	abs = flags & TIMER_ABSTIME;
> -
> -	if (restart_block->fn == clock_nanosleep_restart) {
> -		/*
> -		 * Interrupted by a non-delivered signal, pick up remaining
> -		 * time and continue.  Remaining time is in arg2 & 3.
> -		 */
> -		restart_block->fn = do_no_restart_syscall;
> -
> -		rq_time = restart_block->arg3;
> -		rq_time = (rq_time << 32) + restart_block->arg2;
> -		if (!rq_time)
> -			return -EINTR;
> -		left = rq_time - get_jiffies_64();
> -		if (left <= (s64)0)
> -			return 0;	/* Already passed */
> -	}
> -
> -	if (abs && (posix_clocks[which_clock].clock_get !=
> -			    posix_clocks[CLOCK_MONOTONIC].clock_get))
> -		add_wait_queue(&nanosleep_abs_wqueue, &abs_wqueue);
> -
> -	do {
> -		t = *tsave;
> -		if (abs || !rq_time) {
> -			adjust_abs_time(&posix_clocks[which_clock], &t, abs,
> -					&rq_time, &dum);
> -		}
> -
> -		left = rq_time - get_jiffies_64();
> -		if (left >= (s64)MAX_JIFFY_OFFSET)
> -			left = (s64)MAX_JIFFY_OFFSET;
> -		if (left < (s64)0)
> -			break;
> -
> -		new_timer.expires = jiffies + left;
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -		add_timer(&new_timer);
> -
> -		schedule();
> -
> -		del_timer_sync(&new_timer);
> -		left = rq_time - get_jiffies_64();
> -	} while (left > (s64)0 && !test_thread_flag(TIF_SIGPENDING));
> -
> -	if (abs_wqueue.task_list.next)
> -		finish_wait(&nanosleep_abs_wqueue, &abs_wqueue);
> -
> -	if (left > (s64)0) {
> -
> -		/*
> -		 * Always restart abs calls from scratch to pick up any
> -		 * clock shifting that happened while we are away.
> -		 */
> -		if (abs)
> -			return -ERESTARTNOHAND;
> -
> -		left *= TICK_NSEC;
> -		tsave->tv_sec = div_long_long_rem(left, 
> -						  NSEC_PER_SEC, 
> -						  &tsave->tv_nsec);
> -		/*
> -		 * Restart works by saving the time remaing in 
> -		 * arg2 & 3 (it is 64-bits of jiffies).  The other
> -		 * info we need is the clock_id (saved in arg0). 
> -		 * The sys_call interface needs the users 
> -		 * timespec return address which _it_ saves in arg1.
> -		 * Since we have cast the nanosleep call to a clock_nanosleep
> -		 * both can be restarted with the same code.
> -		 */
> -		restart_block->fn = clock_nanosleep_restart;
> -		restart_block->arg0 = which_clock;
> -		/*
> -		 * Caller sets arg1
> -		 */
> -		restart_block->arg2 = rq_time & 0xffffffffLL;
> -		restart_block->arg3 = rq_time >> 32;
> -
> -		return -ERESTART_RESTARTBLOCK;
> -	}
> -
> -	return 0;
> -}
> -/*
> - * This will restart clock_nanosleep.
> - */
> -long
> -clock_nanosleep_restart(struct restart_block *restart_block)
> -{
> -	struct timespec t;
> -	int ret = common_nsleep(restart_block->arg0, 0, &t);
> -
> -	if ((ret == -ERESTART_RESTARTBLOCK) && restart_block->arg1 &&
> -	    copy_to_user((struct timespec __user *)(restart_block->arg1), &t,
> -			 sizeof (t)))
> -		return -EFAULT;
> -	return ret;
> +	return CLOCK_DISPATCH(which_clock, nsleep, (which_clock, flags, &t, rmtp));
>  }
> Index: linux-2.6.14-rc2-rt4/kernel/timer.c
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/kernel/timer.c
> +++ linux-2.6.14-rc2-rt4/kernel/timer.c
> @@ -912,6 +912,7 @@ static void run_timer_softirq(struct sof
>  {
>  	tvec_base_t *base = &__get_cpu_var(tvec_bases);
>  
> + 	run_ktimer_queues();
>  	if (time_after_eq(jiffies, base->timer_jiffies))
>  		__run_timers(base);
>  }
> @@ -1177,62 +1178,6 @@ asmlinkage long sys_gettid(void)
>  	return current->pid;
>  }
>  
> -static long __sched nanosleep_restart(struct restart_block *restart)
> -{
> -	unsigned long expire = restart->arg0, now = jiffies;
> -	struct timespec __user *rmtp = (struct timespec __user *) restart->arg1;
> -	long ret;
> -
> -	/* Did it expire while we handled signals? */
> -	if (!time_after(expire, now))
> -		return 0;
> -
> -	expire = schedule_timeout_interruptible(expire - now);
> -
> -	ret = 0;
> -	if (expire) {
> -		struct timespec t;
> -		jiffies_to_timespec(expire, &t);
> -
> -		ret = -ERESTART_RESTARTBLOCK;
> -		if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
> -			ret = -EFAULT;
> -		/* The 'restart' block is already filled in */
> -	}
> -	return ret;
> -}
> -
> -asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __user *rmtp)
> -{
> -	struct timespec t;
> -	unsigned long expire;
> -	long ret;
> -
> -	if (copy_from_user(&t, rqtp, sizeof(t)))
> -		return -EFAULT;
> -
> -	if ((t.tv_nsec >= 1000000000L) || (t.tv_nsec < 0) || (t.tv_sec < 0))
> -		return -EINVAL;
> -
> -	expire = timespec_to_jiffies(&t) + (t.tv_sec || t.tv_nsec);
> -	expire = schedule_timeout_interruptible(expire);
> -
> -	ret = 0;
> -	if (expire) {
> -		struct restart_block *restart;
> -		jiffies_to_timespec(expire, &t);
> -		if (rmtp && copy_to_user(rmtp, &t, sizeof(t)))
> -			return -EFAULT;
> -
> -		restart = &current_thread_info()->restart_block;
> -		restart->fn = nanosleep_restart;
> -		restart->arg0 = jiffies + expire;
> -		restart->arg1 = (unsigned long) rmtp;
> -		ret = -ERESTART_RESTARTBLOCK;
> -	}
> -	return ret;
> -}
> -
>  /*
>   * sys_sysinfo - fill in sysinfo struct
>   */ 
> Index: linux-2.6.14-rc2-rt4/include/linux/time.h
> ===================================================================
> --- linux-2.6.14-rc2-rt4.orig/include/linux/time.h
> +++ linux-2.6.14-rc2-rt4/include/linux/time.h
> @@ -4,6 +4,7 @@
>  #include <linux/types.h>
>  
>  #ifdef __KERNEL__
> +#include <linux/calc64.h>
>  #include <linux/seqlock.h>
>  #endif
>  
> @@ -38,6 +39,11 @@ static __inline__ int timespec_equal(str
>  	return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec);
>  } 
>  
> +#define timespec_valid(ts) \
> +(((ts)->tv_sec >= 0) && (((unsigned) (ts)->tv_nsec) < NSEC_PER_SEC))
> +
> +typedef s64 nsec_t;
> +
>  /* Converts Gregorian date to seconds since 1970-01-01 00:00:00.
>   * Assumes input in normal date format, i.e. 1980-12-31 23:59:59
>   * => year=1980, mon=12, day=31, hour=23, min=59, sec=59.
> @@ -88,8 +94,7 @@ struct timespec current_kernel_time(void
>  extern void do_gettimeofday(struct timeval *tv);
>  extern int do_settimeofday(struct timespec *tv);
>  extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
> -extern void clock_was_set(void); // call when ever the clock is set
> -extern int do_posix_clock_monotonic_gettime(struct timespec *tp);
> +extern void do_posix_clock_monotonic_gettime(struct timespec *ts);
>  extern long do_utimes(char __user * filename, struct timeval * times);
>  struct itimerval;
>  extern int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue);
> @@ -113,6 +118,40 @@ set_normalized_timespec (struct timespec
>  	ts->tv_nsec = nsec;
>  }
>  
> +static __inline__ nsec_t timespec_to_ns(struct timespec *s)
> +{
> +	nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
> +	return res + (nsec_t) s->tv_nsec;
> +}
> +
> +static __inline__ struct timespec ns_to_timespec(nsec_t n)
> +{
> +	struct timespec ts;
> +
> +	if (n)
> +		ts.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &ts.tv_nsec);
> +	else
> +		ts.tv_sec = ts.tv_nsec = 0;
> +	return ts;
> +}
> +
> +static __inline__ nsec_t timeval_to_ns(struct timeval *s)
> +{
> +	nsec_t res = (nsec_t) s->tv_sec * NSEC_PER_SEC;
> +	return res + (nsec_t) s->tv_usec * NSEC_PER_USEC;
> +}
> +
> +static __inline__ struct timeval ns_to_timeval(nsec_t n)
> +{
> +	struct timeval tv;
> +	if (n) {
> +		tv.tv_sec = div_long_long_rem_signed(n, NSEC_PER_SEC, &tv.tv_usec);
> +		tv.tv_usec /= 1000;
> +	} else
> +		tv.tv_sec = tv.tv_usec = 0;
> +	return tv;
> +}
> +
>  #endif /* __KERNEL__ */
>  
>  #define NFDBITS			__NFDBITS
> @@ -145,23 +184,18 @@ struct	itimerval {
>  /*
>   * The IDs of the various system clocks (for POSIX.1b interval timers).
>   */
> -#define CLOCK_REALTIME		  0
> -#define CLOCK_MONOTONIC	  1
> +#define CLOCK_REALTIME		 0
> +#define CLOCK_MONOTONIC	  	 1
>  #define CLOCK_PROCESS_CPUTIME_ID 2
>  #define CLOCK_THREAD_CPUTIME_ID	 3
> -#define CLOCK_REALTIME_HR	 4
> -#define CLOCK_MONOTONIC_HR	  5
>  
>  /*
>   * The IDs of various hardware clocks
>   */
> -
> -
>  #define CLOCK_SGI_CYCLE 10
>  #define MAX_CLOCKS 16
> -#define CLOCKS_MASK  (CLOCK_REALTIME | CLOCK_MONOTONIC | \
> -                     CLOCK_REALTIME_HR | CLOCK_MONOTONIC_HR)
> -#define CLOCKS_MONO (CLOCK_MONOTONIC & CLOCK_MONOTONIC_HR)
> +#define CLOCKS_MASK  (CLOCK_REALTIME | CLOCK_MONOTONIC)
> +#define CLOCKS_MONO (CLOCK_MONOTONIC)
>  
>  /*
>   * The various flags for setting POSIX.1b interval timers.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-09-28 20:43 [PATCH] ktimers subsystem 2.6.14-rc2-kt5 tglx
                   ` (2 preceding siblings ...)
  2005-09-29 19:57 ` George Anzinger
@ 2005-10-01  1:03 ` Roman Zippel
  2005-10-01 11:22   ` Ingo Molnar
                     ` (2 more replies)
  3 siblings, 3 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-01  1:03 UTC (permalink / raw)
  To: tglx
  Cc: linux-kernel, mingo, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

Hi,

On Wed, 28 Sep 2005 tglx@linutronix.de wrote:

Your patch introduces some whitespace damage, search for "^\+  " in your 
patch.

> ktimers seperate the "timer API" from the "timeout API". 

I'm not really happy with these names, timeouts are what timers do, so 
these names don't tell at all, what the difference is.
Calling them "process timer" and "kernel timer" would include their main 
usage, although that also means ptimer were the more correct abbreviation.

> +#ifndef KTIME_IS_SCALAR
> +typedef union {
> +	s64	tv64;
> +	struct {
> +#ifdef __BIG_ENDIAN
> +	s32	sec, nsec;
> +#else
> +	s32	nsec, sec;
> +#endif
> +	} tv;
> +} ktime_t;
> +
> +#else
> +
> +typedef s64 ktime_t;
> +
> +#endif

Making the union unconditional, would make tv64 always available and a lot 
of macros unnessary.

> +struct ktimer {
> +	struct rb_node		node;
> +	struct list_head	list;
> +	ktime_t			expires;
> +	ktime_t			expired;
> +	ktime_t			interval;
> +	int 	 	 	overrun;
> +	unsigned long		status;
> +	void 			(*function)(void *);
> +	void			*data;
> +	struct ktimer_base 	*base;
> +};

This structure is rather large and I think a lot can be avoided.
- list: AFAICT it's only used by run_ktimer_queue() to get the first 
pending entry. This can also be done by keeping track of the first entry 
in the base structure (useful in other places as well).
- expired: can be replaced by base->last_expired (may also be useful in 
other places)
- status: only user is ktimer_active(), the same test can be done by 
testing node.rb_parent.
- interval/overrun: this is only needed by itimers and I think it's 
possible to leave it there. Main change would be to let 'function' return 
a value indicating whether to rearm the timer or not (this includes 
expires is updated).

> +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> +
> +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)

A union ktime would especially avoid this.

> +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> +{
> +	ktime_t res;
> +
> +	res.tv64 = a.tv64 - b.tv64;
> +	if (res.tv.nsec < 0)
> +		res.tv.nsec += NSEC_PER_SEC;
> +
> +	return res;
> +}
> +
> +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> +{
> +	ktime_t res;
> +
> +	res.tv64 = a.tv64 + b.tv64;
> +	if (res.tv.nsec >= NSEC_PER_SEC) {
> +		res.tv.nsec -= NSEC_PER_SEC;
> +		res.tv.sec++;
> +	}
> +	return res;
> +}

Not using 64bit math here allows gcc to generate better code, e.g. gcc 
has to add another test for "nsec < 0" because the condition code is 
already used for the overflow, adding the "sec--" instead is IMO faster 
(i.e. less likely).

> +/* The time bases */
> +#define MAX_KTIMER_BASES	2
> +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =

Do you have any numbers (besides maybe microbenchmarks) that show a real 
advantage by using per cpu data? What kind of usage do you expect here?

The other thing is that this assumes, that all time sources are 
programmable per cpu, otherwise it will be more complicated for a time 
source to run the timers for every cpu, I don't know how safe that 
assumption is.
Changing the array of structures into an array of pointers to the 
structures would allow to switch between percpu bases and a single base.

> +ktime_t ktimer_convert_timespec(struct ktimer *timer, struct timespec *ts)
> +{
> +	struct ktimer_base *base = get_ktimer_base_unlocked(timer);
> +	ktime_t t;
> +	long rem = ts->tv_nsec % base->resolution;
> +
> +	t = ktime_set(ts->tv_sec, ts->tv_nsec);
> +
> +	/* Check, if the value has to be rounded */
> +	if (rem)
> +		t = ktime_add_ns(t, base->resolution - rem);
> +	return t;
> +}

Could you explain a little the resolution handling behind in your patch?
If I read SUS correctly clock resolution and timer resolution don't have 
to be the same, the first is returned by clock_getres() and the latter 
only documented somewhere (and AFAICT our implementation always returned 
the wrong value).
IMO this also means we can don't have to make the rounding that 
complicated. Actually it could be done automatically by the timer, e.g. 
interval timer are reprogrammed at (now + interval) and the timer 
resolution will automatically round it up.

> +static int enqueue_ktimer(struct ktimer *timer, struct ktimer_base *base,
> +			   ktime_t *tim, int mode)
> +{
> +	struct rb_node **link = &base->active.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct ktimer *entry;
> +	struct list_head *prev = &base->pending;
> +	ktime_t now;
> +
> +	/* Get current time */
> +	now = base->get_time();

As get_time() is not necessarily cheap, it can be avoided for nonrelative 
timers by comparing it with the first pending timer. Maintaining a pointer 
to the first timer here, avoids the timer list and is a simple check 
whether the time source needs any reprogramming later.

> +	if ktime_cmp(timer->expires, <=, now) {
> +		timer->expired = now;
> +		/* The caller takes care of expiry */
> +		if (!(mode & KTIMER_NOCHECK))
> +			return -1;

I think KTIMER_NOFAIL would be better name, for a while that had me 
confused, as you actually do check the value, but you don't fail it and 
enqueue it anyway.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01  1:03 ` Roman Zippel
@ 2005-10-01 11:22   ` Ingo Molnar
  2005-10-04  1:59     ` George Anzinger
  2005-10-10 12:42     ` Roman Zippel
  2005-10-01 12:05   ` Thomas Gleixner
  2005-10-04  1:55   ` George Anzinger
  2 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-01 11:22 UTC (permalink / raw)
  To: Roman Zippel
  Cc: tglx, linux-kernel, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > +/* The time bases */
> > +#define MAX_KTIMER_BASES	2
> > +static DEFINE_PER_CPU(struct ktimer_base, ktimer_bases[MAX_KTIMER_BASES]) =
> 
> Do you have any numbers (besides maybe microbenchmarks) that show a 
> real advantage by using per cpu data? What kind of usage do you expect 
> here?

it has countless advantages, and these days we basically only design 
per-CPU data structures within the kernel, unless some limitation (such 
as API or hw property) forces us to do otherwise. So i turn around the 
question: what would be your reason for _not_ doing this clean per-CPU 
design for SMP systems?

> The other thing is that this assumes, that all time sources are 
> programmable per cpu, otherwise it will be more complicated for a time 
> source to run the timers for every cpu, I don't know how safe that 
> assumption is. Changing the array of structures into an array of 
> pointers to the structures would allow to switch between percpu bases 
> and a single base.

yeah, and that's an assumption that simplifies things on SMP 
significantly. PIT on SMP systems for HRT is so gross that it's not 
funny. If anyone wants to revive that notion, please do a separate patch 
and make the case convincing enough ...

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01 11:22   ` Ingo Molnar
@ 2005-10-04  1:59     ` George Anzinger
  2005-10-04  5:51       ` Ingo Molnar
  2005-10-10 12:42     ` Roman Zippel
  1 sibling, 1 reply; 67+ messages in thread
From: George Anzinger @ 2005-10-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Roman Zippel, tglx, linux-kernel, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

Ingo Molnar wrote:
> * Roman Zippel <zippel@linux-m68k.org> wrote:
> > 
>>The other thing is that this assumes, that all time sources are 
>>programmable per cpu, otherwise it will be more complicated for a time 
>>source to run the timers for every cpu, I don't know how safe that 
>>assumption is. Changing the array of structures into an array of 
>>pointers to the structures would allow to switch between percpu bases 
>>and a single base.
> 
> 
> yeah, and that's an assumption that simplifies things on SMP 
> significantly. PIT on SMP systems for HRT is so gross that it's not 
> funny. If anyone wants to revive that notion, please do a separate patch 
> and make the case convincing enough ...
> 
Lets not talk about PIT, but, a lot of SMP platforms do NOT have per cpu timers.  For those, it 
would seem having per cpu lists to handle the timer is not really reasonable.

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-04  1:59     ` George Anzinger
@ 2005-10-04  5:51       ` Ingo Molnar
  0 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-04  5:51 UTC (permalink / raw)
  To: George Anzinger
  Cc: Roman Zippel, tglx, linux-kernel, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

* George Anzinger <george@mvista.com> wrote:

> > yeah, and that's an assumption that simplifies things on SMP 
> > significantly. PIT on SMP systems for HRT is so gross that it's not 
> > funny. If anyone wants to revive that notion, please do a separate 
> > patch and make the case convincing enough ...
>
> Lets not talk about PIT, but, a lot of SMP platforms do NOT have per 
> cpu timers.  For those, it would seem having per cpu lists to handle 
> the timer is not really reasonable.

frankly, such systems are rare, and are an afterthought at most. Think 
about it: 8 CPUs and only one hres timer source? It cannot work nor 
scale well.

i agree that they might eventually be handled (although i think we 
shouldnt bother, all sane SMP designs have per-CPU timers), but we 
definite wont design for them. What such an architecture has to do is to 
provide the proper do_hr_timer_int() and arch_hrtimer_reprogram() 
semantics, via locking around that timer source (naturally), and via 
cross-CPU calls - as if they were per-CPU timers.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01 11:22   ` Ingo Molnar
  2005-10-04  1:59     ` George Anzinger
@ 2005-10-10 12:42     ` Roman Zippel
  2005-10-10 14:04       ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-10 12:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: tglx, linux-kernel, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

Hi,

On Sat, 1 Oct 2005, Ingo Molnar wrote:

> > Do you have any numbers (besides maybe microbenchmarks) that show a 
> > real advantage by using per cpu data? What kind of usage do you expect 
> > here?
> 
> it has countless advantages, and these days we basically only design 
> per-CPU data structures within the kernel, unless some limitation (such 
> as API or hw property) forces us to do otherwise. So i turn around the 
> question: what would be your reason for _not_ doing this clean per-CPU 
> design for SMP systems?

Did I say I'm against it? No, I was just hoping someone put some more 
thought into it than just "all the other kids are doing it".
I was just curious how well it really scales compared to the simple 
version, e.g. what happens if most timer end up on a single cpu or what 
happens if we want to start the timer on a different cpu. Is this so wrong 
that you have to go into attack mode? :(

> > The other thing is that this assumes, that all time sources are 
> > programmable per cpu, otherwise it will be more complicated for a time 
> > source to run the timers for every cpu, I don't know how safe that 
> > assumption is. Changing the array of structures into an array of 
> > pointers to the structures would allow to switch between percpu bases 
> > and a single base.
> 
> yeah, and that's an assumption that simplifies things on SMP 
> significantly. PIT on SMP systems for HRT is so gross that it's not 
> funny. If anyone wants to revive that notion, please do a separate patch 
> and make the case convincing enough ...

Why do use "PIT on SMP" as an extreme example to reject the general 
concept completely? This doesn't explain, why first such a (simple) SMP 
design shouldn't exist and why secondly my suggestion is such a big 
problem.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-10 12:42     ` Roman Zippel
@ 2005-10-10 14:04       ` Ingo Molnar
  0 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-10 14:04 UTC (permalink / raw)
  To: Roman Zippel
  Cc: tglx, linux-kernel, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > > Do you have any numbers (besides maybe microbenchmarks) that show a 
> > > real advantage by using per cpu data? What kind of usage do you expect 
> > > here?
> > 
> > it has countless advantages, and these days we basically only design 
> > per-CPU data structures within the kernel, unless some limitation (such 
> > as API or hw property) forces us to do otherwise. So i turn around the 
> > question: what would be your reason for _not_ doing this clean per-CPU 
> > design for SMP systems?
> 
> Did I say I'm against it? No, I was just hoping someone put some more 
> thought into it than just "all the other kids are doing it". I was 
> just curious how well it really scales compared to the simple version, 
> e.g. what happens if most timer end up on a single cpu or what happens 
> if we want to start the timer on a different cpu. Is this so wrong 
> that you have to go into attack mode? :(

[ sorry, and i didnt go into 'attack mode'. I believe you'll distinctly 
  notice when i do that :-) ]

just think NUMA, and the generic advantages of PER_CPU become obvious.  
(via PER_CPU the different data structures indexed can properly end up 
on another domain's RAM, and can thus improve caching characteristics.)

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01  1:03 ` Roman Zippel
  2005-10-01 11:22   ` Ingo Molnar
@ 2005-10-01 12:05   ` Thomas Gleixner
  2005-10-10 17:22     ` Roman Zippel
  2005-10-04  1:55   ` George Anzinger
  2 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2005-10-01 12:05 UTC (permalink / raw)
  To: Roman Zippel
  Cc: linux-kernel, mingo, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

[-- Attachment #1: Type: text/plain, Size: 8003 bytes --]

Roman,

On Sat, 2005-10-01 at 03:03 +0200, Roman Zippel wrote:

> Your patch introduces some whitespace damage, search for "^\+  " in your 
> patch.

Ok.

> > ktimers seperate the "timer API" from the "timeout API". 
> I'm not really happy with these names, timeouts are what timers do, so 
> these names don't tell at all, what the difference is.

There is a clear distinction between timers and timeouts.

>From IT-dictonary:

"Timeout is a specified period of time that will be allowed to elapse in
a system before a specified event is to take place, unless another
specified event occurs first; in either case, the period is terminated
when either event takes place."

"A timer is a specialized type of clock. A timer can be used to control
the sequence of an event or process."

> Calling them "process timer" and "kernel timer" would include their main 
> usage, although that also means ptimer were the more correct abbreviation.

As said before I think the disctinction between timers and timeouts
makes perfectly sense and ktimers are _not_ restricted to process
timers. 

> > +#ifndef KTIME_IS_SCALAR
> > +typedef union {
> > +	s64	tv64;
> > +	struct {
> > +#ifdef __BIG_ENDIAN
> > +	s32	sec, nsec;
> > +#else
> > +	s32	nsec, sec;
> > +#endif
> > +	} tv;
> > +} ktime_t;
> > +
> > +#else
> > +
> > +typedef s64 ktime_t;
> > +
> > +#endif
> 
> Making the union unconditional, would make tv64 always available and a lot 
> of macros unnessary.

nsec,sec storage format is essentially different to the scalar storage
format and has to be handled different.

The above gives a clear distinction between scalar and sec/nsec based
cases. So you cannot mess up without notice. 

I prefer having a lot more macros / inlines around rather than tracking
down _one_ single bug which happens by a non clearly seperated
implementation.

> > +struct ktimer {
> > +	struct rb_node		node;
> > +	struct list_head	list;
> > +	ktime_t			expires;
> > +	ktime_t			expired;
> > +	ktime_t			interval;
> > +	int 	 	 	overrun;
> > +	unsigned long		status;
> > +	void 			(*function)(void *);
> > +	void			*data;
> > +	struct ktimer_base 	*base;
> > +};
> 
> This structure is rather large and I think a lot can be avoided.
> - list: AFAICT it's only used by run_ktimer_queue() to get the first 
> pending entry. This can also be done by keeping track of the first entry 
> in the base structure (useful in other places as well).

You are right that the list is not necessary for the plain integration
into the current system, but it is necessary once you start to upgrade
to high resolution timers.

> - expired: can be replaced by base->last_expired (may also be useful in 
> other places)

How gives base->last_expired a per timer expired information? And where
would it be useful ?

> - status: only user is ktimer_active(), the same test can be done by 
> testing node.rb_parent.

Uurg. Been there and discarded the idea, because its ugly and clashes
with further extensibilty requirements e.g. high resolution timers,
where we have more than two states. 

Having status information bound to arbitrary pointers is trading a
variable against flexibility, cleanliness and maintainability. 

> - interval/overrun: this is only needed by itimers and I think it's 
> possible to leave it there. Main change would be to let 'function' return 
> a value indicating whether to rearm the timer or not (this includes 
> expires is updated).

It is also used by the posix timer code and I plan to do another round
of simplification also there.


This implementation is chosen to be flexible and easy exstensible for
use cases like high resolution timers. 


I do not want to end up with a next round of discussion there about
either introducing tons of new ifdefs, macros or redesigning the code
base another time. 

As others have stated too, we have to wage the tradeoff between 

simplicity, flexibility, maintainability 
vs.
size and performance impacts

Performance is definitely an important issue and was accepted and
addressed.

The tradeoff of the size in question is not a valid argument to give up
a clear, flexible and maintainable design.

> > +#define DEFINE_KTIME(k) ktime_t k = {.tv64 = 0LL }
> > +
> > +#define ktime_cmp(a,op,b) ((a).tv64 op (b).tv64)
> > +#define ktime_cmp_val(a, op, b) ((a).tv64 op b)
> 
> A union ktime would especially avoid this.

See above

> > +static inline ktime_t ktime_sub(ktime_t a, ktime_t b)
> > +{
> > +	ktime_t res;
> > +
> > +	res.tv64 = a.tv64 - b.tv64;
> > +	if (res.tv.nsec < 0)
> > +		res.tv.nsec += NSEC_PER_SEC;
> > +
> > +	return res;
> > +}
> > +
> > +static inline ktime_t ktime_add(ktime_t a, ktime_t b)
> > +{
> > +	ktime_t res;
> > +
> > +	res.tv64 = a.tv64 + b.tv64;
> > +	if (res.tv.nsec >= NSEC_PER_SEC) {
> > +		res.tv.nsec -= NSEC_PER_SEC;
> > +		res.tv.sec++;
> > +	}
> > +	return res;
> > +}
> 
> Not using 64bit math here allows gcc to generate better code, e.g. gcc 
> has to add another test for "nsec < 0" because the condition code is 
> already used for the overflow, adding the "sec--" instead is IMO faster 
> (i.e. less likely).

i686
DOADD32         00000048
DOADD64         0000002a
DOSUB32         00000060
DOSUB64         0000002f
arm
DOADD32         0000004c
DOADD64         0000004c
DOSUB32         00000040
DOSUB64         00000038
m68k
DOADD32         0000003c
DOADD64         0000002e
DOSUB32         00000036
DOSUB64         00000028
powerpc
DOADD32         00000040
DOADD64         00000044
DOSUB32         00000044
DOSUB64         00000044
m68k
DOADD32         0000003c
DOADD64         0000002e
DOSUB32         00000036
DOSUB64         00000028

Please do not tell me that size does not matter. :)

I attached the assembler dumps, so you can have a look yourself. I did
these tests during the implementation and decided on the results rather
than on assumptions about gcc.


> Could you explain a little the resolution handling behind in your patch?
> If I read SUS correctly clock resolution and timer resolution don't have 
> to be the same, the first is returned by clock_getres() and the latter 
> only documented somewhere (and AFAICT our implementation always returned 
> the wrong value).

As far as I understand SUS timer resolution is equal to clock resolution
and the timer value/interval is rounded up to the resolution.

> IMO this also means we can don't have to make the rounding that 
> complicated. Actually it could be done automatically by the timer, e.g. 
> interval timer are reprogrammed at (now + interval) and the timer 
> resolution will automatically round it up.

Reprogramming interval timers by now + interval is completely wrong.
Reprogramming has to be 
timer->expires + interval and nothing else. 

Doing real rounding in the reprogramming code would be a performance
impact.

> > +	/* Get current time */
> > +	now = base->get_time();
> 
> As get_time() is not necessarily cheap, it can be avoided for nonrelative 
> timers by comparing it with the first pending timer. Maintaining a pointer 
> to the first timer here, avoids the timer list and is a simple check 
> whether the time source needs any reprogramming later.

Would you please care to read the complete related code to find out why
this does not work. This is totaly unrelated to reprogramming of the
time event source in the HRT case.
...
	case KTIMER_FORWARD:
		while ktime_cmp(timer->expires, <= , now) {
...

	case KTIMER_REARM:
		while ktime_cmp(timer->expires, <= , now) {
			timer->expires = ktime_add(timer->expires,

and of course the expiry check below.

> > +	if ktime_cmp(timer->expires, <=, now) {
> > +		timer->expired = now;
> > +		/* The caller takes care of expiry */
> > +		if (!(mode & KTIMER_NOCHECK))
> > +			return -1;
> 
> I think KTIMER_NOFAIL would be better name, for a while that had me 
> confused, as you actually do check the value, but you don't fail it and 
> enqueue it anyway.

It does not fail. It returns in the case that the timer is already
expired. The NOCHECK flag is used to skip the check.

tglx


[-- Attachment #2: testarchs.dmp --]
[-- Type: text/plain, Size: 18161 bytes --]



DOADD32

ktime.o:     file format elf32-i386

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	83 ec 0c             	sub    $0xc,%esp
   6:	89 1c 24             	mov    %ebx,(%esp)
   9:	8b 45 14             	mov    0x14(%ebp),%eax
   c:	8b 5d 10             	mov    0x10(%ebp),%ebx
   f:	89 74 24 04          	mov    %esi,0x4(%esp)
  13:	8b 55 18             	mov    0x18(%ebp),%edx
  16:	89 7c 24 08          	mov    %edi,0x8(%esp)
  1a:	8d 34 03             	lea    (%ebx,%eax,1),%esi
  1d:	81 fe ff c9 9a 3b    	cmp    $0x3b9ac9ff,%esi
  23:	8d 3c 13             	lea    (%ebx,%edx,1),%edi
  26:	7e 07                	jle    2f <ktime_ops+0x2f>
  28:	81 ee 00 ca 9a 3b    	sub    $0x3b9aca00,%esi
  2e:	47                   	inc    %edi
  2f:	8b 45 08             	mov    0x8(%ebp),%eax
  32:	89 30                	mov    %esi,(%eax)
  34:	89 78 04             	mov    %edi,0x4(%eax)
  37:	8b 1c 24             	mov    (%esp),%ebx
  3a:	8b 74 24 04          	mov    0x4(%esp),%esi
  3e:	8b 7c 24 08          	mov    0x8(%esp),%edi
  42:	89 ec                	mov    %ebp,%esp
  44:	5d                   	pop    %ebp
  45:	c2 04 00             	ret    $0x4
-------------------------------------------------------------------------

DOADD64

ktime.o:     file format elf32-i386

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	8b 55 14             	mov    0x14(%ebp),%edx
   6:	03 55 0c             	add    0xc(%ebp),%edx
   9:	8b 4d 18             	mov    0x18(%ebp),%ecx
   c:	8b 45 08             	mov    0x8(%ebp),%eax
   f:	13 4d 10             	adc    0x10(%ebp),%ecx
  12:	81 fa ff c9 9a 3b    	cmp    $0x3b9ac9ff,%edx
  18:	7e 07                	jle    21 <ktime_ops+0x21>
  1a:	81 ea 00 ca 9a 3b    	sub    $0x3b9aca00,%edx
  20:	41                   	inc    %ecx
  21:	89 10                	mov    %edx,(%eax)
  23:	89 48 04             	mov    %ecx,0x4(%eax)
  26:	5d                   	pop    %ebp
  27:	c2 04 00             	ret    $0x4
-------------------------------------------------------------------------

DOSUB32

ktime.o:     file format elf32-i386

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	83 ec 0c             	sub    $0xc,%esp
   6:	89 74 24 04          	mov    %esi,0x4(%esp)
   a:	89 7c 24 08          	mov    %edi,0x8(%esp)
   e:	89 1c 24             	mov    %ebx,(%esp)
  11:	8b 5d 10             	mov    0x10(%ebp),%ebx
  14:	8b 45 14             	mov    0x14(%ebp),%eax
  17:	8b 55 18             	mov    0x18(%ebp),%edx
  1a:	89 de                	mov    %ebx,%esi
  1c:	89 df                	mov    %ebx,%edi
  1e:	29 c6                	sub    %eax,%esi
  20:	29 d7                	sub    %edx,%edi
  22:	85 f6                	test   %esi,%esi
  24:	78 1a                	js     40 <ktime_ops+0x40>
  26:	8b 45 08             	mov    0x8(%ebp),%eax
  29:	89 30                	mov    %esi,(%eax)
  2b:	89 78 04             	mov    %edi,0x4(%eax)
  2e:	8b 1c 24             	mov    (%esp),%ebx
  31:	8b 74 24 04          	mov    0x4(%esp),%esi
  35:	8b 7c 24 08          	mov    0x8(%esp),%edi
  39:	89 ec                	mov    %ebp,%esp
  3b:	5d                   	pop    %ebp
  3c:	c2 04 00             	ret    $0x4
  3f:	90                   	nop    
  40:	8b 45 08             	mov    0x8(%ebp),%eax
  43:	81 c6 00 ca 9a 3b    	add    $0x3b9aca00,%esi
  49:	4f                   	dec    %edi
  4a:	89 30                	mov    %esi,(%eax)
  4c:	89 78 04             	mov    %edi,0x4(%eax)
  4f:	8b 1c 24             	mov    (%esp),%ebx
  52:	8b 74 24 04          	mov    0x4(%esp),%esi
  56:	8b 7c 24 08          	mov    0x8(%esp),%edi
  5a:	89 ec                	mov    %ebp,%esp
  5c:	5d                   	pop    %ebp
  5d:	c2 04 00             	ret    $0x4
-------------------------------------------------------------------------

DOSUB64

ktime.o:     file format elf32-i386

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	8b 55 0c             	mov    0xc(%ebp),%edx
   6:	2b 55 14             	sub    0x14(%ebp),%edx
   9:	8b 4d 10             	mov    0x10(%ebp),%ecx
   c:	8b 45 08             	mov    0x8(%ebp),%eax
   f:	1b 4d 18             	sbb    0x18(%ebp),%ecx
  12:	85 d2                	test   %edx,%edx
  14:	78 0a                	js     20 <ktime_ops+0x20>
  16:	89 10                	mov    %edx,(%eax)
  18:	89 48 04             	mov    %ecx,0x4(%eax)
  1b:	5d                   	pop    %ebp
  1c:	c2 04 00             	ret    $0x4
  1f:	90                   	nop    
  20:	89 48 04             	mov    %ecx,0x4(%eax)
  23:	81 c2 00 ca 9a 3b    	add    $0x3b9aca00,%edx
  29:	89 10                	mov    %edx,(%eax)
  2b:	5d                   	pop    %ebp
  2c:	c2 04 00             	ret    $0x4
-------------------------------------------------------------------------

DOADD32

ktime.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	e24dd004 	sub	sp, sp, #4	; 0x4
   4:	e92d4070 	stmdb	sp!, {r4, r5, r6, lr}
   8:	e3e0c331 	mvn	ip, #-1006632960	; 0xc4000000
   c:	e24cc865 	sub	ip, ip, #6619136	; 0x650000
  10:	e24ccc36 	sub	ip, ip, #13824	; 0x3600
  14:	e58d3010 	str	r3, [sp, #16]
  18:	e28de010 	add	lr, sp, #16	; 0x10
  1c:	e89e0018 	ldmia	lr, {r3, r4}
  20:	e0825003 	add	r5, r2, r3
  24:	e155000c 	cmp	r5, ip
  28:	c2855331 	addgt	r5, r5, #-1006632960	; 0xc4000000
  2c:	e0826004 	add	r6, r2, r4
  30:	c2855865 	addgt	r5, r5, #6619136	; 0x650000
  34:	c2855c36 	addgt	r5, r5, #13824	; 0x3600
  38:	c2866001 	addgt	r6, r6, #1	; 0x1
  3c:	e8800060 	stmia	r0, {r5, r6}
  40:	e8bd4070 	ldmia	sp!, {r4, r5, r6, lr}
  44:	e28dd004 	add	sp, sp, #4	; 0x4
  48:	e1a0f00e 	mov	pc, lr
-------------------------------------------------------------------------

DOADD64

ktime.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	e24dd004 	sub	sp, sp, #4	; 0x4
   4:	e92d4030 	stmdb	sp!, {r4, r5, lr}
   8:	e58d300c 	str	r3, [sp, #12]
   c:	e28de010 	add	lr, sp, #16	; 0x10
  10:	e3e03331 	mvn	r3, #-1006632960	; 0xc4000000
  14:	e2433865 	sub	r3, r3, #6619136	; 0x650000
  18:	e2433c36 	sub	r3, r3, #13824	; 0x3600
  1c:	e81e0030 	ldmda	lr, {r4, r5}
  20:	e0944001 	adds	r4, r4, r1
  24:	e0a55002 	adc	r5, r5, r2
  28:	e1540003 	cmp	r4, r3
  2c:	c2844331 	addgt	r4, r4, #-1006632960	; 0xc4000000
  30:	c2844865 	addgt	r4, r4, #6619136	; 0x650000
  34:	c2844c36 	addgt	r4, r4, #13824	; 0x3600
  38:	c2855001 	addgt	r5, r5, #1	; 0x1
  3c:	e8800030 	stmia	r0, {r4, r5}
  40:	e8bd4030 	ldmia	sp!, {r4, r5, lr}
  44:	e28dd004 	add	sp, sp, #4	; 0x4
  48:	e1a0f00e 	mov	pc, lr
-------------------------------------------------------------------------

DOSUB32

ktime.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	e24dd004 	sub	sp, sp, #4	; 0x4
   4:	e92d4070 	stmdb	sp!, {r4, r5, r6, lr}
   8:	e58d3010 	str	r3, [sp, #16]
   c:	e28de010 	add	lr, sp, #16	; 0x10
  10:	e89e0018 	ldmia	lr, {r3, r4}
  14:	e0635002 	rsb	r5, r3, r2
  18:	e3550000 	cmp	r5, #0	; 0x0
  1c:	b28555ee 	addlt	r5, r5, #998244352	; 0x3b800000
  20:	e0646002 	rsb	r6, r4, r2
  24:	b285596b 	addlt	r5, r5, #1753088	; 0x1ac000
  28:	b2855c0a 	addlt	r5, r5, #2560	; 0xa00
  2c:	b2466001 	sublt	r6, r6, #1	; 0x1
  30:	e8800060 	stmia	r0, {r5, r6}
  34:	e8bd4070 	ldmia	sp!, {r4, r5, r6, lr}
  38:	e28dd004 	add	sp, sp, #4	; 0x4
  3c:	e1a0f00e 	mov	pc, lr
-------------------------------------------------------------------------

DOSUB64

ktime.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	e24dd004 	sub	sp, sp, #4	; 0x4
   4:	e52d4004 	str	r4, [sp, #-4]!
   8:	e58d3004 	str	r3, [sp, #4]
   c:	e99d0018 	ldmib	sp, {r3, r4}
  10:	e0511003 	subs	r1, r1, r3
  14:	e0c22004 	sbc	r2, r2, r4
  18:	e3510000 	cmp	r1, #0	; 0x0
  1c:	b28115ee 	addlt	r1, r1, #998244352	; 0x3b800000
  20:	b281196b 	addlt	r1, r1, #1753088	; 0x1ac000
  24:	b2811c0a 	addlt	r1, r1, #2560	; 0xa00
  28:	e8800006 	stmia	r0, {r1, r2}
  2c:	e8bd0010 	ldmia	sp!, {r4}
  30:	e28dd004 	add	sp, sp, #4	; 0x4
  34:	e1a0f00e 	mov	pc, lr
-------------------------------------------------------------------------

DOADD32

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f03           	movel %d3,%sp@-
   6:	2f02           	movel %d2,%sp@-
   8:	206e 0008      	moveal %fp@(8),%a0
   c:	226e 000c      	moveal %fp@(12),%a1
  10:	202e 0010      	movel %fp@(16),%d0
  14:	222e 0014      	movel %fp@(20),%d1
  18:	2408           	movel %a0,%d2
  1a:	d480           	addl %d0,%d2
  1c:	2608           	movel %a0,%d3
  1e:	d681           	addl %d1,%d3
  20:	0c83 3b9a c9ff 	cmpil #999999999,%d3
  26:	6f08           	bles 30 <ktime_ops+0x30>
  28:	0683 c465 3600 	addil #-1000000000,%d3
  2e:	5282           	addql #1,%d2
  30:	2002           	movel %d2,%d0
  32:	2203           	movel %d3,%d1
  34:	241f           	movel %sp@+,%d2
  36:	261f           	movel %sp@+,%d3
  38:	4e5e           	unlk %fp
  3a:	4e75           	rts
-------------------------------------------------------------------------

DOADD64

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f02           	movel %d2,%sp@-
   6:	202e 0008      	movel %fp@(8),%d0
   a:	222e 000c      	movel %fp@(12),%d1
   e:	242e 0010      	movel %fp@(16),%d2
  12:	d2ae 0014      	addl %fp@(20),%d1
  16:	d182           	addxl %d2,%d0
  18:	0c81 3b9a c9ff 	cmpil #999999999,%d1
  1e:	6f08           	bles 28 <ktime_ops+0x28>
  20:	0681 c465 3600 	addil #-1000000000,%d1
  26:	5280           	addql #1,%d0
  28:	241f           	movel %sp@+,%d2
  2a:	4e5e           	unlk %fp
  2c:	4e75           	rts
-------------------------------------------------------------------------

DOSUB32

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f03           	movel %d3,%sp@-
   6:	2f02           	movel %d2,%sp@-
   8:	202e 0008      	movel %fp@(8),%d0
   c:	222e 000c      	movel %fp@(12),%d1
  10:	206e 0010      	moveal %fp@(16),%a0
  14:	226e 0014      	moveal %fp@(20),%a1
  18:	2400           	movel %d0,%d2
  1a:	9488           	subl %a0,%d2
  1c:	9089           	subl %a1,%d0
  1e:	2600           	movel %d0,%d3
  20:	6c08           	bges 2a <ktime_ops+0x2a>
  22:	0683 3b9a ca00 	addil #1000000000,%d3
  28:	5382           	subql #1,%d2
  2a:	2002           	movel %d2,%d0
  2c:	2203           	movel %d3,%d1
  2e:	241f           	movel %sp@+,%d2
  30:	261f           	movel %sp@+,%d3
  32:	4e5e           	unlk %fp
  34:	4e75           	rts
-------------------------------------------------------------------------

DOSUB64

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f02           	movel %d2,%sp@-
   6:	202e 0008      	movel %fp@(8),%d0
   a:	222e 000c      	movel %fp@(12),%d1
   e:	242e 0010      	movel %fp@(16),%d2
  12:	92ae 0014      	subl %fp@(20),%d1
  16:	9182           	subxl %d2,%d0
  18:	4a81           	tstl %d1
  1a:	6c06           	bges 22 <ktime_ops+0x22>
  1c:	0681 3b9a ca00 	addil #1000000000,%d1
  22:	241f           	movel %sp@+,%d2
  24:	4e5e           	unlk %fp
  26:	4e75           	rts
-------------------------------------------------------------------------

DOADD32

ktime.o:     file format elf32-powerpc

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	81 64 00 00 	lwz     r11,0(r4)
   4:	3c 00 3b 9a 	lis     r0,15258
   8:	81 45 00 04 	lwz     r10,4(r5)
   c:	60 00 c9 ff 	ori     r0,r0,51711
  10:	81 25 00 00 	lwz     r9,0(r5)
  14:	7d 0b 52 14 	add     r8,r11,r10
  18:	7f 88 00 00 	cmpw    cr7,r8,r0
  1c:	7c eb 4a 14 	add     r7,r11,r9
  20:	3d 68 c4 65 	addis   r11,r8,-15259
  24:	7c 69 1b 78 	mr      r9,r3
  28:	40 9d 00 0c 	ble-    cr7,34 <ktime_ops+0x34>
  2c:	39 0b 36 00 	addi    r8,r11,13824
  30:	38 e7 00 01 	addi    r7,r7,1
  34:	90 e9 00 00 	stw     r7,0(r9)
  38:	91 09 00 04 	stw     r8,4(r9)
  3c:	4e 80 00 20 	blr
-------------------------------------------------------------------------

DOADD64

ktime.o:     file format elf32-powerpc

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	81 25 00 00 	lwz     r9,0(r5)
   4:	3c 00 3b 9a 	lis     r0,15258
   8:	81 45 00 04 	lwz     r10,4(r5)
   c:	60 00 c9 ff 	ori     r0,r0,51711
  10:	81 64 00 00 	lwz     r11,0(r4)
  14:	81 84 00 04 	lwz     r12,4(r4)
  18:	7d 8c 50 14 	addc    r12,r12,r10
  1c:	7d 6b 49 14 	adde    r11,r11,r9
  20:	7c 69 1b 78 	mr      r9,r3
  24:	7f 8c 00 00 	cmpw    cr7,r12,r0
  28:	3d 4c c4 65 	addis   r10,r12,-15259
  2c:	40 9d 00 0c 	ble-    cr7,38 <ktime_ops+0x38>
  30:	39 8a 36 00 	addi    r12,r10,13824
  34:	39 6b 00 01 	addi    r11,r11,1
  38:	91 69 00 00 	stw     r11,0(r9)
  3c:	91 89 00 04 	stw     r12,4(r9)
  40:	4e 80 00 20 	blr
-------------------------------------------------------------------------

DOSUB32

ktime.o:     file format elf32-powerpc

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	81 24 00 00 	lwz     r9,0(r4)
   4:	81 85 00 04 	lwz     r12,4(r5)
   8:	81 65 00 00 	lwz     r11,0(r5)
   c:	7d 0c 48 50 	subf    r8,r12,r9
  10:	2f 88 00 00 	cmpwi   cr7,r8,0
  14:	7c eb 48 50 	subf    r7,r11,r9
  18:	3d 68 3b 9b 	addis   r11,r8,15259
  1c:	7c 69 1b 78 	mr      r9,r3
  20:	41 9c 00 10 	blt-    cr7,30 <ktime_ops+0x30>
  24:	90 e9 00 00 	stw     r7,0(r9)
  28:	91 09 00 04 	stw     r8,4(r9)
  2c:	4e 80 00 20 	blr
  30:	39 0b ca 00 	addi    r8,r11,-13824
  34:	38 e7 ff ff 	addi    r7,r7,-1
  38:	90 e9 00 00 	stw     r7,0(r9)
  3c:	91 09 00 04 	stw     r8,4(r9)
  40:	4e 80 00 20 	blr
-------------------------------------------------------------------------

DOSUB64

ktime.o:     file format elf32-powerpc

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	81 65 00 00 	lwz     r11,0(r5)
   4:	7c 68 1b 78 	mr      r8,r3
   8:	81 24 00 00 	lwz     r9,0(r4)
   c:	81 44 00 04 	lwz     r10,4(r4)
  10:	81 85 00 04 	lwz     r12,4(r5)
  14:	7d 4c 50 10 	subfc   r10,r12,r10
  18:	7d 2b 49 10 	subfe   r9,r11,r9
  1c:	2f 8a 00 00 	cmpwi   cr7,r10,0
  20:	3d 6a 3b 9b 	addis   r11,r10,15259
  24:	41 9c 00 10 	blt-    cr7,34 <ktime_ops+0x34>
  28:	91 28 00 00 	stw     r9,0(r8)
  2c:	91 48 00 04 	stw     r10,4(r8)
  30:	4e 80 00 20 	blr
  34:	39 4b ca 00 	addi    r10,r11,-13824
  38:	91 28 00 00 	stw     r9,0(r8)
  3c:	91 48 00 04 	stw     r10,4(r8)
  40:	4e 80 00 20 	blr
-------------------------------------------------------------------------

DOADD32

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f03           	movel %d3,%sp@-
   6:	2f02           	movel %d2,%sp@-
   8:	206e 0008      	moveal %fp@(8),%a0
   c:	226e 000c      	moveal %fp@(12),%a1
  10:	202e 0010      	movel %fp@(16),%d0
  14:	222e 0014      	movel %fp@(20),%d1
  18:	2408           	movel %a0,%d2
  1a:	d480           	addl %d0,%d2
  1c:	2608           	movel %a0,%d3
  1e:	d681           	addl %d1,%d3
  20:	0c83 3b9a c9ff 	cmpil #999999999,%d3
  26:	6f08           	bles 30 <ktime_ops+0x30>
  28:	0683 c465 3600 	addil #-1000000000,%d3
  2e:	5282           	addql #1,%d2
  30:	2002           	movel %d2,%d0
  32:	2203           	movel %d3,%d1
  34:	241f           	movel %sp@+,%d2
  36:	261f           	movel %sp@+,%d3
  38:	4e5e           	unlk %fp
  3a:	4e75           	rts
-------------------------------------------------------------------------

DOADD64

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f02           	movel %d2,%sp@-
   6:	202e 0008      	movel %fp@(8),%d0
   a:	222e 000c      	movel %fp@(12),%d1
   e:	242e 0010      	movel %fp@(16),%d2
  12:	d2ae 0014      	addl %fp@(20),%d1
  16:	d182           	addxl %d2,%d0
  18:	0c81 3b9a c9ff 	cmpil #999999999,%d1
  1e:	6f08           	bles 28 <ktime_ops+0x28>
  20:	0681 c465 3600 	addil #-1000000000,%d1
  26:	5280           	addql #1,%d0
  28:	241f           	movel %sp@+,%d2
  2a:	4e5e           	unlk %fp
  2c:	4e75           	rts
-------------------------------------------------------------------------

DOSUB32

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f03           	movel %d3,%sp@-
   6:	2f02           	movel %d2,%sp@-
   8:	202e 0008      	movel %fp@(8),%d0
   c:	222e 000c      	movel %fp@(12),%d1
  10:	206e 0010      	moveal %fp@(16),%a0
  14:	226e 0014      	moveal %fp@(20),%a1
  18:	2400           	movel %d0,%d2
  1a:	9488           	subl %a0,%d2
  1c:	9089           	subl %a1,%d0
  1e:	2600           	movel %d0,%d3
  20:	6c08           	bges 2a <ktime_ops+0x2a>
  22:	0683 3b9a ca00 	addil #1000000000,%d3
  28:	5382           	subql #1,%d2
  2a:	2002           	movel %d2,%d0
  2c:	2203           	movel %d3,%d1
  2e:	241f           	movel %sp@+,%d2
  30:	261f           	movel %sp@+,%d3
  32:	4e5e           	unlk %fp
  34:	4e75           	rts
-------------------------------------------------------------------------

DOSUB64

ktime.o:     file format elf32-m68k

Disassembly of section .text:

00000000 <ktime_ops>:
   0:	4e56 0000      	linkw %fp,#0
   4:	2f02           	movel %d2,%sp@-
   6:	202e 0008      	movel %fp@(8),%d0
   a:	222e 000c      	movel %fp@(12),%d1
   e:	242e 0010      	movel %fp@(16),%d2
  12:	92ae 0014      	subl %fp@(20),%d1
  16:	9182           	subxl %d2,%d0
  18:	4a81           	tstl %d1
  1a:	6c06           	bges 22 <ktime_ops+0x22>
  1c:	0681 3b9a ca00 	addil #1000000000,%d1
  22:	241f           	movel %sp@+,%d2
  24:	4e5e           	unlk %fp
  26:	4e75           	rts
-------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01 12:05   ` Thomas Gleixner
@ 2005-10-10 17:22     ` Roman Zippel
  2005-10-11  7:42       ` Thomas Gleixner
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-10 17:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, mingo, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

Hi,

On Sat, 1 Oct 2005, Thomas Gleixner wrote:

> > > ktimers seperate the "timer API" from the "timeout API". 
> > I'm not really happy with these names, timeouts are what timers do, so 
> > these names don't tell at all, what the difference is.
> 
> There is a clear distinction between timers and timeouts.
> 
> >From IT-dictonary:
> 
> "Timeout is a specified period of time that will be allowed to elapse in
> a system before a specified event is to take place, unless another
> specified event occurs first; in either case, the period is terminated
> when either event takes place."
> 
> "A timer is a specialized type of clock. A timer can be used to control
> the sequence of an event or process."

IOW a timer uses timeouts to control a sequence of events, it's still part 
of the same thing, which makes "timer API" and "timeout API" very 
confusing.

> > Calling them "process timer" and "kernel timer" would include their main 
> > usage, although that also means ptimer were the more correct abbreviation.
> 
> As said before I think the disctinction between timers and timeouts
> makes perfectly sense and ktimers are _not_ restricted to process
> timers. 

"main usage" != "restricted to"

> > > +#ifndef KTIME_IS_SCALAR
> > > +typedef union {
> > > +	s64	tv64;
> > > +	struct {
> > > +#ifdef __BIG_ENDIAN
> > > +	s32	sec, nsec;
> > > +#else
> > > +	s32	nsec, sec;
> > > +#endif
> > > +	} tv;
> > > +} ktime_t;
> > > +
> > > +#else
> > > +
> > > +typedef s64 ktime_t;
> > > +
> > > +#endif
> > 
> > Making the union unconditional, would make tv64 always available and a lot 
> > of macros unnessary.
> 
> nsec,sec storage format is essentially different to the scalar storage
> format and has to be handled different.
> 
> The above gives a clear distinction between scalar and sec/nsec based
> cases. So you cannot mess up without notice. 

There are enough macros to do this anyway. There are a number of 
operations which are identical. Separating them artifically makes 
everything only more complicated.

> > > +struct ktimer {
> > > +	struct rb_node		node;
> > > +	struct list_head	list;
> > > +	ktime_t			expires;
> > > +	ktime_t			expired;
> > > +	ktime_t			interval;
> > > +	int 	 	 	overrun;
> > > +	unsigned long		status;
> > > +	void 			(*function)(void *);
> > > +	void			*data;
> > > +	struct ktimer_base 	*base;
> > > +};
> > 
> > This structure is rather large and I think a lot can be avoided.
> > - list: AFAICT it's only used by run_ktimer_queue() to get the first 
> > pending entry. This can also be done by keeping track of the first entry 
> > in the base structure (useful in other places as well).
> 
> You are right that the list is not necessary for the plain integration
> into the current system, but it is necessary once you start to upgrade
> to high resolution timers.

Could you please specifiy these requirements?

> > - expired: can be replaced by base->last_expired (may also be useful in 
> > other places)
> 
> How gives base->last_expired a per timer expired information? And where
> would it be useful ?

If a callback needs that information, it can it get from there.

> > - status: only user is ktimer_active(), the same test can be done by 
> > testing node.rb_parent.
> 
> Uurg. Been there and discarded the idea, because its ugly and clashes
> with further extensibilty requirements e.g. high resolution timers,
> where we have more than two states. 
> 
> Having status information bound to arbitrary pointers is trading a
> variable against flexibility, cleanliness and maintainability. 

If you want to introduce more states later, it requires changing _one_ 
macro, so I don't really see the problem.

> > - interval/overrun: this is only needed by itimers and I think it's 
> > possible to leave it there. Main change would be to let 'function' return 
> > a value indicating whether to rearm the timer or not (this includes 
> > expires is updated).
> 
> It is also used by the posix timer code and I plan to do another round
> of simplification also there.

Please explain.

> I do not want to end up with a next round of discussion there about
> either introducing tons of new ifdefs, macros or redesigning the code
> base another time. 

I don't really see why this should be an excuse to introduce now more 
complex code than really necessary. If that extra complexity can't stand 
on it's own please introduce as soon as it becomes necessary.
I like most of the patch, but I would prefer to do a simple 
implementation/ cleanup first and then build anything more complex on top 
of it. If you need another complete redesign for this, then you likely do 
something wrong already now.

> > Not using 64bit math here allows gcc to generate better code, e.g. gcc 
> > has to add another test for "nsec < 0" because the condition code is 
> > already used for the overflow, adding the "sec--" instead is IMO faster 
> > (i.e. less likely).
> 
> i686
> DOADD32         00000048
> DOADD64         0000002a
> DOSUB32         00000060
> DOSUB64         0000002f
> arm
> DOADD32         0000004c
> DOADD64         0000004c
> DOSUB32         00000040
> DOSUB64         00000038
> m68k
> DOADD32         0000003c
> DOADD64         0000002e
> DOSUB32         00000036
> DOSUB64         00000028
> powerpc
> DOADD32         00000040
> DOADD64         00000044
> DOSUB32         00000044
> DOSUB64         00000044
> 
> Please do not tell me that size does not matter. :)
> 
> I attached the assembler dumps, so you can have a look yourself. I did
> these tests during the implementation and decided on the results rather
> than on assumptions about gcc.

Did you look at the generating code? Most of it is function prologue/ 
epilogue, which is quite unimportant for inline functions. The other thing 
I forgot to mention last time is that passing values by reference instead 
of value also makes a difference.
For m68k I actually got smaller code this way (mostly because addx/subx 
are limited in their addressing modes). In the other cases I'm actually 
surprised gcc doesn't use the previous result from the sub and adds 
another test. The remaining difference comes from how gcc deals with 
structure vs. integral values, which could use some improvement, 
especially the add case should have produced nearly identical results.

Anyway, this point wasn't that important, it's only microoptimizations and 
at least having the option to change it later (after more tests) is fine 
with me.

> > Could you explain a little the resolution handling behind in your patch?
> > If I read SUS correctly clock resolution and timer resolution don't have 
> > to be the same, the first is returned by clock_getres() and the latter 
> > only documented somewhere (and AFAICT our implementation always returned 
> > the wrong value).
> 
> As far as I understand SUS timer resolution is equal to clock resolution
> and the timer value/interval is rounded up to the resolution.

Please check the rationale about clocks and timers. It talks about clocks 
and timer services based on them and their resolutions can be different.

> > IMO this also means we can don't have to make the rounding that 
> > complicated. Actually it could be done automatically by the timer, e.g. 
> > interval timer are reprogrammed at (now + interval) and the timer 
> > resolution will automatically round it up.
> 
> Reprogramming interval timers by now + interval is completely wrong.
> Reprogramming has to be 
> timer->expires + interval and nothing else. 

Where do get the requirement for an explicit rounding from?
The point is that the timer should not expire early, but there is more 
than one way to do this and can be done implicitly using the timer 
resolution.

> > > +	/* Get current time */
> > > +	now = base->get_time();
> > 
> > As get_time() is not necessarily cheap, it can be avoided for nonrelative 
> > timers by comparing it with the first pending timer. Maintaining a pointer 
> > to the first timer here, avoids the timer list and is a simple check 
> > whether the time source needs any reprogramming later.
> 
> Would you please care to read the complete related code to find out why
> this does not work. This is totaly unrelated to reprogramming of the
> time event source in the HRT case.

You saw that I restricted this to "nonrelative timers"?

> > > +	if ktime_cmp(timer->expires, <=, now) {
> > > +		timer->expired = now;
> > > +		/* The caller takes care of expiry */
> > > +		if (!(mode & KTIMER_NOCHECK))
> > > +			return -1;
> > 
> > I think KTIMER_NOFAIL would be better name, for a while that had me 
> > confused, as you actually do check the value, but you don't fail it and 
> > enqueue it anyway.
> 
> It does not fail. It returns in the case that the timer is already
> expired. The NOCHECK flag is used to skip the check.

It returns with a failure value!? The NOCHECK name is ambiguous about what 
should not be checked, the NOFAIL name is more clear that the caller 
doesn't need to check the return value, because the function won't fail.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-10 17:22     ` Roman Zippel
@ 2005-10-11  7:42       ` Thomas Gleixner
  2005-10-12 22:36         ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2005-10-11  7:42 UTC (permalink / raw)
  To: Roman Zippel
  Cc: linux-kernel, mingo, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

On Mon, 2005-10-10 at 19:22 +0200, Roman Zippel wrote:
> > The above gives a clear distinction between scalar and sec/nsec based
> > cases. So you cannot mess up without notice. 
> 
> There are enough macros to do this anyway. There are a number of 
> operations which are identical. Separating them artifically makes 
> everything only more complicated.

I don't see a distinct set of macros around which is providing all the
functionality.

> > As far as I understand SUS timer resolution is equal to clock resolution
> > and the timer value/interval is rounded up to the resolution.
> 
> Please check the rationale about clocks and timers. It talks about clocks 
> and timer services based on them and their resolutions can be different.

clock_settime():
... Time values that are between two consecutive non-negative integer
multiples of the resolution of the specified clock shall be truncated
down to the smaller multiple of the resolution.

timer_settime():
...Time values that are between two consecutive non-negative integer
multiples of the resolution of the specified timer shall be rounded up
to the larger multiple of the resolution. Quantization error shall not
cause the timer to expire earlier than the rounded time value.

> > Reprogramming interval timers by now + interval is completely wrong.
> > Reprogramming has to be 
> > timer->expires + interval and nothing else. 
> 
> Where do get the requirement for an explicit rounding from?
> The point is that the timer should not expire early, but there is more 
> than one way to do this and can be done implicitly using the timer 
> resolution.

See above.

tglx



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-11  7:42       ` Thomas Gleixner
@ 2005-10-12 22:36         ` Roman Zippel
  2005-10-12 23:46           ` George Anzinger
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-12 22:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, mingo, Andrew Morton, george, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

Hi,

On Tue, 11 Oct 2005, Thomas Gleixner wrote:

> > > As far as I understand SUS timer resolution is equal to clock resolution
> > > and the timer value/interval is rounded up to the resolution.
> > 
> > Please check the rationale about clocks and timers. It talks about clocks 
> > and timer services based on them and their resolutions can be different.
> 
> clock_settime():
> ... Time values that are between two consecutive non-negative integer
> multiples of the resolution of the specified clock shall be truncated
> down to the smaller multiple of the resolution.
> 
> timer_settime():
> ...Time values that are between two consecutive non-negative integer
> multiples of the resolution of the specified timer shall be rounded up
> to the larger multiple of the resolution. Quantization error shall not
> cause the timer to expire earlier than the rounded time value.

Where does it say anything about that their resolution is equal?

> > > Reprogramming interval timers by now + interval is completely wrong.
> > > Reprogramming has to be 
> > > timer->expires + interval and nothing else. 
> > 
> > Where do get the requirement for an explicit rounding from?
> > The point is that the timer should not expire early, but there is more 
> > than one way to do this and can be done implicitly using the timer 
> > resolution.
> 
> See above.

I know it and above is an _interface_ description, but what leads you to 
the conclusion that your _implementation_ is the only correct one?

Thomas, are you even interested in discussing this? Do you just expect 
that everyone accepts your patch and is happy? So far it's difficult 
enough to get you to explain your design, but a serious discussion also 
requires to look at the possible alternatives. It's quite possible I'm 
wrong, but you have to try a little harder at explaining why.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-12 22:36         ` Roman Zippel
@ 2005-10-12 23:46           ` George Anzinger
  2005-10-16 16:34             ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: George Anzinger @ 2005-10-12 23:46 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, linux-kernel, mingo, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

Roman Zippel wrote:
> Hi,
> 
> On Tue, 11 Oct 2005, Thomas Gleixner wrote:
> 
> 
>>>>As far as I understand SUS timer resolution is equal to clock resolution
>>>>and the timer value/interval is rounded up to the resolution.
>>>
>>>Please check the rationale about clocks and timers. It talks about clocks 
>>>and timer services based on them and their resolutions can be different.

Well, yes and no.  Under timer_settime() it talks about ticks and resolution being the inverse of 
the tick rate.  AND it does imply that timers on a given CLOCK will have that clocks resolution as 
returned by clock_res().  This is fine as far as it goes.  In practical systems we almost always 
have a much higher resolution for the clock_gettime() and gettimeofday() than the tick rate.  What 
the standard does not seem to want to do is to admit that a clock may have the ability to be read at 
a better resolution than its tick rate.

For this reason, the usual practice is to return the "timer" resolution for clock_res() and to 
return clock values with as much resolution as possible.  In no case should the actual clock 
resolution be less than what clock_res() returns.

>>
>>clock_settime():
>>... Time values that are between two consecutive non-negative integer
>>multiples of the resolution of the specified clock shall be truncated
>>down to the smaller multiple of the resolution.
>>
>>timer_settime():
>>...Time values that are between two consecutive non-negative integer
>>multiples of the resolution of the specified timer shall be rounded up
>>to the larger multiple of the resolution. Quantization error shall not
>>cause the timer to expire earlier than the rounded time value.

Here the standard uses "resolution of the specified timer" but the only way, in the standard, to 
associate a resolution with a timer is via the CLOCK used.
> 
> 
> Where does it say anything about that their resolution is equal?

So the timers resolution is the same as the CLOCKs resolution as returned by clock_res() but, as I 
said above, the usual practice is to return clock values (via clock_gettime or gettimeofday) with 
higher resolution.
> 
> 
>>>>Reprogramming interval timers by now + interval is completely wrong.
>>>>Reprogramming has to be 
>>>>timer->expires + interval and nothing else. 
>>>
>>>Where do get the requirement for an explicit rounding from?
>>>The point is that the timer should not expire early, but there is more 
>>>than one way to do this and can be done implicitly using the timer 
>>>resolution.
>>
>>See above.

The standard requires that timer expiry times and interval times be rounded up to the next 
"resolution" value.  For the first or initial time of a repeating timer we, usually, have to add an 
additional "resolution" to account for starting the timer at some point between ticks.  For the 
interval on repeating timers, we know that the interval is starting at the last expiry time and thus 
do not need to account for the between tick start time.
> 
~

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-12 23:46           ` George Anzinger
@ 2005-10-16 16:34             ` Roman Zippel
  2005-10-16 19:26               ` Thomas Gleixner
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-16 16:34 UTC (permalink / raw)
  To: George Anzinger
  Cc: Thomas Gleixner, linux-kernel, mingo, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

Hi,

On Wed, 12 Oct 2005, George Anzinger wrote:

> > > > > As far as I understand SUS timer resolution is equal to clock
> > > > > resolution
> > > > > and the timer value/interval is rounded up to the resolution.
> > > > 
> > > > Please check the rationale about clocks and timers. It talks about
> > > > clocks and timer services based on them and their resolutions can be
> > > > different.
> 
> Well, yes and no.  Under timer_settime() it talks about ticks and resolution
> being the inverse of the tick rate.  AND it does imply that timers on a given
> CLOCK will have that clocks resolution as returned by clock_res().  This is
> fine as far as it goes.  In practical systems we almost always have a much
> higher resolution for the clock_gettime() and gettimeofday() than the tick
> rate.  What the standard does not seem to want to do is to admit that a clock
> may have the ability to be read at a better resolution than its tick rate.

The interesting question is what resolution has CLOCK_REALTIME really?
This paragraph in timer_settime() doesn't mention CLOCK_REALTIME and 
AFAICT historically the resolution of e.g. gettimeofday() was really in 
the msec range.

IMO there is a far more interesting in sentence under clock_getres(): "If 
the time argument of clock_settime() is not a multiple of res, then the 
value is truncated to a multiple of res."
This is relatively obvious for hardware clocks, e.g. we could define a 
CLOCK_JIFFIES with a resolution of TICK_NSEC or CLOCK_PIT with a 
resolution of 838 nsec. The conversion from the actual clock value to/from 
timespec automatically takes care of any truncation/rounding.

CLOCK_REALTIME is now a bit special as it doesn't map directly to a 
hardware clock, it also includes adjustments and these are done in nsec 
resolution (actually even fractions of that in the NTP code). In 2.6 we 
don't truncate the value anywhere and maintain it as a nsec value, 
therefore the resolution of CLOCK_REALTIME should really really 1 nsec 
(and 1 usec under 2.4).

OTOH the precision with which the clock can be read is a different matter 
and depends on the hardware clock CLOCK_REALTIME is derived of. It would 
really help if we could agree on something what clock resolution really 
means (especially for CLOCK_REALTIME). For hardware clocks the resolution 
is defined by the conversion factor from clock cycles to timespec, but 
CLOCK_REALTIME is a virtual clock, so is its resolution the precision with 
which the clock can be read or written? clock_getres() specifically 
mentions clock_settime()...

Depending on this is how we define what timer resolution means. Currently 
we convert the timespec value from/into a jiffies value, so I guess the 
resolution is really TICK_NSEC, as it's the resolution at which we 
maintain the timer value. Thomas's patch now changes this and we keep a 
nsec value, but doesn't that mean the resolution of the timer becomes 1 
nsec? It's basically the same question as above, is the timer resolution 
the precision at which we maintain the values, the precision with which 
the timer can be read or the precision with which the timer can be 
programmed?

The spec is not really clear and Thomas refusal to explain his design 
decision is as also not really helpful. :-(
He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value 
above and this way he basically creates another virtual timer, which has 
only little to do with the real kernel timer tick.

I'm open to other interpretations and I think it's important to get to 
some agreement, _before_ we start to change interfaces.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-16 16:34             ` Roman Zippel
@ 2005-10-16 19:26               ` Thomas Gleixner
  2005-10-16 23:03                 ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2005-10-16 19:26 UTC (permalink / raw)
  To: Roman Zippel
  Cc: George Anzinger, linux-kernel, mingo, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

On Sun, 2005-10-16 at 18:34 +0200, Roman Zippel wrote:

> The spec is not really clear and Thomas refusal to explain his design 
> decision is as also not really helpful. :-(

I did explain, why I did the rounding in the way it is implemented. If
you define the fact that I have a different interpretation of SUS than
you as refusal, then we can stop this thread right here.

> He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value 
> above and this way he basically creates another virtual timer, which has 
> only little to do with the real kernel timer tick.

As George explained already we return the resolution of the timer as the
value which can be assumed to be the resolution of the event source,
which drives the timer, because that seems to be the only interesting
value for an application programmer. The theoretical resolution of a
jiffie based timer system is NSEC_PER_SEC/HZ. 

So why is NSEC_PER_SEC/HZ creating a virtual timer ? Because the ntp
adjusted resolution per tick is 1% off ?

I really don't see any sense in returning changing resolution values
every 5 minutes due to NTP adjustments. I imagine the happiness of
application programmers which actually do calculations based on such a
resolution value.

And in the logical consequence you would have to save the original
userspace timespec value including the time when the timer is set up and
redo the rounding and calculation every time NTP changes the
NSEC_PER_TICK value for _all_ timers which are related to
CLOCK_MONOTONIC and CLOCK_REALTIME. 

The code does not introduce a virtual timer at all. It uses the ntp
adjusted time reference and guarantees that the timer goes not off
early. Usually it expires with the next tick - of course system load can
delay it further. 

	tglx

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-16 19:26               ` Thomas Gleixner
@ 2005-10-16 23:03                 ` Roman Zippel
  2005-10-17  7:59                   ` Ingo Molnar
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-16 23:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: George Anzinger, linux-kernel, mingo, Andrew Morton, johnstul,
	paulmck, Christoph Hellwig, oleg, tim.bird

Hi,

On Sun, 16 Oct 2005, Thomas Gleixner wrote:

> > The spec is not really clear and Thomas refusal to explain his design 
> > decision is as also not really helpful. :-(
> 
> I did explain, why I did the rounding in the way it is implemented. If
> you define the fact that I have a different interpretation of SUS than
> you as refusal, then we can stop this thread right here.

I have no problem with you having a different opinion, I have a problem 
with your childish behaviour. :-(
You completely ignore the rest of my mail, trying to establish some base 
definitions, which would help to figure out the options we have based on 
the spec. You instead just insist on your interpretation without going 
into any detail.

> > He sets the timer resolution to (NSEC_PER_SEC/HZ) which matches no value 
> > above and this way he basically creates another virtual timer, which has 
> > only little to do with the real kernel timer tick.
> 
> As George explained already we return the resolution of the timer as the
> value which can be assumed to be the resolution of the event source,
> which drives the timer, because that seems to be the only interesting
> value for an application programmer. The theoretical resolution of a
> jiffie based timer system is NSEC_PER_SEC/HZ. 

You still don't explain, how you you get to this conclusion based on the 
spec. Instead you redefine it now to useful assupmtions for application
programmers who can't read the spec...
You still completely leave the question unanswered of the possibility of 
different resolutions. We can still discuss what resolution to return with 
clock_getres(), but first we have to establish with what kind of resoltion 
we're dealing with here.

> I really don't see any sense in returning changing resolution values
> every 5 minutes due to NTP adjustments. I imagine the happiness of
> application programmers which actually do calculations based on such a
> resolution value.

Why are they doing this kind of calculations based on this value?
We can discuss returning a reasonable value for these applications, but I 
don't see how these assumptions should control how the kernel works.

> And in the logical consequence you would have to save the original
> userspace timespec value including the time when the timer is set up and
> redo the rounding and calculation every time NTP changes the
> NSEC_PER_TICK value for _all_ timers which are related to
> CLOCK_MONOTONIC and CLOCK_REALTIME. 

The rounding is done based on your interpretation of the spec, which you 
refuse to discuss. AFAICT the spec leaves enough room to avoid this 
rounding completely.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-16 23:03                 ` Roman Zippel
@ 2005-10-17  7:59                   ` Ingo Molnar
  2005-10-17  8:26                     ` Steven Rostedt
  2005-10-17  9:29                     ` Roman Zippel
  0 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17  7:59 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > > The spec is not really clear and Thomas refusal to explain his design 
> > > decision is as also not really helpful. :-(
> > 
> > I did explain, why I did the rounding in the way it is implemented. If
> > you define the fact that I have a different interpretation of SUS than
> > you as refusal, then we can stop this thread right here.
> 
> I have no problem with you having a different opinion, I have a problem 
> with your childish behaviour. :-(

Roman, IMO Thomas has been more than reasonable in replying to you - i'd 
have stopped replying to you after the first couple of mails, and we are 
at mail round 10 now! Thomas is being very patient with you. You are 
being difficult, and IMO you are wasting his and others' time.

the thing is that Thomas has advanced the whole issue of timeouts and 
timekeeping by leaps and bounds and he has written thousands of lines of 
new and excellent code for a kernel subsystem that has seen little 
activity for many years, before John got involved. One of Thomas' 
accomplishments is a timer/time design that allows the enabling of HRT 
timers via an _18 lines_ architecture patch. (!)

on the other hand, i have yet to see a single line of code from you and 
have yet to receive a single bugreport from you. (!)

so for me as a patch integrator and upstream maintainer the equation is 
very simple, and i am not nearly as tolerant as Thomas: shut up Roman 
already and show us the code!

really, start sending in patches. Testreports. Useful feedback. Those we 
can judge by their merits. Talk is cheap. The time subsystem has been 
dormant for years, and it has had more than enough talk already.

the moment you express yourself via patches we'll know that 1) you 
understand what we have done so far 2) you have useful ideas of what 
should be done differently 3) you have the coder capability to implement 
and test those ideas. Patches wont be ignored, i can assure you. Get the 
patches rolling!

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  7:59                   ` Ingo Molnar
@ 2005-10-17  8:26                     ` Steven Rostedt
  2005-10-17  9:29                     ` Roman Zippel
  1 sibling, 0 replies; 67+ messages in thread
From: Steven Rostedt @ 2005-10-17  8:26 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ingo Molnar, linux-kernel


Trivial "stupid" patch.  MAKE Makefile HAVE -kt($num)!!!!

This will help with ketchup :-)

-- Steve

Index: linux-2.6.14-rc4-kt2/Makefile
===================================================================
--- linux-2.6.14-rc4-kt2.orig/Makefile	2005-10-17 10:14:26.000000000 +0200
+++ linux-2.6.14-rc4-kt2/Makefile	2005-10-17 10:15:12.000000000 +0200
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 14
-EXTRAVERSION =-rc4
+EXTRAVERSION =-rc4-kt2
 NAME=Affluent Albatross

 # *DOCUMENTATION*


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  7:59                   ` Ingo Molnar
  2005-10-17  8:26                     ` Steven Rostedt
@ 2005-10-17  9:29                     ` Roman Zippel
  2005-10-17  9:41                       ` Ingo Molnar
  2005-10-17  9:54                       ` Steven Rostedt
  1 sibling, 2 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-17  9:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> the thing is that Thomas has advanced the whole issue of timeouts and 
> timekeeping by leaps and bounds and he has written thousands of lines of 
> new and excellent code for a kernel subsystem that has seen little 
> activity for many years, before John got involved. One of Thomas' 
> accomplishments is a timer/time design that allows the enabling of HRT 
> timers via an _18 lines_ architecture patch. (!)

Did I say these patches were bad in general? All I'm asking for is an 
explanation for a few design decisions to understand the patch and its 
behaviour better and evaluate alternative solutions.
Neither of you have shown any real interest in this so far.

> the moment you express yourself via patches we'll know that 1) you 
> understand what we have done so far 2) you have useful ideas of what 
> should be done differently 3) you have the coder capability to implement 
> and test those ideas. Patches wont be ignored, i can assure you. Get the 
> patches rolling!

This "shut up and show code" attitude is sometimes quite funny, but it's 
no real threat to me. I hoped to avoid this and solve this more civilized. 
Of course I'll understand the issues better afterwards, but you could as 
easily just tell me. It will waste my time, I could spend on other 
projects and it will put Andrew in the unfortunate position to decide, 
which patch to accept.
Is this really what you want?

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:29                     ` Roman Zippel
@ 2005-10-17  9:41                       ` Ingo Molnar
  2005-10-17  9:56                         ` Andrew Morton
  2005-10-17 16:33                         ` Roman Zippel
  2005-10-17  9:54                       ` Steven Rostedt
  1 sibling, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17  9:41 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > the moment you express yourself via patches we'll know that 1) you 
> > understand what we have done so far 2) you have useful ideas of what 
> > should be done differently 3) you have the coder capability to implement 
> > and test those ideas. Patches wont be ignored, i can assure you. Get the 
> > patches rolling!
> 
> This "shut up and show code" attitude is sometimes quite funny, but 
> it's no real threat to me. I hoped to avoid this and solve this more 
> civilized. Of course I'll understand the issues better afterwards, but 
> you could as easily just tell me. [...]

if a dozen mails werent enough then one more probably wont make a 
difference, especially with your last mail calling Thomas's behavior 
"childish" - when all he did was to try to explain his reasons to you as 
patiently as possible! Thomas is not obliged to teach you or bear with 
you - it is his own free choice. (But if you want to discuss this 
personal angle any further please take the public lists (and other 
people) off the Cc: list, it's getting very off-topic.)

Thomas's stuff is now fully integrated into the -rt tree and it works 
excellently. I have measured a 12 usecs worst-case HR timer-delivery 
latency (using cyclictest). _That_ is the thing i care about.

> [...] It will waste my time, I could spend on other projects and it 
> will put Andrew in the unfortunate position to decide, which patch to 
> accept. [...]

yes, please, put Andrew (and me too) into that unfortunate position!  
Please, pretty please, get on with the patches!

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:41                       ` Ingo Molnar
@ 2005-10-17  9:56                         ` Andrew Morton
  2005-10-17 11:00                           ` Ingo Molnar
  2005-10-17 16:25                           ` Roman Zippel
  2005-10-17 16:33                         ` Roman Zippel
  1 sibling, 2 replies; 67+ messages in thread
From: Andrew Morton @ 2005-10-17  9:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: zippel, tglx, george, linux-kernel, johnstul, paulmck, hch, oleg,
	tim.bird

Ingo Molnar <mingo@elte.hu> wrote:
>
> > [...] It will waste my time, I could spend on other projects and it 
>  > will put Andrew in the unfortunate position to decide, which patch to 
>  > accept. [...]
> 
>  yes, please, put Andrew (and me too) into that unfortunate position!  
>  Please, pretty please, get on with the patches!

I'm with Roman on this one - the old "show me the code" trick which people
use to quash other people's objections is rather poor form - we should simply
address the objections as raised.

That being said, I'll confess that I've largely ignored this discussion in
the hope that things would get sorted out.  Seems that this won't be
happening and as Roman's opinions carry weight I do intend to solicit a
(brief!) summary of his objections from him when the patch comes round
again.  Sorry.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:56                         ` Andrew Morton
@ 2005-10-17 11:00                           ` Ingo Molnar
  2005-10-17 16:25                           ` Roman Zippel
  1 sibling, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: zippel, tglx, george, linux-kernel, johnstul, paulmck, hch, oleg,
	tim.bird


* Andrew Morton <akpm@osdl.org> wrote:

> Ingo Molnar <mingo@elte.hu> wrote:
> >
> > > [...] It will waste my time, I could spend on other projects and it 
> >  > will put Andrew in the unfortunate position to decide, which patch to 
> >  > accept. [...]
> > 
> >  yes, please, put Andrew (and me too) into that unfortunate position!  
> >  Please, pretty please, get on with the patches!
> 
> I'm with Roman on this one - the old "show me the code" trick which 
> people use to quash other people's objections is rather poor form - we 
> should simply address the objections as raised.
> 
> That being said, I'll confess that I've largely ignored this 
> discussion in the hope that things would get sorted out.  Seems that 
> this won't be happening and as Roman's opinions carry weight I do 
> intend to solicit a (brief!) summary of his objections from him when 
> the patch comes round again.  Sorry.

Fine with me. A brief summary of technical objections (without any 
personal attacks) is all we wanted to have to begin with. "Show me the 
code" was my last-ditch attempt to move this seemingly unmovable 
discussion from a communication channel where the chemistry doesnt seem 
to work out to a more objective format.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:56                         ` Andrew Morton
  2005-10-17 11:00                           ` Ingo Molnar
@ 2005-10-17 16:25                           ` Roman Zippel
  2005-10-17 16:49                             ` Tim Bird
  2005-10-17 20:55                             ` Thomas Gleixner
  1 sibling, 2 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 16:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, tglx, george, linux-kernel, johnstul, paulmck, hch,
	oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, Andrew Morton wrote:

> That being said, I'll confess that I've largely ignored this discussion in
> the hope that things would get sorted out.  Seems that this won't be
> happening and as Roman's opinions carry weight I do intend to solicit a
> (brief!) summary of his objections from him when the patch comes round
> again.  Sorry.

It's rather simple:
- "timer API" vs "timeout API": I got absolutely no acknowlegement that 
this might be a little confusing and in consequence "process timer" may be 
a better name.
- I pointed out various (IMO) unnecessary complexities, which were rather 
quickly brushed off e.g. with a need for further (not closer specified) 
cleanups.
- resolution handling: at what resolution should/does the kernel work and 
what do we report to user space. The spec allows multiple interpretations 
and I have a hard time to get at least one coherent interpretation out of 
Thomas.

Maybe I'm the only one who found Thomas answers a little superficial, but 
as this is a central kernel subsystem I think it deserves a closer look 
and everytime I tried to poke a little deeper I got nothing.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:25                           ` Roman Zippel
@ 2005-10-17 16:49                             ` Tim Bird
  2005-10-17 17:26                               ` Steven Rostedt
  2005-10-17 18:49                               ` Roman Zippel
  2005-10-17 20:55                             ` Thomas Gleixner
  1 sibling, 2 replies; 67+ messages in thread
From: Tim Bird @ 2005-10-17 16:49 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Andrew Morton, Ingo Molnar, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Roman Zippel wrote:
> On Mon, 17 Oct 2005, Andrew Morton wrote:
>>That being said, I'll confess that I've largely ignored this discussion in
>>the hope that things would get sorted out.  Seems that this won't be
>>happening and as Roman's opinions carry weight I do intend to solicit a
>>(brief!) summary of his objections from him when the patch comes round
>>again.  Sorry.
> 
> 
> It's rather simple:
> - "timer API" vs "timeout API": I got absolutely no acknowlegement that 
> this might be a little confusing and in consequence "process timer" may be 
> a better name.

I agree with Thomas on this one.  Maybe "timer" and "timeout" are too 
close, but I think they are the most descriptive names.
  - timeout is something used for a timeout.  Timeouts only actually
  expire infrequently, so they have a host of attributes associated
  with that characteristic.
  - timer is something used to time something.  They almost always
  expire as part of their normal behaviour.  In the ktimer code they
  have a host of attributes related to this characteristic.

Thomas answered the suggestion to use "process timer" as an alternative 
name, but I didn't see a reply after that from Roman (I may have missed it.)

> - I pointed out various (IMO) unnecessary complexities, which were rather 
> quickly brushed off e.g. with a need for further (not closer specified) 
> cleanups.

This is rather vague.  It is rather easy to raise hypothetical
issues.  From what I've seen, Thomas has gone to great lengths to
address specific issues raised.  For example, he actually compiled
code on 4 different platforms to get the REAL size of the assembly
fragments, in order to address your concern about CONJECTURED size
problems.

> - resolution handling: at what resolution should/does the kernel work and 
> what do we report to user space. The spec allows multiple interpretations 
> and I have a hard time to get at least one coherent interpretation out of 
> Thomas.

Huh?  I thought Thomas' last answer was pretty clear.

> 
> Maybe I'm the only one who found Thomas answers a little superficial, but 
> as this is a central kernel subsystem I think it deserves a closer look 
> and everytime I tried to poke a little deeper I got nothing.

No one minds poking deep.  But frankly, I find hypothetical arguments
to be less useful than reality-backed ones.  I would rather not hear
reasoning about a resolution issue - I'd like to numbers, if possible,
about the degradation of performance, if that's the issue.  If
it's confusion about the API, then maybe we just need clear statements
that "X API provides resolution at Y level (from one of: hardware, tick, 
something else).

Regards,
  -- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:49                             ` Tim Bird
@ 2005-10-17 17:26                               ` Steven Rostedt
  2005-10-17 18:49                               ` Roman Zippel
  1 sibling, 0 replies; 67+ messages in thread
From: Steven Rostedt @ 2005-10-17 17:26 UTC (permalink / raw)
  To: Tim Bird
  Cc: Roman Zippel, Andrew Morton, Ingo Molnar, tglx, george,
	linux-kernel, johnstul, paulmck, hch, oleg

On Mon, 17 Oct 2005, Tim Bird wrote:

> >
> >
> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that
> > this might be a little confusing and in consequence "process timer" may be
> > a better name.
>
> I agree with Thomas on this one.  Maybe "timer" and "timeout" are too
> close, but I think they are the most descriptive names.
>   - timeout is something used for a timeout.  Timeouts only actually
>   expire infrequently, so they have a host of attributes associated
>   with that characteristic.
>   - timer is something used to time something.  They almost always
>   expire as part of their normal behaviour.  In the ktimer code they
>   have a host of attributes related to this characteristic.
>
> Thomas answered the suggestion to use "process timer" as an alternative
> name, but I didn't see a reply after that from Roman (I may have missed it.)
>

I can add to this.  After this was brought up, I did a little
non-scientific survey. I walked around and asked various engineers here at
my customer's site, what it meant to them if I had two types of timer
APIs, one for "timers" and one for "timeouts".  All 100% of 8 people that
I asked (not a lot, but still), had no confusion with what they meant.  I
asked them to explain what these names mean to them, and every one said
basically, timeouts are for situations that are for things that lasted too
long, and timers and for things where they want to be notified of an
event that takes place at some time.

They all agreed with me that timeouts were for exceptions and not expected
to be triggered, and timers were the other way around and should always be
triggered.

Not only that, I also asked if these timers would make sense if we called
them "kernel" timers and "process" timers.  These names confused them
because they use both timers in their kernel modules.

That convinced me enough to think that Thomas' naming convention is not
confusing.

-- Steve

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:49                             ` Tim Bird
  2005-10-17 17:26                               ` Steven Rostedt
@ 2005-10-17 18:49                               ` Roman Zippel
  2005-10-17 19:19                                 ` Tim Bird
  2005-10-17 20:09                                 ` Ingo Molnar
  1 sibling, 2 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 18:49 UTC (permalink / raw)
  To: Tim Bird
  Cc: Andrew Morton, Ingo Molnar, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Mon, 17 Oct 2005, Tim Bird wrote:

> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that this
> > might be a little confusing and in consequence "process timer" may be a
> > better name.
> 
> I agree with Thomas on this one.  Maybe "timer" and "timeout" are too close,
> but I think they are the most descriptive names.
>  - timeout is something used for a timeout.  Timeouts only actually
>  expire infrequently, so they have a host of attributes associated
>  with that characteristic.
>  - timer is something used to time something.  They almost always
>  expire as part of their normal behaviour.  In the ktimer code they
>  have a host of attributes related to this characteristic.

There is of course a difference, but is it big enough that they deserve 
different APIs? Just look into <linux/timer.h> it doesn't mention timeout 
once, but according to Thomas that's our "timeout API". Look at the 
description of mod_timer() in timer.c: "modify a timer's timeout".
It seems I'm not only one who thinks that both are closely related.

> Thomas answered the suggestion to use "process timer" as an alternative name,
> but I didn't see a reply after that from Roman (I may have missed it.)

It was short and painless:

} > > Calling them "process timer" and "kernel timer" would include their main 
} > > usage, although that also means ptimer were the more correct abbreviation.
} > 
} > As said before I think the disctinction between timers and timeouts
} > makes perfectly sense and ktimers are _not_ restricted to process
} > timers. 
} 
} "main usage" != "restricted to"

IOW I didn't say that "process timer" are restricted to processes, but 
it's their intended main usage. "kernel timer" are OTOH the first choice 
for any internal kernel time issues (which are not just timeouts).

> > - I pointed out various (IMO) unnecessary complexities, which were rather
> > quickly brushed off e.g. with a need for further (not closer specified)
> > cleanups.
> 
> This is rather vague.  It is rather easy to raise hypothetical
> issues.  From what I've seen, Thomas has gone to great lengths to
> address specific issues raised.  For example, he actually compiled
> code on 4 different platforms to get the REAL size of the assembly
> fragments, in order to address your concern about CONJECTURED size
> problems.

This was the _only_ issue where he got into any detail, but I also 
mentioned later that this one of the minor issues.
Above was about the size of the ktimer structure and interval timer. 

> > - resolution handling: at what resolution should/does the kernel work and
> > what do we report to user space. The spec allows multiple interpretations
> > and I have a hard time to get at least one coherent interpretation out of
> > Thomas.
> 
> Huh?  I thought Thomas' last answer was pretty clear.

Then I must have missed something. Earlier he just quotes something from 
SUS without any explanation. His last answer was just about user 
expectations without any connection to the different resolutions at the 
kernel side I described in the mail before.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 18:49                               ` Roman Zippel
@ 2005-10-17 19:19                                 ` Tim Bird
  2005-10-17 19:48                                   ` Roman Zippel
  2005-10-17 20:09                                 ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Tim Bird @ 2005-10-17 19:19 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Bird, Tim, Andrew Morton, Ingo Molnar, tglx, george, linux-kernel,
	johnstul, paulmck, hch, oleg

Roman Zippel wrote:
> } > > Calling them "process timer" and "kernel timer" would include
> their main 
> } > > usage, although that also means ptimer were the more correct
> abbreviation.
> } > 
> } > As said before I think the disctinction between timers and timeouts
> } > makes perfectly sense and ktimers are _not_ restricted to process
> } > timers. 
> } 
> } "main usage" != "restricted to"
> 
> IOW I didn't say that "process timer" are restricted to processes, but 
> it's their intended main usage. "kernel timer" are OTOH the first choice
> 
> for any internal kernel time issues (which are not just timeouts).

Maybe for a more experienced kernel person such as
yourself, this distinction make sense.  But
"process timer" and "kernel timer" don't carry much
semantic value for me.  They seem to convey an
arbitrary expectation of usage patterns.  Maybe
they match the current usage patterns in the kernel,
but I'd prefer naming based on functionality or
behaviour of the API.

> There is of course a difference, but is it big enough that they deserve 
> different APIs?

IMHO yes.  I think having separate APIs will eventually be
beneficial to allow better handling of resolution
manipulation in the future.

For example, timeouts are likely to need less resolution,
and it may be valuable to adjust the resolution of timeouts
to support coalescing timeouts for better tickless operation.
(Driving towards better power management performance for
embedded devices.)

> Just look into <linux/timer.h> it doesn't mention timeout 
> once, but according to Thomas that's our "timeout API". Look at the 
> description of mod_timer() in timer.c: "modify a timer's timeout".
> It seems I'm not only one who thinks that both are closely related.

I'm not sure if you are arguing for renaming the
old API.  I would be in favor of this (from an abstract
perspective, to clarify the usage in the kernel), but
it might be too big a change right now.

Regards,
 -- Tim

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 19:19                                 ` Tim Bird
@ 2005-10-17 19:48                                   ` Roman Zippel
  2005-10-17 20:13                                     ` Ingo Molnar
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 19:48 UTC (permalink / raw)
  To: Tim Bird
  Cc: Andrew Morton, Ingo Molnar, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Mon, 17 Oct 2005, Tim Bird wrote:

> Maybe for a more experienced kernel person such as
> yourself, this distinction make sense.  But
> "process timer" and "kernel timer" don't carry much
> semantic value for me.  They seem to convey an
> arbitrary expectation of usage patterns.  Maybe
> they match the current usage patterns in the kernel,
> but I'd prefer naming based on functionality or
> behaviour of the API.

Let's say you want to implement a watchdog timer for a driver, which runs 
about every second to do something. Now if you have the choice between 
"timer API" vs. "timeout API" and "kernel timer" vs. "process timer", what 
would you choose based on the name?

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 19:48                                   ` Roman Zippel
@ 2005-10-17 20:13                                     ` Ingo Molnar
  2005-10-17 20:31                                       ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17 20:13 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > Maybe for a more experienced kernel person such as
> > yourself, this distinction make sense.  But
> > "process timer" and "kernel timer" don't carry much
> > semantic value for me.  They seem to convey an
> > arbitrary expectation of usage patterns.  Maybe
> > they match the current usage patterns in the kernel,
> > but I'd prefer naming based on functionality or
> > behaviour of the API.
> 
> Let's say you want to implement a watchdog timer for a driver, which 
> runs about every second to do something. Now if you have the choice 
> between "timer API" vs. "timeout API" and "kernel timer" vs. "process 
> timer", what would you choose based on the name?

why you insist on ktimers being 'process timers'? They are totally 
separate entities, not limited to any process notion. One of their first 
practical use happens to be POSIX process timers (both itimers and 
ptimers) via them, but no way are ktimers only 'process timers'. They 
are very generic timers, usable for any kernel purpose.

so to answer your question: it is totally possible for a watchdog 
mechanism to use ktimers. In fact it would be desirable from a 
robustness POV too: e.g. we dont want a watchdog from being 
overload-able via too many timeouts in the timer wheel ...

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 20:13                                     ` Ingo Molnar
@ 2005-10-17 20:31                                       ` Roman Zippel
  2005-10-18  8:46                                         ` Ingo Molnar
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 20:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> why you insist on ktimers being 'process timers'?

Because they are optimized for process usage. OTOH kernel usage is more 
than just "timeouts".

> so to answer your question: it is totally possible for a watchdog 
> mechanism to use ktimers. In fact it would be desirable from a 
> robustness POV too: 

"possible" and "desirable" is still different from "preferable", as they 
involve a higher cost.

> e.g. we dont want a watchdog from being 
> overload-able via too many timeouts in the timer wheel ...

Please explain.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 20:31                                       ` Roman Zippel
@ 2005-10-18  8:46                                         ` Ingo Molnar
  2005-10-18 23:52                                           ` Tim Bird
  2005-10-19  1:58                                           ` Roman Zippel
  0 siblings, 2 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-18  8:46 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

* Roman Zippel <zippel@linux-m68k.org> wrote:

> On Mon, 17 Oct 2005, Ingo Molnar wrote:
> 
> > why you insist on ktimers being 'process timers'?
> 
> Because they are optimized for process usage. OTOH kernel usage is 
> more than just "timeouts".

you have cut out the rest of what i write in the paragraph, which IMO 
answers your question:

> > They are totally separate entities, not limited to any process 
> > notion. One of their first practical use happens to be POSIX process 
> > timers (both itimers and ptimers) via them, but no way are ktimers 
> > only 'process timers'. They are very generic timers, usable for any
> > kernel purpose.

so i can only repeat that ktimers is a generic timer subsystem, with a 
focus on _actually delivering a timer event_.

and no, ktimers are not "optimized for process usage" (or tied to 
whatever other process notion, as i said before), they are optimized 
for:

 - the delivery of time related events

as contrasted to the timeout-API (a'ka "timer wheel") code in 
kernel/timers.c that is optimized towards:

 - the fast adding/removal of timers

without too much focus on robust and deterministic delivery of events.

these two concepts are conflicting, and i claim that a (sane) data 
structure that maximally fulfills both sets of requirements does not 
exist, mathematically. (to repeat, the requirements are: 'fast 
add/remove' and 'fast+deterministic expiry')

at this point i'd really suggest for readers to lean back and think 
about the mathematical foundations of timer data structures for a bit, 
with a focus on the tradeoffs that the timer wheel data structure has, 
vs. the tradeoffs of the rbtree data structure that ktimers has.

My claim is that if you _know_ that a timer will expire most likely, you 
want it to order at insertion time - i.e. you want to have a tree 
structure. If you _know_ that a timer will most likely _not_ expire, 
then you can avoid the tree overhead by 'delaying' the decision of 
sorting timers, to the point in the future where we really are forced to 
do so.

The result of this mathematical paradox is that we end up with two data 
structures: one is the timer wheel (kernel/timers.c) for 
timeout/exception related use; the other one is ktimers 
(kernel/ktimers.c), for expiry oriented use.

> > so to answer your question: it is totally possible for a watchdog 
> > mechanism to use ktimers. In fact it would be desirable from a 
> > robustness POV too: 
> 
> "possible" and "desirable" is still different from "preferable", as 
> they involve a higher cost.

[ in my answer above you are free to substitute "preferable" with
  "desirable" - i do mean it as it reads in plain English. ]

> > e.g. we dont want a watchdog from being 
> > overload-able via too many timeouts in the timer wheel ...
> 
> Please explain.

e.g. on busy networked servers (i.e. ones that do have a need for 
watchdogs) the timer wheel often includes large numbers of timeouts, 
99.9% of which never expire. If they do expire en masse for whatever 
reason, then we can get into overload mode: a million timers might have 
to expire before we get to process the watchdog event and act upon it.  
This can delay the watchdog event significantly, which delay might (or 
might not) matter to the watchdog application.

in short: the timer wheel was not designed with determinism in mind (nor 
should 'simple timeouts' care about determinism). Watchdogs are 
preferably (and desirably) implemented via the most deterministic timer 
mechanism that the kernel offers: ktimers in this particular case.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-18  8:46                                         ` Ingo Molnar
@ 2005-10-18 23:52                                           ` Tim Bird
  2005-10-19  0:03                                             ` George Anzinger
  2005-10-19  1:58                                           ` Roman Zippel
  1 sibling, 1 reply; 67+ messages in thread
From: Tim Bird @ 2005-10-18 23:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Roman Zippel, Bird, Tim, Andrew Morton, tglx, george,
	linux-kernel, johnstul, paulmck, hch, oleg

Ingo Molnar wrote:
> My claim is that if you _know_ that a timer will expire most likely, you 
> want it to order at insertion time - i.e. you want to have a tree 
> structure. If you _know_ that a timer will most likely _not_ expire, 
> then you can avoid the tree overhead by 'delaying' the decision of 
> sorting timers, to the point in the future where we really are forced to
> do so.
> 
> The result of this mathematical paradox is that we end up with two data 
> structures: one is the timer wheel (kernel/timers.c) for 
> timeout/exception related use; the other one is ktimers 
> (kernel/ktimers.c), for expiry oriented use.

I'd like to make an observation on another
difference between the wheel and the rbtree.  Note that
the wheel implementation inherently coalesces timeouts
that are near each other, due to it's relatively
low resolution (at tick granularity - which is
still pretty low resolution on embedded hardware -
usually 10 milliseconds.)

One concern I have with the rbtree is that this
automatic coalescing is lost, and there may be
unanticipated overhead in the move to support
high resolution timers.

Whether some form of coalescing should be
preserved for timers, even when the system
supports higher resolution, will be a
function of the number of timers and their
intended use.  I don't see any support for that
in the current patch, but maybe I'm missing
something.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-18 23:52                                           ` Tim Bird
@ 2005-10-19  0:03                                             ` George Anzinger
  0 siblings, 0 replies; 67+ messages in thread
From: George Anzinger @ 2005-10-19  0:03 UTC (permalink / raw)
  To: Tim Bird
  Cc: Ingo Molnar, Roman Zippel, Andrew Morton, tglx, linux-kernel,
	johnstul, paulmck, hch, oleg

Tim Bird wrote:
> Ingo Molnar wrote:
> 
>>My claim is that if you _know_ that a timer will expire most likely, you 
>>want it to order at insertion time - i.e. you want to have a tree 
>>structure. If you _know_ that a timer will most likely _not_ expire, 
>>then you can avoid the tree overhead by 'delaying' the decision of 
>>sorting timers, to the point in the future where we really are forced to
>>do so.
>>
>>The result of this mathematical paradox is that we end up with two data 
>>structures: one is the timer wheel (kernel/timers.c) for 
>>timeout/exception related use; the other one is ktimers 
>>(kernel/ktimers.c), for expiry oriented use.
> 
> 
> I'd like to make an observation on another
> difference between the wheel and the rbtree.  Note that
> the wheel implementation inherently coalesces timeouts
> that are near each other, due to it's relatively
> low resolution (at tick granularity - which is
> still pretty low resolution on embedded hardware -
> usually 10 milliseconds.)
> 
> One concern I have with the rbtree is that this
> automatic coalescing is lost, and there may be
> unanticipated overhead in the move to support
> high resolution timers.

I think the coalescing is really done by the resolution rounding.  There will always be the list 
removal overhead, but short of a duplex tree (i.e. one entry per time with dup times linked from the 
first (Ug)) you will always have that.  What you want to coalesce is the interrupt overhead, not the 
list overhead, the former being MUCH larger.  The difference here is that we don't see the 
resolution reflected in the tree structure, but that, I think, is good.
> 
> Whether some form of coalescing should be
> preserved for timers, even when the system
> supports higher resolution, will be a
> function of the number of timers and their
> intended use.  I don't see any support for that
> in the current patch, but maybe I'm missing
> something.
> 
> =============================
~
-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-18  8:46                                         ` Ingo Molnar
  2005-10-18 23:52                                           ` Tim Bird
@ 2005-10-19  1:58                                           ` Roman Zippel
  2005-10-19  6:46                                             ` Ingo Molnar
                                                               ` (3 more replies)
  1 sibling, 4 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-19  1:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Tue, 18 Oct 2005, Ingo Molnar wrote:

> > Because they are optimized for process usage. OTOH kernel usage is 
> > more than just "timeouts".
> 
> you have cut out the rest of what i write in the paragraph, which IMO 
> answers your question:
> 
> > > They are totally separate entities, not limited to any process 
> > > notion. One of their first practical use happens to be POSIX process 
> > > timers (both itimers and ptimers) via them, but no way are ktimers 
> > > only 'process timers'. They are very generic timers, usable for any
> > > kernel purpose.
> 
> so i can only repeat that ktimers is a generic timer subsystem, with a 
> focus on _actually delivering a timer event_.

It doesn't answer it at all. The new timer system is definitively not 
"usable for any kernel purpose", it has certain properties, which makes it 
only applicable under certain conditions.

> and no, ktimers are not "optimized for process usage" (or tied to 
> whatever other process notion, as i said before), they are optimized 
> for:
> 
>  - the delivery of time related events
> 
> as contrasted to the timeout-API (a'ka "timer wheel") code in 
> kernel/timers.c that is optimized towards:
> 
>  - the fast adding/removal of timers
> 
> without too much focus on robust and deterministic delivery of events.

You forgot the main property of high resolution, which implies a higher 
maintainance cost. 
Whether the timer event is delivered or not is completely unimportant, as 
at some point the event has to be removed anyway, so that optimizing a 
timer for (non)delivery is complete nonsense.

> these two concepts are conflicting, and i claim that a (sane) data 
> structure that maximally fulfills both sets of requirements does not 
> exist, mathematically. (to repeat, the requirements are: 'fast 
> add/remove' and 'fast+deterministic expiry')

to repeat: low resolution/overhead vs high resolution.
Both are hopefully deterministic (only at different resolutions) or we 
have serious bug at hand.

> > > e.g. we dont want a watchdog from being 
> > > overload-able via too many timeouts in the timer wheel ...
> > 
> > Please explain.
> 
> e.g. on busy networked servers (i.e. ones that do have a need for 
> watchdogs) the timer wheel often includes large numbers of timeouts, 
> 99.9% of which never expire. If they do expire en masse for whatever 
> reason, then we can get into overload mode: a million timers might have 
> to expire before we get to process the watchdog event and act upon it.  
> This can delay the watchdog event significantly, which delay might (or 
> might not) matter to the watchdog application.

I already mentioned earlier that it's possible to reduce the timer load by 
using a watchdog timer to filter most of these events, so that you get 
into the interesting situation that most kernel timer actually do expire 
and suddenly you easily can have more "timers" than "timeouts".

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19  1:58                                           ` Roman Zippel
@ 2005-10-19  6:46                                             ` Ingo Molnar
  2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
                                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-19  6:46 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > and no, ktimers are not "optimized for process usage" (or tied to 
> > whatever other process notion, as i said before), they are optimized 
> > for:
> > 
> >  - the delivery of time related events
> > 
> > as contrasted to the timeout-API (a'ka "timer wheel") code in 
> > kernel/timers.c that is optimized towards:
> > 
> >  - the fast adding/removal of timers
> > 
> > without too much focus on robust and deterministic delivery of events.
> 
> You forgot the main property of high resolution, which implies a 
> higher maintainance cost.

what did i forget? I did not mention "high resolution" anywhere. And 
what precisely do you mean by "higher maintainance cost"?

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* kernel/timer.c design (was: Re: ktimers subsystem)
  2005-10-19  1:58                                           ` Roman Zippel
  2005-10-19  6:46                                             ` Ingo Molnar
@ 2005-10-19 10:49                                             ` Ingo Molnar
  2005-10-19 17:48                                               ` kernel/timer.c design Tim Bird
                                                                 ` (2 more replies)
  2005-10-19 11:40                                             ` [PATCH] ktimers subsystem 2.6.14-rc2-kt5 Ingo Molnar
  2005-10-19 11:58                                             ` Ingo Molnar
  3 siblings, 3 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-19 10:49 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

* Roman Zippel <zippel@linux-m68k.org> wrote:

> Whether the timer event is delivered or not is completely unimportant, 
> as at some point the event has to be removed anyway, so that 
> optimizing a timer for (non)delivery is complete nonsense.

completely wrong! To explain this, let me first give you an introduction 
to the design goals and implementation/optimization details of the 
upstream kernel/timer.c code:

The current design has remained largely unchanged since Finn Arne 
Gangstad implemented timer wheels in 1997.

The code implements 'struct timer_list' objects, which can be 'added'
via add_timer() to 'expire' in N jiffies, and can be 'removed' via
del_timer() before expiry. If timers are not removed before expiry then
they will expire, at which point the kernel has to call
timer->fn(timer->data). Time has a granularity of 1/HZ and timeouts are
32 bits.

[ sidenote: there are other details, like timer modification and other
  API variants, SMP scalability and other issues - in that sense this
  writeup is simplified, but the essence of the algorithms is still the
  same. ]

since timers can be added in arbitrary time order (a timer that will
expire sooner can be added after a timer has been added that will expire
later, etc.), the kernel has to have timers sorted when they expire.
Note: there is no requirement to sort timers _before_ expiry!

the initial Linux timer implementation did not (have to) bother about 
the 'millions of timers' workloads yet, so it went for the simplest 
model: it has put all timers into a doubly-linked list, and sorted 
timers at insertion time, which made addition O(N). It also had an O(N) 
removal function, only expiry was O(1).

[ the name 'struct timer_list' originates from this linked-list model, 
  and this name has survived 15 years. The reason for the O(N) removal 
  overhead of the original implementation was that it maintained a 'next 
  timer will expire in N jiffies' value for every timer on the list, 
  which the kernel could have used to implement dynamic timer ticks. We 
  never ended up using that particular aspect of the implementation, and 
  future timer implementations removed that property altogether. ]

one could implement a add:O(N)/del:O(1)/exp:O(1) algorithm for sorted 
linked lists, the original implementation was suboptimal in doing a O(N) 
del_timer().

one could also implement a add:O(1)/del:O(1)/exp:O(N) algorithm via an 
unsorted linked list. In any case, if there's only a single list then 
either insertion or expiry has to carry the O(N) linear sorting 
overhead.

another canonical 'computer science' way of dealing with timers is to 
put them into a binary tree that sorts by expiry-time: this means that 
at add_timer() time we have to insert the timer into the binary tree 
(O(log(N)) overhead), removal and expiry is O(1).

the fastest theoretical timer algorithm is to have a linear array of 
lists [timer buckets] for every future jiffy (and a running index to 
represent the current jiffy): then adding a timer is a simple add_list() 
for the array entry indexed by the target timeout. Removing a timer is a 
simple list_del(), and expiring the timer is a matter of advancing the 
'current time' index by one and expiring all (if any) timers that are in 
the next slot. Thus adding, removing and expiring a timer has constant 
O(1) overhead, and the worst-case behavior is constant bounded too.

what makes this algorithm impossible in practice is its huge RAM 
footprint: tens of gigabytes of RAM to represent all ~2^32 jiffies.  
(Some OSs still do this, at the price of restricting either timer 
granularity, or the maximum possible timeout)

it can be proven that under our assumptions this 'linear array of time' 
approach is the best fully O(1) algorithm [with constant worst-case 
behavior as well], so whatever other solution we choose to significantly 
reduce the RAM footprint, it wont be fully O(1).

we've seen two practical approaches so far: the 'historical Linux 
implementation' which was add:O(N)/del:O(N)/exp:O(1), and the 'timer 
tree' solution which is add:O(log(N))/del:O(1)/exp:O(1).

but the current Linux kernel uses a third algorithm: the timer wheels.  
This is a variant of the simple 'array of future jiffies' model, but 
instead of representing every future jiffy in a bucket, it categorizes 
future jiffies into a 'logarithmic array of arrays' where the arrays 
represent buckets with larger and larger 'scope/granularity': the 
further a jiffy is in the future, the more jiffies belong to the same 
single bucket.

In practice it's done by categorizing all future jiffies into 5 groups:

1..256, 257..16384, 16385..1048576, 1048577..67108864, 67108865..4294967295

the first category consists of 256 buckets (each bucket representing a 
single jiffy), the second category consists of 64 buckets equally 
divided (each bucket represents 256 subsequent jiffies), the third 
category consists of 64 buckets too (each bucket representing 256*64 == 
16384 jiffies), the fourth category consists of 64 buckets too (each 
bucket representing 256*64*64 == 1048576 jiffies), the fifth category 
consists of 64 buckets too (each bucket representing 67108864 jiffies).

the buckets of each category are put into a per-category fixed-size 
array, called the "timer vector" - named tv1, tv2, tv3, tv4 and tv5.

as you can see, we only used 256+64+64+64+64 == 512 buckets, but we've 
managed to map all 4294967295 future jiffies to these buckets! In other 
words: we've split up the 32 bits of 'timeout' value into 8+6+6+6+6 
bits.

[ you might ask: why dont we use an even number of buckets such as 
  8+8+8+8, which would simplify the code? The reason is mostly RAM 
  footprint optimizations: an 8+8+8+8 splitup gives a total of 
  256+256+256+256 == 1024 buckets, which was considered a bit too high 
  back when this code was designed. In fact, in recent 2.6 kernels, if 
  CONFIG_BASE_SMALL is specified then we use a 6+4+4+4+4 splitup and 
  round down the remaining 10 bits, which gives an embedded-friendly RAM 
  footprint of 128 buckets. The 'splitup' is under constant revision and 
  we might switch to the simpler (and slightly faster) 8+8+8+8 model in 
  the future, for servers. ]

how do we insert timers? In add_timer() we can calculate their "target 
category" in constant overhead (with at most 5 comparisons), and put the 
timer into that bucket. Note: unless it's in the first category, timers 
with different timeout values can end up in the same bucket. E.g. timers 
expiring at jiffy 260 and 265 will be both put into the first bucket of 
category 2. This means that timers in these buckets are 'partially 
sorted': they are only sorted in their highest bits, initially. So 
add_timer() is O(1) overhead.

removal is simple: we remove the timer from the bucket, which is a 
list_del(), so O(1) overhead too.

we knew that there's no free lunch, right? The main complication is how 
we do expiry. The first 256 jiffies are not a problem, because they are 
represented by the first array of buckets, so the expiry code only has 
to check whether there are any timers to be expired in that bucket.  
Expiry overhead is O(1) for these steps. But at jiffy 257 we do 
something special: the expiry code 'cascades' the first bucket of the 
second array 'down into' the first 256 buckets. It does it the hard way: 
walks the list of timers in that bucket (if any), and removes them from 
that list and inserts them into one of the first 256 buckets (depending 
on what the timeout value of that timer is). Then the expiry code goes 
back to bucket 1, and expires the timers there (if any). The expiry code 
keeps a persistent running index for every category, and if that index 
overflows back to 1, it increments the next category's index by one and 
'cascades down' timers from that bucket into the previous category.

in other words: what happens is that we sort timers "piecemail wise", 
first we order by the highest bits of their timeout value, then we sort 
by the lower bits too - in the end they are fully sorted. If all timers 
expire and are never removed then still we have won relative to the 
fully-sorted-list approach: all timers will end up fully sorted, and 
average per-timer expiry overhead is still O(1)! But expiry worst-case 
is not bounded, it is O(N).

One cost is the burstiness of processing: a single step of cascading can 
take many timers to be processed (if they happen to be in that same 
bucket), and no timers may expire while we do that processing. The 
worst-case expiry behavior is O(N). (The average cost is still O(1), 
because we process every timer at most 5 times.) Another cost is that we 
touch (and dirty) the timers again and again during their lifetime, 
bringing them into cache multiple times.

But there's a hidden win as well from this approach: if a timer is 
removed before it expires, we've saved the remaining cascading steps!  
This happens surprisingly often: on a busy networked server, the 
majority of the timers never expire, and are removed before they have to 
be cascaded even once.

in other words: we 'lazy sort' timers, and we push most of the sorting 
overhead as much into the future as possible, in the hope of the problem 
of having to sort them going away, because they get removed before they 
expire. (and even if we wanted, we couldnt sort earlier in this model, 
due to the RAM footprint limits)

with all these details in mind, lets go back to Roman's assertion:

> Whether the timer event is delivered or not is completely unimportant, 
> as at some point the event has to be removed anyway, so that 
> optimizing a timer for (non)delivery is complete nonsense.

it is very much crutial whether a timer event is delivered. Think about 
the 'millions of network timers' case: most of them are removed before 
cascaded even once! By removing early we might not have to propagate and 
sort the timer in any way: it is added to a bucket and soon removed from 
the same bucket.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: kernel/timer.c design
  2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
@ 2005-10-19 17:48                                               ` Tim Bird
  2005-10-19 18:00                                               ` Tim Bird
  2005-10-19 22:12                                               ` kernel/timer.c design (was: Re: ktimers subsystem) Roman Zippel
  2 siblings, 0 replies; 67+ messages in thread
From: Tim Bird @ 2005-10-19 17:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Roman Zippel, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Ingo,

Thanks for the excellent description of the timer wheel
implementation.

Ingo Molnar wrote:
> One cost is the burstiness of processing: a single step of cascading can 
> take many timers to be processed (if they happen to be in that same 
> bucket)...

> But there's a hidden win as well from this approach: if a timer is 
> removed before it expires, we've saved the remaining cascading steps!  
> This happens surprisingly often: on a busy networked server, the 
> majority of the timers never expire, and are removed before they have to 
> be cascaded even once.

Unfortunately, this means that the actual costs of the wheel
implementation vary depending on the relationship between HZ,
the average timeout duration, and the bucket mappings (which,
as you say, can be adjusted for size reasons.)  This is one of
the downsides of the wheel implementation.  It's very difficult
to tell in advance whether a particular timer load
will cascade or not, making the costs (although bounded)
unexpectedly variable.

One solution (even suggested by Linus) for high resolution
timers was to increase HZ and skip timer ticks.  Unfortunately,
this has a dramatic affect on the cost of cascading, and on
the maximum duration available for timers.  (By increasing
HZ, you push more timers to higher tiers in the wheel, which
means you potentially end up cascading them more often,
even when they are removed before expiry.) These types
of unexpected consequences are one good reason for avoiding
use of the wheel for high res timers.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: kernel/timer.c design
  2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
  2005-10-19 17:48                                               ` kernel/timer.c design Tim Bird
@ 2005-10-19 18:00                                               ` Tim Bird
  2005-10-19 19:04                                                 ` Thomas Gleixner
  2005-10-19 22:12                                               ` kernel/timer.c design (was: Re: ktimers subsystem) Roman Zippel
  2 siblings, 1 reply; 67+ messages in thread
From: Tim Bird @ 2005-10-19 18:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Roman Zippel, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Ingo,

Thanks for the excellent description of the timer wheel
implementation.

Ingo Molnar wrote:
> One cost is the burstiness of processing: a single step of cascading can 
> take many timers to be processed (if they happen to be in that same 
> bucket)...

> But there's a hidden win as well from this approach: if a timer is 
> removed before it expires, we've saved the remaining cascading steps!  
> This happens surprisingly often: on a busy networked server, the 
> majority of the timers never expire, and are removed before they have to 
> be cascaded even once.

Unfortunately, this means that the actual costs of the wheel
implementation vary depending on the relationship between HZ,
the average timeout duration, and the bucket mappings (which,
as you say, can be adjusted for size reasons.)  This is one of
the downsides of the wheel implementation.  It's very difficult
to tell in advance whether a particular timer load
will cascade or not, making the costs (although bounded)
unexpectedly variable.

One solution (even suggested by Linus) for high resolution
timers was to increase HZ and skip timer ticks.  Unfortunately,
this has a dramatic affect on the cost of cascading, and on
the maximum duration available for timers.  (By increasing
HZ, you push more timers to higher tiers in the wheel, which
means you potentially end up cascading them more often,
even when they are removed before expiry.) These types
of unexpected consequences are one good reason for avoiding
use of the wheel for high res timers.

=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Electronics
=============================

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: kernel/timer.c design
  2005-10-19 18:00                                               ` Tim Bird
@ 2005-10-19 19:04                                                 ` Thomas Gleixner
  0 siblings, 0 replies; 67+ messages in thread
From: Thomas Gleixner @ 2005-10-19 19:04 UTC (permalink / raw)
  To: Tim Bird
  Cc: Ingo Molnar, Roman Zippel, Andrew Morton, george, linux-kernel,
	johnstul, paulmck, hch, oleg

On Wed, 2005-10-19 at 11:00 -0700, Tim Bird wrote:
> > But there's a hidden win as well from this approach: if a timer is 
> > removed before it expires, we've saved the remaining cascading steps!  
> > This happens surprisingly often: on a busy networked server, the 
> > majority of the timers never expire, and are removed before they have to 
> > be cascaded even once.
> 
> Unfortunately, this means that the actual costs of the wheel
> implementation vary depending on the relationship between HZ,
> the average timeout duration, and the bucket mappings (which,
> as you say, can be adjusted for size reasons.)  This is one of
> the downsides of the wheel implementation.  It's very difficult
> to tell in advance whether a particular timer load
> will cascade or not, making the costs (although bounded)
> unexpectedly variable.

Thats exactly the problem we described earlier in the ktimer discussion:

Changing HZ from 100 to 1000 while keeping the primary wheel size
unchanged caused increased cascading load.

HZ     CONFIG_BASE_SMALL=n     CONFIG_BASE_SMALL=y
 100    2560 ms                 640 ms
 250    1024 ms                 256 ms
 1000    256 ms                  64 ms

A lot of timeouts are in the range of 500ms. While the HZ=100 and HZ=250
settings keep them in the primary wheel either until expiry or early
removal, HZ=1000 and CONFIG_BASE_SMALL with HZ > 100 make cascading more
likely when the system load goes up.

Thats hard to balance for sure.

	tglx



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: kernel/timer.c design (was: Re: ktimers subsystem)
  2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
  2005-10-19 17:48                                               ` kernel/timer.c design Tim Bird
  2005-10-19 18:00                                               ` Tim Bird
@ 2005-10-19 22:12                                               ` Roman Zippel
  2 siblings, 0 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-19 22:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Wed, 19 Oct 2005, Ingo Molnar wrote:

> > Whether the timer event is delivered or not is completely unimportant, 
> > as at some point the event has to be removed anyway, so that 
> > optimizing a timer for (non)delivery is complete nonsense.
> 
> completely wrong! To explain this, let me first give you an introduction 
> to the design goals and implementation/optimization details of the 
> upstream kernel/timer.c code:

I indeed made a mistake, thanks for pointing it out so elaborately.

I'd like to mention something else here. It's rather bad style to start 
with "completely wrong!" and then continue to gloat with "let me give you 
an introduction", unless you intentionally want to insult me. Usually I 
would just ignore this, as it can happen to anyone, but I can find this 
style too often in your mails lately with the most obvious example of your 
"shut up or show code" comment. You're more busy trying to prove me wrong 
than adressing the actual issue. It never was my intention to discuss the 
kernel timer design (the one in timer.c you describe here), the original 
issue was and still is that "timer API" is a too generic term and you 
actually proved my point by using the terms timer and their timeout values 
very consistently in your description.

It's possible I read this wrong, in that case I apologize already in 
advance, but please rethink the attitude you're showing, otherwise I'll 
reduce our conversion to a minimum. You're certainly have the more 
detailed knowledge in this area, but you don't have to show it off like 
this.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19  1:58                                           ` Roman Zippel
  2005-10-19  6:46                                             ` Ingo Molnar
  2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
@ 2005-10-19 11:40                                             ` Ingo Molnar
  2005-10-19 11:58                                             ` Ingo Molnar
  3 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-19 11:40 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > so i can only repeat that ktimers is a generic timer subsystem, with a 
> > focus on _actually delivering a timer event_.
> 
> It doesn't answer it at all. The new timer system is definitively not 
> "usable for any kernel purpose", it has certain properties, which 
> makes it only applicable under certain conditions.

what "certain properties" and under what "certain conditions"? Please 
provide specifics to prove your point. I repeat for the third time: 
ktimers is a generic timer subsystem, with a focus on timer event 
delivery.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19  1:58                                           ` Roman Zippel
                                                               ` (2 preceding siblings ...)
  2005-10-19 11:40                                             ` [PATCH] ktimers subsystem 2.6.14-rc2-kt5 Ingo Molnar
@ 2005-10-19 11:58                                             ` Ingo Molnar
  2005-10-19 22:24                                               ` Roman Zippel
  3 siblings, 1 reply; 67+ messages in thread
From: Ingo Molnar @ 2005-10-19 11:58 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > > > e.g. we dont want a watchdog from being 
> > > > overload-able via too many timeouts in the timer wheel ...
> > > 
> > > Please explain.
> > 
> > e.g. on busy networked servers (i.e. ones that do have a need for 
> > watchdogs) the timer wheel often includes large numbers of timeouts, 
> > 99.9% of which never expire. If they do expire en masse for whatever 
> > reason, then we can get into overload mode: a million timers might have 
> > to expire before we get to process the watchdog event and act upon it.  
> > This can delay the watchdog event significantly, which delay might (or 
> > might not) matter to the watchdog application.
> 
> I already mentioned earlier that it's possible to reduce the timer 
> load by using a watchdog timer to filter most of these events, so that 
> you get into the interesting situation that most kernel timer actually 
> do expire and suddenly you easily can have more "timers" than 
> "timeouts".

this sentence does not parse at all, for me. Here's the effort i did 
trying to decypher it:

Firstly, you mention 'watchdog' without clarifying whether it's the 
examplary watchdog we were talking about above, or whether it's some 
other, new mechanism. The former makes no sense (what does the watchdog 
timer in a random driver have to do with the millions of network timers 
i was talking about, and how could it be used to filter anything?), the 
later you dont explain.

Secondly, the above sentence is the first time in the ktimer discussion 
that you ever mentioned the word 'filter', and you never mentioned the 
word 'watchdog' outside of the example we were discussing, so i'm 
curious about the source of the above "I already mentioned earlier" 
statement. When earlier? Which email? Frankly, the whole paragraph reads 
as if from another planet, i see the words but the content seems totally 
out of context and makes no sense to me.

So i cannot even agree or disagree with anything you said in that 
sentence, because the sentence simply does not parse. Please enlighten 
me!

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19 11:58                                             ` Ingo Molnar
@ 2005-10-19 22:24                                               ` Roman Zippel
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-19 22:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg

Hi,

On Wed, 19 Oct 2005, Ingo Molnar wrote:

> Secondly, the above sentence is the first time in the ktimer discussion 
> that you ever mentioned the word 'filter', and you never mentioned the 
> word 'watchdog' outside of the example we were discussing, so i'm 
> curious about the source of the above "I already mentioned earlier" 
> statement. When earlier? Which email?

http://marc.theaimsgroup.com/?l=linux-kernel&m=112752984710746&w=2

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 18:49                               ` Roman Zippel
  2005-10-17 19:19                                 ` Tim Bird
@ 2005-10-17 20:09                                 ` Ingo Molnar
  1 sibling, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17 20:09 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Tim Bird, Andrew Morton, tglx, george, linux-kernel, johnstul,
	paulmck, hch, oleg


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > > It's rather simple:
> > > - "timer API" vs "timeout API": I got absolutely no acknowlegement that this
> > > might be a little confusing and in consequence "process timer" may be a
> > > better name.
> > 
> > I agree with Thomas on this one.  Maybe "timer" and "timeout" are too close,
> > but I think they are the most descriptive names.
> >  - timeout is something used for a timeout.  Timeouts only actually
> >  expire infrequently, so they have a host of attributes associated
> >  with that characteristic.
> >  - timer is something used to time something.  They almost always
> >  expire as part of their normal behaviour.  In the ktimer code they
> >  have a host of attributes related to this characteristic.
> 
> There is of course a difference, but is it big enough that they 
> deserve different APIs? Just look into <linux/timer.h> it doesn't 
> mention timeout once, but according to Thomas that's our "timeout 
> API". Look at the description of mod_timer() in timer.c: "modify a 
> timer's timeout". It seems I'm not only one who thinks that both are 
> closely related.

this is one more area where there's no good substitute from 'walking the 
walk', i.e. getting yourself dirty with actual code. I have been 
involved with the following variants which were part of the -rt tree:

- we implemented both timeouts and timers with the same
  timeout-optimized framework [i.e. with the 'wheel'] - it sucked.

- timers and timeouts with a timer-optimized framework [i.e. with a
  binary tree] sucks too, due to the tree overhead.

- we in fact tried another variant too: a hybrid method where timers and
  timeouts lived in the timer wheel and some time before (hr) timers
  were about to time out they were put into a separate hr-list. This
  hybrid solution sucked too.

so then we tried a separate API and subsystem for both of them, and 
voila, many of the uglinesses went away, and things became robust.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:25                           ` Roman Zippel
  2005-10-17 16:49                             ` Tim Bird
@ 2005-10-17 20:55                             ` Thomas Gleixner
  2005-10-18  0:07                               ` Roman Zippel
  1 sibling, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2005-10-17 20:55 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Andrew Morton, Ingo Molnar, george, linux-kernel, johnstul,
	paulmck, hch, oleg, tim.bird

On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
> It's rather simple:
> - "timer API" vs "timeout API": I got absolutely no acknowlegement that 
> this might be a little confusing and in consequence "process timer" may be 
> a better name.

Not only me, also a lot of other people do _not_ find it confusing and I
explained why it is a clear technical distcinction. I also explained why
I think that process_timers is too restrictive IMO.

I accept that you find it confusing, but I dont understand neither what
kind of acknowledgement you want nor how you deduce my obligation for
acknowledging whatever.

> - I pointed out various (IMO) unnecessary complexities, which were rather 
> quickly brushed off e.g. with a need for further (not closer specified) 
> cleanups.

The so called complexities are a not various. You complained about
exactly 5 members of the ktimer structure. 

- list, expired, status, interval, overrun

which are superflous in your opinion.

Again an explanation for each :

list: 
allows fast access to the time sorted list without walking the rbtree
and is a preliminary for the extension to high resolution timers.

-----------

expired:
The field was added for simplification of some delta calculations in the
return path. e.g. nanosleep in the expired case to avoid the extra call
to get the current time. Also quite useful for debugging.

-----------

status:
A simple field, which stores at the moment 2 states and is necessary for
extensions to high resolution timers too, as we have more states there. 
The suggested usage of the rbnode.parent pointer is wrong IMO as the
overloading of arbitrary pointers for status information is a kind of
pseudo optimization which is reducing in fact maintainability and
clarity for a the win of a 32bit variable.

-----------

interval, overrun:
Interval holds the converted interval value for itimers. The overrun
member is used by the rearm code so the caller can figure out the number
of missed events. 

The cleanup I pointed out for the posix timer interval timers is pretty
obvious. It makes use of interval and overrun and removes two members of
the posix timer structure.

-----------

The size of the ktimer structure is a matter of micro optimizations in
the same way as the macros/inlines are. 

Calling the pure existence of some struct members complexity is an
exaggeration and contradicts your own request for a simple and clear
design. 

The implementation was done clear and simple from the very beginning and
I really dont understand why the preparation for further extensions in
the first place is bad. 

Doing a design with the final goal in mind is much cleaner than doing
micro optimizations in the first place and afterwards working around
them when you apply extensions.

> - resolution handling: at what resolution should/does the kernel work and 
> what do we report to user space. The spec allows multiple interpretations 
> and I have a hard time to get at least one coherent interpretation out of 
> Thomas.

I interpret the spec in the way I do for following reasons:

1. It is _usual practice_ to return the "timer" resolution for
clock_res() and to return clock values with as much resolution as
possible. In no case should the actual clock resolution be less than
what clock_res() returns.
- George Anzinger in this thread. Similar opinions can be found via
Google. I came to the same conclusion and saw no reason to repeat
Georges statement. I thought a simple pointer would be sufficient.

2. The rounding to the resolution value is explicitly required by the
standard.

3. It makes a lot of sense to do what (1.) describes, due to the fact
that we actually want to restrict the timer resolution to avoid
interrupt and reprogramming floods in very short intervals. This in fact
is the default behaviour in a jiffy driven environment. Pretending a
real nsec resolution and doing no rounding at all is violating (2.).
>From an application programmers view it makes sense to return the timer
resolution so he actually can adjust the program behaviour on the
provided resolution and not rely on wild guess assumptions about what
might be there. Applications need to be able to verify whether the
system can handle the required intervals or not.

	tglx

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 20:55                             ` Thomas Gleixner
@ 2005-10-18  0:07                               ` Roman Zippel
  2005-10-18  1:03                                 ` George Anzinger
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-18  0:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Ingo Molnar, george, linux-kernel, johnstul,
	paulmck, hch, oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, Thomas Gleixner wrote:

> On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
> > It's rather simple:
> > - "timer API" vs "timeout API": I got absolutely no acknowlegement that 
> > this might be a little confusing and in consequence "process timer" may be 
> > a better name.
> 
> Not only me, also a lot of other people do _not_ find it confusing and I
> explained why it is a clear technical distcinction. I also explained why
> I think that process_timers is too restrictive IMO.

People don't find it confusing, exactly because it gives them the wrong 
idea about it, neither "API" is restricted to just timeouts or timer.
I don't insist on the term "process timer", but I'd really like to find 
something better than ktimer. We already have kernel timer API, which is 
the primary API for kernel usage (for both timer and timeouts).

> list: 
> allows fast access to the time sorted list without walking the rbtree
> and is a preliminary for the extension to high resolution timers.

Only access to first element is needed, which can be cached in the base.
Please explain the second part.

> expired:
> The field was added for simplification of some delta calculations in the
> return path. e.g. nanosleep in the expired case to avoid the extra call
> to get the current time. Also quite useful for debugging.

The return path can also get it from the base.

> status:
> A simple field, which stores at the moment 2 states and is necessary for
> extensions to high resolution timers too, as we have more states there. 
> The suggested usage of the rbnode.parent pointer is wrong IMO as the
> overloading of arbitrary pointers for status information is a kind of
> pseudo optimization which is reducing in fact maintainability and
> clarity for a the win of a 32bit variable.

Testing a pointer is not "arbitrary", we do it all the time in the kernel.

> interval, overrun:
> Interval holds the converted interval value for itimers. The overrun
> member is used by the rearm code so the caller can figure out the number
> of missed events. 
> 
> The cleanup I pointed out for the posix timer interval timers is pretty
> obvious. It makes use of interval and overrun and removes two members of
> the posix timer structure.

Where I think it's possible to separate the timer from the interval 
functionality to get a simpler timer base implementation.

> The size of the ktimer structure is a matter of micro optimizations in
> the same way as the macros/inlines are. 

Not really, these fields have to be initialized and maintained, which 
quickly goes beyond "micro optimizations".

> Calling the pure existence of some struct members complexity is an
> exaggeration and contradicts your own request for a simple and clear
> design. 

That's not all what I had it in mind regarding complexity, I just started 
with the more simpler parts and never got to the more complex part.

> Doing a design with the final goal in mind is much cleaner than doing
> micro optimizations in the first place and afterwards working around
> them when you apply extensions.

This is fine, but then you should explain them, I'm not a mind reader, so 
that I can guess what you're planning.

> > - resolution handling: at what resolution should/does the kernel work and 
> > what do we report to user space. The spec allows multiple interpretations 
> > and I have a hard time to get at least one coherent interpretation out of 
> > Thomas.
> 
> I interpret the spec in the way I do for following reasons:
> 
> 1. It is _usual practice_ to return the "timer" resolution for
> clock_res() and to return clock values with as much resolution as
> possible. In no case should the actual clock resolution be less than
> what clock_res() returns.
> - George Anzinger in this thread. Similar opinions can be found via
> Google. I came to the same conclusion and saw no reason to repeat
> Georges statement. I thought a simple pointer would be sufficient.

In this case you don't interpret the spec, you ignore the spec. (I'll 
leave it open whether that's a good or bad thing.)

> 2. The rounding to the resolution value is explicitly required by the
> standard.

It doesn't explicitly specify which resolution (see the previous mail).
It doesn't explicitly specify how this rounding has to be implemented.

> 3. It makes a lot of sense to do what (1.) describes, due to the fact
> that we actually want to restrict the timer resolution to avoid
> interrupt and reprogramming floods in very short intervals. This in fact
> is the default behaviour in a jiffy driven environment. Pretending a
> real nsec resolution and doing no rounding at all is violating (2.).
> >From an application programmers view it makes sense to return the timer
> resolution so he actually can adjust the program behaviour on the
> provided resolution and not rely on wild guess assumptions about what
> might be there. Applications need to be able to verify whether the
> system can handle the required intervals or not.

A portable application simply cannot make this assumption.

Anyway, it's rather confusing how you ignore the spec, when "it makes a 
lot of sense" and OTOH how you can stick to the spec. I honestly don't 
know how to argue on this basis, where the spec can be arbitrarily 
redefined based on undocumented assumptions.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-18  0:07                               ` Roman Zippel
@ 2005-10-18  1:03                                 ` George Anzinger
  2005-10-19  1:26                                   ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: George Anzinger @ 2005-10-18  1:03 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Roman Zippel wrote:
> Hi,
> 
> On Mon, 17 Oct 2005, Thomas Gleixner wrote:
> 
> 
>>On Mon, 2005-10-17 at 18:25 +0200, Roman Zippel wrote:
>>
~
>>interval, overrun:
>>Interval holds the converted interval value for itimers. The overrun
>>member is used by the rearm code so the caller can figure out the number
>>of missed events. 
>>
>>The cleanup I pointed out for the posix timer interval timers is pretty
>>obvious. It makes use of interval and overrun and removes two members of
>>the posix timer structure.
> 
> 
> Where I think it's possible to separate the timer from the interval 
> functionality to get a simpler timer base implementation.

They are required fields for the POSIX timer.  I think you are saying that they should be there and 
not in the ktime struct, which is part of the POSIX timer struct.  Is that right?

Along this line, I have a bit of a problem with the ktimer code doing the timer repeat stuff.  This 
is NOT used by POSIX timers because we want to wait for the user to pick up the signal before 
starting the next interval.  This is key to avoiding timer storms and I would think that puting the 
repeat stuff in ktimer code opens it to the possibility of other users starting a timer storm via 
this.  I think the itimer code should also use the signal call back to start the next interval, and 
for the same reason.
> 
> 
~
>>>- resolution handling: at what resolution should/does the kernel work and 
>>>what do we report to user space. The spec allows multiple interpretations 
>>>and I have a hard time to get at least one coherent interpretation out of 
>>>Thomas.
>>
>>I interpret the spec in the way I do for following reasons:
>>
>>1. It is _usual practice_ to return the "timer" resolution for
>>clock_res() and to return clock values with as much resolution as
>>possible. In no case should the actual clock resolution be less than
>>what clock_res() returns.
>>- George Anzinger in this thread. Similar opinions can be found via
>>Google. I came to the same conclusion and saw no reason to repeat
>>Georges statement. I thought a simple pointer would be sufficient.
> 
> 
> In this case you don't interpret the spec, you ignore the spec. (I'll 
> leave it open whether that's a good or bad thing.)

Eh?  Granted we don't truncate the time on settime, but how else is it ignored?
> 
> 
>>2. The rounding to the resolution value is explicitly required by the
>>standard.
> 
> 
> It doesn't explicitly specify which resolution (see the previous mail).
> It doesn't explicitly specify how this rounding has to be implemented.

In the timer_settime() call there is only one possible resolution refered to, that of the specified 
clock.  The standard says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):

Time values that are between two consecutive non-negative integer multiples of the resolution of the 
specified timer shall be rounded up to the larger multiple of the resolution. Quantization error 
shall not cause the timer to expire earlier than the rounded time value.

This says a) round to the next resolution, and b) don't allow the resulting timer to expire early. 
The implication is that timers are to expire on resolution boundries so we need to round such that 
the expire happens _after_ the rounded time.

Am I missing something here?

The assumption, that I think you made, that we can let the hardware do the rounding runs contrary to 
one of the main reasons for resolution, i.e. to group timers so that we can reduce the system 
overhead.  Just because we have timer hardware with microsecond resolution is not reason enough to 
offer it to the user as handling an interrupt every micro second is way too much overhead, and, in 
most cases, the user doesn't even want to such a fine resolution.
> 
> 
>>3. It makes a lot of sense to do what (1.) describes, due to the fact
>>that we actually want to restrict the timer resolution to avoid
>>interrupt and reprogramming floods in very short intervals. This in fact
>>is the default behaviour in a jiffy driven environment. Pretending a
>>real nsec resolution and doing no rounding at all is violating (2.).
>>>From an application programmers view it makes sense to return the timer
>>resolution so he actually can adjust the program behaviour on the
>>provided resolution and not rely on wild guess assumptions about what
>>might be there. Applications need to be able to verify whether the
>>system can handle the required intervals or not.
> 
> 
> A portable application simply cannot make this assumption.

POSIX clocks and timers are part of the REAL TIME POSIX extension.  Arguing that real time apps need 
to be portable is, I think, rather beside the point.  At the same time, if rounding follows the 
rules, one can set up a timer_settime() timer_gettime() sequence to get the resolution, even with 
the itimer one can do this.  So resolution is available to the user in one way or another.  What he 
does with it is up to him, but at least some RT apps. set up timer to expire early and after expiry, 
busy wait until the "appointed" time.  Knowing the resolution helps to know how to set this up...
~
-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-18  1:03                                 ` George Anzinger
@ 2005-10-19  1:26                                   ` Roman Zippel
  2005-10-19  2:52                                     ` George Anzinger
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-19  1:26 UTC (permalink / raw)
  To: George Anzinger
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, George Anzinger wrote:

> > Where I think it's possible to separate the timer from the interval
> > functionality to get a simpler timer base implementation.
> 
> They are required fields for the POSIX timer.  I think you are saying that
> they should be there and not in the ktime struct, which is part of the POSIX
> timer struct.  Is that right?

Basically, yes. I would take some simpler steps in creating the new timer 
system. Thomas' patch introduces multiple concepts at once, which are hard 
to digest via a simple review. As it looks right now I have to take the 
patch apart myself and split it into simpler patches.

> > > 2. The rounding to the resolution value is explicitly required by the
> > > standard.
> > 
> > 
> > It doesn't explicitly specify which resolution (see the previous mail).
> > It doesn't explicitly specify how this rounding has to be implemented.
> 
> In the timer_settime() call there is only one possible resolution refered to,
> that of the specified clock.  The standard
> says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):
> 
> Time values that are between two consecutive non-negative integer multiples of
> the resolution of the specified timer shall be rounded up to the larger
> multiple of the resolution. Quantization error shall not cause the timer to
> expire earlier than the rounded time value.
> 
> This says a) round to the next resolution, and b) don't allow the resulting
> timer to expire early. The implication is that timers are to expire on
> resolution boundries so we need to round such that the expire happens _after_
> the rounded time.
> 
> Am I missing something here?

In short: rounding errors.

Above says IOW if we have a clock with a freqency f and a resolution with
r=10^9/f, we have to round time t up so that it becomes a integer multiple 
i of r, so that once the counter reaches the value i all timer with upto a
time value of i*r are expired.

If we now simply ignore the resolution fraction, we get a rounded value 
which is quickly far away from the real value (with a worst case of r-1 
nsec). This means an explicit rounding is likely only to make things 
worse and any rounding is better done as part of the conversion from/to 
timespec to/from the counter value according to the above rules and even 
this conversion should be avoided as much as possible to minimize rounding 
errors.

> The assumption, that I think you made, that we can let the hardware do the
> rounding runs contrary to one of the main reasons for resolution, i.e. to
> group timers so that we can reduce the system overhead.  Just because we have
> timer hardware with microsecond resolution is not reason enough to offer it to
> the user as handling an interrupt every micro second is way too much overhead,
> and, in most cases, the user doesn't even want to such a fine resolution.

This just means that we have two values to describe a timer, the clock 
resolution describes the precision with which the timer can be programmed 
and the timer precision which describes the maximum frequency of timer 
expiry. I think both values are of interest to user applications, so my 
current preference is to actually export them both properly instead of 
overloading the clock_getres() interface.

The spec allows both resolutions:

"an implementation (is required) to document the resolution supported for 
timers and nanosleep() if they differ from the supported clock resolution"

This means that unfortunately only one can be determined at runtime via 
standard means, so if we are going to create nonportable interfaces, we 
should do it at least properly.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19  1:26                                   ` Roman Zippel
@ 2005-10-19  2:52                                     ` George Anzinger
  2005-10-21 16:22                                       ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: George Anzinger @ 2005-10-19  2:52 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Roman Zippel wrote:
> Hi,
> 
> On Mon, 17 Oct 2005, George Anzinger wrote:
> 
~
> 
>>>>2. The rounding to the resolution value is explicitly required by the
>>>>standard.
>>>
>>>
>>>It doesn't explicitly specify which resolution (see the previous mail).
>>>It doesn't explicitly specify how this rounding has to be implemented.
>>
>>In the timer_settime() call there is only one possible resolution referred to,
>>that of the specified clock.  The standard
>>says(http://www.opengroup.org/onlinepubs/009695399/functions/timer_settime.html):
>>
>>Time values that are between two consecutive non-negative integer multiples of
>>the resolution of the specified timer shall be rounded up to the larger
>>multiple of the resolution. Quantization error shall not cause the timer to
>>expire earlier than the rounded time value.
>>
>>This says a) round to the next resolution, and b) don't allow the resulting
>>timer to expire early. The implication is that timers are to expire on
>>resolution boundaries so we need to round such that the expire happens _after_
>>the rounded time.
>>
>>Am I missing something here?
> 
> 
> In short: rounding errors.
> 
> Above says IOW if we have a clock with a frequency f and a resolution with
> r=10^9/f, we have to round time t up so that it becomes a integer multiple 
> i of r, so that once the counter reaches the value i all timer with up to a
> time value of i*r are expired.
> 
> If we now simply ignore the resolution fraction, we get a rounded value 
> which is quickly far away from the real value (with a worst case of r-1 
> nsec). This means an explicit rounding is likely only to make things 
> worse and any rounding is better done as part of the conversion from/to 
> timespec to/from the counter value according to the above rules and even 
> this conversion should be avoided as much as possible to minimize rounding 
> errors.

I think the rounding errors you are talking about would require us to define the clock period in 
something finer than nanoseconds.  The usual practice is to work with a resolution specified in 
nanoseconds (which is the same units the user hands us).  We then only worry about the last 
"resolution" or so of the elapsed time, rather than going back to the beginning of time.  The math 
becomes harder when converting to a particular timer with resolution in the nanosecond area, as, for 
example, the TSC.  Here we use what I call "scaled math" to both improve resolution and accuracy and 
to avoid the evil div instruction.  It is rather easy to get accuracy down to a few parts per 
billion.  I really don't think the math, however, is the issue here.

Rather I think you would like to turn the hardware resolution into the resolution we use and send to 
the user.  This, I think, is not quite the right way to go.  Suppose, for example, we have a timer 
that will do micro second resolution.  To provide this to the user implies that he is free to ask 
for timers that expire every micro second.  Today, this is not really a wise thing to do as we would 
soon use all the cpu cycles doing interrupt overhead.  So we define a resolution, say 100 micro 
seconds, and set things up that way.  This means we, at most, need to handle timer interrupts once 
every 100 usecs (still not really wise, put possible with some of todays hardware).

Now, if the timer we use actually has a resolution of 1.33333 usec, do we want to use a multiple of 
this as our resolution?  Not really, folks would just get confused. We can just tell them it is 
100usec and do the math.  The errors introduced by this are, at most, 1.3333 usec, and they are NOT 
cumulative, as long as we do the math for each expiry.  (If we try to compute a LATCH to use to get 
100 usec periods, we will accumulate errors, so why do that?)  A jitter of 1.3333 usec is well under 
the radar, being lost in the interrupt overhead.
> 
> 
>>The assumption, that I think you made, that we can let the hardware do the
>>rounding runs contrary to one of the main reasons for resolution, i.e. to
>>group timers so that we can reduce the system overhead.  Just because we have
>>timer hardware with microsecond resolution is not reason enough to offer it to
>>the user as handling an interrupt every micro second is way too much overhead,
>>and, in most cases, the user doesn't even want to such a fine resolution.
> 
> 
> This just means that we have two values to describe a timer, the clock 
> resolution describes the precision with which the timer can be programmed 
> and the timer precision which describes the maximum frequency of timer 
> expiry. I think both values are of interest to user applications, so my 
> current preference is to actually export them both properly instead of 
> overloading the clock_getres() interface.

But, as I say above, we don't want to export the hardware detail, but an abstraction we build on top 
of it.  Suppose we don't want to provide 100 usec timers except where really needed.  We could 
provide a different abstraction that has, say 10 ms resolution.  We could then set things up so that 
the user gets this all most all the time, say by define CLOCK_REALTIME with this resolution.  We 
then might define CLOCK_REALTIME_HR to have a resolution of 100 usec.  The user who needs it will 
realize that it has higher overhead (else why would we make it a bit harder to get to), and use it 
only when he needs the resolution it provides.

There is no reason that both of these "clocks" can not use the same underlying code and hardware. 
At the same time they do not have to.
> 
> The spec allows both resolutions:
> 
> "an implementation (is required) to document the resolution supported for 
> timers and nanosleep() if they differ from the supported clock resolution"

What we want to do, and what is done by others, is to define different clocks which carry their 
resolution to the timers used on them.  This is a little orthogonal to the standard, but seems to be 
a reasonable extension.
> 
> This means that unfortunately only one can be determined at runtime via 
> standard means, so if we are going to create non portable interfaces, we 
> should do it at least properly.
> 
> bye, Roman

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-19  2:52                                     ` George Anzinger
@ 2005-10-21 16:22                                       ` Roman Zippel
  2005-10-23 18:17                                         ` George Anzinger
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-21 16:22 UTC (permalink / raw)
  To: George Anzinger
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Hi,

On Tue, 18 Oct 2005, George Anzinger wrote:

> > Above says IOW if we have a clock with a frequency f and a resolution with
> > r=10^9/f, we have to round time t up so that it becomes a integer multiple i
> > of r, so that once the counter reaches the value i all timer with up to a
> > time value of i*r are expired.

You don't specifically disagree, so I can assume you agree that this a 
valid interpretation of the spec?
(I'm asking because it's important for the design of the timer system.)

> > If we now simply ignore the resolution fraction, we get a rounded value
> > which is quickly far away from the real value (with a worst case of r-1
> > nsec). This means an explicit rounding is likely only to make things worse
> > and any rounding is better done as part of the conversion from/to timespec
> > to/from the counter value according to the above rules and even this
> > conversion should be avoided as much as possible to minimize rounding
> > errors.
> 
> I think the rounding errors you are talking about would require us to define
> the clock period in something finer than nanoseconds.

No, you don't have to, all you have to do is to make sure that 
"Quantization error shall not cause the timer to expire earlier than the 
rounded time value." IOW at the time the timer expires, the expiry time 
must not be greater than clock_gettime().

> Rather I think you would like to turn the hardware resolution into the
> resolution we use and send to the user.  This, I think, is not quite the right
> way to go.  Suppose, for example, we have a timer that will do micro second
> resolution.  To provide this to the user implies that he is free to ask for
> timers that expire every micro second.  Today, this is not really a wise thing
> to do as we would soon use all the cpu cycles doing interrupt overhead.  So we
> define a resolution, say 100 micro seconds, and set things up that way.  This
> means we, at most, need to handle timer interrupts once every 100 usecs (still
> not really wise, put possible with some of todays hardware).
> 
> Now, if the timer we use actually has a resolution of 1.33333 usec, do we want
> to use a multiple of this as our resolution?  Not really, folks would just get
> confused. We can just tell them it is 100usec and do the math.  The errors
> introduced by this are, at most, 1.3333 usec, and they are NOT cumulative, as
> long as we do the math for each expiry.  (If we try to compute a LATCH to use
> to get 100 usec periods, we will accumulate errors, so why do that?)  A jitter
> of 1.3333 usec is well under the radar, being lost in the interrupt overhead.

No, the error is worse, although I specifically talk about the rounding 
done in Thomas' patch, I'm not sure we're really talking about the same 
thing. I didn't mean the error caused by jitter, in this case I'd 
actually agree with you.

He sets the resolution right now to (NSEC_PER_SEC/HZ) and uses this value 
to explicit round the time values. For example a timer is set to the value 
1.1ms and is rounded to 2ms. The timer tick now actually expires at 1.2ms 
and could expire the timer, but it's instead expired at 2.2ms and user 
space sees an error of 1.1ms.
A similiar error even exists with interval timer, e.g. an interval timer 
is set to 0.9ms and rounded to 1ms. If the clock now expires a little 
too early the timer will expire repeatedly one tick too late.

In general due to this rounding and normal clock skew an extra error is 
added with an average value of half the timer resolution.

> But, as I say above, we don't want to export the hardware detail, but an
> abstraction we build on top of it.  Suppose we don't want to provide 100 usec
> timers except where really needed.  We could provide a different abstraction
> that has, say 10 ms resolution.  We could then set things up so that the user
> gets this all most all the time, say by define CLOCK_REALTIME with this
> resolution.  We then might define CLOCK_REALTIME_HR to have a resolution of
> 100 usec.  The user who needs it will realize that it has higher overhead
> (else why would we make it a bit harder to get to), and use it only when he
> needs the resolution it provides.
> 
> There is no reason that both of these "clocks" can not use the same underlying
> code and hardware. At the same time they do not have to.

I don't have a problem with this at all. I think it's fine to leave 
clock_getres(CLOCK_REALTIME) at a save value.

> > The spec allows both resolutions:
> > 
> > "an implementation (is required) to document the resolution supported for
> > timers and nanosleep() if they differ from the supported clock resolution"
> 
> What we want to do, and what is done by others, is to define different clocks
> which carry their resolution to the timers used on them.  This is a little
> orthogonal to the standard, but seems to be a reasonable extension.

Could you please give me an example for "others"?
I don't think I have a problem with this. My point is to better define 
"reasonable extension" or to be more specific what user expectations are 
reasonable. What is needed by applications and where exactly do we draw 
the line, when it comes to extra complexity in the timer design.

I wouldn't a priori exclude the possibility to break some user 
applications, which have unreasonable expectations. We did this in the 
past (e.g. sched_yield()), we simply fixed the applications and moved on, 
but this requires more information about what applications expect about 
high resolution timer.

George, many thanks for trying to understand me and helping me to get a 
better understanding of the issues. I see more misunderstandings than 
disagreements, so I'd be really grateful, if you can help me later to 
translate this into something Thomas and Ingo can understand. :-)

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-21 16:22                                       ` Roman Zippel
@ 2005-10-23 18:17                                         ` George Anzinger
  2005-10-27 20:23                                           ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: George Anzinger @ 2005-10-23 18:17 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Roman Zippel wrote:
> Hi,
> 
> On Tue, 18 Oct 2005, George Anzinger wrote:
> 
> 
>>>Above says IOW if we have a clock with a frequency f and a resolution with
>>>r=10^9/f, we have to round time t up so that it becomes a integer multiple i
>>>of r, so that once the counter reaches the value i all timer with up to a
>>>time value of i*r are expired.
> 
> 
> You don't specifically disagree, so I can assume you agree that this a 
> valid interpretation of the spec?
> (I'm asking because it's important for the design of the timer system.)

I agree with the proviso that we can define such a clock as an abstraction of a clock with a better 
resolution.  I.e. we can provide clocks with lesser resolution than the physical clock has.
> 
> 
>>>If we now simply ignore the resolution fraction, we get a rounded value
>>>which is quickly far away from the real value (with a worst case of r-1
>>>nsec). This means an explicit rounding is likely only to make things worse
>>>and any rounding is better done as part of the conversion from/to timespec
>>>to/from the counter value according to the above rules and even this
>>>conversion should be avoided as much as possible to minimize rounding
>>>errors.
>>
>>I think the rounding errors you are talking about would require us to define
>>the clock period in something finer than nanoseconds.
> 
> 
> No, you don't have to, all you have to do is to make sure that 
> "Quantization error shall not cause the timer to expire earlier than the 
> rounded time value." IOW at the time the timer expires, the expiry time 
> must not be greater than clock_gettime().

That should be "less than", but yes.  The comment I was making is that the math is not that hard to 
get right.
> 
> 
>>Rather I think you would like to turn the hardware resolution into the
>>resolution we use and send to the user.  This, I think, is not quite the right
>>way to go.  Suppose, for example, we have a timer that will do micro second
>>resolution.  To provide this to the user implies that he is free to ask for
>>timers that expire every micro second.  Today, this is not really a wise thing
>>to do as we would soon use all the cpu cycles doing interrupt overhead.  So we
>>define a resolution, say 100 micro seconds, and set things up that way.  This
>>means we, at most, need to handle timer interrupts once every 100 usecs (still
>>not really wise, put possible with some of todays hardware).
>>
>>Now, if the timer we use actually has a resolution of 1.33333 usec, do we want
>>to use a multiple of this as our resolution?  Not really, folks would just get
>>confused. We can just tell them it is 100usec and do the math.  The errors
>>introduced by this are, at most, 1.3333 usec, and they are NOT cumulative, as
>>long as we do the math for each expiry.  (If we try to compute a LATCH to use
>>to get 100 usec periods, we will accumulate errors, so why do that?)  A jitter
>>of 1.3333 usec is well under the radar, being lost in the interrupt overhead.
> 
> 
> No, the error is worse, although I specifically talk about the rounding 
> done in Thomas' patch, I'm not sure we're really talking about the same 
> thing. I didn't mean the error caused by jitter, in this case I'd 
> actually agree with you.
> 
> He sets the resolution right now to (NSEC_PER_SEC/HZ) and uses this value 
> to explicit round the time values. For example a timer is set to the value 
> 1.1ms and is rounded to 2ms. The timer tick now actually expires at 1.2ms 
> and could expire the timer, but it's instead expired at 2.2ms and user 
> space sees an error of 1.1ms.
> A similiar error even exists with interval timer, e.g. an interval timer 
> is set to 0.9ms and rounded to 1ms. If the clock now expires a little 
> too early the timer will expire repeatedly one tick too late.
> 
> In general due to this rounding and normal clock skew an extra error is 
> added with an average value of half the timer resolution.


I admit I have not looked, in detail, at this part of ktimers, however, assuming that the clock 
ticks at HZ then the normal error to be expected is and average of 1/2 the resolution with a max of 
1 resolution.  This is AFTER the rounding to the next resolution, so we can expect the expiry to be 
any where from 0 to 2*resolution-1.  (up to resolution-1 from rounding, and up to one resolution 
from clock skew).  This the way I and every one I have worked with understand the standard.

In your example, consider a request for 0.1ms rounded to 1ms....
> 
> 
>>But, as I say above, we don't want to export the hardware detail, but an
>>abstraction we build on top of it.  Suppose we don't want to provide 100 usec
>>timers except where really needed.  We could provide a different abstraction
>>that has, say 10 ms resolution.  We could then set things up so that the user
>>gets this all most all the time, say by define CLOCK_REALTIME with this
>>resolution.  We then might define CLOCK_REALTIME_HR to have a resolution of
>>100 usec.  The user who needs it will realize that it has higher overhead
>>(else why would we make it a bit harder to get to), and use it only when he
>>needs the resolution it provides.
>>
>>There is no reason that both of these "clocks" can not use the same underlying
>>code and hardware. At the same time they do not have to.
> 
> 
> I don't have a problem with this at all. I think it's fine to leave 
> clock_getres(CLOCK_REALTIME) at a save value.
> 
> 
>>>The spec allows both resolutions:
>>>
>>>"an implementation (is required) to document the resolution supported for
>>>timers and nanosleep() if they differ from the supported clock resolution"
>>
>>What we want to do, and what is done by others, is to define different clocks
>>which carry their resolution to the timers used on them.  This is a little
>>orthogonal to the standard, but seems to be a reasonable extension.
> 
> 
> Could you please give me an example for "others"?

Well, I know that HP in the HPRT system (likely long gone by now) did it this way.  That was back 
prior to 1997.  The system was based on the HP PA risc arch which has a system timer based on a 
cycle counter (rather like the PPC, but different).

> I don't think I have a problem with this. My point is to better define 
> "reasonable extension" or to be more specific what user expectations are 
> reasonable. What is needed by applications and where exactly do we draw 
> the line, when it comes to extra complexity in the timer design.
> 
> I wouldn't a priori exclude the possibility to break some user 
> applications, which have unreasonable expectations. We did this in the 
> past (e.g. sched_yield()), we simply fixed the applications and moved on, 
> but this requires more information about what applications expect about 
> high resolution timer.

We have had good acceptance of the HRT patch in our customer base.  As far as I know we have not 
gotten any feed back on just what resolution they want or use.  We allow them to define it (within 
reason) at configure time.  I have been recommending, for the x86, nothing better than about 100usec 
but this is based on my machine being able to handle an interrupt in about that amount of time.

We don't require alignment of the resolution with the actual hardware resolution as at these levels 
the interrupt jitter smooths over any issues in this area.  A comment here, in some of your math 
examples you seem to be implying that we are going to use a particular resolution from the begining 
of time to compute expiry time.  In fact, we start from "now" as defined by a, possibly corrected, 
system clock.  Once we have the rounded expiry time we use full resolution math to figure how to fit 
that into the timing services.  So, infact, the resolution comes into play only over 1 to 2 
resolutions of the requested time.  In other words, errors do not accumulate since we always mark to 
the corrected clock.
> 
> George, many thanks for trying to understand me and helping me to get a 
> better understanding of the issues. I see more misunderstandings than 
> disagreements, so I'd be really grateful, if you can help me later to 
> translate this into something Thomas and Ingo can understand. :-)
> 
No problem.  Do be advised I will be out most of next week through the end of Oct.
-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-23 18:17                                         ` George Anzinger
@ 2005-10-27 20:23                                           ` Roman Zippel
  2005-10-28  4:52                                             ` Steven Rostedt
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-27 20:23 UTC (permalink / raw)
  To: George Anzinger
  Cc: Thomas Gleixner, Andrew Morton, Ingo Molnar, linux-kernel,
	johnstul, paulmck, hch, oleg, tim.bird

Hi,

On Sun, 23 Oct 2005, George Anzinger wrote:

> > > > Above says IOW if we have a clock with a frequency f and a resolution with
> > > > r=10^9/f, we have to round time t up so that it becomes a integer multiple i
> > > > of r, so that once the counter reaches the value i all timer with up to a
> > > > time value of i*r are expired.
> >
> >
> > You don't specifically disagree, so I can assume you agree that this a valid
> > interpretation of the spec?
> > (I'm asking because it's important for the design of the timer system.)
>
> I agree with the proviso that we can define such a clock as an abstraction of
> a clock with a better resolution.  I.e. we can provide clocks with lesser
> resolution than the physical clock has.

I had a different aspect in mind: at what resolution are we doing the 
calculations?
Let's say we have a clock with a frequency of 300Hz, now we could programm 
the timer like this:

	tmp = time * 300;
	clock = tmp / 10^9 + (tmp % 10^9 != 0);

This rounds the time to the next clock count and as soon as the clock 
reaches this count the timer is expired.
OTOH We could also do this at the timer interrupt:

	tmp = clock * 10^9;
	time = tmp / 300 + (tmp % 300 != 0);

and use time to expire all timer upto this time. In either case the 
behaviour is exactly the same.

The problem is now that we can export the real resolution only as integer 
value. What consequences has this to the kernel timer implementation? 
Something like above must be done anyway, so what's the point in doing an 
extra rounding step?
For example if we set a timer to expire at 999999990ns, so the next 
interrupt is at 1000000000ns, but rounding it to 3333333ns means the 
expiry time changes to 1003333233ns and the timer expires one clock tick 
later.
Which application seriously expects this kind of behaviour?

> I admit I have not looked, in detail, at this part of ktimers, however,
> assuming that the clock ticks at HZ then the normal error to be expected is
> and average of 1/2 the resolution with a max of 1 resolution.  This is AFTER
> the rounding to the next resolution, so we can expect the expiry to be any
> where from 0 to 2*resolution-1.  (up to resolution-1 from rounding, and up to
> one resolution from clock skew).  This the way I and every one I have worked
> with understand the standard.

For relative timer I agree that the error can be twice the resolution. 
First the value read from the clock must be rounded up and then we still 
have to wait till the next clock tick.
OTOH for absolute timer we don't need the first step, we just have to wait 
until the clock reaches this time. Why should we add an extra error to it, 
if we can avoid it? The spec actually says "a timer expiration signal is 
requested when the associated clock reaches or exceeds the specified 
time." The clock resolution causes the actual expiration time 
automatically to be a rounded value of the requested value.

Next question would be what happens if timer and clock resolution differs? 
For example if the clock has a resolution of 1us and the timer runs every 
1ms. For relative timer this would mean we can keep the error within 
1.001ms and for absolute timer within 1ms. Do we really have to force an 
error larger than really necessary?

Interesting is now that Thomas doesn't take the clock resolution into 
account at all. Let's say clock and timer resolution are 1ms (or HZ=1000). 
If we program a normal kernel timer, we do something like this:

	timer->expires = jiffies + 1 + usecs_to_jiffies(timeout);

Thomas does now basically this:

	timer->expires = jiffies * res + round(timeout, res);

IOW if the clock resolution is larger than the interrupt delay, the timer 
may expire early.

> We have had good acceptance of the HRT patch in our customer base.  As far as
> I know we have not gotten any feed back on just what resolution they want or
> use.  We allow them to define it (within reason) at configure time.  I have
> been recommending, for the x86, nothing better than about 100usec but this is
> based on my machine being able to handle an interrupt in about that amount of
> time.
> 
> We don't require alignment of the resolution with the actual hardware
> resolution as at these levels the interrupt jitter smooths over any issues in
> this area.

I expected as much, so users who do care make sure the timer resolution is 
good enough. In this case I would also expect that they are interested in 
keeping the error as small as possible.

>  A comment here, in some of your math examples you seem to be
> implying that we are going to use a particular resolution from the begining of
> time to compute expiry time.  In fact, we start from "now" as defined by a,
> possibly corrected, system clock.  Once we have the rounded expiry time we use
> full resolution math to figure how to fit that into the timing services.  So,
> infact, the resolution comes into play only over 1 to 2 resolutions of the
> requested time.  In other words, errors do not accumulate since we always mark
> to the corrected clock.

I didn't imply that, I tried to keep focus on the model as described in 
the spec. I think we should keep the focus on the behaviour this model 
describes, no user cares how the kernel implements the spec, just that the 
visible behaviour matches the spec.
SUS rationale specifically says "The interfaces also allow flexibility in 
the implementation of the functions. For example, ..." (in "Relationship 
of Timers to Clocks"), i.e. there is not one true implementation and so I 
think it's very well worth it to explore our options. Reducing the whole 
design to a single number (the resolution returned by clock_getres()) 
would be IMO very shortsighted. We could very well allow the user to 
define his own timer based on various parameters, so he can adjust the 
timer to his needs.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-27 20:23                                           ` Roman Zippel
@ 2005-10-28  4:52                                             ` Steven Rostedt
  2005-10-28 16:06                                               ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: Steven Rostedt @ 2005-10-28  4:52 UTC (permalink / raw)
  To: Roman Zippel
  Cc: tim.bird, oleg, hch, paulmck, johnstul, linux-kernel, Ingo Molnar,
	Andrew Morton, Thomas Gleixner, George Anzinger

On Thu, 2005-10-27 at 22:23 +0200, Roman Zippel wrote:

> 
> Next question would be what happens if timer and clock resolution differs? 
> For example if the clock has a resolution of 1us and the timer runs every 
> 1ms. For relative timer this would mean we can keep the error within 
> 1.001ms and for absolute timer within 1ms. Do we really have to force an 
> error larger than really necessary?
> 
> Interesting is now that Thomas doesn't take the clock resolution into 
> account at all. Let's say clock and timer resolution are 1ms (or HZ=1000). 
> If we program a normal kernel timer, we do something like this:
> 
> 	timer->expires = jiffies + 1 + usecs_to_jiffies(timeout);
> 
> Thomas does now basically this:
> 
> 	timer->expires = jiffies * res + round(timeout, res);
> 
> IOW if the clock resolution is larger than the interrupt delay, the timer 
> may expire early.

Roman, I think I know what you are trying to say here. Although it took
me several readings of what you wrote and then really just looking at
Thomas' code.

It's the old problem with:

1          2          3          4     
+----------+----------+----------+---------->>
       ^                ^
       |                |
     Start             End

Asking for 2 ms (with both clock and res the same at 1ms). We start the
clock at 1 but it really is 1.7 and we get the interrupt and return at 3
but really 3.2, so instead of receiving a wait of 2ms, we return with
3.2 - 1.7 = 1.5ms

Currently, this is not a problem when the clock is at a higher
frequency, (like the tsc).  So the base->get_time works now since the
clock is at a higher frequency, but if the get_time returned jiffies,
this would fail.  And the clock used is also much faster that the delay
it takes to get back to the calling process (which is much more than a
nanosecond today).

Is that what you were trying to say Roman?

Interesting though, I tried to force this scenario, by changing the
base->get_time to return jiffies.  I have a jitter test and ran this
several times, and I could never get it to expire early.  I even changed
HZ back to 100.

Then I looked at run_ktimer_queue.  And here we have the compare:

		timer = list_entry(base->pending.next, struct ktimer, list);
		if (ktime_cmp(now, <=, timer->expires))
			break;

So, the timer does _not_ get processed if it is after or _equal_ to the
current time.  So although the timer may go off early, the expired queue
does not get executed.  So the above example would not go off at 3.2,
but some time in the 4 category.

So the function will _not_ be executed early, although this could mean
that the timer could actually go off early (in the HRT case), but I
haven't taken a look there.  That is to say the interrupt goes off
early, not the function being executed.

-- Steve

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-28  4:52                                             ` Steven Rostedt
@ 2005-10-28 16:06                                               ` Roman Zippel
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Zippel @ 2005-10-28 16:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tim.bird, oleg, Christoph Hellwig, paulmck, johnstul,
	linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	George Anzinger

Hi,

On Fri, 28 Oct 2005, Steven Rostedt wrote:

> Roman, I think I know what you are trying to say here. Although it took
> me several readings of what you wrote and then really just looking at
> Thomas' code.

Thanks for the effort. :-) I know I'm sometimes a bit difficult to 
understand, which makes it easier to simply flame me for a silly mistake.

> It's the old problem with:
> 
> 1          2          3          4     
> +----------+----------+----------+---------->>
>        ^                ^
>        |                |
>      Start             End
> 
> Asking for 2 ms (with both clock and res the same at 1ms). We start the
> clock at 1 but it really is 1.7 and we get the interrupt and return at 3
> but really 3.2, so instead of receiving a wait of 2ms, we return with
> 3.2 - 1.7 = 1.5ms
> 
> Currently, this is not a problem when the clock is at a higher
> frequency, (like the tsc).  So the base->get_time works now since the
> clock is at a higher frequency, but if the get_time returned jiffies,
> this would fail.  And the clock used is also much faster that the delay
> it takes to get back to the calling process (which is much more than a
> nanosecond today).
> 
> Is that what you were trying to say Roman?

Yes.

> Interesting though, I tried to force this scenario, by changing the
> base->get_time to return jiffies.  I have a jitter test and ran this
> several times, and I could never get it to expire early.  I even changed
> HZ back to 100.
> 
> Then I looked at run_ktimer_queue.  And here we have the compare:
> 
> 		timer = list_entry(base->pending.next, struct ktimer, list);
> 		if (ktime_cmp(now, <=, timer->expires))
> 			break;
> 
> So, the timer does _not_ get processed if it is after or _equal_ to the
> current time.  So although the timer may go off early, the expired queue
> does not get executed.  So the above example would not go off at 3.2,
> but some time in the 4 category.
> 
> So the function will _not_ be executed early, although this could mean
> that the timer could actually go off early (in the HRT case), but I
> haven't taken a look there.  That is to say the interrupt goes off
> early, not the function being executed.

You're correct. I missed that comparison, so if clock resolution and timer 
resolution are equal, this indeed works.
It still goes wrong if the resolutions are different. get_time() normally 
wouldn't use jiffies but xtime. Thomas uses a fixed resolution, but xtime 
updates are not constant. On my machine here with HZ=250 the resolution 
would be 4000000ns, but xtime is updated by about 4000150ns every tick. 
This means the timer value is rounded to full 4ms, but this is not enough 
to get it past the next tick.
In the other more common case, where clock resolution is smaller than the 
timer resolution, this means the delay may be larger than really 
necessary.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:41                       ` Ingo Molnar
  2005-10-17  9:56                         ` Andrew Morton
@ 2005-10-17 16:33                         ` Roman Zippel
  2005-10-17 16:39                           ` Ingo Molnar
  1 sibling, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 16:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> if a dozen mails werent enough then one more probably wont make a 
> difference,

Just for the record: in this thread I got exactly three answers from 
Thomas. I don't know where you got the other nine mails from, maybe you 
could forward them to me, as they seem to contain the "patient 
explanations" I'm missing.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:33                         ` Roman Zippel
@ 2005-10-17 16:39                           ` Ingo Molnar
  2005-10-17 16:54                             ` Roman Zippel
  0 siblings, 1 reply; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17 16:39 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird


* Roman Zippel <zippel@linux-m68k.org> wrote:

> Hi,
> 
> On Mon, 17 Oct 2005, Ingo Molnar wrote:
> 
> > if a dozen mails werent enough then one more probably wont make a 
> > difference,
> 
> Just for the record: in this thread I got exactly three answers from 
> Thomas. I don't know where you got the other nine mails from, maybe 
> you could forward them to me, as they seem to contain the "patient 
> explanations" I'm missing.

here are all the replies from Thomas, regarding ktimers:

12359   * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
12362   * Sep 23 Thomas Gleixner (  49) Re: [ANNOUNCE] ktimers subsystem
12363   * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
12367   * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
12368   * Sep 25 Thomas Gleixner (  25) Re: [ANNOUNCE] ktimers subsystem
12369   * Sep 25 Thomas Gleixner (  17) Re: [ANNOUNCE] ktimers subsystem
12370   * Sep 25 Thomas Gleixner (  10) Re: [ANNOUNCE] ktimers subsystem
12387   * Oct 01 Thomas Gleixner ( 817) Re: [PATCH]  ktimers subsystem 2.6.14-rc
12419   * Oct 11 Thomas Gleixner (  41) Re: [PATCH]  ktimers subsystem 2.6.14-rc
12434   * Oct 16 Thomas Gleixner (  40) Re: [PATCH]  ktimers subsystem 2.6.14-rc

some of them very large and detailed.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:39                           ` Ingo Molnar
@ 2005-10-17 16:54                             ` Roman Zippel
  2005-10-17 17:35                               ` Ingo Molnar
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Zippel @ 2005-10-17 16:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird

Hi,

On Mon, 17 Oct 2005, Ingo Molnar wrote:

> here are all the replies from Thomas, regarding ktimers:
> 
> 12359   * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
> 12362   * Sep 23 Thomas Gleixner (  49) Re: [ANNOUNCE] ktimers subsystem
> 12363   * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
> 12367   * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
> 12368   * Sep 25 Thomas Gleixner (  25) Re: [ANNOUNCE] ktimers subsystem
> 12369   * Sep 25 Thomas Gleixner (  17) Re: [ANNOUNCE] ktimers subsystem
> 12370   * Sep 25 Thomas Gleixner (  10) Re: [ANNOUNCE] ktimers subsystem

Different thread and not directly related to issues with the patch.

> 12387   * Oct 01 Thomas Gleixner ( 817) Re: [PATCH]  ktimers subsystem 2.6.14-rc
> 12419   * Oct 11 Thomas Gleixner (  41) Re: [PATCH]  ktimers subsystem 2.6.14-rc
> 12434   * Oct 16 Thomas Gleixner (  40) Re: [PATCH]  ktimers subsystem 2.6.14-rc

That's the only mails related to the patch.

bye, Roman

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17 16:54                             ` Roman Zippel
@ 2005-10-17 17:35                               ` Ingo Molnar
  0 siblings, 0 replies; 67+ messages in thread
From: Ingo Molnar @ 2005-10-17 17:35 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Thomas Gleixner, George Anzinger, linux-kernel, Andrew Morton,
	johnstul, paulmck, Christoph Hellwig, oleg, tim.bird


* Roman Zippel <zippel@linux-m68k.org> wrote:

> > > Just for the record: in this thread I got exactly three answers 
> > > from Thomas. I don't know where you got the other nine mails from, 
> > > maybe you could forward them to me, as they seem to contain the 
> > > "patient explanations" I'm missing.
> > >
> > here are all the replies from Thomas, regarding ktimers:
> > 
> > 12359   * Sep 22 Thomas Gleixner ( 319) Re: [ANNOUNCE] ktimers subsystem
> > 12362   * Sep 23 Thomas Gleixner (  49) Re: [ANNOUNCE] ktimers subsystem
> > 12363   * Sep 23 Thomas Gleixner ( 235) Re: [ANNOUNCE] ktimers subsystem
> > 12367   * Sep 24 Thomas Gleixner ( 214) Re: [ANNOUNCE] ktimers subsystem
> > 12368   * Sep 25 Thomas Gleixner (  25) Re: [ANNOUNCE] ktimers subsystem
> > 12369   * Sep 25 Thomas Gleixner (  17) Re: [ANNOUNCE] ktimers subsystem
> > 12370   * Sep 25 Thomas Gleixner (  10) Re: [ANNOUNCE] ktimers subsystem
> 
> Different thread and not directly related to issues with the patch.

ugh, what were they about then, poetry?

Ah i think i know what you mean: these were about a PREVIOUS VERSION of 
the patch, and hence they fell off the face of the earth, regardless of 
their content, right? What a tricky little definition of "Thomas replied 
only 3 times" ...

> > 12387   * Oct 01 Thomas Gleixner ( 817) Re: [PATCH]  ktimers subsystem 2.6.14-rc
> > 12419   * Oct 11 Thomas Gleixner (  41) Re: [PATCH]  ktimers subsystem 2.6.14-rc
> > 12434   * Oct 16 Thomas Gleixner (  40) Re: [PATCH]  ktimers subsystem 2.6.14-rc
> 
> That's the only mails related to the patch.

your latest mail with the list of 'open' issues seems to contradict your 
assertion that the above 3 mails from Thomas where "the only mails 
related to the patch". E.g.:

' - "timer API" vs "timeout API": I got absolutely no acknowlegement 
     that this might be a little confusing and in consequence "process 
     timer" may be a better name. '

was raised and discussed in the first chunk of mails just as well.

	Ingo

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-17  9:29                     ` Roman Zippel
  2005-10-17  9:41                       ` Ingo Molnar
@ 2005-10-17  9:54                       ` Steven Rostedt
  1 sibling, 0 replies; 67+ messages in thread
From: Steven Rostedt @ 2005-10-17  9:54 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Ingo Molnar, Thomas Gleixner, George Anzinger, linux-kernel,
	Andrew Morton, johnstul, paulmck, Christoph Hellwig, oleg,
	tim.bird

On Mon, 17 Oct 2005, Roman Zippel wrote:

> Hi,
>
> On Mon, 17 Oct 2005, Ingo Molnar wrote:
>
> > the thing is that Thomas has advanced the whole issue of timeouts and
> > timekeeping by leaps and bounds and he has written thousands of lines of
> > new and excellent code for a kernel subsystem that has seen little
> > activity for many years, before John got involved. One of Thomas'
> > accomplishments is a timer/time design that allows the enabling of HRT
> > timers via an _18 lines_ architecture patch. (!)
>
> Did I say these patches were bad in general? All I'm asking for is an
> explanation for a few design decisions to understand the patch and its
> behaviour better and evaluate alternative solutions.
> Neither of you have shown any real interest in this so far.
>

Well, for me anyway, the best way I have with understanding ones decisions
in their code design _is_ to start playing with the code.  Try it the way
you want and you might realize things don't work so well, and then you
might understand why Thomas did it his way.

There's several times where I thought I could write something better, and
after playing with it, the problems start to arise where I then become
"enlightened" by the decisions others have made.

> > the moment you express yourself via patches we'll know that 1) you
> > understand what we have done so far 2) you have useful ideas of what
> > should be done differently 3) you have the coder capability to implement
> > and test those ideas. Patches wont be ignored, i can assure you. Get the
> > patches rolling!
>
> This "shut up and show code" attitude is sometimes quite funny, but it's
> no real threat to me. I hoped to avoid this and solve this more civilized.
> Of course I'll understand the issues better afterwards, but you could as
> easily just tell me. It will waste my time, I could spend on other
> projects and it will put Andrew in the unfortunate position to decide,
> which patch to accept.
> Is this really what you want?
>

I think what Ingo is saying, is to modify Thomas' code and show where it
is failing, instead of just talking about it.  You can ask "why" he did
something, but I think Thomas gave you enough in his answers.  If you are
still not satisfied, then that is the time to start playing with the code
and find the problems, fix them and show us that "yes" your way is better.
Don't just ask why Thomas did it one way without a patch that changes it
to show us why he shouldn't have.

-- Steve

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH]  ktimers subsystem 2.6.14-rc2-kt5
  2005-10-01  1:03 ` Roman Zippel
  2005-10-01 11:22   ` Ingo Molnar
  2005-10-01 12:05   ` Thomas Gleixner
@ 2005-10-04  1:55   ` George Anzinger
  2 siblings, 0 replies; 67+ messages in thread
From: George Anzinger @ 2005-10-04  1:55 UTC (permalink / raw)
  To: Roman Zippel
  Cc: tglx, linux-kernel, mingo, Andrew Morton, johnstul, paulmck,
	Christoph Hellwig, oleg, tim.bird

Roman Zippel wrote:
> 
> 
> Could you explain a little the resolution handling behind in your patch?
> If I read SUS correctly clock resolution and timer resolution don't have 
> to be the same, the first is returned by clock_getres() and the latter 
> only documented somewhere (and AFAICT our implementation always returned 
> the wrong value).
> IMO this also means we can don't have to make the rounding that 
> complicated. Actually it could be done automatically by the timer, e.g. 
> interval timer are reprogrammed at (now + interval) and the timer 
> resolution will automatically round it up.

As I understand it the resolution should apply to timers assigned to the given clock.  I assume most 
clock reads will return the best resolution possible, but we can only know what that is (in user 
land) by looking at at series of clock reads and making an educated guess (if indeed we care).

For timers, on the other hand, resolution serves two purposes: a) it tells the user/ application 
what to expect and allows him to take evasive action (such as asking for the timer to expire a "res" 
amount sooner) to get what he wants/needs.  b) for the kernel, it allows timers to be grouped such 
that we can limit the number of interrupts we need to service to handle timers.  Some of this might 
be possible by relying on the hardware, but a lot of hardware may actually be able to handle 
nanosecond resolution.  At that point you end up grouping by latency and getting to the point were, 
for no good reason, you have the possibility of timer storms.  For no good reason, i.e. the user 
really doesn't need or want that level of resolution, being happy with, for example 10 microseconds 
or some such.  This is why, in the HRT patch, the same can be said of the new ability to set HZ at 
configure time.
> 
> 

-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2005-10-28 16:07 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-28 20:43 [PATCH] ktimers subsystem 2.6.14-rc2-kt5 tglx
2005-09-28 23:59 ` Frank Sorenson
2005-09-29  0:50   ` Frank Sorenson
2005-09-29  0:56     ` john stultz
2005-09-29  1:05       ` Frank Sorenson
2005-09-29  1:10 ` john stultz
2005-09-29  6:53   ` Thomas Gleixner
2005-09-30 15:58     ` Frank Sorenson
2005-09-29 19:57 ` George Anzinger
2005-10-01  1:03 ` Roman Zippel
2005-10-01 11:22   ` Ingo Molnar
2005-10-04  1:59     ` George Anzinger
2005-10-04  5:51       ` Ingo Molnar
2005-10-10 12:42     ` Roman Zippel
2005-10-10 14:04       ` Ingo Molnar
2005-10-01 12:05   ` Thomas Gleixner
2005-10-10 17:22     ` Roman Zippel
2005-10-11  7:42       ` Thomas Gleixner
2005-10-12 22:36         ` Roman Zippel
2005-10-12 23:46           ` George Anzinger
2005-10-16 16:34             ` Roman Zippel
2005-10-16 19:26               ` Thomas Gleixner
2005-10-16 23:03                 ` Roman Zippel
2005-10-17  7:59                   ` Ingo Molnar
2005-10-17  8:26                     ` Steven Rostedt
2005-10-17  9:29                     ` Roman Zippel
2005-10-17  9:41                       ` Ingo Molnar
2005-10-17  9:56                         ` Andrew Morton
2005-10-17 11:00                           ` Ingo Molnar
2005-10-17 16:25                           ` Roman Zippel
2005-10-17 16:49                             ` Tim Bird
2005-10-17 17:26                               ` Steven Rostedt
2005-10-17 18:49                               ` Roman Zippel
2005-10-17 19:19                                 ` Tim Bird
2005-10-17 19:48                                   ` Roman Zippel
2005-10-17 20:13                                     ` Ingo Molnar
2005-10-17 20:31                                       ` Roman Zippel
2005-10-18  8:46                                         ` Ingo Molnar
2005-10-18 23:52                                           ` Tim Bird
2005-10-19  0:03                                             ` George Anzinger
2005-10-19  1:58                                           ` Roman Zippel
2005-10-19  6:46                                             ` Ingo Molnar
2005-10-19 10:49                                             ` kernel/timer.c design (was: Re: ktimers subsystem) Ingo Molnar
2005-10-19 17:48                                               ` kernel/timer.c design Tim Bird
2005-10-19 18:00                                               ` Tim Bird
2005-10-19 19:04                                                 ` Thomas Gleixner
2005-10-19 22:12                                               ` kernel/timer.c design (was: Re: ktimers subsystem) Roman Zippel
2005-10-19 11:40                                             ` [PATCH] ktimers subsystem 2.6.14-rc2-kt5 Ingo Molnar
2005-10-19 11:58                                             ` Ingo Molnar
2005-10-19 22:24                                               ` Roman Zippel
2005-10-17 20:09                                 ` Ingo Molnar
2005-10-17 20:55                             ` Thomas Gleixner
2005-10-18  0:07                               ` Roman Zippel
2005-10-18  1:03                                 ` George Anzinger
2005-10-19  1:26                                   ` Roman Zippel
2005-10-19  2:52                                     ` George Anzinger
2005-10-21 16:22                                       ` Roman Zippel
2005-10-23 18:17                                         ` George Anzinger
2005-10-27 20:23                                           ` Roman Zippel
2005-10-28  4:52                                             ` Steven Rostedt
2005-10-28 16:06                                               ` Roman Zippel
2005-10-17 16:33                         ` Roman Zippel
2005-10-17 16:39                           ` Ingo Molnar
2005-10-17 16:54                             ` Roman Zippel
2005-10-17 17:35                               ` Ingo Molnar
2005-10-17  9:54                       ` Steven Rostedt
2005-10-04  1:55   ` George Anzinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).