public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] New timeofday proposal (v.A1)
@ 2004-12-08  1:55 john stultz
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: john stultz @ 2004-12-08  1:55 UTC (permalink / raw)
  To: lkml
  Cc: tim, george, albert, Ulrich.Windl, clameter, len.brown, linux,
	davidm, ak, paulus, schwidefsky, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott, Max, mahuja

All, 	
	Once again here is my time of day proposal and code. I've not been able
to work too much on it since my last release in September, and not much
has changed in the proposal. However, since I'm doing another code
release, I figured I'd re-send the design/justification summary for
context.

New in this code release:
o Initial x86-64 port
o Re-worked the timesource structure as suggested by Christoph Lameter.
		
Any feedback or comments on this or the following code would be greatly
appreciated.
	
thanks
-john

Proposal for an architecture independent time of day implementation.
-------------------------------------------------------------------
John Stultz (johnstul@us.ibm.com)
DRAFT
Tue Dec  7 15:38:58 PST 2004

Credits:
	Keith Mannthey:	Aided initial design.
			Aided greatly to implementation details.
	George Anzinger: Initial review and corrections.
	Ulrich Windl: Review and suggestions for clarity.

	Many of the time of day related issues that cropped up in 2.5
development occurred where a fix or change was made to a number of
architectures, but missed a few others. Currently every architecture has
its own set of timekeeping functions that basically do the same thing,
only using different (or frequently, not so different) types of
hardware. As hardware has changed, many architectures have had to
re-engineer their time system to handle multiple time and interrupt
sources. With little common infrastructure, either each separate
implementation has its own quirks and bugs, or we end up with a
reasonable quantity of duplicated code. Additionally the lack of a clear
time of day interface has led developers to use jiffies, HZ, and the raw
xtime values to calculate the time of day themselves. This has lead to a
number of troublesome bugs.

	With the goal to simplify, streamline and consolidate the time-of-day
infrastructure, I propose the following common implementation across all
arches. This will allow generic bugs to be fixed once, reduce code
duplication, and with many architectures sharing the same time source,
this allows drivers to be written once for multiple architectures.
Additionally it will better delineate the lines between the timer
subsystem and the time-of-day subsystem, opening the door for more
flexible and better timekeeping.

Features of this design:
========================

o Splits time of day management from timer interrupts:
	This is necessary for virtualization & tickless systems. It allows us
to no longer care how often clock_interrupt() is called. Queued, delayed
or lost interrupts do not affect time keeping (within bounds - ie: the
time source cannot overflow). This isolates HZ and jiffies to the timer
subsystem (mostly), as they are frequently and incorrectly used to
calculate time.
	Additionally, it allows for dynamic tick interrupts / high-res ticks.
It avoids the need to interpolate between multiple shoddy time sources,
and lets us be agnostic to where the periodic interrupts come from
(cleans up i386 HPET interrupt code).

o Consolidates a large amount of code:
	Allows for shared times source implementations, such as: i386, x86-64
and ia64 all use HPET, i386 and x86-64 both have ACPI PM timers, and
i386 and ia64 both have cyclone counters. Time sources are just drivers!

o Generic algorithms which use time-source drivers chosen at runtime:
	Drivers are just simple hw accessors functions with no internal state
needed. They can be loaded and changed while the system is running, like
normal modules.

o More consistent and readable code:
	Drop wall_to_monotonic & xtime in favor of a more simple system_time
and wall_time_offset variables. Where system_time is the monotonically
increasing nanoseconds since boot time and wall_time_offset is the
offset added to system_time to calculate time of day.

o Uses nanoseconds as the kernel's base time unit.
	Rather then doing ugly manipulations to timevals or timespecs, this
simplifies math, and gives us plenty of room to grow (64bits of
nanoseconds ~= 584 years).

o Clearly separates the NTP code from the time code:
	Creates a clean and clear interface, keeping all the NTP related code
in a single place. Saves brains, normal people shouldn't have to think
about the in kernel ntp machinery.


Brief Psudo-code to illustrate the design:
==========================================

Globals:
--------
offset_base: timesource cycle value at last call to timeofday_hook()
system_time: time in ns calculated at last call to timeofday_hook()
wall_offset: offset to monotonic_clock() to get current time of day

Functions:
----------
timeofday_hook()
	now = timesource_read();		/* read the timesource */
	ns = cyc2ns(now - offset_base); /* calc nsecs since last call */
	ntp_ns = ntp_scale(ns);		/* apply ntp scaling */
	system_time += ntp_ns;		/* add scaled value to system_time */
	ntp_advance(ns);		/* advance ntp state machine by ns */
	offset_base = now;		/* set new offset_base */

monotonic_clock()
	now = timesource_read();	/* read the timesource */
	ns = cyc2ns(now - offset_base);	/* calculate nsecs since last hook */
	ntp_ns = ntp_scale(ns);		/* apply ntp scaling */
	return system_time + ntp_ns; 	/* return system_time and scaled value
					 */

settimeofday(desired)
	wall_offset = desired - monotonic_clock(); /* set wall offset */

gettimeofday()
	return wall_offset + monotonic_clock();	/* return current timeofday */


Points I'm glossing over for now:
====================================================

o Have to convert back to time_val for syscall interface

o ntp_scale(ns):  scales ns by NTP scaling factor
	- see ntp.c for details
	- costly, but correct.

o ntp_advance(ns): advances NTP state machine by ns
	- see ntp.c for details

o What is the cost of throwing around 64bit values for everything?
	- Do we need an arch specific time structure that varies size
accordingly?

o Some arches (arm, for example) do not have high res timing hardware
	- In this case we can have a "jiffies" timesource
		- cyc2ns(x) =  x*(NSEC_PER_SEC/HZ)
		- doesn't work for tickless systems

o vsyscalls/userspace gettimeofday()
	- Still done on a per-arch basis
		- My earlier arch independent plan won't work
	- Exporting the full NTP logic is going to be annoying

o suspend/resume
	- not yet implemented, but shouldn't be hard
o cpufreq effects
	- handled in the timesource driver
	
Anything else? What am I missing or just being ignorant of?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC] new timeofday core subsystem (v.A1)
  2004-12-08  1:55 [RFC] New timeofday proposal (v.A1) john stultz
@ 2004-12-08  1:56 ` john stultz
  2004-12-08  1:57   ` [RFC] new timeofday arch specific hooks (v.A1) john stultz
                     ` (4 more replies)
  2004-12-08 18:25 ` [RFC] New timeofday proposal (v.A1) Christoph Lameter
  2004-12-08 18:43 ` Nicolas Pitre
  2 siblings, 5 replies; 29+ messages in thread
From: john stultz @ 2004-12-08  1:56 UTC (permalink / raw)
  To: lkml
  Cc: tim, george anzinger, albert, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

All,
	This patch implements the architecture independent portion of the time
of day subsystem. Included is timeofday.c (which includes all the time
of day management and accsessor functions), ntp.c (which includes the
ntp scaling code, leapsecond processing, and ntp kernel state machine
code), timesource.c (for timesource specific management functions),
interface definition .h files, the example jiffies timesource (lowest
common denominator time source, mainly for use as example code) and
minimal hooks into arch independent code.

The patch does not function without minimal archetecture specific hooks
(i386 and x86-64 examples to follow), and it can be applied to a tree
without affecting the code.

New in this version:
o Re-worked timesource structure from Christph Lameter's suggestions.

I look forward to your comments and feedback.

thanks
-john


linux-2.6.10-rc3_timeofday-core_A1.patch
========================================
diff -Nru a/drivers/Makefile b/drivers/Makefile
--- a/drivers/Makefile	2004-12-07 16:47:19 -08:00
+++ b/drivers/Makefile	2004-12-07 16:47:19 -08:00
@@ -60,3 +60,4 @@
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
 obj-y				+= firmware/
+obj-$(CONFIG_NEWTOD)		+= timesource/
diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/Makefile	2004-12-07 16:47:19 -08:00
@@ -0,0 +1 @@
+obj-y += jiffies.o
diff -Nru a/drivers/timesource/jiffies.c b/drivers/timesource/jiffies.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/jiffies.c	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,45 @@
+/*
+ * linux/drivers/timesource/jiffies.c
+ *
+ * Copyright (C) 2004 IBM
+ *
+ * This file contains the jiffies based time source.
+ *
+ */
+#include <linux/timesource.h>
+#include <linux/jiffies.h>
+#include <linux/init.h>
+
+/* The Jiffies based timesource is the lowest common
+ * denominator time source which should function on
+ * all systems. It has the same course resolution as
+ * the timer interrupt frequency HZ and it suffers
+ * inaccuracies caused by missed or lost timer
+ * interrupts and the inability for the timer
+ * interrupt hardware to accuratly tick at the
+ * requested HZ value. It is also not reccomended
+ * for "tick-less" systems.
+ */
+
+static cycle_t jiffies_read(void)
+{
+	cycle_t ret = get_jiffies_64();
+	return ret;
+}
+
+struct timesource_t timesource_jiffies = {
+	.name = "jiffies",
+	.priority = 0, /* lowest priority*/
+	.type = TIMESOURCE_FUNCTION,
+	.read_fnct = jiffies_read,
+	.mask = (cycle_t)~0,
+	.mult = NSEC_PER_SEC/HZ,
+	.shift = 0,
+};
+
+static int init_jiffies_timesource(void)
+{
+	register_timesource(&timesource_jiffies);
+	return 0;
+}
+module_init(init_jiffies_timesource);
diff -Nru a/include/linux/ntp.h b/include/linux/ntp.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/ntp.h	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,22 @@
+/*	linux/include/linux/ntp.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains time of day helper functions
+ */
+
+#ifndef _LINUX_NTP_H
+#define _LINUX_NTP_H
+#include <linux/types.h>
+#include <linux/time.h>
+#include <linux/timex.h>
+
+/* timeofday interfaces */
+nsec_t ntp_scale(nsec_t value);
+void ntp_advance(nsec_t value);
+int ntp_adjtimex(struct timex*);
+int ntp_leapsecond(struct timespec now);
+void ntp_clear(void);
+int get_ntp_status(void);
+
+#endif
diff -Nru a/include/linux/time.h b/include/linux/time.h
--- a/include/linux/time.h	2004-12-07 16:47:19 -08:00
+++ b/include/linux/time.h	2004-12-07 16:47:19 -08:00
@@ -27,6 +27,10 @@
 
 #ifdef __KERNEL__
 
+/* timeofday base types */
+typedef u64 nsec_t;
+typedef u64 cycle_t;
+
 /* Parameters used to convert the timespec values */
 #ifndef USEC_PER_SEC
 #define USEC_PER_SEC (1000000L)
diff -Nru a/include/linux/timeofday.h b/include/linux/timeofday.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/timeofday.h	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,61 @@
+/*	linux/include/linux/timeofday.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains the interface to the time of day subsystem
+ */
+#ifndef _LINUX_TIMEOFDAY_H
+#define _LINUX_TIMEOFDAY_H
+#include <linux/types.h>
+#include <linux/time.h>
+
+#ifdef CONFIG_NEWTOD
+nsec_t get_lowres_timestamp(void);
+nsec_t get_lowres_timeofday(void);
+nsec_t do_monotonic_clock(void);
+
+
+void do_gettimeofday(struct timeval *tv);
+int do_settimeofday(struct timespec *tv);
+int do_adjtimex(struct timex *tx);
+
+void timeofday_interrupt_hook(void);
+void timeofday_init(void);
+
+
+/* Helper functions */
+static inline struct timeval ns2timeval(nsec_t ns)
+{
+	struct timeval tv;
+	tv.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &tv.tv_usec);
+	tv.tv_usec /= NSEC_PER_USEC;
+	return tv;
+}
+
+static inline struct timespec ns2timespec(nsec_t ns)
+{
+	struct timespec ts;
+	ts.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &ts.tv_nsec);
+	return ts;
+}
+
+static inline u64 timespec2ns(struct timespec* ts)
+{
+	nsec_t ret;
+	ret = ((nsec_t)ts->tv_sec) * NSEC_PER_SEC;
+	ret += ts->tv_nsec;
+	return ret;
+}
+
+static inline nsec_t timeval2ns(struct timeval* tv)
+{
+	nsec_t ret;
+	ret = ((nsec_t)tv->tv_sec) * NSEC_PER_SEC;
+	ret += tv->tv_usec*NSEC_PER_USEC;
+	return ret;
+}
+#else /* CONFIG_NEWTOD */
+#define timeofday_interrupt_hook()
+#define timeofday_init()
+#endif /* CONFIG_NEWTOD */
+#endif /* _LINUX_TIMEOFDAY_H */
diff -Nru a/include/linux/timesource.h b/include/linux/timesource.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/timesource.h	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,86 @@
+/*	linux/include/linux/timesource.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains the structure definitions for timesources.
+ *
+ *	If you are not a timesource, or the time of day code, you should
+ *	not be including this file!
+ */
+#ifndef _LINUX_TIMESORUCE_H
+#define _LINUX_TIMESORUCE_H
+
+#include <linux/types.h>
+#include <linux/time.h>
+
+/* struct timesource_t:
+ *		Provides mostly state-free accessors to the underlying
+ *		hardware.
+ * name:		ptr to timesource name
+ * priority:	priority value for selection (higher is better)
+ * type:		defines timesource type
+ * @read_fnct:	returns a cycle value
+ * ptr:			ptr to MMIO'ed counter
+ * mask:		bitmask for two's complement
+ * 				subtraction of non 64 bit counters
+ * mult:		cycle to nanosecond multiplier
+ * shift: 		cycle to nanosecond divisor (power of two)
+ */
+struct timesource_t {
+	char* name;
+	int priority;
+	enum {
+		TIMESOURCE_FUNCTION,
+		TIMESOURCE_MMIO_32,
+		TIMESOURCE_MMIO_64
+	} type;
+	cycle_t (*read_fnct)(void);
+	void* ptr;
+	cycle_t mask;
+	u32 mult;
+	u32 shift;
+};
+
+
+/* read_timersource():
+ *		Uses the timesource to return the current cycle_t value
+ */
+static inline cycle_t read_timesource(struct timesource_t* ts)
+{
+	u64* tmp64;
+	u32* tmp32;
+	switch (ts->type) {
+	case TIMESOURCE_MMIO_32:
+		tmp32 = (u32*)ts->ptr;
+		return (cycle_t)*tmp32;
+	case TIMESOURCE_MMIO_64:
+		tmp64 = (u64*)ts->ptr;
+		return (cycle_t)*tmp64;
+	default:/* case: TIMESOURCE_FUNCTION */
+		return ts->read_fnct();
+	}
+}
+
+/* cyc2ns():
+ *		Uses the timesource to convert cycle_ts to nanoseconds.
+ *		If rem is not null, it stores the remainder of the
+ *		calculation there.
+ *
+ *	XXX - Remainder math is not in place!
+ */
+static inline nsec_t cyc2ns(struct timesource_t* ts, cycle_t cycles, cycle_t* rem)
+{
+	u64 ret;
+	ret = (u64)cycles;
+	ret *= ts->mult;
+	ret >>= ts->shift;
+	if (rem) /* XXX we still need to do remainder math */
+		*rem = (cycle_t)0;
+	return (nsec_t)ret;
+}
+
+/* used to install a new time source */
+void register_timesource(struct timesource_t*);
+struct timesource_t* get_next_timesource(void);
+
+#endif
diff -Nru a/init/main.c b/init/main.c
--- a/init/main.c	2004-12-07 16:47:19 -08:00
+++ b/init/main.c	2004-12-07 16:47:19 -08:00
@@ -46,6 +46,7 @@
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
 #include <linux/key.h>
+#include <linux/timeofday.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -525,6 +526,7 @@
 	pidhash_init();
 	init_timers();
 	softirq_init();
+	timeofday_init();
 	time_init();
 
 	/*
diff -Nru a/kernel/Makefile b/kernel/Makefile
--- a/kernel/Makefile	2004-12-07 16:47:19 -08:00
+++ b/kernel/Makefile	2004-12-07 16:47:19 -08:00
@@ -9,6 +9,7 @@
 	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o
 
+obj-$(CONFIG_NEWTOD) += timeofday.o timesource.o ntp.o
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o spinlock.o
diff -Nru a/kernel/ntp.c b/kernel/ntp.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/kernel/ntp.c	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,554 @@
+/********************************************************************
+* linux/kernel/ntp.c
+*
+* NTP state machine and time scaling code.
+*
+* Copyright (C) 2004 IBM, John Stultz (johnstul@us.ibm.com)
+*
+* Portions rewritten from kernel/time.c and kernel/timer.c
+* Please see those files for original copyrights.
+*
+* Hopefully you should never have to understand or touch
+* any of the code below. but don't let that keep you from trying!
+*
+* This code is loosely based on David Mills' RFC 1589 and its
+* updates. Please see the following for more details:
+*  http://www.eecis.udel.edu/~mills/database/rfc/rfc1589.txt
+*  http://www.eecis.udel.edu/~mills/database/reports/kern/kernb.pdf
+*
+* NOTE:	To simplify the code, we do not implement any of
+* the PPS code, as the code that uses it never was merged.
+*					-johnstul@us.ibm.com
+*
+* Revision History:
+* 2004-09-02: A0
+*	o First pass sent to lkml for review.
+* 2004-12-07: A1
+*	o No changes, sent to lkml for review.
+*
+* TODO List:
+*	o More documentation
+*	o More testing
+*	o More optimization
+*********************************************************************/
+
+#include <linux/ntp.h>
+#include <linux/errno.h>
+#include <linux/sched.h> /* Needed for capable() */
+
+/* NTP scaling code
+ * Functions:
+ * ----------
+ * nsec_t ntp_scale(nsec_t value):
+ *		Scales the nsec_t vale using ntp kernel state
+ * void ntp_advance(nsec_t interval):
+ *		Increments the NTP state machine by interval time
+ * static int ntp_hardupdate(long offset, struct timeval tv)
+ *		ntp_adjtimex helper function
+ * int ntp_adjtimex(struct timex* tx):
+ *		Interface to adjust NTP state machine
+ * int ntp_leapsecond(struct timespec now)
+ *		Does NTP leapsecond processing. Returns number of
+ *		seconds current time should be adjusted by.
+ * void ntp_clear(void):
+ *		Clears the ntp kernel state
+ * int get_ntp_status(void):
+ *		returns ntp_status value
+ *
+ * Variables:
+ * ----------
+ * ntp kernel state variables:
+ *		See below for full list.
+ * ntp_lock:
+ *		Protects ntp kernel state variables
+ */
+
+
+
+/* Chapter 5: Kernel Variables [RFC 1589 pg. 28] */
+/* 5.1 Interface Variables */
+static int ntp_status		= STA_UNSYNC;		/* status */
+static long ntp_offset;					/* usec */
+static long ntp_constant	= 2;			/* ntp magic? */
+static long ntp_maxerror	= NTP_PHASE_LIMIT;	/* usec */
+static long ntp_esterror	= NTP_PHASE_LIMIT;	/* usec */
+static const long ntp_tolerance	= MAXFREQ;		/* shifted ppm */
+static const long ntp_precision	= 1;			/* constant */
+
+/* 5.2 Phase-Lock Loop Variables */
+static long ntp_freq;					/* shifted ppm */
+static long ntp_reftime;				/* sec */
+
+/* Extra values */
+static int ntp_state	= TIME_OK;		/* leapsecond state */
+static long ntp_tick	= USEC_PER_SEC/USER_HZ;	/* tick length */
+
+static s64 ss_offset_len;	/* SINGLESHOT offset adj interval (nsec)*/
+static long singleshot_adj;	/* +/- MAX_SINGLESHOT_ADJ (ppm)*/
+static long tick_adj; 		/* tx->tick adjustment (ppm) */
+static long offset_adj;		/* offset adjustment (ppm) */
+
+
+/* lock for the above variables */
+static seqlock_t ntp_lock = SEQLOCK_UNLOCKED;
+
+#define MILLION 1000000
+#define MAX_SINGLESHOT_ADJ 500 /* (ppm) */
+#define SEC_PER_DAY 86400
+
+/* Required to safely shift negative values */
+#define shiftR(x,s) (x < 0) ? (-((-x) >> (s))) : ((x) >> (s))
+
+/* nsec_t ntp_scale(nsec_t):
+ *	Scale the raw time interval to an NTP scaled value
+ *
+ *	This is done in three steps:
+ *	1. Tick adjustment.
+ *	2. Frequency adjustment.
+ *	3. Singleshot offset adjustment
+ *
+ *	The first two are done over the entire raw interval. While
+ *	the third is only done as long as the raw interval is less
+ *	then the singleshot offset length.
+ *
+ *	The basic equasion is:
+ *	raw = raw + ppm scaling
+ *
+ *	So while raw < ss_offset_len:
+ *	raw = raw + (raw*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ *
+ *	But when raw is > ss_offset_len:
+ *	raw = raw
+ *		+ (ss_offset_len*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ *		+ ((raw - ss_offset_len)*(tick_adj+freq_adj))/MILLION
+ *
+ * XXX - TODO:
+ *	o Get rid of nasty 64bit divides (try shifts)
+ *	o more comments!
+ */
+nsec_t ntp_scale(nsec_t raw)
+{
+	u64 ret = (u64)raw;
+	unsigned long seq;
+	u64 offset_len;
+	long freq_ppm, tick_ppm, ss_ppm, offset_ppm;
+
+	/* save off current kernel state */
+	do {
+		seq = read_seqbegin(&ntp_lock);
+
+		tick_ppm = tick_adj;
+		offset_ppm = offset_adj;
+		freq_ppm = shiftR(ntp_freq,SHIFT_USEC);
+		ss_ppm = singleshot_adj;
+		offset_len = (u64)ss_offset_len;
+
+	} while (read_seqretry(&ntp_lock, seq));
+
+
+	/* Due to sign issues, this is a bit ugly */
+	if (raw <= offset_len) {
+		/* calcluate total ppm */
+		long ppm = tick_ppm + freq_ppm + offset_ppm + ss_ppm;
+
+		/* use abs(ppm) to avoid sign issues w/ do_div */
+		u64 tmp = raw * abs(ppm);
+
+		/* XXX - try to replace this w/ a 64bit shift */
+		do_div(tmp, MILLION);
+
+		/* reapply the sign lost by using abs(ppm) */
+		if(ppm < 0)
+			ret -= tmp;
+		else
+			ret += tmp;
+
+	} else {
+		/* Only apply MAX_SINGLESHOT_ADJ for the ss_offset_len */
+		s64 v1 = offset_len*(tick_ppm + freq_ppm + offset_ppm + ss_ppm);
+		s64 v2 = (s64)(raw - offset_len)*(tick_ppm + freq_ppm+ offset_ppm);
+
+		s64 tmp = v1+v2;
+
+		/* use abs(tmp) to avoid sign issues w/ do_div */
+		u64 tmp2 = abs(tmp);
+
+		/* XXX - try to replace this w/ a 64bit shift */
+		do_div(tmp2,MILLION);
+
+		/* reapply the sign lost by using abs(tmp2) */
+		if(tmp < 0)
+			ret -= tmp2;
+		else
+			ret += tmp2;
+	}
+
+	return (nsec_t)ret;
+}
+
+/* void ntp_advance(nsec_t interval):
+ *	Periodic hook which increments NTP state machine by interval
+ *	This is ntp_hardclock in the RFC.
+ */
+void ntp_advance(nsec_t interval)
+{
+	static u64 interval_sum=0;
+
+	/* inc interval sum */
+	interval_sum += interval;
+
+	write_seqlock_irq(&ntp_lock);
+
+	/* decrement singleshot offset interval */
+	ss_offset_len =- interval;
+	if(ss_offset_len < 0) /* make sure it doesn't go negative */
+		ss_offset_len=0;
+
+	/* Do second overflow code */
+	while (interval_sum > NSEC_PER_SEC) {
+		/* XXX - I'd prefer to smoothly apply this math
+		 * at each call to ntp_advance() rather then each
+		 * second.
+		 */
+		long tmp;
+
+		/* Bump maxerror by ntp_tolerance */
+		ntp_maxerror += shiftR(ntp_tolerance, SHIFT_USEC);
+		if (ntp_maxerror > NTP_PHASE_LIMIT) {
+			ntp_maxerror = NTP_PHASE_LIMIT;
+			ntp_status |= STA_UNSYNC;
+		}
+
+		/* Calculate offset_adj for the next second */
+		tmp = ntp_offset;
+		if (!(ntp_status & STA_FLL))
+		    tmp = shiftR(tmp, SHIFT_KG + ntp_constant);
+
+		/* bound the adjustment to MAXPHASE/MINSEC */
+		if (tmp > (MAXPHASE / MINSEC) << SHIFT_UPDATE)
+		    tmp = (MAXPHASE / MINSEC) << SHIFT_UPDATE;
+		if (tmp < -(MAXPHASE / MINSEC) << SHIFT_UPDATE)
+		    tmp = -(MAXPHASE / MINSEC) << SHIFT_UPDATE;
+
+		offset_adj = shiftR(tmp, SHIFT_UPDATE); /* (usec/sec) = ppm */
+		ntp_offset -= tmp;
+
+		interval_sum -= NSEC_PER_SEC;
+	}
+
+/* XXX - if still needed do equiv code to switching time_next_adjust */
+
+	write_sequnlock_irq(&ntp_lock);
+
+}
+
+
+/* called only by ntp_adjtimex while holding ntp_lock */
+static int ntp_hardupdate(long offset, struct timeval tv)
+{
+	int ret;
+	long tmp, interval;
+
+	ret = 0;
+	if (!(ntp_status & STA_PLL))
+		return ret;
+
+	tmp = offset;
+	/* Make sure offset is bounded by MAXPHASE */
+	if (tmp > MAXPHASE)
+		tmp = MAXPHASE;
+	if (tmp < -MAXPHASE)
+		tmp = -MAXPHASE;
+
+	ntp_offset = tmp << SHIFT_UPDATE;
+
+    if ((ntp_status & STA_FREQHOLD) || (ntp_reftime == 0))
+		ntp_reftime = tv.tv_sec;
+
+	/* calculate seconds since last call to hardupdate */
+	interval = tv.tv_sec - ntp_reftime;
+	ntp_reftime = tv.tv_sec;
+
+	if ((ntp_status & STA_FLL) && (interval >= MINSEC)) {
+		long damping;
+		tmp = (offset / interval); /* ppm (usec/sec)*/
+
+		/* convert to shifted ppm, then apply damping factor */
+
+		/* calculate damping factor - XXX bigger comment!*/
+		damping = SHIFT_KH - SHIFT_USEC;
+
+		/* apply damping factor */
+		ntp_freq += shiftR(tmp,damping);
+
+		printk("ntp->freq change: %ld\n",shiftR(tmp,damping));
+
+	} else if ((ntp_status & STA_PLL) && (interval < MAXSEC)) {
+		long damping;
+		tmp = offset * interval; /* ppm XXX - not quite*/
+
+		/* calculate damping factor - XXX bigger comment!*/
+		damping = (2 * ntp_constant) + SHIFT_KF - SHIFT_USEC;
+
+		/* apply damping factor */
+		ntp_freq += shiftR(tmp,damping);
+
+		printk("ntp->freq change: %ld\n", shiftR(tmp,damping));
+
+	} else { /* interval out of bounds */
+		printk("interval out of bounds: %ld\n", interval);
+		ret = -1; /* TIME_ERROR */
+	}
+
+	/* bound ntp_freq */
+	if (ntp_freq > ntp_tolerance)
+		ntp_freq = ntp_tolerance;
+	if (ntp_freq < -ntp_tolerance)
+		ntp_freq = -ntp_tolerance;
+
+	return ret;
+}
+
+/* int ntp_adjtimex(struct timex* tx)
+ *	Interface to change NTP state machine
+ */
+int ntp_adjtimex(struct timex* tx)
+{
+	long save_offset;
+	int result;
+
+/*=[Sanity checking]===============================*/
+	/* Check capabilities if we're trying to modify something */
+	if (tx->modes && !capable(CAP_SYS_TIME))
+		return -EPERM;
+
+	/* frequency adjustment limited to +/- MAXFREQ */
+	if ((tx->modes & ADJ_FREQUENCY)
+			&& (abs(tx->freq) > MAXFREQ))
+		return -EINVAL;
+
+	/* maxerror adjustment limited to NTP_PHASE_LIMIT */
+	if ((tx->modes & ADJ_MAXERROR)
+			&& (tx->maxerror < 0
+				|| tx->maxerror >= NTP_PHASE_LIMIT))
+		return -EINVAL;
+
+	/* esterror adjustment limited to NTP_PHASE_LIMIT */
+	if ((tx->modes & ADJ_ESTERROR)
+			&& (tx->esterror < 0
+				|| tx->esterror >= NTP_PHASE_LIMIT))
+		return -EINVAL;
+
+	/* constant adjustment must be positive */
+	if ((tx->modes & ADJ_TIMECONST)
+			&& (tx->constant < 0))
+		return -EINVAL;
+
+	/* Single shot mode can only be used by itself */
+	if (((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+			&& (tx->modes != ADJ_OFFSET_SINGLESHOT))
+		return -EINVAL;
+
+	/* offset adjustment limited to +/- MAXPHASE */
+	if ((tx->modes != ADJ_OFFSET_SINGLESHOT)
+			&& (tx->modes & ADJ_OFFSET)
+			&& (abs(tx->offset)>= MAXPHASE))
+		return -EINVAL;
+
+	/* tick adjustment limited to 10% */
+	if ((tx->modes & ADJ_TICK)
+			&& ((tx->tick < 900000/USER_HZ)
+				||(tx->tick > 11000000/USER_HZ)))
+		return -EINVAL;
+
+	/* dbg output XXX - yank me! */
+	if(tx->modes) {
+		printk("adjtimex: tx->offset: %ld    tx->freq: %ld\n",
+				tx->offset, tx->freq);
+	}
+
+/*=[Kernel input bits]==========================*/
+	write_seqlock_irq(&ntp_lock);
+
+	result = ntp_state;
+
+	/* For ADJ_OFFSET_SINGLESHOT we must return the old offset */
+	save_offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+	/* Process input parameters */
+	if (tx->modes & ADJ_STATUS) {
+		ntp_status &=  STA_RONLY;
+		ntp_status |= tx->status & ~STA_RONLY;
+	}
+
+	if (tx->modes & ADJ_FREQUENCY)
+		ntp_freq = tx->freq;
+
+	if (tx->modes & ADJ_MAXERROR)
+		ntp_maxerror = tx->maxerror;
+
+	if (tx->modes & ADJ_ESTERROR)
+		ntp_esterror = tx->esterror;
+
+	if (tx->modes & ADJ_TIMECONST)
+		ntp_constant = tx->constant;
+
+	if (tx->modes & ADJ_OFFSET) {
+		if (tx->modes == ADJ_OFFSET_SINGLESHOT) {
+			if (tx->offset < 0)
+				singleshot_adj = -MAX_SINGLESHOT_ADJ;
+			else
+				singleshot_adj = MAX_SINGLESHOT_ADJ;
+			/* Calculate single shot offset interval:
+			 *
+			 *	len = offset/(+/-)MAX_SINGESHOT_ADJ
+			 *	len = abs(offset)/MAX_SINGESHOT_ADJ
+			 *	len = (abs(offset)*NSEC_PER_USEC)
+			 *			/(MAX_SINGESHOT_ADJ /NSEC_PER_SEC)
+			 *	len = (abs(offset)*NSEC_PER_USEC*NSEC_PER_SEC)/MAX_SINGESHOT_ADJ
+			 *	len = abs(offset)*NSEC_PER_SEC*(NSEC_PER_USEC/MAX_SINGESHOT_ADJ)
+			 */
+			ss_offset_len = abs(tx->offset);
+			ss_offset_len *= NSEC_PER_SEC * (NSEC_PER_USEC/MAX_SINGLESHOT_ADJ);
+		}
+		/* call hardupdate() */
+		if (ntp_hardupdate(tx->offset, tx->time))
+			result = TIME_ERROR;
+	}
+
+	if (tx->modes & ADJ_TICK) {
+		/* first calculate usec/user_tick offset */
+		tick_adj = (USEC_PER_SEC/USER_HZ) - tx->tick;
+		/* multiply by user_hz to get usec/sec => ppm */
+		tick_adj *= USER_HZ;
+		/* save tx->tick for future calls to adjtimex */
+		ntp_tick = tx->tick;
+	}
+
+	if ((ntp_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 )
+		result = TIME_ERROR;
+
+/*=[Kernel output bits]================================*/
+	/* write kernel state to user timex values*/
+	if ((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+		tx->offset = save_offset;
+	else
+		tx->offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+	tx->freq = ntp_freq;
+	tx->maxerror = ntp_maxerror;
+	tx->esterror = ntp_esterror;
+	tx->status = ntp_status;
+	tx->constant = ntp_constant;
+	tx->precision = ntp_precision;
+	tx->tolerance = ntp_tolerance;
+
+	/* PPS is not implemented, so these are zero */
+	tx->ppsfreq	= /*XXX - Not Implemented!*/ 0;
+	tx->jitter	= /*XXX - Not Implemented!*/ 0;
+	tx->shift	= /*XXX - Not Implemented!*/ 0;
+	tx->stabil	= /*XXX - Not Implemented!*/ 0;
+	tx->jitcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->calcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->errcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->stbcnt	= /*XXX - Not Implemented!*/ 0;
+
+	write_sequnlock_irq(&ntp_lock);
+
+	return result;
+}
+
+
+/* void ntp_leapsecond(struct timespec now):
+ *	NTP Leapsecnod processing code. Returns the number of
+ *	seconds (-1, 0, or 1) that should be added to the current
+ *	time to properly adjust for leapseconds.
+ */
+int ntp_leapsecond(struct timespec now)
+{
+	/*
+	 * Leap second processing. If in leap-insert state at
+	 * the end of the day, the system clock is set back one
+	 * second; if in leap-delete state, the system clock is
+	 * set ahead one second. The microtime() routine or
+	 * external clock driver will insure that reported time
+	 * is always monotonic. The ugly divides should be
+	 * replaced.
+	 */
+	static time_t leaptime = 0;
+
+	switch (ntp_state) {
+	case TIME_OK:
+		if (ntp_status & STA_INS) {
+			ntp_state = TIME_INS;
+			/* calculate end of today (23:59:59)*/
+			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+		}
+		else if (ntp_status & STA_DEL) {
+			ntp_state = TIME_DEL;
+			/* calculate end of today (23:59:59)*/
+			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+		}
+		break;
+
+	case TIME_INS:
+		/* Once we are at (or past) leaptime, insert the second */
+		if (now.tv_sec > leaptime) {
+			ntp_state = TIME_OOP;
+			printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
+
+			return -1;
+		}
+		break;
+
+	case TIME_DEL:
+		/* Once we are at (or past) leaptime, delete the second */
+		if (now.tv_sec >= leaptime) {
+			ntp_state = TIME_WAIT;
+			printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
+
+			return 1;
+		}
+		break;
+
+	case TIME_OOP:
+		/*  Wait for the end of the leap second*/
+		if (now.tv_sec > (leaptime + 1))
+			ntp_state = TIME_WAIT;
+		break;
+
+	case TIME_WAIT:
+		if (!(ntp_status & (STA_INS | STA_DEL)))
+			ntp_state = TIME_OK;
+	}
+
+	return 0;
+}
+
+/* void ntp_clear(void):
+ *	Clears the NTP state machine.
+ */
+void ntp_clear(void)
+{
+	write_seqlock_irq(&ntp_lock);
+
+	/* clear everything */
+	ntp_status |= STA_UNSYNC;
+	ntp_maxerror = NTP_PHASE_LIMIT;
+	ntp_esterror = NTP_PHASE_LIMIT;
+	ss_offset_len=0;
+	singleshot_adj=0;
+	tick_adj=0;
+	offset_adj =0;
+
+	write_sequnlock_irq(&ntp_lock);
+}
+
+/* int get_ntp_status(void):
+ *  Returns the NTP status.
+ */
+int get_ntp_status(void)
+{
+	return ntp_status;
+}
+
diff -Nru a/kernel/time.c b/kernel/time.c
--- a/kernel/time.c	2004-12-07 16:47:19 -08:00
+++ b/kernel/time.c	2004-12-07 16:47:19 -08:00
@@ -36,6 +36,7 @@
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
+#include <linux/timeofday.h>
 
 /* 
  * The timezone where the local system is located.  Used as a default by some
@@ -219,6 +220,7 @@
 /* adjtimex mainly allows reading (and writing, if superuser) of
  * kernel time-keeping variables. used by xntpd.
  */
+#ifndef CONFIG_NEWTOD
 int do_adjtimex(struct timex *txc)
 {
         long ltemp, mtemp, save_adjust;
@@ -401,6 +403,7 @@
 	do_gettimeofday(&txc->time);
 	return(result);
 }
+#endif
 
 asmlinkage long sys_adjtimex(struct timex __user *txc_p)
 {
diff -Nru a/kernel/timeofday.c b/kernel/timeofday.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/kernel/timeofday.c	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,313 @@
+/*********************************************************************
+* linux/kernel/timeofday.c
+*
+* Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+*
+* This file contains the functions which access and manage
+* the system's time of day functionality.
+*
+* Revision History:
+* 2004-09-02:	A0
+*	o First pass sent to lkml for review.
+* 2004-12-07:	A1
+*	o Rework of timesource structure
+*	o Sent to lkml for review
+*
+* TODO List:
+*	o More testing
+**********************************************************************/
+
+#include <linux/timeofday.h>
+#include <linux/timesource.h>
+#include <linux/ntp.h>
+#include <linux/timex.h>
+
+/*XXX - remove later */
+#define TIME_DBG 1
+#define TIME_DBG_FREQ 120000
+
+/*[Nanosecond based variables]----------------
+ * system_time:
+ *	Monotonically increasing counter of the number of nanoseconds
+ *	since boot.
+ * wall_time_offset:
+ *	Offset added to system_time to provide accurate time-of-day
+ */
+static nsec_t system_time;
+static nsec_t wall_time_offset;
+
+
+/*[Cycle based variables]----------------
+ * offset_base:
+ *	Value of the timesource at the last clock_interrupt_hook()
+ *	(adjusted only minorly to account for rounded off cycles)
+ */
+static cycle_t offset_base;
+
+/*[Time source data]-------------------
+ * timesource:
+ *	current timesource pointer
+ */
+static struct timesource_t *timesource;
+
+/*[Locks]----------------------------
+ * system_time_lock:
+ *	generic lock for all locally scoped time values
+ */
+static seqlock_t system_time_lock = SEQLOCK_UNLOCKED;
+
+
+/* [XXX - Hacks]--------------------
+ *	Makes stuff compile
+ */
+extern unsigned long get_cmos_time(void);
+extern void sync_persistant_clock(struct timespec ts);
+
+/* get_lowres_timestamp():
+ *		Returns a low res timestamp.
+ *		(ie: the value of system_time as  calculated at
+ *			the last invocation of clock_interrupt_hook() )
+ */
+nsec_t get_lowres_timestamp(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		/* quickly grab system_time*/
+		ret = system_time;
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+/* get_lowres_timeofday():
+ *		Returns a low res time of day, as calculated at the
+ *		last invocation of clock_interrupt_hook()
+ */
+nsec_t get_lowres_timeofday(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		/* quickly calculate low-res time of day */
+		ret = system_time + wall_time_offset;
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+
+/* __monotonic_clock():
+ *		private function, must hold system_time_lock lock when being
+ *		called. Returns the monotonically increasing number of
+ *		nanoseconds	since the system booted (adjusted by NTP scaling)
+ */
+static nsec_t __monotonic_clock(void)
+{
+	nsec_t ret, ns_offset;
+	cycle_t now, delta;
+
+	/* read timesource */
+	now = read_timesource(timesource);
+
+	/* calculate the delta since the last clock_interrupt */
+	delta = (now - offset_base) & timesource->mask;
+
+	/* convert to nanoseconds */
+	ns_offset = cyc2ns(timesource, delta,0);
+
+	/* apply the NTP scaling */
+	ns_offset = ntp_scale(ns_offset);
+
+	/* add result to system time */
+	ret = system_time + ns_offset;
+
+	return ret;
+}
+
+
+/* do_monotonic_clock():
+ *		Returns the monotonically increasing number of nanoseconds
+ *		since the system booted via __monotonic_clock()
+ */
+nsec_t do_monotonic_clock(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+
+	/* atomically read __monotonic_clock() */
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		ret = __monotonic_clock();
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+
+/* do_gettimeofday():
+ *		Returns the time of day
+ */
+void do_gettimeofday(struct timeval *tv)
+{
+	nsec_t wall, sys;
+	unsigned long seq;
+
+	/* atomically read wall and sys time */
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		wall = wall_time_offset;
+		sys = __monotonic_clock();
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	/* add them and convert to timeval */
+	*tv = ns2timeval(wall+sys);
+}
+
+
+/* do_settimeofday():
+ *		Sets the time of day
+ */
+int do_settimeofday(struct timespec *tv)
+{
+	/* convert timespec to ns */
+	nsec_t newtime = timespec2ns(tv);
+
+	/* atomically adjust wall_time_offset to the desired value */
+	write_seqlock_irq(&system_time_lock);
+
+	wall_time_offset = newtime - __monotonic_clock();
+
+	/* clear NTP settings */
+	ntp_clear();
+
+	write_sequnlock_irq(&system_time_lock);
+
+	return 0;
+}
+
+/* do_adjtimex:
+ *		Userspace NTP daemon's interface to the kernel NTP variables
+ */
+int do_adjtimex(struct timex *tx)
+{
+	do_gettimeofday(&tx->time); /* set timex->time*/
+								/* Note: We set tx->time first, */
+								/* because ntp_adjtimex uses it */
+
+	return ntp_adjtimex(tx);			/* call out to NTP code */
+}
+
+
+/* timeofday_interrupt_hook:
+ *		calculates the delta since the last interrupt,
+ *		updates system time and clears the offset.
+ *		likely called by timer_interrupt()
+ */
+void timeofday_interrupt_hook(void)
+{
+	cycle_t now, delta, remainder;
+	nsec_t ns, ntp_ns;
+	long leapsecond;
+	struct timesource_t* next;
+
+	write_seqlock(&system_time_lock);
+
+	/* read time source */
+	now = read_timesource(timesource);
+
+	/* calculate cycle delta */
+	delta = (now - offset_base) & timesource->mask;
+
+	/* convert cycles to ns  and save remainder */
+	ns = cyc2ns(timesource, delta, &remainder);
+
+	/* apply NTP scaling factor for this tick */
+	ntp_ns = ntp_scale(ns);
+
+#if TIME_DBG /* XXX - remove later*/
+{
+	static int dbg=0;
+	if(!(dbg++%TIME_DBG_FREQ)){
+		printk("now: %lluc - then: %lluc = delta: %lluc -> %llu ns + %llu cyc (ntp: %lluc)\n",
+			now, offset_base, delta, ns, remainder, ntp_ns);
+	}
+}
+#endif
+	/* update system_time */
+	system_time += ntp_ns;
+
+	/* reset the offset_base */
+	offset_base = now;
+
+	/* subtract remainder to account for rounded off cycles */
+	offset_base = (offset_base - remainder) & timesource->mask;
+
+	/* advance the ntp state machine by ns*/
+	ntp_advance(ns);
+
+	/* do ntp leap second processing*/
+	leapsecond = ntp_leapsecond(ns2timespec(system_time+wall_time_offset));
+	if (leapsecond == 1) /* XXX could this be cleaner? */
+		wall_time_offset += 1;
+	if (leapsecond == -1)
+		wall_time_offset -= 1;
+
+	/* sync the persistant clock */
+	if (!(get_ntp_status() & STA_UNSYNC))
+		sync_persistant_clock(ns2timespec(system_time + wall_time_offset));
+
+	/* if necessary, switch timesources */
+	next = get_next_timesource();
+	if (next != timesource) {
+		/* immediately set new offset_base */
+		offset_base = read_timesource(next);
+		/* swap timesources */
+		timesource = next;
+		printk(KERN_INFO "Time: %s timesource has been installed\n",
+					timesource->name);
+	}
+
+
+	/* update legacy time values */
+	write_seqlock(&xtime_lock);
+	xtime = ns2timespec(system_time + wall_time_offset);
+	wall_to_monotonic = ns2timespec(wall_time_offset);
+	wall_to_monotonic.tv_sec = -wall_to_monotonic.tv_sec;
+	wall_to_monotonic.tv_nsec = -wall_to_monotonic.tv_nsec;
+	write_sequnlock(&xtime_lock);
+
+	write_sequnlock(&system_time_lock);
+}
+
+/* timeofday_init():
+ *		Initializes time variables
+ */
+void timeofday_init(void)
+{
+	write_seqlock(&system_time_lock);
+
+	/* initialize the timesource variable */
+	timesource = get_next_timesource();
+
+	/* clear and initialize offsets*/
+	offset_base = read_timesource(timesource);
+	wall_time_offset = ((u64)get_cmos_time()) * NSEC_PER_SEC;
+
+	/* clear NTP scaling factor*/
+	ntp_clear();
+
+	write_sequnlock(&system_time_lock);
+
+	return;
+}
diff -Nru a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c	2004-12-07 16:47:19 -08:00
+++ b/kernel/timer.c	2004-12-07 16:47:19 -08:00
@@ -568,6 +568,7 @@
 int tickadj = 500/HZ ? : 1;		/* microsecs */
 
 
+#ifndef CONFIG_NEWTOD
 /*
  * phase-lock loop variables
  */
@@ -798,6 +799,9 @@
 		}
 	} while (ticks);
 }
+#else /* CONFIG_NEWTOD */
+#define update_wall_time(x)
+#endif /* CONFIG_NEWTOD */
 
 static inline void do_process_times(struct task_struct *p,
 	unsigned long user, unsigned long system)
diff -Nru a/kernel/timesource.c b/kernel/timesource.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/kernel/timesource.c	2004-12-07 16:47:19 -08:00
@@ -0,0 +1,71 @@
+/*********************************************************************
+* linux/kernel/timesource.c
+*
+* Copyright (C) 2004 IBM, John Stultz (johnstul@us.ibm.com)
+*
+* This file contains the functions which manage
+* timesource drivers.
+*
+* Revision History:
+* 2004-12-07:	A1
+*	o Rework of timesource structure
+*	o Sent to lkml for review
+*
+* TODO List:
+*	o Allow timesource drivers to be registered and unregistered
+*	o Keep list of all currently registered timesources
+*	o Use "clock=xyz" boot option for selection overrides.
+*	o sysfs interface for manually choosing timesources
+*	o get rid of timesource_jiffies extern
+**********************************************************************/
+
+#include <linux/timesource.h>
+
+/*[Timesource internal variables]---------
+ * curr_timesource:
+ *	currently selected timesource. Initialized to timesource_jiffies.
+ * next_timesource:
+ *	pending next selected timesource.
+ * timesource_lock:
+ *	protects manipulations to curr_timesource and next_timesource
+ */
+/* XXX - Need to have a better way for initializing curr_timesource */
+extern struct timesource_t timesource_jiffies;
+static struct timesource_t *curr_timesource = &timesource_jiffies;
+static struct timesource_t *next_timesource;
+static seqlock_t timesource_lock = SEQLOCK_UNLOCKED;
+
+
+/* register_timesource():
+ *		Used to install a new timesource
+ */
+void register_timesource(struct timesource_t* t)
+{
+	write_seqlock(&timesource_lock);
+
+	/* XXX - check override */
+
+	/* if next_timesource has been set, make sure we beat that one too */
+	if (next_timesource) {
+		if (t->priority > next_timesource->priority)
+			next_timesource = t;
+	} else if(t->priority > curr_timesource->priority)
+		next_timesource = t;
+
+	write_sequnlock(&timesource_lock);
+}
+
+/* get_next_timesource():
+ *		Returns the selected timesource
+ */
+struct timesource_t* get_next_timesource(void)
+{
+	write_seqlock(&timesource_lock);
+	if (next_timesource) {
+		curr_timesource = next_timesource;
+		next_timesource = 0;
+	}
+	write_sequnlock(&timesource_lock);
+
+	return curr_timesource;
+}



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC] new timeofday arch specific hooks (v.A1)
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
@ 2004-12-08  1:57   ` john stultz
  2004-12-08  1:58     ` [RFC] new timeofday timesources (v.A1) john stultz
  2004-12-08  9:17   ` [RFC] new timeofday core subsystem (v.A1) Pavel Machek
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: john stultz @ 2004-12-08  1:57 UTC (permalink / raw)
  To: lkml
  Cc: tim, george anzinger, albert, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

All,
	This patch implements the minimal architecture specific hooks to enable
the new time of day subsystem code for i386 and x86-64. It applies on
top of my linux-2.6.10-rc3_timeofday-core_A1 patch and with this patch
applied, you can test the new time of day subsystem. 

	Basically it adds the call to timeofday_interrupt_hook() and cuts alot
of code out of the build via #ifdefs. I know, I know, #ifdefs' are ugly
and bad, and the final patch will just remove the old code. For now this
allows us to be flexible and easily switch between the two
implementations.

	The only new code is the sync_persistant_clock() function which is
mostly ripped out of do_timer_interrupt(). Pretty un-interesting.

New in this version:
o x86-64 hooks
o generic div_long_long_rem implementation (swiped from jiffies.h)

I look forward to your comments and feedback.

thanks
-john

linux-2.6.9-rc1_timeofday-arch_A1.patch
=======================================
diff -Nru a/arch/i386/Kconfig b/arch/i386/Kconfig
--- a/arch/i386/Kconfig	2004-12-07 16:47:46 -08:00
+++ b/arch/i386/Kconfig	2004-12-07 16:47:46 -08:00
@@ -14,6 +14,10 @@
 	  486, 586, Pentiums, and various instruction-set-compatible chips by
 	  AMD, Cyrix, and others.
 
+config NEWTOD
+	bool
+	default y
+
 config MMU
 	bool
 	default y
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c	2004-12-07 16:47:46 -08:00
+++ b/arch/i386/kernel/time.c	2004-12-07 16:47:46 -08:00
@@ -67,6 +67,8 @@
 
 #include "io_ports.h"
 
+#include <linux/timeofday.h>
+
 extern spinlock_t i8259A_lock;
 int pit_latch_buggy;              /* extern */
 
@@ -87,6 +89,7 @@
 
 struct timer_opts *cur_timer = &timer_none;
 
+#ifndef CONFIG_NEWTOD
 /*
  * This version of gettimeofday has microsecond resolution
  * and better than microsecond precision on fast x86 machines with TSC.
@@ -169,6 +172,7 @@
 }
 
 EXPORT_SYMBOL(do_settimeofday);
+#endif
 
 static int set_rtc_mmss(unsigned long nowtime)
 {
@@ -194,11 +198,39 @@
  *		Note: This function is required to return accurate
  *		time even in the absence of multiple timer ticks.
  */
+#ifndef CONFIG_NEWTOD
 unsigned long long monotonic_clock(void)
 {
 	return cur_timer->monotonic_clock();
 }
 EXPORT_SYMBOL(monotonic_clock);
+#endif
+
+void sync_persistant_clock(struct timespec ts)
+{
+	/*
+	 * If we have an externally synchronized Linux clock, then update
+	 * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
+	 * called as close as possible to 500 ms before the new second starts.
+	 */
+	if (ts.tv_sec > last_rtc_update + 660 &&
+	    (ts.tv_nsec / 1000)
+			>= USEC_AFTER - ((unsigned) TICK_SIZE) / 2 &&
+	    (ts.tv_nsec / 1000)
+			<= USEC_BEFORE + ((unsigned) TICK_SIZE) / 2) {
+		/* horrible...FIXME */
+		if (efi_enabled) {
+	 		if (efi_set_rtc_mmss(ts.tv_sec) == 0)
+				last_rtc_update = ts.tv_sec;
+			else
+				last_rtc_update = ts.tv_sec - 600;
+		} else if (set_rtc_mmss(ts.tv_sec) == 0)
+			last_rtc_update = ts.tv_sec;
+		else
+			last_rtc_update = ts.tv_sec - 600; /* do it again in 60 s */
+	}
+
+}
 
 #if defined(CONFIG_SMP) && defined(CONFIG_FRAME_POINTER)
 unsigned long profile_pc(struct pt_regs *regs)
@@ -238,6 +270,7 @@
 
 	do_timer_interrupt_hook(regs);
 
+#ifndef CONFIG_NEWTOD
 	/*
 	 * If we have an externally synchronized Linux clock, then update
 	 * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
@@ -260,6 +293,7 @@
 		else
 			last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */
 	}
+#endif
 
 #ifdef CONFIG_MCA
 	if( MCA_bus ) {
@@ -294,11 +328,15 @@
 	 */
 	write_seqlock(&xtime_lock);
 
+#ifndef CONFIG_NEWTOD
 	cur_timer->mark_offset();
+#endif
  
 	do_timer_interrupt(irq, NULL, regs);
 
 	write_sequnlock(&xtime_lock);
+
+	timeofday_interrupt_hook();
 	return IRQ_HANDLED;
 }
 
diff -Nru a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
--- a/arch/x86_64/Kconfig	2004-12-07 16:47:46 -08:00
+++ b/arch/x86_64/Kconfig	2004-12-07 16:47:46 -08:00
@@ -24,6 +24,10 @@
 	bool
 	default y
 
+config NEWTOD
+	bool
+	default y
+
 config MMU
 	bool
 	default y
diff -Nru a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c
--- a/arch/x86_64/kernel/time.c	2004-12-07 16:47:46 -08:00
+++ b/arch/x86_64/kernel/time.c	2004-12-07 16:47:46 -08:00
@@ -35,6 +35,7 @@
 #include <asm/sections.h>
 #include <linux/cpufreq.h>
 #include <linux/hpet.h>
+#include <linux/timeofday.h>
 #ifdef CONFIG_X86_LOCAL_APIC
 #include <asm/apic.h>
 #endif
@@ -106,6 +107,7 @@
 
 unsigned int (*do_gettimeoffset)(void) = do_gettimeoffset_tsc;
 
+#ifndef CONFIG_NEWTOD
 /*
  * This version of gettimeofday() has microsecond resolution and better than
  * microsecond precision, as we're using at least a 10 MHz (usually 14.31818
@@ -180,6 +182,7 @@
 }
 
 EXPORT_SYMBOL(do_settimeofday);
+#endif /* CONFIG_NEWTOD */
 
 unsigned long profile_pc(struct pt_regs *regs)
 {
@@ -281,6 +284,7 @@
 }
 
 
+#ifndef CONFIG_NEWTOD
 /* monotonic_clock(): returns # of nanoseconds passed since time_init()
  *		Note: This function is required to return accurate
  *		time even in the absence of multiple timer ticks.
@@ -357,6 +361,26 @@
     }
 #endif
 }
+#endif /* CONFIG_NEWTOD */
+
+
+void sync_persistant_clock(struct timespec ts)
+{
+	static unsigned long rtc_update = 0;
+	/*
+	 * If we have an externally synchronized Linux clock, then update
+	 * CMOS clock accordingly every ~11 minutes. set_rtc_mmss() will
+	 * be called in the jiffy closest to exactly 500 ms before the
+	 * next second. If the update fails, we don't care, as it'll be
+	 * updated on the next turn, and the problem (time way off) isn't
+	 * likely to go away much sooner anyway.
+	 */
+	if (ts.tv_sec > rtc_update &&
+		abs(ts.tv_nsec - 500000000) <= tick_nsec / 2) {
+		set_rtc_mmss(xtime.tv_sec);
+		rtc_update = xtime.tv_sec + 660;
+	}
+}
 
 static irqreturn_t timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
 {
@@ -373,6 +397,7 @@
 
 	write_seqlock(&xtime_lock);
 
+#ifndef CONFIG_NEWTOD
 	if (vxtime.hpet_address) {
 		offset = hpet_readl(HPET_T0_CMP) - hpet_tick;
 		delay = hpet_readl(HPET_COUNTER) - offset;
@@ -422,6 +447,7 @@
 		handle_lost_ticks(lost, regs);
 		jiffies += lost;
 	}
+#endif /* CONFIG_NEWTOD */
 
 /*
  * Do the timer stuff.
@@ -445,6 +471,7 @@
 		smp_local_timer_interrupt(regs);
 #endif
 
+#ifndef CONFIG_NEWTOD
 /*
  * If we have an externally synchronized Linux clock, then update CMOS clock
  * accordingly every ~11 minutes. set_rtc_mmss() will be called in the jiffy
@@ -458,9 +485,11 @@
 		set_rtc_mmss(xtime.tv_sec);
 		rtc_update = xtime.tv_sec + 660;
 	}
- 
+#endif /* CONFIG_NEWTOD */
+
 	write_sequnlock(&xtime_lock);
 
+	timeofday_interrupt_hook();
 	return IRQ_HANDLED;
 }
 
diff -Nru a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
--- a/arch/x86_64/kernel/vsyscall.c	2004-12-07 16:47:46 -08:00
+++ b/arch/x86_64/kernel/vsyscall.c	2004-12-07 16:47:46 -08:00
@@ -171,8 +171,12 @@
 	BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
 	BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
 	map_vsyscall();
+/* XXX - disable vsyscall gettimeofday for now */
+#ifndef CONFIG_NEWTOD
 	sysctl_vsyscall = 1; 
-
+#else
+	sysctl_vsyscall = 0;
+#endif
 	return 0;
 }
 
diff -Nru a/include/asm-generic/div64.h b/include/asm-generic/div64.h
--- a/include/asm-generic/div64.h	2004-12-07 16:47:46 -08:00
+++ b/include/asm-generic/div64.h	2004-12-07 16:47:46 -08:00
@@ -55,4 +55,13 @@
 
 #endif /* BITS_PER_LONG */
 
+#ifndef div_long_long_rem
+#define div_long_long_rem(dividend,divisor,remainder) \
+({							\
+	u64 result = dividend;				\
+	*remainder = do_div(result,divisor);		\
+	result;						\
+})
+#endif
+
 #endif /* _ASM_GENERIC_DIV64_H */



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC] new timeofday timesources (v.A1)
  2004-12-08  1:57   ` [RFC] new timeofday arch specific hooks (v.A1) john stultz
@ 2004-12-08  1:58     ` john stultz
  2004-12-08  2:02       ` john stultz
  0 siblings, 1 reply; 29+ messages in thread
From: john stultz @ 2004-12-08  1:58 UTC (permalink / raw)
  To: lkml
  Cc: tim, george anzinger, albert, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

All,
	This patch implements most of the time sources for i386 and x86-64
(tsc, pit, cyclone, acpi-pm and hpet). It applies on top of my
linux-2.6.10-rc3_timeofday-arch_A1 patch. It provides real timesources
(opposed to the example jiffies timesource) that can be used for more
realistic testing.

	This patch is the shabbiest of the three. It needs to be broken up, and
cleaned. The i386_pit.c is broken and the cpufreq code needs to be added
to tsc.c. Also acpi_pm and hpet need to be made generic so they can be
shared between i386 and x86-64. But for now it will get you going so you
can test and play with the core code.

New in this release:
o x86-64 hpet code
o more i386 tsc cleanups

thanks
-john

linux-2.6.10_rc3_timeofday-timesources_A1.patch
===================================================
diff -Nru a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/Makefile	2004-12-07 16:48:13 -08:00
@@ -7,7 +7,7 @@
 obj-y	:= process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \
 		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
 		pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \
-		doublefault.o quirks.o
+		doublefault.o quirks.o tsc.o
 
 obj-y				+= cpu/
 obj-y				+= timers/
diff -Nru a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
--- a/arch/i386/kernel/setup.c	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/setup.c	2004-12-07 16:48:13 -08:00
@@ -48,6 +48,7 @@
 #include <asm/io_apic.h>
 #include <asm/ist.h>
 #include <asm/io.h>
+#include <asm/tsc.h>
 #include "setup_arch_pre.h"
 #include <bios_ebda.h>
 
@@ -1421,6 +1422,7 @@
 	conswitchp = &dummy_con;
 #endif
 #endif
+	tsc_init();
 }
 
 #include "setup_arch_post.h"
diff -Nru a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/time.c	2004-12-07 16:48:13 -08:00
@@ -436,6 +436,7 @@
 
 void __init time_init(void)
 {
+#ifndef CONFIG_NEWTOD
 #ifdef CONFIG_HPET_TIMER
 	if (is_hpet_capable()) {
 		/*
@@ -453,6 +454,7 @@
 
 	cur_timer = select_timer();
 	printk(KERN_INFO "Using %s for high-res timesource\n",cur_timer->name);
+#endif
 
 	time_init_hook();
 }
diff -Nru a/arch/i386/kernel/timers/Makefile b/arch/i386/kernel/timers/Makefile
--- a/arch/i386/kernel/timers/Makefile	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/timers/Makefile	2004-12-07 16:48:13 -08:00
@@ -4,6 +4,6 @@
 
 obj-y := timer.o timer_none.o timer_tsc.o timer_pit.o common.o
 
-obj-$(CONFIG_X86_CYCLONE_TIMER)	+= timer_cyclone.o
+#obj-$(CONFIG_X86_CYCLONE_TIMER)	+= timer_cyclone.o
 obj-$(CONFIG_HPET_TIMER)	+= timer_hpet.o
-obj-$(CONFIG_X86_PM_TIMER)	+= timer_pm.o
+#obj-$(CONFIG_X86_PM_TIMER)	+= timer_pm.o
diff -Nru a/arch/i386/kernel/timers/common.c b/arch/i386/kernel/timers/common.c
--- a/arch/i386/kernel/timers/common.c	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/timers/common.c	2004-12-07 16:48:13 -08:00
@@ -22,8 +22,6 @@
  * device.
  */
 
-#define CALIBRATE_TIME	(5 * 1000020/HZ)
-
 unsigned long __init calibrate_tsc(void)
 {
 	mach_prepare_counter();
diff -Nru a/arch/i386/kernel/timers/timer.c b/arch/i386/kernel/timers/timer.c
--- a/arch/i386/kernel/timers/timer.c	2004-12-07 16:48:13 -08:00
+++ b/arch/i386/kernel/timers/timer.c	2004-12-07 16:48:13 -08:00
@@ -14,13 +14,13 @@
 /* list of timers, ordered by preference, NULL terminated */
 static struct init_timer_opts* __initdata timers[] = {
 #ifdef CONFIG_X86_CYCLONE_TIMER
-	&timer_cyclone_init,
+//	&timer_cyclone_init,
 #endif
 #ifdef CONFIG_HPET_TIMER
 	&timer_hpet_init,
 #endif
 #ifdef CONFIG_X86_PM_TIMER
-	&timer_pmtmr_init,
+//	&timer_pmtmr_init,
 #endif
 	&timer_tsc_init,
 	&timer_pit_init,
diff -Nru a/arch/i386/kernel/tsc.c b/arch/i386/kernel/tsc.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/arch/i386/kernel/tsc.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,39 @@
+#include <linux/init.h>
+#include <linux/timex.h>
+#include <asm/tsc.h>
+#include "mach_timer.h"
+
+unsigned long cpu_freq_khz;
+
+void tsc_init(void)
+{
+	unsigned long long start, end;
+	unsigned long count;
+	u64 delta64;
+	int i;
+
+	/* repeat 3 times to make sure the cache is warm */
+	for(i=0; i < 3; i++) {
+		mach_prepare_counter();
+		rdtscll(start);
+		mach_countup(&count);
+		rdtscll(end);
+	}
+	delta64 = end - start;
+
+	/* cpu freq too fast */
+	if(delta64 > (1ULL<<32))
+		return;
+	/* cpu freq too slow */
+	if (delta64 <= CALIBRATE_TIME)
+		return;
+
+	delta64 *= 1000;
+	do_div(delta64,CALIBRATE_TIME);
+	cpu_freq_khz = (unsigned long)delta64;
+
+	cpu_khz = cpu_freq_khz;
+
+	printk("Detected %lu.%03lu MHz processor.\n", cpu_khz / 1000, cpu_khz % 1000);
+
+}
diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
--- a/drivers/timesource/Makefile	2004-12-07 16:48:13 -08:00
+++ b/drivers/timesource/Makefile	2004-12-07 16:48:13 -08:00
@@ -1 +1,6 @@
 obj-y += jiffies.o
+obj-$(CONFIG_X86) += i386_tsc.o
+#obj-$(CONFIG_X86) += i386_pit.o
+obj-$(CONFIG_X86_SUMMIT) += cyclone.o
+obj-$(CONFIG_X86_PM_TIMER) += acpi_pm.o
+obj-$(CONFIG_X86_64) += x86-64_hpet.o
diff -Nru a/drivers/timesource/acpi_pm.c b/drivers/timesource/acpi_pm.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/acpi_pm.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,113 @@
+#include <linux/timesource.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <asm/io.h>
+#include "mach_timer.h"
+
+/* Number of PMTMR ticks expected during calibration run */
+#define PMTMR_TICKS_PER_SEC 3579545
+#define PMTMR_EXPECTED_RATE \
+  ((CALIBRATE_LATCH * (PMTMR_TICKS_PER_SEC >> 10)) / (CLOCK_TICK_RATE>>10))
+
+
+/* The I/O port the PMTMR resides at.
+ * The location is detected during setup_arch(),
+ * in arch/i386/acpi/boot.c */
+u32 pmtmr_ioport = 0;
+
+#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */
+
+static inline u32 read_pmtmr(void)
+{
+	u32 v1=0,v2=0,v3=0;
+	/* It has been reported that because of various broken
+	 * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
+	 * source is not latched, so you must read it multiple
+	 * times to insure a safe value is read.
+	 */
+	do {
+		v1 = inl(pmtmr_ioport);
+		v2 = inl(pmtmr_ioport);
+		v3 = inl(pmtmr_ioport);
+	} while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
+			|| (v3 > v1 && v3 < v2));
+
+	/* mask the output to 24 bits */
+	return v2 & ACPI_PM_MASK;
+}
+
+
+static cycle_t acpi_pm_read(void)
+{
+	return (cycle_t)read_pmtmr();
+}
+
+struct timesource_t timesource_acpi_pm = {
+	.name = "acpi_pm",
+	.priority = 200,
+	.type = TIMESOURCE_FUNCTION,
+	.read_fnct = acpi_pm_read,
+	.mask = (cycle_t)ACPI_PM_MASK,
+	.mult = 286070,
+	.shift = 10,
+};
+
+/*
+ * Some boards have the PMTMR running way too fast. We check
+ * the PMTMR rate against PIT channel 2 to catch these cases.
+ */
+static int verify_pmtmr_rate(void)
+{
+	u32 value1, value2;
+	unsigned long count, delta;
+
+	mach_prepare_counter();
+	value1 = read_pmtmr();
+	mach_countup(&count);
+	value2 = read_pmtmr();
+	delta = (value2 - value1) & ACPI_PM_MASK;
+
+	/* Check that the PMTMR delta is within 5% of what we expect */
+	if (delta < (PMTMR_EXPECTED_RATE * 19) / 20 ||
+	    delta > (PMTMR_EXPECTED_RATE * 21) / 20) {
+		printk(KERN_INFO "PM-Timer running at invalid rate: %lu%% of normal - aborting.\n", 100UL * delta / PMTMR_EXPECTED_RATE);
+		return -1;
+	}
+
+	return 0;
+}
+
+
+static int init_acpi_pm_timesource(void)
+{
+	u32 value1, value2;
+	unsigned int i;
+
+	if (!pmtmr_ioport)
+		return -ENODEV;
+
+	/* "verify" this timing source */
+	value1 = read_pmtmr();
+	for (i = 0; i < 10000; i++) {
+		value2 = read_pmtmr();
+		if (value2 == value1)
+			continue;
+		if (value2 > value1)
+			goto pm_good;
+		if ((value2 < value1) && ((value2) < 0xFFF))
+			goto pm_good;
+		printk(KERN_INFO "PM-Timer had inconsistent results: 0x%#x, 0x%#x - aborting.\n", value1, value2);
+		return -EINVAL;
+	}
+	printk(KERN_INFO "PM-Timer had no reasonable result: 0x%#x - aborting.\n", value1);
+	return -ENODEV;
+
+pm_good:
+	if (verify_pmtmr_rate() != 0)
+		return -ENODEV;
+
+	register_timesource(&timesource_acpi_pm);
+	return 0;
+}
+
+module_init(init_acpi_pm_timesource);
diff -Nru a/drivers/timesource/cyclone.c b/drivers/timesource/cyclone.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/cyclone.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,154 @@
+#include <linux/timesource.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/timex.h>
+#include <linux/init.h>
+
+#include <asm/io.h>
+#include <asm/pgtable.h>
+#include <asm/fixmap.h>
+#include "mach_timer.h"
+
+#define CYCLONE_CBAR_ADDR 0xFEB00CD0
+#define CYCLONE_PMCC_OFFSET 0x51A0
+#define CYCLONE_MPMC_OFFSET 0x51D0
+#define CYCLONE_MPCS_OFFSET 0x51A8
+#define CYCLONE_TIMER_FREQ 100000000
+#define CYCLONE_TIMER_MASK (((u64)1<<40)-1) /* 40 bit mask */
+
+unsigned long cyclone_freq_khz;
+
+int use_cyclone = 0;
+static u32* volatile cyclone_timer;	/* Cyclone MPMC0 register */
+
+/* helper macro to atomically read both cyclone counter registers */
+#define read_cyclone_counter(low,high) \
+	do{ \
+		high = cyclone_timer[1]; low = cyclone_timer[0]; \
+	} while (high != cyclone_timer[1]);
+
+
+static cycle_t cyclone_read(void)
+{
+	u32 low, high;
+	u64 ret;
+
+	read_cyclone_counter(low,high);
+	ret = ((u64)high << 32)|low;
+
+	return (cycle_t)ret;
+}
+
+struct timesource_t timesource_cyclone = {
+	.name = "cyclone",
+	.priority = 100,
+	.type = TIMESOURCE_FUNCTION,
+	.read_fnct = cyclone_read,
+	.mask = (cycle_t)CYCLONE_TIMER_MASK,
+	.mult = 10,
+	.shift = 0,
+};
+
+
+static void calibrate_cyclone(void)
+{
+	u32 startlow, starthigh, endlow, endhigh, delta32;
+	u64 start, end, delta64;
+	unsigned long i, count;
+	/* repeat 3 times to make sure the cache is warm */
+	for(i=0; i < 3; i++) {
+		mach_prepare_counter();
+		read_cyclone_counter(startlow,starthigh);
+		mach_countup(&count);
+		read_cyclone_counter(endlow,endhigh);
+	}
+	start = (u64)starthigh<<32|startlow;
+	end = (u64)endhigh<<32|endlow;
+
+	delta64 = end - start;
+	printk("cyclone delta: %llu\n", delta64);
+	delta64 *= (ACTHZ/1000)>>8;
+	printk("delta*hz = %llu\n", delta64);
+	delta32 = (u32)delta64;
+	cyclone_freq_khz = delta32/CALIBRATE_ITERATION;
+	printk("calculated cyclone_freq: %lu khz\n", cyclone_freq_khz);
+}
+
+static int init_cyclone_timesource(void)
+{
+	u32* reg;
+	u32 base;		/* saved cyclone base address */
+	u32 pageaddr;	/* page that contains cyclone_timer register */
+	u32 offset;		/* offset from pageaddr to cyclone_timer register */
+	int i;
+
+	/*make sure we're on a summit box*/
+	if(!use_cyclone) return -ENODEV;
+
+	printk(KERN_INFO "Summit chipset: Starting Cyclone Counter.\n");
+
+	/* find base address */
+	pageaddr = (CYCLONE_CBAR_ADDR)&PAGE_MASK;
+	offset = (CYCLONE_CBAR_ADDR)&(~PAGE_MASK);
+	set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+	reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+	if(!reg){
+		printk(KERN_ERR "Summit chipset: Could not find valid CBAR register.\n");
+		return -ENODEV;
+	}
+	base = *reg;
+	if(!base){
+		printk(KERN_ERR "Summit chipset: Could not find valid CBAR value.\n");
+		return -ENODEV;
+	}
+
+	/* setup PMCC */
+	pageaddr = (base + CYCLONE_PMCC_OFFSET)&PAGE_MASK;
+	offset = (base + CYCLONE_PMCC_OFFSET)&(~PAGE_MASK);
+	set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+	reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+	if(!reg){
+		printk(KERN_ERR "Summit chipset: Could not find valid PMCC register.\n");
+		return -ENODEV;
+	}
+	reg[0] = 0x00000001;
+
+	/* setup MPCS */
+	pageaddr = (base + CYCLONE_MPCS_OFFSET)&PAGE_MASK;
+	offset = (base + CYCLONE_MPCS_OFFSET)&(~PAGE_MASK);
+	set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+	reg = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+	if(!reg){
+		printk(KERN_ERR "Summit chipset: Could not find valid MPCS register.\n");
+		return -ENODEV;
+	}
+	reg[0] = 0x00000001;
+
+	/* map in cyclone_timer */
+	pageaddr = (base + CYCLONE_MPMC_OFFSET)&PAGE_MASK;
+	offset = (base + CYCLONE_MPMC_OFFSET)&(~PAGE_MASK);
+	set_fixmap_nocache(FIX_CYCLONE_TIMER, pageaddr);
+	cyclone_timer = (u32*)(fix_to_virt(FIX_CYCLONE_TIMER) + offset);
+	if(!cyclone_timer){
+		printk(KERN_ERR "Summit chipset: Could not find valid MPMC register.\n");
+		return -ENODEV;
+	}
+
+	/*quick test to make sure its ticking*/
+	for(i=0; i<3; i++){
+		u32 old = cyclone_timer[0];
+		int stall = 100;
+		while(stall--) barrier();
+		if(cyclone_timer[0] == old){
+			printk(KERN_ERR "Summit chipset: Counter not counting! DISABLED\n");
+			cyclone_timer = 0;
+			return -ENODEV;
+		}
+	}
+	calibrate_cyclone();
+	register_timesource(&timesource_cyclone);
+
+	return 0;
+}
+
+module_init(init_cyclone_timesource);
diff -Nru a/drivers/timesource/i386_pit.c b/drivers/timesource/i386_pit.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/i386_pit.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,100 @@
+/* pit timesource: XXX - broken!
+ */
+
+#include <linux/timesource.h>
+#include <linux/timex.h>
+#include <linux/init.h>
+
+#include <asm/io.h>
+#include <asm/timer.h>
+#include "io_ports.h"
+#include "do_timer.h"
+
+extern u64 jiffies_64;
+extern long jiffies;
+extern spinlock_t i8253_lock;
+
+/* Since the PIT overflows every tick, its not very useful
+ * to just read by itself. So throw jiffies into the mix to
+ * and just return nanoseconds in pit_read().
+ */
+
+static cycle_t pit_read(void)
+{
+	unsigned long flags;
+	int count;
+	unsigned long jiffies_t;
+	static int count_p;
+	static unsigned long jiffies_p = 0;
+
+	spin_lock_irqsave(&i8253_lock, flags);
+
+	outb_p(0x00, PIT_MODE);	/* latch the count ASAP */
+
+	count = inb_p(PIT_CH0);	/* read the latched count */
+	jiffies_t = jiffies;
+	count |= inb_p(PIT_CH0) << 8;
+
+	/* VIA686a test code... reset the latch if count > max + 1 */
+	if (count > LATCH) {
+		outb_p(0x34, PIT_MODE);
+		outb_p(LATCH & 0xff, PIT_CH0);
+		outb(LATCH >> 8, PIT_CH0);
+		count = LATCH - 1;
+	}
+
+	/*
+	 * avoiding timer inconsistencies (they are rare, but they happen)...
+	 * there are two kinds of problems that must be avoided here:
+	 *  1. the timer counter underflows
+	 *  2. hardware problem with the timer, not giving us continuous time,
+	 *     the counter does small "jumps" upwards on some Pentium systems,
+	 *     (see c't 95/10 page 335 for Neptun bug.)
+	 */
+
+	if( jiffies_t == jiffies_p ) {
+		if( count > count_p ) {
+			/* the nutcase */
+			count = do_timer_overflow(count);
+		}
+	} else
+		jiffies_p = jiffies_t;
+
+	count_p = count;
+
+	spin_unlock_irqrestore(&i8253_lock, flags);
+
+	count = ((LATCH-1) - count) * TICK_SIZE;
+	count = (count + LATCH/2) / LATCH;
+
+	count *= 1000; /* convert count from usec->nsec */
+
+	return (cycle_t)((jiffies_64 * TICK_NSEC) + count);
+}
+
+static cycle_t pit_delta(cycle_t now, cycle_t then)
+{
+	return now - then;
+}
+
+/* just return cyc, as its already in ns */
+static nsec_t pit_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+	return (nsec_t)cyc;
+}
+
+static struct timesource_t timesource_pit = {
+	.name = "pit",
+	.priority = 0,
+	.read = pit_read,
+	.delta = pit_delta,
+	.cyc2ns = pit_cyc2ns,
+};
+
+static int init_pit_timesource(void)
+{
+	register_timesource(&timesource_pit);
+	return 0;
+}
+
+module_init(init_pit_timesource);
diff -Nru a/drivers/timesource/i386_tsc.c b/drivers/timesource/i386_tsc.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/i386_tsc.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,43 @@
+/* TODO:
+ *		o cpufreq code
+ *		o better calibration
+ */
+
+#include <linux/timesource.h>
+#include <linux/timex.h>
+#include <linux/init.h>
+#include <linux/cpufreq.h>
+
+static cycle_t tsc_read(void)
+{
+	u64 ret;
+	rdtscll(ret);
+	return (cycle_t)ret;
+}
+
+static struct timesource_t timesource_tsc = {
+	.name = "tsc",
+	.priority = 25,
+	.type = TIMESOURCE_FUNCTION,
+	.read_fnct = tsc_read,
+	.mask = (cycle_t)~0,
+	.mult = 0, /* to be set */
+	.shift = 22,
+};
+
+
+static int init_tsc_timesource(void)
+{
+	/* All the initialization is done in arch/i386/kernel/tsc.c */
+	if (cpu_has_tsc && cpu_khz) {
+		unsigned long long x;
+		x = (NSEC_PER_SEC/HZ);
+		x = x << timesource_tsc.shift;
+		do_div(x, cpu_khz);
+		timesource_tsc.mult = (unsigned long)x;
+		register_timesource(&timesource_tsc);
+	}
+	return 0;
+}
+
+module_init(init_tsc_timesource);
diff -Nru a/drivers/timesource/x86-64_hpet.c b/drivers/timesource/x86-64_hpet.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/x86-64_hpet.c	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,44 @@
+#include <linux/timesource.h>
+#include <linux/hpet.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <asm/io.h>
+#include <asm/hpet.h>
+
+#define HPET_MASK (~0L)
+#define HPET_SHIFT 32
+#define hpet_readl(a) readl((void *)fix_to_virt(FIX_HPET_BASE) + a)
+
+
+static cycle_t hpet_read(void)
+{
+	return (cycle_t) hpet_readl(HPET_COUNTER);
+}
+
+
+struct timesource_t timesource_hpet = {
+	.name = "hpet",
+	.priority = 200,
+	.type = TIMESOURCE_FUNCTION,
+	.read_fnct = hpet_read,
+	.mask = (cycle_t)HPET_MASK,
+	.mult = 0, /* set below */
+	.shift = HPET_SHIFT,
+};
+
+static int init_hpet_timesource(void)
+{
+	unsigned long hpet_period, hpet_hz;
+
+	if (!vxtime.hpet_address)
+		return -ENODEV;
+
+	/* calculate and set the timesource multiplier */
+	hpet_period = hpet_readl(HPET_PERIOD);
+	hpet_hz = (1000000000000000L + hpet_period / 2) / hpet_period;
+	timesource_hpet.mult = (1000000L << HPET_SHIFT) / hpet_hz;
+
+	register_timesource(&timesource_hpet);
+	return 0;
+}
+module_init(init_hpet_timesource);
diff -Nru a/include/asm-i386/mach-default/mach_timer.h b/include/asm-i386/mach-default/mach_timer.h
--- a/include/asm-i386/mach-default/mach_timer.h	2004-12-07 16:48:13 -08:00
+++ b/include/asm-i386/mach-default/mach_timer.h	2004-12-07 16:48:13 -08:00
@@ -14,8 +14,12 @@
  */
 #ifndef _MACH_TIMER_H
 #define _MACH_TIMER_H
+#include <linux/jiffies.h>
+#include <asm/io.h>
 
-#define CALIBRATE_LATCH	(5 * LATCH)
+#define CALIBRATE_ITERATION 50
+#define CALIBRATE_LATCH	(CALIBRATE_ITERATION * LATCH)
+#define CALIBRATE_TIME	(CALIBRATE_ITERATION * 1000020/HZ)
 
 static inline void mach_prepare_counter(void)
 {
diff -Nru a/include/asm-i386/tsc.h b/include/asm-i386/tsc.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/asm-i386/tsc.h	2004-12-07 16:48:13 -08:00
@@ -0,0 +1,6 @@
+#ifndef _ASM_I386_TSC_H
+#define _ASM_I386_TSC_H
+extern unsigned long cpu_freq_khz;
+void tsc_init(void);
+
+#endif
diff -Nru a/include/linux/timeofday.h b/include/linux/timeofday.h
--- a/include/linux/timeofday.h	2004-12-07 16:48:13 -08:00
+++ b/include/linux/timeofday.h	2004-12-07 16:48:13 -08:00
@@ -8,6 +8,9 @@
 #define _LINUX_TIMEOFDAY_H
 #include <linux/types.h>
 #include <linux/time.h>
+#include <linux/timex.h>
+#include <asm/div64.h>
+
 
 #ifdef CONFIG_NEWTOD
 nsec_t get_lowres_timestamp(void);
diff -Nru a/include/linux/timesource.h b/include/linux/timesource.h
--- a/include/linux/timesource.h	2004-12-07 16:48:13 -08:00
+++ b/include/linux/timesource.h	2004-12-07 16:48:13 -08:00
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/time.h>
+#include <asm/div64.h>
 
 /* struct timesource_t:
  *		Provides mostly state-free accessors to the underlying



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday timesources (v.A1)
  2004-12-08  1:58     ` [RFC] new timeofday timesources (v.A1) john stultz
@ 2004-12-08  2:02       ` john stultz
  0 siblings, 0 replies; 29+ messages in thread
From: john stultz @ 2004-12-08  2:02 UTC (permalink / raw)
  To: lkml
  Cc: tim, george anzinger, albert, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Tue, 2004-12-07 at 17:58, john stultz wrote:
> All,
> 	This patch implements most of the time sources for i386 and x86-64
> (tsc, pit, cyclone, acpi-pm and hpet). It applies on top of my
> linux-2.6.10-rc3_timeofday-arch_A1 patch. It provides real timesources
> (opposed to the example jiffies timesource) that can be used for more
> realistic testing.

Ack. That last one got sent using the wrong from address. 

Please reply via johnstul@us.ibm.com.

sorry,
-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
  2004-12-08  1:57   ` [RFC] new timeofday arch specific hooks (v.A1) john stultz
@ 2004-12-08  9:17   ` Pavel Machek
  2004-12-08 18:44   ` Christoph Lameter
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Pavel Machek @ 2004-12-08  9:17 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, clameter,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott, Max, mahuja

Hi!

> I look forward to your comments and feedback.

> diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
> --- /dev/null	Wed Dec 31 16:00:00 196900
> +++ b/drivers/timesource/Makefile	2004-12-07 16:47:19 -08:00
> @@ -0,0 +1 @@
> +obj-y += jiffies.o


Perhaps drivers/time would be better name? We do not have
drivers/characterdevices, either...
				Pavel

-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08  1:55 [RFC] New timeofday proposal (v.A1) john stultz
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
@ 2004-12-08 18:25 ` Christoph Lameter
  2004-12-08 19:11   ` john stultz
  2004-12-08 18:43 ` Nicolas Pitre
  2 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 18:25 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george, albert, Ulrich.Windl, len.brown, linux, davidm,
	ak, paulus, schwidefsky, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott, Max, mahuja

On Tue, 7 Dec 2004, john stultz wrote:

> Features of this design:
> ========================

> o Consolidates a large amount of code:
> 	Allows for shared times source implementations, such as: i386, x86-64
> and ia64 all use HPET, i386 and x86-64 both have ACPI PM timers, and
> i386 and ia64 both have cyclone counters. Time sources are just drivers!

What I would to see also included here is to provide a clean posix
interface to these drivers. IMHO the current char interfaces to clock
drivers should be removed. Look at SGI's mmtimer implementation in
2.6.10-rc3. We have modified the posix interface to allow clock drivers to
export their timer values via CLOCK_SGI_CYCLE and we are also
now able to schedule hardware interrupts via timer_* posix functions
utilizing CLOCK_SGI_CYCLE that are then delivered as signals to an
application. Timer chips usually include time sources as well as the
ability to generate periodic or single shot
interrupts. There needs to be some way for clock drivers to cleanly
interface with these. The API may be the posix subsystem but I do not
like the quality of the code nor the current API for the clock drivers.

The API for user space to clocks already exists through the posix
standard. I would suggest to work with that standard for a way to deal
with clocks under Linux.

> Brief Psudo-code to illustrate the design:
> ==========================================
>
> monotonic_clock()
> 	now = timesource_read();	/* read the timesource */
> 	ns = cyc2ns(now - offset_base);	/* calculate nsecs since last hook */
> 	ntp_ns = ntp_scale(ns);		/* apply ntp scaling */

These are not really functions right? timesource_read can be a direct
memory read and the cyc2ns and ntp_scale can be reduced to some scaling
factor?

> Points I'm glossing over for now:
> ====================================================
>
> o ntp_scale(ns):  scales ns by NTP scaling factor
> 	- see ntp.c for details
> 	- costly, but correct.

Please do not call this function from monotonic_clock but provide some
sort of scaling factor that is adjusted from time to time.

> o What is the cost of throwing around 64bit values for everything?
> 	- Do we need an arch specific time structure that varies size
> accordingly?

I think 64 bit values are fine but then I work for a 64 bit company and
this may just be the contextual predisposition.

> o Some arches (arm, for example) do not have high res timing hardware
> 	- In this case we can have a "jiffies" timesource
> 		- cyc2ns(x) =  x*(NSEC_PER_SEC/HZ)
> 		- doesn't work for tickless systems

In that case maybe the "ticks" are the timesource and not really tick
processing per se.
There could be a separation between "increment counter" and tick processing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08  1:55 [RFC] New timeofday proposal (v.A1) john stultz
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
  2004-12-08 18:25 ` [RFC] New timeofday proposal (v.A1) Christoph Lameter
@ 2004-12-08 18:43 ` Nicolas Pitre
  2004-12-08 18:57   ` john stultz
  2 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2004-12-08 18:43 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george, albert, Ulrich.Windl, clameter, len.brown,
	linux, davidm, ak, paulus, schwidefsky, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott, Max, mahuja

On Tue, 7 Dec 2004, john stultz wrote:

> Points I'm glossing over for now:
> ====================================================
> 
> o Some arches (arm, for example) do not have high res timing hardware

Just a note: The ARM architecture is rather a bunch of multiple 
subarchitectures sharing the same instruction set but with wildly 
different sets of peripherals.  Many of those ARM subarchitectures have 
high res (sub microsec) timer capabilities.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
  2004-12-08  1:57   ` [RFC] new timeofday arch specific hooks (v.A1) john stultz
  2004-12-08  9:17   ` [RFC] new timeofday core subsystem (v.A1) Pavel Machek
@ 2004-12-08 18:44   ` Christoph Lameter
  2004-12-08 19:22     ` john stultz
  2004-12-08 18:53   ` Christoph Lameter
  2004-12-08 20:27   ` Martin Waitz
  4 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 18:44 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

On Tue, 7 Dec 2004, john stultz wrote:

> +struct timesource_t {
> +	char* name;
> +	int priority;
> +	enum {
> +		TIMESOURCE_FUNCTION,
> +		TIMESOURCE_MMIO_32,
> +		TIMESOURCE_MMIO_64
> +	} type;
> +	cycle_t (*read_fnct)(void);
> +	void* ptr;
> +	cycle_t mask;
> +	u32 mult;
> +	u32 shift;
> +};

Maybe add TIMESOURCE_CPU or so? How can one specify a time source in a
cpu register?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
                     ` (2 preceding siblings ...)
  2004-12-08 18:44   ` Christoph Lameter
@ 2004-12-08 18:53   ` Christoph Lameter
  2004-12-08 20:27   ` Martin Waitz
  4 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 18:53 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

On Tue, 7 Dec 2004, john stultz wrote:

> +/* __monotonic_clock():
> + *		private function, must hold system_time_lock lock when being
> + *		called. Returns the monotonically increasing number of
> + *		nanoseconds	since the system booted (adjusted by NTP scaling)
> + */
> +static nsec_t __monotonic_clock(void)
> +{
> +	nsec_t ret, ns_offset;
> +	cycle_t now, delta;
> +
> +	/* read timesource */
> +	now = read_timesource(timesource);
> +
> +	/* calculate the delta since the last clock_interrupt */
> +	delta = (now - offset_base) & timesource->mask;
> +
> +	/* convert to nanoseconds */
> +	ns_offset = cyc2ns(timesource, delta,0);
> +
> +	/* apply the NTP scaling */
> +	ns_offset = ntp_scale(ns_offset);

The call to ntp_scale will significantly impact clock retrieval
performance. ntp_scale needs to be removed. Could you simply let the clock
run with a scaling factor (just make sure the scaling factor is a bit
slower than ntp time and then skip a few nanoseconds of time forward
at the next correction?)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 18:43 ` Nicolas Pitre
@ 2004-12-08 18:57   ` john stultz
  0 siblings, 0 replies; 29+ messages in thread
From: john stultz @ 2004-12-08 18:57 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, clameter,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott, Max, mahuja

On Wed, 2004-12-08 at 10:43, Nicolas Pitre wrote:
> On Tue, 7 Dec 2004, john stultz wrote:
> 
> > Points I'm glossing over for now:
> > ====================================================
> > 
> > o Some arches (arm, for example) do not have high res timing hardware
> 
> Just a note: The ARM architecture is rather a bunch of multiple 
> subarchitectures sharing the same instruction set but with wildly 
> different sets of peripherals.  Many of those ARM subarchitectures have 
> high res (sub microsec) timer capabilities.

Ah, I must have missed that when I looked over that code. I'll drop (or
at least qualify) the incorrect reference. 

Thanks for the correction!
-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 18:25 ` [RFC] New timeofday proposal (v.A1) Christoph Lameter
@ 2004-12-08 19:11   ` john stultz
  2004-12-08 19:20     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: john stultz @ 2004-12-08 19:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 2004-12-08 at 10:25, Christoph Lameter wrote:
> On Tue, 7 Dec 2004, john stultz wrote:
> 
> > Features of this design:
> > ========================
> 
> > o Consolidates a large amount of code:
> > 	Allows for shared times source implementations, such as: i386, x86-64
> > and ia64 all use HPET, i386 and x86-64 both have ACPI PM timers, and
> > i386 and ia64 both have cyclone counters. Time sources are just drivers!
> 
> What I would to see also included here is to provide a clean posix
> interface to these drivers. IMHO the current char interfaces to clock
> drivers should be removed. Look at SGI's mmtimer implementation in
> 2.6.10-rc3. We have modified the posix interface to allow clock drivers to
> export their timer values via CLOCK_SGI_CYCLE and we are also
> now able to schedule hardware interrupts via timer_* posix functions
> utilizing CLOCK_SGI_CYCLE that are then delivered as signals to an
> application. Timer chips usually include time sources as well as the
> ability to generate periodic or single shot
> interrupts. There needs to be some way for clock drivers to cleanly
> interface with these. The API may be the posix subsystem but I do not
> like the quality of the code nor the current API for the clock drivers.

I'm not too familiar with the posix interfaces. I've been focused with
in-kernel uses at the moment. I'll try to take a look at the mmtimer
bits and see if I can better grasp your suggestion.

> The API for user space to clocks already exists through the posix
> standard. I would suggest to work with that standard for a way to deal
> with clocks under Linux.
> 
> > Brief Psudo-code to illustrate the design:
> > ==========================================
> >
> > monotonic_clock()
> > 	now = timesource_read();	/* read the timesource */
> > 	ns = cyc2ns(now - offset_base);	/* calculate nsecs since last hook */
> > 	ntp_ns = ntp_scale(ns);		/* apply ntp scaling */
> 
> These are not really functions right? timesource_read can be a direct
> memory read and the cyc2ns and ntp_scale can be reduced to some scaling
> factor?

Yep, timesource_read() and cyc2ns are static inlines which call the
timesource read function, or just read the MMIO'ed address depending on
the timesource type.


> > Points I'm glossing over for now:
> > ====================================================
> >
> > o ntp_scale(ns):  scales ns by NTP scaling factor
> > 	- see ntp.c for details
> > 	- costly, but correct.
> 
> Please do not call this function from monotonic_clock but provide some
> sort of scaling factor that is adjusted from time to time.

You're going to have to expand on this. I had considered only NTP
scaling the wall time, but for consistency it made more sense to me that
we also apply NTP scaling to the monotonic clock. This avoids different
notions of nanoseconds, one being the inaccurate unajusted system
nanoseconds and the other being the accurately ntp scaled wall time
nanoseconds. Trying to keep things sane.


> > o Some arches (arm, for example) do not have high res timing hardware
> > 	- In this case we can have a "jiffies" timesource
> > 		- cyc2ns(x) =  x*(NSEC_PER_SEC/HZ)
> > 		- doesn't work for tickless systems
> 
> In that case maybe the "ticks" are the timesource and not really tick
> processing per se.

If I understand you, yes. In this case the timer interrupt counter
(jiffies) is used as a free running counter. 

> There could be a separation between "increment counter" and tick processing.

Could you expand on this?

thanks for the review!
-john


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 19:11   ` john stultz
@ 2004-12-08 19:20     ` Christoph Lameter
  2004-12-08 19:58       ` john stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 19:20 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 8 Dec 2004, john stultz wrote:

> > > Points I'm glossing over for now:
> > > ====================================================
> > >
> > > o ntp_scale(ns):  scales ns by NTP scaling factor
> > > 	- see ntp.c for details
> > > 	- costly, but correct.
> >
> > Please do not call this function from monotonic_clock but provide some
> > sort of scaling factor that is adjusted from time to time.
>
> You're going to have to expand on this. I had considered only NTP
> scaling the wall time, but for consistency it made more sense to me that
> we also apply NTP scaling to the monotonic clock. This avoids different
> notions of nanoseconds, one being the inaccurate unajusted system
> nanoseconds and the other being the accurately ntp scaled wall time
> nanoseconds. Trying to keep things sane.

It certainly is more consistent but its a big performance hit if that
scaling is done on every invocation of clock_gettime or gettimeofday().

With the improved scaling factor one should be able to come very close to
ntp scaled time without invoking ntp_scale itself. Tick processing will
then update time to be ntp scaled by fine tuning the scaling factor (with
the bit shifting we can get very fine tuning) and eventually skip a few
nanoseconds. Its basically some piece of interpolator logic in there so
that the heavyweight calculations can just be done once in a while.

> > In that case maybe the "ticks" are the timesource and not really tick
> > processing per se.
>
> If I understand you, yes. In this case the timer interrupt counter
> (jiffies) is used as a free running counter.
>
> > There could be a separation between "increment counter" and tick processing.
>
> Could you expand on this?

The timesource is really a memory location incremented by "increment
counter" and not part of tick processing.  Logically these are seperate.
"increment counter" could happen apart from tick processing.
They are just conflated in the current tick implementation.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
  2004-12-08 18:44   ` Christoph Lameter
@ 2004-12-08 19:22     ` john stultz
  0 siblings, 0 replies; 29+ messages in thread
From: john stultz @ 2004-12-08 19:22 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: lkml

On Wed, 2004-12-08 at 10:44, Christoph Lameter wrote:
> On Tue, 7 Dec 2004, john stultz wrote:
> 
> > +struct timesource_t {
> > +	char* name;
> > +	int priority;
> > +	enum {
> > +		TIMESOURCE_FUNCTION,
> > +		TIMESOURCE_MMIO_32,
> > +		TIMESOURCE_MMIO_64
> > +	} type;
> > +	cycle_t (*read_fnct)(void);
> > +	void* ptr;
> > +	cycle_t mask;
> > +	u32 mult;
> > +	u32 shift;
> > +};
> 
> Maybe add TIMESOURCE_CPU or so? How can one specify a time source in a
> cpu register?

Yea, for now its a function. I was thinking that get_cycles() could be
used as an arch independent way to do this. 

Good suggestion! 
Thanks!
-john


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 19:20     ` Christoph Lameter
@ 2004-12-08 19:58       ` john stultz
  2004-12-08 20:14         ` Christoph Lameter
  2004-12-09  7:47         ` Ulrich Windl
  0 siblings, 2 replies; 29+ messages in thread
From: john stultz @ 2004-12-08 19:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 2004-12-08 at 11:20, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, john stultz wrote:
> 
> > > > Points I'm glossing over for now:
> > > > ====================================================
> > > >
> > > > o ntp_scale(ns):  scales ns by NTP scaling factor
> > > > 	- see ntp.c for details
> > > > 	- costly, but correct.
> > >
> > > Please do not call this function from monotonic_clock but provide some
> > > sort of scaling factor that is adjusted from time to time.
> >
> > You're going to have to expand on this. I had considered only NTP
> > scaling the wall time, but for consistency it made more sense to me that
> > we also apply NTP scaling to the monotonic clock. This avoids different
> > notions of nanoseconds, one being the inaccurate unajusted system
> > nanoseconds and the other being the accurately ntp scaled wall time
> > nanoseconds. Trying to keep things sane.
> 
> It certainly is more consistent but its a big performance hit if that
> scaling is done on every invocation of clock_gettime or gettimeofday().
> 
> With the improved scaling factor one should be able to come very close to
> ntp scaled time without invoking ntp_scale itself. Tick processing will
> then update time to be ntp scaled by fine tuning the scaling factor (with
> the bit shifting we can get very fine tuning) and eventually skip a few
> nanoseconds. Its basically some piece of interpolator logic in there so
> that the heavyweight calculations can just be done once in a while.

No. I agree ntp_scale() is a performance concern. However I'm not sure
how your suggestion of just slowing or tweaking the timesource
mult/shift frequency values will allow us to implement the expected
behavior of adjtimex().  We need to be able to implement the following
adjustments within a single tick:

1. Adjust the frequency by 500ppm for 10usecs 
2. After that adjust the frequency by 30ppm for the rest of the tick.

We can see how much of this can be fudged or generalized, but I'm
hesitant to be too flippant about changing the NTP behavior for fear
that the astronomers who so dearly care about leap seconds and minute
time adjustments will "forget" to mention the asteroid heading towards
my home. :) 

I may have asked this before, but w/ 32 bit mult and shifts, how
granular can these adjustments be?

Also additional complications arise when we have multiple things (like
cpufreq) playing with the timesource frequency values as well. 


thanks again for the thorough review!
-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 19:58       ` john stultz
@ 2004-12-08 20:14         ` Christoph Lameter
  2004-12-08 21:25           ` George Anzinger
  2004-12-08 23:36           ` john stultz
  2004-12-09  7:47         ` Ulrich Windl
  1 sibling, 2 replies; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 20:14 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 8 Dec 2004, john stultz wrote:

> > With the improved scaling factor one should be able to come very close to
> > ntp scaled time without invoking ntp_scale itself. Tick processing will
> > then update time to be ntp scaled by fine tuning the scaling factor (with
> > the bit shifting we can get very fine tuning) and eventually skip a few
> > nanoseconds. Its basically some piece of interpolator logic in there so
> > that the heavyweight calculations can just be done once in a while.
>
> No. I agree ntp_scale() is a performance concern. However I'm not sure
> how your suggestion of just slowing or tweaking the timesource
> mult/shift frequency values will allow us to implement the expected
> behavior of adjtimex().  We need to be able to implement the following
> adjustments within a single tick:
>
> 1. Adjust the frequency by 500ppm for 10usecs
> 2. After that adjust the frequency by 30ppm for the rest of the tick.

Frequency adjustments just means an adjustment of the scaling factor.
Am I missing something?

> We can see how much of this can be fudged or generalized, but I'm
> hesitant to be too flippant about changing the NTP behavior for fear
> that the astronomers who so dearly care about leap seconds and minute
> time adjustments will "forget" to mention the asteroid heading towards
> my home. :)

I am not sure what NTP behavior needs to be fudged. Sorry about my limited
NTP knowledge. Could you elaborate on what the problem is?
>
> I may have asked this before, but w/ 32 bit mult and shifts, how
> granular can these adjustments be?

Yes. 128bit would be great for this. 64bit is fine though as
far as I can see and allows granularity up to fractions of
nanoseconds if applied between 1ms intervals.

> Also additional complications arise when we have multiple things (like
> cpufreq) playing with the timesource frequency values as well.

I think these could all be taken into account by a scaling factor off a
certain base established at a tick-like event that does the ntp scaling.
The scaling between tick-like event needs to be just a scaling factor for
performance reasons.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
  2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
                     ` (3 preceding siblings ...)
  2004-12-08 18:53   ` Christoph Lameter
@ 2004-12-08 20:27   ` Martin Waitz
       [not found]     ` <1102555933.1281.301.camel@cog.beaverton.ibm.com>
  4 siblings, 1 reply; 29+ messages in thread
From: Martin Waitz @ 2004-12-08 20:27 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

hoi :)

On Tue, Dec 07, 2004 at 05:56:38PM -0800, john stultz wrote:
> +struct timesource_t {

usually only typedefs end with _t

> +	char* name;
> +	int priority;
> +	enum {
> +		TIMESOURCE_FUNCTION,
> +		TIMESOURCE_MMIO_32,
> +		TIMESOURCE_MMIO_64
> +	} type;
> +	cycle_t (*read_fnct)(void);
> +	void* ptr;

This could be made __iomem if it is intended to point to an IO memory region.
Hmm, but then there is no ioread64, so I guess I'm wrong.

> +	cycle_t mask;
> +	u32 mult;
> +	u32 shift;
> +};

> +static inline nsec_t cyc2ns(struct timesource_t* ts, cycle_t cycles, cycle_t* rem)
> +{
> +	u64 ret;
> +	ret = (u64)cycles;
> +	ret *= ts->mult;
> +	ret >>= ts->shift;
> +	if (rem) /* XXX we still need to do remainder math */
> +		*rem = (cycle_t)0;
> +	return (nsec_t)ret;
> +}

well, the math is simple:
	if (rem) *rem = ret & (1 << ts->shift -1);
	ret >>= ts->shift;


-- 
Martin Waitz

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 20:14         ` Christoph Lameter
@ 2004-12-08 21:25           ` George Anzinger
  2004-12-08 23:47             ` Christoph Lameter
  2004-12-08 23:36           ` john stultz
  1 sibling, 1 reply; 29+ messages in thread
From: George Anzinger @ 2004-12-08 21:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: john stultz, lkml, tim, albert, Ulrich.Windl, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, keith maanthey,
	greg kh, Patricia Gaughen, Chris McDermott, Max, mahuja

Christoph Lameter wrote:
> On Wed, 8 Dec 2004, john stultz wrote:
> 
> 
>>>With the improved scaling factor one should be able to come very close to
>>>ntp scaled time without invoking ntp_scale itself. Tick processing will
>>>then update time to be ntp scaled by fine tuning the scaling factor (with
>>>the bit shifting we can get very fine tuning) and eventually skip a few
>>>nanoseconds. Its basically some piece of interpolator logic in there so
>>>that the heavyweight calculations can just be done once in a while.
>>
>>No. I agree ntp_scale() is a performance concern. However I'm not sure
>>how your suggestion of just slowing or tweaking the timesource
>>mult/shift frequency values will allow us to implement the expected
>>behavior of adjtimex().  We need to be able to implement the following
>>adjustments within a single tick:
>>
>>1. Adjust the frequency by 500ppm for 10usecs
>>2. After that adjust the frequency by 30ppm for the rest of the tick.
> 
> 
> Frequency adjustments just means an adjustment of the scaling factor.
> Am I missing something?
> 
> 
>>We can see how much of this can be fudged or generalized, but I'm
>>hesitant to be too flippant about changing the NTP behavior for fear
>>that the astronomers who so dearly care about leap seconds and minute
>>time adjustments will "forget" to mention the asteroid heading towards
>>my home. :)
> 
> 
> I am not sure what NTP behavior needs to be fudged. Sorry about my limited
> NTP knowledge. Could you elaborate on what the problem is?
> 
>>I may have asked this before, but w/ 32 bit mult and shifts, how
>>granular can these adjustments be?
> 
> 
> Yes. 128bit would be great for this. 64bit is fine though as
> far as I can see and allows granularity up to fractions of
> nanoseconds if applied between 1ms intervals.
> 
> 
>>Also additional complications arise when we have multiple things (like
>>cpufreq) playing with the timesource frequency values as well.
> 
> 
> I think these could all be taken into account by a scaling factor off a
> certain base established at a tick-like event that does the ntp scaling.
> The scaling between tick-like event needs to be just a scaling factor for
> performance reasons.

Right.  We seem to be doing ok now by just adjusting things at tick time and 
using the "normal" interpolation between ticks.

As for the math, the current code keeps a running "remainder" which is the 
amount of the correction that was finer than the clock resolution (i.e. less 
than a nano second) and rolls this in on the next tick.  This gives resolution 
out to several bits to the right of the nano second.  And I think this is all 
done with 32 bit math (if memory serves).
> 
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 20:14         ` Christoph Lameter
  2004-12-08 21:25           ` George Anzinger
@ 2004-12-08 23:36           ` john stultz
  2004-12-08 23:53             ` Christoph Lameter
  2004-12-09  7:57             ` Ulrich Windl
  1 sibling, 2 replies; 29+ messages in thread
From: john stultz @ 2004-12-08 23:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 2004-12-08 at 12:14, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, john stultz wrote:
> > > With the improved scaling factor one should be able to come very close to
> > > ntp scaled time without invoking ntp_scale itself. Tick processing will
> > > then update time to be ntp scaled by fine tuning the scaling factor (with
> > > the bit shifting we can get very fine tuning) and eventually skip a few
> > > nanoseconds. Its basically some piece of interpolator logic in there so
> > > that the heavyweight calculations can just be done once in a while.
> >
> > No. I agree ntp_scale() is a performance concern. However I'm not sure
> > how your suggestion of just slowing or tweaking the timesource
> > mult/shift frequency values will allow us to implement the expected
> > behavior of adjtimex().  We need to be able to implement the following
> > adjustments within a single tick:
> >
> > 1. Adjust the frequency by 500ppm for 10usecs
> > 2. After that adjust the frequency by 30ppm for the rest of the tick.
> 
> Frequency adjustments just means an adjustment of the scaling factor.
> Am I missing something?
> 
> > We can see how much of this can be fudged or generalized, but I'm
> > hesitant to be too flippant about changing the NTP behavior for fear
> > that the astronomers who so dearly care about leap seconds and minute
> > time adjustments will "forget" to mention the asteroid heading towards
> > my home. :)
> 
> I am not sure what NTP behavior needs to be fudged. Sorry about my limited
> NTP knowledge. Could you elaborate on what the problem is?

Take a look at the adjtimex man page as well as the ntp.c file from the
timeofday core patch. There are number of different types of adjustments
that are made, possibly at the same time. Briefly, they are (to my
understanding - I'm going off my notes from awhile ago): 
o tick adjustments
	how much time should pass in a _user_ tick
o frequency adjustments
	long term adjustment to correct for constant drift), 
o offset adjustments
	additional ppm adjustment to correct for current offset from the ntp
server
o single shot offset adjustments 
	old style adjtime functionality

Tick, frequency and offset adjustments can be precalculated and summed
to a single ppm adjustment. This is similar to the style of adjustment
you propose directly onto the time source frequency values. 

However, there is also this short term single shot adjustments. These
adjustments are made by applying the MAX_SINGLESHOT_ADJ (500ppm) scaling
for an amount of time (offset_len) which would compensate for the
offset. This style is difficult because we cannot precompute it and
apply it to an entire tick. Instead it needs to be applied for just a
specific amount of time which may be only a fraction of a tick. When we
start talking about systems with irregular tick frequency (via
virtualization, or tickless systems) it becomes even more problematic. 

If this can be fudged then it becomes less of an issue. Or at worse, we
have to do two mult/shift operations on two "parts" of the time interval
using different adjustments.

Its starting to look doable, but its not necessarily the simplest thing
(for me at least). I'll put it on my list, but patches would be more
then welcome. 

thanks
-john




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 21:25           ` George Anzinger
@ 2004-12-08 23:47             ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 23:47 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, lkml, tim, albert, Ulrich.Windl, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, keith maanthey,
	greg kh, Patricia Gaughen, Chris McDermott, Max, mahuja

On Wed, 8 Dec 2004, George Anzinger wrote:

> Right.  We seem to be doing ok now by just adjusting things at tick time and
> using the "normal" interpolation between ticks.
>
> As for the math, the current code keeps a running "remainder" which is the
> amount of the correction that was finer than the clock resolution (i.e. less
> than a nano second) and rolls this in on the next tick.  This gives resolution
> out to several bits to the right of the nano second.  And I think this is all
> done with 32 bit math (if memory serves).

That is probably the i386 version with which I am not familiar. The time
interpolator logic (IA64 and SPARC64) does fine with a scaled 64 bit
factor without a remainder. The factor may be used to express fractions of
nanoseconds.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 23:36           ` john stultz
@ 2004-12-08 23:53             ` Christoph Lameter
  2004-12-09  0:17               ` john stultz
  2004-12-09  7:57             ` Ulrich Windl
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2004-12-08 23:53 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 8 Dec 2004, john stultz wrote:

> Take a look at the adjtimex man page as well as the ntp.c file from the
> timeofday core patch. There are number of different types of adjustments
> that are made, possibly at the same time. Briefly, they are (to my
> understanding - I'm going off my notes from awhile ago):
> o tick adjustments
> 	how much time should pass in a _user_ tick
> o frequency adjustments
> 	long term adjustment to correct for constant drift),
> o offset adjustments
> 	additional ppm adjustment to correct for current offset from the ntp
> server
> o single shot offset adjustments
> 	old style adjtime functionality
>
> Tick, frequency and offset adjustments can be precalculated and summed
> to a single ppm adjustment. This is similar to the style of adjustment
> you propose directly onto the time source frequency values.
>
> However, there is also this short term single shot adjustments. These
> adjustments are made by applying the MAX_SINGLESHOT_ADJ (500ppm) scaling
> for an amount of time (offset_len) which would compensate for the
> offset. This style is difficult because we cannot precompute it and
> apply it to an entire tick. Instead it needs to be applied for just a
> specific amount of time which may be only a fraction of a tick. When we
> start talking about systems with irregular tick frequency (via
> virtualization, or tickless systems) it becomes even more problematic.

We would need to schedule a special tick like event at a certain time but
otherwise I do not see a problem. Is there a requirement that these
"specific amounts of time" are less than 1 ms? The timer hardware (such as
the RTC clock) can generate an event in <200ns that could be used to
change the scaling. For a tickless system we would need to have such
scheduled events anyways.

> If this can be fudged then it becomes less of an issue. Or at worse, we
> have to do two mult/shift operations on two "parts" of the time interval
> using different adjustments.

That looks troublesome. Better avoid that.

> Its starting to look doable, but its not necessarily the simplest thing
> (for me at least). I'll put it on my list, but patches would be more
> then welcome.

I am still suffering from my limited NTP knowlege but will see what I can
do about this.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 23:53             ` Christoph Lameter
@ 2004-12-09  0:17               ` john stultz
  2004-12-09  0:40                 ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: john stultz @ 2004-12-09  0:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 2004-12-08 at 15:53, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, john stultz wrote:
> > However, there is also this short term single shot adjustments. These
> > adjustments are made by applying the MAX_SINGLESHOT_ADJ (500ppm) scaling
> > for an amount of time (offset_len) which would compensate for the
> > offset. This style is difficult because we cannot precompute it and
> > apply it to an entire tick. Instead it needs to be applied for just a
> > specific amount of time which may be only a fraction of a tick. When we
> > start talking about systems with irregular tick frequency (via
> > virtualization, or tickless systems) it becomes even more problematic.
> 
> We would need to schedule a special tick like event at a certain time but
> otherwise I do not see a problem. Is there a requirement that these
> "specific amounts of time" are less than 1 ms? The timer hardware (such as
> the RTC clock) can generate an event in <200ns that could be used to
> change the scaling. For a tickless system we would need to have such
> scheduled events anyways.

Eh, I'd like to not to be dependent on event accuracy/frequency. Lets
see if we can do it w/o scheduling events. 

> > If this can be fudged then it becomes less of an issue. Or at worse, we
> > have to do two mult/shift operations on two "parts" of the time interval
> > using different adjustments.
> 
> That looks troublesome. Better avoid that.

Well, its not *that* bad. Similar to the ntp_scale() function, it would
look something like:

if (interval <= offset_len)
	return (interval * singleshot_mult)>>shift;
else {
	cycle_t v1,v2;
	v1 = (offset_len * singleshot_mult)>>shift;
	v2 = (interval-offset_len)*adjusted_mult)>>shift;
	return v1+v2;
}

Where:
	singleshot_mult = original_mult + ntp_adj + ss_mult
and
	adjusted_mult = original_mult + ntp_adj


> > Its starting to look doable, but its not necessarily the simplest thing
> > (for me at least). I'll put it on my list, but patches would be more
> > then welcome.
> 
> I am still suffering from my limited NTP knowlege but will see what I can
> do about this.

:) Any added NTP knowledge would be great to add to the pool. 

-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-09  0:17               ` john stultz
@ 2004-12-09  0:40                 ` Christoph Lameter
  2004-12-09  0:51                   ` john stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2004-12-09  0:40 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 8 Dec 2004, john stultz wrote:

> Well, its not *that* bad. Similar to the ntp_scale() function, it would
> look something like:
>
> if (interval <= offset_len)
> 	return (interval * singleshot_mult)>>shift;
> else {
> 	cycle_t v1,v2;
> 	v1 = (offset_len * singleshot_mult)>>shift;
> 	v2 = (interval-offset_len)*adjusted_mult)>>shift;
> 	return v1+v2;
> }
>
> Where:
> 	singleshot_mult = original_mult + ntp_adj + ss_mult
> and
> 	adjusted_mult = original_mult + ntp_adj
>
>

Yuck. Do we support this kind of thing today? Support for periods of a
tick or so is not an issue right?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-09  0:40                 ` Christoph Lameter
@ 2004-12-09  0:51                   ` john stultz
  2004-12-09  1:24                     ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: john stultz @ 2004-12-09  0:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 2004-12-08 at 16:40, Christoph Lameter wrote:
> On Wed, 8 Dec 2004, john stultz wrote:
> 
> > Well, its not *that* bad. Similar to the ntp_scale() function, it would
> > look something like:
> >
> > if (interval <= offset_len)
> > 	return (interval * singleshot_mult)>>shift;
> > else {
> > 	cycle_t v1,v2;
> > 	v1 = (offset_len * singleshot_mult)>>shift;
> > 	v2 = (interval-offset_len)*adjusted_mult)>>shift;
> > 	return v1+v2;
> > }
> >
> > Where:
> > 	singleshot_mult = original_mult + ntp_adj + ss_mult
> > and
> > 	adjusted_mult = original_mult + ntp_adj
> >
> >
> 
> Yuck. Do we support this kind of thing today? Support for periods of a
> tick or so is not an issue right?

Well, ok, you're right. I got my wires twisted and misspoke. Today we
really don't, since NTP adjustments only occur on tick boundaries. So
yes, singleshot adjustments are in multiples of ticks right now. But we
do assume ticks arrive at regular periods. If they don't, well, then we
apply it for only one ticks worth, but we've lost a tick so we're wrong
anyway. 

The code above, however, can handle non-regular interrupt intervals,
which is something I think we should assume will occur.

thanks
-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-09  0:51                   ` john stultz
@ 2004-12-09  1:24                     ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2004-12-09  1:24 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On Wed, 8 Dec 2004, john stultz wrote:

> Well, ok, you're right. I got my wires twisted and misspoke. Today we
> really don't, since NTP adjustments only occur on tick boundaries. So
> yes, singleshot adjustments are in multiples of ticks right now. But we
> do assume ticks arrive at regular periods. If they don't, well, then we
> apply it for only one ticks worth, but we've lost a tick so we're wrong
> anyway.
>
> The code above, however, can handle non-regular interrupt intervals,
> which is something I think we should assume will occur.

Then we might also assume that we are beyond the technology of the
eighties and that there is some way that hardware can give
us an interrupt at a certain point in the future that will allow us to
change the scaling factor? I think its safe to not do this two scale
stuff. Too complex for the hot code path anyways.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] new timeofday core subsystem (v.A1)
       [not found]     ` <1102555933.1281.301.camel@cog.beaverton.ibm.com>
@ 2004-12-09  7:32       ` Martin Waitz
  0 siblings, 0 replies; 29+ messages in thread
From: Martin Waitz @ 2004-12-09  7:32 UTC (permalink / raw)
  To: john stultz; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

hoi :)

On Wed, Dec 08, 2004 at 05:32:13PM -0800, john stultz wrote:
> Ah, actually, you missed something. The remainder you propose above is
> in units of cycles multiplied by mult. Thus to get it back to just
> cycles, we have to divide. So:
> 
> 	ret *= ts->mult;
> 	if (rem)
> 		*rem = (ret & (1 << ts->shift -1))/mult;
> 	ret >>= ts->shift;

you are of course right

> Agreed?

well, divisions are always slow and we would loose precision again.

I have another suggestion: just keep the remainder in units of
cycles*mult.

	ret *= ts->mult
	if (rem) {
		ret += *rem;
		*rem = ret & (1 << ts->shift -1);
	}
	ret >>= ts->shift

and remove the offset_base adjustion below.

-- 
Martin Waitz

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 19:58       ` john stultz
  2004-12-08 20:14         ` Christoph Lameter
@ 2004-12-09  7:47         ` Ulrich Windl
  1 sibling, 0 replies; 29+ messages in thread
From: Ulrich Windl @ 2004-12-09  7:47 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On 8 Dec 2004 at 11:58, john stultz wrote:

[...]
> behavior of adjtimex().  We need to be able to implement the following
> adjustments within a single tick:
> 
> 1. Adjust the frequency by 500ppm for 10usecs 

What do you mean by "for 10usecs"?

> 2. After that adjust the frequency by 30ppm for the rest of the tick.

I'm not sure what you are taling about: plain old adjtime() or the NTP kernel 
interface?

[...]
> I may have asked this before, but w/ 32 bit mult and shifts, how
> granular can these adjustments be?

Independent of any bits, the precision should be up to 1ns for reading and setting 
the clock, and as a consequence you might provide internal fractional nanoseconds 
(if you want to have a truly stable nanosecond clock model). If we get this right, 
there will be peace in this area until the wires in a PC are significantly shorter 
than 30cm (I think this is how far the light goes in 1ns). ;-)

> 
> Also additional complications arise when we have multiple things (like
> cpufreq) playing with the timesource frequency values as well. 

I see a bug difference between precise time keeping and "playing" with 
timesources.

Regards,
Ulrich


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-08 23:36           ` john stultz
  2004-12-08 23:53             ` Christoph Lameter
@ 2004-12-09  7:57             ` Ulrich Windl
  2004-12-09  8:29               ` john stultz
  1 sibling, 1 reply; 29+ messages in thread
From: Ulrich Windl @ 2004-12-09  7:57 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott, Max,
	mahuja

On 8 Dec 2004 at 15:36, john stultz wrote:

[...]
> Take a look at the adjtimex man page as well as the ntp.c file from the
> timeofday core patch. There are number of different types of adjustments
> that are made, possibly at the same time. Briefly, they are (to my
> understanding - I'm going off my notes from awhile ago): 
> o tick adjustments
> 	how much time should pass in a _user_ tick

tick adjustments are considered obsolete, because if a lcok implementation (or 
hardware) is severly broken, users should reject using that stuff. Meaning: tick 
adjustments are ment to be set once in the life of a computer system. No 
continuous tuning.

> o frequency adjustments
> 	long term adjustment to correct for constant drift), 

Actually, you are compensating for a "tick problem" on a smaller scale (constant 
part), and variations caused by temperature, voltage, and others (variable part).

> o offset adjustments
> 	additional ppm adjustment to correct for current offset from the ntp
> server
> o single shot offset adjustments 
> 	old style adjtime functionality
> 
> Tick, frequency and offset adjustments can be precalculated and summed
> to a single ppm adjustment. This is similar to the style of adjustment
> you propose directly onto the time source frequency values. 
> 
> However, there is also this short term single shot adjustments. These
> adjustments are made by applying the MAX_SINGLESHOT_ADJ (500ppm) scaling
> for an amount of time (offset_len) which would compensate for the
> offset. This style is difficult because we cannot precompute it and
> apply it to an entire tick. Instead it needs to be applied for just a
> specific amount of time which may be only a fraction of a tick. When we

Yes, that's the "precise" variant of implementing it. Poor implementations are 
just accurate to one tick.

> start talking about systems with irregular tick frequency (via
> virtualization, or tickless systems) it becomes even more problematic. 
> 
> If this can be fudged then it becomes less of an issue. Or at worse, we
> have to do two mult/shift operations on two "parts" of the time interval
> using different adjustments.
> 
> Its starting to look doable, but its not necessarily the simplest thing
> (for me at least). I'll put it on my list, but patches would be more
> then welcome. 
> 
> thanks
> -john
> 
> 
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC] New timeofday proposal (v.A1)
  2004-12-09  7:57             ` Ulrich Windl
@ 2004-12-09  8:29               ` john stultz
  0 siblings, 0 replies; 29+ messages in thread
From: john stultz @ 2004-12-09  8:29 UTC (permalink / raw)
  To: Ulrich Windl
  Cc: lkml, tim, george anzinger, albert, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, keith maanthey,
	greg kh, Patricia Gaughen, Chris McDermott, Max Asbock, mahuja

On Thu, 2004-12-09 at 08:57 +0100, Ulrich Windl wrote:
> On 8 Dec 2004 at 15:36, john stultz wrote:
> 
> [...]
> > Take a look at the adjtimex man page as well as the ntp.c file from the
> > timeofday core patch. There are number of different types of adjustments
> > that are made, possibly at the same time. Briefly, they are (to my
> > understanding - I'm going off my notes from awhile ago): 
> > o tick adjustments
> > 	how much time should pass in a _user_ tick
> 
> tick adjustments are considered obsolete, because if a lcok implementation (or 
> hardware) is severly broken, users should reject using that stuff. Meaning: tick 
> adjustments are ment to be set once in the life of a computer system. No 
> continuous tuning.
> 
> > o frequency adjustments
> > 	long term adjustment to correct for constant drift), 
> 
> Actually, you are compensating for a "tick problem" on a smaller scale (constant 
> part), and variations caused by temperature, voltage, and others (variable part).
> 
> > o offset adjustments
> > 	additional ppm adjustment to correct for current offset from the ntp
> > server
> > o single shot offset adjustments 
> > 	old style adjtime functionality
> > 
> > Tick, frequency and offset adjustments can be precalculated and summed
> > to a single ppm adjustment. This is similar to the style of adjustment
> > you propose directly onto the time source frequency values. 
> > 
> > However, there is also this short term single shot adjustments. These
> > adjustments are made by applying the MAX_SINGLESHOT_ADJ (500ppm) scaling
> > for an amount of time (offset_len) which would compensate for the
> > offset. This style is difficult because we cannot precompute it and
> > apply it to an entire tick. Instead it needs to be applied for just a
> > specific amount of time which may be only a fraction of a tick. When we
> 
> Yes, that's the "precise" variant of implementing it. Poor implementations are 
> just accurate to one tick.

Thanks for your knowledgeable clarifications. Its good to know someone
out there deeply understands this stuff more then at a "this is what the
code is doing, and I have my own interpretation as to why" level. :)

thanks again,
-john



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2004-12-09 16:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-08  1:55 [RFC] New timeofday proposal (v.A1) john stultz
2004-12-08  1:56 ` [RFC] new timeofday core subsystem (v.A1) john stultz
2004-12-08  1:57   ` [RFC] new timeofday arch specific hooks (v.A1) john stultz
2004-12-08  1:58     ` [RFC] new timeofday timesources (v.A1) john stultz
2004-12-08  2:02       ` john stultz
2004-12-08  9:17   ` [RFC] new timeofday core subsystem (v.A1) Pavel Machek
2004-12-08 18:44   ` Christoph Lameter
2004-12-08 19:22     ` john stultz
2004-12-08 18:53   ` Christoph Lameter
2004-12-08 20:27   ` Martin Waitz
     [not found]     ` <1102555933.1281.301.camel@cog.beaverton.ibm.com>
2004-12-09  7:32       ` Martin Waitz
2004-12-08 18:25 ` [RFC] New timeofday proposal (v.A1) Christoph Lameter
2004-12-08 19:11   ` john stultz
2004-12-08 19:20     ` Christoph Lameter
2004-12-08 19:58       ` john stultz
2004-12-08 20:14         ` Christoph Lameter
2004-12-08 21:25           ` George Anzinger
2004-12-08 23:47             ` Christoph Lameter
2004-12-08 23:36           ` john stultz
2004-12-08 23:53             ` Christoph Lameter
2004-12-09  0:17               ` john stultz
2004-12-09  0:40                 ` Christoph Lameter
2004-12-09  0:51                   ` john stultz
2004-12-09  1:24                     ` Christoph Lameter
2004-12-09  7:57             ` Ulrich Windl
2004-12-09  8:29               ` john stultz
2004-12-09  7:47         ` Ulrich Windl
2004-12-08 18:43 ` Nicolas Pitre
2004-12-08 18:57   ` john stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox