Re: Linux time code

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Linux time code
@ 2006-08-23  6:25 linux
  2006-08-23 18:29 ` john stultz
  0 siblings, 1 reply; 28+ messages in thread
From: linux @ 2006-08-23  6:25 UTC (permalink / raw)
  To: linux-kernel

Just to summarize the kernel/NTP interaction for those not familiar....

The NTP daemon exchanges pings with a number of time sources.  Each ping
produces a round-trip time and a time offset; the latter is computed by
assuming that the one-way trip times are equal.  This is of course not
true, but is closest to true for pings with the shortest round-trip time,
and NTP tries to use those.

The sources are individually sanity-checked, then checked against each
other, in a rather complex way that has proved to be robust in practice.

I don't want to go into it in great detail, but there are three stages of
filtering after the per-source processing:
1) Selection, which takes all the sources' claims for the right time and
   the error in their claims, and finds the interval where the largest
   possible number of them overlap.  Sources that do not participate in
   the overlap do not proceed to clustering.
2) Clustering, where sources that are furthest from the average are
   repeatedly dropped to decrease the standard deviation.
3) Combining, where the remaining sources are averaged, weighted by
   their quality claims.

Note that a tight accuracy claim increases a source's weight in the third
stage, but makes it more likely that it'll be excluded by the first and
second stage filters.

The above operation is run periodically (whenever there is new data from
one of the sources), and the output is a report on the amount by which
the local clock disagrees with the "right time", and a few associated
quality metrics.

This single time offset (and associated quality estimates) is the input
to "clock disipline algorithm".  This is where it starts to get relevant
to the kernel.

NTP needs to divide the observed error into two categories: phase error
and frequency error.  The former is time offset that is uncorrelated
from sample to sample, and can be reduced by longer averaging.

The latter is due to the local clock's frequency changing, and cannot be
reduced by averaging.  Indeed, if you try to average together successive
errors of +1 ms, +2 ms, +3 ms, +4 ms, etc, the longer you average the
worse off you'll be before you start correcting and doing something
about the problem.

Now, phase error, a.k.a. jitter, is dominant over short time spans.
One simple source is the measurement granularity of your clock.
If you can only measure with 1 us resolution, you'll have +/-0.5 us of
jitter just from that.

Frequency error, a.k.a. wander, accumulates with time, so is dominant
over longer time intervals, particular intervals over 1000 seconds.

When NTP first starts up, it considers all error to be frequency error,
to get the clock into approximately the right range.  This is not
terribly accurate, but numerically very stable; it never oscillates.
After a while, it shifts to a phase-locked loop where most of the error
is deemed to be phase error, and only a bit is frequency error.  This can
produce the best time, but can freak out and overshoot if the offsets
used as input are bizarrely behaved.  NTP adjusts the polling interval,
to check with the clock sources more frequently, if it notices that
things are getting a bit weird, and falls back to the frequency-locked
loop if it notices that things have gotten really bad.

Anyway, at the end of this computation, you get a frequency and phase
(time) correction to be applied to the local clock.  This is then applied
by adjusting the frequency of the system clock.  The frequency correction
is applied permanently; the phase correction is applied by adjusting
the frequency a bit more for a short while.

To be precise, 1/64 of the current phase error estimate is corrected
per second, and the phase error estimate is reduced accordingly.
This continues until the phase error is reduced to zero, or a new phase
error is computed.  So if the phase error is 64 microseconds, the clock is
adjusted by 1 ppm for one second, then by 64/64 ppm for the next second,
and so on.  This gives a half-life of 44 seconds.

Anyway, implementing the exponential phase correction in the kernel is
optional; when NTP really wants is a knob to adjust the clock frequency.
It can just call that once per second to make phase corrections if
requited.

In practice, we pass the frequency and phase corrections into the
kernel via the adjtimex() call and let it amortize the phase correction.
Although more than strictly necessary, this is not all bad, as it avoids
the need to wake up a daemon every second to fiddle the clock frequency.
Theoretically, you can code all of this to not wake the kernel up every
tick, although it's not implemented that way right now.

Also note that the current exponential-average way of making gradual
phase corrections is not very critical.  You want to get the total
right, but typical closed-loop time-sync applications aren't even
very sensitive to errors there.  The details of the amortization
schedule are quite non-critical.

So if a tickless user-mode Linux instance is woken up after
a long sleep, it would be more than good enough to process the interval as:

	half_lives = interval / 44;
	interval -= half_lives * 44;
	correction = time_offset;
	correction -= time_offset >>= half_lives;
	xtime += correction;

	/* ... other necessary stuff from second_overflow */

	while (interval--)
		second_overflow();

--- Postscript: Tangents ---

My pet idea on how to do precision timestamping is to separate grabbing
a low-level timer read from converting to portable time units.  If you
can bound the time elapsed between those two events, you can just
keep the information needed to do the conversion around for that long.
If you can tolerate some slop in old time conversions, you can easily do
lossy compression on old conversion parameters.  If you use an piecewise
linear transformation between raw and portable timestamps (seconds =
raw * period + offset, valid for some interval), then you can collapse
two such segments into one by a suitable average of the clock periods.

The one problem there is that, given two adjacent raw samples, if one
is delayed a long time before being converted to a portable timestamp,
the lossy compression can violate monotonicity; i.e. the portable
timestamps might come out in the wrong order.  I have to confess, I
can't think of a 100% fix other than a hard bound on time to convert,
or a probably-messier-than-it's-worth registration of unconverted raw
timestamps which can be converted as part of pushing the conversion
parameters out of scope.  (The whole idea is to move work out of
hardirq handlers.)

However, is this sort of large conversion-delay skew for closely-spaced
timestamps likely?  It seems that they would have to come through
different code paths, and then does the ordering at the microsecond
level matter?

Oh, and if you're going to implement Posix gettimeofday(), have a look at
Markus Kuhn's UTS proposal (http://www.cl.cam.ac.uk/~mgk25/uts.txt).
Given that Posix mandates that days have 86400 seconds
(http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_14),
but the UTC standard maintained by the BIPM
(http://www.bipm.fr/en/scientific/tai/time_server.html) mandates
that days sometimes have 86401 seconds, his system is arguably
the least insane way to reconcile the irreconcilable.
(See also http://www.cl.cam.ac.uk/~mgk25/volatile/ITU-R-TF.460-4.pdf)

E.g.  The follwing handles positive leap seconds (only).
next_leap_second is the Posix timestamp of the next leap
second (23:59:60), which is always a multiple of 86400.
(http://hpiers.obspm.fr/eop-pc/earthor/utc/TAI-UTC_tab.html)

If you wanted to add the (strictly theoretical and never likely to be
used) negative leap second code, you could distinguish negative leap
seconds by the fact that next_leap_second is odd (congruent to -1
mod 86400).

#define MILLION 1000000
#define BILLION 1000000000

extern unsigned tai_minus_utc;	/* Currently 33 */
extern time_t next_leap_second;	/* UTC time after which tai_minus_utc++ */

	switch (clk_id) {
	case CLOCK_UTC:
		clock_gettime(CLOCK_TAI, tp);
		tp->tv_sec -= tai_minus_utc;
		/* Leap seconds per http://www.cl.cam.ac.uk/~mgk25/time/c/ */
		if (tp->tv_sec >= next_leap_second) {
			if (tp->tv_sec == next_leap_second)
				tp->tv_nsec += BILLION;
			tp->tv_sec--;
		}
		break;
	case CLOCK_UTS:
		/* Recommended for gettimeofday() & time() */
		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
		clock_gettime(CLOCK_TAI, tp);
		tp->tv_sec -= tai_minus_utc;
		if (tp->tv_sec > next_leap_second) {
			tp->tv_sec--;
		} else if (next_leap_second - tp->tv_sec < 1000) {
			/* 1000 UTC/TAI seconds = 999 UTS seconds */
			uint32_t offset = next_leap_second - tp->tv_sec + 1;
			offset *= MILLION;
			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
			if ((tp->tv_nsec -= offset) < 0) {
				tp->tv_nsec += BILLION;
				tp->tv_sec--;
			}
		}
		break;
	}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-23  6:25 Linux time code linux
@ 2006-08-23 18:29 ` john stultz
  2006-08-24  2:35   ` linux
                     ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: john stultz @ 2006-08-23 18:29 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel, Roman Zippel, Theodore Ts'o

On Wed, 2006-08-23 at 02:25 -0400, linux@horizon.com wrote:
> Oh, and if you're going to implement Posix gettimeofday(), have a look at
> Markus Kuhn's UTS proposal (http://www.cl.cam.ac.uk/~mgk25/uts.txt).
> Given that Posix mandates that days have 86400 seconds
> (http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_14),
> but the UTC standard maintained by the BIPM
> (http://www.bipm.fr/en/scientific/tai/time_server.html) mandates
> that days sometimes have 86401 seconds, his system is arguably
> the least insane way to reconcile the irreconcilable.
> (See also http://www.cl.cam.ac.uk/~mgk25/volatile/ITU-R-TF.460-4.pdf)
> 
> E.g.  The follwing handles positive leap seconds (only).
> next_leap_second is the Posix timestamp of the next leap
> second (23:59:60), which is always a multiple of 86400.
> (http://hpiers.obspm.fr/eop-pc/earthor/utc/TAI-UTC_tab.html)
> 
> If you wanted to add the (strictly theoretical and never likely to be
> used) negative leap second code, you could distinguish negative leap
> seconds by the fact that next_leap_second is odd (congruent to -1
> mod 86400).
> 
> #define MILLION 1000000
> #define BILLION 1000000000
> 
> extern unsigned tai_minus_utc;	/* Currently 33 */
> extern time_t next_leap_second;	/* UTC time after which tai_minus_utc++ */
> 
> 	switch (clk_id) {
> 	case CLOCK_UTC:
> 		clock_gettime(CLOCK_TAI, tp);
> 		tp->tv_sec -= tai_minus_utc;
> 		/* Leap seconds per http://www.cl.cam.ac.uk/~mgk25/time/c/ */
> 		if (tp->tv_sec >= next_leap_second) {
> 			if (tp->tv_sec == next_leap_second)
> 				tp->tv_nsec += BILLION;
> 			tp->tv_sec--;
> 		}
> 		break;
> 	case CLOCK_UTS:
> 		/* Recommended for gettimeofday() & time() */
> 		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
> 		clock_gettime(CLOCK_TAI, tp);
> 		tp->tv_sec -= tai_minus_utc;
> 		if (tp->tv_sec > next_leap_second) {
> 			tp->tv_sec--;
> 		} else if (next_leap_second - tp->tv_sec < 1000) {
> 			/* 1000 UTC/TAI seconds = 999 UTS seconds */
> 			uint32_t offset = next_leap_second - tp->tv_sec + 1;
> 			offset *= MILLION;
> 			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
> 			if ((tp->tv_nsec -= offset) < 0) {
> 				tp->tv_nsec += BILLION;
> 				tp->tv_sec--;
> 			}
> 		}
> 		break;
> 	}


I was talking about the UTS/leapsecond bits w/ Ted just the other day
and had a similar thought! To me it makes quite a bit of sense to
generate UTC and UTS from TAI, just as you do in the above, since UTC =
TAI + leapsecond offset, just as local time = GMT + timezone offset.

However the difficulty would be that while NTP provides leapsecond +/-
notifiers, it doesn't provide the absolute UTC offset from TAI. So there
isn't a way for the kernel to generate TAI, from a UTC settimeofday
call. Some method to distribute and inform the kernel of the absolute
leapsecond offset (tai_minus_utc in your code above) would be necessary.

Additionally creating UTS and UTC at the same time would be a bit
complicated. Your solution above isn't quite UTS, since it only handles
the leap insertion, however the insertion case is the one that causes
users most of the pain (since the clock goes backward), so it may very
well be good enough.

Overall, I like your idea quite a bit. Might we look forward to a
patch? :)

Roman: you're thoughts?

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-23 18:29 ` john stultz
@ 2006-08-24  2:35   ` linux
  2006-08-28 11:39     ` Roman Zippel
  2006-08-26  0:17   ` linux
  2006-08-26  3:46   ` linux
  2 siblings, 1 reply; 28+ messages in thread
From: linux @ 2006-08-24  2:35 UTC (permalink / raw)
  To: johnstul, linux; +Cc: linux-kernel, theotso, zippel

>> I was talking about the UTS/leapsecond bits w/ Ted just the other day
>> and had a similar thought! To me it makes quite a bit of sense to
>> generate UTC and UTS from TAI, just as you do in the above, since UTC =
>> TAI + leapsecond offset, just as local time = GMT + timezone offset.

> However the difficulty would be that while NTP provides leapsecond +/-
> notifiers, it doesn't provide the absolute UTC offset from TAI. So there
> isn't a way for the kernel to generate TAI, from a UTC settimeofday
> call. Some method to distribute and inform the kernel of the absolute
> leapsecond offset (tai_minus_utc in your code above) would be necessary.

Well, there are several possibilities.  For the opinion of experts on
the subject see paper #7 from http://www.cis.udel.edu/~mills/ntp.html:

Levine, J., and D. Mills. Using the Network Time Protocol to transmit
International Atomic Time (TAI). Proc. Precision Time and Time Interval
(PTTI) Applications and Planning Meeting (Reston VA, November 2000).

http://www.cis.udel.edu/~mills/database/papers/leapsecond.{ps,pdf}

This describes an NTP extension to disseminate leap second times.

GPS broadcasts the absolute offset of UTC from GPS, which is itself
19 s from TAI, so you can get TAI.

You can also poll
ftp://time-b.nist.gov/pub/leap-seconds.list
every few months.  Note that a directory listing will
tell you if anything has changed, since that's a symlink
to the real file, whose name includes an update timestamp.

You can also just accumulate the +/- notifiers to figure out the offset.

I think this can be entered very easily using sysctl.

> Additionally creating UTS and UTC at the same time would be a bit
> complicated. Your solution above isn't quite UTS, since it only handles
> the leap insertion, however the insertion case is the one that causes
> users most of the pain (since the clock goes backward), so it may very
> well be good enough.

It's not that it's hard to implement leap deletion, but it's code
on a moderately hot path (gettimeofday() is a very popular system
call) that will, as far as anyone knows, never be used.

If you want the full version, try:

	case CLOCK_UTS:
		/* Recommended for gettimeofday() & time() */
		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
		clock_gettime(CLOCK_TAI, tp);
		tp->tv_sec -= tai_minus_utc;

		if (tp->tv_sec > next_leap_second) {
			tp->tv_sec += (next_leap_second & 1) ? -1 : 1;

		} else if (next_leap_second - tp->tv_sec < 1000) {
			/* 1000 UTC/TAI seconds = 999 or 1001 UTS seconds */
			uint32_t offset = next_leap_second - tp->tv_sec + 1;
			offset *= MILLION;
			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
			if (next_leap_second & 1) {
				/* Negative (deleted) leap second */
				if ((tp->tv_nsec += offset) >= BILLION) {
					tp->tv_nsec -= BILLION;
					tp->tv_sec++;
				}
			} else {
				/* Positive (inserted) leap second */
				if ((tp->tv_nsec -= offset) < 0) {
					tp->tv_nsec += BILLION;
					tp->tv_sec--;
				}
			}
		}
		break;

Note that this code does not interact nicely with updates to tai_minus_utc
and next_leap_second.  An RCU-like scheme would involve a pre- and
post-leap tai_minus_utc, which lets you schedule a new leap by:

<wait for idle>
# At this point, everyone knows that next_leap_second has passed, and
# so pre_tai_utc is don't care
pre_tai_utc = post_tai_utc;
<wait for idle>
# Now next_leap_second is don't care.
next_leap_second = <announced time>
<wait for idle>
# Now post_tai_utc can be rewritten.
post_tai_utc++;

Which doesn't require any locking on the part of the reader, just not
blocking during the conversion.

> Overall, I like your idea quite a bit. Might we look forward to a
> patch? :)

Um, the UTS one I talked about, or the two-phase grab-raw and
convert-to-portable implementation technique?  If the latter,
can we come to some agreement about the questions asked therein?

There's a very nice implementation in PHK's FreeBSD timecounter code.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-24  2:35   ` linux
@ 2006-08-28 11:39     ` Roman Zippel
  2006-08-28 22:36       ` john stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Roman Zippel @ 2006-08-28 11:39 UTC (permalink / raw)
  To: linux; +Cc: johnstul, linux-kernel, theotso

Hi,

On Thu, 23 Aug 2006, linux@horizon.com wrote:

> > Additionally creating UTS and UTC at the same time would be a bit
> > complicated. Your solution above isn't quite UTS, since it only handles
> > the leap insertion, however the insertion case is the one that causes
> > users most of the pain (since the clock goes backward), so it may very
> > well be good enough.
> 
> It's not that it's hard to implement leap deletion, but it's code
> on a moderately hot path (gettimeofday() is a very popular system
> call) that will, as far as anyone knows, never be used.
> 
> If you want the full version, try:
> 
> 	case CLOCK_UTS:
> 		/* Recommended for gettimeofday() & time() */
> 		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
> 		clock_gettime(CLOCK_TAI, tp);
> 		tp->tv_sec -= tai_minus_utc;
> 
> 		if (tp->tv_sec > next_leap_second) {
> 			tp->tv_sec += (next_leap_second & 1) ? -1 : 1;
> 
> 		} else if (next_leap_second - tp->tv_sec < 1000) {
> 			/* 1000 UTC/TAI seconds = 999 or 1001 UTS seconds */
> 			uint32_t offset = next_leap_second - tp->tv_sec + 1;
> 			offset *= MILLION;
> 			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
> 			if (next_leap_second & 1) {
> 				/* Negative (deleted) leap second */
> 				if ((tp->tv_nsec += offset) >= BILLION) {
> 					tp->tv_nsec -= BILLION;
> 					tp->tv_sec++;
> 				}
> 			} else {
> 				/* Positive (inserted) leap second */
> 				if ((tp->tv_nsec -= offset) < 0) {
> 					tp->tv_nsec += BILLION;
> 					tp->tv_sec--;
> 				}
> 			}
> 		}
> 		break;

Doing something like for this for gettimeofday() is pretty much obsolete 
with the new time code. OTOH it's rather simple to smooth out the leap 
second now, you can set time_adjust and adjust MAX_TICKADJ and the clock 
will follow nicely.

bye, Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-28 11:39     ` Roman Zippel
@ 2006-08-28 22:36       ` john stultz
  2006-08-29  3:28         ` linux
  2006-08-29 14:43         ` Roman Zippel
  0 siblings, 2 replies; 28+ messages in thread
From: john stultz @ 2006-08-28 22:36 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux, linux-kernel, theotso

On Mon, 2006-08-28 at 13:39 +0200, Roman Zippel wrote:
> On Thu, 23 Aug 2006, linux@horizon.com wrote:
> > > Additionally creating UTS and UTC at the same time would be a bit
> > > complicated. Your solution above isn't quite UTS, since it only handles
> > > the leap insertion, however the insertion case is the one that causes
> > > users most of the pain (since the clock goes backward), so it may very
> > > well be good enough.
> > 
> > It's not that it's hard to implement leap deletion, but it's code
> > on a moderately hot path (gettimeofday() is a very popular system
> > call) that will, as far as anyone knows, never be used.
> > 
> > If you want the full version, try:
> > 
> > 	case CLOCK_UTS:
> > 		/* Recommended for gettimeofday() & time() */
> > 		/* See http://www.cl.cam.ac.uk/~mgk25/uts.txt */
> > 		clock_gettime(CLOCK_TAI, tp);
> > 		tp->tv_sec -= tai_minus_utc;
> > 
> > 		if (tp->tv_sec > next_leap_second) {
> > 			tp->tv_sec += (next_leap_second & 1) ? -1 : 1;
> > 
> > 		} else if (next_leap_second - tp->tv_sec < 1000) {
> > 			/* 1000 UTC/TAI seconds = 999 or 1001 UTS seconds */
> > 			uint32_t offset = next_leap_second - tp->tv_sec + 1;
> > 			offset *= MILLION;
> > 			offset += (uint32_t)(BILLION - tp->tv_nsec)/1000;
> > 			if (next_leap_second & 1) {
> > 				/* Negative (deleted) leap second */
> > 				if ((tp->tv_nsec += offset) >= BILLION) {
> > 					tp->tv_nsec -= BILLION;
> > 					tp->tv_sec++;
> > 				}
> > 			} else {
> > 				/* Positive (inserted) leap second */
> > 				if ((tp->tv_nsec -= offset) < 0) {
> > 					tp->tv_nsec += BILLION;
> > 					tp->tv_sec--;
> > 				}
> > 			}
> > 		}
> > 		break;
> 
> Doing something like for this for gettimeofday() is pretty much obsolete 
> with the new time code. OTOH it's rather simple to smooth out the leap 
> second now, you can set time_adjust and adjust MAX_TICKADJ and the clock 
> will follow nicely.

While its possible to smooth out the leapsecond (which would be useful
to many folks), the problem is one's system would then diverge from UTC
for that leapsecond. 

The idea he's proposing here is to keep both UTC and UTS as separate
clock ids, allowing apps to choose which standard (well, I UTS isn't
quite a standard) they want to follow.

I think this would be quite useful, as I've seen a number of requests
where users don't want the leapsecond inconsistency, and others where
they need to strictly follow UTC.

I think having TAI would be nice too, but that requires quite a bit of
infrastructure work (NTP distributing absolute leapsecond counts, etc).

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-28 22:36       ` john stultz
@ 2006-08-29  3:28         ` linux
  2006-08-29 13:15           ` Theodore Tso
  2006-08-29 19:23           ` john stultz
  2006-08-29 14:43         ` Roman Zippel
  1 sibling, 2 replies; 28+ messages in thread
From: linux @ 2006-08-29  3:28 UTC (permalink / raw)
  To: johnstul, zippel; +Cc: linux-kernel, linux, theotso

> While its possible to smooth out the leapsecond (which would be useful
> to many folks), the problem is one's system would then diverge from UTC
> for that leapsecond. 

The Posix-mandated behaviour *requires* diverging from UTC for some
time period around the leap second.  All you can do is decide how
to schedule the divergence.

Note that any real clock diverges from UTC by some amount, and
POSIX's denial of leap seconds

> The idea he's proposing here is to keep both UTC and UTS as separate
> clock ids, allowing apps to choose which standard (well, I UTS isn't
> quite a standard) they want to follow.
> 
> I think this would be quite useful, as I've seen a number of requests
> where users don't want the leapsecond inconsistency, and others where
> they need to strictly follow UTC.

I think smoothing it out should be the default for Posix-specified things
like gettimeofday() and CLOCK_REALTIME, since that is, as I said, the
least insane way to deal with the contradictory POSIX requirements.

But also provide an explicit CLOCK_UTC and CLOCK_UTS for people who care
and want to be specific.  adjtimex() should stay UTC, since it returns leap
second information.

> I think having TAI would be nice too, but that requires quite a bit of
> infrastructure work (NTP distributing absolute leapsecond counts, etc).

Yes, but it would be damned nice.  To implement leap seconds at all,
you need to have notice of at least the next one.  The Olson time
zone files, which have a similar several-month advance-notice schedule,
include leap-second information.

Combining messages:
> With the new clocksource code, we can (currently just i386, but the
> architecture is generic and I'm working on the other arches) make use of
> continuous clocksources for accumulating time instead of having the deal
> with the problematic PIT (as well as the lost ticks issue).

If it's there, it's great, but what about i386EX embedded boards and
the like?  It's approximately manageable on uniprocessor, but can
I be sure there's always something (what?) better than the PIT on
*every* SMP system?

I need to study what you've done and see how to use it.

> Maybe I'm missing what you're proposing, but I think "that pit of
> madness" can now be avoided. :)

I'm just trying to start with the best possible worst-case situation,
and then improve on things from there.  Implement the robust slow path
first, then add fast paths for common cases.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-29  3:28         ` linux
@ 2006-08-29 13:15           ` Theodore Tso
  2006-08-29 15:18             ` linux
  2006-08-29 19:23           ` john stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2006-08-29 13:15 UTC (permalink / raw)
  To: linux; +Cc: johnstul, zippel, linux-kernel, theotso

On Mon, Aug 28, 2006 at 11:28:29PM -0400, linux@horizon.com wrote:
> > While its possible to smooth out the leapsecond (which would be useful
> > to many folks), the problem is one's system would then diverge from UTC
> > for that leapsecond. 
> 
> The Posix-mandated behaviour *requires* diverging from UTC for some
> time period around the leap second.  All you can do is decide how
> to schedule the divergence.

POSIX mandates this for gettimeofday() and CLOCK_REALTIME.  

However, a conforming implementation, could (either in userspace or in
the kernel) provide access to other time bases, include TAI or the
proposed UTS time scales.

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-29 13:15           ` Theodore Tso
@ 2006-08-29 15:18             ` linux
  0 siblings, 0 replies; 28+ messages in thread
From: linux @ 2006-08-29 15:18 UTC (permalink / raw)
  To: linux, tytso; +Cc: johnstul, linux-kernel, theotso, zippel

>> The Posix-mandated behaviour *requires* diverging from UTC for some
>> time period around the leap second.  All you can do is decide how
>> to schedule the divergence.

> POSIX mandates this for gettimeofday() and CLOCK_REALTIME.  

> However, a conforming implementation, could (either in userspace or in
> the kernel) provide access to other time bases, include TAI or the
> proposed UTS time scales.

The suggestion is to use UTS to implement CLOCK_REALTIME and
gettimeofday().

Since CLOCK_REALTIME has no specified accuracy bounds, it's a legal
realization, but UTS provides defined behavior when you have better time
sync than the 1s uncertainty inherent in the POSIX spec.

time() is more interesting, since it's so quantized already.  Is it better
to have a 2-second second, or to keep it in sync with gettimeofday()
and have 1000 1.001-second seconds?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-29  3:28         ` linux
  2006-08-29 13:15           ` Theodore Tso
@ 2006-08-29 19:23           ` john stultz
  1 sibling, 0 replies; 28+ messages in thread
From: john stultz @ 2006-08-29 19:23 UTC (permalink / raw)
  To: linux; +Cc: zippel, linux-kernel, theotso

On Mon, 2006-08-28 at 23:28 -0400, linux@horizon.com wrote:
> > With the new clocksource code, we can (currently just i386, but the
> > architecture is generic and I'm working on the other arches) make use of
> > continuous clocksources for accumulating time instead of having the deal
> > with the problematic PIT (as well as the lost ticks issue).
> 
> If it's there, it's great, but what about i386EX embedded boards and
> the like?

The PIT clocksource is available for those situations, but is one of the
lowest rated clocksources, so anything else will be used if its
available.

>   It's approximately manageable on uniprocessor, but can
> I be sure there's always something (what?) better than the PIT on
> *every* SMP system?

Yea. With the exception of NUMAQ almost every i386 SMP system either can
use the TSC or has an alternative clocksource (acpi pm, hpet, cyclone).


> I need to study what you've done and see how to use it.

Let me know if you have any questions or thoughts about it.

thanks
-john



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-28 22:36       ` john stultz
  2006-08-29  3:28         ` linux
@ 2006-08-29 14:43         ` Roman Zippel
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Zippel @ 2006-08-29 14:43 UTC (permalink / raw)
  To: john stultz; +Cc: linux, linux-kernel, theotso

Hi,

On Mon, 28 Aug 2006, john stultz wrote:

> While its possible to smooth out the leapsecond (which would be useful
> to many folks), the problem is one's system would then diverge from UTC
> for that leapsecond. 
> 
> The idea he's proposing here is to keep both UTC and UTS as separate
> clock ids, allowing apps to choose which standard (well, I UTS isn't
> quite a standard) they want to follow.

Making it a separate clock would be a bit more complex and I don't know if 
it's really worth it for an event that only happens every few years.
We already have everything we need to adjust CLOCK_REALTIME, so it would 
be not a real problem to support a timezone UTS.

> I think this would be quite useful, as I've seen a number of requests
> where users don't want the leapsecond inconsistency, and others where
> they need to strictly follow UTC.
> 
> I think having TAI would be nice too, but that requires quite a bit of
> infrastructure work (NTP distributing absolute leapsecond counts, etc).

That's the other possibility, as soon as we update the userspace interface 
to NTP4, it will also include the TAI value, so it will be available via 
adjtimex()/ntp_gettime().

bye, Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-23 18:29 ` john stultz
  2006-08-24  2:35   ` linux
@ 2006-08-26  0:17   ` linux
  2006-08-28 22:41     ` john stultz
  2006-08-26  3:46   ` linux
  2 siblings, 1 reply; 28+ messages in thread
From: linux @ 2006-08-26  0:17 UTC (permalink / raw)
  To: johnstul, linux; +Cc: linux-kernel, theotso, zippel

> Overall, I like your idea quite a bit. Might we look forward to a
> patch? :)

Researching this has led me into that pit of madness, the i8254
programmable interval timer specification.

This is the original IBM PC timer (well, the version after the original
8253), counting at 13125000/11 = 1193181.81... Hz, programmed to divide
by 11932 to generate a 100 Hz timer interrupt.  If you want to be picky,
the options are

Divisor	Exact frequency	Decimal Hz	Error

11932	13125000/131252	  99.998476...	-15 ppm
 4773	13125000/52503	 249.985715...	-57 ppm
 1193	13125000/13123	1000.152404...	152 ppm

But, of course, the (at best) 100 ppm quality of typical cheap quartz
oscillators means that only the first 4 digits of the frequency matter.

Anyway, if you see a high value in the counter (it counts down, so this
means that it was just reset), the question arises: has the interrupt
handler run and propagated the carry bit yet?

Now, if you assume that the interrupt controller gets the signal in
negligible time, it's not that hard.  Grab the software-managed
high-order bits, read the low-order bits from the 8254, and if it
reports "just rolled over" (generally, in the first half of its count
range), re-grab the software bits.  If interrupts are disabled,
also check the interrupt pending bit on the controller.

The logic on FreeBSD to do that is in i8254_get_timecount
(source/i386/isa/clock.c):

static unsigned
i8254_get_timecount(struct timecounter *tc)
{
	u_int count;
	u_int high, low;
	u_int eflags;

	eflags = read_eflags();
	mtx_lock_spin(&clock_lock);

	/* Select timer0 and latch counter value. */
	outb(TIMER_MODE, TIMER_SEL0 | TIMER_LATCH);

	low = inb(TIMER_CNTR0);
	high = inb(TIMER_CNTR0);
	count = timer0_max_count - ((high << 8) | low);
	if (count < i8254_lastcount ||
	    (!i8254_ticked && (clkintr_pending ||
	    ((count < 20 || (!(eflags & PSL_I) && count < timer0_max_count / 2u)) &&
	    i8254_pending != NULL && i8254_pending(i8254_intsrc)))))

		i8254_ticked = 1;
		i8254_offset += timer0_max_count;
	}
	i8254_lastcount = count;
	count += i8254_offset;
	mtx_unlock_spin(&clock_lock);
	return (count);
}

Parsing that four-line if () condition is a bit tricky.  First, notice
that count() has already been converted to up-counter form.

 1) If the count has gone backwards, it has obviously wrapped, else
 2) If we've already recorded that it has wrapped, it hasn't, else
 3) If we have a ending clock interrupt, it's wrapped, else
4a) If the count is less than 20, or is less then half of full-scale and
    interrupts are disabled, AND
4b) The interrupt is pending at the interrupt controller,
    then it has wrapped.

But I still don't see how to manage something like this in an SMP
environment, where the interrupt handler might be in the process of running
on a different processor.

It's sometimes possible to very Very Sneaky and use timer 1 (the refresh
counter) to detect overflows in the tick timer, as long as its period
does not evenly divide the tick timer.  Latch them simultaneously,
and use modular congruence to figure out what's happened.

But such sneakiness could have bad interactions with the wide variety
of crapola southbridge implementations out there.

Is there an alternate timer which is GUARANTEED to exist on SMP
systems, obviating the need to solve this problem?  Any other ideas?

Life would be a lot easier if I could use the RTC interrupt for tick
timing (128, 256, or 1024 Hz, as required), and leave the PIC at 18.2 Hz
to interpolate and detect lost ticks.  (You can also detect lost ticks
by reading the RTC just before and after each second to see if it reads
the expected value.)

(P.S. It's surprising how well Linux keeps working when you've disabled
the timer interrupt by buggy 8254 programming from user-space.  Not
well enough to let me undo it without rebooting, however.  For future
reference, after writing the mode, you MUST program the count, even
if the mode you programmed is the same as the previous one.)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-26  0:17   ` linux
@ 2006-08-28 22:41     ` john stultz
  0 siblings, 0 replies; 28+ messages in thread
From: john stultz @ 2006-08-28 22:41 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel, theotso, zippel

On Fri, 2006-08-25 at 20:17 -0400, linux@horizon.com wrote:
> > Overall, I like your idea quite a bit. Might we look forward to a
> > patch? :)
> 
> Researching this has led me into that pit of madness, the i8254
> programmable interval timer specification.
> 
> This is the original IBM PC timer (well, the version after the original
> 8253), counting at 13125000/11 = 1193181.81... Hz, programmed to divide
> by 11932 to generate a 100 Hz timer interrupt.  If you want to be picky,
> the options are

With the new clocksource code, we can (currently just i386, but the
architecture is generic and I'm working on the other arches) make use of
continuous clocksources for accumulating time instead of having the deal
with the problematic PIT (as well as the lost ticks issue).

Maybe I'm missing what you're proposing, but I think "that pit of
madness" can now be avoided. :)

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-23 18:29 ` john stultz
  2006-08-24  2:35   ` linux
  2006-08-26  0:17   ` linux
@ 2006-08-26  3:46   ` linux
  2 siblings, 0 replies; 28+ messages in thread
From: linux @ 2006-08-26  3:46 UTC (permalink / raw)
  To: johnstul, linux; +Cc: linux-kernel, theotso, zippel

If anyone's interesting in the Very Sneaky option, here's a little
test program to judge its feasibility.

You can do an even nicer thing with timer 2, the speaker timer, since
it keeps counting mod-65536 when not in use, but coping with console
beeps or the PC speaker driver would be a nightmare.

On 4 machines (laptop, two desktops, and amd64), timer 1 is always
programmed for mode 2 (status = 0x94) with a cout of 18 (0x12).  This
does not divide any of the standard timer rate divisors (including the
AMD Elan versions).  However, collecting hitograms, I do see occasional
glitches:

Timer 1 status = 94
  1: 34
  3: 133
  4: 34
  6: 132
  7: 33
  9: 133
 10: 33
 12: 134
 13: 33
 15: 134
 16: 33
 18: 133
181: 1

(AMD64, nForce4 chipset - why it has a "refresh timer" when it doesn't
have any RAM attached is a mystery)

Also, it appears to take very nearly 3 us to latch the
count and read the timer, resulting in histograms like:

Timer 1 status = 94
  1: 167
  4: 167
  7: 166
 10: 166
 13: 167
 16: 167

Anyway, if anyone else would like to try, the following program should
be very conservative and not kill your machine.  Still, it disables
interrupts and does raw hardware accesses.  In fact, since nobody else
should be touching timer 1, the interrupt disabling is probably not
necessary, but I'd still be cuatious running it on an SMP system.

/*
 * This program figures out what timer 1 (the memory refresh timer) is
 * doing on yor computer.
 */
#include <stdio.h>
#include <sys/io.h>	/* For iopl(), inb(), outb() */
#include <asm/system.h>	/* For local_irq_disable(), local_irq_enable() */
#include <stdint.h>
#include <assert.h>

/* Timer registers at 0x40..0x43 */

static uint8_t
get_status(unsigned timer)
{
	uint8_t status;

	assert(timer < 3);

	local_irq_disable();
	outb(0xe0 + (2<<timer), 0x43);
	status = inb(0x40 + timer);
	local_irq_enable();

	return status;
}

/*
 * We collect a histogram of values to find out the highest counter
 * value and look for oddities like skipped values.
 */
static void
get_histogram(unsigned timer, unsigned counts[256], unsigned n)
{
	uint8_t const command = 0xd0 + (2<<timer);
	/* Alt: command = timer << 6 */
	uint16_t const port = 0x40 + timer;

	assert(timer < 3);
	assert(n);

	local_irq_disable();
	do {
		outb(command, 0x43);
		counts[inb(port)]++;
	} while (--n);
	local_irq_enable();
}

int
main(void)
{
	unsigned i;
	unsigned histogram[256];
	uint8_t status;

	if (iopl(3) < 0) {
		perror("iopl(3)");
		fputs("Are you running as root?\n", stderr);
		return 1;
	}

	status = get_status(1);

	printf("Timer 1 status = %02x\n", status);

	if (status & 0x40) {
		puts("Null count it set.  This is strange and interesting,\n"
		     "but means I don't know what's going on.  Aborting.");
		return 0;
	}
	if ((status & 0x30) != 0x10) {
		puts("Read mode is not lsbyte-only.  This is strange and interesting,\n"
		     "but means I don't know what's going on.  Aborting.");
		     return 0;
	}
	if ((status & 0xe) != 4) {
		printf("Timer mode is %u, not 2, which is unexpected.  Still,\n"
		       "data can be collected.  Please report!",
		       (status >> 1) & 7);
	}
	if (status & 1) {
		puts("Timer is in BCD mode.  Most peculiar.  Please report!");
		/* But we can collect a histogram */
	}

	for (i = 0; i < 256; i++)
		histogram[i] = 0;

	/* Break up collection into bursts to avoid long interrupt latency */
	for (i = 0; i < 10; i++)
		get_histogram(1, histogram, 100);

	for (i = 0; i < 256; i++)
		if (histogram[i])
			printf("%3u: %u\n", i, histogram[i]);

	puts("Thank you.");
	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Linux time code
@ 2006-08-16 12:26 Ulrich Windl
  2006-08-16 12:36 ` Oleg Verych
  2006-08-16 19:53 ` john stultz
  0 siblings, 2 replies; 28+ messages in thread
From: Ulrich Windl @ 2006-08-16 12:26 UTC (permalink / raw)
  To: linux-kernel

Hi everybody!

I've been viewing recent changes to the Linux kernel (specifically 2.6.15.1 to 
2.6.17.8), and I felt I'll have to say something:

First there's a new routine in kernel/time.c named "set_normalized_timespec()". 
That routine sets nothing besides the actual argument being passed by reference. 
Thus I feel that routine should rather be named "normalize_timespec()" (just to 
save a few bytes. No, not really ;-). Alternatively that thing could be a pure 
("const") function that returns the normalized timespec. In that case I'd call it 
"normalized_timespec()"...

OK, that issue woun't make anybody feel hot I guess, so here's another one:

The existing routines for measuring time among the various architectures is an 
absolute mess. Well, it always had been, but it didn't become any better, but 
worse it seems. For example there is a POSIX-like sys_clock_gettime() intended to 
server the end-user directly, but there's no counterpart do_clock_gettime() to 
server any in-kernel needs. The implementation of clock_getres() is also hardly 
worth it. I once had implemented a routine like this:

void do_clock_getres(clockid_t sysclock, struct timespec *tsp)
{
	struct timespec ts;
	int retry_limit;

	ts.tv_sec = 0;
	do {
		struct timespec ts1, ts2;

		do_clock_gettime(sysclock, &ts1);
		do_clock_gettime(sysclock, &ts2);
		ts.tv_sec = ts2.tv_sec - ts1.tv_sec;
		ts.tv_nsec = ts2.tv_nsec - ts1.tv_nsec;
	} while (--retry_limit > 0 && (ts.tv_sec != 0 || ts.tv_nsec == 0));
	*tsp = ts;
}

That routine tries to get the typical clock resolution the user is expected to 
see, automatically adjusting to the interpolation method and CPU speed being used. 
I think that's preferrable to just returning 1ns or "tick" or whatever.

Finally I have the personal need for an "unadjusted tick interpolator" 
(preferrably being clocked by the same clock as the timer chip) to estimate the 
frequency error of the system clock (independently from any offset adjustments 
being made).

For those who might wonder: Yes, that's the code that had been thown out recently: 
NTP PPS calibration.

So summarize: I'd wish for fewer, but more useful routines dealing with time. Some 
modules just don't export useful (and otherwise missing) routines, while other 
useful exported routines have different names for each architecture. A mess...

Sorry if you don't like that kind of message, but I just had to say that. It seems 
the time subsystem is already so complex that people are just adding new code 
instead of considering redesign or reuse of the existing code.

Regards,
Ulrich

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 12:26 Ulrich Windl
@ 2006-08-16 12:36 ` Oleg Verych
  2006-08-16 15:35   ` H. Peter Anvin
  2006-08-16 19:53 ` john stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Oleg Verych @ 2006-08-16 12:36 UTC (permalink / raw)
  To: linux-kernel

Ulrich Windl wrote:
> Hi everybody!
> 
> I've been viewing recent changes to the Linux kernel (specifically 2.6.15.1 to 
> 2.6.17.8), and I felt I'll have to say something:
> 
> First there's a new routine in kernel/time.c named "set_normalized_timespec()". 
[..something....]
> Sorry if you don't like that kind of message, but I just had to say that. It seems 
> the time subsystem is already so complex that people are just adding new code 
> instead of considering redesign or reuse of the existing code.
> 
As far as i can see here's "return -ENOPATCH;" kind of mail list.
Did you read and consider cooperation with authors of:

"We Are Not Getting Any Younger: A New Approach to  Time and Timers"
<http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf>

"Hrtimers and Beyond: Transforming the Linux Time"
Subsystems <https://ols2006.108.redhat.com/reprints/gleixner-reprint.pdf>  ?

> Regards,
> Ulrich
> 

-- 
-o--=O`C
  #oo'L O
<___=E M


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 12:36 ` Oleg Verych
@ 2006-08-16 15:35   ` H. Peter Anvin
  2006-08-16 15:12     ` Oleg Verych
  0 siblings, 1 reply; 28+ messages in thread
From: H. Peter Anvin @ 2006-08-16 15:35 UTC (permalink / raw)
  To: Oleg Verych; +Cc: linux-kernel

Oleg Verych wrote:
>
> As far as i can see here's "return -ENOPATCH;" kind of mail list.
> Did you read and consider cooperation with authors of:
> 

I think you're barking up the wrong tree.  Ulrich has been actively 
involved in Linux timekeeping for over a decade.

	-hpa

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 15:35   ` H. Peter Anvin
@ 2006-08-16 15:12     ` Oleg Verych
  0 siblings, 0 replies; 28+ messages in thread
From: Oleg Verych @ 2006-08-16 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-kernel

H. Peter Anvin wrote:
> Oleg Verych wrote:
>> As far as i can see here's "return -ENOPATCH;" kind of mail list.
>> Did you read and consider cooperation with authors of:
>>
> 
> I think you're barking up the wrong tree.  Ulrich has been actively 
> involved in Linux timekeeping for over a decade.
> 
So why the things are so wrong ? ;D

I hope my comments aren't so offensive, so one couldn't say "yes"
and give a short summary of that papers and actual implementation.

>     -hpa
But yours comments are, just don't know what to do ;p

-- 
-o--=O`C
  #oo'L O
<___=E M


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 12:26 Ulrich Windl
  2006-08-16 12:36 ` Oleg Verych
@ 2006-08-16 19:53 ` john stultz
  2006-08-17  7:20   ` Ulrich Windl
  2006-08-17 11:43   ` Roman Zippel
  1 sibling, 2 replies; 28+ messages in thread
From: john stultz @ 2006-08-16 19:53 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: linux-kernel, Roman Zippel, Udo van den Heuvel

On Wed, 2006-08-16 at 14:26 +0200, Ulrich Windl wrote:
> I've been viewing recent changes to the Linux kernel (specifically 2.6.15.1 to 
> 2.6.17.8), and I felt I'll have to say something:

Hey Ulrich,

If you haven't already (and have the time), please also take a peek at
the current 2.6.18-rc patch as well as -mm, as a number of timekeeping
changes have been made since 2.6.17.x.

> First there's a new routine in kernel/time.c named "set_normalized_timespec()". 
> That routine sets nothing besides the actual argument being passed by reference. 
> Thus I feel that routine should rather be named "normalize_timespec()" (just to 
> save a few bytes. No, not really ;-). Alternatively that thing could be a pure 
> ("const") function that returns the normalized timespec. In that case I'd call it 
> "normalized_timespec()"...

Sounds reasonable.

> OK, that issue woun't make anybody feel hot I guess, so here's another one:
> 
> The existing routines for measuring time among the various architectures is an 
> absolute mess. Well, it always had been, but it didn't become any better, but 
> worse it seems. 

As you know, myself and others are working on this. Its taken quite a
bit of time to get some of the groundwork in, and cleanups are still
needed, but I think we're on the right track. However, criticism is
welcome, and I'd appreciate your input (I did try to keep you CC'ed on
most of the early discussions, but forgive me as I left you out on some
of the more recent discussions)

> For example there is a POSIX-like sys_clock_gettime() intended to 
> server the end-user directly, but there's no counterpart do_clock_gettime() to 
> server any in-kernel needs. 

Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide this
info. Is there something missing here?

I will agree that the code in kernel/time.c, kernel/timer.c,
kernel/posix-timers.c, and kernel/hrtimer.c files could be better
organized so the layered logic is more clear. I am working on this (see
the ntp-move-all-the-ntp-related-code-to-ntpc-fix patch currently in
-mm), but untangling the code without breaking anyone (well, that's the
intent) is a slow process.

> The implementation of clock_getres() is also hardly 
> worth it. I once had implemented a routine like this:
[snip]
> That routine tries to get the typical clock resolution the user is expected to 
> see, automatically adjusting to the interpolation method and CPU speed being used. 
> I think that's preferrable to just returning 1ns or "tick" or whatever.

Yea. This area could use improvement. The clocksource infrastructure
should better allow us to export the actual hardware resolution.

> Finally I have the personal need for an "unadjusted tick interpolator" 
> (preferrably being clocked by the same clock as the timer chip) to estimate the 
> frequency error of the system clock (independently from any offset adjustments 
> being made).
> 
> For those who might wonder: Yes, that's the code that had been thown out recently: 
> NTP PPS calibration.

The NTP PPS code was dropped because there were no in-kernel users of
that interface. But as I've always said, I'd be very happy to see your
PPS work get merged. I know there are a few out-of-tree patches
currently floating around (Udo mailed me awhile back with some links,
but I can't find them at the moment), and I'm sure due to the high level
of activity in this area makes it difficult to keep out of tree patches
up to date. Is there any reason these patches aren't being pushed into
mainline?

> So summarize: I'd wish for fewer, but more useful routines dealing with time. Some 
> modules just don't export useful (and otherwise missing) routines, while other 
> useful exported routines have different names for each architecture. A mess...

I agree, and folks are working to clean this up (I've got a
get_persistent_clock patch to try to unify all the different
get_rtc/cmos/boot_time() hooks across the arches coming soon). Again, I
very much welcome your experience, suggestions and patches to this area.

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 19:53 ` john stultz
@ 2006-08-17  7:20   ` Ulrich Windl
  2006-08-17 19:15     ` john stultz
  2006-08-17 11:43   ` Roman Zippel
  1 sibling, 1 reply; 28+ messages in thread
From: Ulrich Windl @ 2006-08-17  7:20 UTC (permalink / raw)
  To: john stultz; +Cc: linux-kernel, Roman Zippel, Udo van den Heuvel

On 16 Aug 2006 at 12:53, john stultz wrote:

> On Wed, 2006-08-16 at 14:26 +0200, Ulrich Windl wrote:
> > I've been viewing recent changes to the Linux kernel (specifically 2.6.15.1 to 
> > 2.6.17.8), and I felt I'll have to say something:
> 
> Hey Ulrich,
> 
> If you haven't already (and have the time), please also take a peek at
> the current 2.6.18-rc patch as well as -mm, as a number of timekeeping
> changes have been made since 2.6.17.x.

Hi John,

during the nice weather I was quite lazy here, but the weather recently was so bad 
that I turned on the computer again ;-) I decided to download a "stable" kernel 
release to evaluate (hoping your code was in already). I cannot tell you when, but 
I'll have a look sooner or later ;-)

> 
> > First there's a new routine in kernel/time.c named "set_normalized_timespec()". 
> > That routine sets nothing besides the actual argument being passed by reference. 
> > Thus I feel that routine should rather be named "normalize_timespec()" (just to 
> > save a few bytes. No, not really ;-). Alternatively that thing could be a pure 
> > ("const") function that returns the normalized timespec. In that case I'd call it 
> > "normalized_timespec()"...
> 
> Sounds reasonable.
> 
> > OK, that issue woun't make anybody feel hot I guess, so here's another one:
> > 
> > The existing routines for measuring time among the various architectures is an 
> > absolute mess. Well, it always had been, but it didn't become any better, but 
> > worse it seems. 
> 
> As you know, myself and others are working on this. Its taken quite a
> bit of time to get some of the groundwork in, and cleanups are still
> needed, but I think we're on the right track. However, criticism is
> welcome, and I'd appreciate your input (I did try to keep you CC'ed on
> most of the early discussions, but forgive me as I left you out on some
> of the more recent discussions)

No problem, I was on holiday anyway. The code I tried had a problem with my ADM 
Athlon X2 (Dual core): Both cores run with different frequency, a feature of power 
management, thus making hi-res timing instable. I haven't investigated in-depth, 
but I thought the hpet timer was used.

> 
> > For example there is a POSIX-like sys_clock_gettime() intended to 
> > server the end-user directly, but there's no counterpart do_clock_gettime() to 
> > server any in-kernel needs. 
> 
> Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide this
> info. Is there something missing here?

>From memory: Are those exported from posix_timer? I think I saw those, but wasn't 
sure whether they are for general cross-arch use.

> 
> I will agree that the code in kernel/time.c, kernel/timer.c,
> kernel/posix-timers.c, and kernel/hrtimer.c files could be better
> organized so the layered logic is more clear. I am working on this (see
> the ntp-move-all-the-ntp-related-code-to-ntpc-fix patch currently in
> -mm), but untangling the code without breaking anyone (well, that's the
> intent) is a slow process.
> 
> > The implementation of clock_getres() is also hardly 
> > worth it. I once had implemented a routine like this:
> [snip]
> > That routine tries to get the typical clock resolution the user is expected to 
> > see, automatically adjusting to the interpolation method and CPU speed being used. 
> > I think that's preferrable to just returning 1ns or "tick" or whatever.
> 
> Yea. This area could use improvement. The clocksource infrastructure
> should better allow us to export the actual hardware resolution.
> 
> > Finally I have the personal need for an "unadjusted tick interpolator" 
> > (preferrably being clocked by the same clock as the timer chip) to estimate the 
> > frequency error of the system clock (independently from any offset adjustments 
> > being made).
> > 
> > For those who might wonder: Yes, that's the code that had been thown out recently: 
> > NTP PPS calibration.
> 
> The NTP PPS code was dropped because there were no in-kernel users of
> that interface. But as I've always said, I'd be very happy to see your
> PPS work get merged. I know there are a few out-of-tree patches
> currently floating around (Udo mailed me awhile back with some links,
> but I can't find them at the moment), and I'm sure due to the high level
> of activity in this area makes it difficult to keep out of tree patches
> up to date. Is there any reason these patches aren't being pushed into
> mainline?

I'm only waiting for a "pusher" ;-) No actually I have my own quality check, and 
currently the code fails those. It's named "alpha" by myself. Unless it's "beta" I 
won't ask anybody for inclusion.

I don't like the idea of a loadable module, because most of the code accesses 
several timing variables that are (or can be) private now. A module would make 
them public (for misuse). The time machinery should be a sealed black box IMHO.

> 
> > So summarize: I'd wish for fewer, but more useful routines dealing with time. Some 
> > modules just don't export useful (and otherwise missing) routines, while other 
> > useful exported routines have different names for each architecture. A mess...
> 
> I agree, and folks are working to clean this up (I've got a
> get_persistent_clock patch to try to unify all the different
> get_rtc/cmos/boot_time() hooks across the arches coming soon). Again, I
> very much welcome your experience, suggestions and patches to this area.

Regards,
Ulrich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17  7:20   ` Ulrich Windl
@ 2006-08-17 19:15     ` john stultz
  0 siblings, 0 replies; 28+ messages in thread
From: john stultz @ 2006-08-17 19:15 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: linux-kernel, Roman Zippel, Udo van den Heuvel

On Thu, 2006-08-17 at 09:20 +0200, Ulrich Windl wrote:
> On 16 Aug 2006 at 12:53, john stultz wrote:
> > As you know, myself and others are working on this. Its taken quite a
> > bit of time to get some of the groundwork in, and cleanups are still
> > needed, but I think we're on the right track. However, criticism is
> > welcome, and I'd appreciate your input (I did try to keep you CC'ed on
> > most of the early discussions, but forgive me as I left you out on some
> > of the more recent discussions)
> 
> No problem, I was on holiday anyway. The code I tried had a problem with my ADM 
> Athlon X2 (Dual core): Both cores run with different frequency, a feature of power 
> management, thus making hi-res timing instable. I haven't investigated in-depth, 
> but I thought the hpet timer was used.

Hmmm.. The dualcore AMD TSC usage issue should be resolved now, so
please let me know if you can provide any further info on what you saw
(dmesg, config).


> > > For example there is a POSIX-like sys_clock_gettime() intended to 
> > > server the end-user directly, but there's no counterpart do_clock_gettime() to 
> > > server any in-kernel needs. 
> > 
> > Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide this
> > info. Is there something missing here?
> 
> From memory: Are those exported from posix_timer? I think I saw those, but wasn't 
> sure whether they are for general cross-arch use.

They should be cross-arch safe. Exported from ktime.h
 
> > The NTP PPS code was dropped because there were no in-kernel users of
> > that interface. But as I've always said, I'd be very happy to see your
> > PPS work get merged. I know there are a few out-of-tree patches
> > currently floating around (Udo mailed me awhile back with some links,
> > but I can't find them at the moment), and I'm sure due to the high level
> > of activity in this area makes it difficult to keep out of tree patches
> > up to date. Is there any reason these patches aren't being pushed into
> > mainline?
> 
> I'm only waiting for a "pusher" ;-) No actually I have my own quality check, and 
> currently the code fails those. It's named "alpha" by myself. Unless it's "beta" I 
> won't ask anybody for inclusion.

Even so, if I recall, your earlier PPS kit patch had both cleanups and
new features. Breaking that up and pushing just the cleanups might be an
easy way to reduce the patch size you have to maintain until its beta.

> I don't like the idea of a loadable module, because most of the code accesses 
> several timing variables that are (or can be) private now. A module would make 
> them public (for misuse). The time machinery should be a sealed black box IMHO.

Agreed, although interfaces can be added if necessary, so let me know if
you find anything specific that is lacking.


thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-16 19:53 ` john stultz
  2006-08-17  7:20   ` Ulrich Windl
@ 2006-08-17 11:43   ` Roman Zippel
  2006-08-17 21:58     ` john stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Roman Zippel @ 2006-08-17 11:43 UTC (permalink / raw)
  To: john stultz; +Cc: Ulrich Windl, linux-kernel, Udo van den Heuvel

Hi,

On Wed, 16 Aug 2006, john stultz wrote:

> > For example there is a POSIX-like sys_clock_gettime() intended to 
> > server the end-user directly, but there's no counterpart do_clock_gettime() to 
> > server any in-kernel needs. 
> 
> Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide this
> info. Is there something missing here?

What is missing is the abiltity to map a clock to a posix clock, so that 
you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP controlled clocks and 
other CLOCK_* as the raw clock.
At some point I tried to discuss such possibilities, but it probably 
wasn't relevant for the rt kernel, so it was utterly ignored. :(

bye, Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 11:43   ` Roman Zippel
@ 2006-08-17 21:58     ` john stultz
  2006-08-17 22:11       ` Jesse Barnes
  2006-08-20 17:10       ` Roman Zippel
  0 siblings, 2 replies; 28+ messages in thread
From: john stultz @ 2006-08-17 21:58 UTC (permalink / raw)
  To: Roman Zippel; +Cc: Ulrich Windl, linux-kernel, Udo van den Heuvel

On Thu, 2006-08-17 at 13:43 +0200, Roman Zippel wrote:
> On Wed, 16 Aug 2006, john stultz wrote:
> > > For example there is a POSIX-like sys_clock_gettime() intended to 
> > > server the end-user directly, but there's no counterpart do_clock_gettime() to 
> > > server any in-kernel needs. 
> > 
> > Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide this
> > info. Is there something missing here?
> 
> What is missing is the abiltity to map a clock to a posix clock, so that 
> you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP controlled clocks and 
> other CLOCK_* as the raw clock.

Is there a use case for this (wanting non-NTP corrected time on a system
running NTPd) you have in mind?

I'm not strictly opposed to this idea, but since it exposes a new
interface to userland it needs to be carefully thought out and well
understood.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 21:58     ` john stultz
@ 2006-08-17 22:11       ` Jesse Barnes
  2006-08-17 22:32         ` john stultz
  2006-08-20 17:10       ` Roman Zippel
  1 sibling, 1 reply; 28+ messages in thread
From: Jesse Barnes @ 2006-08-17 22:11 UTC (permalink / raw)
  To: john stultz; +Cc: Roman Zippel, Ulrich Windl, linux-kernel, Udo van den Heuvel

On Thursday, August 17, 2006 2:58 pm, john stultz wrote:
> On Thu, 2006-08-17 at 13:43 +0200, Roman Zippel wrote:
> > On Wed, 16 Aug 2006, john stultz wrote:
> > > > For example there is a POSIX-like sys_clock_gettime() intended
> > > > to server the end-user directly, but there's no counterpart
> > > > do_clock_gettime() to server any in-kernel needs.
> > >
> > > Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide
> > > this info. Is there something missing here?
> >
> > What is missing is the abiltity to map a clock to a posix clock, so
> > that you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP controlled
> > clocks and other CLOCK_* as the raw clock.
>
> Is there a use case for this (wanting non-NTP corrected time on a
> system running NTPd) you have in mind?

Isn't this what CLOCK_MONOTONIC[_HR] is for?  It's not supposed to jump 
around at all, so the basic usage model is to use this source for 
timestamping purposes...

Jesse

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 22:11       ` Jesse Barnes
@ 2006-08-17 22:32         ` john stultz
  2006-08-17 22:50           ` Jesse Barnes
  0 siblings, 1 reply; 28+ messages in thread
From: john stultz @ 2006-08-17 22:32 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Roman Zippel, Ulrich Windl, linux-kernel, Udo van den Heuvel

On Thu, 2006-08-17 at 15:11 -0700, Jesse Barnes wrote:
> On Thursday, August 17, 2006 2:58 pm, john stultz wrote:
> > On Thu, 2006-08-17 at 13:43 +0200, Roman Zippel wrote:
> > > On Wed, 16 Aug 2006, john stultz wrote:
> > > > > For example there is a POSIX-like sys_clock_gettime() intended
> > > > > to server the end-user directly, but there's no counterpart
> > > > > do_clock_gettime() to server any in-kernel needs.
> > > >
> > > > Hmmm.. ktime_get(), ktime_get_ts() and ktime_get_real(), provide
> > > > this info. Is there something missing here?
> > >
> > > What is missing is the abiltity to map a clock to a posix clock, so
> > > that you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP controlled
> > > clocks and other CLOCK_* as the raw clock.
> >
> > Is there a use case for this (wanting non-NTP corrected time on a
> > system running NTPd) you have in mind?
> 
> Isn't this what CLOCK_MONOTONIC[_HR] is for?  It's not supposed to jump 
> around at all, so the basic usage model is to use this source for 
> timestamping purposes...

Well, CLOCK_MONOTONIC is not affected by calls to settimeofday() so it
will never go backward, however it does get frequency correction if
provided by NTP (thus a second will be a correct second and you won't
accumulate error).

Also the _HR clocks have always been out of tree, so there isn't the
binary compatibility worry.

thanks
-john



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 22:32         ` john stultz
@ 2006-08-17 22:50           ` Jesse Barnes
  2006-08-17 23:02             ` john stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Jesse Barnes @ 2006-08-17 22:50 UTC (permalink / raw)
  To: john stultz; +Cc: Roman Zippel, Ulrich Windl, linux-kernel, Udo van den Heuvel

On Thursday, August 17, 2006 3:32 pm, john stultz wrote:
> On Thu, 2006-08-17 at 15:11 -0700, Jesse Barnes wrote:
> > On Thursday, August 17, 2006 2:58 pm, john stultz wrote:
> > > On Thu, 2006-08-17 at 13:43 +0200, Roman Zippel wrote:
> > > > What is missing is the abiltity to map a clock to a posix clock,
> > > > so that you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP
> > > > controlled clocks and other CLOCK_* as the raw clock.
> > >
> > > Is there a use case for this (wanting non-NTP corrected time on a
> > > system running NTPd) you have in mind?
> >
> > Isn't this what CLOCK_MONOTONIC[_HR] is for?  It's not supposed to
> > jump around at all, so the basic usage model is to use this source
> > for timestamping purposes...
>
> Well, CLOCK_MONOTONIC is not affected by calls to settimeofday() so it
> will never go backward, however it does get frequency correction if
> provided by NTP (thus a second will be a correct second and you won't
> accumulate error).

Hm, I guess that's ok for most of the timestamp applications I'm aware of 
as long as NTP won't cause the clock to stand still...

FWIW I think many other Unices provide a raw cycle counter via the POSIX 
clock routines.  I don't imagine they're NTP corrected, since at least 
on IRIX the application is expected to handle even cycle counter 
wraparound.

Jesse

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 22:50           ` Jesse Barnes
@ 2006-08-17 23:02             ` john stultz
  2006-08-20 17:14               ` Roman Zippel
  0 siblings, 1 reply; 28+ messages in thread
From: john stultz @ 2006-08-17 23:02 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Roman Zippel, Ulrich Windl, linux-kernel, Udo van den Heuvel

On Thu, 2006-08-17 at 15:50 -0700, Jesse Barnes wrote:
> On Thursday, August 17, 2006 3:32 pm, john stultz wrote:
> > On Thu, 2006-08-17 at 15:11 -0700, Jesse Barnes wrote:
> > > On Thursday, August 17, 2006 2:58 pm, john stultz wrote:
> > > > On Thu, 2006-08-17 at 13:43 +0200, Roman Zippel wrote:
> > > > > What is missing is the abiltity to map a clock to a posix clock,
> > > > > so that you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP
> > > > > controlled clocks and other CLOCK_* as the raw clock.
> > > >
> > > > Is there a use case for this (wanting non-NTP corrected time on a
> > > > system running NTPd) you have in mind?
> > >
> > > Isn't this what CLOCK_MONOTONIC[_HR] is for?  It's not supposed to
> > > jump around at all, so the basic usage model is to use this source
> > > for timestamping purposes...
> >
> > Well, CLOCK_MONOTONIC is not affected by calls to settimeofday() so it
> > will never go backward, however it does get frequency correction if
> > provided by NTP (thus a second will be a correct second and you won't
> > accumulate error).
> 
> Hm, I guess that's ok for most of the timestamp applications I'm aware of 
> as long as NTP won't cause the clock to stand still...

Nope. Its limited to a +/-500ppm adjustment max.


> FWIW I think many other Unices provide a raw cycle counter via the POSIX 
> clock routines.  I don't imagine they're NTP corrected, since at least 
> on IRIX the application is expected to handle even cycle counter 
> wraparound.

Yea, I just want to make sure we're not creating a
portability/maintenance nightmare by exporting too much raw hardware
information.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 23:02             ` john stultz
@ 2006-08-20 17:14               ` Roman Zippel
  0 siblings, 0 replies; 28+ messages in thread
From: Roman Zippel @ 2006-08-20 17:14 UTC (permalink / raw)
  To: john stultz; +Cc: Jesse Barnes, Ulrich Windl, linux-kernel, Udo van den Heuvel

Hi,

On Thu, 17 Aug 2006, john stultz wrote:

> > Hm, I guess that's ok for most of the timestamp applications I'm aware of 
> > as long as NTP won't cause the clock to stand still...
> 
> Nope. Its limited to a +/-500ppm adjustment max.

This is only frequency adjustment, time adjustment is added as well which 
can be much larger than this.

bye, Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Linux time code
  2006-08-17 21:58     ` john stultz
  2006-08-17 22:11       ` Jesse Barnes
@ 2006-08-20 17:10       ` Roman Zippel
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Zippel @ 2006-08-20 17:10 UTC (permalink / raw)
  To: john stultz; +Cc: Ulrich Windl, linux-kernel, Udo van den Heuvel

Hi,

On Thu, 17 Aug 2006, john stultz wrote:

> > What is missing is the abiltity to map a clock to a posix clock, so that 
> > you would have CLOCK_REALTIME/CLOCK_MONOTONIC as NTP controlled clocks and 
> > other CLOCK_* as the raw clock.
> 
> Is there a use case for this (wanting non-NTP corrected time on a system
> running NTPd) you have in mind?

Most are probably special cases, but a more general use would be to allow
tracking the stability of multiple clocks, so you can check which is most 
suited for a time server. So far you can only do it for one clock at a 
time and you have to turn off NTP for calibration.

bye, Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-08-29 19:23 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-23  6:25 Linux time code linux
2006-08-23 18:29 ` john stultz
2006-08-24  2:35   ` linux
2006-08-28 11:39     ` Roman Zippel
2006-08-28 22:36       ` john stultz
2006-08-29  3:28         ` linux
2006-08-29 13:15           ` Theodore Tso
2006-08-29 15:18             ` linux
2006-08-29 19:23           ` john stultz
2006-08-29 14:43         ` Roman Zippel
2006-08-26  0:17   ` linux
2006-08-28 22:41     ` john stultz
2006-08-26  3:46   ` linux
  -- strict thread matches above, loose matches on Subject: below --
2006-08-16 12:26 Ulrich Windl
2006-08-16 12:36 ` Oleg Verych
2006-08-16 15:35   ` H. Peter Anvin
2006-08-16 15:12     ` Oleg Verych
2006-08-16 19:53 ` john stultz
2006-08-17  7:20   ` Ulrich Windl
2006-08-17 19:15     ` john stultz
2006-08-17 11:43   ` Roman Zippel
2006-08-17 21:58     ` john stultz
2006-08-17 22:11       ` Jesse Barnes
2006-08-17 22:32         ` john stultz
2006-08-17 22:50           ` Jesse Barnes
2006-08-17 23:02             ` john stultz
2006-08-20 17:14               ` Roman Zippel
2006-08-20 17:10       ` Roman Zippel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox