[RFC 0/2] ABI for clock_gettime

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] ABI for clock_gettime_ns
@ 2011-12-13  1:26 Andy Lutomirski
  2011-12-13  1:26 ` [RFC 1/2] Add clock_gettime_ns syscall Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-13  1:26 UTC (permalink / raw)
  To: linux-kernel, Kumar Sundararajan, john stultz, Arun Sharma
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andy Lutomirski

On x86-64, clock_gettime is so fast that the overhead converting to and
from nanoseconds is non-negligible.  clock_gettime_ns is a different
interface that is potentially faster.  If people like the ABI, I'll
implement an optimized version.

For the git-inclined, this series is at
https://git.kernel.org/?p=linux/kernel/git/luto/linux.git;a=shortlog;h=refs/heads/timing/clock_gettime_ns/rfc_v1

Andy Lutomirski (2):
  Add clock_gettime_ns syscall
  x86-64: Add __vdso_clock_gettime_ns vsyscall

 arch/x86/include/asm/unistd_64.h |    2 +
 arch/x86/vdso/vclock_gettime.c   |   70 +++++++++++++++++++++++++++++---------
 arch/x86/vdso/vdso.lds.S         |    7 ++++
 include/linux/syscalls.h         |    2 +
 kernel/posix-timers.c            |   29 ++++++++++++++++
 5 files changed, 94 insertions(+), 16 deletions(-)

-- 
1.7.7.4


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 1/2] Add clock_gettime_ns syscall
  2011-12-13  1:26 [RFC 0/2] ABI for clock_gettime_ns Andy Lutomirski
@ 2011-12-13  1:26 ` Andy Lutomirski
  2011-12-13  3:32   ` Richard Cochran
  2011-12-13  1:26 ` [RFC 2/2] x86-64: Add __vdso_clock_gettime_ns vsyscall Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-13  1:26 UTC (permalink / raw)
  To: linux-kernel, Kumar Sundararajan, john stultz, Arun Sharma
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andy Lutomirski

On some architectures, clock_gettime is fast enough that converting
between nanoseconds and struct timespec takes a significant amount of
time.  Introduce a new syscall that does the same thing but returns the
answer in nanoseconds.  2^64 nanoseconds since the epoch won't wrap
around until the year 2554, and by then we can use 128-bit types.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/unistd_64.h |    2 ++
 include/linux/syscalls.h         |    2 ++
 kernel/posix-timers.c            |   29 +++++++++++++++++++++++++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 2010405..3a48069 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -683,6 +683,8 @@ __SYSCALL(__NR_sendmmsg, sys_sendmmsg)
 __SYSCALL(__NR_setns, sys_setns)
 #define __NR_getcpu				309
 __SYSCALL(__NR_getcpu, sys_getcpu)
+#define __NR_clock_gettime_ns			310
+__SYSCALL(__NR_clock_gettime_ns, sys_clock_gettime_ns)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1ff0ec2..2502bc1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -316,6 +316,8 @@ asmlinkage long sys_clock_settime(clockid_t which_clock,
 				const struct timespec __user *tp);
 asmlinkage long sys_clock_gettime(clockid_t which_clock,
 				struct timespec __user *tp);
+asmlinkage long sys_clock_gettime_ns(clockid_t which_clock,
+				u64 __user *tp);
 asmlinkage long sys_clock_adjtime(clockid_t which_clock,
 				struct timex __user *tx);
 asmlinkage long sys_clock_getres(clockid_t which_clock,
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4556182..07e0772 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -980,6 +980,35 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
 	return error;
 }
 
+SYSCALL_DEFINE2(clock_gettime_ns, const clockid_t, which_clock,
+		u64 __user *, tp)
+{
+	/*
+	 * This implementation isn't as fast as it could be, but the syscall
+	 * entry will take much longer than the unnecessary division and
+	 * multiplication.  Arch-specific implementations can be made faster.
+	 */
+
+	struct k_clock *kc = clockid_to_kclock(which_clock);
+	struct timespec kernel_timespec;
+	u64 ns;
+	int error;
+
+	if (!kc)
+		return -EINVAL;
+
+	error = kc->clock_get(which_clock, &kernel_timespec);
+
+	if (!error) {
+		ns = kernel_timespec.tv_sec * NSEC_PER_SEC
+			+ kernel_timespec.tv_nsec;
+
+		error = copy_to_user(tp, &ns, sizeof(ns));
+	}
+
+	return error;
+}
+
 SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
 		struct timex __user *, utx)
 {
-- 
1.7.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC 2/2] x86-64: Add __vdso_clock_gettime_ns vsyscall
  2011-12-13  1:26 [RFC 0/2] ABI for clock_gettime_ns Andy Lutomirski
  2011-12-13  1:26 ` [RFC 1/2] Add clock_gettime_ns syscall Andy Lutomirski
@ 2011-12-13  1:26 ` Andy Lutomirski
  2011-12-13  3:24 ` [RFC 0/2] ABI for clock_gettime_ns Richard Cochran
  2011-12-21  0:50 ` Arun Sharma
  3 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-13  1:26 UTC (permalink / raw)
  To: linux-kernel, Kumar Sundararajan, john stultz, Arun Sharma
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andy Lutomirski

This is just for the ABI.  An optimized implementation will come later.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/vdso/vclock_gettime.c |   70 ++++++++++++++++++++++++++++++---------
 arch/x86/vdso/vdso.lds.S       |    7 ++++
 2 files changed, 61 insertions(+), 16 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 6bc0e72..79d6919 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -82,7 +82,7 @@ notrace static inline long vgetns(void)
 	return (v * gtod->clock.mult) >> gtod->clock.shift;
 }
 
-notrace static noinline int do_realtime(struct timespec *ts)
+notrace static noinline void do_realtime(struct timespec *ts)
 {
 	unsigned long seq, ns;
 	do {
@@ -92,10 +92,9 @@ notrace static noinline int do_realtime(struct timespec *ts)
 		ns = vgetns();
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
 	timespec_add_ns(ts, ns);
-	return 0;
 }
 
-notrace static noinline int do_monotonic(struct timespec *ts)
+notrace static noinline void do_monotonic(struct timespec *ts)
 {
 	unsigned long seq, ns, secs;
 	do {
@@ -115,11 +114,9 @@ notrace static noinline int do_monotonic(struct timespec *ts)
 	}
 	ts->tv_sec = secs;
 	ts->tv_nsec = ns;
-
-	return 0;
 }
 
-notrace static noinline int do_realtime_coarse(struct timespec *ts)
+notrace static noinline void do_realtime_coarse(struct timespec *ts)
 {
 	unsigned long seq;
 	do {
@@ -127,10 +124,9 @@ notrace static noinline int do_realtime_coarse(struct timespec *ts)
 		ts->tv_sec = gtod->wall_time_coarse.tv_sec;
 		ts->tv_nsec = gtod->wall_time_coarse.tv_nsec;
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
-	return 0;
 }
 
-notrace static noinline int do_monotonic_coarse(struct timespec *ts)
+notrace static noinline void do_monotonic_coarse(struct timespec *ts)
 {
 	unsigned long seq, ns, secs;
 	do {
@@ -150,25 +146,29 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
 	}
 	ts->tv_sec = secs;
 	ts->tv_nsec = ns;
-
-	return 0;
 }
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
 	switch (clock) {
 	case CLOCK_REALTIME:
-		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE))
-			return do_realtime(ts);
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE)) {
+			do_realtime(ts);
+			return 0;
+		}
 		break;
 	case CLOCK_MONOTONIC:
-		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE))
-			return do_monotonic(ts);
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE)) {
+			do_monotonic(ts);
+			return 0;
+		}
 		break;
 	case CLOCK_REALTIME_COARSE:
-		return do_realtime_coarse(ts);
+		do_realtime_coarse(ts);
+		return 0;
 	case CLOCK_MONOTONIC_COARSE:
-		return do_monotonic_coarse(ts);
+		do_monotonic_coarse(ts);
+		return 0;
 	}
 
 	return vdso_fallback_gettime(clock, ts);
@@ -176,6 +176,44 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 int clock_gettime(clockid_t, struct timespec *)
 	__attribute__((weak, alias("__vdso_clock_gettime")));
 
+notrace int __vdso_clock_gettime_ns(clockid_t clock, u64 *t)
+{
+	/* This implementation is slow.  It will be improved later. */
+
+	struct timespec ts;
+	int error;
+
+	switch (clock) {
+	case CLOCK_REALTIME:
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE)) {
+			do_realtime(&ts);
+			*t = ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+			return 0;
+		}
+		break;
+	case CLOCK_MONOTONIC:
+		if (likely(gtod->clock.vclock_mode != VCLOCK_NONE)) {
+			do_monotonic(&ts);
+			*t = ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+			return 0;
+		}
+		break;
+	case CLOCK_REALTIME_COARSE:
+		do_realtime_coarse(&ts);
+		*t = ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+		return 0;
+	case CLOCK_MONOTONIC_COARSE:
+		do_monotonic_coarse(&ts);
+		*t = ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+		return 0;
+	}
+
+	error = vdso_fallback_gettime(clock, &ts);
+	if (!error)
+		*t = ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+	return error;
+}
+
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	long ret;
diff --git a/arch/x86/vdso/vdso.lds.S b/arch/x86/vdso/vdso.lds.S
index b96b267..238f500 100644
--- a/arch/x86/vdso/vdso.lds.S
+++ b/arch/x86/vdso/vdso.lds.S
@@ -17,6 +17,10 @@
 VERSION {
 	LINUX_2.6 {
 	global:
+		/*
+		 * These are the original vsyscalls.  They have weak symbols
+		 * without the __vdso_ prefix for ABI compatibility.
+		 */
 		clock_gettime;
 		__vdso_clock_gettime;
 		gettimeofday;
@@ -25,6 +29,9 @@ VERSION {
 		__vdso_getcpu;
 		time;
 		__vdso_time;
+
+		/* New vsyscalls are just plain functions. */
+		__vdso_clock_gettime_ns;
 	local: *;
 	};
 }
-- 
1.7.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  1:26 [RFC 0/2] ABI for clock_gettime_ns Andy Lutomirski
  2011-12-13  1:26 ` [RFC 1/2] Add clock_gettime_ns syscall Andy Lutomirski
  2011-12-13  1:26 ` [RFC 2/2] x86-64: Add __vdso_clock_gettime_ns vsyscall Andy Lutomirski
@ 2011-12-13  3:24 ` Richard Cochran
  2011-12-13  3:43   ` john stultz
  2011-12-21  0:50 ` Arun Sharma
  3 siblings, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-13  3:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kumar Sundararajan, john stultz, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Mon, Dec 12, 2011 at 05:26:36PM -0800, Andy Lutomirski wrote:
> On x86-64, clock_gettime is so fast that the overhead converting to and
> from nanoseconds is non-negligible.  clock_gettime_ns is a different
> interface that is potentially faster.  If people like the ABI, I'll
> implement an optimized version.

I am not so interested in performance optimizations, but do I think
offering time in nanoseconds is attractive from an application point
of view. The timespec is impractical for everyone.

While you are at it with new syscalls, why not make a clean break from
POSIX and fix the uglies?

- New name, to distance ourselves from POSIX (clock_ns_get?)
- Family of calls, with set/get
- Sub nanosecond field
- TAI time base (or according to parameter?)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/2] Add clock_gettime_ns syscall
  2011-12-13  1:26 ` [RFC 1/2] Add clock_gettime_ns syscall Andy Lutomirski
@ 2011-12-13  3:32   ` Richard Cochran
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Cochran @ 2011-12-13  3:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kumar Sundararajan, john stultz, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Mon, Dec 12, 2011 at 05:26:37PM -0800, Andy Lutomirski wrote:
> On some architectures, clock_gettime is fast enough that converting
> between nanoseconds and struct timespec takes a significant amount of
> time.  Introduce a new syscall that does the same thing but returns the
> answer in nanoseconds.  2^64 nanoseconds since the epoch won't wrap
> around until the year 2554, and by then we can use 128-bit types.

You have here unsigned, but the time_t timespec is signed. To be
consistent which clock_gettime, it would have to be signed, and that
still gives you about 300 years.

OTOH, clearly new and different syscalls can happily be unsigned.

In any case, you should make it clear.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  3:24 ` [RFC 0/2] ABI for clock_gettime_ns Richard Cochran
@ 2011-12-13  3:43   ` john stultz
  2011-12-13  7:09     ` Andy Lutomirski
  2011-12-14  7:20     ` Richard Cochran
  0 siblings, 2 replies; 28+ messages in thread
From: john stultz @ 2011-12-13  3:43 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Tue, 2011-12-13 at 04:24 +0100, Richard Cochran wrote:
> On Mon, Dec 12, 2011 at 05:26:36PM -0800, Andy Lutomirski wrote:
> > On x86-64, clock_gettime is so fast that the overhead converting to and
> > from nanoseconds is non-negligible.  clock_gettime_ns is a different
> > interface that is potentially faster.  If people like the ABI, I'll
> > implement an optimized version.
> 
> I am not so interested in performance optimizations, but do I think
> offering time in nanoseconds is attractive from an application point
> of view. The timespec is impractical for everyone.
> 
> While you are at it with new syscalls, why not make a clean break from
> POSIX and fix the uglies?
> 
> - New name, to distance ourselves from POSIX (clock_ns_get?)
> - Family of calls, with set/get
> - Sub nanosecond field
> - TAI time base (or according to parameter?)

Having a CLOCK_TAI would be interesting across the board. We already
keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
sets it these days.

thanks
-john


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  3:43   ` john stultz
@ 2011-12-13  7:09     ` Andy Lutomirski
  2011-12-14  7:46       ` Richard Cochran
  2011-12-14  7:20     ` Richard Cochran
  1 sibling, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-13  7:09 UTC (permalink / raw)
  To: john stultz
  Cc: Richard Cochran, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
> On Tue, 2011-12-13 at 04:24 +0100, Richard Cochran wrote:
>> On Mon, Dec 12, 2011 at 05:26:36PM -0800, Andy Lutomirski wrote:
>> > On x86-64, clock_gettime is so fast that the overhead converting to and
>> > from nanoseconds is non-negligible.  clock_gettime_ns is a different
>> > interface that is potentially faster.  If people like the ABI, I'll
>> > implement an optimized version.
>>
>> I am not so interested in performance optimizations, but do I think
>> offering time in nanoseconds is attractive from an application point
>> of view. The timespec is impractical for everyone.
>>
>> While you are at it with new syscalls, why not make a clean break from
>> POSIX and fix the uglies?
>>
>> - New name, to distance ourselves from POSIX (clock_ns_get?)

I will defer to the bikeshedding consensus :)

>> - Family of calls, with set/get

Setting the time is a big can of worms.  adjtimex is rather
incomprehensible (without reading lots of source and/or the rfc) and
IMO puts a lot of NTP magic into the kernel, where it doesn't belong.
But I don't really want to design, let alone implement, something
better, especially right now.  Maybe a better design would let you
open a file descriptor to control the time and apply offsets and
frequency correction (over a wide range, specified as a HZ-independent
fixed-point number) as needed.  But that's a whole different
discussion.

That being said, it might be nice to do something about leap seconds.
I always thought that the nanosecond count should include every
possible leap second so that every time that actually happens
corresponds to a unique count, but maybe that's just me.

>> - Sub nanosecond field

Me.  A nanosecond is approximately a light-second.  Other than things
local to a single computer, not much of interest happens on a
sub-nanosecond time scale.  Also, a single 64-bit count is nice, and
2^64 picoseconds isn't very long.

>> - TAI time base (or according to parameter?)
>
> Having a CLOCK_TAI would be interesting across the board. We already
> keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
> sets it these days.

A friend of mine would probably appreciate various barycentric time
scales as well.  This would also be a different (and unrelated) patch.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  3:43   ` john stultz
  2011-12-13  7:09     ` Andy Lutomirski
@ 2011-12-14  7:20     ` Richard Cochran
  2011-12-14 16:23       ` john stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-14  7:20 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Mon, Dec 12, 2011 at 07:43:02PM -0800, john stultz wrote:
> 
> Having a CLOCK_TAI would be interesting across the board. We already
> keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
> sets it these days.

A bit OT, but what do think of the idea of keeping TAI in the kernel,
and providing UTC via a tabular conversion routine?

Michel Hack wrote an article last year detailing how Linux botches the
leap second and suggested a more robust way to handle it.

Thanks,
Richard



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  7:09     ` Andy Lutomirski
@ 2011-12-14  7:46       ` Richard Cochran
  2011-12-14 16:48         ` john stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-14  7:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: john stultz, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Mon, Dec 12, 2011 at 11:09:29PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
> >> - New name, to distance ourselves from POSIX (clock_ns_get?)
> 
> I will defer to the bikeshedding consensus :)
> 
> >> - Family of calls, with set/get
> 
> Setting the time is a big can of worms.  adjtimex is rather
> incomprehensible (without reading lots of source and/or the rfc) and
> IMO puts a lot of NTP magic into the kernel, where it doesn't belong.

I agree with you about that, but when you consider the HW/SW
enviroment at the time when NTP was developed, you can understand why
it had to done like that. Actually even most of today's hardware does
not help us to keep the time.

With the newer PTP devices, with tunable HW clocks, we finally have a
chance to move the "magic" back into user space. We can only hope that
the idea will transfer over to CPU clocks as well.

> But I don't really want to design, let alone implement, something
> better, especially right now.  Maybe a better design would let you
> open a file descriptor to control the time and apply offsets and
> frequency correction (over a wide range, specified as a HZ-independent
> fixed-point number) as needed.  But that's a whole different
> discussion.

I initially proposed PTP clock support as a character device, since
that was the quick and easy way to do it, and it was adequate for the
user land needs. The reaction from the list, however, was, why not do
this way ... (details) ... ?

The end result was much more work (and discussion), but the solution
that emerged was really, truly much nicer than the character device
idea.

So, I understand that all you need is a "quick and easy" performance
fix, but I see an opportunity here for a new, improved time interface.

> That being said, it might be nice to do something about leap seconds.
> I always thought that the nanosecond count should include every
> possible leap second so that every time that actually happens
> corresponds to a unique count, but maybe that's just me.

The advantage of working with TAI is that you can use simple addition
and substraction (converting the result to UTC or whatever), and the
answer is always correct.

> >> - Sub nanosecond field
> 
> Me.  A nanosecond is approximately a light-second.  Other than things
> local to a single computer, not much of interest happens on a
> sub-nanosecond time scale.  Also, a single 64-bit count is nice, and
> 2^64 picoseconds isn't very long.

Believe it or not, people (from the Test and Measurement field) have
already been asking me about having subnanosecond time values from the
kernel.

What about this sort of time value?

struct sys_timeval {
	__s64 nanoseconds;
	__u32 fractional_ns;
};

The second field can just be zero, for now.

> >> - TAI time base (or according to parameter?)
> >
> > Having a CLOCK_TAI would be interesting across the board. We already
> > keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
> > sets it these days.
> 
> A friend of mine would probably appreciate various barycentric time
> scales as well.  This would also be a different (and unrelated) patch.

What about this: a new, non-POSIX, rational time interface providing
TAI time values, and a user space library for time scale conversion?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14  7:20     ` Richard Cochran
@ 2011-12-14 16:23       ` john stultz
  2011-12-14 18:21         ` Richard Cochran
  0 siblings, 1 reply; 28+ messages in thread
From: john stultz @ 2011-12-14 16:23 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, 2011-12-14 at 08:20 +0100, Richard Cochran wrote:
> On Mon, Dec 12, 2011 at 07:43:02PM -0800, john stultz wrote:
> > 
> > Having a CLOCK_TAI would be interesting across the board. We already
> > keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
> > sets it these days.
> 
> A bit OT, but what do think of the idea of keeping TAI in the kernel,
> and providing UTC via a tabular conversion routine?

Agreed its OT for this thread. 

> Michel Hack wrote an article last year detailing how Linux botches the
> leap second and suggested a more robust way to handle it.

Hmm. Do you have a link to the article? 

I like the idea of having TAI as a kernel clockid. The hard part is
getting systems to initialize it properly at boot. 

Also part of the issue with leapseconds is that time functions are such
a hot path, we can't really add extra conditionals checking for leap
seconds. Instead the leapsecond occurs on the first tick of the
leapsecond.

More interestingly to me is Google's recent use of slewed leapseconds.
However, how that would work on a public network is a bit more fuzzy.
And being able to support both TAI and slewed leapseconds would require
quite a bit more logic.

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14  7:46       ` Richard Cochran
@ 2011-12-14 16:48         ` john stultz
  2011-12-14 17:15           ` Andy Lutomirski
  2011-12-14 18:30           ` Richard Cochran
  0 siblings, 2 replies; 28+ messages in thread
From: john stultz @ 2011-12-14 16:48 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
> On Mon, Dec 12, 2011 at 11:09:29PM -0800, Andy Lutomirski wrote:
> > On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
> > >> - New name, to distance ourselves from POSIX (clock_ns_get?)
> > 
> > I will defer to the bikeshedding consensus :)
> > 
> > >> - Family of calls, with set/get
> > 
> > Setting the time is a big can of worms.  adjtimex is rather
> > incomprehensible (without reading lots of source and/or the rfc) and
> > IMO puts a lot of NTP magic into the kernel, where it doesn't belong.

Honestly, I don't really see how we jumped to adjtimex from setting the
time, nor the complexity hinted at. First, the rational for getting
clock_gettime_ns is to avoid the overhead of userland translating from
timespec to ns.   I doubt there are similar performance needs for
settimeofday().  Even if it was needed, it shouldn't be more complex
then the unit conversion done in this abi patch. Am I missing something?

> > That being said, it might be nice to do something about leap seconds.
> > I always thought that the nanosecond count should include every
> > possible leap second so that every time that actually happens
> > corresponds to a unique count, but maybe that's just me.
> 
> The advantage of working with TAI is that you can use simple addition
> and substraction (converting the result to UTC or whatever), and the
> answer is always correct.

But again, the hard part with in-kernel TAI (possibly as the base of
time)is that initialization of the TAI/UTC offset needs to be able to be
phased in slowly, as we also have to preserve legacy interfaces and
behavior. 

> > >> - Sub nanosecond field
> > 
> > Me.  A nanosecond is approximately a light-second.  Other than things
> > local to a single computer, not much of interest happens on a
> > sub-nanosecond time scale.  Also, a single 64-bit count is nice, and
> > 2^64 picoseconds isn't very long.
> 
> Believe it or not, people (from the Test and Measurement field) have
> already been asking me about having subnanosecond time values from the
> kernel.
> 
> What about this sort of time value?
> 
> struct sys_timeval {
> 	__s64 nanoseconds;
> 	__u32 fractional_ns;
> };
> 
> The second field can just be zero, for now.

I'm mixed on this. 

We could do this, as the kernel keeps track of sub-ns granularity.
However, its not stored in a decimal format. So I worry the extra math
needed to convert it to something usable might add extra overhead,
removing the gain of the proposed clock_gettime_ns() interface.


> > >> - TAI time base (or according to parameter?)
> > >
> > > Having a CLOCK_TAI would be interesting across the board. We already
> > > keep a TAI offset in the ntp code. However, I'm not sure if ntp actually
> > > sets it these days.
> > 
> > A friend of mine would probably appreciate various barycentric time
> > scales as well.  This would also be a different (and unrelated) patch.
> 
> What about this: a new, non-POSIX, rational time interface providing
> TAI time values, and a user space library for time scale conversion?

Why do we need a new interface for TAI? clock_gettime(CLOCK_TAI,...)
should be achievable. I do think it would be interesting, but I also
think its separate from the goal of this proposal.

thanks
-john



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 16:48         ` john stultz
@ 2011-12-14 17:15           ` Andy Lutomirski
  2011-12-14 17:31             ` john stultz
  2011-12-14 18:37             ` Richard Cochran
  2011-12-14 18:30           ` Richard Cochran
  1 sibling, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-14 17:15 UTC (permalink / raw)
  To: john stultz
  Cc: Richard Cochran, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 8:48 AM, john stultz <johnstul@us.ibm.com> wrote:
> On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
>> On Mon, Dec 12, 2011 at 11:09:29PM -0800, Andy Lutomirski wrote:
>> > On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
>> > >> - New name, to distance ourselves from POSIX (clock_ns_get?)
>> >
>> > I will defer to the bikeshedding consensus :)
>> >
>> > >> - Family of calls, with set/get
>> >
>> > Setting the time is a big can of worms.  adjtimex is rather
>> > incomprehensible (without reading lots of source and/or the rfc) and
>> > IMO puts a lot of NTP magic into the kernel, where it doesn't belong.
>
> Honestly, I don't really see how we jumped to adjtimex from setting the
> time, nor the complexity hinted at. First, the rational for getting
> clock_gettime_ns is to avoid the overhead of userland translating from
> timespec to ns.   I doubt there are similar performance needs for
> settimeofday().  Even if it was needed, it shouldn't be more complex
> then the unit conversion done in this abi patch. Am I missing something?
>
>> > That being said, it might be nice to do something about leap seconds.
>> > I always thought that the nanosecond count should include every
>> > possible leap second so that every time that actually happens
>> > corresponds to a unique count, but maybe that's just me.
>>
>> The advantage of working with TAI is that you can use simple addition
>> and substraction (converting the result to UTC or whatever), and the
>> answer is always correct.
>
> But again, the hard part with in-kernel TAI (possibly as the base of
> time)is that initialization of the TAI/UTC offset needs to be able to be
> phased in slowly, as we also have to preserve legacy interfaces and
> behavior.

I have a computer that ticks at the wrong rate, and trying to correct
it via the existing APIs is possible but seemed far more complicated
than it should have been.  That being said, I have almost no interest
in messing with this stuff.  I'll leave it to the NTP experts :)

It certainly has nothing to do with my patch.

>
>> > >> - Sub nanosecond field
>> >
>> > Me.  A nanosecond is approximately a light-second.  Other than things
>> > local to a single computer, not much of interest happens on a
>> > sub-nanosecond time scale.  Also, a single 64-bit count is nice, and
>> > 2^64 picoseconds isn't very long.
>>
>> Believe it or not, people (from the Test and Measurement field) have
>> already been asking me about having subnanosecond time values from the
>> kernel.

I'm curious how that works.  My personal record is synchronizing time
across a bunch of computers to within maybe half a nanosecond, but it
wasn't the *system* clock that I synchronized -- I just calibrated a
bunch of oscillator phase differences on ADC clocks that I was using.
I only relied on the system clock being correct to a few tens of
microseconds, which is easily done with PTP.

>>
>> What about this sort of time value?
>>
>> struct sys_timeval {
>>       __s64 nanoseconds;
>>       __u32 fractional_ns;
>> };
>>
>> The second field can just be zero, for now.
>
> I'm mixed on this.
>
> We could do this, as the kernel keeps track of sub-ns granularity.
> However, its not stored in a decimal format. So I worry the extra math
> needed to convert it to something usable might add extra overhead,
> removing the gain of the proposed clock_gettime_ns() interface.
>

I would actually prefer units of 2^-32 ns over .  I have no attachment
to SI picoseconds so long as the units are constant.

Windows sidesteps this issue by returning arbitrary units and telling
the user what those units are.  This adds a lot of unpleasantness (try
relating the timestamps to actual wall time) and we need to rescale
the time anyway for NTP.

What about:

struct sys_timeval {
    u64 nanoseconds;  /* unsigned.  the current time will always be
after 1970, and those extra 290 years might be nice. */
    u64 padding;  /* for later.  currently always zero. */

That way, once there's both an implementation and a use case, we can
implement it.  In the mean time, the overhead is probably immeasurably
low -- it's a single assignment.

Note that rdtsc isn't good to a nanosecond, let alone sub-nanosecond
intervals, on any hardware I've ever seen.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 17:15           ` Andy Lutomirski
@ 2011-12-14 17:31             ` john stultz
  2011-12-14 18:37             ` Richard Cochran
  1 sibling, 0 replies; 28+ messages in thread
From: john stultz @ 2011-12-14 17:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Richard Cochran, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, 2011-12-14 at 09:15 -0800, Andy Lutomirski wrote:
> On Wed, Dec 14, 2011 at 8:48 AM, john stultz <johnstul@us.ibm.com> wrote:
> > On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
> >>
> >> What about this sort of time value?
> >>
> >> struct sys_timeval {
> >>       __s64 nanoseconds;
> >>       __u32 fractional_ns;
> >> };
> >>
> >> The second field can just be zero, for now.
> >
> > I'm mixed on this.
> >
> > We could do this, as the kernel keeps track of sub-ns granularity.
> > However, its not stored in a decimal format. So I worry the extra math
> > needed to convert it to something usable might add extra overhead,
> > removing the gain of the proposed clock_gettime_ns() interface.
> >
> 
> I would actually prefer units of 2^-32 ns over .  I have no attachment
> to SI picoseconds so long as the units are constant.

2^-32ns would be much easier to do.


> Windows sidesteps this issue by returning arbitrary units and telling
> the user what those units are.  This adds a lot of unpleasantness (try
> relating the timestamps to actual wall time) and we need to rescale
> the time anyway for NTP.
> 
> What about:
> 
> struct sys_timeval {
>     u64 nanoseconds;  /* unsigned.  the current time will always be
> after 1970, and those extra 290 years might be nice. */

I'd suspect we will still need this to be signed if it goes to userland.
In-kernel u64 for nanoseconds is fine because it doesn't have to deal
with anything that far in the past. But for userland we probably should
use s64. 

>     u64 padding;  /* for later.  currently always zero. */
> 
> That way, once there's both an implementation and a use case, we can
> implement it.  In the mean time, the overhead is probably immeasurably
> low -- it's a single assignment.

This sounds good to me. 

Kumar, Arun, I know we've strayed a bit from your original patch, but
any objections here?

thanks
-john





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 16:23       ` john stultz
@ 2011-12-14 18:21         ` Richard Cochran
  2011-12-14 18:57           ` john stultz
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-14 18:21 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 08:23:52AM -0800, john stultz wrote:
> On Wed, 2011-12-14 at 08:20 +0100, Richard Cochran wrote:
> > Michel Hack wrote an article last year detailing how Linux botches the
> > leap second and suggested a more robust way to handle it.
> 
> Hmm. Do you have a link to the article? 

I don't think it is online. Do you have the magic IEEE access?

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5609776

> I like the idea of having TAI as a kernel clockid. The hard part is
> getting systems to initialize it properly at boot. 
> 
> Also part of the issue with leapseconds is that time functions are such
> a hot path, we can't really add extra conditionals checking for leap
> seconds. Instead the leapsecond occurs on the first tick of the
> leapsecond.

The idea would only involve one conditional and one addition:

- System clock represents TAI
- Table of {threshold; offset} values, read mostly, rarely updated
- Table has index pointing to next event

Get time becomes:

1. read system time
2. test threshold
3. apply correction

> More interestingly to me is Google's recent use of slewed leapseconds.
> However, how that would work on a public network is a bit more fuzzy.
> And being able to support both TAI and slewed leapseconds would require
> quite a bit more logic.

Do you mean smoothing the jump over the entire day (or other
interval)? This is also discussed in Hack's paper.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 16:48         ` john stultz
  2011-12-14 17:15           ` Andy Lutomirski
@ 2011-12-14 18:30           ` Richard Cochran
  2011-12-14 19:07             ` john stultz
  1 sibling, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-14 18:30 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 08:48:30AM -0800, john stultz wrote:
> On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
> > On Mon, Dec 12, 2011 at 11:09:29PM -0800, Andy Lutomirski wrote:
> > > On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
> > > >> - New name, to distance ourselves from POSIX (clock_ns_get?)
> > > 
> > > I will defer to the bikeshedding consensus :)
> > > 
> > > >> - Family of calls, with set/get
> > > 
> > > Setting the time is a big can of worms.  adjtimex is rather
> > > incomprehensible (without reading lots of source and/or the rfc) and
> > > IMO puts a lot of NTP magic into the kernel, where it doesn't belong.
> 
> Honestly, I don't really see how we jumped to adjtimex from setting the
> time, nor the complexity hinted at. First, the rational for getting
> clock_gettime_ns is to avoid the overhead of userland translating from
> timespec to ns.   I doubt there are similar performance needs for
> settimeofday().  Even if it was needed, it shouldn't be more complex
> then the unit conversion done in this abi patch. Am I missing something?

So, you agree on adding new syscalls as a performance tweek?

I am not against it, but I do think syscalls should try to satisfy a
large number of user cases.

> But again, the hard part with in-kernel TAI (possibly as the base of
> time)is that initialization of the TAI/UTC offset needs to be able to be
> phased in slowly, as we also have to preserve legacy interfaces and
> behavior. 

With brand new syscall, there are no legacy uses.

> Why do we need a new interface for TAI? clock_gettime(CLOCK_TAI,...)
> should be achievable. I do think it would be interesting, but I also
> think its separate from the goal of this proposal.

I mean to define an interface that always returns TAI values, no matter
what the clock device.

Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 17:15           ` Andy Lutomirski
  2011-12-14 17:31             ` john stultz
@ 2011-12-14 18:37             ` Richard Cochran
  1 sibling, 0 replies; 28+ messages in thread
From: Richard Cochran @ 2011-12-14 18:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: john stultz, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 09:15:29AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 14, 2011 at 8:48 AM, john stultz <johnstul@us.ibm.com> wrote:

> > On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
> >> Believe it or not, people (from the Test and Measurement field) have
> >> already been asking me about having subnanosecond time values from the
> >> kernel.
> 
> I'm curious how that works.  My personal record is synchronizing time
> across a bunch of computers to within maybe half a nanosecond, but it
> wasn't the *system* clock that I synchronized -- I just calibrated a
> bunch of oscillator phase differences on ADC clocks that I was using.
> I only relied on the system clock being correct to a few tens of
> microseconds, which is easily done with PTP.

On example to take a look at is the White Rabbit project.

It is possible to time stamp events and apply clock corrections at a
very fine resolution, for example with synchronized Ethernet and PTP.

> What about:
> 
> struct sys_timeval {
>     u64 nanoseconds;  /* unsigned.  the current time will always be
> after 1970, and those extra 290 years might be nice. */
>     u64 padding;  /* for later.  currently always zero. */
> 
> That way, once there's both an implementation and a use case, we can
> implement it.  In the mean time, the overhead is probably immeasurably
> low -- it's a single assignment.

Agreed.

> Note that rdtsc isn't good to a nanosecond, let alone sub-nanosecond
> intervals, on any hardware I've ever seen.

But the hardware is coming, sooner or later.

Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 18:21         ` Richard Cochran
@ 2011-12-14 18:57           ` john stultz
  2012-01-07 19:51             ` Richard Cochran
  0 siblings, 1 reply; 28+ messages in thread
From: john stultz @ 2011-12-14 18:57 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, 2011-12-14 at 19:21 +0100, Richard Cochran wrote:
> On Wed, Dec 14, 2011 at 08:23:52AM -0800, john stultz wrote:
> > On Wed, 2011-12-14 at 08:20 +0100, Richard Cochran wrote:
> > > Michel Hack wrote an article last year detailing how Linux botches the
> > > leap second and suggested a more robust way to handle it.
> > 
> > Hmm. Do you have a link to the article? 
> 
> I don't think it is online. Do you have the magic IEEE access?
> 
> http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5609776

Ah, research. I'll have to look into getting access.

> > I like the idea of having TAI as a kernel clockid. The hard part is
> > getting systems to initialize it properly at boot. 
> > 
> > Also part of the issue with leapseconds is that time functions are such
> > a hot path, we can't really add extra conditionals checking for leap
> > seconds. Instead the leapsecond occurs on the first tick of the
> > leapsecond.
> 
> The idea would only involve one conditional and one addition:
> 
> - System clock represents TAI
> - Table of {threshold; offset} values, read mostly, rarely updated
> - Table has index pointing to next event
> 
> Get time becomes:
> 
> 1. read system time
> 2. test threshold
> 3. apply correction

Again, this seems relatively reasonable. But the difficulty in changing
system clock to be TAI is getting the table initialized and updated on
legacy systems that don't have the userland support added. 

I'd suggest starting with adding the threshold check and leap-second
correction in the getnstimeofday() path, and then see how performance is
impacted. 

That would let us improve leapsecond handling and get a sense of the
performance impact prior to reworking the kernel internals to be TAI.

> > More interestingly to me is Google's recent use of slewed leapseconds.
> > However, how that would work on a public network is a bit more fuzzy.
> > And being able to support both TAI and slewed leapseconds would require
> > quite a bit more logic.
> 
> Do you mean smoothing the jump over the entire day (or other
> interval)? This is also discussed in Hack's paper.

Yea, its been discussed widely before, but I've not heard of it actually
being implemented. You can read about it here:
http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html

Its interesting and attractive solution, but again, I suspect it really
only works in a closed network where everyone is using the same
approach. Mixing leap-smeared ntp servers with legacy ntp servers would
likely cause trouble. 

Also, trying to mix TAI with leap-smeared UTC would be complex to do
in-kernel, since we would need another layer of indirection to keep the
smear adjustments to UTC separate from the frequency adjustments for
TAI. Not impossible, but not trivial.

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 18:30           ` Richard Cochran
@ 2011-12-14 19:07             ` john stultz
  2011-12-14 19:20               ` Andy Lutomirski
  2011-12-22 12:03               ` Richard Cochran
  0 siblings, 2 replies; 28+ messages in thread
From: john stultz @ 2011-12-14 19:07 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, 2011-12-14 at 19:30 +0100, Richard Cochran wrote:
> On Wed, Dec 14, 2011 at 08:48:30AM -0800, john stultz wrote:
> > On Wed, 2011-12-14 at 08:46 +0100, Richard Cochran wrote:
> > > On Mon, Dec 12, 2011 at 11:09:29PM -0800, Andy Lutomirski wrote:
> > > > On Mon, Dec 12, 2011 at 7:43 PM, john stultz <johnstul@us.ibm.com> wrote:
> > > > >> - New name, to distance ourselves from POSIX (clock_ns_get?)
> > > > 
> > > > I will defer to the bikeshedding consensus :)
> > > > 
> > > > >> - Family of calls, with set/get
> > > > 
> > > > Setting the time is a big can of worms.  adjtimex is rather
> > > > incomprehensible (without reading lots of source and/or the rfc) and
> > > > IMO puts a lot of NTP magic into the kernel, where it doesn't belong.
> > 
> > Honestly, I don't really see how we jumped to adjtimex from setting the
> > time, nor the complexity hinted at. First, the rational for getting
> > clock_gettime_ns is to avoid the overhead of userland translating from
> > timespec to ns.   I doubt there are similar performance needs for
> > settimeofday().  Even if it was needed, it shouldn't be more complex
> > then the unit conversion done in this abi patch. Am I missing something?
> 
> So, you agree on adding new syscalls as a performance tweek?

Well, the patch that started this off was introducing a new vdso
function (which had no syscall equivalent) that provided the same data
as clock_gettime(CLOCK_THREAD_CPUTIME,...) but in ns format, because
that was a reasonable performance win.

While I'm not eager to add duplicative interfaces, if the syscall data
structure is in the way of performance, folks will make ugly hacks to
get around it, so we might as well adapt to something better as a
standard interface. However, to avoid maintenance burden, I'd like to
keep the duplicative interfaces similar so we can share as much back-end
logic as possible.

> I am not against it, but I do think syscalls should try to satisfy a
> large number of user cases.

I don't think what is being proposed is trying to limit its use cases.
The only limitation api wise was if we should return just nanoseconds or
something with the potential for sub-ns values.

> > But again, the hard part with in-kernel TAI (possibly as the base of
> > time)is that initialization of the TAI/UTC offset needs to be able to be
> > phased in slowly, as we also have to preserve legacy interfaces and
> > behavior. 
> 
> With brand new syscall, there are no legacy uses.
> 
> > Why do we need a new interface for TAI? clock_gettime(CLOCK_TAI,...)
> > should be achievable. I do think it would be interesting, but I also
> > think its separate from the goal of this proposal.
> 
> I mean to define an interface that always returns TAI values, no matter
> what the clock device.

Maybe I'm still not understanding, but that seems more limited then what
is being proposed, at least in my mind. clock_gettime_ns() would still
take a clockid, so having a CLOCK_TAI would be a potential change in the
future.

I don't see a reason to limit clock_gettime_ns() to only CLOCK_TAI.
After all, while CLOCK_TAI doesn't have leapseconds, it still can be set
by userland if its wrong, so it doesn't provide the same functionality
as CLOCK_MONOTONIC. Additionally, the motivation for this new interface
is for a more efficient CLOCK_THREAD_CPUTIME vdso implementation, so an
exclusive CLOCK_TAI interface wouldn't serve that need.

I agree adding CLOCK_TAI is an interesting feature, but that could done
properly with the existing clock_gettime() interface. So I think the
CLOCK_TAI discussion is really disconnected from the new ABI being
proposed.

thanks
-john

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 19:07             ` john stultz
@ 2011-12-14 19:20               ` Andy Lutomirski
  2011-12-14 21:34                 ` john stultz
  2011-12-22 12:03               ` Richard Cochran
  1 sibling, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-14 19:20 UTC (permalink / raw)
  To: john stultz
  Cc: Richard Cochran, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 11:07 AM, john stultz <johnstul@us.ibm.com> wrote:
> I don't see a reason to limit clock_gettime_ns() to only CLOCK_TAI.
> After all, while CLOCK_TAI doesn't have leapseconds, it still can be set
> by userland if its wrong, so it doesn't provide the same functionality
> as CLOCK_MONOTONIC. Additionally, the motivation for this new interface
> is for a more efficient CLOCK_THREAD_CPUTIME vdso implementation, so an
> exclusive CLOCK_TAI interface wouldn't serve that need.
>
> I agree adding CLOCK_TAI is an interesting feature, but that could done
> properly with the existing clock_gettime() interface. So I think the
> CLOCK_TAI discussion is really disconnected from the new ABI being
> proposed.

Unless we want the future CLOCK_TAI to also return the UTC offset (or
vice versa), in which case we can use 32 bits of the padding field for
exactly that purpose.  (If TAI and UTC diverge by more than 2^32
seconds, we probably have other things to worry about.  That being
said, it might be useful to add a little more padding at the cost of a
bit of performance.)

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 19:20               ` Andy Lutomirski
@ 2011-12-14 21:34                 ` john stultz
  2011-12-15 11:35                   ` Richard Cochran
  0 siblings, 1 reply; 28+ messages in thread
From: john stultz @ 2011-12-14 21:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Richard Cochran, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steve Allen

On Wed, 2011-12-14 at 11:20 -0800, Andy Lutomirski wrote:
> On Wed, Dec 14, 2011 at 11:07 AM, john stultz <johnstul@us.ibm.com> wrote:
> > I don't see a reason to limit clock_gettime_ns() to only CLOCK_TAI.
> > After all, while CLOCK_TAI doesn't have leapseconds, it still can be set
> > by userland if its wrong, so it doesn't provide the same functionality
> > as CLOCK_MONOTONIC. Additionally, the motivation for this new interface
> > is for a more efficient CLOCK_THREAD_CPUTIME vdso implementation, so an
> > exclusive CLOCK_TAI interface wouldn't serve that need.
> >
> > I agree adding CLOCK_TAI is an interesting feature, but that could done
> > properly with the existing clock_gettime() interface. So I think the
> > CLOCK_TAI discussion is really disconnected from the new ABI being
> > proposed.
> 
> Unless we want the future CLOCK_TAI to also return the UTC offset (or
> vice versa), in which case we can use 32 bits of the padding field for
> exactly that purpose.  (If TAI and UTC diverge by more than 2^32
> seconds, we probably have other things to worry about.  That being
> said, it might be useful to add a little more padding at the cost of a
> bit of performance.)

FYI: Interesting additional emails from Steve Allen below (forwarded
with permission) that put some caution around CLOCK_TAI. 

Also, The timescales web page looks like a great reference, which I'll
have to bookmark.

thanks
-john


-------- Forwarded Message --------
From: Steve Allen <sla@ucolick.org>
To: John Stultz <johnstul@us.ibm.com>
Subject: TAI in linux kernel
Date: Wed, 14 Dec 2011 11:44:23 -0800

Greetings John Stultz,

I note the linux kernel discussion mentioning the use of TAI.

It may be relevant to note the position of the CCTF and BIPM on
the use of TAI as expressed in Document CCTF/09-27

http://www.bipm.org/cc/CCTF/Allowed/18/CCTF_09-27_note_on_UTC-ITU-R.pdf

Their position is that they do not want TAI used as an operational
system time, and in their last paragraph they make it plain that
they would consider suppressing TAI in order to accomplish that.

Building TAI into the linux kernel could result in the use of
an nonexistent time standard.



-------- Forwarded Message --------
From: Steve Allen <sla@ucolick.org>
To: john stultz <johnstul@us.ibm.com>
Subject: Re: TAI in linux kernel
Date: Wed, 14 Dec 2011 12:33:40 -0800

Greetings again John,

On Wed 2011-12-14T12:11:52 -0800, Steve Allen hath writ:
> Feel free.  It's a BIPM document, and somewhat confusing because BIPM
> documents from a few years earlier indicated different things, so it's
> hard to tell what to rely upon when designing operational systems.

I have been trying to track what's what for quite a while.

For a few links pointing at previous position statements about TAI see
http://www.ucolick.org/~sla/leapsecs/timescales.html#TAI
The 1999 statement is quite opposite to the 2007/2009 document.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 21:34                 ` john stultz
@ 2011-12-15 11:35                   ` Richard Cochran
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Cochran @ 2011-12-15 11:35 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Steve Allen

On Wed, Dec 14, 2011 at 01:34:50PM -0800, john stultz wrote:

> FYI: Interesting additional emails from Steve Allen below (forwarded
> with permission) that put some caution around CLOCK_TAI. 

I saw that, too, and I have a comment, below.

> -------- Forwarded Message --------
> From: Steve Allen <sla@ucolick.org>
> To: John Stultz <johnstul@us.ibm.com>
> Subject: TAI in linux kernel
> Date: Wed, 14 Dec 2011 11:44:23 -0800
> 
> Greetings John Stultz,
> 
> I note the linux kernel discussion mentioning the use of TAI.
> 
> It may be relevant to note the position of the CCTF and BIPM on
> the use of TAI as expressed in Document CCTF/09-27
> 
> http://www.bipm.org/cc/CCTF/Allowed/18/CCTF_09-27_note_on_UTC-ITU-R.pdf
> 
> Their position is that they do not want TAI used as an operational
> system time, and in their last paragraph they make it plain that
> they would consider suppressing TAI in order to accomplish that.
> 
> Building TAI into the linux kernel could result in the use of
> an nonexistent time standard.

Yes, they would do this only if UTC becomes continuous and decoupled
from the whole leap second mess. I don't think it would be a problem
for the kernel, since we could simply re-name the kernel time scale to
UTC+X, or just remove the offset and call it UTC.

This would only happen if and when everyone agrees to the UTC
redefinition. I am not holding my breath. Earlier on in the article we
read that "after serveral years of discussions and analysis of
documents no firm postion has been taken."  That article is dated
2007. Is the UTC fixup any closer now than it was back then?

Also, TAI has already been standardized as an operational time scale
by IEEE standard 1588.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-13  1:26 [RFC 0/2] ABI for clock_gettime_ns Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-12-13  3:24 ` [RFC 0/2] ABI for clock_gettime_ns Richard Cochran
@ 2011-12-21  0:50 ` Arun Sharma
  2011-12-21  1:07   ` Andy Lutomirski
  3 siblings, 1 reply; 28+ messages in thread
From: Arun Sharma @ 2011-12-21  0:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kumar Sundararajan, john stultz, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner

On 12/12/11 5:26 PM, Andy Lutomirski wrote:
> On x86-64, clock_gettime is so fast that the overhead converting to and
> from nanoseconds is non-negligible.  clock_gettime_ns is a different
> interface that is potentially faster.  If people like the ABI, I'll
> implement an optimized version.

Didn't see any major objections to this proposal.

Andy: will you be posting an optimized version?

  -Arun


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-21  0:50 ` Arun Sharma
@ 2011-12-21  1:07   ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-21  1:07 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, Kumar Sundararajan, john stultz, Peter Zijlstra,
	Ingo Molnar, Thomas Gleixner

On Tue, Dec 20, 2011 at 4:50 PM, Arun Sharma <asharma@fb.com> wrote:
> On 12/12/11 5:26 PM, Andy Lutomirski wrote:
>>
>> On x86-64, clock_gettime is so fast that the overhead converting to and
>> from nanoseconds is non-negligible.  clock_gettime_ns is a different
>> interface that is potentially faster.  If people like the ABI, I'll
>> implement an optimized version.
>
>
> Didn't see any major objections to this proposal.
>
> Andy: will you be posting an optimized version?

Yes, probably in a day or two.  It's mostly written, but I want to
generate some benchmarks to go with it.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 19:07             ` john stultz
  2011-12-14 19:20               ` Andy Lutomirski
@ 2011-12-22 12:03               ` Richard Cochran
  2011-12-24  5:59                 ` Andy Lutomirski
  1 sibling, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-22 12:03 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 11:07:46AM -0800, john stultz wrote:
> On Wed, 2011-12-14 at 19:30 +0100, Richard Cochran wrote:
> > 
> > So, you agree on adding new syscalls as a performance tweek?
> 
> Well, the patch that started this off was introducing a new vdso
> function (which had no syscall equivalent) that provided the same data
> as clock_gettime(CLOCK_THREAD_CPUTIME,...) but in ns format, because
> that was a reasonable performance win.

I don't see anything CLOCK_THREAD_CPUTIME in the patches, but I do see
CLOCK_REALTIME and CLOCK_MONOTONIC.

How is this about thread CPU time? (I assumed it was about fast time
stamps.)

> I don't think what is being proposed is trying to limit its use cases.
> The only limitation api wise was if we should return just nanoseconds or
> something with the potential for sub-ns values.

Yes, I do think new interfaces should anticipate sub-ns uses.

> > I mean to define an interface that always returns TAI values, no matter
> > what the clock device.
> 
> Maybe I'm still not understanding, but that seems more limited then what
> is being proposed, at least in my mind. clock_gettime_ns() would still
> take a clockid, so having a CLOCK_TAI would be a potential change in the
> future.

POSIX got the clock_gettime interface wrong, because you cannot tell
the time with it. The POSIX interface will return the same time value
for two consecutive seconds, due to leap seconds.

IMHO, new interfaces should correct this mistake. So, a new interface
providing UTC should also tell the user about leap seconds.

Just my 2 cents,

Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-22 12:03               ` Richard Cochran
@ 2011-12-24  5:59                 ` Andy Lutomirski
  2011-12-24  6:50                   ` Richard Cochran
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-24  5:59 UTC (permalink / raw)
  To: Richard Cochran
  Cc: john stultz, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Thu, Dec 22, 2011 at 4:03 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Wed, Dec 14, 2011 at 11:07:46AM -0800, john stultz wrote:
>> Maybe I'm still not understanding, but that seems more limited then what
>> is being proposed, at least in my mind. clock_gettime_ns() would still
>> take a clockid, so having a CLOCK_TAI would be a potential change in the
>> future.
>
> POSIX got the clock_gettime interface wrong, because you cannot tell
> the time with it. The POSIX interface will return the same time value
> for two consecutive seconds, due to leap seconds.
>
> IMHO, new interfaces should correct this mistake. So, a new interface
> providing UTC should also tell the user about leap seconds.

I agree.  Hence the extra padding that can be used to fix this, once
the fix is available in core code.

I've written the patch and it's not as big an improvement as I hoped.
I'll play with it a bit and send it out soon.

--Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-24  5:59                 ` Andy Lutomirski
@ 2011-12-24  6:50                   ` Richard Cochran
  2011-12-25  4:06                     ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Cochran @ 2011-12-24  6:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: john stultz, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Fri, Dec 23, 2011 at 09:59:04PM -0800, Andy Lutomirski wrote:

> I've written the patch and it's not as big an improvement as I hoped.
> I'll play with it a bit and send it out soon.

Andy, can you say what the motivation or use case is?

Just curious,

Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-24  6:50                   ` Richard Cochran
@ 2011-12-25  4:06                     ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2011-12-25  4:06 UTC (permalink / raw)
  To: Richard Cochran
  Cc: john stultz, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Fri, Dec 23, 2011 at 10:50 PM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Fri, Dec 23, 2011 at 09:59:04PM -0800, Andy Lutomirski wrote:
>
>> I've written the patch and it's not as big an improvement as I hoped.
>> I'll play with it a bit and send it out soon.
>
> Andy, can you say what the motivation or use case is?

Simplicity and performance.  Many users (e.g. I) use clock_gettime to
measure elapsed time.  We just end up converting to flat nanoseconds
over and over.  This syscall avoids the conversion.

The added padding bits will allow UTC/TAI offset and/or sub-nanosecond
precision in the future if someone wants to add them.

I seem to be saving just over one nanosecond per call on my laptop,
which isn't that impressive.  The real-world improvement may be better
-- calling clock_gettime in a tight loop should give very good branch
prediction success, which will make the cost of the loops that flat ns
avoids seem a little lower than they are.

I'll send patches in the morning.

--Andy

>
> Just curious,
>
> Richard



-- 
Andy Lutomirski
AMA Capital Management, LLC
Office: (310) 553-5322
Mobile: (650) 906-0647

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/2] ABI for clock_gettime_ns
  2011-12-14 18:57           ` john stultz
@ 2012-01-07 19:51             ` Richard Cochran
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Cochran @ 2012-01-07 19:51 UTC (permalink / raw)
  To: john stultz
  Cc: Andy Lutomirski, linux-kernel, Kumar Sundararajan, Arun Sharma,
	Peter Zijlstra, Ingo Molnar, Thomas Gleixner

On Wed, Dec 14, 2011 at 10:57:03AM -0800, john stultz wrote:
> On Wed, 2011-12-14 at 19:21 +0100, Richard Cochran wrote:
> > - System clock represents TAI
> > - Table of {threshold; offset} values, read mostly, rarely updated
> > - Table has index pointing to next event
> > 
> > Get time becomes:
> > 
> > 1. read system time
> > 2. test threshold
> > 3. apply correction
> 
> Again, this seems relatively reasonable. But the difficulty in changing
> system clock to be TAI is getting the table initialized and updated on
> legacy systems that don't have the userland support added. 
> 
> I'd suggest starting with adding the threshold check and leap-second
> correction in the getnstimeofday() path, and then see how performance is
> impacted. 
> 
> That would let us improve leapsecond handling and get a sense of the
> performance impact prior to reworking the kernel internals to be TAI.

BTW, another leap second is coming this summer, and another chance to
test the kernel leap second handling.

It will also be a good day to simply avoid any time measurements, turn
off your computer, and boot up again the next day.

Richard

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2012-01-07 19:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13  1:26 [RFC 0/2] ABI for clock_gettime_ns Andy Lutomirski
2011-12-13  1:26 ` [RFC 1/2] Add clock_gettime_ns syscall Andy Lutomirski
2011-12-13  3:32   ` Richard Cochran
2011-12-13  1:26 ` [RFC 2/2] x86-64: Add __vdso_clock_gettime_ns vsyscall Andy Lutomirski
2011-12-13  3:24 ` [RFC 0/2] ABI for clock_gettime_ns Richard Cochran
2011-12-13  3:43   ` john stultz
2011-12-13  7:09     ` Andy Lutomirski
2011-12-14  7:46       ` Richard Cochran
2011-12-14 16:48         ` john stultz
2011-12-14 17:15           ` Andy Lutomirski
2011-12-14 17:31             ` john stultz
2011-12-14 18:37             ` Richard Cochran
2011-12-14 18:30           ` Richard Cochran
2011-12-14 19:07             ` john stultz
2011-12-14 19:20               ` Andy Lutomirski
2011-12-14 21:34                 ` john stultz
2011-12-15 11:35                   ` Richard Cochran
2011-12-22 12:03               ` Richard Cochran
2011-12-24  5:59                 ` Andy Lutomirski
2011-12-24  6:50                   ` Richard Cochran
2011-12-25  4:06                     ` Andy Lutomirski
2011-12-14  7:20     ` Richard Cochran
2011-12-14 16:23       ` john stultz
2011-12-14 18:21         ` Richard Cochran
2011-12-14 18:57           ` john stultz
2012-01-07 19:51             ` Richard Cochran
2011-12-21  0:50 ` Arun Sharma
2011-12-21  1:07   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).