public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
@ 2011-03-28 15:06 Andy Lutomirski
  2011-03-28 15:06 ` [PATCH 1/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
(tested on Sandy Bridge).  They're ordered in roughly decreasing order
of improvement.

These are meant for 2.6.40, but if anyone wants to take some of them
for 2.6.39 I won't object.

The changes and timings (fastest of 20 trials of 100M iters on Sandy
Bridge) are:

Unpatched:

CLOCK_MONOTONIC: 22.09ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.65ns

x86-64: Optimize vread_tsc's barriers

This replaces lfence;rdtsc;lfence with a faster sequence with similar
ordering guarantees.

CLOCK_MONOTONIC: 18.28ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.98ns

x86-64: Don't generate cmov in vread_tsc

GCC likes to generate a cmov on a branch that's almost completely
predictable.  Force it to generate a real branch instead.

CLOCK_MONOTONIC: 16.30ns
CLOCK_REALTIME_COARSE: 4.23ns
CLOCK_MONOTONIC_COARSE: 5.95ns

x86-64: Put vsyscall_gtod_data at a fixed virtual address

Because vsyscall_gtod_data's address isn't known until load time, the
code contains unnecessary address calculations.  Hardcode it.  This is
a nice speedup for the _COARSE variants as well.

CLOCK_MONOTONIC: 16.12ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 5.31ns

x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0

vset_normalize_timespec was more general than necessary.  Open-code
the appropriate normalization loops.  This is a big win for
CLOCK_MONOTONIC_COARSE

CLOCK_MONOTONIC: 16.09ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.49ns

x86-64: Omit frame pointers on vread_tsc

This is a bit silly and needs work for gcc < 4.4 (if we even care),
but, rather surprisingly, it's 0.3ns faster.  I guess that the CPU's
stack frame optimizations aren't quite as good as I thought.

CLOCK_MONOTONIC: 15.79ns
CLOCK_REALTIME_COARSE: 3.70ns
CLOCK_MONOTONIC_COARSE: 4.50ns

x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

We're building the vDSO with optimizations disabled that were meant
for kernel code.  Override that, except for -fno-omit-frame-pointers,
which might make userspace debugging harder.

CLOCK_MONOTONIC: 15.66ns
CLOCK_REALTIME_COARSE: 3.44ns
CLOCK_MONOTONIC_COARSE: 4.23ns


Andy Lutomirski (6):
  x86-64: Optimize vread_tsc's barriers
  x86-64: Don't generate cmov in vread_tsc
  x86-64: Put vsyscall_gtod_data at a fixed virtual address
  x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  x86-64: Omit frame pointers on vread_tsc
  x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO

 arch/x86/kernel/tsc.c          |   48 ++++++++++++++++++++++++++++++++-------
 arch/x86/kernel/vmlinux.lds.S  |   13 +++++-----
 arch/x86/vdso/Makefile         |   15 +++++++++++-
 arch/x86/vdso/vclock_gettime.c |   40 ++++++++++++++++++---------------
 arch/x86/vdso/vextern.h        |    9 ++++++-
 5 files changed, 90 insertions(+), 35 deletions(-)

-- 
1.7.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] x86-64: Optimize vread_tsc's barriers
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-29  6:18   ` Ingo Molnar
  2011-03-28 15:06 ` [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc Andy Lutomirski
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

RDTSC is completely unordered on modern Intel and AMD CPUs.  The
Intel manual says that lfence;rdtsc causes all previous instructions
to complete before the tsc is read, and the AMD manual says to use
mfence;rdtsc to do the same thing.

We want a stronger guarantee, though: we want the tsc to be read
before any memory access that occurs after the call to
vclock_gettime (or vgettimeofday).  We currently guarantee that with
a second lfence or mfence.  This sequence is not really supported by
the manual (AFAICT) and it's also slow.

This patch changes the rdtsc to use implicit memory ordering instead
of the second fence.  The sequence looks like this:

{l,m}fence
rdtsc
mov [something dependent on edx],[tmp]
return [some function of tmp]

This means that the time stamp has to be read before the load, and
the return value depends on tmp.  All x86-64 chips guarantee that no
memory access after a load moves before that load.  This means that
all memory access after vread_tsc occurs after the time stamp is
read.

The trick is that the answer should not actually change as a result
of the sneaky memory access.  I accomplish this by shifting rdx left
by 32 bits, twice, to generate the number zero.  (I can't imagine
that any CPU can break that dependency.)  Then I use "zero" as an
offset to a memory access that we had to do anyway.

On Sandy Bridge (i7-2600), this improves a loop of
clock_gettime(CLOCK_MONOTONIC) by 5 ns/iter (from ~22.7 to ~17.7).
time-warp-test still passes.

I suspect that it's sufficient to just load something dependent on
edx without using the result, but I don't see any solid evidence in
the manual that CPUs won't eliminate useless loads.  I leave scary
stuff like that to the real experts.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/tsc.c |   38 +++++++++++++++++++++++++++++---------
 1 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ffe5755..80e6017 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -767,18 +767,38 @@ static cycle_t read_tsc(struct clocksource *cs)
 static cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret;
-
-	/*
-	 * Surround the RDTSC by barriers, to make sure it's not
-	 * speculated to outside the seqlock critical section and
-	 * does not cause time warps:
+	u64 zero, last;
+
+	/* rdtsc is unordered, and we want it to be ordered like
+	 * a load with respect to other CPUs (and we don't want
+	 * it to execute absurdly early wrt code on this CPU.)
+	 * rdtsc_barrier() is a barrier that provides this ordering
+	 * with respect to *earlier* loads.  (Which barrier to use
+	 * depends on the CPU.)
 	 */
 	rdtsc_barrier();
-	ret = (cycle_t)vget_cycles();
-	rdtsc_barrier();
 
-	return ret >= __vsyscall_gtod_data.clock.cycle_last ?
-		ret : __vsyscall_gtod_data.clock.cycle_last;
+	asm volatile ("rdtsc\n\t"
+		      "shl $0x20,%%rdx\n\t"
+		      "or %%rdx,%%rax\n\t"
+		      "shl $0x20,%%rdx"
+		      : "=a" (ret), "=d" (zero) : : "cc");
+
+	/* zero == 0, but as far as the processor is concerned, zero
+	 * depends on the output of rdtsc.  So we can use it as a
+	 * load barrier by loading something that depends on it.
+	 * x86-64 keeps all loads in order wrt each other, so this
+	 * ensures that rdtsc is ordered wrt all later loads.
+	 */
+
+	/* This doesn't multiply 'zero' by anything, which *should*
+	 * generate nicer code, except that gcc cleverly embeds the
+	 * dereference into the cmp and the cmovae.  Oh, well.
+	 */
+	last = *( (cycle_t *)
+		  ((char *)&__vsyscall_gtod_data.clock.cycle_last + zero) );
+
+	return ret >= last ? ret : last;
 }
 #endif
 
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
  2011-03-28 15:06 ` [PATCH 1/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-29  6:15   ` Ingo Molnar
  2011-03-28 15:06 ` [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address Andy Lutomirski
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

vread_tsc checks whether rdtsc returns something less than
cycle_last, which is an extremely predictable branch.  GCC likes
to generate a cmov anyway, which is several cycles slower than
a predicted branch.  This saves a couple of nanoseconds.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/tsc.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 80e6017..a159fba 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -791,14 +791,21 @@ static cycle_t __vsyscall_fn vread_tsc(void)
 	 * ensures that rdtsc is ordered wrt all later loads.
 	 */
 
-	/* This doesn't multiply 'zero' by anything, which *should*
-	 * generate nicer code, except that gcc cleverly embeds the
-	 * dereference into the cmp and the cmovae.  Oh, well.
+	/* This doesn't multiply 'zero' by anything, which generates
+	 * very slightly nicer code than multiplying it by 8.
 	 */
 	last = *( (cycle_t *)
 		  ((char *)&__vsyscall_gtod_data.clock.cycle_last + zero) );
 
-	return ret >= last ? ret : last;
+	if (likely(ret >= last))
+		return ret;
+
+	/* GCC likes to generate cmov here, but this branch is extremely
+	   predictable (it's just a funciton of time and the likely is
+	   very likely) and there's a data dependence, so force GCC
+	   to generate a branch instead. */
+	asm volatile ("");
+	return last;
 }
 #endif
 
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
  2011-03-28 15:06 ` [PATCH 1/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
  2011-03-28 15:06 ` [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-28 17:49   ` Thomas Gleixner
  2011-03-28 15:06 ` [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

This generates slightly better code because vclock_gettime doesn't
need to load the address.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/kernel/vmlinux.lds.S |   13 +++++++------
 arch/x86/vdso/vextern.h       |    9 ++++++++-
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index bf47007..1f545ff 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -174,12 +174,6 @@ SECTIONS
 		*(.vsyscall_fn)
 	}
 
-	. = ALIGN(L1_CACHE_BYTES);
-	.vsyscall_gtod_data : AT(VLOAD(.vsyscall_gtod_data)) {
-		*(.vsyscall_gtod_data)
-	}
-
-	vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data);
 	.vsyscall_clock : AT(VLOAD(.vsyscall_clock)) {
 		*(.vsyscall_clock)
 	}
@@ -208,6 +202,13 @@ SECTIONS
 		*(.vsyscall_3)
 	}
 
+	 /* This address must be kept in sync with arch/x86/vdso/vextern.h */
+	.vsyscall_gtod_data ADDR(.vsyscall_0) + 3072 + 256:
+	                    AT(VLOAD(.vsyscall_gtod_data)) {
+		*(.vsyscall_gtod_data)
+	}
+	vsyscall_gtod_data = VVIRT(.vsyscall_gtod_data);
+
 	. = __vsyscall_0 + PAGE_SIZE;
 
 #undef VSYSCALL_ADDR
diff --git a/arch/x86/vdso/vextern.h b/arch/x86/vdso/vextern.h
index 1683ba2..5bfa77c 100644
--- a/arch/x86/vdso/vextern.h
+++ b/arch/x86/vdso/vextern.h
@@ -2,6 +2,11 @@
 #include <asm/vsyscall.h>
 #define VEXTERN(x) \
 	extern typeof(x) *vdso_ ## x __attribute__((visibility("hidden")));
+#define VEXTERN_FIXED(x, offset) \
+	static typeof(x) * const vdso_ ## x = \
+		(void *)(VSYSCALL_START + (offset));
+#else
+#define VEXTERN_FIXED(x, offset)
 #endif
 
 #define VMAGIC 0xfeedbabeabcdefabUL
@@ -13,4 +18,6 @@
 
 VEXTERN(jiffies)
 VEXTERN(vgetcpu_mode)
-VEXTERN(vsyscall_gtod_data)
+VEXTERN_FIXED(vsyscall_gtod_data, 3072 + 256)  /* Hard-coded in linker script */
+
+#undef VEXTERN_FIXED
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
                   ` (2 preceding siblings ...)
  2011-03-28 15:06 ` [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-29  6:21   ` Ingo Molnar
  2011-03-28 15:06 ` [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc Andy Lutomirski
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

vclock_gettime's do_monotonic helper can't ever generate a negative
nsec value, so it doesn't need to check whether it's negative.  This
saves a single easily-predicted branch.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---
 arch/x86/vdso/vclock_gettime.c |   40 ++++++++++++++++++++++------------------
 1 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index ee55754..67d54bb 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -56,22 +56,6 @@ notrace static noinline int do_realtime(struct timespec *ts)
 	return 0;
 }
 
-/* Copy of the version in kernel/time.c which we cannot directly access */
-notrace static void
-vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
-{
-	while (nsec >= NSEC_PER_SEC) {
-		nsec -= NSEC_PER_SEC;
-		++sec;
-	}
-	while (nsec < 0) {
-		nsec += NSEC_PER_SEC;
-		--sec;
-	}
-	ts->tv_sec = sec;
-	ts->tv_nsec = nsec;
-}
-
 notrace static noinline int do_monotonic(struct timespec *ts)
 {
 	unsigned long seq, ns, secs;
@@ -82,7 +66,17 @@ notrace static noinline int do_monotonic(struct timespec *ts)
 		secs += gtod->wall_to_monotonic.tv_sec;
 		ns += gtod->wall_to_monotonic.tv_nsec;
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
-	vset_normalized_timespec(ts, secs, ns);
+
+	/* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
+	 * are all guaranteed to be nonnegative.
+	 */
+	while (ns >= NSEC_PER_SEC) {
+		ns -= NSEC_PER_SEC;
+		++secs;
+	}
+	ts->tv_sec = secs;
+	ts->tv_nsec = ns;
+
 	return 0;
 }
 
@@ -107,7 +101,17 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
 		secs += gtod->wall_to_monotonic.tv_sec;
 		ns += gtod->wall_to_monotonic.tv_nsec;
 	} while (unlikely(read_seqretry(&gtod->lock, seq)));
-	vset_normalized_timespec(ts, secs, ns);
+
+	/* wall_time_nsec and wall_to_monotonic.tv_nsec are
+	 * guaranteed to be between 0 and NSEC_PER_SEC.
+	 */
+	if (ns >= NSEC_PER_SEC) {
+		ns -= NSEC_PER_SEC;
+		++secs;
+	}
+	ts->tv_sec = secs;
+	ts->tv_nsec = ns;
+
 	return 0;
 }
 
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
                   ` (3 preceding siblings ...)
  2011-03-28 15:06 ` [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-29  6:24   ` Ingo Molnar
  2011-03-28 15:06 ` [PATCH 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

vread_tsc is short and hot, and it's userspace code so the usual
reasons to keep frame pointers around don't apply.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---

It might be nicer to move vread_tsc into a separate file and change
the options directly.

 arch/x86/kernel/tsc.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a159fba..1aea4a6 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -764,6 +764,8 @@ static cycle_t read_tsc(struct clocksource *cs)
 }
 
 #ifdef CONFIG_X86_64
+#pragma GCC push_options
+#pragma GCC optimize ("omit-frame-pointer")
 static cycle_t __vsyscall_fn vread_tsc(void)
 {
 	cycle_t ret;
@@ -807,6 +809,7 @@ static cycle_t __vsyscall_fn vread_tsc(void)
 	asm volatile ("");
 	return last;
 }
+#pragma GCC pop_options
 #endif
 
 static void resume_tsc(struct clocksource *cs)
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
                   ` (4 preceding siblings ...)
  2011-03-28 15:06 ` [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc Andy Lutomirski
@ 2011-03-28 15:06 ` Andy Lutomirski
  2011-03-29  6:27 ` [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Ingo Molnar
  2011-04-06 18:20 ` Andi Kleen
  7 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2011-03-28 15:06 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, John Stultz, Thomas Gleixner, Andy Lutomirski

The vDSO isn't part of the kernel, so profiling and kernel
backtraces don't really matter.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---

This might need a cc-option check for -foptimize-sibling-calls.

 arch/x86/vdso/Makefile |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/x86/vdso/Makefile b/arch/x86/vdso/Makefile
index b6552b1..7b6df84 100644
--- a/arch/x86/vdso/Makefile
+++ b/arch/x86/vdso/Makefile
@@ -37,11 +37,24 @@ $(obj)/%.so: OBJCOPYFLAGS := -S
 $(obj)/%.so: $(obj)/%.so.dbg FORCE
 	$(call if_changed,objcopy)
 
+#
+# Don't omit frame pointers for ease of userspace debugging, but do
+# optimize sibling calls.
+#
 CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
-       $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector)
+       $(filter -g%,$(KBUILD_CFLAGS)) $(call cc-option, -fno-stack-protector) \
+       -fno-omit-frame-pointer -foptimize-sibling-calls
 
 $(vobjs): KBUILD_CFLAGS += $(CFL)
 
+#
+# vDSO code runs in userspace and -pg doesn't help with profiling anyway.
+#
+CFLAGS_REMOVE_vdso-note.o = -pg
+CFLAGS_REMOVE_vclock_gettime.o = -pg
+CFLAGS_REMOVE_vgetcpu.o = -pg
+CFLAGS_REMOVE_vvar.o = -pg
+
 targets += vdso-syms.lds
 obj-$(VDSO64-y)			+= vdso-syms.lds
 
-- 
1.7.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address
  2011-03-28 15:06 ` [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address Andy Lutomirski
@ 2011-03-28 17:49   ` Thomas Gleixner
  2011-03-28 18:09     ` Andrew Lutomirski
  2011-03-28 21:35     ` Andrew Lutomirski
  0 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2011-03-28 17:49 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz

On Mon, 28 Mar 2011, Andy Lutomirski wrote:
>  
> +	 /* This address must be kept in sync with arch/x86/vdso/vextern.h */

This is the worst idea ever. Breakage is lurking by making something:
Must be kept in sync ...

> +	.vsyscall_gtod_data ADDR(.vsyscall_0) + 3072 + 256:

And this magic constant is what ? Something which will blow up sooner
than later.

NAK.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address
  2011-03-28 17:49   ` Thomas Gleixner
@ 2011-03-28 18:09     ` Andrew Lutomirski
  2011-03-28 21:35     ` Andrew Lutomirski
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Lutomirski @ 2011-03-28 18:09 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, John Stultz

On Mon, Mar 28, 2011 at 1:49 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 28 Mar 2011, Andy Lutomirski wrote:
>>
>> +      /* This address must be kept in sync with arch/x86/vdso/vextern.h */
>
> This is the worst idea ever. Breakage is lurking by making something:
> Must be kept in sync ...
>
>> +     .vsyscall_gtod_data ADDR(.vsyscall_0) + 3072 + 256:
>
> And this magic constant is what ? Something which will blow up sooner
> than later.

Fair enough.  The constant is pretty much arbitrary, except that it
has to be the same in both places.

Would you object less if I added a new header that just contained
offsets within the vsyscall page to clock, vgetcpu_mode, and
gtod_data?  That way I could remove the entire VMAGIC mechanism,
everything gets faster, and there's nothing to be kept in sync.

I just checked and the linker will error out of any of these data
structures get too large to fit in their assigned places.  It looks
like this:

ld: section .vsyscall_1 loaded at [00000000018ba400,00000000018ba43d]
overlaps section .vsyscall_gtod_data loaded at
[00000000018ba140,00000000018ba5d7]
ld: .tmp_vmlinux1: section .vsyscall_1 vma 0xffffffffff600400 overlaps
previous sections

(That's on an unmodified kernel, but the same thing would happen with
hardcoded offsets.)

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address
  2011-03-28 17:49   ` Thomas Gleixner
  2011-03-28 18:09     ` Andrew Lutomirski
@ 2011-03-28 21:35     ` Andrew Lutomirski
  2011-03-28 23:13       ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Lutomirski @ 2011-03-28 21:35 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: x86, linux-kernel, John Stultz

On Mon, Mar 28, 2011 at 1:49 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> NAK.

Consider this one withdrawn for now.  I'll redo this patch later on
and resubmit it on its own.

Any comments on the rest of the series?  They're all independent
changes, other than touching the same code.

--Andy

>
> Thanks,
>
>        tglx
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address
  2011-03-28 21:35     ` Andrew Lutomirski
@ 2011-03-28 23:13       ` Thomas Gleixner
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2011-03-28 23:13 UTC (permalink / raw)
  To: Andrew Lutomirski; +Cc: x86, linux-kernel, John Stultz

On Mon, 28 Mar 2011, Andrew Lutomirski wrote:

> On Mon, Mar 28, 2011 at 1:49 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > NAK.
> 
> Consider this one withdrawn for now.  I'll redo this patch later on
> and resubmit it on its own.
> 
> Any comments on the rest of the series?  They're all independent
> changes, other than touching the same code.

Not at the moment. I'm swamped, sorry. Will have a look later.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc
  2011-03-28 15:06 ` [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc Andy Lutomirski
@ 2011-03-29  6:15   ` Ingo Molnar
  2011-03-29 11:52     ` Andrew Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2011-03-29  6:15 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner


* Andy Lutomirski <luto@MIT.EDU> wrote:

> -	/* This doesn't multiply 'zero' by anything, which *should*
> -	 * generate nicer code, except that gcc cleverly embeds the
> -	 * dereference into the cmp and the cmovae.  Oh, well.
> +	/* This doesn't multiply 'zero' by anything, which generates
> +	 * very slightly nicer code than multiplying it by 8.
>  	 */
>  	last = *( (cycle_t *)
>  		  ((char *)&__vsyscall_gtod_data.clock.cycle_last + zero) );
>  
> -	return ret >= last ? ret : last;
> +	if (likely(ret >= last))
> +		return ret;
> +
> +	/* GCC likes to generate cmov here, but this branch is extremely
> +	   predictable (it's just a funciton of time and the likely is
> +	   very likely) and there's a data dependence, so force GCC
> +	   to generate a branch instead. */
> +	asm volatile ("");

barrier() would do the same, right?

Also, a nit, please use the customary (multi-line) comment style:

  /*
   * Comment .....
   * ...... goes here.
   */

specified in Documentation/CodingStyle.

Thanks,

        Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] x86-64: Optimize vread_tsc's barriers
  2011-03-28 15:06 ` [PATCH 1/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
@ 2011-03-29  6:18   ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2011-03-29  6:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner


* Andy Lutomirski <luto@MIT.EDU> wrote:

> @@ -767,18 +767,38 @@ static cycle_t read_tsc(struct clocksource *cs)
>  static cycle_t __vsyscall_fn vread_tsc(void)
>  {
>  	cycle_t ret;
> -
> -	/*
> -	 * Surround the RDTSC by barriers, to make sure it's not
> -	 * speculated to outside the seqlock critical section and
> -	 * does not cause time warps:
> +	u64 zero, last;
> +
> +	/* rdtsc is unordered, and we want it to be ordered like
> +	 * a load with respect to other CPUs (and we don't want
> +	 * it to execute absurdly early wrt code on this CPU.)
> +	 * rdtsc_barrier() is a barrier that provides this ordering
> +	 * with respect to *earlier* loads.  (Which barrier to use
> +	 * depends on the CPU.)
>  	 */
>  	rdtsc_barrier();
> -	ret = (cycle_t)vget_cycles();
> -	rdtsc_barrier();
>  
> -	return ret >= __vsyscall_gtod_data.clock.cycle_last ?
> -		ret : __vsyscall_gtod_data.clock.cycle_last;
> +	asm volatile ("rdtsc\n\t"
> +		      "shl $0x20,%%rdx\n\t"
> +		      "or %%rdx,%%rax\n\t"
> +		      "shl $0x20,%%rdx"
> +		      : "=a" (ret), "=d" (zero) : : "cc");
> +
> +	/* zero == 0, but as far as the processor is concerned, zero
> +	 * depends on the output of rdtsc.  So we can use it as a
> +	 * load barrier by loading something that depends on it.
> +	 * x86-64 keeps all loads in order wrt each other, so this
> +	 * ensures that rdtsc is ordered wrt all later loads.
> +	 */
> +
> +	/* This doesn't multiply 'zero' by anything, which *should*
> +	 * generate nicer code, except that gcc cleverly embeds the
> +	 * dereference into the cmp and the cmovae.  Oh, well.
> +	 */
> +	last = *( (cycle_t *)
> +		  ((char *)&__vsyscall_gtod_data.clock.cycle_last + zero) );
> +
> +	return ret >= last ? ret : last;

Looks like GCC hurts performance here more than it helps. Have you considered 
putting the whole function into assembly in a .S file? How maintainable does it 
look like?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  2011-03-28 15:06 ` [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
@ 2011-03-29  6:21   ` Ingo Molnar
  2011-03-29 11:54     ` Andrew Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2011-03-29  6:21 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner


* Andy Lutomirski <luto@MIT.EDU> wrote:

> vclock_gettime's do_monotonic helper can't ever generate a negative
> nsec value, so it doesn't need to check whether it's negative.  This
> saves a single easily-predicted branch.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
>  arch/x86/vdso/vclock_gettime.c |   40 ++++++++++++++++++++++------------------
>  1 files changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> index ee55754..67d54bb 100644
> --- a/arch/x86/vdso/vclock_gettime.c
> +++ b/arch/x86/vdso/vclock_gettime.c
> @@ -56,22 +56,6 @@ notrace static noinline int do_realtime(struct timespec *ts)
>  	return 0;
>  }
>  
> -/* Copy of the version in kernel/time.c which we cannot directly access */
> -notrace static void
> -vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
> -{
> -	while (nsec >= NSEC_PER_SEC) {
> -		nsec -= NSEC_PER_SEC;
> -		++sec;
> -	}
> -	while (nsec < 0) {
> -		nsec += NSEC_PER_SEC;
> -		--sec;
> -	}
> -	ts->tv_sec = sec;
> -	ts->tv_nsec = nsec;
> -}
> -
>  notrace static noinline int do_monotonic(struct timespec *ts)
>  {
>  	unsigned long seq, ns, secs;
> @@ -82,7 +66,17 @@ notrace static noinline int do_monotonic(struct timespec *ts)
>  		secs += gtod->wall_to_monotonic.tv_sec;
>  		ns += gtod->wall_to_monotonic.tv_nsec;
>  	} while (unlikely(read_seqretry(&gtod->lock, seq)));
> -	vset_normalized_timespec(ts, secs, ns);
> +
> +	/* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
> +	 * are all guaranteed to be nonnegative.
> +	 */
> +	while (ns >= NSEC_PER_SEC) {
> +		ns -= NSEC_PER_SEC;
> +		++secs;
> +	}
> +	ts->tv_sec = secs;
> +	ts->tv_nsec = ns;
> +
>  	return 0;
>  }
>  
> @@ -107,7 +101,17 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
>  		secs += gtod->wall_to_monotonic.tv_sec;
>  		ns += gtod->wall_to_monotonic.tv_nsec;
>  	} while (unlikely(read_seqretry(&gtod->lock, seq)));
> -	vset_normalized_timespec(ts, secs, ns);
> +
> +	/* wall_time_nsec and wall_to_monotonic.tv_nsec are
> +	 * guaranteed to be between 0 and NSEC_PER_SEC.
> +	 */
> +	if (ns >= NSEC_PER_SEC) {
> +		ns -= NSEC_PER_SEC;
> +		++secs;
> +	}
> +	ts->tv_sec = secs;
> +	ts->tv_nsec = ns;
> +
>  	return 0;

Beyond the change you describe in the changelog, you also uninlined the helper 
function.

You can use __always_inline instead and still keep the code maintainable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc
  2011-03-28 15:06 ` [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc Andy Lutomirski
@ 2011-03-29  6:24   ` Ingo Molnar
  0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2011-03-29  6:24 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner


* Andy Lutomirski <luto@MIT.EDU> wrote:

> vread_tsc is short and hot, and it's userspace code so the usual
> reasons to keep frame pointers around don't apply.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
> 
> It might be nicer to move vread_tsc into a separate file and change
> the options directly.
> 
>  arch/x86/kernel/tsc.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a159fba..1aea4a6 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -764,6 +764,8 @@ static cycle_t read_tsc(struct clocksource *cs)
>  }
>  
>  #ifdef CONFIG_X86_64
> +#pragma GCC push_options
> +#pragma GCC optimize ("omit-frame-pointer")
>  static cycle_t __vsyscall_fn vread_tsc(void)
>  {
>  	cycle_t ret;
> @@ -807,6 +809,7 @@ static cycle_t __vsyscall_fn vread_tsc(void)
>  	asm volatile ("");
>  	return last;
>  }
> +#pragma GCC pop_options
>  #endif

The cleaner option would be to put the __vsyscall functions into a new .c file 
and build it with framepointers off in the Makefile, which can be done via:

CFLAGS_vtsc.o := -fno-omit-frame-pointer

or so.

We could also consider putting this function into a .S too. Not sure it works 
out in a maintainable fashion though.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
                   ` (5 preceding siblings ...)
  2011-03-28 15:06 ` [PATCH 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
@ 2011-03-29  6:27 ` Ingo Molnar
  2011-04-06 18:20 ` Andi Kleen
  7 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2011-03-29  6:27 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner


* Andy Lutomirski <luto@mit.edu> wrote:

> This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
> (tested on Sandy Bridge).  They're ordered in roughly decreasing order
> of improvement.
> 
> These are meant for 2.6.40, but if anyone wants to take some of them
> for 2.6.39 I won't object.
> 
> The changes and timings (fastest of 20 trials of 100M iters on Sandy
> Bridge) are:
> 
> Unpatched:
> 
> CLOCK_MONOTONIC: 22.09ns
> CLOCK_REALTIME_COARSE: 4.23ns
> CLOCK_MONOTONIC_COARSE: 5.65ns

  [ Patched: ]

> CLOCK_MONOTONIC: 15.66ns
> CLOCK_REALTIME_COARSE: 3.44ns
> CLOCK_MONOTONIC_COARSE: 4.23ns

That looks like a nice speedup, and speedups are definitely welcome. Many of 
your speedup patches were framed in an unclean or hard to maintain fashion 
though - so those have to be improved for this series to become palatable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc
  2011-03-29  6:15   ` Ingo Molnar
@ 2011-03-29 11:52     ` Andrew Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Lutomirski @ 2011-03-29 11:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner

On Tue, Mar 29, 2011 at 2:15 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andy Lutomirski <luto@MIT.EDU> wrote:
>
>> -     /* This doesn't multiply 'zero' by anything, which *should*
>> -      * generate nicer code, except that gcc cleverly embeds the
>> -      * dereference into the cmp and the cmovae.  Oh, well.
>> +     /* This doesn't multiply 'zero' by anything, which generates
>> +      * very slightly nicer code than multiplying it by 8.
>>        */
>>       last = *( (cycle_t *)
>>                 ((char *)&__vsyscall_gtod_data.clock.cycle_last + zero) );
>>
>> -     return ret >= last ? ret : last;
>> +     if (likely(ret >= last))
>> +             return ret;
>> +
>> +     /* GCC likes to generate cmov here, but this branch is extremely
>> +        predictable (it's just a funciton of time and the likely is
>> +        very likely) and there's a data dependence, so force GCC
>> +        to generate a branch instead. */
>> +     asm volatile ("");
>
> barrier() would do the same, right?

Yes.  It's overkill (the memory clobber is unnecessary) but should be harmless.

I'll take a look at folding this and [1/6] together and sticking them
in a .S.  The down side is that rdtsc_barrier() expends to two
alternatives, and it has other callers.  I'll see what it looks like.

>
> Also, a nit, please use the customary (multi-line) comment style:
>
>  /*
>   * Comment .....
>   * ...... goes here.
>   */
>
> specified in Documentation/CodingStyle.

I was hoping checkpatch would warn me :)

--Andy

>
> Thanks,
>
>        Ingo
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
  2011-03-29  6:21   ` Ingo Molnar
@ 2011-03-29 11:54     ` Andrew Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Lutomirski @ 2011-03-29 11:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner

On Tue, Mar 29, 2011 at 2:21 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andy Lutomirski <luto@MIT.EDU> wrote:
>
>> vclock_gettime's do_monotonic helper can't ever generate a negative
>> nsec value, so it doesn't need to check whether it's negative.  This
>> saves a single easily-predicted branch.
>>
>> Signed-off-by: Andy Lutomirski <luto@mit.edu>
>> ---
>>  arch/x86/vdso/vclock_gettime.c |   40 ++++++++++++++++++++++------------------
>>  1 files changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> index ee55754..67d54bb 100644
>> --- a/arch/x86/vdso/vclock_gettime.c
>> +++ b/arch/x86/vdso/vclock_gettime.c
>> @@ -56,22 +56,6 @@ notrace static noinline int do_realtime(struct timespec *ts)
>>       return 0;
>>  }
>>
>> -/* Copy of the version in kernel/time.c which we cannot directly access */
>> -notrace static void
>> -vset_normalized_timespec(struct timespec *ts, long sec, long nsec)
>> -{
>> -     while (nsec >= NSEC_PER_SEC) {
>> -             nsec -= NSEC_PER_SEC;
>> -             ++sec;
>> -     }
>> -     while (nsec < 0) {
>> -             nsec += NSEC_PER_SEC;
>> -             --sec;
>> -     }
>> -     ts->tv_sec = sec;
>> -     ts->tv_nsec = nsec;
>> -}
>> -
>>  notrace static noinline int do_monotonic(struct timespec *ts)
>>  {
>>       unsigned long seq, ns, secs;
>> @@ -82,7 +66,17 @@ notrace static noinline int do_monotonic(struct timespec *ts)
>>               secs += gtod->wall_to_monotonic.tv_sec;
>>               ns += gtod->wall_to_monotonic.tv_nsec;
>>       } while (unlikely(read_seqretry(&gtod->lock, seq)));
>> -     vset_normalized_timespec(ts, secs, ns);
>> +
>> +     /* wall_time_nsec, vgetns(), and wall_to_monotonic.tv_nsec
>> +      * are all guaranteed to be nonnegative.
>> +      */
>> +     while (ns >= NSEC_PER_SEC) {
>> +             ns -= NSEC_PER_SEC;
>> +             ++secs;
>> +     }
>> +     ts->tv_sec = secs;
>> +     ts->tv_nsec = ns;
>> +
>>       return 0;
>>  }
>>
>> @@ -107,7 +101,17 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
>>               secs += gtod->wall_to_monotonic.tv_sec;
>>               ns += gtod->wall_to_monotonic.tv_nsec;
>>       } while (unlikely(read_seqretry(&gtod->lock, seq)));
>> -     vset_normalized_timespec(ts, secs, ns);
>> +
>> +     /* wall_time_nsec and wall_to_monotonic.tv_nsec are
>> +      * guaranteed to be between 0 and NSEC_PER_SEC.
>> +      */
>> +     if (ns >= NSEC_PER_SEC) {
>> +             ns -= NSEC_PER_SEC;
>> +             ++secs;
>> +     }
>> +     ts->tv_sec = secs;
>> +     ts->tv_nsec = ns;
>> +
>>       return 0;
>
> Beyond the change you describe in the changelog, you also uninlined the helper
> function.
>
> You can use __always_inline instead and still keep the code maintainable.
>

I think I should fix the changelog instead.  The two copies are
different (one uses while because nsec > 2e9 is possible and one uses
if because nsec < 2e9 always).

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
  2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
                   ` (6 preceding siblings ...)
  2011-03-29  6:27 ` [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Ingo Molnar
@ 2011-04-06 18:20 ` Andi Kleen
  2011-04-06 20:10   ` Andrew Lutomirski
  7 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2011-04-06 18:20 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner

Andy Lutomirski <luto@MIT.EDU> writes:

> This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
> (tested on Sandy Bridge).  They're ordered in roughly decreasing order
> of improvement.
>
> These are meant for 2.6.40, but if anyone wants to take some of them
> for 2.6.39 I won't object.

I read all the patchkit and it looks good to me.  I felt a bit uneasy
about the barrier changes though, it may be worth running of the
paranoid "check monotonicity on lots of cpus" test cases to double check
on different CPUs.  The interesting cases are: P4-Prescott, Merom
(C2Duo), AMD K8.

Thanks for doing these optimizations again. Before generic clock source
these functions used to be somewhat faster, but they regressed
significantly back then. It may be worth comparing the current
asm code against these old code and see if there's still something
obvious missing.

Possible more optimizations if you're still motivated:
 
- Move all the timer state/seqlock into one cache line and start 
with a prefetch. 
I did a similar attempt recently for the in kernel timers.
You won't see any difference in a micro benchmark loop, but you may
in a workload that dirties lots of cache between timer calls.

- Replace the indirect call in vread() with a if ( timer == TSC)
inline() else indirect_call
(manual devirtualization essentially)

- Replace the sysctl checks with code patching use the new
static branch frameworks

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
  2011-04-06 18:20 ` Andi Kleen
@ 2011-04-06 20:10   ` Andrew Lutomirski
  2011-04-06 20:14     ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Lutomirski @ 2011-04-06 20:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner

On Wed, Apr 6, 2011 at 2:20 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Andy Lutomirski <luto@MIT.EDU> writes:
>
>> This series speeds up vclock_gettime(CLOCK_MONOTONIC) on by almost 30%
>> (tested on Sandy Bridge).  They're ordered in roughly decreasing order
>> of improvement.
>>
>> These are meant for 2.6.40, but if anyone wants to take some of them
>> for 2.6.39 I won't object.
>
> I read all the patchkit and it looks good to me.  I felt a bit uneasy
> about the barrier changes though, it may be worth running of the
> paranoid "check monotonicity on lots of cpus" test cases to double check
> on different CPUs.  The interesting cases are: P4-Prescott, Merom
> (C2Duo), AMD K8.

I ran Ingo's time-warp-test w/ 6, 7, and 8 threads on Sandy Bridge and
on a Xeon 5600 series chip.  My C2D laptop thinks that its TSC halts
in idle and my only AMD system has unsynchronized TSCs.

I couldn't even make it fail without the barrier trick after the rdtsc
at all, probably because after the lfence the rdtsc runs pretty much
immediately in practice.

>
> Thanks for doing these optimizations again. Before generic clock source
> these functions used to be somewhat faster, but they regressed
> significantly back then. It may be worth comparing the current
> asm code against these old code and see if there's still something
> obvious missing.
>
> Possible more optimizations if you're still motivated:
>
> - Move all the timer state/seqlock into one cache line and start
> with a prefetch.
> I did a similar attempt recently for the in kernel timers.
> You won't see any difference in a micro benchmark loop, but you may
> in a workload that dirties lots of cache between timer calls.

For CLOCK_REALTIME they're already in one cache line.  I tried the
prefetch and couldn't measure a speedup even after playing with
clflush for a bit.

For CLOCK_MONOTONIC, there's an obvious speedup available, though:
pre-add the offset.

>
> - Replace the indirect call in vread() with a if ( timer == TSC)
> inline() else indirect_call
> (manual devirtualization essentially)
>
> - Replace the sysctl checks with code patching use the new
> static branch frameworks

Agreed.  In fact, I could do both in one fell swoop: have a flag for
the mode and have one option be "just issue the syscall."  Static
branch stuff scares me because this stuff runs in userspace and, in
theory, userspace might have COWed the page with this code in it.

v2 of the patchkit coming in a few days.  It'll be a lot cleaner, I
think, although it'll generate the same code.  I'll play with further
optimizations after this stuff gets merged somewhere if I'm motivated
enough.

The other thing to do down the line is to fix the in-kernel implementation.

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
  2011-04-06 20:10   ` Andrew Lutomirski
@ 2011-04-06 20:14     ` Andi Kleen
  2011-04-06 20:49       ` Andrew Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2011-04-06 20:14 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, John Stultz, Thomas Gleixner

On Wed, Apr 06, 2011 at 04:10:22PM -0400, Andrew Lutomirski wrote:
> I ran Ingo's time-warp-test w/ 6, 7, and 8 threads on Sandy Bridge and
> on a Xeon 5600 series chip.  My C2D laptop thinks that its TSC halts
> in idle and my only AMD system has unsynchronized TSCs.

I think you should have coverage on more systems. The original
problems that motivated the barriers were on older K8 AMD systems.

You can ask people on l-k to run such tests for you if you don't
have the hardware.

> > I did a similar attempt recently for the in kernel timers.
> > You won't see any difference in a micro benchmark loop, but you may
> > in a workload that dirties lots of cache between timer calls.
> 
> For CLOCK_REALTIME they're already in one cache line.  I tried the
> prefetch and couldn't measure a speedup even after playing with

Did you run a cache pig between the calls? With a tight loop it's obviously
useless.

> Agreed.  In fact, I could do both in one fell swoop: have a flag for
> the mode and have one option be "just issue the syscall."  Static
> branch stuff scares me because this stuff runs in userspace and, in
> theory, userspace might have COWed the page with this code in it.

The vdso is never cowed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/6] x86-64: Micro-optimize vclock_gettime
  2011-04-06 20:14     ` Andi Kleen
@ 2011-04-06 20:49       ` Andrew Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Lutomirski @ 2011-04-06 20:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, John Stultz, Thomas Gleixner

On Wed, Apr 6, 2011 at 4:14 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Wed, Apr 06, 2011 at 04:10:22PM -0400, Andrew Lutomirski wrote:
>> I ran Ingo's time-warp-test w/ 6, 7, and 8 threads on Sandy Bridge and
>> on a Xeon 5600 series chip.  My C2D laptop thinks that its TSC halts
>> in idle and my only AMD system has unsynchronized TSCs.
>
> I think you should have coverage on more systems. The original
> problems that motivated the barriers were on older K8 AMD systems.
>
> You can ask people on l-k to run such tests for you if you don't
> have the hardware.

Will do, once v2 is ready.

>
>> > I did a similar attempt recently for the in kernel timers.
>> > You won't see any difference in a micro benchmark loop, but you may
>> > in a workload that dirties lots of cache between timer calls.
>>
>> For CLOCK_REALTIME they're already in one cache line.  I tried the
>> prefetch and couldn't measure a speedup even after playing with
>
> Did you run a cache pig between the calls? With a tight loop it's obviously
> useless.

No, but I clflushed the cache line with vsyscall_gtod_data in it.  I
think the result was just too noisy.

>
>> Agreed.  In fact, I could do both in one fell swoop: have a flag for
>> the mode and have one option be "just issue the syscall."  Static
>> branch stuff scares me because this stuff runs in userspace and, in
>> theory, userspace might have COWed the page with this code in it.
>
> The vdso is never cowed.

This program successfully gets SIGTRAP.  I assume COW is involved.

#include <dlfcn.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>

int main()
{
  volatile char *vclock_gettime;
  struct timespec t;

  void *vdso = dlopen("linux-vdso.so.1", RTLD_NOW | RTLD_NOLOAD);
  if (!vdso) {
    fprintf(stderr, "dlopen: %s\n", dlerror());
    return 1;
  }
  vclock_gettime = dlsym(vdso, "clock_gettime");
  if (!vclock_gettime) {
    fprintf(stderr, "dlsym: %s\n", dlerror());
    return 1;
  }

  if (mprotect((void*)((unsigned long)vclock_gettime & ~0xFFFUL),
	       4096, PROT_READ | PROT_WRITE | PROT_EXEC) != 0) {
    perror("mprotect");
    return 1;
  }

  *vclock_gettime = 0xcc;  /* breakpoint */
  clock_gettime(CLOCK_MONOTONIC, &t);

  return 0;
}

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-04-06 20:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-28 15:06 [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Andy Lutomirski
2011-03-28 15:06 ` [PATCH 1/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
2011-03-29  6:18   ` Ingo Molnar
2011-03-28 15:06 ` [PATCH 2/6] x86-64: Don't generate cmov in vread_tsc Andy Lutomirski
2011-03-29  6:15   ` Ingo Molnar
2011-03-29 11:52     ` Andrew Lutomirski
2011-03-28 15:06 ` [PATCH 3/6] x86-64: Put vsyscall_gtod_data at a fixed virtual address Andy Lutomirski
2011-03-28 17:49   ` Thomas Gleixner
2011-03-28 18:09     ` Andrew Lutomirski
2011-03-28 21:35     ` Andrew Lutomirski
2011-03-28 23:13       ` Thomas Gleixner
2011-03-28 15:06 ` [PATCH 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
2011-03-29  6:21   ` Ingo Molnar
2011-03-29 11:54     ` Andrew Lutomirski
2011-03-28 15:06 ` [PATCH 5/6] x86-64: Omit frame pointers on vread_tsc Andy Lutomirski
2011-03-29  6:24   ` Ingo Molnar
2011-03-28 15:06 ` [PATCH 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
2011-03-29  6:27 ` [PATCH 0/6] x86-64: Micro-optimize vclock_gettime Ingo Molnar
2011-04-06 18:20 ` Andi Kleen
2011-04-06 20:10   ` Andrew Lutomirski
2011-04-06 20:14     ` Andi Kleen
2011-04-06 20:49       ` Andrew Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox