[PATCH] optimize writer path in time_interpolator_get

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] optimize writer path in time_interpolator_get_counter()
@ 2005-08-01 15:52 Alex Williamson
  2005-08-01 16:06 ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2005-08-01 15:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tony.luck, clameter


   When using a time interpolator that is susceptible to jitter there's
potentially contention over a cmpxchg used to prevent time from going
backwards.  This is unnecessary when the caller holds the xtime write
seqlock as all readers will be blocked from returning until the write is
complete.  We can therefore allow writers to insert a new value and exit
rather than fight with CPUs who only hold a reader lock.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>

diff -r cff8d3633e9c kernel/timer.c
--- a/kernel/timer.c	Fri Jul 29 22:01:15 2005
+++ b/kernel/timer.c	Mon Aug  1 08:39:44 2005
@@ -1428,7 +1428,7 @@
 	}
 }
 
-static inline u64 time_interpolator_get_counter(void)
+static inline u64 time_interpolator_get_counter(int writelock)
 {
 	unsigned int src = time_interpolator->source;
 
@@ -1436,6 +1436,20 @@
 	{
 		u64 lcycle;
 		u64 now;
+
+		if (writelock) {
+			lcycle = time_interpolator->last_cycle;
+			now = time_interpolator_get_cycles(src);
+			if (lcycle && time_after(lcycle, now))
+				return lcycle;
+
+			/* When holding the xtime write lock, there's no need
+			 * to add the overhead of the cmpxchg.  Readers are
+			 * force to retry until the write lock is released.
+			 */
+			time_interpolator->last_cycle = now;
+			return now;
+		}
 
 		do {
 			lcycle = time_interpolator->last_cycle;
@@ -1455,7 +1469,7 @@
 void time_interpolator_reset(void)
 {
 	time_interpolator->offset = 0;
-	time_interpolator->last_counter = time_interpolator_get_counter();
+	time_interpolator->last_counter = time_interpolator_get_counter(1);
 }
 
 #define GET_TI_NSECS(count,i) (((((count) - i->last_counter) & (i)->mask) * (i)->nsec_per_cyc) >> (i)->shift)
@@ -1467,7 +1481,7 @@
 		return 0;
 
 	return time_interpolator->offset +
-		GET_TI_NSECS(time_interpolator_get_counter(), time_interpolator);
+		GET_TI_NSECS(time_interpolator_get_counter(0), time_interpolator);
 }
 
 #define INTERPOLATOR_ADJUST 65536
@@ -1490,7 +1504,7 @@
 	 * and the tuning logic insures that.
          */
 
-	counter = time_interpolator_get_counter();
+	counter = time_interpolator_get_counter(1);
 	offset = time_interpolator->offset + GET_TI_NSECS(counter, time_interpolator);
 
 	if (delta_nsec < 0 || (unsigned long) delta_nsec < offset)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-01 15:52 [PATCH] optimize writer path in time_interpolator_get_counter() Alex Williamson
@ 2005-08-01 16:06 ` Christoph Lameter
  2005-08-01 16:10   ` Alex Williamson
  2005-08-02 18:37   ` tony.luck
  0 siblings, 2 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-01 16:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: linux-kernel, akpm, tony.luck

Could we remove some code duplication?

--

   When using a time interpolator that is susceptible to jitter there's
potentially contention over a cmpxchg used to prevent time from going
backwards.  This is unnecessary when the caller holds the xtime write
seqlock as all readers will be blocked from returning until the write is
complete.  We can therefore allow writers to insert a new value and exit
rather than fight with CPUs who only hold a reader lock.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/kernel/timer.c
===================================================================
--- linux-2.6.orig/kernel/timer.c	2005-07-27 18:29:24.000000000 -0700
+++ linux-2.6/kernel/timer.c	2005-08-01 09:04:21.000000000 -0700
@@ -1428,7 +1428,7 @@
 	}
 }
 
-static inline u64 time_interpolator_get_counter(void)
+static inline u64 time_interpolator_get_counter(int writelock)
 {
 	unsigned int src = time_interpolator->source;
 
@@ -1442,6 +1442,15 @@
 			now = time_interpolator_get_cycles(src);
 			if (lcycle && time_after(lcycle, now))
 				return lcycle;
+
+			/* When holding the xtime write lock, there's no need
+			 * to add the overhead of the cmpxchg.  Readers are
+			 * force to retry until the write lock is released.
+			 */
+			if (writelock) {
+				time_interpolator->last_cycle = now;
+				return now;
+			}
 			/* Keep track of the last timer value returned. The use of cmpxchg here
 			 * will cause contention in an SMP environment.
 			 */
@@ -1455,7 +1464,7 @@
 void time_interpolator_reset(void)
 {
 	time_interpolator->offset = 0;
-	time_interpolator->last_counter = time_interpolator_get_counter();
+	time_interpolator->last_counter = time_interpolator_get_counter(1);
 }
 
 #define GET_TI_NSECS(count,i) (((((count) - i->last_counter) & (i)->mask) * (i)->nsec_per_cyc) >> (i)->shift)
@@ -1467,7 +1476,7 @@
 		return 0;
 
 	return time_interpolator->offset +
-		GET_TI_NSECS(time_interpolator_get_counter(), time_interpolator);
+		GET_TI_NSECS(time_interpolator_get_counter(0), time_interpolator);
 }
 
 #define INTERPOLATOR_ADJUST 65536
@@ -1490,7 +1499,7 @@
 	 * and the tuning logic insures that.
          */
 
-	counter = time_interpolator_get_counter();
+	counter = time_interpolator_get_counter(1);
 	offset = time_interpolator->offset + GET_TI_NSECS(counter, time_interpolator);
 
 	if (delta_nsec < 0 || (unsigned long) delta_nsec < offset)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-01 16:06 ` Christoph Lameter
@ 2005-08-01 16:10   ` Alex Williamson
  2005-08-02 18:37   ` tony.luck
  1 sibling, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2005-08-01 16:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, akpm, tony.luck

On Mon, 2005-08-01 at 09:06 -0700, Christoph Lameter wrote:
> Could we remove some code duplication?

   Works for me.  Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-01 16:06 ` Christoph Lameter
  2005-08-01 16:10   ` Alex Williamson
@ 2005-08-02 18:37   ` tony.luck
  2005-08-02 21:19     ` Christoph Lameter
  2005-08-03 14:42     ` Alex Williamson
  1 sibling, 2 replies; 12+ messages in thread
From: tony.luck @ 2005-08-02 18:37 UTC (permalink / raw)
  To: Christoph Lameter, Alex Williamson; +Cc: linux-kernel, akpm

+			/* When holding the xtime write lock, there's no need
+			 * to add the overhead of the cmpxchg.  Readers are
+			 * force to retry until the write lock is released.
+			 */
+			if (writelock) {
+				time_interpolator->last_cycle = now;
+				return now;
+			}

This is lots prettier that my first suggestion (no surprise, even _I_ didn't like
my suggested patch :-)

In theory it looks like it should help a lot as it ought to prevent the worst case
spinning when one cpu has the xtime write lock.

Sadly, running my test case (running 1-4 tasks, each bound to a cpu, each pounding
on gettimeofday(2)) I'm still seeing significant time spent spinning in this loop.
Things are better: worst case time was down to just over 2ms from 34ms ... which
is a significant improvement ... but 2ms still sounds ugly.

I'm still seeing the asymmetric behavior where cpu3 sees the really high times,
while cpu0,1,2 are seeing peaks of 170us, which is still not pretty.

So this is a very half-hearted ACK for the patch.  It does improve things, but I
still think that code that does:

	do {
		val = xxx;

		// execute some stuff here
	} while (cmpxchg(&xxx, val, newval) != val);

is badly flawed.  If this were in some rare code path that was only likely to see
collisions when more than three planets were aligned, then it might be OK, but this
is a common path and it is easy for (malicious) users to force multiple cpus into
running this code ... making it likely that the cmpxchg will fail repeatedly.

-Tony

P.S. My ugly patch does even less to fix this problem ... times are worse than this.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-02 18:37   ` tony.luck
@ 2005-08-02 21:19     ` Christoph Lameter
  2005-08-03 14:42     ` Alex Williamson
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-02 21:19 UTC (permalink / raw)
  To: tony.luck; +Cc: Alex Williamson, linux-kernel, akpm

On Tue, 2 Aug 2005 tony.luck@intel.com wrote:

> Sadly, running my test case (running 1-4 tasks, each bound to a cpu, each pounding
> on gettimeofday(2)) I'm still seeing significant time spent spinning in this loop.
> Things are better: worst case time was down to just over 2ms from 34ms ... which
> is a significant improvement ... but 2ms still sounds ugly.

I think we should be happy about this. Sounds ok for the problematic 
hardware.

> I'm still seeing the asymmetric behavior where cpu3 sees the really high times,
> while cpu0,1,2 are seeing peaks of 170us, which is still not pretty.

Is this an SMP system? Updates are performed by cpu0 and therefore the 
cacheline is mostly exclusively owned by that processor and then later 
forced to become shared by processors 1,2,3.

> So this is a very half-hearted ACK for the patch.  It does improve things, but I
> still think that code that does:
> 
> 	do {
> 		val = xxx;
> 
> 		// execute some stuff here
> 	} while (cmpxchg(&xxx, val, newval) != val);
> 
> is badly flawed.  If this were in some rare code path that was only likely to see
> collisions when more than three planets were aligned, then it might be OK, but this
> is a common path and it is easy for (malicious) users to force multiple cpus into
> running this code ... making it likely that the cmpxchg will fail repeatedly.

cmpxchg is the best that one can do in this case. If you would take a 
spinlock then we would even need more processing.

We can still switch on the nojitter by default if the ITC's are guaranteed 
to be properly synchronized which will make all this ugliness go away. 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] optimize writer path in time_interpolator_get_counter()
@ 2005-08-02 21:50 Luck, Tony
  2005-08-02 22:09 ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2005-08-02 21:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Alex Williamson, linux-kernel, akpm

>> I'm still seeing the asymmetric behavior where cpu3 sees the really high times,
>> while cpu0,1,2 are seeing peaks of 170us, which is still not pretty.
>
>Is this an SMP system? Updates are performed by cpu0 and therefore the 
>cacheline is mostly exclusively owned by that processor and then later 
>forced to become shared by processors 1,2,3.

Yes this is an SMP system (Intel Tiger4).  Cpu0 is the boot cpu, and is
indeed the one that takes the write lock, and thus the fast-return from
the get_counter() code.  I'm just very confused as to why I only see these
10X worse outliers on cpu3.  There doesn't seem to be anything special
about it (/proc/interrupts shows similar stuff happening on each cpu).

>We can still switch on the nojitter by default if the ITC's are guaranteed 
>to be properly synchronized which will make all this ugliness go away. 

What do you mean by "properly synchronized"?  We can't get them
completely in step ... but we do know that they are very close.
Closer than the microsecond granularity that most users (gettimeofday)
are going to pass back to the caller.  But running with "nojitter" will
occasionally result in time (as seen on two different processors) sometimes
jumping back by a microsecond.  How bad is this in practice?  I can
see that a user that simply subtracts two times may sometimes see an
answer of minus one micro-second ... but does this hurt?

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-02 21:50 Luck, Tony
@ 2005-08-02 22:09 ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-02 22:09 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Alex Williamson, linux-kernel, akpm

On Tue, 2 Aug 2005, Luck, Tony wrote:

> Yes this is an SMP system (Intel Tiger4).  Cpu0 is the boot cpu, and is
> indeed the one that takes the write lock, and thus the fast-return from
> the get_counter() code.  I'm just very confused as to why I only see these
> 10X worse outliers on cpu3.  There doesn't seem to be anything special
> about it (/proc/interrupts shows similar stuff happening on each cpu).

Are you sure that this is not some hardware contingency? I.e. cachelines 
are prioritized according to cpu number or some such thing? Switch the 
timer interrupt to a different cpu and see how this affects the 
situation.

> >We can still switch on the nojitter by default if the ITC's are guaranteed 
> >to be properly synchronized which will make all this ugliness go away. 
> 
> What do you mean by "properly synchronized"?  We can't get them
> completely in step ... but we do know that they are very close.

Properly synchronized means completely in step.

> Closer than the microsecond granularity that most users (gettimeofday)
> are going to pass back to the caller.  But running with "nojitter" will
> occasionally result in time (as seen on two different processors) sometimes
> jumping back by a microsecond.  How bad is this in practice?  I can
> see that a user that simply subtracts two times may sometimes see an
> answer of minus one micro-second ... but does this hurt?

In practice we have seen the time jump forward to A.D. 2587 which then 
results in tcp retransmits periods of a few centuries. I'd strongly advise 
against nojitter unless you know that the ITCs are in step. Isnt the 
hardware capable of making them run in step? It seems that some i386 
BIOSes have accomplished just that.

Note that the time subsystem for ia64 is not based on microseconds like 
i386 but on nanoseconds. clock_gettime will return nanosecond resolution 
which WILL become negative with devastating consequence if the clock 
jumps back a microsecond. A single ia64 processor can do multiple calls 
to retrieve time from user space within a microsecond.

The jitter compensation avoids the problems arising with time for ITC.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-02 18:37   ` tony.luck
  2005-08-02 21:19     ` Christoph Lameter
@ 2005-08-03 14:42     ` Alex Williamson
  2005-08-03 16:10       ` Christoph Lameter
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2005-08-03 14:42 UTC (permalink / raw)
  To: tony.luck; +Cc: Christoph Lameter, linux-kernel, akpm

On Tue, 2005-08-02 at 11:37 -0700, tony.luck@intel.com wrote:

> Sadly, running my test case (running 1-4 tasks, each bound to a cpu, each pounding
> on gettimeofday(2)) I'm still seeing significant time spent spinning in this loop.
> Things are better: worst case time was down to just over 2ms from 34ms ... which
> is a significant improvement ... but 2ms still sounds ugly.

   I'll settle for an order of magnitude improvement for a first pass :)

> I'm still seeing the asymmetric behavior where cpu3 sees the really high times,
> while cpu0,1,2 are seeing peaks of 170us, which is still not pretty.

   How does cpu3's ITC synchronization compare to the others?  I suspect
there's some reasonable way that we could do a backoff w/i the do {}
while() loop, but I'm not sure what it is.  For a while, I was toying
around with the idea of how to convert:

	if (lcycle && time_after(lcycle, now))
		return lcycle;

into:

	if (lcycle && time_after(lcycle + min_delta, now))
		return lcycle;

But I don't know how that min_delta would be determined.  A "close
enough" factor like this would still prevent jitter, but would introduce
a minimum granularity increment of gettimeofday().  I think this might
be a reasonable performance vs absolute accuracy trade-off.  What
happens to your worst case time if you (just for a test) hard code a
min_delta of something around 20-50?  There could be some kind of
boot/run time tunable to set this min_delta if there's no good way to
calculate it.  It should be trivial to add something like this to the
fsys path as well and shouldn't disrupt the nojitter path at all.
Thanks,

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-03 14:42     ` Alex Williamson
@ 2005-08-03 16:10       ` Christoph Lameter
  2005-08-03 16:32         ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2005-08-03 16:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: tony.luck, linux-kernel, akpm

On Wed, 3 Aug 2005, Alex Williamson wrote:

> be a reasonable performance vs absolute accuracy trade-off.  What
> happens to your worst case time if you (just for a test) hard code a
> min_delta of something around 20-50?  There could be some kind of

Think about a threaded process that gets time on multiple processors 
and then compares the times. This means that the time value obtained later 
on one thread may indicate a time earlier than that obtained on another 
thread. An essential requirement for time values is that they are 
monotonically increasing. You are changing that basic nature.

> boot/run time tunable to set this min_delta if there's no good way to
> calculate it.  It should be trivial to add something like this to the
> fsys path as well and shouldn't disrupt the nojitter path at all.

I assure you it is not going to be trivial. IA64 asm instructions bundling 
and cycle optimization will have to be changed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-03 16:10       ` Christoph Lameter
@ 2005-08-03 16:32         ` Alex Williamson
  2005-08-03 20:49           ` Christoph Lameter
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2005-08-03 16:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: tony.luck, linux-kernel, akpm

On Wed, 2005-08-03 at 09:10 -0700, Christoph Lameter wrote:
> On Wed, 3 Aug 2005, Alex Williamson wrote:
> 
> > be a reasonable performance vs absolute accuracy trade-off.  What
> > happens to your worst case time if you (just for a test) hard code a
> > min_delta of something around 20-50?  There could be some kind of
> 
> Think about a threaded process that gets time on multiple processors 
> and then compares the times. This means that the time value obtained later 
> on one thread may indicate a time earlier than that obtained on another 
> thread. An essential requirement for time values is that they are 
> monotonically increasing. You are changing that basic nature.

   Ok, I can see the scenario where that could produce jitter.  However,
that implies than any exit through that path could produce jitter as it
is.  For instance:

CPU0				CPU1
read lcycle
read itc
				read lcycle
				read itc
cmpxchg
				oops, lcycle is stale

So the window already exists for this...

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] optimize writer path in time_interpolator_get_counter()
@ 2005-08-03 17:00 Luck, Tony
  0 siblings, 0 replies; 12+ messages in thread
From: Luck, Tony @ 2005-08-03 17:00 UTC (permalink / raw)
  To: Christoph Lameter, Alex Williamson; +Cc: linux-kernel, akpm

>Think about a threaded process that gets time on multiple processors 
>and then compares the times. This means that the time value obtained later 
>on one thread may indicate a time earlier than that obtained on another 
>thread. An essential requirement for time values is that they are 
>monotonically increasing. You are changing that basic nature.

But this comes down to how much time does it take for two threads
to perform some synchronization operation so that they know that one
thread made the call before the other[1]?  I think that it might be
possible to make the claim[2] that we have synchonized the ITC values
more closely than the fastest user synchronization method ... hence
a simplistic kernel implementation will be monotonic in practice.

-Tony

[1] It is pointless to expect one return value to be greater than
another unless we know that the calls were made in a particular sequence.

[2] Hands are waving so wildly here that it is a wonder that I can
even type.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] optimize writer path in time_interpolator_get_counter()
  2005-08-03 16:32         ` Alex Williamson
@ 2005-08-03 20:49           ` Christoph Lameter
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2005-08-03 20:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: tony.luck, linux-kernel, akpm

On Wed, 3 Aug 2005, Alex Williamson wrote:

>    Ok, I can see the scenario where that could produce jitter.  However,
> that implies than any exit through that path could produce jitter as it
> is.  For instance:

Well what is the difference of this approach from booting with "nojitter"? 
The ITC offsets are fixed right?

Just check that the ITC difference is less than what you want to risk and 
switch on "nojitter" during bootup if less than the limit. Same effect 
without changes to the critical timer code patchs.

The main problem occurs as far as I know when execution continues on 
another processor during time measurement. In that case you may get an 
earlier time later because the ITC of the processor you migrated to is not 
there yet.... Similar issues may occur if time information is communicated 
between threads running on different processor.

You have an awful nest of rats here so if you do this then please at 
least do a printk that warns people of what is going on.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-08-03 20:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-01 15:52 [PATCH] optimize writer path in time_interpolator_get_counter() Alex Williamson
2005-08-01 16:06 ` Christoph Lameter
2005-08-01 16:10   ` Alex Williamson
2005-08-02 18:37   ` tony.luck
2005-08-02 21:19     ` Christoph Lameter
2005-08-03 14:42     ` Alex Williamson
2005-08-03 16:10       ` Christoph Lameter
2005-08-03 16:32         ` Alex Williamson
2005-08-03 20:49           ` Christoph Lameter
  -- strict thread matches above, loose matches on Subject: below --
2005-08-02 21:50 Luck, Tony
2005-08-02 22:09 ` Christoph Lameter
2005-08-03 17:00 Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox