long delays (possibly infinite) in time_interpolator_get

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* long delays (possibly infinite) in time_interpolator_get_counter
@ 2005-07-29 22:06 tony.luck
  2005-07-29 23:31 ` Christoph Lameter
  2005-07-30  0:32 ` Christoph Lameter
  0 siblings, 2 replies; 9+ messages in thread
From: tony.luck @ 2005-07-29 22:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: alex.williamson, clameter

This loop in time_interpolator_get_counter():

		do {
			lcycle = time_interpolator->last_cycle;
			now = time_interpolator_get_cycles(src);
			if (lcycle && time_after(lcycle, now))
				return lcycle;
			/* Keep track of the last timer value returned. The use of cmpxchg here
			 * will cause contention in an SMP environment.
			 */
		} while (unlikely(cmpxchg(&time_interpolator->last_cycle, lcycle, now) != lcycle));

is causing problems.  Alex has managed to get systems to hang in here when
one cpu 'X' is trying to update the time, so owns the write_seqlock on xtime_lock
and two other cpus 'Y' and 'Z' began a gettimeofday *after* X obtained that lock.
X gets stuck in the above loop since one of Y or Z manages to update the value
of time_interpolator->last_cycle every time before it gets to the cmpxchg. Y
and Z are doomed to keep looping in:

	do {
		seq = read_seqbegin(&xtime_lock);
		offset = time_interpolator_get_offset();
		sec = xtime.tv_sec;
		nsec = xtime.tv_nsec;
	} while (unlikely(read_seqretry(&xtime_lock, seq)));

because X has a write lock on xtime_lock.  All this analysis by Alex, I've
just been following along until here.

Alex seems to be a "lucky" guy here, because noone else can reproduce the
hang ... though it is easy to see that it is theoretically possible.

I did throw some instrumentation into time_interpolator_get_counter() to see
how long we spent looping ... and on a 4-way ia64 box saw times as high as
34.7ms (yes, milli-seconds).  The distribution is odd ... the huge majority
of attempts get out of the loop in 0.1 to 100usec.  There is a second, very small
peak centered around 1ms (77 calls out of 5.6 billion), and then a tiny set of
outliers (8 calls) up around 30ms.  These high outliers only show up on cpu3,
max time on cpu0..cpu2 are all around 250us.

The patch below makes things less bad by not letting Y & Z do the cmpxchg if
they are going to fail the read_seqretry() test anyway.  But it is very ugly
to do this extra test on xtime_lock ... so I'm hoping that someone can come
up with something better.

Signed-off-by: Tony Luck <tony.luck@intel.com> // but I hope someone has a better fix :-)

diff --git a/include/linux/timex.h b/include/linux/timex.h
--- a/include/linux/timex.h
+++ b/include/linux/timex.h
@@ -304,7 +304,7 @@ struct time_interpolator {
 extern void register_time_interpolator(struct time_interpolator *);
 extern void unregister_time_interpolator(struct time_interpolator *);
 extern void time_interpolator_reset(void);
-extern unsigned long time_interpolator_get_offset(void);
+extern unsigned long time_interpolator_get_offset(long);
 
 #else /* !CONFIG_TIME_INTERPOLATION */
 
diff --git a/kernel/time.c b/kernel/time.c
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -494,7 +494,7 @@ void getnstimeofday (struct timespec *tv
 	do {
 		seq = read_seqbegin(&xtime_lock);
 		sec = xtime.tv_sec;
-		nsec = xtime.tv_nsec+time_interpolator_get_offset();
+		nsec = xtime.tv_nsec+time_interpolator_get_offset(seq);
 	} while (unlikely(read_seqretry(&xtime_lock, seq)));
 
 	while (unlikely(nsec >= NSEC_PER_SEC)) {
@@ -538,7 +538,7 @@ void do_gettimeofday (struct timeval *tv
 	unsigned long seq, nsec, usec, sec, offset;
 	do {
 		seq = read_seqbegin(&xtime_lock);
-		offset = time_interpolator_get_offset();
+		offset = time_interpolator_get_offset(seq);
 		sec = xtime.tv_sec;
 		nsec = xtime.tv_nsec;
 	} while (unlikely(read_seqretry(&xtime_lock, seq)));
diff --git a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1428,7 +1428,7 @@ static inline u64 time_interpolator_get_
 	}
 }
 
-static inline u64 time_interpolator_get_counter(void)
+static inline u64 time_interpolator_get_counter(long seq)
 {
 	unsigned int src = time_interpolator->source;
 
@@ -1442,6 +1442,14 @@ static inline u64 time_interpolator_get_
 			now = time_interpolator_get_cycles(src);
 			if (lcycle && time_after(lcycle, now))
 				return lcycle;
+
+			/*
+			 * If our caller is going to ignore us and retry, then
+			 * don't burn up the bus with the cmpxchg
+			 */
+			if (seq && unlikely(read_seqretry(&xtime_lock, seq)))
+				return now;
+
 			/* Keep track of the last timer value returned. The use of cmpxchg here
 			 * will cause contention in an SMP environment.
 			 */
@@ -1455,19 +1463,19 @@ static inline u64 time_interpolator_get_
 void time_interpolator_reset(void)
 {
 	time_interpolator->offset = 0;
-	time_interpolator->last_counter = time_interpolator_get_counter();
+	time_interpolator->last_counter = time_interpolator_get_counter(0);
 }
 
 #define GET_TI_NSECS(count,i) (((((count) - i->last_counter) & (i)->mask) * (i)->nsec_per_cyc) >> (i)->shift)
 
-unsigned long time_interpolator_get_offset(void)
+unsigned long time_interpolator_get_offset(long seq)
 {
 	/* If we do not have a time interpolator set up then just return zero */
 	if (!time_interpolator)
 		return 0;
 
 	return time_interpolator->offset +
-		GET_TI_NSECS(time_interpolator_get_counter(), time_interpolator);
+		GET_TI_NSECS(time_interpolator_get_counter(seq), time_interpolator);
 }
 
 #define INTERPOLATOR_ADJUST 65536
@@ -1490,7 +1498,7 @@ static void time_interpolator_update(lon
 	 * and the tuning logic insures that.
          */
 
-	counter = time_interpolator_get_counter();
+	counter = time_interpolator_get_counter(0);
 	offset = time_interpolator->offset + GET_TI_NSECS(counter, time_interpolator);
 
 	if (delta_nsec < 0 || (unsigned long) delta_nsec < offset)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-29 22:06 long delays (possibly infinite) in time_interpolator_get_counter tony.luck
@ 2005-07-29 23:31 ` Christoph Lameter
  2005-07-30 16:47   ` Alex Williamson
  2005-07-30  0:32 ` Christoph Lameter
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2005-07-29 23:31 UTC (permalink / raw)
  To: tony.luck; +Cc: linux-kernel, alex.williamson

What you are dealing with is a machine that is using ITC as a time bases. 
That is a special case. The fix should not affect machines that have a 
proper time source. More below. You can circumvent the compensation for 
ITC inaccuracies by specifying "nojitter" on the kernel command if you are 
willing to take the risk of slightly inaccurate time.

On Fri, 29 Jul 2005 tony.luck@intel.com wrote:
> The patch below makes things less bad by not letting Y & Z do the cmpxchg if
> they are going to fail the read_seqretry() test anyway.  But it is very ugly
> to do this extra test on xtime_lock ... so I'm hoping that someone can come
> up with something better.

Well get a proper time source and do not use ITC for a time source in an 
SMP system. Got HPET hardware?

> -extern unsigned long time_interpolator_get_offset(void);
> +extern unsigned long time_interpolator_get_offset(long);

This is going to cause unnecessary overhead for non ITC users.

> -		nsec = xtime.tv_nsec+time_interpolator_get_offset();
> +		nsec = xtime.tv_nsec+time_interpolator_get_offset(seq);

Here it is.

> -		offset = time_interpolator_get_offset();
> +		offset = time_interpolator_get_offset(seq);

and here again.

> +			/*
> +			 * If our caller is going to ignore us and retry, then
> +			 * don't burn up the bus with the cmpxchg
> +			 */
> +			if (seq && unlikely(read_seqretry(&xtime_lock, seq)))
> +				return now;

Two read_seqretries() for one  read_seqbegin()? That its unusual to say 
the least.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-29 22:06 long delays (possibly infinite) in time_interpolator_get_counter tony.luck
  2005-07-29 23:31 ` Christoph Lameter
@ 2005-07-30  0:32 ` Christoph Lameter
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2005-07-30  0:32 UTC (permalink / raw)
  To: tony.luck; +Cc: linux-kernel, alex.williamson

> diff --git a/include/linux/timex.h b/include/linux/timex.h

Oh. Before I forget: You need to make the same changes to the asm code in 
arch/ia64/kernel/fsys.S in order for this to work properly. The asm code 
has been optimized to the hilt to save every cycle possible. Please dont 
add any. The C code is typically bypassed for all user space gettimeofday 
/ clock_gettime calls.

Hmm.. However, if you did not see the problem in the asm code (which does 
not have the nesting issue of C and wastes some time doing other things) 
then we may solve the issue by either also calling asm from kernel space 
or making sure that some time is wasted on something else then the 
cmpxchg in the inner loop.

Or we can make "nojitter" the default? Then do a

if (nojitter)
	printk(KERN_ERR "Beware: SMP system using ITC as a time source!"
		"Time may fluctuate.\n");

at bootup and hope for the best?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-29 23:31 ` Christoph Lameter
@ 2005-07-30 16:47   ` Alex Williamson
  2005-07-30 18:10     ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2005-07-30 16:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: tony.luck, linux-kernel

On Fri, 2005-07-29 at 16:31 -0700, Christoph Lameter wrote:
> What you are dealing with is a machine that is using ITC as a time bases. 
> That is a special case.

   The default time source for ia64 systems is a "special case"?  4
socket and smaller boxes typically do not have any other time source.

> The fix should not affect machines that have a 
> proper time source. More below. You can circumvent the compensation for 
> ITC inaccuracies by specifying "nojitter" on the kernel command if you are 
> willing to take the risk of slightly inaccurate time.

   And what if you don't have any HPET and aren't willing to take that
risk?  We need a solution that works with all time sources.  A system
with the default time source should not hang or have unreasonable delays
with the standard setup.  Why is a synchronized ITC driven from a common
clock such a terrible time source for small systems?

> Well get a proper time source and do not use ITC for a time source in an 
> SMP system. Got HPET hardware?

   No, HPET on small boxes is unnecessary, we should be able to come up
with something that can effectively use the ITC.  Does a seqlock really
make sense for the do_gettimeofday() path?  This problem arises because
a reader of time is actually updating and writing a part of time.  It
would certainly be too much overhead to obtain a write lock for every
gettimeofday(), but to avoid that we have to put some kind of contention
avoidance in the path.  I don't know whether that should be some kind of
back-off algorithm at the point of contention w/ the cmpxchg or up
higher with a new seqlock read entry point that blocks when a write is
in progress.  In any case, I think we need to focus on a solution that
works well on all systems, not just those with extra timer hardware.
Thanks,

	Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-30 16:47   ` Alex Williamson
@ 2005-07-30 18:10     ` Christoph Lameter
  2005-07-31 17:02       ` Alex Williamson
  2005-08-01 16:57       ` Bjorn Helgaas
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Lameter @ 2005-07-30 18:10 UTC (permalink / raw)
  To: Alex Williamson; +Cc: tony.luck, linux-kernel

On Sat, 30 Jul 2005, Alex Williamson wrote:

> On Fri, 2005-07-29 at 16:31 -0700, Christoph Lameter wrote:
> > What you are dealing with is a machine that is using ITC as a time bases. 
> > That is a special case.
> 
>    The default time source for ia64 systems is a "special case"?  4
> socket and smaller boxes typically do not have any other time source.

It is a special case because we typically use other time sources. The 
limitations of cycle counters in SMP environments are well known and most 
hardware manufacturers have dealt with this issue by using another clock 
source.

>    And what if you don't have any HPET and aren't willing to take that
> risk?  We need a solution that works with all time sources.  A system
> with the default time source should not hang or have unreasonable delays
> with the standard setup.  Why is a synchronized ITC driven from a common
> clock such a terrible time source for small systems

If it is really synchronized then you can run with "nojitter" without any 
issue. Then you wont have to deal with the cmpxchg and everything is fine. 
But I suspect that the ITC are NOT truly synchronized (it has an 
"offset" that may be nonzero right?) otherwise you would have used nojitter.

> in progress.  In any case, I think we need to focus on a solution that
> works well on all systems, not just those with extra timer hardware.

Extra timer hardware is necessary to fix the ITC deficiency. You are at 
the source of the problem. Fix the damn hardware to provide a standardized 
synchronized clock or provide a truly synchronized ITC.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-30 18:10     ` Christoph Lameter
@ 2005-07-31 17:02       ` Alex Williamson
  2005-07-31 20:00         ` Christoph Lameter
  2005-08-01 16:57       ` Bjorn Helgaas
  1 sibling, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2005-07-31 17:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: tony.luck, linux-kernel


   Ok, here's an optimization that should help reduce contention on the
cmpxchg, has zero impact on the nojitter path, and doesn't require any
changes to fsys.  When a caller already had the xtime_lock write lock
there's no need to fight with other CPUs for the cmpxchg.  The other
"reader" CPUs will have to fetch it again since a seqlock write is in
progress.  Therefore we can simplify this path as shown below.  The
write is atomic, and we don't care if another CPU has changed last_cycle
since it can't return the value until the write lock is released.  This
has only been compile tested, but I'm interested to hear your opinion.
Thanks,

	Alex


diff -r cff8d3633e9c kernel/timer.c
--- a/kernel/timer.c	Fri Jul 29 22:01:15 2005
+++ b/kernel/timer.c	Sun Jul 31 10:25:41 2005
@@ -1452,10 +1452,34 @@
 		return time_interpolator_get_cycles(src);
 }
 
+static inline u64 time_interpolator_get_counter_locked(void)
+{
+	unsigned int src = time_interpolator->source;
+
+	if (time_interpolator->jitter)
+	{
+		u64 lcycle = time_interpolator->last_cycle;
+		u64 now = time_interpolator_get_cycles(src);
+
+		if (lcycle && time_after(lcycle, now))
+			return lcycle;
+
+		/*
+		 * This path is called when holding the xtime write lock.
+		 * This allows us to avoid the contention of the cmpxchg
+		 * in get_counter, and still ensures jitter protection.
+		 */
+		time_interpolator->last_cycle = now;
+		return now;
+	}
+	else
+		return time_interpolator_get_cycles(src);
+}
+
 void time_interpolator_reset(void)
 {
 	time_interpolator->offset = 0;
-	time_interpolator->last_counter = time_interpolator_get_counter();
+	time_interpolator->last_counter = time_interpolator_get_counter_locked();
 }
 
 #define GET_TI_NSECS(count,i) (((((count) - i->last_counter) & (i)->mask) * (i)->nsec_per_cyc) >> (i)->shift)
@@ -1490,7 +1514,7 @@
 	 * and the tuning logic insures that.
          */
 
-	counter = time_interpolator_get_counter();
+	counter = time_interpolator_get_counter_locked();
 	offset = time_interpolator->offset + GET_TI_NSECS(counter, time_interpolator);
 
 	if (delta_nsec < 0 || (unsigned long) delta_nsec < offset)



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-31 17:02       ` Alex Williamson
@ 2005-07-31 20:00         ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2005-07-31 20:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: tony.luck, linux-kernel

On Sun, 31 Jul 2005, Alex Williamson wrote:

>    Ok, here's an optimization that should help reduce contention on the
> cmpxchg, has zero impact on the nojitter path, and doesn't require any
> changes to fsys.  When a caller already had the xtime_lock write lock
> there's no need to fight with other CPUs for the cmpxchg.  The other

Yup correct.

> "reader" CPUs will have to fetch it again since a seqlock write is in
> progress.  Therefore we can simplify this path as shown below.  The
> write is atomic, and we don't care if another CPU has changed last_cycle
> since it can't return the value until the write lock is released.  This
> has only been compile tested, but I'm interested to hear your opinion.

time_interpolator_get_counter is static inline. So you may modify the
existing function and pass a constant parameter without a 
performance reduction. Two different versions will then be generated for 
time_interpolator_get_counter at compile time.

Maybe call the parameter "writelock"?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-07-30 18:10     ` Christoph Lameter
  2005-07-31 17:02       ` Alex Williamson
@ 2005-08-01 16:57       ` Bjorn Helgaas
  2005-08-01 17:04         ` Christoph Lameter
  1 sibling, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2005-08-01 16:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Alex Williamson, tony.luck, linux-kernel

On Saturday 30 July 2005 12:10 pm, Christoph Lameter wrote:
> On Sat, 30 Jul 2005, Alex Williamson wrote:
> 
> > On Fri, 2005-07-29 at 16:31 -0700, Christoph Lameter wrote:
> > > What you are dealing with is a machine that is using ITC as a time bases. 
> > > That is a special case.
> > 
> >    The default time source for ia64 systems is a "special case"?  4
> > socket and smaller boxes typically do not have any other time source.
> 
> It is a special case because we typically use other time sources.

Maybe you==SGI typically use other time sources, but most other
ia64 boxes have synchronized ITCs.  There's no reason such machines
should have to use the slower and lower precision HPET.

> If it is really synchronized then you can run with "nojitter" without any 
> issue. Then you wont have to deal with the cmpxchg and everything is fine. 
> But I suspect that the ITC are NOT truly synchronized (it has an 
> "offset" that may be nonzero right?) otherwise you would have used nojitter.

And why should everyday users have to be concerned with "nojitter"?

> Extra timer hardware is necessary to fix the ITC deficiency. You are at 
> the source of the problem. Fix the damn hardware to provide a standardized 
> synchronized clock or provide a truly synchronized ITC.

The "ITC deficiency" is a platform design issue.  Most small SMP platforms
*do* synchronize the clocks of all processors.  Obviously that's difficult
on large boxes, and then you may need extra timers in the platform.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: long delays (possibly infinite) in time_interpolator_get_counter
  2005-08-01 16:57       ` Bjorn Helgaas
@ 2005-08-01 17:04         ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2005-08-01 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Alex Williamson, tony.luck, linux-kernel

On Mon, 1 Aug 2005, Bjorn Helgaas wrote:

> > If it is really synchronized then you can run with "nojitter" without any 
> > issue. Then you wont have to deal with the cmpxchg and everything is fine. 
> > But I suspect that the ITC are NOT truly synchronized (it has an 
> > "offset" that may be nonzero right?) otherwise you would have used nojitter.
> 
> And why should everyday users have to be concerned with "nojitter"?

Because so far ITCs were not truly synchronized.
 
> > Extra timer hardware is necessary to fix the ITC deficiency. You are at 
> > the source of the problem. Fix the damn hardware to provide a standardized 
> > synchronized clock or provide a truly synchronized ITC.
> 
> The "ITC deficiency" is a platform design issue.  Most small SMP platforms
> *do* synchronize the clocks of all processors.  Obviously that's difficult
> on large boxes, and then you may need extra timers in the platform.

I have no objections to making nojitter the default. There is already some 
PAL information that provides information on ITC reliability. If you can 
insure that this is accurate (currently it indicates no drift even if 
the ITC's have an offset) then simply switch on nojitter for small 
boxes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-08-01 17:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-29 22:06 long delays (possibly infinite) in time_interpolator_get_counter tony.luck
2005-07-29 23:31 ` Christoph Lameter
2005-07-30 16:47   ` Alex Williamson
2005-07-30 18:10     ` Christoph Lameter
2005-07-31 17:02       ` Alex Williamson
2005-07-31 20:00         ` Christoph Lameter
2005-08-01 16:57       ` Bjorn Helgaas
2005-08-01 17:04         ` Christoph Lameter
2005-07-30  0:32 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox