via timer/clock problem workaround

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* via timer/clock problem workaround
@ 2002-05-23  3:18 Eric Seppanen
  2002-05-23  9:11 ` george anzinger
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Seppanen @ 2002-05-23  3:18 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3159 bytes --]

I've noticed a handful of messages in recent months regarding the problems 
with the via chipset timer.  It would appear that the timer fails every 
so often and this causes gettimeofday to start returning weird values.

This has the following symptoms that I've noticed:
- clock often jumps forward 71 minutes, then back
- screensaver kicks on unexpectedly
- video playback programs freeze or start stuttering
- PS/2 mouse flies up to upper right corner under X
... I'm sure there's more; odd timeofday values cause lots of strange 
things.

There are a few patches floating around that fix this in some cases, but 
not all.  I've looked into this further and created a patch that I think 
does a much better job, though it may not be perfect yet.

In 2.4.18, whenever the code sees the microsecond offset start to grow too 
large, it guesses that there's a timer problem and smacks the timer.  This 
seems to work, but I think the code is in the wrong place.

This workaround only happens if CONFIG_X86_TSC is not set. 
Athlon-optimized kenels seem likely to have CONFIG_X86_TSC set (the 
redhat athlon kernel does), so it seems wrong to put the workaround there.

Additionally, there's a while loop in do_gettimeofday() that will loop 
millions of times if an unreasonable offset is returned from 
do_gettimeoffset().  This can be avoided by doing division instead.

I've worked over the code a bit, and I have a new patch that moves the 
timer-smack into the part of the code that executes whether the TSC is 
being used or not.  If you don't like the amount of code I've moved 
around, fear not: most of the code shuffling is just to make the debugging 
printk print the data I want.  It should be straightforward to make a 
smaller patch that does the same thing.

In my testing (using CONFIG_X86_TSC) this improves the situation quite a 
bit: before, the timer would stay messed up and the machine would act 
crazy until the next reboot.  Now, there may be a single bad value 
returned but the system goes back to normal after that.  Maybe not 
perfect, but certainly better.

I'd appreciate it if anyone experiencing odd behavior on Via chipsets 
could give it a try.  The problem usually only occurs under heavy loads; I 
have reproduced it often by creating massive images (5000x5000 pixels) in 
The Gimp or playing MPEG files while copying huge files around.

The patch works well today, but there are still a few outstanding 
questions I have:

1. Why does this (bogus offset) happen?  Has the timer died?  Is there 
another way to prevent this from happening in the first place?
2. Is it possible to resurrect what the correct offset should be at this 
point?
3. If not, what's the best value to use as an offset here?  I'm still 
using the bogus value to calculate the timeofday returned.  Is there a 
better way?
4. What does the code (which I've named smack_timer) do?  It is correct or 
just lucky?  I kept the workaround code that was already in 2.4.18, but I 
don't understand what it's doing.

Patch attached applies against 2.4.18 and the redhat 7.3 kernels.  I'll 
keep my latest version here:
http://www.reric.net/linux/viatimer/

Eric

[-- Attachment #2: eds_timer1.patch --]
[-- Type: text/plain, Size: 2117 bytes --]

--- time.c.orig	Wed May 22 12:01:24 2002
+++ linux/arch/i386/kernel/time.c	Wed May 22 20:38:29 2002
@@ -118,6 +118,14 @@

 extern spinlock_t i8259A_lock;

+static void smack_timer(void)
+{
+	outb_p(0x34, 0x43);
+	outb_p(LATCH & 0xff, 0x40);
+	outb(LATCH >> 8, 0x40);
+}
+
+
 #ifndef CONFIG_X86_TSC

 /* This function must be called with interrupts disabled 
@@ -179,14 +187,6 @@

 	count |= inb_p(0x40) << 8;

-        /* VIA686a test code... reset the latch if count > max + 1 */
-        if (count > LATCH) {
-                outb_p(0x34, 0x43);
-                outb_p(LATCH & 0xff, 0x40);
-                outb(LATCH >> 8, 0x40);
-                count = LATCH - 1;
-        }
-	
 	spin_unlock(&i8253_lock);

 	/*
@@ -267,23 +267,49 @@
 {
 	unsigned long flags;
 	unsigned long usec, sec;
+	unsigned long usec_overflow=0;
+	unsigned long lost;

 	read_lock_irqsave(&xtime_lock, flags);
 	usec = do_gettimeoffset();
-	{
-		unsigned long lost = jiffies - wall_jiffies;
-		if (lost)
-			usec += lost * (1000000 / HZ);
-	}
 	sec = xtime.tv_sec;
 	usec += xtime.tv_usec;
 	read_unlock_irqrestore(&xtime_lock, flags);
+	lost = jiffies - wall_jiffies;

-	while (usec >= 1000000) {
-		usec -= 1000000;
-		sec++;
+	/* if usec is overflowing calculate by how much */
+	if (usec >= 1000000) {
+		usec_overflow = usec / 1000000;
 	}

+	/* xtime.tv_usec could bring us almost to 1, so if we go over 2,
+           were're overflowing by over a second. */
+	if (usec_overflow > 2) {
+#ifdef CONFIG_X86_TSC
+		printk("gettimeofday bug(TSC): offset=0x%lx, sec=%lu, lost=%lu\n", usec, sec, lost);
+#else
+		printk("gettimeofday bug: offset=0x%lx, sec=%lu, lost=%lu\n", usec, sec, lost);
+#endif
+		smack_timer();
+	}
+
+	if (usec_overflow) {
+		usec = usec % 1000000;
+	}
+
+	if (lost) {
+		usec += lost * (1000000 / HZ);
+	}
+
+	/* this is a little redundant but now includes lost jiffies,
+           which I didn't want to count in the bug test above */
+	if (usec >= 1000000) {
+		usec_overflow += usec / 1000000;
+		usec = usec % 1000000;
+	}
+
+	sec+= usec_overflow;
+
 	tv->tv_sec = sec;
 	tv->tv_usec = usec;
 }

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: via timer/clock problem workaround
  2002-05-23  3:18 via timer/clock problem workaround Eric Seppanen
@ 2002-05-23  9:11 ` george anzinger
       [not found]   ` <20020523103340.A27767@reric.net>
  0 siblings, 1 reply; 4+ messages in thread
From: george anzinger @ 2002-05-23  9:11 UTC (permalink / raw)
  To: Eric Seppanen; +Cc: linux-kernel

Eric Seppanen wrote:
> 
> I've noticed a handful of messages in recent months regarding the problems
> with the via chipset timer.  It would appear that the timer fails every
> so often and this causes gettimeofday to start returning weird values.
> 
> This has the following symptoms that I've noticed:
> - clock often jumps forward 71 minutes, then back
> - screensaver kicks on unexpectedly
> - video playback programs freeze or start stuttering
> - PS/2 mouse flies up to upper right corner under X
> ... I'm sure there's more; odd timeofday values cause lots of strange
> things.
> 
> There are a few patches floating around that fix this in some cases, but
> not all.  I've looked into this further and created a patch that I think
> does a much better job, though it may not be perfect yet.
> 
> In 2.4.18, whenever the code sees the microsecond offset start to grow too
> large, it guesses that there's a timer problem and smacks the timer.  This
> seems to work, but I think the code is in the wrong place.
> 
> This workaround only happens if CONFIG_X86_TSC is not set.
> Athlon-optimized kenels seem likely to have CONFIG_X86_TSC set (the
> redhat athlon kernel does), so it seems wrong to put the workaround there.
> 
> Additionally, there's a while loop in do_gettimeofday() that will loop
> millions of times if an unreasonable offset is returned from
> do_gettimeoffset().  This can be avoided by doing division instead.
> 
> I've worked over the code a bit, and I have a new patch that moves the
> timer-smack into the part of the code that executes whether the TSC is
> being used or not.  If you don't like the amount of code I've moved
> around, fear not: most of the code shuffling is just to make the debugging
> printk print the data I want.  It should be straightforward to make a
> smaller patch that does the same thing.
> 
> In my testing (using CONFIG_X86_TSC) this improves the situation quite a
> bit: before, the timer would stay messed up and the machine would act
> crazy until the next reboot.  Now, there may be a single bad value
> returned but the system goes back to normal after that.  Maybe not
> perfect, but certainly better.
> 
> I'd appreciate it if anyone experiencing odd behavior on Via chipsets
> could give it a try.  The problem usually only occurs under heavy loads; I
> have reproduced it often by creating massive images (5000x5000 pixels) in
> The Gimp or playing MPEG files while copying huge files around.
> 
> The patch works well today, but there are still a few outstanding
> questions I have:
> 
> 1. Why does this (bogus offset) happen?  Has the timer died?  Is there
> another way to prevent this from happening in the first place?

Good question.  The code you call smack_timer reprograms the
timer chip to generate repeating interrupts every "LATCH"
units of time (this is 1/HZ in timer chip ticks).  This is a
complete reprogram of the chips timer.  

Since you got here on a request for time of day, it is
possible that timer interrupts are just not happening.  In
fact, given that you are detecting over 1 second of lost
time (from reading the TSC) surly indicates that the timer
is no longer interrupting.  It might be instructive to read
the timer latch at this time to see what, or where it is
stuck.  Normally the LATCH count is counted down to zero by
the chip and then the LATCH value is reloaded by the chip
and the count starts over again.  The chip has other modes
that just count through zero (i.e. going from 0 to 0xffff
(it is 16 bits)).  These modes are supposed to generate only
one interrupt.  Another factoid is that the maximum time the
chip can be programmed for is on the order of 50 msec.

And, by the way, I don't think the offset is bogus.  It just
indicates that a great deal of time has passed since the
last interrupt.  Note that the TSC also rolls over (since
only the low 32-bits are used) and this will cause the
offset to be negative.  Also, the base TSC that is used is
only reset on a timer tick, so once you see the error it
will persist until the next timer interrupt (i.e. 1/Hz or 10
msec since you smack_timer).  Also, on the next timer
interrupt, the wall clock will only be advanced by 10 msec
and the 2 seconds will be lost, i.e. time will appear to
jump backward (since the TSC base will also get reset at
this time).

> 2. Is it possible to resurrect what the correct offset should be at this
> point?

I think it is correct.  The "correct" thing to do is to
incorporate it into xtime AND to get the timer restarted in
such a way that it is in tune with the actual time.  The way
to do this is to adjust jiffies by the number of even
jiffies that the offset computes out to.  Then the remainder
should be used to advance xtime.tv_usec.  The TSC base
should be reset at this time also.  This allows the next
interrupt to push the advanced jiffies through the update of
the wall clock, taking care of any ntp issues, and all the
other accounting things that need doing.  By doing the above
and smacking the timer everything should get on track on the
next tick interrupt.  Meanwhile, get time requests will find
small offsets (and a jiffies-wall_jiffies correction) and
will give correct time.  Most importantly, time will not
jump backward.

> 3. If not, what's the best value to use as an offset here?  I'm still
> using the bogus value to calculate the timeofday returned.  Is there a
> better way?

See above.

> 4. What does the code (which I've named smack_timer) do?  It is correct or
> just lucky?  I kept the workaround code that was already in 2.4.18, but I
> don't understand what it's doing.

See above.  It programs the PIT to interrupt every LATCH
timer units.  LATCH is defined to give 1/HZ ticks.
> 
> Patch attached applies against 2.4.18 and the redhat 7.3 kernels.  I'll
> keep my latest version here:
> http://www.reric.net/linux/viatimer/
> 
> Eric
> 
>   ------------------------------------------------------------
> 
>    eds_timer1.patchName: eds_timer1.patch
>                    Type: Plain Text (text/plain)

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: via timer/clock problem workaround
       [not found]       ` <20020523113943.A28069@reric.net>
@ 2002-05-23 16:38         ` george anzinger
  2002-05-23 19:15           ` Eric Seppanen
  0 siblings, 1 reply; 4+ messages in thread
From: george anzinger @ 2002-05-23 16:38 UTC (permalink / raw)
  To: Eric Seppanen, linux-kernel@vger.kernel.org

Eric Seppanen wrote:
> 
> On Thu, May 23, 2002 at 08:14:52AM -0700, george anzinger wrote:
> > It occurs to me that you could be MUCH more aggressive in
> > your fault detection.  Since the code you are detecting the
> > fault in is interruptable, (i.e. if a timer interrupts were
> > happening they surly would happen at this time) you should
> > be able to safely assume that a value greater than, oh say
> > 1.5*1/HZ IS a fault.  You could then, immediately do the
> > "fix" code.  Note that it is not enough to just restart the
> > time since it is now out of step with xtime.  The update I
> > suggested should fix things to oh say about a 10 usec or
> > less hiccup.
> 
> OK, makes sense.  My focus for now was to figure out how to detect the
> situation.  Since the offset returned is 71 minutes into the future, the
> number I chose (~1 second) wasn't real important.
> 
> > The unfortunate thing about the fix is that execution of the
> > detection code requires some one to request the time of
> > day.  This, of course, could be delayed by an arbitrary
> > time, depending on system activity.
> 
> Good point.  Looking at it from that perspective, it may be a waste of
> time to put fixes in do_gettimeofday().  A dead timer is pretty serious,
> and I can't think of a simple way to detect it.  A slower, backup timer
> would be an option.  Or maybe doing a timer-sanity-check every _x_
> interrupts.  But good luck getting that past Linus/Marcelo :)
> 
> Best thing would be to figure out what's wrong with the timer and try to
> find a way to use the timer that doesn't suffer this problem.  I don't
> have any good ideas to try here, however.

It would be good to have some input from VIA on this, but I
can not find any errata on this (or anything else for that
matter-- they must make perfect chips :).

If you have an SMP box, there are other timers (in the APIC)
that generate interrupts, but, of course, most do not have
SMP boxen.  The ONLY thing in a UP box that generates a
stream of interrupts with out being acknowledged is the
PIT.  This is why it drives the NMI stuff in SMP boxes, for
example.  There are other watchdogs, but I think they
require special hardware.

On the other hand, WHY does the box need to be busy for this
to fail?  From the timer point of view, exactly what is
busy?  I wonder if it is a voltage sag or some such.

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: via timer/clock problem workaround
  2002-05-23 16:38         ` george anzinger
@ 2002-05-23 19:15           ` Eric Seppanen
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Seppanen @ 2002-05-23 19:15 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org

Note that most of this thread occurred off-list and this is the tail end.

It looks as though the timer tick interrupt is going away.  If true, 
that's well outside the scope of what can be detected/repaired from inside 
do_gettimeofday().  So the workarounds posted on this list that I've seen 
look like the wrong approach.  That includes the fix that's in 2.4.18.

On Thu, May 23, 2002 at 09:38:12AM -0700, george anzinger wrote:
> Eric Seppanen wrote:
> > george anzinger wrote:
> > > The unfortunate thing about the fix is that execution of the
> > > detection code requires some one to request the time of
> > > day.  This, of course, could be delayed by an arbitrary
> > > time, depending on system activity.
> > 
> > Good point.  Looking at it from that perspective, it may be a waste of
> > time to put fixes in do_gettimeofday().  A dead timer is pretty serious,
> > and I can't think of a simple way to detect it.

On this note, if anybody has any hints on why the timer may be dying, how 
to debug it, or better places to detect/fix it, I'd be grateful.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-05-23 19:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23  3:18 via timer/clock problem workaround Eric Seppanen
2002-05-23  9:11 ` george anzinger
     [not found]   ` <20020523103340.A27767@reric.net>
     [not found]     ` <3CED076C.20C24B23@mvista.com>
     [not found]       ` <20020523113943.A28069@reric.net>
2002-05-23 16:38         ` george anzinger
2002-05-23 19:15           ` Eric Seppanen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox