2.4.18 clock warps 4294 seconds

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.18 clock warps 4294 seconds
@ 2002-07-24 12:35 Per Gregers Bilse
  2002-07-24 13:37 ` Richard B. Johnson
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-24 12:35 UTC (permalink / raw)
  To: linux-kernel

[1.] System clock (as returned by eg time(2)) warps 4294 seconds
[2.] Full description of the problem/report can be found below
[3.] system clock, clock warp, Duron
[4.] Linux version 2.4.18 (root@wraith) (gcc version 2.95.3 20010315 (release)) #23 Tue Jul 23 19:22:21 BST 2002
[5.] No Oops
[6.] No small shell script or example program
[7.] Duron 800Mhz, PC Chips M807 mboard, 128M PC100 memory (all BIOS tweaks
set to very conservative values), soft raid
[7.1.] Software
Kernel modules         2.4.13
Gnu C                  2.95.3
Binutils               2.11.90.0.8
Linux C Library        2.2.4
Dynamic linker         ldd (GNU libc) 2.2.4
Procps                 2.0.7
Mount                  2.11g
Net-tools              1.60
Console-tools          0.3.3
Sh-utils               2.0.11
Modules Loaded         8139too mii loop rtc

[7.2.] Processor information (from /proc/cpuinfo):
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 6
model		: 3
model name	: AMD Duron(tm) Processor
stepping	: 1
cpu MHz		: 801.452
cache size	: 64 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips	: 1599.07

[7.3.] Module information (from /proc/modules):
8139too                13792   1
mii                     1120   0 [8139too]
loop                    8496   6 (autoclean)
rtc                     5592   0 (autoclean)

[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
4000-40ff : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
5000-500f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
6000-607f : VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
c000-cfff : PCI Bus #01
  c000-c07f : Silicon Integrated Systems [SiS] 86C326
d000-d00f : VIA Technologies, Inc. Bus Master IDE
  d000-d007 : ide0
  d008-d00f : ide1
d400-d41f : VIA Technologies, Inc. UHCI USB
dc00-dcff : VIA Technologies, Inc. AC97 Audio Controller
e000-e003 : VIA Technologies, Inc. AC97 Audio Controller
e400-e403 : VIA Technologies, Inc. AC97 Audio Controller
e800-e8ff : VIA Technologies, Inc. AC97 Modem Controller
ec00-ecff : Accton Technology Corporation SMC2-1211TX

00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000f0000-000fffff : System ROM
00100000-00efffff : System RAM
  00100000-00282ae9 : Kernel code
  00282aea-002ede97 : Kernel data
00f00000-00ffffff : reserved
01000000-07feffff : System RAM
07ff0000-07ff2fff : ACPI Non-volatile Storage
07ff3000-07ffffff : ACPI Tables
d0000000-d3ffffff : VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
d4000000-d5ffffff : PCI Bus #01
  d5000000-d500ffff : Silicon Integrated Systems [SiS] 86C326
d6000000-d67fffff : PCI Bus #01
  d6000000-d67fffff : Silicon Integrated Systems [SiS] 86C326
d6800000-d68000ff : Accton Technology Corporation SMC2-1211TX
  d6800000-d68000ff : 8139too
ffff0000-ffffffff : reserved

[7.5.] PCI information ('lspci -vvv' as root)
00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133] (rev 03)
	Subsystem: Elitegroup Computer Systems: Unknown device 0987
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
	Latency: 0
	Region 0: Memory at d0000000 (32-bit, prefetchable) [size=64M]
	Capabilities: [a0] AGP version 2.0
		Status: RQ=31 SBA+ 64bit- FW- Rate=x1,x2
		Command: RQ=0 SBA- AGP- 64bit- FW- Rate=<none>
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP] (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR-
	Latency: 0
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000c000-0000cfff
	Memory behind bridge: d4000000-d5ffffff
	Prefetchable memory behind bridge: d6000000-d67fffff
	BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-
	Capabilities: [80] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 22)
	Subsystem: VIA Technologies, Inc. VT82C686/A PCI to ISA Bridge
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0

00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 10) (prog-if 8a [Master SecP PriP])
	Subsystem: VIA Technologies, Inc. Bus Master IDE
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 32
	Region 4: I/O ports at d000 [size=16]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 10) (prog-if 00 [UHCI])
	Subsystem: Unknown device 0925:1234
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 32, cache line size 08
	Interrupt: pin D routed to IRQ 5
	Region 4: I/O ports at d400 [size=32]
	Capabilities: [80] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 30)
	Subsystem: Elitegroup Computer Systems: Unknown device 0987
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Interrupt: pin ? routed to IRQ 3
	Capabilities: [68] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.5 Multimedia audio controller: VIA Technologies, Inc. AC97 Audio Controller (rev 20)
	Subsystem: VIA Technologies, Inc. AC97 Audio Controller
	Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Interrupt: pin C routed to IRQ 10
	Region 0: I/O ports at dc00 [size=256]
	Region 1: I/O ports at e000 [size=4]
	Region 2: I/O ports at e400 [size=4]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:07.6 Communication controller: VIA Technologies, Inc. AC97 Modem Controller (rev 20)
	Subsystem: SILICON Laboratories: Unknown device 4c21
	Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Interrupt: pin C routed to IRQ 10
	Region 0: I/O ports at e800 [size=256]
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

00:0d.0 Ethernet controller: Accton Technology Corporation SMC2-1211TX (rev 10)
	Subsystem: Accton Technology Corporation EN-1207D Fast Ethernet Adapter
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 32 (8000ns min, 16000ns max)
	Interrupt: pin A routed to IRQ 11
	Region 0: I/O ports at ec00 [size=256]
	Region 1: Memory at d6800000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

01:00.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 (rev 0b) (prog-if 00 [VGA])
	Subsystem: Palit Microsystems Inc. SiS6326 GUI Accelerator
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 32 (500ns min)
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at d6000000 (32-bit, prefetchable) [size=8M]
	Region 1: Memory at d5000000 (32-bit, non-prefetchable) [size=64K]
	Region 2: I/O ports at c000 [size=128]
	Expansion ROM at <unassigned> [disabled] [size=64K]
	Capabilities: [40] Power Management version 1
		Flags: PMEClk- DSI- D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] AGP version 1.0
		Status: RQ=1 SBA- 64bit- FW- Rate=x1,x2
		Command: RQ=0 SBA- AGP- 64bit- FW- Rate=<none>

[7.6.] No SCSI information (from /proc/scsi/scsi)
[7.7.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):
[X.] Other notes, patches, fixes, workarounds:

Hi,

I have been experiencing a strange problem, whereby the system clock on
my workstation warps forward 4294 seconds for either a second or two,
or much longer.  The first symptom of the problem was the X screensaver
misbehaving.  I have previously been searching any and all mailing lists
and FAQs, with no result, until the problem occurred on another machine
(same H/W as my workstation) used as a server.  That is the machine
I have taken the information above from (essentially RedHat 7.2 with
a lot of patches, and a GCC "downgrade" to the stable/recommended
version).  I then went searching again, and stumbled upon one posting
referencing the same problem, please see:

http://gwyn.tux.org/hypermail/linux-kernel/1999week20/0728.html

(What clinched my search this time was the "4294" seconds.)

The problem in my case seems to be closely linked to moderate network
activity and the absence (yes, absence) of disk activity.

The server runs among other things a BGP4 implementation I have developed,
and while loading up several sets of full Internet routing tables,
incoming network traffic is oto 500kbps / 30-40pps for a minute or two.
(For the record, the machine is just a host, and the kernel routing tables
are not modified, everything just stays in userland.)  CPU load is only a
few percent during this, and there is very little (just background noise)
disk activity.  During this period I will experience several thousand
instances of time warp per second (ie, time(2) will repeatedly return
a warped value), while everything is fine under otherwise steady state
conditions (little network activity), as well as under any CPU and/or
disk load, including complete congestion.

The plot thickens ...

The server had been up for nearly three months when the problem suddenly
manifested itself very firmly (basically, time-stamp based soft timers
in the BGP implementation were bombing out).  I had never seen it
on the server before, but once it appeared it stayed.  What made me
recognise it was that the warp was the exact same 4294 seconds I had
previously experienced on my workstation, and I'm pretty sure that what
made it appear was that I turned off debugging in the BGP implementation.
With full debug/log output, as I had been running with for the past couple
of months, it produces several Mbytes of logfile output per second while
loading up; with debugging off, there's practically no disk activity.

I've applied the hack/fix suggested in the posting mentioned above,
and the problem seems to have gone away, but I'm wondering if there
might be something else wrong, given that I'm experiencing it on single
CPU machines.

I'm not subscribed to the kernel mailing list, please explicitly Cc
any replies; thanks.

  -- Per G. Bilse


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-24 12:35 2.4.18 clock warps 4294 seconds Per Gregers Bilse
@ 2002-07-24 13:37 ` Richard B. Johnson
  2002-07-24 17:46   ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: Richard B. Johnson @ 2002-07-24 13:37 UTC (permalink / raw)
  To: Per Gregers Bilse; +Cc: linux-kernel

On Wed, 24 Jul 2002, Per Gregers Bilse wrote:

> [1.] System clock (as returned by eg time(2)) warps 4294 seconds
[SNIPPED...]

Do you have a time-daemon running that synchronizes your machine to
some other?

Run this program as:
./xxx &>xxx.log &

...and see if your machine is really jumping.

#include <stdio.h>
#include <time.h>


int main()
{
    time_t tim;
    time_t last;
    (void)time(&last);
    (void)time(&tim);
    for(;;)
    {
        (void)time(&tim);
        if(tim != last)
        {
            if((last +1) != tim)
            {
                fprintf(stderr,"Time = %08lx\n", tim);
                fprintf(stderr,"Last = %08lx\n", last);
            }
            last = tim;
        }
    }
}

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-24 13:37 ` Richard B. Johnson
@ 2002-07-24 17:46   ` Per Gregers Bilse
  2002-07-24 18:50     ` george anzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-24 17:46 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

Hi,

On Jul 24,  9:37am, "Richard B. Johnson" <root@chaos.analogic.com> wrote:
> Do you have a time-daemon running that synchronizes your machine to
> some other?

NTP is running on both machines.

> Run this program as:
> ./xxx &>xxx.log &
> 
> ...and see if your machine is really jumping.

I have effectively the same in the code already running.  Once the
problem on the server started (yesterday), I added some sanity
(abbreviated for clarity):

    waitinterP = rt_needbgwalk() ? &zerosec : &onesec;
    ...
    sval = pselect(sockhi,&readset,&writeset,&exceptset,waitinterP,0);
    ...
#ifdef SHAKY_CLOCK
    newnow = time((time_t*)0);
    if ( newnow > now + 100 ) {  /* Some H/W bug on some mboards? */
      if ( warpcount % 1000 == 0 )
        error("timewarp fwd %d: now %d new %d",warpcount,now,newnow);
      warpcount++;
      now += sval == 0 ? waitinterP->tv_sec : 0;
    }
    else {
      ....
      now = newnow;
    }
#else
    now = time((time_t*)0);
#endif

Here's one snippet of output:

timewarp fwd 0: now 1027434974 new 1027439268
timewarp fwd 1000: now 1027434977 new 1027439271
timewarp fwd 2000: now 1027434978 new 1027439272

I know it's a tall order, but believe me, it really is happening.
I've always had it on my workstation, ever since I started using it,
first with a 2.4.7-10 kernel.  It's happened many many times that I
would slide the mouse down to the bottom of the screen and the autohiding
taskbar would appear, clock to the right, and the next second the clock
would jump forward an hour and some minutes.  I thought it was an obscure
bug in X or Gnome until it happened on the server.

As mentioned, the workaround suggested by the posting from 1999 seems
to be doing the trick (and the sanity code I added to my stuff handles
the problem by guessing), so it isn't a big issue for me, but I would
imagine other people with the same hardware are being bitten by it.

Thanks for your note.

  -- Per G. Bilse

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-24 17:46   ` Per Gregers Bilse
@ 2002-07-24 18:50     ` george anzinger
  2002-07-25 10:01       ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: george anzinger @ 2002-07-24 18:50 UTC (permalink / raw)
  To: Per Gregers Bilse; +Cc: root, linux-kernel

Per Gregers Bilse wrote:
> 
> Hi,
> 
> On Jul 24,  9:37am, "Richard B. Johnson" <root@chaos.analogic.com> wrote:
> > Do you have a time-daemon running that synchronizes your machine to
> > some other?
> 
> NTP is running on both machines.

I doubt that NTP has anything to do with this.  It looks
like the TSC difference used to fill in the sub 1/HZ time
info is overflowing.  This would give a negative difference
(well it would if not unsigned) which would work out to the
4294 sec warp.  But on a 800 MHz box this means that it has
missed 2.68 (or so) seconds worth of ticks.  At the warp
back, it would NOT make them up, however the ntp code should
notice and start a correction sequence.  To confirm this you
might look at the ntp logs, AND/OR you could modify the test
code to query the other machine and report the time
difference.  Something this big should show up, even if the
network is slow.  Let us know if you try either of these.

Interestingly if this is the case, fixing the TSC difference
code to use 64 bits is not the whole solution, as we also
would need to account for the missed ticks.

And, no, this would not show up in the file times as they
use xtime without the sub 1/HZ correction.

-g
> 
> > Run this program as:
> > ./xxx &>xxx.log &
> >
> > ...and see if your machine is really jumping.
> 
> I have effectively the same in the code already running.  Once the
> problem on the server started (yesterday), I added some sanity
> (abbreviated for clarity):
> 
>     waitinterP = rt_needbgwalk() ? &zerosec : &onesec;
>     ...
>     sval = pselect(sockhi,&readset,&writeset,&exceptset,waitinterP,0);
>     ...
> #ifdef SHAKY_CLOCK
>     newnow = time((time_t*)0);
>     if ( newnow > now + 100 ) {  /* Some H/W bug on some mboards? */
>       if ( warpcount % 1000 == 0 )
>         error("timewarp fwd %d: now %d new %d",warpcount,now,newnow);
>       warpcount++;
>       now += sval == 0 ? waitinterP->tv_sec : 0;
>     }
>     else {
>       ....
>       now = newnow;
>     }
> #else
>     now = time((time_t*)0);
> #endif
> 
> Here's one snippet of output:
> 
> timewarp fwd 0: now 1027434974 new 1027439268
> timewarp fwd 1000: now 1027434977 new 1027439271
> timewarp fwd 2000: now 1027434978 new 1027439272
> 
> I know it's a tall order, but believe me, it really is happening.
> I've always had it on my workstation, ever since I started using it,
> first with a 2.4.7-10 kernel.  It's happened many many times that I
> would slide the mouse down to the bottom of the screen and the autohiding
> taskbar would appear, clock to the right, and the next second the clock
> would jump forward an hour and some minutes.  I thought it was an obscure
> bug in X or Gnome until it happened on the server.
> 
> As mentioned, the workaround suggested by the posting from 1999 seems
> to be doing the trick (and the sanity code I added to my stuff handles
> the problem by guessing), so it isn't a big issue for me, but I would
> imagine other people with the same hardware are being bitten by it.
> 
> Thanks for your note.
> 
>   -- Per G. Bilse
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-24 18:50     ` george anzinger
@ 2002-07-25 10:01       ` Per Gregers Bilse
  2002-07-25 10:36         ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-25 10:01 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel

On Jul 24, 11:50am, george anzinger <george@mvista.com> wrote:
> missed 2.68 (or so) seconds worth of ticks.  At the warp
> back, it would NOT make them up, however the ntp code should
> notice and start a correction sequence.  To confirm this you
> might look at the ntp logs, AND/OR you could modify the test

This is indeed happening.  I've been puzzled by both machines
suffering from repeated loss of NTP synch, here's from the logs:

/var/log/messages.4:Jun 24 06:52:14 wraith ntpd[24927]: synchronisation lost
/var/log/messages.4:Jun 25 04:13:44 wraith ntpd[24927]: synchronisation lost
/var/log/messages.4:Jun 25 07:19:06 wraith ntpd[24927]: synchronisation lost
/var/log/messages.4:Jun 25 12:35:14 wraith ntpd[24927]: synchronisation lost
/var/log/messages.4:Jun 25 12:38:42 wraith ntpd[24927]: synchronisation lost
/var/log/messages.4:Jun 26 14:24:30 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jun 30 21:02:45 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  1 14:18:47 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  1 14:24:12 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  1 20:13:25 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  2 22:45:52 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  3 01:12:24 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  3 12:23:15 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  3 16:46:18 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  3 18:19:17 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  3 23:51:23 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  4 04:08:52 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  4 19:38:39 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  5 10:04:17 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  6 06:08:25 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  6 20:07:39 wraith ntpd[24927]: synchronisation lost
/var/log/messages.3:Jul  7 00:44:08 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  7 12:26:35 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  8 01:21:44 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  8 12:49:24 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  8 12:58:11 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  8 16:16:52 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul  8 23:19:42 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 10 03:22:27 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 10 03:40:45 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 10 05:12:57 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 10 11:20:37 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 10 11:24:05 wraith ntpd[24927]: synchronisation lost
/var/log/messages.2:Jul 13 10:09:24 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 17:09:57 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 17:31:27 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 19:14:15 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 19:43:14 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 20:26:19 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 20:37:12 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 17 21:04:03 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 19 13:02:30 wraith ntpd[24927]: synchronisation lost
/var/log/messages.1:Jul 19 19:29:56 wraith ntpd[24927]: synchronisation lost
/var/log/messages:Jul 23 05:24:06 wraith ntpd[24927]: synchronisation lost
/var/log/messages:Jul 23 19:28:54 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 01:18:13 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 06:39:41 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 14:47:58 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 14:52:34 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 16:28:00 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 24 22:10:54 wraith ntpd[690]: synchronisation lost
/var/log/messages:Jul 25 00:17:05 wraith ntpd[690]: synchronisation lost

I don't think I'm really losing genuine NTP synch:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LCL.            0 l    3   64  377    0.000    0.000   0.015
-bow.qbfox.com   Time4.Stupi.SE   2 u 1005 1024  377    0.015   73.062  17.196
-spur.qbfox.com  chronos.EU.net   2 u  218 1024  377    0.015   45.572  26.202
-chronos.EU.net  .GPS.            1 u  38m  68m  377   24.867   30.629  72.573
*Time1.Stupi.SE  .PPS.            1 u  58m  68m  377  116.053   43.776  20.203
-Time2.Stupi.SE  .PPS.            1 u 2035  68m  377  109.794   76.732  83.159
+Time3.Stupi.SE  Time2.Stupi.SE   2 u  58m  68m  377  112.871   40.217  35.787
+Time4.Stupi.SE  .PPS.            1 u  58m  68m  377  109.851   40.490  28.328
-veracity.mcc.ac swisstime.ee.et  2 u  59m  68m  337   24.609  -36.300  12.049
#ice.mcc.ac.uk   swisstime.ee.et  2 u  38m  68m  237    9.838   34.553  79.622
 ntp0.teletext.n 71.80.83.0       2 u  38m  68m  377    9.917   25.972 5780.24
#fnbhs.com       ntp0.mcs.anl.go  2 u  38m  68m  377  204.931  -27.539 126.563
#triangle.kansas navobs1.wustl.e  2 u  36m  68m  377  142.747   33.596  73.565
#cisco1-mhk.kans navobs1.wustl.e  2 u  38m  68m  377  139.877   34.206  72.558
#lovenun.maineco prowlingtabby.m  2 u  38m  68m  377  369.278  -60.517  18.498
#ns1.usg.edu     navobs1.gatech.  2 u  38m  68m  377   98.843   35.825  78.009
#tock.CS.unlv.ed usno.pa-x.dec.c  2 u  38m  68m  377  379.879  -71.831  22.986
#thor.capfed1.si rdu163-37-160.n  3 u  68m  68m  377  239.856  -51.136   3.906
#augean.eleceng. murgon.cs.mu.OZ  2 u  38m  68m  377  349.874   31.117  80.168
#WWW.UREGINA.CA  CLOCK.UREGINA.C  2 u  38m  68m  377  321.107  -43.573   0.314
#ns.keso.fi      ntp-cup.externa  2 u  36m  68m  377   89.870   44.390  65.048
#ns2.keso.fi     swisstime.ee.et  2 u  65m  68m  337   80.688  -41.038   5.025
-zen.via.ecp.fr  horlogegps.rese  2 u  38m  68m  377   31.053   34.715   2.282
-fartein.ifi.uio swisstime.ee.et  2 u  59m  68m  377   55.164  -41.223  11.385
#alpha.prao.psn. ntp0-rz.rrze.un  2 u 2042  68m  377  101.028   39.420  83.563
#maverick.mcc.ac ntp1.ja.net      2 u  38m  68m  377   18.858   59.411 176.100

(I turned into a bit of an NTP aficionado after having ported NTPv4 while
working at Cisco a couple of years ago.-)  I'm on a leased line into
the MAE-East equivalent in London, so I don't think it will be network
issues causing NTP synch loss.  Also notice nearly all NTP peers having
full reachability -- "reach" is in octal, it's 8 bits shifting left
indicating whether or not a reply was received at each "poll" interval.

I've added some debug to arch/i386/kernel/time.c:

[root@vulpes kernel]# diff -c time.c.orig time.c
*** time.c.orig	Thu Jul 25 10:36:41 2002
--- time.c	Thu Jul 25 10:39:10 2002
***************
*** 83,88 ****
--- 83,90 ----
  
  spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
  
+ static unsigned long glob_eax;
+ 
  static inline unsigned long do_fast_gettimeoffset(void)
  {
  	register unsigned long eax, edx;
***************
*** 94,99 ****
--- 96,107 ----
  	/* .. relative to previous jiffy (32 bits is enough) */
  	eax -= last_tsc_low;	/* tsc_low delta */
  
+ 	if ( eax > 5000000 ) {
+ 		glob_eax = eax;
+ 		return delay_at_last_interrupt;
+ 	}
+ 
+ 
  	/*
           * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient
           *             = (tsc_low delta) * (usecs_per_clock)
***************
*** 286,291 ****
--- 294,304 ----
  
  	tv->tv_sec = sec;
  	tv->tv_usec = usec;
+ 
+ 	if ( glob_eax != 0 ) {
+ 		printk("do_gettimeofday eax %lu\n",glob_eax);
+ 		glob_eax = 0;
+ 	}
  }
  
  void do_settimeofday(struct timeval *tv)
[root@vulpes kernel]# 


Ie, set flag/value in do_fast_gettimeoffset(), check/print in
do_gettimeofday().  I figure that should catch it if it happens
again, and safely so.

Thanks for your note.

  -- Per


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-25 10:01       ` Per Gregers Bilse
@ 2002-07-25 10:36         ` Per Gregers Bilse
  2002-07-25 17:34           ` george anzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-25 10:36 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel

On Jul 25, 11:01am, Per Gregers Bilse <bilse@qbfox.com> wrote:
> Ie, set flag/value in do_fast_gettimeoffset(), check/print in
> do_gettimeofday().  I figure that should catch it if it happens
> again, and safely so.

Not sure if that's the right thing to do, the check succeeds very
frequently.  A modified version, printing only every 1000th time,
yields:

Jul 25 11:18:36 vulpes kernel: do_gettimeofday warp 0 eax 6968466
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 1000 eax 5301486
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 2000 eax 5814470
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 3000 eax 6238995
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 4000 eax 6662828
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 5000 eax 7109559
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 6000 eax 7534440
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 7000 eax 7959034
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 8000 eax 6302366
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 9000 eax 6736399
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 10000 eax 7160461
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 11000 eax 7584511
Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 12000 eax 5855713
Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 13000 eax 6265463
Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 14000 eax 7992560
Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 15000 eax 6263791
Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 16000 eax 5247486
Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 17000 eax 6358315
Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 18000 eax 7393344
Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 19000 eax 6237626
Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 20000 eax 6406708
Jul 25 11:19:21 vulpes kernel: do_gettimeofday warp 21000 eax 6005906
Jul 25 11:19:21 vulpes kernel: do_gettimeofday warp 22000 eax 5056750
Jul 25 11:19:23 vulpes kernel: do_gettimeofday warp 23000 eax 6434750
Jul 25 11:19:24 vulpes kernel: do_gettimeofday warp 24000 eax 5093940
Jul 25 11:19:25 vulpes kernel: do_gettimeofday warp 25000 eax 6269335
Jul 25 11:19:26 vulpes kernel: do_gettimeofday warp 26000 eax 5596617
Jul 25 11:19:27 vulpes kernel: do_gettimeofday warp 27000 eax 7918244
Jul 25 11:19:29 vulpes kernel: do_gettimeofday warp 28000 eax 7266647
Jul 25 11:19:33 vulpes kernel: do_gettimeofday warp 29000 eax 5293260
Jul 25 11:19:53 vulpes kernel: do_gettimeofday warp 30000 eax 6751016
Jul 25 11:21:32 vulpes kernel: do_gettimeofday warp 31000 eax 6213632
Jul 25 11:23:24 vulpes kernel: do_gettimeofday warp 32000 eax 6997568
Jul 25 11:24:16 vulpes kernel: do_gettimeofday warp 33000 eax 5326413
Jul 25 11:25:10 vulpes kernel: do_gettimeofday warp 34000 eax 5717715
Jul 25 11:25:37 vulpes kernel: do_gettimeofday warp 35000 eax 6464694
Jul 25 11:26:19 vulpes kernel: do_gettimeofday warp 36000 eax 5969042
Jul 25 11:27:05 vulpes kernel: do_gettimeofday warp 37000 eax 7196006
Jul 25 11:28:10 vulpes kernel: do_gettimeofday warp 38000 eax 7129216
Jul 25 11:28:40 vulpes kernel: do_gettimeofday warp 39000 eax 5971848
Jul 25 11:29:04 vulpes kernel: do_gettimeofday warp 40000 eax 6347832
Jul 25 11:29:38 vulpes kernel: do_gettimeofday warp 41000 eax 6769141
Jul 25 11:30:19 vulpes kernel: do_gettimeofday warp 42000 eax 5279110
Jul 25 11:31:18 vulpes kernel: do_gettimeofday warp 43000 eax 6256288
Jul 25 11:31:34 vulpes kernel: do_gettimeofday warp 44000 eax 6483175

Here are complete code excerpts from arch/i386/kernel/time.c,
look for glob_eax:

static unsigned long glob_eax;

static inline unsigned long do_fast_gettimeoffset(void)
{
        register unsigned long eax, edx;

        /* Read the Time Stamp Counter */

        rdtsc(eax,edx);

        /* .. relative to previous jiffy (32 bits is enough) */
        eax -= last_tsc_low;    /* tsc_low delta */

        if ( eax > 5000000 ) {
                glob_eax = eax;
                return delay_at_last_interrupt;
        }


        /*
         * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient
         *             = (tsc_low delta) * (usecs_per_clock)
         *             = (tsc_low delta) * (usecs_per_jiffy / clocks_per_jiffy)
         *
         * Using a mull instead of a divl saves up to 31 clock cycles
         * in the critical path.
         */

        __asm__("mull %2"
                :"=a" (eax), "=d" (edx)
                :"rm" (fast_gettimeoffset_quotient),
                 "0" (eax));

        /* our adjusted time offset in microseconds */
        return delay_at_last_interrupt + edx;
}


void do_gettimeofday(struct timeval *tv)
{
        unsigned long flags;
        unsigned long usec, sec;
        static unsigned warp;

        read_lock_irqsave(&xtime_lock, flags);
        usec = do_gettimeoffset();
        {
                unsigned long lost = jiffies - wall_jiffies;
                if (lost)
                        usec += lost * (1000000 / HZ);
        }
        sec = xtime.tv_sec;
        usec += xtime.tv_usec;
        read_unlock_irqrestore(&xtime_lock, flags);

        while (usec >= 1000000) {
                usec -= 1000000;
                sec++;
        }

        tv->tv_sec = sec;
        tv->tv_usec = usec;

        if ( glob_eax != 0 ) {
                if ( warp % 1000 == 0 )
                        printk("do_gettimeofday warp %u eax %lu\n",warp,glob_eax);
                warp++;
                glob_eax = 0;
        }
}

Is this correctly detecting the problem?  Or is the low word (eax) of the TSC
a red herring?

Thanks.

  -- Per


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-25 10:36         ` Per Gregers Bilse
@ 2002-07-25 17:34           ` george anzinger
  2002-07-26 10:22             ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: george anzinger @ 2002-07-25 17:34 UTC (permalink / raw)
  To: Per Gregers Bilse; +Cc: linux-kernel

Per Gregers Bilse wrote:
> 
> On Jul 25, 11:01am, Per Gregers Bilse <bilse@qbfox.com> wrote:
> > Ie, set flag/value in do_fast_gettimeoffset(), check/print in
> > do_gettimeofday().  I figure that should catch it if it happens
> > again, and safely so.
> 
> Not sure if that's the right thing to do, the check succeeds very
> frequently.  A modified version, printing only every 1000th time,
> yields:

You have the number a bit low.  If I recall, this is an 800
MHz machine, so each TSC tick is 1.25 nano seconds.  This
means your threshold (with 5,000,000) is 6.25 ms.  The
system should easily recover from anything up to about 9.9
ms.  If you were to use 8,000,000 it would trigger on loss
of a tick (i.e. 10 ms).  On the other hand, the warp you see
is caused by the subtract overflowing.  For this the test
would be something like:

if ( eax & 0x80000000)  (i.e. is the sign bit set?)  OR you
could do:

if ( (int)eax < 0)


Now if we are convinced that you are having these long
"black outs" the question turns to why.  We are talking
about an interrupt hold off here, not a preemption issue. 
The first thing I would check is that you are using DMA for
you disc transfers.  To the best of my knowledge, the
default config file still has DMA off for these transfers
and this can cause long delays.  Beyond this, I would look
for special content in your system.  From the analytical
point of view, some analysis of what is running when this
happens might be of use.

-g


> 
> Jul 25 11:18:36 vulpes kernel: do_gettimeofday warp 0 eax 6968466
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 1000 eax 5301486
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 2000 eax 5814470
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 3000 eax 6238995
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 4000 eax 6662828
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 5000 eax 7109559
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 6000 eax 7534440
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 7000 eax 7959034
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 8000 eax 6302366
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 9000 eax 6736399
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 10000 eax 7160461
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 11000 eax 7584511
> Jul 25 11:18:39 vulpes kernel: do_gettimeofday warp 12000 eax 5855713
> Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 13000 eax 6265463
> Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 14000 eax 7992560
> Jul 25 11:19:15 vulpes kernel: do_gettimeofday warp 15000 eax 6263791
> Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 16000 eax 5247486
> Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 17000 eax 6358315
> Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 18000 eax 7393344
> Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 19000 eax 6237626
> Jul 25 11:19:20 vulpes kernel: do_gettimeofday warp 20000 eax 6406708
> Jul 25 11:19:21 vulpes kernel: do_gettimeofday warp 21000 eax 6005906
> Jul 25 11:19:21 vulpes kernel: do_gettimeofday warp 22000 eax 5056750
> Jul 25 11:19:23 vulpes kernel: do_gettimeofday warp 23000 eax 6434750
> Jul 25 11:19:24 vulpes kernel: do_gettimeofday warp 24000 eax 5093940
> Jul 25 11:19:25 vulpes kernel: do_gettimeofday warp 25000 eax 6269335
> Jul 25 11:19:26 vulpes kernel: do_gettimeofday warp 26000 eax 5596617
> Jul 25 11:19:27 vulpes kernel: do_gettimeofday warp 27000 eax 7918244
> Jul 25 11:19:29 vulpes kernel: do_gettimeofday warp 28000 eax 7266647
> Jul 25 11:19:33 vulpes kernel: do_gettimeofday warp 29000 eax 5293260
> Jul 25 11:19:53 vulpes kernel: do_gettimeofday warp 30000 eax 6751016
> Jul 25 11:21:32 vulpes kernel: do_gettimeofday warp 31000 eax 6213632
> Jul 25 11:23:24 vulpes kernel: do_gettimeofday warp 32000 eax 6997568
> Jul 25 11:24:16 vulpes kernel: do_gettimeofday warp 33000 eax 5326413
> Jul 25 11:25:10 vulpes kernel: do_gettimeofday warp 34000 eax 5717715
> Jul 25 11:25:37 vulpes kernel: do_gettimeofday warp 35000 eax 6464694
> Jul 25 11:26:19 vulpes kernel: do_gettimeofday warp 36000 eax 5969042
> Jul 25 11:27:05 vulpes kernel: do_gettimeofday warp 37000 eax 7196006
> Jul 25 11:28:10 vulpes kernel: do_gettimeofday warp 38000 eax 7129216
> Jul 25 11:28:40 vulpes kernel: do_gettimeofday warp 39000 eax 5971848
> Jul 25 11:29:04 vulpes kernel: do_gettimeofday warp 40000 eax 6347832
> Jul 25 11:29:38 vulpes kernel: do_gettimeofday warp 41000 eax 6769141
> Jul 25 11:30:19 vulpes kernel: do_gettimeofday warp 42000 eax 5279110
> Jul 25 11:31:18 vulpes kernel: do_gettimeofday warp 43000 eax 6256288
> Jul 25 11:31:34 vulpes kernel: do_gettimeofday warp 44000 eax 6483175
> 
> Here are complete code excerpts from arch/i386/kernel/time.c,
> look for glob_eax:
> 
> static unsigned long glob_eax;
> 
> static inline unsigned long do_fast_gettimeoffset(void)
> {
>         register unsigned long eax, edx;
> 
>         /* Read the Time Stamp Counter */
> 
>         rdtsc(eax,edx);
> 
>         /* .. relative to previous jiffy (32 bits is enough) */
>         eax -= last_tsc_low;    /* tsc_low delta */
> 
>         if ( eax > 5000000 ) {
>                 glob_eax = eax;
>                 return delay_at_last_interrupt;
>         }
> 
>         /*
>          * Time offset = (tsc_low delta) * fast_gettimeoffset_quotient
>          *             = (tsc_low delta) * (usecs_per_clock)
>          *             = (tsc_low delta) * (usecs_per_jiffy / clocks_per_jiffy)
>          *
>          * Using a mull instead of a divl saves up to 31 clock cycles
>          * in the critical path.
>          */
> 
>         __asm__("mull %2"
>                 :"=a" (eax), "=d" (edx)
>                 :"rm" (fast_gettimeoffset_quotient),
>                  "0" (eax));
> 
>         /* our adjusted time offset in microseconds */
>         return delay_at_last_interrupt + edx;
> }
> 
> void do_gettimeofday(struct timeval *tv)
> {
>         unsigned long flags;
>         unsigned long usec, sec;
>         static unsigned warp;
> 
>         read_lock_irqsave(&xtime_lock, flags);
>         usec = do_gettimeoffset();
>         {
>                 unsigned long lost = jiffies - wall_jiffies;
>                 if (lost)
>                         usec += lost * (1000000 / HZ);
>         }
>         sec = xtime.tv_sec;
>         usec += xtime.tv_usec;
>         read_unlock_irqrestore(&xtime_lock, flags);
> 
>         while (usec >= 1000000) {
>                 usec -= 1000000;
>                 sec++;
>         }
> 
>         tv->tv_sec = sec;
>         tv->tv_usec = usec;
> 
>         if ( glob_eax != 0 ) {
>                 if ( warp % 1000 == 0 )
>                         printk("do_gettimeofday warp %u eax %lu\n",warp,glob_eax);
>                 warp++;
>                 glob_eax = 0;
>         }
> }
> 
> Is this correctly detecting the problem?  Or is the low word (eax) of the TSC
> a red herring?
> 
> Thanks.
> 
>   -- Per

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-25 17:34           ` george anzinger
@ 2002-07-26 10:22             ` Per Gregers Bilse
  2002-07-29 18:39               ` george anzinger
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-26 10:22 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel

On Jul 25, 10:34am, george anzinger <george@mvista.com> wrote:
> You have the number a bit low.  If I recall, this is an 800

Yes, I figured that, and bumped it up a lot (to 1e9).  Of course, since
setting the trap, things have been fine, including no loss of NTP synch.-/
Let's see over the weekend.

> The first thing I would check is that you are using DMA for
> you disc transfers.  To the best of my knowledge, the

Yes, both machines and both disks use DMA, and also allow interrupts
("unmaskirq" option) during disk transfers, here's from hdparm(8):

/dev/hda:
 multcount    = 16 (on)
 I/O support  =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 nowerr       =  0 (off)
 readonly     =  0 (off)
 readahead    =  8 (on)
 geometry     = 2434/255/63, sectors = 39102336, start = 0

The only slightly unusual thing is that both machines use soft RAID,
I don't know if that code might be doing something.  But the problem
occurred at the same time as I made an application change (debug/logging)
that vastly -reduced- disk I/O.

Anyway, I've been looking through archived log files, and found a few
entries from the 2.4.7-10 kernel that looked interesting, here's a pair:

Feb 23 04:07:52 vulpes kernel: probable hardware bug: clock timer configuration lost - probably a VIA686a motherboard.
Feb 23 04:07:52 vulpes kernel: probable hardware bug: restoring chip configuration.

Both machines indeed have identical VIA686a motherboards.  The messages
come from code in timer_interrupt() in time.c:

                /* read Pentium cycle counter */

                rdtscl(last_tsc_low);

                spin_lock(&i8253_lock);
                outb_p(0x00, 0x43);     /* latch the count ASAP */

                count = inb_p(0x40);    /* read the latched count */
                count |= inb(0x40) << 8;

                /* VIA686a test code... reset the latch if count > max */
                if (count > LATCH) {
                        static int last_whine;
                        outb_p(0x34, 0x43);
                        outb_p(LATCH & 0xff, 0x40);
                        outb(LATCH >> 8, 0x40);
                        count = LATCH - 1;
                        if(time_after(jiffies, last_whine))
                        {
                                printk(KERN_WARNING "probable hardware bug: clock timer configuration lost - probably a VIA686a motherboard.\n");
                                printk(KERN_WARNING "probable hardware bug: restoring chip configuration.\n");
                                last_whine = jiffies + HZ;
                        }
                }

                spin_unlock(&i8253_lock);

The "if (count > LATCH)" block has been taken out of the 2.4.18
kernel, while similar code is in do_slow_gettimeoffset() in both
the 2.4.7-10 and 2.4.18 kernels.  I'm not sufficiently familiar
with the hardware and the code to know if this is significant,
but it does seem that there are some known hardware bugs which
the earlier kernel tried to address (but with limited or no success).

Anyway, let's see what happens over the weekend.

Thanks.

  -- Per

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-26 10:22             ` Per Gregers Bilse
@ 2002-07-29 18:39               ` george anzinger
  2002-07-31 11:22                 ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: george anzinger @ 2002-07-29 18:39 UTC (permalink / raw)
  To: Per Gregers Bilse; +Cc: linux-kernel

Per Gregers Bilse wrote:
> 
> On Jul 25, 10:34am, george anzinger <george@mvista.com> wrote:
> > You have the number a bit low.  If I recall, this is an 800
> 
> Yes, I figured that, and bumped it up a lot (to 1e9).  Of course, since
> setting the trap, things have been fine, including no loss of NTP synch.-/
> Let's see over the weekend.
> 
> > The first thing I would check is that you are using DMA for
> > you disc transfers.  To the best of my knowledge, the
> 
> Yes, both machines and both disks use DMA, and also allow interrupts
> ("unmaskirq" option) during disk transfers, here's from hdparm(8):
> 
> /dev/hda:
>  multcount    = 16 (on)
>  I/O support  =  1 (32-bit)
>  unmaskirq    =  1 (on)
>  using_dma    =  1 (on)
>  keepsettings =  0 (off)
>  nowerr       =  0 (off)
>  readonly     =  0 (off)
>  readahead    =  8 (on)
>  geometry     = 2434/255/63, sectors = 39102336, start = 0
> 
> The only slightly unusual thing is that both machines use soft RAID,
> I don't know if that code might be doing something.  But the problem
> occurred at the same time as I made an application change (debug/logging)
> that vastly -reduced- disk I/O.
> 
> Anyway, I've been looking through archived log files, and found a few
> entries from the 2.4.7-10 kernel that looked interesting, here's a pair:
> 
> Feb 23 04:07:52 vulpes kernel: probable hardware bug: clock timer configuration lost - probably a VIA686a motherboard.
> Feb 23 04:07:52 vulpes kernel: probable hardware bug: restoring chip configuration.
> 
> Both machines indeed have identical VIA686a motherboards.  The messages
> come from code in timer_interrupt() in time.c:
> 
>                 /* read Pentium cycle counter */
> 
>                 rdtscl(last_tsc_low);
> 
>                 spin_lock(&i8253_lock);
>                 outb_p(0x00, 0x43);     /* latch the count ASAP */
> 
>                 count = inb_p(0x40);    /* read the latched count */
>                 count |= inb(0x40) << 8;
> 
>                 /* VIA686a test code... reset the latch if count > max */
>                 if (count > LATCH) {
>                         static int last_whine;
>                         outb_p(0x34, 0x43);
>                         outb_p(LATCH & 0xff, 0x40);
>                         outb(LATCH >> 8, 0x40);
>                         count = LATCH - 1;
>                         if(time_after(jiffies, last_whine))
>                         {
>                                 printk(KERN_WARNING "probable hardware bug: clock timer configuration lost - probably a VIA686a motherboard.\n");
>                                 printk(KERN_WARNING "probable hardware bug: restoring chip configuration.\n");
>                                 last_whine = jiffies + HZ;
>                         }
>                 }
> 
>                 spin_unlock(&i8253_lock);
> 
> The "if (count > LATCH)" block has been taken out of the 2.4.18

I am not sure it was ever in the kernel in that form.  Are
you sure you did not put some patch in here?

> kernel, while similar code is in do_slow_gettimeoffset() in both
> the 2.4.7-10 and 2.4.18 kernels.  I'm not sufficiently familiar
> with the hardware and the code to know if this is significant,
> but it does seem that there are some known hardware bugs which
> the earlier kernel tried to address (but with limited or no success).

I wish I knew more about this hardware bug.  The test
suggests that the chip is not resetting the latch on
interrupt, but rather that it just rolls over (or under). 
This would cause the count to, again, reach zero (and,
hopefully interrupt) in about 50 ms.  On the other hand, the
chip could be switching modes and only the "0X34" mode will
continue to interrupt with out the chip being reprogrammed. 
In this case, it is hard to understand how the system keeps
ANY time at all.  

The above "fix" detects the count not being reset and
reprograms the chip, it does not attempt to correct for any
lost time.
> 
> Anyway, let's see what happens over the weekend.
> 
> Thanks.
> 
>   -- Per

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-29 18:39               ` george anzinger
@ 2002-07-31 11:22                 ` Per Gregers Bilse
  2002-08-29 15:19                   ` Per Gregers Bilse
  0 siblings, 1 reply; 11+ messages in thread
From: Per Gregers Bilse @ 2002-07-31 11:22 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel

On Jul 29, 11:39am, george anzinger <george@mvista.com> wrote:
> > The "if (count > LATCH)" block has been taken out of the 2.4.18
> 
> I am not sure it was ever in the kernel in that form.  Are
> you sure you did not put some patch in here?

I've never deliberately touched anything in there.  Everything
will have come from RedHat (7.2 CDs) or kernel.org, and I'm
pretty sure I didn't install any interim and/or experimental
versions or patches.

> I wish I knew more about this hardware bug.  The test
> suggests that the chip is not resetting the latch on
> interrupt, but rather that it just rolls over (or under). 
> This would cause the count to, again, reach zero (and,
> hopefully interrupt) in about 50 ms.  On the other hand, the
> chip could be switching modes and only the "0X34" mode will
> continue to interrupt with out the chip being reprogrammed. 
> In this case, it is hard to understand how the system keeps
> ANY time at all.  

Both machines generally seem to keep time just fine, apart
from the bumps.  But then, there were only a few of the
hardware bug warning messages in the logs, so it didn't
manifest itself very often.

Anyway, nothing caught over the weekend, unfortunately.  The trap
in do_fast_gettimeoffset() reads:

        if ( eax > 100000000 ) {
                glob_eax = eax;
                return delay_at_last_interrupt;
        }

Then a check and printk() in do_gettimeofday().  The machine
has lost NTP synch somewhat sporadically, but not as regularly
as before.  Sigh.  Maybe it's the weather, we've been having a
heatwave since Friday, and everybody and everything in the office
is roasting.

I'm going to try to fiddle around some more, also with the other
machine.

  -- Per

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.18 clock warps 4294 seconds
  2002-07-31 11:22                 ` Per Gregers Bilse
@ 2002-08-29 15:19                   ` Per Gregers Bilse
  0 siblings, 0 replies; 11+ messages in thread
From: Per Gregers Bilse @ 2002-08-29 15:19 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel

Hello again,

I've been away for a couple of weeks due to family illness.

Quick recap:

On Jul 31, 12:22pm, Per Gregers Bilse <bilse@qbfox.com> wrote:
> Subject: Re: 2.4.18 clock warps 4294 seconds
> On Jul 29, 11:39am, george anzinger <george@mvista.com> wrote:
> > > The "if (count > LATCH)" block has been taken out of the 2.4.18
> > 
> > I am not sure it was ever in the kernel in that form.  Are
> > you sure you did not put some patch in here?
> 
> I've never deliberately touched anything in there.  Everything
> will have come from RedHat (7.2 CDs) or kernel.org, and I'm
> pretty sure I didn't install any interim and/or experimental
> versions or patches.
> 
> > I wish I knew more about this hardware bug.  The test
> > suggests that the chip is not resetting the latch on
> > interrupt, but rather that it just rolls over (or under). 
> > This would cause the count to, again, reach zero (and,
> > hopefully interrupt) in about 50 ms.  On the other hand, the
> > chip could be switching modes and only the "0X34" mode will
> > continue to interrupt with out the chip being reprogrammed. 
> > In this case, it is hard to understand how the system keeps
> > ANY time at all.  
> 
> Both machines generally seem to keep time just fine, apart
> from the bumps.  But then, there were only a few of the
> hardware bug warning messages in the logs, so it didn't
> manifest itself very often.
> 
> Anyway, nothing caught over the weekend, unfortunately.  The trap
> in do_fast_gettimeoffset() reads:
> 
>         if ( eax > 100000000 ) {
>                 glob_eax = eax;
>                 return delay_at_last_interrupt;
>         }
> 
> Then a check and printk() in do_gettimeofday().  The machine
> has lost NTP synch somewhat sporadically, but not as regularly
> as before.  Sigh.  Maybe it's the weather, we've been having a
> heatwave since Friday, and everybody and everything in the office
> is roasting.
> 
> I'm going to try to fiddle around some more, also with the other
> machine.
> 
>   -- Per
>-- End of excerpt from Per Gregers Bilse


I made a new trap, much simpler and well focused, here's the full
function:

/*
 * This version of gettimeofday has microsecond resolution
 * and better than microsecond precision on fast x86 machines with TSC.
 */
void do_gettimeofday(struct timeval *tv)
{
        unsigned long flags;
        unsigned long usec, sec;
        unsigned long offusec;

        read_lock_irqsave(&xtime_lock, flags);
        offusec = usec = do_gettimeoffset();
        {
                unsigned long lost = jiffies - wall_jiffies;
                if (lost)
                        usec += lost * (1000000 / HZ);
        }
        sec = xtime.tv_sec;
        usec += xtime.tv_usec;
        read_unlock_irqrestore(&xtime_lock, flags);

        if ( offusec > 2000000 )
                printk("do_gettimeofday offusec %lu\n",offusec);

        if ( usec > 2000000 )
                printk("do_gettimeofday usec %lu\n",usec);

        while (usec >= 1000000) {
                usec -= 1000000;
                sec++;
        }

        tv->tv_sec = sec;
        tv->tv_usec = usec;

}

It's basically the addition of 'offusec' to check the offset value.

Nothing happened for a week or so after I'd booted the machine, then
suddenly in the middle of the night:

Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294939478
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294939498
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294939849
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294939957
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294939997
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940016
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940037
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940056
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940077
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940094
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940115
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940133
Aug 28 01:05:03 vulpes kernel: do_gettimeofday offusec 4294940154

On and on it went, until I woke up in the morning and rebooted the
machine:

Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294951722
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294953414
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294953450
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294955967
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294956001
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294957942
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294957978
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294951303
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294951339
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294953107
Aug 28 05:51:23 vulpes kernel: do_gettimeofday offusec 4294953143

As you can see the incorrect offset increases by one second every
second, so I figure it must be the global delay_at_last_interrupt
which gets clobbered at some point.  This is added to the return
value in do_fast_gettimeoffset(), and looks like the only thing
(apart from a permanently clobbered hardware TSC) which could cause
do_fast_gettimeoffset() to repeatedly return a bad value, increasing by
one second per second.  In turn, delay_at_last_interrupt is only
set to anything in timer_interrupt(), where there once was code
to deal with buggy VIA686a motherboards -- you may recall this:

> Both machines indeed have identical VIA686a motherboards.  The messages
> come from code in timer_interrupt() in time.c:
> 
>                 /* read Pentium cycle counter */
> 
>                 rdtscl(last_tsc_low);
> 
>                 spin_lock(&i8253_lock);
>                 outb_p(0x00, 0x43);     /* latch the count ASAP */
> 
>                 count = inb_p(0x40);    /* read the latched count */
>                 count |= inb(0x40) << 8;
> 
>                 /* VIA686a test code... reset the latch if count > max */
>                 if (count > LATCH) {
>                         static int last_whine;
>                         outb_p(0x34, 0x43);
>                         outb_p(LATCH & 0xff, 0x40);
>                         outb(LATCH >> 8, 0x40);
>                         count = LATCH - 1;
>                         if(time_after(jiffies, last_whine))
>                         {
>                                 printk(KERN_WARNING "probable hardware bug: clock timer configuration lost - probably a VIA686a motherboard.\n");
>                                 printk(KERN_WARNING "probable hardware bug: restoring chip configuration.\n");
>                                 last_whine = jiffies + HZ;
>                         }
>                 }
> 
>                 spin_unlock(&i8253_lock);
> 
> The "if (count > LATCH)" block has been taken out of the 2.4.18
> kernel, while similar code is in do_slow_gettimeoffset() in both
> the 2.4.7-10 and 2.4.18 kernels.  I'm not sufficiently familiar
> with the hardware and the code to know if this is significant,
> but it does seem that there are some known hardware bugs which
> the earlier kernel tried to address (but with limited or no success).

So I'll try to put the recovery code shown above back in, and see
if I catch something.

BTW, if y'all are tired of this, let me know.  I've only got two of
the VIA686a boards, and I can live with them as is.  It's more a case
of contributing to the kernel if I can.  Not sure how many VIA686a
boards are in use, but I guess it could be some hundred thousand.

  -- Per


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-08-29 15:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-24 12:35 2.4.18 clock warps 4294 seconds Per Gregers Bilse
2002-07-24 13:37 ` Richard B. Johnson
2002-07-24 17:46   ` Per Gregers Bilse
2002-07-24 18:50     ` george anzinger
2002-07-25 10:01       ` Per Gregers Bilse
2002-07-25 10:36         ` Per Gregers Bilse
2002-07-25 17:34           ` george anzinger
2002-07-26 10:22             ` Per Gregers Bilse
2002-07-29 18:39               ` george anzinger
2002-07-31 11:22                 ` Per Gregers Bilse
2002-08-29 15:19                   ` Per Gregers Bilse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox