public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Debugging hard lockups (hardware?)
@ 2003-04-07 23:24 Richard J Moore
  2003-04-09  1:24 ` Nick Urbanik
  0 siblings, 1 reply; 20+ messages in thread
From: Richard J Moore @ 2003-04-07 23:24 UTC (permalink / raw)
  To: linux-kernel

On Sun 6 April 2003 8:31 pm, Dave Jones wrote:
> On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
>  > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  >
>  > ...
>  > Your choice of components looks fine, its all stuff I trust, even if the
>  > ethernet card is not good for performance it ought to be fine in
>  > general. If it is a faulty part most likely its a one off fault.
>
> Note the IDE controller, and 2.5 bugzilla #123
> That controller has been nothing but trouble for me.
>
> 		Dave
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Do you know that your platform support a watchdog timer. Not every thing
 does, thought most do. It's probably worh verifying (cli; hlt;).
Is it possible you have experiennced a triple-fault? Most CPUs reset but some
have been known to freeze and require a manual reset. This can be verified
relatively easily (make int 8 descriptor NP then loop recursively - does the
system hang or reboot?).

--
Richard J Moore
IBM Linux Technology Centre



^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: Debugging hard lockups (hardware?)
@ 2003-04-11 18:08 Scott Lee
  2003-04-11 19:22 ` Bernd Eckenfels
  0 siblings, 1 reply; 20+ messages in thread
From: Scott Lee @ 2003-04-11 18:08 UTC (permalink / raw)
  To: linux-kernel

On Sunday, April 06, 2003 at 11:34 AM Alan Cox wrote:
> For the NMI watchdog to fail (if you have it enabled) requires pretty
> major disaster to have occurred since the NMI will be delivered through
> any kind of system hang

Has anyone ever thought about that fact that although the dog might bark (printk), klogd may be deaf in some cases?  It seems to me that there will be cases where this happens and thus we see no data.

I don't know if all x86 hardware has this, but I know that the couple of systems I typically use have a small amount of battery backed ram that is available for use in the RTC.  Anyone ever write a patch to dump data (small stack trace) to this memory so it can be retrieved after a reboot?  It seems to me that some data is better than none...  Of course it would be nice to know what the lowest common denominator is so that one patch will work with all PC hardware but I don't even know if a general approach is really feasible.

Regards,

Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: Debugging hard lockups (hardware?)
@ 2003-04-08 18:36 Scott Lee
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Lee @ 2003-04-08 18:36 UTC (permalink / raw)
  To: linux-kernel

I too am experiencing hard lockups.  

I'm using 2.4.18-rc1-ac2 (2.4.18-18.7.x) on multiple units with fairly uniform HW.   (HW differences being PGA vs BGA processor and RAM/HD size.)  Because of the stage of software development I'm not really able roll the kernel.  It seems unlikely that it is bad hardware because it is happening on multiple units and those same units are rock solid if running older (2.0 kernel) based firmware.

Unfortunately I'm on a National GX1 processor so I don't have NMI capability.  The lockups are very sporatic and can not be reproduced at will.  Additionally most of the units are headless which complicates matters.  I've looked at the state of SUSPA# on a locked up box and it is not asserted which indicates that the CPU is not halted.  I can't be 100% sure but it looks like the only memory activity is refresh related so it seems like the CPU is in a loop that runs out of its puny 16K unified cache.

Since this is an embedded device it is running a fairly stripped kernel.  I suppose I can post the config if necessary but it's basically using ide, ext3, and pcnet32.  (Yes I know about the ext3 patches but they don't seem to help in my case.)

Also, recently, Jerry Carter (of Samba) has asked me for some help regarding one of his personal systems as he too is seeing hard lockups.  He hasn't yet had time to try the nmi_watchdog but I'll prod him some more on this.  His system has a SiS 730 chipset with Athlon 200XP+ and Promise IDE ultra 100 controller.  He has ATI Rage II PCI video but he has also seen the problem when using the onboard video.  As for kernels he is using 2.4.20 and has seen the problem w/ and w/o the ext3 patches and acl patches.  Additionally, he reverted his fs to ext2 at one point and still saw the problem.  He generally sees the problem when playing back mp3s but it has happened at other times.

Regards,

Scott Lee

^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: Debugging hard lockups (hardware?)
@ 2003-04-06 10:12 mikpe
  2003-04-06 16:01 ` Nick Urbanik
  0 siblings, 1 reply; 20+ messages in thread
From: mikpe @ 2003-04-06 10:12 UTC (permalink / raw)
  To: linux-kernel, nicku

On Sun, 06 Apr 2003 14:32:27 +0800, Nick Urbanik wrote:
>This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
>not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do not
>respond.  Logs show no hint of any problem (that I recognise) before lockup.
>Occurs often during scrolling e.g., Mozilla.  I swapped the Radeon 7000 for a
>7500, then an Nvidia.
...
>flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm

You didn't supply your .config.

It could be local APIC breakage. Try the following:

1. Disable fancy power management support in the kernel.
   If you're using APM, either disable it completely, or keep
   only APM_DO_ENABLE enabled. Specifically, CPU_IDLE and
   DISPLAY_BLANK should be disabled.
   The problem is that power management calls into the BIOS
   _without_ disabling the local APIC before the call.
   I know that several of my machines used to hang due to
   the local APIC timer interrupt occurring while graphics
   card BIOS code was running because of APM.

2. If the above doesn't help, disable local APIC support.

If neither of these help, then the problem isn't the local APIC.

/Mikael

^ permalink raw reply	[flat|nested] 20+ messages in thread
* Debugging hard lockups (hardware?)
@ 2003-04-06  6:32 Nick Urbanik
  2003-04-06  6:39 ` Nick Urbanik
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Nick Urbanik @ 2003-04-06  6:32 UTC (permalink / raw)
  To: Linux Kernel list

Dear team,

This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do not
respond.  Logs show no hint of any problem (that I recognise) before lockup.
Occurs often during scrolling e.g., Mozilla.  I swapped the Radeon 7000 for a
7500, then an Nvidia.

I guess hardware.  But memtest run exhaustively shows no problem.

Same is true for large numbers of kernels I've tried, mostly Red Hat and -ac
kernels, old and new.

current grub kernel line:
kernel /boot/vmlinuz-2.4.20-8custom ro root=LABEL=/ vga=6 console=ttyS0,38400
nmi_watchdog=1 hdm=ide-scsi

I have six 80 G IDE disks, software RAID, LVM on top.  On Red Hat 8.0 and 9.

Any hints on how to troubleshoot this (besides replacing motherboard and other
components I cannot afford to replace?)

$ cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
      10240128 blocks [2/2] [UU]

md2 : active raid5 hdk1[3] hdi1[2] hdg1[1] hde1[0]
      20480256 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 hdk2[5] hdi2[4] hdg2[2] hde2[3] hdc2[1] hda2[0]
      4088064 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid5 hdk3[6] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
      267514112 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
$ lspci
00:00.0 Host bridge: Intel Corp. 82845 845 (Brookdale) Chipset Host Bridge (rev
11)
00:01.0 PCI bridge: Intel Corp. 82845 845 (Brookdale) Chipset AGP Bridge (rev
11)
00:1d.0 USB Controller: Intel Corp. 82801DB USB (Hub #1) (rev 01)
00:1d.1 USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 01)
00:1d.2 USB Controller: Intel Corp. 82801DB USB (Hub #3) (rev 01)
00:1d.7 USB Controller: Intel Corp. 82801DB USB EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corp. 82801DB ISA Bridge (LPC) (rev 01)
00:1f.1 IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400]
(rev b2)
02:03.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
02:04.0 FireWire (IEEE 1394): NEC Corporation: Unknown device 00f2 (rev 01)
02:08.0 Ethernet controller: Intel Corp. 82801BD PRO/100 VE (LOM) Ethernet
Controller (rev 81)
02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0c.0 SCSI storage controller: Advanced Micro Devices [AMD] 53c974 [PCscsi]
(rev 10)
02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
02:0e.0 RAID bus controller: CMD Technology Inc PCI0649 (rev 02)
$ lsmod
Module                  Size  Used by    Not tainted
nls_iso8859-1           3516   1 (autoclean)
cmpci                  35464   1 (autoclean)
soundcore               6436   4 (autoclean) [cmpci]
ppp_deflate             4504   2 (autoclean)
zlib_deflate           21624   0 (autoclean) [ppp_deflate]
osst                   48072   0 (autoclean) (unused)
st                     30736   0 (autoclean) (unused)
usb-storage            68276   0 (unused)
binfmt_misc             7400   1
parport_pc             19108   1 (autoclean)
lp                      8996   0 (autoclean)
parport                36960   1 (autoclean) [parport_pc lp]
nfsd                   79920   8 (autoclean)
lockd                  58128   1 (autoclean) [nfsd]
sunrpc                 81372   1 (autoclean) [nfsd lockd]
autofs                 13108   1 (autoclean)
n_hdlc                  8000   1
ppp_synctty             7840   1
ppp_async               9376   1
ppp_generic            24540   6 [ppp_deflate ppp_synctty ppp_async]
slhc                    6628   1 [ppp_generic]
ne2k-pci                7168   1 (autoclean)
8390                    8364   0 (autoclean) [ne2k-pci]
e100                   63908   1
ipt_multiport           1176   5 (autoclean)
ipt_REJECT              3832   2 (autoclean)
ipt_limit               1560   2 (autoclean)
ipt_MASQUERADE          2136   3 (autoclean)
iptable_filter          2412   1 (autoclean)
ipt_LOG                 4248  11
ipt_state               1080  23
ip_nat_ftp              4016   0 (unused)
ip_conntrack_ftp        5264   1
iptable_nat            20824   2 [ipt_MASQUERADE ip_nat_ftp]
ip_conntrack           26976   3 [ipt_MASQUERADE ipt_state ip_nat_ftp
ip_conntrack_ftp iptable_nat]
ip_tables              14840  10 [ipt_multiport ipt_REJECT ipt_limit
ipt_MASQUERADE iptable_filter ipt_LOG ipt_state iptable_nat]
sg                     35820   0 (autoclean)
sr_mod                 17912   0 (autoclean)
ide-scsi               12176   0
ide-cd                 35580   0
cdrom                  33120   0 [sr_mod ide-cd]
loop                   12056   3 (autoclean)
keybdev                 2976   0 (unused)
mousedev                5460   1
hid                    21956   0 (unused)
input                   5888   0 [keybdev mousedev hid]
usb-uhci               26188   0 (unused)
usbcore                78272   2 [usb-storage hid usb-uhci]
ext3                   69760  11
jbd                    51828  11 [ext3]
tmscsim                37088   0
sd_mod                 13484   0 (unused)
scsi_mod              106872   7 [osst st usb-storage sg sr_mod ide-scsi tmscsim
sd_mod]
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.26GHz
stepping        : 4
cpu MHz         : 2289.252
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4561.30

$ free
             total       used       free     shared    buffers     cached
Mem:       1030780    1020180      10600          0     158692     633092
-/+ buffers/cache:     228396     802384
Swap:      4088056      31020    4057036
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0              10079212   5805256   3761952  61% /
none                    515388         0    515388   0% /dev/shm
/dev/main/home         2873984    814604   1913564  30% /home
/dev/main/usr         10321208   5762360   4034560  59% /usr
/dev/main/var          3201076   1168292   1870176  39% /var
/dev/main/var2        24520572   5211756  18063224  23% /var2
/dev/main/nicku       55734652  27123980  26345748  51% /home/nicku
/dev/main/photos      46445552  28598296  15487960  65% /home/nicku/.photos
/dev/main/mail         8256952   5107616   2729908  66% /home/nicku/work/nsmail
/dev/main/ftp         98051740  77431664  15639340  84% /var/ftp
/dev/main/mp3         10321208   8117692   1679228  83% /mp3
/dev/main/cdimage     23738812  18602584   4171540  82% /cdimage
/$ sudo hdparm -tT /dev/md{0,1,2,3}

/dev/md0:
 Timing buffer-cache reads:   128 MB in  0.36 seconds =355.56 MB/sec
 Timing buffered disk reads:  64 MB in  1.67 seconds = 38.32 MB/sec

/dev/md1:
 Timing buffer-cache reads:   128 MB in  0.37 seconds =345.95 MB/sec
 Timing buffered disk reads:  64 MB in  0.77 seconds = 83.12 MB/sec

/dev/md2:
 Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
 Timing buffered disk reads:  64 MB in  1.31 seconds = 48.85 MB/sec

/dev/md3:
 Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
 Timing buffered disk reads:  64 MB in 21.93 seconds =  2.92 MB/sec
(last horribly. slow; get zillions of lines in syslog saying stuff like:
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr  6 14:08:50 nicksbox last message repeated 2 times
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 4096 -->
1024
does not occur like that with RH 2.4.18-x kernels)

Any pointers to web sites, information that may help, any hints, suggestions,
ideas,... all most welcome.  Actually, if replacing the motherboard would fix
it, I'd do it, but I cannot guess why it should help; Asus motherboards have
always been good to me before.

--
Nick Urbanik   RHCE                                  nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24




^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2003-04-17 16:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-07 23:24 Debugging hard lockups (hardware?) Richard J Moore
2003-04-09  1:24 ` Nick Urbanik
  -- strict thread matches above, loose matches on Subject: below --
2003-04-11 18:08 Scott Lee
2003-04-11 19:22 ` Bernd Eckenfels
2003-04-08 18:36 Scott Lee
2003-04-06 10:12 mikpe
2003-04-06 16:01 ` Nick Urbanik
2003-04-06  6:32 Nick Urbanik
2003-04-06  6:39 ` Nick Urbanik
2003-04-06  9:27 ` John Bradford
2003-04-06 10:55 ` Bruce Harada
2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29   ` Arador
2003-04-06 21:04     ` Alan Cox
2003-04-17 16:12       ` Arador
2003-04-06 20:31   ` Dave Jones
2003-04-06 22:02     ` Nick Urbanik
2003-04-06 23:15       ` Dave Jones
2003-04-08  3:33     ` Andre Hedrick
2003-04-08 12:35       ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox