Debugging hard lockups (hardware?)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Debugging hard lockups (hardware?)
@ 2003-04-06  6:32 Nick Urbanik
  2003-04-06  6:39 ` Nick Urbanik
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06  6:32 UTC (permalink / raw)
  To: Linux Kernel list

Dear team,

This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do not
respond.  Logs show no hint of any problem (that I recognise) before lockup.
Occurs often during scrolling e.g., Mozilla.  I swapped the Radeon 7000 for a
7500, then an Nvidia.

I guess hardware.  But memtest run exhaustively shows no problem.

Same is true for large numbers of kernels I've tried, mostly Red Hat and -ac
kernels, old and new.

current grub kernel line:
kernel /boot/vmlinuz-2.4.20-8custom ro root=LABEL=/ vga=6 console=ttyS0,38400
nmi_watchdog=1 hdm=ide-scsi

I have six 80 G IDE disks, software RAID, LVM on top.  On Red Hat 8.0 and 9.

Any hints on how to troubleshoot this (besides replacing motherboard and other
components I cannot afford to replace?)

$ cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
      10240128 blocks [2/2] [UU]

md2 : active raid5 hdk1[3] hdi1[2] hdg1[1] hde1[0]
      20480256 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 hdk2[5] hdi2[4] hdg2[2] hde2[3] hdc2[1] hda2[0]
      4088064 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

md3 : active raid5 hdk3[6] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
      267514112 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
$ lspci
00:00.0 Host bridge: Intel Corp. 82845 845 (Brookdale) Chipset Host Bridge (rev
11)
00:01.0 PCI bridge: Intel Corp. 82845 845 (Brookdale) Chipset AGP Bridge (rev
11)
00:1d.0 USB Controller: Intel Corp. 82801DB USB (Hub #1) (rev 01)
00:1d.1 USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 01)
00:1d.2 USB Controller: Intel Corp. 82801DB USB (Hub #3) (rev 01)
00:1d.7 USB Controller: Intel Corp. 82801DB USB EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corp. 82801DB ISA Bridge (LPC) (rev 01)
00:1f.1 IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400]
(rev b2)
02:03.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
02:04.0 FireWire (IEEE 1394): NEC Corporation: Unknown device 00f2 (rev 01)
02:08.0 Ethernet controller: Intel Corp. 82801BD PRO/100 VE (LOM) Ethernet
Controller (rev 81)
02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0c.0 SCSI storage controller: Advanced Micro Devices [AMD] 53c974 [PCscsi]
(rev 10)
02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
02:0e.0 RAID bus controller: CMD Technology Inc PCI0649 (rev 02)
$ lsmod
Module                  Size  Used by    Not tainted
nls_iso8859-1           3516   1 (autoclean)
cmpci                  35464   1 (autoclean)
soundcore               6436   4 (autoclean) [cmpci]
ppp_deflate             4504   2 (autoclean)
zlib_deflate           21624   0 (autoclean) [ppp_deflate]
osst                   48072   0 (autoclean) (unused)
st                     30736   0 (autoclean) (unused)
usb-storage            68276   0 (unused)
binfmt_misc             7400   1
parport_pc             19108   1 (autoclean)
lp                      8996   0 (autoclean)
parport                36960   1 (autoclean) [parport_pc lp]
nfsd                   79920   8 (autoclean)
lockd                  58128   1 (autoclean) [nfsd]
sunrpc                 81372   1 (autoclean) [nfsd lockd]
autofs                 13108   1 (autoclean)
n_hdlc                  8000   1
ppp_synctty             7840   1
ppp_async               9376   1
ppp_generic            24540   6 [ppp_deflate ppp_synctty ppp_async]
slhc                    6628   1 [ppp_generic]
ne2k-pci                7168   1 (autoclean)
8390                    8364   0 (autoclean) [ne2k-pci]
e100                   63908   1
ipt_multiport           1176   5 (autoclean)
ipt_REJECT              3832   2 (autoclean)
ipt_limit               1560   2 (autoclean)
ipt_MASQUERADE          2136   3 (autoclean)
iptable_filter          2412   1 (autoclean)
ipt_LOG                 4248  11
ipt_state               1080  23
ip_nat_ftp              4016   0 (unused)
ip_conntrack_ftp        5264   1
iptable_nat            20824   2 [ipt_MASQUERADE ip_nat_ftp]
ip_conntrack           26976   3 [ipt_MASQUERADE ipt_state ip_nat_ftp
ip_conntrack_ftp iptable_nat]
ip_tables              14840  10 [ipt_multiport ipt_REJECT ipt_limit
ipt_MASQUERADE iptable_filter ipt_LOG ipt_state iptable_nat]
sg                     35820   0 (autoclean)
sr_mod                 17912   0 (autoclean)
ide-scsi               12176   0
ide-cd                 35580   0
cdrom                  33120   0 [sr_mod ide-cd]
loop                   12056   3 (autoclean)
keybdev                 2976   0 (unused)
mousedev                5460   1
hid                    21956   0 (unused)
input                   5888   0 [keybdev mousedev hid]
usb-uhci               26188   0 (unused)
usbcore                78272   2 [usb-storage hid usb-uhci]
ext3                   69760  11
jbd                    51828  11 [ext3]
tmscsim                37088   0
sd_mod                 13484   0 (unused)
scsi_mod              106872   7 [osst st usb-storage sg sr_mod ide-scsi tmscsim
sd_mod]
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.26GHz
stepping        : 4
cpu MHz         : 2289.252
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips        : 4561.30

$ free
             total       used       free     shared    buffers     cached
Mem:       1030780    1020180      10600          0     158692     633092
-/+ buffers/cache:     228396     802384
Swap:      4088056      31020    4057036
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0              10079212   5805256   3761952  61% /
none                    515388         0    515388   0% /dev/shm
/dev/main/home         2873984    814604   1913564  30% /home
/dev/main/usr         10321208   5762360   4034560  59% /usr
/dev/main/var          3201076   1168292   1870176  39% /var
/dev/main/var2        24520572   5211756  18063224  23% /var2
/dev/main/nicku       55734652  27123980  26345748  51% /home/nicku
/dev/main/photos      46445552  28598296  15487960  65% /home/nicku/.photos
/dev/main/mail         8256952   5107616   2729908  66% /home/nicku/work/nsmail
/dev/main/ftp         98051740  77431664  15639340  84% /var/ftp
/dev/main/mp3         10321208   8117692   1679228  83% /mp3
/dev/main/cdimage     23738812  18602584   4171540  82% /cdimage
/$ sudo hdparm -tT /dev/md{0,1,2,3}

/dev/md0:
 Timing buffer-cache reads:   128 MB in  0.36 seconds =355.56 MB/sec
 Timing buffered disk reads:  64 MB in  1.67 seconds = 38.32 MB/sec

/dev/md1:
 Timing buffer-cache reads:   128 MB in  0.37 seconds =345.95 MB/sec
 Timing buffered disk reads:  64 MB in  0.77 seconds = 83.12 MB/sec

/dev/md2:
 Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
 Timing buffered disk reads:  64 MB in  1.31 seconds = 48.85 MB/sec

/dev/md3:
 Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
 Timing buffered disk reads:  64 MB in 21.93 seconds =  2.92 MB/sec
(last horribly. slow; get zillions of lines in syslog saying stuff like:
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr  6 14:08:50 nicksbox last message repeated 2 times
Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 4096 -->
1024
does not occur like that with RH 2.4.18-x kernels)

Any pointers to web sites, information that may help, any hints, suggestions,
ideas,... all most welcome.  Actually, if replacing the motherboard would fix
it, I'd do it, but I cannot guess why it should help; Asus motherboards have
always been good to me before.

--
Nick Urbanik   RHCE                                  nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06  6:32 Debugging hard lockups (hardware?) Nick Urbanik
@ 2003-04-06  6:39 ` Nick Urbanik
  2003-04-06  9:27 ` John Bradford
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06  6:39 UTC (permalink / raw)
  To: Linux Kernel list

Nick Urbanik wrote:

> Dear team,
>
> This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
> not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do not
> respond.  Logs show no hint of any problem (that I recognise) before lockup.
> Occurs often during scrolling e.g., Mozilla.

I omitted one critical fact: it is only while I am using the machine sitting at the
console; it has never happened while connecting via ssh, even when doing plenty of
I/O.

--
Nick Urbanik   RHCE                                  nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06  6:32 Debugging hard lockups (hardware?) Nick Urbanik
  2003-04-06  6:39 ` Nick Urbanik
@ 2003-04-06  9:27 ` John Bradford
  2003-04-06 10:55 ` Bruce Harada
  2003-04-06 18:34 ` Alan Cox
  3 siblings, 0 replies; 15+ messages in thread
From: John Bradford @ 2003-04-06  9:27 UTC (permalink / raw)
  To: Nick Urbanik; +Cc: linux-kernel

> This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
> not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do not
> respond.  Logs show no hint of any problem (that I recognise) before lockup.
> Occurs often during scrolling e.g., Mozilla.  I swapped the Radeon 7000 for a
> 7500, then an Nvidia.
> 
> I guess hardware.  But memtest run exhaustively shows no problem.

Have you tried the CPUBURN utilities?

http://users.ev1.net/~redelm/

John.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06  6:32 Debugging hard lockups (hardware?) Nick Urbanik
  2003-04-06  6:39 ` Nick Urbanik
  2003-04-06  9:27 ` John Bradford
@ 2003-04-06 10:55 ` Bruce Harada
  2003-04-06 18:34 ` Alan Cox
  3 siblings, 0 replies; 15+ messages in thread
From: Bruce Harada @ 2003-04-06 10:55 UTC (permalink / raw)
  To: Nick Urbanik; +Cc: linux-kernel

On Sun, 06 Apr 2003 14:32:27 +0800
Nick Urbanik <nicku@vtc.edu.hk> wrote:

> Dear team,
> 
> This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll
> Lock do not respond.  The NMI watchdog does not kick in.  Alt-SysRq-keys do
> not respond.  Logs show no hint of any problem (that I recognise) before
> lockup. Occurs often during scrolling e.g., Mozilla.  I swapped the Radeon
> 7000 for a 7500, then an Nvidia.

Does booting with 'noapic' make any difference?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06  6:32 Debugging hard lockups (hardware?) Nick Urbanik
                   ` (2 preceding siblings ...)
  2003-04-06 10:55 ` Bruce Harada
@ 2003-04-06 18:34 ` Alan Cox
  2003-04-06 20:29   ` Arador
                     ` (2 more replies)
  3 siblings, 3 replies; 15+ messages in thread
From: Alan Cox @ 2003-04-06 18:34 UTC (permalink / raw)
  To: Nick Urbanik; +Cc: Linux Kernel Mailing List

On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
> This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
> not respond.  The NMI watchdog does not kick in.

For the NMI watchdog to fail (if you have it enabled) requires pretty
major disaster to have occurred since the NMI will be delivered through
any kind of system hang

> I guess hardware.  But memtest run exhaustively shows no problem.

Memory errors normally generate "Oops" type lines rather than other
stuff

> I have six 80 G IDE disks, software RAID, LVM on top.  On Red Hat 8.0 and 9.
> 
> Any hints on how to troubleshoot this (besides replacing motherboard and other
> components I cannot afford to replace?)

Is your PSU up to scratch for six disks ?

> /dev/md3:
>  Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
>  Timing buffered disk reads:  64 MB in 21.93 seconds =  2.92 MB/sec
> (last horribly. slow; get zillions of lines in syslog saying stuff like:
> Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
> Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
> Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->

Im not sure what this one indicates actually

> Any pointers to web sites, information that may help, any hints, suggestions,
> ideas,... all most welcome.  Actually, if replacing the motherboard would fix
> it, I'd do it, but I cannot guess why it should help; Asus motherboards have
> always been good to me before.

Your choice of components looks fine, its all stuff I trust, even if the
ethernet card is not good for performance it ought to be fine in
general. If it is a faulty part most likely its a one off fault.

Which bits of the system are not being used (sound, video, network ?)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 18:34 ` Alan Cox
@ 2003-04-06 20:29   ` Arador
  2003-04-06 21:04     ` Alan Cox
  2003-04-06 20:31   ` Dave Jones
  2003-09-15  8:49   ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
  2 siblings, 1 reply; 15+ messages in thread
From: Arador @ 2003-04-06 20:29 UTC (permalink / raw)
  To: Alan Cox; +Cc: nicku, linux-kernel

On 06 Apr 2003 19:34:09 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> For the NMI watchdog to fail (if you have it enabled) requires pretty
> major disaster to have occurred since the NMI will be delivered through
> any kind of system hang

I've a similar hang; no oops; no sysrq; no NMI messages;
But mine only happens under 2.5; since long time ago.
The one strange thing is that it seems that it's not hanged;
since the X pointer moves in 3-5 seconds intervals (it even
change the shape in the window's corners).
It happens without X too; but as i said nothing survives...
no oops, sysrq, nmi messages, doesn't answer to pings...

I know by the fans' sound that the cpu usage goes to 100%

I'm thinking of a hardware failure too (but the odd X behaviour makes
me hesitate); since i don't remember that it failed under 2.4 that i
remember of...this box didn't run a lot of 2.4 kernel though.

The box passes memtest86; it's
 a 2x800 box, ide disk, 256 ram;
VIA chipset...just in the (very strange) case that somebody
has exactly the same box and they have/don't have the same problems.
(elitegroup d6vaa motherboard)

Diego Calleja

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 18:34 ` Alan Cox
  2003-04-06 20:29   ` Arador
@ 2003-04-06 20:31   ` Dave Jones
  2003-04-06 22:02     ` Nick Urbanik
  2003-04-08  3:33     ` Andre Hedrick
  2003-09-15  8:49   ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
  2 siblings, 2 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-06 20:31 UTC (permalink / raw)
  To: Alan Cox; +Cc: Nick Urbanik, Linux Kernel Mailing List

On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
 > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
 > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
 > ...
 > Your choice of components looks fine, its all stuff I trust, even if the
 > ethernet card is not good for performance it ought to be fine in
 > general. If it is a faulty part most likely its a one off fault.

Note the IDE controller, and 2.5 bugzilla #123
That controller has been nothing but trouble for me.

		Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 20:29   ` Arador
@ 2003-04-06 21:04     ` Alan Cox
  2003-04-17 16:12       ` Arador
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Cox @ 2003-04-06 21:04 UTC (permalink / raw)
  To: Arador; +Cc: nicku, Linux Kernel Mailing List

On Sul, 2003-04-06 at 21:29, Arador wrote:
> I've a similar hang; no oops; no sysrq; no NMI messages;
> But mine only happens under 2.5; since long time ago.
> The one strange thing is that it seems that it's not hanged;
> since the X pointer moves in 3-5 seconds intervals (it even
> change the shape in the window's corners).

So its not hanging, but acting like something gets burning
CPU. If you can duplicate this in non X next time it occurs
use right-alt or shift or ctrl and scrolllock get some data
on what its doing.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 20:31   ` Dave Jones
@ 2003-04-06 22:02     ` Nick Urbanik
  2003-04-06 23:15       ` Dave Jones
  2003-04-08  3:33     ` Andre Hedrick
  1 sibling, 1 reply; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06 22:02 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alan Cox, Linux Kernel Mailing List

Dave Jones wrote:

> On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
>  > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  > ...
>  > Your choice of components looks fine, its all stuff I trust, even if the
>  > ethernet card is not good for performance it ought to be fine in
>  > general. If it is a faulty part most likely its a one off fault.
>
> Note the IDE controller, and 2.5 bugzilla #123
> That controller has been nothing but trouble for me.
>
>                 Dave

Yes, it was the first thing I suspected, so I went out to the fabled Golden
Shopping Centre and bought all the alternative disk controllers I could (except
for 3ware, which I lacked the cash for).  I tried HighPoint HPT 370A (Adaptec
1200A), HighPoint HPT368, Promise PDC20270 (FastTrack 100Tx2), and the PDC20276
built onto the motherboard, and a HighPoint HPT302 which didn't work properly at
all.  I still got the lockups with various permutations of the non-Silicon Image
chipsets, and found that to my amazement, the Silicon Image 680 chips gave the
best performance.  I too had major problems with CMD64x boards on a production
2-CPU system (server for student accounts), so remained with SCSI on that
machine.

Bugzilla for the kernel?  I didn't know there is one!  I'd better find it.
Thanks Dave.

--
Nick Urbanik   RHCE                                  nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 22:02     ` Nick Urbanik
@ 2003-04-06 23:15       ` Dave Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-06 23:15 UTC (permalink / raw)
  To: Nick Urbanik; +Cc: Alan Cox, Linux Kernel Mailing List

On Mon, Apr 07, 2003 at 06:02:14AM +0800, Nick Urbanik wrote:

 > Bugzilla for the kernel?  I didn't know there is one!  I'd better find it.
 > Thanks Dave.

At this time, 2.5 only.  http://bugzilla.kernel.org

		Dave


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 20:31   ` Dave Jones
  2003-04-06 22:02     ` Nick Urbanik
@ 2003-04-08  3:33     ` Andre Hedrick
  2003-04-08 12:35       ` Dave Jones
  1 sibling, 1 reply; 15+ messages in thread
From: Andre Hedrick @ 2003-04-08  3:33 UTC (permalink / raw)
  To: Dave Jones; +Cc: Alan Cox, Nick Urbanik, Linux Kernel Mailing List


That controller is perfectly fine, give it a swing in 2.4 and it rocks.
2.5 is the problem not the hardware.

On Sun, 6 Apr 2003, Dave Jones wrote:

> On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
>  > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
>  > ...
>  > Your choice of components looks fine, its all stuff I trust, even if the
>  > ethernet card is not good for performance it ought to be fine in
>  > general. If it is a faulty part most likely its a one off fault.
> 
> Note the IDE controller, and 2.5 bugzilla #123
> That controller has been nothing but trouble for me.
> 
> 		Dave
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Andre Hedrick
LAD Storage Consulting Group


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-08  3:33     ` Andre Hedrick
@ 2003-04-08 12:35       ` Dave Jones
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-08 12:35 UTC (permalink / raw)
  To: Andre Hedrick; +Cc: Alan Cox, Nick Urbanik, Linux Kernel Mailing List

On Mon, Apr 07, 2003 at 08:33:12PM -0700, Andre Hedrick wrote:

 > That controller is perfectly fine, give it a swing in 2.4 and it rocks.
 > 2.5 is the problem not the hardware.

Exactly the same problem in 2.4.  Your words on this subject previously..

"Get over it, you have a hardware problem period! I have drives that do this every time."

So which is it ?

It does this with every drive I've tried with it, so I'm incredibly
sceptical theres a problem with the drives, especially when they
work 100% fine on other controllers.

		Dave

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?)
  2003-04-06 21:04     ` Alan Cox
@ 2003-04-17 16:12       ` Arador
  0 siblings, 0 replies; 15+ messages in thread
From: Arador @ 2003-04-17 16:12 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel

On 06 Apr 2003 22:04:55 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:

> On Sul, 2003-04-06 at 21:29, Arador wrote:
> > I've a similar hang; no oops; no sysrq; no NMI messages;
> > But mine only happens under 2.5; since long time ago.
> > The one strange thing is that it seems that it's not hanged;
> > since the X pointer moves in 3-5 seconds intervals (it even
> > change the shape in the window's corners).
> 
> So its not hanging, but acting like something gets burning
> CPU. If you can duplicate this in non X next time it occurs
> use right-alt or shift or ctrl and scrolllock get some data
> on what its doing.

It doesn't work either :( (sorry for the delay, but i use
X quite a lot and took me a while to reproduce it again in console)

The description is above....a hang that happens randomly in 2.5
from ages; both with and without X; no sysrq or NMI messages....
now i checked that right-alt/shift/ctrl and scrolllock doesn't work...
i suspect this might be a hardware failure; i should run it a while under
2.4 to see if i get a similar behaviour....


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?): Power Supply!
  2003-04-06 18:34 ` Alan Cox
  2003-04-06 20:29   ` Arador
  2003-04-06 20:31   ` Dave Jones
@ 2003-09-15  8:49   ` Nick Urbanik
  2003-09-15 19:57     ` Resident Boxholder
  2 siblings, 1 reply; 15+ messages in thread
From: Nick Urbanik @ 2003-09-15  8:49 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

Dear Folks,

Alan Cox wrote:

> On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
> > This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
> > not respond.  The NMI watchdog does not kick in.
>
> For the NMI watchdog to fail (if you have it enabled) requires pretty
> major disaster to have occurred since the NMI will be delivered through
> any kind of system hang
>
> > I guess hardware.  But memtest run exhaustively shows no problem.
>
> Memory errors normally generate "Oops" type lines rather than other
> stuff
>
> > I have six 80 G IDE disks, software RAID, LVM on top.  On Red Hat 8.0 and 9.
> >
> > Any hints on how to troubleshoot this (besides replacing motherboard and other
> > components I cannot afford to replace?)
>
> Is your PSU up to scratch for six disks ?

It seems that the 470W ToTheMax power supply was the problem, though I've not tested
for more than 17 hours yet.

Yesterday, after adding two disks to the motherboard IDE together with the new 3ware
7506-8 with 8 x 80G disks (a total of 10 disks), the 3ware RAID 5 unit refused to
rebuild.  Then I noticed that many dma timeout messages appeared in the logs for all
the disks.  I promptly shut the system down, took my little boy Linus for a walk to
the fabled Golden Shopping Center and bought an Antec 550W supply, plus a few more
case fans.  The 3ware RAID5 unit, and also the kernel md RAID1 built nicely.  So far
no dma_intr messages have appeared (and it has not locked up).

I wish I had changed the power supply before buying the 3ware card so that I could
find out whether the Si680 cards were okay or not!  HK$135 * 3 <<< HK$3800 (comparing
price of 3 Si680 cards with one 3ware 7506-8).  Unfortunately the disks cannot be
moved from the 3ware back to the Si680s without another mighty backup and restore.

> > /dev/md3:
> >  Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
> >  Timing buffered disk reads:  64 MB in 21.93 seconds =  2.92 MB/sec
> > (last horribly. slow; get zillions of lines in syslog saying stuff like:
> > Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
> > Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
> > Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
>
> Im not sure what this one indicates actually
>
> > Any pointers to web sites, information that may help, any hints, suggestions,
> > ideas,... all most welcome.  Actually, if replacing the motherboard would fix
> > it, I'd do it, but I cannot guess why it should help; Asus motherboards have
> > always been good to me before.
>
> Your choice of components looks fine, its all stuff I trust, even if the
> ethernet card is not good for performance it ought to be fine in
> general. If it is a faulty part most likely its a one off fault.
>
> Which bits of the system are not being used (sound, video, network ?)

--
Nick Urbanik   RHCE                               nicku(at)vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Debugging hard lockups (hardware?): Power Supply!
  2003-09-15  8:49   ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
@ 2003-09-15 19:57     ` Resident Boxholder
  0 siblings, 0 replies; 15+ messages in thread
From: Resident Boxholder @ 2003-09-15 19:57 UTC (permalink / raw)
  To: Kernel List

faq's for acpi and ide and no-ping or unrecognized ethernet problems,
seek error, random crash

When it seems like there's a bug in hardware or linux but it's just no acpi
id, default code runs slower, timing issues crop up, something breaks,
kill the scapegoats(swap cards, motherboards, try all config/kern opts,
ebay 3ware).

Power supply was the first thing I replaced, going to a 550w Coolmax
with the 3 amp 5v standby voltage.

The second thing one should do is flash bios. If bios acpi does not
recognize hardware id's, linux reasonable defaults institute pretty good
alright solutions which are slower than ideal, and on a system faster than
133fsb and 1ghz cpu that's enough to cause io to miss the love boat
with regard to resources and anything relating to timing.

memtest86

I set aside a pile of Promise cards, tried a SiI680, got almost there by
setting all options that allowed more time for dma resources to become
available. That got me pretty stable, but the bios flash I should have
done first finished the job, proper acpi hardware id's making all the
fastest code available so dma resources came available soon
enough.

Options that may improve stability are--
  --don't use acpi=off
  --don't use pci=noacpi
  --don't go back to a previous kernel
  --cmos setup apic off, let linux activate apic
  --don't oc or push memory before stable(somebody)
  --kernel local apic on, but try sub-option ioapic off if you
       can't ping through your ethernet onboard or card or
       you see garbage scrolling on boot or you have random
        freeze-ups
  --kernel opt try anticipatory-scheduling instead of
     deadline scheduling for more time for ide ops to live
  --cmos pci-delay transaction and hdparm -u1 to unmask
      irq would buy time but can't do with sii680 or cmd640
  --hdparm -d1 -c1 -p9 -X70 -u0 drive on sii680, pio9 is
     a special escape pseudo-mode for sii680, probably -p9
     just says "-d1 -c1 -u0 -p4", redundancy doesn't hurt
  --turn off unused usb, parallell, audio, acpi sleep and
     battery options, anything you don't need, at least until
     you get the system stable
  --3ware cards are cheap on ebay! try everything else first
     or you may be disappointed by throwing money at
     problems. success can strike at any moment--my
     3ware card is still in the mail but system passes
     bonnie++ and cp -aR /usr /tmp [different md's]

-Bob

Nick Urbanik wrote:

>Dear Folks,
>
>Alan Cox wrote:
>
>  
>
>>On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
>>    
>>
>>>This machine locks up solid every few days.  Caps Lock, Num Lock, Scroll Lock do
>>>not respond.  The NMI watchdog does not kick in.
>>>      
>>>
>>For the NMI watchdog to fail (if you have it enabled) requires pretty
>>major disaster to have occurred since the NMI will be delivered through
>>any kind of system hang
>>
>>    
>>
>>>I guess hardware.  But memtest run exhaustively shows no problem.
>>>      
>>>
>>Memory errors normally generate "Oops" type lines rather than other
>>stuff
>>
>>    
>>
>>>I have six 80 G IDE disks, software RAID, LVM on top.  On Red Hat 8.0 and 9.
>>>
>>>Any hints on how to troubleshoot this (besides replacing motherboard and other
>>>components I cannot afford to replace?)
>>>      
>>>
>>Is your PSU up to scratch for six disks ?
>>    
>>
>
>It seems that the 470W ToTheMax power supply was the problem, though I've not tested
>for more than 17 hours yet.
>
>Yesterday, after adding two disks to the motherboard IDE together with the new 3ware
>7506-8 with 8 x 80G disks (a total of 10 disks), the 3ware RAID 5 unit refused to
>rebuild.  Then I noticed that many dma timeout messages appeared in the logs for all
>the disks.  I promptly shut the system down, took my little boy Linus for a walk to
>the fabled Golden Shopping Center and bought an Antec 550W supply, plus a few more
>case fans.  The 3ware RAID5 unit, and also the kernel md RAID1 built nicely.  So far
>no dma_intr messages have appeared (and it has not locked up).
>
>I wish I had changed the power supply before buying the 3ware card so that I could
>find out whether the Si680 cards were okay or not!  HK$135 * 3 <<< HK$3800 (comparing
>price of 3 Si680 cards with one 3ware 7506-8).  Unfortunately the disks cannot be
>moved from the 3ware back to the Si680s without another mighty backup and restore.
>
>  
>
>>>/dev/md3:
>>> Timing buffer-cache reads:   128 MB in  0.35 seconds =365.71 MB/sec
>>> Timing buffered disk reads:  64 MB in 21.93 seconds =  2.92 MB/sec
>>>(last horribly. slow; get zillions of lines in syslog saying stuff like:
>>>Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
>>>Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
>>>Apr  6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
>>>      
>>>
>>Im not sure what this one indicates actually
>>
>>    
>>
>>>Any pointers to web sites, information that may help, any hints, suggestions,
>>>ideas,... all most welcome.  Actually, if replacing the motherboard would fix
>>>it, I'd do it, but I cannot guess why it should help; Asus motherboards have
>>>always been good to me before.
>>>      
>>>
>>Your choice of components looks fine, its all stuff I trust, even if the
>>ethernet card is not good for performance it ought to be fine in
>>general. If it is a faulty part most likely its a one off fault.
>>
>>Which bits of the system are not being used (sound, video, network ?)
>>    
>>
>
>--
>Nick Urbanik   RHCE                               nicku(at)vtc.edu.hk
>Dept. of Information & Communications Technology
>Hong Kong Institute of Vocational Education (Tsing Yi)
>Tel:   (852) 2436 8576, (852) 2436 8713          Fax: (852) 2436 8526
>PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B     ID: 7529555D
>GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24   ID: BB9D2C24
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-09-15 20:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-06  6:32 Debugging hard lockups (hardware?) Nick Urbanik
2003-04-06  6:39 ` Nick Urbanik
2003-04-06  9:27 ` John Bradford
2003-04-06 10:55 ` Bruce Harada
2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29   ` Arador
2003-04-06 21:04     ` Alan Cox
2003-04-17 16:12       ` Arador
2003-04-06 20:31   ` Dave Jones
2003-04-06 22:02     ` Nick Urbanik
2003-04-06 23:15       ` Dave Jones
2003-04-08  3:33     ` Andre Hedrick
2003-04-08 12:35       ` Dave Jones
2003-09-15  8:49   ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
2003-09-15 19:57     ` Resident Boxholder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox