* Debugging hard lockups (hardware?)
@ 2003-04-06 6:32 Nick Urbanik
2003-04-06 6:39 ` Nick Urbanik
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06 6:32 UTC (permalink / raw)
To: Linux Kernel list
Dear team,
This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
not respond. The NMI watchdog does not kick in. Alt-SysRq-keys do not
respond. Logs show no hint of any problem (that I recognise) before lockup.
Occurs often during scrolling e.g., Mozilla. I swapped the Radeon 7000 for a
7500, then an Nvidia.
I guess hardware. But memtest run exhaustively shows no problem.
Same is true for large numbers of kernels I've tried, mostly Red Hat and -ac
kernels, old and new.
current grub kernel line:
kernel /boot/vmlinuz-2.4.20-8custom ro root=LABEL=/ vga=6 console=ttyS0,38400
nmi_watchdog=1 hdm=ide-scsi
I have six 80 G IDE disks, software RAID, LVM on top. On Red Hat 8.0 and 9.
Any hints on how to troubleshoot this (besides replacing motherboard and other
components I cannot afford to replace?)
$ cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 hdc1[1] hda1[0]
10240128 blocks [2/2] [UU]
md2 : active raid5 hdk1[3] hdi1[2] hdg1[1] hde1[0]
20480256 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
md1 : active raid5 hdk2[5] hdi2[4] hdg2[2] hde2[3] hdc2[1] hda2[0]
4088064 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
md3 : active raid5 hdk3[6] hdi3[4] hdg3[3] hde3[2] hdc3[1] hda3[0]
267514112 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
unused devices: <none>
$ lspci
00:00.0 Host bridge: Intel Corp. 82845 845 (Brookdale) Chipset Host Bridge (rev
11)
00:01.0 PCI bridge: Intel Corp. 82845 845 (Brookdale) Chipset AGP Bridge (rev
11)
00:1d.0 USB Controller: Intel Corp. 82801DB USB (Hub #1) (rev 01)
00:1d.1 USB Controller: Intel Corp. 82801DB USB (Hub #2) (rev 01)
00:1d.2 USB Controller: Intel Corp. 82801DB USB (Hub #3) (rev 01)
00:1d.7 USB Controller: Intel Corp. 82801DB USB EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corp. 82801DB ISA Bridge (LPC) (rev 01)
00:1f.1 IDE interface: Intel Corp. 82801DB ICH4 IDE (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400]
(rev b2)
02:03.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
02:04.0 FireWire (IEEE 1394): NEC Corporation: Unknown device 00f2 (rev 01)
02:08.0 Ethernet controller: Intel Corp. 82801BD PRO/100 VE (LOM) Ethernet
Controller (rev 81)
02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
02:0c.0 SCSI storage controller: Advanced Micro Devices [AMD] 53c974 [PCscsi]
(rev 10)
02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
02:0e.0 RAID bus controller: CMD Technology Inc PCI0649 (rev 02)
$ lsmod
Module Size Used by Not tainted
nls_iso8859-1 3516 1 (autoclean)
cmpci 35464 1 (autoclean)
soundcore 6436 4 (autoclean) [cmpci]
ppp_deflate 4504 2 (autoclean)
zlib_deflate 21624 0 (autoclean) [ppp_deflate]
osst 48072 0 (autoclean) (unused)
st 30736 0 (autoclean) (unused)
usb-storage 68276 0 (unused)
binfmt_misc 7400 1
parport_pc 19108 1 (autoclean)
lp 8996 0 (autoclean)
parport 36960 1 (autoclean) [parport_pc lp]
nfsd 79920 8 (autoclean)
lockd 58128 1 (autoclean) [nfsd]
sunrpc 81372 1 (autoclean) [nfsd lockd]
autofs 13108 1 (autoclean)
n_hdlc 8000 1
ppp_synctty 7840 1
ppp_async 9376 1
ppp_generic 24540 6 [ppp_deflate ppp_synctty ppp_async]
slhc 6628 1 [ppp_generic]
ne2k-pci 7168 1 (autoclean)
8390 8364 0 (autoclean) [ne2k-pci]
e100 63908 1
ipt_multiport 1176 5 (autoclean)
ipt_REJECT 3832 2 (autoclean)
ipt_limit 1560 2 (autoclean)
ipt_MASQUERADE 2136 3 (autoclean)
iptable_filter 2412 1 (autoclean)
ipt_LOG 4248 11
ipt_state 1080 23
ip_nat_ftp 4016 0 (unused)
ip_conntrack_ftp 5264 1
iptable_nat 20824 2 [ipt_MASQUERADE ip_nat_ftp]
ip_conntrack 26976 3 [ipt_MASQUERADE ipt_state ip_nat_ftp
ip_conntrack_ftp iptable_nat]
ip_tables 14840 10 [ipt_multiport ipt_REJECT ipt_limit
ipt_MASQUERADE iptable_filter ipt_LOG ipt_state iptable_nat]
sg 35820 0 (autoclean)
sr_mod 17912 0 (autoclean)
ide-scsi 12176 0
ide-cd 35580 0
cdrom 33120 0 [sr_mod ide-cd]
loop 12056 3 (autoclean)
keybdev 2976 0 (unused)
mousedev 5460 1
hid 21956 0 (unused)
input 5888 0 [keybdev mousedev hid]
usb-uhci 26188 0 (unused)
usbcore 78272 2 [usb-storage hid usb-uhci]
ext3 69760 11
jbd 51828 11 [ext3]
tmscsim 37088 0
sd_mod 13484 0 (unused)
scsi_mod 106872 7 [osst st usb-storage sg sr_mod ide-scsi tmscsim
sd_mod]
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.26GHz
stepping : 4
cpu MHz : 2289.252
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips : 4561.30
$ free
total used free shared buffers cached
Mem: 1030780 1020180 10600 0 158692 633092
-/+ buffers/cache: 228396 802384
Swap: 4088056 31020 4057036
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md0 10079212 5805256 3761952 61% /
none 515388 0 515388 0% /dev/shm
/dev/main/home 2873984 814604 1913564 30% /home
/dev/main/usr 10321208 5762360 4034560 59% /usr
/dev/main/var 3201076 1168292 1870176 39% /var
/dev/main/var2 24520572 5211756 18063224 23% /var2
/dev/main/nicku 55734652 27123980 26345748 51% /home/nicku
/dev/main/photos 46445552 28598296 15487960 65% /home/nicku/.photos
/dev/main/mail 8256952 5107616 2729908 66% /home/nicku/work/nsmail
/dev/main/ftp 98051740 77431664 15639340 84% /var/ftp
/dev/main/mp3 10321208 8117692 1679228 83% /mp3
/dev/main/cdimage 23738812 18602584 4171540 82% /cdimage
/$ sudo hdparm -tT /dev/md{0,1,2,3}
/dev/md0:
Timing buffer-cache reads: 128 MB in 0.36 seconds =355.56 MB/sec
Timing buffered disk reads: 64 MB in 1.67 seconds = 38.32 MB/sec
/dev/md1:
Timing buffer-cache reads: 128 MB in 0.37 seconds =345.95 MB/sec
Timing buffered disk reads: 64 MB in 0.77 seconds = 83.12 MB/sec
/dev/md2:
Timing buffer-cache reads: 128 MB in 0.35 seconds =365.71 MB/sec
Timing buffered disk reads: 64 MB in 1.31 seconds = 48.85 MB/sec
/dev/md3:
Timing buffer-cache reads: 128 MB in 0.35 seconds =365.71 MB/sec
Timing buffered disk reads: 64 MB in 21.93 seconds = 2.92 MB/sec
(last horribly. slow; get zillions of lines in syslog saying stuff like:
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
4096
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
Apr 6 14:08:50 nicksbox last message repeated 2 times
Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 4096 -->
1024
does not occur like that with RH 2.4.18-x kernels)
Any pointers to web sites, information that may help, any hints, suggestions,
ideas,... all most welcome. Actually, if replacing the motherboard would fix
it, I'd do it, but I cannot guess why it should help; Asus motherboards have
always been good to me before.
--
Nick Urbanik RHCE nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel: (852) 2436 8576, (852) 2436 8713 Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 6:32 Debugging hard lockups (hardware?) Nick Urbanik
@ 2003-04-06 6:39 ` Nick Urbanik
2003-04-06 9:27 ` John Bradford
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06 6:39 UTC (permalink / raw)
To: Linux Kernel list
Nick Urbanik wrote:
> Dear team,
>
> This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
> not respond. The NMI watchdog does not kick in. Alt-SysRq-keys do not
> respond. Logs show no hint of any problem (that I recognise) before lockup.
> Occurs often during scrolling e.g., Mozilla.
I omitted one critical fact: it is only while I am using the machine sitting at the
console; it has never happened while connecting via ssh, even when doing plenty of
I/O.
--
Nick Urbanik RHCE nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel: (852) 2436 8576, (852) 2436 8713 Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 6:32 Debugging hard lockups (hardware?) Nick Urbanik
2003-04-06 6:39 ` Nick Urbanik
@ 2003-04-06 9:27 ` John Bradford
2003-04-06 10:55 ` Bruce Harada
2003-04-06 18:34 ` Alan Cox
3 siblings, 0 replies; 15+ messages in thread
From: John Bradford @ 2003-04-06 9:27 UTC (permalink / raw)
To: Nick Urbanik; +Cc: linux-kernel
> This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
> not respond. The NMI watchdog does not kick in. Alt-SysRq-keys do not
> respond. Logs show no hint of any problem (that I recognise) before lockup.
> Occurs often during scrolling e.g., Mozilla. I swapped the Radeon 7000 for a
> 7500, then an Nvidia.
>
> I guess hardware. But memtest run exhaustively shows no problem.
Have you tried the CPUBURN utilities?
http://users.ev1.net/~redelm/
John.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 6:32 Debugging hard lockups (hardware?) Nick Urbanik
2003-04-06 6:39 ` Nick Urbanik
2003-04-06 9:27 ` John Bradford
@ 2003-04-06 10:55 ` Bruce Harada
2003-04-06 18:34 ` Alan Cox
3 siblings, 0 replies; 15+ messages in thread
From: Bruce Harada @ 2003-04-06 10:55 UTC (permalink / raw)
To: Nick Urbanik; +Cc: linux-kernel
On Sun, 06 Apr 2003 14:32:27 +0800
Nick Urbanik <nicku@vtc.edu.hk> wrote:
> Dear team,
>
> This machine locks up solid every few days. Caps Lock, Num Lock, Scroll
> Lock do not respond. The NMI watchdog does not kick in. Alt-SysRq-keys do
> not respond. Logs show no hint of any problem (that I recognise) before
> lockup. Occurs often during scrolling e.g., Mozilla. I swapped the Radeon
> 7000 for a 7500, then an Nvidia.
Does booting with 'noapic' make any difference?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 6:32 Debugging hard lockups (hardware?) Nick Urbanik
` (2 preceding siblings ...)
2003-04-06 10:55 ` Bruce Harada
@ 2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29 ` Arador
` (2 more replies)
3 siblings, 3 replies; 15+ messages in thread
From: Alan Cox @ 2003-04-06 18:34 UTC (permalink / raw)
To: Nick Urbanik; +Cc: Linux Kernel Mailing List
On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
> This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
> not respond. The NMI watchdog does not kick in.
For the NMI watchdog to fail (if you have it enabled) requires pretty
major disaster to have occurred since the NMI will be delivered through
any kind of system hang
> I guess hardware. But memtest run exhaustively shows no problem.
Memory errors normally generate "Oops" type lines rather than other
stuff
> I have six 80 G IDE disks, software RAID, LVM on top. On Red Hat 8.0 and 9.
>
> Any hints on how to troubleshoot this (besides replacing motherboard and other
> components I cannot afford to replace?)
Is your PSU up to scratch for six disks ?
> /dev/md3:
> Timing buffer-cache reads: 128 MB in 0.35 seconds =365.71 MB/sec
> Timing buffered disk reads: 64 MB in 21.93 seconds = 2.92 MB/sec
> (last horribly. slow; get zillions of lines in syslog saying stuff like:
> Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
> Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
> Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
Im not sure what this one indicates actually
> Any pointers to web sites, information that may help, any hints, suggestions,
> ideas,... all most welcome. Actually, if replacing the motherboard would fix
> it, I'd do it, but I cannot guess why it should help; Asus motherboards have
> always been good to me before.
Your choice of components looks fine, its all stuff I trust, even if the
ethernet card is not good for performance it ought to be fine in
general. If it is a faulty part most likely its a one off fault.
Which bits of the system are not being used (sound, video, network ?)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 18:34 ` Alan Cox
@ 2003-04-06 20:29 ` Arador
2003-04-06 21:04 ` Alan Cox
2003-04-06 20:31 ` Dave Jones
2003-09-15 8:49 ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
2 siblings, 1 reply; 15+ messages in thread
From: Arador @ 2003-04-06 20:29 UTC (permalink / raw)
To: Alan Cox; +Cc: nicku, linux-kernel
On 06 Apr 2003 19:34:09 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> For the NMI watchdog to fail (if you have it enabled) requires pretty
> major disaster to have occurred since the NMI will be delivered through
> any kind of system hang
I've a similar hang; no oops; no sysrq; no NMI messages;
But mine only happens under 2.5; since long time ago.
The one strange thing is that it seems that it's not hanged;
since the X pointer moves in 3-5 seconds intervals (it even
change the shape in the window's corners).
It happens without X too; but as i said nothing survives...
no oops, sysrq, nmi messages, doesn't answer to pings...
I know by the fans' sound that the cpu usage goes to 100%
I'm thinking of a hardware failure too (but the odd X behaviour makes
me hesitate); since i don't remember that it failed under 2.4 that i
remember of...this box didn't run a lot of 2.4 kernel though.
The box passes memtest86; it's
a 2x800 box, ide disk, 256 ram;
VIA chipset...just in the (very strange) case that somebody
has exactly the same box and they have/don't have the same problems.
(elitegroup d6vaa motherboard)
Diego Calleja
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29 ` Arador
@ 2003-04-06 20:31 ` Dave Jones
2003-04-06 22:02 ` Nick Urbanik
2003-04-08 3:33 ` Andre Hedrick
2003-09-15 8:49 ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
2 siblings, 2 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-06 20:31 UTC (permalink / raw)
To: Alan Cox; +Cc: Nick Urbanik, Linux Kernel Mailing List
On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
> > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> ...
> Your choice of components looks fine, its all stuff I trust, even if the
> ethernet card is not good for performance it ought to be fine in
> general. If it is a faulty part most likely its a one off fault.
Note the IDE controller, and 2.5 bugzilla #123
That controller has been nothing but trouble for me.
Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 20:29 ` Arador
@ 2003-04-06 21:04 ` Alan Cox
2003-04-17 16:12 ` Arador
0 siblings, 1 reply; 15+ messages in thread
From: Alan Cox @ 2003-04-06 21:04 UTC (permalink / raw)
To: Arador; +Cc: nicku, Linux Kernel Mailing List
On Sul, 2003-04-06 at 21:29, Arador wrote:
> I've a similar hang; no oops; no sysrq; no NMI messages;
> But mine only happens under 2.5; since long time ago.
> The one strange thing is that it seems that it's not hanged;
> since the X pointer moves in 3-5 seconds intervals (it even
> change the shape in the window's corners).
So its not hanging, but acting like something gets burning
CPU. If you can duplicate this in non X next time it occurs
use right-alt or shift or ctrl and scrolllock get some data
on what its doing.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 20:31 ` Dave Jones
@ 2003-04-06 22:02 ` Nick Urbanik
2003-04-06 23:15 ` Dave Jones
2003-04-08 3:33 ` Andre Hedrick
1 sibling, 1 reply; 15+ messages in thread
From: Nick Urbanik @ 2003-04-06 22:02 UTC (permalink / raw)
To: Dave Jones; +Cc: Alan Cox, Linux Kernel Mailing List
Dave Jones wrote:
> On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
> > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> > ...
> > Your choice of components looks fine, its all stuff I trust, even if the
> > ethernet card is not good for performance it ought to be fine in
> > general. If it is a faulty part most likely its a one off fault.
>
> Note the IDE controller, and 2.5 bugzilla #123
> That controller has been nothing but trouble for me.
>
> Dave
Yes, it was the first thing I suspected, so I went out to the fabled Golden
Shopping Centre and bought all the alternative disk controllers I could (except
for 3ware, which I lacked the cash for). I tried HighPoint HPT 370A (Adaptec
1200A), HighPoint HPT368, Promise PDC20270 (FastTrack 100Tx2), and the PDC20276
built onto the motherboard, and a HighPoint HPT302 which didn't work properly at
all. I still got the lockups with various permutations of the non-Silicon Image
chipsets, and found that to my amazement, the Silicon Image 680 chips gave the
best performance. I too had major problems with CMD64x boards on a production
2-CPU system (server for student accounts), so remained with SCSI on that
machine.
Bugzilla for the kernel? I didn't know there is one! I'd better find it.
Thanks Dave.
--
Nick Urbanik RHCE nicku@vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel: (852) 2436 8576, (852) 2436 8713 Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 22:02 ` Nick Urbanik
@ 2003-04-06 23:15 ` Dave Jones
0 siblings, 0 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-06 23:15 UTC (permalink / raw)
To: Nick Urbanik; +Cc: Alan Cox, Linux Kernel Mailing List
On Mon, Apr 07, 2003 at 06:02:14AM +0800, Nick Urbanik wrote:
> Bugzilla for the kernel? I didn't know there is one! I'd better find it.
> Thanks Dave.
At this time, 2.5 only. http://bugzilla.kernel.org
Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 20:31 ` Dave Jones
2003-04-06 22:02 ` Nick Urbanik
@ 2003-04-08 3:33 ` Andre Hedrick
2003-04-08 12:35 ` Dave Jones
1 sibling, 1 reply; 15+ messages in thread
From: Andre Hedrick @ 2003-04-08 3:33 UTC (permalink / raw)
To: Dave Jones; +Cc: Alan Cox, Nick Urbanik, Linux Kernel Mailing List
That controller is perfectly fine, give it a swing in 2.4 and it rocks.
2.5 is the problem not the hardware.
On Sun, 6 Apr 2003, Dave Jones wrote:
> On Sun, Apr 06, 2003 at 07:34:09PM +0100, Alan Cox wrote:
> > > 02:0a.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> > > 02:0b.0 RAID bus controller: CMD Technology Inc PCI0680 (rev 01)
> > ...
> > Your choice of components looks fine, its all stuff I trust, even if the
> > ethernet card is not good for performance it ought to be fine in
> > general. If it is a faulty part most likely its a one off fault.
>
> Note the IDE controller, and 2.5 bugzilla #123
> That controller has been nothing but trouble for me.
>
> Dave
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Andre Hedrick
LAD Storage Consulting Group
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-08 3:33 ` Andre Hedrick
@ 2003-04-08 12:35 ` Dave Jones
0 siblings, 0 replies; 15+ messages in thread
From: Dave Jones @ 2003-04-08 12:35 UTC (permalink / raw)
To: Andre Hedrick; +Cc: Alan Cox, Nick Urbanik, Linux Kernel Mailing List
On Mon, Apr 07, 2003 at 08:33:12PM -0700, Andre Hedrick wrote:
> That controller is perfectly fine, give it a swing in 2.4 and it rocks.
> 2.5 is the problem not the hardware.
Exactly the same problem in 2.4. Your words on this subject previously..
"Get over it, you have a hardware problem period! I have drives that do this every time."
So which is it ?
It does this with every drive I've tried with it, so I'm incredibly
sceptical theres a problem with the drives, especially when they
work 100% fine on other controllers.
Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?)
2003-04-06 21:04 ` Alan Cox
@ 2003-04-17 16:12 ` Arador
0 siblings, 0 replies; 15+ messages in thread
From: Arador @ 2003-04-17 16:12 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-kernel
On 06 Apr 2003 22:04:55 +0100
Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Sul, 2003-04-06 at 21:29, Arador wrote:
> > I've a similar hang; no oops; no sysrq; no NMI messages;
> > But mine only happens under 2.5; since long time ago.
> > The one strange thing is that it seems that it's not hanged;
> > since the X pointer moves in 3-5 seconds intervals (it even
> > change the shape in the window's corners).
>
> So its not hanging, but acting like something gets burning
> CPU. If you can duplicate this in non X next time it occurs
> use right-alt or shift or ctrl and scrolllock get some data
> on what its doing.
It doesn't work either :( (sorry for the delay, but i use
X quite a lot and took me a while to reproduce it again in console)
The description is above....a hang that happens randomly in 2.5
from ages; both with and without X; no sysrq or NMI messages....
now i checked that right-alt/shift/ctrl and scrolllock doesn't work...
i suspect this might be a hardware failure; i should run it a while under
2.4 to see if i get a similar behaviour....
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?): Power Supply!
2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29 ` Arador
2003-04-06 20:31 ` Dave Jones
@ 2003-09-15 8:49 ` Nick Urbanik
2003-09-15 19:57 ` Resident Boxholder
2 siblings, 1 reply; 15+ messages in thread
From: Nick Urbanik @ 2003-09-15 8:49 UTC (permalink / raw)
To: Alan Cox; +Cc: Linux Kernel Mailing List
Dear Folks,
Alan Cox wrote:
> On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
> > This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
> > not respond. The NMI watchdog does not kick in.
>
> For the NMI watchdog to fail (if you have it enabled) requires pretty
> major disaster to have occurred since the NMI will be delivered through
> any kind of system hang
>
> > I guess hardware. But memtest run exhaustively shows no problem.
>
> Memory errors normally generate "Oops" type lines rather than other
> stuff
>
> > I have six 80 G IDE disks, software RAID, LVM on top. On Red Hat 8.0 and 9.
> >
> > Any hints on how to troubleshoot this (besides replacing motherboard and other
> > components I cannot afford to replace?)
>
> Is your PSU up to scratch for six disks ?
It seems that the 470W ToTheMax power supply was the problem, though I've not tested
for more than 17 hours yet.
Yesterday, after adding two disks to the motherboard IDE together with the new 3ware
7506-8 with 8 x 80G disks (a total of 10 disks), the 3ware RAID 5 unit refused to
rebuild. Then I noticed that many dma timeout messages appeared in the logs for all
the disks. I promptly shut the system down, took my little boy Linus for a walk to
the fabled Golden Shopping Center and bought an Antec 550W supply, plus a few more
case fans. The 3ware RAID5 unit, and also the kernel md RAID1 built nicely. So far
no dma_intr messages have appeared (and it has not locked up).
I wish I had changed the power supply before buying the 3ware card so that I could
find out whether the Si680 cards were okay or not! HK$135 * 3 <<< HK$3800 (comparing
price of 3 Si680 cards with one 3ware 7506-8). Unfortunately the disks cannot be
moved from the 3ware back to the Si680s without another mighty backup and restore.
> > /dev/md3:
> > Timing buffer-cache reads: 128 MB in 0.35 seconds =365.71 MB/sec
> > Timing buffered disk reads: 64 MB in 21.93 seconds = 2.92 MB/sec
> > (last horribly. slow; get zillions of lines in syslog saying stuff like:
> > Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
> > Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
> > Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
>
> Im not sure what this one indicates actually
>
> > Any pointers to web sites, information that may help, any hints, suggestions,
> > ideas,... all most welcome. Actually, if replacing the motherboard would fix
> > it, I'd do it, but I cannot guess why it should help; Asus motherboards have
> > always been good to me before.
>
> Your choice of components looks fine, its all stuff I trust, even if the
> ethernet card is not good for performance it ought to be fine in
> general. If it is a faulty part most likely its a one off fault.
>
> Which bits of the system are not being used (sound, video, network ?)
--
Nick Urbanik RHCE nicku(at)vtc.edu.hk
Dept. of Information & Communications Technology
Hong Kong Institute of Vocational Education (Tsing Yi)
Tel: (852) 2436 8576, (852) 2436 8713 Fax: (852) 2436 8526
PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B ID: 7529555D
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Debugging hard lockups (hardware?): Power Supply!
2003-09-15 8:49 ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
@ 2003-09-15 19:57 ` Resident Boxholder
0 siblings, 0 replies; 15+ messages in thread
From: Resident Boxholder @ 2003-09-15 19:57 UTC (permalink / raw)
To: Kernel List
faq's for acpi and ide and no-ping or unrecognized ethernet problems,
seek error, random crash
When it seems like there's a bug in hardware or linux but it's just no acpi
id, default code runs slower, timing issues crop up, something breaks,
kill the scapegoats(swap cards, motherboards, try all config/kern opts,
ebay 3ware).
Power supply was the first thing I replaced, going to a 550w Coolmax
with the 3 amp 5v standby voltage.
The second thing one should do is flash bios. If bios acpi does not
recognize hardware id's, linux reasonable defaults institute pretty good
alright solutions which are slower than ideal, and on a system faster than
133fsb and 1ghz cpu that's enough to cause io to miss the love boat
with regard to resources and anything relating to timing.
memtest86
I set aside a pile of Promise cards, tried a SiI680, got almost there by
setting all options that allowed more time for dma resources to become
available. That got me pretty stable, but the bios flash I should have
done first finished the job, proper acpi hardware id's making all the
fastest code available so dma resources came available soon
enough.
Options that may improve stability are--
--don't use acpi=off
--don't use pci=noacpi
--don't go back to a previous kernel
--cmos setup apic off, let linux activate apic
--don't oc or push memory before stable(somebody)
--kernel local apic on, but try sub-option ioapic off if you
can't ping through your ethernet onboard or card or
you see garbage scrolling on boot or you have random
freeze-ups
--kernel opt try anticipatory-scheduling instead of
deadline scheduling for more time for ide ops to live
--cmos pci-delay transaction and hdparm -u1 to unmask
irq would buy time but can't do with sii680 or cmd640
--hdparm -d1 -c1 -p9 -X70 -u0 drive on sii680, pio9 is
a special escape pseudo-mode for sii680, probably -p9
just says "-d1 -c1 -u0 -p4", redundancy doesn't hurt
--turn off unused usb, parallell, audio, acpi sleep and
battery options, anything you don't need, at least until
you get the system stable
--3ware cards are cheap on ebay! try everything else first
or you may be disappointed by throwing money at
problems. success can strike at any moment--my
3ware card is still in the mail but system passes
bonnie++ and cp -aR /usr /tmp [different md's]
-Bob
Nick Urbanik wrote:
>Dear Folks,
>
>Alan Cox wrote:
>
>
>
>>On Sul, 2003-04-06 at 07:32, Nick Urbanik wrote:
>>
>>
>>>This machine locks up solid every few days. Caps Lock, Num Lock, Scroll Lock do
>>>not respond. The NMI watchdog does not kick in.
>>>
>>>
>>For the NMI watchdog to fail (if you have it enabled) requires pretty
>>major disaster to have occurred since the NMI will be delivered through
>>any kind of system hang
>>
>>
>>
>>>I guess hardware. But memtest run exhaustively shows no problem.
>>>
>>>
>>Memory errors normally generate "Oops" type lines rather than other
>>stuff
>>
>>
>>
>>>I have six 80 G IDE disks, software RAID, LVM on top. On Red Hat 8.0 and 9.
>>>
>>>Any hints on how to troubleshoot this (besides replacing motherboard and other
>>>components I cannot afford to replace?)
>>>
>>>
>>Is your PSU up to scratch for six disks ?
>>
>>
>
>It seems that the 470W ToTheMax power supply was the problem, though I've not tested
>for more than 17 hours yet.
>
>Yesterday, after adding two disks to the motherboard IDE together with the new 3ware
>7506-8 with 8 x 80G disks (a total of 10 disks), the 3ware RAID 5 unit refused to
>rebuild. Then I noticed that many dma timeout messages appeared in the logs for all
>the disks. I promptly shut the system down, took my little boy Linus for a walk to
>the fabled Golden Shopping Center and bought an Antec 550W supply, plus a few more
>case fans. The 3ware RAID5 unit, and also the kernel md RAID1 built nicely. So far
>no dma_intr messages have appeared (and it has not locked up).
>
>I wish I had changed the power supply before buying the 3ware card so that I could
>find out whether the Si680 cards were okay or not! HK$135 * 3 <<< HK$3800 (comparing
>price of 3 Si680 cards with one 3ware 7506-8). Unfortunately the disks cannot be
>moved from the 3ware back to the Si680s without another mighty backup and restore.
>
>
>
>>>/dev/md3:
>>> Timing buffer-cache reads: 128 MB in 0.35 seconds =365.71 MB/sec
>>> Timing buffered disk reads: 64 MB in 21.93 seconds = 2.92 MB/sec
>>>(last horribly. slow; get zillions of lines in syslog saying stuff like:
>>>Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 4096
>>>Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 0 --> 1024
>>>Apr 6 14:08:50 nicksbox kernel: raid5: switching cache buffer size, 1024 -->
>>>
>>>
>>Im not sure what this one indicates actually
>>
>>
>>
>>>Any pointers to web sites, information that may help, any hints, suggestions,
>>>ideas,... all most welcome. Actually, if replacing the motherboard would fix
>>>it, I'd do it, but I cannot guess why it should help; Asus motherboards have
>>>always been good to me before.
>>>
>>>
>>Your choice of components looks fine, its all stuff I trust, even if the
>>ethernet card is not good for performance it ought to be fine in
>>general. If it is a faulty part most likely its a one off fault.
>>
>>Which bits of the system are not being used (sound, video, network ?)
>>
>>
>
>--
>Nick Urbanik RHCE nicku(at)vtc.edu.hk
>Dept. of Information & Communications Technology
>Hong Kong Institute of Vocational Education (Tsing Yi)
>Tel: (852) 2436 8576, (852) 2436 8713 Fax: (852) 2436 8526
>PGP: 53 B6 6D 73 52 EE 1F EE EC F8 21 98 45 1C 23 7B ID: 7529555D
>GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2003-09-15 20:10 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-06 6:32 Debugging hard lockups (hardware?) Nick Urbanik
2003-04-06 6:39 ` Nick Urbanik
2003-04-06 9:27 ` John Bradford
2003-04-06 10:55 ` Bruce Harada
2003-04-06 18:34 ` Alan Cox
2003-04-06 20:29 ` Arador
2003-04-06 21:04 ` Alan Cox
2003-04-17 16:12 ` Arador
2003-04-06 20:31 ` Dave Jones
2003-04-06 22:02 ` Nick Urbanik
2003-04-06 23:15 ` Dave Jones
2003-04-08 3:33 ` Andre Hedrick
2003-04-08 12:35 ` Dave Jones
2003-09-15 8:49 ` Debugging hard lockups (hardware?): Power Supply! Nick Urbanik
2003-09-15 19:57 ` Resident Boxholder
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox