* [parisc-linux] Re: Dodgy SCSI in L2000 [not found] ` <000001c1f741$846d8730$0500a8c0@oscar> @ 2002-05-09 16:54 ` Grant Grundler 2002-05-09 22:46 ` James Braid 0 siblings, 1 reply; 4+ messages in thread From: Grant Grundler @ 2002-05-09 16:54 UTC (permalink / raw) To: jamesb; +Cc: parisc-linux "James Braid" wrote: ... > Kernel Fault: Code=15 regs=000000004eb88000 (Addr=000000005eb80018) > IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000010108fe0 > 0000000010108fe4 The fault was caused at 0x0000000010108fe0 - I need to see the matching vmlinux and System.map to determine where this is in the code. > IIR: 487a0030 ISR: 0000000000000000 IOR: 000000005eb80018 > CPU: 0 CR30: 000000004eb48000 CR31: 0000000010460000 > ORIG_R28: 000000001012c9fc > > I then hard rebooted (rs command from the gsp), ran dbench 10 on sdc, > and the box just completely froze after about half a line of dots from > dbench. When the box freezes, do "tc" from GSP. On reboot, at PDC prompt type "ser pim" to get the state of the machine when it was TC'd. Once you've saved the PIM dump, it's good to clear PIM. (iirc, "ser clearpim") Again, save matching Sysytem.map and vmlinux. > So I was thinking that sdc may be a dodgy disk/controller, so > then I rebooted again, ran dbench 10 on sdd, and it worked fine. Tried > sdb, that was fine too. Okay, so theres something weird happening here, > I tried dbench 10 on sdc now, and it ran fine. Ran dbench 10 a few more > times on sdc and it ran fine every time. Also ran dbench 10 over all > disks serially 4 times and no errors. Well, that's interesting. > I rebooted it cleanly (finally!) and went into the boot menu thing, and > into the service menu, and had a look through the options. I saw the > scsi paths were set to fast for the boot disk and ultra for the other 3 > disks. IIRC, setting _SYNC parameter to 10 is equivalent to "fast". > I set all the scsi paths to "fast" instead of ultra, booted up > and ran dbench 10 on sdb and sdc simultaneously...it went okay for a > while, and then, another kernel panic: > > Dumping Stack from 0x0000000056f10000 to 0x0000000056f11380: > WARNING! Stack pointer and cr30 do not correspond! oic. In cases like this, we have to disable Stack Dumping since it data page faults. I suspect that's what's happened in the previous dumps too. You can disable stack dumps by changing "#if 1" to "#if 0" on line 149 (show_stack()) in arch/parisc/kernel/traps.c. BTW, typically this msg means a kernel driver is attempting to directly access user space data instead of copying the data into kernel space. ... > Hard rebooted it (*again*), ran dbench 10 on 1 disk (sdd), it ran fine, > so I cranked it up to dbench 100. That crashed nicely with this panic: > > Dumping Stack from 0x0000000056390000 to 0x0000000056390000: > > Kernel Fault: Code=15 regs=0000000046390000 (Addr=0000000056388018) did you get the "Stack pointer and cr30 do not correspond!" msg before this? Well, I guess it doesn't matter...keep an eye out for it though. > I have no idea whats going on here now :( Me either since I've not seen this problem. This does sound like the SCSI interface driver is hitting a corner case and dying there. But that's just a SWAG. I'll have to get dbench and try it on the a500 when that's available. > Is there anything I need to do to decode these kernel panics or anything > (I'm not a kernel hacker at all, so I don't really know much about the > panics). I did notice that the ORIG_R28 part is identical on the panics > though - no idea what this means. GR02 and IAOQ are my starting points. get "a.c" from http://cvs.parisc-linux.org/build-tools/ and use that to lookup symbols in System.map. > I am running ext3 on all my disks - could this be causing any problems? I doubt it. I'm running ext3 on all my machines. > I did however notice that the problems still occurred running ext2 > before I re-made the filesystems. yeah - i don't think this is related to anything in the file system. ... > As for the good news, I tried a SMP kernel, and SMP works :) > It sees both CPUs and uses them (I think, top doesn't show cpu usages, > as per the bug in the bug tracking system). SMP boots - but it's still less stable the UP. Maybe because of the same problem you are running into here. grant ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [parisc-linux] Re: Dodgy SCSI in L2000 2002-05-09 16:54 ` [parisc-linux] Re: Dodgy SCSI in L2000 Grant Grundler @ 2002-05-09 22:46 ` James Braid 2002-05-11 21:14 ` Grant Grundler 0 siblings, 1 reply; 4+ messages in thread From: James Braid @ 2002-05-09 22:46 UTC (permalink / raw) To: 'Grant Grundler'; +Cc: parisc-linux -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > The fault was caused at 0x0000000010108fe0 - I need to see the > matching vmlinux and System.map to determine where this is in > the code. > Ah, mind if send you the vmlinux and System.map off-list? > When the box freezes, do "tc" from GSP. On reboot, at PDC prompt > type "ser pim" to get the state of the machine when it was TC'd. > Once you've saved the PIM dump, it's good to clear PIM. > (iirc, "ser clearpim") > Again, save matching Sysytem.map and vmlinux. > > IIRC, setting _SYNC parameter to 10 is equivalent to "fast". Okay, I will try that and put a new kernel in. > oic. In cases like this, we have to disable Stack Dumping since > it data page faults. I suspect that's what's happened in the > previous dumps too. You can disable stack dumps by changing "#if 1" > to "#if 0" on line 149 (show_stack()) in > arch/parisc/kernel/traps.c. I will disable that as well and put a new kernel in. > BTW, typically this msg means a kernel driver is attempting > to directly > access user space data instead of copying the data into kernel > space. That sounds nasty... > > so I cranked it up to dbench 100. That crashed nicely with > this panic: > > > > Dumping Stack from 0x0000000056390000 to 0x0000000056390000: > > > > Kernel Fault: Code=15 regs=0000000046390000 > > (Addr=0000000056388018) > > did you get the "Stack pointer and cr30 do not correspond!" > msg before this? Yep, I did get the "Stack pointer..." stuff before this, I left it off the email though. > Me either since I've not seen this problem. This does sound like > the SCSI interface driver is hitting a corner case and dying there. > But that's just a SWAG. > > I'll have to get dbench and try it on the a500 when that's > available. Cool. Thanks, James. -----BEGIN PGP SIGNATURE----- Version: PGP 7.1.1 iQA/AwUBPNr8OFW+bhIOiSqWEQLoTwCeOG3NSZzK01Aq1w+tz7R421fs9xgAoPbb 9VCHtaEiG6tTKElsrWsPMiad =6qMg -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [parisc-linux] Re: Dodgy SCSI in L2000 2002-05-09 22:46 ` James Braid @ 2002-05-11 21:14 ` Grant Grundler 2002-05-16 5:00 ` James Braid 0 siblings, 1 reply; 4+ messages in thread From: Grant Grundler @ 2002-05-11 21:14 UTC (permalink / raw) To: James Braid; +Cc: parisc-linux "James Braid" wrote: > Ah, mind if send you the vmlinux and System.map off-list? Yes - I do mind. For now, don't bother. I don't have time. And the next round of debug info will be more interesting. In general, please put them on a publicly accessible http or ftp server and I'll pull them when I have time. Or someone else can if I don't. If that's not possible, contact me off-list and I'll setup an account for you to push them to. grant ^ permalink raw reply [flat|nested] 4+ messages in thread
* [parisc-linux] Re: Dodgy SCSI in L2000 2002-05-11 21:14 ` Grant Grundler @ 2002-05-16 5:00 ` James Braid 0 siblings, 0 replies; 4+ messages in thread From: James Braid @ 2002-05-16 5:00 UTC (permalink / raw) To: parisc-linux Hey, I have applied the patch just posted to the list (irq.c patch). I'm running the latest CVS kernel on a dual 440Mhz L2000, 1Gb ram, 4x 18.2Gb LVD SCSI disks. I am seeing the same problems I have seen before (SCSI resets etc), BUT the box is not kernel panicing any more - which is an improvement Dbench works fine on single disks (i.e running one instance of dbench on one disk) - up to 200 clients (didn't bother trying further). But when I try to run 2 instances of dbench on any 2 disks in the box, I get all sorts of SCSI bus resets and errors. Heres a cut and paste from the console: --------- scsi : aborting command due to timeout : pid 200512, scsi0, channel 0, id 0, lun 0 Read (10) 00 02 03 78 20 00 00 08 00 sym53c8xx_abort: pid=200512 serial_number=200514 serial_number_at_timeout=200514 SCSI host 0 abort (pid 200512) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=200512 reset_flags=2 serial_number=200514 serial_number_at_timeout=200514 scsi : aborting command due to timeout : pid 200771, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 98 22 c8 00 00 08 00 sym53c8xx_abort: pid=200771 serial_number=200773 serial_number_at_timeout=200773 scsi : aborting command due to timeout : pid 200772, scsi0, channel 0, id 2, lun 0 Write (10) 00 00 01 10 a8 00 00 08 00 sym53c8xx_abort: pid=200772 serial_number=200774 serial_number_at_timeout=200774 scsi : aborting command due to timeout : pid 200773, scsi0, channel 0, id 2, lun 0 Write (10) 00 02 00 63 e0 00 00 08 00 sym53c8xx_abort: pid=200773 serial_number=200775 serial_number_at_timeout=200775 scsi : aborting command due to timeout : pid 200774, scsi0, channel 0, id 2, lun 0 Write (10) 00 00 d0 51 b8 00 00 08 00 sym53c8xx_abort: pid=200774 serial_number=200776 serial_number_at_timeout=200776 scsi : aborting command due to timeout : pid 200775, scsi0, channel 0, id 0, lun 0 Write (10) 00 00 40 4a 38 00 00 18 00 sym53c8xx_abort: pid=200775 serial_number=200777 serial_number_at_timeout=200777 scsi : aborting command due to timeout : pid 200776, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 b0 38 e0 00 00 08 00 sym53c8xx_abort: pid=200776 serial_number=200778 serial_number_at_timeout=200778 scsi : aborting command due to timeout : pid 200777, scsi0, channel 0, id 2, lun 0 Write (10) 00 00 04 2d 80 00 00 08 00 sym53c8xx_abort: pid=200777 serial_number=200779 serial_number_at_timeout=200779 scsi : aborting command due to timeout : pid 200778, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 1c 5b 90 00 00 08 00 sym53c8xx_abort: pid=200778 serial_number=200780 serial_number_at_timeout=200780 scsi : aborting command due to timeout : pid 200779, scsi0, channel 0, id 2, lun 0 Write (10) 00 00 d0 52 c0 00 00 08 00 sym53c8xx_abort: pid=200779 serial_number=200781 serial_number_at_timeout=200781 SCSI host 0 abort (pid 200780) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=200780 reset_flags=2 serial_number=200782 serial_number_at_timeout=200782 SCSI host 0 abort (pid 201014) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201014 reset_flags=2 serial_number=201016 serial_number_at_timeout=201016 SCSI host 0 abort (pid 201161) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201161 reset_flags=2 serial_number=201163 serial_number_at_timeout=201163 SCSI host 0 abort (pid 201174) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201174 reset_flags=2 serial_number=201176 serial_number_at_timeout=201176 SCSI host 0 abort (pid 201187) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201187 reset_flags=2 serial_number=201189 serial_number_at_timeout=201189 SCSI host 0 abort (pid 201200) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201200 reset_flags=2 serial_number=201202 serial_number_at_timeout=201202 SCSI host 0 abort (pid 201213) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201213 reset_flags=2 serial_number=201215 serial_number_at_timeout=201215 SCSI host 0 abort (pid 201226) timed out - resetting SCSI bus is being reset for host 0 channel 0. sym53c8xx_reset: pid=201226 reset_flags=2 serial_number=201228 serial_number_at_timeout=201228 --------- And so on and so on like this. Grant has mentioned that the termination or SCSI cables could be an issue, but as I have no replacements for this box I cant really test this out. Before I applied the irq.c patch, the box would panic just running dbench on one single disk. If anyone has any ideas or possible solutions on what could be causing this, I'd *love* to hear them. If you need any further details, just let me know. I've also tried compiling the Qlogic ISP (we have bunch of these cards lying around from our SGI boxes) scsi driver but it doesn't want to compile on PA-RISC. Are there any other SCSI cards which are known to compile under PA-RISC? I was thinking I could then leave just the root disk on the core I/O board and use another SCSI controller for the other 3 disks. Is this possible? Cheers, James -- James Braid System Administrator Peace Software Ph: +64 9 373 0400 Email: james.braid@peace.com ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2002-05-16 4:57 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <jbraid@gmx.net>
[not found] ` <000001c1f741$846d8730$0500a8c0@oscar>
2002-05-09 16:54 ` [parisc-linux] Re: Dodgy SCSI in L2000 Grant Grundler
2002-05-09 22:46 ` James Braid
2002-05-11 21:14 ` Grant Grundler
2002-05-16 5:00 ` James Braid
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox