[parisc-linux] Re: Dodgy SCSI in L2000

Linux PARISC architecture development
 help / color / mirror / Atom feed

* [parisc-linux] Re: Dodgy SCSI in L2000
       [not found] ` <000001c1f741$846d8730$0500a8c0@oscar>
@ 2002-05-09 16:54   ` Grant Grundler
  2002-05-09 22:46     ` James Braid
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Grundler @ 2002-05-09 16:54 UTC (permalink / raw)
  To: jamesb; +Cc: parisc-linux

"James Braid" wrote:
...
> Kernel Fault: Code=15 regs=000000004eb88000 (Addr=000000005eb80018)

> IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000010108fe0
> 0000000010108fe4

The fault was caused at 0x0000000010108fe0 - I need to see the
matching vmlinux and System.map to determine where this is in
the code.

>  IIR: 487a0030    ISR: 0000000000000000  IOR: 000000005eb80018
>  CPU:        0   CR30: 000000004eb48000 CR31: 0000000010460000
>  ORIG_R28: 000000001012c9fc
> 
> I then hard rebooted (rs command from the gsp), ran dbench 10 on sdc,
> and the box just completely froze after about half a line of dots from
> dbench.

When the box freezes, do "tc" from GSP. On reboot, at PDC prompt type
"ser pim" to get the state of the machine when it was TC'd.
Once you've saved the PIM dump, it's good to clear PIM.
(iirc, "ser clearpim")
Again, save matching Sysytem.map and vmlinux.

> So I was thinking that sdc may be a dodgy disk/controller, so
> then I rebooted again, ran dbench 10 on sdd, and it worked fine. Tried
> sdb, that was fine too. Okay, so theres something weird happening here,
> I tried dbench 10 on sdc now, and it ran fine. Ran dbench 10 a few more
> times on sdc and it ran fine every time. Also ran dbench 10 over all
> disks serially 4 times and no errors.

Well, that's interesting.

> I rebooted it cleanly (finally!) and went into the boot menu thing, and
> into the service menu, and had a look through the options. I saw the
> scsi paths were set to fast for the boot disk and ultra for the other 3
> disks.

IIRC, setting _SYNC parameter to 10 is equivalent to "fast".

> I set all the scsi paths to "fast" instead of ultra, booted up
> and ran dbench 10 on sdb and sdc simultaneously...it went okay for a
> while, and then, another kernel panic:
> 
> Dumping Stack from 0x0000000056f10000 to 0x0000000056f11380:
> WARNING! Stack pointer and cr30 do not correspond!

oic. In cases like this, we have to disable Stack Dumping since
it data page faults. I suspect that's what's happened in the
previous dumps too. You can disable stack dumps by changing "#if 1"
to "#if 0" on line 149 (show_stack()) in arch/parisc/kernel/traps.c.

BTW, typically this msg means a kernel driver is attempting to directly
access user space data instead of copying the data into kernel space.

...
> Hard rebooted it (*again*), ran dbench 10 on 1 disk (sdd), it ran fine,
> so I cranked it up to dbench 100. That crashed nicely with this panic:
> 
> Dumping Stack from 0x0000000056390000 to 0x0000000056390000:
> 
> Kernel Fault: Code=15 regs=0000000046390000 (Addr=0000000056388018)

did you get the "Stack pointer and cr30 do not correspond!" msg before this?
Well, I guess it doesn't matter...keep an eye out for it though.

> I have no idea whats going on here now :(

Me either since I've not seen this problem. This does sound like
the SCSI interface driver is hitting a corner case and dying there.
But that's just a SWAG.

I'll have to get dbench and try it on the a500 when that's available.

> Is there anything I need to do to decode these kernel panics or anything
> (I'm not a kernel hacker at all, so I don't really know much about the
> panics). I did notice that the ORIG_R28 part is identical on the panics
> though - no idea what this means.

GR02 and IAOQ are my starting points.
get "a.c" from http://cvs.parisc-linux.org/build-tools/
and use that to lookup symbols in System.map.

> I am running ext3 on all my disks - could this be causing any problems?

I doubt it. I'm running ext3 on all my machines.

> I did however notice that the problems still occurred running ext2
> before I re-made the filesystems.

yeah - i don't think this is related to anything in the file system.

...
> As for the good news, I tried a SMP kernel, and SMP works :)
> It sees both CPUs and uses them (I think, top doesn't show cpu usages,
> as per the bug in the bug tracking system).

SMP boots - but it's still less stable the UP. Maybe because of the same
problem you are running into here.

grant

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [parisc-linux] Re: Dodgy SCSI in L2000
  2002-05-09 16:54   ` [parisc-linux] Re: Dodgy SCSI in L2000 Grant Grundler
@ 2002-05-09 22:46     ` James Braid
  2002-05-11 21:14       ` Grant Grundler
  0 siblings, 1 reply; 4+ messages in thread
From: James Braid @ 2002-05-09 22:46 UTC (permalink / raw)
  To: 'Grant Grundler'; +Cc: parisc-linux

 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> The fault was caused at 0x0000000010108fe0 - I need to see the
> matching vmlinux and System.map to determine where this is in
> the code.
> 

Ah, mind if send you the vmlinux and System.map off-list?


> When the box freezes, do "tc" from GSP. On reboot, at PDC prompt
> type "ser pim" to get the state of the machine when it was TC'd.
> Once you've saved the PIM dump, it's good to clear PIM.
> (iirc, "ser clearpim")
> Again, save matching Sysytem.map and vmlinux.
> 

> IIRC, setting _SYNC parameter to 10 is equivalent to "fast".

Okay, I will try that and put a new kernel in.
 
> oic. In cases like this, we have to disable Stack Dumping since
> it data page faults. I suspect that's what's happened in the
> previous dumps too. You can disable stack dumps by changing "#if 1"
> to "#if 0" on line 149 (show_stack()) in
> arch/parisc/kernel/traps.c.  

I will disable that as well and put a new kernel in.

> BTW, typically this msg means a kernel driver is attempting 
> to directly
> access user space data instead of copying the data into kernel
> space.  

That sounds nasty...

> > so I cranked it up to dbench 100. That crashed nicely with 
> this panic:
> > 
> > Dumping Stack from 0x0000000056390000 to 0x0000000056390000:
> > 
> > Kernel Fault: Code=15 regs=0000000046390000
> > (Addr=0000000056388018) 
> 
> did you get the "Stack pointer and cr30 do not correspond!" 
> msg before this?

Yep, I did get the "Stack pointer..." stuff before this, I left it
off the email though.

> Me either since I've not seen this problem. This does sound like
> the SCSI interface driver is hitting a corner case and dying there.
> But that's just a SWAG.
> 
> I'll have to get dbench and try it on the a500 when that's
> available.  

Cool.

Thanks, James.

-----BEGIN PGP SIGNATURE-----
Version: PGP 7.1.1

iQA/AwUBPNr8OFW+bhIOiSqWEQLoTwCeOG3NSZzK01Aq1w+tz7R421fs9xgAoPbb
9VCHtaEiG6tTKElsrWsPMiad
=6qMg
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [parisc-linux] Re: Dodgy SCSI in L2000
  2002-05-09 22:46     ` James Braid
@ 2002-05-11 21:14       ` Grant Grundler
  2002-05-16  5:00         ` James Braid
  0 siblings, 1 reply; 4+ messages in thread
From: Grant Grundler @ 2002-05-11 21:14 UTC (permalink / raw)
  To: James Braid; +Cc: parisc-linux

"James Braid" wrote:
> Ah, mind if send you the vmlinux and System.map off-list?

Yes - I do mind. For now, don't bother. I don't have time.
And the next round of debug info will be more interesting.

In general, please put them on a publicly accessible http
or ftp server and I'll pull them when I have time. Or someone
else can if I don't. If that's not possible, contact me off-list
and I'll setup an account for you to push them to.

grant

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [parisc-linux] Re: Dodgy SCSI in L2000
  2002-05-11 21:14       ` Grant Grundler
@ 2002-05-16  5:00         ` James Braid
  0 siblings, 0 replies; 4+ messages in thread
From: James Braid @ 2002-05-16  5:00 UTC (permalink / raw)
  To: parisc-linux

Hey,

I have applied the patch just posted to the list (irq.c patch). I'm
running the latest CVS kernel on a dual 440Mhz L2000, 1Gb ram, 4x 18.2Gb
LVD SCSI disks.

I am seeing the same problems I have seen before (SCSI resets etc), BUT
the box is not kernel panicing any more - which is an improvement

Dbench works fine on single disks (i.e running one instance of dbench on
one disk) - up to 200 clients (didn't bother trying further).

But when I try to run 2 instances of dbench on any 2 disks in the box, I
get all sorts of SCSI bus resets and errors.

Heres a cut and paste from the console:

---------

scsi : aborting command due to timeout : pid 200512, scsi0, channel 0,
id 0, lun 0 Read (10) 00 02 03 78 20 00 00 08 00
sym53c8xx_abort: pid=200512 serial_number=200514
serial_number_at_timeout=200514
SCSI host 0 abort (pid 200512) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=200512 reset_flags=2 serial_number=200514
serial_number_at_timeout=200514
scsi : aborting command due to timeout : pid 200771, scsi0, channel 0,
id 2, lun 0 Write (10) 00 01 98 22 c8 00 00 08 00
sym53c8xx_abort: pid=200771 serial_number=200773
serial_number_at_timeout=200773
scsi : aborting command due to timeout : pid 200772, scsi0, channel 0,
id 2, lun 0 Write (10) 00 00 01 10 a8 00 00 08 00
sym53c8xx_abort: pid=200772 serial_number=200774
serial_number_at_timeout=200774
scsi : aborting command due to timeout : pid 200773, scsi0, channel 0,
id 2, lun 0 Write (10) 00 02 00 63 e0 00 00 08 00
sym53c8xx_abort: pid=200773 serial_number=200775
serial_number_at_timeout=200775
scsi : aborting command due to timeout : pid 200774, scsi0, channel 0,
id 2, lun 0 Write (10) 00 00 d0 51 b8 00 00 08 00
sym53c8xx_abort: pid=200774 serial_number=200776
serial_number_at_timeout=200776
scsi : aborting command due to timeout : pid 200775, scsi0, channel 0,
id 0, lun 0 Write (10) 00 00 40 4a 38 00 00 18 00
sym53c8xx_abort: pid=200775 serial_number=200777
serial_number_at_timeout=200777
scsi : aborting command due to timeout : pid 200776, scsi0, channel 0,
id 2, lun 0 Write (10) 00 01 b0 38 e0 00 00 08 00
sym53c8xx_abort: pid=200776 serial_number=200778
serial_number_at_timeout=200778
scsi : aborting command due to timeout : pid 200777, scsi0, channel 0,
id 2, lun 0 Write (10) 00 00 04 2d 80 00 00 08 00
sym53c8xx_abort: pid=200777 serial_number=200779
serial_number_at_timeout=200779
scsi : aborting command due to timeout : pid 200778, scsi0, channel 0,
id 2, lun 0 Write (10) 00 01 1c 5b 90 00 00 08 00
sym53c8xx_abort: pid=200778 serial_number=200780
serial_number_at_timeout=200780
scsi : aborting command due to timeout : pid 200779, scsi0, channel 0,
id 2, lun 0 Write (10) 00 00 d0 52 c0 00 00 08 00
sym53c8xx_abort: pid=200779 serial_number=200781
serial_number_at_timeout=200781
SCSI host 0 abort (pid 200780) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=200780 reset_flags=2 serial_number=200782
serial_number_at_timeout=200782
SCSI host 0 abort (pid 201014) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201014 reset_flags=2 serial_number=201016
serial_number_at_timeout=201016
SCSI host 0 abort (pid 201161) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201161 reset_flags=2 serial_number=201163
serial_number_at_timeout=201163
SCSI host 0 abort (pid 201174) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201174 reset_flags=2 serial_number=201176
serial_number_at_timeout=201176
SCSI host 0 abort (pid 201187) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201187 reset_flags=2 serial_number=201189
serial_number_at_timeout=201189
SCSI host 0 abort (pid 201200) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201200 reset_flags=2 serial_number=201202
serial_number_at_timeout=201202
SCSI host 0 abort (pid 201213) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201213 reset_flags=2 serial_number=201215
serial_number_at_timeout=201215
SCSI host 0 abort (pid 201226) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
sym53c8xx_reset: pid=201226 reset_flags=2 serial_number=201228
serial_number_at_timeout=201228

---------

And so on and so on like this. Grant has mentioned that the termination
or SCSI cables could be an issue, but as I have no replacements for this
box I cant really test this out. Before I applied the irq.c patch, the
box would panic just running dbench on one single disk.

If anyone has any ideas or possible solutions on what could be causing
this, I'd *love* to hear them. If you need any further details, just let
me know.

I've also tried compiling the Qlogic ISP (we have bunch of these cards
lying around from our SGI boxes) scsi driver but it doesn't want to
compile on PA-RISC. Are there any other SCSI cards which are known to
compile under PA-RISC? I was thinking I could then leave just the root
disk on the core I/O board and use another SCSI controller for the other
3 disks. Is this possible?

Cheers, James


-- 
James Braid
System Administrator
Peace Software
Ph:		+64 9 373 0400
Email:	james.braid@peace.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-05-16  4:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <jbraid@gmx.net>
     [not found] ` <000001c1f741$846d8730$0500a8c0@oscar>
2002-05-09 16:54   ` [parisc-linux] Re: Dodgy SCSI in L2000 Grant Grundler
2002-05-09 22:46     ` James Braid
2002-05-11 21:14       ` Grant Grundler
2002-05-16  5:00         ` James Braid

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox