sym2 oops in 2.6.9-rc2-BK

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* sym2 oops in 2.6.9-rc2-BK
@ 2004-09-28 13:58 Anton Blanchard
  2004-09-28 14:21 ` Anton Blanchard
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Anton Blanchard @ 2004-09-28 13:58 UTC (permalink / raw)
  To: linux-scsi


Hi,

Ive got a 2.6.9-rc2-bk tree from about September 16 which exploded in
sym_prepare_nego. It turns out sdev is NULL, and scsi_device_dt(sdev)
causes the trouble.

A few lines above there is a check for sdev != NULL, so assuming it is
valid to be NULL add a check before scsi_device_dt() too.

Anton

Signed-off-by: Anton Blanchard <anton@samba.org>

diff -puN drivers/scsi/sym53c8xx_2/sym_hipd.c~fix-sym2 drivers/scsi/sym53c8xx_2/sym_hipd.c
--- gr_work/drivers/scsi/sym53c8xx_2/sym_hipd.c~fix-sym2	2004-09-28 03:03:26.493627814 -0500
+++ gr_work-anton/drivers/scsi/sym53c8xx_2/sym_hipd.c	2004-09-28 03:03:50.247458823 -0500
@@ -1550,7 +1550,7 @@ static int sym_prepare_nego(hcb_p np, cc
 	/*
 	 *  negotiate using PPR ?
 	 */
-	if (scsi_device_dt(sdev)) {
+	if (sdev && scsi_device_dt(sdev)) {
 		nego = NS_PPR;
 	} else {
 		/*

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 13:58 sym2 oops in 2.6.9-rc2-BK Anton Blanchard
@ 2004-09-28 14:21 ` Anton Blanchard
  2004-09-28 15:17   ` Matthew Wilcox
  2004-09-28 14:56 ` Matthew Wilcox
  2004-09-28 15:38 ` James Bottomley
  2 siblings, 1 reply; 9+ messages in thread
From: Anton Blanchard @ 2004-09-28 14:21 UTC (permalink / raw)
  To: linux-scsi; +Cc: willy

 
> Ive got a 2.6.9-rc2-bk tree from about September 16 which exploded in
> sym_prepare_nego. It turns out sdev is NULL, and scsi_device_dt(sdev)
> causes the trouble.
> 
> A few lines above there is a check for sdev != NULL, so assuming it is
> valid to be NULL add a check before scsi_device_dt() too.

With that patch I still managed to get an oops. There is a fair amount
of bad hardware in the box but oopsing is pretty anti social.

Looks like a refcount problem. We kref_get'ed something already freed,
then finally oopsed in scsi_device_get, trying to access address
0x100510.

Anton

sym.0014:03:01.0:11:0: ABORT operation started.
sym.0014:03:01.0:11:0: ABORT operation complete.
sym.0014:03:01.0:11:0: DEVICE RESET operation started.
sym.0014:03:01.0:11:0: DEVICE RESET operation complete.
sym.0014:03:01.0:11:control msgout: c.
sym.0014:03:01.0: TARGET 11 has been reset.
sym.1214:03:01.0:11:0: ABORT operation started.
sym.1214:03:01.0:11:0: ABORT operation complete.
sym.1214:03:01.0: SCSI parity error detected: SCR1=1 DBC=1500000e SBCL=ae
sym.1214:03:01.0:11:0: DEVICE RESET operation started.
sym.1214:03:01.0:11:0: DEVICE RESET operation complete.
sym.1214:03:01.0:11:control msgout: c.
sym.1214:03:01.0: TARGET 11 has been reset.
sym.0014:03:01.0:11:0: ABORT operation started.
sym.0014:03:01.0:11:0: ABORT operation complete.
sym.0014:03:01.0:11:0: BUS RESET operation started.
sym.0014:03:01.0:11:0: BUS RESET operation complete.
sym.0014:03:01.0: SCSI BUS reset detected.
sym.0014:03:01.0: SCSI BUS has been reset.
sym.1214:03:01.0:11:0: ABORT operation started.
sym.1214:03:01.0:11:0: ABORT operation complete.
sym.1214:03:01.0:11:0: BUS RESET operation started.
sym.1214:03:01.0:11:0: BUS RESET operation complete.
sym.1214:03:01.0: SCSI BUS reset detected.
sym.1214:03:01.0: SCSI BUS has been reset.
scsi: Device offlined - not ready after error recovery: host 2 channel 0 id 11 lun 0
Badness in kref_get at lib/kref.c:32
Call Trace:
[c0000032fcab3bd0] [c0000032fcab3c50] 0xc0000032fcab3c50 (unreliable)
[c0000032fcab3c50] [c00000000021f5b8] .get_device+0x20/0x3c
[c0000032fcab3cc0] [c000000000294c60] .scsi_device_get+0x38/0xe4
[c0000032fcab3d40] [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
[c0000032fcab3de0] [c000000000299bf8] .scsi_run_host_queues+0x34/0x58
[c0000032fcab3e60] [c0000000002989f8] .scsi_error_handler+0x268/0xaa0
[c0000032fcab3f90] [c000000000017aac] .kernel_thread+0x4c/0x68
sym.0014:03:01.0:11:control msgout: c.

NIP: C000000000294C48 XER: 0000000020000000 LR: C000000000294E30
REGS: c0000032fcab3a40 TRAP: 0300   Not tainted  (2.6.9-rc2-bml)
MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 0000000000100510, DSISR: 0000000040000000
TASK: c000002bfd33d3c0[1494] 'scsi_eh_2' THREAD: c0000032fcab0000 CPU: 14
GPR00: FFFFFFFFFFFFFFFA C0000032FCAB3CC0 C0000000007297B8 00000000001000F0 
GPR04: C00000000E112800 0000000000000001 0000000000000000 0000000000000000 
GPR08: 0000000000000000 0000000000100100 C000001DFF875C28 9000000000009032 
GPR12: 0000000024FFFF22 C000000000545880 0000000000000000 0000000000000000 
GPR16: 0000000000000000 C00000000040D190 C000000000587058 C0000032FCAB3ED0 
GPR20: 00000000000000FC C00000000040D190 C000000000587058 C0000032FCAB3F00 
GPR24: C0000032FCAB3EF0 0000040180000000 C000000073847BB0 C00000000E112800 
GPR28: 9000000000009032 C000000FFFFA8800 00000000001002D8 00000000001000F0 
NIP [c000000000294c48] .scsi_device_get+0x20/0xe4
LR [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
Call Trace:
[c0000032fcab3cc0] [c000000000294da8] .scsi_device_put+0x9c/0xc4 (unreliable)
[c0000032fcab3d40] [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
[c0000032fcab3de0] [c000000000299bf8] .scsi_run_host_queues+0x34/0x58
[c0000032fcab3e60] [c0000000002989f8] .scsi_error_handler+0x268/0xaa0
[c0000032fcab3f90] [c000000000017aac] .kernel_thread+0x4c/0x68

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 13:58 sym2 oops in 2.6.9-rc2-BK Anton Blanchard
  2004-09-28 14:21 ` Anton Blanchard
@ 2004-09-28 14:56 ` Matthew Wilcox
  2004-09-28 15:25   ` Anton Blanchard
  2004-09-28 15:38 ` James Bottomley
  2 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2004-09-28 14:56 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-scsi

On Tue, Sep 28, 2004 at 11:58:26PM +1000, Anton Blanchard wrote:
> Ive got a 2.6.9-rc2-bk tree from about September 16 which exploded in
> sym_prepare_nego. It turns out sdev is NULL, and scsi_device_dt(sdev)
> causes the trouble.
> 
> A few lines above there is a check for sdev != NULL, so assuming it is
> valid to be NULL add a check before scsi_device_dt() too.

Yes, this looks like the right solution to me.

Can you tell me what circumstances you see it under, and do you
successfully negotiate 160MB/s with this patch?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 14:21 ` Anton Blanchard
@ 2004-09-28 15:17   ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2004-09-28 15:17 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: linux-scsi, willy

On Wed, Sep 29, 2004 at 12:21:04AM +1000, Anton Blanchard wrote:
> With that patch I still managed to get an oops. There is a fair amount
> of bad hardware in the box but oopsing is pretty anti social.
> 
> Looks like a refcount problem. We kref_get'ed something already freed,
> then finally oopsed in scsi_device_get, trying to access address
> 0x100510.

__scsi_iterate_devices is part of a shost_for_each_device() loop.  That
means we had a scsi_device sitting on the shost->__devices list with a
zero refcount.  I'll see if I can spot the leak in my current sources,
but some of the behaviour has changed recently and it may be gone.

> scsi: Device offlined - not ready after error recovery: host 2 channel 0 id 11 lun 0
> Badness in kref_get at lib/kref.c:32
> Call Trace:
> [c0000032fcab3bd0] [c0000032fcab3c50] 0xc0000032fcab3c50 (unreliable)
> [c0000032fcab3c50] [c00000000021f5b8] .get_device+0x20/0x3c
> [c0000032fcab3cc0] [c000000000294c60] .scsi_device_get+0x38/0xe4
> [c0000032fcab3d40] [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
> [c0000032fcab3de0] [c000000000299bf8] .scsi_run_host_queues+0x34/0x58
> [c0000032fcab3e60] [c0000000002989f8] .scsi_error_handler+0x268/0xaa0
> [c0000032fcab3f90] [c000000000017aac] .kernel_thread+0x4c/0x68
> sym.0014:03:01.0:11:control msgout: c.
> 
> NIP: C000000000294C48 XER: 0000000020000000 LR: C000000000294E30
> REGS: c0000032fcab3a40 TRAP: 0300   Not tainted  (2.6.9-rc2-bml)
> MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
> DAR: 0000000000100510, DSISR: 0000000040000000
> TASK: c000002bfd33d3c0[1494] 'scsi_eh_2' THREAD: c0000032fcab0000 CPU: 14
> GPR00: FFFFFFFFFFFFFFFA C0000032FCAB3CC0 C0000000007297B8 00000000001000F0 
> GPR04: C00000000E112800 0000000000000001 0000000000000000 0000000000000000 
> GPR08: 0000000000000000 0000000000100100 C000001DFF875C28 9000000000009032 
> GPR12: 0000000024FFFF22 C000000000545880 0000000000000000 0000000000000000 
> GPR16: 0000000000000000 C00000000040D190 C000000000587058 C0000032FCAB3ED0 
> GPR20: 00000000000000FC C00000000040D190 C000000000587058 C0000032FCAB3F00 
> GPR24: C0000032FCAB3EF0 0000040180000000 C000000073847BB0 C00000000E112800 
> GPR28: 9000000000009032 C000000FFFFA8800 00000000001002D8 00000000001000F0 
> NIP [c000000000294c48] .scsi_device_get+0x20/0xe4
> LR [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
> Call Trace:
> [c0000032fcab3cc0] [c000000000294da8] .scsi_device_put+0x9c/0xc4 (unreliable)
> [c0000032fcab3d40] [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
> [c0000032fcab3de0] [c000000000299bf8] .scsi_run_host_queues+0x34/0x58
> [c0000032fcab3e60] [c0000000002989f8] .scsi_error_handler+0x268/0xaa0
> [c0000032fcab3f90] [c000000000017aac] .kernel_thread+0x4c/0x68
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 14:56 ` Matthew Wilcox
@ 2004-09-28 15:25   ` Anton Blanchard
  0 siblings, 0 replies; 9+ messages in thread
From: Anton Blanchard @ 2004-09-28 15:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-scsi

 
> Yes, this looks like the right solution to me.
> 
> Can you tell me what circumstances you see it under, and do you
> successfully negotiate 160MB/s with this patch?

There is a bunch of bad hardware in it, so Im having a bit of trouble
working out exactly what is going on :) I'll look through the logs and
see if I can make sense of them.

Anton

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 13:58 sym2 oops in 2.6.9-rc2-BK Anton Blanchard
  2004-09-28 14:21 ` Anton Blanchard
  2004-09-28 14:56 ` Matthew Wilcox
@ 2004-09-28 15:38 ` James Bottomley
  2004-09-30 13:05   ` Anton Blanchard
  2 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2004-09-28 15:38 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: SCSI Mailing List

On Tue, 2004-09-28 at 09:58, Anton Blanchard wrote:
> Ive got a 2.6.9-rc2-bk tree from about September 16 which exploded in
> sym_prepare_nego. It turns out sdev is NULL, and scsi_device_dt(sdev)
> causes the trouble.
> 
> A few lines above there is a check for sdev != NULL, so assuming it is
> valid to be NULL add a check before scsi_device_dt() too.
> 
> Anton
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> 
> diff -puN drivers/scsi/sym53c8xx_2/sym_hipd.c~fix-sym2 drivers/scsi/sym53c8xx_2/sym_hipd.c
> --- gr_work/drivers/scsi/sym53c8xx_2/sym_hipd.c~fix-sym2	2004-09-28 03:03:26.493627814 -0500
> +++ gr_work-anton/drivers/scsi/sym53c8xx_2/sym_hipd.c	2004-09-28 03:03:50.247458823 -0500
> @@ -1550,7 +1550,7 @@ static int sym_prepare_nego(hcb_p np, cc
>  	/*
>  	 *  negotiate using PPR ?
>  	 */
> -	if (scsi_device_dt(sdev)) {
> +	if (sdev && scsi_device_dt(sdev)) {
>  		nego = NS_PPR;
>  	} else {
>  		/*

Actually, this patch can't be correct.  We should never be negotiating
with a NULL sdev.  Previously we negotated after slave_alloc, but I've
tried to change the driver to defer negotiation until slave_configure.

What were the messages in the log prior to the NULL deref?  What I'm
trying to understand is how we got to this point.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-28 15:38 ` James Bottomley
@ 2004-09-30 13:05   ` Anton Blanchard
  2004-09-30 13:52     ` James Bottomley
  0 siblings, 1 reply; 9+ messages in thread
From: Anton Blanchard @ 2004-09-30 13:05 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

 
> Actually, this patch can't be correct.  We should never be negotiating
> with a NULL sdev.  Previously we negotated after slave_alloc, but I've
> tried to change the driver to defer negotiation until slave_configure.
> 
> What were the messages in the log prior to the NULL deref?  What I'm
> trying to understand is how we got to this point.

Im confused, why are we checking sdev earlier on? Unfortunately I dont
have the machine at the moment, if I get it back I'll get a dmesg.

Anton

static int sym_prepare_nego(hcb_p np, ccb_p cp, int nego, u_char
*msgptr)
{
        tcb_p tp = &np->target[cp->target];
        int msglen = 0;
        struct scsi_device *sdev = tp->sdev;

        if (likely(sdev))
                sym_check_goals(sdev);

        /*
         *  Early C1010 chips need a work-around for DT
         *  data transfer to work.
         */
        if (!(np->features & FE_U3EN))
                tp->tinfo.goal.options = 0;
        /*
         *  negotiate using PPR ?
         */
        if (scsi_device_dt(sdev)) {
                nego = NS_PPR;
        } else {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-30 13:05   ` Anton Blanchard
@ 2004-09-30 13:52     ` James Bottomley
  2004-09-30 14:05       ` Anton Blanchard
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2004-09-30 13:52 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: SCSI Mailing List

On Thu, 2004-09-30 at 09:05, Anton Blanchard wrote:
> Im confused, why are we checking sdev earlier on? Unfortunately I dont
> have the machine at the moment, if I get it back I'll get a dmesg.

Erm, because I didn't notice it and forgot to remove it when I revamped
the negotiation routines...

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sym2 oops in 2.6.9-rc2-BK
  2004-09-30 13:52     ` James Bottomley
@ 2004-09-30 14:05       ` Anton Blanchard
  0 siblings, 0 replies; 9+ messages in thread
From: Anton Blanchard @ 2004-09-30 14:05 UTC (permalink / raw)
  To: James Bottomley; +Cc: SCSI Mailing List

 
> Erm, because I didn't notice it and forgot to remove it when I revamped
> the negotiation routines...

OK, I backed that last patch out and hit what looks to be my bug 2 again.

Anton

sym0: <1010-66> rev 0x1 at pci 0004:03:01.0 irq 87
sym.0004:03:01.0: No NVRAM, ID 7, Fast-80, LVD, parity checking
xics_enable_irq 47 buid 4 gqirm 255
sym.0004:03:01.0: SCSI BUS has been reset.
scsi0 : sym-2.1.18j
Using anticipatory io scheduler
sym.0004:03:01.0:10: FAST-40 WIDE SCSI 80.0 MB/s ST (25.0 ns, offset 31)
  Vendor: IBM       Model: IC35L036UCDY10-0  Rev: S25M
  Type:   Direct-Access                      ANSI SCSI revision: 03
sym.0004:03:01.0:10:0: tagged command queuing enabled, command queue depth 16.
scsi(0:0:10:0): Beginning Domain Validation
sym.0004:03:01.0:10: asynchronous.
sym.0004:03:01.0:10: wide asynchronous.
sym.0004:03:01.0:10: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
scsi(0:0:10:0): Ending Domain Validation
sym.0004:03:01.0:11:0:phase change 2-7 6@01050368 resid=5.
sym.0004:03:01.0:11:0:phase change 2-3 6@01050368 resid=5.
sym.0004:03:01.0:11: FAST-40 WIDE SCSI 80.0 MB/s ST (25.0 ns, offset 31)
sym.0004:03:01.0:11:control msgout: c.
sym.0004:03:01.0: TARGET 11 has been reset.
sym.0004:03:01.0:11:0: ABORT operation started.
sym.0004:03:01.0:11:0: ABORT operation complete.
sym.0004:03:01.0:11:0: DEVICE RESET operation started.
sym.0004:03:01.0:11:0: DEVICE RESET operation complete.
sym.0004:03:01.0:11:control msgout: c.
sym.0004:03:01.0: TARGET 11 has been reset.
sym.0004:03:01.0:11:0: ABORT operation started.
sym.0004:03:01.0:11:0: ABORT operation complete.
sym.0004:03:01.0:11:0: BUS RESET operation started.
sym.0004:03:01.0:11:0: BUS RESET operation complete.
sym.0004:03:01.0: SCSI BUS reset detected.
sym.0004:03:01.0: SCSI BUS has been reset.
scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 11 lun 0

NIP: C000000000294C48 XER: 0000000020000000 LR: C000000000294E30
REGS: c000001dfd0e7a40 TRAP: 0300   Not tainted  (2.6.9-rc2-bml)
MSR: 9000000000001032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 0000000000100510, DSISR: 0000000040000000
TASK: c000000ffe73b240[1463] 'scsi_eh_0' THREAD: c000001dfd0e4000 CPU: 3
GPR00: FFFFFFFFFFFFFFFA C000001DFD0E7CC0 C0000000007297B8 00000000001000F0
GPR04: C0000032FC834000 0000000000000001 0000000000000000 0000000000000000
GPR08: 0000000000000000 0000000000100100 C000000FFFFD7228 9000000000009032
GPR12: 0000000024FFFF22 C000000000542700 0000000000000000 0000000000000000
GPR16: 0000000000000000 C00000000040D190 C000000000587058 C000001DFD0E7ED0
GPR20: 00000000000000FC C00000000040D190 C000000000587058 C000001DFD0E7F00
GPR24: C000001DFD0E7EF0 0000040100000000 C000001DFD077D30 C0000032FC834000
GPR28: 9000000000009032 C000000FFFFC3800 00000000001002D8 00000000001000F0
NIP [c000000000294c48] .scsi_device_get+0x20/0xe4
LR [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
Call Trace:
[c000001dfd0e7cc0] [c000000000294da8] .scsi_device_put+0x9c/0xc4 (unreliable)
[c000001dfd0e7d40] [c000000000294e30] .__scsi_iterate_devices+0x60/0xfc
[c000001dfd0e7de0] [c000000000299bf8] .scsi_run_host_queues+0x34/0x58
[c000001dfd0e7e60] [c0000000002989f8] .scsi_error_handler+0x268/0xaa0
[c000001dfd0e7f90] [c000000000017aac] .kernel_thread+0x4c/0x68

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-09-30 14:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-28 13:58 sym2 oops in 2.6.9-rc2-BK Anton Blanchard
2004-09-28 14:21 ` Anton Blanchard
2004-09-28 15:17   ` Matthew Wilcox
2004-09-28 14:56 ` Matthew Wilcox
2004-09-28 15:25   ` Anton Blanchard
2004-09-28 15:38 ` James Bottomley
2004-09-30 13:05   ` Anton Blanchard
2004-09-30 13:52     ` James Bottomley
2004-09-30 14:05       ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox