* [PATCH] OOPS due to clearing eh_action prior to aborting eh command
@ 2005-12-07 21:56 Michael Reed
2005-12-07 22:06 ` James Bottomley
0 siblings, 1 reply; 6+ messages in thread
From: Michael Reed @ 2005-12-07 21:56 UTC (permalink / raw)
To: linux-scsi; +Cc: Christoph Hellwig, James.Smart, James.Bottomley, Jeremy Higdon
[-- Attachment #1: Type: text/plain, Size: 6132 bytes --]
During my testing of fc transport attributes for the mpt fusion
driver, I came upon this OOPS. (Actually, I've come upon it
too many times. :( )
Attached is a patch which addresses the issue. Please give it a look.
Thanks,
Mike
duck /root# dmesg -n8
duck /root# sync
duck /root# sync
duck /root# mptscsih: ioc0: attempting task abort! (sc=e0000034f74d4500)
sd 10:0:4:0:
command: Read (10): 28 00 00 0e 11 00 00 01 00 00
mptscsih: ioc0: task abort: SUCCESS (sc=e0000034f74d4500)
scsi_send_eh_cmnd: wait_for_completion_timeout: 2500, 0
( above message shows timeout value for command and time remaining
upon return from wait_for_completion_timeout().
Return value of zero indicates that the command
was timed out.)
mptscsih: ioc0: attempting task abort! (sc=e0000034f74d4500)
sd 10:0:4:0:
command: Test Unit Ready: 00 00 00 00 00 00
Here's where the driver is called to abort it. Unfortunately,
the eh_action pointer (done routine) was set to zero by now, so when
the command is returned, the system goes OOPS.
scsi_eh_done scmd: e0000034f74d4500 result: 80000, device e0000034f63ff000, host e0000034f7518800, action 0000000000000000
Unable to handle kernel NULL pointer dereference (address 0000000000000008)
swapper[0]: Oops 11020886081536 [1]
Modules linked in: autofs nfsd ipv6 nfs lockd sunrpc usbcore
Pid: 0, CPU 0, comm: swapper
psr : 0000121008022038 ifs : 8000000000000388 ip : [<a00000010073c2a1>] Not tainted
ip is at _spin_lock_irqsave+0x41/0x1a0
unat: 0000000000000000 pfs : 0000000000000388 rsc : 0000000000000003
rnat: 5555555555555555 bsps: 000000000001003e pr : 800000187f55a555
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010073c280 b6 : e0000030023e5490 b7 : a0000001000e9f80
f6 : 1003e20c49ba5e353f7cf f7 : 1003e20c49ba5e353f7cf
f8 : 1003e00000000000000a0 f9 : 1003e00000000000004e2
f10 : 1003e000000000fa00000 f11 : 1003e000000003b9aca00
r1 : a000000100c9e500 r2 : 0000000000010001 r3 : a0000001008d8db0
r8 : 0000001008026038 r9 : e0000034f7518898 r10 : 00000000000000f4
r11 : 0000000000000002 r12 : a0000001008dfbb0 r13 : a0000001008d8000
r14 : a000000100aa88d8 r15 : 0000000000010002 r16 : a0000001008d8db0
r17 : 0000000000010001 r18 : 0000000000010002 r19 : 0000000000000013
r20 : 0000000000000000 r21 : 0000000000000000 r22 : 0000000000000000
r23 : 0000000000000004 r24 : 0000000000000000 r25 : a0000001008d8da4
r26 : a0000001008d8d90 r27 : 0000000000000073 r28 : 0000000000000000
r29 : 0000000000000073 r30 : e00000b0050f0060 r31 : 0000000000000000
Call Trace:
[<a000000100013760>] show_stack+0x80/0xa0
sp=a0000001008df730 bsp=a0000001008d9398
[<a000000100013fd0>] show_regs+0x850/0x880
sp=a0000001008df900 bsp=a0000001008d9338
[<a000000100038250>] die+0x210/0x2a0
sp=a0000001008df910 bsp=a0000001008d92f0
[<a00000010073f370>] ia64_do_page_fault+0x990/0xb40
sp=a0000001008df930 bsp=a0000001008d9288
[<a00000010000bce0>] ia64_leave_kernel+0x0/0x290
sp=a0000001008df9e0 bsp=a0000001008d9288
[<a00000010073c2a0>] _spin_lock_irqsave+0x40/0x1a0
sp=a0000001008dfbb0 bsp=a0000001008d9248
[<a0000001000aaa00>] complete+0x20/0xa0
sp=a0000001008dfbb0 bsp=a0000001008d9218
[<a00000010055a960>] scsi_eh_done+0xe0/0x100
sp=a0000001008dfbb0 bsp=a0000001008d91e8
[<a0000001005df060>] mptscsih_io_done+0x540/0x960
sp=a0000001008dfbb0 bsp=a0000001008d9120
[<a0000001005d6e40>] mpt_interrupt+0x4e0/0xde0
sp=a0000001008dfbb0 bsp=a0000001008d90c8
[<a000000100106bb0>] handle_IRQ_event+0x90/0x120
sp=a0000001008dfbb0 bsp=a0000001008d9088
[<a000000100106e70>] __do_IRQ+0x230/0x360
sp=a0000001008dfbb0 bsp=a0000001008d9030
[<a000000100010400>] ia64_handle_irq+0xc0/0x160
sp=a0000001008dfbb0 bsp=a0000001008d8fe8
[<a00000010000bce0>] ia64_leave_kernel+0x0/0x290
sp=a0000001008dfbb0 bsp=a0000001008d8fe8
[<a000000100011db0>] default_idle+0x150/0x180
sp=a0000001008dfd80 bsp=a0000001008d8f78
[<a000000100012ea0>] cpu_idle+0x1c0/0x2e0
sp=a0000001008dfe20 bsp=a0000001008d8f10
[<a000000100009230>] rest_init+0x90/0xc0
sp=a0000001008dfe20 bsp=a0000001008d8ef8
[<a00000010087d0e0>] start_kernel+0x520/0x5c0
sp=a0000001008dfe20 bsp=a0000001008d8e98
[<a000000100008650>] __end_ivt_text+0x330/0x350
sp=a0000001008dfe30 bsp=a0000001008d8e00
Entering kdb (current=0xa0000001008d8000, pid 0) on processor 0 Oops: <NULL>
due to oops @ 0xa00000010073c2a1
psr: 0x0000121008022038 ifs: 0x8000000000000388 ip: 0xa00000010073c2a0
unat: 0x0000000000000000 pfs: 0x0000000000000388 rsc: 0x0000000000000003
rnat: 0x5555555555555555 bsps: 0x000000000001003e pr: 0x800000187f55a555
ldrs: 0x0000000000000000 ccv: 0x0000000000000000 fpsr: 0x0009804c8a70033f
b0: 0xa00000010073c280 b6: 0xe0000030023e5490 b7: 0xa0000001000e9f80
r1: 0xa000000100c9e500 r2: 0x0000000000010001 r3: 0xa0000001008d8db0
r8: 0x0000001008026038 r9: 0xe0000034f7518898 r10: 0x00000000000000f4
r11: 0x0000000000000002 r12: 0xa0000001008dfbb0 r13: 0xa0000001008d8000
r14: 0xa000000100aa88d8 r15: 0x0000000000010002 r16: 0xa0000001008d8db0
r17: 0x0000000000010001 r18: 0x0000000000010002 r19: 0x0000000000000013
r20: 0x0000000000000000 r21: 0x0000000000000000 r22: 0x0000000000000000
r23: 0x0000000000000004 r24: 0x0000000000000000 r25: 0xa0000001008d8da4
r26: 0xa0000001008d8d90 r27: 0x0000000000000073 r28: 0x0000000000000000
r29: 0x0000000000000073 r30: 0xe00000b0050f0060 r31: 0x0000000000000000
®s = a0000001008df9f0
[0]kdb>
[-- Attachment #2: scsi_error.patch --]
[-- Type: text/x-patch, Size: 493 bytes --]
--- a/drivers/scsi/scsi_error.c 2005-12-07 13:41:19.000000000 -0800
+++ b/drivers/scsi/scsi_error.c 2005-12-07 12:52:59.576655059 -0800
@@ -459,9 +459,6 @@
timeleft = wait_for_completion_timeout(&done, timeout);
- scmd->request->rq_status = RQ_SCSI_DONE;
- shost->eh_action = NULL;
-
scsi_log_completion(scmd, SUCCESS);
SCSI_LOG_ERROR_RECOVERY(3,
@@ -500,6 +497,9 @@
rtn = FAILED;
}
+ scmd->request->rq_status = RQ_SCSI_DONE;
+ shost->eh_action = NULL;
+
return rtn;
}
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] OOPS due to clearing eh_action prior to aborting eh command
2005-12-07 21:56 [PATCH] OOPS due to clearing eh_action prior to aborting eh command Michael Reed
@ 2005-12-07 22:06 ` James Bottomley
2005-12-07 22:16 ` Michael Reed
2005-12-08 3:46 ` Michael Reed
0 siblings, 2 replies; 6+ messages in thread
From: James Bottomley @ 2005-12-07 22:06 UTC (permalink / raw)
To: Michael Reed; +Cc: linux-scsi, Christoph Hellwig, James.Smart, Jeremy Higdon
On Wed, 2005-12-07 at 15:56 -0600, Michael Reed wrote:
> During my testing of fc transport attributes for the mpt fusion
> driver, I came upon this OOPS. (Actually, I've come upon it
> too many times. :( )
>
> Attached is a patch which addresses the issue. Please give it a look.
Isn't a better patch simply to copy the eh_action and check it for null
before completing it? That will close the done after timeout race.
James
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] OOPS due to clearing eh_action prior to aborting eh command
2005-12-07 22:06 ` James Bottomley
@ 2005-12-07 22:16 ` Michael Reed
2005-12-07 23:34 ` James Bottomley
2005-12-08 3:46 ` Michael Reed
1 sibling, 1 reply; 6+ messages in thread
From: Michael Reed @ 2005-12-07 22:16 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
James Bottomley wrote:
> On Wed, 2005-12-07 at 15:56 -0600, Michael Reed wrote:
>> During my testing of fc transport attributes for the mpt fusion
>> driver, I came upon this OOPS. (Actually, I've come upon it
>> too many times. :( )
>>
>> Attached is a patch which addresses the issue. Please give it a look.
>
> Isn't a better patch simply to copy the eh_action and check it for null
> before completing it? That will close the done after timeout race.
FWIW, the situation I'm reporting isn't done after timeout, it's
the scsi error handler calling the LLDD abort routine, which actually
aborts the command and completes it. The eh_action is cleared before
the abort, violating what appears to be the accepted protocol of having
the LLDD complete aborted commands.
Am I missing something in your comment? (It wouldn't surprise me!)
Thanks,
Mike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] OOPS due to clearing eh_action prior to aborting eh command
2005-12-07 22:16 ` Michael Reed
@ 2005-12-07 23:34 ` James Bottomley
2005-12-08 2:31 ` Michael Reed
0 siblings, 1 reply; 6+ messages in thread
From: James Bottomley @ 2005-12-07 23:34 UTC (permalink / raw)
To: Michael Reed; +Cc: linux-scsi
On Wed, 2005-12-07 at 16:16 -0600, Michael Reed wrote:
> FWIW, the situation I'm reporting isn't done after timeout, it's
> the scsi error handler calling the LLDD abort routine, which actually
> aborts the command and completes it. The eh_action is cleared before
> the abort, violating what appears to be the accepted protocol of having
> the LLDD complete aborted commands.
>
> Am I missing something in your comment? (It wouldn't surprise me!)
The timeout race still exists, though. Both fixes would solve both
problems. It would just be easier to follow if the logic of the error
handler done follows that of the normal done function (i.e. don't do
anything after the command times out).
James
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] OOPS due to clearing eh_action prior to aborting eh command
2005-12-07 23:34 ` James Bottomley
@ 2005-12-08 2:31 ` Michael Reed
0 siblings, 0 replies; 6+ messages in thread
From: Michael Reed @ 2005-12-08 2:31 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
James Bottomley wrote:
> On Wed, 2005-12-07 at 16:16 -0600, Michael Reed wrote:
>> FWIW, the situation I'm reporting isn't done after timeout, it's
>> the scsi error handler calling the LLDD abort routine, which actually
>> aborts the command and completes it. The eh_action is cleared before
>> the abort, violating what appears to be the accepted protocol of having
>> the LLDD complete aborted commands.
>>
>> Am I missing something in your comment? (It wouldn't surprise me!)
>
> The timeout race still exists, though. Both fixes would solve both
> problems. It would just be easier to follow if the logic of the error
> handler done follows that of the normal done function (i.e. don't do
> anything after the command times out).
"Ah ha", he said. Further examination of the code clarifies what you were
suggesting. I'll see what I can work up.
Thanks,
Mike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] OOPS due to clearing eh_action prior to aborting eh command
2005-12-07 22:06 ` James Bottomley
2005-12-07 22:16 ` Michael Reed
@ 2005-12-08 3:46 ` Michael Reed
1 sibling, 0 replies; 6+ messages in thread
From: Michael Reed @ 2005-12-08 3:46 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi, Christoph Hellwig, James.Smart, Jeremy Higdon
[-- Attachment #1: Type: text/plain, Size: 512 bytes --]
James Bottomley wrote:
> On Wed, 2005-12-07 at 15:56 -0600, Michael Reed wrote:
>> During my testing of fc transport attributes for the mpt fusion
>> driver, I came upon this OOPS. (Actually, I've come upon it
>> too many times. :( )
>>
>> Attached is a patch which addresses the issue. Please give it a look.
>
> Isn't a better patch simply to copy the eh_action and check it for null
> before completing it? That will close the done after timeout race.
>
> James
>
>
OK. Try this. (I did!)
Mike
[-- Attachment #2: scsi_error.patch --]
[-- Type: text/x-patch, Size: 503 bytes --]
--- a/drivers/scsi/scsi_error.c 2005-12-07 19:14:44.000000000 -0800
+++ b/drivers/scsi/scsi_error.c 2005-12-07 19:13:24.734716721 -0800
@@ -422,10 +422,15 @@
**/
static void scsi_eh_done(struct scsi_cmnd *scmd)
{
+ struct completion *eh_action;
+
SCSI_LOG_ERROR_RECOVERY(3,
printk("%s scmd: %p result: %x\n",
__FUNCTION__, scmd, scmd->result));
- complete(scmd->device->host->eh_action);
+
+ eh_action = scmd->device->host->eh_action;
+ if (eh_action)
+ complete(eh_action);
}
/**
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2005-12-08 3:46 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-07 21:56 [PATCH] OOPS due to clearing eh_action prior to aborting eh command Michael Reed
2005-12-07 22:06 ` James Bottomley
2005-12-07 22:16 ` Michael Reed
2005-12-07 23:34 ` James Bottomley
2005-12-08 2:31 ` Michael Reed
2005-12-08 3:46 ` Michael Reed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).