* qla2xx command abort regression
@ 2010-10-01 10:33 Hannes Reinecke
2010-10-01 12:14 ` Hannes Reinecke
0 siblings, 1 reply; 2+ messages in thread
From: Hannes Reinecke @ 2010-10-01 10:33 UTC (permalink / raw)
To: Andrew Vasquez; +Cc: SCSI Mailing List
Hi Andrew,
there is a regression in the qla2xxx driver, introduced by this commit:
commit 083a469db4ecf3b286a96b5b722c37fc1affe0be
Author: Giridhar Malavali <giridhar.malavali@qlogic.com>
Date: Fri May 28 15:08:18 2010 -0700
[SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort.
Hold a reference to the srb (sp) while aborting an I/O -- as the
I/O can/will complete from within the interrupt-context.
Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
With this patch a reference counting is introduced for srb's.
However, there is this code in qla2xxx_eh_abort():
spin_unlock_irqrestore(&ha->hardware_lock, flags);
/* Wait for the command to be returned. */
if (wait) {
if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
qla_printk(KERN_ERR, ha,
"scsi(%ld:%d:%d): Abort handler timed out -- %lx "
"%x.\n", vha->host_no, id, lun, serial, ret);
ret = FAILED;
}
}
if (got_ref)
qla2x00_sp_compl(ha, sp);
where qla2x00_eh_wait_on_command() is waiting for a command to be
completed by the midlayer. Which will never happen, as the refcount
is held during that time and only released on the last lines.
Hence any command abort will be timed out and the error will be
escalated further.
I have fixed it by simply moving the last two lines above the
'if (wait)' condition. however I fail to see the race condition
mentioned, and hence the validity of the reference counting in the
first place.
So it might be that I'm missing something subtle here, so I would
ask you to have a look here.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: qla2xx command abort regression
2010-10-01 10:33 qla2xx command abort regression Hannes Reinecke
@ 2010-10-01 12:14 ` Hannes Reinecke
0 siblings, 0 replies; 2+ messages in thread
From: Hannes Reinecke @ 2010-10-01 12:14 UTC (permalink / raw)
To: Andrew Vasquez; +Cc: SCSI Mailing List
Hannes Reinecke wrote:
> Hi Andrew,
>
> there is a regression in the qla2xxx driver, introduced by this commit:
>
> commit 083a469db4ecf3b286a96b5b722c37fc1affe0be
> Author: Giridhar Malavali <giridhar.malavali@qlogic.com>
> Date: Fri May 28 15:08:18 2010 -0700
>
> [SCSI] qla2xxx: Correct use-after-free oops seen during EH-abort.
>
> Hold a reference to the srb (sp) while aborting an I/O -- as the
> I/O can/will complete from within the interrupt-context.
>
> Signed-off-by: Andrew Vasquez <andrew.vasquez@qlogic.com>
> Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
>
> With this patch a reference counting is introduced for srb's.
> However, there is this code in qla2xxx_eh_abort():
>
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>
> /* Wait for the command to be returned. */
> if (wait) {
> if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
> qla_printk(KERN_ERR, ha,
> "scsi(%ld:%d:%d): Abort handler timed out -- %lx "
> "%x.\n", vha->host_no, id, lun, serial, ret);
> ret = FAILED;
> }
> }
>
> if (got_ref)
> qla2x00_sp_compl(ha, sp);
>
>
> where qla2x00_eh_wait_on_command() is waiting for a command to be
> completed by the midlayer. Which will never happen, as the refcount
> is held during that time and only released on the last lines.
> Hence any command abort will be timed out and the error will be
> escalated further.
>
> I have fixed it by simply moving the last two lines above the
> 'if (wait)' condition. however I fail to see the race condition
> mentioned, and hence the validity of the reference counting in the
> first place.
> So it might be that I'm missing something subtle here, so I would
> ask you to have a look here.
>
Actually, I found the race condition; it's here
(mentioned commit removed):
DEBUG2(printk("%s(%ld): aborting sp %p from RISC."
" pid=%ld.\n", __func__, vha->host_no, sp, serial));
spin_unlock_irqrestore(&ha->hardware_lock, flags);
if (ha->isp_ops->abort_command(sp)) {
DEBUG2(printk("%s(%ld): abort_command "
"mbx failed.\n", __func__, vha->host_no));
ret = FAILED;
So the command referenced by 'sp' might've been completed after
the unlock, in which case ->abort_command() will be accessing an
invalid 'sp' point. Ok. But then my fix is indeed valid.
I'll be sending a proper patch.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2010-10-01 12:14 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-01 10:33 qla2xx command abort regression Hannes Reinecke
2010-10-01 12:14 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).