* [PATCH] qla2xxx: Drop srb reference before waiting for completion
@ 2010-10-01 12:18 Hannes Reinecke
2010-10-01 17:02 ` Giridhar Malavali
2010-10-01 21:01 ` Mike Christie
0 siblings, 2 replies; 8+ messages in thread
From: Hannes Reinecke @ 2010-10-01 12:18 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi
This patch fixes a regression introduced by commit
083a469db4ecf3b286a96b5b722c37fc1affe0be
qla2xxx_eh_wait_on_command() is waiting for an srb to
complete, which will never happen as the routine took
a reference to the srb previously and will only drop it
after this function. So every command abort will fail.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 1e4bff6..0af7549 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
}
spin_unlock_irqrestore(&ha->hardware_lock, flags);
+ if (got_ref)
+ qla2x00_sp_compl(ha, sp);
+
/* Wait for the command to be returned. */
if (wait) {
if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
@@ -893,9 +896,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
}
}
- if (got_ref)
- qla2x00_sp_compl(ha, sp);
-
qla_printk(KERN_INFO, ha,
"scsi(%ld:%d:%d): Abort command issued -- %d %lx %x.\n",
vha->host_no, id, lun, wait, serial, ret);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-01 12:18 [PATCH] qla2xxx: Drop srb reference before waiting for completion Hannes Reinecke
@ 2010-10-01 17:02 ` Giridhar Malavali
2010-10-01 21:01 ` Mike Christie
1 sibling, 0 replies; 8+ messages in thread
From: Giridhar Malavali @ 2010-10-01 17:02 UTC (permalink / raw)
To: Hannes Reinecke, James Bottomley; +Cc: LinuxSCSI
On 10/1/10 5:18 AM, "Hannes Reinecke" <hare@suse.de> wrote:
>
>
> This patch fixes a regression introduced by commit
> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>
> qla2xxx_eh_wait_on_command() is waiting for an srb to
> complete, which will never happen as the routine took
> a reference to the srb previously and will only drop it
> after this function. So every command abort will fail.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Cc: Andrew Vasquez <andrew.vasquez@qlogic.com>
>
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 1e4bff6..0af7549 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
> }
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>
> + if (got_ref)
> + qla2x00_sp_compl(ha, sp);
> +
> /* Wait for the command to be returned. */
> if (wait) {
> if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
> @@ -893,9 +896,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
> }
> }
>
> - if (got_ref)
> - qla2x00_sp_compl(ha, sp);
> -
> qla_printk(KERN_INFO, ha,
> "scsi(%ld:%d:%d): Abort command issued -- %d %lx %x.\n",
> vha->host_no, id, lun, wait, serial, ret);
void
qla2x00_sp_compl(struct qla_hw_data *ha, srb_t *sp)
{
if (atomic_read(&sp->ref_count) == 0) {
DEBUG2(qla_printk(KERN_WARNING, ha,
"SP reference-count to ZERO -- sp=%p\n", sp));
DEBUG2(BUG());
return;
}
if (!atomic_dec_and_test(&sp->ref_count))
return;
qla2x00_sp_final_compl(ha, sp);
}
All completions will check for reference count going to zero before
releasing the sp structure. Since the ref count is incremented by the
eh_abort hanlder, I don't see sp getting released unless ref count goes to
0. I could not see any other places where sp is getting released. Do you
have any stack trace or logs so that I can analyze more.
-- Giri
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-01 12:18 [PATCH] qla2xxx: Drop srb reference before waiting for completion Hannes Reinecke
2010-10-01 17:02 ` Giridhar Malavali
@ 2010-10-01 21:01 ` Mike Christie
2010-10-01 21:10 ` Mike Christie
1 sibling, 1 reply; 8+ messages in thread
From: Mike Christie @ 2010-10-01 21:01 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: James Bottomley, linux-scsi
[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]
On 10/01/2010 07:18 AM, Hannes Reinecke wrote:
>
> This patch fixes a regression introduced by commit
> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>
> qla2xxx_eh_wait_on_command() is waiting for an srb to
> complete, which will never happen as the routine took
> a reference to the srb previously and will only drop it
> after this function. So every command abort will fail.
>
> Signed-off-by: Hannes Reinecke<hare@suse.de>
> Cc: Andrew Vasquez<andrew.vasquez@qlogic.com>
>
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 1e4bff6..0af7549 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
> }
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>
> + if (got_ref)
> + qla2x00_sp_compl(ha, sp);
> +
You can just get rid of got_ref, because if you just move the compl call
up a little more you know that in that code path we always get a ref.
And there is no need to hold the ref to where it is above. See attached
patch.
We did something similar for qla4xxx. Looks like qla4xxx has a race when
it checks the CMD_SP validity and grabs a ref, but that is different story.
[-- Attachment #2: qla2xxx-rm-got-ref.patch --]
[-- Type: text/plain, Size: 3908 bytes --]
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index bdd53f0..6ea314f 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -832,7 +832,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
struct qla_hw_data *ha = vha->hw;
struct req_que *req = vha->req;
srb_t *spt;
- int got_ref = 0;
fc_block_scsi_eh(cmd);
@@ -866,7 +865,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
/* Get a reference to the sp and drop the lock.*/
sp_get(sp);
- got_ref++;
spin_unlock_irqrestore(&ha->hardware_lock, flags);
if (ha->isp_ops->abort_command(sp)) {
@@ -879,6 +877,7 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
wait = 1;
}
spin_lock_irqsave(&ha->hardware_lock, flags);
+ qla2x00_sp_compl(ha, sp);
break;
}
spin_unlock_irqrestore(&ha->hardware_lock, flags);
@@ -893,9 +892,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
}
}
- if (got_ref)
- qla2x00_sp_compl(ha, sp);
-
qla_printk(KERN_INFO, ha,
"scsi(%ld:%d:%d): Abort command issued -- %d %lx %x.\n",
vha->host_no, id, lun, wait, serial, ret);
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index acfc7d9..7921d76 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -470,7 +470,6 @@ static int iscsi_user_scan_session(struct device *dev, void *data)
struct Scsi_Host *shost;
struct iscsi_cls_host *ihost;
unsigned long flags;
- unsigned int id;
if (!iscsi_is_session_dev(dev))
return 0;
@@ -488,15 +487,14 @@ static int iscsi_user_scan_session(struct device *dev, void *data)
spin_unlock_irqrestore(&session->lock, flags);
goto user_scan_exit;
}
- id = session->target_id;
spin_unlock_irqrestore(&session->lock, flags);
- if (id != ISCSI_MAX_TARGET) {
+ if (session->target_id != ISCSI_MAX_TARGET) {
if ((scan_data->channel == SCAN_WILD_CARD ||
scan_data->channel == 0) &&
(scan_data->id == SCAN_WILD_CARD ||
- scan_data->id == id))
- scsi_scan_target(&session->dev, 0, id,
+ scan_data->id == session->target_id))
+ scsi_scan_target(&session->dev, 0, session->target_id,
scan_data->lun, 1);
}
@@ -642,6 +640,7 @@ static void __iscsi_unblock_session(struct work_struct *work)
void iscsi_unblock_session(struct iscsi_cls_session *session)
{
queue_work(iscsi_eh_timer_workq, &session->unblock_work);
+ flush_work(&session->unblock_work);
}
EXPORT_SYMBOL_GPL(iscsi_unblock_session);
@@ -662,6 +661,10 @@ static void __iscsi_block_session(struct work_struct *work)
queue_delayed_work(iscsi_eh_timer_workq,
&session->recovery_work,
session->recovery_tmo * HZ);
+
+ queue_delayed_work(iscsi_eh_timer_workq,
+ &session->dev_loss_work,
+ session->recovery_tmo * HZ);
}
void iscsi_block_session(struct iscsi_cls_session *session)
@@ -1383,6 +1386,10 @@ iscsi_set_param(struct iscsi_transport *transport, struct iscsi_uevent *ev)
sscanf(data, "%d", &value);
session->recovery_tmo = value;
break;
+ case ISCSI_PARAM_SESS_DEV_LOSS_TMO:
+ sscanf(data, "%d", &value);
+ session->dev_loss_tmo = value;
+ break;
default:
err = transport->set_param(conn, ev->u.set_param.param,
data, ev->u.set_param.len);
@@ -1849,6 +1856,7 @@ static ISCSI_CLASS_ATTR(priv_sess, field, S_IRUGO | S_IWUGO, \
show_priv_session_##field, \
store_priv_session_##field)
iscsi_priv_session_rw_attr(recovery_tmo, "%d");
+iscsi_priv_session_rw_attr(dev_loss_tmo, "%d");
/*
* iSCSI host attrs
@@ -2071,6 +2079,7 @@ iscsi_register_transport(struct iscsi_transport *tt)
SETUP_SESSION_RD_ATTR(ifacename, ISCSI_IFACE_NAME);
SETUP_SESSION_RD_ATTR(initiatorname, ISCSI_INITIATOR_NAME);
SETUP_SESSION_RD_ATTR(targetalias, ISCSI_TARGET_ALIAS);
+ SETUP_PRIV_SESSION_RW_ATTR(dev_loss_tmo);
SETUP_PRIV_SESSION_RW_ATTR(recovery_tmo);
SETUP_PRIV_SESSION_RD_ATTR(state);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-01 21:01 ` Mike Christie
@ 2010-10-01 21:10 ` Mike Christie
2010-10-05 15:42 ` Giridhar Malavali
0 siblings, 1 reply; 8+ messages in thread
From: Mike Christie @ 2010-10-01 21:10 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: James Bottomley, linux-scsi
[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]
On 10/01/2010 04:01 PM, Mike Christie wrote:
> On 10/01/2010 07:18 AM, Hannes Reinecke wrote:
>>
>> This patch fixes a regression introduced by commit
>> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>>
>> qla2xxx_eh_wait_on_command() is waiting for an srb to
>> complete, which will never happen as the routine took
>> a reference to the srb previously and will only drop it
>> after this function. So every command abort will fail.
>>
>> Signed-off-by: Hannes Reinecke<hare@suse.de>
>> Cc: Andrew Vasquez<andrew.vasquez@qlogic.com>
>>
>> diff --git a/drivers/scsi/qla2xxx/qla_os.c
>> b/drivers/scsi/qla2xxx/qla_os.c
>> index 1e4bff6..0af7549 100644
>> --- a/drivers/scsi/qla2xxx/qla_os.c
>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
>> }
>> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>>
>> + if (got_ref)
>> + qla2x00_sp_compl(ha, sp);
>> +
>
> You can just get rid of got_ref, because if you just move the compl call
> up a little more you know that in that code path we always get a ref.
> And there is no need to hold the ref to where it is above. See attached
> patch.
>
Sorry. Last patch had some other iscsi dev loss stuff in there. See this
patch.
[-- Attachment #2: qla2xxx-rm-got-ref2.patch --]
[-- Type: text/plain, Size: 1517 bytes --]
qla2xxx: Drop srb reference before waiting for completion
This patch fixes a regression introduced by commit
083a469db4ecf3b286a96b5b722c37fc1affe0be
qla2xxx_eh_wait_on_command() is waiting for an srb to
complete, which will never happen as the routine took
a reference to the srb previously and will only drop it
after this function. So every command abort will fail.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index bdd53f0..6ea314f 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -832,7 +832,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
struct qla_hw_data *ha = vha->hw;
struct req_que *req = vha->req;
srb_t *spt;
- int got_ref = 0;
fc_block_scsi_eh(cmd);
@@ -866,7 +865,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
/* Get a reference to the sp and drop the lock.*/
sp_get(sp);
- got_ref++;
spin_unlock_irqrestore(&ha->hardware_lock, flags);
if (ha->isp_ops->abort_command(sp)) {
@@ -879,6 +877,7 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
wait = 1;
}
spin_lock_irqsave(&ha->hardware_lock, flags);
+ qla2x00_sp_compl(ha, sp);
break;
}
spin_unlock_irqrestore(&ha->hardware_lock, flags);
@@ -893,9 +892,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
}
}
- if (got_ref)
- qla2x00_sp_compl(ha, sp);
-
qla_printk(KERN_INFO, ha,
"scsi(%ld:%d:%d): Abort command issued -- %d %lx %x.\n",
vha->host_no, id, lun, wait, serial, ret);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-01 21:10 ` Mike Christie
@ 2010-10-05 15:42 ` Giridhar Malavali
2010-10-05 18:18 ` Mike Christie
0 siblings, 1 reply; 8+ messages in thread
From: Giridhar Malavali @ 2010-10-05 15:42 UTC (permalink / raw)
To: Mike Christie, Hannes Reinecke; +Cc: James Bottomley, LinuxSCSI
On 10/1/10 2:10 PM, "Mike Christie" <michaelc@cs.wisc.edu> wrote:
> On 10/01/2010 04:01 PM, Mike Christie wrote:
>> On 10/01/2010 07:18 AM, Hannes Reinecke wrote:
>>>
>>> This patch fixes a regression introduced by commit
>>> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>>>
>>> qla2xxx_eh_wait_on_command() is waiting for an srb to
>>> complete, which will never happen as the routine took
>>> a reference to the srb previously and will only drop it
>>> after this function. So every command abort will fail.
>>>
>>> Signed-off-by: Hannes Reinecke<hare@suse.de>
>>> Cc: Andrew Vasquez<andrew.vasquez@qlogic.com>
>>>
>>> diff --git a/drivers/scsi/qla2xxx/qla_os.c
>>> b/drivers/scsi/qla2xxx/qla_os.c
>>> index 1e4bff6..0af7549 100644
>>> --- a/drivers/scsi/qla2xxx/qla_os.c
>>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>>> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
>>> }
>>> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>>>
>>> + if (got_ref)
>>> + qla2x00_sp_compl(ha, sp);
>>> +
>>
>> You can just get rid of got_ref, because if you just move the compl call
>> up a little more you know that in that code path we always get a ref.
>> And there is no need to hold the ref to where it is above. See attached
>> patch.
>>
>
> Sorry. Last patch had some other iscsi dev loss stuff in there. See this
> patch.
Mike,
The whole purpose of completing the command after the delay is to give
sufficient time for firmware to complete the original command and abort
command issued through interrupt. This makes sure that commands of concern
are no longer with firmware/hardware before completing it to upper layers. I
feel it is better to call qla2x00_sp_compl after waiting.
-- Giri
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-05 15:42 ` Giridhar Malavali
@ 2010-10-05 18:18 ` Mike Christie
2010-10-08 19:01 ` Giridhar Malavali
0 siblings, 1 reply; 8+ messages in thread
From: Mike Christie @ 2010-10-05 18:18 UTC (permalink / raw)
To: Giridhar Malavali; +Cc: Hannes Reinecke, James Bottomley, LinuxSCSI
[-- Attachment #1: Type: text/plain, Size: 7228 bytes --]
On 10/05/2010 10:42 AM, Giridhar Malavali wrote:
>
>
>
> On 10/1/10 2:10 PM, "Mike Christie"<michaelc@cs.wisc.edu> wrote:
>
>> On 10/01/2010 04:01 PM, Mike Christie wrote:
>>> On 10/01/2010 07:18 AM, Hannes Reinecke wrote:
>>>>
>>>> This patch fixes a regression introduced by commit
>>>> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>>>>
>>>> qla2xxx_eh_wait_on_command() is waiting for an srb to
>>>> complete, which will never happen as the routine took
>>>> a reference to the srb previously and will only drop it
>>>> after this function. So every command abort will fail.
>>>>
>>>> Signed-off-by: Hannes Reinecke<hare@suse.de>
>>>> Cc: Andrew Vasquez<andrew.vasquez@qlogic.com>
>>>>
>>>> diff --git a/drivers/scsi/qla2xxx/qla_os.c
>>>> b/drivers/scsi/qla2xxx/qla_os.c
>>>> index 1e4bff6..0af7549 100644
>>>> --- a/drivers/scsi/qla2xxx/qla_os.c
>>>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>>>> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
>>>> }
>>>> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>>>>
>>>> + if (got_ref)
>>>> + qla2x00_sp_compl(ha, sp);
>>>> +
>>>
>>> You can just get rid of got_ref, because if you just move the compl call
>>> up a little more you know that in that code path we always get a ref.
>>> And there is no need to hold the ref to where it is above. See attached
>>> patch.
>>>
>>
>> Sorry. Last patch had some other iscsi dev loss stuff in there. See this
>> patch.
>
> Mike,
>
> The whole purpose of completing the command after the delay is to give
> sufficient time for firmware to complete the original command and abort
> command issued through interrupt. This makes sure that commands of concern
> are no longer with firmware/hardware before completing it to upper layers. I
> feel it is better to call qla2x00_sp_compl after waiting.
Note: When the command is aborted the normal completion path in the scsi
layer is short circuited. If you try to do scsi_done on it, it will not
get processed pass the blk_complete_request blk_mark_rq_complete check.
Note2: In that abort path don't we already have a refcount on the sp
from when it was allocated and queued from the queuecommand? If we did
not, then if you did do sp_get on it or any access on it to test it you
would be accessing a freed sp.
For the initial problem Hannes was fixing..... If you wait until after
qla2x00_eh_wait_on_command to call qla2x00_sp_compl then as Hannes
pointed out the qla2x00_eh_wait_on_command is always going to fail. We
agree on that, right?
Right here, before calling sp_get, the sp would have a refcount of 1
from when the sp was created from the normal queuecommand path.
sp_get(sp);
Now after calling it here, sp will have a ref count of 2. This refcount
is supposed to protect where if the lock is dropped below and the
command completes, the sp is not freed, right?
got_ref++;
spin_unlock_irqrestore(&ha->hardware_lock, flags);
if (ha->isp_ops->abort_command(sp)) {
DEBUG2(printk("%s(%ld): abort_command "
"mbx failed.\n", __func__, vha->host_no));
ret = FAILED;
} else {
DEBUG3(printk("%s(%ld): abort_command "
"mbx success.\n", __func__, vha->host_no));
wait = 1;
}
spin_lock_irqsave(&ha->hardware_lock, flags);
break;
}
So right now the refcount is 2.
spin_unlock_irqrestore(&ha->hardware_lock, flags);
/* Wait for the command to be returned. */
if (wait) {
If the cmd does complete, then the normal/abort completion path will
drop the refcount from the initial queue comamnd path, so the refcount
is going to be stuck at 1, and so CMD_SP(cmd) is always going to be
non-null (since it only gets cleared when the ref count goes to zero),
and so we always time out from the wait.
if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
qla_printk(KERN_ERR, ha,
"scsi(%ld:%d:%d): Abort handler timed out
-- %lx "
"%x.\n", vha->host_no, id, lun, serial, ret);
ret = FAILED;
}
}
if (got_ref)
qla2x00_sp_compl(ha, sp);
Now, here the we release the ref count from the abort chunk, and so
CMD_SP is now null.
But the wait failed so we returned FAILED.
-----------------------------------------------------------------------
So that is the reason why we must move the ref count release upwards.
Now, for the answer why we it is ok to release where I did it.
/* Get a reference to the sp and drop the lock.*/
sp_get(sp);
Here we have the hardware lock and another ref to the sp (refcount = 2).
got_ref++;
spin_unlock_irqrestore(&ha->hardware_lock, flags);
if (ha->isp_ops->abort_command(sp)) {
DEBUG2(printk("%s(%ld): abort_command "
"mbx failed.\n", __func__, vha->host_no));
ret = FAILED;
} else {
DEBUG3(printk("%s(%ld): abort_command "
"mbx success.\n", __func__, vha->host_no));
wait = 1;
}
spin_lock_irqsave(&ha->hardware_lock, flags);
right here if the command completed from the abort or normal completion
the the refcount would be 1. If we release the ref that we took above,
it would free the sp and call scsi_done on the command. However, the
scsi eh has made sure that if you did this, it will not complete the IO
upwards. The scsi eh basically owns the command. It has to prevent races
like this for us (for example the scsi_done function could get called
while scsi_error.c is setting up the abort and now it would be accessing
freed/reallcoated memory if it did not handle this problem).
break;
}
spin_unlock_irqrestore(&ha->hardware_lock, flags);
So from here on we never touch the sp, so there is no need to keep its
memory around for qla2xxx's use. We are accessing the scsi command
though, but we are relying on the scsi eh doing the right thing since it
owns the command.
/* Wait for the command to be returned. */
if (wait) {
if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
qla_printk(KERN_ERR, ha,
"scsi(%ld:%d:%d): Abort handler timed out
-- %lx "
"%x.\n", vha->host_no, id, lun, serial, ret);
ret = FAILED;
}
}
Also, why do you do that loop in qla2xxx_eh_abort? Is the hardware lock
held while calling the compl function. If so it seems like if you know
the CMD_SP is not null then there is still a refcount on it so there
must be a valid sp. In the attached patch, I removed the loop. It
assumes that when the hardware lock is held when calling
qla2x00_sp_compl. Patch is only compile tested.
[-- Attachment #2: qla2xxx-rm-got-ref3.patch --]
[-- Type: text/plain, Size: 3013 bytes --]
qla2xxx: Drop srb reference before waiting for completion
This patch fixes a regression introduced by commit
083a469db4ecf3b286a96b5b722c37fc1affe0be
qla2xxx_eh_wait_on_command() is waiting for an srb to
complete, which will never happen as the routine took
a reference to the srb previously and will only drop it
after this function. So every command abort will fail.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index bdd53f0..a5b9af2 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -824,15 +824,12 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
{
scsi_qla_host_t *vha = shost_priv(cmd->device->host);
srb_t *sp;
- int ret, i;
+ int ret;
unsigned int id, lun;
unsigned long serial;
unsigned long flags;
int wait = 0;
struct qla_hw_data *ha = vha->hw;
- struct req_que *req = vha->req;
- srb_t *spt;
- int got_ref = 0;
fc_block_scsi_eh(cmd);
@@ -844,43 +841,32 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
id = cmd->device->id;
lun = cmd->device->lun;
serial = cmd->serial_number;
- spt = (srb_t *) CMD_SP(cmd);
- if (!spt)
- return SUCCESS;
- /* Check active list for command command. */
spin_lock_irqsave(&ha->hardware_lock, flags);
- for (i = 1; i < MAX_OUTSTANDING_COMMANDS; i++) {
- sp = req->outstanding_cmds[i];
-
- if (sp == NULL)
- continue;
- if ((sp->ctx) && !(sp->flags & SRB_FCP_CMND_DMA_VALID) &&
- !IS_PROT_IO(sp))
- continue;
- if (sp->cmd != cmd)
- continue;
+ sp = (srb_t *) CMD_SP(cmd);
+ if (!sp) {
+ spin_lock_irqsave(&ha->hardware_lock, flags);
+ return SUCCESS;
+ }
- DEBUG2(printk("%s(%ld): aborting sp %p from RISC."
+ DEBUG2(printk("%s(%ld): aborting sp %p from RISC."
" pid=%ld.\n", __func__, vha->host_no, sp, serial));
- /* Get a reference to the sp and drop the lock.*/
- sp_get(sp);
- got_ref++;
+ /* Get a reference to the sp and drop the lock.*/
+ sp_get(sp);
- spin_unlock_irqrestore(&ha->hardware_lock, flags);
- if (ha->isp_ops->abort_command(sp)) {
- DEBUG2(printk("%s(%ld): abort_command "
+ spin_unlock_irqrestore(&ha->hardware_lock, flags);
+ if (ha->isp_ops->abort_command(sp)) {
+ DEBUG2(printk("%s(%ld): abort_command "
"mbx failed.\n", __func__, vha->host_no));
- ret = FAILED;
- } else {
- DEBUG3(printk("%s(%ld): abort_command "
+ ret = FAILED;
+ } else {
+ DEBUG3(printk("%s(%ld): abort_command "
"mbx success.\n", __func__, vha->host_no));
- wait = 1;
- }
- spin_lock_irqsave(&ha->hardware_lock, flags);
- break;
+ wait = 1;
}
+ spin_lock_irqsave(&ha->hardware_lock, flags);
+ qla2x00_sp_compl(ha, sp);
spin_unlock_irqrestore(&ha->hardware_lock, flags);
/* Wait for the command to be returned. */
@@ -893,9 +879,6 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
}
}
- if (got_ref)
- qla2x00_sp_compl(ha, sp);
-
qla_printk(KERN_INFO, ha,
"scsi(%ld:%d:%d): Abort command issued -- %d %lx %x.\n",
vha->host_no, id, lun, wait, serial, ret);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-05 18:18 ` Mike Christie
@ 2010-10-08 19:01 ` Giridhar Malavali
2010-10-08 21:44 ` Mike Christie
0 siblings, 1 reply; 8+ messages in thread
From: Giridhar Malavali @ 2010-10-08 19:01 UTC (permalink / raw)
To: Mike Christie; +Cc: Hannes Reinecke, James Bottomley, LinuxSCSI
On 10/5/10 11:18 AM, "Mike Christie" <michaelc@cs.wisc.edu> wrote:
> On 10/05/2010 10:42 AM, Giridhar Malavali wrote:
>>
>>
>>
>> On 10/1/10 2:10 PM, "Mike Christie"<michaelc@cs.wisc.edu> wrote:
>>
>>> On 10/01/2010 04:01 PM, Mike Christie wrote:
>>>> On 10/01/2010 07:18 AM, Hannes Reinecke wrote:
>>>>>
>>>>> This patch fixes a regression introduced by commit
>>>>> 083a469db4ecf3b286a96b5b722c37fc1affe0be
>>>>>
>>>>> qla2xxx_eh_wait_on_command() is waiting for an srb to
>>>>> complete, which will never happen as the routine took
>>>>> a reference to the srb previously and will only drop it
>>>>> after this function. So every command abort will fail.
>>>>>
>>>>> Signed-off-by: Hannes Reinecke<hare@suse.de>
>>>>> Cc: Andrew Vasquez<andrew.vasquez@qlogic.com>
>>>>>
>>>>> diff --git a/drivers/scsi/qla2xxx/qla_os.c
>>>>> b/drivers/scsi/qla2xxx/qla_os.c
>>>>> index 1e4bff6..0af7549 100644
>>>>> --- a/drivers/scsi/qla2xxx/qla_os.c
>>>>> +++ b/drivers/scsi/qla2xxx/qla_os.c
>>>>> @@ -883,6 +883,9 @@ qla2xxx_eh_abort(struct scsi_cmnd *cmd)
>>>>> }
>>>>> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>>>>>
>>>>> + if (got_ref)
>>>>> + qla2x00_sp_compl(ha, sp);
>>>>> +
>>>>
>>>> You can just get rid of got_ref, because if you just move the compl call
>>>> up a little more you know that in that code path we always get a ref.
>>>> And there is no need to hold the ref to where it is above. See attached
>>>> patch.
>>>>
>>>
>>> Sorry. Last patch had some other iscsi dev loss stuff in there. See this
>>> patch.
>>
>> Mike,
>>
>> The whole purpose of completing the command after the delay is to give
>> sufficient time for firmware to complete the original command and abort
>> command issued through interrupt. This makes sure that commands of concern
>> are no longer with firmware/hardware before completing it to upper layers. I
>> feel it is better to call qla2x00_sp_compl after waiting.
>
>
> Note: When the command is aborted the normal completion path in the scsi
> layer is short circuited. If you try to do scsi_done on it, it will not
> get processed pass the blk_complete_request blk_mark_rq_complete check.
>
>
> Note2: In that abort path don't we already have a refcount on the sp
> from when it was allocated and queued from the queuecommand? If we did
> not, then if you did do sp_get on it or any access on it to test it you
> would be accessing a freed sp.
>
>
>
> For the initial problem Hannes was fixing..... If you wait until after
> qla2x00_eh_wait_on_command to call qla2x00_sp_compl then as Hannes
> pointed out the qla2x00_eh_wait_on_command is always going to fail. We
> agree on that, right?
>
Yes. I agree.
> Right here, before calling sp_get, the sp would have a refcount of 1
> from when the sp was created from the normal queuecommand path.
>
> sp_get(sp);
>
> Now after calling it here, sp will have a ref count of 2. This refcount
> is supposed to protect where if the lock is dropped below and the
> command completes, the sp is not freed, right?
>
>
> got_ref++;
>
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
> if (ha->isp_ops->abort_command(sp)) {
> DEBUG2(printk("%s(%ld): abort_command "
> "mbx failed.\n", __func__, vha->host_no));
> ret = FAILED;
> } else {
> DEBUG3(printk("%s(%ld): abort_command "
> "mbx success.\n", __func__, vha->host_no));
> wait = 1;
> }
> spin_lock_irqsave(&ha->hardware_lock, flags);
> break;
> }
>
> So right now the refcount is 2.
>
>
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>
> /* Wait for the command to be returned. */
> if (wait) {
>
> If the cmd does complete, then the normal/abort completion path will
> drop the refcount from the initial queue comamnd path, so the refcount
> is going to be stuck at 1, and so CMD_SP(cmd) is always going to be
> non-null (since it only gets cleared when the ref count goes to zero),
> and so we always time out from the wait.
>
>
> if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
> qla_printk(KERN_ERR, ha,
> "scsi(%ld:%d:%d): Abort handler timed out
> -- %lx "
> "%x.\n", vha->host_no, id, lun, serial, ret);
> ret = FAILED;
> }
> }
>
> if (got_ref)
> qla2x00_sp_compl(ha, sp);
>
> Now, here the we release the ref count from the abort chunk, and so
> CMD_SP is now null.
>
> But the wait failed so we returned FAILED.
>
>
> -----------------------------------------------------------------------
>
>
> So that is the reason why we must move the ref count release upwards.
> Now, for the answer why we it is ok to release where I did it.
>
>
> /* Get a reference to the sp and drop the lock.*/
> sp_get(sp);
>
>
> Here we have the hardware lock and another ref to the sp (refcount = 2).
>
> got_ref++;
>
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
> if (ha->isp_ops->abort_command(sp)) {
> DEBUG2(printk("%s(%ld): abort_command "
> "mbx failed.\n", __func__, vha->host_no));
> ret = FAILED;
> } else {
> DEBUG3(printk("%s(%ld): abort_command "
> "mbx success.\n", __func__, vha->host_no));
> wait = 1;
> }
> spin_lock_irqsave(&ha->hardware_lock, flags);
>
> right here if the command completed from the abort or normal completion
> the the refcount would be 1. If we release the ref that we took above,
> it would free the sp and call scsi_done on the command. However, the
> scsi eh has made sure that if you did this, it will not complete the IO
> upwards. The scsi eh basically owns the command. It has to prevent races
> like this for us (for example the scsi_done function could get called
> while scsi_error.c is setting up the abort and now it would be accessing
> freed/reallcoated memory if it did not handle this problem).
>
>
> break;
> }
> spin_unlock_irqrestore(&ha->hardware_lock, flags);
>
>
> So from here on we never touch the sp, so there is no need to keep its
> memory around for qla2xxx's use. We are accessing the scsi command
> though, but we are relying on the scsi eh doing the right thing since it
> owns the command.
>
>
> /* Wait for the command to be returned. */
> if (wait) {
> if (qla2x00_eh_wait_on_command(cmd) != QLA_SUCCESS) {
> qla_printk(KERN_ERR, ha,
> "scsi(%ld:%d:%d): Abort handler timed out
> -- %lx "
> "%x.\n", vha->host_no, id, lun, serial, ret);
> ret = FAILED;
> }
> }
>
>
>
> Also, why do you do that loop in qla2xxx_eh_abort? Is the hardware lock
> held while calling the compl function. If so it seems like if you know
> the CMD_SP is not null then there is still a refcount on it so there
> must be a valid sp. In the attached patch, I removed the loop. It
> assumes that when the hardware lock is held when calling
> qla2x00_sp_compl. Patch is only compile tested.
>
Thanks for the clarification. I acknowledge the patch overall except for
spin_lock changes where we need to release the lock when we return success
or failure. I will test the patch and re-submit once I am done.
-- Giridhar
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] qla2xxx: Drop srb reference before waiting for completion
2010-10-08 19:01 ` Giridhar Malavali
@ 2010-10-08 21:44 ` Mike Christie
0 siblings, 0 replies; 8+ messages in thread
From: Mike Christie @ 2010-10-08 21:44 UTC (permalink / raw)
To: Giridhar Malavali; +Cc: Hannes Reinecke, James Bottomley, LinuxSCSI
On 10/08/2010 02:01 PM, Giridhar Malavali wrote:
> Thanks for the clarification. I acknowledge the patch overall except for
> spin_lock changes where we need to release the lock when we return success
> or failure.
Oops, yeah sorry. You are right. Bad cut and paste.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-10-08 21:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-01 12:18 [PATCH] qla2xxx: Drop srb reference before waiting for completion Hannes Reinecke
2010-10-01 17:02 ` Giridhar Malavali
2010-10-01 21:01 ` Mike Christie
2010-10-01 21:10 ` Mike Christie
2010-10-05 15:42 ` Giridhar Malavali
2010-10-05 18:18 ` Mike Christie
2010-10-08 19:01 ` Giridhar Malavali
2010-10-08 21:44 ` Mike Christie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).