* libsas error-handling completion issue.
@ 2009-03-14 6:44 Ying Chu
2009-03-14 17:19 ` James Bottomley
0 siblings, 1 reply; 5+ messages in thread
From: Ying Chu @ 2009-03-14 6:44 UTC (permalink / raw)
To: linux-scsi; +Cc: James.Bottomley, jeff, Tejun Heo
As sas_eh_finish_cmd() routine is used when a sas task is marked
succfully recovered(or the device is marked OFFLINE), then the
corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED is
unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got
invoked, it will translate task_status to scsi_cmnd->request, free the
sas_task structure and invoke task_done, the problem is, after
task_done(), scsi_eh_finish_cmd() will still add the cmd to
sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will
retry the command or invoke scsi_finish_command() to finish the request
again.
Take a timeout SSP request in the race condition for example, if the
request is returned prior to lldd_abort_task() and set
SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but
return TASK_IS_DONE in sas_scsi_find_task(), later sas_eh_finished_cmd()
will get invoked and unset SAS_TASK_STATE_ABORTED mask to let
sas_scsi_task_done() in, since it's a good response and the request will
be finished succfully.But current strategy will
add the scsi_cmnd to sas_ha->eh_done_q after task_done() is performed.
Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
in scsi_eh_flush_done_q(). I meet the condition and sometime the system
hangs.
Did I miss something?
--
Regards,
Ying Chu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libsas error-handling completion issue.
2009-03-14 6:44 libsas error-handling completion issue Ying Chu
@ 2009-03-14 17:19 ` James Bottomley
2009-03-15 4:02 ` Ying Chu
0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2009-03-14 17:19 UTC (permalink / raw)
To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo
On Fri, 2009-03-13 at 23:44 -0700, Ying Chu wrote:
> As sas_eh_finish_cmd() routine is used when a sas task is marked
> succfully recovered(or the device is marked OFFLINE), then the
> corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED is
> unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got
> invoked, it will translate task_status to scsi_cmnd->request, free the
> sas_task structure and invoke task_done, the problem is, after
> task_done(), scsi_eh_finish_cmd() will still add the cmd to
> sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will
> retry the command or invoke scsi_finish_command() to finish the request
> again.
OK, so the immediate answer is that the eh_done_q is processed in the
strategy handler, in this case by sas_eh_handle_sas_errors() which skips
over any commands without sas tasks in the list.
> Take a timeout SSP request in the race condition for example, if the
> request is returned prior to lldd_abort_task() and set
> SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but
> return TASK_IS_DONE in sas_scsi_find_task(), later sas_eh_finished_cmd()
> will get invoked and unset SAS_TASK_STATE_ABORTED mask to let
> sas_scsi_task_done() in, since it's a good response and the request will
> be finished succfully.But current strategy will
> add the scsi_cmnd to sas_ha->eh_done_q after task_done() is performed.
> Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> in scsi_eh_flush_done_q(). I meet the condition and sometime the system
> hangs.
>
> Did I miss something?
I'm not sure, since I didn't quite understand the problem.
So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually
the mediating flag in how completions are handled. There are two ways
through ->task_done() depending on the state of this flag. If this flag
is set, it means that libsas owns the task and ->task_done() may not
free it. Conversely if the timeout fires it checks the flags and if
SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably
the mid-layer will soon see it.
The apparent race between these flags is mediated using the
task_state_lock. When looking to test and set these flags, you must be
holding this lock.
The first case is in sas_scsi_host.c:sas_scsi_timed_out() where it takes
the lock, checks for done and only sets STATE_ABORTED if STATE_DONE
wasn't set.
The other, nastily, is in the driver (which is responsible for setting
STATE_DONE). I've never liked this bit because it's complex and easy to
get wrong. However, if you look at
aic94xx_task.c:asd_task_tasklet_complete() you see the big chunk of code
where it takes the lock, sets STATE_DONE, checks STATE_ABORTED and
doesn't call ->task_done() if this is set.
As far as I can tell, the locking around all of this, while nasty to the
user, is raceproof as long as the LLD gets it right.
James
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: libsas error-handling completion issue.
2009-03-14 17:19 ` James Bottomley
@ 2009-03-15 4:02 ` Ying Chu
2009-03-15 14:14 ` James Bottomley
0 siblings, 1 reply; 5+ messages in thread
From: Ying Chu @ 2009-03-15 4:02 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-scsi, jeff, Tejun Heo
>So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means
> that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it.
What I meant is the corner case where interrupt is fired and the sas_task returned back at the monment but before sas_scsi_find_task() got invoked and set SAS_TASK_STATE_DONE flag. As in asd_task_tasklet_complete() routine, it will check if the STATE_ABORTED is set, if so, it doesn't invoke task_done() routine. Still returned to the strategy handler, sas_scsi_find_task() will try to abort the task and find it has been finished with STATE_DONE set, so it return TASK_IS_DONE and call sas_eh_finish_cmd(), where it unset the TASK_ABORTED flag and call task_done(). Noticed that in sas_eh_finish_cmd(), we call scsi_eh_finish_cmd() which will add the cmd to sas_ha->eh_done(), and in the coming flush_done_q it will be retried or finished again.
-----Original Message-----
From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
Sent: 2009年3月15日 1:19
To: Ying Chu
Cc: linux-scsi@vger.kernel.org; jeff@garzik.org; Tejun Heo
Subject: Re: libsas error-handling completion issue.
On Fri, 2009-03-13 at 23:44 -0700, Ying Chu wrote:
> As sas_eh_finish_cmd() routine is used when a sas task is marked
> succfully recovered(or the device is marked OFFLINE), then the
> corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED
> is unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got
> invoked, it will translate task_status to scsi_cmnd->result, free the
> sas_task structure and invoke task_done, the problem is, after
> task_done(), scsi_eh_finish_cmd() will still add the cmd to
> sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will
> retry the command or invoke scsi_finish_command() to finish the
> request again.
OK, so the immediate answer is that the eh_done_q is processed in the strategy handler, in this case by sas_eh_handle_sas_errors() which skips over any commands without sas tasks in the list.
> Take a timeout SSP request in the race condition for example, if
> the request is returned prior to lldd_abort_task() and set
> SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but
> return TASK_IS_DONE in sas_scsi_find_task(), later
> sas_eh_finished_cmd() will get invoked and unset
> SAS_TASK_STATE_ABORTED mask to let
> sas_scsi_task_done() in, since it's a good response and the request
> will be finished succfully.But current strategy will add the scsi_cmnd
> to sas_ha->eh_done_q after task_done() is performed.
> Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY)
> in scsi_eh_flush_done_q(). I meet the condition and sometime the
> system hangs.
>
> Did I miss something?
I'm not sure, since I didn't quite understand the problem.
So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it.
The apparent race between these flags is mediated using the task_state_lock. When looking to test and set these flags, you must be holding this lock.
The first case is in sas_scsi_host.c:sas_scsi_timed_out() where it takes the lock, checks for done and only sets STATE_ABORTED if STATE_DONE wasn't set.
The other, nastily, is in the driver (which is responsible for setting STATE_DONE). I've never liked this bit because it's complex and easy to get wrong. However, if you look at
aic94xx_task.c:asd_task_tasklet_complete() you see the big chunk of code where it takes the lock, sets STATE_DONE, checks STATE_ABORTED and doesn't call ->task_done() if this is set.
As far as I can tell, the locking around all of this, while nasty to the user, is raceproof as long as the LLD gets it right.
James
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: libsas error-handling completion issue.
2009-03-15 4:02 ` Ying Chu
@ 2009-03-15 14:14 ` James Bottomley
[not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com>
0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2009-03-15 14:14 UTC (permalink / raw)
To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo
On Sat, 2009-03-14 at 21:02 -0700, Ying Chu wrote:
> >So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means
> > that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it.
>
> What I meant is the corner case where interrupt is fired and the
> sas_task returned back at the monment but before sas_scsi_find_task()
> got invoked and set SAS_TASK_STATE_DONE flag. As in
> asd_task_tasklet_complete() routine, it will check if the
> STATE_ABORTED is set, if so, it doesn't invoke task_done() routine.
> Still returned to the strategy handler, sas_scsi_find_task() will try
> to abort the task and find it has been finished with STATE_DONE set,
> so it return TASK_IS_DONE and call sas_eh_finish_cmd(), where it unset
> the TASK_ABORTED flag and call task_done(). Noticed that in
> sas_eh_finish_cmd(), we call scsi_eh_finish_cmd() which will add the
> cmd to sas_ha->eh_done(), and in the coming flush_done_q it will be
> retried or finished again.
I'm still not quite seeing what you think the problem is. Is it that
task_done() calls scsi_cmnd->scsi_done(), in which case you think
there's a double completion? That's fixed in the block layer:
scsi_done() is blk_complete_request() which does nothing if the request
has a completion set and the block layer sets the completion when the
timeout fires.
James
James
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 答复: libsas error-handling completion issue.
[not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com>
@ 2009-03-15 17:13 ` James Bottomley
0 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2009-03-15 17:13 UTC (permalink / raw)
To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo
On Sun, 2009-03-15 at 08:59 -0700, Ying Chu wrote:
> Thanks, I'll figure out why my system hangs after 5 times retry.
Just as a point of mailing list etiquette, this email had a text/html
part ... vger is very intolerant of these, which is why it needs to be
text/plain only ... there's probably an outlook setting (I think it's
recipient doesn't want enriched mail) to fix this.
Failing after five retries sounds significant: 5 is SD_MAX_RETRIES, so
it sounds like something failed after the allowed number of retries was
exhausted.
James
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-03-15 17:13 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-14 6:44 libsas error-handling completion issue Ying Chu
2009-03-14 17:19 ` James Bottomley
2009-03-15 4:02 ` Ying Chu
2009-03-15 14:14 ` James Bottomley
[not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com>
2009-03-15 17:13 ` 答复: " James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox