* libsas error-handling completion issue. @ 2009-03-14 6:44 Ying Chu 2009-03-14 17:19 ` James Bottomley 0 siblings, 1 reply; 5+ messages in thread From: Ying Chu @ 2009-03-14 6:44 UTC (permalink / raw) To: linux-scsi; +Cc: James.Bottomley, jeff, Tejun Heo As sas_eh_finish_cmd() routine is used when a sas task is marked succfully recovered(or the device is marked OFFLINE), then the corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED is unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got invoked, it will translate task_status to scsi_cmnd->request, free the sas_task structure and invoke task_done, the problem is, after task_done(), scsi_eh_finish_cmd() will still add the cmd to sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will retry the command or invoke scsi_finish_command() to finish the request again. Take a timeout SSP request in the race condition for example, if the request is returned prior to lldd_abort_task() and set SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but return TASK_IS_DONE in sas_scsi_find_task(), later sas_eh_finished_cmd() will get invoked and unset SAS_TASK_STATE_ABORTED mask to let sas_scsi_task_done() in, since it's a good response and the request will be finished succfully.But current strategy will add the scsi_cmnd to sas_ha->eh_done_q after task_done() is performed. Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY) in scsi_eh_flush_done_q(). I meet the condition and sometime the system hangs. Did I miss something? -- Regards, Ying Chu ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: libsas error-handling completion issue. 2009-03-14 6:44 libsas error-handling completion issue Ying Chu @ 2009-03-14 17:19 ` James Bottomley 2009-03-15 4:02 ` Ying Chu 0 siblings, 1 reply; 5+ messages in thread From: James Bottomley @ 2009-03-14 17:19 UTC (permalink / raw) To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo On Fri, 2009-03-13 at 23:44 -0700, Ying Chu wrote: > As sas_eh_finish_cmd() routine is used when a sas task is marked > succfully recovered(or the device is marked OFFLINE), then the > corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED is > unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got > invoked, it will translate task_status to scsi_cmnd->request, free the > sas_task structure and invoke task_done, the problem is, after > task_done(), scsi_eh_finish_cmd() will still add the cmd to > sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will > retry the command or invoke scsi_finish_command() to finish the request > again. OK, so the immediate answer is that the eh_done_q is processed in the strategy handler, in this case by sas_eh_handle_sas_errors() which skips over any commands without sas tasks in the list. > Take a timeout SSP request in the race condition for example, if the > request is returned prior to lldd_abort_task() and set > SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but > return TASK_IS_DONE in sas_scsi_find_task(), later sas_eh_finished_cmd() > will get invoked and unset SAS_TASK_STATE_ABORTED mask to let > sas_scsi_task_done() in, since it's a good response and the request will > be finished succfully.But current strategy will > add the scsi_cmnd to sas_ha->eh_done_q after task_done() is performed. > Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY) > in scsi_eh_flush_done_q(). I meet the condition and sometime the system > hangs. > > Did I miss something? I'm not sure, since I didn't quite understand the problem. So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it. The apparent race between these flags is mediated using the task_state_lock. When looking to test and set these flags, you must be holding this lock. The first case is in sas_scsi_host.c:sas_scsi_timed_out() where it takes the lock, checks for done and only sets STATE_ABORTED if STATE_DONE wasn't set. The other, nastily, is in the driver (which is responsible for setting STATE_DONE). I've never liked this bit because it's complex and easy to get wrong. However, if you look at aic94xx_task.c:asd_task_tasklet_complete() you see the big chunk of code where it takes the lock, sets STATE_DONE, checks STATE_ABORTED and doesn't call ->task_done() if this is set. As far as I can tell, the locking around all of this, while nasty to the user, is raceproof as long as the LLD gets it right. James ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: libsas error-handling completion issue. 2009-03-14 17:19 ` James Bottomley @ 2009-03-15 4:02 ` Ying Chu 2009-03-15 14:14 ` James Bottomley 0 siblings, 1 reply; 5+ messages in thread From: Ying Chu @ 2009-03-15 4:02 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi, jeff, Tejun Heo >So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means > that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it. What I meant is the corner case where interrupt is fired and the sas_task returned back at the monment but before sas_scsi_find_task() got invoked and set SAS_TASK_STATE_DONE flag. As in asd_task_tasklet_complete() routine, it will check if the STATE_ABORTED is set, if so, it doesn't invoke task_done() routine. Still returned to the strategy handler, sas_scsi_find_task() will try to abort the task and find it has been finished with STATE_DONE set, so it return TASK_IS_DONE and call sas_eh_finish_cmd(), where it unset the TASK_ABORTED flag and call task_done(). Noticed that in sas_eh_finish_cmd(), we call scsi_eh_finish_cmd() which will add the cmd to sas_ha->eh_done(), and in the coming flush_done_q it will be retried or finished again. -----Original Message----- From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] Sent: 2009年3月15日 1:19 To: Ying Chu Cc: linux-scsi@vger.kernel.org; jeff@garzik.org; Tejun Heo Subject: Re: libsas error-handling completion issue. On Fri, 2009-03-13 at 23:44 -0700, Ying Chu wrote: > As sas_eh_finish_cmd() routine is used when a sas task is marked > succfully recovered(or the device is marked OFFLINE), then the > corresponding task_done will be invoked after SAS_TASK_STATE_ABORTED > is unmasked. Whatever sas_scsi_task_done() or sas_ata_task_done() got > invoked, it will translate task_status to scsi_cmnd->result, free the > sas_task structure and invoke task_done, the problem is, after > task_done(), scsi_eh_finish_cmd() will still add the cmd to > sas_ha->eh_done_q, and in the coming scsi_eh_flush_done_q(), it will > retry the command or invoke scsi_finish_command() to finish the > request again. OK, so the immediate answer is that the eh_done_q is processed in the strategy handler, in this case by sas_eh_handle_sas_errors() which skips over any commands without sas tasks in the list. > Take a timeout SSP request in the race condition for example, if > the request is returned prior to lldd_abort_task() and set > SAS_TASK_STATE_DONE, the lldd_abort_task() handler will do nothing but > return TASK_IS_DONE in sas_scsi_find_task(), later > sas_eh_finished_cmd() will get invoked and unset > SAS_TASK_STATE_ABORTED mask to let > sas_scsi_task_done() in, since it's a good response and the request > will be finished succfully.But current strategy will add the scsi_cmnd > to sas_ha->eh_done_q after task_done() is performed. > Then if possibe, we may scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY) > in scsi_eh_flush_done_q(). I meet the condition and sometime the > system hangs. > > Did I miss something? I'm not sure, since I didn't quite understand the problem. So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it. The apparent race between these flags is mediated using the task_state_lock. When looking to test and set these flags, you must be holding this lock. The first case is in sas_scsi_host.c:sas_scsi_timed_out() where it takes the lock, checks for done and only sets STATE_ABORTED if STATE_DONE wasn't set. The other, nastily, is in the driver (which is responsible for setting STATE_DONE). I've never liked this bit because it's complex and easy to get wrong. However, if you look at aic94xx_task.c:asd_task_tasklet_complete() you see the big chunk of code where it takes the lock, sets STATE_DONE, checks STATE_ABORTED and doesn't call ->task_done() if this is set. As far as I can tell, the locking around all of this, while nasty to the user, is raceproof as long as the LLD gets it right. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: libsas error-handling completion issue. 2009-03-15 4:02 ` Ying Chu @ 2009-03-15 14:14 ` James Bottomley [not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com> 0 siblings, 1 reply; 5+ messages in thread From: James Bottomley @ 2009-03-15 14:14 UTC (permalink / raw) To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo On Sat, 2009-03-14 at 21:02 -0700, Ying Chu wrote: > >So, let me explain how the SAS_TASK_STATE_ABORTED works. It's actually the mediating flag in how completions are handled. There are two ways through ->task_done() depending on the state of this flag. If this flag is set, it means > > that libsas owns the task and ->task_done() may not free it. Conversely if the timeout fires it checks the flags and if SAS_TASK_STATE_DONE is set, it returns BLK_EH_HANDLED because presumably the mid-layer will soon see it. > > What I meant is the corner case where interrupt is fired and the > sas_task returned back at the monment but before sas_scsi_find_task() > got invoked and set SAS_TASK_STATE_DONE flag. As in > asd_task_tasklet_complete() routine, it will check if the > STATE_ABORTED is set, if so, it doesn't invoke task_done() routine. > Still returned to the strategy handler, sas_scsi_find_task() will try > to abort the task and find it has been finished with STATE_DONE set, > so it return TASK_IS_DONE and call sas_eh_finish_cmd(), where it unset > the TASK_ABORTED flag and call task_done(). Noticed that in > sas_eh_finish_cmd(), we call scsi_eh_finish_cmd() which will add the > cmd to sas_ha->eh_done(), and in the coming flush_done_q it will be > retried or finished again. I'm still not quite seeing what you think the problem is. Is it that task_done() calls scsi_cmnd->scsi_done(), in which case you think there's a double completion? That's fixed in the block layer: scsi_done() is blk_complete_request() which does nothing if the request has a completion set and the block layer sets the completion when the timeout fires. James James ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com>]
* Re: 答复: libsas error-handling completion issue. [not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com> @ 2009-03-15 17:13 ` James Bottomley 0 siblings, 0 replies; 5+ messages in thread From: James Bottomley @ 2009-03-15 17:13 UTC (permalink / raw) To: Ying Chu; +Cc: linux-scsi, jeff, Tejun Heo On Sun, 2009-03-15 at 08:59 -0700, Ying Chu wrote: > Thanks, I'll figure out why my system hangs after 5 times retry. Just as a point of mailing list etiquette, this email had a text/html part ... vger is very intolerant of these, which is why it needs to be text/plain only ... there's probably an outlook setting (I think it's recipient doesn't want enriched mail) to fix this. Failing after five retries sounds significant: 5 is SD_MAX_RETRIES, so it sounds like something failed after the allowed number of retries was exhausted. James ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-03-15 17:13 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-14 6:44 libsas error-handling completion issue Ying Chu
2009-03-14 17:19 ` James Bottomley
2009-03-15 4:02 ` Ying Chu
2009-03-15 14:14 ` James Bottomley
[not found] ` <FE3F06125A99254E8D92161AA4569C6F029F0621@sc-exch02.marvell.com>
2009-03-15 17:13 ` 答复: " James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox