* Deadlock in usb-storage error handling [not found] ` <53298181.9020206-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2014-03-19 20:31 ` Alan Stern [not found] ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2014-03-19 20:31 UTC (permalink / raw) To: Andreas Reis, James Bottomley; +Cc: USB list, SCSI development list On Wed, 19 Mar 2014, Andreas Reis wrote: > I've uploaded a dmesg with the new debugging patch to bugzilla: > https://bugzilla.kernel.org/attachment.cgi?id=130041 Thanks. I have now managed to reproduce many of the features of this problem on my own computer. James, I will need your help (or help from somebody who understands the SCSI error handler) to figure out how this problem should be fixed. Basically, usb-storage deadlocks when the SCSI error handler invokes the eh_device_reset_handler callback while a command is running. The command has timed out and will never complete normally, because the device's firmware has crashed. But usb-storage's device-reset routine waits for the current command to finish, which brings everything to a standstill. Is this design wrong? That is, should the device-reset routine wait for currently executing commands to finish, or should it abort them, or what? Or should the SCSI error handler abort the running command before invoking the eh_device_reset_handler callback? For the record, and in case anyone is curious, here's the detailed sequence of events during my test: sd issues a READ(10) command. For whatever reason, the device goes nuts and the command times out. scsi_times_out() calls scsi_abort_command(), which queues an abort request. scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which succeeds in aborting the READ. The READ command is retried (I didn't trace through the details of this). The retry fails with a Unit Attention (SK=6, ASC=0x29, Reset or Bus Device Reset Occurred). The READ command is retried a second time, and it times out again. This time around, scsi_times_out() calls scsi_abort_command() unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is still set). As a result, scsi_error_handler() calls scsi_unjam_host(), which calls scsi_eh_get_sense(). That routine calls scsi_request_sense(), which goes into scsi_send_eh_cmnd(). The calls to shost->hostt->queuecommand() all fail, because the READ command is still running and usb-storage has a queue depth of 1. The error messages produced by these failures are disconcerting but not dangerous. Since the REQUEST SENSE command was never issued, scsi_eh_get_sense() returns 0. scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which does essentially nothing because the SCSI_EH_CANCEL_CMD flag for the only command on work_q is clear. scsi_eh_test_devices() returns 0 because check_list is empty and work_q isn't. scsi_unjam_host() then calls scsi_eh_ready_devs(). This routine ends up calling scsi_eh_bus_device_reset(), at which point usb-storage deadlocks as described above. (On Andreas's system, the first READ retry times out as opposed to the second retry as on my computer. I doubt this makes any difference.) I can't tell if this is all working as intended or if it went off the tracks somewhere. Thanks for any guidance. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> @ 2014-03-19 20:54 ` Dan Williams 2014-03-19 21:35 ` James Bottomley 1 sibling, 0 replies; 16+ messages in thread From: Dan Williams @ 2014-03-19 20:54 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, James Bottomley, USB list, SCSI development list On Wed, Mar 19, 2014 at 1:31 PM, Alan Stern <stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz@public.gmane.org> wrote: > On Wed, 19 Mar 2014, Andreas Reis wrote: > >> I've uploaded a dmesg with the new debugging patch to bugzilla: >> https://bugzilla.kernel.org/attachment.cgi?id=130041 > > Thanks. I have now managed to reproduce many of the features of this > problem on my own computer. > > James, I will need your help (or help from somebody who understands the > SCSI error handler) to figure out how this problem should be fixed. > > Basically, usb-storage deadlocks when the SCSI error handler invokes > the eh_device_reset_handler callback while a command is running. The > command has timed out and will never complete normally, because the > device's firmware has crashed. But usb-storage's device-reset routine > waits for the current command to finish, which brings everything to a > standstill. > > Is this design wrong? That is, should the device-reset routine wait > for currently executing commands to finish, or should it abort them, or > what? > > Or should the SCSI error handler abort the running command before > invoking the eh_device_reset_handler callback? > > For the record, and in case anyone is curious, here's the detailed > sequence of events during my test: > > sd issues a READ(10) command. For whatever reason, the device > goes nuts and the command times out. > > scsi_times_out() calls scsi_abort_command(), which queues an > abort request. > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > succeeds in aborting the READ. > > The READ command is retried (I didn't trace through the details > of this). The retry fails with a Unit Attention (SK=6, > ASC=0x29, Reset or Bus Device Reset Occurred). > > The READ command is retried a second time, and it times out > again. > > This time around, scsi_times_out() calls scsi_abort_command() > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > still set). > > As a result, scsi_error_handler() calls scsi_unjam_host(), > which calls scsi_eh_get_sense(). > > That routine calls scsi_request_sense(), which goes into > scsi_send_eh_cmnd(). > > The calls to shost->hostt->queuecommand() all fail, because the > READ command is still running and usb-storage has a queue > depth of 1. The error messages produced by these failures are > disconcerting but not dangerous. > > Since the REQUEST SENSE command was never issued, > scsi_eh_get_sense() returns 0. > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > for the only command on work_q is clear. > scsi_eh_test_devices() returns 0 because check_list is empty > and work_q isn't. > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > routine ends up calling scsi_eh_bus_device_reset(), at which > point usb-storage deadlocks as described above. > > (On Andreas's system, the first READ retry times out as opposed to the > second retry as on my computer. I doubt this makes any difference.) > > I can't tell if this is all working as intended or if it went off the > tracks somewhere. > > Thanks for any guidance. > It seems to me we need an ->eh_strategy_handler() that understands the usb transport and will escalate to device reset around the time scsi_abort_command() fails. -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling [not found] ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> 2014-03-19 20:54 ` Dan Williams @ 2014-03-19 21:35 ` James Bottomley [not found] ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 16+ messages in thread From: James Bottomley @ 2014-03-19 21:35 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Wed, 2014-03-19 at 16:31 -0400, Alan Stern wrote: > On Wed, 19 Mar 2014, Andreas Reis wrote: > > > I've uploaded a dmesg with the new debugging patch to bugzilla: > > https://bugzilla.kernel.org/attachment.cgi?id=130041 > > Thanks. I have now managed to reproduce many of the features of this > problem on my own computer. > > James, I will need your help (or help from somebody who understands the > SCSI error handler) to figure out how this problem should be fixed. > > Basically, usb-storage deadlocks when the SCSI error handler invokes > the eh_device_reset_handler callback while a command is running. The > command has timed out and will never complete normally, because the > device's firmware has crashed. But usb-storage's device-reset routine > waits for the current command to finish, which brings everything to a > standstill. > > Is this design wrong? That is, should the device-reset routine wait > for currently executing commands to finish, or should it abort them, or > what? In some sense, yes, but not necessarily from the Point of View of USB. What we assume in SCSI is that commands are forgettable, meaning there's always some operation we can perform (be it abort or reset) that causes the device to forget about all outstanding commands and reset its internal state machine to a known good state. The cardinal SCSI assumption is that after we've successfully done an abort or reset on a command that it will never come back to us from the device. > Or should the SCSI error handler abort the running command before > invoking the eh_device_reset_handler callback? So this is rooted in the "Abort can be a Problem" issue: Abort sometimes works well (and it's not very disruptive) but sometimes if the device is having a problem in its command state machine, adding another command (which is what the abort is) doesn't actually do anything, so we need error escalation to reset. We can't wait for the abort or other commands to complete because they never will. The reset is expected to clear everything from the device (including the pending aborts). > For the record, and in case anyone is curious, here's the detailed > sequence of events during my test: > > sd issues a READ(10) command. For whatever reason, the device > goes nuts and the command times out. > > scsi_times_out() calls scsi_abort_command(), which queues an > abort request. > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > succeeds in aborting the READ. > > The READ command is retried (I didn't trace through the details > of this). The retry fails with a Unit Attention (SK=6, > ASC=0x29, Reset or Bus Device Reset Occurred). > > The READ command is retried a second time, and it times out > again. > > This time around, scsi_times_out() calls scsi_abort_command() > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > still set). >From the first time we sent the abort? That sounds like a problem in our state tracking. > As a result, scsi_error_handler() calls scsi_unjam_host(), > which calls scsi_eh_get_sense(). > > That routine calls scsi_request_sense(), which goes into > scsi_send_eh_cmnd(). I thought USB was autosense, so when it reports check condition, we should already have sense ... or are we calling request_sense without being sent a check condition status? > The calls to shost->hostt->queuecommand() all fail, because the > READ command is still running and usb-storage has a queue > depth of 1. The error messages produced by these failures are > disconcerting but not dangerous. > > Since the REQUEST SENSE command was never issued, > scsi_eh_get_sense() returns 0. > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > for the only command on work_q is clear. > scsi_eh_test_devices() returns 0 because check_list is empty > and work_q isn't. > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > routine ends up calling scsi_eh_bus_device_reset(), at which > point usb-storage deadlocks as described above. OK, so in the case where the command can never complete (because the fw has crashed), what should be the process for resetting the device so it can again function? James > (On Andreas's system, the first READ retry times out as opposed to the > second retry as on my computer. I doubt this makes any difference.) > > I can't tell if this is all working as intended or if it went off the > tracks somewhere. > > Thanks for any guidance. > > Alan Stern > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2014-03-20 15:36 ` Alan Stern 2014-03-20 16:09 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2014-03-20 15:36 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Wed, 19 Mar 2014, James Bottomley wrote: > > Basically, usb-storage deadlocks when the SCSI error handler invokes > > the eh_device_reset_handler callback while a command is running. The > > command has timed out and will never complete normally, because the > > device's firmware has crashed. But usb-storage's device-reset routine > > waits for the current command to finish, which brings everything to a > > standstill. > > > > Is this design wrong? That is, should the device-reset routine wait > > for currently executing commands to finish, or should it abort them, or > > what? > > In some sense, yes, but not necessarily from the Point of View of USB. > What we assume in SCSI is that commands are forgettable, meaning there's > always some operation we can perform (be it abort or reset) that causes > the device to forget about all outstanding commands and reset its > internal state machine to a known good state. > > The cardinal SCSI assumption is that after we've successfully done an > abort or reset on a command that it will never come back to us from the > device. The same assumption is present in usb-storage. The difference appears to be that SCSI assumes a reset can be issued at any time, even while a command is running. In usb-storage, that's true for a _bus_ reset but it's not true for a _device_ reset. Perhaps this restriction on device resets could be lifted, but in real life it probably won't make much difference. The fact is, almost no USB mass-storage devices implement device reset correctly, whereas most of them do implement bus reset. (Possibly related factoid: MS Windows uses bus resets but doesn't use device resets in its USB mass-storage driver.) > > Or should the SCSI error handler abort the running command before > > invoking the eh_device_reset_handler callback? > > So this is rooted in the "Abort can be a Problem" issue: Abort > sometimes works well (and it's not very disruptive) but sometimes if the > device is having a problem in its command state machine, adding another > command (which is what the abort is) doesn't actually do anything, so we > need error escalation to reset. We can't wait for the abort or other > commands to complete because they never will. The reset is expected to > clear everything from the device (including the pending aborts). With usb-storage, aborts usually work pretty well. But sometimes they don't cure the underlying cause, so when the offending command is retried, it hangs up again. Something like that seems to be happening here. > > For the record, and in case anyone is curious, here's the detailed > > sequence of events during my test: > > > > sd issues a READ(10) command. For whatever reason, the device > > goes nuts and the command times out. > > > > scsi_times_out() calls scsi_abort_command(), which queues an > > abort request. > > > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > > succeeds in aborting the READ. > > > > The READ command is retried (I didn't trace through the details > > of this). The retry fails with a Unit Attention (SK=6, > > ASC=0x29, Reset or Bus Device Reset Occurred). > > > > The READ command is retried a second time, and it times out > > again. > > > > This time around, scsi_times_out() calls scsi_abort_command() > > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > > still set). > > From the first time we sent the abort? Yes. As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED gets cleared is in scsi_eh_finish_cmd(), which never got called. After the successful abort, scmd_eh_abort_handler() went directly into its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" / scsi_queue_insert() code paths. > That sounds like a problem in > our state tracking. Could well be. How should I track it down further? Or can you suggest a patch just from this much information? > > As a result, scsi_error_handler() calls scsi_unjam_host(), > > which calls scsi_eh_get_sense(). > > > > That routine calls scsi_request_sense(), which goes into > > scsi_send_eh_cmnd(). > > I thought USB was autosense, so when it reports check condition, we > should already have sense ... or are we calling request_sense without > being sent a check condition status? usb-storage does autosense. As far as I can tell, the REQUEST SENSE is issued because there is no sense data, which is because the READ command is still running. Is this another state-tracking bug? > > The calls to shost->hostt->queuecommand() all fail, because the > > READ command is still running and usb-storage has a queue > > depth of 1. The error messages produced by these failures are > > disconcerting but not dangerous. > > > > Since the REQUEST SENSE command was never issued, > > scsi_eh_get_sense() returns 0. > > > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > > for the only command on work_q is clear. > > scsi_eh_test_devices() returns 0 because check_list is empty > > and work_q isn't. > > > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > > routine ends up calling scsi_eh_bus_device_reset(), at which > > point usb-storage deadlocks as described above. > > OK, so in the case where the command can never complete (because the fw > has crashed), what should be the process for resetting the device so it > can again function? Simply abort the command. usb-storage will automatically carry out a bus reset under those conditions, and that's the best we can do. As you can see, the sequence above started out doing exactly this. The problems began when the command was retried. That time it didn't get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag. Things got worse from there. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 15:36 ` Alan Stern @ 2014-03-20 16:09 ` James Bottomley 2014-03-20 16:34 ` Alan Stern [not found] ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 0 siblings, 2 replies; 16+ messages in thread From: James Bottomley @ 2014-03-20 16:09 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 11:36 -0400, Alan Stern wrote: > On Wed, 19 Mar 2014, James Bottomley wrote: > > > > Basically, usb-storage deadlocks when the SCSI error handler invokes > > > the eh_device_reset_handler callback while a command is running. The > > > command has timed out and will never complete normally, because the > > > device's firmware has crashed. But usb-storage's device-reset routine > > > waits for the current command to finish, which brings everything to a > > > standstill. > > > > > > Is this design wrong? That is, should the device-reset routine wait > > > for currently executing commands to finish, or should it abort them, or > > > what? > > > > In some sense, yes, but not necessarily from the Point of View of USB. > > What we assume in SCSI is that commands are forgettable, meaning there's > > always some operation we can perform (be it abort or reset) that causes > > the device to forget about all outstanding commands and reset its > > internal state machine to a known good state. > > > > The cardinal SCSI assumption is that after we've successfully done an > > abort or reset on a command that it will never come back to us from the > > device. > > The same assumption is present in usb-storage. > > The difference appears to be that SCSI assumes a reset can be issued at > any time, even while a command is running. In usb-storage, that's true > for a _bus_ reset but it's not true for a _device_ reset. > > Perhaps this restriction on device resets could be lifted, but in real > life it probably won't make much difference. The fact is, almost no > USB mass-storage devices implement device reset correctly, whereas most > of them do implement bus reset. (Possibly related factoid: MS Windows > uses bus resets but doesn't use device resets in its USB mass-storage > driver.) Actually, you don't have to lift the restriction. Just fail the device reset if you can't issue it (so if there's any outstanding commands). SCSI will move on to a bus reset which you can accept. > > > Or should the SCSI error handler abort the running command before > > > invoking the eh_device_reset_handler callback? > > > > So this is rooted in the "Abort can be a Problem" issue: Abort > > sometimes works well (and it's not very disruptive) but sometimes if the > > device is having a problem in its command state machine, adding another > > command (which is what the abort is) doesn't actually do anything, so we > > need error escalation to reset. We can't wait for the abort or other > > commands to complete because they never will. The reset is expected to > > clear everything from the device (including the pending aborts). > > With usb-storage, aborts usually work pretty well. But sometimes they > don't cure the underlying cause, so when the offending command is > retried, it hangs up again. Something like that seems to be happening > here. > > > > For the record, and in case anyone is curious, here's the detailed > > > sequence of events during my test: > > > > > > sd issues a READ(10) command. For whatever reason, the device > > > goes nuts and the command times out. > > > > > > scsi_times_out() calls scsi_abort_command(), which queues an > > > abort request. > > > > > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which > > > succeeds in aborting the READ. > > > > > > The READ command is retried (I didn't trace through the details > > > of this). The retry fails with a Unit Attention (SK=6, > > > ASC=0x29, Reset or Bus Device Reset Occurred). > > > > > > The READ command is retried a second time, and it times out > > > again. > > > > > > This time around, scsi_times_out() calls scsi_abort_command() > > > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is > > > still set). > > > > From the first time we sent the abort? > > Yes. As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED > gets cleared is in scsi_eh_finish_cmd(), which never got called. > After the successful abort, scmd_eh_abort_handler() went directly into > its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" / > scsi_queue_insert() code paths. OK, that's a bug ... I'll see if I can find it. > > That sounds like a problem in > > our state tracking. > > Could well be. How should I track it down further? Or can you suggest > a patch just from this much information? > > > > As a result, scsi_error_handler() calls scsi_unjam_host(), > > > which calls scsi_eh_get_sense(). > > > > > > That routine calls scsi_request_sense(), which goes into > > > scsi_send_eh_cmnd(). > > > > I thought USB was autosense, so when it reports check condition, we > > should already have sense ... or are we calling request_sense without > > being sent a check condition status? > > usb-storage does autosense. As far as I can tell, the REQUEST SENSE is > issued because there is no sense data, which is because the READ > command is still running. Is this another state-tracking bug? Yes ... we should only issue a request sense if we got a check condition return. If we're doing it without that, something is wrong somewhere. > > > The calls to shost->hostt->queuecommand() all fail, because the > > > READ command is still running and usb-storage has a queue > > > depth of 1. The error messages produced by these failures are > > > disconcerting but not dangerous. > > > > > > Since the REQUEST SENSE command was never issued, > > > scsi_eh_get_sense() returns 0. > > > > > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which > > > does essentially nothing because the SCSI_EH_CANCEL_CMD flag > > > for the only command on work_q is clear. > > > scsi_eh_test_devices() returns 0 because check_list is empty > > > and work_q isn't. > > > > > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This > > > routine ends up calling scsi_eh_bus_device_reset(), at which > > > point usb-storage deadlocks as described above. > > > > OK, so in the case where the command can never complete (because the fw > > has crashed), what should be the process for resetting the device so it > > can again function? > > Simply abort the command. usb-storage will automatically carry out a > bus reset under those conditions, and that's the best we can do. > > As you can see, the sequence above started out doing exactly this. > The problems began when the command was retried. That time it didn't > get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag. > Things got worse from there. OK, so I think we have three things to do 1. Investigate SCSI and fix it's abort state problem that's causing it not to send the abort second time around 2. Fix usb-storage to fail a reset it can't do (i.e. device reset with outstanding commands) 3. Find out why we're sending a spurious request sense. I can look at 1 and 3 if you want to take 2. James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 16:09 ` James Bottomley @ 2014-03-20 16:34 ` Alan Stern [not found] ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> 2014-03-20 18:22 ` James Bottomley [not found] ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 1 sibling, 2 replies; 16+ messages in thread From: Alan Stern @ 2014-03-20 16:34 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > OK, so I think we have three things to do > > 1. Investigate SCSI and fix it's abort state problem that's causing > it not to send the abort second time around > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > with outstanding commands) > 3. Find out why we're sending a spurious request sense. > > I can look at 1 and 3 if you want to take 2. It's a deal! Thanks for your help. Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> @ 2014-03-20 17:57 ` James Bottomley 2014-03-20 19:49 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: James Bottomley @ 2014-03-20 17:57 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > On Thu, 20 Mar 2014, James Bottomley wrote: > > > OK, so I think we have three things to do > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > it not to send the abort second time around > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > with outstanding commands) > > 3. Find out why we're sending a spurious request sense. > > > > I can look at 1 and 3 if you want to take 2. > > It's a deal! Thanks for your help. OK, I think this is the fix for 1, if you could try it out. Thanks, James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..c52bfb2 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work) "scmd %p retry " "aborted command\n", scmd)); scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY); - return; + goto out; } else { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_WARNING, scmd, "scmd %p finish " "aborted command\n", scmd)); scsi_finish_command(scmd); - return; + goto out; } } else { SCSI_LOG_ERROR_RECOVERY(3, @@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work) } } + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; + if (!scsi_eh_scmd_add(scmd, 0)) { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_WARNING, scmd, @@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work) scmd->result |= DID_TIME_OUT << 16; scsi_finish_command(scmd); } + return; + out: + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; + return; } /** -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 17:57 ` James Bottomley @ 2014-03-20 19:49 ` Alan Stern 2014-03-28 16:52 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2014-03-20 19:49 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > > On Thu, 20 Mar 2014, James Bottomley wrote: > > > > > OK, so I think we have three things to do > > > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > > it not to send the abort second time around > > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > > with outstanding commands) > > > 3. Find out why we're sending a spurious request sense. > > > > > > I can look at 1 and 3 if you want to take 2. > > > > It's a deal! Thanks for your help. > > OK, I think this is the fix for 1, if you could try it out. > > Thanks, > > James > > --- > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 771c16b..c52bfb2 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work) > "scmd %p retry " > "aborted command\n", scmd)); > scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY); > - return; > + goto out; > } else { > SCSI_LOG_ERROR_RECOVERY(3, > scmd_printk(KERN_WARNING, scmd, > "scmd %p finish " > "aborted command\n", scmd)); > scsi_finish_command(scmd); > - return; > + goto out; > } > } else { > SCSI_LOG_ERROR_RECOVERY(3, > @@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work) > } > } > > + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; > + > if (!scsi_eh_scmd_add(scmd, 0)) { > SCSI_LOG_ERROR_RECOVERY(3, > scmd_printk(KERN_WARNING, scmd, > @@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work) > scmd->result |= DID_TIME_OUT << 16; > scsi_finish_command(scmd); > } > + return; > + out: > + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; > + return; > } > > /** This worked the first time. :-) But I wonder, is it safe to access scmd after calling scsi_finish_command()? Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 19:49 ` Alan Stern @ 2014-03-28 16:52 ` James Bottomley 0 siblings, 0 replies; 16+ messages in thread From: James Bottomley @ 2014-03-28 16:52 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 15:49 -0400, Alan Stern wrote: > On Thu, 20 Mar 2014, James Bottomley wrote: > > > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > > > On Thu, 20 Mar 2014, James Bottomley wrote: > > > > > > > OK, so I think we have three things to do > > > > > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > > > it not to send the abort second time around > > > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > > > with outstanding commands) > > > > 3. Find out why we're sending a spurious request sense. > > > > > > > > I can look at 1 and 3 if you want to take 2. > > > > > > It's a deal! Thanks for your help. > > > > OK, I think this is the fix for 1, if you could try it out. > > > > Thanks, > > > > James > > > > --- > > > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > > index 771c16b..c52bfb2 100644 > > --- a/drivers/scsi/scsi_error.c > > +++ b/drivers/scsi/scsi_error.c > > @@ -145,14 +145,14 @@ scmd_eh_abort_handler(struct work_struct *work) > > "scmd %p retry " > > "aborted command\n", scmd)); > > scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY); > > - return; > > + goto out; > > } else { > > SCSI_LOG_ERROR_RECOVERY(3, > > scmd_printk(KERN_WARNING, scmd, > > "scmd %p finish " > > "aborted command\n", scmd)); > > scsi_finish_command(scmd); > > - return; > > + goto out; > > } > > } else { > > SCSI_LOG_ERROR_RECOVERY(3, > > @@ -162,6 +162,8 @@ scmd_eh_abort_handler(struct work_struct *work) > > } > > } > > > > + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; > > + > > if (!scsi_eh_scmd_add(scmd, 0)) { > > SCSI_LOG_ERROR_RECOVERY(3, > > scmd_printk(KERN_WARNING, scmd, > > @@ -170,6 +172,10 @@ scmd_eh_abort_handler(struct work_struct *work) > > scmd->result |= DID_TIME_OUT << 16; > > scsi_finish_command(scmd); > > } > > + return; > > + out: > > + scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED; > > + return; > > } > > > > /** > > This worked the first time. :-) > > But I wonder, is it safe to access scmd after calling > scsi_finish_command()? Agree, I've redone the patch integrated into the try_to_abort call instead. Will post shortly. James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 16:34 ` Alan Stern [not found] ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> @ 2014-03-20 18:22 ` James Bottomley 2014-03-20 19:48 ` Alan Stern 1 sibling, 1 reply; 16+ messages in thread From: James Bottomley @ 2014-03-20 18:22 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > On Thu, 20 Mar 2014, James Bottomley wrote: > > > OK, so I think we have three things to do > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > it not to send the abort second time around > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > with outstanding commands) > > 3. Find out why we're sending a spurious request sense. > > > > I can look at 1 and 3 if you want to take 2. > > It's a deal! Thanks for your help. And this looks to be 3: a bug in the way we attach sense data to commands (we shouldn't look for attached sense if the device error code didn't imply there would be any). James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..d020149 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q, __func__)); break; } + if (status_byte(scmd->result) != CHECK_CONDITION) + /* + * don't request sense if there's no check condition + * status because the error we're processing isn't one + * that has a sense code (and some devices get + * confused by sense requests out of the blue) + */ + continue; + SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd, "%s: requesting sense\n", current->comm)); ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 18:22 ` James Bottomley @ 2014-03-20 19:48 ` Alan Stern [not found] ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2014-03-20 19:48 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > > On Thu, 20 Mar 2014, James Bottomley wrote: > > > > > OK, so I think we have three things to do > > > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > > it not to send the abort second time around > > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > > with outstanding commands) > > > 3. Find out why we're sending a spurious request sense. > > > > > > I can look at 1 and 3 if you want to take 2. > > > > It's a deal! Thanks for your help. > > And this looks to be 3: a bug in the way we attach sense data to > commands (we shouldn't look for attached sense if the device error code > didn't imply there would be any). > > James > > --- > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 771c16b..d020149 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q, > __func__)); > break; > } > + if (status_byte(scmd->result) != CHECK_CONDITION) > + /* > + * don't request sense if there's no check condition > + * status because the error we're processing isn't one > + * that has a sense code (and some devices get > + * confused by sense requests out of the blue) > + */ > + continue; > + > SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd, > "%s: requesting sense\n", > current->comm)); I tried this patch first, because fixing the earlier bug would mask this one. The patch sort of worked. But the first time I tried it, it failed in a rather amusing way. While the second retry was running and hung, scmd->result _was_ equal to CHECK_CONDITION -- because that was the result from the _first_ retry, and it had never gotten cleared! scmd->result needs to be set to 0 before the queuecommand callback is invoked. I ended up adding this to your patch, and then it worked perfectly: Index: usb-3.14/drivers/scsi/scsi_error.c =================================================================== --- usb-3.14.orig/drivers/scsi/scsi_error.c +++ usb-3.14/drivers/scsi/scsi_error.c @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd memset(scmd->cmnd, 0, BLK_MAX_CDB); memset(&scmd->sdb, 0, sizeof(scmd->sdb)); scmd->request->next_rq = NULL; + scmd->result = 0; if (sense_bytes) { scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE, Index: usb-3.14/drivers/scsi/scsi_lib.c =================================================================== --- usb-3.14.orig/drivers/scsi/scsi_lib.c +++ usb-3.14/drivers/scsi/scsi_lib.c @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s * lock such that the kblockd_schedule_work() call happens * before blk_cleanup_queue() finishes. */ + cmd->result = 0; spin_lock_irqsave(q->queue_lock, flags); blk_requeue_request(q, cmd->request); kblockd_schedule_work(q, &device->requeue_work); Maybe only the second one is necessary, but it seemed best to be consistent. Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org> @ 2014-03-20 20:26 ` James Bottomley [not found] ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: James Bottomley @ 2014-03-20 20:26 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 15:48 -0400, Alan Stern wrote: > On Thu, 20 Mar 2014, James Bottomley wrote: > > > On Thu, 2014-03-20 at 12:34 -0400, Alan Stern wrote: > > > On Thu, 20 Mar 2014, James Bottomley wrote: > > > > > > > OK, so I think we have three things to do > > > > > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > > > it not to send the abort second time around > > > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > > > with outstanding commands) > > > > 3. Find out why we're sending a spurious request sense. > > > > > > > > I can look at 1 and 3 if you want to take 2. > > > > > > It's a deal! Thanks for your help. > > > > And this looks to be 3: a bug in the way we attach sense data to > > commands (we shouldn't look for attached sense if the device error code > > didn't imply there would be any). > > > > James > > > > --- > > > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > > index 771c16b..d020149 100644 > > --- a/drivers/scsi/scsi_error.c > > +++ b/drivers/scsi/scsi_error.c > > @@ -1157,6 +1157,15 @@ int scsi_eh_get_sense(struct list_head *work_q, > > __func__)); > > break; > > } > > + if (status_byte(scmd->result) != CHECK_CONDITION) > > + /* > > + * don't request sense if there's no check condition > > + * status because the error we're processing isn't one > > + * that has a sense code (and some devices get > > + * confused by sense requests out of the blue) > > + */ > > + continue; > > + > > SCSI_LOG_ERROR_RECOVERY(2, scmd_printk(KERN_INFO, scmd, > > "%s: requesting sense\n", > > current->comm)); > > I tried this patch first, because fixing the earlier bug would mask > this one. > > The patch sort of worked. But the first time I tried it, it failed in > a rather amusing way. While the second retry was running and hung, > scmd->result _was_ equal to CHECK_CONDITION -- because that was the > result from the _first_ retry, and it had never gotten cleared! > > scmd->result needs to be set to 0 before the queuecommand callback is > invoked. I ended up adding this to your patch, and then it worked > perfectly: Wow, the stale data bugs are just crawling out of the code. Thanks for checking. > > Index: usb-3.14/drivers/scsi/scsi_error.c > =================================================================== > --- usb-3.14.orig/drivers/scsi/scsi_error.c > +++ usb-3.14/drivers/scsi/scsi_error.c > @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd > memset(scmd->cmnd, 0, BLK_MAX_CDB); > memset(&scmd->sdb, 0, sizeof(scmd->sdb)); > scmd->request->next_rq = NULL; > + scmd->result = 0; > > if (sense_bytes) { > scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE, > Index: usb-3.14/drivers/scsi/scsi_lib.c > =================================================================== > --- usb-3.14.orig/drivers/scsi/scsi_lib.c > +++ usb-3.14/drivers/scsi/scsi_lib.c > @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s > * lock such that the kblockd_schedule_work() call happens > * before blk_cleanup_queue() finishes. > */ > + cmd->result = 0; > spin_lock_irqsave(q->queue_lock, flags); > blk_requeue_request(q, cmd->request); > kblockd_schedule_work(q, &device->requeue_work); > > > Maybe only the second one is necessary, but it seemed best to be > consistent. Thanks, I'll add this one to the list as well and see if we can get it into the merge window. I take it you'd like a cc to stable on these three? James -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2014-03-20 20:42 ` Alan Stern 0 siblings, 0 replies; 16+ messages in thread From: Alan Stern @ 2014-03-20 20:42 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > > I tried this patch first, because fixing the earlier bug would mask > > this one. > > > > The patch sort of worked. But the first time I tried it, it failed in > > a rather amusing way. While the second retry was running and hung, > > scmd->result _was_ equal to CHECK_CONDITION -- because that was the > > result from the _first_ retry, and it had never gotten cleared! > > > > scmd->result needs to be set to 0 before the queuecommand callback is > > invoked. I ended up adding this to your patch, and then it worked > > perfectly: > > Wow, the stale data bugs are just crawling out of the code. Thanks for > checking. > > > > > Index: usb-3.14/drivers/scsi/scsi_error.c > > =================================================================== > > --- usb-3.14.orig/drivers/scsi/scsi_error.c > > +++ usb-3.14/drivers/scsi/scsi_error.c > > @@ -924,6 +924,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd > > memset(scmd->cmnd, 0, BLK_MAX_CDB); > > memset(&scmd->sdb, 0, sizeof(scmd->sdb)); > > scmd->request->next_rq = NULL; > > + scmd->result = 0; > > > > if (sense_bytes) { > > scmd->sdb.length = min_t(unsigned, SCSI_SENSE_BUFFERSIZE, > > Index: usb-3.14/drivers/scsi/scsi_lib.c > > =================================================================== > > --- usb-3.14.orig/drivers/scsi/scsi_lib.c > > +++ usb-3.14/drivers/scsi/scsi_lib.c > > @@ -159,6 +159,7 @@ static void __scsi_queue_insert(struct s > > * lock such that the kblockd_schedule_work() call happens > > * before blk_cleanup_queue() finishes. > > */ > > + cmd->result = 0; > > spin_lock_irqsave(q->queue_lock, flags); > > blk_requeue_request(q, cmd->request); > > kblockd_schedule_work(q, &device->requeue_work); > > > > > > Maybe only the second one is necessary, but it seemed best to be > > consistent. > > Thanks, I'll add this one to the list as well and see if we can get it > into the merge window. I take it you'd like a cc to stable on these > three? Yes, please. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: Deadlock in usb-storage error handling [not found] ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2014-03-20 19:59 ` Alan Stern 2014-03-20 20:27 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Alan Stern @ 2014-03-20 19:59 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > OK, so I think we have three things to do > > 1. Investigate SCSI and fix it's abort state problem that's causing > it not to send the abort second time around > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > with outstanding commands) > 3. Find out why we're sending a spurious request sense. > > I can look at 1 and 3 if you want to take 2. I wrote a patch for 2. It turned out not to help much, because I was wrong -- while a command is running, usb-storage won't do a bus reset either. The lock occurs in a separate part of the code, where I wasn't looking earlier. When I tested the patch, the device reset failed immediately (as desired) and then the attempted bus reset hung. Overall, not much of an improvement... In fact, this restriction on bus resets is important and necessary. USB devices can be composite, meaning they can contain several functions all packed into a single device. If a driver for one of the other functions needs to perform a bus reset, we don't want it to interrupt an ongoing mass-storage command. Thus, it is important for the reset routine to wait for an outstanding command to complete. Anyway, this looks to be a moot point. With your fix for 1, neither sort of reset was necessary. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 19:59 ` Alan Stern @ 2014-03-20 20:27 ` James Bottomley 2014-03-20 20:43 ` Alan Stern 0 siblings, 1 reply; 16+ messages in thread From: James Bottomley @ 2014-03-20 20:27 UTC (permalink / raw) To: Alan Stern; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 2014-03-20 at 15:59 -0400, Alan Stern wrote: > On Thu, 20 Mar 2014, James Bottomley wrote: > > > OK, so I think we have three things to do > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > it not to send the abort second time around > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > with outstanding commands) > > 3. Find out why we're sending a spurious request sense. > > > > I can look at 1 and 3 if you want to take 2. > > I wrote a patch for 2. It turned out not to help much, because I was > wrong -- while a command is running, usb-storage won't do a bus reset > either. The lock occurs in a separate part of the code, where I wasn't > looking earlier. When I tested the patch, the device reset failed > immediately (as desired) and then the attempted bus reset hung. > Overall, not much of an improvement... Hm, yes, sorry, after the bus reset fails, we'll just take the device offline. > In fact, this restriction on bus resets is important and necessary. > USB devices can be composite, meaning they can contain several > functions all packed into a single device. If a driver for one of the > other functions needs to perform a bus reset, we don't want it to > interrupt an ongoing mass-storage command. Thus, it is important for > the reset routine to wait for an outstanding command to complete. > > Anyway, this looks to be a moot point. With your fix for 1, neither > sort of reset was necessary. Well as long as the tiny hammer works ... but there are going to be devices that need the bigger one one day. James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Deadlock in usb-storage error handling 2014-03-20 20:27 ` James Bottomley @ 2014-03-20 20:43 ` Alan Stern 0 siblings, 0 replies; 16+ messages in thread From: Alan Stern @ 2014-03-20 20:43 UTC (permalink / raw) To: James Bottomley; +Cc: Andreas Reis, USB list, SCSI development list On Thu, 20 Mar 2014, James Bottomley wrote: > On Thu, 2014-03-20 at 15:59 -0400, Alan Stern wrote: > > On Thu, 20 Mar 2014, James Bottomley wrote: > > > > > OK, so I think we have three things to do > > > > > > 1. Investigate SCSI and fix it's abort state problem that's causing > > > it not to send the abort second time around > > > 2. Fix usb-storage to fail a reset it can't do (i.e. device reset > > > with outstanding commands) > > > 3. Find out why we're sending a spurious request sense. > > > > > > I can look at 1 and 3 if you want to take 2. > > > > I wrote a patch for 2. It turned out not to help much, because I was > > wrong -- while a command is running, usb-storage won't do a bus reset > > either. The lock occurs in a separate part of the code, where I wasn't > > looking earlier. When I tested the patch, the device reset failed > > immediately (as desired) and then the attempted bus reset hung. > > Overall, not much of an improvement... > > Hm, yes, sorry, after the bus reset fails, we'll just take the device > offline. > > > In fact, this restriction on bus resets is important and necessary. > > USB devices can be composite, meaning they can contain several > > functions all packed into a single device. If a driver for one of the > > other functions needs to perform a bus reset, we don't want it to > > interrupt an ongoing mass-storage command. Thus, it is important for > > the reset routine to wait for an outstanding command to complete. > > > > Anyway, this looks to be a moot point. With your fix for 1, neither > > sort of reset was necessary. > > Well as long as the tiny hammer works ... but there are going to be > devices that need the bigger one one day. Maybe so. I'll worry about them when they appear. :-) Alan Stern ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2014-03-28 16:52 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <53298181.9020206@gmail.com>
[not found] ` <53298181.9020206-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2014-03-19 20:31 ` Deadlock in usb-storage error handling Alan Stern
[not found] ` <Pine.LNX.4.44L0.1403191449290.887-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-19 20:54 ` Dan Williams
2014-03-19 21:35 ` James Bottomley
[not found] ` <1395264905.2185.55.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 15:36 ` Alan Stern
2014-03-20 16:09 ` James Bottomley
2014-03-20 16:34 ` Alan Stern
[not found] ` <Pine.LNX.4.44L0.1403201233150.858-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-20 17:57 ` James Bottomley
2014-03-20 19:49 ` Alan Stern
2014-03-28 16:52 ` James Bottomley
2014-03-20 18:22 ` James Bottomley
2014-03-20 19:48 ` Alan Stern
[not found] ` <Pine.LNX.4.44L0.1403201518130.863-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2014-03-20 20:26 ` James Bottomley
[not found] ` <1395347219.2244.47.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 20:42 ` Alan Stern
[not found] ` <1395331764.2244.16.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2014-03-20 19:59 ` Alan Stern
2014-03-20 20:27 ` James Bottomley
2014-03-20 20:43 ` Alan Stern
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox