* How are halted endpoints supposed to be handled in Linux? @ 2024-11-20 23:11 Michał Pecio 2024-11-21 0:02 ` Thinh Nguyen 0 siblings, 1 reply; 11+ messages in thread From: Michał Pecio @ 2024-11-20 23:11 UTC (permalink / raw) To: linux-usb; +Cc: Mathias Nyman Hi, I have been wondering about it after seeing how it's done in xhci_hcd, which looks wrong to me. USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL handshake or "transmission error" should cause both the device and host endpoints to be reset. I presume "transmission error" means any error detected by the HC which causes it to halt, various examples exist. USB 3.0 just refers to USB 2.0. Linux appears to ignore this part and only reset on STALL handshake, as advised in Documentation/driver-api/usb/error-codes.rst and practiced by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO. This wouldn't necessarily be bad in itself, but: On the HCD side, xHCI will: - give back the current URB with -EPROTO/-EPIPE status - reset the host side endpoint, clearing its toggle state - point the HC at the next URB if one exist - restart the endpoint without waiting for hcd->endpoint_reset() - ignore one subsequent call to hcd->endpoint_reset() For STALL, I think it's a little awkward, but acceptable. The ultimate result appears to be that all pending URBs are given back with -EPIPE and things start moving again after usb_clear_halt(). But if the device isn't stalled, the next URB may execute right away if the failure was transient. This makes it impossible to ensure in-order delivery on bulk OUT pipes, because one URB is skipped and the driver has no reliable way to retry it before some later ones may get executed. This behavior also creates an opportunity for toggle mismatch, and as far as I understand, the hardware will resolve it by silently dropping one packet. Another could be dropped if usb_clear_halt() were called. Either I'm missing something, or it seems quite broken? I wonder what other HCDs are doing in this case, and what's the idea behind it all? Regards, Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-20 23:11 How are halted endpoints supposed to be handled in Linux? Michał Pecio @ 2024-11-21 0:02 ` Thinh Nguyen 2024-11-21 2:31 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Thinh Nguyen @ 2024-11-21 0:02 UTC (permalink / raw) To: Michał Pecio; +Cc: linux-usb@vger.kernel.org, Mathias Nyman, Alan Stern ++Alan Hi, On Thu, Nov 21, 2024, Michał Pecio wrote: > Hi, > > I have been wondering about it after seeing how it's done in xhci_hcd, > which looks wrong to me. > > USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL > handshake or "transmission error" should cause both the device and host > endpoints to be reset. I presume "transmission error" means any error > detected by the HC which causes it to halt, various examples exist. > > USB 3.0 just refers to USB 2.0. > > Linux appears to ignore this part and only reset on STALL handshake, as > advised in Documentation/driver-api/usb/error-codes.rst and practiced > by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO. > This wouldn't necessarily be bad in itself, but: > > On the HCD side, xHCI will: > - give back the current URB with -EPROTO/-EPIPE status > - reset the host side endpoint, clearing its toggle state > - point the HC at the next URB if one exist > - restart the endpoint without waiting for hcd->endpoint_reset() > - ignore one subsequent call to hcd->endpoint_reset() > > For STALL, I think it's a little awkward, but acceptable. The ultimate > result appears to be that all pending URBs are given back with -EPIPE > and things start moving again after usb_clear_halt(). > > But if the device isn't stalled, the next URB may execute right away if > the failure was transient. This makes it impossible to ensure in-order > delivery on bulk OUT pipes, because one URB is skipped and the driver > has no reliable way to retry it before some later ones may get executed. The class driver will know of this error, and the retry/recovery should be done at the class driver base on this error and not from the HCD. > > This behavior also creates an opportunity for toggle mismatch, and as > far as I understand, the hardware will resolve it by silently dropping > one packet. Another could be dropped if usb_clear_halt() were called. > > > Either I'm missing something, or it seems quite broken? > > I wonder what other HCDs are doing in this case, and what's the idea > behind it all? > > Regards, > Michal > This long discussion may be relevant to your query: https://lore.kernel.org/linux-usb/ba06679f-93d2-4cb4-9218-9e288065bdfb@rowland.harvard.edu/#t The concensus is that these error recovery should be handled by the class drivers. How they should be handled may be different between different OSes. If you follow the thread discussion, it may not necessarily a good idea to use usb_clear_halt() to recover. BR, Thinh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 0:02 ` Thinh Nguyen @ 2024-11-21 2:31 ` Alan Stern 2024-11-21 10:26 ` Michał Pecio 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2024-11-21 2:31 UTC (permalink / raw) To: Thinh Nguyen; +Cc: Michał Pecio, linux-usb@vger.kernel.org, Mathias Nyman On Thu, Nov 21, 2024 at 12:02:19AM +0000, Thinh Nguyen wrote: > ++Alan > > Hi, > > On Thu, Nov 21, 2024, Michał Pecio wrote: > > Hi, > > > > I have been wondering about it after seeing how it's done in xhci_hcd, > > which looks wrong to me. > > > > USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL > > handshake or "transmission error" should cause both the device and host > > endpoints to be reset. I presume "transmission error" means any error > > detected by the HC which causes it to halt, various examples exist. The spec is unfortunately vague about what a halt condition is, although it apparently intends the various sorts of transmission errors to count. > > USB 3.0 just refers to USB 2.0. > > > > Linux appears to ignore this part and only reset on STALL handshake, as > > advised in Documentation/driver-api/usb/error-codes.rst and practiced > > by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO. This is generally a weakness in the drivers. It would be nice if the class specifications said what to do in these situations, but most of them don't. > > This wouldn't necessarily be bad in itself, but: > > > > On the HCD side, xHCI will: > > - give back the current URB with -EPROTO/-EPIPE status > > - reset the host side endpoint, clearing its toggle state > > - point the HC at the next URB if one exist > > - restart the endpoint without waiting for hcd->endpoint_reset() > > - ignore one subsequent call to hcd->endpoint_reset() This behavior is not correct. See the kerneldoc for usb_unlink_urb() in drivers/usb/core/urb.c, especially the section labelled "Unlinking and Endpoint Queues". In general, the only safe thing for class drivers to do when they get one of these errors on a bulk or interrupt endpoint is to unlink all the pending URBs for the endpoint and then call usb_clear_halt() when they have all completed. I know that usb-storage does this; I suspect that not many other drivers do. I suppose the USB core could take care of this automatically so that neither the class drivers nor the HCDs would have to worry about it. If everyone agrees that this wouldn't lead to other problems, it could be implemented without too much difficulty. However, as you point out, in general there's no way for the host to know whether the device accepted the failed transaction. This means that proper recovery requires knowledge of the higher-level protocol, which only the class driver is aware of. > > For STALL, I think it's a little awkward, but acceptable. The ultimate > > result appears to be that all pending URBs are given back with -EPIPE > > and things start moving again after usb_clear_halt(). > > > > But if the device isn't stalled, the next URB may execute right away if > > the failure was transient. This makes it impossible to ensure in-order > > delivery on bulk OUT pipes, because one URB is skipped and the driver > > has no reliable way to retry it before some later ones may get executed. Just so. > The class driver will know of this error, and the retry/recovery should > be done at the class driver base on this error and not from the HCD. Agreed. The minimum requirement is that the endpoint does not restart until all the pending URBs have completed, whether through explicit or automatic unlinks. > > This behavior also creates an opportunity for toggle mismatch, and as > > far as I understand, the hardware will resolve it by silently dropping > > one packet. Another could be dropped if usb_clear_halt() were called. > > > > > > Either I'm missing something, or it seems quite broken? > > > > I wonder what other HCDs are doing in this case, and what's the idea > > behind it all? As far as I know, they don't automatically send Clear-Halt requests to the device or automatically unlink anything. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 2:31 ` Alan Stern @ 2024-11-21 10:26 ` Michał Pecio 2024-11-21 14:08 ` Mathias Nyman 2024-11-21 15:06 ` Alan Stern 0 siblings, 2 replies; 11+ messages in thread From: Michał Pecio @ 2024-11-21 10:26 UTC (permalink / raw) To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman Hi Alan, Thank you for taking the time to answer. I'm beginning to regret not asking this question earlier. On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote: > > > Linux appears to ignore this part and only reset on STALL > > > handshake, as advised in > > > Documentation/driver-api/usb/error-codes.rst and practiced by > > > drivers - they don't seem to bother with usb_clear_halt() on > > > -EPROTO. > > This is generally a weakness in the drivers. It would be nice if the > class specifications said what to do in these situations, but most of > them don't. There is also proprietary hardware with no specification at all. > > > On the HCD side, xHCI will: > > > - give back the current URB with -EPROTO/-EPIPE status > > > - reset the host side endpoint, clearing its toggle state > > > - point the HC at the next URB if one exist > > > - restart the endpoint without waiting for hcd->endpoint_reset() > > > - ignore one subsequent call to hcd->endpoint_reset() > > This behavior is not correct. See the kerneldoc for > usb_unlink_urb() in drivers/usb/core/urb.c, especially the section > labelled "Unlinking and Endpoint Queues". OK, let's go through it. * But when an URB terminates with an * error its queue generally stops (see below), at least until that URB's * completion routine returns. I don't see this working after xHCI adopted bottom half giveback, which is asynchronous. As you are the maintainer of EHCI, which also uses BH, how is EHCI dealing with it? One way I see with existing APIs is to wait until the driver submits a new URB, but are drivers prepared for this? Is EHCI doing the same? * It is guaranteed that a stopped queue * will not restart until all its unlinked URBs have been fully retired, * with their completion routines run I think xHCI can currently guarantee that nothing is restarted until any URB unlinked for any reason is given to the BH worker, but that's all we have, and I just broke it in usb-next: + /* Try to restart the endpoint if all is done */ + ring_doorbell_for_active_rings(xhci, slot_id, ep_index); + /* Start giving back any TDs invalidated above */ + xhci_giveback_invalidated_tds(ep); This is part of a legitimate bugfix patch tagged for stable. I should have insisted on keeping it a separate change, but it seemed a good idea at the time which would soon get implemented anyway... Maybe no more? * even if that's not until some time * after the original completion handler returns. Not entirely sure what this means. * The same behavior and * guarantee apply when an URB terminates because it was unlinked. Same caveat about BH asynchronicity in xHCI. > In general, the only safe thing for class drivers to do when they get > one of these errors on a bulk or interrupt endpoint is to unlink all > the pending URBs for the endpoint and then call usb_clear_halt() when > they have all completed. I know that usb-storage does this; I > suspect that not many other drivers do. Your suspicion isn't wrong AFAIK. One more thing which is safe at least wrt data corruption is to simply retry the same URB without resetting anything. But if an HCD wants to do it, existing API gives no way to notify the driver and allow it to opt out and handle things differently, so it mustn't retry forever. > I suppose the USB core could take care of this automatically so that > neither the class drivers nor the HCDs would have to worry about it. > If everyone agrees that this wouldn't lead to other problems, it > could be implemented without too much difficulty. This still appears to run into the double delivery problem that you described in the discussion linked by Thinh Nguyen, particularly in case of dodgy drivers for dodgy hardware. Considering that, I'm not sure if automatically resetting anything on -EPROTO is a good idea. > > > I wonder what other HCDs are doing in this case, and what's the > > > idea behind it all? > > As far as I know, they don't automatically send Clear-Halt requests > to the device or automatically unlink anything. That's what it looks like, grepping through drivers/usb/host. But my question was mainly about -EPROTO. When should an HCD restart a halted endpoint? Should it clear the sequence state? (probably not). Regards, Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 10:26 ` Michał Pecio @ 2024-11-21 14:08 ` Mathias Nyman 2024-11-22 11:54 ` Michał Pecio 2024-11-21 15:06 ` Alan Stern 1 sibling, 1 reply; 11+ messages in thread From: Mathias Nyman @ 2024-11-21 14:08 UTC (permalink / raw) To: Michał Pecio, Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org On 21.11.2024 12.26, Michał Pecio wrote: > Hi Alan, > > Thank you for taking the time to answer. I'm beginning to regret not > asking this question earlier. > > On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote: >>>> Linux appears to ignore this part and only reset on STALL >>>> handshake, as advised in >>>> Documentation/driver-api/usb/error-codes.rst and practiced by >>>> drivers - they don't seem to bother with usb_clear_halt() on >>>> -EPROTO. For xhci I assumed that the device endpoint is halted when we receive a 'Stall Error' transfer event for bulk or interrupt transfers. Other errors such as 'Transaction Error' do halt the host side, but does not necessarily mean whole endpoint is halted. This is based on the info in xhci 4.6.8.1 "Soft Retry" : "xHC shall halt an endpoint with a USB Transaction Error after CErr retries have been performed. The USB device is not aware that the xHC has halted the endpoint, and will be waiting for another retry, so a Soft Retry may be used to perform additional retries and recover from an error which has caused the xHC to halt an endpoint." >> >> This is generally a weakness in the drivers. It would be nice if the >> class specifications said what to do in these situations, but most of >> them don't. > > There is also proprietary hardware with no specification at all. > >>>> On the HCD side, xHCI will: >>>> - give back the current URB with -EPROTO/-EPIPE status >>>> - reset the host side endpoint, clearing its toggle state >>>> - point the HC at the next URB if one exist >>>> - restart the endpoint without waiting for hcd->endpoint_reset() Intention was not to restart the endpoint, but turns out we end up doing it. Basically we should not ring the doorbell when 'Set TR Deq' command completes for a bulk or interrupt endpoint in case the command was queued to resolve a Stall Error. For control endpoint we should restart the ring (xHC halts the internal control endpoint on errors/stall) >>>> - ignore one subsequent call to hcd->endpoint_reset() >> >> This behavior is not correct. See the kerneldoc for >> usb_unlink_urb() in drivers/usb/core/urb.c, especially the section >> labelled "Unlinking and Endpoint Queues". > > OK, let's go through it. > > * But when an URB terminates with an > * error its queue generally stops (see below), at least until that URB's > * completion routine returns. > > I don't see this working after xHCI adopted bottom half giveback, which > is asynchronous. As you are the maintainer of EHCI, which also uses BH, > how is EHCI dealing with it? > > One way I see with existing APIs is to wait until the driver submits a > new URB, but are drivers prepared for this? Is EHCI doing the same? Using a BH also means class driver may queue a new URB to xhci after xhci has cleared its internal endpoint halt condition, but before class driver is aware that endpoint halted. > > * It is guaranteed that a stopped queue > * will not restart until all its unlinked URBs have been fully retired, > * with their completion routines run > > I think xHCI can currently guarantee that nothing is restarted until > any URB unlinked for any reason is given to the BH worker, but that's > all we have, and I just broke it in usb-next: > > + /* Try to restart the endpoint if all is done */ > + ring_doorbell_for_active_rings(xhci, slot_id, ep_index); > + /* Start giving back any TDs invalidated above */ > + xhci_giveback_invalidated_tds(ep); > > This is part of a legitimate bugfix patch tagged for stable. I should > have insisted on keeping it a separate change, but it seemed a good idea > at the time which would soon get implemented anyway... Maybe no more? I don't think your patch makes the existing issue worse in any significant way. Thanks Mathias ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 14:08 ` Mathias Nyman @ 2024-11-22 11:54 ` Michał Pecio 0 siblings, 0 replies; 11+ messages in thread From: Michał Pecio @ 2024-11-22 11:54 UTC (permalink / raw) To: Mathias Nyman; +Cc: Alan Stern, Thinh Nguyen, linux-usb@vger.kernel.org On Thu, 21 Nov 2024 16:08:15 +0200, Mathias Nyman wrote: > "xHC shall halt an endpoint with a USB Transaction Error after CErr > retries have been performed. The USB device is not aware that the xHC > has halted the endpoint, and will be waiting for another retry, so a > Soft Retry may be used to perform additional retries and recover from > an error which has caused the xHC to halt an endpoint." That's why it would be "nice to have" the ability to allow class drivers to perform further "soft retries" by killing all URBs and submitting them (together with the failed one) again. Or by some other means, as this is a rather crude API. I imagine it would be the most sensible option for some devices like serial, although I don't know if any driver currently attempts it. If this is to work reliably, the sequence state apparently needs to be preserved on both sides and not reset, regardless of what the USB spec says. So it would take a TSP=1 reset on -EPROTO and full reset on -EPIPE (no other choice here), each followed by Set Deq. If xHCI wants or needs to keep doing full reset on -EPROTO then CLEAR_FEATURE(ENDPOINT_HALT) seems like a good addition. It does have the "double delivery after missing ACK" problem with class drivers unable or unwilling to determine the true state of their device, but avoids guaranteed toggle mismatch half the time. > >>>> On the HCD side, xHCI will: > >>>> - give back the current URB with -EPROTO/-EPIPE status > >>>> - reset the host side endpoint, clearing its toggle state > >>>> - point the HC at the next URB if one exist > >>>> - restart the endpoint without waiting for hcd->endpoint_reset() > > Intention was not to restart the endpoint, but turns out we end up > doing it. > > Basically we should not ring the doorbell when 'Set TR Deq' command > completes for a bulk or interrupt endpoint in case the command was > queued to resolve a Stall Error. According to Alan's kerneldoc, also for -EPROTO and ordinary unlink(). So practically always. Apparently, no endpoint stopped for any reason is supposed to restart with any complete() callbacks still pending. I imagine going back to direct giveback in cases of unlink/halt might be an option, maybe even more reliable than waiting for a new URB. > >>>> - ignore one subsequent call to hcd->endpoint_reset() It's also a bug that EP_HARD_CLEAR_TOGGLE isn't cleared when a new URB is queued on the endpoint. Toggle state becomes unknown then. > Using a BH also means class driver may queue a new URB to xhci after > xhci has cleared its internal endpoint halt condition, but before > class driver is aware that endpoint halted. In fact, they can do even more. Network drivers with a simple complete- resubmit loop can schedule a work to resubmit later if their GFP_ATOMIC allocation of a fresh data buffer fails. More generally, drivers are under no obligation to only resubmit URBs from complete() callbacks. I suppose the combination of BH giveback and restarting on a new URB is fundamentally unable to provide documented guarantees, on EHCI too. > + /* Try to restart the endpoint if all is done */ > + ring_doorbell_for_active_rings(xhci, slot_id, ep_index); > + /* Start giving back any TDs invalidated above */ > + xhci_giveback_invalidated_tds(ep); > > I don't think your patch makes the existing issue worse in any > significant way. Currently, it's just a sub-microsecond differenece in an existing race. But if we are to eliminate the race, this code needs reordering. Regards, Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 10:26 ` Michał Pecio 2024-11-21 14:08 ` Mathias Nyman @ 2024-11-21 15:06 ` Alan Stern 2024-11-22 12:57 ` Michał Pecio 1 sibling, 1 reply; 11+ messages in thread From: Alan Stern @ 2024-11-21 15:06 UTC (permalink / raw) To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman On Thu, Nov 21, 2024 at 11:26:53AM +0100, Michał Pecio wrote: > Hi Alan, > > Thank you for taking the time to answer. I'm beginning to regret not > asking this question earlier. I hope all the material here is accurate; it's been a long time since I worked on these matters. > On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote: > > > > Linux appears to ignore this part and only reset on STALL > > > > handshake, as advised in > > > > Documentation/driver-api/usb/error-codes.rst and practiced by > > > > drivers - they don't seem to bother with usb_clear_halt() on > > > > -EPROTO. > > > > This is generally a weakness in the drivers. It would be nice if the > > class specifications said what to do in these situations, but most of > > them don't. > > There is also proprietary hardware with no specification at all. Indeed. > > > > On the HCD side, xHCI will: > > > > - give back the current URB with -EPROTO/-EPIPE status > > > > - reset the host side endpoint, clearing its toggle state > > > > - point the HC at the next URB if one exist > > > > - restart the endpoint without waiting for hcd->endpoint_reset() > > > > - ignore one subsequent call to hcd->endpoint_reset() > > > > This behavior is not correct. See the kerneldoc for > > usb_unlink_urb() in drivers/usb/core/urb.c, especially the section > > labelled "Unlinking and Endpoint Queues". > > OK, let's go through it. > > * But when an URB terminates with an > * error its queue generally stops (see below), at least until that URB's > * completion routine returns. > > I don't see this working after xHCI adopted bottom half giveback, which > is asynchronous. As you are the maintainer of EHCI, which also uses BH, > how is EHCI dealing with it? Yes, I am. For Bulk endpoints: When a transmission error occurs, ehci-hcd removes the endpoint's queue header from the controller's async list, in addition to giving back the failed URB. When the removal is complete, the queue is scanned for other unlinked URBs, and they are given back. After that happens, the next URB submission will cause the queue header to be put back on the controller's async list. You're right about the lack of synchronization caused by use of a BH. The HCD has no way to know when the class driver has finished processing a giveback. We do need to fix this. > One way I see with existing APIs is to wait until the driver submits a > new URB, but are drivers prepared for this? Is EHCI doing the same? Yes; see above. > * It is guaranteed that a stopped queue > * will not restart until all its unlinked URBs have been fully retired, > * with their completion routines run > > I think xHCI can currently guarantee that nothing is restarted until > any URB unlinked for any reason is given to the BH worker, but that's > all we have, and I just broke it in usb-next: > > + /* Try to restart the endpoint if all is done */ > + ring_doorbell_for_active_rings(xhci, slot_id, ep_index); > + /* Start giving back any TDs invalidated above */ > + xhci_giveback_invalidated_tds(ep); > > This is part of a legitimate bugfix patch tagged for stable. I should > have insisted on keeping it a separate change, but it seemed a good idea > at the time which would soon get implemented anyway... Maybe no more? > > * even if that's not until some time > * after the original completion handler returns. > > Not entirely sure what this means. Suppose URBs 1, 2, and 3 are queued when URB 1 gets a transmission error. It is given back, and its completion handler unlinks URBs 2 and 3. The completion handler for URB 1 then returns, but the queue won't restart until the completion handlers for URBs 2 and 3 have returned, even if that doesn't occur for some time. At least, that is the intent. > * The same behavior and > * guarantee apply when an URB terminates because it was unlinked. > > Same caveat about BH asynchronicity in xHCI. > > > In general, the only safe thing for class drivers to do when they get > > one of these errors on a bulk or interrupt endpoint is to unlink all > > the pending URBs for the endpoint and then call usb_clear_halt() when > > they have all completed. I know that usb-storage does this; I > > suspect that not many other drivers do. > > Your suspicion isn't wrong AFAIK. > > One more thing which is safe at least wrt data corruption is to simply > retry the same URB without resetting anything. But if an HCD wants to > do it, existing API gives no way to notify the driver and allow it to > opt out and handle things differently, so it mustn't retry forever. Yes, and in fact ehci-hcd _does_ retry up to 32 times. > > I suppose the USB core could take care of this automatically so that > > neither the class drivers nor the HCDs would have to worry about it. > > If everyone agrees that this wouldn't lead to other problems, it > > could be implemented without too much difficulty. > > This still appears to run into the double delivery problem that you > described in the discussion linked by Thinh Nguyen, particularly in > case of dodgy drivers for dodgy hardware. > > Considering that, I'm not sure if automatically resetting anything on > -EPROTO is a good idea. What about automatic unlinking? Note that some class drivers treat -EPROTO as a fatal error. That is, they don't retry and their completion-resubmission loop breaks down. When that happens, the only way forward is to unbind the driver (for example, by unplugging the device). Of course, this isn't a problem if the original cause of the -EPROTO error is that the device just _was_ unplugged. All other cases are sufficiently rare that we don't have a generally agreed-upon strategy for handling them. > > > > I wonder what other HCDs are doing in this case, and what's the > > > > idea behind it all? > > > > As far as I know, they don't automatically send Clear-Halt requests > > to the device or automatically unlink anything. > > That's what it looks like, grepping through drivers/usb/host. But my > question was mainly about -EPROTO. When should an HCD restart a halted > endpoint? Should it clear the sequence state? (probably not). Except for an immediate retry, I suspect the HCD should never restart a halted Bulk or Interrupt endpoint on its own initiative. Not until another URB is submitted. However, this seems impractical if the class driver wants to retain the existing URBs already on the endpoint's queue. (I don't know of any drivers that do this, but still...) Perhaps we should adopt the policy that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs to fail and enforce this policy in usbcore by automatic unlinking so that HC drivers don't have to do it. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-21 15:06 ` Alan Stern @ 2024-11-22 12:57 ` Michał Pecio 2024-11-22 19:28 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Michał Pecio @ 2024-11-22 12:57 UTC (permalink / raw) To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman On Thu, 21 Nov 2024 10:06:50 -0500, Alan Stern wrote:= > > One way I see with existing APIs is to wait until the driver > > submits a new URB, but are drivers prepared for this? Is EHCI doing > > the same? > > Yes; see above. That's comforting to hear. But still seems to have races, see Mathias and my reply to him. I suppose alternative solutions include: bypassing the BH on unlink and error paths, or making it call the HCD back when it's done. The latter may not be so trivial, as it would ideally be per-endpoint. > What about automatic unlinking? Maybe it could make things go faster and smoother. Networking can tolerate dropped packets, but considering that their MTU is larger than USB MTU, I suppose they have to split payloads across multiple USB packets and may get out of sync if only part of a payload is dropped. But I don't know, they could use packet headers to resync. > Note that some class drivers treat -EPROTO as a fatal error. That > is, they don't retry and their completion-resubmission loop breaks > down. Well, that's on EHCI. xHCI gives them a chance to continue with the remaining URBs. Hopefuly nobody relies on that... > However, this seems impractical if the class driver wants to retain > the existing URBs already on the endpoint's queue. (I don't know of > any drivers that do this, but still...) Perhaps we should adopt the > policy that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs > to fail and enforce this policy in usbcore by automatic unlinking so > that HC drivers don't have to do it. I wouldn't exclude the possibility of sloppy drivers leaving URBs simply because they don't care. Hard to tell what's right for them. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-22 12:57 ` Michał Pecio @ 2024-11-22 19:28 ` Alan Stern 2024-11-22 23:25 ` Michał Pecio 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2024-11-22 19:28 UTC (permalink / raw) To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman On Fri, Nov 22, 2024 at 01:57:33PM +0100, Michał Pecio wrote: > On Thu, 21 Nov 2024 10:06:50 -0500, Alan Stern wrote:= > > > One way I see with existing APIs is to wait until the driver > > > submits a new URB, but are drivers prepared for this? Is EHCI doing > > > the same? > > > > Yes; see above. > > That's comforting to hear. > But still seems to have races, see Mathias and my reply to him. > > I suppose alternative solutions include: bypassing the BH on unlink and > error paths, or making it call the HCD back when it's done. The latter > may not be so trivial, as it would ideally be per-endpoint. Bypassing the BH might not be a good idea, because driver's completion handlers expect to be called in order of URB completion. It probably wouldn't make any difference, but it's hard to be sure. > > What about automatic unlinking? > > Maybe it could make things go faster and smoother. > > Networking can tolerate dropped packets, but considering that their MTU > is larger than USB MTU, I suppose they have to split payloads across > multiple USB packets and may get out of sync if only part of a payload > is dropped. But I don't know, they could use packet headers to resync. > > > Note that some class drivers treat -EPROTO as a fatal error. That > > is, they don't retry and their completion-resubmission loop breaks > > down. > > Well, that's on EHCI. No, it's the behavior of the class driver and is independent of the type of host controller. > xHCI gives them a chance to continue with the > remaining URBs. Hopefuly nobody relies on that... > > > However, this seems impractical if the class driver wants to retain > > the existing URBs already on the endpoint's queue. (I don't know of > > any drivers that do this, but still...) Perhaps we should adopt the > > policy that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs > > to fail and enforce this policy in usbcore by automatic unlinking so > > that HC drivers don't have to do it. > > I wouldn't exclude the possibility of sloppy drivers leaving URBs > simply because they don't care. Hard to tell what's right for them. If they don't care then they won't care if the URBs get unlinked. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-22 19:28 ` Alan Stern @ 2024-11-22 23:25 ` Michał Pecio 2024-11-23 2:39 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Michał Pecio @ 2024-11-22 23:25 UTC (permalink / raw) To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman On Fri, 22 Nov 2024 14:28:58 -0500, Alan Stern wrote: > Bypassing the BH might not be a good idea, because driver's > completion handlers expect to be called in order of URB completion. > It probably wouldn't make any difference, but it's hard to be sure. Valid point. Expecting drivers to deal with reordered completions would be quite unintuitive, potentially laborious and bug-prone. > > > Note that some class drivers treat -EPROTO as a fatal error. That > > > is, they don't retry and their completion-resubmission loop breaks > > > down. > > > > Well, that's on EHCI. > > No, it's the behavior of the class driver and is independent of the > type of host controller. xHCI has been doing things differently for over a decade as far as I see, and it seems to implement the usb_unlink_urb() rules absolutely literally (restart when everything is given back), except for the BH delay problem added later. Maybe it was a common "idiom" before xHCI, but it seems to rely on undocumented behavior, and other undocumented behaviors exist today that sloppy drivers might depend on. So I don't know, it seems risky either way. Regards, Michal ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: How are halted endpoints supposed to be handled in Linux? 2024-11-22 23:25 ` Michał Pecio @ 2024-11-23 2:39 ` Alan Stern 0 siblings, 0 replies; 11+ messages in thread From: Alan Stern @ 2024-11-23 2:39 UTC (permalink / raw) To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman On Sat, Nov 23, 2024 at 12:25:35AM +0100, Michał Pecio wrote: > On Fri, 22 Nov 2024 14:28:58 -0500, Alan Stern wrote: > > > > Note that some class drivers treat -EPROTO as a fatal error. That > > > > is, they don't retry and their completion-resubmission loop breaks > > > > down. > > > > > > Well, that's on EHCI. > > > > No, it's the behavior of the class driver and is independent of the > > type of host controller. > > xHCI has been doing things differently for over a decade as far as I > see, and it seems to implement the usb_unlink_urb() rules absolutely > literally (restart when everything is given back), except for the BH > delay problem added later. > > Maybe it was a common "idiom" before xHCI, but it seems to rely on > undocumented behavior, and other undocumented behaviors exist today > that sloppy drivers might depend on. I think you're misunderstanding what I wrote. I meant that several class drivers have completion handlers that look like this: urb_complete(struct usb_urb *urb, int status) { switch (status) { ... case -EPROTO: dev_warn(dev, "USB communication error\n"); return; ... } /* Process data from the URB */ ... usb_fill_bulk_urb(urb, ....); usb_submit_urb(urb); } The driver works by resubmitting a single URB over and over again. But when there's a -EPROTO error, it doesn't resubmit and the loop stops. The driver doesn't do anything more after that; it becomes useless. This behavior has nothing to do with EHCI or xHCI or any other type of host controller. And overall it works, because -EPROTO errors hardly ever occur except when a device is unplugged. > So I don't know, it seems risky either way. This approach is only a little risky, in the sense that the driver might die in a situation where it didn't really need to -- but that outcome is most unlikely. There is no risk of further communication errors or data corruption. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-23 2:39 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-20 23:11 How are halted endpoints supposed to be handled in Linux? Michał Pecio 2024-11-21 0:02 ` Thinh Nguyen 2024-11-21 2:31 ` Alan Stern 2024-11-21 10:26 ` Michał Pecio 2024-11-21 14:08 ` Mathias Nyman 2024-11-22 11:54 ` Michał Pecio 2024-11-21 15:06 ` Alan Stern 2024-11-22 12:57 ` Michał Pecio 2024-11-22 19:28 ` Alan Stern 2024-11-22 23:25 ` Michał Pecio 2024-11-23 2:39 ` Alan Stern
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox