How are halted endpoints supposed to be handled in Linux?

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* How are halted endpoints supposed to be handled in Linux?
@ 2024-11-20 23:11 Michał Pecio
  2024-11-21  0:02 ` Thinh Nguyen
  0 siblings, 1 reply; 11+ messages in thread
From: Michał Pecio @ 2024-11-20 23:11 UTC (permalink / raw)
  To: linux-usb; +Cc: Mathias Nyman

Hi,

I have been wondering about it after seeing how it's done in xhci_hcd,
which looks wrong to me.

USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL
handshake or "transmission error" should cause both the device and host
endpoints to be reset. I presume "transmission error" means any error
detected by the HC which causes it to halt, various examples exist.

USB 3.0 just refers to USB 2.0.

Linux appears to ignore this part and only reset on STALL handshake, as
advised in Documentation/driver-api/usb/error-codes.rst and practiced
by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO.
This wouldn't necessarily be bad in itself, but:

On the HCD side, xHCI will:
- give back the current URB with -EPROTO/-EPIPE status
- reset the host side endpoint, clearing its toggle state
- point the HC at the next URB if one exist
- restart the endpoint without waiting for hcd->endpoint_reset()
- ignore one subsequent call to hcd->endpoint_reset()

For STALL, I think it's a little awkward, but acceptable. The ultimate
result appears to be that all pending URBs are given back with -EPIPE
and things start moving again after usb_clear_halt().

But if the device isn't stalled, the next URB may execute right away if
the failure was transient. This makes it impossible to ensure in-order
delivery on bulk OUT pipes, because one URB is skipped and the driver
has no reliable way to retry it before some later ones may get executed.

This behavior also creates an opportunity for toggle mismatch, and as
far as I understand, the hardware will resolve it by silently dropping
one packet. Another could be dropped if usb_clear_halt() were called.

Either I'm missing something, or it seems quite broken?

I wonder what other HCDs are doing in this case, and what's the idea
behind it all?

Regards,
Michal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-20 23:11 How are halted endpoints supposed to be handled in Linux? Michał Pecio
@ 2024-11-21  0:02 ` Thinh Nguyen
  2024-11-21  2:31   ` Alan Stern
  0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2024-11-21  0:02 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb@vger.kernel.org, Mathias Nyman, Alan Stern

++Alan

Hi,

On Thu, Nov 21, 2024, Michał Pecio wrote:
> Hi,
> 
> I have been wondering about it after seeing how it's done in xhci_hcd,
> which looks wrong to me.
> 
> USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL
> handshake or "transmission error" should cause both the device and host
> endpoints to be reset. I presume "transmission error" means any error
> detected by the HC which causes it to halt, various examples exist.
> 
> USB 3.0 just refers to USB 2.0.
> 
> Linux appears to ignore this part and only reset on STALL handshake, as
> advised in Documentation/driver-api/usb/error-codes.rst and practiced
> by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO.
> This wouldn't necessarily be bad in itself, but:
> 
> On the HCD side, xHCI will:
> - give back the current URB with -EPROTO/-EPIPE status
> - reset the host side endpoint, clearing its toggle state
> - point the HC at the next URB if one exist
> - restart the endpoint without waiting for hcd->endpoint_reset()
> - ignore one subsequent call to hcd->endpoint_reset()
> 
> For STALL, I think it's a little awkward, but acceptable. The ultimate
> result appears to be that all pending URBs are given back with -EPIPE
> and things start moving again after usb_clear_halt().
> 
> But if the device isn't stalled, the next URB may execute right away if
> the failure was transient. This makes it impossible to ensure in-order
> delivery on bulk OUT pipes, because one URB is skipped and the driver
> has no reliable way to retry it before some later ones may get executed.

The class driver will know of this error, and the retry/recovery should
be done at the class driver base on this error and not from the HCD.

> 
> This behavior also creates an opportunity for toggle mismatch, and as
> far as I understand, the hardware will resolve it by silently dropping
> one packet. Another could be dropped if usb_clear_halt() were called.
> 
> 
> Either I'm missing something, or it seems quite broken?
> 
> I wonder what other HCDs are doing in this case, and what's the idea
> behind it all?
> 
> Regards,
> Michal
> 

This long discussion may be relevant to your query:
https://lore.kernel.org/linux-usb/ba06679f-93d2-4cb4-9218-9e288065bdfb@rowland.harvard.edu/#t

The concensus is that these error recovery should be handled by the
class drivers. How they should be handled may be different between
different OSes. If you follow the thread discussion, it may not
necessarily a good idea to use usb_clear_halt() to recover.

BR,
Thinh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21  0:02 ` Thinh Nguyen
@ 2024-11-21  2:31   ` Alan Stern
  2024-11-21 10:26     ` Michał Pecio
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2024-11-21  2:31 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Michał Pecio, linux-usb@vger.kernel.org, Mathias Nyman

On Thu, Nov 21, 2024 at 12:02:19AM +0000, Thinh Nguyen wrote:
> ++Alan
> 
> Hi,
> 
> On Thu, Nov 21, 2024, Michał Pecio wrote:
> > Hi,
> > 
> > I have been wondering about it after seeing how it's done in xhci_hcd,
> > which looks wrong to me.
> > 
> > USB 2.0 spec 5.7.5/5.8.5 states that halt condition due to either STALL
> > handshake or "transmission error" should cause both the device and host
> > endpoints to be reset. I presume "transmission error" means any error
> > detected by the HC which causes it to halt, various examples exist.

The spec is unfortunately vague about what a halt condition is, although 
it apparently intends the various sorts of transmission errors to count.

> > USB 3.0 just refers to USB 2.0.
> > 
> > Linux appears to ignore this part and only reset on STALL handshake, as
> > advised in Documentation/driver-api/usb/error-codes.rst and practiced
> > by drivers - they don't seem to bother with usb_clear_halt() on -EPROTO.

This is generally a weakness in the drivers.  It would be nice if the 
class specifications said what to do in these situations, but most of 
them don't.

> > This wouldn't necessarily be bad in itself, but:
> > 
> > On the HCD side, xHCI will:
> > - give back the current URB with -EPROTO/-EPIPE status
> > - reset the host side endpoint, clearing its toggle state
> > - point the HC at the next URB if one exist
> > - restart the endpoint without waiting for hcd->endpoint_reset()
> > - ignore one subsequent call to hcd->endpoint_reset()

This behavior is not correct.  See the kerneldoc for 
usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
labelled "Unlinking and Endpoint Queues".

In general, the only safe thing for class drivers to do when they get 
one of these errors on a bulk or interrupt endpoint is to unlink all the 
pending URBs for the endpoint and then call usb_clear_halt() when they 
have all completed.  I know that usb-storage does this; I suspect that 
not many other drivers do.

I suppose the USB core could take care of this automatically so that 
neither the class drivers nor the HCDs would have to worry about it.  If 
everyone agrees that this wouldn't lead to other problems, it could be 
implemented without too much difficulty.

However, as you point out, in general there's no way for the host to 
know whether the device accepted the failed transaction.  This means 
that proper recovery requires knowledge of the higher-level protocol, 
which only the class driver is aware of.

> > For STALL, I think it's a little awkward, but acceptable. The ultimate
> > result appears to be that all pending URBs are given back with -EPIPE
> > and things start moving again after usb_clear_halt().
> > 
> > But if the device isn't stalled, the next URB may execute right away if
> > the failure was transient. This makes it impossible to ensure in-order
> > delivery on bulk OUT pipes, because one URB is skipped and the driver
> > has no reliable way to retry it before some later ones may get executed.

Just so.

> The class driver will know of this error, and the retry/recovery should
> be done at the class driver base on this error and not from the HCD.

Agreed.  The minimum requirement is that the endpoint does not restart 
until all the pending URBs have completed, whether through explicit or 
automatic unlinks.

> > This behavior also creates an opportunity for toggle mismatch, and as
> > far as I understand, the hardware will resolve it by silently dropping
> > one packet. Another could be dropped if usb_clear_halt() were called.
> > 
> > 
> > Either I'm missing something, or it seems quite broken?
> > 
> > I wonder what other HCDs are doing in this case, and what's the idea
> > behind it all?

As far as I know, they don't automatically send Clear-Halt requests to 
the device or automatically unlink anything.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21  2:31   ` Alan Stern
@ 2024-11-21 10:26     ` Michał Pecio
  2024-11-21 14:08       ` Mathias Nyman
  2024-11-21 15:06       ` Alan Stern
  0 siblings, 2 replies; 11+ messages in thread
From: Michał Pecio @ 2024-11-21 10:26 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

Hi Alan,

Thank you for taking the time to answer. I'm beginning to regret not
asking this question earlier.

On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
> > > Linux appears to ignore this part and only reset on STALL
> > > handshake, as advised in
> > > Documentation/driver-api/usb/error-codes.rst and practiced by
> > > drivers - they don't seem to bother with usb_clear_halt() on
> > > -EPROTO.  
> 
> This is generally a weakness in the drivers.  It would be nice if the 
> class specifications said what to do in these situations, but most of 
> them don't.

There is also proprietary hardware with no specification at all.

> > > On the HCD side, xHCI will:
> > > - give back the current URB with -EPROTO/-EPIPE status
> > > - reset the host side endpoint, clearing its toggle state
> > > - point the HC at the next URB if one exist
> > > - restart the endpoint without waiting for hcd->endpoint_reset()
> > > - ignore one subsequent call to hcd->endpoint_reset()  
> 
> This behavior is not correct.  See the kerneldoc for 
> usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
> labelled "Unlinking and Endpoint Queues".

OK, let's go through it.

 * But when an URB terminates with an
 * error its queue generally stops (see below), at least until that URB's
 * completion routine returns.

I don't see this working after xHCI adopted bottom half giveback, which
is asynchronous. As you are the maintainer of EHCI, which also uses BH,
how is EHCI dealing with it?

One way I see with existing APIs is to wait until the driver submits a
new URB, but are drivers prepared for this? Is EHCI doing the same?

 * It is guaranteed that a stopped queue
 * will not restart until all its unlinked URBs have been fully retired,
 * with their completion routines run

I think xHCI can currently guarantee that nothing is restarted until
any URB unlinked for any reason is given to the BH worker, but that's
all we have, and I just broke it in usb-next:

+               /* Try to restart the endpoint if all is done */
+               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
+               /* Start giving back any TDs invalidated above */
+               xhci_giveback_invalidated_tds(ep);

This is part of a legitimate bugfix patch tagged for stable. I should
have insisted on keeping it a separate change, but it seemed a good idea
at the time which would soon get implemented anyway... Maybe no more?

 * even if that's not until some time
 * after the original completion handler returns.

Not entirely sure what this means.

 * The same behavior and
 * guarantee apply when an URB terminates because it was unlinked.

Same caveat about BH asynchronicity in xHCI.

> In general, the only safe thing for class drivers to do when they get 
> one of these errors on a bulk or interrupt endpoint is to unlink all
> the pending URBs for the endpoint and then call usb_clear_halt() when
> they have all completed.  I know that usb-storage does this; I
> suspect that not many other drivers do.

Your suspicion isn't wrong AFAIK.

One more thing which is safe at least wrt data corruption is to simply
retry the same URB without resetting anything. But if an HCD wants to
do it, existing API gives no way to notify the driver and allow it to
opt out and handle things differently, so it mustn't retry forever.

> I suppose the USB core could take care of this automatically so that 
> neither the class drivers nor the HCDs would have to worry about it.
> If everyone agrees that this wouldn't lead to other problems, it
> could be implemented without too much difficulty.

This still appears to run into the double delivery problem that you
described in the discussion linked by Thinh Nguyen, particularly in
case of dodgy drivers for dodgy hardware.

Considering that, I'm not sure if automatically resetting anything on
-EPROTO is a good idea.

> > > I wonder what other HCDs are doing in this case, and what's the
> > > idea behind it all?  
> 
> As far as I know, they don't automatically send Clear-Halt requests
> to the device or automatically unlink anything.

That's what it looks like, grepping through drivers/usb/host. But my
question was mainly about -EPROTO. When should an HCD restart a halted
endpoint? Should it clear the sequence state? (probably not).

Regards,
Michal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21 10:26     ` Michał Pecio
@ 2024-11-21 14:08       ` Mathias Nyman
  2024-11-22 11:54         ` Michał Pecio
  2024-11-21 15:06       ` Alan Stern
  1 sibling, 1 reply; 11+ messages in thread
From: Mathias Nyman @ 2024-11-21 14:08 UTC (permalink / raw)
  To: Michał Pecio, Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org

On 21.11.2024 12.26, Michał Pecio wrote:
> Hi Alan,
> 
> Thank you for taking the time to answer. I'm beginning to regret not
> asking this question earlier.
> 
> On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
>>>> Linux appears to ignore this part and only reset on STALL
>>>> handshake, as advised in
>>>> Documentation/driver-api/usb/error-codes.rst and practiced by
>>>> drivers - they don't seem to bother with usb_clear_halt() on
>>>> -EPROTO.

For xhci I assumed that the device endpoint is halted when we receive
a  'Stall Error' transfer event for bulk or interrupt transfers.

Other errors such as 'Transaction Error' do halt the host side,
but does not necessarily mean whole endpoint is halted. This is based
on the info in xhci 4.6.8.1 "Soft Retry" :

"xHC shall halt an endpoint with a USB Transaction Error after CErr
retries have been performed. The USB device is not aware that the xHC
has halted the endpoint, and will be waiting for another retry, so a
Soft Retry may be used to perform additional retries and recover from
an error which has caused the xHC to halt an endpoint."

>>
>> This is generally a weakness in the drivers.  It would be nice if the
>> class specifications said what to do in these situations, but most of
>> them don't.
> 
> There is also proprietary hardware with no specification at all.
> 
>>>> On the HCD side, xHCI will:
>>>> - give back the current URB with -EPROTO/-EPIPE status
>>>> - reset the host side endpoint, clearing its toggle state
>>>> - point the HC at the next URB if one exist
>>>> - restart the endpoint without waiting for hcd->endpoint_reset()

Intention was not to restart the endpoint, but turns out we end up doing it.

Basically we should not ring the doorbell  when 'Set TR Deq' command completes
for a bulk or interrupt endpoint in case the command was queued to resolve a
Stall Error. For control endpoint we should restart the ring (xHC halts
the internal control endpoint on errors/stall)

>>>> - ignore one subsequent call to hcd->endpoint_reset()

>>
>> This behavior is not correct.  See the kerneldoc for
>> usb_unlink_urb() in drivers/usb/core/urb.c, especially the section
>> labelled "Unlinking and Endpoint Queues".
> 
> OK, let's go through it.
> 
>   * But when an URB terminates with an
>   * error its queue generally stops (see below), at least until that URB's
>   * completion routine returns.
> 
> I don't see this working after xHCI adopted bottom half giveback, which
> is asynchronous. As you are the maintainer of EHCI, which also uses BH,
> how is EHCI dealing with it?
> 
> One way I see with existing APIs is to wait until the driver submits a
> new URB, but are drivers prepared for this? Is EHCI doing the same?

Using a BH also means class driver may queue a new URB to xhci after xhci has
cleared its internal endpoint halt condition, but before class driver is
aware that endpoint halted.

> 
>   * It is guaranteed that a stopped queue
>   * will not restart until all its unlinked URBs have been fully retired,
>   * with their completion routines run
> 
> I think xHCI can currently guarantee that nothing is restarted until
> any URB unlinked for any reason is given to the BH worker, but that's
> all we have, and I just broke it in usb-next:
> 
> +               /* Try to restart the endpoint if all is done */
> +               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
> +               /* Start giving back any TDs invalidated above */
> +               xhci_giveback_invalidated_tds(ep);
> 
> This is part of a legitimate bugfix patch tagged for stable. I should
> have insisted on keeping it a separate change, but it seemed a good idea
> at the time which would soon get implemented anyway... Maybe no more?

I don't think your patch makes the existing issue worse in any significant
way.

Thanks
Mathias


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21 10:26     ` Michał Pecio
  2024-11-21 14:08       ` Mathias Nyman
@ 2024-11-21 15:06       ` Alan Stern
  2024-11-22 12:57         ` Michał Pecio
  1 sibling, 1 reply; 11+ messages in thread
From: Alan Stern @ 2024-11-21 15:06 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

On Thu, Nov 21, 2024 at 11:26:53AM +0100, Michał Pecio wrote:
> Hi Alan,
> 
> Thank you for taking the time to answer. I'm beginning to regret not
> asking this question earlier.

I hope all the material here is accurate; it's been a long time since I 
worked on these matters.

> On Wed, 20 Nov 2024 21:31:29 -0500, Alan Stern wrote:
> > > > Linux appears to ignore this part and only reset on STALL
> > > > handshake, as advised in
> > > > Documentation/driver-api/usb/error-codes.rst and practiced by
> > > > drivers - they don't seem to bother with usb_clear_halt() on
> > > > -EPROTO.  
> > 
> > This is generally a weakness in the drivers.  It would be nice if the 
> > class specifications said what to do in these situations, but most of 
> > them don't.
> 
> There is also proprietary hardware with no specification at all.

Indeed.

> > > > On the HCD side, xHCI will:
> > > > - give back the current URB with -EPROTO/-EPIPE status
> > > > - reset the host side endpoint, clearing its toggle state
> > > > - point the HC at the next URB if one exist
> > > > - restart the endpoint without waiting for hcd->endpoint_reset()
> > > > - ignore one subsequent call to hcd->endpoint_reset()  
> > 
> > This behavior is not correct.  See the kerneldoc for 
> > usb_unlink_urb() in drivers/usb/core/urb.c, especially the section 
> > labelled "Unlinking and Endpoint Queues".
> 
> OK, let's go through it.
> 
>  * But when an URB terminates with an
>  * error its queue generally stops (see below), at least until that URB's
>  * completion routine returns.
> 
> I don't see this working after xHCI adopted bottom half giveback, which
> is asynchronous. As you are the maintainer of EHCI, which also uses BH,
> how is EHCI dealing with it?

Yes, I am.

For Bulk endpoints: When a transmission error occurs, ehci-hcd removes 
the endpoint's queue header from the controller's async list, in 
addition to giving back the failed URB.  When the removal is complete, 
the queue is scanned for other unlinked URBs, and they are given back.  
After that happens, the next URB submission will cause the queue header 
to be put back on the controller's async list.

You're right about the lack of synchronization caused by use of a BH.  
The HCD has no way to know when the class driver has finished processing 
a giveback.  We do need to fix this.

> One way I see with existing APIs is to wait until the driver submits a
> new URB, but are drivers prepared for this? Is EHCI doing the same?

Yes; see above.

>  * It is guaranteed that a stopped queue
>  * will not restart until all its unlinked URBs have been fully retired,
>  * with their completion routines run
> 
> I think xHCI can currently guarantee that nothing is restarted until
> any URB unlinked for any reason is given to the BH worker, but that's
> all we have, and I just broke it in usb-next:
> 
> +               /* Try to restart the endpoint if all is done */
> +               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
> +               /* Start giving back any TDs invalidated above */
> +               xhci_giveback_invalidated_tds(ep);
> 
> This is part of a legitimate bugfix patch tagged for stable. I should
> have insisted on keeping it a separate change, but it seemed a good idea
> at the time which would soon get implemented anyway... Maybe no more?
> 
>  * even if that's not until some time
>  * after the original completion handler returns.
> 
> Not entirely sure what this means.

Suppose URBs 1, 2, and 3 are queued when URB 1 gets a transmission 
error.  It is given back, and its completion handler unlinks URBs 2 and 
3.  The completion handler for URB 1 then returns, but the queue won't 
restart until the completion handlers for URBs 2 and 3 have returned, 
even if that doesn't occur for some time.  At least, that is the intent.

>  * The same behavior and
>  * guarantee apply when an URB terminates because it was unlinked.
> 
> Same caveat about BH asynchronicity in xHCI.
> 
> > In general, the only safe thing for class drivers to do when they get 
> > one of these errors on a bulk or interrupt endpoint is to unlink all
> > the pending URBs for the endpoint and then call usb_clear_halt() when
> > they have all completed.  I know that usb-storage does this; I
> > suspect that not many other drivers do.
> 
> Your suspicion isn't wrong AFAIK.
> 
> One more thing which is safe at least wrt data corruption is to simply
> retry the same URB without resetting anything. But if an HCD wants to
> do it, existing API gives no way to notify the driver and allow it to
> opt out and handle things differently, so it mustn't retry forever.

Yes, and in fact ehci-hcd _does_ retry up to 32 times.

> > I suppose the USB core could take care of this automatically so that 
> > neither the class drivers nor the HCDs would have to worry about it.
> > If everyone agrees that this wouldn't lead to other problems, it
> > could be implemented without too much difficulty.
> 
> This still appears to run into the double delivery problem that you
> described in the discussion linked by Thinh Nguyen, particularly in
> case of dodgy drivers for dodgy hardware.
> 
> Considering that, I'm not sure if automatically resetting anything on
> -EPROTO is a good idea.

What about automatic unlinking?

Note that some class drivers treat -EPROTO as a fatal error.  That is, 
they don't retry and their completion-resubmission loop breaks down.  
When that happens, the only way forward is to unbind the driver (for 
example, by unplugging the device).  Of course, this isn't a problem if 
the original cause of the -EPROTO error is that the device just _was_ 
unplugged.  All other cases are sufficiently rare that we don't have a 
generally agreed-upon strategy for handling them.

> > > > I wonder what other HCDs are doing in this case, and what's the
> > > > idea behind it all?  
> > 
> > As far as I know, they don't automatically send Clear-Halt requests
> > to the device or automatically unlink anything.
> 
> That's what it looks like, grepping through drivers/usb/host. But my
> question was mainly about -EPROTO. When should an HCD restart a halted
> endpoint? Should it clear the sequence state? (probably not).

Except for an immediate retry, I suspect the HCD should never restart a 
halted Bulk or Interrupt endpoint on its own initiative.  Not until 
another URB is submitted.

However, this seems impractical if the class driver wants to retain the 
existing URBs already on the endpoint's queue.  (I don't know of any 
drivers that do this, but still...)  Perhaps we should adopt the policy 
that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs to fail and 
enforce this policy in usbcore by automatic unlinking so that HC 
drivers don't have to do it.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21 14:08       ` Mathias Nyman
@ 2024-11-22 11:54         ` Michał Pecio
  0 siblings, 0 replies; 11+ messages in thread
From: Michał Pecio @ 2024-11-22 11:54 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Alan Stern, Thinh Nguyen, linux-usb@vger.kernel.org

On Thu, 21 Nov 2024 16:08:15 +0200, Mathias Nyman wrote:
> "xHC shall halt an endpoint with a USB Transaction Error after CErr
> retries have been performed. The USB device is not aware that the xHC
> has halted the endpoint, and will be waiting for another retry, so a
> Soft Retry may be used to perform additional retries and recover from
> an error which has caused the xHC to halt an endpoint."

That's why it would be "nice to have" the ability to allow class
drivers to perform further "soft retries" by killing all URBs and
submitting them (together with the failed one) again. Or by some
other means, as this is a rather crude API.

I imagine it would be the most sensible option for some devices like
serial, although I don't know if any driver currently attempts it.

If this is to work reliably, the sequence state apparently needs
to be preserved on both sides and not reset, regardless of what the
USB spec says. So it would take a TSP=1 reset on -EPROTO and full
reset on -EPIPE (no other choice here), each followed by Set Deq.

If xHCI wants or needs to keep doing full reset on -EPROTO then
CLEAR_FEATURE(ENDPOINT_HALT) seems like a good addition. It does
have the "double delivery after missing ACK" problem with class
drivers unable or unwilling to determine the true state of their
device, but avoids guaranteed toggle mismatch half the time.

> >>>> On the HCD side, xHCI will:
> >>>> - give back the current URB with -EPROTO/-EPIPE status
> >>>> - reset the host side endpoint, clearing its toggle state
> >>>> - point the HC at the next URB if one exist
> >>>> - restart the endpoint without waiting for hcd->endpoint_reset()
> 
> Intention was not to restart the endpoint, but turns out we end up
> doing it.
> 
> Basically we should not ring the doorbell  when 'Set TR Deq' command
> completes for a bulk or interrupt endpoint in case the command was
> queued to resolve a Stall Error.

According to Alan's kerneldoc, also for -EPROTO and ordinary unlink().
So practically always. Apparently, no endpoint stopped for any reason
is supposed to restart with any complete() callbacks still pending.

I imagine going back to direct giveback in cases of unlink/halt might
be an option, maybe even more reliable than waiting for a new URB. 

> >>>> - ignore one subsequent call to hcd->endpoint_reset()

It's also a bug that EP_HARD_CLEAR_TOGGLE isn't cleared when a new
URB is queued on the endpoint. Toggle state becomes unknown then.

> Using a BH also means class driver may queue a new URB to xhci after
> xhci has cleared its internal endpoint halt condition, but before
> class driver is aware that endpoint halted.

In fact, they can do even more. Network drivers with a simple complete-
resubmit loop can schedule a work to resubmit later if their GFP_ATOMIC
allocation of a fresh data buffer fails. More generally, drivers are
under no obligation to only resubmit URBs from complete() callbacks.

I suppose the combination of BH giveback and restarting on a new URB
is fundamentally unable to provide documented guarantees, on EHCI too.

> +               /* Try to restart the endpoint if all is done */
> +               ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
> +               /* Start giving back any TDs invalidated above */
> +               xhci_giveback_invalidated_tds(ep);
>
> I don't think your patch makes the existing issue worse in any
> significant way.

Currently, it's just a sub-microsecond differenece in an existing race.
But if we are to eliminate the race, this code needs reordering.

Regards,
Michal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-21 15:06       ` Alan Stern
@ 2024-11-22 12:57         ` Michał Pecio
  2024-11-22 19:28           ` Alan Stern
  0 siblings, 1 reply; 11+ messages in thread
From: Michał Pecio @ 2024-11-22 12:57 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

On Thu, 21 Nov 2024 10:06:50 -0500, Alan Stern wrote:=
> > One way I see with existing APIs is to wait until the driver
> > submits a new URB, but are drivers prepared for this? Is EHCI doing
> > the same?  
> 
> Yes; see above.

That's comforting to hear.
But still seems to have races, see Mathias and my reply to him.

I suppose alternative solutions include: bypassing the BH on unlink and
error paths, or making it call the HCD back when it's done. The latter
may not be so trivial, as it would ideally be per-endpoint.

> What about automatic unlinking?

Maybe it could make things go faster and smoother.

Networking can tolerate dropped packets, but considering that their MTU
is larger than USB MTU, I suppose they have to split payloads across
multiple USB packets and may get out of sync if only part of a payload
is dropped. But I don't know, they could use packet headers to resync.

> Note that some class drivers treat -EPROTO as a fatal error.  That
> is, they don't retry and their completion-resubmission loop breaks
> down.

Well, that's on EHCI. xHCI gives them a chance to continue with the
remaining URBs. Hopefuly nobody relies on that...

> However, this seems impractical if the class driver wants to retain
> the existing URBs already on the endpoint's queue.  (I don't know of
> any drivers that do this, but still...)  Perhaps we should adopt the
> policy that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs
> to fail and enforce this policy in usbcore by automatic unlinking so
> that HC drivers don't have to do it.

I wouldn't exclude the possibility of sloppy drivers leaving URBs
simply because they don't care. Hard to tell what's right for them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-22 12:57         ` Michał Pecio
@ 2024-11-22 19:28           ` Alan Stern
  2024-11-22 23:25             ` Michał Pecio
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2024-11-22 19:28 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

On Fri, Nov 22, 2024 at 01:57:33PM +0100, Michał Pecio wrote:
> On Thu, 21 Nov 2024 10:06:50 -0500, Alan Stern wrote:=
> > > One way I see with existing APIs is to wait until the driver
> > > submits a new URB, but are drivers prepared for this? Is EHCI doing
> > > the same?  
> > 
> > Yes; see above.
> 
> That's comforting to hear.
> But still seems to have races, see Mathias and my reply to him.
> 
> I suppose alternative solutions include: bypassing the BH on unlink and
> error paths, or making it call the HCD back when it's done. The latter
> may not be so trivial, as it would ideally be per-endpoint.

Bypassing the BH might not be a good idea, because driver's completion 
handlers expect to be called in order of URB completion.  It probably 
wouldn't make any difference, but it's hard to be sure.

> > What about automatic unlinking?
> 
> Maybe it could make things go faster and smoother.
> 
> Networking can tolerate dropped packets, but considering that their MTU
> is larger than USB MTU, I suppose they have to split payloads across
> multiple USB packets and may get out of sync if only part of a payload
> is dropped. But I don't know, they could use packet headers to resync.
> 
> > Note that some class drivers treat -EPROTO as a fatal error.  That
> > is, they don't retry and their completion-resubmission loop breaks
> > down.
> 
> Well, that's on EHCI.

No, it's the behavior of the class driver and is independent of the 
type of host controller.

>  xHCI gives them a chance to continue with the
> remaining URBs. Hopefuly nobody relies on that...
> 
> > However, this seems impractical if the class driver wants to retain
> > the existing URBs already on the endpoint's queue.  (I don't know of
> > any drivers that do this, but still...)  Perhaps we should adopt the
> > policy that -EPROTO, -EILSEQ, and -ETIME cause all outstanding URBs
> > to fail and enforce this policy in usbcore by automatic unlinking so
> > that HC drivers don't have to do it.
> 
> I wouldn't exclude the possibility of sloppy drivers leaving URBs
> simply because they don't care. Hard to tell what's right for them.

If they don't care then they won't care if the URBs get unlinked.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-22 19:28           ` Alan Stern
@ 2024-11-22 23:25             ` Michał Pecio
  2024-11-23  2:39               ` Alan Stern
  0 siblings, 1 reply; 11+ messages in thread
From: Michał Pecio @ 2024-11-22 23:25 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

On Fri, 22 Nov 2024 14:28:58 -0500, Alan Stern wrote:
> Bypassing the BH might not be a good idea, because driver's
> completion handlers expect to be called in order of URB completion.
> It probably wouldn't make any difference, but it's hard to be sure.

Valid point. Expecting drivers to deal with reordered completions would
be quite unintuitive, potentially laborious and bug-prone.

> > > Note that some class drivers treat -EPROTO as a fatal error.  That
> > > is, they don't retry and their completion-resubmission loop breaks
> > > down.  
> > 
> > Well, that's on EHCI.  
> 
> No, it's the behavior of the class driver and is independent of the 
> type of host controller.

xHCI has been doing things differently for over a decade as far as I
see, and it seems to implement the usb_unlink_urb() rules absolutely
literally (restart when everything is given back), except for the BH
delay problem added later.

Maybe it was a common "idiom" before xHCI, but it seems to rely on
undocumented behavior, and other undocumented behaviors exist today
that sloppy drivers might depend on.

So I don't know, it seems risky either way.

Regards,
Michal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: How are halted endpoints supposed to be handled in Linux?
  2024-11-22 23:25             ` Michał Pecio
@ 2024-11-23  2:39               ` Alan Stern
  0 siblings, 0 replies; 11+ messages in thread
From: Alan Stern @ 2024-11-23  2:39 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org, Mathias Nyman

On Sat, Nov 23, 2024 at 12:25:35AM +0100, Michał Pecio wrote:
> On Fri, 22 Nov 2024 14:28:58 -0500, Alan Stern wrote:
> > > > Note that some class drivers treat -EPROTO as a fatal error.  That
> > > > is, they don't retry and their completion-resubmission loop breaks
> > > > down.  
> > > 
> > > Well, that's on EHCI.  
> > 
> > No, it's the behavior of the class driver and is independent of the 
> > type of host controller.
> 
> xHCI has been doing things differently for over a decade as far as I
> see, and it seems to implement the usb_unlink_urb() rules absolutely
> literally (restart when everything is given back), except for the BH
> delay problem added later.
> 
> Maybe it was a common "idiom" before xHCI, but it seems to rely on
> undocumented behavior, and other undocumented behaviors exist today
> that sloppy drivers might depend on.

I think you're misunderstanding what I wrote.  I meant that several 
class drivers have completion handlers that look like this:

	urb_complete(struct usb_urb *urb, int status)
	{
		switch (status) {
		...
		case -EPROTO:
			dev_warn(dev, "USB communication error\n");
			return;
		...
		}

		/* Process data from the URB */
		...
		usb_fill_bulk_urb(urb, ....);
		usb_submit_urb(urb);
	}

The driver works by resubmitting a single URB over and over again.  But 
when there's a -EPROTO error, it doesn't resubmit and the loop stops.  
The driver doesn't do anything more after that; it becomes useless.

This behavior has nothing to do with EHCI or xHCI or any other type of 
host controller.  And overall it works, because -EPROTO errors hardly 
ever occur except when a device is unplugged.

> So I don't know, it seems risky either way.

This approach is only a little risky, in the sense that the driver 
might die in a situation where it didn't really need to -- but that 
outcome is most unlikely.  There is no risk of further communication 
errors or data corruption.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-11-23  2:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-20 23:11 How are halted endpoints supposed to be handled in Linux? Michał Pecio
2024-11-21  0:02 ` Thinh Nguyen
2024-11-21  2:31   ` Alan Stern
2024-11-21 10:26     ` Michał Pecio
2024-11-21 14:08       ` Mathias Nyman
2024-11-22 11:54         ` Michał Pecio
2024-11-21 15:06       ` Alan Stern
2024-11-22 12:57         ` Michał Pecio
2024-11-22 19:28           ` Alan Stern
2024-11-22 23:25             ` Michał Pecio
2024-11-23  2:39               ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox