correctly handling EPROTO

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* correctly handling EPROTO
@ 2026-03-12 13:55 Oliver Neukum
  2026-03-12 14:21 ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-12 13:55 UTC (permalink / raw)
  To: Bjørn Mork; +Cc: Alan Stern, USB list

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

Hi,

as we just discussed an HID device which tends to report
-EPROTO from time to time, the question what a driver is
to do when it gets -EPROTO.
Do we just retry? I am not really happy with that, as we
may very well get into a livelock. Nevertheless I think
we cannot just assume that -EPROTO is not recoverable.

Do we need to do something drastic like generalizing
the backoff logic from usbhid?

What do you think?

	Regards
		Oliver

[-- Attachment #2: 0001-usb-class-cdc-wdm-handle-EPROTO-on-interrupt-endpoin.patch --]
[-- Type: text/x-patch, Size: 1021 bytes --]

From c5656d49224908d03a4f7dc82353e919df5198e4 Mon Sep 17 00:00:00 2001
From: Oliver Neukum <oneukum@suse.com>
Date: Thu, 12 Mar 2026 12:29:23 +0100
Subject: [PATCH] usb: class: cdc-wdm handle EPROTO on interrupt endpoint

If we get an unexpected error, most likely EPROTO
during disconnect, there is no point in checking
for transmitted data. We do not want to process
such messages, even if they are long enough.
As we consider such events recoverable, jump
directly to resubmission.

Signed-off-by: Oliver Neukum <oneukum@suse.com>
---
 drivers/usb/class/cdc-wdm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/usb/class/cdc-wdm.c b/drivers/usb/class/cdc-wdm.c
index 9185295f5376..8067506d06a4 100644
--- a/drivers/usb/class/cdc-wdm.c
+++ b/drivers/usb/class/cdc-wdm.c
@@ -271,7 +271,7 @@ static void wdm_int_callback(struct urb *urb)
 		default:
 			dev_err_ratelimited(&desc->intf->dev,
 				"nonzero urb status received: %d\n", status);
-			break;
+			goto exit;
 		}
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-12 13:55 correctly handling EPROTO Oliver Neukum
@ 2026-03-12 14:21 ` Alan Stern
  2026-03-12 15:57   ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-12 14:21 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Bjørn Mork, USB list

On Thu, Mar 12, 2026 at 02:55:48PM +0100, Oliver Neukum wrote:
> Hi,
> 
> as we just discussed an HID device which tends to report
> -EPROTO from time to time, the question what a driver is
> to do when it gets -EPROTO.
> Do we just retry? I am not really happy with that, as we
> may very well get into a livelock. Nevertheless I think
> we cannot just assume that -EPROTO is not recoverable.
> 
> Do we need to do something drastic like generalizing
> the backoff logic from usbhid?
> 
> What do you think?

There's a discussion about the same issue here:

	https://bugzilla.kernel.org/show_bug.cgi?id=221184

See especially the later parts, starting with comment #28.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-12 14:21 ` Alan Stern
@ 2026-03-12 15:57   ` Oliver Neukum
  2026-03-13  7:53     ` Michal Pecio
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-12 15:57 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum; +Cc: Bjørn Mork, USB list

[-- Attachment #1: Type: text/plain, Size: 476 bytes --]

On 12.03.26 15:21, Alan Stern wrote:
  
> There's a discussion about the same issue here:
> 
> 	https://bugzilla.kernel.org/show_bug.cgi?id=221184
> 
> See especially the later parts, starting with comment #28.

Well, that is fascinating, but not necessarily in a comfortable
way. It seems to me that for all drivers to care about the
exact details of getting which toggle back into sync is not
a viable strategy. This I'd say, when in doubt, clear a halt.

	Regards
		Oliver

[-- Attachment #2: 0001-usb-class-cdc-wdm-handle-EPROTO-on-interrupt-endpoin.patch --]
[-- Type: text/x-patch, Size: 1229 bytes --]

From fc7a673c780e8eaf08a529938e70ad00a9edd7b3 Mon Sep 17 00:00:00 2001
From: Oliver Neukum <oneukum@suse.com>
Date: Thu, 12 Mar 2026 12:29:23 +0100
Subject: [PATCH] usb: class: cdc-wdm handle EPROTO on interrupt endpoint

Under some conditions -EPROTO requires a halt to be cleared.
This is too complicated to get optimal. We should not even
try. Hence the sane strategy is to clear a halt on
-EPROTO and directly retry for everything but a known
disconnect.

Signed-off-by: Oliver Neukum <oneukum@suse.com>

eproto
---
 drivers/usb/class/cdc-wdm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/class/cdc-wdm.c b/drivers/usb/class/cdc-wdm.c
index 9185295f5376..cfb31a8145ff 100644
--- a/drivers/usb/class/cdc-wdm.c
+++ b/drivers/usb/class/cdc-wdm.c
@@ -265,13 +265,14 @@ static void wdm_int_callback(struct urb *urb)
 		case -ECONNRESET:
 			return; /* unplug */
 		case -EPIPE:
+		case -EPROTO:
 			set_bit(WDM_INT_STALL, &desc->flags);
 			dev_err(&desc->intf->dev, "Stall on int endpoint\n");
 			goto sw; /* halt is cleared in work */
 		default:
 			dev_err_ratelimited(&desc->intf->dev,
 				"nonzero urb status received: %d\n", status);
-			break;
+			goto exit;
 		}
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-12 15:57   ` Oliver Neukum
@ 2026-03-13  7:53     ` Michal Pecio
  2026-03-13 10:33       ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Pecio @ 2026-03-13  7:53 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Alan Stern, Bjørn Mork, USB list

On Thu, 12 Mar 2026 16:57:18 +0100, Oliver Neukum wrote:
> On 12.03.26 15:21, Alan Stern wrote:
>   
> > There's a discussion about the same issue here:
> > 
> > 	https://bugzilla.kernel.org/show_bug.cgi?id=221184
> > 
> > See especially the later parts, starting with comment #28.  
> 
> Well, that is fascinating, but not necessarily in a comfortable
> way. It seems to me that for all drivers to care about the
> exact details of getting which toggle back into sync is not
> a viable strategy. This I'd say, when in doubt, clear a halt.

The official USB philosophy appears to be:
- remove any remaining URBs from the endpoint
- clear halt (even if not halted)
- use class control requests to bring the HW and SW to a valid state
- submit new URBs

Some thoughts:

1. Giving up on EPROTO may be overly pessimistic. Bit flips happen.

   But EPROTO may also mean disconnection. Parent hub should notice,
   then disconnect() will be called at some point and usb_submit_urb()
   will begin returning ENODEV.

   Some cable failures cause persistent EPROTO without disconnection
   being detected, e.g. D+ or D- discontinuity at low or full speed.

2. The idea that a driver can retry a transfer by resubmitting the
   failed URB encounters certain problems.

 a Resubmitting a multi-packet URB is tricky - part of it may have
   already been transferred, so the URB may need to be modified.

 b With transaction translators (LS/FS behind a HS hub) one IN packet
   may already be lost forever when you see the first EPROTO. The
   interrupt case seems impossible to fix due to HW timings. Fixing the
   bulk case would take actions different than ordinary URB submission.
   No API nor EHCI support seems to exist, xHCI can't do it at all.

   Note: even a high-speed capable device may sometimes work like that.

 c If you usb_clear_halt() before resubmitting, a previously delivered
   packet may be resent and accepted again (if its ACK was lost). Both
   IN and OUT seem potentially affected. Some classe may not care.

3. EHCI/OHCI/UHCI can do proper retries, except case 2b, simply by
   resubmitting while minding points 2a and 2c (so no clear halt).

   The same action on xHCI currently results in 50% chance of the next
   packet being dropped by HW and the URB waiting for another packet.

   Tricking xHCI to behave like EHCI is uncharted territory. It seems
   to stray away from offical USB recommendations.

4. xHCI can support retries cleanly and painlessly (except for the
   impossible case of TT) if the URB isn't given back or is given back
   with the understanding that it is still present in the HW queue and
   can only be unlinked or resumed at the point it got stuck. Issues:

 a No API exists for either option. However, a few retries are already
   made before completing with EPROTO status (except for TT).

 b This doesn't work 100% right and is disabled on some controllers.

5. If you have more URBs queued in advance, you may encounter bugs due
   to race conditions (or outright broken logic in case of xHCI).

Summary:

Retrying may or may not be productive - see 1.

Drivers written for EHCI encounter problems on xHCI - see 3.

Existing API is awkward/insufficient for retries - see 2a and 4.

In certain cases retries are impossible anyway - see 2b and 4b.
Extra work by class specific means is inevitable. See "USB philosophy".

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-13  7:53     ` Michal Pecio
@ 2026-03-13 10:33       ` Oliver Neukum
  2026-03-13 15:28         ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-13 10:33 UTC (permalink / raw)
  To: Michal Pecio, Oliver Neukum; +Cc: Alan Stern, Bjørn Mork, USB list

On 13.03.26 08:53, Michal Pecio wrote:

> The official USB philosophy appears to be:
> - remove any remaining URBs from the endpoint
> - clear halt (even if not halted)
> - use class control requests to bring the HW and SW to a valid state
> - submit new URBs

This looks a lot like saying that EPROTO is EPIPE by another
name in the eyes of a driver. Yes, it indicates that the issue comes
from the transport, not the device, but does that make a practical
difference?

> Some thoughts:
> 
> 1. Giving up on EPROTO may be overly pessimistic. Bit flips happen.
> 
>     But EPROTO may also mean disconnection. Parent hub should notice,
>     then disconnect() will be called at some point and usb_submit_urb()
>     will begin returning ENODEV.
> 
>     Some cable failures cause persistent EPROTO without disconnection
>     being detected, e.g. D+ or D- discontinuity at low or full speed.

There are also these pesky software systems where a perpetual immediate
retry causes a livelock.

We have to ask ourselves to which extent we want to deal with failing
hardware. I would prefer to deal with true hardware failure separately,
but if you think that we need to include the possibility on a driver level,
please say so.

> 
> 2. The idea that a driver can retry a transfer by resubmitting the
>     failed URB encounters certain problems.
> 

[snipped a fascinating but hideously complicated collection
of complexities]

I hope we agree that that is a level of complexity you cannot
expect any but the most complex drivers to even start thinking
about.

> 
> 3. EHCI/OHCI/UHCI can do proper retries, except case 2b, simply by
>     resubmitting while minding points 2a and 2c (so no clear halt).
> 
>     The same action on xHCI currently results in 50% chance of the next
>     packet being dropped by HW and the URB waiting for another packet.
> 
>     Tricking xHCI to behave like EHCI is uncharted territory. It seems
>     to stray away from offical USB recommendations.

This is a gigantic layering violation. The specifics of error handling
in different HCs does not belong into drivers.
[..]

> Summary:
> 
> Retrying may or may not be productive - see 1.

Does it hurt? Furthermore is it likelier to succeed if we clear
a halt before we do so?

> In certain cases retries are impossible anyway - see 2b and 4b.
> Extra work by class specific means is inevitable. See "USB philosophy".

We are handling errors. That is, conditions that should not happen
in the first place. Perfection will not serve us. Can we come to
something reasonable that will mostly work and not preclude going
to more drastic methods if it fails?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-13 10:33       ` Oliver Neukum
@ 2026-03-13 15:28         ` Alan Stern
  2026-03-13 22:45           ` Thinh Nguyen
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-13 15:28 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Michal Pecio, Bjørn Mork, USB list

On Fri, Mar 13, 2026 at 11:33:48AM +0100, Oliver Neukum wrote:
> On 13.03.26 08:53, Michal Pecio wrote:
> > The official USB philosophy appears to be:
> > - remove any remaining URBs from the endpoint
> > - clear halt (even if not halted)
> > - use class control requests to bring the HW and SW to a valid state
> > - submit new URBs
> 
> This looks a lot like saying that EPROTO is EPIPE by another
> name in the eyes of a driver. Yes, it indicates that the issue comes
> from the transport, not the device, but does that make a practical
> difference?

In some cases it may.  For example, EPROTO can indicate a short-term 
problem (and so retrying would be appropriate) whereas EPIPE generally 
means that the device does not support a particular request (so retrying 
would be futile).

> > Some thoughts:
> > 
> > 1. Giving up on EPROTO may be overly pessimistic. Bit flips happen.
> > 
> >     But EPROTO may also mean disconnection. Parent hub should notice,
> >     then disconnect() will be called at some point and usb_submit_urb()
> >     will begin returning ENODEV.
> > 
> >     Some cable failures cause persistent EPROTO without disconnection
> >     being detected, e.g. D+ or D- discontinuity at low or full speed.
> 
> There are also these pesky software systems where a perpetual immediate
> retry causes a livelock.

Yes, we need to avoid this.

> We have to ask ourselves to which extent we want to deal with failing
> hardware. I would prefer to deal with true hardware failure separately,
> but if you think that we need to include the possibility on a driver level,
> please say so.

I tend to group transaction-level errors like EPROTO into three 
categories:

	1. Device has been unplugged, hub will notify us soon;

	2. Unrecoverable device problem, needs reset or power cycle;

	3. Short term problem (cable issue, EMI, system load).

Retrying makes sense for 3 but not for 1 or 2.  Unfortunately we can't 
tell which category a particular fault lies in.

Furthermore, most drivers shouldn't have to include the code necessary 
to handle all these possibilities.  For some drivers, giving up entirely 
is a simplistic solution that might be good enough.  But others need to 
be more sophisticated; we should make this as easy as possible for them.

> > 2. The idea that a driver can retry a transfer by resubmitting the
> >     failed URB encounters certain problems.
> > 
> 
> [snipped a fascinating but hideously complicated collection
> of complexities]
> 
> I hope we agree that that is a level of complexity you cannot
> expect any but the most complex drivers to even start thinking
> about.

Indeed.

> > 3. EHCI/OHCI/UHCI can do proper retries, except case 2b, simply by
> >     resubmitting while minding points 2a and 2c (so no clear halt).
> > 
> >     The same action on xHCI currently results in 50% chance of the next
> >     packet being dropped by HW and the URB waiting for another packet.
> > 
> >     Tricking xHCI to behave like EHCI is uncharted territory. It seems
> >     to stray away from offical USB recommendations.
> 
> This is a gigantic layering violation. The specifics of error handling
> in different HCs does not belong into drivers.

Also agreed.

> [..]
> > Summary:
> > 
> > Retrying may or may not be productive - see 1.
> 
> Does it hurt? Furthermore is it likelier to succeed if we clear
> a halt before we do so?
> 
> > In certain cases retries are impossible anyway - see 2b and 4b.
> > Extra work by class specific means is inevitable. See "USB philosophy".
> 
> We are handling errors. That is, conditions that should not happen
> in the first place. Perfection will not serve us. Can we come to
> something reasonable that will mostly work and not preclude going
> to more drastic methods if it fails?

And also bearing in mind that retrying will help only for problems of 
type 3 above.  (Also bearing in mind that the most drastic methods 
involve manual intervention; they can't be done in software.)

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-13 15:28         ` Alan Stern
@ 2026-03-13 22:45           ` Thinh Nguyen
  2026-03-14  2:39             ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-13 22:45 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, Michal Pecio, Bjørn Mork, USB list

On Fri, Mar 13, 2026, Alan Stern wrote:
> On Fri, Mar 13, 2026 at 11:33:48AM +0100, Oliver Neukum wrote:
> > On 13.03.26 08:53, Michal Pecio wrote:
> > 
> > There are also these pesky software systems where a perpetual immediate
> > retry causes a livelock.
> 
> Yes, we need to avoid this.
> 

This can be handled by the class driver (e.g. fall back to port reset
after 3 retry failures in a row).

> > We have to ask ourselves to which extent we want to deal with failing
> > hardware. I would prefer to deal with true hardware failure separately,
> > but if you think that we need to include the possibility on a driver level,
> > please say so.
> 
> I tend to group transaction-level errors like EPROTO into three 
> categories:
> 
> 	1. Device has been unplugged, hub will notify us soon;
> 
> 	2. Unrecoverable device problem, needs reset or power cycle;
> 
> 	3. Short term problem (cable issue, EMI, system load).
> 
> Retrying makes sense for 3 but not for 1 or 2.  Unfortunately we can't 
> tell which category a particular fault lies in.

There's no need to distinquish them if we have a proper fallback
recovery (such as reset/power cycle) should retry failed as noted above. 

>

<snip>

> 
> > [..]
> > > Summary:
> > > 
> > > Retrying may or may not be productive - see 1.
> > 
> > Does it hurt? Furthermore is it likelier to succeed if we clear
> > a halt before we do so?
> > 
> > > In certain cases retries are impossible anyway - see 2b and 4b.
> > > Extra work by class specific means is inevitable. See "USB philosophy".
> > 
> > We are handling errors. That is, conditions that should not happen
> > in the first place. Perfection will not serve us. Can we come to
> > something reasonable that will mostly work and not preclude going
> > to more drastic methods if it fails?
> 
> And also bearing in mind that retrying will help only for problems of 
> type 3 above.  (Also bearing in mind that the most drastic methods 
> involve manual intervention; they can't be done in software.)
> 

Just want to give my 2-cent here: I experimented and have seen Windows
drivers do retry for MSC (particularly for UASP devices), and it works
well.

The retry is not specifically at the failed URB. It's a retry of the
UASP command where the data block offset is specified, and the entire
transfer is resent.

This will probably not work for application with isoc endpoints where
timing is critical or application without certain synchronization
protocol in its transfer header.

BR,
Thinh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-13 22:45           ` Thinh Nguyen
@ 2026-03-14  2:39             ` Alan Stern
  2026-03-16 12:58               ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-14  2:39 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Oliver Neukum, Michal Pecio, Bjørn Mork, USB list

On Fri, Mar 13, 2026 at 10:45:32PM +0000, Thinh Nguyen wrote:
> On Fri, Mar 13, 2026, Alan Stern wrote:
> > On Fri, Mar 13, 2026 at 11:33:48AM +0100, Oliver Neukum wrote:
> > > On 13.03.26 08:53, Michal Pecio wrote:
> > > 
> > > There are also these pesky software systems where a perpetual immediate
> > > retry causes a livelock.
> > 
> > Yes, we need to avoid this.
> > 
> 
> This can be handled by the class driver (e.g. fall back to port reset
> after 3 retry failures in a row).

Part of what we are discussing is how to carry out a retry.  It seems 
that the most general approach is to unlink all pending URBs for the 
endpoint, wait for them to complete, call usb_clear_halt(), and then 
resubmit everything.

And of course, isochronous transfers are never retried, by definition.

> > I tend to group transaction-level errors like EPROTO into three 
> > categories:
> > 
> > 	1. Device has been unplugged, hub will notify us soon;
> > 
> > 	2. Unrecoverable device problem, needs reset or power cycle;
> > 
> > 	3. Short term problem (cable issue, EMI, system load).
> > 
> > Retrying makes sense for 3 but not for 1 or 2.  Unfortunately we can't 
> > tell which category a particular fault lies in.
> 
> There's no need to distinquish them if we have a proper fallback
> recovery (such as reset/power cycle) should retry failed as noted above. 

Yes.  Still, that's a fair amount of logic to add into every device 
driver.  We should be able to centralize it somehow.

Also, just to make things more difficult, these errors are reported in 
atomic context but the recovery procedure has to happen in process 
context.  Which means there has to be a way to cancel the recovery 
procedure if it's in progress when the driver is unbound.

> Just want to give my 2-cent here: I experimented and have seen Windows
> drivers do retry for MSC (particularly for UASP devices), and it works
> well.
> 
> The retry is not specifically at the failed URB. It's a retry of the
> UASP command where the data block offset is specified, and the entire
> transfer is resent.

Right.  usb-storage and uas rely on the SCSI layer to retry failed 
commands; we don't need to worry about them.

> This will probably not work for application with isoc endpoints where
> timing is critical or application without certain synchronization
> protocol in its transfer header.

Because the host and the device may disagree about whether the last 
transaction was received.  USB-2 would handle this okay if we skip the 
usb_clear_halt() step, but I'm not so sure that xHCI controllers will 
allow it to be skipped.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-14  2:39             ` Alan Stern
@ 2026-03-16 12:58               ` Oliver Neukum
  2026-03-16 14:02                 ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-16 12:58 UTC (permalink / raw)
  To: Alan Stern, Thinh Nguyen; +Cc: Michal Pecio, Bjørn Mork, USB list

On 14.03.26 03:39, Alan Stern wrote:
> On Fri, Mar 13, 2026 at 10:45:32PM +0000, Thinh Nguyen wrote:
>> On Fri, Mar 13, 2026, Alan Stern wrote:

> Part of what we are discussing is how to carry out a retry.  It seems
> that the most general approach is to unlink all pending URBs for the
> endpoint, wait for them to complete, call usb_clear_halt(), and then
> resubmit everything.

Yes. That raises the question how much can be centralized.
  
> And of course, isochronous transfers are never retried, by definition.

Do we still need to clear a halt?

>>> I tend to group transaction-level errors like EPROTO into three
>>> categories:
>>>
>>> 	1. Device has been unplugged, hub will notify us soon;
>>>
>>> 	2. Unrecoverable device problem, needs reset or power cycle;
>>>
>>> 	3. Short term problem (cable issue, EMI, system load).
>>>
>>> Retrying makes sense for 3 but not for 1 or 2.  Unfortunately we can't
>>> tell which category a particular fault lies in.
>>
>> There's no need to distinquish them if we have a proper fallback
>> recovery (such as reset/power cycle) should retry failed as noted above.
> 
> Yes.  Still, that's a fair amount of logic to add into every device
> driver.  We should be able to centralize it somehow.

That would suggest implementing an equivalent of usb_queue_reset_device()
for clearing halts.

> Also, just to make things more difficult, these errors are reported in
> atomic context but the recovery procedure has to happen in process
> context.  Which means there has to be a way to cancel the recovery
> procedure if it's in progress when the driver is unbound.

Well, no. Not exactly. If it is necessary to clear a halt before
you can communicate with the device again, we cannot reprobe
the device before the error is handled. It wouldn't work.
We need to wait for error handling to complete if the driver
is unbound.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 12:58               ` Oliver Neukum
@ 2026-03-16 14:02                 ` Alan Stern
  2026-03-16 14:47                   ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-16 14:02 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Mon, Mar 16, 2026 at 01:58:34PM +0100, Oliver Neukum wrote:
> On 14.03.26 03:39, Alan Stern wrote:
> > On Fri, Mar 13, 2026 at 10:45:32PM +0000, Thinh Nguyen wrote:
> > > On Fri, Mar 13, 2026, Alan Stern wrote:
> 
> > Part of what we are discussing is how to carry out a retry.  It seems
> > that the most general approach is to unlink all pending URBs for the
> > endpoint, wait for them to complete, call usb_clear_halt(), and then
> > resubmit everything.
> 
> Yes. That raises the question how much can be centralized.
> > And of course, isochronous transfers are never retried, by definition.
> 
> Do we still need to clear a halt?

Isochronous endpoints do not halt, and isochronous transfers are never 
retried.  And although the spec doesn't seem to say this explicitly, I 
believe isochronous endpoints do not pay any attention to the HALT 
feature setting (which can be changed by a Set-Feature or Clear-Feature 
request).

> > > > I tend to group transaction-level errors like EPROTO into three
> > > > categories:
> > > > 
> > > > 	1. Device has been unplugged, hub will notify us soon;
> > > > 
> > > > 	2. Unrecoverable device problem, needs reset or power cycle;
> > > > 
> > > > 	3. Short term problem (cable issue, EMI, system load).
> > > > 
> > > > Retrying makes sense for 3 but not for 1 or 2.  Unfortunately we can't
> > > > tell which category a particular fault lies in.
> > > 
> > > There's no need to distinquish them if we have a proper fallback
> > > recovery (such as reset/power cycle) should retry failed as noted above.
> > 
> > Yes.  Still, that's a fair amount of logic to add into every device
> > driver.  We should be able to centralize it somehow.
> 
> That would suggest implementing an equivalent of usb_queue_reset_device()
> for clearing halts.

My thought exactly.

> > Also, just to make things more difficult, these errors are reported in
> > atomic context but the recovery procedure has to happen in process
> > context.  Which means there has to be a way to cancel the recovery
> > procedure if it's in progress when the driver is unbound.
> 
> Well, no. Not exactly. If it is necessary to clear a halt before
> you can communicate with the device again, we cannot reprobe
> the device before the error is handled. It wouldn't work.
> We need to wait for error handling to complete if the driver
> is unbound.

Good point.  So not quite the same behavior as usb_queue_reset_device().

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 14:02                 ` Alan Stern
@ 2026-03-16 14:47                   ` Oliver Neukum
  2026-03-16 17:33                     ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-16 14:47 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum
  Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On 16.03.26 15:02, Alan Stern wrote:
> On Mon, Mar 16, 2026 at 01:58:34PM +0100, Oliver Neukum wrote:
>> On 14.03.26 03:39, Alan Stern wrote:

>> Yes. That raises the question how much can be centralized.
>>> And of course, isochronous transfers are never retried, by definition.
>>
>> Do we still need to clear a halt?
> 
> Isochronous endpoints do not halt, and isochronous transfers are never
> retried.  And although the spec doesn't seem to say this explicitly, I
> believe isochronous endpoints do not pay any attention to the HALT
> feature setting (which can be changed by a Set-Feature or Clear-Feature
> request).

That then raises the question how we resync.
  
>> That would suggest implementing an equivalent of usb_queue_reset_device()
>> for clearing halts.
> 
> My thought exactly.

Good. It would need to take a callback as an argument and in principle
you could have this for multiple endpoints. Any ideas for the API?
  
>>> Also, just to make things more difficult, these errors are reported in
>>> atomic context but the recovery procedure has to happen in process
>>> context.  Which means there has to be a way to cancel the recovery
>>> procedure if it's in progress when the driver is unbound.
>>
>> Well, no. Not exactly. If it is necessary to clear a halt before
>> you can communicate with the device again, we cannot reprobe
>> the device before the error is handled. It wouldn't work.
>> We need to wait for error handling to complete if the driver
>> is unbound.
> 
> Good point.  So not quite the same behavior as usb_queue_reset_device().

Actually you make me wonder whether the semantics for
usb_queue_reset_device() is good.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 14:47                   ` Oliver Neukum
@ 2026-03-16 17:33                     ` Alan Stern
  2026-03-16 19:32                       ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-16 17:33 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Mon, Mar 16, 2026 at 03:47:49PM +0100, Oliver Neukum wrote:
> On 16.03.26 15:02, Alan Stern wrote:
> > On Mon, Mar 16, 2026 at 01:58:34PM +0100, Oliver Neukum wrote:
> > > On 14.03.26 03:39, Alan Stern wrote:
> 
> > > Yes. That raises the question how much can be centralized.
> > > > And of course, isochronous transfers are never retried, by definition.
> > > 
> > > Do we still need to clear a halt?
> > 
> > Isochronous endpoints do not halt, and isochronous transfers are never
> > retried.  And although the spec doesn't seem to say this explicitly, I
> > believe isochronous endpoints do not pay any attention to the HALT
> > feature setting (which can be changed by a Set-Feature or Clear-Feature
> > request).
> 
> That then raises the question how we resync.

That's handled at the class level.  In the simplest approach there is no 
resync.  The host just keeps trying to send or receive isochronous 
packets at the previously scheduled intervals, and some data is lost.  
Consider an audio or video stream, for example.

> > > That would suggest implementing an equivalent of usb_queue_reset_device()
> > > for clearing halts.
> > 
> > My thought exactly.
> 
> Good. It would need to take a callback as an argument and in principle
> you could have this for multiple endpoints. Any ideas for the API?

It's more complicated than just clearing halts.  What if the driver has 
queued a bunch of URBs?  They all have to be unlinked first.

Then after the halt has been cleared, the driver has to resubmit the URB 
where the error occurred (keeping in mind that some initial part of it 
may have been sent/received already).  Maybe also submit the other URBs 
that were in the unlinked queue.

There has to be a retry counter or timer because the driver should give 
up after some length of time.  When that happens, should we try to reset 
the device?

It's a mess.  Implementing it in usbhid was justified because that's 
such an important driver in such widespread use.  I'm not at all sure 
how it can be generalized for all sorts of other drivers.

> > > > Also, just to make things more difficult, these errors are reported in
> > > > atomic context but the recovery procedure has to happen in process
> > > > context.  Which means there has to be a way to cancel the recovery
> > > > procedure if it's in progress when the driver is unbound.
> > > 
> > > Well, no. Not exactly. If it is necessary to clear a halt before
> > > you can communicate with the device again, we cannot reprobe
> > > the device before the error is handled. It wouldn't work.
> > > We need to wait for error handling to complete if the driver
> > > is unbound.
> > 
> > Good point.  So not quite the same behavior as usb_queue_reset_device().
> 
> Actually you make me wonder whether the semantics for
> usb_queue_reset_device() is good.

That's a separate matter.  However, a driver that is clever enough to 
call usb_queue_reset_device() should also be clever enough to call 
usb_reset_device() from within its probe routine, if needed.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 17:33                     ` Alan Stern
@ 2026-03-16 19:32                       ` Oliver Neukum
  2026-03-17  9:05                         ` Mathias Nyman
  2026-03-17 14:31                         ` Alan Stern
  0 siblings, 2 replies; 40+ messages in thread
From: Oliver Neukum @ 2026-03-16 19:32 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum
  Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On 16.03.26 18:33, Alan Stern wrote:

> That's handled at the class level.  In the simplest approach there is no
> resync.  The host just keeps trying to send or receive isochronous
> packets at the previously scheduled intervals, and some data is lost.
> Consider an audio or video stream, for example.

Very well. We can set that aside for now,

> It's more complicated than just clearing halts.  What if the driver has
> queued a bunch of URBs?  They all have to be unlinked first.

As far as I can tell for some hardware those URBs may be already be in execution
when the error is returned. So that is a hard problem. Frankly I do not
see what we can do more than provide a suitable operation for anchors.
  
> Then after the halt has been cleared, the driver has to resubmit the URB
> where the error occurred (keeping in mind that some initial part of it
> may have been sent/received already).  Maybe also submit the other URBs
> that were in the unlinked queue.

Correct. Hence usbcore needs to notify the driver when a halt has been
cleared. I see two obvious options. Either we provide a callback with
the helper or we declare another full callback akin to pre/post_reset.

> There has to be a retry counter or timer because the driver should give
> up after some length of time.  When that happens, should we try to reset
> the device?

We need to notify the driver when a halt is cleared. How about we
provide the option based on the return value of the notification?

> It's a mess.  Implementing it in usbhid was justified because that's
> such an important driver in such widespread use.  I'm not at all sure
> how it can be generalized for all sorts of other drivers.

Don't you think that what usbhid does is a relatively useful model
for other drivers?

>> Actually you make me wonder whether the semantics for
>> usb_queue_reset_device() is good.
> 
> That's a separate matter.  However, a driver that is clever enough to
> call usb_queue_reset_device() should also be clever enough to call
> usb_reset_device() from within its probe routine, if needed.

Yes, one issue at a time.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 19:32                       ` Oliver Neukum
@ 2026-03-17  9:05                         ` Mathias Nyman
  2026-03-17 14:31                         ` Alan Stern
  1 sibling, 0 replies; 40+ messages in thread
From: Mathias Nyman @ 2026-03-17  9:05 UTC (permalink / raw)
  To: Oliver Neukum, Alan Stern
  Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On 3/16/26 21:32, Oliver Neukum wrote:
> On 16.03.26 18:33, Alan Stern wrote:
> 
>> It's more complicated than just clearing halts.  What if the driver has
>> queued a bunch of URBs?  They all have to be unlinked first.
> 
> As far as I can tell for some hardware those URBs may be already be in execution
> when the error is returned. So that is a hard problem. Frankly I do not
> see what we can do more than provide a suitable operation for anchors.
> 

xHC controller stops executing URBs in STALL (-EPIPE) and Transaction error (-EPROTO)
cases, but there is a driver flaw that may restart the endpoint after giving back the
URB.

I'll look into this

-Mathias



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-16 19:32                       ` Oliver Neukum
  2026-03-17  9:05                         ` Mathias Nyman
@ 2026-03-17 14:31                         ` Alan Stern
  2026-03-17 16:20                           ` Oliver Neukum
  1 sibling, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-17 14:31 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Mon, Mar 16, 2026 at 08:32:59PM +0100, Oliver Neukum wrote:
> On 16.03.26 18:33, Alan Stern wrote:
> > It's more complicated than just clearing halts.  What if the driver has
> > queued a bunch of URBs?  They all have to be unlinked first.
> 
> As far as I can tell for some hardware those URBs may be already be in execution
> when the error is returned. So that is a hard problem. Frankly I do not
> see what we can do more than provide a suitable operation for anchors.

If this happens, it's a bug in the host controller driver.  All bulk and 
interrupt endpoint queues are supposed to stop when a transaction error 
occurs.  This is mentioned explicitly in the kerneldoc.

> > Then after the halt has been cleared, the driver has to resubmit the URB
> > where the error occurred (keeping in mind that some initial part of it
> > may have been sent/received already).  Maybe also submit the other URBs
> > that were in the unlinked queue.
> 
> Correct. Hence usbcore needs to notify the driver when a halt has been
> cleared. I see two obvious options. Either we provide a callback with
> the helper or we declare another full callback akin to pre/post_reset.
> 
> > There has to be a retry counter or timer because the driver should give
> > up after some length of time.  When that happens, should we try to reset
> > the device?
> 
> We need to notify the driver when a halt is cleared. How about we
> provide the option based on the return value of the notification?

Think about what needs to happen from the driver's point of view.  An 
URB completes with a -EPROTO (or similar) error.  We need to unlink all 
URBs queued to the same endpoint, wait for them to complete, and then 
try to recover from the error.  How should a core library routine handle 
this?

Luckily the core manages an URB queue for each endpoint (see 
usb_hcd_link_urb_to_ep()), so the routine will know what URBs need to be 
unlinked.  How will the driver's completion handler know to ignore these 
URBs when they complete?

Following the clear-halt, the URBs need to be resubmitted.  Should this 
be done by the driver or by the library routine?

When the library finishes its work, it needs to tell the driver either 
to start running again or to give up.  Presumably by means of some 
callback.

How will the library keep track of recent error recovery attempts?  It 
needs to know when to stop doing clear-halts & retries and instead issue 
a reset.  How will this reset interact with the driver's recovery 
mechanism?

> > It's a mess.  Implementing it in usbhid was justified because that's
> > such an important driver in such widespread use.  I'm not at all sure
> > how it can be generalized for all sorts of other drivers.
> 
> Don't you think that what usbhid does is a relatively useful model
> for other drivers?

It's a good deal of reasonably complex code, which should not be copied 
to every other USB driver.  While the approach is sound, the problem is 
finding a reasonable way to implement it.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-17 14:31                         ` Alan Stern
@ 2026-03-17 16:20                           ` Oliver Neukum
  2026-03-17 18:03                             ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-17 16:20 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list



On 17.03.26 15:31, Alan Stern wrote:
> On Mon, Mar 16, 2026 at 08:32:59PM +0100, Oliver Neukum wrote:

> If this happens, it's a bug in the host controller driver.  All bulk and
> interrupt endpoint queues are supposed to stop when a transaction error
> occurs.  This is mentioned explicitly in the kerneldoc.

Good.
  
> Think about what needs to happen from the driver's point of view.  An
> URB completes with a -EPROTO (or similar) error.  We need to unlink all
> URBs queued to the same endpoint, wait for them to complete, and then
> try to recover from the error.  How should a core library routine handle
> this?

I am not sure.

> Luckily the core manages an URB queue for each endpoint (see
> usb_hcd_link_urb_to_ep()), so the routine will know what URBs need to be
> unlinked.  How will the driver's completion handler know to ignore these
> URBs when they complete?
> 
> Following the clear-halt, the URBs need to be resubmitted.  Should this
> be done by the driver or by the library routine?

My preference would be by the driver, because it is not clear whether
simply repeating the IO is the action the driver wants to take. The IO
may have become obsolete for example.
Yet the endpoint must be made usable again.

> When the library finishes its work, it needs to tell the driver either
> to start running again or to give up.  Presumably by means of some
> callback.

Yes, but we cannot assume that a full device reset is always the next
stage. Nor, and that really hurts, can we assume that only a single driver
of the device is involved.
  
> How will the library keep track of recent error recovery attempts?  It
> needs to know when to stop doing clear-halts & retries and instead issue
> a reset.  How will this reset interact with the driver's recovery
> mechanism?

In principle we know how a reset is handled, don't we?
Again, we cannot know whether a driver is the first, let alone only, driver
to request error handling.
If we decide to reset there is no point in clearing a halt and resubmitting
URBs would be wrong.

> It's a good deal of reasonably complex code, which should not be copied
> to every other USB driver.  While the approach is sound, the problem is
> finding a reasonable way to implement it.

How else would you handle errors of this kind. It seems to me that we
need to make the delays and number of retries tunable, but other than that
what can you do generically?

	Regards
		Oliver



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-17 16:20                           ` Oliver Neukum
@ 2026-03-17 18:03                             ` Alan Stern
  2026-03-18  9:54                               ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-17 18:03 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Tue, Mar 17, 2026 at 05:20:37PM +0100, Oliver Neukum wrote:
> 
> 
> On 17.03.26 15:31, Alan Stern wrote:
> > Following the clear-halt, the URBs need to be resubmitted.  Should this
> > be done by the driver or by the library routine?
> 
> My preference would be by the driver, because it is not clear whether
> simply repeating the IO is the action the driver wants to take. The IO
> may have become obsolete for example.
> Yet the endpoint must be made usable again.

You know, with a USB-2 host controller, transaction errors don't make an 
endpoint unusable, and clear-halt isn't necessary.  I wonder if xhci-hcd 
couldn't be adjusted to behave the same way (issuing its own clear-halt 
internally, if that is needed).  That would make things simpler.

Another possibility is to change usbcore to automatically unlink all 
the endpoint's pending URBs as soon as a transaction error occurs.  Then 
drivers wouldn't have to worry about it.  But even then, drivers would 
need to know how to react when it happened.

> > When the library finishes its work, it needs to tell the driver either
> > to start running again or to give up.  Presumably by means of some
> > callback.
> 
> Yes, but we cannot assume that a full device reset is always the next
> stage. Nor, and that really hurts, can we assume that only a single driver
> of the device is involved.

Reset isn't always the next step.  In some cases the driver should just 
give up.  For instance, if the device has been unplugged.

> > How will the library keep track of recent error recovery attempts?  It
> > needs to know when to stop doing clear-halts & retries and instead issue
> > a reset.  How will this reset interact with the driver's recovery
> > mechanism?
> 
> In principle we know how a reset is handled, don't we?
> Again, we cannot know whether a driver is the first, let alone only, driver
> to request error handling.
> If we decide to reset there is no point in clearing a halt and resubmitting
> URBs would be wrong.

In practice, does resetting ever help?  With usb-storage and uas, yes, 
occasionally.  But those are unusual drivers; what about all the other 
ones?

> > It's a good deal of reasonably complex code, which should not be copied
> > to every other USB driver.  While the approach is sound, the problem is
> > finding a reasonable way to implement it.
> 
> How else would you handle errors of this kind. It seems to me that we
> need to make the delays and number of retries tunable, but other than that
> what can you do generically?

You're right that those are the only things to be done.  The question is 
whether they can be done in a way that will be easy for all drivers to 
adopt.

Consider that while error recovery is in progress, the rest of the 
driver has to remain essentially dormant because the endpoint cannot be 
used.  I don't think many drivers are written to support such an 
operating mode.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-17 18:03                             ` Alan Stern
@ 2026-03-18  9:54                               ` Oliver Neukum
  2026-03-18 17:46                                 ` Alan Stern
  0 siblings, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-18  9:54 UTC (permalink / raw)
  To: Alan Stern, Oliver Neukum
  Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On 17.03.26 19:03, Alan Stern wrote:
> On Tue, Mar 17, 2026 at 05:20:37PM +0100, Oliver Neukum wrote:

> You know, with a USB-2 host controller, transaction errors don't make an
> endpoint unusable, and clear-halt isn't necessary.  I wonder if xhci-hcd
> couldn't be adjusted to behave the same way (issuing its own clear-halt
> internally, if that is needed).  That would make things simpler.

It could be. But do you want a HCD to generate requests to endpoint 0
on its own without coordination with usbcore or drivers bound to interfaces
of the device?

It would be an elegant solution, but I think it would bite us into our
posterior. E.g. what do we do if a reset is requested or somebody
wants to suspend the device or change the configuration or a setting?

> Another possibility is to change usbcore to automatically unlink all
> the endpoint's pending URBs as soon as a transaction error occurs.  Then
> drivers wouldn't have to worry about it.  But even then, drivers would
> need to know how to react when it happened.

Yes. That looks more plausible.

> Reset isn't always the next step.  In some cases the driver should just
> give up.  For instance, if the device has been unplugged.

Then we do not care. Signalling and detecting an unplug is not a driver's
task. A driver should leave enough time for that detection to happen, but
if usbcore does not eventually tell us that a device is gone, we need
to treat errors as genuine.

>> In principle we know how a reset is handled, don't we?
>> Again, we cannot know whether a driver is the first, let alone only, driver
>> to request error handling.
>> If we decide to reset there is no point in clearing a halt and resubmitting
>> URBs would be wrong.
> 
> In practice, does resetting ever help?  With usb-storage and uas, yes,
> occasionally.  But those are unusual drivers; what about all the other
> ones?

They don't even try. UAS, storage and usbhid are the only drivers with
a full error handling. usbnet has vestiges. That is about it. Others
may try to clear a spurious halt and that's it.

Generically speaking, short of giving up, what is the generic alternative?
As far as other examples are concerned, isn't this scheme quite close
to what SCSI does in terms of algorithm?

>> How else would you handle errors of this kind. It seems to me that we
>> need to make the delays and number of retries tunable, but other than that
>> what can you do generically?
> 
> You're right that those are the only things to be done.  The question is
> whether they can be done in a way that will be easy for all drivers to
> adopt.

Provide defaults, again best be copied from usbhid.

> Consider that while error recovery is in progress, the rest of the
> driver has to remain essentially dormant because the endpoint cannot be
> used.  I don't think many drivers are written to support such an
> operating mode.

Isn't that exactly what you have to do after pre_reset() and suspend()?
Nor do you have to use this facility, do you?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-18  9:54                               ` Oliver Neukum
@ 2026-03-18 17:46                                 ` Alan Stern
  2026-03-18 21:38                                   ` Michal Pecio
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-18 17:46 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> On 17.03.26 19:03, Alan Stern wrote:
> > On Tue, Mar 17, 2026 at 05:20:37PM +0100, Oliver Neukum wrote:
> 
> > You know, with a USB-2 host controller, transaction errors don't make an
> > endpoint unusable, and clear-halt isn't necessary.  I wonder if xhci-hcd
> > couldn't be adjusted to behave the same way (issuing its own clear-halt
> > internally, if that is needed).  That would make things simpler.
> 
> It could be. But do you want a HCD to generate requests to endpoint 0
> on its own without coordination with usbcore or drivers bound to interfaces
> of the device?

Michal should be the person to answer these questions.  I guess that a 
clear-halt is necessary for synchronizing an xHCI host controller with 
the device's endpoint after an error, but I don't really know.  Maybe 
it's necessary only for USB-3 devices?

> It would be an elegant solution, but I think it would bite us into our
> posterior. E.g. what do we do if a reset is requested or somebody
> wants to suspend the device or change the configuration or a setting?

A core library routine would face the same problems.

> > Another possibility is to change usbcore to automatically unlink all
> > the endpoint's pending URBs as soon as a transaction error occurs.  Then
> > drivers wouldn't have to worry about it.  But even then, drivers would
> > need to know how to react when it happened.
> 
> Yes. That looks more plausible.
> 
> > Reset isn't always the next step.  In some cases the driver should just
> > give up.  For instance, if the device has been unplugged.
> 
> Then we do not care. Signalling and detecting an unplug is not a driver's
> task. A driver should leave enough time for that detection to happen, but
> if usbcore does not eventually tell us that a device is gone, we need
> to treat errors as genuine.

I get your point.  Think of it this way: How long should error recovery 
persist?  Let's say you retry, with increasing delays, for 500 ms and 
then you do a reset.  If you still can't communicate with the endpoint 
after that, there's really nothing else you can do.  It's time to give 
up -- take the device offline, so to speak.

> > > In principle we know how a reset is handled, don't we?
> > > Again, we cannot know whether a driver is the first, let alone only, driver
> > > to request error handling.
> > > If we decide to reset there is no point in clearing a halt and resubmitting
> > > URBs would be wrong.
> > 
> > In practice, does resetting ever help?  With usb-storage and uas, yes,
> > occasionally.  But those are unusual drivers; what about all the other
> > ones?
> 
> They don't even try. UAS, storage and usbhid are the only drivers with
> a full error handling. usbnet has vestiges. That is about it. Others
> may try to clear a spurious halt and that's it.
> 
> Generically speaking, short of giving up, what is the generic alternative?
> As far as other examples are concerned, isn't this scheme quite close
> to what SCSI does in terms of algorithm?

It's similar.  usb-storage doesn't retry sending packets; when a 
communication error occurs it immediately does a reset and relies on the 
SCSI layer to handle retrying the higher-level command.

I'm just wondering how helpful resets will be in general.  I suspect not 
very much.

> > > How else would you handle errors of this kind. It seems to me that we
> > > need to make the delays and number of retries tunable, but other than that
> > > what can you do generically?
> > 
> > You're right that those are the only things to be done.  The question is
> > whether they can be done in a way that will be easy for all drivers to
> > adopt.
> 
> Provide defaults, again best be copied from usbhid.
> 
> > Consider that while error recovery is in progress, the rest of the
> > driver has to remain essentially dormant because the endpoint cannot be
> > used.  I don't think many drivers are written to support such an
> > operating mode.
> 
> Isn't that exactly what you have to do after pre_reset() and suspend()?
> Nor do you have to use this facility, do you?

No.  But the alternative is to give up right away.

It sounds like we're saying that a library routine would have to:

	Start a thread for handling the recovery.  In the thread:

	Call the driver's pre_reset handler.

	Unlink all URBs queued for the endpoint.

	Issue clear-halt (if needed).

	Call the driver's post_reset handler.

It's not obvious where a retry counter, increasing time delay, and 
eventual reset would fit into this scheme.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-18 17:46                                 ` Alan Stern
@ 2026-03-18 21:38                                   ` Michal Pecio
  2026-03-18 23:59                                     ` Thinh Nguyen
  2026-03-19  1:56                                     ` Alan Stern
  0 siblings, 2 replies; 40+ messages in thread
From: Michal Pecio @ 2026-03-18 21:38 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On Wed, 18 Mar 2026 13:46:26 -0400, Alan Stern wrote:
> On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> > On 17.03.26 19:03, Alan Stern wrote:
> > > You know, with a USB-2 host controller, transaction errors don't
> > > make an endpoint unusable, and clear-halt isn't necessary.

Depends on what you mean by "usable". If you get EPROTO reading from
a Transaction Translator and the TT discards the packet (because of
timeout on Int or by explicit SW request on Bulk) then not only is the
particular packet lost because the device got its ACK and moved on, but
I suspect the next packet will be dropped too due to toggle mismatch.

Granted, EPROTO outside of disconnections is apparently rare enough
that a horde of users demanging this to be fixed hasn't materialized.

I found simple ways to produce recoverable EPROTO at low/full speed,
but no such luck with higher speeds so far. It would be helpful.

> > > I wonder if xhci-hcd couldn't be adjusted to behave the same way
> > > (issuing its own clear-halt internally, if that is needed). That
> > > would make things simpler.
> > 
> > It could be. But do you want a HCD to generate requests to endpoint
> > 0 on its own without coordination with usbcore or drivers bound to
> > interfaces of the device?

I believe the intent is to maintain the illusion that drivers can just
resubmit the failed URB and have that become a proper retry.

It likely could be done, but it's still not the same thing as old HCDs
are doing because it opens the possibility of double delivery, while
closing the possibility of further packet loss speculated about above.

Question is if this illusion is worth fighting for, when
1. it hasn't been 100% reliable since at least USB 2.0 and TTs
2. USB-IF keeps creating problems for software trying to be like that
3. xhci-hcd has been this mess for 15 years without loud complaints

What should an HCD do if such request fails? Just refuse URBs?

> Michal should be the person to answer these questions.  I guess that
> a clear-halt is necessary for synchronizing an xHCI host controller
> with the device's endpoint after an error, but I don't really know.
> Maybe it's necessary only for USB-3 devices?

Strictly, it's the opposite - synchronize the device with the host
having already zeroed its toggle or SuperSpeed sequence number.

Such toggle mismatch at USB2 speeds results in loss of next packet.

AFAIU, SuperSpeed sequence mismatch is covered by USB3 8.11, which says
"discard with no response", so maybe we would keep getting EPROTO due
to timeouts while wrong number is being retried, but not sure.

Resetting host sequence state is not mandatory, but the alternative is
intended for retrying the same URB, not another identical one. That
would be an edge case possibly exercised by no SW other than Linux.

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-18 21:38                                   ` Michal Pecio
@ 2026-03-18 23:59                                     ` Thinh Nguyen
  2026-03-19  2:07                                       ` Alan Stern
  2026-03-19  1:56                                     ` Alan Stern
  1 sibling, 1 reply; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-18 23:59 UTC (permalink / raw)
  To: Michal Pecio, Alan Stern
  Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On Wed, Mar 18, 2026, Michal Pecio wrote:
> On Wed, 18 Mar 2026 13:46:26 -0400, Alan Stern wrote:
> > On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> > > On 17.03.26 19:03, Alan Stern wrote:
> > > > You know, with a USB-2 host controller, transaction errors don't
> > > > make an endpoint unusable, and clear-halt isn't necessary.
> 
> Depends on what you mean by "usable". If you get EPROTO reading from
> a Transaction Translator and the TT discards the packet (because of
> timeout on Int or by explicit SW request on Bulk) then not only is the
> particular packet lost because the device got its ACK and moved on, but
> I suspect the next packet will be dropped too due to toggle mismatch.
> 
> Granted, EPROTO outside of disconnections is apparently rare enough
> that a horde of users demanging this to be fixed hasn't materialized.

I've seen Windows drivers handling UAS transaction error recovery
through clear-halt and retry SCSI command, and it works well. I hope to
see this type of recovery implemented for usb storage and uas devices in
the future.

> 
> I found simple ways to produce recoverable EPROTO at low/full speed,
> but no such luck with higher speeds so far. It would be helpful.

Get a bad cable, that should help triggering transaction errors. Back
when I was doing more hardware testing, we were able to trigger this
more easily when testing BOT and UAS devices behind certain hubs and
docks.

> 
> > > > I wonder if xhci-hcd couldn't be adjusted to behave the same way
> > > > (issuing its own clear-halt internally, if that is needed). That
> > > > would make things simpler.
> > > 
> > > It could be. But do you want a HCD to generate requests to endpoint
> > > 0 on its own without coordination with usbcore or drivers bound to
> > > interfaces of the device?
> 
> I believe the intent is to maintain the illusion that drivers can just
> resubmit the failed URB and have that become a proper retry.
> 
> It likely could be done, but it's still not the same thing as old HCDs
> are doing because it opens the possibility of double delivery, while
> closing the possibility of further packet loss speculated about above.
> 
> Question is if this illusion is worth fighting for, when
> 1. it hasn't been 100% reliable since at least USB 2.0 and TTs
> 2. USB-IF keeps creating problems for software trying to be like that
> 3. xhci-hcd has been this mess for 15 years without loud complaints
> 
> What should an HCD do if such request fails? Just refuse URBs?
> 
> > Michal should be the person to answer these questions.  I guess that
> > a clear-halt is necessary for synchronizing an xHCI host controller
> > with the device's endpoint after an error, but I don't really know.
> > Maybe it's necessary only for USB-3 devices?
> 
> Strictly, it's the opposite - synchronize the device with the host
> having already zeroed its toggle or SuperSpeed sequence number.
> 
> Such toggle mismatch at USB2 speeds results in loss of next packet.
> 
> AFAIU, SuperSpeed sequence mismatch is covered by USB3 8.11, which says
> "discard with no response", so maybe we would keep getting EPROTO due
> to timeouts while wrong number is being retried, but not sure.
> 
> Resetting host sequence state is not mandatory, but the alternative is
> intended for retrying the same URB, not another identical one. That
> would be an edge case possibly exercised by no SW other than Linux.
> 

The retrying of the URB or sending a new set of URBs should be a
decision by the class driver where synchronization at the higher
protocol may be needed. An URB failed with -EPROTO may mean some
previous successful transfers need to be discarded and retried also.

What we do know is that an -EPROTO means that the xhci endpoint state
was halted, the xhci would need to reset (not soft retry) the endpoint
before it can be used again. Since the bulk sequence is reset from reset
ep command, we'd need clear-halt to synchronize with the device side.
The reset ep command behavior (and when to use it) is xhci specific, so
IMHO, it should the xhci driver to handle the clear-halt. Which URB(s)
need to be retried should be a decision by the upperlayer drivers.

BR,
Thinh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-18 21:38                                   ` Michal Pecio
  2026-03-18 23:59                                     ` Thinh Nguyen
@ 2026-03-19  1:56                                     ` Alan Stern
  2026-03-19  8:40                                       ` Mathias Nyman
  2026-03-19  8:55                                       ` Michal Pecio
  1 sibling, 2 replies; 40+ messages in thread
From: Alan Stern @ 2026-03-19  1:56 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On Wed, Mar 18, 2026 at 10:38:51PM +0100, Michal Pecio wrote:
> On Wed, 18 Mar 2026 13:46:26 -0400, Alan Stern wrote:
> > On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> > > On 17.03.26 19:03, Alan Stern wrote:
> > > > You know, with a USB-2 host controller, transaction errors don't
> > > > make an endpoint unusable, and clear-halt isn't necessary.
> 
> Depends on what you mean by "usable". If you get EPROTO reading from
> a Transaction Translator and the TT discards the packet (because of
> timeout on Int or by explicit SW request on Bulk) then not only is the
> particular packet lost because the device got its ACK and moved on, but
> I suspect the next packet will be dropped too due to toggle mismatch.

(Is it the TT that keeps track of the toggle state, or the host 
controller?  I don't remember and I'm too lazy to look up the answer.)

By "unusable", I meant that no more data could be transmitted through 
that endpoint until some sort of recovery action had been carried out 
(such as clear-halt, set-config, or device reset).

Yes, data loss is sometimes inevitable, and we shouldn't worry too much 
about things we can't prevent.

> Granted, EPROTO outside of disconnections is apparently rare enough
> that a horde of users demanging this to be fixed hasn't materialized.

The major example showed up just a few weeks ago.  It was an old system 
where some component (the PCI bus?) apparently could become saturated at 
high load, leading to communication failures because the host controller 
couldn't access the data needed to keep up with its scheduled work.

> I found simple ways to produce recoverable EPROTO at low/full speed,
> but no such luck with higher speeds so far. It would be helpful.
> 
> > > > I wonder if xhci-hcd couldn't be adjusted to behave the same way
> > > > (issuing its own clear-halt internally, if that is needed). That
> > > > would make things simpler.
> > > 
> > > It could be. But do you want a HCD to generate requests to endpoint
> > > 0 on its own without coordination with usbcore or drivers bound to
> > > interfaces of the device?
> 
> I believe the intent is to maintain the illusion that drivers can just
> resubmit the failed URB and have that become a proper retry.
> 
> It likely could be done, but it's still not the same thing as old HCDs
> are doing because it opens the possibility of double delivery, while
> closing the possibility of further packet loss speculated about above.
> 
> Question is if this illusion is worth fighting for, when
> 1. it hasn't been 100% reliable since at least USB 2.0 and TTs
> 2. USB-IF keeps creating problems for software trying to be like that
> 3. xhci-hcd has been this mess for 15 years without loud complaints
> 
> What should an HCD do if such request fails? Just refuse URBs?

Do nothing.  Log an error message, continue on, and hope for best,

I'm not sure what sort of transfers will really want to go through the 
retry procedure.  With usbhid, we're talking about a stream of interrupt 
URBs.  If some data gets lost it's not a good thing, but the user can 
probably handle it -- provided the data stream manages to pick up more 
or less where it left off (and the shorter the recovery delay, the 
better).

For other types of transfers (i.e., not data streams), I do not have a 
clear idea of what requirements drivers will have.

> > Michal should be the person to answer these questions.  I guess that
> > a clear-halt is necessary for synchronizing an xHCI host controller
> > with the device's endpoint after an error, but I don't really know.
> > Maybe it's necessary only for USB-3 devices?
> 
> Strictly, it's the opposite - synchronize the device with the host
> having already zeroed its toggle or SuperSpeed sequence number.
> 
> Such toggle mismatch at USB2 speeds results in loss of next packet.

Just to be clear, are you saying there's no way for an xHC to restart a 
(host-side) halted non-SuperSpeed endpoint without setting the toggle 
back to 0?

> AFAIU, SuperSpeed sequence mismatch is covered by USB3 8.11, which says
> "discard with no response", so maybe we would keep getting EPROTO due
> to timeouts while wrong number is being retried, but not sure.
> 
> Resetting host sequence state is not mandatory, but the alternative is
> intended for retrying the same URB, not another identical one. That
> would be an edge case possibly exercised by no SW other than Linux.

It does seem like the USB-IF has not given much thought to this problem.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-18 23:59                                     ` Thinh Nguyen
@ 2026-03-19  2:07                                       ` Alan Stern
  2026-03-19 23:16                                         ` Thinh Nguyen
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-19  2:07 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Michal Pecio, Oliver Neukum, Bjørn Mork, USB list

On Wed, Mar 18, 2026 at 11:59:21PM +0000, Thinh Nguyen wrote:
> On Wed, Mar 18, 2026, Michal Pecio wrote:
> > On Wed, 18 Mar 2026 13:46:26 -0400, Alan Stern wrote:
> > > On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> > > > On 17.03.26 19:03, Alan Stern wrote:
> > > > > You know, with a USB-2 host controller, transaction errors don't
> > > > > make an endpoint unusable, and clear-halt isn't necessary.
> > 
> > Depends on what you mean by "usable". If you get EPROTO reading from
> > a Transaction Translator and the TT discards the packet (because of
> > timeout on Int or by explicit SW request on Bulk) then not only is the
> > particular packet lost because the device got its ACK and moved on, but
> > I suspect the next packet will be dropped too due to toggle mismatch.
> > 
> > Granted, EPROTO outside of disconnections is apparently rare enough
> > that a horde of users demanging this to be fixed hasn't materialized.
> 
> I've seen Windows drivers handling UAS transaction error recovery
> through clear-halt and retry SCSI command, and it works well. I hope to
> see this type of recovery implemented for usb storage and uas devices in
> the future.

I don't know about uas, but usb-storage handles transaction error 
recovery in approximately the same way.  A clear-halt is issued only if 
the device sent a halt token -- but that's not considered a transaction 
error.  Otherwise, for things like -EPROTO, usb-storage does a device 
reset and the SCSI command is retried.  Possibly skipping some initial 
portion of the data if the transfer was partially successful.  (This 
might not work very well for things like tape drives, but disk drives 
have the convenient feature that reads and writes are idempotent.)

> The retrying of the URB or sending a new set of URBs should be a
> decision by the class driver where synchronization at the higher
> protocol may be needed. An URB failed with -EPROTO may mean some
> previous successful transfers need to be discarded and retried also.

That's a good point.  There's only so much we can expect the core to 
handle.

> What we do know is that an -EPROTO means that the xhci endpoint state
> was halted, the xhci would need to reset (not soft retry) the endpoint
> before it can be used again. Since the bulk sequence is reset from reset
> ep command, we'd need clear-halt to synchronize with the device side.
> The reset ep command behavior (and when to use it) is xhci specific, so
> IMHO, it should the xhci driver to handle the clear-halt. Which URB(s)
> need to be retried should be a decision by the upperlayer drivers.

And for which drivers will we want to go to the trouble of adding this 
kind of error recovery?  Alternatives include doing just enough to make 
the endpoint start working again and ignoring any data loss, or 
declaring the whole device to be offline (which would need at least an 
unbind and maybe even a power cycle to recover from).

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19  1:56                                     ` Alan Stern
@ 2026-03-19  8:40                                       ` Mathias Nyman
  2026-03-19 23:34                                         ` Thinh Nguyen
  2026-03-19  8:55                                       ` Michal Pecio
  1 sibling, 1 reply; 40+ messages in thread
From: Mathias Nyman @ 2026-03-19  8:40 UTC (permalink / raw)
  To: Alan Stern, Michal Pecio
  Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On 3/19/26 03:56, Alan Stern wrote:

> Just to be clear, are you saying there's no way for an xHC to restart a
> (host-side) halted non-SuperSpeed endpoint without setting the toggle
> back to 0?

There is.
A reset endpoint command with a TSP flag (transfer state preserve)
clear the host side halt and preserve the toggle state.

It's used for soft-retry purposes, retrying a transfer after
a transaction error. This is also the only use-case described in xHCI specification.

Unclear what happens if we clear the host side halt, preserving the toggle, and
then ask host to move to the next URB

Could be worth giving people a way to try it out somehow.
Maybe option to enable it via debugfs, maybe a quirk, or event just provide a patch.
See how different xHC hosts behave

-Mathias

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19  1:56                                     ` Alan Stern
  2026-03-19  8:40                                       ` Mathias Nyman
@ 2026-03-19  8:55                                       ` Michal Pecio
  2026-03-19 14:24                                         ` Alan Stern
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Pecio @ 2026-03-19  8:55 UTC (permalink / raw)
  To: Alan Stern; +Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On Wed, 18 Mar 2026 21:56:27 -0400, Alan Stern wrote:
> On Wed, Mar 18, 2026 at 10:38:51PM +0100, Michal Pecio wrote:
> > If you get EPROTO reading from a Transaction Translator and the TT
> > discards the packet (because of timeout on Int or by explicit SW
> > request on Bulk) then not only is the particular packet lost
> > because the device got its ACK and moved on, but I suspect the next
> > packet will be dropped too due to toggle mismatch.  
>
> (Is it the TT that keeps track of the toggle state, or the host 
> controller?  I don't remember and I'm too lazy to look up the answer.)

Good question. I skimmed USB2 chapters about TT (11.14+) and haven't
seen this spelled out clearly. However,
- I don't remember ever reading about a requirement for TT to keep
  separate toggle state upstream and downstream
- there is nothing about what CSPLIT response to send after device IN
  packet was discarded due to wrong toggle
- we would need a separate request to clear TT toggle when we clear
  host and device toggle, no such request seems to exist

So I still suspect that TT is just a dumb forwarder and that we get
toggle mismatch when a packet is lost on the HS bus, which sets us up
to lose the next valid packet too.

> By "unusable", I meant that no more data could be transmitted through 
> that endpoint until some sort of recovery action had been carried out 
> (such as clear-halt, set-config, or device reset).
>
> Yes, data loss is sometimes inevitable, and we shouldn't worry too
> much about things we can't prevent.

But an EP which loses data is not as usable as we might wish for.
Doing usb_clear_halt() prevents at least that second loss, which may
occur in far future when we think the error has been solved already.

USB-IF doesn't seem to consider "pipes" a reliable transport and
expects class drivers to anticipate data loss and work around it.

HID, for example, appears to provide a mandatory control request to
poll for the current state of input controls. This could recover lost
data (except from mice) after resetting the pipe to avoid more loss.

> The major example showed up just a few weeks ago.  It was an old
> system where some component (the PCI bus?) apparently could become
> saturated at high load, leading to communication failures because the
> host controller couldn't access the data needed to keep up with its
> scheduled work.

I'm aware of this bug and it's an odd one, because I don't understand
why reducing retry delay seems to help.

> Just to be clear, are you saying there's no way for an xHC to restart
> a (host-side) halted non-SuperSpeed endpoint without setting the
> toggle back to 0?

To be clear, I'm saying that most xHCI people probably think so. You
have seen Thinh's reaction to my suggestion that we could do otherwise.

We need a Reset Endpoint command (xHCI 4.6.8) to clear the Halted flag
in xHC, otherwise it will refuse to run the endpoint to protect us from
race conditions (new submission while the error is being reported).

Clearing toggle/sequesce state is optional. If we don't clear it then
"the endpoint shall continue execution by retrying the last transaction
[after restart] if no other commands have been issued to the endpoint."

We can command it to move to the next URB (possibly not submitted yet)
and only then restart the endpoint. But this is another weird thing
that Linux does, possibly noboddy else does it, USB-IF doesn't expect
us to do it, and HW vendors may not test this edge case.

It's something that would have made sense to try 15 years ago, but
today I think you may meet resistance. Ask Mathias what he thinks.

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19  8:55                                       ` Michal Pecio
@ 2026-03-19 14:24                                         ` Alan Stern
  0 siblings, 0 replies; 40+ messages in thread
From: Alan Stern @ 2026-03-19 14:24 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Oliver Neukum, Thinh Nguyen, Bjørn Mork, USB list

On Thu, Mar 19, 2026 at 09:55:06AM +0100, Michal Pecio wrote:
> On Wed, 18 Mar 2026 21:56:27 -0400, Alan Stern wrote:
> > On Wed, Mar 18, 2026 at 10:38:51PM +0100, Michal Pecio wrote:
> > > If you get EPROTO reading from a Transaction Translator and the TT
> > > discards the packet (because of timeout on Int or by explicit SW
> > > request on Bulk) then not only is the particular packet lost
> > > because the device got its ACK and moved on, but I suspect the next
> > > packet will be dropped too due to toggle mismatch.  
> >
> > (Is it the TT that keeps track of the toggle state, or the host 
> > controller?  I don't remember and I'm too lazy to look up the answer.)
> 
> Good question. I skimmed USB2 chapters about TT (11.14+) and haven't
> seen this spelled out clearly. However,
> - I don't remember ever reading about a requirement for TT to keep
>   separate toggle state upstream and downstream
> - there is nothing about what CSPLIT response to send after device IN
>   packet was discarded due to wrong toggle
> - we would need a separate request to clear TT toggle when we clear
>   host and device toggle, no such request seems to exist
> 
> So I still suspect that TT is just a dumb forwarder and that we get
> toggle mismatch when a packet is lost on the HS bus, which sets us up
> to lose the next valid packet too.

I overcame my laziness and checked the USB 2.0 spec.  You are right; the 
toggle control is in the host controller.  See Figures 11-48 and 11-51 
(bulk/control OUT and IN, respectively).

> > By "unusable", I meant that no more data could be transmitted through 
> > that endpoint until some sort of recovery action had been carried out 
> > (such as clear-halt, set-config, or device reset).
> >
> > Yes, data loss is sometimes inevitable, and we shouldn't worry too
> > much about things we can't prevent.
> 
> But an EP which loses data is not as usable as we might wish for.
> Doing usb_clear_halt() prevents at least that second loss, which may
> occur in far future when we think the error has been solved already.

Granted.  However, we should be able to avoid issuing the clear-halt if 
the device is attached to a USB-2 controller.

> USB-IF doesn't seem to consider "pipes" a reliable transport and
> expects class drivers to anticipate data loss and work around it.

Yeah.  As far as I can tell, most don't bother to specify anything about 
this.

> HID, for example, appears to provide a mandatory control request to
> poll for the current state of input controls. This could recover lost
> data (except from mice) after resetting the pipe to avoid more loss.
>  
> > The major example showed up just a few weeks ago.  It was an old
> > system where some component (the PCI bus?) apparently could become
> > saturated at high load, leading to communication failures because the
> > host controller couldn't access the data needed to keep up with its
> > scheduled work.
> 
> I'm aware of this bug and it's an odd one, because I don't understand
> why reducing retry delay seems to help.

It doesn't help the recovery effort.  But it does improve the user 
experience by minimizing the number of lost keystrokes or mouse clicks.

> > Just to be clear, are you saying there's no way for an xHC to restart
> > a (host-side) halted non-SuperSpeed endpoint without setting the
> > toggle back to 0?
> 
> To be clear, I'm saying that most xHCI people probably think so. You
> have seen Thinh's reaction to my suggestion that we could do otherwise.
> 
> We need a Reset Endpoint command (xHCI 4.6.8) to clear the Halted flag
> in xHC, otherwise it will refuse to run the endpoint to protect us from
> race conditions (new submission while the error is being reported).
> 
> Clearing toggle/sequesce state is optional. If we don't clear it then
> "the endpoint shall continue execution by retrying the last transaction
> [after restart] if no other commands have been issued to the endpoint."
> 
> We can command it to move to the next URB (possibly not submitted yet)
> and only then restart the endpoint. But this is another weird thing
> that Linux does, possibly noboddy else does it, USB-IF doesn't expect
> us to do it, and HW vendors may not test this edge case.
> 
> It's something that would have made sense to try 15 years ago, but
> today I think you may meet resistance. Ask Mathias what he thinks.

I'm getting a definite feeling that we shouldn't try to depend on this.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19  2:07                                       ` Alan Stern
@ 2026-03-19 23:16                                         ` Thinh Nguyen
  2026-03-20  9:58                                           ` Michal Pecio
  2026-03-20 16:20                                           ` Alan Stern
  0 siblings, 2 replies; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-19 23:16 UTC (permalink / raw)
  To: Alan Stern
  Cc: Thinh Nguyen, Michal Pecio, Oliver Neukum, Bjørn Mork,
	USB list

On Wed, Mar 18, 2026, Alan Stern wrote:
> On Wed, Mar 18, 2026 at 11:59:21PM +0000, Thinh Nguyen wrote:
> > On Wed, Mar 18, 2026, Michal Pecio wrote:
> > > On Wed, 18 Mar 2026 13:46:26 -0400, Alan Stern wrote:
> > > > On Wed, Mar 18, 2026 at 10:54:20AM +0100, Oliver Neukum wrote:
> > > > > On 17.03.26 19:03, Alan Stern wrote:
> > > > > > You know, with a USB-2 host controller, transaction errors don't
> > > > > > make an endpoint unusable, and clear-halt isn't necessary.
> > > 
> > > Depends on what you mean by "usable". If you get EPROTO reading from
> > > a Transaction Translator and the TT discards the packet (because of
> > > timeout on Int or by explicit SW request on Bulk) then not only is the
> > > particular packet lost because the device got its ACK and moved on, but
> > > I suspect the next packet will be dropped too due to toggle mismatch.
> > > 
> > > Granted, EPROTO outside of disconnections is apparently rare enough
> > > that a horde of users demanging this to be fixed hasn't materialized.
> > 
> > I've seen Windows drivers handling UAS transaction error recovery
> > through clear-halt and retry SCSI command, and it works well. I hope to
> > see this type of recovery implemented for usb storage and uas devices in
> > the future.
> 
> I don't know about uas, but usb-storage handles transaction error 
> recovery in approximately the same way.  A clear-halt is issued only if 
> the device sent a halt token -- but that's not considered a transaction 

That's -EPIPE right? With this, the storage driver knows that it needs to
perform clear-halt because the bulk endpoint is STALL, not -EPROTO.

> error.  Otherwise, for things like -EPROTO, usb-storage does a device 
> reset and the SCSI command is retried.  Possibly skipping some initial 

The recovery I'm thinking of doesn't involve a port reset. A port reset
is very disruptive and will impact performance greatly. I'm referring to
an intermediate recovery step at the usb storage driver without
delegating to the SCSI layer.

Currently we _have_ to do a port reset because the bulk sequence can be
out of sync and the xhci doesn't synchronize against the device for the
storage driver to retry the command directly.

> portion of the data if the transfer was partially successful.  (This 
> might not work very well for things like tape drives, but disk drives 
> have the convenient feature that reads and writes are idempotent.)
> 
> > The retrying of the URB or sending a new set of URBs should be a
> > decision by the class driver where synchronization at the higher
> > protocol may be needed. An URB failed with -EPROTO may mean some
> > previous successful transfers need to be discarded and retried also.
> 
> That's a good point.  There's only so much we can expect the core to 
> handle.

Right. Not sure what the core can do here.

> 
> > What we do know is that an -EPROTO means that the xhci endpoint state
> > was halted, the xhci would need to reset (not soft retry) the endpoint
> > before it can be used again. Since the bulk sequence is reset from reset
> > ep command, we'd need clear-halt to synchronize with the device side.
> > The reset ep command behavior (and when to use it) is xhci specific, so
> > IMHO, it should the xhci driver to handle the clear-halt. Which URB(s)
> > need to be retried should be a decision by the upperlayer drivers.
> 
> And for which drivers will we want to go to the trouble of adding this 
> kind of error recovery?  Alternatives include doing just enough to make 
> the endpoint start working again and ignoring any data loss, or 
> declaring the whole device to be offline (which would need at least an 
> unbind and maybe even a power cycle to recover from).
> 

What I'd like to see is that the endpoint can be put in a state where
the class driver can submit a new URB without unbind/reset/power cycle.
With the current implementation of the xhci driver, we cannot do that
for bulk endpoints with -EPROTO error.

BR,
Thinh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19  8:40                                       ` Mathias Nyman
@ 2026-03-19 23:34                                         ` Thinh Nguyen
  0 siblings, 0 replies; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-19 23:34 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Alan Stern, Michal Pecio, Oliver Neukum, Thinh Nguyen,
	Bjørn Mork, USB list

On Thu, Mar 19, 2026, Mathias Nyman wrote:
> On 3/19/26 03:56, Alan Stern wrote:
> 
> > Just to be clear, are you saying there's no way for an xHC to restart a
> > (host-side) halted non-SuperSpeed endpoint without setting the toggle
> > back to 0?
> 
> There is.
> A reset endpoint command with a TSP flag (transfer state preserve)
> clear the host side halt and preserve the toggle state.
> 
> It's used for soft-retry purposes, retrying a transfer after
> a transaction error. This is also the only use-case described in xHCI specification.

When the xhci gives back the URB with -EPROTO, that's when it gives up
soft-retrying. The TRBs corresponding to the URB are done. Once the
URB is given back, sending new ones should not revive or continue the
previous TRBs.

> 
> Unclear what happens if we clear the host side halt, preserving the toggle, and
> then ask host to move to the next URB

For bulk, if we're trying to preserve the toggle, we can't move on
unless the soft retry succeed and the transfer completed right?

BR,
Thinh

> 
> Could be worth giving people a way to try it out somehow.
> Maybe option to enable it via debugfs, maybe a quirk, or event just provide a patch.
> See how different xHC hosts behave
> 
> -Mathias
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19 23:16                                         ` Thinh Nguyen
@ 2026-03-20  9:58                                           ` Michal Pecio
  2026-03-20 16:20                                           ` Alan Stern
  1 sibling, 0 replies; 40+ messages in thread
From: Michal Pecio @ 2026-03-20  9:58 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Alan Stern, Oliver Neukum, Bjørn Mork, USB list

On Thu, 19 Mar 2026 23:16:22 +0000, Thinh Nguyen wrote:
> On Wed, Mar 18, 2026, Alan Stern wrote:
> > I don't know about uas, but usb-storage handles transaction error 
> > recovery in approximately the same way.  A clear-halt is issued
> > only if the device sent a halt token -- but that's not considered a
> > transaction error.
> 
> That's -EPIPE right? With this, the storage driver knows that it
> needs to perform clear-halt because the bulk endpoint is STALL, not
> -EPROTO.

To be exact, EPIPE only means that the host got STALL handshake, but
not that the device originated it. Our good friend TT rendponds with
STALL after Bulk/Control transaction error on the downstream bus.
Similar error on Int produces a distinct ERR handshake. Don't ask me.

On IN endpoint both sides agree that the transaction didn't happen.
On OUT the device may have accepted the data (figure A-14). If you
mindlessly clear halt and resubmit it will accept same data again.

Some HCDs also report ETIME and EILSEQ, supposedly similar to EPROTO.

On xhci-hcd there is no ETIME, and EILSEQ means that the HW considers
our "transfer descriptors" ill formed. We don't bother unlocking the
endpoint, as retrying seems futile. No reports in at least two years.
Maybe other status would be more appropriate? But nobody complains.

> > Otherwise, for things like -EPROTO, usb-storage does a device reset
> > and the SCSI command is retried.  Possibly skipping some initial   
> 
> The recovery I'm thinking of doesn't involve a port reset. A port
> reset is very disruptive and will impact performance greatly. I'm
> referring to an intermediate recovery step at the usb storage driver
> without delegating to the SCSI layer.

Device reset is slow no doubt, but it may be the reason why there are
no users screaming about filesystem corruption despite the apparent
widespread neglect of TT corner cases and xhci-hcd bugs.

UAS is another can of worms, for example xHCI seems to require a
guarantee that a stream is inactive in the device (by class specific
means) before its URBs can be unlinked. See 4.6.10, 4.12.

> What I'd like to see is that the endpoint can be put in a state where
> the class driver can submit a new URB without unbind/reset/power
> cycle. With the current implementation of the xhci driver, we cannot
> do that for bulk endpoints with -EPROTO error.

It can already be done with usb_clear_halt() and this should generally
work for drivers which don't queue multiple URBs in advance (those are
subject to race conditions due to BH giveback, and to xhci-hcd bugs).

Double delivery is possible on retries after usb_clear_halt(). Probably
less likely at SuperSpeed (32 instead of 2 sequence states).

If you don't reset the pipe then xhci-hcd resets one end of it behind
your back. I could write a test patch which changes this behavior for
people to play with, but you seemed skeptical.

Alternatively, URB API would need changes to support xHCI native retry.

As you work for an xHCI IP vendor, do you know something we don't? ;)
It seems to me (and Mathias apparently too) that Reset Endpoint with
TSP followed by Set TR Dequeue would trick HW into retrying the failed
USB transaction with the data buffer of the new or resubmitted URB.

Except of course if TTs are involved. Retrying transactions involving
those is "undefined behavior" per xHCI spec. I suspect that even
ehci-hcd may not support retry by resubmission in such cases properly.

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-19 23:16                                         ` Thinh Nguyen
  2026-03-20  9:58                                           ` Michal Pecio
@ 2026-03-20 16:20                                           ` Alan Stern
  2026-03-20 17:49                                             ` Oliver Neukum
  2026-03-21  2:14                                             ` Thinh Nguyen
  1 sibling, 2 replies; 40+ messages in thread
From: Alan Stern @ 2026-03-20 16:20 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Michal Pecio, Oliver Neukum, Bjørn Mork, USB list

On Thu, Mar 19, 2026 at 11:16:22PM +0000, Thinh Nguyen wrote:
> On Wed, Mar 18, 2026, Alan Stern wrote:
> > On Wed, Mar 18, 2026 at 11:59:21PM +0000, Thinh Nguyen wrote:
> > > I've seen Windows drivers handling UAS transaction error recovery
> > > through clear-halt and retry SCSI command, and it works well. I hope to
> > > see this type of recovery implemented for usb storage and uas devices in
> > > the future.
> > 
> > I don't know about uas, but usb-storage handles transaction error 
> > recovery in approximately the same way.  A clear-halt is issued only if 
> > the device sent a halt token -- but that's not considered a transaction 
> 
> That's -EPIPE right? With this, the storage driver knows that it needs to
> perform clear-halt because the bulk endpoint is STALL, not -EPROTO.

Correct.  As for Michal's caveats regarding TTs, they don't really apply 
to USB mass-storage devices because almost all of them can connect at 
high speed or even at SuperSpeed.  Besides, even if there was a device 
behind a TT and the TT messed up recovery from a -EPIPE, usb-storage 
would simply proceed with a port reset.

(There may be a few oddball devices out there which can only run at full 
speed.  For instance, at one time when Linus reported that his wife was 
having a problem with some low-end mass-storage device she used for 
knitting or crochet.  It turned out not to be any sort of protocol 
error; the problem was caused by the userspace utilities that probe each 
newly added disk looking for partition and LVM information -- this 
device was extremely slow and had a total storage capacity something 
like a mere 100 KB (not MB!), and the utilities were trying to read all 
of it repeatedly, which took a long time.)

> > error.  Otherwise, for things like -EPROTO, usb-storage does a device 
> > reset and the SCSI command is retried.  Possibly skipping some initial 
> 
> The recovery I'm thinking of doesn't involve a port reset. A port reset
> is very disruptive and will impact performance greatly. I'm referring to
> an intermediate recovery step at the usb storage driver without
> delegating to the SCSI layer.

I don't know what other sort of intermediate reset could be carried out.  
The Bulk-Only-Transport protocol _does_ include a class-specific reset 
request, but usb-storage doesn't use it because experience has shown 
that practically no USB mass-storage devices implement it properly 
(which was probably a consequence of the fact that Windows did not use 
it).

In fact, the error recovery sequence used by usb-storage is as similar 
to what Windows does -- or did, since this goes back quite a few years 
-- as I could make it.

Naturally, UAS may be a totally different situation.

> Currently we _have_ to do a port reset because the bulk sequence can be
> out of sync and the xhci doesn't synchronize against the device for the
> storage driver to retry the command directly.

The same is true for EHCI.

> What I'd like to see is that the endpoint can be put in a state where
> the class driver can submit a new URB without unbind/reset/power cycle.
> With the current implementation of the xhci driver, we cannot do that
> for bulk endpoints with -EPROTO error.

Which means unlinking queued URBs (either automatically by the core or 
else by hand in the class driver), waiting for them to complete, and 
issuing a clear-halt.  Once that is done, submission of new URBs should 
work, if the cause of the error was transient and has gone away.

I don't make any distinction here between resubmitting the URB that 
failed (a retry) or submitting a new, completely different URB.  But 
Michal is right that under some circumstances, when communicating with a 
low- or full-speed device (which is common with HID), data may get lost 
or duplicated.  I don't see anything we can do about that.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-20 16:20                                           ` Alan Stern
@ 2026-03-20 17:49                                             ` Oliver Neukum
  2026-03-21  2:14                                             ` Thinh Nguyen
  1 sibling, 0 replies; 40+ messages in thread
From: Oliver Neukum @ 2026-03-20 17:49 UTC (permalink / raw)
  To: Alan Stern, Thinh Nguyen
  Cc: Michal Pecio, Oliver Neukum, Bjørn Mork, USB list

On 20.03.26 17:20, Alan Stern wrote:
  
> In fact, the error recovery sequence used by usb-storage is as similar
> to what Windows does -- or did, since this goes back quite a few years
> -- as I could make it.
> 
> Naturally, UAS may be a totally different situation.

Sadly, no. In theory you could use all the TMF features.
In practice that does not work. Plus even if it did, we'd
be forced to reserve a tag for management.
I have an old patch set for UAS implementing the stuff, if
you want to play around with it. In practice you
only time out on TMF until you need to reset anyway.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-20 16:20                                           ` Alan Stern
  2026-03-20 17:49                                             ` Oliver Neukum
@ 2026-03-21  2:14                                             ` Thinh Nguyen
  2026-03-21  5:54                                               ` Michal Pecio
  2026-03-23 10:26                                               ` Oliver Neukum
  1 sibling, 2 replies; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-21  2:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Thinh Nguyen, Michal Pecio, Oliver Neukum, Bjørn Mork,
	USB list

On Fri, Mar 20, 2026, Alan Stern wrote:
> On Thu, Mar 19, 2026 at 11:16:22PM +0000, Thinh Nguyen wrote:
> > On Wed, Mar 18, 2026, Alan Stern wrote:
> > > On Wed, Mar 18, 2026 at 11:59:21PM +0000, Thinh Nguyen wrote:
> > > > I've seen Windows drivers handling UAS transaction error recovery
> > > > through clear-halt and retry SCSI command, and it works well. I hope to
> > > > see this type of recovery implemented for usb storage and uas devices in
> > > > the future.
> > > 
> > > I don't know about uas, but usb-storage handles transaction error 
> > > recovery in approximately the same way.  A clear-halt is issued only if 
> > > the device sent a halt token -- but that's not considered a transaction 
> > 
> > That's -EPIPE right? With this, the storage driver knows that it needs to
> > perform clear-halt because the bulk endpoint is STALL, not -EPROTO.
> 
> Correct.  As for Michal's caveats regarding TTs, they don't really apply 
> to USB mass-storage devices because almost all of them can connect at 
> high speed or even at SuperSpeed.  Besides, even if there was a device 
> behind a TT and the TT messed up recovery from a -EPIPE, usb-storage 
> would simply proceed with a port reset.
> 
> (There may be a few oddball devices out there which can only run at full 
> speed.  For instance, at one time when Linus reported that his wife was 
> having a problem with some low-end mass-storage device she used for 
> knitting or crochet.  It turned out not to be any sort of protocol 
> error; the problem was caused by the userspace utilities that probe each 
> newly added disk looking for partition and LVM information -- this 
> device was extremely slow and had a total storage capacity something 
> like a mere 100 KB (not MB!), and the utilities were trying to read all 
> of it repeatedly, which took a long time.)

Yikes, with that device, -EPROTO probably is the least of the concern.
It better serve in certain tech museum.

> 
> > > error.  Otherwise, for things like -EPROTO, usb-storage does a device 
> > > reset and the SCSI command is retried.  Possibly skipping some initial 
> > 
> > The recovery I'm thinking of doesn't involve a port reset. A port reset
> > is very disruptive and will impact performance greatly. I'm referring to
> > an intermediate recovery step at the usb storage driver without
> > delegating to the SCSI layer.
> 
> I don't know what other sort of intermediate reset could be carried out.  
> The Bulk-Only-Transport protocol _does_ include a class-specific reset 
> request, but usb-storage doesn't use it because experience has shown 
> that practically no USB mass-storage devices implement it properly 
> (which was probably a consequence of the fact that Windows did not use 
> it).
> 
> In fact, the error recovery sequence used by usb-storage is as similar 
> to what Windows does -- or did, since this goes back quite a few years 
> -- as I could make it.

Just to be clear, I'm not suggesting to replace the current recovery
mechanism, but potentially we can introduce more.

> 
> Naturally, UAS may be a totally different situation.

Windows has a clever way to handle this for UAS. It sends a command/task
with the same tag as the failing transfer on the command endpoint,
triggering an overlap tag response and causing the device side to cancel
the command along with the pending data transfer. This puts the host and
device in sync again.

All compliant UAS devices must support overlap tag detection.

(We can go into more detail should we want to pursue this)

> 
> > Currently we _have_ to do a port reset because the bulk sequence can be
> > out of sync and the xhci doesn't synchronize against the device for the
> > storage driver to retry the command directly.
> 
> The same is true for EHCI.
> 
> > What I'd like to see is that the endpoint can be put in a state where
> > the class driver can submit a new URB without unbind/reset/power cycle.
> > With the current implementation of the xhci driver, we cannot do that
> > for bulk endpoints with -EPROTO error.
> 
> Which means unlinking queued URBs (either automatically by the core or 
> else by hand in the class driver), waiting for them to complete, and 
> issuing a clear-halt.  Once that is done, submission of new URBs should 

The clear-halt doesn't have to be done after the unlinking of URBs. The
xhci endpoint is in stopped state after a reset ep command. As
long as the class driver doesn't submit a new URB to trigger a doorbell
ring, the xhci driver can send a clear-halt after a reset ep command and
no transfer will start.

> work, if the cause of the error was transient and has gone away.
> 
> I don't make any distinction here between resubmitting the URB that 
> failed (a retry) or submitting a new, completely different URB.  But 
> Michal is right that under some circumstances, when communicating with a 
> low- or full-speed device (which is common with HID), data may get lost 
> or duplicated.  I don't see anything we can do about that.
> 

Of course, as I mentioned, there needs to be synchronization mechanism
at the class driver or higher layer.

The xhci spec actually suggests to perform clear-halt after a reset for
this type of scenario. Whether there will be any class driver to take
advantage of retrying without port reset is a different matter.

I can see that this kind of recovery can be done, and it seems to be an
improvement. So I just want to put it out there.

BR,
Thinh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-21  2:14                                             ` Thinh Nguyen
@ 2026-03-21  5:54                                               ` Michal Pecio
  2026-03-21 15:58                                                 ` Alan Stern
  2026-03-23 10:26                                               ` Oliver Neukum
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Pecio @ 2026-03-21  5:54 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: Alan Stern, Oliver Neukum, Bjørn Mork, USB list

On Sat, 21 Mar 2026 02:14:46 +0000, Thinh Nguyen wrote:
> Windows has a clever way to handle this for UAS. It sends a
> command/task with the same tag as the failing transfer on the command
> endpoint, triggering an overlap tag response and causing the device
> side to cancel the command along with the pending data transfer. This
> puts the host and device in sync again.
> 
> All compliant UAS devices must support overlap tag detection.

Doing what Windows does may be a good idea. I have seen certain UAS
bridges cause problems on multiple xHCI controllers when data URBs are
unlinked after receiving an error status. I suspect those chips may
violate USB3 spec with current UAS driver, but I have no way to debug.

> The clear-halt doesn't have to be done after the unlinking of URBs.
> The xhci endpoint is in stopped state after a reset ep command. As
> long as the class driver doesn't submit a new URB to trigger a
> doorbell ring, the xhci driver can send a clear-halt after a reset ep
> command and no transfer will start.

Nope, for many years now, if not forever, xhci-hcd has been restarting
the endpoint after giving back the failed URB if its completion hasn't
unlinked all remaining URBs.

Changing this is one of the issues under discussion here. It would take
a few tweaks to the driver.

Per kerneldoc, you should unlink URBs before calling usb_clear_halt(),
and xHCI *requires* URBs to be unlinked in some situations, though you
wouldn't run into that if things worked the way you described.

It's another case when old USB2 spec arbitrarily dictates how software
should conduct its business and later xHCI assumes that we do. Annoying
as it is, it seems safer to follow the spec, particularly if URBs need
to be unlinked anyway to retry or replace the failed one.

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-21  5:54                                               ` Michal Pecio
@ 2026-03-21 15:58                                                 ` Alan Stern
  2026-03-28 21:22                                                   ` Michal Pecio
  0 siblings, 1 reply; 40+ messages in thread
From: Alan Stern @ 2026-03-21 15:58 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Thinh Nguyen, Oliver Neukum, Bjørn Mork, USB list

On Sat, Mar 21, 2026 at 06:54:24AM +0100, Michal Pecio wrote:
> > The clear-halt doesn't have to be done after the unlinking of URBs.
> > The xhci endpoint is in stopped state after a reset ep command. As
> > long as the class driver doesn't submit a new URB to trigger a
> > doorbell ring, the xhci driver can send a clear-halt after a reset ep
> > command and no transfer will start.
> 
> Nope, for many years now, if not forever, xhci-hcd has been restarting
> the endpoint after giving back the failed URB if its completion hasn't
> unlinked all remaining URBs.

How can that work in the presence of BH givebacks?  xhci-hcd doesn't 
really know when the completion handler runs.

> Changing this is one of the issues under discussion here. It would take
> a few tweaks to the driver.

I think this is something that should be done in any case.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-21  2:14                                             ` Thinh Nguyen
  2026-03-21  5:54                                               ` Michal Pecio
@ 2026-03-23 10:26                                               ` Oliver Neukum
  2026-03-24  1:06                                                 ` Thinh Nguyen
  1 sibling, 1 reply; 40+ messages in thread
From: Oliver Neukum @ 2026-03-23 10:26 UTC (permalink / raw)
  To: Thinh Nguyen, Alan Stern
  Cc: Michal Pecio, Oliver Neukum, Bjørn Mork, USB list



On 21.03.26 03:14, Thinh Nguyen wrote:
> On Fri, Mar 20, 2026, Alan Stern wrote:
>> On Thu, Mar 19, 2026 at 11:16:22PM +0000, Thinh Nguyen wrote:
>>> On Wed, Mar 18, 2026, Alan Stern wrote:

>> Naturally, UAS may be a totally different situation.
> 
> Windows has a clever way to handle this for UAS. It sends a command/task
> with the same tag as the failing transfer on the command endpoint,
> triggering an overlap tag response and causing the device side to cancel
> the command along with the pending data transfer. This puts the host and
> device in sync again.
> 
> All compliant UAS devices must support overlap tag detection.
> 
> (We can go into more detail should we want to pursue this)

_yes_

Do you have a trace?

>>> Currently we _have_ to do a port reset because the bulk sequence can be
>>> out of sync and the xhci doesn't synchronize against the device for the
>>> storage driver to retry the command directly.
>>
>> The same is true for EHCI.
>>
>>> What I'd like to see is that the endpoint can be put in a state where
>>> the class driver can submit a new URB without unbind/reset/power cycle.
>>> With the current implementation of the xhci driver, we cannot do that
>>> for bulk endpoints with -EPROTO error.
>>
>> Which means unlinking queued URBs (either automatically by the core or
>> else by hand in the class driver), waiting for them to complete, and
>> issuing a clear-halt.  Once that is done, submission of new URBs should
> 
> The clear-halt doesn't have to be done after the unlinking of URBs. The
> xhci endpoint is in stopped state after a reset ep command. As
> long as the class driver doesn't submit a new URB to trigger a doorbell
> ring, the xhci driver can send a clear-halt after a reset ep command and
> no transfer will start.

How do we tell drivers about that? If we give back the URB it must
be unlinked.

	Regards
		Oliver



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-23 10:26                                               ` Oliver Neukum
@ 2026-03-24  1:06                                                 ` Thinh Nguyen
  2026-03-24  9:28                                                   ` Oliver Neukum
  0 siblings, 1 reply; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-24  1:06 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Thinh Nguyen, Alan Stern, Michal Pecio, Bjørn Mork, USB list

[-- Attachment #1: Type: text/plain, Size: 5448 bytes --]

On Mon, Mar 23, 2026, Oliver Neukum wrote:
> 
> 
> On 21.03.26 03:14, Thinh Nguyen wrote:
> > On Fri, Mar 20, 2026, Alan Stern wrote:
> > > On Thu, Mar 19, 2026 at 11:16:22PM +0000, Thinh Nguyen wrote:
> > > > On Wed, Mar 18, 2026, Alan Stern wrote:
> 
> > > Naturally, UAS may be a totally different situation.
> > 
> > Windows has a clever way to handle this for UAS. It sends a command/task
> > with the same tag as the failing transfer on the command endpoint,
> > triggering an overlap tag response and causing the device side to cancel
> > the command along with the pending data transfer. This puts the host and
> > device in sync again.
> > 
> > All compliant UAS devices must support overlap tag detection.
> > 
> > (We can go into more detail should we want to pursue this)
> 
> _yes_
> 
> Do you have a trace?
> 

I attached a couple of usb traffic sniffing traces. Review comments
below.

<snip>

> > 
> > The clear-halt doesn't have to be done after the unlinking of URBs. The
> > xhci endpoint is in stopped state after a reset ep command. As
> > long as the class driver doesn't submit a new URB to trigger a doorbell
> > ring, the xhci driver can send a clear-halt after a reset ep command and
> > no transfer will start.
> 
> How do we tell drivers about that? If we give back the URB it must
> be unlinked.
> 

Yes it must. I was responding to Alan's comment that noting that it can
be done prior or after unlinking the URBs. But as Michal noted, that may
not be possible because we ring the doorbell right after giving back an
URB.

---

Now, about the attached traces. They were captured some ~3 years ago on
certain build of Windows 10. One trace shows transaction error response
to data IN endpoint, the other trace shows transaction error response to
data OUT endpoint. These are some off-the-shelf devices tested behind a
hub.

win10_uasp_clear_halt_ep1in_T7.txt:
-----------------------------------
Unfortunately, this test was run with a single tag number, so it's not a
very good demonstration of the recovery. However, you can see
transaction error to this SCSI READ(10) and stopped at 146432 bytes:

	_______|_______________________________________________________________________
	SCSI Op(85) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x0928FC00) Data(146432 bytes) Status(Missing)-BAD 
	_______| Time(  2.681 ms) Time Stamp(10 . 006 968 662) Metrics #Xfers(2) 
	_______|_______________________________________________________________________

Then clear-halt and host sends a new command with the same tag, causing
an overlapped response:

	Transfer(289) Left("Left") G2(x1) Control(SET) ADDR(3) ENDP(0) 
	_______| bRequest(CLEAR_FEATURE) wValue(ENDPOINT_HALT) wLength(0) 
	_______| Time(166.322 us) Time Stamp(10 . 009 649 516) 
	_______|_______________________________________________________________________
	SCSI Op(99) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x09290000) RESPONSE_CODE(OVERLAPPED TAG) 
	_______| Time(365.854 us) Time Stamp(10 . 009 815 838) Metrics #Xfers(2) 
	_______|_______________________________________________________________________

Then the host retry starting from where the failing SCSI command was:

	SCSI Op(100) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x09290400) STATUS(GOOD) Data(524288 bytes) 
	_______| Time(  1.012 sec) Time Stamp(10 . 010 181 692) Metrics #Xfers(3) 
	_______|_______________________________________________________________________
	SCSI Op(101) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x0928FC00) STATUS(GOOD) Data(524288 bytes) 
	_______| Time(882.412 us) Time Stamp(11 . 022 469 104) Metrics #Xfers(3) 
	_______|_______________________________________________________________________
	SCSI Op(102) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x09290000) STATUS(GOOD) Data(524288 bytes) 
	_______| Time(  1.060 ms) Time Stamp(11 . 023 351 516) Metrics #Xfers(3) 
	_______|_______________________________________________________________________
	SCSI Op(103) ADDR(3) Tag(0x0002) SCSI CDB READ(10) 
	_______| Logical Block Addr(0x09290800) STATUS(GOOD) Data(524288 bytes) 
	_______| Time(  1.013 ms) Time Stamp(11 . 024 411 510) Metrics #Xfers(3) 
	_______|_______________________________________________________________________

And recovery was successful and transfers resumed.

win10_uasp_clear_halt_ep1out_T7.txt:
-----------------------------------
This file is viewed at the Transfer level because it's difficult to see
at SCSI Op level. This one has multiple transaction errors and triggers
multiple clear-halts from host. After the host retries a few times to
the same transfer, it gave up retrying. The last thing you see in that
trace is the host performing clear-halt due to transaction error:

	_______|_______________________________________________________________________
	Transfer(293) Left("Left") G2(x1) Control(SET) ADDR(4) ENDP(0) Route String(3) 
	_______| bRequest(CLEAR_FEATURE) wValue(ENDPOINT_HALT) wLength(0) 
	_______| Time Stamp(21 . 540 631 880) 
	_______|_______________________________________________________________________

After a certain amount of time, from what I recall, the host will
perform port reset recovery due to transfer timeout similar to Linux.

BR,
Thinh

[-- Attachment #2: uas_clear_halt.tar.gz --]
[-- Type: application/x-tar-gz, Size: 2242 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-24  1:06                                                 ` Thinh Nguyen
@ 2026-03-24  9:28                                                   ` Oliver Neukum
  2026-03-24 13:25                                                     ` Alan Stern
  2026-03-25  1:44                                                     ` Thinh Nguyen
  0 siblings, 2 replies; 40+ messages in thread
From: Oliver Neukum @ 2026-03-24  9:28 UTC (permalink / raw)
  To: Thinh Nguyen, Oliver Neukum
  Cc: Alan Stern, Michal Pecio, Bjørn Mork, USB list

On 24.03.26 02:06, Thinh Nguyen wrote:

> I attached a couple of usb traffic sniffing traces. Review comments
> below.

Thank you a whole lot. These are extremely educational. I am not sure
to which extent this discussion is on topic. Though it makes me wonder
how we'd deal with an error in the last phase of the command. We'd
be unsure whether it has been completed.

> Yes it must. I was responding to Alan's comment that noting that it can
> be done prior or after unlinking the URBs. But as Michal noted, that may
> not be possible because we ring the doorbell right after giving back an
> URB.

Very well. That raises a fundamental issue. Are we planning around the limits
of the existing API or according to capabilities of the hardware. I see
two specific issues

1) What do we do to URBs after the URB suffering a failure? We cannot just execute
them.
2) Do we need a second callback for an "undead" URB, which decides on how errors
are to be handled?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-24  9:28                                                   ` Oliver Neukum
@ 2026-03-24 13:25                                                     ` Alan Stern
  2026-03-25  1:44                                                     ` Thinh Nguyen
  1 sibling, 0 replies; 40+ messages in thread
From: Alan Stern @ 2026-03-24 13:25 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Thinh Nguyen, Michal Pecio, Bjørn Mork, USB list

On Tue, Mar 24, 2026 at 10:28:01AM +0100, Oliver Neukum wrote:
> Very well. That raises a fundamental issue. Are we planning around the limits
> of the existing API or according to capabilities of the hardware. I see
> two specific issues
> 
> 1) What do we do to URBs after the URB suffering a failure? We cannot just execute
> them.

Indeed not.  That leaves only one choice: Give them back with a suitable 
error status.

> 2) Do we need a second callback for an "undead" URB, which decides on how errors
> are to be handled?

That would be too much complication.  The decisions on how to handle 
errors, whether to resubmit, and so on, are entirely up to the class 
driver.  Whatever the decision is, the driver can carry it out directly, 
with no need for an extra callback.

Alan Stern

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-24  9:28                                                   ` Oliver Neukum
  2026-03-24 13:25                                                     ` Alan Stern
@ 2026-03-25  1:44                                                     ` Thinh Nguyen
  1 sibling, 0 replies; 40+ messages in thread
From: Thinh Nguyen @ 2026-03-25  1:44 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Thinh Nguyen, Alan Stern, Michal Pecio, Bjørn Mork, USB list

On Tue, Mar 24, 2026, Oliver Neukum wrote:
> 
> 
> On 24.03.26 02:06, Thinh Nguyen wrote:
> 
> 
> > I attached a couple of usb traffic sniffing traces. Review comments
> > below.
> 
> Thank you a whole lot. These are extremely educational. I am not sure
> to which extent this discussion is on topic. Though it makes me wonder
> how we'd deal with an error in the last phase of the command. We'd
> be unsure whether it has been completed.
> 

The status endpoint is bulk IN. If there's transaction error, the host
should not give a handshake indicating packet completion. If there's no
status completion, the device side should not complete the command. I
expect that a new command with the same tag to cause the device to
cancel the stale status and replace with an overlapped tag response when
the host requests for a new status.

BR,
Thinh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: correctly handling EPROTO
  2026-03-21 15:58                                                 ` Alan Stern
@ 2026-03-28 21:22                                                   ` Michal Pecio
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Pecio @ 2026-03-28 21:22 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, Oliver Neukum, Bjørn Mork, USB list

On Sat, 21 Mar 2026 11:58:53 -0400, Alan Stern wrote:
> On Sat, Mar 21, 2026 at 06:54:24AM +0100, Michal Pecio wrote:
> > Nope, for many years now, if not forever, xhci-hcd has been
> > restarting the endpoint after giving back the failed URB if its
> > completion hasn't unlinked all remaining URBs.  
> 
> How can that work in the presence of BH givebacks?

Certainly not reliably and I started a similar thread two years ago
after coming to this exact realization.

Does anyone know class drivers affected by this which could be used
to validate such changes? Writing a patch is one thing, knowing whether
it does any good is another. I recall that last time Mathias tried to
touch this logic it caused a regression by unearthing more issues.

I was reluctant to touch this mess in absence of known impact. The race
is as old as BH giveback (2019) and automatic restarting is even older.
It could get awkward if users (or driver developers) learned to expect
this behavior.

But if somebody can point out serious issues like data loss in storage
then it's a different ball game.

Regards,
Michal

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-03-28 21:22 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 13:55 correctly handling EPROTO Oliver Neukum
2026-03-12 14:21 ` Alan Stern
2026-03-12 15:57   ` Oliver Neukum
2026-03-13  7:53     ` Michal Pecio
2026-03-13 10:33       ` Oliver Neukum
2026-03-13 15:28         ` Alan Stern
2026-03-13 22:45           ` Thinh Nguyen
2026-03-14  2:39             ` Alan Stern
2026-03-16 12:58               ` Oliver Neukum
2026-03-16 14:02                 ` Alan Stern
2026-03-16 14:47                   ` Oliver Neukum
2026-03-16 17:33                     ` Alan Stern
2026-03-16 19:32                       ` Oliver Neukum
2026-03-17  9:05                         ` Mathias Nyman
2026-03-17 14:31                         ` Alan Stern
2026-03-17 16:20                           ` Oliver Neukum
2026-03-17 18:03                             ` Alan Stern
2026-03-18  9:54                               ` Oliver Neukum
2026-03-18 17:46                                 ` Alan Stern
2026-03-18 21:38                                   ` Michal Pecio
2026-03-18 23:59                                     ` Thinh Nguyen
2026-03-19  2:07                                       ` Alan Stern
2026-03-19 23:16                                         ` Thinh Nguyen
2026-03-20  9:58                                           ` Michal Pecio
2026-03-20 16:20                                           ` Alan Stern
2026-03-20 17:49                                             ` Oliver Neukum
2026-03-21  2:14                                             ` Thinh Nguyen
2026-03-21  5:54                                               ` Michal Pecio
2026-03-21 15:58                                                 ` Alan Stern
2026-03-28 21:22                                                   ` Michal Pecio
2026-03-23 10:26                                               ` Oliver Neukum
2026-03-24  1:06                                                 ` Thinh Nguyen
2026-03-24  9:28                                                   ` Oliver Neukum
2026-03-24 13:25                                                     ` Alan Stern
2026-03-25  1:44                                                     ` Thinh Nguyen
2026-03-19  1:56                                     ` Alan Stern
2026-03-19  8:40                                       ` Mathias Nyman
2026-03-19 23:34                                         ` Thinh Nguyen
2026-03-19  8:55                                       ` Michal Pecio
2026-03-19 14:24                                         ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox