Problematic Set TR Deq error handling series in xhci-next

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* Problematic Set TR Deq error handling series in xhci-next
@ 2025-05-09  9:41 Michał Pecio
  2025-05-10 15:52 ` Alan Stern
  2025-05-12 13:02 ` Neronin, Niklas
  0 siblings, 2 replies; 8+ messages in thread
From: Michał Pecio @ 2025-05-09  9:41 UTC (permalink / raw)
  To: Niklas Neronin, Mathias Nyman; +Cc: linux-usb

Hi,

I noticed that xhci/for-usb-next now includes a series which tries
to handle Set TR Deq errors. It strikes me as a solution looking for
a problem, and frankly creating new problems where none existed ;)

I am aware of three cases leading to errors being handled here, and
none of them is addressed. One is harmless and easy to fix properly,
but this series appears to turn it into a "never give back the URB"
disaster. Tests pending, I hope to find some time this weekend.

There should be no need to handle these errors, they are prevented
by not queuing the command in wrong states. When the command fails,
it means the driver screwed up tracking endpoint state and other
things are on fire too, so the actual bug should be fixed instead.

The case of disabled endpoints is clear: no URBs are allowed, the
core is broken. It would be more productive to sanity-check core:
detect and nuke lingering URBs in places like endpoint_disable(),
drop_endpoint(), reset_device(), free_dev(). If Set Deq is already
pending at the time, give back the URB and let the command fail.

The case of "stopped" should outright never happen, we don't queue
Stop Endpoint if Set TR Deq is pending. It triggers a known HW bug,
so it's a bug. Note that this series already assumes that Stop EP
isn't pending when it queues a new one.

The case of "running" (or "halted", which "running" can morth into)
is possibly more useful, because we assume it's caused by illegal
state changes without driver's knowledge. But these are supposed to
be detected and fixed by handle_cmd_stop_ep(), because they cause
other problems too. It's unclear if retrying Stop Endpoint and Set
TR Deq again will solve any case not solved yet, but risk exists of
infinite retries on broken HW. (And HW is broken if we are here).

Queuing a Soft Retry and then doing Set TR Deq out of the halted TD
instead of restarting the ring is a weird thing to do and IDK if HW
will get it right. IIRC, some ASMedia had issues in similar cases:
it would retry the TD anyway, despite Set TR Deq.

The case of command TRB error looks wrong too. AFAIK, bad stream
ID only halts the endpoint if it's in a packet from the device. The
command just fails. And find_halted_td() will crash on Streams EPs.

TRB error also happens on ASMedia HCs under unclear circumstances.
It's either an HC bug, or (totally guessing) maybe a way the HC is
telling us that the target stream is the Current Stream, forbidden
by the spec. It only seems to happen with some UAS chips, and same
chips also have issues (but no TRB errors) on other HCs. If anyone
with access to the UAS spec (paywall) and a USB packet analyzer
would like to debug it, I can provide detailed repro instructions. 

Regards,
Michal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-09  9:41 Problematic Set TR Deq error handling series in xhci-next Michał Pecio
@ 2025-05-10 15:52 ` Alan Stern
  2025-05-11 14:04   ` Michał Pecio
  2025-05-12 13:02 ` Neronin, Niklas
  1 sibling, 1 reply; 8+ messages in thread
From: Alan Stern @ 2025-05-10 15:52 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Niklas Neronin, Mathias Nyman, linux-usb

On Fri, May 09, 2025 at 11:41:38AM +0200, Michał Pecio wrote:
> Hi,
> 
> I noticed that xhci/for-usb-next now includes a series which tries
> to handle Set TR Deq errors. It strikes me as a solution looking for
> a problem, and frankly creating new problems where none existed ;)
> 
> I am aware of three cases leading to errors being handled here, and
> none of them is addressed. One is harmless and easy to fix properly,
> but this series appears to turn it into a "never give back the URB"
> disaster. Tests pending, I hope to find some time this weekend.
> 
> There should be no need to handle these errors, they are prevented
> by not queuing the command in wrong states. When the command fails,
> it means the driver screwed up tracking endpoint state and other
> things are on fire too, so the actual bug should be fixed instead.
> 
> The case of disabled endpoints is clear: no URBs are allowed, the
> core is broken. It would be more productive to sanity-check core:
> detect and nuke lingering URBs in places like endpoint_disable(),
> drop_endpoint(), reset_device(), free_dev(). If Set Deq is already
> pending at the time, give back the URB and let the command fail.

The core already does this for endpoint_disable.  If the others have 
problems, could you provide a tracebacks so we can see the pathways 
where the problem occurs?

Alan Stern

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-10 15:52 ` Alan Stern
@ 2025-05-11 14:04   ` Michał Pecio
  2025-05-11 19:24     ` Alan Stern
  0 siblings, 1 reply; 8+ messages in thread
From: Michał Pecio @ 2025-05-11 14:04 UTC (permalink / raw)
  To: Alan Stern; +Cc: Niklas Neronin, Mathias Nyman, linux-usb

On Sat, 10 May 2025 11:52:49 -0400, Alan Stern wrote:
> On Fri, May 09, 2025 at 11:41:38AM +0200, Michał Pecio wrote:
> > Hi,
> > 
> > I noticed that xhci/for-usb-next now includes a series which tries
> > to handle Set TR Deq errors. It strikes me as a solution looking for
> > a problem, and frankly creating new problems where none existed ;)
> > 
> > I am aware of three cases leading to errors being handled here, and
> > none of them is addressed. One is harmless and easy to fix properly,
> > but this series appears to turn it into a "never give back the URB"
> > disaster. Tests pending, I hope to find some time this weekend.
> > 
> > There should be no need to handle these errors, they are prevented
> > by not queuing the command in wrong states. When the command fails,
> > it means the driver screwed up tracking endpoint state and other
> > things are on fire too, so the actual bug should be fixed instead.
> > 
> > The case of disabled endpoints is clear: no URBs are allowed, the
> > core is broken. It would be more productive to sanity-check core:
> > detect and nuke lingering URBs in places like endpoint_disable(),
> > drop_endpoint(), reset_device(), free_dev(). If Set Deq is already
> > pending at the time, give back the URB and let the command fail.  
> 
> The core already does this for endpoint_disable.  If the others have 
> problems, could you provide a tracebacks so we can see the pathways 
> where the problem occurs?

I'm not aware of problems, this paragraph was hypothetical: if someone
thinks that problems exist or should be monitored for, there are better
places to do it than handle_cmd_set_deq().

Today I patched those HCD methods locally to check for pending URBs.
Nothing caught so far, but I will leave this code running long term.

This was discussed before and you said that device reset should be OK.
Users of hub_free_dev() also appear to be OK (they call things which
call disable_endpoint() on EP0 or all EPs).

Mathias fixed usb_set_interface() a few years ago. Not sure if similar
use of usb_hcd_alloc_bandwidth() in usb_set_configuration() is safe?

Michal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-11 14:04   ` Michał Pecio
@ 2025-05-11 19:24     ` Alan Stern
  0 siblings, 0 replies; 8+ messages in thread
From: Alan Stern @ 2025-05-11 19:24 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Niklas Neronin, Mathias Nyman, linux-usb

On Sun, May 11, 2025 at 04:04:57PM +0200, Michał Pecio wrote:
> On Sat, 10 May 2025 11:52:49 -0400, Alan Stern wrote:
> > > The case of disabled endpoints is clear: no URBs are allowed, the
> > > core is broken. It would be more productive to sanity-check core:
> > > detect and nuke lingering URBs in places like endpoint_disable(),
> > > drop_endpoint(), reset_device(), free_dev(). If Set Deq is already
> > > pending at the time, give back the URB and let the command fail.  
> > 
> > The core already does this for endpoint_disable.  If the others have 
> > problems, could you provide a tracebacks so we can see the pathways 
> > where the problem occurs?
> 
> I'm not aware of problems, this paragraph was hypothetical: if someone
> thinks that problems exist or should be monitored for, there are better
> places to do it than handle_cmd_set_deq().
> 
> Today I patched those HCD methods locally to check for pending URBs.
> Nothing caught so far, but I will leave this code running long term.
> 
> This was discussed before and you said that device reset should be OK.
> Users of hub_free_dev() also appear to be OK (they call things which
> call disable_endpoint() on EP0 or all EPs).
> 
> Mathias fixed usb_set_interface() a few years ago. Not sure if similar
> use of usb_hcd_alloc_bandwidth() in usb_set_configuration() is safe?

One of the first things that usb_set_configuration() does is to call 
usb_disable_device(), which calls usb_disable_device_endpoints(), which 
calls usb_disable_endpoint() for all but ep0, which calls 
usb_hcd_flush_endpoint().  This all happens before 
usb_hcd_alloc_bandwidth() is called.

Alan Stern

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-09  9:41 Problematic Set TR Deq error handling series in xhci-next Michał Pecio
  2025-05-10 15:52 ` Alan Stern
@ 2025-05-12 13:02 ` Neronin, Niklas
  2025-05-13 10:29   ` Michał Pecio
  2025-05-14  9:34   ` Michał Pecio
  1 sibling, 2 replies; 8+ messages in thread
From: Neronin, Niklas @ 2025-05-12 13:02 UTC (permalink / raw)
  To: Michał Pecio, Mathias Nyman; +Cc: linux-usb



> I noticed that xhci/for-usb-next now includes a series which tries
> to handle Set TR Deq errors. It strikes me as a solution looking for
> a problem, and frankly creating new problems where none existed ;)

Hi,

The purpose of this series is to add some error handling and improve
warning messages. Fixing the root cause is another task altogether.

Before, the only difference between a Set TR Deq command failure
and success was that, in the case of failure, a warning was printed,
and the xhci driver ring dequeue pointer was not moved. Otherwise,
a Set TR Deq command failure behaved as if it were successful.

In my opinion, instead of ignoring the Set TR Deq errors it would be
better to retry the command if the error is due to something easily
fixed, such as wrong EP state or Slot state.
In the worst-case scenario, the xhci driver will still fail, but with
a more specific warning message.

> I am aware of three cases leading to errors being handled here, and
> none of them is addressed. One is harmless and easy to fix properly,
> but this series appears to turn it into a "never give back the URB"
> disaster. Tests pending, I hope to find some time this weekend.
> > ...
> 
> The case of "running" (or "halted", which "running" can morth into)
> is possibly more useful, because we assume it's caused by illegal
> state changes without driver's knowledge. But these are supposed to
> be detected and fixed by handle_cmd_stop_ep(), because they cause
> other problems too. It's unclear if retrying Stop Endpoint and Set
> TR Deq again will solve any case not solved yet, but risk exists of
> infinite retries on broken HW. (And HW is broken if we are here).
The infinite retry loop is a good point, did not consider this.
But wouldn't the Stop EP command fail first, otherwise the state is
correct for the Set TR Deq?

Do you still think these changes might result in more negative impacts
than positive ones?

Thank you for the review and comments.

Best Regards,
Niklas

> 
> Queuing a Soft Retry and then doing Set TR Deq out of the halted TD
> instead of restarting the ring is a weird thing to do and IDK if HW
> will get it right. IIRC, some ASMedia had issues in similar cases:
> it would retry the TD anyway, despite Set TR Deq.
> 
> ...
> 
> Regards,
> Michal


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-12 13:02 ` Neronin, Niklas
@ 2025-05-13 10:29   ` Michał Pecio
  2025-05-14  9:34   ` Michał Pecio
  1 sibling, 0 replies; 8+ messages in thread
From: Michał Pecio @ 2025-05-13 10:29 UTC (permalink / raw)
  To: Neronin, Niklas; +Cc: Mathias Nyman, linux-usb

On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote:
> The purpose of this series is to add some error handling and improve
> warning messages.

Adding more information to those messages is great and IMO it should
have been done long ago instead of an xhci_dbg() that nobody enables.
These errors are sporadically seen on bugzilla/linux-usb and it would
be nice to know why they are happening.

I have done it on my system and it helped me understand some cases.
I found nothing terribly concerning, by the way.

(That being said, I don't like multi-line messages: they reduce log
rotation period and mix with others. I added info to existing logs.)

> Fixing the root cause is another task altogether.

But if you don't know why something happens, can you really fix it?

In normal operation these errors just don't happen, and in abnormal
operation there is no guarantee that normal recovery will work. The
driver already tried to avoid error by normal means and it failed.

A code with no known trigger can't be tested, so it may have bugs.
And it adds maintainance burden because people fear the unknown.

> Before, the only difference between a Set TR Deq command failure
> and success was that, in the case of failure, a warning was printed,
> and the xhci driver ring dequeue pointer was not moved. Otherwise,
> a Set TR Deq command failure behaved as if it were successful.

This is adequate for two most common cases that I know. Fristly, the
mysterious command TRB errors on ASMedia. The endpoint is stuck (but
not halted) until UAS times out and resets the whole device. Same
stuck endpoint behavior, but no TRB error, is seen on other HCs.

Secondly, xhci_handle_halted_endpoint() doesn't queue Reset Endpoint
when a SuperSpeed device is unplugged from the root hub. The endpoint
stays halted forever, so Set TR Dequeue is unnecessary and it fails.

Simply ignoring the failure is not a bug. Waiting for the endpoint to
reset (which never happens) turns it into a bug.

(This series generally fails to check for errors of command queuing
functions, which are not guaranteed to succeed.)

> In my opinion, instead of ignoring the Set TR Deq errors it would be
> better to retry the command if the error is due to something easily
> fixed, such as wrong EP state or Slot state.

There is nothing easy about stopping an endpoint, last year we spent
a few weeks with Mathias making it work.

See commits fd9d55d190c0 and 42b758137601, in this order ;)

We won't call xhci_invalidate_cancelled_tds() unless we are strongly
convinced that the endpoint is in the Stopped state, or the HC seems
to be completely disfunctional. (And I just facepalmed writing this,
because it looks like 28a76fcc4c85 does more harm than good, I think
will need to revert it.)

There is one case which cannot be detected by handle_cmd_stop_ep():
Stop Endpoint completes sucessfully, and later the endpoint changes
state without a doorbell ring. ASMedia HCs actually do such things
when they manage to corrupt their internal endpoint state. There
are some cases when stopping and restarting a (periodic?) endpoint
appears to corrupt the control endpoint; it generates no events or
Transaction Errors for no reason, changes state together with other
endpoints, etc. Resetting the xHC is the only solution I know.

Michal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-12 13:02 ` Neronin, Niklas
  2025-05-13 10:29   ` Michał Pecio
@ 2025-05-14  9:34   ` Michał Pecio
  2025-05-15  7:11     ` Neronin, Niklas
  1 sibling, 1 reply; 8+ messages in thread
From: Michał Pecio @ 2025-05-14  9:34 UTC (permalink / raw)
  To: Neronin, Niklas; +Cc: Mathias Nyman, linux-usb

On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote:
> > The case of "running" (or "halted", which "running" can morth into)
> > is possibly more useful, because we assume it's caused by illegal
> > state changes without driver's knowledge. But these are supposed to
> > be detected and fixed by handle_cmd_stop_ep(), because they cause
> > other problems too. It's unclear if retrying Stop Endpoint and Set
> > TR Deq again will solve any case not solved yet, but risk exists of
> > infinite retries on broken HW. (And HW is broken if we are here).  
> The infinite retry loop is a good point, did not consider this.
> But wouldn't the Stop EP command fail first, otherwise the state is
> correct for the Set TR Deq?

When Set Deq fails with Context State Error there are no rules that
can be relied on. Either handle_cmd_stop_ep() queued Set Deq in a wrong
state, or the xHC changed the endpoint state for unknown reason.

Either way, reissuing Stop EP takes us back to square one. Maybe the
SW or HW won't screw up this time and a new Set Deq will succeed, or
maybe it will screw up again, perhaps for the same reason as before,
and Set Deq will fail again.


ASMedia HCs are so unreliable that even their bugs are difficult to
reproduce, but I was lucky enough to capture one successful retry.

The format of the log below is:
slot/ep (ep_state/ctx.state) [dequeue/hw_dequeue] log info
All completions (other than transfer success) and commands are logged.

# the xHC is already in a fubar state
# idle device resuming from autosuspend
[ 2724.482048] 2/0 (040/3) [ffe73960/ffe73961] ring_ep_doorbell stream 0
[ 2724.482066] 2/2 (000/3) [ffeb7000/ffeb7001] ring_ep_doorbell stream 0
[ 2724.482074] 2/6 (000/3) [fff03020/fff03021] ring_ep_doorbell stream 0
[ 2724.482081] 2/8 (000/3) [fff96000/fff96001] ring_ep_doorbell stream 0
# a control transfer is cancelled after 5s timeout
[ 2729.943197] 2/0 (044/1) [ffe73960/ffe73961] queue_stop_endpoint suspend 0
[ 2729.943218] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
# Stopped Length Invalid for ep 0
[ 2729.943393] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event comp_code 27 trb_dma 0x00000000ffe73960
[ 2729.943414] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event stream_id 0 trb_len 8 missing 0
# Stopped Length Invalid for ep 2, because why not?
[ 2729.943425] 2/2 (000/3) [ffeb7000/ffeb7001] handle_tx_event comp_code 27 trb_dma 0x00000000ffeb7000
[ 2729.943433] 2/2 (000/3) [ffeb7000/ffeb7001] handle_tx_event stream_id 0 trb_len 0 missing 0
# another Stopped for ep 0 and the command completion (Success)
[ 2729.943439] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event comp_code 26 trb_dma 0x00000000ffe73960
[ 2729.943444] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event stream_id 0 trb_len 8 missing 8
[ 2729.943452] 2/0 (044/3) [ffe73960/ffe73961] handle_cmd_completion cmd_type 15 comp_code 1
# Set TR Deq is queued on ep 0 in state Stopped(3)
[ 2729.943461] 2/0 (044/3) [ffe73960/ffe73961] queue_set_tr_deq stream 0 addr 0x0x00000000ffe73990 EOS
[ 2729.943468] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
# Context State Error, ep 0 changed state to Running(1)
[ 2729.943796] 2/0 (041/1) [ffe73960/ffe73961] handle_cmd_completion cmd_type 16 comp_code 19
[ 2729.943805] Set TR Deq error for TRB 0xffe73990 in slot 2 ep 0
[ 2729.943810] Set TR Deq failed, due to running endpoint
# Stop EP succeeds, hw_dequeue advanced by one TRB
[ 2729.943814] 2/0 (044/1) [ffe73960/ffe73961] queue_stop_endpoint suspend 0
[ 2729.943820] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
[ 2729.943989] 2/0 (044/3) [ffe73960/ffe73971] handle_cmd_completion cmd_type 15 comp_code 1
# we get lucky and ep 0 state doesn't change this time
[ 2729.943996] 2/0 (044/3) [ffe73960/ffe73971] queue_set_tr_deq stream 0 addr 0x0x00000000ffe73990 EOS
[ 2729.944003] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
[ 2729.944150] 2/0 (041/3) [ffe73960/ffe73991] handle_cmd_completion cmd_type 16 comp_code 1

Note that the endpoint appears to change state when Set TR Deq is
queued to it. This is not the known "late restart" bug, it's some
complete nonsense and I don't know why it happens. The xHC is FUBAR.

This is the sort of conditions which will cause this new code to run.

Either that, or handle_cmd_stop_ep() gives up retrying too early and
a "late restart" condition is ignored. In the latter case, one more
retry after Set Deq failure will likely stop the endpoint, but damage
may already be done. Such bugs should be avoided in the first place.

Regards,
Michal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Problematic Set TR Deq error handling series in xhci-next
  2025-05-14  9:34   ` Michał Pecio
@ 2025-05-15  7:11     ` Neronin, Niklas
  0 siblings, 0 replies; 8+ messages in thread
From: Neronin, Niklas @ 2025-05-15  7:11 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Mathias Nyman, linux-usb



On 14/05/2025 12.34, Michał Pecio wrote:
> On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote:
>>> The case of "running" (or "halted", which "running" can morth into)
>>> is possibly more useful, because we assume it's caused by illegal
>>> state changes without driver's knowledge. But these are supposed to
>>> be detected and fixed by handle_cmd_stop_ep(), because they cause
>>> other problems too. It's unclear if retrying Stop Endpoint and Set
>>> TR Deq again will solve any case not solved yet, but risk exists of
>>> infinite retries on broken HW. (And HW is broken if we are here).  
>> The infinite retry loop is a good point, did not consider this.
>> But wouldn't the Stop EP command fail first, otherwise the state is
>> correct for the Set TR Deq?
> 
> ...
> 
> Either that, or handle_cmd_stop_ep() gives up retrying too early and
> a "late restart" condition is ignored. In the latter case, one more
> retry after Set Deq failure will likely stop the endpoint, but damage
> may already be done. Such bugs should be avoided in the first place.
> 
Thank you for the testing and review.
I've decided to hold off on the Set TR Deq series for now.

Best Regards,
Niklas


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-05-15  7:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-09  9:41 Problematic Set TR Deq error handling series in xhci-next Michał Pecio
2025-05-10 15:52 ` Alan Stern
2025-05-11 14:04   ` Michał Pecio
2025-05-11 19:24     ` Alan Stern
2025-05-12 13:02 ` Neronin, Niklas
2025-05-13 10:29   ` Michał Pecio
2025-05-14  9:34   ` Michał Pecio
2025-05-15  7:11     ` Neronin, Niklas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox