public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed
From: "Michał Pecio" <michal.pecio@gmail.com>
To: "Neronin, Niklas" <niklas.neronin@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@intel.com>, linux-usb@vger.kernel.org
Subject: Re: Problematic Set TR Deq error handling series in xhci-next
Date: Wed, 14 May 2025 11:34:31 +0200	[thread overview]
Message-ID: <20250514113431.632efb74@foxbook> (raw)
In-Reply-To: <c606689d-e680-4796-bb65-87424a157e98@linux.intel.com>

On Mon, 12 May 2025 16:02:06 +0300, Neronin, Niklas wrote:
> > The case of "running" (or "halted", which "running" can morth into)
> > is possibly more useful, because we assume it's caused by illegal
> > state changes without driver's knowledge. But these are supposed to
> > be detected and fixed by handle_cmd_stop_ep(), because they cause
> > other problems too. It's unclear if retrying Stop Endpoint and Set
> > TR Deq again will solve any case not solved yet, but risk exists of
> > infinite retries on broken HW. (And HW is broken if we are here).  
> The infinite retry loop is a good point, did not consider this.
> But wouldn't the Stop EP command fail first, otherwise the state is
> correct for the Set TR Deq?

When Set Deq fails with Context State Error there are no rules that
can be relied on. Either handle_cmd_stop_ep() queued Set Deq in a wrong
state, or the xHC changed the endpoint state for unknown reason.

Either way, reissuing Stop EP takes us back to square one. Maybe the
SW or HW won't screw up this time and a new Set Deq will succeed, or
maybe it will screw up again, perhaps for the same reason as before,
and Set Deq will fail again.


ASMedia HCs are so unreliable that even their bugs are difficult to
reproduce, but I was lucky enough to capture one successful retry.

The format of the log below is:
slot/ep (ep_state/ctx.state) [dequeue/hw_dequeue] log info
All completions (other than transfer success) and commands are logged.

# the xHC is already in a fubar state
# idle device resuming from autosuspend
[ 2724.482048] 2/0 (040/3) [ffe73960/ffe73961] ring_ep_doorbell stream 0
[ 2724.482066] 2/2 (000/3) [ffeb7000/ffeb7001] ring_ep_doorbell stream 0
[ 2724.482074] 2/6 (000/3) [fff03020/fff03021] ring_ep_doorbell stream 0
[ 2724.482081] 2/8 (000/3) [fff96000/fff96001] ring_ep_doorbell stream 0
# a control transfer is cancelled after 5s timeout
[ 2729.943197] 2/0 (044/1) [ffe73960/ffe73961] queue_stop_endpoint suspend 0
[ 2729.943218] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
# Stopped Length Invalid for ep 0
[ 2729.943393] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event comp_code 27 trb_dma 0x00000000ffe73960
[ 2729.943414] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event stream_id 0 trb_len 8 missing 0
# Stopped Length Invalid for ep 2, because why not?
[ 2729.943425] 2/2 (000/3) [ffeb7000/ffeb7001] handle_tx_event comp_code 27 trb_dma 0x00000000ffeb7000
[ 2729.943433] 2/2 (000/3) [ffeb7000/ffeb7001] handle_tx_event stream_id 0 trb_len 0 missing 0
# another Stopped for ep 0 and the command completion (Success)
[ 2729.943439] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event comp_code 26 trb_dma 0x00000000ffe73960
[ 2729.943444] 2/0 (044/3) [ffe73960/ffe73961] handle_tx_event stream_id 0 trb_len 8 missing 8
[ 2729.943452] 2/0 (044/3) [ffe73960/ffe73961] handle_cmd_completion cmd_type 15 comp_code 1
# Set TR Deq is queued on ep 0 in state Stopped(3)
[ 2729.943461] 2/0 (044/3) [ffe73960/ffe73961] queue_set_tr_deq stream 0 addr 0x0x00000000ffe73990 EOS
[ 2729.943468] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
# Context State Error, ep 0 changed state to Running(1)
[ 2729.943796] 2/0 (041/1) [ffe73960/ffe73961] handle_cmd_completion cmd_type 16 comp_code 19
[ 2729.943805] Set TR Deq error for TRB 0xffe73990 in slot 2 ep 0
[ 2729.943810] Set TR Deq failed, due to running endpoint
# Stop EP succeeds, hw_dequeue advanced by one TRB
[ 2729.943814] 2/0 (044/1) [ffe73960/ffe73961] queue_stop_endpoint suspend 0
[ 2729.943820] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
[ 2729.943989] 2/0 (044/3) [ffe73960/ffe73971] handle_cmd_completion cmd_type 15 comp_code 1
# we get lucky and ep 0 state doesn't change this time
[ 2729.943996] 2/0 (044/3) [ffe73960/ffe73971] queue_set_tr_deq stream 0 addr 0x0x00000000ffe73990 EOS
[ 2729.944003] 0/-1 (fff/f) [ffffffff/ffffffff] xhci_ring_cmd_db cmd_ring_state 1
[ 2729.944150] 2/0 (041/3) [ffe73960/ffe73991] handle_cmd_completion cmd_type 16 comp_code 1

Note that the endpoint appears to change state when Set TR Deq is
queued to it. This is not the known "late restart" bug, it's some
complete nonsense and I don't know why it happens. The xHC is FUBAR.

This is the sort of conditions which will cause this new code to run.

Either that, or handle_cmd_stop_ep() gives up retrying too early and
a "late restart" condition is ignored. In the latter case, one more
retry after Set Deq failure will likely stop the endpoint, but damage
may already be done. Such bugs should be avoided in the first place.

Regards,
Michal

  parent reply	other threads:[~2025-05-14  9:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-09  9:41 Problematic Set TR Deq error handling series in xhci-next Michał Pecio
2025-05-10 15:52 ` Alan Stern
2025-05-11 14:04   ` Michał Pecio
2025-05-11 19:24     ` Alan Stern
2025-05-12 13:02 ` Neronin, Niklas
2025-05-13 10:29   ` Michał Pecio
2025-05-14  9:34   ` Michał Pecio [this message]
2025-05-15  7:11     ` Neronin, Niklas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250514113431.632efb74@foxbook \
    --to=michal.pecio@gmail.com \
    --cc=linux-usb@vger.kernel.org \
    --cc=mathias.nyman@intel.com \
    --cc=niklas.neronin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox