[PATCH v4 0/3] xhci: Fix Stop Endpoint problems

linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] xhci: Fix Stop Endpoint problems
@ 2024-11-04  9:32 Michal Pecio
  2024-11-04  9:33 ` [PATCH v4 1/3] usb: xhci: Limit Stop Endpoint retries Michal Pecio
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Michal Pecio @ 2024-11-04  9:32 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

v4 of the attempt to sort out Stop Endpoint retries and related issues.

General design:

1/3 guarantee no infinite retries under any failure conditions, ever.
2/3 fix bugs and make Set TR Dequeue handler call invalidate/giveback.
3/3 eliminate cases of pointless Stop EP commands, depends on 2/3.

Changes since v3:

2/3 no longer tries to add code cleanups (and bugs, as it turned out)
    to a bugfix patch, and became much shorter as a result.
3/3 dropped a desperate attempt at detecting a halt/unlink race, which
    I was unable to reproduce because it seems impossible - EP_HALTED
    or SET_DEQ_PENDING remains set until after the halted URB is given
    back to USB core, so we would never queue a pointless Stop command.

Also some commit message and comment cleanups in all three.
And I documented another bug which 2/3 appears to (partly) fix.

The state of other chips:

VIA VL805 and Etron EJ168A don't seem to have the "latent restart" bug
so they don't care about this code, unless it causes pointless retries
on an already stopped endpoint for 100ms, if it's enabled for them.

ASM1042 and ASM3142 have the bug, although not as common as on NEC. It
seems to be particularly triggered by multiple endpoints cancelling at
the same time, so for example, both loops below simultaneously:

while true; do timeout .3 yavta -c /dev/video0; done
while true; do timeout .1 cat /dev/ttyUSB0 >/dev/null; done

On ASM3142 this triggers the bug about once per minute, and prints:

WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.

The "mitigation" patch seems to be helping, because reverting it adds:

ERROR Transfer event TRB DMA ptr not part of current TD ep_index 6 comp_code 13
Looking for event-dma 0000000003715ec0 trb-start 0000000003715ed0 trb-end 0000000003715ed0 seg-start 0000000003715000 seg-end 0000000003715ff0

Enabling the NEC workaround for this chip appears to fix the problem
entirely and so far it only took one retry each time:

Stop ep completion ctx error, ctx_state 3

I have left this running overninght and nothing unexpected was logged
in dmesg. IOW, no undesirable side effects known so far. But I haven't
done much tests with bad cables yet, and ASMedia chips appear to be
quite buggy and prone to strange internal race conditions. Examples:

- the ugly ifconfig up/down lockup issue on 3142 (not tried 1042 yet)
- a pair of "stopped" events out of order preceding the above lockup
- on 1042 Set Deq on streams seems to be no-op (streams are disabled)
- Soft Retry preceding a Set Deq seems to undo the Set Deq's effect
- Stop Endpoint *success* and then a spontaneous restart (1042 only)
- spontaneous restart causes a halt, but transfer event is missing
- TRB 15 (Stop EP) on the event ring (seen only once on 1042)

ASM1042 still not tested for compatilibity with the workaround at all.

Michal Pecio (3):
  usb: xhci: Limit Stop Endpoint retries
  usb: xhci: Fix TD invalidation under pending Set TR Dequeue
  usb: xhci: Avoid queuing redundant Stop Endpoint commands

 drivers/usb/host/xhci-ring.c | 59 ++++++++++++++++++++++++++++++------
 drivers/usb/host/xhci.c      | 21 ++++++++++---
 drivers/usb/host/xhci.h      |  2 ++
 3 files changed, 69 insertions(+), 13 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v4 1/3] usb: xhci: Limit Stop Endpoint retries
  2024-11-04  9:32 [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Michal Pecio
@ 2024-11-04  9:33 ` Michal Pecio
  2024-11-04  9:33 ` [PATCH v4 2/3] usb: xhci: Fix TD invalidation under pending Set TR Dequeue Michal Pecio
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Michal Pecio @ 2024-11-04  9:33 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

Some host controllers fail to atomically transition an endpoint to the
Running state on a doorbell ring and enter a hidden "Restarting" state,
which looks very much like Stopped, with the important difference that
it will spontaneously transition to Running anytime soon.

A Stop Endpoint command queued in the Restarting state typically fails
with Context State Error and the completion handler sees the Endpoint
Context State as either still Stopped or already Running. Even a case
of Halted was observed, when an error occurred right after the restart.

The Halted state is already recovered from by resetting the endpoint.
The Running state is handled by retrying Stop Endpoint.

The Stopped state was recognized as a problem on NEC controllers and
worked around also by retrying, because the endpoint soon restarts and
then stops for good. But there is a risk: the command may fail if the
endpoint is "stopped for good" already, and retries will fail forever.

The possibility of this was not realized at the time, but a number of
cases were discovered later and reproduced. Some proved difficult to
deal with, and it is outright impossible to predict if an endpoint may
fail to ever start at all due to a hardware bug. One such bug (albeit
on ASM3142, not on NEC) was found to be reliably triggered simply by
toggling an AX88179 NIC up/down in a tight loop for a few seconds.

An endless retries storm is quite nasty. Besides putting needless load
on the xHC and CPU, it causes URBs never to be given back, paralyzing
the device and connection/disconnection logic for the whole bus if the
device is unplugged. User processes waiting for URBs become unkillable,
drivers and kworker threads lock up and xhci_hcd cannot be reloaded.

For peace of mind, impose a timeout on Stop Endpoint retries in this
case. If they don't succeed in 100ms, consider the endpoint stopped
permanently for some reason and just give back the unlinked URBs. This
failure case is rare already and work is under way to make it rarer.

Start this work today by also handling one simple case of race with
Reset Endpoint, because it costs just two lines to implement.

Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
CC: stable@vger.kernel.org
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 28 ++++++++++++++++++++++++----
 drivers/usb/host/xhci.c      |  2 ++
 drivers/usb/host/xhci.h      |  1 +
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b6eb928e260f..98e12267bbf6 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -52,6 +52,7 @@
  *   endpoint rings; it generates events on the event ring for these.
  */

+#include <linux/jiffies.h>
 #include <linux/scatterlist.h>
 #include <linux/slab.h>
 #include <linux/dma-mapping.h>
@@ -1151,16 +1152,35 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			return;
 		case EP_STATE_STOPPED:
 			/*
-			 * NEC uPD720200 sometimes sets this state and fails with
-			 * Context Error while continuing to process TRBs.
-			 * Be conservative and trust EP_CTX_STATE on other chips.
+			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+			 * EP is a Context State Error, and EP stays Stopped.
+			 *
+			 * But maybe it failed on Halted, and somebody ran Reset
+			 * Endpoint later. EP state is now Stopped and EP_HALTED
+			 * still set because Reset EP handler will run after us.
+			 */
+			if (ep->ep_state & EP_HALTED)
+				break;
+			/*
+			 * On some HCs EP state remains Stopped for some tens of
+			 * us to a few ms or more after a doorbell ring, and any
+			 * new Stop Endpoint fails without aborting the restart.
+			 * This handler may run quickly enough to still see this
+			 * Stopped state, but it will soon change to Running.
+			 *
+			 * Assume this bug on unexpected Stop Endpoint failures.
+			 * Keep retrying until the EP starts and stops again, on
+			 * chips where this is known to help. Wait for 100ms.
 			 */
 			if (!(xhci->quirks & XHCI_NEC_HOST))
 				break;
+			if (time_is_before_jiffies(ep->stop_time + msecs_to_jiffies(100)))
+				break;
 			fallthrough;
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
-			xhci_dbg(xhci, "Stop ep completion ctx error, ep is running\n");
+			xhci_dbg(xhci, "Stop ep completion ctx error, ctx_state %d\n",
+					GET_EP_CTX_STATE(ep_ctx));

 			command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
 			if (!command) {
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 899c0effb5d3..1ed4dce7eeb7 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -8,6 +8,7 @@
  * Some code borrowed from the Linux EHCI driver.
  */

+#include <linux/jiffies.h>
 #include <linux/pci.h>
 #include <linux/iommu.h>
 #include <linux/iopoll.h>
@@ -1777,6 +1778,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 			ret = -ENOMEM;
 			goto done;
 		}
+		ep->stop_time = jiffies;
 		ep->ep_state |= EP_STOP_CMD_PENDING;
 		xhci_queue_stop_endpoint(xhci, command, urb->dev->slot_id,
 					 ep_index, 0);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index f0fb696d5619..0bef6c8b51cf 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -690,6 +690,7 @@ struct xhci_virt_ep {
 	/* Bandwidth checking storage */
 	struct xhci_bw_info	bw_info;
 	struct list_head	bw_endpoint_list;
+	unsigned long		stop_time;
 	/* Isoch Frame ID checking storage */
 	int			next_frame_id;
 	/* Use new Isoch TRB layout needed for extended TBC support */
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 2/3] usb: xhci: Fix TD invalidation under pending Set TR Dequeue
  2024-11-04  9:32 [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Michal Pecio
  2024-11-04  9:33 ` [PATCH v4 1/3] usb: xhci: Limit Stop Endpoint retries Michal Pecio
@ 2024-11-04  9:33 ` Michal Pecio
  2024-11-04  9:34 ` [PATCH v4 3/3] usb: xhci: Avoid queuing redundant Stop Endpoint commands Michal Pecio
  2024-11-05 11:05 ` [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Mathias Nyman
  3 siblings, 0 replies; 6+ messages in thread
From: Michal Pecio @ 2024-11-04  9:33 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

xhci_invalidate_cancelled_tds() may not work correctly if the hardware
is modifying endpoint or stream contexts at the same time by executing
a Set TR Dequeue command. And even if it worked, it would be unable to
queue Set TR Dequeue for the next stream, failing to clear xHC cache.

On stream endpoints, a chain of Set TR Dequeue commands may take some
time to execute and we may want to cancel more TDs during this time.
Currently this leads to Stop Endpoint completion handler calling this
function without testing for SET_DEQ_PENDING, which will trigger the
aforementioned problems when it happens.

On all endpoints, a halt condition causes Reset Endpoint to be queued
and an error status given to the class driver, which may unlink more
URBs in response. Stop Endpoint is queued and its handler may execute
concurrently with Set TR Dequeue queued by Reset Endpoint handler.

(Reset Endpoint handler calls this function too, but there seems to
be no possibility of it running concurrently with Set TR Dequeue).

Fix xhci_invalidate_cancelled_tds() to work correctly under a pending
Set TR Dequeue. Bail out of the function when SET_DEQ_PENDING is set,
then make the completion handler call the function again and also call
xhci_giveback_invalidated_tds(), which needs to be called next.

This seems to fix another potential bug, where the handler would call
xhci_invalidate_cancelled_tds(), which may clear some deferred TDs if
a sanity check fails, and the TDs wouldn't be given back promptly.

Said sanity check seems to be wrong and prone to false positives when
the endpoint halts, but fixing it is beyond the scope of this change,
besides ensuring that cleared TDs are given back properly.

Fixes: 5ceac4402f5d ("xhci: Handle TD clearing for multiple streams case")
CC: stable@vger.kernel.org
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 98e12267bbf6..30b392715f1e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -973,6 +973,13 @@ static int xhci_invalidate_cancelled_tds(struct xhci_virt_ep *ep)
 	unsigned int		slot_id = ep->vdev->slot_id;
 	int			err;

+	/*
+	 * This is not going to work if the hardware is changing its dequeue
+	 * pointers as we look at them. Completion handler will call us later.
+	 */
+	if (ep->ep_state & SET_DEQ_PENDING)
+		return 0;
+
 	xhci = ep->xhci;

 	list_for_each_entry_safe(td, tmp_td, &ep->cancelled_td_list, cancelled_td_list) {
@@ -1359,7 +1366,6 @@ static void xhci_handle_cmd_set_deq(struct xhci_hcd *xhci, int slot_id,
 	struct xhci_ep_ctx *ep_ctx;
 	struct xhci_slot_ctx *slot_ctx;
 	struct xhci_td *td, *tmp_td;
-	bool deferred = false;

 	ep_index = TRB_TO_EP_INDEX(le32_to_cpu(trb->generic.field[3]));
 	stream_id = TRB_TO_STREAM_ID(le32_to_cpu(trb->generic.field[2]));
@@ -1460,8 +1466,6 @@ static void xhci_handle_cmd_set_deq(struct xhci_hcd *xhci, int slot_id,
 			xhci_dbg(ep->xhci, "%s: Giveback cancelled URB %p TD\n",
 				 __func__, td->urb);
 			xhci_td_cleanup(ep->xhci, td, ep_ring, td->status);
-		} else if (td->cancel_status == TD_CLEARING_CACHE_DEFERRED) {
-			deferred = true;
 		} else {
 			xhci_dbg(ep->xhci, "%s: Keep cancelled URB %p TD as cancel_status is %d\n",
 				 __func__, td->urb, td->cancel_status);
@@ -1472,11 +1476,15 @@ static void xhci_handle_cmd_set_deq(struct xhci_hcd *xhci, int slot_id,
 	ep->queued_deq_seg = NULL;
 	ep->queued_deq_ptr = NULL;

-	if (deferred) {
-		/* We have more streams to clear */
+	/* Check for deferred or newly cancelled TDs */
+	if (!list_empty(&ep->cancelled_td_list)) {
 		xhci_dbg(ep->xhci, "%s: Pending TDs to clear, continuing with invalidation\n",
 			 __func__);
 		xhci_invalidate_cancelled_tds(ep);
+		/* Try to restart the endpoint if all is done */
+		ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
+		/* Start giving back any TDs invalidated above */
+		xhci_giveback_invalidated_tds(ep);
 	} else {
 		/* Restart any rings with pending URBs */
 		xhci_dbg(ep->xhci, "%s: All TDs cleared, ring doorbell\n", __func__);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 3/3] usb: xhci: Avoid queuing redundant Stop Endpoint commands
  2024-11-04  9:32 [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Michal Pecio
  2024-11-04  9:33 ` [PATCH v4 1/3] usb: xhci: Limit Stop Endpoint retries Michal Pecio
  2024-11-04  9:33 ` [PATCH v4 2/3] usb: xhci: Fix TD invalidation under pending Set TR Dequeue Michal Pecio
@ 2024-11-04  9:34 ` Michal Pecio
  2024-11-05 11:05 ` [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Mathias Nyman
  3 siblings, 0 replies; 6+ messages in thread
From: Michal Pecio @ 2024-11-04  9:34 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

Stop Endpoint command on an already stopped endpoint fails and may be
misinterpreted as a known hardware bug by the completion handler. This
results in an unnecessary delay with repeated retries of the command.

Avoid queuing this command when endpoint state flags indicate that it's
stopped or halted and the command will fail. If commands are pending on
the endpoint, their completion handlers will process cancelled TDs so
it's done. In case of waiting for external operations like clearing TT
buffer, the endpoint is stopped and cancelled TDs can be processed now.

This eliminates practically all unnecessary retries because an endpoint
with pending URBs is maintained in Running state by the driver, unless
aforementioned commands or other operations are pending on it. This is
guaranteed by xhci_ring_ep_doorbell() and by the fact that it is called
every time any of those operations completes.

The only known exceptions are hardware bugs (the endpoint never starts
at all) and Stream Protocol errors not associated with any TRB, which
cause an endpoint reset not followed by restart. Sounds like a bug.

Generally, these retries are only expected to happen when the endpoint
fails to start for unknown/no reason, which is a worse problem itself,
and fixing the bug eliminates the retries too.

All cases were tested and found to work as expected. SET_DEQ_PENDING
was produced by patching uvcvideo to unlink URBs in 100us intervals,
which then runs into this case very often. EP_HALTED was produced by
restarting 'cat /dev/ttyUSB0' on a serial dongle with broken cable.
EP_CLEARING_TT by the same, with the dongle on an external hub.

Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
CC: stable@vger.kernel.org
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 13 +++++++++++++
 drivers/usb/host/xhci.c      | 19 +++++++++++++++----
 drivers/usb/host/xhci.h      |  1 +
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 30b392715f1e..92f9b6c468c6 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1069,6 +1069,19 @@ static int xhci_invalidate_cancelled_tds(struct xhci_virt_ep *ep)
 	return 0;
 }

+/*
+ * Erase queued TDs from transfer ring(s) and give back those the xHC didn't
+ * stop on. If necessary, queue commands to move the xHC off cancelled TDs it
+ * stopped on. Those will be given back later when the commands complete.
+ *
+ * Call under xhci->lock on a stopped endpoint.
+ */
+void xhci_process_cancelled_tds(struct xhci_virt_ep *ep)
+{
+	xhci_invalidate_cancelled_tds(ep);
+	xhci_giveback_invalidated_tds(ep);
+}
+
 /*
  * Returns the TD the endpoint ring halted on.
  * Only call for non-running rings without streams.
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 1ed4dce7eeb7..695e29b0946c 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1769,10 +1769,21 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 		}
 	}

-	/* Queue a stop endpoint command, but only if this is
-	 * the first cancellation to be handled.
-	 */
-	if (!(ep->ep_state & EP_STOP_CMD_PENDING)) {
+	/* These completion handlers will sort out cancelled TDs for us */
+	if (ep->ep_state & (EP_STOP_CMD_PENDING | EP_HALTED | SET_DEQ_PENDING)) {
+		xhci_dbg(xhci, "Not queuing Stop Endpoint on slot %d ep %d in state 0x%x\n",
+				urb->dev->slot_id, ep_index, ep->ep_state);
+		goto done;
+	}
+
+	/* In this case no commands are pending but the endpoint is stopped */
+	if (ep->ep_state & EP_CLEARING_TT) {
+		/* and cancelled TDs can be given back right away */
+		xhci_dbg(xhci, "Invalidating TDs instantly on slot %d ep %d in state 0x%x\n",
+				urb->dev->slot_id, ep_index, ep->ep_state);
+		xhci_process_cancelled_tds(ep);
+	} else {
+		/* Otherwise, queue a new Stop Endpoint command */
 		command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
 		if (!command) {
 			ret = -ENOMEM;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 0bef6c8b51cf..33e57f408e86 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1914,6 +1914,7 @@ void xhci_ring_doorbell_for_active_rings(struct xhci_hcd *xhci,
 void xhci_cleanup_command_queue(struct xhci_hcd *xhci);
 void inc_deq(struct xhci_hcd *xhci, struct xhci_ring *ring);
 unsigned int count_trbs(u64 addr, u64 len);
+void xhci_process_cancelled_tds(struct xhci_virt_ep *ep);

 /* xHCI roothub code */
 void xhci_set_link_state(struct xhci_hcd *xhci, struct xhci_port *port,
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 0/3] xhci: Fix Stop Endpoint problems
  2024-11-04  9:32 [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Michal Pecio
                   ` (2 preceding siblings ...)
  2024-11-04  9:34 ` [PATCH v4 3/3] usb: xhci: Avoid queuing redundant Stop Endpoint commands Michal Pecio
@ 2024-11-05 11:05 ` Mathias Nyman
  2024-11-05 12:18   ` Michał Pecio
  3 siblings, 1 reply; 6+ messages in thread
From: Mathias Nyman @ 2024-11-05 11:05 UTC (permalink / raw)
  To: Michal Pecio, Mathias Nyman; +Cc: linux-usb

On 4.11.2024 11.32, Michal Pecio wrote:
> v4 of the attempt to sort out Stop Endpoint retries and related issues.
> 
> General design:
> 
> 1/3 guarantee no infinite retries under any failure conditions, ever.
> 2/3 fix bugs and make Set TR Dequeue handler call invalidate/giveback.
> 3/3 eliminate cases of pointless Stop EP commands, depends on 2/3.
> 
> Changes since v3:
> 
> 2/3 no longer tries to add code cleanups (and bugs, as it turned out)
>      to a bugfix patch, and became much shorter as a result.
> 3/3 dropped a desperate attempt at detecting a halt/unlink race, which
>      I was unable to reproduce because it seems impossible - EP_HALTED
>      or SET_DEQ_PENDING remains set until after the halted URB is given
>      back to USB core, so we would never queue a pointless Stop command.
> 
> Also some commit message and comment cleanups in all three.
> And I documented another bug which 2/3 appears to (partly) fix.
> 
> 
> The state of other chips:
> 
> VIA VL805 and Etron EJ168A don't seem to have the "latent restart" bug
> so they don't care about this code, unless it causes pointless retries
> on an already stopped endpoint for 100ms, if it's enabled for them.
> 
> ASM1042 and ASM3142 have the bug, although not as common as on NEC. It
> seems to be particularly triggered by multiple endpoints cancelling at
> the same time, so for example, both loops below simultaneously:
> 
> while true; do timeout .3 yavta -c /dev/video0; done
> while true; do timeout .1 cat /dev/ttyUSB0 >/dev/null; done
> 
> On ASM3142 this triggers the bug about once per minute, and prints:
> 
> WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
> 
> The "mitigation" patch seems to be helping, because reverting it adds:
> 
> ERROR Transfer event TRB DMA ptr not part of current TD ep_index 6 comp_code 13
> Looking for event-dma 0000000003715ec0 trb-start 0000000003715ed0 trb-end 0000000003715ed0 seg-start 0000000003715000 seg-end 0000000003715ff0
> 
> Enabling the NEC workaround for this chip appears to fix the problem
> entirely and so far it only took one retry each time:
> 
> Stop ep completion ctx error, ctx_state 3
> 
> I have left this running overninght and nothing unexpected was logged
> in dmesg. IOW, no undesirable side effects known so far. But I haven't
> done much tests with bad cables yet, and ASMedia chips appear to be
> quite buggy and prone to strange internal race conditions. Examples:
> 
> - the ugly ifconfig up/down lockup issue on 3142 (not tried 1042 yet)
> - a pair of "stopped" events out of order preceding the above lockup
> - on 1042 Set Deq on streams seems to be no-op (streams are disabled)
> - Soft Retry preceding a Set Deq seems to undo the Set Deq's effect
> - Stop Endpoint *success* and then a spontaneous restart (1042 only)
> - spontaneous restart causes a halt, but transfer event is missing
> - TRB 15 (Stop EP) on the event ring (seen only once on 1042)
> 
> ASM1042 still not tested for compatilibity with the workaround at all.

Thanks for all the work you put into this series.
I added all patches to my for-usb-next branch and will send them to Greg shortly.

I think this code won't be NEC specific for very long as other vendors
have these similar issues.

Out of curiosity, what was the longest needed 'stop endpoint' retry time
you saw which still had a successful outcome?

Thanks
Mathias


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 0/3] xhci: Fix Stop Endpoint problems
  2024-11-05 11:05 ` [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Mathias Nyman
@ 2024-11-05 12:18   ` Michał Pecio
  0 siblings, 0 replies; 6+ messages in thread
From: Michał Pecio @ 2024-11-05 12:18 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Mathias Nyman, linux-usb

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

On Tue, 5 Nov 2024 13:05:19 +0200, Mathias Nyman wrote:
> Out of curiosity, what was the longest needed 'stop endpoint' retry
> time you saw which still had a successful outcome?

On NEC, periodic endpoints appear to take up to one ESIT. It's one or
two retries on fast isochronous EPs, but interrupt may take a while.

The longest I have seen was 24ms on a SuperSpeed EP with bInterval 11,
so it looks like this may be the chip's limit (bInterval 11 = 128ms),
but admittedly, I haven't tested this thoroughly with many devices.

No problems ever seen with bulk or control on NEC.

On ASM3142 it seems that one retry tends to be enough, and IIRC all
endpoint types can misbehave.


I attached a patch, based on the earlier "redundant" patch, which tries
to detect these bugs by looping until it sees EP Context state change.
Looping under spinlock sure is dodgy, but with these timeout values it
doesn't seem to completely break the driver and it finds these HC bugs.

And yes, I have seen the "absolutely fubar" case on ASM1042 with bad
cable (repeated halts, resets and stops). Some time later it "died".

Regards,
Michal

[-- Attachment #2: xhci-detect-stop-bugs.patch --]
[-- Type: text/x-patch, Size: 6897 bytes --]

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b6eb928e260f..85289f09ca18 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -487,6 +487,27 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci, unsigned long flags)
 	return 0;
 }
 
+static bool busywait(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, int timeout_ms)
+{
+	struct xhci_ep_ctx *ep_ctx = xhci_get_ep_ctx(xhci, ep->vdev->out_ctx, ep->ep_index);
+	u64 t2, t1 = ktime_get_ns();
+
+	do {
+		t2 = ktime_get_ns();
+		if (GET_EP_CTX_STATE(READ_ONCE(ep_ctx)) == EP_STATE_RUNNING) {
+			xhci_info(xhci, "slot %d ep %d busy wait found ctx_state RUNNING after %lldus\n",
+					ep->vdev->slot_id, ep->ep_index,
+					(t2 - t1) / 1000);
+			return true;
+		}
+	} while (t2 < t1 + timeout_ms * 1000000);
+
+	xhci_info(xhci, "slot %d ep %d busy wait gave up after %lldus\n",
+			ep->vdev->slot_id, ep->ep_index,
+			(t2 - t1) / 1000);
+	return false;
+}
+
 void xhci_ring_ep_doorbell(struct xhci_hcd *xhci,
 		unsigned int slot_id,
 		unsigned int ep_index,
@@ -1114,6 +1135,20 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 		return;
 
 	ep_ctx = xhci_get_ep_ctx(xhci, ep->vdev->out_ctx, ep_index);
+	int ctx_state = GET_EP_CTX_STATE(READ_ONCE(ep_ctx));
+
+	if (ep->stop_retry)
+		xhci_info(xhci, comp_code == COMP_SUCCESS ?
+				"it worked!\n" : "it failed %d\n", comp_code);
+	ep->stop_retry = false;
+
+	if (comp_code != COMP_SUCCESS && comp_code != COMP_CONTEXT_STATE_ERROR)
+		xhci_err(xhci, "slot %d ep %d WTF BUG in ep_state 0x%x\n",
+				slot_id, ep_index, ep->ep_state);
+
+	if (comp_code == COMP_SUCCESS && busywait(xhci, ep, 3))
+		xhci_err(xhci, "slot %d ep %d ABSOLUTELY FUBAR BUG in ep_state 0x%x ctx_state %d\n",
+				slot_id, ep_index, ep->ep_state, ctx_state);
 
 	trace_xhci_handle_cmd_stop_ep(ep_ctx);
 
@@ -1132,7 +1167,7 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	 * next transfer, which then will return -EPIPE, and device side stall is
 	 * noted and cleared by class driver.
 	 */
-		switch (GET_EP_CTX_STATE(ep_ctx)) {
+		switch (ctx_state) {
 		case EP_STATE_HALTED:
 			xhci_dbg(xhci, "Stop ep completion raced with stall, reset ep\n");
 			if (ep->ep_state & EP_HAS_STREAMS) {
@@ -1149,19 +1184,55 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 				break;
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
+
 		case EP_STATE_STOPPED:
+			bool predicted = true;
 			/*
-			 * NEC uPD720200 sometimes sets this state and fails with
-			 * Context Error while continuing to process TRBs.
-			 * Be conservative and trust EP_CTX_STATE on other chips.
+			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+			 * EP is a Context State Error, and EP stays Stopped.
+			 * This outcome is valid if the command was redundant.
 			 */
-			if (!(xhci->quirks & XHCI_NEC_HOST))
-				break;
-			fallthrough;
+			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+				predicted = false;
+			/*
+			 * Or it really failed on Halted, but somebody ran Reset
+			 * Endpoint later. EP state is now Stopped and EP_HALTED
+			 * still set because Reset EP handler will run after us.
+			 */
+			if (ep->ep_state & EP_HALTED)
+				predicted = false;
+			/*
+			 * On some HCs EP state remains Stopped for some tens of
+			 * us to a few ms or more after a doorbell ring, and any
+			 * new Stop Endpoint fails without aborting the restart.
+			 * This handler may run quickly enough to still see this
+			 * Stopped state, but it will soon change to Running.
+			 *
+			 * Assume this bug on unexpected Stop Endpoint failures.
+			 * Keep retrying until the EP starts and stops again, on
+			 * chips known to have the bug and to react positively.
+			 */
+			if (busywait(xhci, ep, predicted? 30: 3)) {
+				xhci_err(xhci, "slot %d ep %d %sSTOPPED BUG in ep_state 0x%x\n",
+						slot_id, ep_index, predicted? "": "UNPREDICTED ",
+						ep->ep_state);
+				goto retry;
+			}
+			if (predicted)
+				xhci_err(xhci, "slot %d ep %d MISPREDICTED STOPPED BUG in ep_state 0x%x (busywait too short?)\n",
+						slot_id, ep_index, ep->ep_state);
+			else
+				xhci_info(xhci, "slot %d ep %d not a bug predicted in ep_state 0x%x\n",
+						slot_id, ep_index, ep->ep_state);
+			break;
+
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
-			xhci_dbg(xhci, "Stop ep completion ctx error, ep is running\n");
-
+			xhci_err(xhci, "slot %d ep %d RUNNING BUG in ep_state %d\n",
+					slot_id, ep_index, ep->ep_state);
+retry:
+			xhci_info(xhci, "stop again and see what happens...\n");
+			ep->stop_retry = true;
 			command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
 			if (!command) {
 				ep->ep_state &= ~EP_STOP_CMD_PENDING;
@@ -4375,11 +4446,31 @@ int xhci_queue_evaluate_context(struct xhci_hcd *xhci, struct xhci_command *cmd,
 int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd,
 			     int slot_id, unsigned int ep_index, int suspend)
 {
+	struct xhci_virt_ep *ep;
 	u32 trb_slot_id = SLOT_ID_FOR_TRB(slot_id);
 	u32 trb_ep_index = EP_INDEX_FOR_TRB(ep_index);
 	u32 type = TRB_TYPE(TRB_STOP_RING);
 	u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+	/*
+	 * Any of that means the EP is or will be Stopped by the time this
+	 * command runs. Queue it anyway for its side effects like calling
+	 * our default handler or complete(). But our handler must know if
+	 * the command is redundant, so check it now. The handler can't do
+	 * it later because those operations may finish before it runs.
+	 */
+	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
+	if (ep) {
+		if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
+			ep->ep_state |= EP_STOP_CMD_REDUNDANT;
+		else
+			ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
+		if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+			xhci_info(xhci, "slot %d ep %d queuing a redundant Stop Endpoint in ep_state 0x%x\n",
+					slot_id, ep_index, ep->ep_state);
+	}
+	/* else: don't care, the handler will do nothing anyway */
+
 	return queue_command(xhci, cmd, 0, 0, 0,
 			trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index f0fb696d5619..05555ca4f38a 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -670,6 +670,8 @@ struct xhci_virt_ep {
 #define EP_SOFT_CLEAR_TOGGLE	(1 << 7)
 /* usb_hub_clear_tt_buffer is in progress */
 #define EP_CLEARING_TT		(1 << 8)
+/* queued Stop Endpoint is expected to fail */
+#define EP_STOP_CMD_REDUNDANT	(1 << 9)
 	/* ----  Related to URB cancellation ---- */
 	struct list_head	cancelled_td_list;
 	struct xhci_hcd		*xhci;
@@ -694,6 +696,8 @@ struct xhci_virt_ep {
 	int			next_frame_id;
 	/* Use new Isoch TRB layout needed for extended TBC support */
 	bool			use_extended_tbc;
+
+	bool			stop_retry;
 };
 
 enum xhci_overhead_type {

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-05 12:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-04  9:32 [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Michal Pecio
2024-11-04  9:33 ` [PATCH v4 1/3] usb: xhci: Limit Stop Endpoint retries Michal Pecio
2024-11-04  9:33 ` [PATCH v4 2/3] usb: xhci: Fix TD invalidation under pending Set TR Dequeue Michal Pecio
2024-11-04  9:34 ` [PATCH v4 3/3] usb: xhci: Avoid queuing redundant Stop Endpoint commands Michal Pecio
2024-11-05 11:05 ` [PATCH v4 0/3] xhci: Fix Stop Endpoint problems Mathias Nyman
2024-11-05 12:18   ` Michał Pecio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).