[PATCH 0/2] Fix the NEC stop bug workaround

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix the NEC stop bug workaround
@ 2024-10-14 19:08 Michal Pecio
  2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Michal Pecio @ 2024-10-14 19:08 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

Hi,

I found an unfortunate problem with my workaround for this hardware bug.

To recap, Stop Endpoint sometimes fails, the Endpoint Context says the
EP is Stopped, but cancelled TRBs are still executed. I found this bug
earlier this year and submitted a workaround, which retries the command
(sometimes a few times) and all is good.

This works fine for common cases, but what if the endpoint is really
stopped? Then Stop Endpoint is supposed to fail and fail it does. The
workaround code doesn't know that it happened and retries infinitely.

I have never seen it in normal use, but I devised a reliable repro.
The effect isn't pretty - no URBs can be cancelled, device gets stuck,
if unplugged it locks up connections/disconnections on the whole bus.

With some experimentation I found that the bug is a variant of the old
"stop after restart" issue - the doorbell ring is internally reordered
after the subsequent command. By busy-waiting I confirmed that EP state
which is initially seen as Stopped becomes Running some time later.

I came up with a few ways to deal with it, this patch implements #3
which I think is the lowest risk. But for completeness:

0. Do nothing, revert the old patch. Not great, we are back to those
races and DMA-after-free. Seems particularly dangereous on Int IN EPs,
which may take a few ms to start - plenty of time to reuse URB buffers.

1. Ring the doorbell before queuing another Stop Endpoint. This ensures
that the EP will restart even if it wasn't meant to yet. Then a retry
succeeds and we are done. Super-simple and seems to work 100% reliably,
but what if there are still more gotchas in this hardware?

2. Set a flag on doorbell ring, clear on Stop EP or Halt. Look at the
flag to decide if it's the bug or a legitimately stopped EP. But I
didn't like adding overhead to DB ring, even if almost unmeasurable.

3. Set a flag when we know that the command will fail. This appears to
actually be doable. Then we just look at the flag and retry only if it
wasn't supposed to fail.

And some further ideas, considered but not implemented yet:

4. If we know that the command will fail, don't queue it at all. Other
commands are pending in those cases, so modify their handlers to do
our work for us. A little more invasive than the simple fixes 1-3.

5. Maybe it would make sense to ensure that the command can't ever
retry infinitely. Just give up and call hc_died() after 5 seconds.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] usb: xhci: Fix the NEC stop bug workaround
  2024-10-14 19:08 [PATCH 0/2] Fix the NEC stop bug workaround Michal Pecio
@ 2024-10-14 19:10 ` Michal Pecio
  2024-10-15 10:38   ` Greg KH
  2024-10-15 11:05   ` Mathias Nyman
  2024-10-14 19:11 ` [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs Michal Pecio
  2024-10-15 12:23 ` [PATCH 0/2] Fix the NEC stop bug workaround Mathias Nyman
  2 siblings, 2 replies; 12+ messages in thread
From: Michal Pecio @ 2024-10-14 19:10 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

The NEC uPD720200 has a bug, which prevents reliably stopping
an endpoint shortly after it has been restarted. This usually
happens when a driver kills many URBs in quick succession and
it results in concurrent execution and cancellation of TDs.

This is handled by stopping the endpoint again if in doubt.

This "doubt" turns out to be a problem, because Stop Endpoint
may be queued when the EP is already Stopped (for Set TR Deq
execution, for example) or becomes Stopped concurrently (by
Reset Endpoint, for example). If the EP is truly Stopped, the
command fails and further retries just keep failing forever.

This is easily triggered by modifying uvcvideo to unlink its
isochronous URBs in 100us intervals instead of poisoning them.
Any driver that unlinks URBs asynchronously may trigger this,
and any URB unlink during ongoing halt recovery also can.

Fix the problem by tracking redundant Stop Endpoint commands
which are sure to fail, and by not retrying them. It's easy,
because xhci_urb_dequeue() is the only user ever queuing the
command with the default handler and without ensuring that
the endpoint is Running and will not Halt before it Stops.
For this case, we assume that an endpoint with pending URBs
is always Running, unless certain operations are pending on
it which indicate known exceptions.

Note that we need to catch those exceptions when they occur,
because their flags may be cleared before our handler runs.

It's possible that other HCs have similar bugs (see also the
related "Running" case below), but the workaround is limited
to NEC because no such chips are currently known and tested.

Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
 drivers/usb/host/xhci.h      |  2 ++
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 4d664ba53fe9..c0efb4d34ab9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -911,6 +911,21 @@ static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
 	return ret;
 }
 
+/*
+ * A Stop Endpoint command is redundant if the EP is not in the Running state.
+ * It will fail with Context State Error. We sometimes queue redundant Stop EP
+ * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
+ * A pending Stop Endpoint command *becomes* redundant if the EP halts before
+ * its completion, and this flag needs to be updated in those cases too.
+ */
+static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
+{
+	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
+		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
+	else
+		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
+}
+
 static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 				struct xhci_virt_ep *ep,
 				struct xhci_td *td,
@@ -946,6 +961,7 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
 		return err;
 
 	ep->ep_state |= EP_HALTED;
+	xhci_update_stop_cmd_redundant(ep);
 
 	xhci_ring_cmd_db(xhci);
 
@@ -1149,15 +1165,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 				break;
 			ep->ep_state &= ~EP_STOP_CMD_PENDING;
 			return;
+
 		case EP_STATE_STOPPED:
 			/*
-			 * NEC uPD720200 sometimes sets this state and fails with
-			 * Context Error while continuing to process TRBs.
-			 * Be conservative and trust EP_CTX_STATE on other chips.
+			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
+			 * EP is a Context State Error, and EP stays Stopped.
+			 * The EP could be stopped by some concurrent job, so
+			 * ignore this error when that's the case.
+			 */
+			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
+				break;
+			/*
+			 * NEC uPD720200 controllers have this bug: it takes one
+			 * service interval to restart a stopped EP, and in this
+			 * time its state remains Stopped. Any new Stop Endpoint
+			 * command fails and this handler may run quickly enough
+			 * to still observe the Stopped state, but it's bound to
+			 * soon change to Running, and all hell breaks loose.
+			 *
+			 * So keep retrying until the command clearly succeeds.
+			 * Not clear what to do if other HCs have similar bugs.
 			 */
 			if (!(xhci->quirks & XHCI_NEC_HOST))
 				break;
 			fallthrough;
+
 		case EP_STATE_RUNNING:
 			/* Race, HW handled stop ep cmd before ep was running */
 			xhci_dbg(xhci, "Stop ep completion ctx error, ep is running\n");
@@ -4383,11 +4415,17 @@ int xhci_queue_evaluate_context(struct xhci_hcd *xhci, struct xhci_command *cmd,
 int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd,
 			     int slot_id, unsigned int ep_index, int suspend)
 {
+	struct xhci_virt_ep *ep;
 	u32 trb_slot_id = SLOT_ID_FOR_TRB(slot_id);
 	u32 trb_ep_index = EP_INDEX_FOR_TRB(ep_index);
 	u32 type = TRB_TYPE(TRB_STOP_RING);
 	u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
+	if (ep)
+		xhci_update_stop_cmd_redundant(ep);
+	/* else: don't care, the handler will do nothing anyway */
+
 	return queue_command(xhci, cmd, 0, 0, 0,
 			trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 620502de971a..dd803b016c48 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -670,6 +670,8 @@ struct xhci_virt_ep {
 #define EP_SOFT_CLEAR_TOGGLE	(1 << 7)
 /* usb_hub_clear_tt_buffer is in progress */
 #define EP_CLEARING_TT		(1 << 8)
+/* the EP is kept Stopped or recovering from Halt/Error */
+#define EP_STOP_CMD_REDUNDANT	(1 << 9)
 	/* ----  Related to URB cancellation ---- */
 	struct list_head	cancelled_td_list;
 	struct xhci_hcd		*xhci;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs
  2024-10-14 19:08 [PATCH 0/2] Fix the NEC stop bug workaround Michal Pecio
  2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
@ 2024-10-14 19:11 ` Michal Pecio
  2024-10-15 10:40   ` Greg KH
  2024-10-15 12:23 ` [PATCH 0/2] Fix the NEC stop bug workaround Mathias Nyman
  2 siblings, 1 reply; 12+ messages in thread
From: Michal Pecio @ 2024-10-14 19:11 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

NEC controllers have a bug, where stopping an endpoint soon after it
has been restarted doesn't quite work as expected. This forces us to
track whether each Stop Endpoint command is expected to fail or not.

Reuse this infrastracture to warn about similar bugs on other chips,
if any are found.

Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
---
 drivers/usb/host/xhci-ring.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index c0efb4d34ab9..c326b86d713c 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1186,8 +1186,11 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 			 * So keep retrying until the command clearly succeeds.
 			 * Not clear what to do if other HCs have similar bugs.
 			 */
-			if (!(xhci->quirks & XHCI_NEC_HOST))
+			if (!(xhci->quirks & XHCI_NEC_HOST)) {
+				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
+						slot_id, ep_index);
 				break;
+			}
 			fallthrough;
 
 		case EP_STATE_RUNNING:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] usb: xhci: Fix the NEC stop bug workaround
  2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
@ 2024-10-15 10:38   ` Greg KH
  2024-10-15 11:05   ` Mathias Nyman
  1 sibling, 0 replies; 12+ messages in thread
From: Greg KH @ 2024-10-15 10:38 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Mathias Nyman, linux-usb

On Mon, Oct 14, 2024 at 09:10:05PM +0200, Michal Pecio wrote:
> The NEC uPD720200 has a bug, which prevents reliably stopping
> an endpoint shortly after it has been restarted. This usually
> happens when a driver kills many URBs in quick succession and
> it results in concurrent execution and cancellation of TDs.
> 
> This is handled by stopping the endpoint again if in doubt.
> 
> This "doubt" turns out to be a problem, because Stop Endpoint
> may be queued when the EP is already Stopped (for Set TR Deq
> execution, for example) or becomes Stopped concurrently (by
> Reset Endpoint, for example). If the EP is truly Stopped, the
> command fails and further retries just keep failing forever.
> 
> This is easily triggered by modifying uvcvideo to unlink its
> isochronous URBs in 100us intervals instead of poisoning them.
> Any driver that unlinks URBs asynchronously may trigger this,
> and any URB unlink during ongoing halt recovery also can.
> 
> Fix the problem by tracking redundant Stop Endpoint commands
> which are sure to fail, and by not retrying them. It's easy,
> because xhci_urb_dequeue() is the only user ever queuing the
> command with the default handler and without ensuring that
> the endpoint is Running and will not Halt before it Stops.
> For this case, we assume that an endpoint with pending URBs
> is always Running, unless certain operations are pending on
> it which indicate known exceptions.
> 
> Note that we need to catch those exceptions when they occur,
> because their flags may be cleared before our handler runs.
> 
> It's possible that other HCs have similar bugs (see also the
> related "Running" case below), but the workaround is limited
> to NEC because no such chips are currently known and tested.
> 
> Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>  drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
>  drivers/usb/host/xhci.h      |  2 ++
>  2 files changed, 43 insertions(+), 3 deletions(-)
> 

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- You have marked a patch with a "Fixes:" tag for a commit that is in an
  older released kernel, yet you do not have a cc: stable line in the
  signed-off-by area at all, which means that the patch will not be
  applied to any older kernel releases.  To properly fix this, please
  follow the documented rules in the
  Documentation/process/stable-kernel-rules.rst file for how to resolve
  this.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs
  2024-10-14 19:11 ` [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs Michal Pecio
@ 2024-10-15 10:40   ` Greg KH
  2024-10-15 18:52     ` Michał Pecio
  0 siblings, 1 reply; 12+ messages in thread
From: Greg KH @ 2024-10-15 10:40 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Mathias Nyman, linux-usb

On Mon, Oct 14, 2024 at 09:11:22PM +0200, Michal Pecio wrote:
> NEC controllers have a bug, where stopping an endpoint soon after it
> has been restarted doesn't quite work as expected. This forces us to
> track whether each Stop Endpoint command is expected to fail or not.
> 
> Reuse this infrastracture to warn about similar bugs on other chips,
> if any are found.
> 
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>  drivers/usb/host/xhci-ring.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index c0efb4d34ab9..c326b86d713c 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1186,8 +1186,11 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
>  			 * So keep retrying until the command clearly succeeds.
>  			 * Not clear what to do if other HCs have similar bugs.
>  			 */
> -			if (!(xhci->quirks & XHCI_NEC_HOST))
> +			if (!(xhci->quirks & XHCI_NEC_HOST)) {
> +				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
> +						slot_id, ep_index);

If a user sees this, what are they supposed to do?  This is a hardware
bug, but with this we are going to get reports of "something broke in
the kernel", right?  Why not make it just more informative, like:
	xhci_info(xhci, "hardware can not deal with...

or something like that so that people know we know about the bug, and
are working around it, but that it's not our issue, but rather the
hardware that is at fault?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] usb: xhci: Fix the NEC stop bug workaround
  2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
  2024-10-15 10:38   ` Greg KH
@ 2024-10-15 11:05   ` Mathias Nyman
  2024-10-15 13:27     ` Michał Pecio
  1 sibling, 1 reply; 12+ messages in thread
From: Mathias Nyman @ 2024-10-15 11:05 UTC (permalink / raw)
  To: Michal Pecio, Mathias Nyman; +Cc: linux-usb

On 14.10.2024 22.10, Michal Pecio wrote:
> The NEC uPD720200 has a bug, which prevents reliably stopping
> an endpoint shortly after it has been restarted. This usually
> happens when a driver kills many URBs in quick succession and
> it results in concurrent execution and cancellation of TDs.
> 
> This is handled by stopping the endpoint again if in doubt.
> 
> This "doubt" turns out to be a problem, because Stop Endpoint
> may be queued when the EP is already Stopped (for Set TR Deq
> execution, for example) or becomes Stopped concurrently (by
> Reset Endpoint, for example). If the EP is truly Stopped, the
> command fails and further retries just keep failing forever.
> 
> This is easily triggered by modifying uvcvideo to unlink its
> isochronous URBs in 100us intervals instead of poisoning them.
> Any driver that unlinks URBs asynchronously may trigger this,
> and any URB unlink during ongoing halt recovery also can.
> 
> Fix the problem by tracking redundant Stop Endpoint commands
> which are sure to fail, and by not retrying them. It's easy,
> because xhci_urb_dequeue() is the only user ever queuing the
> command with the default handler and without ensuring that
> the endpoint is Running and will not Halt before it Stops.
> For this case, we assume that an endpoint with pending URBs
> is always Running, unless certain operations are pending on
> it which indicate known exceptions.
> 
> Note that we need to catch those exceptions when they occur,
> because their flags may be cleared before our handler runs.
> 
> It's possible that other HCs have similar bugs (see also the
> related "Running" case below), but the workaround is limited
> to NEC because no such chips are currently known and tested.
> 
> Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers")
> Signed-off-by: Michal Pecio <michal.pecio@gmail.com>
> ---
>   drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++---
>   drivers/usb/host/xhci.h      |  2 ++
>   2 files changed, 43 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 4d664ba53fe9..c0efb4d34ab9 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -911,6 +911,21 @@ static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
>   	return ret;
>   }
>   
> +/*
> + * A Stop Endpoint command is redundant if the EP is not in the Running state.
> + * It will fail with Context State Error. We sometimes queue redundant Stop EP
> + * commands when the EP is held Stopped for Set TR Deq execution, or Halted.
> + * A pending Stop Endpoint command *becomes* redundant if the EP halts before
> + * its completion, and this flag needs to be updated in those cases too.
> + */
> +static void xhci_update_stop_cmd_redundant(struct xhci_virt_ep *ep)
> +{
> +	if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT))
> +		ep->ep_state |= EP_STOP_CMD_REDUNDANT;
> +	else
> +		ep->ep_state &= ~EP_STOP_CMD_REDUNDANT;
> +}
> +
>   static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   				struct xhci_virt_ep *ep,
>   				struct xhci_td *td,
> @@ -946,6 +961,7 @@ static int xhci_handle_halted_endpoint(struct xhci_hcd *xhci,
>   		return err;
>   
>   	ep->ep_state |= EP_HALTED;
> +	xhci_update_stop_cmd_redundant(ep);
>   
>   	xhci_ring_cmd_db(xhci);
>   
> @@ -1149,15 +1165,31 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
>   				break;
>   			ep->ep_state &= ~EP_STOP_CMD_PENDING;
>   			return;
> +
>   		case EP_STATE_STOPPED:
>   			/*
> -			 * NEC uPD720200 sometimes sets this state and fails with
> -			 * Context Error while continuing to process TRBs.
> -			 * Be conservative and trust EP_CTX_STATE on other chips.
> +			 * Per xHCI 4.6.9, Stop Endpoint command on a Stopped
> +			 * EP is a Context State Error, and EP stays Stopped.
> +			 * The EP could be stopped by some concurrent job, so
> +			 * ignore this error when that's the case.
> +			 */
> +			if (ep->ep_state & EP_STOP_CMD_REDUNDANT)
> +				break;

Can we skip the new flag and just check for the correct flags here directly?

if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
	break;

Thanks
Mathias

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Fix the NEC stop bug workaround
  2024-10-14 19:08 [PATCH 0/2] Fix the NEC stop bug workaround Michal Pecio
  2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
  2024-10-14 19:11 ` [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs Michal Pecio
@ 2024-10-15 12:23 ` Mathias Nyman
  2024-10-15 14:51   ` Alan Stern
  2024-10-16  5:47   ` Michał Pecio
  2 siblings, 2 replies; 12+ messages in thread
From: Mathias Nyman @ 2024-10-15 12:23 UTC (permalink / raw)
  To: Michal Pecio, Mathias Nyman; +Cc: linux-usb

On 14.10.2024 22.08, Michal Pecio wrote:
> Hi,
> 
> I found an unfortunate problem with my workaround for this hardware bug.
> 
> To recap, Stop Endpoint sometimes fails, the Endpoint Context says the
> EP is Stopped, but cancelled TRBs are still executed. I found this bug
> earlier this year and submitted a workaround, which retries the command
> (sometimes a few times) and all is good.
> 
> This works fine for common cases, but what if the endpoint is really
> stopped? Then Stop Endpoint is supposed to fail and fail it does. The
> workaround code doesn't know that it happened and retries infinitely.
> 
> I have never seen it in normal use, but I devised a reliable repro.
> The effect isn't pretty - no URBs can be cancelled, device gets stuck,
> if unplugged it locks up connections/disconnections on the whole bus.
> 
> With some experimentation I found that the bug is a variant of the old
> "stop after restart" issue - the doorbell ring is internally reordered
> after the subsequent command. By busy-waiting I confirmed that EP state
> which is initially seen as Stopped becomes Running some time later.
> 

Seems host controllers aren't designed to stop, move dequeue, and restart
an endpoint in quick succession.

In addition to fixing this NEC case we could think about avoiding these
cases, some could be avoided by adding a new ".flush_endpoint()" callback to
the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function
that calls .urb_dequeue() in a loop for each queued URB, causing host to
issue the stop, move deq and ring doorbell for every URB.

If usbcore knows all URBs will be cancelled it could let host do it in one go.
i.e. stop endpoint once.

Thanks
Mathias

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] usb: xhci: Fix the NEC stop bug workaround
  2024-10-15 11:05   ` Mathias Nyman
@ 2024-10-15 13:27     ` Michał Pecio
  0 siblings, 0 replies; 12+ messages in thread
From: Michał Pecio @ 2024-10-15 13:27 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Mathias Nyman, linux-usb

> Can we skip the new flag and just check for the correct flags here
> directly?
> 
> if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT)
> 	break;

Unfortunately not, because those pending operations may (and usually
will) complete before our handler runs. They will not restart the EP
because we set EP_STOP_CMD_PENDING, but they will clear their flags.
So we know that Stop Endpoint is guaranteed to fail, but its handler
will not see those flags and will have no clue why it failed, hence
we store this one bit of knowledge specially for its use.

But you raise a valid point. If Stop EP fails on a Halted endpoint and
somebody else resets it before Stop EP handler runs, the handler will
see EP_HALTED, because Reset EP handler must run later if the commands
were queued and executed in this order.

So if Stop EP handler tests for EP_HALTED, nobody needs to worry about
updating EP_STOP_CMD_REDUNDANT for us. The helper function can go out,
the patch is shorter, and the solution more robust against any changes
to halt recovery code that anyone might do. All that "redundant" logic
becomes concentrated in queue/handle _stop_endpoint() functions.

I think I will do a v2.

By the way, is this list of conditions complete? There are other flags
like GETTING_STREAMS or CLEAR_TOGGLE, but I'm under impression that they
are valid only with no queued URBs, so nothing can be cancelled then.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Fix the NEC stop bug workaround
  2024-10-15 12:23 ` [PATCH 0/2] Fix the NEC stop bug workaround Mathias Nyman
@ 2024-10-15 14:51   ` Alan Stern
  2024-10-16  5:47   ` Michał Pecio
  1 sibling, 0 replies; 12+ messages in thread
From: Alan Stern @ 2024-10-15 14:51 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Michal Pecio, Mathias Nyman, linux-usb

On Tue, Oct 15, 2024 at 03:23:23PM +0300, Mathias Nyman wrote:
> In addition to fixing this NEC case we could think about avoiding these
> cases, some could be avoided by adding a new ".flush_endpoint()" callback to
> the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function
> that calls .urb_dequeue() in a loop for each queued URB, causing host to
> issue the stop, move deq and ring doorbell for every URB.
> 
> If usbcore knows all URBs will be cancelled it could let host do it in one go.
> i.e. stop endpoint once.

Indeed, this makes a lot of sense, and I have long thought that the
API should have been designed this way from the beginning.  At least
for non-Control transfers, unlinking a single URB somewhere inside a
sequence of URBs seems pointless.  I doubt that it ever happens in the
kernel.

(On the other hand, it _is_ reasonable to do this for Control
transfers, because they can come from several different sources, not
just from the device's driver.  The source for a Control URB might
want to unlink it while not affecting the URBs from other sources.)

Furthermore, I suspect this is what Windows does and what the USBIF
originally had in mind for URB management.  (It's harder to tell what
they thought about Control transfers, though.)

Alan Stern

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs
  2024-10-15 10:40   ` Greg KH
@ 2024-10-15 18:52     ` Michał Pecio
  0 siblings, 0 replies; 12+ messages in thread
From: Michał Pecio @ 2024-10-15 18:52 UTC (permalink / raw)
  To: Greg KH; +Cc: Mathias Nyman, linux-usb

Hi,

> +			if (!(xhci->quirks & XHCI_NEC_HOST)) {
> > +				xhci_warn(xhci, "Unhandled Stop Endpoint failure on slot %d ep_index %d\n",
> > +						slot_id, ep_index);  
> 
> If a user sees this, what are they supposed to do?  This is a hardware
> bug, but with this we are going to get reports of "something broke in
> the kernel", right?  Why not make it just more informative, like:
> 	xhci_info(xhci, "hardware can not deal with...
> 
> or something like that so that people know we know about the bug, and
> are working around it, but that it's not our issue, but rather the
> hardware that is at fault?

Yes, the point is that ideally some users would report when they see it.
This is not a warning for a bug we know about and work around, it's for
hypothetical bugs that aren't known. If I'm adding code to handle a bug
of my HC, I might as well use it to sanity check other HCs for free.

Ideally they are OK and no one will ever see it. I tested several HCs
in January and found no similar issues. Maybe I will test them again.

I specifically worded it as something the kernel could do better rather
than the kernel just whining about something incomprehensibly. For the
latter, we already have the absolute classic "ERROR Transfer event TRB
DMA ptr not part of current TD", which would often follow this one, but
offer little clue about the root cause. It's kinda too late by then.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Fix the NEC stop bug workaround
  2024-10-15 12:23 ` [PATCH 0/2] Fix the NEC stop bug workaround Mathias Nyman
  2024-10-15 14:51   ` Alan Stern
@ 2024-10-16  5:47   ` Michał Pecio
  2024-10-24 15:29     ` Mathias Nyman
  1 sibling, 1 reply; 12+ messages in thread
From: Michał Pecio @ 2024-10-16  5:47 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Mathias Nyman, linux-usb

> > With some experimentation I found that the bug is a variant of the
> > old "stop after restart" issue - the doorbell ring is internally
> > reordered after the subsequent command. By busy-waiting I confirmed
> > that EP state which is initially seen as Stopped becomes Running
> > some time later. 
> 
> Seems host controllers aren't designed to stop, move dequeue, and
> restart an endpoint in quick succession.

As it was you who added the Running case handling, do you know hardware
other than NEC which triggers this? Or could it be just a single vendor
who screwed up once 15 years ago and caused all the chaos?

NEC sometimes triggers the Running case too and it is obvious why. I'm
not sure how I missed it back in January and assumed it's some sort of
random failure for no reason.

BTW, the NEC problem appears to be limited to periodic endpoints. I am
unable to reproduce it on bulk. I thought that I reproduced it on bulk
back then, but on second thought it may have been interrupt, which that
device also has. Unfortunatel I wasn't printing endpoint numbers then.

Regards,
Michal

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] Fix the NEC stop bug workaround
  2024-10-16  5:47   ` Michał Pecio
@ 2024-10-24 15:29     ` Mathias Nyman
  0 siblings, 0 replies; 12+ messages in thread
From: Mathias Nyman @ 2024-10-24 15:29 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Mathias Nyman, linux-usb

On 16.10.2024 8.47, Michał Pecio wrote:
>>> With some experimentation I found that the bug is a variant of the
>>> old "stop after restart" issue - the doorbell ring is internally
>>> reordered after the subsequent command. By busy-waiting I confirmed
>>> that EP state which is initially seen as Stopped becomes Running
>>> some time later.
>>
>> Seems host controllers aren't designed to stop, move dequeue, and
>> restart an endpoint in quick succession.
> 
> As it was you who added the Running case handling, do you know hardware
> other than NEC which triggers this? Or could it be just a single vendor
> who screwed up once 15 years ago and caused all the chaos?
> 
> NEC sometimes triggers the Running case too and it is obvious why. I'm
> not sure how I missed it back in January and assumed it's some sort of
> random failure for no reason.
> 
> BTW, the NEC problem appears to be limited to periodic endpoints. I am
> unable to reproduce it on bulk. I thought that I reproduced it on bulk
> back then, but on second thought it may have been interrupt, which that
> device also has. Unfortunatel I wasn't printing endpoint numbers then.
> 
> Regards,
> Michal

Sorry about the reply delay.
I don't think this is a NEC only issue.

I was originally fixing halted endpoints at stop endpoint command completion,
did some stress testing, and was able to hit that running case on Intel
xHC controllers

See:
9ebf30007858 xhci: Fix halted endpoint at stop endpoint command completion
1174d44906d5 xhci: handle stop endpoint command completion with endpoint in running state.

I also just got a report off-list about an exactly similar case as yours, endpoint
stopped with ctx error, endpoint state was still stopped even if doorbell was
already rung.

This caused Set TR Deq command to fail with context error as endpoint was running
by the time this command was processed.

This was on a Intel host, se we need a generic solution to this.

Thanks
-Mathias

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-10-24 15:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14 19:08 [PATCH 0/2] Fix the NEC stop bug workaround Michal Pecio
2024-10-14 19:10 ` [PATCH 1/2] usb: xhci: " Michal Pecio
2024-10-15 10:38   ` Greg KH
2024-10-15 11:05   ` Mathias Nyman
2024-10-15 13:27     ` Michał Pecio
2024-10-14 19:11 ` [PATCH 2/2] usb: xhci: Warn about suspected "start-stop" bugs in HCs Michal Pecio
2024-10-15 10:40   ` Greg KH
2024-10-15 18:52     ` Michał Pecio
2024-10-15 12:23 ` [PATCH 0/2] Fix the NEC stop bug workaround Mathias Nyman
2024-10-15 14:51   ` Alan Stern
2024-10-16  5:47   ` Michał Pecio
2024-10-24 15:29     ` Mathias Nyman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox