My transfer ring grew to 740 segments

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* My transfer ring grew to 740 segments
@ 2025-03-11 22:41 Michał Pecio
  2025-03-12 13:37 ` Mathias Nyman
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Michał Pecio @ 2025-03-11 22:41 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

Hi,

This happened under a simple test meant to check if AMD "Promontory"
chipset (from ASMedia) has the delayed restart bug (it does, rarely).

Two pl2303 serial dongles were connected to a hub, loops were opening
and closing /dev/ttyUSBn to enqueue/dequeue some IN URBs which would
never complete with any data (nothing was fed to UART RX).

The test was running unattended for a few hours and it seems that at
some point the hub stopped working and transfers to downstream devices
were all returning Transaction Error. dmesg was full of this:

[102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
[102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
[102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
[102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000

[ ... 735 lines omitted for brevity ... ]

[102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
[102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000

Looking through debugfs, ffef4a50 is indeed a normal TD, apparently no
longer on td_list for some reason and hence the errors. The rest of the
ring is No-Ops.

Class driver enqueues its URBs, rings the doorbell and triggers this
error message. The endpoint halts, but that is ignored. Serial device
is closed, URBs are unlinked, Stop EP sees Halted, resests. No Set Deq
because HW Dequeue doesn't match any known TD. Rinse, repeat.

At some point end of the segment is reached, new segment is allocated
because ep_ring->dequeue is still in the first segment.

Sow how does the driver enter this screwed up state? Apparently due to
a HW bug. More detailed debug log from a different run:

[39607.305224] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
[39607.305235] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
[39607.305413] xhci_hcd 0000:02:00.0: 2/6 (040/1) handle_tx_event comp_code 4 trb_dma 0x00000000ffa80050

The 1 in (040/1) is EP Ctx state, i.e. Running, despite Trans. Error.
It looks like finish_td() sees it, ignores the error and gives back
normally. EP Ctx is still wrong later when the next URB is unlinked:

[39607.398526] xhci_hcd 0000:02:00.0: 2/6 (040/1) xhci_urb_dequeue cancel TD at 0x00000000ffa80060 stream 0
[39607.398531] xhci_hcd 0000:02:00.0: 2/6 (044/1) queue_stop_endpoint suspend 0

But Stop EP fails and updates it properly to 2=Halted:

[39607.398655] xhci_hcd 0000:02:00.0: 2/6 (044/2) handle_cmd_completion cmd_type 15 comp_code 19

Then the EP is reset without Set Deq or clearing and ffa80050 becomes
"stuck and forgotten", initiating the above problem.

The fact that EP Ctx state is Running for >90ms suggests that it's
a bug. But a race could have similar effect, and I can't find any
guarantee in the spec that EP Ctx is updated before posting an error
transfer event. 4.8.3 guarantees that it becomes Running before normal
transfer events are posted, but suggests not to trust EP Ctx too much.

Maybe finish_td() should be more cautious?

Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-11 22:41 My transfer ring grew to 740 segments Michał Pecio
@ 2025-03-12 13:37 ` Mathias Nyman
  2025-03-13  7:54   ` Michał Pecio
  2025-03-13  8:46 ` Michał Pecio
  2025-03-14 19:15 ` David Laight
  2 siblings, 1 reply; 10+ messages in thread
From: Mathias Nyman @ 2025-03-12 13:37 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 12.3.2025 0.41, Michał Pecio wrote:
> Hi,
> 
> This happened under a simple test meant to check if AMD "Promontory"
> chipset (from ASMedia) has the delayed restart bug (it does, rarely).
> 
> Two pl2303 serial dongles were connected to a hub, loops were opening
> and closing /dev/ttyUSBn to enqueue/dequeue some IN URBs which would
> never complete with any data (nothing was fed to UART RX).
> 
> The test was running unattended for a few hours and it seems that at
> some point the hub stopped working and transfers to downstream devices
> were all returning Transaction Error. dmesg was full of this:
> 
> [102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
> [102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
> [102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
> [102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000
> 
> [ ... 735 lines omitted for brevity ... ]
> 
> [102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
> [102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000
> 
> Looking through debugfs, ffef4a50 is indeed a normal TD, apparently no
> longer on td_list for some reason and hence the errors. The rest of the
> ring is No-Ops.
> 
> Class driver enqueues its URBs, rings the doorbell and triggers this
> error message. The endpoint halts, but that is ignored. Serial device
> is closed, URBs are unlinked, Stop EP sees Halted, resests. No Set Deq
> because HW Dequeue doesn't match any known TD. Rinse, repeat.

Ok, so this means endpoint does get reset, and once restarted it
tries to transfer the same TRB, which again fails with Transaction error.

> 
> At some point end of the segment is reached, new segment is allocated
> because ep_ring->dequeue is still in the first segment.
> 
> 
> Sow how does the driver enter this screwed up state? Apparently due to
> a HW bug. More detailed debug log from a different run:
> 
> [39607.305224] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
> [39607.305235] xhci_hcd 0000:02:00.0: 2/6 (040/3) ring_ep_doorbell stream 0
> [39607.305413] xhci_hcd 0000:02:00.0: 2/6 (040/1) handle_tx_event comp_code 4 trb_dma 0x00000000ffa80050
> 
> The 1 in (040/1) is EP Ctx state, i.e. Running, despite Trans. Error.
> It looks like finish_td() sees it, ignores the error and gives back
> normally. EP Ctx is still wrong later when the next URB is unlinked:
> 
> [39607.398526] xhci_hcd 0000:02:00.0: 2/6 (040/1) xhci_urb_dequeue cancel TD at 0x00000000ffa80060 stream 0
> [39607.398531] xhci_hcd 0000:02:00.0: 2/6 (044/1) queue_stop_endpoint suspend 0
> 
> But Stop EP fails and updates it properly to 2=Halted:
> 
> [39607.398655] xhci_hcd 0000:02:00.0: 2/6 (044/2) handle_cmd_completion cmd_type 15 comp_code 19
> 
> Then the EP is reset without Set Deq or clearing and ffa80050 becomes
> "stuck and forgotten", initiating the above problem.
> 
> 
> The fact that EP Ctx state is Running for >90ms suggests that it's
> a bug. But a race could have similar effect, and I can't find any
> guarantee in the spec that EP Ctx is updated before posting an error
> transfer event. 4.8.3 guarantees that it becomes Running before normal
> transfer events are posted, but suggests not to trust EP Ctx too much.
> 
> Maybe finish_td() should be more cautious?

Good point, finish_td() should probably trust ep_state flags set by driver
first.

Thanks
Mathias



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-12 13:37 ` Mathias Nyman
@ 2025-03-13  7:54   ` Michał Pecio
  0 siblings, 0 replies; 10+ messages in thread
From: Michał Pecio @ 2025-03-13  7:54 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

On Wed, 12 Mar 2025 15:37:12 +0200, Mathias Nyman wrote:
> On 12.3.2025 0.41, Michał Pecio wrote:
> > Class driver enqueues its URBs, rings the doorbell and triggers this
> > error message. The endpoint halts, but that is ignored. Serial
> > device is closed, URBs are unlinked, Stop EP sees Halted, resests.
> > No Set Deq because HW Dequeue doesn't match any known TD. Rinse,
> > repeat.  
> 
> Ok, so this means endpoint does get reset, and once restarted it
> tries to transfer the same TRB, which again fails with Transaction
> error.

Yes. It makes me wonder whether it even makes sense to reset endpoints
in cases when the halted TD cannot be identified and skipped with Set
TR Dequeue. We don't know if it got td_to_noop(), and even if it did,
there is no guarantee that the HC flushes TRB cache before retrying.

If connection is lost permanently, this situation at least is safe.
But if it's a temporary Transaction Error or Stall then a future URB
may cause this stale TD to execute, affecting device state without
class driver's knowledge and using a retired data buffer.

Since every known halted TD is cancelled rather than given back, having
a halted EP with no TD to blame appears to generally be a bug. In this
case, finish_td() failed to recognize and handle the halt. And papering
over this problem with a reset didn't make it go away.

> > Maybe finish_td() should be more cautious?  
> 
> Good point, finish_td() should probably trust ep_state flags set by
> driver first.

Actually, finish_td() is supposed to call xhci_handle_halted_endpoint()
which then sets EP_HALTED. It could do so more reliably by trusting the
spec and assuming that every Tr-Error or Babble halts the endpoint
(with exceptions for isochronous and babbling 0.95 control endpoints).

4.8.3 instructs to assume that EP is halted after these errors and
warns that EP Ctx may not always be up to date, although Promontory
seems to (randomly) never update it at all, even 90ms later.

For now, I tried this simple hack and it solved the Promontory problem.
The message gets printed sometimes, but nothing worse happens.

@@ -2254,8 +2254,8 @@ static void finish_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
                                                 td->start_seg, td->start_trb));
                                return;
                        }
-                       /* endpoint not halted, don't reset it */
-                       break;
+                       xhci_info(xhci, "slot %d ep %d comp_code %d but not halted?\n",
+                                       ep->vdev->slot_id, ep->ep_index, trb_comp_code);
                }
                /* Almost same procedure as for STALL_ERROR below */
                xhci_clear_hub_tt_buffer(xhci, td, ep);

BTW, I'm reproducing this bug in a much simpler way, not involving any
dodgy hub. I use a full speed device (a PL2303 serial dongle) and
disconnect its D- line after enumeration. This brakes communications,
but disconnection is not reported because D+ line is still pulled up.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-11 22:41 My transfer ring grew to 740 segments Michał Pecio
  2025-03-12 13:37 ` Mathias Nyman
@ 2025-03-13  8:46 ` Michał Pecio
  2025-03-13  9:45   ` Neronin, Niklas
  2025-03-13 14:43   ` Mathias Nyman
  2025-03-14 19:15 ` David Laight
  2 siblings, 2 replies; 10+ messages in thread
From: Michał Pecio @ 2025-03-13  8:46 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb, niklas.neronin

On Tue, 11 Mar 2025 23:41:39 +0100, Michał Pecio wrote:
> [102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
> [102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
> [102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
> [102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000
> 
> [ ... 735 lines omitted for brevity ... ]
> 
> [102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
> [102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000

And what are your thoughts about this noise? It's absurd to print such
long debug dumps under failure conditions (and hold a spinlock for 2ms
to do so), and I would argue that it is pointless even normally:

1. Almost always exactly two segments exist, and either
  a. the event and the TD are in the same segment, so who cares where
     the other segment is
  b. they are in different segments, and you can deduce both segments
     from dma pointers so the dump tells you absolutely nothing new

2. With more segments, the dump can tell if there were other segments
   between the event and the TD, but is it really important?

3. It might help with finding out-of-ring events, but this is extremely
   rare and should be done automatically (xhci_dma_to_trb() or similar).

Bottom line, the driver never printed it and no one other than Niklas
(Cc) seemed to really miss such a feature. 

I would be inclined to submit a small patch which removes this segment
dump, as I have already done so locally. Or at least make it xhci_dbg()
if somebody can present a convincing case for having it around.

Note that debugfs exists and provides this and much more information,
at least so long as the class driver doesn't disable the endpoint.

Regards,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-13  8:46 ` Michał Pecio
@ 2025-03-13  9:45   ` Neronin, Niklas
  2025-03-14  8:10     ` Michał Pecio
  2025-03-13 14:43   ` Mathias Nyman
  1 sibling, 1 reply; 10+ messages in thread
From: Neronin, Niklas @ 2025-03-13  9:45 UTC (permalink / raw)
  To: Michał Pecio, Mathias Nyman; +Cc: linux-usb



On 13/03/2025 10.46, Michał Pecio wrote:
> On Tue, 11 Mar 2025 23:41:39 +0100, Michał Pecio wrote:
>> [102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
>> [102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
>> [102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
>> [102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000
>>
>> [ ... 735 lines omitted for brevity ... ]
>>
>> [102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
>> [102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000
> 
> And what are your thoughts about this noise? It's absurd to print such
> long debug dumps under failure conditions (and hold a spinlock for 2ms
> to do so), and I would argue that it is pointless even normally:
> 
> 1. Almost always exactly two segments exist, and either
>   a. the event and the TD are in the same segment, so who cares where
>      the other segment is
>   b. they are in different segments, and you can deduce both segments
>      from dma pointers so the dump tells you absolutely nothing new
> 
> 2. With more segments, the dump can tell if there were other segments
>    between the event and the TD, but is it really important?
> 
> 3. It might help with finding out-of-ring events, but this is extremely
>    rare and should be done automatically (xhci_dma_to_trb() or similar).
> 
> 
> Bottom line, the driver never printed it and no one other than Niklas
> (Cc) seemed to really miss such a feature. 

IMO the driver used to print a long and repetitive debug message,
which is why I changed it.
Admittedly, my design does not handle hundreds of segments well.

Before:
  For each segment or until the segment containing the TD end TRB:
	"Looking for event-dma %016llx trb-start %016llx trb-end %016llx seg-start %016llx seg-end %016llx"

After:
  "Event dma %pad for ep %d status %d not part of TD at %016llx - %016llx"
  For each segment:
	"Ring seg %u dma %pad"

Probably, would have been better to loop from TD start seg to end seg.

>
> I would be inclined to submit a small patch which removes this segment
> dump, as I have already done so locally. Or at least make it xhci_dbg()
> if somebody can present a convincing case for having it around.

My patch was only meant to move the debugging out of trb_in_td() and shorten it.
Before, trb_in_td() was called twice, once for its primary functionality and a
second time solely for debugging purposes. This was what I wanted to remove.

Otherwise, I don't object to modifying or removing the debugs.

Best Regards,
Niklas

> 
> Note that debugfs exists and provides this and much more information,
> at least so long as the class driver doesn't disable the endpoint.
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-13  8:46 ` Michał Pecio
  2025-03-13  9:45   ` Neronin, Niklas
@ 2025-03-13 14:43   ` Mathias Nyman
  1 sibling, 0 replies; 10+ messages in thread
From: Mathias Nyman @ 2025-03-13 14:43 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb, niklas.neronin

On 13.3.2025 10.46, Michał Pecio wrote:
> On Tue, 11 Mar 2025 23:41:39 +0100, Michał Pecio wrote:
>> [102711.994235] xhci_hcd 0000:02:00.0: Event dma 0x00000000ffef4a50 for ep 6 status 4 not part of TD at 00000000eb22b790 - 00000000eb22b790
>> [102711.994243] xhci_hcd 0000:02:00.0: Ring seg 0 dma 0x00000000ffef4000
>> [102711.994246] xhci_hcd 0000:02:00.0: Ring seg 1 dma 0x00000000ffeee000
>> [102711.994249] xhci_hcd 0000:02:00.0: Ring seg 2 dma 0x00000000ffc4e000
>>
>> [ ... 735 lines omitted for brevity ... ]
>>
>> [102711.995935] xhci_hcd 0000:02:00.0: Ring seg 738 dma 0x00000000eb2e2000
>> [102711.995937] xhci_hcd 0000:02:00.0: Ring seg 739 dma 0x00000000eb22b000
> 
> And what are your thoughts about this noise? It's absurd to print such
> long debug dumps under failure conditions (and hold a spinlock for 2ms
> to do so), and I would argue that it is pointless even normally:
> 
> 1. Almost always exactly two segments exist, and either
>    a. the event and the TD are in the same segment, so who cares where
>       the other segment is
>    b. they are in different segments, and you can deduce both segments
>       from dma pointers so the dump tells you absolutely nothing new
> 
> 2. With more segments, the dump can tell if there were other segments
>     between the event and the TD, but is it really important?
> 
> 3. It might help with finding out-of-ring events, but this is extremely
>     rare and should be done automatically (xhci_dma_to_trb() or similar).
> 
> 
> Bottom line, the driver never printed it and no one other than Niklas
> (Cc) seemed to really miss such a feature.
> 
> I would be inclined to submit a small patch which removes this segment
> dump, as I have already done so locally. Or at least make it xhci_dbg()
> if somebody can present a convincing case for having it around.

I don't object to that, we can get rid of it.

But to be fair, didn't it assist in detecting the ~700 segment ring expansion
you just found :)

Thanks
Mathias

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-13  9:45   ` Neronin, Niklas
@ 2025-03-14  8:10     ` Michał Pecio
  0 siblings, 0 replies; 10+ messages in thread
From: Michał Pecio @ 2025-03-14  8:10 UTC (permalink / raw)
  To: Neronin, Niklas; +Cc: Mathias Nyman, linux-usb

On Thu, 13 Mar 2025 11:45:30 +0200, Neronin, Niklas wrote:

> IMO the driver used to print a long and repetitive debug message,
> which is why I changed it.
> Admittedly, my design does not handle hundreds of segments well.
> 
> Before:
>   For each segment or until the segment containing the TD end TRB:
> 	"Looking for event-dma %016llx trb-start %016llx trb-end %016llx seg-start %016llx seg-end %016llx"
> 
> After:
>   "Event dma %pad for ep %d status %d not part of TD at %016llx - %016llx"
>   For each segment:
> 	"Ring seg %u dma %pad"
> 
> Probably, would have been better to loop from TD start seg to end seg.

That's actually what the old code did, it only printed segments which
contained parts of the TD. Usually one, sometimes two.

New version always prints at least two lines.

I thought that maybe you wanted it for some reason, but if it was only
a matter preserving the old annoying behavior, I think it can go away :)


Regards,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-11 22:41 My transfer ring grew to 740 segments Michał Pecio
  2025-03-12 13:37 ` Mathias Nyman
  2025-03-13  8:46 ` Michał Pecio
@ 2025-03-14 19:15 ` David Laight
  2025-03-16 10:27   ` Michał Pecio
  2 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2025-03-14 19:15 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Mathias Nyman, linux-usb

On Tue, 11 Mar 2025 23:41:39 +0100
Michał Pecio <michal.pecio@gmail.com> wrote:

> Hi,
> 
> This happened under a simple test meant to check if AMD "Promontory"
> chipset (from ASMedia) has the delayed restart bug (it does, rarely).

Several years ago I found a bug in one of the asmedia chips that it
only processed one entry from the command ring each time the doorbell
was rung (the normal transfers were fine).
It would get 'out of step' so every time you sent a new command an old one
got executed instead - very confusing.
I don't remember seeing the bug 'worked around' while I was actively looking
at the changes - so it may still be present.
So setting up the ethernet interface I was using only worked most of the time.
Reproducible by adding two commands but only ringing the bell once.
I fixed it by ringing the doorbell again in the completion interrupt path.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-14 19:15 ` David Laight
@ 2025-03-16 10:27   ` Michał Pecio
  2025-03-16 13:20     ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Michał Pecio @ 2025-03-16 10:27 UTC (permalink / raw)
  To: David Laight; +Cc: Mathias Nyman, linux-usb

On Fri, 14 Mar 2025 19:15:36 +0000, David Laight wrote:
> Several years ago I found a bug in one of the asmedia chips that it
> only processed one entry from the command ring each time the doorbell
> was rung (the normal transfers were fine).
> It would get 'out of step' so every time you sent a new command an
> old one got executed instead - very confusing.


Interesting, but it doesn't seem to reproduce here.
I tried Promontory, ASM3142, ASM1142, ASM1042.

I removed the check for running endpoint from xhci-hub.c stop_device()
so it queues a Stop EP for each endpoint (as was done before 2017) and
then rings the command doorbell once (as it always did).

This is called before autosuspend so I would expect autosuspend to be
broken by such a bug, particularly before 2017.

The worst I got was a Stopped event from ASM1042 for a command failing
with Context State Error, IIRC it's illegal. But both cmds completed:

[  +2,271097] xhci_hcd 0000:02:00.0: 1/6 (000/3) queue_stop_endpoint suspend 1
[  +0,000006] xhci_hcd 0000:02:00.0: 1/0 (040/1) queue_stop_endpoint suspend 1
[  +0,000003] xhci_hcd 0000:02:00.0: 0/-1 (fff/f) xhci_ring_cmd_db cmd_ring_state 1
[  +0,000047] xhci_hcd 0000:02:00.0: 1/6 (000/3) handle_tx_event comp_code 26 trb_dma 0x000000000341d010
[  +0,000036] xhci_hcd 0000:02:00.0: 1/6 (000/3) handle_cmd_completion cmd_type 15 comp_code 19
[  +0,000142] xhci_hcd 0000:02:00.0: 1/0 (040/1) handle_tx_event comp_code 26 trb_dma 0x0000000003415640
[  +0,000038] xhci_hcd 0000:02:00.0: 1/0 (040/3) handle_cmd_completion cmd_type 15 comp_code 1

Was it supposed to happen every time, or only randomly?

> I don't remember seeing the bug 'worked around' while I was actively
> looking at the changes - so it may still be present.
> So setting up the ethernet interface I was using only worked most of
> the time. Reproducible by adding two commands but only ringing the
> bell once. I fixed it by ringing the doorbell again in the completion
> interrupt path.

I don't see any evidence of such workaround today.

Regards,
Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: My transfer ring grew to 740 segments
  2025-03-16 10:27   ` Michał Pecio
@ 2025-03-16 13:20     ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2025-03-16 13:20 UTC (permalink / raw)
  To: Michał Pecio; +Cc: Mathias Nyman, linux-usb

On Sun, 16 Mar 2025 11:27:44 +0100
Michał Pecio <michal.pecio@gmail.com> wrote:

> On Fri, 14 Mar 2025 19:15:36 +0000, David Laight wrote:
> > Several years ago I found a bug in one of the asmedia chips that it
> > only processed one entry from the command ring each time the doorbell
> > was rung (the normal transfers were fine).
> > It would get 'out of step' so every time you sent a new command an
> > old one got executed instead - very confusing.  
> 
> 
> Interesting, but it doesn't seem to reproduce here.
> I tried Promontory, ASM3142, ASM1142, ASM1042.

So it isn't what you are hitting.

> 
...
> Was it supposed to happen every time, or only randomly?

It happened whenever two commands got queued.
So the usb-net initialisation hit it.

> > I don't remember seeing the bug 'worked around' while I was actively
> > looking at the changes - so it may still be present.
> > So setting up the ethernet interface I was using only worked most of
> > the time. Reproducible by adding two commands but only ringing the
> > bell once. I fixed it by ringing the doorbell again in the completion
> > interrupt path.  
> 
> I don't see any evidence of such workaround today.

The machine that failed is 'no longer with us'.
Was an AMD piledriver (or similar vintage) with (IIRC) an asmedia USB3
controller.

The project I was working on got canned - so I stopped persuing fixes.

	David

> 
> Regards,
> Michal


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-16 13:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 22:41 My transfer ring grew to 740 segments Michał Pecio
2025-03-12 13:37 ` Mathias Nyman
2025-03-13  7:54   ` Michał Pecio
2025-03-13  8:46 ` Michał Pecio
2025-03-13  9:45   ` Neronin, Niklas
2025-03-14  8:10     ` Michał Pecio
2025-03-13 14:43   ` Mathias Nyman
2025-03-14 19:15 ` David Laight
2025-03-16 10:27   ` Michał Pecio
2025-03-16 13:20     ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox