Re: Inquiry about the f_tcm: Enhance UASP driver work

public inbox for linux-usb@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Inquiry about the f_tcm: Enhance UASP driver work
       [not found] <CAC7i41MN9LAxy7FZh7nbo9zQ_hvmkZtzpGpXatyCVLQTde=RZA@mail.gmail.com>
@ 2024-11-22  2:21 ` Thinh Nguyen
  2024-11-22  7:57   ` Michał Pecio
  0 siblings, 1 reply; 7+ messages in thread
From: Thinh Nguyen @ 2024-11-22  2:21 UTC (permalink / raw)
  To: Homura Akemi; +Cc: Thinh Nguyen, linux-usb@vger.kernel.org

++ linux-usb

Hi,

On Thu, Nov 21, 2024, Homura Akemi wrote:
> Hello Mr. Nguyen,
> 
> When I want to bringup a USB mass storage gadget with better performance, I
> came across your "f_tcm: Enhance UASP driver" PATCH series (and together with
> the target series).
> 
> I am very grateful for your work, the stream feature strongly increase the
> performance. Your work is very useful to those who need it.

I'm glad it works for you.

> 
> It seems still an unfinished job and long time no updates,   so I emailed to
> consult if this work still going or just abandoned ? 

It's not really abandoned. I just haven't really had the time to push
the new revision out again.

> 
> I tried to use it on dwc3, though I fix some other problems, the host side xhci
> (ubuntu client) using fio for stress testing, and I encountered the following
> error on host:
> [18836.092159] xhci_hcd 0000:00:0d.0: Transfer error for slot 3 ep 1 on
> endpoint
> [18836.092211] sd 0:0:0:0: [sda] tag#11 data cmplt err -71 uas-tag 1 inflight:
> CMD
> [18836.092213] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 5e 31 00 00 01
> 00 00
> .....more and mores....
> [18867.369118] scsi host0: uas_eh_device_reset_handler start [18867.453796] usb
> 2-3.2: reset SuperSpeed USB device number 4 using xhci_hcd
> and the gadget side is keep resetting configfs and printing wait_for_completion
> timeout (since dwc3 have )
> 
> I am not sure whether this is due to the stream exception of dwc3 or some
> logical in f_tcm and target.

The error is -71. This is transaction error (could be a CRC error). It
could be due to the host, device hardware, electrical interference, or
even the cable. No logical issue from software.

> 
> I only want to know the staus of the job, no intend to ask for help on the
> detailed issues.
> 
> I am new to kernel community, didn't emailed to the thread I worried that it's
> no longer valid. 
> Maybe I can help with it somehow.
> 
> If you feel bothered, please feel free to ignore me. Thank you very much!
> 
> Yours sincerely,
> H. Akemi

Perhaps I can try to send a new and revised series when the next kernel
release come out. You're not the only one who ping'ed me on this.
Hopefully you can add your Tested-by tag by then. It would help get the
series merged.

BR,
Thinh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-22  2:21 ` Inquiry about the f_tcm: Enhance UASP driver work Thinh Nguyen
@ 2024-11-22  7:57   ` Michał Pecio
  2024-11-23  0:02     ` Thinh Nguyen
  0 siblings, 1 reply; 7+ messages in thread
From: Michał Pecio @ 2024-11-22  7:57 UTC (permalink / raw)
  To: thinh.nguyen; +Cc: a1134123566, linux-usb

Hi,

> > I tried to use it on dwc3, though I fix some other problems, the host side xhci
> > (ubuntu client) using fio for stress testing, and I encountered the following
> > error on host:
> > [18836.092159] xhci_hcd 0000:00:0d.0: Transfer error for slot 3 ep 1 on
> > endpoint
> > [18836.092211] sd 0:0:0:0: [sda] tag#11 data cmplt err -71 uas-tag 1 inflight:
> > CMD
> > [18836.092213] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 5e 31 00 00 01
> > 00 00
> > .....more and mores....
> > [18867.369118] scsi host0: uas_eh_device_reset_handler start [18867.453796] usb
> > 2-3.2: reset SuperSpeed USB device number 4 using xhci_hcd
> > and the gadget side is keep resetting configfs and printing wait_for_completion
> > timeout (since dwc3 have )
> > 
> > I am not sure whether this is due to the stream exception of dwc3 or some
> > logical in f_tcm and target.
> 
> The error is -71. This is transaction error (could be a CRC error). It
> could be due to the host, device hardware, electrical interference, or
> even the cable. No logical issue from software.

A transaction error is a transaction error, but waiting 30 seconds for
UAS to reset the device afterwards looks wrong. I seem to recall seeing
sporadic transaction errors which triggered the reset instantly.

Long delays I have seen mainly on some unfortunate pairings of HC and
device (HW bugs?) which trigger unusual error conditions poorly handled
by xhci_hcd. Try with dynamic debug on handle_transferless_tx_event(),
if your kernel is recent enough for that to be a separate function.

In those cases, UAS seems to wait for other streams to complete before
resetting, but the whole endpoint is stopped and nothing moves forward.
At least that's the impression I got, I was looking at other things.

If you aren't running into this case, I would say something may be wrong
with UAS implementation on one or the other side.

It looks like the transaction error was delivered to UAS by means of
-EPROTO status so xhci_hcd has done its job at least for this one URB.
No idea what happened later and why the device wasn't reset promptly.

Regards,
Michal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-22  7:57   ` Michał Pecio
@ 2024-11-23  0:02     ` Thinh Nguyen
  2024-11-23  2:49       ` Alan Stern
  2024-11-23 14:25       ` Michał Pecio
  0 siblings, 2 replies; 7+ messages in thread
From: Thinh Nguyen @ 2024-11-23  0:02 UTC (permalink / raw)
  To: Michał Pecio
  Cc: Thinh Nguyen, a1134123566@gmail.com, linux-usb@vger.kernel.org

On Fri, Nov 22, 2024, Michał Pecio wrote:
> Hi,
> 
> > > I tried to use it on dwc3, though I fix some other problems, the host side xhci
> > > (ubuntu client) using fio for stress testing, and I encountered the following
> > > error on host:
> > > [18836.092159] xhci_hcd 0000:00:0d.0: Transfer error for slot 3 ep 1 on
> > > endpoint
> > > [18836.092211] sd 0:0:0:0: [sda] tag#11 data cmplt err -71 uas-tag 1 inflight:
> > > CMD
> > > [18836.092213] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 5e 31 00 00 01
> > > 00 00
> > > .....more and mores....
> > > [18867.369118] scsi host0: uas_eh_device_reset_handler start [18867.453796] usb
> > > 2-3.2: reset SuperSpeed USB device number 4 using xhci_hcd
> > > and the gadget side is keep resetting configfs and printing wait_for_completion
> > > timeout (since dwc3 have )
> > > 
> > > I am not sure whether this is due to the stream exception of dwc3 or some
> > > logical in f_tcm and target.
> > 
> > The error is -71. This is transaction error (could be a CRC error). It
> > could be due to the host, device hardware, electrical interference, or
> > even the cable. No logical issue from software.
> 
> A transaction error is a transaction error, but waiting 30 seconds for
> UAS to reset the device afterwards looks wrong. I seem to recall seeing
> sporadic transaction errors which triggered the reset instantly.

That's not what happening. I don't recall the storage class handles
transaction error as such. It just waits for the scsi command timeout.

> 
> Long delays I have seen mainly on some unfortunate pairings of HC and
> device (HW bugs?) which trigger unusual error conditions poorly handled
> by xhci_hcd. Try with dynamic debug on handle_transferless_tx_event(),
> if your kernel is recent enough for that to be a separate function.

No, this delay is not a HW bug. When there's transaction error, the xHCI
driver will reset the endpoint. The packet sequence number is reset and
out of sync with the device. The next packet cannot proceed until
there's some sort of recovery. There's no usb_clear_halt() or port reset
immediately after a -EPROTO. The only recovery (port reset) will happen
is after a timeout.

> 
> In those cases, UAS seems to wait for other streams to complete before
> resetting, but the whole endpoint is stopped and nothing moves forward.
> At least that's the impression I got, I was looking at other things.
> 
> If you aren't running into this case, I would say something may be wrong
> with UAS implementation on one or the other side.
> 
> It looks like the transaction error was delivered to UAS by means of
> -EPROTO status so xhci_hcd has done its job at least for this one URB.
> No idea what happened later and why the device wasn't reset promptly.
> 

The host doesn't tell the device to reset until after a timeout. There's
no sync'ing mechanism, so it wouldn't know how to recover. All it can
tell from the device side is it's waiting for the transfer to complete.
Perhaps this can be enhanced in the future in the storage class driver
regarding -EPROTO recovery.

BR,
Thinh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-23  0:02     ` Thinh Nguyen
@ 2024-11-23  2:49       ` Alan Stern
  2024-11-25  1:58         ` Homura Akemi
  2024-11-23 14:25       ` Michał Pecio
  1 sibling, 1 reply; 7+ messages in thread
From: Alan Stern @ 2024-11-23  2:49 UTC (permalink / raw)
  To: Thinh Nguyen
  Cc: Michał Pecio, a1134123566@gmail.com,
	linux-usb@vger.kernel.org

On Sat, Nov 23, 2024 at 12:02:10AM +0000, Thinh Nguyen wrote:
> On Fri, Nov 22, 2024, Michał Pecio wrote:
> > Hi,
> > 
> > > > I tried to use it on dwc3, though I fix some other problems, the host side xhci
> > > > (ubuntu client) using fio for stress testing, and I encountered the following
> > > > error on host:
> > > > [18836.092159] xhci_hcd 0000:00:0d.0: Transfer error for slot 3 ep 1 on
> > > > endpoint
> > > > [18836.092211] sd 0:0:0:0: [sda] tag#11 data cmplt err -71 uas-tag 1 inflight:
> > > > CMD
> > > > [18836.092213] sd 0:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 02 5e 31 00 00 01
> > > > 00 00
> > > > .....more and mores....
> > > > [18867.369118] scsi host0: uas_eh_device_reset_handler start [18867.453796] usb
> > > > 2-3.2: reset SuperSpeed USB device number 4 using xhci_hcd
> > > > and the gadget side is keep resetting configfs and printing wait_for_completion
> > > > timeout (since dwc3 have )
> > > > 
> > > > I am not sure whether this is due to the stream exception of dwc3 or some
> > > > logical in f_tcm and target.

Homura Akemi: You can try using usbmon to record the transfers between 
the host and the f_tcm target.  Seeing that data might help explain 
what's going on.

> > > 
> > > The error is -71. This is transaction error (could be a CRC error). It
> > > could be due to the host, device hardware, electrical interference, or
> > > even the cable. No logical issue from software.
> > 
> > A transaction error is a transaction error, but waiting 30 seconds for
> > UAS to reset the device afterwards looks wrong. I seem to recall seeing
> > sporadic transaction errors which triggered the reset instantly.
> 
> That's not what happening. I don't recall the storage class handles
> transaction error as such. It just waits for the scsi command timeout.

For what it's worth, usb-storage handles transaction errors right away, 
without waiting for the SCSI timeout.  It initiates a USB port reset, 
which of course clears out all the messed-up state on both the device 
and the controller.

I don't know about uas, though.

Alan Stern

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-23  2:49       ` Alan Stern
@ 2024-11-25  1:58         ` Homura Akemi
  0 siblings, 0 replies; 7+ messages in thread
From: Homura Akemi @ 2024-11-25  1:58 UTC (permalink / raw)
  To: Alan Stern; +Cc: Thinh Nguyen, Michał Pecio, linux-usb@vger.kernel.org

On Fri, 22 Nov 2024 18:49:08 -0800 (PST), Alan Stern Wrote:

> Homura Akemi: You can try using usbmon to record the transfers between
> the host and the f_tcm target.  Seeing that data might help explain
> what's going on.

Good idea, I tried to use a small lecroy analyzer the other day, but
the reproduction
takes a long time for fio to run thus the buffer runs out soon and
never reaches the
scene. I will try usbmon now.

Thanks,
H. Akemi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-23  0:02     ` Thinh Nguyen
  2024-11-23  2:49       ` Alan Stern
@ 2024-11-23 14:25       ` Michał Pecio
  2024-11-25 21:13         ` Thinh Nguyen
  1 sibling, 1 reply; 7+ messages in thread
From: Michał Pecio @ 2024-11-23 14:25 UTC (permalink / raw)
  To: Thinh Nguyen; +Cc: a1134123566@gmail.com, linux-usb@vger.kernel.org

On Sat, 23 Nov 2024 00:02:10 +0000, Thinh Nguyen wrote:
> > Long delays I have seen mainly on some unfortunate pairings of HC
> > and device (HW bugs?) which trigger unusual error conditions poorly
> > handled by xhci_hcd. Try with dynamic debug on
> > handle_transferless_tx_event(), if your kernel is recent enough for
> > that to be a separate function.  
> 
> No, this delay is not a HW bug. When there's transaction error, the
> xHCI driver will reset the endpoint. The packet sequence number is
> reset and out of sync with the device. The next packet cannot proceed
> until there's some sort of recovery. There's no usb_clear_halt() or
> port reset immediately after a -EPROTO. The only recovery (port
> reset) will happen is after a timeout.

I think you are right. I tried to repro and I got this:

[Nov23 14:01] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
[  +0.000380] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
[ +30.096820] sd 6:0:0:0: [sdb] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: IN 
[  +0.000006] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
[  +0.012009] scsi host6: uas_eh_device_reset_handler start
[  +0.114634] usb 13-2: reset SuperSpeed USB device number 6 using xhci-pci-renesas
[  +0.017603] scsi host6: uas_eh_device_reset_handler success
[  +0.000072] sd 6:0:0:0: [sdb] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=30s
[  +0.000003] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
[  +0.000001] I/O error, dev sdb, sector 47198216 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0

I will keep it running for a few more hours and if those timeouts
keep happening I will have to conclude that I remembered wrong.

> > before resetting, but the whole endpoint is stopped and nothing
> > moves forward. At least that's the impression I got, I was looking
> > at other things.

But a completely stopped endpoint is *also* possible if you encounter
COMP_INVALID_STREAM_ID. I see it after some command errors on this chip:

13fd:5910 Initio Corporation

> Perhaps this can be enhanced in the future in the storage class
> driver regarding -EPROTO recovery.

It's a universal problem with xhci_hcd, it always resets the host
sequence state on every error, which is against Linux convention,
so nobody expects it and nobody handles it. It's nuts.

One thing I'm going to try is patch it to stop doing this and see
what happens.

Regards,
Michal

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inquiry about the f_tcm: Enhance UASP driver work
  2024-11-23 14:25       ` Michał Pecio
@ 2024-11-25 21:13         ` Thinh Nguyen
  0 siblings, 0 replies; 7+ messages in thread
From: Thinh Nguyen @ 2024-11-25 21:13 UTC (permalink / raw)
  To: Michał Pecio
  Cc: Thinh Nguyen, a1134123566@gmail.com, linux-usb@vger.kernel.org

On Sat, Nov 23, 2024, Michał Pecio wrote:
> On Sat, 23 Nov 2024 00:02:10 +0000, Thinh Nguyen wrote:
> > > Long delays I have seen mainly on some unfortunate pairings of HC
> > > and device (HW bugs?) which trigger unusual error conditions poorly
> > > handled by xhci_hcd. Try with dynamic debug on
> > > handle_transferless_tx_event(), if your kernel is recent enough for
> > > that to be a separate function.  
> > 
> > No, this delay is not a HW bug. When there's transaction error, the
> > xHCI driver will reset the endpoint. The packet sequence number is
> > reset and out of sync with the device. The next packet cannot proceed
> > until there's some sort of recovery. There's no usb_clear_halt() or
> > port reset immediately after a -EPROTO. The only recovery (port
> > reset) will happen is after a timeout.
> 
> I think you are right. I tried to repro and I got this:
> 
> [Nov23 14:01] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
> [  +0.000380] xhci-pci-renesas 0000:03:00.0: Transfer error for slot 1 ep 6 on endpoint
> [ +30.096820] sd 6:0:0:0: [sdb] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: IN 
> [  +0.000006] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
> [  +0.012009] scsi host6: uas_eh_device_reset_handler start
> [  +0.114634] usb 13-2: reset SuperSpeed USB device number 6 using xhci-pci-renesas
> [  +0.017603] scsi host6: uas_eh_device_reset_handler success
> [  +0.000072] sd 6:0:0:0: [sdb] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=DRIVER_OK cmd_age=30s
> [  +0.000003] sd 6:0:0:0: [sdb] tag#1 CDB: opcode=0x28 28 00 02 d0 30 08 00 02 00 00
> [  +0.000001] I/O error, dev sdb, sector 47198216 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
> 
> I will keep it running for a few more hours and if those timeouts
> keep happening I will have to conclude that I remembered wrong.
> 
> > > before resetting, but the whole endpoint is stopped and nothing
> > > moves forward. At least that's the impression I got, I was looking
> > > at other things.
> 
> But a completely stopped endpoint is *also* possible if you encounter
> COMP_INVALID_STREAM_ID. I see it after some command errors on this chip:
> 
> 13fd:5910 Initio Corporation

Sure. I guess I always associate the -EPROTO error with transaction
error that I forget there are some other error conditions with that
error code also.

> 
> > Perhaps this can be enhanced in the future in the storage class
> > driver regarding -EPROTO recovery.
> 
> It's a universal problem with xhci_hcd, it always resets the host
> sequence state on every error, which is against Linux convention,
> so nobody expects it and nobody handles it. It's nuts.
> 
> One thing I'm going to try is patch it to stop doing this and see
> what happens.
> 

The xHC can halt the endpoint when it sees certain error. That's what
happens here. To recover from this halted state, the driver can either
try to soft reset to resume or give up and reset the endpoint and return
-EPROTO. We can't avoid the reset here. Perhaps we can associate this
type of error as -EPIPE instead. If not, we should update the error code
documentation under Documentation/driver-api/usb/error-codes.rst

BR,
Thinh

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-11-25 21:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAC7i41MN9LAxy7FZh7nbo9zQ_hvmkZtzpGpXatyCVLQTde=RZA@mail.gmail.com>
2024-11-22  2:21 ` Inquiry about the f_tcm: Enhance UASP driver work Thinh Nguyen
2024-11-22  7:57   ` Michał Pecio
2024-11-23  0:02     ` Thinh Nguyen
2024-11-23  2:49       ` Alan Stern
2024-11-25  1:58         ` Homura Akemi
2024-11-23 14:25       ` Michał Pecio
2024-11-25 21:13         ` Thinh Nguyen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox