usb: dwc3: HC dies under high I/O load on Exynos5422

linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* usb: dwc3: HC dies under high I/O load on Exynos5422
@ 2023-06-16  3:11 Jakub Vaněk
  2023-06-16  9:26 ` Krzysztof Kozlowski
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jakub Vaněk @ 2023-06-16  3:11 UTC (permalink / raw)
  To: Thinh.Nguyen, krzysztof.kozlowski, mauro.ribeiro
  Cc: linux-usb, linux-samsung-soc

Hi all,

I've discovered that on recent kernels the xHCI controller on Odroid
HC2 dies when a USB-attached disk is put under a heavy I/O load.

The hardware in question is using a DWC3 2.00a IP within the Exynos5422
to provide two internal USB3 ports. One of them is connected to a
JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge
is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132).

The crash can be triggered by running a read-heavy workload. This
triggers it for me within tens of seconds:

$ fio --filename=/dev/sda --direct=1 --rw=read --bs=4k \
 --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
 --time_based --group_reporting --name=iops-test-job \
 --eta-newline=1 --readonly

FIO output then follows this pattern:

iops-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.16
Starting 4 processes
Jobs: 4 (f=4): [R(4)][2.5%][r=341MiB/s][r=87.2k IOPS][eta 01m:57s]
Jobs: 4 (f=4): [R(4)][4.2%][r=340MiB/s][r=87.1k IOPS][eta 01m:55s]
Jobs: 4 (f=4): [R(4)][5.8%][r=337MiB/s][r=86.2k IOPS][eta 01m:53s]
Jobs: 4 (f=4): [R(4)][7.5%][r=369MiB/s][r=94.5k IOPS][eta 01m:51s]
Jobs: 4 (f=4): [R(4)][9.2%][r=364MiB/s][r=93.2k IOPS][eta 01m:49s]
Jobs: 4 (f=4): [R(4)][10.8%][r=363MiB/s][r=92.9k IOPS][eta 01m:47s]
Jobs: 4 (f=4): [R(4)][12.5%][r=348MiB/s][r=88.0k IOPS][eta 01m:45s]
Jobs: 4 (f=4): [R(4)][14.2%][r=348MiB/s][r=88.0k IOPS][eta 01m:43s]
Jobs: 4 (f=4): [R(4)][15.8%][r=377MiB/s][r=96.4k IOPS][eta 01m:41s]
Jobs: 4 (f=4): [R(4)][17.5%][r=372MiB/s][r=95.2k IOPS][eta 01m:39s]
Jobs: 4 (f=4): [R(4)][18.3%][r=77.0MiB/s][r=19.0k IOPS][eta 01m:38s]
Jobs: 4 (f=4): [R(4)][20.0%][eta 01m:36s]
< line without progress repeated many times; xHC is now unresponsive >
Jobs: 4 (f=4): [R(4)][45.8%][eta 01m:05s]
fio: io_u error on file /dev/sda: No such device: read
offset=1820839936, buflen=4096
fio: pid=1863, err=19/file:io_u.c:1787, func=io_u error, error=No such
device
< and so on >

Dmesg contains the following output:

[ 266.310767] xhci-hcd xhci-hcd.8.auto: xHCI host controller not
responding, assume dead
[ 266.317388] xhci-hcd xhci-hcd.8.auto: HC died; cleaning up
[ 266.322710] usb 4-1: cmd cmplt err -108
[ 266.326497] usb 4-1: cmd cmplt err -108
[ 266.330313] usb 4-1: cmd cmplt err -108
[ 266.334096] usb 4-1: cmd cmplt err -108
[ 266.337942] usb 4-1: cmd cmplt err -108
[ 266.341746] usb 4-1: cmd cmplt err -108
[ 266.345561] usb 4-1: cmd cmplt err -108
[ 266.349372] usb 4-1: cmd cmplt err -108
[ 266.353187] usb 4-1: cmd cmplt err -108
[ 266.357000] usb 4-1: cmd cmplt err -108
[ 266.360809] usb 4-1: cmd cmplt err -108
[ 266.364626] usb 4-1: cmd cmplt err -108
[ 266.368439] usb 4-1: cmd cmplt err -108
[ 266.372248] usb 4-1: cmd cmplt err -108
[ 266.376063] usb 4-1: cmd cmplt err -108
[ 266.379876] usb 4-1: cmd cmplt err -108
[ 266.383688] usb 4-1: cmd cmplt err -108
[ 266.387500] usb 4-1: cmd cmplt err -108
[ 266.391314] usb 4-1: cmd cmplt err -108
[ 266.395127] usb 4-1: cmd cmplt err -108
[ 266.398943] usb 4-1: cmd cmplt err -108
[ 266.402753] usb 4-1: cmd cmplt err -108
[ 266.406565] usb 4-1: cmd cmplt err -108
[ 266.410379] usb 4-1: cmd cmplt err -108
[ 266.414165] usb 4-1: cmd cmplt err -108
[ 266.418003] usb 4-1: cmd cmplt err -108
[ 266.448629] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd
1, flush 0, corrupt 0, gen 0
< more FS errors follow >

The OS is then unable to recover (I have rootfs on that SSD too) and
the board must be manually restarted.

I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig
(+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while
back and the first broken commit is b138e23d3dff ("usb: dwc3: core:
Enable AutoRetry feature in the controller"). Reverting this commit
locally makes my board stable again (FIO test above can run
for >10 minutes without any issues).

The crash is happening when the USB-SATA bridge is controlled by the
uas driver. I have not tested the usb-storage driver yet.

What do you think would be an appropriate fix here? One idea I had is
to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here.
However, I'm not entirely sure this is isolated to Odroid boards.

Please let me know if you need me to do some more experiments.

Thank you,

Jakub Vanek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-16  3:11 usb: dwc3: HC dies under high I/O load on Exynos5422 Jakub Vaněk
@ 2023-06-16  9:26 ` Krzysztof Kozlowski
  2023-06-16 10:44   ` Jakub Vaněk
  2023-06-17  8:55 ` Jakub Vaněk
  2023-06-22 22:33 ` Thinh Nguyen
  2 siblings, 1 reply; 7+ messages in thread
From: Krzysztof Kozlowski @ 2023-06-16  9:26 UTC (permalink / raw)
  To: Jakub Vaněk, Thinh.Nguyen, mauro.ribeiro
  Cc: linux-usb, linux-samsung-soc

On 16/06/2023 05:11, Jakub Vaněk wrote:
> Hi all,
> 
> I've discovered that on recent kernels the xHCI controller on Odroid
> HC2 dies when a USB-attached disk is put under a heavy I/O load.
> 
> The hardware in question is using a DWC3 2.00a IP within the Exynos5422
> to provide two internal USB3 ports. One of them is connected to a
> JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge
> is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132).
> 
> The crash can be triggered by running a read-heavy workload. This
> triggers it for me within tens of seconds:

multi_v7 has devfreq enabled. Does disabling ARM_EXYNOS_BUS_DEVFREQ
change anything here?

Best regards,
Krzysztof


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-16  9:26 ` Krzysztof Kozlowski
@ 2023-06-16 10:44   ` Jakub Vaněk
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Vaněk @ 2023-06-16 10:44 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Thinh.Nguyen, mauro.ribeiro
  Cc: linux-usb, linux-samsung-soc

Hi Krzysztof,

thank you for your quick reply!

On Fri, 2023-06-16 at 11:26 +0200, Krzysztof Kozlowski wrote:
> On 16/06/2023 05:11, Jakub Vaněk wrote:
> > Hi all,
> > 
> > I've discovered that on recent kernels the xHCI controller on
> > Odroid
> > HC2 dies when a USB-attached disk is put under a heavy I/O load.
> > 
> > The hardware in question is using a DWC3 2.00a IP within the
> > Exynos5422
> > to provide two internal USB3 ports. One of them is connected to a
> > JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The
> > bridge
> > is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132).
> > 
> > The crash can be triggered by running a read-heavy workload. This
> > triggers it for me within tens of seconds:
> 
> multi_v7 has devfreq enabled. Does disabling ARM_EXYNOS_BUS_DEVFREQ
> change anything here?

Only slightly. The FIO test still makes the xHCI controller crash.
However, the timing seems to be slightly different -- I either get the
crash in ~10 seconds (most of the time) or only after a minute. Before
disabling ARM_EXYNOS_BUS_DEVFREQ it seemed to take about 20-40 seconds.
On the other hand, I have tried it only two or three times before, so
this data may not be conclusive.

Full kernel config: https://pastebin.com/iLSsYfBF
Full fio output: https://pastebin.com/9NehLhQr
Full-ish dmesg here: https://pastebin.com/1Zgd1gVg
All of the bus-* devfreq sysfs nodes disappeared in this configuration:
$ ls /sys/class/devfreq
10c20000.memory-controller  11800000.gpu

The memory controller driver prints some errors in this configuration.
Disabling it with CONFIG_EXYNOS5422_DMC=n doesn't seem to affect the
crash. I also tried to set the cpufreq governor to performance instead
of ondemand and that too didn't help.

> Best regards,
> Krzysztof

Best regards,
Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-16  3:11 usb: dwc3: HC dies under high I/O load on Exynos5422 Jakub Vaněk
  2023-06-16  9:26 ` Krzysztof Kozlowski
@ 2023-06-17  8:55 ` Jakub Vaněk
  2023-06-22 22:33 ` Thinh Nguyen
  2 siblings, 0 replies; 7+ messages in thread
From: Jakub Vaněk @ 2023-06-17  8:55 UTC (permalink / raw)
  To: Thinh.Nguyen, krzysztof.kozlowski, mauro.ribeiro
  Cc: linux-usb, linux-samsung-soc

[-- Attachment #1: Type: text/plain, Size: 2699 bytes --]

Hi all,

I've done a few more tests. I'm also adding the required information
described in DWC3 documentation which I previously missed.

> The OS is then unable to recover (I have rootfs on that SSD too) and
> the board must be manually restarted.

I have resolved this by creating a fresh Ubuntu 20.04 rootfs on an SD
card. The system now survives the controller crash. The xHC can also be
brought up again by unbinding the dwc3 driver and then binding it back.

> Dmesg contains the following output:
> < ... >

It turns out that this was not the full relevant output. I was
collecting the logs from a serial console and I haven't properly
enabled verbose printing. Hopefully the full dmesg is now linked below.

> The crash is happening when the USB-SATA bridge is controlled by the
> uas driver. I have not tested the usb-storage driver yet.

I tested this now. With usb-storage the controller is stable, but the
achievable throughput is lower (75 MB/s BOT vs 300 MB/s UAS).

---

With the rootfs on the SD card, I was able to capture a DWC3 event
trace & register dump. I am running clean 6.4-rc6 with a config similar
to multi_v7_defconfig (see below for details).

To capture the trace, I followed these steps:

 1. Unbind the DWC3 driver from the controller (12000000.usb).
 2. Enable DWC3 tracing.
 3. Bind the DWC3 driver back.
 4. Save the DWC3 register dump to "regdump-before-fio.txt".
 5. Run the FIO stress test from the first email. Once FIO stops
    printing IOPS, dump registers again to "regdump-during-freeze.txt"
 6. Once FIO exits and the kernel prints the "HC died" message,
    dump registers once more to "regdump-after-hc-died.txt".
 7. Save the current trace buffer to "trace.txt".
 8. Save the current kernel log to "dmesg.txt".

I had to do the DWC3 unbind-bind dance because I have no way of
unplugging the onboard JMS578 bridge from the main Exynos chip.

The resulting files can be found in the attached tarball including the
kernel config (I kept ARM_EXYNOS_BUS_DEVFREQ enabled this time).
Dmesg.txt is also available at https://pastebin.com/EkfXKMih .

I am not 100% sure this is not a hardware fault. However, there are a
few Exynos5422-based Odroid users experiencing a similar issue. Most of
them mention kernel 5.4, which does contain the bisected bad commit.
 - https://forum.odroid.com/viewtopic.php?t=42630 (report mine,
   but there are some people having the same issue)
 - https://forum.odroid.com/viewtopic.php?t=46409
 - https://forum.armbian.com/topic/20582-odroid-xu4-usb-sata-ssd-drive-random-disconnect/

Please let me know if I you need more information.

Thank you,
Jakub Vanek

[-- Attachment #2: dwc3-logs.tar.gz --]
[-- Type: application/x-compressed-tar, Size: 73992 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-16  3:11 usb: dwc3: HC dies under high I/O load on Exynos5422 Jakub Vaněk
  2023-06-16  9:26 ` Krzysztof Kozlowski
  2023-06-17  8:55 ` Jakub Vaněk
@ 2023-06-22 22:33 ` Thinh Nguyen
  2023-06-24 15:58   ` Jakub Vaněk
  2 siblings, 1 reply; 7+ messages in thread
From: Thinh Nguyen @ 2023-06-22 22:33 UTC (permalink / raw)
  To: Jakub Vaněk
  Cc: Thinh Nguyen, krzysztof.kozlowski@linaro.org,
	mauro.ribeiro@hardkernel.com, linux-usb@vger.kernel.org,
	linux-samsung-soc@vger.kernel.org

Sorry for the delay in response. I was away.

On Fri, Jun 16, 2023, Jakub Vaněk wrote:
> Hi all,
> 
> I've discovered that on recent kernels the xHCI controller on Odroid
> HC2 dies when a USB-attached disk is put under a heavy I/O load.
> 
> The hardware in question is using a DWC3 2.00a IP within the Exynos5422

Just want to clarify, this is dwc_usb3 v2.00a and not dwc_usb31.

> to provide two internal USB3 ports. One of them is connected to a
> JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge
> is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132).
> 
> The crash can be triggered by running a read-heavy workload. This
> triggers it for me within tens of seconds:
> 
> $ fio --filename=/dev/sda --direct=1 --rw=read --bs=4k \
>  --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
>  --time_based --group_reporting --name=iops-test-job \
>  --eta-newline=1 --readonly
> 
> FIO output then follows this pattern:
> 
> iops-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=libaio, iodepth=256
> ...
> fio-3.16
> Starting 4 processes
> Jobs: 4 (f=4): [R(4)][2.5%][r=341MiB/s][r=87.2k IOPS][eta 01m:57s]
> Jobs: 4 (f=4): [R(4)][4.2%][r=340MiB/s][r=87.1k IOPS][eta 01m:55s]
> Jobs: 4 (f=4): [R(4)][5.8%][r=337MiB/s][r=86.2k IOPS][eta 01m:53s]
> Jobs: 4 (f=4): [R(4)][7.5%][r=369MiB/s][r=94.5k IOPS][eta 01m:51s]
> Jobs: 4 (f=4): [R(4)][9.2%][r=364MiB/s][r=93.2k IOPS][eta 01m:49s]
> Jobs: 4 (f=4): [R(4)][10.8%][r=363MiB/s][r=92.9k IOPS][eta 01m:47s]
> Jobs: 4 (f=4): [R(4)][12.5%][r=348MiB/s][r=88.0k IOPS][eta 01m:45s]
> Jobs: 4 (f=4): [R(4)][14.2%][r=348MiB/s][r=88.0k IOPS][eta 01m:43s]
> Jobs: 4 (f=4): [R(4)][15.8%][r=377MiB/s][r=96.4k IOPS][eta 01m:41s]
> Jobs: 4 (f=4): [R(4)][17.5%][r=372MiB/s][r=95.2k IOPS][eta 01m:39s]
> Jobs: 4 (f=4): [R(4)][18.3%][r=77.0MiB/s][r=19.0k IOPS][eta 01m:38s]
> Jobs: 4 (f=4): [R(4)][20.0%][eta 01m:36s]
> < line without progress repeated many times; xHC is now unresponsive >
> Jobs: 4 (f=4): [R(4)][45.8%][eta 01m:05s]
> fio: io_u error on file /dev/sda: No such device: read
> offset=1820839936, buflen=4096
> fio: pid=1863, err=19/file:io_u.c:1787, func=io_u error, error=No such
> device
> < and so on >
> 
> Dmesg contains the following output:
> 
> [ 266.310767] xhci-hcd xhci-hcd.8.auto: xHCI host controller not
> responding, assume dead
> [ 266.317388] xhci-hcd xhci-hcd.8.auto: HC died; cleaning up
> [ 266.322710] usb 4-1: cmd cmplt err -108
> [ 266.326497] usb 4-1: cmd cmplt err -108
> [ 266.330313] usb 4-1: cmd cmplt err -108
> [ 266.334096] usb 4-1: cmd cmplt err -108
> [ 266.337942] usb 4-1: cmd cmplt err -108
> [ 266.341746] usb 4-1: cmd cmplt err -108
> [ 266.345561] usb 4-1: cmd cmplt err -108
> [ 266.349372] usb 4-1: cmd cmplt err -108
> [ 266.353187] usb 4-1: cmd cmplt err -108
> [ 266.357000] usb 4-1: cmd cmplt err -108
> [ 266.360809] usb 4-1: cmd cmplt err -108
> [ 266.364626] usb 4-1: cmd cmplt err -108
> [ 266.368439] usb 4-1: cmd cmplt err -108
> [ 266.372248] usb 4-1: cmd cmplt err -108
> [ 266.376063] usb 4-1: cmd cmplt err -108
> [ 266.379876] usb 4-1: cmd cmplt err -108
> [ 266.383688] usb 4-1: cmd cmplt err -108
> [ 266.387500] usb 4-1: cmd cmplt err -108
> [ 266.391314] usb 4-1: cmd cmplt err -108
> [ 266.395127] usb 4-1: cmd cmplt err -108
> [ 266.398943] usb 4-1: cmd cmplt err -108
> [ 266.402753] usb 4-1: cmd cmplt err -108
> [ 266.406565] usb 4-1: cmd cmplt err -108
> [ 266.410379] usb 4-1: cmd cmplt err -108
> [ 266.414165] usb 4-1: cmd cmplt err -108
> [ 266.418003] usb 4-1: cmd cmplt err -108
> [ 266.448629] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd
> 1, flush 0, corrupt 0, gen 0
> < more FS errors follow >
> 
> The OS is then unable to recover (I have rootfs on that SSD too) and
> the board must be manually restarted.
> 
> I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig
> (+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while
> back and the first broken commit is b138e23d3dff ("usb: dwc3: core:
> Enable AutoRetry feature in the controller"). Reverting this commit
> locally makes my board stable again (FIO test above can run
> for >10 minutes without any issues).

This info helps a lot.

> 
> The crash is happening when the USB-SATA bridge is controlled by the
> uas driver. I have not tested the usb-storage driver yet.
> 
> What do you think would be an appropriate fix here? One idea I had is
> to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here.
> However, I'm not entirely sure this is isolated to Odroid boards.
> 
> Please let me know if you need me to do some more experiments.
> 

This failure indicates that whichever device you're testing against
could not retry with burst (NumP != 0) after a CRC error. After a period
of time, the host timed out and attempted to restore its operations by
stoping the active transfers with a Stop-ep command. However, for some
reason, the host doesn't respond to this command. The crash you observed
is probably a separate issue. The main issue is why the host doesn't
receive a command completion event. If you're our direct customer, you
can submit a STAR request for our support. I'm not aware of this type of
failure related to AutoRetry. However, given how old this controller
version is (over a decade ago), I can't be sure.

I think if you try to test against a different device, you may not
observe this same failure.

To resolve this, please look into our support team to investigate
further to see whether it's a setup issue. Otherwise, we can disable
this feature for dwc_usb3 v2.00a. Depending on how bad the CRC error
rate is (which should be low), this should not affect performance much.
I don't think this neccessarily needs a new DT property.

Something like this:

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 0beaab932e7d..1bfd8b127240 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1209,8 +1209,9 @@ static int dwc3_core_init(struct dwc3 *dwc)
 		dwc3_writel(dwc->regs, DWC3_GUCTL1, reg);
 	}
 
-	if (dwc->dr_mode == USB_DR_MODE_HOST ||
-	    dwc->dr_mode == USB_DR_MODE_OTG) {
+	if (!DWC3_VER_IS(DWC3, 200A) &&
+	    (dwc->dr_mode == USB_DR_MODE_HOST ||
+	     dwc->dr_mode == USB_DR_MODE_OTG)) {
 		reg = dwc3_readl(dwc->regs, DWC3_GUCTL);
 
 		/*


Thanks,
Thinh

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-22 22:33 ` Thinh Nguyen
@ 2023-06-24 15:58   ` Jakub Vaněk
  2023-06-26 23:20     ` Thinh Nguyen
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Vaněk @ 2023-06-24 15:58 UTC (permalink / raw)
  To: Thinh Nguyen
  Cc: krzysztof.kozlowski@linaro.org, mauro.ribeiro@hardkernel.com,
	linux-usb@vger.kernel.org, linux-samsung-soc@vger.kernel.org

Hi Thinh,

thank you for your reply!

On Thu, 2023-06-22 at 22:33 +0000, Thinh Nguyen wrote:
> Sorry for the delay in response. I was away.
> 
> On Fri, Jun 16, 2023, Jakub Vaněk wrote:
> > Hi all,
> > 
> > I've discovered that on recent kernels the xHCI controller on Odroid
> > HC2 dies when a USB-attached disk is put under a heavy I/O load.
> > 
> > The hardware in question is using a DWC3 2.00a IP within the Exynos5422
> 
> Just want to clarify, this is dwc_usb3 v2.00a and not dwc_usb31.

Indeed, I forgot to add this.

> > to provide two internal USB3 ports. One of them is connected to a
> > JMS578 USB-to-SATA bridge (Odroid firmware v173.01.00.02). The bridge
> > is then connected to a Intel SSDSC2KG240G8 (firmware XCV10132).
> > 
> > The crash can be triggered by running a read-heavy workload. This
> > triggers it for me within tens of seconds:
> > 
> > $ fio --filename=/dev/sda --direct=1 --rw=read --bs=4k \
> >  --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 \
> >  --time_based --group_reporting --name=iops-test-job \
> >  --eta-newline=1 --readonly
> > 
> > FIO output then follows this pattern:
> > 
> > iops-test-job: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=libaio, iodepth=256
> > ...
> > fio-3.16
> > Starting 4 processes
> > Jobs: 4 (f=4): [R(4)][2.5%][r=341MiB/s][r=87.2k IOPS][eta 01m:57s]
> > Jobs: 4 (f=4): [R(4)][4.2%][r=340MiB/s][r=87.1k IOPS][eta 01m:55s]
> > Jobs: 4 (f=4): [R(4)][5.8%][r=337MiB/s][r=86.2k IOPS][eta 01m:53s]
> > Jobs: 4 (f=4): [R(4)][7.5%][r=369MiB/s][r=94.5k IOPS][eta 01m:51s]
> > Jobs: 4 (f=4): [R(4)][9.2%][r=364MiB/s][r=93.2k IOPS][eta 01m:49s]
> > Jobs: 4 (f=4): [R(4)][10.8%][r=363MiB/s][r=92.9k IOPS][eta 01m:47s]
> > Jobs: 4 (f=4): [R(4)][12.5%][r=348MiB/s][r=88.0k IOPS][eta 01m:45s]
> > Jobs: 4 (f=4): [R(4)][14.2%][r=348MiB/s][r=88.0k IOPS][eta 01m:43s]
> > Jobs: 4 (f=4): [R(4)][15.8%][r=377MiB/s][r=96.4k IOPS][eta 01m:41s]
> > Jobs: 4 (f=4): [R(4)][17.5%][r=372MiB/s][r=95.2k IOPS][eta 01m:39s]
> > Jobs: 4 (f=4): [R(4)][18.3%][r=77.0MiB/s][r=19.0k IOPS][eta 01m:38s]
> > Jobs: 4 (f=4): [R(4)][20.0%][eta 01m:36s]
> > < line without progress repeated many times; xHC is now unresponsive >
> > Jobs: 4 (f=4): [R(4)][45.8%][eta 01m:05s]
> > fio: io_u error on file /dev/sda: No such device: read
> > offset=1820839936, buflen=4096
> > fio: pid=1863, err=19/file:io_u.c:1787, func=io_u error, error=No such
> > device
> > < and so on >
> > 
> > Dmesg contains the following output:
> > 
> > [ 266.310767] xhci-hcd xhci-hcd.8.auto: xHCI host controller not
> > responding, assume dead
> > [ 266.317388] xhci-hcd xhci-hcd.8.auto: HC died; cleaning up
> > [ 266.322710] usb 4-1: cmd cmplt err -108
> > [ 266.326497] usb 4-1: cmd cmplt err -108
> > [ 266.330313] usb 4-1: cmd cmplt err -108
> > [ 266.334096] usb 4-1: cmd cmplt err -108
> > [ 266.337942] usb 4-1: cmd cmplt err -108
> > [ 266.341746] usb 4-1: cmd cmplt err -108
> > [ 266.345561] usb 4-1: cmd cmplt err -108
> > [ 266.349372] usb 4-1: cmd cmplt err -108
> > [ 266.353187] usb 4-1: cmd cmplt err -108
> > [ 266.357000] usb 4-1: cmd cmplt err -108
> > [ 266.360809] usb 4-1: cmd cmplt err -108
> > [ 266.364626] usb 4-1: cmd cmplt err -108
> > [ 266.368439] usb 4-1: cmd cmplt err -108
> > [ 266.372248] usb 4-1: cmd cmplt err -108
> > [ 266.376063] usb 4-1: cmd cmplt err -108
> > [ 266.379876] usb 4-1: cmd cmplt err -108
> > [ 266.383688] usb 4-1: cmd cmplt err -108
> > [ 266.387500] usb 4-1: cmd cmplt err -108
> > [ 266.391314] usb 4-1: cmd cmplt err -108
> > [ 266.395127] usb 4-1: cmd cmplt err -108
> > [ 266.398943] usb 4-1: cmd cmplt err -108
> > [ 266.402753] usb 4-1: cmd cmplt err -108
> > [ 266.406565] usb 4-1: cmd cmplt err -108
> > [ 266.410379] usb 4-1: cmd cmplt err -108
> > [ 266.414165] usb 4-1: cmd cmplt err -108
> > [ 266.418003] usb 4-1: cmd cmplt err -108
> > [ 266.448629] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd
> > 1, flush 0, corrupt 0, gen 0
> > < more FS errors follow >
> > 
> > The OS is then unable to recover (I have rootfs on that SSD too) and
> > the board must be manually restarted.
> > 
> > I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig
> > (+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while
> > back and the first broken commit is b138e23d3dff ("usb: dwc3: core:
> > Enable AutoRetry feature in the controller"). Reverting this commit
> > locally makes my board stable again (FIO test above can run
> > for >10 minutes without any issues).
> 
> This info helps a lot.
> 
> > 
> > The crash is happening when the USB-SATA bridge is controlled by the
> > uas driver. I have not tested the usb-storage driver yet.
> > 
> > What do you think would be an appropriate fix here? One idea I had is
> > to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here.
> > However, I'm not entirely sure this is isolated to Odroid boards.
> > 
> > Please let me know if you need me to do some more experiments.
> > 
> 
> This failure indicates that whichever device you're testing against
> could not retry with burst (NumP != 0) after a CRC error. After a period
> of time, the host timed out and attempted to restore its operations by
> stoping the active transfers with a Stop-ep command. However, for some
> reason, the host doesn't respond to this command. The crash you observed
> is probably a separate issue. The main issue is why the host doesn't
> receive a command completion event. If you're our direct customer, you
> can submit a STAR request for our support. I'm not aware of this type of
> failure related to AutoRetry. However, given how old this controller
> version is (over a decade ago), I can't be sure.

Thank you, this explanations makes sense to me.

> I think if you try to test against a different device, you may not
> observe this same failure.

I can partially confirm this. There is a USB3 to 1Gbit Ethernet bridge
onboard too and this peripheral appears to work reliably. I am unable
to test a different USB-to-SATA bridge though - there are no physical
USB3 ports on Odroid HC2. It would be possible to verify this on Odroid
XU4 which uses the same chip and does have physical USB ports. However,
I don't have one at hand now.

> To resolve this, please look into our support team to investigate
> further to see whether it's a setup issue. Otherwise, we can disable
> this feature for dwc_usb3 v2.00a. Depending on how bad the CRC error
> rate is (which should be low), this should not affect performance
> much.

I unfortunately have no relationship with either Synopsys, Samsung or
Hardkernel. Would you be OK with me submitting the proposed patch even
without further investigation? Also, can I submit this for backporting
to -stable?

> I don't think this neccessarily needs a new DT property.
> 
> Something like this:
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index 0beaab932e7d..1bfd8b127240 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -1209,8 +1209,9 @@ static int dwc3_core_init(struct dwc3 *dwc)
>                 dwc3_writel(dwc->regs, DWC3_GUCTL1, reg);
>         }
>  
> -       if (dwc->dr_mode == USB_DR_MODE_HOST ||
> -           dwc->dr_mode == USB_DR_MODE_OTG) {
> +       if (!DWC3_VER_IS(DWC3, 200A) &&
> +           (dwc->dr_mode == USB_DR_MODE_HOST ||
> +            dwc->dr_mode == USB_DR_MODE_OTG)) {
>                 reg = dwc3_readl(dwc->regs, DWC3_GUCTL);
>  
>                 /*
> 

> Thanks,
> Thinh

Thank you,
Jakub

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: usb: dwc3: HC dies under high I/O load on Exynos5422
  2023-06-24 15:58   ` Jakub Vaněk
@ 2023-06-26 23:20     ` Thinh Nguyen
  0 siblings, 0 replies; 7+ messages in thread
From: Thinh Nguyen @ 2023-06-26 23:20 UTC (permalink / raw)
  To: Jakub Vaněk
  Cc: Thinh Nguyen, krzysztof.kozlowski@linaro.org,
	mauro.ribeiro@hardkernel.com, linux-usb@vger.kernel.org,
	linux-samsung-soc@vger.kernel.org

On Sat, Jun 24, 2023, Jakub Vaněk wrote:

<snip>

> > > 
> > > I can reproduce the problem on mainline 6.4-rc6 with multi_v7_defconfig
> > > (+ CONFIG_BTRFS=y for the rootfs). I've bisected the problem a while
> > > back and the first broken commit is b138e23d3dff ("usb: dwc3: core:
> > > Enable AutoRetry feature in the controller"). Reverting this commit
> > > locally makes my board stable again (FIO test above can run
> > > for >10 minutes without any issues).
> > 
> > This info helps a lot.
> > 
> > > 
> > > The crash is happening when the USB-SATA bridge is controlled by the
> > > uas driver. I have not tested the usb-storage driver yet.
> > > 
> > > What do you think would be an appropriate fix here? One idea I had is
> > > to add a Odroid-specific quirk to DWC3 to not enable AutoRetry here.
> > > However, I'm not entirely sure this is isolated to Odroid boards.
> > > 
> > > Please let me know if you need me to do some more experiments.
> > > 
> > 
> > This failure indicates that whichever device you're testing against
> > could not retry with burst (NumP != 0) after a CRC error. After a period
> > of time, the host timed out and attempted to restore its operations by
> > stoping the active transfers with a Stop-ep command. However, for some
> > reason, the host doesn't respond to this command. The crash you observed
> > is probably a separate issue. The main issue is why the host doesn't
> > receive a command completion event. If you're our direct customer, you
> > can submit a STAR request for our support. I'm not aware of this type of
> > failure related to AutoRetry. However, given how old this controller
> > version is (over a decade ago), I can't be sure.
> 
> Thank you, this explanations makes sense to me.
> 
> > I think if you try to test against a different device, you may not
> > observe this same failure.
> 
> I can partially confirm this. There is a USB3 to 1Gbit Ethernet bridge
> onboard too and this peripheral appears to work reliably. I am unable
> to test a different USB-to-SATA bridge though - there are no physical
> USB3 ports on Odroid HC2. It would be possible to verify this on Odroid
> XU4 which uses the same chip and does have physical USB ports. However,
> I don't have one at hand now.
> 
> > To resolve this, please look into our support team to investigate
> > further to see whether it's a setup issue. Otherwise, we can disable
> > this feature for dwc_usb3 v2.00a. Depending on how bad the CRC error
> > rate is (which should be low), this should not affect performance
> > much.
> 
> I unfortunately have no relationship with either Synopsys, Samsung or
> Hardkernel. Would you be OK with me submitting the proposed patch even
> without further investigation? Also, can I submit this for backporting
> to -stable?

Yes. Please summarize all the findings and explanation in the change
log.

Thanks,
Thinh

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-06-26 23:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-16  3:11 usb: dwc3: HC dies under high I/O load on Exynos5422 Jakub Vaněk
2023-06-16  9:26 ` Krzysztof Kozlowski
2023-06-16 10:44   ` Jakub Vaněk
2023-06-17  8:55 ` Jakub Vaněk
2023-06-22 22:33 ` Thinh Nguyen
2023-06-24 15:58   ` Jakub Vaněk
2023-06-26 23:20     ` Thinh Nguyen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).