The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling
@ 2026-07-01  4:03 Sergey Senozhatsky
  2026-07-01  5:38 ` Greg KH
  2026-07-01  8:28 ` [usb-storage] " Oliver Neukum
  0 siblings, 2 replies; 5+ messages in thread
From: Sergey Senozhatsky @ 2026-07-01  4:03 UTC (permalink / raw)
  To: Oliver Neukum, Alan Stern
  Cc: linux-usb, linux-scsi, usb-storage, linux-kernel, Tomasz Figa,
	Sergey Senozhatsky

When a UAS storage device experiences persistent wire or hardware IO
failures, commands time out and the SCSI error handler thread invokes
uas_eh_device_reset_handler().  If usb_reset_device() succeeds at the
USB hub level but the underlying drive remains unresponsive, the reset
handler returns SUCCESS. SCSI EH then requeues pending commands with
DID_RESET (ACTION_RETRY), causing them to time out again 30 seconds
later in an infinite loop.  This blocks block layer queues indefinitely:

[..]
 sd 0:0:0:0: [sda] tag#4 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
 sd 0:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
 sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
 sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
 scsi host0: uas_eh_device_reset_handler start
 usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
 scsi host0: uas_eh_device_reset_handler success
 sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
 sd 0:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
 scsi host0: uas_eh_device_reset_handler start
 sd 0:0:0:0: [sda] tag#1 uas_zap_pending 0 uas-tag 1 inflight: CMD
 sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
 sd 0:0:0:0: [sda] tag#2 uas_zap_pending 0 uas-tag 2 inflight: CMD
 sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
 usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
 scsi host0: uas_eh_device_reset_handler success
[..]

Introduce a runtime-configurable module parameter 'reset_limit' (default
3) and track consecutive resets in devinfo->reset_cnt.  When a productive
block layer command completes successfully (SUBMITTED_BY_BLOCK_LAYER),
reset the counter to zero.  If consecutive resets exceed reset_limit,
abort the loop by completing pending commands with DID_NO_CONNECT and
returning FAILED.  This allows SCSI EH to offline the unresponsive
device.

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 drivers/usb/storage/uas.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 265162981269..a63c66c8bbad 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -32,6 +32,10 @@
 
 #define MAX_CMNDS 256
 
+static int uas_reset_limit = 3;
+module_param_named(reset_limit, uas_reset_limit, int, 0644);
+MODULE_PARM_DESC(reset_limit, "Maximum number of consecutive device resets during error handling before failing");
+
 struct uas_dev_info {
 	struct usb_interface *intf;
 	struct usb_device *udev;
@@ -40,6 +44,7 @@ struct uas_dev_info {
 	struct usb_anchor data_urbs;
 	u64 flags;
 	int qdepth, resetting;
+	int reset_cnt;
 	unsigned cmd_pipe, status_pipe, data_in_pipe, data_out_pipe;
 	unsigned use_streams:1;
 	unsigned shutdown:1;
@@ -255,6 +260,8 @@ static int uas_try_complete(struct scsi_cmnd *cmnd, const char *caller)
 		return -EBUSY;
 	devinfo->cmnd[cmdinfo->uas_tag - 1] = NULL;
 	uas_free_unsubmitted_urbs(cmnd);
+	if (cmnd->result == 0 && cmnd->submitter == SUBMITTED_BY_BLOCK_LAYER)
+		devinfo->reset_cnt = 0;
 	scsi_done(cmnd);
 	return 0;
 }
@@ -796,6 +803,21 @@ static int uas_eh_host_reset_handler(struct scsi_cmnd *cmnd)
 	usb_kill_anchored_urbs(&devinfo->cmd_urbs);
 	usb_kill_anchored_urbs(&devinfo->sense_urbs);
 	usb_kill_anchored_urbs(&devinfo->data_urbs);
+
+	spin_lock_irqsave(&devinfo->lock, flags);
+	if (uas_reset_limit > 0 && devinfo->reset_cnt >= uas_reset_limit) {
+		devinfo->resetting = 0;
+		spin_unlock_irqrestore(&devinfo->lock, flags);
+		uas_zap_pending(devinfo, DID_NO_CONNECT);
+		usb_unlock_device(udev);
+		shost_printk(KERN_ERR, sdev->host,
+			     "%s FAILED reset limit %d exceeded\n",
+			     __func__, uas_reset_limit);
+		return FAILED;
+	}
+	devinfo->reset_cnt++;
+	spin_unlock_irqrestore(&devinfo->lock, flags);
+
 	uas_zap_pending(devinfo, DID_RESET);
 
 	err = usb_reset_device(udev);
-- 
2.55.0.795.g602f6c329a-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling
  2026-07-01  4:03 [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling Sergey Senozhatsky
@ 2026-07-01  5:38 ` Greg KH
  2026-07-01  5:57   ` Sergey Senozhatsky
  2026-07-01  8:28 ` [usb-storage] " Oliver Neukum
  1 sibling, 1 reply; 5+ messages in thread
From: Greg KH @ 2026-07-01  5:38 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Oliver Neukum, Alan Stern, linux-usb, linux-scsi, usb-storage,
	linux-kernel, Tomasz Figa

On Wed, Jul 01, 2026 at 01:03:21PM +0900, Sergey Senozhatsky wrote:
> When a UAS storage device experiences persistent wire or hardware IO
> failures, commands time out and the SCSI error handler thread invokes
> uas_eh_device_reset_handler().  If usb_reset_device() succeeds at the
> USB hub level but the underlying drive remains unresponsive, the reset
> handler returns SUCCESS. SCSI EH then requeues pending commands with
> DID_RESET (ACTION_RETRY), causing them to time out again 30 seconds
> later in an infinite loop.  This blocks block layer queues indefinitely:
> 
> [..]
>  sd 0:0:0:0: [sda] tag#4 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
>  sd 0:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
>  sd 0:0:0:0: [sda] tag#0 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
>  sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
>  scsi host0: uas_eh_device_reset_handler start
>  usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
>  scsi host0: uas_eh_device_reset_handler success
>  sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
>  sd 0:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
>  scsi host0: uas_eh_device_reset_handler start
>  sd 0:0:0:0: [sda] tag#1 uas_zap_pending 0 uas-tag 1 inflight: CMD
>  sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
>  sd 0:0:0:0: [sda] tag#2 uas_zap_pending 0 uas-tag 2 inflight: CMD
>  sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
>  usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
>  scsi host0: uas_eh_device_reset_handler success
> [..]
> 
> Introduce a runtime-configurable module parameter 'reset_limit' (default
> 3) and track consecutive resets in devinfo->reset_cnt.  When a productive
> block layer command completes successfully (SUBMITTED_BY_BLOCK_LAYER),
> reset the counter to zero.  If consecutive resets exceed reset_limit,
> abort the loop by completing pending commands with DID_NO_CONNECT and
> returning FAILED.  This allows SCSI EH to offline the unresponsive
> device.
> 
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> ---
>  drivers/usb/storage/uas.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
> index 265162981269..a63c66c8bbad 100644
> --- a/drivers/usb/storage/uas.c
> +++ b/drivers/usb/storage/uas.c
> @@ -32,6 +32,10 @@
>  
>  #define MAX_CMNDS 256
>  
> +static int uas_reset_limit = 3;
> +module_param_named(reset_limit, uas_reset_limit, int, 0644);
> +MODULE_PARM_DESC(reset_limit, "Maximum number of consecutive device resets during error handling before failing");

This is not the 1990's, we do not add module parameters for issues that
should be properly solved either automatically, or on a per-device
basis.

There's no way that ChromeOs wants to attempt to track this module
parameter as a bootline config option, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling
  2026-07-01  5:38 ` Greg KH
@ 2026-07-01  5:57   ` Sergey Senozhatsky
  2026-07-01  6:01     ` Sergey Senozhatsky
  0 siblings, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2026-07-01  5:57 UTC (permalink / raw)
  To: Greg KH
  Cc: Sergey Senozhatsky, Oliver Neukum, Alan Stern, linux-usb,
	linux-scsi, usb-storage, linux-kernel, Tomasz Figa

On (26/07/01 07:38), Greg KH wrote:
> > +static int uas_reset_limit = 3;

This obviously wanted to be 0 by default (just a side note).

> > +module_param_named(reset_limit, uas_reset_limit, int, 0644);
> > +MODULE_PARM_DESC(reset_limit, "Maximum number of consecutive device resets during error handling before failing");
> 
> This is not the 1990's, we do not add module parameters for issues that
> should be properly solved either automatically, or on a per-device
> basis.
> 
> There's no way that ChromeOs wants to attempt to track this module
> parameter as a bootline config option, right?

Can you please elaborate on "properly solved either automatically,
or on a per-device basis".  I don't know how to break that endless
reset loop otherwise.  I'm open to any suggestions, the patch is RFC
for a reason.

That reset loop essentially panics the system (we have hung-task watchdog
and hung-task panic enabled).  The kernel does uas_eh_device_reset_handler()
every 30 seconds (which succeeds), while the block layer doesn't make progress
and keeps the tasks blocked, which eventually triggers the watchdog:

[..]
<6>[ 917.070222] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
<6>[ 917.070318] sd 0:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
<6>[ 917.079434] scsi host0: uas_eh_device_reset_handler start
<6>[ 917.079994] sd 0:0:0:0: [sda] tag#1 uas_zap_pending 0 uas-tag 1 inflight: CMD
<6>[ 917.080046] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
<6>[ 917.080072] sd 0:0:0:0: [sda] tag#2 uas_zap_pending 0 uas-tag 2 inflight: CMD
<6>[ 917.080118] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
<6>[ 917.145714] usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
<6>[ 917.164988] scsi host0: uas_eh_device_reset_handler success
<6>[ 917.165155] sd 0:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=30s
<6>[ 917.165210] sd 0:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 00 00 00 00 20 00
<3>[ 917.165239] I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 3
<4>[ 934.971114] rq: tag=0 hctx=0 op=WRITE sector=13867016 len=524288 age=78231 ms pid=11900 comm=image_burner state=D
<4>[ 934.971169] rq: tag=1 hctx=0 op=WRITE sector=13868040 len=524288 age=78230 ms pid=11900 comm=image_burner state=D
<4>[ 934.971217] blk: queue stall on sda: 3 inflight, 2 stalled (threshold 60000 ms)
<6>[ 947.276222] sd 0:0:0:0: [sda] tag#3 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN
<6>[ 947.276282] sd 0:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
<6>[ 947.280489] scsi host0: uas_eh_device_reset_handler start
<6>[ 947.281255] sd 0:0:0:0: [sda] tag#0 uas_zap_pending 0 uas-tag 1 inflight: CMD
<6>[ 947.281309] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
<6>[ 947.281391] sd 0:0:0:0: [sda] tag#1 uas_zap_pending 0 uas-tag 2 inflight: CMD
<6>[ 947.281468] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
<6>[ 947.346666] usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
<6>[ 947.365031] scsi host0: uas_eh_device_reset_handler success
<6>[ 977.479406] scsi host0: uas_eh_device_reset_handler start
<6>[ 977.479963] sd 0:0:0:0: [sda] tag#2 uas_zap_pending 0 uas-tag 1 inflight: CMD
<6>[ 977.480015] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 00 d3 98 08 00 04 00 00
<6>[ 977.480042] sd 0:0:0:0: [sda] tag#3 uas_zap_pending 0 uas-tag 2 inflight: CMD
<6>[ 977.480089] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 00 d3 9c 08 00 04 00 00
<6>[ 977.480137] sd 0:0:0:0: [sda] tag#4 uas_zap_pending 0 uas-tag 3 inflight: CMD
<6>[ 977.480166] sd 0:0:0:0: [sda] tag#4 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
<6>[ 977.545653] usb 2-1.3: reset SuperSpeed Plus Gen 2x1 USB device number 4 using xhci_hcd
<6>[ 977.565338] scsi host0: uas_eh_device_reset_handler success
<3>[ 984.123394] INFO: task image_burner:11900 blocked for more than 122 seconds.
<3>[ 984.123501] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>[ 984.123517] task:image_burner state:D stack:0 pid:11900 ppid:11893 flags:0x00004002
<6>[ 984.123577] Call Trace:
<6>[ 984.123603] <TASK>
<6>[ 984.123659] __schedule+0x5e8/0x14c0
<6>[ 984.123688] ? free_unref_page_list+0x25d/0x290
<6>[ 984.123715] ? lru_gen_update_size+0x1e3/0x220
<6>[ 984.123763] schedule+0x5e/0xa0
<6>[ 984.123788] io_schedule+0x47/0x70
<6>[ 984.123814] folio_wait_bit_common+0x18a/0x270
<6>[ 984.123863] ? __pfx_wake_page_function+0x10/0x10
<6>[ 984.123889] folio_wait_writeback+0x35/0x90
<6>[ 984.123937] __filemap_fdatawait_range+0x162/0x190
<6>[ 984.123965] file_write_and_wait_range+0xa3/0x120
<6>[ 984.124013] blkdev_fsync+0x39/0x60
<6>[ 984.124040] __x64_sys_fdatasync+0x4c/0x90
<6>[ 984.124067] do_syscall_64+0x6b/0xa0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling
  2026-07-01  5:57   ` Sergey Senozhatsky
@ 2026-07-01  6:01     ` Sergey Senozhatsky
  0 siblings, 0 replies; 5+ messages in thread
From: Sergey Senozhatsky @ 2026-07-01  6:01 UTC (permalink / raw)
  To: Greg KH
  Cc: Oliver Neukum, Alan Stern, linux-usb, linux-scsi, usb-storage,
	linux-kernel, Tomasz Figa, Sergey Senozhatsky

On (26/07/01 14:57), Sergey Senozhatsky wrote:
> On (26/07/01 07:38), Greg KH wrote:
> > > +static int uas_reset_limit = 3;
> 
> This obviously wanted to be 0 by default (just a side note).
> 
> > > +module_param_named(reset_limit, uas_reset_limit, int, 0644);
> > > +MODULE_PARM_DESC(reset_limit, "Maximum number of consecutive device resets during error handling before failing");
> > 
> > This is not the 1990's, we do not add module parameters for issues that
> > should be properly solved either automatically, or on a per-device
> > basis.
> > 
> > There's no way that ChromeOs wants to attempt to track this module
> > parameter as a bootline config option, right?
> 
> Can you please elaborate on "properly solved either automatically,
> or on a per-device basis".  I don't know how to break that endless
> reset loop otherwise.  I'm open to any suggestions, the patch is RFC
> for a reason.

I can imagine uas_reset_limit being auto-calculated based on SCSI
timeout (30 seconds) and HUNG_TASK_TIMEOUT (if set).  Will that
work?  I don't know if all those timeouts can be clearly exposed
to the UAS driver (or should they be in the first place).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [usb-storage] [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling
  2026-07-01  4:03 [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling Sergey Senozhatsky
  2026-07-01  5:38 ` Greg KH
@ 2026-07-01  8:28 ` Oliver Neukum
  1 sibling, 0 replies; 5+ messages in thread
From: Oliver Neukum @ 2026-07-01  8:28 UTC (permalink / raw)
  To: Sergey Senozhatsky, Oliver Neukum, Alan Stern
  Cc: linux-usb, linux-scsi, usb-storage, linux-kernel, Tomasz Figa

On 01.07.26 06:03, Sergey Senozhatsky wrote:
> When a UAS storage device experiences persistent wire or hardware IO
> failures, commands time out and the SCSI error handler thread invokes
> uas_eh_device_reset_handler().  If usb_reset_device() succeeds at the
> USB hub level but the underlying drive remains unresponsive, the reset

What exactly do you mean by unresponsive? Usbcore must at least
reassign the configuration (and the device address).

> handler returns SUCCESS. SCSI EH then requeues pending commands with
> DID_RESET (ACTION_RETRY), causing them to time out again 30 seconds
> later in an infinite loop.  This blocks block layer queues indefinitely:

Arguably this is a SCSI issue, not a UAS issue, but anyway.
[..]

> Introduce a runtime-configurable module parameter 'reset_limit' (default
> 3) and track consecutive resets in devinfo->reset_cnt.  When a productive
> block layer command completes successfully (SUBMITTED_BY_BLOCK_LAYER),
> reset the counter to zero.  If consecutive resets exceed reset_limit,
> abort the loop by completing pending commands with DID_NO_CONNECT and
> returning FAILED.  This allows SCSI EH to offline the unresponsive
> device.

Let us take a step back. What is the issue here? The device goes
into error handling. That is not a problem as such. A method
designed to remedy an error condition has not been effective but seems
to succeed.
That must not happen. So what do we do? It seems to me like we
ought to add a test for the effectiveness of the reset.
At first glance it looks like UAS should do a TEST UNIT READY
on its own after a reset.
Or are we looking at a command that reliably crashes the device and
is reissued by an upper layer? In that case either we need
a quirk or the SCSI layer ought to deduce that it is using commands
it shouldn't use.

Can we have more information about the scenario that triggered
the desire for this patch?

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-07-01  8:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01  4:03 [RFC PATCH] usb: storage: uas: limit consecutive device resets in error handling Sergey Senozhatsky
2026-07-01  5:38 ` Greg KH
2026-07-01  5:57   ` Sergey Senozhatsky
2026-07-01  6:01     ` Sergey Senozhatsky
2026-07-01  8:28 ` [usb-storage] " Oliver Neukum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox