public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] ATA port deferred qc fixes
@ 2026-02-20 22:14 Damien Le Moal
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Damien Le Moal @ 2026-02-20 22:14 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel

The first patch addresses a use-after-free issue when a deferred qc
times out. The second patch avoids a call to a potentially sleeping
function while a port spinlock is held.

Changes from v1:
 - Corrected typo in patch 1 message, improved comment in code and added
   a WARN_ON_ONCE() call to verify that a timed out qc is not active.
 - Fixed patch 2 to not call ata_scsi_requeue_deferred_qc() without the
   port lock held. This call is in fact removed: it is not needed as
   ata_scsi_requeue_deferred_qc() is called in EH, which is always run
   when removing a port.

Damien Le Moal (2):
  ata: libata-eh: correctly handle deferred qc timeouts
  ata: libata-core: fix cancellation of a port deferred qc work

 drivers/ata/libata-core.c |  8 +++-----
 drivers/ata/libata-eh.c   | 22 +++++++++++++++++++---
 2 files changed, 22 insertions(+), 8 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-02-20 22:14 [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
@ 2026-02-20 22:14 ` Damien Le Moal
  2026-02-23 12:09   ` Hannes Reinecke
                     ` (2 more replies)
  2026-02-20 22:14 ` [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work Damien Le Moal
  2026-02-24  0:39 ` [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
  2 siblings, 3 replies; 21+ messages in thread
From: Damien Le Moal @ 2026-02-20 22:14 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel

A deferred qc may timeout while waiting for the device queue to drain
to be submitted. In such case, since the qc is not active,
ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
which frees the qc. But as the port deferred_qc field still references
this finished/freed qc, the deferred qc work may eventually attempt to
call ata_qc_issue() against this invalid qc, leading to errors such as
reported by UBSAN (syzbot run):

UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
...
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
 __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
 ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
 ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
 process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
 process_scheduled_works kernel/workqueue.c:3358 [inline]
 worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
 kthread+0x370/0x450 kernel/kthread.c:467
 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

Fix this by checking if the qc of a timed out SCSI command is a deferred
one, and in such case, clear the port deferred_qc field and finish the
SCSI command with DID_TIME_OUT.

Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 72a22b6c9682..b373cceb95d2 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
 		set_host_byte(scmd, DID_OK);
 
 		ata_qc_for_each_raw(ap, qc, i) {
-			if (qc->flags & ATA_QCFLAG_ACTIVE &&
-			    qc->scsicmd == scmd)
+			if (qc->scsicmd != scmd)
+				continue;
+			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
+			    qc == ap->deferred_qc)
 				break;
 		}
 
-		if (i < ATA_MAX_QUEUE) {
+		if (qc == ap->deferred_qc) {
+			/*
+			 * This is a deferred command that timed out while
+			 * waiting for the command queue to drain. Since the qc
+			 * is not active yet (deferred_qc is still set, so the
+			 * deferred qc work has not issued the command yet),
+			 * simply signal the timeout by finishing the SCSI
+			 * command and clear the deferred qc to prevent the
+			 * deferred qc work from issuing this qc.
+			 */
+			WARN_ON_ONCE(qc->flags & ATA_QCFLAG_ACTIVE);
+			ap->deferred_qc = NULL;
+			set_host_byte(scmd, DID_TIME_OUT);
+			scsi_eh_finish_cmd(scmd, &ap->eh_done_q);
+		} else if (i < ATA_MAX_QUEUE) {
 			/* the scmd has an associated qc */
 			if (!(qc->flags & ATA_QCFLAG_EH)) {
 				/* which hasn't failed yet, timeout */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work
  2026-02-20 22:14 [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
@ 2026-02-20 22:14 ` Damien Le Moal
  2026-02-23 12:09   ` Hannes Reinecke
  2026-02-23 17:49   ` Igor Pylypiv
  2026-02-24  0:39 ` [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
  2 siblings, 2 replies; 21+ messages in thread
From: Damien Le Moal @ 2026-02-20 22:14 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel

cancel_work_sync() is a sleeping function so it cannot be called with
the spin lock of a port being held. Move the call to this function in
ata_port_detach() after EH completes, with the port lock released,
together with other work cancellation calls.

Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-core.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index beb6984b379a..d470b7bc92c7 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6269,10 +6269,6 @@ static void ata_port_detach(struct ata_port *ap)
 		}
 	}
 
-	/* Make sure the deferred qc work finished. */
-	cancel_work_sync(&ap->deferred_qc_work);
-	WARN_ON(ap->deferred_qc);
-
 	/* Tell EH to disable all devices */
 	ap->pflags |= ATA_PFLAG_UNLOADING;
 	ata_port_schedule_eh(ap);
@@ -6283,9 +6279,11 @@ static void ata_port_detach(struct ata_port *ap)
 	/* wait till EH commits suicide */
 	ata_port_wait_eh(ap);
 
-	/* it better be dead now */
+	/* It better be dead now and not have any remaining deferred qc. */
 	WARN_ON(!(ap->pflags & ATA_PFLAG_UNLOADED));
+	WARN_ON(ap->deferred_qc);
 
+	cancel_work_sync(&ap->deferred_qc_work);
 	cancel_delayed_work_sync(&ap->hotplug_task);
 	cancel_delayed_work_sync(&ap->scsi_rescan_task);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
@ 2026-02-23 12:09   ` Hannes Reinecke
  2026-02-23 17:48   ` Igor Pylypiv
  2026-03-05 17:59   ` Guenter Roeck
  2 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-ide, Niklas Cassel

On 2/20/26 23:14, Damien Le Moal wrote:
> A deferred qc may timeout while waiting for the device queue to drain
> to be submitted. In such case, since the qc is not active,
> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> which frees the qc. But as the port deferred_qc field still references
> this finished/freed qc, the deferred qc work may eventually attempt to
> call ata_qc_issue() against this invalid qc, leading to errors such as
> reported by UBSAN (syzbot run):
> 
> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> ...
> Call Trace:
>   <TASK>
>   __dump_stack lib/dump_stack.c:94 [inline]
>   dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>   ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>   __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>   ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>   ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>   process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>   process_scheduled_works kernel/workqueue.c:3358 [inline]
>   worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>   kthread+0x370/0x450 kernel/kthread.c:467
>   ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>   </TASK>
> 
> Fix this by checking if the qc of a timed out SCSI command is a deferred
> one, and in such case, clear the port deferred_qc field and finish the
> SCSI command with DID_TIME_OUT.
> 
> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>   1 file changed, 19 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work
  2026-02-20 22:14 ` [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work Damien Le Moal
@ 2026-02-23 12:09   ` Hannes Reinecke
  2026-02-23 17:49   ` Igor Pylypiv
  1 sibling, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2026-02-23 12:09 UTC (permalink / raw)
  To: Damien Le Moal, linux-ide, Niklas Cassel

On 2/20/26 23:14, Damien Le Moal wrote:
> cancel_work_sync() is a sleeping function so it cannot be called with
> the spin lock of a port being held. Move the call to this function in
> ata_port_detach() after EH completes, with the port lock released,
> together with other work cancellation calls.
> 
> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-core.c | 8 +++-----
>   1 file changed, 3 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
  2026-02-23 12:09   ` Hannes Reinecke
@ 2026-02-23 17:48   ` Igor Pylypiv
  2026-03-05 17:59   ` Guenter Roeck
  2 siblings, 0 replies; 21+ messages in thread
From: Igor Pylypiv @ 2026-02-23 17:48 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel

On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
> A deferred qc may timeout while waiting for the device queue to drain
> to be submitted. In such case, since the qc is not active,
> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> which frees the qc. But as the port deferred_qc field still references
> this finished/freed qc, the deferred qc work may eventually attempt to
> call ata_qc_issue() against this invalid qc, leading to errors such as
> reported by UBSAN (syzbot run):
> 
> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> ...
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:94 [inline]
>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>  process_scheduled_works kernel/workqueue.c:3358 [inline]
>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>  kthread+0x370/0x450 kernel/kthread.c:467
>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> 
> Fix this by checking if the qc of a timed out SCSI command is a deferred
> one, and in such case, clear the port deferred_qc field and finish the
> SCSI command with DID_TIME_OUT.
> 
> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

Reviewed-by: Igor Pylypiv <ipylypiv@google.com>

Thanks,
Igor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work
  2026-02-20 22:14 ` [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work Damien Le Moal
  2026-02-23 12:09   ` Hannes Reinecke
@ 2026-02-23 17:49   ` Igor Pylypiv
  1 sibling, 0 replies; 21+ messages in thread
From: Igor Pylypiv @ 2026-02-23 17:49 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel

On Sat, Feb 21, 2026 at 07:14:39AM +0900, Damien Le Moal wrote:
> cancel_work_sync() is a sleeping function so it cannot be called with
> the spin lock of a port being held. Move the call to this function in
> ata_port_detach() after EH completes, with the port lock released,
> together with other work cancellation calls.
> 
> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

Reviewed-by: Igor Pylypiv <ipylypiv@google.com>

Thanks,
Igor

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/2] ATA port deferred qc fixes
  2026-02-20 22:14 [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
  2026-02-20 22:14 ` [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work Damien Le Moal
@ 2026-02-24  0:39 ` Damien Le Moal
  2 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2026-02-24  0:39 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel

On 2/21/26 7:14 AM, Damien Le Moal wrote:
> The first patch addresses a use-after-free issue when a deferred qc
> times out. The second patch avoids a call to a potentially sleeping
> function while a port spinlock is held.
> 
> Changes from v1:
>  - Corrected typo in patch 1 message, improved comment in code and added
>    a WARN_ON_ONCE() call to verify that a timed out qc is not active.
>  - Fixed patch 2 to not call ata_scsi_requeue_deferred_qc() without the
>    port lock held. This call is in fact removed: it is not needed as
>    ata_scsi_requeue_deferred_qc() is called in EH, which is always run
>    when removing a port.
> 
> Damien Le Moal (2):
>   ata: libata-eh: correctly handle deferred qc timeouts
>   ata: libata-core: fix cancellation of a port deferred qc work
> 
>  drivers/ata/libata-core.c |  8 +++-----
>  drivers/ata/libata-eh.c   | 22 +++++++++++++++++++---
>  2 files changed, 22 insertions(+), 8 deletions(-)

Applied to for-7.0-fixes.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
  2026-02-23 12:09   ` Hannes Reinecke
  2026-02-23 17:48   ` Igor Pylypiv
@ 2026-03-05 17:59   ` Guenter Roeck
  2026-03-05 23:27     ` Niklas Cassel
  2026-03-05 23:59     ` Damien Le Moal
  2 siblings, 2 replies; 21+ messages in thread
From: Guenter Roeck @ 2026-03-05 17:59 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel

Hi,

On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
> A deferred qc may timeout while waiting for the device queue to drain
> to be submitted. In such case, since the qc is not active,
> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> which frees the qc. But as the port deferred_qc field still references
> this finished/freed qc, the deferred qc work may eventually attempt to
> call ata_qc_issue() against this invalid qc, leading to errors such as
> reported by UBSAN (syzbot run):
> 
> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> ...
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:94 [inline]
>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>  process_scheduled_works kernel/workqueue.c:3358 [inline]
>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>  kthread+0x370/0x450 kernel/kthread.c:467
>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> 
> Fix this by checking if the qc of a timed out SCSI command is a deferred
> one, and in such case, clear the port deferred_qc field and finish the
> SCSI command with DID_TIME_OUT.
> 
> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
> ---
>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 72a22b6c9682..b373cceb95d2 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>  		set_host_byte(scmd, DID_OK);
>  
>  		ata_qc_for_each_raw(ap, qc, i) {
> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
> -			    qc->scsicmd == scmd)
> +			if (qc->scsicmd != scmd)
> +				continue;
> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
> +			    qc == ap->deferred_qc)
>  				break;
>  		}
>  
> -		if (i < ATA_MAX_QUEUE) {
> +		if (qc == ap->deferred_qc) {

An experimental AI code review agent tagged this patch with the following
comment.

  If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
  `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
  If this last element happens to be `ap->deferred_qc`, the condition
  `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
  match.

  Could this mistakenly intercept a command that completed normally after a SCSI
  timeout, returning a timeout error instead of success? Would this also
  incorrectly clear `ap->deferred_qc`, dropping the deferred command?
  Should we verify that the loop actually found a match, for instance by checking
  `if (i < ATA_MAX_QUEUE && qc == ap->deferred_qc)`?

It does seem to be a real problem to me, but I don't know the code well
enough to be sure. Please take a look and let me know if the problem is
real. If so, I'll be happy to submit a patch to fix it. If not, please let
me know what the agent is missing.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-05 17:59   ` Guenter Roeck
@ 2026-03-05 23:27     ` Niklas Cassel
  2026-03-06  0:11       ` Damien Le Moal
  2026-03-06  0:14       ` Guenter Roeck
  2026-03-05 23:59     ` Damien Le Moal
  1 sibling, 2 replies; 21+ messages in thread
From: Niklas Cassel @ 2026-03-05 23:27 UTC (permalink / raw)
  To: Guenter Roeck, Damien Le Moal; +Cc: linux-ide

On 5 March 2026 18:59:08 CET, Guenter Roeck <linux@roeck-us.net> wrote:
>Hi,
>
>On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>> A deferred qc may timeout while waiting for the device queue to drain
>> to be submitted. In such case, since the qc is not active,
>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
>> which frees the qc. But as the port deferred_qc field still references
>> this finished/freed qc, the deferred qc work may eventually attempt to
>> call ata_qc_issue() against this invalid qc, leading to errors such as
>> reported by UBSAN (syzbot run):
>> 
>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
>> ...
>> Call Trace:
>>  <TASK>
>>  __dump_stack lib/dump_stack.c:94 [inline]
>>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>>  process_scheduled_works kernel/workqueue.c:3358 [inline]
>>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>>  kthread+0x370/0x450 kernel/kthread.c:467
>>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>>  </TASK>
>> 
>> Fix this by checking if the qc of a timed out SCSI command is a deferred
>> one, and in such case, clear the port deferred_qc field and finish the
>> SCSI command with DID_TIME_OUT.
>> 
>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
>> ---
>>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>>  1 file changed, 19 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>> index 72a22b6c9682..b373cceb95d2 100644
>> --- a/drivers/ata/libata-eh.c
>> +++ b/drivers/ata/libata-eh.c
>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>>  		set_host_byte(scmd, DID_OK);
>>  
>>  		ata_qc_for_each_raw(ap, qc, i) {
>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
>> -			    qc->scsicmd == scmd)
>> +			if (qc->scsicmd != scmd)
>> +				continue;
>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
>> +			    qc == ap->deferred_qc)
>>  				break;
>>  		}
>>  
>> -		if (i < ATA_MAX_QUEUE) {
>> +		if (qc == ap->deferred_qc) {
>
>An experimental AI code review agent tagged this patch with the following
>comment.
>
>  If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
>  `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).

I think the AI is wrong here.

That last element assigned to QC will be ATA_MAX_QUEUE - 1.


>  If this last element happens to be `ap->deferred_qc`, the condition
>  `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
>  match.
>
>  Could this mistakenly intercept a command that completed normally after a SCSI
>  timeout, returning a timeout error instead of success? Would this also
>  incorrectly clear `ap->deferred_qc`, dropping the deferred command?

I think the AI is partially wrong here.

If you read the comment below it if (), we know that ap->deferred_qc is only set until that command has been issued. So if it is set, that qc has not been issued, so it can't have successfully completed.

But... Since we don't verify that i < ATA_MAX_QUEUE, we might end up completing the deferred QC as a failed command, even though it did not time out...

On NCQ error, we complete the deferred QC as a failed command.

However, if there was a timeout of a command, which was not the deferred QC, but the deferred QC did not timeout, I think it is wrong to complete the deferred QC as a failed command.

So... I actually think that the change suggested by the AI is something we want.
(Especially after Damien commit queued in for-next where we will not invoke error_handler() if there were no timed out commands.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-05 17:59   ` Guenter Roeck
  2026-03-05 23:27     ` Niklas Cassel
@ 2026-03-05 23:59     ` Damien Le Moal
  2026-03-06  0:32       ` Guenter Roeck
  1 sibling, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2026-03-05 23:59 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linux-ide, Niklas Cassel

On 3/6/26 02:59, Guenter Roeck wrote:
> Hi,
> 
> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>> A deferred qc may timeout while waiting for the device queue to drain
>> to be submitted. In such case, since the qc is not active,
>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
>> which frees the qc. But as the port deferred_qc field still references
>> this finished/freed qc, the deferred qc work may eventually attempt to
>> call ata_qc_issue() against this invalid qc, leading to errors such as
>> reported by UBSAN (syzbot run):
>>
>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
>> ...
>> Call Trace:
>>  <TASK>
>>  __dump_stack lib/dump_stack.c:94 [inline]
>>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>>  process_scheduled_works kernel/workqueue.c:3358 [inline]
>>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>>  kthread+0x370/0x450 kernel/kthread.c:467
>>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>>  </TASK>
>>
>> Fix this by checking if the qc of a timed out SCSI command is a deferred
>> one, and in such case, clear the port deferred_qc field and finish the
>> SCSI command with DID_TIME_OUT.
>>
>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
>> ---
>>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>>  1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>> index 72a22b6c9682..b373cceb95d2 100644
>> --- a/drivers/ata/libata-eh.c
>> +++ b/drivers/ata/libata-eh.c
>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>>  		set_host_byte(scmd, DID_OK);
>>  
>>  		ata_qc_for_each_raw(ap, qc, i) {
>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
>> -			    qc->scsicmd == scmd)
>> +			if (qc->scsicmd != scmd)
>> +				continue;
>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
>> +			    qc == ap->deferred_qc)
>>  				break;
>>  		}
>>  
>> -		if (i < ATA_MAX_QUEUE) {
>> +		if (qc == ap->deferred_qc) {
> 
> An experimental AI code review agent tagged this patch with the following
> comment.

Thanks for that.

>   If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
>   `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
>   If this last element happens to be `ap->deferred_qc`, the condition
>   `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
>   match.
> 
>   Could this mistakenly intercept a command that completed normally after a SCSI
>   timeout, returning a timeout error instead of success? Would this also
>   incorrectly clear `ap->deferred_qc`, dropping the deferred command?
>   Should we verify that the loop actually found a match, for instance by checking
>   `if (i < ATA_MAX_QUEUE && qc == ap->deferred_qc)`?

Yeah. Something like this. But that condition cannot actually happen. i ==
ATA_MAX_QUEUE correspond to internal QCs, and these never go through the
deferred path.

> It does seem to be a real problem to me, but I don't know the code well
> enough to be sure. Please take a look and let me know if the problem is
> real. If so, I'll be happy to submit a patch to fix it. If not, please let
> me know what the agent is missing.

See above. That is not a real problem, but it would still be good to check. I
think your change is fine. Car to send a proper patch ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-05 23:27     ` Niklas Cassel
@ 2026-03-06  0:11       ` Damien Le Moal
  2026-03-06  0:59         ` Damien Le Moal
  2026-03-06  0:14       ` Guenter Roeck
  1 sibling, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2026-03-06  0:11 UTC (permalink / raw)
  To: Niklas Cassel, Guenter Roeck; +Cc: linux-ide

On 3/6/26 08:27, Niklas Cassel wrote:
> On 5 March 2026 18:59:08 CET, Guenter Roeck <linux@roeck-us.net> wrote:
>> Hi,
>> 
>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>>> A deferred qc may timeout while waiting for the device queue to drain to
>>> be submitted. In such case, since the qc is not active, 
>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(), which
>>> frees the qc. But as the port deferred_qc field still references this
>>> finished/freed qc, the deferred qc work may eventually attempt to call
>>> ata_qc_issue() against this invalid qc, leading to errors such as 
>>> reported by UBSAN (syzbot run):
>>> 
>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24 shift
>>> exponent 4210818301 is too large for 64-bit type 'long long unsigned
>>> int' ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] 
>>> dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120 ubsan_epilogue+0xa/0x30
>>> lib/ubsan.c:233 __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/
>>> ubsan.c:494 ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166 
>>> ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679 
>>> process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275 
>>> process_scheduled_works kernel/workqueue.c:3358 [inline] 
>>> worker_thread+0x5da/0xe40 kernel/workqueue.c:3439 kthread+0x370/0x450
>>> kernel/kthread.c:467 ret_from_fork+0x754/0xd80 arch/x86/kernel/
>>> process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 
>>> </TASK>
>>> 
>>> Fix this by checking if the qc of a timed out SCSI command is a deferred 
>>> one, and in such case, clear the port deferred_qc field and finish the 
>>> SCSI command with DID_TIME_OUT.
>>> 
>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com 
>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command
>>> starvation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-
>>> by: Hannes Reinecke <hare@suse.de> Reviewed-by: Igor Pylypiv
>>> <ipylypiv@google.com> --- drivers/ata/libata-eh.c | 22 ++++++++++++++++++
>>> +--- 1 file changed, 19 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index
>>> 72a22b6c9682..b373cceb95d2 100644 --- a/drivers/ata/libata-eh.c +++ b/
>>> drivers/ata/libata-eh.c @@ -640,12 +640,28 @@ void
>>> ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap, 
>>> set_host_byte(scmd, DID_OK);
>>> 
>>> ata_qc_for_each_raw(ap, qc, i) { -			if (qc->flags & ATA_QCFLAG_ACTIVE
>>> && -			    qc->scsicmd == scmd) +			if (qc->scsicmd != scmd) +
>>> continue; +			if ((qc->flags & ATA_QCFLAG_ACTIVE) || +			    qc == ap-
>>> >deferred_qc) break; }
>>> 
>>> -		if (i < ATA_MAX_QUEUE) { +		if (qc == ap->deferred_qc) {
>> 
>> An experimental AI code review agent tagged this patch with the following 
>> comment.
>> 
>> If the `ata_qc_for_each_raw()` loop finishes without finding a matching
>> `scmd`, `qc` will hold a pointer to the last element examined (`i ==
>> ATA_MAX_QUEUE`).
> 
> I think the AI is wrong here.
> 
> That last element assigned to QC will be ATA_MAX_QUEUE - 1.
> 
> 
>> If this last element happens to be `ap->deferred_qc`, the condition `qc ==
>> ap->deferred_qc` evaluates to true despite the loop not breaking on a 
>> match.
>> 
>> Could this mistakenly intercept a command that completed normally after a
>> SCSI timeout, returning a timeout error instead of success? Would this
>> also incorrectly clear `ap->deferred_qc`, dropping the deferred command?
> 
> I think the AI is partially wrong here.
> 
> If you read the comment below it if (), we know that ap->deferred_qc is only
> set until that command has been issued. So if it is set, that qc has not
> been issued, so it can't have successfully completed.

The request for the qc/scsi command was started from the block layer perspective
and so can still timeout. So this is all valid.

BUT ATA_MAX_QUEUE qc (last in the loop if there is no match) is the one reserved
for internal commands issued from EH. Internal QCs do not go through the
deferred issue path, so even without checking for the index when there is no
match, we have:
 - qc is still a valid pointer (the array of QCs is ATA_MAX_QUEUE + 1 sized)
 - We can never have qc == ap->deferred_qc.

As-is, the code is fine, but the above is not super clear :)

> But... Since we don't verify that i < ATA_MAX_QUEUE, we might end up
> completing the deferred QC as a failed command, even though it did not time
> out...

Nope. If there is no match, it means that we do not have a QC anymore and so
completing the command is OK (this is the race case with the ATA driver
signaling a completion with an IRQ).

> On NCQ error, we complete the deferred QC as a failed command.

No we do not, we requeue the deferred QC in EH. That qc has not been issued, so
the NCQ error did not affect it.

> However, if there was a timeout of a command, which was not the deferred QC,
> but the deferred QC did not timeout, I think it is wrong to complete the
> deferred QC as a failed command.

We do not complete it, we requeue it from EH.

> So... I actually think that the change suggested by the AI is something we
> want. (Especially after Damien commit queued in for-next where we will not
> invoke error_handler() if there were no timed out commands.)

Yes, I think the proposed fix is fine. When the deferred qc is not a match, that
restores the same behavior of this function as before we added the deferred qc
stuff.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-05 23:27     ` Niklas Cassel
  2026-03-06  0:11       ` Damien Le Moal
@ 2026-03-06  0:14       ` Guenter Roeck
  2026-03-06  0:21         ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Guenter Roeck @ 2026-03-06  0:14 UTC (permalink / raw)
  To: Niklas Cassel; +Cc: Damien Le Moal, linux-ide

On Fri, Mar 06, 2026 at 12:27:34AM +0100, Niklas Cassel wrote:
> On 5 March 2026 18:59:08 CET, Guenter Roeck <linux@roeck-us.net> wrote:
> >Hi,
> >
> >On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
> >> A deferred qc may timeout while waiting for the device queue to drain
> >> to be submitted. In such case, since the qc is not active,
> >> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> >> which frees the qc. But as the port deferred_qc field still references
> >> this finished/freed qc, the deferred qc work may eventually attempt to
> >> call ata_qc_issue() against this invalid qc, leading to errors such as
> >> reported by UBSAN (syzbot run):
> >> 
> >> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> >> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> >> ...
> >> Call Trace:
> >>  <TASK>
> >>  __dump_stack lib/dump_stack.c:94 [inline]
> >>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
> >>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
> >>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
> >>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
> >>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
> >>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
> >>  process_scheduled_works kernel/workqueue.c:3358 [inline]
> >>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
> >>  kthread+0x370/0x450 kernel/kthread.c:467
> >>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
> >>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >>  </TASK>
> >> 
> >> Fix this by checking if the qc of a timed out SCSI command is a deferred
> >> one, and in such case, clear the port deferred_qc field and finish the
> >> SCSI command with DID_TIME_OUT.
> >> 
> >> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> >> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> >> Reviewed-by: Hannes Reinecke <hare@suse.de>
> >> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
> >> ---
> >>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
> >>  1 file changed, 19 insertions(+), 3 deletions(-)
> >> 
> >> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> >> index 72a22b6c9682..b373cceb95d2 100644
> >> --- a/drivers/ata/libata-eh.c
> >> +++ b/drivers/ata/libata-eh.c
> >> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
> >>  		set_host_byte(scmd, DID_OK);
> >>  
> >>  		ata_qc_for_each_raw(ap, qc, i) {
> >> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
> >> -			    qc->scsicmd == scmd)
> >> +			if (qc->scsicmd != scmd)
> >> +				continue;
> >> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
> >> +			    qc == ap->deferred_qc)
> >>  				break;
> >>  		}
> >>  
> >> -		if (i < ATA_MAX_QUEUE) {
> >> +		if (qc == ap->deferred_qc) {
> >
> >An experimental AI code review agent tagged this patch with the following
> >comment.
> >
> >  If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
> >  `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
> 
> I think the AI is wrong here.
> 
> That last element assigned to QC will be ATA_MAX_QUEUE - 1.
> 

I think that is what it means with "`qc` will hold a pointer to the
last element examined". The "(`i == ATA_MAX_QUEUE`) part is a bit
confusing.

I think what it is trying to say is that if i == ATA_MAX_QUEUE,
qc would point to the last examined element, which would not
have ATA_QCFLAG_ACTIVE set because otherwise it would have
exited the loop. Yet, ap->deferred_qc could be set, and the
if statement would be true even though i == ATA_MAX_QUEUE
and there was no qc match.

> 
> >  If this last element happens to be `ap->deferred_qc`, the condition
> >  `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
> >  match.
> >

That is pretty much much repeating what I said above, without
the confusing "(`i == ATA_MAX_QUEUE`)" part.

> >  Could this mistakenly intercept a command that completed normally after a SCSI
> >  timeout, returning a timeout error instead of success? Would this also
> >  incorrectly clear `ap->deferred_qc`, dropping the deferred command?
> 

This part is beyond my understanding, primarily because I don't know
what "qc->deferred" actually refers to.

> I think the AI is partially wrong here.
> 
> If you read the comment below it if (), we know that ap->deferred_qc is only set until that command has been issued. So if it is set, that qc has not been issued, so it can't have successfully completed.
> 
> But... Since we don't verify that i < ATA_MAX_QUEUE, we might end up completing the deferred QC as a failed command, even though it did not time out...
> 
> On NCQ error, we complete the deferred QC as a failed command.
> 
> However, if there was a timeout of a command, which was not the deferred QC, but the deferred QC did not timeout, I think it is wrong to complete the deferred QC as a failed command.
> 
> So... I actually think that the change suggested by the AI is something we want.
> (Especially after Damien commit queued in for-next where we will not invoke error_handler() if there were no timed out commands.)
> 

So should I send a patch, or do you want to handle it ?
It might be better if you handle it since I don't know
how to exactly describe the problem differently than the AI.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:14       ` Guenter Roeck
@ 2026-03-06  0:21         ` Damien Le Moal
  2026-03-06  0:41           ` Guenter Roeck
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2026-03-06  0:21 UTC (permalink / raw)
  To: Guenter Roeck, Niklas Cassel; +Cc: linux-ide

On 3/6/26 09:14, Guenter Roeck wrote:
> On Fri, Mar 06, 2026 at 12:27:34AM +0100, Niklas Cassel wrote:
>> On 5 March 2026 18:59:08 CET, Guenter Roeck <linux@roeck-us.net> wrote:
>>> Hi,
>>>
>>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>>>> A deferred qc may timeout while waiting for the device queue to drain
>>>> to be submitted. In such case, since the qc is not active,
>>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
>>>> which frees the qc. But as the port deferred_qc field still references
>>>> this finished/freed qc, the deferred qc work may eventually attempt to
>>>> call ata_qc_issue() against this invalid qc, leading to errors such as
>>>> reported by UBSAN (syzbot run):
>>>>
>>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
>>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
>>>> ...
>>>> Call Trace:
>>>>  <TASK>
>>>>  __dump_stack lib/dump_stack.c:94 [inline]
>>>>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>>>>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>>>>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>>>>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>>>>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>>>>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>>>>  process_scheduled_works kernel/workqueue.c:3358 [inline]
>>>>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>>>>  kthread+0x370/0x450 kernel/kthread.c:467
>>>>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>>>>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>>>>  </TASK>
>>>>
>>>> Fix this by checking if the qc of a timed out SCSI command is a deferred
>>>> one, and in such case, clear the port deferred_qc field and finish the
>>>> SCSI command with DID_TIME_OUT.
>>>>
>>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
>>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
>>>> ---
>>>>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>>>>  1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>> index 72a22b6c9682..b373cceb95d2 100644
>>>> --- a/drivers/ata/libata-eh.c
>>>> +++ b/drivers/ata/libata-eh.c
>>>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>>>>  		set_host_byte(scmd, DID_OK);
>>>>  
>>>>  		ata_qc_for_each_raw(ap, qc, i) {
>>>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
>>>> -			    qc->scsicmd == scmd)
>>>> +			if (qc->scsicmd != scmd)
>>>> +				continue;
>>>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
>>>> +			    qc == ap->deferred_qc)
>>>>  				break;
>>>>  		}
>>>>  
>>>> -		if (i < ATA_MAX_QUEUE) {
>>>> +		if (qc == ap->deferred_qc) {
>>>
>>> An experimental AI code review agent tagged this patch with the following
>>> comment.
>>>
>>>  If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
>>>  `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
>>
>> I think the AI is wrong here.
>>
>> That last element assigned to QC will be ATA_MAX_QUEUE - 1.
>>
> 
> I think that is what it means with "`qc` will hold a pointer to the
> last element examined". The "(`i == ATA_MAX_QUEUE`) part is a bit
> confusing.
> 
> I think what it is trying to say is that if i == ATA_MAX_QUEUE,
> qc would point to the last examined element, which would not
> have ATA_QCFLAG_ACTIVE set because otherwise it would have
> exited the loop. Yet, ap->deferred_qc could be set, and the
> if statement would be true even though i == ATA_MAX_QUEUE
> and there was no qc match.
> 
>>
>>>  If this last element happens to be `ap->deferred_qc`, the condition
>>>  `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
>>>  match.
>>>
> 
> That is pretty much much repeating what I said above, without
> the confusing "(`i == ATA_MAX_QUEUE`)" part.
> 
>>>  Could this mistakenly intercept a command that completed normally after a SCSI
>>>  timeout, returning a timeout error instead of success? Would this also
>>>  incorrectly clear `ap->deferred_qc`, dropping the deferred command?
>>
> 
> This part is beyond my understanding, primarily because I don't know
> what "qc->deferred" actually refers to.

There are 2 types of ATA commands: queueable ones (NCQ == Native Command
Queueing) and non-queueable ones (legacy/old ATA commands). The 2 types cannot
be mixed. When NCQ commands are on-going, you cannot issue a non-NCQ command,
and vice-versa. This has always been handled with command requeueing in
libata-scsi (since forever), but with blk-mq introduction, there was a potential
command starvation issue for non-NCQ commands that has existed for a long time.

We fixed that recently by keeping on hand any non-NCQ command that must wait for
on-going NCQ commands to complete first. This is ap->deferred_qc.

> 
>> I think the AI is partially wrong here.
>>
>> If you read the comment below it if (), we know that ap->deferred_qc is only set until that command has been issued. So if it is set, that qc has not been issued, so it can't have successfully completed.
>>
>> But... Since we don't verify that i < ATA_MAX_QUEUE, we might end up completing the deferred QC as a failed command, even though it did not time out...
>>
>> On NCQ error, we complete the deferred QC as a failed command.
>>
>> However, if there was a timeout of a command, which was not the deferred QC, but the deferred QC did not timeout, I think it is wrong to complete the deferred QC as a failed command.
>>
>> So... I actually think that the change suggested by the AI is something we want.
>> (Especially after Damien commit queued in for-next where we will not invoke error_handler() if there were no timed out commands.)
>>
> 
> So should I send a patch, or do you want to handle it ?
> It might be better if you handle it since I don't know
> how to exactly describe the problem differently than the AI.

Send a patch. Write a commit message based on the information I sent in my
previous email. We can correct the commit message if needed.

> 
> Thanks,
> Guenter


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-05 23:59     ` Damien Le Moal
@ 2026-03-06  0:32       ` Guenter Roeck
  2026-03-06  0:50         ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Guenter Roeck @ 2026-03-06  0:32 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel

On 3/5/26 15:59, Damien Le Moal wrote:
> On 3/6/26 02:59, Guenter Roeck wrote:
>> Hi,
>>
>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>>> A deferred qc may timeout while waiting for the device queue to drain
>>> to be submitted. In such case, since the qc is not active,
>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
>>> which frees the qc. But as the port deferred_qc field still references
>>> this finished/freed qc, the deferred qc work may eventually attempt to
>>> call ata_qc_issue() against this invalid qc, leading to errors such as
>>> reported by UBSAN (syzbot run):
>>>
>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
>>> ...
>>> Call Trace:
>>>   <TASK>
>>>   __dump_stack lib/dump_stack.c:94 [inline]
>>>   dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>>>   ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>>>   __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>>>   ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>>>   ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>>>   process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>>>   process_scheduled_works kernel/workqueue.c:3358 [inline]
>>>   worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>>>   kthread+0x370/0x450 kernel/kthread.c:467
>>>   ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>>>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>>>   </TASK>
>>>
>>> Fix this by checking if the qc of a timed out SCSI command is a deferred
>>> one, and in such case, clear the port deferred_qc field and finish the
>>> SCSI command with DID_TIME_OUT.
>>>
>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
>>> ---
>>>   drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>> index 72a22b6c9682..b373cceb95d2 100644
>>> --- a/drivers/ata/libata-eh.c
>>> +++ b/drivers/ata/libata-eh.c
>>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>>>   		set_host_byte(scmd, DID_OK);
>>>   
>>>   		ata_qc_for_each_raw(ap, qc, i) {
>>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
>>> -			    qc->scsicmd == scmd)
>>> +			if (qc->scsicmd != scmd)
>>> +				continue;
>>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
>>> +			    qc == ap->deferred_qc)
>>>   				break;
>>>   		}
>>>   
>>> -		if (i < ATA_MAX_QUEUE) {
>>> +		if (qc == ap->deferred_qc) {
>>
>> An experimental AI code review agent tagged this patch with the following
>> comment.
> 
> Thanks for that.
> 
>>    If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
>>    `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
>>    If this last element happens to be `ap->deferred_qc`, the condition
>>    `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
>>    match.
>>
>>    Could this mistakenly intercept a command that completed normally after a SCSI
>>    timeout, returning a timeout error instead of success? Would this also
>>    incorrectly clear `ap->deferred_qc`, dropping the deferred command?
>>    Should we verify that the loop actually found a match, for instance by checking
>>    `if (i < ATA_MAX_QUEUE && qc == ap->deferred_qc)`?
> 
> Yeah. Something like this. But that condition cannot actually happen. i ==
> ATA_MAX_QUEUE correspond to internal QCs, and these never go through the
> deferred path.
> 

Pardon my ignorance, but doesn't i == ATA_MAX_QUEUE mean that qc points
to the _previously_ examined qc, i.e., the one associated with i ==
ATA_MAX_QUEUE - 1 ? Are you saying that this qc will always always be
internal and never match ap->deferred_qc ?

>> It does seem to be a real problem to me, but I don't know the code well
>> enough to be sure. Please take a look and let me know if the problem is
>> real. If so, I'll be happy to submit a patch to fix it. If not, please let
>> me know what the agent is missing.
> 
> See above. That is not a real problem, but it would still be good to check. I
> think your change is fine. Car to send a proper patch ?
> 

I can try, but if the AI analysis is missing the point I don't really know
how to describe/explain it. Can you give me some guidance ?

Thanks,
Guenter


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:21         ` Damien Le Moal
@ 2026-03-06  0:41           ` Guenter Roeck
  0 siblings, 0 replies; 21+ messages in thread
From: Guenter Roeck @ 2026-03-06  0:41 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Niklas Cassel, linux-ide

On Fri, Mar 06, 2026 at 09:21:26AM +0900, Damien Le Moal wrote:
> On 3/6/26 09:14, Guenter Roeck wrote:
> > On Fri, Mar 06, 2026 at 12:27:34AM +0100, Niklas Cassel wrote:
> >> On 5 March 2026 18:59:08 CET, Guenter Roeck <linux@roeck-us.net> wrote:
> >>> Hi,
> >>>
> >>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
> >>>> A deferred qc may timeout while waiting for the device queue to drain
> >>>> to be submitted. In such case, since the qc is not active,
> >>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> >>>> which frees the qc. But as the port deferred_qc field still references
> >>>> this finished/freed qc, the deferred qc work may eventually attempt to
> >>>> call ata_qc_issue() against this invalid qc, leading to errors such as
> >>>> reported by UBSAN (syzbot run):
> >>>>
> >>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> >>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> >>>> ...
> >>>> Call Trace:
> >>>>  <TASK>
> >>>>  __dump_stack lib/dump_stack.c:94 [inline]
> >>>>  dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
> >>>>  ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
> >>>>  __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
> >>>>  ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
> >>>>  ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
> >>>>  process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
> >>>>  process_scheduled_works kernel/workqueue.c:3358 [inline]
> >>>>  worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
> >>>>  kthread+0x370/0x450 kernel/kthread.c:467
> >>>>  ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
> >>>>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >>>>  </TASK>
> >>>>
> >>>> Fix this by checking if the qc of a timed out SCSI command is a deferred
> >>>> one, and in such case, clear the port deferred_qc field and finish the
> >>>> SCSI command with DID_TIME_OUT.
> >>>>
> >>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> >>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> >>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> >>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
> >>>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
> >>>> ---
> >>>>  drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
> >>>>  1 file changed, 19 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> >>>> index 72a22b6c9682..b373cceb95d2 100644
> >>>> --- a/drivers/ata/libata-eh.c
> >>>> +++ b/drivers/ata/libata-eh.c
> >>>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
> >>>>  		set_host_byte(scmd, DID_OK);
> >>>>  
> >>>>  		ata_qc_for_each_raw(ap, qc, i) {
> >>>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
> >>>> -			    qc->scsicmd == scmd)
> >>>> +			if (qc->scsicmd != scmd)
> >>>> +				continue;
> >>>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
> >>>> +			    qc == ap->deferred_qc)
> >>>>  				break;
> >>>>  		}
> >>>>  
> >>>> -		if (i < ATA_MAX_QUEUE) {
> >>>> +		if (qc == ap->deferred_qc) {
> >>>
> >>> An experimental AI code review agent tagged this patch with the following
> >>> comment.
> >>>
> >>>  If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
> >>>  `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
> >>
> >> I think the AI is wrong here.
> >>
> >> That last element assigned to QC will be ATA_MAX_QUEUE - 1.
> >>
> > 
> > I think that is what it means with "`qc` will hold a pointer to the
> > last element examined". The "(`i == ATA_MAX_QUEUE`) part is a bit
> > confusing.
> > 
> > I think what it is trying to say is that if i == ATA_MAX_QUEUE,
> > qc would point to the last examined element, which would not
> > have ATA_QCFLAG_ACTIVE set because otherwise it would have
> > exited the loop. Yet, ap->deferred_qc could be set, and the
> > if statement would be true even though i == ATA_MAX_QUEUE
> > and there was no qc match.
> > 
> >>
> >>>  If this last element happens to be `ap->deferred_qc`, the condition
> >>>  `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
> >>>  match.
> >>>
> > 
> > That is pretty much much repeating what I said above, without
> > the confusing "(`i == ATA_MAX_QUEUE`)" part.
> > 
> >>>  Could this mistakenly intercept a command that completed normally after a SCSI
> >>>  timeout, returning a timeout error instead of success? Would this also
> >>>  incorrectly clear `ap->deferred_qc`, dropping the deferred command?
> >>
> > 
> > This part is beyond my understanding, primarily because I don't know
> > what "qc->deferred" actually refers to.
> 
> There are 2 types of ATA commands: queueable ones (NCQ == Native Command
> Queueing) and non-queueable ones (legacy/old ATA commands). The 2 types cannot
> be mixed. When NCQ commands are on-going, you cannot issue a non-NCQ command,
> and vice-versa. This has always been handled with command requeueing in
> libata-scsi (since forever), but with blk-mq introduction, there was a potential
> command starvation issue for non-NCQ commands that has existed for a long time.
> 
> We fixed that recently by keeping on hand any non-NCQ command that must wait for
> on-going NCQ commands to complete first. This is ap->deferred_qc.
> 
> > 
> >> I think the AI is partially wrong here.
> >>
> >> If you read the comment below it if (), we know that ap->deferred_qc is only set until that command has been issued. So if it is set, that qc has not been issued, so it can't have successfully completed.
> >>
> >> But... Since we don't verify that i < ATA_MAX_QUEUE, we might end up completing the deferred QC as a failed command, even though it did not time out...
> >>
> >> On NCQ error, we complete the deferred QC as a failed command.
> >>
> >> However, if there was a timeout of a command, which was not the deferred QC, but the deferred QC did not timeout, I think it is wrong to complete the deferred QC as a failed command.
> >>
> >> So... I actually think that the change suggested by the AI is something we want.
> >> (Especially after Damien commit queued in for-next where we will not invoke error_handler() if there were no timed out commands.)
> >>
> > 
> > So should I send a patch, or do you want to handle it ?
> > It might be better if you handle it since I don't know
> > how to exactly describe the problem differently than the AI.
> 
> Send a patch. Write a commit message based on the information I sent in my
> previous email. We can correct the commit message if needed.
> 
I'll give it a try.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:32       ` Guenter Roeck
@ 2026-03-06  0:50         ` Damien Le Moal
  2026-03-06  1:31           ` Guenter Roeck
  2026-03-06  8:24           ` Niklas Cassel
  0 siblings, 2 replies; 21+ messages in thread
From: Damien Le Moal @ 2026-03-06  0:50 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linux-ide, Niklas Cassel

On 3/6/26 09:32, Guenter Roeck wrote:
> On 3/5/26 15:59, Damien Le Moal wrote:
>> On 3/6/26 02:59, Guenter Roeck wrote:
>>> Hi,
>>>
>>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
>>>> A deferred qc may timeout while waiting for the device queue to drain
>>>> to be submitted. In such case, since the qc is not active,
>>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
>>>> which frees the qc. But as the port deferred_qc field still references
>>>> this finished/freed qc, the deferred qc work may eventually attempt to
>>>> call ata_qc_issue() against this invalid qc, leading to errors such as
>>>> reported by UBSAN (syzbot run):
>>>>
>>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
>>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
>>>> ...
>>>> Call Trace:
>>>>   <TASK>
>>>>   __dump_stack lib/dump_stack.c:94 [inline]
>>>>   dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
>>>>   ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>>>>   __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
>>>>   ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
>>>>   ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
>>>>   process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
>>>>   process_scheduled_works kernel/workqueue.c:3358 [inline]
>>>>   worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
>>>>   kthread+0x370/0x450 kernel/kthread.c:467
>>>>   ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
>>>>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>>>>   </TASK>
>>>>
>>>> Fix this by checking if the qc of a timed out SCSI command is a deferred
>>>> one, and in such case, clear the port deferred_qc field and finish the
>>>> SCSI command with DID_TIME_OUT.
>>>>
>>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
>>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
>>>> ---
>>>>   drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
>>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>> index 72a22b6c9682..b373cceb95d2 100644
>>>> --- a/drivers/ata/libata-eh.c
>>>> +++ b/drivers/ata/libata-eh.c
>>>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
>>>>   		set_host_byte(scmd, DID_OK);
>>>>   
>>>>   		ata_qc_for_each_raw(ap, qc, i) {
>>>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
>>>> -			    qc->scsicmd == scmd)
>>>> +			if (qc->scsicmd != scmd)
>>>> +				continue;
>>>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
>>>> +			    qc == ap->deferred_qc)
>>>>   				break;
>>>>   		}
>>>>   
>>>> -		if (i < ATA_MAX_QUEUE) {
>>>> +		if (qc == ap->deferred_qc) {
>>>
>>> An experimental AI code review agent tagged this patch with the following
>>> comment.
>>
>> Thanks for that.
>>
>>>    If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
>>>    `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
>>>    If this last element happens to be `ap->deferred_qc`, the condition
>>>    `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
>>>    match.
>>>
>>>    Could this mistakenly intercept a command that completed normally after a SCSI
>>>    timeout, returning a timeout error instead of success? Would this also
>>>    incorrectly clear `ap->deferred_qc`, dropping the deferred command?
>>>    Should we verify that the loop actually found a match, for instance by checking
>>>    `if (i < ATA_MAX_QUEUE && qc == ap->deferred_qc)`?
>>
>> Yeah. Something like this. But that condition cannot actually happen. i ==
>> ATA_MAX_QUEUE correspond to internal QCs, and these never go through the
>> deferred path.
>>
> 
> Pardon my ignorance, but doesn't i == ATA_MAX_QUEUE mean that qc points
> to the _previously_ examined qc, i.e., the one associated with i ==
> ATA_MAX_QUEUE - 1 ? Are you saying that this qc will always always be
> internal and never match ap->deferred_qc ?

Arg !!!! Yes ! I screwed up. Checking ata_qc_for_each_raw(), the tag is checked
before setting the qc pointer. So yes, you are absolutely correct, and we have a
problem.

>>> It does seem to be a real problem to me, but I don't know the code well
>>> enough to be sure. Please take a look and let me know if the problem is
>>> real. If so, I'll be happy to submit a patch to fix it. If not, please let
>>> me know what the agent is missing.
>>
>> See above. That is not a real problem, but it would still be good to check. I
>> think your change is fine. Car to send a proper patch ?
>>
> 
> I can try, but if the AI analysis is missing the point I don't really know
> how to describe/explain it. Can you give me some guidance ?

I think the AI is correct. That "if" without the index check can lead to false
positives and we can endup timeout-failing a deferred qc that has not timed out yet.

Send a patch please. As mentioned, we can touch-up the commit message if needed.

> 
> Thanks,
> Guenter
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:11       ` Damien Le Moal
@ 2026-03-06  0:59         ` Damien Le Moal
  2026-03-06  8:23           ` Niklas Cassel
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2026-03-06  0:59 UTC (permalink / raw)
  To: Niklas Cassel, Guenter Roeck; +Cc: linux-ide

On 3/6/26 09:11, Damien Le Moal wrote:
>>> Could this mistakenly intercept a command that completed normally after a
>>> SCSI timeout, returning a timeout error instead of success? Would this
>>> also incorrectly clear `ap->deferred_qc`, dropping the deferred command?
>>
>> I think the AI is partially wrong here.
>>
>> If you read the comment below it if (), we know that ap->deferred_qc is only
>> set until that command has been issued. So if it is set, that qc has not
>> been issued, so it can't have successfully completed.
> 
> The request for the qc/scsi command was started from the block layer perspective
> and so can still timeout. So this is all valid.
> 
> BUT ATA_MAX_QUEUE qc (last in the loop if there is no match) is the one reserved
> for internal commands issued from EH. Internal QCs do not go through the
> deferred issue path, so even without checking for the index when there is no
> match, we have:
>  - qc is still a valid pointer (the array of QCs is ATA_MAX_QUEUE + 1 sized)
>  - We can never have qc == ap->deferred_qc.

This second part of my comment is obviously wrong. I need more coffee :)

The first part stands: a deferred qc that has not been issued can timeout.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:50         ` Damien Le Moal
@ 2026-03-06  1:31           ` Guenter Roeck
  2026-03-06  8:24           ` Niklas Cassel
  1 sibling, 0 replies; 21+ messages in thread
From: Guenter Roeck @ 2026-03-06  1:31 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel

On Fri, Mar 06, 2026 at 09:50:49AM +0900, Damien Le Moal wrote:
> On 3/6/26 09:32, Guenter Roeck wrote:
> > On 3/5/26 15:59, Damien Le Moal wrote:
> >> On 3/6/26 02:59, Guenter Roeck wrote:
> >>> Hi,
> >>>
> >>> On Sat, Feb 21, 2026 at 07:14:38AM +0900, Damien Le Moal wrote:
> >>>> A deferred qc may timeout while waiting for the device queue to drain
> >>>> to be submitted. In such case, since the qc is not active,
> >>>> ata_scsi_cmd_error_handler() ends up calling scsi_eh_finish_cmd(),
> >>>> which frees the qc. But as the port deferred_qc field still references
> >>>> this finished/freed qc, the deferred qc work may eventually attempt to
> >>>> call ata_qc_issue() against this invalid qc, leading to errors such as
> >>>> reported by UBSAN (syzbot run):
> >>>>
> >>>> UBSAN: shift-out-of-bounds in drivers/ata/libata-core.c:5166:24
> >>>> shift exponent 4210818301 is too large for 64-bit type 'long long unsigned int'
> >>>> ...
> >>>> Call Trace:
> >>>>   <TASK>
> >>>>   __dump_stack lib/dump_stack.c:94 [inline]
> >>>>   dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
> >>>>   ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
> >>>>   __ubsan_handle_shift_out_of_bounds+0x279/0x2a0 lib/ubsan.c:494
> >>>>   ata_qc_issue.cold+0x38/0x9f drivers/ata/libata-core.c:5166
> >>>>   ata_scsi_deferred_qc_work+0x154/0x1f0 drivers/ata/libata-scsi.c:1679
> >>>>   process_one_work+0x9d7/0x1920 kernel/workqueue.c:3275
> >>>>   process_scheduled_works kernel/workqueue.c:3358 [inline]
> >>>>   worker_thread+0x5da/0xe40 kernel/workqueue.c:3439
> >>>>   kthread+0x370/0x450 kernel/kthread.c:467
> >>>>   ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
> >>>>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >>>>   </TASK>
> >>>>
> >>>> Fix this by checking if the qc of a timed out SCSI command is a deferred
> >>>> one, and in such case, clear the port deferred_qc field and finish the
> >>>> SCSI command with DID_TIME_OUT.
> >>>>
> >>>> Reported-by: syzbot+1f77b8ca15336fff21ff@syzkaller.appspotmail.com
> >>>> Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation")
> >>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> >>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
> >>>> Reviewed-by: Igor Pylypiv <ipylypiv@google.com>
> >>>> ---
> >>>>   drivers/ata/libata-eh.c | 22 +++++++++++++++++++---
> >>>>   1 file changed, 19 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> >>>> index 72a22b6c9682..b373cceb95d2 100644
> >>>> --- a/drivers/ata/libata-eh.c
> >>>> +++ b/drivers/ata/libata-eh.c
> >>>> @@ -640,12 +640,28 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, struct ata_port *ap,
> >>>>   		set_host_byte(scmd, DID_OK);
> >>>>   
> >>>>   		ata_qc_for_each_raw(ap, qc, i) {
> >>>> -			if (qc->flags & ATA_QCFLAG_ACTIVE &&
> >>>> -			    qc->scsicmd == scmd)
> >>>> +			if (qc->scsicmd != scmd)
> >>>> +				continue;
> >>>> +			if ((qc->flags & ATA_QCFLAG_ACTIVE) ||
> >>>> +			    qc == ap->deferred_qc)
> >>>>   				break;
> >>>>   		}
> >>>>   
> >>>> -		if (i < ATA_MAX_QUEUE) {
> >>>> +		if (qc == ap->deferred_qc) {
> >>>
> >>> An experimental AI code review agent tagged this patch with the following
> >>> comment.
> >>
> >> Thanks for that.
> >>
> >>>    If the `ata_qc_for_each_raw()` loop finishes without finding a matching `scmd`,
> >>>    `qc` will hold a pointer to the last element examined (`i == ATA_MAX_QUEUE`).
> >>>    If this last element happens to be `ap->deferred_qc`, the condition
> >>>    `qc == ap->deferred_qc` evaluates to true despite the loop not breaking on a
> >>>    match.
> >>>
> >>>    Could this mistakenly intercept a command that completed normally after a SCSI
> >>>    timeout, returning a timeout error instead of success? Would this also
> >>>    incorrectly clear `ap->deferred_qc`, dropping the deferred command?
> >>>    Should we verify that the loop actually found a match, for instance by checking
> >>>    `if (i < ATA_MAX_QUEUE && qc == ap->deferred_qc)`?
> >>
> >> Yeah. Something like this. But that condition cannot actually happen. i ==
> >> ATA_MAX_QUEUE correspond to internal QCs, and these never go through the
> >> deferred path.
> >>
> > 
> > Pardon my ignorance, but doesn't i == ATA_MAX_QUEUE mean that qc points
> > to the _previously_ examined qc, i.e., the one associated with i ==
> > ATA_MAX_QUEUE - 1 ? Are you saying that this qc will always always be
> > internal and never match ap->deferred_qc ?
> 
> Arg !!!! Yes ! I screwed up. Checking ata_qc_for_each_raw(), the tag is checked
> before setting the qc pointer. So yes, you are absolutely correct, and we have a
> problem.
> 
> >>> It does seem to be a real problem to me, but I don't know the code well
> >>> enough to be sure. Please take a look and let me know if the problem is
> >>> real. If so, I'll be happy to submit a patch to fix it. If not, please let
> >>> me know what the agent is missing.
> >>
> >> See above. That is not a real problem, but it would still be good to check. I
> >> think your change is fine. Car to send a proper patch ?
> >>
> > 
> > I can try, but if the AI analysis is missing the point I don't really know
> > how to describe/explain it. Can you give me some guidance ?
> 
> I think the AI is correct. That "if" without the index check can lead to false
> positives and we can endup timeout-failing a deferred qc that has not timed out yet.
> 
> Send a patch please. As mentioned, we can touch-up the commit message if needed.
> 
Working on it.

Thnaks,
Guenter

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:59         ` Damien Le Moal
@ 2026-03-06  8:23           ` Niklas Cassel
  0 siblings, 0 replies; 21+ messages in thread
From: Niklas Cassel @ 2026-03-06  8:23 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Guenter Roeck, linux-ide

On Fri, Mar 06, 2026 at 09:59:38AM +0900, Damien Le Moal wrote:
> On 3/6/26 09:11, Damien Le Moal wrote:
> >>> Could this mistakenly intercept a command that completed normally after a
> >>> SCSI timeout, returning a timeout error instead of success? Would this
> >>> also incorrectly clear `ap->deferred_qc`, dropping the deferred command?
> >>
> >> I think the AI is partially wrong here.
> >>
> >> If you read the comment below it if (), we know that ap->deferred_qc is only
> >> set until that command has been issued. So if it is set, that qc has not
> >> been issued, so it can't have successfully completed.
> > 
> > The request for the qc/scsi command was started from the block layer perspective
> > and so can still timeout. So this is all valid.
> > 
> > BUT ATA_MAX_QUEUE qc (last in the loop if there is no match) is the one reserved
> > for internal commands issued from EH. Internal QCs do not go through the
> > deferred issue path, so even without checking for the index when there is no
> > match, we have:
> >  - qc is still a valid pointer (the array of QCs is ATA_MAX_QUEUE + 1 sized)
> >  - We can never have qc == ap->deferred_qc.
> 
> This second part of my comment is obviously wrong. I need more coffee :)
> 
> The first part stands: a deferred qc that has not been issued can timeout.

Yes, did someone claim otherwise?


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts
  2026-03-06  0:50         ` Damien Le Moal
  2026-03-06  1:31           ` Guenter Roeck
@ 2026-03-06  8:24           ` Niklas Cassel
  1 sibling, 0 replies; 21+ messages in thread
From: Niklas Cassel @ 2026-03-06  8:24 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Guenter Roeck, linux-ide

On Fri, Mar 06, 2026 at 09:50:49AM +0900, Damien Le Moal wrote:
> > 
> > I can try, but if the AI analysis is missing the point I don't really know
> > how to describe/explain it. Can you give me some guidance ?
> 
> I think the AI is correct. That "if" without the index check can lead to false
> positives and we can endup timeout-failing a deferred qc that has not timed out yet.

Which is exactly what I wrote yesterday:

>> However, if there was a timeout of a command, which was not the deferred QC,
>> but the deferred QC did not timeout, I think it is wrong to complete the
>> deferred QC as a failed command.
>>
>> So... I actually think that the change suggested by the AI is something we
>> want.

Glad that you came to the same conclusion after some coffee :)

(Yes, I did actually look at ata_qc_for_each_raw() before writing my reply
yesterday.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-03-06  8:24 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20 22:14 [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal
2026-02-20 22:14 ` [PATCH v2 1/2] ata: libata-eh: correctly handle deferred qc timeouts Damien Le Moal
2026-02-23 12:09   ` Hannes Reinecke
2026-02-23 17:48   ` Igor Pylypiv
2026-03-05 17:59   ` Guenter Roeck
2026-03-05 23:27     ` Niklas Cassel
2026-03-06  0:11       ` Damien Le Moal
2026-03-06  0:59         ` Damien Le Moal
2026-03-06  8:23           ` Niklas Cassel
2026-03-06  0:14       ` Guenter Roeck
2026-03-06  0:21         ` Damien Le Moal
2026-03-06  0:41           ` Guenter Roeck
2026-03-05 23:59     ` Damien Le Moal
2026-03-06  0:32       ` Guenter Roeck
2026-03-06  0:50         ` Damien Le Moal
2026-03-06  1:31           ` Guenter Roeck
2026-03-06  8:24           ` Niklas Cassel
2026-02-20 22:14 ` [PATCH v2 2/2] ata: libata-core: fix cancellation of a port deferred qc work Damien Le Moal
2026-02-23 12:09   ` Hannes Reinecke
2026-02-23 17:49   ` Igor Pylypiv
2026-02-24  0:39 ` [PATCH v2 0/2] ATA port deferred qc fixes Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox