public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Prevent non-NCQ command starvation
@ 2025-12-17 23:17 Damien Le Moal
  2025-12-17 23:17 ` [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate() Damien Le Moal
  2025-12-17 23:17 ` [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation Damien Le Moal
  0 siblings, 2 replies; 7+ messages in thread
From: Damien Le Moal @ 2025-12-17 23:17 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel; +Cc: Igor Pylypiv, Xingui Yang

This small patch series addresses potential command starvation issues
with non-NCQ passthrough commands issued to a device accessed through
a multi-queue libsas HBA. In such setup, a non-NCQ passthrough command
issued by an application may suffer long delays and even complete
starvation if the device is under constant NCQ read/write workload.

This issue is avoided by limiting the requeing through the block layer
and instead explicitly waiting for the device command queue to drain
before issuing the non-NCQ command. This is done reliably by not trying
to second guess ata EH in case of error or device reset, by always
requeuing a deferred command whenever EH is scheduled for the device
port.

When a non-ncq command is deferred waiting for the device queue to
drain, all new incoming commands are always requeued through the regular
busy mechanism, thus avoiding the need to manage an internal command
queue.

With this, the long latencies observed when executing non-NCQ
passthrough commands under a heavy NCQ read/write workload at high queue
depth is resolved: the non-NCQ command latency only depends on how long
it takes for all in-flight commands to complete.

Damien Le Moal (2):
  ata: libata-scsi: refactor ata_scsi_translate()
  ata: libata-scsi: avoid passthrough command starvation

 drivers/ata/libata-core.c |   1 +
 drivers/ata/libata-eh.c   |   6 ++
 drivers/ata/libata-scsi.c | 155 ++++++++++++++++++++++++++++++++------
 drivers/ata/libata.h      |   2 +
 include/linux/libata.h    |   3 +
 5 files changed, 142 insertions(+), 25 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate()
  2025-12-17 23:17 [PATCH 0/2] Prevent non-NCQ command starvation Damien Le Moal
@ 2025-12-17 23:17 ` Damien Le Moal
  2025-12-18  9:23   ` Hannes Reinecke
  2025-12-17 23:17 ` [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation Damien Le Moal
  1 sibling, 1 reply; 7+ messages in thread
From: Damien Le Moal @ 2025-12-17 23:17 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel; +Cc: Igor Pylypiv, Xingui Yang

Factor out of ata_scsi_translate() the code handling queued command
deferral using the port qc_defer callback into the new function
ata_scsi_defer(), and simplify the goto used in ata_scsi_translate().
While at it, also add a lockdep annotation to check that the port lock
is held when ata_scsi_translate() is called.

No functional changes.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-scsi.c | 66 ++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 721d3f270c8e..42d103542525 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1691,6 +1691,30 @@ static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
 	ata_qc_done(qc);
 }
 
+static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
+{
+	int ret;
+
+	if (!ap->ops->qc_defer)
+		return 0;
+
+	ret = ap->ops->qc_defer(qc);
+	if (!ret)
+		return 0;
+
+	ata_qc_free(qc);
+
+	switch (ret) {
+	case ATA_DEFER_LINK:
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+	case ATA_DEFER_PORT:
+		return SCSI_MLQUEUE_HOST_BUSY;
+	default:
+		WARN_ON_ONCE(1);
+		return SCSI_MLQUEUE_HOST_BUSY;
+	}
+}
+
 /**
  *	ata_scsi_translate - Translate then issue SCSI command to ATA device
  *	@dev: ATA device to which the command is addressed
@@ -1714,8 +1738,8 @@ static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
  *	spin_lock_irqsave(host lock)
  *
  *	RETURNS:
- *	0 on success, SCSI_ML_QUEUE_DEVICE_BUSY if the command
- *	needs to be deferred.
+ *	0 on success, SCSI_ML_QUEUE_DEVICE_BUSY or SCSI_MLQUEUE_HOST_BUSY if the
+ *	command needs to be deferred.
  */
 static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
 			      ata_xlat_func_t xlat_func)
@@ -1724,56 +1748,46 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
 	struct ata_queued_cmd *qc;
 	int rc;
 
+	lockdep_assert_held(ap->lock);
+
+	/*
+	 * If we fail to allocate a qc, simply return without reporting an error
+	 * as scsi_done(cmd) is already called.
+	 */
 	qc = ata_scsi_qc_new(dev, cmd);
 	if (!qc)
-		goto err_mem;
+		return 0;
 
 	/* data is present; dma-map it */
 	if (cmd->sc_data_direction == DMA_FROM_DEVICE ||
 	    cmd->sc_data_direction == DMA_TO_DEVICE) {
 		if (unlikely(scsi_bufflen(cmd) < 1)) {
 			ata_dev_warn(dev, "WARNING: zero len r/w req\n");
-			goto err_did;
+			cmd->result = (DID_ERROR << 16);
+			goto done;
 		}
 
 		ata_sg_init(qc, scsi_sglist(cmd), scsi_sg_count(cmd));
-
 		qc->dma_dir = cmd->sc_data_direction;
 	}
 
 	qc->complete_fn = ata_scsi_qc_complete;
 
 	if (xlat_func(qc))
-		goto early_finish;
+		goto done;
 
-	if (ap->ops->qc_defer) {
-		if ((rc = ap->ops->qc_defer(qc)))
-			goto defer;
-	}
+	rc = ata_scsi_defer(ap, qc);
+	if (rc)
+		return rc;
 
-	/* select device, send command to hardware */
 	ata_qc_issue(qc);
 
 	return 0;
 
-early_finish:
-	ata_qc_free(qc);
-	scsi_done(cmd);
-	return 0;
-
-err_did:
+done:
 	ata_qc_free(qc);
-	cmd->result = (DID_ERROR << 16);
 	scsi_done(cmd);
-err_mem:
 	return 0;
-
-defer:
-	ata_qc_free(qc);
-	if (rc == ATA_DEFER_LINK)
-		return SCSI_MLQUEUE_DEVICE_BUSY;
-	else
-		return SCSI_MLQUEUE_HOST_BUSY;
 }
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation
  2025-12-17 23:17 [PATCH 0/2] Prevent non-NCQ command starvation Damien Le Moal
  2025-12-17 23:17 ` [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate() Damien Le Moal
@ 2025-12-17 23:17 ` Damien Le Moal
  2025-12-18  9:26   ` Hannes Reinecke
  2025-12-19  0:08   ` Igor Pylypiv
  1 sibling, 2 replies; 7+ messages in thread
From: Damien Le Moal @ 2025-12-17 23:17 UTC (permalink / raw)
  To: linux-ide, Niklas Cassel; +Cc: Igor Pylypiv, Xingui Yang

When a non-NCQ passthrough command is issued while NCQ commands are
being executed, ata_scsi_defer() indicates to ata_scsi_translate() that
ata_qc_issue() should not be called for the passthrough command, and
instead returns SCSI_MLQUEUE_XXX_BUSY to defer the command execution.
This command deferring is correct and as mandated by the ACS
specifications since NCQ and non-NCQ commands cannot be mixed.

However, in the case of a host adapter using multiple submission queues,
when the target device is under a constant load of NCQ read or write
commands, there are no guarantees that requeueing the non-NCQ command
will lead to it not being deferred again repeatedly, since other
submission queues can constantly issue NCQ commands from different CPUs
ahead of the non-NCQ command. This can lead to very long delays for the
execution of non-NCQ passthrough commands, and even complete starvation
in the worst case scenario.

Since the block layer and the SCSI layer do not distinguish between
queuable (NCQ) and non queueable (non-NCQ) commands, libata-scsi SAT
implementation must ensure forward progress for non-NCQ commands in the
presence of NCQ command traffic. This is similar to what SAS HBAs with a
hardware/firmware based implementation do.

Implement such forward progress guarantee by limiting requeueing of
non-NCQ commands: when a non-NCQ command is received and NCQ commands
are in-flight, do not force a requeue of the non-NCQ command by
returning SCSI_MLQUEUE_XXX_BUSY in ata_scsi_translate() and instead
hold on to the qc using the new deferred_qc field of struct ata_port.

This deferred qc will be issued using the work item deferred_qc_work
running the function ata_scsi_deferred_qc_work() once all in-flight
commands complete, which is checked with the port qc_defer() callback
indicating that no further delay is necessary. This check is done using
the helper function ata_scsi_schedule_deferred_qc() which is called from
ata_scsi_qc_complete(). This thus excludes this mechanism from all
internal non-NCQ commands issued by ATA EH.

When a port deferred_qc is non NULL, that is, the port has a command
waiting for the device queue to drain, the issuing of all incoming
commands (both NCQ and non-NCQ) is deferred using the regular busy
mechanism. This simplifies the code and also avoids potential denial of
service problems if a user issues too many non-NCQ passthrough commands.

Finally, whenever ata EH is scheduled, regardless of the reason, a
deferred qc is always requeued so that it can be retried once EH
completes. This is done by calling the function
ata_scsi_requeue_deferred_qc() from ata_eh_set_pending(). This avoids
the need for any special processing for the deferred qc in case of NCQ
error, link or device reset, or device timeout.

Reported-by: Xingui Yang <yangxingui@huawei.com>
Reported-by: Igor Pylypiv <ipylypiv@google.com>
Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-core.c |  1 +
 drivers/ata/libata-eh.c   |  6 +++
 drivers/ata/libata-scsi.c | 93 ++++++++++++++++++++++++++++++++++++++-
 drivers/ata/libata.h      |  2 +
 include/linux/libata.h    |  3 ++
 5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 0b24bd169d61..121f35115d33 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5558,6 +5558,7 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
 	mutex_init(&ap->scsi_scan_mutex);
 	INIT_DELAYED_WORK(&ap->hotplug_task, ata_scsi_hotplug);
 	INIT_DELAYED_WORK(&ap->scsi_rescan_task, ata_scsi_dev_rescan);
+	INIT_WORK(&ap->deferred_qc_work, ata_scsi_deferred_qc_work);
 	INIT_LIST_HEAD(&ap->eh_done_q);
 	init_waitqueue_head(&ap->eh_wait_q);
 	init_completion(&ap->park_req_pending);
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2586e77ebf45..b90b17f680f8 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -917,6 +917,12 @@ static void ata_eh_set_pending(struct ata_port *ap, bool fastdrain)
 
 	ap->pflags |= ATA_PFLAG_EH_PENDING;
 
+	/*
+	 * If we have a deferred qc, requeue it so that it is retried once EH
+	 * completes.
+	 */
+	ata_scsi_requeue_deferred_qc(ap);
+
 	if (!fastdrain)
 		return;
 
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 42d103542525..c5ebd2bab356 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1658,8 +1658,75 @@ static void ata_qc_done(struct ata_queued_cmd *qc)
 	done(cmd);
 }
 
+void ata_scsi_deferred_qc_work(struct work_struct *work)
+{
+	struct ata_port *ap =
+		container_of(work, struct ata_port, deferred_qc_work);
+	struct ata_queued_cmd *qc;
+	unsigned long flags;
+
+	spin_lock_irqsave(ap->lock, flags);
+
+	/*
+	 * If we still have a deferred QC and we are not in EH, issue it. In
+	 * such case, we should not need any more deferring the QC, so warn if
+	 * qc_defer() says otherwise.
+	 */
+	qc = ap->deferred_qc;
+	if (qc && !ata_port_eh_scheduled(ap)) {
+		WARN_ON_ONCE(ap->ops->qc_defer(qc));
+		ap->deferred_qc = NULL;
+		ata_qc_issue(qc);
+	}
+
+	spin_unlock_irqrestore(ap->lock, flags);
+}
+
+void ata_scsi_requeue_deferred_qc(struct ata_port *ap)
+{
+	struct ata_queued_cmd *qc = ap->deferred_qc;
+
+	lockdep_assert_held(ap->lock);
+
+	/*
+	 * If we have a differed QC when a reset occurs or NCQ commands fail, do
+	 * not try to be smart about what to do with this deferred command and
+	 * simply retry it by completing it with DID_SOFT_ERROR.
+	 */
+	if (qc) {
+		struct scsi_cmnd *scmd = qc->scsicmd;
+
+		ap->deferred_qc = NULL;
+		ata_qc_free(qc);
+		scmd->result = (DID_SOFT_ERROR << 16);
+		scsi_done(scmd);
+	}
+}
+
+static void ata_scsi_schedule_deferred_qc(struct ata_port *ap)
+{
+	struct ata_queued_cmd *qc = ap->deferred_qc;
+
+	lockdep_assert_held(ap->lock);
+
+	/*
+	 * If we have a differed QC, then qc_defer() is defined and we can use
+	 * this callback to determine if this QC is good to go, unless EH has
+	 * been scheduled.
+	 */
+	if (qc) {
+		if (ata_port_eh_scheduled(ap)) {
+			ata_scsi_requeue_deferred_qc(ap);
+			return;
+		}
+		if (!ap->ops->qc_defer(qc))
+			queue_work(system_highpri_wq, &ap->deferred_qc_work);
+	}
+}
+
 static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
 {
+	struct ata_port *ap = qc->ap;
 	struct scsi_cmnd *cmd = qc->scsicmd;
 	u8 *cdb = cmd->cmnd;
 	bool have_sense = qc->flags & ATA_QCFLAG_SENSE_VALID;
@@ -1689,12 +1756,22 @@ static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
 	}
 
 	ata_qc_done(qc);
+
+	ata_scsi_schedule_deferred_qc(ap);
 }
 
 static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
 {
 	int ret;
 
+	/*
+	 * If we already have a deferred QC, then rely on the SCSI layer to
+	 * defer and requeue all incoming commands until the deferred QC is
+	 * processed, once all on-going commands are completed.
+	 */
+	if (ap->deferred_qc)
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+
 	if (!ap->ops->qc_defer)
 		return 0;
 
@@ -1702,6 +1779,17 @@ static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
 	if (!ret)
 		return 0;
 
+	/*
+	 * We must defer this QC: if this is not an NCQ command, keep this QC as
+	 * a deferred one and wait for all on-going NCQ commands to complete
+	 * before issuing it with the deferred QC work.
+	 */
+	if (!ata_is_ncq(qc->tf.protocol)) {
+		ap->deferred_qc = qc;
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+	}
+
+	/* Use the SCSI layer to defer and requeue the command. */
 	ata_qc_free(qc);
 
 	switch (ret) {
@@ -1777,8 +1865,11 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
 		goto done;
 
 	rc = ata_scsi_defer(ap, qc);
-	if (rc)
+	if (rc) {
+		if (qc == ap->deferred_qc)
+			return 0;
 		return rc;
+	}
 
 	ata_qc_issue(qc);
 
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 0e7ecac73680..60a675df61dc 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -165,6 +165,8 @@ void ata_scsi_sdev_config(struct scsi_device *sdev);
 int ata_scsi_dev_config(struct scsi_device *sdev, struct queue_limits *lim,
 		struct ata_device *dev);
 int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev);
+void ata_scsi_deferred_qc_work(struct work_struct *work);
+void ata_scsi_requeue_deferred_qc(struct ata_port *ap);
 
 /* libata-eh.c */
 extern unsigned int ata_internal_cmd_timeout(struct ata_device *dev, u8 cmd);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 39534fafa36a..c5b27d97dfaf 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -903,6 +903,9 @@ struct ata_port {
 	u64			qc_active;
 	int			nr_active_links; /* #links with active qcs */
 
+	struct work_struct	deferred_qc_work;
+	struct ata_queued_cmd	*deferred_qc;
+
 	struct ata_link		link;		/* host default link */
 	struct ata_link		*slave_link;	/* see ata_slave_link_init() */
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate()
  2025-12-17 23:17 ` [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate() Damien Le Moal
@ 2025-12-18  9:23   ` Hannes Reinecke
  0 siblings, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2025-12-18  9:23 UTC (permalink / raw)
  To: Damien Le Moal, linux-ide, Niklas Cassel; +Cc: Igor Pylypiv, Xingui Yang

On 12/18/25 00:17, Damien Le Moal wrote:
> Factor out of ata_scsi_translate() the code handling queued command
> deferral using the port qc_defer callback into the new function
> ata_scsi_defer(), and simplify the goto used in ata_scsi_translate().
> While at it, also add a lockdep annotation to check that the port lock
> is held when ata_scsi_translate() is called.
> 
> No functional changes.
> 
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-scsi.c | 66 ++++++++++++++++++++++++---------------
>   1 file changed, 40 insertions(+), 26 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation
  2025-12-17 23:17 ` [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation Damien Le Moal
@ 2025-12-18  9:26   ` Hannes Reinecke
  2025-12-19  0:08   ` Igor Pylypiv
  1 sibling, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2025-12-18  9:26 UTC (permalink / raw)
  To: Damien Le Moal, linux-ide, Niklas Cassel; +Cc: Igor Pylypiv, Xingui Yang

On 12/18/25 00:17, Damien Le Moal wrote:
> When a non-NCQ passthrough command is issued while NCQ commands are
> being executed, ata_scsi_defer() indicates to ata_scsi_translate() that
> ata_qc_issue() should not be called for the passthrough command, and
> instead returns SCSI_MLQUEUE_XXX_BUSY to defer the command execution.
> This command deferring is correct and as mandated by the ACS
> specifications since NCQ and non-NCQ commands cannot be mixed.
> 
> However, in the case of a host adapter using multiple submission queues,
> when the target device is under a constant load of NCQ read or write
> commands, there are no guarantees that requeueing the non-NCQ command
> will lead to it not being deferred again repeatedly, since other
> submission queues can constantly issue NCQ commands from different CPUs
> ahead of the non-NCQ command. This can lead to very long delays for the
> execution of non-NCQ passthrough commands, and even complete starvation
> in the worst case scenario.
> 
> Since the block layer and the SCSI layer do not distinguish between
> queuable (NCQ) and non queueable (non-NCQ) commands, libata-scsi SAT
> implementation must ensure forward progress for non-NCQ commands in the
> presence of NCQ command traffic. This is similar to what SAS HBAs with a
> hardware/firmware based implementation do.
> 
> Implement such forward progress guarantee by limiting requeueing of
> non-NCQ commands: when a non-NCQ command is received and NCQ commands
> are in-flight, do not force a requeue of the non-NCQ command by
> returning SCSI_MLQUEUE_XXX_BUSY in ata_scsi_translate() and instead
> hold on to the qc using the new deferred_qc field of struct ata_port.
> 
> This deferred qc will be issued using the work item deferred_qc_work
> running the function ata_scsi_deferred_qc_work() once all in-flight
> commands complete, which is checked with the port qc_defer() callback
> indicating that no further delay is necessary. This check is done using
> the helper function ata_scsi_schedule_deferred_qc() which is called from
> ata_scsi_qc_complete(). This thus excludes this mechanism from all
> internal non-NCQ commands issued by ATA EH.
> 
> When a port deferred_qc is non NULL, that is, the port has a command
> waiting for the device queue to drain, the issuing of all incoming
> commands (both NCQ and non-NCQ) is deferred using the regular busy
> mechanism. This simplifies the code and also avoids potential denial of
> service problems if a user issues too many non-NCQ passthrough commands.
> 
> Finally, whenever ata EH is scheduled, regardless of the reason, a
> deferred qc is always requeued so that it can be retried once EH
> completes. This is done by calling the function
> ata_scsi_requeue_deferred_qc() from ata_eh_set_pending(). This avoids
> the need for any special processing for the deferred qc in case of NCQ
> error, link or device reset, or device timeout.
> 
> Reported-by: Xingui Yang <yangxingui@huawei.com>
> Reported-by: Igor Pylypiv <ipylypiv@google.com>
> Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-core.c |  1 +
>   drivers/ata/libata-eh.c   |  6 +++
>   drivers/ata/libata-scsi.c | 93 ++++++++++++++++++++++++++++++++++++++-
>   drivers/ata/libata.h      |  2 +
>   include/linux/libata.h    |  3 ++
>   5 files changed, 104 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation
  2025-12-17 23:17 ` [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation Damien Le Moal
  2025-12-18  9:26   ` Hannes Reinecke
@ 2025-12-19  0:08   ` Igor Pylypiv
  2025-12-19  7:47     ` Damien Le Moal
  1 sibling, 1 reply; 7+ messages in thread
From: Igor Pylypiv @ 2025-12-19  0:08 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide, Niklas Cassel, Xingui Yang

On Thu, Dec 18, 2025 at 08:17:12AM +0900, Damien Le Moal wrote:
> When a non-NCQ passthrough command is issued while NCQ commands are
> being executed, ata_scsi_defer() indicates to ata_scsi_translate() that
> ata_qc_issue() should not be called for the passthrough command, and
> instead returns SCSI_MLQUEUE_XXX_BUSY to defer the command execution.
> This command deferring is correct and as mandated by the ACS
> specifications since NCQ and non-NCQ commands cannot be mixed.
> 
> However, in the case of a host adapter using multiple submission queues,
> when the target device is under a constant load of NCQ read or write
> commands, there are no guarantees that requeueing the non-NCQ command
> will lead to it not being deferred again repeatedly, since other
> submission queues can constantly issue NCQ commands from different CPUs
> ahead of the non-NCQ command. This can lead to very long delays for the
> execution of non-NCQ passthrough commands, and even complete starvation
> in the worst case scenario.
> 
> Since the block layer and the SCSI layer do not distinguish between
> queuable (NCQ) and non queueable (non-NCQ) commands, libata-scsi SAT
> implementation must ensure forward progress for non-NCQ commands in the
> presence of NCQ command traffic. This is similar to what SAS HBAs with a
> hardware/firmware based implementation do.
> 
> Implement such forward progress guarantee by limiting requeueing of
> non-NCQ commands: when a non-NCQ command is received and NCQ commands
> are in-flight, do not force a requeue of the non-NCQ command by
> returning SCSI_MLQUEUE_XXX_BUSY in ata_scsi_translate() and instead
> hold on to the qc using the new deferred_qc field of struct ata_port.
> 
> This deferred qc will be issued using the work item deferred_qc_work
> running the function ata_scsi_deferred_qc_work() once all in-flight
> commands complete, which is checked with the port qc_defer() callback
> indicating that no further delay is necessary. This check is done using
> the helper function ata_scsi_schedule_deferred_qc() which is called from
> ata_scsi_qc_complete(). This thus excludes this mechanism from all
> internal non-NCQ commands issued by ATA EH.
> 
> When a port deferred_qc is non NULL, that is, the port has a command
> waiting for the device queue to drain, the issuing of all incoming
> commands (both NCQ and non-NCQ) is deferred using the regular busy
> mechanism. This simplifies the code and also avoids potential denial of
> service problems if a user issues too many non-NCQ passthrough commands.
> 
> Finally, whenever ata EH is scheduled, regardless of the reason, a
> deferred qc is always requeued so that it can be retried once EH
> completes. This is done by calling the function
> ata_scsi_requeue_deferred_qc() from ata_eh_set_pending(). This avoids
> the need for any special processing for the deferred qc in case of NCQ
> error, link or device reset, or device timeout.
> 

Thank you very much for fixing this, Damien! I'll run some tests and add
my Tested-by if everything looks good.

> Reported-by: Xingui Yang <yangxingui@huawei.com>
> Reported-by: Igor Pylypiv <ipylypiv@google.com>
> Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  drivers/ata/libata-core.c |  1 +
>  drivers/ata/libata-eh.c   |  6 +++
>  drivers/ata/libata-scsi.c | 93 ++++++++++++++++++++++++++++++++++++++-
>  drivers/ata/libata.h      |  2 +
>  include/linux/libata.h    |  3 ++
>  5 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 0b24bd169d61..121f35115d33 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -5558,6 +5558,7 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
>  	mutex_init(&ap->scsi_scan_mutex);
>  	INIT_DELAYED_WORK(&ap->hotplug_task, ata_scsi_hotplug);
>  	INIT_DELAYED_WORK(&ap->scsi_rescan_task, ata_scsi_dev_rescan);
> +	INIT_WORK(&ap->deferred_qc_work, ata_scsi_deferred_qc_work);
>  	INIT_LIST_HEAD(&ap->eh_done_q);
>  	init_waitqueue_head(&ap->eh_wait_q);
>  	init_completion(&ap->park_req_pending);
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 2586e77ebf45..b90b17f680f8 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -917,6 +917,12 @@ static void ata_eh_set_pending(struct ata_port *ap, bool fastdrain)
>  
>  	ap->pflags |= ATA_PFLAG_EH_PENDING;
>  
> +	/*
> +	 * If we have a deferred qc, requeue it so that it is retried once EH
> +	 * completes.
> +	 */
> +	ata_scsi_requeue_deferred_qc(ap);
> +
>  	if (!fastdrain)
>  		return;
>  
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 42d103542525..c5ebd2bab356 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1658,8 +1658,75 @@ static void ata_qc_done(struct ata_queued_cmd *qc)
>  	done(cmd);
>  }
>  
> +void ata_scsi_deferred_qc_work(struct work_struct *work)
> +{
> +	struct ata_port *ap =
> +		container_of(work, struct ata_port, deferred_qc_work);
> +	struct ata_queued_cmd *qc;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(ap->lock, flags);
> +
> +	/*
> +	 * If we still have a deferred QC and we are not in EH, issue it. In
> +	 * such case, we should not need any more deferring the QC, so warn if
> +	 * qc_defer() says otherwise.
> +	 */
> +	qc = ap->deferred_qc;
> +	if (qc && !ata_port_eh_scheduled(ap)) {
> +		WARN_ON_ONCE(ap->ops->qc_defer(qc));
> +		ap->deferred_qc = NULL;
> +		ata_qc_issue(qc);
> +	}
> +
> +	spin_unlock_irqrestore(ap->lock, flags);
> +}
> +
> +void ata_scsi_requeue_deferred_qc(struct ata_port *ap)
> +{
> +	struct ata_queued_cmd *qc = ap->deferred_qc;
> +
> +	lockdep_assert_held(ap->lock);
> +
> +	/*
> +	 * If we have a differed QC when a reset occurs or NCQ commands fail, do
> +	 * not try to be smart about what to do with this deferred command and
> +	 * simply retry it by completing it with DID_SOFT_ERROR.
> +	 */
> +	if (qc) {
> +		struct scsi_cmnd *scmd = qc->scsicmd;
> +
> +		ap->deferred_qc = NULL;
> +		ata_qc_free(qc);
> +		scmd->result = (DID_SOFT_ERROR << 16);
> +		scsi_done(scmd);
> +	}
> +}
> +
> +static void ata_scsi_schedule_deferred_qc(struct ata_port *ap)
> +{
> +	struct ata_queued_cmd *qc = ap->deferred_qc;
> +
> +	lockdep_assert_held(ap->lock);
> +
> +	/*
> +	 * If we have a differed QC, then qc_defer() is defined and we can use
> +	 * this callback to determine if this QC is good to go, unless EH has
> +	 * been scheduled.
> +	 */
> +	if (qc) {
> +		if (ata_port_eh_scheduled(ap)) {
> +			ata_scsi_requeue_deferred_qc(ap);
> +			return;
> +		}
> +		if (!ap->ops->qc_defer(qc))
> +			queue_work(system_highpri_wq, &ap->deferred_qc_work);
> +	}
> +}
> +
>  static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
>  {
> +	struct ata_port *ap = qc->ap;
>  	struct scsi_cmnd *cmd = qc->scsicmd;
>  	u8 *cdb = cmd->cmnd;
>  	bool have_sense = qc->flags & ATA_QCFLAG_SENSE_VALID;
> @@ -1689,12 +1756,22 @@ static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
>  	}
>  
>  	ata_qc_done(qc);
> +
> +	ata_scsi_schedule_deferred_qc(ap);
>  }
>  
>  static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
>  {
>  	int ret;
>  
> +	/*
> +	 * If we already have a deferred QC, then rely on the SCSI layer to
> +	 * defer and requeue all incoming commands until the deferred QC is
> +	 * processed, once all on-going commands are completed.
> +	 */
> +	if (ap->deferred_qc)
> +		return SCSI_MLQUEUE_DEVICE_BUSY;

Should we free qc here? ata_scsi_translate() does not free qc if
ata_scsi_defer() returns a non-zero value.

> +
>  	if (!ap->ops->qc_defer)
>  		return 0;
>  
> @@ -1702,6 +1779,17 @@ static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
>  	if (!ret)
>  		return 0;
>  
> +	/*
> +	 * We must defer this QC: if this is not an NCQ command, keep this QC as
> +	 * a deferred one and wait for all on-going NCQ commands to complete
> +	 * before issuing it with the deferred QC work.
> +	 */
> +	if (!ata_is_ncq(qc->tf.protocol)) {
> +		ap->deferred_qc = qc;
> +		return SCSI_MLQUEUE_DEVICE_BUSY;
> +	}
> +
> +	/* Use the SCSI layer to defer and requeue the command. */
>  	ata_qc_free(qc);
>  
>  	switch (ret) {
> @@ -1777,8 +1865,11 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
>  		goto done;
>  
>  	rc = ata_scsi_defer(ap, qc);
> -	if (rc)
> +	if (rc) {
> +		if (qc == ap->deferred_qc)
> +			return 0;
>  		return rc;
> +	}
>  
>  	ata_qc_issue(qc);
>  
> diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
> index 0e7ecac73680..60a675df61dc 100644
> --- a/drivers/ata/libata.h
> +++ b/drivers/ata/libata.h
> @@ -165,6 +165,8 @@ void ata_scsi_sdev_config(struct scsi_device *sdev);
>  int ata_scsi_dev_config(struct scsi_device *sdev, struct queue_limits *lim,
>  		struct ata_device *dev);
>  int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev);
> +void ata_scsi_deferred_qc_work(struct work_struct *work);
> +void ata_scsi_requeue_deferred_qc(struct ata_port *ap);
>  
>  /* libata-eh.c */
>  extern unsigned int ata_internal_cmd_timeout(struct ata_device *dev, u8 cmd);
> diff --git a/include/linux/libata.h b/include/linux/libata.h
> index 39534fafa36a..c5b27d97dfaf 100644
> --- a/include/linux/libata.h
> +++ b/include/linux/libata.h
> @@ -903,6 +903,9 @@ struct ata_port {
>  	u64			qc_active;
>  	int			nr_active_links; /* #links with active qcs */
>  
> +	struct work_struct	deferred_qc_work;
> +	struct ata_queued_cmd	*deferred_qc;
> +
>  	struct ata_link		link;		/* host default link */
>  	struct ata_link		*slave_link;	/* see ata_slave_link_init() */
>  
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation
  2025-12-19  0:08   ` Igor Pylypiv
@ 2025-12-19  7:47     ` Damien Le Moal
  0 siblings, 0 replies; 7+ messages in thread
From: Damien Le Moal @ 2025-12-19  7:47 UTC (permalink / raw)
  To: Igor Pylypiv; +Cc: linux-ide, Niklas Cassel, Xingui Yang

On 12/19/25 09:08, Igor Pylypiv wrote:
>>  static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
>>  {
>>  	int ret;
>>  
>> +	/*
>> +	 * If we already have a deferred QC, then rely on the SCSI layer to
>> +	 * defer and requeue all incoming commands until the deferred QC is
>> +	 * processed, once all on-going commands are completed.
>> +	 */
>> +	if (ap->deferred_qc)
>> +		return SCSI_MLQUEUE_DEVICE_BUSY;
> 
> Should we free qc here? ata_scsi_translate() does not free qc if
> ata_scsi_defer() returns a non-zero value.

Good catch ! Sending V2.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-19  7:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 23:17 [PATCH 0/2] Prevent non-NCQ command starvation Damien Le Moal
2025-12-17 23:17 ` [PATCH 1/2] ata: libata-scsi: refactor ata_scsi_translate() Damien Le Moal
2025-12-18  9:23   ` Hannes Reinecke
2025-12-17 23:17 ` [PATCH 2/2] ata: libata-scsi: avoid passthrough command starvation Damien Le Moal
2025-12-18  9:26   ` Hannes Reinecke
2025-12-19  0:08   ` Igor Pylypiv
2025-12-19  7:47     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox