public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <dlemoal@kernel.org>
To: linux-ide@vger.kernel.org, Niklas Cassel <cassel@kernel.org>
Cc: Igor Pylypiv <ipylypiv@google.com>,
	Xingui Yang <yangxingui@huawei.com>,
	John Garry <john.g.garry@oracle.com>
Subject: [PATCH v3 2/2] ata: libata-scsi: avoid passthrough command starvation
Date: Sat, 20 Dec 2025 09:21:40 +0900	[thread overview]
Message-ID: <20251220002140.148854-3-dlemoal@kernel.org> (raw)
In-Reply-To: <20251220002140.148854-1-dlemoal@kernel.org>

When a non-NCQ passthrough command is issued while NCQ commands are
being executed, ata_scsi_defer() indicates to ata_scsi_translate() that
ata_qc_issue() should not be called for the passthrough command, and
instead returns SCSI_MLQUEUE_XXX_BUSY to defer the command execution.
This command deferring is correct and as mandated by the ACS
specifications since NCQ and non-NCQ commands cannot be mixed.

However, in the case of a host adapter using multiple submission queues,
when the target device is under a constant load of NCQ read or write
commands, there are no guarantees that requeueing the non-NCQ command
will lead to it not being deferred again repeatedly, since other
submission queues can constantly issue NCQ commands from different CPUs
ahead of the non-NCQ command. This can lead to very long delays for the
execution of non-NCQ passthrough commands, and even complete starvation
in the worst case scenario.

Since the block layer and the SCSI layer do not distinguish between
queuable (NCQ) and non queueable (non-NCQ) commands, libata-scsi SAT
implementation must ensure forward progress for non-NCQ commands in the
presence of NCQ command traffic. This is similar to what SAS HBAs with a
hardware/firmware based implementation do.

Implement such forward progress guarantee by limiting requeueing of
non-NCQ commands: when a non-NCQ command is received and NCQ commands
are in-flight, do not force a requeue of the non-NCQ command by
returning SCSI_MLQUEUE_XXX_BUSY in ata_scsi_translate() and instead
hold on to the qc using the new deferred_qc field of struct ata_port.

This deferred qc will be issued using the work item deferred_qc_work
running the function ata_scsi_deferred_qc_work() once all in-flight
commands complete, which is checked with the port qc_defer() callback
indicating that no further delay is necessary. This check is done using
the helper function ata_scsi_schedule_deferred_qc() which is called from
ata_scsi_qc_complete(). This thus excludes this mechanism from all
internal non-NCQ commands issued by ATA EH.

When a port deferred_qc is non NULL, that is, the port has a command
waiting for the device queue to drain, the issuing of all incoming
commands (both NCQ and non-NCQ) is deferred using the regular busy
mechanism. This simplifies the code and also avoids potential denial of
service problems if a user issues too many non-NCQ passthrough commands.

Finally, whenever ata EH is scheduled, regardless of the reason, a
deferred qc is always requeued so that it can be retried once EH
completes. This is done by calling the function
ata_scsi_requeue_deferred_qc() from ata_eh_set_pending(). This avoids
the need for any special processing for the deferred qc in case of NCQ
error, link or device reset, or device timeout.

Reported-by: Xingui Yang <yangxingui@huawei.com>
Reported-by: Igor Pylypiv <ipylypiv@google.com>
Fixes: 42f22fe36d51 ("scsi: pm8001: Expose hardware queues for pm80xx")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-core.c |  1 +
 drivers/ata/libata-eh.c   |  6 +++
 drivers/ata/libata-scsi.c | 93 ++++++++++++++++++++++++++++++++++++++-
 drivers/ata/libata.h      |  2 +
 include/linux/libata.h    |  3 ++
 5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 0b24bd169d61..121f35115d33 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5558,6 +5558,7 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
 	mutex_init(&ap->scsi_scan_mutex);
 	INIT_DELAYED_WORK(&ap->hotplug_task, ata_scsi_hotplug);
 	INIT_DELAYED_WORK(&ap->scsi_rescan_task, ata_scsi_dev_rescan);
+	INIT_WORK(&ap->deferred_qc_work, ata_scsi_deferred_qc_work);
 	INIT_LIST_HEAD(&ap->eh_done_q);
 	init_waitqueue_head(&ap->eh_wait_q);
 	init_completion(&ap->park_req_pending);
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2586e77ebf45..b90b17f680f8 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -917,6 +917,12 @@ static void ata_eh_set_pending(struct ata_port *ap, bool fastdrain)
 
 	ap->pflags |= ATA_PFLAG_EH_PENDING;
 
+	/*
+	 * If we have a deferred qc, requeue it so that it is retried once EH
+	 * completes.
+	 */
+	ata_scsi_requeue_deferred_qc(ap);
+
 	if (!fastdrain)
 		return;
 
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 8e04fc173ea3..e39837846566 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1658,8 +1658,73 @@ static void ata_qc_done(struct ata_queued_cmd *qc)
 	done(cmd);
 }
 
+void ata_scsi_deferred_qc_work(struct work_struct *work)
+{
+	struct ata_port *ap =
+		container_of(work, struct ata_port, deferred_qc_work);
+	struct ata_queued_cmd *qc;
+	unsigned long flags;
+
+	spin_lock_irqsave(ap->lock, flags);
+
+	/*
+	 * If we still have a deferred qc and we are not in EH, issue it. In
+	 * such case, we should not need any more deferring the qc, so warn if
+	 * qc_defer() says otherwise.
+	 */
+	qc = ap->deferred_qc;
+	if (qc && !ata_port_eh_scheduled(ap)) {
+		WARN_ON_ONCE(ap->ops->qc_defer(qc));
+		ap->deferred_qc = NULL;
+		ata_qc_issue(qc);
+	}
+
+	spin_unlock_irqrestore(ap->lock, flags);
+}
+
+void ata_scsi_requeue_deferred_qc(struct ata_port *ap)
+{
+	struct ata_queued_cmd *qc = ap->deferred_qc;
+	struct scsi_cmnd *scmd;
+
+	/*
+	 * If we have a deferred qc when a reset occurs or NCQ commands fail,
+	 * do not try to be smart about what to do with this deferred command
+	 * and simply retry it by completing it with DID_SOFT_ERROR.
+	 */
+	if (!qc)
+		return;
+
+	scmd = qc->scsicmd;
+	ap->deferred_qc = NULL;
+	ata_qc_free(qc);
+	scmd->result = (DID_SOFT_ERROR << 16);
+	scsi_done(scmd);
+}
+
+static void ata_scsi_schedule_deferred_qc(struct ata_port *ap)
+{
+	struct ata_queued_cmd *qc = ap->deferred_qc;
+
+	/*
+	 * If we have a deferred qc, then qc_defer() is defined and we can use
+	 * this callback to determine if this qc is good to go, unless EH has
+	 * been scheduled.
+	 */
+	if (!qc)
+		return;
+
+	if (ata_port_eh_scheduled(ap)) {
+		ata_scsi_requeue_deferred_qc(ap);
+		return;
+	}
+	if (!ap->ops->qc_defer(qc))
+		queue_work(system_highpri_wq, &ap->deferred_qc_work);
+}
+
 static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
 {
+	struct ata_port *ap = qc->ap;
 	struct scsi_cmnd *cmd = qc->scsicmd;
 	u8 *cdb = cmd->cmnd;
 	bool have_sense = qc->flags & ATA_QCFLAG_SENSE_VALID;
@@ -1689,12 +1754,24 @@ static void ata_scsi_qc_complete(struct ata_queued_cmd *qc)
 	}
 
 	ata_qc_done(qc);
+
+	ata_scsi_schedule_deferred_qc(ap);
 }
 
 static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
 {
 	int ret;
 
+	/*
+	 * If we already have a deferred qc, then rely on the SCSI layer to
+	 * requeue and defer all incoming commands until the deferred qc is
+	 * processed, once all on-going commands are completed.
+	 */
+	if (ap->deferred_qc) {
+		ata_qc_free(qc);
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+	}
+
 	if (!ap->ops->qc_defer)
 		return 0;
 
@@ -1702,6 +1779,17 @@ static int ata_scsi_defer(struct ata_port *ap, struct ata_queued_cmd *qc)
 	if (!ret)
 		return 0;
 
+	/*
+	 * We must defer this qc: if this is not an NCQ command, keep this qc
+	 * as a deferred one and wait for all on-going NCQ commands to complete
+	 * before issuing it with the deferred qc work.
+	 */
+	if (!ata_is_ncq(qc->tf.protocol)) {
+		ap->deferred_qc = qc;
+		return SCSI_MLQUEUE_DEVICE_BUSY;
+	}
+
+	/* Use the SCSI layer to requeue and defer the command. */
 	ata_qc_free(qc);
 
 	switch (ret) {
@@ -1777,8 +1865,11 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
 		goto done;
 
 	rc = ata_scsi_defer(ap, qc);
-	if (rc)
+	if (rc) {
+		if (qc == ap->deferred_qc)
+			return 0;
 		return rc;
+	}
 
 	ata_qc_issue(qc);
 
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 0e7ecac73680..60a675df61dc 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -165,6 +165,8 @@ void ata_scsi_sdev_config(struct scsi_device *sdev);
 int ata_scsi_dev_config(struct scsi_device *sdev, struct queue_limits *lim,
 		struct ata_device *dev);
 int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev);
+void ata_scsi_deferred_qc_work(struct work_struct *work);
+void ata_scsi_requeue_deferred_qc(struct ata_port *ap);
 
 /* libata-eh.c */
 extern unsigned int ata_internal_cmd_timeout(struct ata_device *dev, u8 cmd);
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 39534fafa36a..c5b27d97dfaf 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -903,6 +903,9 @@ struct ata_port {
 	u64			qc_active;
 	int			nr_active_links; /* #links with active qcs */
 
+	struct work_struct	deferred_qc_work;
+	struct ata_queued_cmd	*deferred_qc;
+
 	struct ata_link		link;		/* host default link */
 	struct ata_link		*slave_link;	/* see ata_slave_link_init() */
 
-- 
2.52.0


  parent reply	other threads:[~2025-12-20  0:26 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-20  0:21 [PATCH v3 0/2] Prevent non-NCQ command starvation Damien Le Moal
2025-12-20  0:21 ` [PATCH v3 1/2] ata: libata-scsi: refactor ata_scsi_translate() Damien Le Moal
2025-12-22 21:19   ` Igor Pylypiv
2025-12-30 13:40   ` Niklas Cassel
2025-12-20  0:21 ` Damien Le Moal [this message]
2025-12-22 21:28   ` [PATCH v3 2/2] ata: libata-scsi: avoid passthrough command starvation Igor Pylypiv
2025-12-28  2:25     ` Damien Le Moal
2025-12-23 11:17   ` yangxingui
2025-12-31 11:17   ` Niklas Cassel
2026-01-02  0:47     ` Damien Le Moal
2026-01-02 10:24       ` Niklas Cassel
2026-01-02 11:00         ` Damien Le Moal
2025-12-31 11:39   ` Niklas Cassel
2026-01-02  1:10     ` Damien Le Moal
2026-01-02  9:46       ` Niklas Cassel
2026-01-02 11:01         ` Damien Le Moal
2025-12-31 10:04 ` [PATCH v3 0/2] Prevent non-NCQ " Niklas Cassel
2026-01-02  1:14   ` Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251220002140.148854-3-dlemoal@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=cassel@kernel.org \
    --cc=ipylypiv@google.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=yangxingui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox