public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed
From: Henry Tseng <henrytseng@qnap.com>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <cassel@kernel.org>,
	linux-ide@vger.kernel.org, Kevin Ko <kevinko@qnap.com>,
	SW Chen <swchen@qnap.com>, Henry Tseng <henrytseng@qnap.com>
Subject: [PATCH v2] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
Date: Fri, 21 Nov 2025 18:14:07 +0800	[thread overview]
Message-ID: <20251121101407.978550-1-henrytseng@qnap.com> (raw)
In-Reply-To: <27a5386e-1cc9-438b-bf05-41f7dcbb3f77@kernel.org>

When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
switch, hot-unplugging the whole enclosure causes pciehp to tear down
the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
START STOP UNIT for the disks.

libata still queues these commands and the AHCI driver tries to access
the HBA registers even though the PCI channel is already offline. This
results in a series of timeouts and error recovery attempts, e.g.:

  [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
  [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
  [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
  [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  ...
  [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
  [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
  [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
  [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

With this patch applied, the same hot-unplug looks like:

  [   53.905168] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [   54.069159] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_OK driverbyte=DRIVER_OK
  ...
  [   54.330226] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_OK driverbyte=DRIVER_OK

In this test setup with two disks, the hot-unplug sequence shrinks from
about 226 seconds (~3.8 minutes) between the Link Down event and the
last SYNCHRONIZE CACHE failure to under a second. Without this patch the
total delay grows roughly with the number of disks, because each disk
gets its own SYNCHRONIZE CACHE and qc timeout series.

If the underlying PCI device is already gone, these commands cannot
succeed anyway. Avoid issuing them by introducing
ata_adapter_is_online(), which checks pci_channel_offline() for
PCI-based hosts and is used from ata_qc_issue() to bail out before
touching the HBA registers.

Since such failures would otherwise trigger libata error handling, we
also consult ata_adapter_is_online() from ata_scsi_port_error_handler().
When the adapter is offline, libata skips ap->ops->error_handler(ap)
and directly finishes error handling using the existing unload path,
instead of running a full EH sequence against a dead adapter.

With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
issued during hot-unplug fail quickly once the PCI channel is offline,
without qc timeout spam or long libata EH delays.

Signed-off-by: Henry Tseng <henrytseng@qnap.com>
---
v2:
 - Move the PCI adapter state check into ata_adapter_is_online()
   implemented in libata-core.c, following your suggestion, 
   and call it from ata_qc_issue().
 - Also consult ata_adapter_is_online() from ata_scsi_port_error_handler() so that libata can
   skip ap->ops->error_handler(ap) once the adapter is offline, using the existing unload/suicide path
   rather than invoking a long EH sequence on a removed AHCI adapter.
 - This patch focuses on reducing libata EH delays and qc timeout spam when a Thunderbolt-attached
   DAS enclosure is unplugged after the file systems have been unmounted.
   It does not attempt to address general surprise-removal safety.
 - As a side effect, the SCSI result for these SYNCHRONIZE CACHE and START STOP UNIT commands
   changes from hostbyte=DID_BAD_TARGET to hostbyte=DID_OK with sense key "Aborted Command".
   From the SCSI/block layer perspective, the commands still fail, and the devices are being removed.
   If preserving DID_BAD_TARGET is preferred, the error mapping can be updated accordingly.

 drivers/ata/libata-core.c | 21 +++++++++++++++++++++
 drivers/ata/libata-eh.c   |  3 ++-
 drivers/ata/libata.h      |  1 +
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2a210719c4ce..0116dac3b4ec 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2358,6 +2358,24 @@ static bool ata_dev_check_adapter(struct ata_device *dev,
 	return false;
 }
 
+bool ata_adapter_is_online(struct ata_port *ap)
+{
+	struct device *dev;
+
+	if (!ap || !ap->host)
+		return false;
+
+	dev = ap->host->dev;
+	if (!dev)
+		return false;
+
+	if (dev_is_pci(dev) &&
+	    pci_channel_offline(to_pci_dev(dev)))
+		return false;
+
+	return true;
+}
+
 static int ata_dev_config_ncq(struct ata_device *dev,
 			       char *desc, size_t desc_sz)
 {
@@ -5060,6 +5078,9 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
 		if (ata_sg_setup(qc))
 			goto sys_err;
 
+	if (!ata_adapter_is_online(ap))
+		goto sys_err;
+
 	/* if device is sleeping, schedule reset and abort the link */
 	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
 		link->eh_info.action |= ATA_EH_RESET;
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2586e77ebf45..f4c9541d1910 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -736,7 +736,8 @@ void ata_scsi_port_error_handler(struct Scsi_Host *host, struct ata_port *ap)
 	spin_unlock_irqrestore(ap->lock, flags);
 
 	/* invoke EH, skip if unloading or suspended */
-	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)))
+	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)) &&
+	    ata_adapter_is_online(ap))
 		ap->ops->error_handler(ap);
 	else {
 		/* if unloading, commence suicide */
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index e5b977a8d3e1..a14cd588c2d4 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -94,6 +94,7 @@ extern int atapi_check_dma(struct ata_queued_cmd *qc);
 extern void swap_buf_le16(u16 *buf, unsigned int buf_words);
 extern bool ata_phys_link_online(struct ata_link *link);
 extern bool ata_phys_link_offline(struct ata_link *link);
+bool ata_adapter_is_online(struct ata_port *ap);
 extern void ata_dev_init(struct ata_device *dev);
 extern void ata_link_init(struct ata_port *ap, struct ata_link *link, int pmp);
 extern int sata_link_init_spd(struct ata_link *link);
-- 
2.43.0


  reply	other threads:[~2025-11-21 10:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20  7:35 [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
2025-11-20  8:30 ` Damien Le Moal
2025-11-21 10:14   ` Henry Tseng [this message]
2025-11-26  7:50     ` [PATCH v2] " Damien Le Moal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251121101407.978550-1-henrytseng@qnap.com \
    --to=henrytseng@qnap.com \
    --cc=cassel@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=kevinko@qnap.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=swchen@qnap.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox