[PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
@ 2025-12-01  9:46 Henry Tseng
  2025-12-02  2:44 ` Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Henry Tseng @ 2025-12-01  9:46 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen, Henry Tseng

When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
switch, hot-unplugging the whole enclosure causes pciehp to tear down
the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
START STOP UNIT for the disks.

libata still queues these commands and the AHCI driver tries to access
the HBA registers even though the PCI channel is already offline. This
results in a series of timeouts and error recovery attempts, e.g.:

  [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
  [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
  [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
  [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  ...
  [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
  [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
  [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
  [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

With this patch applied, the same hot-unplug looks like:

  [   59.965496] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [   60.002502] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  ...
  [   60.103050] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

In this test setup with two disks, the hot-unplug sequence shrinks from
about 226 seconds (~3.8 minutes) between the Link Down event and the
last SYNCHRONIZE CACHE failure to under a second. Without this patch the
total delay grows roughly with the number of disks, because each disk
gets its own SYNCHRONIZE CACHE and qc timeout series.

If the underlying PCI device is already gone, these commands cannot
succeed anyway. Avoid issuing them by introducing
ata_adapter_is_online(), which checks pci_channel_offline() for
PCI-based hosts. It is used from ata_scsi_find_dev() to return NULL,
causing the SCSI layer to fail new commands with DID_BAD_TARGET
immediately, and from ata_qc_issue() to bail out before touching the
HBA registers.

Since such failures would otherwise trigger libata error handling,
ata_adapter_is_online() is also consulted from ata_scsi_port_error_handler().
When the adapter is offline, libata skips ap->ops->error_handler(ap) and
completes error handling using the existing path, rather than running
a full EH sequence against a dead adapter.

With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
issued during hot-unplug fail quickly once the PCI channel is offline,
without qc timeout spam or long libata EH delays.

Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Henry Tseng <henrytseng@qnap.com>
---
change from v2:
 - Add Suggested-by tag.
 - In ata_qc_issue(), move the ata_adapter_is_online() check earlier and
   set the error mask to AC_ERR_HOST_BUS as suggested.
 - Add ata_adapter_is_online() check to ata_scsi_find_dev() to fail
   commands with DID_BAD_TARGET immediately when unplugged, as suggested.

Note: Thanks to Damien Le Moal for the detailed suggestions regarding the
check locations and error codes.

change from v1:
 - Move the PCI adapter state check into ata_adapter_is_online()
   implemented in libata-core.c.
 - Also consult ata_adapter_is_online() from ata_scsi_port_error_handler()
   to skip unnecessary error handling on dead adapters.

 drivers/ata/libata-core.c | 24 ++++++++++++++++++++++++
 drivers/ata/libata-eh.c   |  3 ++-
 drivers/ata/libata-scsi.c |  3 +++
 drivers/ata/libata.h      |  1 +
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2a210719c4ce..51090dbbe1e8 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2358,6 +2358,24 @@ static bool ata_dev_check_adapter(struct ata_device *dev,
 	return false;
 }
 
+bool ata_adapter_is_online(struct ata_port *ap)
+{
+	struct device *dev;
+
+	if (!ap || !ap->host)
+		return false;
+
+	dev = ap->host->dev;
+	if (!dev)
+		return false;
+
+	if (dev_is_pci(dev) &&
+	    pci_channel_offline(to_pci_dev(dev)))
+		return false;
+
+	return true;
+}
+
 static int ata_dev_config_ncq(struct ata_device *dev,
 			       char *desc, size_t desc_sz)
 {
@@ -5048,6 +5066,12 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
 	qc->flags |= ATA_QCFLAG_ACTIVE;
 	ap->qc_active |= 1ULL << qc->tag;
 
+	/* Make sure the device is still accessible. */
+	if (!ata_adapter_is_online(ap)) {
+		qc->err_mask |= AC_ERR_HOST_BUS;
+		goto sys_err;
+	}
+
 	/*
 	 * We guarantee to LLDs that they will have at least one
 	 * non-zero sg if the command is a data command.
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2586e77ebf45..f4c9541d1910 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -736,7 +736,8 @@ void ata_scsi_port_error_handler(struct Scsi_Host *host, struct ata_port *ap)
 	spin_unlock_irqrestore(ap->lock, flags);
 
 	/* invoke EH, skip if unloading or suspended */
-	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)))
+	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)) &&
+	    ata_adapter_is_online(ap))
 		ap->ops->error_handler(ap);
 	else {
 		/* if unloading, commence suicide */
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b43a3196e2be..d8a43a680f93 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2974,6 +2974,9 @@ ata_scsi_find_dev(struct ata_port *ap, const struct scsi_device *scsidev)
 {
 	struct ata_device *dev = __ata_scsi_find_dev(ap, scsidev);
 
+	if (!ata_adapter_is_online(ap))
+		return NULL;
+
 	if (unlikely(!dev || !ata_dev_enabled(dev)))
 		return NULL;
 
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index e5b977a8d3e1..a14cd588c2d4 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -94,6 +94,7 @@ extern int atapi_check_dma(struct ata_queued_cmd *qc);
 extern void swap_buf_le16(u16 *buf, unsigned int buf_words);
 extern bool ata_phys_link_online(struct ata_link *link);
 extern bool ata_phys_link_offline(struct ata_link *link);
+bool ata_adapter_is_online(struct ata_port *ap);
 extern void ata_dev_init(struct ata_device *dev);
 extern void ata_link_init(struct ata_port *ap, struct ata_link *link, int pmp);
 extern int sata_link_init_spd(struct ata_link *link);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-12-01  9:46 [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
@ 2025-12-02  2:44 ` Damien Le Moal
  2025-12-02  7:45 ` Niklas Cassel
  2025-12-08  3:47 ` Damien Le Moal
  2 siblings, 0 replies; 4+ messages in thread
From: Damien Le Moal @ 2025-12-02  2:44 UTC (permalink / raw)
  To: Henry Tseng; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen

On 12/1/25 18:46, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   59.965496] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   60.002502] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [   60.103050] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> In this test setup with two disks, the hot-unplug sequence shrinks from
> about 226 seconds (~3.8 minutes) between the Link Down event and the
> last SYNCHRONIZE CACHE failure to under a second. Without this patch the
> total delay grows roughly with the number of disks, because each disk
> gets its own SYNCHRONIZE CACHE and qc timeout series.
> 
> If the underlying PCI device is already gone, these commands cannot
> succeed anyway. Avoid issuing them by introducing
> ata_adapter_is_online(), which checks pci_channel_offline() for
> PCI-based hosts. It is used from ata_scsi_find_dev() to return NULL,
> causing the SCSI layer to fail new commands with DID_BAD_TARGET
> immediately, and from ata_qc_issue() to bail out before touching the
> HBA registers.
> 
> Since such failures would otherwise trigger libata error handling,
> ata_adapter_is_online() is also consulted from ata_scsi_port_error_handler().
> When the adapter is offline, libata skips ap->ops->error_handler(ap) and
> completes error handling using the existing path, rather than running
> a full EH sequence against a dead adapter.
> 
> With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
> issued during hot-unplug fail quickly once the PCI channel is offline,
> without qc timeout spam or long libata EH delays.
> 
> Suggested-by: Damien Le Moal <dlemoal@kernel.org>
> Signed-off-by: Henry Tseng <henrytseng@qnap.com>

Looks good.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-12-01  9:46 [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
  2025-12-02  2:44 ` Damien Le Moal
@ 2025-12-02  7:45 ` Niklas Cassel
  2025-12-08  3:47 ` Damien Le Moal
  2 siblings, 0 replies; 4+ messages in thread
From: Niklas Cassel @ 2025-12-02  7:45 UTC (permalink / raw)
  To: Henry Tseng; +Cc: Damien Le Moal, linux-ide, Kevin Ko, SW Chen

On Mon, Dec 01, 2025 at 05:46:22PM +0800, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   59.965496] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   60.002502] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [   60.103050] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> In this test setup with two disks, the hot-unplug sequence shrinks from
> about 226 seconds (~3.8 minutes) between the Link Down event and the
> last SYNCHRONIZE CACHE failure to under a second. Without this patch the
> total delay grows roughly with the number of disks, because each disk
> gets its own SYNCHRONIZE CACHE and qc timeout series.
> 
> If the underlying PCI device is already gone, these commands cannot
> succeed anyway. Avoid issuing them by introducing
> ata_adapter_is_online(), which checks pci_channel_offline() for
> PCI-based hosts. It is used from ata_scsi_find_dev() to return NULL,
> causing the SCSI layer to fail new commands with DID_BAD_TARGET
> immediately, and from ata_qc_issue() to bail out before touching the
> HBA registers.
> 
> Since such failures would otherwise trigger libata error handling,
> ata_adapter_is_online() is also consulted from ata_scsi_port_error_handler().
> When the adapter is offline, libata skips ap->ops->error_handler(ap) and
> completes error handling using the existing path, rather than running
> a full EH sequence against a dead adapter.
> 
> With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
> issued during hot-unplug fail quickly once the PCI channel is offline,
> without qc timeout spam or long libata EH delays.
> 
> Suggested-by: Damien Le Moal <dlemoal@kernel.org>
> Signed-off-by: Henry Tseng <henrytseng@qnap.com>

Thanks for the patch!

Since the merge window is already open, and since this is not a strict bug
fix, this will be queued up after v6.19-rc1 is out (targeting v7.0 / v6.20,
whichever Linus decides to name it).


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-12-01  9:46 [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
  2025-12-02  2:44 ` Damien Le Moal
  2025-12-02  7:45 ` Niklas Cassel
@ 2025-12-08  3:47 ` Damien Le Moal
  2 siblings, 0 replies; 4+ messages in thread
From: Damien Le Moal @ 2025-12-08  3:47 UTC (permalink / raw)
  To: Henry Tseng; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen

On 12/1/25 6:46 PM, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   59.965496] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   60.002502] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [   60.103050] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> In this test setup with two disks, the hot-unplug sequence shrinks from
> about 226 seconds (~3.8 minutes) between the Link Down event and the
> last SYNCHRONIZE CACHE failure to under a second. Without this patch the
> total delay grows roughly with the number of disks, because each disk
> gets its own SYNCHRONIZE CACHE and qc timeout series.
> 
> If the underlying PCI device is already gone, these commands cannot
> succeed anyway. Avoid issuing them by introducing
> ata_adapter_is_online(), which checks pci_channel_offline() for
> PCI-based hosts. It is used from ata_scsi_find_dev() to return NULL,
> causing the SCSI layer to fail new commands with DID_BAD_TARGET
> immediately, and from ata_qc_issue() to bail out before touching the
> HBA registers.
> 
> Since such failures would otherwise trigger libata error handling,
> ata_adapter_is_online() is also consulted from ata_scsi_port_error_handler().
> When the adapter is offline, libata skips ap->ops->error_handler(ap) and
> completes error handling using the existing path, rather than running
> a full EH sequence against a dead adapter.
> 
> With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
> issued during hot-unplug fail quickly once the PCI channel is offline,
> without qc timeout spam or long libata EH delays.
> 
> Suggested-by: Damien Le Moal <dlemoal@kernel.org>
> Signed-off-by: Henry Tseng <henrytseng@qnap.com>

Applied to for-6.20. Thanks!

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-08  3:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-01  9:46 [PATCH v3] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
2025-12-02  2:44 ` Damien Le Moal
2025-12-02  7:45 ` Niklas Cassel
2025-12-08  3:47 ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox