[PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
@ 2025-11-20  7:35 Henry Tseng
  2025-11-20  8:30 ` Damien Le Moal
  0 siblings, 1 reply; 4+ messages in thread
From: Henry Tseng @ 2025-11-20  7:35 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen, Henry Tseng

When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
switch, hot-unplugging the whole enclosure causes pciehp to tear down
the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
START STOP UNIT for the disks.

libata still queues these commands and the AHCI driver tries to access
the HBA registers even though the PCI channel is already offline. This
results in a series of timeouts and error recovery attempts, e.g.:

  [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
  [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
  [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
  [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  ...
  [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
  [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
  [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
  [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

With this patch applied, the same hot-unplug looks like:

  [   55.452526] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [   55.496563] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  [   55.617450] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

In this test setup with two disks, the hot-unplug sequence shrinks from
about 226 seconds (~3.8 minutes) between the Link Down event and the
last SYNCHRONIZE CACHE failure to under 200 ms. Without this patch the
total delay grows roughly with the number of disks, because each disk
gets its own SYNCHRONIZE CACHE and qc timeout series.

If the underlying PCI device is already gone, these commands cannot
succeed anyway. Avoid issuing them by introducing
ata_port_pci_channel_offline(), which checks pci_channel_offline() for
PCI-based hosts and is used in both ata_scsi_queuecmd() and
ata_qc_issue().

For SCSI-originated commands we now complete the request immediately
with DID_BAD_TARGET instead of queuing it. For internal ATA commands
we fail the qc with AC_ERR_SYSTEM early without touching the HBA
registers.

With this change, SYNCHRONIZE CACHE issued during hot-unplug fails
quickly with DID_BAD_TARGET, without qc timeout spam, and the whole
unplug completes much faster.

Signed-off-by: Henry Tseng <henrytseng@qnap.com>
---
 drivers/ata/libata-core.c |  3 +++
 drivers/ata/libata-scsi.c |  3 ++-
 drivers/ata/libata.h      | 16 ++++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2a210719c4ce..49a59a90e8eb 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5060,6 +5060,9 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
 		if (ata_sg_setup(qc))
 			goto sys_err;
 
+	if (ata_port_pci_channel_offline(ap))
+		goto sys_err;
+
 	/* if device is sleeping, schedule reset and abort the link */
 	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
 		link->eh_info.action |= ATA_EH_RESET;
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b43a3196e2be..a49fd4bbf321 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -4388,7 +4388,8 @@ int ata_scsi_queuecmd(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
 	spin_lock_irqsave(ap->lock, irq_flags);
 
 	dev = ata_scsi_find_dev(ap, scsidev);
-	if (likely(dev))
+
+	if (likely(dev) && !ata_port_pci_channel_offline(ap))
 		rc = __ata_scsi_queuecmd(cmd, dev);
 	else {
 		cmd->result = (DID_BAD_TARGET << 16);
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index e5b977a8d3e1..ff2eb824e51b 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -12,6 +12,8 @@
 #ifndef __LIBATA_H__
 #define __LIBATA_H__
 
+#include <linux/pci.h>
+
 #define DRV_NAME	"libata"
 #define DRV_VERSION	"3.00"	/* must be exactly four chars */
 
@@ -56,6 +58,20 @@ static inline bool ata_port_eh_scheduled(struct ata_port *ap)
 	return ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS);
 }
 
+static inline bool ata_port_pci_channel_offline(struct ata_port *ap)
+{
+	struct device *dev;
+
+	if (!ap || !ap->host)
+		return false;
+
+	dev = ap->host->dev;
+	if (!dev || !dev_is_pci(dev))
+		return false;
+
+	return pci_channel_offline(to_pci_dev(dev));
+}
+
 #ifdef CONFIG_ATA_FORCE
 extern void ata_force_cbl(struct ata_port *ap);
 #else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-11-20  7:35 [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
@ 2025-11-20  8:30 ` Damien Le Moal
  2025-11-21 10:14   ` [PATCH v2] " Henry Tseng
  0 siblings, 1 reply; 4+ messages in thread
From: Damien Le Moal @ 2025-11-20  8:30 UTC (permalink / raw)
  To: Henry Tseng; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen

On 11/20/25 4:35 PM, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   55.452526] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   55.496563] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   [   55.617450] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> In this test setup with two disks, the hot-unplug sequence shrinks from
> about 226 seconds (~3.8 minutes) between the Link Down event and the
> last SYNCHRONIZE CACHE failure to under 200 ms. Without this patch the
> total delay grows roughly with the number of disks, because each disk
> gets its own SYNCHRONIZE CACHE and qc timeout series.
> 
> If the underlying PCI device is already gone, these commands cannot
> succeed anyway. Avoid issuing them by introducing
> ata_port_pci_channel_offline(), which checks pci_channel_offline() for
> PCI-based hosts and is used in both ata_scsi_queuecmd() and
> ata_qc_issue().
> 
> For SCSI-originated commands we now complete the request immediately
> with DID_BAD_TARGET instead of queuing it. For internal ATA commands
> we fail the qc with AC_ERR_SYSTEM early without touching the HBA
> registers.
> 
> With this change, SYNCHRONIZE CACHE issued during hot-unplug fails
> quickly with DID_BAD_TARGET, without qc timeout spam, and the whole
> unplug completes much faster.

Wao... a hot-pluggable AHCI adapter. That is a new one...
And super dangerous with libata & AHCI driver as is: if the drives connected to
this adapter are not also declared as removabe devices and their write cache
disabled, random unplug of the drives will corrupt data/FS on the drive.

I understand what you are trying to fix here, but do not expect this setup to
be reliable without more patches.

See comments below.


> Signed-off-by: Henry Tseng <henrytseng@qnap.com>
> ---
>  drivers/ata/libata-core.c |  3 +++
>  drivers/ata/libata-scsi.c |  3 ++-
>  drivers/ata/libata.h      | 16 ++++++++++++++++
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 2a210719c4ce..49a59a90e8eb 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -5060,6 +5060,9 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
>  		if (ata_sg_setup(qc))
>  			goto sys_err;
>  
> +	if (ata_port_pci_channel_offline(ap))
> +		goto sys_err;
> +
>  	/* if device is sleeping, schedule reset and abort the link */
>  	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
>  		link->eh_info.action |= ATA_EH_RESET;
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index b43a3196e2be..a49fd4bbf321 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -4388,7 +4388,8 @@ int ata_scsi_queuecmd(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
>  	spin_lock_irqsave(ap->lock, irq_flags);
>  
>  	dev = ata_scsi_find_dev(ap, scsidev);
> -	if (likely(dev))
> +
> +	if (likely(dev) && !ata_port_pci_channel_offline(ap))

libata-scsi implements a SAT (SCSI-to-ATA translation). It should *not* be
bothered with checking the underlying adapter state. So this is the wrong place
to put this.

Strictly speaking, the check for the PCI channel offline should go into
libahci, or any other PCI ATA adapter driver. But since that would be repeating
the code, I think it is acceptable to put this check in libata-core.c, in
ata_qc_issue().

>  		rc = __ata_scsi_queuecmd(cmd, dev);
>  	else {
>  		cmd->result = (DID_BAD_TARGET << 16);
> diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
> index e5b977a8d3e1..ff2eb824e51b 100644
> --- a/drivers/ata/libata.h
> +++ b/drivers/ata/libata.h
> @@ -12,6 +12,8 @@
>  #ifndef __LIBATA_H__
>  #define __LIBATA_H__
>  
> +#include <linux/pci.h>

No. This should not be here. If you do as suggested above, libata-core.c
already includes linux/pci.h. It has functions such as ata_dev_check_adapter()
which already look at PCI devices. So your function
ata_port_pci_channel_offline() should be changed into something more generic,
like this:

static inline bool ata_adapter_is_online(struct ata_port *ap)
{
	struct device *dev;

	if (!ap || !ap->host)
		return false;

	dev = ap->host->dev;
	if (!dev)
		return false;

	if (dev_is_pci(dev) &&
	    pci_channel_offline(to_pci_dev(dev)))
		return false;

	return true;
}

and then, at the top of ata_qc_issue(), add something like:

	if (!ata_adapter_is_online(ap))
		goto sys_err;

> +
>  #define DRV_NAME	"libata"
>  #define DRV_VERSION	"3.00"	/* must be exactly four chars */
>  
> @@ -56,6 +58,20 @@ static inline bool ata_port_eh_scheduled(struct ata_port *ap)
>  	return ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS);
>  }
>  
> +static inline bool ata_port_pci_channel_offline(struct ata_port *ap)
> +{
> +	struct device *dev;
> +
> +	if (!ap || !ap->host)
> +		return false;
> +
> +	dev = ap->host->dev;
> +	if (!dev || !dev_is_pci(dev))
> +		return false;
> +
> +	return pci_channel_offline(to_pci_dev(dev));
> +}

This is too big a function to be inline. Make it a static function in
libata-core.c as suggested above.

> +
>  #ifdef CONFIG_ATA_FORCE
>  extern void ata_force_cbl(struct ata_port *ap);
>  #else


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-11-20  8:30 ` Damien Le Moal
@ 2025-11-21 10:14   ` Henry Tseng
  2025-11-26  7:50     ` Damien Le Moal
  0 siblings, 1 reply; 4+ messages in thread
From: Henry Tseng @ 2025-11-21 10:14 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen, Henry Tseng

When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
switch, hot-unplugging the whole enclosure causes pciehp to tear down
the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
START STOP UNIT for the disks.

libata still queues these commands and the AHCI driver tries to access
the HBA registers even though the PCI channel is already offline. This
results in a series of timeouts and error recovery attempts, e.g.:

  [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
  [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
  [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
  [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
  ...
  [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
  [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
  [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
  [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

With this patch applied, the same hot-unplug looks like:

  [   53.905168] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
  [   54.069159] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
      Result: hostbyte=DID_OK driverbyte=DRIVER_OK
  ...
  [   54.330226] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
      Result: hostbyte=DID_OK driverbyte=DRIVER_OK

In this test setup with two disks, the hot-unplug sequence shrinks from
about 226 seconds (~3.8 minutes) between the Link Down event and the
last SYNCHRONIZE CACHE failure to under a second. Without this patch the
total delay grows roughly with the number of disks, because each disk
gets its own SYNCHRONIZE CACHE and qc timeout series.

If the underlying PCI device is already gone, these commands cannot
succeed anyway. Avoid issuing them by introducing
ata_adapter_is_online(), which checks pci_channel_offline() for
PCI-based hosts and is used from ata_qc_issue() to bail out before
touching the HBA registers.

Since such failures would otherwise trigger libata error handling, we
also consult ata_adapter_is_online() from ata_scsi_port_error_handler().
When the adapter is offline, libata skips ap->ops->error_handler(ap)
and directly finishes error handling using the existing unload path,
instead of running a full EH sequence against a dead adapter.

With this change, SYNCHRONIZE CACHE and START STOP UNIT commands
issued during hot-unplug fail quickly once the PCI channel is offline,
without qc timeout spam or long libata EH delays.

Signed-off-by: Henry Tseng <henrytseng@qnap.com>
---
v2:
 - Move the PCI adapter state check into ata_adapter_is_online()
   implemented in libata-core.c, following your suggestion, 
   and call it from ata_qc_issue().
 - Also consult ata_adapter_is_online() from ata_scsi_port_error_handler() so that libata can
   skip ap->ops->error_handler(ap) once the adapter is offline, using the existing unload/suicide path
   rather than invoking a long EH sequence on a removed AHCI adapter.
 - This patch focuses on reducing libata EH delays and qc timeout spam when a Thunderbolt-attached
   DAS enclosure is unplugged after the file systems have been unmounted.
   It does not attempt to address general surprise-removal safety.
 - As a side effect, the SCSI result for these SYNCHRONIZE CACHE and START STOP UNIT commands
   changes from hostbyte=DID_BAD_TARGET to hostbyte=DID_OK with sense key "Aborted Command".
   From the SCSI/block layer perspective, the commands still fail, and the devices are being removed.
   If preserving DID_BAD_TARGET is preferred, the error mapping can be updated accordingly.

 drivers/ata/libata-core.c | 21 +++++++++++++++++++++
 drivers/ata/libata-eh.c   |  3 ++-
 drivers/ata/libata.h      |  1 +
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2a210719c4ce..0116dac3b4ec 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2358,6 +2358,24 @@ static bool ata_dev_check_adapter(struct ata_device *dev,
 	return false;
 }
 
+bool ata_adapter_is_online(struct ata_port *ap)
+{
+	struct device *dev;
+
+	if (!ap || !ap->host)
+		return false;
+
+	dev = ap->host->dev;
+	if (!dev)
+		return false;
+
+	if (dev_is_pci(dev) &&
+	    pci_channel_offline(to_pci_dev(dev)))
+		return false;
+
+	return true;
+}
+
 static int ata_dev_config_ncq(struct ata_device *dev,
 			       char *desc, size_t desc_sz)
 {
@@ -5060,6 +5078,9 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
 		if (ata_sg_setup(qc))
 			goto sys_err;
 
+	if (!ata_adapter_is_online(ap))
+		goto sys_err;
+
 	/* if device is sleeping, schedule reset and abort the link */
 	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
 		link->eh_info.action |= ATA_EH_RESET;
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2586e77ebf45..f4c9541d1910 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -736,7 +736,8 @@ void ata_scsi_port_error_handler(struct Scsi_Host *host, struct ata_port *ap)
 	spin_unlock_irqrestore(ap->lock, flags);
 
 	/* invoke EH, skip if unloading or suspended */
-	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)))
+	if (!(ap->pflags & (ATA_PFLAG_UNLOADING | ATA_PFLAG_SUSPENDED)) &&
+	    ata_adapter_is_online(ap))
 		ap->ops->error_handler(ap);
 	else {
 		/* if unloading, commence suicide */
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index e5b977a8d3e1..a14cd588c2d4 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -94,6 +94,7 @@ extern int atapi_check_dma(struct ata_queued_cmd *qc);
 extern void swap_buf_le16(u16 *buf, unsigned int buf_words);
 extern bool ata_phys_link_online(struct ata_link *link);
 extern bool ata_phys_link_offline(struct ata_link *link);
+bool ata_adapter_is_online(struct ata_port *ap);
 extern void ata_dev_init(struct ata_device *dev);
 extern void ata_link_init(struct ata_port *ap, struct ata_link *link, int pmp);
 extern int sata_link_init_spd(struct ata_link *link);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
  2025-11-21 10:14   ` [PATCH v2] " Henry Tseng
@ 2025-11-26  7:50     ` Damien Le Moal
  0 siblings, 0 replies; 4+ messages in thread
From: Damien Le Moal @ 2025-11-26  7:50 UTC (permalink / raw)
  To: Henry Tseng; +Cc: Niklas Cassel, linux-ide, Kevin Ko, SW Chen

On 11/21/25 7:14 PM, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   53.905168] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   54.069159] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>   ...
>   [   54.330226] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_OK driverbyte=DRIVER_OK

So I had an M.2 AHCI adapter and a USB4/Thunderbolt USB-C/M.2 enclosure which
allowed me to recreate the problem, and also test your patch.
I confirm the above, but I am really not a big fan of the above DID_OK. Because
the host did not do OK at all since it is gone !

So on top of your patch, I applied this diff:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index a80a7e657041..52312ea6eb32 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5076,6 +5076,12 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
        qc->flags |= ATA_QCFLAG_ACTIVE;
        ap->qc_active |= 1ULL << qc->tag;

+       /* Make sure the device is still accessible. */
+       if (!ata_adapter_is_online(ap)) {
+               qc->err_mask |= AC_ERR_HOST_BUS;
+               goto err;
+       }
+
        /*
         * We guarantee to LLDs that they will have at least one
         * non-zero sg if the command is a data command.
@@ -5088,9 +5094,6 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
                if (ata_sg_setup(qc))
                        goto sys_err;

-       if (!ata_adapter_is_online(ap))
-               goto sys_err;
-
        /* if device is sleeping, schedule reset and abort the link */
        if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
                link->eh_info.action |= ATA_EH_RESET;
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 434774e71fe6..c760732712c4 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2981,6 +2981,9 @@ ata_scsi_find_dev(struct ata_port *ap, const struct
scsi_device *scsidev)
 {
        struct ata_device *dev = __ata_scsi_find_dev(ap, scsidev);

+       if (!ata_adapter_is_online(ap))
+               return NULL;
+
        if (unlikely(!dev || !ata_dev_enabled(dev)))
                return NULL;

The first part (change in libata-core) only moves the check for
ata_adapter_is_online() earlier and sets AC_ERR_HOST_BUS for the err mask. The
more interesting change is the ata_adapter_is_online() added to
ata_scsi_find_dev(), which if the adapter is unplugged, will return NULL for
the device and prevent any command issuing, but also return DID_BAD_TARGET as
the status of the scsi command.

With this change, unplugging the adapter leads to quick cleanup. I see this:

[   95.198300] thunderbolt 0-0:3.1: retimer disconnected
[   95.201121] thunderbolt 0-3: device disconnected
[   95.216189] pcieport 0000:00:07.1: pciehp: pending interrupts 0x0108 from
Slot Status
[   95.216200] pcieport 0000:00:07.1: pciehp: Slot(13): Link Down
[   95.216202] pcieport 0000:00:07.1: pciehp: Slot(13): Card not present
[   95.216203] pcieport 0000:00:07.1: pciehp: pciehp_unconfigure_device:
domain:bus:dev = 0000:2b:00
[   95.216208] ahci 0000:2d:00.0: PME# disabled
[   95.259543] sd 2:0:0:0: [sda] Synchronizing SCSI cache
[   95.259580] sd 2:0:0:0: [sda] Synchronize Cache(10) failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIV
ER_OK
[   95.259583] sd 2:0:0:0: [sda] Stopping disk
[   95.259589] sd 2:0:0:0: [sda] Start/Stop Unit failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   95.259688] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.259697] ata3.00: Check power mode failed (err_mask=0x20)
[   95.396292] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.396297] ata1: failed to stop engine (-19)
...
[   95.396359] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.396359] ata32: failed to stop engine (-19)
[   95.903412] pcieport 0000:2c:00.0: PME# disabled
[   95.903645] pcieport 0000:2b:00.0: PME# disabled
[   95.903904] pci 0000:2d:00.0: device released
[   95.903910] pci_bus 0000:2d: busn_res: [bus 2d] is released
[   95.904211] pci 0000:2c:00.0: device released
[   95.904215] pci_bus 0000:2c: busn_res: [bus 2c-2d] is released
[   95.904583] pci 0000:2b:00.0: EDR: Notify handler removed
[   95.905017] pci 0000:2b:00.0: device released

So about 800ms to cleanup the adapter device (and this one is odd as it has
only 6 ports but shows 32 ports !).

So can you merge the diff above into your patch and send a V3 ?
Thanks.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-11-26  7:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20  7:35 [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
2025-11-20  8:30 ` Damien Le Moal
2025-11-21 10:14   ` [PATCH v2] " Henry Tseng
2025-11-26  7:50     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox