public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <dlemoal@kernel.org>
To: Henry Tseng <henrytseng@qnap.com>
Cc: Niklas Cassel <cassel@kernel.org>,
	linux-ide@vger.kernel.org, Kevin Ko <kevinko@qnap.com>,
	SW Chen <swchen@qnap.com>
Subject: Re: [PATCH v2] ata: libata: avoid long timeouts on hot-unplugged SATA DAS
Date: Wed, 26 Nov 2025 16:50:09 +0900	[thread overview]
Message-ID: <0c380e65-1425-4dda-b6ae-f3cf4f69f8e6@kernel.org> (raw)
In-Reply-To: <20251121101407.978550-1-henrytseng@qnap.com>

On 11/21/25 7:14 PM, Henry Tseng wrote:
> When a SATA DAS enclosure is connected behind a Thunderbolt PCIe
> switch, hot-unplugging the whole enclosure causes pciehp to tear down
> the PCI hierarchy before the SCSI layer issues SYNCHRONIZE CACHE and
> START STOP UNIT for the disks.
> 
> libata still queues these commands and the AHCI driver tries to access
> the HBA registers even though the PCI channel is already offline. This
> results in a series of timeouts and error recovery attempts, e.g.:
> 
>   [  824.778346] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [  891.612720] ata8.00: qc timeout after 5000 msecs (cmd 0xec)
>   [  902.876501] ata8.00: qc timeout after 10000 msecs (cmd 0xec)
>   [  934.107998] ata8.00: qc timeout after 30000 msecs (cmd 0xec)
>   [  936.206431] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
>   ...
>   [ 1006.298356] ata1.00: qc timeout after 5000 msecs (cmd 0xec)
>   [ 1017.561926] ata1.00: qc timeout after 10000 msecs (cmd 0xec)
>   [ 1048.791790] ata1.00: qc timeout after 30000 msecs (cmd 0xec)
>   [ 1050.890035] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> With this patch applied, the same hot-unplug looks like:
> 
>   [   53.905168] pcieport 0000:00:07.0: pciehp: Slot(14): Link Down
>   [   54.069159] sd 7:0:0:0: [sda] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>   ...
>   [   54.330226] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed:
>       Result: hostbyte=DID_OK driverbyte=DRIVER_OK

So I had an M.2 AHCI adapter and a USB4/Thunderbolt USB-C/M.2 enclosure which
allowed me to recreate the problem, and also test your patch.
I confirm the above, but I am really not a big fan of the above DID_OK. Because
the host did not do OK at all since it is gone !

So on top of your patch, I applied this diff:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index a80a7e657041..52312ea6eb32 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5076,6 +5076,12 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
        qc->flags |= ATA_QCFLAG_ACTIVE;
        ap->qc_active |= 1ULL << qc->tag;

+       /* Make sure the device is still accessible. */
+       if (!ata_adapter_is_online(ap)) {
+               qc->err_mask |= AC_ERR_HOST_BUS;
+               goto err;
+       }
+
        /*
         * We guarantee to LLDs that they will have at least one
         * non-zero sg if the command is a data command.
@@ -5088,9 +5094,6 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
                if (ata_sg_setup(qc))
                        goto sys_err;

-       if (!ata_adapter_is_online(ap))
-               goto sys_err;
-
        /* if device is sleeping, schedule reset and abort the link */
        if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
                link->eh_info.action |= ATA_EH_RESET;
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 434774e71fe6..c760732712c4 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2981,6 +2981,9 @@ ata_scsi_find_dev(struct ata_port *ap, const struct
scsi_device *scsidev)
 {
        struct ata_device *dev = __ata_scsi_find_dev(ap, scsidev);

+       if (!ata_adapter_is_online(ap))
+               return NULL;
+
        if (unlikely(!dev || !ata_dev_enabled(dev)))
                return NULL;

The first part (change in libata-core) only moves the check for
ata_adapter_is_online() earlier and sets AC_ERR_HOST_BUS for the err mask. The
more interesting change is the ata_adapter_is_online() added to
ata_scsi_find_dev(), which if the adapter is unplugged, will return NULL for
the device and prevent any command issuing, but also return DID_BAD_TARGET as
the status of the scsi command.

With this change, unplugging the adapter leads to quick cleanup. I see this:

[   95.198300] thunderbolt 0-0:3.1: retimer disconnected
[   95.201121] thunderbolt 0-3: device disconnected
[   95.216189] pcieport 0000:00:07.1: pciehp: pending interrupts 0x0108 from
Slot Status
[   95.216200] pcieport 0000:00:07.1: pciehp: Slot(13): Link Down
[   95.216202] pcieport 0000:00:07.1: pciehp: Slot(13): Card not present
[   95.216203] pcieport 0000:00:07.1: pciehp: pciehp_unconfigure_device:
domain:bus:dev = 0000:2b:00
[   95.216208] ahci 0000:2d:00.0: PME# disabled
[   95.259543] sd 2:0:0:0: [sda] Synchronizing SCSI cache
[   95.259580] sd 2:0:0:0: [sda] Synchronize Cache(10) failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIV
ER_OK
[   95.259583] sd 2:0:0:0: [sda] Stopping disk
[   95.259589] sd 2:0:0:0: [sda] Start/Stop Unit failed: Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   95.259688] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.259697] ata3.00: Check power mode failed (err_mask=0x20)
[   95.396292] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.396297] ata1: failed to stop engine (-19)
...
[   95.396359] ahci 0000:2d:00.0: AHCI controller unavailable!
[   95.396359] ata32: failed to stop engine (-19)
[   95.903412] pcieport 0000:2c:00.0: PME# disabled
[   95.903645] pcieport 0000:2b:00.0: PME# disabled
[   95.903904] pci 0000:2d:00.0: device released
[   95.903910] pci_bus 0000:2d: busn_res: [bus 2d] is released
[   95.904211] pci 0000:2c:00.0: device released
[   95.904215] pci_bus 0000:2c: busn_res: [bus 2c-2d] is released
[   95.904583] pci 0000:2b:00.0: EDR: Notify handler removed
[   95.905017] pci 0000:2b:00.0: device released

So about 800ms to cleanup the adapter device (and this one is odd as it has
only 6 ports but shows 32 ports !).

So can you merge the diff above into your patch and send a V3 ?
Thanks.

-- 
Damien Le Moal
Western Digital Research

      reply	other threads:[~2025-11-26  7:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20  7:35 [PATCH] ata: libata: avoid long timeouts on hot-unplugged SATA DAS Henry Tseng
2025-11-20  8:30 ` Damien Le Moal
2025-11-21 10:14   ` [PATCH v2] " Henry Tseng
2025-11-26  7:50     ` Damien Le Moal [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0c380e65-1425-4dda-b6ae-f3cf4f69f8e6@kernel.org \
    --to=dlemoal@kernel.org \
    --cc=cassel@kernel.org \
    --cc=henrytseng@qnap.com \
    --cc=kevinko@qnap.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=swchen@qnap.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox