Default IDENTIFY timeout is 5000ms which is too short for enterprise disks

public inbox for linux-ide@vger.kernel.org
 help / color / mirror / Atom feed

* Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
@ 2026-04-09 10:21 AlanCui4080
  2026-04-09 11:55 ` Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: AlanCui4080 @ 2026-04-09 10:21 UTC (permalink / raw)
  To: linux-ide, dlemoal

Hi,

I have two ST4000NM000A-2HZ100 on my computer which is of seagate enterprise 
line.  But when i recovery from suspend, the kernel complains about that and 
the zpool kicks the disk off:

```
ata2: found unknown device (class 0)
ata4: found unknown device (class 0)
ata2: found unknown device (class 0)
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4: found unknown device (class 0)
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4.00: qc timeout after 5000 msecs (cmd 0xec)
ata4.00: qc timeout after 5000 msecs (cmd 0xec)
ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata4.00: revalidation failed (errno=-5)
ata2.00: qc timeout after 5000 msecs (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata2.00: configured for UDMA/133
ata4.00: configured for UDMA/133
```
I think that's cause by the too slow spinup for my disk.
After make libata to wait longer, the warning disappeared.

```
# cat /proc/cmdline
libata.ata_probe_timeout=10
```

```
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sd 1:0:0:0: [sda] Starting disk
ata2.00: configured for UDMA/133
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
sd 3:0:0:0: [sdb] Starting disk
ata4.00: configured for UDMA/133
```

Meanwhile, the seachest reports that the startup time from standby is 9sec, 
which is longer that the default ATA IDENTIFY timeout.

```
/dev/sg0 - ST4000NM000A-2HZ100 - **** - TN04 - ATA

Standby Z : Recovery Time : 90 (in 100msecs)
```

```
static const unsigned int ata_eh_identify_timeouts[] = {
         5000,  /* covers > 99% of successes and not too boring on failures */
        10000,  /* combined time till here is enough even for media access */
        30000,  /* for true idiots */
        UINT_MAX,
};
```

I tested the hard drive, and as long as it's never set to STANDBY_Z (disk 
stops spinning, requiring 9 seconds to recover) and kept in IDLE_C (platter 
slows down, requiring 3.2 seconds to recover), this error never occurs.

It's been seen many users complaining about this elsewhere, should we quirk 
for those "heavy" disk? Or print some warnings about how to relax this 
problem.

Alan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
  2026-04-09 10:21 Default IDENTIFY timeout is 5000ms which is too short for enterprise disks AlanCui4080
@ 2026-04-09 11:55 ` Damien Le Moal
  2026-04-09 12:01 ` Damien Le Moal
       [not found] ` <14062658.dW097sEU6C@alanarchdesktop>
  2 siblings, 0 replies; 5+ messages in thread
From: Damien Le Moal @ 2026-04-09 11:55 UTC (permalink / raw)
  To: AlanCui4080, linux-ide

On 2026/04/09 12:21, AlanCui4080 wrote:
> Hi,
> 
> I have two ST4000NM000A-2HZ100 on my computer which is of seagate enterprise 
> line.  But when i recovery from suspend, the kernel complains about that and 
> the zpool kicks the disk off:
> 
> ```
> ata2: found unknown device (class 0)
> ata4: found unknown device (class 0)
> ata2: found unknown device (class 0)
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4: found unknown device (class 0)
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4.00: qc timeout after 5000 msecs (cmd 0xec)
> ata4.00: qc timeout after 5000 msecs (cmd 0xec)
> ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> ata4.00: revalidation failed (errno=-5)
> ata2.00: qc timeout after 5000 msecs (cmd 0xec)
> ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> ata2.00: revalidation failed (errno=-5)
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: configured for UDMA/133
> ata4.00: configured for UDMA/133
> ```
> I think that's cause by the too slow spinup for my disk.
> After make libata to wait longer, the warning disappeared.
> 
> ```
> # cat /proc/cmdline
> libata.ata_probe_timeout=10
> ```
> 
> ```
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> sd 1:0:0:0: [sda] Starting disk
> ata2.00: configured for UDMA/133
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> sd 3:0:0:0: [sdb] Starting disk
> ata4.00: configured for UDMA/133
> ```
> 
> 
> Meanwhile, the seachest reports that the startup time from standby is 9sec, 
> which is longer that the default ATA IDENTIFY timeout.
> 
> ```
> /dev/sg0 - ST4000NM000A-2HZ100 - **** - TN04 - ATA
> 
> Standby Z : Recovery Time : 90 (in 100msecs)
> ```
> 
> ```
> static const unsigned int ata_eh_identify_timeouts[] = {
>          5000,  /* covers > 99% of successes and not too boring on failures */
>         10000,  /* combined time till here is enough even for media access */
>         30000,  /* for true idiots */
>         UINT_MAX,
> };
> ```
> 
> I tested the hard drive, and as long as it's never set to STANDBY_Z (disk 
> stops spinning, requiring 9 seconds to recover) and kept in IDLE_C (platter 
> slows down, requiring 3.2 seconds to recover), this error never occurs.
> 
> It's been seen many users complaining about this elsewhere, should we quirk 
> for those "heavy" disk? Or print some warnings about how to relax this 
> problem.

Elsewhere ? I have not seen any complaints/problem reports on the linux-ide list
recently. So I do not know where "elsewhere" is.

And no, we should not quirk the disk but rather improve resume from suspend to
issue identify with increasing timeouts, like regular probe does, or to issue
identify only once we see the drive ready, which is a check that exist for
spundown startups. That should solve the issue.




-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
  2026-04-09 10:21 Default IDENTIFY timeout is 5000ms which is too short for enterprise disks AlanCui4080
  2026-04-09 11:55 ` Damien Le Moal
@ 2026-04-09 12:01 ` Damien Le Moal
       [not found] ` <14062658.dW097sEU6C@alanarchdesktop>
  2 siblings, 0 replies; 5+ messages in thread
From: Damien Le Moal @ 2026-04-09 12:01 UTC (permalink / raw)
  To: AlanCui4080, linux-ide, Niklas Cassel

On 2026/04/09 12:21, AlanCui4080 wrote:
> Hi,
> 
> I have two ST4000NM000A-2HZ100 on my computer which is of seagate enterprise 
> line.  But when i recovery from suspend, the kernel complains about that and 
> the zpool kicks the disk off:

We do not deal with out of tree code. So mentioning something that ZFS does is
not helping. Please check with an upstream file system. E.g. XFS, ext4 or BTRFS.

> 
> ```
> ata2: found unknown device (class 0)
> ata4: found unknown device (class 0)
> ata2: found unknown device (class 0)
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4: found unknown device (class 0)
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4.00: qc timeout after 5000 msecs (cmd 0xec)
> ata4.00: qc timeout after 5000 msecs (cmd 0xec)
> ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> ata4.00: revalidation failed (errno=-5)
> ata2.00: qc timeout after 5000 msecs (cmd 0xec)
> ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
> ata2.00: revalidation failed (errno=-5)
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: configured for UDMA/133
> ata4.00: configured for UDMA/133
> ```
> I think that's cause by the too slow spinup for my disk.
> After make libata to wait longer, the warning disappeared.

What kernel version is this ? Did you test with the latest mainline (7.0-rc7) ?

> 
> ```
> # cat /proc/cmdline
> libata.ata_probe_timeout=10
> ```
> 
> ```
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> sd 1:0:0:0: [sda] Starting disk
> ata2.00: configured for UDMA/133
> ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> sd 3:0:0:0: [sdb] Starting disk
> ata4.00: configured for UDMA/133
> ```
> 
> 
> Meanwhile, the seachest reports that the startup time from standby is 9sec, 
> which is longer that the default ATA IDENTIFY timeout.

Your drive is very slow/old. Most modern drives can reply to identify even when
they are not fully spun up.

> 
> ```
> /dev/sg0 - ST4000NM000A-2HZ100 - **** - TN04 - ATA
> 
> Standby Z : Recovery Time : 90 (in 100msecs)
> ```
> 
> ```
> static const unsigned int ata_eh_identify_timeouts[] = {
>          5000,  /* covers > 99% of successes and not too boring on failures */
>         10000,  /* combined time till here is enough even for media access */
>         30000,  /* for true idiots */
>         UINT_MAX,
> };
> ```
> 
> I tested the hard drive, and as long as it's never set to STANDBY_Z (disk 
> stops spinning, requiring 9 seconds to recover) and kept in IDLE_C (platter 
> slows down, requiring 3.2 seconds to recover), this error never occurs.
> 
> It's been seen many users complaining about this elsewhere, should we quirk 
> for those "heavy" disk? Or print some warnings about how to relax this 
> problem.

Elsewhere ? That certainly was not on this list as we have seen no problem
reports recently.

And no, we should not introduce a quirk for this. Rather, we should do the same
3-steps timeout for revalidation after a resume from suspend in the same manner
as a regular probe does. Or add a check/wait for "drive ready" when resuming,
similar to the PUIS handling (power up in standby).


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <14062658.dW097sEU6C@alanarchdesktop>]

[parent not found: <4482b737-1454-48cb-a941-165aa84fb2eb@kernel.org>]

* Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
       [not found]   ` <4482b737-1454-48cb-a941-165aa84fb2eb@kernel.org>
@ 2026-04-10 11:24     ` AlanCui4080
  2026-04-10 12:14       ` AlanCui4080
  0 siblings, 1 reply; 5+ messages in thread
From: AlanCui4080 @ 2026-04-10 11:24 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide

On Friday, 10 April 2026 12:19，you wrote：
> I need to check the code again, but no, That's not that. Sinc on resume we
> revalidate the device, it is ata_dev_reread_id() that needs to be a bit more lax
> on timeouts and repeatedly call ata_dev_read_id() with an increasing timeout as
> defined by ata_eh_identify_timeouts(). That should the IDENTIFY issue for drives
> that slow to respond to that command on resume/while spinning up.
> 
> >> Or add a check/wait for "drive ready"
> >> when resuming, similar to the PUIS handling (power up in standby).
> > 
> > There is tried_spinup in ata_dev_read_id(), but seems required the device to 
> > response at least incomplete IDENTIFY, with a device will never response 
> > during spining up, is that possible to implement it?
> 
> Ah, yes, forgot about that one. So it is not an option.
> 

Hi, I've tried (and extra WARN ONCE at ata_port_is_frozen):

---

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 374993031895..0ac0daae33f9 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3902,7 +3902,15 @@ int ata_dev_reread_id(struct ata_device *dev, unsigned int readid_flags)
        int rc;
 
        /* read ID data */
-       rc = ata_dev_read_id(dev, &class, readid_flags, id);
+       int retry_read_id = 3;
+       do {
+               rc = ata_dev_read_id(dev, &class, readid_flags, id);
+               if (rc) {
+                       ata_dev_warn(dev, "retrying ata_dev_read_id(), %d times remainng",
+                               retry_read_id);
+               }
+               retry_read_id--;
+       } while (rc && retry_read_id > 0);
        if (rc)
                return rc

--

But it reports:

```
[  119.260621] ata2: found unknown device (class 0)
[  119.264620] ata4: found unknown device (class 0)
[  119.415623] ata2: found unknown device (class 0)
[  119.415634] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  119.422627] ata4: found unknown device (class 0)
[  119.422636] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  124.646636] ata4.00: qc timeout after 5000 msecs (cmd 0xec)
[  124.646646] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  124.646648] ata4.00: retrying ata_dev_read_id(), 3 times remainng
[  124.646657] ------------[ cut here ]------------
[  124.646659] ata_port_is_frozen(ap)
[  124.646660] WARNING: drivers/ata/libata-core.c:1549 at ata_exec_internal+0x4e4/0x590, CPU#0: scsi_eh_3/155
...
[  124.646793] Call Trace:
[  124.646795]  <TASK>
[  124.646799]  ata_dev_read_id+0x3b2/0x560
[  124.646805]  ata_dev_reread_id+0x50/0xf0
[  124.646808]  ata_dev_revalidate+0x64/0xd0
[  124.646811]  ata_eh_recover+0xa76/0xf90
[  124.646815]  ? update_load_avg+0x7b/0x740
[  124.646819]  ? __dequeue_entity+0x4f4/0x5d0
[  124.646823]  sata_pmp_error_handler+0x387/0x660
[  124.646827]  ? __flush_work+0x2b1/0x360
[  124.646832]  ahci_error_handler+0x42/0x80
[  124.646836]  ata_scsi_port_error_handler+0x71a/0x950
[  124.646840]  ata_scsi_error+0x95/0xd0
[  124.646843]  scsi_error_handler+0xd1/0x530
[  124.646848]  ? __pfx_scsi_error_handler+0x10/0x10
[  124.646851]  kthread+0xfc/0x240
[  124.646855]  ? __pfx_kthread+0x10/0x10
[  124.646858]  ret_from_fork+0x243/0x280
[  124.646862]  ? __pfx_kthread+0x10/0x10
[  124.646865]  ret_from_fork_asm+0x1a/0x30
[  124.646873]  </TASK>
[  124.646875] ---[ end trace 0000000000000000 ]---
[  124.646877] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[  124.646879] ata4.00: retrying ata_dev_read_id(), 2 times remainng
[  124.646886] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[  124.646888] ata4.00: retrying ata_dev_read_id(), 1 times remainng
[  124.646889] ata4.00: revalidation failed (errno=-5)
[  124.646919] ata2.00: qc timeout after 5000 msecs (cmd 0xec)
[  124.646927] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  124.646929] ata2.00: retrying ata_dev_read_id(), 3 times remainng
[  124.646937] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[  124.646939] ata2.00: retrying ata_dev_read_id(), 2 times remainng
[  124.646945] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[  124.646947] ata2.00: retrying ata_dev_read_id(), 1 times remainng
[  124.646948] ata2.00: revalidation failed (errno=-5)
[  125.110629] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  125.110649] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  125.146916] ata2.00: configured for UDMA/133
[  125.163102] ata4.00: configured for UDMA/133

```

And, yes, libata will freeze the link when the qc failed:

```
if (qc->flags & ATA_QCFLAG_ACTIVE) {
	qc->err_mask |= AC_ERR_TIMEOUT;
	ata_port_freeze(ap);
	ata_dev_warn(dev, "qc timeout after %u msecs (cmd 0x%x)\n",
		     timeout, command);
}
```

So, should the retry happened in ata_exec_internal()? No, the ata_exec_internal has
no a path to cancel the command already issued, it can only be freeze and reset the
port. All we can do is to continue wait and increase the timeout 3 times before
let the port reset. I don't think that is a good idea.

Alan.



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
  2026-04-10 11:24     ` AlanCui4080
@ 2026-04-10 12:14       ` AlanCui4080
  0 siblings, 0 replies; 5+ messages in thread
From: AlanCui4080 @ 2026-04-10 12:14 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: linux-ide

Hi,

As further infomation, I found that increase the time of timeout can only relax
the problem, In multiple wakings from S3, it failed to IDENTIFY in about 10% time.
Interestingly, after the failure, the port immediately regained the link then
successfully configured the hard drive.

```
[  322.975526] ACPI: PM: Waking up from system sleep state S3
...
[  332.991862] ata4: found unknown device (class 0)
[  332.992863] ata2: found unknown device (class 0)
[  333.147890] ata2: found unknown device (class 0)
[  333.147899] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  333.147911] ata4: found unknown device (class 0)
[  333.147920] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  348.198232] ata4.00: qc timeout after 15000 msecs (cmd 0xec)
[  348.198242] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  348.198245] ata4.00: revalidation failed (errno=-5)
[  348.198259] ata2.00: qc timeout after 15000 msecs (cmd 0xec)
[  348.198269] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[  348.198272] ata2.00: revalidation failed (errno=-5)
[  348.662584] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  348.662610] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  348.699354] ata4.00: configured for UDMA/133
[  348.719825] ata2.00: configured for UDMA/133
```

And the difference between the failed recovery to succeed one is that ata
won't report "found unkown device". Then I attached new customer-level
WD and Seagate drives, and as what i think, they spinup really faster
than those Exos drives and will never be reported as revalidation failed:

```  // 2.5 inch WD Blue drive, 8 secs faster
[ 1047.409533] ACPI: PM: Waking up from system sleep state S3
...
[ 1047.724415] ata5: SATA link down (SStatus 0 SControl 330)
[ 1047.724451] ata3: SATA link down (SStatus 0 SControl 300)
[ 1048.452451] ata6: SATA link down (SStatus 0 SControl 0)
[ 1049.204451] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 1049.257864] sd 0:0:0:0: [sdc] Starting disk
...
[ 1051.916495] PM: suspend exit
[ 1052.728394] ata4: link is slow to respond, please be patient (ready=0)
[ 1052.733355] ata2: link is slow to respond, please be patient (ready=0)
[ 1054.840880] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1057.076309] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1057.116416] sd 1:0:0:0: [sda] Starting disk
[ 1057.134584] ata2.00: configured for UDMA/133
[ 1057.532325] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1057.576679] sd 3:0:0:0: [sdb] Starting disk
[ 1057.594743] ata4.00: configured for UDMA/13
```

```  // 3.5 inch Seagate BarraCuda drive, 6 secs faster
[ 1484.056163] ACPI: PM: Waking up from system sleep state S3
[ 1484.371881] ata5: SATA link down (SStatus 0 SControl 330)
[ 1484.371917] ata3: SATA link down (SStatus 0 SControl 300)
[ 1485.099799] ata6: SATA link down (SStatus 0 SControl 0)
...
[ 1488.620192] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1488.621446] sd 0:0:0:0: [sdc] Starting disk
[ 1488.622941] ata1.00: configured for UDMA/133
...
[ 1488.633805] PM: suspend exit
[ 1489.374930] ata2: link is slow to respond, please be patient (ready=0)
[ 1489.374939] ata4: link is slow to respond, please be patient (ready=0)
[ 1491.563828] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1493.666523] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1493.713096] sd 1:0:0:0: [sda] Starting disk
[ 1493.731018] ata2.00: configured for UDMA/133
[ 1494.026490] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1494.083273] sd 3:0:0:0: [sdb] Starting disk
[ 1494.101513] ata4.00: configured for UDMA/133
```
Furthermore, I discovered that adding an extra hard drive to the system can relax
the revalidation failure issue. That may shows, the hard drive might not actually
restore the linkwhen the kernel believes it has (because the kernel said it don't
know the device on the link). And the slight delay caused by adding an extra
hard drive allows command can be truly accepted by the hard drive, thus avoiding this problem.

At the same time, I'd like to point out that the AMD B550 southbridge only has
two native SATA ports, so these six ports must be of port multiplier.
Could this cause issues? I've seen many B550 users reported that the ASMedia IP Cores
used for the southbridge SATA ports are no reliable enough.

Alan.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-10 12:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 10:21 Default IDENTIFY timeout is 5000ms which is too short for enterprise disks AlanCui4080
2026-04-09 11:55 ` Damien Le Moal
2026-04-09 12:01 ` Damien Le Moal
     [not found] ` <14062658.dW097sEU6C@alanarchdesktop>
     [not found]   ` <4482b737-1454-48cb-a941-165aa84fb2eb@kernel.org>
2026-04-10 11:24     ` AlanCui4080
2026-04-10 12:14       ` AlanCui4080

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox