Linux ATA/IDE development
 help / color / mirror / Atom feed
From: AlanCui4080 <me@alancui.cc>
To: Niklas Cassel <cassel@kernel.org>
Cc: linux-ide@vger.kernel.org
Subject: Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
Date: Thu, 16 Apr 2026 20:59:30 +0800	[thread overview]
Message-ID: <23071769.EfDdHjke4D@alanarchdesktop> (raw)
In-Reply-To: <ad-HTImFZxdtkD2J@ryzen>

Hi

On Wednesday, 15 April 2026 20:40,you wrote:

> Alan,
> could you please provide a full dmesg (dmesg that has not been cut)
> when reproducing your problem on kernel v7.0.

Here is a full dmesg before enter the suspend and recovery from it:

$ sudo systemctl suspend

[12951.183287] r8169 0000:07:00.0 enp7s0: Link is Down
[12951.209606] PM: suspend entry (deep)
[12951.253352] Filesystems sync: 0.043 seconds
[12955.016893] Freezing user space processes
[12955.018682] Freezing user space processes completed (elapsed 0.001 seconds)
[12955.018686] OOM killer disabled.
[12955.018688] Freezing remaining freezable tasks
[12955.019625] Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
[12955.019644] printk: Suspending console(s) (use no_console_suspend to debug)
[12955.027700] serial 00:04: disabled
[12955.036574] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[12955.037560] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[12955.194975] ACPI: PM: Preparing to enter system sleep state S3
[12955.701415] ACPI: PM: Saving platform NVS memory
[12955.701483] Disabling non-boot CPUs ...
[12955.703600] smpboot: CPU 15 is now offline
[12955.706609] smpboot: CPU 14 is now offline
[12955.709581] smpboot: CPU 13 is now offline
[12955.712508] smpboot: CPU 12 is now offline
[12955.715410] smpboot: CPU 11 is now offline
[12955.718240] smpboot: CPU 10 is now offline
[12955.721183] smpboot: CPU 9 is now offline
[12955.724271] smpboot: CPU 8 is now offline
[12955.725278] Spectre V2 : Update user space SMT mitigation: STIBP off
[12955.727089] smpboot: CPU 7 is now offline
[12955.729733] smpboot: CPU 6 is now offline
[12955.732059] smpboot: CPU 5 is now offline
[12955.734550] smpboot: CPU 4 is now offline
[12955.737036] smpboot: CPU 3 is now offline
[12955.739410] smpboot: CPU 2 is now offline
[12955.741822] smpboot: CPU 1 is now offline
[12955.743174] ACPI: PM: Low-level resume complete
[12955.743192] ACPI: PM: Restoring platform NVS memory
[12955.743331] LVT offset 0 assigned for vector 0x400
[12955.743893] Enabling non-boot CPUs ...
[12955.743930] smpboot: Booting Node 0 Processor 1 APIC 0x2
[12955.747020] CPU1 is up
[12955.747038] smpboot: Booting Node 0 Processor 2 APIC 0x4
[12955.750123] CPU2 is up
[12955.750147] smpboot: Booting Node 0 Processor 3 APIC 0x6
[12955.753787] CPU3 is up
[12955.753808] smpboot: Booting Node 0 Processor 4 APIC 0x8
[12955.757092] CPU4 is up
[12955.757112] smpboot: Booting Node 0 Processor 5 APIC 0xa
[12955.760262] CPU5 is up
[12955.760295] smpboot: Booting Node 0 Processor 6 APIC 0xc
[12955.763945] CPU6 is up
[12955.763969] smpboot: Booting Node 0 Processor 7 APIC 0xe
[12955.767190] CPU7 is up
[12955.767212] smpboot: Booting Node 0 Processor 8 APIC 0x1
[12955.770965] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[12955.770995] CPU8 is up
[12955.771014] smpboot: Booting Node 0 Processor 9 APIC 0x3
[12955.774292] CPU9 is up
[12955.774314] smpboot: Booting Node 0 Processor 10 APIC 0x5
[12955.778002] CPU10 is up
[12955.778025] smpboot: Booting Node 0 Processor 11 APIC 0x7
[12955.781355] CPU11 is up
[12955.781376] smpboot: Booting Node 0 Processor 12 APIC 0x9
[12955.785077] CPU12 is up
[12955.785106] smpboot: Booting Node 0 Processor 13 APIC 0xb
[12955.789075] CPU13 is up
[12955.789106] smpboot: Booting Node 0 Processor 14 APIC 0xd
[12955.792527] CPU14 is up
[12955.792550] smpboot: Booting Node 0 Processor 15 APIC 0xf
[12955.796288] CPU15 is up
[12955.797605] ACPI: PM: Waking up from system sleep state S3
[12955.800715] xhci_hcd 0000:02:00.0: xHC error in resume, USBSTS 0x401, Reinit
[12955.800718] usb usb1: root hub lost power or was reset
[12955.800720] usb usb2: root hub lost power or was reset
[12955.801742] serial 00:04: activated
[12955.858626] nvme nvme0: D3 entry latency set to 8 seconds
[12955.865177] nvme nvme1: 8/0/0 default/read/poll queues
[12955.874829] nvme nvme0: 16/0/0 default/read/poll queues
[12956.110891] ata3: SATA link down (SStatus 0 SControl 300)
[12956.110924] ata1: SATA link down (SStatus 0 SControl 300)
[12956.110955] ata5: SATA link down (SStatus 0 SControl 330)
[12956.260598] usb 1-9: reset low-speed USB device number 7 using xhci_hcd
[12956.617639] usb 1-6: WARN: invalid context state for evaluate context command.
[12956.790599] usb 1-6: reset full-speed USB device number 3 using xhci_hcd
[12956.841562] ata6: failed to resume link (SControl 0)
[12956.841577] ata6: SATA link down (SStatus 0 SControl 0)
[12957.058571] usb 1-1: WARN: invalid context state for evaluate context command.
[12957.231590] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
[12957.498554] usb 1-8: WARN: invalid context state for evaluate context command.
[12957.671531] usb 1-8: reset full-speed USB device number 5 using xhci_hcd
[12958.111579] usb 1-7: reset high-speed USB device number 4 using xhci_hcd
[12958.576614] usb 1-7.3: WARN: invalid context state for evaluate context command.
[12958.648571] usb 1-7.3: reset full-speed USB device number 6 using xhci_hcd
[12958.876493] OOM killer enabled.
[12958.876496] Restarting tasks: Starting
[12958.879178] Bluetooth: hci0: CSR: Setting up dongle with HCI ver=6 rev=22bb
[12958.879183] Bluetooth: hci0: LMP ver=6 subver=22bb; manufacturer=10
[12958.879299] Restarting tasks: Done
[12958.879306] efivarfs: resyncing variable state
[12958.886795] efivarfs: finished resyncing variable state
[12958.886811] random: crng reseeded on system resumption
[12959.047801] Bluetooth: MGMT ver 1.23
[12959.427062] NVRM: _kgspIsHeartbeatTimedOut: Heartbeat timed out, currentTimeMs 2367851435 heartbeat 0 heartbeatWithOffsetMs 0 diff 2367851435 timeout 5200
[12959.427065] NVRM: _kgspRpcRecvPoll: GSP RM heartbeat timed out
[12960.627839] PM: suspend exit
[12960.661209] Realtek Internal NBASE-T PHY r8169-0-700:00: attached PHY driver (mii_bus:phy_addr=r8169-0-700:00, irq=MAC)
[12960.820346] r8169 0000:07:00.0 enp7s0: Link is Down
[12961.113196] ata2: link is slow to respond, please be patient (ready=0)
[12961.114200] ata4: link is slow to respond, please be patient (ready=0)
[12963.428372] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx
[12965.816162] ata2: found unknown device (class 0)
[12965.816180] ata4: found unknown device (class 0)
[12965.969160] ata2: found unknown device (class 0)
[12965.969171] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12965.970159] ata4: found unknown device (class 0)
[12965.970167] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12981.369196] ata2.00: qc timeout after 15000 msecs (cmd 0xec)
[12981.369207] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[12981.369210] ata2.00: revalidation failed (errno=-5)
[12981.369226] ata4.00: qc timeout after 15000 msecs (cmd 0xec)
[12981.369236] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[12981.369238] ata4.00: revalidation failed (errno=-5)
[12981.833047] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12981.833134] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[12981.869506] ata2.00: configured for UDMA/133
[12981.879537] ata4.00: configured for UDMA/133

> And please explain your problem as detailed as you can, including which
> drive/port (ataX.YY) that you are having a problem with.

There is a soft RAID on ata2.00 and ata4.00. When i recovery from suspend,
the ata port will be failed at revalidation and "some software"
will see it as a failure, then kick the disk out of the RAID. So I started
to reproduce it.  (Those two are enterprise-grade disk, starts up really slowly)

And, This problem will only happened at recovery from suspend or when i put the
disk into "Standby_Z" mode. I guess, the datasheet of this disk said the recovery
from standby costs, at typical, 9 seconds or at maximum, 23 seconds, so that may
caused by the spinup time too long for those two disks. I enlong the timeout,
then this problem is being "relaxed", during 10 recoveries, only 2 times
the revalidation failed. Damien said a disk that properly implemented the ATA
specification should have response of ATA commands though they are spining up.
That is strange, you can see the revalidation on second time will immediately succeed,
and the total time (6s) is still short than 9 seconds. The difference between the
failed recovery to succeed one is that ata won't report "found unkown device" like
[  332.991862] ata4: found unknown device (class 0)

I attached more disk as test (they are just regular customer-grade), they
did not go wrong. And at most strange, the three new attached disk also will relaxed
the problem without adjustment on timeout (like 50 pct time failed, 50 pct time succeed).

So, at least now, I have no idea about why it happened.

> Then that timeout value will be used for each retry:
> https://github.com/torvalds/linux/blob/v7.0/drivers/ata/libata-core.c#L1612-L1617

> I.e. if you specify an explicit probe timeout value, you will not
> automatically get a larger timeout timeout for each retry.

But as far as i seen, the qc timeout will frezze the port, then reset it.
I just don't want the port failed explicitly, or without a port reset.

Alan




  reply	other threads:[~2026-04-16 13:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 10:21 Default IDENTIFY timeout is 5000ms which is too short for enterprise disks AlanCui4080
2026-04-09 11:55 ` Damien Le Moal
2026-04-09 12:01 ` Damien Le Moal
2026-04-15 12:40   ` Niklas Cassel
2026-04-16 12:59     ` AlanCui4080 [this message]
2026-04-20 16:27       ` Niklas Cassel
2026-04-23  9:18         ` AlanCui4080
2026-04-23 11:15           ` Niklas Cassel
2026-04-23 14:26             ` AlanCui4080
2026-04-23 16:17               ` Niklas Cassel
2026-05-08 20:48                 ` AlanCui4080
     [not found] ` <14062658.dW097sEU6C@alanarchdesktop>
     [not found]   ` <4482b737-1454-48cb-a941-165aa84fb2eb@kernel.org>
2026-04-10 11:24     ` AlanCui4080
2026-04-10 12:14       ` AlanCui4080

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23071769.EfDdHjke4D@alanarchdesktop \
    --to=me@alancui.cc \
    --cc=cassel@kernel.org \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox