Linux ATA/IDE development
 help / color / mirror / Atom feed
From: Niklas Cassel <cassel@kernel.org>
To: AlanCui4080 <me@alancui.cc>
Cc: linux-ide@vger.kernel.org, dlemoal@kernel.org
Subject: Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks
Date: Thu, 23 Apr 2026 13:15:21 +0200	[thread overview]
Message-ID: <aen_SQ-7fPfdAylr@ryzen> (raw)
In-Reply-To: <f8dAJyMVQ4yJA5_7X9Jscw@alancui.cc>

Hello Alan,

On Thu, Apr 23, 2026 at 05:18:24PM +0800, AlanCui4080 wrote:
> On Tuesday, 21 April 2026 00:27,you wrote:
> > From this it seems that it is simply the first IDENTIFY that times out.
> > On the second try, it seems that the IDENTIFY passes, otherwise we would
> > have seen more "revalidation failed (errno=-5)" prints for the same drive.
> > 
> > So, from this log alone, I don't see any problem. We will try to do IDENTIFY
> > up to three times, so just a single IDENTIFY failing should not be a problem.
> 
> So at your opinion, the error is caused by a hardware failure but not kernel, 
> so we should not add any quirk to relax or solve the problem, is that correct?
> (I just want to confirm that how kernel will deal with this error)

Like Damien said, the IDENTIFY DEVICE command is one of the few commands which
a device is required to execute without leaving the Standby state or requiring
a spin-up. A device is allowed to reply to IDENTIFY with the 'incomplete' bit
set:

37C8h - Device requires SET FEATURES subcommand to spin-up after power-up and
IDENTIFY DEVICE data is incomplete (see 4.19).
738Ch - Device requires SET FEATURES subcommand to spin-up after power-up and
IDENTIFY DEVICE data is complete (see 4.19).

8C73h - Device does not require SET FEATURES subcommand to spin-up after
power-up and IDENTIFY DEVICE data is incomplete (see 4.19).
C837h - Device does not require SET FEATURES subcommand to spin-up after
power-up and IDENTIFY DEVICE data is complete (see 4.19).

libata looks like it already handles this:
https://github.com/torvalds/linux/blob/v7.0/drivers/ata/libata-core.c#L1903-L1922



However, in your case you get a timeout, which means that the device does
not reply at all.

Before a system suspend, libata will send a spin-down/STANDBY IMMEDIATE
command to all drives.

After a system resume, libata will send a COMRESET to all devices, before
it sends the IDENTIFY, and after that it will send SET ACTIVE to spin-up
the drive.

It seems that occasionally, some of your drives hangs in a weird state after
STANDBY + COMRESET + IDENTIFY. When we get a timeout, we will do another
COMRESET + IDENTIFY, and this time your drive does not hang.

My best guess is that it is a HDD firmware bug where the drive sometimes
hangs after a STANDBY + COMRESET + IDENTIFY. Or claims to be ready before
it is actually ready.

It could of course also be a bug in e.g. ata_wait_ready(), and we are sending
the IDENTIFY command too quickly after the COMRESET, but if that was the case,
I think we would have seen way more bug reports from different vendors by now.



Anyway, considering that from a user space perspective, we are never removing
the device (we only do that if we fail IDENTIFY three times), so the retries
themselves should not be visible to user space applications.

So if you disregard the error in the log, from a user space application
perspective, the only difference should be that it takes a few extra seconds
for the device to reply to commands after a system resume.


> 
> > So I think the question is, at this point, can you read from the drive?
> > 
> > E.g.:
> > # dd if=/dev/sda of=/dev/null iflag=direct bs=4K count=1
> 
> I will be blocked out of the shell for 5 secs unless the IDENTIFY succeed.

But as soon as you get a shell after a system resume, the above command
succeeds, right?


> 
> > 
> > If you can read from the device, then this seem like a problem with zpool
> > kicking the device off the RAID array (perhaps because it is taking longer
> > than some zpool defined timeout value?), rather than a libata problem.
> 
> But after the link re-established, the drive works normally.

My suggestion is to look at the zpool code to see how long it waits to finds
all devices after a system resume before it kicks devices off the RAID array.

My initial feeling is that if your device is ready after 5 seconds after a
system resume, then the timeout value for zpool to kick off a device must be
very low.


Kind regards,
Niklas

  reply	other threads:[~2026-04-23 11:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09 10:21 Default IDENTIFY timeout is 5000ms which is too short for enterprise disks AlanCui4080
2026-04-09 11:55 ` Damien Le Moal
2026-04-09 12:01 ` Damien Le Moal
2026-04-15 12:40   ` Niklas Cassel
2026-04-16 12:59     ` AlanCui4080
2026-04-20 16:27       ` Niklas Cassel
2026-04-23  9:18         ` AlanCui4080
2026-04-23 11:15           ` Niklas Cassel [this message]
2026-04-23 14:26             ` AlanCui4080
2026-04-23 16:17               ` Niklas Cassel
2026-05-08 20:48                 ` AlanCui4080
     [not found] ` <14062658.dW097sEU6C@alanarchdesktop>
     [not found]   ` <4482b737-1454-48cb-a941-165aa84fb2eb@kernel.org>
2026-04-10 11:24     ` AlanCui4080
2026-04-10 12:14       ` AlanCui4080

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aen_SQ-7fPfdAylr@ryzen \
    --to=cassel@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=linux-ide@vger.kernel.org \
    --cc=me@alancui.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox