From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 684A330EF8F for ; Thu, 23 Apr 2026 11:15:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776942925; cv=none; b=IRMMmyg/tlUX2XTVK1A2Abi+sfC5s9e4iHYsMaVOc+D7iYt1usKtpbygqYbCRGocv3Y+WKrW19Ph03p8KExcA8PDTd/0zhRvYO+jdFPzhu5EP6CiIQeRqdTSULPRhowLfqn5lZE76GlOpex+sosg71TIEeZgUtBGRnKSS19iPQk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776942925; c=relaxed/simple; bh=FZ+KqEOY8Au+l0aTYPICFKrbk63cneCrqrjuBerVASs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bCkyhN+Zx3LUvu6mboj4o3oHDUQcCfEMealPAmA0RXfxfbay3gL9jx/1zg5kofEBSC29tTG7+dnMkMLZxj+LDJZtrprkh3yEdDT77RAH2R5az3TDDWig0fX8YZA02WeoT9zYdAoQCUkLLHLe2+q9vJeNNIRjpceIxb/mTs6+8u8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gE+f54sb; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gE+f54sb" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A24AC2BCAF; Thu, 23 Apr 2026 11:15:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776942925; bh=FZ+KqEOY8Au+l0aTYPICFKrbk63cneCrqrjuBerVASs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=gE+f54sb2mYT5zR3DTeYZMVD6moS5JOSi1v9uZsP1+402SN46Y+Gbt65gYbe8JQEI GoQUAUoQI3aZ3mAxhstT3vDigwAwtY6yLj7FpT9a8cXWKWaJxZT8AGjpsArHJ7I70e zRki5JN5vg1xMG6GMt6MTxF9Q8xIpBypPX7JV5O0PNHOgFz/yDxP8x2RNfRWOyKGzA hKc96SB/4/Sw629bxe+Kdl2FSOGMebi4HeDj8syDobyghWgusAw9ZhdX3guE3WLoIZ Rw0e3fBgvaaA6CPEM7RlTpzCEocEf0PWfXwv7f+1RulD0gEHxyX6TI/MU4zMT43fAS lN1QfJ9vOGxUQ== Date: Thu, 23 Apr 2026 13:15:21 +0200 From: Niklas Cassel To: AlanCui4080 Cc: linux-ide@vger.kernel.org, dlemoal@kernel.org Subject: Re: Default IDENTIFY timeout is 5000ms which is too short for enterprise disks Message-ID: References: <14015677.uLZWGnKmhe@alanarchdesktop> <23071769.EfDdHjke4D@alanarchdesktop> Precedence: bulk X-Mailing-List: linux-ide@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Hello Alan, On Thu, Apr 23, 2026 at 05:18:24PM +0800, AlanCui4080 wrote: > On Tuesday, 21 April 2026 00:27,you wrote: > > From this it seems that it is simply the first IDENTIFY that times out. > > On the second try, it seems that the IDENTIFY passes, otherwise we would > > have seen more "revalidation failed (errno=-5)" prints for the same drive. > > > > So, from this log alone, I don't see any problem. We will try to do IDENTIFY > > up to three times, so just a single IDENTIFY failing should not be a problem. > > So at your opinion, the error is caused by a hardware failure but not kernel, > so we should not add any quirk to relax or solve the problem, is that correct? > (I just want to confirm that how kernel will deal with this error) Like Damien said, the IDENTIFY DEVICE command is one of the few commands which a device is required to execute without leaving the Standby state or requiring a spin-up. A device is allowed to reply to IDENTIFY with the 'incomplete' bit set: 37C8h - Device requires SET FEATURES subcommand to spin-up after power-up and IDENTIFY DEVICE data is incomplete (see 4.19). 738Ch - Device requires SET FEATURES subcommand to spin-up after power-up and IDENTIFY DEVICE data is complete (see 4.19). 8C73h - Device does not require SET FEATURES subcommand to spin-up after power-up and IDENTIFY DEVICE data is incomplete (see 4.19). C837h - Device does not require SET FEATURES subcommand to spin-up after power-up and IDENTIFY DEVICE data is complete (see 4.19). libata looks like it already handles this: https://github.com/torvalds/linux/blob/v7.0/drivers/ata/libata-core.c#L1903-L1922 However, in your case you get a timeout, which means that the device does not reply at all. Before a system suspend, libata will send a spin-down/STANDBY IMMEDIATE command to all drives. After a system resume, libata will send a COMRESET to all devices, before it sends the IDENTIFY, and after that it will send SET ACTIVE to spin-up the drive. It seems that occasionally, some of your drives hangs in a weird state after STANDBY + COMRESET + IDENTIFY. When we get a timeout, we will do another COMRESET + IDENTIFY, and this time your drive does not hang. My best guess is that it is a HDD firmware bug where the drive sometimes hangs after a STANDBY + COMRESET + IDENTIFY. Or claims to be ready before it is actually ready. It could of course also be a bug in e.g. ata_wait_ready(), and we are sending the IDENTIFY command too quickly after the COMRESET, but if that was the case, I think we would have seen way more bug reports from different vendors by now. Anyway, considering that from a user space perspective, we are never removing the device (we only do that if we fail IDENTIFY three times), so the retries themselves should not be visible to user space applications. So if you disregard the error in the log, from a user space application perspective, the only difference should be that it takes a few extra seconds for the device to reply to commands after a system resume. > > > So I think the question is, at this point, can you read from the drive? > > > > E.g.: > > # dd if=/dev/sda of=/dev/null iflag=direct bs=4K count=1 > > I will be blocked out of the shell for 5 secs unless the IDENTIFY succeed. But as soon as you get a shell after a system resume, the above command succeeds, right? > > > > > If you can read from the device, then this seem like a problem with zpool > > kicking the device off the RAID array (perhaps because it is taking longer > > than some zpool defined timeout value?), rather than a libata problem. > > But after the link re-established, the drive works normally. My suggestion is to look at the zpool code to see how long it waits to finds all devices after a system resume before it kicks devices off the RAID array. My initial feeling is that if your device is ready after 5 seconds after a system resume, then the timeout value for zpool to kick off a device must be very low. Kind regards, Niklas