Re: scrub implies failing drive - smartctl blissfully unaware

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Phillip Susi <psusi@ubuntu.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: scrub implies failing drive - smartctl blissfully unaware
Date: Wed, 19 Nov 2014 11:07:43 -0500	[thread overview]
Message-ID: <546CC04F.6040207@ubuntu.com> (raw)
In-Reply-To: <pan$7c051$a7c25939$cafa59e$f63379c9@cox.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 9:46 PM, Duncan wrote:
> I'm not sure about normal operation, but certainly, many drives
> take longer than 30 seconds to stabilize after power-on, and I
> routinely see resets during this time.

As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.

> This doesn't happen on single-hardware-device block devices and 
> filesystems because in that case it's either up or down, if the
> device doesn't come up in time the resume simply fails entirely,
> instead of coming up with one or more devices there, but others
> missing as they didn't stabilize in time, as is unfortunately all
> too common in the multi- device scenario.

No, the resume doesn't "fail entirely".  The drive is reset, and the
IO request is retried, and by then it should succeed.

> I've seen this with both spinning rust and with SSDs, with mdraid
> and btrfs, with multiple mobos and device controllers, and with
> resume both from suspend to ram (if the machine powers down the
> storage devices in that case, as most modern ones do) and hibernate
> to permanent storage device, over several years worth of kernel
> series, so it's a reasonably widespread phenomena, at least among
> consumer-level SATA devices.  (My experience doesn't extend to
> enterprise-raid-level devices or proper SCSI, etc, so I simply
> don't know, there.)

If you are restoring from hibernation, then the drives are already
spun up before the kernel is loaded.

> While two minutes is getting a bit long, I think it's still within
> normal range, and some devices definitely take over a minute enough
> of the time to be both noticeable and irritating.

It certainly is not normal for a drive to take that long to spin up.
IIRC, the 30 second timeout comes from the ATA specs which state that
it can take up to 30 seconds for a drive to spin up.

> That said, I SHOULD say I'd be far *MORE* irritated if the device
> simply pretended it was stable and started reading/writing data
> before it really had stabilized, particularly with SSDs where that
> sort of behavior has been observed and is known to put some devices
> at risk of complete scrambling of either media or firmware, beyond
> recovery at times.  That of course is the risk of going the other
> direction, and I'd a WHOLE lot rather have devices play it safe for
> another 30 seconds or so after they / think/ they're stable and be
> SURE, than pretend to be just fine when voltages have NOT
> stabilized yet and thus end up scrambling things irrecoverably.
> I've never had that happen here tho I've never stress- tested for
> it, only done normal operation, but I've seen testing reports where
> the testers DID make it happen surprisingly easily, to a surprising
>  number of their test devices.

Power supply voltage is stable within milliseconds.  What takes HDDs
time to start up is mechanically bringing the spinning rust up to
speed.  On SSDs, I think you are confusing testing done on power
*cycling* ( i.e. yanking the power cord in the middle of a write )
with startup.

> So, umm... I suspect the 2-minute default is 2 minutes due to
> power-up stabilizing issues, where two minutes is a reasonable
> compromise between failing the boot most of the time if the timeout
> is too low, and taking excessively long for very little further
> gain.

The default is 30 seconds, not 2 minutes.

> sure whether it's even possible, without some specific hardware
> feature available to tell the kernel that it has in fact NOT been
> in power-saving mode for say 5-10 minutes, hopefully long enough
> that voltage readings really /are/ fully stabilized and a shorter
> timeout is possible.

Again, there is no several minute period where voltage stabilizes and
the drive takes longer to access.  This is a complete red herring.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUbMBPAAoJEI5FoCIzSKrwcV0H/20pv7O5+CDf2cRg5G5vt7PR
4J1NuVIBsboKwjwCj8qdxHQJHihvLYkTQKANqaqHv0+wx0u2DaQdPU/LRnqN71xA
jP7b9lx9X6rPnAnZUDBbxzAc8HLeutgQ8YD/WB0sE5IXlI1/XFGW4tXIZ4iYmtN9
GUdL+zcdtEiYE993xiGSMXF4UBrN8d/5buBRsUsPVivAZes6OHbf9bd72c1IXBuS
ADZ7cH7XGmLL3OXA+hm7d99429HFZYAgI7DjrLWp6Tb9ja5Gvhy+AVvrbU5ZWMwu
XUnNsLsBBhEGuZs5xpkotZgaQlmJpw4BFY4BKwC6PL+7ex7ud3hGCGeI6VDmI0U=
=DLHU
-----END PGP SIGNATURE-----

next prev parent reply	other threads:[~2014-11-19 16:07 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <E1XqYMg-0000YI-8y@watricky.valid.co.za>
2014-11-18  7:29 ` scrub implies failing drive - smartctl blissfully unaware Brendan Hide
2014-11-18  7:36   ` Roman Mamedov
2014-11-18 13:24     ` Brendan Hide
2014-11-18 15:16       ` Duncan
2014-11-18 12:08   ` Austin S Hemmelgarn
2014-11-18 13:25     ` Brendan Hide
2014-11-18 16:02     ` Phillip Susi
2014-11-18 15:35   ` Marc MERLIN
2014-11-18 16:04     ` Phillip Susi
2014-11-18 16:11       ` Marc MERLIN
2014-11-18 16:26         ` Phillip Susi
2014-11-18 18:57     ` Chris Murphy
2014-11-18 20:58       ` Phillip Susi
2014-11-19  2:40         ` Chris Murphy
2014-11-19 15:11           ` Phillip Susi
2014-11-20  0:05             ` Chris Murphy
2014-11-25 21:34               ` Phillip Susi
2014-11-25 23:13                 ` Chris Murphy
2014-11-26  1:53                   ` Rich Freeman
2014-12-01 19:10                   ` Phillip Susi
2014-11-28 15:02                 ` Patrik Lundquist
2014-11-19  2:46         ` Duncan
2014-11-19 16:07           ` Phillip Susi [this message]
2014-11-19 21:05             ` Robert White
2014-11-19 21:47               ` Phillip Susi
2014-11-19 22:25                 ` Robert White
2014-11-20 20:26                   ` Phillip Susi
2014-11-20 22:45                     ` Robert White
2014-11-21 15:11                       ` Phillip Susi
2014-11-21 21:12                         ` Robert White
2014-11-21 21:41                           ` Robert White
2014-11-22 22:06                           ` Phillip Susi
2014-11-19 22:33                 ` Robert White
2014-11-20 20:34                   ` Phillip Susi
2014-11-20 23:08                     ` Robert White
2014-11-21 15:27                       ` Phillip Susi
2014-11-20  0:25               ` Duncan
2014-11-20  2:08                 ` Robert White
2014-11-19 23:59             ` Duncan
2014-11-25 22:14               ` Phillip Susi
2014-11-28 15:55                 ` Patrik Lundquist
2014-11-21  4:58   ` Zygo Blaxell
2014-11-21  7:05     ` Brendan Hide
2014-11-21 12:55       ` Ian Armstrong
2014-11-21 17:45         ` Chris Murphy
2014-11-22  7:18           ` Ian Armstrong
2014-11-21 17:42       ` Zygo Blaxell
2014-11-21 18:06         ` Chris Murphy
2014-11-22  2:25           ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=546CC04F.6040207@ubuntu.com \
    --to=psusi@ubuntu.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).