Re: Adventures in btrfs raid5 disk recovery

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Murphy <lists@colorremedies.com>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Wed, 6 Jul 2016 15:01:49 -0600	[thread overview]
Message-ID: <CAJCQCtTGUQxkOf7VhGS--APJ45BsyuduzqDJcGOM5ihSC3L50w@mail.gmail.com> (raw)
In-Reply-To: <07c35af5-5780-3659-48cc-63bff79548a4@gmail.com>

On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-07-06 14:45, Chris Murphy wrote:

>> I think it's statistically 0 people changing this from default. It's
>> people with drives that have no SCT ERC support, used in raid1+, who
>> happen to stumble upon this very obscure work around to avoid link
>> resets in the face of media defects. Rare.
>
> Not as much as you think, once someone has this issue, they usually put
> preventative measures in place on any system where it applies.  I'd be
> willing to bet that most sysadmins at big companies like RedHat or Oracle
> are setting this.

SCT ERC yes. Changing the kernel's command timer? I think almost zero.

>> Well they have link resets and their file system presumably face
>> plants as a result of a pile of commands in the queue returning as
>> unsuccessful. So they have premature death of their system, rather
>> than it getting sluggish. This is a long standing indicator on Windows
>> to just reinstall the OS and restore data from backups -> the user has
>> an opportunity to freshen up user data backup, and the reinstallation
>> and restore from backup results in freshly written sectors which is
>> how bad sectors get fixed. The marginally bad sectors get new writes
>> and now read fast (or fast enough), and the persistently bad sectors
>> result in the drive firmware remapping to reserve sectors.
>>
>> The main thing in my opinion is less extension of drive life, as it is
>> the user gets to use the system, albeit sluggish, to make a backup of
>> their data rather than possibly losing it.
>
> The extension of the drive's lifetime is a nice benefit, but not what my
> point was here.  For people in this particular case, it will almost
> certainly only make things better (although at first it may make performance
> worse).

I'm not sure why it makes performance worse. The options are, slower
reads vs a file system that almost certainly face plants upon a link
reset.

>> Basically it's:
>>
>> For SATA and USB drives:
>>
>> if data redundant, then enable short SCT ERC time if supported, if not
>> supported then extend SCSI command timer to 200;
>>
>> if data not redundant, then disable SCT ERC if supported, and extend
>> SCSI command timer to 200.
>>
>> For SCSI (SAS most likely these days), keep things the same as now.
>> But that's only because this is a rare enough configuration now I
>> don't know if we really know the problems there. It may be that their
>> error recovery in 7 seconds is massively better and more reliable than
>> consumer drives over 180 seconds.
>
> I don't see why you would think this is not common.

I was not clear. Single device SAS is probably not common. They're
typically being used in arrays where data is redundant. Using such a
drive with short error recovery as a single boot drive? Probably not
that common.

> Separately, USB gets _really_ complicated if you want to cover everything,
> USB drives may or may not present as non-rotational, may or may not show up
> as SATA or SCSI bridges (there are some of the more expensive flash drives
> that actually use SSD controllers plus USB-SAT chips internally), if they do
> show up as such, may or may not support the required commands (most don't,
> but it's seemingly hit or miss which do).

Yup. Well, do what we can instead of just ignoring the problem? They
can still be polled for features including SCT ERC and if it's not
supported or configurable then fallback to increasing the command
timer. I'm not sure what else can be done anyway.

The main obstacle is squaring the device capability (low level) with
storage stack redundancy 0 or 1 (high level). Something has to be
aware of both to ideally get all devices ideally configured.

>> Yep it's imperfect unless there's the proper cross communication
>> between layers. There are some such things like hardware raid geometry
>> that optionally poke through (when supported by hardware raid drivers)
>> so that things like mkfs.xfs can automatically provide the right sunit
>> swidth for optimized layout; which the device mapper already does
>> automatically. So it could be done it's just a matter of how big of a
>> problem is this to build it, vs just going with a new one size fits
>> all default command timer?
>
> The other problem though is that the existing things pass through
> _read-only_ data, while this requires writable data to be passed through,
> which leads to all kinds of complicated issues potentially.

I'm aware. There are also plenty of bugs even if write were to pass
through. I've encountered more drives than not which accept only one
SCT ERC change per poweron. A 2nd change causes the drive to offline
and vanish off the bus. So no doubt this whole area is fragile enough
not even the drive, controller, enclosure vendors are aware of where
all the bodies are buried.

What I think is fairly well established is that at least on Windows
their lower level stuff including kernel tolerates these very high
recovery times. The OS just gets irritatingly slow but doesn't flip
out. Linux is flipping out. And it's not Linux's direct fault, that's
drive manufacturers, but Linux needs to adapt.

>>
>>
>> If it were always 200 instead of 30, the consequence is if there's a
>> link problem that is not related to media errors. But what the hell
>> takes that long to report an explicit error? Even cable problems
>> generate UDMA errors pretty much instantly.
>
> And that is more why I'd suggest changing the kernel default first before
> trying to use special heuristics or anything like that.  The caveat is that
> it would need to be for ATA disks only to not break SCSI (which works fine
> right now) and USB (which has it's own unique issues).

I think you're probably right. Simpler is better.

Thing is, there will be consequences. In the software raid case where
a drive hangs on a media defect, right now this means a link reset at
30 seconds, which results in md reconstructing data and it goes where
needed by pretty much 31 seconds after requested. If that changes to
180 seconds, there will no doubt be some use cases that will be, WTF
just happened? This used to always recover in 30 seconds at the
longest, and now it's causing the network stack to implode while
waiting.

So all kinds of other timeouts might get impacted.

I wonder if it makes sense to change the default SCSI command timer on
a distribution and see what happens - if e.g. Fedora or opensuse would
volunteer to make the change for Rawhide or Tumbleweed. *shrug*
statically the number of users for those rolling releases may not have
a drive with media defects and a delay intolerant workload for maybe
years...

-- 
Chris Murphy

next prev parent reply	other threads:[~2016-07-06 21:03 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy [this message]
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtTGUQxkOf7VhGS--APJ45BsyuduzqDJcGOM5ihSC3L50w@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).