From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Wed, 6 Jul 2016 15:15:25 -0400 [thread overview]
Message-ID: <07c35af5-5780-3659-48cc-63bff79548a4@gmail.com> (raw)
In-Reply-To: <CAJCQCtSpMwksiB=219xg0PVPX=9Qjz4D=T3_Ky3pea5-zN5ejQ@mail.gmail.com>
On 2016-07-06 14:45, Chris Murphy wrote:
> On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-07-06 12:43, Chris Murphy wrote:
>
>>> So does it make sense to just set the default to 180? Or is there a
>>> smarter way to do this? I don't know.
>>
>> Just thinking about this:
>> 1. People who are setting this somewhere will be functionally unaffected.
>
> I think it's statistically 0 people changing this from default. It's
> people with drives that have no SCT ERC support, used in raid1+, who
> happen to stumble upon this very obscure work around to avoid link
> resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put
preventative measures in place on any system where it applies. I'd be
willing to bet that most sysadmins at big companies like RedHat or
Oracle are setting this.
>
>
>> 2. People using single disks which have lots of errors may or may not see an
>> apparent degradation of performance, but will likely have the life
>> expectancy of their device extended.
>
> Well they have link resets and their file system presumably face
> plants as a result of a pile of commands in the queue returning as
> unsuccessful. So they have premature death of their system, rather
> than it getting sluggish. This is a long standing indicator on Windows
> to just reinstall the OS and restore data from backups -> the user has
> an opportunity to freshen up user data backup, and the reinstallation
> and restore from backup results in freshly written sectors which is
> how bad sectors get fixed. The marginally bad sectors get new writes
> and now read fast (or fast enough), and the persistently bad sectors
> result in the drive firmware remapping to reserve sectors.
>
> The main thing in my opinion is less extension of drive life, as it is
> the user gets to use the system, albeit sluggish, to make a backup of
> their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my
point was here. For people in this particular case, it will almost
certainly only make things better (although at first it may make
performance worse).
>
>
>> 3. Individuals who are not setting this but should be will on average be no
>> worse off than before other than seeing a bigger performance hit on a disk
>> error.
>> 4. People with single disks which are new will see no functional change
>> until the disk has an error.
>
> I follow.
>
>
>>
>> In an ideal situation, what I'd want to see is:
>> 1. If the device supports SCT ERC, set scsi_command_timer to reasonable
>> percentage over that (probably something like 25%, which would give roughly
>> 10 seconds for the normal 7 second ERC timer).
>> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
>> this is reasonable for SCSI disks).
>> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
>> expected disk timeout to account for things like latency outside of the
>> disk).
>
> Well if it's a non-redundant configuration, you'd want those long
> recoveries permitted, rather than enable SCT ERC. The drive has the
> ability to relocate sector data on a marginal (slow) read that's still
> successful. But clearly many manufacturers tolerate slow reads that
> don't result in immediate reallocation or overwrite or we wouldn't be
> in this situation in the first place. I think this auto reallocation
> is thwarted by enabling SCT ERC. It just flat out gives up and reports
> a read error. So it is still data loss in the non-redundant
> configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make
judgements based on userspace usage. Also, the first situation while
not optimal is still better than what happens now, at least there you
will get an I/O error in a reasonable amount of time (as opposed to
after a really long time if ever).
>
> Basically it's:
>
> For SATA and USB drives:
>
> if data redundant, then enable short SCT ERC time if supported, if not
> supported then extend SCSI command timer to 200;
>
> if data not redundant, then disable SCT ERC if supported, and extend
> SCSI command timer to 200.
>
> For SCSI (SAS most likely these days), keep things the same as now.
> But that's only because this is a rare enough configuration now I
> don't know if we really know the problems there. It may be that their
> error recovery in 7 seconds is massively better and more reliable than
> consumer drives over 180 seconds.
I don't see why you would think this is not common. If you count just
by systems, then it's absolutely outnumbered at least 100 to 1 by
regular ATA disks. If you look at individual disks though, the reverse
is true, because people who use SCSI drives tend to use _lots_ of disks
(think big data centers, NAS and SAN systems and such). OTOH, both are
probably vastly outnumbered by stuff that doesn't use either standard
for storage...
Separately, USB gets _really_ complicated if you want to cover
everything, USB drives may or may not present as non-rotational, may or
may not show up as SATA or SCSI bridges (there are some of the more
expensive flash drives that actually use SSD controllers plus USB-SAT
chips internally), if they do show up as such, may or may not support
the required commands (most don't, but it's seemingly hit or miss which do).
>
>
>>
>>>
>>>
>>>>> I suspect, but haven't tested, that ZFS On Linux would be equally
>>>>> affected, unless they're completely reimplementing their own block
>>>>> layer (?) So there are quite a few parties now negatively impacted by
>>>>> the current default behavior.
>>>>
>>>>
>>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>>> if
>>>> your not using enterprise drives', not because of the project itself, but
>>>> because it's ZFS. Part of their minimum recommended hardware
>>>> requirements
>>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>>> there too.
>>>
>>>
>>> http://open-zfs.org/wiki/Hardware
>>> "Consistent performance requires hard drives that support error
>>> recovery control. "
>>>
>>> "Drives that lack such functionality can be expected to have
>>> arbitrarily high limits. Several minutes is not impossible. Drives
>>> with this functionality typically default to 7 seconds. ZFS does not
>>> currently adjust this setting on drives. However, it is advisable to
>>> write a script to set the error recovery time to a low value, such as
>>> 0.1 seconds until ZFS is modified to control it. This must be done on
>>> every boot. "
>>>
>>> They do not explicitly require enterprise drives, but they clearly
>>> expect SCT ERC enabled to some sane value.
>>>
>>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>>> parameters for properly setting SCT ERC and the SCSI command timer for
>>> every device. Maybe it could create the udev rule? Single and raid0
>>> profiles need to permit long recoveries; where raid1, 5, 6 need to set
>>> things for very short recoveries.
>>>
>>> Possibly mdadm and lvm tools do the same thing.
>>
>> I"m pretty certain they don't create rules, or even try to check the drive
>> for SCT ERC support.
>
> They don't. That's a suggested change in behavior. Sorry "should do
> the same thing" instead of "do the same thing".
>
>
>> The problem with doing this is that you can't be
>> certain that your underlying device is actually a physical storage device or
>> not, and thus you have to check more than just the SCT ERC commands, and
>> many people (myself included) don't like tools doing things that modify the
>> persistent functioning of their system that the tool itself is not intended
>> to do (and messing with block layer settings falls into that category for a
>> mkfs tool).
>
> Yep it's imperfect unless there's the proper cross communication
> between layers. There are some such things like hardware raid geometry
> that optionally poke through (when supported by hardware raid drivers)
> so that things like mkfs.xfs can automatically provide the right sunit
> swidth for optimized layout; which the device mapper already does
> automatically. So it could be done it's just a matter of how big of a
> problem is this to build it, vs just going with a new one size fits
> all default command timer?
The other problem though is that the existing things pass through
_read-only_ data, while this requires writable data to be passed
through, which leads to all kinds of complicated issues potentially.
>
> If it were always 200 instead of 30, the consequence is if there's a
> link problem that is not related to media errors. But what the hell
> takes that long to report an explicit error? Even cable problems
> generate UDMA errors pretty much instantly.
And that is more why I'd suggest changing the kernel default first
before trying to use special heuristics or anything like that. The
caveat is that it would need to be for ATA disks only to not break SCSI
(which works fine right now) and USB (which has it's own unique issues).
next prev parent reply other threads:[~2016-07-06 19:15 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-20 3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11 ` Zygo Blaxell
2016-06-20 19:30 ` Chris Murphy
2016-06-20 20:40 ` Zygo Blaxell
2016-06-20 21:27 ` Chris Murphy
2016-06-21 1:55 ` Zygo Blaxell
2016-06-21 3:53 ` Zygo Blaxell
2016-06-22 17:14 ` Chris Murphy
2016-06-22 20:35 ` Zygo Blaxell
2016-06-23 19:32 ` Goffredo Baroncelli
2016-06-24 0:26 ` Chris Murphy
2016-06-24 1:47 ` Zygo Blaxell
2016-06-24 4:02 ` Andrei Borzenkov
2016-06-24 8:50 ` Hugo Mills
2016-06-24 9:52 ` Andrei Borzenkov
2016-06-24 10:16 ` Hugo Mills
2016-06-24 10:19 ` Andrei Borzenkov
2016-06-24 10:59 ` Hugo Mills
2016-06-24 11:36 ` Austin S. Hemmelgarn
2016-06-24 17:40 ` Chris Murphy
2016-06-24 18:06 ` Zygo Blaxell
2016-06-24 17:06 ` Chris Murphy
2016-06-24 17:21 ` Andrei Borzenkov
2016-06-24 17:52 ` Chris Murphy
2016-06-24 18:19 ` Austin S. Hemmelgarn
2016-06-25 16:44 ` Chris Murphy
2016-06-25 21:52 ` Chris Murphy
2016-06-26 7:54 ` Andrei Borzenkov
2016-06-26 15:03 ` Duncan
2016-06-26 19:30 ` Chris Murphy
2016-06-26 19:52 ` Zygo Blaxell
2016-06-27 11:21 ` Austin S. Hemmelgarn
2016-06-27 16:17 ` Chris Murphy
2016-06-27 20:54 ` Chris Murphy
2016-06-27 21:02 ` Henk Slager
2016-06-27 21:57 ` Zygo Blaxell
2016-06-27 22:30 ` Chris Murphy
2016-06-28 1:52 ` Zygo Blaxell
2016-06-28 2:39 ` Chris Murphy
2016-06-28 3:17 ` Zygo Blaxell
2016-06-28 11:23 ` Austin S. Hemmelgarn
2016-06-28 12:05 ` Austin S. Hemmelgarn
2016-06-28 12:14 ` Steven Haigh
2016-06-28 12:25 ` Austin S. Hemmelgarn
2016-06-28 16:40 ` Steven Haigh
2016-06-28 18:01 ` Chris Murphy
2016-06-28 18:17 ` Steven Haigh
2016-07-05 23:05 ` Chris Murphy
2016-07-06 11:51 ` Austin S. Hemmelgarn
2016-07-06 16:43 ` Chris Murphy
2016-07-06 17:18 ` Austin S. Hemmelgarn
2016-07-06 18:45 ` Chris Murphy
2016-07-06 19:15 ` Austin S. Hemmelgarn [this message]
2016-07-06 21:01 ` Chris Murphy
2016-06-24 16:52 ` Chris Murphy
2016-06-24 16:56 ` Hugo Mills
2016-06-24 16:39 ` Zygo Blaxell
2016-06-24 1:36 ` Zygo Blaxell
2016-06-23 23:37 ` Chris Murphy
2016-06-24 2:07 ` Zygo Blaxell
2016-06-24 5:20 ` Chris Murphy
2016-06-24 10:16 ` Andrei Borzenkov
2016-06-24 17:33 ` Chris Murphy
2016-06-24 11:24 ` Austin S. Hemmelgarn
2016-06-24 16:32 ` Zygo Blaxell
2016-06-24 2:17 ` Zygo Blaxell
2016-06-22 4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=07c35af5-5780-3659-48cc-63bff79548a4@gmail.com \
--to=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).