linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Andrei Borzenkov <arvidjaar@gmail.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Mon, 27 Jun 2016 07:21:56 -0400	[thread overview]
Message-ID: <fd7d250c-0a5a-ea3e-9ea2-ec6e50e14169@gmail.com> (raw)
In-Reply-To: <CAJCQCtSqO4GNm8kBDuzUXEXYx+54zFgsD6=ARNsRgVUb53LQZw@mail.gmail.com>

On 2016-06-25 12:44, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> Well, the obvious major advantage that comes to mind for me to checksumming
>> parity is that it would let us scrub the parity data itself and verify it.
>
> OK but hold on. During scrub, it should read data, compute checksums
> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> the checksum tree, and the parity strip in the chunk tree. And if
> parity is wrong, then it should be replaced.
Except that's horribly inefficient.  With limited exceptions involving 
highly situational co-processors, computing a checksum of a parity block 
is always going to be faster than computing parity for the stripe.  By 
using that to check parity, we can safely speed up the common case of 
near zero errors during a scrub by a pretty significant factor.

The ideal situation that I'd like to see for scrub WRT parity is:
1. Store checksums for the parity itself.
2. During scrub, if the checksum is good, the parity is good, and we 
just saved the time of computing the whole parity block.
3. If the checksum is not good, then compute the parity.  If the parity 
just computed matches what is there already, the checksum is bad and 
should be rewritten (and we should probably recompute the whole block of 
checksums it's in), otherwise, the parity was bad, write out the new 
parity and update the checksum.
4. Have an option to skip the csum check on the parity and always 
compute it.
>
> Even check > md/sync_action does this. So no pun intended but Btrfs
> isn't even at parity with mdadm on data integrity if it doesn't check
> if the parity matches data.
Except that MD and LVM don't have checksums to verify anything outside 
of the very high-level metadata.  They have to compute the parity during 
a scrub because that's the _only_ way they have to check data integrity. 
  Just because that's the only way for them to check it does not mean we 
have to follow their design, especially considering that we have other, 
faster ways to check it.
>
>
>> I'd personally much rather know my parity is bad before I need to use it
>> than after using it to reconstruct data and getting an error there, and I'd
>> be willing to be that most seasoned sysadmins working for companies using
>> big storage arrays likely feel the same about it.
>
> That doesn't require parity csums though. It just requires computing
> parity during a scrub and comparing it to the parity on disk to make
> sure they're the same. If they aren't, assuming no other error for
> that full stripe read, then the parity block is replaced.
It does not require it, but it can make it significantly more efficient, 
and even a 1% increase in efficiency is a huge difference on a big array.
>
> So that's also something to check in the code or poke a system with a
> stick and see what happens.
>
>> I could see it being
>> practical to have an option to turn this off for performance reasons or
>> similar, but again, I have a feeling that most people would rather be able
>> to check if a rebuild will eat data before trying to rebuild (depending on
>> the situation in such a case, it will sometimes just make more sense to nuke
>> the array and restore from a backup instead of spending time waiting for it
>> to rebuild).
>
> The much bigger problem we have right now that affects Btrfs,
> LVM/mdadm md raid, is this silly bad default with non-enterprise
> drives having no configurable SCT ERC, with ensuing long recovery
> times, and the kernel SCSI command timer at 30 seconds - which
> actually also fucks over regular single disk users also because it
> means they don't get the "benefit" of long recovery times, which is
> the whole g'd point of that feature. This itself causes so many
> problems where bad sectors just get worse and don't get fixed up
> because of all the link resets. So I still think it's a bullshit
> default kernel side because it pretty much affects the majority use
> case, it is only a non-problem with proprietary hardware raid, and
> software raid using enterprise (or NAS specific) drives that already
> have short recovery times by default.
On this, we can agree.

  parent reply	other threads:[~2016-06-27 11:22 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell
2016-06-27 11:21                                       ` Austin S. Hemmelgarn [this message]
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fd7d250c-0a5a-ea3e-9ea2-ec6e50e14169@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=arvidjaar@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).