linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugo Mills <hugo@carfax.org.uk>
To: Andrei Borzenkov <arvidjaar@gmail.com>
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Chris Murphy <lists@colorremedies.com>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Fri, 24 Jun 2016 10:16:58 +0000	[thread overview]
Message-ID: <20160624101658.GI3325@carfax.org.uk> (raw)
In-Reply-To: <CAA91j0Uqg0FafFTG3NhQt=p8KkYRhTeMU5Bd+JuUxDntP6g8Ng@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3672 bytes --]

On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
> >> >>
> >> >> Yeah I'm kinda confused on this point.
> >> >>
> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >>
> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> from merely parity being wrong but parity being wrong *and* then being
> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >
> >> > I think the opposite is more likely, as the layers above raid56
> >> > seem to check the data against sums before raid56 ever sees it.
> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> > probably good reason to do it that way).
> >> >
> >>
> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> is unaware of parity blocks at all; for all practical purposes they do
> >> not exist. What happens is approximately
> >>
> >> 1. logical extent is allocated and checksum computed
> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> be parity blocks
> >> 3. when these areas are written out, RAID56 parity is computed and filled in
> >>
> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >
> >    Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree. This allows the FS to determine where damage has occurred,
> > rather thansimply detecting that it has occurred (which would be the
> > case if the parity doesn't match the data, or if the two copies of a
> > RAID-1 array don't match).
> >
> 
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest
data) you can reconstruct the remaining block. That reconstructed
block won't be checked against the csum for the missing block -- it'll
just be written and a new csum for it written with it.

   Hugo.

> ...
> >
> >> > It looks like uncorrectable failures might occur because parity is
> >> > correct, but the parity checksum is out of date, so the parity checksum
> >> > doesn't match even though data blindly reconstructed from the parity
> >> > *would* match the data.
> >> >
> >>
> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> should at least avoid silent corruption.
> >>

-- 
Hugo Mills             | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
http://carfax.org.uk/  | you're allowed to stop debugging.
PGP: E2AB1DE4          |                                        PotatoEngineer

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

  reply	other threads:[~2016-06-24 10:17 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills [this message]
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160624101658.GI3325@carfax.org.uk \
    --to=hugo@carfax.org.uk \
    --cc=arvidjaar@gmail.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).