From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Hugo Mills <hugo@carfax.org.uk>,
Andrei Borzenkov <arvidjaar@gmail.com>,
Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
Chris Murphy <lists@colorremedies.com>,
kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Fri, 24 Jun 2016 07:36:43 -0400 [thread overview]
Message-ID: <edab8dbd-60e7-fafa-7267-d1e65a897353@gmail.com> (raw)
In-Reply-To: <20160624105959.GJ3325@carfax.org.uk>
On 2016-06-24 06:59, Hugo Mills wrote:
> On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>>> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>>>>>> 24.06.2016 04:47, Zygo Blaxell пишет:
>>>>>>> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>>>>>>>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>>>>>>>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
>>>>>>>>
>>>>>>>> Yeah I'm kinda confused on this point.
>>>>>>>>
>>>>>>>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>>>>>>>
>>>>>>>> It says there is a write hole for Btrfs. But defines it in terms of
>>>>>>>> parity possibly being stale after a crash. I think the term comes not
>>>>>>>> from merely parity being wrong but parity being wrong *and* then being
>>>>>>>> used to wrongly reconstruct data because it's blindly trusted.
>>>>>>>
>>>>>>> I think the opposite is more likely, as the layers above raid56
>>>>>>> seem to check the data against sums before raid56 ever sees it.
>>>>>>> (If those layers seem inverted to you, I agree, but OTOH there are
>>>>>>> probably good reason to do it that way).
>>>>>>>
>>>>>>
>>>>>> Yes, that's how I read code as well. btrfs layer that does checksumming
>>>>>> is unaware of parity blocks at all; for all practical purposes they do
>>>>>> not exist. What happens is approximately
>>>>>>
>>>>>> 1. logical extent is allocated and checksum computed
>>>>>> 2. it is mapped to physical area(s) on disks, skipping over what would
>>>>>> be parity blocks
>>>>>> 3. when these areas are written out, RAID56 parity is computed and filled in
>>>>>>
>>>>>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>>>>>
>>>>> Checksums are not parity, correct. However, every data block
>>>>> (including, I think, the parity) is checksummed and put into the csum
>>>>> tree. This allows the FS to determine where damage has occurred,
>>>>> rather thansimply detecting that it has occurred (which would be the
>>>>> case if the parity doesn't match the data, or if the two copies of a
>>>>> RAID-1 array don't match).
>>>>>
>>>>
>>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>>> degraded disk won't be able to reconstruct data on this degraded disk
>>>> because reconstructed extent content won't match checksum. Which kinda
>>>> makes RAID5 pointless.
>>>
>>> Eh? How do you come to that conclusion?
>>>
>>> For data, say you have n-1 good devices, with n-1 blocks on them.
>>> Each block has a checksum in the metadata, so you can read that
>>> checksum, read the blocks, and verify that they're not damaged. From
>>> those n-1 known-good blocks (all data, or one parity and the rest
>>
>> We do not know whether parity is good or not because as far as I can
>> tell parity is not checksummed.
>
> I was about to write a devastating rebuttal of this... then I
> actually tested it, and holy crap you're right.
>
> I've just closed the terminal in question by accident, so I can't
> copy-and-paste, but the way I checked was:
>
> # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
> # mount /dev/loop0 foo
> # dd if=/dev/urandom of=foo/file bs=4k count=32
> # umount /dev/loop0
> # btrfs-debug-tree /dev/loop0
>
> then look at the csum tree:
>
> item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
> extent csum item
>
> There is a single csum item in it, of length 128. At 4 bytes per csum,
> that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
> nothing for the parity.
>
> This is fundamentally broken, and I think we need to change the
> wiki to indicate that the parity RAID implementation is not
> recommended, because it doesn't actually do the job it's meant to in a
> reliable way. :(
>
So item 4 now then, together with:
1. Rebuilds seemingly randomly decide based on the filesystem whether or
not to take an insanely long time (always happens on some arrays, never
happens on others, I have yet to see a report where it happens
intermittently).
2. Failed disks seem to occasionally cause irreversible data corruption.
3. Classic erasure-code write-hole, just slightly different because of COW.
TBH, as much as I hate to say this, it looks like the raid5/6 code needs
redone from scratch. At an absolute minimum, we need to put a warning
in mkfs for people using raid5/6 to tell them they shouldn't be using it
outside of testing.
next prev parent reply other threads:[~2016-06-24 11:36 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-20 3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11 ` Zygo Blaxell
2016-06-20 19:30 ` Chris Murphy
2016-06-20 20:40 ` Zygo Blaxell
2016-06-20 21:27 ` Chris Murphy
2016-06-21 1:55 ` Zygo Blaxell
2016-06-21 3:53 ` Zygo Blaxell
2016-06-22 17:14 ` Chris Murphy
2016-06-22 20:35 ` Zygo Blaxell
2016-06-23 19:32 ` Goffredo Baroncelli
2016-06-24 0:26 ` Chris Murphy
2016-06-24 1:47 ` Zygo Blaxell
2016-06-24 4:02 ` Andrei Borzenkov
2016-06-24 8:50 ` Hugo Mills
2016-06-24 9:52 ` Andrei Borzenkov
2016-06-24 10:16 ` Hugo Mills
2016-06-24 10:19 ` Andrei Borzenkov
2016-06-24 10:59 ` Hugo Mills
2016-06-24 11:36 ` Austin S. Hemmelgarn [this message]
2016-06-24 17:40 ` Chris Murphy
2016-06-24 18:06 ` Zygo Blaxell
2016-06-24 17:06 ` Chris Murphy
2016-06-24 17:21 ` Andrei Borzenkov
2016-06-24 17:52 ` Chris Murphy
2016-06-24 18:19 ` Austin S. Hemmelgarn
2016-06-25 16:44 ` Chris Murphy
2016-06-25 21:52 ` Chris Murphy
2016-06-26 7:54 ` Andrei Borzenkov
2016-06-26 15:03 ` Duncan
2016-06-26 19:30 ` Chris Murphy
2016-06-26 19:52 ` Zygo Blaxell
2016-06-27 11:21 ` Austin S. Hemmelgarn
2016-06-27 16:17 ` Chris Murphy
2016-06-27 20:54 ` Chris Murphy
2016-06-27 21:02 ` Henk Slager
2016-06-27 21:57 ` Zygo Blaxell
2016-06-27 22:30 ` Chris Murphy
2016-06-28 1:52 ` Zygo Blaxell
2016-06-28 2:39 ` Chris Murphy
2016-06-28 3:17 ` Zygo Blaxell
2016-06-28 11:23 ` Austin S. Hemmelgarn
2016-06-28 12:05 ` Austin S. Hemmelgarn
2016-06-28 12:14 ` Steven Haigh
2016-06-28 12:25 ` Austin S. Hemmelgarn
2016-06-28 16:40 ` Steven Haigh
2016-06-28 18:01 ` Chris Murphy
2016-06-28 18:17 ` Steven Haigh
2016-07-05 23:05 ` Chris Murphy
2016-07-06 11:51 ` Austin S. Hemmelgarn
2016-07-06 16:43 ` Chris Murphy
2016-07-06 17:18 ` Austin S. Hemmelgarn
2016-07-06 18:45 ` Chris Murphy
2016-07-06 19:15 ` Austin S. Hemmelgarn
2016-07-06 21:01 ` Chris Murphy
2016-06-24 16:52 ` Chris Murphy
2016-06-24 16:56 ` Hugo Mills
2016-06-24 16:39 ` Zygo Blaxell
2016-06-24 1:36 ` Zygo Blaxell
2016-06-23 23:37 ` Chris Murphy
2016-06-24 2:07 ` Zygo Blaxell
2016-06-24 5:20 ` Chris Murphy
2016-06-24 10:16 ` Andrei Borzenkov
2016-06-24 17:33 ` Chris Murphy
2016-06-24 11:24 ` Austin S. Hemmelgarn
2016-06-24 16:32 ` Zygo Blaxell
2016-06-24 2:17 ` Zygo Blaxell
2016-06-22 4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=edab8dbd-60e7-fafa-7267-d1e65a897353@gmail.com \
--to=ahferroin7@gmail.com \
--cc=arvidjaar@gmail.com \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=hugo@carfax.org.uk \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
--cc=rm@romanrm.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).