From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Chris Murphy <lists@colorremedies.com>,
Hugo Mills <hugo@carfax.org.uk>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
Austin Hemmelgarn <ahferroin7@gmail.com>
Subject: Re: RAID system with adaption to changed number of disks
Date: Wed, 12 Oct 2016 09:32:17 +0800 [thread overview]
Message-ID: <3da9a459-c63b-570c-5b42-c7186b3a74fd@cn.fujitsu.com> (raw)
In-Reply-To: <CAJCQCtSY2Y5AsW2FC5FGP3x3Vaz6Y10=EbAE-0FKFQAqg0oGkg@mail.gmail.com>
At 10/12/2016 07:58 AM, Chris Murphy wrote:
> https://btrfs.wiki.kernel.org/index.php/Status
> Scrub + RAID56 Unstable will verify but not repair
>
> This doesn't seem quite accurate. It does repair the vast majority of
> the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
> data strip results in a.) fixed up data strip from parity b.) wrong
> recomputation of replacement parity c.) good parity is overwritten
> with bad, silently, d.) if parity reconstruction is needed in the
> future e.g. device or sector failure, it results in EIO, a kind of
> data loss.
>
> Bad bug. For sure.
>
> But consider the identical scenario with md or LVM raid5, or any
> conventional hardware raid5. A scrub check simply reports a mismatch.
> It's unknown whether data or parity is bad, so the bad data strip is
> propagated upward to user space without error. On a scrub repair, the
> data strip is assumed to be good, and good parity is overwritten with
> bad.
Totally true.
Original RAID5/6 design is only to handle missing device, not rotted bits.
>
> So while I agree in total that Btrfs raid56 isn't mature or tested
> enough to consider it production ready, I think that's because of the
> UNKNOWN causes for problems we've seen with raid56. Not the parity
> scrub bug which - yeah NOT good, not least of which is the data
> integrity guarantees Btrfs is purported to make are substantially
> negated by this bug. I think the bark is worse than the bite. It is
> not the bark we'd like Btrfs to have though, for sure.
>
Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree
and data checksum.
In ideal situation, btrfs should detect which stripe is corrupted, and
only try to recover data/parity if recovered data checksum matches.
For example, for a very traditional RAID5 layout like the following:
Disk 1 | Disk 2 | Disk 3 |
-----------------------------------------
Data 1 | Data 2 | Parity |
Scrub should check data stripe 1 and 2, against their checksum first
[All data extents has csum]
1) All csum matches
Good, then check parity.
1.1) Parity matches
Nothing wrong at all
1.1) Parity mismatch
Just recalculate parity. Corruption may happen in unused data
space or in parity. Either way recalculate parity is good
enough.
2) One data stripe csum mismatches(missing), parity mismatches too
We only know one data stripe mismatch, not sure if parity is OK.
Try to recover that data stripe from parity, and recheck csum.
2.1) Recovered data stripe matches csum
That data stripe is corrupted and parity is OK
Recoverable.
2.2) Recovered data stripe mismatch csum
Both that data stripe and parity is corrupted.
3) Two data stripes csum mismatch, no matter parity matches or not
At least 2 stripes are screwed up. no fix anyway.
[Some data extents has no csum(nodatasum)]
4) Existing(or no csum at all) csum matches, parity matches
Good, nothing to worry about
5) Exist csum mismatch for one data stripe, parity mismatch
Like 2), try to recover that data stripe, and re-check csum.
5.1) recovered data stripes matches csum
At least we can recover the data covered by csum.
Corrupted no-csum data is not our concern.
5.2) recovered data stripes mismatches csum
Screwed up
6) No csum at all, parity mismatch
We are screwed up, just like traditional RAID5.
And I'm coding for the above cases in btrfs-progs to implement an
off-line scrub tool.
Currently it looks good, and can already handle case from 1) to 3).
And I tend to ignore any full stripe who lacks checksum and parity
mismatches.
But as you can see, there are so many things(csum exists,matches pairty
matches, missing devices) involved in btrfs RAID5(RAID6 will be more
complex), it's already much complex than traditional RAID5/6 or current
scrub implementation.
So what current kernel scub lacks is:
1) Detection of good/bad stripes
2) Recheck of recovery attempts
But that's all traditional RAID5/6 lacks unless there is some hidden
checksum like btrfs they can use.
Thanks,
Qu
next prev parent reply other threads:[~2016-10-12 1:32 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-11 15:14 RAID system with adaption to changed number of disks Philip Louis Moetteli
2016-10-11 16:06 ` Hugo Mills
2016-10-11 23:58 ` Chris Murphy
2016-10-12 1:32 ` Qu Wenruo [this message]
2016-10-12 4:37 ` Zygo Blaxell
2016-10-12 5:48 ` Qu Wenruo
2016-10-12 17:19 ` Zygo Blaxell
2016-10-12 19:55 ` Adam Borowski
2016-10-12 21:10 ` Zygo Blaxell
2016-10-13 3:40 ` Adam Borowski
2016-10-12 20:41 ` Chris Murphy
2016-10-13 0:35 ` Qu Wenruo
2016-10-13 21:03 ` Zygo Blaxell
2016-10-14 1:24 ` Qu Wenruo
2016-10-14 7:16 ` Chris Murphy
2016-10-14 19:55 ` Zygo Blaxell
2016-10-14 21:19 ` Duncan
2016-10-14 21:38 ` Chris Murphy
2016-10-14 22:30 ` Chris Murphy
2016-10-15 3:19 ` Zygo Blaxell
2016-10-12 7:02 ` Anand Jain
2016-10-12 7:25 ` Roman Mamedov
2016-10-12 17:31 ` Zygo Blaxell
2016-10-12 19:19 ` Zygo Blaxell
2016-10-12 19:33 ` Roman Mamedov
2016-10-12 20:33 ` Zygo Blaxell
2016-10-11 16:37 ` Austin S. Hemmelgarn
2016-10-11 17:16 ` Tomasz Kusmierz
2016-10-11 17:29 ` ronnie sahlberg
2016-10-12 1:33 ` Dan Mons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3da9a459-c63b-570c-5b42-c7186b3a74fd@cn.fujitsu.com \
--to=quwenruo@cn.fujitsu.com \
--cc=ahferroin7@gmail.com \
--cc=hugo@carfax.org.uk \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).