Re: RAID system with adaption to changed number of disks

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Chris Murphy <lists@colorremedies.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	Austin Hemmelgarn <ahferroin7@gmail.com>
Subject: Re: RAID system with adaption to changed number of disks
Date: Wed, 12 Oct 2016 09:32:17 +0800	[thread overview]
Message-ID: <3da9a459-c63b-570c-5b42-c7186b3a74fd@cn.fujitsu.com> (raw)
In-Reply-To: <CAJCQCtSY2Y5AsW2FC5FGP3x3Vaz6Y10=EbAE-0FKFQAqg0oGkg@mail.gmail.com>



At 10/12/2016 07:58 AM, Chris Murphy wrote:
> https://btrfs.wiki.kernel.org/index.php/Status
> Scrub + RAID56 Unstable will verify but not repair
>
> This doesn't seem quite accurate. It does repair the vast majority of
> the time. On scrub though, there's maybe a 1 in 3 or 1 in 4 chance bad
> data strip results in a.) fixed up data strip from parity b.) wrong
> recomputation of replacement parity c.) good parity is overwritten
> with bad, silently, d.) if parity reconstruction is needed in the
> future e.g. device or sector failure, it results in EIO, a kind of
> data loss.
>
> Bad bug. For sure.
>
> But consider the identical scenario with md or LVM raid5, or any
> conventional hardware raid5. A scrub check simply reports a mismatch.
> It's unknown whether data or parity is bad, so the bad data strip is
> propagated upward to user space without error. On a scrub repair, the
> data strip is assumed to be good, and good parity is overwritten with
> bad.

Totally true.

Original RAID5/6 design is only to handle missing device, not rotted bits.

>
> So while I agree in total that Btrfs raid56 isn't mature or tested
> enough to consider it production ready, I think that's because of the
> UNKNOWN causes for problems we've seen with raid56. Not the parity
> scrub bug which - yeah NOT good, not least of which is the data
> integrity guarantees Btrfs is purported to make are substantially
> negated by this bug. I think the bark is worse than the bite. It is
> not the bark we'd like Btrfs to have though, for sure.
>

Current btrfs RAID5/6 scrub problem is, we don't take full usage of tree 
and data checksum.

In ideal situation, btrfs should detect which stripe is corrupted, and 
only try to recover data/parity if recovered data checksum matches.

For example, for a very traditional RAID5 layout like the following:

   Disk 1    |   Disk 2    |  Disk 3     |
-----------------------------------------
   Data 1    |   Data 2    |  Parity     |

Scrub should check data stripe 1 and 2, against their checksum first

[All data extents has csum]
1) All csum matches
    Good, then check parity.
    1.1) Parity matches
         Nothing wrong at all

    1.1) Parity mismatch
         Just recalculate parity. Corruption may happen in unused data
         space or in parity. Either way recalculate parity is good
         enough.

2) One data stripe csum mismatches(missing), parity mismatches too
    We only know one data stripe mismatch, not sure if parity is OK.
    Try to recover that data stripe from parity, and recheck csum.

    2.1) Recovered data stripe matches csum
         That data stripe is corrupted and parity is OK
         Recoverable.

    2.2) Recovered data stripe mismatch csum
         Both that data stripe and parity is corrupted.

3) Two data stripes csum mismatch, no matter parity matches or not
    At least 2 stripes are screwed up. no fix anyway.

[Some data extents has no csum(nodatasum)]
4) Existing(or no csum at all) csum matches, parity matches
    Good, nothing to worry about

5) Exist csum mismatch for one data stripe, parity mismatch
    Like 2), try to recover that data stripe, and re-check csum.

    5.1) recovered data stripes matches csum
         At least we can recover the data covered by csum.
         Corrupted no-csum data is not our concern.

    5.2) recovered data stripes mismatches csum
         Screwed up

6) No csum at all, parity mismatch
    We are screwed up, just like traditional RAID5.

And I'm coding for the above cases in btrfs-progs to implement an 
off-line scrub tool.

Currently it looks good, and can already handle case from 1) to 3).
And I tend to ignore any full stripe who lacks checksum and parity 
mismatches.

But as you can see, there are so many things(csum exists,matches pairty 
matches, missing devices) involved in btrfs RAID5(RAID6 will be more 
complex), it's already much complex than traditional RAID5/6 or current 
scrub implementation.


So what current kernel scub lacks is:
1) Detection of good/bad stripes
2) Recheck of recovery attempts

But that's all traditional RAID5/6 lacks unless there is some hidden 
checksum like btrfs they can use.

Thanks,
Qu

next prev parent reply	other threads:[~2016-10-12  1:32 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-11 15:14 RAID system with adaption to changed number of disks Philip Louis Moetteli
2016-10-11 16:06 ` Hugo Mills
2016-10-11 23:58   ` Chris Murphy
2016-10-12  1:32     ` Qu Wenruo [this message]
2016-10-12  4:37       ` Zygo Blaxell
2016-10-12  5:48         ` Qu Wenruo
2016-10-12 17:19           ` Zygo Blaxell
2016-10-12 19:55             ` Adam Borowski
2016-10-12 21:10               ` Zygo Blaxell
2016-10-13  3:40                 ` Adam Borowski
2016-10-12 20:41             ` Chris Murphy
2016-10-13  0:35             ` Qu Wenruo
2016-10-13 21:03               ` Zygo Blaxell
2016-10-14  1:24                 ` Qu Wenruo
2016-10-14  7:16                   ` Chris Murphy
2016-10-14 19:55                     ` Zygo Blaxell
2016-10-14 21:19                       ` Duncan
2016-10-14 21:38                       ` Chris Murphy
2016-10-14 22:30                         ` Chris Murphy
2016-10-15  3:19                           ` Zygo Blaxell
2016-10-12  7:02         ` Anand Jain
2016-10-12  7:25     ` Roman Mamedov
2016-10-12 17:31       ` Zygo Blaxell
2016-10-12 19:19         ` Zygo Blaxell
2016-10-12 19:33           ` Roman Mamedov
2016-10-12 20:33             ` Zygo Blaxell
2016-10-11 16:37 ` Austin S. Hemmelgarn
2016-10-11 17:16 ` Tomasz Kusmierz
2016-10-11 17:29 ` ronnie sahlberg
2016-10-12  1:33 ` Dan Mons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3da9a459-c63b-570c-5b42-c7186b3a74fd@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=ahferroin7@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).