Re: Status of RAID5/6 - Chris Murphy

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Murphy <lists@colorremedies.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Sun, 1 Apr 2018 15:11:04 -0600	[thread overview]
Message-ID: <CAJCQCtTXQGbVjRLegC25DDokcv4Wph6O04s=C_rzp8n5jRpt5Q@mail.gmail.com> (raw)
In-Reply-To: <CAJCQCtSrcFD7jTbrqsWZFWrKUrMp4wW0QhkPApB-pgA-O3WksA@mail.gmail.com>

(I hate it when my palm rubs the trackpad and hits send prematurely...)

On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy <lists@colorremedies.com> wrote:

>> Users can run scrub immediately after _every_ unclean shutdown to
>> reduce the risk of inconsistent parity and unrecoverable data should
>> a disk fail later, but this can only prevent future write hole events,
>> not recover data lost during past events.
>
> Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> such a leaf containing EXTENT_CSUM means that EXTENT_CSUM

means that EXTENT_CSUM is assumed to be correct. But in fact it could
be stale. It's just as possible the metadata and superblock update is
what's missing due to the interruption, while both data and parity
strip writes succeeded. The window for either the data or parity write
to fail is way shorter of a time interval, than that of the numerous
metadata writes, followed by superblock update. In such a case, the
old metadata is what's pointed to, including EXTENT_CSUM. Therefore
your scrub would always show csum error, even if both data and parity
are correct. You'd have to init-csum in this case, I suppose.

Pretty much it's RMW with a (partial) stripe overwrite upending COW,
and therefore upending the atomicity, and thus consistency of Btrfs in
the raid56 case where any portion of the transaction is interrupted.

And this is amplified if metadata is also raid56.

ZFS avoids the problem at the expense of probably a ton of
fragmentation, by taking e.g. 4KiB RMW and writing a full length
stripe of 8KiB fully COW, rather than doing stripe modification with
an overwrite. And that's because it has dynamic stripe lengths. For
Btrfs to always do COW would mean that 4KiB change goes into a new
full stripe, 64KiB * num devices, assuming no other changes are ready
at commit time.

So yeah, avoiding the problem is best. But if it's going to be a
journal, it's going to make things pretty damn slow I'd think, unless
the journal can be explicitly placed something faster than the array,
like an SSD/NVMe device. And that's what mdadm allows and expects.

-- 
Chris Murphy

next prev parent reply	other threads:[~2018-04-01 21:11 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy [this message]
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJCQCtTXQGbVjRLegC25DDokcv4Wph6O04s=C_rzp8n5jRpt5Q@mail.gmail.com' \
    --to=lists@colorremedies.com \
    --cc=calestyo@scientia.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).