Re: RFC: raid with a variable stripe size

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zygo Blaxell <zblaxell@furryterror.org>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Chris Murphy <lists@colorremedies.com>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: RFC: raid with a variable stripe size
Date: Tue, 29 Nov 2016 17:54:00 -0500	[thread overview]
Message-ID: <20161129225400.GT8685@hungrycats.org> (raw)
In-Reply-To: <cf02a446-f83d-0d4b-2b60-70a99e18ab39@cn.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 3786 bytes --]

On Tue, Nov 29, 2016 at 02:03:58PM +0800, Qu Wenruo wrote:
> At 11/29/2016 01:51 PM, Chris Murphy wrote:
> >On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> >>
> >>
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem of
> >>>raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >>See how badly the current RAID56 works?
> >>
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >Btrfs is subject to the write hole problem on disk, but any read or
> >scrub that needs to reconstruct from parity that is corrupt results in
> >a checksum error and EIO. So corruption is not passed up to user
> >space. Recent versions of md/mdadm support a write journal to avoid
> >the write hole problem on disk in case of a crash.
> 
> That's interesting.
> 
> So I think it's less worthy to support RAID56 in btrfs, especially
> considering the stability.
> 
> My widest dream is, btrfs calls device mapper to build a micro RAID1/5/6/10
> device for each chunk.
> Which should save us tons of codes and bugs.
> 
> And for better recovery, enhance device mapper to provide interface to judge
> which block is correct.
> 
> Although that's just dream anyway.

It would be nice to do that for balancing.  In many balance cases
(especially device delete and full balance after device add) it's not
necessary to rewrite the data in a block group, only copy it verbatim
to a different physical location (like pvmove does) and update the chunk
tree with the new address when it's done.  No need to rewrite the whole
extent tree.

> Thanks,
> Qu
> >
> >>>The problem is that the stripe size is bigger than the "sector size" (ok
> >>>sector is not the correct word, but I am referring to the basic unit of
> >>>writing on the disk, which is 4k or 16K in btrfs).
> >>>So when btrfs writes less data than the stripe, the stripe is not filled;
> >>>when it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try to
> >>>solve this issue using a variable length stripe.
> >>
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS isn't subject to the write hole. My understanding is they get
> >around this because all writes are COW, there is no RMW.
> >But the
> >variable stripe size means they don't have to do the usual (fixed)
> >full stripe write for just, for example a 4KiB change in data for a
> >single file. Conversely Btrfs does do RMW in such a case.
> >
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >I tend to agree. I think the non-scalability of Btrfs raid10, which
> >makes it behave more like raid 0+1, is a higher priority because right
> >now it's misleading to say the least; and then the longer term goal
> >for scaleable huge file systems is how Btrfs can shed irreparably
> >damaged parts of the file system (tree pruning) rather than
> >reconstruction.
> >
> >
> >
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

next prev parent reply	other threads:[~2016-11-29 22:54 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51   ` Timofey Titovets
2016-11-18 21:38     ` Janos Toth F.
2016-11-19  8:55   ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19  8:59   ` Goffredo Baroncelli
2016-11-19  8:22 ` Zygo Blaxell
2016-11-19  9:13   ` Goffredo Baroncelli
2016-11-29  0:48 ` Qu Wenruo
2016-11-29  3:53   ` Zygo Blaxell
2016-11-29  4:12     ` Qu Wenruo
2016-11-29  4:55       ` Zygo Blaxell
2016-11-29  5:49         ` Qu Wenruo
2016-11-29 18:47           ` Janos Toth F.
2016-11-29 22:51           ` Zygo Blaxell
2016-11-29  5:51   ` Chris Murphy
2016-11-29  6:03     ` Qu Wenruo
2016-11-29 18:19       ` Goffredo Baroncelli
2016-11-29 22:54       ` Zygo Blaxell [this message]
2016-11-29 18:10   ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161129225400.GT8685@hungrycats.org \
    --to=zblaxell@furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).