Re: RFC: raid with a variable stripe size

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <zblaxell@furryterror.org>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: kreijack@inwind.it, linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: RFC: raid with a variable stripe size
Date: Tue, 29 Nov 2016 17:51:27 -0500	[thread overview]
Message-ID: <20161129225127.GS8685@hungrycats.org> (raw)
In-Reply-To: <07d2e8cf-fb23-b2f1-cc69-f329d8347301@cn.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 6895 bytes --]

On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote:
> >>>My proposal requires only a modification to the extent allocator.
> >>>The behavior at the block group layer and scrub remains exactly the same.
> >>>We just need to adjust the allocator slightly to take the RAID5 CoW
> >>>constraints into account.
> >>
> >>Then, you'd need to allow btrfs to split large buffered/direct write into
> >>small extents(not 128M anymore).
> >>Not sure if we need to do extra work for DirectIO.
> >
> >Nope, that's not my proposal.  My proposal is to simply ignore free
> >space whenever it's inside a partially filled raid stripe (optimization:
> >...which was empty at the start of the current transaction).
> 
> Still have problems.
> 
> Allocator must handle fs under device remove or profile converting (from 4
> disks raid5 to 5 disk raid5/6) correctly.
> Which already seems complex for me.

Those would be allocations in separate block groups with different stripe
widths.  Already handled in btrfs.

> And further more, for fs with more devices, for example, 9 devices RAID5.
> It will be a disaster to just write a 4K data and take up the whole 8 * 64K
> space.
> It will  definitely cause huge ENOSPC problem.

If you called fsync() after every 4K, yes; otherwise you can just batch
up small writes into full-size stripes.  The worst case isn't common
enough to be a serious problem for a lot of the common RAID5 use cases
(i.e. non-database workloads).  I wouldn't try running a database on
it--I'd use a RAID1 or RAID10 array for that instead, because the other
RAID5 performance issues would be deal-breakers.

On ZFS the same case degenerates into something like btrfs RAID1 over
the 9 disks, which burns over 50% of the space.  More efficient than 
wasting 99% of the space, but still wasteful.

> If you really think it's easy, make a RFC patch, which should be easy if it
> is, then run fstest auto group on it.

I plan to when I get time; however, that could be some months in the
future and I don't want to "claim" the task and stop anyone else from
taking a crack at it in the meantime.

> Easy words won't turn emails into real patch.
> 
> >That avoids modifying a stripe with committed data and therefore plugs the
> >write hole.
> >
> >For nodatacow, prealloc (and maybe directio?) extents the behavior
> >wouldn't change (you'd have write hole, but only on data blocks not
> >metadata, and only on files that were already marked as explicitly not
> >requiring data integrity).
> >
> >>And in fact, you're going to support variant max file extent size.
> >
> >The existing extent sizing behavior is not changed *at all* in my proposal,
> >only the allocator's notion of what space is 'free'.
> >
> >We can write an extent across multiple RAID5 stripes so long as we
> >finish writing the entire extent before pointing committed metadata to
> >it.  btrfs does that already otherwise checksums wouldn't work.
> >
> >>This makes delalloc more complex (Wang enhanced dealloc support for variant
> >>file extent size, to fix ENOSPC problem for dedupe and compression).
> >>
> >>This is already much more complex than you expected.
> >
> >The complexity I anticipate is having to deal with two implementations
> >of the free space search, one for free space cache and one for free
> >space tree.
> >
> >It could be as simple as calling the existing allocation functions and
> >just filtering out anything that isn't suitably aligned inside a raid56
> >block group (at least for a proof of concept).
> >
> >>And this is the *BIGGEST* problem of current btrfs:
> >>No good enough(if there is any) *ISOLATION* for such a complex fs.
> >>
> >>So even "small" modification can lead to unexpected bugs.
> >>
> >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
> >
> >I don't think the write hole is fixable in the current raid56 layer, at
> >least not without a nasty brute force solution like stripe update journal.
> >
> >Any of the fixes I'd want to use fix the problem from outside.
> >
> >>If not possible, I prefer not to do anything yet, until we are sure the very
> >>basic part of RAID56 is stable.
> >>
> >>Thanks,
> >>Qu
> >>
> >>>
> >>>It's not as efficient as the ZFS approach, but it doesn't require an
> >>>incompatible disk format change either.
> >>>
> >>>>>On BTRFS this could be achieved using several BGs (== block group or chunk), one for each stripe size.
> >>>>>
> >>>>>For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem should have three BGs:
> >>>>>BG #1,composed by two disks (1 data+ 1 parity)
> >>>>>BG #2 composed by three disks (2 data + 1 parity)
> >>>>>BG #3 composed by four disks (3 data + 1 parity).
> >>>>
> >>>>Too complicated bg layout and further extent allocator modification.
> >>>>
> >>>>More code means more bugs, and I'm pretty sure it will be bug prone.
> >>>>
> >>>>
> >>>>Although the idea of variable stripe size can somewhat reduce the problem
> >>>>under certain situation.
> >>>>
> >>>>For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> >>>>disc RAID5, we can avoid such write hole problem.
> >>>>Withouth modification to extent/chunk allocator.
> >>>>
> >>>>And I'd prefer to make stripe len mkfs time parameter, not possible to
> >>>>modify after mkfs. To make things easy.
> >>>>
> >>>>Thanks,
> >>>>Qu
> >>>>
> >>>>>
> >>>>>If the data to be written has a size of 4k, it will be allocated to the BG #1.
> >>>>>If the data to be written has a size of 8k, it will be allocated to the BG #2
> >>>>>If the data to be written has a size of 12k, it will be allocated to the BG #3
> >>>>>If the data to be written has a size greater than 12k, it will be allocated to the BG3, until the data fills a full stripes; then the remainder will be stored in BG #1 or BG #2.
> >>>>>
> >>>>>
> >>>>>To avoid unbalancing of the disk usage, each BG could use all the disks, even if a stripe uses less disks: i.e
> >>>>>
> >>>>>DISK1 DISK2 DISK3 DISK4
> >>>>>S1    S1    S1    S2
> >>>>>S2    S2    S3    S3
> >>>>>S3    S4    S4    S4
> >>>>>[....]
> >>>>>
> >>>>>Above is show a BG which uses all the four disks, but has a stripe which spans only 3 disks.
> >>>>>
> >>>>>
> >>>>>Pro:
> >>>>>- btrfs already is capable to handle different BG in the filesystem, only the allocator has to change
> >>>>>- no more RMW are required (== higher performance)
> >>>>>
> >>>>>Cons:
> >>>>>- the data will be more fragmented
> >>>>>- the filesystem, will have more BGs; this will require time-to time a re-balance. But is is an issue which we already know (even if may be not 100% addressed).
> >>>>>
> >>>>>
> >>>>>Thoughts ?
> >>>>>
> >>>>>BR
> >>>>>G.Baroncelli
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
> 
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

next prev parent reply	other threads:[~2016-11-29 22:51 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51   ` Timofey Titovets
2016-11-18 21:38     ` Janos Toth F.
2016-11-19  8:55   ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19  8:59   ` Goffredo Baroncelli
2016-11-19  8:22 ` Zygo Blaxell
2016-11-19  9:13   ` Goffredo Baroncelli
2016-11-29  0:48 ` Qu Wenruo
2016-11-29  3:53   ` Zygo Blaxell
2016-11-29  4:12     ` Qu Wenruo
2016-11-29  4:55       ` Zygo Blaxell
2016-11-29  5:49         ` Qu Wenruo
2016-11-29 18:47           ` Janos Toth F.
2016-11-29 22:51           ` Zygo Blaxell [this message]
2016-11-29  5:51   ` Chris Murphy
2016-11-29  6:03     ` Qu Wenruo
2016-11-29 18:19       ` Goffredo Baroncelli
2016-11-29 22:54       ` Zygo Blaxell
2016-11-29 18:10   ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161129225127.GS8685@hungrycats.org \
    --to=zblaxell@furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.