linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
Cc: kreijack@inwind.it, Chris Murphy <lists@colorremedies.com>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Mon, 2 Apr 2018 20:31:11 -0400	[thread overview]
Message-ID: <20180403003102.GI2446@hungrycats.org> (raw)
In-Reply-To: <20180402222250.GH2446@hungrycats.org>

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

  reply	other threads:[~2018-04-03  0:31 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell [this message]
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180403003102.GI2446@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).