linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: kreijack@inwind.it, Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Chris Murphy <lists@colorremedies.com>
Cc: Christoph Anton Mitterer <calestyo@scientia.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Status of RAID5/6
Date: Mon, 2 Apr 2018 11:49:42 -0400	[thread overview]
Message-ID: <7c76dae7-b38c-d514-4284-1cd093f5bcac@gmail.com> (raw)
In-Reply-To: <df74c8a6-b748-20c5-8bef-eb261b645b29@inwind.it>

On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> [...]
>> It is possible to combine writes from a single transaction into full
>> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
>> Any partially-filled stripe is effectively read-only and the space within
>> it is inaccessible until all data within the stripe is overwritten,
>> deleted, or relocated by balance.
>>
>> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
>> update, but that has a significant write magnification effect (and before
>> kernel 4.14, non-trivial CPU load as well).
>>
>> btrfs could also just allocate the full stripe to an extent, but emit
>> only extent ref items for the blocks that are in use.  No fragmentation
>> but lots of extra disk space used.  Also doesn't quite work the same
>> way for metadata pages.
>>
>> If btrfs adopted the ZFS approach, the extent allocator and all higher
>> layers of the filesystem would have to know about--and skip over--the
>> parity blocks embedded inside extents.  Making this change would mean
>> that some btrfs RAID profiles start interacting with stuff like balance
>> and compression which they currently do not.  It would create a new
>> block group type and require an incompatible on-disk format change for
>> both reads and writes.
> 
> I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG
> 
> BG #1: 1 data disk, 2 parity disks
> BG #2: 2 data disks, 2 parity disks,
> BG #3: 4 data disks, 2 parity disks
> 
> For simplicity, the disk-stripe length is assumed = 4K.
> 
> So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1.
> 
> This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?).
Yes, fragmentation _does_ matter even with storage devices that have a 
uniform seek latency (such as SSD's), because less fragmentation means 
fewer I/O requests have to be made to load the same amount of data. 
Contrary to popular belief uniform seek-time devices do still perform 
better doing purely sequential I/O to random I/O because larger requests 
can be made, the difference is just small enough that it only matters if 
you're constantly using all the disk bandwidth.

Also, you're still going to be wasting space, it's just that less space 
will be wasted, and it will be wasted at the chunk level instead of the 
block level, which opens up a whole new set of issues to deal with, most 
significantly that it becomes functionally impossible without 
brute-force search techniques to determine when you will hit the 
common-case of -ENOSPC due to being unable to allocate a new chunk.
> 
> Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated.
> 
> The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks.


  reply	other threads:[~2018-04-02 15:49 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn [this message]
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7c76dae7-b38c-d514-4284-1cd093f5bcac@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).