From: Goffredo Baroncelli <kreijack@inwind.it>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: RFC: raid with a variable stripe size
Date: Sat, 19 Nov 2016 10:13:39 +0100 [thread overview]
Message-ID: <80bfba90-2aab-16f7-83e6-00cc3f2b96b4@inwind.it> (raw)
In-Reply-To: <20161119082252.GU21290@hungrycats.org>
On 2016-11-19 09:22, Zygo Blaxell wrote:
[...]
>> If the data to be written has a size of 4k, it will be allocated to
>> the BG #1. If the data to be written has a size of 8k, it will be
>> allocated to the BG #2 If the data to be written has a size of 12k,
>> it will be allocated to the BG #3 If the data to be written has a size
>> greater than 12k, it will be allocated to the BG3, until the data fills
>> a full stripes; then the remainder will be stored in BG #1 or BG #2.
>
> OK I think I'm beginning to understand this idea better. Short writes
> degenerate to RAID1, and large writes behave more like RAID5. No disk
> format change is required because newer kernels would just allocate
> block groups and distribute data differently.
>
> That might be OK on SSD, but on spinning rust (where you're most likely
> to find a RAID5 array) it'd be really seeky. It'd also make 'df' output
> even less predictive of actual data capacity.
>
> Going back to the earlier example (but on 5 disks) we now have:
>
> block groups with 5 disks:
> D1 D2 D3 D4 P1
> F1 F2 F3 P2 F4
> F5 F6 P3 F7 F8
>
> block groups with 4 disks:
> E1 E2 E3 P4
> D5 D6 P5 D7
>
> block groups with 3 disks:
> (none)
>
> block groups with 2 disks:
> F9 P6
>
> Now every parity block contains data from only one transaction, but
> extents D and F are separated by up to 4GB of disk space.
>
[....]
>
> When the disk does get close to full, this would lead to some nasty
> early-ENOSPC issues. It's bad enough now with just two competing
> allocators (metadata and data)...imagine those problems multiplied by
> 10 on a big RAID5 array.
I am incline to think that some problem would be reduced developing a daemon which starts a balance automatically when need (on the basis of the fragmentation). Anyway this is an issue which we should solve anyway.
[...]
>
> I now realize there's no need for any "plug extent" to physically
> exist--the allocator can simply infer their existence on the fly by
> noticing where the RAID stripe boundaries are, and remembering which
> blocks it had allocated in the current uncommitted transaction.
Even this could be a "simple" solution: when a write starts, the system has to use only empty stripes...
>
>
> The tradeoff is that more balances would be required to avoid free space
> fragmentation; on the other hand, typical RAID5 use cases involve storing
> a lot of huge files, so the fragmentation won't be a very large percentage
> of total space. A few percent of disk capacity is a fair price to pay for
> data integrity.
Both the methods would require a more aggressive balance. In this they are equal.
BR
G.Baroncelli
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
next prev parent reply other threads:[~2016-11-19 9:13 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-18 18:15 RFC: raid with a variable stripe size Goffredo Baroncelli
2016-11-18 20:32 ` Janos Toth F.
2016-11-18 20:51 ` Timofey Titovets
2016-11-18 21:38 ` Janos Toth F.
2016-11-19 8:55 ` Goffredo Baroncelli
2016-11-18 20:34 ` Timofey Titovets
2016-11-19 8:59 ` Goffredo Baroncelli
2016-11-19 8:22 ` Zygo Blaxell
2016-11-19 9:13 ` Goffredo Baroncelli [this message]
2016-11-29 0:48 ` Qu Wenruo
2016-11-29 3:53 ` Zygo Blaxell
2016-11-29 4:12 ` Qu Wenruo
2016-11-29 4:55 ` Zygo Blaxell
2016-11-29 5:49 ` Qu Wenruo
2016-11-29 18:47 ` Janos Toth F.
2016-11-29 22:51 ` Zygo Blaxell
2016-11-29 5:51 ` Chris Murphy
2016-11-29 6:03 ` Qu Wenruo
2016-11-29 18:19 ` Goffredo Baroncelli
2016-11-29 22:54 ` Zygo Blaxell
2016-11-29 18:10 ` Goffredo Baroncelli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=80bfba90-2aab-16f7-83e6-00cc3f2b96b4@inwind.it \
--to=kreijack@inwind.it \
--cc=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).