From: Goffredo Baroncelli <kreijack@inwind.it>
To: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
Duncan <1i5t5.duncan@cox.net>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/4] 3- and 4- copy RAID1
Date: Thu, 19 Jul 2018 19:29:51 +0200 [thread overview]
Message-ID: <d8071de6-7c02-c8cf-074e-9a52ed9f3fac@inwind.it> (raw)
In-Reply-To: <330020d0-c611-54f6-cbf2-fe9fa6bd3782@gmail.com>
On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:
> On 2018-07-18 15:42, Goffredo Baroncelli wrote:
>> On 07/18/2018 09:20 AM, Duncan wrote:
>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>> excerpted:
>>>
>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>> excerpted:
>>>>>
[...]
>>>>
>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>> have - striping (RAID0)
>>>> - parity (?)
>>>> - striping + parity (e.g. RAID5/6)
>>>> - mirroring (RAID1)
>>>> - mirroring + striping (RAID10)
>>>>
>>>> However you can't have mirroring+parity; this means that a notation
>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>> too verbose.
>>>
>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is
>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
>>> on top of raid0.
>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
>>
>> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
> No, there are cases where other configurations make sense.
>
> RAID05 and RAID06 are very widely used, especially on NAS systems where you have lots of disks. The RAID5/6 lower layer mitigates the data loss risk of RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of RAID5/6. In fact, this is pretty much the standard recommended configuration for large ZFS arrays that want to use parity RAID. This could be reasonably easily supported to a rudimentary degree in BTRFS by providing the ability to limit the stripe width for the parity profiles.
>
> Some people use RAID50 or RAID60, although they are strictly speaking inferior in almost all respects to RAID05 and RAID06.
>
> RAID01 is also used on occasion, it ends up having the same storage capacity as RAID10, but for some RAID implementations it has a different performance envelope and different rebuild characteristics. Usually, when it is used though, it's software RAID0 on top of hardware RAID1.
>
> RAID51 and RAID61 used to be used, but aren't much now. They provided an easy way to have proper data verification without always having the rebuild overhead of RAID5/6 and without needing to do checksumming. They are pretty much useless for BTRFS, as it can already tell which copy is correct.
So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring
>
> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail.
If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1
>>
>> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
>> [....]
>>>>>
>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>> much works this way except with huge (gig size) chunks.
>>>>
>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>
>>> The argument is that because there's only two copies, on multi-device
>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>> alternate device pairs, it's effectively striped at the macro level, with
>>> the 1 GiB device-level chunks effectively being huge individual device
>>> strips of 1 GiB.
>>
>> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
> That's not the only benefit of striping though. The other big one is that you now have one volume that's the combined size of both of the original devices. Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation.
Striping means that the data is interleaved between the disks with a reasonable "block unit". Otherwise which would be the difference between btrfs-raid0 and btrfs-single ?
>
>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
> That's a pretty big assumption. There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things.
>
> If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks. Such a workload is not unusual on a very busy mail-server.
I fully agree that 64K may be too much for some workload, however I have to point out that I still find difficult to imagine that you can take advantage of parallel read from multiple disks with a 1GB stripe unit for a *common workload*. Pay attention that btrfs inline in the metadata the small files, so even if the file is smaller than 64k, a 64k read (or more) will be required in order to access it.
>>
>>
>>> At 1 GiB strip size it doesn't have the typical performance advantage of
>>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>>> strips/chunks.
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
next prev parent reply other threads:[~2018-07-19 18:22 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2018-07-13 18:46 ` [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper David Sterba
2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
2018-07-13 21:02 ` Goffredo Baroncelli
2018-07-17 16:00 ` David Sterba
2018-07-13 18:46 ` [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4) David Sterba
2018-07-13 18:46 ` [PATCH 4/4] btrfs: add incompatibility bit for extended raid features David Sterba
2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
2018-07-16 18:29 ` Goffredo Baroncelli
2018-07-16 18:49 ` Austin S. Hemmelgarn
2018-07-17 21:12 ` Duncan
2018-07-18 5:59 ` Goffredo Baroncelli
2018-07-18 7:20 ` Duncan
2018-07-18 8:39 ` Duncan
2018-07-18 12:45 ` Austin S. Hemmelgarn
2018-07-18 12:50 ` Hugo Mills
2018-07-19 21:22 ` waxhead
2018-07-18 12:50 ` Austin S. Hemmelgarn
2018-07-18 19:42 ` Goffredo Baroncelli
2018-07-19 11:43 ` Austin S. Hemmelgarn
2018-07-19 17:29 ` Goffredo Baroncelli [this message]
2018-07-19 19:10 ` Austin S. Hemmelgarn
2018-07-20 17:13 ` Goffredo Baroncelli
2018-07-20 18:33 ` Austin S. Hemmelgarn
2018-07-20 5:17 ` Andrei Borzenkov
2018-07-20 17:16 ` Goffredo Baroncelli
2018-07-20 18:38 ` Andrei Borzenkov
2018-07-20 18:41 ` Hugo Mills
2018-07-20 18:46 ` Austin S. Hemmelgarn
2018-07-16 21:51 ` waxhead
2018-07-15 14:46 ` Hugo Mills
2018-07-19 7:27 ` Qu Wenruo
2018-07-19 11:47 ` Austin S. Hemmelgarn
2018-07-20 16:42 ` David Sterba
2018-07-20 16:35 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d8071de6-7c02-c8cf-074e-9a52ed9f3fac@inwind.it \
--to=kreijack@inwind.it \
--cc=1i5t5.duncan@cox.net \
--cc=ahferroin7@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).