* RE: question about creating a raid10
2019-01-16 18:15 ` Chris Murphy
@ 2019-01-17 0:59 ` Paul Jones
2019-01-17 12:33 ` Austin S. Hemmelgarn
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Paul Jones @ 2019-01-17 0:59 UTC (permalink / raw)
To: Chris Murphy, Stefan K; +Cc: Linux Btrfs
> -----Original Message-----
> From: linux-btrfs-owner@vger.kernel.org <linux-btrfs-
> owner@vger.kernel.org> On Behalf Of Chris Murphy
> Sent: Thursday, 17 January 2019 5:15 AM
> To: Stefan K <shadow_7@gmx.net>
> Cc: Linux Btrfs <linux-btrfs@vger.kernel.org>
> Subject: Re: question about creating a raid10
>
> On Wed, Jan 16, 2019 at 7:58 AM Stefan K <shadow_7@gmx.net> wrote:
> >
> > :(
> > that means when one jbod fail its there is no guarantee that it works
> > fine? like in zfs? well that sucks Didn't anyone think to program it that way?
>
> The mirroring is a function of the block group, not the block device.
> And yes that's part of the intentional design and why it's so flexible. A real
> raid10 isn't as flexible, so to enforce the allocation of specific block group
> stripes to specific block devices would add complexity to the allocator while
> reducing flexibility. It's not impossible, it'd just come with caveats like no
> three device
> raid10 like now; and you'd have to figure out what to do if the user adds one
> new device instead of two at a time, and what if any new device isn't the
> same size as existing devices or if you add two devices that aren't the same
> size. Do you refuse to add such devices?
> What limitations do we run into when rebalancing? It's way more
> complicated.
>
> Btrfs raid10 really should not be called raid10. It sets up the wrong user
> expectation entirely. It's more like raid0+1, except even that is deceptive
> because in theory a legit raid0+1 you can lose multiple drives on one side of
> the mirror (but not both); but with Btrfs raid10 you really can't lose more
> than one drive. And therefore it does not scale. The probability of downtime
> increases as drives are added; whereas with a real raid10 downtime doesn't
> change.
>
> In your case you're better off with raid0'ing the two drives in each enclosure
> (whether it's a feature of the enclosure or doing it with mdadm or LVM). And
> then using Btrfs raid1 on top of the resulting virtual block devices. Or do
> mdadm/LVM raid10, and format it Btrfs. Or yeah, use ZFS.
What I've done is create separate lvm volumes and groups, and assigned the lvm groups to separate physical storage so I don't accidently get two btrfs mirrors on the same device.
It's a bit complicated (especially since I'm using caching with lvm) but it works very well.
vm-server ~ # lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
backup-a a Cwi-aoC--- <2.73t [cache-backup-a] [backup-a_corig] 99.99 24.63 0.00
lvol0 a -wi-a----- 44.00m
lvol1 a -wi-a----- 44.00m
storage-a a Cwi-aoC--- <3.64t [cache-storage-a] [storage-a_corig] 99.98 24.76 0.00
backup-b b Cwi-aoC--- <2.73t [cache-backup-b] [backup-b_corig] 99.99 24.75 0.00
storage-b b Cwi-aoC--- <3.64t [cache-storage-b] [storage-b_corig] 99.99 24.66 0.00
storage-c c -wi-a----- <3.64t
vm-server ~ # vgs
VG #PV #LV #SN Attr VSize VFree
a 3 4 0 wz--n- 6.70t 0
b 3 2 0 wz--n- 6.70t 8.00m
c 1 1 0 wz--n- <3.64t 0
vm-server ~ # pvs
PV VG Fmt Attr PSize PFree
/dev/sda4 a lvm2 a-- <342.00g 0
/dev/sdb4 b lvm2 a-- <342.00g 8.00m
/dev/sdc1 a lvm2 a-- <2.73t 0
/dev/sdd1 b lvm2 a-- <2.73t 0
/dev/sdf1 a lvm2 a-- <3.64t 0
/dev/sdh1 c lvm2 a-- <3.64t 0
/dev/sdi1 b lvm2 a-- <3.64t 0
vm-server ~ # btrfs fi sh
Label: 'Root' uuid: 58d27dbd-7c1e-4ef7-8d43-e93df1537b08
Total devices 2 FS bytes used 55.26GiB
devid 13 size 100.00GiB used 65.03GiB path /dev/sdb1
devid 14 size 100.00GiB used 65.03GiB path /dev/sda1
Label: 'Boot' uuid: 8f63cd03-67b2-47cd-85ce-ca355769c123
Total devices 2 FS bytes used 66.11MiB
devid 1 size 1.00GiB used 356.00MiB path /dev/sdb6
devid 2 size 1.00GiB used 0.00B path /dev/sda6
Label: 'Storage' uuid: 1438fdc5-8b2a-47b3-8a5b-eb74cde3df42
Total devices 4 FS bytes used 2.85TiB
devid 1 size 3.61TiB used 3.19TiB path /dev/mapper/b-storage--b
devid 2 size 3.42TiB used 3.02TiB path /dev/mapper/a-storage--a
devid 3 size 279.40GiB used 173.00GiB path /dev/sdg1
devid 4 size 279.40GiB used 172.00GiB path /dev/sde1
Label: 'Backup' uuid: 21e59d66-3e88-4fc9-806f-69bde58be6a3
Total devices 2 FS bytes used 1.31TiB
devid 1 size 2.73TiB used 1.31TiB path /dev/mapper/a-backup--a
devid 2 size 2.73TiB used 1.31TiB path /dev/mapper/b-backup--b
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: question about creating a raid10
2019-01-16 18:15 ` Chris Murphy
2019-01-17 0:59 ` Paul Jones
@ 2019-01-17 12:33 ` Austin S. Hemmelgarn
2019-01-17 19:17 ` Andrei Borzenkov
2019-01-18 7:02 ` Stefan K
3 siblings, 0 replies; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2019-01-17 12:33 UTC (permalink / raw)
To: Stefan K; +Cc: Chris Murphy, Linux Btrfs
On 2019-01-16 13:15, Chris Murphy wrote:
> On Wed, Jan 16, 2019 at 7:58 AM Stefan K <shadow_7@gmx.net> wrote:
>>
>> :(
>> that means when one jbod fail its there is no guarantee that it works fine? like in zfs? well that sucks
>> Didn't anyone think to program it that way?
>
> The mirroring is a function of the block group, not the block device.
> And yes that's part of the intentional design and why it's so
> flexible. A real raid10 isn't as flexible, so to enforce the
> allocation of specific block group stripes to specific block devices
> would add complexity to the allocator while reducing flexibility. It's
> not impossible, it'd just come with caveats like no three device
> raid10 like now; and you'd have to figure out what to do if the user
> adds one new device instead of two at a time, and what if any new
> device isn't the same size as existing devices or if you add two
> devices that aren't the same size. Do you refuse to add such devices?
> What limitations do we run into when rebalancing? It's way more
> complicated.
>
> Btrfs raid10 really should not be called raid10. It sets up the wrong
> user expectation entirely. It's more like raid0+1, except even that is
> deceptive because in theory a legit raid0+1 you can lose multiple
> drives on one side of the mirror (but not both); but with Btrfs raid10
> you really can't lose more than one drive. And therefore it does not
> scale. The probability of downtime increases as drives are added;
> whereas with a real raid10 downtime doesn't change.
>
> In your case you're better off with raid0'ing the two drives in each
> enclosure (whether it's a feature of the enclosure or doing it with
> mdadm or LVM). And then using Btrfs raid1 on top of the resulting
> virtual block devices. Or do mdadm/LVM raid10, and format it Btrfs. Or
> yeah, use ZFS.
I was about to recommend the same BTRFS raid1 on top of MD or LVM RAID0
approach myself. Not only will it get you as close as possible with
BTRFS to the ZFS configuration you posted, it will also net you slightly
better performance than BTRFS in raid10 mode.
Realistically, it's not perfect (if you lose one of the JBOD arrays, you
have to rebuild that array completely and then replace the higher-level
device in BTRFS, instead of just replacing the disk at the lower level),
but the same approach can be extrapolated to cover a wide variety of
configurations in terms of required failure domains, and I can attest to
the fact that it works (I've used this configuration a lot myself, but
mostly for performance reasons, not reliability).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: question about creating a raid10
2019-01-16 18:15 ` Chris Murphy
2019-01-17 0:59 ` Paul Jones
2019-01-17 12:33 ` Austin S. Hemmelgarn
@ 2019-01-17 19:17 ` Andrei Borzenkov
2019-01-18 7:02 ` Stefan K
3 siblings, 0 replies; 12+ messages in thread
From: Andrei Borzenkov @ 2019-01-17 19:17 UTC (permalink / raw)
To: Chris Murphy, Stefan K; +Cc: Linux Btrfs
16.01.2019 21:15, Chris Murphy пишет:
>
> Btrfs raid10 really should not be called raid10. It sets up the wrong
> user expectation entirely. It's more like raid0+1,
It is actually more like RAID-1E which is supported by some hardware
RAID HBA. The difference is that RAID-1E is usually using strict
sequential block placement algorithm and assumes disks of equal size,
while btrfs raid10 is more flexible in selecting where next mirror pair
is allocated.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: question about creating a raid10
2019-01-16 18:15 ` Chris Murphy
` (2 preceding siblings ...)
2019-01-17 19:17 ` Andrei Borzenkov
@ 2019-01-18 7:02 ` Stefan K
2019-01-18 13:30 ` Jukka Larja
2019-01-19 2:43 ` Chris Murphy
3 siblings, 2 replies; 12+ messages in thread
From: Stefan K @ 2019-01-18 7:02 UTC (permalink / raw)
To: Linux Btrfs
> Btrfs raid10 really should not be called raid10. It sets up the wrong
> user expectation entirely. It's more like raid0+1, except even that is
> deceptive because in theory a legit raid0+1 you can lose multiple
> drives on one side of the mirror (but not both); but with Btrfs raid10
> you really can't lose more than one drive. And therefore it does not
> scale. The probability of downtime increases as drives are added;
> whereas with a real raid10 downtime doesn't change.
WTF?! really, so with btrfs raid10 I can't lose more than one drive? that sucks, that an advantage of raid 10!
and the crazy thing is, thats not documented, not in the manpage nor btrfs wiki and and thats is very important.
thats unbelievable ..
> In your case you're better off with raid0'ing the two drives in each
> enclosure (whether it's a feature of the enclosure or doing it with
> mdadm or LVM). And then using Btrfs raid1 on top of the resulting
> virtual block devices. Or do mdadm/LVM raid10, and format it Btrfs.
mdadm, lvm.. btrfs is a reason to not use this programms, but since btrfs does not have a 'real' raid10 but a raid01 it does not fit in our use case + I can't configure which disk is in which mirror..
On Wednesday, January 16, 2019 11:15:02 AM CET Chris Murphy wrote:
> On Wed, Jan 16, 2019 at 7:58 AM Stefan K <shadow_7@gmx.net> wrote:
> >
> > :(
> > that means when one jbod fail its there is no guarantee that it works fine? like in zfs? well that sucks
> > Didn't anyone think to program it that way?
>
> The mirroring is a function of the block group, not the block device.
> And yes that's part of the intentional design and why it's so
> flexible. A real raid10 isn't as flexible, so to enforce the
> allocation of specific block group stripes to specific block devices
> would add complexity to the allocator while reducing flexibility. It's
> not impossible, it'd just come with caveats like no three device
> raid10 like now; and you'd have to figure out what to do if the user
> adds one new device instead of two at a time, and what if any new
> device isn't the same size as existing devices or if you add two
> devices that aren't the same size. Do you refuse to add such devices?
> What limitations do we run into when rebalancing? It's way more
> complicated.
>
> Btrfs raid10 really should not be called raid10. It sets up the wrong
> user expectation entirely. It's more like raid0+1, except even that is
> deceptive because in theory a legit raid0+1 you can lose multiple
> drives on one side of the mirror (but not both); but with Btrfs raid10
> you really can't lose more than one drive. And therefore it does not
> scale. The probability of downtime increases as drives are added;
> whereas with a real raid10 downtime doesn't change.
>
> In your case you're better off with raid0'ing the two drives in each
> enclosure (whether it's a feature of the enclosure or doing it with
> mdadm or LVM). And then using Btrfs raid1 on top of the resulting
> virtual block devices. Or do mdadm/LVM raid10, and format it Btrfs. Or
> yeah, use ZFS.
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: question about creating a raid10
2019-01-18 7:02 ` Stefan K
@ 2019-01-18 13:30 ` Jukka Larja
2019-01-19 2:43 ` Chris Murphy
1 sibling, 0 replies; 12+ messages in thread
From: Jukka Larja @ 2019-01-18 13:30 UTC (permalink / raw)
To: linux-btrfs
Stefan K kirjoitti 18.1.2019 klo 9.02:
> WTF?! really, so with btrfs raid10 I can't lose more than one drive? that sucks, that an advantage of raid 10!
> and the crazy thing is, thats not documented, not in the manpage nor btrfs wiki and and thats is very important.
> thats unbelievable ..
You should probably check at least
https://btrfs.wiki.kernel.org/index.php/SysadminGuide before planning any
further Btrfs usage.
--
...Elämälle vierasta toimintaa...
Jukka Larja, Roskakori@aarghimedes.fi
<saylan> I just set up port forwards to defense.gov
<saylan> anyone scanning me now will be scanning/attacking the DoD :D
<renderbod> O.o
<bolt> that's... not exactly how port forwarding works
<saylan> ?
- Quote Database, http://www.bash.org/?954232 -
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: question about creating a raid10
2019-01-18 7:02 ` Stefan K
2019-01-18 13:30 ` Jukka Larja
@ 2019-01-19 2:43 ` Chris Murphy
1 sibling, 0 replies; 12+ messages in thread
From: Chris Murphy @ 2019-01-19 2:43 UTC (permalink / raw)
To: Btrfs BTRFS, Stefan K
On Fri, Jan 18, 2019 at 12:02 AM Stefan K <shadow_7@gmx.net> wrote:
>
> > Btrfs raid10 really should not be called raid10. It sets up the wrong
> > user expectation entirely. It's more like raid0+1, except even that is
> > deceptive because in theory a legit raid0+1 you can lose multiple
> > drives on one side of the mirror (but not both); but with Btrfs raid10
> > you really can't lose more than one drive. And therefore it does not
> > scale. The probability of downtime increases as drives are added;
> > whereas with a real raid10 downtime doesn't change.
> WTF?! really, so with btrfs raid10 I can't lose more than one drive?
Correct.
>that sucks, that an advantage of raid 10!
> and the crazy thing is, thats not documented, not in the manpage nor btrfs wiki and and thats is very important.
> thats unbelievable ..
Yep. It comes up from time to time, it's discussed in the archives. I
suspect if someone comes up with a btrfs-progs patch to add a warning
note under the block group profiles grid (note 4 perhaps) I suspect
it'd get accepted.
--
Chris Murphy
^ permalink raw reply [flat|nested] 12+ messages in thread