"layout" of a six drive raid10

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* "layout" of a six drive raid10
@ 2016-02-08 22:19 boli
  2016-02-08 23:05 ` Hugo Mills
  2016-02-09  1:42 ` Duncan
  0 siblings, 2 replies; 6+ messages in thread
From: boli @ 2016-02-08 22:19 UTC (permalink / raw)
  To: linux-btrfs

Hi

I'm trying to figure out what a six drive btrfs raid10 would look like. The example at <https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_differences_among_MD-RAID_.2F_device_mapper_.2F_btrfs_raid.3F> seems ambiguous to me.

It could mean that stripes are split over two raid1 sets of three devices each. The sentence "Every stripe is split across to exactly 2 RAID-1 sets" would lead me to believe this.

However, earlier it says for raid0 that "stripe[s are] split across as many devices as possible". Which for six drives would be: stripes are split over three raid1 sets of two devices each.

Can anyone enlighten me as to which is correct?

Reason I'm asking is that I'm deciding on a suitable raid level for a new DIY NAS box. I'd rather not use btrfs raid6 (for now). The first alternative I thought of was raid10. Later I learned how btrfs raid1 works and figured it might be better suited for my use case: Striping the data over multiple raid1 sets doesn't really help, as transfer from/to my box will be limited by gigabit ethernet anyway, and a single drive can saturate that.

Thoughts on this would also be appreciated.

As a bonus I was wondering how btrfs raid1 are layed out in general, in particular with even and odd numbers of drives. A pair is trivial. For three drives I think a "ring setup" with each drive sharing half of its data with another drive. But how is it with four drives – are they organized as two pairs, or four-way, or …

Cheers, boli

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "layout" of a six drive raid10
  2016-02-08 22:19 "layout" of a six drive raid10 boli
@ 2016-02-08 23:05 ` Hugo Mills
  2016-02-09  1:42 ` Duncan
  1 sibling, 0 replies; 6+ messages in thread
From: Hugo Mills @ 2016-02-08 23:05 UTC (permalink / raw)
  To: boli; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3205 bytes --]

On Mon, Feb 08, 2016 at 11:19:52PM +0100, boli wrote:
> Hi
> 
> I'm trying to figure out what a six drive btrfs raid10 would look like. The example at <https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_differences_among_MD-RAID_.2F_device_mapper_.2F_btrfs_raid.3F> seems ambiguous to me.
> 
> It could mean that stripes are split over two raid1 sets of three devices each. The sentence "Every stripe is split across to exactly 2 RAID-1 sets" would lead me to believe this.
> 
> However, earlier it says for raid0 that "stripe[s are] split across as many devices as possible". Which for six drives would be: stripes are split over three raid1 sets of two devices each.
> 
> Can anyone enlighten me as to which is correct?

   Both? :)

   You'll find that on six devices, you'll have six chunks allocated
at the same time: A1, B1, C1, A2, B2, C2. The "2" chunks are
duplicates of the corresponding "1" chunks. The "A", "B", "C" chunks
are the alternate stripes. There's no hierarchy of RAID-0-then-RAID-1
or RAID-1-then-RAID-0.

> Reason I'm asking is that I'm deciding on a suitable raid level for a new DIY NAS box. I'd rather not use btrfs raid6 (for now). The first alternative I thought of was raid10. Later I learned how btrfs raid1 works and figured it might be better suited for my use case: Striping the data over multiple raid1 sets doesn't really help, as transfer from/to my box will be limited by gigabit ethernet anyway, and a single drive can saturate that.
> 
> Thoughts on this would also be appreciated.

> As a bonus I was wondering how btrfs raid1 are layed out in general,
> in particular with even and odd numbers of drives. A pair is
> trivial. For three drives I think a "ring setup" with each drive
> sharing half of its data with another drive. But how is it with four
> drives – are they organized as two pairs, or four-way, or …

   The fundamental unit of space allocation at this level is the chunk
-- a 1 GiB unit of storage on one device. (Or 256 MiB for metadata).
Chunks are allocated in block groups to form the RAID behaviour of the
FS.

   So, single mode will allocate one chunk in a block group. RAID-1
and -0 will allocate two chunks in a block group. RAID-10 will
allocate N chunks in a block group, where N is the largest even number
equal to or smaller than the number of devices [with space on]. RAID-5
and -6 will allocate N chunks, where N is the number of devices [with
space on].

   When chunks are to be allocated, they devices are ordered by the
amount of free space on them. The chunks are allocated to devices in
that order.

   So, if you have three equal devices, 1, 2, 3, RAID-1 chunks will be
allocated to them as: 1+2, 3+1, 2+3, repeat.

   With one device larger than the others (say, device 3), it'll start
as: 3+1, 3+2, 3+1, 3+2, repeating until all three devices have equal
free space, and then going back to the pattern above.

   Hugo.

-- 
Hugo Mills             | Well, you don't get to be a kernel hacker simply by
hugo@... carfax.org.uk | looking good in Speedos.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                         Rusty Russell

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "layout" of a six drive raid10
  2016-02-08 22:19 "layout" of a six drive raid10 boli
  2016-02-08 23:05 ` Hugo Mills
@ 2016-02-09  1:42 ` Duncan
  2016-02-09  7:02   ` Kai Krakow
  1 sibling, 1 reply; 6+ messages in thread
From: Duncan @ 2016-02-09  1:42 UTC (permalink / raw)
  To: linux-btrfs

boli posted on Mon, 08 Feb 2016 23:19:52 +0100 as excerpted:

> Hi
> 
> I'm trying to figure out what a six drive btrfs raid10 would look like.

> It could mean that stripes are split over two raid1 sets of three
> devices each. The sentence "Every stripe is split across to exactly 2
> RAID-1 sets" would lead me to believe this.
> 
> However, earlier it says for raid0 that "stripe[s are] split across as
> many devices as possible". Which for six drives would be: stripes are
> split over three raid1 sets of two devices each.
> 
> Can anyone enlighten me as to which is correct?

Hugo's correct, and this is pretty much restating what he did.  Sometimes 
I find that reading things again in different words helps me better 
understand the concept, and this post is made with that in mind.

At present, btrfs has only two-way mirroring, not N-way.  So any raid 
level that includes mirroring will have exactly two copies, no matter the 
number of devices.  (FWIW, N-way-mirroring is on the roadmap, but who 
knows when it'll come, and like raid56 mode, it will likely take some 
time to stabilize even once it does.)

What that means for a six device raid1 or raid10 is, still exactly two 
copies of everything, with raid1 simply being three independent chunks, 
two copies each, and raid10 being two copies of a three-device stripe.

> Reason I'm asking is that I'm deciding on a suitable raid level for a
> new DIY NAS box. I'd rather not use btrfs raid6 (for now).

Agreed and I think wise choice. =:^)  I'd still be a bit cautious of 
btrfs raid56, as I don't think it's quite to the level of stability that 
other btrfs raid types are, just yet.  I expect to be much more 
comfortable recommending it in another couple kernel cycles.

> The first
> alternative I thought of was raid10. Later I learned how btrfs raid1
> works and figured it might be better suited for my use case: Striping
> the data over multiple raid1 sets doesn't really help, as transfer
> from/to my box will be limited by gigabit ethernet anyway, and a single
> drive can saturate that.
> 
> Thoughts on this would also be appreciated.

Agreed, again. =:^)

Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1 on 
spinning rust will in practice fully saturate the gigabit Ethernet, 
particularly as it gets fragmented (which COW filesystems such as btrfs 
tend to do much more so than non-COW, unless you're using something like 
the autodefrag mount option from the get-go, as I do here, tho in that 
case, striping won't necessarily help a lot either).

If you're concerned about getting the last bit of performance possible, 
I'd say raid10, tho over the gigabit ethernet, the difference isn't 
likely to be much.

OTOH, if you're more concerned about ease of maintenance, replacing 
devices, etc, I believe raid1 is a bit less complex both in code terms 
(where less code complexity means less chance of bugs) and in 
administration, at least conceptually, tho in practice the administration 
is going to be very close to the same as well.

So I'd tend to lean toward raid1 for a use-case thruput limited to gitabit 
Ethernet speeds, even on spinning rust, as I think there may be a bit of 
a difference in speed vs raid10, but I doubt it'll be much due to the 
gigabit thruput limit, and I'd consider the lower complexity of raid1 to 
offset that.

> As a bonus I was wondering how btrfs raid1 are layed out in general, in
> particular with even and odd numbers of drives. A pair is trivial. For
> three drives I think a "ring setup" with each drive sharing half of its
> data with another drive. But how is it with four drives – are they
> organized as two pairs, or four-way, or …

For raid1, allocation is done in pairs, with each allocation taking the 
device with the most space left, except that both copies can't be on a 
single device, even if for instance you have a 3 TB device and the rest 
are 1 TB or smaller.  That case would result in one copy of each pair on 
the 3 TB device, one copy on whatever device has the most space left of 
the others.

Which on a filesystem with all equal sized devices, tends to result in 
round-robin allocation, tho of course in the odd number of devices case, 
there will always be at least one device that has either more or less 
allocation by a one-chunk margin.  (Tho it can be noted that metadata 
chunks are smaller than data chunks, and while Hugo noted the nominal 1 
GiB data chunk size and 256 MiB metadata chunk size, at the 100 GiB plus 
per device scale, chunks can be larger, upto 10 GiB data chunk, and of 
course smaller on very small devices, so the 1GiB-data/256MiB-metadata 
values are indeed only nominal, but they still give you some idea of the 
relative size.)

So a btrfs raid1 on four equally sized devices will indeed result in two 
pairs, but simply because of the most-space-available allocation rule, 
not because it's forced to pairs of pairs.  And with unequally sized 
devices, the device with the most space will always get one of the two 
copies, until its space equalizes to that of at least one other device.

Btrfs raid10 works similarly with the copy allocation, but stripe 
allocation works exactly opposite, prioritizing stripe width.  So with an 
even number of equally sized devices, each stripe will be half the number 
of devices wide, with the second copy being the other half.  If there's 
an odd number of devices, one will be left out on each allocation, but 
the one that's left out will change with each allocation, as the one left 
out in the previous allocation will now have more space available than 
the others so it'll be allocated first for one of the copies, leaving a 
different one to be left out on this allocation round.  And with 
unequally sized devices, allocation will always be to an even number and 
always to at least four at once, of course favoring the device with the 
most space available, but stripes will always be half the available 
width, with a second copy of the stripe to the other half, so will use up 
space on all devices at once if it's an even number of devices with space 
left, all but one if it's an odd number with space left, since both 
copies can't be on the same device, which means that odd device can't be 
used for that allocation round, tho it will be for the next, and a 
different device left out instead.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "layout" of a six drive raid10
  2016-02-09  1:42 ` Duncan
@ 2016-02-09  7:02   ` Kai Krakow
  2016-02-09  7:19     ` Kai Krakow
  2016-02-09 13:02     ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 6+ messages in thread
From: Kai Krakow @ 2016-02-09  7:02 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC)
schrieb Duncan <1i5t5.duncan@cox.net>:

> Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1
> on spinning rust will in practice fully saturate the gigabit
> Ethernet, particularly as it gets fragmented (which COW filesystems
> such as btrfs tend to do much more so than non-COW, unless you're
> using something like the autodefrag mount option from the get-go, as
> I do here, tho in that case, striping won't necessarily help a lot
> either).
> 
> If you're concerned about getting the last bit of performance
> possible, I'd say raid10, tho over the gigabit ethernet, the
> difference isn't likely to be much.

If performance is an issue, I suggest putting an SSD and bcache into
the equation. I have very nice performance improvements with that,
especially with writeback caching (random write go to bcache first,
then to harddisk in background idle time).

Apparently, afaik it's currently not possible to have native bcache
redundandancy yet - so bcache can only be one SSD. It may be possible
to use two bcaches and assign the btrfs members alternating to it - tho
btrfs may decide to put two mirrors on the same bcache then. On the
other side, you could put bcache on lvm oder mdraid - but I would not
do it. On the bcache list, multiple people had problems with that
including btrfs corruption beyond repair.

On the other hand, you could simply go with bcache writearound caching
(only reads become cached) or writethrough caching (writes go in
parallel to bcache and btrfs). If the SSD dies, btrfs will still be
perfectly safe in this case.

If you are going with one of the latter options, the tuning knobs of
bcache may help you actually cache not only random accesses to bcache
but also linear accesses. It should help to saturate a gigabit link.

Currently, SANdisk offers a pretty cheap (not top performance) drive
with 500GB which should perfectly cover this usecase. Tho, I'm not sure
how stable this drive works with bcache. I only checked Crucial MX100
and Samsung Evo 840 yet - both working very stable with latest kernel
and discard enabled, no mdraid or lvm involved.

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "layout" of a six drive raid10
  2016-02-09  7:02   ` Kai Krakow
@ 2016-02-09  7:19     ` Kai Krakow
  2016-02-09 13:02     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 6+ messages in thread
From: Kai Krakow @ 2016-02-09  7:19 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 9 Feb 2016 08:02:58 +0100
schrieb Kai Krakow <hurikhan77@gmail.com>:

> Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC)
> schrieb Duncan <1i5t5.duncan@cox.net>:
> 
> > Tho I'd consider benchmarking or testing, as I'm not sure btrfs
> > raid1 on spinning rust will in practice fully saturate the gigabit
> > Ethernet, particularly as it gets fragmented (which COW filesystems
> > such as btrfs tend to do much more so than non-COW, unless you're
> > using something like the autodefrag mount option from the get-go, as
> > I do here, tho in that case, striping won't necessarily help a lot
> > either).
> > 
> > If you're concerned about getting the last bit of performance
> > possible, I'd say raid10, tho over the gigabit ethernet, the
> > difference isn't likely to be much.
> 
> If performance is an issue, I suggest putting an SSD and bcache into
> the equation. I have very nice performance improvements with that,
> especially with writeback caching (random write go to bcache first,
> then to harddisk in background idle time).
> 
> Apparently, afaik it's currently not possible to have native bcache
> redundandancy yet - so bcache can only be one SSD. It may be possible
> to use two bcaches and assign the btrfs members alternating to it -
> tho btrfs may decide to put two mirrors on the same bcache then. On
> the other side, you could put bcache on lvm oder mdraid - but I would
> not do it. On the bcache list, multiple people had problems with that
> including btrfs corruption beyond repair.
> 
> On the other hand, you could simply go with bcache writearound caching
> (only reads become cached) or writethrough caching (writes go in
> parallel to bcache and btrfs). If the SSD dies, btrfs will still be
> perfectly safe in this case.
> 
> If you are going with one of the latter options, the tuning knobs of
> bcache may help you actually cache not only random accesses to bcache
> but also linear accesses. It should help to saturate a gigabit link.
> 
> Currently, SANdisk offers a pretty cheap (not top performance) drive
> with 500GB which should perfectly cover this usecase. Tho, I'm not
> sure how stable this drive works with bcache. I only checked Crucial
> MX100 and Samsung Evo 840 yet - both working very stable with latest
> kernel and discard enabled, no mdraid or lvm involved.

BTW: If you are thinking about adding bcache later keep in mind that it
is almost impossible to do that (requires reformatting) as bcache needs
to add its own superblock to the backing storage devices (spinning
rust). But it's perfectly okay to format with a bcache superblock even
if you do not use bcache caching with SSD yet. It will work in passthru
mode until you add the SSD later so it may be worth starting with a
bcache superblock right from the beginning. It creates a sub device
like this:

/dev/sda [spinning disk]
`- /dev/bcache0
/dev/sdb [spinning disk]
`- /dev/bcache1

So, you put btrfs on /dev/bcache* then.

If you later add the caching device, it will add the following to
"lsblk":

/dev/sdc [SSD, ex. 500GB]
`- /dev/bcache0 [harddisk, ex. 2TB]
`- /dev/bcache1 [harddisk, ex. 2TB]

Access to bcache0 and bcache1 will then go thru /dev/sdc as the cache.
Bcache is very good at turning random access patterns into linear
access patterns, in turn reducing seeking noise from the harddisks to a
minimum (you will actually hear the difference). So essentially it
quite effectively reduces seeking which makes btrfs slow on spinning
rust, in turn speeding it up noticeably.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "layout" of a six drive raid10
  2016-02-09  7:02   ` Kai Krakow
  2016-02-09  7:19     ` Kai Krakow
@ 2016-02-09 13:02     ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 6+ messages in thread
From: Austin S. Hemmelgarn @ 2016-02-09 13:02 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 2016-02-09 02:02, Kai Krakow wrote:
> Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC)
> schrieb Duncan <1i5t5.duncan@cox.net>:
>
>> Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1
>> on spinning rust will in practice fully saturate the gigabit
>> Ethernet, particularly as it gets fragmented (which COW filesystems
>> such as btrfs tend to do much more so than non-COW, unless you're
>> using something like the autodefrag mount option from the get-go, as
>> I do here, tho in that case, striping won't necessarily help a lot
>> either).
>>
>> If you're concerned about getting the last bit of performance
>> possible, I'd say raid10, tho over the gigabit ethernet, the
>> difference isn't likely to be much.
>
> If performance is an issue, I suggest putting an SSD and bcache into
> the equation. I have very nice performance improvements with that,
> especially with writeback caching (random write go to bcache first,
> then to harddisk in background idle time).
>
> Apparently, afaik it's currently not possible to have native bcache
> redundandancy yet - so bcache can only be one SSD. It may be possible
> to use two bcaches and assign the btrfs members alternating to it - tho
> btrfs may decide to put two mirrors on the same bcache then. On the
> other side, you could put bcache on lvm oder mdraid - but I would not
> do it. On the bcache list, multiple people had problems with that
> including btrfs corruption beyond repair.
>
> On the other hand, you could simply go with bcache writearound caching
> (only reads become cached) or writethrough caching (writes go in
> parallel to bcache and btrfs). If the SSD dies, btrfs will still be
> perfectly safe in this case.
>
> If you are going with one of the latter options, the tuning knobs of
> bcache may help you actually cache not only random accesses to bcache
> but also linear accesses. It should help to saturate a gigabit link.
>
> Currently, SANdisk offers a pretty cheap (not top performance) drive
> with 500GB which should perfectly cover this usecase. Tho, I'm not sure
> how stable this drive works with bcache. I only checked Crucial MX100
> and Samsung Evo 840 yet - both working very stable with latest kernel
> and discard enabled, no mdraid or lvm involved.
>
FWIW, the other option if you want good performance and don't want to 
get an SSD is to run BTRFS in raid1 mode on top of two LVM or MD-RAID 
RAID0 volumes.  I do this regularly for VM's and see a roughly 25-30% 
performance increase compared to BTRFS raid10 for my workloads, and 
that's with things laid out such that each block in BTRFS (16k in my 
case) ends up entirely on one disk in the RAID0 volume (you could 
theoretically get better performance by sizing the stripes on the RAID0 
volume such that a block from BTRFS gets spread across all the disks in 
the volume, but that is marginally less safe than forcing each to one).

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-02-09 13:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-08 22:19 "layout" of a six drive raid10 boli
2016-02-08 23:05 ` Hugo Mills
2016-02-09  1:42 ` Duncan
2016-02-09  7:02   ` Kai Krakow
2016-02-09  7:19     ` Kai Krakow
2016-02-09 13:02     ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).