* "layout" of a six drive raid10 @ 2016-02-08 22:19 boli 2016-02-08 23:05 ` Hugo Mills 2016-02-09 1:42 ` Duncan 0 siblings, 2 replies; 6+ messages in thread From: boli @ 2016-02-08 22:19 UTC (permalink / raw) To: linux-btrfs Hi I'm trying to figure out what a six drive btrfs raid10 would look like. The example at <https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_differences_among_MD-RAID_.2F_device_mapper_.2F_btrfs_raid.3F> seems ambiguous to me. It could mean that stripes are split over two raid1 sets of three devices each. The sentence "Every stripe is split across to exactly 2 RAID-1 sets" would lead me to believe this. However, earlier it says for raid0 that "stripe[s are] split across as many devices as possible". Which for six drives would be: stripes are split over three raid1 sets of two devices each. Can anyone enlighten me as to which is correct? Reason I'm asking is that I'm deciding on a suitable raid level for a new DIY NAS box. I'd rather not use btrfs raid6 (for now). The first alternative I thought of was raid10. Later I learned how btrfs raid1 works and figured it might be better suited for my use case: Striping the data over multiple raid1 sets doesn't really help, as transfer from/to my box will be limited by gigabit ethernet anyway, and a single drive can saturate that. Thoughts on this would also be appreciated. As a bonus I was wondering how btrfs raid1 are layed out in general, in particular with even and odd numbers of drives. A pair is trivial. For three drives I think a "ring setup" with each drive sharing half of its data with another drive. But how is it with four drives – are they organized as two pairs, or four-way, or … Cheers, boli ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "layout" of a six drive raid10 2016-02-08 22:19 "layout" of a six drive raid10 boli @ 2016-02-08 23:05 ` Hugo Mills 2016-02-09 1:42 ` Duncan 1 sibling, 0 replies; 6+ messages in thread From: Hugo Mills @ 2016-02-08 23:05 UTC (permalink / raw) To: boli; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3205 bytes --] On Mon, Feb 08, 2016 at 11:19:52PM +0100, boli wrote: > Hi > > I'm trying to figure out what a six drive btrfs raid10 would look like. The example at <https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_differences_among_MD-RAID_.2F_device_mapper_.2F_btrfs_raid.3F> seems ambiguous to me. > > It could mean that stripes are split over two raid1 sets of three devices each. The sentence "Every stripe is split across to exactly 2 RAID-1 sets" would lead me to believe this. > > However, earlier it says for raid0 that "stripe[s are] split across as many devices as possible". Which for six drives would be: stripes are split over three raid1 sets of two devices each. > > Can anyone enlighten me as to which is correct? Both? :) You'll find that on six devices, you'll have six chunks allocated at the same time: A1, B1, C1, A2, B2, C2. The "2" chunks are duplicates of the corresponding "1" chunks. The "A", "B", "C" chunks are the alternate stripes. There's no hierarchy of RAID-0-then-RAID-1 or RAID-1-then-RAID-0. > Reason I'm asking is that I'm deciding on a suitable raid level for a new DIY NAS box. I'd rather not use btrfs raid6 (for now). The first alternative I thought of was raid10. Later I learned how btrfs raid1 works and figured it might be better suited for my use case: Striping the data over multiple raid1 sets doesn't really help, as transfer from/to my box will be limited by gigabit ethernet anyway, and a single drive can saturate that. > > Thoughts on this would also be appreciated. > As a bonus I was wondering how btrfs raid1 are layed out in general, > in particular with even and odd numbers of drives. A pair is > trivial. For three drives I think a "ring setup" with each drive > sharing half of its data with another drive. But how is it with four > drives – are they organized as two pairs, or four-way, or … The fundamental unit of space allocation at this level is the chunk -- a 1 GiB unit of storage on one device. (Or 256 MiB for metadata). Chunks are allocated in block groups to form the RAID behaviour of the FS. So, single mode will allocate one chunk in a block group. RAID-1 and -0 will allocate two chunks in a block group. RAID-10 will allocate N chunks in a block group, where N is the largest even number equal to or smaller than the number of devices [with space on]. RAID-5 and -6 will allocate N chunks, where N is the number of devices [with space on]. When chunks are to be allocated, they devices are ordered by the amount of free space on them. The chunks are allocated to devices in that order. So, if you have three equal devices, 1, 2, 3, RAID-1 chunks will be allocated to them as: 1+2, 3+1, 2+3, repeat. With one device larger than the others (say, device 3), it'll start as: 3+1, 3+2, 3+1, 3+2, repeating until all three devices have equal free space, and then going back to the pattern above. Hugo. -- Hugo Mills | Well, you don't get to be a kernel hacker simply by hugo@... carfax.org.uk | looking good in Speedos. http://carfax.org.uk/ | PGP: E2AB1DE4 | Rusty Russell [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "layout" of a six drive raid10 2016-02-08 22:19 "layout" of a six drive raid10 boli 2016-02-08 23:05 ` Hugo Mills @ 2016-02-09 1:42 ` Duncan 2016-02-09 7:02 ` Kai Krakow 1 sibling, 1 reply; 6+ messages in thread From: Duncan @ 2016-02-09 1:42 UTC (permalink / raw) To: linux-btrfs boli posted on Mon, 08 Feb 2016 23:19:52 +0100 as excerpted: > Hi > > I'm trying to figure out what a six drive btrfs raid10 would look like. > It could mean that stripes are split over two raid1 sets of three > devices each. The sentence "Every stripe is split across to exactly 2 > RAID-1 sets" would lead me to believe this. > > However, earlier it says for raid0 that "stripe[s are] split across as > many devices as possible". Which for six drives would be: stripes are > split over three raid1 sets of two devices each. > > Can anyone enlighten me as to which is correct? Hugo's correct, and this is pretty much restating what he did. Sometimes I find that reading things again in different words helps me better understand the concept, and this post is made with that in mind. At present, btrfs has only two-way mirroring, not N-way. So any raid level that includes mirroring will have exactly two copies, no matter the number of devices. (FWIW, N-way-mirroring is on the roadmap, but who knows when it'll come, and like raid56 mode, it will likely take some time to stabilize even once it does.) What that means for a six device raid1 or raid10 is, still exactly two copies of everything, with raid1 simply being three independent chunks, two copies each, and raid10 being two copies of a three-device stripe. > Reason I'm asking is that I'm deciding on a suitable raid level for a > new DIY NAS box. I'd rather not use btrfs raid6 (for now). Agreed and I think wise choice. =:^) I'd still be a bit cautious of btrfs raid56, as I don't think it's quite to the level of stability that other btrfs raid types are, just yet. I expect to be much more comfortable recommending it in another couple kernel cycles. > The first > alternative I thought of was raid10. Later I learned how btrfs raid1 > works and figured it might be better suited for my use case: Striping > the data over multiple raid1 sets doesn't really help, as transfer > from/to my box will be limited by gigabit ethernet anyway, and a single > drive can saturate that. > > Thoughts on this would also be appreciated. Agreed, again. =:^) Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1 on spinning rust will in practice fully saturate the gigabit Ethernet, particularly as it gets fragmented (which COW filesystems such as btrfs tend to do much more so than non-COW, unless you're using something like the autodefrag mount option from the get-go, as I do here, tho in that case, striping won't necessarily help a lot either). If you're concerned about getting the last bit of performance possible, I'd say raid10, tho over the gigabit ethernet, the difference isn't likely to be much. OTOH, if you're more concerned about ease of maintenance, replacing devices, etc, I believe raid1 is a bit less complex both in code terms (where less code complexity means less chance of bugs) and in administration, at least conceptually, tho in practice the administration is going to be very close to the same as well. So I'd tend to lean toward raid1 for a use-case thruput limited to gitabit Ethernet speeds, even on spinning rust, as I think there may be a bit of a difference in speed vs raid10, but I doubt it'll be much due to the gigabit thruput limit, and I'd consider the lower complexity of raid1 to offset that. > As a bonus I was wondering how btrfs raid1 are layed out in general, in > particular with even and odd numbers of drives. A pair is trivial. For > three drives I think a "ring setup" with each drive sharing half of its > data with another drive. But how is it with four drives – are they > organized as two pairs, or four-way, or … For raid1, allocation is done in pairs, with each allocation taking the device with the most space left, except that both copies can't be on a single device, even if for instance you have a 3 TB device and the rest are 1 TB or smaller. That case would result in one copy of each pair on the 3 TB device, one copy on whatever device has the most space left of the others. Which on a filesystem with all equal sized devices, tends to result in round-robin allocation, tho of course in the odd number of devices case, there will always be at least one device that has either more or less allocation by a one-chunk margin. (Tho it can be noted that metadata chunks are smaller than data chunks, and while Hugo noted the nominal 1 GiB data chunk size and 256 MiB metadata chunk size, at the 100 GiB plus per device scale, chunks can be larger, upto 10 GiB data chunk, and of course smaller on very small devices, so the 1GiB-data/256MiB-metadata values are indeed only nominal, but they still give you some idea of the relative size.) So a btrfs raid1 on four equally sized devices will indeed result in two pairs, but simply because of the most-space-available allocation rule, not because it's forced to pairs of pairs. And with unequally sized devices, the device with the most space will always get one of the two copies, until its space equalizes to that of at least one other device. Btrfs raid10 works similarly with the copy allocation, but stripe allocation works exactly opposite, prioritizing stripe width. So with an even number of equally sized devices, each stripe will be half the number of devices wide, with the second copy being the other half. If there's an odd number of devices, one will be left out on each allocation, but the one that's left out will change with each allocation, as the one left out in the previous allocation will now have more space available than the others so it'll be allocated first for one of the copies, leaving a different one to be left out on this allocation round. And with unequally sized devices, allocation will always be to an even number and always to at least four at once, of course favoring the device with the most space available, but stripes will always be half the available width, with a second copy of the stripe to the other half, so will use up space on all devices at once if it's an even number of devices with space left, all but one if it's an odd number with space left, since both copies can't be on the same device, which means that odd device can't be used for that allocation round, tho it will be for the next, and a different device left out instead. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "layout" of a six drive raid10 2016-02-09 1:42 ` Duncan @ 2016-02-09 7:02 ` Kai Krakow 2016-02-09 7:19 ` Kai Krakow 2016-02-09 13:02 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 6+ messages in thread From: Kai Krakow @ 2016-02-09 7:02 UTC (permalink / raw) To: linux-btrfs Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC) schrieb Duncan <1i5t5.duncan@cox.net>: > Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1 > on spinning rust will in practice fully saturate the gigabit > Ethernet, particularly as it gets fragmented (which COW filesystems > such as btrfs tend to do much more so than non-COW, unless you're > using something like the autodefrag mount option from the get-go, as > I do here, tho in that case, striping won't necessarily help a lot > either). > > If you're concerned about getting the last bit of performance > possible, I'd say raid10, tho over the gigabit ethernet, the > difference isn't likely to be much. If performance is an issue, I suggest putting an SSD and bcache into the equation. I have very nice performance improvements with that, especially with writeback caching (random write go to bcache first, then to harddisk in background idle time). Apparently, afaik it's currently not possible to have native bcache redundandancy yet - so bcache can only be one SSD. It may be possible to use two bcaches and assign the btrfs members alternating to it - tho btrfs may decide to put two mirrors on the same bcache then. On the other side, you could put bcache on lvm oder mdraid - but I would not do it. On the bcache list, multiple people had problems with that including btrfs corruption beyond repair. On the other hand, you could simply go with bcache writearound caching (only reads become cached) or writethrough caching (writes go in parallel to bcache and btrfs). If the SSD dies, btrfs will still be perfectly safe in this case. If you are going with one of the latter options, the tuning knobs of bcache may help you actually cache not only random accesses to bcache but also linear accesses. It should help to saturate a gigabit link. Currently, SANdisk offers a pretty cheap (not top performance) drive with 500GB which should perfectly cover this usecase. Tho, I'm not sure how stable this drive works with bcache. I only checked Crucial MX100 and Samsung Evo 840 yet - both working very stable with latest kernel and discard enabled, no mdraid or lvm involved. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "layout" of a six drive raid10 2016-02-09 7:02 ` Kai Krakow @ 2016-02-09 7:19 ` Kai Krakow 2016-02-09 13:02 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 6+ messages in thread From: Kai Krakow @ 2016-02-09 7:19 UTC (permalink / raw) To: linux-btrfs Am Tue, 9 Feb 2016 08:02:58 +0100 schrieb Kai Krakow <hurikhan77@gmail.com>: > Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC) > schrieb Duncan <1i5t5.duncan@cox.net>: > > > Tho I'd consider benchmarking or testing, as I'm not sure btrfs > > raid1 on spinning rust will in practice fully saturate the gigabit > > Ethernet, particularly as it gets fragmented (which COW filesystems > > such as btrfs tend to do much more so than non-COW, unless you're > > using something like the autodefrag mount option from the get-go, as > > I do here, tho in that case, striping won't necessarily help a lot > > either). > > > > If you're concerned about getting the last bit of performance > > possible, I'd say raid10, tho over the gigabit ethernet, the > > difference isn't likely to be much. > > If performance is an issue, I suggest putting an SSD and bcache into > the equation. I have very nice performance improvements with that, > especially with writeback caching (random write go to bcache first, > then to harddisk in background idle time). > > Apparently, afaik it's currently not possible to have native bcache > redundandancy yet - so bcache can only be one SSD. It may be possible > to use two bcaches and assign the btrfs members alternating to it - > tho btrfs may decide to put two mirrors on the same bcache then. On > the other side, you could put bcache on lvm oder mdraid - but I would > not do it. On the bcache list, multiple people had problems with that > including btrfs corruption beyond repair. > > On the other hand, you could simply go with bcache writearound caching > (only reads become cached) or writethrough caching (writes go in > parallel to bcache and btrfs). If the SSD dies, btrfs will still be > perfectly safe in this case. > > If you are going with one of the latter options, the tuning knobs of > bcache may help you actually cache not only random accesses to bcache > but also linear accesses. It should help to saturate a gigabit link. > > Currently, SANdisk offers a pretty cheap (not top performance) drive > with 500GB which should perfectly cover this usecase. Tho, I'm not > sure how stable this drive works with bcache. I only checked Crucial > MX100 and Samsung Evo 840 yet - both working very stable with latest > kernel and discard enabled, no mdraid or lvm involved. BTW: If you are thinking about adding bcache later keep in mind that it is almost impossible to do that (requires reformatting) as bcache needs to add its own superblock to the backing storage devices (spinning rust). But it's perfectly okay to format with a bcache superblock even if you do not use bcache caching with SSD yet. It will work in passthru mode until you add the SSD later so it may be worth starting with a bcache superblock right from the beginning. It creates a sub device like this: /dev/sda [spinning disk] `- /dev/bcache0 /dev/sdb [spinning disk] `- /dev/bcache1 So, you put btrfs on /dev/bcache* then. If you later add the caching device, it will add the following to "lsblk": /dev/sdc [SSD, ex. 500GB] `- /dev/bcache0 [harddisk, ex. 2TB] `- /dev/bcache1 [harddisk, ex. 2TB] Access to bcache0 and bcache1 will then go thru /dev/sdc as the cache. Bcache is very good at turning random access patterns into linear access patterns, in turn reducing seeking noise from the harddisks to a minimum (you will actually hear the difference). So essentially it quite effectively reduces seeking which makes btrfs slow on spinning rust, in turn speeding it up noticeably. -- Regards, Kai Replies to list-only preferred. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: "layout" of a six drive raid10 2016-02-09 7:02 ` Kai Krakow 2016-02-09 7:19 ` Kai Krakow @ 2016-02-09 13:02 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 6+ messages in thread From: Austin S. Hemmelgarn @ 2016-02-09 13:02 UTC (permalink / raw) To: Kai Krakow, linux-btrfs On 2016-02-09 02:02, Kai Krakow wrote: > Am Tue, 9 Feb 2016 01:42:40 +0000 (UTC) > schrieb Duncan <1i5t5.duncan@cox.net>: > >> Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1 >> on spinning rust will in practice fully saturate the gigabit >> Ethernet, particularly as it gets fragmented (which COW filesystems >> such as btrfs tend to do much more so than non-COW, unless you're >> using something like the autodefrag mount option from the get-go, as >> I do here, tho in that case, striping won't necessarily help a lot >> either). >> >> If you're concerned about getting the last bit of performance >> possible, I'd say raid10, tho over the gigabit ethernet, the >> difference isn't likely to be much. > > If performance is an issue, I suggest putting an SSD and bcache into > the equation. I have very nice performance improvements with that, > especially with writeback caching (random write go to bcache first, > then to harddisk in background idle time). > > Apparently, afaik it's currently not possible to have native bcache > redundandancy yet - so bcache can only be one SSD. It may be possible > to use two bcaches and assign the btrfs members alternating to it - tho > btrfs may decide to put two mirrors on the same bcache then. On the > other side, you could put bcache on lvm oder mdraid - but I would not > do it. On the bcache list, multiple people had problems with that > including btrfs corruption beyond repair. > > On the other hand, you could simply go with bcache writearound caching > (only reads become cached) or writethrough caching (writes go in > parallel to bcache and btrfs). If the SSD dies, btrfs will still be > perfectly safe in this case. > > If you are going with one of the latter options, the tuning knobs of > bcache may help you actually cache not only random accesses to bcache > but also linear accesses. It should help to saturate a gigabit link. > > Currently, SANdisk offers a pretty cheap (not top performance) drive > with 500GB which should perfectly cover this usecase. Tho, I'm not sure > how stable this drive works with bcache. I only checked Crucial MX100 > and Samsung Evo 840 yet - both working very stable with latest kernel > and discard enabled, no mdraid or lvm involved. > FWIW, the other option if you want good performance and don't want to get an SSD is to run BTRFS in raid1 mode on top of two LVM or MD-RAID RAID0 volumes. I do this regularly for VM's and see a roughly 25-30% performance increase compared to BTRFS raid10 for my workloads, and that's with things laid out such that each block in BTRFS (16k in my case) ends up entirely on one disk in the RAID0 volume (you could theoretically get better performance by sizing the stripes on the RAID0 volume such that a block from BTRFS gets spread across all the disks in the volume, but that is marginally less safe than forcing each to one). ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-02-09 13:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-08 22:19 "layout" of a six drive raid10 boli 2016-02-08 23:05 ` Hugo Mills 2016-02-09 1:42 ` Duncan 2016-02-09 7:02 ` Kai Krakow 2016-02-09 7:19 ` Kai Krakow 2016-02-09 13:02 ` Austin S. Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).