* RAID1, SSD+non-SSD
@ 2015-02-06 20:01 Brian B
2015-02-07 0:23 ` Chris Murphy
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Brian B @ 2015-02-06 20:01 UTC (permalink / raw)
To: linux-btrfs
My laptop has two disks, a SSD and a traditional magnetic disk. I plan
to make a partition on the mag disk equal in size the SSD and set up
BTRFS RAID1. This I know how to do.
The only reason I'm doing the RAID1 is for the self-healing. I realize
writing large amounts of data will be slower than the SSD alone, but
is it possible to set it up to only read from the magnetic drive if
there's an error reading from the SSD?
In other words, is there a way to tell it to only read from the faster
disk? Is that even necessary? Is there a better way to accomplish
this?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: RAID1, SSD+non-SSD 2015-02-06 20:01 RAID1, SSD+non-SSD Brian B @ 2015-02-07 0:23 ` Chris Murphy 2015-02-07 18:06 ` Kai Krakow 2015-02-07 6:39 ` Duncan 2015-02-07 17:28 ` Kyle Manna 2 siblings, 1 reply; 10+ messages in thread From: Chris Murphy @ 2015-02-07 0:23 UTC (permalink / raw) To: Btrfs BTRFS On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote: > My laptop has two disks, a SSD and a traditional magnetic disk. I plan > to make a partition on the mag disk equal in size the SSD and set up > BTRFS RAID1. This I know how to do. There isn't a write mostly option in btrfs like there is with md raid, so I don't know how Btrfs will tolerate one device being exceptionally slower than the other. It may be most of the time it won't matter, but I can imagine with a ton of IOPS backing up on the hard drive, having completed on the SSD, could maybe be a problem. I'd test it unless someone else who has pipes up. > > The only reason I'm doing the RAID1 is for the self-healing. I realize > writing large amounts of data will be slower than the SSD alone, but > is it possible to set it up to only read from the magnetic drive if > there's an error reading from the SSD? No. > > In other words, is there a way to tell it to only read from the faster > disk? Is that even necessary? Is there a better way to accomplish > this? No. No. And maybe. In order. If there is an error detected by either drive, or by Btrfs, Btrfs will get the correct data from the other drive and fix the problem on the original drive. You don't need to configure anything. The only concern is the asymmetric performance. I think the use case is better achieved with two HDD's + two SSD partitions, configured either with bcache or dmcache. The result is two logical devices using HDDs as backing device and SSD partitions as cache, and then format them as Btrfs raid1. The question there of course, is maturity of bcache vs dmcache and their interactions with Btrfs. But at least that's supposed to work. -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-07 0:23 ` Chris Murphy @ 2015-02-07 18:06 ` Kai Krakow 2015-02-08 3:31 ` Duncan 0 siblings, 1 reply; 10+ messages in thread From: Kai Krakow @ 2015-02-07 18:06 UTC (permalink / raw) To: linux-btrfs Chris Murphy <lists@colorremedies.com> schrieb: > On Fri, Feb 6, 2015 at 1:01 PM, Brian B <canis8585@gmail.com> wrote: >> My laptop has two disks, a SSD and a traditional magnetic disk. I plan >> to make a partition on the mag disk equal in size the SSD and set up >> BTRFS RAID1. This I know how to do. > > There isn't a write mostly option in btrfs like there is with md raid, > so I don't know how Btrfs will tolerate one device being exceptionally > slower than the other. It may be most of the time it won't matter, but > I can imagine with a ton of IOPS backing up on the hard drive, having > completed on the SSD, could maybe be a problem. I'd test it unless > someone else who has pipes up. > >> >> The only reason I'm doing the RAID1 is for the self-healing. I realize >> writing large amounts of data will be slower than the SSD alone, but >> is it possible to set it up to only read from the magnetic drive if >> there's an error reading from the SSD? > > No. >> >> In other words, is there a way to tell it to only read from the faster >> disk? Is that even necessary? Is there a better way to accomplish >> this? > > No. No. And maybe. In order. > > If there is an error detected by either drive, or by Btrfs, Btrfs will > get the correct data from the other drive and fix the problem on the > original drive. You don't need to configure anything. The only concern > is the asymmetric performance. > > I think the use case is better achieved with two HDD's + two SSD > partitions, configured either with bcache or dmcache. The result is > two logical devices using HDDs as backing device and SSD partitions as > cache, and then format them as Btrfs raid1. The question there of > course, is maturity of bcache vs dmcache and their interactions with > Btrfs. But at least that's supposed to work. Bcache on multi-device btrfs works fine for me. No problems yet, even in case of hard-reset. I'm using normal desktop workload, some Steam games, some VMs, and some MySQL/PHP/Rails programming. A single bcache partition can support many backing device, so no need to use multiple partitions in that case. In this scenario, bcache{0,1,2} is my btrfs rootfs: $ lsblk /dev/sdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 119,2G 0 disk ├─sdb1 8:17 0 512M 0 part ├─sdb2 8:18 0 20G 0 part ├─sdb3 8:19 0 79,5G 0 part │ ├─bcache0 252:0 0 925,5G 0 disk │ ├─bcache1 252:1 0 925,5G 0 disk │ └─bcache2 252:2 0 925,5G 0 disk /home └─sdb4 8:20 0 19,2G 0 part sdb3 is the cache device, sdb4 is trimmed and left untouched for SSD overprovisioning, sdb2 is a dedicated resume swap (traditional swapping goes to all there HDDs), sdb1 is my ESP to boot kernel and initramfs from. Bcache takes some time to warm up but is really fast afterwards: Boot times (using systemd and readahead) went down from >60s (on spinning rust) to ~30s on first boot (system was migrated to bcache with dev del/add, not reinstalled), then ~15s, 10s, and now it fluctuates between 3 and 8s (mostly around 5s) for reaching graphical.target (depending on whether I installed updates). KDE takes some time to load but I suppose most of it is due to its artificial 4 second delay during initialization of kded and friends - I guess this will be fixed in KDE5. This boot target is not stripped down, it includes network, mysql, postfix and some other stuff that one either usually not needs or could be optimized away. The numbers are taken from "systemd-analyze critical-chain". Bcache hit rate is usually between 60 and 80% using write-back. So after all I can generally recommend bcache. I don't know dmcache, tho. But I really suggest against using a mixed setup of SSD and HDD partitions in btrfs RAID mode, especially since btrfs does not handle different sized partitions that well. With bcache you can have your cake and eat it too (read: big storage pool + fast access times). BTW: Is there work in progress to let btrfs choose which device to read from or write to other than using round-robin or pid mapping? Maybe it would be interesting to watch the current read and write latencies of all drives and choose the one with the lowest latency. Tho, I think it won't make much sense when passing accesses through btrfs. -- Replies to list only preferred. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-07 18:06 ` Kai Krakow @ 2015-02-08 3:31 ` Duncan 0 siblings, 0 replies; 10+ messages in thread From: Duncan @ 2015-02-08 3:31 UTC (permalink / raw) To: linux-btrfs Kai Krakow posted on Sat, 07 Feb 2015 19:06:14 +0100 as excerpted: > BTW: Is there work in progress to let btrfs choose which device to read > from or write to other than using round-robin or pid mapping? Maybe it > would be interesting to watch the current read and write latencies of > all drives and choose the one with the lowest latency. Tho, I think it > won't make much sense when passing accesses through btrfs. There's several projects in that general area suggested on the wiki. You'd need to look there for status (unclaimed, claimed, in progress, etc) and to see if any of them match well enough to what you had in mind or if you might wish to add another. There's definitely optimization planned, with the project ideas mentioned above going beyond that. However, I'm not sure of the status of the actually planned optimization either. There's certainly the standard premature optimization thing to worry about, but arguably, we're past the point at which it'd be premature now, and actually need it, if for no other reason, because making a case for true btrfs stability is rather difficult if such optimization is still being held off as premature, or isn't being held off any longer, but that state is so new the optimization simply hasn't been done yet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-06 20:01 RAID1, SSD+non-SSD Brian B 2015-02-07 0:23 ` Chris Murphy @ 2015-02-07 6:39 ` Duncan 2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson 2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B 2015-02-07 17:28 ` Kyle Manna 2 siblings, 2 replies; 10+ messages in thread From: Duncan @ 2015-02-07 6:39 UTC (permalink / raw) To: linux-btrfs Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted: > The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I > realize writing large amounts of data will be slower than the SSD > alone, but is it possible to set it up to only read from the magnetic > drive if there's an error reading from the SSD? Chris Murphy is correct. Btrfs raid1 doesn't have the write-mostly option that mdraid has. I'll simply expand on what he mentioned with two points, #1 being the more important for your case. 1) The btrfs raid1 read-mode device choice algorithm is known to be sub- optimal, and the plan is to change and optimize it in the longer term. Basically, it's an easy first implementation that's simple enough to be reasonably bug-free and to stay out of the developer's way while they work on on other things, while still allowing easy testing of both devices. Specifically, it's a very simple even/odd parity assignment based on the PID making the request. Thus, a single PID read task will consistently read from the same device (unless the block checksum on that device is bad, then it tries the other device), no matter how much there is to read and how backed up that device might be, or how idle the other one might be. Even a second read task from another PID, or a 10th, or the 100th, if they're all even or all odd parity PIDs, will all be assigned to read from the same device, even if the other one is entirely idle. Which ends up being worst-case for a multi-threaded heavy-read focused task where all read threads happen to be even or odd, say if read and compute threads are paired and always spawned in the same order, with nothing else going on to throw the parity ordering off. But that's how it's currently implemented. =:^( And it /does/ make for easily repeatable test results, while being simple enough to stay out of the way while development interest focuses elsewhere, after all pretty important factors early in a project of this scope. =:^) Obviously, that's going to be bad news for you, too, unless your use-case is specific enough that you can tune the read PIDs to favor the parity that hits the SSD. =:^( The claim is made that btrfs is stabilizing, and in fact, as a regular here for some time, I can vouch for that. But I think it's reasonable to argue that until this sort of read-scheduling algorithm is replaced with something a bit more optimized, and of course that replacement well tested, it's definitely premature to call btrfs fully stable. This sort of painfully bad in some cases mis-optimization just doesn't fit with stable, and regardless of how long it takes, until development quiets down far enough that the devs can feel comfortable focusing on something like this, it's extremely hard to argue that development has quieted down enough to fairly call it stable in the first place. Well, my opinion anyway. So the short of it is, at least until btrfs optimizes this a bit better, for SSD paired with spinning-rust raid1 optimization, as Chris Murphy suggested, use some sort of caching mechanism, bcache or dmcache. Tho you'll want to compare notes with someone who has already tried it, as there were some issues with at least btrfs and bcache earlier. I believe they're fixed now, but as explained above, btrfs itself isn't really entirely stable yet, so I'd definitely recommend keeping backups, and comparing notes with others who have tried it. (I know there's some on the list, tho they may not see this. But hopefully they'll respond to a new thread with bcache or dmcache in the title, if you decide to go that way.) 2) While this doesn't make a significant difference in the two-device btrfs raid1 case, it does with three or more devices in the btrfs raid1, and with other raid forms the difference is even stronger. I noticed you wrote RAID1 in ALL CAPS form. Btrfs' raid implementations aren't quite like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC) pointing out that the choice to use small-letters raidX nomenclature was deliberate, in ordered to remind people that there is a difference. Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1, at present btrfs raid1 is always pair-mirrored, regardless of the number of devices (above two, of course). While a three-device md/RAID-1 will have three mirrors and a four-device md/RAID-1 will have four, simply adding redundant mirrors while maintaining capacity (in the simple all- the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x the capacity of a two-device btrfs raid1, and a four-device btrfs raid1 will have twice the two-device capacity, while maintaining a constant pair-mirroring regardless of the number of devices in the btrfs raid1. For btrfs raid10, the pair-mirroring is there, but for odd numbers of devices there's also a difference of uneven striping, because of the odd one out in the mirroring and the difference in chunk size between data and metadata chunks. And of course there's the difference that data and metadata are treated separately in btrfs, and don't have to have the same raid levels, nor are they the same by default. A filesystem agnostic raid such as mdraid or dmraid will by definition treat data and metadata alike as it won't be able to tell the difference -- if it did it wouldn't be filesystem agnostic. Now that btrfs raid56 mode is basically complete with kernel 3.19, the next thing on the raid side of the roadmap is N-way-mirroring. I'm really looking forward to that as I really like btrfs' self-repair capacities as well, but for me the ideal balance is three-way-mirroring, just in case two copies fail checksum. Tho the fact of the matter is, btrfs only now is getting to the point where a third mirror has some reasonable chance of being useful, as until now btrfs itself was unstable enough that the chances of it having a bug were far higher than of both devices going bad for a checksummed block at the same time. But btrfs really is much more stable than it was, and it's stable enough now that the possibility of a third mirror really should start making statistical sense pretty soon, if it doesn't already. But given the time raid56 took, I'm not holding my breath. I guess they'll be focused on the remaining raid56 bugs thru 3.20, and figure it'll be at least three kernel cycles later, so second half of the year at best, before we see N-way-mirroring in mainstream. This time next year would actually seem more reasonable, and 2H-2016 or into 2017 wouldn't surprise me in the least, again, given the time raid56 mode took. Hopefully it'll be there before 2018... Tho as I said, for the two-device case, if both data and metadata are raid1 mode, those differences can for the most part be ignored. Thus, this point is mostly for others reading, and for you in the future should you end up working with a btrfs raid1 with more than two devices. I mostly mentioned it due to seeing that all-caps RAID1. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD (RAID 5/6 question) 2015-02-07 6:39 ` Duncan @ 2015-02-07 12:42 ` Ed Tomlinson 2015-02-08 3:18 ` Duncan 2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B 1 sibling, 1 reply; 10+ messages in thread From: Ed Tomlinson @ 2015-02-07 12:42 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote: > The btrfs raid1 read-mode device choice algorithm Duncan, Very interesting suff on the raid1 read select alg. What changes with raid5/6? Is that alg 'smarter'? TIA Ed Tomlinson ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD (RAID 5/6 question) 2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson @ 2015-02-08 3:18 ` Duncan 0 siblings, 0 replies; 10+ messages in thread From: Duncan @ 2015-02-08 3:18 UTC (permalink / raw) To: linux-btrfs Ed Tomlinson posted on Sat, 07 Feb 2015 07:42:50 -0500 as excerpted: > On Saturday, February 7, 2015 1:39:07 AM EST, Duncan wrote: > >> The btrfs raid1 read-mode device choice algorithm > > Very interesting suff on the raid1 read select alg. What changes with > raid5/6? Is that alg 'smarter'? I don't know as much about the raid56 (5/6) mode. What I /do/ know about it is that until the still-in-testing 3.19 kernel and similarly "now" userspace, raid56 mode mkfs worked, and normal runtime worked, but scrub and the various repair modes were code-incomplete. That made it effectively an inefficient raid0 in practice -- the parity strips were calculated and written, but the tools weren't there to properly recover from them should it be necessary, so from an admin perspective it was like a raid0, if a device drops out, say bye-bye to the entire filesystem. In practice there were certain limited recovery steps that could be taken in some circumstances, but as they couldn't be counted on, from an admin perspective, the best policy really was to consider it a slow raid0, as that's the risk you were taking, running it. The difference was that if you set it up for raid5/6, once the tools were complete and ready, you'd effectively get a "free" redundancy upgrade, since it was actually running that way all along, it just couldn't be recovered as such because the recovery tools weren't done yet. With kernel 3.19, in theory all the btrfs raid56 mode kernel pieces are there now, altho in practice there's still bugs being worked out, so I'd not (bleeding-edge) trust it until 3.20 at least, and I'd hesitate to consider it as (relatively) stable as single/dup/raid0/1/10 modes for another couple kernels after that, simply because they've been usable for long enough to have had quite a few more bugs found and worked out at this point. I'm not exactly sure what the status is on the userspace side, but I /think/ it's there in the current v3.18.x userspace release, and should be usable by the time the kernelspace is usable, kernel 3.20 with userspace 3.19. But with ~9 week release cycles and with 3.19 very close to out now, if we take that 3.20 bleeding-edge usable in say 10 weeks from now, and call raid56 mode reasonably stable two kernel cycles or 18 weeks later, that puts it 28 weeks out, say 6.5 months, for reasonably stable. Which would be late August. Of course if you're willing to take a bit more risk, it's more like six or seven weeks, say 3.20-rc4 or so, about the end of March. I'd really not recommend raid56 mode until then, unless you *ARE* treating it exactly as you would a raid0, and are willing to call the entire filesystem a complete loss if a device drops or there's any other serious problem with it. As for algorithm, AFAIK, operationally btrfs raid56 mode stripes data similar to raid0, except that one or two devices of each stripe are of course reserved for parity. So a three-way raid5 or a four-way raid6 will have a two-way-data-stripe, while a four-way raid5 or a five-way raid6 will have a three-way-data-stripe. Since data chunks are nominally 1 GiB and the allocator will allocate a chunk on each device, then full available width sub-chunk stripe with raid0/5/6, in theory at least, performance should be very similar to a conventional raid0/5/6, at least for single thread. Which means writes are going to be the big bottleneck, just as they are with conventional raid5/6, since they end up being read-modify-write for any of the strips of the stripe not yet read into cache yet. FWIW I actually ran md/RAID-6 here for awhile (general desktop/ workstation use-case, tho on gentoo, so call it developer's workstation due to the building from source), and was rather disappointed. I found a well-optimized raid1 implementation (as md/RAID-1 is) to be much more efficient, even with four-way-mirroring! Tho due to btrfs raid1 mode not yet being optimized, btrfs raid56 mode even with a reasonable write load, might well actually be competitive or even faster, at this point. I haven't even looked to see if there's any benchmarks on that, yet. (Despite raid56 mode repair tools not being complete, runtime worked, so it could have been benchmarked against raid1 mode already. I just haven't checked to see if there's actually a report of such on the wiki or wherever.) But back to the SSD+spinning-rust combo, I don't expect btrfs raid56 mode to do particularly well on that, either, tho at least you wouldn't have the potential worst-case of all reads getting assigned to the spinning rust, as could well happen with btrfs' unoptimized raid1 mode, at this point. Intuitively, I'd predict that read thruput would be similar to that of reading just the spinning-rust share off the spinning-rust device. IOW, when reading from both, the SSD would be done so fast it wouldn't even show up in the results, while the speed of the spinning rust would be what you'd be getting for data read off of it, so where half the data is on spinning rust and half on ssd, you'd effectively get twice the speed you'd get if it were all on spinning rust, because half would show up at spinning rust speed, while the other half would already be there by the time the spinning rust side finished. But that's simply intuition, and simple intuition could be quite wrong. You could of course test it. The ideal, if you don't want to deal with a cache layer, as I didn't, would be to simply declare the money to put it all on SSD worth it, and just do that. Two SSDs in btrfs raid1 mode. That's actually what I'm running here, tho I don't like all my data eggs in the same filesystem basket, so I actually have both SSDs partitioned up similarly, and am running multiple smaller independent btrfs, all (but for /boot) being btrfs raid1, with each of the two devices for each btrfs raid1 being a partition on one of the SSDs. That actually works quite well and I've been very happy with it. =:^) Particularly when doing a full balance/scrub/check on a filesystem takes under 10 minutes, with some of them a minute or less, both because of the speed of the SSDs, and because the filesystems are all under 50 GiB each. It's **MUCH** easier to work with such filesystems when a scrub or balance doesn't take the **DAYS** people often report for their multi- terabyte spinning-rust based filesystems! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-07 6:39 ` Duncan 2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson @ 2015-02-08 2:41 ` Brian B 2015-02-08 3:51 ` Duncan 1 sibling, 1 reply; 10+ messages in thread From: Brian B @ 2015-02-08 2:41 UTC (permalink / raw) To: linux-btrfs On Sat, Feb 7, 2015 at 1:39 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Brian B posted on Fri, 06 Feb 2015 15:01:30 -0500 as excerpted: > >> The only reason I'm doing the [btrfs] RAID1 is for the self-healing. I >> realize writing large amounts of data will be slower than the SSD >> alone, but is it possible to set it up to only read from the magnetic >> drive if there's an error reading from the SSD? > > Chris Murphy is correct. Btrfs raid1 doesn't have the write-mostly > option that mdraid has. > > I'll simply expand on what he mentioned with two points, #1 being the > more important for your case. > > 1) The btrfs raid1 read-mode device choice algorithm is known to be sub- > optimal, and the plan is to change and optimize it in the longer term. > Basically, it's an easy first implementation that's simple enough to be > reasonably bug-free and to stay out of the developer's way while they > work on on other things, while still allowing easy testing of both > devices. > > Specifically, it's a very simple even/odd parity assignment based on the > PID making the request. Thus, a single PID read task will consistently > read from the same device (unless the block checksum on that device is > bad, then it tries the other device), no matter how much there is to read > and how backed up that device might be, or how idle the other one might > be. Even a second read task from another PID, or a 10th, or the 100th, if > they're all even or all odd parity PIDs, will all be assigned to read > from the same device, even if the other one is entirely idle. > > Which ends up being worst-case for a multi-threaded heavy-read focused > task where all read threads happen to be even or odd, say if read and > compute threads are paired and always spawned in the same order, with > nothing else going on to throw the parity ordering off. But that's how > it's currently implemented. =:^( > > And it /does/ make for easily repeatable test results, while being simple > enough to stay out of the way while development interest focuses > elsewhere, after all pretty important factors early in a project of this > scope. =:^) > > > Obviously, that's going to be bad news for you, too, unless your use-case > is specific enough that you can tune the read PIDs to favor the parity > that hits the SSD. =:^( > > > The claim is made that btrfs is stabilizing, and in fact, as a regular > here for some time, I can vouch for that. But I think it's reasonable to > argue that until this sort of read-scheduling algorithm is replaced with > something a bit more optimized, and of course that replacement well > tested, it's definitely premature to call btrfs fully stable. This sort > of painfully bad in some cases mis-optimization just doesn't fit with > stable, and regardless of how long it takes, until development quiets > down far enough that the devs can feel comfortable focusing on something > like this, it's extremely hard to argue that development has quieted down > enough to fairly call it stable in the first place. > > Well, my opinion anyway. > > So the short of it is, at least until btrfs optimizes this a bit better, > for SSD paired with spinning-rust raid1 optimization, as Chris Murphy > suggested, use some sort of caching mechanism, bcache or dmcache. > > Tho you'll want to compare notes with someone who has already tried it, > as there were some issues with at least btrfs and bcache earlier. I > believe they're fixed now, but as explained above, btrfs itself isn't > really entirely stable yet, so I'd definitely recommend keeping backups, > and comparing notes with others who have tried it. (I know there's some > on the list, tho they may not see this. But hopefully they'll respond to > a new thread with bcache or dmcache in the title, if you decide to go > that way.) > > > 2) While this doesn't make a significant difference in the two-device > btrfs raid1 case, it does with three or more devices in the btrfs raid1, > and with other raid forms the difference is even stronger. I noticed you > wrote RAID1 in ALL CAPS form. Btrfs' raid implementations aren't quite > like traditional RAID, and I recall a dev (Chris Mason, actually, IIRC) > pointing out that the choice to use small-letters raidX nomenclature was > deliberate, in ordered to remind people that there is a difference. > > Specifically for btrfs raid1, as contrasted to, for instance, md/RAID-1, > at present btrfs raid1 is always pair-mirrored, regardless of the number > of devices (above two, of course). While a three-device md/RAID-1 will > have three mirrors and a four-device md/RAID-1 will have four, simply > adding redundant mirrors while maintaining capacity (in the simple all- > the-same-size case, anyway), a three-device btrfs raid1 will have 1.5x > the capacity of a two-device btrfs raid1, and a four-device btrfs raid1 > will have twice the two-device capacity, while maintaining a constant > pair-mirroring regardless of the number of devices in the btrfs raid1. > > For btrfs raid10, the pair-mirroring is there, but for odd numbers of > devices there's also a difference of uneven striping, because of the odd > one out in the mirroring and the difference in chunk size between data > and metadata chunks. > > And of course there's the difference that data and metadata are treated > separately in btrfs, and don't have to have the same raid levels, nor are > they the same by default. A filesystem agnostic raid such as mdraid or > dmraid will by definition treat data and metadata alike as it won't be > able to tell the difference -- if it did it wouldn't be filesystem > agnostic. > > > Now that btrfs raid56 mode is basically complete with kernel 3.19, the > next thing on the raid side of the roadmap is N-way-mirroring. I'm > really looking forward to that as I really like btrfs' self-repair > capacities as well, but for me the ideal balance is three-way-mirroring, > just in case two copies fail checksum. Tho the fact of the matter is, > btrfs only now is getting to the point where a third mirror has some > reasonable chance of being useful, as until now btrfs itself was unstable > enough that the chances of it having a bug were far higher than of both > devices going bad for a checksummed block at the same time. But btrfs > really is much more stable than it was, and it's stable enough now that > the possibility of a third mirror really should start making statistical > sense pretty soon, if it doesn't already. > > But given the time raid56 took, I'm not holding my breath. I guess > they'll be focused on the remaining raid56 bugs thru 3.20, and figure > it'll be at least three kernel cycles later, so second half of the year > at best, before we see N-way-mirroring in mainstream. This time next > year would actually seem more reasonable, and 2H-2016 or into 2017 > wouldn't surprise me in the least, again, given the time raid56 mode > took. Hopefully it'll be there before 2018... > > > Tho as I said, for the two-device case, if both data and metadata are > raid1 mode, those differences can for the most part be ignored. Thus, > this point is mostly for others reading, and for you in the future should > you end up working with a btrfs raid1 with more than two devices. I > mostly mentioned it due to seeing that all-caps RAID1. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks, very informative about the read alg. Sounds like it makes more sense to simply do backups to the slower drive and manually restore from those if I ever have a checksum error. My main goal here was protection from undetectable sector corruption ("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose it's impossible for bitrot errors to creep into backups, because I'd get a checksum error before that happened right? Then I could just restore it from a previous backup. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B @ 2015-02-08 3:51 ` Duncan 0 siblings, 0 replies; 10+ messages in thread From: Duncan @ 2015-02-08 3:51 UTC (permalink / raw) To: linux-btrfs Brian B posted on Sat, 07 Feb 2015 21:41:08 -0500 as excerpted: > Sounds like it makes more > sense to simply do backups to the slower drive and manually restore from > those if I ever have a checksum error. > > My main goal here was protection from undetectable sector corruption > ("bitrot" etc.) without having to halve my SSD, but on btrfs I suppose > it's impossible for bitrot errors to creep into backups, because I'd get > a checksum error before that happened right? Then I could just restore > it from a previous backup. Well, /as/ the bitrot happened, but you couldn't really do a backup of the bitrotted file, because you're correct, you'd get checksum errors due to the bitrot and btrfs wouldn't even let you access the file to back it up again, which in turn would mean it's time to restore at least that file from backup... So yes, that plan does make sense to me. =:^) BTW, it's worth noting the btrfs send/receive feature. If both the ssd and the spinning rust backup are btrfs, send/receive should be an extremely efficient way to do the backups. =:^) Tho it may be worth keeping a more conventionally maintained second-level backup that's /not/ on btrfs as well, depending on how critical you consider that data. While btrfs is stabilizing reasonably well now, it's not entirely stable yet and probably won't be for, let's say another year, and at least here, I really do sleep better knowing I have a non- btrfs backup available as well. You could manually checksum it, either in whole or in part, to be sure of detecting rot there, tho I've not done so here, figuring if I could survive decades without it before btrfs, I can survive another few years with it as a second backup. Given the cost of SSD vs. spinning-rust, if all your data fits on the SSD, you should be able to do multiple levels of backup on spinning rust without breaking the bank. (FWIW, altho as I mentioned earlier I have dual SSD btrfs raid1, I do still keep my media on spinning rust, NOT on SSD. So I can't say all my data fits on SSD, here, or rather, it might, but that's not how I've set it up. But as it happens the media files are both larger and less critical in terms of access speed, so spinning rust for them actually works out very well for me.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RAID1, SSD+non-SSD 2015-02-06 20:01 RAID1, SSD+non-SSD Brian B 2015-02-07 0:23 ` Chris Murphy 2015-02-07 6:39 ` Duncan @ 2015-02-07 17:28 ` Kyle Manna 2 siblings, 0 replies; 10+ messages in thread From: Kyle Manna @ 2015-02-07 17:28 UTC (permalink / raw) To: Brian B, linux-btrfs On Fri Feb 06 2015 at 12:06:33 PM Brian B <canis8585@gmail.com> wrote: > > My laptop has two disks, a SSD and a traditional magnetic disk. I plan > to make a partition on the mag disk equal in size the SSD and set up > BTRFS RAID1. This I know how to do. > > The only reason I'm doing the RAID1 is for the self-healing. I realize > writing large amounts of data will be slower than the SSD alone, but > is it possible to set it up to only read from the magnetic drive if > there's an error reading from the SSD? > > In other words, is there a way to tell it to only read from the faster > disk? Is that even necessary? Is there a better way to accomplish > this? What you may want to look at is lvmcache + btrfs. I've played with lvmcache (using ext4 on top) and btrfs independently, but not together. Too many new technologies at the same time for my taste. :) The best documentation I've found on lvm cache is the man page: http://man7.org/linux/man-pages/man7/lvmcache.7.html LVM cache uses dm-cache behind the scenes and makes it much more manageable (i.e. construction, manipulation, and teardown of devices. An lvm cache won't help with redundancy, the blocks will either exist on the caching device or slower device. To remove the cache, you can force a flush of the blocks out of the to the traditional HDD and use it without the cache without having to recreate the file system. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-02-08 3:52 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-06 20:01 RAID1, SSD+non-SSD Brian B 2015-02-07 0:23 ` Chris Murphy 2015-02-07 18:06 ` Kai Krakow 2015-02-08 3:31 ` Duncan 2015-02-07 6:39 ` Duncan 2015-02-07 12:42 ` RAID1, SSD+non-SSD (RAID 5/6 question) Ed Tomlinson 2015-02-08 3:18 ` Duncan 2015-02-08 2:41 ` RAID1, SSD+non-SSD Brian B 2015-02-08 3:51 ` Duncan 2015-02-07 17:28 ` Kyle Manna
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.