* RAID1 3+ drives @ 2014-06-28 0:30 Zack Coffey 2014-06-28 0:51 ` Russell Coker 0 siblings, 1 reply; 11+ messages in thread From: Zack Coffey @ 2014-06-28 0:30 UTC (permalink / raw) To: linux-btrfs Can I get more protection by using more than 2 drives? I had an onboard RAID a few years back that would let me use RAID1 across up to 4 drives. Apologies if this has been covered already, I don't recall seeing anything saying yay or nay. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 0:30 RAID1 3+ drives Zack Coffey @ 2014-06-28 0:51 ` Russell Coker 2014-06-28 4:26 ` Duncan 0 siblings, 1 reply; 11+ messages in thread From: Russell Coker @ 2014-06-28 0:51 UTC (permalink / raw) To: Zack Coffey; +Cc: linux-btrfs On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: > Can I get more protection by using more than 2 drives? > > I had an onboard RAID a few years back that would let me use RAID1 > across up to 4 drives. Currently the only RAID level that fully works in BTRFS is RAID-1 with data on 2 disks. If you have 4 disks in the array then each block will be on 2 of the disks. RAID-5/6 code mostly works but the last report I read indicated that some situations for recovery and disk replacement didn't work - presumably anyone who's afraid of multiple disks failing isn't going to want to trust BTRFS RAID-6 code at the moment. If you want to have 4 disks in a fully redundant configuration (IE you could lose 3 disks without losing any data) then the thing to do is to have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1 on top of that. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 0:51 ` Russell Coker @ 2014-06-28 4:26 ` Duncan 2014-06-28 6:28 ` Russell Coker 2014-06-28 10:13 ` Roman Mamedov 0 siblings, 2 replies; 11+ messages in thread From: Duncan @ 2014-06-28 4:26 UTC (permalink / raw) To: linux-btrfs Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted: > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: >> Can I get more protection by using more than 2 drives? >> >> I had an onboard RAID a few years back that would let me use RAID1 >> across up to 4 drives. >> > Currently the only RAID level that fully works in BTRFS is RAID-1 with > data on 2 disks. Not /quite/ correct. Raid0 works, but of course that isn't exactly "RAID" as it's not "redundant". And raid10 works. But that's simply raid0 over raid1. So depending on whether you consider raid0 actually "RAID" or not, which in turn depends on how strict you are with the "redundant" part, there is or is not more than btrfs raid1 working. > If you have 4 disks in the array then each block will > be on 2 of the disks. Correct. FWIW I'm told that the paper that laid out the original definition of RAID (which was linked on this list in a similar discussion some months ago) defined RAID-1 as paired redundancy, no matter the number of devices. Various implementations (including Linux' own mdraid soft-raid, and I believe dmraid as well) feature multi-way-mirroring aka N-way- mirroring such that N devices equals N way mirroring, but that's an implementation extension and isn't actually necessary to claim RAID-1 support. So look for N-way-mirroring when you go RAID shopping, and no, btrfs does not have it at this time, altho it is roadmapped for implementation after completion of the raid5/6 code. FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for device redundancy, but to take full advantage of btrfs data integrity features, allowing to "scrub" a checksum-mismatch copy with the content of a checksum-validated copy if available. That's currently possible, but due to the pair-mirroring-only restriction, there's only one additional copy, and if it happens to be bad as well, there's no possibility of a third copy to scrub from. As it happens my personal sweet-spot between cost/performance and reliability would be 3-way mirroring, but once they code beyond N=2, N should go unlimited, so N=3, N=4, N=50 if you have a way to hook them all up... should all be possible. But... > RAID-5/6 code mostly works but the last report I > read indicated that some situations for recovery and disk replacement > didn't work - presumably anyone who's afraid of multiple disks failing > isn't going to want to trust BTRFS RAID-6 code at the moment. The raid5/6 code was on the list to be introduced in the next kernel or two something like two years ago, when I originally looked into it, and likely before that. Like many of the btrfs features, it actually took rather longer to cook than was in the original plan -- it's actually rather more complicated than anticipated, and additionally it has been put off a few times to work on bugfixing currently supported feature bugs. An incomplete raid56 implementation, normal runtime but not scrub or recovery, was introduced several kernels ago now, but it's still not complete. So N-way-mirroring, which is supposed to build on several bits of the raid5/6 implementation and therefore is roadmapped for after it, continues to look about the same 3-5 kernels off, after raid5/6, as it did two years ago. Except, having seen the raid5/6 timing, and having looked back at btrfs feature history going back rather longer, even if raid5/6 was declared finished for kernel 3.17 (since 3.16 is past the commit window), I'd guess it'd probably take another five kernels (a year's worth) or so, at /least/, for N-way-mirroring to properly cook. So in actuality I'd be surprised to see any N-way-mirroring code at all before next spring, and would /not/ be surprised at all to see it take all of next year to fully cook to "completion". Not that I'm complaining /too/ much. We work with what we have and btrfs as it is is quite beyond the features of most filesystems (just the data integrity and multi-device filesystem stuff at all, is great to work with, besides the stuff like subvolumes and snapshotting that doesn't fit my use-case that well =:^), even if it /is/ all presently limited to two- way-mirroring! =:^\ ). But it will sure be nice when I /can/ count on that third copy to scrub two bad copies, if two copies /do/ happen to be bad. > If you want to have 4 disks in a fully redundant configuration (IE you > could lose 3 disks without losing any data) then the thing to do is to > have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1 > on top of that. The caveat with that is that at least mdraid1/dmraid1 has no verified data integrity, and while mdraid5/6 does have 1/2-way-parity calculation, it's only used in recovery, NOT cross-verified in ordinary use. So it's not a proper substitute, tho I guess some big-money hardware raids might do it. In fact, with md/dmraid and its reasonable possibility of silent corruption since at that level any of the copies could be returned and there's no data integrity checking, if whatever md/dmraid level copy /is/ returned ends up being bad, then btrfs will consider that side of the pair bad, without any way to check additional copies at the underlying md/ dmraid level. Effectively you only have two verified copies no matter how many ways the dm/mdraid level is mirrored, since there's no verification at the dm/mdraid level at all. Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs scrub on top, one could be /reasonably/ assured of freedom from lower level corruption. But with both levels of scrub together very possibly taking a couple days, and various ongoing write activity in the mean time, by the time one run was done it'd be time to start the next one, so you'd effectively be running scrub at one level or the other *ALL* the time! So... I'd suggest either forgetting about data integrity for the time being and just running md/dmraid without worrying about it, or just running btrfs with pairs, and backing up to another btrfs of pairs. Btrfs send/receive could even be used as the primary syncing method between the main and backup set, altho I'd suggest having a fallback such as rsync setup and tested to work as well, in case there's a bug in send/ receive that stalls that method for awhile. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 4:26 ` Duncan @ 2014-06-28 6:28 ` Russell Coker 2014-06-28 7:38 ` Martin Steigerwald ` (2 more replies) 2014-06-28 10:13 ` Roman Mamedov 1 sibling, 3 replies; 11+ messages in thread From: Russell Coker @ 2014-06-28 6:28 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Sat, 28 Jun 2014 04:26:43 Duncan wrote: > Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted: > > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: > >> Can I get more protection by using more than 2 drives? > >> > >> I had an onboard RAID a few years back that would let me use RAID1 > >> across up to 4 drives. > > > > Currently the only RAID level that fully works in BTRFS is RAID-1 with > > data on 2 disks. > > Not /quite/ correct. Raid0 works, but of course that isn't exactly > "RAID" as it's not "redundant". And raid10 works. But that's simply > raid0 over raid1. So depending on whether you consider raid0 actually http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10 There are a number of ways of doing RAID-0 over RAID-1, but BTRFS doesn't do any of them. When you have more than 2 disks and tell BTRFS to do RAID-1 you get a result that might be somewhat comparable to Linux software RAID-10, except for the issue of having disks of different sizes and adding more disks after creating the "RAID". > "RAID" or not, which in turn depends on how strict you are with the > "redundant" part, there is or is not more than btrfs raid1 working. The way BTRFS, ZFS, and WAFL work is quite different to anything described in any of the original papers on RAID. One could make a case that what these filesystems do shouldn't be called RAID, but then we would be searching for another term for it. > > If you have 4 disks in the array then each block will > > be on 2 of the disks. > > Correct. > > FWIW I'm told that the paper that laid out the original definition of > RAID (which was linked on this list in a similar discussion some months > ago) defined RAID-1 as paired redundancy, no matter the number of > devices. Various implementations (including Linux' own mdraid soft-raid, > and I believe dmraid as well) feature multi-way-mirroring aka N-way- > mirroring such that N devices equals N way mirroring, but that's an > implementation extension and isn't actually necessary to claim RAID-1 > support. The paper is a little ambiguous as to whether a 3 disk mirror can be RAID-1. > So look for N-way-mirroring when you go RAID shopping, and no, btrfs does > not have it at this time, altho it is roadmapped for implementation after > completion of the raid5/6 code. > > FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for > device redundancy, but to take full advantage of btrfs data integrity > features, allowing to "scrub" a checksum-mismatch copy with the content > of a checksum-validated copy if available. That's currently possible, > but due to the pair-mirroring-only restriction, there's only one > additional copy, and if it happens to be bad as well, there's no > possibility of a third copy to scrub from. As it happens my personal > sweet-spot between cost/performance and reliability would be 3-way > mirroring, but once they code beyond N=2, N should go unlimited, so N=3, > N=4, N=50 if you have a way to hook them all up... should all be possible. What I want is the ZFS copies= feature. > > If you want to have 4 disks in a fully redundant configuration (IE you > > could lose 3 disks without losing any data) then the thing to do is to > > have 2 RAID-1 arrays with Linux software RAID and then run BTRFS RAID-1 > > on top of that. > > The caveat with that is that at least mdraid1/dmraid1 has no verified > data integrity, and while mdraid5/6 does have 1/2-way-parity calculation, > it's only used in recovery, NOT cross-verified in ordinary use. Linux Software RAID-6 only uses the parity when you have a hard read error. If you have a disk return bad data and say it's good then you just lose. That said the rate of disks returning such bad data is very low. If you had a hypothetical array of 4 disks as I suggested then to lose data you need to have one pair of disks entirely fail and another disk return corrupt data or have 2 disks in separate RAID-1 pairs return corrupt data on matching sectors (according to BTRFS data copies) such that Linux software RAID copies the corrupt data to the good disk. That sort of thing is much less likely than having a regular BTRFS RAID-1 array of 2 disks failing. Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 filesystems that each contain a single large file. Those 2 large files could be run via losetup and used for another BTRFS RAID-1 filesystem. That gets you redundancy at both levels. Of course if you had 2 disks in one pair fail then the loopback BTRFS filesystem would still be OK. How does the BTRFS kernel code handle a loopback device read failure? > In fact, with md/dmraid and its reasonable possibility of silent > corruption since at that level any of the copies could be returned and > there's no data integrity checking, if whatever md/dmraid level copy /is/ > returned ends up being bad, then btrfs will consider that side of the > pair bad, without any way to check additional copies at the underlying md/ > dmraid level. Effectively you only have two verified copies no matter > how many ways the dm/mdraid level is mirrored, since there's no > verification at the dm/mdraid level at all. BTRFS doesn't consider a side of the pair to be bad, just the block that was read. Usually disk corruption is in the order of dozens of blocks and the rest of the disk will be good. > Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs > scrub on top, one could be /reasonably/ assured of freedom from lower > level corruption. Not at all. Linux software RAID scrub will copy data from one disk to the other. It may copy from the good disk to the bad or from the bad disk to the good - and it won't know which it's doing. Also last time I checked a scrub of Linux software RAID-1 still reported large multiples of 128 sectors mismatching in normal operation. So you won't even know if a disk is returning bogus data unless the bad data is copied to the good disk and exposed to BTRFS. > But with both levels of scrub together very possibly > taking a couple days, and various ongoing write activity in the mean > time, by the time one run was done it'd be time to start the next one, so > you'd effectively be running scrub at one level or the other *ALL* the > time! No. I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub every Sunday night. If I had an array of 4 disks then I could do scrubs on Saturday night as well. > So... I'd suggest either forgetting about data integrity for the time > being and just running md/dmraid without worrying about it, or just > running btrfs with pairs, and backing up to another btrfs of pairs. > Btrfs send/receive could even be used as the primary syncing method > between the main and backup set, altho I'd suggest having a fallback such > as rsync setup and tested to work as well, in case there's a bug in send/ > receive that stalls that method for awhile. One advantage of BTRFS backup is that you know if the data is corrupt. If you make several backups that end up with different blocks on disk then Linux knows which one has the correct file data. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 6:28 ` Russell Coker @ 2014-06-28 7:38 ` Martin Steigerwald 2014-06-28 7:43 ` Hugo Mills 2014-06-28 11:38 ` Duncan 2014-06-28 18:15 ` Chris Murphy 2 siblings, 1 reply; 11+ messages in thread From: Martin Steigerwald @ 2014-06-28 7:38 UTC (permalink / raw) To: russell; +Cc: Duncan, linux-btrfs Am Samstag, 28. Juni 2014, 16:28:23 schrieb Russell Coker: > > So look for N-way-mirroring when you go RAID shopping, and no, btrfs does > > not have it at this time, altho it is roadmapped for implementation after > > completion of the raid5/6 code. > > > > > > > > FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for > > device redundancy, but to take full advantage of btrfs data integrity > > features, allowing to "scrub" a checksum-mismatch copy with the content > > of a checksum-validated copy if available. That's currently possible, > > but due to the pair-mirroring-only restriction, there's only one > > additional copy, and if it happens to be bad as well, there's no > > possibility of a third copy to scrub from. As it happens my personal > > sweet-spot between cost/performance and reliability would be 3-way > > mirroring, but once they code beyond N=2, N should go unlimited, so N=3, > > N=4, N=50 if you have a way to hook them all up... should all be possible. > > What I want is the ZFS copies= feature. Something like this, even more flexible, was planned to be added. There were some discussion on how to specificy complex redundancy patterns totally flexibly exactly with how much redundancy, how much spares and so on. I didn't read any of this since a long time. I wonder what happened to this idea. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 7:38 ` Martin Steigerwald @ 2014-06-28 7:43 ` Hugo Mills 0 siblings, 0 replies; 11+ messages in thread From: Hugo Mills @ 2014-06-28 7:43 UTC (permalink / raw) To: Martin Steigerwald; +Cc: russell, Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1923 bytes --] On Sat, Jun 28, 2014 at 09:38:00AM +0200, Martin Steigerwald wrote: > Am Samstag, 28. Juni 2014, 16:28:23 schrieb Russell Coker: > > > So look for N-way-mirroring when you go RAID shopping, and no, btrfs does > > > not have it at this time, altho it is roadmapped for implementation after > > > completion of the raid5/6 code. > > > > > > > > > > > > FWIW, N-way-mirroring is my #1 btrfs wish-list item too, not just for > > > device redundancy, but to take full advantage of btrfs data integrity > > > features, allowing to "scrub" a checksum-mismatch copy with the content > > > of a checksum-validated copy if available. That's currently possible, > > > but due to the pair-mirroring-only restriction, there's only one > > > additional copy, and if it happens to be bad as well, there's no > > > possibility of a third copy to scrub from. As it happens my personal > > > sweet-spot between cost/performance and reliability would be 3-way > > > mirroring, but once they code beyond N=2, N should go unlimited, so N=3, > > > N=4, N=50 if you have a way to hook them all up... should all be possible. > > > > What I want is the ZFS copies= feature. > > Something like this, even more flexible, was planned to be added. There were > some discussion on how to specificy complex redundancy patterns totally flexibly > exactly with how much redundancy, how much spares and so on. > > I didn't read any of this since a long time. I wonder what happened to this > idea. It's moving slowly in fits and starts. I haven't forgotten it. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- But people have always eaten people, / what else is there to --- eat? / If the Juju had meant us not to eat people / he wouldn't have made us of meat. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 6:28 ` Russell Coker 2014-06-28 7:38 ` Martin Steigerwald @ 2014-06-28 11:38 ` Duncan 2014-06-28 13:40 ` Russell Coker 2014-06-28 18:15 ` Chris Murphy 2 siblings, 1 reply; 11+ messages in thread From: Duncan @ 2014-06-28 11:38 UTC (permalink / raw) To: linux-btrfs Russell Coker posted on Sat, 28 Jun 2014 16:28:23 +1000 as excerpted: > On Sat, 28 Jun 2014 04:26:43 Duncan wrote: >> Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted: >> > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: >> >> Can I get more protection by using more than 2 drives? >> >> >> >> I had an onboard RAID a few years back that would let me use RAID1 >> >> across up to 4 drives. >> > >> > Currently the only RAID level that fully works in BTRFS is RAID-1 >> > with data on 2 disks. >> >> Not /quite/ correct. Raid0 works, but of course that isn't exactly >> "RAID" as it's not "redundant". And raid10 works. But that's simply >> raid0 over raid1. So depending on whether you consider raid0 actually > > http://en.wikipedia.org/wiki/Linux_MD_RAID_10#LINUX-MD-RAID-10 > > There are a number of ways of doing RAID-0 over RAID-1, Yes... > but BTRFS doesn't do any of them. It does... > When you have more than 2 disks and > tell BTRFS to do RAID-1 you get a result that might be somewhat > comparable to Linux software RAID-10, except for the issue of having > disks of different sizes and adding more disks after creating the > "RAID". What about when you tell btrfs to do raid10? Unless you're going to argue that btrfs raid10 mode isn't "real" raid10, or that like raid5/6 it's not complete, but you haven't mentioned it at all, so that doesn't seem to be what you're saying. Which was my point when I mentioned raid10 in the first place, it's there, and unlike raid5/6, I've never seen any indication that it's not complete or supported. >> "RAID" or not, which in turn depends on how strict you are with the >> "redundant" part, there is or is not more than btrfs raid1 working. > > The way BTRFS, ZFS, and WAFL work is quite different to anything > described in any of the original papers on RAID. One could make a case > that what these filesystems do shouldn't be called RAID, but then we > would be searching for another term for it. The FAQ admits that some people call it a layering violation... =8^0 Which in a way it is, as it combines a below-filesystem virtual device layer (where raid is normally found) with the filesystem layer. But the argument is, it's a /useful/ layering violation. Which it is, as that's what gives btrfs the ability to do what it does with some of its features. But the flip side of that is that since it includes so much that is normally strictly isolated into other layers, it's intensely complex, far more so than most other filesystems, which is why it's taking so horribly long to introduce some of these features, and why some of the scaling bugs in particular have been so nasty -- it's just /dealing/ with that much more than the ordinary filesystem. The nearest competitor that I'm aware of is zfs. But (1) zfs made some compromises that btrfs is trying to avoid, and (2) AFAIK, zfs had a LOT more real resources sunk into it. I'm sure there's people that know way more about its development than I do. And of course zfs isn't GPLv2 compatible, the reason it'll never be mainline Linux unless the zfs owners wish it so, but it's very obvious they wish it NOT so, which is why it remains as it is. That's not important to everyone, but it's a big reason I can't/won't seriously consider zfs here. > What I want is the ZFS copies= feature. As others have mentioned, the discussed idea is multi-axis configurability, N-mirror, S-stripe, P-parity (tho I don't believe that's the letters used). It's possible strip-size could be added to that as well. Hugo is the guy that has been working most directly on defining that. *BUT*, at this point that's all pie-in-the-sky for btrfs, while I guess zfs copies= "just works". If the licensing issues weren't there, I imagine I'd be using zfs today, and if btrfs took another decade or whatever to mature, no big deal. But the licensing issues are there and zfs is thus not an option for me, so... as I said earlier, we work with what we have. >> The caveat with that is that at least mdraid1/dmraid1 has no verified >> data integrity, and while mdraid5/6 does have 1/2-way-parity >> calculation, it's only used in recovery, NOT cross-verified in ordinary >> use. > > Linux Software RAID-6 only uses the parity when you have a hard read > error. If you have a disk return bad data and say it's good then you > just lose. Which is basically restating what I was saying. > That said the rate of disks returning such bad data is very low. If you > had a hypothetical array of 4 disks as I suggested then to lose data you > need to have one pair of disks entirely fail and another disk return > corrupt data or have 2 disks in separate RAID-1 pairs return corrupt > data on matching sectors (according to BTRFS data copies) such that > Linux software RAID copies the corrupt data to the good disk. Well, it's a bit more complex than that, and the details can definitely come back to bite you in certain corner cases, but I agree with the general idea. > That sort of thing is much less likely than having a regular BTRFS > RAID-1 array of 2 disks failing. The problem is that there's little or no control of it at the mdraid level. In md/raid1 mode, a "scrub" simply copies the data, good or bad, from the first device to the others. There's no data integrity checking and not even a majority vote, it simply dumbly copies what's on one device to the others, as long as what's on the first device is readable at all. In theory raid6 with its two-way-parity could be better, since it /does/ have the two-way-parity data it /could/ check, but the frustrating part of it is that it /doesn't/! It only reads the data strip not the entire stripe, and doesn't do any cross-checking unless it has to make up for a dropped device. And with the size of disks we have today, the statistics on multiple whole device reliability are NOT good to us! There's a VERY REAL chance, even likelihood, that at least one block on the device is going to be bad, and not be caught by its own error detection! There's some serious study and work going into this, and it's why people working on modern filesystems are pretty much all adding data integrity features, etc. Btrfs and zfs aren't alone in that. And it's really because there's no choice. As TB scale to PB, the chances are that there /will/ be one or possibly more device-undetected errors somewhere on that device. One in a billion or whatever (IDR the real number and I'm too lazy to do the math ATM) chance, but once you have numbers nearing a billion... > Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 > filesystems that each contain a single large file. Those 2 large files > could be run via losetup and used for another BTRFS RAID-1 filesystem. > That gets you redundancy at both levels. Of course if you had 2 disks > in one pair fail then the loopback BTRFS filesystem would still be OK. But the COW and fragmentation issues on the bottom level... OUCH! And you can't simply set NOCOW, because that turns off the checksumming as well, leaving you right back where you were without the integrity checking! IOW, it might work for filesystems to a quarter TiB or so, but don't except it to scale to TiB plus without getting MASSIVELY slow. I used to mention that theoretical option too, but once I saw the problems btrfs has with fragmentation on internal-write files, which is what loop-file would be... lets just say when I thought about mentioning it I shuddered and decided to forget I even considered it. Tho for the sub-100-GiB filesystems I'm dealing with here, on fast SSD with near 100% over-provisioning (hey, the size I wanted wasn't available at a good price so I took what I could get, and the overprovisioning certainly doesn't hurt!), it might actually be somewhat practical... > How does the BTRFS kernel code handle a loopback device read failure? > >> In fact, with md/dmraid and its reasonable possibility of silent >> corruption since at that level any of the copies could be returned and >> there's no data integrity checking, if whatever md/dmraid level copy >> /is/ returned ends up being bad, then btrfs will consider that side of >> the pair bad, without any way to check additional copies at the >> underlying md/dmraid level. Effectively you only have two verified >> copies no matter how many ways the dm/mdraid level is mirrored, since >> there's no verification at the dm/mdraid level at all. > > BTRFS doesn't consider a side of the pair to be bad, just the block that > was read. Usually disk corruption is in the order of dozens of blocks > and the rest of the disk will be good. I didn't word that well, primarily because I didn't even think of the whole-device-bad case. What I meant was that in the context of a btrfs scrub, btrfs will only be aware of the two "sides" for every block, no matter how many devices the underlying mdraid on that "side" is actually composed of. At the btrfs level, then, it'll only have one chance to present good data, and the mdraid level will effectively pick a candidate randomly. If the picked candidate happens to return a block that fails the btrfs checksum, it'll reject that block from that side, regardless of how many good copies there might also be. If it /does/ reject that block, you better *HOPE* that the copy it picks from the mdraid on the /other/ side happens to be valid, because if it's not... If it's not, then btrfs will show both sides as failing the checksum, which means as far as btrfs is concerned that block (not the whole btrfs device "side", just that block, but that's bad enough) is dead, there's no good copies for it to use, regardless of the number of good copies on the other devices composing the underlying mdraids on each side. It's simply a matter of chance, over which the admin has very little control. That's the frustrating part, and the point I was trying to get across. But I agree (now that you made me aware of that read of what I wrote in the first place) that the way I wrote it did sound like I was saying that btrfs would drop that whole underlying mdraid, composing that "side". But while that's what I appeared to write, that's not what I had in mind... >> Tho if you ran a md/dmraid level scrub often enough, and then ran a >> btrfs scrub on top, one could be /reasonably/ assured of freedom from >> lower level corruption. > > Not at all. Linux software RAID scrub will copy data from one disk to > the other. It may copy from the good disk to the bad or from the bad > disk to the good - and it won't know which it's doing. Which was my point. But, assuming that you do an mdraid scrub and it finds and copies a bad version. At that point, if you've been both-layer scrubbing regularly, the chances of the /other/ side being bad are relatively low, so if as soon as you finish the mdraid scrub, you do a btrfs scrub, it should catch that bad copy and rewrite it from other, good copy at the btrfs level. The rewrite will then be propagated down to all the devices on the underlying mdraid on the bad side of the btrfs, and with a bit of luck, that will rewrite all the bad copies, or at least the bad copy on the first mdraid device so that the next mdraid scrub will propagate it to the bad device. If you constantly scrub the underlying mdraids and it sometimes propagates a bad block at that level, followed by a scrub at the btrfs level to (hopefully) force rewrites of any bad copies that the mdraid scrub propagated, then back to the mdraid level, then back to the btrfs level, basically constantly scrubbing at one level or the other, then in theory anyway, the chances of bitrot appearing on both sides of the btrfs at the same time are rather lowered... *BUT* at a cost of essentially *CONSTANT* scrubbing. Constant because at the multi-TBs we're talking, just completing a single scrub cycle could well take more than a standard 8-hour work-day, so by the time you finish, it's already about time to start the next scrub cycle. That sort of constant scrubbing is going to take its toll both on device life and on I/O thruput for whatever data you're actually storing on the device, since a good share of the time it's going to be scrubbing as well, slowing down the speed of the real I/O. And I just don't see that as realistic. At least not for spinning rust, which is where people talking about multi-TB capacities are likely to be at this point. For SSD it could be feasible as the scrubs should go fast enough that most of the time will be /between/ scrubs instead of /doing/ scrubs, and even during the scrubs, normal I/O shouldn't be /too/ held up on SSD, altho higher capacity I/O certainly would be, but of course SSD limits you to the lower capacities and higher costs of SSD. > Also last time I checked a scrub of Linux software RAID-1 still reported > large multiples of 128 sectors mismatching in normal operation. Ouch! That I hadn't even considered. > So you won't even know if a disk is returning bogus data unless the bad > data is copied to the good disk and exposed to BTRFS. > >> But with both levels of scrub together very possibly taking a couple >> days, and various ongoing write activity in the mean time, by the time >> one run was done it'd be time to start the next one, so you'd >> effectively be running scrub at one level or the other *ALL* the time! > > No. I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub > every Sunday night. If I had an array of 4 disks then I could do scrubs > on Saturday night as well. But are you scrubbing at both the btrfs and the md/dmraid level? That'll effectively double the scrub-time. And the idea was to scrub say every other day if not daily, so the chance of developing further bitrot and thus of getting it on both sides of the btrfs at the same time, is reduced as much as possible because the bitrot is caught and btrfs-scrub-corrected as soon as possible. And while that might not take a full 24 hours, it's likely to take a significant enough portion of 24 hours, that if you're doing a full mdraid and btrfs level both scrub every two days, some significant fraction (say a third to a half) of the time will be spent scrubbing, during which normal I/O speeds will be significantly reduced, while also reducing device lifetime due to the relatively high duty cycle seek activity. >> So... I'd suggest either forgetting about data integrity for the time >> being and just running md/dmraid without worrying about it, or just >> running btrfs with pairs, and backing up to another btrfs of pairs. >> Btrfs send/receive could even be used as the primary syncing method >> between the main and backup set, altho I'd suggest having a fallback >> such as rsync setup and tested to work as well, in case there's a bug >> in send/ receive that stalls that method for awhile. > > One advantage of BTRFS backup is that you know if the data is corrupt. > If you make several backups that end up with different blocks on disk > then Linux knows which one has the correct file data. Absolutely agreed. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 11:38 ` Duncan @ 2014-06-28 13:40 ` Russell Coker 0 siblings, 0 replies; 11+ messages in thread From: Russell Coker @ 2014-06-28 13:40 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs On Sat, 28 Jun 2014 11:38:47 Duncan wrote: > And with the size of disks we have today, the statistics on multiple > whole device reliability are NOT good to us! There's a VERY REAL chance, > even likelihood, that at least one block on the device is going to be > bad, and not be caught by its own error detection! http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The above paper suggests that it's about 10% of SATA disks getting such errors per year and that typically a disk that has such a problem has it for ~50 sectors. The probability of having 2 disks randomly get such errors (if they are truly random and independent) would be something like 1% per year. The probability that the ~50 sectors on each of 2*3TB disks happening to match up is much lower. > > Also if you were REALLY paranoid you could have 2 BTRFS RAID-1 > > filesystems that each contain a single large file. Those 2 large files > > could be run via losetup and used for another BTRFS RAID-1 filesystem. > > That gets you redundancy at both levels. Of course if you had 2 disks > > in one pair fail then the loopback BTRFS filesystem would still be OK. > > But the COW and fragmentation issues on the bottom level... OUCH! And > you can't simply set NOCOW, because that turns off the checksumming as > well, leaving you right back where you were without the integrity > checking! It really depends on how much performance you need. I've got some virtual servers running BTRFS within BTRFS and with modern hardware and a light load it works OK. > *BUT* at a cost of essentially *CONSTANT* scrubbing. Constant because at > the multi-TBs we're talking, just completing a single scrub cycle could > well take more than a standard 8-hour work-day, so by the time you > finish, it's already about time to start the next scrub cycle. Scrubbing my BTRFS RAID-1 filesystem with 2.4TB of data stored on a pair of 3TB disks takes 5 hours. > That sort of constant scrubbing is going to take its toll both on device > life and on I/O thruput for whatever data you're actually storing on the > device, since a good share of the time it's going to be scrubbing as > well, slowing down the speed of the real I/O. Some years ago I asked an executive from a company that manufactured hard drives about this. The engineering manager who was directed to answer my question told me that the drives were designed to perform any sequence of legal operations continually for the warranty period. So if a disk had a 3 year warranty then it should be able to survive a scrubbing loop for 3 years. But scrubbing a system that runs 24*7 is a problem. Hopefully we will get a speed limit feature for BTRFS scrubbing as there is for Linux software RAID rebuild/scrub. > > No. I have a RAID-1 array of 3TB disks that is 2/3 full which I scrub > > every Sunday night. If I had an array of 4 disks then I could do scrubs > > on Saturday night as well. > > But are you scrubbing at both the btrfs and the md/dmraid level? That'll > effectively double the scrub-time. It's a BTRFS RAID-1, there is no mdadm on that system. > And while that might not take a full 24 hours, it's likely to take a > significant enough portion of 24 hours, that if you're doing a full mdraid > and btrfs level both scrub every two days, some significant fraction (say > a third to a half) of the time will be spent scrubbing, during which > normal I/O speeds will be significantly reduced, while also reducing > device lifetime due to the relatively high duty cycle seek activity. When the expected error rate for SATA disks is ~10% of disks having errors per year a scrub every second day seems rather paranoid. But if you are that paranoid then the wisc.edu paper suggests that you should be buying "enterprise" disks that have a much lower error rate. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 6:28 ` Russell Coker 2014-06-28 7:38 ` Martin Steigerwald 2014-06-28 11:38 ` Duncan @ 2014-06-28 18:15 ` Chris Murphy 2 siblings, 0 replies; 11+ messages in thread From: Chris Murphy @ 2014-06-28 18:15 UTC (permalink / raw) To: Btrfs BTRFS On Jun 28, 2014, at 12:28 AM, Russell Coker <russell@coker.com.au> wrote: > >> Tho if you ran a md/dmraid level scrub often enough, and then ran a btrfs >> scrub on top, one could be /reasonably/ assured of freedom from lower >> level corruption. > > Not at all. Linux software RAID scrub will copy data from one disk to the > other. md supports two kinds of scrub: check and repair. Check is the same as btrfs read-only scrub with -r option. > It may copy from the good disk to the bad or from the bad disk to the > good - and it won't know which it's doing. Yes. > Also last time I checked a scrub of Linux software RAID-1 still reported large > multiples of 128 sectors mismatching in normal operation. So you won't even > know if a disk is returning bogus data unless the bad data is copied to the > good disk and exposed to BTRFS. For raid1,raid10 you need to zero the drives or you will get mismatches. And swap partition or swap file on an md device will also cause mismatches. Mismatches on raid1,10 are much less likely for other types of files, but man 4 md does say it's possible so mismatch_cnt isn't perfectly reliable on raid1,10. Chris Murphy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 4:26 ` Duncan 2014-06-28 6:28 ` Russell Coker @ 2014-06-28 10:13 ` Roman Mamedov 2014-06-29 2:30 ` Duncan 1 sibling, 1 reply; 11+ messages in thread From: Roman Mamedov @ 2014-06-28 10:13 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1234 bytes --] On Sat, 28 Jun 2014 04:26:43 +0000 (UTC) Duncan <1i5t5.duncan@cox.net> wrote: > Russell Coker posted on Sat, 28 Jun 2014 10:51:00 +1000 as excerpted: > > > On Fri, 27 Jun 2014 20:30:32 Zack Coffey wrote: > >> Can I get more protection by using more than 2 drives? > >> > >> I had an onboard RAID a few years back that would let me use RAID1 > >> across up to 4 drives. > >> > > Currently the only RAID level that fully works in BTRFS is RAID-1 with > > data on 2 disks. > > Not /quite/ correct. Raid0 works, but of course that isn't exactly > "RAID" as it's not "redundant". And raid10 works. But that's simply > raid0 over raid1. So depending on whether you consider raid0 actually > "RAID" or not, which in turn depends on how strict you are with the > "redundant" part, there is or is not more than btrfs raid1 working. Also depending on what you consider "fully works", RAID1 may not qualify too, as neither the read-balancing, nor write-submission algorithms are ready for production use, performance-wise. (RAID1 writes to two disks sequentially, not at the same time; and reads are satisfied from in effect a random device, not from the least-busy device). -- With respect, Roman [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: RAID1 3+ drives 2014-06-28 10:13 ` Roman Mamedov @ 2014-06-29 2:30 ` Duncan 0 siblings, 0 replies; 11+ messages in thread From: Duncan @ 2014-06-29 2:30 UTC (permalink / raw) To: linux-btrfs Roman Mamedov posted on Sat, 28 Jun 2014 16:13:47 +0600 as excerpted: > Also depending on what you consider "fully works", RAID1 may not qualify > too, > as neither the read-balancing, nor write-submission algorithms are ready > for production use, performance-wise. > > (RAID1 writes to two disks sequentially, not at the same time; and reads > are satisfied from in effect a random device, not from the least-busy > device). Good point. The current algorithms were designed as "good enough" stand- ins for testing. They were /not/ designed as highly efficient parallel-I/O on parallel devices and cores implementations, as that was to come later. Of course part of /that/ problem is that often enough, the I/O channel is /not/ the bottleneck, the bottleneck is still the horrible scaling issues due to calculating the interplay between all those snapshots and quotas and massive internal-rewrite-pattern-VM-images, thus the reason we have snapshot-aware-defrag disabled ATM, so arguably focusing on the most efficient I/O queues algorithm at this point would be premature optimization, which would mean it's a /good/ thing they haven't focused on updating them yet. Once these horrible scaling issues are addressed and snapshot-aware-defrag and the like can be enabled again without triggering week-going-and-it's-still-not-half-done issues, /then/ perhaps it's time to look at the parallel I/O queuing and balancing algorithms. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-06-29 2:30 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-28 0:30 RAID1 3+ drives Zack Coffey 2014-06-28 0:51 ` Russell Coker 2014-06-28 4:26 ` Duncan 2014-06-28 6:28 ` Russell Coker 2014-06-28 7:38 ` Martin Steigerwald 2014-06-28 7:43 ` Hugo Mills 2014-06-28 11:38 ` Duncan 2014-06-28 13:40 ` Russell Coker 2014-06-28 18:15 ` Chris Murphy 2014-06-28 10:13 ` Roman Mamedov 2014-06-29 2:30 ` Duncan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.