* mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
@ 2014-12-10 22:18 Robert White
2014-12-11 7:33 ` Duncan
2014-12-12 3:56 ` Zygo Blaxell
0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-10 22:18 UTC (permalink / raw)
To: Btrfs BTRFS
So I started looking at the mkfs.btrfs manual page with an eye towards
documenting some of the tidbits like metadata automatically switching
from dup to raid1 when more than one device is used.
In experimenting I ended up with some questions...
(1) why is the dup profile for data restricted to only one device and
only if it's mixed mode?
Gust t # mkfs.btrfs -f /dev/loop{0..1} -d dup
Error: unable to create FS with data profile 16 (have 2 devices)
Gust t # mkfs.btrfs -f /dev/loop0 -d dup
Error: dup for data is allowed only in mixed mode
(2) why is metadata dup profile restricted to only one device on
creation when it will run that way just fine after a device add?
Gust t # mkfs.btrfs -f /dev/loop{0..1} -m dup
Error: unable to create FS with metadata profile 32 (have 2 devices)
(3) why can I make a raid5 out of two devices? (I understand that we are
currently just making mirrors, but the standard requires three devices
in the geometry etc. So I would expect a two device RAID5 to be
considered degraded with all that entails. It just looks like its asking
for trouble to allow this once the support is finalized as suddenly a
working RAID5 thats really a mirror would become something that can only
be mounted with the degraded flag.)
Gust t # mkfs.btrfs -f /dev/loop{0..1} -d raid5 -m raid5
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.
Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file
to 65536
Turning ON incompat feature 'raid56': raid56 extended format
Performing full device TRIM (2.00GiB) ...
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB
(4) Same question for raid6 but with three drives instead of the
mandated four.
(5) If I can make a RAID5 or RAID6 device with one missing element, why
can't I make a RAID1 out of one drive, e.g. with one missing element?
(6) If I make a RAID1 out of three devices are there three copies of
every extent or are there always two copies that are semi-randomly
spread across three devices? (ibid for more than three).
---
It seems to me (very dangerous words in computer science, I know) that
we need a "failed" device designator so that a device can be in the
geometry (e.g. have a device ID) but not actually exist. Reads/writes to
the failed device would always be treated as error returns.
The failed device would be subject to replacement with "btrfs dev
replace", and could be the source of said replacement to drop a
problematic device out of an array.
EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.
Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file
to 65536
Processing explicitly missing device
adding device (failed) id 2 (phantom device)
mount /dev/loop0 /mountpoint
btrfs replace start 2 /dev/loop1 /mountpoint
(and so on)
Being able to "replace" a faulty device with a phantom "failed" device
would nicely disambiguate the whole device add/remove versus replace
mistake.
It would make the degraded status less mysterious.
A filesystem with an explicitly failed element would also make the
future roll-out of full RAID5/6 less confusing.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White @ 2014-12-11 7:33 ` Duncan 2014-12-12 3:56 ` Zygo Blaxell 1 sibling, 0 replies; 11+ messages in thread From: Duncan @ 2014-12-11 7:33 UTC (permalink / raw) To: linux-btrfs Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted: > So I started looking at the mkfs.btrfs manual page with an eye towards > documenting some of the tidbits like metadata automatically switching > from dup to raid1 when more than one device is used. > > In experimenting I ended up with some questions... > > (1) why is the dup profile for data restricted to only one device and > only if it's mixed mode? > (2) why is metadata dup profile restricted to only one device on > creation when it will run that way just fine after a device add? 1 and 2 together since they both deal with dup mode... Dup mode was apparently originally considered purely an extra safeguard for metadata in the single-device case, where it was made the default (except for SSDs, which default to single mode metadata on a single- device filesystem, because the FTL voids any guarantees on location anyway, and because firmware such as sandforce compresses and dedups anyway, in which case the hardware/firmware is subverting btrfs' efforts to do dup anyway). In the single-device case, two copies of data was considered simply not worth the cost, due both to doubling the size (especially on SSD where size is money!) and to the speed penalties on spinning rust due to seeks between one 1-GiB data-chunk and its dup. With multi-device, raid1 metadata, forcing one copy to each of two different devices, was considered enough superior to make that the default, since that provided device-loss resiliency for the all-important metadata, thus enabling recovery of at least /some/ files even with a device missing (single-mode data where the file's extents all happened to be on available devices, plus of course raid1, etc, data). Further, dup- mode metadata was considered a mistake it was better not to even have available as an option, since loss of a single device would likely kill the filesystem, which made dup mode little better than single mode, without the doubled-size-cost. Further, on spinning rust there'd again be the seek penalty, to little benefit since dup mode provides no guarantees in case of device loss. So multi-device defaults to raid1 metadata for safety, but single mode metadata remains an option (along with raid0) if you really /don't/ care about losing everything due to loss of a single device. Single-device simply makes dup-mode available (and the default) for metadata, as a poor- man's substitute for the safety of raid1, but single-device-metadata is the only case where that poor-man's-raid1-substitute is worth the (considered extreme) cost, with usage of that option not even available on multi-device as it'd be a near-certain mistake, certainly at the mkfs level. And dup mode isn't ordinarily available for data even on single- device, because it's considered not worth the cost. As for dup-mode working after device-add, that's simply a necessary bit in ordered for device add to work from a default-dup-mode single-device at all. And it's only the existing metadata chunks on the original device that will be dup-mode. Once a second device is added, additional metadata chunks will be written in raid1 mode, forcing the two chunk copies to different devices since there's multiple devices available to allow that. The clear intent and recommendation is to do a rebalance ASAP after a device add, to spread usage to the new device as appropriate. And of course that rebalance will use the new raid1 metadata defaults, unless told otherwise of course, and I don't believe dup mode is available to tell it otherwise there, either. What all that original reasoning fails to account for, however, is the btrfs data/metadata checksumming and integrity features and the very high (which the original btrfs mode designers obviously considered extreme) value some users (including me) place on them. While a multi-device dup- mode-metadata choice at mkfs is arguably still a mistake, the cost of raid1 metadata without the benefit, near the risk of single metadata but at double the size, dup-mode data combined with btrfs checksumming and data integrity features on a single device has strong data integrity benefits that some would definitely consider worth it, even at the additional cost in speed on spinning rust due to seeking, and in size on expensive SSDs. Meanwhile, mixed-bg-mode was an after-thought, added much later (after my own btrfs journey began) in ordered to make working with small filesystems reasonable. Before mixed-bg-mode, people attempting to use btrfs on sub-GiB devices often found they couldn't use all available space (often 25-50% wasted!) as the separate data/metadata chunk allocation was simply too large grained to properly deal with the small sizes involved. And small filesystems really _was_ mixed-mode's _entire_ purpose. That it could additionally be used to allow dup-data, using the ability to specify mixed-bg-mode even on > 1 GiB filesystems where it wasn't the default to get dup-data, was *ENTIRELY* an accident, not even considered until a user figured it out, as confirmed by I believe it was Chris Mason when directly asked at some point. But now that mixed-mode is there and can be used to enable dup-mode data too, for people that want it, and now that we know for sure such people exist because we see mixed-bg mode being offered as a way to get exactly that, dup-mode-data, there's little reason to remove the accidental feature. =:^) Meanwhile, now that demand is known to exist for dup-mode-data, I think it probable that at some point code for that without having to force mixed-bg-mode to get it will be made available and tested, much as other features have been. But there's way more features left to implement than time to implement them, at least with the current btrfs developer pool. And given that mixed-bg-mode is available to deliver dup-mode-data for those /really/ intent on having it, the priority of coding and testing stand-alone-dup-mode-data is going to be relatively low, so I'd suggest not expecting it any time soon -- maybe five years out, I don't see it much sooner unless a dev (or dev sponsor) really gets that itch and decides to priority scratch it. > (3) why can I make a raid5 out of two devices? > (4) Same question for raid6 but with three drives instead of the > mandated four. > > (5) If I can make a RAID5 or RAID6 device with one missing element, why > can't I make a RAID1 out of one drive, e.g. with one missing element? AFAIK, the ability to mkfs raid56 modes with a missing device is a bug. I'm not sure if it was known or not, tho I know there has been some change in minimum number of devices over time and it might have gotten caught in that, but I'd /guess/ that since raid56 isn't yet fully supported, if the bug /was/ known, it had relatively low priority on the fix-list compared to various other bugs with currently supported features. If it is a bug as I believe it to be, that nullifies most of the secondary questions you had... > (6) If I make a RAID1 out of three devices are there three copies of > every extent or are there always two copies that are semi-randomly > spread across three devices? (ibid for more than three). Currently btrfs raid1 is defined very specifically as exactly two copies/ mirrors, regardless of whether there are two or two hundred devices in the filesystem. More devices gives you more room; number of copies remains two. This is covered in the wiki. The feature known as N-way-mirroring is however on the roadmap -- for just after raid56, since the planned implementation depends on some of the same code. This is actually a bit of a personal sore spot for me, since it has long been my most-wished-for feature. When I first investigated btrfs now years ago, I was running quad-way-mdraid-1, and was very disappointed to see that btrfs only offered paired-raid1, since I wanted (and still want) very much to be able to fall back more than once to additional copies, should the checksum fail on the first N-1 copies. And back then (kernel 3.5 era IIRC) it was already roadmapped immediately after raid56 modes, which was to be introduced in another kernel cycle or two, so I figured perhaps 3-4 cycles, maybe a year (~5 cycles) for N-way- mirroring. But it seems as far out now as then, if not further since we know how long raid56 is taking to complete, and two kernel cycles after that for N-way-mirroring seems wildly optimistic, now. Maybe a year after... if it's not too complicated. But it's definitely on the roadmap, next thing to implement in fact, but it's still right after raid56, and raid56 has of course been coming right up since kernel 3.6 or whatever, at least. But I'm not a dev so I can't help in that regard, tho I do use btrfs in pair-way raid1 mode now, and try to help on the list where my knowledge as list regular and sysadmin using btrfs allow it. Someday that feature will be available to play with... but that doesn't mean I can't enjoy btrfs for what it has right now, nor does it mean I can't help others with btrfs while I wait... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White 2014-12-11 7:33 ` Duncan @ 2014-12-12 3:56 ` Zygo Blaxell 2014-12-12 6:01 ` Robert White 1 sibling, 1 reply; 11+ messages in thread From: Zygo Blaxell @ 2014-12-12 3:56 UTC (permalink / raw) To: Robert White; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 4079 bytes --] On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote: > (3) why can I make a raid5 out of two devices? (I understand that we > are currently just making mirrors, but the standard requires three > devices in the geometry etc. So I would expect a two device RAID5 to > be considered degraded with all that entails. It just looks like its > asking for trouble to allow this once the support is finalized as > suddenly a working RAID5 thats really a mirror would become > something that can only be mounted with the degraded flag.) RAID5 with even parity and two devices should be exactly the same as RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping is irrelevant because there is no difference in disk contents so the disks are interchangeable), except with different behavior when more devices are added (RAID1 will mirror chunks on pairs of disks, RAID5 should start writing new chunks with N stripes instead of two). > (4) Same question for raid6 but with three drives instead of the > mandated four. RAID6 with three devices should behave more or less like three-way RAID1, except maybe the two parity disks might be different (I forget how the function used to calculate the two parity stripes works, and whether it can be defined such that F(disk1, disk2, disk3) == disk1). > (5) If I can make a RAID5 or RAID6 device with one missing element, > why can't I make a RAID1 out of one drive, e.g. with one missing > element? They're only missing if you believe the minimum number of RAID5 disks is not two and the minimum number of RAID6 disks is not three. > (6) If I make a RAID1 out of three devices are there three copies of > every extent or are there always two copies that are semi-randomly > spread across three devices? (ibid for more than three). There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB of mirrored storage. > --- > > It seems to me (very dangerous words in computer science, I know) > that we need a "failed" device designator so that a device can be in > the geometry (e.g. have a device ID) but not actually exist. > Reads/writes to the failed device would always be treated as error > returns. > > The failed device would be subject to replacement with "btrfs dev > replace", and could be the source of said replacement to drop a > problematic device out of an array. > > EXAMPLE: > Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1 > Btrfs v3.17.1 > See http://btrfs.wiki.kernel.org for more information. > > Performing full device TRIM (2.00GiB) ... > Turning ON incompat feature 'extref': increased hardlink limit per > file to 65536 > Processing explicitly missing device > adding device (failed) id 2 (phantom device) > > mount /dev/loop0 /mountpoint > > btrfs replace start 2 /dev/loop1 /mountpoint > > (and so on) > > Being able to "replace" a faulty device with a phantom "failed" > device would nicely disambiguate the whole device add/remove versus > replace mistake. It is a little odd that an array of 3 disks with one missing looks like this: Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b Total devices 3 FS bytes used 256.00KiB devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01 devid 2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02 devid 3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04 In the above, "vgtest-d02" was a deleted LV and does not exist, but you'd never know that from the output of 'btrfs fi show'... > It would make the degraded status less mysterious. The 'degraded' status currently protects against some significant data corruption risks. :-O > A filesystem with an explicitly failed element would also make the > future roll-out of full RAID5/6 less confusing. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 3:56 ` Zygo Blaxell @ 2014-12-12 6:01 ` Robert White 2014-12-12 9:06 ` David Taylor 2014-12-12 16:45 ` Zygo Blaxell 0 siblings, 2 replies; 11+ messages in thread From: Robert White @ 2014-12-12 6:01 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS On 12/11/2014 07:56 PM, Zygo Blaxell wrote: > On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote: >> (3) why can I make a raid5 out of two devices? (I understand that we >> are currently just making mirrors, but the standard requires three >> devices in the geometry etc. So I would expect a two device RAID5 to >> be considered degraded with all that entails. It just looks like its >> asking for trouble to allow this once the support is finalized as >> suddenly a working RAID5 thats really a mirror would become >> something that can only be mounted with the degraded flag.) > > RAID5 with even parity and two devices should be exactly the same as > RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping > is irrelevant because there is no difference in disk contents so the > disks are interchangeable), except with different behavior when more > devices are added (RAID1 will mirror chunks on pairs of disks, RAID5 > should start writing new chunks with N stripes instead of two). That's not correct. A RAID5 with three elements presents two _different_ sectors in each stripe. When one element is lost, it would still present two different sectors, but the safety is gone. I understand that the XOR collapses into a mirror if only two datum are involved, but that's a mathematical fact that is irrelevant to the definition of a RAID5 layout. When you take a wheel off of a tricycle it doesn't just become a bike. And you can't make a bicycle into a trike by just welding on a wheel somewhere. The infrastructure of the two is completely different. So RAID5 with three media M is M MM MMM D1 D2 P(a) D3 P(b) D4 P(c) D5 D6 If MMM is lost D1, D2, D3, and D5 are intact D4 and D6 can be recreated via D3^P(b) and P(c)^D5 M MM X D1 D2 . D3 P(b) . P(c) D5 . So under _no_ circumstances would a two-disk RAID5 be the same as a RAID1 since a two disk RAID5 functionally implies disk three because the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data protection because the minimum third element is a computational phantom. In short it is irrational to have a "two disk" RAID5 that is "not degraded" in the same way you cannot have a two-wheeled tricycle without scraping some part of something along the asphalt. A RAID1 with two elements presents one sector along the "stripe". I realize that what has been implemented is what you call a two drive RAID5, and done so by really implementing a RAID1, but it's nonsense. I mean I understand what you are saying you've done, but it makes no sense according to the definitions of RAID5. There is no circumstance where RAID5 falls back to mirroring. Trying to implement RAID5 as an extension of a mirroring paradigm would involve a fundamental conflict in definitions. Especially when you reached a failure mode. This is so fundamental to the design that the "fast" way to assemble a RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 elements, declare the raid valid-but-degraded using (N-1) of the media, and then "replacing" the Nth phantom/missing/failed element with the real disk and triggering a rebuild. This only works if you don't need the initial contents of the array to have a specific value like zero. (This involves fewest reads and the array is instantly available while it builds.) As soon as you start writing to the array, the stripes you write "repair" the extents if the repair process hadn't gotten to them yet. Its basically impossible to turn a mirror into a RAID5 if you _ever_ expect the code base to to be able to recover an array that's lost an element. >> (4) Same question for raid6 but with three drives instead of the >> mandated four. > > RAID6 with three devices should behave more or less like three-way RAID1, > except maybe the two parity disks might be different (I forget how the > function used to calculate the two parity stripes works, and whether > it can be defined such that F(disk1, disk2, disk3) == disk1). Uh, no. A raid 6 with three drives, or even two drives, is also degraded because the minimum is four. A B C D D1 D2 Pa Qa D3 Pb Qb D4 Pc Qc D5 D6 Qd D7 D8 Pd You can lose one or two media but the minimum stripe is again [X1,X2] for any read (ABCD)(ABC.)(AB..)(A..D) etc. Minimum arity for RAID6 is 4, maximum lost-but-functional configuration is arity-minus-two. > >> (5) If I can make a RAID5 or RAID6 device with one missing element, >> why can't I make a RAID1 out of one drive, e.g. with one missing >> element? > > They're only missing if you believe the minimum number of RAID5 disks > is not two and the minimum number of RAID6 disks is not three. I do believe that, because that's what the terms are universally taken to mean. If what BTRFS is promising/planning as raid5 will run non-degraded on two disks its... something... but it's not RAID5. If what BTRFS is promising/planing as raid6 will run non-degraded on three disks its... something... bt it's not RAID6. > >> (6) If I make a RAID1 out of three devices are there three copies of >> every extent or are there always two copies that are semi-randomly >> spread across three devices? (ibid for more than three). > > There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB > of mirrored storage. >> --- >> >> It seems to me (very dangerous words in computer science, I know) >> that we need a "failed" device designator so that a device can be in >> the geometry (e.g. have a device ID) but not actually exist. >> Reads/writes to the failed device would always be treated as error >> returns. >> >> The failed device would be subject to replacement with "btrfs dev >> replace", and could be the source of said replacement to drop a >> problematic device out of an array. >> >> EXAMPLE: >> Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1 >> Btrfs v3.17.1 >> See http://btrfs.wiki.kernel.org for more information. >> >> Performing full device TRIM (2.00GiB) ... >> Turning ON incompat feature 'extref': increased hardlink limit per >> file to 65536 >> Processing explicitly missing device >> adding device (failed) id 2 (phantom device) >> >> mount /dev/loop0 /mountpoint >> >> btrfs replace start 2 /dev/loop1 /mountpoint >> >> (and so on) >> >> Being able to "replace" a faulty device with a phantom "failed" >> device would nicely disambiguate the whole device add/remove versus >> replace mistake. > > It is a little odd that an array of 3 disks with one missing looks > like this: Its correct for a three disk array with one "failed" (e.g. where vgtester-d04 is present but bad), it's wrong for a _four_ disk array where one disk (vgtester-d03) has been unpluged or otherwise missing (as opposed to "deleted"). The entire idea of "three disk array with one missing" doesn't match your example below, which is in fact a three disk array with all elements present. Your example below started out as a four disk array and then you deleted one, making it a three disk array. The point at issue would be a four-disk array with one missing. So there'd be four lines. E.g. a four disk array with one missing _ought_ to look like: > Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b > Total devices 3 FS bytes used 256.00KiB > devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01 > devid 2 size 4.00GiB used 827.12MiB path /dev/mapper /vgtester-d02 devid 3 size 4.00GiB used 0.00B (missing) path ??? > devid 4 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04 The problem here is that the concept of "missing" is, um, missing from BTRFS statuses. For instance the same idea I presume you are going for would be expressed in mdadm as with the little status array "UU.U" for "up, up, missing, and up". BTRFS _should_ (big words from a noob, I know) have and display the arity of the array with the correct number of expected disks, filled out with the information of the available disks. Were this correct there would be a covariant line for 3 and vgtester-d04 would be devid 4 like I did to it above. > > Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b > Total devices 3 FS bytes used 256.00KiB > devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01 > devid 2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02 > devid 3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04 That's not odd at all. sort of... (simplified to three lines and names changed because of word-wrap here... ) An array of three disks with one missing should look like: Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b Total devices 3 FS bytes used 256.00KiB devid 1 size 4.00GiB used 847.12MiB path /dev/sda devid 2 size 4.00GiB used 827.12MiB path /dev/sdb devid 3 size 4.00GiB used 0.00B (missing) because, you know, it's like... missing... An array of three disks with one _failed_ should look like: Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b Total devices 3 FS bytes used 256.00KiB devid 1 size 4.00GiB used 847.12MiB path /dev/sda devid 2 size 4.00GiB used 827.12MiB path /dev/sdb devid 3 size 4.00GiB used 0.00B (failed) path /dev/sdc An array of three disks with one freshly replacing a previously missing or failed should look like: Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b Total devices 3 FS bytes used 256.00KiB devid 1 size 4.00GiB used 847.12MiB path /dev/sda devid 2 size 4.00GiB used 827.12MiB path /dev/sdb devid 3 size 4.00GiB used x.xxxB (rebuilding) path /dev/sdc With the used value growing with each subsequent repeat of the inquiry until they all had the same numbers. And I don't know what the status would look like during a "replace" but there'd temporarily be a fourth disk in the list, one being a donor and one being the new replacement. That's _exactly_ what a RAID5 with a degraded, failed, or mising member should look like. For any extent (A,B,C) any one column (A), (B), or (C) can be missing -- shown as (.) such that for a chunk size X there is always a return stripe [X1,X2] (e.g. the stripe size is _always_ the arity minus one, and the minimum arity is three) returned by any legal read (A,B,C) == (.,B,C) == (A,.,C) == (A,B,.); it is this property that provides the redundancy. So nominally, the above would result in all reads [X1,X2] being a result of (A,B,.) or by device ID (1,2,.). And each read of (1,2,.) would provide the opportunity to repair ID 3's chunk. The subsequent activity, especially a balance/repair operation would be repopulating /dev/mapper/vgtester-d04 to reestablish the parity. Similarly all writes to a valid extent require two reads, and two writes minimum. If you have the parity and the target block in memory (that's the two reads), you xor-out the original contents of target block, xor-in the new contents of target block, then you have to write _both_ the target block and the parity block (preferably in one transaction). In a degraded RAID5, if you are writing to a "missing" block, you have to read all blocks in the stripe calculate the missing block, xor the calculated block out of the parity block, then xor the new block into the parity and write the parity block back out. If the replacement drive is installed and active you and also then just write the new block there as well and the block stripe is now no longer degraded. This is core paradigm for RAID5. And you need the "empty" device ID in the missing case would cause noop read and write events/errors but allow the spanning logic to remain intact and that logic is necessary for rational recovery when it ends up being 1,2,3-is-bad 1,2,3-is-bad 1,2,3 1,2,3 1,2,3-is bad As in this case stripes zero, one, and four are improper, still missing, whatever and stripes two and three have been balanced/scrubbed back into good order. Its particularly important and valuable to have that device ID allocated "failed" in a replace scenario where the logic is now ready to keep the good stuff (the extents for tracks 2 and 3 for example) and only recalculate the bad. > > In the above, "vgtest-d02" was a deleted LV and does not exist, but > you'd never know that from the output of 'btrfs fi show'... That would be because "deleted" and "failed" are two inherently different conditions and BTRFS doesn't have the ID smarts for a "failed" device to be present in the map. > >> It would make the degraded status less mysterious. > > The 'degraded' status currently protects against some significant data > corruption risks. :-O > >> A filesystem with an explicitly failed element would also make the >> future roll-out of full RAID5/6 less confusing. I also still don't get why the RAID1 with arity grater than two was at all hard to construct. It would have been my first step on the way to RAID5/6 A D1 D2 D3 A B D1 D1 D2 D2 D3 D3 A B C D1 D1 D1 D2 D2 D2 D3 D3 D3 Is the logical progression right before A B C D1 D2 Pa D3 Pb D4 Pc D5 D6 Until you have the code base and data structures to "search past B" in a mirror of arbitrary arity, you just don't have the means to organize the horizontal stripe-as-entity needed to do record the arbitrarily wide stripes you need to make a higher-order RAID. And before _any_ of that you need to be able to explicitly account for a missing drive such that you have a RAID1 of A x D1 . D2 . D3 . For all possible read and write events. Without that your rebuild of any RAID is "iffy". If you are not ready for A x C D1 . D1 D2 . D2 D3 . D3 then A x C D1 . Pa D3 . D4 Pc . D6 Is going to ruin your world. I don't know how to turn this into proper BTRFS speak since I am still new to the code base... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 6:01 ` Robert White @ 2014-12-12 9:06 ` David Taylor 2014-12-12 11:16 ` Robert White 2014-12-12 16:45 ` Zygo Blaxell 1 sibling, 1 reply; 11+ messages in thread From: David Taylor @ 2014-12-12 9:06 UTC (permalink / raw) To: Robert White; +Cc: Btrfs BTRFS On Thu, 11 Dec 2014, Robert White wrote: >On 12/11/2014 07:56 PM, Zygo Blaxell wrote: >> >>RAID5 with even parity and two devices should be exactly the same as >>RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping >>is irrelevant because there is no difference in disk contents so the >>disks are interchangeable), except with different behavior when more >>devices are added (RAID1 will mirror chunks on pairs of disks, RAID5 >>should start writing new chunks with N stripes instead of two). > >That's not correct. A RAID5 with three elements presents two >_different_ sectors in each stripe. When one element is lost, it would >still present two different sectors, but the safety is gone. The above quote is discussing two device RAID5, you are discussing three device RAID5. >I understand that the XOR collapses into a mirror if only two datum >are involved, but that's a mathematical fact that is irrelevant to the >definition of a RAID5 layout. When you take a wheel off of a tricycle >it doesn't just become a bike. And you can't make a bicycle into a >trike by just welding on a wheel somewhere. The infrastructure of the >two is completely different. True. A two-device RAID5 is not the same as a degraded three-device RAID5. >So RAID5 with three media M is > >M MM MMM >D1 D2 P(a) >D3 P(b) D4 >P(c) D5 D6 > >If MMM is lost D1, D2, D3, and D5 are intact >D4 and D6 can be recreated via D3^P(b) and P(c)^D5 > >M MM X >D1 D2 . >D3 P(b) . >P(c) D5 . >So under _no_ circumstances would a two-disk RAID5 be the same as a >RAID1 since a two disk RAID5 functionally implies disk three because >the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data >protection because the minimum third element is a computational >phantom. You again seem to be treating a "two disk RAID5" as synonymous with your degraded three disk RAID5 above. It is not. RAID5 with two media M would be: M MM D1 P(a) P(b) D2 D3 P(c) [and each P would be identical to its corresponding D] >In short it is irrational to have a "two disk" RAID5 that is "not >degraded" in the same way you cannot have a two-wheeled tricycle >without scraping some part of something along the asphalt. There is nothing irrational about it at all, except that it is exactly equivalent to two disk RAID1. >A RAID1 with two elements presents one sector along the "stripe". A RAID5 with N elements presents N-1 sectors along the "stripe", so I'm not sure what the problem is with setting N=2. >I realize that what has been implemented is what you call a two drive >RAID5, and done so by really implementing a RAID1, but it's nonsense. It's not really, it's merely an argument of semantics if you want to define it as nonsense. >I mean I understand what you are saying you've done, but it makes no >sense according to the definitions of RAID5. There is no circumstance >where RAID5 falls back to mirroring. Trying to implement RAID5 as an >extension of a mirroring paradigm would involve a fundamental conflict >in definitions. Especially when you reached a failure mode. I have no idea what you mean by "a fundamental conflict in definition". >This is so fundamental to the design that the "fast" way to assemble a >RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 >elements, declare the raid valid-but-degraded using (N-1) of the >media, and then "replacing" the Nth phantom/missing/failed element >with the real disk and triggering a rebuild. This only works if you >don't need the initial contents of the array to have a specific value >like zero. (This involves fewest reads and the array is instantly >available while it builds.) There is no reason you could not do exactly this with N=2. >As soon as you start writing to the array, the stripes you write >"repair" the extents if the repair process hadn't gotten to them yet. > >Its basically impossible to turn a mirror into a RAID5 if you _ever_ >expect the code base to to be able to recover an array that's lost an >element. Again, I'm not really sure what you mean. >Uh, no. A raid 6 with three drives, or even two drives, is also >degraded because the minimum is four. You're doing your weird semantic dance again. Just because you define the minimum to be four does not mean that someone talking about a three device RAID6 is talking about a degraded four device RAID6, they're not. As above, a non-degraded three-device RAID6 can be perfectly sensibly defined. Once again, it has exactly the same failure properties as a three device RAID1 (any two of the devices can fail), so it's a bit pointless. But not "impossible"... > >A B C D >D1 D2 Pa Qa >D3 Pb Qb D4 >Pc Qc D5 D6 >Qd D7 D8 Pd > > >You can lose one or two media but the minimum stripe is again [X1,X2] >for any read (ABCD)(ABC.)(AB..)(A..D) etc. > >Minimum arity for RAID6 is 4, maximum lost-but-functional >configuration is arity-minus-two. A B C D1 Pa Qa Pb Qb D2 Qc D3 Pc D4 Pd Qd >>They're only missing if you believe the minimum number of RAID5 disks >>is not two and the minimum number of RAID6 disks is not three. > >I do believe that, because that's what the terms are universally taken >to mean. > Apparently not universally. -- David Taylor ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 9:06 ` David Taylor @ 2014-12-12 11:16 ` Robert White 2014-12-12 13:29 ` Hugo Mills 2014-12-13 3:01 ` Duncan 0 siblings, 2 replies; 11+ messages in thread From: Robert White @ 2014-12-12 11:16 UTC (permalink / raw) To: Btrfs BTRFS On 12/12/2014 01:06 AM, David Taylor wrote: > The above quote is discussing two device RAID5, you are discussing > three device RAID5. Heresy! (yes, some humor is required here.) There is no such thing as a "two device RAID5". That's what RAID1 is for. Saying "The above quote is discussing a two device RAID5" is exactly like saying "The above quote is discussing a two wheeled tricycle". You might as well be talking about three-octet IP addresses. That is you could make a network address out of three octets, but it wouldn't' be an IP address. It would be something else with the wrong name attached. I challenge you... nay I _defy_ you... to find a single authority on disk storage anywhere on this planet (except, apparently, this list and its directly attached people and materials) that discusses, describes, or acknowledges the existence of a "two device RAID5" while not discussing a system with an arity of 3 degraded by the absence of one media. All these words have standardized definitions. [That's not hyperbole. I searched for several hours and could not find _any_ reference anywhere to construction of a RAID5 array using only two devices that did not involve airity-3 and a dummy/missing/failed psudo target. So if you can find any reference to doing this _anywhere_ outside of BTRFS I'd like to see it. Genuinely.] THAT SAID... I really can find no reason the math wouldn't work using only two drives. It would be a terrific waste of CPU cycles and storage space to construct the stripe buffers and do the XORs instead of just copying the data, but the math would work. So, um, "well I'll be damned". Perhaps is just a tautological belief that someone here didn't buy into. Like how people keep partitioning drives into little slices for things because thats the preserved wisdom from early eighties. I think constructing a non-degraded-mode two device thing and calling it RAID5 will surprise virtually _everyone_ on the planet. In every other system. And I do mean _every_ other system, if I had two media and I put them under RAID-5 I'd be required to specify the third drive as some sort failed device (the block device equivalent of /dev/null but that returns error results for all operations instead of successes.) See the reserved keyword "missing" in the mdadm documentation etc. That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB array with no actual redundancy. As in mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc the resulting array would be the same effective size as a stripe of the two drives, but when the third was added later it would just slot in as a replacement for the missing device and the airity-3 thing would "reestablish" it's redundancy. (this is actually what mdadm does internally with a normal build, it blesses the first N-1 drives into an array with a missing member, and adds the Nth drive as a "spare" and then the spare is immediately adopted as a replacement for the "missing" drive.) The parity computation on a single value is just nutty waste of time though. "Backing it out" when the array is degraded is double-nuts. Maybe everybody just decided it was too crazy to consider for the CPU time penalty...? So yea, semantics... apparently... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 11:16 ` Robert White @ 2014-12-12 13:29 ` Hugo Mills 2014-12-13 3:01 ` Duncan 1 sibling, 0 replies; 11+ messages in thread From: Hugo Mills @ 2014-12-12 13:29 UTC (permalink / raw) To: Robert White; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3934 bytes --] On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote: > On 12/12/2014 01:06 AM, David Taylor wrote: > >The above quote is discussing two device RAID5, you are discussing > >three device RAID5. > > Heresy! (yes, some humor is required here.) > > There is no such thing as a "two device RAID5". That's what RAID1 is for. > > Saying "The above quote is discussing a two device RAID5" is exactly > like saying "The above quote is discussing a two wheeled tricycle". > > You might as well be talking about three-octet IP addresses. That is > you could make a network address out of three octets, but it > wouldn't' be an IP address. It would be something else with the > wrong name attached. OK. Sounds like I need to dust off the change-of-nomenclature patch again. The argument here is about the 1c1s1p configuration. Is there a problem with that? Hugo. > I challenge you... nay I _defy_ you... to find a single authority on > disk storage anywhere on this planet (except, apparently, this list > and its directly attached people and materials) that discusses, > describes, or acknowledges the existence of a "two device RAID5" > while not discussing a system with an arity of 3 degraded by the > absence of one media. > > All these words have standardized definitions. > > [That's not hyperbole. I searched for several hours and could not > find _any_ reference anywhere to construction of a RAID5 array using > only two devices that did not involve airity-3 and a > dummy/missing/failed psudo target. So if you can find any reference > to doing this _anywhere_ outside of BTRFS I'd like to see it. > Genuinely.] > > THAT SAID... > > I really can find no reason the math wouldn't work using only two > drives. It would be a terrific waste of CPU cycles and storage space > to construct the stripe buffers and do the XORs instead of just > copying the data, but the math would work. > > So, um, "well I'll be damned". > > Perhaps is just a tautological belief that someone here didn't buy > into. Like how people keep partitioning drives into little slices > for things because thats the preserved wisdom from early eighties. > > I think constructing a non-degraded-mode two device thing and > calling it RAID5 will surprise virtually _everyone_ on the planet. > > In every other system. And I do mean _every_ other system, if I had > two media and I put them under RAID-5 I'd be required to specify the > third drive as some sort failed device (the block device equivalent > of /dev/null but that returns error results for all operations > instead of successes.) See the reserved keyword "missing" in the > mdadm documentation etc. > > That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a > 2TiB array with no actual redundancy. As in > > mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc > > the resulting array would be the same effective size as a stripe of > the two drives, but when the third was added later it would just > slot in as a replacement for the missing device and the airity-3 > thing would "reestablish" it's redundancy. (this is actually what > mdadm does internally with a normal build, it blesses the first N-1 > drives into an array with a missing member, and adds the Nth drive > as a "spare" and then the spare is immediately adopted as a > replacement for the "missing" drive.) > > The parity computation on a single value is just nutty waste of time > though. "Backing it out" when the array is degraded is double-nuts. > > Maybe everybody just decided it was too crazy to consider for the > CPU time penalty...? > > So yea, semantics... apparently... -- Hugo Mills | There's an infinite number of monkeys outside who hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet http://carfax.org.uk/ | they've worked out! PGP: 65E74AC0 | Arthur Dent [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 11:16 ` Robert White 2014-12-12 13:29 ` Hugo Mills @ 2014-12-13 3:01 ` Duncan 1 sibling, 0 replies; 11+ messages in thread From: Duncan @ 2014-12-13 3:01 UTC (permalink / raw) To: linux-btrfs Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted: > Perhaps is just a tautological belief that someone here didn't buy into. > Like how people keep partitioning drives into little slices for things > because thats the preserved wisdom from early eighties. While I absolutely agree with your raid5 sentiments (which is exactly what I suppose they might be; I'm getting a bit of an education in that regard myself, here)... In the context of the 80s, or even the 90s, nothing about multi-gigabyte could be considered "little"! =:^) In fact, while it most assuredly dates me, it /still/ feels a bit odd referring to the 1 GiB btrfs default threshold for mixed-bg-mode as "small", given that I distinctly remember wondering how long it might take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that time I did have enough experience to know I'd eventually be dealing with multi-gig as at the time I was dealing with multi-meg. More to the point, however... Those partitions have saved my a** quite a few times over the years. Among other things, partitioning allows me to keep my (8 GiB) rootfs an entirely separate filesystem that's mounted read-only by default, which has kept it undamaged and the tools on it still available to help recover my other filesystems, when /var/log and /home were damaged due to a hard shutdown recently. And some years ago I had an AC failure here in Phoenix in the middle of the summer, resulting in a physical head-crash and loss of the operating partitions on my disk in use at the time, while the backup partitions on the same device remained intact, such that after cooldown I actually continued to use that disk for some time, mounting the damaged partitions only to recover the most recent copies of what I could, updating the backups which were now promoted to operational. Sure, technology such as LVM can do similar and is more flexible in some ways, but unfortunately it requires userspace and thus an initr* in ordered to handle a root on the same technology. Otherwise, root must be treated differently, and then you have partitioning again. Additionally, LVM is yet another layer of software that can and does go wrong and itself need fixed. Partitioning is too, to some extent, but in practice it has been pretty bullet-proof compared to technologies such as LVM and btrfs-subvolumes. LVM has some way to go before it's as robust as partitioning, and of course btrfs with its subvolumes isn't really even completely stable yet. Further, btrfs doesn't well limit damage of a subvolume to just that subvolume (that head-crash scenario would have almost certainly been a total loss on btrfs subvolumes), the way partitioning tends to do. And LVM's very flexibility means it doesn't normally have that sort of damage limitation either. It certainly can, but doing so severely reduces its flexibility, making going back to regular partitions to avoid the complexity and additional points of failure entirely a rather viable and often better choice. Meanwhile, technology such as EFI and GPT is breathing new life into partitioning, making it more reliable (checksummed redundant partition tables), more useful/flexible (killing the primary/secondary/logical divisions and adding partition names/labels and a far larger typing space), and creating yet more uses for partitioning in the first place, due to separate reserved EFI and legacy-BIOS partition types. Tho of course these days those partition "slices" are often tens or hundreds of gigs, and are now sometimes "teras"[1], bringing up my initial point once again; that's NOT actually so small! But to each his own, of course, and I definitely do agree with you on raid5, the larger point. FWIW, I still consider allowing a two-device "raid5" or a three-device "raid6" a bug, particularly given that a single- device "raid1" is /not/ allowed, nor is a 3-device "raid10". --- [1] Hmm, K, megs, gigs, "ters", "teras", simply "T" to match K ??? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 6:01 ` Robert White 2014-12-12 9:06 ` David Taylor @ 2014-12-12 16:45 ` Zygo Blaxell 2014-12-12 22:28 ` Robert White 1 sibling, 1 reply; 11+ messages in thread From: Zygo Blaxell @ 2014-12-12 16:45 UTC (permalink / raw) To: Robert White; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 508 bytes --] On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: > So RAID5 with three media M is > > M MM MMM > D1 D2 P(a) > D3 P(b) D4 > P(c) D5 D6 RAID5 with two media is well defined, and looks like this: M MM D1 P(a) P(b) D2 D3 P(c) With even parity and N disks P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0 Simplifying for one data disk and one parity stripe: P(a) ^ D1 = 0 therefore P(a) = D1 which is effectively (and, in practice, literally) mirroring. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 16:45 ` Zygo Blaxell @ 2014-12-12 22:28 ` Robert White 2014-12-13 4:28 ` Zygo Blaxell 0 siblings, 1 reply; 11+ messages in thread From: Robert White @ 2014-12-12 22:28 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS On 12/12/2014 08:45 AM, Zygo Blaxell wrote: > On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: >> So RAID5 with three media M is >> >> M MM MMM >> D1 D2 P(a) >> D3 P(b) D4 >> P(c) D5 D6 > > RAID5 with two media is well defined, and looks like this: > > M MM > D1 P(a) > P(b) D2 > D3 P(c) Like I said in the other fork of this thread... I see (now) that the math works but I can find no trace of anyone having ever implemented this for arity less than 3 RAID greater than one paradigm (outside btrfs and its associated materials). It's like talking about a two-wheeled tricycle. 8-) I would _genuinely_ like to see any third party discussion of this. It just isn't done (probably because, as you've shown it just a really complicated and CPU intensive way to end up with a simple mirror). I spent several hours looking. I can see the math works, and I understand what you are doing (as I said at some length in the grandparent message) but it "just isn't done". The reason I use the tricycle example is that, while most people know this instinctively few are aware of the fact that going from two wheels to three-or-more wheels reverses the steering paradigm. On a bike you push-left lean-left and go-left. At the higher arity vehicles (including adding a side-car to a bike) you push-right go left (you lean left too, but that's just to keep from nosing over 8-). I find that quite apt in the whole RAID1 vs RAID5 discussion since the former is about copying one-or-more times and the latter is about starting with a theoretically zeroed buffer and doing reversible checksumming into it. I doubt that I will be the last person to be confused by BTRFS' implementation of a two-wheeled tricycle. You're going to get a lot of mail over the years. 8-) MEANWHILE the system really needs to be able to explicitly express and support the "missing" media paradigm. M x MMM D1 . P(a) D3 . D4 P(c) . D6 The correct logic here to "remove" (e.g. "replace with nothing" instead of "delete") a media just doesn't seem to exist. And it's already painfully missing in the RAID1 situation. If I have a system with N SATA ports, and I have connected N drives, and device M is starting to fail... I need to be able to disconnect M and then connect M(new). Possibly with a non-trivial amount of time in there. For all RAID levels greater than zero this is a natural operation in a degraded mode. And for a nearly full filesystem the shrink operation that is btrfs device delete would not work. And for any nontrivially occupied fiesystem it would be way slow, and need to be reversed for another way-slow interval. So I need to be able to "replace" a drive with a "nothing" so that the number of active media becomes N-1 but the arity remains N. mdadm has the "missing" keyword. the Device Mapper has the "zero" target. As near as I can tell btrfs has got nothing in this functional slot. Imagine, if you will, a block device that is the anti-/dev/null. All operations on this block device return EFAULT. lets call it /dev/nothing. And lets say I have a /dev/sdc that has to come out immediately (and all my stuff is RAID1/5/6). The operational chain would be btrfs replace start /dev/sdc /dev/nothing / (time pases, physical device is removed and replace) btrfs replace start /dev/nothing /dev/sdc / Now that's good-ish, but really the first replace is pernicious. The internal state for the filesystem should just be able to record that device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this example) is just gone. The replace-with-nothing becomes more-or-less instant. The first replace is also pernicious if its the second media failure on a fully RAID6 array since that would trying to put the same kernel level device in the array twice. The restore operation, the replace of the nothing with the something, remains fully elaborate. The "nothing" devices need to show up in the device id tables for a running array in their geographically correct positions and all that. Without this "missing" status as a first-class part of the system, dealing with failures and communicating about those failures with the operator will become vexatious. [The use of "device delete" and "device add" as changes in arity and size, and its inaplicability to cases where failure is being dealt with abent a change of arity, could be clearer in the documentation.] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] 2014-12-12 22:28 ` Robert White @ 2014-12-13 4:28 ` Zygo Blaxell 0 siblings, 0 replies; 11+ messages in thread From: Zygo Blaxell @ 2014-12-13 4:28 UTC (permalink / raw) To: Robert White; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 7645 bytes --] On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote: > On 12/12/2014 08:45 AM, Zygo Blaxell wrote: > >On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: > >>So RAID5 with three media M is > >> > >>M MM MMM > >>D1 D2 P(a) > >>D3 P(b) D4 > >>P(c) D5 D6 > > > >RAID5 with two media is well defined, and looks like this: > > > >M MM > >D1 P(a) > >P(b) D2 > >D3 P(c) > > Like I said in the other fork of this thread... I see (now) that the > math works but I can find no trace of anyone having ever implemented > this for arity less than 3 RAID greater than one paradigm (outside > btrfs and its associated materials). I've set up mdadm that way (though it does ask you to use '--force' when you set it up). mdadm will also ask for --force if you try to set up RAID1 with one disk. I don't know of a RAID implementation that _doesn't_ do these modes, excluding a few ancient proprietary implementations which have no way to change a layout once created (usually because they shoot themselves in the foot with bad choices early on, e.g. by picking odd parity for RAID5). The reason to allow it is future expansion: below-3-disk RAID5 ensures that you have the layout constraints *now* for stripe/chunk size so you can add more disks later. If RAID5 has a 512K chunk size, and you start with a linear or RAID1 array and add another disk later, you might lose part of the last 512K when you switch to RAID5. So you start with RAID5 on one or two disks so you can scale up without losing any data. Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies with no additional capacity). btrfs doesn't really need this capability for expansion, since it can just create new RAID5 profile chunks whenever it wants to; however, I'd expect a complete btrfs RAID5 implementation to borrow some ideas from ZFS, and dynamically change the number of disks per chunk to maintain write integrity as drives are added/removed/missing. That would imply btrfs-RAID56 profile chunks would have to be able to exist on two or even one disk, if that was all that was available for writing at the time. Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the wrong way when more disks were added later. > MEANWHILE > > the system really needs to be able to explicitly express and support > the "missing" media paradigm. > > M x MMM > D1 . P(a) > D3 . D4 > P(c) . D6 > > The correct logic here to "remove" (e.g. "replace with nothing" > instead of "delete") a media just doesn't seem to exist. And it's > already painfully missing in the RAID1 situation. There are a number of permanent mistakes a naive admin can make when dealing with a broken array. I've destroyed arrays (made them permanently read-only beyond the ability of btrfs kernel or user tools to recover) by getting "add" and "replace" confused, or by allowing an offline drive to rejoin an array that had been mounted read-write,degraded for some time. The basic functionality works. btrfs does track missing devices and can replace them relatively quickly (not as fast as mdadm, but less than an order of magnitude slower) in RAID1. The reporting is full of out-of-date cached data, but when a disk is really failing, there is usually little doubt which one needs to be replaced. > If I have a system with N SATA ports, and I have connected N drives, > and device M is starting to fail... I need to be able to disconnect > M and then connect M(new). Possibly with a non-trivial amount of > time in there. For all RAID levels greater than zero this is a > natural operation in a degraded mode. And for a nearly full > filesystem the shrink operation that is btrfs device delete would > not work. And for any nontrivially occupied fiesystem it would be > way slow, and need to be reversed for another way-slow interval. > > So I need to be able to "replace" a drive with a "nothing" so that > the number of active media becomes N-1 but the arity remains N. btrfs already does that, but it sucks. In a naive RAID5 implementation, a write in degraded mode will corrupt your data if it is interrupted. This is a general property of all RAID5 implementations that don't have NVRAM journalling or some other way to solve the atomic update problem. ZFS does this well: when a device is missing, it leaves old data in degraded mode, but writes new data striped across the existing disks in non-degraded mode. If you have 5 disks, and one dies, your writes are then spread across 4 disks (3 data + parity) while your reads are reconstructed from 4 disks (4 data + 1 parity - 1 missing). This prevents the degraded mode write data integrity problem. When the dead disk is replaced you would have the 3 data + parity promoted to 4 data + parity, or you can elect not to replace the dead disk and get 3 data + party everywhere (with a loss of capacity). btrfs could presumably do that by allocating chunks with different raid56 parameters, although in this early stage of implementation I'm not sure how much of any of that has been done yet. > mdadm has the "missing" keyword. the Device Mapper has the "zero" > target. dm also has the "ioerror" target, which is much better for this ("zero" would allow reads to succeed, which is incorrect). lvm2 uses "ioerror" for missing pieces of broken LVs in partial mode. > btrfs replace start /dev/sdc /dev/nothing / > (time pases, physical device is removed and replace) > btrfs replace start /dev/nothing /dev/sdc / Why wouldn't you just remove the physical device (say device #2) and then run: btrfs replace start 2 /dev/sdc / ? The way it works now seems much less complicated than what you propose. Granted, I have a feature request here: we know the sizes of all the missing disks, and we know the size of /dev/sdc, so why can't we just write "missing" instead of "2" and have btrfs choose a missing device to replace by itself? > Now that's good-ish, but really the first replace is pernicious. The > internal state for the filesystem should just be able to record that > device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this > example) is just gone. The replace-with-nothing becomes more-or-less > instant. To clarify: what is required here is the ability to quickly record that the device's subuuid is no longer welcome in this filesystem, and never will be. Should it reappear in the future, it has to be excluded from the btrfs. The underlying physical device could return, but it would have to be treated as a new empty device with a new subuuid, and its data reconstructed by btrfs balance or btrfs replace. This is because btrfs does really awful things when a filesystem gets assembled out of mirrors of different vintages. Before allowing writes on a subset of the disks in a multi-disk btrfs, the disks that are written have to agree that they are now the only disks that are currently members of the filesystem. > [The use of "device delete" and "device add" as changes in arity and > size, and its inaplicability to cases where failure is being dealt > with abent a change of arity, could be clearer in the > documentation.] Yes. This is _not_ equivalent to a btrfs replace, although it is very similar: btrfs device add /dev/sdc / btrfs device delete missing / It can work--sometimes--but it needs a surprising amount of free space (or multiple new drives). [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-12-13 4:28 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White 2014-12-11 7:33 ` Duncan 2014-12-12 3:56 ` Zygo Blaxell 2014-12-12 6:01 ` Robert White 2014-12-12 9:06 ` David Taylor 2014-12-12 11:16 ` Robert White 2014-12-12 13:29 ` Hugo Mills 2014-12-13 3:01 ` Duncan 2014-12-12 16:45 ` Zygo Blaxell 2014-12-12 22:28 ` Robert White 2014-12-13 4:28 ` Zygo Blaxell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.