* mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
@ 2014-12-10 22:18 Robert White
2014-12-11 7:33 ` Duncan
2014-12-12 3:56 ` Zygo Blaxell
0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-10 22:18 UTC (permalink / raw)
To: Btrfs BTRFS
So I started looking at the mkfs.btrfs manual page with an eye towards
documenting some of the tidbits like metadata automatically switching
from dup to raid1 when more than one device is used.
In experimenting I ended up with some questions...
(1) why is the dup profile for data restricted to only one device and
only if it's mixed mode?
Gust t # mkfs.btrfs -f /dev/loop{0..1} -d dup
Error: unable to create FS with data profile 16 (have 2 devices)
Gust t # mkfs.btrfs -f /dev/loop0 -d dup
Error: dup for data is allowed only in mixed mode
(2) why is metadata dup profile restricted to only one device on
creation when it will run that way just fine after a device add?
Gust t # mkfs.btrfs -f /dev/loop{0..1} -m dup
Error: unable to create FS with metadata profile 32 (have 2 devices)
(3) why can I make a raid5 out of two devices? (I understand that we are
currently just making mirrors, but the standard requires three devices
in the geometry etc. So I would expect a two device RAID5 to be
considered degraded with all that entails. It just looks like its asking
for trouble to allow this once the support is finalized as suddenly a
working RAID5 thats really a mirror would become something that can only
be mounted with the degraded flag.)
Gust t # mkfs.btrfs -f /dev/loop{0..1} -d raid5 -m raid5
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.
Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file
to 65536
Turning ON incompat feature 'raid56': raid56 extended format
Performing full device TRIM (2.00GiB) ...
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB
(4) Same question for raid6 but with three drives instead of the
mandated four.
(5) If I can make a RAID5 or RAID6 device with one missing element, why
can't I make a RAID1 out of one drive, e.g. with one missing element?
(6) If I make a RAID1 out of three devices are there three copies of
every extent or are there always two copies that are semi-randomly
spread across three devices? (ibid for more than three).
---
It seems to me (very dangerous words in computer science, I know) that
we need a "failed" device designator so that a device can be in the
geometry (e.g. have a device ID) but not actually exist. Reads/writes to
the failed device would always be treated as error returns.
The failed device would be subject to replacement with "btrfs dev
replace", and could be the source of said replacement to drop a
problematic device out of an array.
EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.
Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file
to 65536
Processing explicitly missing device
adding device (failed) id 2 (phantom device)
mount /dev/loop0 /mountpoint
btrfs replace start 2 /dev/loop1 /mountpoint
(and so on)
Being able to "replace" a faulty device with a phantom "failed" device
would nicely disambiguate the whole device add/remove versus replace
mistake.
It would make the degraded status less mysterious.
A filesystem with an explicitly failed element would also make the
future roll-out of full RAID5/6 less confusing.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
@ 2014-12-11 7:33 ` Duncan
2014-12-12 3:56 ` Zygo Blaxell
1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2014-12-11 7:33 UTC (permalink / raw)
To: linux-btrfs
Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted:
> So I started looking at the mkfs.btrfs manual page with an eye towards
> documenting some of the tidbits like metadata automatically switching
> from dup to raid1 when more than one device is used.
>
> In experimenting I ended up with some questions...
>
> (1) why is the dup profile for data restricted to only one device and
> only if it's mixed mode?
> (2) why is metadata dup profile restricted to only one device on
> creation when it will run that way just fine after a device add?
1 and 2 together since they both deal with dup mode...
Dup mode was apparently originally considered purely an extra safeguard
for metadata in the single-device case, where it was made the default
(except for SSDs, which default to single mode metadata on a single-
device filesystem, because the FTL voids any guarantees on location
anyway, and because firmware such as sandforce compresses and dedups
anyway, in which case the hardware/firmware is subverting btrfs' efforts
to do dup anyway).
In the single-device case, two copies of data was considered simply not
worth the cost, due both to doubling the size (especially on SSD where
size is money!) and to the speed penalties on spinning rust due to seeks
between one 1-GiB data-chunk and its dup.
With multi-device, raid1 metadata, forcing one copy to each of two
different devices, was considered enough superior to make that the
default, since that provided device-loss resiliency for the all-important
metadata, thus enabling recovery of at least /some/ files even with a
device missing (single-mode data where the file's extents all happened to
be on available devices, plus of course raid1, etc, data). Further, dup-
mode metadata was considered a mistake it was better not to even have
available as an option, since loss of a single device would likely kill
the filesystem, which made dup mode little better than single mode,
without the doubled-size-cost. Further, on spinning rust there'd again
be the seek penalty, to little benefit since dup mode provides no
guarantees in case of device loss.
So multi-device defaults to raid1 metadata for safety, but single mode
metadata remains an option (along with raid0) if you really /don't/ care
about losing everything due to loss of a single device. Single-device
simply makes dup-mode available (and the default) for metadata, as a poor-
man's substitute for the safety of raid1, but single-device-metadata is
the only case where that poor-man's-raid1-substitute is worth the
(considered extreme) cost, with usage of that option not even available
on multi-device as it'd be a near-certain mistake, certainly at the mkfs
level. And dup mode isn't ordinarily available for data even on single-
device, because it's considered not worth the cost.
As for dup-mode working after device-add, that's simply a necessary bit
in ordered for device add to work from a default-dup-mode single-device
at all. And it's only the existing metadata chunks on the original
device that will be dup-mode. Once a second device is added, additional
metadata chunks will be written in raid1 mode, forcing the two chunk
copies to different devices since there's multiple devices available to
allow that. The clear intent and recommendation is to do a rebalance
ASAP after a device add, to spread usage to the new device as
appropriate. And of course that rebalance will use the new raid1
metadata defaults, unless told otherwise of course, and I don't believe
dup mode is available to tell it otherwise there, either.
What all that original reasoning fails to account for, however, is the
btrfs data/metadata checksumming and integrity features and the very high
(which the original btrfs mode designers obviously considered extreme)
value some users (including me) place on them. While a multi-device dup-
mode-metadata choice at mkfs is arguably still a mistake, the cost of
raid1 metadata without the benefit, near the risk of single metadata but
at double the size, dup-mode data combined with btrfs checksumming and
data integrity features on a single device has strong data integrity
benefits that some would definitely consider worth it, even at the
additional cost in speed on spinning rust due to seeking, and in size on
expensive SSDs.
Meanwhile, mixed-bg-mode was an after-thought, added much later (after my
own btrfs journey began) in ordered to make working with small
filesystems reasonable. Before mixed-bg-mode, people attempting to use
btrfs on sub-GiB devices often found they couldn't use all available
space (often 25-50% wasted!) as the separate data/metadata chunk
allocation was simply too large grained to properly deal with the small
sizes involved.
And small filesystems really _was_ mixed-mode's _entire_ purpose. That
it could additionally be used to allow dup-data, using the ability to
specify mixed-bg-mode even on > 1 GiB filesystems where it wasn't the
default to get dup-data, was *ENTIRELY* an accident, not even considered
until a user figured it out, as confirmed by I believe it was Chris Mason
when directly asked at some point.
But now that mixed-mode is there and can be used to enable dup-mode data
too, for people that want it, and now that we know for sure such people
exist because we see mixed-bg mode being offered as a way to get exactly
that, dup-mode-data, there's little reason to remove the accidental
feature. =:^)
Meanwhile, now that demand is known to exist for dup-mode-data, I think
it probable that at some point code for that without having to force
mixed-bg-mode to get it will be made available and tested, much as other
features have been. But there's way more features left to implement than
time to implement them, at least with the current btrfs developer pool.
And given that mixed-bg-mode is available to deliver dup-mode-data for
those /really/ intent on having it, the priority of coding and testing
stand-alone-dup-mode-data is going to be relatively low, so I'd suggest
not expecting it any time soon -- maybe five years out, I don't see it
much sooner unless a dev (or dev sponsor) really gets that itch and
decides to priority scratch it.
> (3) why can I make a raid5 out of two devices?
> (4) Same question for raid6 but with three drives instead of the
> mandated four.
>
> (5) If I can make a RAID5 or RAID6 device with one missing element, why
> can't I make a RAID1 out of one drive, e.g. with one missing element?
AFAIK, the ability to mkfs raid56 modes with a missing device is a bug.
I'm not sure if it was known or not, tho I know there has been some
change in minimum number of devices over time and it might have gotten
caught in that, but I'd /guess/ that since raid56 isn't yet fully
supported, if the bug /was/ known, it had relatively low priority on the
fix-list compared to various other bugs with currently supported features.
If it is a bug as I believe it to be, that nullifies most of the
secondary questions you had...
> (6) If I make a RAID1 out of three devices are there three copies of
> every extent or are there always two copies that are semi-randomly
> spread across three devices? (ibid for more than three).
Currently btrfs raid1 is defined very specifically as exactly two copies/
mirrors, regardless of whether there are two or two hundred devices in
the filesystem. More devices gives you more room; number of copies
remains two. This is covered in the wiki.
The feature known as N-way-mirroring is however on the roadmap -- for
just after raid56, since the planned implementation depends on some of
the same code.
This is actually a bit of a personal sore spot for me, since it has long
been my most-wished-for feature. When I first investigated btrfs now
years ago, I was running quad-way-mdraid-1, and was very disappointed to
see that btrfs only offered paired-raid1, since I wanted (and still want)
very much to be able to fall back more than once to additional copies,
should the checksum fail on the first N-1 copies.
And back then (kernel 3.5 era IIRC) it was already roadmapped immediately
after raid56 modes, which was to be introduced in another kernel cycle or
two, so I figured perhaps 3-4 cycles, maybe a year (~5 cycles) for N-way-
mirroring. But it seems as far out now as then, if not further since we
know how long raid56 is taking to complete, and two kernel cycles after
that for N-way-mirroring seems wildly optimistic, now. Maybe a year
after... if it's not too complicated.
But it's definitely on the roadmap, next thing to implement in fact, but
it's still right after raid56, and raid56 has of course been coming right
up since kernel 3.6 or whatever, at least.
But I'm not a dev so I can't help in that regard, tho I do use btrfs in
pair-way raid1 mode now, and try to help on the list where my knowledge
as list regular and sysadmin using btrfs allow it. Someday that feature
will be available to play with... but that doesn't mean I can't enjoy
btrfs for what it has right now, nor does it mean I can't help others
with btrfs while I wait...
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
2014-12-11 7:33 ` Duncan
@ 2014-12-12 3:56 ` Zygo Blaxell
2014-12-12 6:01 ` Robert White
1 sibling, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-12 3:56 UTC (permalink / raw)
To: Robert White; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 4079 bytes --]
On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
> (3) why can I make a raid5 out of two devices? (I understand that we
> are currently just making mirrors, but the standard requires three
> devices in the geometry etc. So I would expect a two device RAID5 to
> be considered degraded with all that entails. It just looks like its
> asking for trouble to allow this once the support is finalized as
> suddenly a working RAID5 thats really a mirror would become
> something that can only be mounted with the degraded flag.)
RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).
> (4) Same question for raid6 but with three drives instead of the
> mandated four.
RAID6 with three devices should behave more or less like three-way RAID1,
except maybe the two parity disks might be different (I forget how the
function used to calculate the two parity stripes works, and whether
it can be defined such that F(disk1, disk2, disk3) == disk1).
> (5) If I can make a RAID5 or RAID6 device with one missing element,
> why can't I make a RAID1 out of one drive, e.g. with one missing
> element?
They're only missing if you believe the minimum number of RAID5 disks
is not two and the minimum number of RAID6 disks is not three.
> (6) If I make a RAID1 out of three devices are there three copies of
> every extent or are there always two copies that are semi-randomly
> spread across three devices? (ibid for more than three).
There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB
of mirrored storage.
> ---
>
> It seems to me (very dangerous words in computer science, I know)
> that we need a "failed" device designator so that a device can be in
> the geometry (e.g. have a device ID) but not actually exist.
> Reads/writes to the failed device would always be treated as error
> returns.
>
> The failed device would be subject to replacement with "btrfs dev
> replace", and could be the source of said replacement to drop a
> problematic device out of an array.
>
> EXAMPLE:
> Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
> Btrfs v3.17.1
> See http://btrfs.wiki.kernel.org for more information.
>
> Performing full device TRIM (2.00GiB) ...
> Turning ON incompat feature 'extref': increased hardlink limit per
> file to 65536
> Processing explicitly missing device
> adding device (failed) id 2 (phantom device)
>
> mount /dev/loop0 /mountpoint
>
> btrfs replace start 2 /dev/loop1 /mountpoint
>
> (and so on)
>
> Being able to "replace" a faulty device with a phantom "failed"
> device would nicely disambiguate the whole device add/remove versus
> replace mistake.
It is a little odd that an array of 3 disks with one missing looks
like this:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
devid 2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
devid 3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04
In the above, "vgtest-d02" was a deleted LV and does not exist, but
you'd never know that from the output of 'btrfs fi show'...
> It would make the degraded status less mysterious.
The 'degraded' status currently protects against some significant data
corruption risks. :-O
> A filesystem with an explicitly failed element would also make the
> future roll-out of full RAID5/6 less confusing.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 3:56 ` Zygo Blaxell
@ 2014-12-12 6:01 ` Robert White
2014-12-12 9:06 ` David Taylor
2014-12-12 16:45 ` Zygo Blaxell
0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-12 6:01 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Btrfs BTRFS
On 12/11/2014 07:56 PM, Zygo Blaxell wrote:
> On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
>> (3) why can I make a raid5 out of two devices? (I understand that we
>> are currently just making mirrors, but the standard requires three
>> devices in the geometry etc. So I would expect a two device RAID5 to
>> be considered degraded with all that entails. It just looks like its
>> asking for trouble to allow this once the support is finalized as
>> suddenly a working RAID5 thats really a mirror would become
>> something that can only be mounted with the degraded flag.)
>
> RAID5 with even parity and two devices should be exactly the same as
> RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
> is irrelevant because there is no difference in disk contents so the
> disks are interchangeable), except with different behavior when more
> devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
> should start writing new chunks with N stripes instead of two).
That's not correct. A RAID5 with three elements presents two _different_
sectors in each stripe. When one element is lost, it would still present
two different sectors, but the safety is gone.
I understand that the XOR collapses into a mirror if only two datum are
involved, but that's a mathematical fact that is irrelevant to the
definition of a RAID5 layout. When you take a wheel off of a tricycle it
doesn't just become a bike. And you can't make a bicycle into a trike by
just welding on a wheel somewhere. The infrastructure of the two is
completely different.
So RAID5 with three media M is
M MM MMM
D1 D2 P(a)
D3 P(b) D4
P(c) D5 D6
If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5
M MM X
D1 D2 .
D3 P(b) .
P(c) D5 .
So under _no_ circumstances would a two-disk RAID5 be the same as a
RAID1 since a two disk RAID5 functionally implies disk three because the
_minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data
protection because the minimum third element is a computational phantom.
In short it is irrational to have a "two disk" RAID5 that is "not
degraded" in the same way you cannot have a two-wheeled tricycle without
scraping some part of something along the asphalt.
A RAID1 with two elements presents one sector along the "stripe".
I realize that what has been implemented is what you call a two drive
RAID5, and done so by really implementing a RAID1, but it's nonsense.
I mean I understand what you are saying you've done, but it makes no
sense according to the definitions of RAID5. There is no circumstance
where RAID5 falls back to mirroring. Trying to implement RAID5 as an
extension of a mirroring paradigm would involve a fundamental conflict
in definitions. Especially when you reached a failure mode.
This is so fundamental to the design that the "fast" way to assemble a
RAID5 of N-arity (minimum N being 3) is to just connect the first N-1
elements, declare the raid valid-but-degraded using (N-1) of the media,
and then "replacing" the Nth phantom/missing/failed element with the
real disk and triggering a rebuild. This only works if you don't need
the initial contents of the array to have a specific value like zero.
(This involves fewest reads and the array is instantly available while
it builds.)
As soon as you start writing to the array, the stripes you write
"repair" the extents if the repair process hadn't gotten to them yet.
Its basically impossible to turn a mirror into a RAID5 if you _ever_
expect the code base to to be able to recover an array that's lost an
element.
>> (4) Same question for raid6 but with three drives instead of the
>> mandated four.
>
> RAID6 with three devices should behave more or less like three-way RAID1,
> except maybe the two parity disks might be different (I forget how the
> function used to calculate the two parity stripes works, and whether
> it can be defined such that F(disk1, disk2, disk3) == disk1).
Uh, no. A raid 6 with three drives, or even two drives, is also degraded
because the minimum is four.
A B C D
D1 D2 Pa Qa
D3 Pb Qb D4
Pc Qc D5 D6
Qd D7 D8 Pd
You can lose one or two media but the minimum stripe is again [X1,X2]
for any read (ABCD)(ABC.)(AB..)(A..D) etc.
Minimum arity for RAID6 is 4, maximum lost-but-functional configuration
is arity-minus-two.
>
>> (5) If I can make a RAID5 or RAID6 device with one missing element,
>> why can't I make a RAID1 out of one drive, e.g. with one missing
>> element?
>
> They're only missing if you believe the minimum number of RAID5 disks
> is not two and the minimum number of RAID6 disks is not three.
I do believe that, because that's what the terms are universally taken
to mean.
If what BTRFS is promising/planning as raid5 will run non-degraded on
two disks its... something... but it's not RAID5.
If what BTRFS is promising/planing as raid6 will run non-degraded on
three disks its... something... bt it's not RAID6.
>
>> (6) If I make a RAID1 out of three devices are there three copies of
>> every extent or are there always two copies that are semi-randomly
>> spread across three devices? (ibid for more than three).
>
> There are always two copies. RAID1 on 3x1TBdisks gives you 1.5TB
> of mirrored storage.
>> ---
>>
>> It seems to me (very dangerous words in computer science, I know)
>> that we need a "failed" device designator so that a device can be in
>> the geometry (e.g. have a device ID) but not actually exist.
>> Reads/writes to the failed device would always be treated as error
>> returns.
>>
>> The failed device would be subject to replacement with "btrfs dev
>> replace", and could be the source of said replacement to drop a
>> problematic device out of an array.
>>
>> EXAMPLE:
>> Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
>> Btrfs v3.17.1
>> See http://btrfs.wiki.kernel.org for more information.
>>
>> Performing full device TRIM (2.00GiB) ...
>> Turning ON incompat feature 'extref': increased hardlink limit per
>> file to 65536
>> Processing explicitly missing device
>> adding device (failed) id 2 (phantom device)
>>
>> mount /dev/loop0 /mountpoint
>>
>> btrfs replace start 2 /dev/loop1 /mountpoint
>>
>> (and so on)
>>
>> Being able to "replace" a faulty device with a phantom "failed"
>> device would nicely disambiguate the whole device add/remove versus
>> replace mistake.
>
> It is a little odd that an array of 3 disks with one missing looks
> like this:
Its correct for a three disk array with one "failed" (e.g. where
vgtester-d04 is present but bad), it's wrong for a _four_ disk array
where one disk (vgtester-d03) has been unpluged or otherwise missing (as
opposed to "deleted").
The entire idea of "three disk array with one missing" doesn't match
your example below, which is in fact a three disk array with all
elements present. Your example below started out as a four disk array
and then you deleted one, making it a three disk array. The point at
issue would be a four-disk array with one missing. So there'd be four lines.
E.g. a four disk array with one missing _ought_ to look like:
> Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
> Total devices 3 FS bytes used 256.00KiB
> devid 1 size 4.00GiB used 847.12MiB path
/dev/mapper/vgtester-d01
> devid 2 size 4.00GiB used 827.12MiB path /dev/mapper
/vgtester-d02
devid 3 size 4.00GiB used 0.00B (missing) path ???
> devid 4 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04
The problem here is that the concept of "missing" is, um, missing from
BTRFS statuses.
For instance the same idea I presume you are going for would be
expressed in mdadm as with the little status array "UU.U" for "up, up,
missing, and up".
BTRFS _should_ (big words from a noob, I know) have and display the
arity of the array with the correct number of expected disks, filled out
with the information of the available disks.
Were this correct there would be a covariant line for 3 and vgtester-d04
would be devid 4 like I did to it above.
>
> Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
> Total devices 3 FS bytes used 256.00KiB
> devid 1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
> devid 2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
> devid 3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04
That's not odd at all. sort of... (simplified to three lines and names
changed because of word-wrap here... )
An array of three disks with one missing should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used 0.00B (missing)
because, you know, it's like... missing...
An array of three disks with one _failed_ should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used 0.00B (failed) path /dev/sdc
An array of three disks with one freshly replacing a previously missing
or failed should look like:
Label: none uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
Total devices 3 FS bytes used 256.00KiB
devid 1 size 4.00GiB used 847.12MiB path /dev/sda
devid 2 size 4.00GiB used 827.12MiB path /dev/sdb
devid 3 size 4.00GiB used x.xxxB (rebuilding) path /dev/sdc
With the used value growing with each subsequent repeat of the inquiry
until they all had the same numbers.
And I don't know what the status would look like during a "replace" but
there'd temporarily be a fourth disk in the list, one being a donor and
one being the new replacement.
That's _exactly_ what a RAID5 with a degraded, failed, or mising member
should look like. For any extent (A,B,C) any one column (A), (B), or (C)
can be missing -- shown as (.) such that for a chunk size X there is
always a return stripe [X1,X2] (e.g. the stripe size is _always_ the
arity minus one, and the minimum arity is three) returned by any legal
read (A,B,C) == (.,B,C) == (A,.,C) == (A,B,.); it is this property that
provides the redundancy.
So nominally, the above would result in all reads [X1,X2] being a result
of (A,B,.) or by device ID (1,2,.). And each read of (1,2,.) would
provide the opportunity to repair ID 3's chunk.
The subsequent activity, especially a balance/repair operation would be
repopulating /dev/mapper/vgtester-d04 to reestablish the parity.
Similarly all writes to a valid extent require two reads, and two writes
minimum. If you have the parity and the target block in memory (that's
the two reads), you xor-out the original contents of target block,
xor-in the new contents of target block, then you have to write _both_
the target block and the parity block (preferably in one transaction).
In a degraded RAID5, if you are writing to a "missing" block, you have
to read all blocks in the stripe calculate the missing block, xor the
calculated block out of the parity block, then xor the new block into
the parity and write the parity block back out. If the replacement drive
is installed and active you and also then just write the new block there
as well and the block stripe is now no longer degraded.
This is core paradigm for RAID5.
And you need the "empty" device ID in the missing case would cause noop
read and write events/errors but allow the spanning logic to remain
intact and that logic is necessary for rational recovery when it ends up
being
1,2,3-is-bad
1,2,3-is-bad
1,2,3
1,2,3
1,2,3-is bad
As in this case stripes zero, one, and four are improper, still missing,
whatever and stripes two and three have been balanced/scrubbed back into
good order.
Its particularly important and valuable to have that device ID allocated
"failed" in a replace scenario where the logic is now ready to keep the
good stuff (the extents for tracks 2 and 3 for example) and only
recalculate the bad.
>
> In the above, "vgtest-d02" was a deleted LV and does not exist, but
> you'd never know that from the output of 'btrfs fi show'...
That would be because "deleted" and "failed" are two inherently
different conditions and BTRFS doesn't have the ID smarts for a "failed"
device to be present in the map.
>
>> It would make the degraded status less mysterious.
>
> The 'degraded' status currently protects against some significant data
> corruption risks. :-O
>
>> A filesystem with an explicitly failed element would also make the
>> future roll-out of full RAID5/6 less confusing.
I also still don't get why the RAID1 with arity grater than two was at
all hard to construct. It would have been my first step on the way to
RAID5/6
A
D1
D2
D3
A B
D1 D1
D2 D2
D3 D3
A B C
D1 D1 D1
D2 D2 D2
D3 D3 D3
Is the logical progression right before
A B C
D1 D2 Pa
D3 Pb D4
Pc D5 D6
Until you have the code base and data structures to "search past B" in a
mirror of arbitrary arity, you just don't have the means to organize the
horizontal stripe-as-entity needed to do record the arbitrarily wide
stripes you need to make a higher-order RAID.
And before _any_ of that you need to be able to explicitly account for a
missing drive such that you have a RAID1 of
A x
D1 .
D2 .
D3 .
For all possible read and write events. Without that your rebuild of any
RAID is "iffy". If you are not ready for
A x C
D1 . D1
D2 . D2
D3 . D3
then
A x C
D1 . Pa
D3 . D4
Pc . D6
Is going to ruin your world.
I don't know how to turn this into proper BTRFS speak since I am still
new to the code base...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 6:01 ` Robert White
@ 2014-12-12 9:06 ` David Taylor
2014-12-12 11:16 ` Robert White
2014-12-12 16:45 ` Zygo Blaxell
1 sibling, 1 reply; 11+ messages in thread
From: David Taylor @ 2014-12-12 9:06 UTC (permalink / raw)
To: Robert White; +Cc: Btrfs BTRFS
On Thu, 11 Dec 2014, Robert White wrote:
>On 12/11/2014 07:56 PM, Zygo Blaxell wrote:
>>
>>RAID5 with even parity and two devices should be exactly the same as
>>RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
>>is irrelevant because there is no difference in disk contents so the
>>disks are interchangeable), except with different behavior when more
>>devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
>>should start writing new chunks with N stripes instead of two).
>
>That's not correct. A RAID5 with three elements presents two
>_different_ sectors in each stripe. When one element is lost, it would
>still present two different sectors, but the safety is gone.
The above quote is discussing two device RAID5, you are discussing
three device RAID5.
>I understand that the XOR collapses into a mirror if only two datum
>are involved, but that's a mathematical fact that is irrelevant to the
>definition of a RAID5 layout. When you take a wheel off of a tricycle
>it doesn't just become a bike. And you can't make a bicycle into a
>trike by just welding on a wheel somewhere. The infrastructure of the
>two is completely different.
True. A two-device RAID5 is not the same as a degraded three-device
RAID5.
>So RAID5 with three media M is
>
>M MM MMM
>D1 D2 P(a)
>D3 P(b) D4
>P(c) D5 D6
>
>If MMM is lost D1, D2, D3, and D5 are intact
>D4 and D6 can be recreated via D3^P(b) and P(c)^D5
>
>M MM X
>D1 D2 .
>D3 P(b) .
>P(c) D5 .
>So under _no_ circumstances would a two-disk RAID5 be the same as a
>RAID1 since a two disk RAID5 functionally implies disk three because
>the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data
>protection because the minimum third element is a computational
>phantom.
You again seem to be treating a "two disk RAID5" as synonymous with your
degraded three disk RAID5 above. It is not.
RAID5 with two media M would be:
M MM
D1 P(a)
P(b) D2
D3 P(c)
[and each P would be identical to its corresponding D]
>In short it is irrational to have a "two disk" RAID5 that is "not
>degraded" in the same way you cannot have a two-wheeled tricycle
>without scraping some part of something along the asphalt.
There is nothing irrational about it at all, except that it is
exactly equivalent to two disk RAID1.
>A RAID1 with two elements presents one sector along the "stripe".
A RAID5 with N elements presents N-1 sectors along the "stripe",
so I'm not sure what the problem is with setting N=2.
>I realize that what has been implemented is what you call a two drive
>RAID5, and done so by really implementing a RAID1, but it's nonsense.
It's not really, it's merely an argument of semantics if you want
to define it as nonsense.
>I mean I understand what you are saying you've done, but it makes no
>sense according to the definitions of RAID5. There is no circumstance
>where RAID5 falls back to mirroring. Trying to implement RAID5 as an
>extension of a mirroring paradigm would involve a fundamental conflict
>in definitions. Especially when you reached a failure mode.
I have no idea what you mean by "a fundamental conflict in definition".
>This is so fundamental to the design that the "fast" way to assemble a
>RAID5 of N-arity (minimum N being 3) is to just connect the first N-1
>elements, declare the raid valid-but-degraded using (N-1) of the
>media, and then "replacing" the Nth phantom/missing/failed element
>with the real disk and triggering a rebuild. This only works if you
>don't need the initial contents of the array to have a specific value
>like zero. (This involves fewest reads and the array is instantly
>available while it builds.)
There is no reason you could not do exactly this with N=2.
>As soon as you start writing to the array, the stripes you write
>"repair" the extents if the repair process hadn't gotten to them yet.
>
>Its basically impossible to turn a mirror into a RAID5 if you _ever_
>expect the code base to to be able to recover an array that's lost an
>element.
Again, I'm not really sure what you mean.
>Uh, no. A raid 6 with three drives, or even two drives, is also
>degraded because the minimum is four.
You're doing your weird semantic dance again. Just because you
define the minimum to be four does not mean that someone talking
about a three device RAID6 is talking about a degraded four device
RAID6, they're not.
As above, a non-degraded three-device RAID6 can be perfectly
sensibly defined. Once again, it has exactly the same failure
properties as a three device RAID1 (any two of the devices can
fail), so it's a bit pointless. But not "impossible"...
>
>A B C D
>D1 D2 Pa Qa
>D3 Pb Qb D4
>Pc Qc D5 D6
>Qd D7 D8 Pd
>
>
>You can lose one or two media but the minimum stripe is again [X1,X2]
>for any read (ABCD)(ABC.)(AB..)(A..D) etc.
>
>Minimum arity for RAID6 is 4, maximum lost-but-functional
>configuration is arity-minus-two.
A B C
D1 Pa Qa
Pb Qb D2
Qc D3 Pc
D4 Pd Qd
>>They're only missing if you believe the minimum number of RAID5 disks
>>is not two and the minimum number of RAID6 disks is not three.
>
>I do believe that, because that's what the terms are universally taken
>to mean.
>
Apparently not universally.
--
David Taylor
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 9:06 ` David Taylor
@ 2014-12-12 11:16 ` Robert White
2014-12-12 13:29 ` Hugo Mills
2014-12-13 3:01 ` Duncan
0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-12 11:16 UTC (permalink / raw)
To: Btrfs BTRFS
On 12/12/2014 01:06 AM, David Taylor wrote:
> The above quote is discussing two device RAID5, you are discussing
> three device RAID5.
Heresy! (yes, some humor is required here.)
There is no such thing as a "two device RAID5". That's what RAID1 is for.
Saying "The above quote is discussing a two device RAID5" is exactly
like saying "The above quote is discussing a two wheeled tricycle".
You might as well be talking about three-octet IP addresses. That is you
could make a network address out of three octets, but it wouldn't' be an
IP address. It would be something else with the wrong name attached.
I challenge you... nay I _defy_ you... to find a single authority on
disk storage anywhere on this planet (except, apparently, this list and
its directly attached people and materials) that discusses, describes,
or acknowledges the existence of a "two device RAID5" while not
discussing a system with an arity of 3 degraded by the absence of one media.
All these words have standardized definitions.
[That's not hyperbole. I searched for several hours and could not find
_any_ reference anywhere to construction of a RAID5 array using only two
devices that did not involve airity-3 and a dummy/missing/failed psudo
target. So if you can find any reference to doing this _anywhere_
outside of BTRFS I'd like to see it. Genuinely.]
THAT SAID...
I really can find no reason the math wouldn't work using only two
drives. It would be a terrific waste of CPU cycles and storage space to
construct the stripe buffers and do the XORs instead of just copying the
data, but the math would work.
So, um, "well I'll be damned".
Perhaps is just a tautological belief that someone here didn't buy into.
Like how people keep partitioning drives into little slices for things
because thats the preserved wisdom from early eighties.
I think constructing a non-degraded-mode two device thing and calling it
RAID5 will surprise virtually _everyone_ on the planet.
In every other system. And I do mean _every_ other system, if I had two
media and I put them under RAID-5 I'd be required to specify the third
drive as some sort failed device (the block device equivalent of
/dev/null but that returns error results for all operations instead of
successes.) See the reserved keyword "missing" in the mdadm
documentation etc.
That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB
array with no actual redundancy. As in
mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc
the resulting array would be the same effective size as a stripe of the
two drives, but when the third was added later it would just slot in as
a replacement for the missing device and the airity-3 thing would
"reestablish" it's redundancy. (this is actually what mdadm does
internally with a normal build, it blesses the first N-1 drives into an
array with a missing member, and adds the Nth drive as a "spare" and
then the spare is immediately adopted as a replacement for the "missing"
drive.)
The parity computation on a single value is just nutty waste of time
though. "Backing it out" when the array is degraded is double-nuts.
Maybe everybody just decided it was too crazy to consider for the CPU
time penalty...?
So yea, semantics... apparently...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 11:16 ` Robert White
@ 2014-12-12 13:29 ` Hugo Mills
2014-12-13 3:01 ` Duncan
1 sibling, 0 replies; 11+ messages in thread
From: Hugo Mills @ 2014-12-12 13:29 UTC (permalink / raw)
To: Robert White; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]
On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote:
> On 12/12/2014 01:06 AM, David Taylor wrote:
> >The above quote is discussing two device RAID5, you are discussing
> >three device RAID5.
>
> Heresy! (yes, some humor is required here.)
>
> There is no such thing as a "two device RAID5". That's what RAID1 is for.
>
> Saying "The above quote is discussing a two device RAID5" is exactly
> like saying "The above quote is discussing a two wheeled tricycle".
>
> You might as well be talking about three-octet IP addresses. That is
> you could make a network address out of three octets, but it
> wouldn't' be an IP address. It would be something else with the
> wrong name attached.
OK. Sounds like I need to dust off the change-of-nomenclature patch
again.
The argument here is about the 1c1s1p configuration. Is there a
problem with that?
Hugo.
> I challenge you... nay I _defy_ you... to find a single authority on
> disk storage anywhere on this planet (except, apparently, this list
> and its directly attached people and materials) that discusses,
> describes, or acknowledges the existence of a "two device RAID5"
> while not discussing a system with an arity of 3 degraded by the
> absence of one media.
>
> All these words have standardized definitions.
>
> [That's not hyperbole. I searched for several hours and could not
> find _any_ reference anywhere to construction of a RAID5 array using
> only two devices that did not involve airity-3 and a
> dummy/missing/failed psudo target. So if you can find any reference
> to doing this _anywhere_ outside of BTRFS I'd like to see it.
> Genuinely.]
>
> THAT SAID...
>
> I really can find no reason the math wouldn't work using only two
> drives. It would be a terrific waste of CPU cycles and storage space
> to construct the stripe buffers and do the XORs instead of just
> copying the data, but the math would work.
>
> So, um, "well I'll be damned".
>
> Perhaps is just a tautological belief that someone here didn't buy
> into. Like how people keep partitioning drives into little slices
> for things because thats the preserved wisdom from early eighties.
>
> I think constructing a non-degraded-mode two device thing and
> calling it RAID5 will surprise virtually _everyone_ on the planet.
>
> In every other system. And I do mean _every_ other system, if I had
> two media and I put them under RAID-5 I'd be required to specify the
> third drive as some sort failed device (the block device equivalent
> of /dev/null but that returns error results for all operations
> instead of successes.) See the reserved keyword "missing" in the
> mdadm documentation etc.
>
> That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a
> 2TiB array with no actual redundancy. As in
>
> mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc
>
> the resulting array would be the same effective size as a stripe of
> the two drives, but when the third was added later it would just
> slot in as a replacement for the missing device and the airity-3
> thing would "reestablish" it's redundancy. (this is actually what
> mdadm does internally with a normal build, it blesses the first N-1
> drives into an array with a missing member, and adds the Nth drive
> as a "spare" and then the spare is immediately adopted as a
> replacement for the "missing" drive.)
>
> The parity computation on a single value is just nutty waste of time
> though. "Backing it out" when the array is degraded is double-nuts.
>
> Maybe everybody just decided it was too crazy to consider for the
> CPU time penalty...?
>
> So yea, semantics... apparently...
--
Hugo Mills | There's an infinite number of monkeys outside who
hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet
http://carfax.org.uk/ | they've worked out!
PGP: 65E74AC0 | Arthur Dent
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 6:01 ` Robert White
2014-12-12 9:06 ` David Taylor
@ 2014-12-12 16:45 ` Zygo Blaxell
2014-12-12 22:28 ` Robert White
1 sibling, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-12 16:45 UTC (permalink / raw)
To: Robert White; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 508 bytes --]
On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> So RAID5 with three media M is
>
> M MM MMM
> D1 D2 P(a)
> D3 P(b) D4
> P(c) D5 D6
RAID5 with two media is well defined, and looks like this:
M MM
D1 P(a)
P(b) D2
D3 P(c)
With even parity and N disks
P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0
Simplifying for one data disk and one parity stripe:
P(a) ^ D1 = 0
therefore
P(a) = D1
which is effectively (and, in practice, literally) mirroring.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 16:45 ` Zygo Blaxell
@ 2014-12-12 22:28 ` Robert White
2014-12-13 4:28 ` Zygo Blaxell
0 siblings, 1 reply; 11+ messages in thread
From: Robert White @ 2014-12-12 22:28 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: Btrfs BTRFS
On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
> On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
>> So RAID5 with three media M is
>>
>> M MM MMM
>> D1 D2 P(a)
>> D3 P(b) D4
>> P(c) D5 D6
>
> RAID5 with two media is well defined, and looks like this:
>
> M MM
> D1 P(a)
> P(b) D2
> D3 P(c)
Like I said in the other fork of this thread... I see (now) that the
math works but I can find no trace of anyone having ever implemented
this for arity less than 3 RAID greater than one paradigm (outside btrfs
and its associated materials).
It's like talking about a two-wheeled tricycle. 8-)
I would _genuinely_ like to see any third party discussion of this. It
just isn't done (probably because, as you've shown it just a really
complicated and CPU intensive way to end up with a simple mirror). I
spent several hours looking. I can see the math works, and I understand
what you are doing (as I said at some length in the grandparent message)
but it "just isn't done".
The reason I use the tricycle example is that, while most people know
this instinctively few are aware of the fact that going from two wheels
to three-or-more wheels reverses the steering paradigm. On a bike you
push-left lean-left and go-left. At the higher arity vehicles (including
adding a side-car to a bike) you push-right go left (you lean left too,
but that's just to keep from nosing over 8-). I find that quite apt in
the whole RAID1 vs RAID5 discussion since the former is about copying
one-or-more times and the latter is about starting with a theoretically
zeroed buffer and doing reversible checksumming into it.
I doubt that I will be the last person to be confused by BTRFS'
implementation of a two-wheeled tricycle.
You're going to get a lot of mail over the years. 8-)
MEANWHILE
the system really needs to be able to explicitly express and support the
"missing" media paradigm.
M x MMM
D1 . P(a)
D3 . D4
P(c) . D6
The correct logic here to "remove" (e.g. "replace with nothing" instead
of "delete") a media just doesn't seem to exist. And it's already
painfully missing in the RAID1 situation.
If I have a system with N SATA ports, and I have connected N drives, and
device M is starting to fail... I need to be able to disconnect M and
then connect M(new). Possibly with a non-trivial amount of time in
there. For all RAID levels greater than zero this is a natural operation
in a degraded mode. And for a nearly full filesystem the shrink
operation that is btrfs device delete would not work. And for any
nontrivially occupied fiesystem it would be way slow, and need to be
reversed for another way-slow interval.
So I need to be able to "replace" a drive with a "nothing" so that the
number of active media becomes N-1 but the arity remains N.
mdadm has the "missing" keyword. the Device Mapper has the "zero"
target. As near as I can tell btrfs has got nothing in this functional slot.
Imagine, if you will, a block device that is the anti-/dev/null. All
operations on this block device return EFAULT. lets call it
/dev/nothing. And lets say I have a /dev/sdc that has to come out
immediately (and all my stuff is RAID1/5/6). The operational chain would be
btrfs replace start /dev/sdc /dev/nothing /
(time pases, physical device is removed and replace)
btrfs replace start /dev/nothing /dev/sdc /
Now that's good-ish, but really the first replace is pernicious. The
internal state for the filesystem should just be able to record that
device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this
example) is just gone. The replace-with-nothing becomes more-or-less
instant.
The first replace is also pernicious if its the second media failure on
a fully RAID6 array since that would trying to put the same kernel level
device in the array twice.
The restore operation, the replace of the nothing with the something,
remains fully elaborate.
The "nothing" devices need to show up in the device id tables for a
running array in their geographically correct positions and all that.
Without this "missing" status as a first-class part of the system,
dealing with failures and communicating about those failures with the
operator will become vexatious.
[The use of "device delete" and "device add" as changes in arity and
size, and its inaplicability to cases where failure is being dealt with
abent a change of arity, could be clearer in the documentation.]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 11:16 ` Robert White
2014-12-12 13:29 ` Hugo Mills
@ 2014-12-13 3:01 ` Duncan
1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2014-12-13 3:01 UTC (permalink / raw)
To: linux-btrfs
Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted:
> Perhaps is just a tautological belief that someone here didn't buy into.
> Like how people keep partitioning drives into little slices for things
> because thats the preserved wisdom from early eighties.
While I absolutely agree with your raid5 sentiments (which is exactly
what I suppose they might be; I'm getting a bit of an education in that
regard myself, here)...
In the context of the 80s, or even the 90s, nothing about multi-gigabyte
could be considered "little"! =:^)
In fact, while it most assuredly dates me, it /still/ feels a bit odd
referring to the 1 GiB btrfs default threshold for mixed-bg-mode as
"small", given that I distinctly remember wondering how long it might
take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that
time I did have enough experience to know I'd eventually be dealing with
multi-gig as at the time I was dealing with multi-meg.
More to the point, however...
Those partitions have saved my a** quite a few times over the years.
Among other things, partitioning allows me to keep my (8 GiB) rootfs an
entirely separate filesystem that's mounted read-only by default, which
has kept it undamaged and the tools on it still available to help recover
my other filesystems, when /var/log and /home were damaged due to a hard
shutdown recently.
And some years ago I had an AC failure here in Phoenix in the middle of
the summer, resulting in a physical head-crash and loss of the operating
partitions on my disk in use at the time, while the backup partitions on
the same device remained intact, such that after cooldown I actually
continued to use that disk for some time, mounting the damaged partitions
only to recover the most recent copies of what I could, updating the
backups which were now promoted to operational.
Sure, technology such as LVM can do similar and is more flexible in some
ways, but unfortunately it requires userspace and thus an initr* in
ordered to handle a root on the same technology. Otherwise, root must be
treated differently, and then you have partitioning again.
Additionally, LVM is yet another layer of software that can and does go
wrong and itself need fixed. Partitioning is too, to some extent, but in
practice it has been pretty bullet-proof compared to technologies such as
LVM and btrfs-subvolumes. LVM has some way to go before it's as robust
as partitioning, and of course btrfs with its subvolumes isn't really
even completely stable yet. Further, btrfs doesn't well limit damage of
a subvolume to just that subvolume (that head-crash scenario would have
almost certainly been a total loss on btrfs subvolumes), the way
partitioning tends to do. And LVM's very flexibility means it doesn't
normally have that sort of damage limitation either. It certainly can,
but doing so severely reduces its flexibility, making going back to
regular partitions to avoid the complexity and additional points of
failure entirely a rather viable and often better choice.
Meanwhile, technology such as EFI and GPT is breathing new life into
partitioning, making it more reliable (checksummed redundant partition
tables), more useful/flexible (killing the primary/secondary/logical
divisions and adding partition names/labels and a far larger typing
space), and creating yet more uses for partitioning in the first place,
due to separate reserved EFI and legacy-BIOS partition types.
Tho of course these days those partition "slices" are often tens or
hundreds of gigs, and are now sometimes "teras"[1], bringing up my
initial point once again; that's NOT actually so small!
But to each his own, of course, and I definitely do agree with you on
raid5, the larger point. FWIW, I still consider allowing a two-device
"raid5" or a three-device "raid6" a bug, particularly given that a single-
device "raid1" is /not/ allowed, nor is a 3-device "raid10".
---
[1] Hmm, K, megs, gigs, "ters", "teras", simply "T" to match K ???
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
2014-12-12 22:28 ` Robert White
@ 2014-12-13 4:28 ` Zygo Blaxell
0 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-13 4:28 UTC (permalink / raw)
To: Robert White; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 7645 bytes --]
On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote:
> On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
> >On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> >>So RAID5 with three media M is
> >>
> >>M MM MMM
> >>D1 D2 P(a)
> >>D3 P(b) D4
> >>P(c) D5 D6
> >
> >RAID5 with two media is well defined, and looks like this:
> >
> >M MM
> >D1 P(a)
> >P(b) D2
> >D3 P(c)
>
> Like I said in the other fork of this thread... I see (now) that the
> math works but I can find no trace of anyone having ever implemented
> this for arity less than 3 RAID greater than one paradigm (outside
> btrfs and its associated materials).
I've set up mdadm that way (though it does ask you to use '--force'
when you set it up). mdadm will also ask for --force if you try to set
up RAID1 with one disk.
I don't know of a RAID implementation that _doesn't_ do these modes,
excluding a few ancient proprietary implementations which have no way to
change a layout once created (usually because they shoot themselves in
the foot with bad choices early on, e.g. by picking odd parity for RAID5).
The reason to allow it is future expansion: below-3-disk RAID5 ensures
that you have the layout constraints *now* for stripe/chunk size so you
can add more disks later. If RAID5 has a 512K chunk size, and you start
with a linear or RAID1 array and add another disk later, you might lose
part of the last 512K when you switch to RAID5. So you start with RAID5
on one or two disks so you can scale up without losing any data.
Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk
mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies
with no additional capacity).
btrfs doesn't really need this capability for expansion, since it can
just create new RAID5 profile chunks whenever it wants to; however, I'd
expect a complete btrfs RAID5 implementation to borrow some ideas from
ZFS, and dynamically change the number of disks per chunk to maintain
write integrity as drives are added/removed/missing. That would imply
btrfs-RAID56 profile chunks would have to be able to exist on two or even
one disk, if that was all that was available for writing at the time.
Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the
wrong way when more disks were added later.
> MEANWHILE
>
> the system really needs to be able to explicitly express and support
> the "missing" media paradigm.
>
> M x MMM
> D1 . P(a)
> D3 . D4
> P(c) . D6
>
> The correct logic here to "remove" (e.g. "replace with nothing"
> instead of "delete") a media just doesn't seem to exist. And it's
> already painfully missing in the RAID1 situation.
There are a number of permanent mistakes a naive admin can make when
dealing with a broken array. I've destroyed arrays (made them permanently
read-only beyond the ability of btrfs kernel or user tools to recover)
by getting "add" and "replace" confused, or by allowing an offline drive
to rejoin an array that had been mounted read-write,degraded for some time.
The basic functionality works. btrfs does track missing devices and
can replace them relatively quickly (not as fast as mdadm, but less
than an order of magnitude slower) in RAID1. The reporting is full
of out-of-date cached data, but when a disk is really failing,
there is usually little doubt which one needs to be replaced.
> If I have a system with N SATA ports, and I have connected N drives,
> and device M is starting to fail... I need to be able to disconnect
> M and then connect M(new). Possibly with a non-trivial amount of
> time in there. For all RAID levels greater than zero this is a
> natural operation in a degraded mode. And for a nearly full
> filesystem the shrink operation that is btrfs device delete would
> not work. And for any nontrivially occupied fiesystem it would be
> way slow, and need to be reversed for another way-slow interval.
>
> So I need to be able to "replace" a drive with a "nothing" so that
> the number of active media becomes N-1 but the arity remains N.
btrfs already does that, but it sucks. In a naive RAID5 implementation,
a write in degraded mode will corrupt your data if it is interrupted.
This is a general property of all RAID5 implementations that don't have
NVRAM journalling or some other way to solve the atomic update problem.
ZFS does this well: when a device is missing, it leaves old data in
degraded mode, but writes new data striped across the existing disks
in non-degraded mode. If you have 5 disks, and one dies, your writes
are then spread across 4 disks (3 data + parity) while your reads are
reconstructed from 4 disks (4 data + 1 parity - 1 missing). This prevents
the degraded mode write data integrity problem.
When the dead disk is replaced you would have the 3 data + parity promoted
to 4 data + parity, or you can elect not to replace the dead disk and
get 3 data + party everywhere (with a loss of capacity). btrfs could
presumably do that by allocating chunks with different raid56 parameters,
although in this early stage of implementation I'm not sure how much of
any of that has been done yet.
> mdadm has the "missing" keyword. the Device Mapper has the "zero"
> target.
dm also has the "ioerror" target, which is much better for this ("zero"
would allow reads to succeed, which is incorrect). lvm2 uses "ioerror"
for missing pieces of broken LVs in partial mode.
> btrfs replace start /dev/sdc /dev/nothing /
> (time pases, physical device is removed and replace)
> btrfs replace start /dev/nothing /dev/sdc /
Why wouldn't you just remove the physical device (say device #2) and
then run:
btrfs replace start 2 /dev/sdc /
? The way it works now seems much less complicated than what you propose.
Granted, I have a feature request here: we know the sizes of all the
missing disks, and we know the size of /dev/sdc, so why can't we just
write "missing" instead of "2" and have btrfs choose a missing device
to replace by itself?
> Now that's good-ish, but really the first replace is pernicious. The
> internal state for the filesystem should just be able to record that
> device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this
> example) is just gone. The replace-with-nothing becomes more-or-less
> instant.
To clarify: what is required here is the ability to quickly record that
the device's subuuid is no longer welcome in this filesystem, and never
will be. Should it reappear in the future, it has to be excluded from
the btrfs.
The underlying physical device could return, but it would have to
be treated as a new empty device with a new subuuid, and its data
reconstructed by btrfs balance or btrfs replace.
This is because btrfs does really awful things when a filesystem gets
assembled out of mirrors of different vintages. Before allowing writes
on a subset of the disks in a multi-disk btrfs, the disks that are written
have to agree that they are now the only disks that are currently members
of the filesystem.
> [The use of "device delete" and "device add" as changes in arity and
> size, and its inaplicability to cases where failure is being dealt
> with abent a change of arity, could be clearer in the
> documentation.]
Yes. This is _not_ equivalent to a btrfs replace, although it is very
similar:
btrfs device add /dev/sdc /
btrfs device delete missing /
It can work--sometimes--but it needs a surprising amount of free space
(or multiple new drives).
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-12-13 4:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
2014-12-11 7:33 ` Duncan
2014-12-12 3:56 ` Zygo Blaxell
2014-12-12 6:01 ` Robert White
2014-12-12 9:06 ` David Taylor
2014-12-12 11:16 ` Robert White
2014-12-12 13:29 ` Hugo Mills
2014-12-13 3:01 ` Duncan
2014-12-12 16:45 ` Zygo Blaxell
2014-12-12 22:28 ` Robert White
2014-12-13 4:28 ` Zygo Blaxell
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.