mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

All of lore.kernel.org
 help / color / mirror / Atom feed

* mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
@ 2014-12-10 22:18 Robert White
  2014-12-11  7:33 ` Duncan
  2014-12-12  3:56 ` Zygo Blaxell
  0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-10 22:18 UTC (permalink / raw)
  To: Btrfs BTRFS

So I started looking at the mkfs.btrfs manual page with an eye towards 
documenting some of the tidbits like metadata automatically switching 
from dup to raid1 when more than one device is used.

In experimenting I ended up with some questions...

(1) why is the dup profile for data restricted to only one device and 
only if it's mixed mode?

Gust t # mkfs.btrfs -f /dev/loop{0..1} -d dup
Error: unable to create FS with data profile 16 (have 2 devices)

Gust t # mkfs.btrfs -f /dev/loop0 -d dup
Error: dup for data is allowed only in mixed mode

(2) why is metadata dup profile restricted to only one device on 
creation when it will run that way just fine after a device add?

Gust t # mkfs.btrfs -f /dev/loop{0..1} -m dup
Error: unable to create FS with metadata profile 32 (have 2 devices)

(3) why can I make a raid5 out of two devices? (I understand that we are 
currently just making mirrors, but the standard requires three devices 
in the geometry etc. So I would expect a two device RAID5 to be 
considered degraded with all that entails. It just looks like its asking 
for trouble to allow this once the support is finalized as suddenly a 
working RAID5 thats really a mirror would become something that can only 
be mounted with the degraded flag.)

Gust t # mkfs.btrfs -f /dev/loop{0..1} -d raid5 -m raid5
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file 
to 65536
Turning ON incompat feature 'raid56': raid56 extended format
Performing full device TRIM (2.00GiB) ...
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
         nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB

(4) Same question for raid6 but with three drives instead of the 
mandated four.

(5) If I can make a RAID5 or RAID6 device with one missing element, why 
can't I make a RAID1 out of one drive, e.g. with one missing element?

(6) If I make a RAID1 out of three devices are there three copies of 
every extent or are there always two copies that are semi-randomly 
spread across three devices? (ibid for more than three).

---

It seems to me (very dangerous words in computer science, I know) that 
we need a "failed" device designator so that a device can be in the 
geometry (e.g. have a device ID) but not actually exist. Reads/writes to 
the failed device would always be treated as error returns.

The failed device would be subject to replacement with "btrfs dev 
replace", and could be the source of said replacement to drop a 
problematic device out of an array.

EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file 
to 65536
Processing explicitly missing device
adding device (failed) id 2 (phantom device)

mount /dev/loop0 /mountpoint

btrfs replace start 2 /dev/loop1 /mountpoint

(and so on)

Being able to "replace" a faulty device with a phantom "failed" device 
would nicely disambiguate the whole device add/remove versus replace 
mistake.

It would make the degraded status less mysterious.

A filesystem with an explicitly failed element would also make the 
future roll-out of full RAID5/6 less confusing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
@ 2014-12-11  7:33 ` Duncan
  2014-12-12  3:56 ` Zygo Blaxell
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2014-12-11  7:33 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted:

> So I started looking at the mkfs.btrfs manual page with an eye towards
> documenting some of the tidbits like metadata automatically switching
> from dup to raid1 when more than one device is used.
> 
> In experimenting I ended up with some questions...
> 
> (1) why is the dup profile for data restricted to only one device and
> only if it's mixed mode?

> (2) why is metadata dup profile restricted to only one device on
> creation when it will run that way just fine after a device add?

1 and 2 together since they both deal with dup mode...

Dup mode was apparently originally considered purely an extra safeguard 
for metadata in the single-device case, where it was made the default 
(except for SSDs, which default to single mode metadata on a single-
device filesystem, because the FTL voids any guarantees on location 
anyway, and because firmware such as sandforce compresses and dedups 
anyway, in which case the hardware/firmware is subverting btrfs' efforts 
to do dup anyway).

In the single-device case, two copies of data was considered simply not 
worth the cost, due both to doubling the size (especially on SSD where 
size is money!) and to the speed penalties on spinning rust due to seeks 
between one 1-GiB data-chunk and its dup.

With multi-device, raid1 metadata, forcing one copy to each of two 
different devices, was considered enough superior to make that the 
default, since that provided device-loss resiliency for the all-important 
metadata, thus enabling recovery of at least /some/ files even with a 
device missing (single-mode data where the file's extents all happened to 
be on available devices, plus of course raid1, etc, data).  Further, dup-
mode metadata was considered a mistake it was better not to even have 
available as an option, since loss of a single device would likely kill 
the filesystem, which made dup mode little better than single mode, 
without the doubled-size-cost.  Further, on spinning rust there'd again 
be the seek penalty, to little benefit since dup mode provides no 
guarantees in case of device loss.

So multi-device defaults to raid1 metadata for safety, but single mode 
metadata remains an option (along with raid0) if you really /don't/ care 
about losing everything due to loss of a single device.  Single-device 
simply makes dup-mode available (and the default) for metadata, as a poor-
man's substitute for the safety of raid1, but single-device-metadata is 
the only case where that poor-man's-raid1-substitute is worth the 
(considered extreme) cost, with usage of that option not even available 
on multi-device as it'd be a near-certain mistake, certainly at the mkfs 
level.  And dup mode isn't ordinarily available for data even on single-
device, because it's considered not worth the cost.

As for dup-mode working after device-add, that's simply a necessary bit 
in ordered for device add to work from a default-dup-mode single-device 
at all.  And it's only the existing metadata chunks on the original 
device that will be dup-mode.  Once a second device is added, additional 
metadata chunks will be written in raid1 mode, forcing the two chunk 
copies to different devices since there's multiple devices available to 
allow that.  The clear intent and recommendation is to do a rebalance 
ASAP after a device add, to spread usage to the new device as 
appropriate.  And of course that rebalance will use the new raid1 
metadata defaults, unless told otherwise of course, and I don't believe 
dup mode is available to tell it otherwise there, either.

What all that original reasoning fails to account for, however, is the 
btrfs data/metadata checksumming and integrity features and the very high 
(which the original btrfs mode designers obviously considered extreme) 
value some users (including me) place on them.  While a multi-device dup-
mode-metadata choice at mkfs is arguably still a mistake, the cost of 
raid1 metadata without the benefit, near the risk of single metadata but 
at double the size, dup-mode data combined with btrfs checksumming and 
data integrity features on a single device has strong data integrity 
benefits that some would definitely consider worth it, even at the 
additional cost in speed on spinning rust due to seeking, and in size on 
expensive SSDs.

Meanwhile, mixed-bg-mode was an after-thought, added much later (after my 
own btrfs journey began) in ordered to make working with small 
filesystems reasonable.  Before mixed-bg-mode, people attempting to use 
btrfs on sub-GiB devices often found they couldn't use all available 
space (often 25-50% wasted!) as the separate data/metadata chunk 
allocation was simply too large grained to properly deal with the small 
sizes involved.

And small filesystems really _was_ mixed-mode's _entire_ purpose.  That 
it could additionally be used to allow dup-data, using the ability to 
specify mixed-bg-mode even on > 1 GiB filesystems where it wasn't the 
default to get dup-data, was *ENTIRELY* an accident, not even considered 
until a user figured it out, as confirmed by I believe it was Chris Mason 
when directly asked at some point.

But now that mixed-mode is there and can be used to enable dup-mode data 
too, for people that want it, and now that we know for sure such people 
exist because we see mixed-bg mode being offered as a way to get exactly 
that, dup-mode-data, there's little reason to remove the accidental 
feature. =:^)

Meanwhile, now that demand is known to exist for dup-mode-data, I think 
it probable that at some point code for that without having to force 
mixed-bg-mode to get it will be made available and tested, much as other 
features have been.  But there's way more features left to implement than 
time to implement them, at least with the current btrfs developer pool.  
And given that mixed-bg-mode is available to deliver dup-mode-data for 
those /really/ intent on having it, the priority of coding and testing 
stand-alone-dup-mode-data is going to be relatively low, so I'd suggest 
not expecting it any time soon -- maybe five years out, I don't see it 
much sooner unless a dev (or dev sponsor) really gets that itch and 
decides to priority scratch it.

> (3) why can I make a raid5 out of two devices?

> (4) Same question for raid6 but with three drives instead of the
> mandated four.
> 
> (5) If I can make a RAID5 or RAID6 device with one missing element, why
> can't I make a RAID1 out of one drive, e.g. with one missing element?

AFAIK, the ability to mkfs raid56 modes with a missing device is a bug.  
I'm not sure if it was known or not, tho I know there has been some 
change in minimum number of devices over time and it might have gotten 
caught in that, but I'd /guess/ that since raid56 isn't yet fully 
supported, if the bug /was/ known, it had relatively low priority on the 
fix-list compared to various other bugs with currently supported features.

If it is a bug as I believe it to be, that nullifies most of the 
secondary questions you had...

> (6) If I make a RAID1 out of three devices are there three copies of
> every extent or are there always two copies that are semi-randomly
> spread across three devices? (ibid for more than three).

Currently btrfs raid1 is defined very specifically as exactly two copies/
mirrors, regardless of whether there are two or two hundred devices in 
the filesystem.  More devices gives you more room; number of copies 
remains two.  This is covered in the wiki.

The feature known as N-way-mirroring is however on the roadmap -- for 
just after raid56, since the planned implementation depends on some of 
the same code.

This is actually a bit of a personal sore spot for me, since it has long 
been my most-wished-for feature.  When I first investigated btrfs now 
years ago, I was running quad-way-mdraid-1, and was very disappointed to 
see that btrfs only offered paired-raid1, since I wanted (and still want) 
very much to be able to fall back more than once to additional copies, 
should the checksum fail on the first N-1 copies.

And back then (kernel 3.5 era IIRC) it was already roadmapped immediately 
after raid56 modes, which was to be introduced in another kernel cycle or 
two, so I figured perhaps 3-4 cycles, maybe a year (~5 cycles) for N-way-
mirroring.  But it seems as far out now as then, if not further since we 
know how long raid56 is taking to complete, and two kernel cycles after 
that for N-way-mirroring seems wildly optimistic, now.  Maybe a year 
after... if it's not too complicated.

But it's definitely on the roadmap, next thing to implement in fact, but 
it's still right after raid56, and raid56 has of course been coming right 
up since kernel 3.6 or whatever, at least.

But I'm not a dev so I can't help in that regard, tho I do use btrfs in 
pair-way raid1 mode now, and try to help on the list where my knowledge 
as list regular and sysadmin using btrfs allow it.  Someday that feature 
will be available to play with... but that doesn't mean I can't enjoy 
btrfs for what it has right now, nor does it mean I can't help others 
with btrfs while I wait...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
  2014-12-11  7:33 ` Duncan
@ 2014-12-12  3:56 ` Zygo Blaxell
  2014-12-12  6:01   ` Robert White
  1 sibling, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-12  3:56 UTC (permalink / raw)
  To: Robert White; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4079 bytes --]

On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
> (3) why can I make a raid5 out of two devices? (I understand that we
> are currently just making mirrors, but the standard requires three
> devices in the geometry etc. So I would expect a two device RAID5 to
> be considered degraded with all that entails. It just looks like its
> asking for trouble to allow this once the support is finalized as
> suddenly a working RAID5 thats really a mirror would become
> something that can only be mounted with the degraded flag.)

RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).

> (4) Same question for raid6 but with three drives instead of the
> mandated four.

RAID6 with three devices should behave more or less like three-way RAID1,
except maybe the two parity disks might be different (I forget how the
function used to calculate the two parity stripes works, and whether
it can be defined such that F(disk1, disk2, disk3) == disk1).

> (5) If I can make a RAID5 or RAID6 device with one missing element,
> why can't I make a RAID1 out of one drive, e.g. with one missing
> element?

They're only missing if you believe the minimum number of RAID5 disks
is not two and the minimum number of RAID6 disks is not three.

> (6) If I make a RAID1 out of three devices are there three copies of
> every extent or are there always two copies that are semi-randomly
> spread across three devices? (ibid for more than three).

There are always two copies.  RAID1 on 3x1TBdisks gives you 1.5TB
of mirrored storage.

> ---
> 
> It seems to me (very dangerous words in computer science, I know)
> that we need a "failed" device designator so that a device can be in
> the geometry (e.g. have a device ID) but not actually exist.
> Reads/writes to the failed device would always be treated as error
> returns.
> 
> The failed device would be subject to replacement with "btrfs dev
> replace", and could be the source of said replacement to drop a
> problematic device out of an array.
> 
> EXAMPLE:
> Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
> Btrfs v3.17.1
> See http://btrfs.wiki.kernel.org for more information.
> 
> Performing full device TRIM (2.00GiB) ...
> Turning ON incompat feature 'extref': increased hardlink limit per
> file to 65536
> Processing explicitly missing device
> adding device (failed) id 2 (phantom device)
> 
> mount /dev/loop0 /mountpoint
> 
> btrfs replace start 2 /dev/loop1 /mountpoint
> 
> (and so on)
> 
> Being able to "replace" a faulty device with a phantom "failed"
> device would nicely disambiguate the whole device add/remove versus
> replace mistake.

It is a little odd that an array of 3 disks with one missing looks
like this:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
        Total devices 3 FS bytes used 256.00KiB
        devid    1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
        devid    2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
        devid    3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

In the above, "vgtest-d02" was a deleted LV and does not exist, but
you'd never know that from the output of 'btrfs fi show'...

> It would make the degraded status less mysterious.

The 'degraded' status currently protects against some significant data
corruption risks.  :-O

> A filesystem with an explicitly failed element would also make the
> future roll-out of full RAID5/6 less confusing.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12  3:56 ` Zygo Blaxell
@ 2014-12-12  6:01   ` Robert White
  2014-12-12  9:06     ` David Taylor
  2014-12-12 16:45     ` Zygo Blaxell
  0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-12  6:01 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

On 12/11/2014 07:56 PM, Zygo Blaxell wrote:
> On Wed, Dec 10, 2014 at 02:18:55PM -0800, Robert White wrote:
>> (3) why can I make a raid5 out of two devices? (I understand that we
>> are currently just making mirrors, but the standard requires three
>> devices in the geometry etc. So I would expect a two device RAID5 to
>> be considered degraded with all that entails. It just looks like its
>> asking for trouble to allow this once the support is finalized as
>> suddenly a working RAID5 thats really a mirror would become
>> something that can only be mounted with the degraded flag.)
>
> RAID5 with even parity and two devices should be exactly the same as
> RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
> is irrelevant because there is no difference in disk contents so the
> disks are interchangeable), except with different behavior when more
> devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
> should start writing new chunks with N stripes instead of two).

That's not correct. A RAID5 with three elements presents two _different_ 
sectors in each stripe. When one element is lost, it would still present 
two different sectors, but the safety is gone.

I understand that the XOR collapses into a mirror if only two datum are 
involved, but that's a mathematical fact that is irrelevant to the 
definition of a RAID5 layout. When you take a wheel off of a tricycle it 
doesn't just become a bike. And you can't make a bicycle into a trike by 
just welding on a wheel somewhere. The infrastructure of the two is 
completely different.

So RAID5 with three media M is

M    MM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6

If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5

M    MM   X
D1   D2   .
D3   P(b) .
P(c) D5   .

So under _no_ circumstances would a two-disk RAID5 be the same as a 
RAID1 since a two disk RAID5 functionally implies disk three because the 
_minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data 
protection because the minimum third element is a computational phantom.

In short it is irrational to have a "two disk" RAID5 that is "not 
degraded" in the same way you cannot have a two-wheeled tricycle without 
scraping some part of something along the asphalt.

A RAID1 with two elements presents one sector along the "stripe".

I realize that what has been implemented is what you call a two drive 
RAID5, and done so by really implementing a RAID1, but it's nonsense.

I mean I understand what you are saying you've done, but it makes no 
sense according to the definitions of RAID5. There is no circumstance 
where RAID5 falls back to mirroring. Trying to implement RAID5 as an 
extension of a mirroring paradigm would involve a fundamental conflict 
in definitions. Especially when you reached a failure mode.

This is so fundamental to the design that the "fast" way to assemble a 
RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 
elements, declare the raid valid-but-degraded using (N-1) of the media, 
and then "replacing" the Nth phantom/missing/failed element with the 
real disk and triggering a rebuild. This only works if you don't need 
the initial contents of the array to have a specific value like zero. 
(This involves fewest reads and the array is instantly available while 
it builds.)

As soon as you start writing to the array, the stripes you write 
"repair" the extents if the repair process hadn't gotten to them yet.

Its basically impossible to turn a mirror into a RAID5 if you _ever_ 
expect the code base to to be able to recover an array that's lost an 
element.

>> (4) Same question for raid6 but with three drives instead of the
>> mandated four.
>
> RAID6 with three devices should behave more or less like three-way RAID1,
> except maybe the two parity disks might be different (I forget how the
> function used to calculate the two parity stripes works, and whether
> it can be defined such that F(disk1, disk2, disk3) == disk1).

Uh, no. A raid 6 with three drives, or even two drives, is also degraded 
because the minimum is four.

A   B   C   D
D1  D2  Pa  Qa
D3  Pb  Qb  D4
Pc  Qc  D5  D6
Qd  D7  D8  Pd

You can lose one or two media but the minimum stripe is again [X1,X2] 
for any read (ABCD)(ABC.)(AB..)(A..D) etc.

Minimum arity for RAID6 is 4, maximum lost-but-functional configuration 
is arity-minus-two.

>
>> (5) If I can make a RAID5 or RAID6 device with one missing element,
>> why can't I make a RAID1 out of one drive, e.g. with one missing
>> element?
>
> They're only missing if you believe the minimum number of RAID5 disks
> is not two and the minimum number of RAID6 disks is not three.

I do believe that, because that's what the terms are universally taken 
to mean.

If what BTRFS is promising/planning as raid5 will run non-degraded on 
two disks its... something... but it's not RAID5.

If what BTRFS is promising/planing as raid6 will run non-degraded on 
three disks its... something... bt it's not RAID6.

>
>> (6) If I make a RAID1 out of three devices are there three copies of
>> every extent or are there always two copies that are semi-randomly
>> spread across three devices? (ibid for more than three).
>
> There are always two copies.  RAID1 on 3x1TBdisks gives you 1.5TB
> of mirrored storage.

>> ---
>>
>> It seems to me (very dangerous words in computer science, I know)
>> that we need a "failed" device designator so that a device can be in
>> the geometry (e.g. have a device ID) but not actually exist.
>> Reads/writes to the failed device would always be treated as error
>> returns.
>>
>> The failed device would be subject to replacement with "btrfs dev
>> replace", and could be the source of said replacement to drop a
>> problematic device out of an array.
>>
>> EXAMPLE:
>> Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
>> Btrfs v3.17.1
>> See http://btrfs.wiki.kernel.org for more information.
>>
>> Performing full device TRIM (2.00GiB) ...
>> Turning ON incompat feature 'extref': increased hardlink limit per
>> file to 65536
>> Processing explicitly missing device
>> adding device (failed) id 2 (phantom device)
>>
>> mount /dev/loop0 /mountpoint
>>
>> btrfs replace start 2 /dev/loop1 /mountpoint
>>
>> (and so on)
>>
>> Being able to "replace" a faulty device with a phantom "failed"
>> device would nicely disambiguate the whole device add/remove versus
>> replace mistake.
>
> It is a little odd that an array of 3 disks with one missing looks
> like this:

Its correct for a three disk array with one "failed" (e.g. where 
vgtester-d04 is present but bad), it's wrong for a _four_ disk array 
where one disk (vgtester-d03) has been unpluged or otherwise missing (as 
opposed to "deleted").

The entire idea of "three disk array with one missing" doesn't match 
your example below, which is in fact a three disk array with all 
elements present. Your example below started out as a four disk array 
and then you deleted one, making it a three disk array. The point at 
issue would be a four-disk array with one missing. So there'd be four lines.

E.g. a four disk array with one missing _ought_ to look like:

 > Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
 >          Total devices 3 FS bytes used 256.00KiB
 >          devid    1 size 4.00GiB used 847.12MiB path 
/dev/mapper/vgtester-d01
 >          devid    2 size 4.00GiB used 827.12MiB path /dev/mapper
/vgtester-d02
            devid    3 size 4.00GiB used 0.00B (missing) path ???
 >          devid    4 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

The problem here is that the concept of "missing" is, um, missing from 
BTRFS statuses.

For instance the same idea I presume you are going for would be 
expressed in mdadm as with the little status array "UU.U" for "up, up, 
missing, and up".

BTRFS _should_ (big words from a noob, I know) have and display the 
arity of the array with the correct number of expected disks, filled out 
with the information of the available disks.

Were this correct there would be a covariant line for 3 and vgtester-d04 
would be devid 4 like I did to it above.

>
> Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
>          Total devices 3 FS bytes used 256.00KiB
>          devid    1 size 4.00GiB used 847.12MiB path /dev/mapper/vgtester-d01
>          devid    2 size 4.00GiB used 827.12MiB path /dev/mapper/vgtester-d02
>          devid    3 size 4.00GiB used 0.00B path /dev/mapper/vgtester-d04

That's not odd at all. sort of... (simplified to three lines and names 
changed because of word-wrap here... )

An array of three disks with one missing should look like:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
          Total devices 3 FS bytes used 256.00KiB
          devid    1 size 4.00GiB used 847.12MiB path /dev/sda
          devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
          devid    3 size 4.00GiB used 0.00B (missing)

because, you know, it's like... missing...

An array of three disks with one _failed_ should look like:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
          Total devices 3 FS bytes used 256.00KiB
          devid    1 size 4.00GiB used 847.12MiB path /dev/sda
          devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
          devid    3 size 4.00GiB used 0.00B (failed) path /dev/sdc

An array of three disks with one freshly replacing a previously missing 
or failed should look like:

Label: none  uuid: dde7fa84-11ed-4281-8d90-3eb0d29b949b
          Total devices 3 FS bytes used 256.00KiB
          devid    1 size 4.00GiB used 847.12MiB path /dev/sda
          devid    2 size 4.00GiB used 827.12MiB path /dev/sdb
          devid    3 size 4.00GiB used x.xxxB (rebuilding) path /dev/sdc

With the used value growing with each subsequent repeat of the inquiry 
until they all had the same numbers.

And I don't know what the status would look like during a "replace" but 
there'd temporarily be a fourth disk in the list, one being a donor and 
one being the new replacement.

That's _exactly_ what a RAID5 with a degraded, failed, or mising member 
should look like. For any extent (A,B,C) any one column (A), (B), or (C) 
can be missing -- shown as (.) such that for a chunk size X there is 
always a return stripe [X1,X2] (e.g. the stripe size is _always_ the 
arity minus one, and the minimum arity is three) returned by any legal 
read (A,B,C) == (.,B,C) == (A,.,C) == (A,B,.); it is this property that 
provides the redundancy.

So nominally, the above would result in all reads [X1,X2] being a result 
of (A,B,.) or by device ID (1,2,.). And each read of (1,2,.) would 
provide the opportunity to repair ID 3's chunk.

The subsequent activity, especially a balance/repair operation would be 
repopulating /dev/mapper/vgtester-d04 to reestablish the parity.

Similarly all writes to a valid extent require two reads, and two writes 
minimum. If you have the parity and the target block in memory (that's 
the two reads), you xor-out the original contents of target block, 
xor-in the new contents of target block, then you have to write _both_ 
the target block and the parity block (preferably in one transaction).

In a degraded RAID5, if you are writing to a "missing" block, you have 
to read all blocks in the stripe calculate the missing block, xor the 
calculated block out of the parity block, then xor the new block into 
the parity and write the parity block back out. If the replacement drive 
is installed and active you and also then just write the new block there 
as well and the block stripe is now no longer degraded.

This is core paradigm for RAID5.

And you need the "empty" device ID in the missing case would cause noop 
read and write events/errors but allow the spanning logic to remain 
intact and that logic is necessary for rational recovery when it ends up 
being

1,2,3-is-bad
1,2,3-is-bad
1,2,3
1,2,3
1,2,3-is bad

As in this case stripes zero, one, and four are improper, still missing, 
whatever and stripes two and three have been balanced/scrubbed back into 
good order.

Its particularly important and valuable to have that device ID allocated 
"failed" in a replace scenario where the logic is now ready to keep the 
good stuff (the extents for tracks 2 and 3 for example) and only 
recalculate the bad.

>
> In the above, "vgtest-d02" was a deleted LV and does not exist, but
> you'd never know that from the output of 'btrfs fi show'...

That would be because "deleted" and "failed" are two inherently 
different conditions and BTRFS doesn't have the ID smarts for a "failed" 
device to be present in the map.

>
>> It would make the degraded status less mysterious.
>
> The 'degraded' status currently protects against some significant data
> corruption risks.  :-O
>
>> A filesystem with an explicitly failed element would also make the
>> future roll-out of full RAID5/6 less confusing.

I also still don't get why the RAID1 with arity grater than two was at 
all hard to construct. It would have been my first step on the way to 
RAID5/6

A
D1
D2
D3

A   B
D1  D1
D2  D2
D3  D3

A   B   C
D1  D1  D1
D2  D2  D2
D3  D3  D3

Is the logical progression right before

A   B   C
D1  D2  Pa
D3  Pb  D4
Pc  D5  D6

Until you have the code base and data structures to "search past B" in a 
mirror of arbitrary arity, you just don't have the means to organize the 
horizontal stripe-as-entity needed to do record the arbitrarily wide 
stripes you need to make a higher-order RAID.

And before _any_ of that you need to be able to explicitly account for a 
missing drive such that you have a RAID1 of

A   x
D1  .
D2  .
D3  .

For all possible read and write events. Without that your rebuild of any 
RAID is "iffy". If you are not ready for

A   x   C
D1  .   D1
D2  .   D2
D3  .   D3

then

A   x   C
D1  .   Pa
D3  .   D4
Pc  .   D6

Is going to ruin your world.

I don't know how to turn this into proper BTRFS speak since I am still 
new to the code base...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12  6:01   ` Robert White
@ 2014-12-12  9:06     ` David Taylor
  2014-12-12 11:16       ` Robert White
  2014-12-12 16:45     ` Zygo Blaxell
  1 sibling, 1 reply; 11+ messages in thread
From: David Taylor @ 2014-12-12  9:06 UTC (permalink / raw)
  To: Robert White; +Cc: Btrfs BTRFS

On Thu, 11 Dec 2014, Robert White wrote:

>On 12/11/2014 07:56 PM, Zygo Blaxell wrote:
>>
>>RAID5 with even parity and two devices should be exactly the same as
>>RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
>>is irrelevant because there is no difference in disk contents so the
>>disks are interchangeable), except with different behavior when more
>>devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
>>should start writing new chunks with N stripes instead of two).
>
>That's not correct. A RAID5 with three elements presents two 
>_different_ sectors in each stripe. When one element is lost, it would 
>still present two different sectors, but the safety is gone.

The above quote is discussing two device RAID5, you are discussing
three device RAID5.

>I understand that the XOR collapses into a mirror if only two datum 
>are involved, but that's a mathematical fact that is irrelevant to the 
>definition of a RAID5 layout. When you take a wheel off of a tricycle 
>it doesn't just become a bike. And you can't make a bicycle into a 
>trike by just welding on a wheel somewhere. The infrastructure of the 
>two is completely different.

True.  A two-device RAID5 is not the same as a degraded three-device 
RAID5.

>So RAID5 with three media M is
>
>M    MM   MMM
>D1   D2   P(a)
>D3   P(b) D4
>P(c) D5   D6
>
>If MMM is lost D1, D2, D3, and D5 are intact
>D4 and D6 can be recreated via D3^P(b) and P(c)^D5
>
>M    MM   X
>D1   D2   .
>D3   P(b) .
>P(c) D5   .

>So under _no_ circumstances would a two-disk RAID5 be the same as a 
>RAID1 since a two disk RAID5 functionally implies disk three because 
>the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data 
>protection because the minimum third element is a computational 
>phantom.

You again seem to be treating a "two disk RAID5" as synonymous with your 
degraded three disk RAID5 above.  It is not.

RAID5 with two media M would be:

M    MM
D1   P(a)
P(b) D2
D3   P(c)

[and each P would be identical to its corresponding D]

>In short it is irrational to have a "two disk" RAID5 that is "not 
>degraded" in the same way you cannot have a two-wheeled tricycle 
>without scraping some part of something along the asphalt.

There is nothing irrational about it at all, except that it is
exactly equivalent to two disk RAID1.

>A RAID1 with two elements presents one sector along the "stripe".

A RAID5 with N elements presents N-1 sectors along the "stripe",
so I'm not sure what the problem is with setting N=2.

>I realize that what has been implemented is what you call a two drive 
>RAID5, and done so by really implementing a RAID1, but it's nonsense.

It's not really, it's merely an argument of semantics if you want
to define it as nonsense.

>I mean I understand what you are saying you've done, but it makes no 
>sense according to the definitions of RAID5. There is no circumstance 
>where RAID5 falls back to mirroring. Trying to implement RAID5 as an 
>extension of a mirroring paradigm would involve a fundamental conflict 
>in definitions. Especially when you reached a failure mode.

I have no idea what you mean by "a fundamental conflict in definition".

>This is so fundamental to the design that the "fast" way to assemble a 
>RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 
>elements, declare the raid valid-but-degraded using (N-1) of the 
>media, and then "replacing" the Nth phantom/missing/failed element 
>with the real disk and triggering a rebuild. This only works if you 
>don't need the initial contents of the array to have a specific value 
>like zero. (This involves fewest reads and the array is instantly 
>available while it builds.)

There is no reason you could not do exactly this with N=2.

>As soon as you start writing to the array, the stripes you write 
>"repair" the extents if the repair process hadn't gotten to them yet.
>
>Its basically impossible to turn a mirror into a RAID5 if you _ever_ 
>expect the code base to to be able to recover an array that's lost an 
>element.

Again, I'm not really sure what you mean.

>Uh, no. A raid 6 with three drives, or even two drives, is also 
>degraded because the minimum is four.

You're doing your weird semantic dance again.  Just because you
define the minimum to be four does not mean that someone talking
about a three device RAID6 is talking about a degraded four device
RAID6, they're not.

As above, a non-degraded three-device RAID6 can be perfectly
sensibly defined.  Once again, it has exactly the same failure
properties as a three device RAID1 (any two of the devices can
fail), so it's a bit pointless.  But not "impossible"...

>
>A   B   C   D
>D1  D2  Pa  Qa
>D3  Pb  Qb  D4
>Pc  Qc  D5  D6
>Qd  D7  D8  Pd
>
>
>You can lose one or two media but the minimum stripe is again [X1,X2] 
>for any read (ABCD)(ABC.)(AB..)(A..D) etc.
>
>Minimum arity for RAID6 is 4, maximum lost-but-functional 
>configuration is arity-minus-two.

A   B   C
D1  Pa  Qa
Pb  Qb  D2
Qc  D3  Pc
D4  Pd  Qd

>>They're only missing if you believe the minimum number of RAID5 disks
>>is not two and the minimum number of RAID6 disks is not three.
>
>I do believe that, because that's what the terms are universally taken 
>to mean.
>

Apparently not universally.

-- 
David Taylor

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12  9:06     ` David Taylor
@ 2014-12-12 11:16       ` Robert White
  2014-12-12 13:29         ` Hugo Mills
  2014-12-13  3:01         ` Duncan
  0 siblings, 2 replies; 11+ messages in thread
From: Robert White @ 2014-12-12 11:16 UTC (permalink / raw)
  To: Btrfs BTRFS

On 12/12/2014 01:06 AM, David Taylor wrote:
> The above quote is discussing two device RAID5, you are discussing
> three device RAID5.

Heresy! (yes, some humor is required here.)

There is no such thing as a "two device RAID5". That's what RAID1 is for.

Saying "The above quote is discussing a two device RAID5" is exactly 
like saying "The above quote is discussing a two wheeled tricycle".

You might as well be talking about three-octet IP addresses. That is you 
could make a network address out of three octets, but it wouldn't' be an 
IP address. It would be something else with the wrong name attached.

I challenge you... nay I _defy_ you... to find a single authority on 
disk storage anywhere on this planet (except, apparently, this list and 
its directly attached people and materials) that discusses, describes, 
or acknowledges the existence of a "two device RAID5" while not 
discussing a system with an arity of 3 degraded by the absence of one media.

All these words have standardized definitions.

[That's not hyperbole. I searched for several hours and could not find 
_any_ reference anywhere to construction of a RAID5 array using only two 
devices that did not involve airity-3 and a dummy/missing/failed psudo 
target. So if you can find any reference to doing this _anywhere_ 
outside of BTRFS I'd like to see it. Genuinely.]

THAT SAID...

I really can find no reason the math wouldn't work using only two 
drives. It would be a terrific waste of CPU cycles and storage space to 
construct the stripe buffers and do the XORs instead of just copying the 
data, but the math would work.

So, um, "well I'll be damned".

Perhaps is just a tautological belief that someone here didn't buy into. 
Like how people keep partitioning drives into little slices for things 
because thats the preserved wisdom from early eighties.

I think constructing a non-degraded-mode two device thing and calling it 
RAID5 will surprise virtually _everyone_ on the planet.

In every other system. And I do mean _every_ other system, if I had two 
media and I put them under RAID-5 I'd be required to specify the third 
drive as some sort failed device (the block device equivalent of 
/dev/null but that returns error results for all operations instead of 
successes.) See the reserved keyword "missing" in the mdadm 
documentation etc.

That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB 
array with no actual redundancy. As in

mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc

the resulting array would be the same effective size as a stripe of the 
two drives, but when the third was added later it would just slot in as 
a replacement for the missing device and the airity-3 thing would 
"reestablish" it's redundancy. (this is actually what mdadm does 
internally with a normal build, it blesses the first N-1 drives into an 
array with a missing member, and adds the Nth drive as a "spare" and 
then the spare is immediately adopted as a replacement for the "missing" 
drive.)

The parity computation on a single value is just nutty waste of time 
though. "Backing it out" when the array is degraded is double-nuts.

Maybe everybody just decided it was too crazy to consider for the CPU 
time penalty...?

So yea, semantics... apparently...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12 11:16       ` Robert White
@ 2014-12-12 13:29         ` Hugo Mills
  2014-12-13  3:01         ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Hugo Mills @ 2014-12-12 13:29 UTC (permalink / raw)
  To: Robert White; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]

On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote:
> On 12/12/2014 01:06 AM, David Taylor wrote:
> >The above quote is discussing two device RAID5, you are discussing
> >three device RAID5.
> 
> Heresy! (yes, some humor is required here.)
> 
> There is no such thing as a "two device RAID5". That's what RAID1 is for.
> 
> Saying "The above quote is discussing a two device RAID5" is exactly
> like saying "The above quote is discussing a two wheeled tricycle".
> 
> You might as well be talking about three-octet IP addresses. That is
> you could make a network address out of three octets, but it
> wouldn't' be an IP address. It would be something else with the
> wrong name attached.

   OK. Sounds like I need to dust off the change-of-nomenclature patch
again.

   The argument here is about the 1c1s1p configuration. Is there a
problem with that?

   Hugo.

> I challenge you... nay I _defy_ you... to find a single authority on
> disk storage anywhere on this planet (except, apparently, this list
> and its directly attached people and materials) that discusses,
> describes, or acknowledges the existence of a "two device RAID5"
> while not discussing a system with an arity of 3 degraded by the
> absence of one media.
> 
> All these words have standardized definitions.
> 
> [That's not hyperbole. I searched for several hours and could not
> find _any_ reference anywhere to construction of a RAID5 array using
> only two devices that did not involve airity-3 and a
> dummy/missing/failed psudo target. So if you can find any reference
> to doing this _anywhere_ outside of BTRFS I'd like to see it.
> Genuinely.]
> 
> THAT SAID...
> 
> I really can find no reason the math wouldn't work using only two
> drives. It would be a terrific waste of CPU cycles and storage space
> to construct the stripe buffers and do the XORs instead of just
> copying the data, but the math would work.
> 
> So, um, "well I'll be damned".
> 
> Perhaps is just a tautological belief that someone here didn't buy
> into. Like how people keep partitioning drives into little slices
> for things because thats the preserved wisdom from early eighties.
> 
> I think constructing a non-degraded-mode two device thing and
> calling it RAID5 will surprise virtually _everyone_ on the planet.
> 
> In every other system. And I do mean _every_ other system, if I had
> two media and I put them under RAID-5 I'd be required to specify the
> third drive as some sort failed device (the block device equivalent
> of /dev/null but that returns error results for all operations
> instead of successes.) See the reserved keyword "missing" in the
> mdadm documentation etc.
> 
> That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a
> 2TiB array with no actual redundancy. As in
> 
> mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc
> 
> the resulting array would be the same effective size as a stripe of
> the two drives, but when the third was added later it would just
> slot in as a replacement for the missing device and the airity-3
> thing would "reestablish" it's redundancy. (this is actually what
> mdadm does internally with a normal build, it blesses the first N-1
> drives into an array with a missing member, and adds the Nth drive
> as a "spare" and then the spare is immediately adopted as a
> replacement for the "missing" drive.)
> 
> The parity computation on a single value is just nutty waste of time
> though. "Backing it out" when the array is degraded is double-nuts.
> 
> Maybe everybody just decided it was too crazy to consider for the
> CPU time penalty...?
> 
> So yea, semantics... apparently...

-- 
Hugo Mills             | There's an infinite number of monkeys outside who
hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet
http://carfax.org.uk/  | they've worked out!
PGP: 65E74AC0          |                                           Arthur Dent

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12 11:16       ` Robert White
  2014-12-12 13:29         ` Hugo Mills
@ 2014-12-13  3:01         ` Duncan
  1 sibling, 0 replies; 11+ messages in thread
From: Duncan @ 2014-12-13  3:01 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted:

> Perhaps is just a tautological belief that someone here didn't buy into.
> Like how people keep partitioning drives into little slices for things
> because thats the preserved wisdom from early eighties.

While I absolutely agree with your raid5 sentiments (which is exactly 
what I suppose they might be; I'm getting a bit of an education in that 
regard myself, here)...

In the context of the 80s, or even the 90s, nothing about multi-gigabyte 
could be considered "little"! =:^)

In fact, while it most assuredly dates me, it /still/ feels a bit odd 
referring to the 1 GiB btrfs default threshold for mixed-bg-mode as 
"small", given that I distinctly remember wondering how long it might 
take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that 
time I did have enough experience to know I'd eventually be dealing with 
multi-gig as at the time I was dealing with multi-meg.

More to the point, however...

Those partitions have saved my a** quite a few times over the years.  
Among other things, partitioning allows me to keep my (8 GiB) rootfs an 
entirely separate filesystem that's mounted read-only by default, which 
has kept it undamaged and the tools on it still available to help recover 
my other filesystems, when /var/log and /home were damaged due to a hard 
shutdown recently.

And some years ago I had an AC failure here in Phoenix in the middle of 
the summer, resulting in a physical head-crash and loss of the operating 
partitions on my disk in use at the time, while the backup partitions on 
the same device remained intact, such that after cooldown I actually 
continued to use that disk for some time, mounting the damaged partitions 
only to recover the most recent copies of what I could, updating the 
backups which were now promoted to operational.

Sure, technology such as LVM can do similar and is more flexible in some 
ways, but unfortunately it requires userspace and thus an initr* in 
ordered to handle a root on the same technology.  Otherwise, root must be 
treated differently, and then you have partitioning again.

Additionally, LVM is yet another layer of software that can and does go 
wrong and itself need fixed.  Partitioning is too, to some extent, but in 
practice it has been pretty bullet-proof compared to technologies such as 
LVM and btrfs-subvolumes.  LVM has some way to go before it's as robust 
as partitioning, and of course btrfs with its subvolumes isn't really 
even completely stable yet.  Further, btrfs doesn't well limit damage of 
a subvolume to just that subvolume (that head-crash scenario would have 
almost certainly been a total loss on btrfs subvolumes), the way 
partitioning tends to do.  And LVM's very flexibility means it doesn't 
normally have that sort of damage limitation either.  It certainly can, 
but doing so severely reduces its flexibility, making going back to 
regular partitions to avoid the complexity and additional points of 
failure entirely a rather viable and often better choice.

Meanwhile, technology such as EFI and GPT is breathing new life into 
partitioning, making it more reliable (checksummed redundant partition 
tables), more useful/flexible (killing the primary/secondary/logical 
divisions and adding partition names/labels and a far larger typing 
space), and creating yet more uses for partitioning in the first place, 
due to separate reserved EFI and legacy-BIOS partition types.

Tho of course these days those partition "slices" are often tens or 
hundreds of gigs, and are now sometimes "teras"[1], bringing up my 
initial point once again; that's NOT actually so small!

But to each his own, of course, and I definitely do agree with you on 
raid5, the larger point.  FWIW, I still consider allowing a two-device 
"raid5" or a three-device "raid6" a bug, particularly given that a single-
device "raid1" is /not/ allowed, nor is a 3-device "raid10".

---
[1] Hmm, K, megs, gigs, "ters", "teras", simply "T" to match K ???

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12  6:01   ` Robert White
  2014-12-12  9:06     ` David Taylor
@ 2014-12-12 16:45     ` Zygo Blaxell
  2014-12-12 22:28       ` Robert White
  1 sibling, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-12 16:45 UTC (permalink / raw)
  To: Robert White; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 508 bytes --]

On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> So RAID5 with three media M is
> 
> M    MM   MMM
> D1   D2   P(a)
> D3   P(b) D4
> P(c) D5   D6

RAID5 with two media is well defined, and looks like this:

M    MM
D1   P(a)
P(b) D2
D3   P(c)

With even parity and N disks

	P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0

Simplifying for one data disk and one parity stripe:

	P(a) ^ D1 = 0

therefore

	P(a) = D1

which is effectively (and, in practice, literally) mirroring.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12 16:45     ` Zygo Blaxell
@ 2014-12-12 22:28       ` Robert White
  2014-12-13  4:28         ` Zygo Blaxell
  0 siblings, 1 reply; 11+ messages in thread
From: Robert White @ 2014-12-12 22:28 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
> On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
>> So RAID5 with three media M is
>>
>> M    MM   MMM
>> D1   D2   P(a)
>> D3   P(b) D4
>> P(c) D5   D6
>
> RAID5 with two media is well defined, and looks like this:
>
> M    MM
> D1   P(a)
> P(b) D2
> D3   P(c)

Like I said in the other fork of this thread... I see (now) that the 
math works but I can find no trace of anyone having ever implemented 
this for arity less than 3 RAID greater than one paradigm (outside btrfs 
and its associated materials).

It's like talking about a two-wheeled tricycle. 8-)

I would _genuinely_ like to see any third party discussion of this. It 
just isn't done (probably because, as you've shown it just a really 
complicated and CPU intensive way to end up with a simple mirror). I 
spent several hours looking. I can see the math works, and I understand 
what you are doing (as I said at some length in the grandparent message) 
but it "just isn't done".

The reason I use the tricycle example is that, while most people know 
this instinctively few are aware of the fact that going from two wheels 
to three-or-more wheels reverses the steering paradigm. On a bike you 
push-left lean-left and go-left. At the higher arity vehicles (including 
adding a side-car to a bike) you push-right go left (you lean left too, 
but that's just to keep from nosing over 8-). I find that quite apt in 
the whole RAID1 vs RAID5 discussion since the former is about copying 
one-or-more times and the latter is about starting with a theoretically 
zeroed buffer and doing reversible checksumming into it.

I doubt that I will be the last person to be confused by BTRFS' 
implementation of a two-wheeled tricycle.

You're going to get a lot of mail over the years. 8-)

MEANWHILE

the system really needs to be able to explicitly express and support the 
"missing" media paradigm.

  M     x    MMM
  D1    .    P(a)
  D3    .    D4
  P(c)  .    D6

The correct logic here to "remove" (e.g. "replace with nothing" instead 
of "delete") a media just doesn't seem to exist. And it's already 
painfully missing in the RAID1 situation.

If I have a system with N SATA ports, and I have connected N drives, and 
device M is starting to fail... I need to be able to disconnect M and 
then connect M(new). Possibly with a non-trivial amount of time in 
there. For all RAID levels greater than zero this is a natural operation 
in a degraded mode. And for a nearly full filesystem the shrink 
operation that is btrfs device delete would not work. And for any 
nontrivially occupied fiesystem it would be way slow, and need to be 
reversed for another way-slow interval.

So I need to be able to "replace" a drive with a "nothing" so that the 
number of active media becomes N-1 but the arity remains N.

mdadm has the "missing" keyword. the Device Mapper has the "zero" 
target. As near as I can tell btrfs has got nothing in this functional slot.

Imagine, if you will, a block device that is the anti-/dev/null. All 
operations on this block device return EFAULT. lets call it 
/dev/nothing. And lets say I have a /dev/sdc that has to come out 
immediately (and all my stuff is RAID1/5/6).  The operational chain would be

btrfs replace start /dev/sdc /dev/nothing /
(time pases, physical device is removed and replace)
btrfs replace start /dev/nothing /dev/sdc /

Now that's good-ish, but really the first replace is pernicious. The 
internal state for the filesystem should just be able to record that 
device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this 
example) is just gone. The replace-with-nothing becomes more-or-less 
instant.

The first replace is also pernicious if its the second media failure on 
a fully RAID6 array since that would trying to put the same kernel level 
device in the array twice.

The restore operation, the replace of the nothing with the something, 
remains fully elaborate.

The "nothing" devices need to show up in the device id tables for a 
running array in their geographically correct positions and all that.

Without this "missing" status as a first-class part of the system, 
dealing with failures and communicating about those failures with the 
operator will become vexatious.

[The use of "device delete" and "device add" as changes in arity and 
size, and its inaplicability to cases where failure is being dealt with 
abent a change of arity, could be clearer in the documentation.]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
  2014-12-12 22:28       ` Robert White
@ 2014-12-13  4:28         ` Zygo Blaxell
  0 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2014-12-13  4:28 UTC (permalink / raw)
  To: Robert White; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 7645 bytes --]

On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote:
> On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
> >On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> >>So RAID5 with three media M is
> >>
> >>M    MM   MMM
> >>D1   D2   P(a)
> >>D3   P(b) D4
> >>P(c) D5   D6
> >
> >RAID5 with two media is well defined, and looks like this:
> >
> >M    MM
> >D1   P(a)
> >P(b) D2
> >D3   P(c)
> 
> Like I said in the other fork of this thread... I see (now) that the
> math works but I can find no trace of anyone having ever implemented
> this for arity less than 3 RAID greater than one paradigm (outside
> btrfs and its associated materials).

I've set up mdadm that way (though it does ask you to use '--force'
when you set it up).  mdadm will also ask for --force if you try to set
up RAID1 with one disk.

I don't know of a RAID implementation that _doesn't_ do these modes,
excluding a few ancient proprietary implementations which have no way to
change a layout once created (usually because they shoot themselves in
the foot with bad choices early on, e.g. by picking odd parity for RAID5).

The reason to allow it is future expansion:  below-3-disk RAID5 ensures
that you have the layout constraints *now* for stripe/chunk size so you
can add more disks later.  If RAID5 has a 512K chunk size, and you start
with a linear or RAID1 array and add another disk later, you might lose
part of the last 512K when you switch to RAID5.  So you start with RAID5
on one or two disks so you can scale up without losing any data.

Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk
mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies
with no additional capacity).

btrfs doesn't really need this capability for expansion, since it can
just create new RAID5 profile chunks whenever it wants to; however, I'd
expect a complete btrfs RAID5 implementation to borrow some ideas from
ZFS, and dynamically change the number of disks per chunk to maintain
write integrity as drives are added/removed/missing.  That would imply
btrfs-RAID56 profile chunks would have to be able to exist on two or even
one disk, if that was all that was available for writing at the time.
Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the
wrong way when more disks were added later.

> MEANWHILE
> 
> the system really needs to be able to explicitly express and support
> the "missing" media paradigm.
> 
>  M     x    MMM
>  D1    .    P(a)
>  D3    .    D4
>  P(c)  .    D6
> 
> The correct logic here to "remove" (e.g. "replace with nothing"
> instead of "delete") a media just doesn't seem to exist. And it's
> already painfully missing in the RAID1 situation.

There are a number of permanent mistakes a naive admin can make when
dealing with a broken array.  I've destroyed arrays (made them permanently
read-only beyond the ability of btrfs kernel or user tools to recover)
by getting "add" and "replace" confused, or by allowing an offline drive
to rejoin an array that had been mounted read-write,degraded for some time.

The basic functionality works.  btrfs does track missing devices and
can replace them relatively quickly (not as fast as mdadm, but less
than an order of magnitude slower) in RAID1.  The reporting is full
of out-of-date cached data, but when a disk is really failing,
there is usually little doubt which one needs to be replaced.

> If I have a system with N SATA ports, and I have connected N drives,
> and device M is starting to fail... I need to be able to disconnect
> M and then connect M(new). Possibly with a non-trivial amount of
> time in there. For all RAID levels greater than zero this is a
> natural operation in a degraded mode. And for a nearly full
> filesystem the shrink operation that is btrfs device delete would
> not work. And for any nontrivially occupied fiesystem it would be
> way slow, and need to be reversed for another way-slow interval.
> 
> So I need to be able to "replace" a drive with a "nothing" so that
> the number of active media becomes N-1 but the arity remains N.

btrfs already does that, but it sucks.  In a naive RAID5 implementation,
a write in degraded mode will corrupt your data if it is interrupted.
This is a general property of all RAID5 implementations that don't have
NVRAM journalling or some other way to solve the atomic update problem.

ZFS does this well:  when a device is missing, it leaves old data in
degraded mode, but writes new data striped across the existing disks
in non-degraded mode.  If you have 5 disks, and one dies, your writes
are then spread across 4 disks (3 data + parity) while your reads are
reconstructed from 4 disks (4 data + 1 parity - 1 missing).  This prevents
the degraded mode write data integrity problem.

When the dead disk is replaced you would have the 3 data + parity promoted
to 4 data + parity, or you can elect not to replace the dead disk and
get 3 data + party everywhere (with a loss of capacity).  btrfs could
presumably do that by allocating chunks with different raid56 parameters,
although in this early stage of implementation I'm not sure how much of
any of that has been done yet.

> mdadm has the "missing" keyword. the Device Mapper has the "zero"
> target. 

dm also has the "ioerror" target, which is much better for this ("zero"
would allow reads to succeed, which is incorrect).  lvm2 uses "ioerror"
for missing pieces of broken LVs in partial mode.

> btrfs replace start /dev/sdc /dev/nothing /
> (time pases, physical device is removed and replace)
> btrfs replace start /dev/nothing /dev/sdc /

Why wouldn't you just remove the physical device (say device #2) and
then run:

	btrfs replace start 2 /dev/sdc /

?  The way it works now seems much less complicated than what you propose.

Granted, I have a feature request here:  we know the sizes of all the
missing disks, and we know the size of /dev/sdc, so why can't we just
write "missing" instead of "2" and have btrfs choose a missing device
to replace by itself?

> Now that's good-ish, but really the first replace is pernicious. The
> internal state for the filesystem should just be able to record that
> device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this
> example) is just gone. The replace-with-nothing becomes more-or-less
> instant.

To clarify:  what is required here is the ability to quickly record that
the device's subuuid is no longer welcome in this filesystem, and never
will be.  Should it reappear in the future, it has to be excluded from
the btrfs.

The underlying physical device could return, but it would have to
be treated as a new empty device with a new subuuid, and its data
reconstructed by btrfs balance or btrfs replace.

This is because btrfs does really awful things when a filesystem gets
assembled out of mirrors of different vintages.  Before allowing writes
on a subset of the disks in a multi-disk btrfs, the disks that are written
have to agree that they are now the only disks that are currently members
of the filesystem.

> [The use of "device delete" and "device add" as changes in arity and
> size, and its inaplicability to cases where failure is being dealt
> with abent a change of arity, could be clearer in the
> documentation.]

Yes.  This is _not_ equivalent to a btrfs replace, although it is very
similar:

	btrfs device add /dev/sdc /
	btrfs device delete missing /

It can work--sometimes--but it needs a surprising amount of free space
(or multiple new drives).

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-12-13  4:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-10 22:18 mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Robert White
2014-12-11  7:33 ` Duncan
2014-12-12  3:56 ` Zygo Blaxell
2014-12-12  6:01   ` Robert White
2014-12-12  9:06     ` David Taylor
2014-12-12 11:16       ` Robert White
2014-12-12 13:29         ` Hugo Mills
2014-12-13  3:01         ` Duncan
2014-12-12 16:45     ` Zygo Blaxell
2014-12-12 22:28       ` Robert White
2014-12-13  4:28         ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.