Likelihood of read error, recover device failure raid10

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Likelihood of read error, recover device failure raid10
@ 2016-08-13 15:39 Wolfgang Mader
  2016-08-13 20:15 ` Hugo Mills
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Wolfgang Mader @ 2016-08-13 15:39 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

Hi,

I have two questions

1) Layout of raid10 in btrfs
btrfs pools all devices and than stripes and mirrors across this pool. Is it 
therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is 
_not_

              raid0
       |---------------|
------------      -------------
|a|  |b|      |c|  |d|
   raid1            raid1

Rather, there is no clear distinction of device level between two devices 
which form a raid1 set which are than paired by raid0, but simply, each bit is 
mirrored across two different devices. Is this correct?

2) Recover raid10 from a failed disk
Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from 
n devices, each bit is mirrored across two devices. Therefore, in order to 
restore a raid10 from a single failed device, I need to read the amount of 
data worth this device from the remaining n-1 devices. In case, the amount of 
data on the failed disk is in the order of the number of bits for which I can 
expect an unrecoverable read error from a device, I will most likely not be 
able to recover from the disk failure. Is this conclusion correct, or am I am 
missing something here.

Thanks,
Wolfgang

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader
@ 2016-08-13 20:15 ` Hugo Mills
  2016-08-14  1:07 ` Duncan
  2016-08-14 16:20 ` Chris Murphy
  2 siblings, 0 replies; 8+ messages in thread
From: Hugo Mills @ 2016-08-13 20:15 UTC (permalink / raw)
  To: Wolfgang Mader; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2308 bytes --]

On Sat, Aug 13, 2016 at 05:39:18PM +0200, Wolfgang Mader wrote:
> Hi,
> 
> I have two questions
> 
> 1) Layout of raid10 in btrfs
> btrfs pools all devices and than stripes and mirrors across this pool. Is it 
> therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is 
> _not_
> 
>               raid0
>        |---------------|
> ------------      -------------
> |a|  |b|      |c|  |d|
>    raid1            raid1
> 
> Rather, there is no clear distinction of device level between two devices 
> which form a raid1 set which are than paired by raid0, but simply, each bit is 
> mirrored across two different devices. Is this correct?

   Correct. There's no clear hierarchy of RAID-1-then-RAID-0 vs
RAID-0-then-RAID-1. Instead, if you look at a single device (in a
4-device arraay), that will be one of two copies, and will be either
the "odd" stripes or the "even" stripes. That's all you get, within a
block group.

> 2) Recover raid10 from a failed disk
> Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from 
> n devices, each bit is mirrored across two devices. Therefore, in order to 
> restore a raid10 from a single failed device, I need to read the amount of 
> data worth this device from the remaining n-1 devices. In case, the amount of 
> data on the failed disk is in the order of the number of bits for which I can 
> expect an unrecoverable read error from a device, I will most likely not be 
> able to recover from the disk failure. Is this conclusion correct, or am I am 
> missing something here.

   That's right, but the unrecoverable bit rates quoted by the hard
drive manufacturers aren't necessarily reflected in the real-life
usage of the devices. I think that if you're doing those calculations,
you really need to find out what the values quoted by the manufacturer
actually mean, first. (i.e. if you read all the data once a month with
a scrub, and allow the drive to identify and correct any transient
errors which might indicate incipient failure, does the quoted BER
still apply?)

   Hugo.

-- 
Hugo Mills             | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4          |                                          Ford Prefect

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader
  2016-08-13 20:15 ` Hugo Mills
@ 2016-08-14  1:07 ` Duncan
  2016-08-14 16:20 ` Chris Murphy
  2 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2016-08-14  1:07 UTC (permalink / raw)
  To: linux-btrfs

Wolfgang Mader posted on Sat, 13 Aug 2016 17:39:18 +0200 as excerpted:

> Hi,
> 
> I have two questions
> 
> 1) Layout of raid10 in btrfs btrfs pools all devices and than stripes
> and mirrors across this pool. Is it therefore correct, that a raid10
> layout consisting of 4 devices a,b,c,d is _not_
> 
>               raid0
>        |---------------|
> ------------      ------------
> |a|      |b|      |c|      |d|
>    raid1             raid1
> 
> Rather, there is no clear distinction of device level between two
> devices which form a raid1 set which are than paired by raid0, but
> simply, each bit is mirrored across two different devices. Is this
> correct?

Not correct in detail, but you have the general idea, yes.

The key thing to remember with btrfs in this context is that it's chunk-
based raid, /not/ device-based (or for that matter, bit- or byte-based) 
raid.  If the "each bit" in your last sentence above is substituted with 
"each chunk", where chunks are nominally (that is, they can vary from 
this) 1 GiB for data, 256 MiB for metadata, thus billions of times your 
"each bit" size, /then/ your description gets much more accurate (tho 
technically each strip is 64 KiB, I believe, with each strip mirrored at 
the raid1 level and then combined with other strips at the raid0 level to 
make a stripe, and multiple stripes then composing a chunk, with the 
device assignment variable at the chunk level).

At the chunk level, mirroring and striping is as you indicate.  Chunks 
are allocated on-demand from the available unallocated space such that 
the two mirrors of each strip can vary from one chunk to the next, which 
if I'm not mistaken, was the point you were making.

The effect is that btrfs raid10 doesn't have the ability to tolerate loss 
of two devices as long as the two devices are from separate raid1s 
underneath the raid0, that per-device raid10 has, because once there are 
a decent amount of chunks allocated, there's no distinctive raid1s at the 
btrfs device level, such that loss of any two devices is virtually 
guaranteed to be the loss of both mirrors of some strip of a chunk for 
/some/ number of chunks.

Of course it remains possible, indeed quite viably so, to create a hybrid 
raid, btrfs raid1 on top of md- or dm-raid0, for instance.  Altho that's 
technically raid01 instead of raid10, btrfs raid1 has some distinctive 
advantages that make it the preferred top layer in this sort of hybrid, 
as opposed to btrfs raid0 on top of md/dm-raid1, the conventionally 
preferred raid10 arrangement.

Namely, btrfs raid1 has the file integrity feature in the form of 
checksumming and checksum validation failure detection, and for raid1, 
checksum validation failure repair from the mirror copy, assuming of 
course that it passes checksum validation.  Few raid schemes have that, 
and it's enough of a feature leap that it justifies making the top layer 
btrfs raid1, as opposed to btrfs raid0, which would lack that automatic 
error repair feature, tho it could still detect the error based on the 
checksums, but even manual repair would be difficult as you'd have to 
somehow figure out which was the bad copy that it read from and then 
check the other copy and see if it was good before overwriting the bad 
copy.

> 2) Recover raid10 from a failed disk Raid10 inherits its redundancy from
> the raid1 scheme. If I build a raid10 from n devices, each bit is
> mirrored across two devices. Therefore, in order to restore a raid10
> from a single failed device, I need to read the amount of data worth
> this device from the remaining n-1 devices. In case, the amount of data
> on the failed disk is in the order of the number of bits for which I can
> expect an unrecoverable read error from a device, I will most likely not
> be able to recover from the disk failure. Is this conclusion correct, or
> am I am missing something here.

Again, not each bit, but (each strip of) each chunk (with the strips 
being 64 KiB IIRC).

But your conclusion is generally correct, since the problem would be 
quite likely to be detected by checksum verification failure, but if it 
were to occur in the raid1 pair that was degraded, there would be no 
second copy to fall back on to repair with.

Of course that's a 50% chance, with the other possibility being that the 
IO read error occurs in the undegraded raid1, and thus can be corrected 
normally.

Which means given a random read error, if you try the recovery enough 
times you should eventually succeed, because eventually any occurring 
read error will happen on the still undegraded raid1 area.

Tho of course if the read error isn't random and it happens repeatedly in 
the degraded area, you're screwed, for whatever file or metadata covering 
multiple files it was in, at least.  You should still be able to recover 
the rest of the filesystem, however.

Which all goes to demonstrate once again that raid != backup, and there's 
no substitute for the latter, to whatever level the value of the data in 
question justifies.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader
  2016-08-13 20:15 ` Hugo Mills
  2016-08-14  1:07 ` Duncan
@ 2016-08-14 16:20 ` Chris Murphy
  2016-08-14 18:04   ` Wolfgang Mader
                     ` (2 more replies)
  2 siblings, 3 replies; 8+ messages in thread
From: Chris Murphy @ 2016-08-14 16:20 UTC (permalink / raw)
  To: Wolfgang Mader; +Cc: Btrfs BTRFS

On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader
<Wolfgang_Mader@brain-frog.de> wrote:
> Hi,
>
> I have two questions
>
> 1) Layout of raid10 in btrfs
> btrfs pools all devices and than stripes and mirrors across this pool. Is it
> therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is
> _not_
>
>               raid0
>        |---------------|
> ------------      -------------
> |a|  |b|      |c|  |d|
>    raid1            raid1
>
> Rather, there is no clear distinction of device level between two devices
> which form a raid1 set which are than paired by raid0, but simply, each bit is
> mirrored across two different devices. Is this correct?

All of the profiles apply to block groups (chunks), and that includes
raid10. They only incidentally apply to devices since of course block
groups end up on those devices, but which stripe ends up on which
device is not consistent, and that ends up making Btrfs raid10 pretty
much only able to survive a single device loss.

I don't know if this is really thoroughly understood. I just did a
test and I kinda wonder if the reason for this inconsistent assignment
is a difference between the initial stripe>devid pairing at mkfs time,
compared to subsequent pairings done by kernel code. For example, I
get this from mkfs:

    item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 itemsize 176
        chunk length 16777216 owner 2 stripe_len 65536
        type SYSTEM|RAID10 num_stripes 4
            stripe 0 devid 4 offset 1048576
            dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
            stripe 1 devid 3 offset 1048576
            dev uuid: af95126a-e674-425c-af01-2599d66d9d06
            stripe 2 devid 2 offset 1048576
            dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
            stripe 3 devid 1 offset 20971520
            dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
    item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 itemsize 176
        chunk length 2147483648 owner 2 stripe_len 65536
        type METADATA|RAID10 num_stripes 4
            stripe 0 devid 4 offset 9437184
            dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
            stripe 1 devid 3 offset 9437184
            dev uuid: af95126a-e674-425c-af01-2599d66d9d06
            stripe 2 devid 2 offset 9437184
            dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
            stripe 3 devid 1 offset 29360128
            dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
    item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363
itemsize 176
        chunk length 2147483648 owner 2 stripe_len 65536
        type DATA|RAID10 num_stripes 4
            stripe 0 devid 4 offset 1083179008
            dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
            stripe 1 devid 3 offset 1083179008
            dev uuid: af95126a-e674-425c-af01-2599d66d9d06
            stripe 2 devid 2 offset 1083179008
            dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
            stripe 3 devid 1 offset 1103101952
            dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74

Here you can see every chunk type has the same stripe to devid
pairing. But once the kernel starts to allocate more data chunks, the
pairing is different from mkfs, yet always (so far) consistent for
each additional kernel allocated chunk.

    item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187
itemsize 176
        chunk length 2147483648 owner 2 stripe_len 65536
        type DATA|RAID10 num_stripes 4
            stripe 0 devid 2 offset 2156920832
            dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
            stripe 1 devid 3 offset 2156920832
            dev uuid: af95126a-e674-425c-af01-2599d66d9d06
            stripe 2 devid 4 offset 2156920832
            dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
            stripe 3 devid 1 offset 2176843776
            dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74

This volume now has about a dozen chunks created by kernel code, and
the stripe X to devid Y mapping is identical. Using dd and hexdump,
I'm finding that stripe 0 and 1 are mirrored pairs, they contain
identical information. And stripe 2 and 3 are mirrored pairs. And the
raid0 striping happens across 01 and 23 such that odd-numbered 64KiB
(default) stripe elements go on 01, and even-numbered stripe elements
go on 23. If the stripe to devid pairing were always consistent, I
could lose more than one device and still have a viable volume, just
like a conventional raid10. Of course you can't lose both of any
mirrored pair, but you could lose one of every mirrored pair. That's
why raid10 is considered scalable.

But apparently the pairing is different between mkfs and kernel code.
And due to that I can't reliably lose more than one device. There is
an edge case where I could lose two:

stripe 0 devid 4
stripe 1 devid 3
stripe 2 devid 2
stripe 3 devid 1

stripe 0 devid 2
stripe 1 devid 3
stripe 2 devid 4
stripe 3 devid 1

I could, in theory, lose devid 3 and devid 1 and still have one of
each stripe copies for all block groups, but kernel code doesn't
permit this:

[352467.557960] BTRFS warning (device dm-9): missing devices (2)
exceeds the limit (1), writeable mount is not allowed

> 2) Recover raid10 from a failed disk
> Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from
> n devices, each bit is mirrored across two devices. Therefore, in order to
> restore a raid10 from a single failed device, I need to read the amount of
> data worth this device from the remaining n-1 devices.

Maybe? In a traditional raid10, rebuild of a faulty device means
reading 100% of its mirror device and that's it. For Btrfs the same
could be true, it just depends on where the block group copies are
located, they could all be on just one other device, or they could be
spread across more than one device. Also for Btrfs it's only copying
extents, it's not doing sector level rebuild, it'll skip the empty
space.

>In case, the amount of
> data on the failed disk is in the order of the number of bits for which I can
> expect an unrecoverable read error from a device, I will most likely not be
> able to recover from the disk failure. Is this conclusion correct, or am I am
> missing something here.

I think you're over estimating the probability of URE. They're pretty
rare, and it's far less likely if you're doing regular scrubs.

I haven't actually tested this but if a URE or even a checksum
mismatch were to happen on a data block group during rebuild following
replacing a failed device, I'd like to think Btrfs just complains, it
doesn't stop the remainder of the rebuild. If it happens on metadata
or system chunk, well that's bad and could be fatal.

As an aside, I'm finding the size information for the data chunk in
'fi us' confusing...

The sample file system contains one file:
[root@f24s ~]# ls -lh /mnt/0
total 1.4G
-rw-r--r--. 1 root root 1.4G Aug 13 19:24
Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso

[root@f24s ~]# btrfs fi us /mnt/0
Overall:
    Device size:         400.00GiB
    Device allocated:           8.03GiB
    Device unallocated:         391.97GiB
    Device missing:             0.00B
    Used:               2.66GiB
    Free (estimated):         196.66GiB    (min: 196.66GiB)
    Data ratio:                  2.00
    Metadata ratio:              2.00
    Global reserve:          16.00MiB    (used: 0.00B)

## "Device size" is total volume or pool size, "Used" shows actual
usage accounting for the replication of raid1, and yet "Free" shows
1/2. This can't work long term as by the time I have 100GiB in the
volume, Used will report 200Gib while Free will report 100GiB for a
total of 300GiB which does not match the device size. So that's a bug
in my opinion.

Data,RAID10: Size:2.00GiB, Used:1.33GiB
   /dev/mapper/VG-1     512.00MiB
   /dev/mapper/VG-2     512.00MiB
   /dev/mapper/VG-3     512.00MiB
   /dev/mapper/VG-4     512.00MiB

## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird.
And now in this area the user is somehow expected to know that all of
these values are 1/2 their actual value due to the RAID10. I don't
like this inconsistency for one. But it's made worse by using the
secret decoder ring method of usage when it comes to individual device
allocations. Very clearly Size if really 4, and each device has a 1GiB
chunk. So why not say that? This is consistent with the earlier
"Device allocated" value of 8GiB.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-14 16:20 ` Chris Murphy
@ 2016-08-14 18:04   ` Wolfgang Mader
  2016-08-15  4:21     ` Wolfgang Mader
  2016-08-15  3:46   ` Andrei Borzenkov
  2016-08-15  5:51   ` Andrei Borzenkov
  2 siblings, 1 reply; 8+ messages in thread
From: Wolfgang Mader @ 2016-08-14 18:04 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10436 bytes --]

On Sunday, August 14, 2016 10:20:39 AM CEST you wrote:
> On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader
> 
> <Wolfgang_Mader@brain-frog.de> wrote:
> > Hi,
> > 
> > I have two questions
> > 
> > 1) Layout of raid10 in btrfs
> > btrfs pools all devices and than stripes and mirrors across this pool. Is
> > it therefore correct, that a raid10 layout consisting of 4 devices
> > a,b,c,d is _not_
> > 
> >               raid0
> >        |
> >        |---------------|
> > 
> > ------------      -------------
> > 
> > |a|  |b|      |c|  |d|
> > |
> >    raid1            raid1
> > 
> > Rather, there is no clear distinction of device level between two devices
> > which form a raid1 set which are than paired by raid0, but simply, each
> > bit is mirrored across two different devices. Is this correct?
> 
> All of the profiles apply to block groups (chunks), and that includes
> raid10. They only incidentally apply to devices since of course block
> groups end up on those devices, but which stripe ends up on which
> device is not consistent, and that ends up making Btrfs raid10 pretty
> much only able to survive a single device loss.
> 
> I don't know if this is really thoroughly understood. I just did a
> test and I kinda wonder if the reason for this inconsistent assignment
> is a difference between the initial stripe>devid pairing at mkfs time,
> compared to subsequent pairings done by kernel code. For example, I
> get this from mkfs:
> 
>     item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 itemsize
> 176 chunk length 16777216 owner 2 stripe_len 65536
>         type SYSTEM|RAID10 num_stripes 4
>             stripe 0 devid 4 offset 1048576
>             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
>             stripe 1 devid 3 offset 1048576
>             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
>             stripe 2 devid 2 offset 1048576
>             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
>             stripe 3 devid 1 offset 20971520
>             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
>     item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 itemsize
> 176 chunk length 2147483648 owner 2 stripe_len 65536
>         type METADATA|RAID10 num_stripes 4
>             stripe 0 devid 4 offset 9437184
>             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
>             stripe 1 devid 3 offset 9437184
>             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
>             stripe 2 devid 2 offset 9437184
>             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
>             stripe 3 devid 1 offset 29360128
>             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
>     item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363
> itemsize 176
>         chunk length 2147483648 owner 2 stripe_len 65536
>         type DATA|RAID10 num_stripes 4
>             stripe 0 devid 4 offset 1083179008
>             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
>             stripe 1 devid 3 offset 1083179008
>             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
>             stripe 2 devid 2 offset 1083179008
>             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
>             stripe 3 devid 1 offset 1103101952
>             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> 
> Here you can see every chunk type has the same stripe to devid
> pairing. But once the kernel starts to allocate more data chunks, the
> pairing is different from mkfs, yet always (so far) consistent for
> each additional kernel allocated chunk.
> 
> 
>     item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187
> itemsize 176
>         chunk length 2147483648 owner 2 stripe_len 65536
>         type DATA|RAID10 num_stripes 4
>             stripe 0 devid 2 offset 2156920832
>             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
>             stripe 1 devid 3 offset 2156920832
>             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
>             stripe 2 devid 4 offset 2156920832
>             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
>             stripe 3 devid 1 offset 2176843776
>             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> 
> This volume now has about a dozen chunks created by kernel code, and
> the stripe X to devid Y mapping is identical. Using dd and hexdump,
> I'm finding that stripe 0 and 1 are mirrored pairs, they contain
> identical information. And stripe 2 and 3 are mirrored pairs. And the
> raid0 striping happens across 01 and 23 such that odd-numbered 64KiB
> (default) stripe elements go on 01, and even-numbered stripe elements
> go on 23. If the stripe to devid pairing were always consistent, I
> could lose more than one device and still have a viable volume, just
> like a conventional raid10. Of course you can't lose both of any
> mirrored pair, but you could lose one of every mirrored pair. That's
> why raid10 is considered scalable.

Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5 across 
n disks. Than, for each chunk (don't know the unit of such a chunk) of n-1 
disks, a parity chunk is written to the remaining disk using xor. Parity 
chunks are distributed across all disks. In case the data of a failed disk has 
to be restored from the degraded array, the entirety of n-1 disks have to be 
read, in order to use xor to reconstruct the data. Is this correct? Again, in 
order to restore a failed disk in raid5, all data on all remaining disks is 
needed, otherwise the array can not be restored. Correct?

For btrfs raid10, I only can loose a single device, but in order to rebuild 
it, I only need to read the amount of data which was stored on the failed 
device, as no parity is used, but mirroring. Correct? Therefore, the amount of 
bits I need to read successfully for a rebuild is independent of the number of 
devices included in the raid10, while the amount of read data scales with the 
number of devices in a raid5.

Still, I think it is unfortunate, that btrfs raid10 does not stick to a fixed 
layout, as than the entire array must be available. If you have your devices 
attached by more than one controller, in more than one case powered by 
different power supplies etc., the probability for their failure has to be 
summed up, as no component is allowed to fail. Is work under way to change 
this, or is this s.th. out of reach for btrfs as it is an implementation 
detail of the kernel.

> 
> But apparently the pairing is different between mkfs and kernel code.
> And due to that I can't reliably lose more than one device. There is
> an edge case where I could lose two:
> 
> 
> 
> stripe 0 devid 4
> stripe 1 devid 3
> stripe 2 devid 2
> stripe 3 devid 1
> 
> stripe 0 devid 2
> stripe 1 devid 3
> stripe 2 devid 4
> stripe 3 devid 1
> 
> 
> I could, in theory, lose devid 3 and devid 1 and still have one of
> each stripe copies for all block groups, but kernel code doesn't
> permit this:
> 
> [352467.557960] BTRFS warning (device dm-9): missing devices (2)
> exceeds the limit (1), writeable mount is not allowed
> 
> > 2) Recover raid10 from a failed disk
> > Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10
> > from n devices, each bit is mirrored across two devices. Therefore, in
> > order to restore a raid10 from a single failed device, I need to read the
> > amount of data worth this device from the remaining n-1 devices.
> 
> Maybe? In a traditional raid10, rebuild of a faulty device means
> reading 100% of its mirror device and that's it. For Btrfs the same
> could be true, it just depends on where the block group copies are
> located, they could all be on just one other device, or they could be
> spread across more than one device. Also for Btrfs it's only copying
> extents, it's not doing sector level rebuild, it'll skip the empty
> space.
> 
> >In case, the amount of
> >
> > data on the failed disk is in the order of the number of bits for which I
> > can expect an unrecoverable read error from a device, I will most likely
> > not be able to recover from the disk failure. Is this conclusion correct,
> > or am I am missing something here.
> 
> I think you're over estimating the probability of URE. They're pretty
> rare, and it's far less likely if you're doing regular scrubs.
> 
> I haven't actually tested this but if a URE or even a checksum
> mismatch were to happen on a data block group during rebuild following
> replacing a failed device, I'd like to think Btrfs just complains, it
> doesn't stop the remainder of the rebuild. If it happens on metadata
> or system chunk, well that's bad and could be fatal.
> 
> 
> As an aside, I'm finding the size information for the data chunk in
> 'fi us' confusing...
> 
> The sample file system contains one file:
> [root@f24s ~]# ls -lh /mnt/0
> total 1.4G
> -rw-r--r--. 1 root root 1.4G Aug 13 19:24
> Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso
> 
> 
> [root@f24s ~]# btrfs fi us /mnt/0
> Overall:
>     Device size:         400.00GiB
>     Device allocated:           8.03GiB
>     Device unallocated:         391.97GiB
>     Device missing:             0.00B
>     Used:               2.66GiB
>     Free (estimated):         196.66GiB    (min: 196.66GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              2.00
>     Global reserve:          16.00MiB    (used: 0.00B)
> 
> ## "Device size" is total volume or pool size, "Used" shows actual
> usage accounting for the replication of raid1, and yet "Free" shows
> 1/2. This can't work long term as by the time I have 100GiB in the
> volume, Used will report 200Gib while Free will report 100GiB for a
> total of 300GiB which does not match the device size. So that's a bug
> in my opinion.
> 
> Data,RAID10: Size:2.00GiB, Used:1.33GiB
>    /dev/mapper/VG-1     512.00MiB
>    /dev/mapper/VG-2     512.00MiB
>    /dev/mapper/VG-3     512.00MiB
>    /dev/mapper/VG-4     512.00MiB
> 
> ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird.
> And now in this area the user is somehow expected to know that all of
> these values are 1/2 their actual value due to the RAID10. I don't
> like this inconsistency for one. But it's made worse by using the
> secret decoder ring method of usage when it comes to individual device
> allocations. Very clearly Size if really 4, and each device has a 1GiB
> chunk. So why not say that? This is consistent with the earlier
> "Device allocated" value of 8GiB.


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-14 16:20 ` Chris Murphy
  2016-08-14 18:04   ` Wolfgang Mader
@ 2016-08-15  3:46   ` Andrei Borzenkov
  2016-08-15  5:51   ` Andrei Borzenkov
  2 siblings, 0 replies; 8+ messages in thread
From: Andrei Borzenkov @ 2016-08-15  3:46 UTC (permalink / raw)
  To: Chris Murphy, Wolfgang Mader; +Cc: Btrfs BTRFS

14.08.2016 19:20, Chris Murphy пишет:
...
> 
> This volume now has about a dozen chunks created by kernel code, and
> the stripe X to devid Y mapping is identical. Using dd and hexdump,
> I'm finding that stripe 0 and 1 are mirrored pairs, they contain
> identical information. And stripe 2 and 3 are mirrored pairs. And the
> raid0 striping happens across 01 and 23 such that odd-numbered 64KiB
> (default) stripe elements go on 01, and even-numbered stripe elements
> go on 23. If the stripe to devid pairing were always consistent, I
> could lose more than one device and still have a viable volume, just
> like a conventional raid10. Of course you can't lose both of any
> mirrored pair, but you could lose one of every mirrored pair. That's
> why raid10 is considered scalable.
> 
> But apparently the pairing is different between mkfs and kernel code.

My understanding is, that chunk allocation code is using devices in
order they have been discovered. Which implies that that can change
between reboots or even during system run if devices are added/removed.

Also I think code may skip devices under some condition (most obvious
being not enough space, may happen if you mix allocation profiles).

So the only thing that is guaranteed right now is that every stripe
element will be on different device.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-14 18:04   ` Wolfgang Mader
@ 2016-08-15  4:21     ` Wolfgang Mader
  0 siblings, 0 replies; 8+ messages in thread
From: Wolfgang Mader @ 2016-08-15  4:21 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11350 bytes --]

On Sunday, August 14, 2016 8:04:14 PM CEST you wrote:
> On Sunday, August 14, 2016 10:20:39 AM CEST you wrote:
> > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader
> > 
> > <Wolfgang_Mader@brain-frog.de> wrote:
> > > Hi,
> > > 
> > > I have two questions
> > > 
> > > 1) Layout of raid10 in btrfs
> > > btrfs pools all devices and than stripes and mirrors across this pool.
> > > Is
> > > it therefore correct, that a raid10 layout consisting of 4 devices
> > > a,b,c,d is _not_
> > > 
> > >               raid0
> > >        |
> > >        |---------------|
> > > 
> > > ------------      -------------
> > > 
> > > |a|  |b|      |c|  |d|
> > > |
> > >    raid1            raid1
> > > 
> > > Rather, there is no clear distinction of device level between two
> > > devices
> > > which form a raid1 set which are than paired by raid0, but simply, each
> > > bit is mirrored across two different devices. Is this correct?
> > 
> > All of the profiles apply to block groups (chunks), and that includes
> > raid10. They only incidentally apply to devices since of course block
> > groups end up on those devices, but which stripe ends up on which
> > device is not consistent, and that ends up making Btrfs raid10 pretty
> > much only able to survive a single device loss.
> > 
> > I don't know if this is really thoroughly understood. I just did a
> > test and I kinda wonder if the reason for this inconsistent assignment
> > is a difference between the initial stripe>devid pairing at mkfs time,
> > compared to subsequent pairings done by kernel code. For example, I
> > 
> > get this from mkfs:
> >     item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715
> >     itemsize
> > 
> > 176 chunk length 16777216 owner 2 stripe_len 65536
> > 
> >         type SYSTEM|RAID10 num_stripes 4
> >         
> >             stripe 0 devid 4 offset 1048576
> >             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
> >             stripe 1 devid 3 offset 1048576
> >             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
> >             stripe 2 devid 2 offset 1048576
> >             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
> >             stripe 3 devid 1 offset 20971520
> >             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> >     
> >     item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539
> >     itemsize
> > 
> > 176 chunk length 2147483648 owner 2 stripe_len 65536
> > 
> >         type METADATA|RAID10 num_stripes 4
> >         
> >             stripe 0 devid 4 offset 9437184
> >             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
> >             stripe 1 devid 3 offset 9437184
> >             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
> >             stripe 2 devid 2 offset 9437184
> >             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
> >             stripe 3 devid 1 offset 29360128
> >             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> >     
> >     item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363
> > 
> > itemsize 176
> > 
> >         chunk length 2147483648 owner 2 stripe_len 65536
> >         type DATA|RAID10 num_stripes 4
> >         
> >             stripe 0 devid 4 offset 1083179008
> >             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
> >             stripe 1 devid 3 offset 1083179008
> >             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
> >             stripe 2 devid 2 offset 1083179008
> >             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
> >             stripe 3 devid 1 offset 1103101952
> >             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> > 
> > Here you can see every chunk type has the same stripe to devid
> > pairing. But once the kernel starts to allocate more data chunks, the
> > pairing is different from mkfs, yet always (so far) consistent for
> > each additional kernel allocated chunk.
> > 
> >     item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187
> > 
> > itemsize 176
> > 
> >         chunk length 2147483648 owner 2 stripe_len 65536
> >         type DATA|RAID10 num_stripes 4
> >         
> >             stripe 0 devid 2 offset 2156920832
> >             dev uuid: 1c3038ca-2615-414e-9383-d326b942f647
> >             stripe 1 devid 3 offset 2156920832
> >             dev uuid: af95126a-e674-425c-af01-2599d66d9d06
> >             stripe 2 devid 4 offset 2156920832
> >             dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82
> >             stripe 3 devid 1 offset 2176843776
> >             dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74
> > 
> > This volume now has about a dozen chunks created by kernel code, and
> > the stripe X to devid Y mapping is identical. Using dd and hexdump,
> > I'm finding that stripe 0 and 1 are mirrored pairs, they contain
> > identical information. And stripe 2 and 3 are mirrored pairs. And the
> > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB
> > (default) stripe elements go on 01, and even-numbered stripe elements
> > go on 23. If the stripe to devid pairing were always consistent, I
> > could lose more than one device and still have a viable volume, just
> > like a conventional raid10. Of course you can't lose both of any
> > mirrored pair, but you could lose one of every mirrored pair. That's
> > why raid10 is considered scalable.
> 
> Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5
> across n disks. Than, for each chunk (don't know the unit of such a chunk)
> of n-1 disks, a parity chunk is written to the remaining disk using xor.
> Parity chunks are distributed across all disks. In case the data of a
> failed disk has to be restored from the degraded array, the entirety of n-1
> disks have to be read, in order to use xor to reconstruct the data. Is this
> correct? Again, in order to restore a failed disk in raid5, all data on all
> remaining disks is needed, otherwise the array can not be restored.
> Correct?
> 
> For btrfs raid10, I only can loose a single device, but in order to rebuild
> it, I only need to read the amount of data which was stored on the failed
> device, as no parity is used, but mirroring. Correct? Therefore, the amount
> of bits I need to read successfully for a rebuild is independent of the
> number of devices included in the raid10, while the amount of read data
> scales with the number of devices in a raid5.
> 
> Still, I think it is unfortunate, that btrfs raid10 does not stick to a
> fixed layout, as than the entire array must be available. If you have your
> devices attached by more than one controller, in more than one case powered
> by different power supplies etc., the probability for their failure has to
> be summed up,

This formulation might be a bit vague. For m devices of which non is allowed 
to fail, the total failure probability should be
  p_tot =  (1-p_f)^m
where p_f is the probability of failure for a single device, assuming p_f is 
the same for all m devices.

> as no component is allowed to fail. Is work under way to
> change this, or is this s.th. out of reach for btrfs as it is an
> implementation detail of the kernel.
> 
> > But apparently the pairing is different between mkfs and kernel code.
> > And due to that I can't reliably lose more than one device. There is
> > an edge case where I could lose two:
> > 
> > 
> > 
> > stripe 0 devid 4
> > stripe 1 devid 3
> > stripe 2 devid 2
> > stripe 3 devid 1
> > 
> > stripe 0 devid 2
> > stripe 1 devid 3
> > stripe 2 devid 4
> > stripe 3 devid 1
> > 
> > 
> > I could, in theory, lose devid 3 and devid 1 and still have one of
> > each stripe copies for all block groups, but kernel code doesn't
> > permit this:
> > 
> > [352467.557960] BTRFS warning (device dm-9): missing devices (2)
> > exceeds the limit (1), writeable mount is not allowed
> > 
> > > 2) Recover raid10 from a failed disk
> > > Raid10 inherits its redundancy from the raid1 scheme. If I build a
> > > raid10
> > > from n devices, each bit is mirrored across two devices. Therefore, in
> > > order to restore a raid10 from a single failed device, I need to read
> > > the
> > > amount of data worth this device from the remaining n-1 devices.
> > 
> > Maybe? In a traditional raid10, rebuild of a faulty device means
> > reading 100% of its mirror device and that's it. For Btrfs the same
> > could be true, it just depends on where the block group copies are
> > located, they could all be on just one other device, or they could be
> > spread across more than one device. Also for Btrfs it's only copying
> > extents, it's not doing sector level rebuild, it'll skip the empty
> > space.
> > 
> > >In case, the amount of
> > >
> > > data on the failed disk is in the order of the number of bits for which
> > > I
> > > can expect an unrecoverable read error from a device, I will most likely
> > > not be able to recover from the disk failure. Is this conclusion
> > > correct,
> > > or am I am missing something here.
> > 
> > I think you're over estimating the probability of URE. They're pretty
> > rare, and it's far less likely if you're doing regular scrubs.
> > 
> > I haven't actually tested this but if a URE or even a checksum
> > mismatch were to happen on a data block group during rebuild following
> > replacing a failed device, I'd like to think Btrfs just complains, it
> > doesn't stop the remainder of the rebuild. If it happens on metadata
> > or system chunk, well that's bad and could be fatal.
> > 
> > 
> > As an aside, I'm finding the size information for the data chunk in
> > 'fi us' confusing...
> > 
> > The sample file system contains one file:
> > [root@f24s ~]# ls -lh /mnt/0
> > total 1.4G
> > -rw-r--r--. 1 root root 1.4G Aug 13 19:24
> > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso
> > 
> > 
> > [root@f24s ~]# btrfs fi us /mnt/0
> > 
> > Overall:
> >     Device size:         400.00GiB
> >     Device allocated:           8.03GiB
> >     Device unallocated:         391.97GiB
> >     Device missing:             0.00B
> >     Used:               2.66GiB
> >     Free (estimated):         196.66GiB    (min: 196.66GiB)
> >     Data ratio:                  2.00
> >     Metadata ratio:              2.00
> >     Global reserve:          16.00MiB    (used: 0.00B)
> > 
> > ## "Device size" is total volume or pool size, "Used" shows actual
> > usage accounting for the replication of raid1, and yet "Free" shows
> > 1/2. This can't work long term as by the time I have 100GiB in the
> > volume, Used will report 200Gib while Free will report 100GiB for a
> > total of 300GiB which does not match the device size. So that's a bug
> > in my opinion.
> > 
> > Data,RAID10: Size:2.00GiB, Used:1.33GiB
> > 
> >    /dev/mapper/VG-1     512.00MiB
> >    /dev/mapper/VG-2     512.00MiB
> >    /dev/mapper/VG-3     512.00MiB
> >    /dev/mapper/VG-4     512.00MiB
> > 
> > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird.
> > And now in this area the user is somehow expected to know that all of
> > these values are 1/2 their actual value due to the RAID10. I don't
> > like this inconsistency for one. But it's made worse by using the
> > secret decoder ring method of usage when it comes to individual device
> > allocations. Very clearly Size if really 4, and each device has a 1GiB
> > chunk. So why not say that? This is consistent with the earlier
> > "Device allocated" value of 8GiB.


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likelihood of read error, recover device failure raid10
  2016-08-14 16:20 ` Chris Murphy
  2016-08-14 18:04   ` Wolfgang Mader
  2016-08-15  3:46   ` Andrei Borzenkov
@ 2016-08-15  5:51   ` Andrei Borzenkov
  2 siblings, 0 replies; 8+ messages in thread
From: Andrei Borzenkov @ 2016-08-15  5:51 UTC (permalink / raw)
  To: Chris Murphy, Wolfgang Mader; +Cc: Btrfs BTRFS

14.08.2016 19:20, Chris Murphy пишет:
> 
> As an aside, I'm finding the size information for the data chunk in
> 'fi us' confusing...
> 
> The sample file system contains one file:
> [root@f24s ~]# ls -lh /mnt/0
> total 1.4G
> -rw-r--r--. 1 root root 1.4G Aug 13 19:24
> Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso
> 
> 
> [root@f24s ~]# btrfs fi us /mnt/0
> Overall:
>     Device size:         400.00GiB
>     Device allocated:           8.03GiB
>     Device unallocated:         391.97GiB
>     Device missing:             0.00B
>     Used:               2.66GiB
>     Free (estimated):         196.66GiB    (min: 196.66GiB)
>     Data ratio:                  2.00
>     Metadata ratio:              2.00
>     Global reserve:          16.00MiB    (used: 0.00B)
> 
> ## "Device size" is total volume or pool size, "Used" shows actual
> usage accounting for the replication of raid1, and yet "Free" shows
> 1/2. This can't work long term as by the time I have 100GiB in the
> volume, Used will report 200Gib while Free will report 100GiB for a
> total of 300GiB which does not match the device size. So that's a bug
> in my opinion.
> 

Well, it says "estimated". It shows how much you could possibly write
using current allocation profile(s). There is no way to predict actual
space usage if you mix allocation profiles.

I agree that having single field that is referring to virtual capacity
among fields showing physical consumption is confusing.

> Data,RAID10: Size:2.00GiB, Used:1.33GiB
>    /dev/mapper/VG-1     512.00MiB
>    /dev/mapper/VG-2     512.00MiB
>    /dev/mapper/VG-3     512.00MiB
>    /dev/mapper/VG-4     512.00MiB
> 
> ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird.

I think this is difference between rounding done by ls and internal
btrfs counting. I bet if you show size in KiB (or even 512B) you will
get better match.

> And now in this area the user is somehow expected to know that all of
> these values are 1/2 their actual value due to the RAID10. I don't
> like this inconsistency for one. But it's made worse by using the
> secret decoder ring method of usage when it comes to individual device
> allocations. Very clearly Size if really 4, and each device has a 1GiB
> chunk. So why not say that? This is consistent with the earlier
> "Device allocated" value of 8GiB.
> 
> 

This looks like a bug in RAID10. In RAID1 output is consistent with Size
showing virtual size and each disk allocated size matching it. This is
openSUSE Tumbleweed with brfsprogs 4.7 and kernel 4.7.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-08-15  5:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader
2016-08-13 20:15 ` Hugo Mills
2016-08-14  1:07 ` Duncan
2016-08-14 16:20 ` Chris Murphy
2016-08-14 18:04   ` Wolfgang Mader
2016-08-15  4:21     ` Wolfgang Mader
2016-08-15  3:46   ` Andrei Borzenkov
2016-08-15  5:51   ` Andrei Borzenkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).