* Likelihood of read error, recover device failure raid10
@ 2016-08-13 15:39 Wolfgang Mader
2016-08-13 20:15 ` Hugo Mills
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Wolfgang Mader @ 2016-08-13 15:39 UTC (permalink / raw)
To: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]
Hi,
I have two questions
1) Layout of raid10 in btrfs
btrfs pools all devices and than stripes and mirrors across this pool. Is it
therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is
_not_
raid0
|---------------|
------------ -------------
|a| |b| |c| |d|
raid1 raid1
Rather, there is no clear distinction of device level between two devices
which form a raid1 set which are than paired by raid0, but simply, each bit is
mirrored across two different devices. Is this correct?
2) Recover raid10 from a failed disk
Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from
n devices, each bit is mirrored across two devices. Therefore, in order to
restore a raid10 from a single failed device, I need to read the amount of
data worth this device from the remaining n-1 devices. In case, the amount of
data on the failed disk is in the order of the number of bits for which I can
expect an unrecoverable read error from a device, I will most likely not be
able to recover from the disk failure. Is this conclusion correct, or am I am
missing something here.
Thanks,
Wolfgang
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: Likelihood of read error, recover device failure raid10 2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader @ 2016-08-13 20:15 ` Hugo Mills 2016-08-14 1:07 ` Duncan 2016-08-14 16:20 ` Chris Murphy 2 siblings, 0 replies; 8+ messages in thread From: Hugo Mills @ 2016-08-13 20:15 UTC (permalink / raw) To: Wolfgang Mader; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 2308 bytes --] On Sat, Aug 13, 2016 at 05:39:18PM +0200, Wolfgang Mader wrote: > Hi, > > I have two questions > > 1) Layout of raid10 in btrfs > btrfs pools all devices and than stripes and mirrors across this pool. Is it > therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is > _not_ > > raid0 > |---------------| > ------------ ------------- > |a| |b| |c| |d| > raid1 raid1 > > Rather, there is no clear distinction of device level between two devices > which form a raid1 set which are than paired by raid0, but simply, each bit is > mirrored across two different devices. Is this correct? Correct. There's no clear hierarchy of RAID-1-then-RAID-0 vs RAID-0-then-RAID-1. Instead, if you look at a single device (in a 4-device arraay), that will be one of two copies, and will be either the "odd" stripes or the "even" stripes. That's all you get, within a block group. > 2) Recover raid10 from a failed disk > Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from > n devices, each bit is mirrored across two devices. Therefore, in order to > restore a raid10 from a single failed device, I need to read the amount of > data worth this device from the remaining n-1 devices. In case, the amount of > data on the failed disk is in the order of the number of bits for which I can > expect an unrecoverable read error from a device, I will most likely not be > able to recover from the disk failure. Is this conclusion correct, or am I am > missing something here. That's right, but the unrecoverable bit rates quoted by the hard drive manufacturers aren't necessarily reflected in the real-life usage of the devices. I think that if you're doing those calculations, you really need to find out what the values quoted by the manufacturer actually mean, first. (i.e. if you read all the data once a month with a scrub, and allow the drive to identify and correct any transient errors which might indicate incipient failure, does the quoted BER still apply?) Hugo. -- Hugo Mills | Let me past! There's been a major scientific hugo@... carfax.org.uk | break-in! http://carfax.org.uk/ | Through! Break-through! PGP: E2AB1DE4 | Ford Prefect [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader 2016-08-13 20:15 ` Hugo Mills @ 2016-08-14 1:07 ` Duncan 2016-08-14 16:20 ` Chris Murphy 2 siblings, 0 replies; 8+ messages in thread From: Duncan @ 2016-08-14 1:07 UTC (permalink / raw) To: linux-btrfs Wolfgang Mader posted on Sat, 13 Aug 2016 17:39:18 +0200 as excerpted: > Hi, > > I have two questions > > 1) Layout of raid10 in btrfs btrfs pools all devices and than stripes > and mirrors across this pool. Is it therefore correct, that a raid10 > layout consisting of 4 devices a,b,c,d is _not_ > > raid0 > |---------------| > ------------ ------------ > |a| |b| |c| |d| > raid1 raid1 > > Rather, there is no clear distinction of device level between two > devices which form a raid1 set which are than paired by raid0, but > simply, each bit is mirrored across two different devices. Is this > correct? Not correct in detail, but you have the general idea, yes. The key thing to remember with btrfs in this context is that it's chunk- based raid, /not/ device-based (or for that matter, bit- or byte-based) raid. If the "each bit" in your last sentence above is substituted with "each chunk", where chunks are nominally (that is, they can vary from this) 1 GiB for data, 256 MiB for metadata, thus billions of times your "each bit" size, /then/ your description gets much more accurate (tho technically each strip is 64 KiB, I believe, with each strip mirrored at the raid1 level and then combined with other strips at the raid0 level to make a stripe, and multiple stripes then composing a chunk, with the device assignment variable at the chunk level). At the chunk level, mirroring and striping is as you indicate. Chunks are allocated on-demand from the available unallocated space such that the two mirrors of each strip can vary from one chunk to the next, which if I'm not mistaken, was the point you were making. The effect is that btrfs raid10 doesn't have the ability to tolerate loss of two devices as long as the two devices are from separate raid1s underneath the raid0, that per-device raid10 has, because once there are a decent amount of chunks allocated, there's no distinctive raid1s at the btrfs device level, such that loss of any two devices is virtually guaranteed to be the loss of both mirrors of some strip of a chunk for /some/ number of chunks. Of course it remains possible, indeed quite viably so, to create a hybrid raid, btrfs raid1 on top of md- or dm-raid0, for instance. Altho that's technically raid01 instead of raid10, btrfs raid1 has some distinctive advantages that make it the preferred top layer in this sort of hybrid, as opposed to btrfs raid0 on top of md/dm-raid1, the conventionally preferred raid10 arrangement. Namely, btrfs raid1 has the file integrity feature in the form of checksumming and checksum validation failure detection, and for raid1, checksum validation failure repair from the mirror copy, assuming of course that it passes checksum validation. Few raid schemes have that, and it's enough of a feature leap that it justifies making the top layer btrfs raid1, as opposed to btrfs raid0, which would lack that automatic error repair feature, tho it could still detect the error based on the checksums, but even manual repair would be difficult as you'd have to somehow figure out which was the bad copy that it read from and then check the other copy and see if it was good before overwriting the bad copy. > 2) Recover raid10 from a failed disk Raid10 inherits its redundancy from > the raid1 scheme. If I build a raid10 from n devices, each bit is > mirrored across two devices. Therefore, in order to restore a raid10 > from a single failed device, I need to read the amount of data worth > this device from the remaining n-1 devices. In case, the amount of data > on the failed disk is in the order of the number of bits for which I can > expect an unrecoverable read error from a device, I will most likely not > be able to recover from the disk failure. Is this conclusion correct, or > am I am missing something here. Again, not each bit, but (each strip of) each chunk (with the strips being 64 KiB IIRC). But your conclusion is generally correct, since the problem would be quite likely to be detected by checksum verification failure, but if it were to occur in the raid1 pair that was degraded, there would be no second copy to fall back on to repair with. Of course that's a 50% chance, with the other possibility being that the IO read error occurs in the undegraded raid1, and thus can be corrected normally. Which means given a random read error, if you try the recovery enough times you should eventually succeed, because eventually any occurring read error will happen on the still undegraded raid1 area. Tho of course if the read error isn't random and it happens repeatedly in the degraded area, you're screwed, for whatever file or metadata covering multiple files it was in, at least. You should still be able to recover the rest of the filesystem, however. Which all goes to demonstrate once again that raid != backup, and there's no substitute for the latter, to whatever level the value of the data in question justifies. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader 2016-08-13 20:15 ` Hugo Mills 2016-08-14 1:07 ` Duncan @ 2016-08-14 16:20 ` Chris Murphy 2016-08-14 18:04 ` Wolfgang Mader ` (2 more replies) 2 siblings, 3 replies; 8+ messages in thread From: Chris Murphy @ 2016-08-14 16:20 UTC (permalink / raw) To: Wolfgang Mader; +Cc: Btrfs BTRFS On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader <Wolfgang_Mader@brain-frog.de> wrote: > Hi, > > I have two questions > > 1) Layout of raid10 in btrfs > btrfs pools all devices and than stripes and mirrors across this pool. Is it > therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is > _not_ > > raid0 > |---------------| > ------------ ------------- > |a| |b| |c| |d| > raid1 raid1 > > Rather, there is no clear distinction of device level between two devices > which form a raid1 set which are than paired by raid0, but simply, each bit is > mirrored across two different devices. Is this correct? All of the profiles apply to block groups (chunks), and that includes raid10. They only incidentally apply to devices since of course block groups end up on those devices, but which stripe ends up on which device is not consistent, and that ends up making Btrfs raid10 pretty much only able to survive a single device loss. I don't know if this is really thoroughly understood. I just did a test and I kinda wonder if the reason for this inconsistent assignment is a difference between the initial stripe>devid pairing at mkfs time, compared to subsequent pairings done by kernel code. For example, I get this from mkfs: item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 itemsize 176 chunk length 16777216 owner 2 stripe_len 65536 type SYSTEM|RAID10 num_stripes 4 stripe 0 devid 4 offset 1048576 dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 stripe 1 devid 3 offset 1048576 dev uuid: af95126a-e674-425c-af01-2599d66d9d06 stripe 2 devid 2 offset 1048576 dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 stripe 3 devid 1 offset 20971520 dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 itemsize 176 chunk length 2147483648 owner 2 stripe_len 65536 type METADATA|RAID10 num_stripes 4 stripe 0 devid 4 offset 9437184 dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 stripe 1 devid 3 offset 9437184 dev uuid: af95126a-e674-425c-af01-2599d66d9d06 stripe 2 devid 2 offset 9437184 dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 stripe 3 devid 1 offset 29360128 dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 itemsize 176 chunk length 2147483648 owner 2 stripe_len 65536 type DATA|RAID10 num_stripes 4 stripe 0 devid 4 offset 1083179008 dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 stripe 1 devid 3 offset 1083179008 dev uuid: af95126a-e674-425c-af01-2599d66d9d06 stripe 2 devid 2 offset 1083179008 dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 stripe 3 devid 1 offset 1103101952 dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 Here you can see every chunk type has the same stripe to devid pairing. But once the kernel starts to allocate more data chunks, the pairing is different from mkfs, yet always (so far) consistent for each additional kernel allocated chunk. item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 itemsize 176 chunk length 2147483648 owner 2 stripe_len 65536 type DATA|RAID10 num_stripes 4 stripe 0 devid 2 offset 2156920832 dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 stripe 1 devid 3 offset 2156920832 dev uuid: af95126a-e674-425c-af01-2599d66d9d06 stripe 2 devid 4 offset 2156920832 dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 stripe 3 devid 1 offset 2176843776 dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 This volume now has about a dozen chunks created by kernel code, and the stripe X to devid Y mapping is identical. Using dd and hexdump, I'm finding that stripe 0 and 1 are mirrored pairs, they contain identical information. And stripe 2 and 3 are mirrored pairs. And the raid0 striping happens across 01 and 23 such that odd-numbered 64KiB (default) stripe elements go on 01, and even-numbered stripe elements go on 23. If the stripe to devid pairing were always consistent, I could lose more than one device and still have a viable volume, just like a conventional raid10. Of course you can't lose both of any mirrored pair, but you could lose one of every mirrored pair. That's why raid10 is considered scalable. But apparently the pairing is different between mkfs and kernel code. And due to that I can't reliably lose more than one device. There is an edge case where I could lose two: stripe 0 devid 4 stripe 1 devid 3 stripe 2 devid 2 stripe 3 devid 1 stripe 0 devid 2 stripe 1 devid 3 stripe 2 devid 4 stripe 3 devid 1 I could, in theory, lose devid 3 and devid 1 and still have one of each stripe copies for all block groups, but kernel code doesn't permit this: [352467.557960] BTRFS warning (device dm-9): missing devices (2) exceeds the limit (1), writeable mount is not allowed > 2) Recover raid10 from a failed disk > Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from > n devices, each bit is mirrored across two devices. Therefore, in order to > restore a raid10 from a single failed device, I need to read the amount of > data worth this device from the remaining n-1 devices. Maybe? In a traditional raid10, rebuild of a faulty device means reading 100% of its mirror device and that's it. For Btrfs the same could be true, it just depends on where the block group copies are located, they could all be on just one other device, or they could be spread across more than one device. Also for Btrfs it's only copying extents, it's not doing sector level rebuild, it'll skip the empty space. >In case, the amount of > data on the failed disk is in the order of the number of bits for which I can > expect an unrecoverable read error from a device, I will most likely not be > able to recover from the disk failure. Is this conclusion correct, or am I am > missing something here. I think you're over estimating the probability of URE. They're pretty rare, and it's far less likely if you're doing regular scrubs. I haven't actually tested this but if a URE or even a checksum mismatch were to happen on a data block group during rebuild following replacing a failed device, I'd like to think Btrfs just complains, it doesn't stop the remainder of the rebuild. If it happens on metadata or system chunk, well that's bad and could be fatal. As an aside, I'm finding the size information for the data chunk in 'fi us' confusing... The sample file system contains one file: [root@f24s ~]# ls -lh /mnt/0 total 1.4G -rw-r--r--. 1 root root 1.4G Aug 13 19:24 Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso [root@f24s ~]# btrfs fi us /mnt/0 Overall: Device size: 400.00GiB Device allocated: 8.03GiB Device unallocated: 391.97GiB Device missing: 0.00B Used: 2.66GiB Free (estimated): 196.66GiB (min: 196.66GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 16.00MiB (used: 0.00B) ## "Device size" is total volume or pool size, "Used" shows actual usage accounting for the replication of raid1, and yet "Free" shows 1/2. This can't work long term as by the time I have 100GiB in the volume, Used will report 200Gib while Free will report 100GiB for a total of 300GiB which does not match the device size. So that's a bug in my opinion. Data,RAID10: Size:2.00GiB, Used:1.33GiB /dev/mapper/VG-1 512.00MiB /dev/mapper/VG-2 512.00MiB /dev/mapper/VG-3 512.00MiB /dev/mapper/VG-4 512.00MiB ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. And now in this area the user is somehow expected to know that all of these values are 1/2 their actual value due to the RAID10. I don't like this inconsistency for one. But it's made worse by using the secret decoder ring method of usage when it comes to individual device allocations. Very clearly Size if really 4, and each device has a 1GiB chunk. So why not say that? This is consistent with the earlier "Device allocated" value of 8GiB. -- Chris Murphy ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-14 16:20 ` Chris Murphy @ 2016-08-14 18:04 ` Wolfgang Mader 2016-08-15 4:21 ` Wolfgang Mader 2016-08-15 3:46 ` Andrei Borzenkov 2016-08-15 5:51 ` Andrei Borzenkov 2 siblings, 1 reply; 8+ messages in thread From: Wolfgang Mader @ 2016-08-14 18:04 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 10436 bytes --] On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > <Wolfgang_Mader@brain-frog.de> wrote: > > Hi, > > > > I have two questions > > > > 1) Layout of raid10 in btrfs > > btrfs pools all devices and than stripes and mirrors across this pool. Is > > it therefore correct, that a raid10 layout consisting of 4 devices > > a,b,c,d is _not_ > > > > raid0 > > | > > |---------------| > > > > ------------ ------------- > > > > |a| |b| |c| |d| > > | > > raid1 raid1 > > > > Rather, there is no clear distinction of device level between two devices > > which form a raid1 set which are than paired by raid0, but simply, each > > bit is mirrored across two different devices. Is this correct? > > All of the profiles apply to block groups (chunks), and that includes > raid10. They only incidentally apply to devices since of course block > groups end up on those devices, but which stripe ends up on which > device is not consistent, and that ends up making Btrfs raid10 pretty > much only able to survive a single device loss. > > I don't know if this is really thoroughly understood. I just did a > test and I kinda wonder if the reason for this inconsistent assignment > is a difference between the initial stripe>devid pairing at mkfs time, > compared to subsequent pairings done by kernel code. For example, I > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 itemsize > 176 chunk length 16777216 owner 2 stripe_len 65536 > type SYSTEM|RAID10 num_stripes 4 > stripe 0 devid 4 offset 1048576 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 1048576 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 1048576 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 20971520 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 itemsize > 176 chunk length 2147483648 owner 2 stripe_len 65536 > type METADATA|RAID10 num_stripes 4 > stripe 0 devid 4 offset 9437184 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 9437184 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 9437184 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 29360128 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > itemsize 176 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID10 num_stripes 4 > stripe 0 devid 4 offset 1083179008 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 1083179008 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 1083179008 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 1103101952 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > Here you can see every chunk type has the same stripe to devid > pairing. But once the kernel starts to allocate more data chunks, the > pairing is different from mkfs, yet always (so far) consistent for > each additional kernel allocated chunk. > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > itemsize 176 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID10 num_stripes 4 > stripe 0 devid 2 offset 2156920832 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 1 devid 3 offset 2156920832 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 4 offset 2156920832 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 3 devid 1 offset 2176843776 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > This volume now has about a dozen chunks created by kernel code, and > the stripe X to devid Y mapping is identical. Using dd and hexdump, > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > identical information. And stripe 2 and 3 are mirrored pairs. And the > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > (default) stripe elements go on 01, and even-numbered stripe elements > go on 23. If the stripe to devid pairing were always consistent, I > could lose more than one device and still have a viable volume, just > like a conventional raid10. Of course you can't lose both of any > mirrored pair, but you could lose one of every mirrored pair. That's > why raid10 is considered scalable. Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5 across n disks. Than, for each chunk (don't know the unit of such a chunk) of n-1 disks, a parity chunk is written to the remaining disk using xor. Parity chunks are distributed across all disks. In case the data of a failed disk has to be restored from the degraded array, the entirety of n-1 disks have to be read, in order to use xor to reconstruct the data. Is this correct? Again, in order to restore a failed disk in raid5, all data on all remaining disks is needed, otherwise the array can not be restored. Correct? For btrfs raid10, I only can loose a single device, but in order to rebuild it, I only need to read the amount of data which was stored on the failed device, as no parity is used, but mirroring. Correct? Therefore, the amount of bits I need to read successfully for a rebuild is independent of the number of devices included in the raid10, while the amount of read data scales with the number of devices in a raid5. Still, I think it is unfortunate, that btrfs raid10 does not stick to a fixed layout, as than the entire array must be available. If you have your devices attached by more than one controller, in more than one case powered by different power supplies etc., the probability for their failure has to be summed up, as no component is allowed to fail. Is work under way to change this, or is this s.th. out of reach for btrfs as it is an implementation detail of the kernel. > > But apparently the pairing is different between mkfs and kernel code. > And due to that I can't reliably lose more than one device. There is > an edge case where I could lose two: > > > > stripe 0 devid 4 > stripe 1 devid 3 > stripe 2 devid 2 > stripe 3 devid 1 > > stripe 0 devid 2 > stripe 1 devid 3 > stripe 2 devid 4 > stripe 3 devid 1 > > > I could, in theory, lose devid 3 and devid 1 and still have one of > each stripe copies for all block groups, but kernel code doesn't > permit this: > > [352467.557960] BTRFS warning (device dm-9): missing devices (2) > exceeds the limit (1), writeable mount is not allowed > > > 2) Recover raid10 from a failed disk > > Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 > > from n devices, each bit is mirrored across two devices. Therefore, in > > order to restore a raid10 from a single failed device, I need to read the > > amount of data worth this device from the remaining n-1 devices. > > Maybe? In a traditional raid10, rebuild of a faulty device means > reading 100% of its mirror device and that's it. For Btrfs the same > could be true, it just depends on where the block group copies are > located, they could all be on just one other device, or they could be > spread across more than one device. Also for Btrfs it's only copying > extents, it's not doing sector level rebuild, it'll skip the empty > space. > > >In case, the amount of > > > > data on the failed disk is in the order of the number of bits for which I > > can expect an unrecoverable read error from a device, I will most likely > > not be able to recover from the disk failure. Is this conclusion correct, > > or am I am missing something here. > > I think you're over estimating the probability of URE. They're pretty > rare, and it's far less likely if you're doing regular scrubs. > > I haven't actually tested this but if a URE or even a checksum > mismatch were to happen on a data block group during rebuild following > replacing a failed device, I'd like to think Btrfs just complains, it > doesn't stop the remainder of the rebuild. If it happens on metadata > or system chunk, well that's bad and could be fatal. > > > As an aside, I'm finding the size information for the data chunk in > 'fi us' confusing... > > The sample file system contains one file: > [root@f24s ~]# ls -lh /mnt/0 > total 1.4G > -rw-r--r--. 1 root root 1.4G Aug 13 19:24 > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso > > > [root@f24s ~]# btrfs fi us /mnt/0 > Overall: > Device size: 400.00GiB > Device allocated: 8.03GiB > Device unallocated: 391.97GiB > Device missing: 0.00B > Used: 2.66GiB > Free (estimated): 196.66GiB (min: 196.66GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 16.00MiB (used: 0.00B) > > ## "Device size" is total volume or pool size, "Used" shows actual > usage accounting for the replication of raid1, and yet "Free" shows > 1/2. This can't work long term as by the time I have 100GiB in the > volume, Used will report 200Gib while Free will report 100GiB for a > total of 300GiB which does not match the device size. So that's a bug > in my opinion. > > Data,RAID10: Size:2.00GiB, Used:1.33GiB > /dev/mapper/VG-1 512.00MiB > /dev/mapper/VG-2 512.00MiB > /dev/mapper/VG-3 512.00MiB > /dev/mapper/VG-4 512.00MiB > > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. > And now in this area the user is somehow expected to know that all of > these values are 1/2 their actual value due to the RAID10. I don't > like this inconsistency for one. But it's made worse by using the > secret decoder ring method of usage when it comes to individual device > allocations. Very clearly Size if really 4, and each device has a 1GiB > chunk. So why not say that? This is consistent with the earlier > "Device allocated" value of 8GiB. [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-14 18:04 ` Wolfgang Mader @ 2016-08-15 4:21 ` Wolfgang Mader 0 siblings, 0 replies; 8+ messages in thread From: Wolfgang Mader @ 2016-08-15 4:21 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 11350 bytes --] On Sunday, August 14, 2016 8:04:14 PM CEST you wrote: > On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > > > <Wolfgang_Mader@brain-frog.de> wrote: > > > Hi, > > > > > > I have two questions > > > > > > 1) Layout of raid10 in btrfs > > > btrfs pools all devices and than stripes and mirrors across this pool. > > > Is > > > it therefore correct, that a raid10 layout consisting of 4 devices > > > a,b,c,d is _not_ > > > > > > raid0 > > > | > > > |---------------| > > > > > > ------------ ------------- > > > > > > |a| |b| |c| |d| > > > | > > > raid1 raid1 > > > > > > Rather, there is no clear distinction of device level between two > > > devices > > > which form a raid1 set which are than paired by raid0, but simply, each > > > bit is mirrored across two different devices. Is this correct? > > > > All of the profiles apply to block groups (chunks), and that includes > > raid10. They only incidentally apply to devices since of course block > > groups end up on those devices, but which stripe ends up on which > > device is not consistent, and that ends up making Btrfs raid10 pretty > > much only able to survive a single device loss. > > > > I don't know if this is really thoroughly understood. I just did a > > test and I kinda wonder if the reason for this inconsistent assignment > > is a difference between the initial stripe>devid pairing at mkfs time, > > compared to subsequent pairings done by kernel code. For example, I > > > > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 > > itemsize > > > > 176 chunk length 16777216 owner 2 stripe_len 65536 > > > > type SYSTEM|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1048576 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1048576 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1048576 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 20971520 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 > > itemsize > > > > 176 chunk length 2147483648 owner 2 stripe_len 65536 > > > > type METADATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 9437184 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 9437184 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 9437184 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 29360128 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1083179008 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1083179008 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1083179008 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 1103101952 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > Here you can see every chunk type has the same stripe to devid > > pairing. But once the kernel starts to allocate more data chunks, the > > pairing is different from mkfs, yet always (so far) consistent for > > each additional kernel allocated chunk. > > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 2 offset 2156920832 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 1 devid 3 offset 2156920832 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 4 offset 2156920832 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 3 devid 1 offset 2176843776 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > This volume now has about a dozen chunks created by kernel code, and > > the stripe X to devid Y mapping is identical. Using dd and hexdump, > > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > > identical information. And stripe 2 and 3 are mirrored pairs. And the > > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > > (default) stripe elements go on 01, and even-numbered stripe elements > > go on 23. If the stripe to devid pairing were always consistent, I > > could lose more than one device and still have a viable volume, just > > like a conventional raid10. Of course you can't lose both of any > > mirrored pair, but you could lose one of every mirrored pair. That's > > why raid10 is considered scalable. > > Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5 > across n disks. Than, for each chunk (don't know the unit of such a chunk) > of n-1 disks, a parity chunk is written to the remaining disk using xor. > Parity chunks are distributed across all disks. In case the data of a > failed disk has to be restored from the degraded array, the entirety of n-1 > disks have to be read, in order to use xor to reconstruct the data. Is this > correct? Again, in order to restore a failed disk in raid5, all data on all > remaining disks is needed, otherwise the array can not be restored. > Correct? > > For btrfs raid10, I only can loose a single device, but in order to rebuild > it, I only need to read the amount of data which was stored on the failed > device, as no parity is used, but mirroring. Correct? Therefore, the amount > of bits I need to read successfully for a rebuild is independent of the > number of devices included in the raid10, while the amount of read data > scales with the number of devices in a raid5. > > Still, I think it is unfortunate, that btrfs raid10 does not stick to a > fixed layout, as than the entire array must be available. If you have your > devices attached by more than one controller, in more than one case powered > by different power supplies etc., the probability for their failure has to > be summed up, This formulation might be a bit vague. For m devices of which non is allowed to fail, the total failure probability should be p_tot = (1-p_f)^m where p_f is the probability of failure for a single device, assuming p_f is the same for all m devices. > as no component is allowed to fail. Is work under way to > change this, or is this s.th. out of reach for btrfs as it is an > implementation detail of the kernel. > > > But apparently the pairing is different between mkfs and kernel code. > > And due to that I can't reliably lose more than one device. There is > > an edge case where I could lose two: > > > > > > > > stripe 0 devid 4 > > stripe 1 devid 3 > > stripe 2 devid 2 > > stripe 3 devid 1 > > > > stripe 0 devid 2 > > stripe 1 devid 3 > > stripe 2 devid 4 > > stripe 3 devid 1 > > > > > > I could, in theory, lose devid 3 and devid 1 and still have one of > > each stripe copies for all block groups, but kernel code doesn't > > permit this: > > > > [352467.557960] BTRFS warning (device dm-9): missing devices (2) > > exceeds the limit (1), writeable mount is not allowed > > > > > 2) Recover raid10 from a failed disk > > > Raid10 inherits its redundancy from the raid1 scheme. If I build a > > > raid10 > > > from n devices, each bit is mirrored across two devices. Therefore, in > > > order to restore a raid10 from a single failed device, I need to read > > > the > > > amount of data worth this device from the remaining n-1 devices. > > > > Maybe? In a traditional raid10, rebuild of a faulty device means > > reading 100% of its mirror device and that's it. For Btrfs the same > > could be true, it just depends on where the block group copies are > > located, they could all be on just one other device, or they could be > > spread across more than one device. Also for Btrfs it's only copying > > extents, it's not doing sector level rebuild, it'll skip the empty > > space. > > > > >In case, the amount of > > > > > > data on the failed disk is in the order of the number of bits for which > > > I > > > can expect an unrecoverable read error from a device, I will most likely > > > not be able to recover from the disk failure. Is this conclusion > > > correct, > > > or am I am missing something here. > > > > I think you're over estimating the probability of URE. They're pretty > > rare, and it's far less likely if you're doing regular scrubs. > > > > I haven't actually tested this but if a URE or even a checksum > > mismatch were to happen on a data block group during rebuild following > > replacing a failed device, I'd like to think Btrfs just complains, it > > doesn't stop the remainder of the rebuild. If it happens on metadata > > or system chunk, well that's bad and could be fatal. > > > > > > As an aside, I'm finding the size information for the data chunk in > > 'fi us' confusing... > > > > The sample file system contains one file: > > [root@f24s ~]# ls -lh /mnt/0 > > total 1.4G > > -rw-r--r--. 1 root root 1.4G Aug 13 19:24 > > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso > > > > > > [root@f24s ~]# btrfs fi us /mnt/0 > > > > Overall: > > Device size: 400.00GiB > > Device allocated: 8.03GiB > > Device unallocated: 391.97GiB > > Device missing: 0.00B > > Used: 2.66GiB > > Free (estimated): 196.66GiB (min: 196.66GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 16.00MiB (used: 0.00B) > > > > ## "Device size" is total volume or pool size, "Used" shows actual > > usage accounting for the replication of raid1, and yet "Free" shows > > 1/2. This can't work long term as by the time I have 100GiB in the > > volume, Used will report 200Gib while Free will report 100GiB for a > > total of 300GiB which does not match the device size. So that's a bug > > in my opinion. > > > > Data,RAID10: Size:2.00GiB, Used:1.33GiB > > > > /dev/mapper/VG-1 512.00MiB > > /dev/mapper/VG-2 512.00MiB > > /dev/mapper/VG-3 512.00MiB > > /dev/mapper/VG-4 512.00MiB > > > > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. > > And now in this area the user is somehow expected to know that all of > > these values are 1/2 their actual value due to the RAID10. I don't > > like this inconsistency for one. But it's made worse by using the > > secret decoder ring method of usage when it comes to individual device > > allocations. Very clearly Size if really 4, and each device has a 1GiB > > chunk. So why not say that? This is consistent with the earlier > > "Device allocated" value of 8GiB. [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-14 16:20 ` Chris Murphy 2016-08-14 18:04 ` Wolfgang Mader @ 2016-08-15 3:46 ` Andrei Borzenkov 2016-08-15 5:51 ` Andrei Borzenkov 2 siblings, 0 replies; 8+ messages in thread From: Andrei Borzenkov @ 2016-08-15 3:46 UTC (permalink / raw) To: Chris Murphy, Wolfgang Mader; +Cc: Btrfs BTRFS 14.08.2016 19:20, Chris Murphy пишет: ... > > This volume now has about a dozen chunks created by kernel code, and > the stripe X to devid Y mapping is identical. Using dd and hexdump, > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > identical information. And stripe 2 and 3 are mirrored pairs. And the > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > (default) stripe elements go on 01, and even-numbered stripe elements > go on 23. If the stripe to devid pairing were always consistent, I > could lose more than one device and still have a viable volume, just > like a conventional raid10. Of course you can't lose both of any > mirrored pair, but you could lose one of every mirrored pair. That's > why raid10 is considered scalable. > > But apparently the pairing is different between mkfs and kernel code. My understanding is, that chunk allocation code is using devices in order they have been discovered. Which implies that that can change between reboots or even during system run if devices are added/removed. Also I think code may skip devices under some condition (most obvious being not enough space, may happen if you mix allocation profiles). So the only thing that is guaranteed right now is that every stripe element will be on different device. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Likelihood of read error, recover device failure raid10 2016-08-14 16:20 ` Chris Murphy 2016-08-14 18:04 ` Wolfgang Mader 2016-08-15 3:46 ` Andrei Borzenkov @ 2016-08-15 5:51 ` Andrei Borzenkov 2 siblings, 0 replies; 8+ messages in thread From: Andrei Borzenkov @ 2016-08-15 5:51 UTC (permalink / raw) To: Chris Murphy, Wolfgang Mader; +Cc: Btrfs BTRFS 14.08.2016 19:20, Chris Murphy пишет: > > As an aside, I'm finding the size information for the data chunk in > 'fi us' confusing... > > The sample file system contains one file: > [root@f24s ~]# ls -lh /mnt/0 > total 1.4G > -rw-r--r--. 1 root root 1.4G Aug 13 19:24 > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso > > > [root@f24s ~]# btrfs fi us /mnt/0 > Overall: > Device size: 400.00GiB > Device allocated: 8.03GiB > Device unallocated: 391.97GiB > Device missing: 0.00B > Used: 2.66GiB > Free (estimated): 196.66GiB (min: 196.66GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 16.00MiB (used: 0.00B) > > ## "Device size" is total volume or pool size, "Used" shows actual > usage accounting for the replication of raid1, and yet "Free" shows > 1/2. This can't work long term as by the time I have 100GiB in the > volume, Used will report 200Gib while Free will report 100GiB for a > total of 300GiB which does not match the device size. So that's a bug > in my opinion. > Well, it says "estimated". It shows how much you could possibly write using current allocation profile(s). There is no way to predict actual space usage if you mix allocation profiles. I agree that having single field that is referring to virtual capacity among fields showing physical consumption is confusing. > Data,RAID10: Size:2.00GiB, Used:1.33GiB > /dev/mapper/VG-1 512.00MiB > /dev/mapper/VG-2 512.00MiB > /dev/mapper/VG-3 512.00MiB > /dev/mapper/VG-4 512.00MiB > > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. I think this is difference between rounding done by ls and internal btrfs counting. I bet if you show size in KiB (or even 512B) you will get better match. > And now in this area the user is somehow expected to know that all of > these values are 1/2 their actual value due to the RAID10. I don't > like this inconsistency for one. But it's made worse by using the > secret decoder ring method of usage when it comes to individual device > allocations. Very clearly Size if really 4, and each device has a 1GiB > chunk. So why not say that? This is consistent with the earlier > "Device allocated" value of 8GiB. > > This looks like a bug in RAID10. In RAID1 output is consistent with Size showing virtual size and each disk allocated size matching it. This is openSUSE Tumbleweed with brfsprogs 4.7 and kernel 4.7. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-08-15 5:51 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-13 15:39 Likelihood of read error, recover device failure raid10 Wolfgang Mader 2016-08-13 20:15 ` Hugo Mills 2016-08-14 1:07 ` Duncan 2016-08-14 16:20 ` Chris Murphy 2016-08-14 18:04 ` Wolfgang Mader 2016-08-15 4:21 ` Wolfgang Mader 2016-08-15 3:46 ` Andrei Borzenkov 2016-08-15 5:51 ` Andrei Borzenkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).