From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mailgateway3.uni-freiburg.de ([132.230.2.213]:45417 "EHLO mailgateway3.uni-freiburg.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751233AbcHOEjM (ORCPT ); Mon, 15 Aug 2016 00:39:12 -0400 Received: from ms2.uni-freiburg.de ([132.230.2.3] helo=uni-freiburg.de) port 63581 by mailgateway3.uni-freiburg.de with esmtp (Exim 4.85 #2 built 17-Dec-2015 15:40:08 running on Gentoo) id 1bZ9PH-0000wx-AJ for linux-btrfs@vger.kernel.org; Mon, 15 Aug 2016 06:21:47 +0200 Received: from [88.70.41.116] (account wolfgang.mader@fdm.uni-freiburg.de HELO discus.localnet) by fdm.uni-freiburg.de (CommuniGate Pro SMTP 6.1.6b) with ESMTPSA id 292133973 for linux-btrfs@vger.kernel.org; Mon, 15 Aug 2016 06:21:46 +0200 From: Wolfgang Mader To: linux-btrfs@vger.kernel.org Subject: Re: Likelihood of read error, recover device failure raid10 Date: Mon, 15 Aug 2016 06:21:39 +0200 Message-ID: <133750337.sc9iDRAKOZ@discus> In-Reply-To: <1611335.C74ovpucO9@discus> References: <2336793.AMaAIAxWk4@discus> <1611335.C74ovpucO9@discus> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart3439174.ZkL3yN2nX6"; micalg="pgp-sha256"; protocol="application/pgp-signature" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --nextPart3439174.ZkL3yN2nX6 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" On Sunday, August 14, 2016 8:04:14 PM CEST you wrote: > On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > > > wrote: > > > Hi, > > > > > > I have two questions > > > > > > 1) Layout of raid10 in btrfs > > > btrfs pools all devices and than stripes and mirrors across this pool. > > > Is > > > it therefore correct, that a raid10 layout consisting of 4 devices > > > a,b,c,d is _not_ > > > > > > raid0 > > > | > > > |---------------| > > > > > > ------------ ------------- > > > > > > |a| |b| |c| |d| > > > | > > > raid1 raid1 > > > > > > Rather, there is no clear distinction of device level between two > > > devices > > > which form a raid1 set which are than paired by raid0, but simply, each > > > bit is mirrored across two different devices. Is this correct? > > > > All of the profiles apply to block groups (chunks), and that includes > > raid10. They only incidentally apply to devices since of course block > > groups end up on those devices, but which stripe ends up on which > > device is not consistent, and that ends up making Btrfs raid10 pretty > > much only able to survive a single device loss. > > > > I don't know if this is really thoroughly understood. I just did a > > test and I kinda wonder if the reason for this inconsistent assignment > > is a difference between the initial stripe>devid pairing at mkfs time, > > compared to subsequent pairings done by kernel code. For example, I > > > > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 > > itemsize > > > > 176 chunk length 16777216 owner 2 stripe_len 65536 > > > > type SYSTEM|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1048576 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1048576 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1048576 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 20971520 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 > > itemsize > > > > 176 chunk length 2147483648 owner 2 stripe_len 65536 > > > > type METADATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 9437184 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 9437184 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 9437184 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 29360128 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1083179008 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1083179008 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1083179008 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 1103101952 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > Here you can see every chunk type has the same stripe to devid > > pairing. But once the kernel starts to allocate more data chunks, the > > pairing is different from mkfs, yet always (so far) consistent for > > each additional kernel allocated chunk. > > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 2 offset 2156920832 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 1 devid 3 offset 2156920832 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 4 offset 2156920832 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 3 devid 1 offset 2176843776 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > This volume now has about a dozen chunks created by kernel code, and > > the stripe X to devid Y mapping is identical. Using dd and hexdump, > > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > > identical information. And stripe 2 and 3 are mirrored pairs. And the > > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > > (default) stripe elements go on 01, and even-numbered stripe elements > > go on 23. If the stripe to devid pairing were always consistent, I > > could lose more than one device and still have a viable volume, just > > like a conventional raid10. Of course you can't lose both of any > > mirrored pair, but you could lose one of every mirrored pair. That's > > why raid10 is considered scalable. > > Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5 > across n disks. Than, for each chunk (don't know the unit of such a chunk) > of n-1 disks, a parity chunk is written to the remaining disk using xor. > Parity chunks are distributed across all disks. In case the data of a > failed disk has to be restored from the degraded array, the entirety of n-1 > disks have to be read, in order to use xor to reconstruct the data. Is this > correct? Again, in order to restore a failed disk in raid5, all data on all > remaining disks is needed, otherwise the array can not be restored. > Correct? > > For btrfs raid10, I only can loose a single device, but in order to rebuild > it, I only need to read the amount of data which was stored on the failed > device, as no parity is used, but mirroring. Correct? Therefore, the amount > of bits I need to read successfully for a rebuild is independent of the > number of devices included in the raid10, while the amount of read data > scales with the number of devices in a raid5. > > Still, I think it is unfortunate, that btrfs raid10 does not stick to a > fixed layout, as than the entire array must be available. If you have your > devices attached by more than one controller, in more than one case powered > by different power supplies etc., the probability for their failure has to > be summed up, This formulation might be a bit vague. For m devices of which non is allowed to fail, the total failure probability should be p_tot = (1-p_f)^m where p_f is the probability of failure for a single device, assuming p_f is the same for all m devices. > as no component is allowed to fail. Is work under way to > change this, or is this s.th. out of reach for btrfs as it is an > implementation detail of the kernel. > > > But apparently the pairing is different between mkfs and kernel code. > > And due to that I can't reliably lose more than one device. There is > > an edge case where I could lose two: > > > > > > > > stripe 0 devid 4 > > stripe 1 devid 3 > > stripe 2 devid 2 > > stripe 3 devid 1 > > > > stripe 0 devid 2 > > stripe 1 devid 3 > > stripe 2 devid 4 > > stripe 3 devid 1 > > > > > > I could, in theory, lose devid 3 and devid 1 and still have one of > > each stripe copies for all block groups, but kernel code doesn't > > permit this: > > > > [352467.557960] BTRFS warning (device dm-9): missing devices (2) > > exceeds the limit (1), writeable mount is not allowed > > > > > 2) Recover raid10 from a failed disk > > > Raid10 inherits its redundancy from the raid1 scheme. If I build a > > > raid10 > > > from n devices, each bit is mirrored across two devices. Therefore, in > > > order to restore a raid10 from a single failed device, I need to read > > > the > > > amount of data worth this device from the remaining n-1 devices. > > > > Maybe? In a traditional raid10, rebuild of a faulty device means > > reading 100% of its mirror device and that's it. For Btrfs the same > > could be true, it just depends on where the block group copies are > > located, they could all be on just one other device, or they could be > > spread across more than one device. Also for Btrfs it's only copying > > extents, it's not doing sector level rebuild, it'll skip the empty > > space. > > > > >In case, the amount of > > > > > > data on the failed disk is in the order of the number of bits for which > > > I > > > can expect an unrecoverable read error from a device, I will most likely > > > not be able to recover from the disk failure. Is this conclusion > > > correct, > > > or am I am missing something here. > > > > I think you're over estimating the probability of URE. They're pretty > > rare, and it's far less likely if you're doing regular scrubs. > > > > I haven't actually tested this but if a URE or even a checksum > > mismatch were to happen on a data block group during rebuild following > > replacing a failed device, I'd like to think Btrfs just complains, it > > doesn't stop the remainder of the rebuild. If it happens on metadata > > or system chunk, well that's bad and could be fatal. > > > > > > As an aside, I'm finding the size information for the data chunk in > > 'fi us' confusing... > > > > The sample file system contains one file: > > [root@f24s ~]# ls -lh /mnt/0 > > total 1.4G > > -rw-r--r--. 1 root root 1.4G Aug 13 19:24 > > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso > > > > > > [root@f24s ~]# btrfs fi us /mnt/0 > > > > Overall: > > Device size: 400.00GiB > > Device allocated: 8.03GiB > > Device unallocated: 391.97GiB > > Device missing: 0.00B > > Used: 2.66GiB > > Free (estimated): 196.66GiB (min: 196.66GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 16.00MiB (used: 0.00B) > > > > ## "Device size" is total volume or pool size, "Used" shows actual > > usage accounting for the replication of raid1, and yet "Free" shows > > 1/2. This can't work long term as by the time I have 100GiB in the > > volume, Used will report 200Gib while Free will report 100GiB for a > > total of 300GiB which does not match the device size. So that's a bug > > in my opinion. > > > > Data,RAID10: Size:2.00GiB, Used:1.33GiB > > > > /dev/mapper/VG-1 512.00MiB > > /dev/mapper/VG-2 512.00MiB > > /dev/mapper/VG-3 512.00MiB > > /dev/mapper/VG-4 512.00MiB > > > > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. > > And now in this area the user is somehow expected to know that all of > > these values are 1/2 their actual value due to the RAID10. I don't > > like this inconsistency for one. But it's made worse by using the > > secret decoder ring method of usage when it comes to individual device > > allocations. Very clearly Size if really 4, and each device has a 1GiB > > chunk. So why not say that? This is consistent with the earlier > > "Device allocated" value of 8GiB. --nextPart3439174.ZkL3yN2nX6 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJXsUNTAAoJEGZtonfLCdqhYn8P/it+x5EcbESFEk5kBy2CQACk Hl2xsrkkDM+X/rrcvJjmSOgixavzpmJH+4dRVFAogx03/jguMBY40Hl1ESIOKXxD zJJE8csrVXcSdCf/2+YuZYSjSjwDGDuyGGoXcnsAx+fQ8f4lnAYTiPa+RxNqprVZ 1aGfkhKNxAPzMTJ2Ojlre7ojJOPCZxcFI7sdiH2gx296WgidikmLMNCF6qfPHGxV uasr6THzD6s7hrdw9i6DabVJUAMgtuGUNkvGs7aTr2YEB+Zb/XECb2zb+mqfgLrR plIRbubwaywHL0BMntp3nST2yjC0PsdqTYMykOpIjE02xeTCEVMrnw6qd45jMksl NrGlN2TpP7BMx0VDsUkggZeWf31y/jA8oAEBTeuQVfFeDm3nvoIfX4fc2BkwTxbC AYERUwzXzXmTXbCVjd5iHHr5H5e27Ml2Z0YyLeKrBCbJMS6hI0E89CtQQ4ODmBdr Vomg2N3Sd0fWFj4FMl93fRKarwA1svk+WzGBqOUvby0lPxALA48SGy4XW6aK2kbn v7/epiLMDg46rfRpdRW7ekYgGw95RSMa8NsyyJqJps2gEv5ulo07+q8nTghdHjXO 3mqCHnS7Gp/2OfFzAnHmgI6wJdykMJvjvaW952KcWAsTBH2Olgbgi48sLy877/9j 423LAruLhjjeHGNTobyE =Oydh -----END PGP SIGNATURE----- --nextPart3439174.ZkL3yN2nX6--