From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:37661 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S964929AbaLME20 (ORCPT ); Fri, 12 Dec 2014 23:28:26 -0500 Date: Fri, 12 Dec 2014 23:28:24 -0500 From: Zygo Blaxell To: Robert White Cc: Btrfs BTRFS Subject: Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?] Message-ID: <20141213042824.GC25614@hungrycats.org> References: <5488C6CF.1080608@pobox.com> <20141212035651.GD22023@hungrycats.org> <548A84A2.4000501@pobox.com> <20141212164544.GB25614@hungrycats.org> <548B6BF6.2060306@pobox.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Clx92ZfkiYIKRjnr" In-Reply-To: <548B6BF6.2060306@pobox.com> Sender: linux-btrfs-owner@vger.kernel.org List-ID: --Clx92ZfkiYIKRjnr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote: > On 12/12/2014 08:45 AM, Zygo Blaxell wrote: > >On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote: > >>So RAID5 with three media M is > >> > >>M MM MMM > >>D1 D2 P(a) > >>D3 P(b) D4 > >>P(c) D5 D6 > > > >RAID5 with two media is well defined, and looks like this: > > > >M MM > >D1 P(a) > >P(b) D2 > >D3 P(c) >=20 > Like I said in the other fork of this thread... I see (now) that the > math works but I can find no trace of anyone having ever implemented > this for arity less than 3 RAID greater than one paradigm (outside > btrfs and its associated materials). I've set up mdadm that way (though it does ask you to use '--force' when you set it up). mdadm will also ask for --force if you try to set up RAID1 with one disk. I don't know of a RAID implementation that _doesn't_ do these modes, excluding a few ancient proprietary implementations which have no way to change a layout once created (usually because they shoot themselves in the foot with bad choices early on, e.g. by picking odd parity for RAID5). The reason to allow it is future expansion: below-3-disk RAID5 ensures that you have the layout constraints *now* for stripe/chunk size so you can add more disks later. If RAID5 has a 512K chunk size, and you start with a linear or RAID1 array and add another disk later, you might lose part of the last 512K when you switch to RAID5. So you start with RAID5 on one or two disks so you can scale up without losing any data. Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies with no additional capacity). btrfs doesn't really need this capability for expansion, since it can just create new RAID5 profile chunks whenever it wants to; however, I'd expect a complete btrfs RAID5 implementation to borrow some ideas from ZFS, and dynamically change the number of disks per chunk to maintain write integrity as drives are added/removed/missing. That would imply btrfs-RAID56 profile chunks would have to be able to exist on two or even one disk, if that was all that was available for writing at the time. Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the wrong way when more disks were added later. > MEANWHILE >=20 > the system really needs to be able to explicitly express and support > the "missing" media paradigm. >=20 > M x MMM > D1 . P(a) > D3 . D4 > P(c) . D6 >=20 > The correct logic here to "remove" (e.g. "replace with nothing" > instead of "delete") a media just doesn't seem to exist. And it's > already painfully missing in the RAID1 situation. There are a number of permanent mistakes a naive admin can make when dealing with a broken array. I've destroyed arrays (made them permanently read-only beyond the ability of btrfs kernel or user tools to recover) by getting "add" and "replace" confused, or by allowing an offline drive to rejoin an array that had been mounted read-write,degraded for some time. The basic functionality works. btrfs does track missing devices and can replace them relatively quickly (not as fast as mdadm, but less than an order of magnitude slower) in RAID1. The reporting is full of out-of-date cached data, but when a disk is really failing, there is usually little doubt which one needs to be replaced. > If I have a system with N SATA ports, and I have connected N drives, > and device M is starting to fail... I need to be able to disconnect > M and then connect M(new). Possibly with a non-trivial amount of > time in there. For all RAID levels greater than zero this is a > natural operation in a degraded mode. And for a nearly full > filesystem the shrink operation that is btrfs device delete would > not work. And for any nontrivially occupied fiesystem it would be > way slow, and need to be reversed for another way-slow interval. >=20 > So I need to be able to "replace" a drive with a "nothing" so that > the number of active media becomes N-1 but the arity remains N. btrfs already does that, but it sucks. In a naive RAID5 implementation, a write in degraded mode will corrupt your data if it is interrupted. This is a general property of all RAID5 implementations that don't have NVRAM journalling or some other way to solve the atomic update problem. ZFS does this well: when a device is missing, it leaves old data in degraded mode, but writes new data striped across the existing disks in non-degraded mode. If you have 5 disks, and one dies, your writes are then spread across 4 disks (3 data + parity) while your reads are reconstructed from 4 disks (4 data + 1 parity - 1 missing). This prevents the degraded mode write data integrity problem. When the dead disk is replaced you would have the 3 data + parity promoted to 4 data + parity, or you can elect not to replace the dead disk and get 3 data + party everywhere (with a loss of capacity). btrfs could presumably do that by allocating chunks with different raid56 parameters, although in this early stage of implementation I'm not sure how much of any of that has been done yet. > mdadm has the "missing" keyword. the Device Mapper has the "zero" > target.=20 dm also has the "ioerror" target, which is much better for this ("zero" would allow reads to succeed, which is incorrect). lvm2 uses "ioerror" for missing pieces of broken LVs in partial mode. > btrfs replace start /dev/sdc /dev/nothing / > (time pases, physical device is removed and replace) > btrfs replace start /dev/nothing /dev/sdc / Why wouldn't you just remove the physical device (say device #2) and then run: btrfs replace start 2 /dev/sdc / ? The way it works now seems much less complicated than what you propose. Granted, I have a feature request here: we know the sizes of all the missing disks, and we know the size of /dev/sdc, so why can't we just write "missing" instead of "2" and have btrfs choose a missing device to replace by itself? > Now that's good-ish, but really the first replace is pernicious. The > internal state for the filesystem should just be able to record that > device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this > example) is just gone. The replace-with-nothing becomes more-or-less > instant. To clarify: what is required here is the ability to quickly record that the device's subuuid is no longer welcome in this filesystem, and never will be. Should it reappear in the future, it has to be excluded from the btrfs. The underlying physical device could return, but it would have to be treated as a new empty device with a new subuuid, and its data reconstructed by btrfs balance or btrfs replace. This is because btrfs does really awful things when a filesystem gets assembled out of mirrors of different vintages. Before allowing writes on a subset of the disks in a multi-disk btrfs, the disks that are written have to agree that they are now the only disks that are currently members of the filesystem. > [The use of "device delete" and "device add" as changes in arity and > size, and its inaplicability to cases where failure is being dealt > with abent a change of arity, could be clearer in the > documentation.] Yes. This is _not_ equivalent to a btrfs replace, although it is very similar: btrfs device add /dev/sdc / btrfs device delete missing / It can work--sometimes--but it needs a surprising amount of free space (or multiple new drives). --Clx92ZfkiYIKRjnr Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlSLwGgACgkQgfmLGlazG5w4owCfZaco5qhdoUvVQ+IHQbJh6zIK nLQAnjK/wjpFgbSGJePSt94MCvuxfX5t =2JfA -----END PGP SIGNATURE----- --Clx92ZfkiYIKRjnr--