From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Date: Fri, 3 Mar 2017 03:38:56 +0000 (UTC) [thread overview]
Message-ID: <pan$684e5$d327dc6a$2427124b$84079405@cox.net> (raw)
In-Reply-To: 22712.48434.400550.346157@tree.ty.sabi.co.uk
Peter Grandi posted on Fri, 03 Mar 2017 00:47:46 +0000 as excerpted:
>> [ ... ] Meanwhile, the problem as I understand it is that at the first
>> raid1 degraded writable mount, no single-mode chunks exist, but without
>> the second device, they are created. [ ... ]
>
> That does not make any sense, unless there is a fundamental mistake in
> the design of the 'raid1' profile, which this and other situations make
> me think is a possibility: that the category of "mirrored" 'raid1' chunk
> does not exist in the Btrfs chunk manager. That is, a chunk is either
> 'raid1' if it has a mirror, or if has no mirror it must be 'single'.
>
> If a member device of a 'raid1' profile multidevice volume disappears
> there will be "unmirrored" 'raid1' profile chunks and some code path
> must recognize them as such, but the logic of the code does not allow
> their creation. Question: how does the code know that a specific 'raid1'
> chunk is mirrored or not? The chunk must have a link (member, offset) to
> its mirror, do they?
The problem at the surface level is, raid1 chunks MUST be created with
two copies, one each on two different devices. It is (currently) not
allowed to create only a single copy of a raid1 chunk, and the two copies
must be on different devices, so once you have only a single device,
raid1 chunks cannot be created.
Which presents a problem when you're trying to recover, needing writable
in ordered to be able to do a device replace or add/remove (with the
remove triggering a balance), because btrfs is COW, so any changes get
written to new locations, which requires chunked space that might not be
available in the currently allocated chunks.
To work around that, they allowed the chunk allocator to fallback to
single mode when it couldn't create raid1.
Which is fine as long as the recovery is completed in the same mount.
But if you unmount or crash and try to remount to complete the job after
those single-mode chunks have been created, oops! Single mode chunks on
a multi-device filesystem with a device missing, and the logic currently
isn't sophisticated enough to realize that all the chunks are actually
accounted for, so it forces read-only mounting to prevent further damage.
Which means you can copy off the files to a different filesystem as
they're still all available, including any written in single-mode, but
you can't fix the degraded filesystem any longer, as that requires a
writable mount you're not going to be able to get, at least not with
mainline.
At a lower level, the problem is that for raid1 (and I think raid10 as
well tho I'm not sure on it), they made a mistake in the implementation.
For raid56, the minimum allowed writable devices is lower than the
minimum number of devices for undegraded write, by the number of parity
devices (so raid5 will allow two devices for undegraded write, 1 parity,
one data, but one device for degraded write, raid6 will allow three
devices for undegraded write, one data, two parity, or again, one device
for degraded write).
But for raid1, both the degraded write minimum and the undegraded write
minimum are set to *two* devices, an implementation error since the
degraded write minimum should arguably be one device, without a mirror.
So the degrade to single-mode is a workaround for the real problem, not
allowing degraded raid1 write (that is, chunk creation).
And all this is known and has been discussed right here on this list by
the devs, but nobody has actually bothered to properly fix it, either by
correctly setting the degraded raid1 write minimum to a single device, or
even by working around the single-mode workaround, by correctly checking
each chunk and allowing writable mount if all are accounted for, even if
there's a missing device.
Or rather, the workaround for the incomplete workaround has had a patch
submitted, but it got stuck in that long-running project and has been in
limbo every since, and now I guess the patch has gone stale and doesn't
even properly apply any longer.
All of which is yet more demonstration of the fact that is stated time
and again on this list, that btrfs should be considered stabilizing, but
still under heavy development and not yet fully stable, and backups
should be kept updated and at-hand for any data you value higher than the
bother and resources necessary to make those backups.
Because if there's backups updated and at hand, then what happens to the
working copy doesn't matter, and in this particular case, even if the
backups aren't fully current, the fact that they're available means
there's space available to update them from the working copy should it go
into readonly mode as well, which means recovery from the read-only
formerly working copy is no big deal.
Either that, or by definition, the data wasn't of enough value to have
backups when storing it on a widely known to be still stabilizing and
under heavy development filesystem, where those backups are strongly
recommended for any data of value, so /losing/ that data, by definition
of failure to have that backup, can't be that big a deal either. If
actions, or failure to complete actions, speak louder than words, well,
that's the way it is.
> What makes me think that "unmirrored" 'raid1' profile chunks are "not a
> thing" is that it is impossible to remove explicitly a member device
> from a 'raid1' profile volume: first one has to 'convert' to 'single',
> and then the 'remove' copies back to the remaining devices the 'single'
> chunks that are on the explicitly 'remove'd device. Which to me seems
> absurd.
A device can indeed be removed from a raid1 without converting to single
first... as long as that raid1 had more than two devices before, and
there's enough space on the remaining two-plus devices to put at least
one copy each on two separate devices.
Of course if there's only two devices in the raid1 to begin with, then
yes, you can't remove one of the two devices while it's still raid1. And
of course if there's not enough room on the remaining two-plus devices
for what was on the device being removed, likewise. But you didn't
mention either one of those conditions.
> Going further in my speculation, I suspect that at the core of the Btrfs
> multidevice design there is a persistent "confusion" (to use en
> euphemism) between volumes having a profile, and merely chunks have a
> profile.
Well, in btrfs, it's always chunks having the profile. But there is
indeed a confusion, as explained above, it's just not quite the one you
described.
> My additional guess that the original design concept had multidevice
> volumes to be merely containers for chunks of whichever mixed profiles,
> so a subvolume could have 'raid1' profile metadata and 'raid0' profile
> data, and another could have 'raid10' profile metadata and data, but
> since handling this turned out to be too hard, this was compromised into
> volumes having all metadata chunks to have the same profile and all data
> of the same profile, which requires special-case handling of corner
> cases, like volumes being converted or missing member devices.
>
> So in the case of 'raid1', a volume with say a 'raid1' data profile
> should have all-'raid1' and fully mirrored profile chunks, and the lack
> of a member devices fails that aim in two ways.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2017-03-03 5:33 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-02 0:30 raid1 degraded mount still produce single chunks, writeable mount not allowed Chris Murphy
2017-03-02 10:37 ` Adam Borowski
2017-03-03 5:56 ` Kai Krakow
2017-03-03 10:13 ` Adam Borowski
2017-03-03 12:19 ` Austin S. Hemmelgarn
2017-03-03 20:10 ` Kai Krakow
2017-03-06 13:07 ` Austin S. Hemmelgarn
2017-03-02 13:41 ` Duncan
2017-03-02 17:26 ` Andrei Borzenkov
2017-03-02 17:58 ` Austin S. Hemmelgarn
2017-03-03 0:47 ` Peter Grandi
2017-03-03 1:15 ` Chris Murphy
2017-03-03 1:18 ` Qu Wenruo
2017-03-03 1:48 ` Chris Murphy
2017-03-04 4:38 ` Chris Murphy
2017-03-04 9:55 ` waxhead
2017-03-03 3:38 ` Duncan [this message]
2017-03-03 12:38 ` Austin S. Hemmelgarn
2017-03-05 19:13 ` Peter Grandi
2017-03-05 19:55 ` Peter Grandi
2017-03-06 13:18 ` Austin S. Hemmelgarn
2017-03-09 9:49 ` Peter Grandi
2017-03-09 13:54 ` Austin S. Hemmelgarn
2017-03-03 10:16 ` Anand Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$684e5$d327dc6a$2427124b$84079405@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.