Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Peter Grandi <pg@btrfs.list.sabi.co.UK>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Date: Fri, 3 Mar 2017 07:38:36 -0500	[thread overview]
Message-ID: <284e9d01-3e69-705e-dd88-ebe047e2e4b8@gmail.com> (raw)
In-Reply-To: <22712.48434.400550.346157@tree.ty.sabi.co.uk>

On 2017-03-02 19:47, Peter Grandi wrote:
>> [ ... ] Meanwhile, the problem as I understand it is that at
>> the first raid1 degraded writable mount, no single-mode chunks
>> exist, but without the second device, they are created.  [
>> ... ]
>
> That does not make any sense, unless there is a fundamental
> mistake in the design of the 'raid1' profile, which this and
> other situations make me think is a possibility: that the
> category of "mirrored" 'raid1' chunk does not exist in the Btrfs
> chunk manager. That is, a chunk is either 'raid1' if it has a
> mirror, or if has no mirror it must be 'single'.
>
> If a member device of a 'raid1' profile multidevice volume
> disappears there will be "unmirrored" 'raid1' profile chunks and
> some code path must recognize them as such, but the logic of the
> code does not allow their creation. Question: how does the code
> know that a specific 'raid1' chunk is mirrored or not? The chunk
> must have a link (member, offset) to its mirror, do they?
>
> What makes me think that "unmirrored" 'raid1' profile chunks are
> "not a thing" is that it is impossible to remove explicitly a
> member device from a 'raid1' profile volume: first one has to
> 'convert' to 'single', and then  the 'remove' copies back to the
> remaining devices the 'single' chunks that are on the explicitly
> 'remove'd device. Which to me seems absurd.
It is, there should be a way to do this as a single operation.  The 
reason this is currently the case though is a simple one, 'btrfs device 
delete' is just a special instance of balance that prevents new chunks 
being allocated on the device being removed and balances all the chunks 
on that device so they end up on other devices.  It currently does no 
profile conversion, but having that as an option would actually be 
_very_ useful from a data safety perspective.
>
> Going further in my speculation, I suspect that at the core of
> the Btrfs multidevice design there is a persistent "confusion"
> (to use en euphemism) between volumes having a profile, and
> merely chunks have a profile.
There generally is.  The profile is entirely a property of the chunks 
(each chunk literally has a bit of metadata that says what profile it 
is), not the volume.  There's some metadata in the volume somewhere that 
says what profile to use for new chunks of each type (I think), but that 
doesn't dictate what chunk profiles there are on the volume.  This whole 
arrangement is actually pretty important for fault tolerance in general, 
since during a conversion you have _both_ profiles for that chunk type 
at the same time on the same filesystem (new chunks will get allocated 
with the new type though), and the kernel has to be able to handle a 
partially converted FS.
>
> My additional guess that the original design concept had
> multidevice volumes to be merely containers for chunks of
> whichever mixed profiles, so a subvolume could have 'raid1'
> profile metadata and 'raid0' profile data, and another could
> have 'raid10' profile metadata and data, but since handling this
> turned out to be too hard, this was compromised into volumes
> having all metadata chunks to have the same profile and all data
> of the same profile, which requires special-case handling of
> corner cases, like volumes being converted or missing member
> devices.
Actually, the only bits missing that would be needed to do this are 
stuff to segregate the data of given subvolumes completely form each 
other (ie, make sure they can't be in the same chunks at all).  Doing 
that is hard, so we don't have per-subvolume profiles yet.  It's fully 
possible to have a mix of profiles on a given volume though.  Some old 
versions of mkfs actually did this (you'd end up with a small single 
profile chunk of each type on a FS that used different profiles), and 
the filesystem is in exactly that state when converting between profiles 
for a given chunk type.  New chunks will only be generated with one 
profile, but you can have whatever other mix you want essentially (in 
fact, one of the handful of regression tests I run when I'm checking 
patches explicitly creates a filesystem with one data and one system 
chunk of every profile and makes sure the kernel can still access it 
correctly).

next prev parent reply	other threads:[~2017-03-03 13:05 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-02  0:30 raid1 degraded mount still produce single chunks, writeable mount not allowed Chris Murphy
2017-03-02 10:37 ` Adam Borowski
2017-03-03  5:56   ` Kai Krakow
2017-03-03 10:13     ` Adam Borowski
2017-03-03 12:19     ` Austin S. Hemmelgarn
2017-03-03 20:10       ` Kai Krakow
2017-03-06 13:07         ` Austin S. Hemmelgarn
2017-03-02 13:41 ` Duncan
2017-03-02 17:26   ` Andrei Borzenkov
2017-03-02 17:58     ` Austin S. Hemmelgarn
2017-03-03  0:47   ` Peter Grandi
2017-03-03  1:15     ` Chris Murphy
2017-03-03  1:18       ` Qu Wenruo
2017-03-03  1:48         ` Chris Murphy
2017-03-04  4:38           ` Chris Murphy
2017-03-04  9:55             ` waxhead
2017-03-03  3:38     ` Duncan
2017-03-03 12:38     ` Austin S. Hemmelgarn [this message]
2017-03-05 19:13       ` Peter Grandi
2017-03-05 19:55         ` Peter Grandi
2017-03-06 13:18         ` Austin S. Hemmelgarn
2017-03-09  9:49           ` Peter Grandi
2017-03-09 13:54             ` Austin S. Hemmelgarn
2017-03-03 10:16   ` Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=284e9d01-3e69-705e-dd88-ebe047e2e4b8@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pg@btrfs.list.sabi.co.UK \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.