linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robert White <rwhite@pobox.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>,
	Goffredo Baroncelli <kreijack@inwind.it>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: BTRFS messes up snapshot LV with origin
Date: Fri, 28 Nov 2014 23:55:07 -0800	[thread overview]
Message-ID: <54797BDB.1050905@pobox.com> (raw)
In-Reply-To: <20141129045924.GX17395@hungrycats.org>

On 11/28/2014 08:59 PM, Zygo Blaxell wrote:
> On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote:
>> On 11/27/2014 05:15 AM, Zygo Blaxell wrote:
>>> This is a weakness of the current udev and asynchronous device hotplug
>>> concept:  there is no notion of bus enumeration in progress, so we can be
>>> trying to assemble multi-device storage before we have all the devices
>>> visible.  Assembly of aggregate storage (whatever it is--btrfs, md,
>>> lvm2...) has to wait until all known storage buses are fully enumerated
>>> in order to detect if there are duplicates.
>>
>> It is more complex than that. Some devices may appear after the "1st" bus
>> enumeration.
>
> That case is well handled already--a new enumeration will start with the
> second (and all later) hotplug events.
>
> The problem arises when we try to assemble disk arrays before the
> known end of the "1st" (or any) enumeration.  There is no way for an
> enumerating agent to tell other agents "this is definitely not the
> complete list of devices yet, other devices may be inserted imminently"
> and defer all the multi-device assembly until the address space of the
> enumering bus is fully covered.
>
MDADM has an "attached" but not "started" state for arrays that handles 
this condition during incremental assembly. (see "mdadm --incremental 
/dev/whatever"),

To slightly misuse the vocabulary, as each partition is encountered and 
submitted to the system it's checked for a superblock. If one is found 
then it has the identity of an array encoded on it and if that array 
doesn't exist it is allocated, otherwise the device is added to the 
existent array. The array is only started if all the devices are 
accounted for unless an option is added to allow earlier starts, and 
even then "enough" of the devices must be present to make sense (e.g. 
only one device missing from a RAID5, or a correct pair of devices for a 
RAID10 etc.)

So we'd need a "partially assembled but not started" state and some 
ioctls to do things like force-start or force-disown a filesystem that 
cannot be "finished" automatically.

That sort of thing is very easy to do with devices because devices don't 
have to be opened and can reject an open attempt, or at least the 
read/writes after an open and such.

Unfortunately a filesystem can really only exist as a mounted thing, and 
can really only be controlled by remounting thereafter. The most 
efficient way to do this would be to have a alternate file system 
operations structure that was filled mostly with dummy operations that 
would return ENOENT and friends. Then the remount that finally fulfilled 
the file system's requirements would then switch out that struct for the 
fully functional one. That remount would need an "adddev=" and some 
other such options (much like AUFS adds layers).

It;s all doable. But it stretches to near breaking the "mount" paradigm. 
You would need an operation that looked like "mount -t btrfs -o 
do_we_need_this /dev/whatever /this/datum/means/nothing" to match and 
attach a device "wherever it goes" or you might end up needing to do the 
Cartesian product of trial attachments of each new device to all active 
fileystems to match it up, which is an ugly external scripting requirement.

As far as waiting for the address space to be fully covered. Meh. If a 
ready-or-not, or ready-enough, status is established in the file system 
it would be undesirable for it to know anything about any other subsystem.

We don't care if enumeration is "done" we only care if we have a 
rational set of storage, and whether that rational set is "enough" to be 
fully ready, enough to be only read-ready, or just plain not enough.

In theory, the idempotent mount command could be

mount -t btrfs some-uuid-instead-of-device /mount/point
mount -t btrfs some-other-uuid-here /other/mount/point

to create the zero-devices involved entity, followed by

mount -t btrfs -o trydev /dev/something /this/bit/is/ignored

repeated for all possible somethings. /mount/point and 
/other/mount/point would be returning ENOENT for their contents until 
they were ready-enough.

In practice this is very impure compared to how mdadm has the /dev/md- 
namespace in which to build its devices before any actual mount is possible.

  reply	other threads:[~2014-11-29  7:55 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-16 21:35 BTRFS messes up snapshot LV with origin MegaBrutal
2014-11-17  1:42 ` Duncan
2014-11-17  6:59   ` Brendan Hide
2014-11-17  7:35     ` Daniel Dressler
2014-11-17  9:00       ` Brendan Hide
2014-11-17 19:04     ` Goffredo Baroncelli
     [not found]       ` <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com>
2014-11-17 19:45         ` Fwd: " MegaBrutal
2014-11-17 20:32           ` Goffredo Baroncelli
2014-11-18  6:16           ` Chris Murphy
2014-11-18 15:42             ` Phillip Susi
2014-11-18 19:17               ` Chris Murphy
2014-11-18 20:17                 ` Phillip Susi
2014-11-19  2:54                   ` Chris Murphy
2014-11-19 15:20                     ` Phillip Susi
2014-11-19 18:35                       ` Chris Murphy
2014-11-19 19:23                         ` Phillip Susi
2014-11-21  4:28                       ` Zygo Blaxell
2014-11-21  6:22                         ` Duncan
2014-11-21 11:35                           ` Robert White
2014-11-21 11:54                             ` Duncan
2014-11-21 17:56                           ` Zygo Blaxell
2014-11-21 23:09                             ` Duncan
2014-11-21 18:23                           ` Chris Murphy
2014-11-21 22:49                             ` Duncan
2014-11-21 23:41                               ` Duncan
2014-11-21 23:51                                 ` Duncan
2014-11-22 17:34                         ` Goffredo Baroncelli
2014-11-23  0:19                           ` Zygo Blaxell
2014-11-25 16:34                             ` Goffredo Baroncelli
2014-11-25 20:29                               ` Zygo Blaxell
2014-11-25 21:59                                 ` Goffredo Baroncelli
2014-11-25 22:21                                   ` Zygo Blaxell
2014-11-25 22:47                                     ` Chris Murphy
     [not found]                                     ` <CAJCQCtQUM=viSoPtcJMcyKquYb1DLmEsqBi=p++uXPy63+r3Ow@mail.gmail.com>
     [not found]                                       ` <20141126021134.GR17380@hungrycats.org>
2014-11-26  4:48                                         ` Chris Murphy
2014-11-26 17:19                                     ` Goffredo Baroncelli
2014-11-27  4:15                                       ` Zygo Blaxell
2014-11-28 17:05                                         ` Goffredo Baroncelli
2014-11-29  1:25                                           ` Robert White
2014-11-29  7:35                                             ` Goffredo Baroncelli
2014-11-29  8:02                                               ` Robert White
2014-11-29  7:37                                             ` MegaBrutal
2014-11-29  4:59                                           ` Zygo Blaxell
2014-11-29  7:55                                             ` Robert White [this message]
2014-12-01 15:25                                               ` Zygo Blaxell
2014-11-26  3:22                                   ` Duncan
2014-11-26  5:11                                     ` Chris Murphy
2014-11-26 22:08                                     ` Robert White
2014-11-27  9:08                                       ` Duncan
2014-11-28  7:10                                         ` Chris Murphy
2014-11-29  7:29                                           ` Duncan
2014-11-29  8:20                                             ` Robert White
2014-11-29  9:41                                               ` Duncan
2014-11-29 16:33                                                 ` Robert White
2014-11-29 16:50                                               ` Robert White
2014-11-30  6:46                                                 ` Duncan
2014-11-29 21:15                                               ` Chris Murphy
2014-11-18 20:41               ` MegaBrutal
2014-11-19  1:29               ` Robert White
2014-11-19  3:37                 ` Duncan
2014-11-21  4:24       ` Zygo Blaxell
2014-11-18  6:21     ` Chris Murphy
2014-11-18 12:13       ` Duncan
2014-11-18 20:01       ` Goffredo Baroncelli
  -- strict thread matches above, loose matches on Subject: below --
2014-11-17  8:00 MegaBrutal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54797BDB.1050905@pobox.com \
    --to=rwhite@pobox.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).