Re: [RFC] shared subtrees - Mike Waychison

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Waychison <Michael.Waychison@Sun.COM>
To: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC] shared subtrees
Date: Thu, 13 Jan 2005 18:30:50 -0500	[thread overview]
Message-ID: <41E704AA.9000305@sun.com> (raw)
In-Reply-To: <20050113221851.GI26051@parcelfarce.linux.theplanet.co.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Just a few comments below.  Some of this will take time to digest ;)

Al Viro wrote:
> [apologies for delay - there'd been lots of unrelated crap lately]
> ======================================================================
> NOTE: as far as I'm concerned, that's a beginning of VFS-2.7 branch.
> All that work will stay in a separate tree, with gradual merge back
> into 2.6 once the things start settling down.
> ======================================================================
> 
> OK, here comes the first draft of proposed semantics for subtree
> sharing.  What we want is being able to propagate events between
> the parts of mount trees.  Below is a description of what I think
> might be a workable semantics; it does *NOT* describe the data
> structures I would consider final and there are considerable
> areas where we still need to figure out the right behaviour.
> 
> Let's start with introducing a notion of propagation node; I consider
> it only as a convenient way to describe the desired behaviour - it
> almost certainly won't be a data structure in the final variant.
> 
> 1) each p-node corresponds to a group of 1 or more vfsmounts.
> 2) there is at most 1 p-node containing a given vfsmount.
> 3) each p-node owns a possibly empty set of p-nodes and vfsmounts
> 4) no p-node or vfsmount can be owned by more than one p-node
> 5) only vfsmounts that are not contained in any p-nodes might be owned.
> 6) no p-node can own (directly or via intermediates) itself (i.e. the
> graph of p-node ownership is a forest).
> 
> These guys define propagation:
> 	a) if vfsmounts A and B are contained in the same p-node, events
> propagate from A to B
> 	b) if vfsmount A is contained in p-node p, vfsmount B is contained
> in p-node q and p owns q, events propagate from A to B
> 	c) if vfsmount A is contained in p-node p and vfsmount B is owned
> by p, events propagate from A to B

How is (c) different from (a)? Is there a distinction between
'containing' and 'owning' here?

> 	d) propagation is transitive: if events propagate from A to B and
> from B to C, they propagate from A to C.
> 
> In other words, members of the same p-node are equivalent and events anywhere
> in p-node are propagated to all its slaves.  Note that not any transitive
> relation can be represented that way; it has to satisfy the following
> condition:
> 	* A->C and B->C => A->B or B->A
> All propagation setups we are going to deal with will satisfy that condition.
> 
> 
> How do we set them up?
> 
> 	* we can mark a subtree sharable.  Every vfsmount in the subtree
> that is not already in some p-node gets a single-element p-node of its
> own.
> 	* we can mark a subtree slave.  That removes all vfsmounts in
> the subtree from their p-nodes and makes them owned by said p-nodes.
> p-nodes that became empty will disappear and everything they used to
> own will be repossessed by their owners (if any).

Would this be better read as "That removes each vfsmount A in the
subtree from its respective p-node p and makes it contained by a new
p-node p' (containing only A), and p' becomes 'owned' by p." ?


> 	* we can mark a subtree private.  Same as above, but followed
> by taking all vfsmounts in our subtree and making them *not* owned
> by anybody.
> 
> 
> Of course, namespace operations (clone, mount, etc.) affect that structure
> and are affected by it (that's what it's for, after all).
> 
> 	1. CLONE_NS
> 
> That one is simple - we copy vfsmounts as usual
> 	* if vfsmount A is contained in p-node p, then copy of A goes into
> the same p-node
> 	* if A is owned by p, then copy of A is also owned by p
> 	* no new p-nodes are created.
> 
> 	2. mount
> 
> We have a new vfsmount A and want to attach it to mountpoint somewhere in
> vfsmount B.  If B does not belong to any p-node, everything is as usual; A
> doesn't become a member or slave of any p-node and is simply attached to B.
> 
> If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events
> propagated from B and all p-nodes p1,...,pk that contain them.

By p1,...,pk, I assume you mean all p-nodes in the effective propagation
tree?  If so, the following looks okay.

> 	* A gets cloned into n copies and these copies (A1,...,An) are attached
> to corresponding points in B1,...,Bn.
> 	* k new p-nodes (q1,...,qk) are created
> 	* Ai is contained in qj <=> Bi is contained in qj
> 	* qi owns qj <=> pi owns pj
> 	* qi owns Aj <=> pi owns Bj
> 
> In other words, mount is propagated and propagation among the new vfsmounts
> mirrors the propagation between mountpoints.
> 
> 	3. bind
> 
> bind works almost identically to mount; new vfsmount is created for every
> place that gets propagation from mountpoint and propagation is set up to
> mirror that between the mountpoints.  However, there is a difference: unlike
> the case of mount, vfsmount we were going to attach (say it, A) has some
> history - it was created as a  copy of some pre-existing vfsmount V.  And
> that's where the things get interesting:
> 	* if V is contained in some p-node p, A is placed into the same
> p-node.  That may require merging one of the p-nodes we'd just created
> with p (that will be the counterpart of the p-node containing the mountpoint).
> 	* if V is owned by some p-node p, then A (or p-node containing A)
> becomes owned by p.

I don't follow this.  I still don't see the distinction between being
owned and being contained.  Also, for statements like 'A belongs to B',
which is it?

> 
> 	4. rbind
> rbind is recursive bind, so we just do binds for everything we had in
> a subtree we are binding in obvious order; everything is described
> by previous case.
> 
> 	5. umount
> umount everything that gets propagation from victim.
> 
> 	6. mount --move
> prohibited if what we are moving is in some p-node, otherwise we move
> as usual to intended mountpoint and create copies for everything that
> gets propagation from there (as we would do for rbind).
> 
> 	7. pivot_root
> similar to --move
> 
> 
> How to use all that stuff?
> 
> Example 1:
> 	mount --bind /floppy /floppy
> 	mount --make-shared /floppy
> 	mount --rbind / /jail
> 	<finish setting the jail up, umount whatever doesn't belong there,
> etc.>
> 	mount --make-slave /jail/floppy
> and we get /floppy in chroot jail slave to /floppy outside - if somebody
> (u)mounts stuff on it, that will get propagated to jail.
> 
> Example 2:
> 	same, but with the namespaces instead of chroots.
> 
> Example 3:
> 	same subtree visible (and kept in sync) in several places - just
> mark it shared and rbind; it will stay in sync
> 
> Example 4:
> 	have some daemon control the stuff in a subtree sharable with many
> namespaces, chroots, etc. without any magic:
> 	mark that subtree sharable
> 	clone with CLONE_NS
> 	parent marks that subtree slave
> 	child keeps working on the tree in its private namespace.
> 
> There's a lot more applications of the same idea, of course - AFS and its
> ilk, autofs-like stuff (with proper handling of MNT_EXPIRE and traps - see
> below), etc., etc.
> 
> 
> 
> 	Areas where we still have to figure things out:
> 
> * MNT_EXPIRE handling done right; there are some fun ideas in that area,
> but they still need to be done in more details (basically, lazy expire -
> mount in a slave expiring into a trap that would clone a copy from master
> when stepped upon).
> 
> * traps and their sharing.  What we want is an ability to use the master/slave
> mechanisms for *all* cross-namespace/cross-chroot issues in autofs, so that
> daemon would only need to work with the namespace of its own and no nothing
> about other instances.
> 
> * implementation ;-)  It certainly looks reasonably easy to do; memory
> demands are linear by number of vfsmounts involved and locking appears
> to be solvable.
> 
> * whatever issues that might come up from MVFS demands (and AFS, and...)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE:  The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB5wSpdQs4kOxk3/MRAvKoAJ9hpJhSZFSED6yLKvFL8VvwgZfJNwCZAe+x
Ibm55ty86r4EfPVd32OUkTw=
=V1jV
-----END PGP SIGNATURE-----

next prev parent reply	other threads:[~2005-01-13 23:43 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-13 22:18 [RFC] shared subtrees Al Viro
2005-01-13 23:30 ` Mike Waychison [this message]
2005-01-14  0:19   ` Al Viro
2005-01-14  1:11 ` Erez Zadok
2005-01-14  1:38   ` Al Viro
2005-01-16  0:46 ` J. Bruce Fields
2005-01-16  0:51   ` Al Viro
2005-01-16 16:02 ` J. Bruce Fields
2005-01-16 18:06   ` Al Viro
2005-01-16 18:42     ` J. Bruce Fields
2005-01-17  6:11       ` Al Viro
2005-01-17 17:32         ` J. Bruce Fields
2005-01-25 21:07           ` Ram
2005-01-25 21:47             ` Mike Waychison
2005-01-25 21:55               ` J. Bruce Fields
2005-01-25 23:56                 ` Mike Waychison
2005-01-25 22:02               ` Ram
2005-02-01 23:37                 ` J. Bruce Fields
2005-02-02  1:37                   ` J. Bruce Fields
2005-02-01 23:21             ` J. Bruce Fields
2005-02-02 18:36               ` Ram
2005-02-02 19:45                 ` Mike Waychison
2005-02-02 20:33                   ` Ram
2005-02-02 21:08                     ` Mike Waychison
2005-02-02 21:25                       ` J. Bruce Fields
2005-02-02 21:33                         ` Mike Waychison
2005-02-02 21:48                           ` J. Bruce Fields
2005-04-05  9:37         ` Ram
2005-01-17 18:31 ` Mike Waychison
2005-01-17 19:00   ` J. Bruce Fields
2005-01-17 19:30     ` Mike Waychison
2005-01-17 19:32       ` J. Bruce Fields
2005-01-17 20:11         ` Mike Waychison
2005-01-17 20:39           ` Al Viro
2005-01-18 19:44             ` Mike Waychison
2005-01-17 21:21           ` J. Bruce Fields
2005-01-28 22:31 ` Mike Waychison
2005-01-29  4:40   ` raven
2005-01-31 17:19     ` Mike Waychison
2005-02-01  1:31       ` Ian Kent
2005-02-01  2:28   ` Ram
2005-02-01  7:02     ` Mike Waychison
2005-02-01 19:27       ` Ram
2005-02-01 21:15         ` Mike Waychison
2005-02-01 23:33           ` Ram
2005-02-02  2:10           ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41E704AA.9000305@sun.com \
    --to=michael.waychison@sun.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@parcelfarce.linux.theplanet.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.