From: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
To: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Subject: [RFC] shared subtrees
Date: Thu, 13 Jan 2005 22:18:51 +0000 [thread overview]
Message-ID: <20050113221851.GI26051@parcelfarce.linux.theplanet.co.uk> (raw)
[apologies for delay - there'd been lots of unrelated crap lately]
======================================================================
NOTE: as far as I'm concerned, that's a beginning of VFS-2.7 branch.
All that work will stay in a separate tree, with gradual merge back
into 2.6 once the things start settling down.
======================================================================
OK, here comes the first draft of proposed semantics for subtree
sharing. What we want is being able to propagate events between
the parts of mount trees. Below is a description of what I think
might be a workable semantics; it does *NOT* describe the data
structures I would consider final and there are considerable
areas where we still need to figure out the right behaviour.
Let's start with introducing a notion of propagation node; I consider
it only as a convenient way to describe the desired behaviour - it
almost certainly won't be a data structure in the final variant.
1) each p-node corresponds to a group of 1 or more vfsmounts.
2) there is at most 1 p-node containing a given vfsmount.
3) each p-node owns a possibly empty set of p-nodes and vfsmounts
4) no p-node or vfsmount can be owned by more than one p-node
5) only vfsmounts that are not contained in any p-nodes might be owned.
6) no p-node can own (directly or via intermediates) itself (i.e. the
graph of p-node ownership is a forest).
These guys define propagation:
a) if vfsmounts A and B are contained in the same p-node, events
propagate from A to B
b) if vfsmount A is contained in p-node p, vfsmount B is contained
in p-node q and p owns q, events propagate from A to B
c) if vfsmount A is contained in p-node p and vfsmount B is owned
by p, events propagate from A to B
d) propagation is transitive: if events propagate from A to B and
from B to C, they propagate from A to C.
In other words, members of the same p-node are equivalent and events anywhere
in p-node are propagated to all its slaves. Note that not any transitive
relation can be represented that way; it has to satisfy the following
condition:
* A->C and B->C => A->B or B->A
All propagation setups we are going to deal with will satisfy that condition.
How do we set them up?
* we can mark a subtree sharable. Every vfsmount in the subtree
that is not already in some p-node gets a single-element p-node of its
own.
* we can mark a subtree slave. That removes all vfsmounts in
the subtree from their p-nodes and makes them owned by said p-nodes.
p-nodes that became empty will disappear and everything they used to
own will be repossessed by their owners (if any).
* we can mark a subtree private. Same as above, but followed
by taking all vfsmounts in our subtree and making them *not* owned
by anybody.
Of course, namespace operations (clone, mount, etc.) affect that structure
and are affected by it (that's what it's for, after all).
1. CLONE_NS
That one is simple - we copy vfsmounts as usual
* if vfsmount A is contained in p-node p, then copy of A goes into
the same p-node
* if A is owned by p, then copy of A is also owned by p
* no new p-nodes are created.
2. mount
We have a new vfsmount A and want to attach it to mountpoint somewhere in
vfsmount B. If B does not belong to any p-node, everything is as usual; A
doesn't become a member or slave of any p-node and is simply attached to B.
If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events
propagated from B and all p-nodes p1,...,pk that contain them.
* A gets cloned into n copies and these copies (A1,...,An) are attached
to corresponding points in B1,...,Bn.
* k new p-nodes (q1,...,qk) are created
* Ai is contained in qj <=> Bi is contained in qj
* qi owns qj <=> pi owns pj
* qi owns Aj <=> pi owns Bj
In other words, mount is propagated and propagation among the new vfsmounts
mirrors the propagation between mountpoints.
3. bind
bind works almost identically to mount; new vfsmount is created for every
place that gets propagation from mountpoint and propagation is set up to
mirror that between the mountpoints. However, there is a difference: unlike
the case of mount, vfsmount we were going to attach (say it, A) has some
history - it was created as a copy of some pre-existing vfsmount V. And
that's where the things get interesting:
* if V is contained in some p-node p, A is placed into the same
p-node. That may require merging one of the p-nodes we'd just created
with p (that will be the counterpart of the p-node containing the mountpoint).
* if V is owned by some p-node p, then A (or p-node containing A)
becomes owned by p.
4. rbind
rbind is recursive bind, so we just do binds for everything we had in
a subtree we are binding in obvious order; everything is described
by previous case.
5. umount
umount everything that gets propagation from victim.
6. mount --move
prohibited if what we are moving is in some p-node, otherwise we move
as usual to intended mountpoint and create copies for everything that
gets propagation from there (as we would do for rbind).
7. pivot_root
similar to --move
How to use all that stuff?
Example 1:
mount --bind /floppy /floppy
mount --make-shared /floppy
mount --rbind / /jail
<finish setting the jail up, umount whatever doesn't belong there,
etc.>
mount --make-slave /jail/floppy
and we get /floppy in chroot jail slave to /floppy outside - if somebody
(u)mounts stuff on it, that will get propagated to jail.
Example 2:
same, but with the namespaces instead of chroots.
Example 3:
same subtree visible (and kept in sync) in several places - just
mark it shared and rbind; it will stay in sync
Example 4:
have some daemon control the stuff in a subtree sharable with many
namespaces, chroots, etc. without any magic:
mark that subtree sharable
clone with CLONE_NS
parent marks that subtree slave
child keeps working on the tree in its private namespace.
There's a lot more applications of the same idea, of course - AFS and its
ilk, autofs-like stuff (with proper handling of MNT_EXPIRE and traps - see
below), etc., etc.
Areas where we still have to figure things out:
* MNT_EXPIRE handling done right; there are some fun ideas in that area,
but they still need to be done in more details (basically, lazy expire -
mount in a slave expiring into a trap that would clone a copy from master
when stepped upon).
* traps and their sharing. What we want is an ability to use the master/slave
mechanisms for *all* cross-namespace/cross-chroot issues in autofs, so that
daemon would only need to work with the namespace of its own and no nothing
about other instances.
* implementation ;-) It certainly looks reasonably easy to do; memory
demands are linear by number of vfsmounts involved and locking appears
to be solvable.
* whatever issues that might come up from MVFS demands (and AFS, and...)
next reply other threads:[~2005-01-13 22:18 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-01-13 22:18 Al Viro [this message]
2005-01-13 23:30 ` [RFC] shared subtrees Mike Waychison
2005-01-14 0:19 ` Al Viro
2005-01-14 1:11 ` Erez Zadok
2005-01-14 1:38 ` Al Viro
2005-01-16 0:46 ` J. Bruce Fields
2005-01-16 0:51 ` Al Viro
2005-01-16 16:02 ` J. Bruce Fields
2005-01-16 18:06 ` Al Viro
2005-01-16 18:42 ` J. Bruce Fields
2005-01-17 6:11 ` Al Viro
2005-01-17 17:32 ` J. Bruce Fields
2005-01-25 21:07 ` Ram
2005-01-25 21:47 ` Mike Waychison
2005-01-25 21:55 ` J. Bruce Fields
2005-01-25 23:56 ` Mike Waychison
2005-01-25 22:02 ` Ram
2005-02-01 23:37 ` J. Bruce Fields
2005-02-02 1:37 ` J. Bruce Fields
2005-02-01 23:21 ` J. Bruce Fields
2005-02-02 18:36 ` Ram
2005-02-02 19:45 ` Mike Waychison
2005-02-02 20:33 ` Ram
2005-02-02 21:08 ` Mike Waychison
2005-02-02 21:25 ` J. Bruce Fields
2005-02-02 21:33 ` Mike Waychison
2005-02-02 21:48 ` J. Bruce Fields
2005-04-05 9:37 ` Ram
2005-01-17 18:31 ` Mike Waychison
2005-01-17 19:00 ` J. Bruce Fields
2005-01-17 19:30 ` Mike Waychison
2005-01-17 19:32 ` J. Bruce Fields
2005-01-17 20:11 ` Mike Waychison
2005-01-17 20:39 ` Al Viro
2005-01-18 19:44 ` Mike Waychison
2005-01-17 21:21 ` J. Bruce Fields
2005-01-28 22:31 ` Mike Waychison
2005-01-29 4:40 ` raven
2005-01-31 17:19 ` Mike Waychison
2005-02-01 1:31 ` Ian Kent
2005-02-01 2:28 ` Ram
2005-02-01 7:02 ` Mike Waychison
2005-02-01 19:27 ` Ram
2005-02-01 21:15 ` Mike Waychison
2005-02-01 23:33 ` Ram
2005-02-02 2:10 ` J. Bruce Fields
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20050113221851.GI26051@parcelfarce.linux.theplanet.co.uk \
--to=viro@parcelfarce.linux.theplanet.co.uk \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).