[RFC PATCH 1/4] Union mount documentation.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@linux.vnet.ibm.com>
To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Jan Blunck <j.blunck@tu-harburg.de>
Subject: [RFC PATCH 1/4] Union mount documentation.
Date: Wed, 20 Jun 2007 11:21:57 +0530	[thread overview]
Message-ID: <20070620055157.GC4267@in.ibm.com> (raw)
In-Reply-To: <20070620055050.GB4267@in.ibm.com>

From: Bharata B Rao <bharata@linux.vnet.ibm.com>
Subject: Union mount documentation.

Adds union mount documentation.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 Documentation/union-mounts.txt |  232 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 232 insertions(+)

--- /dev/null
+++ b/Documentation/union-mounts.txt
@@ -0,0 +1,232 @@
+VFS BASED UNION MOUNT
+=====================
+
+1. What is Union Mount ?
+2. Recap
+3. The new approach
+4. Union stack: building and traversal
+5. Union stack: destroying
+6. Directory lising
+7. What works and what doesn't ?
+8. Usage
+9. References
+
+1. What is Union Mount ?
+------------------------
+Union mount allows mounting of two or more filesystems transparently on
+a single mount point. Contents (files or directories) of all the
+filesystems become visible at the mount point after a union mount. If
+there are files with the same name in multiple layers of the union, only
+the topmost files remain visible. Contents of common named directories are
+merged again to present a unified view at the subdirectory level.
+
+In this approach of filesystem namespace unification, the layering or
+stacking information of different components (filesystems) of the union
+mount are maintained at the VFS layer. Hence, this is referred to as VFS
+based union mount.
+
+2. Recap
+--------
+Jan Blunck had developed a version of VFS based union mount in 2003-4.
+This version was cleaned up and ported to later kernels. Early in year
+2007, two iterations of these patches were posted for review (Ref 1, Ref 2).
+But, this approach had a few shortcomings:
+
+- It wasn't designed to work with shared subtree additions to mount.
+- It didn't work well when same filesystem was mounted from different
+  namespaces, as it maintained the union stack at dentry level.
+- It made dget() sleep.
+- The union stack was built using dentries and this was too fragile. This
+  made the code complex and the locking ugly.
+
+3. The new approach
+-------------------
+In this new approach, the way union stack is built and traversed has been
+changed. Instead of dentry-to-dentry links forming the stack between
+different layers, we now have (vfsmount, dentry) pairs as the building
+blocks of the union stack. Since this (vfsmount, dentry) combination is
+unique across all namespaces, we should be able to maintain the union stack
+sanely even if the filesystem is union mounted privately in different
+namespaces or if it appears under different mounts due to various types
+of bind mounts.
+
+4. Union stack: building and traversal
+--------------------------------------
+Union stack needs to be built from two places: during an explicit union
+mount (or mount propagation) and during the lookup of a directory that
+appears in more than one layer of the union.
+
+The link between two layers of union stack is maintained using the
+union_mount structure:
+
+struct union_mount {
+	/* vfsmount and dentry of this layer */
+	struct vfsmount *src_mnt;
+	struct dentry *src_dentry;
+
+	/* vfsmount and dentry of the next lower layer */
+	struct vfsmount *dst_mnt;
+	struct dentry *dst_dentry;
+
+	/*
+	 * This list_head hashes this union_mount based on this layer's
+	 * vfsmount and dentry. This is used to get to the next layer of
+	 * the stack (dst_mnt, dst_dentry) given the (src_mnt, src_dentry)
+	 * and is used for stack traversal.
+	 */
+	struct list_head hash;
+
+	/*
+	 * All union_mounts under a vfsmount(src_mnt) are linked together
+	 * at mnt->mnt_union using this list_head. This is needed to destroy
+	 * all the union_mounts when the mnt goes away.
+	 */
+	struct list_head list;
+};
+
+These union mount structures are stored in a hash table(union_mount_hashtable)
+which uses the same hash as used for mount_hashtable since both of them use
+(vfsmount, dentry) pairs to calculate the hash.
+
+During a new mount (or mount propagation), a new union_mount structure is
+created. A reference to the mountpoint's vfsmount and dentry is taken and
+stored in the union_mount (as dst_mnt, dst_dentry). And this union_mount
+is inserted in the union_mount_hashtable based on the hash generated by
+the mount root's vfsmount and dentry.
+
+Similar method is employed to create a union stack during first time lookup
+of a common named directory within a union mount point. But here, the top
+level directory's vfsmount and dentry are hashed to get to the lower level
+directory's vfsmount and dentry.
+
+The insertion, deletion and lookup of union_mounts in the
+union_mount_hashtable is protected by vfsmount_lock. While traversing the
+stack, we hold this spinlock only briefly during lookup time and release
+it as soon as we get the next union stack member. The top level of the
+stack holds a reference to the next level (via union_mount structure) and
+so on. Therefore, as long as we hold a reference to a union stack member,
+its lower layers can't go away. And since we don't do the complete
+traversal under any lock, it is possible for the stack to change over the
+level from where we started traversing. For eg. when traversing the stack
+downwards, a new filesystem can be mounted on top of it. When this happens,
+the user who had a reference to the old top wouldn't have visibility to
+the new top and would continue as if the new top didn't exist for him.
+I believe this is fine as long as members of the stack don't go away from
+under us(CHECK). And to be sure of this, we need to hold a reference to the
+level from where we start the traversal and should continue to hold it
+till we are done with the traversal.
+
+5. Union stack: destroying
+--------------------------
+In addition to storing the union_mounts in a hash table for quick lookups,
+they are also stored as a list, headed at vsmount->mnt_union. So, all
+union_mounts that occur under a vfsmount (starting from the mountpoint
+followed by the subdir unions) are stored within the vfsmount. During
+umount (specifically, during the last mntput()), this list is traversed
+to destroy all union stacks under this vfsmount.
+
+Hence, all union stacks under a vfsmount continue to exist until the
+vfsmount is unmounted. It may be noted that the union_mount structure
+holds a reference to the current dentry also. Becasue of this, for
+subdir unions, both the top and bottom level dentries become pinned
+till the upper layer filesystem is unmounted. Is this behaviour
+acceptable ? Would this lead to a lot of pinned dentries over a period
+of time ? (CHECK) If we don't do this, the top layer dentry might go
+out of cache, during which time we have no means to release the
+corresponding union_mount and the union_mount becomes stale. Would it
+be necessary and worthwhile to add intelligence to prune_dcache() to
+prune unused union_mounts thereby releasing the dentries ?
+
+As noted above, we hold the refernce to current dentry from union_mount
+but don't get a reference to the corresponding vfsmount. We depend on
+the user of the union stack to hold the reference to the topmost vfsmount
+until he is done with the stack traversal. Not holding a reference to the
+top vfsmount from within union_mount allows us to free all the union_mounts
+from last mntput of the top vfsmount. Is this approach acceptable ?
+
+NOTE: union_mount structures are part of two lists: the hash list for
+quick lookups and a linked list to aid the freeing of these structures
+during unmount.
+
+6. Directory lising
+-------------------
+The merged view of directories is obtained by reading the directory
+entries of all the layers (starting from topmost) and merging the result.
+To aid this, the directory entries are stored in a cache as and when they
+are read and the newly read entries are compared against this for duplicate
+elimination before being passed to user space. This cache is a simple linked
+list at the moment.
+
+If getdents() returns to user space before completely reading the directory,
+the state at which it left reading the union mounted directory is stored
+in the rdstate structure.
+
+struct rdstate {
+	/* vfsmount and dentry of the directory from which we were reading */
+	struct vfsmount *mnt;
+	struct dentry *dentry;
+
+	/* the file offset of directory file at which we stopped reading */
+	loff_t off;
+
+	/* cache of directory entries */
+	struct list_head dirent_cache;
+};
+
+A pointer to this structure is stored in the file structure for the topmost
+directory and initialized during the first readdir()/getdents() of this
+directory. This readdir state information is destroyed during the last
+fput() of the file. For every subsequent readdir()/getdents(), the file
+offset of the directory determined by rdstate->{mnt, dentry} is set to
+the rdstate->off, before continuing with readdir()/getdents() on that
+directory.
+
+Since readdir()/getdents() is issued on the topmost directory for union
+mounted directories, it is possible for the file->f_pos of the topmost
+directory to reach its end while we are still reading the contents of
+the stacked bottom directories. So, file->f_pos is not clearly defined
+for union mounted directories. And because of this lseek doesn't work
+as it works normally for other directories. If this approach of directory
+listing is acceptable, we need to fix the meaning of file offset for
+union mounted directories and accordingly get lseek to behave sanely.
+
+7. What works and what doesn't ?
+-------------------------------
+These work:
+	- mount/umount :)
+	- A simple case of union mount propagation to slave and shared
+	  mounts.
+	- /bin/ls on a union mounted directory.
+
+These don't:
+	- lseek on union mounted directory.
+
+Not tried:
+	- move mounts
+	- pivot_root
+	- Other cases of bind mounts, specifically recursive binds.
+	- etc :(
+
+Not yet implemented:
+	- copyup and whiteout features. So, as of now we can only
+	do a union mount and directory listing on it. Other operations,
+	specifically write to a lower layer file are not supported.
+
+8. Usage
+--------
+To union mount a device /dev/sda1 on a mount point /mnt, we do this:
+
+# mount --union /dev/sda1 /mnt
+
+This results in the union mount getting created at /mnt which will contain
+the merged view of /mnt's original content and the contents of /dev/sda1.
+
+The mount(8) command from util-linux has to be modified to support
+--union option.
+
+9. References
+-------------
+1. http://lkml.org/lkml/2007/4/17/150 - First post of original union mount.
+2. http://lkml.org/lkml/2007/5/14/69 - Next (v1) post of original union mount.
+
+- June 2007

next prev parent reply	other threads:[~2007-06-20  5:44 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-20  5:50 [RFC PATCH 0/4] New approach to VFS based union mount Bharata B Rao
2007-06-20  5:51 ` Bharata B Rao [this message]
2007-06-20  5:59   ` [RFC PATCH 1/4] Union mount documentation Arjan van de Ven
2007-06-20  7:29     ` Jan Blunck
2007-06-20 12:32       ` Christoph Hellwig
2007-06-20 12:43         ` Jan Blunck
2007-06-20 13:25           ` Christoph Hellwig
2007-06-20 17:28       ` Erez Zadok
2007-06-21  5:25         ` Bharata B Rao
2007-06-21 16:29           ` Josef Sipek
2007-06-21 16:39             ` Erez Zadok
2007-06-20 12:56     ` Jan Blunck
2007-06-20  8:11   ` Jan Blunck
2007-06-20  9:09     ` Bharata B Rao
2007-06-20  5:52 ` [RFC PATCH 2/4] Mount changes to support union mount Bharata B Rao
2007-06-20  7:47   ` Jan Blunck
2007-06-20  8:53     ` Bharata B Rao
2007-06-21 16:40       ` Josef Sipek
2007-06-20  5:53 ` [RFC PATCH 3/4] Lookup " Bharata B Rao
2007-06-20  7:51   ` Jan Blunck
2007-06-20  8:56     ` Bharata B Rao
2007-06-20  5:54 ` [RFC PATCH 4/4] Directory listing support for union mounted directories Bharata B Rao
2007-06-20 12:09   ` Christoph Hellwig
2007-06-20 14:22     ` Trond Myklebust
2007-06-20 17:02       ` Christoph Hellwig
2007-06-20 17:44         ` Trond Myklebust
2007-06-30  9:43           ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070620055157.GC4267@in.ibm.com \
    --to=bharata@linux.vnet.ibm.com \
    --cc=j.blunck@tu-harburg.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.