[RFC][PATCH 0/15] VFS based Union Mount

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH  0/15] VFS based Union Mount
@ 2007-04-17 13:14 Bharata B Rao
  2007-04-17 13:16 ` [RFC][PATCH 1/15] Add union mount documentation Bharata B Rao
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

Hi,

Here is an attempt towards vfs based union mount implementation.
Union mount provides the filesytem namespace unification feature.
Unlike the traditional mounts which hide the contents of the mount point,
the union mount presents the merged view of the mount point and the
mounted filesytem.

These patches were originally developed for 2.6.11 by Jan Blunck and
lately we have been working together in taking this forward. The current
patchset applies against 2.6.21-rc6-mm1.

The code is in a highly experimental stage at the moment and the intention
of posting this now is to get some initial feedback about the design
and the future directions about how this should be taken forward.

You can find more details about union mount in the documentation
included in the patchset.

Kindly review and let us know your comments.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  1/15] Add union mount documentation
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
@ 2007-04-17 13:16 ` Bharata B Rao
  2007-04-17 13:17 ` [RFC][PATCH 2/15] Add a new mount flag (MNT_UNION) for union mount Bharata B Rao
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Bharata B Rao <bharata@linux.vnet.ibm.com>
Subject: Add union mount documentation.

This is an attempt to document some of the implementation details
and issues of union mount.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 Documentation/union-mounts.txt |  489 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 489 insertions(+)

--- /dev/null
+++ b/Documentation/union-mounts.txt
@@ -0,0 +1,489 @@
+VFS BASED UNION MOUNT
+=====================
+
+1. Overview
+2. Union stack
+3. Lookup
+4. Readdir
+5. Copyup
+6. Whiteout
+	6.1. Creation and deletion
+	6.2. Whiteout filetype support
+	6.3. Directory renaming
+7. Usage
+8. State of the code
+9. Extracted mail comments
+
+1. Overview
+-----------
+Union mount allows mounting of two or more filesystems transparently on
+a single mount point. The contents(files or directories) of all the
+filesystems become visible at the mount point after a union mount. If
+there are files of same name in multiple layers, only the topmost files remain
+visible in a union mount. However (currently) common named directories are
+again union-ed to present a unified view at the subdir level.
+
+In this approach of unioning filesystems, the layering information of
+different components of the union mount are maintained at the VFS layer.
+Hence we call this a VFS based union mount.
+
+2. Union stack
+--------------
+Union stack reflects the stacking of two or more filesystems of the
+union mount. The stacking or the layering information is maintained
+as part of dentry structures of the mountpoint and mount root.
+
+The union stack information in the dentry structure looks like this:
+
+struct dentry {
+	...
+
+#ifdef CONFIG_UNION_MOUNT
+	struct dentry *d_overlaid;	/* overlaid directory */
+	struct dentry *d_topmost;	/* topmost directory */
+	struct union_info *d_union;	/* union stack info */
+#endif
+	...
+};
+
+struct union_info {
+	struct mutex u_mutex;
+	atomic_t u_count;
+};
+
+There is one union_info shared by all dentries which are part of
+a union and u_count member holds the number of references to the union
+stack. When this reaches zero, the union stack ceases to exist and
+the union_info is freed.
+
+Union stack is essentially a singly linked list of dentries of the union
+with d_topmost as the head of the list and d_overlaid points
+to the next member of the stack. The walking of union stack is guarded by
+the u_mutex member.
+
+dget() references every dentry of the overlaid union stack to make sure
+that no dentry of the stack is discarded from memory while others are
+still in use. Since walking of union stack is protected by a mutex,
+dget() can now sleep.
+
+dput() also walks the union stack and releases references to all the
+dentries that are part of the union. If a dentry's reference count
+in a union stack reaches zero, it implies that the dentries above it
+in the stack must also be unused and the union stack can be safely
+destroyed at this point.
+
+Since dget() can sleep with union mount, it becomes necessary to
+fix many callers of dget() to release and re-acquire any spinlocks
+they are holding until they acquire the union lock(mutex).
+
+3. Lookup
+---------
+With union mount, it becomes necessary to lookup pathnames not only
+in the topmost filesystem but also in the underlying filesystems.
+
+In case of looking up a filename, the lookup routines as a rule return
+the match from the topmost layer. However if the file is not found
+in the topmost layer, the lookup routines have been modified to
+find the file in the underlying filesystems of the union stack.
+
+When looking up a directory under a union mount point, the lookup
+code has been modified to build a union stack (if necessary).
+
+When looking up a name in a union directory, it is necessary to
+guarantee that the returned union stack remains valid. Hence
+concurrent lookups are prevented by obtaining the mutex lock during
+lookups.
+
+4. Readdir
+----------
+The core functionality of union mount, viz., the merged view of
+multiple directories is provided by the readdir()/getdents() routines.
+This is achieved by reading the contents of every directory of the union
+stack and by merging the result.
+
+The directory entries are read starting from the top layer and they
+are maintained in a cache. Subsequently when the entries from the bottom layers
+of the union stack are read they are checked for duplicates (in the cache)
+before being passed out to the user space. There can be multiple calls
+to readdir/getdents routines for reading the entries of a single directory.
+But union directory cache is not maintained across these calls. Instead
+for every call, the previously read entries are re-read into the cache
+and newly read entries are compared against these for duplicates before
+being they are returned to user space. We are aware that this is not
+the most ideal solution for merging the directory entries. This approach
+involves setting up the cache for every getdents() call, re-reading some
+of the entries again into the cache and destroying the cache at the end
+of getdents() call. And this happens for every getdents() call.
+
+But there is an even bigger problem. Since readdir() on the union directory
+returns contents of all the underlying directories, it is possible
+that the file position exceeds the inode size of the first directory.
+Therefore the file position is rearranged to select the correct directory
+in the union stack. This is done by subtracting the inode size if the
+file position exceeds it and selecting the next member of the union stack next.
+
+This works well with filesystems like ext2/3 that use flat file directories.
+The directory entry offsets are arranged linear and are always smaller than
+the inode size of the directory. Modern filesystems have implemented
+directories differently and just return special cookies as directory entry
+offsets which are unrelated to the position in the directory or the inode
+size. So the current approach of directory merging is working only for
+file systems like ext2 and ext3.
+
+5. Copyup
+---------
+In this implementation of union mount, only the files residing in
+the topmost layer are writable. With this restriction, when a file residing
+in a bottom layer is opened for writing, it is copied up to the topmost layer
+and the write is allowed there. The copyup is done by first creating the
+file in the topmost layer and then copying the contents of the file.
+
+If it becomes necessary to create a directory structure in the top layer
+while copying up a file, then it is done so.
+
+Every time a file is opened for writing, we have introduced a check to
+see if this file belongs to a union and if so resides in the bottom
+layer of the union stack. Only then the copyup operation is performed.
+VFS routines are used directly to create the file in the topmost layer.
+However to copy the contents of the file from within the kernel splice
+routines are used.
+
+6. Whiteout
+-----------
+A whiteout file is a placeholder for a file that does not exist from a
+logical point of view. VFS returns -ENOENT for any reference to whiteouts.
+
+Typically whiteouts are created in the topmost layer when a file in
+the lower layer is deleted. The whiteout essentially masks out the file
+in the lower layer.
+
+6.1 Creation and deletion
+
+With union mount, a top layer whiteout is created in the following scenarios:
+- A file/directory which resides only the bottom layer is removed.
+- A file/directory which resides in both the layers are removed.
+
+The VFS calls like unlink(), rename() and rmdir() have been modified to create
+a whiteout automatically when the above situation occurs.
+
+A whiteout is automatically deleted whenever a new file or directory
+with a corresponding name is created. This happens in calls like
+create(), mknod(), symlink(), link() and mkdir().
+
+There is a special case in mkdir(). When a whiteout is replaced by a
+directory, it is marked opaque (by using new S_OPAQUE inode flag).
+And lookup wouldn't descend down to lower directories if a directory
+is marked opaque. This is needed in the following scenario:
+
+# rm -rf dir/
+# mkdir dir
+
+The newly created dir/ has to be marked opaque, otherwise the contents
+of union stack would become visible again. And it is not expected to
+find a non-empty directory immediately after it's creation.
+
+6.2. Whiteout filetype support
+
+Creation or deletion of whiteouts is a persistent operation and hence it
+needs support from the underlying filesystem.
+
+Linux already defines DT_WHT(include/linux/fs.h) for whiteout directory
+entry (file)type. In addition we need to define the whiteout filetype
+for which we make use of an unused bit in the filetype bitmask and
+define S_IFWHT (include/linux/stat.h).
+
+Filesystems which support the whiteout filetype should set the FS_WHT
+flag (include/linux/fs.h) on .fs_type in their file_system_type structure.
+
+Additionally they have to implement the whiteout inode operation.
+
+int (*whiteout)(struct inode *dir, struct dentry *dentry);
+
+where 'dentry' is the negative dentry to be masked out under the parent 'dir'.
+
+In the current implementation, there is an inode for every whiteout in the
+filesystem. But since a whiteout doesn't have any usable attribute apart
+from it's name(name of the whiteout file is stored as directory entry
+in the parent directory), it is an ideal candidate for being replaced by
+a singleton object. We have plans to explore this option at a later point
+in time.
+
+In ext2 and ext3 filesystems, whiteout is introduced as an incompatible
+feature and only readonly mounts are allowed without whiteout support.
+tune2fs(8) from e2fsprogs has been modified to add whiteout support to
+ext2/3.
+
+6.3. Directory renaming
+<TODO>
+
+7. Usage
+--------
+The way to union mount filesystems on two devices /dev/sda1 and /dev/sda2,
+on a mountpoint union/ is like this:
+
+- Mount the first filesystem normally and this becomes the lower layer
+of the union stack.
+# mount /dev/sda1 union/
+
+- Mount the second filesystem as a union on top of first
+# mount --union /dev/sda2 union/
+
+The mount(8) command from util-linux needs to be modified to make it
+interpret the --union option.
+
+After this the union/ will have the merged contents of /dev/sda1
+and /dev/sda2.
+
+8. State of the code
+--------------------
+The entire code is in highly experimental stage at present.
+
+These are a number of (un)known issues/shortcomings:
+
+- Unstable, might crash any time. Hasn't undergone any decent levels
+  of testing.
+- We are touching some fastpaths in the lookup code and introducing the
+  latency of obtaining a mutex in dget() (only for union mount cases).
+  We haven't yet benchmarked this to check the (adverse) effects.
+- Known to union mount correctly only two filesystems. Not tried with more.
+- Unioning of subdirectories within a union mount is working, but is buggy.
+- Whiteout support in ext3 is not thoroughly analyzed/tested for correctness.
+- The side effects of union mount changes on other subsystems
+  (eg cpuset, aio, dnotify, inotify etc which are touched by union
+  mount changes) haven't been tested yet.
+- bind/move vs union mount not yet handled.
+- Readdir has issues as noted above.
+- Some lockdep warnings need to be addressed still.
+- In general some code cleanliness issues are yet to be handled.
+
+9. Extracted mail comments
+--------------------------
+
+These are some of the extracts from an old linux-fsdevel post.
+
+----
+Andries Brouwer wrote:
+>
+> On "union mounts".
+> We must first have a theory on what "union mount" means.
+> Union is a commutative operator, but here there is no symmetry
+> at all, so "union" is a misnomer. There is an order.
+>
+> One might consider partial orders, so that one obtains a tree of mounts,
+> but I do not know any applications, and there is the problem of naming.
+> So, for simplicity, maybe there is a linear order.
+>
+> Things happen in the top one. All others are read-only.
+>
+
+Yes, that is correct. This is naturally since the stacking of vfsmount objects
+has been like this before.
+
+----
+
+Alexander Viro wrote:
+>
+> > Does not same thing apply also for common subdirectories?
+>
+> Not. union-mount != unionfs, it does not descend into subdirectories.
+> There is no way in hell to do that and permit sharing the union-mount
+> components between several mountpoints. unionfs is very different animal
+> and there the main point is that you are getting real, honest
+> copy-on-write, i.e. if you have foo/bar/baz on underlying filesystem than
+> any attempt to access foo will create a shadowing directory in the upper
+> layer, any attempt to access foo/bar will do the same for foo/bar and
+> attempt to write into the foo/bar/baz will lead to copying the thing into
+> the upper layer and changing it there. _Very_ useful when you have a
+> read-only fs and want to run make on it, for one thing - everything
+> new/modified gets into the covering layer, along with the accessed part of
+> directory tree. Very nice, but completely different - there are things
+> impossible for one and doable on another.
+>
+
+----
+
+Werner Almesberger wrote:
+>
+> Hmm, now I'm throughly confused :-( What is the "union" in here then ?
+> Is it that a lookup for a top-level component searches all file system
+> in that list, or does it simply mean that all the file systems are
+> internally linked to the same place, but only one of them is truly
+> visible ?
+>
+> E.g., given
+>
+> # mount /dev/a /mnt
+> # mkdir -p /mnt/foo/blah /mnt/bar
+> # umount /dev/a
+> # mount /dev/b /mnt
+> # mkdir -p /mnt/foo/zulu /mnt/baz
+> # mount -o union /dev/a /mnt
+>
+> # cd /mnt/foo/blah              works ?
+> # cd /mnt/foo/zulu              works too ? (no, I guess)
+> # cd /mnt/baz                   works ?
+> # cd /mnt/bar                   works too ?
+> # cd /mnt; touch file           works ? on which device is the file created ?
+> # cd /mnt/foo; touch file	  works ?
+> # cd /mnt/foo/blah; touch file  works ?
+> # cd /mnt/foo/zulu; touch file  works too ? (no, I guess)
+>
+
+# cd /mnt/foo/blah              works !
+# cd /mnt/foo/zulu              works !
+# cd /mnt/baz                   works !
+# cd /mnt/bar                   works !
+# cd /mnt; touch file           file created on /dev/a
+# cd /mnt/foo; touch file	file created on /dev/a
+# cd /mnt/foo/blah; touch file  file created on /dev/a
+# cd /mnt/foo/zulu; touch file  zulu copied to /dev/a and file created on it
+
+----
+
+Alexander Viro wrote:
+>
+> A) suppose we have a bunch of filesystems union-mounted on /foo/bar. We do
+>    chdir("/foo/bar"), what should become busy? Variants:
+>    mountpoint, first element, last element, all of them.
+> B) after the action in (A) we add another filesystem to the set. Again, what
+>    should happen to the busy/not busy status of the components?
+> C) we start with the normal mount and union-mount something else.
+>    Question: what is the desired result (almost definitely the set of old
+>    and new mounted stuff) and who should become busy?
+> D) In the cases above, what do we want to get from stat(2)?
+> E) What do we want to do if we do normal mount atop of the union-mount?
+>    Variants: try to replace, return -EBUSY. Doing replace (i.e. if
+>    everything can be umounted - do it and mount the new fs in place of the
+>    union) is attractive - we probably might treat the normal mount same way,
+>    which kills the "I've clicked in my point'n'drool krapplication ten times
+>    and it mounted CD ten times, waaaaaah" bug reports.
+>    Disadvantage: may need small fixes to mount(8) (basically, "if we already
+>    have mtab entry for this mountpoint and mount succeeds - discard the old
+>    one").
+>
+
+I don't understand the union mount as a set of mounts because we also need a
+strict order to remove duplicate filenames from the directory
+listing. Therefore after union mounting a filesystem the mount-points
+filesystem is busy. A chdir() to the mount-point makes the last mounted
+filesystem busy since a lookup returns the root directory of the topmost
+filesystem.
+
+----
+
+Alexander Viro wrote:
+> >
+> > >     A) suppose we have a bunch of filesystems union-mounted on
+> > > /foo/bar. We do chdir("/foo/bar"), what should become busy? Variants:
+> > > mountpoint, first element, last element, all of them.
+> >
+> > I believe that all of them. Or, we can make alternative and mark
+> > none of them busy (together with Tigran yet-to-write force unmount) -
+> > if there is reason why cwd should make filesystem busy at all...
+>
+> Ouch. "All" means that we can't, e.g expire elements of union.
+>
+
+
+----
+
+Andries Brouwer wrote:
+>
+> > 	A) suppose we have a bunch of filesystems union-mounted on
+> > /foo/bar. We do chdir("/foo/bar"), what should become busy? Variants:
+> > mountpoint, first element, last element, all of them.
+>
+> Last element.
+>
+> > 	B) after the action in (A) we add another filesystem to the set.
+> > Again, what should happen to the busy/not busy status of the components?
+>
+> Previous top one has now become busy. All other were busy already.
+>
+> > 	C) we start with the normal mount and union-mount something else.
+> > Question: what is the desired result (almost definitely the set of old and
+> > new mounted stuff) and who should become busy?
+>
+> First element now is busy.
+>
+> > 	D) In the cases above, what do we want to get from stat(2)?
+>
+> stat(2)  on this directory looks at the top one
+>
+> > 	E) What do we want to do if we do normal mount atop of the
+> > union-mount? Variants: try to replace,
+>
+> No. Very strange semantics for a mount.
+>
+> > return -EBUSY.
+>
+> Yes, quite reasonable. But I would prefer the third: just succeed.
+> We have a file hierarchy, and do a mount - well, we already know what that
+>  means, and we just do it.
+>
+> [I would prefer to return -EBUSY only when the same filesystem was already
+> mounted (in the same way) on the same mount point.]
+>
+
+
+----
+
+Neil Brown wrote:
+>
+> A "mount" is an ordered list (pile) of directories.
+> One of these elements is the "mountpoint", and it is particularly
+> distiguished because ".." from the "mount" goes through ".." of the
+> "mountpoint".    ".." of all other directories is not accessable.
+>
+> Each directory in the pile has two flags (well, three if you count
+> IS_MOUNTPOINT):
+>
+>   IS_WRITABLE: You can create things in here.
+>   IS_VISIBLE: You can see inside this.
+>
+> Thus, a traditional mount has two directories in the pile.
+> The bottom one IS_MOUNTPOINT
+> The top one IS_WRITABLE|IS_VISIBLE
+>
+> With mount -o union, you can set what ever flags you like, though
+> having IS_WRITABLE and not IS_VISIBLE would be a problem.
+> However you can only have one IS_MOUNTPOINT directory.
+>
+> Now the rules:
+>
+> 1/ on "lookup", you do a lookup in each IS_VISIBLE directory from the
+>     top down until you find a match or you hit the bottom.
+>
+> 2/ If you decide to create something (*) then it goes in the uppermost
+>    IS_WRITABLE directory.
+>
+> 3/ "stat" (of ".") sees the IS_MOUNTPOINT directory if it IS_VISIBLE,
+>    otherwise the lowest IS_VISIBLE directory.
+>    Possibly n_links could be fiddled, but I don't know how important
+>    that is.
+>
+> 4/ The "mount" keeps only the IS_MOUNTPOINT directory busy.
+>
+> 5/ An open or cd to the mount makes the directory which "stat" sees
+>    busy.
+>
+> 6/ A mount is not allowed if it would change 'the directory which
+>    "stat" sees', and that directory is "busy".
+>
+> (*) It is unclear to me when creation should be allowed.
+>    If I say "mkdir fred", and fred does not exist in or above the
+>    uppermost IS_WRITABLE directory, but does exist is a lower
+>    IS_VISIBLE directory, should the create succeed or fail?
+>    Would that same be true for
+>      open("fred", O_CREAT)  which is "create if it doesn't exist"
+>    or open("fred", O_CREAT|O_EXCL) which is "create and it mustn't exist".
+>
+
+For the complete thread refer to:
+http://marc.theaimsgroup.com/?l=linux-fsdevel&m=96035682927821&w=2
+
+---
+- Bharata B Rao <bharata@linux.vnet.ibm.com>
+- Jan Blunck <j.blunck@tu-harburg.de>
+
+April 2007

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  2/15] Add a new mount flag (MNT_UNION) for union mount
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
  2007-04-17 13:16 ` [RFC][PATCH 1/15] Add union mount documentation Bharata B Rao
@ 2007-04-17 13:17 ` Bharata B Rao
  2007-04-17 13:17 ` [RFC][PATCH 3/15] Add the whiteout file type Bharata B Rao
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Add a new mount flag (MNT_UNION) for union mount.

Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
for doing

    mount /dev/hda3 /mnt -o union

You need additional patches for util-linux for that to work.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/namespace.c        |   13 ++++++++++++-
 include/linux/fs.h    |    2 ++
 include/linux/mount.h |    1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -376,6 +376,7 @@ static int show_vfsmnt(struct seq_file *
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_UNION, ",union" },
 		{ 0, NULL }
 	};
 	struct proc_fs_info *fs_infop;
@@ -1127,6 +1128,14 @@ int do_add_mount(struct vfsmount *newmnt
 	if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
 		goto unlock;
 
+	/* Unions couldn't be writable if the filesystem
+	 * doesn't know about whiteouts */
+	err = -ENOTSUPP;
+	if ((mnt_flags & MNT_UNION) &&
+	    !(newmnt->mnt_sb->s_flags & MS_RDONLY) &&
+	    !(newmnt->mnt_sb->s_type->fs_flags & FS_WHT))
+		goto unlock;
+
 	newmnt->mnt_flags = mnt_flags;
 	if ((err = graft_tree(newmnt, nd)))
 		goto unlock;
@@ -1430,9 +1439,11 @@ long do_mount(char *dev_name, char *dir_
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_RELATIME)
 		mnt_flags |= MNT_RELATIME;
+	if (flags & MS_UNION)
+		mnt_flags |= MNT_UNION;
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
+		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_UNION);
 
 	/* ... and get the mountpoint */
 	retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -93,6 +93,7 @@ extern int dir_notify_enable;
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
+#define FS_WHT		8
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
@@ -109,6 +110,7 @@ extern int dir_notify_enable;
 #define MS_REMOUNT	32	/* Alter flags of a mounted FS */
 #define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define MS_DIRSYNC	128	/* Directory modifications are synchronous */
+#define MS_UNION	256	/* Union mount */
 #define MS_NOATIME	1024	/* Do not update access times. */
 #define MS_NODIRATIME	2048	/* Do not update directory access times */
 #define MS_BIND		4096
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -34,6 +34,7 @@ struct mnt_namespace;
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
 #define MNT_PNODE_MASK	0x3000	/* propogation flag mask */
+#define MNT_UNION	0x4000	/* if the vfsmount is a union mount */
 
 struct vfsmount {
 	struct list_head mnt_hash;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  3/15] Add the whiteout file type
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
  2007-04-17 13:16 ` [RFC][PATCH 1/15] Add union mount documentation Bharata B Rao
  2007-04-17 13:17 ` [RFC][PATCH 2/15] Add a new mount flag (MNT_UNION) for union mount Bharata B Rao
@ 2007-04-17 13:17 ` Bharata B Rao
  2007-04-17 13:18 ` [RFC][PATCH 4/15] Add config options for union mount Bharata B Rao
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Add the whiteout file type

A white-out stops the VFS from further lookups of the white-outs name and
returns -ENOENT. This is the same behaviour as if the filename isn't
found. This can be used in combination with union mounts to virtually
delete (white-out) files by creating a file with this file type.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/stat.h |    2 ++
 1 files changed, 2 insertions(+)

--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -10,6 +10,7 @@
 #if defined(__KERNEL__) || !defined(__GLIBC__) || (__GLIBC__ < 2)
 
 #define S_IFMT  00170000
+#define S_IFWHT  0160000	/* whiteout */
 #define S_IFSOCK 0140000
 #define S_IFLNK	 0120000
 #define S_IFREG  0100000
@@ -28,6 +29,7 @@
 #define S_ISBLK(m)	(((m) & S_IFMT) == S_IFBLK)
 #define S_ISFIFO(m)	(((m) & S_IFMT) == S_IFIFO)
 #define S_ISSOCK(m)	(((m) & S_IFMT) == S_IFSOCK)
+#define S_ISWHT(m)	(((m) & S_IFMT) == S_IFWHT)
 
 #define S_IRWXU 00700
 #define S_IRUSR 00400

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  4/15] Add config options for union mount
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (2 preceding siblings ...)
  2007-04-17 13:17 ` [RFC][PATCH 3/15] Add the whiteout file type Bharata B Rao
@ 2007-04-17 13:18 ` Bharata B Rao
  2007-04-17 13:19 ` [RFC][PATCH 5/15] Introduce union stack Bharata B Rao
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Add config options for union mount

Introduces two new config options for union mount:

CONFIG_UNION_MOUNT - Enables union mount
CONFIG_UNION_MOUNT_DEBUG - Enables debugging support for union mount.

Also adds debugging routines.

FIXME: this needs some work. printk'ing isn't the right method for getting
good debugging output.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/Kconfig                  |   16 +++++++++
 include/linux/union_debug.h |   76 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -551,6 +551,22 @@ config INOTIFY_USER
 
 	  If unsure, say Y.
 
+config UNION_MOUNT
+       bool "Union mount support (EXPERIMENTAL)"
+       depends on EXPERIMENTAL
+       ---help---
+         If you say Y here, you will be able to mount file systems as
+         union mount stacks. This is a VFS based implementation and
+         should work with all file systems. If unsure, say N.
+
+config UNION_MOUNT_DEBUG
+       bool "Union mount debugging output"
+       depends on UNION_MOUNT
+       ---help---
+         If you say Y here, the union mount debugging code will be
+         compiled in. You have activate the appropriate UNION_MOUNT_DEBUG
+         flags in <file:include/linux/union.h>, too.
+
 config QUOTA
 	bool "Quota support"
 	help
--- /dev/null
+++ b/include/linux/union_debug.h
@@ -0,0 +1,76 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_DEBUG_H
+#define __LINUX_UNION_DEBUG_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT_DEBUG
+
+#include <linux/sched.h>
+
+#ifndef UNION_MOUNT_DEBUG
+#define UNION_MOUNT_DEBUG 0
+#endif	/* UNION_MOUNT_DEBUG */
+#ifndef UNION_MOUNT_DEBUG_DCACHE
+#define UNION_MOUNT_DEBUG_DCACHE 0
+#endif	/* UNION_MOUNT_DEBUG_DCACHE */
+#ifndef UNION_MOUNT_DEBUG_LOCK
+#define UNION_MOUNT_DEBUG_LOCK 0
+#endif	/* UNION_MOUNT_DEBUG_LOCK */
+#ifndef UNION_MOUNT_DEBUG_READDIR
+#define UNION_MOUNT_DEBUG_READDIR 0
+#endif	/* UNION_MOUNT_DEBUG_READDIR */
+
+/*
+ * The really excessive debugging output is triggered by
+ * the user id (7777) which is accessing the union stack
+ */
+#define UM_DEBUG(fmt, args...)						\
+do {									\
+	if (UNION_MOUNT_DEBUG)						\
+		printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args);	\
+} while (0)
+#define UM_DEBUG_UID(fmt, args...)					\
+do {									\
+	if (UNION_MOUNT_DEBUG && (current->uid == 7777))		\
+		printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args);	\
+} while (0)
+#define UM_DEBUG_DCACHE(fmt, args...)					\
+do {									\
+	if (UNION_MOUNT_DEBUG_DCACHE && (current->uid == 7777))		\
+		printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args);	\
+} while (0)
+#define UM_DEBUG_LOCK(fmt, args...)					\
+do {									\
+	if (UNION_MOUNT_DEBUG_LOCK && (current->uid == 7777))		\
+		printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args);	\
+} while (0)
+#define UM_DEBUG_READDIR(fmt, args...)					\
+do {									\
+	if (UNION_MOUNT_DEBUG_READDIR && (current->uid == 7777))	\
+		printk(KERN_DEBUG "%s: " fmt, __FUNCTION__, ## args);	\
+} while (0)
+
+#else	/* CONFIG_UNION_MOUNT_DEBUG */
+
+#define UM_DEBUG(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_UID(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_DCACHE(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_LOCK(fmt, args...) do { /* empty */ } while (0)
+#define UM_DEBUG_READDIR(fmt, args...) do { /* empty */ } while (0)
+
+#endif	/* CONFIG_UNION_MOUNT_DEBUG */
+
+#endif	/* __KERNEL__ */
+#endif	/*  __LINUX_UNION_DEBUG_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  5/15] Introduce union stack
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (3 preceding siblings ...)
  2007-04-17 13:18 ` [RFC][PATCH 4/15] Add config options for union mount Bharata B Rao
@ 2007-04-17 13:19 ` Bharata B Rao
  2007-04-17 22:08   ` Serge E. Hallyn
  2007-04-17 13:20 ` [RFC][PATCH 6/15] Union-mount dentry reference counting Bharata B Rao
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Introduce union stack.

Adds union stack infrastructure to the dentry structure and provides
locking routines to walk the union stack.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/Makefile                  |    2 
 fs/dcache.c                  |    5 
 fs/union.c                   |   53 +++++++++
 include/linux/dcache.h       |   11 +
 include/linux/dcache_union.h |  243 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 314 insertions(+)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.
 obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
 
+obj-$(CONFIG_UNION_MOUNT)	+= union.o
+
 obj-$(CONFIG_QUOTA)		+= dquot.o
 obj-$(CONFIG_QFMT_V1)		+= quota_v1.o
 obj-$(CONFIG_QFMT_V2)		+= quota_v2.o
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
 #ifdef CONFIG_PROFILING
 	dentry->d_cookie = NULL;
 #endif
+#ifdef CONFIG_UNION_MOUNT
+	dentry->d_overlaid = NULL;
+	dentry->d_topmost = NULL;
+	dentry->d_union = NULL;
+#endif
 	INIT_HLIST_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_LIST_HEAD(&dentry->d_subdirs);
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,53 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/fs.h>
+
+struct union_info * union_alloc(void)
+{
+	struct union_info *info;
+
+	info = kmalloc(sizeof(*info), GFP_ATOMIC);
+	if (!info)
+		return NULL;
+
+	mutex_init(&info->u_mutex);
+	mutex_lock(&info->u_mutex);
+	atomic_set(&info->u_count, 1);
+	UM_DEBUG_LOCK("allocate union %p\n", info);
+	return info;
+}
+
+struct union_info * union_get(struct union_info *info)
+{
+	BUG_ON(!info);
+	BUG_ON(!atomic_read(&info->u_count));
+	atomic_inc(&info->u_count);
+	UM_DEBUG_LOCK("get union %p (count=%d)\n", info,
+		      atomic_read(&info->u_count));
+	return info;
+}
+
+void union_put(struct union_info *info)
+{
+	BUG_ON(!info);
+	UM_DEBUG_LOCK("put union %p (count=%d)\n", info,
+		      atomic_read(&info->u_count));
+	atomic_dec(&info->u_count);
+
+	if (!atomic_read(&info->u_count)) {
+		UM_DEBUG_LOCK("free union %p\n", info);
+		kfree(info);
+	}
+
+	return;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -93,6 +93,12 @@ struct dentry {
 	struct dentry *d_parent;	/* parent directory */
 	struct qstr d_name;
 
+#ifdef CONFIG_UNION_MOUNT
+	struct dentry *d_overlaid;	/* overlaid directory */
+	struct dentry *d_topmost;	/* topmost directory */
+	struct union_info *d_union;	/* union directory info */
+#endif
+
 	struct list_head d_lru;		/* LRU list */
 	/*
 	 * d_child and d_rcu can share memory
@@ -325,6 +331,11 @@ static inline struct dentry *dget(struct
 	return dentry;
 }
 
+/*
+ * Reference counting for union mounts
+ */
+#include <linux/dcache_union.h>
+
 extern struct dentry * dget_locked(struct dentry *);
 
 /**
--- /dev/null
+++ b/include/linux/dcache_union.h
@@ -0,0 +1,243 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_DCACHE_UNION_H
+#define __LINUX_DCACHE_UNION_H
+#ifdef __KERNEL__
+
+#include <linux/union_debug.h>
+#include <linux/fs_struct.h>
+#include <asm/atomic.h>
+#include <asm/semaphore.h>
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * This is the union info object, that describes general information about this
+ * union directory
+ *
+ * u_mutex protects the union stack against modification. You can reach it
+ * through the d_union field in struct dentry. Hold it when you are walking
+ * or modifing the union stack !
+ *
+ * NOTE: Read the remark for union_trylock() below!
+ */
+struct union_info {
+	atomic_t u_count;
+	struct mutex u_mutex;
+};
+
+/* allocate/de-allocate */
+extern struct union_info *union_alloc(void);
+extern struct union_info *union_get(struct union_info *);
+extern void union_put(struct union_info *);
+
+/*
+ * These are the functions for locking a dentrys union. When one
+ * want to acquire a denties union lock, use:
+ *
+ * - union_lock() when you can sleep,
+ * - union_lock_spinlock() when you are holding a spinlock (that
+ *   you CAN savely give up and reacquire again)
+ * - union_lock_readlock() when you are holding a readlock (that
+ *   you CAN savely give up and reacquire again)
+ *
+ * Otherwise get the union lock early before you enter your
+ * "no sleeping here" code.
+ */
+static inline void __union_lock(struct union_info *uinfo)
+{
+	BUG_ON(!atomic_read(&uinfo->u_count));
+	mutex_lock(&uinfo->u_mutex);
+}
+
+static inline void union_lock(struct dentry *dentry)
+{
+	if (unlikely(dentry && dentry->d_union)) {
+		struct union_info *ui = dentry->d_union;
+
+		UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
+			      dentry->d_name.name, ui,
+			      atomic_read(&ui->u_count));
+		__union_lock(dentry->d_union);
+	}
+}
+
+static inline void __union_unlock(struct union_info *uinfo)
+{
+	BUG_ON(!atomic_read(&uinfo->u_count));
+	mutex_unlock(&uinfo->u_mutex);
+}
+
+static inline void union_unlock(struct dentry *dentry)
+{
+	if (unlikely(dentry && dentry->d_union)) {
+		struct union_info *ui = dentry->d_union;
+
+		UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
+			      dentry->d_name.name, ui,
+			      atomic_read(&ui->u_count));
+		__union_unlock(dentry->d_union);
+	}
+}
+
+/*
+ * Two helpers for namespace.c
+ *
+ * FIXME: clean this up to get it right
+ */
+static inline struct union_info *union_alloc2(struct dentry * dentry)
+{
+	struct union_info *uinfo;
+
+	spin_lock(&dentry->d_lock);
+	if (!dentry->d_union) {
+		dentry->d_union = union_alloc();
+		uinfo = union_get(dentry->d_union);
+		spin_unlock(&dentry->d_lock);
+	} else {
+		uinfo = union_get(dentry->d_union);
+		spin_unlock(&dentry->d_lock);
+		union_lock(dentry);
+	}
+
+	return uinfo;
+}
+
+static inline struct union_info *union_get2(struct dentry * dentry)
+{
+	struct union_info *uinfo;
+
+	union_lock(dentry);
+	uinfo = union_get(dentry->d_union);
+	return uinfo;
+}
+
+static inline void union_release(struct union_info *uinfo)
+{
+	if (!uinfo)
+		return;
+
+	mutex_unlock(&uinfo->u_mutex);
+	union_put(uinfo);
+}
+
+/*
+ * Immediately return ZERO if the lock is contended, NON-ZERO if it's acquired.
+ */
+static inline int union_trylock(struct dentry *dentry)
+{
+	int locked = 1;
+
+	if (unlikely(dentry && dentry->d_union)) {
+		UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
+			      dentry->d_name.name, dentry->d_union,
+			      atomic_read(&dentry->d_union->u_count));
+		BUG_ON(!atomic_read(&dentry->d_union->u_count));
+		locked = mutex_trylock(&dentry->d_union->u_mutex);
+		UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
+			      dentry->d_union,
+			      locked ? "succeeded" : "failed");
+	}
+	return (locked ? 1 : 0);
+}
+
+/*
+ * The following functions are locking helpers to guarantee the locking order
+ * in some situations.
+ */
+
+static inline void union_lock_spinlock(struct dentry *dentry, spinlock_t *lock)
+{
+	while (!union_trylock(dentry)) {
+		spin_unlock(lock);
+		cpu_relax();
+		spin_lock(lock);
+	}
+}
+
+static inline void union_lock_readlock(struct dentry *dentry, rwlock_t *lock)
+{
+	while (!union_trylock(dentry)) {
+		read_unlock(lock);
+		cpu_relax();
+		read_lock(lock);
+	}
+}
+
+/*
+ * This is a *I can't get no sleep* helper which is called when we try
+ * to access the struct fs_struct *fs field of a struct task_struct.
+ *
+ * Yes, this is possibly starving but we have to change root, altroot
+ * or pwd in the frequency of this while loop. Don't think that this
+ * happens really often ;)
+ *
+ * This is called while holding the rwlock_t fs->lock
+ *
+ * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s.
+ * May be introduce union_unlock_fs().
+ *
+ * FIXME: This routine is used when the caller wants to dget one or
+ * more of fs->[root, altroot, pwd]. When the caller doesn't want to
+ * dget _all_ of these, it is strictly not necessary to get union_locks
+ * on all of these. Check.
+ */
+static inline void union_lock_fs(struct fs_struct *fs)
+{
+	int locked;
+
+	while (fs) {
+		locked = union_trylock(fs->root);
+		if (!locked)
+			goto loop1;
+		locked = union_trylock(fs->altroot);
+		if (!locked)
+			goto loop2;
+		locked = union_trylock(fs->pwd);
+		if (!locked)
+			goto loop3;
+		break;
+	loop3:
+		union_unlock(fs->altroot);
+	loop2:
+		union_unlock(fs->root);
+	loop1:
+		read_unlock(&fs->lock);
+		UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
+		cpu_relax();
+		read_lock(&fs->lock);
+		continue;
+	}
+	BUG_ON(!fs);
+	return;
+}
+
+#define IS_UNION(dentry) ((dentry)->d_overlaid || (dentry)->d_topmost || \
+				(dentry)->d_overlaid)
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define union_lock(dentry) do { /* empty */ } while (0)
+#define union_trylock(dentry) ({ (1); })
+#define union_unlock(dentry) do { /* empty */ } while (0)
+#define union_lock_spinlock(dentry, lock) do { /* empty */ } while (0)
+#define union_lock_readlock(dentry, lock) do { /* empty */ } while (0)
+#define union_lock_fs(fs) do { /* empty */ } while (0)
+#define IS_UNION(dentry) ({ (0); })
+#define union_alloc2(x) ({ BUG(); (0); })
+#define union_get2(x) ({ BUG(); (0); })
+#define union_release(x) do { BUG(); } while (0)
+
+#endif	/* CONFIG_UNION_MOUNT */
+#endif	/* __KERNEL__ */
+#endif	/* __LINUX_DCACHE_UNION_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  6/15] Union-mount dentry reference counting
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (4 preceding siblings ...)
  2007-04-17 13:19 ` [RFC][PATCH 5/15] Introduce union stack Bharata B Rao
@ 2007-04-17 13:20 ` Bharata B Rao
  2007-04-17 13:20 ` [RFC][PATCH 7/15] Union-mount mounting Bharata B Rao
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Union-mount dentry reference counting

dget is modified to walk the union stack taking reference on every
dentry that is part of the union stack. This is necessary to ensure that
parts of union stack don't go away from under us. Since dget() takes a mutex
for walking the stack, dget can now sleep.

dput also walks the union stack and releases references to all the
dentries that are part of the union.

Since dget() can now sleep, make sure that dget() doesn't go to sleep with
any spinlocks held while it tries to get the mutex.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/dcache.c                  |   35 ++++--
 fs/dnotify.c                 |    5 
 fs/inotify.c                 |    8 +
 fs/namei.c                   |   42 +++++--
 fs/namespace.c               |   12 +-
 fs/proc/base.c               |   17 ++
 fs/union.c                   |  249 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h       |   64 ++++++++++-
 include/linux/dcache_union.h |   65 +++++++++++
 kernel/auditsc.c             |    4 
 kernel/cpuset.c              |    4 
 kernel/fork.c                |   10 +
 net/unix/af_unix.c           |    5 
 13 files changed, 484 insertions(+), 36 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -171,8 +171,7 @@ static struct dentry *d_kill(struct dent
  *
  * no dcache lock, please.
  */
-
-void dput(struct dentry *dentry)
+void __dput_single(struct dentry *dentry)
 {
 	if (!dentry)
 		return;
@@ -190,6 +189,13 @@ repeat:
 		return;
 	}
 
+	if (!__dput_single_destroy_union(dentry)) {
+		atomic_inc(&dentry->d_count);
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_lock);
+		goto repeat;
+	}
+
 	/*
 	 * AV: ->d_delete() is _NOT_ allowed to block now.
 	 */
@@ -285,6 +291,15 @@ int d_invalidate(struct dentry * dentry)
 
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
+	/*
+	 * TODO: We come here with dcache_lock held and can't
+	 * afford to sleep now to acquire the union_lock. We should
+	 * change all the callers to acquire union_lock first using
+	 * the union_lock_spinlock() helper. Until that is done,
+	 * BUG() here.
+	 */
+	BUG_ON(IS_UNION(dentry));
+
 	atomic_inc(&dentry->d_count);
 	if (!list_empty(&dentry->d_lru)) {
 		dentry_stat.nr_unused--;
@@ -392,7 +407,7 @@ static void prune_one_dentry(struct dent
 	__d_drop(dentry);
 	dentry = d_kill(dentry);
 	if (!prune_parents) {
-		dput(dentry);
+		__dput_single(dentry);
 		spin_lock(&dcache_lock);
 		return;
 	}
@@ -947,7 +962,7 @@ struct dentry *d_alloc(struct dentry * p
 	INIT_LIST_HEAD(&dentry->d_alias);
 
 	if (parent) {
-		dentry->d_parent = dget(parent);
+		dentry->d_parent = __dget_single(parent);
 		dentry->d_sb = parent->d_sb;
 	} else {
 		INIT_LIST_HEAD(&dentry->d_u.d_child);
@@ -1908,8 +1923,10 @@ char *d_path(struct dentry *dentry, stru
 		return dentry->d_op->d_dname(dentry, buf, buflen);
 
 	read_lock(&current->fs->lock);
+	union_lock_readlock(current->fs->root, &current->fs->lock);
 	rootmnt = mntget(current->fs->rootmnt);
-	root = dget(current->fs->root);
+	root = __dget(current->fs->root);
+	union_unlock(current->fs->root);
 	read_unlock(&current->fs->lock);
 	res = __d_path(dentry, vfsmnt, root, rootmnt, buf, buflen, 0);
 	dput(root);
@@ -1967,10 +1984,14 @@ asmlinkage long sys_getcwd(char __user *
 		return -ENOMEM;
 
 	read_lock(&current->fs->lock);
+	union_lock_fs(current->fs);
 	pwdmnt = mntget(current->fs->pwdmnt);
-	pwd = dget(current->fs->pwd);
+	pwd = __dget(current->fs->pwd);
 	rootmnt = mntget(current->fs->rootmnt);
-	root = dget(current->fs->root);
+	root = __dget(current->fs->root);
+	union_unlock(current->fs->pwd);
+	union_unlock(current->fs->altroot);
+	union_unlock(current->fs->root);
 	read_unlock(&current->fs->lock);
 
 	cwd = __d_path(pwd, pwdmnt, root, rootmnt, page, PAGE_SIZE, 1);
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -161,13 +161,16 @@ void dnotify_parent(struct dentry *dentr
 		return;
 
 	spin_lock(&dentry->d_lock);
+	union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
 	parent = dentry->d_parent;
 	if (parent->d_inode->i_dnotify_mask & event) {
-		dget(parent);
+		__dget(parent);
+		union_unlock(parent);
 		spin_unlock(&dentry->d_lock);
 		__inode_dir_notify(parent->d_inode, event);
 		dput(parent);
 	} else {
+		union_unlock(parent);
 		spin_unlock(&dentry->d_lock);
 	}
 }
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -325,17 +325,21 @@ void inotify_dentry_parent_queue_event(s
 		return;
 
 	spin_lock(&dentry->d_lock);
+	union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
 	parent = dentry->d_parent;
 	inode = parent->d_inode;
 
 	if (inotify_inode_watched(inode)) {
-		dget(parent);
+		__dget(parent);
+		union_unlock(parent);
 		spin_unlock(&dentry->d_lock);
 		inotify_inode_queue_event(inode, mask, cookie, name,
 					  dentry->d_inode);
 		dput(parent);
-	} else
+	} else {
+		union_unlock(parent);
 		spin_unlock(&dentry->d_lock);
+	}
 }
 EXPORT_SYMBOL_GPL(inotify_dentry_parent_queue_event);
 
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -522,16 +522,23 @@ walk_init_root(const char *name, struct 
 	struct fs_struct *fs = current->fs;
 
 	read_lock(&fs->lock);
+	union_lock_fs(fs);
 	if (fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
 		nd->mnt = mntget(fs->altrootmnt);
-		nd->dentry = dget(fs->altroot);
+		nd->dentry = __dget(fs->altroot);
+		union_unlock(fs->pwd);
+		union_unlock(fs->altroot);
+		union_unlock(fs->root);
 		read_unlock(&fs->lock);
 		if (__emul_lookup_dentry(name,nd))
 			return 0;
 		read_lock(&fs->lock);
 	}
 	nd->mnt = mntget(fs->rootmnt);
-	nd->dentry = dget(fs->root);
+	nd->dentry = __dget(fs->root);
+	union_unlock(fs->pwd);
+	union_unlock(fs->altroot);
+	union_unlock(fs->root);
 	read_unlock(&fs->lock);
 	return 1;
 }
@@ -654,13 +661,16 @@ int follow_up(struct vfsmount **mnt, str
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
 	spin_lock(&vfsmount_lock);
+	union_lock_spinlock((*mnt)->mnt_mountpoint, &vfsmount_lock);
 	parent=(*mnt)->mnt_parent;
 	if (parent == *mnt) {
+		union_unlock((*mnt)->mnt_mountpoint);
 		spin_unlock(&vfsmount_lock);
 		return 0;
 	}
 	mntget(parent);
-	mountpoint=dget((*mnt)->mnt_mountpoint);
+	mountpoint=__dget((*mnt)->mnt_mountpoint);
+	union_unlock((*mnt)->mnt_mountpoint);
 	spin_unlock(&vfsmount_lock);
 	dput(*dentry);
 	*dentry = mountpoint;
@@ -736,21 +746,27 @@ static __always_inline void follow_dotdo
 		}
                 read_unlock(&fs->lock);
 		spin_lock(&dcache_lock);
+		union_lock_spinlock(nd->dentry->d_parent, &dcache_lock);
 		if (nd->dentry != nd->mnt->mnt_root) {
-			nd->dentry = dget(nd->dentry->d_parent);
+			nd->dentry = __dget(nd->dentry->d_parent);
+			union_unlock(nd->dentry->d_parent);
 			spin_unlock(&dcache_lock);
 			dput(old);
 			break;
 		}
+		union_unlock(nd->dentry->d_parent);
 		spin_unlock(&dcache_lock);
 		spin_lock(&vfsmount_lock);
+		union_lock_spinlock(nd->mnt->mnt_mountpoint, &vfsmount_lock);
 		parent = nd->mnt->mnt_parent;
 		if (parent == nd->mnt) {
+			union_unlock(nd->mnt->mnt_mountpoint);
 			spin_unlock(&vfsmount_lock);
 			break;
 		}
 		mntget(parent);
-		nd->dentry = dget(nd->mnt->mnt_mountpoint);
+		nd->dentry = __dget(nd->mnt->mnt_mountpoint);
+		union_unlock(nd->mnt->mnt_mountpoint);
 		spin_unlock(&vfsmount_lock);
 		dput(old);
 		mntput(nd->mnt);
@@ -1050,8 +1066,10 @@ static int __emul_lookup_dentry(const ch
 		 */
 		nd->last_type = LAST_ROOT;
 		read_lock(&fs->lock);
+		union_lock_readlock(fs->root, &fs->lock);
 		nd->mnt = mntget(fs->rootmnt);
-		nd->dentry = dget(fs->root);
+		nd->dentry = __dget(fs->root);
+		union_unlock(fs->root);
 		read_unlock(&fs->lock);
 		if (path_walk(name, nd) == 0) {
 			if (nd->dentry->d_inode) {
@@ -1114,20 +1132,26 @@ static int fastcall do_path_lookup(int d
 	if (*name=='/') {
 		read_lock(&fs->lock);
 		if (fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
+			union_lock_readlock(fs->altroot, &fs->lock);
 			nd->mnt = mntget(fs->altrootmnt);
-			nd->dentry = dget(fs->altroot);
+			nd->dentry = __dget(fs->altroot);
+			union_unlock(fs->altroot);
 			read_unlock(&fs->lock);
 			if (__emul_lookup_dentry(name,nd))
 				goto out; /* found in altroot */
 			read_lock(&fs->lock);
 		}
+		union_lock_readlock(fs->root, &fs->lock);
 		nd->mnt = mntget(fs->rootmnt);
-		nd->dentry = dget(fs->root);
+		nd->dentry = __dget(fs->root);
+		union_unlock(fs->root);
 		read_unlock(&fs->lock);
 	} else if (dfd == AT_FDCWD) {
 		read_lock(&fs->lock);
+		union_lock_readlock(fs->pwd, &fs->lock);
 		nd->mnt = mntget(fs->pwdmnt);
-		nd->dentry = dget(fs->pwd);
+		nd->dentry = __dget(fs->pwd);
+		union_unlock(fs->pwd);
 		read_unlock(&fs->lock);
 	} else {
 		struct dentry *dentry;
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1610,12 +1610,14 @@ void set_fs_root(struct fs_struct *fs, s
 {
 	struct dentry *old_root;
 	struct vfsmount *old_rootmnt;
+	union_lock(dentry);
 	write_lock(&fs->lock);
 	old_root = fs->root;
 	old_rootmnt = fs->rootmnt;
 	fs->rootmnt = mntget(mnt);
-	fs->root = dget(dentry);
+	fs->root = __dget(dentry);
 	write_unlock(&fs->lock);
+	union_unlock(dentry);
 	if (old_root) {
 		dput(old_root);
 		mntput(old_rootmnt);
@@ -1632,12 +1634,14 @@ void set_fs_pwd(struct fs_struct *fs, st
 	struct dentry *old_pwd;
 	struct vfsmount *old_pwdmnt;
 
+	union_lock(dentry);
 	write_lock(&fs->lock);
 	old_pwd = fs->pwd;
 	old_pwdmnt = fs->pwdmnt;
 	fs->pwdmnt = mntget(mnt);
-	fs->pwd = dget(dentry);
+	fs->pwd = __dget(dentry);
 	write_unlock(&fs->lock);
+	union_unlock(dentry);
 
 	if (old_pwd) {
 		dput(old_pwd);
@@ -1726,8 +1730,10 @@ asmlinkage long sys_pivot_root(const cha
 	}
 
 	read_lock(&current->fs->lock);
+	union_lock_readlock(current->fs->root, &current->fs->lock);
 	user_nd.mnt = mntget(current->fs->rootmnt);
-	user_nd.dentry = dget(current->fs->root);
+	user_nd.dentry = __dget(current->fs->root);
+	union_unlock(current->fs->root);
 	read_unlock(&current->fs->lock);
 	down_write(&namespace_sem);
 	mutex_lock(&old_nd.dentry->d_inode->i_mutex);
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -203,8 +203,10 @@ static int proc_cwd_link(struct inode *i
 	}
 	if (fs) {
 		read_lock(&fs->lock);
+		union_lock_readlock(fs->pwd, &fs->lock);
 		*mnt = mntget(fs->pwdmnt);
-		*dentry = dget(fs->pwd);
+		*dentry = __dget(fs->pwd);
+		union_unlock(fs->pwd);
 		read_unlock(&fs->lock);
 		result = 0;
 		put_fs_struct(fs);
@@ -224,8 +226,10 @@ static int proc_root_link(struct inode *
 	}
 	if (fs) {
 		read_lock(&fs->lock);
+		union_lock_readlock(fs->root, &fs->lock);
 		*mnt = mntget(fs->rootmnt);
-		*dentry = dget(fs->root);
+		*dentry = __dget(fs->root);
+		union_unlock(fs->root);
 		read_unlock(&fs->lock);
 		result = 0;
 		put_fs_struct(fs);
@@ -1252,19 +1256,26 @@ static int proc_fd_info(struct inode *in
 		 * We are not taking a ref to the file structure, so we must
 		 * hold ->file_lock.
 		 */
+repeat:
 		spin_lock(&files->file_lock);
 		file = fcheck_files(files, fd);
 		if (file) {
+			if (!union_trylock(file->f_path.dentry)) {
+				spin_unlock(&files->file_lock);
+				cpu_relax();
+				goto repeat;
+			}
 			if (mnt)
 				*mnt = mntget(file->f_path.mnt);
 			if (dentry)
-				*dentry = dget(file->f_path.dentry);
+				*dentry = __dget(file->f_path.dentry);
 			if (info)
 				snprintf(info, PROC_FDINFO_MAX,
 					 "pos:\t%lli\n"
 					 "flags:\t0%o\n",
 					 (long long) file->f_pos,
 					 file->f_flags);
+			union_unlock(file->f_path.dentry);
 			spin_unlock(&files->file_lock);
 			put_files_struct(files);
 			return 0;
--- a/fs/union.c
+++ b/fs/union.c
@@ -11,6 +11,55 @@
  */
 
 #include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+
+void
+__union_check(struct dentry *dentry)
+{
+	if (likely(!(dentry->d_topmost || dentry->d_overlaid))) {
+		if (unlikely(dentry->d_union)) {
+			printk(KERN_ERR "%s: \"%s\" stale union reference\n" \
+			       "\tdentry=%p, inode=%p, count=%d, u_count=%d\n",
+			       __FUNCTION__,
+			       dentry->d_name.name,
+			       dentry,
+			       dentry->d_inode,
+			       atomic_read(&dentry->d_count),
+			       atomic_read(&dentry->d_union->u_count));
+			dump_stack();
+		}
+		return;
+	}
+
+	BUG_ON(!dentry->d_union);
+
+	if ((dentry == dentry->d_topmost) || (dentry == dentry->d_overlaid)) {
+		printk(KERN_ERR "%s: \"%s\" loop in union stack\n",
+		       __FUNCTION__, dentry->d_name.name);
+		BUG();
+	}
+
+	if (dentry->d_inode && !S_ISDIR(dentry->d_inode->i_mode)) {
+		printk(KERN_ERR "%s: \"%s\" isn't a directory!\n",
+		       __FUNCTION__, dentry->d_name.name);
+		BUG();
+	}
+
+	if (dentry->d_topmost && !dentry->d_topmost->d_inode) {
+		printk(KERN_ERR "%s: \"%s\" has a negative topmost dentry!\n",
+		       __FUNCTION__, dentry->d_name.name);
+		BUG();
+	}
+
+	if (!dentry->d_inode && !dentry->d_topmost) {
+		printk(KERN_ERR "%s: \"%s\" is a negative topmost dentry!\n",
+		       __FUNCTION__, dentry->d_name.name);
+		BUG();
+	}
+}
+EXPORT_SYMBOL_GPL(__union_check);
 
 struct union_info * union_alloc(void)
 {
@@ -51,3 +100,203 @@ void union_put(struct union_info *info)
 
 	return;
 }
+
+/*
+ * Check if the given @parent dentry is really a parent of @dentry
+ */
+static int union_is_parent(struct dentry *dentry, struct dentry *parent)
+{
+	struct dentry *tmp = dentry;
+
+	if (parent->d_sb != dentry->d_sb) {
+		UM_DEBUG("%s and %s have different superblocks\n",
+			 dentry->d_name.name, parent->d_name.name);
+		return 0;
+	}
+
+	do {
+		if (tmp == parent)
+			return 1;
+	} while (tmp != tmp->d_parent && (tmp = tmp->d_parent));
+
+	return 0;
+}
+
+/*
+ * Check if the @dentry is part of a union
+ */
+int union_is_member(struct dentry *dentry, struct vfsmount *mnt)
+{
+	struct list_head *tmp;
+	struct vfsmount *p, *m_tmp = mntget(mnt);
+	struct dentry *d_tmp = __dget(dentry);
+
+	UM_DEBUG_UID("dentry=%s\n", dentry->d_name.name);
+
+	do {
+		UM_DEBUG_UID("device=%s\n", mnt->mnt_devname);
+		list_for_each(tmp, &m_tmp->mnt_mounts) {
+			p = list_entry(tmp, struct vfsmount, mnt_child);
+			UM_DEBUG_UID("child=%s\n", p->mnt_devname);
+			if (p->mnt_flags & MNT_UNION) {
+				UM_DEBUG_UID("is union=%s\n", p->mnt_devname);
+				if (union_is_parent(d_tmp, p->mnt_mountpoint)) {
+					__dput(d_tmp);
+					mntput(m_tmp);
+					return 1;
+				}
+			}
+		}
+
+		__dput(d_tmp);
+		d_tmp = __dget(m_tmp->mnt_mountpoint);
+		p = mntget(m_tmp->mnt_parent);
+		mntput(m_tmp);
+		m_tmp = p;
+	} while (m_tmp != m_tmp->mnt_parent);
+
+	__dput(d_tmp);
+	mntput(m_tmp);
+	return 0;
+}
+
+int __destroy_union(struct dentry * dentry)
+{
+	struct dentry *next;
+	struct dentry *topmost;
+	struct union_info *uinfo;
+
+	if (!union_trylock(dentry))
+		return 0;
+
+	uinfo = union_get(dentry->d_union);
+
+	UM_DEBUG_DCACHE("destroying \"%s\" (%p) union stack %p\n",
+			dentry->d_name.name, dentry->d_inode, uinfo);
+
+	next = dentry->d_topmost ? dentry->d_topmost : dentry;
+	while (next) {
+		struct dentry *tmp = next;
+		next = next->d_overlaid;
+
+		UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+				tmp->d_name.name, tmp->d_inode,
+				atomic_read(&tmp->d_count));
+		if (tmp != dentry)
+			spin_lock(&tmp->d_lock);
+		tmp->d_topmost = NULL;
+		tmp->d_overlaid = NULL;
+		union_put(tmp->d_union);
+		tmp->d_union = NULL;
+		if (tmp != dentry)
+			spin_unlock(&tmp->d_lock);
+		if (tmp == dentry)
+			goto rebuild_stack;
+	}
+
+	mutex_unlock(&uinfo->u_mutex);
+	union_put(uinfo);
+	return 1;
+
+rebuild_stack:
+	if (next) {
+		topmost = next;
+		UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+				next->d_name.name, next->d_inode,
+				atomic_read(&next->d_count));
+		spin_lock(&next->d_lock);
+		next->d_topmost = NULL;
+		if (!next->d_overlaid) {
+			union_put(next->d_union);
+			next->d_union = NULL;
+		}
+		spin_unlock(&next->d_lock);
+		next = next->d_overlaid;
+	}
+
+	while (next) {
+		struct dentry *tmp = next;
+		next = next->d_overlaid;
+		UM_DEBUG_DCACHE("\"%s\", inode=%p, count=%d\n",
+				tmp->d_name.name, tmp->d_inode,
+				atomic_read(&tmp->d_count));
+		tmp->d_topmost = topmost;
+	}
+
+	mutex_unlock(&uinfo->u_mutex);
+	union_put(uinfo);
+	return 1;
+}
+
+static void __destroy_stack_part(struct dentry * first, struct dentry * last)
+{
+	struct dentry * next = first;
+
+	while (next) {
+		struct dentry * tmp = next;
+		next = next->d_overlaid;
+
+		spin_lock(&tmp->d_lock);
+		tmp->d_topmost = NULL;
+		tmp->d_overlaid = NULL;
+		union_put(tmp->d_union);
+		tmp->d_union = NULL;
+		spin_unlock(&tmp->d_lock);
+		if (tmp == last)
+			break;
+	}
+}
+
+/*
+ * This is union-mount dput(). For union mount dentries it is walking DOWN
+ * the union stack and putting every dentry in it. If one of the dentries
+ * usage count reaching zero it is removed from the stack.
+ */
+void __dput_union(struct dentry *dentry)
+{
+	struct dentry *topmost;		// the new topmost after dput()
+	struct dentry *next;
+
+	union_check(dentry);
+
+	if (dentry->d_topmost) {
+		UM_DEBUG_DCACHE("we are not the topmost dentry\n");
+		topmost = dentry->d_topmost;
+	} else
+		topmost = NULL;
+
+	next = dentry;
+	while (next) {
+		struct dentry *tmp = next;	// the dentry we dput now
+		next = next->d_overlaid;	// the dentry we dput next
+
+		UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+				tmp->d_name.name, tmp->d_inode,
+				atomic_read(&tmp->d_count));
+
+		if (atomic_read(&tmp->d_count) < 2) {
+			__destroy_stack_part(topmost ? topmost : tmp, tmp);
+			topmost = NULL;
+		} else {
+			tmp->d_topmost = topmost;
+			if (!topmost)
+				topmost = tmp;
+		}
+
+		/* We are the last one using d_union */
+		spin_lock(&tmp->d_lock);
+		if (tmp->d_union
+		    && (atomic_read(&tmp->d_union->u_count) == 1)) {
+			BUG_ON(next);
+			tmp->d_overlaid = NULL;
+			tmp->d_topmost = NULL;
+			union_put(tmp->d_union);
+			tmp->d_union = NULL;
+		}
+		spin_unlock(&tmp->d_lock);
+
+		__dput_single(tmp);
+	}
+
+	return;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -322,7 +322,7 @@ extern char * d_path(struct dentry *, st
  *	and call dget_locked() instead of dget().
  */
  
-static inline struct dentry *dget(struct dentry *dentry)
+static inline struct dentry *__dget_single(struct dentry *dentry)
 {
 	if (dentry) {
 		BUG_ON(!atomic_read(&dentry->d_count));
@@ -331,13 +331,69 @@ static inline struct dentry *dget(struct
 	return dentry;
 }
 
+extern void __dput_single(struct dentry *);
+
 /*
  * Reference counting for union mounts
  */
 #include <linux/dcache_union.h>
 
+/*
+ * Called with dentry's union lock held
+ */
+static inline struct dentry * __dget(struct dentry *dentry)
+{
+	if (unlikely(IS_UNION(dentry)))
+		return __dget_union(dentry);
+	else
+		return __dget_single(dentry);
+}
+
+static inline struct dentry * dget(struct dentry *dentry)
+{
+	if (!dentry)
+		return dentry;
+
+	/*
+	 * Yes, dget() can sleep now, if the union struct isn't yet read
+	 * in completly. This is symmetric to dput() which can sleep too.
+	 */
+	might_sleep();
+
+	union_lock(dentry);
+	__dget(dentry);
+	union_unlock(dentry);
+	return dentry;
+}
+
 extern struct dentry * dget_locked(struct dentry *);
 
+/*
+ * Called with dentry's union lock held
+ */
+static inline void __dput(struct dentry *dentry)
+{
+	if (unlikely(IS_UNION(dentry)))
+		__dput_union(dentry);
+	else
+		__dput_single(dentry);
+}
+
+static inline void dput(struct dentry *dentry)
+{
+	if (!dentry)
+		return;
+
+	if (unlikely(IS_UNION(dentry))) {
+		struct union_info *uinfo;
+
+		uinfo = union_get2(dentry);
+		__dput_union(dentry);
+		union_release(uinfo);
+	} else
+		__dput_single(dentry);
+}
+
 /**
  *	d_unhashed -	is dentry hashed
  *	@dentry: entry to check
@@ -355,13 +411,13 @@ static inline struct dentry *dget_parent
 	struct dentry *ret;
 
 	spin_lock(&dentry->d_lock);
-	ret = dget(dentry->d_parent);
+	union_lock_spinlock(dentry->d_parent, &dentry->d_lock);
+	ret = __dget(dentry->d_parent);
+	union_unlock(dentry->d_parent);
 	spin_unlock(&dentry->d_lock);
 	return ret;
 }
 
-extern void dput(struct dentry *);
-
 static inline int d_mountpoint(struct dentry *dentry)
 {
 	return dentry->d_mounted;
--- a/include/linux/dcache_union.h
+++ b/include/linux/dcache_union.h
@@ -41,6 +41,17 @@ extern struct union_info *union_alloc(vo
 extern struct union_info *union_get(struct union_info *);
 extern void union_put(struct union_info *);
 
+
+extern void __union_check(struct dentry *);
+
+static inline void union_check(struct dentry *dentry)
+{
+	if (!dentry)
+		return;
+	if (unlikely(dentry->d_union))
+		__union_check(dentry);
+}
+
 /*
  * These are the functions for locking a dentrys union. When one
  * want to acquire a denties union lock, use:
@@ -68,6 +79,7 @@ static inline void union_lock(struct den
 		UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
 			      dentry->d_name.name, ui,
 			      atomic_read(&ui->u_count));
+		union_check(dentry);
 		__union_lock(dentry->d_union);
 	}
 }
@@ -86,6 +98,7 @@ static inline void union_unlock(struct d
 		UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
 			      dentry->d_name.name, ui,
 			      atomic_read(&ui->u_count));
+		union_check(dentry);
 		__union_unlock(dentry->d_union);
 	}
 }
@@ -142,6 +155,7 @@ static inline int union_trylock(struct d
 		UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
 			      dentry->d_name.name, dentry->d_union,
 			      atomic_read(&dentry->d_union->u_count));
+		union_check(dentry);
 		BUG_ON(!atomic_read(&dentry->d_union->u_count));
 		locked = mutex_trylock(&dentry->d_union->u_mutex);
 		UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
@@ -223,10 +237,57 @@ static inline void union_lock_fs(struct 
 }
 
 #define IS_UNION(dentry) ((dentry)->d_overlaid || (dentry)->d_topmost || \
-				(dentry)->d_overlaid)
+			 (dentry)->d_union)
+
+/* dentry reference counting */
+static inline struct dentry * __dget_union(struct dentry *dentry)
+{
+	if (!dentry)
+		return dentry;
+
+	union_check(dentry);
+	__dget_single(dentry);
+
+	if (likely(!dentry->d_overlaid && !dentry->d_topmost))
+		return dentry;
+
+	if (dentry->d_topmost)
+		UM_DEBUG_DCACHE("we are not the topmost dentry\n");
+
+	UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+			dentry->d_name.name, dentry->d_inode,
+			atomic_read(&dentry->d_count));
+
+	if (dentry->d_overlaid) {
+		struct dentry * tmp = dentry->d_overlaid;
+
+		while (tmp) {
+			__dget_single(tmp);
+			UM_DEBUG_DCACHE("name=\"%s\", inode=%p, count=%d\n",
+					tmp->d_name.name, tmp->d_inode,
+					atomic_read(&tmp->d_count));
+			tmp = tmp->d_overlaid;
+		}
+	}
+
+	return dentry;
+}
+
+extern void __dput_union(struct dentry *);
+extern int __destroy_union(struct dentry *dentry);
+
+static inline int __dput_single_destroy_union(struct dentry *dentry)
+{
+	if (!dentry->d_union)
+		return 1;
+
+	return __destroy_union(dentry);
+}
 
 #else /* CONFIG_UNION_MOUNT */
 
+#define union_check(dentry) do { /* empty */ } while (0)
+
 #define union_lock(dentry) do { /* empty */ } while (0)
 #define union_trylock(dentry) ({ (1); })
 #define union_unlock(dentry) do { /* empty */ } while (0)
@@ -238,6 +299,8 @@ static inline void union_lock_fs(struct 
 #define union_get2(x) ({ BUG(); (0); })
 #define union_release(x) do { BUG(); } while (0)
 
+#define __dput_single_destroy_union(x) ({ (1); })
+
 #endif	/* CONFIG_UNION_MOUNT */
 #endif	/* __KERNEL__ */
 #endif	/* __LINUX_DCACHE_UNION_H */
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1229,8 +1229,10 @@ void __audit_getname(const char *name)
 	++context->name_count;
 	if (!context->pwd) {
 		read_lock(&current->fs->lock);
-		context->pwd = dget(current->fs->pwd);
+		union_lock_readlock(current->fs->pwd, &current->fs->lock);
+		context->pwd = __dget(current->fs->pwd);
 		context->pwdmnt = mntget(current->fs->pwdmnt);
+		union_unlock(current->fs->pwd);
 		read_unlock(&current->fs->lock);
 	}
 		
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1894,9 +1894,11 @@ static int cpuset_rmdir(struct inode *un
 	mutex_lock(&callback_mutex);
 	set_bit(CS_REMOVED, &cs->flags);
 	list_del(&cs->sibling);	/* delete my sibling from parent->children */
+	union_lock(cs->dentry);
 	spin_lock(&cs->dentry->d_lock);
-	d = dget(cs->dentry);
+	d = __dget(cs->dentry);
 	cs->dentry = NULL;
+	union_unlock(d);
 	spin_unlock(&d->d_lock);
 	cpuset_d_remove_dir(d);
 	dput(d);
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -581,17 +581,21 @@ static inline struct fs_struct *__copy_f
 		rwlock_init(&fs->lock);
 		fs->umask = old->umask;
 		read_lock(&old->lock);
+		union_lock_fs(old);
 		fs->rootmnt = mntget(old->rootmnt);
-		fs->root = dget(old->root);
+		fs->root = __dget(old->root);
 		fs->pwdmnt = mntget(old->pwdmnt);
-		fs->pwd = dget(old->pwd);
+		fs->pwd = __dget(old->pwd);
 		if (old->altroot) {
 			fs->altrootmnt = mntget(old->altrootmnt);
-			fs->altroot = dget(old->altroot);
+			fs->altroot = __dget(old->altroot);
 		} else {
 			fs->altrootmnt = NULL;
 			fs->altroot = NULL;
 		}
+		union_unlock(old->pwd);
+		union_unlock(old->altroot);
+		union_unlock(old->root);
 		read_unlock(&old->lock);
 	}
 	return fs;
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1082,8 +1082,11 @@ restart:
 		newu->addr = otheru->addr;
 	}
 	if (otheru->dentry) {
-		newu->dentry	= dget(otheru->dentry);
+		/* Is this safe here? I don't know ... */
+		union_lock_spinlock(otheru->dentry, &otheru->lock);
+		newu->dentry	= __dget(otheru->dentry);
 		newu->mnt	= mntget(otheru->mnt);
+		union_unlock(otheru->dentry);
 	}
 
 	/* Set credentials */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  7/15] Union-mount mounting
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (5 preceding siblings ...)
  2007-04-17 13:20 ` [RFC][PATCH 6/15] Union-mount dentry reference counting Bharata B Rao
@ 2007-04-17 13:20 ` Bharata B Rao
  2007-04-17 13:21 ` [RFC][PATCH 8/15] Union-mount lookup Bharata B Rao
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Union-mount mounting

Adds union mount support to mount() and umount() system calls.
Sets up the union stack during mount and destroys it during unmount.

TODO: bind and move mounts aren't yet supported with union mounts.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/namespace.c        |   79 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/union.c            |   65 +++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h    |    3 +
 include/linux/union.h |   33 ++++++++++++++++++++
 4 files changed, 173 insertions(+), 7 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -166,7 +166,7 @@ void mnt_set_mountpoint(struct vfsmount 
 			struct vfsmount *child_mnt)
 {
 	child_mnt->mnt_parent = mntget(mnt);
-	child_mnt->mnt_mountpoint = dget(dentry);
+	child_mnt->mnt_mountpoint = __dget(dentry);
 	dentry->d_mounted++;
 }
 
@@ -234,6 +234,10 @@ static struct vfsmount *clone_mnt(struct
 	struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
 
 	if (mnt) {
+		/*
+		 * As of now, cloning of union mounted mnt isn't permitted.
+		 */
+		BUG_ON(mnt->mnt_flags & MNT_UNION);
 		mnt->mnt_flags = old->mnt_flags;
 		atomic_inc(&sb->s_active);
 		mnt->mnt_sb = sb;
@@ -522,16 +526,20 @@ void release_mounts(struct list_head *he
 		mnt = list_entry(head->next, struct vfsmount, mnt_hash);
 		list_del_init(&mnt->mnt_hash);
 		if (mnt->mnt_parent != mnt) {
-			struct dentry *dentry;
-			struct vfsmount *m;
+			struct path old_nd;
 			spin_lock(&vfsmount_lock);
-			dentry = mnt->mnt_mountpoint;
-			m = mnt->mnt_parent;
+			old_nd.dentry = mnt->mnt_mountpoint;
+			old_nd.mnt = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
+			detach_mnt_union(mnt, &old_nd);
 			spin_unlock(&vfsmount_lock);
-			dput(dentry);
-			mntput(m);
+			if (mnt->mnt_flags & MNT_UNION) {
+				UM_DEBUG("shrink the mountpoint's dcache\n");
+				shrink_dcache_sb(old_nd.dentry->d_sb);
+			}
+			__dput(old_nd.dentry);
+			mntput(old_nd.mnt);
 		}
 		mntput(mnt);
 	}
@@ -564,6 +572,7 @@ static int do_umount(struct vfsmount *mn
 	struct super_block *sb = mnt->mnt_sb;
 	int retval;
 	LIST_HEAD(umount_list);
+	struct union_info *uinfo = NULL;
 
 	retval = security_sb_umount(mnt, flags);
 	if (retval)
@@ -628,6 +637,9 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
+	/* get the union lock which is released after release_mounts() */
+	if (mnt->mnt_flags & MNT_UNION)
+		uinfo = union_get2(mnt->mnt_root);
 	spin_lock(&vfsmount_lock);
 	event++;
 
@@ -642,6 +654,8 @@ static int do_umount(struct vfsmount *mn
 		security_sb_umount_busy(mnt);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
+	if (uinfo)
+		union_release(uinfo);
 	return retval;
 }
 
@@ -857,6 +871,7 @@ static int attach_recursive_mnt(struct v
 		touch_mnt_namespace(current->nsproxy->mnt_ns);
 	} else {
 		mnt_set_mountpoint(dest_mnt, dest_dentry, source_mnt);
+		attach_mnt_union(source_mnt, nd);
 		commit_tree(source_mnt);
 	}
 
@@ -871,6 +886,8 @@ static int attach_recursive_mnt(struct v
 static int graft_tree(struct vfsmount *mnt, struct nameidata *nd)
 {
 	int err;
+	struct union_info *uinfo = NULL;
+
 	if (mnt->mnt_sb->s_flags & MS_NOUSER)
 		return -EINVAL;
 
@@ -878,6 +895,9 @@ static int graft_tree(struct vfsmount *m
 	      S_ISDIR(mnt->mnt_root->d_inode->i_mode))
 		return -ENOTDIR;
 
+	if (mnt->mnt_flags & MNT_UNION)
+		uinfo = union_alloc2(nd->dentry);
+
 	err = -ENOENT;
 	mutex_lock(&nd->dentry->d_inode->i_mutex);
 	if (IS_DEADDIR(nd->dentry->d_inode))
@@ -894,6 +914,8 @@ out_unlock:
 	mutex_unlock(&nd->dentry->d_inode->i_mutex);
 	if (!err)
 		security_sb_post_addmount(mnt, nd);
+	if (uinfo)
+		union_release(uinfo);
 	return err;
 }
 
@@ -909,6 +931,12 @@ static int do_change_type(struct nameida
 	if (nd->dentry != nd->mnt->mnt_root)
 		return -EINVAL;
 
+	/*
+	 * Don't change the type of union mounts
+	 */
+	if (nd->mnt->mnt_flags & MNT_UNION)
+		return -EINVAL;
+
 	down_write(&namespace_sem);
 	spin_lock(&vfsmount_lock);
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
@@ -934,6 +962,15 @@ static int do_loopback(struct nameidata 
 	if (err)
 		return err;
 
+	/*
+	 * bind mounting to or from union mounts is not supported
+	 */
+	err = -EINVAL;
+	if (nd->mnt->mnt_flags & MNT_UNION)
+		goto out_unlocked;
+	if (old_nd.mnt->mnt_flags & MNT_UNION)
+		goto out_unlocked;
+
 	down_write(&namespace_sem);
 	err = -EINVAL;
 	if (IS_MNT_UNBINDABLE(old_nd.mnt))
@@ -962,6 +999,7 @@ static int do_loopback(struct nameidata 
 
 out:
 	up_write(&namespace_sem);
+out_unlocked:
 	path_release(&old_nd);
 	return err;
 }
@@ -1019,6 +1057,15 @@ static int do_move_mount(struct nameidat
 	if (err)
 		return err;
 
+	/*
+	 * moving to or from a union mount is not supported
+	 */
+	err = -EINVAL;
+	if (nd->mnt->mnt_flags & MNT_UNION)
+		goto exit;
+	if (old_nd.mnt->mnt_flags & MNT_UNION)
+		goto exit;
+
 	down_write(&namespace_sem);
 	while (d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
 		;
@@ -1074,6 +1121,7 @@ out:
 	up_write(&namespace_sem);
 	if (!err)
 		path_release(&parent_nd);
+exit:
 	path_release(&old_nd);
 	return err;
 }
@@ -1098,6 +1146,9 @@ static int do_new_mount(struct nameidata
 	if (IS_ERR(mnt))
 		return PTR_ERR(mnt);
 
+	UM_DEBUG("dentry=%s, device=%s\n", nd->dentry->d_name.name,
+	       mnt->mnt_devname);
+
 	return do_add_mount(mnt, nd, mnt_flags, NULL);
 }
 
@@ -1128,6 +1179,12 @@ int do_add_mount(struct vfsmount *newmnt
 	if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
 		goto unlock;
 
+	/* Unions couldn't include shared mounts */
+	err = -EINVAL;
+	if ((mnt_flags & MNT_UNION) &&
+	    IS_MNT_SHARED(nd->mnt))
+		goto unlock;
+
 	/* Unions couldn't be writable if the filesystem
 	 * doesn't know about whiteouts */
 	err = -ENOTSUPP;
@@ -1146,6 +1203,14 @@ int do_add_mount(struct vfsmount *newmnt
 		list_add_tail(&newmnt->mnt_expire, fslist);
 		spin_unlock(&vfsmount_lock);
 	}
+
+	UM_DEBUG("mntpoint->d_count=%d/%p\n",
+		 atomic_read(&nd->dentry->d_count),
+		 &nd->dentry->d_count);
+	UM_DEBUG("mntroot->d_count=%d/%p\n",
+		 atomic_read(&newmnt->mnt_root->d_count),
+		 &newmnt->mnt_root->d_count);
+
 	up_write(&namespace_sem);
 	return 0;
 
--- a/fs/union.c
+++ b/fs/union.c
@@ -300,3 +300,68 @@ void __dput_union(struct dentry *dentry)
 
 	return;
 }
+
+void attach_mnt_union(struct vfsmount *mnt, struct nameidata *nd)
+{
+	struct dentry *tmp;
+
+	if (!(mnt->mnt_flags & MNT_UNION))
+		return;
+
+	UM_DEBUG("MNT_UNION set for dentry \"%s\", devname=%s\n",
+		 mnt->mnt_root->d_name.name, mnt->mnt_devname);
+	UM_DEBUG("mountpoint \"%s\", inode=%p\n",
+		 nd->dentry->d_name.name, nd->dentry->d_inode);
+
+	mnt->mnt_root->d_overlaid = __dget(nd->dentry);
+	mnt->mnt_root->d_topmost = NULL;
+	mnt->mnt_root->d_union = union_get(nd->dentry->d_union);
+
+	tmp = nd->dentry;
+	while (tmp) {
+		tmp->d_topmost = mnt->mnt_root;
+		tmp = tmp->d_overlaid;
+	}
+}
+
+void detach_mnt_union(struct vfsmount *mnt, struct path *path)
+{
+	struct dentry *tmp;
+
+	if (!(mnt->mnt_flags & MNT_UNION))
+		return;
+
+	UM_DEBUG("MNT_UNION set for dentry \"%s\", devname=%s\n",
+		 mnt->mnt_root->d_name.name, mnt->mnt_devname);
+	UM_DEBUG("mountpoint \"%s\", inode=%p\n",
+		 path->dentry->d_name.name, path->dentry->d_inode);
+	BUG_ON(mnt->mnt_root->d_topmost);
+
+	/* put reference to the underlying union stack */
+	__dput(mnt->mnt_root->d_overlaid);
+	mnt->mnt_root->d_overlaid = NULL;
+	union_put(mnt->mnt_root->d_union);
+	mnt->mnt_root->d_union = NULL;
+
+	/* rearrange the union stack */
+	path->dentry->d_topmost = NULL;
+	tmp = path->dentry->d_overlaid;
+	while (tmp) {
+		tmp->d_topmost = path->dentry;
+		tmp = tmp->d_overlaid;
+	}
+
+	/* If the mount point is the last component in the union,
+	 * put the reference to the union struct */
+	if (!path->dentry->d_overlaid) {
+		union_put(path->dentry->d_union);
+		path->dentry->d_union = NULL;
+	}
+
+	/* when we looked up the mountpoint to be unmounted
+	 * we dget() a union-mount dentry struct so we have
+	 * to dput() parts of it by hand before we remove the
+	 * topmost dentry (which is mnt->mnt_root) from the
+	 * union stack */
+	__dput(path->dentry);
+}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1976,6 +1976,9 @@ static inline ino_t parent_ino(struct de
 /* kernel/fork.c */
 extern int unshare_files(void);
 
+/* fs/union.c */
+#include <linux/union.h>
+
 /* Transaction based IO helpers */
 
 /*
--- /dev/null
+++ b/include/linux/union.h
@@ -0,0 +1,33 @@
+/*
+ * VFS based union mount for Linux
+ *
+ * Copyright © 2004-2007 IBM Corporation
+ *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT
+
+#include <linux/fs_struct.h>
+
+/* namespace stuff used at mount time */
+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
+extern void detach_mnt_union(struct vfsmount *, struct path *);
+
+#else	/* CONFIG_UNION_MOUNT */
+
+#define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+#define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+
+#endif	/* CONFIG_UNION_MOUNT */
+
+#endif	/* __KERNEL __ */
+#endif	/* __LINUX_UNION_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  8/15] Union-mount lookup
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (6 preceding siblings ...)
  2007-04-17 13:20 ` [RFC][PATCH 7/15] Union-mount mounting Bharata B Rao
@ 2007-04-17 13:21 ` Bharata B Rao
  2007-04-17 13:22 ` [RFC][PATCH 9/15] Simple union-mount readdir Bharata B Rao
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Union-mount lookup

Modifies the vfs lookup routines to work with union mounted directories.

The existing lookup routines generally lookup for a pathname only in the
topmost or given directory. The changed versions of the lookup routines
search for the pathname in the entire union mounted stack. Also they have been
modified to setup the union stack during lookup from dcache cache and from
real_lookup().

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/dcache.c            |   16 +
 fs/namei.c             |   76 +++++-
 fs/namespace.c         |   35 ++
 fs/union.c             |  597 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |   17 +
 include/linux/namei.h  |    4 
 include/linux/union.h  |   27 ++
 7 files changed, 761 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1278,7 +1278,7 @@ struct dentry * d_lookup(struct dentry *
 	return dentry;
 }
 
-struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
@@ -1363,6 +1363,20 @@ out:
 	return dentry;
 }
 
+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
+{
+	struct dentry *dentry;
+	unsigned long seq;
+
+        do {
+                seq = read_seqbegin(&rename_lock);
+                dentry = __d_lookup_single(parent, name);
+                if (dentry)
+			break;
+	} while (read_seqretry(&rename_lock, seq));
+	return dentry;
+}
+
 /**
  * d_validate - verify dentry provided from insecure source
  * @dentry: The dentry alleged to be valid child of @dparent
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
 }
 
 static inline struct dentry *
+do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
+{
+	int status = dentry->d_op->d_revalidate(dentry, nd);
+	if (unlikely(status <= 0)) {
+		/*
+		 * The dentry failed validation.
+		 * If d_revalidate returned 0 attempt to invalidate
+		 * the dentry otherwise d_revalidate is asking us
+		 * to return a fail status.
+		 */
+		if (!status) {
+			if (!d_invalidate(dentry)) {
+				__dput_single(dentry);
+				dentry = NULL;
+			}
+		} else {
+			__dput_single(dentry);
+			dentry = ERR_PTR(status);
+		}
+	}
+	return dentry;
+}
+
+/*
+ * FIXME: We need a union aware revalidate here!
+ */
+static inline struct dentry *
 do_revalidate(struct dentry *dentry, struct nameidata *nd)
 {
 	int status = dentry->d_op->d_revalidate(dentry, nd);
@@ -403,16 +430,16 @@ do_revalidate(struct dentry *dentry, str
  */
 static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
 {
-	struct dentry * dentry = __d_lookup(parent, name);
+	struct dentry *dentry = __d_lookup_single(parent, name);
 
 	/* lockess __d_lookup may fail due to concurrent d_move() 
 	 * in some unrelated directory, so try with d_lookup
 	 */
 	if (!dentry)
-		dentry = d_lookup(parent, name);
+		dentry = d_lookup_single(parent, name);
 
 	if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
-		dentry = do_revalidate(dentry, nd);
+		dentry = do_revalidate_single(dentry, nd);
 
 	return dentry;
 }
@@ -465,7 +492,7 @@ ok:
  * make sure that nobody added the entry to the dcache in the meantime..
  * SMP-safe
  */
-static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, struct nameidata *nd)
+struct dentry * real_lookup_single(struct dentry *parent, struct qstr *name, struct nameidata *nd)
 {
 	struct dentry * result;
 	struct inode *dir = parent->d_inode;
@@ -485,7 +512,7 @@ static struct dentry * real_lookup(struc
 	 *
 	 * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
 	 */
-	result = d_lookup(parent, name);
+	result = d_lookup_single(parent, name);
 	if (!result) {
 		struct dentry * dentry = d_alloc(parent, name);
 		result = ERR_PTR(-ENOMEM);
@@ -506,7 +533,7 @@ static struct dentry * real_lookup(struc
 	 */
 	mutex_unlock(&dir->i_mutex);
 	if (result->d_op && result->d_op->d_revalidate) {
-		result = do_revalidate(result, nd);
+		result = do_revalidate_single(result, nd);
 		if (!result)
 			result = ERR_PTR(-ENOENT);
 	}
@@ -699,7 +726,7 @@ static int __follow_mount(struct path *p
 	return res;
 }
 
-static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
+void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
 {
 	while (d_mountpoint(*dentry)) {
 		struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
@@ -773,6 +800,7 @@ static __always_inline void follow_dotdo
 		nd->mnt = parent;
 	}
 	follow_mount(&nd->mnt, &nd->dentry);
+	follow_union_mount(&nd->mnt, &nd->dentry);
 }
 
 /*
@@ -784,7 +812,15 @@ static int do_lookup(struct nameidata *n
 		     struct path *path)
 {
 	struct vfsmount *mnt = nd->mnt;
-	struct dentry *dentry = __d_lookup(nd->dentry, name);
+	struct dentry *dentry;
+
+	UM_DEBUG_UID("lookup \"%s\" in \"%s\" (inode=%p,dev=%s)\n",
+		     name->name,
+		     nd->dentry->d_name.name,
+		     nd->dentry->d_inode,
+		     nd->mnt->mnt_devname);
+
+	dentry = __d_lookup(nd->dentry, name);
 
 	if (!dentry)
 		goto need_lookup;
@@ -793,7 +829,17 @@ static int do_lookup(struct nameidata *n
 done:
 	path->mnt = mnt;
 	path->dentry = dentry;
+
+	if (nd->dentry->d_sb != dentry->d_sb)
+		path->mnt = find_mnt(dentry);
+
 	__follow_mount(path);
+	follow_union_mount(&path->mnt, &path->dentry);
+
+	UM_DEBUG_UID("found \"%s\" (inode=%p,dev=%s)\n",
+		     path->dentry->d_name.name,
+		     path->dentry->d_inode,
+		     path->mnt->mnt_devname);
 	return 0;
 
 need_lookup:
@@ -838,6 +884,9 @@ static fastcall int __link_path_walk(con
 	if (nd->depth)
 		lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
 
+	UM_DEBUG_UID("begin walking for %s\n", name);
+	follow_union_mount(&nd->mnt, &nd->dentry);
+
 	/* At this point we know we have a real path component. */
 	for(;;) {
 		unsigned long hash;
@@ -931,6 +980,7 @@ static fastcall int __link_path_walk(con
 last_with_slashes:
 		lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
 last_component:
+		UM_DEBUG_UID("last component %s\n", this.name);
 		/* Clear LOOKUP_CONTINUE iff it was previously unset */
 		nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
 		if (lookup_flags & LOOKUP_PARENT)
@@ -1270,8 +1320,14 @@ int __user_path_lookup_open(const char _
  * Restricted form of lookup. Doesn't follow links, single-component only,
  * needs parent already locked. Doesn't follow mounts.
  * SMP-safe.
+ *
+ * NOTE: On union mounts it is important that the overlaid dentries are
+ * correct. Therefore we need to follow mounts. Take a look at
+ * __lookup_hash_union() how it is done.
+ *
+ * Called with union already locked (before the parent inode is locked !!!)
  */
-static struct dentry * __lookup_hash(struct qstr *name, struct dentry * base, struct nameidata *nd)
+struct dentry * __lookup_hash_single(struct qstr *name, struct dentry *base, struct nameidata *nd)
 {
 	struct dentry * dentry;
 	struct inode *inode;
@@ -1307,6 +1363,8 @@ static struct dentry * __lookup_hash(str
 			dput(new);
 	}
 out:
+	UM_DEBUG_UID("name=\"%s\", inode=%p\n",
+		     dentry->d_name.name, dentry->d_inode);
 	return dentry;
 }
 
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -130,6 +130,41 @@ struct vfsmount *lookup_mnt(struct vfsmo
 	return child_mnt;
 }
 
+/*
+ * find_mnt - find a vfsmount struct
+ * @dentry: a dentry
+ *
+ * This searches the namespace for a given dentries
+ * vfsmount struct. This is used by union-mount.
+ */
+struct vfsmount * find_mnt(struct dentry *dentry)
+{
+	struct list_head *tmp;
+	struct vfsmount *p, *mnt = NULL;
+
+	down_read(&namespace_sem);
+	spin_lock(&vfsmount_lock);
+	if (list_empty(&current->nsproxy->mnt_ns->list)) {
+		spin_unlock(&vfsmount_lock);
+		up_read(&namespace_sem);
+		return NULL;
+	}
+	list_for_each(tmp, &current->nsproxy->mnt_ns->list) {
+		p = list_entry(tmp, struct vfsmount, mnt_list);
+		if (dentry->d_sb == p->mnt_sb) {
+			mnt = mntget(p);
+			break;
+		}
+	}
+	spin_unlock(&vfsmount_lock);
+	up_read(&namespace_sem);
+
+	BUG_ON(!mnt);
+//	UM_DEBUG_UID("found %s/%p in %s\n", dentry->d_name.name,
+//		     dentry->d_inode, mnt->mnt_devname);
+	return mnt;
+}
+
 static inline int check_mnt(struct vfsmount *mnt)
 {
 	return mnt->mnt_ns == current->nsproxy->mnt_ns;
--- a/fs/union.c
+++ b/fs/union.c
@@ -365,3 +365,600 @@ void detach_mnt_union(struct vfsmount *m
 	 * union stack */
 	__dput(path->dentry);
 }
+
+static noinline int revalidate_union(struct dentry * dentry)
+{
+	union_check(dentry);
+
+	spin_lock(&dcache_lock);
+	spin_lock(&dentry->d_lock);
+	if (atomic_read(&dentry->d_count) < 2) {
+		UM_DEBUG_DCACHE("dentry unused, count=%d\n",
+			     atomic_read(&dentry->d_count));
+		__d_drop(dentry);
+		spin_unlock(&dentry->d_lock);
+		spin_unlock(&dcache_lock);
+		return 0;
+	}
+	spin_unlock(&dentry->d_lock);
+	spin_unlock(&dcache_lock);
+
+	return 1;
+}
+
+static noinline void replace_union_info(struct dentry *dentry,
+					struct union_info *lock)
+{
+	struct dentry *tmp = dentry;
+	struct union_info *old_lock = union_get(dentry->d_union);
+
+	BUG_ON(!lock);
+	BUG_ON(dentry->d_union == lock);
+
+	while (tmp) {
+		spin_lock(&tmp->d_lock);
+		union_put(tmp->d_union);
+		tmp->d_union = union_get(lock);
+		spin_unlock(&tmp->d_lock);
+		tmp = tmp->d_overlaid;
+	}
+
+	BUG_ON(atomic_read(&old_lock->u_count) != 1);
+	union_put(old_lock);
+	return;
+}
+
+static void __dput_from_to(struct dentry *from, struct dentry *to,
+			   struct union_info *lock)
+{
+	struct dentry *next = from;
+	struct union_info *mylock = union_get(from->d_union);
+
+	while (next) {
+		struct dentry *tmp = next;
+		next = next->d_overlaid;
+
+		UM_DEBUG_UID("dput_all dentry=\"%s\", inode=\"%p\"\n",
+			     tmp->d_name.name, tmp->d_inode);
+
+		if (lock) {
+			spin_lock(&tmp->d_lock);
+			tmp->d_topmost = NULL;
+			tmp->d_overlaid = NULL;
+			union_put(tmp->d_union);
+			tmp->d_union = NULL;
+			spin_unlock(&tmp->d_lock);
+		}
+
+		__dput_single(tmp);
+
+		if (tmp == to)
+			break;
+	}
+
+	UM_DEBUG_LOCK("\"??\" unlocking union %p\n", lock);
+	mutex_unlock(&mylock->u_mutex);
+	union_put(mylock);
+}
+
+/*
+ * Lookup for the @name in the dentry cache. Look through the lower layers
+ * of the union stack and build a union stack if necessary.
+ *
+ * 1.) used dentry, negative dentry, NULL dentry: just return
+ * 2.) unused union-mount dentry: How do I know? Hmm, our parent is overlaid.
+ *     Some of our d_overlaid dentries COULD be out of cache but we don't know
+ *     for sure. So lets return NULL. Maybe we should also drop the dentry?
+ * 3.) used union-mount dentry: How do I know? Because we are used and our
+ *     d_overlaid isn't NULL. What should happen? Increment the d_count on
+ *     all our d_overlaid dentries and return.
+ *
+ * TODO: Consider breaking this function into smaller bits.
+ */
+struct dentry * __d_lookup_union(struct dentry *base, struct qstr *name)
+{
+	struct dentry *parent = base->d_overlaid;
+	struct dentry *dentry = NULL;
+	struct dentry *topmost;
+	struct dentry *last;
+	struct qstr this;
+	struct union_info *lock = NULL;
+	int err;
+
+	union_lock(base);
+	topmost = __d_lookup_single(base, name);
+	last = topmost;
+
+	if (!topmost || !base->d_overlaid)
+		goto out;
+
+	this.name = name->name;
+	this.len = name->len;
+	this.hash = name->hash;
+
+	if (topmost->d_inode)
+		goto lookup_union;
+
+	/*
+	 * look for the first non-negative dentry
+	 */
+	while (parent) {
+		if (parent->d_op && parent->d_op->d_hash) {
+			err = parent->d_op->d_hash(parent, &this);
+			if (err < 0) {
+				__dput_single(topmost);
+				topmost = NULL;
+				goto out;
+			}
+		}
+		dentry = __d_lookup_single(parent, &this);
+		/*
+		 * Force a real lookup if parts of the
+		 * union stack are not in the dcache
+		 */
+		if (!dentry) {
+			__dput_single(topmost);
+			topmost = NULL;
+			goto out;
+		}
+		if (dentry->d_inode)
+			break;
+		__dput_single(dentry);
+		dentry = NULL;
+		parent = parent->d_overlaid;
+	}
+
+	if (!dentry)
+		goto out;
+
+	__dput_single(topmost);
+	topmost = dentry;
+	last = dentry;
+
+lookup_union:
+	do {
+		struct vfsmount *mnt = find_mnt(topmost);
+		UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
+				topmost->d_name.name, topmost->d_inode,
+				mnt->mnt_devname);
+		mntput(mnt);
+	} while (0);
+
+	if (!S_ISDIR(topmost->d_inode->i_mode))
+		goto out;
+
+	if (!revalidate_union(topmost)) {
+		__dput_single(topmost);
+		topmost = NULL;
+		goto out;
+	}
+
+	spin_lock(&topmost->d_lock);
+	if (topmost->d_union) {
+		union_lock_spinlock(topmost, &topmost->d_lock);
+	}
+	spin_unlock(&topmost->d_lock);
+
+	parent = topmost->d_parent->d_overlaid;
+	while (parent) {
+		if (parent->d_op && parent->d_op->d_hash) {
+			err = parent->d_op->d_hash(parent, &this);
+			if (err < 0) {
+				UM_DEBUG("failed to hash the qstr\n");
+				goto dput_all;
+			}
+		}
+		dentry = __d_lookup_single(parent, &this);
+		if (!dentry) {
+			__dput_single(dentry);
+			goto dput_all;
+		}
+		if (!dentry->d_inode) {
+			__dput_single(dentry);
+			parent = parent->d_overlaid;
+			continue;
+		}
+		if (!S_ISDIR(dentry->d_inode->i_mode)) {
+			__dput_single(dentry);
+			break;
+		}
+		if (last->d_overlaid
+		    && (last->d_overlaid != dentry)) {
+			printk(KERN_ERR "%s: strange stack layout " \
+			       "(\"%s\" overlays \"%s\")\n",
+			       __FUNCTION__, last->d_name.name,
+			       dentry->d_name.name);
+			dump_stack();
+			__dput_single(dentry);
+			goto dput_all;
+		}
+		spin_lock(&topmost->d_lock);
+		if (!topmost->d_union) {
+			UM_DEBUG_LOCK("allocate union for \"%s\"\n",
+				      topmost->d_name.name);
+			topmost->d_union = union_alloc();
+			lock = topmost->d_union;
+		}
+		spin_unlock(&topmost->d_lock);
+		spin_lock(&dentry->d_lock);
+		if (!dentry->d_union)
+			dentry->d_union = union_get(topmost->d_union);
+		spin_unlock(&dentry->d_lock);
+		if (dentry->d_union != topmost->d_union) {
+			union_lock(dentry);
+			replace_union_info(topmost, dentry->d_union);
+		}
+		dentry->d_topmost = topmost;
+		last->d_overlaid = dentry;
+		last = dentry;
+		parent = parent->d_overlaid;
+	}
+
+	spin_lock(&topmost->d_lock);
+	if (topmost->d_union && atomic_read(&topmost->d_union->u_count) == 1) {
+		union_put(topmost->d_union);
+		topmost->d_union = NULL;
+	} else
+		union_unlock(topmost);
+	spin_unlock(&topmost->d_lock);
+out:
+	union_unlock(base);
+	return topmost;
+
+dput_all:
+	__dput_from_to(topmost, last, lock);
+	union_unlock(base);
+	return NULL;
+}
+
+/* "stack" is a already existing union stack, "new" is a dentry or
+ * a union stack which is overlaid by "stack". So the topmost dentry
+ * is in "stack". */
+static void append_to_stack(struct dentry *stack, struct dentry *new)
+{
+	struct dentry *topmost;
+	struct dentry *prev = stack;
+	struct dentry *next = new;
+
+	BUG_ON(!stack);
+	BUG_ON(!new);
+
+	while (prev->d_overlaid)
+		prev = prev->d_overlaid;
+
+	if (prev->d_topmost)
+		topmost = prev->d_topmost;
+	else
+		topmost = stack;
+
+	while (next) {
+		next->d_topmost = topmost;
+		prev->d_overlaid = next;
+		prev = next;
+		next = next->d_overlaid;
+	}
+
+	return;
+}
+
+/*
+ * FIXME: export this from fs/namei.c ???
+ */
+extern int follow_mount(struct vfsmount **, struct dentry **);
+extern struct dentry * __lookup_hash_single(struct qstr *, struct dentry *,
+					    struct nameidata *);
+extern struct dentry * real_lookup_single(struct dentry *, struct qstr *,
+					  struct nameidata *);
+
+/*
+ * This is called when a dentries parent is union-mounted and we have
+ * to lookup the overlaid dentries. The lookup starts at the parents
+ * first overlaid dentry of the given dentry. Negative dentries are
+ * ignored and not included in the overlaid list.
+ *
+ * If we reach a dentry with restricted access, we just stop the lookup
+ * because we shouldn't see through that dentry. Same thing for dentry
+ * type mismatch and whiteouts.
+ *
+ * FIXME:
+ * - handle DT_WHT
+ * - handle union stacks in use
+ * - handle union stacks mounted upon union stacks
+ * - avoid unnecessary allocations of union locks
+ */
+static int __lookup_union(struct dentry *topmost, struct qstr *name,
+			  struct nameidata *__nd)
+{
+	struct dentry *parent;
+	struct dentry *last;
+	struct dentry *dentry;
+	unsigned int hash = name->hash;
+	struct nameidata nd;
+	int err;
+
+	/* we may also be called via lookup_hash
+	 * with a NULLed nd argument */
+	if (__nd) {
+		nd.last.name = NULL;	// handled in __link_path_walk
+		nd.last.len = 0;
+		nd.last.hash = 0;
+		nd.flags = __nd->flags;
+		nd.um_flags = 0;	// dito
+		nd.last_type = -1;	// dito
+		nd.depth = 0;		// handled in do_follow_link
+		memcpy(&nd.intent, &__nd->intent, sizeof(nd.intent));
+	}
+
+	spin_lock(&topmost->d_lock);
+	if (topmost->d_union) {
+		union_lock_spinlock(topmost, &topmost->d_lock);
+	}
+	spin_unlock(&topmost->d_lock);
+
+	parent = topmost->d_parent->d_overlaid;
+	last = topmost;
+
+	while (parent) {
+		/* the hash could be changed in the last __lookup_hash_single()
+		 * so we need to reset it here */
+		name->hash = hash;
+		nd.dentry = __dget(parent);
+		nd.mnt = find_mnt(parent);
+
+		mutex_lock(&parent->d_inode->i_mutex);
+		dentry = __lookup_hash_single(name, parent,
+					      __nd ? &nd : NULL);
+		mutex_unlock(&parent->d_inode->i_mutex);
+		if (IS_ERR(dentry)) {
+			err = PTR_ERR(dentry);
+			goto out;
+		}
+
+		if (!dentry->d_inode) {
+			__dput_single(dentry);
+			goto loop;
+		}
+
+		if (!S_ISDIR(dentry->d_inode->i_mode)) {
+			__dput_single(dentry);
+			err = 0;
+			goto out;
+		}
+
+		/* Now we know, we found something real */
+		follow_mount(&nd.mnt, &dentry);
+
+		do {
+			struct vfsmount *mnt = find_mnt(dentry);
+			UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+				     dentry->d_name.name, dentry->d_inode,
+				     mnt->mnt_devname);
+			mntput(mnt);
+		} while (0);
+
+		if (last->d_overlaid && (last->d_overlaid != dentry)) {
+			printk(KERN_ERR "%s: strange stack layout " \
+			       "(\"%s\" overlays \"%s\")\n",
+			       __FUNCTION__, last->d_name.name,
+			       dentry->d_name.name);
+			dump_stack();
+			__dput_single(dentry);
+			/* lets try to make a clean ending */
+			last->d_overlaid = NULL;
+			err = -EFAULT;	// FIXME: something better?
+			goto out;
+		}
+
+		spin_lock(&topmost->d_lock);
+		if (!topmost->d_union) {
+			UM_DEBUG_LOCK("allocate union for \"%s\"\n",
+				      topmost->d_name.name);
+			topmost->d_union = union_alloc();
+		}
+		spin_unlock(&topmost->d_lock);
+
+		spin_lock(&dentry->d_lock);
+		if (!dentry->d_union)
+			dentry->d_union = union_get(topmost->d_union);
+		spin_unlock(&dentry->d_lock);
+
+		if (topmost->d_union != dentry->d_union) {
+			union_lock(dentry);
+			replace_union_info(topmost, dentry->d_union);
+		}
+
+		dentry->d_topmost = topmost;
+		last->d_overlaid = dentry;
+		last = dentry;
+	loop:
+		__dput(nd.dentry);
+		mntput(nd.mnt);
+		parent = parent->d_overlaid;
+	}
+
+	err = 0;
+	union_unlock(topmost);
+	return err;
+out:
+	__dput(nd.dentry);
+	mntput(nd.mnt);
+	union_unlock(topmost);
+	return err;
+}
+
+struct dentry * real_lookup_union(struct dentry *base, struct qstr *name,
+				  struct nameidata *__nd)
+{
+	struct dentry *parent;
+	struct dentry *topmost;
+	unsigned int hash = name->hash;
+	struct nameidata nd;
+	int err;
+
+	union_lock(base);
+	topmost = real_lookup_single(base, name, __nd);
+	if (IS_ERR(topmost))
+		goto out;
+
+	if (topmost->d_inode) {
+		parent = base;
+		goto lookup_union;
+	}
+
+	if (__nd) {
+		nd.last.name = NULL;	// handled in __link_path_walk
+		nd.last.len = 0;
+		nd.last.hash = 0;
+		nd.flags = __nd->flags;
+		nd.um_flags = 0;	// dito
+		nd.last_type = -1;	// dito
+		nd.depth = 0;		// handled in do_follow_link
+		memcpy(&nd.intent, &__nd->intent, sizeof(nd.intent));
+	}
+
+	parent = base->d_overlaid;
+	while (parent) {
+		struct dentry * dentry;
+
+		name->hash = hash;
+		nd.dentry = __dget(parent);
+		nd.mnt = find_mnt(parent);
+
+		dentry = real_lookup_single(nd.dentry, name, &nd);
+		__dput(nd.dentry);
+		mntput(nd.mnt);
+		if (IS_ERR(dentry))
+			goto out;
+
+		if (dentry->d_inode) {
+			__dput_single(topmost);
+			topmost = dentry;
+			goto lookup_union;
+		}
+		__dput_single(dentry);
+		parent = parent->d_overlaid;
+	}
+
+out:
+	union_unlock(base);
+	return topmost;
+
+lookup_union:
+	if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
+		goto out;
+
+	do {
+		struct vfsmount *mnt = find_mnt(topmost);
+		UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+			     topmost->d_name.name, topmost->d_inode,
+			     mnt->mnt_devname);
+		mntput(mnt);
+	} while (0);
+
+	name->hash = hash;
+	err = __lookup_union(topmost, name, &nd);
+	if (err) {
+		dput(topmost);
+		topmost = ERR_PTR(err);
+	}
+
+	union_unlock(base);
+	return topmost;
+}
+
+struct dentry * __lookup_hash_union(struct qstr *name, struct dentry *parent,
+				    struct nameidata *__nd)
+{
+	struct dentry *topmost;
+	unsigned int hash = name->hash;
+	struct nameidata nd;
+	int err;
+
+	topmost =  __lookup_hash_single(name, parent, __nd);
+	if (IS_ERR(topmost))
+		goto out;
+
+	if (topmost->d_inode)
+		goto lookup_union;
+
+	if (__nd) {
+		nd.last.name = NULL;	// handled in __link_path_walk
+		nd.last.len = 0;
+		nd.last.hash = 0;
+		nd.flags = __nd->flags;
+		nd.um_flags = 0;	// dito
+		nd.last_type = -1;	// dito
+		nd.depth = 0;		// handled in do_follow_link
+		memcpy(&nd.intent, &__nd->intent, sizeof(nd.intent));
+	}
+
+	parent = parent->d_overlaid;
+	while (parent) {
+		struct dentry *dentry;
+
+		name->hash = hash;
+		nd.dentry = __dget(parent);
+		nd.mnt = find_mnt(parent);
+
+		mutex_lock(&parent->d_inode->i_mutex);
+		dentry = __lookup_hash_single(name, nd.dentry, &nd);
+		mutex_unlock(&parent->d_inode->i_mutex);
+		__dput(nd.dentry);
+		mntput(nd.mnt);
+		if (IS_ERR(dentry))
+			goto out;
+
+		if (dentry->d_inode) {
+			__dput_single(topmost);
+			topmost = dentry;
+			goto lookup_union;
+		}
+		__dput_single(dentry);
+		parent = parent->d_overlaid;
+	}
+
+out:
+	return topmost;
+
+lookup_union:
+	if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
+		goto out;
+
+	do {
+		struct vfsmount *mnt = find_mnt(topmost);
+		UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
+			     topmost->d_name.name, topmost->d_inode,
+			     mnt->mnt_devname);
+		mntput(mnt);
+	} while (0);
+
+	name->hash = hash;
+	err = __lookup_union(topmost, name, &nd);
+	if (err) {
+		dput(topmost);
+		topmost = ERR_PTR(err);
+	}
+
+	return topmost;
+}
+
+int follow_union_mount(struct vfsmount **mnt, struct dentry **dentry)
+{
+	int res = 0;
+
+	while ((*dentry)->d_topmost) {
+		struct dentry *d_tmp = dget((*dentry)->d_topmost);
+		struct vfsmount *m_tmp = find_mnt((*dentry)->d_topmost);
+
+		UM_DEBUG_UID("name=\"%s\", follow union from %s to %s\n",
+			     (*dentry)->d_name.name, (*mnt)->mnt_devname,
+			     m_tmp->mnt_devname);
+		mntput(*mnt);
+		*mnt = m_tmp;
+		dput(*dentry);
+		*dentry = d_tmp;
+		res = 1;
+	}
+
+	return res;
+}
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -294,9 +294,23 @@ extern void d_move(struct dentry *, stru
 
 /* appendix may either be NULL or be used for transname suffixes */
 extern struct dentry * d_lookup(struct dentry *, struct qstr *);
-extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
+extern struct dentry * d_lookup_single(struct dentry *, struct qstr *);
+extern struct dentry * __d_lookup_single(struct dentry *, struct qstr *);
 extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);
 
+#ifdef CONFIG_UNION_MOUNT
+extern struct dentry * __d_lookup_union(struct dentry *, struct qstr *);
+#endif
+
+static inline struct dentry * __d_lookup(struct dentry *parent, struct qstr *name)
+{
+#ifdef CONFIG_UNION_MOUNT
+       return __d_lookup_union(parent, name);
+#else
+       return __d_lookup_single(parent, name);
+#endif
+}
+
 /* validate "insecure" dentry pointer */
 extern int d_validate(struct dentry *, struct dentry *);
 
@@ -426,6 +440,7 @@ static inline int d_mountpoint(struct de
 extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *);
 extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
 extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);
+extern struct vfsmount *find_mnt(struct dentry *);
 
 extern int sysctl_vfs_cache_pressure;
 
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -19,6 +19,7 @@ struct nameidata {
 	struct vfsmount *mnt;
 	struct qstr	last;
 	unsigned int	flags;
+	unsigned int	um_flags;
 	int		last_type;
 	unsigned	depth;
 	char *saved_names[MAX_NESTED_LINKS + 1];
@@ -39,6 +40,9 @@ struct path {
  */
 enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 
+#define LAST_UNION             0x01
+#define LAST_LOWLEVEL          0x02
+
 /*
  * The bitmask for a lookup event:
  *  - follow links at the end
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -17,17 +17,44 @@
 #ifdef CONFIG_UNION_MOUNT
 
 #include <linux/fs_struct.h>
+#include <linux/dcache_union.h>
 
 /* namespace stuff used at mount time */
 extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
 extern void detach_mnt_union(struct vfsmount *, struct path *);
 
+/* lookup stuff */
+extern int follow_union_mount(struct vfsmount **, struct dentry **);
+extern struct dentry * real_lookup_union(struct dentry *, struct qstr *,
+					 struct nameidata *);
+extern struct dentry * __lookup_hash_union(struct qstr *, struct dentry *,
+					   struct nameidata *);
+
 #else	/* CONFIG_UNION_MOUNT */
 
 #define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
 #define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
+#define follow_union_mount(x,y) do { /* empty */ } while (0)
 
 #endif	/* CONFIG_UNION_MOUNT */
 
+static inline struct dentry * real_lookup(struct dentry *parent, struct qstr *name, struct nameidata *nd)
+{
+#ifdef CONFIG_UNION_MOUNT
+	return real_lookup_union(parent, name, nd);
+#else
+	return real_lookup_single(parent, name, nd);
+#endif
+}
+
+static inline struct dentry * __lookup_hash(struct qstr *name, struct dentry *base, struct nameidata *nd)
+{
+#ifdef CONFIG_UNION_MOUNT
+	return __lookup_hash_union(name, base, nd);
+#else
+	return __lookup_hash_single(name, base, nd);
+#endif
+}
+
 #endif	/* __KERNEL __ */
 #endif	/* __LINUX_UNION_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH  9/15] Simple union-mount readdir
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (7 preceding siblings ...)
  2007-04-17 13:21 ` [RFC][PATCH 8/15] Union-mount lookup Bharata B Rao
@ 2007-04-17 13:22 ` Bharata B Rao
  2007-04-17 13:22 ` [RFC][PATCH 10/15] In-kernel file copy between union mounted filesystems Bharata B Rao
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <jblunck@suse.de>
Subject: Simple union-mount readdir

This is a very simple union mount readdir implementation. It modifies the
readdir routine to merge the entries of union mounted directories and
eliminate duplicates while walking the union stack.

FIXME: This patch needs to be reworked! At the moment this only works for
ext2/3 and tmpfs. All kind of index directories that return d_off > i_size
don't work with this.

The directory entries are read starting from the top layer and they
are maintained in a cache. Subsequently when the entries from the bottom layers
of the union stack are read they are checked for duplicates (in the cache)
before being passed out to the user space. There can be multiple calls
to readdir/getdents routines for reading the entries of a single directory.
But union directory cache is not maitained across these calls. Instead
for every call, the previously read entries are re-read into the cache
and newly read entires are compared against these for duplicates before
being they are returned to user space.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/aio.c              |    7 
 fs/file_table.c       |   14 +
 fs/readdir.c          |    2 
 fs/union.c            |  393 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/union.h |   17 ++
 5 files changed, 426 insertions(+), 7 deletions(-)

--- a/fs/aio.c
+++ b/fs/aio.c
@@ -21,6 +21,7 @@
 
 #include <linux/sched.h>
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/file.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
@@ -487,6 +488,12 @@ static void aio_fput_routine(struct work
 
 		/* Complete the fput */
 		__fput(req->ki_filp);
+		/*
+		 * __fput no longer releases the dentry and vfsmnt, thanks to
+		 * to union mount. Hence do this manually.
+		 */
+		dput(req->ki_filp->f_path.dentry);
+		mntput(req->ki_filp->f_path.mnt);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -141,8 +141,14 @@ EXPORT_SYMBOL(get_empty_filp);
 
 void fastcall fput(struct file *file)
 {
-	if (atomic_dec_and_test(&file->f_count))
+	struct dentry *dentry = file->f_path.dentry;
+	struct vfsmount *mnt = file->f_path.mnt;
+
+	if (atomic_dec_and_test(&file->f_count)) {
 		__fput(file);
+		dput(dentry);
+		mntput(mnt);
+	}
 }
 
 EXPORT_SYMBOL(fput);
@@ -152,9 +158,7 @@ EXPORT_SYMBOL(fput);
  */
 void fastcall __fput(struct file *file)
 {
-	struct dentry *dentry = file->f_path.dentry;
-	struct vfsmount *mnt = file->f_path.mnt;
-	struct inode *inode = dentry->d_inode;
+	struct inode *inode = file->f_path.dentry->d_inode;
 
 	might_sleep();
 
@@ -180,8 +184,6 @@ void fastcall __fput(struct file *file)
 	file->f_path.dentry = NULL;
 	file->f_path.mnt = NULL;
 	file_free(file);
-	dput(dentry);
-	mntput(mnt);
 }
 
 struct file fastcall *fget(unsigned int fd)
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -33,7 +33,7 @@ int vfs_readdir(struct file *file, filld
 	mutex_lock(&inode->i_mutex);
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
-		res = file->f_op->readdir(file, buf, filler);
+		res = do_readdir(file, buf, filler);
 		file_accessed(file);
 	}
 	mutex_unlock(&inode->i_mutex);
--- a/fs/union.c
+++ b/fs/union.c
@@ -14,6 +14,7 @@
 #include <linux/namei.h>
 #include <linux/module.h>
 #include <linux/mount.h>
+#include <linux/file.h>
 
 void
 __union_check(struct dentry *dentry)
@@ -962,3 +963,395 @@ int follow_union_mount(struct vfsmount *
 
 	return res;
 }
+
+
+/*
+ * Union mounts support for readdir.
+ */
+
+/* This is a copy from fs/readdir.c */
+struct getdents_callback {
+	struct linux_dirent __user *current_dir;
+	struct linux_dirent __user *previous;
+	int count;
+	int error;
+};
+
+/*
+ * The readdir union cache object
+ */
+struct union_cache_entry {
+	struct list_head list;
+	struct qstr name;
+};
+
+static int union_cache_add_entry(struct list_head *list,
+				 const char *name, int namelen)
+{
+	struct union_cache_entry *this;
+	char *tmp_name;
+
+	this = kmalloc(sizeof(*this), GFP_KERNEL);
+	if (!this) {
+		printk(KERN_CRIT
+		       "union_cache_add_entry(): out of kernel memory\n");
+		return -ENOMEM;
+	}
+
+	tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
+	if (!tmp_name) {
+		printk(KERN_CRIT
+		       "union_cache_add_entry(): out of kernel memory\n");
+		kfree(this);
+		return -ENOMEM;
+	}
+
+	this->name.name = tmp_name;
+	this->name.len = namelen;
+	this->name.hash = 0;
+	memcpy(tmp_name, name, namelen);
+	tmp_name[namelen] = 0;
+	INIT_LIST_HEAD(&this->list);
+	list_add(&this->list, list);
+	return 0;
+}
+
+static void union_cache_free(struct list_head *uc_list)
+{
+	struct list_head *p;
+	struct list_head *ptmp;
+	int count = 0;
+
+	list_for_each_safe(p, ptmp, uc_list) {
+		struct union_cache_entry *this;
+
+		this = list_entry(p, struct union_cache_entry, list);
+		list_del_init(&this->list);
+		kfree(this->name.name);
+		kfree(this);
+		count++;
+	}
+	UM_DEBUG_READDIR("freed %d entries\n", count);
+	return;
+}
+
+static int union_cache_find_entry(struct list_head *uc_list,
+				  const char *name, int namelen)
+{
+	struct union_cache_entry *p;
+	int ret = 0;
+
+	list_for_each_entry(p, uc_list, list) {
+		if (p->name.len != namelen)
+			continue;
+		if (strncmp(p->name.name, name, namelen) == 0) {
+			ret = 1;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * There are four filldir() wrapper necessary for the union mount readdir
+ * implementation:
+ *
+ * - filldir_topmost(): fills the union's readdir cache and the user space
+ *			buffer. This is only used for the topmost directory
+ *			in the union stack.
+ * - filldir_topmost_cacheonly(): only fills the union's readdir cache.
+ *			This is only used for the topmost directory in the
+ *			union stack.
+ * - filldir_overlaid(): fills the union's readdir cache and the user space
+ *			buffer. This is only used for directories on the
+ *			stack's lower layers.
+ * - filldir_overlaid_cacheonly(): only fills the union's readdir cache.
+ *			This is only used for directories on the stack's
+ *			lower layers.
+ */
+
+struct union_cache_callback {
+	struct getdents_callback *buf;	/* original getdents_callback */
+	struct list_head list;		/* list of union cache entries */
+	filldir_t filler;		/* the filldir() we should call */
+	loff_t offset;			/* base offset of our dirents */
+	loff_t count;			/* maximum number of bytes to "read" */
+};
+
+static int filldir_topmost(void *_buf, const char *name, int namlen,
+			   loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct union_cache_callback *cb = _buf;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
+			  d_type);
+}
+
+static int filldir_topmost_cacheonly(void *_buf, const char *name, int namlen,
+				     loff_t offset, u64 ino,
+				     unsigned int d_type)
+{
+	struct union_cache_callback *cb = _buf;
+
+	if (offset > cb->count)
+		return -EINVAL;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return 0;
+}
+
+static int filldir_overlaid(void *_buf, const char *name, int namlen,
+			    loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct union_cache_callback *cb = _buf;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (union_cache_find_entry(&cb->list, name, namlen))
+		return 0;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return cb->filler(cb->buf, name, namlen, cb->offset + offset, ino,
+			  d_type);
+}
+
+static int filldir_overlaid_cacheonly(void *_buf, const char *name, int namlen,
+				      loff_t offset, u64 ino,
+				      unsigned int d_type)
+{
+	struct union_cache_callback *cb = _buf;
+
+	if (offset > cb->count)
+		return -EINVAL;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (union_cache_find_entry(&cb->list, name, namlen))
+		return 0;
+
+	union_cache_add_entry(&cb->list, name, namlen);
+	return 0;
+}
+
+static void fastcall fput_union(struct file *file)
+{
+	struct dentry *dentry = file->f_dentry;
+	struct vfsmount *mnt = file->f_vfsmnt;
+
+	if (atomic_dec_and_test(&file->f_count)) {
+		__fput(file);
+		__dput(dentry);
+		mntput(mnt);
+	}
+}
+
+static struct file * __dentry_open_read(struct dentry *dentry,
+					struct vfsmount *mnt, int flags)
+{
+	struct file *f;
+	struct inode *inode;
+	int error;
+
+	error = -ENFILE;
+	f = get_empty_filp();
+	if (!f)
+		goto out;
+	f->f_flags = flags;
+	f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
+		FMODE_PREAD | FMODE_PWRITE;
+	inode = dentry->d_inode;
+	BUG_ON(f->f_mode & FMODE_WRITE);
+	f->f_mapping = inode->i_mapping;
+	f->f_dentry = dentry;
+	f->f_vfsmnt = mnt;
+	f->f_pos = 0;
+	f->f_op = fops_get(inode->i_fop);
+	file_move(f, &inode->i_sb->s_files);
+
+	if (f->f_op && f->f_op->open) {
+		error = f->f_op->open(inode,f);
+		if (error)
+			goto cleanup;
+	}
+	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+
+	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
+
+	/* NB: we're sure to have correct a_ops only after f_op->open */
+	if (f->f_flags & O_DIRECT) {
+		if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) {
+			fput_union(f);
+			f = ERR_PTR(-EINVAL);
+		}
+	}
+
+	return f;
+
+cleanup:
+	fops_put(f->f_op);
+	file_kill(f);
+	f->f_dentry = NULL;
+	f->f_vfsmnt = NULL;
+	put_filp(f);
+out:
+	return ERR_PTR(error);
+}
+
+/*
+ * readdir_union_cache - A helper to fill the readdir cache
+ */
+static int readdir_union_cache(struct file *file, void *_buf, filldir_t filler)
+{
+	struct union_cache_callback *cb = _buf;
+	int old_count;
+	loff_t old_pos;
+	int res;
+
+	old_count = cb->count;
+	cb->count = ((file->f_pos > i_size_read(file->f_dentry->d_inode)) ?
+		      i_size_read(file->f_dentry->d_inode) :
+		      file->f_pos) & INT_MAX;
+	old_pos = file->f_pos;
+	file->f_pos = 0;
+	UM_DEBUG_READDIR("count=%ld\n", cb->count);
+	res = file->f_op->readdir(file, _buf, filler);
+	file->f_pos = old_pos;
+	cb->count = old_count;
+	return res;
+}
+
+/**
+ * readdir_union - A wrapper around ->readdir()
+ *
+ * This is a wrapper around the filesystems readdir(), which is walking
+ * the union stack and calls ->readdir() for every directory in the stack.
+ * The directory entries are read into the union mounts readdir cache to
+ * support whiteout's and duplicate removal.
+ */
+int readdir_union(struct file *file, void *_buf, filldir_t filler)
+{
+	int res = -ENOENT;
+	struct union_cache_callback cb;
+	struct dentry *dentry;
+	struct getdents_callback *buf = _buf;
+	loff_t offset = 0;
+
+	INIT_LIST_HEAD(&cb.list);
+	cb.buf = _buf;
+	cb.filler = filler;
+	cb.offset = 0;
+	offset = i_size_read(file->f_dentry->d_inode);
+	cb.count = file->f_pos;
+	UM_DEBUG_READDIR("file=%s, f_pos=%lld, i_size=%lld\n",
+			 file->f_dentry->d_name.name,
+			 file->f_pos,
+			 offset);
+
+	if (file->f_pos > 0) {
+		/* we have already read from this dir,
+		 * lets read that stuff to our union-cache
+		 * only */
+		res = readdir_union_cache(file, &cb,
+					  filldir_topmost_cacheonly);
+		if (res)
+			goto out;
+	}
+
+	if (file->f_pos < offset) {
+		res = file->f_op->readdir(file, &cb,
+					  filldir_topmost);
+		if (res)
+			goto out;
+	}
+
+	dentry = file->f_dentry->d_overlaid;
+	while (dentry) {
+		struct vfsmount *mnt;
+		struct file *ftmp;
+		struct inode *inode;
+
+		if (buf->count <= 0)
+			break;
+
+		mnt = find_mnt(dentry);
+		__dget(dentry);
+		ftmp = __dentry_open_read(dentry, mnt, file->f_flags);
+		if (IS_ERR(ftmp)) {
+			__dput(dentry);
+			mntput(mnt);
+			res = PTR_ERR(ftmp);
+			break;
+		}
+
+		inode = dentry->d_inode;
+		mutex_lock(&inode->i_mutex);
+
+		/* rearrange the file position */
+		cb.offset += offset;
+		offset = i_size_read(inode);
+		ftmp->f_pos = file->f_pos - cb.offset;
+		cb.count = ftmp->f_pos;
+		if (ftmp->f_pos < 0) {
+			mutex_unlock(&inode->i_mutex);
+			fput_union(ftmp);
+			break;
+		}
+
+		UM_DEBUG_READDIR("ftmp=%s, f_pos=%lld, i_size=%lld\n",
+				 ftmp->f_dentry->d_name.name,
+				 ftmp->f_pos,
+				 offset);
+
+		res = -ENOENT;
+		if (IS_DEADDIR(inode))
+			goto out_fput;
+
+		if (ftmp->f_pos > 0) {
+			/* we have already read from this dir,
+			 * lets read that stuff to our union-cache
+			 * only */
+			res = readdir_union_cache(ftmp, &cb,
+						  filldir_overlaid_cacheonly);
+			if (res)
+				goto out_fput;
+		}
+
+		if (ftmp->f_pos < offset) {
+			res = ftmp->f_op->readdir(ftmp, &cb,
+						  filldir_overlaid);
+			file->f_pos += ftmp->f_pos;
+		}
+
+		file_accessed(ftmp);
+
+	out_fput:
+		mutex_unlock(&inode->i_mutex);
+		fput_union(ftmp);
+
+		if (res)
+			break;
+
+		dentry = dentry->d_overlaid;
+	}
+out:
+	union_cache_free(&cb.list);
+	return res;
+}
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -29,6 +29,7 @@ extern struct dentry * real_lookup_union
 					 struct nameidata *);
 extern struct dentry * __lookup_hash_union(struct qstr *, struct dentry *,
 					   struct nameidata *);
+extern int readdir_union(struct file *, void *, filldir_t);
 
 #else	/* CONFIG_UNION_MOUNT */
 
@@ -56,5 +57,21 @@ static inline struct dentry * __lookup_h
 #endif
 }
 
+static inline int do_readdir(struct file *file, void *buf, filldir_t filler)
+{
+#ifdef CONFIG_UNION_MOUNT
+	int res;
+
+	if (file->f_dentry->d_overlaid) {
+		union_lock(file->f_dentry);
+		res = readdir_union(file, buf, filler);
+		union_unlock(file->f_dentry);
+	} else
+#endif
+		res = file->f_op->readdir(file, buf, filler);
+
+	return res;
+}
+
 #endif	/* __KERNEL __ */
 #endif	/* __LINUX_UNION_H */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 10/15] In-kernel file copy between union mounted filesystems
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (8 preceding siblings ...)
  2007-04-17 13:22 ` [RFC][PATCH 9/15] Simple union-mount readdir Bharata B Rao
@ 2007-04-17 13:22 ` Bharata B Rao
  2007-04-17 13:23 ` [RFC][PATCH 11/15] VFS whiteout handling Bharata B Rao
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: In-kernel file copy between union mounted filesystems

This patch introduces in-kernel file copy between union mounted
filesystems. When a file is opened for writing but resides on a lower (thus
read-only) layer of the union stack it is copied to the topmost union layer
first.

This patch uses the do_splice_direct() for doing the in-kernel file copy.

Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
---
 fs/namei.c            |   46 +++++
 fs/union.c            |  384 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/namei.h |    2 
 include/linux/union.h |   14 +
 4 files changed, 445 insertions(+), 1 deletion(-)

--- a/fs/namei.c
+++ b/fs/namei.c
@@ -830,8 +830,17 @@ done:
 	path->mnt = mnt;
 	path->dentry = dentry;
 
-	if (nd->dentry->d_sb != dentry->d_sb)
+	/*
+	 * This should be checked after the following of unions.
+	 * Otherwise we might run into trouble creating directories
+	 * on mountpoints. :(
+	 * But maybe we shouldn't set the LAST_LOWLEVEL flag here
+	 * at all ... */
+	if (nd->dentry->d_sb != dentry->d_sb) {
 		path->mnt = find_mnt(dentry);
+		UM_DEBUG_UID("Setting LAST_LOWLEVEL for %s\n", name->name);
+		nd->um_flags |= LAST_LOWLEVEL;
+	}
 
 	__follow_mount(path);
 	follow_union_mount(&path->mnt, &path->dentry);
@@ -950,6 +959,14 @@ static fastcall int __link_path_walk(con
 		if (err)
 			break;
 
+		if ((nd->flags & LOOKUP_TOPMOST) &&
+		    (nd->um_flags & LAST_LOWLEVEL)) {
+			err = union_create_topdir(nd,&next.dentry,&next.mnt);
+			if (err)
+				goto out_dput;
+			nd->um_flags &= ~LAST_LOWLEVEL;
+		}
+
 		err = -ENOENT;
 		inode = next.dentry->d_inode;
 		if (!inode)
@@ -1005,6 +1022,15 @@ last_component:
 		err = do_lookup(nd, &this, &next);
 		if (err)
 			break;
+
+		if ((nd->flags & LOOKUP_TOPMOST) &&
+		    (nd->um_flags & LAST_LOWLEVEL)) {
+			err = union_create_topdir(nd,&next.dentry,&next.mnt);
+			if (err)
+				goto out_dput;
+			nd->um_flags &= ~LAST_LOWLEVEL;
+		}
+
 		inode = next.dentry->d_inode;
 		if ((lookup_flags & LOOKUP_FOLLOW)
 		    && inode && inode->i_op && inode->i_op->follow_link) {
@@ -1177,6 +1203,7 @@ static int fastcall do_path_lookup(int d
 
 	nd->last_type = LAST_ROOT; /* if there are only slashes... */
 	nd->flags = flags;
+	nd->um_flags = 0;
 	nd->depth = 0;
 
 	if (*name=='/') {
@@ -1721,9 +1748,18 @@ int open_namei(int dfd, const char *path
 					 nd, flag);
 		if (error)
 			return error;
+		/* test for WRONLY and RDWR - flag's special lower bits */
+		if (flag & 0x2) {
+			UM_DEBUG_UID("\"%s\" opened for writing\n", pathname);
+			error = union_copyup(nd, flag);
+			if (error)
+				return error;
+		}
 		goto ok;
 	}
 
+	UM_DEBUG_UID("open called with O_CREATE\n");
+
 	/*
 	 * Create - we need to know the parent.
 	 */
@@ -1740,6 +1776,8 @@ int open_namei(int dfd, const char *path
 	if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
 		goto exit;
 
+	UM_DEBUG_UID("do_last now\n");
+
 	dir = nd->dentry;
 	nd->flags &= ~LOOKUP_PARENT;
 	mutex_lock(&dir->d_inode->i_mutex);
@@ -1793,6 +1831,12 @@ do_last:
 	error = -EISDIR;
 	if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
 		goto exit;
+
+	if (flag & 0x2) {
+		error = union_copyup(nd, flag);
+		if (error)
+			goto exit;
+	}
 ok:
 	error = may_open(nd, acc_mode, flag);
 	if (error)
--- a/fs/union.c
+++ b/fs/union.c
@@ -15,6 +15,11 @@
 #include <linux/module.h>
 #include <linux/mount.h>
 #include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/quotaops.h>
+#include <linux/dnotify.h>
+#include <linux/security.h>
+#include <linux/pipe_fs_i.h>
 
 void
 __union_check(struct dentry *dentry)
@@ -302,6 +307,53 @@ void __dput_union(struct dentry *dentry)
 	return;
 }
 
+/*
+ * union_relookup_topmost - lookup and create the topmost path to dentry
+ * @nd: pointer to nameidata
+ * @flags: lookup flags
+ */
+int union_relookup_topmost(struct nameidata *nd, int flags)
+{
+	int err;
+	char *kbuf, *name;
+	struct nameidata this;
+
+	UM_DEBUG_UID("relookup the topmost dir for %s\n",
+		     nd->dentry->d_name.name);
+
+	kbuf = (char *)__get_free_page(GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	name = d_path(nd->dentry, nd->mnt, kbuf, PAGE_SIZE);
+	err = PTR_ERR(name);
+	if (IS_ERR(name))
+		goto free_page;
+
+	err = path_lookup(name, flags|LOOKUP_CREATE|LOOKUP_TOPMOST, &this);
+	if (err)
+		goto free_page;
+
+	path_release(nd);
+	nd->dentry = this.dentry;
+	nd->mnt = this.mnt;
+
+	/* If we are looking up the parent, copy the child details also */
+	if (flags & LOOKUP_PARENT) {
+		nd->last = this.last;
+		nd->last_type = this.last_type;
+	}
+
+	/*
+	 * the nd->flags should be unchanged
+	 */
+	BUG_ON(this.um_flags & LAST_LOWLEVEL);
+	nd->um_flags &= ~LAST_LOWLEVEL;
+ free_page:
+	free_page((unsigned long)kbuf);
+	return err;
+}
+
 void attach_mnt_union(struct vfsmount *mnt, struct nameidata *nd)
 {
 	struct dentry *tmp;
@@ -1355,3 +1407,335 @@ out:
 	union_cache_free(&cb.list);
 	return res;
 }
+
+/*
+ * Union mount copyup support
+ */
+
+/*
+ * Just do what vfs_create() would do, but the union mount way
+ */
+static struct dentry * union_create(struct dentry *parent, struct dentry *old,
+				    struct nameidata *nd)
+{
+	struct dentry *dentry;
+	int err, mode;
+
+	dentry = __lookup_hash_single(&old->d_name, parent, NULL);
+	if (IS_ERR(dentry))
+		goto exit;
+
+	err = -EEXIST;
+	if (dentry->d_inode) {
+		dput(dentry);
+		goto error;
+	}
+
+	err = -ENOENT;
+	if (IS_DEADDIR(parent->d_inode))
+		goto error;
+	err = -EACCES;	/* shouldn't it be ENOSYS? */
+	if (!parent->d_inode->i_op || !parent->d_inode->i_op->create)
+		goto error;
+
+	mode = old->d_inode->i_mode & S_IALLUGO;
+	mode |= S_IFREG;
+
+	err = security_inode_create(parent->d_inode, dentry, mode);
+	if (err)
+		goto error;
+
+	DQUOT_INIT(parent->d_inode);
+	err = parent->d_inode->i_op->create(parent->d_inode, dentry, mode, nd);
+	if (err)
+		goto error;
+
+	dentry->d_inode->i_uid = old->d_inode->i_uid;
+	dentry->d_inode->i_gid = old->d_inode->i_gid;
+	mark_inode_dirty(dentry->d_inode);
+exit:
+	return dentry;
+error:
+	return ERR_PTR(err);
+}
+
+/*
+ * Just do what vfs_mkdir() would do, but the union mount way
+ */
+static struct dentry * union_mkdir(struct dentry *parent, struct dentry *dir)
+{
+	struct dentry *dentry;
+	int err, mode;
+
+	dentry = __lookup_hash_single(&dir->d_name, parent, NULL);
+	if (IS_ERR(dentry))
+		goto exit;
+
+	err = -EEXIST;
+	if (dentry->d_inode) {
+		dput(dentry);
+		goto error;
+	}
+
+	err = -ENOENT;
+	if (IS_DEADDIR(parent->d_inode))
+		goto error;
+	err = -EPERM;
+	if (!parent->d_inode->i_op || !parent->d_inode->i_op->mkdir)
+		goto error;
+
+	mode = dir->d_inode->i_mode & (S_IRWXUGO|S_ISVTX);
+
+	err = security_inode_mkdir(parent->d_inode, dentry, mode);
+	if (err)
+		goto error;
+
+	DQUOT_INIT(parent->d_inode);
+	err = parent->d_inode->i_op->mkdir(parent->d_inode, dentry, mode);
+	if (err)
+		goto error;
+
+	dentry->d_inode->i_uid = dir->d_inode->i_uid;
+	dentry->d_inode->i_gid = dir->d_inode->i_gid;
+	mark_inode_dirty(dentry->d_inode);
+exit:
+	return dentry;
+error:
+	return ERR_PTR(err);
+}
+
+static void __update_fs_pwd(struct dentry *old, struct dentry *new)
+{
+	struct dentry *old_pwd = NULL;
+	struct vfsmount *old_pwdmnt = NULL;
+	struct vfsmount *new_pwdmnt = find_mnt(new);
+
+	write_lock(&current->fs->lock);
+	if (current->fs->pwd == old) {
+		old_pwd = current->fs->pwd;
+		old_pwdmnt = current->fs->pwdmnt;
+		current->fs->pwdmnt = mntget(new_pwdmnt);
+		current->fs->pwd = __dget(new);
+		UM_DEBUG_UID("replacing fs->pwd\n");
+		UM_DEBUG_UID("oldpwd: name=\"%s\", inode=%p, devname=%s\n",
+			     old_pwd->d_name.name, old_pwd->d_inode,
+			     old_pwdmnt->mnt_devname);
+		UM_DEBUG_UID("newpwd: name=\"%s\", inode=%p, devname=%s\n",
+			     new->d_name.name, new->d_inode,
+			     new_pwdmnt->mnt_devname);
+	}
+	write_unlock(&current->fs->lock);
+
+	if (old_pwd) {
+		__dput(old_pwd);
+		mntput(old_pwdmnt);
+	}
+
+	mntput(new_pwdmnt);
+
+	return;
+}
+
+struct dentry * union_create_topmost(struct nameidata *nd, struct dentry *old)
+{
+	struct dentry *dentry;
+	struct dentry *parent = nd->dentry;
+
+	UM_DEBUG_UID("dentry=%s\n", old->d_name.name);
+
+	BUG_ON(parent->d_sb == old->d_sb);
+	if (!S_ISREG(old->d_inode->i_mode)) {
+		UM_DEBUG("This filetype isn't supported!\n");
+		dentry = ERR_PTR(-EINVAL);
+		goto exit;
+	}
+
+	/*
+	 * Create the topmost regular file here.
+	 */
+	mutex_lock(&parent->d_inode->i_mutex);
+	dentry = union_create(parent, old, nd);
+	mutex_unlock(&parent->d_inode->i_mutex);
+	if (IS_ERR(dentry)) {
+		UM_DEBUG("some error occurred\n");
+		goto exit;
+	}
+
+exit:
+	return dentry;
+}
+
+int union_create_topdir(struct nameidata *nd,
+			struct dentry **dentry, struct vfsmount **mnt)
+{
+	struct dentry *topdir;
+	struct dentry *parent = nd->dentry;
+
+	UM_DEBUG_UID("dentry=%s\n", (*dentry)->d_name.name);
+
+	if (parent->d_sb == (*dentry)->d_sb)
+		return 0;
+
+	if (!S_ISDIR((*dentry)->d_inode->i_mode)) {
+		UM_DEBUG("Unsupported filetype!\n");
+		BUG();
+	}
+
+	/*
+	 * Create the topmost directory here.
+	 */
+	spin_lock(&(*dentry)->d_lock);
+	if (!(*dentry)->d_union) {
+		UM_DEBUG_LOCK("Allocate lock for \"%s\"\n",
+			      (*dentry)->d_name.name);
+		(*dentry)->d_union = union_alloc();
+		spin_unlock(&(*dentry)->d_lock);
+	} else {
+		spin_unlock(&(*dentry)->d_lock);
+		union_lock(*dentry);
+	}
+	mutex_lock(&parent->d_inode->i_mutex);
+	topdir = union_mkdir(parent, *dentry);
+	if (IS_ERR(topdir)) {
+		UM_DEBUG("some error occurred\n");
+		mutex_unlock(&parent->d_inode->i_mutex);
+		union_unlock(*dentry);
+		return PTR_ERR(topdir);
+	}
+
+	spin_lock(&topdir->d_lock);
+	if (topdir->d_union) {
+		UM_DEBUG("Aaargh! topdir \"%s\" already has a lock?!\n",
+			 topdir->d_name.name);
+		dump_stack();
+	}
+	topdir->d_union = union_get((*dentry)->d_union);
+	spin_unlock(&topdir->d_lock);
+	append_to_stack(topdir, *dentry);
+	__update_fs_pwd(*dentry, topdir);
+	*dentry = topdir;
+	mutex_unlock(&parent->d_inode->i_mutex);
+	union_unlock(*dentry);
+
+	if (nd->mnt != *mnt) {
+		mntput(*mnt);
+		*mnt = mntget(nd->mnt);
+	}
+
+	return 0;
+}
+
+int union_copy_file(struct dentry *old_dentry, struct vfsmount *old_mnt,
+		    struct dentry *new_dentry, struct vfsmount *new_mnt)
+{
+	int ret;
+	size_t size;
+	loff_t offset;
+	struct file *old_file, *new_file;
+
+	dget(old_dentry);
+	mntget(old_mnt);
+	old_file = dentry_open(old_dentry, old_mnt, O_RDONLY);
+	if (IS_ERR(old_file))
+		return PTR_ERR(old_file);
+
+	dget(new_dentry);
+	mntget(new_mnt);
+	new_file = dentry_open(new_dentry, new_mnt, O_WRONLY);
+	ret = PTR_ERR(new_file);
+	if (IS_ERR(new_file))
+		goto fput_old;
+
+	size = i_size_read(old_file->f_path.dentry->d_inode);
+	if (((size_t)size != size) || ((ssize_t)size != size)) {
+		ret = -EFBIG;
+		goto fput_new;
+	}
+
+	offset = 0;
+	ret = do_splice_direct(old_file, &offset, new_file, size,
+			       SPLICE_F_MOVE);
+	if (ret >= 0)
+		ret = 0;
+ fput_new:
+	fput(new_file);
+ fput_old:
+	fput(old_file);
+	return ret;
+}
+
+/**
+ * union_copyup - copy a file to the topmost layer of the union stack
+ * @nd: nameidata pointer to the file
+ * @flags: flags given to open_namei
+ */
+int union_copyup(struct nameidata *nd, int flags)
+{
+	struct dentry *dir;
+	struct dentry *dentry;
+	int err;
+
+	if (!union_is_member(nd->dentry, nd->mnt))
+		return 0;
+	if (!S_ISREG(nd->dentry->d_inode->i_mode))
+		return 0;
+
+	err = union_relookup_topmost(nd, nd->flags|LOOKUP_PARENT);
+	if (err)
+		return err;
+
+	dir = nd->dentry;
+	nd->flags &= ~LOOKUP_PARENT;
+	union_lock(nd->dentry);
+	mutex_lock(&dir->d_inode->i_mutex);
+	dentry = __lookup_hash_union(&nd->last, nd->dentry, nd);
+	err = PTR_ERR(dentry);
+	if (IS_ERR(dentry)) {
+		mutex_unlock(&dir->d_inode->i_mutex);
+		union_unlock(nd->dentry);
+		return err;
+	}
+
+	mutex_unlock(&dir->d_inode->i_mutex);
+	union_unlock(nd->dentry);
+
+	err = -ENOENT;
+	if (!dentry->d_inode)
+		goto exit_dput;
+
+	follow_mount(&nd->mnt, &dentry);
+
+	err = -ENOENT;
+	if (!dentry->d_inode)
+		goto exit_dput;
+
+	if (dentry->d_parent != dir) {
+		struct dentry *tmp;
+		struct vfsmount *old_mnt;
+
+		UM_DEBUG_UID("already exists -> copy file\n");
+		tmp = union_create_topmost(nd, dentry);
+		if (IS_ERR(tmp))
+			goto exit_dput;
+
+		old_mnt = find_mnt(dentry);
+		err = union_copy_file(dentry, old_mnt, tmp, nd->mnt);
+		if (err) {
+			int ret = vfs_unlink(tmp->d_inode, tmp);
+			BUG_ON(ret);
+			/* FIXME: not sure if there are return value
+			 * we should not BUG() on */
+		}
+		dput(dentry);
+		dentry = tmp;
+		mntput(old_mnt);
+	}
+
+	dput(nd->dentry);
+	nd->dentry = dentry;
+	return 0;
+
+exit_dput:
+	dput(dentry);
+	return err;
+}
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -58,6 +58,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
 #define LOOKUP_PARENT		16
 #define LOOKUP_NOALT		32
 #define LOOKUP_REVAL		64
+#define LOOKUP_TOPMOST	       128
+#define LOOKUP_WHT	       256
 /*
  * Intent data
  */
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -31,11 +31,25 @@ extern struct dentry * __lookup_hash_uni
 					   struct nameidata *);
 extern int readdir_union(struct file *, void *, filldir_t);
 
+/* copy-up support */
+extern struct dentry * union_create_topmost(struct nameidata *, struct dentry *);
+extern int union_create_topdir(struct nameidata *, struct dentry **, struct vfsmount **);
+extern int union_is_member(struct dentry *, struct vfsmount *);
+extern int union_copy_file(struct dentry *, struct vfsmount *, struct dentry *, struct vfsmount *);
+extern int union_copyup(struct nameidata *, int);
+extern int union_relookup_topmost(struct nameidata *, int);
+
 #else	/* CONFIG_UNION_MOUNT */
 
 #define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
 #define detach_mnt_union(mnt,nd) do { /* empty */ } while (0)
 #define follow_union_mount(x,y) do { /* empty */ } while (0)
+#define union_create_topmost(x,y) ({ BUG(); ERR_PTR(-EINVAL); })
+#define union_create_topdir(x,y,z) ({ (0); })
+#define union_is_member(x,y) ({ (0); })
+#define union_copy_file(dentry1,mnt1,dentry2,mnt2) ({ (0); })
+#define union_copyup(x,y) ({ (0); })
+#define union_relookup_topmost(x,y) ({ (0); })
 
 #endif	/* CONFIG_UNION_MOUNT */
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 11/15] VFS whiteout handling
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (9 preceding siblings ...)
  2007-04-17 13:22 ` [RFC][PATCH 10/15] In-kernel file copy between union mounted filesystems Bharata B Rao
@ 2007-04-17 13:23 ` Bharata B Rao
  2007-04-17 13:23 ` [RFC][PATCH 12/15] ext2 whiteout support Bharata B Rao
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: VFS whiteout handling

Introduce white-out handling in the VFS.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/inode.c            |   17 +
 fs/namei.c            |  476 ++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/readdir.c          |   10 +
 fs/union.c            |  104 ++++++++++
 include/linux/fs.h    |    4 
 include/linux/union.h |    6 
 6 files changed, 605 insertions(+), 12 deletions(-)

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1418,6 +1418,21 @@ void __init inode_init(unsigned long mem
 		INIT_HLIST_HEAD(&inode_hashtable[loop]);
 }
 
+/*
+ * Dummy default file-operations:
+ * Never open a whiteout. This is always a bug.
+ */
+static int whiteout_no_open(struct inode *irrelevant, struct file *dontcare)
+{
+	printk("Attemp to open a whiteout!\n");
+	WARN_ON(1);
+	return -ENXIO;
+}
+
+static struct file_operations def_wht_fops = {
+	.open		= whiteout_no_open,
+};
+
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 {
 	inode->i_mode = mode;
@@ -1431,6 +1446,8 @@ void init_special_inode(struct inode *in
 		inode->i_fop = &def_fifo_fops;
 	else if (S_ISSOCK(mode))
 		inode->i_fop = &bad_sock_fops;
+	else if (S_ISWHT(mode))
+		inode->i_fop = &def_wht_fops;
 	else
 		printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o)\n",
 		       mode);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -969,7 +969,7 @@ static fastcall int __link_path_walk(con
 
 		err = -ENOENT;
 		inode = next.dentry->d_inode;
-		if (!inode)
+		if (!inode || S_ISWHT(inode->i_mode))
 			goto out_dput;
 		err = -ENOTDIR; 
 		if (!inode->i_op)
@@ -1043,6 +1043,12 @@ last_component:
 		err = -ENOENT;
 		if (!inode)
 			break;
+		if (S_ISWHT(inode->i_mode)) {
+			UM_DEBUG_UID("found a whiteout\n");
+			break;
+			//if (!(nd->flags & LOOKUP_WHT))
+			//    break;
+		}
 		if (lookup_flags & LOOKUP_DIRECTORY) {
 			err = -ENOTDIR; 
 			if (!inode->i_op || !inode->i_op->lookup)
@@ -1521,7 +1527,7 @@ static int may_delete(struct inode *dir,
 static inline int may_create(struct inode *dir, struct dentry *child,
 			     struct nameidata *nd)
 {
-	if (child->d_inode)
+	if (child->d_inode && !S_ISWHT(child->d_inode->i_mode))
 		return -EEXIST;
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
@@ -1588,6 +1594,82 @@ void unlock_rename(struct dentry *p1, st
 	}
 }
 
+/*
+ * __vfs_unlink_whiteout - Unlink a single whiteout from the system
+ * @dir: parent directory
+ * @dentry: the whiteout itself
+ *
+ * This is for unlinking a single whiteout. Don't use vfs_unlink() because we
+ * don't want any notification stuff etc. but basically it is the same stuff.
+ */
+static int
+__vfs_unlink_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	int error = may_delete(dir, dentry, 0);
+
+	if (error)
+		return error;
+
+	if (!dir->i_op || !dir->i_op->unlink)
+		return -EPERM;
+
+	DQUOT_INIT(dir);
+
+	mutex_lock(&dentry->d_inode->i_mutex);
+	if (d_mountpoint(dentry))
+		error = -EBUSY;
+	else {
+		error = security_inode_unlink(dir, dentry);
+		if (!error)
+			error = dir->i_op->unlink(dir, dentry);
+	}
+	mutex_unlock(&dentry->d_inode->i_mutex);
+
+	/* We don't d_delete() NFS sillyrenamed files--they still exist. */
+	if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
+		d_delete(dentry);
+		//inode_dir_notify(dir, DN_DELETE);
+	}
+	return error;
+}
+
+/*
+ * vfs_unlink_whiteout - unlink and relookup the whiteout
+ *
+ * This is what you want to call from vfs_* functions to remove a whiteout. It
+ * unlinks the whiteout dentry and relookups it afterwards.
+ */
+static int
+vfs_unlink_whiteout(struct inode *dir, struct dentry **dp)
+{
+	struct dentry *dentry = *dp;
+	struct dentry *parent = dentry->d_parent;
+	struct qstr name;
+	int error;
+
+	BUG_ON(dir != parent->d_inode);
+
+	error = -ENOMEM;
+	name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+	if (!name.name)
+		goto out;
+	strncpy((char *)name.name, dentry->d_name.name, dentry->d_name.len);
+	name.len = dentry->d_name.len;
+	name.hash = dentry->d_name.hash;
+
+	error = __vfs_unlink_whiteout(dir, dentry);
+	if (error)
+		goto out_freename;
+
+	__dput_single(dentry);
+	*dp = __lookup_hash_single(&name, parent, NULL);
+	BUG_ON(IS_ERR(*dp));	/* Hmm, very hard response here */
+out_freename:
+	kfree(name.name);
+out:
+	return error;
+}
+
 int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
 		struct nameidata *nd)
 {
@@ -1603,6 +1685,13 @@ int vfs_create(struct inode *dir, struct
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
+
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, &dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->create(dir, dentry, mode, nd);
 	if (!error)
@@ -1798,7 +1887,14 @@ do_last:
 	}
 
 	/* Negative dentry, just create the file */
-	if (!path.dentry->d_inode) {
+	if (!path.dentry->d_inode || S_ISWHT(path.dentry->d_inode->i_mode)) {
+		if (path.dentry->d_parent != dir) {
+			UM_DEBUG_UID("found a lower layers whiteout\n");
+			dput(path.dentry);
+			path.dentry = __lookup_hash_single(&nd->last, dir, nd);
+			goto do_last;
+		}
+
 		error = open_namei_create(nd, &path, flag, mode);
 		if (error)
 			goto exit;
@@ -1914,6 +2010,17 @@ do_link:
 struct dentry *lookup_create(struct nameidata *nd, int is_dir)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
+	int error;
+
+	if (union_is_member(nd->dentry, nd->mnt)) {
+		error = union_relookup_topmost(nd, nd->flags & ~LOOKUP_PARENT);
+		if (error) {
+			/* FIXME: This really sucks */
+			mutex_lock_nested(&nd->dentry->d_inode->i_mutex,
+					  I_MUTEX_PARENT);
+			goto fail;
+		}
+	}
 
 	mutex_lock_nested(&nd->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	/*
@@ -1933,6 +2040,15 @@ struct dentry *lookup_create(struct name
 	if (IS_ERR(dentry))
 		goto fail;
 
+	/* Special case - we found a whiteout */
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		if (dentry->d_parent != nd->dentry) {
+			UM_DEBUG_UID("found a lower layers whiteout\n");
+			dput(dentry);
+			dentry = __lookup_hash_single(&nd->last,nd->dentry,nd);
+		}
+	}
+
 	/*
 	 * Special case - lookup gave negative, but... we had foo/bar/
 	 * From the vfs_mknod() POV we just have a negative dentry -
@@ -1967,6 +2083,12 @@ int vfs_mknod(struct inode *dir, struct 
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, &dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->mknod(dir, dentry, mode, dev);
 	if (!error)
@@ -2032,6 +2154,7 @@ asmlinkage long sys_mknod(const char __u
 int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	int error = may_create(dir, dentry, NULL);
+	int opaque;
 
 	if (error)
 		return error;
@@ -2044,10 +2167,23 @@ int vfs_mkdir(struct inode *dir, struct 
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, &dentry);
+		if (error)
+			return error;
+		opaque = 1;
+	} else
+		opaque = 0;
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->mkdir(dir, dentry, mode);
-	if (!error)
+	if (!error) {
 		fsnotify_mkdir(dir, dentry);
+#ifdef CONFIG_UNION_MOUNT
+		if (opaque && dentry->d_parent->d_overlaid)
+			dentry->d_inode->i_flags |= S_OPAQUE;
+#endif
+	}
 	return error;
 }
 
@@ -2089,6 +2225,225 @@ asmlinkage long sys_mkdir(const char __u
 	return sys_mkdirat(AT_FDCWD, pathname, mode);
 }
 
+/* Checks on the victiom for whiteout */
+static inline int may_whiteout(struct dentry *victim, int isdir)
+{
+	if (!victim->d_inode || S_ISWHT(victim->d_inode->i_mode))
+		return -ENOENT;
+	if (IS_APPEND(victim->d_inode) || IS_IMMUTABLE(victim->d_inode))
+		return -EPERM;
+	if (isdir) {
+		if (!S_ISDIR(victim->d_inode->i_mode))
+			return -ENOTDIR;
+		if (IS_ROOT(victim))
+			return -EBUSY;
+		if (!union_dir_is_empty(victim))
+			return -ENOTEMPTY;
+	} else if (S_ISDIR(victim->d_inode->i_mode))
+		return -EISDIR;
+	if (victim->d_flags & DCACHE_NFSFS_RENAMED)
+		return -EBUSY;
+	return 0;
+}
+
+/*
+ * We try to whiteout a dentry. dir is the parent of the whiteout.
+ * Whiteouts can be vfs_unlink'ed.
+ */
+int vfs_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	int err;
+
+	BUG_ON(dentry->d_parent->d_inode != dir);
+
+	/* from may_create() */
+	if (dentry->d_inode)
+		return -EEXIST;
+	if (IS_DEADDIR(dir))
+		return -ENOENT;
+	err = permission(dir, MAY_WRITE | MAY_EXEC, NULL);
+	if (err)
+		return err;
+
+	/* from may_delete() */
+	if (IS_APPEND(dir))
+		return -EPERM;
+	/* We don't call check_sticky() here because d_inode == NULL */
+
+	if (!dir->i_op || !dir->i_op->whiteout)
+		return -EOPNOTSUPP;
+
+	err = dir->i_op->whiteout(dir, dentry);
+	/* Ignore quota and fsnotify */
+	return err;
+}
+
+/*
+ * do_whiteout - whiteout a dentry, either when removing or renaming
+ * @dentry: the dentry to whiteout
+ *
+ * This is called by the VFS when removing or renaming files on an union mount.
+ */
+static int do_whiteout(struct dentry *parent, struct dentry *dentry, int isdir)
+{
+	int err;
+	struct qstr name;
+
+	UM_DEBUG_UID("parent=\"%s\", dentry=\"%s\", isdir=%d\n",
+		     parent->d_name.name, dentry->d_name.name, isdir);
+
+	err = may_whiteout(dentry, isdir);
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+	if (!name.name)
+		goto out;
+	strncpy((char *)name.name, dentry->d_name.name, dentry->d_name.len);
+	name.len = dentry->d_name.len;
+	name.hash = dentry->d_name.hash;
+
+	/*
+	 * TODO: Should we BUG_ON(dentry->d_parent != parent) ?
+	 */
+	if (dentry->d_parent == parent) {
+		if (isdir)
+			err = vfs_rmdir(parent->d_inode, dentry);
+		else
+			err = vfs_unlink(parent->d_inode, dentry);
+		dput(dentry);
+		if (err)
+			goto out_freename;
+	}
+
+	/*
+	 * Relookup the dentry to whiteout now. By this time, the dentry is
+	 * dput'ed in vfs_rmdir or vfs_unlink and we should find a fresh
+	 * negative dentry.
+	 */
+	dentry = __lookup_hash_single(&name, parent, NULL);
+	err = PTR_ERR(dentry);
+	if (IS_ERR(dentry))
+		goto out_freename;
+
+	err = vfs_whiteout(parent->d_inode, dentry);
+	__dput_single(dentry);
+out_freename:
+	kfree(name.name);
+out:
+	return err;
+}
+
+static int
+__hash_one_len(const char *name, int len, struct qstr *this)
+{
+	unsigned long hash;
+	unsigned int c;
+
+	hash = init_name_hash();
+	while (len--) {
+		c = *(const unsigned char *)name++;
+		if (c == '/' || c == '\0')
+			return -EINVAL;
+		hash = partial_name_hash(c, hash);
+	}
+	this->hash = end_name_hash(hash);
+	return 0;
+}
+
+static int unlink_whiteouts_filldir(void *buf, const char *name, int namlen,
+			   loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct dentry *parent = buf;
+	struct dentry *dentry;
+	struct qstr this;
+	int res;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	UM_DEBUG_UID("name=\"%s\", d_type=%d\n", name, d_type);
+
+	if (d_type != DT_WHT)
+		return 0;
+
+	this.name = name;
+	this.len = namlen;
+	res = __hash_one_len(name, namlen, &this);
+	if (res)
+		return res;
+
+	dentry = __lookup_hash_single(&this, parent, NULL);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+
+	res = __vfs_unlink_whiteout(parent->d_inode, dentry);
+	__dput_single(dentry);
+	return res;
+}
+
+/*
+ * do_unlink_whiteouts - remove all whiteouts of an "empty" directory
+ * @dentry: the directories dentry
+ *
+ * Before removing a directory from the file system, we have to make sure
+ * that there are no stale whiteouts in it. Therefore we call readdir() with
+ * a special filldir helper to remove all the whiteouts.
+ *
+ * XXX: Don't call any security and permission checks here (If we aren't
+ * allowed to go here, we shouldn't be here at all). Same with i_mutex, don't
+ * touch it here.
+ */
+static int do_unlink_whiteouts(struct dentry *dentry)
+{
+	struct file *file;
+	struct vfsmount *mnt;
+	struct inode *inode;
+	int res;
+
+	dget(dentry);
+	mnt = find_mnt(dentry);
+
+	/*
+	 * FIXME: This is bad, because we really don't want to open a new
+	 * file in the kernel but readdir needs a file pointer
+	 */
+	file = dentry_open(dentry, mnt, O_RDWR);
+	if (IS_ERR(file)) {
+		printk(KERN_ERR "%s: dentry_open failed (%ld)\n",
+		       __FUNCTION__, PTR_ERR(file));
+		return PTR_ERR(file);
+	}
+
+	inode = file->f_path.dentry->d_inode;
+
+	res = -ENOTDIR;
+	if (!file->f_op || !file->f_op->readdir)
+		goto out_fput;
+
+	res = -ENOENT;
+	if (!IS_DEADDIR(inode)) {
+		res = file->f_op->readdir(file, (void *)file->f_path.dentry,
+					  unlink_whiteouts_filldir);
+		file_accessed(file);
+	}
+out_fput:
+	fput(file);
+	if (unlikely(res))
+		printk(KERN_ERR "%s: readdir failed (%d)\n",
+		       __FUNCTION__, res);
+	return res;
+}
+
+
 /*
  * We try to drop the dentry early: we should have
  * a usage count of 2 if we're the only user of this
@@ -2118,8 +2473,12 @@ void dentry_unhash(struct dentry *dentry
 
 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	int error = may_delete(dir, dentry, 1);
+	int error;
 
+	if (!dentry->d_inode || S_ISWHT(dentry->d_inode->i_mode))
+		return -ENOENT;
+
+	error = may_delete(dir, dentry, 1);
 	if (error)
 		return error;
 
@@ -2135,11 +2494,15 @@ int vfs_rmdir(struct inode *dir, struct 
 	else {
 		error = security_inode_rmdir(dir, dentry);
 		if (!error) {
+			error = do_unlink_whiteouts(dentry);
+			if (error)
+				goto out;
 			error = dir->i_op->rmdir(dir, dentry);
 			if (!error)
 				dentry->d_inode->i_flags |= S_DEAD;
 		}
 	}
+ out:
 	mutex_unlock(&dentry->d_inode->i_mutex);
 	if (!error) {
 		d_delete(dentry);
@@ -2180,8 +2543,41 @@ static long do_rmdir(int dfd, const char
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit2;
-	error = vfs_rmdir(nd.dentry->d_inode, dentry);
-	dput(dentry);
+
+	if (!union_is_member(nd.dentry, nd.mnt)) {
+		/* Not a member of union, normal removal */
+		error = vfs_rmdir(nd.dentry->d_inode, dentry);
+		dput(dentry);
+		goto exit2;
+	}
+
+	if (dentry->d_parent == nd.dentry) {
+		/*
+		 * Topmost dentry of the union. Check if there
+		 * is a dentry of same name in the lower layers.
+		 * If so create a whiteout before unlinking.
+		 * Else normal removal.
+		 */
+		if (present_in_lower(dentry, &nd))
+			error = do_whiteout(nd.dentry, dentry, 1);
+		else {
+			error = vfs_rmdir(nd.dentry->d_inode, dentry);
+			dput(dentry);
+		}
+	} else {
+		/*
+		 * Lower layer dentry of the union. Relookup
+		 * the dentry in the top layer(which should return
+		 * a negative dentry) create a whiteout there.
+		 */
+		dput(dentry);
+		dentry = __lookup_hash_single(&nd.last, nd.dentry, &nd);
+		error = PTR_ERR(dentry);
+		if (IS_ERR(dentry))
+			goto exit2;
+		error = vfs_whiteout(nd.dentry->d_inode, dentry);
+		__dput_single(dentry);
+	}
 exit2:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 exit1:
@@ -2260,10 +2656,44 @@ static long do_unlinkat(int dfd, const c
 		inode = dentry->d_inode;
 		if (inode)
 			atomic_inc(&inode->i_count);
-		error = vfs_unlink(nd.dentry->d_inode, dentry);
-	exit2:
-		dput(dentry);
+
+		if (!union_is_member(nd.dentry, nd.mnt)) {
+			/* Not a member of union, normal removal */
+			error = vfs_unlink(nd.dentry->d_inode, dentry);
+			dput(dentry);
+			goto exit2;
+		}
+
+		/* TODO: fix this code duplication with do_rmdir() */
+		if (dentry->d_parent == nd.dentry) {
+			/*
+			 * Topmost dentry of the union. Check if there
+			 * is a dentry of same name in the lower layers.
+			 * If so create a whiteout before unlinking.
+			 * Else normal removal.
+			 */
+			if (present_in_lower(dentry, &nd))
+				error = do_whiteout(nd.dentry, dentry, 0);
+			else {
+				error = vfs_unlink(nd.dentry->d_inode, dentry);
+				dput(dentry);
+			}
+		} else {
+			/*
+			 * Lower layer dentry of the union. Relookup
+			 * the dentry in the top layer(which should return
+			 * a negative dentry) create a whiteout there.
+			 */
+			dput(dentry);
+			dentry = __lookup_hash_single(&nd.last, nd.dentry, &nd);
+			error = PTR_ERR(dentry);
+			if (IS_ERR(dentry))
+				goto exit2;
+			error = vfs_whiteout(nd.dentry->d_inode, dentry);
+			__dput_single(dentry);
+		}
 	}
+exit2:
 	mutex_unlock(&nd.dentry->d_inode->i_mutex);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
@@ -2276,6 +2706,7 @@ exit:
 slashes:
 	error = !dentry->d_inode ? -ENOENT :
 		S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+	dput(dentry);
 	goto exit2;
 }
 
@@ -2309,6 +2740,12 @@ int vfs_symlink(struct inode *dir, struc
 	if (error)
 		return error;
 
+	if (dentry->d_inode && S_ISWHT(dentry->d_inode->i_mode)) {
+		error = vfs_unlink_whiteout(dir, &dentry);
+		if (error)
+			return error;
+	}
+
 	DQUOT_INIT(dir);
 	error = dir->i_op->symlink(dir, dentry, oldname);
 	if (!error)
@@ -2363,7 +2800,7 @@ int vfs_link(struct dentry *old_dentry, 
 	struct inode *inode = old_dentry->d_inode;
 	int error;
 
-	if (!inode)
+	if (!inode || S_ISWHT(inode->i_mode))
 		return -ENOENT;
 
 	error = may_create(dir, new_dentry, NULL);
@@ -2639,7 +3076,7 @@ static int do_rename(int olddfd, const c
 		goto exit3;
 	/* source must exist */
 	error = -ENOENT;
-	if (!old_dentry->d_inode)
+	if (!old_dentry->d_inode || S_ISWHT(old_dentry->d_inode->i_mode))
 		goto exit4;
 	/* unless the source is a directory trailing slashes give -ENOTDIR */
 	if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
@@ -2661,6 +3098,21 @@ static int do_rename(int olddfd, const c
 	error = -ENOTEMPTY;
 	if (new_dentry == trap)
 		goto exit5;
+	error = -EXDEV;
+	/* renaming of directories on unions isn't implemented, yet */
+	if (union_is_member(old_dentry, oldnd.mnt)) {
+		error = -EOPNOTSUPP;
+		if (S_ISDIR(old_dentry->d_inode->i_mode))
+			goto exit5;
+		error = -EXDEV;
+		if (oldnd.um_flags & LAST_LOWLEVEL)
+			goto exit5;
+	}
+	if (union_is_member(new_dentry, newnd.mnt)) {
+		error = -EXDEV;
+		if (newnd.um_flags & LAST_LOWLEVEL)
+			goto exit5;
+	}
 
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -148,6 +148,11 @@ static int filldir(void * __buf, const c
 	unsigned long d_ino;
 	int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));
 
+#ifdef CONFIG_UNION_MOUNT
+	if (d_type == DT_WHT)
+		return 0;
+#endif /* CONFIG_UNION_MOUNT */
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
@@ -233,6 +238,11 @@ static int filldir64(void * __buf, const
 	struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
 	int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));
 
+#ifdef CONFIG_UNION_MOUNT
+	if (d_type == DT_WHT)
+		return 0;
+#endif /* CONFIG_UNION_MOUNT */
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
--- a/fs/union.c
+++ b/fs/union.c
@@ -580,6 +580,9 @@ lookup_union:
 	if (!S_ISDIR(topmost->d_inode->i_mode))
 		goto out;
 
+	if (IS_OPAQUE(topmost->d_inode))
+		goto out;
+
 	if (!revalidate_union(topmost)) {
 		__dput_single(topmost);
 		topmost = NULL;
@@ -644,6 +647,8 @@ lookup_union:
 		dentry->d_topmost = topmost;
 		last->d_overlaid = dentry;
 		last = dentry;
+		if (IS_OPAQUE(last->d_inode))
+			break;
 		parent = parent->d_overlaid;
 	}
 
@@ -826,6 +831,8 @@ static int __lookup_union(struct dentry 
 	loop:
 		__dput(nd.dentry);
 		mntput(nd.mnt);
+		if (IS_OPAQUE(last->d_inode))
+			break;
 		parent = parent->d_overlaid;
 	}
 
@@ -900,6 +907,9 @@ lookup_union:
 	if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
 		goto out;
 
+	if (IS_OPAQUE(topmost->d_inode))
+		goto out;
+
 	do {
 		struct vfsmount *mnt = find_mnt(topmost);
 		UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
@@ -977,6 +987,9 @@ lookup_union:
 	if (!parent->d_overlaid || !S_ISDIR(topmost->d_inode->i_mode))
 		goto out;
 
+	if (IS_OPAQUE(topmost->d_inode))
+		goto out;
+
 	do {
 		struct vfsmount *mnt = find_mnt(topmost);
 		UM_DEBUG_UID("name=\"%s\", inode=%p, device=%s\n",
@@ -1739,3 +1752,94 @@ exit_dput:
 	dput(dentry);
 	return err;
 }
+
+static int
+filldir_dummy(void *__buf, const char *name, int namlen, loff_t offset,
+	      u64 ino, unsigned int d_type)
+{
+	int *is_empty = (int *)__buf;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (d_type == DT_WHT)
+		return 0;
+
+	(*is_empty) = 0;
+	return 0;
+}
+
+int
+union_dir_is_empty(struct dentry *dentry)
+{
+	struct file *file;
+	struct vfsmount *mnt;
+	int err;
+	int is_empty = 1;
+
+	BUG_ON(!S_ISDIR(dentry->d_inode->i_mode));
+
+	dget(dentry);
+	mnt = find_mnt(dentry);
+
+	file = dentry_open(dentry, mnt, O_RDONLY);
+	if (IS_ERR(file))
+		return 0;
+
+	err = vfs_readdir(file, filldir_dummy, &is_empty);
+	UM_DEBUG("err=%d, is_empty=%d\n", err, is_empty);
+
+	fput(file);
+	return is_empty;
+}
+
+int present_in_lower(struct dentry *dentry, struct nameidata *nd)
+{
+	int err = 0;
+	struct dentry *parent = nd->dentry->d_overlaid;
+	struct dentry *tmp;
+	struct nameidata nd_tmp;
+	struct qstr this;
+
+	this.name = nd->last.name;
+	this.len = nd->last.len;
+
+	while (parent) {
+		this.hash = nd->last.hash;
+		nd_tmp.dentry = dget(parent);
+		nd_tmp.mnt = find_mnt(parent);
+	        mutex_lock(&parent->d_inode->i_mutex);
+		tmp = __lookup_hash_single(&this, nd_tmp.dentry, &nd_tmp);
+		mutex_unlock(&parent->d_inode->i_mutex);
+		/*
+		 * If there is an error in lookup, we return 0 concluding
+		 * that this dentry is not present in lower layers.
+		 */
+		if (IS_ERR(tmp))
+			goto out;
+
+		if (tmp->d_inode) {
+			__dput_single(tmp);
+			err = 1;
+			goto out;
+		}
+
+		__dput_single(tmp);
+		mntput(nd_tmp.mnt);
+		dput(nd_tmp.dentry);
+		parent = parent->d_overlaid;
+	}
+
+	return err;
+out:
+	mntput(nd_tmp.mnt);
+	dput(nd_tmp.dentry);
+	return err;
+}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -151,6 +151,7 @@ extern int dir_notify_enable;
 #define S_NOCMTIME	128	/* Do not update file c/mtime */
 #define S_SWAPFILE	256	/* Do not truncate: swapon got its bmaps */
 #define S_PRIVATE	512	/* Inode is fs-internal */
+#define S_OPAQUE	1024	/* Directory is opaque */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -184,6 +185,7 @@ extern int dir_notify_enable;
 #define IS_NOCMTIME(inode)	((inode)->i_flags & S_NOCMTIME)
 #define IS_SWAPFILE(inode)	((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)	((inode)->i_flags & S_PRIVATE)
+#define IS_OPAQUE(inode)	((inode)->i_flags & S_OPAQUE)
 
 /* the read-only stuff doesn't really belong here, but any other place is
    probably as bad and I don't want to create yet another include file. */
@@ -1043,6 +1045,7 @@ extern int vfs_link(struct dentry *, str
 extern int vfs_rmdir(struct inode *, struct dentry *);
 extern int vfs_unlink(struct inode *, struct dentry *);
 extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
+extern int vfs_whiteout(struct inode *, struct dentry *);
 
 /*
  * VFS dentry helper functions.
@@ -1168,6 +1171,7 @@ struct inode_operations {
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+	int (*whiteout) (struct inode *, struct dentry *);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);
--- a/include/linux/union.h
+++ b/include/linux/union.h
@@ -39,6 +39,10 @@ extern int union_copy_file(struct dentry
 extern int union_copyup(struct nameidata *, int);
 extern int union_relookup_topmost(struct nameidata *, int);
 
+/* vfs whiteout support */
+extern int union_dir_is_empty(struct dentry *);
+extern int present_in_lower(struct dentry *, struct nameidata *);
+
 #else	/* CONFIG_UNION_MOUNT */
 
 #define attach_mnt_union(mnt,nd) do { /* empty */ } while (0)
@@ -50,6 +54,8 @@ extern int union_relookup_topmost(struct
 #define union_copy_file(dentry1,mnt1,dentry2,mnt2) ({ (0); })
 #define union_copyup(x,y) ({ (0); })
 #define union_relookup_topmost(x,y) ({ (0); })
+#define union_dir_is_empty(x) ({ (1); })
+#define present_in_lower(x, y)	({ (0); })
 
 #endif	/* CONFIG_UNION_MOUNT */
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 12/15] ext2 whiteout support
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (10 preceding siblings ...)
  2007-04-17 13:23 ` [RFC][PATCH 11/15] VFS whiteout handling Bharata B Rao
@ 2007-04-17 13:23 ` Bharata B Rao
  2007-04-17 13:24 ` [RFC][PATCH 13/15] ext3 " Bharata B Rao
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: ext2 whiteout support

Introduce whiteout support to ext2.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/ext2/dir.c           |    2 ++
 fs/ext2/namei.c         |   17 +++++++++++++++++
 fs/ext2/super.c         |   11 ++++++++++-
 include/linux/ext2_fs.h |    4 ++++
 4 files changed, 33 insertions(+), 1 deletion(-)

--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -218,6 +218,7 @@ static unsigned char ext2_filetype_table
 	[EXT2_FT_FIFO]		= DT_FIFO,
 	[EXT2_FT_SOCK]		= DT_SOCK,
 	[EXT2_FT_SYMLINK]	= DT_LNK,
+	[EXT2_FT_WHT]		= DT_WHT,
 };
 
 #define S_SHIFT 12
@@ -229,6 +230,7 @@ static unsigned char ext2_type_by_mode[S
 	[S_IFIFO >> S_SHIFT]	= EXT2_FT_FIFO,
 	[S_IFSOCK >> S_SHIFT]	= EXT2_FT_SOCK,
 	[S_IFLNK >> S_SHIFT]	= EXT2_FT_SYMLINK,
+	[S_IFWHT >> S_SHIFT]	= EXT2_FT_WHT,
 };
 
 static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -288,6 +288,22 @@ static int ext2_rmdir (struct inode * di
 	return err;
 }
 
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode;
+	int err;
+
+	inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out;
+
+	mark_inode_dirty(inode);
+	err = ext2_add_nondir(dentry, inode);
+out:
+	return err;
+}
+
 static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
 	struct inode * new_dir,	struct dentry * new_dentry )
 {
@@ -382,6 +398,7 @@ const struct inode_operations ext2_dir_i
 	.mkdir		= ext2_mkdir,
 	.rmdir		= ext2_rmdir,
 	.mknod		= ext2_mknod,
+	.whiteout	= ext2_whiteout,
 	.rename		= ext2_rename,
 #ifdef CONFIG_EXT2_FS_XATTR
 	.setxattr	= generic_setxattr,
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -755,6 +755,15 @@ static int ext2_fill_super(struct super_
 	ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
 				    EXT2_MOUNT_XIP if not */
 
+	if ((sb->s_flags & MS_UNION) && !(sb->s_flags & MS_RDONLY)) {
+		if (!EXT2_HAS_INCOMPAT_FEATURE(sb,
+				EXT2_FEATURE_INCOMPAT_WHITEOUT)) {
+			sb->s_flags |= MS_RDONLY;
+			ext2_warning(sb, __FUNCTION__,
+			"no whiteout support, mounting filesystem read-only");
+		}
+	}
+
 	if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
 	    (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -1293,7 +1302,7 @@ static struct file_system_type ext2_fs_t
 	.name		= "ext2",
 	.get_sb		= ext2_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_WHT,
 };
 
 static int __init init_ext2_fs(void)
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -61,6 +61,7 @@
 #define EXT2_ROOT_INO		 2	/* Root inode */
 #define EXT2_BOOT_LOADER_INO	 5	/* Boot loader inode */
 #define EXT2_UNDEL_DIR_INO	 6	/* Undelete directory inode */
+#define EXT2_WHT_INO		 7	/* Whiteout inode */
 
 /* First non-reserved inode for old ext2 filesystems */
 #define EXT2_GOOD_OLD_FIRST_INO	11
@@ -479,10 +480,12 @@ struct ext2_super_block {
 #define EXT3_FEATURE_INCOMPAT_RECOVER		0x0004
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008
 #define EXT2_FEATURE_INCOMPAT_META_BG		0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT		0x0020
 #define EXT2_FEATURE_INCOMPAT_ANY		0xffffffff
 
 #define EXT2_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT2_FEATURE_INCOMPAT_SUPP	(EXT2_FEATURE_INCOMPAT_FILETYPE| \
+					 EXT2_FEATURE_INCOMPAT_WHITEOUT| \
 					 EXT2_FEATURE_INCOMPAT_META_BG)
 #define EXT2_FEATURE_RO_COMPAT_SUPP	(EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -549,6 +552,7 @@ enum {
 	EXT2_FT_FIFO,
 	EXT2_FT_SOCK,
 	EXT2_FT_SYMLINK,
+	EXT2_FT_WHT,
 	EXT2_FT_MAX
 };
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 13/15] ext3 whiteout support
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (11 preceding siblings ...)
  2007-04-17 13:23 ` [RFC][PATCH 12/15] ext2 whiteout support Bharata B Rao
@ 2007-04-17 13:24 ` Bharata B Rao
  2007-04-17 13:24 ` [RFC][PATCH 14/15] tmpfs " Bharata B Rao
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Bharata B Rao <bharata@linux.vnet.ibm.com>
Subject: ext3 whiteout support

Introduce whiteout support for ext3.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-bg: Jan Blunck <jblunck@suse.de>
---
 fs/ext3/dir.c           |    2 -
 fs/ext3/namei.c         |   65 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext3/super.c         |   11 +++++++-
 include/linux/ext3_fs.h |    5 ++-
 4 files changed, 75 insertions(+), 8 deletions(-)

--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -29,7 +29,7 @@
 #include <linux/rbtree.h>
 
 static unsigned char ext3_filetype_table[] = {
-	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+	DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK, DT_WHT
 };
 
 static int ext3_readdir(struct file *, void *, filldir_t);
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1071,6 +1071,7 @@ static unsigned char ext3_type_by_mode[S
 	[S_IFIFO >> S_SHIFT]	= EXT3_FT_FIFO,
 	[S_IFSOCK >> S_SHIFT]	= EXT3_FT_SOCK,
 	[S_IFLNK >> S_SHIFT]	= EXT3_FT_SYMLINK,
+	[S_IFWHT >> S_SHIFT]	= EXT3_FT_WHT,
 };
 
 static inline void ext3_set_de_type(struct super_block *sb,
@@ -1786,7 +1787,7 @@ out_stop:
 /*
  * routine to check that the specified directory is empty (for rmdir)
  */
-static int empty_dir (struct inode * inode)
+static int empty_dir (handle_t *handle, struct inode * inode)
 {
 	unsigned long offset;
 	struct buffer_head * bh;
@@ -1848,8 +1849,28 @@ static int empty_dir (struct inode * ino
 			continue;
 		}
 		if (le32_to_cpu(de->inode)) {
-			brelse (bh);
-			return 0;
+			/* If this is a whiteout, remove it */
+			if (de->file_type == EXT3_FT_WHT) {
+				unsigned long ino = le32_to_cpu(de->inode);
+				struct inode *tmp_inode = iget(inode->i_sb, ino);
+				if (!tmp_inode) {
+					brelse (bh);
+					return 0;
+				}
+
+				if (ext3_delete_entry(handle, inode, de, bh)) {
+					iput(tmp_inode);
+					brelse (bh);
+					return 0;
+				}
+
+				tmp_inode->i_ctime = inode->i_ctime;
+				tmp_inode->i_nlink--;
+				iput(tmp_inode);
+			} else {
+				brelse (bh);
+				return 0;
+			}
 		}
 		offset += le16_to_cpu(de->rec_len);
 		de = (struct ext3_dir_entry_2 *)
@@ -2031,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
 		goto end_rmdir;
 
 	retval = -ENOTEMPTY;
-	if (!empty_dir (inode))
+	if (!empty_dir (handle, inode))
 		goto end_rmdir;
 
 	retval = ext3_delete_entry(handle, dir, de, bh);
@@ -2060,6 +2081,39 @@ end_rmdir:
 	return retval;
 }
 
+/*
+ * TODO: Not sure about the args to ext3_journal_start. Check.
+ */
+static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	struct inode * inode;
+	int err, retries = 0;
+	handle_t *handle;
+
+retry:
+	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	if (IS_DIRSYNC(dir))
+		handle->h_sync = 1;
+
+	inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
+	err = PTR_ERR(inode);
+	if (IS_ERR(inode))
+		goto out_stop;
+
+	err = ext3_add_nondir(handle, dentry, inode);
+
+out_stop:
+	ext3_journal_stop(handle);
+	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
+		goto retry;
+	return err;
+}
+
 static int ext3_unlink(struct inode * dir, struct dentry *dentry)
 {
 	int retval;
@@ -2261,7 +2315,7 @@ static int ext3_rename (struct inode * o
 	if (S_ISDIR(old_inode->i_mode)) {
 		if (new_inode) {
 			retval = -ENOTEMPTY;
-			if (!empty_dir (new_inode))
+			if (!empty_dir (handle, new_inode))
 				goto end_rename;
 		}
 		retval = -EIO;
@@ -2377,6 +2431,7 @@ const struct inode_operations ext3_dir_i
 	.mkdir		= ext3_mkdir,
 	.rmdir		= ext3_rmdir,
 	.mknod		= ext3_mknod,
+	.whiteout	= ext3_whiteout,
 	.rename		= ext3_rename,
 	.setattr	= ext3_setattr,
 #ifdef CONFIG_EXT3_FS_XATTR
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1493,6 +1493,15 @@ static int ext3_fill_super (struct super
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
 		((sbi->s_mount_opt & EXT3_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
+	if ((sb->s_flags & MS_UNION) && !(sb->s_flags & MS_RDONLY)) {
+		if (!EXT3_HAS_INCOMPAT_FEATURE(sb,
+				EXT3_FEATURE_INCOMPAT_WHITEOUT)) {
+			sb->s_flags |= MS_RDONLY;
+			ext3_warning(sb, __FUNCTION__,
+			"no whiteout support, mounting filesystem read-only");
+		}
+	}
+
 	if (le32_to_cpu(es->s_rev_level) == EXT3_GOOD_OLD_REV &&
 	    (EXT3_HAS_COMPAT_FEATURE(sb, ~0U) ||
 	     EXT3_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -2749,7 +2758,7 @@ static struct file_system_type ext3_fs_t
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_WHT,
 };
 
 static int __init init_ext3_fs(void)
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -63,6 +63,7 @@
 #define EXT3_UNDEL_DIR_INO	 6	/* Undelete directory inode */
 #define EXT3_RESIZE_INO		 7	/* Reserved group descriptors inode */
 #define EXT3_JOURNAL_INO	 8	/* Journal inode */
+#define EXT3_WHT_INO		 9	/* Whiteout inode */
 
 /* First non-reserved inode for old ext3 filesystems */
 #define EXT3_GOOD_OLD_FIRST_INO	11
@@ -582,6 +583,7 @@ static inline int ext3_valid_inum(struct
 #define EXT3_FEATURE_INCOMPAT_RECOVER		0x0004 /* Needs recovery */
 #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV	0x0008 /* Journal device */
 #define EXT3_FEATURE_INCOMPAT_META_BG		0x0010
+#define EXT3_FEATURE_INCOMPAT_WHITEOUT		0x0020
 
 #define EXT3_FEATURE_COMPAT_SUPP	EXT2_FEATURE_COMPAT_EXT_ATTR
 #define EXT3_FEATURE_INCOMPAT_SUPP	(EXT3_FEATURE_INCOMPAT_FILETYPE| \
@@ -648,8 +650,9 @@ struct ext3_dir_entry_2 {
 #define EXT3_FT_FIFO		5
 #define EXT3_FT_SOCK		6
 #define EXT3_FT_SYMLINK		7
+#define EXT3_FT_WHT		8
 
-#define EXT3_FT_MAX		8
+#define EXT3_FT_MAX		9
 
 /*
  * EXT3_DIR_PAD defines the directory entries boundaries

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 14/15] tmpfs whiteout support
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (12 preceding siblings ...)
  2007-04-17 13:24 ` [RFC][PATCH 13/15] ext3 " Bharata B Rao
@ 2007-04-17 13:24 ` Bharata B Rao
  2007-04-17 13:25 ` [RFC][PATCH 15/15] Union-mount changes for NFS Bharata B Rao
  2007-04-17 14:35 ` [RFC][PATCH 0/15] VFS based Union Mount Shaya Potter
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: tmpfs whiteout support

Introduce whiteout support to tmpfs.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 mm/shmem.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletion(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -74,7 +74,7 @@
 #define LATENCY_LIMIT	 64
 
 /* Pretend that each entry is of this size in directory's i_size */
-#define BOGO_DIRENT_SIZE 20
+#define BOGO_DIRENT_SIZE 1
 
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
@@ -1772,6 +1772,11 @@ static int shmem_create(struct inode *di
 	return shmem_mknod(dir, dentry, mode | S_IFREG, 0);
 }
 
+static int shmem_whiteout(struct inode *dir, struct dentry *dentry)
+{
+	return shmem_mknod(dir, dentry, S_IRUGO | S_IWUGO | S_IFWHT, 0);
+}
+
 /*
  * Link a file..
  */
@@ -2400,6 +2405,7 @@ static const struct inode_operations shm
 	.rmdir		= shmem_rmdir,
 	.mknod		= shmem_mknod,
 	.rename		= shmem_rename,
+	.whiteout       = shmem_whiteout,
 #endif
 #ifdef CONFIG_TMPFS_POSIX_ACL
 	.setattr	= shmem_notify_change,
@@ -2454,6 +2460,7 @@ static struct file_system_type tmpfs_fs_
 	.name		= "tmpfs",
 	.get_sb		= shmem_get_sb,
 	.kill_sb	= kill_litter_super,
+	.fs_flags	= FS_WHT,
 };
 static struct vfsmount *shm_mnt;
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 15/15] Union-mount changes for NFS
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (13 preceding siblings ...)
  2007-04-17 13:24 ` [RFC][PATCH 14/15] tmpfs " Bharata B Rao
@ 2007-04-17 13:25 ` Bharata B Rao
  2007-04-17 14:35 ` [RFC][PATCH 0/15] VFS based Union Mount Shaya Potter
  15 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, Jan Blunck

From: Jan Blunck <j.blunck@tu-harburg.de>
Subject: Union-mount changes for NFS

Changes necessary to mount a NFS volume into a union.

Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 fs/nfs/dir.c    |   41 +++++++++++++++++++++++++++--------------
 fs/nfs/inode.c  |    4 ++--
 fs/nfs/unlink.c |    4 ++--
 3 files changed, 31 insertions(+), 18 deletions(-)

--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -409,7 +409,7 @@ int nfs_do_filldir(nfs_readdir_descripto
 
 		/* Get a dentry if we have one */
 		if (dentry != NULL)
-			dput(dentry);
+			__dput_single(dentry);
 		dentry = nfs_readdir_lookup(desc);
 
 		/* Use readdirplus info */
@@ -435,7 +435,7 @@ int nfs_do_filldir(nfs_readdir_descripto
 	}
 	dir_page_release(desc);
 	if (dentry != NULL)
-		dput(dentry);
+		__dput_single(dentry);
 	dfprintk(DIRCACHE, "NFS: nfs_do_filldir() filling ended @ cookie %Lu; returning = %d\n",
 			(unsigned long long)*desc->dir_cookie, res);
 	return res;
@@ -721,6 +721,16 @@ int nfs_neg_need_reval(struct inode *dir
 	return !nfs_check_verifier(dir, dentry);
 }
 
+static struct dentry * __dget_parent_single(struct dentry *dentry)
+{
+	struct dentry *ret;
+
+	spin_lock(&dentry->d_lock);
+	ret = __dget_single(dentry->d_parent);
+	spin_unlock(&dentry->d_lock);
+	return ret;
+}
+
 /*
  * This is called every time the dcache has a lookup hit,
  * and we should check whether we can really trust that
@@ -742,7 +752,7 @@ static int nfs_lookup_revalidate(struct 
 	struct nfs_fattr fattr;
 	unsigned long verifier;
 
-	parent = dget_parent(dentry);
+	parent = __dget_parent_single(dentry);
 	lock_kernel();
 	dir = parent->d_inode;
 	nfs_inc_stats(dir, NFSIOS_DENTRYREVALIDATE);
@@ -788,7 +798,7 @@ static int nfs_lookup_revalidate(struct 
 	nfs_refresh_verifier(dentry, verifier);
  out_valid:
 	unlock_kernel();
-	dput(parent);
+	__dput_single(parent);
 	dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is valid\n",
 			__FUNCTION__, dentry->d_parent->d_name.name,
 			dentry->d_name.name);
@@ -807,7 +817,7 @@ out_zap_parent:
 	}
 	d_drop(dentry);
 	unlock_kernel();
-	dput(parent);
+	__dput_single(parent);
 	dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
 			__FUNCTION__, dentry->d_parent->d_name.name,
 			dentry->d_name.name);
@@ -1057,7 +1067,7 @@ static int nfs_open_revalidate(struct de
 	unsigned long verifier;
 	int openflags, ret = 0;
 
-	parent = dget_parent(dentry);
+	parent = __dget_parent_single(dentry);
 	dir = parent->d_inode;
 	if (!is_atomic_open(dir, nd))
 		goto no_open;
@@ -1088,18 +1098,21 @@ static int nfs_open_revalidate(struct de
 		nfs_refresh_verifier(dentry, verifier);
 	unlock_kernel();
 out:
-	dput(parent);
+	__dput_single(parent);
 	if (!ret)
 		d_drop(dentry);
 	return ret;
 no_open:
-	dput(parent);
+	__dput_single(parent);
 	if (inode != NULL && nfs_have_delegation(inode, FMODE_READ))
 		return 1;
 	return nfs_lookup_revalidate(dentry, nd);
 }
 #endif /* CONFIG_NFSV4 */
 
+/* For union mount we need this:
+ * - lookup the complete union if we found one
+ * - don't return lower layers dentries ... */
 static struct dentry *nfs_readdir_lookup(nfs_readdir_descriptor_t *desc)
 {
 	struct dentry *parent = desc->file->f_path.dentry;
@@ -1115,14 +1128,14 @@ static struct dentry *nfs_readdir_lookup
 	switch (name.len) {
 		case 2:
 			if (name.name[0] == '.' && name.name[1] == '.')
-				return dget_parent(parent);
+				return __dget_parent_single(parent);
 			break;
 		case 1:
 			if (name.name[0] == '.')
-				return dget(parent);
+				return __dget_single(parent);
 	}
 	name.hash = full_name_hash(name.name, name.len);
-	dentry = d_lookup(parent, &name);
+	dentry = d_lookup_single(parent, &name);
 	if (dentry != NULL) {
 		/* Is this a positive dentry that matches the readdir info? */
 		if (dentry->d_inode != NULL &&
@@ -1136,7 +1149,7 @@ static struct dentry *nfs_readdir_lookup
 		}
 		/* No, so d_drop to allow one to be created */
 		d_drop(dentry);
-		dput(dentry);
+		__dput_single(dentry);
 	}
 	if (!desc->plus || !(entry->fattr->valid & NFS_ATTR_FATTR))
 		return NULL;
@@ -1147,13 +1160,13 @@ static struct dentry *nfs_readdir_lookup
 	dentry->d_op = NFS_PROTO(dir)->dentry_ops;
 	inode = nfs_fhget(dentry->d_sb, entry->fh, entry->fattr);
 	if (IS_ERR(inode)) {
-		dput(dentry);
+		__dput_single(dentry);
 		return NULL;
 	}
 
 	alias = d_materialise_unique(dentry, inode);
 	if (alias != NULL) {
-		dput(dentry);
+		__dput_single(dentry);
 		if (IS_ERR(alias))
 			return NULL;
 		dentry = alias;
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -460,7 +460,7 @@ static struct nfs_open_context *alloc_nf
 	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
 	if (ctx != NULL) {
 		atomic_set(&ctx->count, 1);
-		ctx->dentry = dget(dentry);
+		ctx->dentry = __dget_single(dentry);
 		ctx->vfsmnt = mntget(mnt);
 		ctx->cred = get_rpccred(cred);
 		ctx->state = NULL;
@@ -491,7 +491,7 @@ void put_nfs_open_context(struct nfs_ope
 			nfs4_close_state(ctx->state, ctx->mode);
 		if (ctx->cred != NULL)
 			put_rpccred(ctx->cred);
-		dput(ctx->dentry);
+		__dput_single(ctx->dentry);
 		mntput(ctx->vfsmnt);
 		kfree(ctx);
 	}
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -129,7 +129,7 @@ static void nfs_async_unlink_done(struct
 		return;
 	put_rpccred(data->cred);
 	data->cred = NULL;
-	dput(dir);
+	__dput_single(dir);
 }
 
 /**
@@ -172,7 +172,7 @@ nfs_async_unlink(struct dentry *dentry)
 		status = PTR_ERR(data->cred);
 		goto out_free;
 	}
-	data->dir = dget(dir);
+	data->dir = __dget_single(dir);
 	data->dentry = dentry;
 
 	data->next = nfs_deletes;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  0/15] VFS based Union Mount
  2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
                   ` (14 preceding siblings ...)
  2007-04-17 13:25 ` [RFC][PATCH 15/15] Union-mount changes for NFS Bharata B Rao
@ 2007-04-17 14:35 ` Shaya Potter
  2007-04-17 16:30   ` Bharata B Rao
  15 siblings, 1 reply; 22+ messages in thread
From: Shaya Potter @ 2007-04-17 14:35 UTC (permalink / raw)
  To: bharata; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

Bharata B Rao wrote:
> Hi,
> 
> Here is an attempt towards vfs based union mount implementation.
> Union mount provides the filesytem namespace unification feature.
> Unlike the traditional mounts which hide the contents of the mount point,
> the union mount presents the merged view of the mount point and the
> mounted filesytem.

does this approach allow one to add directories to the union and have it 
behave normally.  namely when imagine one has the situation

dir-b
dir-a/ (contains file foo)


if one unions this and deletes foo, that will create a whiteout entry in 
dir-a

now, what happens if one does

dir-c
dir-b (now contains whiteout, from previous union).
dir-a (contains file foo)

will one see foo or not. i.e. are whiteouts only looked for in the 
topmost dir, or in every dir?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  0/15] VFS based Union Mount
  2007-04-17 14:35 ` [RFC][PATCH 0/15] VFS based Union Mount Shaya Potter
@ 2007-04-17 16:30   ` Bharata B Rao
  2007-04-17 16:56     ` Shaya Potter
  0 siblings, 1 reply; 22+ messages in thread
From: Bharata B Rao @ 2007-04-17 16:30 UTC (permalink / raw)
  To: Shaya Potter; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

On Tue, Apr 17, 2007 at 10:35:50AM -0400, Shaya Potter wrote:
> Bharata B Rao wrote:
> >Hi,
> >
> >Here is an attempt towards vfs based union mount implementation.
> >Union mount provides the filesytem namespace unification feature.
> >Unlike the traditional mounts which hide the contents of the mount point,
> >the union mount presents the merged view of the mount point and the
> >mounted filesytem.
> 
> does this approach allow one to add directories to the union and have it 
> behave normally.  namely when imagine one has the situation
> 
> dir-b
> dir-a/ (contains file foo)
> 
> 
> if one unions this and deletes foo, that will create a whiteout entry in 
> dir-a

(I guess you mean to say that this creates a whiteout in dir-b)

> 
> now, what happens if one does
> 
> dir-c
> dir-b (now contains whiteout, from previous union).
> dir-a (contains file foo)
> 
> will one see foo or not. i.e. are whiteouts only looked for in the 
> topmost dir, or in every dir?

No. foo is not visible. While looking for a file in a union mounted
directory, the lookup starts from the topmost directory and proceeds
downwards if the file isn't present the top layers. If a whiteout is
found in any of the top layers, the lookup is abondoned and -ENOENT
is removed. Thus until a whiteout exists in any upper layer for
a corresponding file in the lower layer, the lower layer file remains
hidden until the whiteout is removed.

However in the case of dir-c containing foo, the foo(from dir-c) will become
visible after union mounting dir-c on top of dir-b and dir-a.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  0/15] VFS based Union Mount
  2007-04-17 16:30   ` Bharata B Rao
@ 2007-04-17 16:56     ` Shaya Potter
  2007-04-18  7:19       ` Bharata B Rao
  0 siblings, 1 reply; 22+ messages in thread
From: Shaya Potter @ 2007-04-17 16:56 UTC (permalink / raw)
  To: bharata; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

Bharata B Rao wrote:

> No. foo is not visible. While looking for a file in a union mounted
> directory, the lookup starts from the topmost directory and proceeds
> downwards if the file isn't present the top layers. If a whiteout is
> found in any of the top layers, the lookup is abondoned and -ENOENT
> is removed. Thus until a whiteout exists in any upper layer for
> a corresponding file in the lower layer, the lower layer file remains
> hidden until the whiteout is removed.
> 
> However in the case of dir-c containing foo, the foo(from dir-c) will become
> visible after union mounting dir-c on top of dir-b and dir-a.

ok, so the major limitation of this approach is that the top most layer 
has to either be, ext2, ext3 or tmpfs (in patch), and most likely not 
NFS (assumption is that NFS has no conception of the whiteout type of 
file).  One thing the unionfs people are doing w/ their ODF approach, is 
within the ODF fs, they have a special inode that is the "whiteout" 
inode, and when they create a whiteout, they just create a hardlink from 
the dentry they want to whiteout to the "whiteout inode".  could that be 
a worthwhile approach instead of the whiteout file type?  (i.e. many 
file systems support the concept of a hard link).

I ask, because using union in a diskless environment.  Imagine pxe 
booting a kernel/initramfs and then using union to create a real root fs 
  (shared lower layer, private rw upper layer, ala live cds).  Which 
brings up a different point, with unionfs, one can pivot_root into it, 
can one do the same for these "union mounts"?  Don't know enough about 
the VFS to know if this should "just work" or might be a problem.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  5/15] Introduce union stack
  2007-04-17 13:19 ` [RFC][PATCH 5/15] Introduce union stack Bharata B Rao
@ 2007-04-17 22:08   ` Serge E. Hallyn
  2007-04-18  3:27     ` Bharata B Rao
  0 siblings, 1 reply; 22+ messages in thread
From: Serge E. Hallyn @ 2007-04-17 22:08 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

Quoting Bharata B Rao (bharata@linux.vnet.ibm.com):
> From: Jan Blunck <j.blunck@tu-harburg.de>
> Subject: Introduce union stack.
> 
> Adds union stack infrastructure to the dentry structure and provides
> locking routines to walk the union stack.
> 
> Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  fs/Makefile                  |    2 
>  fs/dcache.c                  |    5 
>  fs/union.c                   |   53 +++++++++
>  include/linux/dcache.h       |   11 +
>  include/linux/dcache_union.h |  243 +++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 314 insertions(+)
> 
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.
>  obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
>  obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
> 
> +obj-$(CONFIG_UNION_MOUNT)	+= union.o
> +
>  obj-$(CONFIG_QUOTA)		+= dquot.o
>  obj-$(CONFIG_QFMT_V1)		+= quota_v1.o
>  obj-$(CONFIG_QFMT_V2)		+= quota_v2.o
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
>  #ifdef CONFIG_PROFILING
>  	dentry->d_cookie = NULL;
>  #endif
> +#ifdef CONFIG_UNION_MOUNT
> +	dentry->d_overlaid = NULL;
> +	dentry->d_topmost = NULL;
> +	dentry->d_union = NULL;
> +#endif
>  	INIT_HLIST_NODE(&dentry->d_hash);
>  	INIT_LIST_HEAD(&dentry->d_lru);
>  	INIT_LIST_HEAD(&dentry->d_subdirs);
> --- /dev/null
> +++ b/fs/union.c
> @@ -0,0 +1,53 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright ? 2004-2007 IBM Corporation
> + *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include <linux/fs.h>
> +
> +struct union_info * union_alloc(void)
> +{
> +	struct union_info *info;
> +
> +	info = kmalloc(sizeof(*info), GFP_ATOMIC);
> +	if (!info)
> +		return NULL;
> +
> +	mutex_init(&info->u_mutex);
> +	mutex_lock(&info->u_mutex);
> +	atomic_set(&info->u_count, 1);
> +	UM_DEBUG_LOCK("allocate union %p\n", info);
> +	return info;
> +}
> +
> +struct union_info * union_get(struct union_info *info)
> +{
> +	BUG_ON(!info);
> +	BUG_ON(!atomic_read(&info->u_count));
> +	atomic_inc(&info->u_count);
> +	UM_DEBUG_LOCK("get union %p (count=%d)\n", info,
> +		      atomic_read(&info->u_count));
> +	return info;
> +}

The locking here needs to be laid out.  It looks like union_get() needs
to be called under union_lock(), while union_get2() (horrible name)
grabs that lock itself, and returns with the lock held?

Similarly union_put clearly needs to be called under union_lock(), so
that should be commented here.

> +void union_put(struct union_info *info)
> +{
> +	BUG_ON(!info);
> +	UM_DEBUG_LOCK("put union %p (count=%d)\n", info,
> +		      atomic_read(&info->u_count));
> +	atomic_dec(&info->u_count);
> +
> +	if (!atomic_read(&info->u_count)) {
> +		UM_DEBUG_LOCK("free union %p\n", info);
> +		kfree(info);
> +	}
> +
> +	return;
> +}
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -93,6 +93,12 @@ struct dentry {
>  	struct dentry *d_parent;	/* parent directory */
>  	struct qstr d_name;
> 
> +#ifdef CONFIG_UNION_MOUNT
> +	struct dentry *d_overlaid;	/* overlaid directory */
> +	struct dentry *d_topmost;	/* topmost directory */
> +	struct union_info *d_union;	/* union directory info */
> +#endif
> +
>  	struct list_head d_lru;		/* LRU list */
>  	/*
>  	 * d_child and d_rcu can share memory
> @@ -325,6 +331,11 @@ static inline struct dentry *dget(struct
>  	return dentry;
>  }
> 
> +/*
> + * Reference counting for union mounts
> + */
> +#include <linux/dcache_union.h>
> +
>  extern struct dentry * dget_locked(struct dentry *);
> 
>  /**
> --- /dev/null
> +++ b/include/linux/dcache_union.h
> @@ -0,0 +1,243 @@
> +/*
> + * VFS based union mount for Linux
> + *
> + * Copyright ? 2004-2007 IBM Corporation
> + *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + *
> + */
> +#ifndef __LINUX_DCACHE_UNION_H
> +#define __LINUX_DCACHE_UNION_H
> +#ifdef __KERNEL__
> +
> +#include <linux/union_debug.h>
> +#include <linux/fs_struct.h>
> +#include <asm/atomic.h>
> +#include <asm/semaphore.h>
> +
> +#ifdef CONFIG_UNION_MOUNT
> +
> +/*
> + * This is the union info object, that describes general information about this
> + * union directory
> + *
> + * u_mutex protects the union stack against modification. You can reach it
> + * through the d_union field in struct dentry. Hold it when you are walking
> + * or modifing the union stack !
> + *
> + * NOTE: Read the remark for union_trylock() below!
> + */
> +struct union_info {
> +	atomic_t u_count;
> +	struct mutex u_mutex;
> +};
> +
> +/* allocate/de-allocate */
> +extern struct union_info *union_alloc(void);
> +extern struct union_info *union_get(struct union_info *);
> +extern void union_put(struct union_info *);
> +
> +/*
> + * These are the functions for locking a dentrys union. When one
> + * want to acquire a denties union lock, use:
> + *
> + * - union_lock() when you can sleep,
> + * - union_lock_spinlock() when you are holding a spinlock (that
> + *   you CAN savely give up and reacquire again)
> + * - union_lock_readlock() when you are holding a readlock (that
> + *   you CAN savely give up and reacquire again)
> + *
> + * Otherwise get the union lock early before you enter your
> + * "no sleeping here" code.
> + */
> +static inline void __union_lock(struct union_info *uinfo)
> +{
> +	BUG_ON(!atomic_read(&uinfo->u_count));
> +	mutex_lock(&uinfo->u_mutex);
> +}
> +
> +static inline void union_lock(struct dentry *dentry)
> +{
> +	if (unlikely(dentry && dentry->d_union)) {
> +		struct union_info *ui = dentry->d_union;
> +
> +		UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
> +			      dentry->d_name.name, ui,
> +			      atomic_read(&ui->u_count));
> +		__union_lock(dentry->d_union);
> +	}
> +}
> +
> +static inline void __union_unlock(struct union_info *uinfo)
> +{
> +	BUG_ON(!atomic_read(&uinfo->u_count));
> +	mutex_unlock(&uinfo->u_mutex);
> +}
> +
> +static inline void union_unlock(struct dentry *dentry)
> +{
> +	if (unlikely(dentry && dentry->d_union)) {
> +		struct union_info *ui = dentry->d_union;
> +
> +		UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
> +			      dentry->d_name.name, ui,
> +			      atomic_read(&ui->u_count));
> +		__union_unlock(dentry->d_union);
> +	}
> +}
> +
> +/*
> + * Two helpers for namespace.c
> + *
> + * FIXME: clean this up to get it right
> + */
> +static inline struct union_info *union_alloc2(struct dentry * dentry)
> +{
> +	struct union_info *uinfo;
> +
> +	spin_lock(&dentry->d_lock);
> +	if (!dentry->d_union) {
> +		dentry->d_union = union_alloc();
> +		uinfo = union_get(dentry->d_union);
> +		spin_unlock(&dentry->d_lock);
> +	} else {
> +		uinfo = union_get(dentry->d_union);
> +		spin_unlock(&dentry->d_lock);
> +		union_lock(dentry);
> +	}
> +
> +	return uinfo;
> +}
> +
> +static inline struct union_info *union_get2(struct dentry * dentry)
> +{
> +	struct union_info *uinfo;
> +
> +	union_lock(dentry);
> +	uinfo = union_get(dentry->d_union);
> +	return uinfo;
> +}
> +
> +static inline void union_release(struct union_info *uinfo)
> +{
> +	if (!uinfo)
> +		return;
> +
> +	mutex_unlock(&uinfo->u_mutex);
> +	union_put(uinfo);

Is it safe to do this - releasing the lock before doing the put?

> +}
> +
> +/*
> + * Immediately return ZERO if the lock is contended, NON-ZERO if it's acquired.
> + */
> +static inline int union_trylock(struct dentry *dentry)
> +{
> +	int locked = 1;
> +
> +	if (unlikely(dentry && dentry->d_union)) {
> +		UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
> +			      dentry->d_name.name, dentry->d_union,
> +			      atomic_read(&dentry->d_union->u_count));
> +		BUG_ON(!atomic_read(&dentry->d_union->u_count));
> +		locked = mutex_trylock(&dentry->d_union->u_mutex);
> +		UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
> +			      dentry->d_union,
> +			      locked ? "succeeded" : "failed");
> +	}
> +	return (locked ? 1 : 0);
> +}
> +
> +/*
> + * The following functions are locking helpers to guarantee the locking order
> + * in some situations.
> + */
> +
> +static inline void union_lock_spinlock(struct dentry *dentry, spinlock_t *lock)
> +{
> +	while (!union_trylock(dentry)) {
> +		spin_unlock(lock);
> +		cpu_relax();
> +		spin_lock(lock);
> +	}
> +}
> +
> +static inline void union_lock_readlock(struct dentry *dentry, rwlock_t *lock)
> +{
> +	while (!union_trylock(dentry)) {
> +		read_unlock(lock);
> +		cpu_relax();
> +		read_lock(lock);
> +	}
> +}
> +
> +/*
> + * This is a *I can't get no sleep* helper which is called when we try
> + * to access the struct fs_struct *fs field of a struct task_struct.
> + *
> + * Yes, this is possibly starving but we have to change root, altroot
> + * or pwd in the frequency of this while loop. Don't think that this
> + * happens really often ;)
> + *
> + * This is called while holding the rwlock_t fs->lock
> + *
> + * TODO: Unlocking side of union_lock_fs() needs 3 union_unlock()s.
> + * May be introduce union_unlock_fs().
> + *
> + * FIXME: This routine is used when the caller wants to dget one or
> + * more of fs->[root, altroot, pwd]. When the caller doesn't want to
> + * dget _all_ of these, it is strictly not necessary to get union_locks
> + * on all of these. Check.
> + */
> +static inline void union_lock_fs(struct fs_struct *fs)
> +{
> +	int locked;
> +
> +	while (fs) {
> +		locked = union_trylock(fs->root);
> +		if (!locked)
> +			goto loop1;
> +		locked = union_trylock(fs->altroot);
> +		if (!locked)
> +			goto loop2;
> +		locked = union_trylock(fs->pwd);
> +		if (!locked)
> +			goto loop3;
> +		break;
> +	loop3:
> +		union_unlock(fs->altroot);
> +	loop2:
> +		union_unlock(fs->root);
> +	loop1:
> +		read_unlock(&fs->lock);
> +		UM_DEBUG_LOCK("Failed to get all semaphores in fs_struct!\n");
> +		cpu_relax();
> +		read_lock(&fs->lock);
> +		continue;
> +	}
> +	BUG_ON(!fs);
> +	return;
> +}
> +
> +#define IS_UNION(dentry) ((dentry)->d_overlaid || (dentry)->d_topmost || \
> +				(dentry)->d_overlaid)
> +
> +#else /* CONFIG_UNION_MOUNT */
> +
> +#define union_lock(dentry) do { /* empty */ } while (0)
> +#define union_trylock(dentry) ({ (1); })
> +#define union_unlock(dentry) do { /* empty */ } while (0)
> +#define union_lock_spinlock(dentry, lock) do { /* empty */ } while (0)
> +#define union_lock_readlock(dentry, lock) do { /* empty */ } while (0)
> +#define union_lock_fs(fs) do { /* empty */ } while (0)
> +#define IS_UNION(dentry) ({ (0); })
> +#define union_alloc2(x) ({ BUG(); (0); })
> +#define union_get2(x) ({ BUG(); (0); })
> +#define union_release(x) do { BUG(); } while (0)
> +
> +#endif	/* CONFIG_UNION_MOUNT */
> +#endif	/* __KERNEL__ */
> +#endif	/* __LINUX_DCACHE_UNION_H */
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  5/15] Introduce union stack
  2007-04-17 22:08   ` Serge E. Hallyn
@ 2007-04-18  3:27     ` Bharata B Rao
  0 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-18  3:27 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

On Tue, Apr 17, 2007 at 05:08:48PM -0500, Serge E. Hallyn wrote:
> Quoting Bharata B Rao (bharata@linux.vnet.ibm.com):
> > From: Jan Blunck <j.blunck@tu-harburg.de>
> > Subject: Introduce union stack.
> > 
> > Adds union stack infrastructure to the dentry structure and provides
> > locking routines to walk the union stack.
> > 
> > Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> >  fs/Makefile                  |    2 
> >  fs/dcache.c                  |    5 
> >  fs/union.c                   |   53 +++++++++
> >  include/linux/dcache.h       |   11 +
> >  include/linux/dcache_union.h |  243 +++++++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 314 insertions(+)
> > 
> > --- a/fs/Makefile
> > +++ b/fs/Makefile
> > @@ -49,6 +49,8 @@ obj-$(CONFIG_FS_POSIX_ACL)	+= posix_acl.
> >  obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
> >  obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
> > 
> > +obj-$(CONFIG_UNION_MOUNT)	+= union.o
> > +
> >  obj-$(CONFIG_QUOTA)		+= dquot.o
> >  obj-$(CONFIG_QFMT_V1)		+= quota_v1.o
> >  obj-$(CONFIG_QFMT_V2)		+= quota_v2.o
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -936,6 +936,11 @@ struct dentry *d_alloc(struct dentry * p
> >  #ifdef CONFIG_PROFILING
> >  	dentry->d_cookie = NULL;
> >  #endif
> > +#ifdef CONFIG_UNION_MOUNT
> > +	dentry->d_overlaid = NULL;
> > +	dentry->d_topmost = NULL;
> > +	dentry->d_union = NULL;
> > +#endif
> >  	INIT_HLIST_NODE(&dentry->d_hash);
> >  	INIT_LIST_HEAD(&dentry->d_lru);
> >  	INIT_LIST_HEAD(&dentry->d_subdirs);
> > --- /dev/null
> > +++ b/fs/union.c
> > @@ -0,0 +1,53 @@
> > +/*
> > + * VFS based union mount for Linux
> > + *
> > + * Copyright ? 2004-2007 IBM Corporation
> > + *   Author(s): Jan Blunck (j.blunck@tu-harburg.de)
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of the GNU General Public License as published by the Free
> > + * Software Foundation; either version 2 of the License, or (at your option)
> > + * any later version.
> > + */
> > +
> > +#include <linux/fs.h>
> > +
> > +struct union_info * union_alloc(void)
> > +{
> > +	struct union_info *info;
> > +
> > +	info = kmalloc(sizeof(*info), GFP_ATOMIC);
> > +	if (!info)
> > +		return NULL;
> > +
> > +	mutex_init(&info->u_mutex);
> > +	mutex_lock(&info->u_mutex);
> > +	atomic_set(&info->u_count, 1);
> > +	UM_DEBUG_LOCK("allocate union %p\n", info);
> > +	return info;
> > +}
> > +
> > +struct union_info * union_get(struct union_info *info)
> > +{
> > +	BUG_ON(!info);
> > +	BUG_ON(!atomic_read(&info->u_count));
> > +	atomic_inc(&info->u_count);
> > +	UM_DEBUG_LOCK("get union %p (count=%d)\n", info,
> > +		      atomic_read(&info->u_count));
> > +	return info;
> > +}
> 
> The locking here needs to be laid out.  It looks like union_get() needs
> to be called under union_lock(), while union_get2() (horrible name)
> grabs that lock itself, and returns with the lock held?
> 
> Similarly union_put clearly needs to be called under union_lock(), so
> that should be commented here.

Agreed. This whole thing needs a fresh look and we need to establish
some rules and consistency here. After you pointed out this, even
union_alloc2() is looking suspect to me. Same goes for union_release()
also. Thanks for pointing this out.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH  0/15] VFS based Union Mount
  2007-04-17 16:56     ` Shaya Potter
@ 2007-04-18  7:19       ` Bharata B Rao
  0 siblings, 0 replies; 22+ messages in thread
From: Bharata B Rao @ 2007-04-18  7:19 UTC (permalink / raw)
  To: Shaya Potter; +Cc: linux-kernel, linux-fsdevel, Jan Blunck

On Tue, Apr 17, 2007 at 12:56:24PM -0400, Shaya Potter wrote:
> Bharata B Rao wrote:
> 
> >No. foo is not visible. While looking for a file in a union mounted
> >directory, the lookup starts from the topmost directory and proceeds
> >downwards if the file isn't present the top layers. If a whiteout is
> >found in any of the top layers, the lookup is abondoned and -ENOENT
> >is removed. Thus until a whiteout exists in any upper layer for
> >a corresponding file in the lower layer, the lower layer file remains
> >hidden until the whiteout is removed.
> >
> >However in the case of dir-c containing foo, the foo(from dir-c) will 
> >become
> >visible after union mounting dir-c on top of dir-b and dir-a.
> 
> ok, so the major limitation of this approach is that the top most layer 
> has to either be, ext2, ext3 or tmpfs (in patch), and most likely not 
> NFS (assumption is that NFS has no conception of the whiteout type of 
> file).

I haven't played with union mounts with NFS. Hence would let Jan answer
this. However note that union mount provides a writable union only
if the filesystem supports the notion of whiteouts.

> One thing the unionfs people are doing w/ their ODF approach, is 
> within the ODF fs, they have a special inode that is the "whiteout" 
> inode, and when they create a whiteout, they just create a hardlink from 
> the dentry they want to whiteout to the "whiteout inode".  could that be 
> a worthwhile approach instead of the whiteout file type?  (i.e. many 
> file systems support the concept of a hard link).

We we thinking something on similar lines as noted in our documentation.
Right now we maintain one inode for every whiteout. We were planning to
have a single whiteout inode and have all whiteout dentries point to this.
But here again we were thinking of having every filesystem support
this whiteout inode type.

Anyway I will have a look at ODF from unionfs to see how this is done.

>
> I ask, because using union in a diskless environment.  Imagine pxe 
> booting a kernel/initramfs and then using union to create a real root fs 
>  (shared lower layer, private rw upper layer, ala live cds).  Which 
> brings up a different point, with unionfs, one can pivot_root into it, 
> can one do the same for these "union mounts"?  Don't know enough about 
> the VFS to know if this should "just work" or might be a problem.

I would assume that it should 'just work'. But right now it is not working.
Our code is not yet ready to correctly work with move mounts. Since pivot_root
has semantics similar to move mounts, pivot_root is also not working. Also
chroot to a union mount point is also not working atm. We will be working
to get all these right.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-04-18  7:12 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-17 13:14 [RFC][PATCH 0/15] VFS based Union Mount Bharata B Rao
2007-04-17 13:16 ` [RFC][PATCH 1/15] Add union mount documentation Bharata B Rao
2007-04-17 13:17 ` [RFC][PATCH 2/15] Add a new mount flag (MNT_UNION) for union mount Bharata B Rao
2007-04-17 13:17 ` [RFC][PATCH 3/15] Add the whiteout file type Bharata B Rao
2007-04-17 13:18 ` [RFC][PATCH 4/15] Add config options for union mount Bharata B Rao
2007-04-17 13:19 ` [RFC][PATCH 5/15] Introduce union stack Bharata B Rao
2007-04-17 22:08   ` Serge E. Hallyn
2007-04-18  3:27     ` Bharata B Rao
2007-04-17 13:20 ` [RFC][PATCH 6/15] Union-mount dentry reference counting Bharata B Rao
2007-04-17 13:20 ` [RFC][PATCH 7/15] Union-mount mounting Bharata B Rao
2007-04-17 13:21 ` [RFC][PATCH 8/15] Union-mount lookup Bharata B Rao
2007-04-17 13:22 ` [RFC][PATCH 9/15] Simple union-mount readdir Bharata B Rao
2007-04-17 13:22 ` [RFC][PATCH 10/15] In-kernel file copy between union mounted filesystems Bharata B Rao
2007-04-17 13:23 ` [RFC][PATCH 11/15] VFS whiteout handling Bharata B Rao
2007-04-17 13:23 ` [RFC][PATCH 12/15] ext2 whiteout support Bharata B Rao
2007-04-17 13:24 ` [RFC][PATCH 13/15] ext3 " Bharata B Rao
2007-04-17 13:24 ` [RFC][PATCH 14/15] tmpfs " Bharata B Rao
2007-04-17 13:25 ` [RFC][PATCH 15/15] Union-mount changes for NFS Bharata B Rao
2007-04-17 14:35 ` [RFC][PATCH 0/15] VFS based Union Mount Shaya Potter
2007-04-17 16:30   ` Bharata B Rao
2007-04-17 16:56     ` Shaya Potter
2007-04-18  7:19       ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox