* [PATCH] namespace.c: fix bind mount from foreign namespace @ 2005-05-13 10:44 Miklos Szeredi 2005-05-13 16:49 ` Ram 2005-05-13 17:06 ` Al Viro 0 siblings, 2 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-13 10:44 UTC (permalink / raw) To: akpm; +Cc: linux-kernel, linux-fsdevel Bind mount from a foreign namespace results in an un-removable mount. The reason is that mnt->mnt_namespace is copied from the old mount in clone_mnt(). Because of this, check_mnt() in sys_umount() will fail. The solution is to set mnt->mnt_namespace to current->namespace. clone_mnt() is either called from do_loopback() or copy_tree(). copy_tree() is called from do_loopback() or copy_namespace(). When called (directly or indirectly) from do_loopback(), always current->namspace is being modified: check_mnt(nd->mnt). So setting mnt->mnt_namespace to current->namspace is the right thing to do. When called from copy_namespace(), the setting of mnt_namespace is irrelevant, since mnt_namespace is reset later in that function for all copied mounts. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Index: linux/fs/namespace.c =================================================================== --- linux.orig/fs/namespace.c 2005-05-13 12:22:52.000000000 +0200 +++ linux/fs/namespace.c 2005-05-13 12:32:36.000000000 +0200 @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d mnt->mnt_root = dget(root); mnt->mnt_mountpoint = mnt->mnt_root; mnt->mnt_parent = mnt; - mnt->mnt_namespace = old->mnt_namespace; + mnt->mnt_namespace = current->namespace; /* stick the duplicate mount on the same expiry list * as the original if that was on one */ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi @ 2005-05-13 16:49 ` Ram 2005-05-13 17:06 ` Al Viro 1 sibling, 0 replies; 44+ messages in thread From: Ram @ 2005-05-13 16:49 UTC (permalink / raw) To: Miklos Szeredi; +Cc: Andrew Morton, linux-kernel, linux-fsdevel On Fri, 2005-05-13 at 03:44, Miklos Szeredi wrote: > Bind mount from a foreign namespace results in an un-removable mount. i wonder, should we even allow the ability to bind mount from a foreign namespace? The only time this is allowed is while creating a new namespace. RP > The reason is that mnt->mnt_namespace is copied from the old mount in > clone_mnt(). Because of this, check_mnt() in sys_umount() will fail. > > The solution is to set mnt->mnt_namespace to current->namespace. > > clone_mnt() is either called from do_loopback() or copy_tree(). > copy_tree() is called from do_loopback() or copy_namespace(). > > When called (directly or indirectly) from do_loopback(), always > current->namspace is being modified: check_mnt(nd->mnt). So setting > mnt->mnt_namespace to current->namspace is the right thing to do. > > When called from copy_namespace(), the setting of mnt_namespace is > irrelevant, since mnt_namespace is reset later in that function for > all copied mounts. > > Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> > > Index: linux/fs/namespace.c > =================================================================== > --- linux.orig/fs/namespace.c 2005-05-13 12:22:52.000000000 +0200 > +++ linux/fs/namespace.c 2005-05-13 12:32:36.000000000 +0200 > @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d > mnt->mnt_root = dget(root); > mnt->mnt_mountpoint = mnt->mnt_root; > mnt->mnt_parent = mnt; > - mnt->mnt_namespace = old->mnt_namespace; > + mnt->mnt_namespace = current->namespace; > > /* stick the duplicate mount on the same expiry list > * as the original if that was on one */ > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi 2005-05-13 16:49 ` Ram @ 2005-05-13 17:06 ` Al Viro 2005-05-13 17:17 ` Miklos Szeredi 1 sibling, 1 reply; 44+ messages in thread From: Al Viro @ 2005-05-13 17:06 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel On Fri, May 13, 2005 at 12:44:03PM +0200, Miklos Szeredi wrote: > Bind mount from a foreign namespace results in ... -EINVAL ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 17:06 ` Al Viro @ 2005-05-13 17:17 ` Miklos Szeredi 2005-05-13 17:25 ` Al Viro 2005-05-13 17:29 ` Ram 0 siblings, 2 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-13 17:17 UTC (permalink / raw) To: viro; +Cc: akpm, linux-kernel, linux-fsdevel > > Bind mount from a foreign namespace results in > > ... -EINVAL Wrong answer. Look again, you wrote the code, so you _should_ know ;) Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 17:17 ` Miklos Szeredi @ 2005-05-13 17:25 ` Al Viro 2005-05-13 17:34 ` Miklos Szeredi 2005-05-13 17:29 ` Ram 1 sibling, 1 reply; 44+ messages in thread From: Al Viro @ 2005-05-13 17:25 UTC (permalink / raw) To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel On Fri, May 13, 2005 at 07:17:39PM +0200, Miklos Szeredi wrote: > > > Bind mount from a foreign namespace results in > > > > ... -EINVAL > > Wrong answer. Look again, you wrote the code, so you _should_ know ;) static inline int check_mnt(struct vfsmount *mnt) { return mnt->mnt_namespace == current->namespace; } static int do_loopback(struct nameidata *nd, char *old_name, int recurse) { struct nameidata old_nd; struct vfsmount *mnt = NULL; /* no changes of mnt */ err = -EINVAL; if (check_mnt(nd->mnt) && ... ) { /* assigns to mnt */ } if (mnt) { /* assigns to err */ } up_write(¤t->namespace->sem); path_release(&old_nd); return err; } Care to explain how that would not give -EINVAL? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 17:25 ` Al Viro @ 2005-05-13 17:34 ` Miklos Szeredi 0 siblings, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-13 17:34 UTC (permalink / raw) To: viro; +Cc: akpm, linux-kernel, linux-fsdevel > static int do_loopback(struct nameidata *nd, char *old_name, int recurse) > { > struct nameidata old_nd; > struct vfsmount *mnt = NULL; > /* no changes of mnt */ > err = -EINVAL; > if (check_mnt(nd->mnt) && ... ) { > /* assigns to mnt */ > } > if (mnt) { > /* assigns to err */ > } > up_write(¤t->namespace->sem); > path_release(&old_nd); > return err; > } > > Care to explain how that would not give -EINVAL? Yeah, but that check_mnt() checks the _destination_ of the bind not the source. The source is only checked for recursive mounts, presumably because the source namespace is not locked, and so can change. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 17:17 ` Miklos Szeredi 2005-05-13 17:25 ` Al Viro @ 2005-05-13 17:29 ` Ram 2005-05-13 18:40 ` Miklos Szeredi 1 sibling, 1 reply; 44+ messages in thread From: Ram @ 2005-05-13 17:29 UTC (permalink / raw) To: Miklos Szeredi; +Cc: viro, Andrew Morton, linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 578 bytes --] On Fri, 2005-05-13 at 10:17, Miklos Szeredi wrote: > > > Bind mount from a foreign namespace results in > > > > ... -EINVAL > > Wrong answer. Look again, you wrote the code, so you _should_ know ;) I guess Al agrees that bind mount from foreign namespace must be disallowed. Which means what Jamie pointed to was right. Attached the patch which fixes it. > > Miklos > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: bind.patch --] [-- Type: text/x-patch, Size: 478 bytes --] --- /home/linux/views/linux-2.6.12-rc4/fs/namespace.c 2005-05-06 23:22:29.000000000 -0700 +++ 2.6.12-rc4/fs/namespace.c 2005-05-13 10:17:19.000000000 -0700 @@ -633,7 +633,7 @@ static int do_loopback(struct nameidata down_write(¤t->namespace->sem); err = -EINVAL; - if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) { + if (check_mnt(nd->mnt) && check_mnt(old_nd.mnt)) { err = -ENOMEM; if (recurse) mnt = copy_tree(old_nd.mnt, old_nd.dentry); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-13 17:29 ` Ram @ 2005-05-13 18:40 ` Miklos Szeredi [not found] ` <1116012287.6248.410.camel@localhost> 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-13 18:40 UTC (permalink / raw) To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel > > > > Bind mount from a foreign namespace results in > > > > > > ... -EINVAL > > > > Wrong answer. Look again, you wrote the code, so you _should_ know ;) > > I guess Al agrees that bind mount from foreign namespace must be > disallowed. > > Which means what Jamie pointed to was right. Attached the patch which > fixes it. You are very quick fixing things which are not broken :) And BTW Jamie was saying, the checks should be removed, not that more checks should be added (as your patch does). Jamie Lokier wrote: > I agree about the bug (and it's why I think the current->namespace > checks in fs/namespace.c should be killed - the _only_ effect is to > make un-removable mounts like the above, and the checks are completely > redundant for "normal" namespace operations). The checks are actually not redundant, but only because of locking reasons, not because of security reasons. So I agree with Jamie, that in the long run it makes sense to relax those checks. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <1116012287.6248.410.camel@localhost>]
[parent not found: <E1DWfqJ-0004eP-00@dorka.pomaz.szeredi.hu>]
[parent not found: <1116013840.6248.429.camel@localhost>]
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace [not found] ` <1116013840.6248.429.camel@localhost> @ 2005-05-14 6:11 ` Miklos Szeredi 2005-05-16 15:11 ` Ram 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-14 6:11 UTC (permalink / raw) To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel (CC restored, because I think this is interesting to others as well) > > > I dont get it... > > > > > > do you agree that bind mounts accross namespaces should be disallowed? > > > > No. I think it's a useful feature. It's actually been discussed at > > length earlier, that it makes sense if the user can copy private > > mounts from one session to the other (or even automate this with pam, > > and a daemon). Why should they be disallowed? > > I understand that feature and the way to do it is through shared > subtrees. I agree fully, that shared subtrees would be very useful. > I am against a mount in one namespace being bind mounted in > another namespace(which Al Viro has implied in his mail). I'd rather not speculate on what Al Viro was thinking, it may have been just a misunderstanding. > With shared subtree if a bind mount is done > in one namespace, the bind happens within the same namespace. Yes. But that's not fundamentally different from explicitly passing a mount (through a file descriptor) to a process in a different namespace, and allowing that process to bind mount it in it's native namespace. The end result can be exactly the same, only in the shared subtree case the binding is implicit, while in the other case it's explicit on both sides (which makes it perfectly secure even for unprivileged use) Please explain why you think it's wrong to be able to bind mount from a different namespace? > > However the operation is mirrored to other namespaces > that has the same heridity link to this namespace. > > probably I can give an example: > > if namespace n1 has the following tree > v11 > / \ > v12 v13 > > v1 is mark shared. (mount --make-shared v1) [ for simplicity I vxy > means yth vfsmount in xth namespace ] > > and than n2 is cloned out of n1, than in n2 we have > v21 > / \ > v22 v23 > > now a bind mount in n1 > mount --bind v12 v13 > > will first change the tree in n1 as follows: > > v11 > / \ > v12 v13 > \ > v14 > where v14 is a bind mount of v12 > > > and than due to propogation the tree in n2 will also change to > v21 > / \ > v22 v23 > \ > v24 > where v24 is a bind mount of v22 > > > Essentially there is no cross-contamination, as well as it meets > the requirement of per-user-namespace. What do you mean by cross contamination? Thanks, Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-14 6:11 ` Miklos Szeredi @ 2005-05-16 15:11 ` Ram 2005-05-16 8:44 ` Miklos Szeredi 2005-05-16 11:14 ` Jamie Lokier 0 siblings, 2 replies; 44+ messages in thread From: Ram @ 2005-05-16 15:11 UTC (permalink / raw) To: Miklos Szeredi; +Cc: viro, Andrew Morton, linux-kernel, linux-fsdevel On Fri, 2005-05-13 at 23:11, Miklos Szeredi wrote: > (CC restored, because I think this is interesting to others as well) > > > > > I dont get it... > > > > > > > > do you agree that bind mounts accross namespaces should be disallowed? > > > > > > No. I think it's a useful feature. It's actually been discussed at > > > length earlier, that it makes sense if the user can copy private > > > mounts from one session to the other (or even automate this with pam, > > > and a daemon). Why should they be disallowed? > > > > I understand that feature and the way to do it is through shared > > subtrees. > > I agree fully, that shared subtrees would be very useful. > > > I am against a mount in one namespace being bind mounted in > > another namespace(which Al Viro has implied in his mail). > > I'd rather not speculate on what Al Viro was thinking, it may have > been just a misunderstanding. Can somebody who know internals of Al Viro's thinking help here? > > > With shared subtree if a bind mount is done > > in one namespace, the bind happens within the same namespace. > > Yes. But that's not fundamentally different from explicitly passing a > mount (through a file descriptor) to a process in a different > namespace, and allowing that process to bind mount it in it's native > namespace. > > The end result can be exactly the same, only in the shared subtree > case the binding is implicit, while in the other case it's explicit on > both sides (which makes it perfectly secure even for unprivileged use) > > Please explain why you think it's wrong to be able to bind mount from > a different namespace? If It is allowed, the concept of namespaces itself becomes nebulous. one could bind mount the root vfsmount of all the other namespace in their own namespace and then it all becomes one big tree with all the other namespaces as a subtree. why would we need this feature? what extra advantage would this feature provide us? Is the advantage of this feature already discussed in this thread? (maybe i missed it). > > > > > However the operation is mirrored to other namespaces > > that has the same heridity link to this namespace. > > > > probably I can give an example: > > > > if namespace n1 has the following tree > > v11 > > / \ > > v12 v13 > > > > v1 is mark shared. (mount --make-shared v1) [ for simplicity I vxy > > means yth vfsmount in xth namespace ] > > > > and than n2 is cloned out of n1, than in n2 we have > > v21 > > / \ > > v22 v23 > > > > now a bind mount in n1 > > mount --bind v12 v13 > > > > will first change the tree in n1 as follows: > > > > v11 > > / \ > > v12 v13 > > \ > > v14 > > where v14 is a bind mount of v12 > > > > > > and than due to propogation the tree in n2 will also change to > > v21 > > / \ > > v22 v23 > > \ > > v24 > > where v24 is a bind mount of v22 > > > > > > Essentially there is no cross-contamination, as well as it meets > > the requirement of per-user-namespace. > > What do you mean by cross contamination? A vfsmount in one namespace bound to a mountpoint in another namespace. RP > > Thanks, > Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 15:11 ` Ram @ 2005-05-16 8:44 ` Miklos Szeredi 2005-05-16 8:59 ` Miklos Szeredi 2005-05-16 11:26 ` Jamie Lokier 2005-05-16 11:14 ` Jamie Lokier 1 sibling, 2 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-16 8:44 UTC (permalink / raw) To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel > Can somebody who know internals of Al Viro's thinking help here? There's only one person... :) > > Please explain why you think it's wrong to be able to bind mount from > > a different namespace? > > > If It is allowed, the concept of namespaces itself becomes > nebulous. one could bind mount the root vfsmount of all the other > namespace in their own namespace and then it all becomes one big tree > with all the other namespaces as a subtree. 1) you need not recursively bind the whole tree of the private namespace. In fact you can only do that by hand, since the kernel won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback). 2) you won't see changes made in other namespace, they are still separate, they are just sharing some filesystems, just as after clone, or just as after propagation within a shared subtree. 3) this is not automatic, so no filesystems in the other namespace are visible unless there's an explicit action by processes both in the originating namespace and the destination namespace. 4) in fact, the process in the originating namespace can single out a mount and just send a file descriptor refering to that mount (e.g. by binding it to a temporary directory, opening the root, detaching from the mountpoint, and then sending the file descriptor to the receiving process). This way the receiving process will see no other mounts in the originating namespace, and can only bind from that single mount. > why would we need this feature? what extra advantage would this feature > provide us? Is the advantage of this feature already discussed in this > thread? (maybe i missed it). http://lkml.org/lkml/2005/4/25/47 It was suggested as a way of sharing mounts between sessions (as opposed to shared subtrees, which shares mounts between parent/child process). I'm not saying that that this is the way to do it. But it does seem to me a useful feature. And since it doesn't add _any_ complexity to the kernel, I think it would be rather stupid to remove it. > > What do you mean by cross contamination? > > A vfsmount in one namespace bound to a mountpoint in another > namespace. A vfsmount can only be in a single namespace at a time, since each mount tree is rooted in a single namespace. So what you are saying is impossible. This feature is about binding a mount (copying a vfsmount in clone_mnt()), which happens to be in a different namespace. After the vfsmount is cloned, and is attached to the process's native namespace it's in that namespace solely. The bug being discussed, is an administrative error, of setting mnt_namespace to the wrong value, which causes the mount to be un-removable. So there was never a question of "cross contamination". Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 8:44 ` Miklos Szeredi @ 2005-05-16 8:59 ` Miklos Szeredi 2005-05-16 11:26 ` Jamie Lokier 1 sibling, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-16 8:59 UTC (permalink / raw) To: miklos; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel > A vfsmount can only be in a single namespace at a time, since each > mount tree is rooted in a single namespace. So what you are saying is > impossible. To be more precise a vfsmount can only be in _at_ _most_ one namespace. When it is detached from a mount tree, it's no longer in _any_ namespace. There even seems to be some confusion about that in namespace.c. I think mnt_namespace should be set to NULL in detach_mnt() instead of __put_namespace(). Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 8:44 ` Miklos Szeredi 2005-05-16 8:59 ` Miklos Szeredi @ 2005-05-16 11:26 ` Jamie Lokier 2005-05-16 13:23 ` Miklos Szeredi 1 sibling, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-16 11:26 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi wrote: > 1) you need not recursively bind the whole tree of the private > namespace. In fact you can only do that by hand, since the kernel > won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback). That would be easy to change if it was desired though, by taking both namespace semaphores when two namespaces are involved. > 2) you won't see changes made in other namespace, they are still > separate, they are just sharing some filesystems, just as after > clone, or just as after propagation within a shared subtree. That's true. Let's not get confused between binding across namespaces, and chroot/chdir into an fd supplied by a process from another namespace. In the case of bind mount, that _won't_ see changes made in the other namespace. The /dentry/ visible in the other namespace is simply mounted here. The fact that it happens to come from another namespace is irrelevant: the other namespace is not used for single dentry bind mounts. In the case of chroot/chdir, that _will_ see changes made in the other namespace. Effectively, that transitions into the other namespace as a whole, which is exactly what we want in some cases (when userspace policy determines that a per-session namespace is wanted). > 4) in fact, the process in the originating namespace can single out a > mount and just send a file descriptor refering to that mount > (e.g. by binding it to a temporary directory, opening the root, > detaching from the mountpoint, and then sending the file descriptor > to the receiving process). This way the receiving process will see > no other mounts in the originating namespace, and can only bind > from that single mount. Nice. The process in the originating namespace can also bind a small, carefully selected tree of mounts to a tree in that temporary directory before passing it, so the recipient can chroot/chdir into the set of mounts and get only those explicitly authorised by the originating process. -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 11:26 ` Jamie Lokier @ 2005-05-16 13:23 ` Miklos Szeredi 0 siblings, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-16 13:23 UTC (permalink / raw) To: jamie; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel > > 1) you need not recursively bind the whole tree of the private > > namespace. In fact you can only do that by hand, since the kernel > > won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback). > > That would be easy to change if it was desired though, by taking both > namespace semaphores when two namespaces are involved. Yes. The other check_mnt() calls could be removed by taking nd.mnt->mnt_namespace->sem instead of current->namespace->sem in the relevant functions. It does make sense IMO, even if it won't be used very often, since only very little extra complexity is involved. > > 4) in fact, the process in the originating namespace can single out a > > mount and just send a file descriptor refering to that mount > > (e.g. by binding it to a temporary directory, opening the root, > > detaching from the mountpoint, and then sending the file descriptor > > to the receiving process). This way the receiving process will see > > no other mounts in the originating namespace, and can only bind > > from that single mount. > > Nice. The process in the originating namespace can also bind a small, > carefully selected tree of mounts to a tree in that temporary > directory before passing it, so the recipient can chroot/chdir into > the set of mounts and get only those explicitly authorised by the > originating process. That won't work, since detach (umount -l) will break up the tree, and the file descriptor will hold a reference to only one vfsmount/dentry. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 15:11 ` Ram 2005-05-16 8:44 ` Miklos Szeredi @ 2005-05-16 11:14 ` Jamie Lokier 2005-05-17 3:50 ` Ram 1 sibling, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-16 11:14 UTC (permalink / raw) To: Ram; +Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel Ram wrote: > > I'd rather not speculate on what Al Viro was thinking, it may have > > been just a misunderstanding. > > Can somebody who know internals of Al Viro's thinking help here? Presumably he wrote this line: if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) { Which /explicitly/ permits bind mounts between namespaces if it's not recursive. It's not accidental: that !recurse is blatantly making a point of allowing it. I take that to mean that /at least at one time/ Al chose to allow it. Then again, he also wrote this: > > Bind mount from a foreign namespace results in > > ... -EINVAL Which means that /at another time/ Al thought he'd disallowed it. This is a bit like arguing over what the Founding Fathers of the US Constitution meant. Does it matter? We really should ask what behaviour makes sense now. Should we add more explicit restrictions to the code, making the concept of namespaces more restrictive? Or remove the restrictions, on the grounds that they don't really add any security, it'd be useful to relax them, and the code would be simpler? -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 11:14 ` Jamie Lokier @ 2005-05-17 3:50 ` Ram 2005-05-16 20:15 ` Miklos Szeredi 2005-05-17 0:00 ` Jamie Lokier 0 siblings, 2 replies; 44+ messages in thread From: Ram @ 2005-05-17 3:50 UTC (permalink / raw) To: Jamie Lokier Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel On Mon, 2005-05-16 at 04:14, Jamie Lokier wrote: > Ram wrote: > > > I'd rather not speculate on what Al Viro was thinking, it may have > > > been just a misunderstanding. > > > > Can somebody who know internals of Al Viro's thinking help here? > > Presumably he wrote this line: > > if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) { > > Which /explicitly/ permits bind mounts between namespaces if it's not > recursive. It's not accidental: that !recurse is blatantly making a > point of allowing it. > > I take that to mean that /at least at one time/ Al chose to allow it. > > Then again, he also wrote this: > > > > Bind mount from a foreign namespace results in > > > > ... -EINVAL > > Which means that /at another time/ Al thought he'd disallowed it. > > This is a bit like arguing over what the Founding Fathers of the US > Constitution meant. Does it matter? We really should ask what > behaviour makes sense now. Should we add more explicit restrictions > to the code, making the concept of namespaces more restrictive? Or > remove the restrictions, on the grounds that they don't really add any > security, it'd be useful to relax them, and the code would be simpler? > Ok. less restriction without compromising security is a good idea. Under the premise that bind mounts across namespace should be allowed; any insight why the "founding fathers" :) allowed only bind and not recursive bind? What issue would that create? One can easily workaround that restriction by manually binding recursively. So does the recursive bind restriction serve any purpose? I remember Miklos saying its not a security issue but a implementation/locking issue. That can be fixed aswell. RP > -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-17 3:50 ` Ram @ 2005-05-16 20:15 ` Miklos Szeredi 2005-05-17 1:28 ` Jamie Lokier 2005-05-17 18:48 ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram 2005-05-17 0:00 ` Jamie Lokier 1 sibling, 2 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-16 20:15 UTC (permalink / raw) To: linuxram; +Cc: jamie, miklos, viro, akpm, linux-kernel, linux-fsdevel > Ok. less restriction without compromising security is a good idea. > > Under the premise that bind mounts across namespace should be allowed; > any insight why the "founding fathers" :) allowed only bind > and not recursive bind? What issue would that create? One can > easily workaround that restriction by manually binding recursively. > So does the recursive bind restriction serve any purpose? > > I remember Miklos saying its not a security issue but a > implementation/locking issue. That can be fixed aswell. Yes, as pointed out by Jamie, both namespaces need to be locked for this to work. Something like the attached should do it. Miklos Index: linux/fs/namespace.c =================================================================== --- linux.orig/fs/namespace.c 2005-05-16 22:02:36.000000000 +0200 +++ linux/fs/namespace.c 2005-05-16 22:13:30.000000000 +0200 @@ -622,6 +622,8 @@ out_unlock: static int do_loopback(struct nameidata *nd, char *old_name, int recurse) { struct nameidata old_nd; + struct namespace *ns1 = current->namespace; + struct namespace *ns2 = NULL; struct vfsmount *mnt = NULL; int err = mount_is_safe(nd); if (err) @@ -632,15 +634,30 @@ static int do_loopback(struct nameidata if (err) return err; - down_write(¤t->namespace->sem); err = -EINVAL; - if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) { - err = -ENOMEM; - if (recurse) - mnt = copy_tree(old_nd.mnt, old_nd.dentry); - else - mnt = clone_mnt(old_nd.mnt, old_nd.dentry); - } + if (!check_mnt(nd->mnt)) + goto out_path_release; + + if (recurse && old_nd.mnt->mnt_namespace != ns1) { + ns2 = old_nd.mnt->mnt_namespace; + if (ns1 < ns2) { + down_write(&ns1->sem); + down_write(&ns2->sem); + } else { + down_write(&ns2->sem); + down_write(&ns1->sem); + } + } else + down_write(&ns1->sem); + + err = -ENOMEM; + if (recurse) + mnt = copy_tree(old_nd.mnt, old_nd.dentry); + else + mnt = clone_mnt(old_nd.mnt, old_nd.dentry); + + if (ns2) + up_write(&ns2->sem); if (mnt) { /* stop bind mounts from expiring */ @@ -657,7 +674,8 @@ static int do_loopback(struct nameidata mntput(mnt); } - up_write(¤t->namespace->sem); + up_write(&ns1->sem); +out_path_release: path_release(&old_nd); return err; } ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 20:15 ` Miklos Szeredi @ 2005-05-17 1:28 ` Jamie Lokier 2005-05-17 5:34 ` Miklos Szeredi 2005-05-17 18:48 ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram 1 sibling, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-17 1:28 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi wrote: > + if (ns1 < ns2) { > + down_write(&ns1->sem); > + down_write(&ns2->sem); > + } else { > + down_write(&ns2->sem); > + down_write(&ns1->sem); > + } That's a bit smaller (source and compiled) as: if (ns2 < ns1) down_write(&ns2->sem); down_write(&ns1->sem); if (ns2 > ns1) down_write(&ns2->sem); (And you'll notice that does the right thing if ns2==ns1 too, in case that gives you any ideas.) Otherwise, the patch looks convincing to me. -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-17 1:28 ` Jamie Lokier @ 2005-05-17 5:34 ` Miklos Szeredi [not found] ` <1116360352.24560.85.camel@localhost> 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-17 5:34 UTC (permalink / raw) To: jamie; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel > That's a bit smaller (source and compiled) as: > > if (ns2 < ns1) > down_write(&ns2->sem); > down_write(&ns1->sem); > if (ns2 > ns1) > down_write(&ns2->sem); > > (And you'll notice that does the right thing if ns2==ns1 too, in case > that gives you any ideas.) Nice. > Otherwise, the patch looks convincing to me. There's another problem with it: we don't hold a reference to old_nd.mnt->mnt_namespace, and the namespace going away could race with this function. So first obtain the necessary reference: spin_lock(&vfsmount_lock); ns2 = old_nd.mnt->mnt_namespace; if (ns2) get_namespace(ns2); spin_unlock(&vfsmount_lock); Then take the semaphores. Then recheck old_nd.mnt->mnt_namespace, because it might have been detached between the spin_unlock() and the down_write(&ns2->sem). It's starting to get a bit complex, and I'm wondering if it's worth it :) Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <1116360352.24560.85.camel@localhost>]
[parent not found: <E1DYI0m-0000K5-00@dorka.pomaz.szeredi.hu>]
[parent not found: <1116399887.24560.116.camel@localhost>]
[parent not found: <1116400118.24560.119.camel@localhost>]
* [PATCH] fix race in mark_mounts_for_expiry() [not found] ` <1116400118.24560.119.camel@localhost> @ 2005-05-18 9:51 ` Miklos Szeredi 2005-05-18 10:32 ` David Howells 1 sibling, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 9:51 UTC (permalink / raw) To: linuxram Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel > > > > I am seeing a locking issue with get_namespace() and put_namespace() > > > > > > > > lets say put_namespace() is called and it finds that it is the last > > > > user of the namespace is just about to call __put_namespace(). > > > > > > > > Simultaneously another thread calls get_namespace() and increments > > > > the count. > > > > > > > > At this point __put_namespace() goes ahead and cleans the namespace. > > > > > > I think you are right. > > > > > > There's one place in the existing code, which could experience this > > > race I think: > > > > > > > > > /* don't do anything if the namespace is dead - all the > > > * vfsmounts from it are going away anyway */ > > > namespace = mnt->mnt_namespace; > > > if (!namespace || atomic_read(&namespace->count) <= 0) > > > continue; > > > here ----> > > > get_namespace(namespace); > > > > > > spin_unlock(&vfsmount_lock); > > > > > > > > > Locking vfsmount_lock in put_namespace() would fix it. Any better ideas? > > > > > yes. it will. > > > > However I don't think this issue will trigger currently because for this > > to happen, a process should attempt a get_namespace() on a foreign > > namespace. If all tasks are operating in the same > > namespace, than the count will not go to zero for put namespace to > > clean it up, unless it is the last task. > > > > I think, it will trigger once we allow recursive binds from foreign > > namespace. > > No I am wrong. mark_mounts_for_expiry() is not called from a process > context. Its in a timer context. So the race could happen. How about this patch? It tries to solve this race without additional locking. If refcount is already zero, it will increment and decrement it. So be careful to only call grab_namespace() with vfsmount_lock held, otherwise it could race with itself. (vfsmount_lock is also needed in this case so that mnt->mnt_namespace, doesn't change, while grabbing the namespace) Compile tested only, please review carefully. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Index: linux/fs/namespace.c =================================================================== --- linux.orig/fs/namespace.c 2005-05-17 13:46:58.000000000 +0200 +++ linux/fs/namespace.c 2005-05-18 11:09:47.000000000 +0200 @@ -825,6 +825,16 @@ unlock: EXPORT_SYMBOL_GPL(do_add_mount); +/* Must be called with vfsmount_lock */ +static inline struct namespace *grab_namespace(struct namespace *n) +{ + if (n && atomic_add_return(1, &n->count) <= 1) { + atomic_dec(&n->count); + n = NULL; + } + return n; +} + /* * process a list of expirable mountpoints with the intent of discarding any * mountpoints that aren't in use and haven't been touched since last we came @@ -868,10 +878,9 @@ void mark_mounts_for_expiry(struct list_ /* don't do anything if the namespace is dead - all the * vfsmounts from it are going away anyway */ - namespace = mnt->mnt_namespace; - if (!namespace || atomic_read(&namespace->count) <= 0) + namespace = grab_namespace(mnt->mnt_namespace); + if (!namespace) continue; - get_namespace(namespace); spin_unlock(&vfsmount_lock); down_write(&namespace->sem); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() [not found] ` <1116400118.24560.119.camel@localhost> 2005-05-18 9:51 ` [PATCH] fix race in mark_mounts_for_expiry() Miklos Szeredi @ 2005-05-18 10:32 ` David Howells 2005-05-18 10:37 ` Miklos Szeredi 2005-05-18 10:46 ` David Howells 1 sibling, 2 replies; 44+ messages in thread From: David Howells @ 2005-05-18 10:32 UTC (permalink / raw) To: Miklos Szeredi Cc: linuxram, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel Miklos Szeredi <miklos@szeredi.hu> wrote: > > How about this patch? It tries to solve this race without additional > locking. If refcount is already zero, it will increment and decrement > it. So be careful to only call grab_namespace() with vfsmount_lock > held, otherwise it could race with itself. (vfsmount_lock is also > needed in this case so that mnt->mnt_namespace, doesn't change, while > grabbing the namespace) How about using cmpxchg? David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:32 ` David Howells @ 2005-05-18 10:37 ` Miklos Szeredi 2005-05-18 10:46 ` David Howells 1 sibling, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 10:37 UTC (permalink / raw) To: dhowells; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > > How about this patch? It tries to solve this race without additional > > locking. If refcount is already zero, it will increment and decrement > > it. So be careful to only call grab_namespace() with vfsmount_lock > > held, otherwise it could race with itself. (vfsmount_lock is also > > needed in this case so that mnt->mnt_namespace, doesn't change, while > > grabbing the namespace) > > How about using cmpxchg? How? If the count is nonzero, an incremented count must be stored. You can't do that atomically with cmpxchg. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:32 ` David Howells 2005-05-18 10:37 ` Miklos Szeredi @ 2005-05-18 10:46 ` David Howells 2005-05-18 10:53 ` Miklos Szeredi ` (2 more replies) 1 sibling, 3 replies; 44+ messages in thread From: David Howells @ 2005-05-18 10:46 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi <miklos@szeredi.hu> wrote: > > How about using cmpxchg? > > How? If the count is nonzero, an incremented count must be stored. > You can't do that atomically with cmpxchg. Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co with cmpxchg() or similar. Something like: static inline struct namespace *grab_namespace(struct namespace *n) { int old = atomic_read(&n->count); while (old > 0) { /* attempt to increment the counter */ old = cmpxchg(&n->count, old, old + 1); } return old > 0 ? n : NULL; } David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:46 ` David Howells @ 2005-05-18 10:53 ` Miklos Szeredi 2005-05-18 11:07 ` Trond Myklebust 2005-05-18 10:59 ` David Howells 2005-05-18 11:51 ` David Howells 2 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 10:53 UTC (permalink / raw) To: dhowells; +Cc: miklos, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co > with cmpxchg() or similar. > > Something like: > > static inline struct namespace *grab_namespace(struct namespace *n) > { > int old = atomic_read(&n->count); > > while (old > 0) { > /* attempt to increment the counter */ > old = cmpxchg(&n->count, old, old + 1); > } > > return old > 0 ? n : NULL; > } > Ahh OK :) There's still the problem of cmpxchg meddling in the internals of an atomic_t. Is that OK? Will that work on all archs? Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:53 ` Miklos Szeredi @ 2005-05-18 11:07 ` Trond Myklebust 2005-05-18 11:32 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Trond Myklebust @ 2005-05-18 11:07 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel on den 18.05.2005 Klokka 12:53 (+0200) skreiv Miklos Szeredi: > > Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co > > with cmpxchg() or similar. > > > > Something like: > > > > static inline struct namespace *grab_namespace(struct namespace *n) > > { > > int old = atomic_read(&n->count); > > > > while (old > 0) { > > /* attempt to increment the counter */ > > old = cmpxchg(&n->count, old, old + 1); > > } > > > > return old > 0 ? n : NULL; > > } > > > > Ahh OK :) > > There's still the problem of cmpxchg meddling in the internals of an > atomic_t. Is that OK? Will that work on all archs? Some archs already have an atomic_dec_if_positive() (see for instance the PPC). It won't take much work to convert that to an atomic_inc_if_positive(). For those arches that don't have that sort of thing, then writing a generic atomic_inc_if_positive() using cmpxchg() will often be possible, but there are exceptions (for instance the original 386 does not have a cmpxchg, so there you will have to use something else). Cheers, Trond ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 11:07 ` Trond Myklebust @ 2005-05-18 11:32 ` Miklos Szeredi 2005-05-18 12:50 ` Jamie Lokier 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 11:32 UTC (permalink / raw) To: trond.myklebust Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > Some archs already have an atomic_dec_if_positive() (see for instance > the PPC). It won't take much work to convert that to an > atomic_inc_if_positive(). > > For those arches that don't have that sort of thing, then writing a > generic atomic_inc_if_positive() using cmpxchg() will often be possible, > but there are exceptions (for instance the original 386 does not have a > cmpxchg, so there you will have to use something else). The problem with introducing architecture specific code, is that it's just asking for new bugs. If it's something used all over the kernel, than obviously it's OK, but for the sake of just one caller it's a bit crazy I think. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 11:32 ` Miklos Szeredi @ 2005-05-18 12:50 ` Jamie Lokier 2005-05-18 13:21 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-18 12:50 UTC (permalink / raw) To: Miklos Szeredi Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi wrote: > > Some archs already have an atomic_dec_if_positive() (see for instance > > the PPC). It won't take much work to convert that to an > > atomic_inc_if_positive(). > > > > For those arches that don't have that sort of thing, then writing a > > generic atomic_inc_if_positive() using cmpxchg() will often be possible, > > but there are exceptions (for instance the original 386 does not have a > > cmpxchg, so there you will have to use something else). > > The problem with introducing architecture specific code, is that it's > just asking for new bugs. > > If it's something used all over the kernel, than obviously it's OK, > but for the sake of just one caller it's a bit crazy I think. I agree. And I think you're just adding to the case for removing mnt_namespace entirely. We'd still keep CLONE_NS, and users currently using namespaces (in the normal ways) would see no difference. mnt_namespace has these visible effects: - Prevents some tasks from mounting/umounting in a "foreign" namespace, even when they are granted access to the directory tree of the foreign namespace. It's not clear if the restriction is a useful security tool. - Causes every mount in a mount tree to be detached (independently), when last task associated with a namespace is destroyed. And this invisible effect: - More concurrency than a global mount lock would have. Is that all? Are any of these effects important enough to keep? -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 12:50 ` Jamie Lokier @ 2005-05-18 13:21 ` Miklos Szeredi 2005-05-18 17:34 ` Jamie Lokier 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 13:21 UTC (permalink / raw) To: jamie Cc: miklos, trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel > And I think you're just adding to the case for removing mnt_namespace > entirely. We'd still keep CLONE_NS, and users currently using > namespaces (in the normal ways) would see no difference. > > mnt_namespace has these visible effects: > > - Prevents some tasks from mounting/umounting in a "foreign" > namespace, even when they are granted access to the directory > tree of the foreign namespace. > > It's not clear if the restriction is a useful security tool. > > - Causes every mount in a mount tree to be detached (independently), > when last task associated with a namespace is destroyed. I don't understand. The tree _has_ to be detached when no task uses the namespace. That is the main purpose of the namespace structure, to provide an anchor for the mount tree. > And this invisible effect: > > - More concurrency than a global mount lock would have. This is the key issue I think. It may even have security implications in the future if we want to allow unprivileged mounts : a user could DoS the system by just doing lots of mounts umounts in a private namespace. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 13:21 ` Miklos Szeredi @ 2005-05-18 17:34 ` Jamie Lokier 2005-05-18 19:05 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-18 17:34 UTC (permalink / raw) To: Miklos Szeredi Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi wrote: > > - Causes every mount in a mount tree to be detached (independently), > > when last task associated with a namespace is destroyed. > > I don't understand. The tree _has_ to be detached when no task uses > the namespace. That is the main purpose of the namespace structure, > to provide an anchor for the mount tree. That makes less sense if we allow other tasks to be using a namespace through a passing a file descriptor, and then the last task which has current->namespace equal to that namespace exits. It makes no sense to me that the mount which is still accessible through the file descriptor is suddenly detached from it's parent and children mounts. Why is it not good enough to detach each vfsmnt when the last reference to each vfsmnt is dropped? In other words, simply when the vfsmnt becomes unreachable? That would detach whole vfsmnt trees when the last reference to the root of the tree is dropped, rather than when a task exits even though another tasks has a file descriptor still pointing to the tree. That would be a change of behaviour, but it seems like quite a sensible change in behaviour, and would not affect "normal" namespace users. There's one other purpose to mnt_namespace: - The list in /proc/mounts. But really it would *improve* security if the list in /proc/mounts was derived by walking the vfsmnt subtree starting at current->chroot. > > And this invisible effect: > > > > - More concurrency than a global mount lock would have. > > This is the key issue I think. It may even have security implications > in the future if we want to allow unprivileged mounts : a user could > DoS the system by just doing lots of mounts umounts in a private > namespace. Not an argument. If that's a problem, they could easily DoS the system by doing lots of mount/umounts in the initial namespace. The proper thing to do about concurrency is to make sure the slow operation (getting the superblock) is done outside any mount semaphore, and just do the tree traversal/splicing operations inside it. The code seems to do that already, so I think a global semaphore instead of mnt_namespace->sem would make no practical difference. But if it was a problem, we'd come up with something. Let's concentrate on the behaviour we want, rather than a minor detail like that. I think the common sense behaviour is to say that vfsmnts remain mounted as long as they are reachable - from a file descriptor, task's root, or task's cwd. -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 17:34 ` Jamie Lokier @ 2005-05-18 19:05 ` Miklos Szeredi 2005-05-18 19:52 ` Jamie Lokier 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 19:05 UTC (permalink / raw) To: jamie Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel > That makes less sense if we allow other tasks to be using a namespace > through a passing a file descriptor, and then the last task which has > current->namespace equal to that namespace exits. It makes no sense > to me that the mount which is still accessible through the file > descriptor is suddenly detached from it's parent and children mounts. I see your point. I don't yet see a solution. Currently detach is an explicit action, not something automatic which happens when there are no more references to a vfsmount. > Why is it not good enough to detach each vfsmnt when the last > reference to each vfsmnt is dropped? In other words, simply when the > vfsmnt becomes unreachable? Define unreachable. Then define a mechanism, by which it can be detected. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 19:05 ` Miklos Szeredi @ 2005-05-18 19:52 ` Jamie Lokier 2005-05-19 12:41 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Jamie Lokier @ 2005-05-18 19:52 UTC (permalink / raw) To: Miklos Szeredi Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi wrote: > > That makes less sense if we allow other tasks to be using a namespace > > through a passing a file descriptor, and then the last task which has > > current->namespace equal to that namespace exits. It makes no sense > > to me that the mount which is still accessible through the file > > descriptor is suddenly detached from it's parent and children mounts. > > I see your point. I don't yet see a solution. > > Currently detach is an explicit action, not something automatic which > happens when there are no more references to a vfsmount. It's implicit, when the last task calls put_namespace: void __put_namespace(struct namespace *namespace) { [...] umount_tree(namespace->root); -> calls detach_mnt for each vfsmnt in namespace. [...] } > > Why is it not good enough to detach each vfsmnt when the last > > reference to each vfsmnt is dropped? In other words, simply when the > > vfsmnt becomes unreachable? > > Define unreachable. Unreachable as in no file descriptors (or chroot/cwd) refer to the vfsmnt, either directly or indirectly through a path traversal. > Then define a mechanism, by which it can be detected. There aren't any vfsmnt->vfsmnt cycles... They're a forest, vfsmnts don't move from one tree to another (bind mounts don't link them, they create new vfsmnts), and each tree can be referenced by a file descriptor at any point on the tree. It rather hinges on which of these behaviours you prefer: 1. A file descriptor/chroot/cwd reference to any point in a vfsmnt tree means the whole tree is retained. This means ".." remains always accessible: fchdir(fd); open("..") continues to access that whole tree as you still have fd. 2. A file descriptor/chroot/cwd reference to any point in a vfsmnt tree means the subtree from that point is retained, and parents may disappear if there are no references (not counting ".." as a reference). This behaviour is more sensible for chroots, where the parents should be inaccessible anyway. 3. A mixture, where current->root references only maintain the subtree rooted at that point, and other references, if outside the current->root subtree, retain the whole tree accessible from those references. The appropriate data structure / algorithm depends on which behaviour is preferred. So which is it? 1 Is best done with a mnt_namespace structure, but references to it counted when vfsmnts are referenced by file descriptors/root/cwd, _not_ references by tasks (no current->namespace). 2 is best done by simply reference counting vfsmnts. -- Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 19:52 ` Jamie Lokier @ 2005-05-19 12:41 ` Miklos Szeredi 0 siblings, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-19 12:41 UTC (permalink / raw) To: jamie Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel, linux-fsdevel > > Define unreachable. > > Unreachable as in no file descriptors (or chroot/cwd) refer to the > vfsmnt, either directly or indirectly through a path traversal. > > > Then define a mechanism, by which it can be detected. > > There aren't any vfsmnt->vfsmnt cycles... They're a forest, vfsmnts > don't move from one tree to another (bind mounts don't link them, they > create new vfsmnts), and each tree can be referenced by a file > descriptor at any point on the tree. > > It rather hinges on which of these behaviours you prefer: > > 1. A file descriptor/chroot/cwd reference to any point in a vfsmnt > tree means the whole tree is retained. This means ".." remains > always accessible: fchdir(fd); open("..") continues to access > that whole tree as you still have fd. > > 2. A file descriptor/chroot/cwd reference to any point in a vfsmnt > tree means the subtree from that point is retained, and parents > may disappear if there are no references (not counting ".." as a > reference). This behaviour is more sensible for chroots, where > the parents should be inaccessible anyway. > > 3. A mixture, where current->root references only maintain the > subtree rooted at that point, and other references, if outside > the current->root subtree, retain the whole tree accessible from > those references. 4. Everything stays the same, except chroot() changes current->namespace to current->fs->root_mnt->namespace. This would address your concern with chroot. > The appropriate data structure / algorithm depends on which > behaviour is preferred. So which is it? My preference is 1) or 4). 2) and 3) would have some pretty strange properties (in some situations you can enter a directory, but you cannot get out with 'cd ..' any more) > 1 Is best done with a mnt_namespace structure, but references to it > counted when vfsmnts are referenced by file descriptors/root/cwd, > _not_ references by tasks (no current->namespace). So you are proposing selective replacement of mntget(mnt); with mntget(mnt); get_namespace(mnt->mnt_namespace); and similar for mntput()? What do you do on unmount? If you set mnt->mnt_namespace to NULL, you'd have to do a put_namespace() for each file descriptor/root/cwd referencing the mount, but how do you do that? > 2 is best done by simply reference counting vfsmnts. It would need quite a revamp of the current reference counting. If I'm not mistaken, currently each vfsmount has an initial reference given to it in alloc_vfsmnt. And on attach, the parent namespace's count is also increased, and vice-versa for unmount. I'm not sure what purpose referencing the parent serves, but it won't work with 2). There you'd need referencing the other way round: parents holding a reference on their children, and have no initial reference. With 2) it's also questionable, how you copy the namespace in clone(). Do you just copy the tree under current->fs->root_mnt? Or do you copy tree under cwd as well (if it's disjunct from the root tree)? Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:46 ` David Howells 2005-05-18 10:53 ` Miklos Szeredi @ 2005-05-18 10:59 ` David Howells 2005-05-18 11:14 ` Miklos Szeredi 2005-05-18 11:51 ` David Howells 2 siblings, 1 reply; 44+ messages in thread From: David Howells @ 2005-05-18 10:59 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi <miklos@szeredi.hu> wrote: > > There's still the problem of cmpxchg meddling in the internals of an > atomic_t. Is that OK? Will that work on all archs? Probably. It might be worth defining an atomic_cmpxchg() op to formalise it though: int atomic_cmpxchg(atomic_t *p, int old, int new); The main reason for this is that atomic_t is mostly a 32-bit value, I think, and cmpxchg() may sometimes enforce a 64-bit value. It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work on MIPS, sparc and alpha too without too much trouble. Dunno about the rest. David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:59 ` David Howells @ 2005-05-18 11:14 ` Miklos Szeredi 0 siblings, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 11:14 UTC (permalink / raw) To: dhowells; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > > > > There's still the problem of cmpxchg meddling in the internals of an > > atomic_t. Is that OK? Will that work on all archs? > > Probably. It might be worth defining an atomic_cmpxchg() op to formalise it > though: > > int atomic_cmpxchg(atomic_t *p, int old, int new); > > The main reason for this is that atomic_t is mostly a 32-bit value, I think, > and cmpxchg() may sometimes enforce a 64-bit value. > > It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work > on MIPS, sparc and alpha too without too much trouble. Dunno about the rest. Any takers? I'd rather not do this, if I don't have to :) Do you think my original fix is wrong, or is this just cosmetics? Thanks, Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 10:46 ` David Howells 2005-05-18 10:53 ` Miklos Szeredi 2005-05-18 10:59 ` David Howells @ 2005-05-18 11:51 ` David Howells 2005-05-18 12:08 ` Miklos Szeredi 2 siblings, 1 reply; 44+ messages in thread From: David Howells @ 2005-05-18 11:51 UTC (permalink / raw) To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel Miklos Szeredi <miklos@szeredi.hu> wrote: > Do you think my original fix is wrong, or is this just cosmetics? What you're doing is tricky. It's asking for a race. Admittedly, it may not occur in the particular situation you're looking at, but can you always guarantee that? Remember, it may be a race against some piece of code that's not yet written, by an author who doesn't realise what _you_ are doing here because their changeset doesn't intersect with yours. Remember: you have, in effect, made the usage count on that structure non-atomic. I do something like that in rwsems and it's something I have to be very careful about. The main reason I can get away with it is that the actual implementation of rwsems is small and is located in a very restrictes set of places and it's not intermingled with other stuff. David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 11:51 ` David Howells @ 2005-05-18 12:08 ` Miklos Szeredi 2005-05-18 12:33 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 12:08 UTC (permalink / raw) To: dhowells; +Cc: miklos, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > What you're doing is tricky. It's asking for a race. I know. The comment above the function is there to make sure the user is aware of this. > Admittedly, it may not > occur in the particular situation you're looking at, but can you always > guarantee that? Yes, if it's always called under lock. > Remember, it may be a race against some piece of code that's not yet > written, by an author who doesn't realise what _you_ are doing here > because their changeset doesn't intersect with yours. > > Remember: you have, in effect, made the usage count on that structure > non-atomic. But _only after_ it's has gone to zero. When in fact there are no more references to it, so it shouldn't matter. The fact that it does matter and that mark_mounts_for_expiry() derefences mnt->mnt_namespace without actually having a proper reference to the namespace is the real culprit here. This is the third bug found by Jamie Lokier, Ram and me in the mnt_namespace change. So if we are looking at proper solutions I think that is what we should be examining. Thanks, Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 12:08 ` Miklos Szeredi @ 2005-05-18 12:33 ` Miklos Szeredi 2005-05-18 16:53 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 12:33 UTC (permalink / raw) To: miklos Cc: dhowells, miklos, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > But _only after_ it's has gone to zero. When in fact there are no > more references to it, so it shouldn't matter. > > The fact that it does matter and that mark_mounts_for_expiry() > derefences mnt->mnt_namespace without actually having a proper > reference to the namespace is the real culprit here. > > This is the third bug found by Jamie Lokier, Ram and me in the > mnt_namespace change. So if we are looking at proper solutions I > think that is what we should be examining. E.g. having a separate task count, which is incremented/decremented only by clone/exit. If the task count goes to zero, umount_tree is called on root, but namespace is not freed. And each mnt_namespace holds a proper reference to the namespace, so it's safe to dereference it anytime. When truly no more references remain, the namespace can go away. Hmm? Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 12:33 ` Miklos Szeredi @ 2005-05-18 16:53 ` Miklos Szeredi 2005-05-18 18:47 ` Ram 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 16:53 UTC (permalink / raw) To: miklos; +Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel > E.g. having a separate task count, which is incremented/decremented > only by clone/exit. If the task count goes to zero, umount_tree is > called on root, but namespace is not freed. > > And each mnt_namespace holds a proper reference to the namespace, so > it's safe to dereference it anytime. When truly no more references > remain, the namespace can go away. > > Hmm? Here's a patch (compile and boot tested). Now there are two solutions to one problem, the previous one is ugly, this one is buggy. Which should it be? Thanks, Miklos Index: linux/include/linux/namespace.h =================================================================== --- linux.orig/include/linux/namespace.h 2005-05-18 18:02:24.000000000 +0200 +++ linux/include/linux/namespace.h 2005-05-18 18:28:48.000000000 +0200 @@ -7,18 +7,26 @@ struct namespace { atomic_t count; + atomic_t task_count; /* how many tasks use this */ struct vfsmount * root; struct list_head list; struct rw_semaphore sem; }; extern int copy_namespace(int, struct task_struct *); -extern void __put_namespace(struct namespace *namespace); +extern void __exit_namespace(struct namespace *namespace); static inline void put_namespace(struct namespace *namespace) { - if (atomic_dec_and_test(&namespace->count)) - __put_namespace(namespace); + if (namespace && atomic_dec_and_test(&namespace->count)) + kfree(namespace); +} + +static inline struct namespace *get_namespace(struct namespace *namespace) +{ + if (namespace) + atomic_inc(&namespace->count); + return namespace; } static inline void exit_namespace(struct task_struct *p) @@ -28,14 +36,10 @@ static inline void exit_namespace(struct task_lock(p); p->namespace = NULL; task_unlock(p); - put_namespace(namespace); + if (atomic_dec_and_test(&p->namespace->task_count)) + __exit_namespace(p->namespace); } } -static inline void get_namespace(struct namespace *namespace) -{ - atomic_inc(&namespace->count); -} - #endif #endif Index: linux/fs/namespace.c =================================================================== --- linux.orig/fs/namespace.c 2005-05-18 17:42:46.000000000 +0200 +++ linux/fs/namespace.c 2005-05-18 18:32:29.000000000 +0200 @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d mnt->mnt_root = dget(root); mnt->mnt_mountpoint = mnt->mnt_root; mnt->mnt_parent = mnt; - mnt->mnt_namespace = current->namespace; + mnt->mnt_namespace = get_namespace(current->namespace); /* stick the duplicate mount on the same expiry list * as the original if that was on one */ @@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount for (p = mnt; p; p = next_mnt(p, mnt)) { list_del(&p->mnt_list); list_add(&p->mnt_list, &kill); - p->mnt_namespace = NULL; } while (!list_empty(&kill)) { + struct namespace *n; mnt = list_entry(kill.next, struct vfsmount, mnt_list); list_del_init(&mnt->mnt_list); list_del_init(&mnt->mnt_fslink); + n = mnt->mnt_namespace; + mnt->mnt_namespace = NULL; if (mnt->mnt_parent == mnt) { spin_unlock(&vfsmount_lock); } else { @@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount path_release(&old_nd); } mntput(mnt); + put_namespace(n); spin_lock(&vfsmount_lock); } } @@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_ mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink); list_del_init(&mnt->mnt_fslink); - /* don't do anything if the namespace is dead - all the - * vfsmounts from it are going away anyway */ - namespace = mnt->mnt_namespace; - if (!namespace || atomic_read(&namespace->count) <= 0) + namespace = get_namespace(mnt->mnt_namespace); + if (!namespace) continue; - get_namespace(namespace); spin_unlock(&vfsmount_lock); down_write(&namespace->sem); @@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas get_namespace(namespace); - if (!(flags & CLONE_NEWNS)) + if (!(flags & CLONE_NEWNS)) { + atomic_inc(&namespace->task_count); return 0; + } if (!capable(CAP_SYS_ADMIN)) { put_namespace(namespace); @@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas goto out; atomic_set(&new_ns->count, 1); + atomic_set(&new_ns->task_count, 1); init_rwsem(&new_ns->sem); INIT_LIST_HEAD(&new_ns->list); @@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas p = namespace->root; q = new_ns->root; while (p) { - q->mnt_namespace = new_ns; + struct namespace *oldns = q->mnt_namespace; + q->mnt_namespace = get_namespace(new_ns); + put_namespace(oldns); if (fs) { if (p == fs->rootmnt) { rootmnt = p; @@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void) if (!namespace) panic("Can't allocate initial namespace"); atomic_set(&namespace->count, 1); + atomic_set(&namespace->task_count, 0); INIT_LIST_HEAD(&namespace->list); init_rwsem(&namespace->sem); list_add(&mnt->mnt_list, &namespace->list); namespace->root = mnt; - mnt->mnt_namespace = namespace; + mnt->mnt_namespace = get_namespace(namespace); init_task.namespace = namespace; read_lock(&tasklist_lock); do_each_thread(g, p) { - get_namespace(namespace); + atomic_inc(&namespace->task_count); p->namespace = namespace; } while_each_thread(g, p); read_unlock(&tasklist_lock); @@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa init_mount_tree(); } -void __put_namespace(struct namespace *namespace) +void __exit_namespace(struct namespace *namespace) { down_write(&namespace->sem); spin_lock(&vfsmount_lock); umount_tree(namespace->root); + namespace->root = NULL; spin_unlock(&vfsmount_lock); up_write(&namespace->sem); - kfree(namespace); + put_namespace(namespace); } Index: linux/fs/super.c =================================================================== --- linux.orig/fs/super.c 2005-05-17 13:46:56.000000000 +0200 +++ linux/fs/super.c 2005-05-18 18:44:18.000000000 +0200 @@ -37,6 +37,7 @@ #include <linux/writeback.h> /* for the emergency remount stuff */ #include <linux/idr.h> #include <linux/kobject.h> +#include <linux/namespace.h> #include <asm/uaccess.h> @@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl mnt->mnt_root = dget(sb->s_root); mnt->mnt_mountpoint = sb->s_root; mnt->mnt_parent = mnt; - mnt->mnt_namespace = current->namespace; + mnt->mnt_namespace = get_namespace(current->namespace); up_write(&sb->s_umount); put_filesystem(type); return mnt; ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 16:53 ` Miklos Szeredi @ 2005-05-18 18:47 ` Ram 2005-05-18 19:19 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Ram @ 2005-05-18 18:47 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel On Wed, 2005-05-18 at 09:53, Miklos Szeredi wrote: > > E.g. having a separate task count, which is incremented/decremented > > only by clone/exit. If the task count goes to zero, umount_tree is > > called on root, but namespace is not freed. > > > > And each mnt_namespace holds a proper reference to the namespace, so > > it's safe to dereference it anytime. When truly no more references > > remain, the namespace can go away. > > > > Hmm? First of all the reason this race exists implies Al Viro may have had assertion in his mind that "All tasks that have access to a namespace has a refcount on that namespace". If that was what he was thinking, than the I would stick with that assertion being true. The overall idea I am thinking off is: 1) have the automounter hold a refcount on any new namespace created the mounts in which the automounter has interest in. 2) have a refcount on the namespace when a new task gains access to a namespace through the file descriptor or any other similar mechanisms and remove the reference once the fd gets closed. (bit tricky to implement) Do you agree with the overall idea? If so I will try to come up with a patch. RP > > Here's a patch (compile and boot tested). > > Now there are two solutions to one problem, the previous one is ugly, > this one is buggy. Which should it be? > > Thanks, > Miklos > > Index: linux/include/linux/namespace.h > =================================================================== > --- linux.orig/include/linux/namespace.h 2005-05-18 18:02:24.000000000 +0200 > +++ linux/include/linux/namespace.h 2005-05-18 18:28:48.000000000 +0200 > @@ -7,18 +7,26 @@ > > struct namespace { > atomic_t count; > + atomic_t task_count; /* how many tasks use this */ > struct vfsmount * root; > struct list_head list; > struct rw_semaphore sem; > }; > > extern int copy_namespace(int, struct task_struct *); > -extern void __put_namespace(struct namespace *namespace); > +extern void __exit_namespace(struct namespace *namespace); > > static inline void put_namespace(struct namespace *namespace) > { > - if (atomic_dec_and_test(&namespace->count)) > - __put_namespace(namespace); > + if (namespace && atomic_dec_and_test(&namespace->count)) > + kfree(namespace); > +} > + > +static inline struct namespace *get_namespace(struct namespace *namespace) > +{ > + if (namespace) > + atomic_inc(&namespace->count); > + return namespace; > } > > static inline void exit_namespace(struct task_struct *p) > @@ -28,14 +36,10 @@ static inline void exit_namespace(struct > task_lock(p); > p->namespace = NULL; > task_unlock(p); > - put_namespace(namespace); > + if (atomic_dec_and_test(&p->namespace->task_count)) > + __exit_namespace(p->namespace); > } > } > > -static inline void get_namespace(struct namespace *namespace) > -{ > - atomic_inc(&namespace->count); > -} > - > #endif > #endif > Index: linux/fs/namespace.c > =================================================================== > --- linux.orig/fs/namespace.c 2005-05-18 17:42:46.000000000 +0200 > +++ linux/fs/namespace.c 2005-05-18 18:32:29.000000000 +0200 > @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d > mnt->mnt_root = dget(root); > mnt->mnt_mountpoint = mnt->mnt_root; > mnt->mnt_parent = mnt; > - mnt->mnt_namespace = current->namespace; > + mnt->mnt_namespace = get_namespace(current->namespace); > > /* stick the duplicate mount on the same expiry list > * as the original if that was on one */ > @@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount > for (p = mnt; p; p = next_mnt(p, mnt)) { > list_del(&p->mnt_list); > list_add(&p->mnt_list, &kill); > - p->mnt_namespace = NULL; > } > > while (!list_empty(&kill)) { > + struct namespace *n; > mnt = list_entry(kill.next, struct vfsmount, mnt_list); > list_del_init(&mnt->mnt_list); > list_del_init(&mnt->mnt_fslink); > + n = mnt->mnt_namespace; > + mnt->mnt_namespace = NULL; > if (mnt->mnt_parent == mnt) { > spin_unlock(&vfsmount_lock); > } else { > @@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount > path_release(&old_nd); > } > mntput(mnt); > + put_namespace(n); > spin_lock(&vfsmount_lock); > } > } > @@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_ > mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink); > list_del_init(&mnt->mnt_fslink); > > - /* don't do anything if the namespace is dead - all the > - * vfsmounts from it are going away anyway */ > - namespace = mnt->mnt_namespace; > - if (!namespace || atomic_read(&namespace->count) <= 0) > + namespace = get_namespace(mnt->mnt_namespace); > + if (!namespace) > continue; > - get_namespace(namespace); > > spin_unlock(&vfsmount_lock); > down_write(&namespace->sem); > @@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas > > get_namespace(namespace); > > - if (!(flags & CLONE_NEWNS)) > + if (!(flags & CLONE_NEWNS)) { > + atomic_inc(&namespace->task_count); > return 0; > + } > > if (!capable(CAP_SYS_ADMIN)) { > put_namespace(namespace); > @@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas > goto out; > > atomic_set(&new_ns->count, 1); > + atomic_set(&new_ns->task_count, 1); > init_rwsem(&new_ns->sem); > INIT_LIST_HEAD(&new_ns->list); > > @@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas > p = namespace->root; > q = new_ns->root; > while (p) { > - q->mnt_namespace = new_ns; > + struct namespace *oldns = q->mnt_namespace; > + q->mnt_namespace = get_namespace(new_ns); > + put_namespace(oldns); > if (fs) { > if (p == fs->rootmnt) { > rootmnt = p; > @@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void) > if (!namespace) > panic("Can't allocate initial namespace"); > atomic_set(&namespace->count, 1); > + atomic_set(&namespace->task_count, 0); > INIT_LIST_HEAD(&namespace->list); > init_rwsem(&namespace->sem); > list_add(&mnt->mnt_list, &namespace->list); > namespace->root = mnt; > - mnt->mnt_namespace = namespace; > + mnt->mnt_namespace = get_namespace(namespace); > > init_task.namespace = namespace; > read_lock(&tasklist_lock); > do_each_thread(g, p) { > - get_namespace(namespace); > + atomic_inc(&namespace->task_count); > p->namespace = namespace; > } while_each_thread(g, p); > read_unlock(&tasklist_lock); > @@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa > init_mount_tree(); > } > > -void __put_namespace(struct namespace *namespace) > +void __exit_namespace(struct namespace *namespace) > { > down_write(&namespace->sem); > spin_lock(&vfsmount_lock); > umount_tree(namespace->root); > + namespace->root = NULL; > spin_unlock(&vfsmount_lock); > up_write(&namespace->sem); > - kfree(namespace); > + put_namespace(namespace); > } > Index: linux/fs/super.c > =================================================================== > --- linux.orig/fs/super.c 2005-05-17 13:46:56.000000000 +0200 > +++ linux/fs/super.c 2005-05-18 18:44:18.000000000 +0200 > @@ -37,6 +37,7 @@ > #include <linux/writeback.h> /* for the emergency remount stuff */ > #include <linux/idr.h> > #include <linux/kobject.h> > +#include <linux/namespace.h> > #include <asm/uaccess.h> > > > @@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl > mnt->mnt_root = dget(sb->s_root); > mnt->mnt_mountpoint = sb->s_root; > mnt->mnt_parent = mnt; > - mnt->mnt_namespace = current->namespace; > + mnt->mnt_namespace = get_namespace(current->namespace); > up_write(&sb->s_umount); > put_filesystem(type); > return mnt; ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 18:47 ` Ram @ 2005-05-18 19:19 ` Miklos Szeredi 2005-05-18 20:35 ` Ram 0 siblings, 1 reply; 44+ messages in thread From: Miklos Szeredi @ 2005-05-18 19:19 UTC (permalink / raw) To: linuxram; +Cc: dhowells, jamie, viro, akpm, linux-kernel, linux-fsdevel > First of all the reason this race exists implies Al Viro may have had > assertion in his mind that "All tasks that have access to a namespace > has a refcount on that namespace". If that was what he was thinking, > than the I would stick with that assertion being true. The overall idea > I am thinking off is: > > 1) have the automounter hold a refcount on any new namespace created > the mounts in which the automounter has interest in. > 2) have a refcount on the namespace when a new task gains access to > a namespace through the file descriptor or any other > similar mechanisms and remove the reference > once the fd gets closed. (bit tricky to implement) > > Do you agree with the overall idea? I don't really understand it. A reference count usually means, the number of references (pointers) to an object. You can sometimes get away with schemes that deviate from this in various ways, but it's usually asking for trouble. The usage in mark_mounts_for_expiry() deviated from it so much, that the result was a subtle race. Doing some tricky thing like what you propose will just likely introduce more subtle problems. Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 19:19 ` Miklos Szeredi @ 2005-05-18 20:35 ` Ram 2005-05-19 12:52 ` Miklos Szeredi 0 siblings, 1 reply; 44+ messages in thread From: Ram @ 2005-05-18 20:35 UTC (permalink / raw) To: Miklos Szeredi Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel On Wed, 2005-05-18 at 12:19, Miklos Szeredi wrote: > > First of all the reason this race exists implies Al Viro may have had > > assertion in his mind that "All tasks that have access to a namespace > > has a refcount on that namespace". If that was what he was thinking, > > than the I would stick with that assertion being true. The overall idea > > I am thinking off is: > > > > 1) have the automounter hold a refcount on any new namespace created > > the mounts in which the automounter has interest in. > > 2) have a refcount on the namespace when a new task gains access to > > a namespace through the file descriptor or any other > > similar mechanisms and remove the reference > > once the fd gets closed. (bit tricky to implement) > > > > Do you agree with the overall idea? > > I don't really understand it. > > A reference count usually means, the number of references (pointers) > to an object. You can sometimes get away with schemes that deviate > from this in various ways, but it's usually asking for trouble. Ok. One more attempt to be clear. All the places where get_namespace() is called currently except in mark_mounts_for_expiry() are safe because they are called in places where it is guaranteed that they will not race with __put_namespace(). For example in clone_namespace(), get_namespace() will not race because the task that called the clone has a refcount on the namespace and since that task is currently in the kernel, there is no chance for the task to go away decrementing the refcount on that namespace. But the case where the call to get_namespace() is buggy in mark_mounts_for_expiry() is because: it is called in a timer context, and the last process referring the namespace may just disapper right than. So what I am proposing is: in automouter, while the automount takes place in afs_mntpt_follow_link() increment the refcount of the namespace, by calling get_namespace(). This call will not race with __put_namespace because the process that is trying to access the mountpoint already has a refcount on the namespace and it won't be able to decrement the refcount currently. agree? Now later when the automounter tries to unmount the mount call put_namespace() after unmounting. I mean do it in mark_mounts_for_expiry(). Also delete the call to get_namespace()) So the race will not happen at all. Makes sense? That was the easiest part, but the difficult part is how to call get_namespace() on a forign namespace in do_loopback()? thinking about it, RP > > The usage in mark_mounts_for_expiry() deviated from it so much, that > the result was a subtle race. > > Doing some tricky thing like what you propose will just likely > introduce more subtle problems. > > Miklos > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] fix race in mark_mounts_for_expiry() 2005-05-18 20:35 ` Ram @ 2005-05-19 12:52 ` Miklos Szeredi 0 siblings, 0 replies; 44+ messages in thread From: Miklos Szeredi @ 2005-05-19 12:52 UTC (permalink / raw) To: linuxram; +Cc: miklos, dhowells, jamie, viro, akpm, linux-kernel, linux-fsdevel > Ok. One more attempt to be clear. > > All the places where get_namespace() is called currently except > in mark_mounts_for_expiry() are safe because they are called in > places where it is guaranteed that they will not race with > __put_namespace(). > > For example in clone_namespace(), get_namespace() will not race > because the task that called the clone has a refcount on the > namespace and since that task is currently in the kernel, there is > no chance for the task to go away decrementing the refcount > on that namespace. > > But the case where the call to get_namespace() is buggy in > mark_mounts_for_expiry() is because: > it is called in a timer context, and the last process referring > the namespace may just disapper right than. > So what I am proposing is: > in automouter, while the automount takes place in > afs_mntpt_follow_link() increment the refcount of the namespace, > by calling get_namespace(). This call will not race with __put_namespace > because the process that is trying to access the > mountpoint already has a refcount on the namespace and it won't be > able to decrement the refcount currently. agree? > > Now later when the automounter tries to unmount the mount > call put_namespace() after unmounting. I mean do it in > mark_mounts_for_expiry(). Also delete the call to get_namespace()) > > So the race will not happen at all. > > Makes sense? Well I imagine it could work, but again it's just a special case for the automounter stuff. It still won't solve the recursive mount problem that this discussion (and your discovery of the race) originated from. My second patch solves the problem generally (though I'm sure that it's full of bugs yet), by keeping a reference to the namespace from each vfsmount in that namespace. That way, until the vfsmount remains in the namespace (i.e. until it's unmounted) mnt_namespace can be safely used for whatever reason (recursive bind, atomount, etc). So why does it make sense to solve this problem only for the automounter? Miklos ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-16 20:15 ` Miklos Szeredi 2005-05-17 1:28 ` Jamie Lokier @ 2005-05-17 18:48 ` Ram 1 sibling, 0 replies; 44+ messages in thread From: Ram @ 2005-05-17 18:48 UTC (permalink / raw) To: Miklos Szeredi; +Cc: jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel On Mon, 2005-05-16 at 13:15, Miklos Szeredi wrote: > > Ok. less restriction without compromising security is a good idea. > > > > Under the premise that bind mounts across namespace should be allowed; > > any insight why the "founding fathers" :) allowed only bind > > and not recursive bind? What issue would that create? One can > > easily workaround that restriction by manually binding recursively. > > So does the recursive bind restriction serve any purpose? > > > > I remember Miklos saying its not a security issue but a > > implementation/locking issue. That can be fixed aswell. > > Yes, as pointed out by Jamie, both namespaces need to be locked for > this to work. Something like the attached should do it. > > Miklos > > > Index: linux/fs/namespace.c > =================================================================== > --- linux.orig/fs/namespace.c 2005-05-16 22:02:36.000000000 +0200 > +++ linux/fs/namespace.c 2005-05-16 22:13:30.000000000 +0200 > @@ -622,6 +622,8 @@ out_unlock: > static int do_loopback(struct nameidata *nd, char *old_name, int recurse) > { > struct nameidata old_nd; > + struct namespace *ns1 = current->namespace; > + struct namespace *ns2 = NULL; > struct vfsmount *mnt = NULL; > int err = mount_is_safe(nd); > if (err) > @@ -632,15 +634,30 @@ static int do_loopback(struct nameidata > if (err) > return err; > > - down_write(¤t->namespace->sem); > err = -EINVAL; > - if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) { > - err = -ENOMEM; > - if (recurse) > - mnt = copy_tree(old_nd.mnt, old_nd.dentry); > - else > - mnt = clone_mnt(old_nd.mnt, old_nd.dentry); > - } > + if (!check_mnt(nd->mnt)) > + goto out_path_release; This disallows bind mounts in foreign namespace. But allows bind mounts in current namespace from foreign namespace. Any reason? Both should be allowed. Infact both the namespaces operated on could be foriegn namespaces. This is based on the premise that the process has gained rights to operate on the namespaces in question. RP ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] namespace.c: fix bind mount from foreign namespace 2005-05-17 3:50 ` Ram 2005-05-16 20:15 ` Miklos Szeredi @ 2005-05-17 0:00 ` Jamie Lokier 1 sibling, 0 replies; 44+ messages in thread From: Jamie Lokier @ 2005-05-17 0:00 UTC (permalink / raw) To: Ram; +Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel Ram wrote: > Under the premise that bind mounts across namespace should be allowed; > any insight why the "founding fathers" :) allowed only bind > and not recursive bind? What issue would that create? Recursive bind traverses the subtree of vfsmnts rooted at the source mount (following mnt->mnt_mounts, see copy_tree()). That requires the source mount's namespace semaphore to be held. > One can easily workaround that restriction by manually binding > recursively. Yes, if you know which mounts they are. > I remember Miklos saying its not a security issue but a > implementation/locking issue. That can be fixed aswell. Yes, by taking the source namespace semaphore while traversing the subtree. That involves taking _two_ semaphores, so they have to be ordered to avoid deadlock (see double-locking elsewhere in the kernel). - Jamie ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2005-05-19 12:54 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi
2005-05-13 16:49 ` Ram
2005-05-13 17:06 ` Al Viro
2005-05-13 17:17 ` Miklos Szeredi
2005-05-13 17:25 ` Al Viro
2005-05-13 17:34 ` Miklos Szeredi
2005-05-13 17:29 ` Ram
2005-05-13 18:40 ` Miklos Szeredi
[not found] ` <1116012287.6248.410.camel@localhost>
[not found] ` <E1DWfqJ-0004eP-00@dorka.pomaz.szeredi.hu>
[not found] ` <1116013840.6248.429.camel@localhost>
2005-05-14 6:11 ` Miklos Szeredi
2005-05-16 15:11 ` Ram
2005-05-16 8:44 ` Miklos Szeredi
2005-05-16 8:59 ` Miklos Szeredi
2005-05-16 11:26 ` Jamie Lokier
2005-05-16 13:23 ` Miklos Szeredi
2005-05-16 11:14 ` Jamie Lokier
2005-05-17 3:50 ` Ram
2005-05-16 20:15 ` Miklos Szeredi
2005-05-17 1:28 ` Jamie Lokier
2005-05-17 5:34 ` Miklos Szeredi
[not found] ` <1116360352.24560.85.camel@localhost>
[not found] ` <E1DYI0m-0000K5-00@dorka.pomaz.szeredi.hu>
[not found] ` <1116399887.24560.116.camel@localhost>
[not found] ` <1116400118.24560.119.camel@localhost>
2005-05-18 9:51 ` [PATCH] fix race in mark_mounts_for_expiry() Miklos Szeredi
2005-05-18 10:32 ` David Howells
2005-05-18 10:37 ` Miklos Szeredi
2005-05-18 10:46 ` David Howells
2005-05-18 10:53 ` Miklos Szeredi
2005-05-18 11:07 ` Trond Myklebust
2005-05-18 11:32 ` Miklos Szeredi
2005-05-18 12:50 ` Jamie Lokier
2005-05-18 13:21 ` Miklos Szeredi
2005-05-18 17:34 ` Jamie Lokier
2005-05-18 19:05 ` Miklos Szeredi
2005-05-18 19:52 ` Jamie Lokier
2005-05-19 12:41 ` Miklos Szeredi
2005-05-18 10:59 ` David Howells
2005-05-18 11:14 ` Miklos Szeredi
2005-05-18 11:51 ` David Howells
2005-05-18 12:08 ` Miklos Szeredi
2005-05-18 12:33 ` Miklos Szeredi
2005-05-18 16:53 ` Miklos Szeredi
2005-05-18 18:47 ` Ram
2005-05-18 19:19 ` Miklos Szeredi
2005-05-18 20:35 ` Ram
2005-05-19 12:52 ` Miklos Szeredi
2005-05-17 18:48 ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram
2005-05-17 0:00 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).