linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] namespace.c: fix bind mount from foreign namespace
@ 2005-05-13 10:44 Miklos Szeredi
  2005-05-13 16:49 ` Ram
  2005-05-13 17:06 ` Al Viro
  0 siblings, 2 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-13 10:44 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-fsdevel

Bind mount from a foreign namespace results in an un-removable mount.
The reason is that mnt->mnt_namespace is copied from the old mount in
clone_mnt().  Because of this, check_mnt() in sys_umount() will fail.

The solution is to set mnt->mnt_namespace to current->namespace.

clone_mnt() is either called from do_loopback() or copy_tree().
copy_tree() is called from do_loopback() or copy_namespace().

When called (directly or indirectly) from do_loopback(), always
current->namspace is being modified: check_mnt(nd->mnt).  So setting
mnt->mnt_namespace to current->namspace is the right thing to do.

When called from copy_namespace(), the setting of mnt_namespace is
irrelevant, since mnt_namespace is reset later in that function for
all copied mounts.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2005-05-13 12:22:52.000000000 +0200
+++ linux/fs/namespace.c	2005-05-13 12:32:36.000000000 +0200
@@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
 		mnt->mnt_root = dget(root);
 		mnt->mnt_mountpoint = mnt->mnt_root;
 		mnt->mnt_parent = mnt;
-		mnt->mnt_namespace = old->mnt_namespace;
+		mnt->mnt_namespace = current->namespace;
 
 		/* stick the duplicate mount on the same expiry list
 		 * as the original if that was on one */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi
@ 2005-05-13 16:49 ` Ram
  2005-05-13 17:06 ` Al Viro
  1 sibling, 0 replies; 46+ messages in thread
From: Ram @ 2005-05-13 16:49 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On Fri, 2005-05-13 at 03:44, Miklos Szeredi wrote:
> Bind mount from a foreign namespace results in an un-removable mount.

i wonder, should we even allow the ability to bind mount 
from a foreign namespace?  

The only time this is allowed is while creating a new namespace.

RP


> The reason is that mnt->mnt_namespace is copied from the old mount in
> clone_mnt().  Because of this, check_mnt() in sys_umount() will fail.
> 
> The solution is to set mnt->mnt_namespace to current->namespace.
> 
> clone_mnt() is either called from do_loopback() or copy_tree().
> copy_tree() is called from do_loopback() or copy_namespace().
> 
> When called (directly or indirectly) from do_loopback(), always
> current->namspace is being modified: check_mnt(nd->mnt).  So setting
> mnt->mnt_namespace to current->namspace is the right thing to do.
> 
> When called from copy_namespace(), the setting of mnt_namespace is
> irrelevant, since mnt_namespace is reset later in that function for
> all copied mounts.


> 
> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
> 
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c	2005-05-13 12:22:52.000000000 +0200
> +++ linux/fs/namespace.c	2005-05-13 12:32:36.000000000 +0200
> @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
>  		mnt->mnt_root = dget(root);
>  		mnt->mnt_mountpoint = mnt->mnt_root;
>  		mnt->mnt_parent = mnt;
> -		mnt->mnt_namespace = old->mnt_namespace;
> +		mnt->mnt_namespace = current->namespace;
>  
>  		/* stick the duplicate mount on the same expiry list
>  		 * as the original if that was on one */
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi
  2005-05-13 16:49 ` Ram
@ 2005-05-13 17:06 ` Al Viro
  2005-05-13 17:17   ` Miklos Szeredi
  1 sibling, 1 reply; 46+ messages in thread
From: Al Viro @ 2005-05-13 17:06 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

On Fri, May 13, 2005 at 12:44:03PM +0200, Miklos Szeredi wrote:
> Bind mount from a foreign namespace results in

... -EINVAL

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 17:06 ` Al Viro
@ 2005-05-13 17:17   ` Miklos Szeredi
  2005-05-13 17:25     ` Al Viro
  2005-05-13 17:29     ` Ram
  0 siblings, 2 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-13 17:17 UTC (permalink / raw)
  To: viro; +Cc: akpm, linux-kernel, linux-fsdevel

> > Bind mount from a foreign namespace results in
> 
> ... -EINVAL

Wrong answer.  Look again, you wrote the code, so you _should_ know ;)

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 17:17   ` Miklos Szeredi
@ 2005-05-13 17:25     ` Al Viro
  2005-05-13 17:34       ` Miklos Szeredi
  2005-05-13 17:29     ` Ram
  1 sibling, 1 reply; 46+ messages in thread
From: Al Viro @ 2005-05-13 17:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, linux-kernel, linux-fsdevel

On Fri, May 13, 2005 at 07:17:39PM +0200, Miklos Szeredi wrote:
> > > Bind mount from a foreign namespace results in
> > 
> > ... -EINVAL
> 
> Wrong answer.  Look again, you wrote the code, so you _should_ know ;)

static inline int check_mnt(struct vfsmount *mnt)
{
        return mnt->mnt_namespace == current->namespace;
}

static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
{
        struct nameidata old_nd;
        struct vfsmount *mnt = NULL;
	/* no changes of mnt */
	err = -EINVAL;
	if (check_mnt(nd->mnt) && ... ) {
		/* assigns to mnt */
	}
	if (mnt) {
		/* assigns to err */
	}
        up_write(&current->namespace->sem);
        path_release(&old_nd);
        return err;
}

Care to explain how that would not give -EINVAL?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 17:17   ` Miklos Szeredi
  2005-05-13 17:25     ` Al Viro
@ 2005-05-13 17:29     ` Ram
  2005-05-13 18:40       ` Miklos Szeredi
  1 sibling, 1 reply; 46+ messages in thread
From: Ram @ 2005-05-13 17:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: viro, Andrew Morton, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

On Fri, 2005-05-13 at 10:17, Miklos Szeredi wrote:
> > > Bind mount from a foreign namespace results in
> > 
> > ... -EINVAL
> 
> Wrong answer.  Look again, you wrote the code, so you _should_ know ;)

I guess Al agrees that bind mount from foreign namespace must be
disallowed.

Which means what Jamie pointed to was right. Attached the patch which
fixes it.


> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: bind.patch --]
[-- Type: text/x-patch, Size: 478 bytes --]

--- /home/linux/views/linux-2.6.12-rc4/fs/namespace.c	2005-05-06 23:22:29.000000000 -0700
+++ 2.6.12-rc4/fs/namespace.c	2005-05-13 10:17:19.000000000 -0700
@@ -633,7 +633,7 @@ static int do_loopback(struct nameidata 
 
 	down_write(&current->namespace->sem);
 	err = -EINVAL;
-	if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
+	if (check_mnt(nd->mnt) && check_mnt(old_nd.mnt)) {
 		err = -ENOMEM;
 		if (recurse)
 			mnt = copy_tree(old_nd.mnt, old_nd.dentry);

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 17:25     ` Al Viro
@ 2005-05-13 17:34       ` Miklos Szeredi
  0 siblings, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-13 17:34 UTC (permalink / raw)
  To: viro; +Cc: akpm, linux-kernel, linux-fsdevel

> static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
> {
>         struct nameidata old_nd;
>         struct vfsmount *mnt = NULL;
> 	/* no changes of mnt */
> 	err = -EINVAL;
> 	if (check_mnt(nd->mnt) && ... ) {
> 		/* assigns to mnt */
> 	}
> 	if (mnt) {
> 		/* assigns to err */
> 	}
>         up_write(&current->namespace->sem);
>         path_release(&old_nd);
>         return err;
> }
> 
> Care to explain how that would not give -EINVAL?

Yeah, but that check_mnt() checks the _destination_ of the bind not
the source.  The source is only checked for recursive mounts,
presumably because the source namespace is not locked, and so can
change.

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-13 17:29     ` Ram
@ 2005-05-13 18:40       ` Miklos Szeredi
       [not found]         ` <1116012287.6248.410.camel@localhost>
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-13 18:40 UTC (permalink / raw)
  To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel

> > > > Bind mount from a foreign namespace results in
> > > 
> > > ... -EINVAL
> > 
> > Wrong answer.  Look again, you wrote the code, so you _should_ know ;)
> 
> I guess Al agrees that bind mount from foreign namespace must be
> disallowed.
> 
> Which means what Jamie pointed to was right. Attached the patch which
> fixes it.

You are very quick fixing things which are not broken :)

And BTW Jamie was saying, the checks should be removed, not that more
checks should be added (as your patch does).

Jamie Lokier wrote:
> I agree about the bug (and it's why I think the current->namespace
> checks in fs/namespace.c should be killed - the _only_ effect is to
> make un-removable mounts like the above, and the checks are completely
> redundant for "normal" namespace operations).

The checks are actually not redundant, but only because of locking
reasons, not because of security reasons.  So I agree with Jamie, that
in the long run it makes sense to relax those checks.

Miklos


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
       [not found]             ` <1116013840.6248.429.camel@localhost>
@ 2005-05-14  6:11               ` Miklos Szeredi
  2005-05-16 15:11                 ` Ram
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-14  6:11 UTC (permalink / raw)
  To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel

(CC restored, because I think this is interesting to others as well)

> > > I dont get it...
> > > 
> > > do you agree that bind mounts accross namespaces should be disallowed?
> > 
> > No.  I think it's a useful feature.  It's actually been discussed at
> > length earlier, that it makes sense if the user can copy private
> > mounts from one session to the other (or even automate this with pam,
> > and a daemon).  Why should they be disallowed?
> 
> I understand that feature and the way to do it is through shared
> subtrees.  

I agree fully, that shared subtrees would be very useful.

> I am against a mount in one namespace being bind mounted in
> another namespace(which Al Viro has implied in his mail). 

I'd rather not speculate on what Al Viro was thinking, it may have
been just a misunderstanding.

> With shared subtree if a bind mount is done
> in one namespace, the bind happens within the same namespace.

Yes.  But that's not fundamentally different from explicitly passing a
mount (through a file descriptor) to a process in a different
namespace, and allowing that process to bind mount it in it's native
namespace.

The end result can be exactly the same, only in the shared subtree
case the binding is implicit, while in the other case it's explicit on
both sides (which makes it perfectly secure even for unprivileged use)

Please explain why you think it's wrong to be able to bind mount from
a different namespace?

> 
> However the operation is mirrored to other namespaces
> that has the same heridity link to this namespace.
> 
> probably I can give an example:
> 
> if namespace  n1 has the following tree
>                    v11
>                   /  \
>                  v12  v13
> 
> v1 is mark shared. (mount --make-shared v1) [ for simplicity I vxy
> means  yth vfsmount in xth namespace ]
> 
> and than n2 is cloned out of n1, than in n2 we have 
>                       v21
>                       / \
>                     v22  v23
> 
> now a bind mount in n1 
>          mount --bind v12 v13
> 
>    will first change the tree in n1 as follows:
> 
>                  v11
>                 /   \
>               v12   v13
>                       \
>                       v14
> where v14 is a bind mount of v12
> 
> 
>   and than due to propogation the tree in n2 will also change to
>                     v21
>                    /   \
>                   v22   v23
>                           \
>                           v24
>   where v24 is a bind mount of v22
> 
> 
> Essentially there is no cross-contamination, as well as it meets
> the requirement of per-user-namespace.

What do you mean by cross contamination?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 15:11                 ` Ram
@ 2005-05-16  8:44                   ` Miklos Szeredi
  2005-05-16  8:59                     ` Miklos Szeredi
  2005-05-16 11:26                     ` Jamie Lokier
  2005-05-16 11:14                   ` Jamie Lokier
  1 sibling, 2 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-16  8:44 UTC (permalink / raw)
  To: linuxram; +Cc: viro, akpm, linux-kernel, linux-fsdevel

> Can somebody who know internals of Al Viro's thinking help here?

There's only one person... :)

> > Please explain why you think it's wrong to be able to bind mount from
> > a different namespace?
> 
> 
> If It is allowed, the concept of namespaces itself becomes
> nebulous.  one could bind mount the root vfsmount of all the other
> namespace in their own namespace and then it all becomes one big tree
> with all the other namespaces as a subtree.

1) you need not recursively bind the whole tree of the private
   namespace.  In fact you can only do that by hand, since the kernel
   won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback).

2) you won't see changes made in other namespace, they are still
   separate, they are just sharing some filesystems, just as after
   clone, or just as after propagation within a shared subtree.

3) this is not automatic, so no filesystems in the other namespace are
   visible unless there's an explicit action by processes both in the
   originating namespace and the destination namespace.

4) in fact, the process in the originating namespace can single out a
   mount and just send a file descriptor refering to that mount
   (e.g. by binding it to a temporary directory, opening the root,
   detaching from the mountpoint, and then sending the file descriptor
   to the receiving process).  This way the receiving process will see
   no other mounts in the originating namespace, and can only bind
   from that single mount.

> why would we need this feature? what extra advantage would this feature
> provide us? Is the advantage of this feature already discussed in this
> thread? (maybe i missed it).

http://lkml.org/lkml/2005/4/25/47

It was suggested as a way of sharing mounts between sessions (as
opposed to shared subtrees, which shares mounts between parent/child
process).

I'm not saying that that this is the way to do it.  But it does seem
to me a useful feature.

And since it doesn't add _any_ complexity to the kernel, I think it
would be rather stupid to remove it.

> > What do you mean by cross contamination?
> 
> A vfsmount in one namespace bound to a mountpoint in another 
> namespace.

A vfsmount can only be in a single namespace at a time, since each
mount tree is rooted in a single namespace.  So what you are saying is
impossible.

This feature is about binding a mount (copying a vfsmount in
clone_mnt()), which happens to be in a different namespace.  After the
vfsmount is cloned, and is attached to the process's native namespace
it's in that namespace solely.

The bug being discussed, is an administrative error, of setting
mnt_namespace to the wrong value, which causes the mount to be
un-removable.  So there was never a question of "cross contamination".

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16  8:44                   ` Miklos Szeredi
@ 2005-05-16  8:59                     ` Miklos Szeredi
  2005-05-16 11:26                     ` Jamie Lokier
  1 sibling, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-16  8:59 UTC (permalink / raw)
  To: miklos; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel

> A vfsmount can only be in a single namespace at a time, since each
> mount tree is rooted in a single namespace.  So what you are saying is
> impossible.

To be more precise a vfsmount can only be in _at_ _most_ one
namespace.  When it is detached from a mount tree, it's no longer in
_any_ namespace.

There even seems to be some confusion about that in namespace.c.  I
think mnt_namespace should be set to NULL in detach_mnt() instead of
__put_namespace().

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 15:11                 ` Ram
  2005-05-16  8:44                   ` Miklos Szeredi
@ 2005-05-16 11:14                   ` Jamie Lokier
  2005-05-17  3:50                     ` Ram
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-16 11:14 UTC (permalink / raw)
  To: Ram; +Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel

Ram wrote:
> > I'd rather not speculate on what Al Viro was thinking, it may have
> > been just a misunderstanding.
> 
> Can somebody who know internals of Al Viro's thinking help here?

Presumably he wrote this line:

	if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {

Which /explicitly/ permits bind mounts between namespaces if it's not
recursive.  It's not accidental: that !recurse is blatantly making a
point of allowing it.

I take that to mean that /at least at one time/ Al chose to allow it.

Then again, he also wrote this:

> > Bind mount from a foreign namespace results in
> 
> ... -EINVAL

Which means that /at another time/ Al thought he'd disallowed it.

This is a bit like arguing over what the Founding Fathers of the US
Constitution meant.  Does it matter?  We really should ask what
behaviour makes sense now.  Should we add more explicit restrictions
to the code, making the concept of namespaces more restrictive?  Or
remove the restrictions, on the grounds that they don't really add any
security, it'd be useful to relax them, and the code would be simpler?

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16  8:44                   ` Miklos Szeredi
  2005-05-16  8:59                     ` Miklos Szeredi
@ 2005-05-16 11:26                     ` Jamie Lokier
  2005-05-16 13:23                       ` Miklos Szeredi
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-16 11:26 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
> 1) you need not recursively bind the whole tree of the private
>    namespace.  In fact you can only do that by hand, since the kernel
>    won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback).

That would be easy to change if it was desired though, by taking both
namespace semaphores when two namespaces are involved.

> 2) you won't see changes made in other namespace, they are still
>    separate, they are just sharing some filesystems, just as after
>    clone, or just as after propagation within a shared subtree.

That's true.

Let's not get confused between binding across namespaces, and
chroot/chdir into an fd supplied by a process from another namespace.

In the case of bind mount, that _won't_ see changes made in the other
namespace.  The /dentry/ visible in the other namespace is simply
mounted here.  The fact that it happens to come from another namespace
is irrelevant: the other namespace is not used for single dentry bind mounts.

In the case of chroot/chdir, that _will_ see changes made in the other
namespace.  Effectively, that transitions into the other namespace as
a whole, which is exactly what we want in some cases (when userspace
policy determines that a per-session namespace is wanted).

> 4) in fact, the process in the originating namespace can single out a
>    mount and just send a file descriptor refering to that mount
>    (e.g. by binding it to a temporary directory, opening the root,
>    detaching from the mountpoint, and then sending the file descriptor
>    to the receiving process).  This way the receiving process will see
>    no other mounts in the originating namespace, and can only bind
>    from that single mount.

Nice.  The process in the originating namespace can also bind a small,
carefully selected tree of mounts to a tree in that temporary
directory before passing it, so the recipient can chroot/chdir into
the set of mounts and get only those explicitly authorised by the
originating process.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 11:26                     ` Jamie Lokier
@ 2005-05-16 13:23                       ` Miklos Szeredi
  0 siblings, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-16 13:23 UTC (permalink / raw)
  To: jamie; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel

> > 1) you need not recursively bind the whole tree of the private
> >    namespace.  In fact you can only do that by hand, since the kernel
> >    won't do it (!recurse || check_mnt(old_nd.mnt) in do_loopback).
> 
> That would be easy to change if it was desired though, by taking both
> namespace semaphores when two namespaces are involved.

Yes.

The other check_mnt() calls could be removed by taking
nd.mnt->mnt_namespace->sem instead of current->namespace->sem in the
relevant functions.

It does make sense IMO, even if it won't be used very often, since
only very little extra complexity is involved.

> > 4) in fact, the process in the originating namespace can single out a
> >    mount and just send a file descriptor refering to that mount
> >    (e.g. by binding it to a temporary directory, opening the root,
> >    detaching from the mountpoint, and then sending the file descriptor
> >    to the receiving process).  This way the receiving process will see
> >    no other mounts in the originating namespace, and can only bind
> >    from that single mount.
> 
> Nice.  The process in the originating namespace can also bind a small,
> carefully selected tree of mounts to a tree in that temporary
> directory before passing it, so the recipient can chroot/chdir into
> the set of mounts and get only those explicitly authorised by the
> originating process.

That won't work, since detach (umount -l) will break up the tree, and
the file descriptor will hold a reference to only one vfsmount/dentry.

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-14  6:11               ` Miklos Szeredi
@ 2005-05-16 15:11                 ` Ram
  2005-05-16  8:44                   ` Miklos Szeredi
  2005-05-16 11:14                   ` Jamie Lokier
  0 siblings, 2 replies; 46+ messages in thread
From: Ram @ 2005-05-16 15:11 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: viro, Andrew Morton, linux-kernel, linux-fsdevel

On Fri, 2005-05-13 at 23:11, Miklos Szeredi wrote:
> (CC restored, because I think this is interesting to others as well)
> 
> > > > I dont get it...
> > > > 
> > > > do you agree that bind mounts accross namespaces should be disallowed?
> > > 
> > > No.  I think it's a useful feature.  It's actually been discussed at
> > > length earlier, that it makes sense if the user can copy private
> > > mounts from one session to the other (or even automate this with pam,
> > > and a daemon).  Why should they be disallowed?
> > 
> > I understand that feature and the way to do it is through shared
> > subtrees.  
> 
> I agree fully, that shared subtrees would be very useful.
> 
> > I am against a mount in one namespace being bind mounted in
> > another namespace(which Al Viro has implied in his mail). 
> 
> I'd rather not speculate on what Al Viro was thinking, it may have
> been just a misunderstanding.

Can somebody who know internals of Al Viro's thinking help here?


> 
> > With shared subtree if a bind mount is done
> > in one namespace, the bind happens within the same namespace.
> 
> Yes.  But that's not fundamentally different from explicitly passing a
> mount (through a file descriptor) to a process in a different
> namespace, and allowing that process to bind mount it in it's native
> namespace.
> 
> The end result can be exactly the same, only in the shared subtree
> case the binding is implicit, while in the other case it's explicit on
> both sides (which makes it perfectly secure even for unprivileged use)
> 
> Please explain why you think it's wrong to be able to bind mount from
> a different namespace?


If It is allowed, the concept of namespaces itself becomes
nebulous.  one could bind mount the root vfsmount of all the other
namespace in their own namespace and then it all becomes one big tree
with all the other namespaces as a subtree.

why would we need this feature? what extra advantage would this feature
provide us? Is the advantage of this feature already discussed in this
thread? (maybe i missed it).


> 
> > 
> > However the operation is mirrored to other namespaces
> > that has the same heridity link to this namespace.
> > 
> > probably I can give an example:
> > 
> > if namespace  n1 has the following tree
> >                    v11
> >                   /  \
> >                  v12  v13
> > 
> > v1 is mark shared. (mount --make-shared v1) [ for simplicity I vxy
> > means  yth vfsmount in xth namespace ]
> > 
> > and than n2 is cloned out of n1, than in n2 we have 
> >                       v21
> >                       / \
> >                     v22  v23
> > 
> > now a bind mount in n1 
> >          mount --bind v12 v13
> > 
> >    will first change the tree in n1 as follows:
> > 
> >                  v11
> >                 /   \
> >               v12   v13
> >                       \
> >                       v14
> > where v14 is a bind mount of v12
> > 
> > 
> >   and than due to propogation the tree in n2 will also change to
> >                     v21
> >                    /   \
> >                   v22   v23
> >                           \
> >                           v24
> >   where v24 is a bind mount of v22
> > 
> > 
> > Essentially there is no cross-contamination, as well as it meets
> > the requirement of per-user-namespace.
> 
> What do you mean by cross contamination?

A vfsmount in one namespace bound to a mountpoint in another 
namespace.

RP



> 
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] namespace.c: fix bind mount from foreign namespace
@ 2005-05-16 19:51 Miklos Szeredi
  2005-05-17  1:23 ` Jamie Lokier
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-16 19:51 UTC (permalink / raw)
  To: akpm, viro; +Cc: linux-kernel, linux-fsdevel

I'm resending this patch, because I still believe it's the correct fix.

Tested before/after applying the patch with a test application
available from:

  http://www.inf.bme.hu/~mszeredi/nstest.c

Bind mount from a foreign namespace results in an un-removable mount.
The reason is that mnt->mnt_namespace is copied from the old mount in
clone_mnt().  Because of this check_mnt() in sys_umount() will fail.

The solution is to set mnt->mnt_namespace to current->namespace in
clone_mnt().  clone_mnt() is either called from do_loopback() or
copy_tree().  copy_tree() is called from do_loopback() or
copy_namespace().

When called (directly or indirectly) from do_loopback(), always
current->namspace is being modified: check_mnt(nd->mnt).  So setting
mnt->mnt_namespace to current->namspace is the right thing to do.

When called from copy_namespace(), the setting of mnt_namespace is
irrelevant, since mnt_namespace is reset later in that function for
all copied mounts.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2005-05-13 12:22:52.000000000 +0200
+++ linux/fs/namespace.c	2005-05-13 12:32:36.000000000 +0200
@@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
 		mnt->mnt_root = dget(root);
 		mnt->mnt_mountpoint = mnt->mnt_root;
 		mnt->mnt_parent = mnt;
-		mnt->mnt_namespace = old->mnt_namespace;
+		mnt->mnt_namespace = current->namespace;
 
 		/* stick the duplicate mount on the same expiry list
 		 * as the original if that was on one */



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-17  3:50                     ` Ram
@ 2005-05-16 20:15                       ` Miklos Szeredi
  2005-05-17  1:28                         ` Jamie Lokier
  2005-05-17 18:48                         ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram
  2005-05-17  0:00                       ` Jamie Lokier
  1 sibling, 2 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-16 20:15 UTC (permalink / raw)
  To: linuxram; +Cc: jamie, miklos, viro, akpm, linux-kernel, linux-fsdevel

> Ok. less restriction without compromising security is a good idea.
> 
> Under the premise that bind mounts across namespace should be allowed;
> any insight why the "founding fathers" :) allowed only bind
> and not recursive bind?  What issue would that create? One can
> easily workaround that restriction by manually binding recursively.
> So does the recursive bind restriction serve any purpose?
> 
> I remember Miklos saying its not a security issue but a
> implementation/locking issue. That can be fixed aswell.

Yes, as pointed out by Jamie, both namespaces need to be locked for
this to work.  Something like the attached should do it.

Miklos


Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2005-05-16 22:02:36.000000000 +0200
+++ linux/fs/namespace.c	2005-05-16 22:13:30.000000000 +0200
@@ -622,6 +622,8 @@ out_unlock:
 static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
 {
 	struct nameidata old_nd;
+	struct namespace *ns1 = current->namespace;
+	struct namespace *ns2 = NULL;
 	struct vfsmount *mnt = NULL;
 	int err = mount_is_safe(nd);
 	if (err)
@@ -632,15 +634,30 @@ static int do_loopback(struct nameidata 
 	if (err)
 		return err;
 
-	down_write(&current->namespace->sem);
 	err = -EINVAL;
-	if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
-		err = -ENOMEM;
-		if (recurse)
-			mnt = copy_tree(old_nd.mnt, old_nd.dentry);
-		else
-			mnt = clone_mnt(old_nd.mnt, old_nd.dentry);
-	}
+	if (!check_mnt(nd->mnt))
+		goto out_path_release;
+
+	if (recurse && old_nd.mnt->mnt_namespace != ns1) {
+		ns2 = old_nd.mnt->mnt_namespace;
+		if (ns1 < ns2) {
+			down_write(&ns1->sem);
+			down_write(&ns2->sem);
+		} else {
+			down_write(&ns2->sem);
+			down_write(&ns1->sem);
+		}
+	} else
+		down_write(&ns1->sem);
+
+	err = -ENOMEM;
+	if (recurse)
+		mnt = copy_tree(old_nd.mnt, old_nd.dentry);
+	else
+		mnt = clone_mnt(old_nd.mnt, old_nd.dentry);
+	
+	if (ns2)
+		up_write(&ns2->sem);
 
 	if (mnt) {
 		/* stop bind mounts from expiring */
@@ -657,7 +674,8 @@ static int do_loopback(struct nameidata 
 			mntput(mnt);
 	}
 
-	up_write(&current->namespace->sem);
+	up_write(&ns1->sem);
+out_path_release:
 	path_release(&old_nd);
 	return err;
 }

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-17  3:50                     ` Ram
  2005-05-16 20:15                       ` Miklos Szeredi
@ 2005-05-17  0:00                       ` Jamie Lokier
  1 sibling, 0 replies; 46+ messages in thread
From: Jamie Lokier @ 2005-05-17  0:00 UTC (permalink / raw)
  To: Ram; +Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel

Ram wrote:
> Under the premise that bind mounts across namespace should be allowed;
> any insight why the "founding fathers" :) allowed only bind
> and not recursive bind?  What issue would that create?

Recursive bind traverses the subtree of vfsmnts rooted at the source
mount (following mnt->mnt_mounts, see copy_tree()).  That requires the
source mount's namespace semaphore to be held.

> One can easily workaround that restriction by manually binding
> recursively.

Yes, if you know which mounts they are.

> I remember Miklos saying its not a security issue but a
> implementation/locking issue. That can be fixed aswell.

Yes, by taking the source namespace semaphore while traversing the
subtree.  That involves taking _two_ semaphores, so they have to be
ordered to avoid deadlock (see double-locking elsewhere in the kernel).

- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 19:51 Miklos Szeredi
@ 2005-05-17  1:23 ` Jamie Lokier
  0 siblings, 0 replies; 46+ messages in thread
From: Jamie Lokier @ 2005-05-17  1:23 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, viro, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
> Bind mount from a foreign namespace results in an un-removable mount.
> The reason is that mnt->mnt_namespace is copied from the old mount in
> clone_mnt().  Because of this check_mnt() in sys_umount() will fail.
> 
> The solution is to set mnt->mnt_namespace to current->namespace in
> clone_mnt().  clone_mnt() is either called from do_loopback() or
> copy_tree().  copy_tree() is called from do_loopback() or
> copy_namespace().
> 
> When called (directly or indirectly) from do_loopback(), always
> current->namspace is being modified: check_mnt(nd->mnt).  So setting
> mnt->mnt_namespace to current->namspace is the right thing to do.
> 
> When called from copy_namespace(), the setting of mnt_namespace is
> irrelevant, since mnt_namespace is reset later in that function for
> all copied mounts.
> 
> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>

This patch is correct.  The old code was buggy for more fundamental
and serious reason: it broke the invariant that a tree of vfsmnts all
have the same value of mnt_namespace (and the same for the mnt_list
list).

Signed-off-by: Jamie Lokier <jamie@shareable.org>

> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c	2005-05-13 12:22:52.000000000 +0200
> +++ linux/fs/namespace.c	2005-05-13 12:32:36.000000000 +0200
> @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
>  		mnt->mnt_root = dget(root);
>  		mnt->mnt_mountpoint = mnt->mnt_root;
>  		mnt->mnt_parent = mnt;
> -		mnt->mnt_namespace = old->mnt_namespace;
> +		mnt->mnt_namespace = current->namespace;
>  
>  		/* stick the duplicate mount on the same expiry list
>  		 * as the original if that was on one */

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 20:15                       ` Miklos Szeredi
@ 2005-05-17  1:28                         ` Jamie Lokier
  2005-05-17  5:34                           ` Miklos Szeredi
  2005-05-17 18:48                         ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram
  1 sibling, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-17  1:28 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel

Miklos Szeredi wrote:
> +		if (ns1 < ns2) {
> +			down_write(&ns1->sem);
> +			down_write(&ns2->sem);
> +		} else {
> +			down_write(&ns2->sem);
> +			down_write(&ns1->sem);
> +		}

That's a bit smaller (source and compiled) as:

	if (ns2 < ns1)
		down_write(&ns2->sem);
	down_write(&ns1->sem);
	if (ns2 > ns1)
		down_write(&ns2->sem);

(And you'll notice that does the right thing if ns2==ns1 too, in case
that gives you any ideas.)

Otherwise, the patch looks convincing to me.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 11:14                   ` Jamie Lokier
@ 2005-05-17  3:50                     ` Ram
  2005-05-16 20:15                       ` Miklos Szeredi
  2005-05-17  0:00                       ` Jamie Lokier
  0 siblings, 2 replies; 46+ messages in thread
From: Ram @ 2005-05-17  3:50 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Miklos Szeredi, viro, Andrew Morton, linux-kernel, linux-fsdevel

On Mon, 2005-05-16 at 04:14, Jamie Lokier wrote:
> Ram wrote:
> > > I'd rather not speculate on what Al Viro was thinking, it may have
> > > been just a misunderstanding.
> > 
> > Can somebody who know internals of Al Viro's thinking help here?
> 
> Presumably he wrote this line:
> 
> 	if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
> 
> Which /explicitly/ permits bind mounts between namespaces if it's not
> recursive.  It's not accidental: that !recurse is blatantly making a
> point of allowing it.
> 
> I take that to mean that /at least at one time/ Al chose to allow it.
> 
> Then again, he also wrote this:
> 
> > > Bind mount from a foreign namespace results in
> > 
> > ... -EINVAL
> 
> Which means that /at another time/ Al thought he'd disallowed it.
> 
> This is a bit like arguing over what the Founding Fathers of the US
> Constitution meant.  Does it matter?  We really should ask what
> behaviour makes sense now.  Should we add more explicit restrictions
> to the code, making the concept of namespaces more restrictive?  Or
> remove the restrictions, on the grounds that they don't really add any
> security, it'd be useful to relax them, and the code would be simpler?
> 

Ok. less restriction without compromising security is a good idea.

Under the premise that bind mounts across namespace should be allowed;
any insight why the "founding fathers" :) allowed only bind
and not recursive bind?  What issue would that create? One can
easily workaround that restriction by manually binding recursively.
So does the recursive bind restriction serve any purpose?

I remember Miklos saying its not a security issue but a
implementation/locking issue. That can be fixed aswell.

RP


> -- Jamie


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-17  1:28                         ` Jamie Lokier
@ 2005-05-17  5:34                           ` Miklos Szeredi
       [not found]                             ` <1116360352.24560.85.camel@localhost>
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-17  5:34 UTC (permalink / raw)
  To: jamie; +Cc: linuxram, viro, akpm, linux-kernel, linux-fsdevel

> That's a bit smaller (source and compiled) as:
> 
> 	if (ns2 < ns1)
> 		down_write(&ns2->sem);
> 	down_write(&ns1->sem);
> 	if (ns2 > ns1)
> 		down_write(&ns2->sem);
> 
> (And you'll notice that does the right thing if ns2==ns1 too, in case
> that gives you any ideas.)

Nice.

> Otherwise, the patch looks convincing to me.

There's another problem with it: we don't hold a reference to
old_nd.mnt->mnt_namespace, and the namespace going away could race
with this function.

So first obtain the necessary reference:

	spin_lock(&vfsmount_lock);
	ns2 = old_nd.mnt->mnt_namespace;
	if (ns2)
		get_namespace(ns2);
	spin_unlock(&vfsmount_lock);

Then take the semaphores.

Then recheck old_nd.mnt->mnt_namespace, because it might have been
detached between the spin_unlock() and the down_write(&ns2->sem).

It's starting to get a bit complex, and I'm wondering if it's worth it :)

Miklos


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] namespace.c: fix bind mount from foreign namespace
  2005-05-16 20:15                       ` Miklos Szeredi
  2005-05-17  1:28                         ` Jamie Lokier
@ 2005-05-17 18:48                         ` Ram
  1 sibling, 0 replies; 46+ messages in thread
From: Ram @ 2005-05-17 18:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel

On Mon, 2005-05-16 at 13:15, Miklos Szeredi wrote:
> > Ok. less restriction without compromising security is a good idea.
> > 
> > Under the premise that bind mounts across namespace should be allowed;
> > any insight why the "founding fathers" :) allowed only bind
> > and not recursive bind?  What issue would that create? One can
> > easily workaround that restriction by manually binding recursively.
> > So does the recursive bind restriction serve any purpose?
> > 
> > I remember Miklos saying its not a security issue but a
> > implementation/locking issue. That can be fixed aswell.
> 
> Yes, as pointed out by Jamie, both namespaces need to be locked for
> this to work.  Something like the attached should do it.
> 
> Miklos
> 
> 
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c	2005-05-16 22:02:36.000000000 +0200
> +++ linux/fs/namespace.c	2005-05-16 22:13:30.000000000 +0200
> @@ -622,6 +622,8 @@ out_unlock:
>  static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
>  {
>  	struct nameidata old_nd;
> +	struct namespace *ns1 = current->namespace;
> +	struct namespace *ns2 = NULL;
>  	struct vfsmount *mnt = NULL;
>  	int err = mount_is_safe(nd);
>  	if (err)
> @@ -632,15 +634,30 @@ static int do_loopback(struct nameidata 
>  	if (err)
>  		return err;
>  
> -	down_write(&current->namespace->sem);
>  	err = -EINVAL;
> -	if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
> -		err = -ENOMEM;
> -		if (recurse)
> -			mnt = copy_tree(old_nd.mnt, old_nd.dentry);
> -		else
> -			mnt = clone_mnt(old_nd.mnt, old_nd.dentry);
> -	}
> +	if (!check_mnt(nd->mnt))
> +		goto out_path_release;

This disallows bind mounts in foreign namespace. 
But allows bind mounts in current namespace from foreign namespace. Any
reason? 
Both should be allowed. Infact both the namespaces  operated
on could be foriegn namespaces.

This is based on the premise that the process has gained rights 
to operate on the namespaces in question.

RP



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH] fix race in mark_mounts_for_expiry()
       [not found]                                   ` <1116400118.24560.119.camel@localhost>
@ 2005-05-18  9:51                                     ` Miklos Szeredi
  2005-05-18 10:32                                     ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18  9:51 UTC (permalink / raw)
  To: linuxram
  Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel

> > > > I am seeing a locking issue with get_namespace() and put_namespace()
> > > > 
> > > > lets say put_namespace() is called and it finds that it is the last
> > > > user of the namespace is just about to  call __put_namespace().
> > > > 
> > > > Simultaneously another thread calls get_namespace() and increments
> > > > the count.  
> > > > 
> > > > At this point __put_namespace() goes ahead and cleans the namespace.
> > > 
> > > I think you are right. 
> > > 
> > > There's one place in the existing code, which could experience this
> > > race I think:
> > > 
> > > 
> > > 		/* don't do anything if the namespace is dead - all the
> > > 		 * vfsmounts from it are going away anyway */
> > > 		namespace = mnt->mnt_namespace;
> > > 		if (!namespace || atomic_read(&namespace->count) <= 0)
> > > 			continue;
> > > here ----> 
> > > 		get_namespace(namespace);
> > > 
> > > 		spin_unlock(&vfsmount_lock);
> > > 
> > > 
> > > Locking vfsmount_lock in put_namespace() would fix it.  Any better ideas?
> > > 
> > yes. it will.
> > 
> > However I don't think this issue will trigger currently because for this
> > to happen, a process should attempt a get_namespace() on a foreign
> > namespace. If all tasks are operating in the same
> > namespace, than the count will not go to zero for put namespace to
> > clean it up, unless it is the last task.
> > 
> > I think, it will trigger once we allow recursive binds from foreign
> > namespace.
> 
> No I am wrong. mark_mounts_for_expiry() is not called from a process
> context. Its in a timer context. So the race could happen.

How about this patch?  It tries to solve this race without additional
locking.  If refcount is already zero, it will increment and decrement
it.  So be careful to only call grab_namespace() with vfsmount_lock
held, otherwise it could race with itself.  (vfsmount_lock is also
needed in this case so that mnt->mnt_namespace, doesn't change, while
grabbing the namespace)

Compile tested only, please review carefully.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2005-05-17 13:46:58.000000000 +0200
+++ linux/fs/namespace.c	2005-05-18 11:09:47.000000000 +0200
@@ -825,6 +825,16 @@ unlock:
 
 EXPORT_SYMBOL_GPL(do_add_mount);
 
+/* Must be called with vfsmount_lock */
+static inline struct namespace *grab_namespace(struct namespace *n)
+{
+	if (n && atomic_add_return(1, &n->count) <= 1) {
+		atomic_dec(&n->count);
+		n = NULL;
+	}
+	return n;
+}
+
 /*
  * process a list of expirable mountpoints with the intent of discarding any
  * mountpoints that aren't in use and haven't been touched since last we came
@@ -868,10 +878,9 @@ void mark_mounts_for_expiry(struct list_
 
 		/* don't do anything if the namespace is dead - all the
 		 * vfsmounts from it are going away anyway */
-		namespace = mnt->mnt_namespace;
-		if (!namespace || atomic_read(&namespace->count) <= 0)
+		namespace = grab_namespace(mnt->mnt_namespace);
+		if (!namespace)
 			continue;
-		get_namespace(namespace);
 
 		spin_unlock(&vfsmount_lock);
 		down_write(&namespace->sem);


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
       [not found]                                   ` <1116400118.24560.119.camel@localhost>
  2005-05-18  9:51                                     ` [PATCH] fix race in mark_mounts_for_expiry() Miklos Szeredi
@ 2005-05-18 10:32                                     ` David Howells
  2005-05-18 10:37                                       ` Miklos Szeredi
  2005-05-18 10:46                                       ` David Howells
  1 sibling, 2 replies; 46+ messages in thread
From: David Howells @ 2005-05-18 10:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: linuxram, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> 
> How about this patch?  It tries to solve this race without additional
> locking.  If refcount is already zero, it will increment and decrement
> it.  So be careful to only call grab_namespace() with vfsmount_lock
> held, otherwise it could race with itself.  (vfsmount_lock is also
> needed in this case so that mnt->mnt_namespace, doesn't change, while
> grabbing the namespace)

How about using cmpxchg?

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:32                                     ` David Howells
@ 2005-05-18 10:37                                       ` Miklos Szeredi
  2005-05-18 10:46                                       ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 10:37 UTC (permalink / raw)
  To: dhowells; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

> > How about this patch?  It tries to solve this race without additional
> > locking.  If refcount is already zero, it will increment and decrement
> > it.  So be careful to only call grab_namespace() with vfsmount_lock
> > held, otherwise it could race with itself.  (vfsmount_lock is also
> > needed in this case so that mnt->mnt_namespace, doesn't change, while
> > grabbing the namespace)
> 
> How about using cmpxchg?

How?  If the count is nonzero, an incremented count must be stored.
You can't do that atomically with cmpxchg.

Miklos


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:32                                     ` David Howells
  2005-05-18 10:37                                       ` Miklos Szeredi
@ 2005-05-18 10:46                                       ` David Howells
  2005-05-18 10:53                                         ` Miklos Szeredi
                                                           ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: David Howells @ 2005-05-18 10:46 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > How about using cmpxchg?
> 
> How?  If the count is nonzero, an incremented count must be stored.
> You can't do that atomically with cmpxchg.

Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
with cmpxchg() or similar.

Something like:

	static inline struct namespace *grab_namespace(struct namespace *n)
	{
		int old = atomic_read(&n->count);

		while (old > 0) {
			/* attempt to increment the counter */
			old = cmpxchg(&n->count, old, old + 1);
		}

		return old > 0 ? n : NULL;
	}

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:46                                       ` David Howells
@ 2005-05-18 10:53                                         ` Miklos Szeredi
  2005-05-18 11:07                                           ` Trond Myklebust
  2005-05-18 10:59                                         ` David Howells
  2005-05-18 11:51                                         ` David Howells
  2 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 10:53 UTC (permalink / raw)
  To: dhowells; +Cc: miklos, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

> Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
> with cmpxchg() or similar.
> 
> Something like:
> 
> 	static inline struct namespace *grab_namespace(struct namespace *n)
> 	{
> 		int old = atomic_read(&n->count);
> 
> 		while (old > 0) {
> 			/* attempt to increment the counter */
> 			old = cmpxchg(&n->count, old, old + 1);
> 		}
> 
> 		return old > 0 ? n : NULL;
> 	}
> 

Ahh OK :)

There's still the problem of cmpxchg meddling in the internals of an
atomic_t.  Is that OK?  Will that work on all archs?

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:46                                       ` David Howells
  2005-05-18 10:53                                         ` Miklos Szeredi
@ 2005-05-18 10:59                                         ` David Howells
  2005-05-18 11:14                                           ` Miklos Szeredi
  2005-05-18 11:51                                         ` David Howells
  2 siblings, 1 reply; 46+ messages in thread
From: David Howells @ 2005-05-18 10:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

Miklos Szeredi <miklos@szeredi.hu> wrote:

> 
> There's still the problem of cmpxchg meddling in the internals of an
> atomic_t.  Is that OK?  Will that work on all archs?

Probably. It might be worth defining an atomic_cmpxchg() op to formalise it
though:

	int atomic_cmpxchg(atomic_t *p, int old, int new);

The main reason for this is that atomic_t is mostly a 32-bit value, I think,
and cmpxchg() may sometimes enforce a 64-bit value.

It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work
on MIPS, sparc and alpha too without too much trouble. Dunno about the rest.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:53                                         ` Miklos Szeredi
@ 2005-05-18 11:07                                           ` Trond Myklebust
  2005-05-18 11:32                                             ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Trond Myklebust @ 2005-05-18 11:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel,
	linux-fsdevel

on den 18.05.2005 Klokka 12:53 (+0200) skreiv Miklos Szeredi:
> > Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
> > with cmpxchg() or similar.
> > 
> > Something like:
> > 
> > 	static inline struct namespace *grab_namespace(struct namespace *n)
> > 	{
> > 		int old = atomic_read(&n->count);
> > 
> > 		while (old > 0) {
> > 			/* attempt to increment the counter */
> > 			old = cmpxchg(&n->count, old, old + 1);
> > 		}
> > 
> > 		return old > 0 ? n : NULL;
> > 	}
> > 
> 
> Ahh OK :)
> 
> There's still the problem of cmpxchg meddling in the internals of an
> atomic_t.  Is that OK?  Will that work on all archs?

Some archs already have an atomic_dec_if_positive() (see for instance
the PPC). It won't take much work to convert that to an
atomic_inc_if_positive().

For those arches that don't have that sort of thing, then writing a
generic atomic_inc_if_positive() using cmpxchg() will often be possible,
but there are exceptions (for instance the original 386 does not have a
cmpxchg, so there you will have to use something else).

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:59                                         ` David Howells
@ 2005-05-18 11:14                                           ` Miklos Szeredi
  0 siblings, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 11:14 UTC (permalink / raw)
  To: dhowells; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

> > 
> > There's still the problem of cmpxchg meddling in the internals of an
> > atomic_t.  Is that OK?  Will that work on all archs?
> 
> Probably. It might be worth defining an atomic_cmpxchg() op to formalise it
> though:
> 
> 	int atomic_cmpxchg(atomic_t *p, int old, int new);
> 
> The main reason for this is that atomic_t is mostly a 32-bit value, I think,
> and cmpxchg() may sometimes enforce a 64-bit value.
> 
> It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work
> on MIPS, sparc and alpha too without too much trouble. Dunno about the rest.

Any takers?

I'd rather not do this, if I don't have to :)

Do you think my original fix is wrong, or is this just cosmetics?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 11:07                                           ` Trond Myklebust
@ 2005-05-18 11:32                                             ` Miklos Szeredi
  2005-05-18 12:50                                               ` Jamie Lokier
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 11:32 UTC (permalink / raw)
  To: trond.myklebust
  Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel,
	linux-fsdevel

> Some archs already have an atomic_dec_if_positive() (see for instance
> the PPC). It won't take much work to convert that to an
> atomic_inc_if_positive().
> 
> For those arches that don't have that sort of thing, then writing a
> generic atomic_inc_if_positive() using cmpxchg() will often be possible,
> but there are exceptions (for instance the original 386 does not have a
> cmpxchg, so there you will have to use something else).

The problem with introducing architecture specific code, is that it's
just asking for new bugs.

If it's something used all over the kernel, than obviously it's OK,
but for the sake of just one caller it's a bit crazy I think.

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 10:46                                       ` David Howells
  2005-05-18 10:53                                         ` Miklos Szeredi
  2005-05-18 10:59                                         ` David Howells
@ 2005-05-18 11:51                                         ` David Howells
  2005-05-18 12:08                                           ` Miklos Szeredi
  2 siblings, 1 reply; 46+ messages in thread
From: David Howells @ 2005-05-18 11:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel


Miklos Szeredi <miklos@szeredi.hu> wrote:

> Do you think my original fix is wrong, or is this just cosmetics?

What you're doing is tricky. It's asking for a race. Admittedly, it may not
occur in the particular situation you're looking at, but can you always
guarantee that? Remember, it may be a race against some piece of code that's
not yet written, by an author who doesn't realise what _you_ are doing here
because their changeset doesn't intersect with yours.

Remember: you have, in effect, made the usage count on that structure
non-atomic.

I do something like that in rwsems and it's something I have to be very
careful about. The main reason I can get away with it is that the actual
implementation of rwsems is small and is located in a very restrictes set of
places and it's not intermingled with other stuff.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 11:51                                         ` David Howells
@ 2005-05-18 12:08                                           ` Miklos Szeredi
  2005-05-18 12:33                                             ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 12:08 UTC (permalink / raw)
  To: dhowells; +Cc: miklos, linuxram, jamie, viro, akpm, linux-kernel, linux-fsdevel

> What you're doing is tricky. It's asking for a race.

I know.  The comment above the function is there to make sure the user
is aware of this.

> Admittedly, it may not
> occur in the particular situation you're looking at, but can you always
> guarantee that?

Yes, if it's always called under lock.

> Remember, it may be a race against some piece of code that's not yet
> written, by an author who doesn't realise what _you_ are doing here
> because their changeset doesn't intersect with yours.
> 
> Remember: you have, in effect, made the usage count on that structure
> non-atomic.

But _only after_ it's has gone to zero.  When in fact there are no
more references to it, so it shouldn't matter.

The fact that it does matter and that mark_mounts_for_expiry()
derefences mnt->mnt_namespace without actually having a proper
reference to the namespace is the real culprit here.

This is the third bug found by Jamie Lokier, Ram and me in the
mnt_namespace change.  So if we are looking at proper solutions I
think that is what we should be examining.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 12:08                                           ` Miklos Szeredi
@ 2005-05-18 12:33                                             ` Miklos Szeredi
  2005-05-18 16:53                                               ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 12:33 UTC (permalink / raw)
  To: miklos
  Cc: dhowells, miklos, linuxram, jamie, viro, akpm, linux-kernel,
	linux-fsdevel

> But _only after_ it's has gone to zero.  When in fact there are no
> more references to it, so it shouldn't matter.
> 
> The fact that it does matter and that mark_mounts_for_expiry()
> derefences mnt->mnt_namespace without actually having a proper
> reference to the namespace is the real culprit here.
> 
> This is the third bug found by Jamie Lokier, Ram and me in the
> mnt_namespace change.  So if we are looking at proper solutions I
> think that is what we should be examining.

E.g. having a separate task count, which is incremented/decremented
only by clone/exit.  If the task count goes to zero, umount_tree is
called on root, but namespace is not freed.

And each mnt_namespace holds a proper reference to the namespace, so
it's safe to dereference it anytime.  When truly no more references
remain, the namespace can go away.

Hmm?

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 11:32                                             ` Miklos Szeredi
@ 2005-05-18 12:50                                               ` Jamie Lokier
  2005-05-18 13:21                                                 ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-18 12:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel,
	linux-fsdevel

Miklos Szeredi wrote:
> > Some archs already have an atomic_dec_if_positive() (see for instance
> > the PPC). It won't take much work to convert that to an
> > atomic_inc_if_positive().
> > 
> > For those arches that don't have that sort of thing, then writing a
> > generic atomic_inc_if_positive() using cmpxchg() will often be possible,
> > but there are exceptions (for instance the original 386 does not have a
> > cmpxchg, so there you will have to use something else).
> 
> The problem with introducing architecture specific code, is that it's
> just asking for new bugs.
> 
> If it's something used all over the kernel, than obviously it's OK,
> but for the sake of just one caller it's a bit crazy I think.

I agree.

And I think you're just adding to the case for removing mnt_namespace
entirely.  We'd still keep CLONE_NS, and users currently using
namespaces (in the normal ways) would see no difference.

mnt_namespace has these visible effects:

    - Prevents some tasks from mounting/umounting in a "foreign"
      namespace, even when they are granted access to the directory
      tree of the foreign namespace.

      It's not clear if the restriction is a useful security tool.

    - Causes every mount in a mount tree to be detached (independently),
      when last task associated with a namespace is destroyed.

And this invisible effect:

    - More concurrency than a global mount lock would have.

Is that all?  Are any of these effects important enough to keep?

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 12:50                                               ` Jamie Lokier
@ 2005-05-18 13:21                                                 ` Miklos Szeredi
  2005-05-18 17:34                                                   ` Jamie Lokier
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 13:21 UTC (permalink / raw)
  To: jamie
  Cc: miklos, trond.myklebust, dhowells, linuxram, viro, akpm,
	linux-kernel, linux-fsdevel

> And I think you're just adding to the case for removing mnt_namespace
> entirely.  We'd still keep CLONE_NS, and users currently using
> namespaces (in the normal ways) would see no difference.
> 
> mnt_namespace has these visible effects:
> 
>     - Prevents some tasks from mounting/umounting in a "foreign"
>       namespace, even when they are granted access to the directory
>       tree of the foreign namespace.
> 
>       It's not clear if the restriction is a useful security tool.
> 
>     - Causes every mount in a mount tree to be detached (independently),
>       when last task associated with a namespace is destroyed.

I don't understand.  The tree _has_ to be detached when no task uses
the namespace.  That is the main purpose of the namespace structure,
to provide an anchor for the mount tree.

> And this invisible effect:
> 
>     - More concurrency than a global mount lock would have.

This is the key issue I think.  It may even have security implications
in the future if we want to allow unprivileged mounts : a user could
DoS the system by just doing lots of mounts umounts in a private
namespace.

Miklos




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 12:33                                             ` Miklos Szeredi
@ 2005-05-18 16:53                                               ` Miklos Szeredi
  2005-05-18 18:47                                                 ` Ram
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 16:53 UTC (permalink / raw)
  To: miklos; +Cc: dhowells, linuxram, jamie, viro, akpm, linux-kernel,
	linux-fsdevel

> E.g. having a separate task count, which is incremented/decremented
> only by clone/exit.  If the task count goes to zero, umount_tree is
> called on root, but namespace is not freed.
> 
> And each mnt_namespace holds a proper reference to the namespace, so
> it's safe to dereference it anytime.  When truly no more references
> remain, the namespace can go away.
> 
> Hmm?

Here's a patch (compile and boot tested). 

Now there are two solutions to one problem, the previous one is ugly,
this one is buggy.  Which should it be?

Thanks,
Miklos

Index: linux/include/linux/namespace.h
===================================================================
--- linux.orig/include/linux/namespace.h	2005-05-18 18:02:24.000000000 +0200
+++ linux/include/linux/namespace.h	2005-05-18 18:28:48.000000000 +0200
@@ -7,18 +7,26 @@
 
 struct namespace {
 	atomic_t		count;
+	atomic_t		task_count; /* how many tasks use this */
 	struct vfsmount *	root;
 	struct list_head	list;
 	struct rw_semaphore	sem;
 };
 
 extern int copy_namespace(int, struct task_struct *);
-extern void __put_namespace(struct namespace *namespace);
+extern void __exit_namespace(struct namespace *namespace);
 
 static inline void put_namespace(struct namespace *namespace)
 {
-	if (atomic_dec_and_test(&namespace->count))
-		__put_namespace(namespace);
+	if (namespace && atomic_dec_and_test(&namespace->count))
+		kfree(namespace);
+}
+
+static inline struct namespace *get_namespace(struct namespace *namespace)
+{
+	if (namespace)
+		atomic_inc(&namespace->count);
+	return namespace;
 }
 
 static inline void exit_namespace(struct task_struct *p)
@@ -28,14 +36,10 @@ static inline void exit_namespace(struct
 		task_lock(p);
 		p->namespace = NULL;
 		task_unlock(p);
-		put_namespace(namespace);
+		if (atomic_dec_and_test(&p->namespace->task_count))
+			__exit_namespace(p->namespace);
 	}
 }
 
-static inline void get_namespace(struct namespace *namespace)
-{
-	atomic_inc(&namespace->count);
-}
-
 #endif
 #endif
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c	2005-05-18 17:42:46.000000000 +0200
+++ linux/fs/namespace.c	2005-05-18 18:32:29.000000000 +0200
@@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
 		mnt->mnt_root = dget(root);
 		mnt->mnt_mountpoint = mnt->mnt_root;
 		mnt->mnt_parent = mnt;
-		mnt->mnt_namespace = current->namespace;
+		mnt->mnt_namespace = get_namespace(current->namespace);
 
 		/* stick the duplicate mount on the same expiry list
 		 * as the original if that was on one */
@@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount 
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		list_del(&p->mnt_list);
 		list_add(&p->mnt_list, &kill);
-		p->mnt_namespace = NULL;
 	}
 
 	while (!list_empty(&kill)) {
+		struct namespace *n;
 		mnt = list_entry(kill.next, struct vfsmount, mnt_list);
 		list_del_init(&mnt->mnt_list);
 		list_del_init(&mnt->mnt_fslink);
+		n = mnt->mnt_namespace;
+		mnt->mnt_namespace = NULL;
 		if (mnt->mnt_parent == mnt) {
 			spin_unlock(&vfsmount_lock);
 		} else {
@@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount 
 			path_release(&old_nd);
 		}
 		mntput(mnt);
+		put_namespace(n);
 		spin_lock(&vfsmount_lock);
 	}
 }
@@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_
 		mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink);
 		list_del_init(&mnt->mnt_fslink);
 
-		/* don't do anything if the namespace is dead - all the
-		 * vfsmounts from it are going away anyway */
-		namespace = mnt->mnt_namespace;
-		if (!namespace || atomic_read(&namespace->count) <= 0)
+		namespace = get_namespace(mnt->mnt_namespace);
+		if (!namespace)
 			continue;
-		get_namespace(namespace);
 
 		spin_unlock(&vfsmount_lock);
 		down_write(&namespace->sem);
@@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas
 
 	get_namespace(namespace);
 
-	if (!(flags & CLONE_NEWNS))
+	if (!(flags & CLONE_NEWNS)) {
+		atomic_inc(&namespace->task_count);
 		return 0;
+	}
 
 	if (!capable(CAP_SYS_ADMIN)) {
 		put_namespace(namespace);
@@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas
 		goto out;
 
 	atomic_set(&new_ns->count, 1);
+	atomic_set(&new_ns->task_count, 1);
 	init_rwsem(&new_ns->sem);
 	INIT_LIST_HEAD(&new_ns->list);
 
@@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas
 	p = namespace->root;
 	q = new_ns->root;
 	while (p) {
-		q->mnt_namespace = new_ns;
+		struct namespace *oldns = q->mnt_namespace;
+		q->mnt_namespace = get_namespace(new_ns);
+		put_namespace(oldns);
 		if (fs) {
 			if (p == fs->rootmnt) {
 				rootmnt = p;
@@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void)
 	if (!namespace)
 		panic("Can't allocate initial namespace");
 	atomic_set(&namespace->count, 1);
+	atomic_set(&namespace->task_count, 0);
 	INIT_LIST_HEAD(&namespace->list);
 	init_rwsem(&namespace->sem);
 	list_add(&mnt->mnt_list, &namespace->list);
 	namespace->root = mnt;
-	mnt->mnt_namespace = namespace;
+	mnt->mnt_namespace = get_namespace(namespace);
 
 	init_task.namespace = namespace;
 	read_lock(&tasklist_lock);
 	do_each_thread(g, p) {
-		get_namespace(namespace);
+		atomic_inc(&namespace->task_count);
 		p->namespace = namespace;
 	} while_each_thread(g, p);
 	read_unlock(&tasklist_lock);
@@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa
 	init_mount_tree();
 }
 
-void __put_namespace(struct namespace *namespace)
+void __exit_namespace(struct namespace *namespace)
 {
 	down_write(&namespace->sem);
 	spin_lock(&vfsmount_lock);
 	umount_tree(namespace->root);
+	namespace->root = NULL;
 	spin_unlock(&vfsmount_lock);
 	up_write(&namespace->sem);
-	kfree(namespace);
+	put_namespace(namespace);
 }
Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c	2005-05-17 13:46:56.000000000 +0200
+++ linux/fs/super.c	2005-05-18 18:44:18.000000000 +0200
@@ -37,6 +37,7 @@
 #include <linux/writeback.h>		/* for the emergency remount stuff */
 #include <linux/idr.h>
 #include <linux/kobject.h>
+#include <linux/namespace.h>
 #include <asm/uaccess.h>
 
 
@@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl
 	mnt->mnt_root = dget(sb->s_root);
 	mnt->mnt_mountpoint = sb->s_root;
 	mnt->mnt_parent = mnt;
-	mnt->mnt_namespace = current->namespace;
+	mnt->mnt_namespace = get_namespace(current->namespace);
 	up_write(&sb->s_umount);
 	put_filesystem(type);
 	return mnt;

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 13:21                                                 ` Miklos Szeredi
@ 2005-05-18 17:34                                                   ` Jamie Lokier
  2005-05-18 19:05                                                     ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-18 17:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel,
	linux-fsdevel

Miklos Szeredi wrote:
> >     - Causes every mount in a mount tree to be detached (independently),
> >       when last task associated with a namespace is destroyed.
> 
> I don't understand.  The tree _has_ to be detached when no task uses
> the namespace.  That is the main purpose of the namespace structure,
> to provide an anchor for the mount tree.

That makes less sense if we allow other tasks to be using a namespace
through a passing a file descriptor, and then the last task which has
current->namespace equal to that namespace exits.  It makes no sense
to me that the mount which is still accessible through the file
descriptor is suddenly detached from it's parent and children mounts.

Why is it not good enough to detach each vfsmnt when the last
reference to each vfsmnt is dropped?  In other words, simply when the
vfsmnt becomes unreachable?

That would detach whole vfsmnt trees when the last reference to the
root of the tree is dropped, rather than when a task exits even though
another tasks has a file descriptor still pointing to the tree.

That would be a change of behaviour, but it seems like quite a
sensible change in behaviour, and would not affect "normal" namespace
users.

There's one other purpose to mnt_namespace:

   - The list in /proc/mounts.

But really it would *improve* security if the list in /proc/mounts was
derived by walking the vfsmnt subtree starting at current->chroot.

> > And this invisible effect:
> > 
> >     - More concurrency than a global mount lock would have.
> 
> This is the key issue I think.  It may even have security implications
> in the future if we want to allow unprivileged mounts : a user could
> DoS the system by just doing lots of mounts umounts in a private
> namespace.

Not an argument.  If that's a problem, they could easily DoS the
system by doing lots of mount/umounts in the initial namespace.

The proper thing to do about concurrency is to make sure the slow
operation (getting the superblock) is done outside any mount
semaphore, and just do the tree traversal/splicing operations inside
it.  The code seems to do that already, so I think a global semaphore
instead of mnt_namespace->sem would make no practical difference.

But if it was a problem, we'd come up with something.  Let's
concentrate on the behaviour we want, rather than a minor detail like
that.

I think the common sense behaviour is to say that vfsmnts remain
mounted as long as they are reachable - from a file descriptor, task's
root, or task's cwd.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 16:53                                               ` Miklos Szeredi
@ 2005-05-18 18:47                                                 ` Ram
  2005-05-18 19:19                                                   ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Ram @ 2005-05-18 18:47 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel

On Wed, 2005-05-18 at 09:53, Miklos Szeredi wrote:
> > E.g. having a separate task count, which is incremented/decremented
> > only by clone/exit.  If the task count goes to zero, umount_tree is
> > called on root, but namespace is not freed.
> > 
> > And each mnt_namespace holds a proper reference to the namespace, so
> > it's safe to dereference it anytime.  When truly no more references
> > remain, the namespace can go away.
> > 
> > Hmm?

First of all the reason this race exists implies Al Viro may have had
assertion in his mind that "All tasks that have access to a namespace
has a refcount on that namespace".  If that was what he was thinking,
than the I would stick with that assertion being true. The overall idea
I am thinking off is:

1) have the automounter hold a refcount on any new namespace created
    the mounts in which the automounter has interest in.
2) have a refcount on the namespace when a new task gains access to
   a namespace through the file descriptor or any other 
   similar mechanisms and remove the reference
   once the fd gets closed. (bit tricky to implement)

Do you agree with the overall idea? 
If so I will try to come up with a patch. 

RP

> 
> Here's a patch (compile and boot tested). 
> 
> Now there are two solutions to one problem, the previous one is ugly,
> this one is buggy.  Which should it be?
> 
> Thanks,
> Miklos
> 
> Index: linux/include/linux/namespace.h
> ===================================================================
> --- linux.orig/include/linux/namespace.h	2005-05-18 18:02:24.000000000 +0200
> +++ linux/include/linux/namespace.h	2005-05-18 18:28:48.000000000 +0200
> @@ -7,18 +7,26 @@
>  
>  struct namespace {
>  	atomic_t		count;
> +	atomic_t		task_count; /* how many tasks use this */
>  	struct vfsmount *	root;
>  	struct list_head	list;
>  	struct rw_semaphore	sem;
>  };
>  
>  extern int copy_namespace(int, struct task_struct *);
> -extern void __put_namespace(struct namespace *namespace);
> +extern void __exit_namespace(struct namespace *namespace);
>  
>  static inline void put_namespace(struct namespace *namespace)
>  {
> -	if (atomic_dec_and_test(&namespace->count))
> -		__put_namespace(namespace);
> +	if (namespace && atomic_dec_and_test(&namespace->count))
> +		kfree(namespace);
> +}
> +
> +static inline struct namespace *get_namespace(struct namespace *namespace)
> +{
> +	if (namespace)
> +		atomic_inc(&namespace->count);
> +	return namespace;
>  }
>  
>  static inline void exit_namespace(struct task_struct *p)
> @@ -28,14 +36,10 @@ static inline void exit_namespace(struct
>  		task_lock(p);
>  		p->namespace = NULL;
>  		task_unlock(p);
> -		put_namespace(namespace);
> +		if (atomic_dec_and_test(&p->namespace->task_count))
> +			__exit_namespace(p->namespace);
>  	}
>  }
>  
> -static inline void get_namespace(struct namespace *namespace)
> -{
> -	atomic_inc(&namespace->count);
> -}
> -
>  #endif
>  #endif
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c	2005-05-18 17:42:46.000000000 +0200
> +++ linux/fs/namespace.c	2005-05-18 18:32:29.000000000 +0200
> @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
>  		mnt->mnt_root = dget(root);
>  		mnt->mnt_mountpoint = mnt->mnt_root;
>  		mnt->mnt_parent = mnt;
> -		mnt->mnt_namespace = current->namespace;
> +		mnt->mnt_namespace = get_namespace(current->namespace);
>  
>  		/* stick the duplicate mount on the same expiry list
>  		 * as the original if that was on one */
> @@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount 
>  	for (p = mnt; p; p = next_mnt(p, mnt)) {
>  		list_del(&p->mnt_list);
>  		list_add(&p->mnt_list, &kill);
> -		p->mnt_namespace = NULL;
>  	}
>  
>  	while (!list_empty(&kill)) {
> +		struct namespace *n;
>  		mnt = list_entry(kill.next, struct vfsmount, mnt_list);
>  		list_del_init(&mnt->mnt_list);
>  		list_del_init(&mnt->mnt_fslink);
> +		n = mnt->mnt_namespace;
> +		mnt->mnt_namespace = NULL;
>  		if (mnt->mnt_parent == mnt) {
>  			spin_unlock(&vfsmount_lock);
>  		} else {
> @@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount 
>  			path_release(&old_nd);
>  		}
>  		mntput(mnt);
> +		put_namespace(n);
>  		spin_lock(&vfsmount_lock);
>  	}
>  }
> @@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_
>  		mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink);
>  		list_del_init(&mnt->mnt_fslink);
>  
> -		/* don't do anything if the namespace is dead - all the
> -		 * vfsmounts from it are going away anyway */
> -		namespace = mnt->mnt_namespace;
> -		if (!namespace || atomic_read(&namespace->count) <= 0)
> +		namespace = get_namespace(mnt->mnt_namespace);
> +		if (!namespace)
>  			continue;
> -		get_namespace(namespace);
>  
>  		spin_unlock(&vfsmount_lock);
>  		down_write(&namespace->sem);
> @@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas
>  
>  	get_namespace(namespace);
>  
> -	if (!(flags & CLONE_NEWNS))
> +	if (!(flags & CLONE_NEWNS)) {
> +		atomic_inc(&namespace->task_count);
>  		return 0;
> +	}
>  
>  	if (!capable(CAP_SYS_ADMIN)) {
>  		put_namespace(namespace);
> @@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas
>  		goto out;
>  
>  	atomic_set(&new_ns->count, 1);
> +	atomic_set(&new_ns->task_count, 1);
>  	init_rwsem(&new_ns->sem);
>  	INIT_LIST_HEAD(&new_ns->list);
>  
> @@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas
>  	p = namespace->root;
>  	q = new_ns->root;
>  	while (p) {
> -		q->mnt_namespace = new_ns;
> +		struct namespace *oldns = q->mnt_namespace;
> +		q->mnt_namespace = get_namespace(new_ns);
> +		put_namespace(oldns);
>  		if (fs) {
>  			if (p == fs->rootmnt) {
>  				rootmnt = p;
> @@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void)
>  	if (!namespace)
>  		panic("Can't allocate initial namespace");
>  	atomic_set(&namespace->count, 1);
> +	atomic_set(&namespace->task_count, 0);
>  	INIT_LIST_HEAD(&namespace->list);
>  	init_rwsem(&namespace->sem);
>  	list_add(&mnt->mnt_list, &namespace->list);
>  	namespace->root = mnt;
> -	mnt->mnt_namespace = namespace;
> +	mnt->mnt_namespace = get_namespace(namespace);
>  
>  	init_task.namespace = namespace;
>  	read_lock(&tasklist_lock);
>  	do_each_thread(g, p) {
> -		get_namespace(namespace);
> +		atomic_inc(&namespace->task_count);
>  		p->namespace = namespace;
>  	} while_each_thread(g, p);
>  	read_unlock(&tasklist_lock);
> @@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa
>  	init_mount_tree();
>  }
>  
> -void __put_namespace(struct namespace *namespace)
> +void __exit_namespace(struct namespace *namespace)
>  {
>  	down_write(&namespace->sem);
>  	spin_lock(&vfsmount_lock);
>  	umount_tree(namespace->root);
> +	namespace->root = NULL;
>  	spin_unlock(&vfsmount_lock);
>  	up_write(&namespace->sem);
> -	kfree(namespace);
> +	put_namespace(namespace);
>  }
> Index: linux/fs/super.c
> ===================================================================
> --- linux.orig/fs/super.c	2005-05-17 13:46:56.000000000 +0200
> +++ linux/fs/super.c	2005-05-18 18:44:18.000000000 +0200
> @@ -37,6 +37,7 @@
>  #include <linux/writeback.h>		/* for the emergency remount stuff */
>  #include <linux/idr.h>
>  #include <linux/kobject.h>
> +#include <linux/namespace.h>
>  #include <asm/uaccess.h>
>  
> 
> @@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl
>  	mnt->mnt_root = dget(sb->s_root);
>  	mnt->mnt_mountpoint = sb->s_root;
>  	mnt->mnt_parent = mnt;
> -	mnt->mnt_namespace = current->namespace;
> +	mnt->mnt_namespace = get_namespace(current->namespace);
>  	up_write(&sb->s_umount);
>  	put_filesystem(type);
>  	return mnt;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 17:34                                                   ` Jamie Lokier
@ 2005-05-18 19:05                                                     ` Miklos Szeredi
  2005-05-18 19:52                                                       ` Jamie Lokier
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 19:05 UTC (permalink / raw)
  To: jamie
  Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel,
	linux-fsdevel

> That makes less sense if we allow other tasks to be using a namespace
> through a passing a file descriptor, and then the last task which has
> current->namespace equal to that namespace exits.  It makes no sense
> to me that the mount which is still accessible through the file
> descriptor is suddenly detached from it's parent and children mounts.

I see your point.  I don't yet see a solution.

Currently detach is an explicit action, not something automatic which
happens when there are no more references to a vfsmount.

> Why is it not good enough to detach each vfsmnt when the last
> reference to each vfsmnt is dropped?  In other words, simply when the
> vfsmnt becomes unreachable?

Define unreachable.  Then define a mechanism, by which it can be
detected.

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 18:47                                                 ` Ram
@ 2005-05-18 19:19                                                   ` Miklos Szeredi
  2005-05-18 20:35                                                     ` Ram
  0 siblings, 1 reply; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-18 19:19 UTC (permalink / raw)
  To: linuxram; +Cc: dhowells, jamie, viro, akpm, linux-kernel, linux-fsdevel

> First of all the reason this race exists implies Al Viro may have had
> assertion in his mind that "All tasks that have access to a namespace
> has a refcount on that namespace".  If that was what he was thinking,
> than the I would stick with that assertion being true. The overall idea
> I am thinking off is:
> 
> 1) have the automounter hold a refcount on any new namespace created
>     the mounts in which the automounter has interest in.
> 2) have a refcount on the namespace when a new task gains access to
>    a namespace through the file descriptor or any other 
>    similar mechanisms and remove the reference
>    once the fd gets closed. (bit tricky to implement)
> 
> Do you agree with the overall idea? 

I don't really understand it.

A reference count usually means, the number of references (pointers)
to an object.  You can sometimes get away with schemes that deviate
from this in various ways, but it's usually asking for trouble.

The usage in mark_mounts_for_expiry() deviated from it so much, that
the result was a subtle race.

Doing some tricky thing like what you propose will just likely
introduce more subtle problems.

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 19:05                                                     ` Miklos Szeredi
@ 2005-05-18 19:52                                                       ` Jamie Lokier
  2005-05-19 12:41                                                         ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Jamie Lokier @ 2005-05-18 19:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel,
	linux-fsdevel

Miklos Szeredi wrote:
> > That makes less sense if we allow other tasks to be using a namespace
> > through a passing a file descriptor, and then the last task which has
> > current->namespace equal to that namespace exits.  It makes no sense
> > to me that the mount which is still accessible through the file
> > descriptor is suddenly detached from it's parent and children mounts.
> 
> I see your point.  I don't yet see a solution.
> 
> Currently detach is an explicit action, not something automatic which
> happens when there are no more references to a vfsmount.

It's implicit, when the last task calls put_namespace:

	void __put_namespace(struct namespace *namespace)
	{
[...]
		umount_tree(namespace->root);
		-> calls detach_mnt for each vfsmnt in namespace.
[...]
	}

> > Why is it not good enough to detach each vfsmnt when the last
> > reference to each vfsmnt is dropped?  In other words, simply when the
> > vfsmnt becomes unreachable?
> 
> Define unreachable.

Unreachable as in no file descriptors (or chroot/cwd) refer to the
vfsmnt, either directly or indirectly through a path traversal.

> Then define a mechanism, by which it can be detected.

There aren't any vfsmnt->vfsmnt cycles...  They're a forest, vfsmnts
don't move from one tree to another (bind mounts don't link them, they
create new vfsmnts), and each tree can be referenced by a file
descriptor at any point on the tree.

It rather hinges on which of these behaviours you prefer:

   1. A file descriptor/chroot/cwd reference to any point in a vfsmnt
      tree means the whole tree is retained.  This means ".." remains
      always accessible: fchdir(fd); open("..") continues to access
      that whole tree as you still have fd.

   2. A file descriptor/chroot/cwd reference to any point in a vfsmnt
      tree means the subtree from that point is retained, and parents
      may disappear if there are no references (not counting ".." as a
      reference).  This behaviour is more sensible for chroots, where
      the parents should be inaccessible anyway.

   3. A mixture, where current->root references only maintain the
      subtree rooted at that point, and other references, if outside
      the current->root subtree, retain the whole tree accessible from
      those references.

The appropriate data structure / algorithm depends on which behaviour
is preferred.  So which is it?  1 Is best done with a mnt_namespace
structure, but references to it counted when vfsmnts are referenced by
file descriptors/root/cwd, _not_ references by tasks (no
current->namespace).  2 is best done by simply reference counting
vfsmnts.

-- Jamie

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 19:19                                                   ` Miklos Szeredi
@ 2005-05-18 20:35                                                     ` Ram
  2005-05-19 12:52                                                       ` Miklos Szeredi
  0 siblings, 1 reply; 46+ messages in thread
From: Ram @ 2005-05-18 20:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, jamie, viro, Andrew Morton, linux-kernel, linux-fsdevel

On Wed, 2005-05-18 at 12:19, Miklos Szeredi wrote:
> > First of all the reason this race exists implies Al Viro may have had
> > assertion in his mind that "All tasks that have access to a namespace
> > has a refcount on that namespace".  If that was what he was thinking,
> > than the I would stick with that assertion being true. The overall idea
> > I am thinking off is:
> > 
> > 1) have the automounter hold a refcount on any new namespace created
> >     the mounts in which the automounter has interest in.
> > 2) have a refcount on the namespace when a new task gains access to
> >    a namespace through the file descriptor or any other 
> >    similar mechanisms and remove the reference
> >    once the fd gets closed. (bit tricky to implement)
> > 
> > Do you agree with the overall idea? 
> 
> I don't really understand it.
> 
> A reference count usually means, the number of references (pointers)
> to an object.  You can sometimes get away with schemes that deviate
> from this in various ways, but it's usually asking for trouble.

Ok. One more attempt to be clear.

All the places where get_namespace() is called currently except
in mark_mounts_for_expiry() are safe because they are called in
places where it is guaranteed that they will not race with 
__put_namespace(). 

For example in clone_namespace(), get_namespace() will not race
because the task that called the clone has a refcount on the
namespace and since that task is currently  in the kernel, there is
no chance for  the task  to go away  decrementing the refcount 
on that namespace. 

But the case where the call to get_namespace() is buggy in 
mark_mounts_for_expiry() is because:
it is called in a timer context, and the last process referring
the namespace may just disapper right than.  
 So what I am proposing is:
in automouter, while the automount takes place in 
afs_mntpt_follow_link() increment the refcount of the namespace,
by calling get_namespace(). This call will not race with __put_namespace
because the process that is trying to access the
mountpoint already has a refcount on the namespace and it won't be 
able to decrement the refcount currently. agree?

Now later when the automounter tries to unmount the mount 
call put_namespace() after unmounting. I mean do it in
mark_mounts_for_expiry(). Also delete the call to get_namespace()) 

So the race will not happen at all.

Makes sense? 

That was the easiest part, but the difficult part is how to call
get_namespace() on a forign namespace in do_loopback()?

thinking about it,
RP



> 
> The usage in mark_mounts_for_expiry() deviated from it so much, that
> the result was a subtle race.
> 
> Doing some tricky thing like what you propose will just likely
> introduce more subtle problems.
> 
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 19:52                                                       ` Jamie Lokier
@ 2005-05-19 12:41                                                         ` Miklos Szeredi
  0 siblings, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-19 12:41 UTC (permalink / raw)
  To: jamie
  Cc: trond.myklebust, dhowells, linuxram, viro, akpm, linux-kernel,
	linux-fsdevel

> > Define unreachable.
> 
> Unreachable as in no file descriptors (or chroot/cwd) refer to the
> vfsmnt, either directly or indirectly through a path traversal.
> 
> > Then define a mechanism, by which it can be detected.
> 
> There aren't any vfsmnt->vfsmnt cycles...  They're a forest, vfsmnts
> don't move from one tree to another (bind mounts don't link them, they
> create new vfsmnts), and each tree can be referenced by a file
> descriptor at any point on the tree.
> 
> It rather hinges on which of these behaviours you prefer:
> 
>    1. A file descriptor/chroot/cwd reference to any point in a vfsmnt
>       tree means the whole tree is retained.  This means ".." remains
>       always accessible: fchdir(fd); open("..") continues to access
>       that whole tree as you still have fd.
> 
>    2. A file descriptor/chroot/cwd reference to any point in a vfsmnt
>       tree means the subtree from that point is retained, and parents
>       may disappear if there are no references (not counting ".." as a
>       reference).  This behaviour is more sensible for chroots, where
>       the parents should be inaccessible anyway.
> 
>    3. A mixture, where current->root references only maintain the
>       subtree rooted at that point, and other references, if outside
>       the current->root subtree, retain the whole tree accessible from
>       those references.

4. Everything stays the same, except chroot() changes
current->namespace to current->fs->root_mnt->namespace.  This would
address your concern with chroot.

> The appropriate data structure / algorithm depends on which
> behaviour is preferred.  So which is it?

My preference is 1) or 4).

2) and 3) would have some pretty strange properties (in some
situations you can enter a directory, but you cannot get out with 'cd
..' any more)

>  1 Is best done with a mnt_namespace structure, but references to it
> counted when vfsmnts are referenced by file descriptors/root/cwd,
> _not_ references by tasks (no current->namespace).

So you are proposing selective replacement of 

  mntget(mnt);

with

  mntget(mnt);
  get_namespace(mnt->mnt_namespace);

and similar for mntput()?

What do you do on unmount?  If you set mnt->mnt_namespace to NULL,
you'd have to do a put_namespace() for each file descriptor/root/cwd
referencing the mount, but how do you do that?

> 2 is best done by simply reference counting vfsmnts.

It would need quite a revamp of the current reference counting.  If
I'm not mistaken, currently each vfsmount has an initial reference
given to it in alloc_vfsmnt.  And on attach, the parent namespace's
count is also increased, and vice-versa for unmount.

I'm not sure what purpose referencing the parent serves, but it won't
work with 2).  There you'd need referencing the other way round:
parents holding a reference on their children, and have no initial
reference.

With 2) it's also questionable, how you copy the namespace in clone().
Do you just copy the tree under current->fs->root_mnt?  Or do you copy
tree under cwd as well (if it's disjunct from the root tree)?

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH] fix race in mark_mounts_for_expiry()
  2005-05-18 20:35                                                     ` Ram
@ 2005-05-19 12:52                                                       ` Miklos Szeredi
  0 siblings, 0 replies; 46+ messages in thread
From: Miklos Szeredi @ 2005-05-19 12:52 UTC (permalink / raw)
  To: linuxram; +Cc: miklos, dhowells, jamie, viro, akpm, linux-kernel, linux-fsdevel

> Ok. One more attempt to be clear.
> 
> All the places where get_namespace() is called currently except
> in mark_mounts_for_expiry() are safe because they are called in
> places where it is guaranteed that they will not race with 
> __put_namespace(). 
> 
> For example in clone_namespace(), get_namespace() will not race
> because the task that called the clone has a refcount on the
> namespace and since that task is currently  in the kernel, there is
> no chance for  the task  to go away  decrementing the refcount 
> on that namespace. 
> 
> But the case where the call to get_namespace() is buggy in 
> mark_mounts_for_expiry() is because:
> it is called in a timer context, and the last process referring
> the namespace may just disapper right than.  
>  So what I am proposing is:
> in automouter, while the automount takes place in 
> afs_mntpt_follow_link() increment the refcount of the namespace,
> by calling get_namespace(). This call will not race with __put_namespace
> because the process that is trying to access the
> mountpoint already has a refcount on the namespace and it won't be 
> able to decrement the refcount currently. agree?
> 
> Now later when the automounter tries to unmount the mount 
> call put_namespace() after unmounting. I mean do it in
> mark_mounts_for_expiry(). Also delete the call to get_namespace()) 
> 
> So the race will not happen at all.
> 
> Makes sense? 

Well I imagine it could work, but again it's just a special case for
the automounter stuff.

It still won't solve the recursive mount problem that this discussion
(and your discovery of the race) originated from.

My second patch solves the problem generally (though I'm sure that
it's full of bugs yet), by keeping a reference to the namespace from
each vfsmount in that namespace.  That way, until the vfsmount remains
in the namespace (i.e. until it's unmounted) mnt_namespace can be
safely used for whatever reason (recursive bind, atomount, etc).

So why does it make sense to solve this problem only for the
automounter?

Miklos

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2005-05-19 12:54 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-13 10:44 [PATCH] namespace.c: fix bind mount from foreign namespace Miklos Szeredi
2005-05-13 16:49 ` Ram
2005-05-13 17:06 ` Al Viro
2005-05-13 17:17   ` Miklos Szeredi
2005-05-13 17:25     ` Al Viro
2005-05-13 17:34       ` Miklos Szeredi
2005-05-13 17:29     ` Ram
2005-05-13 18:40       ` Miklos Szeredi
     [not found]         ` <1116012287.6248.410.camel@localhost>
     [not found]           ` <E1DWfqJ-0004eP-00@dorka.pomaz.szeredi.hu>
     [not found]             ` <1116013840.6248.429.camel@localhost>
2005-05-14  6:11               ` Miklos Szeredi
2005-05-16 15:11                 ` Ram
2005-05-16  8:44                   ` Miklos Szeredi
2005-05-16  8:59                     ` Miklos Szeredi
2005-05-16 11:26                     ` Jamie Lokier
2005-05-16 13:23                       ` Miklos Szeredi
2005-05-16 11:14                   ` Jamie Lokier
2005-05-17  3:50                     ` Ram
2005-05-16 20:15                       ` Miklos Szeredi
2005-05-17  1:28                         ` Jamie Lokier
2005-05-17  5:34                           ` Miklos Szeredi
     [not found]                             ` <1116360352.24560.85.camel@localhost>
     [not found]                               ` <E1DYI0m-0000K5-00@dorka.pomaz.szeredi.hu>
     [not found]                                 ` <1116399887.24560.116.camel@localhost>
     [not found]                                   ` <1116400118.24560.119.camel@localhost>
2005-05-18  9:51                                     ` [PATCH] fix race in mark_mounts_for_expiry() Miklos Szeredi
2005-05-18 10:32                                     ` David Howells
2005-05-18 10:37                                       ` Miklos Szeredi
2005-05-18 10:46                                       ` David Howells
2005-05-18 10:53                                         ` Miklos Szeredi
2005-05-18 11:07                                           ` Trond Myklebust
2005-05-18 11:32                                             ` Miklos Szeredi
2005-05-18 12:50                                               ` Jamie Lokier
2005-05-18 13:21                                                 ` Miklos Szeredi
2005-05-18 17:34                                                   ` Jamie Lokier
2005-05-18 19:05                                                     ` Miklos Szeredi
2005-05-18 19:52                                                       ` Jamie Lokier
2005-05-19 12:41                                                         ` Miklos Szeredi
2005-05-18 10:59                                         ` David Howells
2005-05-18 11:14                                           ` Miklos Szeredi
2005-05-18 11:51                                         ` David Howells
2005-05-18 12:08                                           ` Miklos Szeredi
2005-05-18 12:33                                             ` Miklos Szeredi
2005-05-18 16:53                                               ` Miklos Szeredi
2005-05-18 18:47                                                 ` Ram
2005-05-18 19:19                                                   ` Miklos Szeredi
2005-05-18 20:35                                                     ` Ram
2005-05-19 12:52                                                       ` Miklos Szeredi
2005-05-17 18:48                         ` [PATCH] namespace.c: fix bind mount from foreign namespace Ram
2005-05-17  0:00                       ` Jamie Lokier
  -- strict thread matches above, loose matches on Subject: below --
2005-05-16 19:51 Miklos Szeredi
2005-05-17  1:23 ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).