[RFC] User CLONE_NEWNS permission and rlimits

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] User CLONE_NEWNS permission and rlimits
@ 2005-04-20  1:24 Eric Van Hensbergen
  2005-04-20  1:50 ` Ram
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Van Hensbergen @ 2005-04-20  1:24 UTC (permalink / raw)
  To: linux-fsdevel, Al Viro

This is again related to the FUSE permission thread, but a slightly
different idea and without a slimy hack patch.

I really want to enable users to be able to create private namespaces,
but I want to try and avoid creating a venerability by allowing them
to abuse system resources.  It looks like this can be done by adding
RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
private namespaces a user has in the user_struct.  Any time a user
creates a private namespace (either via clone with CLONE_NEWNS) or any
other method, this limit is checked and the per user count is
incremented (in copy_namespace).  When namespaces are cleaned up (in
__put_namespace), the per-user count is decremented.

Is this sufficient to cover any exposure?  What's the correct solution
for the shared sub-trees RFC?  Should there be something similar for
user mounts/binds?

         -eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  1:24 [RFC] User CLONE_NEWNS permission and rlimits Eric Van Hensbergen
@ 2005-04-20  1:50 ` Ram
  2005-04-20  3:02   ` Ritesh Kumar
  2005-04-20 12:47   ` Eric Van Hensbergen
  0 siblings, 2 replies; 10+ messages in thread
From: Ram @ 2005-04-20  1:50 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, Al Viro

On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> This is again related to the FUSE permission thread, but a slightly
> different idea and without a slimy hack patch.
> 
> I really want to enable users to be able to create private namespaces,
> but I want to try and avoid creating a venerability by allowing them
> to abuse system resources.  It looks like this can be done by adding
> RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
> private namespaces a user has in the user_struct.  Any time a user
> creates a private namespace (either via clone with CLONE_NEWNS) or any
> other method, this limit is checked and the per user count is
> incremented (in copy_namespace).  When namespaces are cleaned up (in
> __put_namespace), the per-user count is decremented.
> 
> Is this sufficient to cover any exposure?  What's the correct solution
> for the shared sub-trees RFC?  Should there be something similar for
> user mounts/binds?

A new namespace in a shared subtree realm can create number-of-
private-namespaces number of mounts or binds depending on the number of
binds and mounts in the shared tree.

for example if  there were 10 shared vfsmounts in the original
namespace, a new private namespace will duplicate 10 of these, and
any mount or bind attempted in any of these vfsmounts will double the
number of mounts and binds. 

Hence probably you may want to keep a tab on the number mounts and
binds a user does, instead of keeping a tab on the number of namespaces
a user creates.

RP

> 
>          -eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  1:50 ` Ram
@ 2005-04-20  3:02   ` Ritesh Kumar
  2005-04-20  3:20     ` Al Viro
  2005-04-20 18:03     ` Bryan Henderson
  2005-04-20 12:47   ` Eric Van Hensbergen
  1 sibling, 2 replies; 10+ messages in thread
From: Ritesh Kumar @ 2005-04-20  3:02 UTC (permalink / raw)
  To: Ram; +Cc: linux-fsdevel

I am new to the list so please bear with me :-)

I have also be thinking about filesystem namespaces which are
completely under the user's own control. I was also thinking of them
being inherited and changed along the process heirarchy. So a given
process is allowed to change its namespace any way it likes and map it
to its parent's namespace.
More importantly, I was thinking in terms of having this entire
capability in the userspace itself. Instead of giving all the details
right here... let me redirect you to the page where I have set up the
prototype. You should be able to download the sample code (very small)
and browse through it to get an idea of what I had in mind. I also
have an article which explains what I was thinking. In essense, I was
thinking of splitting up the conceps of 1) accessing the filesystem on
the HDD/device and 2) setting up a namespace for accessing the files
into two separate concepts and bringing up 2) completely in the
userspace.
What do you think? I would like to have feedback on the idea.

http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html

Ritesh

On 4/19/05, Ram <linuxram@us.ibm.com> wrote:
> On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> > This is again related to the FUSE permission thread, but a slightly
> > different idea and without a slimy hack patch.
> >
> > I really want to enable users to be able to create private namespaces,
> > but I want to try and avoid creating a venerability by allowing them
> > to abuse system resources.  It looks like this can be done by adding
> > RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
> > private namespaces a user has in the user_struct.  Any time a user
> > creates a private namespace (either via clone with CLONE_NEWNS) or any
> > other method, this limit is checked and the per user count is
> > incremented (in copy_namespace).  When namespaces are cleaned up (in
> > __put_namespace), the per-user count is decremented.
> >
> > Is this sufficient to cover any exposure?  What's the correct solution
> > for the shared sub-trees RFC?  Should there be something similar for
> > user mounts/binds?
> 
> A new namespace in a shared subtree realm can create number-of-
> private-namespaces number of mounts or binds depending on the number of
> binds and mounts in the shared tree.
> 
> for example if  there were 10 shared vfsmounts in the original
> namespace, a new private namespace will duplicate 10 of these, and
> any mount or bind attempted in any of these vfsmounts will double the
> number of mounts and binds.
> 
> Hence probably you may want to keep a tab on the number mounts and
> binds a user does, instead of keeping a tab on the number of namespaces
> a user creates.
> 
> RP
> 
> >
> >          -eric
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Rationality is the fundamental limitation to all human thought.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  3:02   ` Ritesh Kumar
@ 2005-04-20  3:20     ` Al Viro
  2005-04-20  3:38       ` Ritesh Kumar
  2005-04-20 18:03     ` Bryan Henderson
  1 sibling, 1 reply; 10+ messages in thread
From: Al Viro @ 2005-04-20  3:20 UTC (permalink / raw)
  To: Ritesh Kumar; +Cc: Ram, linux-fsdevel

On Tue, Apr 19, 2005 at 11:02:53PM -0400, Ritesh Kumar wrote:
> I am new to the list so please bear with me :-)
> 
> I have also be thinking about filesystem namespaces which are
> completely under the user's own control.

	How do you deal with su(1) finding /etc/shadow in your namespace
and seeing an entry for root there - with no password?

> I was also thinking of them
> being inherited and changed along the process heirarchy.

	We have that already...

> So a given
> process is allowed to change its namespace any way it likes and map it
> to its parent's namespace.

	See above.

> More importantly, I was thinking in terms of having this entire
> capability in the userspace itself. Instead of giving all the details
> right here... let me redirect you to the page where I have set up the
> prototype. You should be able to download the sample code (very small)
> and browse through it to get an idea of what I had in mind. I also
> have an article which explains what I was thinking. In essense, I was
> thinking of splitting up the conceps of 1) accessing the filesystem on
> the HDD/device and 2) setting up a namespace for accessing the files
> into two separate concepts and bringing up 2) completely in the
> userspace.
> What do you think? I would like to have feedback on the idea.

	That your library will leave any suid program seeing hell knows
what.  Which gets very unpleasant when you are using it to do something
with your files...

	That's besides the issues with races when two tasks that share
namespace attempt to change it.

> http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  3:20     ` Al Viro
@ 2005-04-20  3:38       ` Ritesh Kumar
  2005-04-20  4:01         ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Ritesh Kumar @ 2005-04-20  3:38 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel

You are right. A more priviledged process running as a child of
another process should not be allowed to look at the same namespace as
its parent. However, there is atleast one other example where
something like this exists and there is a counter for that. We can
learn from the counter.

Consider the LD_PRELOAD env variable. The dynamic linker ignores this
variable (well there are suitable exceptions clearly defined in the
ld.so manpage) the moment it sees that the child is a root process.
Thus, even though the parent had changed the effective behavior of
dynamic linking , the child doesn't suffer from the same.
In my prototype, currently, I am okay because the basis of the
functionality is an interposing library which is ignored if somebody
does and 'su'.

Also, the access control for the filesystem is still in the kernel.
What we change in the userspace is just the namespace and nothing
else. If you are fundamentally denied access to a file (from the
kernel) then you cannot access it no matter how you access it using
userspace libraries.

Plus, it is not very clear to me what to you mean by 'tasks'. If that
is processes, then the child will inherit a separate copy of the
namespace from the parent (Copy-on-write of the data structs of the
user library probably... I'll have to think over this). So no race
conditions here. For mutiple threads we will have to use mutual
exclusion on the 'userspace vfs' to keep race conditions out...
similar to many other things (like malloc et al).

Ritesh 

On 4/19/05, Al Viro <viro@parcelfarce.linux.theplanet.co.uk> wrote:
> On Tue, Apr 19, 2005 at 11:02:53PM -0400, Ritesh Kumar wrote:
> > I am new to the list so please bear with me :-)
> >
> > I have also be thinking about filesystem namespaces which are
> > completely under the user's own control.
> 
>         How do you deal with su(1) finding /etc/shadow in your namespace
> and seeing an entry for root there - with no password?
> 
> > I was also thinking of them
> > being inherited and changed along the process heirarchy.
> 
>         We have that already...
> 
> > So a given
> > process is allowed to change its namespace any way it likes and map it
> > to its parent's namespace.
> 
>         See above.
> 
> > More importantly, I was thinking in terms of having this entire
> > capability in the userspace itself. Instead of giving all the details
> > right here... let me redirect you to the page where I have set up the
> > prototype. You should be able to download the sample code (very small)
> > and browse through it to get an idea of what I had in mind. I also
> > have an article which explains what I was thinking. In essense, I was
> > thinking of splitting up the conceps of 1) accessing the filesystem on
> > the HDD/device and 2) setting up a namespace for accessing the files
> > into two separate concepts and bringing up 2) completely in the
> > userspace.
> > What do you think? I would like to have feedback on the idea.
> 
>         That your library will leave any suid program seeing hell knows
> what.  Which gets very unpleasant when you are using it to do something
> with your files...
> 
>         That's besides the issues with races when two tasks that share
> namespace attempt to change it.
> 
> > http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html
> 

-- 
Rationality is the fundamental limitation to all human thought.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  3:38       ` Ritesh Kumar
@ 2005-04-20  4:01         ` Al Viro
  0 siblings, 0 replies; 10+ messages in thread
From: Al Viro @ 2005-04-20  4:01 UTC (permalink / raw)
  To: Ritesh Kumar; +Cc: linux-fsdevel

On Tue, Apr 19, 2005 at 11:38:21PM -0400, Ritesh Kumar wrote:
> You are right. A more priviledged process running as a child of
> another process should not be allowed to look at the same namespace as
> its parent.

No go.  That immediately breaks any suid program that takes a pathname
as an argument and is supposed to do something to the file in question.
Or uses dotfiles for per-user config.  gpg(1) fits both, for example,
and that's not something rarely used.  Moreover, used in fsckload of
scripts that are entirely out of your control, so something like "OK,
use it on stdin, then" is not an answer (and it still doesn't address
the second issue - gpg *does* need access to keyring, after all).

> Also, the access control for the filesystem is still in the kernel.
> What we change in the userspace is just the namespace and nothing
> else. If you are fundamentally denied access to a file (from the
> kernel) then you cannot access it no matter how you access it using
> userspace libraries.

The issue is not with being able to see something you shouldn't see.
It's being able to trick more priveleged process into accepting your
data as something it trusts.  OR not being able to use suid programs
on your own files at all.  Neither is acceptable.

BTW, your references to Plan 9 completely miss one very important thing -
they manage to live without any suid stuff at all.  Which is certainly
very nice, but not useful in our case, unless you volunteer to rewrite
suid applications to the form that would not need suid.

> Plus, it is not very clear to me what to you mean by 'tasks'. If that
> is processes, then the child will inherit a separate copy of the
> namespace from the parent (Copy-on-write of the data structs of the
> user library probably... I'll have to think over this). So no race
> conditions here.

... and no working mount(8) either.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  3:02   ` Ritesh Kumar
  2005-04-20  3:20     ` Al Viro
@ 2005-04-20 18:03     ` Bryan Henderson
  2005-04-20 18:37       ` Ritesh Kumar
  1 sibling, 1 reply; 10+ messages in thread
From: Bryan Henderson @ 2005-04-20 18:03 UTC (permalink / raw)
  To: Ritesh Kumar; +Cc: linux-fsdevel, linuxram

>In essense, I was
>thinking of splitting up the concepts of 1) accessing the filesystem on
>the HDD/device and 2) setting up a namespace for accessing the files
>into two separate concepts

I've been crusading for years to get people to understand that a classic 
Unix mount is composed of these two parts, and they don't have to be 
married together.  (1) is called creating a filesystem image and (2) is 
called mounting a filesystem image.

(2) isn't actually "setting up" a namespace.  There's one namespace. 
Mounting is adding the names in a filesystem to that namespace, and 
thereby making the named filesystem objects accessible.

The two pieces have been slowly divorcing over the years.  We now have a 
little-used ability to have a filesystem image exist without being mounted 
at all (you get that by forcibly unmounting a filesystem image that has 
open files.  The unmount happens right away, but the filesystem image 
continues to exist until the last file is closed).  We also have the bind 
mounts that add to the namespace without creating a new filesystem image. 
I would like someday to see the ability to create  a filesystem image 
without ever mounting it, and access a file in it without ever adding it 
to the master file namespace.

>bringing up 2) completely in the userspace.

That part's another issue.  The user-controls-his-namespace aspect of it 
has been commented on at length in this and another current thread.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20 18:03     ` Bryan Henderson
@ 2005-04-20 18:37       ` Ritesh Kumar
  0 siblings, 0 replies; 10+ messages in thread
From: Ritesh Kumar @ 2005-04-20 18:37 UTC (permalink / raw)
  To: Bryan Henderson; +Cc: linux-fsdevel

On 4/20/05, Bryan Henderson <hbryan@us.ibm.com> wrote:
> >In essense, I was
> >thinking of splitting up the concepts of 1) accessing the filesystem on
> >the HDD/device and 2) setting up a namespace for accessing the files
> >into two separate concepts
> 
> I've been crusading for years to get people to understand that a classic
> Unix mount is composed of these two parts, and they don't have to be
> married together.  (1) is called creating a filesystem image and (2) is
> called mounting a filesystem image.
> 
> (2) isn't actually "setting up" a namespace.  There's one namespace.
> Mounting is adding the names in a filesystem to that namespace, and
> thereby making the named filesystem objects accessible.
> 
> The two pieces have been slowly divorcing over the years.  We now have a
> little-used ability to have a filesystem image exist without being mounted
> at all (you get that by forcibly unmounting a filesystem image that has
> open files.  The unmount happens right away, but the filesystem image
> continues to exist until the last file is closed).  We also have the bind
> mounts that add to the namespace without creating a new filesystem image.
> I would like someday to see the ability to create  a filesystem image
> without ever mounting it, and access a file in it without ever adding it
> to the master file namespace.
> 
> >bringing up 2) completely in the userspace.
> 
> That part's another issue.  The user-controls-his-namespace aspect of it
> has been commented on at length in this and another current thread.
> 
> --
> Bryan Henderson                          IBM Almaden Research Center
> San Jose CA                              Filesystems
> 
> 

I totally agree with you. I came across this problem mainly because
there are a lot of times when I want to install custom software in
userspace and there is a limit to what you can do. I don't know if you
have looked at the installation procedure of gentoo or not... however,
in gentoo you install software by chrooting into an 'install
directory' and doing all the installation there. However, chrooting
requires root priviledges and for installing and executing software in
the userspace... chroot just misses the point.
Till now I have been manually doing installs by passing on
installation prefixes in the configure scripts... but having a
filesystem namespace associated with a process seems to be just so
right :-). When I was busy implementing the prototype I had also
thought of integrating completely usesspace filesystems (like FUSE)
into the usespace namespace. My hope was coming up with an elegant
solution to mount and use a device without resorting to any suid
binaries or scripts or help from the root. However, again, we should
keep in mind the two concepts of 'namespaces' and the 'actual
filesystem' separate in the system so that the device can be shared
(probably using a different namespace) by another user/process.

I get the point that per process filesystem namespaces the way I
presented them may have some problems with suid binaries and the way
many pieces of software work. I have not used gpg so I am not
qualified to say anything for gpg here. I should say that I don't have
an idea of the amount of backward compatibility we shall loose if we
use my apparoach. However, if I am getting things right, I believe
that the backward compatibility is lost just by introducing "user
changeable filesystem namespaces" (which in my opinion is a valuable
thing to have in an operating system) irrespective of the
implementation.

-- 
Rationality is the fundamental limitation to all human thought.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20  1:50 ` Ram
  2005-04-20  3:02   ` Ritesh Kumar
@ 2005-04-20 12:47   ` Eric Van Hensbergen
  2005-04-20 17:07     ` Ram
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Van Hensbergen @ 2005-04-20 12:47 UTC (permalink / raw)
  To: Ram; +Cc: linux-fsdevel, Al Viro

On 4/19/05, Ram <linuxram@us.ibm.com> wrote:
> On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> >
> > Is this sufficient to cover any exposure?  What's the correct solution
> > for the shared sub-trees RFC?  Should there be something similar for
> > user mounts/binds?
> 
> A new namespace in a shared subtree realm can create number-of-
> private-namespaces number of mounts or binds depending on the number of
> binds and mounts in the shared tree.
> 
> for example if  there were 10 shared vfsmounts in the original
> namespace, a new private namespace will duplicate 10 of these, and
> any mount or bind attempted in any of these vfsmounts will double the
> number of mounts and binds.
> 
> Hence probably you may want to keep a tab on the number mounts and
> binds a user does, instead of keeping a tab on the number of namespaces
> a user creates.
> 

Yeah, that does make a lot more sense, I suppose in the worst case a
user is guaranteed to not have more namespaces than processes anyways.
 So, should the count of mounts be inclusive of mounts the user
inherits, or only the ones he creates?  I suppose as a resource limit,
it should probably cover both.

         -eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] User CLONE_NEWNS permission and rlimits
  2005-04-20 12:47   ` Eric Van Hensbergen
@ 2005-04-20 17:07     ` Ram
  0 siblings, 0 replies; 10+ messages in thread
From: Ram @ 2005-04-20 17:07 UTC (permalink / raw)
  To: Eric Van Hensbergen; +Cc: linux-fsdevel, Al Viro

On Wed, 2005-04-20 at 05:47, Eric Van Hensbergen wrote:
> On 4/19/05, Ram <linuxram@us.ibm.com> wrote:
> > On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> > >
> > > Is this sufficient to cover any exposure?  What's the correct solution
> > > for the shared sub-trees RFC?  Should there be something similar for
> > > user mounts/binds?
> > 
> > A new namespace in a shared subtree realm can create number-of-
> > private-namespaces number of mounts or binds depending on the number of
> > binds and mounts in the shared tree.
> > 
> > for example if  there were 10 shared vfsmounts in the original
> > namespace, a new private namespace will duplicate 10 of these, and
> > any mount or bind attempted in any of these vfsmounts will double the
> > number of mounts and binds.
> > 
> > Hence probably you may want to keep a tab on the number mounts and
> > binds a user does, instead of keeping a tab on the number of namespaces
> > a user creates.
> > 
> 
> Yeah, that does make a lot more sense, I suppose in the worst case a
> user is guaranteed to not have more namespaces than processes anyways.
>  So, should the count of mounts be inclusive of mounts the user
> inherits, or only the ones he creates?  I suppose as a resource limit,
> it should probably cover both.

Yes I think it should be both. It should be the sum total of all the
mounts that exists in all the user-created-namespaces.

I would not add "the mounts that propogated to some other namespace
because of a mount in the user's namespace" towards the total, because
those mounts are for some other user/namespace.  

RP
> 
>          -eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-04-20 18:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-20  1:24 [RFC] User CLONE_NEWNS permission and rlimits Eric Van Hensbergen
2005-04-20  1:50 ` Ram
2005-04-20  3:02   ` Ritesh Kumar
2005-04-20  3:20     ` Al Viro
2005-04-20  3:38       ` Ritesh Kumar
2005-04-20  4:01         ` Al Viro
2005-04-20 18:03     ` Bryan Henderson
2005-04-20 18:37       ` Ritesh Kumar
2005-04-20 12:47   ` Eric Van Hensbergen
2005-04-20 17:07     ` Ram

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).