* bind mounting namespace inodes for unprivileged users
@ 2016-05-03 18:20 James Bottomley
2016-05-03 21:22 ` Serge Hallyn
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: James Bottomley @ 2016-05-03 18:20 UTC (permalink / raw)
To: Linux Containers, util-linux
Right at the moment, unprivileged users cannot call mount --bind to
create a permanent copy of any of their namespaces. This is annoying
because it means that for entry to long running containers you have to
spawn an undying process and use nsenter via the /proc/<pid>/ns files.
The first question is: assuming we restrict it to bind mounting only
nsfs inodes, is there any reason an unprivileged user shouldn't be able
to bind a namespace they've created to a file they own in the initial
mount namespace?
Assuming the answer to this is no, then how to implement it becomes the
next problem. Right at the moment, util-linux/mount will deny a non
-root user the ability to use --bind. This check could be relaxed and,
since mount is setuid root, it could be modified to force the binding
as root meaning this could be implemented entirely within the util
-linux package.
Doing this from within the kernel sys_mount is much more problematic:
no root users are forbidden from calling any type of mount by the
may_mount() check, which makes sure you only have root capability in
the user_ns attached to the current mnt_ns. Overriding that simply to
allow nsfs binding looks like a recipe for introducing unexpected
security problems.
So, does anyone have any strong (or even weak) opinions about this
before I start coding patches?
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-03 18:20 bind mounting namespace inodes for unprivileged users James Bottomley
@ 2016-05-03 21:22 ` Serge Hallyn
2016-05-04 11:15 ` James Bottomley
2016-05-04 8:44 ` Karel Zak
2016-05-04 14:38 ` Eric W. Biederman
2 siblings, 1 reply; 9+ messages in thread
From: Serge Hallyn @ 2016-05-03 21:22 UTC (permalink / raw)
To: James Bottomley; +Cc: Linux Containers, util-linux
Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> Right at the moment, unprivileged users cannot call mount --bind to
> create a permanent copy of any of their namespaces. This is annoying
> because it means that for entry to long running containers you have to
> spawn an undying process and use nsenter via the /proc/<pid>/ns files.
>
> The first question is: assuming we restrict it to bind mounting only
> nsfs inodes, is there any reason an unprivileged user shouldn't be able
> to bind a namespace they've created to a file they own in the initial
> mount namespace?
>
> Assuming the answer to this is no, then how to implement it becomes the
> next problem. Right at the moment, util-linux/mount will deny a non
> -root user the ability to use --bind. This check could be relaxed and,
> since mount is setuid root, it could be modified to force the binding
> as root meaning this could be implemented entirely within the util
> -linux package.
>
> Doing this from within the kernel sys_mount is much more problematic:
> no root users are forbidden from calling any type of mount by the
> may_mount() check, which makes sure you only have root capability in
> the user_ns attached to the current mnt_ns. Overriding that simply to
> allow nsfs binding looks like a recipe for introducing unexpected
> security problems.
>
> So, does anyone have any strong (or even weak) opinions about this
> before I start coding patches?
Hi,
so this is a bit scatterbrained, but it points to what I think is
a workable way to do this all unprivileged (well, besides the
privilege conferred by newuidmap/newgidmap). Assume you are
uid 1000 and have a /etc/sub{u,g}id entry joe:100000:65536.
Start by creating one container (namespace, whatever you want to
call it) which has uid 1000 mapped to container root, and all subuids
mapped into the container so that container root is privileged over
them. This container/namespace creates a private mntns which is
where you'll be keeping the persistent nsfs bind mounts. Let's
call this the 'factotum' for the duration of this email.
Now say you create a container with 100000 as container root and you
want to persist its user and network namespaces. The init task (which
you don't want to keep around) is pid 999. Uid 1000 cannot see under
/proc/999/ns, but a task in your factotum can. So it can open
/proc/999/ns/net and /proc/999/ns/user and bind mount them. Any time a
task (pid 1999) owned by 1000 on the host wants to use such an inode,
the factotum can open it, and task 1999 can open /proc/$(pidof
factotum)/fd/N, or the factotum could simply pass the open fds over a
unix socket. Any task spawned by uid 1000 should then be able to setns
using those fds.
This is something which could be done by transparently by 'unshare'
and 'nsenter'.
-serge
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-03 18:20 bind mounting namespace inodes for unprivileged users James Bottomley
2016-05-03 21:22 ` Serge Hallyn
@ 2016-05-04 8:44 ` Karel Zak
2016-05-04 13:16 ` James Bottomley
2016-05-04 14:38 ` Eric W. Biederman
2 siblings, 1 reply; 9+ messages in thread
From: Karel Zak @ 2016-05-04 8:44 UTC (permalink / raw)
To: James Bottomley; +Cc: Linux Containers, util-linux
On Tue, May 03, 2016 at 02:20:56PM -0400, James Bottomley wrote:
> Right at the moment, unprivileged users cannot call mount --bind to
> create a permanent copy of any of their namespaces. This is annoying
> because it means that for entry to long running containers you have to
> spawn an undying process and use nsenter via the /proc/<pid>/ns files.
Well, unshare is able to create permanent namespaces and the bind
mounts and nsenter is able to follow these files, but you need root
permissions to create this stuff.
touch /home/kzak/ns
sudo unshare --uts=/home/kzak/ns
<exit namespace>
sudo nsenter --uts=/home/kzak/ns
it means you really do not need any process in the namespace.
Not sure about unprivileged users, it always sounds like a game with
Pandora's box ;-)
Karel
--
Karel Zak <kzak@redhat.com>
http://karelzak.blogspot.com
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-03 21:22 ` Serge Hallyn
@ 2016-05-04 11:15 ` James Bottomley
0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2016-05-04 11:15 UTC (permalink / raw)
To: Serge Hallyn; +Cc: Linux Containers, util-linux
On Tue, 2016-05-03 at 21:22 +0000, Serge Hallyn wrote:
> Quoting James Bottomley (James.Bottomley@HansenPartnership.com):
> > Right at the moment, unprivileged users cannot call mount --bind to
> > create a permanent copy of any of their namespaces. This is
> > annoying because it means that for entry to long running containers
> > you have to spawn an undying process and use nsenter via the
> > /proc/<pid>/ns files.
> >
> > The first question is: assuming we restrict it to bind mounting
> > only nsfs inodes, is there any reason an unprivileged user
> > shouldn't be able to bind a namespace they've created to a file
> > they own in the initial mount namespace?
> >
> > Assuming the answer to this is no, then how to implement it becomes
> > the next problem. Right at the moment, util-linux/mount will deny
> > a non-root user the ability to use --bind. This check could be
> > relaxed and, since mount is setuid root, it could be modified to
> > force the binding as root meaning this could be implemented
> > entirely within the util-linux package.
> >
> > Doing this from within the kernel sys_mount is much more
> > problematic: no root users are forbidden from calling any type of
> > mount by the may_mount() check, which makes sure you only have root
> > capability in the user_ns attached to the current mnt_ns.
> > Overriding that simply to allow nsfs binding looks like a recipe
> > for introducing unexpected security problems.
> >
> > So, does anyone have any strong (or even weak) opinions about this
> > before I start coding patches?
>
> Hi,
>
> so this is a bit scatterbrained, but it points to what I think is
> a workable way to do this all unprivileged (well, besides the
> privilege conferred by newuidmap/newgidmap). Assume you are
> uid 1000 and have a /etc/sub{u,g}id entry joe:100000:65536.
>
> Start by creating one container (namespace, whatever you want to
> call it) which has uid 1000 mapped to container root, and all subuids
> mapped into the container so that container root is privileged over
> them. This container/namespace creates a private mntns which is
> where you'll be keeping the persistent nsfs bind mounts. Let's
> call this the 'factotum' for the duration of this email.
>
> Now say you create a container with 100000 as container root and you
> want to persist its user and network namespaces. The init task
> (which you don't want to keep around) is pid 999. Uid 1000 cannot
> see under /proc/999/ns, but a task in your factotum can.
Actually, the process that first created the userns is you in the
parent namespace. You need to call the newuidmap, newgidmap on a
different task for this process, so if you persist the original process
that first entered the namespace, you can use it to access the
container even though it has no uid mapping inside the namespace. That
means you can actually get away without using a factotum container at
all because nsenter enters the userns first, so even if your --user
points to the initial process and --mount points to some process you
don't have access to, you'll gain entry.
This is a script that demonstrates this:
unshare --user sleep 356d &
userns=$!
ln -s /proc/$userns/ns/user myuserns
sleep 1 # need ns to be entered and started
newuidmap $userns 0 100000 1000
newgidmap $userns 0 100000 1000
nsenter --user=myuserns unshare --mount sleep 356d &
ln -s /proc/$!/ns/mnt mymntns
sleep 1 # wait for ns to be entered and started
nsenter --user=myuserns --mount=mymntns
> So it can open /proc/999/ns/net and /proc/999/ns/user and bind
> mount them. Any time a task (pid 1999) owned by 1000 on the host
> wants to use such an inode, the factotum can open it, and task 1999
> can open /proc/$(pidof factotum)/fd/N, or the factotum could simply
> pass the open fds over a unix socket. Any task spawned by uid 1000
> should then be able to setns using those fds.
>
> This is something which could be done by transparently by 'unshare'
> and 'nsenter'.
Something like this is what I do today with architecture emulation
containers. The thing is that the factotum container still needs a
long running process to keep it around (I currently use sleep 365d),
plus you need to remember the pid and the fd for your other containers
rather than names if you use bind (although you can install symbolic
links where you would have installed the bind mount to help you
remember this, so it's a minor quibble).
But the question I still come back to is should the use be allowed to
bind mount this in the original mount namespace instead of using
symbolic links and having to persist a process inside the container.
The Emulation containers I build naturally don't have any processes
inside them because you only enter them when you want to begin
emulating a different architecture. For me, the nice thing about bind
mounts is that the container is gone when I unmount them. Without bind
mounting, I find I still have a long procession of long running sleeps
keeping containers I don't want around after I've finished playing with
some new thing ... and if I don't kill them carefully (the mount sleeps
have to be killed from within the userns), I end up with inaccessible
unkillable containers.
So I'd still like to think if there is a valid reason to deny
unprivileged users the ability to bind containers?
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-04 8:44 ` Karel Zak
@ 2016-05-04 13:16 ` James Bottomley
0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2016-05-04 13:16 UTC (permalink / raw)
To: Karel Zak; +Cc: Linux Containers, util-linux
On Wed, 2016-05-04 at 10:44 +0200, Karel Zak wrote:
> On Tue, May 03, 2016 at 02:20:56PM -0400, James Bottomley wrote:
> > Right at the moment, unprivileged users cannot call mount --bind to
> > create a permanent copy of any of their namespaces. This is
> > annoying
> > because it means that for entry to long running containers you have
> > to
> > spawn an undying process and use nsenter via the /proc/<pid>/ns
> > files.
>
> Well, unshare is able to create permanent namespaces and the bind
> mounts and nsenter is able to follow these files, but you need root
> permissions to create this stuff.
>
> touch /home/kzak/ns
> sudo unshare --uts=/home/kzak/ns
> <exit namespace>
>
> sudo nsenter --uts=/home/kzak/ns
>
> it means you really do not need any process in the namespace.
Yes, I do this when I'm root.
> Not sure about unprivileged users, it always sounds like a game with
> Pandora's box ;-)
But that's currently my specific problem: binding a container when I'm
an unprivileged user. I was thinking of persuading mount to do it, but
unshare could as well, provided it's setuid root. I'm leery of
proliferating setuid root binaries, which is why I was looking at
mount, but I could easily (more easily than mount) make unshare do it
if that's preferred.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-03 18:20 bind mounting namespace inodes for unprivileged users James Bottomley
2016-05-03 21:22 ` Serge Hallyn
2016-05-04 8:44 ` Karel Zak
@ 2016-05-04 14:38 ` Eric W. Biederman
2016-05-04 17:28 ` James Bottomley
2 siblings, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2016-05-04 14:38 UTC (permalink / raw)
To: James Bottomley; +Cc: Linux Containers, util-linux
James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> Right at the moment, unprivileged users cannot call mount --bind to
> create a permanent copy of any of their namespaces. This is annoying
> because it means that for entry to long running containers you have to
> spawn an undying process and use nsenter via the /proc/<pid>/ns files.
>
> The first question is: assuming we restrict it to bind mounting only
> nsfs inodes, is there any reason an unprivileged user shouldn't be able
> to bind a namespace they've created to a file they own in the initial
> mount namespace?
Own, have read/write and unlink privileges.
My big concern would be the fact that a bind mount today makes a file
immune from unlink. So it would mess up rm -rf.
That might not be worse than what a setuid fuse mount binary allows
today.
I wonder if there might is a way to setup a
user namespace and mount namespace combination so users could manage
mounts in their own login shells, just like is allowed in plan 9.
Long term I think that would be more satisfactory.
> So, does anyone have any strong (or even weak) opinions about this
> before I start coding patches?
The mount namespace is complex and getting it right is a pain in the
rear. So adding yet another path and piece in to the existing
complexity makes me cringe a little.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-04 14:38 ` Eric W. Biederman
@ 2016-05-04 17:28 ` James Bottomley
2016-05-04 17:43 ` Eric W. Biederman
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2016-05-04 17:28 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, util-linux
On Wed, 2016-05-04 at 09:38 -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>
> > Right at the moment, unprivileged users cannot call mount --bind to
> > create a permanent copy of any of their namespaces. This is
> > annoying
> > because it means that for entry to long running containers you have
> > to
> > spawn an undying process and use nsenter via the /proc/<pid>/ns
> > files.
> >
> > The first question is: assuming we restrict it to bind mounting
> > only
> > nsfs inodes, is there any reason an unprivileged user shouldn't be
> > able
> > to bind a namespace they've created to a file they own in the
> > initial
> > mount namespace?
>
> Own, have read/write and unlink privileges.
>
> My big concern would be the fact that a bind mount today makes a file
> immune from unlink. So it would mess up rm -rf.
Yes, that's true. You have to unmount a bind mount, even of a file,
before you can remove it. The way me mostly cope with this today is to
install the bind mounts on a tmpfs ... however, the unprivileged user
can't mount a tmpfs either ...
However, when I experimented, it seems that the rm isn't hard and fast.
If I create a file outside the mount namespace, but then bind mount it
within the mount namespace, I can still remove it from the outside, in
which case the binding also disappears. The is_locally_mounted() check
in vfs_unlink() returns false because the file isn't covered outside
the child mount namespace. It doesn't look like too much bother to
make unlink do the same for bind mounted files regardless of whether
the mount point is covered by another bind mounted file (although
obviously keeping the same semantics for directories).
> That might not be worse than what a setuid fuse mount binary allows
> today.
It's about the same: you can't remove the fuse mount point until it
gets unmounted. If you have gvfs, you can see this by looking at
/run/user/<uid>/gvfs
> I wonder if there might is a way to setup a user namespace and mount
> namespace combination so users could manage mounts in their own login
> shells, just like is allowed in plan 9. Long term I think that would
> be more satisfactory.
So I thought about this as well. However, you do want a single user
and mount namespace for all logins, which means it would have to be
managed by the login process itself. That seemed to be quite a large
thing to parametrise to login.
> > So, does anyone have any strong (or even weak) opinions about this
> > before I start coding patches?
>
> The mount namespace is complex and getting it right is a pain in the
> rear. So adding yet another path and piece in to the existing
> complexity makes me cringe a little.
Yes, well which is worse: having no way to bind unprivileged containers
without spawning a long running process or having a way to bind them
which may lead to unremovable files. Since I just use sudo mount -
-bind anyway for my containers, I don't see the file removal argument
as too daunting.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-04 17:28 ` James Bottomley
@ 2016-05-04 17:43 ` Eric W. Biederman
2016-05-04 18:00 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2016-05-04 17:43 UTC (permalink / raw)
To: James Bottomley; +Cc: Linux Containers, util-linux
James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> On Wed, 2016-05-04 at 09:38 -0500, Eric W. Biederman wrote:
>> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
>>
>> > Right at the moment, unprivileged users cannot call mount --bind to
>> > create a permanent copy of any of their namespaces. This is
>> > annoying
>> > because it means that for entry to long running containers you have
>> > to
>> > spawn an undying process and use nsenter via the /proc/<pid>/ns
>> > files.
>> >
>> > The first question is: assuming we restrict it to bind mounting
>> > only
>> > nsfs inodes, is there any reason an unprivileged user shouldn't be
>> > able
>> > to bind a namespace they've created to a file they own in the
>> > initial
>> > mount namespace?
>>
>> Own, have read/write and unlink privileges.
>>
>> My big concern would be the fact that a bind mount today makes a file
>> immune from unlink. So it would mess up rm -rf.
>
> Yes, that's true. You have to unmount a bind mount, even of a file,
> before you can remove it. The way me mostly cope with this today is to
> install the bind mounts on a tmpfs ... however, the unprivileged user
> can't mount a tmpfs either ...
>
> However, when I experimented, it seems that the rm isn't hard and fast.
> If I create a file outside the mount namespace, but then bind mount it
> within the mount namespace, I can still remove it from the outside, in
> which case the binding also disappears. The is_locally_mounted() check
> in vfs_unlink() returns false because the file isn't covered outside
> the child mount namespace. It doesn't look like too much bother to
> make unlink do the same for bind mounted files regardless of whether
> the mount point is covered by another bind mounted file (although
> obviously keeping the same semantics for directories).
True, althought that will be a potentially long conversation. The
existing semantics were a bug fix for security issues with user
namespaces and mount namespaces. I would have loved not to have
added is_local_mountpoint, but that was the compromise between fixing
the issues and remaining backwards compatible.
>> That might not be worse than what a setuid fuse mount binary allows
>> today.
>
> It's about the same: you can't remove the fuse mount point until it
> gets unmounted. If you have gvfs, you can see this by looking at
> /run/user/<uid>/gvfs
I don't have it handy and gnome and I parted was several versions ago,
but yes. My point is that the unprivileged fuse case makes a good
precedent and example to follow.
>> I wonder if there might is a way to setup a user namespace and mount
>> namespace combination so users could manage mounts in their own login
>> shells, just like is allowed in plan 9. Long term I think that would
>> be more satisfactory.
>
> So I thought about this as well. However, you do want a single user
> and mount namespace for all logins, which means it would have to be
> managed by the login process itself. That seemed to be quite a large
> thing to parametrise to login.
No. This can be done with pam. Last I looked there was even a
pam_namespace plugin for dealing with the mount namespace. The only
real issue I can think of is that exec likes to drop capabilities
(unless your uid == 0).
I remember reviewin the kernel's namespace semantics with a nod towards
using them in a pam plugin several years ago, and it should be possible
to have a shared container for all of a persons logins if that is
desired, or a separate container per login if that is desired.
>> > So, does anyone have any strong (or even weak) opinions about this
>> > before I start coding patches?
>>
>> The mount namespace is complex and getting it right is a pain in the
>> rear. So adding yet another path and piece in to the existing
>> complexity makes me cringe a little.
>
> Yes, well which is worse: having no way to bind unprivileged containers
> without spawning a long running process or having a way to bind them
> which may lead to unremovable files. Since I just use sudo mount -
> -bind anyway for my containers, I don't see the file removal argument
> as too daunting.
So far with setns support I haven't felt the need to bind mount
containers. So I am not certain it is an either or choice.
And of course the other side of the craziness is having a mount point on
a filesystem makes that filesystem unmountable (except for lazy
unmounts). So getting this wrong could affect clean shutdowns and
reboots. Which suggests it may be wise to limit this kind of thing
to a tmpfs like /run/user/<uid>/.
Mostly this is my way of say tread carefully because there be dragons
here.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bind mounting namespace inodes for unprivileged users
2016-05-04 17:43 ` Eric W. Biederman
@ 2016-05-04 18:00 ` James Bottomley
0 siblings, 0 replies; 9+ messages in thread
From: James Bottomley @ 2016-05-04 18:00 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Linux Containers, util-linux
On Wed, 2016-05-04 at 12:43 -0500, Eric W. Biederman wrote:
> James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> > On Wed, 2016-05-04 at 09:38 -0500, Eric W. Biederman wrote:
> > > James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> > > > So, does anyone have any strong (or even weak) opinions about
> > > > this before I start coding patches?
> > >
> > > The mount namespace is complex and getting it right is a pain in
> > > the rear. So adding yet another path and piece in to the
> > > existing complexity makes me cringe a little.
> >
> > Yes, well which is worse: having no way to bind unprivileged
> > containers without spawning a long running process or having a way
> > to bind them which may lead to unremovable files. Since I just use
> > sudo mount --bind anyway for my containers, I don't see the file
> > removal argument as too daunting.
>
> So far with setns support I haven't felt the need to bind mount
> containers. So I am not certain it is an either or choice.
>
> And of course the other side of the craziness is having a mount point
> on a filesystem makes that filesystem unmountable (except for lazy
> unmounts). So getting this wrong could affect clean shutdowns and
> reboots.
OK, I by this argument a little. It could be circumvented by having
the shutdown script remove all container bindings, though. This seems
to work
umount -t nsfs -a
> Which suggests it may be wise to limit this kind of thing
> to a tmpfs like /run/user/<uid>/.
>
> Mostly this is my way of say tread carefully because there be dragons
> here.
Understood. Even though fixing the pinned filesystem issue can be
done, I do agree that it makes the problem knottier.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-05-04 18:00 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-03 18:20 bind mounting namespace inodes for unprivileged users James Bottomley
2016-05-03 21:22 ` Serge Hallyn
2016-05-04 11:15 ` James Bottomley
2016-05-04 8:44 ` Karel Zak
2016-05-04 13:16 ` James Bottomley
2016-05-04 14:38 ` Eric W. Biederman
2016-05-04 17:28 ` James Bottomley
2016-05-04 17:43 ` Eric W. Biederman
2016-05-04 18:00 ` James Bottomley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox